sorting text in a file

Adam Akhtar · Mar 26, 2008

Hi Ive been hacking away at this all morning and getting nowhere fast.
Im relatively new to ruby and im not so hot at regex.

Im trying to grab text data from a website that shows events and then
putting each event into its own class. I figured out how to get the
screen scraped stuff into a clean state. Its just processing it into my
class htat im having problems.

Here is a few events in their natural format

---start of file----
Toto and Boz Scaggs

Seminal American rock band with the talented blues-rock musician. Mar
21, 7pm, ¥13,000. JCB Hall, Suidobashi. Tel: Udo 03-3402-5999.

Kreva

Hip-hop track maker. Mar 21, 7pm, ¥5,000. Akasaka Blitz.

Tel: Disk Garage 03-5436-9600.

Blood Red Shoes

Rock duo from the UK. Mar 21, 7pm, ¥5,000. Shibuya Club Quattro.
Tel: Creativeman 03-3462-6969.

etcetcetc
---end-----

First i grab the file into a string. As all the concerts are seperated
by 4 newlines I use

concertevents = filetext.split(/\n\n\n\n/)

to get an array of events.

Id then like to process these further by keeping the group name seperate
from the rest of the other details. So I thought I'd do

artist = conevt.slice(/[^\n]*/) #get artist info

which assumes the group name will only be on one line. Fine for this
prototype.

The details are a bit trickier as some spill onto a second line (but
seperated by a blank line). The second event is so. I tried

description = conevt.slice(/.*\n\n(.*\n\n.*)/,1) #get desc

Although my RegexCoach programm says it works with the first event, when
i run the programme it seems slice returns nil to description. It
definately works for the second event which takes up 3 lines.

So first question is how should I alter the above regex to make it work
for those cases above - any hints tips or if you feel like it answers.
At this stage im up for easier longer ways rather than the shorter more
cryptic ones.

Second am i going about this the write way. Should I have just avoided
regex and simply read the file line by line using if structures to
figure out which lines are with which event???

Does anyone know of any good resources e.g. tutorials on this subject
i.e. screen scraping, cleaning the grabbed text and then processing it
into your own classes.

wow its a long post....ill leave it at that.

7stud -- · Mar 26, 2008

Adam said:
The details are a bit trickier as some spill onto a second line (but
seperated by a blank line).

Then you should have posted an example file with all the possibilities.

Second am i going about this the write way. S

Probably not.

Should I have just avoided
regex and simply read the file line by line using if structures to
figure out which lines are with which event???

That is one way. On the website, the information is probably contained
in different html tags. So scraping the website, then joining all the
data together, then trying to separate the data is not a good plan.
You should be able to pick the pieces from the website directly.
However, you have to know html is written and how it is structured.
Ruby has several gems, e.g. Hpricot, that make it easy to pick out
pieces of information on a website, but you sort of have to know how
html in order to pick out the data you want.

If that sounds too confusing, then just deal with the text file you
have, and YES you should avoid regex's whenever possible. So reading
the file line by line would be much easier.

7stud -- · Mar 26, 2008

7stud said:
Then you should have posted an example file with all the possibilities.

Probably not.

That is one way. On the website, the information is probably contained
in different html tags. So scraping the website, then joining all the
data together, then trying to separate the data is not a good plan.
You should be able to pick the pieces from the website directly.
However, you have to know html is written and how it is structured.
Ruby has several gems, e.g. Hpricot, that make it easy to pick out
pieces of information on a website, but you sort of have to know how
html in order to pick out the data you want.

If that sounds too confusing, then just deal with the text file you
have, and YES you should avoid regex's whenever possible. So reading
the file line by line would be much easier.

Ack! Let's try that again:

That is one way. On the website, the data is probably contained in
different html tags. So scraping the website, then joining all the data
together, then trying to separate the data back out again is not a very
good plan. You should be able to pick out the pieces of the data you
want directly from the html. However, you have to know how html is
written and how html is structured. Ruby has several gems, e.g.
Hpricot, that make it easy to pick out pieces of information from a page
of html.

If that sounds too confusing, then just deal with the text file you have
already, and YES you should avoid regex's whenever possible. Reading
the file line by line would be better and probably easier.

Zoltan Dezso · Mar 26, 2008

Adam said:
Hi Ive been hacking away at this all morning and getting nowhere fast.
Im relatively new to ruby and im not so hot at regex.

Hi,

How about something like this quick script:
don't forget the /m modifier for multiline matching mode.
(it assumes that there is no newline in the artist name part though)

File.open('events.txt', 'r') {|f|
contents = f.read()
contents.split(/\n\n\n\n/).each {|conevt|
if (conevt =~ /([^\n]*)\n\n(.*)/im)
artist = $1
description = $2
print "ARTIST: #{artist}\nDESC: #{description}\n\n"
end
}
}

Second am i going about this the write way. Should I have just avoided
regex and simply read the file line by line using if structures to
figure out which lines are with which event???

From what I can see, I guess this is the format you have to deal with...
in this case, I believe regexp are the way to go and you will only hurt
yourself in the long term with switch-case spaghetti

In case, you can get your hands on other formats, or if you are in
charge of creating the data in the first place, I wouldn't recommend
using plain text in the first place (yaml, xml, ini, whichever you like
best), but I think that is not an option for you.

Zaki

Adam Akhtar · Mar 27, 2008

Thanks very much for your replies 7stud and Zaki, I am tinkering with
Hpricot now. Ill see how working with the html tags in place will work.

hrpicot - cant extract what i want from page	7	Mar 28, 2008
Processing a text file	1	Feb 15, 2007
changing the format of a text file	2	Feb 25, 2009
NEED HELP-process words in a text file	0	Jun 19, 2011
Appropriate technique for altering a text file?	19	Aug 13, 2010
make a program that count lines in a text	37	Aug 17, 2010
Collect Excel Data from Website	5	Apr 30, 2022
Probles parsing a text file	2	Jan 6, 2004

sorting text in a file

Adam Akhtar

7stud --

7stud --

Zoltan Dezso

Adam Akhtar

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads