Text File Parsing

G

gregarican

I am trying to create a routine that will parse a text file and break
down the various fields into an array. Here's the basic layout:

element1 | element2 | element3
element4 | element5 | element6

As you can tell it's pretty straightforward. I can just #split things
using the pipe as the delimiter. But every now and again the last
element on the line is actually thrown down to the next line, like:

element7 | element8 |
element9
element10 | element11 | element12
element13 | element14 |
element15

Can anyone suggest an easy way to parse things so that the "dangling"
elements are brought back to the preceding lines? In the example above
I would need to bring element9 up to the last pipe on the preceding
line. And same with bringing element15 to the last pipe on its
preceding line.
 
D

Dave Burt

gregarican said:
I am trying to create a routine that will parse a text file and break
down the various fields into an array. Here's the basic layout:

element1 | element2 | element3
element4 | element5 | element6

As you can tell it's pretty straightforward. I can just #split things
using the pipe as the delimiter. But every now and again the last
element on the line is actually thrown down to the next line, like:

element7 | element8 |
element9
element10 | element11 | element12
element13 | element14 |
element15

Can anyone suggest an easy way to parse things so that the "dangling"
elements are brought back to the preceding lines? In the example above
I would need to bring element9 up to the last pipe on the preceding
line. And same with bringing element15 to the last pipe on its
preceding line.

@delim = /\s+\|\s+/
@tok = /\w+/
s.scan(/(#@tok)#@delim(#@tok)#@delim(#@tok)/m) do |a,b,c|
p [a,b,c]
end

Cheers,
Dave
 
P

Phil Robyn

gregarican said:
I am trying to create a routine that will parse a text file and break
down the various fields into an array. Here's the basic layout:

element1 | element2 | element3
element4 | element5 | element6

As you can tell it's pretty straightforward. I can just #split things
using the pipe as the delimiter. But every now and again the last
element on the line is actually thrown down to the next line, like:

element7 | element8 |
element9
element10 | element11 | element12
element13 | element14 |
element15

Can anyone suggest an easy way to parse things so that the "dangling"
elements are brought back to the preceding lines? In the example above
I would need to bring element9 up to the last pipe on the preceding
line. And same with bringing element15 to the last pipe on its
preceding line.

c:\cmd>for /f "tokens=1-3 delims=|" %a in (
c:\temp\PipeDelimited.txt
) do @echo %a^|%b^|%c
element1 | element2 | element3
element4 | element5 | element6
element7 | element8 | element9
element10 | element11 | element12
element13 | element14 | element15
 
P

Phil Robyn

gregarican said:
I am trying to create a routine that will parse a text file and break
down the various fields into an array. Here's the basic layout:

element1 | element2 | element3
element4 | element5 | element6

As you can tell it's pretty straightforward. I can just #split things
using the pipe as the delimiter. But every now and again the last
element on the line is actually thrown down to the next line, like:

element7 | element8 |
element9
element10 | element11 | element12
element13 | element14 |
element15

Can anyone suggest an easy way to parse things so that the "dangling"
elements are brought back to the preceding lines? In the example above
I would need to bring element9 up to the last pipe on the preceding
line. And same with bringing element15 to the last pipe on its
preceding line.

Sorry, wrong NG!
 
W

William James

gregarican said:
I am trying to create a routine that will parse a text file and break
down the various fields into an array. Here's the basic layout:

element1 | element2 | element3
element4 | element5 | element6

As you can tell it's pretty straightforward. I can just #split things
using the pipe as the delimiter. But every now and again the last
element on the line is actually thrown down to the next line, like:

element7 | element8 |
element9
element10 | element11 | element12
element13 | element14 |
element15

Can anyone suggest an easy way to parse things so that the "dangling"
elements are brought back to the preceding lines? In the example above
I would need to bring element9 up to the last pipe on the preceding
line. And same with bringing element15 to the last pipe on its
preceding line.

fs = /\s+\|\s+/
rec = ""
IO.foreach("data1"){ |line|
rec += line
if line !~ /#{fs}$/
p rec.chomp.split( fs )
rec = ""
end
}
 
R

Robert Klemme

gregarican said:
I am trying to create a routine that will parse a text file and break
down the various fields into an array. Here's the basic layout:

element1 | element2 | element3
element4 | element5 | element6

As you can tell it's pretty straightforward. I can just #split things
using the pipe as the delimiter. But every now and again the last
element on the line is actually thrown down to the next line, like:

element7 | element8 |
element9
element10 | element11 | element12
element13 | element14 |
element15

Can anyone suggest an easy way to parse things so that the "dangling"
elements are brought back to the preceding lines? In the example above
I would need to bring element9 up to the last pipe on the preceding
line. And same with bringing element15 to the last pipe on its
preceding line.

If the file is reasonably small you could do something like this (untested):

File.read("foo.txt").scan %r{[^|]+(?:\|[^|]+){2}} do |line|
items = line.split /\|/
...
end

Kind regards

robert
 
G

gregarican

Dave said:
@delim = /\s+\|\s+/
@tok = /\w+/
s.scan(/(#@tok)#@delim(#@tok)#@delim(#@tok)/m) do |a,b,c|
p [a,b,c]
end

Cheers,
Dave


I had to modify what you submitted a bit. Here's the version I have,
where 'infile' represents the source text file:

--------------------------
infile.readlines.collect {|line|
contents << line
}

contents.scan(/(\w+)\|(\w+)\|(\w+)/m) do |a,b,c|
p [a,b,c]
end
--------------------------

Where I run into a problem is that the third token I need to get (in
this case the local block variable 'c') can be a sentence composed of
multiple words. I will need to revisit my 'Mastering Regular
Expressions' book, as I am a bit rusty at regexes, which is likely
apparent by the trouble I am running into accomplishing the task at
hand :-/
 
D

Dave Burt

gregarican said:
infile.readlines.collect {|line|
contents << line
}

contents.scan(/(\w+)\|(\w+)\|(\w+)/m) do |a,b,c|
p [a,b,c]
end
--------------------------

Where I run into a problem is that the third token I need to get (in
this case the local block variable 'c') can be a sentence composed of
multiple words. I will need to revisit my 'Mastering Regular
Expressions' book, as I am a bit rusty at regexes, which is likely
apparent by the trouble I am running into accomplishing the task at
hand :-/

OK, let me help!

First, let's look at your first block of code. It does this:
* infile: assumed to be an open input file handle
* readlines: read the file into an array of lines
* collect: produce another array consisting of entire file's data
repeated for each line in the file. (each is a little more appropriate
for this kind of use, where you don't care about the result.)
* contents: add each line successively into a single string

If all you want to do is get the file's data into a string, the
following alternative:
* avoids the need to open and close file handles
* avoids producing 2 extra arrays
* should be slightly quicker
* is shorter

contents = IO.read(filename)

Now, the regexp. If \w isn't broad enough, use . (to match any
character). That will match |, too, so we'll add ^...$ to make sure it
starts at the start of a line and ends at the end of a line. Finally, we
also need to make it non-greedy (Otherwise, for example, "a | b | c\nd |
e | f\n" would be matched as ["a | b | c\nd ", " e ", " f\n"].)

contents.scan(/^(.*?)\|(.*?)\|(.*?)$/mx) do |a,b,c|
p [a,b,c]
end

Cheers,
Dave
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,995
Messages
2,570,226
Members
46,816
Latest member
nipsseyhussle

Latest Threads

Top