Extract range of lines from a text file

A

Amer Neely

Dr.Ruud said:
Amer Neely schreef:



Or use a simplified state machine.

my $state = -1;
my $line = -1;

while (<>) {
chomp; # s/^\s+//; s/\s+$//;

if (-1 == $state) {
if (/^Transaction Time:/) {
++$state;
}
}
elsif (0 == $state) {
if (/^$/) {
++$state;
$line = 0;
}
else {
die "$state: <$_>?";
}
}
elsif (1 == $state) { # in address
if (^$) {
# skip
}
elsif (/^\d{3}-\d{3}-\d{4}$/) {
$state = -1;
$line = -1;
}
else {
++$line;
print "$line: $_\n";
}
}
else {
die "$state: <$_>?";
}
}

(untested)

Very interesting. It's a little more complex than I need (see the reply
by Xicheng Jia dated today 11:57). That works for me.

I did adopt your code (with a few minor fixes) to my situation, but got
an error when I ran it on my test input file:
Sun Apr 9 12:29:32 2006
0: <xxxxxxxxxxxx
xxxxxxxxxxxxxxx
SAULT STE MARIE Ontario
P6A 3P4
CANADA>? at parse_mail7.pl line 39, <IN> chunk 2.

Thank you for this very different approach. I will keep it in mind for
other situations.
--
Amer Neely
Home of Spam Catcher
W: www.softouch.on.ca
E: (e-mail address removed)
Perl | MySQL | CGI programming for all data entry forms.
"We make web sites work!"
 
A

Amer Neely

Xicheng said:
Here is a test code which uses paragraph-mode to extract info and try
to insert into your database (tested under WinXP)..
--------------------------
use strict;
use warnings;

local $/ = "";

while ( <DATA> ) {
if (/^Transaction Time:/ .. /^\d\d\d-\d\d\d-\d\d\d\d\s*$/){
my $lines = tr/\n//;
next if $lines < 6;
my ( $name, $addr1, $addr2, $city, $code, $cont );
if ( $lines == 6 ) {
( $name, $addr1, $city, $code, $cont ) = split "\n";
$addr2 = "";
} elsif ( $lines == 7 ) {
( $name, $addr1, $addr2, $city, $code, $cont ) = split
"\n";
}
# to INSERT INTO mytable from mydb.
#$sth->execute( $name, $addr1, $addr2, $city, $code, $cont );
print <<TEST;
name = $name
addr1 = $addr1
addr2 = $addr2
city = $city
code = $code
country = $cont

TEST
}
}

__DATA__
one block
one block
one block
one block
one block

Transaction Time: 18:45:55

Amer Neely
POB 1481 Station Main
AMS dept
North Bay ON
P1B 8K7
CANADA

123-456-7890

some other blocks
some other blocks
some other blocks
some other blocks
some other blocks
some other blocks

Transaction Time: 18:45:34

Bmer Neely
POB 123
South
ABC 879
USA

800-346-7890

another block
another block
another block
another block
another block
another block
another block

Transaction Time: 18:45:55

Amer Neely
POB 1481 Station Main
North Bay ON
P1B 8K7
CANADA

123-456-7890

more blocks
more blocks
more blocks
more blocks
more blocks
more blocks
more blocks
more blocks
more blocks
more blocks

Slight hiccup.
This works perfectly when I run it on my file of extracted lines from
the real file.

When I run this code on the real mail file, I get no output, and $lines
is 0.

--
Amer Neely
Home of Spam Catcher
W: www.softouch.on.ca
E: (e-mail address removed)
Perl | MySQL | CGI programming for all data entry forms.
"We make web sites work!"
 
X

Xicheng Jia

Amer said:
Slight hiccup.
= This works perfectly when I run it on my file of extracted lines from

= the real file.

= When I run this code on the real mail file, I get no output, and
$lines
= is 0.

you may need to check up several things with your real email files:

1) if your records are separated by blank lines.(this garantees the
paragraph-mode)

2) if the patterns /^Transaction Time:/ and
/^\d\d\d-\d\d\d-\d\d\d\d\s*$/ match your real lines. i.e. check out the
line right before the line "Transaction Time: *****" to see if it
contains whitespaces. If it 's not a blank line, you may need to add a
'm' modifier to your regex, so change

/^Transaction Time:/

to

/^Transaction Time:/m

do the same to /^\d\d\d-\d\d\d-\d\d\d\d\s*$/.

At anytime, If the records you wanted are not seperated by blank lines
with other text, then split "\n" may not work for your data since your
records will be mixed up with the sourrounding text under
paragraph-mode..

The other way is using $/ = "Transaction" as IRS and filtering out some
unnecessary lines and then splitting it up again..

Anyway, if there are some sample email files here, I think you may get
more reasonable suggestions. :)

Xicheng
 
A

Amer Neely

Xicheng said:
= This works perfectly when I run it on my file of extracted lines from

= the real file.

= When I run this code on the real mail file, I get no output, and
$lines
= is 0.

you may need to check up several things with your real email files:

1) if your records are separated by blank lines.(this garantees the
paragraph-mode)

2) if the patterns /^Transaction Time:/ and
/^\d\d\d-\d\d\d-\d\d\d\d\s*$/ match your real lines. i.e. check out the
line right before the line "Transaction Time: *****" to see if it
contains whitespaces. If it 's not a blank line, you may need to add a
'm' modifier to your regex, so change

/^Transaction Time:/

to

/^Transaction Time:/m

do the same to /^\d\d\d-\d\d\d-\d\d\d\d\s*$/.

At anytime, If the records you wanted are not seperated by blank lines
with other text, then split "\n" may not work for your data since your
records will be mixed up with the sourrounding text under
paragraph-mode..

The other way is using $/ = "Transaction" as IRS and filtering out some
unnecessary lines and then splitting it up again..

Anyway, if there are some sample email files here, I think you may get
more reasonable suggestions. :)

Xicheng

Thank you for your suggestions. I would post the real file, but
obviously can't for privacy reasons. In looking at the file, it seems
each message starts with 'From - ' and then a timestamp. This is the
mail file from the Unix server where I will eventually run this script
if I ever get it working :) I'm developing on Win2K with ActiveState.

--
Amer Neely
Home of Spam Catcher
W: www.softouch.on.ca
E: (e-mail address removed)
Perl | MySQL | CGI programming for all data entry forms.
"We make web sites work!"
 
A

Amer Neely

Xicheng said:
= This works perfectly when I run it on my file of extracted lines from

= the real file.

= When I run this code on the real mail file, I get no output, and
$lines
= is 0.

you may need to check up several things with your real email files:

1) if your records are separated by blank lines.(this garantees the
paragraph-mode)

2) if the patterns /^Transaction Time:/ and
/^\d\d\d-\d\d\d-\d\d\d\d\s*$/ match your real lines. i.e. check out the
line right before the line "Transaction Time: *****" to see if it
contains whitespaces. If it 's not a blank line, you may need to add a
'm' modifier to your regex, so change

/^Transaction Time:/

to

/^Transaction Time:/m

do the same to /^\d\d\d-\d\d\d-\d\d\d\d\s*$/.

At anytime, If the records you wanted are not seperated by blank lines
with other text, then split "\n" may not work for your data since your
records will be mixed up with the sourrounding text under
paragraph-mode..

The other way is using $/ = "Transaction" as IRS and filtering out some
unnecessary lines and then splitting it up again..

Anyway, if there are some sample email files here, I think you may get
more reasonable suggestions. :)

Xicheng

I added the /m to both patterns as above, and it now works on my real
mail file. Again, thank you very much for all the time you've spent on this.

--
Amer Neely
Home of Spam Catcher
W: www.softouch.on.ca
E: (e-mail address removed)
Perl | MySQL | CGI programming for all data entry forms.
"We make web sites work!"
 
T

Tad McClellan

Amer Neely said:
The problem seems to be that $CustData holds all 5 lines. I need to
break out each of the lines into a separate string variable


my @separate_strings = split /\n/, $CustData;
 
M

MSG

Amer said:
while (<IN>)
{
my @CustData=();
Sorry! I was wrong on this earlier. The array has to be outside of the
loop
otherwise it'll get reset on every iteration.
Also had to change your ( @CustData[2 to ( $CustData[2 otherwise
I got no output at all.
@CustData[2..5] means array slice, while $CustData[2..5] is not valid.
You should really put at the top of your code these two lines all the
time:
use strict;
use warnings;

Anyway here is a way of using array for your problem. I think it is
cleaner
with programming logic than using $/ combined with range operator ( It
feels redundant because both $/ and range op give you a block of lines,

and $/ introduces more Regex complexity such as the need for /m ).
With array, you can easliy manipulate it to get 5-line, 6-line or
N-line
address blocks.

use strict;
use warnings;

my @CustData = ();
while ( <DATA> ) {
if (/^Transaction Time:/ .. /^\d\d\d-\d\d\d-\d\d\d\d$/){
push @CustData, $_;
if ( /^\d\d\d-\d\d\d-\d\d\d\d$/ ){
print "---\n";
print $_ for (@CustData[2..$#CustData-2]);
print "\n";
@CustData = ();
}
}
}


__DATA__
one block
one block
one block
one block
one block

Transaction Time: 18:45:55

Amer Neely
POB 1481 Station Main
AMS dept
North Bay ON
P1B 8K7
CANADA

123-456-7890

some other blocks
some other blocks
some other blocks
some other blocks
some other blocks
some other blocks

Transaction Time: 18:45:34

John Doe
POB 123
West Side
ABC 879
USA

800-346-7890

another block
another block
another block
another block
another block
another block
another block

Transaction Time: 18:45:55

Perl Hacker
#1 Main St.
South West
P5B 8K7
CANADA

000-999-7987

more blocks
more blocks
more blocks
more blocks
more blocks
more blocks
more blocks
more blocks
more blocks
more blocks
 
R

robic0

Amer said:
while (<IN>)
{
my @CustData=();
Sorry! I was wrong on this earlier. The array has to be outside of the
loop
otherwise it'll get reset on every iteration.
Also had to change your ( @CustData[2 to ( $CustData[2 otherwise
I got no output at all.
@CustData[2..5] means array slice, while $CustData[2..5] is not valid.
You should really put at the top of your code these two lines all the
time:
use strict;
use warnings;
From another post mentioning warnings:

Lukas Mai ([email protected]) wrote on MMMMDCIV September MCMXCIII in
<URL:<> Mark Hobley <[email protected]> schrob:
<>
<> > I now add brackets to change the precedence:
<> >
<> > print (4 + 2) * 3; # This unexpectedly gives 6
<>
<> Use warnings.

Actually, the fact this gives a warning is a damn good reason to NOT
use warnings.

Because Perl is more often wrong about this warning than it is right.
If I were to advocate against using Perl, I'd point out the warnings
it gives with 'print', and then rest my case.
[snip]

Perl also tries to be smart by looking what's following the parenthesis.
Only to end up looking like an utter fool.
[snip]

Yeah. It fails to explain why while you can make this mistake with any
function in Perl, it only warns for three of them. And most of the time
in a wrong way.


Don't use warnings. Unless you rip out this stupid warning from your
copy of Perl. And I'm not joking.



Abigail
 
A

Amer Neely

MSG said:
Amer said:
while (<IN>)
{
my @CustData=();
Sorry! I was wrong on this earlier. The array has to be outside of the
loop
otherwise it'll get reset on every iteration.
Also had to change your ( @CustData[2 to ( $CustData[2 otherwise
I got no output at all.
@CustData[2..5] means array slice, while $CustData[2..5] is not valid.
You should really put at the top of your code these two lines all the
time:
use strict;
use warnings;

Yes, I always do. I also add this snippet at the top during development.
BEGIN
{
open (STDERR,">>$0-err.txt");
print STDERR "\n",scalar localtime,"\n";
}

It writes the errors to a file with the same name as the script, with a
'.txt' extension. Easier to find that way.

I'll see how this works, but I have a working version now, thanks to
everyone in the group. Xicheng Jia provided the right direction and
examples to get me on my way. Thanks for taking some time to put into
this for me.
Anyway here is a way of using array for your problem. I think it is
cleaner
with programming logic than using $/ combined with range operator ( It
feels redundant because both $/ and range op give you a block of lines,

and $/ introduces more Regex complexity such as the need for /m ).
With array, you can easliy manipulate it to get 5-line, 6-line or
N-line
address blocks.

use strict;
use warnings;

my @CustData = ();
while ( <DATA> ) {
if (/^Transaction Time:/ .. /^\d\d\d-\d\d\d-\d\d\d\d$/){
push @CustData, $_;
if ( /^\d\d\d-\d\d\d-\d\d\d\d$/ ){
print "---\n";
print $_ for (@CustData[2..$#CustData-2]);
print "\n";
@CustData = ();
}
}
}


__DATA__
one block
one block
one block
one block
one block

Transaction Time: 18:45:55

Amer Neely
POB 1481 Station Main
AMS dept
North Bay ON
P1B 8K7
CANADA

123-456-7890

some other blocks
some other blocks
some other blocks
some other blocks
some other blocks
some other blocks

Transaction Time: 18:45:34

John Doe
POB 123
West Side
ABC 879
USA

800-346-7890

another block
another block
another block
another block
another block
another block
another block

Transaction Time: 18:45:55

Perl Hacker
#1 Main St.
South West
P5B 8K7
CANADA

000-999-7987

more blocks
more blocks
more blocks
more blocks
more blocks
more blocks
more blocks
more blocks
more blocks
more blocks


--
Amer Neely
Home of Spam Catcher
W: www.softouch.on.ca
E: (e-mail address removed)
Perl | MySQL | CGI programming for all data entry forms.
"We make web sites work!"
 
A

Amer Neely

Tad said:
my @separate_strings = split /\n/, $CustData;

I think I ended up using something like that after munging around with
the code. I realized that I still had some 'splitting' to do. Now have a
working version. It walks through my mailbox file and pulls out the data
I was having trouble with. Now I can pull out the easy stuff. Thanks for
taking some time with this.
--
Amer Neely
Home of Spam Catcher
W: www.softouch.on.ca
E: (e-mail address removed)
Perl | MySQL | CGI programming for all data entry forms.
"We make web sites work!"
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
474,184
Messages
2,570,976
Members
47,536
Latest member
MistyLough

Latest Threads

Top