Regular expression question.

M

MENTAT

Hi,

I have a log file that looks something like this

2005-03-29 17:17:11.293|DEBUG|Line 1|Actual Log output line 1
Actual Log output line 2
Actual Log output line 3
Actual Log output line ...
<<<<<<<
2005-03-29 17:17:11.293|DEBUG|Line 9|Actual Log output line 1
Actual Log output line 2
Actual Log output line 3
Actual Log output line ...
<<<<<<<
2005-03-29 17:17:11.309|INFO|Line 4|Actual Log output line 1
Actual Log output line 2
Actual Log output line 3
Actual Log output line ...
<<<<<<<
2005-03-29 17:17:11.319|DEBUG|Line 9|Actual Log output line 1
Actual Log output line 2
Actual Log output line 3
Actual Log output line ...
<<<<<<<

I am trying to write a regular expression that extracts all the log
entries for a given value of "Line".

So if I wanted to look at the entries for Line 4 I want the output to
look something like

2005-03-29 17:17:11.309|INFO|Line 4|Actual Log output line 1
Actual Log output line 2
Actual Log output line 3
Actual Log output line ...
<<<<<<<

Note that I want everything in the line before the "Line 4" text as
well, such as time.

I tried using "<<<<<<<(.*?\|Line 4.*?)>>>>>>>(.*?)<<<<<<<" as the
regular expression (with the ms global modifiers), but the problem is
it matches everything from the beginning of the file (basically the
first <<<<<<<) to the "Line 4". Making .* non-greedy doesn't help
because after it finds the first <<<<<<< the non-greedy match goes all
the way upto the first "Line 4".

Replacing <<<<<<< with the beginning of line (^) doesn't make any
difference either because after the start of the first line, the .*?
still matches everything until "Line 4".

I tried using lookahead and lookbehind assertions as well, but to no
avail. This "<<<<<<<$\(.(?!$)*\|Line 4.*?)>>>>>>>(.*?)<<<<<<<" doesn't
match anything.

Ofcourse, if i remove the s global modifier, i can easily match it
using "(^.*\|Line 4.*)", but then I can't get all the (variable) lines
between <<<<<<< and >>>>>>>. The .* won't match across new line.

Any idea how this problem could be solved? Any help is much
appreciated.

Thanks in advance.
 
G

Gunnar Hjalmarsson

MENTAT said:
I have a log file that looks something like this

2005-03-29 17:17:11.293|DEBUG|Line 1|
Actual Log output line 1
Actual Log output line 2
Actual Log output line 3
Actual Log output line ...
<<<<<<<
2005-03-29 17:17:11.293|DEBUG|Line 9|
Actual Log output line 1
Actual Log output line 2
Actual Log output line 3
Actual Log output line ...
<<<<<<<

I am trying to write a regular expression that extracts all the log
entries for a given value of "Line".

Set the input record separator:

my $line = 4;
local $/ = "<<<\n";
/\|Line $line\|/ and print while <LOG>;

See "perldoc perlvar".
 
A

axel

MENTAT said:
I have a log file that looks something like this

2005-03-29 17:17:11.293|DEBUG|Line 1|Actual Log output line 1
Actual Log output line 2
Actual Log output line 3
Actual Log output line ...
<<<<<<<
2005-03-29 17:17:11.293|DEBUG|Line 9|
[...]

So if I wanted to look at the entries for Line 4 I want the output to
look something like
2005-03-29 17:17:11.309|INFO|Line 4|
Actual Log output line 1
Actual Log output line 2
Actual Log output line 3
Actual Log output line ...
<<<<<<<
Note that I want everything in the line before the "Line 4" text as
well, such as time.
I tried using "<<<<<<<(.*?\|Line 4.*?)>>>>>>>(.*?)<<<<<<<" as the
regular expression (with the ms global modifiers), but the problem is
it matches everything from the beginning of the file (basically the
first <<<<<<<) to the "Line 4". Making .* non-greedy doesn't help
because after it finds the first <<<<<<< the non-greedy match goes all
the way upto the first "Line 4".
Replacing <<<<<<< with the beginning of line (^) doesn't make any
difference either because after the start of the first line, the .*?
still matches everything until "Line 4".

You could instead of using .*? specify exactly where a line end may
occur... along the lines of:

'<<<<<<<\n([\w |.:-]*?\|Line 4.*?)>>>>>>>(.*?)<<<<<<<'

Although from the sample data that you have provided, this would not
work if the data sought is at the begining of the file. So perhaps:

'^([\w |.:-]*?\|Line 1.*?)>>>>>>>(.*?)<<<<<<<'

would be better. Of course you would have to check that the [\w |.:-]
part covers exverything that will occur on that line and that there will
not be clashed with the 'Actual Log output' lines.

Axel
 
T

Tad McClellan

MENTAT said:
I have a log file that looks something like this

2005-03-29 17:17:11.293|DEBUG|Line 1|
Actual Log output line 1
Actual Log output line 2
Actual Log output line 3
Actual Log output line ...
<<<<<<<
2005-03-29 17:17:11.293|DEBUG|Line 9|
Actual Log output line 1
Actual Log output line 2
Actual Log output line 3
Actual Log output line ...
<<<<<<<

I am trying to write a regular expression that extracts all the log
entries for a given value of "Line".


Would a much easier way that makes no use of regular expressions be OK?

Ofcourse, if i remove the s global modifier, i can easily match it
using "(^.*\|Line 4.*)", but then I can't get all the (variable) lines
between <<<<<<< and >>>>>>>. The .* won't match across new line.


There are several ways to write "any character" (which includes
newline) that remain unaffected by the m//s modifier.

[\000-\0377]
[\d\D]
[\w\W]
[\s\S]

Any idea how this problem could be solved?


Setting

$/ = "<<<<<<<\n";

before reading the input would help a lot.
 
M

MENTAT

Thanks Guys. It works with the input record seperator. That was the
missing key. The following code works.

$required_pattern = "(\\|Line 4)";
if (-e $file_name)
{
open (THEFILE, $file_name) or die "Unable to open file $file_name";
$/ = "<<<<<<<\n"; #set the input record seperator to this string.

while (<THEFILE>)
{
if ($_ =~ m/$required_pattern/ms)
{
print $_;
}
}

close (THEFILE);
}


Thanks again ...

Tad McClellan said:
MENTAT said:
I have a log file that looks something like this

2005-03-29 17:17:11.293|DEBUG|Line 1|
Actual Log output line 1
Actual Log output line 2
Actual Log output line 3
Actual Log output line ...
<<<<<<<
2005-03-29 17:17:11.293|DEBUG|Line 9|
Actual Log output line 1
Actual Log output line 2
Actual Log output line 3
Actual Log output line ...
<<<<<<<

I am trying to write a regular expression that extracts all the log
entries for a given value of "Line".


Would a much easier way that makes no use of regular expressions be OK?

Ofcourse, if i remove the s global modifier, i can easily match it
using "(^.*\|Line 4.*)", but then I can't get all the (variable) lines
between <<<<<<< and >>>>>>>. The .* won't match across new line.


There are several ways to write "any character" (which includes
newline) that remain unaffected by the m//s modifier.

[\000-\0377]
[\d\D]
[\w\W]
[\s\S]

Any idea how this problem could be solved?


Setting

$/ = "<<<<<<<\n";

before reading the input would help a lot.
 
M

MENTAT

PS: Tad, what was the other approach that doesn't use regular expressions?

Tad McClellan said:
MENTAT said:
I have a log file that looks something like this

2005-03-29 17:17:11.293|DEBUG|Line 1|
Actual Log output line 1
Actual Log output line 2
Actual Log output line 3
Actual Log output line ...
<<<<<<<
2005-03-29 17:17:11.293|DEBUG|Line 9|
Actual Log output line 1
Actual Log output line 2
Actual Log output line 3
Actual Log output line ...
<<<<<<<

I am trying to write a regular expression that extracts all the log
entries for a given value of "Line".


Would a much easier way that makes no use of regular expressions be OK?

Ofcourse, if i remove the s global modifier, i can easily match it
using "(^.*\|Line 4.*)", but then I can't get all the (variable) lines
between <<<<<<< and >>>>>>>. The .* won't match across new line.


There are several ways to write "any character" (which includes
newline) that remain unaffected by the m//s modifier.

[\000-\0377]
[\d\D]
[\w\W]
[\s\S]

Any idea how this problem could be solved?


Setting

$/ = "<<<<<<<\n";

before reading the input would help a lot.
 
A

A. Sinan Unur

(e-mail address removed) (MENTAT) wrote in

[ top-posting fixed. please don't do that. ]
....

....


Thanks Guys. It works with the input record seperator. That was the
missing key. The following code works.

use strict;
use warnings;
$required_pattern = "(\\|Line 4)";

my $required_pattern = '(\|Line 4)';

Why are you capturing?
if (-e $file_name)

This is a useless test.
{
open (THEFILE, $file_name) or die "Unable to open file $file_name";

Because open will fail if the file does not exist. BTW, you should
include the reason open failed in the error message:

open my $file, '<', $file_name
or die "Unable to open file $file_name: $!";
if ($_ =~ m/$required_pattern/ms)

By default, m// matches against $_, so no need to explicitly specify it.

What do you think using both the m and s options for the match above
achieves?

From perldoc perlop:

m Treat string as multiple lines.
s Treat string as single line.

Which one is it?
{
print $_;
}

The whole thing can be written as

print if /$required_pattern/ose;

Sinan
 
J

John Bokma

MENTAT said:
PS: Tad, what was the other approach that doesn't use regular
expressions?

Not Tad, but perldoc -f index

print if index( $_, '\\' ) >= 0 or index( $_, 'Line 4' ) >= 0;

The order of the index calls can make a difference :)
 
J

John W. Krahn

A. Sinan Unur said:
use strict;
use warnings;


my $required_pattern = '(\|Line 4)';

Or even better:

my $required_pattern = qr'(?:\|Line 4)';

Why are you capturing?
Indeed.

[snip]


if ($_ =~ m/$required_pattern/ms)

By default, m// matches against $_, so no need to explicitly specify it.

What do you think using both the m and s options for the match above
achieves?

From perldoc perlop:

m Treat string as multiple lines.
s Treat string as single line.

Which one is it?

According to the OP's pattern he doesn't need either.

The whole thing can be written as

print if /$required_pattern/ose;

/s ??? /e ???

There are no periods in the pattern for /s and there are no expressions for /e
to evaluate. (And if he uses qr// to compile the regexp there is no need for /o.)



John
 
A

A. Sinan Unur

A. Sinan Unur wrote:

/s ??? /e ???

There are no periods in the pattern for /s and there are no
expressions for /e to evaluate. (And if he uses qr// to compile the
regexp there is no need for /o.)

Indeed :)

Dunno what I was thinking.

Sinan
 
A

A. Sinan Unur

It is not useless.
....

Remove the test and those semantics change.

I see your point. I would prefer to handle the case where the file did
not exist, if that is an important special case, as part of handling the
failure from open.

On the other hand, in the OP's code, if the file did not exist, the
program did not convey this information to the user. Given that this
might be one of the most ways an open might fail, it would have been
better to 'tell' the user why open failed and be done with it.
(the OP needs neither of course)


This illustrates precisely why I don't like the doc's treatment
of these two modifiers. I'm sure the docs do it that way for
mnemonic reasons.

But it falsely implies that they are mutually exclusive.

Yeah, as I said, I don't know what I was thinking when I wrote that
part. Thanks for the correction.

Sinan
 
T

Tad McClellan

John W. Krahn said:
A. Sinan Unur wrote:


there are no expressions for /e
to evaluate.


It is worse than that. It won't even compile, since /e is only
valid for s/// not for m//. :)
 
T

Tad McClellan

A. Sinan Unur said:
(e-mail address removed) (MENTAT) wrote in



This is a useless test.


Because open will fail if the file does not exist.


It is not useless.

If the file does not exist: do nothing.

If the file exists but cannot be opened: complain and exit.

If the file exists and can be opened: normal processing.

Remove the test and those semantics change.

What do you think using both the m and s options for the match above
achieves?


(the OP needs neither of course)

From perldoc perlop:

m Treat string as multiple lines.
s Treat string as single line.

Which one is it?


This illustrates precisely why I don't like the doc's treatment
of these two modifiers. I'm sure the docs do it that way for
mnemonic reasons.

But it falsely implies that they are mutually exclusive.

There are times when you might use both modifiers.

So I'd prefer to give up on the mnemonicness:


m Makes ^ and $ match begin/end of line (rather than of string)
s Makes . match a newline
 
B

Brian McCauley

John said:
Or even better:

my $required_pattern = qr'(?:\|Line 4)';

Or even better:

my $required_pattern = qr/\|Line 4/;

When there's no beniefit gained from using non-standard delimiters on
qr// it's best not to do so (IMNSHO).

Precompiling a regex with qr// implicitly has the same effect as
wrapping it in (?:...).
 
J

John W. Krahn

Brian said:
Or even better:

my $required_pattern = qr/\|Line 4/;

Then you would have to backslash the backslash character because qr//
interpolates its contents.

my $required_pattern = qr/\\|Line 4/;

When there's no beniefit gained from using non-standard delimiters on
qr// it's best not to do so (IMNSHO).

I used the single quotes to avoid interpolation. :)


John
 
N

nobull

John said:
I used the single quotes to avoid interpolation. :)

There were no characters in that pattern that would be interpreted as
interpolation.

If you use m'' rather than // thoughout your code whenever you want a
regex but don't need interpolation then that would be perfectly
reasonable.

If you don't habitually use m'' then using qr'' seems inconsitent.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,995
Messages
2,570,236
Members
46,825
Latest member
VernonQuy6

Latest Threads

Top