Regexp - alternate match and grouping

W

Witold Rugowski

Hi!

I need to do some grouping in regexp's but data can have different format. I'm trying to gather some data from syslog servers. I'm trying to extract client hostname (from FreeBSD syslog) or client's ip (from Webtrends syslog).

First ones looks like:
Feb 28 00:00:00 HOSTNAME Feb 28 2006 01:00:00 HOSTNAME : %PIX-6-305011 [cut]

And from Webtrends:
WTsyslog[2006-02-26 23:59:59 ip=IP_ADDRESS pri=6] <14>Feb 26 2006 23:59:59: %PIX-6-302016: [cut]

Currently I'm matching it with:
/(?:([\w\d\-\_\.]*) |ip=([.\d]*).*?)(\w{3} \d{2} \d{4} \d{2}:\d{2}:\d{2})[\w\d\-\_\.: ]*?%PIX[and more]/

But this means that $1 or $2 is defined, depending on input data format. Is some better way to do it? Better for me means that $1 always is HOSTNAME or IP address and $2 is always date...
 
P

Paul Lalli

Witold said:
I need to do some grouping in regexp's but data can have different format. I'm trying to gather some data from syslog servers. I'm trying to extract client hostname (from FreeBSD syslog) or client's ip (from Webtrends syslog).

First ones looks like:
Feb 28 00:00:00 HOSTNAME Feb 28 2006 01:00:00 HOSTNAME : %PIX-6-305011 [cut]

And from Webtrends:
WTsyslog[2006-02-26 23:59:59 ip=IP_ADDRESS pri=6] <14>Feb 26 2006 23:59:59: %PIX-6-302016: [cut]

Currently I'm matching it with:
/(?:([\w\d\-\_\.]*) |ip=([.\d]*).*?)(\w{3} \d{2} \d{4} \d{2}:\d{2}:\d{2})[\w\d\-\_\.: ]*?%PIX[and more]/

Yuck yuck yuck. Use the /x modifier to increase the readability of
your regexp.
But this means that $1 or $2 is defined, depending on input data format. Is some better way to do it? Better for me means that $1 always is HOSTNAME or IP address and $2 is always date...

Why? Why are you feeling the need to make this one massive regexp?
You're matching two completely different formats. It makes no sense
that one regexp should be able to handle both.

[untested]

my $date_pat = qr/[a-z]{3} \d{2} \d{2}:\d{2}:\d{2}/;
my ($host_or_ip, $date);
if (/^([\w.-]+) ($date_pat)/){
($host_or_ip, $date) = ($1, $2);
} elsif (/ip=([.\d]+).*?($date_pat)){
($host_or_ip, $date) = ($1, $2);
} else {
die "Unknown format in log file";
}


Paul Lalli
 
W

Witold Rugowski

Paul said:
Why? Why are you feeling the need to make this one massive regexp?
You're matching two completely different formats. It makes no sense
that one regexp should be able to handle both.

Because currently I need to handle only two formats, but it may be more formats in future. Creating new branches IMO makes it harder to maintain and develop further...
 
P

Paul Lalli

Witold said:
Because currently I need to handle only two formats, but it may be more formats in future. Creating new branches IMO makes it harder to maintain and develop further...

As opposed to Frankenstein-ing your one single regular expression even
further? You have an odd view of what kind of code is difficult to
maintain, IMHO.

Paul Lalli
 
P

Paul Lalli

Paul said:
Witold said:
Feb 28 00:00:00 HOSTNAME Feb 28 2006 01:00:00 HOSTNAME : %PIX-6-305011 [cut]

WTsyslog[2006-02-26 23:59:59 ip=IP_ADDRESS pri=6] <14>Feb 26 2006 23:59:59: %PIX-6-302016: [cut]
my $date_pat = qr/[a-z]{3} \d{2} \d{2}:\d{2}:\d{2}/;

Should have the /i modifier there.
my ($host_or_ip, $date);
if (/^([\w.-]+) ($date_pat)/){

Should not have the ^ anchor there.

Apologies for those errors.

Paul Lalli
 
W

Witold Rugowski

Paul said:
As opposed to Frankenstein-ing your one single regular expression even

It wont be such a monster ;-)) since changes in format are in header only, so most of information is taken by not showed part of regexp (PIX syslog messages).

This case is probably easier to do with many regexps as You suggest (file format is constant so ifs can recognize right format on the beginning, and then all can be done without additional ifs), however I can imagine data format for which would very convenient to do such alternate grouping.

Is this possible?
 
P

Paul Lalli

Witold said:
This case is probably easier to do with many regexps as You suggest (file format is
constant so ifs can recognize right format on the beginning, and then all can be done
without additional ifs), however I can imagine data format for which would very
convenient to do such alternate grouping.
Is this possible?

The only thing that comes to mind immediately is some fiddling with the
$+ variable, which you can read about in
perldoc perlop

Paul Lalli
 
A

Anno Siegel

Witold Rugowski said:
It wont be such a monster ;-)) since changes in format are in header
only, so most of information is taken by not showed part of regexp (PIX
syslog messages).

This case is probably easier to do with many regexps as You suggest
(file format is constant so ifs can recognize right format on the
beginning, and then all can be done without additional ifs), however I
can imagine data format for which would very convenient to do such
alternate grouping.

Is this possible?

If the corresponding parts are in the same sequence in all cases,
maybe. But, if I remember your data right, they aren't. Since
parentheses also count from left to right, you can't capture a
late match in an early capture, so no.

Anyway, when using regular expressions, your first interest (after
making them work) is to keep them manageable, which means keeping them
small. It is just a bad idea to combine regexes without need.

I'd go along these lines: Build a regex for each case that matches
the specific case, and let it capture the relevant parts whichever
way. Then assign the captures to named variables that are listed
in the sequence the specific pattern of captures needs. Here is
an example using simplified data:

while ( <DATA> ) {
my( $animal, $fruit);
( $animal, $fruit) = /^case1\s+(\w+)\s+(\w+)/ or
( $fruit, $animal) = /^case2\s+(\w+)\s+(\w+)/ or next;
print "animal: $animal, fruit: $fruit\n";
}

__DATA__
case1 horse orange
case2 apple cat
case2 banana snake
case1 cow tomato

That way the regexes for different cases stay apart, and you get to
maintain the regex and the corresponding sequence of variables on a
single line.

Anno
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
474,183
Messages
2,570,967
Members
47,520
Latest member
KrisMacono

Latest Threads

Top