Regexp - alternate match and grouping

Witold Rugowski · Feb 28, 2006

Hi!

I need to do some grouping in regexp's but data can have different format. I'm trying to gather some data from syslog servers. I'm trying to extract client hostname (from FreeBSD syslog) or client's ip (from Webtrends syslog).

First ones looks like:
Feb 28 00:00:00 HOSTNAME Feb 28 2006 01:00:00 HOSTNAME : %PIX-6-305011 [cut]

And from Webtrends:
WTsyslog[2006-02-26 23:59:59 ip=IP_ADDRESS pri=6] <14>Feb 26 2006 23:59:59: %PIX-6-302016: [cut]

Currently I'm matching it with:
/(?

[\w\d\-\_\.]*) |ip=([.\d]*).*?)(\w{3} \d{2} \d{4} \d{2}:\d{2}:\d{2})[\w\d\-\_\.: ]*?%PIX[and more]/

But this means that $1 or $2 is defined, depending on input data format. Is some better way to do it? Better for me means that $1 always is HOSTNAME or IP address and $2 is always date...

Paul Lalli · Feb 28, 2006

Witold said:
I need to do some grouping in regexp's but data can have different format. I'm trying to gather some data from syslog servers. I'm trying to extract client hostname (from FreeBSD syslog) or client's ip (from Webtrends syslog).

First ones looks like:
Feb 28 00:00:00 HOSTNAME Feb 28 2006 01:00:00 HOSTNAME : %PIX-6-305011 [cut]

And from Webtrends:
WTsyslog[2006-02-26 23:59:59 ip=IP_ADDRESS pri=6] <14>Feb 26 2006 23:59:59: %PIX-6-302016: [cut]

Currently I'm matching it with:
/(?[\w\d\-\_\.]*) |ip=([.\d]*).*?)(\w{3} \d{2} \d{4} \d{2}:\d{2}:\d{2})[\w\d\-\_\.: ]*?%PIX[and more]/

Yuck yuck yuck. Use the /x modifier to increase the readability of
your regexp.

But this means that $1 or $2 is defined, depending on input data format. Is some better way to do it? Better for me means that $1 always is HOSTNAME or IP address and $2 is always date...

Why? Why are you feeling the need to make this one massive regexp?
You're matching two completely different formats. It makes no sense
that one regexp should be able to handle both.

[untested]

my $date_pat = qr/[a-z]{3} \d{2} \d{2}:\d{2}:\d{2}/;
my ($host_or_ip, $date);
if (/^([\w.-]+) ($date_pat)/){
($host_or_ip, $date) = ($1, $2);
} elsif (/ip=([.\d]+).*?($date_pat)){
($host_or_ip, $date) = ($1, $2);
} else {
die "Unknown format in log file";
}

Paul Lalli

Witold Rugowski · Feb 28, 2006

Paul said:
Why? Why are you feeling the need to make this one massive regexp?
You're matching two completely different formats. It makes no sense
that one regexp should be able to handle both.

Because currently I need to handle only two formats, but it may be more formats in future. Creating new branches IMO makes it harder to maintain and develop further...

Paul Lalli · Feb 28, 2006

Witold said:
Because currently I need to handle only two formats, but it may be more formats in future. Creating new branches IMO makes it harder to maintain and develop further...

As opposed to Frankenstein-ing your one single regular expression even
further? You have an odd view of what kind of code is difficult to
maintain, IMHO.

Paul Lalli

Paul Lalli · Feb 28, 2006

Paul said:
Witold said:

Feb 28 00:00:00 HOSTNAME Feb 28 2006 01:00:00 HOSTNAME : %PIX-6-305011 [cut]

WTsyslog[2006-02-26 23:59:59 ip=IP_ADDRESS pri=6] <14>Feb 26 2006 23:59:59: %PIX-6-302016: [cut]

Click to expand...

my $date_pat = qr/[a-z]{3} \d{2} \d{2}:\d{2}:\d{2}/;

Should have the /i modifier there.

my ($host_or_ip, $date);
if (/^([\w.-]+) ($date_pat)/){

Should not have the ^ anchor there.

Apologies for those errors.

Paul Lalli

Witold Rugowski · Feb 28, 2006

Paul said:
As opposed to Frankenstein-ing your one single regular expression even

It wont be such a monster ;-)) since changes in format are in header only, so most of information is taken by not showed part of regexp (PIX syslog messages).

This case is probably easier to do with many regexps as You suggest (file format is constant so ifs can recognize right format on the beginning, and then all can be done without additional ifs), however I can imagine data format for which would very convenient to do such alternate grouping.

Is this possible?

Paul Lalli · Feb 28, 2006

Witold said:
This case is probably easier to do with many regexps as You suggest (file format is
constant so ifs can recognize right format on the beginning, and then all can be done
without additional ifs), however I can imagine data format for which would very
convenient to do such alternate grouping.

Is this possible?

The only thing that comes to mind immediately is some fiddling with the
$+ variable, which you can read about in
perldoc perlop

Paul Lalli

DJ Stunks · Feb 28, 2006

Paul said:
The only thing that comes to mind immediately is some fiddling with the
$+ variable, which you can read about in
perldoc perlop

perlvar?

-jp

DJ Stunks · Feb 28, 2006

Paul said:
The only thing that comes to mind immediately is some fiddling with the
$+ variable, which you can read about in
perldoc perlop

perlvar?

-jp

Anno Siegel · Feb 28, 2006

Witold Rugowski said:
It wont be such a monster ;-)) since changes in format are in header
only, so most of information is taken by not showed part of regexp (PIX
syslog messages).

This case is probably easier to do with many regexps as You suggest
(file format is constant so ifs can recognize right format on the
beginning, and then all can be done without additional ifs), however I
can imagine data format for which would very convenient to do such
alternate grouping.

Is this possible?

If the corresponding parts are in the same sequence in all cases,
maybe. But, if I remember your data right, they aren't. Since
parentheses also count from left to right, you can't capture a
late match in an early capture, so no.

Anyway, when using regular expressions, your first interest (after
making them work) is to keep them manageable, which means keeping them
small. It is just a bad idea to combine regexes without need.

I'd go along these lines: Build a regex for each case that matches
the specific case, and let it capture the relevant parts whichever
way. Then assign the captures to named variables that are listed
in the sequence the specific pattern of captures needs. Here is
an example using simplified data:

while ( <DATA> ) {
my( $animal, $fruit);
( $animal, $fruit) = /^case1\s+(\w+)\s+(\w+)/ or
( $fruit, $animal) = /^case2\s+(\w+)\s+(\w+)/ or next;
print "animal: $animal, fruit: $fruit\n";
}

__DATA__
case1 horse orange
case2 apple cat
case2 banana snake
case1 cow tomato

That way the regexes for different cases stay apart, and you get to
maintain the regex and the corresponding sequence of variables on a
single line.

Anno

Paul Lalli · Feb 28, 2006

DJ said:
perlvar?

Whoops. Yes, thank you for that correction.

Paul Lalli

Taskcproblem calendar	4	Aug 31, 2023
Adding adressing of IPv6 to program	1	Feb 16, 2023
regexp+hash problem	5	May 12, 2008
Decoding no of ways and printing each decode message	2	Jun 1, 2021
Help with my responsive home page	2	Dec 14, 2022
DST and datetime	1	Dec 30, 2009
Connected SQLite to my java program but information are not submitted	2	Aug 2, 2022
greatly differing processing time between java and Linux while calculating hashes?	1	Sep 9, 2012

Regexp - alternate match and grouping

Witold Rugowski

Paul Lalli

Witold Rugowski

Paul Lalli

Paul Lalli

Witold Rugowski

Paul Lalli

DJ Stunks

DJ Stunks

Anno Siegel

Paul Lalli

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads