Regex comparison

Les Peters · Mar 1, 2004

Hey:

I am working on a bit of code that will compare two regexen
to see if one will match a superset of the other. Trying to
take pattern-1 and transform it into a string, then try to match
it against pattern-2 is getting really hard... any better methods?

TIA,
Les

Gunnar Hjalmarsson · Mar 1, 2004

Les said:
I am working
Good.

on a bit of code that will compare two regexen to see if one will
match a superset of the other. Trying to take pattern-1 and
transform it into a string, then try to match it against pattern-2
is getting really hard...

Is it? Maybe. Why don't you show us the code you have so far, and it
will be much easier to understand how hard it is.

any better methods?

Methods for doing what? You didn't tell us what it is you are trying
to accomplish!

Les Peters · Mar 1, 2004

Gunnar said:
Good.

Yes, it is good.

Is it? Maybe. Why don't you show us the code you have so far, and it
will be much easier to understand how hard it is.

Ok...

sub pattern_check {
#my ($p1, $p2) = @_;
my ($p1) = @_;

# transform p1 into s1 that matches p1

my $s1;
$s1 = $p1;
print "I $p1\n";
$s1 =~ s/\\d[\+\*]?/0/g; # replace \d, \d+, \d*
$s1 =~ s/\\s[\+\*]?/ /g; # replace \s, \s+, \s*
$s1 =~ s/\\S[\+\*]?/-/g; # replace \S, \S+, \S*
$s1 =~ s/\.[\+\*]/-/g; # replace .+, .*
$s1 =~ s/\\\+/+/g; # replace \+
while ($s1 =~ /([^\\])\[(.)[^]]*?\]/) {
$s1 =~ s/([^\\])\[(.)[^]]*?\]/\1\2/; # replace character range
}
$s1 =~ s/\\\[(.)[^]]*?\\\]/\[\1\]/g; # replace bracketed character range

$s1 =~ s/$(.+?)\|.+?$/\1/g; # replace alternation
$s1 =~ s/([^\\])[()]/\1/g;
$s1 =~ s/\$[\[\]\($\/\.'"])/\1/g; # replace backslashed []()/.'"

print "O $s1\n";
if ($s1 =~ /$p1/) {
print "match\n";
} else {
print "NO\n";
}
print "\n";

# attempt to match s1 with p2
# if successful, conflict

}

Methods for doing what? You didn't tell us what it is you are trying
to accomplish!

I am trying to properly order a series of regexen so that a general pattern
will not match all the lines of a more specific pattern, as part of a log
monitoring script.

Les

Tad McClellan · Mar 1, 2004

Les Peters said:
Trying to
take pattern-1 and transform it into a string,

so for /a+/, you need to expand it to:

a
aa
aaa
aaaa
aaaaa
aaaaaa
aaaaaaa
aaaaaaaa
...

Let us know when you get that one finished.

Gunnar Hjalmarsson · Mar 1, 2004

Les said:
sub pattern_check {
#my ($p1, $p2) = @_;
my ($p1) = @_;

# transform p1 into s1 that matches p1

my $s1;
$s1 = $p1;
print "I $p1\n";
$s1 =~ s/\\d[\+\*]?/0/g; # replace \d, \d+, \d*
$s1 =~ s/\\s[\+\*]?/ /g; # replace \s, \s+, \s*
$s1 =~ s/\\S[\+\*]?/-/g; # replace \S, \S+, \S*
$s1 =~ s/\.[\+\*]/-/g; # replace .+, .*
$s1 =~ s/\\\+/+/g; # replace \+
while ($s1 =~ /([^\\])\[(.)[^]]*?\]/) {
$s1 =~ s/([^\\])\[(.)[^]]*?\]/\1\2/; # replace character range
}
$s1 =~ s/\\\[(.)[^]]*?\\\]/\[\1\]/g; # replace bracketed character
range

$s1 =~ s/$(.+?)\|.+?$/\1/g; # replace alternation
$s1 =~ s/([^\\])[()]/\1/g;
$s1 =~ s/\$[\[\]\($\/\.'"])/\1/g; # replace backslashed []()/.'"

print "O $s1\n";
if ($s1 =~ /$p1/) {
print "match\n";
} else {
print "NO\n";
}
print "\n";

# attempt to match s1 with p2
# if successful, conflict

}

Wow, you were really working.

Even if I'm afraid that this is somewhat above my head (not the first
time in this group), I do have a feeling that what you are trying to
do - kind of reversing the regex engine - would result in anything but
robust code.

I am trying to properly order a series of regexen so that a general
pattern will not match all the lines of a more specific pattern, as
part of a log monitoring script.

Please consider this approach:

my $string = 'abc123###def456ghi789###jkl012mno';

my $genre = qr/\w+/;
my $subre = qr/\d+/;

my (@genmatches, @submatches);

while ($string =~ /($genre)/g) {
my $part = $1;
push @submatches, $part =~ /($subre)/g;
push @genmatches, split /$subre/, $part;
}

print "@genmatches\n";
print "@submatches\n";

Ben Morrow · Mar 1, 2004

Les Peters said:
I am trying to properly order a series of regexen so that a general pattern
will not match all the lines of a more specific pattern, as part of a log
monitoring script.

This problem is not well-defined: which is the more specific of these

/[abc]/ and /[cd]/

?

Ben

Les Peters · Mar 1, 2004

Ben said:
Les Peters said:

I am trying to properly order a series of regexen so that a general pattern
will not match all the lines of a more specific pattern, as part of a log
monitoring script.

Click to expand...

This problem is not well-defined: which is the more specific of these

/[abc]/ and /[cd]/

Those two patterns would have equal specificity.

The problem lies with patterns like:

/abc/ and /abcde/

if /abc/ is reached first, it will match the lines that /abcde/ would match,
therefore, /abcde/ should be used first, then /abc/.

Here is an update to the routine (I will be collapsing some of
this after it works for a significantly complex problem set):

sub pattern_check {
#my ($p1, $p2) = @_;
my ($p1) = @_;

# transform p1 into s1 that matches p1

my $s1;
$s1 = $p1;
$s1 =~ s/^\^//; # replace ^
if ($s1 =~ /\\d{(\d+)}/) {
my $patch = "0" x $1;
$s1 =~ s/\\d{(\d+)}/$patch/g;
}
$s1 =~ s/\\d[\+\*]?/0/g; # replace \d, \d+, \d*
$s1 =~ s/\\D[\+\*]?/-/g; # replace \D, \D+, \D*
$s1 =~ s/\\s[\+\*]?/ /g; # replace \s, \s+, \s*
$s1 =~ s/\\S[\+\*]?/-/g; # replace \S, \S+, \S*
$s1 =~ s/\\w[\+\*]?/a/g; # replace \w, \w+, \w*
$s1 =~ s/\\W[\+\*]?/-/g; # replace \W, \W+, \W*
$s1 =~ s/([^\\])\+/\1/g; # replace <non-backslash>+
$s1 =~ s/([^\\])\*//g; # replace <non-backslash>+
$s1 =~ s/\\\+/+/g; # replace \+
$s1 =~ s/\\\$/\$/g; # replace \$
$s1 =~ s/\\\*/*/g; # replace \*
$s1 =~ s/\\\././g; # replace \.
$s1 =~ s/\\</</g; # replace \<
$s1 =~ s/\\>/>/g; # replace \>
$s1 =~ s/ +[\+\*]/ /g; # replace <space>+, <space>*

while ($s1 =~ /([^\\])\[(.)[^]]*?\][\+\*]?/) {
$s1 =~ s/([^\\])\[(.)[^]]*?\][\+\*]?/\1\2/; # replace character range
}

if ($s1 =~ /[^\\]\|/) {
($begin, $end) = split(/\|/,$s1);
@chars = split(//,$begin);
$sparen_count = scalar grep(/$/,@chars);
$eparen_count = scalar grep(/$/,@chars);
if ((($sparen_count - $eparen_count) % 2) == 0) {
$s1 = $begin;
}
}

while ($s1 =~ /$(.+?)\|.+?$/) {
$s1 =~ s/$(.+?)\|.+?$/\1/g; # replace alternation
}
while ($s1 =~ /(.+?)\|.+?/) {
$s1 =~ s/(.+?)\|.+?/\1/g; # replace alternation
}

$s1 =~ s/([^\\])[()]/\1/g;
$s1 =~ s/\$[\-\[\]\($\/\.'"?@#{}])/\1/g; # replace backslashed -[]()/.'"?@#{}
$s1 =~ s/\$$//; # replace $

if ($s1 !~ /$p1/) {
print "I /$p1/\n";
print "O '$s1'\n";
print "NO\n";
print "\n";
}

# attempt to match s1 with p2
# if successful, conflict

}

At the moment, the code is tripping over this pattern:
/login\[[\d]+\]: failed: ^C on /dev/ttyd\d|login\[[\d]+\]: failed: on /dev/ttyd\d|login\[[\d]+\]: Locked ^C account|login\[[\d]+\]:
Locked account/

Specifically, the first caret is giving it fits.

Les

Malcolm Dew-Jones · Mar 1, 2004

Les Peters ([email protected]) wrote:
: Hey:

: I am working on a bit of code that will compare two regexen
: to see if one will match a superset of the other. Trying to
: take pattern-1 and transform it into a string, then try to match
: it against pattern-2 is getting really hard... any better methods?

You build the directed graph that represents the first regular expression,
and then you build the directed graph that represents the second regular
expression.

Then you check if one graph is a subset of the other.

However, I have no suggestion on the perl code to do this, though it
sounds like an interesting problem.

Ben Morrow · Mar 1, 2004

Les Peters said:
Ben said:

Les Peters said:

I am trying to properly order a series of regexen so that a general pattern
will not match all the lines of a more specific pattern, as part of a log
monitoring script.

Click to expand...

This problem is not well-defined: which is the more specific of these

/[abc]/ and /[cd]/

Click to expand...

Those two patterns would have equal specificity.

The problem lies with patterns like:

/abc/ and /abcde/

if /abc/ is reached first, it will match the lines that /abcde/ would match,
therefore, /abcde/ should be used first, then /abc/.

I realise this. What you are attempting to do is make the choice of
pattern for a given string independant of the order of patterns in the
file, right? Consider the two patterns I gave you, and the string "c".
It matches both. Which should be chosen in this case?

Here is an update to the routine (I will be collapsing some of
this after it works for a significantly complex problem set):

sub pattern_check {
#my ($p1, $p2) = @_;
my ($p1) = @_;

# transform p1 into s1 that matches p1

my $s1;
$s1 = $p1;
$s1 =~ s/^\^//; # replace ^

^ does not necessarily have to occur at the beginning of a pattern.
Consider /a|^b/.

if ($s1 =~ /\\d{(\d+)}/) {
my $patch = "0" x $1;
$s1 =~ s/\\d{(\d+)}/$patch/g;
}
$s1 =~ s/\\d[\+\*]?/0/g; # replace \d, \d+, \d*

This is wrong. /\d/ is more specific than /\d+/ is more specific than
/\d*/, by any reasonable definition of 'specific'.

$s1 =~ s/\\\+/+/g; # replace \+
$s1 =~ s/\\\$/\$/g; # replace \$
$s1 =~ s/\\\*/*/g; # replace \*
$s1 =~ s/\\\././g; # replace \.
$s1 =~ s/\\</</g; # replace \<
$s1 =~ s/\\>/>/g; # replace \>

These are also wrong: you need to replace all \X with X unless \X is
significant in regexen. Note also that < and > are not special in a

regex[1] said:
At the moment, the code is tripping over this pattern:
/login\[[\d]+\]: failed: ^C on /dev/ttyd\d|login\[[\d]+\]: failed: on
/dev/ttyd\d|login\[[\d]+\]: Locked ^C account|login\[[\d]+\]: Locked
account/

Specifically, the first caret is giving it fits.

This isn't much of a problem, as this pattern won't match anything at
all: indeed, it isn't even valid Perl. The ^s and /s need escaping.

I think, as I said before, that your problem is either unsolvable or
*extremely* difficult. I would go with a heuristic approach: first,
split your pattern into its top-level alternatives: you will be best off
using Text::Balanced for this. Then split each alternative into atoms,
an atom being

1. a contiguous literal string, such as "login\[" at the start of the
pattern above; or

2. a zero-width metachar, such as ^ or $ or \b; or

3. a repeat, such as +, *, +? or {1,2}; or

4. a bracketed sub-expression.

Now calculate the 'complexity' of each alternative by some set of rules
like:

1. a literal or a zero-width metachar has a complexity of 1.

2. a bracketed sub-expression has the complexity of the pattern inside
it.

3. an atom with a {} or ? repeat count has its complexity doubled; with
+ or +?, trebled; with * or *?, quadrupled. This is obviously entirely
arbitrary, and may well need adjusting; but this order should be
preserved.

The complexity of the pattern as a whole is then the maximum complexity
of its alternatives (or some such: maybe the sum of the complexities
would be better?). Order the patterns by complexity.

This approach fails in many obvious ways, but may provide a reasonable
guess for sensibly written expressions.

Ben

[1] yet...

Ben Morrow · Mar 1, 2004

Les Peters ([email protected]) wrote:
: Hey:

: I am working on a bit of code that will compare two regexen
: to see if one will match a superset of the other. Trying to
: take pattern-1 and transform it into a string, then try to match
: it against pattern-2 is getting really hard... any better methods?

You build the directed graph that represents the first regular expression,
and then you build the directed graph that represents the second regular
expression.

Then you check if one graph is a subset of the other.

However, I have no suggestion on the perl code to do this, though it
sounds like an interesting problem.

I would start by writing a bit of XS to get at the compiled representation of
the pattern.

Ben

Les Peters · Mar 2, 2004

Ben said:
I realise this. What you are attempting to do is make the choice of
pattern for a given string independant of the order of patterns in the
file, right?
Right.

Consider the two patterns I gave you, and the string "c".
It matches both. Which should be chosen in this case?

I would probably say that /[abc]/ can match more things, and therefore is
more general than /[cd]/, so the order would be /[cd]/, /[abc]/.
I had misread your previous post, and was thinking of /abc/ and /de/.

^ does not necessarily have to occur at the beginning of a pattern.
Consider /a|^b/.

This pattern would have to be dealt with, obviously, as an alternation.

This is wrong. /\d/ is more specific than /\d+/ is more specific than
/\d*/, by any reasonable definition of 'specific'.

I agree. The same would apply to \D, \s, \S, etc.

These are also wrong: you need to replace all \X with X unless \X is
significant in regexen. Note also that < and > are not special in a
regex[1], so there's no need to write them as \< and \>.

I agree with you here as well, but I am working from an existing set
of patterns, and this unnecessary backslashing occurs there, so I am
managing it in this way.

I think, as I said before, that your problem is either unsolvable or
*extremely* difficult. I would go with a heuristic approach: first,
split your pattern into its top-level alternatives: you will be best off
using Text::Balanced for this. Then split each alternative into atoms,
an atom being

I am getting a better understanding of the scope of the problem: it's
huge! I am effectively reverse-engineering the regex engine, and only
an approach that takes that into consideration will have any chance of
a moderate level of success.

Thank you for your assistance,
Les

Bart Lateur · Mar 3, 2004

Ben said:
I would start by writing a bit of XS to get at the compiled representation of
the pattern.

Heh? Why that? I'd start by looking at YAPE::Regex (on CPAN) instead. A
demo for use of the module can be found in another module
YAPE::Regex::Explain.

SQL Connection string regex pattern to parse sections	1	May 9, 2024
My regex kung-fu is not strong =(	0	Apr 4, 2020
Picture Comparison Code Not Working Properly	1	Jul 24, 2021
Image overlay and comparison code error.	2	Jul 1, 2021
regex question	7	Jun 20, 2013
Clickable link conversion regex?	0	Nov 30, 2012
Seeking co-founders for my company.	3	Sep 8, 2024
Regex: match double OR single quote	4	Jul 12, 2012

Regex comparison

Les Peters

Gunnar Hjalmarsson

Les Peters

Tad McClellan

Gunnar Hjalmarsson

Ben Morrow

Les Peters

Malcolm Dew-Jones

Ben Morrow

Ben Morrow

Les Peters

Bart Lateur

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads