Regex comparison

L

Les Peters

Hey:

I am working on a bit of code that will compare two regexen
to see if one will match a superset of the other. Trying to
take pattern-1 and transform it into a string, then try to match
it against pattern-2 is getting really hard... any better methods?

TIA,
Les
 
G

Gunnar Hjalmarsson

Les said:
I am working
Good.

on a bit of code that will compare two regexen to see if one will
match a superset of the other. Trying to take pattern-1 and
transform it into a string, then try to match it against pattern-2
is getting really hard...

Is it? Maybe. Why don't you show us the code you have so far, and it
will be much easier to understand how hard it is.
any better methods?

Methods for doing what? You didn't tell us what it is you are trying
to accomplish!
 
L

Les Peters

Gunnar said:

Yes, it is good.
Is it? Maybe. Why don't you show us the code you have so far, and it
will be much easier to understand how hard it is.

Ok...

sub pattern_check {
#my ($p1, $p2) = @_;
my ($p1) = @_;

# transform p1 into s1 that matches p1

my $s1;
$s1 = $p1;
print "I $p1\n";
$s1 =~ s/\\d[\+\*]?/0/g; # replace \d, \d+, \d*
$s1 =~ s/\\s[\+\*]?/ /g; # replace \s, \s+, \s*
$s1 =~ s/\\S[\+\*]?/-/g; # replace \S, \S+, \S*
$s1 =~ s/\.[\+\*]/-/g; # replace .+, .*
$s1 =~ s/\\\+/+/g; # replace \+
while ($s1 =~ /([^\\])\[(.)[^]]*?\]/) {
$s1 =~ s/([^\\])\[(.)[^]]*?\]/\1\2/; # replace character range
}
$s1 =~ s/\\\[(.)[^]]*?\\\]/\[\1\]/g; # replace bracketed character range

$s1 =~ s/\((.+?)\|.+?\)/\1/g; # replace alternation
$s1 =~ s/([^\\])[()]/\1/g;
$s1 =~ s/\\([\[\]\(\)\/\.'"])/\1/g; # replace backslashed []()/.'"

print "O $s1\n";
if ($s1 =~ /$p1/) {
print "match\n";
} else {
print "NO\n";
}
print "\n";

# attempt to match s1 with p2
# if successful, conflict

}

Methods for doing what? You didn't tell us what it is you are trying
to accomplish!

I am trying to properly order a series of regexen so that a general pattern
will not match all the lines of a more specific pattern, as part of a log
monitoring script.

Les
 
T

Tad McClellan

Les Peters said:
Trying to
take pattern-1 and transform it into a string,


so for /a+/, you need to expand it to:

a
aa
aaa
aaaa
aaaaa
aaaaaa
aaaaaaa
aaaaaaaa
...


Let us know when you get that one finished. :)
 
G

Gunnar Hjalmarsson

Les said:
sub pattern_check {
#my ($p1, $p2) = @_;
my ($p1) = @_;

# transform p1 into s1 that matches p1

my $s1;
$s1 = $p1;
print "I $p1\n";
$s1 =~ s/\\d[\+\*]?/0/g; # replace \d, \d+, \d*
$s1 =~ s/\\s[\+\*]?/ /g; # replace \s, \s+, \s*
$s1 =~ s/\\S[\+\*]?/-/g; # replace \S, \S+, \S*
$s1 =~ s/\.[\+\*]/-/g; # replace .+, .*
$s1 =~ s/\\\+/+/g; # replace \+
while ($s1 =~ /([^\\])\[(.)[^]]*?\]/) {
$s1 =~ s/([^\\])\[(.)[^]]*?\]/\1\2/; # replace character range
}
$s1 =~ s/\\\[(.)[^]]*?\\\]/\[\1\]/g; # replace bracketed character
range

$s1 =~ s/\((.+?)\|.+?\)/\1/g; # replace alternation
$s1 =~ s/([^\\])[()]/\1/g;
$s1 =~ s/\\([\[\]\(\)\/\.'"])/\1/g; # replace backslashed []()/.'"

print "O $s1\n";
if ($s1 =~ /$p1/) {
print "match\n";
} else {
print "NO\n";
}
print "\n";

# attempt to match s1 with p2
# if successful, conflict

}

Wow, you were really working. :)

Even if I'm afraid that this is somewhat above my head (not the first
time in this group), I do have a feeling that what you are trying to
do - kind of reversing the regex engine - would result in anything but
robust code.
I am trying to properly order a series of regexen so that a general
pattern will not match all the lines of a more specific pattern, as
part of a log monitoring script.

Please consider this approach:

my $string = 'abc123###def456ghi789###jkl012mno';

my $genre = qr/\w+/;
my $subre = qr/\d+/;

my (@genmatches, @submatches);

while ($string =~ /($genre)/g) {
my $part = $1;
push @submatches, $part =~ /($subre)/g;
push @genmatches, split /$subre/, $part;
}

print "@genmatches\n";
print "@submatches\n";
 
B

Ben Morrow

Les Peters said:
I am trying to properly order a series of regexen so that a general pattern
will not match all the lines of a more specific pattern, as part of a log
monitoring script.

This problem is not well-defined: which is the more specific of these

/[abc]/ and /[cd]/

?

Ben
 
L

Les Peters

Ben said:
Les Peters said:
I am trying to properly order a series of regexen so that a general pattern
will not match all the lines of a more specific pattern, as part of a log
monitoring script.


This problem is not well-defined: which is the more specific of these

/[abc]/ and /[cd]/

Those two patterns would have equal specificity.

The problem lies with patterns like:

/abc/ and /abcde/

if /abc/ is reached first, it will match the lines that /abcde/ would match,
therefore, /abcde/ should be used first, then /abc/.

Here is an update to the routine (I will be collapsing some of
this after it works for a significantly complex problem set):

sub pattern_check {
#my ($p1, $p2) = @_;
my ($p1) = @_;

# transform p1 into s1 that matches p1

my $s1;
$s1 = $p1;
$s1 =~ s/^\^//; # replace ^
if ($s1 =~ /\\d{(\d+)}/) {
my $patch = "0" x $1;
$s1 =~ s/\\d{(\d+)}/$patch/g;
}
$s1 =~ s/\\d[\+\*]?/0/g; # replace \d, \d+, \d*
$s1 =~ s/\\D[\+\*]?/-/g; # replace \D, \D+, \D*
$s1 =~ s/\\s[\+\*]?/ /g; # replace \s, \s+, \s*
$s1 =~ s/\\S[\+\*]?/-/g; # replace \S, \S+, \S*
$s1 =~ s/\\w[\+\*]?/a/g; # replace \w, \w+, \w*
$s1 =~ s/\\W[\+\*]?/-/g; # replace \W, \W+, \W*
$s1 =~ s/([^\\])\+/\1/g; # replace <non-backslash>+
$s1 =~ s/([^\\])\*//g; # replace <non-backslash>+
$s1 =~ s/\\\+/+/g; # replace \+
$s1 =~ s/\\\$/\$/g; # replace \$
$s1 =~ s/\\\*/*/g; # replace \*
$s1 =~ s/\\\././g; # replace \.
$s1 =~ s/\\</</g; # replace \<
$s1 =~ s/\\>/>/g; # replace \>
$s1 =~ s/ +[\+\*]/ /g; # replace <space>+, <space>*

while ($s1 =~ /([^\\])\[(.)[^]]*?\][\+\*]?/) {
$s1 =~ s/([^\\])\[(.)[^]]*?\][\+\*]?/\1\2/; # replace character range
}

if ($s1 =~ /[^\\]\|/) {
($begin, $end) = split(/\|/,$s1);
@chars = split(//,$begin);
$sparen_count = scalar grep(/\(/,@chars);
$eparen_count = scalar grep(/\)/,@chars);
if ((($sparen_count - $eparen_count) % 2) == 0) {
$s1 = $begin;
}
}

while ($s1 =~ /\((.+?)\|.+?\)/) {
$s1 =~ s/\((.+?)\|.+?\)/\1/g; # replace alternation
}
while ($s1 =~ /(.+?)\|.+?/) {
$s1 =~ s/(.+?)\|.+?/\1/g; # replace alternation
}

$s1 =~ s/([^\\])[()]/\1/g;
$s1 =~ s/\\([\-\[\]\(\)\/\.'"?@#{}])/\1/g; # replace backslashed -[]()/.'"?@#{}
$s1 =~ s/\$$//; # replace $

if ($s1 !~ /$p1/) {
print "I /$p1/\n";
print "O '$s1'\n";
print "NO\n";
print "\n";
}

# attempt to match s1 with p2
# if successful, conflict

}

At the moment, the code is tripping over this pattern:
/login\[[\d]+\]: failed: ^C on /dev/ttyd\d|login\[[\d]+\]: failed: on /dev/ttyd\d|login\[[\d]+\]: Locked ^C account|login\[[\d]+\]:
Locked account/

Specifically, the first caret is giving it fits.

Les
 
M

Malcolm Dew-Jones

Les Peters ([email protected]) wrote:
: Hey:

: I am working on a bit of code that will compare two regexen
: to see if one will match a superset of the other. Trying to
: take pattern-1 and transform it into a string, then try to match
: it against pattern-2 is getting really hard... any better methods?

You build the directed graph that represents the first regular expression,
and then you build the directed graph that represents the second regular
expression.

Then you check if one graph is a subset of the other.

However, I have no suggestion on the perl code to do this, though it
sounds like an interesting problem.
 
B

Ben Morrow

Les Peters said:
Ben said:
Les Peters said:
I am trying to properly order a series of regexen so that a general pattern
will not match all the lines of a more specific pattern, as part of a log
monitoring script.

This problem is not well-defined: which is the more specific of these

/[abc]/ and /[cd]/

Those two patterns would have equal specificity.

The problem lies with patterns like:

/abc/ and /abcde/

if /abc/ is reached first, it will match the lines that /abcde/ would match,
therefore, /abcde/ should be used first, then /abc/.

I realise this. What you are attempting to do is make the choice of
pattern for a given string independant of the order of patterns in the
file, right? Consider the two patterns I gave you, and the string "c".
It matches both. Which should be chosen in this case?
Here is an update to the routine (I will be collapsing some of
this after it works for a significantly complex problem set):

sub pattern_check {
#my ($p1, $p2) = @_;
my ($p1) = @_;

# transform p1 into s1 that matches p1

my $s1;
$s1 = $p1;
$s1 =~ s/^\^//; # replace ^

^ does not necessarily have to occur at the beginning of a pattern.
Consider /a|^b/.
if ($s1 =~ /\\d{(\d+)}/) {
my $patch = "0" x $1;
$s1 =~ s/\\d{(\d+)}/$patch/g;
}
$s1 =~ s/\\d[\+\*]?/0/g; # replace \d, \d+, \d*

This is wrong. /\d/ is more specific than /\d+/ is more specific than
/\d*/, by any reasonable definition of 'specific'.
$s1 =~ s/\\\+/+/g; # replace \+
$s1 =~ s/\\\$/\$/g; # replace \$
$s1 =~ s/\\\*/*/g; # replace \*
$s1 =~ s/\\\././g; # replace \.
$s1 =~ s/\\</</g; # replace \<
$s1 =~ s/\\>/>/g; # replace \>

These are also wrong: you need to replace all \X with X unless \X is
significant in regexen. Note also that < and > are not special in a
regex[1] said:
At the moment, the code is tripping over this pattern:
/login\[[\d]+\]: failed: ^C on /dev/ttyd\d|login\[[\d]+\]: failed: on
/dev/ttyd\d|login\[[\d]+\]: Locked ^C account|login\[[\d]+\]: Locked
account/

Specifically, the first caret is giving it fits.

This isn't much of a problem, as this pattern won't match anything at
all: indeed, it isn't even valid Perl. The ^s and /s need escaping.

I think, as I said before, that your problem is either unsolvable or
*extremely* difficult. I would go with a heuristic approach: first,
split your pattern into its top-level alternatives: you will be best off
using Text::Balanced for this. Then split each alternative into atoms,
an atom being

1. a contiguous literal string, such as "login\[" at the start of the
pattern above; or

2. a zero-width metachar, such as ^ or $ or \b; or

3. a repeat, such as +, *, +? or {1,2}; or

4. a bracketed sub-expression.

Now calculate the 'complexity' of each alternative by some set of rules
like:

1. a literal or a zero-width metachar has a complexity of 1.

2. a bracketed sub-expression has the complexity of the pattern inside
it.

3. an atom with a {} or ? repeat count has its complexity doubled; with
+ or +?, trebled; with * or *?, quadrupled. This is obviously entirely
arbitrary, and may well need adjusting; but this order should be
preserved.

The complexity of the pattern as a whole is then the maximum complexity
of its alternatives (or some such: maybe the sum of the complexities
would be better?). Order the patterns by complexity.

This approach fails in many obvious ways, but may provide a reasonable
guess for sensibly written expressions.

Ben

[1] yet... :)
 
B

Ben Morrow

Les Peters ([email protected]) wrote:
: Hey:

: I am working on a bit of code that will compare two regexen
: to see if one will match a superset of the other. Trying to
: take pattern-1 and transform it into a string, then try to match
: it against pattern-2 is getting really hard... any better methods?

You build the directed graph that represents the first regular expression,
and then you build the directed graph that represents the second regular
expression.

Then you check if one graph is a subset of the other.

However, I have no suggestion on the perl code to do this, though it
sounds like an interesting problem.

I would start by writing a bit of XS to get at the compiled representation of
the pattern.

Ben
 
L

Les Peters

Ben said:
I realise this. What you are attempting to do is make the choice of
pattern for a given string independant of the order of patterns in the
file, right?
Right.

Consider the two patterns I gave you, and the string "c".
It matches both. Which should be chosen in this case?

I would probably say that /[abc]/ can match more things, and therefore is
more general than /[cd]/, so the order would be /[cd]/, /[abc]/.
I had misread your previous post, and was thinking of /abc/ and /de/.
^ does not necessarily have to occur at the beginning of a pattern.
Consider /a|^b/.

This pattern would have to be dealt with, obviously, as an alternation.
This is wrong. /\d/ is more specific than /\d+/ is more specific than
/\d*/, by any reasonable definition of 'specific'.

I agree. The same would apply to \D, \s, \S, etc.
These are also wrong: you need to replace all \X with X unless \X is
significant in regexen. Note also that < and > are not special in a
regex[1], so there's no need to write them as \< and \>.

I agree with you here as well, but I am working from an existing set
of patterns, and this unnecessary backslashing occurs there, so I am
managing it in this way.
I think, as I said before, that your problem is either unsolvable or
*extremely* difficult. I would go with a heuristic approach: first,
split your pattern into its top-level alternatives: you will be best off
using Text::Balanced for this. Then split each alternative into atoms,
an atom being

I am getting a better understanding of the scope of the problem: it's
huge! I am effectively reverse-engineering the regex engine, and only
an approach that takes that into consideration will have any chance of
a moderate level of success.

Thank you for your assistance,
Les
 
B

Bart Lateur

Ben said:
I would start by writing a bit of XS to get at the compiled representation of
the pattern.

Heh? Why that? I'd start by looking at YAPE::Regex (on CPAN) instead. A
demo for use of the module can be found in another module
YAPE::Regex::Explain.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
474,146
Messages
2,570,832
Members
47,374
Latest member
EmeliaBryc

Latest Threads

Top