Regexp: Lazy match workaround?

  • Thread starter R. Rajesh Jeba Anbiah
  • Start date
R

R. Rajesh Jeba Anbiah

This question was originally posted to comp.lang.php by one of the
regulars <http://groups.google.com/[email protected]>
I've tried to solve it by myself, but faced the similar problem as of
the OP.

$str = <<<EOT
n2 = new something(){
with n2{
__add (a);
__add (d);
}



n3 = new somethinge_else(){
with n3{
__add (x);
__add (y);
}

EOT;

In this string OP wants matches like n2, something, a, d and n3,
something_else, x, y

Mine and OP's regex pattern matches n2, something, a and then n3,
something_else, x (ommitting d and y)

Here is my pattern:
/(\w+) = new (.*?)\(\).*?(__add \((.*?)\).+?)+.*?\}/is
^^^^^^^^^^^^^^^^^^^^

Any comments or suggestion is highly appreciated. TIA
 
G

gnari

[snip]

show us code.
with n2{
__add (a);
__add (d);
}

are there always 2 sets of __add ?
Here is my pattern:
/(\w+) = new (.*?)\(\).*?(__add \((.*?)\).+?)+.*?\}/is

for starters the (__add should have been :)__add
or you'll be catching the whole grouping

did you mean to use /g ?

gnari
 
G

gnari

gnari said:
[snip,snip]

for starters the (__add should have been :)__add
or you'll be catching the whole grouping

of course i meant (?: but i have changed my mind.
just use:
@r=$str=~/(\w+) = new (.*?)\(\).*?__add \((.*?)\);\s+__add
\((.*?)\);\s+.*?\}/isg;
or:
@r=$str=~/(\w+)
\ =\ new\ (.*?)
\(\).*?
__add\ \((.*?)\);\s+
__add\ \((.*?)\);\s+
.*?\}
/isgx;


gnari
 
A

Anno Siegel

R. Rajesh Jeba Anbiah said:
This question was originally posted to comp.lang.php by one of the
regulars
<http://groups.google.com/[email protected]>
I've tried to solve it by myself, but faced the similar problem as of
the OP.

$str = <<<EOT
n2 = new something(){
with n2{
__add (a);
__add (d);
}



n3 = new somethinge_else(){
with n3{
__add (x);
__add (y);
}

EOT;

In this string OP wants matches like n2, something, a, d and n3,
something_else, x, y

Mine and OP's regex pattern matches n2, something, a and then n3,
something_else, x (ommitting d and y)

Here is my pattern:
/(\w+) = new (.*?)\(\).*?(__add \((.*?)\).+?)+.*?\}/is

my $add = qr/\W*__add\s*\((\w+)\)/;
my $group = qr/(\w+)\s*=\s*new\s+(\w+)\(\).*?with\s*\1$add$add/s;

print "$1 $2 $3 $4\n" while $str =~ /$group/g;

Anno
 
B

Brian McCauley

I've tried to solve it by myself, but faced the similar problem as of
the OP.

[ snip example code that won't compile ]

Please express your problem wherever possile as real code that you
have actually run and found to reproduce the symtoms you describe and
cut-and-paste it into the posting.

In order to help you each person will have to correct the typos and
construct the by script you should have posted by hand. This is not
efficient.

Here's what I constructed from your description.

my $str = <<EOT;
n2 = new something(){
with n2{
__add (a);
__add (d);
}



n3 = new somethinge_else(){
with n3{
__add (x);
__add (y);
}

EOT

print join ', ', map "'$_'",
$str =~ /(\w+) = new (.*?)\(\).*?(__add \((.*?)\).+?)+.*?\}/is;
print "\n";
__END__
In this string OP wants matches like n2, something, a, d and n3,
something_else, x, y

Mine and OP's regex pattern matches n2, something, a and then n3,
something_else, x (ommitting d and y)

No they don't. Your pattern finds:

'n2', 'something', '__add (a);', 'a'

That's all.

Random shot in the dark: You want to be able to capture any number of
__add() lines. This is most simply done with two m// operators. One
to caputure everyting within the {...} and an other to capture the
argument of each __add within that.

while ($str =~ /(\w+) = new (.*?)\(\)(.*?)\}/isg ) {
my @match = ($1,$2);
push @match, $3 =~ /__add \((.*?)\)/ig;
print join ', ', map "'$_'", @match;
print "\n";
}

This gives:

'n2', 'something', 'a', 'd'
'n3', 'somethinge_else', 'x', 'y'

I'm sure one could do it all in a single m// using (?{}) but that would
make for hard to read/maintain code.
Any comments or suggestion is highly appreciated. TIA

In future please post an actual script you have run. Describe what it
does and how this is different from what you want. Often the process
of preparing such a script will lead you to the solution yourself.

This and much other useful advice can be found in the posting
guidelines.

--
\\ ( )
. _\\__[oo
.__/ \\ /\@
. l___\\
# ll l\\
###LL LL\\
 
R

R. Rajesh Jeba Anbiah

Brian McCauley said:
(e-mail address removed) (R. Rajesh Jeba Anbiah) writes:

<snip>

Many thanks to all the experts who answered in this thread. My
original code was in PHP with PCRE. I'd thought I will get more help
in regular expression in a Perl group, and so posted here. Sorry to
bug you all.
Here's what I constructed from your description.

my $str = <<EOT;
n2 = new something(){
with n2{
__add (a);
__add (d);
}



n3 = new somethinge_else(){
with n3{
__add (x);
__add (y);
}

EOT

print join ', ', map "'$_'",
$str =~ /(\w+) = new (.*?)\(\).*?(__add \((.*?)\).+?)+.*?\}/is;
print "\n";
__END__

No they don't. Your pattern finds:

'n2', 'something', '__add (a);', 'a'

Yes, that is my problem. I couldn't get it to match the next '__add
(d);', and 'd'
Random shot in the dark: You want to be able to capture any number of
__add() lines. This is most simply done with two m// operators. One
to caputure everyting within the {...} and an other to capture the
argument of each __add within that.

while ($str =~ /(\w+) = new (.*?)\(\)(.*?)\}/isg ) {
my @match = ($1,$2);
push @match, $3 =~ /__add \((.*?)\)/ig;
print join ', ', map "'$_'", @match;
print "\n";
}

This gives:

'n2', 'something', 'a', 'd'
'n3', 'somethinge_else', 'x', 'y'

Yes, I understand, you're suggesting to use two patterns.
I'm sure one could do it all in a single m// using (?{}) but that would
make for hard to read/maintain code.

Yes, that is what I was trying. I couldn't understand why a single
pattern didn't catch up all __add (). In the string __add () appears
two times, but my pattern didn't catch it.
/(\w+) = new (.*?)\(\).*?(__add \((.*?)\).+?)+.*?\}/is
^^^^^^^^^^^^^^^^^^^^
As you see, I have used (__add \((.*?)\).+?)+ I have also tried
(__add \((.*?)\).+?)* and (__add \((.*?)\).+?)*? It catches up only
if we used in a separate expression, but if we use it in
/(\w+) = new (.*?)\(\).*?(__add \((.*?)\).+?)+.*?\}/is
^^^^^^^^^^^^^^^^^^^^
it doesn't work. I'm much puzzled here with this behavior.

Many thanks for all your patience.
 
A

Anno Siegel

R. Rajesh Jeba Anbiah said:
[...]

Yes, I understand, you're suggesting to use two patterns.
I'm sure one could do it all in a single m// using (?{}) but that would
make for hard to read/maintain code.

Yes, that is what I was trying. I couldn't understand why a single
pattern didn't catch up all __add (). In the string __add () appears
two times, but my pattern didn't catch it.
/(\w+) = new (.*?)\(\).*?(__add \((.*?)\).+?)+.*?\}/is
^^^^^^^^^^^^^^^^^^^^
As you see, I have used (__add \((.*?)\).+?)+ I have also tried
(__add \((.*?)\).+?)* and (__add \((.*?)\).+?)*? It catches up only
if we used in a separate expression, but if we use it in
/(\w+) = new (.*?)\(\).*?(__add \((.*?)\).+?)+.*?\}/is
^^^^^^^^^^^^^^^^^^^^
it doesn't work. I'm much puzzled here with this behavior.

That's because the same pair of capturing parentheses matches both
occurrences of "__add ()". In effect only the second match is captured.
In isolation this may be clearer:

$_ = 'xxAxxB';
no warnings 'uninitialized';
print "match: \$1: |$1|, \$2: |$2|\n" if /(xx.)+/;
print "match: \$1: |$1|, \$2: |$2|\n" if /(xx.)(xx.)/;

Anno
 
B

Brian McCauley

I couldn't understand why a single

That's because the same pair of capturing parentheses matches both
occurrences of "__add ()". In effect only the second match is captured.
In isolation this may be clearer:

$_ = 'xxAxxB';
no warnings 'uninitialized';
print "match: \$1: |$1|, \$2: |$2|\n" if /(xx.)+/;
print "match: \$1: |$1|, \$2: |$2|\n" if /(xx.)(xx.)/;

Whilst what you say it is important and relevant it's not actually
right in this pariticular case.

In this paricular case it captures only the first.

This is because in regex the leftmost greedy/non-greedy repeats take
precedence.

Consider

' A C !' =~ /(A.*?)+.*!/;

Here the repeated group matches only 'A'. It does not match the 'C'
because the non-greedyness of the '*?' is more important than the
greedyness of the '+'. Since it is possible to find a match such that
the '.*?' matched '' then that is the best match.


--
\\ ( )
. _\\__[oo
.__/ \\ /\@
. l___\\
# ll l\\
###LL LL\\
 
B

Brian McCauley

Yes, that is what I was trying.

I very much doubt that you were using (?{}) since it does not appear
in any of your code. (?{}) it is a very advanced feature allowing you
to insert bits of Perl code that are to be executed during the regex
matching operation.

I also very much doubt that you were trying to make hard to
read/maintain code. :)

--
\\ ( )
. _\\__[oo
.__/ \\ /\@
. l___\\
# ll l\\
###LL LL\\
 
B

Brian McCauley

Brian McCauley said:
Consider

' A C !' =~ /(A.*?)+.*!/;

Here the repeated group matches only 'A'. It does not match the 'C'
because the non-greedyness of the '*?' is more important than the
greedyness of the '+'.

I meant, of course, consider

' A C !' =~ /(\w.*?)+.*!/;

Obviously /A/ won't match 'C' ever!

--
\\ ( )
. _\\__[oo
.__/ \\ /\@
. l___\\
# ll l\\
###LL LL\\
 
B

Ben Morrow

Quoth Brian McCauley said:
I very much doubt that you were using (?{}) since it does not appear
in any of your code. (?{}) it is a very advanced feature allowing you
to insert bits of Perl code that are to be executed during the regex
matching operation.

....and (obviously) isn't supported by PCRE.

Ben
 
R

R. Rajesh Jeba Anbiah

Brian McCauley said:
I meant, of course, consider

' A C !' =~ /(\w.*?)+.*!/;

Obviously /A/ won't match 'C' ever!

Again, many thanks to all the experts. I understand what you mean,
for example in the following case:
Target string: XabcABCX
Regex Pattern: /X(abc)+X/i
Matches : XabcABCX, ABC
NOT: XabcABCX, abc, ABC
^^^
Here, only the 'ABC' is get matched, but not the first 'abc'. This
behavior is indeed bit difficult to understand :-(
 
N

nobull

Again, many thanks to all the experts. I understand what you mean,
for example in the following case:
Target string: XabcABCX
Regex Pattern: /X(abc)+X/i
Matches : XabcABCX, ABC
NOT: XabcABCX, abc, ABC
^^^
Here, only the 'ABC' is get matched, but not the first 'abc'. This
behavior is indeed bit difficult to understand :-(

Indeed it would be - but that it not what happens. Go back and
re-read what Anno said.

The repeated capturing subexpression /(abc)/i does indeed match and
capture both 'abc' and then also 'ABC'. But upon completion of the
pattern match the special variable $1 ( or the first element of the
list context value of the m// operator ) will contain the _last_ thing
to be captured (i.e. 'ABC').

The only way you could see that 'abc' had been captured would be to
look at the value of $1 part way through the pattern match operation.
This is where (?{}) would come in.
 
R

R. Rajesh Jeba Anbiah

Indeed it would be - but that it not what happens. Go back and
re-read what Anno said.

The repeated capturing subexpression /(abc)/i does indeed match and
capture both 'abc' and then also 'ABC'. But upon completion of the
pattern match the special variable $1 ( or the first element of the
list context value of the m// operator ) will contain the _last_ thing
to be captured (i.e. 'ABC').

Indeed a nice explanation. Many thanks for all your comments and
help.
The only way you could see that 'abc' had been captured would be to
look at the value of $1 part way through the pattern match operation.
This is where (?{}) would come in.

Thanks for pointing out that. But it is not available in PCRE as
someone said. Thanks.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
474,156
Messages
2,570,878
Members
47,408
Latest member
AlenaRay88

Latest Threads

Top