regular expression strangeness

G

greendogday

How come the following:

my $s = 'here "is" some "text" stuff';
if ($s =~ /(")([^"]*)"/) { print "$2\n" }
if ($s =~ /(")([^\1]*)"/) { print "$2\n" }

outputs this:

is
is" some "text

Why doesn't the 2nd one work the same as the first? How did it skip
over the quotes in the middle when it is meant to match with
non-quotes?

Thanks,
 
M

Matt Garrish

greendogday said:
How come the following:

my $s = 'here "is" some "text" stuff';
if ($s =~ /(")([^"]*)"/) { print "$2\n" }
if ($s =~ /(")([^\1]*)"/) { print "$2\n" }

Out of curiosity why are you capturing a character that doesn't change?
You also should assume the first match worked when performing the
seond, and you should use the proper $1 when referring matches (\1 is
for backreferncing a match).
Why doesn't the 2nd one work the same as the first?

Because you're greedily matching anything that isn't a 1 after the
first double quote up to the last.

Matt
 
M

Matt Garrish

Matt said:
greendogday said:
How come the following:

my $s = 'here "is" some "text" stuff';
if ($s =~ /(")([^"]*)"/) { print "$2\n" }
if ($s =~ /(")([^\1]*)"/) { print "$2\n" }

Out of curiosity why are you capturing a character that doesn't change?
You also should assume the first match worked when performing the
seond

"shouldn't" of course...

Matt
 
G

greendogday

Out of curiosity why are you capturing a character that doesn't change?

It's just a cut down version of a problem, to show where the problem
lies.
The original had a match for either a double or single quote.
You also should assume the first match worked when performing the
seond, and you should use the proper $1 when referring matches (\1 is
for backreferncing a match).

I don't understand that. I am backreferencing a match - the double
quote.
Because you're greedily matching anything that isn't a 1 after the
first double quote up to the last.

But it should be "not a double quote", shouldn't it? Not a 1.
 
M

Matt Garrish

Matt said:
greendogday said:
How come the following:

my $s = 'here "is" some "text" stuff';
if ($s =~ /(")([^"]*)"/) { print "$2\n" }
if ($s =~ /(")([^\1]*)"/) { print "$2\n" }

Out of curiosity why are you capturing a character that doesn't change?
You also should assume the first match worked when performing the
seond, and you should use the proper $1 when referring matches (\1 is
for backreferncing a match).

Sorry, guilty of skimming. I thought you were trying to reference the
match in the first from the second. You can't use a backreference
inside a character class because character classes are meant to contain
literal characters so you can't build them dynamically during
evaluation. My point still stands, you're telling perl to find anything
that is not a 1 [^\1]. You could do [^$1] (which is what I thought you
were trying to do) because $1 is set in the preceding match so it gets
interpolated when the regular expression is compiled, but that's
obviously not possible from a single substitution and not what you
need. Try a negative lookahead assertion instead.

Matt
 
P

Peter J. Holzer

How come the following:

my $s = 'here "is" some "text" stuff';
if ($s =~ /(")([^"]*)"/) { print "$2\n" }
if ($s =~ /(")([^\1]*)"/) { print "$2\n" }

outputs this:

is
is" some "text

Why doesn't the 2nd one work the same as the first? How did it skip
over the quotes in the middle when it is meant to match with
non-quotes?

I don't think \1 is supposed to be a backreference inside a character
class (what if the first () matched more than one character?).


if ($s =~ /(")(.*?)\1/) { print "$2\n" }

works as expected.

hp
 
M

Matt Garrish

A. Sinan Unur said:
[ Please do not snip attributions when you reply ]
It's just a cut down version of a problem, to show where the problem
lies. The original had a match for either a double or single quote.

I think you are asking a FAQ in disguise:

perldoc -q inside
But it should be "not a double quote", shouldn't it? Not a 1.

See

perldoc perlreref

See the paragraph starting with "The following sequences work within or
without a character class."

#!/usr/bin/perl

use strict;
use warnings;

my $s = 'here "is" some "text" stuff';

if ( $s =~ /(")([^"]*)"/ ) {
print "$2\n";
}

if ( $s =~ /(")([^$1]*)"/ ) {
print "$2\n";
}

I'm learning something new today. I didn't think you could dynamically
interpolate into a character class, but it appears you can, but only if
$1 has been set before you try and do it. If you take your example
above and remove the first pattern match, you'll get a compilation
error. I assumed perl would only accept $1 from the first expression
when compiling the second, but so long as $1 has been set it doesn't
matter what it has been set to as the first set of parens will
override:

my $s = 'here "is" some "text" stuff';
if ($s =~ /(")([^$1]*)"/) { print "$2\n" }

outputs:
Unmatched [ in regex; marked by <-- HERE in m/(")([ <-- HERE ^]*)"/ at

however:

my $s = 'here "is" some "text" stuff';
if ($s =~ /( )([^"]*)"/) { print "$2\n" }
if ($s =~ /(")([^$1]*)"/) { print "$2\n" }

outputs:

is

Is this a bug in perl or a feature?

Matt
 
M

Matt Garrish

Matt said:
A. Sinan Unur said:
[ Please do not snip attributions when you reply ]
Out of curiosity why are you capturing a character that doesn't
change?

It's just a cut down version of a problem, to show where the problem
lies. The original had a match for either a double or single quote.

I think you are asking a FAQ in disguise:

perldoc -q inside
Why doesn't the 2nd one work the same as the first?

Because you're greedily matching anything that isn't a 1 after the
first double quote up to the last.

But it should be "not a double quote", shouldn't it? Not a 1.

See

perldoc perlreref

See the paragraph starting with "The following sequences work within or
without a character class."

#!/usr/bin/perl

use strict;
use warnings;

my $s = 'here "is" some "text" stuff';

if ( $s =~ /(")([^"]*)"/ ) {
print "$2\n";
}

if ( $s =~ /(")([^$1]*)"/ ) {
print "$2\n";
}

I'm learning something new today. I didn't think you could dynamically
interpolate into a character class, but it appears you can, but only if
$1 has been set before you try and do it. If you take your example
above and remove the first pattern match, you'll get a compilation
error. I assumed perl would only accept $1 from the first expression
when compiling the second, but so long as $1 has been set it doesn't
matter what it has been set to as the first set of parens will
override:

Hmm, didn't learn anything but I've been away from perl too long. A bad
assumption on my part. I was right the first time, but used a bad test
case and thought I was wrong. Switching my example to capture (i)
proved that it is picking up from the first pattern match and you can't
create dynamically as I thought. I'd still be interested in hearing why
the pattern only compiles if $1 has been set.

Matt
 
M

Matt Garrish

A. Sinan Unur said:
Sure you can. Regexen behave as double-quoted strings.


I think that's a runtime error caused by the fact that $1 is not
defined:

Bad wording on my part, I think. I meant that you can't create the
regex from captured groups in the same regex, and I believe I'm right.

I don't deny that you can interpolate when the regex is compiled, but
that's not what the OP was asking. He wants to reference the first set
of parens in the match, not the set from the preceding match, which is
what your example takes advantage of.

You're right about the regex class not being defined. I was thinking
that was happening during compilation (too much compiling code of
late!). But at least I don't feel so dumb now about being incredulous
that you could reference a match in a character class during execution
of the regex. I still like being awed if anyone has a way... : )

Matt
 
M

Matt Garrish

A. Sinan Unur said:
You are, and I now understand what you meant.


I don't think so, but then I really don't know much.

Hey, I've been all over the map on this one proving I've forgotten
everything I thought I once knew. A couple of bad test cases and I was
thinking I had a whole new power over regexes. Some days just aren't as
good as others... :p

Matt
 
I

Ilya Zakharevich

[A complimentary Cc of this posting was sent to
A. Sinan Unur
Sure you can. Regexen behave as double-quoted strings.

<pedantic>
... as far as variable interpolation goes.
</pedantic>

In other respects (e.g, backslash interpolation) the behaviour is
quite different (and not fully documented yet, AFAIK. I tried to do
it in "gory details", but made some goofs, where were not fixed yet -
at least several years ago.)

Hope this helps,
Ilya
 
G

greendogday

My point still stands, you're telling perl to find anything
that is not a 1 [^\1].

But if I put a 1 in the string, like so:
$s = 'here "is" 1 some "text" stuff';

then I get:

is" 1 some "text

from the 2nd expression. So it doesn't seem to be searching
for not 1's.
You can't use a backreference
inside a character class

So, as a matter of interest, what does the [^\1] end up as?
What is it "not looking for" when it gets to this bit?
 
A

anno4000

greendogday said:
My point still stands, you're telling perl to find anything
that is not a 1 [^\1].

But if I put a 1 in the string, like so:
$s = 'here "is" 1 some "text" stuff';

then I get:

is" 1 some "text

from the 2nd expression. So it doesn't seem to be searching
for not 1's.
You can't use a backreference
inside a character class

So, as a matter of interest, what does the [^\1] end up as?
What is it "not looking for" when it gets to this bit?

It's looking for "\1". That is the character whose ord() is (octal) 1.
Set

$s = qq(here "is" \1 some "text" stuff);

to see that.

Anno
 
A

anno4000

Chris Mattern said:
greendogday said:
It's just a cut down version of a problem, to show where the problem
lies.
The original had a match for either a double or single quote.




I don't understand that. I am backreferencing a match - the double
quote.

No, you aren't. You only think you are. \[1-9] is only "special"
in the second part of a search and replace, and can only refer to

No, that's wrong too.

The one-digit backreferences *are* a replacement for $1 .. $9 in a
regex (and the regex part of a s///) where $1 .. $9 can't be used.
Using them on the replacement side of a s/// works, but is considered
bad style.

They are match-time interpolated in the regex, but only in the parts
that are literal matching text. They are not interpolated in character
classes, and neither in {,}-quantifiers and probably a lot more places
that aren't matched literally. The non-interpolation in character
classes was the source of the confusion.

BTW, the existence of backreferences is what makes computer-style
regexes fundamentally different from their mathematical model. In
mathematics a "regular expression" disallows backreferences. That
limits the set of languages they describe in (mathematically)
interesting ways.
the first part of the same S&R. In a simple match, it doesn't mean
anything. It is a literal "1", which you escaped with a backslash,
making it still a literal "1".

Well, no. Here is a regex

/^(.)\1*$/

that uses a backreference to match all strings that are a repetition
of a single character, no matter which. This is something mathematicians
like to prove regexes cannot do.

print "'$_': ", /^(.)\1*$/ ? 'yes' : 'no', "\n" for
'', qw( a ab aaaa aaab XXX ;;;;; ;;;:;;);

Anno
 
B

Ben Morrow

Quoth "Peter J. Holzer said:
How come the following:

my $s = 'here "is" some "text" stuff';
if ($s =~ /(")([^"]*)"/) { print "$2\n" }
if ($s =~ /(")([^\1]*)"/) { print "$2\n" }

outputs this:

is
is" some "text

Why doesn't the 2nd one work the same as the first? How did it skip
over the quotes in the middle when it is meant to match with
non-quotes?

I don't think \1 is supposed to be a backreference inside a character
class (what if the first () matched more than one character?).


if ($s =~ /(")(.*?)\1/) { print "$2\n" }

works as expected.

....but only if that is the whole regex; e.g.

/(") (.*?) \1 foo/x

does *not* match "a double-quoted string followed by 'foo'". Applied to
the string

"xxx"bar "yyy"foo

$2 will be 'xxx"bar "yyy', which is (probably) not what was meant. In
the general case you need a negative look-ahead:

m{ (['"]) ( (?:(?! \1).)* ) \1 foo }x

For matching actual quoted strings you really want to use
Text::Balanced: go read the FAQ.

Ben
 
P

Peter J. Holzer

...but only if that is the whole regex; e.g.

/(") (.*?) \1 foo/x

does *not* match "a double-quoted string followed by 'foo'". Applied to
the string

"xxx"bar "yyy"foo

$2 will be 'xxx"bar "yyy', which is (probably) not what was meant.

I don't know what was meant, but this is what I would expect. It's the
same result as you get with

/(") (.*?) " foo/x

so the backreference works "as expected".
In the general case you need a negative look-ahead:

In the general case there is probably also some escape mechanism which
you need to consider.

hp
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
474,201
Messages
2,571,048
Members
47,647
Latest member
NelleMacy9

Latest Threads

Top