regular expression strangeness

greendogday · Aug 16, 2006

How come the following:

my $s = 'here "is" some "text" stuff';
if ($s =~ /(")([^"]*)"/) { print "$2\n" }
if ($s =~ /(")([^\1]*)"/) { print "$2\n" }

outputs this:

is
is" some "text

Why doesn't the 2nd one work the same as the first? How did it skip
over the quotes in the middle when it is meant to match with
non-quotes?

Thanks,

Matt Garrish · Aug 16, 2006

greendogday said:
How come the following:

my $s = 'here "is" some "text" stuff';
if ($s =~ /(")([^"]*)"/) { print "$2\n" }
if ($s =~ /(")([^\1]*)"/) { print "$2\n" }

Out of curiosity why are you capturing a character that doesn't change?
You also should assume the first match worked when performing the
seond, and you should use the proper $1 when referring matches (\1 is
for backreferncing a match).

Why doesn't the 2nd one work the same as the first?

Because you're greedily matching anything that isn't a 1 after the
first double quote up to the last.

Matt

Matt Garrish · Aug 16, 2006

Matt said:
greendogday said:

How come the following:

my $s = 'here "is" some "text" stuff';
if ($s =~ /(")([^"]*)"/) { print "$2\n" }
if ($s =~ /(")([^\1]*)"/) { print "$2\n" }

Click to expand...

Out of curiosity why are you capturing a character that doesn't change?
You also should assume the first match worked when performing the
seond

"shouldn't" of course...

Matt

greendogday · Aug 16, 2006

Out of curiosity why are you capturing a character that doesn't change?

It's just a cut down version of a problem, to show where the problem
lies.
The original had a match for either a double or single quote.

You also should assume the first match worked when performing the
seond, and you should use the proper $1 when referring matches (\1 is
for backreferncing a match).

I don't understand that. I am backreferencing a match - the double
quote.

Because you're greedily matching anything that isn't a 1 after the
first double quote up to the last.

But it should be "not a double quote", shouldn't it? Not a 1.

Matt Garrish · Aug 16, 2006

Matt said:
greendogday said:

How come the following:

my $s = 'here "is" some "text" stuff';
if ($s =~ /(")([^"]*)"/) { print "$2\n" }
if ($s =~ /(")([^\1]*)"/) { print "$2\n" }

Click to expand...

Out of curiosity why are you capturing a character that doesn't change?
You also should assume the first match worked when performing the
seond, and you should use the proper $1 when referring matches (\1 is
for backreferncing a match).

Sorry, guilty of skimming. I thought you were trying to reference the
match in the first from the second. You can't use a backreference
inside a character class because character classes are meant to contain
literal characters so you can't build them dynamically during
evaluation. My point still stands, you're telling perl to find anything
that is not a 1 [^\1]. You could do [^$1] (which is what I thought you
were trying to do) because $1 is set in the preceding match so it gets
interpolated when the regular expression is compiled, but that's
obviously not possible from a single substitution and not what you
need. Try a negative lookahead assertion instead.

Matt

Peter J. Holzer · Aug 17, 2006

How come the following:

my $s = 'here "is" some "text" stuff';
if ($s =~ /(")([^"]*)"/) { print "$2\n" }
if ($s =~ /(")([^\1]*)"/) { print "$2\n" }

outputs this:

is
is" some "text

Why doesn't the 2nd one work the same as the first? How did it skip
over the quotes in the middle when it is meant to match with
non-quotes?

I don't think \1 is supposed to be a backreference inside a character
class (what if the first () matched more than one character?).

if ($s =~ /(")(.*?)\1/) { print "$2\n" }

works as expected.

hp

Matt Garrish · Aug 17, 2006

A. Sinan Unur said:
[ Please do not snip attributions when you reply ]

It's just a cut down version of a problem, to show where the problem
lies. The original had a match for either a double or single quote.

Click to expand...

I think you are asking a FAQ in disguise:

perldoc -q inside

But it should be "not a double quote", shouldn't it? Not a 1.

Click to expand...

See

perldoc perlreref

See the paragraph starting with "The following sequences work within or
without a character class."

#!/usr/bin/perl

use strict;
use warnings;

my $s = 'here "is" some "text" stuff';

if ( $s =~ /(")([^"]*)"/ ) {
print "$2\n";
}

if ( $s =~ /(")([^$1]*)"/ ) {
print "$2\n";
}

I'm learning something new today. I didn't think you could dynamically
interpolate into a character class, but it appears you can, but only if
$1 has been set before you try and do it. If you take your example
above and remove the first pattern match, you'll get a compilation
error. I assumed perl would only accept $1 from the first expression
when compiling the second, but so long as $1 has been set it doesn't
matter what it has been set to as the first set of parens will
override:

my $s = 'here "is" some "text" stuff';
if ($s =~ /(")([^$1]*)"/) { print "$2\n" }

outputs:
Unmatched [ in regex; marked by <-- HERE in m/(")([ <-- HERE ^]*)"/ at

however:

my $s = 'here "is" some "text" stuff';
if ($s =~ /( )([^"]*)"/) { print "$2\n" }
if ($s =~ /(")([^$1]*)"/) { print "$2\n" }

outputs:

is

Is this a bug in perl or a feature?

Matt

Matt Garrish · Aug 17, 2006

Matt said:
A. Sinan Unur said:

[ Please do not snip attributions when you reply ]

Out of curiosity why are you capturing a character that doesn't
change?

It's just a cut down version of a problem, to show where the problem
lies. The original had a match for either a double or single quote.

Click to expand...

I think you are asking a FAQ in disguise:

perldoc -q inside

Why doesn't the 2nd one work the same as the first?

Because you're greedily matching anything that isn't a 1 after the
first double quote up to the last.

But it should be "not a double quote", shouldn't it? Not a 1.

Click to expand...

See

perldoc perlreref

See the paragraph starting with "The following sequences work within or
without a character class."

#!/usr/bin/perl

use strict;
use warnings;

my $s = 'here "is" some "text" stuff';

if ( $s =~ /(")([^"]*)"/ ) {
print "$2\n";
}

if ( $s =~ /(")([^$1]*)"/ ) {
print "$2\n";
}

Click to expand...

I'm learning something new today. I didn't think you could dynamically
interpolate into a character class, but it appears you can, but only if
$1 has been set before you try and do it. If you take your example
above and remove the first pattern match, you'll get a compilation
error. I assumed perl would only accept $1 from the first expression
when compiling the second, but so long as $1 has been set it doesn't
matter what it has been set to as the first set of parens will
override:

Hmm, didn't learn anything but I've been away from perl too long. A bad
assumption on my part. I was right the first time, but used a bad test
case and thought I was wrong. Switching my example to capture (i)
proved that it is picking up from the first pattern match and you can't
create dynamically as I thought. I'd still be interested in hearing why
the pattern only compiles if $1 has been set.

Matt

Matt Garrish · Aug 17, 2006

A. Sinan Unur said:
Sure you can. Regexen behave as double-quoted strings.

I think that's a runtime error caused by the fact that $1 is not
defined:

Bad wording on my part, I think. I meant that you can't create the
regex from captured groups in the same regex, and I believe I'm right.

I don't deny that you can interpolate when the regex is compiled, but
that's not what the OP was asking. He wants to reference the first set
of parens in the match, not the set from the preceding match, which is
what your example takes advantage of.

You're right about the regex class not being defined. I was thinking
that was happening during compilation (too much compiling code of
late!). But at least I don't feel so dumb now about being incredulous
that you could reference a match in a character class during execution
of the regex. I still like being awed if anyone has a way... : )

Matt

Matt Garrish · Aug 17, 2006

A. Sinan Unur said:
You are, and I now understand what you meant.

I don't think so, but then I really don't know much.

Hey, I've been all over the map on this one proving I've forgotten
everything I thought I once knew. A couple of bad test cases and I was
thinking I had a whole new power over regexes. Some days just aren't as
good as others...

Matt

Ilya Zakharevich · Aug 17, 2006

[A complimentary Cc of this posting was sent to
A. Sinan Unur

Sure you can. Regexen behave as double-quoted strings.

<pedantic>
... as far as variable interpolation goes.
</pedantic>

In other respects (e.g, backslash interpolation) the behaviour is
quite different (and not fully documented yet, AFAIK. I tried to do
it in "gory details", but made some goofs, where were not fixed yet -
at least several years ago.)

Hope this helps,
Ilya

greendogday · Aug 17, 2006

My point still stands, you're telling perl to find anything
that is not a 1 [^\1].

But if I put a 1 in the string, like so:
$s = 'here "is" 1 some "text" stuff';

then I get:

is" 1 some "text

from the 2nd expression. So it doesn't seem to be searching
for not 1's.

You can't use a backreference
inside a character class

So, as a matter of interest, what does the [^\1] end up as?
What is it "not looking for" when it gets to this bit?

anno4000 · Aug 17, 2006

greendogday said:
My point still stands, you're telling perl to find anything
that is not a 1 [^\1].

Click to expand...

But if I put a 1 in the string, like so:
$s = 'here "is" 1 some "text" stuff';

then I get:

is" 1 some "text

from the 2nd expression. So it doesn't seem to be searching
for not 1's.

You can't use a backreference
inside a character class

Click to expand...

So, as a matter of interest, what does the [^\1] end up as?
What is it "not looking for" when it gets to this bit?

It's looking for "\1". That is the character whose ord() is (octal) 1.
Set

$s = qq(here "is" \1 some "text" stuff);

to see that.

Anno

anno4000 · Aug 17, 2006

Chris Mattern said:
greendogday said:

It's just a cut down version of a problem, to show where the problem
lies.
The original had a match for either a double or single quote.

I don't understand that. I am backreferencing a match - the double
quote.

Click to expand...

No, you aren't. You only think you are. \[1-9] is only "special"
in the second part of a search and replace, and can only refer to

No, that's wrong too.

The one-digit backreferences *are* a replacement for $1 .. $9 in a
regex (and the regex part of a s///) where $1 .. $9 can't be used.
Using them on the replacement side of a s/// works, but is considered
bad style.

They are match-time interpolated in the regex, but only in the parts
that are literal matching text. They are not interpolated in character
classes, and neither in {,}-quantifiers and probably a lot more places
that aren't matched literally. The non-interpolation in character
classes was the source of the confusion.

BTW, the existence of backreferences is what makes computer-style
regexes fundamentally different from their mathematical model. In
mathematics a "regular expression" disallows backreferences. That
limits the set of languages they describe in (mathematically)
interesting ways.

the first part of the same S&R. In a simple match, it doesn't mean
anything. It is a literal "1", which you escaped with a backslash,
making it still a literal "1".

Well, no. Here is a regex

/^(.)\1*$/

that uses a backreference to match all strings that are a repetition
of a single character, no matter which. This is something mathematicians
like to prove regexes cannot do.

print "'$_': ", /^(.)\1*$/ ? 'yes' : 'no', "\n" for
'', qw( a ab aaaa aaab XXX ;;;;; ;;;:;

;

Anno

Ben Morrow · Aug 17, 2006

Quoth "Peter J. Holzer said:
How come the following:

my $s = 'here "is" some "text" stuff';
if ($s =~ /(")([^"]*)"/) { print "$2\n" }
if ($s =~ /(")([^\1]*)"/) { print "$2\n" }

outputs this:

is
is" some "text

Why doesn't the 2nd one work the same as the first? How did it skip
over the quotes in the middle when it is meant to match with
non-quotes?

Click to expand...

I don't think \1 is supposed to be a backreference inside a character
class (what if the first () matched more than one character?).

if ($s =~ /(")(.*?)\1/) { print "$2\n" }

works as expected.

....but only if that is the whole regex; e.g.

/(") (.*?) \1 foo/x

does *not* match "a double-quoted string followed by 'foo'". Applied to
the string

"xxx"bar "yyy"foo

$2 will be 'xxx"bar "yyy', which is (probably) not what was meant. In
the general case you need a negative look-ahead:

m{ (['"]) ( (?

?! \1).)* ) \1 foo }x

For matching actual quoted strings you really want to use
Text::Balanced: go read the FAQ.

Ben

Peter J. Holzer · Aug 19, 2006

...but only if that is the whole regex; e.g.

/(") (.*?) \1 foo/x

does *not* match "a double-quoted string followed by 'foo'". Applied to
the string

"xxx"bar "yyy"foo

$2 will be 'xxx"bar "yyy', which is (probably) not what was meant.

I don't know what was meant, but this is what I would expect. It's the
same result as you get with

/(") (.*?) " foo/x

so the backreference works "as expected".

In the general case you need a negative look-ahead:

In the general case there is probably also some escape mechanism which
you need to consider.

hp

Recursion regular expression (xtended)	1	Aug 16, 2010
Regular expression for BOM required	6	Jan 12, 2013
FAQ 6.20 What good is "\G" in a regular expression?	0	Mar 3, 2011
How do I get the text that is found by a regular expression?	10	Apr 30, 2014
Requesting regular expression help	12	Feb 26, 2010
Need Assistance With A Coding Problem	0	Aug 26, 2023
Repeating assertions in regular expression	3	Jan 3, 2012
Command Line Arguments	0	Mar 7, 2023

regular expression strangeness

greendogday

Matt Garrish

Matt Garrish

greendogday

Matt Garrish

Peter J. Holzer

Matt Garrish

Matt Garrish

Matt Garrish

Matt Garrish

Ilya Zakharevich

greendogday

anno4000

anno4000

Ben Morrow

Peter J. Holzer

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads