Tough (for me) regex case

B

Brian McGonigle

Actually, all you need is the last two expressions. Here it is revised:

#!/usr/bin/perl


$text = 'The "quick" brown "fox jumped ""over"" the" lazy dog.';


while ($text =~ /"(.*?)"/g) {
if ($text =~ /".*?(""$1"").*?"/) {
push @matches, $1;
print "REGEX 1: $1 \n";
}
else {
push @matches, $1;
print "REGEX 2: $1\n";
}
}


print "MATCHES: @matches\n";


It prints...

REGEX 2: quick
REGEX 2: fox jumped
REGEX 1: ""over""
REGEX 2: the
MATCHES: quick fox jumped ""over"" the



Brian said:
Got it!!!

$text = 'The "quick" brown "fox jumped ""over"" the" lazy dog.';

while ($text =~ /"(.*?)"/g) {
if ($text =~ /"$_("".*?"")/) {
push @matches, ($1);
print "FOUND: $1\n";
}
elsif ($text =~ /(""$1"")(.*?)"/) {
push @matches, $1;
print "FOUND: $1 \n";
}
else {
push @matches, $1;
print "FOUND: $1\n";
}
}

print "MATCHES: @matches\n";


Prints...

FOUND: quick
FOUND: fox jumped
FOUND: ""over""
FOUND: the
MATCHES: quick fox jumped ""over"" the


mortb said:
...and
(?<=")(""|[^"])+(?=")

gets me...

(1) quick
(2) brown (notice the space before)
(3) fox jumped ""over"" the

this got rid of the quotes but introduced the " brown" error
Perhaps you can live with the quotes....

/mortb

"(""|[^"])+"

gets me:
(1) "quick"
(2) "fox jumped ""over"" the"

- the | is an OR-operator which means that after " you match either "" or
any character except "

I've tried to get rid of the initial and ending quotes -- I think it's
possible in the same expression -- but I haven't succeeded -- yet.
tricky? -- yes!

/mortb
 
R

Rob Perkins

Matt Garrish said:
Second and third passes? Yuck!

The quotes could be removed in one regex, but in this case the match pattern
*does not* produce the results the OP claims and so no other processing
should be necessary.

The match pattern produced the first set of results I listed. I needed
the second set.
If there are quotation marks at the beginning and end
of the strings, I would hazard a guess that the OP added them somewhere his
code.

Nope.

Rob
 
M

Matt Garrish

Rob Perkins said:
Who's wrong? You'll doubtless say Microsoft did it wrong, I'm sure,
since you seem to care about believing that they can't do it right.

They do many things wrong, but that's not my point. Regular expressions are
*largely* the same between implementations, but each language has its own
idiosyncracies you have to be aware of. I'll give you a quick breakdown of
why the quotes *shouldn't* be captured:

/(?<!")"(?!")(.*?)(?<!")"(?!")/

(?<!")"(?!") -- find a quote that isn't preceded or followed by a quote (see
the perlre documentation). Note that the quotation mark is not in parens,
and so should *not* be captured by the regex (which was my point from the
outset).

(.*?)(?<!")"(?!") -- Now do a non-greedy match of everything up to the next
quotation mark that is not preceded or followed by a quotation mark. Again,
this final quotation mark is not part of the match pattern, so it should
*not* be included in the match (your match is everything (.*?) between the
quotation marks).

Does it make a little more sense now why Microsoft's implementation is
wrong?

Matt
 
T

Tad McClellan

Rob Perkins said:
I had *no idea* that there were different implementations of "regular
expressions". The name belies the very idea of differing
implementations.

Well, for what it's worth, I ran the code through ActivePerl, with the
results you predicted. It seems to me reading the regex that if Perl's
evaluator strips the surrounding quotes,


It does not "strip the surrounding quotes", they are still there
in the original string where they always were.

It "fails to match the surrounding quotes".

m//g in a list context returns a list of the memories in the pattern.
The pattern had no memories that would match the leading/trailing quotes.

it's *wrong*, since nothing
in that regex should consume the character,


There is no "consuming" going on, only "matching" (or not matching).

and ALL implementations
should give the same match, IMO.


Every operator in every language must mean the same thing?

That's too limiting to be widely adopted I'm afraid.
 
T

Tad McClellan

Chris Sells' RegexDesigner.NET shows that with that input string and
that regex, the quotemarks are sucked up in the match.


Different languages have different behaviors. Get used to it.

Of course, if you'd rather be dogmatic about it, I guess I'll have to
leave you alone, and slink away with my question unanswered.


It is "pragmatic" rather than "dogmatic".

Since different languages have different behaviors, it is kinda
required that the particular language be known in order to
provide a useable answer.

If you ask for an answer in Perl, but hope to really use some
other language, then it is up to YOU to translate it. We speak
Perl here in the Perl newsgroup.

If you can't be bothered with doing the translation, then don't
ask in other-language newsgroups.
 
T

Tad McClellan

Rob Perkins said:
I had *no idea* that there were different implementations of "regular
expressions". The name belies the very idea of differing
implementations.


Then you misunderstand the use of the word "regular" here.

It has a precise mathematical meaning, as in "regular language",
"regular grammar", etc.

http://en.wikipedia.org/wiki/Regular_language


It should also be noted that regexes in Perl no longer meet
the rules to be mathematically "regular". ie. The name is
historical rather than accurate.
 
R

Rob Perkins

[x-posted to m.p.d.f because it concerns the .NET Framework's regex-er
as well...]

Matt Garrish said:
Does it make a little more sense now why Microsoft's implementation is
wrong?

I'm not ready to call it "wrong", but I'm getting close. OK, so we
start with:

/(?<!")"(?!")(.*?)(?<!")"(?!")/

Removing the lookahead and lookbehind stuff, (in other words, don't
worry about the paired doublequote case) I get a pattern which reads:

/"(.*?)"/

....which includes the quotes in the match, in the .NET implemenation.
In Perl, the quotes get consumed before the match is constructed. But
if I do this:

/".*?"/

Then the regex matches include the quote characters, in either
implementation. So apparantly in the .NET implementation there is no
semantic difference between the two smaller cases.

And... now it begins to make a bit more sense. One implementor decided
there was no distinction in that difference. Another did.

It makes me wonder if this .NET implementation approach is shared by
other implementations. IOW, is the desirable (for my problem) behavior
unique to Perl 5, or is the undesirable behavior unique to .NET?

TMTOWDI. But it represents a case which works desirably for me under
Perl, and generates a bit more work for me under the .NET Framework's
regex engine.

OK, so that leads me then to a case where this particular regex fails,
even in the Perl implementation. Consider the case of:

The "quick" brown "fox jumped ""over""" the lazy dog.

The desirable matches are:

quick
fox jumped ""over""

but this regex returns only

quick

If I stick whitespace between the second and third quote after "over"
then it returns:

quick
fox jumped ""over""<space>

Again, the plain-english description is "all text between a pair of
doublequote characters, except that paired doublequotes inside a
quoted string are part of the match."

What do you think the regex will be?

Rob
 
M

Matt Garrish

Rob Perkins said:
[x-posted to m.p.d.f because it concerns the .NET Framework's regex-er
as well...]

Matt Garrish said:
Does it make a little more sense now why Microsoft's implementation is
wrong?

I'm not ready to call it "wrong", but I'm getting close. OK, so we
start with:

/(?<!")"(?!")(.*?)(?<!")"(?!")/

Removing the lookahead and lookbehind stuff, (in other words, don't
worry about the paired doublequote case) I get a pattern which reads:

/"(.*?)"/

...which includes the quotes in the match, in the .NET implemenation.
In Perl, the quotes get consumed before the match is constructed. But
if I do this:

/".*?"/

Then the regex matches include the quote characters, in either
implementation. So apparantly in the .NET implementation there is no
semantic difference between the two smaller cases.

And... now it begins to make a bit more sense. One implementor decided
there was no distinction in that difference. Another did.

It makes me wonder if this .NET implementation approach is shared by
other implementations. IOW, is the desirable (for my problem) behavior
unique to Perl 5, or is the undesirable behavior unique to .NET?

To put it bluntly, who cares?

You should figure out if you're using the .net framework or if you're
writing a perl script and write your code according the rules of the
language. I have no idea how regular expressions are implemented within
..net, and am not going to figure them out for you.

Chances are you've misread the documentation (i.e., for all I know there may
be an implicit capture around the entire pattern in .net). It could also be
a poorly written program you're using to test with. If you run into any perl
problems with your regex feel free to post again, otherwise please stick to
the appropriate forum.

Matt
 
R

Rob Perkins

Matt Garrish said:
To put it bluntly, who cares?

comp.lang.perl... what again?

Ohyeah! MISC.

Hmm.
You should figure out if you're using the .net framework or if you're
writing a perl script and write your code according the rules of the
language. I have no idea how regular expressions are implemented within
.net, and am not going to figure them out for you.

What of those using ActivePerl on the .NET framework? Which
implementation will they hit?
Chances are you've misread the documentation (i.e., for all I know there may
be an implicit capture around the entire pattern in .net). It could also be
a poorly written program you're using to test with.

Anything, I see, except the notion that Perl's regex implementation
might be bugged. Well, OK, if that's the way you want it, but...
If you run into any perl
problems with your regex feel free to post again, otherwise please stick to
the appropriate forum.

You're leaving me with a really bad impression of the perl aficionados
around here. People must be far more interested in evangelism and
computing mysticism here than in solving decently complex problems.

You saw the thread title. You saw that I'd asked the question of two
groups who might know practical somethings about regex's. For the
record, the folks over in the .NET groups have not berated me for also
talking about Perl.

The best approach would likely have been a silent plonking, I'd
estimate, either of the thread or of me, but the public "get offa my
cloud" is a bit much, dontcha think?

Rob
 
M

Matt Garrish

Rob Perkins said:
comp.lang.perl... what again?

Ohyeah! MISC.

Hmm.

Yes, it's for miscellaneous *Perl* questions. Stop being a dolt and figure
out that Perl and .net are not the same thing. Python has regular
expressions, why aren't you bugging the Python people? Even javascript has
them. I'm sure the people in those groups would love to hear about your .net
woes.
What of those using ActivePerl on the .NET framework? Which
implementation will they hit?

Are you really this stupid, or are you just trolling?
Anything, I see, except the notion that Perl's regex implementation
might be bugged. Well, OK, if that's the way you want it, but...

There is nothing buggy about Perl's implementation. It does exactly what
it's supposed to do. You have yet to show how it doesn't. According to you,
Perl's regexes are buggy because they don't capture text they're not
supposed to capture. All that has come from this thread is confirmation that
you are an idiot.
You're leaving me with a really bad impression of the perl aficionados
around here. People must be far more interested in evangelism and
computing mysticism here than in solving decently complex problems.

There is nothing complex about your problem, and it was answered a long time
ago. That it was not answered to *your* satisfaction is no one's problem but
your own.

Matt
 
R

Rob Perkins

Matt Garrish said:
Yes, it's for miscellaneous *Perl* questions. Stop being a dolt and figure
out that Perl and .net are not the same thing. Python has regular
expressions, why aren't you bugging the Python people?

Didn't know Python had regular expressions.
Even javascript has
them.

Didn't know that, either.
Are you really this stupid, or are you just trolling?

Niether. I'm genuinely curious about that one, what with my current
Perl engine being the ActivePerl stuff.
All that has come from this thread is confirmation that
you are an idiot.

Well, we'll just let stand that I haven't called anyone any names yet,
I guess.

And you're wrong about one thing: The other thing that came from this
thread was a workable regular expression that I could use for my
problem, and the revelation that regular doesn't mean "the same across
platforms."

For that, I thank all those who took time to educate me without
casting aspersions. I've learned quite a lot.
There is nothing complex about your problem, and it was answered a long time
ago. That it was not answered to *your* satisfaction is no one's problem but
your own.

Then I have to ask, rhetorically of course, since I'm going to stick
your name in my bit bucket right after this, why do you keep
responding?

And I'll add that even the most self-important of those on the .NET/MS
newsgroups have never taken a misplaced question or a simplistic (to
them) problem and used that as a springboard to call me a dolt, an
idiot, a simpleton, a bad programmer, a non-reader of documentation,
or any other dehumanizing and overgeneralizing name.

There's just a precious few (two so far, around here) who insist that
the novices at their specialty are idiots, or that those who use
Microsoft stuff are clearly "agin' us."

Bubbye, Matt! <plonk>

Rob
 
D

Dave Cross

Didn't know Python had regular expressions.


Didn't know that, either.


Niether. I'm genuinely curious about that one, what with my current
Perl engine being the ActivePerl stuff.

ActivePerl is Perl. It is am implementation of the Perl language.
Therefore it uses regular expressions in the way defined by Perl.

Dave...
 
S

Steven Kuo

On Tue, 13 Apr 2004, Rob Perkins wrote:


(snipped)
TMTOWDI. But it represents a case which works desirably for me under
Perl, and generates a bit more work for me under the .NET Framework's
regex engine.

OK, so that leads me then to a case where this particular regex fails,
even in the Perl implementation. Consider the case of:

The "quick" brown "fox jumped ""over""" the lazy dog.


The fragility of the (previous) solution would indicate that this
type of problem isn't best handled with regular expressions. You
should look at different approaches as shown by others in this
thread.

The desirable matches are:

quick
fox jumped ""over""

but this regex returns only

quick

If I stick whitespace between the second and third quote after "over"
then it returns:

quick
fox jumped ""over""<space>

Again, the plain-english description is "all text between a pair of
doublequote characters, except that paired doublequotes inside a
quoted string are part of the match."



Your specification of the problem is also incomplete. For example,
how would you parse this string?

$_ = q{""""};

One can claim that no matches are to be found as the string is two
pairs of quotes; one can also equally claim that it's a single pair
of quotes enclosed within quotes.

What do you think the regex will be?


Regardless, if you're just interested in regular expressions, you
may want to try;

$_ = 'The "quick" brown "fox jumped ""over""" the lazy dog.';

# very ugly:

my @arr = m/(?<!")"(?=(?:"")*(?!"))(.*?(?:(?<!")(?:"")*|(?<!")))"(?!")/g;

print join "\n", @arr;

Again, this is an (untested) solution in Perl. Any resemblence to
..NET is coincidental.
 
R

Rob Perkins

Steven Kuo said:
Your specification of the problem is also incomplete. For example,
how would you parse this string?

$_ = q{""""};

Depending on the implementation of regexes, based on what I've learned
here, I'd expect a single match off of that string, with no text in
it. Which is what your Perl regex delivers, it seems.
Again, this is an (untested) solution in Perl. Any resemblence to
.NET is coincidental.

Works very nicely, thank you!

These

Rob
 
J

Joe Smith

Rob said:
Didn't know Python had regular expressions.

Didn't know that, either.

Lots of programs have Perl-Compatible Regular Expressions.
http://www.pcre.org/

Perl does _NOT_ the system-supplied routines. It uses its own.
It has to, in order to provide consistent results on all platforms.

Check the PLATFORMS section of "perldoc perlport" to get an idea
of all the non-.NET platforms perl has compatiblity with.

-Joe
 
J

John W. Kennedy

Dave said:
ActivePerl is Perl. It is am implementation of the Perl language.
Therefore it uses regular expressions in the way defined by Perl.

More than that, ActivePerl is (/inter alia/) perl.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
474,147
Messages
2,570,835
Members
47,382
Latest member
MichaleStr

Latest Threads

Top