Tough (for me) regex case

R

Rob Perkins

Hello,

I know I'm not a regular, and I'm new to the arcana of regular
expressions, so I'm a little stuck with two specific cases and I'm
hoping for a genius:

The case I'm most stumped on is an input string like this:

The "quick" brown "fox jumped ""over"" the" lazy dog.

Where what I want to have matched is the quoted strings, except that
paired doublequotes don't count, and I don't want to capture the
quotemarks. In other words, my desired matches are:

<>
quick
fox jumped ""over"" the
</>

If I use: /".+?"/, I get:

<>
"quick"
"fox jumped "
"over"
" the "
</>

....which isn't right. If I use /".+"/, I get:

<>
"quick" brown "fox jumped ""over"" the"
</>

....which also isn't right. So I don't know how to proceed and get the
match of the strings contained in doublequotes, with the paired
doublequotes escaped, and the matches without the quotes.

How would you do it?

Rob
 
S

Steven Kuo

On Thu, 1 Apr 2004, Rob Perkins wrote:

...
The case I'm most stumped on is an input string like this:

The "quick" brown "fox jumped ""over"" the" lazy dog.

Where what I want to have matched is the quoted strings, except that
paired doublequotes don't count, and I don't want to capture the
quotemarks. In other words, my desired matches are:

<>
quick
fox jumped ""over"" the
</>

If I use: /".+?"/, I get:

<>
"quick"
"fox jumped "
"over"
" the "
</>

...which isn't right. If I use /".+"/, I get:

<>
"quick" brown "fox jumped ""over"" the"
</>

...which also isn't right. So I don't know how to proceed and get the
match of the strings contained in doublequotes, with the paired
doublequotes escaped, and the matches without the quotes.

How would you do it?

Rob




One way to do it is:

$_ = 'The "quick" brown "fox jumped ""over"" the" lazy dog.';

my @matches = m/(?<!")"(?!")(.*?)(?<!")"(?!")/g;


See 'perldoc perlre' for look-ahead and look-behind assertions.
 
J

John Bokma

Rob said:
Hello,

I know I'm not a regular, and I'm new to the arcana of regular
expressions, so I'm a little stuck with two specific cases and I'm
hoping for a genius:

The case I'm most stumped on is an input string like this:

The "quick" brown "fox jumped ""over"" the" lazy dog.

Where what I want to have matched is the quoted strings, except that
paired doublequotes don't count, and I don't want to capture the
quotemarks. In other words, my desired matches are:

What I do sometimes is:

[1] replace the problem character(s) with something that according to
the specs, can never occur in the string (e.g. s/.../\000/g; with
... the problem char(s), not three dots.
[2] do your thing
[3] undo step 1 (e.g. s/\000/.../g; see remark in step [1] wrt ...)
 
R

Rob Perkins

Steven Kuo said:
One way to do it is:

$_ = 'The "quick" brown "fox jumped ""over"" the" lazy dog.';

my @matches = m/(?<!")"(?!")(.*?)(?<!")"(?!")/g;

Thank you very much, and thanks for the perlre reference.

/(?<!")"(?!")(.*?)(?<!")"(?!")/, on my sample string, produces:

<>
"quick"
"fox jumped ""over"" the"
</>

How should I modify the regex to get:
<>
quick
fox jumped ""over"" the
</>

....in other words, without the quotes as first and last characters in
the matches?

Rob
 
K

ko

Rob said:
Hello,

I know I'm not a regular, and I'm new to the arcana of regular
expressions, so I'm a little stuck with two specific cases and I'm
hoping for a genius:

The case I'm most stumped on is an input string like this:

The "quick" brown "fox jumped ""over"" the" lazy dog.

Where what I want to have matched is the quoted strings, except that
paired doublequotes don't count, and I don't want to capture the
quotemarks. In other words, my desired matches are:

<>
quick
fox jumped ""over"" the
</>

If I use: /".+?"/, I get:

<>
"quick"
"fox jumped "
"over"
" the "
</>

...which isn't right. If I use /".+"/, I get:

<>
"quick" brown "fox jumped ""over"" the"
</>

...which also isn't right. So I don't know how to proceed and get the
match of the strings contained in doublequotes, with the paired
doublequotes escaped, and the matches without the quotes.

How would you do it?

Rob

Take a look at the Text::Balanced module - here's a short example:

use strict;
use warnings;
use Text::Balanced qw[ extract_delimited ];

my $text = q[The "quick" brown "fox jumped ""over"" the" lazy dog.];

while (my ($extracted, $remainder) =
extract_delimited($text, '"', '[^"]+', '"') )
{
last unless $extracted =~ s#^"(.*)"$#$1#;
print "EXTRACTED TEXT: $extracted\n";
$text = $remainder;
}

HTH - keith
 
M

mortb

"(""|[^"])+"

gets me:
(1) "quick"
(2) "fox jumped ""over"" the"

- the | is an OR-operator which means that after " you match either "" or
any character except "

I've tried to get rid of the initial and ending quotes -- I think it's
possible in the same expression -- but I haven't succeeded -- yet.
tricky? -- yes!

/mortb
 
M

mortb

....and
(?<=")(""|[^"])+(?=")

gets me...

(1) quick
(2) brown (notice the space before)
(3) fox jumped ""over"" the

this got rid of the quotes but introduced the " brown" error
Perhaps you can live with the quotes....

/mortb
 
R

Rob Perkins

Perhaps you can live with the quotes....

Not really. I'm localizing an application. The regex is part of the
parser I'm using to identify string constants buried in the code, and
replace them with calls into a hashtable which uses the English string
as the source data for

I worked around it by taking the resulting match and using string
length and positioning data to remove the quotes. Not a big deal, but
I was hoping for something really regex-elegant. Everyone who loves
regex's seems to think it's possible, noone I know personally has
figured out how to do it quite yet...

Rob
 
R

Rob Perkins

Rob Perkins said:
and
replace them with calls into a hashtable which uses the English string
as the source data for

should have been

and replace them with calls into a hashtable which uses the English
string as the source data for the hash function.

Rob
 
B

Brian Davis

Does my earlier suggestion about using a named group not work?

The problem with using look-ahead/behind is that since the quotes do not
actually get consumed by the match, they remain available for the next
match. This is why " brown" appears to be a match when it is not. You must
consume the quotes with the match so they will not be re-used. Grouping
constructs can then be used to extract only the part of the match you need.

Brian Davis
http://www.knowdotnet.com
 
R

Rob Perkins

Brian Davis said:
Does my earlier suggestion about using a named group not work?

It looked like it would work, though I can't claim a lot of expertise.

I made use of

(?<!")"(?!")(.*?)(?<!")"(?!")

(offered by Steven Kuo on comp.lang.perl.misc)

....which works nicely, and then just used
Microsoft.VisualBasic.Mid(s,2,Microsoft.VisualBasic.Len(s)-2) to strip
the first and last characters, which with that regex are always
doublequotes.

Might be slow 'n' ugly, but I'm not releasing this code.

I would have tried your named group suggestion, but I had to move on
to manipulating resx files so the compiler doesn't barf on 'em, and it
came in later than the other suggestion.

Thank you, though!

Rob
 
M

mortb

Sorry Brian,
I tested your expression "(?<no_quotes>(""|[^"])*)" and it also rendered

(1) "quick"
(2) "fox jumped ""over"" the"

thus still leaving the initial and ending quotes in the strings.

cheers,
mortb
 
R

Richard Morse

Rob Perkins said:
/(?<!")"(?!")(.*?)(?<!")"(?!")/, on my sample string, produces:

<>
"quick"
"fox jumped ""over"" the"
</>

How should I modify the regex to get:
<>
quick
fox jumped ""over"" the
</>

...in other words, without the quotes as first and last characters in
the matches?

You could add a second and third pass:
s/^"//;
s/"$//;

HTH,
Ricky
 
B

Brian Davis

The match itself contains the quotes, but the named group 'no_quotes'
contains only the text within the quotes.

As I mentioned in the reply, the match should consume the quotes so they
will not be re-used in other matches (the " brown" problem). You can then
use a named group to extract only a portion of the actual match.


Brian Davis
http://www.knowdotnet.com


mortb said:
Sorry Brian,
I tested your expression "(?<no_quotes>(""|[^"])*)" and it also rendered

(1) "quick"
(2) "fox jumped ""over"" the"

thus still leaving the initial and ending quotes in the strings.

cheers,
mortb
 
K

Kevin Collins

Brian Davis said:
You can use the following expression:

"(?<no_quotes>(""|[^"])*)"

Simply access the value of the named group "no_quotes" for each match
returned.


Brian Davis
http://www.knowdotnet.com



Rob Perkins said:
Hello,

I know I'm not a regular, and I'm new to the arcana of regular
expressions, so I'm a little stuck with two specific cases and I'm
hoping for a genius:

The case I'm most stumped on is an input string like this:

The "quick" brown "fox jumped ""over"" the" lazy dog.

Where what I want to have matched is the quoted strings, except that
paired doublequotes don't count, and I don't want to capture the
quotemarks. In other words, my desired matches are:

-SNIP-

Can you point me to some documentation (man page, etc) that describes
a "named group"? I've searched the perlre man page and cannot seem to
find any reference to named groups or an example similar to yours.

How about showing us (me?) how to access the named group?

Thanks,

Kevin
 
K

ko

Brian Davis said:
You can use the following expression:

"(?<no_quotes>(""|[^"])*)"

Simply access the value of the named group "no_quotes" for each match
returned.


Brian Davis
http://www.knowdotnet.com
[snip]

Can you point me to some documentation (man page, etc) that describes
a "named group"? I've searched the perlre man page and cannot seem to
find any reference to named groups or an example similar to yours.

How about showing us (me?) how to access the named group?

Thanks,

Kevin

I couldn't find a reference either - also checked 'perlreref' and
'perlrequick'.

The OP posted to microsoft.public.dotnet.framework besides
comp.lang.perl.misc, so did a quick index check of the second edition
of 'Mastering Regular Expressions'. It states that named capture is
specific to .NET and Python.

HTH - keith
 
M

Matt Garrish

Richard Morse said:
You could add a second and third pass:
s/^"//;
s/"$//;

Second and third passes? Yuck!

The quotes could be removed in one regex, but in this case the match pattern
*does not* produce the results the OP claims and so no other processing
should be necessary. If there are quotation marks at the beginning and end
of the strings, I would hazard a guess that the OP added them somewhere his
code.

Matt
 
M

Matt Garrish

ko said:
(e-mail address removed) (Kevin Collins) wrote in message
"Brian Davis" <[email protected]> wrote in message
You can use the following expression:

"(?<no_quotes>(""|[^"])*)"

Simply access the value of the named group "no_quotes" for each match
returned.
Can you point me to some documentation (man page, etc) that describes
a "named group"? I've searched the perlre man page and cannot seem to
find any reference to named groups or an example similar to yours.

How about showing us (me?) how to access the named group?

I couldn't find a reference either - also checked 'perlreref' and
'perlrequick'.

The OP posted to microsoft.public.dotnet.framework besides
comp.lang.perl.misc, so did a quick index check of the second edition
of 'Mastering Regular Expressions'. It states that named capture is
specific to .NET and Python.

They allow you to give a name to your captured text as you're writing your
expression. So, instead of using $1, $2, etc. you could just reference the
captured text by the name you gave it. Not the most useful of additions, in
my opinion.

Matt
 
B

Brian McGonigle

Got it!!!

$text = 'The "quick" brown "fox jumped ""over"" the" lazy dog.';

while ($text =~ /"(.*?)"/g) {
if ($text =~ /"$_("".*?"")/) {
push @matches, ($1);
print "FOUND: $1\n";
}
elsif ($text =~ /(""$1"")(.*?)"/) {
push @matches, $1;
print "FOUND: $1 \n";
}
else {
push @matches, $1;
print "FOUND: $1\n";
}
}

print "MATCHES: @matches\n";


Prints...

FOUND: quick
FOUND: fox jumped
FOUND: ""over""
FOUND: the
MATCHES: quick fox jumped ""over"" the


mortb said:
...and
(?<=")(""|[^"])+(?=")

gets me...

(1) quick
(2) brown (notice the space before)
(3) fox jumped ""over"" the

this got rid of the quotes but introduced the " brown" error
Perhaps you can live with the quotes....

/mortb

mortb said:
"(""|[^"])+"

gets me:
(1) "quick"
(2) "fox jumped ""over"" the"

- the | is an OR-operator which means that after " you match either "" or
any character except "

I've tried to get rid of the initial and ending quotes -- I think it's
possible in the same expression -- but I haven't succeeded -- yet.
tricky? -- yes!

/mortb
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
474,147
Messages
2,570,835
Members
47,382
Latest member
MichaleStr

Latest Threads

Top