Regex parsing question

P

Paul Hanchett

Why does this:

text= "AA<X>BB<X>CC</X>DD</X>EE"
regex = %r{(.*)<X>(.*)}

t = text.sub( regex, "z" );
print "$1=#{$1}\n$2=#{$2}\n$3=#{$3}\n$4=#{$4}\n"

Return this:

$1=AA<X>BB
$2=CC</X>DD</X>EE
$3=
$4=

Instead of:

$1=AA
$2=BB<X>CC</X>DD</X>EE
$3=
$4=

And how would I fix it?

Paul
 
J

James Edward Gray II

Why does this:

text= "AA<X>BB<X>CC</X>DD</X>EE"
regex = %r{(.*)<X>(.*)}

t = text.sub( regex, "z" );
print "$1=#{$1}\n$2=#{$2}\n$3=#{$3}\n$4=#{$4}\n"

Return this:

$1=AA<X>BB
$2=CC</X>DD</X>EE
$3=
$4=

Because the construct .* means, "Zero of more non-newline characters,
but as many as I can get". We say the * operator is "greedy".
Instead of:

$1=AA
$2=BB<X>CC</X>DD</X>EE
$3=
$4=

And how would I fix it?

One way would be to switch from the greedy * to the conservative *?.
That would have your Regexp looking like this:

%r{(.*?)<X>(.*)}

Another way is to use split() with a limit:

irb(main):001:0> text= "AA<X>BB<X>CC</X>DD</X>EE"
=> "AA<X>BB<X>CC</X>DD</X>EE"
irb(main):002:0> first, rest = text.split(/<X>/, 2)
=> ["AA", "BB<X>CC</X>DD</X>EE"]
irb(main):003:0> first
=> "AA"
irb(main):004:0> rest
=> "BB<X>CC</X>DD</X>EE"

Hope that helps.

James Edward Gray II
 
D

David A. Black

Hi --

Why does this:

text= "AA<X>BB<X>CC</X>DD</X>EE"
regex = %r{(.*)<X>(.*)}

t = text.sub( regex, "z" );
print "$1=#{$1}\n$2=#{$2}\n$3=#{$3}\n$4=#{$4}\n"

Return this:

$1=AA<X>BB
$2=CC</X>DD</X>EE
$3=
$4=

Instead of:

$1=AA
$2=BB<X>CC</X>DD</X>EE
$3=
$4=

Because * is "greedy" -- meaning, it eats up as many characters as
possible, from left to right, while still allowing for a successful
match overall.

So your first .* eats up everything until it reaches as far right as
it possibly can -- namely, just before the second <X> (which it then
leaves intact so that it can be matched by the literal <X> in your
regex). It even eats up the first said:
And how would I fix it?

Use *? instead of * -- like this:

regex = %r{(.*?)<X>(.*)}


David
 
N

Nikolai Weibull

* Paul Hanchett (Mar 31, 2005 23:00):
text= "AA<X>BB<X>CC</X>DD</X>EE"
regex = %r{(.*)<X>(.*)}

use

regex = %r{(.*?)<X>(.*)}

The .* will match the first <X> and will only relinquish the second so
that an overall match can be made (for the <X>-part of the regex),
nikolai
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
474,169
Messages
2,570,920
Members
47,464
Latest member
Bobbylenly

Latest Threads

Top