Regex parsing question

Paul Hanchett · Mar 31, 2005

Why does this:

text= "AA<X>BB<X>CC</X>DD</X>EE"
regex = %r{(.*)<X>(.*)}

t = text.sub( regex, "z" );
print "$1=#{$1}\n$2=#{$2}\n$3=#{$3}\n$4=#{$4}\n"

Return this:

$1=AA<X>BB
$2=CC</X>DD</X>EE
$3=
$4=

Instead of:

$1=AA
$2=BB<X>CC</X>DD</X>EE
$3=
$4=

And how would I fix it?

Paul

James Edward Gray II · Mar 31, 2005

Why does this:

text= "AA<X>BB<X>CC</X>DD</X>EE"
regex = %r{(.*)<X>(.*)}

t = text.sub( regex, "z" );
print "$1=#{$1}\n$2=#{$2}\n$3=#{$3}\n$4=#{$4}\n"

Return this:

$1=AA<X>BB
$2=CC</X>DD</X>EE
$3=
$4=

Because the construct .* means, "Zero of more non-newline characters,
but as many as I can get". We say the * operator is "greedy".

Instead of:

$1=AA
$2=BB<X>CC</X>DD</X>EE
$3=
$4=

And how would I fix it?

One way would be to switch from the greedy * to the conservative *?.
That would have your Regexp looking like this:

%r{(.*?)<X>(.*)}

Another way is to use split() with a limit:

irb(main):001:0> text= "AA<X>BB<X>CC</X>DD</X>EE"
=> "AA<X>BB<X>CC</X>DD</X>EE"
irb(main):002:0> first, rest = text.split(/<X>/, 2)
=> ["AA", "BB<X>CC</X>DD</X>EE"]
irb(main):003:0> first
=> "AA"
irb(main):004:0> rest
=> "BB<X>CC</X>DD</X>EE"

Hope that helps.

James Edward Gray II

David A. Black · Mar 31, 2005

Hi --

Why does this:

text= "AA<X>BB<X>CC</X>DD</X>EE"
regex = %r{(.*)<X>(.*)}

t = text.sub( regex, "z" );
print "$1=#{$1}\n$2=#{$2}\n$3=#{$3}\n$4=#{$4}\n"

Return this:

$1=AA<X>BB
$2=CC</X>DD</X>EE
$3=
$4=

Instead of:

$1=AA
$2=BB<X>CC</X>DD</X>EE
$3=
$4=

Because * is "greedy" -- meaning, it eats up as many characters as
possible, from left to right, while still allowing for a successful
match overall.

So your first .* eats up everything until it reaches as far right as
it possibly can -- namely, just before the second <X> (which it then
leaves intact so that it can be matched by the literal <X> in your

regex). It even eats up the first said:
And how would I fix it?

Use *? instead of * -- like this:

regex = %r{(.*?)<X>(.*)}

David

Nikolai Weibull · Mar 31, 2005

* Paul Hanchett (Mar 31, 2005 23:00):

text= "AA<X>BB<X>CC</X>DD</X>EE"
regex = %r{(.*)<X>(.*)}

use

regex = %r{(.*?)<X>(.*)}

The .* will match the first <X> and will only relinquish the second so
that an overall match can be made (for the <X>-part of the regex),
nikolai

Paul Hanchett · Mar 31, 2005

Thanks all for the help. I understand better now.

Paul

Where is my mistake? Why is s equal to minus infinity at some loop iterations?	0	Oct 9, 2022
SENTINEL CONTROL LOOP WHEN DEALING WITH TWO ARRAYS	1	Oct 26, 2023
Function is not worked in C	2	Jun 27, 2023
Tic Tac Toe Game	2	Mar 10, 2024
Need help with this script	4	Mar 12, 2023
geting error as unxpected symbol read: ". in array initialization	0	Mar 27, 2016
C program: memory leak/ segmentation fault/ memory limit exceeded	0	Nov 12, 2022
Translater + module + tkinter	1	Feb 16, 2023

Regex parsing question

Paul Hanchett

James Edward Gray II

David A. Black

Nikolai Weibull

Paul Hanchett

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads