Regex Question

S

seijin

I recently started studying Ruby (1.8.6) and things are going well.

However, something bothered me when I got to the regular expression
section. This is my first time deal with regular expressions directly
so I'm completely new to things. I understand the idea behind regular
expressions and most of the usage but two of them stuck out for me.

I'm studying the 2nd edition Pragmatic book and using their
show_regexp method...

def show_regexp(a, re)
if a =~ re
"#{$`}<<#{$&}>>#{$'}"
else
"no match"
end
end

....hopefully I don't get in trouble posting that method here. So here
are my two problems...

First, why does ... show_regexp("banana", /(an)*/) ... not match
"anan" ? I thought it was a greedy algorithm that tried to match as
much as possible?

Second, how does ... show_regexp("Mississippi", /(\w+)\1/) ... work?
Why in the world does it match "ississ" rather than returning no
match? I think most of my problem with this one is not understanding
the underlying logic used when doing pattern matching with back
references. Does it go through and check "Mississippi", "Mississipp"
down to "M" and then "i", "is", "iss", etc? Like trying all possible
combinations in a lock? I would be extremely grateful if someone
would do a short step-by-step of how it matches "ississ". I
understand what the "\1" does, I just don't understand how the first
part even gets to the first "iss".

Thanks a lot!
 
R

Rob Biedenharn

I recently started studying Ruby (1.8.6) and things are going well.

However, something bothered me when I got to the regular expression
section. This is my first time deal with regular expressions directly
so I'm completely new to things. I understand the idea behind regular
expressions and most of the usage but two of them stuck out for me.

I'm studying the 2nd edition Pragmatic book and using their
show_regexp method...

def show_regexp(a, re)
if a =~ re
"#{$`}<<#{$&}>>#{$'}"
else
"no match"
end
end

...hopefully I don't get in trouble posting that method here. So here
are my two problems...

First, why does ... show_regexp("banana", /(an)*/) ... not match
"anan" ? I thought it was a greedy algorithm that tried to match as
much as possible?

Well, it would but the * is 'zero or more times' and it can match
zero "an"s before it ever needs to get past the 'b'. If you change
the '*' to a '+' and force 'one or more times', then it has to find
the first "an" and will then greedily consume the next "an" also.
Second, how does ... show_regexp("Mississippi", /(\w+)\1/) ... work?
Why in the world does it match "ississ" rather than returning no
match? I think most of my problem with this one is not understanding
the underlying logic used when doing pattern matching with back
references. Does it go through and check "Mississippi", "Mississipp"
down to "M" and then "i", "is", "iss", etc? Like trying all possible
combinations in a lock? I would be extremely grateful if someone
would do a short step-by-step of how it matches "ississ". I
understand what the "\1" does, I just don't understand how the first
part even gets to the first "iss".

Thanks a lot!

The '+' is greedy so I think it tries \w+ as Mississippi and finds
that there's no following copy, backs up to try Mississipp and finds
that 'i' isn't a copy, and so on back to 'M' before moving past 'M'
to 'ississippi', 'ississipp', 'ississip', etc. until 'iss' has a
following copy and a match is declared. Think of it this way: Until
the regexp gets to the end, it doesn't know that the string isn't
"MississippiMississipp".

-Rob

Rob Biedenharn http://agileconsultingllc.com
(e-mail address removed)
 
B

Bill Kelly

From: said:
def show_regexp(a, re)
if a =~ re
"#{$`}<<#{$&}>>#{$'}"
else
"no match"
end
end

First, why does ... show_regexp("banana", /(an)*/) ... not match
"anan" ? I thought it was a greedy algorithm that tried to match as
much as possible?

It is greedy, but it's giving you the *first* match it
can find. Since it's allowed to match zero-or-more times
(because of the *), it's finding a match at the beginning
of the string. From that standpoint, as greedy as it can
be is to match zero characters.

Second, how does ... show_regexp("Mississippi", /(\w+)\1/) ... work?
Why in the world does it match "ississ" rather than returning no
match? I think most of my problem with this one is not understanding
the underlying logic used when doing pattern matching with back
references. Does it go through and check "Mississippi", "Mississipp"
down to "M" and then "i", "is", "iss", etc? Like trying all possible
combinations in a lock? I would be extremely grateful if someone
would do a short step-by-step of how it matches "ississ". I
understand what the "\1" does, I just don't understand how the first
part even gets to the first "iss".

Essentially it scans the string for the first match it can find
that satisfies the expression.

Maybe it's like trying all possible locks that fit the key. ;)

I don't know for sure what order it tries the combinations, but
it's probably something like:

MM
MiMi
MisMis
MissMiss
...
MississippiMississippi
<no match, so advance to next character in string>
ii
isis
ississ
<match found>


Hope this helps,

Bill
 
M

Martin DeMello

First, why does ... show_regexp("banana", /(an)*/) ... not match
"anan" ? I thought it was a greedy algorithm that tried to match as
much as possible?

It does, but it also returns the *first* available match, which in the
case of 'banana' is the empty string. Note that * matches 0 or more
iterations. + will work as expected:
=> "b said:
Second, how does ... show_regexp("Mississippi", /(\w+)\1/) ... work?
Why in the world does it match "ississ" rather than returning no
match? I think most of my problem with this one is not understanding
the underlying logic used when doing pattern matching with back
references. Does it go through and check "Mississippi", "Mississipp"
down to "M" and then "i", "is", "iss", etc? Like trying all possible
combinations in a lock? I would be extremely grateful if someone
would do a short step-by-step of how it matches "ississ". I
understand what the "\1" does, I just don't understand how the first
part even gets to the first "iss".

(\w+)\1 matches one or more letters, twice in a row. The empty string
doesn't match the "one or more bit". This is just the opposite case to
your last question - (\w*)\1 matches the empty string. What the engine
does is match (\w+) against M, then \1 against another M which fails.
Likewise for Mi, Mis, Miss etc, till it hits the end and moves on to
i, is, iss... when it suddenly succeeds. Illustrative examples:
=> "<<bb>>MississippiMississippi"

martin

martin
 
M

Martin DeMello

your last question - (\w*)\1 matches the empty string. What the engine
does is match (\w+) against M, then \1 against another M which fails.
Likewise for Mi, Mis, Miss etc, till it hits the end and moves on to
i, is, iss... when it suddenly succeeds. Illustrative examples:

Oops - Rob is right, of course - + being greedy, it tries \w+ =
Mississippi first, then shrinks from there:
=> "<<MiMississippiMiMississippi>>"

martin
 
S

seijin

Oh, I see. So using the "*" with just one ... errr... thing will
never match it? "banana" =~ /a*/ will always be 0 but "banana" =~ /
a*n/ will be 1 (one) ? It's just confusing because I think of
"greedy" as it trying to match as many characters as possible. So
that although "*" tell it to find 0 or more that it would work hard to
return the maximum number of matches. Perhaps the difference between
"Find as many a's as possible" and "Find as many a's or nothing as
possible" ? If you see what I'm trying to say. No, I think I just
figured out what my problem is. I'm taking it out of context. The
"*" was created for use other arguments. I keep getting stuck on
using it with just one argument which is not what it was intended for.

And for Mississippi - is that method of matching only used when using
back references via the \1-9 ? If so, I can set my mind at ease as
that makes sense. The thing that was bothering me was that I didn't
see why it would work so hard to find the match. I thought it was a
simple... (\w+) matches "Mississippi" but there is no
"MississippiMississippi" so I'll return nil... type of thing. I
didn't realize it'd start scanning the string aggresively.

Thanks for the reply!
 
D

dblack

Hi --

Oh, I see. So using the "*" with just one ... errr... thing will
never match it? "banana" =~ /a*/ will always be 0 but "banana" =~ /
a*n/ will be 1 (one) ? It's just confusing because I think of
"greedy" as it trying to match as many characters as possible. So
that although "*" tell it to find 0 or more that it would work hard to
return the maximum number of matches.

Keep in mind, though, that it's looking for a pattern-match starting
at the left. If it finds it, it's finished; it's not going to keep
scanning the string. In the leftmost position, it finds a match for
a* -- namely, 0 occurrences of a. So it's done.

Here's an example that might help clarify it:

"xxyyxxxyyy" =~ /x*/ # 0

The longest stretch of x's is at position 5. However, the job of the
match engine is not to find the maximum number of x's; it's to find
the first match for /x*/, starting at the left. It finds that match
at position 0. There are only two x's, but that's neither here nor
there.


David

--
* Books:
RAILS ROUTING (new! http://www.awprofessional.com/title/0321509242)
RUBY FOR RAILS (http://www.manning.com/black)
* Ruby/Rails training
& consulting: Ruby Power and Light, LLC (http://www.rubypal.com)
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,982
Messages
2,570,190
Members
46,736
Latest member
zacharyharris

Latest Threads

Top