Do You Understand Regular Expressions?

G

growlatoe

Hi all.

I'm pretty new to Ruby and that sort of thing, and I'm having a few
problems understanding regular expressions. I'm hoping one of you can
point me in the right direction.

I want to replace an entire string with another string. I know you
don't need regular expressions for that, but it's part of a more
generic approach. Anyway, the problem I'm having is that my regular
expressions are finding two matches instead of one, and I don't
understand why. I've narrowed down my confusion to the following code,
which shows some output from irb:

irb(main):001:0> "hello".scan(/.*/)
=> ["hello", ""]

I was expecting one match, not two, because .* matches everything,
right? Can someone explain why an empty string is also matched?

The same thing can be seen when substituting - this is closer to how
I'm using regular expressions in my code:

irb(main):001:0> "hello".gsub(/.*/, "P")
=> "PP"

Two substitutions are made and I expected one. So am I right or wrong
to expect one substitution?

Please help - this is driving me nuts!

And in case it helps...

$ ruby --version
ruby 1.8.5 (2006-08-25) [i486-linux]


Thanks in advance.
 
T

Tim Hunter

irb(main):001:0> "hello".scan(/.*/)
=> ["hello", ""]

I was expecting one match, not two, because .* matches everything,
right? Can someone explain why an empty string is also matched?
Try anchoring the match: /^.*/
 
A

Axel Etzold

irb(main):001:0> "hello".scan(/.*/)
=> ["hello", ""]

I was expecting one match, not two, because .* matches everything,
right? Can someone explain why an empty string is also matched?

String.scan searches for all occurrences of (any number of any
character) here. So zero occurrences is one match.

You can search for at least one occurrence like this:

"hello".scan(/.+/)

"hello".gsub(/.+/, "P") => 'P'

As an introduction, I find

http://www.regular-expressions.info/ruby.html

quite instructive for the use of regexps in Ruby.

Best regards,

Axel
 
D

Daniel DeLorme

Axel said:
irb(main):001:0> "hello".scan(/.*/)
=> ["hello", ""]

I was expecting one match, not two, because .* matches everything,
right? Can someone explain why an empty string is also matched?

String.scan searches for all occurrences of (any number of any
character) here. So zero occurrences is one match.

That doesn't really explain why the regexp finds an extra empty string.
I know that zero occurrences is one match but after a greedy match that
matches everything, there should be (logically?) no other match. I am no
stranger to regexps and the result is counter-intuitive to me; I would
consider it a bug. Or at least a very very peculiar behavior.

Daniel
 
R

Ryan Mcdonald

Daniel said:
Axel said:
irb(main):001:0> "hello".scan(/.*/)
=> ["hello", ""]

I was expecting one match, not two, because .* matches everything,
right? Can someone explain why an empty string is also matched?

String.scan searches for all occurrences of (any number of any
character) here. So zero occurrences is one match.

That doesn't really explain why the regexp finds an extra empty string.
I know that zero occurrences is one match but after a greedy match that
matches everything, there should be (logically?) no other match. I am no
stranger to regexps and the result is counter-intuitive to me; I would
consider it a bug. Or at least a very very peculiar behavior.

Daniel

I agree. Can someone explain why gsub, sub or scan matches with * are
different than =~ matches with *

puts "hello".gsub(/[aeiou]/, '<\1>') # h<>ll<>
puts "hello".gsub(/.*/, '<\1>') # <><>
print "before: #{$`}\n" # before: hello
print "match: #{$&}\n" # match:
print "after: #{$'}\n" # after:

puts "hello" =~ (/.*/) # 0
print "before: #{$`}\n" # before:
print "match: #{$&}\n" # match: hello
print "after: #{$'}\n" # after:


thanks!
 
W

Wild Karl-Heinz

Hello Ryan

In message "Do You Understand Regular Expressions?"

RM> I agree. Can someone explain why gsub, sub or scan matches with * are
RM> different than =~ matches with *

RM> puts "hello".gsub(/[aeiou]/, '<\1>') # h<>ll<>

irb(main):024:0> "hello".gsub( /([aeiou])/, "<\\1>" )

Please note the () around the expression.
After that you can refer with \\1 to the found
letters.


RM> puts "hello".gsub(/.*/, '<\1>') # <><>

irb(main):029:0> "hello".gsub(/(.*)/, '<\1>')
=> "<hello><>"
irb(main):030:0> "hello".gsub(/(.+)/, '<\1>')
=> "<hello>"

RM> print "before: #{$`}\n" # before: hello

irb(main):031:0> $`
=> ""

RM> print "match: #{$&}\n" # match:

irb(main):032:0> $&
=> "hello"

RM> print "after: #{$'}\n" # after:

irb(main):033:0> $'
=> ""


hope this helps.

regards.
Karl-Heinz
 
S

Stephen Ball

That doesn't really explain why the regexp finds an extra empty string.
I know that zero occurrences is one match but after a greedy match that
matches everything, there should be (logically?) no other match. I am no
stranger to regexps and the result is counter-intuitive to me; I would
consider it a bug. Or at least a very very peculiar behavior.

Daniel

It's because the pattern /.*/ matches everything, including the
absence of everything. Yes, with the proper regexs you can indeed have
tea and no tea at the same time. Certainly peculiar, but occasionally
useful.

So: since * matches "zero or more" characters when it starts the
search for .* it matches the absence (the 'zero') and then matches the
string (the 'or more').

To prevent this you need to indicate to your regular expression that
you only want the subset of 'everything' that is actually something.
Here are a couple ways to do this:

/.+/ will match 1 or more of something, so doesn't return the absence

/^.*/ will start the search at the start of the pattern, in a way
bypassing the match of zero (the pattern /^.*$/ makes this more
clear).

/..*/ will match everything after something, this is a modified form
of the above that isn't tied to the start of the string

-- Stephen
 
R

Rob Biedenharn

It's because the pattern /.*/ matches everything, including the
absence of everything. Yes, with the proper regexs you can indeed have
tea and no tea at the same time. Certainly peculiar, but occasionally
useful.
...
-- Stephen

That still doesn't really explain why "hello".scan(/.*/) => ["hello",
""]

Why wouldn't it be ["hello", "", "", "", "", "", "", "", "", "", "",
"", ... ] since I (or rather the OP) could continue to match zero
characters (bytes) at the end of the string forever? It does seem
that it might be that a termination condition is checked a bit later
than it should be in this case.

-Rob

Rob Biedenharn http://agileconsultingllc.com
(e-mail address removed)
 
D

dblack

Hi --

It's because the pattern /.*/ matches everything, including the
absence of everything. Yes, with the proper regexs you can indeed have
tea and no tea at the same time. Certainly peculiar, but occasionally
useful.

So: since * matches "zero or more" characters when it starts the
search for .* it matches the absence (the 'zero') and then matches the
string (the 'or more').

It's the other way around, though; it matches "hello" *first*, and
then "". So the zero-matching (which I admit I'm among those who find
unexpected) is happening at the end.
To prevent this you need to indicate to your regular expression that
you only want the subset of 'everything' that is actually something.
Here are a couple ways to do this:

/.+/ will match 1 or more of something, so doesn't return the absence

/^.*/ will start the search at the start of the pattern, in a way
bypassing the match of zero (the pattern /^.*$/ makes this more
clear).

Here, again, "hello" is first, so /^.*/ matches it but doesn't match
the second time ("") because the "" isn't anchored to ^.


David

--
* Books:
RAILS ROUTING (new! http://www.awprofessional.com/title/0321509242)
RUBY FOR RAILS (http://www.manning.com/black)
* Ruby/Rails training
& consulting: Ruby Power and Light, LLC (http://www.rubypal.com)
 
B

Brian Adkins

Hello Ryan

In message "Do You Understand Regular Expressions?"

RM> I agree. Can someone explain why gsub, sub or scan matches with * are
RM> different than =~ matches with *

RM> puts "hello".gsub(/[aeiou]/, '<\1>') # h<>ll<>

irb(main):024:0> "hello".gsub( /([aeiou])/, "<\\1>" )

Please note the () around the expression.
After that you can refer with \\1 to the found
letters.

Why not simply change the 1 to a 0 ?

irb(main):001:0> puts "hello".gsub(/[aeiou]/, '<\0>')
h<e>ll<o>
 
S

Stephen Ball

On 6/21/07 said:
It's the other way around, though; it matches "hello" *first*, and
then "". So the zero-matching (which I admit I'm among those who find
unexpected) is happening at the end.

Ah, but notice:

"hello".scan(/.*$/)
=> ["hello", ""]

"hello".scan(/^.*/)
=> ["hello"]

Strange indeed, but it seems that's how it's working. Although I
suspect I'm not fully grasping the subtleties introduced by the *
character.

Hmm, the more I think on it I think I have an answer:

The /^.*/ pattern specifies that the string must start with anything
(e.g. it must have at least one character) and then zero or more
characters following.

The /.*$/ pattern has no restriction since the anchor is on the side
with the * character. So it's parsed as "zero or more of anything
before the end of the string".

So, if that's correct, you are right that the absence is matched last.
Verified by the fact that the absence follows the string in the
pattern match.

-- Stephen
 
D

dblack

Hi --

On 6/21/07 said:
It's the other way around, though; it matches "hello" *first*, and
then "". So the zero-matching (which I admit I'm among those who find
unexpected) is happening at the end.

Ah, but notice:

"hello".scan(/.*$/)
=> ["hello", ""]

"hello".scan(/^.*/)
=> ["hello"]

Strange indeed, but it seems that's how it's working. Although I
suspect I'm not fully grasping the subtleties introduced by the *
character.

Hmm, the more I think on it I think I have an answer:

The /^.*/ pattern specifies that the string must start with anything
(e.g. it must have at least one character) and then zero or more
characters following.

The /.*$/ pattern has no restriction since the anchor is on the side
with the * character. So it's parsed as "zero or more of anything
before the end of the string".

So, if that's correct, you are right that the absence is matched last.
Verified by the fact that the absence follows the string in the
pattern match.

Yes, that was what I was mostly going by :)


David

--
* Books:
RAILS ROUTING (new! http://www.awprofessional.com/title/0321509242)
RUBY FOR RAILS (http://www.manning.com/black)
* Ruby/Rails training
& consulting: Ruby Power and Light, LLC (http://www.rubypal.com)
 
S

Sami Samhuri

On 6/21/07 said:
The /^.*/ pattern specifies that the string must start with anything
(e.g. it must have at least one character) and then zero or more
characters following.

^ anchors the match to beginning of a line or the beginning of the
string. The second match fails because it's starting from the first
point after "hello", where it left off. It says nothing about the
content that follows.

"".scan /^.*/ => [""]
The /.*$/ pattern has no restriction since the anchor is on the side
with the * character. So it's parsed as "zero or more of anything
before the end of the string".

This is correct. First it finds the longest match it can in "hello".
Then it finds nothing, but still anchored at the end of the line. Note
that $ does not anchor the end of the string, but the end of each line
within the string or the very end. \z matches the actual end of
string, while \A does the same for the beginning.

Hope this helps.
 
M

Mariusz Pękala

--k+w/mQv8wyuph6w0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

It's because the pattern /.*/ matches everything, including the
absence of everything. Yes, with the proper regexs you can indeed have
tea and no tea at the same time. Certainly peculiar, but occasionally
useful.
=20
That still doesn't really explain why "hello".scan(/.*/) =3D> ["hello", = =20
""]
=20
Why wouldn't it be ["hello", "", "", "", "", "", "", "", "", "", "", =20
"", ... ] since I (or rather the OP) could continue to match zero =20
characters (bytes) at the end of the string forever? It does seem =20
that it might be that a termination condition is checked a bit later =20
than it should be in this case.

I would say the condition is checked at the right time, it's just the
condition is different: it allows checking a match for empty string
at the end of just-matched string, it does not allow checking empty
string after ampty string.

The interesting behaviour is:

irb(main):035:0> "hello".scan /.*?/
=3D> ["", "", "", "", "", ""]

The /.*?/ matches 'zero or more characters, preferring the shortest
match'. One could ask - where have the actual characters gone?
Note that it's not an infinite loop of empty strings.
After matching 'nothing', the start-position for next match is
increased, skipping one character, to prevent infinite loop of matching
nothing again.

*This* behavour may be considered weird, or buggy, and probably results
are not what was expected.

But look at:

irb(main):038:0> "hello".scan /h(.*)e/
=3D> [[""]]
irb(main):039:0> "hello".scan /h(.*)(.*)(.*)(.*)(.*)e/
=3D> [["", "", "", "", ""]]

Here 'nothing' matches many times, and definitely this *is* the expected
behaviour.



--=20
No virus found in this outgoing message.
Checked by 'grep -i virus $MESSAGE'
Trust me.

--k+w/mQv8wyuph6w0
Content-Type: application/pgp-signature
Content-Disposition: inline

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7-ecc0.1.6 (GNU/Linux)

iD8DBQFGe6qIsnU0scoWZKARAinZAJ90/W0QbLmoCRwEPshaOTxsvxohRgCeLM0E
to5oOEBI6bj7NtbiSky/d+c=
=04Fg
-----END PGP SIGNATURE-----

--k+w/mQv8wyuph6w0--
 
G

growlatoe

It's because the pattern /.*/ matches everything, including the
It's the other way around, though; it matches "hello" *first*, and
then "". So the zero-matching (which I admit I'm among those who find
unexpected) is happening at the end.

Oh right, I think I get it now. If you try to match anything with *
then a match is guaranteed, because if there's nothing to match, then
you'll just match nothing?

Like this:

irb(main):001:0> "hello".scan(/h*/)
=> ["h", "", "", "", "", ""]

And this:

irb(main):002:0> "hello".scan(/P*/)
=> ["", "", "", "", "", ""]


I've always assumed, and used, .* to make everything before,
but I suppose .+ does make more sense. Although I have to say
I still find it a bit odd...

Thanks everyone for your help.
 
R

Rob Biedenharn

It's because the pattern /.*/ matches everything, including the
absence of everything. Yes, with the proper regexs you can indeed =20=
have
tea and no tea at the same time. Certainly peculiar, but =20
occasionally
useful.

That still doesn't really explain why "hello".scan(/.*/) =3D> = ["hello",
""]

Why wouldn't it be ["hello", "", "", "", "", "", "", "", "", "", "",
"", ... ] since I (or rather the OP) could continue to match zero
characters (bytes) at the end of the string forever? It does seem
that it might be that a termination condition is checked a bit later
than it should be in this case.

I would say the condition is checked at the right time, it's just the
condition is different: it allows checking a match for empty string
at the end of just-matched string, it does not allow checking empty
string after ampty string.

The interesting behaviour is:

irb(main):035:0> "hello".scan /.*?/
=3D> ["", "", "", "", "", ""]

The /.*?/ matches 'zero or more characters, preferring the shortest
match'. One could ask - where have the actual characters gone?
Note that it's not an infinite loop of empty strings.
After matching 'nothing', the start-position for next match is
increased, skipping one character, to prevent infinite loop of =20
matching
nothing again.

*This* behavour may be considered weird, or buggy, and probably =20
results
are not what was expected.

A great example which I *do* consider to be buggy. The similar =20
example from perl is something like:
$ perl -e '$h =3D "hello"; $h =3D~ s/.*?/[$&]/g; print "$h\n";'
[][h][][e][][l][][l][][o][]

It matches the empty string at the beginning, between each character, =20=

and at the end, but it does consume the actual characters of the =20
string. Even if not what one would anticipate, it's not too hard to =20
justify the result. (Something that can't be said for ruby's =20
["","","","","",""].)

The other versions from perl are enlightening:
$ perl -e '$h =3D "hello"; $h =3D~ s/.?/[$&]/g; print "$h\n";'
[h][e][l][l][o][]

$ perl -e '$h =3D "hello"; $h =3D~ s/.*/[$&]/g; print "$h\n";'
[hello][]

Both succeed in a zero-character match at the end. These are =20
equivalent in ruby (1.8.5):

$ ruby -e 'puts "hello".scan(/.?/).inspect'
["h", "e", "l", "l", "o", ""]

$ ruby -e 'puts "hello".scan(/.*/).inspect'
["hello", ""]

I thought I'd see what Oniguruma (5.8.0; with 1.1.0 gem) had to say:

irb> require 'oniguruma'
=3D> true
irb> reluctant =3D Oniguruma::ORegexp.new('.*?')
=3D> /.*?/
irb> greedy =3D Oniguruma::ORegexp.new('.*')
=3D> /.*/
irb> greedyq =3D Oniguruma::ORegexp.new('.?')
=3D> /.?/
irb> reluctant.scan("hello")
=3D> [#<MatchData:0x10b9aa4>, #<MatchData:0x10b9a7c>, #<MatchData:=20
0x10b9a68>, #<MatchData:0x10b9a40>, #<MatchData:0x10b9a18>, =20
#<MatchData:0x10b99f0>]
irb> reluctant.scan("hello").map{|md|md[0]}
=3D> ["", "", "", "", "", ""]
irb> greedy.scan("hello").map{|md|md[0]}
=3D> ["hello", ""]
irb> greedyq.scan("hello").map{|md|md[0]}
=3D> ["h", "e", "l", "l", "o", ""]

OK, the same result as the ruby Regexp. Including, that .*? produces =20=

[""]*6 which is the "before each character and at the end" locations =20
of the zero-length matches from perl, but the individual single-byte =20
matches are missing.

I presume that there's some justification for these behaviors, but I =20
can't figure out what it might be.

-Rob
But look at:

irb(main):038:0> "hello".scan /h(.*)e/
=3D> [[""]]
irb(main):039:0> "hello".scan /h(.*)(.*)(.*)(.*)(.*)e/
=3D> [["", "", "", "", ""]]

Here 'nothing' matches many times, and definitely this *is* the =20
expected
behaviour.

I agree that those results are exactly what I'd expect.
--=20
No virus found in this outgoing message.
Checked by 'grep -i virus $MESSAGE'
Trust me.

Rob Biedenharn http://agileconsultingllc.com
(e-mail address removed)
 
R

Robert Klemme

It's because the pattern /.*/ matches everything, including the
absence of everything. Yes, with the proper regexs you can indeed have
tea and no tea at the same time. Certainly peculiar, but occasionally
useful.
...
-- Stephen

That still doesn't really explain why "hello".scan(/.*/) => ["hello", ""]

Why wouldn't it be ["hello", "", "", "", "", "", "", "", "", "", "", "",
... ] since I (or rather the OP) could continue to match zero characters
(bytes) at the end of the string forever? It does seem that it might be
that a termination condition is checked a bit later than it should be in
this case.

As far as I remember it works like this: first .* matches the whole
sequence. Then the "cursor" is placed behind the match, i.e. after the
last char of the match and matching starts again. At this place the
empty sequence matches because we're at the end of the match. After
that match the cursor is advanced one step (to avoid endless
repetitions) and - alas! - we're at the end of the string and matching
stops.

For learning regular expressions this is a great program: it allows to
graphically step through the matching process:
http://weitz.de/regex-coach/

See also this thread:
http://groups.google.de/group/comp....2390ff905f?lnk=st&q=&rnum=10#f759612390ff905f

Btw, for replacing the whole string this is much better:

irb(main):001:0> s = "foo"
=> "foo"
irb(main):002:0> s.object_id
=> 1073540760
irb(main):003:0> s.replace "bar"
=> "bar"
irb(main):004:0> s.object_id
=> 1073540760
irb(main):005:0> s
=> "bar"
irb(main):006:0>

Kind regards

robert
 
R

Robert Klemme

It's the other way around, though; it matches "hello" *first*, and
then "". So the zero-matching (which I admit I'm among those who find
unexpected) is happening at the end.

Oh right, I think I get it now. If you try to match anything with *
then a match is guaranteed, because if there's nothing to match, then
you'll just match nothing?

Like this:

irb(main):001:0> "hello".scan(/h*/)
=> ["h", "", "", "", "", ""]

And this:

irb(main):002:0> "hello".scan(/P*/)
=> ["", "", "", "", "", ""]


I've always assumed, and used, .* to make everything before,
but I suppose .+ does make more sense. Although I have to say
I still find it a bit odd...

".*" has its use but it's generally overrated, i.e. more often used than
needed / wanted. If you show a more concrete example of what you are
doing we might be able to come up with better suggestions. If you are
really interested to dive into the matter then I suggest "Mastering
Regular Expressions" which is an excellent book for the money.

Kind regards

robert
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,968
Messages
2,570,150
Members
46,697
Latest member
AugustNabo

Latest Threads

Top