Regexp Error?

Robert Klemme · May 14, 2004

What's wrong here?

irb(main):022:0> RUBY_VERSION
=> "1.8.1"
irb(main):023:0> "123-456".gsub(/.*/, 'X')
=> "XX"
irb(main):024:0> "123-456".gsub(/^.*/, 'X')
=> "X"
irb(main):025:0> "123-456".gsub(/.*$/, 'X')
=> "XX"
irb(main):026:0> "123-456".gsub(/^.*$/, 'X')
=> "X"

I'd have expected "X" as result of gsub in 23 and 25 because .* is greedy.

robert

ts · May 14, 2004

R> irb(main):023:0> "123-456".gsub(/.*/, 'X')
R> => "XX"

* first match : "123-456"
* now it's at end
* second match with the empty string

R> irb(main):024:0> "123-456".gsub(/^.*/, 'X')
R> => "X"

* first match : "123-456"
* now it's at end
* it can't match the empty string because there is ^ in the regexp

Guy Decoux

Robert Klemme · May 14, 2004

ts said:
R> irb(main):023:0> "123-456".gsub(/.*/, 'X')
R> => "XX"

* first match : "123-456"
* now it's at end
* second match with the empty string

R> irb(main):024:0> "123-456".gsub(/^.*/, 'X')
R> => "X"

* first match : "123-456"
* now it's at end
* it can't match the empty string because there is ^ in the regexp

That explains it. I still have to say that I find this a bit surprising.
Moreover, case strikes me as error:

irb(main):025:0> "123-456".gsub(/.*$/, 'X')
=> "XX"

Applying your explanation, one would have to say that "$" is matched
twice. IMHO this is not correct. Or is there another explanation why
this gsub happens to return "XX"?

Regards

robert

ts · May 14, 2004

R> irb(main):025:0> "123-456".gsub(/.*$/, 'X')
R> => "XX"

R> Applying your explanation, one would have to say that "$" is matched
R> twice. IMHO this is not correct. Or is there another explanation why
R> this gsub happens to return "XX"?

This is the same explanation. Your regexp means match a string which is at
the end (where end can be \n or the end of string, in this case)

* first match : "123-456"
* now it's at end
* it can match the empty string

$ is like ^ : it don't match a character but a position in the string
(sort of ...)

Guy Decoux

Ara.T.Howard · May 14, 2004

That explains it. I still have to say that I find this a bit surprising.
Moreover, case strikes me as error:

irb(main):025:0> "123-456".gsub(/.*$/, 'X')
=> "XX"

Applying your explanation, one would have to say that "$" is matched
twice. IMHO this is not correct. Or is there another explanation why
this gsub happens to return "XX"?

Regards

robert

^ and $ are special and they consume no chars and so are not really 'matched'
in the same way...

your regex says 'zero or more chars before the end of a string' so you get

^ 1 2 3 - 4 5 6 $
---------------
^

the first go then then scanning starts again - the problem is that it's then
looking for something potentially zero widthed followed by something zero
widthed - which is always going to match (again). i guess the difference for
the second match is that it does not advance the scanner ptr and can therefore
know it's done... it does seem odd, but without that behaviour it would be
hard to match empty strings, line boundries, and other zero widthed things...
for instance if you did this

"123-456".gsub(/.*|$/, 'X')

you would expect 'XX', where the second 'X' is inserted into a zero width
position and '.*' does not include '$' and yet this is realy the same exact
behaviour - scanning is done again from the non-space before the end of line,
allowing you to finally match '$' which '.*' did not consume.

regexs can be so tricky, i _try_ to use these rules with them

* always use both ^ and $ (this makes it a lot harder to write the expression
too!)

* never use .* (or * at all really)

the last is actually pretty important - we use a product here, ldm (local data
manager), that scans a huge memeory mapped queue full of data products matched
a list of actions against the product tags. the list of actions use regexps
and all of ours had '.*' in them. top showed the ldm process at around 30%
cpu - reworking the patterns to not include '.*' dropped it off the rader.

-a
--
===============================================================================
| EMAIL :: Ara [dot] T [dot] Howard [at] noaa [dot] gov
| PHONE :: 303.497.6469
| ADDRESS :: E/GC2 325 Broadway, Boulder, CO 80305-3328
| URL :: http://www.ngdc.noaa.gov/stp/
| TRY :: for l in ruby perl;do $l -e "print \"\x3a\x2d\x29\x0a\"";done
===============================================================================

Kristof Bastiaensen · May 14, 2004

R> irb(main):023:0> "123-456".gsub(/.*/, 'X')
R> => "XX"

* first match : "123-456"
* now it's at end
* second match with the empty string

I would say the empty string is included with "123-456",
so it shouldn't give another match:

echo "123-456" | sed 's/.*/X/g'
X

Kristof

Ara.T.Howard · May 14, 2004

I would say the empty string is included with "123-456",
so it shouldn't give another match:

echo "123-456" | sed 's/.*/X/g'
X

Kristof

yes but

~ > echo "123-456" | perl -npe 's/.*$/X/g'
XX

and sed regexps are not the same as perl/ruby right?

-a
--
===============================================================================
| EMAIL :: Ara [dot] T [dot] Howard [at] noaa [dot] gov
| PHONE :: 303.497.6469
| ADDRESS :: E/GC2 325 Broadway, Boulder, CO 80305-3328
| URL :: http://www.ngdc.noaa.gov/stp/
| TRY :: for l in ruby perl;do $l -e "print \"\x3a\x2d\x29\x0a\"";done
===============================================================================

Dave Burt · May 14, 2004

Perl and Javascript (MSIE) do it too, so I don't propose changing it, but it
seems a strange (wrong) behaviour.

I would have thought that the second match you refer to should have been
included in the first; that is, that the greedy match should match (and then
replace) the whole string, including the 0 characters between the last
character and the end of the string.

But apparently it's not like that.

Kristof Bastiaensen · May 14, 2004

yes but

~ > echo "123-456" | perl -npe 's/.*$/X/g'
XX

gawk also returns X:

$ echo "123-456" | gawk '{ gsub(/.*/, "X"); print }'
X

and sed regexps are not the same as perl/ruby right?

-a

No, but I would expect the basic ones to behave the same
way. (sed and gawk have been there before perl/ruby/javascript).
Is that not a expectation that can be trusted?

Simon Strandgaard · May 14, 2004

Ara.T.Howardwrote:

yes but

~ > echo "123-456" | perl -npe 's/.*$/X/g'
XX

and sed regexps are not the same as perl/ruby right?

This is a widespread problem with regexp, when dealing with kleene star, its
tricky to detemine when to stop looping. I have putted lot of effort investigating
where to stop in my engine, so the output is the most desired.

Unfortunatly Ruby's native regexp engines (GNU or Oniguruma) attempts to be
perl compatible, and thus sometimes emulating a non-desired behavior.

Robert Klemme · May 14, 2004

Ara.T.Howard said:
^ and $ are special and they consume no chars and so are not really 'matched'
in the same way...

your regex says 'zero or more chars before the end of a string' so you get

^ 1 2 3 - 4 5 6 $
---------------
^

the first go then then scanning starts again - the problem is that it's then
looking for something potentially zero widthed followed by something zero
widthed - which is always going to match (again). i guess the difference for
the second match is that it does not advance the scanner ptr and can therefore
know it's done... it does seem odd,

Definitely! What strikes me odd is, that the engine must know start and
end of the match. So it could relaize that end is at the end.

but without that behaviour it would be
hard to match empty strings, line boundries, and other zero widthed

things...

You mean because then it would immediately stop without matching anything.
Yeah, might be true.

The sed and awk examples show that apparently there's disagreement on how
this should be handled. I just wonder why I didn't step into this pitfall
earlier. Apparently I never felt the need for .* in a replacement context
before.

Thx all!

Kind regards

robert

ts · May 14, 2004

R> Definitely! What strikes me odd is, that the engine must know start and
R> end of the match. So it could relaize that end is at the end.

Well you can have another explanation in 'man 7 regex' (linux) (or another
way to see it)

Match lengths are measured in characters, not collating elements. A
null string is considered longer than no match at all.

It's in the case of null string vs no match

Guy Decoux

Dave Burt · May 15, 2004

I would have thought, from that logic, that you could just as well expect an
infinite loop ("XXXXXX...") rather than just "XX" - why does /.*/ not keep
matching that same 0-char gap at the end?

Robert Klemme · May 15, 2004

Dave Burt said:
I would have thought, from that logic, that you could just as well expect an
infinite loop ("XXXXXX...") rather than just "XX" - why does /.*/ not keep
matching that same 0-char gap at the end?

I thought that for a moment, too. But he gave the answer already:

Regards

robert

Robert Klemme · May 15, 2004

ts said:
R> Definitely! What strikes me odd is, that the engine must know start and
R> end of the match. So it could relaize that end is at the end.

Well you can have another explanation in 'man 7 regex' (linux) (or another
way to see it)

Match lengths are measured in characters, not collating elements. A
null string is considered longer than no match at all.

It's in the case of null string vs no match

This reminds me of the mathematician that found an epsilon so small, that -
if you divided it in halves - it was already negative.

Cheers

robert

Simon Strandgaard · May 15, 2004

Robert said:
This reminds me of the mathematician that found an epsilon so small, that -
if you divided it in halves - it was already negative.

Epsilon transitions is a very interesting feature of regexp.. I like them.
However variable-width lookbehind with subcaptures and backreferences are
even more amazing (that would be suitable to a small research project).

date.parse does not work anymore	2	Aug 14, 2008
Something changed an instance variable ... and now I'm confused	3	Jan 8, 2010
Same name for class and instance method	1	Nov 22, 2009
Socket hang in thread	1	Sep 15, 2009
mysterious behavior of mixins	4	Aug 29, 2008
bsearch.rb	0	Nov 15, 2009
block.call	3	Jan 2, 2010
marshalling and serialiing to IO problem	1	Jun 5, 2008

Regexp Error?

Robert Klemme

ts

Robert Klemme

ts

Ara.T.Howard

Kristof Bastiaensen

Ara.T.Howard

Dave Burt

Kristof Bastiaensen

Simon Strandgaard

Robert Klemme

ts

Dave Burt

Robert Klemme

Robert Klemme

Simon Strandgaard

Ask a Question

Similar Threads

Staff online

Members online

Forum statistics

Latest Threads