Regexp Error?

R

Robert Klemme

What's wrong here?

irb(main):022:0> RUBY_VERSION
=> "1.8.1"
irb(main):023:0> "123-456".gsub(/.*/, 'X')
=> "XX"
irb(main):024:0> "123-456".gsub(/^.*/, 'X')
=> "X"
irb(main):025:0> "123-456".gsub(/.*$/, 'X')
=> "XX"
irb(main):026:0> "123-456".gsub(/^.*$/, 'X')
=> "X"

I'd have expected "X" as result of gsub in 23 and 25 because .* is greedy.

robert
 
T

ts

R> irb(main):023:0> "123-456".gsub(/.*/, 'X')
R> => "XX"

* first match : "123-456"
* now it's at end
* second match with the empty string

R> irb(main):024:0> "123-456".gsub(/^.*/, 'X')
R> => "X"

* first match : "123-456"
* now it's at end
* it can't match the empty string because there is ^ in the regexp



Guy Decoux
 
R

Robert Klemme

ts said:
R> irb(main):023:0> "123-456".gsub(/.*/, 'X')
R> => "XX"

* first match : "123-456"
* now it's at end
* second match with the empty string

R> irb(main):024:0> "123-456".gsub(/^.*/, 'X')
R> => "X"

* first match : "123-456"
* now it's at end
* it can't match the empty string because there is ^ in the regexp

That explains it. I still have to say that I find this a bit surprising.
Moreover, case strikes me as error:

irb(main):025:0> "123-456".gsub(/.*$/, 'X')
=> "XX"

Applying your explanation, one would have to say that "$" is matched
twice. IMHO this is not correct. Or is there another explanation why
this gsub happens to return "XX"?

Regards

robert
 
T

ts

R> irb(main):025:0> "123-456".gsub(/.*$/, 'X')
R> => "XX"

R> Applying your explanation, one would have to say that "$" is matched
R> twice. IMHO this is not correct. Or is there another explanation why
R> this gsub happens to return "XX"?

This is the same explanation. Your regexp means match a string which is at
the end (where end can be \n or the end of string, in this case)

* first match : "123-456"
* now it's at end
* it can match the empty string


$ is like ^ : it don't match a character but a position in the string
(sort of ...)


Guy Decoux
 
A

Ara.T.Howard

That explains it. I still have to say that I find this a bit surprising.
Moreover, case strikes me as error:

irb(main):025:0> "123-456".gsub(/.*$/, 'X')
=> "XX"

Applying your explanation, one would have to say that "$" is matched
twice. IMHO this is not correct. Or is there another explanation why
this gsub happens to return "XX"?

Regards

robert

^ and $ are special and they consume no chars and so are not really 'matched'
in the same way...

your regex says 'zero or more chars before the end of a string' so you get


^ 1 2 3 - 4 5 6 $
---------------
^

the first go then then scanning starts again - the problem is that it's then
looking for something potentially zero widthed followed by something zero
widthed - which is always going to match (again). i guess the difference for
the second match is that it does not advance the scanner ptr and can therefore
know it's done... it does seem odd, but without that behaviour it would be
hard to match empty strings, line boundries, and other zero widthed things...
for instance if you did this

"123-456".gsub(/.*|$/, 'X')

you would expect 'XX', where the second 'X' is inserted into a zero width
position and '.*' does not include '$' and yet this is realy the same exact
behaviour - scanning is done again from the non-space before the end of line,
allowing you to finally match '$' which '.*' did not consume.

regexs can be so tricky, i _try_ to use these rules with them

* always use both ^ and $ (this makes it a lot harder to write the expression
too!)

* never use .* (or * at all really)

the last is actually pretty important - we use a product here, ldm (local data
manager), that scans a huge memeory mapped queue full of data products matched
a list of actions against the product tags. the list of actions use regexps
and all of ours had '.*' in them. top showed the ldm process at around 30%
cpu - reworking the patterns to not include '.*' dropped it off the rader.

-a
--
===============================================================================
| EMAIL :: Ara [dot] T [dot] Howard [at] noaa [dot] gov
| PHONE :: 303.497.6469
| ADDRESS :: E/GC2 325 Broadway, Boulder, CO 80305-3328
| URL :: http://www.ngdc.noaa.gov/stp/
| TRY :: for l in ruby perl;do $l -e "print \"\x3a\x2d\x29\x0a\"";done
===============================================================================
 
K

Kristof Bastiaensen

R> irb(main):023:0> "123-456".gsub(/.*/, 'X')
R> => "XX"

* first match : "123-456"
* now it's at end
* second match with the empty string

I would say the empty string is included with "123-456",
so it shouldn't give another match:

echo "123-456" | sed 's/.*/X/g'
X

Kristof
 
A

Ara.T.Howard

I would say the empty string is included with "123-456",
so it shouldn't give another match:

echo "123-456" | sed 's/.*/X/g'
X

Kristof

yes but

~ > echo "123-456" | perl -npe 's/.*$/X/g'
XX

and sed regexps are not the same as perl/ruby right?

-a
--
===============================================================================
| EMAIL :: Ara [dot] T [dot] Howard [at] noaa [dot] gov
| PHONE :: 303.497.6469
| ADDRESS :: E/GC2 325 Broadway, Boulder, CO 80305-3328
| URL :: http://www.ngdc.noaa.gov/stp/
| TRY :: for l in ruby perl;do $l -e "print \"\x3a\x2d\x29\x0a\"";done
===============================================================================
 
D

Dave Burt

Perl and Javascript (MSIE) do it too, so I don't propose changing it, but it
seems a strange (wrong) behaviour.

I would have thought that the second match you refer to should have been
included in the first; that is, that the greedy match should match (and then
replace) the whole string, including the 0 characters between the last
character and the end of the string.

But apparently it's not like that.
 
K

Kristof Bastiaensen

yes but

~ > echo "123-456" | perl -npe 's/.*$/X/g'
XX

gawk also returns X:

$ echo "123-456" | gawk '{ gsub(/.*/, "X"); print }'
X
and sed regexps are not the same as perl/ruby right?

-a

No, but I would expect the basic ones to behave the same
way. (sed and gawk have been there before perl/ruby/javascript).
Is that not a expectation that can be trusted?
 
S

Simon Strandgaard

Ara.T.Howardwrote:
yes but

~ > echo "123-456" | perl -npe 's/.*$/X/g'
XX

and sed regexps are not the same as perl/ruby right?


This is a widespread problem with regexp, when dealing with kleene star, its
tricky to detemine when to stop looping. I have putted lot of effort investigating
where to stop in my engine, so the output is the most desired.

Unfortunatly Ruby's native regexp engines (GNU or Oniguruma) attempts to be
perl compatible, and thus sometimes emulating a non-desired behavior.
 
R

Robert Klemme

Ara.T.Howard said:
^ and $ are special and they consume no chars and so are not really 'matched'
in the same way...

your regex says 'zero or more chars before the end of a string' so you get


^ 1 2 3 - 4 5 6 $
---------------
^

the first go then then scanning starts again - the problem is that it's then
looking for something potentially zero widthed followed by something zero
widthed - which is always going to match (again). i guess the difference for
the second match is that it does not advance the scanner ptr and can therefore
know it's done... it does seem odd,

Definitely! What strikes me odd is, that the engine must know start and
end of the match. So it could relaize that end is at the end.
but without that behaviour it would be
hard to match empty strings, line boundries, and other zero widthed
things...

You mean because then it would immediately stop without matching anything.
Yeah, might be true.

The sed and awk examples show that apparently there's disagreement on how
this should be handled. I just wonder why I didn't step into this pitfall
earlier. Apparently I never felt the need for .* in a replacement context
before. :)

Thx all!

Kind regards

robert
 
T

ts

R> Definitely! What strikes me odd is, that the engine must know start and
R> end of the match. So it could relaize that end is at the end.

Well you can have another explanation in 'man 7 regex' (linux) (or another
way to see it)

Match lengths are measured in characters, not collating elements. A
null string is considered longer than no match at all.

It's in the case of null string vs no match


Guy Decoux
 
D

Dave Burt

I would have thought, from that logic, that you could just as well expect an
infinite loop ("XXXXXX...") rather than just "XX" - why does /.*/ not keep
matching that same 0-char gap at the end?
 
R

Robert Klemme

Dave Burt said:
I would have thought, from that logic, that you could just as well expect an
infinite loop ("XXXXXX...") rather than just "XX" - why does /.*/ not keep
matching that same 0-char gap at the end?

I thought that for a moment, too. But he gave the answer already:

Regards

robert
 
R

Robert Klemme

ts said:
R> Definitely! What strikes me odd is, that the engine must know start and
R> end of the match. So it could relaize that end is at the end.

Well you can have another explanation in 'man 7 regex' (linux) (or another
way to see it)

Match lengths are measured in characters, not collating elements. A
null string is considered longer than no match at all.

It's in the case of null string vs no match

This reminds me of the mathematician that found an epsilon so small, that -
if you divided it in halves - it was already negative. :)

Cheers

robert
 
S

Simon Strandgaard

Robert said:
This reminds me of the mathematician that found an epsilon so small, that -
if you divided it in halves - it was already negative. :)

Epsilon transitions is a very interesting feature of regexp.. I like them.
However variable-width lookbehind with subcaptures and backreferences are
even more amazing (that would be suitable to a small research project).
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Staff online

Members online

Forum statistics

Threads
474,145
Messages
2,570,826
Members
47,372
Latest member
LucretiaFo

Latest Threads

Top