Explain this ruby regex

R

renton.dan

Can someone explain this regex ...

"one two".scan(/\w*/).length

returns 4. I can see it matching the 2 words and the space, what else
is it matching on? Is there a null terminator, I thought Ruby strings
were not null termed.
 
B

Ben Bleything

"one two".scan(/\w*/).length

returns 4. I can see it matching the 2 words and the space, what else
is it matching on? Is there a null terminator, I thought Ruby strings
were not null termed.

Try replacing #length with #inspect and seeing what the output of scan
is. You'll find that it's returning two empty strings as well. I
suspect what you really want is \w+...

Ben
 
R

renton.dan

Try replacing #length with #inspect and seeing what the output of scan
is.  You'll find that it's returning two empty strings as well.  I
suspect what you really want is \w+...

Ben

Yeah, you're right \w+ will pull out the words, which is what I want
anyway. Though I'm trying to understand what \w* is doing.
irb(main):015:0> "one two".scan(/\w*/).inspect
=> "[\"one\", \"\", \"two\", \"\"]"

My question is, what is the last "\", where does it come from.
 
P

Patrick He

\w* does not match the space between string "one" and "two". it matches
"one", <empty string after "one">, "two", <empty string after "two">.

There are some other examples:

irb(main):004:0> "one".scan(/^\w*/)
=> ["one"]
irb(main):005:0> "one".scan(/\w*$/)
=> ["one", ""]


--
Patrick


Try replacing #length with #inspect and seeing what the output of scan
is. You'll find that it's returning two empty strings as well. I
suspect what you really want is \w+...

Ben

Yeah, you're right \w+ will pull out the words, which is what I want
anyway. Though I'm trying to understand what \w* is doing.
irb(main):015:0> "one two".scan(/\w*/).inspect
=> "[\"one\", \"\", \"two\", \"\"]"

My question is, what is the last "\", where does it come from.
 
P

Patrick Doyle

[Note: parts of this message were removed to make it a legal post.]

The key idea here is that "*" means "match zero or more of" whereas "+"
means "match one or more of". So, when you match \w* against "one two",
there are zero or more instances of a word character (3, in fact, 'o', 'n',
and 'e'), so that produces one result. Following that result, there are
zero matches of a word character, but since you asked for "zero or more of",
you get that empty string result. Later, rinse, repeat for the "two" part.

FWIW, instead of looking at the result with #inspect, I found it more
informative to look at the result returned from #scan by itself, e.g.

irb> "one two".scan(/\w*/)
=> ["one", "", "two", ""]

--wpd


\w* does not match the space between string "one" and "two". it matches
"one", <empty string after "one">, "two", <empty string after "two">.

There are some other examples:

irb(main):004:0> "one".scan(/^\w*/)
=> ["one"]
irb(main):005:0> "one".scan(/\w*$/)
=> ["one", ""]


--
Patrick


On Sat, Oct 04, 2008, (e-mail address removed) wrote:

"one two".scan(/\w*/).length

returns 4. I can see it matching the 2 words and the space, what else
is it matching on? Is there a null terminator, I thought Ruby strings
were not null termed.

Try replacing #length with #inspect and seeing what the output of scan
is. You'll find that it's returning two empty strings as well. I
suspect what you really want is \w+...

Ben

Yeah, you're right \w+ will pull out the words, which is what I want
anyway. Though I'm trying to understand what \w* is doing.
irb(main):015:0> "one two".scan(/\w*/).inspect
=> "[\"one\", \"\", \"two\", \"\"]"

My question is, what is the last "\", where does it come from.
 
B

Brian Candler

FWIW, instead of looking at the result with #inspect, I found it more
informative to look at the result returned from #scan by itself, e.g.

irb> "one two".scan(/\w*/)
=> ["one", "", "two", ""]

irb displays the expression value using "inspect", so you are using
inspect even though you didn't ask for it :)
 
R

Robert Klemme

The key idea here is that "*" means "match zero or more of" whereas "+"
means "match one or more of". So, when you match \w* against "one two",
there are zero or more instances of a word character (3, in fact, 'o', 'n',
and 'e'), so that produces one result. Following that result, there are
zero matches of a word character, but since you asked for "zero or more of",
you get that empty string result. Later, rinse, repeat for the "two" part.

It boils down to this statement: a subexpression with "*" potentially
matches an _empty string anywhere_ in a string.

Kind regards

robert
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
474,200
Messages
2,571,046
Members
47,646
Latest member
xayaci5906

Latest Threads

Top