Novice Q: What's the difference between /\s*/ and /(\s)*/?

M

Mike Meng

Hi,
I'm new to Ruby and reading 'Programming Ruby 2/e' now. I encountered
a tricky problem while reading chapter 5, 'String" section. Here is the
problem:

# code
line = '/jazz/j00319.mp3 | 2:58 | Louis Armstrong | Wonderful World'
file, duration, artist, title = line.chomp.split(/\s*\|\s*/)
# code end

Run the code we get:
file=='/jazz/j00319.mp3'
duration=='2:58'
artist=='Louis Armstrong'
title=='Wonderful World'

While if I change the regex pattern in 'split' to /(\s)*\|(\s)*/,
that is,
# code
line = '/jazz/j00319.mp3 | 2:58 | Louis Armstrong | Wonderful World'
file, duration, artist, title = line.chomp.split(/(\s)*\|(\s)*/)
# code end

We get:
file=='/jazz/j00319.mp3'
duration==' '
artist==' '
title=='2:58'

What makes the differece? Any comments are appreciated.
 
W

William James

Mike said:
# code
line = '/jazz/j00319.mp3 | 2:58 | Louis Armstrong | Wonderful World'
file, duration, artist, title = line.chomp.split(/\s*\|\s*/)
# code end

Run the code we get:
file=='/jazz/j00319.mp3'
duration=='2:58'
artist=='Louis Armstrong'
title=='Wonderful World'

While if I change the regex pattern in 'split' to /(\s)*\|(\s)*/,
that is,
# code
line = '/jazz/j00319.mp3 | 2:58 | Louis Armstrong | Wonderful World'
file, duration, artist, title = line.chomp.split(/(\s)*\|(\s)*/)
# code end

We get:
file=='/jazz/j00319.mp3'
duration==' '
artist==' '
title=='2:58'

What makes the differece? Any comments are appreciated.

Without the captures, the substrings on which the string is split
are discarded. When you include captures, they are included in
the resulting array. Which makes sense: why would you include
captures if you didn't want to do something with them?
 
M

Mike Meng

Thank you, William.

Is this hehavior defined by regex spec or by String#split? Where can I
find detailed explaination?

mike
 
D

daz

Mike said:
# code
line = '/jazz/j00319.mp3 | 2:58 | Louis Armstrong | Wonderful World'
file, duration, artist, title = line.chomp.split(/\s*\|\s*/)
# code end

[...]
While if I change the regex pattern in 'split' to /(\s)*\|(\s)*/,
[...]
What makes the differece? Any comments are appreciated.

Hi Mike,

You're not seeing the difference because of your assignments.
Try playing with this:

#-----------------------------------------------------------
def splt(patt)
res = LINE.split(patt)
print "#-> (%s)\n#-> %2d: " % [patt.inspect, res.size]
res.to_a.each {|col| print ' (%s)' % [col]}
puts; puts
end

LINE = 'ABC_:_KLM_:_NOP_:_XYZ'

splt(/_:_/)
splt(/(_:_)/)
splt(/_:))_/)
splt(/(_):(_)/)
splt(/((_):(_))/)
splt(/((_):))(_))/)
splt(/_:_K/)
splt(/(_:_)K/)
splt(/(_:_K)/)
splt(/((_:_K))/)
splt(/(((_:_K)))/)
#-----------------------------------------------------------

#-> (/_:_/)
#-> 4: (ABC) (KLM) (NOP) (XYZ)

#-> (/(_:_)/)
#-> 7: (ABC) (_:_) (KLM) (_:_) (NOP) (_:_) (XYZ)

#-> (/_:))_/)
#-> 7: (ABC) :)) (KLM) :)) (NOP) :)) (XYZ)

#-> (/(_):(_)/)
#-> 10: (ABC) (_) (_) (KLM) (_) (_) (NOP) (_) (_) (XYZ)

#-> (/((_):(_))/)
#-> 13: (ABC) (_:_) (_) (_) (KLM) (_:_) (_) (_) (NOP) (_:_) (_) (_) (XYZ)

#-> (/((_):))(_))/)
#-> 16: (ABC) (_:_) (_) :)) (_) (KLM) (_:_) (_) :)) (_) (NOP) (_:_) (_) :)) (_) (XYZ)

#-> (/_:_K/)
#-> 2: (ABC) (LM_:_NOP_:_XYZ)

#-> (/(_:_)K/)
#-> 3: (ABC) (_:_) (LM_:_NOP_:_XYZ)

#-> (/(_:_K)/)
#-> 3: (ABC) (_:_K) (LM_:_NOP_:_XYZ)

#-> (/((_:_K))/)
#-> 4: (ABC) (_:_K) (_:_K) (LM_:_NOP_:_XYZ)

#-> (/(((_:_K)))/)
#-> 5: (ABC) (_:_K) (_:_K) (_:_K) (LM_:_NOP_:_XYZ)


daz
 
J

Julian Leviston

I'm not sure if someone's already answered this, but...

putting parentheses around things groups them... and it's treated as
though it's a single regexp...

so:
/\s*/ means match a space, zero or more times to the extent of the
contiguous spaces...

but
/(\s)*/ means "match a space, zero or more times to the extent of
THIS CONTIGUOUS MATCH. It first matches zero spaces, then the limit
of the zero spaces is ... (funnily enough) zero spaces, so it doesn't
go any further. You don't want to use parentheses.

There have been whole books written on regular expressions. If you're
going to use them well, they're worth reading, I'd suggest.

Julian.
 
G

Gavin Kistner

I'm not sure if someone's already answered this, but...

putting parentheses around things groups them... and it's treated
as though it's a single regexp...

so:
/\s*/ means match a space, zero or more times to the extent of the
contiguous spaces...

but
/(\s)*/ means "match a space, zero or more times to the extent of
THIS CONTIGUOUS MATCH. It first matches zero spaces, then the limit
of the zero spaces is ... (funnily enough) zero spaces, so it
doesn't go any further. You don't want to use parentheses.

Actually, using parentheses here will not affect what is matched,
only what is saved. Even with the parens, each time the accumulator
is run it re-matches the character class. Either that, or I'm
misinterpreting the results below:


" \t\nHello".match( /^\s*(\w+)/ ) #=> "Hello" , nil
" \t\nHello".match( /^(\s)*(\w+)/ ) #=> "\n" , "Hello"
" \t\nHello".match( /^(\s*)(\w+)/ ) #=> " \t\n" , "Hello"
"\t Hello".match( /^\s*(\w+)/ ) #=> "Hello" , nil
"\t Hello".match( /^(\s)*(\w+)/ ) #=> " " , "Hello"
"\t Hello".match( /^(\s*)(\w+)/ ) #=> "\t " , "Hello"
"\n \tHello".match( /^\s*(\w+)/ ) #=> "Hello" , nil
"\n \tHello".match( /^(\s)*(\w+)/ ) #=> "\t" , "Hello"
"\n \tHello".match( /^(\s*)(\w+)/ ) #=> "\n \t" , "Hello"

(Results are the first two saved subexpressions of the match.)



strings = [
" \t\nHello",
"\t Hello",
"\n \tHello"
]

patterns = [
/^\s*(\w+)/,
/^(\s)*(\w+)/,
/^(\s*)(\w+)/
]

strings.each_with_index{ |str, str_num|
patterns.each_with_index{ |re, re_num|
if match = str.match( re )
info = [ str.inspect, re.inspect, match[1].inspect, match
[2].inspect ]
puts "%s.match( %-14s ) #=> %-8s, %-5s" % info
end
}
}
puts "\n(Results are the first two saved subexpressions of the match.)"
 
G

Gavin Kistner

" \t\nHello".match( /^\s*(\w+)/ ) #=> "Hello" , nil
" \t\nHello".match( /^(\s)*(\w+)/ ) #=> "\n" , "Hello"
" \t\nHello".match( /^(\s*)(\w+)/ ) #=> " \t\n" , "Hello"
"\t Hello".match( /^\s*(\w+)/ ) #=> "Hello" , nil
"\t Hello".match( /^(\s)*(\w+)/ ) #=> " " , "Hello"
"\t Hello".match( /^(\s*)(\w+)/ ) #=> "\t " , "Hello"
"\n \tHello".match( /^\s*(\w+)/ ) #=> "Hello" , nil
"\n \tHello".match( /^(\s)*(\w+)/ ) #=> "\t" , "Hello"
"\n \tHello".match( /^(\s*)(\w+)/ ) #=> "\n \t" , "Hello"

Three more pertinent data points:

"Hello".match( /^\s*(\w+)/ ) #=> "Hello" , nil
"Hello".match( /^(\s)*(\w+)/ ) #=> nil , "Hello"
"Hello".match( /^(\s*)(\w+)/ ) #=> "" , "Hello"
 
J

Jeff Wood

Ok, now for a clean and simple answer...=20

The parenthesis create "groups". Groups give you the ability to save
away parts of the matched pattern for easy access after a match has
been made. And yes, the O'Reilly book "Mastering Regular Expressions"
would be a good read it explains this concept quite completely.

When you use Regexp#match, if you find a match, you are returned a
MatchData object which provides an [] API for standard array style
access.

Element 0 ( [0] ) provides the COMPLETE match. If you defined any
groups in the regular expression, they will then appear as additional
elements in the MatchData [] array. Each group is then accessable
using ordinals ( [1] .. [9] I don't know what happens after 9, I've
never needed that many ) ... Remember that if you define a group to be
inside an optional region of the regular expression, that group will
return nil.

So, if you were parsing phone numbers in the format of:=20
###-###-####

You could save yourself a bit of code and define your regex to be:=20

a =3D "My phone number is : 800-555-1212"
b =3D /(\d{3})\-(\d{3})\-(\d{4})/

c =3D b.match( a )

if c
puts c[0] # returns the complete match : 800-555-1212
puts c[1] # returns group 1 : 800
puts c[2] # returns group 2 : 555
puts c[3] # returns group 3 : 1212
else
puts "Not a match"
end

I hope that helps. Remember that [0] always exists, but the other
items only exist if you define groups within your regular expression.

j.

=20
Three more pertinent data points:
=20
"Hello".match( /^\s*(\w+)/ ) #=3D> "Hello" , nil
"Hello".match( /^(\s)*(\w+)/ ) #=3D> nil , "Hello"
"Hello".match( /^(\s*)(\w+)/ ) #=3D> "" , "Hello"
=20
=20
=20


--=20
"So long, and thanks for all the fish"

Jeff Wood
 
D

David A. Black

Hi --

Ok, now for a clean and simple answer...

The parenthesis create "groups". Groups give you the ability to save
away parts of the matched pattern for easy access after a match has
been made. And yes, the O'Reilly book "Mastering Regular Expressions"
would be a good read it explains this concept quite completely.

When you use Regexp#match, if you find a match, you are returned a
MatchData object which provides an [] API for standard array style
access.

Element 0 ( [0] ) provides the COMPLETE match. If you defined any
groups in the regular expression, they will then appear as additional
elements in the MatchData [] array. Each group is then accessable
using ordinals ( [1] .. [9] I don't know what happens after 9, I've
never needed that many ) ...

I believe that would be 10 :)

irb(main):003:0> m = /(((((((((((a)))))))))))/.match("a")
=> #<MatchData:0xbf4bbd84>
irb(main):004:0> $10
=> "a"
irb(main):005:0> m[10]
=> "a"

Remember that if you define a group to be
inside an optional region of the regular expression, that group will
return nil.

I'm not sure what you mean there.

irb(main):009:0> /((a)?)?/.match("a").to_a
=> ["a", "a", "a"]


David
 
J

Jeff Wood

Regarding nil values for groups

if you define your regular expressions like this :
=20
a =3D "My phone number is : 555-1212"
b =3D /((\d{3})\-)?(\d{3})\-(\d{4})/
c =3D b.match( a )
puts c.to_a

c should be [ "555-1212", nil, nil, "555", "1212" ]

[0] holds a copy of the complete match
[1] matches the parens from char 0 through char 10
- The following question mark states that either 0 or 1 instance of t=
he=20
previous group should be accepted in the pattern match.
[2] matches the parens from char 1 through char 7
[3] matches the parens from char 12 through char 18
[4] matches the parens from char 21 through char 27=20

I hope that made sense too.

j.


Hi --
=20
On Thu, 25 Aug 2005, Jeff Wood wrote:
=20
Ok, now for a clean and simple answer...

The parenthesis create "groups". Groups give you the ability to save
away parts of the matched pattern for easy access after a match has
been made. And yes, the O'Reilly book "Mastering Regular Expressions"
would be a good read it explains this concept quite completely.

When you use Regexp#match, if you find a match, you are returned a
MatchData object which provides an [] API for standard array style
access.

Element 0 ( [0] ) provides the COMPLETE match. If you defined any
groups in the regular expression, they will then appear as additional
elements in the MatchData [] array. Each group is then accessable
using ordinals ( [1] .. [9] I don't know what happens after 9, I've
never needed that many ) ...
=20
I believe that would be 10 :)
=20
irb(main):003:0> m =3D /(((((((((((a)))))))))))/.match("a")
=3D> #<MatchData:0xbf4bbd84>
irb(main):004:0> $10
=3D> "a"
irb(main):005:0> m[10]
=3D> "a"
=20
=20
Remember that if you define a group to be
inside an optional region of the regular expression, that group will
return nil.
=20
I'm not sure what you mean there.
=20
irb(main):009:0> /((a)?)?/.match("a").to_a
=3D> ["a", "a", "a"]
=20
=20
David
=20
--
David A. Black
(e-mail address removed)
=20
=20


--=20
"So long, and thanks for all the fish"

Jeff Wood
 
T

ts

D> I'm not sure what you mean there.

moulon% ruby -e '"The gateway is broken ? yes / no ?" =~ /(yes)|(no)/; p $1,$2'
"yes"
nil
moulon%


Guy Decoux
 
D

David A. Black

Hi --

Regarding nil values for groups

if you define your regular expressions like this :

a = "My phone number is : 555-1212"
b = /((\d{3})\-)?(\d{3})\-(\d{4})/
c = b.match( a )
puts c.to_a

c should be [ "555-1212", nil, nil, "555", "1212" ]

Oh, well, yes -- if there's no match for the group. I was taking you
very literally:

You didn't include the "if there's no match" bit :)


David
 
M

Mike Meng

Thank you, Julian.

I took O'Reilly's Mastering Regular Expressions by Jeff Friedl. On the
page 326, it says:

"Capture parentheses change the whole face of split. When they are
used, the return list
has additional, independent elements interjected for the items
captureed by the parentheses."
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
474,176
Messages
2,570,950
Members
47,503
Latest member
supremedee

Latest Threads

Top