Novice Q: What's the difference between /\s/ and /(\s)/?

Mike Meng · Aug 18, 2005

Hi,
I'm new to Ruby and reading 'Programming Ruby 2/e' now. I encountered
a tricky problem while reading chapter 5, 'String" section. Here is the
problem:

# code
line = '/jazz/j00319.mp3 | 2:58 | Louis Armstrong | Wonderful World'
file, duration, artist, title = line.chomp.split(/\s*\|\s*/)
# code end

Run the code we get:
file=='/jazz/j00319.mp3'
duration=='2:58'
artist=='Louis Armstrong'
title=='Wonderful World'

While if I change the regex pattern in 'split' to /(\s)*\|(\s)*/,
that is,
# code
line = '/jazz/j00319.mp3 | 2:58 | Louis Armstrong | Wonderful World'
file, duration, artist, title = line.chomp.split(/(\s)*\|(\s)*/)
# code end

We get:
file=='/jazz/j00319.mp3'
duration==' '
artist==' '
title=='2:58'

What makes the differece? Any comments are appreciated.

William James · Aug 18, 2005

Mike said:
# code
line = '/jazz/j00319.mp3 | 2:58 | Louis Armstrong | Wonderful World'
file, duration, artist, title = line.chomp.split(/\s*\|\s*/)
# code end

Run the code we get:
file=='/jazz/j00319.mp3'
duration=='2:58'
artist=='Louis Armstrong'
title=='Wonderful World'

While if I change the regex pattern in 'split' to /(\s)*\|(\s)*/,
that is,
# code
line = '/jazz/j00319.mp3 | 2:58 | Louis Armstrong | Wonderful World'
file, duration, artist, title = line.chomp.split(/(\s)*\|(\s)*/)
# code end

We get:
file=='/jazz/j00319.mp3'
duration==' '
artist==' '
title=='2:58'

What makes the differece? Any comments are appreciated.

Without the captures, the substrings on which the string is split
are discarded. When you include captures, they are included in
the resulting array. Which makes sense: why would you include
captures if you didn't want to do something with them?

Mike Meng · Aug 18, 2005

Thank you, William.

Is this hehavior defined by regex spec or by String#split? Where can I
find detailed explaination?

mike

daz · Aug 18, 2005

Mike said:
# code
line = '/jazz/j00319.mp3 | 2:58 | Louis Armstrong | Wonderful World'
file, duration, artist, title = line.chomp.split(/\s*\|\s*/)
# code end

[...]
While if I change the regex pattern in 'split' to /(\s)*\|(\s)*/,
[...]
What makes the differece? Any comments are appreciated.

Hi Mike,

You're not seeing the difference because of your assignments.
Try playing with this:

#-----------------------------------------------------------
def splt(patt)
res = LINE.split(patt)
print "#-> (%s)\n#-> %2d: " % [patt.inspect, res.size]
res.to_a.each {|col| print ' (%s)' % [col]}
puts; puts
end

LINE = 'ABC_:_KLM_:_NOP_:_XYZ'

splt(/_:_/)
splt(/(_:_)/)
splt(/_

)_/)
splt(/(_)

_)/)
splt(/((_)

_))/)
splt(/((_)

)(_))/)
splt(/_:_K/)
splt(/(_:_)K/)
splt(/(_:_K)/)
splt(/((_:_K))/)
splt(/(((_:_K)))/)
#-----------------------------------------------------------

#-> (/_:_/)
#-> 4: (ABC) (KLM) (NOP) (XYZ)

#-> (/(_:_)/)
#-> 7: (ABC) (_:_) (KLM) (_:_) (NOP) (_:_) (XYZ)

#-> (/_

)_/)
#-> 7: (ABC)

) (KLM)

) (NOP)

) (XYZ)

#-> (/(_)

_)/)
#-> 10: (ABC) (_) (_) (KLM) (_) (_) (NOP) (_) (_) (XYZ)

#-> (/((_)

_))/)
#-> 13: (ABC) (_:_) (_) (_) (KLM) (_:_) (_) (_) (NOP) (_:_) (_) (_) (XYZ)

#-> (/((_)

)(_))/)
#-> 16: (ABC) (_:_) (_)

) (_) (KLM) (_:_) (_)

) (_) (NOP) (_:_) (_)

) (_) (XYZ)

#-> (/_:_K/)
#-> 2: (ABC) (LM_:_NOP_:_XYZ)

#-> (/(_:_)K/)
#-> 3: (ABC) (_:_) (LM_:_NOP_:_XYZ)

#-> (/(_:_K)/)
#-> 3: (ABC) (_:_K) (LM_:_NOP_:_XYZ)

#-> (/((_:_K))/)
#-> 4: (ABC) (_:_K) (_:_K) (LM_:_NOP_:_XYZ)

#-> (/(((_:_K)))/)
#-> 5: (ABC) (_:_K) (_:_K) (_:_K) (LM_:_NOP_:_XYZ)

daz

Julian Leviston · Aug 24, 2005

I'm not sure if someone's already answered this, but...

putting parentheses around things groups them... and it's treated as
though it's a single regexp...

so:
/\s*/ means match a space, zero or more times to the extent of the
contiguous spaces...

but
/(\s)*/ means "match a space, zero or more times to the extent of
THIS CONTIGUOUS MATCH. It first matches zero spaces, then the limit
of the zero spaces is ... (funnily enough) zero spaces, so it doesn't
go any further. You don't want to use parentheses.

There have been whole books written on regular expressions. If you're
going to use them well, they're worth reading, I'd suggest.

Julian.

Gavin Kistner · Aug 24, 2005

I'm not sure if someone's already answered this, but...

putting parentheses around things groups them... and it's treated
as though it's a single regexp...

so:
/\s*/ means match a space, zero or more times to the extent of the
contiguous spaces...

but
/(\s)*/ means "match a space, zero or more times to the extent of
THIS CONTIGUOUS MATCH. It first matches zero spaces, then the limit
of the zero spaces is ... (funnily enough) zero spaces, so it
doesn't go any further. You don't want to use parentheses.

Actually, using parentheses here will not affect what is matched,
only what is saved. Even with the parens, each time the accumulator
is run it re-matches the character class. Either that, or I'm
misinterpreting the results below:

" \t\nHello".match( /^\s*(\w+)/ ) #=> "Hello" , nil
" \t\nHello".match( /^(\s)*(\w+)/ ) #=> "\n" , "Hello"
" \t\nHello".match( /^(\s*)(\w+)/ ) #=> " \t\n" , "Hello"
"\t Hello".match( /^\s*(\w+)/ ) #=> "Hello" , nil
"\t Hello".match( /^(\s)*(\w+)/ ) #=> " " , "Hello"
"\t Hello".match( /^(\s*)(\w+)/ ) #=> "\t " , "Hello"
"\n \tHello".match( /^\s*(\w+)/ ) #=> "Hello" , nil
"\n \tHello".match( /^(\s)*(\w+)/ ) #=> "\t" , "Hello"
"\n \tHello".match( /^(\s*)(\w+)/ ) #=> "\n \t" , "Hello"

(Results are the first two saved subexpressions of the match.)

strings = [
" \t\nHello",
"\t Hello",
"\n \tHello"
]

patterns = [
/^\s*(\w+)/,
/^(\s)*(\w+)/,
/^(\s*)(\w+)/
]

strings.each_with_index{ |str, str_num|
patterns.each_with_index{ |re, re_num|
if match = str.match( re )
info = [ str.inspect, re.inspect, match[1].inspect, match
[2].inspect ]
puts "%s.match( %-14s ) #=> %-8s, %-5s" % info
end
}
}
puts "\n(Results are the first two saved subexpressions of the match.)"

Gavin Kistner · Aug 24, 2005

" \t\nHello".match( /^\s*(\w+)/ ) #=> "Hello" , nil
" \t\nHello".match( /^(\s)*(\w+)/ ) #=> "\n" , "Hello"
" \t\nHello".match( /^(\s*)(\w+)/ ) #=> " \t\n" , "Hello"
"\t Hello".match( /^\s*(\w+)/ ) #=> "Hello" , nil
"\t Hello".match( /^(\s)*(\w+)/ ) #=> " " , "Hello"
"\t Hello".match( /^(\s*)(\w+)/ ) #=> "\t " , "Hello"
"\n \tHello".match( /^\s*(\w+)/ ) #=> "Hello" , nil
"\n \tHello".match( /^(\s)*(\w+)/ ) #=> "\t" , "Hello"
"\n \tHello".match( /^(\s*)(\w+)/ ) #=> "\n \t" , "Hello"

Three more pertinent data points:

"Hello".match( /^\s*(\w+)/ ) #=> "Hello" , nil
"Hello".match( /^(\s)*(\w+)/ ) #=> nil , "Hello"
"Hello".match( /^(\s*)(\w+)/ ) #=> "" , "Hello"

Jeff Wood · Aug 24, 2005

Ok, now for a clean and simple answer...=20

The parenthesis create "groups". Groups give you the ability to save
away parts of the matched pattern for easy access after a match has
been made. And yes, the O'Reilly book "Mastering Regular Expressions"
would be a good read it explains this concept quite completely.

When you use Regexp#match, if you find a match, you are returned a
MatchData object which provides an [] API for standard array style
access.

Element 0 ( [0] ) provides the COMPLETE match. If you defined any
groups in the regular expression, they will then appear as additional
elements in the MatchData [] array. Each group is then accessable
using ordinals ( [1] .. [9] I don't know what happens after 9, I've
never needed that many ) ... Remember that if you define a group to be
inside an optional region of the regular expression, that group will
return nil.

So, if you were parsing phone numbers in the format of:=20
###-###-####

You could save yourself a bit of code and define your regex to be:=20

a =3D "My phone number is : 800-555-1212"
b =3D /(\d{3})\-(\d{3})\-(\d{4})/

c =3D b.match( a )

if c
puts c[0] # returns the complete match : 800-555-1212
puts c[1] # returns group 1 : 800
puts c[2] # returns group 2 : 555
puts c[3] # returns group 3 : 1212
else
puts "Not a match"
end

I hope that helps. Remember that [0] always exists, but the other
items only exist if you define groups within your regular expression.

j.

=20
Three more pertinent data points:
=20
"Hello".match( /^\s*(\w+)/ ) #=3D> "Hello" , nil
"Hello".match( /^(\s)*(\w+)/ ) #=3D> nil , "Hello"
"Hello".match( /^(\s*)(\w+)/ ) #=3D> "" , "Hello"
=20
=20
=20

--=20
"So long, and thanks for all the fish"

Jeff Wood

David A. Black · Aug 24, 2005

Hi --

Ok, now for a clean and simple answer...

The parenthesis create "groups". Groups give you the ability to save
away parts of the matched pattern for easy access after a match has
been made. And yes, the O'Reilly book "Mastering Regular Expressions"
would be a good read it explains this concept quite completely.

When you use Regexp#match, if you find a match, you are returned a
MatchData object which provides an [] API for standard array style
access.

Element 0 ( [0] ) provides the COMPLETE match. If you defined any
groups in the regular expression, they will then appear as additional
elements in the MatchData [] array. Each group is then accessable
using ordinals ( [1] .. [9] I don't know what happens after 9, I've
never needed that many ) ...

I believe that would be 10

irb(main):003:0> m = /(((((((((((a)))))))))))/.match("a")
=> #<MatchData:0xbf4bbd84>
irb(main):004:0> $10
=> "a"
irb(main):005:0> m[10]
=> "a"

Remember that if you define a group to be
inside an optional region of the regular expression, that group will
return nil.

I'm not sure what you mean there.

irb(main):009:0> /((a)?)?/.match("a").to_a
=> ["a", "a", "a"]

David

Jeff Wood · Aug 24, 2005

Regarding nil values for groups

if you define your regular expressions like this :
=20
a =3D "My phone number is : 555-1212"
b =3D /((\d{3})\-)?(\d{3})\-(\d{4})/
c =3D b.match( a )
puts c.to_a

c should be [ "555-1212", nil, nil, "555", "1212" ]

[0] holds a copy of the complete match
[1] matches the parens from char 0 through char 10
- The following question mark states that either 0 or 1 instance of t=
he=20
previous group should be accepted in the pattern match.
[2] matches the parens from char 1 through char 7
[3] matches the parens from char 12 through char 18
[4] matches the parens from char 21 through char 27=20

I hope that made sense too.

j.

Hi --
=20
On Thu, 25 Aug 2005, Jeff Wood wrote:
=20

Ok, now for a clean and simple answer...

The parenthesis create "groups". Groups give you the ability to save
away parts of the matched pattern for easy access after a match has
been made. And yes, the O'Reilly book "Mastering Regular Expressions"
would be a good read it explains this concept quite completely.

When you use Regexp#match, if you find a match, you are returned a
MatchData object which provides an [] API for standard array style
access.

Element 0 ( [0] ) provides the COMPLETE match. If you defined any
groups in the regular expression, they will then appear as additional
elements in the MatchData [] array. Each group is then accessable
using ordinals ( [1] .. [9] I don't know what happens after 9, I've
never needed that many ) ...

Click to expand...

=20
I believe that would be 10
=20
irb(main):003:0> m =3D /(((((((((((a)))))))))))/.match("a")
=3D> #<MatchData:0xbf4bbd84>
irb(main):004:0> $10
=3D> "a"
irb(main):005:0> m[10]
=3D> "a"
=20
=20

Remember that if you define a group to be
inside an optional region of the regular expression, that group will
return nil.

Click to expand...

=20
I'm not sure what you mean there.
=20
irb(main):009:0> /((a)?)?/.match("a").to_a
=3D> ["a", "a", "a"]
=20
=20
David
=20
--
David A. Black
(e-mail address removed)
=20
=20

--=20
"So long, and thanks for all the fish"

Jeff Wood

ts · Aug 24, 2005

D> I'm not sure what you mean there.

moulon% ruby -e '"The gateway is broken ? yes / no ?" =~ /(yes)|(no)/; p $1,$2'
"yes"
nil
moulon%

Guy Decoux

David A. Black · Aug 24, 2005

Hi --

Regarding nil values for groups

if you define your regular expressions like this :

a = "My phone number is : 555-1212"
b = /((\d{3})\-)?(\d{3})\-(\d{4})/
c = b.match( a )
puts c.to_a

c should be [ "555-1212", nil, nil, "555", "1212" ]

Oh, well, yes -- if there's no match for the group. I was taking you
very literally:

You didn't include the "if there's no match" bit

David

Mike Meng · Aug 29, 2005

Thank you, Julian.

I took O'Reilly's Mastering Regular Expressions by Jeff Friedl. On the
page 326, it says:

"Capture parentheses change the whole face of split. When they are
used, the return list
has additional, independent elements interjected for the items
captureed by the parentheses."

the difference between "const char* s" and "char* const s"	7	Aug 23, 2006
the difference between "const char* s" and "char* const s"	9	Aug 23, 2006
[SUMMARY] Code to S-Exp (#95)	0	Sep 28, 2006
a new quotation operator to automatically unindent %q and %Q	5	Oct 19, 2008
what's the purpose of IRB.conf[:CONTEXT_MODE] and/orTOPLEVEL_BINDING?	1	May 21, 2009
I Need Help with making a function that draws in a canvas using location data.	1	Dec 17, 2021
from the ruby book install on win32	5	Jun 22, 2007
Difference between Cygwin and DOS handling of string input	49	Dec 17, 2009

Novice Q: What's the difference between /\s/ and /(\s)/?

Mike Meng

William James

Mike Meng

daz

Julian Leviston

Gavin Kistner

Gavin Kistner

Jeff Wood

David A. Black

Jeff Wood

ts

David A. Black

Mike Meng

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads

Novice Q: What's the difference between /\s*/ and /(\s)*/?

Mike Meng

William James

Mike Meng

daz

Julian Leviston

Gavin Kistner

Gavin Kistner

Jeff Wood

David A. Black

Jeff Wood

ts

David A. Black

Mike Meng

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads

Novice Q: What's the difference between /\s/ and /(\s)/?