Puzzling regex behaviour

I

Ian Macdonald

Hello,

Can anyone explain this to me?

$ echo $LANG
nl_NL
$ irb -f
irb(main):001:0> foo = "préférées"
=> "pr\351f\351r\351es"
irb(main):002:0> foo =~ /[^[:alnum:]]/
=> nil
irb(main):003:0> foo =~ /\W/
=> 2

First question: Why does the final statement return 2 instead of nil?
All characters in foo are alphabetic characters in this locale.

Then:

$ echo $LANG
nl_NL
$ cat ./foo
#!/usr/bin/ruby -w

foo = "préférées"
p foo =~ /[^[:alnum:]]/
p foo =~ /\W/
$ ./foo
2
2

Huh?

Second question: Why does the first regex match now return 2 instead of
nil?

To my way of thinking, both statements should always return nil, whether
or not they are typed into irb or run in a stand-alone script. At the
very least, both statements should return the same answer, regardless of
the context.

What am I missing here?

Ian
--
Ian Macdonald | tachyon emissions overloading the system
(e-mail address removed) |
http://www.caliban.org/ |
|
|
 
R

Robert Klemme

Hello,

Can anyone explain this to me?

$ echo $LANG
nl_NL
$ irb -f
irb(main):001:0> foo = "préférées"
=> "pr\351f\351r\351es"
irb(main):002:0> foo =~ /[^[:alnum:]]/
=> nil
irb(main):003:0> foo =~ /\W/
=> 2

First question: Why does the final statement return 2 instead of nil?
All characters in foo are alphabetic characters in this locale.

Then:

$ echo $LANG
nl_NL
$ cat ./foo
#!/usr/bin/ruby -w

foo = "préférées"
p foo =~ /[^[:alnum:]]/
p foo =~ /\W/
$ ./foo
2
2

Huh?

Second question: Why does the first regex match now return 2 instead of
nil?

To my way of thinking, both statements should always return nil, whether
or not they are typed into irb or run in a stand-alone script. At the
very least, both statements should return the same answer, regardless of
the context.

What am I missing here?

Maybe there is an initialization in .irbrc that leads to a changed
locale inside IRB. Or your IRB belongs to a different Ruby version on
that system.

Other than that, I guess you tripped into the wide and wild country of
i18n - many strange things can be found there. Maybe \w and \W only
treat ASCII [a-z] characters as word characters.

Kind regards

robert
 
I

Ian Macdonald

Maybe there is an initialization in .irbrc that leads to a changed
locale inside IRB.

Nope; I had hoped it would be that easy, but as you can see from my
snippet of output, I started irb with -f, which bypasses ~/.irbrc.
ENV['LANG'] also prints nl_NL in irb, so that can't be it.
Or your IRB belongs to a different Ruby version on that system.

I compiled it myself, so there has been no mix-and-matching.
Other than that, I guess you tripped into the wide and wild country of
i18n - many strange things can be found there. Maybe \w and \W only
treat ASCII [a-z] characters as word characters.

It does seem that way, as Perl also appears to treat them this way.

However, I'm still puzzled why there's a difference between irb and a
stand-alone script.

Ian
--
Ian Macdonald | If you are what you eat, I guess that makes
(e-mail address removed) | me a cheese danish. -- Anonymous
http://www.caliban.org/ |
|
|
 
D

David Balmain

However, I'm still puzzled why there's a difference between irb and a
stand-alone script.

Maybe your editor saves the script in UTF-8 format. The irb example
clearly encodes the string in ISO-8859-1. That could explain the
difference.
 
D

David Balmain

Maybe your editor saves the script in UTF-8 format. The irb example
clearly encodes the string in ISO-8859-1. That could explain the
difference.

For example;

~$ echo $LANG
en_US.ISO-8859-1
~$ irb -f
irb(main):001:0> "pr\351f\351r\351es" =3D~ /[^[:alnum:]]/
=3D> nil
irb(main):002:0> "pr\303\251f\303\251r\303\251es" =3D~ /[^[:alnum:]]/
=3D> 3

Not exactly what you had but it probably has something to do with the
encoding of the =E9.

--=20
Dave Balmain
http://www.davebalmain.com/
 
I

Ian Macdonald

Maybe your editor saves the script in UTF-8 format. The irb example
clearly encodes the string in ISO-8859-1. That could explain the
difference.

For example;

~$ echo $LANG
en_US.ISO-8859-1
~$ irb -f
irb(main):001:0> "pr\351f\351r\351es" =~ /[^[:alnum:]]/
=> nil
irb(main):002:0> "pr\303\251f\303\251r\303\251es" =~ /[^[:alnum:]]/
=> 3

Not exactly what you had but it probably has something to do with the
encoding of the é.

My editor is vim and I run it in the nl_NL locale, so it doesn't start
in UTF-8 mode. To double-check:

:set encoding?
encoding=latin1

And if we dump my little script:

$ od -c foo
0000000 # ! / u s r / b i n / r u b y
0000020 - w \n \n f o o = " p r 351 f 351
0000040 r 351 e s " \n p f o o = ~ /
0000060 [ ^ [ : a l n u m : ] ] / \n p
0000100 f o o = ~ / \ W / \n

You can see that it is, indeed, saved as Latin-1, not UTF-8.

The mystery continues. ;-)

Ian
--
Ian Macdonald | It's not whether you win or lose, it's how
(e-mail address removed) | you place the blame.
http://www.caliban.org/ |
|
|
 
I

Ian Macdonald

I can reproduce this 1.8.4

Just to be clear, you are confirming that the following code:

foo = "préférées"
p foo =~ /[^[:alnum:]]/

prints nil in irb and 2 in a stand-alone script when in both cases your
locale is preset to nl_NL?

Ian
--
Ian Macdonald | On a clear disk you can seek forever.
(e-mail address removed) |
http://www.caliban.org/ |
|
|
 
R

Rob Biedenharn

I can reproduce this 1.8.4

Just to be clear, you are confirming that the following code:

foo =3D "pr=E9f=E9r=E9es"
p foo =3D~ /[^[:alnum:]]/

prints nil in irb and 2 in a stand-alone script when in both cases =20
your
locale is preset to nl_NL?

Ian
--=20
Ian Macdonald | On a clear disk you can seek forever.
(e-mail address removed) |
http://www.caliban.org/ |

I'm beginning to wonder if the original question is even accurate. =20
Doing nothing more than changing the encoding and re-saving the file =20
(where the value for foo was a cut-n-paste from the email), there =20
doesn't seem to be any discrpeancy between ruby and irb. (This =20
output is from ruby 1.8.5, but 1.8.2 was the same)

rab:code/ruby $ file regexp_and_alnum_versus_w.rb
regexp_and_alnum_versus_w.rb: ISO-8859 text
rab:code/ruby $ cat regexp_and_alnum_versus_w.rb
foo =3D "pr?f?r?es"
alnum =3D /[^[:alnum:]]/
dubya =3D /\W/

puts "foo\n =3D> #{foo.inspect}"
[ alnum, dubya ].each do |re|
puts "foo =3D~ #{re}\n =3D> #{foo =3D~ re}"
end
rab:code/ruby $ ruby regexp_and_alnum_versus_w.rb
foo
=3D> "pr\351f\351r\351es"
foo =3D~ (?-mix:[^[:alnum:]])
=3D> 2
foo =3D~ (?-mix:\W)
=3D> 2
rab:code/ruby $ irb -r regexp_and_alnum_versus_w.rb
foo
=3D> "pr\351f\351r\351es"
foo =3D~ (?-mix:[^[:alnum:]])
=3D> 2
foo =3D~ (?-mix:\W)
=3D> 2NameError: undefined local variable or method `eixt' for main:Object
from (irb):1rab:code/ruby $ file =20
regexp_and_alnum_versus_w.rbregexp_and_alnum_versus_w.rb: UTF-8 =20
Unicode text
rab:code/ruby $ cat regexp_and_alnum_versus_w.rb
foo =3D "pr=E9f=E9r=E9es"
alnum =3D /[^[:alnum:]]/
dubya =3D /\W/

puts "foo\n =3D> #{foo.inspect}"
[ alnum, dubya ].each do |re|
puts "foo =3D~ #{re}\n =3D> #{foo =3D~ re}"
end
rab:code/ruby $ ruby regexp_and_alnum_versus_w.rb
foo
=3D> "pr\303\251f\303\251r\303\251es"
foo =3D~ (?-mix:[^[:alnum:]])
=3D> 2
foo =3D~ (?-mix:\W)
=3D> 2
rab:code/ruby $ irb -r regexp_and_alnum_versus_w.rb
foo
=3D> "pr\303\251f\303\251r\303\251es"
foo =3D~ (?-mix:[^[:alnum:]])
=3D> 2
foo =3D~ (?-mix:\W)
=3D> 2

-Rob

Rob Biedenharn http://agileconsultingllc.com
(e-mail address removed)
 
R

Robert Klemme

I can reproduce this 1.8.4

Just to be clear, you are confirming that the following code:

foo = "préférées"
p foo =~ /[^[:alnum:]]/

prints nil in irb and 2 in a stand-alone script when in both cases your
locale is preset to nl_NL?

Another idea: maybe the readline lib interferes with encodings somehow
in IRB? What happens if you invoke your script from within IRB via "load"?

Kind regards

robert
 
I

Ian Macdonald

Another idea: maybe the readline lib interferes with encodings somehow
in IRB? What happens if you invoke your script from within IRB via "load"?

It runs as if run from the command line:

irb(main):001:0> load 'foo'
2
2

Ian
--
Ian Macdonald | Man who falls in vat of molten optical
(e-mail address removed) | glass makes spectacle of self.
http://www.caliban.org/ |
|
|
 
I

Ian Macdonald

I'm beginning to wonder if the original question is even accurate.
Doing nothing more than changing the encoding and re-saving the file
(where the value for foo was a cut-n-paste from the email), there
doesn't seem to be any discrpeancy between ruby and irb. (This
output is from ruby 1.8.5, but 1.8.2 was the same)

rab:code/ruby $ file regexp_and_alnum_versus_w.rb
regexp_and_alnum_versus_w.rb: ISO-8859 text
rab:code/ruby $ cat regexp_and_alnum_versus_w.rb
foo = "pr?f?r?es"
alnum = /[^[:alnum:]]/
dubya = /\W/

puts "foo\n => #{foo.inspect}"
[ alnum, dubya ].each do |re|
puts "foo =~ #{re}\n => #{foo =~ re}"
end
rab:code/ruby $ ruby regexp_and_alnum_versus_w.rb
foo
=> "pr\351f\351r\351es"
foo =~ (?-mix:[^[:alnum:]])
=> 2

What is your locale? I strongly suspect it's either unset or set to C.
In those cases, I get the same results as you.

If you use en_US or nl_NL, you'll find (or at least, I find) that
'foo =~ /[^[:alnum:]]/' returns nil in irb and 2 from a stand-alone
script.

In fact, even irb returns a different value from the command line. This
is bizarre:

$ irb -f < foo2
foo = "préférées"
"pr\351f\351r\351es"
foo =~ /[^[:alnum:]]/
2
foo =~ /\W/
2

$ irb
irb(main):001:0> foo = "préférées"
=> "pr\351f\351r\351es"
irb(main):002:0> foo =~ /[^[:alnum:]]/
=> nil
irb(main):003:0> foo =~ /\W/
=> 2

As you can see, interactively irb returns nil for that first regex match.

Ian
--
Ian Macdonald | The cost of living hasn't affected its
(e-mail address removed) | popularity.
http://www.caliban.org/ |
|
|
 
P

Phrogz

As you can see, interactively irb returns nil for that first regex match.

Aren't you making the assumption that it's the regex at fault here,
and not the encoding of the string when you enter it in irb?

What if you do:

gavinkistner$ cat set_foo.rb
$foo = "préférés"

gavinkistner$ irb
irb(main):001:0> load 'set_foo.rb'
=> true
irb(main):001:0> $foo =~ /[^[:alnum:]]/
???
 
I

Ian Macdonald

As you can see, interactively irb returns nil for that first regex match.

Aren't you making the assumption that it's the regex at fault here,
and not the encoding of the string when you enter it in irb?

What if you do:

gavinkistner$ cat set_foo.rb
$foo = "préférés"

gavinkistner$ irb
irb(main):001:0> load 'set_foo.rb'
=> true
irb(main):001:0> $foo =~ /[^[:alnum:]]/
???

$ irb
irb(main):001:0> load 'foo'
=> true
irb(main):002:0> $foo
=> "pr\351f\351r\351es"
irb(main):003:0> $foo =~ /[^[:alnum:]]/
=> nil

It's still nil, I'm afraid.

Ian
--
Ian Macdonald | ..disk or the processor is on fire.
(e-mail address removed) |
http://www.caliban.org/ |
|
|
 
I

Ian Macdonald

It runs as if run from the command line:

irb(main):001:0> load 'foo'
2
2

I beg your pardon. I must have had the locale set incorrectly on that
run. It runs as if typed interactively into irb:

$ irb
irb(main):001:0> load 'foo'
nil
2

Ian
--
Ian Macdonald | A small town that cannot support one lawyer
(e-mail address removed) | can always support two.
http://www.caliban.org/ |
|
|
 
P

Phrogz

I beg your pardon. I must have had the locale set incorrectly on that
run. It runs as if typed interactively into irb:

$ irb
irb(main):001:0> load 'foo'
nil
2

Phewsh. Combined with the behavior you reported for loading a global
and then matching in IRB, I had feared the world had gone insane. At
least its consistently weird and the regexp match is, in fact, the
culprit.
 
R

Rob Biedenharn

Phewsh. Combined with the behavior you reported for loading a global
and then matching in IRB, I had feared the world had gone insane. At
least its consistently weird and the regexp match is, in fact, the
culprit.

Why don't you just find out which characters are in the [:alnum:] and
\w sets?
alnums = (0..0377).select {|c| c.chr =~ /[[:alnum:]]/ }.map {|c|
c.chr}.join
=> "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz"=> "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ_abcdefghijklmnopqrstuvwxyz"

$ LANG=nl_NL irb
alnums = (0..0377).select {|c| c.chr =~ /[[:alnum:]]/ }.map {|c|
c.chr}.join
=> "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz\252
\265\272\300\301\302\303\304\305\306\307\310\311\312\313\314\315\316
\317\320\321\322\323\324\325\326\330\331\332\333\334\335\336\337\340
\341\342\343\344\345\346\347\350\351\352\353\354\355\356\357\360\361
\362\363\364\365\366\370\371\372\373\374\375\376\377"=> "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ_abcdefghijklmnopqrstuvwxyz"

sheesh!

-Rob

Rob Biedenharn http://agileconsultingllc.com
(e-mail address removed)
 
I

Ian Macdonald

Why don't you just find out which characters are in the [:alnum:] and
\w sets?

$ LANG=nl_NL irb
alnums = (0..0377).select {|c| c.chr =~ /[[:alnum:]]/ }.map {|c|
c.chr}.join
=> "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz\252
\265\272\300\301\302\303\304\305\306\307\310\311\312\313\314\315\316
\317\320\321\322\323\324\325\326\330\331\332\333\334\335\336\337\340
\341\342\343\344\345\346\347\350\351\352\353\354\355\356\357\360\361
\362\363\364\365\366\370\371\372\373\374\375\376\377"=> "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ_abcdefghijklmnopqrstuvwxyz"

Yes, but all this really does is indicate that the irb behaviour is
the correct one.

When I run this in a stand-alone script, I get this:

$ LANG=nl_NL ./foo
"0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz"

It's almost as if the locale isn't being propagated to the process via
the environment. But...

$ LANG=nl_NL ruby -e "puts ENV['LANG']"
nl_NL

...it _is_ being propagated.

Is is the same for you?

Ian
--
Ian Macdonald | When a man is tired of London, he is tired
(e-mail address removed) | of life. -- Samuel Johnson
http://www.caliban.org/ |
|
|
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,982
Messages
2,570,185
Members
46,736
Latest member
AdolphBig6

Latest Threads

Top