Multibyte regexps...

H

Horacio Sanson

I am having some issues with regular expressions when working with japanese=
=20
strings.

Using ruby-1.8.3 on Windows XP home (Japanese version) I have this test:

irb(main):271:0> s =3D "=E9=9E=84"
=3D> "\212\223"
irb(main):272:0> l =3D "=E8=A1=8C"
=3D> "\215s"
irb(main):273:0> l =3D~ /s/
=3D> 1
irb(main):274:0> puts "#{$`}<<#{$&}>>#{$'}"
E<s>>
=3D> nil
irb(main):275:0> "#{$`}<<#{$&}>>#{$'}"
=3D> "\215<<s>>"
irb(main):276:0> s =3D~ /l/
=3D> nil


As you can see comparing two totally different characters (kanji) gives me =
a=20
match. Reversing the match gives nil.


How can I get ruby to match things correctly??=20

regards,
Horacio

=20
 
C

Chintan Trivedi

--0-400813578-1135164656=:75970
Content-Type: text/plain; charset=iso-8859-1
Content-Transfer-Encoding: quoted-printable

l =3D~ /s/ ??
=20
It will try to find a char "s" in string l and not the value remained i=
n variable s.=20
=20
=20


I am having some issues with regular expressions when working with japane=
se=20
strings.

Using ruby-1.8.3 on Windows XP home (Japanese version) I have this test:

irb(main):271:0> s =3D "=E9=9E=84"
=3D> "\212\223"
irb(main):272:0> l =3D "=E8=A1=8C"
=3D> "\215s"
irb(main):273:0> l =3D~ /s/
=3D> 1
irb(main):274:0> puts "#{$`}<<#{$&}>>#{$'}"
E>
=3D> nil
irb(main):275:0> "#{$`}<<#{$&}>>#{$'}"
=3D> "\215<>"
irb(main):276:0> s =3D~ /l/
=3D> nil


As you can see comparing two totally different characters (kanji) gives m=
e a=20
match. Reversing the match gives nil.


How can I get ruby to match things correctly??=20

regards,
Horacio

=20




__________________________________________________
Do You Yahoo!?
Tired of spam? Yahoo! Mail has the best spam protection around=20
http://mail.yahoo.com=20
--0-400813578-1135164656=:75970--
 
Y

Yukihiro Matsumoto

Hi,

In message "Re: Multibyte regexps..."

|I am having some issues with regular expressions when working with japanese
|strings.
|
|Using ruby-1.8.3 on Windows XP home (Japanese version) I have this test:
|
|irb(main):271:0> s = "$B3s(B"
|=> "\212\223"
|irb(main):272:0> l = "$B9T(B"
|=> "\215s"
|irb(main):273:0> l =~ /s/
|=> 1
|irb(main):274:0> puts "#{$`}<<#{$&}>>#{$'}"
|E<s>>
|=> nil
|irb(main):275:0> "#{$`}<<#{$&}>>#{$'}"
|=> "\215<<s>>"
|irb(main):276:0> s =~ /l/
|=> nil

The encoding seems to be Shift_JIS. You have to specify encoding
before you make regular expression matching. Put s after every
regular expression.

$KCODE="sjis" # to make p work right
p s = "$B3s(B"
p l = "$B9T(B"
p l =~ /s/s
puts "#{$`}<<#{$&}>>#{$'}"
p "#{$`}<<#{$&}>>#{$'}"
p s =~ /l/s

matz.
 
H

Horacio Sanson

Thanks a lot... this seems to work ok.

Where can I find documentation about this $KCODE global var and the "s" thing
after each regexp? What does the s exactly mean?

Do I have to put it only in regexps with japanese characters or any regexp? I
tried both and saw no difference.

When using Regexp.new to construct the regular expression how can I set the s
to the end of it??

sorry for so many questions but I don't seem to find any docs about these
options.


Horacio

Wednesday 21 December 2005 21:48$B!"(BYukihiro Matsumoto $B$5$s$O=q$-$^$7$?(B:
Hi,

In message "Re: Multibyte regexps..."
 
H

Horacio Sanson

I found some documentation about this. Thanks.

Just one question, it seems to me that I can make two different things to
allow Regexp's to handle multibyte Shift_JIS strings. One is to set the
$KCODE global variable to "sjis" and the other one is to use the "s" modifier
when constructing the regular expresion.

The question is do I use only one of the two methods or shall I use the "s"
modifier even if I set $KCODE to "sjis"??

My testing tells me that only setting the $KCODE global var is enough to get
Shift_JIS strings and Regexp's to work correctly but I just want to make
sure.

thanks,
Horacio

Monday 26 December 2005 10:29$B!"(BHoracio Sanson $B$5$s$O=q$-$^$7$?(B:
Thanks a lot... this seems to work ok.

Where can I find documentation about this $KCODE global var and the "s"
thing after each regexp? What does the s exactly mean?

Do I have to put it only in regexps with japanese characters or any regexp?
I tried both and saw no difference.

When using Regexp.new to construct the regular expression how can I set the
s to the end of it??

sorry for so many questions but I don't seem to find any docs about these
options.


Horacio

Wednesday 21 December 2005 21:48$B!"(BYukihiro Matsumoto $B$5$s$O=q$-$^$7$?(B:
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
474,201
Messages
2,571,049
Members
47,652
Latest member
Campbellamy

Latest Threads

Top