String#split regex \W on non-ASCII text

F

Fritz Anderson

Ruby 1.9.2-p0, built from the tarball on Mac OS X 10.6.4 with the Xcode
3.2.5 tools.

Consider the following string:
STR =3D "sar=C3=A0 la cortesia del gran Lombardo"

Ruby (through irb) correctly identifies this as a Unicode string; the
first word (in case something swallows it on the way to your screen)
ends with an a-grave.

I'd like to na=C3=AFvely split this line into words. The obvious way to d=
o
this is:

words =3D STR.split /\W+/
# adding the u qualifier to the regexp doesn't matter

words becomes
=3D> ["sar", "la", "cortesia", "del", "gran", "Lombardo"]

The a-grave gets interpreted as a non-word (and therefore separator)
character. Apparently the regex implementation doesn't know about
non-ASCII letter classes. Unicode word separation is hard, but Ruby is
famous enough for its Unicode support that I'd hoped it would handle it.

Changing the separator regex to something like /[- .,';: ]+/ will do as
a workaround, if I'm willing to iterate my code till I've found all the
separators. But I hate non-general solutions.

Is there any way to fix this? I can rebuild Ruby if need be, though I
see nothing obvious in "./configure --help".

=E2=80=94 F

-- =

Posted via http://www.ruby-forum.com/.=
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,962
Messages
2,570,134
Members
46,690
Latest member
MacGyver

Latest Threads

Top