String#split regex \W on non-ASCII text

Fritz Anderson · Nov 9, 2010

Ruby 1.9.2-p0, built from the tarball on Mac OS X 10.6.4 with the Xcode
3.2.5 tools.

Consider the following string:
STR =3D "sar=C3=A0 la cortesia del gran Lombardo"

Ruby (through irb) correctly identifies this as a Unicode string; the
first word (in case something swallows it on the way to your screen)
ends with an a-grave.

I'd like to na=C3=AFvely split this line into words. The obvious way to d=
o
this is:

words =3D STR.split /\W+/
# adding the u qualifier to the regexp doesn't matter

words becomes
=3D> ["sar", "la", "cortesia", "del", "gran", "Lombardo"]

The a-grave gets interpreted as a non-word (and therefore separator)
character. Apparently the regex implementation doesn't know about
non-ASCII letter classes. Unicode word separation is hard, but Ruby is
famous enough for its Unicode support that I'd hoped it would handle it.

Changing the separator regex to something like /[- .,';: ]+/ will do as
a workaround, if I'm willing to iterate my code till I've found all the
separators. But I hate non-general solutions.

Is there any way to fix this? I can rebuild Ruby if need be, though I
see nothing obvious in "./configure --help".

=E2=80=94 F

-- =

Posted via http://www.ruby-forum.com/.=

Fritz Anderson · Nov 9, 2010

Character properties are what I needed, thanks. I really did want =

non-word separators, but you led me to \P{L}.

Thanks again.

=E2=80=94 F

-- =

Posted via http://www.ruby-forum.com/.=

String#capitalize more complex	5	Apr 2, 2009
Regular expressions, capture repeated groups	4	Jul 8, 2010
string capture regex	5	Jan 7, 2004
The devolution of English language and slothful c.l.p behaviors exposed!	50	Jan 24, 2012
Ruby Weekly News 13th - 19th December 2004	0	Dec 22, 2004
anybody help me	1	Feb 10, 2006
comp.lang.c Answers to Frequently Asked Questions (FAQ List)	15	Apr 1, 2006
ANN: 'rex', a module for easy creation and use of regular expressions	0	Jun 10, 2004

String#split regex \W on non-ASCII text

Fritz Anderson

Fritz Anderson

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads