F
Fritz Anderson
Ruby 1.9.2-p0, built from the tarball on Mac OS X 10.6.4 with the Xcode
3.2.5 tools.
Consider the following string:
STR =3D "sar=C3=A0 la cortesia del gran Lombardo"
Ruby (through irb) correctly identifies this as a Unicode string; the
first word (in case something swallows it on the way to your screen)
ends with an a-grave.
I'd like to na=C3=AFvely split this line into words. The obvious way to d=
o
this is:
words =3D STR.split /\W+/
# adding the u qualifier to the regexp doesn't matter
words becomes
=3D> ["sar", "la", "cortesia", "del", "gran", "Lombardo"]
The a-grave gets interpreted as a non-word (and therefore separator)
character. Apparently the regex implementation doesn't know about
non-ASCII letter classes. Unicode word separation is hard, but Ruby is
famous enough for its Unicode support that I'd hoped it would handle it.
Changing the separator regex to something like /[- .,';: ]+/ will do as
a workaround, if I'm willing to iterate my code till I've found all the
separators. But I hate non-general solutions.
Is there any way to fix this? I can rebuild Ruby if need be, though I
see nothing obvious in "./configure --help".
=E2=80=94 F
-- =
Posted via http://www.ruby-forum.com/.=
3.2.5 tools.
Consider the following string:
STR =3D "sar=C3=A0 la cortesia del gran Lombardo"
Ruby (through irb) correctly identifies this as a Unicode string; the
first word (in case something swallows it on the way to your screen)
ends with an a-grave.
I'd like to na=C3=AFvely split this line into words. The obvious way to d=
o
this is:
words =3D STR.split /\W+/
# adding the u qualifier to the regexp doesn't matter
words becomes
=3D> ["sar", "la", "cortesia", "del", "gran", "Lombardo"]
The a-grave gets interpreted as a non-word (and therefore separator)
character. Apparently the regex implementation doesn't know about
non-ASCII letter classes. Unicode word separation is hard, but Ruby is
famous enough for its Unicode support that I'd hoped it would handle it.
Changing the separator regex to something like /[- .,';: ]+/ will do as
a workaround, if I'm willing to iterate my code till I've found all the
separators. But I hate non-general solutions.
Is there any way to fix this? I can rebuild Ruby if need be, though I
see nothing obvious in "./configure --help".
=E2=80=94 F
-- =
Posted via http://www.ruby-forum.com/.=