state of unicode support

C

Chad Perrin

I've heard rumors that "oniguruma fixes everything", and the like. I'm
sure that's a touch of hyperbole, but in any case:

What's the current state of Unicode support in Ruby? My recollection is
of Unicode support somewhat lacking.
 
W

why the lucky stiff

Oh man, I really don't have the energy for this thread again :) Chad: if you
get a straight answer about this, let me know. Others: Is there a simple,
straightforward FAQ entry somewhere that says "to use Unicode you have the
following choices"? This keeps coming up.

This isn't a complete answer, but it's the best I can do to help Chad out.
If you really want to solve the question now, Chad, I'd read Julian Tarkhanov's
UNICODE_PRIMER[1].

First, Onigurama[2] is a regular expression engine. It supports Unicode regular
expressions under many encodings, it's very handy. If all you want to do is
search strings for Unicode text, then great, use it.

Ruby's strings are not unicode-aware. There is a library called 'jcode', which
comes with Ruby which tries to help out, but it's very simple, only good for a
few things like counting characters and iterating through characters. Again,
UTF-8 only.

Ruby itself also understands UTF-8 regular expressions to a degree. Using the
'u' modifier. Many Ruby-based UTF-8 hacks are based on the idea of:
str.scan(/./u), which returns an array of strings, each string containing a
multibyte character. (Also: str.unpack('U*').)

If you are using Unicode strings in Rails, check out Julian's unicode_hacks
plugin: <http://julik.textdriven.com/svn/tools/rails_plugins/unicode_hacks/>
They have a channel on irc.freenode.net: #multibyte_rails.

The unicode_hacks plugin is interesting in that it tries to load one of several
Ruby unicode extensions before falling back to str.unpack('U*') mode.

Here are the extensions it prefers, in order:

* icu4r: a Ruby extension to IBM's ICU library. Adds UString, URegexp, etc.
classes for containing Unicode stuffs.
(project page[3] and docs[4])
* utf8proc: a small library for iterating through characters and converting
ints to code points. Adds String#utf8map and Integer#utf8, for example.
(download[5])
* unicode: a little extension by Yoshida Masato which adds Unicode class
methods for `strcmp`, `[de]compose`, normalization and case conversion for
utf-8.
(download[6] and readme[7])

So, many options, some massive, but most only partial and in their infancy.

The most recent entrant into this race, though, is Nikolai Weibull's
ruby-character-encoding library, which aims to get complete multibyte support
into Ruby 1.8's string class. If you use it, it will probably break a lot of
libraries which are used to strings acting the way they do now.
He is trying to emulate the Ruby 2.0 Unicode plans outlined by Matz.[8]

Nevertheless, it is a very promising library and Nikolai is working at
break-neck pace to appease the nations, all tongues and peoples.[9] And
discussion is here[10] with links to the mailing list and all that.

This might be a landslide of information, but it's better than spending all day
Googling and extracting tarballs and pouring through READMEs just to get a
picture of what's happening these days.

Signed in elaborate calligraphy with a picture of grapes at the end,

_why

[1] http://julik.textdriven.com/svn/tools/rails_plugins/unicode_hacks/UNICODE_PRIMER
[2] http://www.geocities.jp/kosako3/oniguruma/
[3] http://rubyforge.org/projects/icu4r/
[4] http://icu4r.rubyforge.org/
[5] http://www.flexiguided.de/publications.utf8proc.en.html
[6] http://www.yoshidam.net/Ruby.html
[7] http://www.yoshidam.net/unicode.txt
[8] http://redhanded.hobix.com/inspect/futurismUnicodeInRuby.html
[9] http://git.bitwi.se/?p=ruby-character-encodings.git;a=summary
[10] http://redhanded.hobix.com/inspect/nikolaiSUtf8LibIsAllReady.html
 
C

Chad Perrin

This might be a landslide of information, but it's better than spending all day
Googling and extracting tarballs and pouring through READMEs just to get a
picture of what's happening these days.

That was most excellent. Thank you for your kind assistance: it answers
my question quite well, and I appreciate your effort.

Signed in elaborate calligraphy with a picture of grapes at the end,

. . and as always, you manage to entertain in the process.
 
M

Matt Todd

So, the problem with Unicode support in Ruby is that the code
currently assumes that each letter is one byte, instead of multiple?
This includes presumably search algorithms (for Regexs, et al), then?

Or is my understanding warped and wrong?

_Why, et al, if you could break down the actual difficulties with
implementing Unicode support into Ruby 1.8, I think that might clear
up the questions we have as to whether a library eradicates all
problems (obviously, some problems can't be fixed, but merely hacked
or worked around).

Cheers, folks; remember to be nice. We're on the same team.

M.T.
 
E

Eric Armstrong

Spectacular summary. As a lurker on this thread,
I greatly appreciate it.
Oh man, I really don't have the energy for this thread again :) Chad: if you
get a straight answer about this, let me know. Others: Is there a simple,
straightforward FAQ entry somewhere that says "to use Unicode you have the
following choices"? This keeps coming up.

This isn't a complete answer, but it's the best I can do to help Chad out.
If you really want to solve the question now, Chad, I'd read Julian Tarkhanov's
UNICODE_PRIMER[1].

First, Onigurama[2] is a regular expression engine. It supports Unicode regular
expressions under many encodings, it's very handy. If all you want to do is
search strings for Unicode text, then great, use it.

Ruby's strings are not unicode-aware. There is a library called 'jcode', which
comes with Ruby which tries to help out, but it's very simple, only good for a
few things like counting characters and iterating through characters. Again,
UTF-8 only.

Ruby itself also understands UTF-8 regular expressions to a degree. Using the
'u' modifier. Many Ruby-based UTF-8 hacks are based on the idea of:
str.scan(/./u), which returns an array of strings, each string containing a
multibyte character. (Also: str.unpack('U*').)

If you are using Unicode strings in Rails, check out Julian's unicode_hacks
plugin: <http://julik.textdriven.com/svn/tools/rails_plugins/unicode_hacks/>
They have a channel on irc.freenode.net: #multibyte_rails.

The unicode_hacks plugin is interesting in that it tries to load one of several
Ruby unicode extensions before falling back to str.unpack('U*') mode.

Here are the extensions it prefers, in order:

* icu4r: a Ruby extension to IBM's ICU library. Adds UString, URegexp, etc.
classes for containing Unicode stuffs.
(project page[3] and docs[4])
* utf8proc: a small library for iterating through characters and converting
ints to code points. Adds String#utf8map and Integer#utf8, for example.
(download[5])
* unicode: a little extension by Yoshida Masato which adds Unicode class
methods for `strcmp`, `[de]compose`, normalization and case conversion for
utf-8.
(download[6] and readme[7])

So, many options, some massive, but most only partial and in their infancy.

The most recent entrant into this race, though, is Nikolai Weibull's
ruby-character-encoding library, which aims to get complete multibyte support
into Ruby 1.8's string class. If you use it, it will probably break a lot of
libraries which are used to strings acting the way they do now.
He is trying to emulate the Ruby 2.0 Unicode plans outlined by Matz.[8]

Nevertheless, it is a very promising library and Nikolai is working at
break-neck pace to appease the nations, all tongues and peoples.[9] And
discussion is here[10] with links to the mailing list and all that.

This might be a landslide of information, but it's better than spending all day
Googling and extracting tarballs and pouring through READMEs just to get a
picture of what's happening these days.

Signed in elaborate calligraphy with a picture of grapes at the end,

_why

[1] http://julik.textdriven.com/svn/tools/rails_plugins/unicode_hacks/UNICODE_PRIMER
[2] http://www.geocities.jp/kosako3/oniguruma/
[3] http://rubyforge.org/projects/icu4r/
[4] http://icu4r.rubyforge.org/
[5] http://www.flexiguided.de/publications.utf8proc.en.html
[6] http://www.yoshidam.net/Ruby.html
[7] http://www.yoshidam.net/unicode.txt
[8] http://redhanded.hobix.com/inspect/futurismUnicodeInRuby.html
[9] http://git.bitwi.se/?p=ruby-character-encodings.git;a=summary
[10] http://redhanded.hobix.com/inspect/nikolaiSUtf8LibIsAllReady.html
 
T

Tim Bray

First, Onigurama[2] is a regular expression engine. It supports
Unicode regular
expressions under many encodings, it's very handy. If all you want
to do is
search strings for Unicode text, then great, use it.

Er uh well it doesn't do unicode properties so you can't use things
like \p{L} which, once you've found them, quickly come to feel
essential. Anytime you write [a-zA-Z] in a regex, you've probably
just uttered a bug So I would say that Oniguruma has holes.

Otherwise, a very useful landslide indeed. -Tim
 
M

Michal Suchanek

So, the problem with Unicode support in Ruby is that the code
currently assumes that each letter is one byte, instead of multiple?
This includes presumably search algorithms (for Regexs, et al), then?

Or is my understanding warped and wrong?

Regexes in 1.8 can do utf-8.
_Why, et al, if you could break down the actual difficulties with
implementing Unicode support into Ruby 1.8, I think that might clear
up the questions we have as to whether a library eradicates all
problems (obviously, some problems can't be fixed, but merely hacked
or worked around).

The problem is with compatibility. In 1.8 it is expected that strings
are arrays of bytes. You can split them to characters with a regex or
convert into a sequence of codepoints. But no standard library or
function would understand that (except the single one that is there
for undoing the transformation).

So you have the choice to work with utf-8 strings and regexes, and
whenever you want characters convert the strings so that you get to
characters.

Or you can use a special unicode string class (such as from icu4r)
that no standard functions understand. Some may be able to do to_s but
you get a normal string then.

Or you can change the strings to handle utf-8 (or any other multibyte)
characters, and probably break most of the standard functions.

None of these is completely satisfactory because it is far from
_transparent_ unicode support in the standard string class. That is
planned for 2.0.

Thanks

Michal
 
A

Alex Young

Tim said:
First, Onigurama[2] is a regular expression engine. It supports
Unicode regular
expressions under many encodings, it's very handy. If all you want
to do is
search strings for Unicode text, then great, use it.


Er uh well it doesn't do unicode properties so you can't use things
like \p{L}

Off topic, what does/would that do? Match a lower-case symbol?
 
T

Tim Bray

First, Onigurama[2] is a regular expression engine. It supports
Unicode regular
expressions under many encodings, it's very handy. If all you
want to do is
search strings for Unicode text, then great, use it.
Er uh well it doesn't do unicode properties so you can't use
things like \p{L}

Off topic, what does/would that do? Match a lower-case symbol?

Unicode characters have named properties. "L" means it's a letter.
There are sub-properties like Lu and Ll for upper and lower case.
There are lots more properties for things like being numbers, being
white-space, combining forms and particular properties of Asian
characters and so on. Tremendously useful in regexes, particularly
for those of us round-eye gringos who are prone to write [a-zA-Z] and
think we're matching letters, which we're not. If you don't support
properties, you don't support Unicode. -Tim
 
J

Julian 'Julik' Tarkhanov

Ruby itself also understands UTF-8 regular expressions to a
degree. Using the
'u' modifier. Many Ruby-based UTF-8 hacks are based on the idea of:
str.scan(/./u), which returns an array of strings, each string
containing a
multibyte character. (Also: str.unpack('U*').)

Which is actually useless because this breaks your string between
codepoints, not between characters. ICU4R currently resolves this, as
well as a library posted
on ruby-talk a while ago (with proper text boudnary handling).
 
A

Alex Young

Tim said:
First, Onigurama[2] is a regular expression engine. It supports
Unicode regular
expressions under many encodings, it's very handy. If all you
want to do is
search strings for Unicode text, then great, use it.

Er uh well it doesn't do unicode properties so you can't use things
like \p{L}


Off topic, what does/would that do? Match a lower-case symbol?


Unicode characters have named properties. "L" means it's a letter.
There are sub-properties like Lu and Ll for upper and lower case.
There are lots more properties for things like being numbers, being
white-space, combining forms and particular properties of Asian
characters and so on. Tremendously useful in regexes, particularly for
those of us round-eye gringos who are prone to write [a-zA-Z] and think
we're matching letters, which we're not. If you don't support
properties, you don't support Unicode. -Tim
Gotcha. Thanks for that.
 
J

Julian 'Julik' Tarkhanov

Unicode characters have named properties. "L" means it's a
letter. There are sub-properties like Lu and Ll for upper and
lower case. There are lots more properties for things like being
numbers, being white-space, combining forms and particular
properties of Asian characters and so on. Tremendously useful in
regexes, particularly for those of us round-eye gringos who are
prone to write [a-zA-Z] and think we're matching letters, which
we're not. If you don't support properties, you don't support
Unicode.

That's one of the reasons why you _need_ tables when working with
Unicode, and you _will_ spend memory on them. What Ruby does now is
nowhere near, and Matz wrote that he didn't unclude complete tables
for Oniguruma in 1.9 yet.

With proper regex support other funky things become posslbe, for
instance {all_cyrillic_letters} in a regex etc.
 
P

Paul Battley

Which is actually useless because this breaks your string between
codepoints, not between characters. ICU4R currently resolves this, as
well as a library posted
on ruby-talk a while ago (with proper text boudnary handling).

Whilst it's certainly useless for a lot of tasks, I'm not sure that
Ruby is any worse than other languages in this regard. As far as I'm
aware, most languages that 'support' Unicode don't handle grapheme
clusters without using additional libraries.
I, for one, am very saddened every time the topic comes up ecause i'm
sick of the brokenness (I actually start looking at these Other
Languages and Other Frameworks that take l10n and i18n seriously).

Actually, that's a really good idea. Which languages/frameworks have
you found that actually do it right? We could learn from their
example.

Paul.
 
J

Julian 'Julik' Tarkhanov

Whilst it's certainly useless for a lot of tasks, I'm not sure that
Ruby is any worse than other languages in this regard. As far as I'm
aware, most languages that 'support' Unicode don't handle grapheme
clusters without using additional libraries.

AFAIK Python regexps do that properly, and ICU does for sure (both as
free iterators and regexps).
Actually, that's a really good idea. Which languages/frameworks have
you found that actually do it right? We could learn from their
example.

To my knowledge you are intimately familiar with the subject so I
take it as sarcasm.

But if you really feel like being constructive you can update the
Unicode gem (wich you promised about a month ago) :))
 
P

Paul Battley

AFAIK Python regexps do that properly, and ICU does for sure (both as
free iterators and regexps).

That's what I mean: ICU is a separate library, not part of a language
core. We can use ICU in Ruby too - it's still pre-alpha and not
seamless, but the possibility exists. From what I've read, Python
doesn't do the heavyweight stuff natively, either. (Please tell me if
I'm wrong - I don't use Python.)
To my knowledge you are intimately familiar with the subject so I
take it as sarcasm.

I'm not being sarcastic at all, though perhaps I could have phrased it
better. It's just that all Unicode discussions in Ruby end up going
round and round in circles; if we as a community could identify some
first-class examples of Doing It Right, I think we'd have some useful
yardsticks. You are someone with particularly high expectations
(rightly so) of Unicode support in a language: have you found anything
that really impressed you?
But if you really feel like being constructive you can update the
Unicode gem (wich you promised about a month ago) :))

I promised I'd try :) Thanks for the reminder, though! I'll get on with it.

Paul.
 
J

Julian 'Julik' Tarkhanov

That's what I mean: ICU is a separate library, not part of a language
core.

PHP took the best of both - they are integrating ICU into the core.
Although I always hated
their tendency to bloat the core, this is one of the cases of bloat
that I would want to applaud as a gesture
of sanity and common sense.
We can use ICU in Ruby too - it's still pre-alpha and not
seamless, but the possibility exists.

Except from the fact that the maintainer has abandoned it and nobody
stepped in. I don't do C.
From what I've read, Python
doesn't do the heavyweight stuff natively, either. (Please tell me if
I'm wrong - I don't use Python.)

It depends on what you call "heavyweight". For the purists out there,
I gather, even including a complete Unicode table with
codepoint properties might be "heavyweight".
I'm not being sarcastic at all, though perhaps I could have phrased it
better. It's just that all Unicode discussions in Ruby end up going
round and round in circles; if we as a community could identify some
first-class examples of Doing It Right, I think we'd have some useful
yardsticks.

The problem being, my "Right Examples" are nowhere near other's
"Right Examples", which in turn supurs flamewars.
My "right example" is simple - Unicode on no terms, no encoding
choice, characters only - but most already are dissatisfied with such
an attitude and the issue has been discussed in detail, with no
solution satisfying all parties being devises. Too much compromise.
You are someone with particularly high expectations
(rightly so) of Unicode support in a language: have you found anything
that really impressed you?

ICU in all it's incarnations (Java and C), compulsory character-
oriented Strings without choice of encoding in Java and the upcoming
Unicode support in Python (again - compulsory Unicode for all
strings, byte arrays for everything else). Perl's regex support. I
know everyone will disagree (how do I match a PNG header in a
string???) but that's what I consider good.

As to localization - resource bundles are good, and of course I
consider all languages that _did_ bother to print localized dates.
Shame on Ruby.
I promised I'd try :) Thanks for the reminder, though! I'll get on
with it.

Gotcha :)
 
D

Daniel DeLorme

Paul said:
I'm not being sarcastic at all, though perhaps I could have phrased it
better. It's just that all Unicode discussions in Ruby end up going
round and round in circles; if we as a community could identify some
first-class examples of Doing It Right, I think we'd have some useful
yardsticks. You are someone with particularly high expectations
(rightly so) of Unicode support in a language: have you found anything
that really impressed you?

I second that. I see a lot of people asking for "transparent" unicode support
but I don't see how that is possible. To me it's like asking for a language that
has transparent bug recovery. I know that ruby has weaknesses when it comes to
multibyte encodings, but the main problem is human in nature; too many people
assume that char==byte, which results in bugs when someone unexpectedly uses
"weird" characters. IMHO no amount of "transparent support" will change that.
But I would love to be shown otherwise with examples of languages that "do it
right".

Daniel
 
M

Michal Suchanek

I second that. I see a lot of people asking for "transparent" unicode support
but I don't see how that is possible. To me it's like asking for a language that
has transparent bug recovery. I know that ruby has weaknesses when it comes to
multibyte encodings, but the main problem is human in nature; too many people
assume that char==byte, which results in bugs when someone unexpectedly uses
"weird" characters. IMHO no amount of "transparent support" will change that.
But I would love to be shown otherwise with examples of languages that "do it
right".
By transparent I mean that I can iterate, compare, match, index, ...
not only bytes but also at least code points (and grapheme clusters if
somebody is so nice and implements that - but for me it is not very
important now). Using the standard string class that all standard
functions accept.

In ruby 1.8 working with anything but bytes is like scratching your
right ear with your left hand .. or leg.

Thanks

Michal
 
M

Michal Suchanek

PHP took the best of both - they are integrating ICU into the core.
Although I always hated
their tendency to bloat the core, this is one of the cases of bloat
that I would want to applaud as a gesture
of sanity and common sense.

Last time I looked ICU was in C++. Requiring a C++ compilier and
runtime is quite a bit of bloat :)
Except from the fact that the maintainer has abandoned it and nobody
stepped in. I don't do C.


It depends on what you call "heavyweight". For the purists out there,
I gather, even including a complete Unicode table with
codepoint properties might be "heavyweight".

I am not sure how large that might be. But if it is about the size of
the interpreter including the rest of the standard libraries I would
consider it "heavyweight". It would be a reason to start "optional
standard libraries" I guess :)
The problem being, my "Right Examples" are nowhere near other's
"Right Examples", which in turn supurs flamewars.
My "right example" is simple - Unicode on no terms, no encoding
choice, characters only - but most already are dissatisfied with such
an attitude and the issue has been discussed in detail, with no
solution satisfying all parties being devises. Too much compromise.

It's been also said that giving more options does not stop you from
using only unicode. If your "right example" is only about restricting
choice then there is really not much to it.

The "right examples" people were interested in are probably more like
the libraries/languages that implement enough functionality to give
you full unicode support for your definition of "full".

Thanks

Michal
 
J

Julian 'Julik' Tarkhanov

Last time I looked ICU was in C++. Requiring a C++ compilier and
runtime is quite a bit of bloat :)

It still is. And it's huge and takes ages to build. If only I knew
something much lighter and better I would have dismissed it.
I am not sure how large that might be. But if it is about the size of
the interpreter including the rest of the standard libraries I would
consider it "heavyweight". It would be a reason to start "optional
standard libraries" I guess :)

I'm stopping right here. Unicode is not an option.
It's been also said that giving more options does not stop you from
using only unicode.

In 90% of the cases giving more options means programmers ignore
Unicode, for reasons ranging from speed
to ignorance. My user experience over the years has proven it.

But then again, I stop right here. And I urge you to do the same :)
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
474,209
Messages
2,571,088
Members
47,686
Latest member
scamivo

Latest Threads

Top