Oh man, I really don't have the energy for this thread again
Chad: if you
get a straight answer about this, let me know. Others: Is there a simple,
straightforward FAQ entry somewhere that says "to use Unicode you have the
following choices"? This keeps coming up.
This isn't a complete answer, but it's the best I can do to help Chad out.
If you really want to solve the question now, Chad, I'd read Julian Tarkhanov's
UNICODE_PRIMER[1].
First, Onigurama[2] is a regular expression engine. It supports Unicode regular
expressions under many encodings, it's very handy. If all you want to do is
search strings for Unicode text, then great, use it.
Ruby's strings are not unicode-aware. There is a library called 'jcode', which
comes with Ruby which tries to help out, but it's very simple, only good for a
few things like counting characters and iterating through characters. Again,
UTF-8 only.
Ruby itself also understands UTF-8 regular expressions to a degree. Using the
'u' modifier. Many Ruby-based UTF-8 hacks are based on the idea of:
str.scan(/./u), which returns an array of strings, each string containing a
multibyte character. (Also: str.unpack('U*').)
If you are using Unicode strings in Rails, check out Julian's unicode_hacks
plugin: <
http://julik.textdriven.com/svn/tools/rails_plugins/unicode_hacks/>
They have a channel on irc.freenode.net: #multibyte_rails.
The unicode_hacks plugin is interesting in that it tries to load one of several
Ruby unicode extensions before falling back to str.unpack('U*') mode.
Here are the extensions it prefers, in order:
* icu4r: a Ruby extension to IBM's ICU library. Adds UString, URegexp, etc.
classes for containing Unicode stuffs.
(project page[3] and docs[4])
* utf8proc: a small library for iterating through characters and converting
ints to code points. Adds String#utf8map and Integer#utf8, for example.
(download[5])
* unicode: a little extension by Yoshida Masato which adds Unicode class
methods for `strcmp`, `[de]compose`, normalization and case conversion for
utf-8.
(download[6] and readme[7])
So, many options, some massive, but most only partial and in their infancy.
The most recent entrant into this race, though, is Nikolai Weibull's
ruby-character-encoding library, which aims to get complete multibyte support
into Ruby 1.8's string class. If you use it, it will probably break a lot of
libraries which are used to strings acting the way they do now.
He is trying to emulate the Ruby 2.0 Unicode plans outlined by Matz.[8]
Nevertheless, it is a very promising library and Nikolai is working at
break-neck pace to appease the nations, all tongues and peoples.[9] And
discussion is here[10] with links to the mailing list and all that.
This might be a landslide of information, but it's better than spending all day
Googling and extracting tarballs and pouring through READMEs just to get a
picture of what's happening these days.
Signed in elaborate calligraphy with a picture of grapes at the end,
_why
[1]
http://julik.textdriven.com/svn/tools/rails_plugins/unicode_hacks/UNICODE_PRIMER
[2]
http://www.geocities.jp/kosako3/oniguruma/
[3]
http://rubyforge.org/projects/icu4r/
[4]
http://icu4r.rubyforge.org/
[5]
http://www.flexiguided.de/publications.utf8proc.en.html
[6]
http://www.yoshidam.net/Ruby.html
[7]
http://www.yoshidam.net/unicode.txt
[8]
http://redhanded.hobix.com/inspect/futurismUnicodeInRuby.html
[9]
http://git.bitwi.se/?p=ruby-character-encodings.git;a=summary
[10]
http://redhanded.hobix.com/inspect/nikolaiSUtf8LibIsAllReady.html