Marnen said:
Huh? Normalization transformations should be pretty easy to implement.
But the point is, you can't do anything useful with this until you
*transcode* it anyway, which you can do using Iconv (in either 1.8 or
1.9).
ruby 1.9's big flag feature of being able to store a string in its
original form tagged with the encoding doesn't help the OP much, even if
it had been tagged correctly.
I mean, to a degree ruby 1.9 already supports this UTF-8-MAC as an
'encoding'. For example:
decomp = [101, 115, 112, 97, 110, 771, 111, 108, 46, 108, 110, 103].map { |x| x.chr("UTF-8-MAC") }.join => "español.lng"
decomp.codepoints.to_a => [101, 115, 112, 97, 110, 771, 111, 108, 46, 108, 110, 103]
decomp.encoding
=> #<Encoding:UTF8-MAC>
Notice that the n-accent is displayed as a single character by the
terminal, even though it is two codepoints (110,771)
So you could argue that Dir[] on the Mac is at fault here, for tagging
the string as UTF-8 when it should be UTF-8-MAC.
But you still need to transcode to UTF-8 before doing anything useful
with this string. Consider a string containing decomposed characters
tagged as UTF-8-MAC:
(1) The regexp /./ should match a series of decomposed codepoints as a
single 'character'; str[n] should fetch the nth 'character'; and so on.
I don't think this would be easy to implement, since finding a character
boundary is no longer a codepoint boundary.
What you actually get is this:
=> ["e", "s", "p", "a", "n", "̃", "o", "l", ".", "l", "n", "g"]
Aside: note that "̃ is actually a single character, a double quote with
the accent applied!
(2) The OP wanted to match the regexp containing a single codepoint /ñ/
against the decomposed representation, which isn't going to work anyway.
That is, ruby 1.9 does not automatically transcode strings so they are
compatible; it just raises an exception if they are not.
Encoding::CompatibilityError: incompatible encoding regexp match (UTF-8
regexp with UTF8-MAC string)
from (irb):5
from /usr/local/bin/irb19:12:in `<main>'
(3) Since ruby 1.9 has a UTF-8-MAC encoding, it *should* be able to
transcode it to UTF-8 without using Iconv. However this is simply
broken, at least in the version I'm trying here.
ArgumentError: invalid byte sequence in UTF-8
from (irb):10:in `codepoints'
from (irb):10:in `each'
from (irb):10:in `to_a'
from (irb):10
=> 24186
(4) If general support for decomposed form would be added as further
'Encodings', there would be an explosion of encodings: UTF-8-D,
UTF-16LE-D, UTF-16BE-D etc, and that's ignoring the "compatible" versus
"canonical" composed and decomposed forms.
(5) It is going to be very hard (if not impossible) to make a source
code string or regexp literal containing decomposed "n" and "̃" to be
distinct from a literal containing a composed "ñ". Try it and see.
(In the above paragraph, the decomposed accent is applied to the
double-quote; that is, "̃ is actually a single 'character'). Most
editors are going to display both the composed and decomposed forms
identically.
I think this just shows that ruby 1.9's complexity is not helping in the
slightest. If you have to transcode to UTF-8 composed form, then ruby
1.8 does this just as well (and then you only need to tag the regexp as
UTF-8 using //u)