Using unicode in YAML

B

baumanj

I've done my research, and it appears that the current ruby YAML
implementation doesn't really grok unicode. What I want to know is
whether anything has changed in this regard, or is likely to in the
future. It doesn't appear that syck has been updated since May '05,
but if it's something I could fix, I'd be willing to do it.

In any case, here's what I need to do: read in YAML files containing
strings in various languages including Japanese and write the same
strings back out unmolested. UTF-8 seems like the natural choice, but
the encoding could be different, so long as I can do some processing
and keep the strings human readable when I spit them back out. I don't
even need to modify the strings themselves, just modify sets of them
and output.

Are ruby and YAML just not an option here? Any other suggestions?

If you're not familiar, here's the basic problem. I have the following
YAML:

jp: "$B$O$$(B"

(If that doesn't display right, it's just the Japanese characters for
the word "yes".)

I read it in via YAML.load and I get:

{"\357\273\277jp"=>"\343\201\257\343\201\204"}

OK, not so bad, the UTF-8 indicator is on the front there, but I can
deal with that, and the six bytes in octal do indeed correspond to the
UTF-8 codes for the two characters I expect. The problem is when I try
to put this back out, and YAML decides to take my string and convert
it to binary data:
 
R

Rainer

Hello baumanj,

\xEF\xBB\xBF from your example above is the byte order mark (BOM)
that is needed to identify UTF-8-files. \357\273\277 are just the
octal numbers that mean \xEF\xBB\xBF in hex. Have you tried to remove
these three bytes manually before writing your string as YAML? I
currently have a similar problem, and I just found out that these
three bytes are imported (wrongly, I think) via YAML#load_file into
one of my objects (in your case: the "jp" key).

So try this:

f = File.open("jp.txt", "r")
raw = f.read
f.close
#remove bom
raw_without_bom = raw[3..-1]
#now change to yaml
hash = YAML::load(raw_without_bom)


Hope that helps.

Happy new year!

Rainer
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Similar Threads

YAML + ASCII Encoded Unicode 1
What is YAML::Syck::Map? 1
puzzled by yaml error .. 0
Using for loops in Python? 5
YAML troubles 5
Unicode 20
Using YAML Files with comments 0
YAML::load help 9

Members online

Forum statistics

Threads
473,994
Messages
2,570,223
Members
46,812
Latest member
GracielaWa

Latest Threads

Top