Ruby1.9 Encoding

  • Thread starter Juliano 준호
  • Start date
J

Juliano 준호

Hey, guys!

I've just started learning Ruby from Python and recently posted a
question here that was promptly and effectively answered (thanks,
Glenn!), so I decided to come here once more. I hope I'll be able to
be answering some of the questions on my own soon. ;)

Here it is.

I'm writing a wrapper for a Korean Morphological Parser that only
works with EUC-KR encoding and has some trouble with longer texts. So,
first, I have to preprocess the input text to divide it into
sentences, remove unicode characters which are not related to Korean
and save them for further reinclusion in the postprocessing stage.
This has worked out wonderfully and, I should say, easier than what
I'd done in Python.

The problem I'm having now is converting the string from UTF-8 (I'm
running Ubuntu with pt_BR.UTF-8 locale) into EUC-KR, run the
Morphological Parser, read its output and process it. I have to parse
a whole bunch of data "spidered" from the internet and it works great
until the encoder comes across a Korean typo... Let's make this point
clearer: the EUC-KR encoding does not cover all the possible
combinations of initial and final consonants + vowels, as does
unicode. Explaining: Unicode has 21708 codepoints for hangul whereas
EUC-KR has only 11172. In fact, most of the extra chars are not used
in day by day life, but still they can be used as abbreviations,
slang, smileys or typos...

In Python, I would simply throw these chars away but I really didn't
manage to understand the "Ruby encoding way"... I know I'm missing
something, but I can't seem to find enough info around... Google
doesn't seem to know much of this either... So, I'm coming here to ask
for your enlightenment, dear rubyist friends!

The part of my code which deals with this is as follows:

def run(txt)
txt = txt.encode("EUC-KR")
kts_file = Tempfile::new('kts_text')
kts_file = open(kts_file.path, "w:EUC-KR")
kts_file << "#{txt}\n"
kts_file.close
cmd = "ktspell < #{kts_file.path}" # 2> /dev/null"
IO::popen(cmd, "r:EUC-KR").read.encode("UTF-8")
end

I found something about "ignoring" the non-existent codepoints, but it
doesn't work... I'm even thinking that my Ruby installation might have
gotten corrupted somehow... Everytime I think I did it right, I still
get The Exception popping up on the screen...

Thanks for your patience reading this looong post.

Juliano
 
A

Axel Etzold

-------- Original-Nachricht --------
Datum: Thu, 10 Sep 2009 18:20:06 +0900
Von: "Juliano 준호" <[email protected]>
An: (e-mail address removed)
Betreff: Ruby1.9 Encoding
Hey, guys!

I've just started learning Ruby from Python and recently posted a
question here that was promptly and effectively answered (thanks,
Glenn!), so I decided to come here once more. I hope I'll be able to
be answering some of the questions on my own soon. ;)

Here it is.

I'm writing a wrapper for a Korean Morphological Parser that only
works with EUC-KR encoding and has some trouble with longer texts. So,
first, I have to preprocess the input text to divide it into
sentences, remove unicode characters which are not related to Korean
and save them for further reinclusion in the postprocessing stage.
This has worked out wonderfully and, I should say, easier than what
I'd done in Python.

The problem I'm having now is converting the string from UTF-8 (I'm
running Ubuntu with pt_BR.UTF-8 locale) into EUC-KR, run the
Morphological Parser, read its output and process it. I have to parse
a whole bunch of data "spidered" from the internet and it works great
until the encoder comes across a Korean typo... Let's make this point
clearer: the EUC-KR encoding does not cover all the possible
combinations of initial and final consonants + vowels, as does
unicode. Explaining: Unicode has 21708 codepoints for hangul whereas
EUC-KR has only 11172. In fact, most of the extra chars are not used
in day by day life, but still they can be used as abbreviations,
slang, smileys or typos...

In Python, I would simply throw these chars away but I really didn't
manage to understand the "Ruby encoding way"... I know I'm missing
something, but I can't seem to find enough info around... Google
doesn't seem to know much of this either... So, I'm coming here to ask
for your enlightenment, dear rubyist friends!

The part of my code which deals with this is as follows:

def run(txt)
txt = txt.encode("EUC-KR")
kts_file = Tempfile::new('kts_text')
kts_file = open(kts_file.path, "w:EUC-KR")
kts_file << "#{txt}\n"
kts_file.close
cmd = "ktspell < #{kts_file.path}" # 2> /dev/null"
IO::popen(cmd, "r:EUC-KR").read.encode("UTF-8")
end

I found something about "ignoring" the non-existent codepoints, but it
doesn't work... I'm even thinking that my Ruby installation might have
gotten corrupted somehow... Everytime I think I did it right, I still
get The Exception popping up on the screen...

Thanks for your patience reading this looong post.

Juliano

Dear Juliano,

a disclaimer first: I know no Korean, so what's below might not work.

I've had to do some coding to resolve Arabic ligatures (combinations
of two letters) recently. Similarly as what you describe, there is most
of the time no need to use a special combined form, and unluckily, the
same word is sometimes spelled in this and sometimes in that way, giving
a list of duplicate words.

I used a list of Unicode characters with names of the individual characters
to solve that problem.

You might find the table below on this page useful :

http://www.kfunigraz.ac.at/~katzer/korean_hangul_unicode.html

I don't know if that list is exhaustive, but you may try to individually
convert each of the syllables listed there from Unicode to EUC::KR, and
if that doesn't work, decide what to do with the particular combination
of signs, based on the Latin transcription, creating a transform hash
for these encodings yourself.

There might also be some locale or OS-related problems with Iconv::IGNORE .
There's some discussion of this here :
http://aspn.activestate.com/ASPN/Mail/Message/ruby-talk/3189105

Best regards,

Axel
 
J

James Edward Gray II

Hey, guys!

I've just started learning Ruby from Python and recently posted a
question here that was promptly and effectively answered (thanks,
Glenn!), so I decided to come here once more. I hope I'll be able to
be answering some of the questions on my own soon. ;)

Welcome to Ruby.
The problem I'm having now is converting the string from UTF-8 (I'm
running Ubuntu with pt_BR.UTF-8 locale) into EUC-KR, run the
Morphological Parser, read its output and process it. I have to parse
a whole bunch of data "spidered" from the internet and it works great
until the encoder comes across a Korean typo... Let's make this point
clearer: the EUC-KR encoding does not cover all the possible
combinations of initial and final consonants + vowels, as does
unicode. Explaining: Unicode has 21708 codepoints for hangul whereas
EUC-KR has only 11172. In fact, most of the extra chars are not used
in day by day life, but still they can be used as abbreviations,
slang, smileys or typos...

In Python, I would simply throw these chars away but I really didn't
manage to understand the "Ruby encoding way"...

I think we can throw them away in Ruby too. See below.
I know I'm missing something, but I can't seem to find enough info =20
around... Google
doesn't seem to know much of this either...

I wrote a lot about Ruby's encoding engine on my blog:

http://blog.grayproductions.net/articles/understanding_m17n
The part of my code which deals with this is as follows:

def run(txt)
txt =3D txt.encode("EUC-KR")

Try replacing the above line with:

txt =3D txt.encode("EUC-KR", invalid: :replace, undef: :replace, =20
replace: "")
kts_file =3D Tempfile::new('kts_text')
kts_file =3D open(kts_file.path, "w:EUC-KR")
kts_file << "#{txt}\n"
kts_file.close
cmd =3D "ktspell < #{kts_file.path}" # 2> /dev/null"
IO::popen(cmd, "r:EUC-KR").read.encode("UTF-8")
end

Hope that helps.

James Edward Gray II
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,968
Messages
2,570,150
Members
46,696
Latest member
BarbraOLog

Latest Threads

Top