J
Juliano 준호
Hey, guys!
I've just started learning Ruby from Python and recently posted a
question here that was promptly and effectively answered (thanks,
Glenn!), so I decided to come here once more. I hope I'll be able to
be answering some of the questions on my own soon.
Here it is.
I'm writing a wrapper for a Korean Morphological Parser that only
works with EUC-KR encoding and has some trouble with longer texts. So,
first, I have to preprocess the input text to divide it into
sentences, remove unicode characters which are not related to Korean
and save them for further reinclusion in the postprocessing stage.
This has worked out wonderfully and, I should say, easier than what
I'd done in Python.
The problem I'm having now is converting the string from UTF-8 (I'm
running Ubuntu with pt_BR.UTF-8 locale) into EUC-KR, run the
Morphological Parser, read its output and process it. I have to parse
a whole bunch of data "spidered" from the internet and it works great
until the encoder comes across a Korean typo... Let's make this point
clearer: the EUC-KR encoding does not cover all the possible
combinations of initial and final consonants + vowels, as does
unicode. Explaining: Unicode has 21708 codepoints for hangul whereas
EUC-KR has only 11172. In fact, most of the extra chars are not used
in day by day life, but still they can be used as abbreviations,
slang, smileys or typos...
In Python, I would simply throw these chars away but I really didn't
manage to understand the "Ruby encoding way"... I know I'm missing
something, but I can't seem to find enough info around... Google
doesn't seem to know much of this either... So, I'm coming here to ask
for your enlightenment, dear rubyist friends!
The part of my code which deals with this is as follows:
def run(txt)
txt = txt.encode("EUC-KR")
kts_file = Tempfile::new('kts_text')
kts_file = open(kts_file.path, "w:EUC-KR")
kts_file << "#{txt}\n"
kts_file.close
cmd = "ktspell < #{kts_file.path}" # 2> /dev/null"
IO:open(cmd, "r:EUC-KR").read.encode("UTF-8")
end
I found something about "ignoring" the non-existent codepoints, but it
doesn't work... I'm even thinking that my Ruby installation might have
gotten corrupted somehow... Everytime I think I did it right, I still
get The Exception popping up on the screen...
Thanks for your patience reading this looong post.
Juliano
I've just started learning Ruby from Python and recently posted a
question here that was promptly and effectively answered (thanks,
Glenn!), so I decided to come here once more. I hope I'll be able to
be answering some of the questions on my own soon.
Here it is.
I'm writing a wrapper for a Korean Morphological Parser that only
works with EUC-KR encoding and has some trouble with longer texts. So,
first, I have to preprocess the input text to divide it into
sentences, remove unicode characters which are not related to Korean
and save them for further reinclusion in the postprocessing stage.
This has worked out wonderfully and, I should say, easier than what
I'd done in Python.
The problem I'm having now is converting the string from UTF-8 (I'm
running Ubuntu with pt_BR.UTF-8 locale) into EUC-KR, run the
Morphological Parser, read its output and process it. I have to parse
a whole bunch of data "spidered" from the internet and it works great
until the encoder comes across a Korean typo... Let's make this point
clearer: the EUC-KR encoding does not cover all the possible
combinations of initial and final consonants + vowels, as does
unicode. Explaining: Unicode has 21708 codepoints for hangul whereas
EUC-KR has only 11172. In fact, most of the extra chars are not used
in day by day life, but still they can be used as abbreviations,
slang, smileys or typos...
In Python, I would simply throw these chars away but I really didn't
manage to understand the "Ruby encoding way"... I know I'm missing
something, but I can't seem to find enough info around... Google
doesn't seem to know much of this either... So, I'm coming here to ask
for your enlightenment, dear rubyist friends!
The part of my code which deals with this is as follows:
def run(txt)
txt = txt.encode("EUC-KR")
kts_file = Tempfile::new('kts_text')
kts_file = open(kts_file.path, "w:EUC-KR")
kts_file << "#{txt}\n"
kts_file.close
cmd = "ktspell < #{kts_file.path}" # 2> /dev/null"
IO:open(cmd, "r:EUC-KR").read.encode("UTF-8")
end
I found something about "ignoring" the non-existent codepoints, but it
doesn't work... I'm even thinking that my Ruby installation might have
gotten corrupted somehow... Everytime I think I did it right, I still
get The Exception popping up on the screen...
Thanks for your patience reading this looong post.
Juliano