Parsing Japanese Language and Some Ruby Trivia

Michael Sullivan · Jan 11, 2006

--Apple-Mail-21--228825549
Content-Transfer-Encoding: 7bit
Content-Type: text/plain;
charset=US-ASCII;
delsp=yes;
format=flowed

All this talk about Unicode support and HTML parsing got me to
wondering about how to parse Japanese text. There are no spaces to
separate words, and though there are some modifiers, or particles in
the Japanese language they are used sometime inconsistently. I could
quote examples, but of you can't read Kanji, Hiragana, and Katakana
they would most likely be meaningless.

So, knowing what little I do of Japanese (been studying for a while
and living in Japan for close to four years), I was wondering how
search engines like Google and Yahoo parse Japanese text, much less
web pages. There are numerous filters to extract text from web
pages, but parsing Japanese text is another matter altogether.

So, I have found one Open Source project which seems to be addressing
this, but I was wondering if there is a solution for Ruby?

Now for the trivia... I've been reading some Japanese text,
"Hiragana Times" - a magazine which prints their articles in Japanese
and English as a learning tool and my newspaper "The Japan Times"
which has a weekly section devoted to bilingual education, as well as
my class textbooks. I've also read some Manga as well. They
generally present the Kanji with tiny Hiragana characters either
above them which are the phonetic equivalent to the Kanji.

Guess what these tiny Hiragana helpers are called... you guessed it
"Ruby Annotation". Check out what I found on W3C, either click on
the link or: http://www.w3.org/TR/ruby/

Coincidence?

Mike

--
Mobile: +81-80-3202-2599
Office: +81-3-3395-6055

"Any sufficiently advanced technology is indistinguishable from
magic..."
- A. C. Clarke

--Apple-Mail-21--228825549--

David Vallner · Jan 11, 2006

Michael said:
[snip]

Coincidence?

Absolutely. I suspect the term and concept of "Ruby annotation" to be a
lot older than the programming language, and AFAIK, the name of the Ruby
programming language is a reference to its roots in Perl. The fact the
gemstone used as the language is ruby might, but doesn't have to be,
intentionally, subconsciously, or coinkydinkally related to Ruby
annotation. If you want to know for certain, submit Matz to regression
hypnosis and take him back to the time he was deciding for a name of the
language.

David Vallner

John Fry · Jan 11, 2006

Michael Sullivan said:
I was wondering how search engines like Google and Yahoo parse
Japanese text, much less web pages. There are numerous filters to
extract text from web pages, but parsing Japanese text is another
matter altogether.

I'm not sure what you mean by "parsing", but if you mean segmentation
and morphological analysis of Japanese, then two popular packages for
doing this are ChaSen and MeCab.

Best,

John

Gene Tani · Jan 11, 2006

David said:
Michael said:

[snip]

Coincidence?

Click to expand...

Absolutely. I suspect the term and concept of "Ruby annotation" to be a
lot older than the programming language, and AFAIK, the name of the Ruby
programming language is a reference to its roots in Perl. The fact the
gemstone used as the language is ruby might, but doesn't have to be,
intentionally, subconsciously, or coinkydinkally related to Ruby
annotation. If you want to know for certain, submit Matz to regression
hypnosis and take him back to the time he was deciding for a name of the
language.

David Vallner

[going OT] i've noted that google has gotten much better at separating
hits of Sam Ruby's pages from pages that refer to ruby language. Must
be all that python programming they're donig ;-p

Michael Sullivan · Jan 12, 2006

--Apple-Mail-22--192015553
Content-Transfer-Encoding: 7bit
Content-Type: text/plain;
charset=US-ASCII;
delsp=yes;
format=flowed

Hi,

I posted this last night and probably didn't hit the correct
audience. I got one relevant answer and need to go check the
packages recommended. But to those in Asian Time zones, I'll ask
again about parsing Japanese text.

And to clarify, I am looking for a way to extract "words" from the
text for cataloging in a database.

Cheers,
Mike

--
Mobile: +81-80-3202-2599
Office: +81-3-3395-6055

"The two most common elements in the universe are hydrogen... and
stupidity."
- Harlan Ellison

Begin forwarded message:

From: Michael Sullivan <[email protected]>
Date: January 12, 2006 12:09:40 AM JST
To: (e-mail address removed) (ruby-talk ML)
Subject: Parsing Japanese Language and Some Ruby Trivia
Reply-To: (e-mail address removed)

All this talk about Unicode support and HTML parsing got me to
wondering about how to parse Japanese text. There are no spaces to
separate words, and though there are some modifiers, or particles
in the Japanese language they are used sometime inconsistently. I
could quote examples, but of you can't read Kanji, Hiragana, and
Katakana they would most likely be meaningless.

So, knowing what little I do of Japanese (been studying for a while
and living in Japan for close to four years), I was wondering how
search engines like Google and Yahoo parse Japanese text, much less
web pages. There are numerous filters to extract text from web
pages, but parsing Japanese text is another matter altogether.

So, I have found one Open Source project which seems to be
addressing this, but I was wondering if there is a solution for Ruby?

Now for the trivia... I've been reading some Japanese text,
"Hiragana Times" - a magazine which prints their articles in
Japanese and English as a learning tool and my newspaper "The Japan
Times" which has a weekly section devoted to bilingual education,
as well as my class textbooks. I've also read some Manga as well.
They generally present the Kanji with tiny Hiragana characters
either above them which are the phonetic equivalent to the Kanji.

Guess what these tiny Hiragana helpers are called... you guessed it
"Ruby Annotation". Check out what I found on W3C, either click on
the link or: http://www.w3.org/TR/ruby/

Coincidence?

Mike

--
Mobile: +81-80-3202-2599
Office: +81-3-3395-6055

"Any sufficiently advanced technology is indistinguishable from
magic..."
- A. C. Clarke

--Apple-Mail-22--192015553--

Mauricio Fernandez · Jan 12, 2006

Hi,

I posted this last night and probably didn't hit the correct
audience. I got one relevant answer and need to go check the
packages recommended. But to those in Asian Time zones, I'll ask
again about parsing Japanese text.

And to clarify, I am looking for a way to extract "words" from the
text for cataloging in a database.

http://raa.ruby-lang.org/cache/ruby-chasen/
-> It looks old and the source code is not very enticing, though.

This is probably better:
http://chasen.org/~taku/software/mecab/bindings.html

I wrote something similar to what you want long ago; for some reason I ended
up parsing the output of mecab instead of using the bindings (can't remember
why atm., nor if it still applies).

The following (old, ugly) code collects some words (names, verbs, "i adjectives")
from a utf8 string held in 'text':

text = Iconv.iconv("eucjp", "utf-8", text).first
temp = Tempfile.new "jphints"
temp.puts text
temp.close
analysis = `mecab #{temp.path}`
output = Iconv.iconv("utf-8", "eucjp", analysis).first
temp.close!
hints = []
output.each_line do |line|
break if /\AEOS\s$/u.match line
hint, nature, canonical = /\A(.+)\s+([^,]+),[^,]+,[^,]+,[^,]+,[^,]+,[^,]+,([^,]+),/u.match(line).captures # UGH
case nature
when %w[e5 90 8d e8 a9 9e].map{|x| x.to_i(16)}.pack("c*"), # name
%w[e5 8b 95 e8 a9 9e].map{|x| x.to_i(16)}.pack("c*"), # verb
%w[ e5 bd a2 e5 ae b9 e8 a9 9e].map{|x| x.to_i(16)}.pack("c*") # adj i
puts "REG HINT #{hint} -> #{canonical}\t #{nature}"
hints << canonical
else
# puts "IGNORED #{hint}"
end
end

# now the words are in hints, as utf8 strings

Hope this helps.

Michael Sullivan · Jan 12, 2006

Thanks, this looks like what I'm looking for.

Cheers,

Mike

--
Mobile: +81-80-3202-2599
Office: +81-3-3395-6055

"Haggis... uh, I was briefed on haggis.... No!"
G W Bush (dubya) - Japan Times, 12 July 2005

Hi,

I posted this last night and probably didn't hit the correct
audience. I got one relevant answer and need to go check the
packages recommended. But to those in Asian Time zones, I'll ask
again about parsing Japanese text.

And to clarify, I am looking for a way to extract "words" from the
text for cataloging in a database.

Click to expand...

http://raa.ruby-lang.org/cache/ruby-chasen/
-> It looks old and the source code is not very enticing, though.

This is probably better:
http://chasen.org/~taku/software/mecab/bindings.html

I wrote something similar to what you want long ago; for some
reason I ended
up parsing the output of mecab instead of using the bindings (can't
remember
why atm., nor if it still applies).

The following (old, ugly) code collects some words (names, verbs,
"i adjectives")
from a utf8 string held in 'text':

text = Iconv.iconv("eucjp", "utf-8", text).first
temp = Tempfile.new "jphints"
temp.puts text
temp.close
analysis = `mecab #{temp.path}`
output = Iconv.iconv("utf-8", "eucjp", analysis).first
temp.close!
hints = []
output.each_line do |line|
break if /\AEOS\s$/u.match line
hint, nature, canonical = /\A(.+)\s+([^,]+),[^,]+,[^,]+,[^,]
+,[^,]+,[^,]+,([^,]+),/u.match(line).captures # UGH
case nature
when %w[e5 90 8d e8 a9 9e].map{|x| x.to_i(16)}.pack("c*"),
# name
%w[e5 8b 95 e8 a9 9e].map{|x| x.to_i(16)}.pack("c*"), #
verb
%w[ e5 bd a2 e5 ae b9 e8 a9 9e].map{|x| x.to_i(16)}.pack
("c*") # adj i
puts "REG HINT #{hint} -> #{canonical}\t #{nature}"
hints << canonical
else
# puts "IGNORED #{hint}"
end
end

# now the words are in hints, as utf8 strings

Hope this helps.

Horacio Sanson · Jan 12, 2006

To delimit words in a japanese text you can use MeCab and/or Kakasi (google=
=20
them).

Kakasi has ruby bindings=20
http://raa.ruby-lang.org/project/ruby-kakasi/

MeCab also has bindings for several scripting languages (Ruby included):
http://chasen.org/~taku/software/mecab/bindings.html

If you want database text search support for japanese you can use Tsearch2=
=20
with Teramoto SQL function that uses Kakasi to index japanese words.

http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/docs/tsearch2j.html

=46inally MSSQL text search function works with Japanese (with MSSQL Japane=
se=20
version of course).

Hope this helps....

Horacio

Thursday 12 January 2006 15:48=E3=80=81Michael Sullivan =E3=81=95=E3=82=93=
=E3=81=AF=E6=9B=B8=E3=81=8D=E3=81=BE=E3=81=97=E3=81=9F:

Thanks, this looks like what I'm looking for.

Cheers,

Mike

--
Mobile: +81-80-3202-2599
Office: +81-3-3395-6055

"Haggis... uh, I was briefed on haggis.... No!"
G W Bush (dubya) - Japan Times, 12 July 2005

Hi,

I posted this last night and probably didn't hit the correct
audience. I got one relevant answer and need to go check the
packages recommended. But to those in Asian Time zones, I'll ask
again about parsing Japanese text.

And to clarify, I am looking for a way to extract "words" from the
text for cataloging in a database.

Click to expand...

http://raa.ruby-lang.org/cache/ruby-chasen/
-> It looks old and the source code is not very enticing, though.

This is probably better:
http://chasen.org/~taku/software/mecab/bindings.html

I wrote something similar to what you want long ago; for some
reason I ended
up parsing the output of mecab instead of using the bindings (can't
remember
why atm., nor if it still applies).

The following (old, ugly) code collects some words (names, verbs,
"i adjectives")
from a utf8 string held in 'text':

text =3D Iconv.iconv("eucjp", "utf-8", text).first
temp =3D Tempfile.new "jphints"
temp.puts text
temp.close
analysis =3D `mecab #{temp.path}`
output =3D Iconv.iconv("utf-8", "eucjp", analysis).first
temp.close!
hints =3D []
output.each_line do |line|
break if /\AEOS\s$/u.match line
hint, nature, canonical =3D /\A(.+)\s+([^,]+),[^,]+,[^,]+,[^,]
+,[^,]+,[^,]+,([^,]+),/u.match(line).captures # UGH
case nature
when %w[e5 90 8d e8 a9 9e].map{|x| x.to_i(16)}.pack("c*"),
# name
%w[e5 8b 95 e8 a9 9e].map{|x| x.to_i(16)}.pack("c*"), #
verb
%w[ e5 bd a2 e5 ae b9 e8 a9 9e].map{|x| x.to_i(16)}.pack
("c*") # adj i
puts "REG HINT #{hint} -> #{canonical}\t #{nature}"
hints << canonical
else
# puts "IGNORED #{hint}"
end
end

# now the words are in hints, as utf8 strings

Hope this helps.

Click to expand...

Brylie Oxley · Jul 18, 2009

Hi,
I have a similar task. I have attached a text file with Japanese
characters. Some of the words are fake and made to look similar to
actually occurring words, as an experiment in linguistics.

Essentially what we are trying to accomplish is to get a count of every
time one of the hiragana characters occurs adjacent to another hiragana
character (including itself) within the context of a word. The majority
of the data in the file is frequency counts for the word over the course
of several years, the numbers can be ignored for the purposes of this
count. At this point we are only concerned with the co-occurance of
hiragana characters, but katakana and kanji may eventually be useful.

I currently have Ruby 1.8.7 installed and when I paste the characters
into irb they print out as bytes and I am not sure where to begin
figuring out how to write an effective regexp.
--Brylie

Attachments:
http://www.ruby-forum.com/attachment/3880/tinyJPsamplecorpus2.txt

Heesob Park · Jul 18, 2009

Hi,

2009/7/18 Brylie Oxley said:
Hi,
I have a similar task. I have attached a text file with Japanese
characters. Some of the words are fake and made to look similar to
actually occurring words, as an experiment in linguistics.

Essentially what we are trying to accomplish is to get a count of every
time one of the hiragana characters occurs adjacent to another hiragana
character (including itself) within the context of a word. =C2=A0The majo= rity
of the data in the file is frequency counts for the word over the course
of several years, the numbers can be ignored for the purposes of this
count. =C2=A0At this point we are only concerned with the co-occurance of
hiragana characters, but katakana and kanji may eventually be useful.

I currently have Ruby 1.8.7 installed and when I paste the characters
into irb they print out as bytes and I am not sure where to begin
figuring out how to write an effective regexp.

According to your attachment, I guess you want to handle Shift_JIS encoded =
text.

Here are some clues.

str.scan(/\w+/s) # =3D> scan for word including Japanese
str.scan(/./s) # =3D> scan for character including Jap=
anese
str.scan(/\x82[\x9f-\xf1]/) # =3D> scan for one hiragana character
str.scan(/\x83[\x40-\x96]/) # =3D> scan for one katakana character
str.scan(/[\x88-\xee][\x40-\xfc]/) # =3D> scan for one kanji character (ro=
ughly)
str.scan(/(?:\x82[\x9f-\xf1])+/) # =3D> scan for two or more hiragana
characters

Refer to
http://blog.grayproductions.net/articles/bytes_and_characters_in_ruby_18
http://www.rikai.com/library/kanjitables/kanji_codes.sjis.shtml

Regards,

Park Heesob

Harry Kakueki · Jul 18, 2009

Essentially what we are trying to accomplish is to get a count of every

time one of the hiragana characters occurs adjacent to another hiragana
character (including itself) within the context of a word.

I'm not sure what you want if there are more than 2 hiragana.
For example, this "$B$i$7$D$E$1$k(B"
or this
["$B$i(B", "$B$7(B"]
["$B$7(B", "$B$D(B"]
["$B$D(B", "$B$E(B"]
["$B$E(B", "$B$1(B"]
["$B$1(B", "$B$k(B"]

Don't trust this too much. I'm just playing around a bit. Maybe it
will give you some ideas.

$KCODE = 'u'
p str.scan(/[$B$"(B-$B$s(B]{2,}/)

OR

require 'enumerator'
$KCODE = 'u'
str.scan(/[$B$"(B-$B$s(B]{2,}/).each {|x| x.split(//).each_cons(2){|a| p a}}

Harry

Brylie Oxley · Jul 18, 2009

Harry said:
I'm not sure what you want if there are more than 2 hiragana.

I think that a string, or array, of each duple from the string of 2 or
more hiragana would be desired. We want to count each occurrence of the
hiragana duples as a running tally through a large document.

There will be duplicates so perhaps each hiragana duple could be a
variable? e.g. "æš®ã‚‰+= 1" whenever a new instance of æš®ã‚‰ is found. Would
there be a more efficient way of counting the co-occurrences?

$KCODE = 'u'
p str.scan(/[$B$"(B-$B$s(B]{2,}/)

OR

require 'enumerator'
$KCODE = 'u'
str.scan(/[$B$"(B-$B$s(B]{2,}/).each {|x|
x.split(//).each_cons(2){|a| p a}}
Harry

Also, I don't think that the encoding is Unicode. I have opened the
document in O

and JEdit using the Shift-JIS (as well as 'Apple
Macintosh' and 'Windows-932') encoding(s) and the characters seem to
render correctly. I am unsure how to convert this text to Unicode for
proper analysis in Ruby.
--Brylie

Brylie Oxley · Jul 18, 2009

Heesob said:
According to your attachment, I guess you want to handle Shift_JIS
encoded text.

Refer to
http://blog.grayproductions.net/articles/bytes_and_characters_in_ruby_18
http://www.rikai.com/library/kanjitables/kanji_codes.sjis.shtml

Regards,

Park Heesob

Thank you Park! You're right, it is Shift-JIS. I'll review those links
right away.
--Brylie

Harry Kakueki · Jul 19, 2009

There will be duplicates so perhaps each hiragana duple could be a
variable? e.g. "$BJk$i(B+= 1" whenever a new instance of $BJk$i(B is found. Would
there be a more efficient way of counting the co-occurrences?

I guess I'm missing something.
It seems to me that you are trying to find groups of 2 or more
hiragana and calling each group a word.
But, many words contain no hiragana at all.
And a group of consecutive hiragana could be 1 word, 2 words, or
several words, or no word.
Some words are just 1 hiragana.

I don't know about the statistics you are working on. I guess you do.
Are you just interested in finding patterns without concern for the word status?

If I just stated the obvious, sorry for the noise.

Harry

James Gray · Jul 21, 2009

Also, I don't think that the encoding is Unicode. I have opened the
document in O and JEdit using the Shift-JIS (as well as 'Apple
Macintosh' and 'Windows-932') encoding(s) and the characters seem to
render correctly. I am unsure how to convert this text to Unicode for
proper analysis in Ruby.

Perhaps this would help:

http://blog.grayproductions.net/articles/encoding_conversion_with_iconv

James Edward Gray II

Python, Dutch, English, Chinese, Japanese, etc.	12	Jun 4, 2007
What libraries should I use for MIME parsing, XML parsing, and MySQL ?	0	Feb 2, 2012
[ANN] Ruby Hacking Guide - New chapters (and a bonus)	2	Apr 5, 2006
[ANN] RubyKaigi2008 Tickets and News	0	May 9, 2008
[ANN] the result of Ruby official logo contest	48	Oct 30, 2007
Ruby hangs on OSX	3	Nov 22, 2006
Ruby DL extension and x86_64	0	Nov 28, 2006
Ruby Weekly News 13th - 26th June 2005	0	Jun 27, 2005

Parsing Japanese Language and Some Ruby Trivia

Michael Sullivan

David Vallner

John Fry

Gene Tani

Michael Sullivan

Mauricio Fernandez

Michael Sullivan

Horacio Sanson

Brylie Oxley

Heesob Park

Harry Kakueki

Brylie Oxley

Brylie Oxley

Harry Kakueki

James Gray

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads