separate Chinese and English! with Ruby

John Joyce · May 8, 2007

Many characters of these two set of Chinese(in fact, including Chinese
Characters in Japanese and Korean...) are the same. Aren't they =20
encoded
to the same codes when they are identical?

Yes. There is lots of overlap. So there is not always a clean =20
separation line. But, the Japanese and Korean phonetic characters =20
will be in a range. You might never use all the kanji/hanzi chinese =20
characters, and a few are Japanese only (very few).

You must mean Unicode range.
http://www.khngai.com/chinese/charmap/tbluni.php?page=3D0

Yes that's exactly what he means.

.... hmmmmm.. if only I could find out what it does...
John Joyce wrote:

I took a look at it. It's the database of characters, sort of. It is =20
a big text file list. Not a proper gem at all actually. The same db =20
file can be downloaded from Unicode.org separately. It doesn't =20
contain the actual characters, just their codes and some comments and =20=

groupings.

Interesting subject indeed it is.

Today I tried this(!!!!under RoR console!!!!):

=EF=BD=9B =EF=BC=9B =E2=80=98 =EF=BC=81 =EF=BC=A0 =EF=BC=83 =EF=BC=84 =
=EF=BC=85 =E2=80=A6 =20=E5=8B=BF =E5=8F=BF =E5=93=BF =E5=9B=BF =E5=A7=BF =E5=AF=BF =E5=B4=81 =
=E5=BF=84=E5=BF=BF =20=E6=BF=97 =E7=80=96 =E7=87=BF =E7=8B=A7 =E7=8F=97 =E7=97=BF =E7=9C=80 =
=E7=A7=8A =E7=AB=97 =20=E9=98=80 =E9=9F=97 =E9=A5=A7 =E9=AA=A0 =E9=B6=86 =E9=BE=A5}

=3D> ["=E2=80=9C", "=E2=80=9D=E3=80=82", "=EF=BC=8C", "=EF=BC=81", =

"=EF=BC=9C", "=EF=BD=9B", "=EF=BC=9B", "=E2=80=98", =20

"=EF=BC=81", "=EF=BC=A0", "=EF=BC=83", "=EF=BC=84", "=EF=BC=85",
"=E2=80=A6", "=EF=BC=8A", "=EF=BC=88", "=EF=BC=89", "=E4=B8=80", =

"=E4=BF=BF", "=E5=80=80", "=E5=87=BF", "=E5=8B=BF", =20

"=E5=8F=BF", "=E5=93=BF", "=E5=9B=BF", "=E5=A7=BF", " =E5=AF=BF",
"=E5=B4=81", "=E5=BF=84=E5=BF=BF", "=E6=81=98", "=E6=89=89", "=E6=8E=B5"=

, "=E6=9B=86", "=E6=A1=B6", "=E6=AA=97", "=E6=B3=97", =20

"=E6=BF=97", "=E7=80=96", "=E7=87=BF", "=E7=8B=A7", "=E7=8F=97",
"=E7=97=BF", "=E7=9C=80", "=E7=A7=8A", "=E7=AB=97", "=E7=AF=BF", =

"=E7=B4=80", "=E7=BF=B9", "=E9=80=80", "=E9=87=BD", =20

"=E9=8E=B7", "=E9=96=88", "=E9=98=80", "=E9=9F=97", "=E9=A5=A7",
"=E9=AA=A0", "=E9=B6=86", "=E9=BE=A5"]

c.collect.map{|o| o[0]}

Click to expand...

Click to expand...

=3D> [226, 226, 239, 239, 239, 239, 239, 226, 239, 239, 239, 239, 239,
226, 239, 239, 239, 228, 228, 229, 229, 229, 229, 229, 229, 229, 229,
229, 229, 230, 230, 230, 230, 230, 230, 230, 230, 231, 231, 231, 231,
231, 231, 231, 231, 231, 231, 231, 233, 233, 233, 233, 233, 233, 233,
233, 233, 233]

c.collect.map{|o| o[0]}.sort

Click to expand...

Click to expand...

=3D> [226, 226, 226, 226, 228, 228, 229, 229, 229, 229, 229, 229, 229,
229, 229, 229, 230, 230, 230, 230, 230, 230, 230, 230, 231, 231, 231,
231, 231, 231, 231, 231, 231, 231, 231, 233, 233, 233, 233, 233, 233,
233, 233, 233, 233, 239, 239, 239, 239, 239, 239, 239, 239, 239, 239,
239, 239, 239]

c.collect.map{|o| o[0]}.sort.uniq

Click to expand...

Click to expand...

=3D> [226, 228, 229, 230, 231, 233, 239]

There punctuations are those commonly used in China.
There Chinese characters are randomly pickup from
http://www.khngai.com/chinese/charmap/tbluni.php?page=3D0
(from all the six pages.)

maybe 226 to 239 is the range I need.

--=20
Posted via http://www.ruby-forum.com/.

If you have access to a Macintosh, the character pallette is pretty =20
helpful for exploring CJK character ranges as subgroupings within the =20=

range.

Zev Blut · May 8, 2007

Yes. There is lots of overlap. So there is not always a clean separation
line. But, the Japanese and Korean phonetic characters will be in a
range. You might never use all the kanji/hanzi chinese characters, and a
few are Japanese only (very few).

Here is an API that might help "guess" if the text is Japanese, Korean or
Chinese:

http://raa.ruby-lang.org/project/libguess-ruby/

http://www.honeyplanet.jp/download.html#libguess

Cheers,
Zev

Nanyang Zhan · May 8, 2007

Eden said:
for UTF-8 encoded strings. Ruby
will just treat the string as a string of 8-bit bytes and give you
back whatever byte you asked for.

irb(main):001:0> s = "å¤§æ™ºè‹¥æ„š"
=> "\345\244\247\346\231\272\350\213\245\346\204\232"
irb(main):002:0> s[0]
=> 229
irb(main):003:0> s.length
=> 12

in RoR console, I can see the string I put in:

s = "å¤§æ™ºè‹¥æ„š" => "å¤§æ™ºè‹¥æ„š"
s[0] => 229
s.length

Click to expand...

=> 12

Zev said:
If the goal is to separate the western languages from the Japanese
Kanji and Kana, then it appears to not be too bad when using a lib
like this:

http://raa.ruby-lang.org/project/moji/

http://gimite.net/gimite/rubymess/moji.html

Thanks, Zev. but my current problem is about Chinese.
I am going to figure out a way to separate Chinese string from a string
mix with other characters.
What I mean other Characters are alphabets from English or/and other
languages, like Ã”, Ã©, Ã¡... (may I call them western words?)

This string may be containing no Chinese:
"String without Chinese" ,I don't need to do anything about it, other
than identify such strings.
"ä¸æ–‡ Western Words" #Chinese characters + space + western words.
"ä¸æ–‡ãƒ»å¦ä¸€äº›ä¸æ–‡ western words" ï¼ƒChinese characters may be separated by
punctuations, or/and space like:
"ä¸æ–‡ å‰æœ‰ç©ºæ ¼ western words"
Almost all Chinese phrases are at the beginning of the strings.
But some may contain numbers, like:
"2007å¹´çš„æ—¥è®° diary of 2007"
or some time English or alphabets are used as part of Chinese
phrases,like:
"BBæ—¥è®° diary of my baby"

Eden said:
Nooo! Those are the first BYTES of the UTF-8 encoding of the
punctuation that you listed.

Finally, I know what those number are. Thanks.

so if you remove them from a givenstring, you're going to get back a poorly encoded UTF-8 string

In fact, I wanted to use those number to test whether a character is
Chinese or not (if 'character[0]' fit the range of [226, 228, 229, 230,
231, 233, 239], then it was likely to be a Chinese). (Now I know it may
be wrong.)
Then depend on this judgment, if this part of string ( string would be
splited by space, divided into parts at the beginning) containing more
X%, say 60%, of this kind of characters, then I would mark this parts as
Chinese phrase, then take it out of string.

I still want to use this strategy. but
As you point out, [226, 228, 229, 230, 231, 233, 239] are not safe to
identify Chinese, is there any other easy way to identify Chinese
characters?

If you want to split on those separators, then why not do so
explicitly?

# fill up c as you've done below=> ["asdf", "asdfasdf"]

I don't get it. what this code does?

Mariusz PÄ™kala · May 8, 2007

--ZPt4rx8FFjLCG7dd
Content-Type: text/plain; charset=utf-8
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

Thanks, Zev. but my current problem is about Chinese.
I am going to figure out a way to separate Chinese string from a string= =20
mix with other characters.
What I mean other Characters are alphabets from English or/and other=20
languages, like =C3=94, =C3=A9, =C3=A1... (may I call them western words?)
=20
This string may be containing no Chinese:
"String without Chinese" ,I don't need to do anything about it, other=20
than identify such strings.
"=E4=B8=AD=E6=96=87 Western Words" #Chinese characters + space + western= words.
"=E4=B8=AD=E6=96=87=E3=83=BB=E5=8F=A6=E4=B8=80=E4=BA=9B=E4=B8=AD=E6=96=87=

western words" =EF=BC=83Chinese characters may be separated by=20

punctuations, or/and space like:
"=E4=B8=AD=E6=96=87 =E5=89=8D=E6=9C=89=E7=A9=BA=E6=A0=BC western words"
Almost all Chinese phrases are at the beginning of the strings.
But some may contain numbers, like:
"2007=E5=B9=B4=E7=9A=84=E6=97=A5=E8=AE=B0 diary of 2007"
or some time English or alphabets are used as part of Chinese=20
phrases,like:
"BB=E6=97=A5=E8=AE=B0 diary of my baby"
[...]

In fact, I wanted to use those number to test whether a character is=20
Chinese or not (if 'character[0]' fit the range of [226, 228, 229, 230,= =20
231, 233, 239], then it was likely to be a Chinese). (Now I know it may= =20
be wrong.)
Then depend on this judgment, if this part of string ( string would be= =20
splited by space, divided into parts at the beginning) containing more=20
X%, say 60%, of this kind of characters, then I would mark this parts as= =20
Chinese phrase, then take it out of string.
=20
I still want to use this strategy. but
As you point out, [226, 228, 229, 230, 231, 233, 239] are not safe to=20
identify Chinese, is there any other easy way to identify Chinese=20
characters?

Just a random idea - maybe, if there is a problem with finding Chinese
characters, you can define the range of non-Chinese (defined for this purpo=
se as western)
characters?
Maybe just finding words composed of only Latin and Common scripts would
be enough?
Or do you plan to have a Chinese - Japanese pairs? You said about
'westen words' and your examples were in English..

--=20
No virus found in this outgoing message.
Checked by 'grep -i virus $MESSAGE'
Trust me.

--ZPt4rx8FFjLCG7dd
Content-Type: application/pgp-signature
Content-Disposition: inline

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6-ecc01.6 (GNU/Linux)

iD8DBQFGQEsTsnU0scoWZKARArKIAJ9oKYMTWx3nyddelVOMQOMA5qyDhgCgmrxz
p3jTwaPMxbfJH6HHs+3pfoo=
=ZPwL
-----END PGP SIGNATURE-----

--ZPt4rx8FFjLCG7dd--

Nanyang Zhan · May 8, 2007

Michal said:
I guess this should give you what you want:

irb(main):001:0> s = "å¤§æ™ºè‹¥æ„š asdfaf sdgs"
=> "\345\244\247\346\231\272\350\213\245\346\204\232 asdfaf sdgs"
irb(main):002:0> s.unpack "U*"
=> [22823, 26234, 33509, 24858, 32, 97, 115, 100, 102, 97, 102, 32,
115, 100, 103, 115]

Michal, Thanks!
Chinese character start from 4e00 to 9fa5 at the unicode table, and CJK
symbols and punctuation range from 3000 to 303f.

I just used my strategy combining this new way (unpack "U*") to identify
Chinese, It picked out 100% Chinese phrases from the strings. (1000
strings are tested)

All of you that have replied and helped, thank you! Enjoy!

John Joyce · May 8, 2007

Michal said:
Michal said:

I guess this should give you what you want:

irb(main):001:0> s =3D "=E5=A4=A7=E6=99=BA=E8=8B=A5=E6=84=9A asdfaf = sdgs"
=3D> "\345\244\247\346\231\272\350\213\245\346\204\232 asdfaf sdgs"
irb(main):002:0> s.unpack "U*"
=3D> [22823, 26234, 33509, 24858, 32, 97, 115, 100, 102, 97, 102, 32,
115, 100, 103, 115]

Click to expand...

Michal, Thanks!
Chinese character start from 4e00 to 9fa5 at the unicode table, and =20=

CJK
symbols and punctuation range from 3000 to 303f.

I just used my strategy combining this new way (unpack "U*") to =20
identify
Chinese, It picked out 100% Chinese phrases from the strings. (1000
strings are tested)

All of you that have replied and helped, thank you! Enjoy!

NZ, could you share your final combined code? It might be useful to =20
anyone using CJK, since Ruby originates in Japan that means a lot of =20
people might find it useful. Might consider making a little gem out =20
of it.=

eden li · May 9, 2007

If you want to split on those separators, then why not do so
explicitly?

Click to expand...

# fill up c as you've done below

"asdf=EF=BC=9Basdfasdf".split(/#{c.join('|')}/)

Click to expand...

=3D> ["asdf", "asdfasdf"]

Click to expand...

I don't get it. what this code does?

This code just splits the string at any separator listed in c (no
matter how long it is, byte-wise). I was guessing at what you were
trying to do, but I understand now. It looks like you've gotten all
you need now

Chinese character start from 4e00 to 9fa5 at the unicode table, and CJK
symbols and punctuation range from 3000 to 303f.

There are also a few other ranges, but I'm not sure how popular they
are (from http://www.fileformat.info/info/unicode/block/index.htm):
CJK Compatibility Forms U+FE30 U+FE4F (32)
CJK Compatibility Ideographs U+F900 U+FAFF (467)
CJK Compatibility U+3300 U+33FF (256)
CJK Unified Ideographs Extension A U+3400 U+4DBF (6582)
CJK Unified Ideographs Extension B U+20000 U+2A6DF (42711)
CJK Compatibility Ideographs Supplement U+2F800 U+2FA1F (542)

Nanyang Zhan · May 10, 2007

NZ, could you share your final combined code? It might be useful to

anyone using CJK, since Ruby originates in Japan that means a lot of
people might find it useful. Might consider making a little gem out
of it.

I do think it will be much helpful, because it only solve a very
specified problem.
But anywhere, I paste it here. Maybe it could inspire somebody... who
knows...

#This !!!!RoR!!!! snippet is used to separate Chinese phrase from
specified formated strings:
#These strings may contain no Chinese:
#"a string without Chinese"
#or Chinese characters + space + western words: "ä¸æ–‡ Western Words".
#"ä¸æ–‡ãƒ»å¦ä¸€äº›ä¸æ–‡ western words" ï¼ƒChinese characters may be separated by
punctuations,
#or/and space like:
#"ä¸æ–‡ å‰æœ‰ç©ºæ ¼ western words"
#Almost all Chinese phrases are at the beginning of the strings.
#But some may contain numbers, like:
#"2007å¹´çš„æ—¥è®° diary of 2007"
#or some time English or alphabets are used as part of Chinese
#phrases,like:
#"BBæ—¥è®° diary of my baby"
#
#usage:
#separate_chinese("a string without Chinese") => "|||a string without
Chinese"
#separate_chinese("2007å¹´çš„æ—¥è®° diary of 2007") => "2007å¹´çš„æ—¥è®°|||diary of
2007"
#chinese_str, other_str = separate_chinese("ä¸æ–‡ å‰æœ‰ç©ºæ ¼ western
words").split("|||")
#chinese_str => "ä¸æ–‡ å‰æœ‰ç©ºæ ¼"
#other_str => "western words"

class Foo < ActiveRecord::Base
def self.separate_chinese(n)
ns = n.split(" ")
i = ns.size
ns.reverse.each do |p|
i -= 1
return ns.values_at(0..i).join(" ") + "|||" + ns.values_at((i +
1)..(ns.size - 1) ).join(" ") if is_chinese(p)
end
"|||" << n
end

def self.is_chinese(n)
cs = n.unpack("U*")
chinese_character_num = 0
cs.each do |unicode|
#comparing character's unicode to test if it is Chinese character
#19968-40869: unicode Chinese Character
#12288-12351: unicode CJK symbols and punctuation
#Note: as Eden Li have mentioned, there are a few more could be used in
a Chinese Document.
chinese_character_num += 1 if (unicode >= 19968 and unicode <=
40869) or (unicode >= 12288 and unicode <= 12351)
end
#if more the 29% of the characters a phrase contains is Chinese, it is
Chinese phrase
#the value 29% servers well for my purpose, but use whatever you like.
return true if chinese_character_num.to_f/cs.size > 0.29
nil
end
end

John Smith · May 10, 2007

http://www.surfjunky.com/?r=Gabrielll chack this out

it chaged my
life style

John Smith · May 10, 2007

http://www.surfjunky.com/?r=Gabrielll chack this out

it chaged my
life style

Encounter troubles with Regex in Chinese text splitting	3	Dec 3, 2005
Python, Dutch, English, Chinese, Japanese, etc.	12	Jun 4, 2007
The devolution of English language and slothful c.l.p behaviors exposed!	50	Jan 24, 2012
Ruby Weekly News 18th - 24th April 2005	1	Apr 26, 2005
anybody help me	1	Feb 10, 2006
comp.lang.c Changes to Answers to Frequently Asked Questions (FAQ)	1	Jul 4, 2004
comp.lang.c Answers to Frequently Asked Questions (FAQ List)	15	Apr 1, 2006
comp.lang.c Answers to Frequently Asked Questions (FAQ List)	1	Feb 1, 2004

separate Chinese and English! with Ruby

John Joyce

Zev Blut

Nanyang Zhan

Mariusz PÄ™kala

Nanyang Zhan

John Joyce

eden li

Nanyang Zhan

John Smith

John Smith

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads