separate Chinese and English! with Ruby

J

John Joyce

Many characters of these two set of Chinese(in fact, including Chinese
Characters in Japanese and Korean...) are the same. Aren't they =20
encoded
to the same codes when they are identical?

Yes. There is lots of overlap. So there is not always a clean =20
separation line. But, the Japanese and Korean phonetic characters =20
will be in a range. You might never use all the kanji/hanzi chinese =20
characters, and a few are Japanese only (very few).


Yes that's exactly what he means.
.... hmmmmm.. if only I could find out what it does...
John Joyce wrote:

I took a look at it. It's the database of characters, sort of. It is =20
a big text file list. Not a proper gem at all actually. The same db =20
file can be downloaded from Unicode.org separately. It doesn't =20
contain the actual characters, just their codes and some comments and =20=

groupings.
Interesting subject indeed it is.

Today I tried this(!!!!under RoR console!!!!):
=EF=BD=9B =EF=BC=9B =E2=80=98 =EF=BC=81 =EF=BC=A0 =EF=BC=83 =EF=BC=84 =
=EF=BC=85 =E2=80=A6 =20=E5=8B=BF =E5=8F=BF =E5=93=BF =E5=9B=BF =E5=A7=BF =E5=AF=BF =E5=B4=81 =
=E5=BF=84=E5=BF=BF =20=E6=BF=97 =E7=80=96 =E7=87=BF =E7=8B=A7 =E7=8F=97 =E7=97=BF =E7=9C=80 =
=E7=A7=8A =E7=AB=97 =20=E9=98=80 =E9=9F=97 =E9=A5=A7 =E9=AA=A0 =E9=B6=86 =E9=BE=A5}
=3D> ["=E2=80=9C", "=E2=80=9D=E3=80=82", "=EF=BC=8C", "=EF=BC=81", =
"=EF=BC=9C", "=EF=BD=9B", "=EF=BC=9B", "=E2=80=98", =20
"=EF=BC=81", "=EF=BC=A0", "=EF=BC=83", "=EF=BC=84", "=EF=BC=85",
"=E2=80=A6", "=EF=BC=8A", "=EF=BC=88", "=EF=BC=89", "=E4=B8=80", =
"=E4=BF=BF", "=E5=80=80", "=E5=87=BF", "=E5=8B=BF", =20
"=E5=8F=BF", "=E5=93=BF", "=E5=9B=BF", "=E5=A7=BF", " =E5=AF=BF",
"=E5=B4=81", "=E5=BF=84=E5=BF=BF", "=E6=81=98", "=E6=89=89", "=E6=8E=B5"=
, "=E6=9B=86", "=E6=A1=B6", "=E6=AA=97", "=E6=B3=97", =20
"=E6=BF=97", "=E7=80=96", "=E7=87=BF", "=E7=8B=A7", "=E7=8F=97",
"=E7=97=BF", "=E7=9C=80", "=E7=A7=8A", "=E7=AB=97", "=E7=AF=BF", =
"=E7=B4=80", "=E7=BF=B9", "=E9=80=80", "=E9=87=BD", =20
"=E9=8E=B7", "=E9=96=88", "=E9=98=80", "=E9=9F=97", "=E9=A5=A7",
"=E9=AA=A0", "=E9=B6=86", "=E9=BE=A5"]
c.collect.map{|o| o[0]}
=3D> [226, 226, 239, 239, 239, 239, 239, 226, 239, 239, 239, 239, 239,
226, 239, 239, 239, 228, 228, 229, 229, 229, 229, 229, 229, 229, 229,
229, 229, 230, 230, 230, 230, 230, 230, 230, 230, 231, 231, 231, 231,
231, 231, 231, 231, 231, 231, 231, 233, 233, 233, 233, 233, 233, 233,
233, 233, 233]
c.collect.map{|o| o[0]}.sort
=3D> [226, 226, 226, 226, 228, 228, 229, 229, 229, 229, 229, 229, 229,
229, 229, 229, 230, 230, 230, 230, 230, 230, 230, 230, 231, 231, 231,
231, 231, 231, 231, 231, 231, 231, 231, 233, 233, 233, 233, 233, 233,
233, 233, 233, 233, 239, 239, 239, 239, 239, 239, 239, 239, 239, 239,
239, 239, 239]
c.collect.map{|o| o[0]}.sort.uniq
=3D> [226, 228, 229, 230, 231, 233, 239]

There punctuations are those commonly used in China.
There Chinese characters are randomly pickup from
http://www.khngai.com/chinese/charmap/tbluni.php?page=3D0
(from all the six pages.)

maybe 226 to 239 is the range I need.

--=20
Posted via http://www.ruby-forum.com/.

If you have access to a Macintosh, the character pallette is pretty =20
helpful for exploring CJK character ranges as subgroupings within the =20=

range.
 
N

Nanyang Zhan

Eden said:
for UTF-8 encoded strings. Ruby
will just treat the string as a string of 8-bit bytes and give you
back whatever byte you asked for.

irb(main):001:0> s = "大智若愚"
=> "\345\244\247\346\231\272\350\213\245\346\204\232"
irb(main):002:0> s[0]
=> 229
irb(main):003:0> s.length
=> 12

in RoR console, I can see the string I put in:
s = "大智若愚" => "大智若愚"
s[0] => 229
s.length
=> 12

Zev said:
If the goal is to separate the western languages from the Japanese
Kanji and Kana, then it appears to not be too bad when using a lib
like this:

http://raa.ruby-lang.org/project/moji/

http://gimite.net/gimite/rubymess/moji.html

Thanks, Zev. but my current problem is about Chinese.
I am going to figure out a way to separate Chinese string from a string
mix with other characters.
What I mean other Characters are alphabets from English or/and other
languages, like Ô, é, á... (may I call them western words?)

This string may be containing no Chinese:
"String without Chinese" ,I don't need to do anything about it, other
than identify such strings.
"中文 Western Words" #Chinese characters + space + western words.
"中文・å¦ä¸€äº›ä¸­æ–‡ western words" #Chinese characters may be separated by
punctuations, or/and space like:
"中文 å‰æœ‰ç©ºæ ¼ western words"
Almost all Chinese phrases are at the beginning of the strings.
But some may contain numbers, like:
"2007年的日记 diary of 2007"
or some time English or alphabets are used as part of Chinese
phrases,like:
"BB日记 diary of my baby"

Eden said:
Nooo! Those are the first BYTES of the UTF-8 encoding of the
punctuation that you listed.

Finally, I know what those number are. Thanks.
so if you remove them from a givenstring, you're going to get back a poorly encoded UTF-8 string

In fact, I wanted to use those number to test whether a character is
Chinese or not (if 'character[0]' fit the range of [226, 228, 229, 230,
231, 233, 239], then it was likely to be a Chinese). (Now I know it may
be wrong.)
Then depend on this judgment, if this part of string ( string would be
splited by space, divided into parts at the beginning) containing more
X%, say 60%, of this kind of characters, then I would mark this parts as
Chinese phrase, then take it out of string.

I still want to use this strategy. but
As you point out, [226, 228, 229, 230, 231, 233, 239] are not safe to
identify Chinese, is there any other easy way to identify Chinese
characters?
If you want to split on those separators, then why not do so
explicitly?

# fill up c as you've done below=> ["asdf", "asdfasdf"]

I don't get it. what this code does?
 
M

Mariusz Pękala

--ZPt4rx8FFjLCG7dd
Content-Type: text/plain; charset=utf-8
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

Thanks, Zev. but my current problem is about Chinese.
I am going to figure out a way to separate Chinese string from a string= =20
mix with other characters.
What I mean other Characters are alphabets from English or/and other=20
languages, like =C3=94, =C3=A9, =C3=A1... (may I call them western words?)
=20
This string may be containing no Chinese:
"String without Chinese" ,I don't need to do anything about it, other=20
than identify such strings.
"=E4=B8=AD=E6=96=87 Western Words" #Chinese characters + space + western= words.
"=E4=B8=AD=E6=96=87=E3=83=BB=E5=8F=A6=E4=B8=80=E4=BA=9B=E4=B8=AD=E6=96=87=
western words" =EF=BC=83Chinese characters may be separated by=20
punctuations, or/and space like:
"=E4=B8=AD=E6=96=87 =E5=89=8D=E6=9C=89=E7=A9=BA=E6=A0=BC western words"
Almost all Chinese phrases are at the beginning of the strings.
But some may contain numbers, like:
"2007=E5=B9=B4=E7=9A=84=E6=97=A5=E8=AE=B0 diary of 2007"
or some time English or alphabets are used as part of Chinese=20
phrases,like:
"BB=E6=97=A5=E8=AE=B0 diary of my baby"
[...]

In fact, I wanted to use those number to test whether a character is=20
Chinese or not (if 'character[0]' fit the range of [226, 228, 229, 230,= =20
231, 233, 239], then it was likely to be a Chinese). (Now I know it may= =20
be wrong.)
Then depend on this judgment, if this part of string ( string would be= =20
splited by space, divided into parts at the beginning) containing more=20
X%, say 60%, of this kind of characters, then I would mark this parts as= =20
Chinese phrase, then take it out of string.
=20
I still want to use this strategy. but
As you point out, [226, 228, 229, 230, 231, 233, 239] are not safe to=20
identify Chinese, is there any other easy way to identify Chinese=20
characters?

Just a random idea - maybe, if there is a problem with finding Chinese
characters, you can define the range of non-Chinese (defined for this purpo=
se as western)
characters?
Maybe just finding words composed of only Latin and Common scripts would
be enough?
Or do you plan to have a Chinese - Japanese pairs? You said about
'westen words' and your examples were in English..


--=20
No virus found in this outgoing message.
Checked by 'grep -i virus $MESSAGE'
Trust me.

--ZPt4rx8FFjLCG7dd
Content-Type: application/pgp-signature
Content-Disposition: inline

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6-ecc01.6 (GNU/Linux)

iD8DBQFGQEsTsnU0scoWZKARArKIAJ9oKYMTWx3nyddelVOMQOMA5qyDhgCgmrxz
p3jTwaPMxbfJH6HHs+3pfoo=
=ZPwL
-----END PGP SIGNATURE-----

--ZPt4rx8FFjLCG7dd--
 
N

Nanyang Zhan

Michal said:
I guess this should give you what you want:

irb(main):001:0> s = "大智若愚 asdfaf sdgs"
=> "\345\244\247\346\231\272\350\213\245\346\204\232 asdfaf sdgs"
irb(main):002:0> s.unpack "U*"
=> [22823, 26234, 33509, 24858, 32, 97, 115, 100, 102, 97, 102, 32,
115, 100, 103, 115]

Michal, Thanks!
Chinese character start from 4e00 to 9fa5 at the unicode table, and CJK
symbols and punctuation range from 3000 to 303f.

I just used my strategy combining this new way (unpack "U*") to identify
Chinese, It picked out 100% Chinese phrases from the strings. (1000
strings are tested)

All of you that have replied and helped, thank you! Enjoy!
 
J

John Joyce

Michal said:
I guess this should give you what you want:

irb(main):001:0> s =3D "=E5=A4=A7=E6=99=BA=E8=8B=A5=E6=84=9A asdfaf = sdgs"
=3D> "\345\244\247\346\231\272\350\213\245\346\204\232 asdfaf sdgs"
irb(main):002:0> s.unpack "U*"
=3D> [22823, 26234, 33509, 24858, 32, 97, 115, 100, 102, 97, 102, 32,
115, 100, 103, 115]

Michal, Thanks!
Chinese character start from 4e00 to 9fa5 at the unicode table, and =20=
CJK
symbols and punctuation range from 3000 to 303f.

I just used my strategy combining this new way (unpack "U*") to =20
identify
Chinese, It picked out 100% Chinese phrases from the strings. (1000
strings are tested)

All of you that have replied and helped, thank you! Enjoy!

NZ, could you share your final combined code? It might be useful to =20
anyone using CJK, since Ruby originates in Japan that means a lot of =20
people might find it useful. Might consider making a little gem out =20
of it.=
 
E

eden li

If you want to split on those separators, then why not do so
explicitly?
# fill up c as you've done below
"asdf=EF=BC=9Basdfasdf".split(/#{c.join('|')}/)
=3D> ["asdf", "asdfasdf"]

I don't get it. what this code does?

This code just splits the string at any separator listed in c (no
matter how long it is, byte-wise). I was guessing at what you were
trying to do, but I understand now. It looks like you've gotten all
you need now :)
Chinese character start from 4e00 to 9fa5 at the unicode table, and CJK
symbols and punctuation range from 3000 to 303f.

There are also a few other ranges, but I'm not sure how popular they
are (from http://www.fileformat.info/info/unicode/block/index.htm):
CJK Compatibility Forms U+FE30 U+FE4F (32)
CJK Compatibility Ideographs U+F900 U+FAFF (467)
CJK Compatibility U+3300 U+33FF (256)
CJK Unified Ideographs Extension A U+3400 U+4DBF (6582)
CJK Unified Ideographs Extension B U+20000 U+2A6DF (42711)
CJK Compatibility Ideographs Supplement U+2F800 U+2FA1F (542)
 
N

Nanyang Zhan

NZ, could you share your final combined code? It might be useful to
anyone using CJK, since Ruby originates in Japan that means a lot of
people might find it useful. Might consider making a little gem out
of it.


I do think it will be much helpful, because it only solve a very
specified problem.
But anywhere, I paste it here. Maybe it could inspire somebody... who
knows...

#This !!!!RoR!!!! snippet is used to separate Chinese phrase from
specified formated strings:
#These strings may contain no Chinese:
#"a string without Chinese"
#or Chinese characters + space + western words: "中文 Western Words".
#"中文・å¦ä¸€äº›ä¸­æ–‡ western words" #Chinese characters may be separated by
punctuations,
#or/and space like:
#"中文 å‰æœ‰ç©ºæ ¼ western words"
#Almost all Chinese phrases are at the beginning of the strings.
#But some may contain numbers, like:
#"2007年的日记 diary of 2007"
#or some time English or alphabets are used as part of Chinese
#phrases,like:
#"BB日记 diary of my baby"
#
#usage:
#separate_chinese("a string without Chinese") => "|||a string without
Chinese"
#separate_chinese("2007年的日记 diary of 2007") => "2007年的日记|||diary of
2007"
#chinese_str, other_str = separate_chinese("中文 å‰æœ‰ç©ºæ ¼ western
words").split("|||")
#chinese_str => "中文 å‰æœ‰ç©ºæ ¼"
#other_str => "western words"

class Foo < ActiveRecord::Base
def self.separate_chinese(n)
ns = n.split(" ")
i = ns.size
ns.reverse.each do |p|
i -= 1
return ns.values_at(0..i).join(" ") + "|||" + ns.values_at((i +
1)..(ns.size - 1) ).join(" ") if is_chinese(p)
end
"|||" << n
end

def self.is_chinese(n)
cs = n.unpack("U*")
chinese_character_num = 0
cs.each do |unicode|
#comparing character's unicode to test if it is Chinese character
#19968-40869: unicode Chinese Character
#12288-12351: unicode CJK symbols and punctuation
#Note: as Eden Li have mentioned, there are a few more could be used in
a Chinese Document.
chinese_character_num += 1 if (unicode >= 19968 and unicode <=
40869) or (unicode >= 12288 and unicode <= 12351)
end
#if more the 29% of the characters a phrase contains is Chinese, it is
Chinese phrase
#the value 29% servers well for my purpose, but use whatever you like.
return true if chinese_character_num.to_f/cs.size > 0.29
nil
end
end
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
474,241
Messages
2,571,223
Members
47,860
Latest member
LoganF4991

Latest Threads

Top