Problem processing Chinese

Anthony Liu · Oct 14, 2005

I believe that topic related to Chinese processing was
discussed before. I could not dig out the info I want
from the mail list archive.

My Python script reads some Chinese text and then
split a line delimited by white spaces. I got lists
like

['\xbc\xc7\xd5\xdf', '\xd0\xbb\xbd\xf0\xbb\xa2',
'\xa1\xa2']

I had

#-*- coding: gbk -*-

on top of the script.

My Windows 2000 system's default language is Chinese
(GB2312) and displays Chinese perfectly.

I don't know how to configure python or what else I
need to properly process such two-byte-character text.

Thanks.

__________________________________
Yahoo! Mail - PC Magazine Editors' Choice 2005
http://mail.yahoo.com

Peter Otten · Oct 14, 2005

Anthony said:
I believe that topic related to Chinese processing was
discussed before. I could not dig out the info I want
from the mail list archive.

My Python script reads some Chinese text and then
split a line delimited by white spaces. I got lists
like

['\xbc\xc7\xd5\xdf', '\xd0\xbb\xbd\xf0\xbb\xa2',
'\xa1\xa2']

I had

#-*- coding: gbk -*-

on top of the script.

My Windows 2000 system's default language is Chinese
(GB2312) and displays Chinese perfectly.

I don't know how to configure python or what else I
need to properly process such two-byte-character text.

Thanks.

Suppose you have a file with the following contents:
'\xbc\xc7\xd5\xdf \xd0\xbb\xbd\xf0\xbb\xa2 \xa1\xa2'

Then it's best to open it via codecs -- of course you have to know the
encoding:
u'\u8bb0\u8005 \u8c22\u91d1\u864e \u3001'

This may still look strange to you but it's the unicode string's repr().
If sys.stdout.encoding is properly set on your system you can just print it:
è®°è€… è°¢é‡‘è™Ž ã€

If that fails, provide the encoding explicitly:
system
è®°è€… è°¢é‡‘è™Ž ã€

Because now you are in unicode all further operations are performed on
characters rather than bytes. Processing Chinese is no longer more
difficult than any language that confines itself to plain ASCII.
But if you split your text into a list
[u'\u8bb0\u8005', u'\u8c22\u91d1\u864e', u'\u3001']

you probably think you are back to square one. That is because Python prints
the repr() of the list items (otherwise a comma would give the impression
that the list contains more items than it actually does). To get the actual
characters, choose an item explicitly

items = u.split()
print items[0]

Click to expand...

Click to expand...

è®°è€…

or convert the entire list to a string of your liking, e. g:

print u"[%s]" % u", ".join(items)

Click to expand...

Click to expand...

[è®°è€…, è°¢é‡‘è™Ž, ã€]

Peter

Fwd: How to Split Chinese Character with backslash representation?	0	Oct 27, 2006
elementtree and gbk encoding	12	Mar 14, 2006
Problem processing Chinese character with Python	0	Mar 6, 2004
Locale setting and chinese file processing problems	0	Jan 12, 2009
WinXP, Python3.1.2,dir-listing to XML - problem with unicode file names	0	Apr 3, 2010
windows active directory ldap output encoding	2	Jul 8, 2008
Upgrading 2.4.1 to 2.4.2	1	Oct 17, 2005
Profiling results	1	Oct 14, 2005

Problem processing Chinese

Anthony Liu

Peter Otten

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads