newbie with a encoding question, please help

Mister Yu · Apr 1, 2010

hi experts,

i m new to python, i m writing crawlers to extract data from some
chinese websites, and i run into a encoding problem.

i have a unicode object, which looks like this u'\xd6\xd0\xce\xc4'
which is encoded in "gb2312", but i have no idea of how to convert it
back to utf-8

to re-create this one is easy:

this will work
============================¤¤¤å -> (same as the original string)

============================
but this doesn't,why
===========================Traceback (most recent call last):
File "<console>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode characters in position
0-3: ordinal not in range(128)
===========================

thank you

Chris Rebert · Apr 1, 2010

2010/4/1 Mister Yu said:
hi experts,

i m new to python, i m writing crawlers to extract data from some
chinese websites, and i run into a encoding problem.

i have a unicode object, which looks like this u'\xd6\xd0\xce\xc4'
which is encoded in "gb2312",

No! Instances of type 'unicode' (i.e. strings with a leading 'u')
***aren't encoded at all***.

but i have no idea of how to convert it
back to utf-8

To convert u'\xd6\xd0\xce\xc4' to UTF-8, do u'\xd6\xd0\xce\xc4'.encode('utf-8')

to re-create this one is easy:

this will work
============================
ä¸æ–‡ Â Â -> (same as the original string)

============================
but this doesn't,why
===========================

You can't decode a unicode string, it's already been decoded!

One decodes a bytestring to get a unicode string.
One **encodes** a unicode string to get a bytestring.

So the last line of your example should be:
print su.encode('gb2312')

Only call .encode() on things of type 'unicode'.
Only call .decode() on things of type 'str'.
[When using Python 2.x that is. Python 3.x renames the types in question.]

Cheers,
Chris

Mister Yu · Apr 1, 2010

2010/4/1 Mister Yu said:
2010/4/1 Mister Yu said:

hi experts,

Click to expand...

i m new to python, i m writing crawlers to extract data from some
chinese websites, and i run into a encoding problem.

Click to expand...

i have a unicode object, which looks like this u'\xd6\xd0\xce\xc4'
which is encoded in "gb2312",

Click to expand...

No! Instances of type 'unicode' (i.e. strings with a leading 'u')
***aren't encoded at all***.

but i have no idea of how to convert it
back to utf-8

Click to expand...

To convert u'\xd6\xd0\xce\xc4' to UTF-8, do u'\xd6\xd0\xce\xc4'.encode('utf-8')

to re-create this one is easy:

Click to expand...

this will work
============================
ä¸æ–‡ Â Â -> (same as the original string)

Click to expand...

============================
but this doesn't,why
===========================

Click to expand...

You can't decode a unicode string, it's already been decoded!

One decodes a bytestring to get a unicode string.
One **encodes** a unicode string to get a bytestring.

So the last line of your example should be:
print su.encode('gb2312')

Only call .encode() on things of type 'unicode'.
Only call .decode() on things of type 'str'.
[When using Python 2.x that is. Python 3.x renames the types in question.]

Cheers,
Chris
--http://blog.rebertia.com

hi, thanks for the tips.

but i m still not very sure how to convert a unicode object **
u'\xd6\xd0\xce\xc4 ** back to "ä¸æ–‡" the string it supposed to be?

thanks.

sorry i m really new to python.

Mister Yu · Apr 1, 2010

===========================================
print u'\xd6\xd0\xce\xc4'.encode('utf-8')
Ã–ÃÃŽÃ„ (the result is supposed to be "ä¸æ–‡" but not something like
this)
===========================================
'\xd6\xd0\xce\xc4'
===========================================

Chris Rebert · Apr 1, 2010

hi, thanks for the tips.

but i m still not very sure how to convert a unicode object Â **
u'\xd6\xd0\xce\xc4 ** back to "ä¸æ–‡" the string it supposed to be?

Ah, my apologies! I overlooked something (sorry, it's early in the
morning where I am).
What you have there is ***really*** screwy. It's the 2 Chinese
characters, encoded in gb2312, and then somehow cast *directly* into a
'unicode' string (which ought never to be done).

In answer to your original question (after some experimentation):
gb2312_bytes = ''.join([chr(ord(c)) for c in u'\xd6\xd0\xce\xc4'])
unicode_string = gb2312_bytes.decode('gb2312')
utf8_bytes = unicode_string.encode('utf-8') #as you wanted

If possible, I'd look at the code that's giving you that funky
"string" in the first place and see if it can be fixed to give you
either a proper bytestring or proper unicode string rather than the
bastardized mess you're currently having to deal with.

Apologies again and Cheers,
Chris

Stefan Behnel · Apr 1, 2010

Mister Yu, 01.04.2010 13:38:

i m still not very sure how to convert a unicode object **
u'\xd6\xd0\xce\xc4 ** back to "ä¸æ–‡" the string it supposed to be?

You are confused. '\xd6\xd0\xce\xc4' is an encoded byte string, not a
unicode string. The fact that you have it stored in a unicode string
implies that something in your code (or in a library) has done an incorrect
conversion from bytes to unicode that did not take into account the real
character set in use. So you end up with a completely meaningless unicode
string.

Please show us the code that does the conversion to a unicode string.

Stefan

Mister Yu · Apr 1, 2010

hi, thanks for the tips.

Click to expand...

but i m still not very sure how to convert a unicode object Â **
u'\xd6\xd0\xce\xc4 ** back to "ä¸æ–‡" the string it supposed to be?

Click to expand...

Ah, my apologies! I overlooked something (sorry, it's early in the
morning where I am).
What you have there is ***really*** screwy. It's the 2 Chinese
characters, encoded in gb2312, and then somehow cast *directly* into a
'unicode' string (which ought never to be done).

In answer to your original question (after some experimentation):
gb2312_bytes = ''.join([chr(ord(c)) for c in u'\xd6\xd0\xce\xc4'])
unicode_string = gb2312_bytes.decode('gb2312')
utf8_bytes = unicode_string.encode('utf-8') #as you wanted

If possible, I'd look at the code that's giving you that funky
"string" in the first place and see if it can be fixed to give you
either a proper bytestring or proper unicode string rather than the
bastardized mess you're currently having to deal with.

Apologies again and Cheers,
Chris
--http://blog.rebertia.com

Hi Chris,

thanks for the great tips! it works like a charm.

i m using the Scrapy project(http://doc.scrapy.org/intro/
tutorial.html) to write my crawler, when it extract data with xpath,
it puts the chinese characters directly into the unicode object.

thanks again chris, and have a good april fool day.

Cheers,
Yu

Stefan Behnel · Apr 1, 2010

Mister Yu, 01.04.2010 14:26:

gb2312_bytes = ''.join([chr(ord(c)) for c in u'\xd6\xd0\xce\xc4'])
unicode_string = gb2312_bytes.decode('gb2312')
utf8_bytes = unicode_string.encode('utf-8') #as you wanted

Click to expand...

Simplifying this hack a bit:

gb2312_bytes = u'\xd6\xd0\xce\xc4'.encode('ISO-8859-1')
unicode_string = gb2312_bytes.decode('gb2312')
utf8_bytes = unicode_string.encode('utf-8')

Although I have to wonder why you want a UTF-8 encoded byte string as
output instead of Unicode.

thanks for the great tips! it works like a charm.

I hope you're aware that it's a big ugly hack, though. You should really
try to fix your input instead.

i m using the Scrapy project(http://doc.scrapy.org/intro/
tutorial.html) to write my crawler, when it extract data with xpath,
it puts the chinese characters directly into the unicode object.

My guess is that the HTML page you are parsing is broken and doesn't
specify its encoding. In that case, all that scrapy can do is guess, and it
seems to have guessed incorrectly.

You should check if there is a way to tell scrapy about the expected page
encoding, so that it can return correctly decoded unicode strings directly,
instead of resorting to dirty hacks that may or may not work depending on
the page you are parsing.

Stefan

Mister Yu · Apr 1, 2010

Mister Yu, 01.04.2010 14:26:

gb2312_bytes = ''.join([chr(ord(c)) for c in u'\xd6\xd0\xce\xc4'])
unicode_string = gb2312_bytes.decode('gb2312')
utf8_bytes = unicode_string.encode('utf-8') #as you wanted

Click to expand...

Click to expand...

Simplifying this hack a bit:

gb2312_bytes = u'\xd6\xd0\xce\xc4'.encode('ISO-8859-1')
unicode_string = gb2312_bytes.decode('gb2312')
utf8_bytes = unicode_string.encode('utf-8')

Although I have to wonder why you want a UTF-8 encoded byte string as
output instead of Unicode.

thanks for the great tips! it works like a charm.

Click to expand...

I hope you're aware that it's a big ugly hack, though. You should really
try to fix your input instead.

i m using the Scrapy project(http://doc.scrapy.org/intro/
tutorial.html) to write my crawler, when it extract data with xpath,
it puts the chinese characters directly into the unicode object.

Click to expand...

My guess is that the HTML page you are parsing is broken and doesn't
specify its encoding. In that case, all that scrapy can do is guess, and it
seems to have guessed incorrectly.

You should check if there is a way to tell scrapy about the expected page
encoding, so that it can return correctly decoded unicode strings directly,
instead of resorting to dirty hacks that may or may not work depending on
the page you are parsing.

Stefan

Hi Stefan,

i don't think the page is broken or somehow, you can take a look at
the page http://www.7176.com/Sections/Genre/Comedy , it's kinda like
a chinese IMDB rip off

from what i can see from the source code of the page header, it
contains the coding info:

<meta content="all" name="robots" /><meta name="author"

content="admin(at)7176.com" /><meta name="Copyright" content="www.
7176.com" /> <meta content="Àà±ðÎª ¾çÇé µÄµçÓ°ÁÐ±í µÚ1Ò³" name="keywords" /><TITLE>
Àà±ðÎª ¾çÇé µÄµçÓ°ÁÐ±í µÚ1Ò³</TITLE><LINK href="http://www.7176.com/images/
pro.css" rel=stylesheet></HEAD>

maybe i should take a look at the source code of Scrapy, but i m just
not more than a week's newbie of python. not sure if i can understand
the source.

earlier Chris's walk around is looking pretty well until it meets some
string like this:

su = u'Ò»¶þÈýËÄ 12345 Ò»¶þÈýËÄ'
su u'\u4e00\u4e8c\u4e09\u56db 12345 \u4e00\u4e8c\u4e09\u56db'
gb2312_bytes = ''.join([chr(ord(c)) for c in u'\u4e00\u4e8c\u4e09\u56db 12345 \u4e00\u4e8c\u4e09\u56db'])

Click to expand...

Click to expand...

Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: chr() arg not in range(256)

the digis doesn't get encoded so it messes up the code.

any ideas?

once again, thanks everybody's help!!!!

How to display Chinese in a list retrieved from database via python	11	Dec 25, 2008
files.py (encoding error)	0	Jun 10, 2013
files.py (weird encoding error)	0	Jun 10, 2013
A few questiosn about encoding	103	Jun 9, 2013
Question of UTF16BE encoding / decoding	2	May 5, 2009
the stupid encoding problem to stdout	16	Jun 9, 2011
encoding error in python 27	4	Feb 22, 2013
elementtree and gbk encoding	12	Mar 14, 2006

newbie with a encoding question, please help

Mister Yu

Chris Rebert

Mister Yu

Mister Yu

Chris Rebert

Stefan Behnel

Mister Yu

Stefan Behnel

Mister Yu

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads