newbie with a encoding question, please help

M

Mister Yu

hi experts,

i m new to python, i m writing crawlers to extract data from some
chinese websites, and i run into a encoding problem.

i have a unicode object, which looks like this u'\xd6\xd0\xce\xc4'
which is encoded in "gb2312", but i have no idea of how to convert it
back to utf-8

to re-create this one is easy:

this will work
============================¤¤¤å -> (same as the original string)

============================
but this doesn't,why
===========================Traceback (most recent call last):
File "<console>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode characters in position
0-3: ordinal not in range(128)
===========================

thank you
 
C

Chris Rebert

2010/4/1 Mister Yu said:
hi experts,

i m new to python, i m writing crawlers to extract data from some
chinese websites, and i run into a encoding problem.

i have a unicode object, which looks like this u'\xd6\xd0\xce\xc4'
which is encoded in "gb2312",

No! Instances of type 'unicode' (i.e. strings with a leading 'u')
***aren't encoded at all***.
but i have no idea of how to convert it
back to utf-8

To convert u'\xd6\xd0\xce\xc4' to UTF-8, do u'\xd6\xd0\xce\xc4'.encode('utf-8')
to re-create this one is easy:

this will work
============================
中文    -> (same as the original string)

============================
but this doesn't,why
===========================
You can't decode a unicode string, it's already been decoded!

One decodes a bytestring to get a unicode string.
One **encodes** a unicode string to get a bytestring.

So the last line of your example should be:
print su.encode('gb2312')

Only call .encode() on things of type 'unicode'.
Only call .decode() on things of type 'str'.
[When using Python 2.x that is. Python 3.x renames the types in question.]

Cheers,
Chris
 
M

Mister Yu

2010/4/1 Mister Yu said:
hi experts,
i m new to python, i m writing crawlers to extract data from some
chinese websites, and i run into a encoding problem.
i have a unicode object, which looks like this u'\xd6\xd0\xce\xc4'
which is encoded in "gb2312",

No! Instances of type 'unicode' (i.e. strings with a leading 'u')
***aren't encoded at all***.
but i have no idea of how to convert it
back to utf-8

To convert u'\xd6\xd0\xce\xc4' to UTF-8, do u'\xd6\xd0\xce\xc4'.encode('utf-8')


to re-create this one is easy:
this will work
============================
中文    -> (same as the original string)
============================
but this doesn't,why
===========================

You can't decode a unicode string, it's already been decoded!

One decodes a bytestring to get a unicode string.
One **encodes** a unicode string to get a bytestring.

So the last line of your example should be:
print su.encode('gb2312')

Only call .encode() on things of type 'unicode'.
Only call .decode() on things of type 'str'.
[When using Python 2.x that is. Python 3.x renames the types in question.]

Cheers,
Chris
--http://blog.rebertia.com

hi, thanks for the tips.

but i m still not very sure how to convert a unicode object **
u'\xd6\xd0\xce\xc4 ** back to "中文" the string it supposed to be?

thanks.

sorry i m really new to python.
 
M

Mister Yu

===========================================
print u'\xd6\xd0\xce\xc4'.encode('utf-8')
ÖÃÎÄ (the result is supposed to be "中文" but not something like
this)
===========================================
'\xd6\xd0\xce\xc4'
===========================================
 
C

Chris Rebert

hi, thanks for the tips.

but i m still not very sure how to convert a unicode object  **
u'\xd6\xd0\xce\xc4 ** back to "中文" the string it supposed to be?

Ah, my apologies! I overlooked something (sorry, it's early in the
morning where I am).
What you have there is ***really*** screwy. It's the 2 Chinese
characters, encoded in gb2312, and then somehow cast *directly* into a
'unicode' string (which ought never to be done).

In answer to your original question (after some experimentation):
gb2312_bytes = ''.join([chr(ord(c)) for c in u'\xd6\xd0\xce\xc4'])
unicode_string = gb2312_bytes.decode('gb2312')
utf8_bytes = unicode_string.encode('utf-8') #as you wanted

If possible, I'd look at the code that's giving you that funky
"string" in the first place and see if it can be fixed to give you
either a proper bytestring or proper unicode string rather than the
bastardized mess you're currently having to deal with.

Apologies again and Cheers,
Chris
 
S

Stefan Behnel

Mister Yu, 01.04.2010 13:38:
i m still not very sure how to convert a unicode object **
u'\xd6\xd0\xce\xc4 ** back to "中文" the string it supposed to be?

You are confused. '\xd6\xd0\xce\xc4' is an encoded byte string, not a
unicode string. The fact that you have it stored in a unicode string
implies that something in your code (or in a library) has done an incorrect
conversion from bytes to unicode that did not take into account the real
character set in use. So you end up with a completely meaningless unicode
string.

Please show us the code that does the conversion to a unicode string.

Stefan
 
M

Mister Yu

hi, thanks for the tips.
but i m still not very sure how to convert a unicode object  **
u'\xd6\xd0\xce\xc4 ** back to "中文" the string it supposed to be?

Ah, my apologies! I overlooked something (sorry, it's early in the
morning where I am).
What you have there is ***really*** screwy. It's the 2 Chinese
characters, encoded in gb2312, and then somehow cast *directly* into a
'unicode' string (which ought never to be done).

In answer to your original question (after some experimentation):
gb2312_bytes = ''.join([chr(ord(c)) for c in u'\xd6\xd0\xce\xc4'])
unicode_string = gb2312_bytes.decode('gb2312')
utf8_bytes = unicode_string.encode('utf-8') #as you wanted

If possible, I'd look at the code that's giving you that funky
"string" in the first place and see if it can be fixed to give you
either a proper bytestring or proper unicode string rather than the
bastardized mess you're currently having to deal with.

Apologies again and Cheers,
Chris
--http://blog.rebertia.com

Hi Chris,

thanks for the great tips! it works like a charm.

i m using the Scrapy project(http://doc.scrapy.org/intro/
tutorial.html) to write my crawler, when it extract data with xpath,
it puts the chinese characters directly into the unicode object.

thanks again chris, and have a good april fool day.

Cheers,
Yu
 
S

Stefan Behnel

Mister Yu, 01.04.2010 14:26:
gb2312_bytes = ''.join([chr(ord(c)) for c in u'\xd6\xd0\xce\xc4'])
unicode_string = gb2312_bytes.decode('gb2312')
utf8_bytes = unicode_string.encode('utf-8') #as you wanted

Simplifying this hack a bit:

gb2312_bytes = u'\xd6\xd0\xce\xc4'.encode('ISO-8859-1')
unicode_string = gb2312_bytes.decode('gb2312')
utf8_bytes = unicode_string.encode('utf-8')

Although I have to wonder why you want a UTF-8 encoded byte string as
output instead of Unicode.

thanks for the great tips! it works like a charm.

I hope you're aware that it's a big ugly hack, though. You should really
try to fix your input instead.

i m using the Scrapy project(http://doc.scrapy.org/intro/
tutorial.html) to write my crawler, when it extract data with xpath,
it puts the chinese characters directly into the unicode object.

My guess is that the HTML page you are parsing is broken and doesn't
specify its encoding. In that case, all that scrapy can do is guess, and it
seems to have guessed incorrectly.

You should check if there is a way to tell scrapy about the expected page
encoding, so that it can return correctly decoded unicode strings directly,
instead of resorting to dirty hacks that may or may not work depending on
the page you are parsing.

Stefan
 
M

Mister Yu

Mister Yu, 01.04.2010 14:26:
gb2312_bytes = ''.join([chr(ord(c)) for c in u'\xd6\xd0\xce\xc4'])
unicode_string = gb2312_bytes.decode('gb2312')
utf8_bytes = unicode_string.encode('utf-8') #as you wanted

Simplifying this hack a bit:

gb2312_bytes = u'\xd6\xd0\xce\xc4'.encode('ISO-8859-1')
unicode_string = gb2312_bytes.decode('gb2312')
utf8_bytes = unicode_string.encode('utf-8')

Although I have to wonder why you want a UTF-8 encoded byte string as
output instead of Unicode.
thanks for the great tips! it works like a charm.

I hope you're aware that it's a big ugly hack, though. You should really
try to fix your input instead.
i m using the Scrapy project(http://doc.scrapy.org/intro/
tutorial.html) to write my crawler, when it extract data with xpath,
it puts the chinese characters directly into the unicode object.

My guess is that the HTML page you are parsing is broken and doesn't
specify its encoding. In that case, all that scrapy can do is guess, and it
seems to have guessed incorrectly.

You should check if there is a way to tell scrapy about the expected page
encoding, so that it can return correctly decoded unicode strings directly,
instead of resorting to dirty hacks that may or may not work depending on
the page you are parsing.

Stefan

Hi Stefan,

i don't think the page is broken or somehow, you can take a look at
the page http://www.7176.com/Sections/Genre/Comedy , it's kinda like
a chinese IMDB rip off

from what i can see from the source code of the page header, it
contains the coding info:
<meta content="all" name="robots" /><meta name="author"
content="admin(at)7176.com" /><meta name="Copyright" content="www.
7176.com" /> <meta content="Àà±ðΪ ¾çÇé µÄµçÓ°ÁÐ±í µÚ1Ò³" name="keywords" /><TITLE>
Àà±ðΪ ¾çÇé µÄµçÓ°ÁÐ±í µÚ1Ò³</TITLE><LINK href="http://www.7176.com/images/
pro.css" rel=stylesheet></HEAD>

maybe i should take a look at the source code of Scrapy, but i m just
not more than a week's newbie of python. not sure if i can understand
the source.

earlier Chris's walk around is looking pretty well until it meets some
string like this:
su = u'Ò»¶þÈýËÄ 12345 Ò»¶þÈýËÄ'
su u'\u4e00\u4e8c\u4e09\u56db 12345 \u4e00\u4e8c\u4e09\u56db'
gb2312_bytes = ''.join([chr(ord(c)) for c in u'\u4e00\u4e8c\u4e09\u56db 12345 \u4e00\u4e8c\u4e09\u56db'])
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: chr() arg not in range(256)

the digis doesn't get encoded so it messes up the code.

any ideas?

once again, thanks everybody's help!!!!
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,969
Messages
2,570,161
Members
46,705
Latest member
Stefkari24

Latest Threads

Top