Encode exception for chinese text

V

Vinayakc

Hi all,

I am new to python.

I have written one small application which reads data from xml file and
tries to encode data using apprpriate charset.
I am facing problem while encoding one chinese paragraph with charset
"gb2312".

code is:

encoded_str = str_data.encode("gb2312")

The type of str_data is <type 'unicode'>

The exception is:

"UnicodeEncodeError: 'gb2312' codec can't encode character u'\xa0' in
position 0: illegal multibyte sequence"

Can anyone please give me direction to solve this isssue.

Regards,
Vinayakc
 
S

swordsp

Are you sure all the characters in original text are in "gb2312"
charset?

Encoding with "utf8" seems work for this character (u'\xa0'), but I
don't know if the result is correct.

Could you give a subset of str_data in unicode?
 
S

Serge Orlov

Vinayakc said:
Hi all,

I am new to python.

I have written one small application which reads data from xml file and
tries to encode data using apprpriate charset.
I am facing problem while encoding one chinese paragraph with charset
"gb2312".

code is:

encoded_str = str_data.encode("gb2312")

The type of str_data is <type 'unicode'>

The exception is:

"UnicodeEncodeError: 'gb2312' codec can't encode character u'\xa0' in
position 0: illegal multibyte sequence"

Hmm, this is 'no-break space' in the very beginning of the text. It
look suspiciously like a plain text utf-8 signature which is 'zero
width no-break space'. If you strip the first character do you still
have encoding errors?
 
V

Vinayakc

Yes serge, I have removed the first character but it is still giving
encoding exception.
 
J

John Machin

1. *By definition*, you can encode *any* Unicode string into utf-8.
Proves nothing.
2. \u00a0 [no-break space] has no equivalent in gb2312, nor in the
later gbk alias cp936. It does have an equivalent in the latest Chinese
encoding, gb18030.
3. gb2312 is outdated. It is not really an "appropriate" charset for
anything much these days. You need to check out what your requirements
really are. The unknowing will cheerfully use "gb" to mean one or more
of those, or to mean "anything that's not big5" :)
4. The slab of text you supplied is genuine unicode and encodes happily
into all those gb* charsets. It does *not* contain \u00a0.

I do hope some of this helps ....

Cheers,
John
 
S

Serge Orlov

Vinayakc said:
Yes serge, I have removed the first character but it is still giving
encoding exception.

Then I guess this character was used as a poor man indentation tool at
least in the beginning of your text. It's up to you to decide what to
do with that character, you have several choices:

* edit source xml file to get rid of it
* remove it while you process your data
* replace it with ordinary space
* consider utf-8

Note, there are legitimate use cases for no-break space, for example
one million can be written like 1 000 000, where spaces are
non-breakable. This prevents the number to be broken by right margin
like this: 1 000
000

Keep that in mind when you remove or replace no-break space.
 
V

Vinayakc

Hey Serge, john,

Thank you very much. I was really not aware of these facts. Anyways
this is happening only for one in millions so I can ignore this for
now.

Thanks again,

Vinayakc
 
G

Guest

John said:
1. *By definition*, you can encode *any* Unicode string into utf-8.
Proves nothing.
2. \u00a0 [no-break space] has no equivalent in gb2312, nor in the
later gbk alias cp936. It does have an equivalent in the latest Chinese
encoding, gb18030.

Also, *by definition*, though :) For those that have not followed
encodings too closely: gb18030 is to gb2312 what UTF-8 is to ASCII.
Both encode the entire Unicode in an algorithmic way, and provide
byte-for-byte identical encodings for the for their respective
subset.

Regards,
Martin
 
J

John Machin

MvL said:
Also, *by definition*, though :)

Ah yes, indeed; and thanks for reminding me. Aside: Similar definition,
but not similar design: IMHO utf-8 sits on top of ASCII like a rose on
a stalk, whereas gb18030 sits on top of gb2312 like a rhinoceros on a
unicycle :)
Cheers,
John
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
474,298
Messages
2,571,540
Members
48,275
Latest member
tetedenuit01

Latest Threads

Top