split() can help to read UTF-16 encoded file without codecs support,why?

Zhongjian Lu · Mar 17, 2006

Hi Guys,

I was processing a UTF-16 coded file with BOM and was not aware of the
codecs package at first. I wrote the following code:
===== Code 1============================
for i in open("d:\python24\lzjtest.xml", 'r').readlines():
i = i.decode("utf-16")
print i
=======================================
Output was:
Traceback (most recent call last):
File "D:\Python24\testutf-16.py", line 4, in -toplevel-
i = i.decode("utf-16")
File "D:\Python24\lib\encodings\utf_16.py", line 16, in decode
return codecs.utf_16_decode(input, errors, True)
UnicodeDecodeError: 'utf16' codec can't decode byte 0x0a in position
84: truncated data

I searched google and found an article on the similar problem saying to use
split(). I had not quite caught the meaning of the article and recode as:
==== Code 2==============================
for i in open("d:\python24\lzjtest.xml", 'r').read().split('\r\n'):
i = i.decode("utf-16")
print i
=======================================
Then it worked (echo the file).

Later I get to know codecs and write the following code:

==== Code 3 =============================
import codecs
for i in codecs.open("d:\python24\lzjtesttvs2.xml", 'r', 'utf-16').readlines():
print i
=======================================
It worked and echo the file.

I am wondering what is the problem with the first code and why the bug
is fixed in
the second.

Thanks in advance.

-Zhongjian

Fuzzyman · Mar 17, 2006

Zhongjian said:
Hi Guys,

I was processing a UTF-16 coded file with BOM and was not aware of the
codecs package at first. I wrote the following code:
===== Code 1============================
for i in open("d:\python24\lzjtest.xml", 'r').readlines():
i = i.decode("utf-16")
print i
=======================================
Output was:
Traceback (most recent call last):
File "D:\Python24\testutf-16.py", line 4, in -toplevel-
i = i.decode("utf-16")
File "D:\Python24\lib\encodings\utf_16.py", line 16, in decode
return codecs.utf_16_decode(input, errors, True)
UnicodeDecodeError: 'utf16' codec can't decode byte 0x0a in position
84: truncated data

UTF16 is a 'two-byte encoding'. This means that '\r\n' is represented
using :

'\r\x00\n\x00'

When you use readlines to split this up it splits on byte boundaries.
This probably returns something like :

'\r', '\x00\n', '\x00'

You can see how the last bit is 'truncated' (single byte only) because
the data has been split on bytes instead of characters.

I searched google and found an article on the similar problem saying to use
split(). I had not quite caught the meaning of the article and recode as:
==== Code 2==============================
for i in open("d:\python24\lzjtest.xml", 'r').read().split('\r\n'):
i = i.decode("utf-16")
print i
=======================================
Then it worked (echo the file).

You will probably find that '\r\n' never occurs in the byte-string, so
this does it *all* in one line, but the decode succeeds.

HTH

All the best,

Fuzzyman
http://www.voidspace.org.uk/python/index.shtml

Read utf-8 file	1	Mar 18, 2013
problem parsing utf-8 encoded xml - minidom	2	Jul 4, 2008
How can I upload a tar.bz2 file to OpenStack swift object storage container using the Python swift client?	2	Mar 22, 2024
MeCab UTF-8 Decoding Problem	6	Jun 29, 2013
codecs / subprocess interaction: utf help requested	2	Jun 10, 2007
Encoding of surrogate code points to UTF-8	14	Oct 8, 2013
Converting file from utf-16 to utf-8	3	Mar 23, 2010
How to dump a Python 2.6 dictionary with UTF-8 strings?	3	Jan 11, 2011

split() can help to read UTF-16 encoded file without codecs support,why?

Zhongjian Lu

Fuzzyman

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads