Python 3.0 automatic decoding of UTF16

John Machin · Dec 7, 2008

Any chance that whatever you used to "simply truncate the trailing
zero byte" also removed the BOM at the start of the file? Without it,
utf16 wouldn't be able to detect endianness and would, I believe, fall
back to native order.

Click to expand...

When I read this, I thought "O no, surely not!". Seems that you are
correct:
[Python 2.5.2, Windows XP]
| >>> nobom = u'abcde'.encode('utf_16_be')
| >>> nobom
| '\x00a\x00b\x00c\x00d\x00e'
| >>> nobom.decode('utf16')
| u'\u6100\u6200\u6300\u6400\u6500'

This may well explain one of the Python 3.0 problems that the OP's 2
files exhibit: data appears to have been byte-swapped under some
conditions. Possibility: it is reading the file a chunk at a time and
applying the utf_16 encoding independently to each chunk -- only the
first chunk will have a BOM.

Well, no, on further investigation, we're not byte-swapped, we've
tricked ourselves into decoding on odd-byte boundaries.

Here's the scoop: It's a bug in the newline handling (in io.py, class
IncrementalNewlineDecoder, method decode). It reads text files in 128-
byte chunks. Converting CR LF to \n requires special case handling
when '\r' is detected at the end of the decoded chunk n in case
there's an LF at the start of chunk n+1. Buggy solution: prepend b'\r'
to the chunk n+1 bytes and decode that -- suddenly with a 2-bytes-per-
char encoding like UTF-16 we are 1 byte out of whack. Better (IMVH[1]
O) solution: prepend '\r' to the result of decoding the chunk n+1
bytes. Each of the OP's files have \r on a 64-character boundary.
Note: They would exhibit the same symptoms if encoded in utf-16LE
instead of utf-16BE. With the better solution applied, the first file
[the truncated one] gave the expected error, and the second file [the
apparently OK one] gave sensible looking output.

[1] I thought it best to be Very Humble given what you see when you
do:
import io
print(io.__author__)
Hope my surge protector can cope with this

^%!//()
NO CARRIER

Terry Reedy · Dec 7, 2008

John said:
Here's the scoop: It's a bug in the newline handling (in io.py, class
IncrementalNewlineDecoder, method decode). It reads text files in 128-
byte chunks. Converting CR LF to \n requires special case handling
when '\r' is detected at the end of the decoded chunk n in case
there's an LF at the start of chunk n+1. Buggy solution: prepend b'\r'
to the chunk n+1 bytes and decode that -- suddenly with a 2-bytes-per-
char encoding like UTF-16 we are 1 byte out of whack. Better (IMVH[1]
O) solution: prepend '\r' to the result of decoding the chunk n+1
bytes. Each of the OP's files have \r on a 64-character boundary.
Note: They would exhibit the same symptoms if encoded in utf-16LE
instead of utf-16BE. With the better solution applied, the first file
[the truncated one] gave the expected error, and the second file [the
apparently OK one] gave sensible looking output.

[1] I thought it best to be Very Humble given what you see when you
do:
import io
print(io.__author__)
Hope my surge protector can cope with this
^%!//()
NO CARRIER

Please post this on the tracker so it can get included with other io
work for 3.0.1.

John Machin · Dec 7, 2008

Please post this on the tracker so it can get included with other io
work for 3.0.1.

I'm fiddling with a short bug-demo script right now.

Johannes Bauer · Dec 7, 2008

John said:
He did. Ugly stuff using readline() Should still work, though.

Well, well, I'm a C kinda guy used to while (fgets(b, sizeof(b), f))
kinda loops

But, seriously - I find that whole "while True:" and "if line == """
construct ugly as hell, too. How can reading a file line by line be
achieved in a more pythonic kind of way?

Regards,
Johannes

D'Arcy J.M. Cain · Dec 7, 2008

But, seriously - I find that whole "while True:" and "if line == """
construct ugly as hell, too. How can reading a file line by line be
achieved in a more pythonic kind of way?

for line in open(filename):
<do stuff with line>

John Machin · Dec 7, 2008

Well, well, I'm a C kinda guy used to while (fgets(b, sizeof(b), f))
kinda loops

But, seriously - I find that whole "while True:" and "if line == """
construct ugly as hell, too. How can reading a file line by line be
achieved in a more pythonic kind of way?

By using
for line in open(.....)
as mentioned in (1) my message that you were replying to (2) the
tutorial:
http://docs.python.org/3.0/tutorial/inputoutput.html#reading-and-writing-files
.... skip the stuff on readline() and readlines() this time

While waiting for the bug to be fixed, you'll need something like the
following:

def utf16_getlines(fname, newline_terminated=True):
f = open(fname, 'rb')
raw_bytes = f.read()
f.close()
decoded = raw_bytes.decode('utf16')
if newline_terminated:
normalised = decoded.replace('\r\n', '\n')
lines = normalised.splitlines(True)
else:
lines = decoded.splitlines()
return lines

That avoids the chunk-reading problem by reading the whole file in one
go. In fact given the way I've written it, there can be 4 copies of
the file contents. Fortunately your files are tiny.

HTH,
John

MeCab UTF-8 Decoding Problem	6	Jun 29, 2013
regular expression, unicode	0	Apr 29, 2009
regular expression, unicode	1	Apr 29, 2009
Iconv problem - not handling \r correctly	1	Oct 26, 2008
ASP.NET Newbie	1	Apr 2, 2006
JSON and Firefox sessionstore.js	6	Apr 23, 2009
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in	0	Jan 29, 2009
How do I encode and decode this data to write to a file?	11	Apr 29, 2013

Python 3.0 automatic decoding of UTF16

John Machin

Terry Reedy

John Machin

Johannes Bauer

D'Arcy J.M. Cain

John Machin

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads