Python 3.0 automatic decoding of UTF16

J

John Machin

Any chance that whatever you used to "simply truncate the trailing
zero byte" also removed the BOM at the start of the file?  Without it,
utf16 wouldn't be able to detect endianness and would, I believe, fall
back to native order.

When I read this, I thought "O no, surely not!". Seems that you are
correct:
[Python 2.5.2, Windows XP]
| >>> nobom = u'abcde'.encode('utf_16_be')
| >>> nobom
| '\x00a\x00b\x00c\x00d\x00e'
| >>> nobom.decode('utf16')
| u'\u6100\u6200\u6300\u6400\u6500'

This may well explain one of the Python 3.0 problems that the OP's 2
files exhibit: data appears to have been byte-swapped under some
conditions. Possibility: it is reading the file a chunk at a time and
applying the utf_16 encoding independently to each chunk -- only the
first chunk will have a BOM.

Well, no, on further investigation, we're not byte-swapped, we've
tricked ourselves into decoding on odd-byte boundaries.

Here's the scoop: It's a bug in the newline handling (in io.py, class
IncrementalNewlineDecoder, method decode). It reads text files in 128-
byte chunks. Converting CR LF to \n requires special case handling
when '\r' is detected at the end of the decoded chunk n in case
there's an LF at the start of chunk n+1. Buggy solution: prepend b'\r'
to the chunk n+1 bytes and decode that -- suddenly with a 2-bytes-per-
char encoding like UTF-16 we are 1 byte out of whack. Better (IMVH[1]
O) solution: prepend '\r' to the result of decoding the chunk n+1
bytes. Each of the OP's files have \r on a 64-character boundary.
Note: They would exhibit the same symptoms if encoded in utf-16LE
instead of utf-16BE. With the better solution applied, the first file
[the truncated one] gave the expected error, and the second file [the
apparently OK one] gave sensible looking output.

[1] I thought it best to be Very Humble given what you see when you
do:
import io
print(io.__author__)
Hope my surge protector can cope with this :)
^%!//()
NO CARRIER
 
T

Terry Reedy

John said:
Here's the scoop: It's a bug in the newline handling (in io.py, class
IncrementalNewlineDecoder, method decode). It reads text files in 128-
byte chunks. Converting CR LF to \n requires special case handling
when '\r' is detected at the end of the decoded chunk n in case
there's an LF at the start of chunk n+1. Buggy solution: prepend b'\r'
to the chunk n+1 bytes and decode that -- suddenly with a 2-bytes-per-
char encoding like UTF-16 we are 1 byte out of whack. Better (IMVH[1]
O) solution: prepend '\r' to the result of decoding the chunk n+1
bytes. Each of the OP's files have \r on a 64-character boundary.
Note: They would exhibit the same symptoms if encoded in utf-16LE
instead of utf-16BE. With the better solution applied, the first file
[the truncated one] gave the expected error, and the second file [the
apparently OK one] gave sensible looking output.

[1] I thought it best to be Very Humble given what you see when you
do:
import io
print(io.__author__)
Hope my surge protector can cope with this :)
^%!//()
NO CARRIER

Please post this on the tracker so it can get included with other io
work for 3.0.1.
 
J

Johannes Bauer

John said:
He did. Ugly stuff using readline() :) Should still work, though.

Well, well, I'm a C kinda guy used to while (fgets(b, sizeof(b), f))
kinda loops :)

But, seriously - I find that whole "while True:" and "if line == """
construct ugly as hell, too. How can reading a file line by line be
achieved in a more pythonic kind of way?

Regards,
Johannes
 
D

D'Arcy J.M. Cain

But, seriously - I find that whole "while True:" and "if line == """
construct ugly as hell, too. How can reading a file line by line be
achieved in a more pythonic kind of way?

for line in open(filename):
<do stuff with line>
 
J

John Machin

Well, well, I'm a C kinda guy used to while (fgets(b, sizeof(b), f))
kinda loops :)

But, seriously - I find that whole "while True:" and "if line == """
construct ugly as hell, too. How can reading a file line by line be
achieved in a more pythonic kind of way?

By using
for line in open(.....)
as mentioned in (1) my message that you were replying to (2) the
tutorial:
http://docs.python.org/3.0/tutorial/inputoutput.html#reading-and-writing-files
.... skip the stuff on readline() and readlines() this time :)

While waiting for the bug to be fixed, you'll need something like the
following:

def utf16_getlines(fname, newline_terminated=True):
f = open(fname, 'rb')
raw_bytes = f.read()
f.close()
decoded = raw_bytes.decode('utf16')
if newline_terminated:
normalised = decoded.replace('\r\n', '\n')
lines = normalised.splitlines(True)
else:
lines = decoded.splitlines()
return lines

That avoids the chunk-reading problem by reading the whole file in one
go. In fact given the way I've written it, there can be 4 copies of
the file contents. Fortunately your files are tiny.

HTH,
John
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,969
Messages
2,570,161
Members
46,708
Latest member
SherleneF1

Latest Threads

Top