J
John Machin
Any chance that whatever you used to "simply truncate the trailing
zero byte" also removed the BOM at the start of the file? Without it,
utf16 wouldn't be able to detect endianness and would, I believe, fall
back to native order.
When I read this, I thought "O no, surely not!". Seems that you are
correct:
[Python 2.5.2, Windows XP]
| >>> nobom = u'abcde'.encode('utf_16_be')
| >>> nobom
| '\x00a\x00b\x00c\x00d\x00e'
| >>> nobom.decode('utf16')
| u'\u6100\u6200\u6300\u6400\u6500'
This may well explain one of the Python 3.0 problems that the OP's 2
files exhibit: data appears to have been byte-swapped under some
conditions. Possibility: it is reading the file a chunk at a time and
applying the utf_16 encoding independently to each chunk -- only the
first chunk will have a BOM.
Well, no, on further investigation, we're not byte-swapped, we've
tricked ourselves into decoding on odd-byte boundaries.
Here's the scoop: It's a bug in the newline handling (in io.py, class
IncrementalNewlineDecoder, method decode). It reads text files in 128-
byte chunks. Converting CR LF to \n requires special case handling
when '\r' is detected at the end of the decoded chunk n in case
there's an LF at the start of chunk n+1. Buggy solution: prepend b'\r'
to the chunk n+1 bytes and decode that -- suddenly with a 2-bytes-per-
char encoding like UTF-16 we are 1 byte out of whack. Better (IMVH[1]
O) solution: prepend '\r' to the result of decoding the chunk n+1
bytes. Each of the OP's files have \r on a 64-character boundary.
Note: They would exhibit the same symptoms if encoded in utf-16LE
instead of utf-16BE. With the better solution applied, the first file
[the truncated one] gave the expected error, and the second file [the
apparently OK one] gave sensible looking output.
[1] I thought it best to be Very Humble given what you see when you
do:
import io
print(io.__author__)
Hope my surge protector can cope with this
^%!//()
NO CARRIER