B
Bill Janssen
I've encountered an issue dealing with strings read from files. I
read a line from a file, then try to print it out as an ASCII string:
line = fp.readline()
print line.encode('US-ASCII', 'replace')
and of course I get an error like:
Traceback (most recent call last):
File "<stdin>", line 1, in ?
UnicodeDecodeError: 'ascii' codec can't decode byte 0xd5 in position 1: ordinal not in range(128)
because the file contained some binary character. You'll notice that
the problem is in *decoding* the string, not in re-encoding it,
because I'm using the default "C" locale, and "US-ASCII" is presumed
for strings. But these strings are *not* US-ASCII, they are raw
bytes. How do I format a string of raw bytes for conversion to a
recognized charset encoding for printing?
There seems to be no 'raw' codec that would capture this. There's no
way of setting an attribute on a file to express this. It looks like
the best I can do is
print string.join([(((ord(x) > 0 and ord(x) < 0x7F) and x) or (r"\x%02x" % ord(x))) for x in line], '')
which seems extremely inefficient.
Bill
read a line from a file, then try to print it out as an ASCII string:
line = fp.readline()
print line.encode('US-ASCII', 'replace')
and of course I get an error like:
Traceback (most recent call last):
File "<stdin>", line 1, in ?
UnicodeDecodeError: 'ascii' codec can't decode byte 0xd5 in position 1: ordinal not in range(128)
because the file contained some binary character. You'll notice that
the problem is in *decoding* the string, not in re-encoding it,
because I'm using the default "C" locale, and "US-ASCII" is presumed
for strings. But these strings are *not* US-ASCII, they are raw
bytes. How do I format a string of raw bytes for conversion to a
recognized charset encoding for printing?
There seems to be no 'raw' codec that would capture this. There's no
way of setting an attribute on a file to express this. It looks like
the best I can do is
print string.join([(((ord(x) > 0 and ord(x) < 0x7F) and x) or (r"\x%02x" % ord(x))) for x in line], '')
which seems extremely inefficient.
Bill