A 'raw' codec for binary "strings" in Python?

Bill Janssen · Mar 1, 2004

I've encountered an issue dealing with strings read from files. I
read a line from a file, then try to print it out as an ASCII string:

line = fp.readline()
print line.encode('US-ASCII', 'replace')

and of course I get an error like:

Traceback (most recent call last):
File "<stdin>", line 1, in ?
UnicodeDecodeError: 'ascii' codec can't decode byte 0xd5 in position 1: ordinal not in range(128)

because the file contained some binary character. You'll notice that
the problem is in *decoding* the string, not in re-encoding it,
because I'm using the default "C" locale, and "US-ASCII" is presumed
for strings. But these strings are *not* US-ASCII, they are raw
bytes. How do I format a string of raw bytes for conversion to a
recognized charset encoding for printing?

There seems to be no 'raw' codec that would capture this. There's no
way of setting an attribute on a file to express this. It looks like
the best I can do is

print string.join([(((ord(x) > 0 and ord(x) < 0x7F) and x) or (r"\x%02x" % ord(x))) for x in line], '')

which seems extremely inefficient.

Bill

Erik Max Francis · Mar 2, 2004

Bill said:
You'll notice that
the problem is in *decoding* the string, not in re-encoding it,
because I'm using the default "C" locale, and "US-ASCII" is presumed
for strings. But these strings are *not* US-ASCII, they are raw
bytes. How do I format a string of raw bytes for conversion to a
recognized charset encoding for printing?

Since the default encoding is ASCII, those 8-bit octets have no meaning
unless you do an explicit conversion. Trying to print them _should_
raise an error, because you're trying to do something that doesn't make
sense.

As Gerrit pointed out, it sounds like what you want is repr.

Michael Hudson · Mar 2, 2004

Bill Janssen said:
I've encountered an issue dealing with strings read from files. I
read a line from a file, then try to print it out as an ASCII string:

line = fp.readline()
print line.encode('US-ASCII', 'replace')

and of course I get an error like:

Traceback (most recent call last):
File "<stdin>", line 1, in ?
UnicodeDecodeError: 'ascii' codec can't decode byte 0xd5 in position 1: ordinal not in range(128)

because the file contained some binary character. You'll notice that
the problem is in *decoding* the string, not in re-encoding it,
because I'm using the default "C" locale, and "US-ASCII" is presumed
for strings.

Actually, the "C" locale has precisely nothing to do with it.

But these strings are *not* US-ASCII, they are raw bytes. How do I
format a string of raw bytes for conversion to a recognized charset
encoding for printing?

You don't?

Wouldn't

def m(c):
if c in string.printable:
return c
else:
return '?'

t = ''.join([m(chr(o)) for o in range(m)])

line.translate(t)

make more sense?

Cheers,
mwh

Unicode again ... default codec ...	0	Oct 20, 2009
logging module and binary strings	1	Jul 1, 2009
Searching for a list of strings in a file with Python	3	Oct 14, 2013
[email protected]	0	Jan 14, 2014
Changing the (codec) error handler for the stdout/stderr streams in Python 3.0	3	Sep 2, 2008
Trouble fixing a broken ASCII string - "replace" mode in codec notworking.	2	Feb 6, 2007
codec for html/xml entities!?	3	Apr 18, 2008
Buffer pair for lexical analysis of raw binary data	3	Jun 27, 2009

A 'raw' codec for binary "strings" in Python?

Bill Janssen

Erik Max Francis

Michael Hudson

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads