A 'raw' codec for binary "strings" in Python?

B

Bill Janssen

I've encountered an issue dealing with strings read from files. I
read a line from a file, then try to print it out as an ASCII string:

line = fp.readline()
print line.encode('US-ASCII', 'replace')

and of course I get an error like:

Traceback (most recent call last):
File "<stdin>", line 1, in ?
UnicodeDecodeError: 'ascii' codec can't decode byte 0xd5 in position 1: ordinal not in range(128)

because the file contained some binary character. You'll notice that
the problem is in *decoding* the string, not in re-encoding it,
because I'm using the default "C" locale, and "US-ASCII" is presumed
for strings. But these strings are *not* US-ASCII, they are raw
bytes. How do I format a string of raw bytes for conversion to a
recognized charset encoding for printing?

There seems to be no 'raw' codec that would capture this. There's no
way of setting an attribute on a file to express this. It looks like
the best I can do is

print string.join([(((ord(x) > 0 and ord(x) < 0x7F) and x) or (r"\x%02x" % ord(x))) for x in line], '')

which seems extremely inefficient.

Bill
 
E

Erik Max Francis

Bill said:
You'll notice that
the problem is in *decoding* the string, not in re-encoding it,
because I'm using the default "C" locale, and "US-ASCII" is presumed
for strings. But these strings are *not* US-ASCII, they are raw
bytes. How do I format a string of raw bytes for conversion to a
recognized charset encoding for printing?

Since the default encoding is ASCII, those 8-bit octets have no meaning
unless you do an explicit conversion. Trying to print them _should_
raise an error, because you're trying to do something that doesn't make
sense.

As Gerrit pointed out, it sounds like what you want is repr.
 
M

Michael Hudson

Bill Janssen said:
I've encountered an issue dealing with strings read from files. I
read a line from a file, then try to print it out as an ASCII string:

line = fp.readline()
print line.encode('US-ASCII', 'replace')

and of course I get an error like:

Traceback (most recent call last):
File "<stdin>", line 1, in ?
UnicodeDecodeError: 'ascii' codec can't decode byte 0xd5 in position 1: ordinal not in range(128)

because the file contained some binary character. You'll notice that
the problem is in *decoding* the string, not in re-encoding it,
because I'm using the default "C" locale, and "US-ASCII" is presumed
for strings.

Actually, the "C" locale has precisely nothing to do with it.
But these strings are *not* US-ASCII, they are raw bytes. How do I
format a string of raw bytes for conversion to a recognized charset
encoding for printing?

You don't?

Wouldn't

def m(c):
if c in string.printable:
return c
else:
return '?'

t = ''.join([m(chr(o)) for o in range(m)])

line.translate(t)

make more sense?

Cheers,
mwh
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,994
Messages
2,570,223
Members
46,813
Latest member
lawrwtwinkle111

Latest Threads

Top