Trouble fixing a broken ASCII string - "replace" mode in codec notworking.

Robert Kern · Feb 6, 2007

John said:
I'm trying to clean up a bad ASCII string, one read from a
web page that is supposedly in the ASCII character set but has some
characters above 127. And I get this:

File "D:\projects\sitetruth\InfoSitePage.py", line 285, in httpfetch
sitetext = sitetext.encode('ascii','replace') # force to clean ASCII

UnicodeDecodeError: 'ascii' codec can't decode byte 0x92 in position 29151:
ordinal not in range(128)

Why is that exception being raised when the codec was told 'replace'?

The .encode('ascii') takes unicode strings to str strings. Since you gave it a
str string, it first tried to convert it to a unicode string using the default
codec ('ascii'), just as if you were to have done
unicode(sitetext).encode('ascii', 'replace').

I think you want something like this:

sitetext = sitetext.decode('ascii', 'replace').encode('ascii', 'replace')

--
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless enigma
that is made terrible by our own mad attempt to interpret it as though it had
an underlying truth."
-- Umberto Eco

John Nagle · Feb 6, 2007

I'm trying to clean up a bad ASCII string, one read from a
web page that is supposedly in the ASCII character set but has some
characters above 127. And I get this:

File "D:\projects\sitetruth\InfoSitePage.py", line 285, in httpfetch
sitetext = sitetext.encode('ascii','replace') # force to clean ASCII

UnicodeDecodeError: 'ascii' codec can't decode byte 0x92 in position 29151:
ordinal not in range(128)

Why is that exception being raised when the codec was told 'replace'?

(And no, just converting it to Unicode with "sitetext = unicode(sitetext)"
won't work either; that correctly raises a Unicode conversion exception.)

[Python 2.4, Win32]

JohnNagle

Neil Cerutti · Feb 7, 2007

The .encode('ascii') takes unicode strings to str strings.
Since you gave it a str string, it first tried to convert it to
a unicode string using the default codec ('ascii'), just as if
you were to have done unicode(sitetext).encode('ascii',
'replace').

I think you want something like this:

sitetext = sitetext.decode('ascii', 'replace').encode('ascii', 'replace')

This is the cue for the translate method, which will be much
faster and simpler for cases like this. You can build the
translation table yourself, or use maketrans.
.... '?'*127)

You'd only want to do that once. Then to strip off the non-ascii:

sitetext.translate(asciitable)

I used a similar solution in an application I'm working on that
must uses a Latin-1 byte-encoding internally, but displays on
stdout in ascii.

UnicodeDecodeError: 'ascii' codec can't decode byte	2	Jun 17, 2008
[2.5.1] "UnicodeDecodeError: 'ascii' codec can't decode byte"?	3	Oct 29, 2008
Trouble with UnicodeEncodeError and email	0	Jan 8, 2014
replace text in unicode string	2	May 14, 2005
Is str/unicode.encode supposed to work? with replace/ignore	1	Jan 16, 2008
Ascii to Unicode.	4	Jul 28, 2010
UTF-8 in basic CGI mode	2	Jan 15, 2008
how to write a unicode string to a file ?	0	Oct 16, 2009

Trouble fixing a broken ASCII string - "replace" mode in codec notworking.

Robert Kern

John Nagle

Neil Cerutti

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads