Python 2.4 vs 2.5 - Unicode error

G

Gaurav Veda

Hi,

I am trying to put some webpages into a mysql database using python
(after some processing on the text). If I use Python 2.4.2, it works
without a fuss. However, on Python 2.5, I get the following error:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position
4357: ordinal not in range(128)

Before sending the (insert) query to the mysql server, I do the
following which I think should've taken care of this problem:
sqlStr = sqlStr.replace('\\', '\\\\')

(where sqlStr is the query).

Any suggestions?

Thanks!
Gaurav
 
J

John Machin

Hi,

I am trying to put some webpages into a mysql database using python
(after some processing on the text). If I use Python 2.4.2, it works
without a fuss. However, on Python 2.5, I get the following error:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position
4357: ordinal not in range(128)

Before sending the (insert) query to the mysql server, I do the
following which I think should've taken care of this problem:
 sqlStr = sqlStr.replace('\\', '\\\\')

(where sqlStr is the query).

Any suggestions?

The 0xc2 strongly suggests that you are feeding the beast data encoded
in UTF-8 while giving it no reason to believe that it is in fact not
encoded in ASCII. Curiously the first errant byte is a long way (4KB)
into your data. Consider doing
print repr(data)
to see what you've actually got there.

I'm a little skeptical about the "2.4 works, 2.5 doesn't" notion --
different versions of mysql, perhaps?

Show at the very least the full traceback that you get. Try to write a
short script that demonstrates the problem with 2.5 and no problem
with 2.4, so that (a) it is apparent what you are doing (b) the
problem can be reproduced if necessary by someone with access to
mysql.

You might like to explain why you think that doubling backslashes in
your SQL is a good idea, and amplify "some processing on the text".

HTH,
John
 
G

Gaurav Veda

The 0xc2 strongly suggests that you are feeding the beast data encoded
in UTF-8 while giving it no reason to believe that it is in fact not
encoded in ASCII. Curiously the first errant byte is a long way (4KB)
into your data. Consider doing
print repr(data)
to see what you've actually got there.
' and 25\xc2\xb0F'

All I want to do is to just replace all the non-ascii characters by a
space.
I'm a little skeptical about the "2.4 works, 2.5 doesn't" notion --
different versions of mysql, perhaps?

I am trying to put content into the mysql server running on machine A,
from machine B & machine C with different versions of python. So I
don't think this is a mysql issue.
Show at the very least the full traceback that you get. Try to write a
short script that demonstrates the problem with 2.5 and no problem
with 2.4, so that (a) it is apparent what you are doing (b) the
problem can be reproduced if necessary by someone with access to
mysql.

Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "putDataIntoDB.py", line 164, in <module>
cursor.execute(sqlStr)
File "/usr/lib64/python2.5/site-packages/MySQLdb/cursors.py", line
146, in execute
query = query.encode(charset)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position
4359: ordinal not in range(128)
You might like to explain why you think that doubling backslashes in
your SQL is a good idea, and amplify "some processing on the text".

I thought this will achieve 2 things.
a) It will escape any unicode character (obviously, I was wrong. Got
carried away by the display. I thought \xc2 will get escaped to \\xc2,
which is completely preposterous).
b) It will make sure that the escape sequences in the string (e.g.
'\n') are received by mysql as an escape sequence.

Thanks for your reply!
Gaurav
 
J

John Machin

The 0xc2 strongly suggests that you are feeding the beast data encoded
in UTF-8 while giving it no reason to believe that it is in fact not
encoded in ASCII. Curiously the first errant byte is a long way (4KB)
into your data. Consider doing
    print repr(data)
to see what you've actually got there.
sqlStr[4352:4362]

' and 25\xc2\xb0F'

That's the UTF-8 version of ' and 25°F' where the character between
the 25 and the F is U+00B0 DEGREE SIGN ... interesting stuff to have
in an SQL query string.
All I want to do is to just replace all the non-ascii characters by a
space.

I can't imagine why you would want to do that to data, let alone to an
SQL query.

I can't see any evidence that you actually tried to do that, anyway.

To replace non-ASCII characters in a UTF-8-encoded string by spaces:
| >>> u8 = ' and 25\xc2\xb0F'
| >>> u = u8.decode('utf8')
| >>> ''.join([chr(ord(c)) if c <= u'\x7f' else ' ' for c in u])
| ' and 25 F'
I am trying to put content into the mysql server running on machine A,
from machine B & machine C with different versions of python. So I
don't think this is a mysql issue.

Terminology confusion. Consider the possibility of different versions
of MySQLdb (the client interface package) on the client machines B and
C.

Also consider the possibility that you didn't run exactly the same
code on B and C.

How about a very small script which includes the minimum necessary to
run these two lines (with appropriate substitutions for column_x and
table_y:
sql_str = "select column_x from table_y where column_x = '\xc2\xb0'"
cursor.execute(sql_str)

and run that on B and C
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "putDataIntoDB.py", line 164, in <module>
    cursor.execute(sqlStr)
  File "/usr/lib64/python2.5/site-packages/MySQLdb/cursors.py", line
146, in execute
    query = query.encode(charset)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position
4359: ordinal not in range(128)


I thought this will achieve 2 things.
a) It will escape any unicode character (obviously, I was wrong. Got
carried away by the display. I thought \xc2 will get escaped to \\xc2,
which is completely preposterous).
b) It will make sure that the escape sequences in the string (e.g.
'\n') are received by mysql as an escape sequence.

Run-time programmatic fiddling with an SQL query string is dangerous
and tricky at the best of times, worse when you don't inspect the
result before you press the launch button.

Cheers,
John
 
W

Wolfgang Rohdewald

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position
4357: ordinal not in range(128)

Before sending the (insert) query to the mysql server, I do the
following which I think should've taken care of this problem:
sqlStr = sqlStr.replace('\\', '\\\\')

you might consider using what mysql offers about unicode: save
all strings encoded as unicode. Might be more work now but I think
it would be a good investment in the future.

have a look at the mysql documentation for

mysql_real_escape_string() takes care of quoted chars.

mysql_set_character_set() for setting the character set used
by the database connection

you can ensure that the web page is unicode by doing something
like

charsetregex = re.compile(r'charset=(.*?)[\"&]')
charsetmatch = charsetregex.search(page)
if charsetmatch:
charset=charsetmatch.group(1)
utf8Text = unicode(page,charset)
 
G

Gaurav Veda

To replace non-ASCII characters in a UTF-8-encoded string by spaces:
| >>> u8 = ' and 25\xc2\xb0F'
| >>> u = u8.decode('utf8')
| >>> ''.join([chr(ord(c)) if c <= u'\x7f' else ' ' for c in u])
| ' and 25 F'

Thanks John for your reply. This is what I needed.

Cheers,
Gaurav
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,995
Messages
2,570,230
Members
46,819
Latest member
masterdaster

Latest Threads

Top