D
David Hughes
I used this function successfully with Python 2.4 to alter the encoding
of a set of database records from latin-1 to utf-8, but the same
program raises an exception using Python 2.5. This small example shows
the problem:
import codecs
fo = open('test.dat', 'w')
fo.write('G\xe2teaux')
fo.close()
fi = open("test.dat",'r')
fx = codecs.EncodedFile(fi, 'utf-8', 'latin-1')
astring = fx.readline()
print astring
ustring = unicode(astring, 'utf-8' )
print repr(ustring)
print ustring.encode('latin-1')
print ustring.encode('utf-8')
Python 2.4 gives:
Gâteaux
u'G\xe2teaux'
Gâteaux
Gâteaux
which I believe is correct, while 2.5 produces
Traceback (most recent call last):
File "test_codec.py", line 8, in <module>
astring = fx.readline()
File "C:\Python25\lib\codecs.py", line 709, in readline
data = self.reader.readline()
File "C:\Python25\lib\codecs.py", line 471, in readline
data = self.read(readsize, firstline=True)
File "C:\Python25\lib\codecs.py", line 418, in read
newchars, decodedbytes = self.decode(data, self.errors)
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 1-3:
invalid data
Is there a genuine problem here, or have I been misusing this function?
of a set of database records from latin-1 to utf-8, but the same
program raises an exception using Python 2.5. This small example shows
the problem:
import codecs
fo = open('test.dat', 'w')
fo.write('G\xe2teaux')
fo.close()
fi = open("test.dat",'r')
fx = codecs.EncodedFile(fi, 'utf-8', 'latin-1')
astring = fx.readline()
print astring
ustring = unicode(astring, 'utf-8' )
print repr(ustring)
print ustring.encode('latin-1')
print ustring.encode('utf-8')
Python 2.4 gives:
Gâteaux
u'G\xe2teaux'
Gâteaux
Gâteaux
which I believe is correct, while 2.5 produces
Traceback (most recent call last):
File "test_codec.py", line 8, in <module>
astring = fx.readline()
File "C:\Python25\lib\codecs.py", line 709, in readline
data = self.reader.readline()
File "C:\Python25\lib\codecs.py", line 471, in readline
data = self.read(readsize, firstline=True)
File "C:\Python25\lib\codecs.py", line 418, in read
newchars, decodedbytes = self.decode(data, self.errors)
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 1-3:
invalid data
Is there a genuine problem here, or have I been misusing this function?