unicode to ascii converting

Peter Wilkinson · Aug 6, 2004

Hello tlistmembers,

I am using the encoding function to convert unicode to ascii. At one point
this code was working just fine, however, now it has broken.

I am reading a text file that has is in unicode (I am unsure of which
flavour or bit depth). as I read in the file one line at a time
(readlines()) it converts to ascii. Simple enough. At the same time I am
copressing to bz2 with the bz2 module but that works just fine. The code
is and error reported appears below. I am unsure what to do.

I assume that because it is reporting that ordinal is not in range, that
something to do with the character width that I am reading?

Peter W.

def encode_file(file_path, encode_type, compress='N'):
"""
Changes encoding of file
"""
new_encode = encode_type
old_file_path = file_path + '.old'
new_file_path = file_path
os.rename(file_path,old_file_path)
file_in = file(old_file_path,'r')

if compress == 'Y' or compress == 'y':
bz_file_path = file_path + '.bz2'
bz_file_out = bz2.BZ2File(bz_file_path, 'w')
for line in file_in.readlines():
bz_file_out.write(line.encode(new_encode))
bz_file_out.close()

else:
file_out = file(file_path,'w')
for line in file_in.readlines():
file_out.write(line.encode(new_encode))
file_out.close()

file_in.close()
os.remove(old_file_path)

ERROR Reported:

Parsing
X:\GenomeQuebec_repository\microarray\HIS\M15K\Step_1_repository\HISH0224.txt
Traceback (most recent call last):
File "C:\Program Files\ActiveState Komodo 2.5\callkomodo\kdb.py", line
433, in _do_start
self.kdb.run(code_ob, locals, locals)
File "C:\Python23\lib\bdb.py", line 350, in run
exec cmd in globals, locals
File "C:\Python23\Lib\site-packages\xBio\Scripts\unicodeToAscii.py",
line 158, in ?
main()
File "C:\Python23\Lib\site-packages\xBio\Scripts\unicodeToAscii.py",
line 75, in main
encode_file(fileToProcess, options.encode, 'Y')
File "C:\Python23\Lib\site-packages\xBio\Scripts\unicodeToAscii.py",
line 144, in encode_file
bz_file_out.write(line.encode(new_encode))
UnicodeDecodeError: 'ascii' codec can't decode byte 0xff in position 0:
ordinal not in range(128)

Tom B. · Aug 6, 2004

Peter Wilkinson said:
Hello tlistmembers,

I am using the encoding function to convert unicode to ascii. At one point
this code was working just fine, however, now it has broken.

I am reading a text file that has is in unicode (I am unsure of which
flavour or bit depth). as I read in the file one line at a time
(readlines()) it converts to ascii. Simple enough. At the same time I am
copressing to bz2 with the bz2 module but that works just fine. The code
is and error reported appears below. I am unsure what to do.

I assume that because it is reporting that ordinal is not in range, that
something to do with the character width that I am reading?

Peter W.

def encode_file(file_path, encode_type, compress='N'):
"""
Changes encoding of file
"""
new_encode = encode_type
old_file_path = file_path + '.old'
new_file_path = file_path
os.rename(file_path,old_file_path)
file_in = file(old_file_path,'r')

if compress == 'Y' or compress == 'y':
bz_file_path = file_path + '.bz2'
bz_file_out = bz2.BZ2File(bz_file_path, 'w')
for line in file_in.readlines():
bz_file_out.write(line.encode(new_encode))
bz_file_out.close()

else:
file_out = file(file_path,'w')
for line in file_in.readlines():
file_out.write(line.encode(new_encode))
file_out.close()

file_in.close()
os.remove(old_file_path)

ERROR Reported:

Parsing
X:\GenomeQuebec_repository\microarray\HIS\M15K\Step_1_repository\HISH0224.tx
t
Traceback (most recent call last):
File "C:\Program Files\ActiveState Komodo 2.5\callkomodo\kdb.py", line
433, in _do_start
self.kdb.run(code_ob, locals, locals)
File "C:\Python23\lib\bdb.py", line 350, in run
exec cmd in globals, locals
File "C:\Python23\Lib\site-packages\xBio\Scripts\unicodeToAscii.py",
line 158, in ?
main()
File "C:\Python23\Lib\site-packages\xBio\Scripts\unicodeToAscii.py",
line 75, in main
encode_file(fileToProcess, options.encode, 'Y')
File "C:\Python23\Lib\site-packages\xBio\Scripts\unicodeToAscii.py",
line 144, in encode_file
bz_file_out.write(line.encode(new_encode))
UnicodeDecodeError: 'ascii' codec can't decode byte 0xff in position 0:
ordinal not in range(128)

I've encountered this problem before and the solution I've come up with a
fix that works but is probably not the best,

def is_ord (strng):
new_text = ''
for i in strng:
if ord(i) > 127:
new_text = new_text + ''
else:
new_text = new_text + i
return new_text

#Then just,

text_from_file = is_ord(text_from_file)

Tom

Peter Wilkinson · Aug 6, 2004

Thanks Tom B.,

I will try that for now ....

It would be good to find out _why_ this happens in the first place. I will
keep do a little searching on this for a few days.

Peter W.

Bernhard Herzog · Aug 6, 2004

Peter Wilkinson said:
It would be good to find out _why_ this happens in the first place. I
will keep do a little searching on this for a few days.

Most likely because you have characters in that file that are not in the
ASCII character set. ASCII is after all only a very small subset of
unicode. E.g.
Traceback (most recent call last):
File "<stdin>", line 1, in ?
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe4' in position 0: ordinal not in range(128)

If it's OK to lose information, you could use the error argument to
..encode like

''

or

'?'

Bernhard

Peter Wilkinson · Aug 6, 2004

I tried the function, actually this does not seem to work as I expected.

What happens is that the character encoding seems to change in the
following way: placing what is the equivalent of some return character
after each character ... or when I view the file in excel there is a blank
row in between between each row.

Its very strange.

back to the drawing board

Peter Wilkinson · Aug 6, 2004

Well this is interestingly annoying:

u"ä".encode("ascii", "ignore") -> '' # works just fine but as I have
written

aa = "ä"
aa.encode("ascii","ignore") ->

Traceback (most recent call last):
File "<interactive input>", line 1, in ?
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe4 in position 0:
ordinal not in range(128)

So I am guessing that I don't understand something about the syntax

Peter

vincent wehren · Aug 6, 2004

Peter said:
Hello tlistmembers,

I am using the encoding function to convert unicode to ascii. At one
point this code was working just fine, however, now it has broken.

I am reading a text file that has is in unicode (I am unsure of which
flavour or bit depth). as I read in the file one line at a time
(readlines()) it converts to ascii. Simple enough. At the same time I am
copressing to bz2 with the bz2 module but that works just fine. The
code is and error reported appears below. I am unsure what to do.

I assume that because it is reporting that ordinal is not in range, that
something to do with the character width that I am reading?

Peter W.

def encode_file(file_path, encode_type, compress='N'):
"""
Changes encoding of file
"""
new_encode = encode_type
old_file_path = file_path + '.old'
new_file_path = file_path
os.rename(file_path,old_file_path)
file_in = file(old_file_path,'r')

if compress == 'Y' or compress == 'y':
bz_file_path = file_path + '.bz2'
bz_file_out = bz2.BZ2File(bz_file_path, 'w')
for line in file_in.readlines():
bz_file_out.write(line.encode(new_encode))
bz_file_out.close()

else:
file_out = file(file_path,'w')
for line in file_in.readlines():
file_out.write(line.encode(new_encode))
file_out.close()

file_in.close()
os.remove(old_file_path)

ERROR Reported:

Parsing
X:\GenomeQuebec_repository\microarray\HIS\M15K\Step_1_repository\HISH0224.txt

Traceback (most recent call last):
File "C:\Program Files\ActiveState Komodo 2.5\callkomodo\kdb.py", line
433, in _do_start
self.kdb.run(code_ob, locals, locals)
File "C:\Python23\lib\bdb.py", line 350, in run
exec cmd in globals, locals
File "C:\Python23\Lib\site-packages\xBio\Scripts\unicodeToAscii.py",
line 158, in ?
main()
File "C:\Python23\Lib\site-packages\xBio\Scripts\unicodeToAscii.py",
line 75, in main
encode_file(fileToProcess, options.encode, 'Y')
File "C:\Python23\Lib\site-packages\xBio\Scripts\unicodeToAscii.py",
line 144, in encode_file
bz_file_out.write(line.encode(new_encode))
UnicodeDecodeError: 'ascii' codec can't decode byte 0xff in position 0:
ordinal not in range(128)

0xff in position 0? If there is a 0xfe is in position 1, I would suspect
your dealing with the Byte Order Mark for a UTF-16 encoded file (UTF-16
LE to be precise). What happens if you skip the first 2 bytes of the file?

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?= · Aug 6, 2004

Peter said:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xff in position 0:
ordinal not in range(128)

That error actually says what happened: You have the byte with the
numeric value 0xff in the input, and the ASCII (American Standard
Code for Information Interchange) converter cannot convert that
into a Unicode character. This is because ASCII is a 7-bit character
set, i.e. it goes from 0..127. 0xFF is 255, so it is out of range.

Now, the line triggering this is

bz_file_out.write(line.encode(new_encode))

and it invokes *encode*, not *decode*. Why would it give a decode error
then?

Because:

decode: take a byte string, return a Unicode string
encode: take a Unicode string, take a byte string

So line should be a Unicode string, for .encode to be a meaningful thing
to do. Unfortunately, Python supports .encode also for byte strings.
If new_encode defines a character encoding, this does

class str:
def encode(self, encoding):
unistr = unicode(self)
return unistr.encode(encoding)

So it first tries to convert the current string into unicode, which
uses the system default encoding, which is us-ascii. Hence the error.

HTH,
Martin

Peter Wilkinson · Aug 6, 2004

thanks for the clear explanation.

I modified my code and now this works

Peter

Michel Claveau - abstraction méta-galactique non t · Aug 7, 2004

Hi !

Try :

aa = u"ä"
aa.encode("ascii","ignore")

Michel Claveau - abstraction méta-galactique non t · Aug 7, 2004

Sorry !

The COMPLETE script is :

# -*- coding: cp1252 -*-
aa = u"ä"
aa.encode("ascii","ignore")

Peter Wilkinson · Aug 9, 2004

Thanks for the help,

I have got it working the problem was that I was not reading into the
string as unicode.

Peter

Skip Montanaro · Aug 10, 2004

Michel> # -*- coding: cp1252 -*-
Michel> aa = u"ä"
Michel> aa.encode("ascii","ignore")

A somewhat less destructive solution might be to try my latscii codec:

http://manatee.mojam.com/~skip/python/latscii.py

(assuming your input is encoded as latin-1).

Skip

Ascii to Unicode.	4	Jul 28, 2010
SAX unicode and ascii parsing problem	4	Nov 30, 2010
Unicode	2	Mar 15, 2013
Ascii to Unicode.	16	Jul 28, 2010
HTMLParser and non-ascii html pages	0	Sep 20, 2011
Right solution to unicode error?	21	Nov 7, 2012
ascii to unicode line endings	5	May 2, 2007
Convert unicode escape sequences to unicode in a file	1	Jan 11, 2011

unicode to ascii converting

Peter Wilkinson

Tom B.

Peter Wilkinson

Bernhard Herzog

Peter Wilkinson

Peter Wilkinson

vincent wehren

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=

Peter Wilkinson

Michel Claveau - abstraction méta-galactique non t

Michel Claveau - abstraction méta-galactique non t

Peter Wilkinson

Skip Montanaro

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads