Ascii to Unicode.

J

Joe Goldthwaite

Thanks to all of you who responded. I guess I was working from the wrong
premise. I was thinking that a file could write any kind of data and that
once I had my Unicode string, I could just write it out with a standard
file.write() operation.

What is actually happening is the file.write() operation was generating the
error until I re-encoded the string as utf-8. This is what worked;

import unicodedata

input = file('ascii.csv', 'rb')
output = file('unicode.csv','wb')

for line in input.xreadlines():
unicodestring = unicode(line, 'latin1')
output.write(unicodestring.encode('utf-8')) # This second encode is
what I was missing.

input.close()
output.close()

A number of you pointed out what I was doing wrong but I couldn't understand
it until I realized that the write operation didn't work until it was using
a properly encoded Unicode string. I thought I was getting the error on the
initial latin Unicode conversion not in the write operation.

This still seems odd to me. I would have thought that the unicode function
would return a properly encoded byte stream that could then simply be
written to disk. Instead it seems like you have to re-encode the byte stream
to some kind of escaped Ascii before it can be written back out.

Thanks to all of you who took the time to respond. I really do appreciate
it. I think with my mental block, I couldn't have figure it out without
your help.
 
S

Steven D'Aprano

This still seems odd to me. I would have thought that the unicode
function would return a properly encoded byte stream that could then
simply be written to disk. Instead it seems like you have to re-encode
the byte stream to some kind of escaped Ascii before it can be written
back out.

I'm afraid that's not even wrong. The unicode function returns a unicode
string object, not a byte-stream, just as the list function returns a
sequence of objects, not a byte-stream.

Perhaps this will help:

http://www.joelonsoftware.com/articles/Unicode.html


Summary:

ASCII is not a synonym for bytes, no matter what some English-speakers
think. ASCII is an encoding from bytes like \x41 to characters like "A".

Unicode strings are a sequence of code points. A code point is a number,
implemented in some complex fashion that you don't need to care about.
Each code point maps conceptually to a letter; for example, the English
letter A is represented by the code point U+0041 and the Arabic letter
Ain is represented by the code point U+0639.

You shouldn't make any assumptions about the size of each code-point, or
how they are put together. You shouldn't expect to write code points to a
disk and have the result make sense, any more than you could expect to
write a sequence of tuples or sets or dicts to disk in any sensible
fashion. You have to serialise it to bytes first, and that's what the
encode method does. Decode does the opposite, taking bytes and creating
unicode strings from them.

For historical reasons -- backwards compatibility with files already
created, back in the Bad Old Days before unicode -- there are a whole
slew of different encodings available. There is no 1:1 mapping between
bytes and strings. If all you have are the bytes, there is literally no
way of knowing what string they represent (although sometimes you can
guess). You need to know what the encoding used was, or take a guess, or
make repeated decodings until something doesn't fail and hope that's the
right one.

As a general rule, Python will try encoding/decoding using the ASCII
encoding unless you tell it differently.

Any time you are writing to disk, you need to serialise the objects,
regardless of whether they are floats, or dicts, or unicode strings.
 
U

Ulrich Eckhardt

Joe said:
import unicodedata

input = file('ascii.csv', 'rb')
output = file('unicode.csv','wb')

for line in input.xreadlines():
unicodestring = unicode(line, 'latin1')
output.write(unicodestring.encode('utf-8')) # This second encode
is what I was missing.

Actually, I see two problems here:
1. "ascii.csv" is not an ASCII file but a Latin-1 encoded file, so there
starts the first confusion.
2. "unicode.csv" is not a "Unicode" file, because Unicode is not a file
format. Rather, it is a UTF-8 encoded file, which is one encoding of
Unicode. This is the second confusion.
A number of you pointed out what I was doing wrong but I couldn't
understand it until I realized that the write operation didn't work until
it was using a properly encoded Unicode string.

The write function wants bytes! Encoding a string in your favourite encoding
yields bytes.
This still seems odd to me. I would have thought that the unicode
function would return a properly encoded byte stream that could then
simply be written to disk.

No, unicode() takes a byte stream and decodes it according to the given
encoding. You then get an internal representation of the string, a unicode
object. This representation typically resembles UCS2 or UCS4, which are
more suitable for internal manipulation than UTF-8. This object is a string
btw, so typical stuff like concatenation etc are supported. However, the
internal representation is a sequence of Unicode codepoints but not a
guaranteed sequence of bytes which is what you want in a file.
Instead it seems like you have to re-encode the byte stream to some
kind of escaped Ascii before it can be written back out.

As mentioned above, you have a string. For writing, that string needs to be
transformed to bytes again.


Note: You can also configure a file to read one encoding or write another.
You then get unicode objects from the input which you can feed to the
output. The important difference is that you only specify the encoding in
one place and it will probably even be more performant. I'd have to search
to find you the according library calls though, but starting point is
http://docs.python.org.

Good luck!

Uli
 
J

Joe Goldthwaite

Hi Steven,

I read through the article you referenced. I understand Unicode better now.
I wasn't completely ignorant of the subject. My confusion is more about how
Python is handling Unicode than Unicode itself. I guess I'm fighting my own
misconceptions. I do that a lot. It's hard for me to understand how things
work when they don't function the way I *think* they should.

Here's the main source of my confusion. In my original sample, I had read a
line in from the file and used the unicode function to create a
unicodestring object;

unicodestring = unicode(line, 'latin1')

What I thought this step would do is translate the line to an internal
Unicode representation. The problem character \xe1 would have been
translated into a correct Unicode representation for the accented "a"
character.

Next I tried to write the unicodestring object to a file thusly;

output.write(unicodestring)

I would have expected the write function to request the byte string from the
unicodestring object and simply write that byte string to a file. I thought
that at this point, I should have had a valid Unicode latin1 encoded file.
Instead get an error that the character \xe1 is invalid.

The fact that the \xe1 character is still in the unicodestring object tells
me it wasn't translated into whatever python uses for its internal Unicode
representation. Either that or the unicodestring object returns the
original string when it's asked for a byte stream representation.

Instead of just writing the unicodestring object, I had to do this;

output.write(unicodestring.encode('utf-8'))

This is doing what I thought the other steps were doing. It's translating
the internal unicodestring byte representation to utf-8 and writing it out.
It still seems strange and I'm still not completely clear as to what is
going on at the byte stream level for each of these steps.
 
J

Joe Goldthwaite

Hi Ulrich,

Ascii.csv isn't really a latin-1 encoded file. It's an ascii file with a
few characters above the 128 range that are causing Postgresql Unicode
errors. Those characters work fine in the Windows world but they're not the
correct byte representation for Unicode. What I'm attempting to do is
translate those upper range characters into the correct Unicode
representations so that they look the same in the Postgresql database as
they did in the CSV file.

I wrote up the source of my confusion to Steven so I won't duplicate it
here. You're comment on defining the encoding of the file directly instead
of using functions to encode and decode the data lead me to the codecs
module. Using it, I can define the encoding a file open time and then just
read and write the lines. I ended up with this;

import codecs

input = codecs.open('ascii.csv', encoding='cp1252')
output = codecs.open('unicode.csv', mode='wb', encoding='utf-8')

output.writelines(input.readlines())

input.close()
output.close()

This is doing exactly the same thing but it's much clearer to me. Readlines
translates the input using the cp1252 codec and writelines encodes it to
utf-8 and writes it out. And as you mentioned, it's probably higher
performance. I haven't tested that but since both programs do the job in
seconds, performance isn't and issue.

Thanks again to everyone who posted. I really do appreciate it.
 
E

Ethan Furman

Joe said:
Hi Steven,

I read through the article you referenced. I understand Unicode better now.
I wasn't completely ignorant of the subject. My confusion is more about how
Python is handling Unicode than Unicode itself. I guess I'm fighting my own
misconceptions. I do that a lot. It's hard for me to understand how things
work when they don't function the way I *think* they should.

Here's the main source of my confusion. In my original sample, I had read a
line in from the file and used the unicode function to create a
unicodestring object;

unicodestring = unicode(line, 'latin1')

What I thought this step would do is translate the line to an internal
Unicode representation. The problem character \xe1 would have been
translated into a correct Unicode representation for the accented "a"
character.

Correct. At this point you have unicode string.
Next I tried to write the unicodestring object to a file thusly;

output.write(unicodestring)

I would have expected the write function to request the byte string from the
unicodestring object and simply write that byte string to a file. I thought
that at this point, I should have had a valid Unicode latin1 encoded file.
Instead get an error that the character \xe1 is invalid.

Here's the problem -- there is no byte string representing the unicode
string, they are completely different. There are dozens of different
possible encodings to go from unicode to a byte-string (of which UTF-8
is one such possibility).
The fact that the \xe1 character is still in the unicodestring object tells
me it wasn't translated into whatever python uses for its internal Unicode
representation. Either that or the unicodestring object returns the
original string when it's asked for a byte stream representation.

Wrong. It so happens that some of the unicode points are the same as
some (but not all) of the ascii and upper-ascii values. When you
attempt to write a unicode string without specifying which encoding you
want, python falls back to ascii (not upper-ascii) so any character
outside the 0-127 range is going to raise an error.
Instead of just writing the unicodestring object, I had to do this;

output.write(unicodestring.encode('utf-8'))

This is doing what I thought the other steps were doing. It's translating
the internal unicodestring byte representation to utf-8 and writing it out.
It still seems strange and I'm still not completely clear as to what is
going on at the byte stream level for each of these steps.


Don't think of unicode as a byte stream. It's a bunch of numbers that
map to a bunch of symbols. The byte stream only comes into play when
you want to send unicode somewhere (file, socket, etc) and you then have
to encode the unicode into bytes.

Hope this helps!

~Ethan~
 
C

Carey Tilden

Hi Ulrich,

Ascii.csv isn't really a latin-1 encoded file.  It's an ascii file with a
few characters above the 128 range that are causing Postgresql Unicode
errors.  Those characters work fine in the Windows world but they're not the
correct byte representation for Unicode. What I'm attempting to do is
translate those upper range characters into the correct Unicode
representations so that they look the same in the Postgresql database as
they did in the CSV file.

Having bytes outside of the ASCII range means, by definition, that the
file is not ASCII encoded. ASCII only defines bytes 0-127. Bytes
outside of that range mean either the file is corrupt, or it's in a
different encoding. In this case, you've been able to determine the
correct encoding (latin-1) for those errant bytes, so the file itself
is thus known to be in that encoding.

Carey
 
E

Ethan Furman

Joe said:
Hi Ulrich,

Ascii.csv isn't really a latin-1 encoded file. It's an ascii file with a
few characters above the 128 range . . .

It took me a while to get this point too (if you already have "gotten
it", I apologize, but the above comment leads me to believe you haven't).

*Every* file is an encoded file... even your UTF-8 file is encoded using
the UTF-8 format. Someone correct me if I'm wrong, but I believe
lower-ascii (0-127) matches up to the first 128 Unicode code points, so
while those first 128 code-points translate easily to ascii, ascii is
still an encoding, and if you have characters higher than 127, you don't
really have an ascii file -- you have (for example) a cp1252 file (which
also, not coincidentally, shares the first 128 characters/code points
with ascii).

Hopefully I'm not adding to the confusion. ;)

~Ethan~
 
J

John Nagle

This still seems odd to me. I would have thought that the unicode function
would return a properly encoded byte stream that could then simply be
written to disk. Instead it seems like you have to re-encode the byte stream
to some kind of escaped Ascii before it can be written back out.

Here's what's really going on.

Unicode strings within Python have to be indexable. So the internal
representation of Unicode has (usually) two bytes for each character,
so they work like arrays.

UTF-8 is a stream format for Unicode. It's slightly compressed;
each character occupies 1 to 4 bytes, and the base ASCII characters
(0..127 only, not 128..255) occupy one byte each. The format is
described in "http://en.wikipedia.org/wiki/UTF-8". A UTF-8 file or
stream has to be parsed from the beginning to keep track of where each
Unicode character begins. So it's not a suitable format for
data being actively worked on in memory; it can't be easily indexed.

That's why it's necessary to convert to UTF-8 before writing
to a file or socket.

John Nagle
 
M

MRAB

John said:
Here's what's really going on.

Unicode strings within Python have to be indexable. So the internal
representation of Unicode has (usually) two bytes for each character,
so they work like arrays.

UTF-8 is a stream format for Unicode. It's slightly compressed;
each character occupies 1 to 4 bytes, and the base ASCII characters
(0..127 only, not 128..255) occupy one byte each. The format is
described in "http://en.wikipedia.org/wiki/UTF-8". A UTF-8 file or
stream has to be parsed from the beginning to keep track of where each
Unicode character begins. So it's not a suitable format for
data being actively worked on in memory; it can't be easily indexed.
Not entirely correct. The advantage of UTF-8 is that although different
codepoints might be encoded into different numbers of bytes it's easy to
tell whether a particular byte is the first in its sequence, so you
don't have to parse from the start of the file. It is true, however, it
can't be easily indexed.
 
S

Steven D'Aprano

Don't think of unicode as a byte stream. It's a bunch of numbers that
map to a bunch of symbols.

Not only are Unicode strings a bunch of numbers ("code points", in
Unicode terminology), but the numbers are not necessarily all the same
width.

The full Unicode system allows for 1,114,112 characters, far more than
will fit in a two-byte code point. The Basic Multilingual Plane (BMP)
includes the first 2**16 (65536) of those characters, or code points
U+0000 through U+FFFF; there are a further 16 supplementary planes of
2**16 characters each, or code points U+10000 through U+10FFFF.

As I understand it (and I welcome corrections), some implementations of
Unicode only support the BMP and use a fixed-width implementation of 16-
bit characters for efficiency reasons. Supporting the entire range of
code points would require either a fixed-width of 21-bits (which would
then probably be padded to four bytes), or a more complex variable-width
implementation.

It looks to me like Python uses a 16-bit implementation internally, which
leads to some rather unintuitive results for code points in the
supplementary place...
2
 
M

Mark Tolonen

Joe Goldthwaite said:
Hi Steven,

I read through the article you referenced. I understand Unicode better
now.
I wasn't completely ignorant of the subject. My confusion is more about
how
Python is handling Unicode than Unicode itself. I guess I'm fighting my
own
misconceptions. I do that a lot. It's hard for me to understand how
things
work when they don't function the way I *think* they should.

Here's the main source of my confusion. In my original sample, I had read
a
line in from the file and used the unicode function to create a
unicodestring object;

unicodestring = unicode(line, 'latin1')

What I thought this step would do is translate the line to an internal
Unicode representation.
Correct.

The problem character \xe1 would have been
translated into a correct Unicode representation for the accented "a"
character.

Which just so happens to be u'\xe1', which probably adds to your confusion
later :^) The first 256 Unicode code points map to latin1.
Next I tried to write the unicodestring object to a file thusly;

output.write(unicodestring)

I would have expected the write function to request the byte string from
the
unicodestring object and simply write that byte string to a file. I
thought
that at this point, I should have had a valid Unicode latin1 encoded file.
Instead get an error that the character \xe1 is invalid.

Incorrect. The unicodestring object doesn't save the original byte string,
so there is nothing to "request".
The fact that the \xe1 character is still in the unicodestring object
tells
me it wasn't translated into whatever python uses for its internal Unicode
representation. Either that or the unicodestring object returns the
original string when it's asked for a byte stream representation.

Both incorrect. As I mentioned earlier, the first Unicode code points map
to latin1. It *was* translated to a Unicode code point whose value (but not
internal representation!) is the same as latin1.
Instead of just writing the unicodestring object, I had to do this;

output.write(unicodestring.encode('utf-8'))

This is exactly what you need to do...explicitly encode the Unicode string
into a byte string.
This is doing what I thought the other steps were doing. It's translating
the internal unicodestring byte representation to utf-8 and writing it
out.
It still seems strange and I'm still not completely clear as to what is
going on at the byte stream level for each of these steps.

I'm surprised that by now no one has mentioned the codecs module. You
original stated you are using Python 2.4.4, which I looked up and does
support the codecs module.

import codecs

infile = codecs.open('ascii.csv,'r','latin1')
outfile = codecs.open('unicode.csv','w','utf-8')
for line in infile:
outfile.write(line)
infile.close()
outfile.close()

As you can see, codecs.open takes a parameter for the encoding of the file.
Lines read are automatically decoded into Unicode; Unicode lines written are
automatically encoded into a byte stream.

-Mark
 
N

Nobody

It looks to me like Python uses a 16-bit implementation internally,

It typically uses the platform's wchar_t, which is 16-bit on Windows and
(typically) 32-bit on Unix.

IIRC, it's possible to build Python with 32-bit Unicode on Windows, but
that will be inefficient (because it has to convert to/from 16-bit
when calling Windows API functions) and will break any C modules which
pass the pointer to the internal buffer directly to API functions.
 
L

Lawrence D'Oliveiro

Joe said:
Ascii.csv isn't really a latin-1 encoded file. It's an ascii file with a
few characters above the 128 range that are causing Postgresql Unicode
errors. Those characters work fine in the Windows world but they're not
the correct byte representation for Unicode.

In other words, the encoding you want to decode from in this case is
windows-1252.
 
L

Lawrence D'Oliveiro

UTF-8 is a stream format for Unicode. It's slightly compressed ...

“Variable-length†is not the same as “compressedâ€.

Particularly if you’re mainly using non-Roman scripts...
 
L

Lawrence D'Oliveiro

Joe said:
Next I tried to write the unicodestring object to a file thusly;

output.write(unicodestring)

I would have expected the write function to request the byte string from
the unicodestring object and simply write that byte string to a file.

Encoded according to which encoding?
 
J

John Machin

In this case, you've been able to determine the
correct encoding (latin-1) for those errant bytes, so the file itself
is thus known to be in that encoding.

The most probably "correct" encoding is, as already stated, and agreed
by the OP to be, cp1252.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,962
Messages
2,570,134
Members
46,692
Latest member
JenniferTi

Latest Threads

Top