J
Joe Goldthwaite
Thanks to all of you who responded. I guess I was working from the wrong
premise. I was thinking that a file could write any kind of data and that
once I had my Unicode string, I could just write it out with a standard
file.write() operation.
What is actually happening is the file.write() operation was generating the
error until I re-encoded the string as utf-8. This is what worked;
import unicodedata
input = file('ascii.csv', 'rb')
output = file('unicode.csv','wb')
for line in input.xreadlines():
unicodestring = unicode(line, 'latin1')
output.write(unicodestring.encode('utf-8')) # This second encode is
what I was missing.
input.close()
output.close()
A number of you pointed out what I was doing wrong but I couldn't understand
it until I realized that the write operation didn't work until it was using
a properly encoded Unicode string. I thought I was getting the error on the
initial latin Unicode conversion not in the write operation.
This still seems odd to me. I would have thought that the unicode function
would return a properly encoded byte stream that could then simply be
written to disk. Instead it seems like you have to re-encode the byte stream
to some kind of escaped Ascii before it can be written back out.
Thanks to all of you who took the time to respond. I really do appreciate
it. I think with my mental block, I couldn't have figure it out without
your help.
premise. I was thinking that a file could write any kind of data and that
once I had my Unicode string, I could just write it out with a standard
file.write() operation.
What is actually happening is the file.write() operation was generating the
error until I re-encoded the string as utf-8. This is what worked;
import unicodedata
input = file('ascii.csv', 'rb')
output = file('unicode.csv','wb')
for line in input.xreadlines():
unicodestring = unicode(line, 'latin1')
output.write(unicodestring.encode('utf-8')) # This second encode is
what I was missing.
input.close()
output.close()
A number of you pointed out what I was doing wrong but I couldn't understand
it until I realized that the write operation didn't work until it was using
a properly encoded Unicode string. I thought I was getting the error on the
initial latin Unicode conversion not in the write operation.
This still seems odd to me. I would have thought that the unicode function
would return a properly encoded byte stream that could then simply be
written to disk. Instead it seems like you have to re-encode the byte stream
to some kind of escaped Ascii before it can be written back out.
Thanks to all of you who took the time to respond. I really do appreciate
it. I think with my mental block, I couldn't have figure it out without
your help.