encode/decode misunderstanding

T

Tim Arnold

Hi, I'm beginning to understand the encode/decode string methods, but I'd
like confirmation that I'm still thinking in the right direction:

I have a file of latin1 encoded text. Let's say I put one line of that file
into a string variable 'tocline', as follows:
tocline = 'Ficha Datos de p\xe9rdida AND acci\xf3n'

import codecs
tocFile = codecs.open('mytoc.htm','wb',encoding='utf8',errors='replace')
tocline = tocline.decode('latin1','replace')
tocFile.write(tocline)
tocFile.close()

What I think is that tocFile is wrapped to insure that anything written to
it is in utf8
I decode the latin1 string into python's internal unicode encoding and that
gets written out as utf8.

Questions:
what exactly is the tocline when it's read in with that \xe9 and \xed in the
string? A latin1 encoded string?
Is my method the right way to write such a line out to a file with utf8
encoding?

If I read in the latin1 file using
codecs.open(filename,encoding='latin1') and write out the utf8 file by
opening with
codecs.open(othername,encoding='utf8'), would I no longer have a problem --
I could just read in latin1 and write out utf8 with no more worries about
encoding?

thanks,
--Tim
 
T

Tim Arnold

If I read in the latin1 file using
codecs.open(filename,encoding='latin1') and write out the utf8 file by
opening with
codecs.open(othername,encoding='utf8'), would I no longer have a
problem -- I could just read in latin1 and write out utf8 with no more
worries about encoding?

thanks,

Replying to my own post, I feel so lonely! I guess that silence means I *am*
thinking correctly about the encoding/decoding stuff; I'll keep heading in
this direction unless someone out there sees it differently.....

--Tim
 
D

Diez B. Roggisch

Tim said:
Hi, I'm beginning to understand the encode/decode string methods, but I'd
like confirmation that I'm still thinking in the right direction:

I have a file of latin1 encoded text. Let's say I put one line of that file
into a string variable 'tocline', as follows:
tocline = 'Ficha Datos de p\xe9rdida AND acci\xf3n'

import codecs
tocFile = codecs.open('mytoc.htm','wb',encoding='utf8',errors='replace')
tocline = tocline.decode('latin1','replace')
tocFile.write(tocline)
tocFile.close()

What I think is that tocFile is wrapped to insure that anything written to
it is in utf8
I decode the latin1 string into python's internal unicode encoding and that
gets written out as utf8.

Questions:
what exactly is the tocline when it's read in with that \xe9 and \xed in the
string? A latin1 encoded string?

Yes. A simple, pure byte-string, that happens to contain bytes which
under the latin1-encoding are "correct".
Is my method the right way to write such a line out to a file with utf8
encoding?
Yes.

If I read in the latin1 file using
codecs.open(filename,encoding='latin1') and write out the utf8 file by
opening with
codecs.open(othername,encoding='utf8'), would I no longer have a problem --
I could just read in latin1 and write out utf8 with no more worries about
encoding?

As long as you don't mix bytestrings and only use unicode-objects, you
should be fine, yes.

Diez
 
T

Tim Arnold

Diez B. Roggisch said:
Yes. A simple, pure byte-string, that happens to contain bytes which under
the latin1-encoding are "correct".


As long as you don't mix bytestrings and only use unicode-objects, you
should be fine, yes.

Diez

wow, I was thinking correctly about encoding! time for a beer!
Diez, thanks very much for confirming my thoughts.

--Tim Arnold
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,997
Messages
2,570,239
Members
46,827
Latest member
DMUK_Beginner

Latest Threads

Top