encoding problem

D

digisatori

The below snippet code generates UnicodeDecodeError.
#!/usr/bin/env python
#--*-- coding: utf-8 --*--
s = 'äöü'
u = unicode(s)


It seems that the system use the default encoding- ASCII to decode the
utf8 encoded string literal, and thus generates the error.

The question is why the Python interpreter use the default encoding
instead of "utf-8", which I explicitly declared in the source.
 
B

Bruno Desthuilliers

(e-mail address removed) a écrit :
The below snippet code generates UnicodeDecodeError.
#!/usr/bin/env python
#--*-- coding: utf-8 --*--
s = 'äöü'
u = unicode(s)


It seems that the system use the default encoding- ASCII to decode the
utf8 encoded string literal, and thus generates the error.

Indeed. You want:

u = unicode(s, 'utf-8') # or : u = s.decode('utf-8')
The question is why the Python interpreter use the default encoding
instead of "utf-8", which I explicitly declared in the source.

Because there's no reliable way for the interpreter to guess how what's
passed to unicode has been encoded ?

s = s.decode("utf-8").encode("latin1")
# should unicode try to use utf-8 here ?
try:
u = unicode(s)
except UnicodeDecodeError:
print "would have worked better with "u = unicode(s, 'latin1')"


NB : IIRC, the ascii subset is safe whatever the encoding, so I'd say
it's a sensible default...
 
M

Marc 'BlackJack' Rintsch

The below snippet code generates UnicodeDecodeError.
#!/usr/bin/env
python
#--*-- coding: utf-8 --*--
s = 'äöü'
u = unicode(s)


It seems that the system use the default encoding- ASCII to decode the
utf8 encoded string literal, and thus generates the error.

The question is why the Python interpreter use the default encoding
instead of "utf-8", which I explicitly declared in the source.

Because the declaration is only for decoding unicode literals in that
very source file.

Ciao,
Marc 'BlackJack' Rintsch
 
J

Joe Strout

Marc said:
Because the declaration is only for decoding unicode literals in that
very source file.

And because strings in Python, unlike in (say) REALbasic, do not know
their encoding -- they're just a string of bytes. If they were a string
of bytes PLUS an encoding, then every string would know what it is, and
things like conversion to another encoding, or concatenation of two
strings that may differ in encoding, could be handled automatically.

I consider this one of the great shortcomings of Python, but it's mostly
just a temporary inconvenience -- the world is moving to Unicode, and
with Python 3, we won't have to worry about it so much.

Best,
- Joe
 
D

digisatori

Because the declaration is only for decoding unicode literals in that
very source file.

Ciao,
        Marc 'BlackJack' Rintsch

Thanks for the answer.
I believe the declaration is not only for unicode literals, it is for
all literals in the source even including Comments. we can try runing
a source file without encoding declaration and have only 1 line of
comments with non-ASCII characters. That will arise a Syntax error and
bring me to the pep263 URL.

I read the pep263 and quoted below:

Python's tokenizer/compiler combo will need to be updated to work as
follows:
1. read the file
2. decode it into Unicode assuming a fixed per-file encoding
3. convert it into a UTF-8 byte string
4. tokenize the UTF-8 content
5. compile it, creating Unicode objects from the given Unicode
data
and creating string objects from the Unicode literal data
by first reencoding the UTF-8 data into 8-bit string data
using the given file encoding

The above described Python internal process indicate that the step 2
will utilise the specific encoding to decode all literals in source,
while in step5 will evolve a re-encoding with the specific encoding.

That is the reason why we have to explicitly declare a encoding as
long as we have non-ASCII in source.

Bruno answered why we need specify a encoding when decoding a byte
string with perfect explanation, Thank you very much.
 
M

Marc 'BlackJack' Rintsch

And because strings in Python, unlike in (say) REALbasic, do not know
their encoding -- they're just a string of bytes. If they were a string
of bytes PLUS an encoding, then every string would know what it is, and
things like conversion to another encoding, or concatenation of two
strings that may differ in encoding, could be handled automatically.

I consider this one of the great shortcomings of Python, but it's mostly
just a temporary inconvenience -- the world is moving to Unicode, and
with Python 3, we won't have to worry about it so much.

I don't see the shortcoming in Python <3.0. If you want real strings
with characters instead of just a bunch of bytes simply use `unicode`
objects instead of `str`.

And does REALbasic really use byte strings plus an encoding!? Sounds
strange. When concatenating which encoding "wins"?

Ciao,
Marc 'BlackJack' Rintsch
 
J

Joe Strout

Marc said:
I don't see the shortcoming in Python <3.0. If you want real strings
with characters instead of just a bunch of bytes simply use `unicode`
objects instead of `str`.

Fair enough -- that certainly is the best policy. But working with any
other encoding (sometimes necessary when interfacing with any other
software), it's still a bit of a PITA.
And does REALbasic really use byte strings plus an encoding!?

You betcha! Works like a dream.
Sounds strange. When concatenating which encoding "wins"?

The one that is a superset of the other, or if neither is, then both are
converted to UTF-8 (which is the "standard" encoding in RB, though it
works comfily with any other too).

Cheers,
- Joe
 
M

Marc 'BlackJack' Rintsch

Fair enough -- that certainly is the best policy. But working with any
other encoding (sometimes necessary when interfacing with any other
software), it's still a bit of a PITA.

But it has to be. There is no automagic guessing possible.
You betcha! Works like a dream.

IMHO a strange design decision. A lot more hassle compared to an opaque
unicode string type which uses some internal encoding that makes
operations like getting a character at a given index easy or
concatenating without the need to reencode.

Ciao,
Marc 'BlackJack' Rintsch
 
J

John Machin

But it has to be.  There is no automagic guessing possible.



IMHO a strange design decision.  A lot more hassle compared to an opaque
unicode string type which uses some internal encoding that makes
operations like getting a character at a given index easy or
concatenating without the need to reencode.

In general I quite agree with you ... hoever with Unicode "getting a
character at a given index" is fine unless and until you stray (or are
dragged!) outside the BMP and you have only a 16-bit Unicode
implementation.
 
J

Joe Strout

Marc said:
But it has to be. There is no automagic guessing possible.

Automagic guessing isn't possible if strings keep track of what encoding
their data is. And why shouldn't they? We're a long way from the day
when a "string" was nothing more than an array of bytes. Adding a teeny
bit of metadata makes life much easier.
IMHO a strange design decision.

I get that you don't grok it, but I think that's because you haven't
worked with it. RB added encoding data to its strings years ago, and
changed the default string encoding to UTF-8 at about the same time, and
life has been delightful since then. The only time you ever have to
think about it is when you're importing a string from some unknown
source (e.g. a socket), at which point you need to tell RB what encoding
it is. From that point on, you can pass that string around, extract
substrings, split it into words, concatenate it with other strings,
etc., and it all Just Works (tm).

In comparison, Python requires a lot more thought on the part of the
programmer to keep track of what's what (unless, as you point out, you
convert everything into unicode strings as soon as you get them, but
that can be a very expensive operation to do on, say, a 500MB UTF-8 text
file).
A lot more hassle compared to an opaque
unicode string type which uses some internal encoding that makes
operations like getting a character at a given index easy or
concatenating without the need to reencode.

No. RB supports UCS-2 encoding, too, and is smart enough to take
advantage of the fixed character width of any encoding when that's what
a string happens to be. And no reencoding is used when it's not
necessary (e.g., concatenating two strings of the same encoding, or
adding an ASCII string to a string using any ASCII superset, such as
UTF-8). There's nothing stopping you from converting all your strings
to UCS-2 when you get them, if that's your preference.

But saying that having only one string type that knows it's Unicode, and
another string type that hasn't the foggiest clue how to interpret its
data as text, is somehow easier than every string knowing what it is and
doing the right thing -- well, that's just silly.

Best,
- Joe
 
M

Marc 'BlackJack' Rintsch

I get that you don't grok it, but I think that's because you haven't
worked with it. RB added encoding data to its strings years ago, and
changed the default string encoding to UTF-8 at about the same time, and
life has been delightful since then. The only time you ever have to
think about it is when you're importing a string from some unknown
source (e.g. a socket), at which point you need to tell RB what encoding
it is. From that point on, you can pass that string around, extract
substrings, split it into words, concatenate it with other strings,
etc., and it all Just Works (tm).

Except that you don't know for sure what the output encoding will be, as
it depends on the operations on the strings in the program flow. So to
be sure you have to en- or recode at output too. And then it is the same
as in Python -- decode when bytes enter the program and encode when
(unicode) strings leave the program.
In comparison, Python requires a lot more thought on the part of the
programmer to keep track of what's what (unless, as you point out, you
convert everything into unicode strings as soon as you get them, but
that can be a very expensive operation to do on, say, a 500MB UTF-8 text
file).

So it doesn't require more thought. Unless you complicate it yourself,
but that is language independent.

I would not do operations on 500 MiB text in any language if there is any
way to break that down into smaller chunks. Slurping in large files
doesn't scale very well. On my Eee-PC even a 500 MiB byte `str` is (too)
expensive.
But saying that having only one string type that knows it's Unicode, and
another string type that hasn't the foggiest clue how to interpret its
data as text, is somehow easier than every string knowing what it is and
doing the right thing -- well, that's just silly.

Sorry, I meant the implementation not the POV of the programmer, which
seems to be quite the same.

Ciao,
Marc 'BlackJack' Rintsch
 
M

Martin v. Löwis

That is the reason why we have to explicitly declare a encoding as
long as we have non-ASCII in source.

True - but it doesn't have to be the "correct" encoding. If you
declared your source as latin-1, the effect is the same on byte string
literals, but not on Unicode literals.

In that sense, the encoding declaration only "matters" for Unicode
literals (of course, it also matters for source editors, and in a few
other places).

Regards,
Martin
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,962
Messages
2,570,134
Members
46,692
Latest member
JenniferTi

Latest Threads

Top