eval and unicode

L

Laszlo Nagy

How can I specify encoding for the built-in eval function? Here is the
documentation:

http://docs.python.org/lib/built-in-funcs.html

It tells that the "expression" parameter is a string. But tells nothing
about the encoding. Same is true for: execfile, eval and compile.

The basic problem:

- expressions need to be evaluated by a program
- expressions are managed through a web based interface. The browser
supports UTF-8, the database also supports UTF-8. The user needs to be
able to enter string expressions in different languages, and store them
in the database
- expressions are for filtering emails, and the emails can contain any
character in any encoding

I tried to use eval with/without unicode strings and it worked. Example:
ãŸã‚³ã‚¹ãƒˆå‰Šæ¸› ÃÃÅ°ÅÜÖÚÓÉ трирова"' )
True

The above test was made on Unbuntu Linux and gnome-terminal.
gnome-terminal does support unicode. What would happen under Windows?

I'm also confused how it is related to PEP 0263. I always get a warning
when I try to enter '"徹底ã—ãŸã‚³ã‚¹ãƒˆå‰Šæ¸› ÃÃÅ°ÅÜÖÚÓÉ трирова"' in a source
file without "# -*- coding: " specified. Why is it not the same for
eval? Why it is not raising an exception (or why the encoding does not
need to be specified?)

Thanks,

Laszlo
 
J

Jonathan Gardner

How can I specify encoding for the built-in eval function? Here is the
documentation:

http://docs.python.org/lib/built-in-funcs.html

It tells that the "expression" parameter is a string. But tells nothing
about the encoding. Same is true for: execfile, eval and compile.

The basic problem:

- expressions need to be evaluated by a program
- expressions are managed through a web based interface. The browser
supports UTF-8, the database also supports UTF-8. The user needs to be
able to enter string expressions in different languages, and store them
in the database
- expressions are for filtering emails, and the emails can contain any
character in any encoding

I tried to use eval with/without unicode strings and it worked. Example:

 >>> eval( u'"徹底ã—ãŸã‚³ã‚¹ãƒˆå‰Šæ¸› ÃÃÅ°ÅÜÖÚÓÉ трирова"' ) == eval( '"徹底ã—
ãŸã‚³ã‚¹ãƒˆå‰Šæ¸› ÃÃÅ°ÅÜÖÚÓÉ трирова"' )
True

The above test was made on Unbuntu Linux and gnome-terminal.
gnome-terminal does support unicode. What would happen under Windows?

I'm also confused how it is related to PEP 0263. I always get a warning
when I try to enter '"徹底ã—ãŸã‚³ã‚¹ãƒˆå‰Šæ¸› ÃÃÅ°ÅÜÖÚÓÉ трирова"' in a source
file without "# -*- coding: " specified. Why is it not the same for
eval? Why it is not raising an exception (or why the encoding does not
need to be specified?)

Encoding information is only useful when you are converting between
bytes and unicode data. If you already have unicode data, you don't
need to do any more work to get unicode data.

Since a file can be in any encoding, it isn't apparent how to decode
the bytes seen in that file and turn them into unicode data. That's
why you need the # -*- coding magic to tell the python interpreter
that the bytes it will see in the file are encoded in a specific way.
Until we have a universal way to accurately find the encoding of every
file in an OS, we will need that magic. Who knows? Maybe one day there
will be a common file attribute system and one of the universal
attributes will be the encoding of the file. But for now, we are stuck
with ancient Unix and DOS conventions.

When you feed your unicode data into eval(), it doesn't have any
encoding or decoding work to do.
 
L

Laszlo Nagy

When you feed your unicode data into eval(), it doesn't have any
encoding or decoding work to do.

Yes, but what about

eval( 'u' + '"徹底ã—ãŸã‚³ã‚¹ãƒˆå‰Šæ¸› ÃÃÅ°ÅÜÖÚÓÉ трирова"' )

The passed expression is not unicode. It is a "normal" string. A
sequence of bytes. It will be evaluated by eval, and eval should know
how to decode the byte sequence. Same way as the interpreter need to
know the encoding of the file when it sees the u"徹底ã—ãŸã‚³ã‚¹ãƒˆå‰Šæ¸›
ÃÃÅ°ÅÜÖÚÓÉ трирова" byte sequence in a python source file - before
creating the unicode instance, it needs to be decoded (or not, depending
on the encoding of the source).

String passed to eval IS python source, and it SHOULD have an encoding
specified (well, unless it is already a unicode string, in that case
this magic is not needed).

Consider this:

exec("""
import codecs
s = u'Å°'
codecs.open("test.txt","w+",encoding="UTF8").write(s)
""")

Facts:

- source passed to exec is a normal string, not unicode
- the variable "s", created inside the exec() call will be a unicode
string. However, it may be Û or something else, depending on the
source encoding. E.g. ASCII encoding it is invalid and exec() should
raise a SyntaxError like:

SyntaxError: Non-ASCII character '\xc5' in file c:\temp\aaa\test.py on
line 1, but no encoding declared; see
http://www.python.org/peps/pep-0263.html for details

Well at least this is what I think. If I'm not right then please explain
why.

Thanks

Laszlo
 
J

Jonathan Gardner

Yes, but what about

eval( 'u' + '"徹底ã—ãŸã‚³ã‚¹ãƒˆå‰Šæ¸› ÃÃÅ°ÅÜÖÚÓÉ трирова"' )

Let's take it apart, bit by bit:

'u' - A byte string with one byte, which is 117

'"徹底ã—ãŸã‚³ã‚¹ãƒˆå‰Šæ¸› ÃÃÅ°ÅÜÖÚÓÉ трирова"' - A byte string starting with " (34),
but then continuing in an unspecified byte sequence. I don't know what
encoding your terminal/file/whatnot is written in. Assuming it is in
UTF-8 and not UTF-16, then it would be the UTF-8 representation of the
unicode code points that follow.

Before you are passing it to eval, you are concatenating them. So now
you have a byte string that starts with u, then ", then something
beyond 128.

Now, when you are calling eval, you are passing in that byte string.
This byte string, it is important to emphasize, is not text. It is
text encoded in some format. Here is what my interpreter does (in a
UTF-8 console):
u'\u5fb9\u5e95\u3057\u305f\u30b3\u30b9\u30c8\u524a\u6e1b \xc1\xcd
\u0170\u0150\xdc\xd6\xda\xd3\xc9 \u0442\u0440\u0438\u0440\u043e
\u0432\u0430'

The first item in the sequence is \u5fb9 -- a unicode code point. It
is NOT a byte.
'\xe5\xbe\xb9\xe5\xba\x95\xe3\x81\x97\xe3\x81\x9f
\xe3\x82\xb3\xe3\x82\xb9\xe3\x83\x88\xe5\x89\x8a\xe6\xb8\x9b
\xc3\x81\xc3\x8d\xc5\xb0\xc5\x90\xc3\x9c\xc3\x96\xc3\x9a
\xc3\x93\xc3\x89 \xd1\x82\xd1\x80\xd0\xb8\xd1\x80\xd0\xbe
\xd0\xb2\xd0\xb0'

The first item in the sequence is \xe5. This IS a byte. This is NOT a
unicode point. It doesn't represent anything except what you want it
to represent.
u'\xe5\xbe\xb9\xe5\xba\x95\xe3\x81\x97\xe3\x81\x9f
\xe3\x82\xb3\xe3\x82\xb9\xe3\x83\x88\xe5\x89\x8a\xe6\xb8\x9b
\xc3\x81\xc3\x8d\xc5\xb0\xc5\x90\xc3\x9c\xc3\x96\xc3\x9a
\xc3\x93\xc3\x89 \xd1\x82\xd1\x80\xd0\xb8\xd1\x80\xd0\xbe
\xd0\xb2\xd0\xb0'

The first item in the sequence is \xe5. This is NOT a byte. This is a
unicode point-- LATIN SMALL LETTER A WITH RING ABOVE.
u'\u5fb9\u5e95\u3057\u305f\u30b3\u30b9\u30c8\u524a\u6e1b \xc1\xcd
\u0170\u0150\xdc\xd6\xda\xd3\xc9 \u0442\u0440\u0438\u0440\u043e
\u0432\u0430'

The first item in the sequence is \u5fb9, which is a unicode point.

In the Python program file proper, if you have your encoding setup
properly, the expression

u"徹底ã—ãŸã‚³ã‚¹ãƒˆå‰Šæ¸› ÃÃÅ°ÅÜÖÚÓÉ трирова"

is a perfectly valid expression. What happens is the Python
interpreter reads in that string of bytes between the quotes,
interprets them to unicode based on the encoding you already
specified, and creates a unicode object to represent that.

eval doesn't muck with encodings.

I'll try to address your points below in the context of what I just
wrote.
The passed expression is not unicode. It is a "normal" string. A
sequence of bytes.
Yes.

It will be evaluated by eval, and eval should know
how to decode the byte sequence.

You think eval is smarter than it is.
Same way as the interpreter need to
know the encoding of the file when it sees the u"徹底ã—ãŸã‚³ã‚¹ãƒˆå‰Šæ¸›
ÃÃÅ°ÅÜÖÚÓÉ трирова" byte sequence in a python source file - before
creating the unicode instance, it needs to be decoded (or not, depending
on the encoding of the source).

Precisely. And it is. Before it is passed to eval/exec/whatever.
String passed to eval IS python source, and it SHOULD have an encoding
specified (well, unless it is already a unicode string, in that case
this magic is not needed).

If it had an encoding specified, YOU should have decoded it and passed
in the unicode string.
Consider this:

exec("""
import codecs
s = u'Å°'
codecs.open("test.txt","w+",encoding="UTF8").write(s)
""")

Facts:

- source passed to exec is a normal string, not unicode
- the variable "s", created inside the exec() call will be a unicode
string. However, it may be Û or something else, depending on the
source encoding. E.g. ASCII encoding it is invalid and exec() should
raise a SyntaxError like:

SyntaxError: Non-ASCII character '\xc5' in file c:\temp\aaa\test.py on
line 1, but no encoding declared; seehttp://www.python.org/peps/pep-0263.htmlfor details

Well at least this is what I think. If I'm not right then please explain
why.

If you want to know what happens, you have to try it. Here's what
happens (again, in my UTF-8 terminal):
... import codecs
... s = u'Å°'
... codecs.open("test.txt","w+",encoding="UTF8").write(s)
... """)Å°

Note that s is a unicode string with 2 unicode code points. Note that
the file has 4 bytes--since it is that 2-code sequence encoded in
UTF-8, and both codes are not ASCII.

Your problem is, I think, that you think the magic of decoding source
code from the byte sequence into unicode happens in exec or eval. It
doesn't. It happens in between reading the file and passing the
contents of the file to exec or eval.
 
L

Laszlo Nagy

Hi Jonathan,


I think I made it too complicated and I did not concentrate on the
question. I could write answers to your post, but I'm going to explain
it formally:
'u"\xdb"' # expr is not a unicode string - it is a binary string and it
has no encoding assigned.u'\xdb' # What? Why it was decoded as 'latin1'? Why not 'latin2'? Why
not 'ascii'?u'\u0170' # You can specify the encoding for eval, that is cool.

I hope it is clear now. Inside eval, an unicode object was created from
a binary string. I just discovered that PEP 0263 can be used to specify
source encoding for eval. But still there is a problem: eval should not
assume that the expression is in any particular encoding. When it sees
something like '\xdb' then it should raise a SyntaxError - same error
that you should get when running a .py file containing the same expression:
gandalf@saturnus:~$ python test.py
File "test.py", line 1
SyntaxError: Non-ASCII character '\xdb' in file test.py on line 1, but
no encoding declared; see http://www.python.org/peps/pep-0263.html for
details

Otherwise the interpretation of the expression will be ambiguous. If
there is any good reason why eval assumed a particular encoding in the
above example?

Sorry for my misunderstanding - my English is not perfect. I hope it is
clear now.

My problem is solved anyway. Anytime I need to eval an expression, I'm
going to specify the encoding manually with # -*- coding: XXX -*-. It is
good to know that it works for eval and its counterparts. And it is
unambiguous. :)

Best,

Laszlo
 
L

Laszlo Nagy

Your problem is, I think, that you think the magic of decoding source
code from the byte sequence into unicode happens in exec or eval. It
doesn't. It happens in between reading the file and passing the
contents of the file to exec or eval.
I think you are wrong here. Decoding source happens inside eval. Here is
the proof:

s = 'u"' + '\xdb' + '"'
print eval(s) == eval( "# -*- coding: iso8859-2\n" + s) # prints False,
indicating that the decoding of the string expression happened inside eval!

It can also be prooven that eval does not use 'ascii' codec for default
decoding:

'\xdb'.decode('ascii') # This will raise an UnicodeDecodeError

eval() somehow decoded the passed expression. No question. It did not
use 'ascii', nor 'latin2' but something else. Why is that? Why there is
a particular encoding hard coded into eval? Which is that encoding? (I
could not decide which one, since '\xdb' will be the same in latin1,
latin3, latin4 and probably many others.)

I suspected that eval is going to use the same encoding that the python
source file/console had at the point of execution, but this is not true:
the following program prints u'\xdb' instead of u'\u0170':

<snip>
# -*- coding iso8859-2 -*-

s = '\xdb'
expr = 'u"' + s +'"'
print repr(eval(expr))
</snip>

Regards,

Laszlo
 
J

Jonathan Gardner

 >>> eval( "# -*- coding: latin2 -*-\n" + expr)
u'\u0170' # You can specify the encoding for eval, that is cool.

I didn't think of that. That's pretty cool.
I hope it is clear now.  Inside eval, an unicode object was created from
a binary string. I just discovered that PEP 0263 can be used to specify
source encoding for eval. But still there is a problem: eval should not
assume that the expression is in any particular encoding. When it sees
something like '\xdb' then it should raise a SyntaxError - same error
that you should get when running a .py file containing the same expression:

 >>> file('test.py','wb+').write(expr + "\n")
 >>> ^D
gandalf@saturnus:~$ python test.py
  File "test.py", line 1
SyntaxError: Non-ASCII character '\xdb' in file test.py on line 1, but
no encoding declared; seehttp://www.python.org/peps/pep-0263.htmlfor
details

Otherwise the interpretation of the expression will be ambiguous. If
there is any good reason why eval assumed a particular encoding in the
above example?

I'm not sure, but being in a terminal session means a lot can be
inferred about what encoding a stream of bytes is in. I don't know off
the top of my head where this would be stored or how Python tries to
figure it out.
My problem is solved anyway. Anytime I need to eval an expression, I'm
going to specify the encoding manually with # -*- coding: XXX -*-. It is
good to know that it works for eval and its counterparts. And it is
unambiguous.  :)

I would personally adopt the Py3k convention and work with text as
unicode and bytes as byte strings. That is, you should pass in a
unicode string every time to eval, and never a byte string.
 
M

Martin v. Löwis

Well at least this is what I think. If I'm not right then please explain

I think your confusion comes from the use of the interactive mode.

PEP 263 doesn't really apply to the interactive mode, hence the
behavior in interactive mode is undefined, and may and will change
across Python versions.

Ideally, interactive mode should assume the terminal's encoding
for source code, but that has not been implemented.

Regards,
Martin
 
M

Martin v. Löwis

Well at least this is what I think. If I'm not right then please explain

I think your confusion comes from the use of the interactive mode.

PEP 263 doesn't really apply to the interactive mode, hence the
behavior in interactive mode is undefined, and may and will change
across Python versions.

Ideally, interactive mode should assume the terminal's encoding
for source code, but that has not been implemented.

Regards,
Martin
 
M

Martin v. Löwis

eval() somehow decoded the passed expression. No question. It did not
use 'ascii', nor 'latin2' but something else. Why is that? Why there is
a particular encoding hard coded into eval? Which is that encoding? (I
could not decide which one, since '\xdb' will be the same in latin1,
latin3, latin4 and probably many others.)

I think in all your examples, you pass a Unicode string to eval, not
a byte string. In that case, it will encode the string as UTF-8, and
then parse the resulting byte string.

Regards,
Martin
 
M

Martin v. Löwis

eval() somehow decoded the passed expression. No question. It did not
use 'ascii', nor 'latin2' but something else. Why is that? Why there is
a particular encoding hard coded into eval? Which is that encoding? (I
could not decide which one, since '\xdb' will be the same in latin1,
latin3, latin4 and probably many others.)

I think in all your examples, you pass a Unicode string to eval, not
a byte string. In that case, it will encode the string as UTF-8, and
then parse the resulting byte string.

Regards,
Martin
 
L

Laszlo Nagy

Martin said:
I think in all your examples, you pass a Unicode string to eval, not
a byte string. In that case, it will encode the string as UTF-8, and
then parse the resulting byte string.
You are definitely wrong:

s = 'u"' + '\xdb' + '"'
type(s) # <type 'str'>
eval(s) # u'\xdb'
s2 = '# -*- coding: latin2 -*-\n' + s
type(s2) # <type 'str'>
eval(s2) # u'\u0170'


Would you please read the original messages before sending answers? :-D


L
 
L

Laszlo Nagy

I think your confusion comes from the use of the interactive mode.
It is not. The example provided in the original post will also work when
you put then into a python source file.
PEP 263 doesn't really apply to the interactive mode, hence the
behavior in interactive mode is undefined, and may and will change
across Python versions.
The expression passed to eval() cannot be considered an interactive session.
Ideally, interactive mode should assume the terminal's encoding for source code, but that has not been implemented.
Again, please read the original messages - many of my examples also work
when you put them into a python source file. They have nothing to do
with terminals.

Laszlo
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,996
Messages
2,570,238
Members
46,826
Latest member
robinsontor

Latest Threads

Top