unicode() vs. s.decode()

  • Thread starter Michael Ströder
  • Start date
M

Michael Ströder

HI!

These both expressions are equivalent but which is faster or should be used
for any reason?

u = unicode(s,'utf-8')

u = s.decode('utf-8') # looks nicer

Ciao, Michael.
 
J

Jason Tackaberry

These both expressions are equivalent but which is faster or should be used
for any reason?

u = unicode(s,'utf-8')

u = s.decode('utf-8') # looks nicer

It is sometimes non-obvious which constructs are faster than others in
Python. I also regularly have these questions, but it's pretty easy to
run quick (albeit naive) benchmarks to see.

The first thing to try is to have a look at the bytecode for each:
1 0 LOAD_FAST 0 (s)
3 LOAD_ATTR 0 (decode)
6 LOAD_CONST 0 ('utf-8')
9 CALL_FUNCTION 1
12 RETURN_VALUE 1 0 LOAD_GLOBAL 0 (unicode)
3 LOAD_FAST 0 (s)
6 LOAD_CONST 0 ('utf-8')
9 CALL_FUNCTION 2
12 RETURN_VALUE

The presence of LOAD_ATTR in the first form hints that this is probably
going to be slower. Next, actually try it:
0.53305888175964355

So indeed, uncode(s, 'utf-8') is faster by a fair margin.

On the other hand, unless you need to do this in a tight loop several
tens of thousands of times, I'd prefer the slower form s.decode('utf-8')
because it's, as you pointed out, cleaner and more readable code.

Cheers,
Jason.
 
1

1x7y2z9

unicode() has LOAD_GLOBAL which s.decode() does not. Is it generally
the case that LOAD_ATTR is slower than LOAD_GLOBAL that lead to your
intuition that the former would probably be slower? Or some other
intuition?
Of course, the results from timeit are a different thing - I ask about
the intuition in the disassembler output.
Thanks.
 
J

John Machin

Jason Tackaberry said:
These both expressions are equivalent but which is faster or should be used
for any reason?
u = unicode(s,'utf-8')
u = s.decode('utf-8') # looks nicer

It is sometimes non-obvious which constructs are faster than others in
Python. I also regularly have these questions, but it's pretty easy to
run quick (albeit naive) benchmarks to see.

The first thing to try is to have a look at the bytecode for each: [snip]
The presence of LOAD_ATTR in the first form hints that this is probably
going to be slower. Next, actually try it:
0.53305888175964355

So indeed, uncode(s, 'utf-8') is faster by a fair margin.

Faster by an enormous margin; attributing this to the cost of attribute lookup
seems implausible.

Suggested further avenues of investigation:

(1) Try the timing again with "cp1252" and "utf8" and "utf_8"

(2) grep "utf-8" <Python2.X_source_code>/Objects/unicodeobject.c

HTH,
John
 
J

Jason Tackaberry

Faster by an enormous margin; attributing this to the cost of attribute lookup
seems implausible.

Ok, fair point. I don't think the time difference fully registered when
I composed that message.

Testing a global access (LOAD_GLOBAL) versus an attribute access on a
global object (LOAD_GLOBAL + LOAD_ATTR) shows that the latter is about
40% slower than the former. So that certainly doesn't account for the
difference.

Suggested further avenues of investigation:

(1) Try the timing again with "cp1252" and "utf8" and "utf_8"

(2) grep "utf-8" <Python2.X_source_code>/Objects/unicodeobject.c

Very pedagogical of you. :) Indeed, it looks like bigger player in the
performance difference is the fact that the code path for unicode(s,
enc) short-circuits the codec registry for common encodings (which
includes 'utf-8' specifically), whereas s.decode('utf-8') necessarily
consults the codec registry.

Cheers,
Jason.
 
T

Thorsten Kampe

* Michael Ströder (Wed, 05 Aug 2009 16:43:09 +0200)
These both expressions are equivalent but which is faster or should be
used for any reason?

u = unicode(s,'utf-8')

u = s.decode('utf-8') # looks nicer

"decode" was added in Python 2.2 for the sake of symmetry to encode().
It's essentially the same as unicode() and I wouldn't be surprised if it
is exactly the same. I don't think any measurable speed increase will be
noticeable between those two.

Thorsten
 
M

Michael Ströder

Thorsten said:
* Michael Ströder (Wed, 05 Aug 2009 16:43:09 +0200)

"decode" was added in Python 2.2 for the sake of symmetry to encode().

Yes, and I like the style. But...
It's essentially the same as unicode() and I wouldn't be surprised if it
is exactly the same.

Did you try?
I don't think any measurable speed increase will be noticeable between
those two.

Well, seems not to be true. Try yourself. I did (my console has UTF-8 as charset):

Python 2.6 (r26:66714, Feb 3 2009, 20:52:03)
[GCC 4.3.2 [gcc-4_3-branch revision 141291]] on linux2
Type "help", "copyright", "credits" or "license" for more information.
Comparing again the two best combinations:
72.087096929550171

That is significant! So the winner is:

unicode('äöüÄÖÜß','utf-8')

Ciao, Michael.
 
T

Thorsten Kampe

* Michael Ströder (Thu, 06 Aug 2009 18:26:09 +0200)
Thorsten said:
* Michael Ströder (Wed, 05 Aug 2009 16:43:09 +0200)
I don't think any measurable speed increase will be noticeable
between those two.

Well, seems not to be true. Try yourself. I did (my console has UTF-8 as charset):

Python 2.6 (r26:66714, Feb 3 2009, 20:52:03)
[GCC 4.3.2 [gcc-4_3-branch revision 141291]] on linux2
Type "help", "copyright", "credits" or "license" for more information.
Comparing again the two best combinations:
72.087096929550171

That is significant! So the winner is:

unicode('äöüÄÖÜß','utf-8')

Unless you are planning to write a loop that decodes "äöüÄÖÜß" one
million times, these benchmarks are meaningless.

Thorsten
 
S

Steven D'Aprano

Unless you are planning to write a loop that decodes "äöüÄÖÜß" one
million times, these benchmarks are meaningless.

What if you're writing a loop which takes one million different lines of
text and decodes them once each?

setup = 'L = ["abc"*(n%100) for n in xrange(1000000)]'
t1 = timeit.Timer('for line in L: line.decode("utf-8")', setup)
t2 = timeit.Timer('for line in L: unicode(line, "utf-8")', setup)
t1.timeit(number=1) 5.6751680374145508
t2.timeit(number=1)
2.6822888851165771


Seems like a pretty meaningful difference to me.
 
M

Michael Ströder

Thorsten said:
* Michael Ströder (Thu, 06 Aug 2009 18:26:09 +0200)

Unless you are planning to write a loop that decodes "äöüÄÖÜß" one
million times, these benchmarks are meaningless.

Well, I can tell you I would not have posted this here and checked it if it
would be meaningless for me. You don't have to read and answer this thread if
it's meaningless to you.

Ciao, Michael.
 
J

John Machin

Jason Tackaberry said:
Very pedagogical of you. :) Indeed, it looks like bigger player in the
performance difference is the fact that the code path for unicode(s,
enc) short-circuits the codec registry for common encodings (which
includes 'utf-8' specifically), whereas s.decode('utf-8') necessarily
consults the codec registry.

So the next question (the answer to which may benefit all users
of .encode() and .decode()) is:

Why does consulting the codec registry take so long,
and can this be improved?
 
M

Mark Lawrence

Michael said:
Well, I can tell you I would not have posted this here and checked it if it
would be meaningless for me. You don't have to read and answer this thread if
it's meaningless to you.

Ciao, Michael.
I believe that the comment "these benchmarks are meaningless" refers to
the length of the strings being used in the tests. Surely something
involving thousands or millions of characters is more meaningful? Or to
go the other way, you are unlikely to write
for c in 'äöüÄÖÜß':
u = unicode(c, 'utf-8')
...
Yes?
 
S

Steven D'Aprano

I believe that the comment "these benchmarks are meaningless" refers to
the length of the strings being used in the tests. Surely something
involving thousands or millions of characters is more meaningful? Or to
go the other way, you are unlikely to write for c in 'äöüÄÖÜß':
u = unicode(c, 'utf-8')
...
Yes?

There are all sorts of potential use-cases. A day or two ago, somebody
posted a question involving tens of thousands of lines of tens of
thousands of characters each (don't quote me, I'm going by memory). On
the other hand, it doesn't require much imagination to think of a use-
case where there are millions of lines each of a dozen or so characters,
and you want to process it line by line:


noun: cat
noun: dog
verb: café
....


As always, before optimizing, you should profile to be sure you are
actually optimizing and not wasting your time.
 
T

Thorsten Kampe

* Steven D'Aprano (06 Aug 2009 19:17:30 GMT)
Unless you are planning to write a loop that decodes "äöüÄÖÜß" one
million times, these benchmarks are meaningless.

What if you're writing a loop which takes one million different lines of
text and decodes them once each?
setup = 'L = ["abc"*(n%100) for n in xrange(1000000)]'
t1 = timeit.Timer('for line in L: line.decode("utf-8")', setup)
t2 = timeit.Timer('for line in L: unicode(line, "utf-8")', setup)
t1.timeit(number=1) 5.6751680374145508
t2.timeit(number=1)
2.6822888851165771

Seems like a pretty meaningful difference to me.

Bollocks. No one will even notice whether a code sequence runs 2.7 or
5.7 seconds. That's completely artificial benchmarking.

Thorsten
 
T

Thorsten Kampe

* Michael Ströder (Fri, 07 Aug 2009 03:25:03 +0200)
Well, I can tell you I would not have posted this here and checked it if it
would be meaningless for me. You don't have to read and answer this thread if
it's meaningless to you.

Again: if you think decoding "äöüÄÖÜß" one million times is a real world
use case for your module then go for unicode(). Otherwise the time you
spent benchmarking artificial cases like this is just wasted time. In
real life people won't even notice whether an application takes one or
two minutes to complete.

Use whatever you prefer (decode() or unicode()). If you experience
performance bottlenecks when you're done, test whether changing decode()
to unicode() makes a difference. /That/ is relevant.

Thorsten
 
G

garabik-news-2005-05

Thorsten Kampe said:
* Steven D'Aprano (06 Aug 2009 19:17:30 GMT)
What if you're writing a loop which takes one million different lines of
text and decodes them once each?
setup = 'L = ["abc"*(n%100) for n in xrange(1000000)]'
t1 = timeit.Timer('for line in L: line.decode("utf-8")', setup)
t2 = timeit.Timer('for line in L: unicode(line, "utf-8")', setup)
t1.timeit(number=1) 5.6751680374145508
t2.timeit(number=1)
2.6822888851165771

Seems like a pretty meaningful difference to me.

Bollocks. No one will even notice whether a code sequence runs 2.7 or
5.7 seconds. That's completely artificial benchmarking.

For a real-life example, I have often a file with one word per line, and
I run python scripts to apply some (sometimes fairy trivial)
transformation over it. REAL example, reading lines with word, lemma,
tag separated by tabs from stdin and writing word into stdout, unless it
starts with '<' (~6e5 lines, python2.5, user times, warm cache, I hope
the comments are self-explanatory)

no unicode
user 0m2.380s

decode('utf-8'), encode('utf-8')
user 0m3.560s

sys.stdout = codecs.getwriter('utf-8')(sys.stdout);sys.stdin = codecs.getreader('utf-8')(sys.stdin)
user 0m6.180s

unicode(line, 'utf8'), encode('utf-8')
user 0m3.820s

unicode(line, 'utf-8'), encode('utf-8')
user 0m2.880sa

python3.1
user 0m1.560s

Since I have something like 18 million words in my currenct project (and
> 600 million overall) and I often tweak some parameters and re-run the
> transformations, the differences are pretty significant.

Personally, I have been surprised by:
1) bad performance of the codecs wrapper (I expected it to be on par with
unicode(x,'utf-8'), mayble slightly better due to less function calls
2) good performance of python3.1 (utf-8 locale)


--
-----------------------------------------------------------
| Radovan Garabík http://kassiopeia.juls.savba.sk/~garabik/ |
| __..--^^^--..__ garabik @ kassiopeia.juls.savba.sk |
-----------------------------------------------------------
Antivirus alert: file .signature infected by signature virus.
Hi! I'm a signature virus! Copy me into your signature file to help me spread!
 
A

alex23

Thorsten Kampe said:
Bollocks. No one will even notice whether a code sequence runs 2.7 or
5.7 seconds. That's completely artificial benchmarking.

But that's not what you first claimed:
I don't think any measurable speed increase will be
noticeable between those two.

But please, keep changing your argument so you don't have to admit you
were wrong.
 
T

Thorsten Kampe

* alex23 (Fri, 7 Aug 2009 06:53:22 -0700 (PDT))
But that's not what you first claimed:


But please, keep changing your argument so you don't have to admit you
were wrong.

Bollocks. Please note the word "noticeable". "noticeable" as in
recognisable as in reasonably experiencable or as in whatever.

One guy claims he has times between 2.7 and 5.7 seconds when
benchmarking more or less randomly generated "one million different
lines". That *is* *exactly* nothing.

Another guy claims he gets times between 2.9 and 6.2 seconds when
running decode/unicode in various manifestations over "18 million
words" (or is it 600 million?) and says "the differences are pretty
significant". I think I don't have to comment on that.

If you increase the number of loops to one million or one billion or
whatever even the slightest completely negligible difference will occur.
The same thing will happen if you just increase the corpus of words to a
million, trillion or whatever. The performance implications of that are
exactly none.

Thorsten
 
G

garabik-news-2005-05

Thorsten Kampe said:
lines". That *is* *exactly* nothing.

Another guy claims he gets times between 2.9 and 6.2 seconds when
running decode/unicode in various manifestations over "18 million


over a sample of 600000 words (sorry for not being able to explain
myself clear enough so that everyone understands)
while my current project is 18e6 words, that is the overall running time
will be 87 vs. 186 seconds, which is fairly noticeable.
words" (or is it 600 million?) and says "the differences are pretty
significant".

600 million is the size of the whole corpus, that translates to
48 minutes vs. 1h43min. That already is a huge difference (going to
lunch during noon or waiting another hour until it runs over - and
you can bet it is _very_ noticeable when I am hungry :)).

With 9 different versions of the corpus (that is, what we are really
using now) that goes to 7.2 hours (or even less with python3.1!) vs. 15
hours. Being able to re-run the whole corpus generation in one working
day (and then go on with the next issues) vs. working overtime or
delivering the corpus one day later is a huge difference. Like, being
one day behind the schedule.
I think I don't have to comment on that.

Indeed, the numbers are self-explanatory.
If you increase the number of loops to one million or one billion or
whatever even the slightest completely negligible difference will occur.
The same thing will happen if you just increase the corpus of words to a
million, trillion or whatever. The performance implications of that are
exactly none.

I am not sure I understood that. Must be my English :)

--
-----------------------------------------------------------
| Radovan Garabík http://kassiopeia.juls.savba.sk/~garabik/ |
| __..--^^^--..__ garabik @ kassiopeia.juls.savba.sk |
-----------------------------------------------------------
Antivirus alert: file .signature infected by signature virus.
Hi! I'm a signature virus! Copy me into your signature file to help me spread!
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
474,201
Messages
2,571,048
Members
47,647
Latest member
NelleMacy9

Latest Threads

Top