unicode() vs. s.decode()

Michael Ströder · Aug 5, 2009

HI!

These both expressions are equivalent but which is faster or should be used
for any reason?

u = unicode(s,'utf-8')

u = s.decode('utf-8') # looks nicer

Ciao, Michael.

Jason Tackaberry · Aug 5, 2009

These both expressions are equivalent but which is faster or should be used
for any reason?

u = unicode(s,'utf-8')

u = s.decode('utf-8') # looks nicer

It is sometimes non-obvious which constructs are faster than others in
Python. I also regularly have these questions, but it's pretty easy to
run quick (albeit naive) benchmarks to see.

The first thing to try is to have a look at the bytecode for each:
1 0 LOAD_FAST 0 (s)
3 LOAD_ATTR 0 (decode)
6 LOAD_CONST 0 ('utf-8')
9 CALL_FUNCTION 1
12 RETURN_VALUE 1 0 LOAD_GLOBAL 0 (unicode)
3 LOAD_FAST 0 (s)
6 LOAD_CONST 0 ('utf-8')
9 CALL_FUNCTION 2
12 RETURN_VALUE

The presence of LOAD_ATTR in the first form hints that this is probably
going to be slower. Next, actually try it:
0.53305888175964355

So indeed, uncode(s, 'utf-8') is faster by a fair margin.

On the other hand, unless you need to do this in a tight loop several
tens of thousands of times, I'd prefer the slower form s.decode('utf-8')
because it's, as you pointed out, cleaner and more readable code.

Cheers,
Jason.

1x7y2z9 · Aug 5, 2009

unicode() has LOAD_GLOBAL which s.decode() does not. Is it generally
the case that LOAD_ATTR is slower than LOAD_GLOBAL that lead to your
intuition that the former would probably be slower? Or some other
intuition?
Of course, the results from timeit are a different thing - I ask about
the intuition in the disassembler output.
Thanks.

John Machin · Aug 6, 2009

Jason Tackaberry said:
These both expressions are equivalent but which is faster or should be used
for any reason?
u = unicode(s,'utf-8')
u = s.decode('utf-8') # looks nicer

Click to expand...

It is sometimes non-obvious which constructs are faster than others in
Python. I also regularly have these questions, but it's pretty easy to
run quick (albeit naive) benchmarks to see.

The first thing to try is to have a look at the bytecode for each: [snip]
The presence of LOAD_ATTR in the first form hints that this is probably
going to be slower. Next, actually try it:
0.53305888175964355

So indeed, uncode(s, 'utf-8') is faster by a fair margin.

Faster by an enormous margin; attributing this to the cost of attribute lookup
seems implausible.

Suggested further avenues of investigation:

(1) Try the timing again with "cp1252" and "utf8" and "utf_8"

(2) grep "utf-8" <Python2.X_source_code>/Objects/unicodeobject.c

HTH,
John

Jason Tackaberry · Aug 6, 2009

Faster by an enormous margin; attributing this to the cost of attribute lookup
seems implausible.

Ok, fair point. I don't think the time difference fully registered when
I composed that message.

Testing a global access (LOAD_GLOBAL) versus an attribute access on a
global object (LOAD_GLOBAL + LOAD_ATTR) shows that the latter is about
40% slower than the former. So that certainly doesn't account for the
difference.

Suggested further avenues of investigation:

(1) Try the timing again with "cp1252" and "utf8" and "utf_8"

(2) grep "utf-8" <Python2.X_source_code>/Objects/unicodeobject.c

Very pedagogical of you.

Indeed, it looks like bigger player in the
performance difference is the fact that the code path for unicode(s,
enc) short-circuits the codec registry for common encodings (which
includes 'utf-8' specifically), whereas s.decode('utf-8') necessarily
consults the codec registry.

Cheers,
Jason.

Thorsten Kampe · Aug 6, 2009

* Michael StrÃ¶der (Wed, 05 Aug 2009 16:43:09 +0200)

These both expressions are equivalent but which is faster or should be
used for any reason?

u = unicode(s,'utf-8')

u = s.decode('utf-8') # looks nicer

"decode" was added in Python 2.2 for the sake of symmetry to encode().
It's essentially the same as unicode() and I wouldn't be surprised if it
is exactly the same. I don't think any measurable speed increase will be
noticeable between those two.

Thorsten

Michael StrÃ¶der · Aug 6, 2009

Thorsten said:
* Michael StrÃ¶der (Wed, 05 Aug 2009 16:43:09 +0200)

"decode" was added in Python 2.2 for the sake of symmetry to encode().

Yes, and I like the style. But...

It's essentially the same as unicode() and I wouldn't be surprised if it
is exactly the same.

Did you try?

I don't think any measurable speed increase will be noticeable between
those two.

Well, seems not to be true. Try yourself. I did (my console has UTF-8 as charset):

Python 2.6 (r26:66714, Feb 3 2009, 20:52:03)
[GCC 4.3.2 [gcc-4_3-branch revision 141291]] on linux2
Type "help", "copyright", "credits" or "license" for more information.
Comparing again the two best combinations:
72.087096929550171

That is significant! So the winner is:

unicode('Ã¤Ã¶Ã¼Ã„Ã–ÃœÃŸ','utf-8')

Ciao, Michael.

Thorsten Kampe · Aug 6, 2009

* Michael StrÃ¶der (Thu, 06 Aug 2009 18:26:09 +0200)

Thorsten said:
Thorsten said:

* Michael StrÃ¶der (Wed, 05 Aug 2009 16:43:09 +0200)
I don't think any measurable speed increase will be noticeable
between those two.

Click to expand...

Well, seems not to be true. Try yourself. I did (my console has UTF-8 as charset):

Python 2.6 (r26:66714, Feb 3 2009, 20:52:03)
[GCC 4.3.2 [gcc-4_3-branch revision 141291]] on linux2
Type "help", "copyright", "credits" or "license" for more information.
Comparing again the two best combinations:
72.087096929550171

That is significant! So the winner is:

unicode('Ã¤Ã¶Ã¼Ã„Ã–ÃœÃŸ','utf-8')

Unless you are planning to write a loop that decodes "Ã¤Ã¶Ã¼Ã„Ã–ÃœÃŸ" one
million times, these benchmarks are meaningless.

Thorsten

Steven D'Aprano · Aug 6, 2009

Unless you are planning to write a loop that decodes "Ã¤Ã¶Ã¼Ã„Ã–ÃœÃŸ" one
million times, these benchmarks are meaningless.

What if you're writing a loop which takes one million different lines of
text and decodes them once each?

setup = 'L = ["abc"*(n%100) for n in xrange(1000000)]'
t1 = timeit.Timer('for line in L: line.decode("utf-8")', setup)
t2 = timeit.Timer('for line in L: unicode(line, "utf-8")', setup)
t1.timeit(number=1) 5.6751680374145508
t2.timeit(number=1)

Click to expand...

Click to expand...

2.6822888851165771

Seems like a pretty meaningful difference to me.

Michael StrÃ¶der · Aug 7, 2009

Thorsten said:
* Michael StrÃ¶der (Thu, 06 Aug 2009 18:26:09 +0200)

Unless you are planning to write a loop that decodes "Ã¤Ã¶Ã¼Ã„Ã–ÃœÃŸ" one
million times, these benchmarks are meaningless.

Well, I can tell you I would not have posted this here and checked it if it
would be meaningless for me. You don't have to read and answer this thread if
it's meaningless to you.

Ciao, Michael.

John Machin · Aug 7, 2009

Jason Tackaberry said:
Very pedagogical of you. Indeed, it looks like bigger player in the
performance difference is the fact that the code path for unicode(s,
enc) short-circuits the codec registry for common encodings (which
includes 'utf-8' specifically), whereas s.decode('utf-8') necessarily
consults the codec registry.

So the next question (the answer to which may benefit all users
of .encode() and .decode()) is:

Why does consulting the codec registry take so long,
and can this be improved?

Mark Lawrence · Aug 7, 2009

Michael said:
Well, I can tell you I would not have posted this here and checked it if it
would be meaningless for me. You don't have to read and answer this thread if
it's meaningless to you.

Ciao, Michael.

I believe that the comment "these benchmarks are meaningless" refers to
the length of the strings being used in the tests. Surely something
involving thousands or millions of characters is more meaningful? Or to
go the other way, you are unlikely to write
for c in 'Ã¤Ã¶Ã¼Ã„Ã–ÃœÃŸ':
u = unicode(c, 'utf-8')
...
Yes?

Steven D'Aprano · Aug 7, 2009

I believe that the comment "these benchmarks are meaningless" refers to
the length of the strings being used in the tests. Surely something
involving thousands or millions of characters is more meaningful? Or to
go the other way, you are unlikely to write for c in 'Ã¤Ã¶Ã¼Ã„Ã–ÃœÃŸ':
u = unicode(c, 'utf-8')
...
Yes?

There are all sorts of potential use-cases. A day or two ago, somebody
posted a question involving tens of thousands of lines of tens of
thousands of characters each (don't quote me, I'm going by memory). On
the other hand, it doesn't require much imagination to think of a use-
case where there are millions of lines each of a dozen or so characters,
and you want to process it line by line:

noun: cat
noun: dog
verb: cafÃ©
....

As always, before optimizing, you should profile to be sure you are
actually optimizing and not wasting your time.

Thorsten Kampe · Aug 7, 2009

* Steven D'Aprano (06 Aug 2009 19:17:30 GMT)

Unless you are planning to write a loop that decodes "Ã¤Ã¶Ã¼Ã„Ã–ÃœÃŸ" one
million times, these benchmarks are meaningless.

Click to expand...

What if you're writing a loop which takes one million different lines of
text and decodes them once each?

setup = 'L = ["abc"*(n%100) for n in xrange(1000000)]'
t1 = timeit.Timer('for line in L: line.decode("utf-8")', setup)
t2 = timeit.Timer('for line in L: unicode(line, "utf-8")', setup)
t1.timeit(number=1) 5.6751680374145508
t2.timeit(number=1)

Click to expand...

Click to expand...

2.6822888851165771

Seems like a pretty meaningful difference to me.

Bollocks. No one will even notice whether a code sequence runs 2.7 or
5.7 seconds. That's completely artificial benchmarking.

Thorsten

Thorsten Kampe · Aug 7, 2009

* Michael StrÃ¶der (Fri, 07 Aug 2009 03:25:03 +0200)

Well, I can tell you I would not have posted this here and checked it if it
would be meaningless for me. You don't have to read and answer this thread if
it's meaningless to you.

Again: if you think decoding "Ã¤Ã¶Ã¼Ã„Ã–ÃœÃŸ" one million times is a real world
use case for your module then go for unicode(). Otherwise the time you
spent benchmarking artificial cases like this is just wasted time. In
real life people won't even notice whether an application takes one or
two minutes to complete.

Use whatever you prefer (decode() or unicode()). If you experience
performance bottlenecks when you're done, test whether changing decode()
to unicode() makes a difference. /That/ is relevant.

Thorsten

garabik-news-2005-05 · Aug 7, 2009

Thorsten Kampe said:
* Steven D'Aprano (06 Aug 2009 19:17:30 GMT)

What if you're writing a loop which takes one million different lines of
text and decodes them once each?

setup = 'L = ["abc"*(n%100) for n in xrange(1000000)]'
t1 = timeit.Timer('for line in L: line.decode("utf-8")', setup)
t2 = timeit.Timer('for line in L: unicode(line, "utf-8")', setup)
t1.timeit(number=1) 5.6751680374145508
t2.timeit(number=1)

Click to expand...

2.6822888851165771

Seems like a pretty meaningful difference to me.

Click to expand...

Bollocks. No one will even notice whether a code sequence runs 2.7 or
5.7 seconds. That's completely artificial benchmarking.

For a real-life example, I have often a file with one word per line, and
I run python scripts to apply some (sometimes fairy trivial)
transformation over it. REAL example, reading lines with word, lemma,
tag separated by tabs from stdin and writing word into stdout, unless it
starts with '<' (~6e5 lines, python2.5, user times, warm cache, I hope
the comments are self-explanatory)

no unicode
user 0m2.380s

decode('utf-8'), encode('utf-8')
user 0m3.560s

sys.stdout = codecs.getwriter('utf-8')(sys.stdout);sys.stdin = codecs.getreader('utf-8')(sys.stdin)
user 0m6.180s

unicode(line, 'utf8'), encode('utf-8')
user 0m3.820s

unicode(line, 'utf-8'), encode('utf-8')
user 0m2.880sa

python3.1
user 0m1.560s

Since I have something like 18 million words in my currenct project (and

> 600 million overall) and I often tweak some parameters and re-run the
> transformations, the differences are pretty significant.

Personally, I have been surprised by:
1) bad performance of the codecs wrapper (I expected it to be on par with
unicode(x,'utf-8'), mayble slightly better due to less function calls
2) good performance of python3.1 (utf-8 locale)

--
-----------------------------------------------------------
| Radovan GarabÃk http://kassiopeia.juls.savba.sk/~garabik/ |
| __..--^^^--..__ garabik @ kassiopeia.juls.savba.sk |
-----------------------------------------------------------
Antivirus alert: file .signature infected by signature virus.
Hi! I'm a signature virus! Copy me into your signature file to help me spread!

alex23 · Aug 7, 2009

Thorsten Kampe said:
Bollocks. No one will even notice whether a code sequence runs 2.7 or
5.7 seconds. That's completely artificial benchmarking.

But that's not what you first claimed:

I don't think any measurable speed increase will be
noticeable between those two.

But please, keep changing your argument so you don't have to admit you
were wrong.

Thorsten Kampe · Aug 7, 2009

* alex23 (Fri, 7 Aug 2009 06:53:22 -0700 (PDT))

But that's not what you first claimed:

But please, keep changing your argument so you don't have to admit you
were wrong.

Bollocks. Please note the word "noticeable". "noticeable" as in
recognisable as in reasonably experiencable or as in whatever.

One guy claims he has times between 2.7 and 5.7 seconds when
benchmarking more or less randomly generated "one million different
lines". That *is* *exactly* nothing.

Another guy claims he gets times between 2.9 and 6.2 seconds when
running decode/unicode in various manifestations over "18 million
words" (or is it 600 million?) and says "the differences are pretty
significant". I think I don't have to comment on that.

If you increase the number of loops to one million or one billion or
whatever even the slightest completely negligible difference will occur.
The same thing will happen if you just increase the corpus of words to a
million, trillion or whatever. The performance implications of that are
exactly none.

Thorsten

garabik-news-2005-05 · Aug 7, 2009

Thorsten Kampe said:
lines". That *is* *exactly* nothing.

Another guy claims he gets times between 2.9 and 6.2 seconds when
running decode/unicode in various manifestations over "18 million

over a sample of 600000 words (sorry for not being able to explain
myself clear enough so that everyone understands)
while my current project is 18e6 words, that is the overall running time
will be 87 vs. 186 seconds, which is fairly noticeable.

words" (or is it 600 million?) and says "the differences are pretty
significant".

600 million is the size of the whole corpus, that translates to
48 minutes vs. 1h43min. That already is a huge difference (going to
lunch during noon or waiting another hour until it runs over - and
you can bet it is _very_ noticeable when I am hungry

).

With 9 different versions of the corpus (that is, what we are really
using now) that goes to 7.2 hours (or even less with python3.1!) vs. 15
hours. Being able to re-run the whole corpus generation in one working
day (and then go on with the next issues) vs. working overtime or
delivering the corpus one day later is a huge difference. Like, being
one day behind the schedule.

I think I don't have to comment on that.

Indeed, the numbers are self-explanatory.

If you increase the number of loops to one million or one billion or
whatever even the slightest completely negligible difference will occur.
The same thing will happen if you just increase the corpus of words to a
million, trillion or whatever. The performance implications of that are
exactly none.

I am not sure I understood that. Must be my English

--
-----------------------------------------------------------
| Radovan GarabÃk http://kassiopeia.juls.savba.sk/~garabik/ |
| __..--^^^--..__ garabik @ kassiopeia.juls.savba.sk |
-----------------------------------------------------------
Antivirus alert: file .signature infected by signature virus.
Hi! I'm a signature virus! Copy me into your signature file to help me spread!

alex23 · Aug 7, 2009

[email protected] said:
I am not sure I understood that. Must be my English

I just parsed it as "blah blah blah I won't admit I'm wrong" and
didn't miss anything substantive.

Python Unicode handling wins again -- mostly	67	Nov 30, 2013
Benchmarking stripping of Unicode characters which are invalid XML	0	Mar 18, 2012
newbie question about unicode	4	Jun 23, 2007
Thinking Unicode	0	Aug 8, 2013
unicode question	2	Feb 25, 2006
unicode compare errors	3	Dec 10, 2010
unicode by default	29	May 11, 2011
MySQLdb not playing nice with unicode	1	Mar 30, 2013

unicode() vs. s.decode()

Michael Ströder

Jason Tackaberry

1x7y2z9

John Machin

Jason Tackaberry

Thorsten Kampe

Michael StrÃ¶der

Thorsten Kampe

Steven D'Aprano

Michael StrÃ¶der

John Machin

Mark Lawrence

Steven D'Aprano

Thorsten Kampe

Thorsten Kampe

garabik-news-2005-05

alex23

Thorsten Kampe

garabik-news-2005-05

alex23

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads