unicode() vs. s.decode()

Steven D'Aprano · Aug 8, 2009

Bollocks. No one will even notice whether a code sequence runs 2.7 or
5.7 seconds. That's completely artificial benchmarking.

You think users won't notice a doubling of execution time? Well, that
explains some of the apps I'm forced to use...

A two-second running time for (say) a command-line tool is already
noticeable. A five-second one is *very* noticeable -- long enough to be a
drag, short enough that you aren't tempted to go off and do something
else while you're waiting for it to finish.

Steven D'Aprano · Aug 8, 2009

One guy claims he has times between 2.7 and 5.7 seconds when
benchmarking more or less randomly generated "one million different
lines". That *is* *exactly* nothing.

We agree that in the grand scheme of things, a difference of 2.7 seconds
versus 5.7 seconds is a trivial difference if your entire program takes
(say) 8 minutes to run. You won't even notice it.

But why assume that the program takes 8 minutes to run? Perhaps it takes
8 seconds to run, and 6 seconds of that is the decoding. Then halving
that reduces the total runtime from 8 seconds to 5, which is a noticeable
speed increase to the user, and significant if you then run that program
tens of thousands of times.

The Python dev team spend significant time and effort to get improvements
of the order of 10%, and you're pooh-poohing an improvement of the order
of 100%. By all means, reminding people that pre-mature optimization is a
waste of time, but it's possible to take that attitude too far to Planet
Bizarro. At the point that you start insisting, and emphasising, that a
three second time difference is "*exactly*" zero, it seems to me that
this is about you winning rather than you giving good advice.

Thorsten Kampe · Aug 8, 2009

* Steven D'Aprano (08 Aug 2009 03:29:43 GMT)

We agree that in the grand scheme of things, a difference of 2.7 seconds
versus 5.7 seconds is a trivial difference if your entire program takes
(say) 8 minutes to run. You won't even notice it.
Exactly.

But why assume that the program takes 8 minutes to run? Perhaps it takes
8 seconds to run, and 6 seconds of that is the decoding. Then halving
that reduces the total runtime from 8 seconds to 5, which is a noticeable
speed increase to the user, and significant if you then run that program
tens of thousands of times.

Exactly. That's why it doesn't make sense to benchmark decode()/unicode
() isolated - meaning out of the context of your actual program.

By all means, reminding people that pre-mature optimization is a
waste of time, but it's possible to take that attitude too far to Planet
Bizarro. At the point that you start insisting, and emphasising, that a
three second time difference is "*exactly*" zero,

Exactly. Because it was not generated in a real world use case but by
running a simple loop one millions times. Why one million times? Because
by running it "only" one hundred thousand times the difference would
have seen even less relevant.

it seems to me that this is about you winning rather than you giving
good advice.

I already gave good advice:
1. don't benchmark
2. don't benchmark until you have an actual performance issue
3. if you benchmark then the whole application and not single commands

It's really easy: Michael has working code. With that he can easily
write two versions - one that uses decode() and one that uses unicode().
He can benchmark these with some real world input he often uses by
running it a hundred or a thousand times (even a million if he likes).
Then he can compare the results. I doubt that there will be any
noticeable difference.

Thorsten

Thorsten Kampe · Aug 8, 2009

* alex23 (Fri, 7 Aug 2009 10:45:29 -0700 (PDT))

I just parsed it as "blah blah blah I won't admit I'm wrong" and
didn't miss anything substantive.

Alex, there are still a number of performance optimizations that require
a thorough optimizer like you. Like using short identifiers instead of
long ones. I guess you could easily prove that by comparing "a = 0" to
"a_long_identifier = 0" and running it one hundred trillion times. The
performance gain could easily add up to *days*. Keep us updated.

Thorsten

Thorsten Kampe · Aug 8, 2009

* (e-mail address removed) (Fri, 7 Aug 2009
17:41:38 +0000 (UTC))

I am not sure I understood that. Must be my English

I guess you understand me very well and I understand you very well. If
the performance gain you want to prove doesn't show with 600,000 words,
you test again with 18,000,000 words and if that is not impressive
enough with 600,000,000 words. Great.

Or if a million repetitions of your "improved" code don't show the
expected "performance advantage" you run it a billion times. Even
greater. Keep on optimzing.

Thorsten

Michael Ströder · Aug 8, 2009

Thorsten said:
* Steven D'Aprano (08 Aug 2009 03:29:43 GMT)

Exactly. That's why it doesn't make sense to benchmark decode()/unicode
() isolated - meaning out of the context of your actual program.

Thorsten, the point is you're too arrogant to admit that making such a general
statement like you did without knowing *anything* about the context is simply
false. So this is not a technial matter. It's mainly an issue with your attitude.

Exactly. Because it was not generated in a real world use case but by
running a simple loop one millions times. Why one million times? Because
by running it "only" one hundred thousand times the difference would
have seen even less relevant.

I was running it one million times to mitigate influences on the timing by
other background processes which is a common technique when benchmarking. I
was mainly interested in the percentage which is indeed significant. The
absolute times also strongly depend on the hardware where the software is
running. So your comment about the absolute times are complete nonsense. I'm
eager that this software should also run with acceptable response times on
hardware much slower than my development machine.

I already gave good advice:
1. don't benchmark
2. don't benchmark until you have an actual performance issue
3. if you benchmark then the whole application and not single commands

You don't know anything about what I'm doing and what my aim is. So your
general rules don't apply.

It's really easy: Michael has working code. With that he can easily
write two versions - one that uses decode() and one that uses unicode().

Yes, I have working code which was originally written before .decode() being
added in Python 2.2. Therefore I wondered whether it would be nice for
readability to replace unicode() by s.decode() since the software does not
support Python versions prior 2.3 anymore anyway. But one aspect is also
performance and hence my question and testing.

Ciao, Michael.

Michael FÃ¶tsch · Aug 8, 2009

Michael said:
> 72.087096929550171
>
> That is significant! So the winner is:
>
> unicode('Ã¤Ã¶Ã¼Ã„Ã–ÃœÃŸ','utf-8')

Which proves that benchmark results can be misleading sometimes.

unicode() becomes *slower* when you try "UTF-8" in uppercase, or an
entirely different codec, say "cp1252":
1.7812771797180176

The reason seems to be that unicode() bypasses codecs.lookup() if the
encoding is one of "utf-8", "latin-1", "mbcs", or "ascii". OTOH,
str.decode() always calls codecs.lookup().

If speed is your primary concern, this will give you even better
performance than unicode():

decoder = codecs.lookup("utf-8").decode
for i in xrange(1000000):
decoder("Ã¤Ã¶Ã¼Ã„Ã–ÃœÃŸ")[0]

However, there's also a functional difference between unicode() and
str.decode():

unicode() always raises an exception when you try to decode a unicode
object. str.decode() will first try to encode a unicode object using the
default encoding (usually "ascii"), which might or might not work.

Kind Regards,
M.F.

garabik-news-2005-05 · Aug 8, 2009

Thorsten Kampe said:
* (e-mail address removed) (Fri, 7 Aug 2009
17:41:38 +0000 (UTC))

I guess you understand me very well and I understand you very well. If

I did not. Really. But then it has been explained to me, so I think I do
now

the performance gain you want to prove doesn't show with 600,000 words,
you test again with 18,000,000 words and if that is not impressive
enough with 600,000,000 words. Great.

Huh?
18e6 words is what I am working with _now_. Most of the data is already
collected, there are going to be few more books, but that's all. And the
optimization I was talking about means going home from work one hour
later or earlier. Quite noticeable for me.
600e6 words is the main corpus. Data is already there and wait to be
processed in some time. Once we finih our current project. That is
real life, no thought experiment.

Or if a million repetitions of your "improved" code don't show the
expected "performance advantage" you run it a billion times. Even
greater. Keep on optimzing.

No, we do not have one billion words (yet - I assume you are talking
about American billion - if you are talking about European billion, we
would be masters of the world with a billion word corpus!).
However, that might change once we start collecting www data (which is a
separate project, to be started in a year or two)
Then, we'll do some more optimiation because the time differences will
be more noticeable. Easy as that.

--
-----------------------------------------------------------
| Radovan GarabÃk http://kassiopeia.juls.savba.sk/~garabik/ |
| __..--^^^--..__ garabik @ kassiopeia.juls.savba.sk |
-----------------------------------------------------------
Antivirus alert: file .signature infected by signature virus.
Hi! I'm a signature virus! Copy me into your signature file to help me spread!

Thorsten Kampe · Aug 8, 2009

* Michael StrÃ¶der (Sat, 08 Aug 2009 15:09:23 +0200)

Thorsten, the point is you're too arrogant to admit that making such a general
statement like you did without knowing *anything* about the context is simply
false.

I made a general statement to a very general question ("These both
expressions are equivalent but which is faster or should be used for any
reason?"). If you have specific needs or reasons then you obviously
failed to provide that specific "context" in your question.

I was running it one million times to mitigate influences on the timing by
other background processes which is a common technique when benchmarking.

Err, no. That is what "repeat" is for and it defaults to 3 ("This means
that other processes running on the same computer may interfere with the
timing. The best thing to do when accurate timing is necessary is to
repeat the timing a few times and use the best time. [...] the default
of 3 repetitions is probably enough in most cases.")

Three times - not one million times. You choose one million times (for
the loop) when the thing you're testing is very fast (like decoding) and
you don't want results in the 0.00000n range. Which is what you asked
for and what you got.

You don't know anything about what I'm doing and what my aim is. So your
general rules don't apply.

See above. You asked a general question, you got a general answer.

Yes, I have working code which was originally written before .decode() being
added in Python 2.2. Therefore I wondered whether it would be nice for
readability to replace unicode() by s.decode() since the software does not
support Python versions prior 2.3 anymore anyway. But one aspect is also
performance and hence my question and testing.

You haven't done any testing yet. Running decode/unicode one million
times in a loop is not testing. If you don't believe me then read at
least Martelli's Optimization chapter in Python in a nutshell (the
chapter is available via Google books).

Thorsten

Michael StrÃ¶der · Aug 9, 2009

Michael said:
If speed is your primary concern, this will give you even better
performance than unicode():

decoder = codecs.lookup("utf-8").decode
for i in xrange(1000000):
decoder("Ã¤Ã¶Ã¼Ã„Ã–ÃœÃŸ")[0]

Hmm, that could be interesting. I will give it a try.

However, there's also a functional difference between unicode() and
str.decode():

unicode() always raises an exception when you try to decode a unicode
object. str.decode() will first try to encode a unicode object using the
default encoding (usually "ascii"), which might or might not work.

Thanks for pointing that out. So in my case I'd consider that also a plus for
using unicode().

Ciao, Michael.

Steven D'Aprano · Aug 9, 2009

I was running it one million times to mitigate influences on the timing
by other background processes which is a common technique when
benchmarking.

Click to expand...

Err, no. That is what "repeat" is for and it defaults to 3 ("This means
that other processes running on the same computer may interfere with the
timing. The best thing to do when accurate timing is necessary is to
repeat the timing a few times and use the best time. [...] the default
of 3 repetitions is probably enough in most cases.")

It's useful to look at the timeit module to see what the author(s) think.

Let's start with the repeat() method. In the Timer docstring:

"The repeat() method is a convenience to call timeit() multiple times and
return a list of results."

and the repeat() method's own docstring:

"This is a convenience function that calls the timeit() repeatedly,
returning a list of results. The first argument specifies how many times
to call timeit(), defaulting to 3; the second argument specifies the
timer argument, defaulting to one million."

So it's quite obvious that the module author(s), and possibly even Tim
Peters himself, consider repeat() to be a mere convenience method.
There's nothing you can do with repeat() that can't be done with the
timeit() method itself.

Notice that both repeat() and timeit() methods take an argument to
specify how many times to execute the code snippet. Why not just execute
it once? The module doesn't say, but the answer is a basic measurement
technique: if your clock is accurate to (say) a millisecond, and you
measure a single event as taking a millisecond, then your relative error
is roughly 100%. But if you time 1000 events, and measure the total time
as 1 second, the relative error is now 0.1%.

The authors of the timeit module obvious considered this an important
factor: not only did they allow you to specify the number of times to
execute the code snippet (defaulting to one million, not to one) but they
had this to say:

Command line usage:
python timeit.py [-n N] [-r N] [-s S] [-t] [-c] [-h] [statement]

Options:
-n/--number N: how many times to execute 'statement'
[...]

If -n is not given, a suitable number of loops is calculated by trying
successive powers of 10 until the total time is at least 0.2 seconds.
[end quote]

In other words, when calling the timeit module from the command line, by
default it will choose a value for n that gives a sufficiently small
relative error.

It's not an accident that timeit gives you two "count" parameters: the
number of times to execute the code snippet per timing, and the number of
timings. They control (partly) for different sources of error.

Python Unicode handling wins again -- mostly	67	Nov 30, 2013
Benchmarking stripping of Unicode characters which are invalid XML	0	Mar 18, 2012
newbie question about unicode	4	Jun 23, 2007
Thinking Unicode	0	Aug 8, 2013
unicode question	2	Feb 25, 2006
unicode compare errors	3	Dec 10, 2010
unicode by default	29	May 11, 2011
MySQLdb not playing nice with unicode	1	Mar 30, 2013

unicode() vs. s.decode()

Steven D'Aprano

Steven D'Aprano

Thorsten Kampe

Thorsten Kampe

Thorsten Kampe

Michael Ströder

Michael FÃ¶tsch

garabik-news-2005-05

Thorsten Kampe

Michael StrÃ¶der

Steven D'Aprano

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads