unicode speed

David Siroky · Nov 29, 2005

Hi!

I need to enlighten myself in Python unicode speed and implementation.

My platform is AMD Athlon@1300 (x86-32), Debian, Python 2.4.

First a simple example (and time results):

x = "a"*50000000
real 0m0.195s
user 0m0.144s
sys 0m0.046s

x = u"a"*50000000
real 0m2.477s
user 0m2.119s
sys 0m0.225s

So my first question is why creation of a unicode string lasts more then 10x
longer than non-unicode string?

Another situation: speed problem with long strings

I have a simple function for removing diacritics from a string:

#!/usr/bin/python2.4
# -*- coding: UTF-8 -*-

import unicodedata

def no_diacritics(line):
if type(line) != unicode:
line = unicode(line, 'utf-8')

line = unicodedata.normalize('NFKD', line)

output = ''
for c in line:
if not unicodedata.combining(c):
output += c
return output

Now the calling sequence (and time results):

for i in xrange(1):
x = u"a"*50000
y = no_diacritics(x)

real 0m17.021s
user 0m11.139s
sys 0m5.116s

for i in xrange(5):
x = u"a"*10000
y = no_diacritics(x)

real 0m0.548s
user 0m0.502s
sys 0m0.004s

In both cases the total amount of data is equal but when I use shorter strings
it is much faster. Maybe it has nothing to do with Python unicode but I would
like to know the reason.

Thanks for notes!

David

Neil Hodgson · Nov 29, 2005

David Siroky:

output = ''

I suspect you really want "output = u''" here.

for c in line:
if not unicodedata.combining(c):
output += c

This is creating as many as 50000 new string objects of increasing
size. To build large strings, some common faster techniques are to
either create a list of characters and then use join on the list or use
a cStringIO to accumulate the characters.

This is about 10 times faster for me:

def no_diacritics(line):
if type(line) != unicode:
line = unicode(line, 'utf-8')

line = unicodedata.normalize('NFKD', line)

output = []
for c in line:
if not unicodedata.combining(c):
output.append(c)
return u''.join(output)

Neil

Tony Nelson · Nov 29, 2005

David Siroky said:
Hi!

I need to enlighten myself in Python unicode speed and implementation.

My platform is AMD Athlon@1300 (x86-32), Debian, Python 2.4.

First a simple example (and time results):

x = "a"*50000000
real 0m0.195s
user 0m0.144s
sys 0m0.046s

x = u"a"*50000000
real 0m2.477s
user 0m2.119s
sys 0m0.225s

So my first question is why creation of a unicode string lasts more then 10x
longer than non-unicode string?

Your first example uses about 50 MB. Your second uses about 200 MB, (or
100 MB if your Python is compiled oddly). Check the size of Unicode
chars by:

If it says '0x10ffff' each unichar uses 4 bytes; if it says '0xffff',
each unichar uses 2 bytes.

Another situation: speed problem with long strings

I have a simple function for removing diacritics from a string:

#!/usr/bin/python2.4
# -*- coding: UTF-8 -*-

import unicodedata

def no_diacritics(line):
if type(line) != unicode:
line = unicode(line, 'utf-8')

line = unicodedata.normalize('NFKD', line)

output = ''
for c in line:
if not unicodedata.combining(c):
output += c
return output

Now the calling sequence (and time results):

for i in xrange(1):
x = u"a"*50000
y = no_diacritics(x)

real 0m17.021s
user 0m11.139s
sys 0m5.116s

for i in xrange(5):
x = u"a"*10000
y = no_diacritics(x)

real 0m0.548s
user 0m0.502s
sys 0m0.004s

In both cases the total amount of data is equal but when I use shorter strings
it is much faster. Maybe it has nothing to do with Python unicode but I would
like to know the reason.

It has to do with how strings (either kind) are implemented. Strings
are "immutable", so string concatination is done by making a new string
that has the concatenated value, ans assigning it to the left-hand-side.
Often, it is faster (but more memory intensive) to append to a list and
then at the end do a u''.join(mylist). See GvR's essay on optimization
at <http://www.python.org/doc/essays/list2str.html>.

Alternatively, you could use array.array from the Python Library (it's
easy) to get something "just as good as" mutable strings.
________________________________________________________________________
TonyN.:' *firstname*nlsnews@georgea*lastname*.com
' <http://www.georgeanelson.com/>

jepler · Nov 29, 2005

Hi!

I need to enlighten myself in Python unicode speed and implementation.

My platform is AMD Athlon@1300 (x86-32), Debian, Python 2.4.

First a simple example (and time results):

x = "a"*50000000
real 0m0.195s
user 0m0.144s
sys 0m0.046s

x = u"a"*50000000
real 0m2.477s
user 0m2.119s
sys 0m0.225s

So my first question is why creation of a unicode string lasts more then 10x
longer than non-unicode string?

string objects have the optimization described in the log message below.
The same optimization hasn't been made to unicode_repeat, though it would
probably also benefit from it.

------------------------------------------------------------------------
r30616 | rhettinger | 2003-01-06 04:33:56 -0600 (Mon, 06 Jan 2003) | 11 lines

Optimize string_repeat.

Christian Tismer pointed out the high cost of the loop overhead and
function call overhead for 'c' * n where n is large. Accordingly,
the new code only makes lg2(n) loops.

Interestingly, 'c' * 1000 * 1000 ran a bit faster with old code. At some
point, the loop and function call overhead became cheaper than invalidating
the cache with lengthy memcpys. But for more typical sizes of n, the new
code runs much faster and for larger values of n it runs only a bit slower.
------------------------------------------------------------------------

If you're a "C" coder too, consider creating and submitting a patch to do this
to the patch tracker on http://sf.net/projects/python . That's the best thing
you can do to ensure the optimization is considered for a future release of
Python.

Jeff

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.1 (GNU/Linux)

iD8DBQFDjNC3Jd01MZaTXX0RAhb8AJwLUv2jNdYPY9CrfH6c1OWpsUTcgwCePq8s
cQGDYnsKRdAv6JO3Zmr3jao=
=906V
-----END PGP SIGNATURE-----

David Siroky · Nov 30, 2005

V Tue, 29 Nov 2005 10:14:26 +0000, Neil Hodgson napsal(a):

David Siroky:

I suspect you really want "output = u''" here.

This is creating as many as 50000 new string objects of increasing
size. To build large strings, some common faster techniques are to
either create a list of characters and then use join on the list or use
a cStringIO to accumulate the characters.

That is the answer I wanted, now I'm finally enlightened!

This is about 10 times faster for me:

def no_diacritics(line):
if type(line) != unicode:
line = unicode(line, 'utf-8')

line = unicodedata.normalize('NFKD', line)

output = []
for c in line:
if not unicodedata.combining(c):
output.append(c)
return u''.join(output)

Neil

Thanx!

David

Python Unicode handling wins again -- mostly	67	Nov 30, 2013
Thinking Unicode	0	Aug 8, 2013
Ascii to Unicode.	4	Jul 28, 2010
Unicode strings as arguments to exceptions	3	Jan 16, 2014
ProgrammingError: (1064, "You have an error in your SQL syntax; checkthe manual that corresponds to	7	Dec 9, 2013
MySQLdb not playing nice with unicode	1	Mar 30, 2013
Unicode lists and join (python 2.2.3)	1	May 25, 2008
Ascii to Unicode.	16	Jul 28, 2010

unicode speed

David Siroky

Neil Hodgson

Tony Nelson

jepler

David Siroky

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads