unicode speed

D

David Siroky

Hi!

I need to enlighten myself in Python unicode speed and implementation.

My platform is AMD Athlon@1300 (x86-32), Debian, Python 2.4.

First a simple example (and time results):

x = "a"*50000000
real 0m0.195s
user 0m0.144s
sys 0m0.046s

x = u"a"*50000000
real 0m2.477s
user 0m2.119s
sys 0m0.225s

So my first question is why creation of a unicode string lasts more then 10x
longer than non-unicode string?

Another situation: speed problem with long strings

I have a simple function for removing diacritics from a string:

#!/usr/bin/python2.4
# -*- coding: UTF-8 -*-

import unicodedata

def no_diacritics(line):
if type(line) != unicode:
line = unicode(line, 'utf-8')

line = unicodedata.normalize('NFKD', line)

output = ''
for c in line:
if not unicodedata.combining(c):
output += c
return output

Now the calling sequence (and time results):

for i in xrange(1):
x = u"a"*50000
y = no_diacritics(x)

real 0m17.021s
user 0m11.139s
sys 0m5.116s

for i in xrange(5):
x = u"a"*10000
y = no_diacritics(x)

real 0m0.548s
user 0m0.502s
sys 0m0.004s

In both cases the total amount of data is equal but when I use shorter strings
it is much faster. Maybe it has nothing to do with Python unicode but I would
like to know the reason.

Thanks for notes!

David
 
N

Neil Hodgson

David Siroky:
output = ''

I suspect you really want "output = u''" here.
for c in line:
if not unicodedata.combining(c):
output += c

This is creating as many as 50000 new string objects of increasing
size. To build large strings, some common faster techniques are to
either create a list of characters and then use join on the list or use
a cStringIO to accumulate the characters.

This is about 10 times faster for me:

def no_diacritics(line):
if type(line) != unicode:
line = unicode(line, 'utf-8')

line = unicodedata.normalize('NFKD', line)

output = []
for c in line:
if not unicodedata.combining(c):
output.append(c)
return u''.join(output)

Neil
 
T

Tony Nelson

David Siroky said:
Hi!

I need to enlighten myself in Python unicode speed and implementation.

My platform is AMD Athlon@1300 (x86-32), Debian, Python 2.4.

First a simple example (and time results):

x = "a"*50000000
real 0m0.195s
user 0m0.144s
sys 0m0.046s

x = u"a"*50000000
real 0m2.477s
user 0m2.119s
sys 0m0.225s

So my first question is why creation of a unicode string lasts more then 10x
longer than non-unicode string?

Your first example uses about 50 MB. Your second uses about 200 MB, (or
100 MB if your Python is compiled oddly). Check the size of Unicode
chars by:

If it says '0x10ffff' each unichar uses 4 bytes; if it says '0xffff',
each unichar uses 2 bytes.

Another situation: speed problem with long strings

I have a simple function for removing diacritics from a string:

#!/usr/bin/python2.4
# -*- coding: UTF-8 -*-

import unicodedata

def no_diacritics(line):
if type(line) != unicode:
line = unicode(line, 'utf-8')

line = unicodedata.normalize('NFKD', line)

output = ''
for c in line:
if not unicodedata.combining(c):
output += c
return output

Now the calling sequence (and time results):

for i in xrange(1):
x = u"a"*50000
y = no_diacritics(x)

real 0m17.021s
user 0m11.139s
sys 0m5.116s

for i in xrange(5):
x = u"a"*10000
y = no_diacritics(x)

real 0m0.548s
user 0m0.502s
sys 0m0.004s

In both cases the total amount of data is equal but when I use shorter strings
it is much faster. Maybe it has nothing to do with Python unicode but I would
like to know the reason.

It has to do with how strings (either kind) are implemented. Strings
are "immutable", so string concatination is done by making a new string
that has the concatenated value, ans assigning it to the left-hand-side.
Often, it is faster (but more memory intensive) to append to a list and
then at the end do a u''.join(mylist). See GvR's essay on optimization
at <http://www.python.org/doc/essays/list2str.html>.

Alternatively, you could use array.array from the Python Library (it's
easy) to get something "just as good as" mutable strings.
________________________________________________________________________
TonyN.:' *firstname*nlsnews@georgea*lastname*.com
' <http://www.georgeanelson.com/>
 
J

jepler

Hi!

I need to enlighten myself in Python unicode speed and implementation.

My platform is AMD Athlon@1300 (x86-32), Debian, Python 2.4.

First a simple example (and time results):

x = "a"*50000000
real 0m0.195s
user 0m0.144s
sys 0m0.046s

x = u"a"*50000000
real 0m2.477s
user 0m2.119s
sys 0m0.225s

So my first question is why creation of a unicode string lasts more then 10x
longer than non-unicode string?

string objects have the optimization described in the log message below.
The same optimization hasn't been made to unicode_repeat, though it would
probably also benefit from it.

------------------------------------------------------------------------
r30616 | rhettinger | 2003-01-06 04:33:56 -0600 (Mon, 06 Jan 2003) | 11 lines

Optimize string_repeat.

Christian Tismer pointed out the high cost of the loop overhead and
function call overhead for 'c' * n where n is large. Accordingly,
the new code only makes lg2(n) loops.

Interestingly, 'c' * 1000 * 1000 ran a bit faster with old code. At some
point, the loop and function call overhead became cheaper than invalidating
the cache with lengthy memcpys. But for more typical sizes of n, the new
code runs much faster and for larger values of n it runs only a bit slower.
------------------------------------------------------------------------

If you're a "C" coder too, consider creating and submitting a patch to do this
to the patch tracker on http://sf.net/projects/python . That's the best thing
you can do to ensure the optimization is considered for a future release of
Python.

Jeff

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.1 (GNU/Linux)

iD8DBQFDjNC3Jd01MZaTXX0RAhb8AJwLUv2jNdYPY9CrfH6c1OWpsUTcgwCePq8s
cQGDYnsKRdAv6JO3Zmr3jao=
=906V
-----END PGP SIGNATURE-----
 
D

David Siroky

V Tue, 29 Nov 2005 10:14:26 +0000, Neil Hodgson napsal(a):
David Siroky:


I suspect you really want "output = u''" here.


This is creating as many as 50000 new string objects of increasing
size. To build large strings, some common faster techniques are to
either create a list of characters and then use join on the list or use
a cStringIO to accumulate the characters.

That is the answer I wanted, now I'm finally enlightened! :)
This is about 10 times faster for me:

def no_diacritics(line):
if type(line) != unicode:
line = unicode(line, 'utf-8')

line = unicodedata.normalize('NFKD', line)

output = []
for c in line:
if not unicodedata.combining(c):
output.append(c)
return u''.join(output)

Neil


Thanx!

David
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
474,273
Messages
2,571,363
Members
48,047
Latest member
prince15

Latest Threads

Top