D
David Siroky
Hi!
I need to enlighten myself in Python unicode speed and implementation.
My platform is AMD Athlon@1300 (x86-32), Debian, Python 2.4.
First a simple example (and time results):
x = "a"*50000000
real 0m0.195s
user 0m0.144s
sys 0m0.046s
x = u"a"*50000000
real 0m2.477s
user 0m2.119s
sys 0m0.225s
So my first question is why creation of a unicode string lasts more then 10x
longer than non-unicode string?
Another situation: speed problem with long strings
I have a simple function for removing diacritics from a string:
#!/usr/bin/python2.4
# -*- coding: UTF-8 -*-
import unicodedata
def no_diacritics(line):
if type(line) != unicode:
line = unicode(line, 'utf-8')
line = unicodedata.normalize('NFKD', line)
output = ''
for c in line:
if not unicodedata.combining(c):
output += c
return output
Now the calling sequence (and time results):
for i in xrange(1):
x = u"a"*50000
y = no_diacritics(x)
real 0m17.021s
user 0m11.139s
sys 0m5.116s
for i in xrange(5):
x = u"a"*10000
y = no_diacritics(x)
real 0m0.548s
user 0m0.502s
sys 0m0.004s
In both cases the total amount of data is equal but when I use shorter strings
it is much faster. Maybe it has nothing to do with Python unicode but I would
like to know the reason.
Thanks for notes!
David
I need to enlighten myself in Python unicode speed and implementation.
My platform is AMD Athlon@1300 (x86-32), Debian, Python 2.4.
First a simple example (and time results):
x = "a"*50000000
real 0m0.195s
user 0m0.144s
sys 0m0.046s
x = u"a"*50000000
real 0m2.477s
user 0m2.119s
sys 0m0.225s
So my first question is why creation of a unicode string lasts more then 10x
longer than non-unicode string?
Another situation: speed problem with long strings
I have a simple function for removing diacritics from a string:
#!/usr/bin/python2.4
# -*- coding: UTF-8 -*-
import unicodedata
def no_diacritics(line):
if type(line) != unicode:
line = unicode(line, 'utf-8')
line = unicodedata.normalize('NFKD', line)
output = ''
for c in line:
if not unicodedata.combining(c):
output += c
return output
Now the calling sequence (and time results):
for i in xrange(1):
x = u"a"*50000
y = no_diacritics(x)
real 0m17.021s
user 0m11.139s
sys 0m5.116s
for i in xrange(5):
x = u"a"*10000
y = no_diacritics(x)
real 0m0.548s
user 0m0.502s
sys 0m0.004s
In both cases the total amount of data is equal but when I use shorter strings
it is much faster. Maybe it has nothing to do with Python unicode but I would
like to know the reason.
Thanks for notes!
David