python3 Unicode is slow

Dale Gerdemann · Oct 25, 2009

I've written simple code in 2.6 and 3.0 to read every charcter of a
set of files and print out some information for each of these
characters. I tested each program on a large Cyrillic/Latin text. The
result was that the 2.6 version was about 5x faster. Here are the two
programs:

#!/usr/bin/env python

import sys
import codecs
import unicodedata
for path in sys.argv[1:]:
lines = codecs.open(path, encoding='UTF-8',
errors='replace').readlines()

for line in lines:
for c in line:
name = unicodedata.name(c,'unknown')
prnt = prnt_rep = c.encode('utf8')
if name == 'unknown':
prnt = ' '
if ord(c) > 127:
print('%s %-14r U+%04x %s' % (prnt, prnt_rep, ord(c),
name))
else:
if ord(c) == 9:
name = 'tab'
prnt = ' '
elif ord(c) == 10:
name = 'LF'
prnt = ' '
elif ord(c) == 13:
name = 'CR'
prnt = ' '
print("{0:s} '\\x{1:02x}' U+{2:04x}
{3:s}".format(
prnt, ord(c), ord(c), name))

#!/usr/bin/env python3

import sys
import unicodedata

for path in sys.argv[1:]:
lines = open(path, errors='replace').readlines()

for line in lines:
for c in line:
code_point = ord(c)
utf8 = c.encode()
if ord(c) <= 127:
utf8 = "b'\\" + hex(ord(c))[1:] + "'"
name = unicodedata.name(c,'unknown')
if name == 'unknown':
c = ' '
if code_point == 9:
c = ' '
name = 'tab'
elif code_point == 10:
c = ' '
name = 'LF'
elif code_point == 13:
c = ' '
name = 'CR'
print("{0:s} {1:15s} U+{2:04x} {3:s}".format(
c, utf8, code_point, name))

John Machin · Oct 25, 2009

I've written simple code in 2.6 and 3.0 to read every charcter of a
set of files and print out some information for each of these
characters. I tested each program on a large Cyrillic/Latin text. The
result was that the 2.6 version was about 5x faster.

3.0? Nowadays nobody wants to know about benchmarks of 3.0. Much of
the new 3.X file I/O stuff was written in Python. It has since been
rewritten in C. In general AFAICT there is no good reason to be using
3.0. Consider updating to 3.1.1.

API for custom Unicode error handlers	5	Oct 4, 2013
Benchmarking stripping of Unicode characters which are invalid XML	0	Mar 18, 2012
can't get utf8 / unicode strings from embedded python	19	Aug 23, 2013
Python battle game help	2	Feb 23, 2023
Ascii to Unicode.	4	Jul 28, 2010
Read utf-8 file	1	Mar 18, 2013
WinXP, Python3.1.2,dir-listing to XML - problem with unicode file names	0	Apr 3, 2010
trying to understand unicode	1	Apr 20, 2005

python3 Unicode is slow

Dale Gerdemann

John Machin

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads