W
Walter Dörwald
Why don't work this code on Python 2.6? Or how can I do this job?
[snip _MAP]
def downcode(name):
"""
downcode(u"Žabovitá zmiešaná kaša")
u'Zabovita zmiesana kasa'
"""
for key, value in _MAP.iteritems():
name = name.replace(key, value)
return name
Though C Python is pretty optimized under the hood for this sort of
single-character replacement, this still seems pretty inefficient
since you're calling replace for every character you want to map. I
think that a better approach might be something like:
def downcode(name):
return ''.join(_MAP.get(c, c) for c in name)
Or using string.translate:
import string
def downcode(name):
table = string.maketrans(
'ÀÃÂÃÄÅ...',
'AAAAAA...')
return name.translate(table)
Or even simpler:
import unicodedata
def downcode(name):
return unicodedata.normalize("NFD", name)\
.encode("ascii", "ignore")\
.decode("ascii")
Servus,
Walter
As I understand it, the "ignore" argument to str.encode *removes* the
undecodable characters, rather than replacing them with an ASCII
approximation. Is that correct? If so, wouldn't that rather defeat the
purpose?
Yes, but any accented characters have been split into the base character
and the combining accent via normalize() before, so only the accent gets
removed. Of course non-decomposable characters will be removed
completely, but it would be possible to replace
.encode("ascii", "ignore").decode("ascii")
with something like this:
u"".join(c for c in name if unicodedata.category(c) == "Mn")
Servus,
Walter