unicode issue

Walter DÃ¶rwald · Oct 1, 2009

Why don't work this code on Python 2.6? Or how can I do this job?

[snip _MAP]

def downcode(name):
"""
downcode(u"Å½abovitÃ¡ zmieÅ¡anÃ¡ kaÅ¡a")
u'Zabovita zmiesana kasa'
"""
for key, value in _MAP.iteritems():
name = name.replace(key, value)
return name

Though C Python is pretty optimized under the hood for this sort of
single-character replacement, this still seems pretty inefficient
since you're calling replace for every character you want to map. I
think that a better approach might be something like:

def downcode(name):
return ''.join(_MAP.get(c, c) for c in name)

Or using string.translate:

import string
def downcode(name):
table = string.maketrans(
'Ã€ÃÃ‚ÃƒÃ„Ã…...',
'AAAAAA...')
return name.translate(table)

Click to expand...

Or even simpler:

import unicodedata

def downcode(name):
return unicodedata.normalize("NFD", name)\
.encode("ascii", "ignore")\
.decode("ascii")

Servus,
Walter

Click to expand...

As I understand it, the "ignore" argument to str.encode *removes* the
undecodable characters, rather than replacing them with an ASCII
approximation. Is that correct? If so, wouldn't that rather defeat the
purpose?

Yes, but any accented characters have been split into the base character
and the combining accent via normalize() before, so only the accent gets
removed. Of course non-decomposable characters will be removed
completely, but it would be possible to replace

.encode("ascii", "ignore").decode("ascii")

with something like this:

u"".join(c for c in name if unicodedata.category(c) == "Mn")

Servus,
Walter

Rami Chowdhury · Oct 1, 2009

Yes, but any accented characters have been split into the base character
and the combining accent via normalize() before, so only the accent gets
removed. Of course non-decomposable characters will be removed
completely, but it would be possible to replace

.encode("ascii", "ignore").decode("ascii")

with something like this:

u"".join(c for c in name if unicodedata.category(c) == "Mn")

Servus,
Walter

Thank you for the clarification!

Neil Hodgson · Oct 1, 2009

Dave Angel:

I know that the clipboard has type tags, but I haven't looked at them in
so long that I forget what they look like. For text, is it just ASCII
and Unicode? Or are there other possible encodings that the source and
sink negotiate?

The normal thing seen is that the clipboard differentiates between
Unicode text and locale-dependent 8 bit text. Depending on platform
Unicode text may be in UTF-8 (Linux) or UTF-16 (Windows). The encoding
of 8-bit text strings is not well defined and is normally assumed to be
compatible with whatever is currently in the document or the current
user interface encoding.

Neil

gentlestone · Oct 5, 2009

Thx for useful advices. They seems to be very clever.

Thx to dajngo users comunity, I've got a nice solution, how to avoid
unicode problems in doctests:

""""Å afÃ¡Å™ovÃ¡".decode('utf-8'))
<Osoba: Å afÃ¡Å™ovÃ¡ Ä½udmila>
"""

It is - do not use unicode string at all. Instead of it create a
unicode object by explicitly decoding a bytestring using the proper
codec.

Gabriel Genellina · Oct 6, 2009

En Thu said:
_MAP = {
# LATIN
u'À': 'A', u'Á': 'A', u'Â': 'A', u'Ã': 'A', u'Ä': 'A', u'Å': 'A',
u'Æ': 'AE', u'Ç':'C', [...long table...]
}

def downcode(name):
"""
downcode(u"´abovitá zmie¨aná ka¨a")
u'Zabovita zmiesana kasa'
"""
for key, value in _MAP.iteritems():
name = name.replace(key, value)
return name

Click to expand...

Click to expand...

import unicodedata

def downcode(name):
return unicodedata.normalize("NFD", name)\
.encode("ascii", "ignore")\
.decode("ascii")

This article [1] shows a mixed technique, decomposing characters when such
info is available in the Unicode tables, and also allowing for a custom
mapping when not.

[1] http://effbot.org/zone/unicode-convert.htm

Blue J Ciphertext Program	2	Nov 22, 2023
My Status, Ciphertext	2	Nov 28, 2023
Delete all not allowed characters..	10	Oct 25, 2007
How to play corresponding sound?	2	Jun 10, 2023
ChatGPT will make us Job(Home)less	3	Jan 22, 2023
Python code problem	2	Apr 23, 2023
Dont work, it´s something whit the loops?	1	Jun 30, 2021
Can't solve problems! please Help	0	Sep 26, 2022

unicode issue

Walter DÃ¶rwald

Rami Chowdhury

Neil Hodgson

gentlestone

Gabriel Genellina

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads