Hi!
I am looking for an example of a UNICODE to ASCII conversion example
that will remove diacritics from characters (and leave the characters,
i.e., Klüft to Kluft) as well as handle the conversion of other
characters, like große to grosse.
Seems like a nasty thing to do, akin to stripping the vowels from English
text just because Hebrew didn't write them. But if you insist, there's
always this:
http://code.activestate.com/recipes/251871
although it is nowhere near complete, and it's pretty ugly code too.
Perhaps a cleaner method might be to use a combination of Unicode
normalisation forms and a custom translation table. Here's a basic
version to get you started, written for Python 3:
import unicodedata
# Do this once. It may take a while.
table = {}
for n in range(128, 0x11000):
# Use unichar in Python2
expanded = unicodedata.normalize('NFKD', chr(n))
keep = [c for c in expanded if ord(c) < 128]
if keep:
table[n] = ''.join(keep)
else:
# None to delete, or use some other replacement string.
table[n] = None
# Add extra transformations.
# In Python2, every string needs to be a Unicode string u'xyz'.
table[ord('ß')] = 'ss'
table[ord('\N{LATIN CAPITAL LETTER SHARP S}')] = 'SS'
table[ord('Æ')] = 'AE'
table[ord('æ')] = 'ae'
table[ord('Å’')] = 'OE'
table[ord('Å“')] = 'oe'
table[ord('ï¬')] = 'fi'
table[ord('fl')] = 'fl'
table[ord('ø')] = 'oe'
table[ord('Ã')] = 'D'
table[ord('Þ')] = 'TH'
# etc.
# Say you don't want control characters in your string, you might
# escape them using caret ^C notation:
for i in range(32):
table
= '^%c' % (ord('@') + i)
table[127] = '^?'
# But it's probably best if you leave newlines, tabs etc. alone...
for c in '\n\r\t\f\v':
del table[ord(c)]
# Add any more transformations you like here. Perhaps you want to
# transliterate Russian and Greek characters to English?
table[whatever] = whatever
# In Python2, use unicode.maketrans instead.
table = str.maketrans(table)
That's a fair chunk of work, but it only needs be done once, at the start
of your application. Then you call it like this:
cleaned = 'some Unicode string'.translate(table)
If you really want to be fancy, you can extract the name of each Unicode
code point (if it has one!) and parse the name. Here's an example:
py> unicodedata.name('ħ')
'LATIN SMALL LETTER H WITH STROKE'
py> unicodedata.lookup('LATIN SMALL LETTER H')
'h'
but I'd only do that after the normalization step, if at all.
Too much work for your needs? Well, you can get about 80% of the way in
only a few lines of code:
cleaned = unicodedata.normalize('NFKD', unistr)
for before, after in (
('ß', 'ss'), ('Æ', 'AE'), ('æ', 'ae'), ('Œ', 'OE'), ('œ', 'oe'),
# put any more transformations here...
):
cleaned = cleaned.replace(before, after)
cleaned = cleaned.encode('ascii', 'replace').decode('ascii')
Another method would be this:
http://effbot.org/zone/unicode-convert.htm
which is focused on European languages. But it might suit your purposes.
There used to be a program called any2ascii.py
(
http://www.haypocalc.com/perso/prog/python/any2ascii.py) that worked
well, but the link is now broken and I can't seem to locate it.
I have seen the page Unicode strings to ASCII ...nicely,
http://www.peterbe.com/plog/unicode-to-ascii, but am looking for a
working example.
He has a working example. How much hand-holding are you looking for?
Quoting from that page:
I'd much rather that a word like "Klüft" is converted to
"Kluft" which will be more human readable and still correct.
The author is wrong. That's like saying that changing the English word
"car" to "cer" is still correct -- it absolutely is not correct, and even
if it were, what is he implying with the quip about "more human
readable"? That Germans and other Europeans aren't human?
If an Italian said:
I'd much rather that a word like "jump" is converted to
"iump" which will be more human readable and still correct.
we'd all agree that he was talking rubbish.
Make no mistake, this sort of simple-minded stripping of accents and
diacritics is an extremely ham-fisted thing to do. To strip out letters
without changing the meaning of the words is, at best, hard to do right
and requiring good knowledge of the linguistic rules of the language
you're translating. And at worst, it's outright impossible. For instance,
in German I believe it is quite acceptable to translate 'ü' to 'ue',
except in names: Herr Müller will probably be quite annoyed if you call
him Herr Mueller, and Herr Mueller will probably be annoyed too, and both
of them will be peeved to be confused with Herr Muller.