Problems With Accented Characters

Fuzzyman · Feb 22, 2004

I've written an anagram finder that produces anagrams from a
dictionary of words. The user can load their own dictionary.

( http://www.voidspace.org.uk/atlantibots/nanagram.html )

In order to ensure it is able to find anagrams properly I wanted to
strip characters like punctuation etc from words in the dictionary and
words the user entered. I test(ed) against the 26 English letters (
string.ascii_lowercase ).

I now have someone who wants to use a French dictionary - with words
containing accented characters !! I have two choices - either map the
accented characters to their unaccented equivalent (slightly
innacurate) or treat the accented charcters as a separate letter (very
few anagrams). However - at the moment I can't experiment with either
because my default codec is the 7-bit ascii and crashes (sometimes !!)
when using the accented characters.

Has anyone any advice - or can point me to any resources - for
effectively handling these characters. I guess it's a latin-1 encoding
I want to use... I can't even work out how to cahnge the default
codec........

Thanks,

Fuzzy

http://www.voidspace.org.uk/atlantibots/pythonutils.html

Fuzzyman · Feb 23, 2004

I've written an anagram finder that produces anagrams from a
dictionary of words. The user can load their own dictionary.

( http://www.voidspace.org.uk/atlantibots/nanagram.html )

In order to ensure it is able to find anagrams properly I wanted to
strip characters like punctuation etc from words in the dictionary and
words the user entered. I test(ed) against the 26 English letters (
string.ascii_lowercase ).

I now have someone who wants to use a French dictionary - with words
containing accented characters !! I have two choices - either map the
accented characters to their unaccented equivalent (slightly
innacurate) or treat the accented charcters as a separate letter (very
few anagrams). However - at the moment I can't experiment with either
because my default codec is the 7-bit ascii and crashes (sometimes !!)
when using the accented characters.

It's particularly difficult for me to understand what is happening -
because python's behaviour *seems* intermittent.

For example - if I run my program from IDLE and give it the word
'degré' (containing e-acute) then I get the error :

Exception in Tkinter callback
Traceback (most recent call last):
[snip..]
File "D:\Python Projects\Nanagram1.3\Nanagram-GUI.pyw", line 123, in
prepare
if letter in self.valid_letters:
UnicodeDecodeError: 'ascii' codec can't decode byte 0x83 in position
26: ordinal not in range(128)
Traceback (most recent call last):

It is testing each character of the users input to remove invalid
characters (like "-" and "'")... It crashes when it comes tot he
e-acute.

*However* - If I run it by double clicking on the file then it appears
to work fine (e.g. if I ask it find anagrams of 'degré hello ma' then
it strips out the e-acute (thinking it's an invalid character) and
finds anagrams of the rest :

gleam holder
hallo merged

What I'd like to do is switch by default to an 8 bit codec (latin-1 I
think ?????) and then offer the user the choice of either mapping the
accented characters to their nearest equivalent (e-acute to e for
example) *or* treating them as seperate characters.............

Anyone able to help ??

Fuzzy

Changing the default text codec	5	Feb 23, 2004
Dealing with accented characters	0	May 31, 2006
Text search with accented characters	3	Dec 15, 2005
accented characters	4	Jun 1, 2005
server variables with international (accented) characters	0	Dec 11, 2006
Querystring with accented characters	10	Dec 17, 2004
Unicode: matching a word and unaccenting characters	2	Nov 15, 2007
Nappy Clog Month.........	0	Feb 9, 2004

Problems With Accented Characters

Fuzzyman

Fuzzyman

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads