Problems With Accented Characters

F

Fuzzyman

I've written an anagram finder that produces anagrams from a
dictionary of words. The user can load their own dictionary.

( http://www.voidspace.org.uk/atlantibots/nanagram.html )

In order to ensure it is able to find anagrams properly I wanted to
strip characters like punctuation etc from words in the dictionary and
words the user entered. I test(ed) against the 26 English letters (
string.ascii_lowercase ).

I now have someone who wants to use a French dictionary - with words
containing accented characters !! I have two choices - either map the
accented characters to their unaccented equivalent (slightly
innacurate) or treat the accented charcters as a separate letter (very
few anagrams). However - at the moment I can't experiment with either
because my default codec is the 7-bit ascii and crashes (sometimes !!)
when using the accented characters.

Has anyone any advice - or can point me to any resources - for
effectively handling these characters. I guess it's a latin-1 encoding
I want to use... I can't even work out how to cahnge the default
codec........

Thanks,

Fuzzy

http://www.voidspace.org.uk/atlantibots/pythonutils.html
 
F

Fuzzyman

I've written an anagram finder that produces anagrams from a
dictionary of words. The user can load their own dictionary.

( http://www.voidspace.org.uk/atlantibots/nanagram.html )

In order to ensure it is able to find anagrams properly I wanted to
strip characters like punctuation etc from words in the dictionary and
words the user entered. I test(ed) against the 26 English letters (
string.ascii_lowercase ).

I now have someone who wants to use a French dictionary - with words
containing accented characters !! I have two choices - either map the
accented characters to their unaccented equivalent (slightly
innacurate) or treat the accented charcters as a separate letter (very
few anagrams). However - at the moment I can't experiment with either
because my default codec is the 7-bit ascii and crashes (sometimes !!)
when using the accented characters.


It's particularly difficult for me to understand what is happening -
because python's behaviour *seems* intermittent.

For example - if I run my program from IDLE and give it the word
'degré' (containing e-acute) then I get the error :

Exception in Tkinter callback
Traceback (most recent call last):
[snip..]
File "D:\Python Projects\Nanagram1.3\Nanagram-GUI.pyw", line 123, in
prepare
if letter in self.valid_letters:
UnicodeDecodeError: 'ascii' codec can't decode byte 0x83 in position
26: ordinal not in range(128)
Traceback (most recent call last):

It is testing each character of the users input to remove invalid
characters (like "-" and "'")... It crashes when it comes tot he
e-acute.


*However* - If I run it by double clicking on the file then it appears
to work fine (e.g. if I ask it find anagrams of 'degré hello ma' then
it strips out the e-acute (thinking it's an invalid character) and
finds anagrams of the rest :

gleam holder
hallo merged

What I'd like to do is switch by default to an 8 bit codec (latin-1 I
think ?????) and then offer the user the choice of either mapping the
accented characters to their nearest equivalent (e-acute to e for
example) *or* treating them as seperate characters.............


Anyone able to help ??



Fuzzy
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,995
Messages
2,570,226
Members
46,815
Latest member
treekmostly22

Latest Threads

Top