187 said:
I'm wondering if theres a nice clean way to convert accented characters
to the root character, like 'À' (A with mark above it) to 'A', or 'Ð' (D
with a horizontal line in the middle) to 'D' or 'Ñ' (spanish N with
tilde above it) to 'N'.
Or is it bettter to make a conversion table (hash) to map them? (I would
like to avoid this if theres a better way.)
Thank you.
I wrote a small sub to do just this for a program I wrote that needed to
sort lists of words into a sort of faux alphabetical order. It is
obviously NOT going to work for unicode text containing characters above
255 but it is pretty effective as long as the characters fall within
the Latin-1 (ISO 8859-1) range.
It may be more efficient to remove the short circuit line that checks to
see if there IS a character that needs to be transliterated, depending
on your mix of words. If many (most) of the words need the transform, it
is just adding overhead. If the words that need transform are sparse in
your list, it can cut down significantly on unnecessary transforms and
substitutions.
I used the sub for sorting, but it is easily adapted for other uses.
@sorted = sort { deaccent(lc($a)) cmp deaccent(lc($b)) } @list;
sub deaccent{
my $phrase = shift;
return $phrase unless ($phrase =~ m/[\xC0-\xFF]/);
$phrase =~
tr/ÀÁÂÃÄÅàáâãäåÇçÈÉÊËèéêëÌÍÎÏìíîïÒÓÔÕÖØòóôõöøÑñÙÚÛÜùúûüÝÿý/AAAAAAaaaaaaCcEEEEeeeeIIIIiiiiOOOOOOooooooNnUUUUuuuuYyy/;
$phrase =~ s/\xC6/AE/g;
$phrase =~ s/\xE6/ae/g;
return $phrase;
}