convert accented characters to root character?

1

187

I'm wondering if theres a nice clean way to convert accented characters
to the root character, like 'À' (A with mark above it) to 'A', or 'Ð' (D
with a horizontal line in the middle) to 'D' or 'Ñ' (spanish N with
tilde above it) to 'N'.

Or is it bettter to make a conversion table (hash) to map them? (I would
like to avoid this if theres a better way.)

Thank you.
 
P

Paul Lalli

I'm wondering if theres a nice clean way to convert accented characters
to the root character, like 'À' (A with mark above it) to 'A', or 'Ð'(D
with a horizontal line in the middle) to 'D' or 'Ñ' (spanish N with
tilde above it) to 'N'.

Or is it bettter to make a conversion table (hash) to map them? (I would
like to avoid this if theres a better way.)

Unless you're asking something I'm not understanding (to which I will
always admit the possibility), we *just* had a thread on this. As in,
within the past two days. Common usenet courtesy dictates that you at
least browse a newsgroup for a bit before posting. The original message
in the thread can be found (among other places) at
http://groups.google.com/groups?hl=en&lr=&ie=UTF-8&safe=off&[email protected]
or search for a message with the title "convert unicode string" sent on
June 6, 2004.

If this thread does not answer your question, please explain how your
question is different.

Paul Lalli
 
J

Jeff 'japhy' Pinyan

I'm wondering if theres a nice clean way to convert accented characters
to the root character, like 'À' (A with mark above it) to 'A', or 'Ð' (D
with a horizontal line in the middle) to 'D' or 'Ñ' (spanish N with
tilde above it) to 'N'.

Or is it bettter to make a conversion table (hash) to map them? (I would
like to avoid this if theres a better way.)

I copied this out of the source code for sirc:

tr{\x80-\xff}
{\x00-\x1f!cLxY|$_ca<\-\-R_o+23\'mp.,1o>123?AAAAAAACEEEEIIIIDNOOOOO*0UUUUYPBaaaaaaaceeeeiiiidnooooo:0uuuuypy};
 
T

thundergnat

187 said:
I'm wondering if theres a nice clean way to convert accented characters
to the root character, like 'À' (A with mark above it) to 'A', or 'Ð' (D
with a horizontal line in the middle) to 'D' or 'Ñ' (spanish N with
tilde above it) to 'N'.

Or is it bettter to make a conversion table (hash) to map them? (I would
like to avoid this if theres a better way.)

Thank you.

I wrote a small sub to do just this for a program I wrote that needed to
sort lists of words into a sort of faux alphabetical order. It is
obviously NOT going to work for unicode text containing characters above
255 but it is pretty effective as long as the characters fall within
the Latin-1 (ISO 8859-1) range.

It may be more efficient to remove the short circuit line that checks to
see if there IS a character that needs to be transliterated, depending
on your mix of words. If many (most) of the words need the transform, it
is just adding overhead. If the words that need transform are sparse in
your list, it can cut down significantly on unnecessary transforms and
substitutions.

I used the sub for sorting, but it is easily adapted for other uses.

@sorted = sort { deaccent(lc($a)) cmp deaccent(lc($b)) } @list;


sub deaccent{
my $phrase = shift;
return $phrase unless ($phrase =~ m/[\xC0-\xFF]/);
$phrase =~
tr/ÀÁÂÃÄÅàáâãäåÇçÈÉÊËèéêëÌÍÎÏìíîïÒÓÔÕÖØòóôõöøÑñÙÚÛÜùúûüÝÿý/AAAAAAaaaaaaCcEEEEeeeeIIIIiiiiOOOOOOooooooNnUUUUuuuuYyy/;
$phrase =~ s/\xC6/AE/g;
$phrase =~ s/\xE6/ae/g;
return $phrase;
}
 
J

Jürgen Exner

187 said:
I'm wondering if theres a nice clean way to convert accented
characters to the root character, like 'À' (A with mark above it) to
'A', or 'Ð' (D with a horizontal line in the middle) to 'D' or 'Ñ'
(spanish N with tilde above it) to 'N'.

Didn't we just (like two days ago) have this discussion in the thread "
Re: Convert unicode string to "basic characters"

"?
Is there anything that you believe was not covered in that discussion? Then
you may want to point out specifically which issues you like to discuss in
more detail.

As has been pointed out there, are you certain you want to convert e.g. "to
hear" ("höra") into "whore" ("hora") or "Austria" ("Österreich") into
"Easter Empire" ("Osterreich")?

jue
 
1

187

Jürgen Exner said:
Didn't we just (like two days ago) have this discussion in the thread
" Re: Convert unicode string to "basic characters"

"?
Is there anything that you believe was not covered in that
discussion? Then you may want to point out specifically which issues
you like to discuss in more detail.

As has been pointed out there, are you certain you want to convert
e.g. "to hear" ("höra") into "whore" ("hora") or "Austria"
("Österreich") into "Easter Empire" ("Osterreich")?

jue

Sorry my news feed doesn't show the other thread you and another person
mentioned. My apologies. I did how ever google about this but maybe I
was using the wrong search pattern as I could not find anything to help.
Thank to al lfor replies though.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
474,156
Messages
2,570,878
Members
47,413
Latest member
KeiraLight

Latest Threads

Top