convert accented characters to root character?

187 · Jun 9, 2004

I'm wondering if theres a nice clean way to convert accented characters
to the root character, like 'À' (A with mark above it) to 'A', or 'Ð' (D
with a horizontal line in the middle) to 'D' or 'Ñ' (spanish N with
tilde above it) to 'N'.

Or is it bettter to make a conversion table (hash) to map them? (I would
like to avoid this if theres a better way.)

Thank you.

Paul Lalli · Jun 9, 2004

I'm wondering if theres a nice clean way to convert accented characters
to the root character, like 'À' (A with mark above it) to 'A', or 'Ð'(D
with a horizontal line in the middle) to 'D' or 'Ñ' (spanish N with
tilde above it) to 'N'.

Or is it bettter to make a conversion table (hash) to map them? (I would
like to avoid this if theres a better way.)

Unless you're asking something I'm not understanding (to which I will
always admit the possibility), we *just* had a thread on this. As in,
within the past two days. Common usenet courtesy dictates that you at
least browse a newsgroup for a bit before posting. The original message
in the thread can be found (among other places) at
http://groups.google.com/groups?hl=en&lr=&ie=UTF-8&safe=off&[email protected]
or search for a message with the title "convert unicode string" sent on
June 6, 2004.

If this thread does not answer your question, please explain how your
question is different.

Paul Lalli

Jeff 'japhy' Pinyan · Jun 9, 2004

I'm wondering if theres a nice clean way to convert accented characters
to the root character, like 'À' (A with mark above it) to 'A', or 'Ð' (D
with a horizontal line in the middle) to 'D' or 'Ñ' (spanish N with
tilde above it) to 'N'.

Or is it bettter to make a conversion table (hash) to map them? (I would
like to avoid this if theres a better way.)

I copied this out of the source code for sirc:

tr{\x80-\xff}
{\x00-\x1f!cLxY|$_ca<\-\-R_o+23\'mp.,1o>123?AAAAAAACEEEEIIIIDNOOOOO*0UUUUYPBaaaaaaaceeeeiiiidnooooo:0uuuuypy};

thundergnat · Jun 9, 2004

187 said:
I'm wondering if theres a nice clean way to convert accented characters
to the root character, like 'À' (A with mark above it) to 'A', or 'Ð' (D
with a horizontal line in the middle) to 'D' or 'Ñ' (spanish N with
tilde above it) to 'N'.

Or is it bettter to make a conversion table (hash) to map them? (I would
like to avoid this if theres a better way.)

Thank you.

I wrote a small sub to do just this for a program I wrote that needed to
sort lists of words into a sort of faux alphabetical order. It is
obviously NOT going to work for unicode text containing characters above
255 but it is pretty effective as long as the characters fall within
the Latin-1 (ISO 8859-1) range.

It may be more efficient to remove the short circuit line that checks to
see if there IS a character that needs to be transliterated, depending
on your mix of words. If many (most) of the words need the transform, it
is just adding overhead. If the words that need transform are sparse in
your list, it can cut down significantly on unnecessary transforms and
substitutions.

I used the sub for sorting, but it is easily adapted for other uses.

@sorted = sort { deaccent(lc($a)) cmp deaccent(lc($b)) } @list;

sub deaccent{
my $phrase = shift;
return $phrase unless ($phrase =~ m/[\xC0-\xFF]/);
$phrase =~
tr/ÀÁÂÃÄÅàáâãäåÇçÈÉÊËèéêëÌÍÎÏìíîïÒÓÔÕÖØòóôõöøÑñÙÚÛÜùúûüÝÿý/AAAAAAaaaaaaCcEEEEeeeeIIIIiiiiOOOOOOooooooNnUUUUuuuuYyy/;
$phrase =~ s/\xC6/AE/g;
$phrase =~ s/\xE6/ae/g;
return $phrase;
}

Jürgen Exner · Jun 9, 2004

187 said:
I'm wondering if theres a nice clean way to convert accented
characters to the root character, like 'À' (A with mark above it) to
'A', or 'Ð' (D with a horizontal line in the middle) to 'D' or 'Ñ'
(spanish N with tilde above it) to 'N'.

Didn't we just (like two days ago) have this discussion in the thread "
Re: Convert unicode string to "basic characters"

"?
Is there anything that you believe was not covered in that discussion? Then
you may want to point out specifically which issues you like to discuss in
more detail.

As has been pointed out there, are you certain you want to convert e.g. "to
hear" ("höra") into "whore" ("hora") or "Austria" ("Österreich") into
"Easter Empire" ("Osterreich")?

jue

187 · Jun 10, 2004

Jürgen Exner said:
Didn't we just (like two days ago) have this discussion in the thread
" Re: Convert unicode string to "basic characters"

"?
Is there anything that you believe was not covered in that
discussion? Then you may want to point out specifically which issues
you like to discuss in more detail.

As has been pointed out there, are you certain you want to convert
e.g. "to hear" ("höra") into "whore" ("hora") or "Austria"
("Österreich") into "Easter Empire" ("Osterreich")?

jue

Sorry my news feed doesn't show the other thread you and another person
mentioned. My apologies. I did how ever google about this but maybe I
was using the wrong search pattern as I could not find anything to help.
Thank to al lfor replies though.

Outputting signal values to terminal Within Character Array	0	Dec 10, 2021
100% Easiest and Hassle-Free Way to Convert MBOX to PST for Free	3	Dec 21, 2024
Expert Guide to Convert MBOX to PST File Manually in 2025	7	Dec 1, 2024
accented characters	4	Jun 1, 2005
Convert \uXXXX to character	5	Jun 27, 2010
How to convert MS Word special characters to HTML codes?	1	Mar 31, 2012
Text search with accented characters	3	Dec 15, 2005
Writing accented characters into HTML files?	4	Jan 5, 2009

convert accented characters to root character?

187

Paul Lalli

Jeff 'japhy' Pinyan

thundergnat

Jürgen Exner

187

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads