character substitution using tr()

Max Williams · Apr 22, 2008

I'm using a method that i found at the acts as ferret site:

http://projects.jkraemer.net/acts_as_ferret/#UTF-8support

which is intended to strip accents out of strings, turning for example
"La BohÃ¨me" into "La Boheme". Here's the method:

def strip_diacritics(s)
# latin1 subset only
s.tr("Ã€ÃÃ‚ÃƒÃ„Ã…Ã‡ÃˆÃ‰ÃŠÃ‹ÃŒÃÃŽÃÃ‘Ã’Ã“Ã”Ã•Ã–Ã˜Ã™ÃšÃ›ÃœÃÃ Ã¡Ã¢Ã£Ã¤Ã¥Ã§Ã¨Ã©ÃªÃ«Ã¬ÃÃ®Ã¯Ã±Ã²Ã³Ã´ÃµÃ¶Ã¸Ã¹ÃºÃ»Ã¼Ã½Ã¿",
"AAAAAACEEEEIIIINOOOOOOUUUUYaaaaaaceeeeiiiinoooooouuuuyy").
gsub(/Ã†/, "AE").
gsub(/Ã/, "Eth").
gsub(/Ãž/, "THORN").
gsub(/ÃŸ/, "ss").
gsub(/Ã¦/, "ae").
gsub(/Ã°/, "eth").
gsub(/Ã¾/, "thorn")
end

However, it's breaking for me: Ã¨ is turned into "yy". I think this is
to do with the number of bytes used: the first string passed to tr()
uses 2 bytes per character while the second uses 1 byte per character:

"Ã€ÃÃ‚ÃƒÃ„Ã…Ã‡ÃˆÃ‰ÃŠÃ‹ÃŒÃÃŽÃÃ‘Ã’Ã“Ã”Ã•Ã–Ã˜Ã™ÃšÃ›ÃœÃÃ Ã¡Ã¢Ã£Ã¤Ã¥Ã§Ã¨Ã©ÃªÃ«Ã¬ÃÃ®Ã¯Ã±Ã²Ã³Ã´ÃµÃ¶Ã¸Ã¹ÃºÃ»Ã¼Ã½Ã¿".size
=> 110

"AAAAAACEEEEIIIINOOOOOOUUUUYaaaaaaceeeeiiiinoooooouuuuyy".size
=> 55

Assuming this is the problem, can anyone tell me how to get around it?
I know next to nothing about character encoding: i tried converting both
translation strings to utf8 with String#toutf8, but that didn't make any
difference.

thanks, max

Jan Dvorak · Apr 22, 2008

I'm using a method that i found at the acts as ferret site:

http://projects.jkraemer.net/acts_as_ferret/#UTF-8support

which is intended to strip accents out of strings, turning for example
"La Boh=C3=A8me" into "La Boheme". Here's the method:

def strip_diacritics(s)
# latin1 subset only
s.tr("=C3=80=C3=81=C3=82=C3=83=C3=84=C3=85=C3=87=C3=88=C3=89=C3=8A=C3= =8B=C3=8C=C3=8D=C3=8E=C3=8F=C3=91=C3=92=C3=93=C3=94=C3=95=C3=96=C3=98=C3=99=
=C3=9A=C3=9B=C3=9C=C3=9D=C3=A0=C3=A1=C3=A2=C3=A3=C3=A4=C3=A5=C3=A7=C3=A8=C3=
=A9=C3=AA=C3=AB=C3=AC=C3=AD=C3=AE=C3=AF=C3=B1=C3=B2=C3=B3=C3=B4=C3=B5=C3=B6=
=C3=B8=C3=B9=C3=BA=C3=BB=C3=BC=C3=BD=C3=BF",
"AAAAAACEEEEIIIINOOOOOOUUUUYaaaaaaceeeeiiiinoooooouuuuyy").
gsub(/=C3=86/, "AE").
gsub(/=C3=90/, "Eth").
gsub(/=C3=9E/, "THORN").
gsub(/=C3=9F/, "ss").
gsub(/=C3=A6/, "ae").
gsub(/=C3=B0/, "eth").
gsub(/=C3=BE/, "thorn")
end

With ruby 1.9 your code works fine without modifications, with ruby 1.8 and=
=20
it's support for unicode (or lack of thereof) it might be quite a problem t=
o=20
get it working.

Assuming this is the problem, can anyone tell me how to get around it?
I know next to nothing about character encoding: i tried converting both
translation strings to utf8 with String#toutf8, but that didn't make any
difference.

UTF-8 is variable length encoding, the first half of ascii (which includes=
=20
a-zA-Z) is not encoded at all (=3D1 byte), anything other is encoded as 2-4=
=20
byte chars. Both of the strings are therefore valid UTF-8, but ruby 1.8's t=
r=20
can't operate on character level, only on byte level.

Jan

Max Williams · Apr 22, 2008

Jan said:
With ruby 1.9 your code works fine without modifications, with ruby 1.8
and
it's support for unicode (or lack of thereof) it might be quite a
problem to
get it working.

ah...i'm a bit scared to change our project over to ruby 1.9 (i didn't
know there was a 1.9) to solve this problem. I ended up just picking
the most commonly used accents and doing individual gsubs on the strings
to swap them out. Feels dirty but it works.

Thanks a lot for the info!
max

Sebastian Hungerecker · Apr 22, 2008

Max said:
However, it's breaking for me: =C3=A8 is turned into "yy".

It works if you require 'jcode' first.

HTH,
Sebastian
=2D-=20
NP: Depeche Mode - The Things You Said
Jabber: (e-mail address removed)
ICQ: 205544826

Max Williams · Apr 23, 2008

Sebastian said:
It works if you require 'jcode' first.

HTH,
Sebastian

Perfect, thanks! That's much more palatable than upgrading ruby.

cheers
max

Looping almost the same repetitive lines	11	Apr 22, 2005
Perfecting index.pl some more!	7	Apr 30, 2005
Converting my index.pl(cgi) to html::template one	4	Apr 26, 2005

character substitution using tr()

Max Williams

Jan Dvorak

Max Williams

Sebastian Hungerecker

Max Williams

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads