character substitution using tr()

M

Max Williams

I'm using a method that i found at the acts as ferret site:

http://projects.jkraemer.net/acts_as_ferret/#UTF-8support

which is intended to strip accents out of strings, turning for example
"La Bohème" into "La Boheme". Here's the method:

def strip_diacritics(s)
# latin1 subset only
s.tr("ÀÃÂÃÄÅÇÈÉÊËÌÃÃŽÃÑÒÓÔÕÖØÙÚÛÜÃàáâãäåçèéêëìíîïñòóôõöøùúûüýÿ",
"AAAAAACEEEEIIIINOOOOOOUUUUYaaaaaaceeeeiiiinoooooouuuuyy").
gsub(/Æ/, "AE").
gsub(/Ã/, "Eth").
gsub(/Þ/, "THORN").
gsub(/ß/, "ss").
gsub(/æ/, "ae").
gsub(/ð/, "eth").
gsub(/þ/, "thorn")
end

However, it's breaking for me: è is turned into "yy". I think this is
to do with the number of bytes used: the first string passed to tr()
uses 2 bytes per character while the second uses 1 byte per character:

"ÀÃÂÃÄÅÇÈÉÊËÌÃÃŽÃÑÒÓÔÕÖØÙÚÛÜÃàáâãäåçèéêëìíîïñòóôõöøùúûüýÿ".size
=> 110

"AAAAAACEEEEIIIINOOOOOOUUUUYaaaaaaceeeeiiiinoooooouuuuyy".size
=> 55

Assuming this is the problem, can anyone tell me how to get around it?
I know next to nothing about character encoding: i tried converting both
translation strings to utf8 with String#toutf8, but that didn't make any
difference.

thanks, max
 
J

Jan Dvorak

I'm using a method that i found at the acts as ferret site:

http://projects.jkraemer.net/acts_as_ferret/#UTF-8support

which is intended to strip accents out of strings, turning for example
"La Boh=C3=A8me" into "La Boheme". Here's the method:

def strip_diacritics(s)
# latin1 subset only
s.tr("=C3=80=C3=81=C3=82=C3=83=C3=84=C3=85=C3=87=C3=88=C3=89=C3=8A=C3= =8B=C3=8C=C3=8D=C3=8E=C3=8F=C3=91=C3=92=C3=93=C3=94=C3=95=C3=96=C3=98=C3=99=
=C3=9A=C3=9B=C3=9C=C3=9D=C3=A0=C3=A1=C3=A2=C3=A3=C3=A4=C3=A5=C3=A7=C3=A8=C3=
=A9=C3=AA=C3=AB=C3=AC=C3=AD=C3=AE=C3=AF=C3=B1=C3=B2=C3=B3=C3=B4=C3=B5=C3=B6=
=C3=B8=C3=B9=C3=BA=C3=BB=C3=BC=C3=BD=C3=BF",
"AAAAAACEEEEIIIINOOOOOOUUUUYaaaaaaceeeeiiiinoooooouuuuyy").
gsub(/=C3=86/, "AE").
gsub(/=C3=90/, "Eth").
gsub(/=C3=9E/, "THORN").
gsub(/=C3=9F/, "ss").
gsub(/=C3=A6/, "ae").
gsub(/=C3=B0/, "eth").
gsub(/=C3=BE/, "thorn")
end

With ruby 1.9 your code works fine without modifications, with ruby 1.8 and=
=20
it's support for unicode (or lack of thereof) it might be quite a problem t=
o=20
get it working.
Assuming this is the problem, can anyone tell me how to get around it?
I know next to nothing about character encoding: i tried converting both
translation strings to utf8 with String#toutf8, but that didn't make any
difference.

UTF-8 is variable length encoding, the first half of ascii (which includes=
=20
a-zA-Z) is not encoded at all (=3D1 byte), anything other is encoded as 2-4=
=20
byte chars. Both of the strings are therefore valid UTF-8, but ruby 1.8's t=
r=20
can't operate on character level, only on byte level.

Jan
 
M

Max Williams

Jan said:
With ruby 1.9 your code works fine without modifications, with ruby 1.8
and
it's support for unicode (or lack of thereof) it might be quite a
problem to
get it working.

ah...i'm a bit scared to change our project over to ruby 1.9 (i didn't
know there was a 1.9) to solve this problem. I ended up just picking
the most commonly used accents and doing individual gsubs on the strings
to swap them out. Feels dirty but it works.

Thanks a lot for the info!
max
 
S

Sebastian Hungerecker

Max said:
However, it's breaking for me: =C3=A8 is turned into "yy".

It works if you require 'jcode' first.

HTH,
Sebastian
=2D-=20
NP: Depeche Mode - The Things You Said
Jabber: (e-mail address removed)
ICQ: 205544826
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,995
Messages
2,570,230
Members
46,819
Latest member
masterdaster

Latest Threads

Top