accents and String#tr

X

Xavier Noria

I wrote this method

def self.normalize_for_sorting(s)
return nil unless s
norm =3D s.downcase
norm.tr!('=C1=C9=CD=D3=DA', 'aeiou')
norm.tr!('=C0=C8=CC=D2=D9', 'aeiou')
norm.tr!('=C4=CB=CF=D6=DC', 'aeiou')
norm.tr!('=C2=CA=CE=D4=DB', 'aeiou')
norm.tr!('=E1=E9=ED=F3=FA', 'aeiou')
norm.tr!('=E0=E8=EC=F2=F9', 'aeiou')
norm.tr!('=E4=EB=EF=F6=FC', 'aeiou')
norm.tr!('=E2=EA=EE=F4=FB', 'aeiou')
norm
end

to normalize strings for sorting. This script is UTF-8, everything is =20=

UTF-8 in my application, $KCODE is 'u'.

But it does not work, examples:

Andr=E9s -> andruos
L=F3pez -> luupez
P=E9rez -> puorez

I tried to "force" it with Iconv.conv('UTF-8', 'ASCII', 'aeiou') to =20
no avail. Any ideas?

-- fxn
 
R

Robin Stocker

Xavier said:
I wrote this method
=20
def self.normalize_for_sorting(s)
return nil unless s
norm =3D s.downcase
norm.tr!('=C1=C9=CD=D3=DA', 'aeiou')
norm.tr!('=C0=C8=CC=D2=D9', 'aeiou')
norm.tr!('=C4=CB=CF=D6=DC', 'aeiou')
norm.tr!('=C2=CA=CE=D4=DB', 'aeiou')
norm.tr!('=E1=E9=ED=F3=FA', 'aeiou')
norm.tr!('=E0=E8=EC=F2=F9', 'aeiou')
norm.tr!('=E4=EB=EF=F6=FC', 'aeiou')
norm.tr!('=E2=EA=EE=F4=FB', 'aeiou')
norm
end
=20
to normalize strings for sorting. This script is UTF-8, everything is=20
UTF-8 in my application, $KCODE is 'u'.
=20
But it does not work, examples:
=20
Andr=E9s -> andruos
L=F3pez -> luupez
P=E9rez -> puorez
=20
I tried to "force" it with Iconv.conv('UTF-8', 'ASCII', 'aeiou') to no=20
avail. Any ideas?
=20
-- fxn

Hi,

My guess is that the "tr" method treats its arguments as a string of
bytes. And because characters with accents need more than 1 byte in
UTF-8, #tr doesn't do what you would expect it to. (It's not even tr's
fault, how is it supposed to know that two bytes actually represent a
single character?)

The solution is not to use #tr!, but #gsub!. It isn't as short, but at
least it's right ;)

norm.gsub!('=E4', 'a')
norm.gsub!('=EB', 'e')
# and so on...

And because that is against DRY (Don't Repeat Yourself), I would
recommend storing the mapping as a hash:

accents =3D { '=E4' =3D> 'a', '=EB' =3D> 'e', ... }
accents.each do |accent, replacement|
norm.gsub!(accent, replacement)
end

Regards,
Robin Stocker
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,967
Messages
2,570,148
Members
46,694
Latest member
LetaCadwal

Latest Threads

Top