M
Max Williams
I'm using a method that i found at the acts as ferret site:
http://projects.jkraemer.net/acts_as_ferret/#UTF-8support
which is intended to strip accents out of strings, turning for example
"La Bohème" into "La Boheme". Here's the method:
def strip_diacritics(s)
# latin1 subset only
s.tr("ÀÃÂÃÄÅÇÈÉÊËÌÃÃŽÃÑÒÓÔÕÖØÙÚÛÜÃà áâãäåçèéêëìÃîïñòóôõöøùúûüýÿ",
"AAAAAACEEEEIIIINOOOOOOUUUUYaaaaaaceeeeiiiinoooooouuuuyy").
gsub(/Æ/, "AE").
gsub(/Ã/, "Eth").
gsub(/Þ/, "THORN").
gsub(/ß/, "ss").
gsub(/æ/, "ae").
gsub(/ð/, "eth").
gsub(/þ/, "thorn")
end
However, it's breaking for me: è is turned into "yy". I think this is
to do with the number of bytes used: the first string passed to tr()
uses 2 bytes per character while the second uses 1 byte per character:
"ÀÃÂÃÄÅÇÈÉÊËÌÃÃŽÃÑÒÓÔÕÖØÙÚÛÜÃà áâãäåçèéêëìÃîïñòóôõöøùúûüýÿ".size
=> 110
"AAAAAACEEEEIIIINOOOOOOUUUUYaaaaaaceeeeiiiinoooooouuuuyy".size
=> 55
Assuming this is the problem, can anyone tell me how to get around it?
I know next to nothing about character encoding: i tried converting both
translation strings to utf8 with String#toutf8, but that didn't make any
difference.
thanks, max
http://projects.jkraemer.net/acts_as_ferret/#UTF-8support
which is intended to strip accents out of strings, turning for example
"La Bohème" into "La Boheme". Here's the method:
def strip_diacritics(s)
# latin1 subset only
s.tr("ÀÃÂÃÄÅÇÈÉÊËÌÃÃŽÃÑÒÓÔÕÖØÙÚÛÜÃà áâãäåçèéêëìÃîïñòóôõöøùúûüýÿ",
"AAAAAACEEEEIIIINOOOOOOUUUUYaaaaaaceeeeiiiinoooooouuuuyy").
gsub(/Æ/, "AE").
gsub(/Ã/, "Eth").
gsub(/Þ/, "THORN").
gsub(/ß/, "ss").
gsub(/æ/, "ae").
gsub(/ð/, "eth").
gsub(/þ/, "thorn")
end
However, it's breaking for me: è is turned into "yy". I think this is
to do with the number of bytes used: the first string passed to tr()
uses 2 bytes per character while the second uses 1 byte per character:
"ÀÃÂÃÄÅÇÈÉÊËÌÃÃŽÃÑÒÓÔÕÖØÙÚÛÜÃà áâãäåçèéêëìÃîïñòóôõöøùúûüýÿ".size
=> 110
"AAAAAACEEEEIIIINOOOOOOUUUUYaaaaaaceeeeiiiinoooooouuuuyy".size
=> 55
Assuming this is the problem, can anyone tell me how to get around it?
I know next to nothing about character encoding: i tried converting both
translation strings to utf8 with String#toutf8, but that didn't make any
difference.
thanks, max