Localized character sort?

Andy Dingley · Oct 22, 2009

Any advice on how to internationalize a web app so that it supports a
sortable table, where clicking column headers sorts by that column?
The basic underlying tech for this is Java on the server and Ajax on
the web client.

The particular problem is in how to localize the sorting, as sorting
non-ASCII characters according to their locale is an important
requirement. Java's java.text.Collator can do this easily, as it
supports comparisons and sorts for a parameterized locale.
JavaScript's localeCompare() though picks this collation locale up
from the browser environment.

How robust is localeCompare() for this?

What's cross-platform support like for localeCompare() ?

What's localized support like for localeCompare() ? Will a browser
running in a call centre in India be able to correctly sort Arabic?

Another idea was to return the results of Java's
Collator.getCollationKey() along with the string data, and have the
local JavaScript sort on that instead. This just needs a simple byte
compare, not anything l10n-aware.

Thanks for any suggestions

SAM · Oct 22, 2009

Le 10/22/09 6:14 PM, Andy Dingley a écrit :

Any advice on how to internationalize a web app so that it supports a
sortable table, where clicking column headers sorts by that column?
The basic underlying tech for this is Java on the server and Ajax on
the web client.

I know nothing about arabic, navajo or indian or ... alphabets
but it seems that JS sorting does it following the page's specified
charset's order.

here in France :

<script type="text/javascript" charset="iso-8859-1">
var a = 'abcdABCD()èéù*785ÚÝ';
a = a.split('');
alert(a); // a,b,c,d,A,B,C,D,(,),è,é,ù,*,7,8,5,Ú,Ý
a.sort();
alert(a); // (,),*,5,7,8,A,B,C,D,a,b,c,d,Ú,Ý,è,é,ù
</script>

And I don't know if that can be seen as a real/correct sorting.
Do conventions make that uppercase is to display before lowercase ?

Passing in utf-8, results are the same.

I think that in others alphabets, using utf-8, all sorting will be
correct ? (at least as above)

The particular problem is in how to localize the sorting, as sorting
non-ASCII characters according to their locale is an important
requirement.

If they are on a page in english that will sort in ASCII (or ISO-8859 or
utf-8, depends the used charset, each one is in same order)

If user is on a page in russian, maybe the charset will be Cyrillic or
utf-8, and that would have to run (?).

If user is with a system in Cyrillic and load a page in english with
utf-8 charset headers, that would have to sort correctly too, no?

Andy Dingley · Oct 22, 2009

but it seems that JS sorting does it following the page's specified charset's order.

I don't know what localeCompare() uses to indicate the sort collation
to use.

It can't be the charset, because for HTML that's always Unicode.

It shouldn't be the encoding, because that's not precise enough.
Scandinavian languages se, da & no sort some of their vowels
differently, but they all use the same ISO-8859 encoding to represent
them.

I think it's probably the language setting, which could be either
indicated by markup within the page and the lang attribute, or else it
could be taken from most browser's preferences. This also raises the
question of whether the best way to sort something is according to
collation for the content, or according to the expectations of the
user?

Although localeCompare() has "locale" in the name, I think it's more
likely that it uses the plain language and not the locale. I'm not
aware (but would be interested to know) of any situations where (for
example) fr_FR, fr_BE & fr_CA had different sort ordering. If any
language does do this, I suspect it's most likely to be Chinese and
variations between the mainland, Hong Kong & Taiwan.

We are incidentally using UTF8 throughout. It's the only practical way
to support an internationalized app from a single codebase, also the
best way to place multiple different languages on the same page.

If they are on a page in english that will sort in ASCII (or ISO-8859 or
utf-8, depends the used charset, each one is in same order)

The problem, just considering Europe, is that accented characters are
non-ASCII and so a crude sort on the codepoint order alone will sort
A,B,C...Z,&Cacute;, placing all of the accents after the ASCII Z. A
better sort algorithm for imposing "English sort order" onto pan-
European content is to map &Cacute; onto plain C, then sort.

SAM · Oct 22, 2009

Le 10/22/09 9:49 PM, Andy Dingley a écrit :

The problem, just considering Europe, is that accented characters are
non-ASCII and so a crude sort on the codepoint order alone will sort
A,B,C...Z,&Cacute;, placing all of the accents after the ASCII Z. A
better sort algorithm for imposing "English sort order" onto pan-
European content is to map &Cacute; onto plain C, then sort.

It seems that an accented word given to google for searching
is converted in ASCII at first before making the search(research).
I suppose that some RegExp would have to be applied to european words
before to try to sort them (un-accentuation and all uppercase).
That doesn't give the way to sort asiatic words (are they only words ?)
What about arabic words which letters(glyphs) change according to their
place in the word ?

I think your tables will only be sortable on numeric columns ;-)

About "how to localize" : ask the user his preference ?

JR · Oct 23, 2009

Le 10/22/09 9:49 PM, Andy Dingley a écrit :

It seems that an accented word given to google for searching
is converted in ASCII at first before making the search(research).
I suppose that some RegExp would have to be applied to european words
before to try to sort them (un-accentuation and all uppercase).
That doesn't give the way to sort asiatic words (are they only words ?)
What about arabic words which letters(glyphs) change according to their
place in the word ?

I think your tables will only be sortable on numeric columns ;-)

About "how to localize" : ask the user his preference ?

I'm used to sort arrays using the localeCompare() method. I can see
that localeCompare() considers accented and case-sensitive characters
regardless of the charset (either utf-8 or iso-8859-1).

E.g

<script type="text/javascript">
function sortLocalized() {
var arr = ['Joao', 'Antonio', 'antonio', 'jansen', 'Johnson',
'Antônio', 'João', 'Érica', 'Eric', 'Jose', 'josé', 'joão'],
sortLC = function(a, b) {
if (typeof a === 'string' && typeof b === 'string') {
return a.localeCompare(b);
}
};
arr.sort(sortLC);
return arr.join(", ");;
}
</script>

Cheers,
JR

Andy Dingley · Oct 23, 2009

It seems that an accented word given to google for searching
is converted in ASCII at first before making the search(research).

Is that just for Google (in English) or Google in other languages,
where accents are significant and may indicate conceptually different
words?

In English this is certainly a useful approach. It might even be
useful for "the web" in general, given the generally poorly correct
use of accents on the English-dominated intawebs. It would be a bit
of a shame if Google's French support was this restricted.

We've decided to reject it for our app, mostly because we need to have
good support for non-Latin (e.g. Cyrillic, Arabic and Chinese)
languages and writing systems.

SAM · Oct 23, 2009

Le 10/23/09 3:38 AM, JR a écrit :

Le 10/22/09 9:49 PM, Andy Dingley a écrit :

It seems that an accented word given to google for searching
is converted in ASCII at first before making the search(research).
I suppose that some RegExp would have to be applied to european words
before to try to sort them (un-accentuation and all uppercase).
That doesn't give the way to sort asiatic words (are they only words ?)
What about arabic words which letters(glyphs) change according to their
place in the word ?

I think your tables will only be sortable on numeric columns ;-)

About "how to localize" : ask the user his preference ?

Click to expand...

I'm used to sort arrays using the localeCompare() method. I can see
that localeCompare() considers accented and case-sensitive characters
regardless of the charset (either utf-8 or iso-8859-1).

E.g

<script type="text/javascript">
function sortLocalized() {
var arr = ['Joao', 'Antonio', 'antonio', 'jansen', 'Johnson',
'Antônio', 'João', 'Érica', 'Eric', 'Jose', 'josé', 'joão'],
sortLC = function(a, b) {
if (typeof a === 'string' && typeof b === 'string') {
return a.localeCompare(b);
}
};
arr.sort(sortLC);
return arr.join(", ");;
}
</script>

I obtain :
Antonio, Antônio, antonio, Eric, Érica, Joao, João, Johnson, Jose,
jansen, joão, josé

where antonio isn't after Antonio
and joão isn't whith the other joaos

not too bad but not yet perfect (and in Japanese ?)

My test was in a page in utf-8 and a browser with 'fr' as preferred
language (then 'en' and nothing more specified)

SAM · Oct 23, 2009

Le 10/23/09 11:07 AM, Andy Dingley a écrit :

Is that just for Google (in English) or Google in other languages,
where accents are significant and may indicate conceptually different
words?

In fact I do not know what they do, just I do no more accentuate words
when I ask someting to google.fr

ie : <http://www.google.fr/search?q=le+ba+blesse>
that find immediately: 'le bât blesse'
Then why to worry about accents ?

In English this is certainly a useful approach. It might even be
useful for "the web" in general, given the generally poorly correct
use of accents on the English-dominated intawebs. It would be a bit
of a shame if Google's French support was this restricted.

Well the search engine of google is certainly a little stronger than a
simple reg expression as it find 'lycée de Versailles" when it is asked
'lice de versay'

(while it doesnt find the lycée is the ask was 'lisse de ...'
'lisse' being a right french word)

Andy Dingley · Oct 23, 2009

Well the search engine of google is certainly a little stronger than a
simple reg expression as it find 'lycée de Versailles" when it is asked
'lice de versay'

Lemmatisation and "stemming" are involved. It's worth a read of a
good text on Lucene (Manning's "Hibernate Search in Action" is a good
read) for discussion of techniques here.

JR · Oct 23, 2009

Le 10/23/09 3:38 AM, JR a écrit :

I'm used to sort arrays using the localeCompare() method. I can see
that localeCompare() considers accented and case-sensitive characters
regardless of the charset (either utf-8 or iso-8859-1).

E.g

Click to expand...

<script type="text/javascript">
function sortLocalized() {
var arr = ['Joao', 'Antonio', 'antonio', 'jansen', 'Johnson',
'Antônio', 'João', 'Érica', 'Eric', 'Jose', 'josé', 'joão'],
sortLC = function(a, b) {
if (typeof a === 'string' && typeof b === 'string'){
return a.localeCompare(b);
}
};
arr.sort(sortLC);
return arr.join(", ");;
}
</script>

Click to expand...

I obtain :
Antonio, Antônio, antonio, Eric, Érica, Joao, João, Johnson, Jose,
jansen, joão, josé

where antonio isn't after Antonio
and joão isn't whith the other joaos

not too bad but not yet perfect (and in Japanese ?)

My test was in a page in utf-8 and a browser with 'fr' as preferred
language (then 'en' and nothing more specified)

Dear SAM,
Thanks for testing. It was a weird result, maybe because you don't
have ' ã ' in French (?)

In FF3, Brazilian-Portuguese version, the result was:

"antonio, Antonio, Antônio, Eric, Érica, jansen, Joao, joão, João,
Johnson, Jose, josé"

Therefore 'antonio' comes before 'Antonio', and 'joão' is situated
between 'Joao' and 'João', which is correct in Portuguese.

Cheers,
João Rodrigues (JR)

SAM · Oct 25, 2009

Le 10/23/09 10:24 PM, JR a écrit :

Le 10/23/09 3:38 AM, JR a écrit :

<script type="text/javascript">
function sortLocalized() {
var arr = ['Joao', 'Antonio', 'antonio', 'jansen', 'Johnson',
'Antônio', 'João', 'Érica', 'Eric', 'Jose', 'josé', 'joão'],
sortLC = function(a, b) {
if (typeof a === 'string' && typeof b === 'string') {
return a.localeCompare(b);
}
};
arr.sort(sortLC);
return arr.join(", ");;
}
</script>

Click to expand...

I obtain :
Antonio, Antônio, antonio, Eric, Érica, Joao, João, Johnson, Jose,
jansen, joão, josé

where antonio isn't after Antonio
and joão isn't whith the other joaos

not too bad but not yet perfect (and in Japanese ?)

My test was in a page in utf-8 and a browser with 'fr' as preferred
language (then 'en' and nothing more specified)

Click to expand...

Dear SAM,
Thanks for testing. It was a weird result, maybe because you don't
have ' ã ' in French (?)

We'll have to suppose it.

(snip)

Cheers,
João Rodrigues (JR)

Ha! JR isn't for Junior ;-)

Table Sort	4	Aug 10, 2005
Sort html:select box by localized key in Struts	5	Dec 15, 2003
[ANN] Ruby-GetText-Package-2.0.0	0	Mar 22, 2009
javascript sort column by textbox value	0	Sep 20, 2005
[ANN] Ruby-Locale-0.9.0 / Ruby-Locale for Ruby on Rails-0.1.0	0	Dec 4, 2008
[ANN] Ruby-GetText-Package-1.91.0	0	May 11, 2008
[ANN] Ruby-GetText-Package-1.8.0	0	Sep 12, 2006
xsl:sort using an xsl:variable as the sort key	2	Sep 5, 2006

Localized character sort?

Andy Dingley

SAM

Andy Dingley

SAM

JR

Andy Dingley

SAM

SAM

Andy Dingley

JR

SAM

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads