Localized character sort?

A

Andy Dingley

Any advice on how to internationalize a web app so that it supports a
sortable table, where clicking column headers sorts by that column?
The basic underlying tech for this is Java on the server and Ajax on
the web client.

The particular problem is in how to localize the sorting, as sorting
non-ASCII characters according to their locale is an important
requirement. Java's java.text.Collator can do this easily, as it
supports comparisons and sorts for a parameterized locale.
JavaScript's localeCompare() though picks this collation locale up
from the browser environment.

How robust is localeCompare() for this?

What's cross-platform support like for localeCompare() ?

What's localized support like for localeCompare() ? Will a browser
running in a call centre in India be able to correctly sort Arabic?


Another idea was to return the results of Java's
Collator.getCollationKey() along with the string data, and have the
local JavaScript sort on that instead. This just needs a simple byte
compare, not anything l10n-aware.

Thanks for any suggestions
 
S

SAM

Le 10/22/09 6:14 PM, Andy Dingley a écrit :
Any advice on how to internationalize a web app so that it supports a
sortable table, where clicking column headers sorts by that column?
The basic underlying tech for this is Java on the server and Ajax on
the web client.

I know nothing about arabic, navajo or indian or ... alphabets
but it seems that JS sorting does it following the page's specified
charset's order.

here in France :

<script type="text/javascript" charset="iso-8859-1">
var a = 'abcdABCD()èéù*785ÚÝ';
a = a.split('');
alert(a); // a,b,c,d,A,B,C,D,(,),è,é,ù,*,7,8,5,Ú,Ý
a.sort();
alert(a); // (,),*,5,7,8,A,B,C,D,a,b,c,d,Ú,Ý,è,é,ù
</script>

And I don't know if that can be seen as a real/correct sorting.
Do conventions make that uppercase is to display before lowercase ?

Passing in utf-8, results are the same.

I think that in others alphabets, using utf-8, all sorting will be
correct ? (at least as above)

The particular problem is in how to localize the sorting, as sorting
non-ASCII characters according to their locale is an important
requirement.

If they are on a page in english that will sort in ASCII (or ISO-8859 or
utf-8, depends the used charset, each one is in same order)

If user is on a page in russian, maybe the charset will be Cyrillic or
utf-8, and that would have to run (?).

If user is with a system in Cyrillic and load a page in english with
utf-8 charset headers, that would have to sort correctly too, no?
 
A

Andy Dingley

but it seems that JS sorting does it following the page's specified charset's order.

I don't know what localeCompare() uses to indicate the sort collation
to use.

It can't be the charset, because for HTML that's always Unicode.

It shouldn't be the encoding, because that's not precise enough.
Scandinavian languages se, da & no sort some of their vowels
differently, but they all use the same ISO-8859 encoding to represent
them.

I think it's probably the language setting, which could be either
indicated by markup within the page and the lang attribute, or else it
could be taken from most browser's preferences. This also raises the
question of whether the best way to sort something is according to
collation for the content, or according to the expectations of the
user?

Although localeCompare() has "locale" in the name, I think it's more
likely that it uses the plain language and not the locale. I'm not
aware (but would be interested to know) of any situations where (for
example) fr_FR, fr_BE & fr_CA had different sort ordering. If any
language does do this, I suspect it's most likely to be Chinese and
variations between the mainland, Hong Kong & Taiwan.


We are incidentally using UTF8 throughout. It's the only practical way
to support an internationalized app from a single codebase, also the
best way to place multiple different languages on the same page.

If they are on a page in english that will sort in ASCII (or ISO-8859 or
utf-8, depends the used charset, each one is in same order)

The problem, just considering Europe, is that accented characters are
non-ASCII and so a crude sort on the codepoint order alone will sort
A,B,C...Z,&Cacute;, placing all of the accents after the ASCII Z. A
better sort algorithm for imposing "English sort order" onto pan-
European content is to map &Cacute; onto plain C, then sort.
 
S

SAM

Le 10/22/09 9:49 PM, Andy Dingley a écrit :
The problem, just considering Europe, is that accented characters are
non-ASCII and so a crude sort on the codepoint order alone will sort
A,B,C...Z,&Cacute;, placing all of the accents after the ASCII Z. A
better sort algorithm for imposing "English sort order" onto pan-
European content is to map &Cacute; onto plain C, then sort.

It seems that an accented word given to google for searching
is converted in ASCII at first before making the search(research).
I suppose that some RegExp would have to be applied to european words
before to try to sort them (un-accentuation and all uppercase).
That doesn't give the way to sort asiatic words (are they only words ?)
What about arabic words which letters(glyphs) change according to their
place in the word ?

I think your tables will only be sortable on numeric columns ;-)

About "how to localize" : ask the user his preference ?
 
J

JR

Le 10/22/09 9:49 PM, Andy Dingley a écrit :



It seems that an accented word given to google for searching
is converted in ASCII at first before making the search(research).
I suppose that some RegExp would have to be applied to european words
before to try to sort them (un-accentuation and all uppercase).
That doesn't give the way to sort asiatic words (are they only words ?)
What about arabic words which letters(glyphs) change according to their
place in the word ?

I think your tables will only be sortable on numeric columns ;-)

About "how to localize" : ask the user his preference ?

I'm used to sort arrays using the localeCompare() method. I can see
that localeCompare() considers accented and case-sensitive characters
regardless of the charset (either utf-8 or iso-8859-1).

E.g

<script type="text/javascript">
function sortLocalized() {
var arr = ['Joao', 'Antonio', 'antonio', 'jansen', 'Johnson',
'Antônio', 'João', 'Érica', 'Eric', 'Jose', 'josé', 'joão'],
sortLC = function(a, b) {
if (typeof a === 'string' && typeof b === 'string') {
return a.localeCompare(b);
}
};
arr.sort(sortLC);
return arr.join(", ");;
}
</script>

Cheers,
JR
 
A

Andy Dingley

It seems that an accented word given to google for searching
is converted in ASCII at first before making the search(research).

Is that just for Google (in English) or Google in other languages,
where accents are significant and may indicate conceptually different
words?

In English this is certainly a useful approach. It might even be
useful for "the web" in general, given the generally poorly correct
use of accents on the English-dominated intawebs. It would be a bit
of a shame if Google's French support was this restricted.

We've decided to reject it for our app, mostly because we need to have
good support for non-Latin (e.g. Cyrillic, Arabic and Chinese)
languages and writing systems.
 
S

SAM

Le 10/23/09 3:38 AM, JR a écrit :
Le 10/22/09 9:49 PM, Andy Dingley a écrit :

It seems that an accented word given to google for searching
is converted in ASCII at first before making the search(research).
I suppose that some RegExp would have to be applied to european words
before to try to sort them (un-accentuation and all uppercase).
That doesn't give the way to sort asiatic words (are they only words ?)
What about arabic words which letters(glyphs) change according to their
place in the word ?

I think your tables will only be sortable on numeric columns ;-)

About "how to localize" : ask the user his preference ?

I'm used to sort arrays using the localeCompare() method. I can see
that localeCompare() considers accented and case-sensitive characters
regardless of the charset (either utf-8 or iso-8859-1).

E.g

<script type="text/javascript">
function sortLocalized() {
var arr = ['Joao', 'Antonio', 'antonio', 'jansen', 'Johnson',
'Antônio', 'João', 'Érica', 'Eric', 'Jose', 'josé', 'joão'],
sortLC = function(a, b) {
if (typeof a === 'string' && typeof b === 'string') {
return a.localeCompare(b);
}
};
arr.sort(sortLC);
return arr.join(", ");;
}
</script>

I obtain :
Antonio, Antônio, antonio, Eric, Érica, Joao, João, Johnson, Jose,
jansen, joão, josé

where antonio isn't after Antonio
and joão isn't whith the other joaos

not too bad but not yet perfect (and in Japanese ?)

My test was in a page in utf-8 and a browser with 'fr' as preferred
language (then 'en' and nothing more specified)
 
S

SAM

Le 10/23/09 11:07 AM, Andy Dingley a écrit :
Is that just for Google (in English) or Google in other languages,
where accents are significant and may indicate conceptually different
words?

In fact I do not know what they do, just I do no more accentuate words
when I ask someting to google.fr

ie : <http://www.google.fr/search?q=le+ba+blesse>
that find immediately: 'le bât blesse'
Then why to worry about accents ?
In English this is certainly a useful approach. It might even be
useful for "the web" in general, given the generally poorly correct
use of accents on the English-dominated intawebs. It would be a bit
of a shame if Google's French support was this restricted.

Well the search engine of google is certainly a little stronger than a
simple reg expression as it find 'lycée de Versailles" when it is asked
'lice de versay'

(while it doesnt find the lycée is the ask was 'lisse de ...'
'lisse' being a right french word)
 
A

Andy Dingley

Well the search engine of google is certainly a little stronger than a
simple reg expression as it find 'lycée de Versailles" when it is asked
'lice de versay'

Lemmatisation and "stemming" are involved. It's worth a read of a
good text on Lucene (Manning's "Hibernate Search in Action" is a good
read) for discussion of techniques here.
 
J

JR

Le 10/23/09 3:38 AM, JR a écrit :


I'm used to sort arrays using the localeCompare() method. I can see
that localeCompare() considers accented and case-sensitive characters
regardless of the charset (either utf-8 or iso-8859-1).

<script type="text/javascript">
function sortLocalized() {
  var arr = ['Joao', 'Antonio', 'antonio', 'jansen', 'Johnson',
'Antônio', 'João', 'Érica', 'Eric', 'Jose', 'josé', 'joão'],
  sortLC = function(a, b) {
    if (typeof a === 'string' && typeof b === 'string'){
      return a.localeCompare(b);
    }
  };
  arr.sort(sortLC);
  return arr.join(", ");;
}
</script>

I obtain :
Antonio, Antônio, antonio, Eric, Érica, Joao, João, Johnson, Jose,
jansen, joão, josé

where antonio isn't after Antonio
and joão isn't whith the other joaos

not too bad but not yet perfect (and in Japanese ?)

My test was in a page in utf-8 and a browser with 'fr' as preferred
language (then 'en' and nothing more specified)

Dear SAM,
Thanks for testing. It was a weird result, maybe because you don't
have ' ã ' in French (?)

In FF3, Brazilian-Portuguese version, the result was:

"antonio, Antonio, Antônio, Eric, Érica, jansen, Joao, joão, João,
Johnson, Jose, josé"

Therefore 'antonio' comes before 'Antonio', and 'joão' is situated
between 'Joao' and 'João', which is correct in Portuguese.

Cheers,
João Rodrigues (JR)
 
S

SAM

Le 10/23/09 10:24 PM, JR a écrit :
Le 10/23/09 3:38 AM, JR a écrit :
<script type="text/javascript">
function sortLocalized() {
var arr = ['Joao', 'Antonio', 'antonio', 'jansen', 'Johnson',
'Antônio', 'João', 'Érica', 'Eric', 'Jose', 'josé', 'joão'],
sortLC = function(a, b) {
if (typeof a === 'string' && typeof b === 'string') {
return a.localeCompare(b);
}
};
arr.sort(sortLC);
return arr.join(", ");;
}
</script>
I obtain :
Antonio, Antônio, antonio, Eric, Érica, Joao, João, Johnson, Jose,
jansen, joão, josé

where antonio isn't after Antonio
and joão isn't whith the other joaos

not too bad but not yet perfect (and in Japanese ?)

My test was in a page in utf-8 and a browser with 'fr' as preferred
language (then 'en' and nothing more specified)

Dear SAM,
Thanks for testing. It was a weird result, maybe because you don't
have ' ã ' in French (?)

We'll have to suppose it.

(snip)
Cheers,
João Rodrigues (JR)

Ha! JR isn't for Junior ;-)
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,994
Messages
2,570,223
Members
46,813
Latest member
lawrwtwinkle111

Latest Threads

Top