In comp.lang.javascript message said:
I'd always been under the misimpression that JavaScript strings were
7-bit ASCII like C strings and the issue had never come up before. It
seems that I'm wrong (happily). Does anyone know if I can assume this
will work regardless of code page, HTML doctype, quirks mode etc? Is it
only because I've made this a super simple page running on Windows XP
US-en that this works? Would it fail in a lot of configurations (e.g.
Mac, Linux, Asian codebases) ?
<html>
<body>
<script>
var s="Hellö Würld";
document.write(s);
alert(s);
</script>
</body>
</html>
That s string contains two German characters (just in case usenet is
still restricted to 7-bit characters like it was 20 years ago).
You should not assume that something which looks like "Hellö Würld"
(umlauted) on your Windows, in an unspecified editor or viewer, will
necessarily have the right international representation for the umlaut-
bearers, although it probably will.
However, if that source code generates the umlauted characters when
executed in a standards-compliant browser in the USA or Germany, then it
will do so in such everywhere : characters such as Asian are often
built-in but are otherwise AFAIK add-ons rather than substitutes.
JavaScript and browsers use, or consistently appear to use, four-byte
Unicode internally; but Windows copy'n'paste translates those to what
the destination can handle.
CAVEAT : I don't know how [languages like] Chinese work in Unicode,
needing a different character for every word.
The following code should show the characters available on your system;
I'd expect a US version to have fewest, unless it has add-on Native
American. Beware - Unicode has what might or might not be a design
fault, in that the last "now write forwards" character precedes the last
for "now write forwards". Smarter code might suppress that effect.
However, as most people can read Urdu, Tamil, etc. just as ineffectively
in either direction, it matters little in the present context.
B = ["<pre>"]
for (K=0 ; K<1024 ; K++) { A = [(1e6+K)+" "]
for (J=0 ; J<64 ; J++) A.push(String.fromCharCode(64*K + J))
B.push(A.join("")) }
B.push("<\/pre>")
document.write(B.join("\n"))
document.close()
NOTE : in the reversed output line apparently numbered 041201 I see
.... IIIIIIIVVVIVIIVIIIIXXXIXIILCDMiiiiiiivvviviiviiiixxxixiilcdm
(de-reversed by copy'n'paste) where the roman numerals for I to XII,
i to xii are single characters (those are Number Forms, \u2160 - \u217F.
Interesting.
As I wrote before, some people like to write dates like "31 III 2009" or
"2009-III-31" ; that opens up a new class of Date Formatting &
Validation. The DATE2 Object can now read and write those, with single-
character months.
Alas, those characters are not constant width in monospace, and IE7 only
has I-XII i-x (FF has I-XII i-xii).
Refer to the Unicode site to find out more.