T
Thomas 'PointedEars' Lahn
Chris said:The ECMAScript spec says that strings are stored and represented in
UTF-16.
No, it does not (which Edition?). It says in Edition 5.1:
| 8.4 The String Type
|
| The String type is the set of all finite ordered sequences of zero or more
| 16-bit unsigned integer values (“elementsâ€). […] Each element is regarded
| as occupying a position within the sequence. These positions are indexed
| with nonnegative integers. The first element (if any) is at position 0,
| the next element (if any) at position 1, and so on. The length of a
| String is the number of elements (i.e., 16-bit values) within it.
|
| […]
| When a String contains actual textual data, each element is considered to
| be a single UTF-16 code unit. Whether or not this is the actual storage
| format of a String, the characters within a String are numbered by
| their initial code unit element position as though they were represented
| using UTF-16. All operations on Strings (except as otherwise stated) treat
| them as sequences of undifferentiated 16-bit unsigned integers; they do
| not ensure the resulting String is in normalised form, nor do they ensure
| language-sensitive results.
|
| NOTE
| The rationale behind this design was to keep the implementation of Strings
| as simple and high-performing as possible. The intent is that textual data
| coming into the execution environment from outside (e.g., user input, text
| read from a file or received over the network, etc.) be converted to
| Unicode Normalised Form C before the running program sees it. Usually this
| would occur at the same time incoming text is converted from its original
| character encoding to Unicode (and would impose no additional overhead).
| Since it is recommended that ECMAScript source code be in Normalised Form
| C, string literals are guaranteed to be normalised (if source text is
| guaranteed to be normalised), as long as they do not contain any Unicode
| escape sequences.
You can see the same thing in Javascript too. Here's a little demo I
just knocked together:
<script>
function foo()
{
var txt=document.getElementById("in").value;
var msg="";
for (var i=0;i<txt.length;++i) msg+="["+i+"]: "+txt.charCodeAt(i)+"
"+txt.charCodeAt(i).toString(16)+"\n";
document.getElementById("out").value=msg;
}
</script>
<input id=in><input type=button onclick="foo()"
value="Show"><br><textarea id=out rows=25 cols=80></textarea>
What an awful piece of code.
Give it an ASCII string
You mean a string of Unicode characters that can also be represented with
the US-ASCII encoding. There are no "ASCII strings" in conforming
ECMAScript implementations. And a string of Unicode characters with code
points within the BMP will suffice already.
and you'll see, as expected, one index (based on string indexing or
charCodeAt, same thing) for each character. Same if it's all BMP. But put
an astral character in and you'll see 00.00.d8.00/24 (oh wait, CIDR
notation doesn't work in Unicode) come up. I raised this issue on the
Google V8 list and on the ECMAScript list (e-mail address removed), and was
basically told that since JavaScript has been buggy for so long, there's
no chance of ever making it bug-free:
https://mail.mozilla.org/pipermail/es-discuss/2012-December/027384.html
You misunderstand, and I am not buying Rick's answer. The problem is not
that String values are defined as units of 16 bits. The problem is that the
length of a primitive String value in ECMAScript, and the position of a
character, is defined in terms of 16-bit units instead of characters. There
is no bug, because ECMAScript specifies that Unicode characters beyond the
Basic Multilingual Plane (BMP) need not be supported:
| 2 Conformance
|
| A conforming implementation of this Standard shall interpret characters in
| conformance with the Unicode Standard, Version 3.0 or later and ISO/IEC
| 10646-1 with either UCS-2 or UTF-16 as the adopted encoding form,
| implementation level 3. If the adopted ISO/IEC 10646-1 subset is not
| otherwise specified, it is presumed to be the BMP subset, collection 300.
| If the adopted encoding form is not otherwise specified, it presumed to
| be the UTF-16 encoding form.
But they can:
| A conforming implementation of ECMAScript is permitted to provide
| additional types, values, objects, properties, and functions beyond those
| described in this specification. In particular, a conforming
| implementation of ECMAScript is permitted to provide properties not
| described in this specification, and values for those properties, for
| objects that are described in this specification.
|
| A conforming implementation of ECMAScript is permitted to support program
| and regular expression syntax not described in this specification. In
| particular, a conforming implementation of ECMAScript is permitted to
| support program syntax that makes use of the “future reserved wordsâ€
| listed in 7.6.1.2 of this specification.
People have found ways to make this work in ECMAScript implementations. For
example, it is possible to scan a normalized string for lead surrogates:
String.fromCharCode = (function () {
var _fromCharCode = String.fromCharCode;
var span;
return function () {
var a = [];
for (var i = 0, len = arguments.length; i < len; ++i)
{
var arg = arguments;
var ch;
if (arg > 0xFFFF)
{
if (typeof span == "undefined")
{
span = document.createElement("span");
}
span.innerHTML = "&#" + arg + ";";
ch = span.firstChild.nodeValue;
}
else
{
ch = _fromCharCode(arg);
}
a.push(ch);
}
return a.join("");
};
}());
/* "ð„¢" (U+1D122 MUSICAL SYMBOL F CLEF) */
var sFClef = String.fromCharCode(0x1D122);
String.prototype.getLength = function () {
return (this.match(/[\uD800-\uDBFF][^\uD800-\uDBFF]|[\S\s]/g)
|| []).length;
};
/* 1 */
sFClef.getLength()
(String.prototype.charAt() etc. are left as an exercise to the reader.)
Tested in Chromium 25.0.1364.160 Debian 7.0 (186726), which according to
Wikipedia should feature V8 3.15.11.5.
But yes, there should be native support for Unicode characters with code
points beyond the BMP, and evidently that does _not_ require a second
language; just a few tweaks to the algorithms.
Fortunately for Python, there are version numbers, and policies that
permit bugs to actually get fixed. (Which is why, for instance, Debian
Squeeze still ships Python 2.6 rather than upgrading to 2.7 - in case
some script is broken by that change.
Debian already ships Python 3.1 in Stable, disproving your argument.
Can't do that with web browsers.)
Yes, you could. It has been done before.
As of Python 3.3, all Pythons function the same way: it's
semantically a "wide build" (UTF-32), but with a memory usage
optimization. That's how it needs to be.
It is _not_ necessary to use the memory-expensive UTF-32 or a memory-cheaper
mixed encoding to represent characters beyond the BMP. UTF-32 would be more
runtime-efficient than any other encoding for such strings, though, because
you could divide by 32 for the length and would not have to find lead
surrogates to determine a character's position.