Are â€extended charactersâ€ safe in identifiers?

Stanimir Stamenkov · Jun 5, 2011

Sun, 05 Jun 2011 20:12:09 +0300, /Stanimir Stamenkov/:

Sun, 05 Jun 2011 18:11:14 +0200, /Thomas 'PointedEars' Lahn/:

Seems like more syntactic sugar. It will still result in inserting
pair of (or more) UTF-16 units, and String.length will still give
the length of the 16-bit units contained.

Reading it to the bottom, it really suggests using 32-bit units for
the elements of a String. I don't think that's going to happen any
time soon. It will waste pretty much twice more memory with no
enough demand for such a feature, and if implementations have to use
more compact internal storage format - the direct indexing of the
elements of a String would be nearly impossible (if not at all).

Stanimir Stamenkov · Jun 5, 2011

Sun, 05 Jun 2011 20:34:56 +0300, /Stanimir Stamenkov/:

Sun, 05 Jun 2011 20:12:09 +0300, /Stanimir Stamenkov/:

Reading it to the bottom, it really suggests using 32-bit units for
the elements of a String. I don't think that's going to happen any
time soon. It will waste pretty much twice more memory with no
enough demand for such a feature, and if implementations have to use
more compact internal storage format - the direct indexing of the
elements of a String would be nearly impossible (if not at all).

I see it is suggested also, the Java approach (since Java 5) could
be taken, instead
<http://www.w3.org/International/wiki/JavaScriptInternationalization>:

| Providing supplementary character support is an important
| requirement. Changes made to the Java programming language in
| this regard (adding additional methods for accessing code points
| instead of UTF-16 code units) might be an appropriate model.
| Norbert Lindenberg has an article on the choices Sun made that
| provides good reference:
|
| http://java.sun.com/developer/technicalArticles/Intl/Supplementary/

Thomas 'PointedEars' Lahn · Jun 5, 2011

Stanimir said:
Sun, 05 Jun 2011 18:11:14 +0200, /Thomas 'PointedEars' Lahn/:

I did not write that.

| 2 Conformance
| [â€¦]
| A conforming implementation of this International standard shall
| interpret characters in conformance with the Unicode Standard, Version
| 3.0 or later and ISO/IEC 10646-1 with either UCS-2 or UTF-16 as the
| adopted encoding form...

Click to expand...

Click to expand...

I did not write that either.

In addition, in a recent es-discuss message, Allen Wirfs-Brock (who is
more or less responsible for the wording) makes it clear that it was not
intended that implementations of ES5 handle surrogate pairs:

,-<https://mail.mozilla.org/pipermail/es-discuss/2011-May/014342.html>
|
| [â€¦]
| The ES5 specification language clearly still has issues WRT Unicode
| encoding of programs and strings. These need to be fixed in the next
| edition. However, interpreting the current language as allow
| supplemental characters to occur in program text and particularly
| string literals doesn't match either reality or the intent of the ES5
| spec. [â€¦]

Click to expand...

Reading through all of this, it really suggest surrogates are not
handled just for source encoding (and I suspect you're arguing just
this from the beginning, which however is not what we're talking about):

You should really read all of it.

| In drafting the ES5 spec, TC39 had two goals WRT character
| encoding. We wanted to allow the occurrences of (BMP) characters
| defined in Unicode versions beyond 2.1 and we wanted to update
| the specification to reflect actual implementation reality that
| source was processed as if it was UCS-2.

Which is fine and dandy. It doesn't mean one can't have surrogate
code points inserted as \uXXXX in the source,

You cannot have surrogate *code* *points* inserted with `\uXXXX'; this does
not make sense.

JFTR, I have never said that you could not have Unicode escape sequences
evaluating to the ECMAScript character value of surrogate pair characters in
string literals. Instead I questioned whether it would be wise to do so.

nor that any such values are prohibited in strings during run-time.

I have never said that either. You should read what I write (and what
Specification writers write), not what you want me (and Specification
writers) to have written so that it fits your idea of how others argue
or the language should be.

Seems like more syntactic sugar.

No, if you had read carefully, it is required to properly deal with Unicode
characters beyond the BMP. Several string methods are currently not capable
of dealing with those characters, be it in verbatim or in escaped form. For
example, String.fromCharCode(65859).charCodeAt(0) should return 65859, but
it does not (it returns 323 instead).

It will still result in inserting pair of (or more) UTF-16 units, and
String.length will still give the length of the 16-bit units contained.

So you have not read that (properly) either. Why am I not surprised?

PointedEars

Stanimir Stamenkov · Jun 5, 2011

Sun, 05 Jun 2011 19:48:32 +0200, /Thomas 'PointedEars' Lahn/:

JFTR, I have never said that you could not have Unicode escape sequences
evaluating to the ECMAScript character value of surrogate pair characters in
string literals. Instead I questioned whether it would be wise to do so.

Alright, I had enough crap from you for today and possibly for the
week.

PEP 3131: Supporting Non-ASCII Identifiers	399	May 13, 2007
comp.lang.c Answers to Frequently Asked Questions (FAQ List)	15	Apr 1, 2006

Are â€extended charactersâ€ safe in identifiers?

Stanimir Stamenkov

Stanimir Stamenkov

Thomas 'PointedEars' Lahn

Stanimir Stamenkov

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads