My own understanding is UCS-2 simply shouldn't be used any more.
Pretty much. But UTF-16 with lax support for surrogates (that is,
surrogates are included but treated as two characters) is essentially
UCS-2 with the restriction against surrogates lifted. That's what Python
currently does, and Javascript.
http://mathiasbynens.be/notes/javascript-encoding
The reality is that support for the Unicode supplementary planes is
pretty poor. Even when applications support it, most fonts don't have
glyphs for the characters. Anything which makes handling of Unicode
supplementary characters better is a step forward.
This I don't see. What are the basic string operations?
The ones I'm specifically referring to are indexing and copying
substrings. There may be others.
* Examine the first character, or first few characters ("few" = "usually
bounded by a small constant") such as to parse a token from an input
stream. This is O(1) with either encoding.
That's actually O(K), for K = "a few", whatever "a few" means. But we
know that anything is fast for small enough N (or K in this case).
* Slice off the first N characters. This is O(N) with either encoding
if it involves copying the chars. I guess you could share references
into the same string, but if the slice reference persists while the
big reference is released, you end up not freeing the memory until
later than you really should.
As a first approximation, memory copying is assumed to be free, or at
least constant time. That's not strictly true, but Big Oh analysis is
looking at algorithmic complexity. It's not a substitute for actual
benchmarks.
Meanwhile, an example of the 393 approach failing: I was involved in a
project that dealt with terabytes of OCR data of mostly English text.
I assume that this wasn't one giant multi-terrabyte string.
So
the chars were mostly ascii, but there would be occasional non-ascii
chars including supplementary plane characters, either because of
special symbols that were really in the text, or the typical OCR
confusion emitting those symbols due to printing imprecision. That's a
natural for UTF-8 but the PEP-393 approach would bloat up the memory
requirements by a factor of 4.
Not necessarily. Presumably you're scanning each page into a single
string. Then only the pages containing a supplementary plane char will be
bloated, which is likely to be rare. Especially since I don't expect your
OCR application would recognise many non-BMP characters -- what does
U+110F3, "SORA SOMPENG DIGIT THREE", look like? If the OCR software
doesn't recognise it, you can't get it in your output. (If you do, the
OCR software has a nasty bug.)
Anyway, in my ignorant opinion the proper fix here is to tell the OCR
software not to bother trying to recognise Imperial Aramaic, Domino
Tiles, Phaistos Disc symbols, or Egyptian Hieroglyphs if you aren't
expecting them in your source material. Not only will the scanning go
faster, but you'll get fewer wrong characters.
[...]
I realize the folks who designed and implemented PEP 393 are very smart
cookies and considered stuff carefully, while I'm just an internet user
posting an immediate impression of something I hadn't seen before (I
still use Python 2.6), but I still have to ask: if the 393 approach
makes sense, why don't other languages do it?
There has to be a first time for everything.
Ropes of UTF-8 segments seems like the most obvious approach and I
wonder if it was considered.
Ropes have been considered and rejected because while they are
asymptotically fast, in common cases the added complexity actually makes
them slower. Especially for immutable strings where you aren't inserting
into the middle of a string.
http://mail.python.org/pipermail/python-dev/2000-February/002321.html
PyPy has revisited ropes and uses, or at least used, ropes as their
native string data structure. But that's ropes of *bytes*, not UTF-8.
http://morepypy.blogspot.com.au/2007/11/ropes-branch-merged.html