break unichr instead of fix ord?

N

Nobody

But getting a "ValueError" in some builds (and not in others)
is rather worse than getting unicode strings of different length....

Not necessarily. If the code assumes that unichr() always returns a
single-character string, it will silently produce bogus results when
unichr() returns a pair of surrogates. An exception is usually preferable
to silently producing bad data.

If unichr() returns a surrogate pair, what is e.g. unichr(i).isalpha()
supposed to do?

Using surrogates is fine in an external representation (UTF-16), but it
doesn't make sense as an internal representation.

Think: why do people use wchar_t[] rather than a char[] encoded in UTF-8?
Because a wchar_t[] allows you to index *characters*, which you can't do
with a multi-byte encoding. You can't do it with a multi-*word* encoding
either.

UCS-2 and UTF-16 are superficially so similar that people forget that
they're completely different beasts. UCS-2 is fixed-length, UTF-16 is
variable-length. This makes UTF-16 semantically much closer to UTF-8 than
to UCS-2 or UCS-4.

If your wchar_t is 16 bits, the only sane solution is to forego support
for characters outside of the BMP.

The alternative is to process wide strings in exactly the same way that
you process narrow (mbcs) strings; e.g. extracting character N requires
iterating over the string from the beginning until you have counted N-1
characters. This provides no benefit over using narrow strings except for
a slight performance gain from halving the number of iterations. You still
end up with indexing being O(n) rather than O(1).
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,982
Messages
2,570,185
Members
46,736
Latest member
AdolphBig6

Latest Threads

Top