break unichr instead of fix ord?

Nobody · Aug 30, 2009

But getting a "ValueError" in some builds (and not in others)
is rather worse than getting unicode strings of different length....

Not necessarily. If the code assumes that unichr() always returns a
single-character string, it will silently produce bogus results when
unichr() returns a pair of surrogates. An exception is usually preferable
to silently producing bad data.

If unichr() returns a surrogate pair, what is e.g. unichr(i).isalpha()
supposed to do?

Using surrogates is fine in an external representation (UTF-16), but it
doesn't make sense as an internal representation.

Think: why do people use wchar_t[] rather than a char[] encoded in UTF-8?
Because a wchar_t[] allows you to index *characters*, which you can't do
with a multi-byte encoding. You can't do it with a multi-*word* encoding
either.

UCS-2 and UTF-16 are superficially so similar that people forget that
they're completely different beasts. UCS-2 is fixed-length, UTF-16 is
variable-length. This makes UTF-16 semantically much closer to UTF-8 than
to UCS-2 or UCS-4.

If your wchar_t is 16 bits, the only sane solution is to forego support
for characters outside of the BMP.

The alternative is to process wide strings in exactly the same way that
you process narrow (mbcs) strings; e.g. extracting character N requires
iterating over the string from the beginning until you have counted N-1
characters. This provides no benefit over using narrow strings except for
a slight performance gain from halving the number of iterations. You still
end up with indexing being O(n) rather than O(1).

unicode string alteration	0	Aug 12, 2010
Wrong unichr docstring in 2.7	3	Aug 22, 2010
python tr equivalent (non-ascii)	3	Aug 13, 2008
Benchmarking stripping of Unicode characters which are invalid XML	0	Mar 18, 2012
Python wide-python-build unicode for Windows	1	Apr 29, 2011
Encoding of surrogate code points to UTF-8	14	Oct 8, 2013
Regex for unicode letter characters	4	Jan 11, 2009
Wide Unicode build for Windows available somewhere?	1	Jan 11, 2005

break unichr instead of fix ord?

Nobody

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads