S
Steven D'Aprano
Ian Foote:
Many common string operations do not require indexing by character
which reduces the impact of this inefficiency.
Which common string operations do you have in mind?
Specifically in Python's case, the most obvious counter-example is the
length of a string. But that's only because Python strings are immutable
objects, and include a field that records the length. So once the string
is created, checking its length takes constant time.
Some string operations need to inspect every character, e.g. str.upper().
Even for them, the increased complexity of a variable-width encoding
costs. It's not sufficient to walk the string inspecting a fixed 1, 2 or
4 bytes per character. You have to walk the string grabbing 1 byte at a
time, and then decide whether you need another 1, 2 or 3 bytes. Even
though it's still O(N), the added bit-masking and overhead of variable-
width encoding adds to the overall cost.
Any string method that takes a starting offset requires the method to
walk the string byte-by-byte. I've even seen languages put responsibility
for dealing with that onto the programmer: the "start offset" is given in
*bytes*, not characters. I don't remember what language this was... it
might have been Haskell? Whatever it was, it horrified me.
UTF-8 seems like a
reasonable choice for an internal representation to me.
It's not. Haskell, for example, uses UTF-8 internally, and it warns that
this makes string operations O(N) instead of O(1) precisely because of
the need to walk the string inspecting every byte.
Remember, when your string primitives are O(N), it is very easy to write
code that becomes O(N**2). Using UTF-8 internally is just begging for
user-written code to be O(N**2).
One benefit of
UTF-8 over Python's flexible representation is that it is, on average,
more compact over a wide set of samples.
Sure. And over a different set of samples, it is less compact. If you
write a lot of Latin-1, Python will use one byte per character, while
UTF-8 will use two bytes per character.