Things went wrong when utf8 was not adopted as the standard encoding
thus requiring two string types, it would have been easier to have a len
function to count bytes as before and a glyphlen to count glyphs. Now as
I understand it we have a complicated mess under the hood for unicode
objects so they have a variable representation to approximate an 8 bit
representation when suitable etc etc etc.
No no no! Glyphs are *pictures*, you know the little blocks of pixels
that you see on your monitor or printed on a page. Before you can count
glyphs in a string, you need to know which typeface ("font") is being
used, since fonts generally lack glyphs for some code points.
[Aside: there's another complication. Some fonts define alternate glyphs
for the same code point, so that the design of (say) the letter "a" may
vary within the one string according to whatever typographical rules the
font supports and the application calls. So the question is, when you
"count glyphs", should you count "a" and "alternate a" as a single glyph
or two?]
You don't actually mean count glyphs, you mean counting code points
(think characters, only with some complications that aren't important for
the purposes of this discussion).
UTF-8 is utterly unsuited for in-memory storage of text strings, I don't
care how many languages (Go, Haskell?) make that mistake. When you're
dealing with text strings, the fundamental unit is the character, not the
byte. Why do you care how many bytes a text string has? If you really
need to know how much memory an object is using, that's where you use
sys.getsizeof(), not len().
We don't say len({42: None}) to discover that the dict requires 136
bytes, why would you use len("heåvy") to learn that it uses 23 bytes?
UTF-8 is variable width encoding, which means it's *rubbish* for the in-
memory representation of strings. Counting characters is slow. Slicing is
slow. If you have mutable strings, deleting or inserting characters is
slow. Every operation has to effectively start at the beginning of the
string and count forward, lest it split bytes in the middle of a UTF
unit. Or worse, the language doesn't give you any protection from this at
all, so rather than slow string routines you have unsafe string routines,
and it's your responsibility to detect UTF boundaries yourself.
In case you aren't familiar with what I'm talking about, here's an
example using Python 3.2, starting with a Unicode string and treating it
as UTF-8 bytes:
py> u = "heåvy"
py> s = u.encode('utf-8')
py> for c in s:
.... print(chr(c))
....
h
e
Ã
Â¥
v
y
"Ã¥"? It didn't take long to get moji-bake in our output, and all I did
was print the (byte) string one "character" at a time. It gets worse: we
can easily end up with invalid UTF-8:
py> a, b = s[:len(s)//2], s[len(s)//2:] # split the string in half
py> a.decode('utf-8')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf8' codec can't decode byte 0xc3 in position 2:
unexpected end of data
py> b.decode('utf-8')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf8' codec can't decode byte 0xa5 in position 0:
invalid start byte
No, UTF-8 is okay for writing to files, but it's not suitable for text
strings. The in-memory representation of text strings should be constant
width, based on characters not bytes, and should prevent the caller from
accidentally ending up with moji-bake or invalid strings.