I will not comment on the Unix-assumption part, but I think you go wrong
with this: "Unicode is a Headache". The major headache is that unicode
and its very few encodings are not universally used. The headache is all
the non-unicode legacy encodings still being used. So you better title
this section 'Non-Unicode is a Headache'.
The first sentence is this misleading tautology: "With ASCII, data is
ASCII whether its file, core, terminal, or network; ie "ABC" is
65,66,67." Let me translate: "If all text is ASCII encoded, then text
data is ASCII, whether ..." But it was never the case that all text was
ASCII encoded. IBM used 6-bit BCDIC and then 8-bit EBCDIC and I believe
still uses the latter. Other mainframe makers used other encodings of
A-Z + 0-9 + symbols + control codes. The all-ASCII paradise was never
universal. You could have just as well said "With EBCDIC, data is
EBCDIC, whether ..."
https://en.wikipedia.org/wiki/Ascii
https://en.wikipedia.org/wiki/EBCDIC
A crucial step in the spread of Ascii was its use for microcomputers,
including the IBM PC. The latter was considered a toy by the mainframe
guys. If they had known that PCs would partly take over the computing
world, they might have suggested or insisted that the it use EBCDIC.
"With unicode there are:
encodings"
where 'encodings' is linked to
https://en.wikipedia.org/wiki/Character_encodings_in_HTML
If html 'always' used utf-8 (like xml), as has become common but not
universal, all of the problems with *non-unicode* character sets and
encodings would disappear. The pre-unicode declarations could then
disappear. More truthful: "without unicode there are 100s of encodings
and with unicode only 3 that we should worry about.
"in-memory formats"
These are not the concern of the using programmer as long as they do not
introduce bugs or limitations (as do all the languages stuck on UCS-2
and many using UTF-16, including old Python narrow builds). Using what
should generally be the universal transmission format, UFT-8, as the
internal format means either losing indexing and slicing, having those
operations slow from O(1) to O(len(string)), or adding an index table
that is not part of the unicode standard. Using UTF-32 avoids the above
but usually wasted space -- up to 75%.
"strange beasties like python's FSR"
Have you really let yourself be poisoned by JMF's bizarre rants? The FSR
is an *internal optimization* that benefits most unicode operations that
people actually perform. It uses UTF-32 by default but adapts to the
strings users create by compressing the internal format. The compression
is trivial -- simple dropping leading null bytes common to all
characters -- so each character is still readable as is. The string
headers records how many bytes are left. Is the idea of algorithms that
adapt to inputs really strange to you?
Like good adaptive algorthms, the FSR is invisible to the user except
for reducing space or time or maybe both. Unicode operations are
otherwise the same as with previous wide builds. People who used to use
narrow-builds also benefit from bug elimination. The only 'headaches'
involved might have been those of the developers who optimized previous
wide builds.
CPython has many other functions with special-case optimizations and
'fast paths' for common, simple cases. For instance, (some? all?) number
operations are optimized for pairs of integers. Do you call these
'strange beasties'?
PyPy is faster than CPython, when it is, because it is even more
adaptable to particular computations by creating new fast paths. The
mechanism to create these 'strange beasties' might have been a headache
for the writers, but when it works, which it now seems to, it is not for
the users.