I do not care
about this optimization. I'm not an ascii user. As a non ascii user,
this optimization is just irrelevant.
WRONG.
Every Python user is an ASCII user. Every Python program has hundreds or
thousands of ASCII strings.
# === example ===
import random
There's already one ASCII string in your code: the module name "random"
is ASCII. Let's look inside that module:
py> dir(random)
['BPF', 'LOG4', 'NV_MAGICCONST', 'RECIP_BPF', 'Random', 'SG_MAGICCONST',
'SystemRandom', 'TWOPI', '_BuiltinMethodType', '_MethodType',
'_Sequence', '_Set', '__all__', '__builtins__', '__cached__', '__doc__',
'__file__', '__initializing__', '__loader__', '__name__', '__package__',
'_acos', '_ceil', '_cos', '_e', '_exp', '_inst', '_log', '_pi',
'_random', '_sha512', '_sin', '_sqrt', '_test', '_test_generator',
'_urandom', '_warn', 'betavariate', 'choice', 'expovariate',
'gammavariate', 'gauss', 'getrandbits', 'getstate', 'lognormvariate',
'normalvariate', 'paretovariate', 'randint', 'random', 'randrange',
'sample', 'seed', 'setstate', 'shuffle', 'triangular', 'uniform',
'vonmisesvariate', 'weibullvariate']
That's another 58 ASCII strings. Let's pick one of those:
py> dir(random.Random)
['VERSION', '__class__', '__delattr__', '__dict__', '__dir__', '__doc__',
'__eq__', '__format__', '__ge__', '__getattribute__', '__getstate__',
'__gt__', '__hash__', '__init__', '__le__', '__lt__', '__module__',
'__ne__', '__new__', '__qualname__', '__reduce__', '__reduce_ex__',
'__repr__', '__setattr__', '__setstate__', '__sizeof__', '__str__',
'__subclasshook__', '__weakref__', '_randbelow', 'betavariate', 'choice',
'expovariate', 'gammavariate', 'gauss', 'getrandbits', 'getstate',
'lognormvariate', 'normalvariate', 'paretovariate', 'randint', 'random',
'randrange', 'sample', 'seed', 'setstate', 'shuffle', 'triangular',
'uniform', 'vonmisesvariate', 'weibullvariate']
That's another 51 ASCII strings. Let's pick one of them:
py> dir(random.Random.shuffle)
['__annotations__', '__call__', '__class__', '__closure__', '__code__',
'__defaults__', '__delattr__', '__dict__', '__dir__', '__doc__',
'__eq__', '__format__', '__ge__', '__get__', '__getattribute__',
'__globals__', '__gt__', '__hash__', '__init__', '__kwdefaults__',
'__le__', '__lt__', '__module__', '__name__', '__ne__', '__new__',
'__qualname__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__',
'__sizeof__', '__str__', '__subclasshook__']
And another 34 ASCII strings.
So to get access to just *one* method of *one* class of *one* module, we
have already seen up to 144 ASCII strings. (Some of them will be
duplicated.)
Even if every one of *your* classes, methods, functions, modules and
variables are using non-ASCII names, you will still use ASCII strings for
built-in functions and standard library modules.
What should a Python user think, if he sees his strings are comsuming
more memory just because he uses non ascii characters
WRONG!
His strings are consuming just as much memory as they need to. You cannot
fit ten thousand different characters into a single byte. A single byte
can represent only 2**8 = 256 characters. Two bytes can only represent
65536 characters at most. Four bytes can represent the entire range of
every character ever represented in human history, and more, but it is
terribly wasteful: most strings do not use a billion different
characters, and so use of a four-byte character encoding uses up to four
times as much memory as necessary.
You are imagining that non-ASCII users are being discriminated against,
with their strings being unfairly bloated. But that is not the case.
Their strings would be equally large in a Python wide-build, give or take
whatever overhead of the string object that change from version to
version. If you are not comparing a wide-build of Python to Python 3.3,
then your comparison is faulty. You are comparing "buggy Unicode, cannot
handle the supplementary planes" with "fixed Unicode, can handle the
supplementary planes". Python 3.2 narrow builds save memory by
introducing bugs into Unicode strings. Python 3.3 fixes those bugs and
still saves memory.