I'm afraid I'm understanding Python (on this
aspect very well).
Really?
Do you belong to this group of people who are naively
writing wrong Python code (usually not properly working)
during more than a decade?
To me, the important question is whether this and previous similar posts
are intentional trolls designed to stir up the flurry of responses they
get or 'innocently' misleading or even erroneous. If your claim of
understanding Python and Unicode is true, then this must be a troll
post. Either way, please desist, or your access to python-list from
google-groups may be removed.
'ß' is the the fourth character in that text "Straße"
(base index 0).
As others have said, in the *unicode text "Straße", 'ß' is the fifth
character, at character index 4, ...
This assertions are correct (byte string and unicode).
whereas, when the text is encoded into bytes, the byte index depends on
the encoding and the assertion that it is always 4 is incorrect. Did you
know this or were you truly ignorant?
sys.version '2.7.6 (default, Nov 10 2013, 19:24:18) [MSC v.1500 32 bit (Intel)]'
assert 'Straße'[4] == 'ß'
Sometimes true, sometimes not.
assert u'Straße'[4] == u'ß'
PS Nothing to do with Py2/Py3.
This issue has everything to do with Py2, where 'Straße' is encoded
bytes, versus Py3, where 'Straße' is unicode text where each character
of that word takes one code unit, whether each is 2 bytes or 4 bytes.
If you replace 'ß' with any astral (non-BMP) character, this issue
appears even for unicode text in 3.2-, where an astral character
requires 2, not 1, code units on narrow builds, thereby screwing up
indexing, just as can happen for encoded bytes. In 3.3+, all characters
use 1 code unit and indexing (and slicing) always works properly. This
is another unicode issue where you appear not to understand, but might
just be trolling.