'Straße' ('Strasse') and Python 2

wxjmfauth · Jan 12, 2014

sys.version
2.7.6 (default, Nov 10 2013, 19:24:18) [MSC v.1500 32 bit (Intel)]

s = 'Straße'
assert len(s) == 6
assert s[5] == 'e'

Click to expand...

Click to expand...

jmf

Peter Otten · Jan 12, 2014

sys.version 2.7.6 (default, Nov 10 2013, 19:24:18) [MSC v.1500 32 bit (Intel)]
s = 'StraÃŸe'
assert len(s) == 6
assert s[5] == 'e'

Click to expand...

Click to expand...

jmf

Signifying nothing. (Macbeth)

Python 2.7.2+ (default, Jul 20 2012, 22:15:08)
[GCC 4.6.1] on linux2
Type "help", "copyright", "credits" or "license" for more information.Traceback (most recent call last):

File said:
assert s[5] == "e"

Click to expand...

Click to expand...

Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AssertionError

Stefan Behnel · Jan 12, 2014

Peter Otten, 12.01.2014 09:31:

File said:
sys.version

Click to expand...

2.7.6 (default, Nov 10 2013, 19:24:18) [MSC v.1500 32 bit (Intel)]

s = 'StraÃŸe'
assert len(s) == 6
assert s[5] == 'e'

Click to expand...

jmf

Click to expand...

Signifying nothing. (Macbeth)

Python 2.7.2+ (default, Jul 20 2012, 22:15:08)
[GCC 4.6.1] on linux2
Type "help", "copyright", "credits" or "license" for more information.Traceback (most recent call last):

File said:

assert s[5] == "e"

Click to expand...

Click to expand...

Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AssertionError

The point I think he was trying to make is that Linux is better than
Windows, because the latter fails to fail on these assertions for some reason.

Stefan

)

Ned Batchelder · Jan 12, 2014

sys.version 2.7.6 (default, Nov 10 2013, 19:24:18) [MSC v.1500 32 bit (Intel)]
s = 'Straße'
assert len(s) == 6
assert s[5] == 'e'

Click to expand...

Click to expand...

jmf

Dumping random snippets of Python sessions here is useless. If you are
trying to make a point, you have to put some English around it. You
know what is in your head, but we do not.

Mark Lawrence · Jan 12, 2014

File said:
Peter Otten, 12.01.2014 09:31:

sys.version
2.7.6 (default, Nov 10 2013, 19:24:18) [MSC v.1500 32 bit (Intel)]
s = 'StraÃŸe'
assert len(s) == 6
assert s[5] == 'e'

jmf

Click to expand...

Signifying nothing. (Macbeth)

Python 2.7.2+ (default, Jul 20 2012, 22:15:08)
[GCC 4.6.1] on linux2
Type "help", "copyright", "credits" or "license" for more information.

s = "StraÃŸe"
assert len(s) == 6

Click to expand...

Traceback (most recent call last):

File said:

assert s[5] == "e"

Click to expand...

Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AssertionError

Click to expand...

The point I think he was trying to make is that Linux is better than
Windows, because the latter fails to fail on these assertions for some reason.

Stefan )

The point he's trying to make is that he also reads the pythondev
mailing list, where Steven D'Aprano posted this very example, stating it
is "Python 2 nonsense". Fixed in Python 3. Don't mention...

MRAB · Jan 12, 2014

File said:
sys.version

Click to expand...

2.7.6 (default, Nov 10 2013, 19:24:18) [MSC v.1500 32 bit (Intel)]

s = 'StraÃŸe'
assert len(s) == 6
assert s[5] == 'e'

Click to expand...

jmf

Click to expand...

Signifying nothing. (Macbeth)

Python 2.7.2+ (default, Jul 20 2012, 22:15:08)
[GCC 4.6.1] on linux2
Type "help", "copyright", "credits" or "license" for more information.Traceback (most recent call last):

File said:

assert s[5] == "e"

Click to expand...

Click to expand...

Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AssertionError

The point is that in Python 2 'StraÃŸe' is a bytestring and its length
depends on the encoding of the source file. If the source file is UTF-8
then 'StraÃŸe' is a string literal with 7 bytes between the single
quotes.

Thomas Rachel · Jan 13, 2014

Am 12.01.2014 08:50 schrieb (e-mail address removed):

sys.version 2.7.6 (default, Nov 10 2013, 19:24:18) [MSC v.1500 32 bit (Intel)]
s = 'Straße'
assert len(s) == 6
assert s[5] == 'e'

Click to expand...

Click to expand...

Wow. You just found one of the major differences between Python 2 and 3.

Your assertins are just wrong, as s = 'Straße' leads - provided you use
UTF8 - to a representation of 'Stra\xc3\x9fe', obviously leading to a
length of 7.

Thomas

wxjmfauth · Jan 13, 2014

Le lundi 13 janvier 2014 09:27:46 UTC+1, Thomas Rachel a écrit :

Am 12.01.2014 08:50 schrieb (e-mail address removed):

2.7.6 (default, Nov 10 2013, 19:24:18) [MSC v.1500 32 bit (Intel)]

s = 'Straï¿½e'
assert len(s) == 6
assert s[5] == 'e'

Click to expand...

Click to expand...

Wow. You just found one of the major differences between Python 2 and 3.

Your assertins are just wrong, as s = 'Straï¿½e' leads - providedyou use

UTF8 - to a representation of 'Stra\xc3\x9fe', obviously leading to a

length of 7.

Not at all. I'm afraid I'm understanding Python (on this
aspect very well).

Do you belong to this group of people who are naively
writing wrong Python code (usually not properly working)
during more than a decade?

'ß' is the the fourth character in that text "Straße"
(base index 0).

This assertions are correct (byte string and unicode).

sys.version '2.7.6 (default, Nov 10 2013, 19:24:18) [MSC v.1500 32 bit (Intel)]'
assert 'Straße'[4] == 'ß'
assert u'Straße'[4] == u'ß'

Click to expand...

Click to expand...

jmf

PS Nothing to do with Py2/Py3.

Chris Angelico · Jan 13, 2014

This assertions are correct (byte string and unicode).

sys.version '2.7.6 (default, Nov 10 2013, 19:24:18) [MSC v.1500 32 bit (Intel)]'
assert 'StraÃŸe'[4] == 'ÃŸ'
assert u'StraÃŸe'[4] == u'ÃŸ'

Click to expand...

Click to expand...

jmf

PS Nothing to do with Py2/Py3.

This means that either your source encoding happens to include that
character, or you have assertions disabled. It does NOT mean that you
can rely on writing this string out to a file and having someone else
read it in and understand it the same way.

ChrisA

Steven D'Aprano · Jan 13, 2014

sys.version '2.7.6 (default, Nov 10 2013, 19:24:18) [MSC v.1500 32 bit (Intel)]'
assert 'StraÃŸe'[4] == 'ÃŸ'
assert u'StraÃŸe'[4] == u'ÃŸ'

Click to expand...

Click to expand...

I think you are using "from __future__ import unicode_literals".
Otherwise, that cannot happen in Python 2.x. Using a narrow build:

# on my machine "ando"
py> sys.version
'2.7.2 (default, May 18 2012, 18:25:10) \n[GCC 4.1.2 20080704 (Red Hat
4.1.2-52)]'
py> sys.maxunicode
65535
py> assert 'StraÃŸe'[4] == 'ÃŸ'
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AssertionError
py> list('StraÃŸe')
['S', 't', 'r', 'a', '\xc3', '\x9f', 'e']

Using a wide build is the same:

# on my machine "orac"

sys.maxunicode 1114111
assert 'StraÃŸe'[4] == 'ÃŸ'

Click to expand...

Click to expand...

Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AssertionError

But once you run the "from __future__" line, the behaviour changes to
what you show:

py> from __future__ import unicode_literals
py> list('StraÃŸe')
[u'S', u't', u'r', u'a', u'\xdf', u'e']
py> assert 'StraÃŸe'[4] == 'ÃŸ'
py>

But I still don't understand the point you are trying to make.

Chris Angelico · Jan 13, 2014

I think you are using "from __future__ import unicode_literals".
Otherwise, that cannot happen in Python 2.x.

Alas, not true.

sys.version '2.7.4 (default, Apr 6 2013, 19:54:46) [MSC v.1500 32 bit (Intel)]'
sys.maxunicode 65535
assert 'StraÃŸe'[4] == 'ÃŸ'
list('StraÃŸe')

Click to expand...

Click to expand...

['S', 't', 'r', 'a', '\xdf', 'e']

That's Windows XP. Presumably Latin-1 (or CP-1252, they both have that
char at 0xDF). He happens to be correct, *as long as the source code
encoding matches the output encoding and is one that uses 0xDF to mean
U+00DF*. Otherwise, he's not.

ChrisA

Michael Torrie · Jan 13, 2014

Not at all. I'm afraid I'm understanding Python (on this
aspect very well).

Are you sure about that? Seems to me you're still confused as to the
difference between unicode and encodings.

Do you belong to this group of people who are naively
writing wrong Python code (usually not properly working)
during more than a decade?

'ß' is the the fourth character in that text "Straße"
(base index 0).

This assertions are correct (byte string and unicode).

How can they be? They only are true for the default encoding and
character set you are using, which happens to have 'ß' as a single byte.
Hence your little python 2.7 snippet is not using unicode at all, in
any form. It's using a non-unicode character set. There are methods
which can decode your character set to unicode and encode from unicode.
But let's be clear. Your byte streams are not unicode!

If the default byte encoding is UTF-8, which is a variable number of
bytes per character, your assertions are completely wrong. Maybe it's
time you stopped programming in Windows and use OS X or Linux which
throw out the random single-byte character sets and instead provide a
UTF-8 terminal environment to support non-latin characters.

sys.version '2.7.6 (default, Nov 10 2013, 19:24:18) [MSC v.1500 32 bit (Intel)]'
assert 'Straße'[4] == 'ß'
assert u'Straße'[4] == u'ß'

Click to expand...

Click to expand...

jmf

PS Nothing to do with Py2/Py3.

wxjmfauth · Jan 13, 2014

Le lundi 13 janvier 2014 11:57:28 UTC+1, Chris Angelico a écrit :

I think you are using "from __future__ import unicode_literals".

Click to expand...

Otherwise, that cannot happen in Python 2.x.

Click to expand...

Alas, not true.

'2.7.4 (default, Apr 6 2013, 19:54:46) [MSC v.1500 32 bit (Intel)]'

sys.maxunicode

Click to expand...

65535

assert 'Straße'[4] == 'ß'
list('Straße')

Click to expand...

Click to expand...

['S', 't', 'r', 'a', '\xdf', 'e']

That's Windows XP. Presumably Latin-1 (or CP-1252, they both have that

char at 0xDF). He happens to be correct, *as long as the source code

encoding matches the output encoding and is one that uses 0xDF to mean

U+00DF*. Otherwise, he's not.

You are right. It's on Windows. It is only showing how
Python can be a holy mess.

The funny aspect is when I'm reading " *YOUR* assertions
are false" when I'm presenting *PYTHON* assertions!

jmf

Mark Lawrence · Jan 13, 2014

You are right. It's on Windows. It is only showing how
Python can be a holy mess.

Regarding unicode Python 2 was a holy mess, fixed in Python 3.

Thomas Rachel · Jan 13, 2014

Am 13.01.2014 10:54 schrieb (e-mail address removed):

Not at all. I'm afraid I'm understanding Python (on this
aspect very well).
IBTD.

Do you belong to this group of people who are naively
writing wrong Python code (usually not properly working)
during more than a decade?

Why should I be?

'ß' is the the fourth character in that text "Straße"
(base index 0).

Character-wise, yes. But not byte-string-wise. In a byte string, this
depends on the character set used.

On CP 437, 850, 12xx (whatever Windows uses) or latin1, you are right,
but not on the widely used UTF8.

sys.version '2.7.6 (default, Nov 10 2013, 19:24:18) [MSC v.1500 32 bit (Intel)]'
assert 'Straße'[4] == 'ß'
assert u'Straße'[4] == u'ß'

Click to expand...

Click to expand...

Linux box at home:

Python 2.7.3 (default, Apr 14 2012, 08:58:41) [GCC] on linux2
Type "help", "copyright", "credits" or "license" for more information.

assert 'Straße'[4] == 'ß'

Click to expand...

Click to expand...

Traceback (most recent call last):

File said:
assert u'Straße'[4] == u'ß'

Click to expand...

Click to expand...

Python 3.3.0 (default, Oct 01 2012, 09:13:30) [GCC] on linux
Type "help", "copyright", "credits" or "license" for more information.

assert 'Straße'[4] == 'ß'
assert u'Straße'[4] == u'ß'

Click to expand...

Click to expand...

Windows box at work:

Python 2.7.5 (default, May 15 2013, 22:44:16) [MSC v.1500 64 bit
(AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.

assert 'Straße'[4] == 'ß'
assert u'Straße'[4] == u'ß'

Click to expand...

Click to expand...

PS Nothing to do with Py2/Py3.

As bytes and unicode and str stuff is heavily changed between them, of
course it has to do.

And I think you know that and try to confuse and FUD us all - with no avail.

Thomas

Terry Reedy · Jan 13, 2014

I'm afraid I'm understanding Python (on this
aspect very well).
Really?

Do you belong to this group of people who are naively
writing wrong Python code (usually not properly working)
during more than a decade?

To me, the important question is whether this and previous similar posts
are intentional trolls designed to stir up the flurry of responses they
get or 'innocently' misleading or even erroneous. If your claim of
understanding Python and Unicode is true, then this must be a troll
post. Either way, please desist, or your access to python-list from
google-groups may be removed.

'ÃŸ' is the the fourth character in that text "StraÃŸe"
(base index 0).

As others have said, in the *unicode text "StraÃŸe", 'ÃŸ' is the fifth
character, at character index 4, ...

This assertions are correct (byte string and unicode).

whereas, when the text is encoded into bytes, the byte index depends on
the encoding and the assertion that it is always 4 is incorrect. Did you
know this or were you truly ignorant?

sys.version '2.7.6 (default, Nov 10 2013, 19:24:18) [MSC v.1500 32 bit (Intel)]'
assert 'StraÃŸe'[4] == 'ÃŸ'

Click to expand...

Click to expand...

Sometimes true, sometimes not.

assert u'StraÃŸe'[4] == u'ÃŸ'

Click to expand...

Click to expand...

PS Nothing to do with Py2/Py3.

This issue has everything to do with Py2, where 'StraÃŸe' is encoded
bytes, versus Py3, where 'StraÃŸe' is unicode text where each character
of that word takes one code unit, whether each is 2 bytes or 4 bytes.

If you replace 'ÃŸ' with any astral (non-BMP) character, this issue
appears even for unicode text in 3.2-, where an astral character
requires 2, not 1, code units on narrow builds, thereby screwing up
indexing, just as can happen for encoded bytes. In 3.3+, all characters
use 1 code unit and indexing (and slicing) always works properly. This
is another unicode issue where you appear not to understand, but might
just be trolling.

Robin Becker · Jan 15, 2014

sys.version 2.7.6 (default, Nov 10 2013, 19:24:18) [MSC v.1500 32 bit (Intel)]
s = 'StraÃŸe'
assert len(s) == 6
assert s[5] == 'e'

Click to expand...

Click to expand...

jmf

On my utf8 based system

robin@everest ~:
$ cat ooo.py
if __name__=='__main__':
import sys
s='AÌ…B'
print('version_info=%s\nlen(%s)=%d' % (sys.version_info,s,len(s)))
robin@everest ~:
$ python ooo.py
version_info=sys.version_info(major=3, minor=3, micro=3, releaselevel='final', serial=0)
len(AÌ…B)=3
robin@everest ~:
$

so two 'characters' are 3 (or 2 or more) codepoints. If I want to isolate so
called graphemes I need an algorithm even for python's unicode ie when it really
matters, python3 str is just another encoding.

Ned Batchelder · Jan 15, 2014

sys.version

Click to expand...

2.7.6 (default, Nov 10 2013, 19:24:18) [MSC v.1500 32 bit (Intel)]

s = 'StraÃŸe'
assert len(s) == 6
assert s[5] == 'e'

Click to expand...

jmf

Click to expand...

On my utf8 based system

robin@everest ~:
$ cat ooo.py
if __name__=='__main__':
import sys
s='AÌ…B'
print('version_info=%s\nlen(%s)=%d' % (sys.version_info,s,len(s)))
robin@everest ~:
$ python ooo.py
version_info=sys.version_info(major=3, minor=3, micro=3,
releaselevel='final', serial=0)
len(AÌ…B)=3
robin@everest ~:
$

Click to expand...

so two 'characters' are 3 (or 2 or more) codepoints. If I want to
isolate so called graphemes I need an algorithm even for python's
unicode ie when it really matters, python3 str is just another encoding.

You are right that more than one codepoint makes up a grapheme, and that
you'll need code to deal with the correspondence between them. But let's
not muddy these already confusing waters by referring to that mapping as
an encoding.

In Unicode terms, an encoding is a mapping between codepoints and bytes.
Python 3's str is a sequence of codepoints.

Robin Becker · Jan 15, 2014

On 15/01/2014 12:13, Ned Batchelder wrote:
.........

.........
You are right that more than one codepoint makes up a grapheme, and that you'll
need code to deal with the correspondence between them. But let's not muddy
these already confusing waters by referring to that mapping as an encoding.

In Unicode terms, an encoding is a mapping between codepoints and bytes. Python
3's str is a sequence of codepoints.

Semantics is everything. For me graphemes are the endpoint (or should be); to
get a proper rendering of a sequence of graphemes I can use either a sequence of
bytes or a sequence of codepoints. They are both encodings of the graphemes;
what unicode says is an encoding doesn't define what encodings are ie mappings
from some source alphabet to a target alphabet.

wxjmfauth · Jan 15, 2014

Le mercredi 15 janvier 2014 13:13:36 UTC+1, Ned Batchelder a écrit :

... more than one codepoint makes up a grapheme ...
No

In Unicode terms, an encoding is a mapping between codepoints and bytes.

No

jmf

'Swampy' installation through 'pip'	3	May 20, 2014
Performance of int/long in Python 3	187	Mar 25, 2013
Change in Python 3.3 with the treatment of sys.argv	10	Mar 22, 2013
input() on python 2.7.5 vs 3.3.2	3	Dec 12, 2013
ImportError: No module named _gdb	3	Jun 1, 2014
On u'Unicode string literals' (Py3)	2	Feb 29, 2012
Python code problem	2	Apr 23, 2023
Representation of floats (-> Mark Dickinson?)	4	Sep 6, 2011

'Straße' ('Strasse') and Python 2

wxjmfauth

Peter Otten

Stefan Behnel

Ned Batchelder

Mark Lawrence

MRAB

Thomas Rachel

wxjmfauth

Chris Angelico

Steven D'Aprano

Chris Angelico

Michael Torrie

wxjmfauth

Mark Lawrence

Thomas Rachel

Terry Reedy

Robin Becker

Ned Batchelder

Robin Becker

wxjmfauth

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads