Unicode 7

Peter Otten · May 2, 2014

Rustom said:
Just noticed a small thing in which python does a bit better than haskell:
$ ghci
let (ï¬ne, fine) = (1,2)
Prelude> (ï¬ne, fine)
(1,2)
Prelude>

In case its not apparent, the fi in the first fine is a ligature.

Python just barfs:

Not Python 3:

Python 3.3.2+ (default, Feb 28 2014, 00:52:16)
[GCC 4.8.1] on linux
Type "help", "copyright", "credits" or "license" for more information.(2, 2)

No copy-and-paste errors involved:
2

Roy Smith · May 2, 2014

Ben Finney said:
The non-breaking space (â€œÂ â€ U+00A0) is frequently used in text to keep
conceptually inseparable text such as â€œ100Â kmâ€ from automatic word
breaks <URL:https://en.wikipedia.org/wiki/Non-breaking_space>.

Which, by the way, argparse doesn't honor...

http://bugs.python.org/issue16623

Rustom Mody · May 3, 2014

Rustom Mody wrote:

Not Python 3:

Python 3.3.2+ (default, Feb 28 2014, 00:52:16)
[GCC 4.8.1] on linux
Type "help", "copyright", "credits" or "license" for more information.(2, 2)

No copy-and-paste errors involved:
2

Aah! Thanks Peter (and Ned and Michael) â€” 2-3 confusion â€” my bad.

I am confused about the tone however:
You think this

is fine?

Chris Angelico · May 3, 2014

You think this

is fine?

Not sure which part you're objecting to. Are you saying that this
should be an error:

or that Python should take the exact sequence of codepoints, rather
than normalizing?

Python 3.5.0a0 (default:6a0def54c63d, Mar 26 2014, 01:11:09)
[GCC 4.7.2] on linux
Type "help", "copyright", "credits" or "license" for more information.{'__package__': None, '__spec__': None, '__doc__': None, 'fine': 1,
'__loader__': <class '_frozen_importlib.BuiltinImporter'>,
'__builtins__': <module 'builtins' (built-in)>, '__name__':
'__main__'}

As regards normalization, I would be happy with either "keep it
exactly as you provided" or "normalize according to <insert Unicode
standard normalization here>", as long as it's consistent. It's like
what happens with SQL identifiers: according to the standard, an
unquoted name should be uppercased, but some databases instead
lowercase them. It doesn't break code (modulo quoted names, not
applicable here), as long as it's consistent.

(My reading of PEP 3131 is that NFKC is used; is that what's
implemented, or was that a temporary measure and/or something for Py2
to consider?)

ChrisA

Ned Batchelder · May 3, 2014

Rustom Mody wrote:

Click to expand...

Not Python 3:

Click to expand...

Python 3.3.2+ (default, Feb 28 2014, 00:52:16)
[GCC 4.8.1] on linux
Type "help", "copyright", "credits" or "license" for more information.

(ï¬ne, fine) = (1,2)
(ï¬ne, fine)

Click to expand...

(2, 2)

Click to expand...

No copy-and-paste errors involved:
2

Click to expand...

Aah! Thanks Peter (and Ned and Michael) â€” 2-3 confusion â€” my bad.

I am confused about the tone however:
You think this

is fine?

Can you be more explicit? It seems like you think it isn't fine. Why
not? What bothers you about it? Should there be an issue?

Rustom Mody · May 3, 2014

Rustom Mody wrote:
Just noticed a small thing in which python does a bit better than haskell:
$ ghci
let (ï¬ne, fine) = (1,2)
Prelude> (ï¬ne, fine)
(1,2)
In case its not apparent, the fi in the first fine is a ligature.
Python just barfs:
Not Python 3:
Python 3.3.2+ (default, Feb 28 2014, 00:52:16)
[GCC 4.8.1] on linux
Type "help", "copyright", "credits" or "license" for more information.
(ï¬ne, fine) = (1,2)
(ï¬ne, fine)
(2, 2)
No copy-and-paste errors involved:
eval("\ufb01ne")
2
eval(b"fine".decode("ascii"))
2

Click to expand...

Aah! Thanks Peter (and Ned and Michael) â€” 2-3 confusion â€” my bad.
I am confused about the tone however:
You think this

(ï¬ne, fine) = (1,2) # and no issue about it

Click to expand...

is fine?

Click to expand...

Can you be more explicit? It seems like you think it isn't fine. Why
not? What bothers you about it? Should there be an issue?

Two identifiers that to some programmers
- can look the same
- and not to others
- and that the language treats as different

is not fine (or ï¬ne) to me.

Putting them together as I did is summarizing the problem.

Think of them textually widely separated.
And the code (un)serendipitously 'working' (ie not giving NameErrors)

Chris Angelico · May 3, 2014

Two identifiers that to some programmers
- can look the same
- and not to others
- and that the language treats as different

is not fine (or ï¬ne) to me.

The language treats them as the same, though.

ChrisA

Rustom Mody · May 3, 2014

The language treats them as the same, though.

Whoops! I seem to be goofing a lot today

Saw Peter's

Didn't notice his next line(2, 2)

So then I am back to my original point:

Python is giving better behavior than Haskell in this regard!

[Earlier reached this conclusion via a wrong path]

Steven D'Aprano · May 3, 2014

I am confused about the tone however: You think this

is fine?

It's no worse than any other obfuscated variable name:

MOOSE, MO0SE, M0OSE = 1, 2, 3
xl, x1 = 1, 2

If you know your victim is reading source code in Ariel font, "rn" and
"m" are virtually indistinguishable except at very large sizes.

Steven D'Aprano · May 3, 2014

It's no worse than any other obfuscated variable name:

MOOSE, MO0SE, M0OSE = 1, 2, 3
xl, x1 = 1, 2

If you know your victim is reading source code in Ariel font, "rn" and
"m" are virtually indistinguishable except at very large sizes.

Ooops! I too missed that Python normalises the name ï¬ne to fine, so in
fact this is not a case of obfuscation.

Chris Angelico · May 3, 2014

If you know your victim is reading source code in Ariel font, "rn" and
"m" are virtually indistinguishable except at very large sizes.

I kinda like the idea of naming it after a bratty teenager who rebels
against her father and runs away from home, but normally the font's
called Arial.

ChrisA

Terry Reedy · May 3, 2014

(My reading of PEP 3131 is that NFKC is used; is that what's
implemented, or was that a temporary measure and/or something for Py2
to consider?)

The 3.4 docs say "The syntax of identifiers in Python is based on the
Unicode standard annex UAX-31, with elaboration and changes as defined
below; see also PEP 3131 for further details."
....
"All identifiers are converted into the normal form NFKC while parsing;
comparison of identifiers is based on NFKC."

Without reading UAX-31, I don't know how much was changed, but I suspect
not much. In any case, the current rules are intended and very unlikely
to change as that would break code going either forward or back for
little purpose.

Dennis Lee Bieber · May 3, 2014

And you've never been bitten by an invisible control character in ASCII
text? You've lived a sheltered life!

Xerox Sigma CP/V would even permit them in file names (though the
system was EBCDIC, not ASCII -- just feeding lots of ASCII terminals).

Think of the pain someone would have trying to figure out where in a 32
character file name the <BEL> was positioned. Even on a 1200bps serial
line, one couldn't really determine between which printable characters the
terminal beeped while listing the directory.

Thinking Unicode	0	Aug 8, 2013
BITCOIN PROGRAMMING - CODE INCLUDED - needs slight modification in linux terminal - NSA please do not block	0	Nov 2, 2024
Python beginner, unicode encode/decode Q	1	Jul 14, 2008
unicode	7	Jul 1, 2007
Convert unicode escape sequences to unicode in a file	1	Jan 11, 2011
Most pythonic way to truncate unicode?	6	May 28, 2009
Python Unicode handling wins again -- mostly	67	Nov 30, 2013
Python and unicode	8	Sep 19, 2010

Unicode 7

Peter Otten

Roy Smith

Rustom Mody

Chris Angelico

Ned Batchelder

Rustom Mody

Chris Angelico

Rustom Mody

Steven D'Aprano

Steven D'Aprano

Chris Angelico

Terry Reedy

Dennis Lee Bieber

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads