Unicode 7

C

Chris Angelico

Unicode consortium's going from old BMP to current (6.0) SMPs to who-knows-what
in the future is similar.

Unicode 1.0: "Let's make a single universal character set that can
represent all the world's scripts. We'll define 65536 codepoints to do
that with."

Unicode 2.0: "Oh. That's not enough. Okay, let's define some more."

It's not a fundamental change, nor is it unhelpful to Unicode's cause.
It's simply an acknowledgement that 64K codepoints aren't enough. Yes,
that gave us the mess of UTF-16 being called "Unicode" (if it hadn't
been for Unicode 1.0, I doubt we'd now have so many languages using
and exposing UTF-16 - it'd be a simple judgment call, pick
UTF-8/UTF-16/UTF-32 based on what you expect your users to want to
use), but it doesn't change Unicode's goal, and it also doesn't
indicate that there's likely to be any more such changes in the
future. (Just look at how little of the Unicode space is allocated so
far.)

ChrisA
 
T

Terry Reedy

Here is an instance of someone who would like a certain optimization to be
dis-able-able

https://mail.python.org/pipermail/python-list/2014-February/667169.html

To the best of my knowledge its nothing to do with unicode or with jmf.

Right. Ned has an actual technical reason to complain, even though the
developers do not consider it strong enough to act.
Why if optimizations are always desirable do C compilers have:
-O0 O1 O2 O3 and zillions of more specific flags?

One reason is that many optimizations sometimes introduce bugs, or to
put it another way, they are based on assumptions that are not true for
all code. For instance, some people have suggested that CPython should
have an optional optimization based on the assumption that builtin names
are never rebound. That is true for perhaps many code files, but
definitely not all. Guido does not seem to like such conditional
optimizations.

I can think of three reasons for not adding to the numerous options
CPython already has.
1. We do not have the developers resources to handle the added
complications of multiple optimization options.
2. Zillions of options and flags confuse users. As it is, most options
are seldom used.
3. Optimization options are easily misused, possibly leading to silently
buggy results, or mysterious failures. For instance, people sometimes
rebind builtins without realizing what they have done, such as using
'id' as a parameter name. Being in the habit of routinely using the
'assume no rebinding option' would lead to problems.

I am rather sure that the string (unicode) test suite was reviewed and
the performance of 3.2 wide builds recorded before the new
implementation was committed.

The tracker currently has 37 behavior (bug) issues marked for the
unicode component. In a quick review, I do not see that any have
anything to do with using standard UTF-32 versus adaptive UTF-32.
Indeed, I believe a majority of the 37 were filed before 3.3 or are 2.7
specific. Problems with FSR itself have been fixed as discovered.
JFTR I have no issue with FSR. What we have to hand to jmf - willingly
or otherwise - is that many more people have heard of FSR thanks to him. [I am one of them]

Somewhat ironically, I suppose your are right.
I dont even know whether jmf has a real
technical (as he calls it 'mathematical') issue or its entirely political:

I would call his view personal or philosophical. I only object to
endless repetition and the deception of claiming that personal views are
mathematical facts.
 
S

Steven D'Aprano

Whats the best cure for headache?

Cut off the head

o_O

I don't think so.

Whats the best cure for Unicode?

Ascii

Unicode is not a problem to be solved.

The inability to write standard human text in ASCII is a problem, e.g.
one cannot write

“ASCII For Dummies†© 2014 by Zöe Smith, now on sale 99¢

so even *Americans* cannot represent all their common characters in
ASCII, let alone specialised characters from mathematics, science, the
printing industry, and law. And even Americans sometimes need to write
text in Foreign. Where is your ASCII now?

The solution is to have at least one encoding which contains the
additional characters needed.

The plethora of such additional encodings is a problem. The solution is a
single encoding that covers all needed characters, like Unicode, so that
there is no need to handle multiple encodings.

The inability for plain text files to record metadata of what encoding
they use is a problem. The solution is to standardize on a single, world-
wide encoding, like Unicode.

Saying however that there is no headache in unicode does not make the
headache go away:

http://lucumr.pocoo.org/2014/1/5/unicode-in-2-and-3/

No I am not saying that the contents/style/tone are right. However
people are evidently suffering the transition. Denying it is not a help.

Transitions are always more painful than after the transition has settled
down. As I have said repeatedly, I look forward for the day when nobody
but document archivists and academics need care about legacy encodings.
But we're not there yet.

And unicode consortium's ways are not exactly helpful to its own cause:
Imagine the C standard committee deciding that adding mandatory garbage
collection to C is a neat idea

Unicode consortium's going from old BMP to current (6.0) SMPs to
who-knows-what in the future is similar.

I don't see the connection.
 
S

Steven D'Aprano

I dont know how one causally connects the 'headaches' but Ive seen -
mojibake

Mojibake is certainly more common with multiple encodings, but the
solution to that is Unicode, not ASCII.

In fact, in your blog post you even link to a post of mine where I
explain that ASCII has gone through multiple backwards incompatible
changes over the decades, which means you can have a limited form of
mojibake even in pure ASCII. Between changes over various versions of
ASCII, and ambiguous characters allowed by the standard, you needed some
sort of out-of-band metadata to tell you whether they intended an @ or a
`, a | or a ¬, a £ or a #, to mention only a few.

It's only since the 1980s that ASCII, actual 7-bit US ASCII, has become
an unambiguous standard. But that's okay, because that merely allowed
people to create dozens of 7-bit and 8-bit variations on ASCII, all
incompatible with each other, and *call them ASCII* regardless of the
actual standard name.

Between ambiguities in actual ASCII, and common practice to label non-
ASCII as ASCII, I can categorically say that mojibake has always been
possible in so-called "plain text". If you haven't noticed it, it was
because you were only exchanging documents with people who happened to
use the same set of characters as you.

- unicode 'number-boxes' (what are these called?)

They are missing character glyphs, and they have nothing to do with
Unicode. They are due to deficiencies in the text font you are using.

Admittedly with Unicode's 0x10FFFF possible characters (actually more,
since a single code point can have multiple glyphs) it isn't surprising
that most font designers have neither the time, skill or desire to create
a glyph for every single code point. But then the same applies even for
more restrictive 8-bit encodings -- sometimes font designers don't even
bother providing glyphs for *ASCII* characters.

(E.g. they may only provide glyphs for uppercase A...Z, not lowercase.)
- Worst of all what we
*dont* see -- how many others dont see what we see?

Again, this a deficiency of the font. There are very few code points in
Unicode which are intended to be invisible, e.g. space, newline, zero-
width joiner, control characters, etc., but they ought to be equally
invisible to everyone. No printable character should ever be invisible in
any decent font.

I never knew of any of this in the good ol days of ASCII

You must have been happy with a very impoverished set of symbols, then.

¶ Passive voice is often the best choice in the interests of political
correctness

It would be a pleasant surprise if everyone sees a pilcrow at start of
line above

I do.
 
C

Chris Angelico

... even *Americans* cannot represent all their common characters in
ASCII, let alone specialised characters from mathematics, science, the
printing industry, and law.

Aside: What additional characters does law use that aren't in ASCII?
Section § and paragraph ¶ are used frequently, but you already
mentioned the printing industry. Are there other symbols?

ChrisA
 
C

Chris Angelico

They are missing character glyphs, and they have nothing to do with
Unicode. They are due to deficiencies in the text font you are using.

Admittedly with Unicode's 0x10FFFF possible characters (actually more,
since a single code point can have multiple glyphs) it isn't surprising
that most font designers have neither the time, skill or desire to create
a glyph for every single code point. But then the same applies even for
more restrictive 8-bit encodings -- sometimes font designers don't even
bother providing glyphs for *ASCII* characters.

(E.g. they may only provide glyphs for uppercase A...Z, not lowercase.)

This is another area where Unicode has given us "a great improvement
over the old method of giving satisfaction". Back in the 1990s on
OS/2, DOS, and Windows, a missing glyph might be (a) blank, (b) a
simple square with no information, or (c) copied from some other font
(common with dingbats fonts). With Unicode, the standard is to show a
little box *with the hex digits in it*. Granted, those boxes are a LOT
more readable for BMP characters than SMP (unless your text is huge,
six digits in the space of one character will make them pretty tiny),
and a "Unicode" font will generally include all (or at least most) of
the BMP, but it's still better than having no information at all.

ChrisA
 
C

Chris Angelico

ASCII does not contain “©†(U+00A9 COPYRIGHT SIGN) nor “®†(U+00AE
REGISTERED SIGN), for instance.

Heh! I forgot about those. U+00A9 in particular has gone so mainstream
that it's easy to think of it not as "I'm going to switch to my
'British English + Legal' dictionary now" and just as "This is a
critical part of the basic dictionary".

ChrisA
 
J

Jussi Piitulainen

Chris said:
(common with dingbats fonts). With Unicode, the standard is to show
a little box *with the hex digits in it*. Granted, those boxes are a
LOT more readable for BMP characters than SMP (unless your text is
huge, six digits in the space of one character will make them pretty
tiny), and a "Unicode" font will generally include all (or at least
most) of the BMP, but it's still better than having no information

I needed to see such tiny numbers just today, just the four of them in
the tiny box. So I pressed C-+ a few times to _make_ the text huge,
obtained my information, and returned to my normal text size with C--.

Perfect. Usually all I need to know is that I have a character for
which I don't have a glyph, but this time I wanted to record the
number because I was testing things rather than reading the text.
 
M

Marko Rauhamaa

Ben Finney said:
ASCII does not contain “©†(U+00A9 COPYRIGHT SIGN) nor “®†(U+00AE
REGISTERED SIGN), for instance.

The em-dash is mapped on my keyboard — I use it quite often.


Marko
 
R

Rustom Mody

Again, this a deficiency of the font. There are very few code points in
Unicode which are intended to be invisible, e.g. space, newline, zero-
width joiner, control characters, etc., but they ought to be equally
invisible to everyone. No printable character should ever be invisible in
any decent font.

Thats not what I meant.

I wrote http://blog.languager.org/2014/04/unicoded-python.html
– mostly on a debian box.
Later on seeing it on a less heavily setup ubuntu box, I see
⟮ ⟯ ⟬ ⟭ ⦇ ⦈ ⦉ ⦊
have become 'missing-glyph' boxes.

It leads me ask, how much else of what I am writing, some random reader
has simply not seen?
Quite simply we can never know – because most are going to go away saying
"mojibaked/garbled rubbish"

Speaking of what you understood of what I said:
Yes invisible chars is another problem I was recently bitten by.
I pasted something from google into emacs' org mode.
Following that link again I kept getting a broken link.

Until I found that the link had an invisible char

The problem was that emacs was faithfully rendering that char according
to standard, ie invisibly!
 
S

Steven D'Aprano

Aside: What additional characters does law use that aren't in ASCII?
Section § and paragraph ¶ are used frequently, but you already mentioned
the printing industry. Are there other symbols?

I was thinking of copyright, trademark, registered mark, and similar. I
think these are all of relevant characters:

py> for c in '©®℗™':
.... unicodedata.name(c)
....
'COPYRIGHT SIGN'
'REGISTERED SIGN'
'SOUND RECORDING COPYRIGHT'
'TRADE MARK SIGN'
 
S

Steven D'Aprano

Thats not what I meant.

I wrote http://blog.languager.org/2014/04/unicoded-python.html
– mostly on a debian box.
Later on seeing it on a less heavily setup ubuntu box, I see
⟮ ⟯ ⟬ ⟭ ⦇ ⦈ ⦉ ⦊
have become 'missing-glyph' boxes.

It leads me ask, how much else of what I am writing, some random reader
has simply not seen?
Quite simply we can never know – because most are going to go away
saying "mojibaked/garbled rubbish"

Speaking of what you understood of what I said: Yes invisible chars is
another problem I was recently bitten by. I pasted something from google
into emacs' org mode. Following that link again I kept getting a broken
link.

Until I found that the link had an invisible char

The problem was that emacs was faithfully rendering that char according
to standard, ie invisibly!

And you've never been bitten by an invisible control character in ASCII
text? You've lived a sheltered life!

Nothing you are describing is unique to Unicode.
 
M

Marko Rauhamaa

Steven D'Aprano said:
And you've never been bitten by an invisible control character in
ASCII text? You've lived a sheltered life!

That reminds me: " " (nonbreakable space) is often used between numbers
and units, for example.


Marko
 
T

Tim Chase

This is another area where Unicode has given us "a great improvement
over the old method of giving satisfaction". Back in the 1990s on
OS/2, DOS, and Windows, a missing glyph might be (a) blank, (b) a
simple square with no information, or (c) copied from some other
font (common with dingbats fonts). With Unicode, the standard is to
show a little box *with the hex digits in it*. Granted, those boxes
are a LOT more readable for BMP characters than SMP (unless your
text is huge, six digits in the space of one character will make
them pretty tiny), and a "Unicode" font will generally include all
(or at least most) of the BMP, but it's still better than having no
information at all.

I'm pleased when applications & fonts work properly, using both the
placeholder fonts for "this character is legitimate but I can't
display it with a font, so here, have a box with the codepoint
numbers in it until I'm directed to use a more appropriate font at
which point you'll see it correctly" and the "somebody crammed garbage
in here, so I'll display it with "�" (U+FFFD) which is designated for
exactly this purpose".

-tkc
 
R

Rustom Mody

And you've never been bitten by an invisible control character in ASCII
text? You've lived a sheltered life!

For control characters Ive seen:
- garbage (the ASCII equiv of mojibake)
- Straight ^A^B^C
- Maybe their names NUL,SOH,STX,ETX,EOT,ENQ,ACK…
- Or maybe just a little dot .
- More pathological behavior: a control sequence putting the
terminal into some other mode

But I dont ever remember seeing a control character become
invisible (except [ \t\n\f])
 
R

Rustom Mody

And you've never been bitten by an invisible control character in ASCII
text? You've lived a sheltered life!
Nothing you are describing is unique to Unicode.

Just noticed a small thing in which python does a bit better than haskell:
$ ghci
let (ï¬ne, fine) = (1,2)
Prelude> (ï¬ne, fine)
(1,2)
Prelude>

In case its not apparent, the fi in the first fine is a ligature.

Python just barfs:
File "<stdin>", line 1
ï¬ne = 1
^
SyntaxError: invalid syntax
The point of that example is to show that unicode gives all kind of
"Aaah! Gotcha!!" opportunities that just dont exist in the old world.
Python may have got this one right but there are surely dozens of others.

On the other hand I see more eagerness for unicode source-text there
eg.

https://github.com/i-tu/Hasklig
http://www.haskell.org/ghc/docs/latest/html/users_guide/syntax-extns.html#unicode-syntax
http://www.haskell.org/haskellwiki/Unicode-symbols
http://hackage.haskell.org/package/base-unicode-symbols

Some music ð„ž ð„¢ â™­ ð„± to appease the utf-8 gods
 
M

MRAB

Okay, so can you change your article to reflect the fact that the
headaches both pre-date Unicode, and are made much easier by Unicode?


Ah yes, the neo-Sumerian story “Enmerkar_and_the_Lord_of_Arattaâ€


And other myths with fantastic reasons for the diversity of language


Yes, by ignoring all other writing systems except one's own – and
thereby excluding most of the world's people – the system can be made
simpler.
ASCII lacked even £. I can remember assembly listings in magazines
containing lines such as:

LDA £0

I even (vaguely) remember an advert with a character that looked like
Å, presumably because they didn't have £. In a UK magazine? Very
strange!
 
M

MRAB

o_O

I don't think so.



Unicode is not a problem to be solved.

The inability to write standard human text in ASCII is a problem, e.g.
one cannot write

“ASCII For Dummies†© 2014 by Zöe Smith, now on sale 99¢
[snip]

Shouldn't that be "Zoë"?
 
M

Michael Torrie

Python just barfs:

File "<stdin>", line 1
ï¬ne = 1
^
SyntaxError: invalid syntax

The point of that example is to show that unicode gives all kind of
"Aaah! Gotcha!!" opportunities that just dont exist in the old world.
Python may have got this one right but there are surely dozens of others.

Except that it doesn't. This has nothing to do with unicode handling.
It has everything to do with what defines an identifier in Python. This
is no different than someone wondering why they can't start an
identifier in Python 1.x with a number or punctuation mark.
 
N

Ned Batchelder

Just noticed a small thing in which python does a bit better than haskell:
$ ghci
let (ï¬ne, fine) = (1,2)
Prelude> (ï¬ne, fine)
(1,2)
Prelude>

In case its not apparent, the fi in the first fine is a ligature.

Python just barfs:

File "<stdin>", line 1
ï¬ne = 1
^
SyntaxError: invalid syntax

Surely by now we could at least be explicit about which version of
Python we are talking about?

$ python2.7
Python 2.7.2 (default, Oct 11 2012, 20:14:37)
[GCC 4.2.1 Compatible Apple Clang 4.0 (tags/Apple/clang-418.0.60)] on
darwin
Type "help", "copyright", "credits" or "license" for more information.File "<stdin>", line 1
ï¬ne = 1
^
SyntaxError: invalid syntax$ python3.4
Python 3.4.0b1 (default, Dec 16 2013, 21:05:22)
[GCC 4.2.1 Compatible Apple LLVM 4.2 (clang-425.0.28)] on darwin
Type "help", "copyright", "credits" or "license" for more information.1

In Python 2 identifiers must be ASCII. Python 3 allows many Unicode
characters in identifiers (see PEP 3131 for details:
http://legacy.python.org/dev/peps/pep-3131/)
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
474,075
Messages
2,570,563
Members
47,200
Latest member
Vanessa98N

Latest Threads

Top