python 3.3 repr

Zero Piraeus · Nov 15, 2013

:

Anybody remember RAD-50? It let you represent a 6-character filename
(plus a 3-character extension) in a 16 bit word. RT-11 used it, not
sure if it showed up anywhere else.

Presumably 16 is a typo, but I just had a moderate amount of fun
envisaging how that might work: if the characters were restricted to
vowels, then 5**6 < 2**14, giving a couple of bits left over for a
choice of four preset "three-character" extensions.

I can't say that AEIOUA.EX1 looks particularly appealing, though ...

-[]z.

Steven D'Aprano · Nov 15, 2013

Things went wrong when utf8 was not adopted as the standard encoding
thus requiring two string types, it would have been easier to have a len
function to count bytes as before and a glyphlen to count glyphs. Now as
I understand it we have a complicated mess under the hood for unicode
objects so they have a variable representation to approximate an 8 bit
representation when suitable etc etc etc.

No no no! Glyphs are *pictures*, you know the little blocks of pixels
that you see on your monitor or printed on a page. Before you can count
glyphs in a string, you need to know which typeface ("font") is being
used, since fonts generally lack glyphs for some code points.

[Aside: there's another complication. Some fonts define alternate glyphs
for the same code point, so that the design of (say) the letter "a" may
vary within the one string according to whatever typographical rules the
font supports and the application calls. So the question is, when you
"count glyphs", should you count "a" and "alternate a" as a single glyph
or two?]

You don't actually mean count glyphs, you mean counting code points
(think characters, only with some complications that aren't important for
the purposes of this discussion).

UTF-8 is utterly unsuited for in-memory storage of text strings, I don't
care how many languages (Go, Haskell?) make that mistake. When you're
dealing with text strings, the fundamental unit is the character, not the
byte. Why do you care how many bytes a text string has? If you really
need to know how much memory an object is using, that's where you use
sys.getsizeof(), not len().

We don't say len({42: None}) to discover that the dict requires 136
bytes, why would you use len("heÃ¥vy") to learn that it uses 23 bytes?

UTF-8 is variable width encoding, which means it's *rubbish* for the in-
memory representation of strings. Counting characters is slow. Slicing is
slow. If you have mutable strings, deleting or inserting characters is
slow. Every operation has to effectively start at the beginning of the
string and count forward, lest it split bytes in the middle of a UTF
unit. Or worse, the language doesn't give you any protection from this at
all, so rather than slow string routines you have unsafe string routines,
and it's your responsibility to detect UTF boundaries yourself.

In case you aren't familiar with what I'm talking about, here's an
example using Python 3.2, starting with a Unicode string and treating it
as UTF-8 bytes:

py> u = "heÃ¥vy"
py> s = u.encode('utf-8')
py> for c in s:
.... print(chr(c))
....
h
e
Ãƒ
Â¥
v
y

"ÃƒÂ¥"? It didn't take long to get moji-bake in our output, and all I did
was print the (byte) string one "character" at a time. It gets worse: we
can easily end up with invalid UTF-8:

py> a, b = s[:len(s)//2], s[len(s)//2:] # split the string in half
py> a.decode('utf-8')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf8' codec can't decode byte 0xc3 in position 2:
unexpected end of data
py> b.decode('utf-8')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf8' codec can't decode byte 0xa5 in position 0:
invalid start byte

No, UTF-8 is okay for writing to files, but it's not suitable for text
strings. The in-memory representation of text strings should be constant
width, based on characters not bytes, and should prevent the caller from
accidentally ending up with moji-bake or invalid strings.

Chris Angelico · Nov 15, 2013

:

Presumably 16 is a typo, but I just had a moderate amount of fun
envisaging how that might work: if the characters were restricted to
vowels, then 5**6 < 2**14, giving a couple of bits left over for a
choice of four preset "three-character" extensions.

I can't say that AEIOUA.EX1 looks particularly appealing, though ...

Looks like it might be this scheme:

https://en.wikipedia.org/wiki/DEC_Radix-50

36-bit word for a 6-char filename, but there was also a 16-bit
variant. I do like that filename scheme you describe, though it would
tend to produce names that would suit virulent diseases.

ChrisA

Chris Angelico · Nov 15, 2013

No, UTF-8 is okay for writing to files, but it's not suitable for text
strings.

Correction: It's _great_ for writing to files (and other fundamentally
byte-oriented streams, like network connections). Does a superb job as
the default encoding for all sorts of situations. But, as you say, it
sucks if you want to find the Nth character.

ChrisA

Serhiy Storchaka · Nov 15, 2013

15.11.13 17:32, Roy Smith Ð½Ð°Ð¿Ð¸ÑÐ°Ð²(Ð»Ð°):

Anybody remember RAD-50? It let you represent a 6-character filename
(plus a 3-character extension) in a 16 bit word. RT-11 used it, not
sure if it showed up anywhere else.

In three 16-bit words.

Cousin Stanley · Nov 15, 2013

....
We don't say len({42: None}) to discover
that the dict requires 136 bytes,
why would you use len("heÃ¥vy")
to learn that it uses 23 bytes ?
....

#!/usr/bin/env python
# -*- coding: utf-8 -*-

"""
illustrate the difference in length of python objects
and the size of their system storage
"""

import sys

s = "heÃ¥vy"

d = { 42 : None }

print
print ' s : %s' % s
print ' len( s ) : %d' % len( s )
print ' sys.getsizeof( s ) : %s ' % sys.getsizeof( s )
print
print
print ' d : ' , d
print ' len( d ) : %d' % len( d )
print ' sys.getsizeof( d ) : %d ' % sys.getsizeof( d )

Neil Cerutti · Nov 15, 2013

Other languages _have_ gone for at least some sort of Unicode
support. Unfortunately quite a few have done a half-way job and
use UTF-16 as their internal representation. That means there's
no difference between U+0012, U+0123, and U+1234, but U+12345
suddenly gets handled differently. ECMAScript actually
specifies the perverse behaviour of treating codepoints >U+FFFF
as two elements in a string, because it's just too costly to
change.

The unicode support I'm learning in Go is, "Everything is utf-8,
right? RIGHT?!?" It also has the interesting behavior that
indexing strings retrieves bytes, while iterating over them
results in a sequence of runes.

It comes with support for no encodings save utf-8 (natively) and
utf-16 (if you work at it). Is that really enough?

Gene Heskett · Nov 15, 2013

I also used the RCA 1802, but did you use the Ferranti F100L? Rationale
for the use of both, mid/late 70s they were the only processors of their
respective type with military approvals.

Can't remember how we coded on the F100L, but the 1802 work was done on
the Texas Instruments Silent 700, copying from one cassette tape to
another. Set the controls wrong when copying and whoops, you've just
overwritten the work you've just done. We could have had a decent
development environment but it was on a UK MOD cost plus project, so the
more inefficiently you worked, the more profit your employer made.

Click to expand...

BTDT but in 1959-60 era. Testing the ullage pressure regulators for the
early birds, including some that gave John Glenn his first ride or 2. I
don't recall the brand of paper tape recorders, but they used 12at7's &
12au7's by the grocery sack full. One or more got noisy & me being the
budding C.E.T. that I now am, of course ran down the bad ones and requested
new ones. But you had to turn in the old ones, which Stellardyne Labs
simply recycled back to you the next time you needed a few. Hopeless
management IMO, but thats cost plus for you.

At 10k$ a truckload for helium back then, each test lost about $3k worth of
helium because the recycle catcher tank was so thin walled. And the 6
stage cardox re-compressor was so leaky, occasionally blowing up a pipe out
of the last stage that put about 7800 lbs back in the monel tanks.

I considered that a huge waste compared to the cost of a 12au7, then about
$1.35, and raised hell, so I got fired. They simply did not care that a
perfectly good regulator was being abused to death when it took 10 or more
test runs to get one good recording for the certification. At those
operating pressures, the valve faces erode just like the seats in your
shower faucets do in 20 years. Ten such runs and you may as well bin it,
but they didn't.

I am amazed that as many of those birds worked as did. Of course if it
wasn't manned, they didn't talk about the roman candles on the launch pads.
I heard one story that they had to regrade one pads real estate at
Vandenburg & start all over, seems some ID10T had left the cable to the
explosive bolts hanging on the cable tower. Ooops, and theres no off
switch in many of those once the umbilical has been dropped.

Cheers, Gene
--
"There are four boxes to be used in defense of liberty:
soap, ballot, jury, and ammo. Please use in that order."
-Ed Howdershelt (Author)

Tehee quod she, and clapte the wyndow to.
-- Geoffrey Chaucer
A pen in the hand of this president is far more
dangerous than 200 million guns in the hands of
law-abiding citizens.

Steven D'Aprano · Nov 16, 2013

The unicode support I'm learning in Go is, "Everything is utf-8, right?
RIGHT?!?" It also has the interesting behavior that indexing strings
retrieves bytes, while iterating over them results in a sequence of
runes.

It comes with support for no encodings save utf-8 (natively) and utf-16
(if you work at it). Is that really enough?

Only if you never need to handle data created by other applications.

UnicodeEncodeError during repr()	3	Apr 19, 2010
Printing characters outside of the ASCII range	18	Nov 9, 2012
Missing library path (WIndows)	4	Sep 29, 2012
Unicode	2	Mar 15, 2013
Python 3.3, gettext and Unicode problems	0	Dec 31, 2012
the same strings, different utf-8 repr values?	2	Sep 7, 2006
Python 3.1.1 bytes decode with replace bug	9	Oct 24, 2009
unable to print Unicode characters in Python 3	12	Jan 26, 2009

python 3.3 repr

Zero Piraeus

Steven D'Aprano

Chris Angelico

Chris Angelico

Serhiy Storchaka

Cousin Stanley

Neil Cerutti

Gene Heskett

Steven D'Aprano

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads