RE Module Performance

Steven D'Aprano · Jul 25, 2013

His most recent argument that Python should use UTF as a representation
is very strange to be honest.

He's not arguing for anything, he is just hating on anything that gives
even the tiniest benefit to ASCII users. This isn't about Python 3.3.
hurting non-ASCII users, because that is demonstrably untrue: they are
*better off* in Python 3.3. This is about denying even a tiny benefit to
ASCII users.

In Python 3.3, non-ASCII users have these advantages compared to previous
versions:

- strings will usually take less memory, and aside from trivial changes
to the object header, they never take more memory than a wide build would
use;

- consequently nearly all objects will take less memory (especially
builtins and standard library objects, which are all ASCII), since
objects contain dozens of internal strings (attribute and method names in
__dict__, class name, etc.);

- consequently whole-application benchmarks show most applications will
use significantly less memory, which leads to faster speeds;

- you cannot break surrogate pairs apart by accident, which you can do in
narrow builds;

- in previous versions, code which works when run in a wide build may
fail in a narrow build, but that is no longer an issue since the
distinction between wide and narrow builds is gone;

- Latin1 users, which includes JMF himself, will likewise see memory
savings, since Latin1 strings will take half the size of narrow builds
and a quarter the size of wide builds.

The cost of all these benefits is a small overhead when creating a string
in the first place, and some purely internal added complication to the
string implementation.

I'm the first to argue against complication unless there is a
corresponding benefit. This is a case where the benefit has proven itself
doubly: Python 3.3's Unicode implementation is *more correct* than
before, and it uses less memory to do so.

The cons of UTF are apparent and widely
known. The main con is that UTF strings are O(n) for indexing a
position within the string.

Not so for UTF-32.

Chris Angelico · Jul 25, 2013

24.07.13 21:15, Chris Angelico Ð½Ð°Ð¿Ð¸ÑÐ°Ð²(Ð»Ð°):

Python 3 uses code points from U+DC80 to U+DCFF (which are in surrogates
area) to represent undecodable bytes with surrogateescape error handler.

That's a deliberate and conscious use of the codepoints; that's not
what I'm talking about here. Suppose you read a UTF-8 stream of bytes
from a file, and decode them into your language's standard string
type. At this point, you should be working with a string of Unicode
codepoints:

"\22\341\210\264\360\222\215\205"

-->

"\x12\u1234\U00012345"

The incoming byte stream has a length of 8, the resulting character
stream has a length of 3. Now, if the language wants to use UTF-16
internally, it's free to do so:

0012 1234 d808 df45

When I referred to exposing surrogates to the application, this is
what I'm talking about. If decoding the above byte stream results in a
length 4 string where the last two are \xd808 and \xdf45, then it's
exposing them. If it's a length 3 string where the last is \U00012345,
then it's hiding them. To be honest, I don't imagine I'll ever see a
language that stores strings in UTF-16 and then exposes them to the
application as UTF-32; there's very little point. But such *is*
possible, and if it's working closely with libraries that demand
UTF-16, it might well make sense to do things that way.

ChrisA

Steven D'Aprano · Jul 25, 2013

But mainly, I'm just wondering how many people here have any basis from
which to argue the point he's trying to make. I doubt most of us have
(a) implemented an editor widget, or (b) tested multiple different
internal representations to learn the true pros and cons of each. And
even if any of us had, that still wouldn't have any bearing on PEP 393,
which is about applications, not editor widgets. As stated above, Python
strings before AND after PEP 393 are poor choices for an editor, ergo
arguing from that standpoint is pretty useless.

That's a misleading way to put it. Using immutable strings as editor
buffers might be a bad way to implement all but the most trivial, low-
performance (i.e. slow) editor, but the basic concept of PEP 393, picking
an internal representation of the text based on its contents, is not.
That's just normal. The only difference with PEP 393 is that the choice
is made on the fly, at runtime, instead of decided in advance by the
programmer.

I expect that the PEP 393 concept of optimizing memory per string buffer
would work well in an editor. However the internal buffer is arranged,
you can safely assume that each chunk of text (word, sentence, paragraph,
buffer...) will very rarely shift from "all Latin 1" to "all BMP" to
"includes SMP chars". So, for example, entering a SMP character will need
to immediately up-cast the chunk from 1-byte per char to 4-bytes per
char, which is relatively pricey, but it's a one-off cost. Down-casting
when the SMP character is deleted doesn't need to be done immediately, it
can be performed when the application is idle.

If the chunks are relatively small (say, a paragraph rather than multiple
pages of text) then even that initial conversion will be invisible. A
fast touch typist hits a key about every 0.1 of a second; if it takes a
millisecond to convert the chunk, you wouldn't even notice the delay. You
can copy and up-cast a lot of bytes in a millisecond.

Steven D'Aprano · Jul 25, 2013

If nobody had ever thought of doing a multi-format string
representation, I could well imagine the Python core devs debating
whether the cost of UTF-32 strings is worth the correctness and
consistency improvements... and most likely concluding that narrow
builds get abolished. And if any other language (eg ECMAScript) decides
to move from UTF-16 to UTF-32, I would wholeheartedly support the move,
even if it broke code to do so.

Unfortunately, so long as most language designers are European-centric,
there is going to be a lot of push-back against any attempt to fix (say)
Javascript, or Java just for the sake of "a bunch of dead languages" in
the SMPs. Thank goodness for emoji. Wait til the young kids start
complaining that their emoticons and emoji are broken in Javascript, and
eventually it will get fixed. It may take a decade, for the young kids to
grow up and take over Javascript from the old-codgers, but it will happen.

To my mind, exposing UTF-16 surrogates
to the application is a bug to be fixed, not a feature to be maintained.

This, times a thousand.

It is *possible* to have non-buggy string routines using UTF-16, but the
implementation is a lot more complex than most language developers can be
bothered with. I'm not aware of any language that uses UTF-16 internally
that doesn't give wrong results for surrogate pairs.

Chris Angelico · Jul 25, 2013

That's a misleading way to put it. Using immutable strings as editor
buffers might be a bad way to implement all but the most trivial, low-
performance (i.e. slow) editor, but the basic concept of PEP 393, picking
an internal representation of the text based on its contents, is not.
That's just normal. The only difference with PEP 393 is that the choice
is made on the fly, at runtime, instead of decided in advance by the
programmer.

Maybe I worded it poorly, but my point was the same as you're saying
here: that a Python string is a poor buffer for editing, regardless of
PEP 393. It's not that PEP 393 makes Python strings worse for writing
a text editor, it's that immutability does that.

ChrisA

Chris Angelico · Jul 25, 2013

Unfortunately, so long as most language designers are European-centric,
there is going to be a lot of push-back against any attempt to fix (say)
Javascript, or Java just for the sake of "a bunch of dead languages" in
the SMPs. Thank goodness for emoji. Wait til the young kids start
complaining that their emoticons and emoji are broken in Javascript, and
eventually it will get fixed. It may take a decade, for the young kids to
grow up and take over Javascript from the old-codgers, but it will happen.

I don't know that that'll happen like that. Emoticons aren't broken in
Javascript - you can use them just fine. You only start seeing
problems when you index into that string. People will start to wonder
why, for instance, a "500 character maximum" field deducts two from
the limit when an emoticon goes in. Example:

Type here:<br><textarea id=content oninput="showlimit(this)"></textarea>
<br>You have <span id=limit1>500</span> characters left (self.value.length).
<br>You have <span id=limit2>500</span> characters left (self.textLength).
<script>
function showlimit(self)
{
document.getElementById("limit1").innerHTML=500-self.value.length;
document.getElementById("limit2").innerHTML=500-self.textLength;
}
</script>

I've included an attribute documented here[1] as the "codepoint length
of the control's value", but in Chrome on Windows, it still counts
UTF-16 code units. However, I very much doubt that this will result in
language changes. People will just live with it. Chinese and Japanese
users will complain, perhaps, and the developers will write it off as
whinging, and just say "That's what the internet does". Maybe, if
you're really lucky, they'll acknowledge that "that's what JavaScript
does", but even then I doubt it'd result in language changes.

This, times a thousand.

It is *possible* to have non-buggy string routines using UTF-16, but the
implementation is a lot more complex than most language developers can be
bothered with. I'm not aware of any language that uses UTF-16 internally
that doesn't give wrong results for surrogate pairs.

The problem isn't the underlying representation, the problem is what
gets exposed to the application. Once you've decided to expose
codepoints to the app (abstracting over your UTF-16 underlying
representation), the change to using UTF-32, or mimicking PEP 393, or
some other structure, is purely internal and an optimization. So I
doubt any language will use UTF-16 internally and UTF-32 to the app.
It'd be needlessly complex.

ChrisA

[1] https://developer.mozilla.org/en-US/docs/Web/API/HTMLTextAreaElement

Steven D'Aprano · Jul 25, 2013

I don't know that that'll happen like that. Emoticons aren't broken in
Javascript - you can use them just fine. You only start seeing problems
when you index into that string. People will start to wonder why, for
instance, a "500 character maximum" field deducts two from the limit
when an emoticon goes in.

I get that. I meant *Javascript developers*, not end-users. The young
kids today who become Javascript developers tomorrow will grow up in a
world where they expect to be able to write band names like
"â–¼â–¡â– â–¡â– â–¡â– " (yes, really, I didn't make that one up) and have it just work.
Okay, all those characters are in the BMP, but emoji aren't, and I
guarantee that even as we speak some new hipster band is trying to decide
whether to name themselves "Smiling ðŸ˜¢" or "Crying ðŸ˜Š".

The problem isn't the underlying representation, the problem is what
gets exposed to the application. Once you've decided to expose
codepoints to the app (abstracting over your UTF-16 underlying
representation), the change to using UTF-32, or mimicking PEP 393, or
some other structure, is purely internal and an optimization. So I doubt
any language will use UTF-16 internally and UTF-32 to the app. It'd be
needlessly complex.

To be honest, I don't understand what you are trying to say.

What I'm trying to say is that it is possible to use UTF-16 internally,
but *not* assume that every code point (character) is represented by a
single 2-byte unit. For example, the len() of a UTF-16 string should not
be calculated by counting the number of bytes and dividing by two. You
actually need to walk the string, inspecting each double-byte:

# calculate length
count = 0
inside_surrogate = False
for bb in buffer: # get two bytes at a time
if is_lower_surrogate(bb):
inside_surrogate = True
continue
if is_upper_surrogate(bb):
if inside_surrogate:
count += 1
inside_surrogate = False
continue
raise ValueError("missing lower surrogate")
if inside_surrogate:
break
count += 1
if inside_surrogate:
raise ValueError("missing upper surrogate")

Given immutable strings, you could validate the string once, on creation,
and from then on assume they are well-formed:

# calculate length, assuming the string is well-formed:
count = 0
skip = False
for bb in buffer: # get two bytes at a time
if skip:
count += 1
skip = False
continue
if is_surrogate(bb):
skip = True
count += 1

String operations such as slicing become much more complex once you can
no longer assume a 1:1 relationship between code points and code units,
whether they are 1, 2 or 4 bytes. Most (all?) language developers don't
handle that complexity, and push responsibility for it back onto the
coder using the language.

wxjmfauth · Jul 25, 2013

Le mercredi 24 juillet 2013 16:47:36 UTC+2, Michael Torrie a écrit :

Really? Enlighten me.

Personally, I would never use UTF as a representation *in memory* for a

unicode string if it were up to me. Why? Because UTF characters are

not uniform in byte width so accessing positions within the string is

terribly slow and has to always be done by starting at the beginning of

the string. That's at minimum O(n) compared to FSR's O(1). Surely you

understand this. Do you dispute this fact?

UTF is a great choice for interchange, though, and indeed that's what it

was designed for.

Are you calling for UTF to be adopted as the internal, in-memory

representation of unicode? Or would you simply settle for UCS-4?

Please be clear here. What are you saying?

How? FSR is just an implementation detail. It could be UCS-4 and it

would also work.

---------

A coding scheme works with a unique set of characters (the repertoire),
and the implementation (the programming) works with a unique set
of encoded code points. The critical step is the path
{unique set of characters} <--> {unique set of encoded code points}

Fact: there is no other way to do it properly (This is explaining
why we have to live today with all these coding schemes or also
explaining why so many coding schemes hadto be created).

How to understand it? With a sheet of paper and a pencil.

In the byte string world, this step is a no-op.

In Unicode, it is exactly the purpose of a "utf" to achieve this
step. "utf": a confusing name covering at the same time the
process and the result of the process.
A "utf chunk", a series of bits (not bytes), hold intrisically
the information about the character it is representing.

Other "exotic" coding schemes like iso6937 of "CID-fonts" are woking
in the same way.

"Unicode" with the help of "utf(s)" does not differ from the basic
rule.

-----

ucs-2: ucs-2 is a perfecly and correctly working coding scheme.
ucs-2 is not different from the other coding schemes and does
not behave differently (cp... or iso-... or ...). It only
covers a smaller repertoire.

-----

utf32: as a pointed many times. You are already using it (maybe
without knowing it). Where? in fonts (OpenType technology),
rendering engines, pdf files. Why? Because there is not other
way to do it better.

------

The Unicode table (its constuction) is a problem per se.
It is not a technical problem, a very important "linguistic
aspect" of Unicode.
See https://groups.google.com/forum/#!topic/comp.lang.python/XkTKE7U8CS0

------

If you are not understanding my "editor" analogy. One other
proposed exercise. Build/create a flexible iso-8859-X coding
scheme. You will quickly understand where the bottleneck
is.
Two working ways:
- stupidly with an editor and your fingers.
- lazily with a sheet of paper and you head.

----

About my benchmarks: No offense. You are not understanding them,
because you do not understand what this FSR does and the coding
of characters. It's a little bit a devil's circle.

Conceptually, this FSR is spending its time in solving the
problem it creates itsself, with plenty of side effects.

-----

There is a clear difference between FSR and ucs-4/utf32.

-----

See also:
http://www.unicode.org/reports/tr17/

(In my mind, quite "dry" and not easy to understand at
a first reading).

jmf

Chris Angelico · Jul 25, 2013

What I'm trying to say is that it is possible to use UTF-16 internally,
but *not* assume that every code point (character) is represented by a
single 2-byte unit. For example, the len() of a UTF-16 string should not
be calculated by counting the number of bytes and dividing by two. You
actually need to walk the string, inspecting each double-byte

Anything's possible. But since underlying representations can be
changed fairly easily (relative term of course - it's a lot of work,
but it can be changed in a single release, no deprecation required or
anything), there's very little reason to continue using UTF-16
underneath. May as well switch to UTF-32 for convenience, or PEP 393
for convenience and efficiency, or maybe some other system that's
still mostly fixed-width.

ChrisA

Chris Angelico · Jul 25, 2013

A coding scheme works with a unique set of characters (the repertoire),
and the implementation (the programming) works with a unique set
of encoded code points. The critical step is the path
{unique set of characters} <--> {unique set of encoded code points}

That's called Unicode. It maps the character 'A' to the code point
U+0041 and so on. Code points are integers. In fact, they are very
well represented in Python that way (also in Pike, fwiw):

123456

In the byte string world, this step is a no-op.

In Unicode, it is exactly the purpose of a "utf" to achieve this
step. "utf": a confusing name covering at the same time the
process and the result of the process.
A "utf chunk", a series of bits (not bytes), hold intrisically
the information about the character it is representing.

No, now you're looking at another level: how to store codepoints in
memory. That demands that they be stored as bits and bytes, because PC
memory works that way.

utf32: as a pointed many times. You are already using it (maybe
without knowing it). Where? in fonts (OpenType technology),
rendering engines, pdf files. Why? Because there is not other
way to do it better.

And UTF-32 is an excellent system... as long as you're okay with
spending four bytes for every character.

See https://groups.google.com/forum/#!topic/comp.lang.python/XkTKE7U8CS0

I refuse to click this link. Give us a link to the
(e-mail address removed) archive, or gmane, or something else more
suited to the audience. I'm not going to Google Groups just to figure
out what you're saying.

If you are not understanding my "editor" analogy. One other
proposed exercise. Build/create a flexible iso-8859-X coding
scheme. You will quickly understand where the bottleneck
is.
Two working ways:
- stupidly with an editor and your fingers.
- lazily with a sheet of paper and you head.

What has this to do with the editor?

There is a clear difference between FSR and ucs-4/utf32.

Yes. Memory usage. PEP 393 strings might take up half or even a
quarter of what they'd take up in fixed UTF-32. Other than that,
there's no difference.

ChrisA

Jeremy Sanders · Jul 25, 2013

Short example. Writing an editor with something like the
FSR is simply impossible (properly).

http://www.gnu.org/software/emacs/m...ext-Representations.html#Text-Representations

"To conserve memory, Emacs does not hold fixed-length 22-bit numbers that are
codepoints of text characters within buffers and strings. Rather, Emacs uses a
variable-length internal representation of characters, that stores each
character as a sequence of 1 to 5 8-bit bytes, depending on the magnitude of
its codepoint[1]. For example, any ASCII character takes up only 1 byte, a
Latin-1 character takes up 2 bytes, etc. We call this representation of text
multibyte.

....

[1] This internal representation is based on one of the encodings defined by
the Unicode Standard, called UTF-8, for representing any Unicode codepoint, but
Emacs extends UTF-8 to represent the additional codepoints it uses for raw 8-
bit bytes and characters not unified with Unicode.

"

Jeremy

Devyn Collier Johnson · Jul 25, 2013

Short example. Writing an editor with something like the
FSR is simply impossible (properly).

Click to expand...

http://www.gnu.org/software/emacs/m...ext-Representations.html#Text-Representations

"To conserve memory, Emacs does not hold fixed-length 22-bit numbers that are
codepoints of text characters within buffers and strings. Rather, Emacs uses a
variable-length internal representation of characters, that stores each
character as a sequence of 1 to 5 8-bit bytes, depending on the magnitude of
its codepoint[1]. For example, any ASCII character takes up only 1 byte, a
Latin-1 character takes up 2 bytes, etc. We call this representation of text
multibyte.

...

[1] This internal representation is based on one of the encodings defined by
the Unicode Standard, called UTF-8, for representing any Unicode codepoint, but
Emacs extends UTF-8 to represent the additional codepoints it uses for raw 8-
bit bytes and characters not unified with Unicode.

"

Jeremy

Wow! The thread that I started has changed a lot and lived a long time.
I look forward to its first birthday (^u^).

Devyn Collier Johnson

Steven D'Aprano · Jul 25, 2013

Short example. Writing an editor with something like the FSR is simply
impossible (properly).
http://www.gnu.org/software/emacs/manual/html_node/elisp/Text-

Click to expand...

Representations.html#Text-Representations

"To conserve memory, Emacs does not hold fixed-length 22-bit numbers
that are codepoints of text characters within buffers and strings.
Rather, Emacs uses a variable-length internal representation of
characters, that stores each character as a sequence of 1 to 5 8-bit
bytes, depending on the magnitude of its codepoint[1]. For example, any
ASCII character takes up only 1 byte, a Latin-1 character takes up 2
bytes, etc. We call this representation of text multibyte.

Well, you've just proven what Vim users have always suspected: Emacs
doesn't really exist.

[1] This internal representation is based on one of the encodings
defined by the Unicode Standard, called UTF-8, for representing any
Unicode codepoint, but Emacs extends UTF-8 to represent the additional
codepoints it uses for raw 8- bit bytes and characters not unified with
Unicode.
"

Do you know what those characters not unified with Unicode are? Is there
a list somewhere? I've read all of the pages from here to no avail:

http://www.gnu.org/software/emacs/manual/html_node/elisp/Non_002dASCII-Characters.html

Chris Angelico · Jul 25, 2013

"To conserve memory, Emacs does not hold fixed-length 22-bit numbers
that are codepoints of text characters within buffers and strings.
Rather, Emacs uses a variable-length internal representation of
characters, that stores each character as a sequence of 1 to 5 8-bit
bytes, depending on the magnitude of its codepoint[1]. For example, any
ASCII character takes up only 1 byte, a Latin-1 character takes up 2
bytes, etc. We call this representation of text multibyte.

Click to expand...

Well, you've just proven what Vim users have always suspected: Emacs
doesn't really exist.

.... lolwut?

ChrisA

Steven D'Aprano · Jul 25, 2013

"To conserve memory, Emacs does not hold fixed-length 22-bit numbers
that are codepoints of text characters within buffers and strings.
Rather, Emacs uses a variable-length internal representation of
characters, that stores each character as a sequence of 1 to 5 8-bit
bytes, depending on the magnitude of its codepoint[1]. For example,
any ASCII character takes up only 1 byte, a Latin-1 character takes up
2 bytes, etc. We call this representation of text multibyte.

Click to expand...

Well, you've just proven what Vim users have always suspected: Emacs
doesn't really exist.

Click to expand...

... lolwut?

JMF has explained that it is impossible, impossible I say!, to write an
editor using a flexible string representation. Since Emacs uses such a
flexible string representation, Emacs is impossible, and therefore Emacs
doesn't exist.

QED.

Chris Angelico · Jul 25, 2013

On Thu, 25 Jul 2013 14:36:25 +0100, Jeremy Sanders wrote:
"To conserve memory, Emacs does not hold fixed-length 22-bit numbers
that are codepoints of text characters within buffers and strings.
Rather, Emacs uses a variable-length internal representation of
characters, that stores each character as a sequence of 1 to 5 8-bit
bytes, depending on the magnitude of its codepoint[1]. For example,
any ASCII character takes up only 1 byte, a Latin-1 character takes up
2 bytes, etc. We call this representation of text multibyte.

Well, you've just proven what Vim users have always suspected: Emacs
doesn't really exist.

Click to expand...

... lolwut?

Click to expand...

JMF has explained that it is impossible, impossible I say!, to write an
editor using a flexible string representation. Since Emacs uses such a
flexible string representation, Emacs is impossible, and therefore Emacs
doesn't exist.

QED.

Quad Error Demonstrated.

I never got past the level of Canis Latinicus in debating class.

ChrisA

wxjmfauth · Jul 25, 2013

Le jeudi 25 juillet 2013 12:14:46 UTC+2, Chris Angelico a écrit :

That's called Unicode. It maps the character 'A' to the code point

U+0041 and so on. Code points are integers. In fact, they are very

well represented in Python that way (also in Pike, fwiw):

No, now you're looking at another level: how to store codepoints in

memory. That demands that they be stored as bits and bytes, because PC

memory works that way.

And UTF-32 is an excellent system... as long as you're okay with

spending four bytes for every character.

I refuse to click this link. Give us a link to the

(e-mail address removed) archive, or gmane, or something else more

suited to the audience. I'm not going to Google Groups just to figure

out what you're saying.

What has this to do with the editor?

Yes. Memory usage. PEP 393 strings might take up half or even a

quarter of what they'd take up in fixed UTF-32. Other than that,

there's no difference.

ChrisA

--------

Let start with a simple string \textemdash or \texttendash
26

jmf

jmf

Chris Angelico · Jul 25, 2013

Let start with a simple string \textemdash or \texttendash

26

Most of the cost is in those two apostrophes, look:
8

Okay, that's slightly unfair (bonus points: figure out what I did to
make this work; there are at least two right answers) but still, look
at what an empty string costs:
25

Or look at the difference between one of these characters and two:
2

That's what the characters really cost. The overhead is fixed. It is,
in fact, almost completely insignificant. The storage requirement for
a non-ASCII, BMP-only string converges to two bytes per character.

ChrisA

Prasad, Ramit · Jul 25, 2013

Chris said:
Most of the cost is in those two apostrophes, look:

8

Okay, that's slightly unfair (bonus points: figure out what I did to
make this work; there are at least two right answers) but still, look
at what an empty string costs:

I like bonus points.

8

Not sure what the other right answer is...booleans take 12 bytes (on 2.6)

25

Or look at the difference between one of these characters and two:

2

That's what the characters really cost. The overhead is fixed. It is,
in fact, almost completely insignificant. The storage requirement for
a non-ASCII, BMP-only string converges to two bytes per character.

ChrisA

Ramit

This email is confidential and subject to important disclaimers and conditions including on offers for the purchase or sale of securities, accuracy andcompleteness of information, viruses, confidentiality, legal privilege, andlegal entity disclaimers, available at http://www.jpmorgan.com/pages/disclosures/email.

Ian Kelly · Jul 25, 2013

On Thu, 25 Jul 2013 14:36:25 +0100, Jeremy Sanders wrote:
"To conserve memory, Emacs does not hold fixed-length 22-bit numbers
that are codepoints of text characters within buffers and strings.
Rather, Emacs uses a variable-length internal representation of
characters, that stores each character as a sequence of 1 to 5 8-bit
bytes, depending on the magnitude of its codepoint[1]. For example,
any ASCII character takes up only 1 byte, a Latin-1 character takes up
2 bytes, etc. We call this representation of text multibyte.

Well, you've just proven what Vim users have always suspected: Emacs
doesn't really exist.

Click to expand...

... lolwut?

Click to expand...

JMF has explained that it is impossible, impossible I say!, to write an
editor using a flexible string representation. Since Emacs uses such a
flexible string representation, Emacs is impossible, and therefore Emacs
doesn't exist.

QED.

Except that the described representation used by Emacs is a variant of
UTF-8, not an FSR. It doesn't have three different possible encodings
for the letter 'a' depending on what other characters happen to be in
the string.

As I understand it, jfm would be perfectly happy if Python used UTF-8
(or presumably the Emacs variant) as its internal string
representation.

import syntax	0	Jul 29, 2013
Cross-Platform Python3 Equivalent to notify-send	1	Jul 27, 2013
Aloha! Check out the Betabots!	0	Oct 1, 2013
Critic my module	13	Jul 25, 2013
PEP8 79 char max	3	Jul 29, 2013
List as Contributor	0	Jul 20, 2013
Play Ogg Files	0	Jul 20, 2013
Share Code Tips	13	Jul 19, 2013

RE Module Performance

Steven D'Aprano

Chris Angelico

Steven D'Aprano

Steven D'Aprano

Chris Angelico

Chris Angelico

Steven D'Aprano

wxjmfauth

Chris Angelico

Chris Angelico

Jeremy Sanders

Devyn Collier Johnson

Steven D'Aprano

Chris Angelico

Steven D'Aprano

Chris Angelico

wxjmfauth

Chris Angelico

Prasad, Ramit

Ian Kelly

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads