RE Module Performance

  • Thread starter Devyn Collier Johnson
  • Start date
C

Chris Angelico

if you care about minimizing every possible byte, you should
use a low-level language like C. Then you can give every character 21
bits, and be happy that you don't waste even one bit.

Could go better! Since not every character has been assigned, and some
are specifically banned (eg U+FFFE and U+D800-U+DFFF), you could cut
them out of your representation system and save memory!

ChrisA
 
A

Antoon Pardon

Op 31-07-13 05:30, Michael Torrie schreef:
I for one found it very interesting. In fact this thread caused me to
wonder how one actually does create an efficient editor. Off the
original topic true, but still very interesting.

Yes, it can be interesting. But I really think if that is what you want
to discuss, it deserves its own subject thread.
 
A

Antoon Pardon

Op 30-07-13 21:09, (e-mail address removed) schreef:
Matable, immutable, copyint + xxx, bufferint, O(n) ....
Yes, but conceptualy the reencoding happen sometime, somewhere.

Which is a far cry from your previous claim that it happened
every time you enter a char.

This of course make your case harder to argue. Because the
impact of something that happens sometime, somewhere is
vastly less than something that happens everytime you enter
a char.
The internal "ucs-2" will never automagically be transformed
into "ucs-4" (eg).

It will just start producing wrong results when someone starts
using characters that don't fit into ucs-2.

7.160483334521416


And do not forget, in a pure utf coding scheme, your
char or a char will *never* be larger than 4 bytes.


Nonsense.
18
 
W

wxjmfauth

FSR:
===

The 'a' in 'a€' and 'a\U0001d11e:
['{:#010b}'.format(c) for c in 'a€'.encode('utf-16-be')] ['0b00000000', '0b01100001', '0b00100000', '0b10101100']
['{:#010b}'.format(c) for c in 'a\U0001d11e'.encode('utf-32-be')]
['0b00000000', '0b00000000', '0b00000000', '0b01100001',
'0b00000000', '0b00000001', '0b11010001', '0b00011110']

Has to be done.

sys.getsizeof('a€')
42
sys.getsizeof('a\U0001d11e')
48
sys.getsizeof('aa')
27


Unicode/utf*
============

i) ("primary key") Create and use a unique set of encoded
code points.
ii) ("secondary key") Depending of the wish,
memory/performance: utf-8/16/32

Two advantages at the light of the above example:
iii) The "a" has never to be reencoded.
iv) An "a" size never exceeds 4 bytes.

Hard job to solve/satisfy i), ii), iii) and iv) at the same time.
Is is possible? ;-) The solution is in the problem.

jmf
 
A

Antoon Pardon

Op 31-07-13 10:32, (e-mail address removed) schreef:
Unicode/utf*
============

i) ("primary key") Create and use a unique set of encoded
code points.

FSR does this.
st1 = 'a€'
st2 = 'aa'
ord(st1[0]) 97
ord(st2[0]) 97
ii) ("secondary key") Depending of the wish,
memory/performance: utf-8/16/32

Whose wish? I don't know any language that allows the
programmer choose the internal representation of its
strings. If it is the designers choice FSR does this,
if it is the programmers choice, I don't see why
this is necessary for compliance.
Two advantages at the light of the above example:
iii) The "a" has never to be reencoded.

FSR: check. Using a container with wider slots is not a reëncoding.
If such widening is encoding then your 'choice' between utf-8/16/32
implies that it will also have to reencode when it changes from
utf-8 to utf-16 or utf-32.
iv) An "a" size never exceeds 4 bytes.

FSR: check.
Hard job to solve/satisfy i), ii), iii) and iv) at the same time.
Is is possible? ;-) The solution is in the problem.

Mayby you should use bytes or bytearrays if that is really what you want.
 
M

Michael Torrie

Op 31-07-13 05:30, Michael Torrie schreef:

Yes, it can be interesting. But I really think if that is what you want
to discuss, it deserves its own subject thread.

Subject lines can and should be changed to reflect the ebbs and flows of
the discussion.

In fact this thread's subject should have been changed a long time ago
since the original topic was RE module performance!
 
W

wxjmfauth

Le mercredi 31 juillet 2013 07:45:18 UTC+2, Steven D'Aprano a écrit :
Neither character above is larger than 4 bytes. You forgot to deduct the

size of the object header. Python is a high-level object-oriented

language, if you care about minimizing every possible byte, you should

use a low-level language like C. Then you can give every character 21

bits, and be happy that you don't waste even one bit.

.... char never consumes or requires more than 4 bytes ...

jmf
 
C

Chris Angelico

... char never consumes or requires more than 4 bytes ...

The integer 5 should be able to be stored in 3 bits.
14

Clearly Python is doing something really horribly wrong here. In fact,
sys.getsizeof needs to be changed to return a float, to allow it to
more properly reflect these important facts.

ChrisA
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Similar Threads

import syntax 0
Cross-Platform Python3 Equivalent to notify-send 1
Aloha! Check out the Betabots! 0
Critic my module 13
PEP8 79 char max 3
List as Contributor 0
Play Ogg Files 0
Share Code Tips 13

Members online

Forum statistics

Threads
474,123
Messages
2,570,741
Members
47,296
Latest member
EarnestSme

Latest Threads

Top