RE Module Performance

  • Thread starter Devyn Collier Johnson
  • Start date
W

wxjmfauth

Le lundi 29 juillet 2013 13:57:47 UTC+2, Chris Angelico a écrit :
45



Wonder if that might maybe have an impact on the timings.



ChrisA

Good point. I stupidely forgot this.

jmf
 
W

wxjmfauth

Le dimanche 28 juillet 2013 19:36:00 UTC+2, Terry Reedy a écrit :
Not necessarily so. See below.









Slicing is at least O(m) where m is the length of the slice.






I posted about a week ago, in response to Chris A., a method by which

lookup for UTF-16 can be made O(log2 k), or perhaps more accurately,

O(1+log2(k+1)), where k is the number of non-BMP chars in the string.



This uses an auxiliary array of k ints. An auxiliary array of n ints

would make UFT-16 lookup O(1), but then one is using more space than

with UFT-32. Similar comments apply to UTF-8.



The unicode standard says that a single strings should use exactly one

coding scheme. It does *not* say that all strings in an application must

use the same scheme. I just rechecked a few days ago. It also does not

say that an application cannot associate additional data with a string

to make processing of the string easier.

To my knowledge, the Unicode doc always speak about
the misc. utf* coding schemes in an "exclusive or" way.

Having multiple encoded strings is one thing. Manipulating
multiple encoded strings is something else.

Maybe the mistake was to not emphasize the fact that
one has to work with a unique set of encoded code points
(utf-8 or utf-16 or utf-32) because it was considered,
as to obvious one can not work properly with multiple
coding schemes.

You are also right in saying " ...application cannot associate
additional data...".
The doc does not specify it either. It is superfleous.


jmf
 
W

wxjmfauth

Le lundi 29 juillet 2013 13:57:47 UTC+2, Chris Angelico a écrit :
45



Wonder if that might maybe have an impact on the timings.



ChrisA

--------


class C:
a = 'abc'
b = 'def'
def aaa(self):
pass
def bbb(self):
pass
def ccc(self):
pass

if __name__ == '__main__':
import timeit
print(timeit.timeit("r = dir(C)", setup="from __main__ import C"))


c:\python32\pythonw -u "timitmod.py" 15.258061416225663
Exit code: 0
c:\Python33\pythonw -u "timitmod.py" 17.052203122286194
Exit code: 0


jmf
 
W

wxjmfauth

Le lundi 29 juillet 2013 16:49:34 UTC+2, Chris Angelico a écrit :
Did you even think to check that before you posted timings?



ChrisA

Boum, no! the diff is one.
I have however noticed, I can increase the number
of attributes (ascii), the timing differences
is very well marked.
I do not draw conclusions. Such a factor for one
unit....

jmf
 
W

wxjmfauth

Le dimanche 28 juillet 2013 05:53:22 UTC+2, Ian a écrit :
Yes, given a pointer location into a utf-8 or utf-16 string, it is

easy to determine the identity of the code point at that location.

But this is not often a useful operation, save for resynchronization

in the case that the string data is corrupted. The caret of an editor

does not conceptually correspond to a pointer location, but to a

character index. Given a particular character index (e.g. 127504), an

editor must be able to determine the identity and/or the memory

location of the character at that index, and for UTF-8 and UTF-16

without an auxiliary data structure that is a O(n) operation.
------

Same conceptual mistake as Steven's example with its buffers,
the buffer does not know it holds characters.
This is not the point to discuss.

-----

I am pretty sure that once you have typed your 127504
ascii characters, you are very happy the buffer of your
editor does not waste time in reencoding the buffer as
soon as you enter an €, the 125505th char. Sorry, I wanted
to say z instead of euro, just to show that backspacing the
last char and reentering a new char implies twice a reencoding.

Somebody wrote "FSR" is just an optimization. Yes, but in case
of an editor à la FSR, this optimization take place everytime you
enter a char. Your poor editor, in fact the FSR, is finally
spending its time in optimizing and finally it optimizes nothing.
(It is even worse).

If you type correctly a z instead of an €, it is not necessary
to reencode the buffer. Problem, you do you know that you do
not have to reencode? simple just check it, and by just checking
it wastes time to test it you have to optimized or not and hurt
a little bit more what is supposed to be an optimization.

Do not confuse the process of optimisation and the result of
optimization (funny, it's like the utf's).

There is a trick to make the editor to know if it has
to be "optimized". Just put some flag somewhere. Then
you fall on the "Houston" syndrome. Houston, we got a
problem, our buffer consumes much more bytes than expected.
26

Now the good news. In an editor à la FSR, the
"composition" is not so important. You know,
"practicality beats purity". The hard job
is the text rendering engine and the handling
of the font (even in a raw unicode editor).
And as these tools are luckily not woking à la FSR
(probably because they understand the coding
of the characters), your editor is still working
not so badly.

jmf
 
A

Antoon Pardon

Op 30-07-13 16:01, (e-mail address removed) schreef:
I am pretty sure that once you have typed your 127504
ascii characters, you are very happy the buffer of your
editor does not waste time in reencoding the buffer as
soon as you enter an €, the 125505th char. Sorry, I wanted
to say z instead of euro, just to show that backspacing the
last char and reentering a new char implies twice a reencoding.

Using a single string as an editor buffer is a bad idea in python
for the simple reason that strings are immutable. So adding
characters would mean continuously copying the string buffer
into a new string with the next character added. Copying
127504 characters into a new string will not make that much
of a difference whether the octets are just copied to octets
or are unpacked into 32 bit words.
Somebody wrote "FSR" is just an optimization. Yes, but in case
of an editor à la FSR, this optimization take place everytime you
enter a char. Your poor editor, in fact the FSR, is finally
spending its time in optimizing and finally it optimizes nothing.
(It is even worse).

Even if you would do it this way, it would *not* take place
every time you enter a char. Once your buffer would contain
a wide character, it would just need to convert the single
character that is added after each keystroke. It would not
need to convert the whole buffer after each key stroke.
If you type correctly a z instead of an €, it is not necessary
to reencode the buffer. Problem, you do you know that you do
not have to reencode? simple just check it, and by just checking
it wastes time to test it you have to optimized or not and hurt
a little bit more what is supposed to be an optimization.

Your scenario is totally unrealistic. First of all because of
the immutable nature of python strings, second because you
suggest that real time usage would result in frequent conversions
which is highly unlikely.
 
C

Chris Angelico

I am pretty sure that once you have typed your 127504
ascii characters, you are very happy the buffer of your
editor does not waste time in reencoding the buffer as
soon as you enter an €, the 125505th char. Sorry, I wanted
to say z instead of euro, just to show that backspacing the
last char and reentering a new char implies twice a reencoding.

You're still thinking that the editor's buffer is a Python string. As
I've shown earlier, this is a really bad idea, and that has nothing to
do with FSR/PEP 393. An immutable string is *horribly* inefficient at
this; if you want to keep concatenating onto a string, the recommended
method is a list of strings that gets join()d at the end, and the same
technique works well here. Here's a little demo class that could make
the basis for such a system:

class EditorBuffer:
def __init__(self,fn):
self.fn=fn
self.buffer=[open(fn).read()]
def insert(self,pos,char):
if pos==0:
# Special case: insertion at beginning of buffer
if len(self.buffer[0])>1024: self.buffer.insert(0,char)
else: self.buffer[0]=char+self.buffer[0]
return
for idx,part in enumerate(self.buffer):
l=len(part)
if pos>l:
pos-=l
continue
if pos<l:
# Cursor is somewhere inside this string
splitme=self.buffer[idx]
self.buffer[idx:idx+1]=splitme[:pos],splitme[pos:]
l=pos
# Cursor is now at the end of this string
if l>1024: self.buffer[idx:idx+1]=self.buffer[idx],char
else: self.buffer[idx]+=char
return
raise ValueError("Cannot insert past end of buffer")
def __str__(self):
return ''.join(self.buffer)
def save(self):
open(fn,"w").write(str(self))

It guarantees that inserts will never need to resize more than 1KB of
text. As a real basis for an editor, it still sucks, but it's purely
to prove this one point.

ChrisA
 
M

MRAB

Op 30-07-13 16:01, (e-mail address removed) schreef:

Using a single string as an editor buffer is a bad idea in python for
the simple reason that strings are immutable.

Using a single string as an editor buffer is a bad idea in _any_
language because an insertion would require all the following
characters to be moved.
So adding characters would mean continuously copying the string
buffer into a new string with the next character added. Copying
127504 characters into a new string will not make that much of a
difference whether the octets are just copied to octets or are
unpacked into 32 bit words.


Even if you would do it this way, it would *not* take place every
time you enter a char. Once your buffer would contain a wide
character, it would just need to convert the single character that is
added after each keystroke. It would not need to convert the whole
buffer after each key stroke.


Your scenario is totally unrealistic. First of all because of the
immutable nature of python strings, second because you suggest that
real time usage would result in frequent conversions which is highly
unlikely.
What you would have is a list of mutable chunks.

Inserting into a chunk would be fast, and a chunk would be split if
it's already full. Also, small adjacent chunks would be joined together.

Finally, a chunk could use FSR to reduce memory usage.
 
A

Antoon Pardon

Op 30-07-13 18:13, MRAB schreef:
Using a single string as an editor buffer is a bad idea in _any_
language because an insertion would require all the following
characters to be moved.

Not if you use a gap buffer.
 
M

MRAB

Op 30-07-13 18:13, MRAB schreef:

Not if you use a gap buffer.
The disadvantage there is that when you move the cursor you must move
characters around. For example, what if the cursor was at the start and
you wanted to move it to the end? Also, when the gap has been filled,
you need to make a new one.
 
T

Tim Delaney

I am pretty sure that once you have typed your 127504
ascii characters, you are very happy the buffer of your
editor does not waste time in reencoding the buffer as
soon as you enter an €, the 125505th char. Sorry, I wanted
to say z instead of euro, just to show that backspacing the
last char and reentering a new char implies twice a reencoding.

And here we come to the root of your complete misunderstanding and
mischaracterisation of the FSR. You don't appear to understand that
strings in Python are immutable and that to add a character to an
existing string requires copying the entire string + new character. In
your hypothetical situation above, you have already performed 127504
copy + new character operations before you ever get to a single widening
operation. The overhead of the copy + new character repeated 127504
times dwarfs the overhead of a single widening operation.

Given your misunderstanding, it's no surprise that you are focused on
microbenchmarks that demonstrate that copying entire strings and adding
a character can be slower in some situations than others. When the only
use case you have is implementing the buffer of an editor using an
immutable string I can fully understand why you would be concerned about
the performance of adding and removing individual characters. However,
in that case *you're focused on the wrong problem*.

Until you can demonstrate an understanding that doing the above in any
language which has immutable strings is completely insane you will have
no credibility and the only interest anyone will pay to your posts is
refuting your FUD so that people new to the language are not driven off
by you.

Tim Delaney
 
J

Joshua Landau

Op 30-07-13 18:13, MRAB schreef:



Not if you use a gap buffer.


Additionally, who says a language couldn't use, say, B-Trees for all of its
list-like types, including strings?
 
A

Antoon Pardon

Op 30-07-13 19:14, MRAB schreef:
The disadvantage there is that when you move the cursor you must move
characters around. For example, what if the cursor was at the start and
you wanted to move it to the end? Also, when the gap has been filled,
you need to make a new one.

So? Why are you making this a point of discussion? I was not aware that
the pro and cons of various editor buffer implemantations was relevant
to the point I was trying to make.

If you prefer an other data structure in the editor you are working on,
I will not dissuade you.
 
W

wxjmfauth

Matable, immutable, copyint + xxx, bufferint, O(n) ....
Yes, but conceptualy the reencoding happen sometime, somewhere.
The internal "ucs-2" will never automagically be transformed
into "ucs-4" (eg).
7.160483334521416


And do not forget, in a pure utf coding scheme, your
char or a char will *never* be larger than 4 bytes.
48


jmf
 
C

Chris Angelico

Matable, immutable, copyint + xxx, bufferint, O(n) ....
Yes, but conceptualy the reencoding happen sometime, somewhere.
The internal "ucs-2" will never automagically be transformed
into "ucs-4" (eg).

But probably not on the entire document. With even a brainless scheme
like I posted code for, no more than 1024 bytes will need to be
recoded at a time (except in some odd edge cases, and even then, no
more than once for any given file).
And do not forget, in a pure utf coding scheme, your
char or a char will *never* be larger than 4 bytes.

48

Yeah, you have a few odd issues like, oh, I dunno, GC overhead,
reference count, object class, and string length, all stored somewhere
there. Honestly jmf, if you want raw assembly you know where to get
it.

ChrisA
 
T

Terry Reedy

Additionally, who says a language couldn't use, say, B-Trees for all of
its list-like types, including strings?

Tk apparently uses a B-tree in its text widget.
 
M

Michael Torrie

So? Why are you making this a point of discussion? I was not aware that
the pro and cons of various editor buffer implemantations was relevant
to the point I was trying to make.

I for one found it very interesting. In fact this thread caused me to
wonder how one actually does create an efficient editor. Off the
original topic true, but still very interesting.
 
M

Michael Torrie

Matable, immutable, copyint + xxx, bufferint, O(n) ....
Yes, but conceptualy the reencoding happen sometime, somewhere.
The internal "ucs-2" will never automagically be transformed
into "ucs-4" (eg).

So what major python project are you working on where you've found FSR
in general to be a problem? Maybe we can help you work out a more
appropriate data structure and algorithm to use.

But if you're not developing something, and not developing in Python,
perhaps you should withdraw and let us use our horrible FSR in peace,
because it doesn't seem to bother the vast majority of python
programmers, and does not bother some large python projects out there.
In fact I think most of us welcome integrated, correct, full unicode.
 
S

Steven D'Aprano

And do not forget, in a pure utf coding scheme, your char or a char will
*never* be larger than 4 bytes.

48

Neither character above is larger than 4 bytes. You forgot to deduct the
size of the object header. Python is a high-level object-oriented
language, if you care about minimizing every possible byte, you should
use a low-level language like C. Then you can give every character 21
bits, and be happy that you don't waste even one bit.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Similar Threads

import syntax 0
Cross-Platform Python3 Equivalent to notify-send 1
Aloha! Check out the Betabots! 0
Critic my module 13
PEP8 79 char max 3
List as Contributor 0
Play Ogg Files 0
Share Code Tips 13

Members online

Forum statistics

Threads
474,123
Messages
2,570,741
Members
47,297
Latest member
NicoleS61

Latest Threads

Top