speedy Python strings?

U

Uwe Mayer

Hi,

thanks to previous help my wrapper for an abstract file type with variable
record length is working now.
I did some test runs and its awfully slow:

I didn't want to read in the whole file at once and I didn't like to read it
step by step (contains 32 bit length terminated strings (Delphi) and 32bit
integers), so I read in i.e. 2MB, work on that buffer and if the buffer
goes empty i load some more 2MB, etc.
For this buffer I use ordinary strings:

class myFile(file):
def read(self, *args):
...
self.buffer += file.read(self, *args)
...

and after reading information from the buffer I remove the read part from
it:

text = struct.unpack("L", self.buffer[:4])
self.buffer = self.buffer[4:]

During debugging I saw the program accelerating when the buffersize
<len(self.buffer) > grew smaller, thus my conclusion: since strings are
immutable the string operations are so slow.

I tested different sizes of the buffer and it performed best when I didn't
use that buffering system (which is sad, since I spend so much time on it).

Is there anything else I can use instead of normal strings that'll speed
this up?
What do you suggest how to deal with this situation? Do you suggest I remove
any buffering of the data?

Thanks for any comments
Yours
Uwe
 
J

Jp Calderone

Hi,

thanks to previous help my wrapper for an abstract file type with variable
record length is working now.
I did some test runs and its awfully slow:

I didn't want to read in the whole file at once and I didn't like to read it
step by step (contains 32 bit length terminated strings (Delphi) and 32bit
integers), so I read in i.e. 2MB, work on that buffer and if the buffer
goes empty i load some more 2MB, etc.
For this buffer I use ordinary strings:

class myFile(file):
def read(self, *args):
...
self.buffer += file.read(self, *args)
...

Doing the above in a loop has quadratic complexity.
and after reading information from the buffer I remove the read part from
it:

text = struct.unpack("L", self.buffer[:4])
self.buffer = self.buffer[4:]

The same is true for this line.

Your program will be much faster if you make fewer string copies.
"self.buffer[4:]" is a string copy, as is "self.buffer += anotherString".

The buffer() builtin is one way to avoid copying, but another way is to
simply keep track of how far into the string you are and use both low and
high indexes when slicing, instead of repeatedly chopping the front off the
string.

Jp
 
R

Roy Smith

Uwe Mayer said:
since strings are
immutable the string operations are so slow.

More to the point, and I suspect what you're running into, string
addition is linear with respect to string length because it has to
create a new string each time and destroy the old one. Since each
addition is linear with string length, a loop which adds strings will
run in quadratic time.

Possibly you want to look at the cStringIO class.

Another possibility is to buffer things in a list, using append to add
to the end and pop(0) to pop off the front (a roll-your-own queue).
Lists do not suffer from the quadratic behavor that string addition does.
 
S

Stuart D. Gathman

class myFile(file):
def read(self, *args):
...
# self.buffer += file.read(self, *args) # not this
self.buffer = self.buffer[self.pos:] + file.read(self, *args)
self.pos = 0
...

and after reading information from the buffer I remove the read part from
it:
# text = struct.unpack("L", self.buffer[:4])
# self.buffer = self.buffer[4:]
pos = self.pos
text = struct.unpack("L", self.buffer[pos:pos+4])
self.pos = pos + 4
 
J

Josiah Carlson

Another possibility is to buffer things in a list, using append to add
to the end and pop(0) to pop off the front (a roll-your-own queue).
Lists do not suffer from the quadratic behavor that string addition does.

Deleting from the front of a list results in every item being shifted up
a single entry. While this is insignificant for short lists, it is
significant for long lists:
... a = range(size)
... t = time.time()
... while a: a.pop(0)
... return time.time()-t
...... a = range(size)
... t = time.time()
... while a: a.pop()
... return time.time()-t
...5.4530000686645508



In terms of buffering as the parent post asks, file IO is already
buffered by the underlying libraries and operating system. Another
layer of Python buffering doesn't seem to make much sense. I would
suggest writing off the buffering exercise as a learning experience.

- Josiah
 
F

Francis Avila

Stuart D. Gathman wrote in message ...
# text = struct.unpack("L", self.buffer[:4])
# self.buffer = self.buffer[4:]
pos = self.pos
text = struct.unpack("L", self.buffer[pos:pos+4])
self.pos = pos + 4


In this vein, I would also recommend looking at the array module. You
didn't describe the data structure, but if each record is simply a list of
4-byte integers, you could convert the whole record to ints at once like so:

record = array.array('L', stringorlist)
Then, perform your loops on record, which will be a list-like object
supporting item insertion and deletion, and conversion to bytestrings and
other Python base types.

Or, you could simply use array.array() to implement your buffer more
efficiently than continually calling struct. Simply buffer the converted
data a chunk at a time, using array.

However, I don't think buffering helps much in this case. File operations
are already buffered by Python, and most (all?) OSes buffer reads
themselves, too. Reading a file is almost always very fast. I would
concentrate more on presenting a useful abstraction for your data, rather
than worrying about what is essentially an optimization problem.
 
J

Jp Calderone

Stuart D. Gathman wrote in message ...
# text = struct.unpack("L", self.buffer[:4])
# self.buffer = self.buffer[4:]
pos = self.pos
text = struct.unpack("L", self.buffer[pos:pos+4])
self.pos = pos + 4


In this vein, I would also recommend looking at the array module. You
didn't describe the data structure, but if each record is simply a list of
4-byte integers, you could convert the whole record to ints at once like so:

record = array.array('L', stringorlist)
Then, perform your loops on record, which will be a list-like object
supporting item insertion and deletion, and conversion to bytestrings and
other Python base types.

Also note the existence of fromfile() - the array module actually makes it
possible to do this incredibly efficiently:

a = array.array('L')
a.fromfile(openFileObj, count)

Jp
 
T

Tim Delaney

Josiah Carlson said:
Deleting from the front of a list results in every item being
shifted up a single entry. While this is insignificant for
short lists, it is significant for long lists:

Note that there's been some very interesting discussions on python-dev
recently that may end up alleviating this issue somewhat in the future - but
no promises - there doesn't appear to be a consensus on the "right way" ...

Tim Delaney
 
J

Josiah Carlson

Note that there's been some very interesting discussions on python-dev
recently that may end up alleviating this issue somewhat in the future - but
no promises - there doesn't appear to be a consensus on the "right way" ...

I've been following it. My favorite solution is to not change the
structure of the list at all, and to offer a reasonably fast fifo queue
in a standard library module with a bunch of other useful data
structures. Then you don't need to worry about breaking C modules that
rely on the list having that behavior, and get some useful data
structures out of the deal.

- Josiah
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
474,175
Messages
2,570,947
Members
47,498
Latest member
yelene6679

Latest Threads

Top