speedy Python strings?

Uwe Mayer · Jan 20, 2004

Hi,

thanks to previous help my wrapper for an abstract file type with variable
record length is working now.
I did some test runs and its awfully slow:

I didn't want to read in the whole file at once and I didn't like to read it
step by step (contains 32 bit length terminated strings (Delphi) and 32bit
integers), so I read in i.e. 2MB, work on that buffer and if the buffer
goes empty i load some more 2MB, etc.
For this buffer I use ordinary strings:

class myFile(file):
def read(self, *args):
...
self.buffer += file.read(self, *args)
...

and after reading information from the buffer I remove the read part from
it:

text = struct.unpack("L", self.buffer[:4])
self.buffer = self.buffer[4:]

During debugging I saw the program accelerating when the buffersize
<len(self.buffer) > grew smaller, thus my conclusion: since strings are
immutable the string operations are so slow.

I tested different sizes of the buffer and it performed best when I didn't
use that buffering system (which is sad, since I spend so much time on it).

Is there anything else I can use instead of normal strings that'll speed
this up?
What do you suggest how to deal with this situation? Do you suggest I remove
any buffering of the data?

Thanks for any comments
Yours
Uwe

Jp Calderone · Jan 20, 2004

Hi,

thanks to previous help my wrapper for an abstract file type with variable
record length is working now.
I did some test runs and its awfully slow:

I didn't want to read in the whole file at once and I didn't like to read it
step by step (contains 32 bit length terminated strings (Delphi) and 32bit
integers), so I read in i.e. 2MB, work on that buffer and if the buffer
goes empty i load some more 2MB, etc.
For this buffer I use ordinary strings:

class myFile(file):
def read(self, *args):
...
self.buffer += file.read(self, *args)
...

Doing the above in a loop has quadratic complexity.

and after reading information from the buffer I remove the read part from
it:

text = struct.unpack("L", self.buffer[:4])
self.buffer = self.buffer[4:]

The same is true for this line.

Your program will be much faster if you make fewer string copies.
"self.buffer[4:]" is a string copy, as is "self.buffer += anotherString".

The buffer() builtin is one way to avoid copying, but another way is to
simply keep track of how far into the string you are and use both low and
high indexes when slicing, instead of repeatedly chopping the front off the
string.

Jp

Roy Smith · Jan 20, 2004

Uwe Mayer said:
since strings are
immutable the string operations are so slow.

More to the point, and I suspect what you're running into, string
addition is linear with respect to string length because it has to
create a new string each time and destroy the old one. Since each
addition is linear with string length, a loop which adds strings will
run in quadratic time.

Possibly you want to look at the cStringIO class.

Another possibility is to buffer things in a list, using append to add
to the end and pop(0) to pop off the front (a roll-your-own queue).
Lists do not suffer from the quadratic behavor that string addition does.

Stuart D. Gathman · Jan 20, 2004

class myFile(file):
def read(self, *args):
...

# self.buffer += file.read(self, *args) # not this
self.buffer = self.buffer[self.pos:] + file.read(self, *args)
self.pos = 0

...

and after reading information from the buffer I remove the read part from
it:

# text = struct.unpack("L", self.buffer[:4])
# self.buffer = self.buffer[4:]
pos = self.pos
text = struct.unpack("L", self.buffer[pos

os+4])
self.pos = pos + 4

Josiah Carlson · Jan 20, 2004

Another possibility is to buffer things in a list, using append to add

to the end and pop(0) to pop off the front (a roll-your-own queue).
Lists do not suffer from the quadratic behavor that string addition does.

Deleting from the front of a list results in every item being shifted up
a single entry. While this is insignificant for short lists, it is
significant for long lists:
... a = range(size)
... t = time.time()
... while a: a.pop(0)
... return time.time()-t
...... a = range(size)
... t = time.time()
... while a: a.pop()
... return time.time()-t
...5.4530000686645508

In terms of buffering as the parent post asks, file IO is already
buffered by the underlying libraries and operating system. Another
layer of Python buffering doesn't seem to make much sense. I would
suggest writing off the buffering exercise as a learning experience.

- Josiah

Francis Avila · Jan 20, 2004

Stuart D. Gathman wrote in message ...

# text = struct.unpack("L", self.buffer[:4])
# self.buffer = self.buffer[4:]
pos = self.pos
text = struct.unpack("L", self.buffer[posos+4])
self.pos = pos + 4

In this vein, I would also recommend looking at the array module. You
didn't describe the data structure, but if each record is simply a list of
4-byte integers, you could convert the whole record to ints at once like so:

record = array.array('L', stringorlist)
Then, perform your loops on record, which will be a list-like object
supporting item insertion and deletion, and conversion to bytestrings and
other Python base types.

Or, you could simply use array.array() to implement your buffer more
efficiently than continually calling struct. Simply buffer the converted
data a chunk at a time, using array.

However, I don't think buffering helps much in this case. File operations
are already buffered by Python, and most (all?) OSes buffer reads
themselves, too. Reading a file is almost always very fast. I would
concentrate more on presenting a useful abstraction for your data, rather
than worrying about what is essentially an optimization problem.

Jp Calderone · Jan 20, 2004

Stuart D. Gathman wrote in message ...

# text = struct.unpack("L", self.buffer[:4])
# self.buffer = self.buffer[4:]
pos = self.pos
text = struct.unpack("L", self.buffer[posos+4])
self.pos = pos + 4

Click to expand...

In this vein, I would also recommend looking at the array module. You
didn't describe the data structure, but if each record is simply a list of
4-byte integers, you could convert the whole record to ints at once like so:

record = array.array('L', stringorlist)
Then, perform your loops on record, which will be a list-like object
supporting item insertion and deletion, and conversion to bytestrings and
other Python base types.

Also note the existence of fromfile() - the array module actually makes it
possible to do this incredibly efficiently:

a = array.array('L')
a.fromfile(openFileObj, count)

Jp

Tim Delaney · Jan 20, 2004

Josiah Carlson said:
Deleting from the front of a list results in every item being
shifted up a single entry. While this is insignificant for
short lists, it is significant for long lists:

Note that there's been some very interesting discussions on python-dev
recently that may end up alleviating this issue somewhat in the future - but
no promises - there doesn't appear to be a consensus on the "right way" ...

Tim Delaney

Josiah Carlson · Jan 20, 2004

Note that there's been some very interesting discussions on python-dev

recently that may end up alleviating this issue somewhat in the future - but
no promises - there doesn't appear to be a consensus on the "right way" ...

I've been following it. My favorite solution is to not change the
structure of the list at all, and to offer a reasonably fast fifo queue
in a standard library module with a bunch of other useful data
structures. Then you don't need to worry about breaking C modules that
rely on the list having that behavior, and get some useful data
structures out of the deal.

- Josiah

Java NIO server and Python asyncore client	2	Feb 5, 2013
Python client/server that reads HTML body from server	1	Apr 12, 2023
How to fix ssl.SSLError: [SSL: WRONG_VERSION_NUMBER] wrong version number (_ssl.c:2570)	0	Jul 28, 2023
Python point location of intersect between two lines	0	Feb 28, 2018
Python strings and coding conventions	12	Jan 11, 2009
python 3.1 - io.BufferedReader.peek() incomplete or change ofbehaviour.	0	Jun 5, 2009
Enumeration of strings and export of the constants	2	Nov 10, 2010
Python hangs: Problem with wxPython, threading, pySerial, or events?	0	Oct 15, 2011

speedy Python strings?

Uwe Mayer

Jp Calderone

Roy Smith

Stuart D. Gathman

Josiah Carlson

Francis Avila

Jp Calderone

Tim Delaney

Josiah Carlson

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads