Fatest standard way to sum bytes (and their squares)?

E

Erik Max Francis

For a file hashing system (finding similar files, rather than identical
ones), I need to be able to efficiently and quickly sum the ordinals of
the bytes of a file and their squares. Because of the nature of the
application, it's a requirement that I do it in Python, or only with
standard library modules (if such facilities exist) that might assist.

So far the fastest way I've found is using the `sum` builtin and
generators::

ordinalSum = sum(ord(x) for x in data)
ordinalSumSquared = sum(ord(x)**2 for x in data)

This is about twice as fast as an explicit loop, but since it's going to
be processing massive amounts of data, the faster the better. Are there
any tricks I'm not thinking of, or perhaps helper functions in other
modules that I'm not thinking of?
 
E

Erik Max Francis

And of course I meant fastest, not "fatest." Fattest wouldn't be good,
either.
 
A

Alexander Schmolck

Erik Max Francis said:
For a file hashing system (finding similar files, rather than identical ones),
I need to be able to efficiently and quickly sum the ordinals of the bytes of
a file and their squares. Because of the nature of the application, it's a
requirement that I do it in Python, or only with standard library modules (if
such facilities exist) that might assist.

So far the fastest way I've found is using the `sum` builtin and generators::

ordinalSum = sum(ord(x) for x in data)
ordinalSumSquared = sum(ord(x)**2 for x in data)

This is about twice as fast as an explicit loop, but since it's going to be
processing massive amounts of data, the faster the better. Are there any
tricks I'm not thinking of, or perhaps helper functions in other modules that
I'm not thinking of?

Is this any faster?

ordSum, orsSumSq = (lambda c:c.real,c.imag)(sum(complex(ord(x),ord(x)<<1)
for x in data))

'as
 
P

Peter Otten

Erik said:
For a file hashing system (finding similar files, rather than identical
ones), I need to be able to efficiently and quickly sum the ordinals of
the bytes of a file and their squares. Because of the nature of the
application, it's a requirement that I do it in Python, or only with
standard library modules (if such facilities exist) that might assist.

So far the fastest way I've found is using the `sum` builtin and
generators::

ordinalSum = sum(ord(x) for x in data)
ordinalSumSquared = sum(ord(x)**2 for x in data)

This is about twice as fast as an explicit loop, but since it's going to
be processing massive amounts of data, the faster the better. Are there
any tricks I'm not thinking of, or perhaps helper functions in other
modules that I'm not thinking of?

Two ideas:

Use a lookup-table for ord(c)**2
Use array.array()

$ cat summit.py
import array

data = "dere gewizzede bizzede bizz" * 1000 + chr(255)

lookup = dict((i, i**2) for i in range(256))

def summit_str(data=data):
return sum(ord(x) for x in data), sum(ord(x)**2 for x in data)

def summit_array(data=data, lookup=lookup):
a = array.array("B")
a.fromstring(data)
return sum(a), sum(lookup[x] for x in a)

if __name__ == "__main__":
assert summit_array() == summit_str()

$ python -m timeit -s'from summit import summit_str as summit' 'summit()'
10 loops, best of 3: 32.2 msec per loop
$ python -m timeit -s'from summit import summit_array as summit' 'summit()'
100 loops, best of 3: 13.4 msec per loop

Your actual code may be even faster because you can read the bytes directly
from the file.

Peter
 
P

Peter Otten

Peter said:
Two ideas:

Use a lookup-table for ord(c)**2
Use array.array()

$ cat summit.py
import array

data = "dere gewizzede bizzede bizz" * 1000 + chr(255)

lookup = dict((i, i**2) for i in range(256))

Oops, make that a list:

lookup = [i**2 for i in range(256)]
def summit_str(data=data):
return sum(ord(x) for x in data), sum(ord(x)**2 for x in data)

def summit_array(data=data, lookup=lookup):
a = array.array("B")
a.fromstring(data)
return sum(a), sum(lookup[x] for x in a)

if __name__ == "__main__":
assert summit_array() == summit_str()

$ python -m timeit -s'from summit import summit_str as summit' 'summit()'
10 loops, best of 3: 32.2 msec per loop
$ python -m timeit -s'from summit import summit_array as summit'
'summit()' 100 loops, best of 3: 13.4 msec per loop

$ python -m timeit -s'from summit import summit_array as summit' 'summit()'
100 loops, best of 3: 12.9 msec per loop

Not a big gain, but still...
 
P

Peter Otten

Peter said:
lookup = [i**2 for i in range(256)]
def summit_str(data=data):
return sum(ord(x) for x in data), sum(ord(x)**2 for x in data)

def summit_array(data=data, lookup=lookup):
a = array.array("B")
a.fromstring(data)
return sum(a), sum(lookup[x] for x in a)
$ python -m timeit -s'from summit import summit_array as summit'
'summit()' 100 loops, best of 3: 12.9 msec per loop

from itertools import imap

def summit_array(data=data, lookup=lookup):
a = array.array("B")
a.fromstring(data)
return sum(a), sum(imap(lookup.__getitem__, a))

$ python -m timeit -s'from summit import summit_array as summit' 'summit()'
100 loops, best of 3: 9.15 msec per loop

I think I'll stop here...

Peter
 
D

Duncan Booth

Erik Max Francis said:
So far the fastest way I've found is using the `sum` builtin and
generators::

ordinalSum = sum(ord(x) for x in data)
ordinalSumSquared = sum(ord(x)**2 for x in data)

This is about twice as fast as an explicit loop, but since it's going to
be processing massive amounts of data, the faster the better. Are there
any tricks I'm not thinking of, or perhaps helper functions in other
modules that I'm not thinking of?

I haven't timed it, but at the very least I'd avoid going through all the
data twice:

count = {}
for i in range(256):
count[chr(i)] = 0
for x in data:
count[x] += 1

ordinalSum = sum(ord(c)*count[c] for c in count)
ordinalSumSquared = sum(ord(c)**2 * count[c] for c in count)
 
H

Hrvoje Niksic

Erik Max Francis said:
So far the fastest way I've found is using the `sum` builtin and
generators::

ordinalSum = sum(ord(x) for x in data)
ordinalSumSquared = sum(ord(x)**2 for x in data)

For ordinalSum, using imap is almost twice as fast:

$ python -m timeit -s 'data=[chr(x) for x in xrange(256)]' 'sum(ord(x) for x in data)'
10000 loops, best of 3: 92.4 usec per loop
$ python -m timeit -s 'data=[chr(x) for x in xrange(256)]; from itertools import imap' 'sum(imap(ord, data))'
10000 loops, best of 3: 55.4 usec per loop

Of course, that optimization doesn't work for the squared sum; using a
lambda only pessimizes it.
 
E

Erik Max Francis

Alexander said:
Is this any faster?

ordSum, orsSumSq = (lambda c:c.real,c.imag)(sum(complex(ord(x),ord(x)<<1)
for x in data))

That's pretty clever, but I neglected to mention that I need to
accumulate the sums as ints/longs to avoid losing precision, so
converting to floating point isn't an optional. (The sums are
normalized by the sizes of the files and expanded to 32 bits in order to
maximize precision.)
 
E

Erik Max Francis

Hrvoje said:
For ordinalSum, using imap is almost twice as fast:

$ python -m timeit -s 'data=[chr(x) for x in xrange(256)]' 'sum(ord(x) for x in data)'
10000 loops, best of 3: 92.4 usec per loop
$ python -m timeit -s 'data=[chr(x) for x in xrange(256)]; from itertools import imap' 'sum(imap(ord, data))'
10000 loops, best of 3: 55.4 usec per loop

You're using data which is a list of chars (strings), rather than a
string itself, which is what the format is in. The imap optimization
doesn't appear to work quite as dramatically well for me with strings
instead of lists, but it certainly is an improvement.
Of course, that optimization doesn't work for the squared sum; using a
lambda only pessimizes it.

Right.
 
E

Erik Max Francis

Duncan said:
I haven't timed it, but at the very least I'd avoid going through all the
data twice:

count = {}
for i in range(256):
count[chr(i)] = 0
for x in data:
count[x] += 1

ordinalSum = sum(ord(c)*count[c] for c in count)
ordinalSumSquared = sum(ord(c)**2 * count[c] for c in count)

This approach is definitely faster than using a generator in my tests,
and was second fastest overall. I'm actually a bit surprised that it's
as fast as it is; it never would have occurred to me to try it this way.
 
E

Erik Max Francis

Peter said:
from itertools import imap

def summit_array(data=data, lookup=lookup):
a = array.array("B")
a.fromstring(data)
return sum(a), sum(imap(lookup.__getitem__, a))

$ python -m timeit -s'from summit import summit_array as summit' 'summit()'
100 loops, best of 3: 9.15 msec per loop

I think I'll stop here...

Yes, this was the fastest approach of all the alternatives I tried.
Plugging the algorithm in the tool, it makes it about twice as fast as
the generator-sum method I was using before, and about _five_ time as
fast as a naive for loop. The use of an array is particularly clever; I
never would have thought of that.

Thanks to everyone for their ideas. Amusingly I posted it late last
night (well, this morning I guess), not expecting much, at least until
the next day, but got a factor of two speedup within half an hour!
 
H

Hrvoje Niksic

Erik Max Francis said:
Hrvoje said:
For ordinalSum, using imap is almost twice as fast:
$ python -m timeit -s 'data=[chr(x) for x in xrange(256)]'
'sum(ord(x) for x in data)'
10000 loops, best of 3: 92.4 usec per loop
$ python -m timeit -s 'data=[chr(x) for x in xrange(256)]; from itertools import imap' 'sum(imap(ord, data))'
10000 loops, best of 3: 55.4 usec per loop

You're using data which is a list of chars (strings), rather than a
string itself, which is what the format is in. The imap
optimization doesn't appear to work quite as dramatically well for
me with strings instead of lists, but it certainly is an
improvement.

I wouldn't expect to see any difference in strings and lists. In this
simple test I get approximately the same ~1.7x speedup:

$ python -m timeit 'sum(ord(x) for x in "abcdefghijklmnopqrstuvwxyz")'
100000 loops, best of 3: 12.7 usec per loop
$ python -m timeit -s 'from itertools import imap' 'sum(imap(ord, "abcdefghijklmnopqrstuvwxyz"))'
100000 loops, best of 3: 7.42 usec per loop
 
A

Alexander Schmolck

Erik Max Francis said:
That's pretty clever, but I neglected to mention that I need to accumulate the
sums as ints/longs to avoid losing precision, so converting to floating point
isn't an optional.

I think you can safely use this trick (provided it really makes things faster)
provided you sum blocks no larger than 2**(53-8) bytes; if your files are
really that big you'd certainly want to split summing into several blocks
anyway, because otherwise you'll be doing *lots* of extra bignum arithmetic
instead of int32/int64 addition (I'd assume this will slow things noticably
down even in python). Another trick you could try, again using table-lookup:
work on words (i.e. 2bytes) instead of single bytes again using a table (from
word->(byte-sum,sq-byte-sum) tuples) ; this will half the function calls and
the table size of is hopefully still small enough to not to ruin your
cache-hit rate (you might want to try array).

'as
 
A

Alexander Schmolck

Alexander Schmolck said:
I think you can safely use this trick (provided it really makes things faster)
provided you sum blocks no larger than 2**(53-8) bytes;

Forgot about the squares; make this 2**(53-16); still pretty big.

'as
 
D

Dan Stromberg

For a file hashing system (finding similar files, rather than identical
ones), I need to be able to efficiently and quickly sum the ordinals of
the bytes of a file and their squares. Because of the nature of the
application, it's a requirement that I do it in Python, or only with
standard library modules (if such facilities exist) that might assist.

So far the fastest way I've found is using the `sum` builtin and
generators::

ordinalSum = sum(ord(x) for x in data)
ordinalSumSquared = sum(ord(x)**2 for x in data)

This is about twice as fast as an explicit loop, but since it's going to
be processing massive amounts of data, the faster the better. Are there
any tricks I'm not thinking of, or perhaps helper functions in other
modules that I'm not thinking of?

I see a lot of messages attacking the CPU optimization, but what about the
I/O optimization - which admittedly, the question seems to sidestep.

You might experiment with using mmap() instead of read()... If it helps,
it may help big, because the I/O time is likely to dominate the CPU time.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,994
Messages
2,570,223
Members
46,812
Latest member
GracielaWa

Latest Threads

Top