XORing long strings opimization?

N

Noen

def XOR(s1,s2):
""" XOR string s1 with s2 """
output = ""
# Argument check
if (type(s1) and type(s2)) != type(""):
raise TypeError, "Arguments are not strings"
if len(s1) != len(s2):
raise ValueError, "Size differs in strings"
# Xoring
for c1 in s1:
for c2 in s2:
output += chr(ord(c1) ^ ord(c2))
return output

This way is very slow for large strings.
Anyone know of a better and faster way to do it?
 
P

Peter Otten

Noen said:
def XOR(s1,s2):
""" XOR string s1 with s2 """
output = ""
# Argument check
if (type(s1) and type(s2)) != type(""):
raise TypeError, "Arguments are not strings"
if len(s1) != len(s2):
raise ValueError, "Size differs in strings"
# Xoring
for c1 in s1:
for c2 in s2:
output += chr(ord(c1) ^ ord(c2))
return output

This way is very slow for large strings.
Anyone know of a better and faster way to do it?

Before we start optimizing:

len(XOR(s1, s2)) == len(s1) * len(s2)

Bug or feature?

Peter
 
N

Noen

Peter said:
Noen wrote:




Before we start optimizing:

len(XOR(s1, s2)) == len(s1) * len(s2)

Bug or feature?

Peter

Oh, didnt notice that as I wrote it. Thanks...
Anyway, this is how it should be.

def XOR(s1,s2):
""" XOR string s1 with s2 """
output = ""
# Argument check
if (type(s1) and type(s2)) != type(""):
raise TypeError, "Arguments are not strings"
if len(s1) != len(s2):
raise ValueError, "Size differs in strings"
# Xoring
for i in range(len(s1)):
output += chr(ord(s1) ^ ord(s2))
return output
 
P

Peter Otten

Noen said:
Oh, didnt notice that as I wrote it. Thanks...
Anyway, this is how it should be.

def XOR(s1,s2):
""" XOR string s1 with s2 """
output = ""
# Argument check
if (type(s1) and type(s2)) != type(""):
raise TypeError, "Arguments are not strings"
if len(s1) != len(s2):
raise ValueError, "Size differs in strings"
# Xoring
for i in range(len(s1)):
output += chr(ord(s1) ^ ord(s2))
return output


Oops, I missed one:
Traceback (most recent call last):
File "<stdin>", line 1, in ?
File "xor2.py", line 7, in XOR
if len(s1) != len(s2):
TypeError: len() of unsized object

Did you expect this?

Peter
 
T

Terry Reedy

Noen said:
def XOR(s1,s2):
""" XOR string s1 with s2 """
output = ""
# Argument check
if (type(s1) and type(s2)) != type(""):
raise TypeError, "Arguments are not strings"
if len(s1) != len(s2):
raise ValueError, "Size differs in strings"
# Xoring
for i in range(len(s1)):
output += chr(ord(s1) ^ ord(s2))
return output


You fell into the O(n**2) immutable sequence repeated addition trap.
For O(n) behavior, try output = [] and return ''.join(output).
Replacing output+=stuff with output.append(stuff) might be faster
still.

Terry J. Reedy
 
N

Noen

Peter said:
Noen wrote:

Oh, didnt notice that as I wrote it. Thanks...
Anyway, this is how it should be.

def XOR(s1,s2):
""" XOR string s1 with s2 """
output = ""
# Argument check
if (type(s1) and type(s2)) != type(""):
raise TypeError, "Arguments are not strings"
if len(s1) != len(s2):
raise ValueError, "Size differs in strings"
# Xoring
for i in range(len(s1)):
output += chr(ord(s1) ^ ord(s2))
return output



Oops, I missed one:


Traceback (most recent call last):
File "<stdin>", line 1, in ?
File "xor2.py", line 7, in XOR
if len(s1) != len(s2):
TypeError: len() of unsized object

Did you expect this?

Peter

nope, thought the Argument check would do it, but I used and opeartor
instead of or. Well, here is XOR v.0.3b :p Any comments how to make it
faster?

def XOR(s1,s2):
""" XOR string s1 with s2 """
output = ""
# Argument check
if (type(s1) or type(s2)) != type(""):
raise TypeError, "Arguments are not strings"
if len(s1) != len(s2):
raise ValueError, "Size differs in strings"
# Xoring
for i in range(len(s1)):
output += chr(ord(s1) ^ ord(s2))
return output
 
P

Peter Otten

Noen said:
def XOR(s1,s2):
""" XOR string s1 with s2 """
output = ""
# Argument check
if (type(s1) or type(s2)) != type(""):
raise TypeError, "Arguments are not strings"
if len(s1) != len(s2):
raise ValueError, "Size differs in strings"
# Xoring
for i in range(len(s1)):
output += chr(ord(s1) ^ ord(s2))
return output


I keep nagging:
Traceback (most recent call last):
File "<stdin>", line 1, in ?
File "xor3.py", line 7, in XOR
if len(s1) != len(s2):
TypeError: len() of unsized object

Must be in a nasty mood this evening :)
Seriously, I think that "first make it right, then make it fast" is a
reasonable design principle.
You can check the type

if type(s1) != type("") or type(s2) != type(""):
....

or

if not isinstance(s1, str) or not isinstance(s2, str):
....

or even

if not type(s1) == type(s2) == type(""):
....

but personally I would omit the check altogether. The second check could be
relaxed to len(s1) <= len(s2), assuming that s1 is the data to be encrypted
and s2 the random character sequence that makes your encryption NSA-proof.
For (much) weaker encryption schemes, you could omit it completely and
cycle over the characters of s2 (not recommended).

As to speed, I would suppose the following to be somewhat faster:

def xor(s1, s2):
return "".join([chr(ord(c1) ^ ord(c2)) for c1, c2 in itertools.izip(s1,
s2)])

It favours "".join(character list) over consecutive s += tail operations,
which need to build "many" strings as intermediate results.

I think, further speedups could be achieved building a lookup table for all
character combinations.

Peter
 
U

Ulrich Petri

Peter Otten said:
Noen said:
Oh, didnt notice that as I wrote it. Thanks...
Anyway, this is how it should be.

def XOR(s1,s2):
""" XOR string s1 with s2 """
output = ""
# Argument check
if (type(s1) and type(s2)) != type(""):
raise TypeError, "Arguments are not strings"
if len(s1) != len(s2):
raise ValueError, "Size differs in strings"
# Xoring
for i in range(len(s1)):
output += chr(ord(s1) ^ ord(s2))
return output


Oops, I missed one:


Therese even more ;)

i guess he meant:
Ciao Ulrich
 
F

Francis Avila

Noen said:
nope, thought the Argument check would do it, but I used and opeartor
instead of or. Well, here is XOR v.0.3b :p Any comments how to make it
faster?

How about using the struct module?
Caveat: I don't know if struct limits the repeat count.

import struct
def XOR(data, key):
if len(data) != len(key):
raise ValueError, "data and key not of equal length"
ldata = struct.unpack('=%sb' % len(data), data)
lkey = struct.unpack('=%sb' % len(key), key)
lxored = [d^k for d,k in zip(ldata, lkey)]
xored = struct.pack('=%sb' % len(lxored), *lxored)
return xored

Just an idea.
 
A

Alex Martelli

Peter said:
Noen said:
Oh, didnt notice that as I wrote it. Thanks...
Anyway, this is how it should be.

def XOR(s1,s2):
""" XOR string s1 with s2 """
output = ""
# Argument check
if (type(s1) and type(s2)) != type(""):
raise TypeError, "Arguments are not strings"
if len(s1) != len(s2):
raise ValueError, "Size differs in strings"
# Xoring
for i in range(len(s1)):
output += chr(ord(s1) ^ ord(s2))
return output


Oops, I missed one:
Traceback (most recent call last):
File "<stdin>", line 1, in ?
File "xor2.py", line 7, in XOR
if len(s1) != len(s2):
TypeError: len() of unsized object


Checks apart -- on to optimization as per subject:

[alex@lancelot krb5module]$ timeit.py -c -s'from string import letters as S'
'"".join([chr(ord(S)^ord(S)) for i in xrange(len(S))])'
10000 loops, best of 3: 113 usec per loop

i'm not sure i can do much better with strings. arrays are cool, though:

[alex@lancelot krb5module]$ timeit.py -c -s'from string import letters as S'
-s'import array' 'x=array.array("b",S)' 'for i,v in enumerate(x): x^=v'
'x.tostring()'
10000 loops, best of 3: 57 usec per loop

Morale: when you find yourself fiddling with lots of chr and ord, and care
about performance, think about "array('b', ... -- it could buy you an easy
speedup by a factor of 2, like here.


Alex
 
F

Francis Avila

about performance, think about "array('b', ... -- it could buy you an easy
speedup by a factor of 2, like here.

Why we should read a module a day: just today I learned about the array
module, when I *had* been using struct for similar tasks....

Sigh; if only the standard library weren't so *big* ;).
 
N

Noen

Here is the result, if anyone wants it :)
It was quite quick indeed, It took me 10.3279999495 seconds to XOR
"a"*1000000 large string. Im going to XOR 700000000 mb sized files, so
if anyone know how to make it even faster, please tell me :)

import string
def xorstring(s1,s2):
""" XOR string s1 with s2 """
# Argument check
if not (type(s1) == type("")) or not (type(s2) == type("")):
raise TypeError, "Arguments are not strings"
if len(s1) != len(s2):
raise ValueError, "Size differs in strings"
# Xoring

# Create lists
l1 = map(ord, s1)
l2 = map(ord, s2)

# Xor it all together
xorlist = []
xorlist = [chr(x ^ y) for (x, y) in zip(l1, l2)]
return string.join(xorlist,"") # Backward compatible
 
P

Peter Otten

Francis said:
Noen said:
nope, thought the Argument check would do it, but I used and opeartor
instead of or. Well, here is XOR v.0.3b :p Any comments how to make it
faster?

How about using the struct module?
Caveat: I don't know if struct limits the repeat count.

import struct
def XOR(data, key):
if len(data) != len(key):
raise ValueError, "data and key not of equal length"
ldata = struct.unpack('=%sb' % len(data), data)
lkey = struct.unpack('=%sb' % len(key), key)
lxored = [d^k for d,k in zip(ldata, lkey)]
xored = struct.pack('=%sb' % len(lxored), *lxored)
return xored

Just an idea.

Nope. Both your and my approach make it worse. Only Alex Martelli did it
right:

import itertools, array, struct

def xor_noen(s1,s2):
""" XOR string s1 with s2 """
output = ""
for i in range(len(s1)):
output += chr(ord(s1) ^ ord(s2))
return output

def xor_po(s1, s2):
return "".join([chr(ord(c1) ^ ord(c2)) for c1, c2 in itertools.izip(s1,
s2)])

def xor_am(s1, s2):
x = array.array("b", s2)
for i, v in enumerate(array.array("b", s1)):
x ^= v
return x.tostring()

def xor_fa(data, key):
ldata = struct.unpack('=%sb' % len(data), data)
lkey = struct.unpack('=%sb' % len(key), key)
lxored = [d^k for d,k in zip(ldata, lkey)]
xored = struct.pack('=%sb' % len(lxored), *lxored)
return xored

# 35 usec
def pro_noen():
return xor_noen("sgnirts ni sreffid eziS", "Size differs in strings")

# 40 usec
def pro_po():
return xor_po("sgnirts ni sreffid eziS", "Size differs in strings")

# 23 usec
def pro_am():
return xor_am("sgnirts ni sreffid eziS", "Size differs in strings")

# 46 usec
def pro_fa():
return xor_fa("sgnirts ni sreffid eziS", "Size differs in strings")

assert pro_po() == pro_am() == pro_fa() == pro_noen()

I was surprised how good pro_noen() does, but that might change for large
strings. By the way, results varied significantly (several usecs) in
various test runs.

Peter
 
C

Casey Whitelaw

Noen said:
Here is the result, if anyone wants it :)
It was quite quick indeed, It took me 10.3279999495 seconds to XOR
"a"*1000000 large string. Im going to XOR 700000000 mb sized files, so
if anyone know how to make it even faster, please tell me :)

import string
def xorstring(s1,s2):
""" XOR string s1 with s2 """
# Argument check
if not (type(s1) == type("")) or not (type(s2) == type("")):
raise TypeError, "Arguments are not strings"
if len(s1) != len(s2):
raise ValueError, "Size differs in strings"
# Xoring

# Create lists
l1 = map(ord, s1)
l2 = map(ord, s2)

# Xor it all together
xorlist = []
xorlist = [chr(x ^ y) for (x, y) in zip(l1, l2)]
return string.join(xorlist,"") # Backward compatible

Just fiddling around with the (excellent) suggestions already posted,
and found a couple of things.

Alex's suggestion using array is the fastest, but was slightly faster
when using xrange() rather than enumerate():

def xor6(s1, s2):
a1 = array.array("b", s1)
a2 = array.array("b", s2)

for i in xrange(len(a1)):
a1 ^= a2
return a1.tostring()

On my system, this XORs 500,000 characters in 0.39s, with Noen's
original taking 0.67s.

Using psyco produced a further ~3x speedup, reducing this to 0.14s.
That's pretty fast!

Casey
 
G

Georgy Pruss

Well, nothing to add actually, but my results:

import string
import array

def xorstring0(s1,s2): # the original method
# argument tests were here

# Create lists
l1 = map(ord, s1)
l2 = map(ord, s2)

# Xor it all together
xorlist = []
xorlist = [chr(x ^ y) for (x, y) in zip(l1, l2)]
return string.join(xorlist,"") # Backward compatible

def xorstring1(s1,s2):
xorlist = [chr(ord(x) ^ ord(y)) for (x, y) in zip(s1, s2)]
return string.join(xorlist,"") # Backward compatible

def xorstring2(s1,s2):
xorlist = [chr(ord(s1) ^ ord(s2)) for i in range(len(s1))] # range
return string.join(xorlist,"") # Backward compatible

def xorstring3(s1,s2):
xorlist = [chr(ord(s1) ^ ord(s2)) for i in xrange(len(s1))] # xrange
return string.join(xorlist,"") # Backward compatible

def xorstring4(s1,s2):
a1 = array.array("B", s1)
a2 = array.array("B", s2)
for i in xrange(len(a1)):
a1 ^= a2
return a1.tostring()

s1 = 'abcew'*200000
s2 = 'xyzqa'*200000

fns = [ xorstring0, xorstring1, xorstring2, xorstring3, xorstring4 ]

# check if all works OK for some short data --
# protection from some rough error or typo
sx = fns[0]( s1[:100], s2[:100] )
for fn in fns:
assert sx == fn( s1[:100], s2[:100] )

import time

tms = [60] * len(fns) # assume some unreal big times

for n in range(30): # loop 30 times

for i in range(len(fns)):

tb = time.time()
sx = fns( s1, s2 ) # do it!
tm = time.time() - tb
tms = min( tms, tm )

for i in range(len(fns)):
print "fn%d -- %.3f sec -- %.2f" % (i,tms,tms/tms[-1])

# fn0 -- 5.578 sec -- 6.37
# fn1 -- 4.593 sec -- 5.25
# fn2 -- 2.609 sec -- 2.98
# fn3 -- 2.531 sec -- 2.89
# fn4 -- 0.875 sec -- 1.00


BTW, for the same million char strings, a C function runs 0.0035 sec, or 250 times faster!


G-:
 
A

Alex Martelli

Georgy said:
Well, nothing to add actually, but my results:
...but of course when in a hurry one should remember...
psyco
!!!

So I gave your script a little tweak...:
fns = [ xorstring0, xorstring1, xorstring2, xorstring3, xorstring4 ]

i.e. right here I added:

import psyco
psyfns = [ psyco.proxy(f) for f in fns ]
fns.extend(psyfns)

so each function is tried in both the plain and psycotic form --
the latter is at an index 5 above (fn5 is the psycoized xorstring0,
etc etc up to fn9 which is the psycoed xorstring4).
# fn0 -- 5.578 sec -- 6.37
# fn1 -- 4.593 sec -- 5.25
# fn2 -- 2.609 sec -- 2.98
# fn3 -- 2.531 sec -- 2.89
# fn4 -- 0.875 sec -- 1.00

BTW, for the same million char strings, a C function runs 0.0035 sec, or
250 times faster!

My measurements give:

fn0 -- 19.229 sec -- 347.98
fn1 -- 12.489 sec -- 226.01
fn2 -- 4.719 sec -- 85.39
fn3 -- 4.359 sec -- 78.88
fn4 -- 2.046 sec -- 37.02

fn5 -- 16.736 sec -- 302.86
fn6 -- 9.675 sec -- 175.08
fn7 -- 1.045 sec -- 18.90
fn8 -- 1.048 sec -- 18.96
fn9 -- 0.055 sec -- 1.00

so, my machine is quite a bit slower (it IS getting long in the tooth --
over 30 months -- and wasn't the absolute speediest even then;-) by a
factor that's quite variable, 1.72 to 3.45 times, median 2.34 times.

psyco helps marginally on some approaches, 4+ times on others, and
almost 40 times on the fn4 approach, which happens to be the fastest
even without it (luckily...:). Still not a match for your reported
C function, but just 6-7 times slower (guessing), not 250 times.

One thing worth checking is whether the end result IS needed as a
string, because some more time might be saved by e.g. using buffer(x)
rather than x.tostring() for the result array x, in many cases.

Another possibility to be always kept in mind for bulk array operations
is Numeric. xor-ing 10,000-long byte arrays takes about 10 msec with
array.array (needing a Python-level loop) versus 50 microseconds with
Numeric.array (no loop needed), so a factor of 200 (over the fastest
python-cum-standard-library only approach) is easily available and may
be getting even closer than psyco to bare-C speeds.


Alex
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
474,169
Messages
2,570,920
Members
47,462
Latest member
ChanaLipsc

Latest Threads

Top