Slow comparison between two lists

J

Jani Tiainen

I have rather simple 'Address' object that contains streetname,
number, my own status and x,y coordinates for it. I have two lists
both containing approximately 30000 addresses.

I've defined __eq__ method in my class like this:

def __eq__(self, other):
return self.xcoord == other.xcoord and \
self.ycoord == other.ycoord and \
self.streetname == other.streetname and \
self.streetno == other.streetno

But it turns out to be very, very slow.

Then I setup two lists:

list_external = getexternal()
list_internal = getinternal()

Now I need get all all addresses from 'list_external' that are not in
'list_internal', and mark them as "new".

I did it like this:

for addr in list_external:
if addr not in list_internal:
addr.status = 1 # New address

But in my case running that loop takes about 10 minutes. What I am
doing wrong?
 
P

Peter Otten

Jani said:
I have rather simple 'Address' object that contains streetname,
number, my own status and x,y coordinates for it. I have two lists
both containing approximately 30000 addresses.

I've defined __eq__ method in my class like this:

def __eq__(self, other):
return self.xcoord == other.xcoord and \
self.ycoord == other.ycoord and \
self.streetname == other.streetname and \
self.streetno == other.streetno

But it turns out to be very, very slow.

Then I setup two lists:

list_external = getexternal()
list_internal = getinternal()

Now I need get all all addresses from 'list_external' that are not in
'list_internal', and mark them as "new".

I did it like this:

for addr in list_external:
if addr not in list_internal:
addr.status = 1 # New address

But in my case running that loop takes about 10 minutes. What I am
doing wrong?

Even if list_external and list_internal contain the same items you need
about len(list_external)*(len(list_internal)/2), or 45 million comparisons.
You can bring that down to 30000*some_small_factor if you make your
addresses hashable and put the internal ones into a dict or set

def __hash__(self):
return hash((self.xcoord, self.yccord, self.streetname, self.streetno))
def __eq__(self, other):
# as above

Then

set_internal = set(list_internal)
for addr in list_external:
if addr not in set_internal:
addr.status = 1

Note that the attributes relevant for hash and equality must not change
during this process.

Peter
 
B

bearophileHUGS

Hrvoje Niksic:
internal = set(list_internal)
....

To do that the original poster may have to define a __hash__ and
__eq__ methods in his/her class.

Bye,
bearophile
 
B

Bruno Desthuilliers

Jani Tiainen a écrit :
I have rather simple 'Address' object that contains streetname,
number, my own status and x,y coordinates for it. I have two lists
both containing approximately 30000 addresses.

I've defined __eq__ method in my class like this:

def __eq__(self, other):
return self.xcoord == other.xcoord and \
self.ycoord == other.ycoord and \
self.streetname == other.streetname and \
self.streetno == other.streetno

But it turns out to be very, very slow.

Then I setup two lists:

list_external = getexternal()
list_internal = getinternal()

Now I need get all all addresses from 'list_external' that are not in
'list_internal', and mark them as "new".

I did it like this:

for addr in list_external:
if addr not in list_internal:
addr.status = 1 # New address

But in my case running that loop takes about 10 minutes. What I am
doing wrong?

mmm... not using sets ?
 
J

Jani Tiainen

Even if list_external and list_internal contain the same items you need
about len(list_external)*(len(list_internal)/2), or 45 million comparisons.
You can bring that down to 30000*some_small_factor if you make your
addresses hashable and put the internal ones into a dict or set

def __hash__(self):
   return hash((self.xcoord, self.yccord, self.streetname, self.streetno))
def __eq__(self, other):
   # as above

Then

set_internal = set(list_internal)
for addr in list_external:
    if addr not in set_internal:
        addr.status = 1

Note that the attributes relevant for hash and equality must not change
during this process.

Very complete answer, thank you very much.

I tried that hash approach and sets but it seemed to get wrong results
first time and it was all due my hash method.

Now it takes like 2-3 seconds to do all that stuff and result seem to
be correct. Apparently I have lot to learn about Python... :)
 
B

bearophileHUGS

Hrvoje Niksic:
You're right.  The OP states he implements __eq__, so he also needs a
matching __hash__, such as:

    def __hash__(self, other):
        return (hash(self.xcoord) ^ hash(self.ycoord) ^
                hash(self.streetname) ^ hash(self.streetno))

The hash function by Otten is better because it considers the order of
the items too (while I think the xor doesn't):

return hash((self.xcoord, self.yccord, self.streetname,
self.streetno))

Bye,
bearophile
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,995
Messages
2,570,236
Members
46,822
Latest member
israfaceZa

Latest Threads

Top