How to iterate through two dicts efficiently

J

J

Hello,

I am looking to improve the performance of the following piece of Python code:-

for cc in StatusContainer:
for srv in StatusContainer[cc]:
for id in StatusContainer[cc][srv]['RECV']:
if id in StageContainer['13']:
StatusContainer[cc][srv].setdefault('ACK', []).append(id)
elif id in StageContainer['14']:
StatusContainer[cc][srv].setdefault('NACK', []).append(id)
else:
if id in StageContainer['504'] or id in StageContainer['505']:
StatusContainer[cc][srv].setdefault('RETRY', []).append(id)

I am new to programming and I decided that I would use python as my first programming language. I have written couple of (very basic but useful) scripts and now I am looking to optimise the performance of some of these scripts so I can make use of them.

The code above is an extract from a bigger script which I am using to counthow many message IDs have been received, sent or rejected per service per country. The code is iterating through two dictionaries (a status dictionary which is a set of sub-dictionaries and a stages dictionary which again is a set of sub-dictionaries), if an entry (in this case an id) is found in the status dictionary, I am performing a search for that id in the stages dictionary. Depending where in the stages dictionary the id is found (whichstage), then I am creating a new sub-dictionary inside the status dictionary and I append the id.

At the end, all I do is count how many IDs each sub-dictionary is holding (e.g: len(StatusContainer[cc][srv]['RECV'])).

My two dictionaries look like this:-

StatusContainer:-

defaultdict(<type 'set'>, {'FR': {'VDMS': {'RECV': ['cad1c651-a61e-11e0-9f43-00238bbdc9eb', 'df25a1d1-a61e-11e0-9f43-00238bbdc9eb', '1a9452c0-a61f-11e0-9f43-00238bbdc9eb', '1a9b3090-a61f-11e0-9f43-00238bbdc9eb']}}, 'PT': {'VDMS': {'RECV': ['2410e980-a61f-11e0-cffc-00238bbdc9e7', '24146bf0-a61f-11e0-cffc-00238bbdc9e7', '29c5f0f0-a61f-11e0-cffc-00238bbdc9e7', '2ebe5f20-a61f-11e0-cffc-00238bbdc9e7']}}, 'CL': {'VDMS': {'RECV': ['379cfda0-a61e-11e0-9bd6-00238bbdc9e7', 'ff60f990-a61e-11e0-9bd6-00238bbdc9e7', '1a7e80d0-a61f-11e0-9bd6-00238bbdc9e7', '7ab64f00-a61f-11e0-9bd6-00238bbdc9e7']}}})

StageContainer:-

defaultdict(<type 'set'>, {'13': ['17897931-a61e-11e0-a764-00238bce423b', 'd0fcb681-a61d-11e0-e693-00238bbdc9eb', '178819a1-a61e-11e0-a764-00238bce423b', '17161df1-a61e-11e0-b558-00238bbdc9e7', '15eb5991-a61e-11e0-8aeb-00238bbdc9b7', '1872eed0-a61e-11e0-de66-00238bce424c', 'b3e470b1-a61d-11e0-b558-00238bbdc9e7'], '14': ['1a98dc11-a61e-11e0-a9e7-00238bce423b', 'da485231-a61d-11e0-ff8f-00238bce423b', '1c245e11-a61e-11e0-8aeb-00238bbdc9b7', '1eab5711-a61e-11e0-da5a-00238bce423b', '8db533c1-a61d-11e0-9393-00238bbdc9b7', 'e0e066a1-a61d-11e0-b558-00238bbdc9e7']})

I ran this piece of code through the python profiler and I see that the code is taking on average around 1300 CPU secs to execute, depending on the size of the dicts. The problem is that both dictionaries will always hold hundreds of thousands of entries, so I need to re-write this code to be more efficient or I need to find a different way to achieve my goal.

Someone in a different forum suggested that I use 'binary search' to iterate through the dictionaries but like I mentioned before, I am still very newto programming and 'binary search' sounds a bit to complicated to me. I was wondering if someone here can have a look at my code above and maybe re-write it using 'binary search' or any other way you may see fit and re-postthe code here with some comments on what each section is doing?

Your help would be very much appreciated.

Regards,

Junior
 
P

Peter Otten

J said:
I am looking to improve the performance of the following piece of Python code:-

for cc in StatusContainer:
for srv in StatusContainer[cc]:
for id in StatusContainer[cc][srv]['RECV']:
if id in StageContainer['13']:
StatusContainer[cc][srv].setdefault('ACK', []).append(id)
elif id in StageContainer['14']:
StatusContainer[cc][srv].setdefault('NACK', []).append(id)
else:
if id in StageContainer['504'] or id in StageContainer['505']:
StatusContainer[cc][srv].setdefault('RETRY', []).append(id)
Someone in a different forum suggested that I use 'binary search' to
iterate through the dictionaries but like I mentioned before, I am still
very new to programming and 'binary search' sounds a bit to complicated to
me. I was wondering if someone here can have a look at my code above and
maybe re-write it using 'binary search' or any other way you may see fit
and re-post the code here with some comments on what each section is
doing?

Python's hash-based dictionaries are usually very fast; I doubt that binary search will
help. I'd try replacing the inner loop with set arithmetic:

# untested
stage_ack = set(StageContainer["13"])
stage_nack = set(StageContainer["14"])
stage_retry = set(StageContainer["504"]) | set(StageContainer["505"])

for servers in StatusContainer.itervalues():
for server in servers.itervalues():
recv = set(server["RECV"])

ack = recv & stage_ack
nack = recv & stage_nack
retry = recv & stage_retry

if ack:
server.setdefault("ACK", []).extend(ack)
if nack:
server.setdefault("NACK", []).extend(nack)
if retry:
server.setdefault("RETRY", []).extend(retry)
 
T

Tim Chase

Someone in a different forum suggested that I use 'binary
search' to iterate through the dictionaries

I'm not sure what they were smoking...a binary search is useful
for finding a thing in a sorted list. It looks like your data is
not sorted (strike #1) and it looks like you're gathering
statistics instead of trying to find a particular value (strike #2).
I am looking to improve the performance of the following piece of Python code:-

for cc in StatusContainer:
for srv in StatusContainer[cc]:
for id in StatusContainer[cc][srv]['RECV']:

I'd rewrite this to use tuple unpacking and the .iteritems()
method of dicts (which might not have any performance
improvements, but improves readability:

for cc, services in StatusContainer.iteritems():
for srv, data in services.iteritems():
for id in data['RECV']: # I'd also avoid shadowing "id"
...
if id in StageContainer['13']:
StatusContainer[cc][srv].setdefault('ACK', []).append(id)
elif id in StageContainer['14']:
StatusContainer[cc][srv].setdefault('NACK', []).append(id)

Rather than look these up each time, I'd hoist them out of the
loops that don't change the values and give them semantic
meaning. You also seem to be using defaultdict() everywhere
else, so I'd set it up so that StatusContainer[cc][srv] is a
defaultdict(list) as well to clean up that logic:

acks = StageContainer['13']
nacks = StageContainer['14']
my_504s = StageContainer['504']
my_505s = StageContainer['505']

for cc, services in ...
for ...
container = StatusContainer[cc][srv]
for ...
if id in acks:
container["ACK"].append(id)
elif id in nacks:
container["NACK"].append(id)
else:
if id in my_504s or id in my_505s:
container["RETRY"].append(id)

If your codes can only be 13, 14, 504 or 505, I'd even skip the
test at the end:

else:
container["RETRY"].append(id)

but that would take knowing your data, something I can't confirm
is safe without the full data-set.

That said, all those are minor improvements that will likely only
minor speed-ups in your code.

My two dictionaries look like this:- >
StatusContainer:-

defaultdict(<type 'set'>, {'FR': {'VDMS': {'RECV': ...

StageContainer:-

defaultdict(<type 'set'>, {'13': ['17897931-a61e-11e0-a764-00238bce423b', 'd0fcb681-a61d-11e0-e693-00238bbdc9eb'

Additionally, I notice the things you're searching through in
your StageContainer are *lists*. Yet for memberships testing,
you're using "in". This is likely where your slowdown is coming.
With the above changes to add semantic meaning to your
13/14/504/505s, just convert the results to a set:

acks = set(StageContainer['13'])
nacks = set(StageContainer['14'])
retries = set(StageContent['504']) + set(StageContent['505'])

....

else:
if id in retries: # instead of the "id in my504 or id in my505"

By changing this data-type from a list to a set, you change from
O(N) to O(1) for lookup time. For the non-CS geeks, that means
that doing an "in" on a list is related to the length of the data
in the list, while doing an "in" on a set is a constant time
regardless of the list-length. Assuming your data-dumps are
accurate and that these are lists, this will make a world of
difference in the speed of your run.

-tkc
 
D

Dave Angel

Hello,

I am looking to improve the performance of the following piece of Python code:-

for cc in StatusContainer:
for srv in StatusContainer[cc]:
for id in StatusContainer[cc][srv]['RECV']:
if id in StageContainer['13']:
StatusContainer[cc][srv].setdefault('ACK', []).append(id)
elif id in StageContainer['14']:
StatusContainer[cc][srv].setdefault('NACK', []).append(id)
else:
if id in StageContainer['504'] or id in StageContainer['505']:
StatusContainer[cc][srv].setdefault('RETRY', []).append(id)

I am new to programming and I decided that I would use python as my first programming language. I have written couple of (very basic but useful) scripts and now I am looking to optimise the performance of some of these scripts so I can make use of them.

The code above is an extract from a bigger script which I am using to count how many message IDs have been received, sent or rejected per service per country. The code is iterating through two dictionaries (a status dictionary which is a set of sub-dictionaries and a stages dictionary which again is a set of sub-dictionaries), if an entry (in this case an id) is found in the status dictionary, I am performing a search for that id in the stages dictionary. Depending where in the stages dictionary the id is found (which stage), then I am creating a new sub-dictionary inside the status dictionary and I append the id.

At the end, all I do is count how many IDs each sub-dictionary is holding (e.g: len(StatusContainer[cc][srv]['RECV'])).

My two dictionaries look like this:-

StatusContainer:-

defaultdict(<type 'set'>, {'FR': {'VDMS': {'RECV': ['cad1c651-a61e-11e0-9f43-00238bbdc9eb', 'df25a1d1-a61e-11e0-9f43-00238bbdc9eb', '1a9452c0-a61f-11e0-9f43-00238bbdc9eb', '1a9b3090-a61f-11e0-9f43-00238bbdc9eb']}}, 'PT': {'VDMS': {'RECV': ['2410e980-a61f-11e0-cffc-00238bbdc9e7', '24146bf0-a61f-11e0-cffc-00238bbdc9e7', '29c5f0f0-a61f-11e0-cffc-00238bbdc9e7', '2ebe5f20-a61f-11e0-cffc-00238bbdc9e7']}}, 'CL': {'VDMS': {'RECV': ['379cfda0-a61e-11e0-9bd6-00238bbdc9e7', 'ff60f990-a61e-11e0-9bd6-00238bbdc9e7', '1a7e80d0-a61f-11e0-9bd6-00238bbdc9e7', '7ab64f00-a61f-11e0-9bd6-00238bbdc9e7']}}})

StageContainer:-

defaultdict(<type 'set'>, {'13': ['17897931-a61e-11e0-a764-00238bce423b', 'd0fcb681-a61d-11e0-e693-00238bbdc9eb', '178819a1-a61e-11e0-a764-00238bce423b', '17161df1-a61e-11e0-b558-00238bbdc9e7', '15eb5991-a61e-11e0-8aeb-00238bbdc9b7', '1872eed0-a61e-11e0-de66-00238bce424c', 'b3e470b1-a61d-11e0-b558-00238bbdc9e7'], '14': ['1a98dc11-a61e-11e0-a9e7-00238bce423b', 'da485231-a61d-11e0-ff8f-00238bce423b', '1c245e11-a61e-11e0-8aeb-00238bbdc9b7', '1eab5711-a61e-11e0-da5a-00238bce423b', '8db533c1-a61d-11e0-9393-00238bbdc9b7', 'e0e066a1-a61d-11e0-b558-00238bbdc9e7']})

I ran this piece of code through the python profiler and I see that the code is taking on average around 1300 CPU secs to execute, depending on the size of the dicts. The problem is that both dictionaries will always hold hundreds of thousands of entries, so I need to re-write this code to be more efficient or I need to find a different way to achieve my goal.

Someone in a different forum suggested that I use 'binary search' to iterate through the dictionaries but like I mentioned before, I am still very new to programming and 'binary search' sounds a bit to complicated to me. I was wondering if someone here can have a look at my code above and maybe re-write it using 'binary search' or any other way you may see fit and re-post the code here with some comments on what each section is doing?

Your help would be very much appreciated.

Regards,

Junior
It would have been better if the code you submit actually did anything.
None of the values in your StatusContainer show up in the
StageContainer, so the inner loop in your code never stores anything.

You mention that 1300 seconds is taken for this loop. But what fraction
of the total is that? Is this really the bottleneck?

Assuming it is, the most important single thing i'd be doing is to
reorganize StageContainer. You have the default type being set, but you
actually store lists as values instead. The value of item '13' is a
list of values, when you probably meant it to be a set.

You could fix that for testing with something like:


StageContainer["13"] = set(StageContainer["13"])
StageContainer["14"] = set(StageContainer["14"])
StageContainer["504"] = set(StageContainer["504"])
StageContainer["505"] = set(StageContainer["505"])

though of course it'd be better to change the logic which builds those sets.

The second thing I'd do is to change the loops to iterate over the items
of the list, instead of the keys. For example, change

for id in StatusContainer[cc][srv]['RECV']:



to something like:
for key, val in StatusContainer[cc][srv].items()
id = val['RECV']

then below you'd have lines like:

val.setdefault('ACK', []).append(id)

Obviously you'd pick a better name than key and val.

This last change should help some, but not nearly as much as the first
one. You could continue to the outer loops, but it probably won't make
much difference.

The code listed above is untested, so it probably has typos, or maybe
even gross errors.

DaveA
 
C

Chris Rebert

Hi guys,

Thank you for your suggestions.  I have managed to get my whole script to execute in under 10 seconds by changing the 'for loop' I posted aboveto the following:-

for opco in Cn:
       for service in Cn[opco]:
               ack = set(Cn[opco][service]['RECV']) & set(Pr['13'])
               for jarid in ack:
                       Cn[opco][service].setdefault('ACK', set()).add(jarid)
               nack = set(Cn[opco][service]['RECV']) & set(Pr['14'])
               for jarid in nack:
                       Cn[opco][service].setdefault('NACK', set()).add(jarid)
               retry = set(Cn[opco][service]['RECV']) & set(Pr['504'])
               for jarid in retry:
                       Cn[opco][service].setdefault('RETRY', set()).add(jarid)
               retry1 = set(Cn[opco][service]['RECV']) & set(Pr['505'])
               for jarid in retry1:
                       Cn[opco][service].setdefault('RETRY', set()).add(jarid)

Code duplication ahoy! Let's refactor that:

pairs = [('ACK', '13'), ('NACK', '14'), ('RETRY', '504'), ('RETRY', '505')]
for opco in Cn:
for service in Cn[opco]:
for msg, prkey in pairs:
ids = set(Cn[opco][service]['RECV']) & set(Pr[prkey])
for jarid in ids:
        Cn[opco][service].setdefault(msg, set()).add(jarid)

Cheers,
Chris
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,961
Messages
2,570,131
Members
46,689
Latest member
liammiller

Latest Threads

Top