weird dict problem, how can this even happen?

J

Joel Hedlund

I'm having a very hard time explaining why this snippet *sometimes*
raises KeyError:

snippet:
print type(self.pool)
for frag in self.pool.keys():
if frag is fragment_info:
print "the fragment_info *is* in the pool", hash(frag), hash(fragment_info), hash(frag) == hash(fragment_info), frag == fragment_info, frag in self.pool, frag in self.pool.keys()
try:
renderer_index = self.pool.pop(fragment_info)
except KeyError:
print "Glorious KeyError!"
for frag in self.pool.keys():
if frag is fragment_info:
print "the fragment_info *is* in the pool", hash(frag), hash(fragment_info), hash(frag) == hash(fragment_info), frag == fragment_info, frag in self.pool, frag in self.pool.keys()
raise

output:
<type 'dict'>
the fragment_info *is* in the pool 987212075 987212075 True True False True
Glorious KeyError!
the fragment_info *is* in the pool 987212075 987212075 True True False True
Traceback (most recent call last):
File "/home/yohell/workspace/missy/core/gui.py", line 92, in process_job
renderer_index = self.pool.pop(fragment_info)
KeyError: <core.gui.FragmentInfo object at 0x8fc906c>

This snippet is part of a much larger gtk program, and the problem only
from time to time, predominantly when the cpu is under heavy load and
this method gets called a lot. If I didn't know better I'd say it's a
bug in python's dict implementation, but I do know better, so I know
it's far more likely that I've made a mistake somewhere. I'll be damned
if I can figure out what and where though. I've reproduced this bug (?)
with python-2.5.2 on Ubuntu 8.10 and python-2.5.1 on WinXP.

I would very much like an explanation to this that does not involve
threads, because I haven't made any that I'm aware of. I can't even
understand how this could happen. How do I even debug this?

Please help, I feel like I've taken crazy pills here!
/Joel Hedlund
 
J

Joel Hedlund

Duncan said:
It could happen quite easily if the hash value of the object has changed
since it was put in the dictionary. what does the definition of your
core.gui.FragmentInfo object look like?

Dunno if it'll help much, but:

class FragmentInfo(object):
def __init__(self, renderer, render_area):
self.renderer = renderer
self.render_area = render_area

def __hash__(self):
return hash((FragmentInfo, self.renderer, self.render_area))

def __eq__(self, other):
return (isinstance(other, self.__class__) and
other.renderer == self.renderer and
other.render_area == self.render_area)
Is the hash definitely immutable?

No. It's a hash of a tuple of its key attributes, themselves similar
objects.

The program can be thought of as a map viewer. In the gui part, image
fragments are cached for speed, and fragments are only half rendered if
there's a lot of complex features to draw. The "pool" consists of semi
rendered fragments. The reason I did it this way is the cache. The cache
is a dict with a use tracker so when the hash changes, the older
fragments eventually drop from the cache. Hmm... I'm starting to realise
now why my implementation of this isn't so hot. I'm going to hazard a
guess here, and then you can correct me ok?

I create a dict and populate it with a key-value pair, and then the
key's hash changes. When the key is returned by k = d.keys(), then k not
in d, even though k in d.keys().

Simple example:
class moo(object):
def __init__(self, a):
self.a = a
def __hash__(self):
return hash(self.a)

d = {moo(1): 1}

k = d.keys()[0]
k.a = 2

k = d.keys()[0]
print k in d, k in d.keys()
> d.pop(k)
output:
False True
Traceback (most recent call last):
File "/bioinfo/yohell/eclipse/Test/src/test.py", line 14, in <module>
d.pop(k)
KeyError: <__main__.moo object at 0x7f1c64120590>

I'd say that's pretty similar to what I observed.

I guess the logical outcome is that the cache dict will fill up with old
junk that I can't access and can't selectively flush since the hashes
have changed, unless I actually somehow save copies of the keys, which
can get pretty complex and probably won't do wonders for execution
speed. Yeah this was probably a bad soulution.

I should probably do this with lists instead because I can't really
think of a way of salvaging this. Am i right?

Thanks for your help!
/Joel
 
A

Arnaud Delobelle

Joel Hedlund said:
I'm having a very hard time explaining why this snippet *sometimes*
raises KeyError:

snippet:

This snippet is part of a much larger gtk program, and the problem
only from time to time, predominantly when the cpu is under heavy load
and this method gets called a lot. If I didn't know better I'd say
it's a bug in python's dict implementation, but I do know better, so I
know it's far more likely that I've made a mistake somewhere. I'll be
damned if I can figure out what and where though. I've reproduced this
bug (?) with python-2.5.2 on Ubuntu 8.10 and python-2.5.1 on WinXP.

I would very much like an explanation to this that does not involve
threads, because I haven't made any that I'm aware of. I can't even
understand how this could happen. How do I even debug this?

Please help, I feel like I've taken crazy pills here!
/Joel Hedlund

It could happen if two object x and y are such that hash(x) != hash(y)
but x == y, breaking the requirement on hash values.

HTH
 
J

Joel Hedlund

Duncan said:
I think you probably are correct. The only thing I can think that might
help is if you can catch all the situations where changes to the dependent
values might change the hash and wrap them up: before changing the hash pop
the item out of the dict, then reinsert it after the change.

That would probably require a lot of uncomfortable signal handling,
especially for a piece of functionality I'd like to be as unobtrusive as
possible in the application.
Alternatively give up on defining hash and __eq__ for FragmentInfo and rely
on object identity instead.

Object identity wouldn't work so well for caching. Objects would always
be drawn as they appeared for the first time. No updates would be shown
until the objects were flushed from the cache.

I've been experimenting with a list cache now and I can't say I'm
noticing any change in performance for a cache of 100 items. I'm still
using the hash to "freeze" a sort of object tag in order to detect
changes, and I require both hash and object equality for cache hits,
like so:
def index(self, key):
h = hash(key)
for i, item in enumerate(self.items):
if item.hash == h and item.key == key:
return i
raise KeyError(key)

This seems to do what I want and does OK performance wise.

Thanks again!

/Joel
 
S

Steven D'Aprano

That would probably require a lot of uncomfortable signal handling,
especially for a piece of functionality I'd like to be as unobtrusive as
possible in the application.


Object identity wouldn't work so well for caching. Objects would always
be drawn as they appeared for the first time. No updates would be shown
until the objects were flushed from the cache.

Perhaps I don't understand your program structure, but I don't see how
that follows.


I've been experimenting with a list cache now and I can't say I'm
noticing any change in performance for a cache of 100 items.

100 items isn't very big though. If you have 50,000 items you may notice
significant slow down :)

If having many items in the cache is possible, you should consider using
a binary search instead of a linear search through the cache. See the
bisect module.
 
S

Steven D'Aprano

I'm not sure I understand what you're suggesting.


Neither am I, since you've removed the hash function which might have
given readers a clue.

*wink*

Snipping irrelevant text is a Good Thing, but in this case you were a
little heavy-handed.
 
J

Joel Hedlund

Steven said:
Perhaps I don't understand your program structure, but I don't see how
that follows.

First off, please note that I consider my problem to be solved, many
thanks to c.l.p and especially Duncan Booth. But of course continued
discussion on this topic can be both enlightening and entertaining as
long as people are interested. So here goes:

I'm making a scientific program that visualizes data. It can be thought
of as a freely zoomable map viewer with a configurable stack of data
feature renderers, themselves also configurable to enhance different
aspects of the individual features. As you zoom in, it becomes possible
to render more types of features in a visually recognizable manner, so
consequentially more feature renderers become enabled. The determining
properties for the final image are then the coordinates and zoom level
of the view, and the current renderer stack configuration. Rendering may
be time consuming, so I cache the resulting bitmap fragments using a
"key object" called FragmentInfo that holds this information.

Some renderers may take a very long time to do their work, so in order
to keep the gui nice and responsive, I interrupt the rendering chain at
that point, put the remainder of the rendering chain in a job pool, and
postpone finishing up the rendering until there are cpu cycles to spare.
At that time, I go through the job pool and ask the cache which fragment
was most recently accessed. This fragment is the most likely to actually
be in the current view, and thus the most interesting for the user to
have finallized.

Now, when the user navigates the view to any given point, the gui asks
the cache if the bitmap fragments necessary to tile up the current view
have already been rendered, and if so, retrieves them straight from the
cache and paints them to the screen. And that's why object identity
wouldn't work. If the user changes something in the config for the
current renderer stack (=mutates the objects), the renderers still
retain the same object identities and therefore the old versions would
be retrieved from the cache, and no updates would be shown until the
fragments are flushed from the cache, and the fragment subsequently
redrawn.

I guess you could delve deep into the data members and pull out all
their object identities and hash wrt that if you'd really want to, but I
don't really see the point.

The stupid cache implementation that I originally employed used a dict
to store the FragmentInfo:BitmapFragment items plus a use tracker (list)
on the side. This obviously doesn't work, because as soon as the
renderer stack mutates, list and dict go out of sync and I can no longer
selectively flush old items from the dict, because it's not readily
apparent how to reconstruct the old keys. Purely using lists here is
vastly superior because I can just .pop() the least recently used items
from the tail and be done with them. Also for lists as small as this,
the cost in performance is at most negligible.
100 items isn't very big though. If you have 50,000 items you may notice
significant slow down :)

If having many items in the cache is possible, you should consider using
a binary search instead of a linear search through the cache. See the
bisect module.

Thanks for the tip, but I don't forsee this cache ever needing to be
that big. ~100 is quite enough for keeping the gui reasonably responsive
in this case.

/Joel
 
J

Joel Hedlund

Joel said:
First off, please note that I consider my problem to be solved, many
thanks to c.l.p and especially Duncan Booth. But of course continued
discussion on this topic can be both enlightening and entertaining as
long as people are interested. So here goes:

heh, nothing like a wall of text to kill off interest I guess. :)

But thank you all for your time and helpful advice! Aand happy holidays!

/Joel
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,997
Messages
2,570,241
Members
46,831
Latest member
RusselWill

Latest Threads

Top