Jeremy said:
Shelve looks like an interesting option, but what might pose an issue
is that I'm reading the data from a disk instead of memory. I didn't
mention this in my original post, but I was hoping that by using a
database it would be more memory efficient in storing data in RAM so I
wouldn't have to read from (or swap to/from) disk. Would using the
shelve package make reading/writing data from disk faster since it is
in a binary format?
Read the docs:
"class shelve.BsdDbShelf(dict[, protocol=None[, writeback=False]])¶
A subclass of Shelf which exposes first(), next(), previous(),
last() and set_location() which are available in the bsddb module but
not in other database modules. The dict object passed to the constructor
must support those methods. This is generally accomplished by calling
one of bsddb.hashopen(), bsddb.btopen() or bsddb.rnopen(). The optional
protocol and writeback parameters have the same interpretation as for
the Shelf class."
Apparently using shelve internally gives you option of using bsddb,
which is good bsddb is B-tree DB, which is highly efficient for
finding keys. I would recommend bsddb.btopen(), as it creates B-tree DB
(perhaps other implementations, like anydb or hash db are good as well,
but I personally didn't test them out).
I can't say for Berkeley DB implementation, but in general B-tree
algorithm has O(log2 n) complexity for finding keys, which roughly means
that if you need to find particular key in a db of 1 million keys,
you'll probably need ~20 disk accesses (or even less if some keys looked
at in the process of search happen to be in the same disk sectors). So
yes, it's highly efficient.
Having said that, remember that disk is many orders of magnitude slower
than RAM, so it's no free lunch.. Nothing will beat memory-based data
structure when it comes to speed (well new flash or hybrid disks perhaps
could significantly improve in comparison to current mainstream
mechanical-platter disks? there are some hyper-fast storage hardware
companies out there, although they tend to charge arm and leg for their
stuff for now).
Caveat: Berkeley DB is dual-licensed -- if you're using it for
commercial work, it might be that you'd need to buy a license for it.
Although I have had no experience with this really, if someone here did
perhaps they will shed some light on it?
Regards,
mk