huge dictionary -> bsddb/pickle question

L

lazy

Hi,

I have a dictionary something like this,

key1=>{key11=>[1,2] , key12=>[6,7] , .... }
For lack of wording, I will call outer dictionary as dict1 and its
value(inner dictionary) dict2 which is a dictionary of small fixed
size lists(2 items)

The key of the dictionary is a string and value is another dictionary
(lets say dict2)
dict2 has a string key and a list of 2 integers.

Im processesing HUGE(~100M inserts into the dictionary) data.
I tried 2 options both seem to be slower and Im seeking suggestions to
improve the speed. The code is sort of in bits and pieces, so Im just
giving the idea.

1) Use bsddb. when an insert is done, the db will have key1 as key and
the value(i.e db[key1] will be be pickleled value of dict2). after
1000 inserts , I close and open the db ,inorder to flush the contents
to disk. Also when I try to insert a key, if its already present, I
unpickle the value and change something in dict2 and then pickle it
back to the bsddb.

2)Instead of pickling the value(dict2) and storing in bsddb
immediately, I keep the dict1(outer dictionary in memory) and when it
reaches 1000 inserts, I store it to bsddb as before, pickling each
individual value. The advantage of this is, when one insert is done,
if its already present, I adjust the value and I dont need to unpickle
and pickle it back, if its the memory. If its not present in memory, I
will still need to lookup in bsddb

This is not getting to speed even with option 2. Before inserting, I
do some processing on the line, so the bottleneck is not clear to me,
(i.e in processing or inserting to db). But I guess its mainly because
of pickling and unpickling.

Any suggestions will be appreciated :)
 
M

Marc 'BlackJack' Rintsch

I have a dictionary something like this,

key1=>{key11=>[1,2] , key12=>[6,7] , .... }
For lack of wording, I will call outer dictionary as dict1 and its
value(inner dictionary) dict2 which is a dictionary of small fixed
size lists(2 items)

The key of the dictionary is a string and value is another dictionary
(lets say dict2)
dict2 has a string key and a list of 2 integers.

Im processesing HUGE(~100M inserts into the dictionary) data.
I tried 2 options both seem to be slower and Im seeking suggestions to
improve the speed. The code is sort of in bits and pieces, so Im just
giving the idea.

[…]

This is not getting to speed even with option 2. Before inserting, I
do some processing on the line, so the bottleneck is not clear to me,
(i.e in processing or inserting to db). But I guess its mainly because
of pickling and unpickling.

Any suggestions will be appreciated :)

I guess your guess about the pickling as bottleneck is correct but
measuring/profiling will give more confidence.

Maybe another database than bsddb might be useful here. An SQL one like
SQLite or maybe an object DB like zodb or Durus.

Ciao,
Marc 'BlackJack' Rintsch
 
A

Alex Martelli

lazy said:
key1=>{key11=>[1,2] , key12=>[6,7] , .... } ...
Im processesing HUGE(~100M inserts into the dictionary) data.

What will you need from the "saved" version later? If you need lookups
by single keys or key pairs then a relational DB may be best; if you
need to recover the whole huge dictionary into memory (assuming it will
fit, e.g. you have a 64-bit machine with ample RAM) then it might be
more efficient to "dump" things into a custom-designed format. If I was
uncertain about future uses of the saved data, I'd go for the maximal
flexibility afforded by a relational DB -- if you save to a relational
DB you can later easily use the data in every which way, including
ad-hoc ones (such flexibility is a key architectural feature of
relational databases). E.g., a single table with 4 fields might be
enough: key, subkey, leftint, rightint -- each field of the appropriate
type (you do tell us that the two ints are smallish integers, but we
don't know directly about the types of the keys -- besides a hint that
the main keys are presumably strings of some kind since that's what
bsddb likes as its own keys, but still, choosing the right kind of
string may be important in order to save space and obtain good
performance). The (key,subkey) pair would be the primary key on that
table, and unless and until you need peculiar "navigation" in the future
you may not require other indices, constraints, or whatever.

At the scale you're talking about (multi-gigabyte, it appears) it's
probably worth using a powerful DB (such as PostgreSQL, interfacing it
to Python via psycopg2) rather than a lightweight one (such as sqlite,
which comes with Python 2.5); however, depending on how critical
performance is, such issues are often best resolved by benchmarking
(similarly, you might benchmark the performance for inserting into the
properly constrained table from the start, vs originally making the
table constraints-less and using an ALTER TABLE later to add the primary
key constraint/index that will be useful for later lookups; "bulk
insertions" into DBs can often benefit from the latter idea).


Alex
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,969
Messages
2,570,161
Members
46,705
Latest member
Stefkari24

Latest Threads

Top