very large dictionaries

R

robin

I need to do a search through about 50 million records, each of which
are less than 100 bytes wide. A database is actually too slow for
this, so I thought of optimising the data and putting it all in
memory.

There is a single key field, so a dictionary is an obvious choice for
a structure, since Python optimises these nicely.

But is there a better choice? Is it worth building some sort of tree?

-- robin
 
L

Larry Bates

1) If the key is 10 bytes, that's 5.5Gb
of memory at a minimum??? That is an
unusually large amount of memory.

2) You can't really "search" through dictionaries
fast, but you can index into them via their key
very fast. But then you can do a database read
with an index key on a table very quickly also
with no startup time to read an index 50 million
records.

We would really need to know more about what
you mean by "search through" to make an intelligent
suggestion.

Larry Bates
Syscon, Inc.
 
W

William Park

robin said:
I need to do a search through about 50 million records, each of which
are less than 100 bytes wide. A database is actually too slow for
this, so I thought of optimising the data and putting it all in
memory.

There is a single key field, so a dictionary is an obvious choice for
a structure, since Python optimises these nicely.

But is there a better choice? Is it worth building some sort of tree?

50M x 100 = 5000M = 5G. You got 5Gig of memory?

Since you are talking about key/value record, you can choose from GDBM
(gdbm), Berkeley DB (dbhash), or disk-based dictionary front-end
(shelve). You can now access GDBM database from Bash shell. :)
 
M

Mathias Waack

robin said:
I need to do a search through about 50 million records, each of
which are less than 100 bytes wide. A database is actually too slow
for this, so I thought of optimising the data and putting it all in
memory.

You'll hardly find something faster than a real database.
There is a single key field, so a dictionary is an obvious choice
for a structure, since Python optimises these nicely.

But is there a better choice? Is it worth building some sort of
tree?

It depends on the structure of the data. What kind of data do you
have, what are you looking for and on what machine (at least how
much memory is available) will the whole thing run?

Mathias
 
M

Matteo Dell'Amico

William said:
Since you are talking about key/value record, you can choose from GDBM
(gdbm), Berkeley DB (dbhash), or disk-based dictionary front-end
(shelve). You can now access GDBM database from Bash shell. :)

If you just need fast lookup, maybe cdb
(http://pilcrow.madison.wi.us/#pycdb) can be of use: two disk accesses
per lookup.
 
R

robin

Let me specify further. Our actual data is enormous, and tests with
Postgres and MySQL indicate that the time taken just to build a single
index is on the order of hours, which is too long. After analysis we
believe that the important lookup info can be compressed to about 40
million records of 48 bytes each. Furthermore, we have the flexibility
to partition this into four somewhat equal parts. Each will hence be
about 460MB in size. Even with structural overhead I see no reason
this couldn't fit into memory. This is our match file.

Our process involves taking an input disk file, and traversing it one
record at a time. This we can sort by the same partition criteria used
to split the match file (so that each match file can be loaded but
once). For each input record we build a series of keys and compare
them to the appropriate match file; thus there are multiple lookups
per input record. An algorithm then determines which match record we
wish to pick and we write an ID to the input.

There is more to it than this, but these are the major elements. The
input table may be millions of records long, but likely will be much
smaller.

The total processing time will be a sum of:
time to sort/index input file
time to traverse input file
time to load in all parts of the match table
time to build keys on input records
time to search match table for each key
time to write match key ID
overhead time of routine

A new wrinkle is that the match table may have duplicate keys, so
storing it as a dictionary is not an option.

The data is alphanumeric.

Assume an arbitrarily powerful computer, since adding a GB of RAM is
not an issue.

The question I had, for those with experience with large data sets, is
what structure would best handle this problem?

-- robin
 
?

=?iso-8859-15?Q?Pierre-Fr=E9d=E9ric_Caillaud?=

From what I understand...

- The match table is static (or does not changes often)
- Your input data changes every time
(so you don't want to load it in a postgres database everytime)

Your problem :
- read the input data.
- for each input row, generate several keys
- lookup each key in the match table
- take appropriate action

My solution :
- read the input data.
- for each input row, generate several keys
- save the keys in a file (tempfile) in this format :
(key, record id)
Generate tempfile in partitions :
Generate N tempfiles which will each contain a partition of the keys.
- For each tempfile :
- Load it
(you chose a number of partitions such that the files fit into about
1/4 your RAM.)
- group by keys (dict( key: [record ids] ))
again it fits into RAM.
Record IDs are already sorted because they were generated this way.
- sort on keys
(again, fast because it fits in RAM)

Now you have a sorted keyset which you want to lookup into a match table,
which is ALSO sorted
(you sorted it beforehand because it doesn't change often).

Now start a key_pointer at the beginning of your key list, and a
match_pointer at the beginning of your match table.
You are basically comparing two sorted lists. This does not involve any
search, rather a very simple algorithm.
Your match table is not partitioned.

If key_pointer.key < match_pointer.key:
advance match_pointer
elif key_pointer.key > match_pointer.key:
advance key_pointer
if at end of your tempfile partition, load the next one and sort it.
elif key_pointer.key == match_pointer.key:
you have a match. record id.

What do you think ? I already used this algorithm and it's very powerful.
 
?

=?iso-8859-15?Q?Pierre-Fr=E9d=E9ric_Caillaud?=

From what I understand...

- The match table is static (or does not changes often)
- Your input data changes every time
(so you don't want to load it in a postgres database everytime)

Your problem :
- read the input data.
- for each input row, generate several keys
- lookup each key in the match table
- take appropriate action

My solution :
- read the input data.
- for each input row, generate several keys
- save the keys in a file (tempfile) in this format :
(key, record id)
Generate tempfile in partitions :
Generate N tempfiles which will each contain a partition of the keys.
- For each tempfile :
- Load it
(you chose a number of partitions such that the files fit into about
1/4 your RAM.)
- group by keys (dict( key: [record ids] ))
again it fits into RAM.
Record IDs are already sorted because they were generated this way.
- sort on keys
(again, fast because it fits in RAM)

Now you have a sorted keyset which you want to lookup into a match table,
which is ALSO sorted
(you sorted it beforehand because it doesn't change often).

Now start a key_pointer at the beginning of your key list, and a
match_pointer at the beginning of your match table.
You are basically comparing two sorted lists. This does not involve any
search, rather a very simple algorithm.
Your match table is not partitioned.

If key_pointer.key < match_pointer.key:
advance match_pointer
elif key_pointer.key > match_pointer.key:
advance key_pointer
if at end of your tempfile partition, load the next one and sort it.
elif key_pointer.key == match_pointer.key:
you have a match. record id.

What do you think ? I already used this algorithm and it's very powerful.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
474,201
Messages
2,571,051
Members
47,656
Latest member
rickwatson

Latest Threads

Top