Dictionary or Database—Please advise

J

Jeremy

I have lots of data that I currently store in dictionaries. However,
the memory requirements are becoming a problem. I am considering
using a database of some sorts instead, but I have never used them
before. Would a database be more memory efficient than a dictionary?
I also need platform independence without having to install a database
and Python interface on all the platforms I'll be using. Is there
something built-in to Python that will allow me to do this?

Thanks,
Jeremy
 
C

Chris Rebert

I have lots of data that I currently store in dictionaries.  However,
the memory requirements are becoming a problem.  I am considering
using a database of some sorts instead, but I have never used them
before.  Would a database be more memory efficient than a dictionary?
I also need platform independence without having to install a database
and Python interface on all the platforms I'll be using.  Is there
something built-in to Python that will allow me to do this?

If you won't be using the SQL features of the database, `shelve` might
be another option; from what I can grok, I sounds like a dictionary
stored mostly on disk rather than entirely in RAM (not 100% sure
though):
http://docs.python.org/library/shelve.html

It's in the std lib and supports several native dbm libraries for its
backend; one of them should almost always be present.

Cheers,
Chris
 
L

lbolla

I have lots of data that I currently store in dictionaries.  However,
the memory requirements are becoming a problem.  I am considering
using a database of some sorts instead, but I have never used them
before.  Would a database be more memory efficient than a dictionary?
I also need platform independence without having to install a database
and Python interface on all the platforms I'll be using.  Is there
something built-in to Python that will allow me to do this?

Thanks,
Jeremy

Maybe shelve would be enough for your needs?
http://docs.python.org/library/shelve.html
 
R

Roy Smith

Jeremy said:
I have lots of data that I currently store in dictionaries. However,
the memory requirements are becoming a problem. I am considering
using a database of some sorts instead, but I have never used them
before. Would a database be more memory efficient than a dictionary?
I also need platform independence without having to install a database
and Python interface on all the platforms I'll be using. Is there
something built-in to Python that will allow me to do this?

Thanks,
Jeremy

This is a very vague question, so it'll get a vague answer :)

If you have so much data that you're running into memory problems, then
yes, storing the data externally in an disk-resident database seems like a
reasonable idea.

Once you get into databases, platform independence will be an issue. There
are many databases out there to pick from. If you want something which
will work on a lot of platforms, a reasonable place to start looking is
MySQL. It's free, runs on lots of platforms, has good Python support, and
there's lots of people on the net who know it and are willing to give help
and advice.

Databases have a bit of a learning curve. If you've never done any
database work, don't expect to download MySql (or any other database) this
afternoon and be up and running by tomorrow.

Whatever database you pick, you're almost certainly going to end up having
to install it wherever you install your application. There's no such thing
as a universally available database that you can expect to be available
everywhere.

Have fun!
 
J

Jeremy

If you won't be using the SQL features of the database, `shelve` might
be another option; from what I can grok, I sounds like a dictionary
stored mostly on disk rather than entirely in RAM (not 100% sure
though):http://docs.python.org/library/shelve.html

It's in the std lib and supports several native dbm libraries for its
backend; one of them should almost always be present.

Cheers,
Chris
--http://blog.rebertia.com

Shelve looks like an interesting option, but what might pose an issue
is that I'm reading the data from a disk instead of memory. I didn't
mention this in my original post, but I was hoping that by using a
database it would be more memory efficient in storing data in RAM so I
wouldn't have to read from (or swap to/from) disk. Would using the
shelve package make reading/writing data from disk faster since it is
in a binary format?

Jeremy
 
D

D'Arcy J.M. Cain

Once you get into databases, platform independence will be an issue. There
are many databases out there to pick from. If you want something which
will work on a lot of platforms, a reasonable place to start looking is
MySQL. It's free, runs on lots of platforms, has good Python support, and
there's lots of people on the net who know it and are willing to give help
and advice.

Or PostgreSQL. It's free, runs on lots of platforms, has good Python
support, and there's lots of people on the net who know it and are
willing to give help and advice. In addition, it is a truly enterprise
level, SQL standard, fully transactional database. Don't start with
MySQL and uprade to PostgreSQL later when you get big. Start with the
best one now and be ready.
Databases have a bit of a learning curve. If you've never done any
database work, don't expect to download MySql (or any other database) this
afternoon and be up and running by tomorrow.

Whatever database you get, there will probably be plenty of tutorials.
See http://www.postgresql.org/docs/current/interactive/tutorial.html
for the PostgreSQL one.
 
M

mk

Jeremy said:
I have lots of data that I currently store in dictionaries. However,
the memory requirements are becoming a problem. I am considering
using a database of some sorts instead, but I have never used them
before. Would a database be more memory efficient than a dictionary?
I also need platform independence without having to install a database
and Python interface on all the platforms I'll be using. Is there
something built-in to Python that will allow me to do this?

Since you use dictionaries, I guess that simple store saving key:value
will do?

If so, bsddb support built into Python will do just nicely.

bsddb is multiplatform, although I have not personally tested if a
binary db created on one platform will be usable on another. You'd have
to check this.

Caveat: from what some people say I gather that binary format between
bsddb versions tends to change.

There's also ultra-cool-and-modern Tokyo Cabinet key:value store with
Python bindings:

http://pypi.python.org/pypi/pytc/

I didn't test it, though.


Regards,
mk
 
M

mk

Jeremy said:
Shelve looks like an interesting option, but what might pose an issue
is that I'm reading the data from a disk instead of memory. I didn't
mention this in my original post, but I was hoping that by using a
database it would be more memory efficient in storing data in RAM so I
wouldn't have to read from (or swap to/from) disk. Would using the
shelve package make reading/writing data from disk faster since it is
in a binary format?

Read the docs:

"class shelve.BsdDbShelf(dict[, protocol=None[, writeback=False]])¶
A subclass of Shelf which exposes first(), next(), previous(),
last() and set_location() which are available in the bsddb module but
not in other database modules. The dict object passed to the constructor
must support those methods. This is generally accomplished by calling
one of bsddb.hashopen(), bsddb.btopen() or bsddb.rnopen(). The optional
protocol and writeback parameters have the same interpretation as for
the Shelf class."

Apparently using shelve internally gives you option of using bsddb,
which is good bsddb is B-tree DB, which is highly efficient for
finding keys. I would recommend bsddb.btopen(), as it creates B-tree DB
(perhaps other implementations, like anydb or hash db are good as well,
but I personally didn't test them out).

I can't say for Berkeley DB implementation, but in general B-tree
algorithm has O(log2 n) complexity for finding keys, which roughly means
that if you need to find particular key in a db of 1 million keys,
you'll probably need ~20 disk accesses (or even less if some keys looked
at in the process of search happen to be in the same disk sectors). So
yes, it's highly efficient.

Having said that, remember that disk is many orders of magnitude slower
than RAM, so it's no free lunch.. Nothing will beat memory-based data
structure when it comes to speed (well new flash or hybrid disks perhaps
could significantly improve in comparison to current mainstream
mechanical-platter disks? there are some hyper-fast storage hardware
companies out there, although they tend to charge arm and leg for their
stuff for now).

Caveat: Berkeley DB is dual-licensed -- if you're using it for
commercial work, it might be that you'd need to buy a license for it.
Although I have had no experience with this really, if someone here did
perhaps they will shed some light on it?

Regards,
mk
 
M

mk

D'Arcy J.M. Cain said:
Or PostgreSQL. It's free, runs on lots of platforms, has good Python
support, and there's lots of people on the net who know it and are
willing to give help and advice. In addition, it is a truly enterprise
level, SQL standard, fully transactional database. Don't start with
MySQL and uprade to PostgreSQL later when you get big. Start with the
best one now and be ready.

I second that: I burned my fingers on MySQL quite a few times and don't
want to have anything to do with it anymore. Eventually you hit the wall
with MySQL (although I haven't tested latest and best, perhaps they
improved).

E.g. don't even get me started on replication that tends to randomly
fizzle out quietly without telling you anything about it. Or that
FOREIGN KEY is accepted but referential integrity is not enforced. Or
that if you want real transactions, you have to choose InnoDB table type
but then you lose much of the performance. Etc.

No, if you have a choice, avoid MySQL and go for PGSQL. It's fantastic,
if (necessarily) complex.

Regards,
mk
 
C

CM

I have lots of data

How much is "lots"?
that I currently store in dictionaries.  However,
the memory requirements are becoming a problem.  I am considering
using a database of some sorts instead, but I have never used them
before.  Would a database be more memory efficient than a dictionary?

What do you mean by more efficient?
I also need platform independence without having to install a database
and Python interface on all the platforms I'll be using.  Is there
something built-in to Python that will allow me to do this?

The SQLite datbase engine is built into Python 2.5 and up. I have
heard on this list that there may be problems with it with Python 2.6
and up on Linux, but I've stayed with 2.5 and it works fine for me on
WinXP, Vista, and Linux.

You can use it as a disk-stored single database file, or an in-memory-
only database. The SQLite website (http://www.sqlite.org/) claims it
is the "most widely deployed SQL database engine in the world.", for
what that's worth.

Have a look at this:
http://docs.python.org/library/sqlite3.html

Che
 
P

Patrick Sabin

Shelve looks like an interesting option, but what might pose an issue
is that I'm reading the data from a disk instead of memory. I didn't
mention this in my original post, but I was hoping that by using a
database it would be more memory efficient in storing data in RAM so I
wouldn't have to read from (or swap to/from) disk.

A database usually stores data on disk and not in RAM. However you could
use sqlite with :memory:, so that it runs in RAM.
Would using the
shelve package make reading/writing data from disk faster since it is
in a binary format?
Faster than what? Shelve uses caching, so it is likely to be faster than
a self-made solution. However, accessing disk is much slower than
accessing RAM.
- Patrick
 
A

Aahz

A database usually stores data on disk and not in RAM. However you could
use sqlite with :memory:, so that it runs in RAM.

The OP wants transparent caching, so :memory: wouldn't work.
--
Aahz ([email protected]) <*> http://www.pythoncraft.com/

"Many customs in this life persist because they ease friction and promote
productivity as a result of universal agreement, and whether they are
precisely the optimal choices is much less important." --Henry Spencer
 
L

Luis M. González

I have lots of data that I currently store in dictionaries.  However,
the memory requirements are becoming a problem.  I am considering
using a database of some sorts instead, but I have never used them
before.  Would a database be more memory efficient than a dictionary?
I also need platform independence without having to install a database
and Python interface on all the platforms I'll be using.  Is there
something built-in to Python that will allow me to do this?

Thanks,
Jeremy

For small quantities of data, dicts are ok, but if your app
continually handles (and add) large quantities of data, you should use
a database.
You can start with sqlite (which is bundled with python) and it's much
easier to use than any other relational database out there.
You don't need to install anything, since it's not a database server.
You simply save your databases as files.
If you don't know sql (the standard language used to query databases),
I recomend this online tutorial: http://www.sqlcourse.com/

Good luck!
Luis
 
A

Aahz

Whatever database you pick, you're almost certainly going to end up having
to install it wherever you install your application. There's no such thing
as a universally available database that you can expect to be available
everywhere.

....unless you use SQLite.
--
Aahz ([email protected]) <*> http://www.pythoncraft.com/

"Many customs in this life persist because they ease friction and promote
productivity as a result of universal agreement, and whether they are
precisely the optimal choices is much less important." --Henry Spencer
 
A

Aahz

I have lots of data that I currently store in dictionaries. However,
the memory requirements are becoming a problem. I am considering
using a database of some sorts instead, but I have never used them
before. Would a database be more memory efficient than a dictionary?
I also need platform independence without having to install a database
and Python interface on all the platforms I'll be using. Is there
something built-in to Python that will allow me to do this?

If you're serious about needing both a disk-based backing store *and*
getting maximum use/performance from your RAM, you probably will need to
combine memcached with one of the other solutions offered.

But given your other requirement of not installing a DB, your best option
will certainly be SQLite. You can use :memory: databases, but that will
require shuttling data to disk manually. I suggest that you start with
plain SQLite and only worry if you prove (repeat, PROVE) that DB is your
bottleneck.
--
Aahz ([email protected]) <*> http://www.pythoncraft.com/

"Many customs in this life persist because they ease friction and promote
productivity as a result of universal agreement, and whether they are
precisely the optimal choices is much less important." --Henry Spencer
 
P

Phlip

I have lots of data that I currently store in dictionaries.  However,
the memory requirements are becoming a problem.  I am considering
using a database of some sorts instead, but I have never used them
before.  Would a database be more memory efficient than a dictionary?
I also need platform independence without having to install a database
and Python interface on all the platforms I'll be using.  Is there
something built-in to Python that will allow me to do this?

If you had wall-to-wall unit tests, you could swap the database in
incrementally ("Deprecation Refactor").

You would just add one database table, switch one client of one
dictionary to use that table, pass all the tests, and integrate.
Repeat until nobody uses the dicts, then trivially retire them.

If you don't have unit tests, then you have a bigger problem than
memory requirements. (You can throw $50 hardware at that!)
 
P

Paul Rubin

Jeremy said:
I have lots of data that I currently store in dictionaries. However,
the memory requirements are becoming a problem. I am considering
using a database of some sorts instead, but I have never used them
before. Would a database be more memory efficient than a dictionary?

What are you trying to do? Yes, a database would let you use disk
instead of ram, but the slowdown might be intolerable depending on the
access pattern. You really have to consider what the whole application
is up to, and what the database is doing for you. It might be imposing
a high overhead to create benefits (e.g. transaction isolation) that
you're not actually using.

Somehow I've managed to do a lot of programming on large datasets
without ever using databases very much. I'm not sure that's a good
thing; but anyway, a lot of times you can do better with externally
sorted flat files, near-line search engines, etc. than you can with
databases.

If the size of your data is fixed or is not growing too fast, and it's
larger than your computer's memory by just a moderate amount (e.g. you
have a 2GB computer and 3GB of data), the simplest approach may be to
just buy a few more ram modules and put them in your computer.
 
S

Steven D'Aprano

A database usually stores data on disk and not in RAM. However you could
use sqlite with :memory:, so that it runs in RAM.

The OP started this thread with:

"I have lots of data that I currently store in dictionaries. However,
the memory requirements are becoming a problem."

So I'm amused that he thinks the solution to running out of memory is to
run the heavy overhead of a database entirely in memory, instead of
lightweight dicts :)

My advice is, first try to optimize your use of dicts. Are you holding
onto large numbers of big dicts that you don't need? Are you making
unnecessary copies? If so, fix your algorithms.

If you can't optimize the dicts anymore, then move to a proper database.
Don't worry about whether it is on-disk or not, databases tend to be
pretty fast regardless, and it is better to run your application a little
bit more slowly than to not run it at all because it ran out of memory
halfway through processing the data.
 
D

Dennis Lee Bieber

I have lots of data that I currently store in dictionaries. However,
the memory requirements are becoming a problem. I am considering
using a database of some sorts instead, but I have never used them
before. Would a database be more memory efficient than a dictionary?
I also need platform independence without having to install a database
and Python interface on all the platforms I'll be using. Is there
something built-in to Python that will allow me to do this?

Uhm... mind if I back up a bit first...

Where is all this data coming from in the first place? You aren't
reading/parsing some massive data file every time the program is run,
are you?

If you are, I definitely recommend you first consider turning that
into a persistent database -- and only update it for changes, not on
every run of the program.

As for run-time memory requirements... Well... they would go down as
long as you don't fetch the entire contents of the database... But run
time may go up, as the database would be a file (or more, depending on
engine) -- OTOH, if the parsing of the raw data is intensive, the run
time may go down as the database is already parsed.

Newer versions of Python include the SQLite3 adapter (and on
Windows, the SQLite3 engine -- Linux may rely upon the engine itself
being part of the OS install; but one then needs to ensure Python was
built with the SQLite3 option)
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
474,176
Messages
2,570,947
Members
47,498
Latest member
log5Sshell/alfa5

Latest Threads

Top