MLDBM tie is very slow

R

Rob Z

Hi all,

I am working with MLDBM to access a static "database file". (Written
once, never altered, only read.) The file is ~75MB and is a 4-level
HOH. i.e. hash of hashes of hashes of hashes. It is running on Linux
on an 2x CPU XServe with Perl 5.8.

The trouble is that the tie() command is taking ~10 seconds when first
connecting to the database file. I would like to shorten this as much
as possible, I dont need the file read into memory at the beginning, I
can read in each entry as it is needed later. I would actually like to
leave as much data out of memory as I can, until it is really needed.
As far as I can find, the whole file isnt being read into memory
(memory use is ~50MB for the process after the tie()), but a good
portion is. My concern is that this file will grow by about 8x over
the next few months, to 500+MB.

Anyway, I am looking for alternatives or options for speeding up that
initial tie() and making the smallest memory commitment up front as
possible. Any ideas?


Thanks,
Rob
 
X

xhoster

Rob Z said:
Hi all,

I am working with MLDBM to access a static "database file". (Written
once, never altered, only read.) The file is ~75MB and is a 4-level
HOH. i.e. hash of hashes of hashes of hashes. It is running on Linux
on an 2x CPU XServe with Perl 5.8.

The trouble is that the tie() command is taking ~10 seconds when first
connecting to the database file.

Just saying you use MLDBM is not sufficient. Please provide two pieces of
runnable code, one that creates a structure similar to what you are working
with and writes it out, and one that times the opening of that structure.

I would like to shorten this as much
as possible, I dont need the file read into memory at the beginning, I
can read in each entry as it is needed later.

I could be wrong, but I don't think that this is the nature of MLDBM.
I would actually like to
leave as much data out of memory as I can, until it is really needed.
As far as I can find, the whole file isnt being read into memory
(memory use is ~50MB for the process after the tie()),

This doesn't mean much. It could just mean that the on-disk format of
MLDBM data is 50% less space-efficient than the in-memory format.
but a good
portion is. My concern is that this file will grow by about 8x over
the next few months, to 500+MB.

I thought the file never changed?
Anyway, I am looking for alternatives or options for speeding up that
initial tie()

How about not doing a tie at all? Store the data in a file using Storable
directly, retrieve it into a hashref directly with Storable.
and making the smallest memory commitment up front as
possible. Any ideas?

Why? If you are ultimately going to end up having it all in memory anyway
(which I assume you are because you say "up front"), why not just load it
into memory and get it over with?

Xho
 
B

Brian Wakem

Rob said:
Hi all,

I am working with MLDBM to access a static "database file". (Written
once, never altered, only read.) The file is ~75MB and is a 4-level
HOH. i.e. hash of hashes of hashes of hashes. It is running on Linux
on an 2x CPU XServe with Perl 5.8.

The trouble is that the tie() command is taking ~10 seconds when first
connecting to the database file. I would like to shorten this as much
as possible, I dont need the file read into memory at the beginning, I
can read in each entry as it is needed later. I would actually like to
leave as much data out of memory as I can, until it is really needed.
As far as I can find, the whole file isnt being read into memory
(memory use is ~50MB for the process after the tie()), but a good
portion is. My concern is that this file will grow by about 8x over
the next few months, to 500+MB.


You said it will never be altered.

Anyway, I am looking for alternatives or options for speeding up that
initial tie() and making the smallest memory commitment up front as
possible. Any ideas?


When dealing will large amounts of data you should be thinking RDBMS.
 
R

Rob Z

I apologize, I should have been more specific, since this is what
everyone is latching on to:

The file will never be altered once it is written. Over the coming
months, new files of the exact same name and hierarchical structure
will be written over the original. The size of those files will become
increasingly large up to 500+MB.



As far as why not read the whole thing into memory at the front, there
are a few reasons, but the easiest to explain is: If a user wants to
make a query for a single data element, having to wait (eventually up
to a minute maybe) for a response just because we are reading the
entire DB into memory is a bit frustrating.

Good point about memory vs. disk size efficiency though, Xho. I will
also look into using Storable directly.

As far as RDBMS, I am trying to avoid it, since it will require
installation and configuration on many computers I have no control over
(customer machines, etc.).
 
A

A. Sinan Unur

As far as RDBMS, I am trying to avoid it, since it will require
installation and configuration on many computers I have no control over
(customer machines, etc.).

SQLite?
 
X

xhoster

Rob Z said:
As far as why not read the whole thing into memory at the front, there
are a few reasons, but the easiest to explain is: If a user wants to
make a query for a single data element, having to wait (eventually up
to a minute maybe) for a response just because we are reading the
entire DB into memory is a bit frustrating.

You could use the program interactively and keep it running between
queries.

Anyway, I was pleasantly surprised to discover that I confused MLDBM with
some other DBM-like thing, and that MLDBM does not keep everything in
memory. In my tests, I've seen neither slowness nor large memory usage upon
tying a large pre-existing file. So without seeing the specifics of your
code/model system, there isn't much more I can say.

As far as RDBMS, I am trying to avoid it, since it will require
installation and configuration on many computers I have no control over
(customer machines, etc.).

Installing and configuring some of the DBM modules is no walk in the park,
either.

Xho
 
S

Stephan Titard

Rob said:
I apologize, I should have been more specific, since this is what
everyone is latching on to:

The file will never be altered once it is written. Over the coming
months, new files of the exact same name and hierarchical structure
will be written over the original. The size of those files will become
increasingly large up to 500+MB.



As far as why not read the whole thing into memory at the front, there
are a few reasons, but the easiest to explain is: If a user wants to
make a query for a single data element, having to wait (eventually up
to a minute maybe) for a response just because we are reading the
entire DB into memory is a bit frustrating.

Good point about memory vs. disk size efficiency though, Xho. I will
also look into using Storable directly.
maybe you can read the originial file, and transform it in something
that can load quickly...
(maybe even various files, adn some kind of index file)
DBM::Deep is pure-perl and performs well
SQLite could be of interest also.
 
P

Paul Marquess

Brian Wakem said:
You said it will never be altered.




When dealing will large amounts of data you should be thinking RDBMS.

If the application needs the flexibility/infrascructure that an RDBMS
gives, then yes, go down the route, but the amount of data being processed
on its own is not a good enough reason to jump ship. I know that DB_File can
easily handle this amount of data, and I'd imagine that GDBM_File can as
well. None of the DBM implementations read the complete database into memory
(unless you have explicitly set it up to do it) - they all use a small
cache.

Regarding the performance problem at hand, a 10 second startup time imples
there is something fundamentally wrong, either with the way the code has
been written or with the environment it is running under. To be able to help
we need to see some code.

cheers
Paul
 
B

Bill Davidsen

Rob said:
I apologize, I should have been more specific, since this is what
everyone is latching on to:

The file will never be altered once it is written. Over the coming
months, new files of the exact same name and hierarchical structure
will be written over the original. The size of those files will become
increasingly large up to 500+MB.

It would seem that "changing the file" and "replacing the file with one
which has changed" is a distinction without a difference. It still
precludes any solution involving leaving a program connected, building a
fast and fancy index, etc.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,995
Messages
2,570,230
Members
46,817
Latest member
DicWeils

Latest Threads

Top