explaining how memory works with tie()ed hashs

B

botfood

I would like to know more about how perl uses memory when managing a
hash that is tie()ed to a file on disk. Using DB_File and variable
length records....

I have an application where the DB file gotten quite big, not giant,
but around 20k records and is a file about 11MB in size. My
understanding of DB_FIle is that while there is no programtic limit on
the number of records, there may be memory-driven limits depending what
one does with the hash. Looping thru the keys is still fast and so the
number of records doesnt seem to be a problem.

In this particular case, each record is not that big, except for one
specific type of 'book-keeping' record that is used to keep track of
what records are considered 'complete' by this particular application.
With a couple reports, I need to whip thru all complete records
searching for various things.... And in another spot I know that the
code looks for matches within this big record that contains around 20k
'words' consisting of about 20 digits.

what I am wondering is whether it is likely that the simple number of
records eats up large amounts of memory just by being tie()ed, or if it
is more likely that this one particular internal index record is
causing me problems when it gets pulled into memory to do things like
m// or s// on its contents to find or edit a 'word' which is simply a
list of the keys having a specific status.

The next part of the question is.... if it sounds like a large internal
indexing record is likely to be a problem, what would some recommended
techniques be to break that out? should I create a separate DB file to
use as an index? I am really wondering how best to 'fake' large
database capabilities to manage keeping track of status without eating
tons of memory.

TIA,
d
 
A

anno4000

botfood said:
I would like to know more about how perl uses memory when managing a
hash that is tie()ed to a file on disk. Using DB_File and variable
length records....

DB_File is Berkeley DB, so that would primarily be a question about
storage management in Berkeley DB.
I have an application where the DB file gotten quite big, not giant,
but around 20k records and is a file about 11MB in size. My

That's a bit more than 500 bytes per record. I'm not a bit surprised.

[...]
In this particular case, each record is not that big, except for one
specific type of 'book-keeping' record that is used to keep track of
what records are considered 'complete' by this particular application.
With a couple reports, I need to whip thru all complete records
searching for various things.... And in another spot I know that the
code looks for matches within this big record that contains around 20k
'words' consisting of about 20 digits.

For an experiment, take the extra long record(s) out of the DB and store
it/them otherwise. See if it makes a difference. I wouldn't expect so,
but who knows.
what I am wondering is whether it is likely that the simple number of
records eats up large amounts of memory just by being tie()ed, or if it

Tie has nothing to do with disk storage management. That's entirely
the DB's business.
is more likely that this one particular internal index record is
causing me problems

See above.
when it gets pulled into memory to do things like
m// or s// on its contents to find or edit a 'word' which is simply a
list of the keys having a specific status.

What has "pulling into memory" to do with disk space consumption?
The next part of the question is.... if it sounds like a large internal
indexing record is likely to be a problem, what would some recommended
techniques be to break that out? should I create a separate DB file to
use as an index? I am really wondering how best to 'fake' large
database capabilities to manage keeping track of status without eating
tons of memory.

Databases are not primarily optimized to be as small as possible but
to be fast and flexible. Also, they often grow in relatively large
steps. It could well be that you could increase the number of records
for a long while in your 11MB until it grows again. I'd do that too:
Add random records and watch how the DB grows while you do. That
will give you a better idea of the overhead.

Anno
 
J

J. Gleixner

botfood wrote:
[...]
In this particular case, each record is not that big, except for one
specific type of 'book-keeping' record that is used to keep track of
what records are considered 'complete' by this particular application.
With a couple reports, I need to whip thru all complete records
searching for various things.... And in another spot I know that the
code looks for matches within this big record that contains around 20k
'words' consisting of about 20 digits.
[...]

The next part of the question is.... if it sounds like a large internal
indexing record is likely to be a problem, what would some recommended
techniques be to break that out? should I create a separate DB file to
use as an index? I am really wondering how best to 'fake' large
database capabilities to manage keeping track of status without eating
tons of memory.


If I understand your issue correctly, possibly using another key
for these completed entries, or a separate DBM, would be better.

For example, if the key were by a user name, then add an additional
key of ${username}_completed. That could be stored in the same DBM or
you could create another "completed" DBM containing the user name as
its key, the 20-digits, or the "various things" you're looking for,
could be stored where ever it makes sense.

This way you'd know what user names were completed and could
easily access the data for the user name in the other table.
I'd think that'd be much more efficient when compared to a key of
'completed' containing all of the user names.

Try to design it as if it were simply a hash, which is all
it is. Using the DBM will really optimize memory and
disk space, however you're responsible for design the
keys and records to work well as a hash.
 
B

botfood

botfood said:
I would like to know more about how perl uses memory when managing a
hash that is tie()ed to a file on disk. Using DB_File and variable
length records.... ....snip
In this particular case, each record is not that big, except for one
specific type of 'book-keeping' record that is used to keep track of
what records are considered 'complete' by this particular application.
With a couple reports, I need to whip thru all complete records
searching for various things.... And in another spot I know that the
code looks for matches within this big record that contains around 20k
'words' consisting of about 20 digits.
------------------------------

thanks for comments so far people.... its sounding like my suspicion
that the memory problems I am having are more likely to be from using a
'record' inside the DB to manage a large list of values, which I need
to sift, sort, and edit... is more likely to be the problem than the
memory allocated to manage the access and paging of the DB itself.

Allow me to clarify in that the nature of the failure SEEMED to be more
like a limit placed by the Apache Web Server running the process,
rather than a hard limit on memory available on the machine.

while I think I am going to try some re-design to extract the hash of
'complete' records to a separate DB, I am trying to get a handle on a
short-term fix by increasing the Apache::SizeLimit parameter to accept
the memory use required for the current design.

so.... the question changes to:

how can I estimate the memory required by perl for a m// or s//
operation on a string that is about 20k 'words' consisting of 20 digits
each separated by a single space.

thanks,
d
 
J

J. Gleixner

botfood said:
so.... the question changes to:

how can I estimate the memory required by perl for a m// or s//
operation on a string that is about 20k 'words' consisting of 20 digits
each separated by a single space.

Why estimate it? Simply run it from the command line, or some other
method, maybe adding a fairly long sleep, after the point you want to
measure, and watch the memory usage using top, ps, etc.
 
X

xhoster

botfood said:
so.... the question changes to:

how can I estimate the memory required by perl for a m// or s//
operation on a string that is about 20k 'words' consisting of 20 digits
each separated by a single space.

I just system out to "ps" (on linux).

It seems to be trivial, as long the regex doesn't degenerate badly.


$ perl -le 'my $x = join " ", map rand, 1..20_000; \
$_=~s/1[23]45/foobar/g; system "ps -p $$ -o rss ";'
RSS #(this is the size in kilobytes)
4340


$ perl -le 'my $x = join " ", map rand, 1..20_000; \
$x=~s/1[23]45/foobar/g; system "ps -p $$ -o rss ";'
RSS
4476

So it takes about 136K more to do the substitution than it does just to
start perl and build the string (plus do a dummy substituion on an empty
varaible)

Xho
 
B

botfood

J. Gleixner said:
Why estimate it? Simply run it from the command line, or some other
method, maybe adding a fairly long sleep, after the point you want to
measure, and watch the memory usage using top, ps, etc.
 
B

botfood

I just system out to "ps" (on linux).

It seems to be trivial, as long the regex doesn't degenerate badly.


$ perl -le 'my $x = join " ", map rand, 1..20_000; \
$_=~s/1[23]45/foobar/g; system "ps -p $$ -o rss ";'
RSS #(this is the size in kilobytes)
4340


$ perl -le 'my $x = join " ", map rand, 1..20_000; \
$x=~s/1[23]45/foobar/g; system "ps -p $$ -o rss ";'
RSS
4476

So it takes about 136K more to do the substitution than it does just to
start perl and build the string (plus do a dummy substituion on an empty
varaible)
-----------------------------------------

not sure exactly what you did with these little tests to build the test
string, but think the estimate is moving in the right direction. The
exact size value I think might be low since each of my 20k 'words' is
20 characters long rather than a random number.

unfortunately I do not have access to the LINUX machine since it is a
remote web server.....

d
 
B

botfood

Jim said:
That makes it tough. However, you can do yourself a favor by setting up
a local Perl installation and running your tests on it. My guess is
that the regular expression engine doesn't vary very much from platform
to platform.
--------------------------------

however, Berkley DB and memory management is probably different between
win32 and Linux, so it wouldnt give me any more than a rough ballpark.
My test server at home is not Apache, I use Xitami, so I cant really
emulate the SizeLimit stuff that was a problem on the web server......
kinda shooting in the dark.

best guess at this point is that s// on a string that is 20k 'words' of
20 characters, *probably* eats up more memory that the host server
wants to allocate for any single process.

d
 
X

xhoster

botfood said:
however, Berkley DB and memory management is probably different between
win32 and Linux, so it wouldnt give me any more than a rough ballpark.

Often a rough ballpark is enough.

My test server at home is not Apache, I use Xitami, so I cant really
emulate the SizeLimit stuff that was a problem on the web server......
kinda shooting in the dark.

You probably don't need to emulate SizeLimit. You just need to know what
it is.
best guess at this point is that s// on a string that is 20k 'words' of
20 characters, *probably* eats up more memory that the host server
wants to allocate for any single process.

I doubt it. Or at least, if this is the case, then you are living so close
to the edge that any random thing is doing to push you over it, anyway.

It is easy enough to write a 5 line CGI which constructs a string of 20k
words and try a realistic s// on it, dump it on your providers server, and
see if it runs afoul of SizeLimit or not. Then try it again with 40k, 60k,
etc just to see how much breathing room you have.

Xho
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,995
Messages
2,570,230
Members
46,816
Latest member
SapanaCarpetStudio

Latest Threads

Top