out of memory

F

friend.05

Hi,

I want to parse large log file (in GBs)

and I am readin 2-3 such files in hash array.

But since it will very big hash array it is going out of memory.

what are the other approach I can take.


Example code:

open ($INFO, '<', $file) or die "Cannot open $file :$!\n";
while (<$INFO>)
{
(undef, undef, undef, $time, $cli_ip, $ser_ip, undef, $id,
undef) = split('\|');
push @{$time_table{"$cli_ip|$id"}}, $time;
}
close $INFO;


In above code $file is very big in size(in Gbs); so I am getting out
of memory !
 
J

Jürgen Exner

I want to parse large log file (in GBs)

and I am readin 2-3 such files in hash array.

But since it will very big hash array it is going out of memory.

what are the other approach I can take.

"Doctor, it hurts when I do this."
"Well, then don't do it."

Simple: don't read them into RAM but process them line by line.
Example code:

open ($INFO, '<', $file) or die "Cannot open $file :$!\n";
while (<$INFO>)

Oh, you are processing them line by line,
{
(undef, undef, undef, $time, $cli_ip, $ser_ip, undef, $id,
undef) = split('\|');
push @{$time_table{"$cli_ip|$id"}}, $time;
}
close $INFO;

If for whatever reason your requirement (sic!!!) is to create an array
with all this data, then you need better hardware and probably a 64bit
OS and Perl.

Of course a much better approach would probably be to trade time for
space and find a different algorithm to solve your original problem
(which you didn't tell us about) by using less RAM in the first place. I
personally don't see any need to store more than one data set in RAM for
"parsing log files", but of course I don't know what kind of log files
you are talking about and what information you want to compute from
those log files.

Another common solution is to use a database to handle large sets of
data.

jue
 
J

Juha Laiho

I want to parse large log file (in GBs)

and I am readin 2-3 such files in hash array.

But since it will very big hash array it is going out of memory.

Do you really need to have the whole file available in order to
extract the data you're interested in?
Example code:

open ($INFO, '<', $file) or die "Cannot open $file :$!\n";
while (<$INFO>)
{
(undef, undef, undef, $time, $cli_ip, $ser_ip, undef, $id,
undef) = split('\|');
push @{$time_table{"$cli_ip|$id"}}, $time;
}
close $INFO;

In above code $file is very big in size(in Gbs); so I am getting out
of memory !

So, you're storing times based on client ip and id, if I read correctly.

How about not keeping that data in memory, but writing it out as you
gather it?
- to a text file, to be processed further in a next stage of the script
- to a database format file (via DB_File module, or one of its sister
modules), so that you can do fast indexed searches on the data
- to a "real" database in a proper relational structure, to allow
you to do any kind of relational reporting rather easily

Also, where $time above apparently is a string containing some kind of
a timestamp, you could convert that timestamp into something else
(number of seconds from epoch comes to mind) that takes a lot less
memory than a string representation such as "2008-10-31 18:33:24".
 
F

friend.05

Do you really need to have the whole file available in order to
extract the data you're interested in?




So, you're storing times based on client ip and id, if I read correctly.

How about not keeping that data in memory, but writing it out as you
gather it?
- to a text file, to be processed further in a next stage of the script
- to a database format file (via DB_File module, or one of its sister
  modules), so that you can do fast indexed searches on the data
- to a "real" database in a proper relational structure, to allow
  you to do any kind of relational reporting rather easily

Also, where $time above apparently is a string containing some kind of
a timestamp, you could convert that timestamp into something else
(number of seconds from epoch comes to mind) that takes a lot less
memory than a string representation such as "2008-10-31 18:33:24".
--
Wolf  a.k.a.  Juha Laiho     Espoo, Finland
(GC 3.0) GIT d- s+: a C++ ULSH++++$ P++@ L+++ E- W+$@ N++ !K w !O !M V
         PS(+) PE Y+ PGP(+) t- 5 !X R !tv b+ !DI D G e+ h---- r+++ y++++
"...cancel my subscription to the resurrection!" (Jim Morrison)

Thanks.

if I output as text file and read it again later on will be able to
search based on key. (I mean when read it again I will be able to use
it as hash or not )
 
X

xhoster

Hi,

I want to parse large log file (in GBs)

and I am readin 2-3 such files in hash array.

But since it will very big hash array it is going out of memory.

what are the other approach I can take.

The other approaches you can take depend on what you are trying to
do.
Example code:

open ($INFO, '<', $file) or die "Cannot open $file :$!\n";
while (<$INFO>)
{
(undef, undef, undef, $time, $cli_ip, $ser_ip, undef, $id,
undef) = split('\|');
push @{$time_table{"$cli_ip|$id"}}, $time;
}
close $INFO;

You could get some improvement by having just a hash rather than a hash of
arrays. Replace the push with, for example:

$time_table{"$cli_ip|$id"} .= "$time|";

Then you would have to split the hash values into a list/array one at a
time as they are needed.



Xho

--
-------------------- http://NewsReader.Com/ --------------------
The costs of publication of this article were defrayed in part by the
payment of page charges. This article must therefore be hereby marked
advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate
this fact.
 
J

Jürgen Exner

if I output as text file and read it again later on will be able to
search based on key. (I mean when read it again I will be able to use
it as hash or not )

That depends upon what you do with the data when reading it in again. Of
course you can construct hash, but then you wouldn't have gained
anything. Why would this hash be any smaller than the one you were
trying to construct the first time?

Your current approach (put everything into a hash) and your current
hardware are incompatible.

Either get larger hardware (expensive) or rethink your basic approach,
e.g. use a database system or compute your desired results on the fly
while parsing through the file or write intermediate results to a file
in a format that later can be processed line by line or by any other of
the gazillions ways of preversing RAM. Don't you learn those techniques
in basic computer science classes any more?

jue
 
F

friend.05

That depends upon what you do with the data when reading it in again. Of
course you can construct hash, but then you wouldn't have gained
anything. Why would this hash be any smaller than the one you were
trying to construct the first time?

Your current approach (put everything into a hash) and your current
hardware are incompatible.

Either get larger hardware (expensive) or rethink your basic approach,
e.g. use a database system or compute your desired results on the fly
while parsing through the file or write intermediate results to a file
in a format that later can be processed line by line or by any other of
the gazillions ways of preversing RAM. Don't you learn those techniques
in basic computer science classes any more?

jue

output to a file and using it again will take lot of time. It will be
very slow.

will be helpful in speed if I use DB_FILE module
 
F

friend.05

output to a file and using it again will take lot of time. It will be
very slow.

will be helpful in speed if I use DB_FILE module- Hide quoted text -

- Show quoted text -

here is what I am trying to do.

I have two large files. I will read one file and see if that is also
present in second file. I also need count how many time it is appear
in both the file. And according I do other processing.

so if I process line by line both the file then it will be like (eg.
file1 has 10 line and file2 has 10 line. for each line file1 it will
loop 10 times. so total 100 loops.) I am dealing millions of lines so
this approach will be very slow.


this is my current code. It runs fine with small file.



open ($INFO, '<', $file) or die "Cannot open $file :$!\n";
while (<$INFO>)
{
(undef, undef, undef, $time, $cli_ip, $ser_ip, undef, $id,
undef) = split('\|');
push @{$time_table{"$cli_ip|$dns_id"}}, $time;
}


open ($INFO_PRI, '<', $pri_file) or die "Cannot open $pri_file :$!
\n";
while (<$INFO_PRI>)
{
(undef, undef, undef, $pri_time, $pri_cli_ip, undef, undef,
$pri_id, undef, $query, undef) = split('\|');
$pri_ip_id_table{"$pri_cli_ip|$pri_id"}++;
push @{$pri_time_table{"$pri_cli_ip|$pri_id"}}, $pri_time;
}

@pri_ip_id_table_ = keys(%pri_ip_id_table);

for($i = 0; $i < @pri_ip_id_table_; $i++) #file 2
{
if($time_table{"$pri_ip_dns_table_[$i]"}) #chk if it
is there in file 1
{
#do some processing.
}

}



so for above example which I approach will be best ?


Thanks for your help.
 
C

Charlton Wilbur

JE> Don't you learn those techniques in basic computer science
JE> classes any more?

The assumption that someone who is getting paid to program has had -- or
even has had any interest in -- computer science classes gets less
tenable with each passing day.

Charlton
 
J

J. Gleixner

here is what I am trying to do.

I have two large files. I will read one file and see if that is also
present in second file. I also need count how many time it is appear
in both the file. And according I do other processing.

so if I process line by line both the file then it will be like (eg.
file1 has 10 line and file2 has 10 line. for each line file1 it will
loop 10 times. so total 100 loops.) I am dealing millions of lines so
this approach will be very slow.

Maybe you shouldn't do your own math. It'd be 10 reads, for each file,
so 20.
this is my current code. It runs fine with small file.
use strict;
use warnings;
open ($INFO, '<', $file) or die "Cannot open $file :$!\n";
open( my $INFO, ...
while (<$INFO>)
{
(undef, undef, undef, $time, $cli_ip, $ser_ip, undef, $id,
undef) = split('\|');

my( $time, $cli_ip, $ser_ip, $id ) = (split( /\|/ ))[3,4,5,7];
push @{$time_table{"$cli_ip|$dns_id"}}, $time;
} close( $INFO );


open ($INFO_PRI, '<', $pri_file) or die "Cannot open $pri_file :$!
\n";

open( my $INFO_PRI, ...
while (<$INFO_PRI>)
{
(undef, undef, undef, $pri_time, $pri_cli_ip, undef, undef,
$pri_id, undef, $query, undef) = split('\|');

my( $pri_time, $pri_cli_ip, $pri_id, $query ) = (split( /\|/ ))[3,4,7,9];
$pri_ip_id_table{"$pri_cli_ip|$pri_id"}++;
push @{$pri_time_table{"$pri_cli_ip|$pri_id"}}, $pri_time;
}

Read one file into memory/hash, if possible. As you're processing
the second one, store/push some data to process later, or process
it at that time, if it matches your criteria. There's no need to
store both in memory.
@pri_ip_id_table_ = keys(%pri_ip_id_table);

for($i = 0; $i < @pri_ip_id_table_; $i++) #file 2

Ugg.. the keys for %pri_ip_id_table are 'something|somethingelse'
how that works with that for loop is probably not what one
would expect.
{
if($time_table{"$pri_ip_dns_table_[$i]"}) #chk if it
is there in file 1

Really? Where is pri_ip_dns_table_ defined?
 
S

smallpond

here is what I am trying to do.

I have two large files. I will read one file and see if that is also
present in second file. I also need count how many time it is appear
in both the file. And according I do other processing.

so if I process line by line both the file then it will be like (eg.
file1 has 10 line and file2 has 10 line. for each line file1 it will
loop 10 times. so total 100 loops.) I am dealing millions of lines so
this approach will be very slow.


This problem was solved 50 years ago. You sort the two files and then
take
one pass through both comparing records. Why are you reinventing the
wheel?

--S
 
X

xhoster

That depends on how you do it.

That depends on what you are comparing it to. Compared to an in memory
hash, DB_File makes things slower, not faster. Except in the sense that
something which runs out of memory and dies before completing the job is
infinitely slow, so preventing that is, in a sense, faster. One exception
I know of would be if one of the files is constant, so it only needs to be
turned into a DB_File once, and if only a small fraction of the keys are
ever probed by the process driven by other file. Then it could be faster.

Also, DB_File doesn't take nested structures, so you would have to flatten
your HoA. Once you flatten it, it might fit in memory anyway.
here is what I am trying to do.

I have two large files. I will read one file and see if that is also
present in second file. I also need count how many time it is appear
in both the file. And according I do other processing.

If you *only* need to count, then you don't need the HoA in the first
place.
so if I process line by line both the file then it will be like (eg.
file1 has 10 line and file2 has 10 line. for each line file1 it will
loop 10 times. so total 100 loops.) I am dealing millions of lines so
this approach will be very slow.

I don't think anyone was recommending that you do a Cartesian join on the
files. You could break the data up into files by hashing on IP address and
making a separate file for each hash value. For each hash bucket you would
have two files, one from each starting file, and they could be processed
together with your existing script. Or you could reformat the two files
and then sort them jointly, which would group all the like keys together
for you for later processing.
@pri_ip_id_table_ = keys(%pri_ip_id_table);

For very large hashes when you have memory issues, you should iterate
over it with "each" rather than building a list of keys.
for($i =3D 0; $i < @pri_ip_id_table_; $i++) #file 2
{
if($time_table{"$pri_ip_dns_table_[$i]"})
{
#do some processing.

Could you "do some processing" incrementally, as each line from file 2 is
encountered, rather than having to load all keys of file2 into memory
at once?

Xho

--
-------------------- http://NewsReader.Com/ --------------------
The costs of publication of this article were defrayed in part by the
payment of page charges. This article must therefore be hereby marked
advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate
this fact.
 
J

Jürgen Exner

I have two large files. I will read one file and see if that is also
present in second file.

The way you wrote this means you are checking if file A is a subset of
file B. However I have a strong feeling, you are talking about the
records in each file, not the files themself.
I also need count how many time it is appear
in both the file. And according I do other processing.
so if I process line by line both the file then it will be like (eg.
file1 has 10 line and file2 has 10 line. for each line file1 it will
loop 10 times. so total 100 loops.) I am dealing millions of lines so
this approach will be very slow.

So you need to pre-process your data.

One possibility: read only the smaller file into a hash. Then you can
compare the larger file line by line against this hash. This is a linear
algorithm. Of course this only works if at least the relevant data from
the smaller file will fit into RAM.

Another approach: sort both input files. There are many sorting
algorithms around, including those that sort completely on disk and
require very minimum RAM. They were very popular back when 32kB was a
lot of memory. Then you can walk through both files line by line in
parallel, requiring only a tiny little bit of RAM.
Depending upon the sorting algorithm this would be O(n)log(n) or
somewhat worse.

Yet another option: put your relevant data into a database and use
database operators to extract the information you want, in your case a
simple intersection: all records, that are in A and in B. Database
systems are optimized to handle large sets of data efficiently.
this is my current code. It runs fine with small file.

Well, that is great. But it seems you still don't believe me when I'm
saying that your problem cannot be fixed by a little tweak in your
existing code. Any gain you may get by storing a smaller data item or
similar will very soon be eaten up by larger data sets.
THIS IS NOT GOING TO WORK. YOU HAVE TO RETHINK YOUR APPROACH AND CHOOSE
A DIFFERENT STRATEGIE/ALGORITHM!

jue
 
J

Jürgen Exner

Jürgen Exner said:
The way you wrote this means you are checking if file A is a subset of
file B. However I have a strong feeling, you are talking about the
records in each file, not the files themself.



So you need to pre-process your data.

One possibility: read only the smaller file into a hash. Then you can
compare the larger file line by line against this hash. This is a linear
algorithm. Of course this only works if at least the relevant data from
the smaller file will fit into RAM.

Another approach: sort both input files. There are many sorting
algorithms around, including those that sort completely on disk and
require very minimum RAM. They were very popular back when 32kB was a
lot of memory. Then you can walk through both files line by line in
parallel, requiring only a tiny little bit of RAM.
Depending upon the sorting algorithm this would be O(n)log(n) or
somewhat worse.

Yet another option: put your relevant data into a database and use
database operators to extract the information you want, in your case a
simple intersection: all records, that are in A and in B. Database
systems are optimized to handle large sets of data efficiently.

Forgot one other common approach: bucketize your data.
Create buckets of IPs or IDs or whatever criteria works for your case.
Then sort the data into 20 or 50 or 100 individual buckets (aka files)
for each of your input files. And then compare bucket x from file A with
bucket x from file B.

jue
 
S

sln

JE> Don't you learn those techniques in basic computer science
JE> classes any more?

The assumption that someone who is getting paid to program has had -- or
even has had any interest in -- computer science classes gets less
tenable with each passing day.

Charlton

Well said.. that should be its own thread.

sln
 
S

sln

The way you wrote this means you are checking if file A is a subset of
file B. However I have a strong feeling, you are talking about the
records in each file, not the files themself.



So you need to pre-process your data.

One possibility: read only the smaller file into a hash. Then you can
compare the larger file line by line against this hash. This is a linear
algorithm. Of course this only works if at least the relevant data from
the smaller file will fit into RAM.

Another approach: sort both input files. There are many sorting
algorithms around, including those that sort completely on disk and
require very minimum RAM. They were very popular back when 32kB was a
lot of memory. Then you can walk through both files line by line in
parallel, requiring only a tiny little bit of RAM.
Depending upon the sorting algorithm this would be O(n)log(n) or
somewhat worse.

Yet another option: put your relevant data into a database and use
database operators to extract the information you want, in your case a
simple intersection: all records, that are in A and in B. Database
systems are optimized to handle large sets of data efficiently.


Well, that is great. But it seems you still don't believe me when I'm
saying that your problem cannot be fixed by a little tweak in your
existing code. Any gain you may get by storing a smaller data item or
similar will very soon be eaten up by larger data sets.
THIS IS NOT GOING TO WORK. YOU HAVE TO RETHINK YOUR APPROACH AND CHOOSE
A DIFFERENT STRATEGIE/ALGORITHM!

jue

He cannot get past the idea of 'millions' of lines in a file, even
though he states items of interrest. He won't think of items, just
the millions of lines.

In todays large data mining, there are billions of lines to consider.
Of course the least common denominator reduces that down to billions
of items.

Like a hash, it can be separated into alphabetical sequence files,
matched with available memory, usually 16 gigabytes, then reduced
exponentially until the desired form is achieved.

But his outlook is panicy and without resolve. The world is coming
to an end for him and he would like to share it with the world.

sln
 
D

David Combs

Jürgen Exner said:
Another approach: sort both input files. There are many sorting
algorithms around,

Question: why not simply use the standard unix (linux) "sort" program?

Does that not do all the right things? qsort, uses file-merge-etc
if it needs to, etc?

(And hopefully has that within-the-last-10-years *massive*
speedup on (a) already-sorted files and (b) sorting ASCII files
discovered by that algorithm-book-writing prof at Princeton.)

including those that sort completely on disk and
require very minimum RAM. They were very popular back when 32kB was a
lot of memory. Then you can walk through both files line by line in
parallel, requiring only a tiny little bit of RAM.
Depending upon the sorting algorithm this would be O(n)log(n) or
somewhat worse.


Thanks,

David
 
D

David Combs

Well said.. that should be its own thread.

sln

Like hiring surgeons who've never had biology.

"Look, I can cut, can't I? What ELSE could I possibly need to know?"


David
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,968
Messages
2,570,152
Members
46,697
Latest member
AugustNabo

Latest Threads

Top