I
it_says_BALLS_on_your forehead
i have roughly 400 logs to parse, with the size of each log ranging
from 200 bytes to 263,000,000 bytes.
i have 20 processes that i can allocate to log parsing.
i have a hash that has the log file name as the key, and its size in
bytes as the value. ( i can sort the entries by size, or use an array
if that's better suited to the solution. )
$file{'/var/log/log1'} => 200
$file{'/var/log/log2'} => 210
....
$file{'/var/log/log400'} => 262000000
i would like to feed each process roughly the same amount of data (i.e.
a certain queue of log files to parse, such that the sum of the bytes
of the log files for each process are roughly equivalent).
so let's just say there are 20 buckets, if that will make things more
comprehensible.
i calculate the sum of the sizes of all the logs, and it was about 7.6
GB. divided among 20 buckets, that's roughly 380 MB per bucket.
one caveat is that i can't break up files. these logs are gzipped, and
i can't split them while zipped. i tried unzipping the 262 MB one (that
took about 2 min 40 sec), and then splitting into files that were
600,000 lines each. that took over 10 min. after running (in Unix)
split -l 600000 file small_file, and waiting 10 min, i canceled, and
only 3 files were written so far! it probably would have taken 20
minutes or more to finish, and nothing was even being parsed yet.
unacceptable.
basically this problem can be reduced to figuring out a way to fit 400
discrete numbers into 20 buckets where the count of the numbers does
not matter, but the sum of the numbers should be roughly equal.
one way of accomplishing this is to sort descending, and loop through
the 400 files, putting the largest one in the first bucket, and going
down the line, putting the largest one that will fit into the current
bucket until you get to the end. granted, this is not perfect--perhaps
if you had waited, 2 smaller ones would have brought you closer to your
bucket-size goal. then you go to the next bucket, and do the same
thing. this is a O = n^2 solution though. i was hoping that there was a
more elegant, less brute-force method. perhaps the most perfect
solution would be to take combinations and permutations of all the 400
logs and finding which combinations of groups would result in the most
equal 20 buckets, but that would take way too much computation time.
using the method i outlined (not the combinations one, the one before
that), i would have to go through the list of 400 logs for the 1st
bucket, then let's say 395 logs for the 2nd bucket, etc. until the 20th
bucket. this really isn't that much, and it's unlikely that the number
of logs and processes will expand by orders of magnitude, so is it even
worth it to expend the effort finding this algorithm? any help would be
appreciated
from 200 bytes to 263,000,000 bytes.
i have 20 processes that i can allocate to log parsing.
i have a hash that has the log file name as the key, and its size in
bytes as the value. ( i can sort the entries by size, or use an array
if that's better suited to the solution. )
$file{'/var/log/log1'} => 200
$file{'/var/log/log2'} => 210
....
$file{'/var/log/log400'} => 262000000
i would like to feed each process roughly the same amount of data (i.e.
a certain queue of log files to parse, such that the sum of the bytes
of the log files for each process are roughly equivalent).
so let's just say there are 20 buckets, if that will make things more
comprehensible.
i calculate the sum of the sizes of all the logs, and it was about 7.6
GB. divided among 20 buckets, that's roughly 380 MB per bucket.
one caveat is that i can't break up files. these logs are gzipped, and
i can't split them while zipped. i tried unzipping the 262 MB one (that
took about 2 min 40 sec), and then splitting into files that were
600,000 lines each. that took over 10 min. after running (in Unix)
split -l 600000 file small_file, and waiting 10 min, i canceled, and
only 3 files were written so far! it probably would have taken 20
minutes or more to finish, and nothing was even being parsed yet.
unacceptable.
basically this problem can be reduced to figuring out a way to fit 400
discrete numbers into 20 buckets where the count of the numbers does
not matter, but the sum of the numbers should be roughly equal.
one way of accomplishing this is to sort descending, and loop through
the 400 files, putting the largest one in the first bucket, and going
down the line, putting the largest one that will fit into the current
bucket until you get to the end. granted, this is not perfect--perhaps
if you had waited, 2 smaller ones would have brought you closer to your
bucket-size goal. then you go to the next bucket, and do the same
thing. this is a O = n^2 solution though. i was hoping that there was a
more elegant, less brute-force method. perhaps the most perfect
solution would be to take combinations and permutations of all the 400
logs and finding which combinations of groups would result in the most
equal 20 buckets, but that would take way too much computation time.
using the method i outlined (not the combinations one, the one before
that), i would have to go through the list of 400 logs for the 1st
bucket, then let's say 395 logs for the 2nd bucket, etc. until the 20th
bucket. this really isn't that much, and it's unlikely that the number
of logs and processes will expand by orders of magnitude, so is it even
worth it to expend the effort finding this algorithm? any help would be
appreciated