optimize log parsing



i have roughly 400 logs to parse, with the size of each log ranging
from 200 bytes to 263,000,000 bytes.

i have 20 processes that i can allocate to log parsing.

i have a hash that has the log file name as the key, and its size in
bytes as the value. ( i can sort the entries by size, or use an array
if that's better suited to the solution. )

$file{'/var/log/log1'} => 200
$file{'/var/log/log2'} => 210
....
$file{'/var/log/log400'} => 262000000

i would like to feed each process roughly the same amount of data (i.e.
a certain queue of log files to parse, such that the sum of the bytes
of the log files for each process are roughly equivalent).

so let's just say there are 20 buckets, if that will make things more
comprehensible.

i calculate the sum of the sizes of all the logs, and it was about 7.6
GB. divided among 20 buckets, that's roughly 380 MB per bucket.

one caveat is that i can't break up files. these logs are gzipped, and
i can't split them while zipped. i tried unzipping the 262 MB one (that
took about 2 min 40 sec), and then splitting into files that were
600,000 lines each. that took over 10 min. after running (in Unix)
split -l 600000 file small_file, and waiting 10 min, i canceled, and
only 3 files were written so far! it probably would have taken 20
minutes or more to finish, and nothing was even being parsed yet.
unacceptable.

basically this problem can be reduced to figuring out a way to fit 400
discrete numbers into 20 buckets where the count of the numbers does
not matter, but the sum of the numbers should be roughly equal.

one way of accomplishing this is to sort descending, and loop through
the 400 files, putting the largest one in the first bucket, and going
down the line, putting the largest one that will fit into the current
bucket until you get to the end. granted, this is not perfect--perhaps
if you had waited, 2 smaller ones would have brought you closer to your
bucket-size goal. then you go to the next bucket, and do the same
thing. this is a O = n^2 solution though. i was hoping that there was a
more elegant, less brute-force method. perhaps the most perfect
solution would be to take combinations and permutations of all the 400
logs and finding which combinations of groups would result in the most
equal 20 buckets, but that would take way too much computation time.

using the method i outlined (not the combinations one, the one before
that), i would have to go through the list of 400 logs for the 1st
bucket, then let's say 395 logs for the 2nd bucket, etc. until the 20th
bucket. this really isn't that much, and it's unlikely that the number
of logs and processes will expand by orders of magnitude, so is it even
worth it to expend the effort finding this algorithm? any help would be
appreciated :-)

.



Relevant Pages

  • Re: [fw-wiz] Handling large log files
    ... logcheck solution now spends hours trying to parse the logs, ... duplicate messages and just simply report on the number of times per ... the key is to categorize and summarize the data. ... Internally it does a lookup of the program name to a bucket name and then outputs the message to that bucket. ...
    (Firewall-Wizards)
  • Re: This is a brilliant SCAM, I wish Id thought of it.
    ... My uncle had one of these on town gas when I was a kid. ... Dunno how they'd go on butane/propane though. ... For wood burners I used to soak a couple of logs in a bucket of central heating oil ...
    (uk.rec.waterways)
  • Sharepoint and Logs
    ... So I am working with this sharepoint server and trying to create reports ... based on the www logs it produces. ... "bucket" without having to visit each website? ... to get rid of this behavior that is. ...
    (microsoft.public.sharepoint.portalserver)
  • Re: How to script this
    ... Set End time to Time2 ... Set Sum = Endtime - Start Time ... I have a access database(from ISA Server) ... > It logs the time when the user goes on the internet, ...
    (microsoft.public.access.gettingstarted)
  • Re: Help me please
    ... Set End time to Time2 ... Set Sum = Endtime - Start Time ... I have a access database(from ISA Server) ... > It logs the time when the user goes on the internet, ...
    (microsoft.public.access.queries)