Re: optimize log parsing



> I wouldn't bother with this 'bucket' stuff at all. Just do it on the fly.
> By addressing the files in the appropriate order (from most work to least
> work) you ensure nearly optimal processing. In fact, because there is no
> guarantee that the actual time for a file to be processed is exactly
> proportional to the file size,

you're right here, although i had the idea that i could weight certain
parse methods by multiplying each log size by a coefficient. the
coefficient would be derived by dividing the average speed of a parse
method by the average speed of the slowest parse method.

balancing on the fly is almost surely going
> to be better than some precomputed balancing based on the assumption that
> size = time.
>
> $pm = new Parallel::ForkManager(20);
>
> foreach $file (sort {$files{$b}<=>$files{$a}} keys %files) {
> my $pid = $pm->start and next;
> ##Process the $file
> $pm->finish; # Terminates the child process
> }
> $pm->wait_all_children;
>
> ...

i admit i'm not too familiar with Threads/Forks (the only fork i use is
the one called from system() ). also, i've read that Perl threading
isn't too stable. i've looked on the web a little, but have not found
anything that describes how to do all of the following:

1) instantiate N processes (or threads)
2) start each process parsing a log file
3) the first process that is done looks at a shared or global queue and
pulls the next log file from that and processes until the queue is
empty.


....the current architecture of my log processing is:
1) set a number of processes (e.g. 20)
2) in a loop for the number of processes:
my @rc;
for my $i (1..$num_processes) {
my $command = 'parseLog.pl $i';
$rc[$i] = system($command);
}
# there is a conf file that has an entry for each log, along with a
number in the next field--the number represents the process_id (can be
1 thru 20)
3) in a loop of all the logs, push logs into arrays if the process_id
== the $num_process that was passed along, so each process has an array
of files to process/parse. each process parses each file in its array
of files. problem with this is that maybe each process has a similar
number of logs to process (the process_id just increments for each
line, then wraps around once it reaches the max number of processes i
defined), but some could be huge while others are small, so not very
optimal. one process could have 20 files of 200 bytes each, while the
other could have 20 files of 230 MB each.

since using the system() approach is all i know, the only scenarios i
considered were those that dealt with providing each process with a
balanced amount of data.

if i can get the on-the-fly thing working, that would be preferable.
then sorting would not even be helpful, would it?

>
> > basically this problem can be reduced to figuring out a way to fit 400
> > discrete numbers into 20 buckets where the count of the numbers does
> > not matter, but the sum of the numbers should be roughly equal.
> >
> > one way of accomplishing this is to sort descending, and loop through
> > the 400 files, putting the largest one in the first bucket, and going
> > down the line, putting the largest one that will fit into the current
> > bucket until you get to the end. granted, this is not perfect--perhaps
> > if you had waited, 2 smaller ones would have brought you closer to your
> > bucket-size goal. then you go to the next bucket, and do the same
> > thing. this is a O = n^2 solution though.
>
> How so? N ln N to sort the files, only has to be done once. Finding the
> biggest file that fits the current bucket is ln N using a binary search
> into the sorted files. You have 1 N ln N operation, followed by N
> operations that are lnN. That comes out to N ln N overall.
>

you would be roughly doing m x n iterations, which i believe amounts to
the same thing as big O of n^2. (i AM a bit rusty at algorithms). i
didn't even count the sort, since that's only done once, and i'm not
familiar with the under-the-hood mechanics of Perl's hash value sort.


> But anyway, why try to preordain the order in which the buckets become
> empty for full?

that's the only way i know how for now.

Start processing the 20 biggest files. When one of them
> finishes (regardless of which one it is), start the next file.
>

if i can do the fork thing, why start with the biggest?

> Xho
>
> --
> -------------------- http://NewsReader.Com/ --------------------
> Usenet Newsgroup Service $9.95/Month 30GB

.



Relevant Pages

  • Re: Finding the max number in a list
    ... Come to think of it, a bucket sort could do it too, ... unless the "uniform distribution" precondition he mentioned was to allow most ... the elements of each histogram bucket end up consecutive in the array ...
    (comp.lang.java.programmer)
  • Re: Sorting algorithm for FPGA availlable?
    ... go towards 16N memory cycles which may use the fastest rate of memory ... Actually that would be bucket sort. ... In a V4 you can probably even hold 64k entries ...
    (comp.arch.fpga)
  • Re: What is the most fast sorting algorithm?
    ... Bucket sort is Oin the worst case. ... Because of various special characters such as ... So our bucket sort's worst case may well have running time of O ...
    (comp.programming)
  • Re: What is the name of this sort method?
    ... The segment from index positions LEFT to RIGHT ... Sounds very similar to a bucket sort (but not quite what most people ... Bucket sort sorts data with a small-ish set of distinct possible keys ...
    (comp.programming)