Re: optimize log parsing
- From: "Tassilo v. Parseval" <tassilo.von.parseval@xxxxxxxxxxxxxx>
- Date: Wed, 5 Oct 2005 07:57:56 +0200
Also sprach it_says_BALLS_on_your forehead:
>> I wouldn't bother with this 'bucket' stuff at all. Just do it on the fly.
>> By addressing the files in the appropriate order (from most work to least
>> work) you ensure nearly optimal processing. In fact, because there is no
>> guarantee that the actual time for a file to be processed is exactly
>> proportional to the file size,
>
> you're right here, although i had the idea that i could weight certain
> parse methods by multiplying each log size by a coefficient. the
> coefficient would be derived by dividing the average speed of a parse
> method by the average speed of the slowest parse method.
>
> balancing on the fly is almost surely going
>> to be better than some precomputed balancing based on the assumption that
>> size = time.
>>
>> $pm = new Parallel::ForkManager(20);
>>
>> foreach $file (sort {$files{$b}<=>$files{$a}} keys %files) {
>> my $pid = $pm->start and next;
>> ##Process the $file
>> $pm->finish; # Terminates the child process
>> }
>> $pm->wait_all_children;
>>
>> ...
>
> i admit i'm not too familiar with Threads/Forks (the only fork i use is
> the one called from system() ). also, i've read that Perl threading
> isn't too stable.
It arguably still has its flaws, but for a task as easy as yours they're
perfectly usable.
> i've looked on the web a little, but have not found
> anything that describes how to do all of the following:
>
> 1) instantiate N processes (or threads)
> 2) start each process parsing a log file
> 3) the first process that is done looks at a shared or global queue and
> pulls the next log file from that and processes until the queue is
> empty.
Extremely easy with threads. Here's a complete example of a program that
spawns off a number of threads where each thread pulls data from a
global queue until it is empty:
#!/usr/bin/perl -w
use strict;
use threads;
use threads::shared;
use constant NUM_THREADS => 10;
# shared queue visible to every thread
my @queue : shared = 1 .. 30;
# create threads
my @threads;
push @threads, threads->new("run") for 1 .. NUM_THREADS;
# wait for all threads to finish
$_->join for @threads;
# code executed by each thread
sub run {
while (defined(my $element = pop @queue)) {
printf "thread %i: working with %i\n", threads->tid, $element;
# make runtime vary a little for
# demonstration purpose
select undef, undef, undef, rand;
}
}
> ...the current architecture of my log processing is:
> 1) set a number of processes (e.g. 20)
> 2) in a loop for the number of processes:
> my @rc;
> for my $i (1..$num_processes) {
> my $command = 'parseLog.pl $i';
> $rc[$i] = system($command);
> }
> # there is a conf file that has an entry for each log, along with a
> number in the next field--the number represents the process_id (can be
> 1 thru 20)
> 3) in a loop of all the logs, push logs into arrays if the process_id
>== the $num_process that was passed along, so each process has an array
> of files to process/parse. each process parses each file in its array
> of files. problem with this is that maybe each process has a similar
> number of logs to process (the process_id just increments for each
> line, then wraps around once it reaches the max number of processes i
> defined), but some could be huge while others are small, so not very
> optimal. one process could have 20 files of 200 bytes each, while the
> other could have 20 files of 230 MB each.
Don't think in terms of processes. If you're using processes for that
kind of thing you'll need to find a way for them to communicate
(possibly pipes, or maybe shared memory). Threads takes this work off
your shoulders as they can share data in a simple and secure manner.
> since using the system() approach is all i know, the only scenarios i
> considered were those that dealt with providing each process with a
> balanced amount of data.
Bad idea. It may take a different time for each piece of datum. The real
way is to store the work in one central repository and each thread
retrieves its working set from there. When it is done, it fetches the
next unless the central pool is empty.
> if i can get the on-the-fly thing working, that would be preferable.
> then sorting would not even be helpful, would it?
It is not helpful.
> Start processing the 20 biggest files. When one of them
>> finishes (regardless of which one it is), start the next file.
>>
>
> if i can do the fork thing, why start with the biggest?
Don't even worry about sorting. Use threads and have each thread do the
parsing of the files in any order. It makes no difference since it's
truely parallel and asynchronous.
Tassilo
--
use bigint;
$n=71423350343770280161397026330337371139054411854220053437565440;
$m=-8,;;$_=$n&(0xff)<<$m,,$_>>=$m,,print+chr,,while(($m+=8)<=200);
.
- Follow-Ups:
- Re: optimize log parsing
- From: it_says_BALLS_on_your forehead
- Re: optimize log parsing
- From: it_says_BALLS_on_your forehead
- Re: optimize log parsing
- References:
- optimize log parsing
- From: it_says_BALLS_on_your forehead
- Re: optimize log parsing
- From: it_says_BALLS_on_your forehead
- optimize log parsing
- Prev by Date: Re: Subroutines with &
- Next by Date: Re: regexp includes a dot in string
- Previous by thread: Re: optimize log parsing
- Next by thread: Re: optimize log parsing
- Index(es):