perl multithreading performance



Hello, oh almighty perl gurus!

I'm trying to implement multithreaded processing for the humongous
amount of logs that I'm currently processing in 1 process on a 4-CPU
server.

What the script does is for each line it checks if the line contains
GET request, and if it does - goes through a list of pre-compiled
regular expressions, trying to find a matching one. Once the match is
found - it uses another regexp, associated with the found match, which
is a bit more complex, to extract data from the line. I have split it
in two separate matches, because about 30% of all lines will match,
and I don't want to run that complex regexp to extract data for all
the lines I know won't match. The goal is to count how many lines
matched for every specific regexp, and the end result is built as a
hash, having data, extracted from the line with second regexp, used as
hash keys, and the value is the number of matches.

Anyway, currently all this is done in a single process, which parses
approx. 30000 lines per second. The CPU usage for this process is
100%, so the bottleneck is in the parsing part.

I have changed the script to use threads + threads::shared +
Thread::Queue. I read data from logs like this:

Code
until( $no_more_data ) {
my @buffer;
foreach( (1..$buffer_size) ) {
if( my $line = <> ) {
push( @buffer, $line );
} else {
$no_more_data = 1;
$q_in->enqueue( \@buffer );
foreach( (1..$cpu_count) ) {
$q_in->enqueue( undef );
}
last;
}
}
$q_in->enqueue( \@buffer ) unless $no_more_data;
}

Then, I create $cpu_count threads, which does something like this:

Code
sub parser {
my $counters = {};
while( my $buffer = $q_in->dequeue() ) {
foreach my $line ( @{ $buffer } ) {
# do its thing
}
}
return $counters;
}

Everything works fine, HOWEVER! It's all so damn slow! It's only 10%
faster than single-process script, consumes about 2-3 times more
memory and about as much times more CPU.

I've also tried abandoning the Thread:Queue and just use
threads::shared with lock/cond_wait/cond_signal combination, without
much success.

I've tried to play with $cpu_count and $buf_size, and found that after
$buf_size > 1000 doesn't make much difference, and $cpu_count > 2
actually makes things a lot worse.

Any ideas why in the world it's so slow? I did some research and
couldn't find a lot of info, other than the way I do it pretty much
the way it should be done, unless I'm missing something...

Hope anybody can enlighten me...

THANKS!
.



Relevant Pages

  • Re: perl multithreading performance
    ... d> What the script does is for each line it checks if the line contains ... d> is a bit more complex, to extract data from the line. ... d> and I don't want to run that complex regexp to extract data for all ... d> memory and about as much times more CPU. ...
    (comp.lang.perl.misc)
  • [CFT][RFC] Module auto-unloading solution.
    ... I put together the script at the end of this ... It won't try to unload ethernet drivers ... ## your kernel modules. ... # Create a regexp of ethernet modules. ...
    (Linux-Kernel)
  • Re: Filter and manipulate sections of file
    ... regexp A that opens a section, ... The second objective is to create a script that instead of printing ... I know bash and perl can easily handle this and in-fact I plan to ... >>> I have two objectives in mind. ...
    (comp.unix.shell)
  • Re: Filter and manipulate sections of file
    ... I have two objectives in mind. ... regexp A that opens a section, ... The second objective is to create a script that instead of printing ... I know bash and perl can easily handle this and in-fact I plan to ...
    (comp.unix.shell)
  • Re: unicode (hebrew) regexp search for new line headaches
    ... > slurp in a utf8 encoded hebrew text file ... > "from the beginning of the line just before the start of the regexp ... > Now this script works on individual files. ... Are you sure you're opening those files in UTF8 mode? ...
    (comp.lang.perl.misc)