Re: performance surprise -- why?
From: Anno Siegel (anno4000_at_lublin.zrz.tu-berlin.de)
Date: 08/25/04
- Next message: Anno Siegel: "Re: Information exchange between unrelated processes"
- Previous message: Graham Wood: "Re: Perl and DOS I/O"
- In reply to: Joe Davison: "performance surprise -- why?"
- Next in thread: Joe Davison: "Re: performance surprise -- why?"
- Reply: Joe Davison: "Re: performance surprise -- why?"
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Date: 25 Aug 2004 10:40:56 GMT
Joe Davison <haltingNOSPAM@comcast.net> wrote in comp.lang.perl.misc:
> I'm searching the genome with a perl script I wrote and encountered a
> surprise when I tried to improve the performance -- it got worse, much
> worse, and I'm wondering if there's a better way to do my second effort.
>
> Here's the basic problem:
>
> Given a short sequence, say AGTACT, and a chromosome, say
> CCCTAAACCCTAAACCCTAAACCCTAAACCTCTGAATCCTTAATCCCTAAATCCCTAAAT...(30MB string).
>
> I want to find all the places in the chromosome where the sequence
> occurs.
>
> Method 1: $genome =~ s/($sequence)/\n$1/g;
> I then wrote the chopped string to a file and counted the lengths of the
> lines using a simple awk program (don't worry about why).
>
> That runs in about 4 seconds on my G3 iBook. But I figured I didn't
> really need a second copy of a 30MB file that differed only in the
> placement and number of newlines, so why not just save the positions
> where it starts? I looked at somebody else's code and tried:
>
> Method 2: $_=$genome;
> while( m/$sequence/g) { push @indices,pos();}
> and then write out the indices.
>
> I only waited half an hour on my iBook before I killed that one.
>
> I tried again with a couple of smaller files on my G4 desktop.
>
>
> Method 5MB time 11MB time
> 1 2.6 sec 5.1 sec
> 2 48.1 sec 3:20.2 = 200.2 sec
That is unexpected. The substitution method must move parts of the
string for every match, so I'd expect it to be slower than global
matching.
I benchmarked both, and also a method based on the index() function.
The results show indexing and global matching in the same ballpark,
both almost twice as fast as substitution (code appended below):
substitute 13.6/s -- -41% -45%
indexing 23.1/s 69% -- -7%
globmatch 24.8/s 82% 7% --
This was on a slow matching with much smaller (200 K) samples, but
the dependence on size should be largely linear. Where your
quadratic (well, more-than-linear) behavior comes from is anybody's
guess.
Anno
=============================================================
#!/usr/local/bin/perl
use strict; use warnings; $| = 1;
use Vi::QuickFix;
use Benchmark ':all';
use constant SIZE => 400_000;
my $sequence = 'AGTACT';
my $str;
$str .= ( qw( A C G T))[ rand 4] for 1 .. SIZE;
goto bench;
print "globmatch: ", globmatch(), "\n";
print "substitute: ", substitute(), "\n";
print "indexing: ", indexing(), "\n";
exit;
bench:
cmpthese( -1, {
globmatch => 'globmatch',
substitute => 'substitute',
indexing => 'indexing',
});
######################################################################
sub globmatch {
my @indices;
$_ = $str;
push @indices, pos while /$sequence/g;
scalar @indices;
}
sub substitute {
$_ = $str;
s/$sequence/\n$sequence/g;
tr/\n//;
}
sub indexing {
$_ = $str;
my @indices;
my $pos = 0;
while ( 1 ) {
last if ( $pos = index( $_, $sequence, $pos)) < 0;
push @indices, $pos;
$pos += length $sequence;
}
scalar @indices;
}
- Next message: Anno Siegel: "Re: Information exchange between unrelated processes"
- Previous message: Graham Wood: "Re: Perl and DOS I/O"
- In reply to: Joe Davison: "performance surprise -- why?"
- Next in thread: Joe Davison: "Re: performance surprise -- why?"
- Reply: Joe Davison: "Re: performance surprise -- why?"
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Relevant Pages
|