Re: performance surprise -- why?

From: Anno Siegel (anno4000_at_lublin.zrz.tu-berlin.de)
Date: 08/25/04


Date: 25 Aug 2004 10:40:56 GMT

Joe Davison <haltingNOSPAM@comcast.net> wrote in comp.lang.perl.misc:
> I'm searching the genome with a perl script I wrote and encountered a
> surprise when I tried to improve the performance -- it got worse, much
> worse, and I'm wondering if there's a better way to do my second effort.
>
> Here's the basic problem:
>
> Given a short sequence, say AGTACT, and a chromosome, say
> CCCTAAACCCTAAACCCTAAACCCTAAACCTCTGAATCCTTAATCCCTAAATCCCTAAAT...(30MB string).
>
> I want to find all the places in the chromosome where the sequence
> occurs.
>
> Method 1: $genome =~ s/($sequence)/\n$1/g;
> I then wrote the chopped string to a file and counted the lengths of the
> lines using a simple awk program (don't worry about why).
>
> That runs in about 4 seconds on my G3 iBook. But I figured I didn't
> really need a second copy of a 30MB file that differed only in the
> placement and number of newlines, so why not just save the positions
> where it starts? I looked at somebody else's code and tried:
>
> Method 2: $_=$genome;
> while( m/$sequence/g) { push @indices,pos();}
> and then write out the indices.
>
> I only waited half an hour on my iBook before I killed that one.
>
> I tried again with a couple of smaller files on my G4 desktop.
>
>
> Method 5MB time 11MB time
> 1 2.6 sec 5.1 sec
> 2 48.1 sec 3:20.2 = 200.2 sec

That is unexpected. The substitution method must move parts of the
string for every match, so I'd expect it to be slower than global
matching.

I benchmarked both, and also a method based on the index() function.
The results show indexing and global matching in the same ballpark,
both almost twice as fast as substitution (code appended below):

    substitute 13.6/s -- -41% -45%
    indexing 23.1/s 69% -- -7%
    globmatch 24.8/s 82% 7% --

This was on a slow matching with much smaller (200 K) samples, but
the dependence on size should be largely linear. Where your
quadratic (well, more-than-linear) behavior comes from is anybody's
guess.

Anno

=============================================================

#!/usr/local/bin/perl
use strict; use warnings; $| = 1;
use Vi::QuickFix;

use Benchmark ':all';

use constant SIZE => 400_000;
my $sequence = 'AGTACT';
my $str;
$str .= ( qw( A C G T))[ rand 4] for 1 .. SIZE;

goto bench;

print "globmatch: ", globmatch(), "\n";
print "substitute: ", substitute(), "\n";
print "indexing: ", indexing(), "\n";
exit;

bench:
cmpthese( -1, {
    globmatch => 'globmatch',
    substitute => 'substitute',
    indexing => 'indexing',
});

######################################################################

sub globmatch {
    my @indices;
    $_ = $str;
    push @indices, pos while /$sequence/g;
    scalar @indices;
}

sub substitute {
    $_ = $str;
    s/$sequence/\n$sequence/g;
    tr/\n//;
}

sub indexing {
    $_ = $str;
    my @indices;
    my $pos = 0;
    while ( 1 ) {
        last if ( $pos = index( $_, $sequence, $pos)) < 0;
        push @indices, $pos;
        $pos += length $sequence;
    }
    scalar @indices;
}



Relevant Pages

  • Re: performance surprise -- why?
    ... On 25 Aug 2004, Anno Siegel wrote: ... >> string to a file and counted the lengths of the lines using a simple ... The substitution method must move parts of the ... > The results show indexing and global matching in the same ballpark, ...
    (comp.lang.perl.misc)
  • Re: FAQ 4.32 How do I strip blank space from the beginning/end of a string?
    ... PV> sub normalize_string { ... by not having a replacement char map it will duplicate the matching ... return $string; ...
    (comp.lang.perl.misc)
  • help with regex
    ... regex is not my cup of tea, i've been using them few times, only matching, no substitution.. ... I need to check a string before saving it into a decimal column in DB and I actually don't know how to start substitution: ...
    (comp.lang.ruby)
  • Project Error
    ... Private Declare Sub Sleep Lib "Kernel32" ... Dim strDataSrc As String ...
    (microsoft.public.vb.bugs)
  • Re: FTP CD command
    ... My code connects to a ftp site and the enumerates all content on that place ... Public Property UriAs String ... End Sub ... Dim listRequest As FtpWebRequest = CType, ...
    (microsoft.public.dotnet.languages.vb)