Re: performance surprise -- why?

From: Joe Davison (haltingNOSPAM_at_comcast.net)
Date: 08/25/04


Date: Wed, 25 Aug 2004 17:47:02 GMT

On 25 Aug 2004, Anno Siegel wrote:
> Joe Davison <haltingNOSPAM@comcast.net> wrote in comp.lang.perl.misc:
> > I'm searching the genome with a perl script I wrote and encountered
> > a surprise when I tried to improve the performance -- it got worse,
> > much worse, and I'm wondering if there's a better way to do my
> > second effort.
> >
> > Here's the basic problem:
> >
> > Given a short sequence, say AGTACT, and a chromosome, say
> > CCCTAAACCCTAAACCCTAAACCCTAAACCTCTGAATCCTTAATCCCTAAATCCCTAAAT...(30MB
> > string).
> >
> > I want to find all the places in the chromosome where the sequence
> > occurs.
> >
> > Method 1: $genome =~ s/($sequence)/\n$1/g; I then wrote the chopped
> > string to a file and counted the lengths of the lines using a simple
> > awk program (don't worry about why).
> >
> > That runs in about 4 seconds on my G3 iBook. But I figured I didn't
> > really need a second copy of a 30MB file that differed only in the
> > placement and number of newlines, so why not just save the positions
> > where it starts? I looked at somebody else's code and tried:
> >
> > Method 2: $_=$genome;
> > while( m/$sequence/g) { push @indices,pos();}
> > and then write out the indices.
> >
> > I only waited half an hour on my iBook before I killed that one.
> >
> > I tried again with a couple of smaller files on my G4 desktop.
> >
> >
> > Method 5MB time 11MB time
> > 1 2.6 sec 5.1 sec
> > 2 48.1 sec 3:20.2 = 200.2 sec
>
> That is unexpected. The substitution method must move parts of the
> string for every match, so I'd expect it to be slower than global
> matching.
>
> I benchmarked both, and also a method based on the index() function.
> The results show indexing and global matching in the same ballpark,
> both almost twice as fast as substitution (code appended below):
>
> substitute 13.6/s -- -41% -45%
> indexing 23.1/s 69% -- -7%
> globmatch 24.8/s 82% 7% --
>
> This was on a slow matching with much smaller (200 K) samples, but
> the dependence on size should be largely linear. Where your
> quadratic (well, more-than-linear) behavior comes from is anybody's
> guess.
>
> Anno
>

Thanks.
   I took your program and modified it so that instead of generating the
string you're matching, it read if from a file, and took the file name
from the command line -- so I could feed it the same files I'd used
before.

My files are k1, k10, k50, k100 and k200
(k1 being 1000 lines, k10 being 10,000 lines, etc)
the file sizes are:
k1 59K
k10 595K
k50 2M
k100 5M
k200 11M

The files have newlines every 60 char or so, but I strip them before
running the test. -- Oh yes, I moved the nl stripping out of substitute(),
and did it last.

Here's my results -- the time line is for the whole program, which
includes time to read the file and strip the newlines.
Interesting that on my system
867 MHz Apple G4 1.25GB ram, Mac OS X(10.3.5), perl 5.8.1
substitute is close to indexing, both significantly faster than
globmatch. Looks like that may be worth trying, at any rate.

I don't understand the percentages being displayed here --

k1:
            Rate globmatch substitute indexing
globmatch 355/s -- -56% -60%
substitute 815/s 129% -- -7%
indexing 878/s 147% 8% --

time: 3.08s user 1.68s system 90% cpu 5.261 total

k10:
             Rate globmatch substitute indexing
globmatch 3.60/s -- -95% -96%
substitute 67.6/s 1777% -- -20%
indexing 84.6/s 2248% 25% --

time: 2.91s user 1.54s system 90% cpu 4.922 total

k50:
            (warning: too few iterations for a reliable count)
                 Rate globmatch substitute indexing
globmatch 9.77e-02/s -- -99% -99%
substitute 11.3/s 11492% -- -20%
indexing 14.2/s 14399% 25% --

time: 10.38s user 13.20s system 89% cpu 26.479 total

k100:
            (warning: too few iterations for a reliable count)
                 Rate globmatch substitute indexing
globmatch 2.42e-02/s -- -100% -100%
substitute 5.71/s 23477% -- -16%
indexing 6.80/s 27941% 19% --

time: 34.66s user 50.17s system 88% cpu 1:35.80 total

k200:
            (warning: too few iterations for a reliable count)
            (warning: too few iterations for a reliable count)
                 Rate globmatch substitute indexing
globmatch 5.65e-03/s -- -100% -100%
substitute 2.83/s 50017% -- -17%
indexing 3.42/s 60440% 21% --

time: 142.39s user 214.24s system 85% cpu 6:56.92 total

joe



Relevant Pages

  • Re: performance surprise -- why?
    ... > CCCTAAACCCTAAACCCTAAACCCTAAACCTCTGAATCCTTAATCCCTAAATCCCTAAAT...(30MB string). ... The substitution method must move parts of the ... The results show indexing and global matching in the same ballpark, ... sub substitute { ...
    (comp.lang.perl.misc)
  • help with regex
    ... regex is not my cup of tea, i've been using them few times, only matching, no substitution.. ... I need to check a string before saving it into a decimal column in DB and I actually don't know how to start substitution: ...
    (comp.lang.ruby)
  • Re: DCL question (of the day)- null byte in string symbol
    ... I have pointed out that this code was just to show the bug, ... Subject: Re: DCL question - null byte in string symbol ... Lexical substitution is a possibility. ...
    (comp.os.vms)
  • Re: DCL question (of the day)- null byte in string symbol
    ... The string isn't that long. ... Note that is 7 bits of the second character, 8 bits of ... Lexical substitution is a possibility. ... The fact that command verification reflects a truncated command line ...
    (comp.os.vms)
  • Re: Testing for names that sounds alike
    ... No SOUNDEX is designed for matching names and even there it is not highly accurate. ... Dim StrOut As String ... Dim vWords As Variant ...
    (microsoft.public.access.queries)