Re: performance surprise -- why?
From: Joe Davison (haltingNOSPAM_at_comcast.net)
Date: 08/25/04
- Next message: Clint Olsen: "Re: 5.8.5 make test fails on lib/ExtUtils/Constant with icc"
- Previous message: Paul Lalli: "Re: creating anonymous subroutines at runtime"
- In reply to: Anno Siegel: "Re: performance surprise -- why?"
- Next in thread: ctcgag_at_hotmail.com: "Re: performance surprise -- why?"
- Reply: ctcgag_at_hotmail.com: "Re: performance surprise -- why?"
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Date: Wed, 25 Aug 2004 17:47:02 GMT
On 25 Aug 2004, Anno Siegel wrote:
> Joe Davison <haltingNOSPAM@comcast.net> wrote in comp.lang.perl.misc:
> > I'm searching the genome with a perl script I wrote and encountered
> > a surprise when I tried to improve the performance -- it got worse,
> > much worse, and I'm wondering if there's a better way to do my
> > second effort.
> >
> > Here's the basic problem:
> >
> > Given a short sequence, say AGTACT, and a chromosome, say
> > CCCTAAACCCTAAACCCTAAACCCTAAACCTCTGAATCCTTAATCCCTAAATCCCTAAAT...(30MB
> > string).
> >
> > I want to find all the places in the chromosome where the sequence
> > occurs.
> >
> > Method 1: $genome =~ s/($sequence)/\n$1/g; I then wrote the chopped
> > string to a file and counted the lengths of the lines using a simple
> > awk program (don't worry about why).
> >
> > That runs in about 4 seconds on my G3 iBook. But I figured I didn't
> > really need a second copy of a 30MB file that differed only in the
> > placement and number of newlines, so why not just save the positions
> > where it starts? I looked at somebody else's code and tried:
> >
> > Method 2: $_=$genome;
> > while( m/$sequence/g) { push @indices,pos();}
> > and then write out the indices.
> >
> > I only waited half an hour on my iBook before I killed that one.
> >
> > I tried again with a couple of smaller files on my G4 desktop.
> >
> >
> > Method 5MB time 11MB time
> > 1 2.6 sec 5.1 sec
> > 2 48.1 sec 3:20.2 = 200.2 sec
>
> That is unexpected. The substitution method must move parts of the
> string for every match, so I'd expect it to be slower than global
> matching.
>
> I benchmarked both, and also a method based on the index() function.
> The results show indexing and global matching in the same ballpark,
> both almost twice as fast as substitution (code appended below):
>
> substitute 13.6/s -- -41% -45%
> indexing 23.1/s 69% -- -7%
> globmatch 24.8/s 82% 7% --
>
> This was on a slow matching with much smaller (200 K) samples, but
> the dependence on size should be largely linear. Where your
> quadratic (well, more-than-linear) behavior comes from is anybody's
> guess.
>
> Anno
>
Thanks.
I took your program and modified it so that instead of generating the
string you're matching, it read if from a file, and took the file name
from the command line -- so I could feed it the same files I'd used
before.
My files are k1, k10, k50, k100 and k200
(k1 being 1000 lines, k10 being 10,000 lines, etc)
the file sizes are:
k1 59K
k10 595K
k50 2M
k100 5M
k200 11M
The files have newlines every 60 char or so, but I strip them before
running the test. -- Oh yes, I moved the nl stripping out of substitute(),
and did it last.
Here's my results -- the time line is for the whole program, which
includes time to read the file and strip the newlines.
Interesting that on my system
867 MHz Apple G4 1.25GB ram, Mac OS X(10.3.5), perl 5.8.1
substitute is close to indexing, both significantly faster than
globmatch. Looks like that may be worth trying, at any rate.
I don't understand the percentages being displayed here --
k1:
Rate globmatch substitute indexing
globmatch 355/s -- -56% -60%
substitute 815/s 129% -- -7%
indexing 878/s 147% 8% --
time: 3.08s user 1.68s system 90% cpu 5.261 total
k10:
Rate globmatch substitute indexing
globmatch 3.60/s -- -95% -96%
substitute 67.6/s 1777% -- -20%
indexing 84.6/s 2248% 25% --
time: 2.91s user 1.54s system 90% cpu 4.922 total
k50:
(warning: too few iterations for a reliable count)
Rate globmatch substitute indexing
globmatch 9.77e-02/s -- -99% -99%
substitute 11.3/s 11492% -- -20%
indexing 14.2/s 14399% 25% --
time: 10.38s user 13.20s system 89% cpu 26.479 total
k100:
(warning: too few iterations for a reliable count)
Rate globmatch substitute indexing
globmatch 2.42e-02/s -- -100% -100%
substitute 5.71/s 23477% -- -16%
indexing 6.80/s 27941% 19% --
time: 34.66s user 50.17s system 88% cpu 1:35.80 total
k200:
(warning: too few iterations for a reliable count)
(warning: too few iterations for a reliable count)
Rate globmatch substitute indexing
globmatch 5.65e-03/s -- -100% -100%
substitute 2.83/s 50017% -- -17%
indexing 3.42/s 60440% 21% --
time: 142.39s user 214.24s system 85% cpu 6:56.92 total
joe
- Next message: Clint Olsen: "Re: 5.8.5 make test fails on lib/ExtUtils/Constant with icc"
- Previous message: Paul Lalli: "Re: creating anonymous subroutines at runtime"
- In reply to: Anno Siegel: "Re: performance surprise -- why?"
- Next in thread: ctcgag_at_hotmail.com: "Re: performance surprise -- why?"
- Reply: ctcgag_at_hotmail.com: "Re: performance surprise -- why?"
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Relevant Pages
|