Re: Speed comparison of regex versus index, lc, and / /i



On Sat, 31 May 2008 01:49:14 +0000, A. Sinan Unur wrote:

There is one change I would make to both routines. Cache length $ss
before the loop. On my machine, that cut down the time by about 40%.

In the real situation almost every search for the search string ($ss)
in the text ($text) is going to fail, so I think it would make much
less difference than in the fake situation of repeating with the same
string over and over again.

I modified the program to use /o that A. Sinan Unur added, and also
added a routine to test Ben Morrow's idea and reran it using the
Benchmark module:

#!/usr/local/bin/perl
use warnings;
use strict;

my $text = <<EOF;
xhoster is the coolest perl programmer ever. xhoster is the
greatest. xhoster is the champion. xhoster is a babe magnet.
EOF
my $ss = "xhoster";

sub index_find
{
my @finds;
my $found = 0;
while (($found = index ($text, $ss, $found)) != -1) {
push @finds, $found;
$found += length ($ss);
}
return \@finds;
}

sub regex_find
{
my @finds;
while ($text =~ /\Q$ss\E/g) {
push @finds, pos ($text) - length($ss);
}
return \@finds;
}

sub regex_unur_find
{
my @finds;
while ($text =~ /\Q$ss/go) {
push @finds, pos ($text) - length($ss);
}
return \@finds;
}

my $re = qr{\Q$ss};

sub regex_morrow_find
{
my @finds;
while ($text =~ /$re/g) {
push @finds, pos ($text) - length($ss);
}
return \@finds;
}

# check it all works

for (\&index_find, \&regex_find, \&regex_unur_find, \&regex_morrow_find) {
print "String found at ", (join ", ",@{&{$_}($text, $ss)}),"\n";
}

use Benchmark qw( cmpthese );

my $count = 100000;

cmpthese $count, {
'index' => sub {index_find($text,$ss)},
'regex' => sub {regex_find($text,$ss)},
'regex_unur' => sub {regex_unur_find($text,$ss)},
'regex_morrow' => sub {regex_morrow_find($text,$ss)},
};

Here are the results of five runs:

Rate regex_morrow regex regex_unur index
regex_morrow 116279/s -- -31% -40% -41%
regex 169492/s 46% -- -12% -14%
regex_unur 192308/s 65% 13% -- -2%
index 196078/s 69% 16% 2% --

Rate regex_morrow regex regex_unur index
regex_morrow 119048/s -- -30% -40% -40%
regex 169492/s 42% -- -15% -15%
regex_unur 200000/s 68% 18% -- -0%
index 200000/s 68% 18% 0% --

Rate regex_morrow regex index regex_unur
regex_morrow 120482/s -- -29% -40% -40%
regex 169492/s 41% -- -15% -15%
index 200000/s 66% 18% -- 0%
regex_unur 200000/s 66% 18% 0% --

Rate regex_morrow regex regex_unur index
regex_morrow 119048/s -- -31% -42% -42%
regex 172414/s 45% -- -16% -16%
regex_unur 204082/s 71% 18% -- -0%
index 204082/s 71% 18% 0% --

Rate regex_morrow regex index regex_unur
regex_morrow 114943/s -- -33% -40% -44%
regex 172414/s 50% -- -10% -16%
index 192308/s 67% 12% -- -6%
regex_unur 204082/s 78% 18% 6% --


This supports my original statement that the regex version is almost
exactly the same speed as "index". But I was surprised that Ben Morrow's
version is the slowest of all. Does anyone see anything wrong with my
methodology?

.



Relevant Pages