Re: speed issues with pattern matching and substitution



Brian McCauley <nobull@xxxxxxxx> wrote in comp.lang.perl.misc:
> Arturi wrote:
>
> > I wrote some perl lines that scans a file looking for some words and
> > substitute them for their corresponding pairs, which are previously
> > saved in a HASH.

[...]

> If you are concerned about speed you should take the comoposition and
> compilatiion of the regex outside the loop.
>
> my @patterns = map { qr/(\W)($_)(\W)/ } keys %HASH;
>
> local *_; # Wise not to stomp on someone else's $_
>
> while (<FILE>) {
> my @values = values %HASH;
> for my $pattern ( @patterns ) {
> my $value = shift @values;
> s/$pattern/$1$value$3/g;
> }
> # print or whatever
> }
>
> But my spider-sense is telling me that perhaps the keys of %HASH really
> arbirary regexes but are just words and all you want is word
> substitution. This is a classic bit of Perl.
>
> while (<FILE>) {
> s/(\w+)/$HASH{$1}||$1/eg;
> # print or whatever
> }

At first sight that looks rather inefficient because *every* word
is captured and replaced (often by itself). Out of interest I
benchmarked a few variants, and it turns out that it is just as
fast as more selective solutions.

I'm appending the benchmark script, in case anyone cares. The four
variants tested are (least efficient first, the difference between
the last two is insignificant):

direct -- every regex is compiled from the string at run time
regex -- all alternatives compiled into a big regex
brian -- replace every word, like the second solution above
precomp -- regular expressions precompiled via qr//, like the
first solution above

Anno

--

#!/usr/local/bin/perl
use strict; use warnings; $| = 1;
use Vi::QuickFix;

use Benchmark qw( cmpthese);

our @lines = <DATA>;

my %repla; # string replacements (master)
@repla{
qw( one two three four five six seven eight nine ten eleven twelve)
} = qw( ein zwei drei vier fuenf sechs sieben acht neun zehn elf zwoelf);

my @repla = map [ qr/\b$_\b/ => $repla{ $_}], keys %repla; # precompiled

my $repla = do { # single regex
my $re = join '|', map "\\b$_\\b", keys %repla;
qr/($re)/;
};

goto bench;

check:
print brian( @lines);
print "\n", @lines;
exit;

bench:
@lines = ( @lines) x 10; # vary to check linearity
cmpthese -1, {
direct => 'direct( @lines)',
precomp => 'precomp( @lines)',
brian => 'brian( @lines)',
regex => 'regex( @lines)',

};
#######################################################################

sub direct {
my @new = @_;
for ( @new ) {
my ( $str, $rpl);
s/\b$str\b/$rpl/g while ( $str, $rpl) = each %repla;
}
@new;
}

sub precomp {
my @new = @_;
for ( @new ) {
for my $r ( @repla ) {
s/$r->[ 0]/$r->[ 1]/g;
}
}
@new ;
}

sub brian {
my @new = @_;
s/(\w+)/$repla{ $1} || $1/eg for @new;
@new;
}

sub regex {
my @new = @_;
s/$repla/$repla{ $1}/g for @new;
@new;
}

__DATA__
one spot was detected on each of the two blades
three jolly coachmen, three jolly coachmen
twelve to a dozen
a display of oneupmanship
three o'clock four o'clock five o'clock ROCK
.



Relevant Pages

  • Re: qr and subroutines
    ... [SNIP] ... Your $tmp_regex is not a regex, ... I shouldn't have referred to the string as "regex", ... sub rework_uri { ...
    (comp.lang.perl.misc)
  • Re: Select specific text in cell
    ... c:filename.ext and the operating system will look in the ... following code (showing a 'backslashless' path reference) prints the first ... End Sub ... Well, as written, the regex would retain the C:. ...
    (microsoft.public.excel.misc)
  • Re: matching a pattern with a space or no space??
    ... Perl doesn't have that function. ... How can I regex a data flow that is always ... surrounded by optional white space. ... next name and assignment. ...
    (comp.lang.perl.misc)
  • qr and subroutines
    ... sub build_url { ... my $tmp_regex = shift; ... in the actual regex function itself, and I get what I want. ...
    (comp.lang.perl.misc)
  • Re: Regexp slowdown
    ... with the same weak logic that you use for the regex stuff. ... cr> What of that makes you think I'm worried about the syntax of my regex? ... cr> memory-intensive attempt at interpreter optimization by the perl folks. ... cr> had a mile-long stack frame I would see a performance hit. ...
    (comp.lang.perl.misc)