Why is my regex so slow?



I've got a script I'm using to search through a list of Wikipedia
article titles to find ones that match certain patterns.

As-written, if you run it and supply '.*target.*' on standard input,
it will process my test file in 125 seconds. Make any of the changes
mentioned in the comments, and the time needed will drop to 1.8
seconds. Why the difference? Particularly interesting is that it
seems to matter where the regex pattern came from: if it's from
standard input, testing is slow; if it's assigned in the script,
testing is fast.

If it matters, I'm using Perl 5.8.8.

To see the problem I'm having, download
http://download.wikimedia.org/eswiki/20081018/eswiki-20081018-all-titles-in-ns0.gz
(a 4.1-MB file), unzip it, and run the program supplying the name of
the unzipped file.

Thanks,
Mark Wagner

--------------
binmode STDIN, ":utf8"; # Comment this out to speed things up

while(<STDIN>)
{
my $lines = 0;
my $lines2 = 0;
my $regex;
$regex = $_;
chomp $regex;

#$regex = '.*target.*'; # Or uncomment this to speed things up
open INFILE, "<", $ARGV[0];
binmode INFILE, ":utf8"; # Or comment this out to speed things up

while(<INFILE>)
{
my $target = $_;
chomp $target;
$target =~ s/_/ /g;

print "Match\n" if($target =~ /^$regex$/); # Or make
this case-insensitive to speed things up, or remove the start and end
anchors to speed things up

$lines = $lines + 1;
if($lines >= 10000)
{
$lines = 0;
$lines2 += 10000;
print STDERR "$lines2\r";
}
}
}
.



Relevant Pages

  • Re: Isolate lines in a text file and perform replacement
    ... Perl and regular expressions to normalize the file names. ... applying these regex patterns to playlist and xml files. ... The script performs a recursive search and finds files with these ...
    (comp.unix.shell)
  • Inplace editing the elegant way
    ... I would like to develop a script that can inplace edit a large file ... replacing some lines according to another regex. ... What is the best way to scale this for n patterns? ... another array and somehow wedge them into the example above? ...
    (comp.lang.perl.misc)
  • Re: Print Section of file
    ... Specifically, FS *can* be a regex (if it's not a single character, ... then awk assumes it's a regex). ... regex (in fact it never is, unless the context requires a regex). ... a relational expression or a boolean combination of patterns... ...
    (comp.lang.awk)
  • Re: re module non-greedy matches broken
    ... Regex patterns are expressed positive by ... Modern regular expression engines (which are no longer regular ... there are some common patterns on how to write some specific ... AFAICS one needs non-greedy regexps very very rarely at all. ...
    (comp.lang.python)
  • Isolate lines in a text file and perform replacement
    ... I developed a shell script for renaming my mp3 files. ... Perl and regular expressions to normalize the file names. ... applying these regex patterns to playlist and xml files. ...
    (comp.unix.shell)

Loading