Re: Speed comparison of regex versus index, lc, and / /i



Ben Bullock wrote:
In a recent discussion on this newsgroup, it was mentioned that
"index" is better for matching fixed strings than using regular
expressions. Coincidentally I've recently been setting up a search
system for a fairly large volume (about 30 megabytes) of text files,
and as a first approximation for the search system I made a simple
routine to open each file and search for the string in the file using
"index".

As a test of the proposition that index is better than regexes, I also
tried using a regex to do the same job. My results were that the
version using regexes was almost identical (within a few percent) in
speed to "index", leading me to think that under the bonnet "index" is
probably just using a regex anyway.

Just the opposite. AFAIK if searching for a literal string (as opposed to a regular expression pattern) the regexp engine will use the same algorithm as index().

See perlreguts.pod for details: http://search.cpan.org/~rgarcia/perl-5.10.0/pod/perlreguts.pod


Furthermore, the biggest
bottleneck in the code wasn't the pattern matching, but the use of
"lc" to convert the text into lower case for case insensitive
search. I found that saving all the text files as lower case before
doing the matching, rather than converting the strings using lc, saved
more than half of the total execution time, so the difference between
"index" and regexes was not even significant compared to the time
spent converting to lower case. I also found that the "i" option to
the regex similarly meant that the regex ran drastically
slower. Similarly another big bottleneck I identified was conversion
of the files - opening the (utf8 encoded) files with

open my $file, "<:utf8", $filename;

saved about 30% of the total execution time compared to converting the
text after reading it in.

So my conclusion is that "index" isn't necessary and one can always
use regexes - unless anyone can prove otherwise - and Perl's regexes
are so fast that they may not be much of a bottleneck. But why "lc"
should be so slow I don't know.

I assume that most of the slowdown was caused by the introduction of the use of UTF, etc.



John
--
Perl isn't a toolbox, but a small machine shop where you
can special-order certain sorts of tools at low cost and
in short order. -- Larry Wall
.



Relevant Pages

  • Speed comparison of regex versus index, lc, and / /i
    ... "index" is better for matching fixed strings than using regular ... As a test of the proposition that index is better than regexes, ... doing the matching, rather than converting the strings using lc, saved ... spent converting to lower case. ...
    (comp.lang.perl.misc)
  • Re: Math
    ... converting them to strings, then back, the precision will be lost. ... string is performed by the C compiler, whence Perl is at the mercy of ...
    (comp.lang.perl.misc)
  • Re: OT: novice regular expression question
    ... >> data record begins with either a string, or a list of strings. ... If the format is "Python strings and lists of ... Then use that pattern, parenthesized to turn it ... > Let's try matching a single string first: ...
    (comp.lang.python)
  • Re: Collection Capabilities
    ... I just iterate through the collections to find the matching sets. ... I've got a collection of strings... ... Dim CurIdx as Long ... 'now we Loop over the sorted group from the top-entry ...
    (microsoft.public.vb.general.discussion)
  • Re: Math
    ... converting them to strings, then back, the precision will be lost. ... on the Perl numeric types. ... blb8 at po dot cwru dot edu ...
    (comp.lang.perl.misc)