Speed comparison of regex versus index, lc, and / /i



In a recent discussion on this newsgroup, it was mentioned that
"index" is better for matching fixed strings than using regular
expressions. Coincidentally I've recently been setting up a search
system for a fairly large volume (about 30 megabytes) of text files,
and as a first approximation for the search system I made a simple
routine to open each file and search for the string in the file using
"index".

As a test of the proposition that index is better than regexes, I also
tried using a regex to do the same job. My results were that the
version using regexes was almost identical (within a few percent) in
speed to "index", leading me to think that under the bonnet "index" is
probably just using a regex anyway. Furthermore, the biggest
bottleneck in the code wasn't the pattern matching, but the use of
"lc" to convert the text into lower case for case insensitive
search. I found that saving all the text files as lower case before
doing the matching, rather than converting the strings using lc, saved
more than half of the total execution time, so the difference between
"index" and regexes was not even significant compared to the time
spent converting to lower case. I also found that the "i" option to
the regex similarly meant that the regex ran drastically
slower. Similarly another big bottleneck I identified was conversion
of the files - opening the (utf8 encoded) files with

open my $file, "<:utf8", $filename;

saved about 30% of the total execution time compared to converting the
text after reading it in.

So my conclusion is that "index" isn't necessary and one can always
use regexes - unless anyone can prove otherwise - and Perl's regexes
are so fast that they may not be much of a bottleneck. But why "lc"
should be so slow I don't know.
.



Relevant Pages

  • Re: Speed comparison of regex versus index, lc, and / /i
    ... "index" is better for matching fixed strings than using regular ... As a test of the proposition that index is better than regexes, ... doing the matching, rather than converting the strings using lc, saved ... spent converting to lower case. ...
    (comp.lang.perl.misc)
  • Re: OT: novice regular expression question
    ... >> data record begins with either a string, or a list of strings. ... If the format is "Python strings and lists of ... Then use that pattern, parenthesized to turn it ... > Let's try matching a single string first: ...
    (comp.lang.python)
  • Re: Collection Capabilities
    ... I just iterate through the collections to find the matching sets. ... I've got a collection of strings... ... Dim CurIdx as Long ... 'now we Loop over the sorted group from the top-entry ...
    (microsoft.public.vb.general.discussion)
  • Re: Seach and replace tools
    ... matching an array of strings to be found and ... replaing them. ...
    (uk.comp.sys.mac)
  • Re: OT: novice regular expression question
    ... > data record begins with either a string, or a list of strings. ... If the format is "Python strings and lists of ... "negative lookbehind assertion" - in other words, a pattern that doesn't ... Let's try matching a single string first: ...
    (comp.lang.python)