Re: Efficient String Lookup?

From: Andrew Dalke (adalke_at_mindspring.com)
Date: 10/17/04


Date: Sun, 17 Oct 2004 10:11:11 GMT

Chris S. wrote:
> Spoke too soon.

As did I. :)

> However, I noticed rex still doesn't return multiple matches. For
> instance, matching 'abc' to the given the patterns '#bc', 'a#c', and
> 'ab#', your code only returns a match to the first pattern '#bc'. Is
> this standard behavior or is it possible to change this?

This is standard behavior. You can't change it. The
only easy solution along these lines is to have a triangular
table of

   (pat1)|(pat2)| .... |(patN)
   (pat2)| .... |(patN)
   ...
   (patN)

and if group i matched at a point x then do another
search using the (i+1)th entry in that table at that
point. Repeat until no more matches at x.

I don't know of any off-the-shelf solution for Python
for what you want to do, other than the usual "try
each pattern individually." You'll need to make some
sort of state table (or trie in your case) and do it
that way.

You *can* use Perl's regexps for this sort of thing. That
regexp language allowed embedded Perl code, so this will
get you an answer

% perl -ne 'while (/((.bc)(?{print "Match 1 at ", length($`),
"\n"})^)|((a.c)(?{print "Match 2 at ", length($`), "\n"})^)|./g){}'
This is abc acbcb
Match 1 at 8
Match 2 at 8
Match 1 at 13

Breaking the pattern down I'm using "while (/ ... /g) {}" to
match everything in the input string ($_), which comes from
each line of stdin (because of the '-n' command-line flag).

The pattern is

  ((.bc)(?{print "Match 1 at ", length($`), "\n"})^)
|((a.c)(?{print "Match 2 at ", length($`), "\n"})^)
|.

That is, match ".bc" then execute the corresponding piece
of embedded Perl code. This prints "Match 1 at " followed
by the length of the text before the current match, which
corresponds to the position of the match.

(If you only need that there is a match, you don't need
then. Using $` in perl gives a performance hit.)

After it executes, the subgroup matches (embedded executable
code always passes, and it consumes no characters). But
then it gets to the '^' test which fails because this is
never at the beginning of the string.

So the regexp engine tries the next option, which is the
"Match 2 at .." test and print. After the print (if
indeed there is a match 2) it also fails.

This takes the engine to the last option which is the "."
character. And that always passes.

Hence this pattern always consumes one and only one character.
I could put it inside a (...)* to match all characters,
but decided instead to use the while(/.../g){} to do the looping.
Why? Old practices and not well-determined reason.

(The while loop works because of the 'g' flag on the pattern.)

You talk about needing to eek all the performance you can
out of the system. Have you tried the brute force approach
of just doing N regexp tests?

If you need the performance, it's rather easy to convert
a simple trie into C code, save the result on the fly
to a file, compile that into a Python shared library, and
import that library, to get a function that does the
tests given a string. Remember to give a new name to
each shared library as otherwise the importing gets confused.

                                Andrew
                                dalke@dalkescientific.com



Relevant Pages

  • Re: How would you design regexps in the integer domain?
    ... I.e. instead of describing a pattern of characters (as in normal ... arrays of characters rather than in arrays of integers. ... instance use an integer regexp to decide whether an array begins with 17 ...
    (comp.lang.ruby)
  • Re: WILDCARD: output all a* by searching a text file
    ... where * denotes a string of characters. ... When you want to write a program that search a pattern such as /a./ ... int main{ ... int main(int argc,const char** argv){ ...
    (comp.programming)
  • Re: Expert script (.bat) writers help needed (strip double-quote from string)
    ... Sets or returns the regular expression pattern being searched for. ... Always a RegExp object variable. ... May include any of the regular expression characters defined in the table in the Settings section. ...
    (microsoft.public.windowsxp.help_and_support)
  • Re: First round of descriptive/external exercises
    ... I don't see coming up with that sort of omni pattern by ... The Mason/Della lie thing has ALWAYS been about ... >stuff, because you're nearly always dealing with two or more characters, ... lie to fit her action. ...
    (rec.arts.sf.composition)
  • Re: RegEx: How to ignore the number of whitespaces?
    ... matching character sequences in strings. ... what are the limitations of the "arbitrary Unicode characters?" ... Exactly one of them must be found in the string, ... Certain subsequences of a pattern may be marked as optional. ...
    (microsoft.public.dotnet.framework)