"Commutative" regexps?

From: J Krugman (jkrugman345_at_yahbitoo.com)
Date: 05/28/04


Date: Fri, 28 May 2004 17:46:35 +0000 (UTC)


Suppose that I have three different regexps, for example:

  my $r1 = qr/foo/;
  my $r2 = qr/bar/;
  my $r3 = qr/baz/;

(although not necessarily this simple). If I wanted to find all
the documents in an array of documents that matched the three
regexps, I'd just do:

  for my $doc (@docs) {
    if (/$r1/s && /$r2/s && /$r3/s) {
      go_to_town($_);
    }
  }

But what if I wanted to impose a constraint on how far any two
regexps can be from each other (e.g. for implementing a NEAR query
operator). In that case I'd need expressions

  /$r1(.*?)$r2/s

and test for the length of $1. The problem is that now I'd need
6 tests like this like. If instead of 3 regexps we had N, the
number of such tests required would be N*(N-1)/2, i.e. quadratic
growth in N. But this would be a very inefficient solution. After
all, after /($r1|$r2|$r3)/s matched, and say the match was $r2, we
know that we are interested only in the distance between this match
and the next match of /($r1|$r3)/s. And once this matched, say to
$r3, we are only interested in the distance between this match and
the next match to /$r1/s. This approach is linear in N. The
problem is that I can't figure out how to code it in a way that
preserves this linear property. The first obstacle is how to
determine which of the regexps in a disjunction (e.g. /($r1|$r2|$r3)/)
actually matched, without having to perform O(N) tests (which would
put me back in O(N^2) territory for the whole algorithm).

Of course, if there were a way to tell Perl to look for lines that
matched $r1, $r2, and $r3 in *any* order, coding this would be
relatively simple (though the results not necessarily faster). Is
there any way to coax the regexp engine to handle such "commutative
regexps"?

Any suggestions on this would be much appreciated.

jill

-- 
To  s&e^n]d  me  m~a}i]l  r%e*m?o\v[e  bit from my a|d)d:r{e:s]s.


Relevant Pages

  • Re: regular expression too big
    ... > Yes, unless he is matching whole words, he is stuck with regexes. ... The problem is to match a whole bunch of words (later regexps) ... Then you can sort the array of words ...
    (comp.lang.ruby)
  • Re: Merging regular expressions
    ... This can be done with alternation ... think of the first requirement as wanting to know which sub-regexp ... in an array of regexps that have been unioned or alternated together ... triggers a match - the object is to save time iterating over an array ...
    (comp.lang.ruby)
  • Re: regexp working with mixed lines endings
    ... my regexps: ... the comparaison i do: ... array << line ...
    (comp.lang.ruby)
  • Marshal gives error when dumping and loading array with two regexps in latest ruby 1.9
    ... Marshal to dump and then load back an array containing two Regexps gives an ...
    (comp.lang.ruby)
  • Re: Merging regular expressions
    ... to compress the regexps into a large regexp while still preserving the ... Will I hack the Ruby source? ... m =~ s # returns an int representing which of the n original RegExps ... Why not just use the array of regexps like this? ...
    (comp.lang.ruby)