Re: re bug

From: Thomas Rast (foo.bar_at_freesurf.ch.invalid)
Date: 10/05/04


Date: 05 Oct 2004 13:46:19 +0200

klaus_neuner82@yahoo.de (Klaus Neuner) writes:

> The regex rx compiles. It can be used to search the strings str1 and
> str2. Yet, when used with str3, the program will not terminate for
> days.

Here's a more readable version of the regexp in question (I assume the
newlines were introduced by your mail program and not part of the
original string), which should be equivalent to your form if compiled
with the re.VERBOSE flag:

'''
(?:^|\ )
(?P<ALLES>
   (?:
      [^/ ]*/[^/ ]*/
      (?: # why a group?
         cn
         (?: :[^/ #]+ )*
      )
   )*
   [^/ ]*/[^/ ]*/
   (?: # why a group?
      cn
      (?: :[^: /#]+ )*
      :2
      (?: :[^: /#]+ )*
      :3
      (?: :[^: /#]+ )*
   )
)
(?=(\ |$))
'''

Judging from the number of '*' and '+' quantifiers, the long search
time may be due to excessive backtracking as the regexp engine tries
to find a match.

[Now that I'm looking this through for the N-th time over, maybe it's
not, maybe you really found a bug. But I've already written the part
below and don't want to waste the time I invested in analysing your
regexp by not posting it...]

> At the time being, I can neither avoid these regular expressions, nor
> can I afford to wait days for the result of my program.

Your example imposes a very strong structure on a possible match: It
consists of small "tokens" (for lack of a better word) delimited by
'/' and ':'. Try simplifying the regexp to something like (re.VERBOSE
again):

'''
(
 [^ /]*/[^ /]*/ # two "tokens" followed by a slash each
 cn
 (:[^ /:]+)* # "tokens" that start with a colon
)*
'''

then .split() the string you're trying to search (this is a far more
efficient way of expressing your "boundary conditions") and .match()
against each of the words. If you have a match, check if it fits the
rest of the requirements, i.e. look for the :2 and :3 tokens etc.

HTH
Thomas

-- 
If you want to reply by mail, substitute my first and last name for
'foo' and 'bar', respectively, and remove '.invalid'.


Relevant Pages

  • Re: regexp: parsing HTML tokens
    ... > I'm attempting to parse simple HTML/ASP tokens. ... > regexp to be more firm in my extraction of the token I'm looking for. ... Since your string contains literal double quote chars you should quote ... > sub-expression to be variable in sequence to the first subexpression. ...
    (comp.lang.tcl)
  • regexp: parsing HTML tokens
    ... I'm attempting to parse simple HTML/ASP tokens. ... My string is: ... What I'm currently using is a regexp with subexpressions to find this, ... sub-expression to be variable in sequence to the first subexpression. ...
    (comp.lang.tcl)
  • Re: Using regexp to check for length of string that accepts optional strings
    ... total length of x characters; what API would you expect in regexp for this purpose? ... tokens and that each token can be of varying length. ...
    (comp.lang.java.programmer)