Re: FAQ 6.11 Can I use Perl regular expressions to match balanced text?



In article <m7vl45-pe7.ln1@xxxxxxxxxxxxxxxxxxx>,
PerlFAQ Server <brian@xxxxxxxxxxxxxx> wrote:
6.11: Can I use Perl regular expressions to match balanced text?



Historically, Perl regular expressions were not capable of matching
balanced text. As of more recent versions of perl including 5.6.1
experimental features have been added that make it possible to do this.
Look at the documentation for the (??{ }) construct in recent perlre
manual pages to see an example of matching balanced parentheses. Be sure
to take special notice of the warnings present in the manual before
making use of this feature.

I've been using the following trick for many practical applications.


It helps to know why regexps cannot match balanced texts. Specifically,
normal regexps match rational languages, and rational languages cannot
count to an arbitrary number.

Balanced texts are algebraic languages, and they can be generated by
a grammar. The simplest example, with just the parentheses, would be:
D -> (D)D | <empty>

This is just one possible grammar, this one is non-ambiguous, which means
less combinatorial explosion later on.

However, one can use this grammar to build a good approximation of it as
a regular expression.

Specifically, set D0 = 0
then build Dn = (D_{n-1})D_{n-1} | empty

In perl terms:

$e = "";
for (1 .. n) {
$e = "(?:\($e\)$e)|";
}

then Dn is the rational language of all balanced expressions that do not
nest more than n open parentheses.

This can be used to build a regexp that matches all balanced expressions up
to an set level of nesting parentheses automatically. In many cases,
you don't have to go higher than n=15 or 20...
.



Relevant Pages

  • Serious Perl Regular Expression deficiency?
    ... After extensive read on the Perl docs on re's ... conclusion that regular expressions have a serious ... This is serious because the not string ... howdy folks --> ...
    (comp.lang.perl.misc)
  • Re: Serious Perl Regular Expression deficiency?
    ... I started doing Perl 2 years ago and have ... > conclusion that regular expressions have a serious ... This is serious because the not string ... If you want to pull out the contents of XML comments you could do this. ...
    (comp.lang.perl.misc)
  • Re: Regexp slowdown
    ... this only applies when addressing parts of someone's ... The word 'backtracking' has a specific meaning in the context of Perl ... decidedly do not know enough) about Perl regular expressions, ... If you had read the posting guidelines, you would have known about the ...
    (comp.lang.perl.misc)
  • Re: RegEx Help Needed
    ... I'm not programming in Perl. ... Regular expressions differ subtly but significantly between the languages ... would have a good chance of not working in your language. ... Regular expressions are a very poor tool for parsing HTML. ...
    (comp.lang.perl.misc)
  • Re: regexp in .net
    ... ago from documentation and examples of Unix sed and Perl, which I think are quite good - but many of their samples are for sed and Perl, obviously, which doesn't make them a very good resource for somebody wanting to use regular expressions in .NET. ... When you are becoming a little better with regexes, it's probably best if you set yourself some tasks and try to ask about the details - or compare your results to published resources like http://www.regexlib.com/ ... Finally, of course, for the exact detailed definition of regular expressions in .NET, you should use the MSDN pages: ...
    (microsoft.public.dotnet.languages.csharp)