Re: combining millions of different regular expressions



e2point@xxxxxxxxx writes:

i have large number of regular expressions. (millions). i want to
match a given string with all of them some how. but i cant iteratively
match the string with all of them because it will not give the
performance expected.
i know that any regular expression can be transformed to a state
machine. im wondering whether it is possible to transform all these
regex's to state machines and merge those state machines somehow(this
merged state machine will have an optimal structure to improve the
memory usage and performance). and then run the string once through
this state machine. (even though i only have one state machine, i need
to know exactly which regex's got matched with the input string.)
is this something that is done already somewhere?
or is this something that is still researched?
any advice/info is appreciated.

thanks

Depending upon your point of view this is either a solved problem or
theoretical research. "From the point of view of theory, there is no
difference between theory and practice, from the practical point of
view this is."

First, be careful with what you mean by matching regular expressions,
searching with regular expressions has slightly different semantics
that anchored matching of regular expressions. Many of the
theoretical papers gloss over that difference, because it can be
resolved (at one level) with adding a .* at the beginning of each
regular expression. However, if one does that naively, one loses the
starting position of where the regular expression actually matches.
(One of those cases where theory and practice mis-align.)

To keep the starting position information, you need something more
akin to "tagged regular expressions", look up Ville Laurikari.

Now, to your size limits. While in theory one can build a DFA from
any set of NFAs, and thus solve an arbitrarily large collection of
regular expressions, the memory requirements for doing so are beyond
practical limits. I would recommend looking up the papers of Michela
Becchi to get a handle on practical solutions to large regular
expression problems. Working with her, we have solved regular
expression search problems with tens of thousands of simultaneous
expressions (and Aho-Corasick fixed string problems with hundreds of
thousands of simultaneous fixed string patterns).

In addition, you might want to read the works of Bruce Watson. He
knows a fair amount (perhaps as much as any other person) about
regular expressions and search.

Hope this helps,
-Chris

******************************************************************************
Chris Clark Internet: christopher.f.clark@xxxxxxxxxxxxxxxxxxxxxx
Compiler Resources, Inc. or: compres@xxxxxxxxxxxxx
23 Bailey Rd Web Site: http://world.std.com/~compres
Berlin, MA 01503 voice: (508) 435-5016
USA fax: (978) 838-0263 (24 hours)
------------------------------------------------------------------------------

.



Relevant Pages

  • Re: Serious Perl Regular Expression deficiency?
    ... I started doing Perl 2 years ago and have ... > conclusion that regular expressions have a serious ... This is serious because the not string ... If you want to pull out the contents of XML comments you could do this. ...
    (comp.lang.perl.misc)
  • Re: Remove characters from string
    ... and your link took me to the templates page at microsoft office. ... there expaining regular expressions unless you meant I should search for it. ... | them to the same format for ease of processing. ... | the string I remove extraneous characters. ...
    (microsoft.public.excel.programming)
  • Re: dividing an replacing spaces in string
    ... I knew regular expressions would help in this. ... This newly delimited string will dump into separate rows like this ... Dim colMatches As Object ... Set objRe = CreateObject ...
    (microsoft.public.excel.programming)
  • combining millions of different regular expressions
    ... i have large number of regular expressions. ... match a given string with all of them some how. ... im wondering whether it is possible to transform all these ... merged state machine will have an optimal structure to improve the ...
    (comp.theory)
  • Re: Extract email addresses
    ... Because of a difference in the VBA flavor of Regular Expressions, ... Function REMid(str As String, Pattern As String, _ ... Dim objRegExp As RegExp ...
    (microsoft.public.excel.worksheet.functions)