Re: combining millions of different regular expressions
- From: Chris F Clark <cfc@xxxxxxxxxxxxxxxxxxxx>
- Date: Mon, 17 Mar 2008 17:06:09 -0400
e2point@xxxxxxxxx writes:
i have large number of regular expressions. (millions). i want to
match a given string with all of them some how. but i cant iteratively
match the string with all of them because it will not give the
performance expected.
i know that any regular expression can be transformed to a state
machine. im wondering whether it is possible to transform all these
regex's to state machines and merge those state machines somehow(this
merged state machine will have an optimal structure to improve the
memory usage and performance). and then run the string once through
this state machine. (even though i only have one state machine, i need
to know exactly which regex's got matched with the input string.)
is this something that is done already somewhere?
or is this something that is still researched?
any advice/info is appreciated.
thanks
Depending upon your point of view this is either a solved problem or
theoretical research. "From the point of view of theory, there is no
difference between theory and practice, from the practical point of
view this is."
First, be careful with what you mean by matching regular expressions,
searching with regular expressions has slightly different semantics
that anchored matching of regular expressions. Many of the
theoretical papers gloss over that difference, because it can be
resolved (at one level) with adding a .* at the beginning of each
regular expression. However, if one does that naively, one loses the
starting position of where the regular expression actually matches.
(One of those cases where theory and practice mis-align.)
To keep the starting position information, you need something more
akin to "tagged regular expressions", look up Ville Laurikari.
Now, to your size limits. While in theory one can build a DFA from
any set of NFAs, and thus solve an arbitrarily large collection of
regular expressions, the memory requirements for doing so are beyond
practical limits. I would recommend looking up the papers of Michela
Becchi to get a handle on practical solutions to large regular
expression problems. Working with her, we have solved regular
expression search problems with tens of thousands of simultaneous
expressions (and Aho-Corasick fixed string problems with hundreds of
thousands of simultaneous fixed string patterns).
In addition, you might want to read the works of Bruce Watson. He
knows a fair amount (perhaps as much as any other person) about
regular expressions and search.
Hope this helps,
-Chris
******************************************************************************
Chris Clark Internet: christopher.f.clark@xxxxxxxxxxxxxxxxxxxxxx
Compiler Resources, Inc. or: compres@xxxxxxxxxxxxx
23 Bailey Rd Web Site: http://world.std.com/~compres
Berlin, MA 01503 voice: (508) 435-5016
USA fax: (978) 838-0263 (24 hours)
------------------------------------------------------------------------------
.
- References:
- combining millions of different regular expressions
- From: e2point
- combining millions of different regular expressions
- Prev by Date: Category theory for Understanding the Internet
- Next by Date: Re: An intuitive reason why P=BPP
- Previous by thread: Re: combining millions of different regular expressions
- Next by thread: A special CVP (Circuit Value Problem), pls help
- Index(es):
Relevant Pages
|