Small regular expression parser

From: Jeff (jma_at_nospam.insightbb.com)
Date: 07/23/04


Date: Fri, 23 Jul 2004 01:47:26 GMT

While nothing special, I've been working on this for a project of mine
that needs regular expressions. And I didn't want something the size of
CL-PPLRE. So, I've developed my own. I will be putting it online for
those who are interested, but thought I would post here first and guage
interest before going through the hassle. It was developed in Corman
Lisp, but I'm sure it is completely Common Lisp and should compile it
other environments (I hope).

First, the goal was to develop a very simple regular expression parser.
It needed to be reasonably fast, small and handle 95% of all cases that
I would need it for. I decided to follow the example of Lua patterns to
make my life easier. So...

Using a token reader for #/ ... / generates a new regular expression.
Likewise, the function MAKE-REGEX can be used. Pre-defined character
sets are selected using the % character (like Lua) instead of \. There
are inclusive and exclusive sets. Capturing (and nested capturing) as
well as some simple functions to help with basic usage. The entire
package weighs in at ~400 lines (with ~80 of that being comments).
Definitely a testament to Lisp ;)

Some examples of what can be done with it:

(match-regex "Hello, World!" #/%U+/) ==> 1 6

MATCH-REGEX is the heart of all other functions. It returns the index
into the string of the start of the match and the length of the match.
In the above example, searching for 1 or more non-uppercase characters.

(capture-regex "my name is Jeff" #/%u%a+$/) ==> "Jeff"

CAPTURE-REGEX returns the first text string captured by the regular
expression. CAPTURE-ALL-REGEX is the same, but will return a list of
all subsequent matches in the search string.

(capture-all-regex "This is a test" #/%a+/) ==> ("This" "is" "a" "test")

SPLIT-REGEX will split a string into a list of strings that are
delimited by the regular expression:

(split-regex "1,2;3 4-5" #/[,;%s-]+/) ==> ("1" "2" "3" "4" "5")

REPLACE-REGEX and REPLACE-ALL-REGEX can be used to return a new string
with text from the original string that was matched replaced with new
text. There is currently no (built-in) support for accessing the
original text. So (for example), captitalizing every word would take
some extra code, but is possible.

(REPLACE-ALL-REGEX "This Is A Test" #/%u/ "*") ==> "*his *s * *est"

Last there are a couple macros to help with captures. Captures can be
nested, and are returned from MATCH-REGEX as values after the length of
the matched text. The captures are in the order in which they were
closed. The two macros are WITH-MATCH-REGEX and WITH-MATCH-ALL-REGEX.
So, for example:

(with-match-all-regex "Hello, world!" #/(%a+)/ (text)
   (print text)) ==> "Hello" "world"

(with-match-regex "Jeff Massung" #/(%a+)%s((%u)%a+)/
   (first initial last)
   (list initial last first)) ==> ("M" "Massung" "Jeff")

A quick breakdown of how the parser works (for those interested); the
expression string is parsed and broken down into elements (character,
set, eof, etc.). This list of elements is then compiled into a #<Regex>
class which is nothing more than a list of generated functions. Each
function takes an input stream (or match state) as a single argument
and returns either nil or the number of characters successfully
matched. That's it. I'm sure it isn't the fastest method, but was
definitely easier to write and is much easier to maintain ;)

I'd appreciate any comments. This is the beginning of a rather large
project using Lisp, and this was a bit of a journey of learning. Quite
valuable, too. Thanks go to those who have answered my questions here
in the past.

Jeff



Relevant Pages

  • Re: Brian Kernighan, maybe Im not worthy, maybe Im scum
    ... abandoned it because I don't think Beautiful Code can be written in C: ... Pike's code doesn't implement a regular expression interpreter ... it makes no provision for a character which must occur at ... changeable index into the string it points-at and it expects the user ...
    (comp.programming)
  • Re: Checking last character of string for punctuation
    ... I'm a newbie with a newbie question. ... the string should be kept as is. ... This matches your string against the regular expression that you need to put ... Given the description of your problem, you might be interested in character ...
    (perl.beginners)
  • Re: MT NewsWatcher filtering
    ... Jim Gibson wrote: ... The plus sign shouldn't be necessary for most regular expression ... character 1-6 exists in the string. ... exists in the string you are testing. ...
    (comp.sys.mac.system)
  • Re: RegularExpressions
    ... Start at the beginning of the string, or at the first line break character ... Basically, that is what a regular expression does, but in a more roundabout ... Notice that before the word BRE there are some other info that I dont ...
    (microsoft.public.dotnet.framework)
  • Re: Get regular expression
    ... own tree structure. ... Expression compares a string character-by character, ... regular expression solution, which was about as close as one could get to ... the structure of the hierarchy can be inferred by using ...
    (microsoft.public.dotnet.languages.csharp)