Re: the lowest number of comparisons in string matching



[laura]
I'm looking for the algorithm which does the lowest number of
comparison for the string matching problem (in the worst case).

I know that the complexity is liniar in the length of the strings, but
I'm interested in finding the actual number of comparisons.

The lowest number (theoretical) is m-n+1, but as far as I know no
algorithm has achieved this limit.

(where m is the lenght of the text and n is the length of the pattern).

Well, m-n+1 can't be right -- for example, when m=n the worst case is that
the pattern and the text are the same, in which case we obviously need n
comparisons to determine that they /are/ the same. For example, find "dog"
in "dog". Unless you make at least m=n=3 comparisons, you can't be sure it
matches.

If you allow for preprocessing of the pattern, then there are ways to build
searchers for general regular expressions (of which a fixed string is a
simple special case) that look at each character of the text through the
first match exactly once. How you count comparisons is a bit fuzzy then,
since a recognizing automaton generally /branches/ on the current text
character via some form of indexed lookup table. For the "dog" example:

state := S0
for i := 1 through m:
switch on state:
case S0: /* looking for initial 'd' */
if text[i] = 'd':
state := S1
continue looping

case S1: /* have 'd', looking for 'o' */
if text[i] = 'o':
state := S2
else:
state := S0
continue looping

case S2: /* have 'do', looking for 'g' */
if text[i] := 'g':
return success /* leftmost 'dog' ends at text[i] */
else:
state := S0
continue looping

return failure /* 'dog' doesn't appear in the text */

That's so simple I didn't need a lookup table for the text characters, and
it obviously does no more than m comparisons (exactly one comparison is done
per trip through the loop, and the loop executes no more than m times). You
could complicate it a bit more to return failure at the start of state S0 if
i > m-2, and at the start of S1 if i > m-1 (in which cases, not enough
characters remain in the text for it to be possible to reach state S2). You
can also replace the loop and the switch on the state with a pile of gotos
coupled with a bunch of tedious code duplication.

While it may not be immediately obvious, it's possible to construct an
essentially similar searcher for any pattern. Such constructions can be
found in any text covering finite automata and regular languages.


.



Relevant Pages

  • [TOMOYO #15 3/8] Common functions for TOMOYO Linux.
    ... This file contains common functions (e.g. policy I/O, pattern matching). ... Since TOMOYO Linux is a name based access control, ... TOMOYO Linux's string manipulation functions make reviewers feel crazy, ... the Linux kernel accepts all characters but NUL character ...
    (Linux-Kernel)
  • Re: "String" manipulation for a Case clause
    ... ' Dim objRegExp As RegExp ... I would use Double or String. ... 'non-word character. ... 'construct a pattern like: ...
    (microsoft.public.excel.programming)
  • Re: Optimization question
    ... PROCEDURE ForOptimize*(s: ARRAY OF CHAR); ... Calling this with a 75-character string yielded "31" for both loops. ... the sentinel null character. ... With C's "for," you define how you continue the loop, which can be (and ...
    (comp.lang.oberon)
  • Re: Can not find older posting: Reading files (fast)
    ... |> "Sucks up an entire file from PATH into a freshly-allocated string, ... (loop for bytes-read = (read-sequence buffer file-stream) ... CHARACTER; or a type specifier for a finite recognizable subtype of INTEGER; or one of the symbols SIGNED-BYTE, UNSIGNED-BYTE, or ... (loop for line = (read-line s nil nil) ...
    (comp.lang.lisp)
  • Re: Cant Figure Out How to Access a List Item
    ... With the alpha character in front you first have to strip it off to get at ... Personally, if the message ID was an alphanumeric string, I'd first try to ... loop through each subject item and send a new email with each subject ... retrieving the object you want with the MessageID, ...
    (comp.lang.basic.visual.misc)