An O(NP) Sequence Comparison Algorithm

From: Michael B Allen (mba2000_at_ioplex.com)
Date: 04/14/04


Date: Wed, 14 Apr 2004 15:49:05 -0400

The code below implements the algorithm described in the following paper:

S. Wu, E. Myers, U. Manber, and W. Miller,
``An O(NP) Sequence Comparison Algorithm,''
Information Processing Letters 35, 6 (1990), 317-323.
http://www.cs.arizona.edu/people/gene/vita.html

This is more commonly known as the shortest edit script, edit distance,
and is a dual of the longest common subsequence problem and perhaps most
commonly known as the algorithm used to implement the diff(1) program [1].

Q: How can the below code be modified to permit scoring such as described
in 11.5.2 of Gusfield's Algorithms on Strings, Trees, and Sequences? I am
concerned that because this algorithm uses the compressed edit distance
that the calculation may not be as simple as modifying the fp function.

Thanks,
Mike

[1] Actually I think GNU diff implements the O(ND) algorithm in which
case this would actually be faster.

#include <stdlib.h>
#include <string.h>
#include <stdio.h>

#define MAX(a,b) ((a) > (b) ? (a) : (b))

struct ses {
    int fp[4096];
    const char *s1;
    const char *s2;
    int n;
    int m;
    int delta;
};

int
snake(struct ses *ses, int k, int y)
{
    int x = y - k;

    while (x < ses->m && y < ses->n && ses->s1[y + 1] == ses->s2[x + 1]) {
        x++;
        y++;
    }

    return y;
}
int
fp(struct ses *ses, int k, int p)
{
    if (p < 0 || k < -p || k > (ses->delta + p)) {
        return -1;
    }

    return ses->fp[ses->m + k];
}
int
compare(const char *s1, int sn1, const char *s2, int sn2)
{
    struct ses ses;
    int p, k, y;

    ses.s1 = s1;
    ses.s2 = s2;
    ses.n = sn1;
    ses.m = sn2;
    ses.delta = sn1 - sn2;

    memset(ses.fp, -1, 4096);
    p = -1;
    do {
        p++;
        for (k = -p; k < ses.delta; k++) {
            ses.fp[ses.m + k] = snake(&ses, k, MAX(fp(&ses, k - 1, p) + 1, fp(&ses, k + 1, p - 1)));
        }
        for (k = ses.delta + p; k > ses.delta; k--) {
            ses.fp[ses.m + k] = snake(&ses, k, MAX(fp(&ses, k - 1, p - 1) + 1, fp(&ses, k + 1, p)));
        }
        k = ses.delta;
        y = ses.fp[ses.m + k] = snake(&ses, k, MAX(fp(&ses, k - 1, p) + 1, fp(&ses, k + 1, p)));
    } while (y != ses.n);

    return ses.delta + 2 * p;
}

int
main(void)
{
    const char *s1 = " acebdabbabed";
    const char *s2 = " acbdeacbed";
    int d;

    d = compare(s1, strlen(s1) - 1, s2, strlen(s2) - 1);
    printf("%d\n", d);

    return EXIT_SUCCESS;
}

Permission is granted to all to use this code. This work in whatever
form will ultimately be freely available in my libmba library.



Relevant Pages

  • Re: C++ more efficient than C?
    ... If you want to do language comparisons, ... complicated algorithm, though). ... int compare ... not been in C's standard library. ...
    (comp.lang.c)
  • Re: The Lisp Curse
    ... If one compares "apples to apples", ... just be a C procedure for par(), and the additional C code would be ignored. ... The simple algorithm can be implemented in a number ... int par ...
    (comp.lang.forth)
  • Re: Mersenne Twister -- A Revised C++ Implementation
    ... "That algorithm is a slight elaboration on the basic linear congruential ... takes an int, and an int on the target you seem to be developing on -- ... unless you qualify it by stressing that you ... compilation to pick the appropriate type in a more general manner. ...
    (comp.lang.cpp)
  • Re: Baeza-Yates intersection Java implementation
    ... The algorithm they give which handily beats it is ... It's more that the algorithm they describe is not very intellectually interesting - it boils down to iterating over a list of numbers stored in a two-level array using the obvious pair of nested loops. ... int highBits{ ...
    (comp.lang.java.programmer)
  • Re: Non-restoring binary square root and convergence
    ... I took the algorithm from "Algorithms for Extracting Square Roots and Cube ... % sqrt 9 16 ... int main ...
    (sci.math)