Re: almost equal strings



On Mon, 22 Dec 2008, rossum wrote:

On Sun, 21 Dec 2008 18:14:48 -0800, Roedy Green
<see_website@xxxxxxxxxxxxxxxxxxxx> wrote:

Have you seen any work on comparing strings for almost equality, or
measuring difference?

Soundex has already been mentioned. See also
http://en.wikipedia.org/wiki/Levenshtein_distance

Those are good for comparing words, or arbitrary strings, as Roedy asked, but he reveals that he actually wants to compare phrases. Obviously, you can still compare phrases as strings, but i wonder if they're actually not so appropriate here. Edit distance (aka Levenshtein distance) is defined for a string over any set of symbols; if you symbols were words, you could apply it at the phrase level. You might need a more refined scoring method, though - not all insertion, deletions or substitutions are of equal importance.

So that for example, if you had two proverbs, they would compare equal or close even if they were worded slightly differently, or punctuated slightly differently.

You might assign low importance to matching word like "the" and high importance to words like "democracy".

That is a lot more complex, unless you go for simple run-length you are going to need a database of relative importance for each possible word.

Such a think would not be impossible to build - you could get hold of some sort of corpus (say, a few gigabytes of modern text from Project Gutenberg) and process it, recording frequencies for each word, and perhaps second-order probabilities (eg 'raging' is more often followed by 'bull' than 'mouse', even though 'mouse' might be the more common word overall). You could then apply Bayesian statistics or something to figure out what the most important words in the processed text were. The database might not need to be that big - apparently 90% of text is made up of ~7000 words [1], so if you knew those, you could assign any unknown word some arbitrary 'very rare' frequency.

Anyway, i don't think cljp is the place to ask about this - there is a huge body of knowlege about text processing which Roedy could tap into, and there must be a place to ask questions on it.

tom

[1] http://www.askoxford.com/oec/mainpage/oec02/

--
Programming is a skill best acquired by practice and example rather than
from books -- Alan Turing
.



Relevant Pages

  • Re: A note on computing thugs and coding bums
    ... Here's my response including a bug fix. ... to make "modern strings" possible was designed and first implemented ... contents to strings, to compare them, and to duplicate them. ... while the Pike code will NEVER work...for international strings. ...
    (comp.programming)
  • Re: Ascan with a Substr
    ... the above suggestion: ... operator to compare the strings. ... code block you originally used was just right, ...
    (comp.lang.clipper)
  • Re: Ascan with a Substr
    ... The following solution work well ... operator to compare the strings. ... code block you originally used was just right, ...
    (comp.lang.clipper)
  • Re: Fast and Safe C Strings: User friendly C macros to Declare and use C Strings.
    ... Explain how such strings will improve the speed of strchr. ... If comparison fails, compare value against "needle" ... And these of course are just buffer overflows waiting to happen anyways. ... comparison of _one_ character: 'a' is not equivalent to 'f', ...
    (comp.lang.c)
  • Re: Fastcode CompareStr B&V 1.1
    ... I noticed the special case that Pierre has optimized for too. ... Then you can compare first chars without reading the ... Many of the strings in the benchmark are different in the ... > missed this possible optimization when I wrote my functions;-) ...
    (borland.public.delphi.language.basm)