Re: almost equal strings
- From: Tom Anderson <twic@xxxxxxxxxxxxxxx>
- Date: Mon, 22 Dec 2008 17:19:42 +0000
On Mon, 22 Dec 2008, rossum wrote:
On Sun, 21 Dec 2008 18:14:48 -0800, Roedy Green
<see_website@xxxxxxxxxxxxxxxxxxxx> wrote:
Have you seen any work on comparing strings for almost equality, or
measuring difference?
Soundex has already been mentioned. See also
http://en.wikipedia.org/wiki/Levenshtein_distance
Those are good for comparing words, or arbitrary strings, as Roedy asked, but he reveals that he actually wants to compare phrases. Obviously, you can still compare phrases as strings, but i wonder if they're actually not so appropriate here. Edit distance (aka Levenshtein distance) is defined for a string over any set of symbols; if you symbols were words, you could apply it at the phrase level. You might need a more refined scoring method, though - not all insertion, deletions or substitutions are of equal importance.
So that for example, if you had two proverbs, they would compare equal or close even if they were worded slightly differently, or punctuated slightly differently.
You might assign low importance to matching word like "the" and high importance to words like "democracy".
That is a lot more complex, unless you go for simple run-length you are going to need a database of relative importance for each possible word.
Such a think would not be impossible to build - you could get hold of some sort of corpus (say, a few gigabytes of modern text from Project Gutenberg) and process it, recording frequencies for each word, and perhaps second-order probabilities (eg 'raging' is more often followed by 'bull' than 'mouse', even though 'mouse' might be the more common word overall). You could then apply Bayesian statistics or something to figure out what the most important words in the processed text were. The database might not need to be that big - apparently 90% of text is made up of ~7000 words [1], so if you knew those, you could assign any unknown word some arbitrary 'very rare' frequency.
Anyway, i don't think cljp is the place to ask about this - there is a huge body of knowlege about text processing which Roedy could tap into, and there must be a place to ask questions on it.
tom
[1] http://www.askoxford.com/oec/mainpage/oec02/
--
Programming is a skill best acquired by practice and example rather than
from books -- Alan Turing
.
- References:
- almost equal strings
- From: Roedy Green
- Re: almost equal strings
- From: rossum
- almost equal strings
- Prev by Date: Re: Compiler Bug?? (javac 1.6.0_0-internal)
- Next by Date: Re: almost equal strings
- Previous by thread: Re: almost equal strings
- Next by thread: Re: almost equal strings
- Index(es):
Relevant Pages
|