Re: word aware distance algorythm



From: rouadec <roua...@xxxxxxxxx>
I'm ... trying to compute the similarity between two sentences ...
right now I'm using a levenshtein distance ...

You're talking at cross purposes. Computing similarity, and
computing distance, are nearly opposite things. Computing distance
usually implies a metric space, and is easy to define in a
consistent precise way, usually Cartesian distance. Computing
similarity is very different and rather difficult to define in any
reasonable consistent manner. For example, consider:
aaaaaaabbbbbbb
aaaaaaa
Are those two strings 50% similar, or some other fraction similar,
or do you get 100% in one direcition but only 50% the other
direction?
aaaaaaabbbbbbb
aaaaaaaccccccc
Are those two strings 50% similar, or what? These two strings are
less similar than the first pair, so if the first pair were 50%
simlar then these two must be less similar, so how do you compute
that???

I guess it's apparent by now that I prefer vector distance methods.

but the levensthein distance isn't word aware and this is causing
issues when matching data, ie something like 'xxxxxx' is widly
different in levenshtein terms to 'xxxxxx - yyyyyy' but not so for
humans ;)

Any idea about a metrics which will give a greater weight to
sentences with similar words (even better if it also weight the
position of similar words) ?

Suppose you computed the trigram-frequency vector, and the measured
the Cartesian distance between these vectors. Would that be close
to what you need, or would an additional tuning of it be needed?

For a large corpus of sentences to be compared, for example a
million different sentences, are you comparing only a few of the
half-trillion possible pairs, or do you need in effect to compare
nearly *all* of those possible pairs? If you need to compare nearly
all pairs in order to perform clustering of sentences, it may be
worth the extra work of pre-processing each sentence into an
explicit vector, so that Cartesian distance between those
pre-computed vectors can be performed very efficiently on vector
processors. Something like my ProxHash may be useful, where each
vector of trigrams is transformed to a much smaller vector
(typically only 15 ordinates) whose pairwise distances can then be
VERY quickly computed. Also, precomputed ProxHashes allow
successive clustering on smaller numbers of ordinates, which
greatly reduces the number of comparisons that need to be done.
.



Relevant Pages

  • Re: Lineage selection
    ... > Computing the distance between parents and pruning them as needed is more ... > parents and searching through that). ... computing using greps, huh, I don't like idea at all. ... > plausible than any of the other methods and as the number of genomes grows ...
    (comp.ai.genetic)
  • Re: Ping John Larkin
    ... the scoring is computing the distance between two points on the earth. ...
    (sci.electronics.design)
  • K-L distance
    ... In my work the current situation demands computing the Kullback-Leibler ... distance between two pdf's obtained by using kernel density estimation. ...
    (sci.stat.math)
  • K-L Distance
    ... In my work the current situation demands computing the Kullback-Leibler ... distance between two pdf's obtained by using kernel density estimation. ...
    (sci.stat.math)
  • how to compare the similarity of two groups?
    ... could anyone please tell me how to compare the similarity of ... I am considering using euclidean distance as the metric, ...
    (sci.stat.edu)