Re: Similarity searching
- From: Gordon Sande <g.sande@xxxxxxxxxxxxxxxx>
- Date: Mon, 04 Apr 2005 13:12:55 GMT
Gary L. Scott wrote:
glen herrmannsfeldt wrote:
justabeginner wrote:
If the number of arrays is N, array cardinality is C (would be 10 for normal decimal arrays) and the array length is L, the number of evaluations is proportional to (N)(N-1)(C**2)(L**2).
Where do you get this formula from? Sorting the set introduces a factor of N*log(N), if the arrays are big you won't in general have to compare to the last element so I don't see the factor of L**2, and I don't see where that C**2 factor is coming from at all.
(snip)
Not quite sure where the log might come from, wouldn't that introduce an irrational number into an integer set, therefore it is not an operation which is likely to be applicable here? Seems like introducing pi into integer arithmetic.
The log comes when you analyze the behavior for large N, and ignore smaller terms, and also ignore constant factors. There are sorting algorithms that compare each element to each other element, called O(n**2) (The leading term of N*(N-1)/2, ignoring the 1/2).
There are also sorting algorithms which don't compare each item to every other item, yet still generate a sorted result. Many of them are O(N*logN) for large N, though the low terms will be significant for very small N.
When you're comparing 2 arrays A & B of length L, A(1) needs to be compared with B(1) only. So it should be proportional to L.
Array cardinality probably isn't a factor because you're just comparing if an entry is the same.
Similarity searching is still an active science, but there are a lot of useful algorithms out there now. Approximate matching, or similarity searching as the title says, is one. Dynamic programming algorithms are popular if you need the best approximate match. Many other algorithms are faster but don't guarantee the best, but just a high probability of finding a good match.
-- glen
Anything to do with "fuzzy" logic (similarity searching)?
The original question was on how to find duplicates. The OP had some formulae which was a mix of bit level complexity (as it had he size of the operands) with naive methods (as it implied an N^2 sort). The usual duplicate finder is a sort followed by comparison of the adjacent items in the sorted list. (The discussion went downhill from there!) This is Knuth Vol. 1&2 sort of stuff.
If sorting does not make sense as would be the case of approximate comparisons then you are into the territory of nearest neighbor algorithms. Search terms include nearest neighbor, range searching, partial matching and k-d trees. The grand daddy algorithm is k-d trees. They take N log ( N ) to set up and then each search for the closest neighbor takes log ( N ). Over time there have been improvements in the shape of the neighborhoods but no major revolution in techniques. An application which as attracted attention is VLSI layout.
.
- Follow-Ups:
- Re: Similarity searching
- From: glen herrmannsfeldt
- Re: Similarity searching
- References:
- Similarity searching
- From: justabeginner
- Re: Similarity searching
- From: James Van Buskirk
- Re: Similarity searching
- From: justabeginner
- Re: Similarity searching
- From: glen herrmannsfeldt
- Re: Similarity searching
- From: Gary L. Scott
- Similarity searching
- Prev by Date: Re: Automatic testing of numerical output
- Next by Date: Re: IDE for Linux
- Previous by thread: Re: Similarity searching
- Next by thread: Re: Similarity searching
- Index(es):
Relevant Pages
|