Re: Musings on uniqueness
- From: Logan Shaw <lshaw-usenet@xxxxxxxxxxxxx>
- Date: Sat, 03 Mar 2007 12:01:36 -0600
[ snip problem of finding unique items, and hashes not being good enough
to guarantee uniqueness, thus requiring full comparisons ]
Richard Heathfield wrote:
a) look along the shelf for a box with the right item size and hash value. If one doesn't exist, create it.
b) for each bag in the box, compare this item with the bag's comparison item (and remember, they may be many megabytes or even several gigabytes in size, so this is the really costly bit). If they're exactly the same, put this item into the bag and stop. Otherwise, go on to the next bag. If this item doesn't belong in any existing bag, make a new bag for it, and put it in. It becomes the comparison item for that bag.
Once that process is complete, I visit each box in turn, and each bag within each box in turn. For each bag, the comparison item is retained, and everything else in that bag is removed from the collection.
Once /that/ process is complete, I have my set.
As far as I can see, this will work just fine. But does anyone have any better ideas?
The way I see it, once you have used heuristics (hashes) to prove what
you can about uniqueness, you are left with groups of things that
are probably equal (with very high probability if you run some
cryptographic hash one them). Since by definition at this point you
have used up all your heuristic ideas, you must now do full comparisons.
And, here are the three insights at that point that determine the
running time:
(1) it's obvious every item must be compared to at least one other
item; otherwise, you have no information about whether it's
equivalent to anything.
(2) every subset of items must somehow, at some point, be compared
to every other (disjoint) subset of items; otherwise, you could
end up with cliques. (hmm, this is a more general form of #1.)
(3) equality is transitive and symmetric (and reflexive); therefore,
comparing A to B and B to C gives you information about the
relationship of A and C.
Because of #1, if you have N items, you need a minimum of N-1
comparisons, absolutely. Because of #3, you can do a chain of
comparisons (where you group the items on a line and compare
each to its neighbor), and that chain can be size N-1 and still
succeed.
Therefore, N-1 is sufficient to determine whether they are all
equivalent, and there is an easy way of doing N-1 comparisons.
As for the question of what happens when all N items are not equal,
is it really that important? If you've already done, say, even MD5
on them, the chances of all N items not being equal are really quite
small. The worst case is that all N items are distinct even though
they have the same MD5 checksum, and in that case, I think your
algorithm is O(N^2), but the probability of having N MD5 collisions
out of N items is so tiny it's not even funny.
Nevertheless, if you do reach this point, this problem can be
reduced to sorting. If you sort the items using some O(N log N)
sort, then you can compare each one to its neighbor using a
total of N-1 comparisons. So, you can solve the problem by doing
a sort. That is, you can improve the worst case by doing a sort.
Come to think of it, another way to solve this problem is to forget
the bags and boxes and shelves. Once you have a bunch of things
you think are probably identical, try to put them into some kind
of balanced tree (like a B-tree), doing the full comparison of
the new item with every node you visit. If you have proved based
on full comparison that your new item is a duplicate of an existing
node, discard it at that point. In the best (and common) case,
this will do only N-1 comparisons, because the tree will have only
one node. But if you have the bad luck to have N distinct items,
it's still only O(N log N).
Of course, with all this sorting business, I'm assuming there is a
full ordering defined for the items. But it sounds like they are
ordered sequences of bytes, so there probably is a full ordering
defined. :-)
- Logan
.
- References:
- Musings on uniqueness
- From: Richard Heathfield
- Musings on uniqueness
- Prev by Date: Re: Musings on uniqueness
- Next by Date: Re: Musings on uniqueness
- Previous by thread: Re: Musings on uniqueness
- Next by thread: Re: Musings on uniqueness
- Index(es):
Relevant Pages
|