Re: Musings on uniqueness



[ snip problem of finding unique items, and hashes not being good enough
to guarantee uniqueness, thus requiring full comparisons ]

Richard Heathfield wrote:
a) look along the shelf for a box with the right item size and hash value. If one doesn't exist, create it.

b) for each bag in the box, compare this item with the bag's comparison item (and remember, they may be many megabytes or even several gigabytes in size, so this is the really costly bit). If they're exactly the same, put this item into the bag and stop. Otherwise, go on to the next bag. If this item doesn't belong in any existing bag, make a new bag for it, and put it in. It becomes the comparison item for that bag.

Once that process is complete, I visit each box in turn, and each bag within each box in turn. For each bag, the comparison item is retained, and everything else in that bag is removed from the collection.

Once /that/ process is complete, I have my set.

As far as I can see, this will work just fine. But does anyone have any better ideas?

The way I see it, once you have used heuristics (hashes) to prove what
you can about uniqueness, you are left with groups of things that
are probably equal (with very high probability if you run some
cryptographic hash one them). Since by definition at this point you
have used up all your heuristic ideas, you must now do full comparisons.

And, here are the three insights at that point that determine the
running time:
(1) it's obvious every item must be compared to at least one other
item; otherwise, you have no information about whether it's
equivalent to anything.
(2) every subset of items must somehow, at some point, be compared
to every other (disjoint) subset of items; otherwise, you could
end up with cliques. (hmm, this is a more general form of #1.)
(3) equality is transitive and symmetric (and reflexive); therefore,
comparing A to B and B to C gives you information about the
relationship of A and C.

Because of #1, if you have N items, you need a minimum of N-1
comparisons, absolutely. Because of #3, you can do a chain of
comparisons (where you group the items on a line and compare
each to its neighbor), and that chain can be size N-1 and still
succeed.

Therefore, N-1 is sufficient to determine whether they are all
equivalent, and there is an easy way of doing N-1 comparisons.

As for the question of what happens when all N items are not equal,
is it really that important? If you've already done, say, even MD5
on them, the chances of all N items not being equal are really quite
small. The worst case is that all N items are distinct even though
they have the same MD5 checksum, and in that case, I think your
algorithm is O(N^2), but the probability of having N MD5 collisions
out of N items is so tiny it's not even funny.

Nevertheless, if you do reach this point, this problem can be
reduced to sorting. If you sort the items using some O(N log N)
sort, then you can compare each one to its neighbor using a
total of N-1 comparisons. So, you can solve the problem by doing
a sort. That is, you can improve the worst case by doing a sort.

Come to think of it, another way to solve this problem is to forget
the bags and boxes and shelves. Once you have a bunch of things
you think are probably identical, try to put them into some kind
of balanced tree (like a B-tree), doing the full comparison of
the new item with every node you visit. If you have proved based
on full comparison that your new item is a duplicate of an existing
node, discard it at that point. In the best (and common) case,
this will do only N-1 comparisons, because the tree will have only
one node. But if you have the bad luck to have N distinct items,
it's still only O(N log N).

Of course, with all this sorting business, I'm assuming there is a
full ordering defined for the items. But it sounds like they are
ordered sequences of bytes, so there probably is a full ordering
defined. :-)

- Logan
.



Relevant Pages

  • Re: First round of descriptive/external exercises
    ... >>Heavy filtering, of the sort Catja has been doing, can (and frequently ... > The leftover objects included a knitting bag, a briefcase, and a diaper ... In the diaper bag were six ... But since, for any given exercise that happens *not* to be about style, the ...
    (rec.arts.sf.composition)
  • Re: Gordon and food waste
    ... that empty from the bottom, with a sort of tap device, or one of those ... aisle by the time Mike and I had managed to get another bag over the leak! ... so you need to have little containers into which you put ...
    (uk.media.radio.archers)
  • Re: Rank features for creative writers editor
    ... had a few bugbears with it that I was too lazy to sort out and recent ... it's turned into a bag of bones WEEKS ago? ...
    (rec.arts.sf.composition)
  • Re: Woman finds $65,000, turns it in
    ... keep it, for sure I'd be in an airport, a few bills in my pocket, or ... I wouldn't be finding $ in a bag on the street IAC. ... dead baby or a body part in a gym bag, that sort of thing. ...
    (alt.true-crime)
  • Re: Bone Trail Found Through Woods Near Anthony Home, Body Wrapped In "Covering" Inside of Bag
    ... that the child's body was wrapped in some sort ... of "covering" inside the bag. ... the Anthony home. ...
    (alt.true-crime)