Re: Efficient algorithm for finding duplicates in 56-bit number

From: Heiner Steven (heiner.steven_at_nexgo.de)
Date: 01/31/05


Date: Mon, 31 Jan 2005 21:24:10 +0100

spinoza1111@yahoo.com wrote:
> The practical problem with the braindead approach is that the guy has
> millions of records, and a non-braindead sorting algorithm such as
> quick or shell sort has to be used to realistically load the table.

Indeed. I implemented a prototype that converted the 32 byte ascii
format into the corresponding 16 byte binary format. Afterwards
1,7 Million numbers (the data of one day) was read into RAM,
and sorted using QuickSort. Afterwards I searced the list
using binary search, which is quite fast.

> Quick and shell sorting are more psychologically complicated, perhaps,
> than open hash.
>
> But if this is only my prejudice, the use of a sort AFTER all entries
> are loaded creates a pragmatic issue.
>
> This is that even using a good algorithm there will be a strange pause
> after load during sort whereas open hash allows each entry to be loaded
> with a time footprint that only degrades slowly.

The pause for sorting is indeed noticable, lasting about
5 seconds. But this is only the data of *one day*, and
certainly will get larger.

> In the "braindead" approach, you have to wait constant (but large) time
> proportional to the size of the data set order 1 while the data set
> loads, and a roughly equivalent time during sort until you are
> presented with the first useful results.
>
> Whereas an object approach using open hash has the multithread
> capability to access items before any drama, before the load is
> complete. During the load you can start work because some keys will
> already exist, and the purpose of computers is to work us all to death,
> isn't it, in paralell.

I agree with you in general, but in this particular case I
will have to read all data before starting to work with it.
"work" in my case means: find duplicates.

[...excursion about the advantages of parallel programming omitted...]

> Open hashing creates the final and useful product, lacking only closure
> in the form of all entries, from the beginning and that is why it MAY
> be useful here. It stores the minimal amount of crap (eg., zero crap)
> per entry.
>
> One negative remains the case that there is no requirement for deletion
> apart from not storing duplicate keys. This means that you do have a
> strong case for brain death here.
>
> But my prejudice against "batch" and the step by step production of
> Useless Things remains precisely because I recall, with less than
> fondness, eternal vigils in mainframe computer rooms, glazedly watching
> der blinkenlights.
>
> Of course, real users can switch over to porn sites today or even do
> useful work while a "brain dead" approach pipes garbage, only at the
> end accomplishing the alchemical step.
>
> But the argument for brain death as simplicity itself has to give way
> to the realization that processes also need to be "as simple as
> possible, but no simpler".

I agree with you here. As a computer scientist I appreciate
an elegant (even complex) solution to a problem. But
the other, pragmatic half of me knows that a simple solution
can be more (cost or time) efficient.

I'm trying to solve a real-world problem here, and therefore
considerations like the time needed to implement it, or the ease
of maintenance come into play. I consider a solution to be
"good" for our purpose when it enables us to process the large
amounts of data in time.

The longer the discussion goes, the more I get the feeling that
taking the redundancy out of our input data is too expensive
in terms of development time.

But just for the fun of it, did anybody consider solving this
problem by interpreting the number as a 128-bit unsigned
integer, and representing the set of numbers seen using
a sparse bit array?

Heiner



Relevant Pages