Re: Efficient algorithm for finding duplicates in 56-bit number

spinoza1111_at_yahoo.com
Date: 02/04/05


Date: 3 Feb 2005 19:05:54 -0800


Heiner Steven wrote:
> spinoza1111@yahoo.com wrote:
> > The practical problem with the braindead approach is that the guy
has
> > millions of records, and a non-braindead sorting algorithm such as
> > quick or shell sort has to be used to realistically load the table.
>
> Indeed. I implemented a prototype that converted the 32 byte ascii
> format into the corresponding 16 byte binary format. Afterwards
> 1,7 Million numbers (the data of one day) was read into RAM,
> and sorted using QuickSort. Afterwards I searced the list
> using binary search, which is quite fast.
>
> > Quick and shell sorting are more psychologically complicated,
perhaps,
> > than open hash.
> >
> > But if this is only my prejudice, the use of a sort AFTER all
entries
> > are loaded creates a pragmatic issue.
> >
> > This is that even using a good algorithm there will be a strange
pause
> > after load during sort whereas open hash allows each entry to be
loaded
> > with a time footprint that only degrades slowly.
>
> The pause for sorting is indeed noticable, lasting about
> 5 seconds. But this is only the data of *one day*, and
> certainly will get larger.

Hmm. I suppose you could in future scroll ads or something. Seriously,
this was my concern, but it is your call if the time is acceptable. It
does seem unbounded by anything in particular.
>
> > In the "braindead" approach, you have to wait constant (but large)
time
> > proportional to the size of the data set order 1 while the data set
> > loads, and a roughly equivalent time during sort until you are
> > presented with the first useful results.
> >
> > Whereas an object approach using open hash has the multithread
> > capability to access items before any drama, before the load is
> > complete. During the load you can start work because some keys will
> > already exist, and the purpose of computers is to work us all to
death,
> > isn't it, in paralell.
>
> I agree with you in general, but in this particular case I
> will have to read all data before starting to work with it.
> "work" in my case means: find duplicates.

I think this will embed a drawback, seriously. Basically, you seem to
be giving duplicates free room and board. With a hash based load you do
not.

>
> [...excursion about the advantages of parallel programming
omitted...]

I do run on, don't I? But this was a good faith excursus in real world
terms.
>
> > Open hashing creates the final and useful product, lacking only
closure
> > in the form of all entries, from the beginning and that is why it
MAY
> > be useful here. It stores the minimal amount of crap (eg., zero
crap)
> > per entry.
> >
> > One negative remains the case that there is no requirement for
deletion
> > apart from not storing duplicate keys. This means that you do have
a
> > strong case for brain death here.
> >
> > But my prejudice against "batch" and the step by step production of
> > Useless Things remains precisely because I recall, with less than
> > fondness, eternal vigils in mainframe computer rooms, glazedly
watching
> > der blinkenlights.
> >
> > Of course, real users can switch over to porn sites today or even
do
> > useful work while a "brain dead" approach pipes garbage, only at
the
> > end accomplishing the alchemical step.
> >
> > But the argument for brain death as simplicity itself has to give
way
> > to the realization that processes also need to be "as simple as
> > possible, but no simpler".
>
> I agree with you here. As a computer scientist I appreciate
> an elegant (even complex) solution to a problem. But
> the other, pragmatic half of me knows that a simple solution
> can be more (cost or time) efficient.
>
> I'm trying to solve a real-world problem here, and therefore
> considerations like the time needed to implement it, or the ease
> of maintenance come into play. I consider a solution to be
> "good" for our purpose when it enables us to process the large
> amounts of data in time.
>
> The longer the discussion goes, the more I get the feeling that
> taking the redundancy out of our input data is too expensive
> in terms of development time.
>
The open hash solution is simplicity itself. But it APPEARS to be new
terrain for you. One thing we must do, whether as technicians or
scientists, is perform without passion the self-reflexive task of
deciding to invest our time and through a combination of vanity, and
humility, we often aren't honest with ourselves. The "complexity" is
the distance between what we know and what we don't, and thought itself
is commodified.

I am myself trying to bail our of .Net in favor of Open Source which is
a "simpler" terrain but makes demands upon me.

> But just for the fun of it, did anybody consider solving this
> problem by interpreting the number as a 128-bit unsigned
> integer, and representing the set of numbers seen using
> a sparse bit array?
>
I sure did not. But such an array is based, isn't it, either on hash or
else a list of occupied zones searched in O(n/2) time.
> Heiner



Relevant Pages