Re: Efficient algorithm for finding duplicates in 56-bit number

From: gerard46 (gerard46_at_rtt.net)
Date: 01/29/05


Date: Sat, 29 Jan 2005 17:32:50 GMT


| Willem wrote:
|) gerard46 wrote:
|) As an aside, I'd be interested to see how long (wall clock and
|) processor time) it would take to sort millions of those values
|) (and I assume that 2 million is quite less than what Heiner had
|) in mind when he said millions).
|)
|) If the OP is looking for duplicates, all one has to do is just
|) sort the file and see if any of the records match the previous
|) one. No need to do a binary search. This boils the problem
|) down to having some utility do the work sorting the file, and
|) creating a new file if the old one is to be kept intact.
|) Presumably, let the utility run overnight or at some other
|) "slack" time. __________________________________________Gerard S.

| I think you're overestimating the amount of time needed to sort
| mollions of numbers. Basically, if it fits in memory, sorting time
| is in the order of minutes, or maybe even seconds.

Yes, and that'a a heck of a big IF (if it fits in memory). Millions
(let's pick ten as a "good" number) of numbers, each requiring 56 bytes
(as I understood the OP's numbers being 56-byte hex numbers) would
require 1/2 Gigabyte or so of memory. Even if you could read all those
numbers into real memory, I don't think that it can be done in an order
of minutes, and certainly not seconds. But, I could be wrong, of
course. __________________________________________________Gerard S.

| If it doesn't, you're best off with a sort that has very good locality
| of reference, or even tailor-make one that sorts in chunks (where each
| chunk fits in memory) and then merges the chunks.
| You can merge the chunks in one go, if you read those in smaller chunks
| so it fits in memory again, and use maybe a heap structure of pointers
| to chunks to always output the smallest value.
|
| I seem to remember there's a sorting challenge where you have to sort
| like billions or trillions of numbers on big hardware.

What size numbers, 32 bits ? _____________________________Gerard S.



Relevant Pages

  • Partially sorting a linked list
    ... I'd list to implement the best algorithm to "sort" the list so that, ... I'm posting this in the context of a heap memory allocator that works ... collates two fragmented chunks of memory into a bigger chunk each time ...
    (comp.programming)
  • Re: Sorting in huge files
    ... for info on how to do external sorts. ... files that are too large to fit in memory. ... chunks of the file that will fit in memory, ... If you only want to do this sort once, ...
    (comp.lang.python)
  • Re: Efficient algorithm for finding duplicates in 56-bit number
    ... processor time) it would take to sort millions of those values ... chunk fits in memory) and then merges the chunks. ...
    (comp.programming)
  • Re: fastest way to parse a file; Most efficient way to store the data?
    ... tradeoff between memory and speed. ... Since you have to sort anyway, you might as well do that and your ... a string, an offset, and a length. ... Create a Field object for each of them. ...
    (microsoft.public.dotnet.languages.csharp)
  • Re: fastest way to parse a file; Most efficient way to store the data?
    ... tradeoff between memory and speed. ... Since you have to sort anyway, you might as well do that and your ... a string, an offset, and a length. ... Create a Field object for each of them. ...
    (microsoft.public.dotnet.languages.csharp)