Re: fast stable sort



From: Moi <r...@xxxxxxxxxxxxxxxxxxx>
This is the point where the real men survive and the boys fail miserably.

If so, that would prove me to be a "real man" when I implemented an
external merge-sort (for variable-length s-expressions) on my Mac
Plus, which would not have all fit in RAM at the same time.

IMHO, sorting is the most commonly known "guinea pig" for
scalability. Offline-sorting is not put into oblivion, just because
"memory is cheap, these days" . Memory-locality and disk-locality
are basically the same problem.

Indeed, that's why merge-sort is the preferred method of doing
external sort, because you're reading/writing each file
sequentially, so you can take maximal advantage of asynchronous
buffering of I/O which is optimized in regard to size of disk
blocks that are in each bufferful.

how about generating the perfect shuffle, using O(1) memory

With fixed-length records, trivial: Swap randomly-selected record
with last record (no-op if you happen to select the last record
itself), and iterate on smaller sub-arrays with that last element
excluded, until only one record remains in the sub-array. Time is
o(n*k) where n is number of records and k is size of record.

With variable-length records it's not quite so trivial. To swap
two non-adjacent records of differing size, requires bubbling
all the intermediate data to change the relative size of the
two holes. A variation of Knuth's algorithm is probably best:
>1111>|>xxxxxxxxxxxx>|>22222>
Reverse middle records and last record together as unit.
>1111>|<22222<|<xxxxxxxxxxxx<
Reverse first and last-record-now-after-first together as unit:
>22222>|<1111<|<xxxxxxxxxxxx<
Reverse middle records (now at end) and first record (middle) together as unit.
>22222>|>xxxxxxxxxxxx>|>1111>
Notice how two reversals of super-segments of that middle xxxx
segment from different pivots has accomplished sliding the xxxx
segment itself, just like how two reflections equals a
transposition or rotation in 3-space?

Unfortunately with variable-length records, you can't just "pick" a
random record from a packed array. To find the start+end location
of the nth record you need to scan all the way from the start of
the array parsing one record at a time until you have counted n
records. So Knuth's algorithm only slightly speeds up the process,
the total time is dominated by repeatedly parsing records while
counting to the nth record, so time is o(n*n*k), i.e. exactly n-1
random selections, each of which on the average requires o(n*k)
bytes of data to be moved twice each, where k is average record
size instead of fixed record size.

Note you can swap as you count, using Knuth's original algorithm to
swap two adjacent segments at each step until you reach the desired
record. That might lead to more re-usable code, and it's only
slightly slower, same o(n*n*k) as above.

But with variable-length records, you should be able to get
o(n*k*log(n)) by adapting one of the sorting algorithms to sort "at
random" instead of per a deterministic rule. I'm sure you can
simply make up a pseudo-random deterministic hashing function and
sort by that value instead of the original data, using a
*different* hashing function each time you shuffle again. That will
guarantee a fair shuffle. I'm not sure that simply picking a random
bit each time you make a less/greater decision in a sorting
algorithm will assure a fair shuffle, especially with merge-sort
where a shorter run will on average be collated near the beginning
of a longer run, making the overall shuffle biassed. And algorithms
such as QuickSort that use a pivot might be totally broken by an
inconsistent result from the "comparison" function, I'm not sure,
maybe QuickSort would work just fine with random "comparison"
function. For any sorting algorithm where it's possible to make it
fair by using a random "comparison" function, it's a constant
factor faster than using the sort-per-hash method. (But reminder:
QuickSort is only *expected/average* o(log(n)*n*k) but worst-case
o(n*n*k).)
.



Relevant Pages

  • Re: Merging mixed data sets
    ... > which will swap the data members, usually 3 for begin and size ... algorithm is not strongly dependant on random element access ... the running time of the whole algorithm is O). ...
    (comp.lang.cpp)
  • Re: Limiting Disck Cache
    ... > Operating on huge files will always cause the kind of problems ... the algorithm used most commonly now is Least Recently ... > to be taken back out of swap, and put back in RAM. ...
    (Debian-User)
  • Re: One algorithm problem
    ... I have a problem and I am not able to get any algorithm for it. ... have to arrange the elements in array A as they are in B (by swapping ... In the second phase you permute the integer ... content at the address is the address itself) we swap it with the ...
    (comp.programming)
  • Re: /var or /usr for data?
    ... and forget all problems about how to partition your drive right... ... somewhere in my 4.x days that FreeBSD used a different algorithm to write ... up to 6.2 today, I put /var on its own filesystem, after / and swap. ... I think you may be confusing var with swap. ...
    (freebsd-questions)