Re: fast stable sort



From: CBFalconer <cbfalco...@xxxxxxxxx>
if you have an existing array, you can simply arrange an array of
pointers to the items, and sort that.

That is no good on large datasets that won't fit in actual RAM and
hence require swapping pages of virtual memory, because the
comparison function will be random-accessing the actual records in
each merge run instead of accessing them in sequential RAM
addresses, so swapping will thrash.

On small datasets, it doesn't matter, but on small datasets even
bubble sort is faster than you can blink, so there's no point in
doing anything more complicated, as you suggest.

On mid-sized datasets, where bubble sort is too slow, but the
entire dataset fits in actual RAM, your idea may be workable.

Another consideration is the CPU fast cache, which is much smaller
than the totality of RAM you have. Virtually any decent sized
dataset will thrash CPU fast cache with your algorithm, reducing
effective speed by an order of magnitude. Bubble sort might
actually beat out your idea on datasets larger than the CPU fast
cache but small enough for n*n behaviour of bubble-sort not to be
too bad, because bubble sort marches up and down consecutive RAM
addresses *all* the time, at worst sweeping an average of half the
entire array 2*n times, n times each direction, swapping the
left-most and right-most parts of the array in and out with each
sweep but keeping the middle part in the fast cache if the array
isn't *too* large.

IMO you really need to consider locality of reference when choosing
among algorthms that process large datasets.

By the way, the first sorting program I wrote, circa 1970-71,
implemented virtual memory at the user level to hold portions of
long variable-length data records, using a type of radix/key sort
(first 5 7-bit characters of each record were loaded into array,
with disk pointer kept also showing where to get next 5 characters,
packing those 5 characters into one 36-bit PDP-10 word in the usual
way). Heap-sort was used to sort just those 5-character words. Then
the array was traversed to find any duplicate words whereupon that
sub-array got the next 5 characters loaded from virtual memory and
that sub-array recursively sorted. If the data was mostly already
sorted, or if I was sorting the appendation of several
already-sorted files, then fetching those next five characters was
efficient, but if I was sorting a randomly shuffled dataset, such
as if the data was sorted on one key and I wanted to re-sort on
another key, then it thrashed horribly trying to load all those
next-5-characters from a subset of records scattered all over the
datafile.

I learned my lesson. My *second* file-merge program used
Huffman-tree (*) external merge sort instead of any type of
keysort, after making merge runs that each nearly filled RAM.

* (At each merge pass, merge the N smallest files into a single new
file, where N is the order of the merge. I think I used appx.
20-way merge, because my Macintosh allowed at most appx. 25 open
files at one time, so I had 20 input and one output and others
spare for other programs I might want to run in the middle of a
merge run. Thus dataflow is a fork-order-N Huffman coding tree.
The very last merge run was usually of smaller order because
fewer total merge-run files existed at that point.)

Note: My algorithm for knowing when to stop building a merge run
and write it out to disk and start building the next merge run was
really crufty: Macintosh Allegro Common Lisp 1.2.2 provided *no*
way to check how much free storage was still available, and no way
to throw an exception if I try to allocate more data when no free
memory is available. The only way to find out when you have filled
memory was *after* it ran out of memory causing a GC to be
triggered. So what I did was just before starting to build a merge
run I invoke GC myself to make sure all RAM is marked free and
available for loading data before next GC would be triggered, then
after parsing each record of input (and pushing it into a list of
not-yet-sorted records) I checked if another GC had occured.
Meanwhile, with each new record I also allocated and immediately
discarded some slack data, so when the GC occurred it would recover
all that slack data instead of crashing due to memory full. When
that second GC occured, I stopped reading records, called the
built-in SORT on the already-loaded data, and wrote it out to a new
merge-run file.
.



Relevant Pages

  • Re: Hackish file -> string trick
    ... Unless you've got oodles of RAM, a lot of the data will just go ... straight bak to the disk via the swapfile, possibly running out of room on ... trying to load a 1 gig file into memory for random ... Coding for this sort of thing is easy, ...
    (comp.lang.pascal.delphi.misc)
  • Re: Memory Q: 1gb dual channel 320mhz or 2gb single channel 266mhz?
    ... > enjoy this sort of thing.. ... > uses about 75% of the memory while it is working. ... I have 1GB RAM installed. ... If it is the same for you, the 1GB DC setup will be ...
    (alt.comp.periphs.mainboard.asus)
  • Re: "Sorting" assignment
    ... And many others prefer to call partition exchange because "quicksort" ... bin B depending on whether it is greater than, ... If the array is already sorted, this means that you end up ... attempt to sort them. ...
    (comp.programming)
  • Re: A Fast sorting algorithm for almost sorted data
    ... far my compressor has potential but is nowhere near ready. ... It does however make heavy use of sorting. ... which I am currently calling Run sort. ... entire selected run can be added to the sorted output array. ...
    (comp.compression)
  • Re: Save & Sort
    ... You can copy your array to a scratch ... "Heap" sort. ... Dim lst As Long ... Dim tmp As String ...
    (microsoft.public.excel.programming)