Demo: Sorts (revised)

From: Robert Wagner (robert.deletethis_at_wagner.net)
Date: 06/23/04


Date: Wed, 23 Jun 2004 08:34:52 GMT

This is a re=posting of the sorts demo with one added test: inserting records
into an indexed file and reading them back in sequence. Jerry Mouse said this
technique is "efficient". In FACT, it ran 15 times slower than a Cobol sort and
more than 100 times slower than a fast sort.

* I'll take Sorts And Trees for a thousand, Alex
*
* These two programs -- Sorts and Trees -- demonstrate Cobol Features
* overlooked by most programmers, while running timing tests on
* sort algorithms.
*
* Glossary.
* We will see three data structures:
*
* Table An array defined with Occurs. Same as a sequential file.
*
* List Each entry has a pointer or index containing the address
* of the next entry, and optionally the previous. Same as
* a 'random' file.
*
* Tree Like a family tree, where each node contains one data
* value and pointers to up to two children, named Left,
* containing entries less than the node's data, and Right.
* Nodes with fewer than two children are called Leaves.
* Same as an indexed file.
*
* Sorts program
*
* Feature: Nested Programs
*
* Cobol is often criticized for having 'an ocean of data' in
* working-storage that is globally accessible to all parts of the
* program. It is also criticized for having overly large monolithic
* programs. Nested programs, seen below, are one answer to that
* objection; we will see another in Trees. They are written like
* external called programs, but they are in the same source file
* with the caller. If they were separately compiled, there would
* be additional administrative work to list them on program
* inventories, promote them through Change Management, perhaps
* list them as dependents in make files.
*
* Nested programs have their own private working-storage and lack
* access to their parent(s)' working-storage unless passed as
* a parameter or declared Global. They have been in the ANSI
* Cobol Standard since 1985.
*
* Test 0. Indexed file sort
*
* Inserts records into an indexed file, then reads them back in sequence.
* This is included as a bad example, the wrong way to sort data.
* It runs 15 times slower than a Cobol sort.
*
* Test 1. Cobol file sort
*
* This is a familiar Cobol sort with input and output procedures,
* release and return records. No Features here, it is shown to provide a
* timing baseline.
*
* Test 2. Micro Focus table sort
*
* Feature: Table sort
*
* Micro Focus provides an extension allowing you sort a table in memory.
* When we time it, we find its speed is almost identical to the file
* sort in test 1, and half the speed it 'should' be as shown in test 3.
* Apparently, under the covers it is releasing records to and returning
* records from the file sort. This was an extension to the 1985 Standard,
* is included in the 2002 Standard.
*
* Test 3. Quick sort
*
* Feature: Calling C language functions from Cobol
*
* This test calls qsort() in the Standard C Library, and directs it to
* call memcmp().
*
* How do you tell Cobol to use the C calling convention? Not necessary,
* Micro Focus uses the C convention by default. How do you tell the
* linker (loader) to look in C libraries? Not necessary. We stopped
* doing static binding (link edit) many years ago. Binding is now done
* at execution time by a program named "ld", which follows
* LD_LIBRARY_PATH. The C Standard Library is in that path by default. In
* short, you don't have to do anything special. You can see libraries
* that will be loaded at execution time by typing "ldd sorts".
*
* C libraries contain a plethora of useful functions. See
* http://www.tcnj.edu/~cs/doc/CStdLib.html
* also
* http://www.bebits.com/app/2917
* also
* a hundred others in /usr/lib, many defined in Standards
*
* The speed of quick sort is said to be proportional to N log2(N), where
* N is the number of entries and log2 is the base 2 logarithm. Most
* high-speed algorithms run at this speed, for example Shell, Heap and
* Comb sorts. Low-speed sorts such as Bubble run at N*N, are suitable
* only for small values of N, say less than 100.
*
* Tests 4 & 5. Radix sort followed by Insertion sort
*
* Features: The radix algorithm
* Linked lists using indexes
*
* When we had to sort a large number of punched cards, we put them into
* approximate sequence by sorting on the first few columns with a card
* sorter. Next we put them through a machine called a collator, which
* pulled cards out of order, typically 5-10%. We sorted those cards on
* the full key, then used a collator to merge them into the big deck.
*
* Would you believe the card sorting algorithm, written in Cobol, is
* five times faster than the best high-speed algorithms written in C,
* and about ten times faster than the sorts delivered by Micro Focus?
*
* The algorithm inspects one 'column' at a time, right to left, tossing
* 'cards' into 'pockets' by linking the 'card' to what had previously
* been the last in the 'pocket'. At the end of each pass, it 'gathers'
* the pocket strings into a single string ('deck') by linking the last
* 'card' in a 'pocket' to the first 'card' in the next 'pocket'.
*
* A 'column' can be any convenient size -- one bit, one byte or one
* word. The demo program defines a column as two bytes, 16 bits. This
* requires 65,536 'pockets', of which only a few thousand are actually
* used.
*
* When cards are approximately in order, the program 'merges' the ones
* out of order into place with an Insertion Sort. As a primary
* algorithm, Insertion is slow, close to N*N. But when the entries are
* almost in sequence, it is quite fast. It almost always puts them in
* order with one pass, then makes a second pass to verify that the job
* is complete. It takes a third pass only when there are two consecutive
* steps down. In the 50,000 item test, that doesn't happen, it runs in
* two passes. If radix had sorted on seven bytes rather than six, they
* would have been in perfect order. Insertion would have taken only one
* pass.
*
* The Radix-Insertion algorithm speed is proportional to kN + 2N+, where
* k is the number of columns sorted by Radix (in this case 3) and the +
* following 2N is close-range rearrangements by Insertion. In THEORY,
* the relative speeds of test 3 vs. test 5 should be 50,000 *
* log2(50,000) = 785,000 vs. 3*50,000 + 2*50,000+ =~ 300,000. The ratio
* is 2.6. In PRACTICE, it is 5.0 times as fast. As usual, the devil is
* in the details. If the test program had used subscripts rather than
* indexes, its speed would have been cut in half. Not because indexes
* are that much faster, because every subscript operation generates a
* call to the runtime whereas index operations produce inline code (seen
* via the now commented ASM options).
*
* [If your calculator doesn't have log2, take ln (base e) and divide by .69,
* which is ln(2).]
*
* To compile this "cob -x sorts.cbl". -x produces a Unix executable.
* To execute "sorts".
*
* Continued in trees.cbl
*
* ---------------------------------------------------------------------

      $SET SOURCEFORMAT"FREE"
      $SET NOBOUND OPT"2" NOTRUNC IBMCOMP NOCHECK
* $SET ASM ASMLIST SOURCEASM
 identification division.
 program-id. Sorts.
 author. Robert Wagner.
* Compare speed of four sorts.

* Results: Sun SPARC, 1.8 GHz, 50,000 30-byte entries, time in
* seconds for 10 iterations.
* --- ratio ---
* 2. 3. 4. 5.
* 0. Indexed file sort 55.4 14.6 22.2 110 139
* 1. Cobol file sort 3.7 1.0 1.5 7.4 9.3
* 2. Micro Focus table sort 3.8 1.5 7.6 9.5
* 3. qsort() 2.5 5.0 6.3
* 4. radix-insertion .5 1.3
* 5. radix to linked list .4
*
* Trees.cbl contains an expanded version of these test results.
 working-storage section.
 01 test-data GLOBAL.
     05 radix-columns comp pic s9(02).
     05 test-number comp pic s9(02).
     05 card occurs 50001 indexed *> the data
            x-n first-n last-n previous-n-0 previous-n-1 previous-n-2 x-last.
         10 sort-key.
             15 key-column occurs 15 indexed x-c comp-x pic xx.
         10 next-n index.
         10 next-n-1 index.

 01 init-data.
     05 init-columns value zero comp pic s9(02).
     05 init-number value zero comp pic s9(02).
     05 init-card occurs 50001 indexed i-n i-next i-last.
         10 init-key-1 pic v9(18).
         10 init-key-2 pic v9(12).
         10 init-next index.
         10 init-next-1 index.

 01 timer-variables.
     05 repeat-factor value 10 comp pic s9(09).
     05 previous-key pic x(30).
     05 card-count comp pic s9(09).
     05 start-time.
         10 start-hours pic 9(02).
         10 start-minutes pic 9(02).
         10 start-seconds pic 9(02).
         10 start-hundredths pic 9(02).
     05 end-time.
         10 end-hours pic 9(02).
         10 end-minutes pic 9(02).
         10 end-seconds pic 9(02).
         10 end-hundredths pic 9(02).
     05 elapsed-time pic z(04).9.

 procedure division.
   perform construct-data

   display '0. Indexed file sort' with no advancing
   move 0 to init-number
   move 2 to repeat-factor
   perform timer-on
   perform repeat-factor times
       perform initialize-test-data
       call 'indexed-file-sort'
   end-perform
   perform timer-off
   perform verify-sort

   display '1. Cobol file sort' with no advancing
   move 1 to init-number
   move 10 to repeat-factor
   perform timer-on
   perform repeat-factor times
       perform initialize-test-data
       call 'cobol-sort'
   end-perform
   perform timer-off
   perform verify-sort

   display '2. Micro Focus sort' with no advancing
   move 2 to init-number
   perform timer-on
   perform repeat-factor times
       perform initialize-test-data
       call 'mf-sort'
   end-perform
   perform timer-off
   perform verify-sort

   display '3. quick sort N log2(N)' with no advancing
   move 3 to init-number
   perform timer-on
   perform repeat-factor times
       perform initialize-test-data
       call 'quick-sort'
   end-perform
   perform timer-off
   perform verify-sort

   display '4. radix-insertion kN + 2N+' with no advancing
   move 4 to init-number
   move 3 to init-columns
   perform timer-on
   perform repeat-factor times
       perform initialize-test-data
       call 'radix-sort'
   end-perform
   perform timer-off
   move zero to init-columns
   perform verify-sort

   display '5. radix-insertion to linked list' with no advancing
   move 5 to init-number
   move 3 to init-columns
   perform timer-on
   perform repeat-factor times
       perform initialize-test-data
       call 'radix-sort'
   end-perform
   perform timer-off

   goback

 . construct-data.
   set x-last, i-last to 50001
   perform varying i-n from 1 by 1 until i-n = i-last
       compute init-key-1 (i-n) = function random
       compute init-key-2 (i-n) = function random
       set i-next to i-n
       set i-next up by 1
       set init-next (i-n) to i-next
       set init-next-1 (i-n) to i-last
   end-perform
   move high-values to init-card (i-last)

 . initialize-test-data.
   move init-data to test-data
   set first-n to 1

 . verify-sort.
   move zero to previous-key, card-count
   set x-n to first-n
   if x-n equal to x-last
       display 'test failed, no first'
   end-if
   perform until x-n equal to x-last
       if sort-key (x-n) less than previous-key
           display 'test failed '
               previous-key space sort-key (x-n)
           set x-n to x-last
       else
           move sort-key (x-n) to previous-key
           if init-columns not equal to zero
               set x-n to next-n (x-n)
           else
               set x-n up by 1
           end-if
           add 1 to card-count
       end-if
   end-perform
   if card-count not equal to 50000
       display 'test failed, count is ' card-count
   end-if

 . timer-on.
     accept start-time from time
 . timer-off.
     accept end-time from time
     compute elapsed-time rounded =
       ((((((((end-hours * 60) +
           end-minutes) * 60) +
           end-seconds) * 100) +
           end-hundredths) -
         ((((((start-hours * 60) +
           start-minutes) * 60) +
           start-seconds) * 100) +
           start-hundredths)) / 100)
       * 10 / repeat-factor
     display elapsed-time ' seconds'
 .

 identification division.
 program-id. indexed-file-sort.
* 0. Performance of using an indexed file to sort data.
 environment division.
 input-output section.
 file-control.
    select indexed-file assign to 'indexed-file'
       organization is indexed
       access mode is dynamic
       record key is index-key.

 data division.
 file section.
 fd indexed-file.
 01 indexed-record.
     05 index-key pic x(30).

 procedure division.
   open output indexed-file

   perform varying x-n from 1 by 1 until x-n = x-last
       write indexed-record from sort-key (x-n)
   end-perform

   close indexed-file
   open input indexed-file

   perform varying x-n from 1 by 1 until x-n = x-last
       read indexed-file next into sort-key (x-n) at end
           display 'end of file'
   end-perform

   close indexed-file
   goback
 . end program indexed-file-sort.

 identification division.
 program-id. cobol-sort.
* 1. Performance of Micro Focus file sort.
 environment division.
 input-output section.
 file-control.
     select sort-file assign to 'sortwork'.
 data division.
 file section.
 sd sort-file.
 01 file-record.
     05 file-key pic x(30).
 procedure division.
   sort sort-file on ascending file-key
       input procedure input-procedure
       output procedure output-procedure
   goback
 . input-procedure.
   perform varying x-n from 1 by 1 until x-n = x-last
       release file-record from sort-key (x-n)
   end-perform
 . output-procedure.
   perform varying x-n from 1 by 1 until x-n = x-last
       return sort-file into sort-key (x-n) at end continue
   end-perform
 . end program cobol-sort.

 identification division.
 program-id. mf-sort.
* 2. Performance of Micro Focus table sort.
* Appears to be using above file sort, because speed is nearly
* identical, and half as fast as it 'should' be.
 procedure division.
   sort card on ascending sort-key
   goback
 . end program mf-sort.

 identification division.
 program-id. quick-sort.
* 3. Performance of quick sort in the C library
 working-storage section.
 01 memcmp-pointer procedure-pointer.
 procedure division.
   set memcmp-pointer to entry 'memcmp'

   call 'qsort' using
       card (1)
       by value 50000
       by value length of card
       by value memcmp-pointer

   goback
 . end program quick-sort.

 identification division.
 program-id. radix-sort.
* 4 & 5. Radix sort followed by insertion sort.
* This treats cards as a linked list.
* It does not move cards; it changes forward 'pointers' next-n.
* Test 4 converts the list to a table (for searching); test 5 does not.
 data division.
 working-storage section.
 01 radix-variables.
     05 swap-count comp pic s9(09).
     05 value '00'.
         10 first-pocket comp-x pic xx.
     05 value '99'.
         10 last-pocket comp-x pic xx.
     05 radix-pockets.
         10 radix-pocket occurs 65536 indexed pocket, p-last.
             15 first-in-pocket index.
             15 last-in-pocket index.
 01 sorted-data.
     05 sorted-columns value zero comp pic s9(02).
     05 sorted-number value zero comp pic s9(02).
     05 sorted-card occurs 50001 indexed s-n.
         10 sorted-key pic x(30).
         10 index.
         10 index.

 procedure division.
   set p-last to last-pocket
   set p-last up by 1
* One iteration per column. Speed is kN, where k is the key length
* in columns. In this case a column is two bytes.
   perform varying x-c from radix-columns by -1 until x-c < 1
* RUN through the input string, tossing entries into pockets
* (uppercase words refer to card sorter operations)
       perform varying pocket from first-pocket by 1 until pocket > p-last
           set first-in-pocket (pocket), last-in-pocket (pocket) to x-last
       end-perform
       add 1 to last-pocket
       set x-n to first-n
       perform until x-n equal to x-last
* determine pocket
           set pocket to key-column (x-n, x-c)
           set pocket up by 1
           if first-in-pocket (pocket) equal to x-last
               set first-in-pocket (pocket) to x-n
           end-if
           set last-n to last-in-pocket (pocket)
* link to previous last in pocket
           if last-in-pocket (pocket) not equal to x-last
               set next-n-1 (last-n) to x-n
           end-if
* make this last in pocket
           set last-in-pocket (pocket) to x-n
           set x-n to next-n (x-n)
       end-perform
* GATHER - link tail of each pocket to head of next
       set first-n, last-n to x-last
       perform varying pocket from first-pocket by 1 until pocket > p-last
           if first-in-pocket (pocket) not = x-last
           set x-n to first-in-pocket (pocket)
               if first-n equal to x-last
                   set first-n to x-n *> first card
               end-if
               if last-n not equal to x-last
                   set next-n-1 (last-n) to x-n
               end-if
               set last-n to last-in-pocket (pocket)
           end-if
       end-perform
       set next-n-1 (last-n) to x-last *> last card
* MOUNT - input is now in the new order
       perform varying x-n from 1 by 1 until x-n = x-last
           set next-n (x-n) to next-n-1 (x-n)
       end-perform
   end-perform

* If radix-columns = 15, the above is all you need.
* By itself, it runs faster than the MF sort when radix-columns < 9.
* The optimized solution sets radix-columns so the radix sort puts
* data almost in order. It then merges out of order data into place
* using an insertion sort, which usually works in two iterations --
* one to sort and one to verify.
* The total time is 3N + 2N+ = 5N+

* MERGE out of order cards
* This is a modified insertion sort which does not
* require backward links. When finding a step down, it searches
* forward looking for a step up, then exchanges the big record before
* the step down with the little record before the step up, then takes
* a step backward. It cannot work in one pass because the step back
* cannot find (n - 2) when there are two consecutive steps back.
* I don't know whether this is a documented algorithm; I just made it up.

   move 1 to swap-count
   perform until swap-count = zero
       move zero to swap-count
       move card (first-n) to sorted-card (1)
       set s-n to 2
       set previous-n-0 to x-last
       set previous-n-1 to first-n
       set x-n to next-n (first-n)
       perform until x-n equal to x-last
           if sort-key (x-n) less than sort-key (previous-n-1)
               perform until x-n equal to x-last or
                   sort-key (x-n) not less than sort-key (previous-n-1)
                   set previous-n-2 to x-n
                   set x-n to next-n (x-n)
               end-perform
* exchange two cards (could use three moves)
   call 'CBL_XOR' using sort-key (previous-n-1) sort-key (previous-n-2)
       by value length of sort-key
   call 'CBL_XOR' using sort-key (previous-n-2) sort-key (previous-n-1)
       by value length of sort-key
   call 'CBL_XOR' using sort-key (previous-n-1) sort-key (previous-n-2)
       by value length of sort-key
* move sort-key (previous-n-1) to sort-key (x-last)
* move sort-key (previous-n-2) to sort-key (previous-n-1)
* move sort-key (x-last) to sort-key (previous-n-2)
* move high-values to sort-key (x-last)
               set x-n to previous-n-0
               if x-n equal to x-last
                   set x-n, first-n to previous-n-1
               end-if
               add 1 to swap-count
           end-if
* Straighten out linked list so program can Search All. Adds 25% to time.
           if test-number equal to 4 and
               swap-count equal to zero
               move sort-key (x-n) to sorted-key (s-n)
               set s-n up by 1
           end-if
           set previous-n-0 to previous-n-1
           set previous-n-1 to x-n
           set x-n to next-n (x-n)
       end-perform
   end-perform
   if test-number equal to 4
       move high-values to sorted-key (x-last)
       move sorted-data to test-data
   end-if
   set first-n to 1

   goback
 . end program radix-sort
 . end program Sorts.



Relevant Pages

  • Re: Demo: Sorts (revised)
    ... > more than 100 times slower than a fast sort. ... > * This is a familiar Cobol sort with input and output procedures, ... > perform timer-on ... > perform repeat-factor times ...
    (comp.lang.cobol)
  • Sorts
    ... These two programs -- Sorts and Lists -- demonstrate Cobol Features ... sort algorithms. ... When we had to sort a large number of punched cards, ... perform timer-on ...
    (comp.lang.cobol)