Demo: Sorts (revised)
From: Robert Wagner (robert.deletethis_at_wagner.net)
Date: 06/23/04
- Next message: Lueko Willms: "Re: Method to force keeping of source was Re: Is it possible to use th"
- Previous message: Peter E. C. Dashwood: "Re: Concerning the state of CLC..."
- Next in thread: Michael Mattias: "Re: Sorts (revised)"
- Reply: Michael Mattias: "Re: Sorts (revised)"
- Reply: JerryMouse: "Re: Sorts (revised)"
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Date: Wed, 23 Jun 2004 08:34:52 GMT
This is a re=posting of the sorts demo with one added test: inserting records
into an indexed file and reading them back in sequence. Jerry Mouse said this
technique is "efficient". In FACT, it ran 15 times slower than a Cobol sort and
more than 100 times slower than a fast sort.
* I'll take Sorts And Trees for a thousand, Alex
*
* These two programs -- Sorts and Trees -- demonstrate Cobol Features
* overlooked by most programmers, while running timing tests on
* sort algorithms.
*
* Glossary.
* We will see three data structures:
*
* Table An array defined with Occurs. Same as a sequential file.
*
* List Each entry has a pointer or index containing the address
* of the next entry, and optionally the previous. Same as
* a 'random' file.
*
* Tree Like a family tree, where each node contains one data
* value and pointers to up to two children, named Left,
* containing entries less than the node's data, and Right.
* Nodes with fewer than two children are called Leaves.
* Same as an indexed file.
*
* Sorts program
*
* Feature: Nested Programs
*
* Cobol is often criticized for having 'an ocean of data' in
* working-storage that is globally accessible to all parts of the
* program. It is also criticized for having overly large monolithic
* programs. Nested programs, seen below, are one answer to that
* objection; we will see another in Trees. They are written like
* external called programs, but they are in the same source file
* with the caller. If they were separately compiled, there would
* be additional administrative work to list them on program
* inventories, promote them through Change Management, perhaps
* list them as dependents in make files.
*
* Nested programs have their own private working-storage and lack
* access to their parent(s)' working-storage unless passed as
* a parameter or declared Global. They have been in the ANSI
* Cobol Standard since 1985.
*
* Test 0. Indexed file sort
*
* Inserts records into an indexed file, then reads them back in sequence.
* This is included as a bad example, the wrong way to sort data.
* It runs 15 times slower than a Cobol sort.
*
* Test 1. Cobol file sort
*
* This is a familiar Cobol sort with input and output procedures,
* release and return records. No Features here, it is shown to provide a
* timing baseline.
*
* Test 2. Micro Focus table sort
*
* Feature: Table sort
*
* Micro Focus provides an extension allowing you sort a table in memory.
* When we time it, we find its speed is almost identical to the file
* sort in test 1, and half the speed it 'should' be as shown in test 3.
* Apparently, under the covers it is releasing records to and returning
* records from the file sort. This was an extension to the 1985 Standard,
* is included in the 2002 Standard.
*
* Test 3. Quick sort
*
* Feature: Calling C language functions from Cobol
*
* This test calls qsort() in the Standard C Library, and directs it to
* call memcmp().
*
* How do you tell Cobol to use the C calling convention? Not necessary,
* Micro Focus uses the C convention by default. How do you tell the
* linker (loader) to look in C libraries? Not necessary. We stopped
* doing static binding (link edit) many years ago. Binding is now done
* at execution time by a program named "ld", which follows
* LD_LIBRARY_PATH. The C Standard Library is in that path by default. In
* short, you don't have to do anything special. You can see libraries
* that will be loaded at execution time by typing "ldd sorts".
*
* C libraries contain a plethora of useful functions. See
* http://www.tcnj.edu/~cs/doc/CStdLib.html
* also
* http://www.bebits.com/app/2917
* also
* a hundred others in /usr/lib, many defined in Standards
*
* The speed of quick sort is said to be proportional to N log2(N), where
* N is the number of entries and log2 is the base 2 logarithm. Most
* high-speed algorithms run at this speed, for example Shell, Heap and
* Comb sorts. Low-speed sorts such as Bubble run at N*N, are suitable
* only for small values of N, say less than 100.
*
* Tests 4 & 5. Radix sort followed by Insertion sort
*
* Features: The radix algorithm
* Linked lists using indexes
*
* When we had to sort a large number of punched cards, we put them into
* approximate sequence by sorting on the first few columns with a card
* sorter. Next we put them through a machine called a collator, which
* pulled cards out of order, typically 5-10%. We sorted those cards on
* the full key, then used a collator to merge them into the big deck.
*
* Would you believe the card sorting algorithm, written in Cobol, is
* five times faster than the best high-speed algorithms written in C,
* and about ten times faster than the sorts delivered by Micro Focus?
*
* The algorithm inspects one 'column' at a time, right to left, tossing
* 'cards' into 'pockets' by linking the 'card' to what had previously
* been the last in the 'pocket'. At the end of each pass, it 'gathers'
* the pocket strings into a single string ('deck') by linking the last
* 'card' in a 'pocket' to the first 'card' in the next 'pocket'.
*
* A 'column' can be any convenient size -- one bit, one byte or one
* word. The demo program defines a column as two bytes, 16 bits. This
* requires 65,536 'pockets', of which only a few thousand are actually
* used.
*
* When cards are approximately in order, the program 'merges' the ones
* out of order into place with an Insertion Sort. As a primary
* algorithm, Insertion is slow, close to N*N. But when the entries are
* almost in sequence, it is quite fast. It almost always puts them in
* order with one pass, then makes a second pass to verify that the job
* is complete. It takes a third pass only when there are two consecutive
* steps down. In the 50,000 item test, that doesn't happen, it runs in
* two passes. If radix had sorted on seven bytes rather than six, they
* would have been in perfect order. Insertion would have taken only one
* pass.
*
* The Radix-Insertion algorithm speed is proportional to kN + 2N+, where
* k is the number of columns sorted by Radix (in this case 3) and the +
* following 2N is close-range rearrangements by Insertion. In THEORY,
* the relative speeds of test 3 vs. test 5 should be 50,000 *
* log2(50,000) = 785,000 vs. 3*50,000 + 2*50,000+ =~ 300,000. The ratio
* is 2.6. In PRACTICE, it is 5.0 times as fast. As usual, the devil is
* in the details. If the test program had used subscripts rather than
* indexes, its speed would have been cut in half. Not because indexes
* are that much faster, because every subscript operation generates a
* call to the runtime whereas index operations produce inline code (seen
* via the now commented ASM options).
*
* [If your calculator doesn't have log2, take ln (base e) and divide by .69,
* which is ln(2).]
*
* To compile this "cob -x sorts.cbl". -x produces a Unix executable.
* To execute "sorts".
*
* Continued in trees.cbl
*
* ---------------------------------------------------------------------
$SET SOURCEFORMAT"FREE"
$SET NOBOUND OPT"2" NOTRUNC IBMCOMP NOCHECK
* $SET ASM ASMLIST SOURCEASM
identification division.
program-id. Sorts.
author. Robert Wagner.
* Compare speed of four sorts.
* Results: Sun SPARC, 1.8 GHz, 50,000 30-byte entries, time in
* seconds for 10 iterations.
* --- ratio ---
* 2. 3. 4. 5.
* 0. Indexed file sort 55.4 14.6 22.2 110 139
* 1. Cobol file sort 3.7 1.0 1.5 7.4 9.3
* 2. Micro Focus table sort 3.8 1.5 7.6 9.5
* 3. qsort() 2.5 5.0 6.3
* 4. radix-insertion .5 1.3
* 5. radix to linked list .4
*
* Trees.cbl contains an expanded version of these test results.
working-storage section.
01 test-data GLOBAL.
05 radix-columns comp pic s9(02).
05 test-number comp pic s9(02).
05 card occurs 50001 indexed *> the data
x-n first-n last-n previous-n-0 previous-n-1 previous-n-2 x-last.
10 sort-key.
15 key-column occurs 15 indexed x-c comp-x pic xx.
10 next-n index.
10 next-n-1 index.
01 init-data.
05 init-columns value zero comp pic s9(02).
05 init-number value zero comp pic s9(02).
05 init-card occurs 50001 indexed i-n i-next i-last.
10 init-key-1 pic v9(18).
10 init-key-2 pic v9(12).
10 init-next index.
10 init-next-1 index.
01 timer-variables.
05 repeat-factor value 10 comp pic s9(09).
05 previous-key pic x(30).
05 card-count comp pic s9(09).
05 start-time.
10 start-hours pic 9(02).
10 start-minutes pic 9(02).
10 start-seconds pic 9(02).
10 start-hundredths pic 9(02).
05 end-time.
10 end-hours pic 9(02).
10 end-minutes pic 9(02).
10 end-seconds pic 9(02).
10 end-hundredths pic 9(02).
05 elapsed-time pic z(04).9.
procedure division.
perform construct-data
display '0. Indexed file sort' with no advancing
move 0 to init-number
move 2 to repeat-factor
perform timer-on
perform repeat-factor times
perform initialize-test-data
call 'indexed-file-sort'
end-perform
perform timer-off
perform verify-sort
display '1. Cobol file sort' with no advancing
move 1 to init-number
move 10 to repeat-factor
perform timer-on
perform repeat-factor times
perform initialize-test-data
call 'cobol-sort'
end-perform
perform timer-off
perform verify-sort
display '2. Micro Focus sort' with no advancing
move 2 to init-number
perform timer-on
perform repeat-factor times
perform initialize-test-data
call 'mf-sort'
end-perform
perform timer-off
perform verify-sort
display '3. quick sort N log2(N)' with no advancing
move 3 to init-number
perform timer-on
perform repeat-factor times
perform initialize-test-data
call 'quick-sort'
end-perform
perform timer-off
perform verify-sort
display '4. radix-insertion kN + 2N+' with no advancing
move 4 to init-number
move 3 to init-columns
perform timer-on
perform repeat-factor times
perform initialize-test-data
call 'radix-sort'
end-perform
perform timer-off
move zero to init-columns
perform verify-sort
display '5. radix-insertion to linked list' with no advancing
move 5 to init-number
move 3 to init-columns
perform timer-on
perform repeat-factor times
perform initialize-test-data
call 'radix-sort'
end-perform
perform timer-off
goback
. construct-data.
set x-last, i-last to 50001
perform varying i-n from 1 by 1 until i-n = i-last
compute init-key-1 (i-n) = function random
compute init-key-2 (i-n) = function random
set i-next to i-n
set i-next up by 1
set init-next (i-n) to i-next
set init-next-1 (i-n) to i-last
end-perform
move high-values to init-card (i-last)
. initialize-test-data.
move init-data to test-data
set first-n to 1
. verify-sort.
move zero to previous-key, card-count
set x-n to first-n
if x-n equal to x-last
display 'test failed, no first'
end-if
perform until x-n equal to x-last
if sort-key (x-n) less than previous-key
display 'test failed '
previous-key space sort-key (x-n)
set x-n to x-last
else
move sort-key (x-n) to previous-key
if init-columns not equal to zero
set x-n to next-n (x-n)
else
set x-n up by 1
end-if
add 1 to card-count
end-if
end-perform
if card-count not equal to 50000
display 'test failed, count is ' card-count
end-if
. timer-on.
accept start-time from time
. timer-off.
accept end-time from time
compute elapsed-time rounded =
((((((((end-hours * 60) +
end-minutes) * 60) +
end-seconds) * 100) +
end-hundredths) -
((((((start-hours * 60) +
start-minutes) * 60) +
start-seconds) * 100) +
start-hundredths)) / 100)
* 10 / repeat-factor
display elapsed-time ' seconds'
.
identification division.
program-id. indexed-file-sort.
* 0. Performance of using an indexed file to sort data.
environment division.
input-output section.
file-control.
select indexed-file assign to 'indexed-file'
organization is indexed
access mode is dynamic
record key is index-key.
data division.
file section.
fd indexed-file.
01 indexed-record.
05 index-key pic x(30).
procedure division.
open output indexed-file
perform varying x-n from 1 by 1 until x-n = x-last
write indexed-record from sort-key (x-n)
end-perform
close indexed-file
open input indexed-file
perform varying x-n from 1 by 1 until x-n = x-last
read indexed-file next into sort-key (x-n) at end
display 'end of file'
end-perform
close indexed-file
goback
. end program indexed-file-sort.
identification division.
program-id. cobol-sort.
* 1. Performance of Micro Focus file sort.
environment division.
input-output section.
file-control.
select sort-file assign to 'sortwork'.
data division.
file section.
sd sort-file.
01 file-record.
05 file-key pic x(30).
procedure division.
sort sort-file on ascending file-key
input procedure input-procedure
output procedure output-procedure
goback
. input-procedure.
perform varying x-n from 1 by 1 until x-n = x-last
release file-record from sort-key (x-n)
end-perform
. output-procedure.
perform varying x-n from 1 by 1 until x-n = x-last
return sort-file into sort-key (x-n) at end continue
end-perform
. end program cobol-sort.
identification division.
program-id. mf-sort.
* 2. Performance of Micro Focus table sort.
* Appears to be using above file sort, because speed is nearly
* identical, and half as fast as it 'should' be.
procedure division.
sort card on ascending sort-key
goback
. end program mf-sort.
identification division.
program-id. quick-sort.
* 3. Performance of quick sort in the C library
working-storage section.
01 memcmp-pointer procedure-pointer.
procedure division.
set memcmp-pointer to entry 'memcmp'
call 'qsort' using
card (1)
by value 50000
by value length of card
by value memcmp-pointer
goback
. end program quick-sort.
identification division.
program-id. radix-sort.
* 4 & 5. Radix sort followed by insertion sort.
* This treats cards as a linked list.
* It does not move cards; it changes forward 'pointers' next-n.
* Test 4 converts the list to a table (for searching); test 5 does not.
data division.
working-storage section.
01 radix-variables.
05 swap-count comp pic s9(09).
05 value '00'.
10 first-pocket comp-x pic xx.
05 value '99'.
10 last-pocket comp-x pic xx.
05 radix-pockets.
10 radix-pocket occurs 65536 indexed pocket, p-last.
15 first-in-pocket index.
15 last-in-pocket index.
01 sorted-data.
05 sorted-columns value zero comp pic s9(02).
05 sorted-number value zero comp pic s9(02).
05 sorted-card occurs 50001 indexed s-n.
10 sorted-key pic x(30).
10 index.
10 index.
procedure division.
set p-last to last-pocket
set p-last up by 1
* One iteration per column. Speed is kN, where k is the key length
* in columns. In this case a column is two bytes.
perform varying x-c from radix-columns by -1 until x-c < 1
* RUN through the input string, tossing entries into pockets
* (uppercase words refer to card sorter operations)
perform varying pocket from first-pocket by 1 until pocket > p-last
set first-in-pocket (pocket), last-in-pocket (pocket) to x-last
end-perform
add 1 to last-pocket
set x-n to first-n
perform until x-n equal to x-last
* determine pocket
set pocket to key-column (x-n, x-c)
set pocket up by 1
if first-in-pocket (pocket) equal to x-last
set first-in-pocket (pocket) to x-n
end-if
set last-n to last-in-pocket (pocket)
* link to previous last in pocket
if last-in-pocket (pocket) not equal to x-last
set next-n-1 (last-n) to x-n
end-if
* make this last in pocket
set last-in-pocket (pocket) to x-n
set x-n to next-n (x-n)
end-perform
* GATHER - link tail of each pocket to head of next
set first-n, last-n to x-last
perform varying pocket from first-pocket by 1 until pocket > p-last
if first-in-pocket (pocket) not = x-last
set x-n to first-in-pocket (pocket)
if first-n equal to x-last
set first-n to x-n *> first card
end-if
if last-n not equal to x-last
set next-n-1 (last-n) to x-n
end-if
set last-n to last-in-pocket (pocket)
end-if
end-perform
set next-n-1 (last-n) to x-last *> last card
* MOUNT - input is now in the new order
perform varying x-n from 1 by 1 until x-n = x-last
set next-n (x-n) to next-n-1 (x-n)
end-perform
end-perform
* If radix-columns = 15, the above is all you need.
* By itself, it runs faster than the MF sort when radix-columns < 9.
* The optimized solution sets radix-columns so the radix sort puts
* data almost in order. It then merges out of order data into place
* using an insertion sort, which usually works in two iterations --
* one to sort and one to verify.
* The total time is 3N + 2N+ = 5N+
* MERGE out of order cards
* This is a modified insertion sort which does not
* require backward links. When finding a step down, it searches
* forward looking for a step up, then exchanges the big record before
* the step down with the little record before the step up, then takes
* a step backward. It cannot work in one pass because the step back
* cannot find (n - 2) when there are two consecutive steps back.
* I don't know whether this is a documented algorithm; I just made it up.
move 1 to swap-count
perform until swap-count = zero
move zero to swap-count
move card (first-n) to sorted-card (1)
set s-n to 2
set previous-n-0 to x-last
set previous-n-1 to first-n
set x-n to next-n (first-n)
perform until x-n equal to x-last
if sort-key (x-n) less than sort-key (previous-n-1)
perform until x-n equal to x-last or
sort-key (x-n) not less than sort-key (previous-n-1)
set previous-n-2 to x-n
set x-n to next-n (x-n)
end-perform
* exchange two cards (could use three moves)
call 'CBL_XOR' using sort-key (previous-n-1) sort-key (previous-n-2)
by value length of sort-key
call 'CBL_XOR' using sort-key (previous-n-2) sort-key (previous-n-1)
by value length of sort-key
call 'CBL_XOR' using sort-key (previous-n-1) sort-key (previous-n-2)
by value length of sort-key
* move sort-key (previous-n-1) to sort-key (x-last)
* move sort-key (previous-n-2) to sort-key (previous-n-1)
* move sort-key (x-last) to sort-key (previous-n-2)
* move high-values to sort-key (x-last)
set x-n to previous-n-0
if x-n equal to x-last
set x-n, first-n to previous-n-1
end-if
add 1 to swap-count
end-if
* Straighten out linked list so program can Search All. Adds 25% to time.
if test-number equal to 4 and
swap-count equal to zero
move sort-key (x-n) to sorted-key (s-n)
set s-n up by 1
end-if
set previous-n-0 to previous-n-1
set previous-n-1 to x-n
set x-n to next-n (x-n)
end-perform
end-perform
if test-number equal to 4
move high-values to sorted-key (x-last)
move sorted-data to test-data
end-if
set first-n to 1
goback
. end program radix-sort
. end program Sorts.
- Next message: Lueko Willms: "Re: Method to force keeping of source was Re: Is it possible to use th"
- Previous message: Peter E. C. Dashwood: "Re: Concerning the state of CLC..."
- Next in thread: Michael Mattias: "Re: Sorts (revised)"
- Reply: Michael Mattias: "Re: Sorts (revised)"
- Reply: JerryMouse: "Re: Sorts (revised)"
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Relevant Pages
|
|