In article <l_zsg.146932$iF6.68300@pd7tw2no>, "James J. Gavan" <jgavandeletethis@xxxxxxx> writes:

That 'thinggy' you did in the M/F Forum. Would love to see an
explanation and sample of a program using hashing with M/F Collections
:-). That's assuming of course there is an advantage for using hashing
over and above existing non-hash methods.

Wrong kind of hashing, I think.

MD5, which is what my sample program from the MF Forum computes, is a
cryptographic hash. Cryptographic hashes are intended to distinguish
documents: they're supposed to be virtually collision-free and resist
creating collisions.

General-purpose hashes, for hash tables and similar data structures,
have different requirements. They should have a low probability of
collision for anticipated input (which in some cases means for
random input), but more importantly they should be very fast, and
their output should be convenient to work with.

Cryptographic hashes try to be relatively fast, but they're not
nearly as fast as general-purpose hashes. And their output has to
be relatively large. MD5's output is 128 bits, and that's now
considered too short; current security applications should use
cryptographic hashes with at least 256 bits of output.

If you're implementing a hash table, you don't want your hash value
to be 16 or 32 bytes; at that point you might as well just use the
original key.

In short, cryptographic hashes like MD5 are used to verify data,
while general-purpose hashes are used to index it.

Now, I do know of at least one widely-used application that uses MD5
hashes as an index: BitTorrent. BitTorrent servers publish the MD5
hashes of chunks of data that they can distribute, and BitTorrent
clients request chunks by MD5 hash. Since the possiblity of a
collision is very very low, if a client asks for and receives a chunk
that has a particular MD5 hash value, it's very very likely to be the
piece of data it's looking for.

But that's a special case. BitTorrent is essentially trying to index
every "interesting" chunk of data of a certain size in the entire
world. Most hash tables are a bit less ambitious, so they do better
with smaller, faster hashes.

All that said, hash tables are a classic type of collection in OO
programming. I see there's discussion of creating custom hash
methods for collections in the MF docs, though I've never looked into
this. It'd be interesting to try to implement one of the current
well-regarded hash functions (such as Bob Jenkins's, or Paul Hsieh's,
or Chunk Falconer's) in COBOL as a collection hash method. The
popular hashes tend to use bit-manipulation and other techniques that
don't work well in COBOL (they're difficult to implement efficiently
and portably), and they tend to be at least somewhat CPU-specific,
which means they may work slowly, if at all, on other CPUs.

For a COBOL hash of alphanumeric data, I'd probably just go with a
linear-congruential algorithm. In untested, MF-specific code,
something like:

data division.
working-storage section.

77 hash-value pic x(4) comp-5.
77 char-value pic x comp-5.
77 string-idx pic x(4) comp-5.

linkage section.

77 in-string pic x.
77 in-len pic x(4) comp-5.

procedure division using in-string in-len.

move 0 to hash-value
perform varying string-idx from 1 by 1 until string-idx > in-len
move in-string(string-idx:1) to char-value
multiply 65599 by hash-value
add char-value to hash-value
goback returning hash-value
end program.

The constant multiplier should probably be prime; since the increment
(the value of the next character) is unpredictable, we don't have any
other way of reducing the probability of short cycles. One reference
(Aho, Sethi, and Ulman) recommends 65599 for a 32-bit hash, so that's
what I used.

Michael Wojcik michael.wojcik@xxxxxxxxxxxxxx

O sometimes, nevertheless,
The labourer at his instrument or tractor,
Bending into a state of merge with objects,
Finds the same love that, from a machine of sex,
Steps down as Venus to her invoker. -- George Barker