CRC checksum on a tree



I just thought of something nice that can be done with CRCs. It's
probably already commonly done. Could someone point me at where it's
already being done?

Large values (CLOBs, documents, files) are often represented as tree of
pieces.

If you want to test equality of large values, it's nice to have a
checksum for each value. That lets you compare the checksums first,
then you only have to compare the full values if the checksums match.

CRCs (cyclic redundancy checksums) have the nice property that you can
compute the CRC for concat(A,B) given the CRC for A, the CRC for B, and
the length of A.

The nice thing I just thought of was that if a large value is stored in
a tree, you can compute the CRC for each leaf, then use those to
compute the CRC for each branch node on up to the root. Modifications
to leaves (including inserts and deletes that change their size)
require the new CRC to be propogated up the tree. So a CRC hash can be
maintained for modifications of large values in O(logn) time. The way
the large value is split up into pieces wouldn't affect the final CRC
value.

I'd seen that done before for positions, and maybe simple sums, but I
hadn't seen it for CRCs. CRCs aren't a strong hash, but they're a lot
better than simple sums. For the purpose of being an equality
prefilter CRCs are good enough. If this were done for a really widely
distributed system (like pages on the internet), it would be nice if
everyone standardized on the same CRC definition for doing this.

If large values are represented as a tree, exposing the CRCs for the
first level or two of the tree would allow replication of large values
to copy just changed pieces rather than the whole thing when the values
are changed. Timestamps instead of CRCs could work the same way, and
would be more reliable if the goal is to detect what has changed.

.