Re: Really fast checksum?
- From: Tom Anderson <twic@xxxxxxxxxxxxxxx>
- Date: Sun, 1 Mar 2009 13:53:53 +0000
On Sat, 28 Feb 2009, Spud wrote:
I need to determine very quickly whether a database record has changed or not. Using JDBC, I select records from a table (possibly large), roll through them, calculate a checksum on the text of all of the fields, and then check it against a stored checksum to see if the record has changed.
What's the fastest checksum algorithm which still yields fairly reliable results?
In the past I've used CRC32, but that only operates on bytes and I'd rather not convert all the strings I'll get from JDBC into bytes (for performance reasons).
So convert the chars into bytes yourself, in a way which you know will be fast.
CRC32 crc;
String s;
for (int i = 0; i < s.length(); ++i) {
char ch = s.charAt(i);
crc.update((ch >> 8) & 0xff);
crc.update(ch & 0xff);
}
Can I just call String.hashCode() for each field and add them all up?
Simply adding the checksums won't detect field being swapped: {"foo", "bar"} and {"bar", "foo") will have identical hashes. You can fix this by instead computing a sort of meta-hash over the hashes, eg using a multiply-and-add scheme like the implementation of String.hashCode to combine them. But ...
Is that reliable enough that I can be almost certain that if the contents of the record changes, the sum of the hash codes will change as well?
No hash function, of any kind, is capable of giving you that guarantee. Look up the pigeonhole principle if you don't believe me.
The best you can get is a probabilistic assurance: for an ideal n-bit hash, there's a 1 in 2^n chance that two different records will have the same hash. CRC32 is 32-bit, so that's a one in four billion chance. If you put that probability into the mathematics for the birthday paradox [1], you see that if you have as few as 2^16 = 65536 records, you can expect to have two with the same hash. CRC32 is adequate for verifying message integrity in the face of small random errors (it's specifically engineered to catch the kind of errors you get in communications - single flipped bits, runs of stuck bits, etc), but it's not really up to the job you want to put it to.
Or do I have to do something fancy like MD5?
Something fancy would be better. SHA1 is the new MD5, so that's your first choice. It's a 160-bit hash, so you'd have to get up to 2^80 records before you have to worry about birthday collisions. It's also engineered to be more general than CRC32, so it should deal with all kinds of variations in the records, including ones designed by evil hackers to fool you. Note that it still can't guarantee that two records are the same, though, so if the equality of records is a mission-critical matter, then you're going to have to do full comparisons.
tom
[1] http://en.wikipedia.org/wiki/Birthday_paradox#Cast_as_a_collision_problem
--
Orange paint menace
.
- Follow-Ups:
- Re: Really fast checksum?
- From: blue indigo
- Re: Really fast checksum?
- References:
- Really fast checksum?
- From: Spud
- Really fast checksum?
- Prev by Date: Re: A Logging data structure
- Next by Date: Re: BigInteger enhancing
- Previous by thread: Re: Really fast checksum?
- Next by thread: Re: Really fast checksum?
- Index(es):
Relevant Pages
|