Re: How good are checksums?

From: Roedy Green (see_at_mindprod.com.invalid)
Date: 04/29/04


Date: Thu, 29 Apr 2004 21:08:08 GMT

On Thu, 29 Apr 2004 18:58:24 GMT, Roedy Green
<see@mindprod.com.invalid> wrote or quoted :

>the logic I use in The Replicator and Untouch for this is to compute a
>checksum and compare file lengths. This makes the odds of a false
>duplicate very low. I use a fast Adlerian checksum.
>
>Even without the length check, the odds are only 1/2^32 of getting a
>false duplicate.

in the replicator, it does not matter if two unrelated documents hash
to the same key, only if the immediately previous version of a
document hashes to the same key and length.

Your problem is similar to what to do with hash table collisions.

See http://mindprod.com/jgloss/hashtable.html

Two different documents can hash to the same key. But two identical
documents cannot hash to different keys. If two documents hash to the
same key, you can do some finer check for duplicates, even a byte by
byte compare. This is not too painful if you do the i/o in whacking
big chunks as raw bytes.

Most of the time you won't have collisions, so you won't have to do
the compare.

If the two files have the same name and same timestamp, you should be
able to trust the OS that they are the same file without even
examining contents. :-)

--
Canadian Mind Products, Roedy Green.
Coaching, problem solving, economical contract programming. 
See http://mindprod.com/jgloss/jgloss.html for The Java Glossary.


Relevant Pages

  • Re: Byte to byte compare, duplicate file finder/killer
    ... cryptographically strong hash, simply because the CRC ... By using a secure hash instead of CRC, the actual byte-to-byte compare ... checksum (files with identical CRCs or checksums are trivial to ...
    (comp.programming)
  • Re: Byte to byte compare, duplicate file finder/killer
    ... Make a checksum that varies the parameters of its computation based ... Maybe pick 1% of each of 1000 files to compare with one "skip ... to generate a filtering hash value. ... And then you'd still have to go back and hash the ...
    (comp.programming)
  • Re: Suggestions for double-hashing scheme
    ... >> secondary hash function to 1/4 of the table size, ... > Hsieh is considered a laughing stock in many circles, ... Thats more typedefs than in all of bstrlib already. ... (Compare this to Bstrlib, whose influence ...
    (comp.programming)
  • Re: Detecting Brute-Force and Dictionary attacks
    ... usually modern systems doesn't compare the password you ... password attempt with the saved hash of your current password. ... It is my personal opinion that evaluating the passwords so closely, such as mentioned in the previous email to Greg's, would open yourself up to a far more complicated world than you might be thinking. ... In the past, when I've had people attempting to attack my systems, the easiest way to tell is the number of login attempts against the frequency of the attempts. ...
    (Focus-Linux)
  • RE: Access Code Required For New Validation database
    ... > duplicate contacts. ... > is at the same address, I only have name to compare. ... > Jon Simmons and another account Jahn Simmons. ... > running score for the user in a summary score table i.e. ...
    (microsoft.public.access.tablesdbdesign)

Loading