Re: Byte to byte compare, duplicate file finder/killer



On Tue, 18 Sep 2007 14:37:03 -0700, "robertwessel2@xxxxxxxxx"
<spamtrap@xxxxxxxxxx> wrote:

On Sep 18, 2:50 pm, Terje Mathisen <spamt...@xxxxxxxxxx> wrote:
I would use a reasonably fast checksum algorithm (CRC-32) instead of a
cryptographically strong hash (MD5 or SHA-160), simply because the CRC
calculation is much faster to calculate.

Stage 4) For any pair of files that survive the previous stages, we do a
full byte-to-byte compare, this will confirm that they are indeed
identical, with odds of about 1:4e9 against a false positive.

By using a secure hash instead of CRC, the actual byte-to-byte compare
can be skipped, but since both files will probably be in the system file
cache at this point, so the compare will run at memory bandwidth speed.

The final runtime of this process will be totally limited by the disk IO
rate, so you can write it in asm, C, Java, Basic or perl: The timing
will be more or less the same.


Well, a decent implementation of a secure hash function (say SHA-1) is
still going to be fast enough that you're basically going to be I/O
limited. SHA-1 should process well in excess of 100MB/s on any semi-
modern PC. An advantage of a cryptographically secure hash is that
you're not at risk for a severe performance problem, potentially to
the level of a DoS attack, from encountering a maliciously constructed
set of files, namely (different) files which generate the same
checksum (files with identical CRCs or checksums are trivial to
generate). Consider what happens if you have a thousand megabyte
files generating the same checksum - you're now stuck doing half a
million file comparisons.
Ideas to consider:
1. compute a few different checksums at the same time.
2. Make a checksum that varies the parameters of its computation based
upon a starting random number. (For example, the weight of each
of the 8 32-bit fields used in making a 256-bit checksum would
depend on the starting seed.) (Determine if a second pass with
a different seed should be done before proceeding to the next
major step. Compare with using a 512-bit checksum to begin with.)
3. read and compare random parts of a bunch of files at the same time
Maybe pick 1% of each of 1000 files to compare with one "skip
pass" over each file. (The exact order of things here could
greatly effect performance, so your task is to determine the best
way.)
Pick the amount you compare at each spot
in each file to be a nice size compared to the disk organization.
Pick the samples randomly.

Determine what a good performance goal is and what worst case is.

If you're not worried about malicious attackers, a simple 32-bit
checksum is probably just as good as CRC-32 for screening purposes,
and certainly easier to implement.
.



Relevant Pages

  • Re: How good are checksums?
    ... >checksum and compare file lengths. ... >duplicate very low. ... I use a fast Adlerian checksum. ... Your problem is similar to what to do with hash table collisions. ...
    (comp.lang.java.programmer)
  • Re: Any known fast hash algorithms ?
    ... What I meant with that the principle is not that different from CRC, ... different from hash and checksum has allways been a little bit abstract. ... Looking at your reply I would say that you call a function a hash when it ... still end up with the best possible distribution of those bits. ...
    (borland.public.delphi.language.basm)
  • Re: Byte to byte compare, duplicate file finder/killer
    ... Make a checksum that varies the parameters of its computation based ... Maybe pick 1% of each of 1000 files to compare with one "skip ... to generate a filtering hash value. ... And then you'd still have to go back and hash the ...
    (comp.programming)
  • Re: Is it possible to forge both CRC and checksum of a file?
    ... simple linear equation to find out how to fix the CRC so it matches the ... the CRC or checksum, if allowed, would make _any_ unkeyed hash function ... and replace the old hash with the new hash. ...
    (sci.crypt)
  • Re: Shorter checksum than MD5
    ... You could use binascii.crc32which generates a 4 byte checksum. ... Compute CRC-32, the 32-bit checksum of data, starting with an initial crc. ... Since the algorithm is ... general hash algorithm. ...
    (comp.lang.python)