Re: Byte to byte compare, duplicate file finder/killer
- From: Mark F <mark49607@xxxxxxxxx>
- Date: Thu, 20 Sep 2007 17:16:04 -0400
On Tue, 18 Sep 2007 14:37:03 -0700, "robertwessel2@xxxxxxxxx"
<spamtrap@xxxxxxxxxx> wrote:
On Sep 18, 2:50 pm, Terje Mathisen <spamt...@xxxxxxxxxx> wrote:Ideas to consider:
I would use a reasonably fast checksum algorithm (CRC-32) instead of a
cryptographically strong hash (MD5 or SHA-160), simply because the CRC
calculation is much faster to calculate.
Stage 4) For any pair of files that survive the previous stages, we do a
full byte-to-byte compare, this will confirm that they are indeed
identical, with odds of about 1:4e9 against a false positive.
By using a secure hash instead of CRC, the actual byte-to-byte compare
can be skipped, but since both files will probably be in the system file
cache at this point, so the compare will run at memory bandwidth speed.
The final runtime of this process will be totally limited by the disk IO
rate, so you can write it in asm, C, Java, Basic or perl: The timing
will be more or less the same.
Well, a decent implementation of a secure hash function (say SHA-1) is
still going to be fast enough that you're basically going to be I/O
limited. SHA-1 should process well in excess of 100MB/s on any semi-
modern PC. An advantage of a cryptographically secure hash is that
you're not at risk for a severe performance problem, potentially to
the level of a DoS attack, from encountering a maliciously constructed
set of files, namely (different) files which generate the same
checksum (files with identical CRCs or checksums are trivial to
generate). Consider what happens if you have a thousand megabyte
files generating the same checksum - you're now stuck doing half a
million file comparisons.
1. compute a few different checksums at the same time.
2. Make a checksum that varies the parameters of its computation based
upon a starting random number. (For example, the weight of each
of the 8 32-bit fields used in making a 256-bit checksum would
depend on the starting seed.) (Determine if a second pass with
a different seed should be done before proceeding to the next
major step. Compare with using a 512-bit checksum to begin with.)
3. read and compare random parts of a bunch of files at the same time
Maybe pick 1% of each of 1000 files to compare with one "skip
pass" over each file. (The exact order of things here could
greatly effect performance, so your task is to determine the best
way.)
Pick the amount you compare at each spot
in each file to be a nice size compared to the disk organization.
Pick the samples randomly.
Determine what a good performance goal is and what worst case is.
If you're not worried about malicious attackers, a simple 32-bit.
checksum is probably just as good as CRC-32 for screening purposes,
and certainly easier to implement.
- Follow-Ups:
- Re: Byte to byte compare, duplicate file finder/killer
- From: robertwessel2@xxxxxxxxx
- Re: Byte to byte compare, duplicate file finder/killer
- Prev by Date: Re: Byte to byte compare, duplicate file finder/killer
- Next by Date: Collection of various algorithms
- Previous by thread: Re: Byte to byte compare, duplicate file finder/killer
- Next by thread: Re: Byte to byte compare, duplicate file finder/killer
- Index(es):
Relevant Pages
|