String similarity

From: Luca Montecchiani (cbm64_at_inwind.it)
Date: 10/10/03


Date: Fri, 10 Oct 2003 00:44:27 GMT

Introduction
------------
The need to find files that "resembled" in the name has pushed me to write
an utility that unlike the other it was not based on the content of the files
but on its name. Initially I start adding this functionality to
one "C" program for Unix called "fdupes" witch give me good performance
and good precision.
The algorithm that I have chosen for the comparison between string was
"Levenshtein Distance".
For reasons that I leave you to imagine I have rewrite that utility using
Python reducing the code from 1394 lines of "C" to 152 lines of Python
enclosed comments :)
Initially I have rearranged part of "C" code in one module "simil"
to which I have added an helper and others two algorithms that in reality
are two versions of the same algorithm "Ratcliff-Obershelp".
The small script works great scanning my download directory or my audio files
collection finding duplicates that I'll never be able to find manually.

Considerations
--------------
During the test I have been able to verify that the performances of the
Python module "difflib" are light years far from the algorithms of my
module "simil", initially I thought that the bottle neck was the
fact of having to compare all the possible combinations (without repetitions)
of files. This is bad IMHO.
This implementation is still in an embryonic phase however I have decided
of posting in this group for being able to discuss it and to carry ahead,
remaining in my HD would not have some future ;)
I'm sorry for the quality of Python code, I'm a beginner on that language..

Installation
------------
Once decompressed the file: http://spazioinwind.libero.it/montecchiani/pyfdupes-0.7.tgz
run the Build.sh command in order to compile and to install the "simil" module.
To the test the modules you need to modify the Test.sh shell before the execution.
( the "int" mode is intentionally commented because on slower machine you could
never see the end, 515 files will generate 132355 iterations! )
Comments, credits, TODO and FIXME can be found inside the source code.

Ideas and optimization
----------------------
In the comparison between the content of files where the key is the size, the
iterations can be limited using a binary tree structure.
Comparing string this is not possible (IMHO) and the only way that I have found
was a "preliminary Check" that avoids to make comparison between string of a great
different length.

Critics, ideas, comments, fix and help are welcomes
luca

PS: sorry for my english