String similarity

From: Luca Montecchiani (
Date: 10/10/03

Date: Fri, 10 Oct 2003 00:44:27 GMT

The need to find files that "resembled" in the name has pushed me to write
an utility that unlike the other it was not based on the content of the files
but on its name. Initially I start adding this functionality to
one "C" program for Unix called "fdupes" witch give me good performance
and good precision.
The algorithm that I have chosen for the comparison between string was
"Levenshtein Distance".
For reasons that I leave you to imagine I have rewrite that utility using
Python reducing the code from 1394 lines of "C" to 152 lines of Python
enclosed comments :)
Initially I have rearranged part of "C" code in one module "simil"
to which I have added an helper and others two algorithms that in reality
are two versions of the same algorithm "Ratcliff-Obershelp".
The small script works great scanning my download directory or my audio files
collection finding duplicates that I'll never be able to find manually.

During the test I have been able to verify that the performances of the
Python module "difflib" are light years far from the algorithms of my
module "simil", initially I thought that the bottle neck was the
fact of having to compare all the possible combinations (without repetitions)
of files. This is bad IMHO.
This implementation is still in an embryonic phase however I have decided
of posting in this group for being able to discuss it and to carry ahead,
remaining in my HD would not have some future ;)
I'm sorry for the quality of Python code, I'm a beginner on that language..

Once decompressed the file:
run the command in order to compile and to install the "simil" module.
To the test the modules you need to modify the shell before the execution.
( the "int" mode is intentionally commented because on slower machine you could
never see the end, 515 files will generate 132355 iterations! )
Comments, credits, TODO and FIXME can be found inside the source code.

Ideas and optimization
In the comparison between the content of files where the key is the size, the
iterations can be limited using a binary tree structure.
Comparing string this is not possible (IMHO) and the only way that I have found
was a "preliminary Check" that avoids to make comparison between string of a great
different length.

Critics, ideas, comments, fix and help are welcomes

PS: sorry for my english

Relevant Pages

  • Re: Combinatorics question
    ... I understood the general method, but I don't quite get the Python ... algorithm. ... is it a string, integer...? ...
  • TOC of Python Cookbook now online (was Re: author index for Python Cookbook 2?)
    ... Processing a String One Character at a Time ... Finding a File on the Python Search Path ... Constructing Lists with List Comprehensions ... Looping over Items and Their Indices in a Sequence ...
  • Re: =?ISO-8859-15?Q?Objektrepr=E4sentation_in_einem_Python?= =?ISO-8859-15?Q?-In
    ... Python 1.4 hat zwar ein uniformes Typsystem, Exemplare sind aber nur nur eine Art von Python-Objekt, will sagen, nicht jedes Ding hat eine Klasse. ... Called when an attribute lookup has not found the attribute in the usual places (i.e. it is not an instance attribute nor is it found in the class tree for self). ... Das einzige Element von None könnte null sein, Integer und String sind klar. ...
  • Python Unicode handling wins again -- mostly
    ... string type in a language works. ... characters", ... Python 3.3 passes this test: ... I would expect a solid Unicode implementation ...
  • Re: =?ISO-8859-15?Q?Objektrepr=E4sentation_in_einem_Python?= =?ISO-8859-15?Q?-In
    ... Exemplare von Klassen. ... Einzig vom Anwender definierte Klassen können Exemplare haben. ... wirklichen Grund sich an die Hierarchie von Python halten zu müssen. ... Wenn ich es richtig verstanden habe, dann ist es in Python möglich einem String eine neue Funktion zu geben, vielleicht sogar eine alte zu ersetzen. ...