String similarity

Luca Montecchiani · Oct 10, 2003

Introduction
------------
The need to find files that "resembled" in the name has pushed me to write
an utility that unlike the other it was not based on the content of the files
but on its name. Initially I start adding this functionality to
one "C" program for Unix called "fdupes" witch give me good performance
and good precision.
The algorithm that I have chosen for the comparison between string was
"Levenshtein Distance".
For reasons that I leave you to imagine I have rewrite that utility using
Python reducing the code from 1394 lines of "C" to 152 lines of Python
enclosed comments

Initially I have rearranged part of "C" code in one module "simil"
to which I have added an helper and others two algorithms that in reality
are two versions of the same algorithm "Ratcliff-Obershelp".
The small script works great scanning my download directory or my audio files
collection finding duplicates that I'll never be able to find manually.

Considerations
--------------
During the test I have been able to verify that the performances of the
Python module "difflib" are light years far from the algorithms of my
module "simil", initially I thought that the bottle neck was the
fact of having to compare all the possible combinations (without repetitions)
of files. This is bad IMHO.
This implementation is still in an embryonic phase however I have decided
of posting in this group for being able to discuss it and to carry ahead,
remaining in my HD would not have some future

I'm sorry for the quality of Python code, I'm a beginner on that language..

Installation
------------
Once decompressed the file: http://spazioinwind.libero.it/montecchiani/pyfdupes-0.7.tgz
run the Build.sh command in order to compile and to install the "simil" module.
To the test the modules you need to modify the Test.sh shell before the execution.
( the "int" mode is intentionally commented because on slower machine you could
never see the end, 515 files will generate 132355 iterations! )
Comments, credits, TODO and FIXME can be found inside the source code.

Ideas and optimization
----------------------
In the comparison between the content of files where the key is the size, the
iterations can be limited using a binary tree structure.
Comparing string this is not possible (IMHO) and the only way that I have found
was a "preliminary Check" that avoids to make comparison between string of a great
different length.

Critics, ideas, comments, fix and help are welcomes
luca

PS: sorry for my english

Cosine Similarity	2	Dec 6, 2012
Problem Splitting Text String	2	Dec 29, 2022
JavaScript String Syntax Checking	5	Jun 29, 2022
SQL Connection string regex pattern to parse sections	1	May 9, 2024
Converting an Array to a String in JavaScript	7	Sep 22, 2023
string similarity in python	5	Nov 24, 2003
wordnet semantic similarity: how to refer to elements of a pair in alist? can we sort dictionary acc	13	Oct 7, 2012
C code String And Comparison	2	Dec 27, 2022

String similarity

Luca Montecchiani

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads