
This article is obsolete, a better version of the algorithm has been published:
Sift3While researching different ways of measuring the distance between two strings, or how different they are, I've found of course the Levenstein algorithm. The problem with it is that it is slow. Searching more, I've seen some algorithms that seemed fast, but I didn't have the time or brain power to understand them. So I've devised my own algorithm, called SIFT. You might think the name comes from Siderite's Intelligent and Fast Technique, but it comes from the English word 'sift'. :)
How does it work? Well, the common scenario in comparing strings is that someone made a mistake, a typo. So in principle, the two strings should be very similar in order to be worth comparing them. So what I do is this:
foreach phase
remove identical overlapping characters
shake strings
return number of shakes + the length of the longest string between the two.
There is an optimisation towards the safe side: if the sift similarity is big enough, perform the constly Levenstein distance.Ok, it might not be so clear, let's take an example:
INTERESTING
INFORMATIVE
Step 1: remove all identical overlapping characters (sift)
TEESNG
FOMAVE
Now we have smaller words to check, let's suppose there was a typo, that means that part of the one word is offset with one or maybe two characters from the other. So we move them a bit, that's a 'shake'.
Step 2: shake
TEESNG
[]FOMAVE
Oops, no overlapping characters. We do this one or two times more and there is no result, so...
Step 3: return result
MaxLength(TEESNG, FOMAVE)=6
There you have it. The sifting algorithm, because it resembles sifting grain.
Not satisfied with such a simple example? Let's take another:
Click hereTHIS EXTREMELY LONG STRING WILL BE COMPARED
WITH ANOTHER VERY LONG STRING WHICH IS QUITE DIFFERENT
- sift (distance is 0)
THISEXTREMELY LONG STRING WILL BECOMPARED
WITHANOTHER VERY LONG STRING WHICHIS QUITE DIFFERENT
- shake and sift (distance becomes 1)
THSEXTREMLY LONG STRING WILL BEOMPARED
[]WTHANOTHR VERY LONG STRING WHIHIS QUITE DIFFERENT
- shake and sift (distance becomes 2)
[]SEXREMLY LONG STRING WILL BEOMPARED
WANOHR VERY LONG STRING WHIHIS QUITE DIFFERENT
- shake and sift (distance becomes 3)
[][][]SEXREMLILL BEOMPARED
WANOHR VERHIHIS QUITE DIFFERENT
- return distance+MaxLength(string1,string2) = 3+31=34
Now, this is only with 3 shakes, chosen randomly. In my .NET library I am using at least 10 shakes with offset values like -1, 1, 2, -2, -3, 3, 4, -4, -5, 5.
Tests have shown it to be pretty close to Levenstein, at least in the cases that matter, while being substantially faster.