Fast String Distance (SIFT) Algorithm

Published Nov 28, 2006

Posted in
programming
essay

This article is obsolete, a better version of the algorithm has been published: Sift3

While researching different ways of measuring the distance between two strings, or how different they are, I've found of course the Levenstein algorithm. The problem with it is that it is slow. Searching more, I've seen some algorithms that seemed fast, but I didn't have the time or brain power to understand them. So I've devised my own algorithm, called SIFT. You might think the name comes from Siderite's Intelligent and Fast Technique, but it comes from the English word 'sift'. :)
How does it work? Well, the common scenario in comparing strings is that someone made a mistake, a typo. So in principle, the two strings should be very similar in order to be worth comparing them. So what I do is this:

foreach phase
    remove identical overlapping characters
    shake strings
return number of shakes + the length of the longest string between the two.

There is an optimisation towards the safe side: if the sift similarity is big enough, perform the constly Levenstein distance.

Ok, it might not be so clear, let's take an example:
INTERESTING
INFORMATIVE

Step 1: remove all identical overlapping characters (sift)
TEESNG
FOMAVE

Now we have smaller words to check, let's suppose there was a typo, that means that part of the one word is offset with one or maybe two characters from the other. So we move them a bit, that's a 'shake'.

Step 2: shake
TEESNG
[]FOMAVE

Oops, no overlapping characters. We do this one or two times more and there is no result, so...

Step 3: return result
MaxLength(TEESNG, FOMAVE)=6

There you have it. The sifting algorithm, because it resembles sifting grain.

Not satisfied with such a simple example? Let's take another:
Click here

Tests have shown it to be pretty close to Levenstein, at least in the cases that matter, while being substantially faster.

Comments

March 11, 2009

You are correct, but the way it was designed has a parallel feel to it. One could optimize it with bit operations and so on. It doesn't much matters now. Sift3 is a lot better and simpler to implement.

Siderite

March 9, 2009

Hello, this is just a cut off from the Levensthein algorithm. You only check permutations, and not all of them but only the number of shakes. This means that you don't count deletions, insertions and even permutations sometimes (if shake number is low). Br, Andras

Anonymous

Comments

March 11, 2009

March 9, 2009

Post a comment