Super Fast and Accurate string distance algorithm: Sift4

Published Nov 10, 2014

The algorithm works perfectly well and is better than Sift3, however it's slightly more complex. You might want to start with Sift3 in order to understand where it came from.

Update November 8 2022: I found a bug in the algorithm, relating to maxDistance. I've updated the code. If you didn't use maxDistance, you are unaffected. Basically the fix is to compare temporaryDistance>minDistance (before it was >= ) and to move the calculation of the temporary distance after c1 and c2 are updated to their minimum value when a token was not found (otherwise the temporary distance might become larger than the final distance)

Try the Javascript implementation here:

Algorithm:

Levenstein
Sift3
Sift4 MaxOffset:

String 1:

String 2:

Result:

Update 28 Mar 2015: I've changed the algorithm significantly. The transpositions are now computed differently and the cost of a transposition in the final result is 1, rather than 0.5. Also, while I think a value of 1 is better conceptually, I noticed that Sift4 approximates Levenshtein a little better when the cost of a transposition is either 2 or a function depending on the offset difference between c2 and c1, especially when maxOffset grows. This can be now changed via the new options function transpositionCostEvaluator. The problem I am having now is more false positives when the letters/tokens of the two strings are the same, but their positions are jumbled differently. With small maxOffset values, like 5 or 10, the result is much better than Sift3, however when maxOffset grows, lots of matches can be found and the cost of transpositions becomes very important.

Update 27 Mar 2015: Thanks to Emanuele Bastianelli who discovered a bug that appeared in an edge case, I've updated the algorithms. Now, at the end of the while loop there is an extra check to prevent the algorithm exiting prematurely, before computing remaining tokens.

Intro

A really long time ago I wrote the third version of Sift, the string distance algorithm. It so happens that I am going to give a small presentation, here in Ispra, about this algorithm, so I had the opportunity to review it. I found some inconsistencies and I actually did some research in the field that gave more more ideas. So before giving the presentation I thought of publishing what I think is the fourth version. What's new:

33% more accurate
three different variants: simple, common and general
new concepts added
support for own value and matching functions, different tokenizer functions, etc.
actually tested with a (slightly more) serious test
more robust, working better for large values of maxOffset

Before I get into the details, I am publishing the algorithm here for the moment, no Codeplex or PasteBin or GitHub or whatever. Also, it is written in Javascript now, the C# and T-SQL version pending. Of course, it would be great if, as before, the community of people using the algorithm would go into implementing it into various programming languages, however I am a bit apprehensive because more often than not people came with their own improvements or interpretations when translating the algorithm into another language. But support is always welcome!

New concepts in Sift4

I created a test that used random strings, but also a huge list of commonly used English phrases as well as mutations on these strings, adding or removing small bits and so on. I then implemented Sift3, Levenstein and the new algorithm and computed the error distance between the Levenstein distance and the two Sift variants. This permitted me to see how the error evolves when changing the algorithm and the parameters. One thing I noticed is that when increasing the maxOffset value to large values like 15 or 20, the accuracy of Sift3 was going down. Also, as pointed out by one commenter on the Sift3 post, there are cases when Sift3(a,b) is different from Sift3(b,a). There are edge cases, but this one in particular grated me.

After implementing Sift4, I can now tell you that the simple version is slightly better than Sift3 for small maxOffset values like 5, but it gets better as the value increases. The common version is a bit more complex, but the error decreases with 33% and maintains a low error for large maxOffset values. The extended or general version receives an options object that can change almost everything, but most important is the tokenizer function. Imagine that you want to compute the distance based not on letters, but on n-grams (groups of n characters). Or that you want to compare them by the words in the text, maybe even their synonyms. This can all be achieved just by changing the tokenizer function. The other parameters involve defining what it means for two tokens to match and what is the value of their match, etc.

One of the new concepts implemented is taken from the Jaro distance. Jaro seems a lot like Sift in the way that it considers two characters to match if they are in close proximity. Also, if "the streams cross", like 'ab' vs 'ba', one considers them transpositions and removes some of their value from the distance. Actually, if I look at the implementation, it might be that I have independently discovered the Jaro distance. I will research this further. I don't know if the transposition calculation is the most optimal. At the moment it uses an array of all matches found until a point, clearing it of values as the cursors move along the string. The difference between the simple and the common versions of Sift4 is that the simple version is not computing the transpositions at all and has no concept of maxDistance. In that respect it is a slightly fixed up Sift3.

Another new concept added is the one of local substring. Imagine that the Largest Common Subsequence that Sift is actually trying to find in order to determine the distance is made of substrings, separated by non matching characters. Each of these substrings can be used to improve the distance function. For example one could argue that 'abcdex' is closer to 'abcde' than 'abcxde', because even if the largest common subsequence is 5, the largest common substring is 5 for the first string and only 3 for the second. The extended version of the algorithm allows for changing the value of each substring individually.

Well, here they are, the three versions. The extended version has some examples at the end for possible parameters.

The code

Simplest Sift4:

// Sift4 - simplest version
// online algorithm to compute the distance between two strings in O(n)
// maxOffset is the number of characters to search for matching letters
function sift4(s1, s2, maxOffset) {
    if (!s1 || !s1.length) {
        if (!s2) {
            return 0;
        }
        return s2.length;
    }

    if (!s2 || !s2.length) {
        return s1.length;
    }

    var l1 = s1.length;
    var l2 = s2.length;

    var c1 = 0; //cursor for string 1
    var c2 = 0; //cursor for string 2
    var lcss = 0; //largest common subsequence
    var local_cs = 0; //local common substring
    while ((c1 < l1) && (c2 < l2)) {
        if (s1.charAt(c1) == s2.charAt(c2)) {
            local_cs++;
        } else {
            lcss += local_cs;
            local_cs = 0;
            if (c1 != c2) {
                c1 = c2 = Math.max(c1, c2); //using max to bypass the need for computer transpositions ('ab' vs 'ba')
            }
            for (var i = 0; i < maxOffset && (c1 + i < l1 || c2 + i < l2); i++) {
                if ((c1 + i < l1) && (s1.charAt(c1 + i) == s2.charAt(c2))) {
                    c1 += i;
                    local_cs++;
                    break;
                }
                if ((c2 + i < l2) && (s1.charAt(c1) == s2.charAt(c2 + i))) {
                    c2 += i;
                    local_cs++;
                    break;
                }
            }
        }
        c1++;
        c2++;
    }
    lcss += local_cs;
    return Math.round(Math.max(l1, l2) - lcss);
}

Common Sift4:

// Sift4 - common version
// online algorithm to compute the distance between two strings in O(n)
// maxOffset is the number of characters to search for matching letters
// maxDistance is the distance at which the algorithm should stop computing the value and just exit (the strings are too different anyway)
function sift4(s1, s2, maxOffset, maxDistance) {
    if (!s1 || !s1.length) {
        if (!s2) {
            return 0;
        }
        return s2.length;
    }

    if (!s2 || !s2.length) {
        return s1.length;
    }

    var l1 = s1.length;
    var l2 = s2.length;

    var c1 = 0; //cursor for string 1
    var c2 = 0; //cursor for string 2
    var lcss = 0; //largest common subsequence
    var local_cs = 0; //local common substring
    var trans = 0; //number of transpositions ('ab' vs 'ba')
    var offset_arr = []; //offset pair array, for computing the transpositions
    while ((c1 < l1) && (c2 < l2)) {
        if (s1.charAt(c1) == s2.charAt(c2)) {
            local_cs++;
            var isTrans = false;
            //see if current match is a transposition
            var i = 0;
            while (i < offset_arr.length) {
                var ofs = offset_arr[i];
                if (c1 <= ofs.c1 || c2 <= ofs.c2) {
                    // when two matches cross, the one considered a transposition is the one with the largest difference in offsets
                    isTrans = Math.abs(c2 - c1) >= Math.abs(ofs.c2 - ofs.c1);
                    if (isTrans) {
                        trans++;
                    } else {
                        if (!ofs.trans) {
                            ofs.trans = true;
                            trans++;
                        }
                    }
                    break;
                } else {
                    if (c1 > ofs.c2 && c2 > ofs.c1) {
                        offset_arr.splice(i, 1);
                    } else {
                        i++;
                    }
                }
            }
            offset_arr.push({
                c1: c1,
                c2: c2,
                trans: isTrans
            });
        } else {
            lcss += local_cs;
            local_cs = 0;
            if (c1 != c2) {
                c1 = c2 = Math.min(c1, c2); //using min allows the computation of transpositions
            }
            if (maxDistance) {
                var temporaryDistance = Math.max(c1, c2) - lcss + trans;
                if (temporaryDistance > maxDistance)
                    return temporaryDistance;
            }
            //if matching characters are found, remove 1 from both cursors (they get incremented at the end of the loop)
            //so that we can have only one code block handling matches
            for (var i = 0; i < maxOffset && (c1 + i < l1 || c2 + i < l2); i++) {
                if ((c1 + i < l1) && (s1.charAt(c1 + i) == s2.charAt(c2))) {
                    c1 += i - 1;
                    c2--;
                    break;
                }
                if ((c2 + i < l2) && (s1.charAt(c1) == s2.charAt(c2 + i))) {
                    c1--;
                    c2 += i - 1;
                    break;
                }
            }
        }
        c1++;
        c2++;
        // this covers the case where the last match is on the last token in list, so that it can compute transpositions correctly
        if ((c1 >= l1) || (c2 >= l2)) {
            lcss += local_cs;
            local_cs = 0;
            c1 = c2 = Math.min(c1, c2);
        }
    }
    lcss += local_cs;
    return Math.max(l1, l2) - lcss + trans; //add the cost of transpositions to the final result
}

Extended/General Sift4:

// Sift4 - extended version
// online algorithm to compute the distance between two strings in O(n)
// maxOffset is the number of positions to search for matching tokens
// options: the options for the function, allowing for customization of the scope and algorithm:
//         maxDistance: the distance at which the algorithm should stop computing the value and just exit (the strings are too different anyway)
//         tokenizer: a function to transform strings into vectors of tokens
//        tokenMatcher: a function to determine if two tokens are matching (equal)
//        matchingEvaluator: a function to determine the way a token match should be added to the local_cs. For example a fuzzy match could be implemented.
//        localLengthEvaluator: a function to determine the way the local_cs value is added to the lcss. For example longer continuous substrings could be awarded.
//        transpositionCostEvaluator: a function to determine the value of an individual transposition. For example longer transpositions should have a higher cost.
//        transpositionsEvaluator: a function to determine the way the total cost of transpositions affects the final result
// the options can and should be implemented at a class level, but this is the demo algorithm
function sift4(s1, s2, maxOffset, options) {

    options = extend(options, {
            maxDistance: null,
            tokenizer: function (s) {
                return s ? s.split('') : [];
            },
            tokenMatcher: function (t1, t2) {
                return t1 == t2;
            },
            matchingEvaluator: function (t1, t2) {
                return 1;
            },
            localLengthEvaluator: function (local_cs) {
                return local_cs;
            },
            transpositionCostEvaluator: function (c1, c2) {
                return 1;
            },
            transpositionsEvaluator: function (lcss, trans) {
                return lcss - trans;
            }
        });

    var t1 = options.tokenizer(s1);
    var t2 = options.tokenizer(s2);

    var l1 = t1.length;
    var l2 = t2.length;

    if (l1 == 0)
        return l2;
    if (l2 == 0)
        return l1;

    var c1 = 0; //cursor for string 1
    var c2 = 0; //cursor for string 2
    var lcss = 0; //largest common subsequence
    var local_cs = 0; //local common substring
    var trans = 0; //number of transpositions ('ab' vs 'ba')
    var offset_arr = []; //offset pair array, for computing the transpositions
    while ((c1 < l1) && (c2 < l2)) {
        if (options.tokenMatcher(t1[c1], t2[c2])) {
            local_cs += options.matchingEvaluator(t1[c1], t2[c2]);
            var isTrans = false;
            //see if current match is a transposition
            var i = 0;
            while (i < offset_arr.length) {
                var ofs = offset_arr[i];
                if (c1 <= ofs.c1 || c2 <= ofs.c2) {
                    // when two matches cross, the one considered a transposition is the one with the largest difference in offsets
                    isTrans = Math.abs(c2 - c1) >= Math.abs(ofs.c2 - ofs.c1);
                    if (isTrans) {
                        trans += options.transpositionCostEvaluator(c1, c2);
                    } else {
                        if (!ofs.trans) {
                            ofs.trans = true;
                            trans += options.transpositionCostEvaluator(ofs.c1, ofs.c2);
                        }
                    }
                    break;
                } else {
                    if (c1 > ofs.c2 && c2 > ofs.c1) {
                        offset_arr.splice(i, 1);
                    } else {
                        i++;
                    }
                }
            }
            offset_arr.push({
                c1: c1,
                c2: c2,
                trans: isTrans
            });
        } else {
            lcss += options.localLengthEvaluator(local_cs);
            local_cs = 0;
            if (c1 != c2) {
                c1 = c2 = Math.min(c1, c2); //using min allows the computation of transpositions
            }
            if (options.maxDistance) {
                var temporaryDistance = options.localLengthEvaluator(Math.max(c1, c2)) - options.transpositionsEvaluator(lcss, trans);
                if (temporaryDistance > options.maxDistance)
                    return Math.round(temporaryDistance);
            }
            //if matching tokens are found, remove 1 from both cursors (they get incremented at the end of the loop)
            //so that we can have only one code block handling matches
            for (var i = 0; i < maxOffset && (c1 + i < l1 || c2 + i < l2); i++) {
                if ((c1 + i < l1) && options.tokenMatcher(t1[c1 + i], t2[c2])) {
                    c1 += i - 1;
                    c2--;
                    break;
                }
                if ((c2 + i < l2) && options.tokenMatcher(t1[c1], t2[c2 + i])) {
                    c1--;
                    c2 += i - 1;
                    break;
                }
            }
        }
        c1++;
        c2++;
        // this covers the case where the last match is on the last token in list, so that it can compute transpositions correctly
        if ((c1 >= l1) || (c2 >= l2)) {
            lcss += options.localLengthEvaluator(local_cs);
            local_cs = 0;
            c1 = c2 = Math.min(c1, c2);
        }
    }
    lcss += options.localLengthEvaluator(local_cs);
    return Math.round(options.localLengthEvaluator(Math.max(l1, l2)) - options.transpositionsEvaluator(lcss, trans)); //add the cost of found transpositions
}

function extend(obj, def) {
    var result = {};
    for (var prop in def) {
        if (!obj || !obj.hasOwnProperty(prop)) {
            result[prop] = def[prop];
        } else {
            result[prop] = obj[prop];
        }
    }
    return result;
}

// possible values for the options
// tokenizers:
function nGramTokenizer(s, n) { //tokenizer:function(s) { return nGramTokenizer(s,2); }
    var result = [];
    if (!s)
        return result;
    for (var i = 0; i <= s.length - n; i++) {
        result.push(s.substr(i, n));
    }
    return result;
}

function wordSplitTokenizer(s) { //tokenizer:wordSplitTokenizer
    if (!s)
        return [];
    return s.split(/\s+/);
}

function characterFrequencyTokenizer(s) { //tokenizer:characterFrequencyTokenizer (letters only)
    var result = [];
    for (var i = 0; i <= 25; i++) {
        var val = 0;
        if (s) {
            for (j = 0; j < s.length; j++) {
                var code = s.charCodeAt(j);
                if (code == i + 65 || code == i + 97)
                    val++;
            }
        }
        result.push(val);
    }
    return result;
}

//tokenMatchers:
function sift4TokenMatcher(t1, t2) { //tokenMatcher:sift4TokenMatcher
    var similarity = 1 - sift4(t1, t2, 5) / Math.max(t1.length, t2.length);
    return similarity > 0.7;
}

//matchingEvaluators:
function sift4MatchingEvaluator(t1, t2) { //matchingEvaluator:sift4MatchingEvaluator
    var similarity = 1 - sift4(t1, t2, 5) / Math.max(t1.length, t2.length);
    return similarity;
}

//localLengthEvaluators:
function rewardLengthEvaluator(l) {
    if (l < 1)
        return l; //0 -> 0
    return l - 1 / (l + 1); //1 -> 0.5, 2-> 0.66, 9 -> 0.9
}

function rewardLengthEvaluator2(l) {
    return Math.pow(l, 1.5); // 0 -> 0, 1 -> 1, 2 -> 2.83, 10 -> 31.62
}

//transpositionCostEvaluators:
function longerTranspositionsAreMoreCostly(c1, c2) {
    return Math.abs(c2 - c1) / 9 + 1;
}

As always, I will be most happy to know if you used my algorithm and how it performed, as well as receive any suggestion that you might have.

Options explained

Here is some explanation for the options of the general algorithm.

It no longer searches for characters, but for tokens. That is why the default tokenizer function splits the values into characters so that the algorithm would work on an array of one character long tokens. Other options are possible, like splitting the strings by empty spaces so that the comparisons are done on words or transforming a string into an array of strings N characters long, the so called N-grams. The tokenizer can be anything, like the characterFrequencyTokenizer, which turns each word in an array of 25 values representing the number of letters in the word for each letter a-z.

The tokenMatcher function returns true if two tokens are matching. They can be fuzzy matched, for example the sift4tokenMatcher example function uses Sift inside Sift to determine the character distance between two tokens and returns true if they match more than 70%.

The matchingEvaluator is a function that returns the value that will be added to the "common substring" length value when two tokens match. The default is 1, but one can use some other metric, like the similarity, for example. Of course, the common substring length has lost its meaning when these functions change, but the variable local_cs is still used.

The lengthEvaluator is taking the length value of the local common substring and returns a value that will be added to the longest common subsequence value. Usually it returns the same value as the one provided, but some functions could reward longer substrings.

FAQ

Q: Can you make Sift4 to work case insensitive?
A: Just turn the strings to lower or upper case before you compare them. Since this algorithm is more general, the concept of 'case' might not apply. Or implement a case insensitive tokenMatcher.

Q: Can you make Sift4 to compare strings based on their meaning, like using synonyms?
A: Use a tokenizer function that splits the strings into words, then replaces them with the most used of their synonyms. A more complex solution would require to analyze the strings beforehand and turn them into some ordered synonym or equivalent expressions equivalents, then use Sift4 with a word tokenizer (one is provided in the Extended algorithm source)

Q: I need an implementation for this programming language, can you help?
A: I can, but I might not have the time. Ask anyway, maybe I can be persuaded :)

Q: I have been using Sift3 until now, how do I upgrade to Sift4?
A: The best way I can think of is to implement Sift4 Simplest, as it needs only the Sift3 code and some minor changes. Since you never needed tokens before, I doubt you need them now. But if you do, I can help, see the above question.

Q: How can I reward you for this fantastic piece of software engineering?
A: While I did this for free and I don't expect to make any money out of it and while this algorithm is completely free to use and change as you see fit, I don't mind having a beer every now and then ;)

Q: Your algorithm really sucks because... reasons.
A: It may. I would be glad to discuss the reasons, though, and try to fix any problem you encounter.

Q: I compared Sift4 with another algorithm that is much more exact and there are differences.
A: Of course, they are different algorithms. This is a fuzzy distance calculator, it doesn't give you the exact value. There are still edge cases. But the idea of Sift is to be fast and relatively accurate, rather than very accurate. You need more accuracy, try to combine Sift with Levenshtein for example, computing Levenshtein only where Sift says the strings are above a certain similarity.

Q: I want to make maxOffset dependent on the length of the strings compared. Can you do that?
A: That is a perfect example why maxOffset should be a parameter of the function rather than a member of the class. Since this implementation is so far Javascript only, just compute the maxOffset that is convenient to you before you compare.

Q: I want to vary the weight of matches based on the position of the match, for example matches at the beginning of the string could be more valuable than those at the end.
A: The position of the match is indeed not sent to the functions that can be specified in the options object of the Sift4 Extended, but that can be trivially changed in the code. I don't think this particular request is very common, though, and I prefer to keep it out of the published implementation to make the code easier to understand.

Q: I found a bug!
A: Let me know it and I will try and fix it.

Q: If you need to compare large lists of strings, it is better to precompute some things, like specific hashes or suffix trees, etc. This will speed up the comparison tremendously!
A: Sift is what is called an online algorithm. It does not precompute anything, it just gets the two strings and the parameters for its functioning and returns the distance. You are correct in what you are saying, but that kind of solution is not in the scope of Sift, at least not version 4.

Q: What are the edge cases for Sift?
A: Probably there are several, but I didn't really spot them. One of them is that one might find both letters at a position matching letters at other positions, but only one will count. Example 'abxx' and 'bayy'. The algorithm will look at position 0, find no match, then try to find the closest match for each letter. Starting with position 0 in the first string it will find 'a' matched in position 1 in the second. It will increase both counters and lcss will be increase as well. Next check will be 'b', the character at position 1 in the first string matched with position 2 in the second string. No match, therefore both counters will be reset to 1, and starting search again. The 'b' match is lost and distance is 3 instead of 2. Also I think there might be some situations where the counters are not equal and the biggest of them reaches the end of its string, thus terminating the algorithm, but there could have been more matches. Incidentally I tried to fix both these issues and the error from Levenshtein was not really affected, but I am not 100% sure of the implementation.

Q: The algorithm continues to be asymmetric, Sift4(s1,s2) can be different from Sift4(s2,s1).
A: Yes. This is one of the artifacts of the linear nature of the algorithm. There is a function that is symmetric and that is Math.min(Sift4(a,b),Sift4(b,a)), however it is twice as slow, obviously.

Implementations in other languages

C# implementation [expand/collapse]

public class Sift4
{
    private Options _options;

    public Sift4(Options options)
    {
        if (options == null) options = new Options();
        if (options.Tokenizer == null)
        {
            options.Tokenizer = (s) => s == null
            ? new string[0]
            : s.ToCharArray().Select(c => c.ToString()).ToArray();
        }
        if (options.TokenMatcher == null)
        {
            options.TokenMatcher = (t1, t2) => t1 == t2;
        }
        if (options.MatchingEvaluator == null)
        {
            options.MatchingEvaluator = (t1, t2) => 1;
        }
        if (options.LocalLengthEvaluator == null)
        {
            options.LocalLengthEvaluator = (l) => l;
        }
        if (options.TranspositionCostEvaluator == null)
        {
            options.TranspositionCostEvaluator = (c1, c2) => 1;
        }
        if (options.TranspositionsEvaluator == null)
        {
            options.TranspositionsEvaluator = (l, t) => l - t;
        }
        _options = options;
    }

    /// <summary>
    /// General distance algorithm uses all the parameters in the options object and works on tokens
    /// </summary>
    /// <param name="s1"></param>
    /// <param name="s2"></param>
    /// <param name="maxOffset"></param>
    /// <returns></returns>
    public double GeneralDistance(string s1, string s2, int maxOffset)
    {
        var t1 = _options.Tokenizer(s1);
        var t2 = _options.Tokenizer(s2);

        var l1 = t1.Length;
        var l2 = t2.Length;

        if (l1 == 0) return _options.LocalLengthEvaluator(l2);
        if (l2 == 0) return _options.LocalLengthEvaluator(l1);

        var c1 = 0;  //cursor for string 1
        var c2 = 0;  //cursor for string 2
        var lcss = 0.0;  //largest common subsequence
        var local_cs = 0.0; //local common substring
        var trans = 0.0;  //cost of transpositions ('axb' vs 'xba')
        var offset_arr = new LinkedList<OffsetPair>();  //offset pair array, for computing the transpositions

        while ((c1 < l1) && (c2 < l2))
        {
            if (_options.TokenMatcher(t1[c1], t2[c2]))
            {
                local_cs += _options.MatchingEvaluator(t1[c1], t2[c2]);
                var isTransposition = false;
                var op = offset_arr.First;
                while (op != null)
                {  //see if current match is a transposition
                    var ofs = op.Value;
                    if (c1 <= ofs.C1 || c2 <= ofs.C2)
                    {
                        // when two matches cross, the one considered a transposition is the one with the largest difference in offsets
                        isTransposition = Math.Abs(c2 - c1) >= Math.Abs(ofs.C2 - ofs.C1);
                        if (isTransposition)
                        {
                            trans += _options.TranspositionCostEvaluator(c1, c2);
                        }
                        else
                        {
                            if (!ofs.IsTransposition)
                            {
                                ofs.IsTransposition = true;
                                trans += _options.TranspositionCostEvaluator(ofs.C1, ofs.C2);
                            }
                        }
                        break;
                    }
                    else
                    {
                        var next_op = op.Next;
                        if (c1 > ofs.C2 && c2 > ofs.C1)
                        {
                            offset_arr.Remove(op);
                        }
                        op = next_op;
                    }
                }
                offset_arr.AddLast(new OffsetPair(c1, c2)
                {
                    IsTransposition = isTransposition
                });
            }
            else
            {
                lcss += _options.LocalLengthEvaluator(local_cs);
                local_cs = 0;
                if (c1 != c2)
                {
                    c1 = c2 = Math.Min(c1, c2);  //using min allows the computation of transpositions
                }
                if (_options.MaxDistance != null)
                {
                    var temporaryDistance = _options.LocalLengthEvaluator(Math.Max(c1, c2)) - _options.TranspositionsEvaluator(lcss, trans);
                    if (temporaryDistance > _options.MaxDistance) return Math.Round(temporaryDistance, MidpointRounding.AwayFromZero);
                }
                //if matching tokens are found, remove 1 from both cursors (they get incremented at the end of the loop)
                //so that we can have only one code block handling matches 
                for (var i = 0; i < maxOffset && (c1 + i < l1 || c2 + i < l2); i++)
                {
                    if ((c1 + i < l1) && _options.TokenMatcher(t1[c1 + i], t2[c2]))
                    {
                        c1 += i - 1;
                        c2--;
                        break;
                    }
                    if ((c2 + i < l2) && _options.TokenMatcher(t1[c1], t2[c2 + i]))
                    {
                        c1--;
                        c2 += i - 1;
                        break;
                    }
                }
            }
            c1++;
            c2++;
            // this covers the case where the last match is on the last token in list, so that it can compute transpositions correctly
            if ((c1 >= l1) || (c2 >= l2))
            {
                lcss += _options.LocalLengthEvaluator(local_cs);
                local_cs = 0;
                c1 = c2 = Math.Min(c1, c2);
            }
        }
        lcss += _options.LocalLengthEvaluator(local_cs);
        return Math.Round(_options.LocalLengthEvaluator(Math.Max(l1, l2)) - _options.TranspositionsEvaluator(lcss, trans), MidpointRounding.AwayFromZero); //apply transposition cost to the final result
    }

    /// <summary>
    /// Static distance algorithm working on strings, computing transpositions as well as stopping when maxDistance was reached.
    /// </summary>
    /// <param name="s1"></param>
    /// <param name="s2"></param>
    /// <param name="maxOffset"></param>
    /// <param name="maxDistance"></param>
    /// <returns></returns>
    public static double CommonDistance(string s1, string s2, int maxOffset, int maxDistance = 0)
    {
        var l1 = s1 == null ? 0 : s1.Length;
        var l2 = s2 == null ? 0 : s2.Length;

        if (l1 == 0) return l2;
        if (l2 == 0) return l1;

        var c1 = 0;  //cursor for string 1
        var c2 = 0;  //cursor for string 2
        var lcss = 0;  //largest common subsequence
        var local_cs = 0; //local common substring
        var trans = 0;  //number of transpositions ('axb' vs 'xba')
        var offset_arr = new LinkedList<OffsetPair>();  //offset pair array, for computing the transpositions

        while ((c1 < l1) && (c2 < l2))
        {
            if (s1[c1] == s2[c2])
            {
                local_cs++;
                var isTransposition = false;
                var op = offset_arr.First;
                while (op != null)
                {  //see if current match is a transposition
                    var ofs = op.Value;
                    if (c1 <= ofs.C1 || c2 <= ofs.C2)
                    {
                        // when two matches cross, the one considered a transposition is the one with the largest difference in offsets
                        isTransposition = Math.Abs(c2 - c1) >= Math.Abs(ofs.C2 - ofs.C1);
                        if (isTransposition)
                        {
                            trans++;
                        }
                        else
                        {
                            if (!ofs.IsTransposition)
                            {
                                ofs.IsTransposition = true;
                                trans++;
                            }
                        }
                        break;
                    }
                    else
                    {
                        var next_op = op.Next;
                        if (c1 > ofs.C2 && c2 > ofs.C1)
                        {
                            offset_arr.Remove(op);
                        }
                        op = next_op;
                    }
                }
                offset_arr.AddLast(new OffsetPair(c1, c2)
                {
                    IsTransposition = isTransposition
                });
            }
            else
            {
                lcss += local_cs;
                local_cs = 0;
                if (c1 != c2)
                {
                    c1 = c2 = Math.Min(c1, c2);  //using min allows the computation of transpositions
                }
                if (maxDistance > 0)
                {
                    var temporaryDistance = Math.Max(c1, c2) - (lcss - trans);
                    if (temporaryDistance > maxDistance) return temporaryDistance;
                }
                //if matching tokens are found, remove 1 from both cursors (they get incremented at the end of the loop)
                //so that we can have only one code block handling matches 
                for (var i = 0; i < maxOffset && (c1 + i < l1 || c2 + i < l2); i++)
                {
                    if ((c1 + i < l1) && s1[c1 + i] == s2[c2])
                    {
                        c1 += i - 1;
                        c2--;
                        break;
                    }
                    if ((c2 + i < l2) && s1[c1] == s2[c2 + i])
                    {
                        c1--;
                        c2 += i - 1;
                        break;
                    }
                }
            }
            c1++;
            c2++;
            // this covers the case where the last match is on the last token in list, so that it can compute transpositions correctly
            if ((c1 >= l1) || (c2 >= l2))
            {
                lcss += local_cs;
                local_cs = 0;
                c1 = c2 = Math.Min(c1, c2);
            }
        }
        lcss += local_cs;
        return Math.Max(l1, l2) - (lcss - trans); //apply transposition cost to the final result
    }

    /// <summary>
    /// Standard Sift algorithm, using strings and taking only maxOffset as a parameter
    /// </summary>
    /// <param name="s1"></param>
    /// <param name="s2"></param>
    /// <param name="maxOffset"></param>
    /// <returns></returns>
    public static int SimplestDistance(string s1, string s2, int maxOffset)
    {
        var l1 = s1 == null ? 0 : s1.Length;
        var l2 = s2 == null ? 0 : s2.Length;

        if (l1 == 0) return l2;
        if (l2 == 0) return l1;

        var c1 = 0;  //cursor for string 1
        var c2 = 0;  //cursor for string 2
        var lcss = 0;  //largest common subsequence
        var local_cs = 0; //local common substring

        while ((c1 < l1) && (c2 < l2))
        {
            if (s1[c1] == s2[c2])
            {
                local_cs++;
            }
            else
            {
                lcss += local_cs;
                local_cs = 0;
                if (c1 != c2)
                {
                    c1 = c2 = Math.Max(c1, c2);
                }
                //if matching tokens are found, remove 1 from both cursors (they get incremented at the end of the loop)
                //so that we can have only one code block handling matches 
                for (var i = 0; i < maxOffset && (c1 + i < l1 && c2 + i < l2); i++)
                {
                    if ((c1 + i < l1) && s1[c1 + i] == s2[c2])
                    {
                        c1 += i - 1;
                        c2--;
                        break;
                    }
                    if ((c2 + i < l2) && s1[c1] == s2[c2 + i])
                    {
                        c1--;
                        c2 += i - 1;
                        break;
                    }
                }
            }
            c1++;
            c2++;
        }
        lcss += local_cs;
        return Math.Max(l1, l2) - lcss;
    }

    private class OffsetPair
    {
        public int C1 { get; set; }
        public int C2 { get; set; }
        public bool IsTransposition { get; set; }

        public OffsetPair(int c1, int c2)
        {
            this.C1 = c1;
            this.C2 = c2;
            this.IsTransposition = false;
        }
    }

    public class Options
    {
        /// <summary>
        /// If set, the algorithm will stop if the distance reaches this value
        /// </summary>
        public int? MaxDistance { get; set; }

        /// <summary>
        /// The function that turns strings into a list of tokens (also strings)
        /// </summary>
        public Func<string, string[]> Tokenizer { get; set; }

        /// <summary>
        /// The function that determines if two tokens are matching (similar to characters being the same in the simple algorithm)
        /// </summary>
        public Func<string, string, bool> TokenMatcher { get; set; }

        /// <summary>
        /// The function that determines the value of a match of two tokens (the equivalent of adding 1 to the lcss when two characters match)
        /// This assumes that the TokenMatcher function is a lot less expensive than this evaluator. If that is not the case, 
        /// you can optimize the speed of the algorithm by using only the matching evaluator and then calculating if two tokens match on the returned value.
        /// </summary>
        public Func<string, string, double> MatchingEvaluator { get; set; }

        /// <summary>
        /// Determines if the local value (computed on subsequent matched tokens) must be modified.
        /// In case one wants to reward longer matched substrings, for example, this is what you need to change.
        /// </summary>
        public Func<double, double> LocalLengthEvaluator { get; set; }

        /// <summary>
        /// The function determining the cost of an individual transposition, based on its counter positions.
        /// </summary>
        public Func<int, int, double> TranspositionCostEvaluator { get; set; }

        /// <summary>
        /// The function determining how the cost of transpositions affects the final result
        /// Change it if it doesn't suit you.
        /// </summary>
        public Func<double, double, double> TranspositionsEvaluator { get; set; }
    }
}

PHP implementation [expand/collapse]

// Sift4 - common version
// online algorithm to compute the distance between two strings in O(n)
// $maxOffset is the number of characters to search for matching letters
// $maxDistance is the distance at which the algorithm should stop computing the value and just exit (the strings are too different anyway)
function sift4($s1, $s2, $maxOffset, $maxDistance = 0) {
    if (!$s1 || !strlen($s1)) {
        if (!$s2) {
            return 0;
        }
        return strlen($s2);
    }
    if (!$s2 || !strlen($s2)) {
        return strlen($s1);
    }
    $l1 = strlen($s1);
    $l2 = strlen($s2);
    $c1 = 0; //cursor for string 1
    $c2 = 0; //cursor for string 2
    $lcss = 0; //largest common subsequence
    $local_cs = 0; //local common substring
    $trans = 0; //number of transpositions ('ab' vs 'ba')
    $offset_arr = array(); //offset pair array, for computing the transpositions
    while (($c1 < $l1) && ($c2 < $l2)) {
        if (substr($s1, $c1, 1) == substr($s2, $c2, 1)) {
            $local_cs++;
            $isTrans = false;
            $i = 0;
            while ($i < sizeof($offset_arr)) { //see if current match is a transposition
                $ofs = $offset_arr[$i];
                if ($c1 <= $ofs['c1'] || $c2 <= $ofs['c2']) {
                    $isTrans = abs($c2 - $c1) >= abs($ofs['c2'] - $ofs['c1']);
                    if ($isTrans) {
                        $trans++;
                    } else {
                        if (!$ofs['trans']) {
                            $ofs['trans'] = true;
                            $trans++;
                        }
                    }
                    break;
                } else {
                    if ($c1 > $ofs['c2'] && $c2 > $ofs['c1']) {
                        array_splice($offset_arr, $i, 1);
                    } else {
                        $i++;
                    }
                }
            }
            array_push($offset_arr, array('c1' = > $c1, 'c2' = > $c2, 'trans' = > $isTrans));
        } else {
            $lcss+= $local_cs;
            $local_cs = 0;
            if ($c1 != $c2) {
                $c1 = $c2 = min($c1, $c2); //using min allows the computation of transpositions
            }
            if ($maxDistance) {
                $temporaryDistance = max($c1, $c2) - $lcss + $trans;
                if ($temporaryDistance > $maxDistance) return $temporaryDistance;
            }

            //if matching characters are found, remove 1 from both cursors (they get incremented at the end of the loop)
            //so that we can have only one code block handling matches
            for ($i = 0;$i < $maxOffset && ($c1 + $i < $l1 || $c2 + $i < $l2);$i++) {
                if (($c1 + $i < $l1) && (substr($s1, $c1 + $i, 1) == substr($s2, $c2, 1))) {
                    $c1+= $i - 1;
                    $c2--;
                    break;
                }
                if (($c2 + $i < $l2) && (substr($s1, $c1, 1) == substr($s2, $c2 + $i, 1))) {
                    $c1--;
                    $c2+= $i - 1;
                    break;
                }
            }
        }
        $c1++;
        $c2++;
        // this covers the case where the last match is on the last token in list, so that it can compute transpositions correctly
        if (($c1 >= $l1) || ($c2 >= $l2)) {
            $lcss+= $local_cs;
            $local_cs = 0;
            $c1 = $c2 = min($c1, $c2);
        }
    }
    $lcss+= $local_cs;
    return max($l1, $l2) - $lcss + $trans; //apply transposition cost to final result
    
}

Thanks to Ferenc Szatmári for corrections in the PHP code

The Simplest and General versions of the algorithm remain as an exercise to you, since I haven't been working in PHP for more than a decade.

T-SQL implementation [expand/collapse]

---<summary>
---Static distance algorithm working on strings, computing transpositions as well as stopping when maxDistance was reached.
---</summary>
---<param name="s1"></param>
---<param name="s2"></param>
---<param name="maxOffset"></param>
---<param name="maxDistance"></param>
---<returns></returns>
CREATE FUNCTION Sift4common(@s1          NVARCHAR(max), 
                            @s2          NVARCHAR(max), 
                            @maxOffset   INT, 
                            @maxDistance INT) 
returns INT 
AS 
  BEGIN 
      DECLARE @l1 INT = Len(Isnull(@s1, '')); 
      DECLARE @l2 INT = Len(Isnull(@s2, '')); 

      IF ( @l1 = 0 ) 
        RETURN @l2; 

      IF ( @l2 = 0 ) 
        RETURN @l1; 

      DECLARE @c1 INT = 0; 
      DECLARE @c2 INT = 0; 
      DECLARE @lcss INT = 0; 
      DECLARE @local_cs INT = 0; 
      DECLARE @trans INT = 0; 
      DECLARE @offset_arr TABLE 
        ( 
           C1    INT, 
           C2    INT, 
           Trans BIT 
        ) 
      DECLARE @i INT 
      DECLARE @temporaryDistance FLOAT 
      DECLARE @result INT 
      DECLARE @oc1 INT, 
              @oc2 INT, 
              @otr BIT 
      DECLARE @isTrans BIT 

      WHILE ( ( @c1 < @l1 ) 
              AND ( @c2 < @l2 ) ) 
        BEGIN 
            IF ( Substring(@s1, @c1 + 1, 1) = Substring(@s2, @c2 + 1, 1) ) 
              BEGIN 
                  SET @local_cs=@local_cs + 1; 
                  SET @isTrans=0 
                  SET @oc1=NULL; 

                  SELECT TOP 1 @oc1 = o.C1, 
                               @oc2 = o.C2, 
                               @otr = o.Trans 
                  FROM   @offset_arr o 
                  WHERE  @c1 <= o.C1 
                          OR @c2 <= o.C2 

                  IF ( @oc1 IS NOT NULL ) 
                    BEGIN 
                        SET @isTrans=CASE 
                                       WHEN Abs(@c2 - @c1) >= Abs(@oc2 - @oc1) 
                                     THEN 1 
                                       ELSE 0 
                                     END 

                        IF ( @isTrans = 1 ) 
                          BEGIN 
                              SET @trans=@trans + 1 
                          END 
                        ELSE IF ( @otr = 0 ) 
                          BEGIN 
                              SET @trans=@trans + 1 

                              UPDATE @offset_arr 
                              SET    Trans = 1 
                              WHERE  C1 = @oc1 
                                     AND C2 = @oc2 
                          END 
                    END 

                  DELETE FROM @offset_arr 
                  WHERE  @c1 > C1 
                         AND @c1 > C2 
                         AND @c2 > C1 
                         AND @c2 > C2; 

                  INSERT INTO @offset_arr 
                  VALUES      (@c1, 
                               @c2, 
                               @isTrans); 
              END 
            ELSE 
              BEGIN 
                  SET @lcss = @lcss + @local_cs; 
                  SET @local_cs = 0; 

                  IF ( @c1 != @c2 ) 
                    -- using min allows the computation of transpositions  
                    BEGIN 
                        IF ( @c1 < @c2 ) 
                          BEGIN 
                              SET @c2=@c1; 
                          END 
                        ELSE 
                          BEGIN 
                              SET @c1=@c2; 
                          END 
                    END 
                IF ( @maxDistance > 0 ) 
                  BEGIN 
                    IF ( @c1 > @c2 ) 
                    BEGIN 
                      SET @temporaryDistance = @c1 - ( @lcss - @trans / 2.0 ); 
                    END 
                    ELSE 
                    BEGIN 
                      SET @temporaryDistance = @c2 - ( @lcss - @trans / 2.0 ); 
                    END 

                  IF ( @temporaryDistance > @maxDistance ) 
                    RETURN Round(@temporaryDistance, 0); 
              END 


                  --if matching tokens are found, remove 1 from both cursors (they get incremented at the end of the loop)
                  --so that we can have only one code block handling matches   
                  SET @i=0; 

                  WHILE ( @i < @maxOffset 
                          AND ( @c1 + @i < @l1 
                                 OR @c2 + @i < @l2 ) ) 
                    BEGIN 
                        IF ( ( @c1 + @i < @l1 ) 
                             AND Substring(@s1, @c1 + @i + 1, 1) = 
                                 Substring(@s2, @c2 + 1, 1) 
                           ) 
                          BEGIN 
                              SET @c1 = @c1 + @i - 1; 
                              SET @c2=@c2 - 1; 

                              BREAK; 
                          END 

                        IF ( ( @c2 + @i < @l2 ) 
                             AND Substring(@s1, @c1 + 1, 1) = Substring(@s2, 
                                                              @c2 + @i + 1, 1 
                                                              ) 
                           ) 
                          BEGIN 
                              SET @c1 = @c1 - 1; 
                              SET @c2=@c2 + @i - 1; 

                              BREAK; 
                          END 

                        SET @i=@i + 1; 
                    END 
              END 

            SET @c1=@c1 + 1; 
            SET @c2=@c2 + 1; 

            IF ( ( @c1 >= @l1 ) 
                  OR ( @c2 >= @l2 ) ) 
              BEGIN 
                  SET @lcss = @lcss + @local_cs; 
                  SET @local_cs = 0; 

                  IF ( @c1 < @c2 ) 
                    BEGIN 
                        SET @c2=@c1; 
                    END 
                  ELSE 
                    BEGIN 
                        SET @c1=@c2; 
                    END 
              END 
        END 

      SET @lcss = @lcss + @local_cs; 

      --apply the transposition cost to the final result 
      IF ( @l1 > @l2 ) 
        BEGIN 
            SET @result = @l1 - ( @lcss - @trans ); 
        END 
      ELSE 
        BEGIN 
            SET @result = @l2 - ( @lcss - @trans ); 
        END 

      RETURN @result; 
  END

Clearly a general version of the algorithm is not possible in Transact SQL.

Here is a MySQL version, gracefully provided by Ferenc Szatmári:

BEGIN 
DECLARE l1 INT DEFAULT Length(IFNULL(s1, '')); 
DECLARE l2 INT DEFAULT Length(IFNULL(s2, '')); 


DECLARE c1 INT Default 0; 
DECLARE c2 INT default 0; 
DECLARE lcss INT default 0; 
DECLARE local_cs INT default 0; 
DECLARE trans INT default 0; 
DECLARE i INT;
DECLARE temporaryDistance FLOAT;
DECLARE result INT;
DECLARE oc1 INT;
DECLARE oc2 INT;
DECLARE otr smallint;
DECLARE isTrans smallint;

drop temporary table if exists offset_arr;
CREATE TEMPORARY TABLE IF not EXISTS offset_arr
( 
C1 INT, 
C2 INT,
Trans BIT
)engine=memory;


/*      delete from offset_arr;*/


IF l1 = 0 THEN
RETURN l2; 
END IF;
IF l2 = 0 THEN 
RETURN l1; 
END IF;  






WHILE ( ( c1 < l1 ) AND ( c2 < l2 ) ) 
DO 
IF ( Substring(s1, c1 + 1, 1) = Substring(s2, c2 + 1, 1) ) THEN

SET local_cs=local_cs + 1; 
SET isTrans=0;
SET oc1=NULL;
SELECT  o.C1, o.C2,o.Trans  into oc1, oc2, otr
FROM offset_arr o 
WHERE c1 <= o.C1 OR c2 <= o.C2
LIMIT 1;
IF oc1 IS NOT NULL THEN

SET isTrans=CASE WHEN ABS(c2-c1)>=ABS(oc2-oc1) THEN 1 ELSE 0 END;
IF (isTrans=1) THEN
SET trans=trans+1;
ELSE
IF (otr=0) THEN


SET trans=trans+1;
UPDATE offset_arr SET Trans=1 WHERE C1=oc1 AND C2=oc2;
END IF;
END IF;  
END IF;


DELETE FROM offset_arr 
WHERE  c1 > C1 
AND c1 > C2
AND c2 > C1 
AND c2 > C2; 


INSERT INTO offset_arr 
VALUES (c1, c2, isTrans); 

ELSE 

SET lcss = lcss + local_cs; 
SET local_cs = 0; 


IF ( c1 != c2 ) THEN
-- using min allows the computation of transpositions 


IF ( c1 < c2 ) THEN
SET c2=c1; 
ELSE 
SET c1=c2; 
END IF;

IF ( maxDistance > 0 ) THEN


IF ( c1 > c2 ) THEN
SET temporaryDistance = c1 - ( lcss - trans / 2.0 ); 
ELSE 
SET temporaryDistance = c2 - ( lcss - trans / 2.0 ); 
END IF;


IF ( temporaryDistance > maxDistance ) THEN
RETURN Round(temporaryDistance, 0); 
END IF;  
END IF;

END IF;


/*if matching tokens are found, remove 1 from both cursors (they get incremented at the end of the loop)
--so that we can have only one code block handling matches  */
SET i=0; 


label : WHILE ( i < maxOffset AND (c1 + i < l1 OR c2 + i < l2) ) do


IF ( ( c1 + i < l1 ) 
AND Substring(s1, c1 + i + 1, 1) = Substring(s2, c2 + 1, 1) 
) THEN


SET c1 = c1 + i - 1; 
SET c2=c2 - 1; 


leave label; 
END IF; 


IF ( ( c2 + i < l2 ) 
AND Substring(s1, c1 + 1, 1) = Substring(s2, c2 + i + 1, 1) 
)THEN 


SET c1 = c1 - 1; 
SET c2=c2 + i - 1; 


leave label;  
END IF;


SET i=i + 1; 
END WHILE;
END IF; 


SET c1=c1 + 1; 
SET c2=c2 + 1; 


IF ( ( c1 >= l1 ) OR ( c2 >= l2 ) ) THEN
SET lcss = lcss + local_cs; 
SET local_cs = 0; 


IF ( c1 < c2 ) THEN
SET c2=c1; 
ELSE 
SET c1=c2; 
END IF;
END IF;
END while;


SET lcss = lcss + local_cs; 


/*apply the transposition cost to the final result*/
IF ( l1 > l2 ) THEN
SET result = l1 - (lcss - trans);
ELSE 
SET result = l2 - (lcss - trans);
END IF;
RETURN result; 
END

Java implementation [expand/collapse]

Here is a Java version, gracefully provided by Nathanæl Fischer:

/**
 * Sift4 - common version
 * online algorithm to compute the distance between two strings in O(n)
 * Algorithm by siderite, java port by Nathan Fischer 2016
 * /blog/super-fast-and-accurate-string-distance.html
 * @param s1
 * @param s2
 * @param maxOffset the number of characters to search for matching letters
 * @return
 */
public static double sift4(String s1, String s2, int maxOffset) {
    class Offset {
        int c1;
        int c2;
        boolean trans;

        Offset(int c1, int c2, boolean trans) {
            this.c1 = c1;
            this.c2 = c2;
            this.trans = trans;
        }
    }

    if (s1 == null || s1.isEmpty())
        return s2 == null ? 0 : s2.length();

    if (s2 == null || s2.isEmpty())
        return s1.length();

    int l1 = s1.length();
    int l2 = s2.length();

    int c1 = 0; //cursor for string 1
    int c2 = 0; //cursor for string 2
    int lcss = 0; //largest common subsequence
    int local_cs = 0; //local common substring
    int trans = 0; //number of transpositions ('ab' vs 'ba')
    LinkedList < Offset > offset_arr = new LinkedList < > (); //offset pair array, for computing the transpositions

    while ((c1 < l1) && (c2 < l2)) {
        if (s1.charAt(c1) == s2.charAt(c2)) {
            local_cs++;
            boolean isTrans = false;
            //see if current match is a transposition
            int i = 0;
            while (i < offset_arr.size()) {
                Offset ofs = offset_arr.get(i);
                if (c1 <= ofs.c1 || c2 <= ofs.c2) {
                    // when two matches cross, the one considered a transposition is the one with the largest difference in offsets
                    isTrans = Math.abs(c2 - c1) >= Math.abs(ofs.c2 - ofs.c1);
                    if (isTrans) {
                        trans++;
                    } else {
                        if (!ofs.trans) {
                            ofs.trans = true;
                            trans++;
                        }
                    }
                    break;
                } else {
                    if (c1 > ofs.c2 && c2 > ofs.c1) {
                        offset_arr.remove(i);
                    } else {
                        i++;
                    }
                }
            }
            offset_arr.add(new Offset(c1, c2, isTrans));
        } else {
            lcss += local_cs;
            local_cs = 0;
            if (c1 != c2) {
                c1 = c2 = Math.min(c1, c2); //using min allows the computation of transpositions
            }
            //if matching characters are found, remove 1 from both cursors (they get incremented at the end of the loop)
            //so that we can have only one code block handling matches
            for (int i = 0; i < maxOffset && (c1 + i < l1 || c2 + i < l2); i++) {
                if ((c1 + i < l1) && (s1.charAt(c1 + i) == s2.charAt(c2))) {
                    c1 += i - 1;
                    c2--;
                    break;
                }
                if ((c2 + i < l2) && (s1.charAt(c1) == s2.charAt(c2 + i))) {
                    c1--;
                    c2 += i - 1;
                    break;
                }
            }
        }
        c1++;
        c2++;
        // this covers the case where the last match is on the last token in list, so that it can compute transpositions correctly
        if ((c1 >= l1) || (c2 >= l2)) {
            lcss += local_cs;
            local_cs = 0;
            c1 = c2 = Math.min(c1, c2);
        }
    }
    lcss += local_cs;
    return Math.round(Math.max(l1, l2) - lcss + trans); //add the cost of transpositions to the final result
}

Powershell implementation [expand/collapse]

Powershell implementation of simple Sift4, by Kirk Sayre:

function Calc-Sift4Distance 
{
  <# 
      .SYNOPSIS 

      Compute the edit distance between 2 strings using the sift4 string
      edit distance algorithm.

      .PARAMETER s1

      The 1st string

      .PARAMETER s2

      The 2nd string

      .PARAMETER maxOffset

      The maximum common substring length for which to search. The default
      is 10.

      .RETURN

      The # of edits needed to make the given 2 strings equal.
  #>

  Param(
    [Parameter(Mandatory = $True, ValueFromPipelineByPropertyName = $True)]
    [String]
    $s1,

    [Parameter(Mandatory = $True, ValueFromPipelineByPropertyName = $True)]
    [String]
    $s2,

    [Parameter(ValueFromPipelineByPropertyName = $True)]
    [Int]
    $maxOffset = 10
  )

  # Handle null or empty strings.
  if ((-not $s1) -or ($s1.length -eq 0)) 
  {
    if (-not $s2) 
    {
      return 0
    }
    return $s2.length
  }

  if ((-not $s2) -or ($s2.length -eq 0)) 
  {
    return $s1.length
  }

  # Initialization.
  $l1 = $s1.length
  $l2 = $s2.length
  $c1 = 0 # Cursor for string 1.
  $c2 = 0 # Cursor for string 2.
  $lcss = 0 # Largest common subsequence.
  $local_cs = 0 # Local common substring.

  # Scan strings.
  while (($c1 -lt $l1) -and ($c2 -lt $l2)) 
  {
    if ($s1[$c1] -eq $s2[$c2]) 
    {
      $local_cs++
    }
    else 
    {
      $lcss += $local_cs
      $local_cs = 0
      if ($c1 -ne $c2) 
      {
        # Using max to bypass the need for computer transpositions ('ab' vs 'ba').
        $c1 = $c2 = (@($c1, $c2) | Measure-Object -Maximum).Maximum
      }

      for ($i = 0; (($i -lt $maxOffset) -and ((($c1 + $i) -lt $l1) -or (($c2 + $i) -lt $l2))); $i++) 
      {
        if ((($c1 + $i) -lt $l1) -and ($s1[$c1 + $i] -eq $s2[$c2])) 
        {
          $c1 += $i
          $local_cs++
          break
        }

        if ((($c1 + $i) -lt $l2) -and ($s1[$c1] -eq $s2[$c2 + $i])) 
        {
          $c2 += $i
          $local_cs++
          break
        }
      }
    }
    $c1++
    $c2++
  }
  $lcss += $local_cs
  return [math]::Round((@($l1, $l2) | Measure-Object -Maximum).Maximum - $lcss)
}

C++ implementation [expand/collapse]

Thanks to Hugo Amaro, a C++ implementation:

struct sift_offset {
  int c1;
  int c2;
  bool trans;
};

template < typename T >
  int sift4(T * s1, int s1_size, T * s2, int s2_size, int maxOffset, int maxDistance) {
    if (!s1 || !s1_size) {
      if (!s2) {
        return 0;
      }
      return s2_size;
    }

    if (!s2 || !s2_size) {
      return s1_size;
    }

    int l1 = s1_size;
    int l2 = s2_size;

    int c1 = 0; //cursor for string 1
    int c2 = 0; //cursor for string 2
    int lcss = 0; //largest common subsequence
    int local_cs = 0; //local common substring
    int trans = 0; //number of transpositions ('ab' vs 'ba')
    std::vector < sift_offset > offset_arr; //offset pair array, for computing the transpositions

    while ((c1 < l1) && (c2 < l2)) {
      if (s1[c1] == s2[c2]) {
        local_cs++;
        bool isTrans = false;
        //see if current match is a transposition
        int i = 0;
        while (i < offset_arr.size()) {
          sift_offset ofs = offset_arr[i];
          if (c1 <= ofs.c1 || c2 <= ofs.c2) {
            // when two matches cross, the one considered a transposition is the one with the largest difference in offsets
            isTrans = std::abs(c2 - c1) >= std::abs(ofs.c2 - ofs.c1);

            if (isTrans) {
              trans++;
            } else {
              if (!ofs.trans) {
                ofs.trans = true;
                trans++;
              }
            }
            break;
          } else {
            if (c1 > ofs.c2 && c2 > ofs.c1) {
              offset_arr.erase(offset_arr.begin() + i);

            } else {
              i++;
            }
          }
        }
        offset_arr.push_back({
          c1,
          c2,
          isTrans
        });
      } else {
        lcss += local_cs;
        local_cs = 0;
        if (c1 != c2) {
          c1 = c2 = (int) min(c1, c2); //using min allows the computation of transpositions
        }
        //if matching characters are found, remove 1 from both cursors (they get incremented at the end of the loop)
        //so that we can have only one code block handling matches 
        for (int i = 0; i < maxOffset && (c1 + i < l1 || c2 + i < l2); i++) {
          if ((c1 + i < l1) && (s1[c1 + i] == s2[c2])) {
            c1 += i - 1;
            c2--;
            break;
          }
          if ((c2 + i < l2) && (s1[c1] == s2[c2 + i])) {
            c1--;
            c2 += i - 1;
            break;
          }
        }
      }
      c1++;
      c2++;
      if (maxDistance) {
        int temporaryDistance = (int) max(c1, c2) - lcss + trans;
        if (temporaryDistance >= maxDistance) return std: round(temporaryDistance);
      }

    }
    // this covers the case where the last match is on the last token in list, so that it can compute transpositions correctly
    if ((c1 >= l1) || (c2 >= l2)) {
      lcss += local_cs;
      local_cs = 0;
      c1 = c2 = (int) min(c1, c2);
    }
  }
lcss += local_cs;
return std::round((int) max(l1, l2) - lcss + trans); //add the cost of transpositions to the final result
}

You can find a Go implementation here, written by Jason W. Hutchinson.
There is also a Swift implementation here.
A Perl 6 (now called Raku) implementation can be found here.

Comments

October 17, 2025

I fed the c# code into chatGPT5, and it had this to say (I don't know enough about this to form an opinion): chatGPT5: Here’s a tight code audit of that “newest” Sift4 you pasted, with concrete fixes. High-impact issues Stale-offset removal still wrong (crossed indices) Both GeneralDistance and CommonDistance remove old offset pairs with the wrong comparisons: - if (c1 > ofs.C2 && c2 > ofs.C1) + if (c1 > ofs.C1 && c2 > ofs.C2) { offset_arr.Remove(op); } Without this, offset_arr grows and transposition detection can miscount. SimplestDistance look-ahead window too strict Your simplest variant still uses &&, which blocks scanning when only one side has room: - for (var i = 0; i < maxOffset && (c1 + i < l1 && c2 + i < l2); i++) + for (var i = 0; i < maxOffset && (c1 + i < l1 || c2 + i < l2); i++) You verified this with the abXc/abc example. Early-exit logic (good now) You’re using > (strict) for the early stop in both GeneralDistance and CommonDistance. ✅ You compute the temporary distance after realigning c1 = c2 = Math.Min(c1, c2) on mismatch. ✅ That matches the blog’s bug fix intent and avoids overshooting. Minor correctness / clarity Docs vs behavior for MaxDistance XML says “stop if the distance reaches this value”; code uses > (exceeds). Adjust the comment or switch to >= (I recommend keeping > and updating the comment). Return types & rounding GeneralDistance returns rounded double; CommonDistance returns unrounded double. Pick one policy. Suggestion: return int (rounded) for both, or return raw double for both and let callers round. Defaults / nullability nits (optional) Use Array.Empty<string>() instead of new string[0]. Option delegates accept string but your code calls Tokenizer(null)—either allow nullable inputs (change to string?) or stop calling with null. If you’re on C# 8+, consider #nullable enable and add ! where you’ve already guarded nulls. Perf trivia (optional) LinkedList<OffsetPair> is cache-unfriendly; a List<OffsetPair> plus a sliding start index is often faster. Drop-in patches (exact lines) A) Offset removal (both methods) - if (c1 > ofs.C2 && c2 > ofs.C1) + if (c1 > ofs.C1 && c2 > ofs.C2) { offset_arr.Remove(op); } B) SimplestDistance window - for (var i = 0; i < maxOffset && (c1 + i < l1 && c2 + i < l2); i++) + for (var i = 0; i < maxOffset && (c1 + i < l1 || c2 + i < l2); i++) C) Cosmetic default - ? new string[0] + ? Array.Empty<string>() Quick regression checks Equal: ("abc","abc") → 0 (all variants) Insert: ("","abc") → 3 Transposition: ("xab","xba") → Common/General add 1 for transposition; Simplest ignores (by design) Window edge: ("abXc","abc", maxOffset:2) → 1 with fixed simplest loop

ziggy90127

May 19, 2023

Which implementation crashes? I just tried it and it works.

Siderite

May 19, 2023

Hello, Simplest Sift4 code crashes (Index out of bounds error) when there is a string that is similar to another string but only in certain places. For example, input1: cgrm, input2: doogmm, will trigger a crash. I was able to stop this by changing Math.max to Math.min

Programmer

October 26, 2022

Hi, Sobral! Thanks! What you want can be done by implementing your own matchingEvaluator and tokenMatcher functions.

Siderite

October 26, 2022

Super useful and interesting! Have you ever considered taking the distance on keyboard into account?

Sobral

May 10, 2018

For some reason Disqus thought your comment was spam, that's why it didn't appear on the site immediately. I've updated the post, please check if it is what you wanted to share. Thanks again!

Siderite

May 10, 2018

Thank you for the code. I will certainly try and update the post.

Siderite

May 10, 2018

Thanks for the algorithm. c++ version. struct sift_offset { int c1; int c2; bool trans; }; template int sift4(T * s1, int s1_size, T * s2, int s2_size, int maxOffset, int maxDistance) { if (!s1 || !s1_size) { if (!s2) { return 0; } return s2_size; } if (!s2 || !s2_size) { return s1_size; } int l1 = s1_size; int l2 = s2_size; int c1 = 0; //cursor for string 1 int c2 = 0; //cursor for string 2 int lcss = 0; //largest common subsequence int local_cs = 0; //local common substring int trans = 0; //number of transpositions ('ab' vs 'ba') std::vector offset_arr; //offset pair array, for computing the transpositions while ((c1 < l1) && (c2 < l2)) { if (s1[c1] == s2[c2]) { local_cs++; bool isTrans = false; //see if current match is a transposition int i = 0; while (i= std::abs(ofs.c2 - ofs.c1); if (isTrans) { trans++; } else { if (!ofs.trans) { ofs.trans = true; trans++; } } break; } else { if (c1>ofs.c2 && c2>ofs.c1) { offset_arr.erase(offset_arr.begin()+i); } else { i++; } } } offset_arr.push_back({ c1, c2, isTrans }); } else { lcss += local_cs; local_cs = 0; if (c1 != c2) { c1 = c2 = (int)min(c1, c2); //using min allows the computation of transpositions } //if matching characters are found, remove 1 from both cursors (they get incremented at the end of the loop) //so that we can have only one code block handling matches for (int i = 0; i = maxDistance) return std::round(temporaryDistance); } // this covers the case where the last match is on the last token in list, so that it can compute transpositions correctly if ((c1 >= l1) || (c2 >= l2)) { lcss += local_cs; local_cs = 0; c1 = c2 = (int)min(c1, c2); } } lcss += local_cs; return std::round((int)max(l1, l2) - lcss + trans); //add the cost of transpositions to the final result } Not sure why but can't fix the broken lines in the code formatter.

Hugo Amaro

July 31, 2017

Thanks, man! I've updated the post.

Siderite

July 31, 2017

Here is a PowerShell implementation of the simple sift4 algorithm: function Calc-Sift4Distance { <# .SYNOPSIS Compute the edit distance between 2 strings using the sift4 string edit distance algorithm. .PARAMETER s1 The 1st string .PARAMETER s2 The 2nd string .PARAMETER maxOffset The maximum common substring length for which to search. The default is 10. .RETURN The # of edits needed to make the given 2 strings equal. #> Param( [Parameter(Mandatory = $True, ValueFromPipelineByPropertyName = $True)] [String] $s1, [Parameter(Mandatory = $True, ValueFromPipelineByPropertyName = $True)] [String] $s2, [Parameter(ValueFromPipelineByPropertyName = $True)] [Int] $maxOffset = 10 ) # Handle null or empty strings. if ((-not $s1) -or ($s1.length -eq 0)) { if (-not $s2) { return 0 } return $s2.length } if ((-not $s2) -or ($s2.length -eq 0)) { return $s1.length } # Initialization. $l1 = $s1.length $l2 = $s2.length $c1 = 0 # Cursor for string 1. $c2 = 0 # Cursor for string 2. $lcss = 0 # Largest common subsequence. $local_cs = 0 # Local common substring. # Scan strings. while (($c1 -lt $l1) -and ($c2 -lt $l2)) { if ($s1[$c1] -eq $s2[$c2]) { $local_cs++ } else { $lcss += $local_cs $local_cs = 0 if ($c1 -ne $c2) { # Using max to bypass the need for computer transpositions ('ab' vs 'ba'). $c1 = $c2 = (@($c1, $c2) | Measure -Max).Maximum } for ($i = 0; (($i -lt $maxOffset) -and ((($c1 + $i) -lt $l1) -or (($c2 + $i) -lt $l2))); $i++) { if ((($c1 + $i) -lt $l1) -and ($s1[$c1 + $i] -eq $s2[$c2])) { $c1 += $i $local_cs++ break } if ((($c1 + $i) -lt $l2) -and ($s1[$c1] -eq $s2[$c2 + $i])) { $c2 += $i $local_cs++ break } } } $c1++ $c2++ } $lcss += $local_cs return [math]::Round((@($l1, $l2) | Measure -Max).Maximum - $lcss) }

Kirk Sayre

April 18, 2017

I updated the `simple` version of the python port and now the results are identical to the JavaScript implementation. The problem was that `'foo'.charAt(4) == ''` in JS and `'foo[4]' raises an exception` in python. Checking string length fixed this. https://gist.github.com/lbenedix/8275d01c2289a7a20d2c6c27ee8ae68e These are still not equal to Levenshtein (neither in python nor in JS): sift4_simple("abcdefg", "xabxcdxxefxgx") sift4_simple("xabxcdxxefxgx", "abcdefg") sift4_simple("example", "samples") sift4_simple("sturgeon", "urgently") sift4_simple("levenshtein", "frankenstein") sift4_simple("distance", "difference")

Lukas Benedix

April 12, 2017

Thank you for the test. It seems that indeed the Python implementation is not that correct. The results on this site with the Javascript version are quite different. Alas, I despise Python as a language and I only copy pasted what was sent to me. If you can fix it, I will be grateful to you.

Siderite

April 12, 2017

Here are the two python-files: Simple: https://gist.github.com/lbenedix/8275d01c2289a7a20d2c6c27ee8ae68e Common: https://gist.github.com/lbenedix/41de53f1afebd45126e990c5d2e9ac42

Lukas Benedix

April 12, 2017

I ported the simple and the common version to python and did some pretty basic tests. Doesn't look very good: test_sift4_on_empty_strings =========================== sift4("", "") = 0 expected: 0 sift4("a", "") = 1 expected: 1 sift4("", "a") = 1 expected: 1 sift4("abc", "") = 3 expected: 3 sift4("", "abc") = 3 expected: 3 test_sift4_on_equal_strings =========================== sift4("", "") = 0 expected: 0 sift4("a", "a") = 0 expected: 0 sift4("abc", "abc") = 0 expected: 0 test_sift4_on_inserts ===================== sift4("", "a") = 1 expected: 1 sift4("a", "ab") = 1 expected: 1 ! sift4("b", "ab") = 2 expected: 1 ! sift4("ac", "abc") = 2 expected: 1 ! sift4("abcdefg", "xabxcdxxefxgx") = 10 expected: 6 test_sift4_on_delete ==================== sift4("a", "") = 1 expected: 1 sift4("ab", "a") = 1 expected: 1 ! sift4("ab", "b") = 2 expected: 1 ! sift4("abc", "ac") = 2 expected: 1 ! sift4("xabxcdxxefxgx", "abcdefg") = 10 expected: 6 test_sift4_on_substitutions =========================== sift4("a", "b") = 1 expected: 1 sift4("ab", "ac") = 1 expected: 1 sift4("ac", "bc") = 1 expected: 1 sift4("abc", "axc") = 1 expected: 1 sift4("xabxcdxxefxgx", "1ab2cd34ef5g6") = 6 expected: 6 test_sift4_on_mixed_operations ============================== ! sift4("example", "samples") = 2 expected: 3 ! sift4("sturgeon", "urgently") = 4 expected: 6 ! sift4("levenshtein", "frankenstein") = 10 expected: 6 ! sift4("distance", "difference") = 7 expected: 5

Lukas Benedix

February 8, 2017

It's an approximation of Levenstein, not a replacement. It's fast and less accurate. I believe the gain of speed offsets the loss of precision.

Siderite

February 8, 2017

Nice algorithm, but: s1: xxxz s2: yxxx Levenstein = 2 Sift3 = 2 Sift4 = 1

Anton Makovsky

December 20, 2016

Thanks for sharing the document. Seems full of interesting info.

Siderite

December 20, 2016

If you want to perform super fast string similarity on large lists of strings then a pass-join implementation is advisable. http://dbgroup.cs.tsinghua.edu.cn/ligl/papers/fcs15-string-survey.pdf

Thad Guidry

November 17, 2015

The Transact SQL version should be easy to modify, since it only declares variables and loops around. Probably a well done search and replace will yield what you want. I haven't been working with mysql since quite some time.

Siderite

November 17, 2015

do you have a mysql version?

esco

October 28, 2015

A way I see to go around that limitation is to backtrack, so move into logarithmic time, rather than linear. Something like "I found 'x' after 20 characters, but I am going to look for the next character 'a' and find it in one, so I will delete the x find". Another is to go to Sift's roots and copy the characters I skip, then do a second run on the resulting strings. Certainly food for thought. The transpositions do help close the gap between Sift4 and Levenshtein, but to be fair, I don't really like them that much. I build a really heavy data structure and waste a lot of time and the result is not always perfect, as evidenced by your example. I am considering a variant of Sift that builds remainder strings and runs the algorithm on them after it is done. Get rid of transpositions altogether and maybe increase the run time only a fraction. The question is why doesn't Sift3 have the same problem? The answer is that the Sift3 on the page was not taking the maxOffset parameter, it was using 5 all the time :) I fixed it and the result is 21, so worse. Thanks for the comment and the work! I am glad people are inspired by my makeshift piece of code.

Siderite

October 28, 2015

This seems pretty solid. You can definitely construct strings that it differs from the Levenshtein distance pretty significantly (see below), so it would be nice to have some kind of theoretical/intuitive explanation of what kinds of text produce such large divergences and how large they might be. One example: s1: *x*aaaaaaaaaaaaaaaaaaaxaaaaaaaaaaaaaaaaaaa s2: *y*aaaaaaaaaaaaaaaaaaaxaaaaaaaaaaaaaaaaaaa maxoffset: 100 Levenstein: 1 Sift3: 1 Sift4: 20 Given that Sift3 seems to do well, maybe this issue with insertions in self-similar text is an oversight rather than a fundamental limitation? Otherwise, the speedup over standard Levenstein calculators is pretty impressive.

Matt B.

October 22, 2015

As an addendum, I was able to squeeze just a little bit of performance optimizing the Javascript code, both in Simplest Sift4 and General Sift4, but so little that it was not worth it from a code readability perspective. I found some avenues for improvement of the algorithm, though, but I can't promise anything. If you want Simplest Sift4 to behave like Sift3 (and be faster), try replacing (c1+i<l1 || c2+i<l2) with (c1+i<l1 && c2+i<l2), but you will lose accuracy. In my tests with 300000 pairs of frequently used English phrases, the accuracy of Sift4 was twice as good as for Sift3. (meaning the sum of squared errors from the true Levenshtein distance for all 300000 pairs was double for Sift3) For the General version, the Tokenizer function takes its toll quite a bit, since each time it compares, it splits both strings into string arrays, and of course, the transposition computations. Also, various browser Javascript engine optimizations might throw performance tests out of whack. For all 300000 comparisons I mentioned it took between 3.5 and 4 seconds, with no modifications and nothing running in the background, so each time it could have a time difference of around 6%... The advantage of Sift is that it is a linear time algorithm, depending on the maximum length of the strings compares and the maxOffset parameter. For small strings, like one word, the difference from Levenshtein will be less significant than for large strings, like addresses. If Sift4 is 6 times slower than Sift3 (as in my tests with General Sift4 and no maxDistance parameter) it is still working in linear time.

Siderite

October 20, 2015

I've looked at the javascript tests and also tried some things locally - annoyed by the multiple captchas on that site. I can tell you that the answer is a little bit counter-intuitive: it's in the data you used. It becomes clear when you look at the version edited by me where there are also the simple algorithms. They perform almost as bad as the complex one. You see, the idea is that in order to get the best result (most similar to Levenshtein), thee algorithm must either compute transpositions (using more complex code, perhaps that part can be optimized, since it stores data in an array and keeps comparing and removing items) or just skip them altogether by aligning the two cursors for the two strings to the largest one. Example: "asdfasd","asdgsg" a,s,d are the same, it goes over them and increments the longest common substring each time. lcs=3. Then f and g are different, so it starts looking for the next match: f with s: no g with a: no f with g: no (reached the end of string 2 - here Sift3 exits) g with s: no g with d: no Having found no match, it increments cursors and tries the next pair and since a and s are different it looks for the next match: a with g: no s with s: yes! lcs=4 and now the cursors are at positions 5 and 4. Are d and g the same? no. So we must align the cursors. Sift4s aligns to the largest one, in order to skip the transposition calculations. So now cursors are both 5 Are s and g equal? no. Next matches: d with g: no (reached the end of both strings, exit) This means a total of 15 comparisons. Sift3 (any of them) does 7 comparisons. However, look at the results of the function, as well. For the three pairs in the test (I added another one with long strings just to make clear the difference from Levenshtein), Levenshtein reports distances of 3, 0 and 44. Sift4 reports 3, 0 and 46, while Sift4 Simplests reports 3, 0 and 48. What does Sift3B report? 6,5 and 12! Now I realize that Sift3B is buggy :) BTW, the standard version (used by MailCheck in a previous incarnation) is further down and it reports 3.5, 0 and 46. That doesn't mean that Sift4 cannot be optimized. In this test Sift3 standard is both fast and rather accurate, but it is a lucky shot. If you look at the comments for Sift3 there are several people that found specific strings that break the algorithm. Again, thanks for your interest and I am curious how you are going to proceed. Perhaps the code, as it is, can be improved in terms of performance. It started like that, then I tried to make it more accurate, but obviously I lost focus from performance. I will attempt to work on it in the following months.

Siderite

October 19, 2015

Which flavour of Sift4 did you use? The mailcheck version is sift3, I think, and sift4 should be slightly slower, but give better results. The code for sift4 is more flexibile; I guess you can simplify it to cover your specific scenario, after you play with all the options. I will take a look at the page and see what I can do, of course. Thanks a lot for the contribution!

Siderite

October 19, 2015

Hello Siderite, I just did a jsperf on this and found that the mailcheck implementation of sift3 was fastest, please take a look in case I missed somehing: http://jsperf.com/word-distance I should have loaded the algorithms as external functions but just wanted to make a quick comparison.

Alejandro U.

May 7, 2015

Thanks a lot, David! I am certainly not adverse to money :) However I did release Sift with a "do whatever you want with it" licence, so it wouldn't be fair. I would like to share the experience, though. Tell me how you used the algorithm, what were the problems or the things you liked, stuff like that, and I will be happy.

Siderite

May 7, 2015

I'm planning to use your algorithm (not sure which version yet) in the PHP code for internationalizing a new website I'm creating for an international nonprofit organization. Sift 3 or 4 will be used when strings in the base language files have changed, to help the human translator by showing the most probable old translation for each string that changed in the base language. Unfortunately, no money involved in this project (I'm retired). If I ever commercialize my translation tool, I'll certainly discuss royalties with you. Why not share the income, if any?

David Spector

April 27, 2015

Well, after I learned that my algorithm was in the LCSS class, I borrowed the metric a little bit and in Sift4 I have the transpositions, a concept borrowed from Jaro. I also tested my algorithm by computing the average error from Levenstein, but I don't think I am going to do this anymore. Sift is not a pretty mature algorithm and its metrics are its own. But you can say that it approximates (or attempts to) Levenstein.

Siderite

April 27, 2015

Ok. So there's no direct scientific publication about it. May I ask to which edit distance it is ispired by? As Jaro-Winkler, Levenshtein or what? The thing is that I'm going to use it in a comparison between different string similarity methods, and this comparison will appear in a scientific publication. I'd like to put some references, and the url of the blog will be there for sure. The fact is that if I may use also a reference to some relate scientific publication, it would be better. Maybe I can say that is available at this url, but it is inspired by other that are present in literature.

Emanuele Bastianelli

April 27, 2015

Well, the entire story is written in the blog posts. Even if at the end it seems I created something that is very similar to other algorithms, it is basically original work.

Siderite

April 27, 2015

Hi again guys, I was wondering: is there any scientific publication or source I may cite in my works about this measure? For example, if there is an edit distance it is inspired to or an explanation in any scientific journal or conference related to it? Thank you

Emanuele Bastianelli

March 28, 2015

Emanuele (and others) I've updated the algorithm again. Now it shows the correct distance of 1 for the strings provided by you. I've written more details about the changes in implementation at the beginning of the post. Once I determine a more rational integration of the concepts I've implemented in Sift4 I will probably remove it out of Beta and place the implementations somewhere on Github or CodePlex.

Siderite

March 27, 2015

Thanks, Emanuele! You have indeed found an edge case that exposes a bug in the algorithm. I have updated the algorithm on the page, but even so, the result is not what you would expect. Before, the first match would have been first letter of the first string with the last letter of the second string. Then the loop would end, because it had reached the end of one of the strings. I've changed the algorithm to not allow that and instead just reset the counters to their minimum value and continue. However, right now the situation is this: after the first match, the counters are reset and all the other characters are identical. The algorithm ends with an lcss of 13, but since the first match spanned 13 characters, all the other matches are counted as transpositions, and the algorithm returns the value of 6, which is 13 (length of strings) - 13 (lcss) + 12/2 (half of the number of transpositions. I have to think about how to solve this situation in an intelligent way. Meanwhile I've updated the page so you can try Levenstein, Sift3 and Sift4 directly, in Javascript. As for the Java code, the first thing I noticed is that getGeneralDistance and getCommonDistance don't use a maxOffset parameter, instead the class field is used, which as far as I can see is always 0. The simplest distance does use the method parameter. I think there are some online C# to Java converters around. Anyway, thanks again and I look forward to hearing from you.

Siderite

March 27, 2015

Hi guys, I moved from Sift3 to Sift4 for an issue I found on Sift3 (see last post of the Sift3 page). I quickly implemented a version of the Sift4 in java starting from the C# version, see the code at this link http://www.dis.uniroma1.it/~labrococo/S4R/Sift4.java (Sift4.java) Maybe there are some errors, I don't know, but it seems that it's working strangely. For example, whenever I compare two strings with Simplest and Common distances, it always return 1.0. When I use the General, it returns me strange values: for example, for "T EY1 B AH0 L" and "L EY1 B AH0 L", it returns me 15.0 or 21.0 according to the value of maxDistance and maxOffset.

Emanuele Bastianelli

February 25, 2015

Thanks this is fantastic stuff!

Anonymous

February 24, 2015

Is there such a big difference between the C# and a C version?

Siderite

February 24, 2015

I really like sift3, which I translated to C. I'd love to have a version of sift4 in C as well! I need the common version that takes a max distance, but have no need for the complete version with a tokenizer. Thanks for your hard work! Philip

Anonymous

November 20, 2014

... and a Microsoft Transact SQL version.

Siderite

November 20, 2014

I've updated the post with C# and PHP versions.

Siderite

November 18, 2014

Probably common, but I can see myself using it for more complex things later so maybe general. I'm using it for loose matching on OCR content gained through Tesseract 3.04

Drew Llewellyn

November 18, 2014

Thank you! Which version would you be most interested in? The common, the simplest or the general? What would you plan to use in your project?

Siderite

November 18, 2014

Hi, Great looking stuff! Would love to see a PHP implementation. I'm using Sift3Plus for PHP at the moment. Great work, can't wait to try it in PHP.

Drew Llewellyn

November 13, 2014

I intend to post a C# version, yes. What algorithms do you want me to compare it with? What would be the test data? These are important questions. In my tests I compared it with Levenshtein, but then again, I chose what I thought was good test data: common English phrases, randomly generated strings and variations thereof.

Siderite

November 12, 2014

Thanks for keeping on working on this! Two requests: 1. Could you put up a C# version? 2. Compare speed and accuracy to the other string comparison algorithms?

Anonymous

Intro

New concepts in Sift4

The code

Simplest Sift4:

Common Sift4:

Extended/General Sift4:

Options explained

FAQ

Implementations in other languages

Comments

Post a comment