Just to remember this for future work. I wanted to replace GetDate() default column values with SysUtcDatetime(). This is the script used:
-- declare a string that will hold the actual SQL executed
DECLARE @SQL NVARCHAR(Max) = ''
SELECT @SQL=@SQL+
N'ALTER TABLE ['+t.name+'] DROP CONSTRAINT ['+o.name+'];
ALTER TABLE ['
+t.name+'] ADD DEFAULT SYSUTCDATETIME() FOR ['+c.name+'];
'
-- drop the default value constraint, then add another with SYSUTCDATETIME() as default value
FROM sys.all_columns c -- get the name of the columns
INNER JOIN sys.tables t -- get the name of the tables containing the columns
ON c.object_id=t.object_id
INNER JOIN sys.default_constraints o -- we are only interested in default value constraints
ON c.default_object_id=o.object_id
WHERE o.definition='(getdate())' -- only interested in the columns with getdate() as default value

-- execute generated SQL
EXEC sp_executesql @SQL

Recently I created a framework for translating JSON requests from a REST API to entities sent to the database. For simple objects, it was very easy, just create an SQL parameter for each property. However, for complex objects - having other objects as properties - this was not a solution. So I used a DataContractSerializer to transform the object to XML, send it as an XML SQL parameter and get the values from it in the stored procedures. Then I noticed date time inconsistencies between the two approaches. What was going on?

Let's start with the code. The DateTime object created from the JSON is a date and time value with a timezone, like 16:00 UTC+1. That is 15:00 in universal time. One you send it as a parameter for a stored procedure, the value received by the stored procedure is 16:00 (the server has the same timezone). In SQL Server, DATETIME and DATETIME2 types don't store timezone information. However, when sent through XML, the value looks like this: 2015-03-09T16:00:0.0000000+01:00. Using SELECT [Time] = T.Item.value('@Time','DATETIME2') FROM @Xml.nodes('//Location/SDO') as T(Item), the value returned is 15:00! You get 16:00+01 if you translate to DATETIMEOFFSET.

So let's recap: When you send a DateTime with timezone offset as an SQL parameter, the value reaching the SQL server is the local time. When you extract a textual value with timezone offset from an XML into a DATETIME, using the .value method, the value you get back is basically the Universal Time.

Solutions? Well, if you are going to use DateTime, you might as well consider that servers and devices are not always in the same timezone. Always translating values to universal time might be a good idea. Another solution is to extract from XML to a DATETIMEOFFSET type, which holds both the datetime and the timezone offset. Converting that value to DATETIME or DATETIME2 removes the timezone (Warning: it does NOT give the local time, unless you are in the same zone as the timezone in the datetimeoffset value).

I was trying to solve another problem, that operations creating, altering or dropping databases cannot be made inside a .Net transaction. Therefore I created a new TransactionScope inside the main one, using the TransactionScopeOption.Suppress option, but then I started getting this weird exception: The transaction associated with the current connection has completed but has not been disposed. The transaction must be disposed before the connection can be used to execute SQL statements. But I did Complete and Dispose all my transaction scopes, so what was going on? Long story short, this is really confusing message for a simple problem: your transaction probably timed out. Create the transaction scope with a large TimeSpan as the Timeout and it will get through. If you can, you can use a try/catch/finally block in which you Dispose the transaction (just remember that a using construct is basically a try/finally block). In my case, I was conditionally creating this new TransactionScope, so I wasn't using using. The transaction would fail, bleed into the other transaction where the confusing exception was being thrown.

I have another SQL quirk for you today, particularly in Microsoft's SQL Server. It concerns the AVG function when the SUM of the averaged values would result in an overflow for that type: AVG fails with the same error. Let's imagine for a moment that we wouldn't have an AVG function. In that case, our code would use SUM and COUNT to average values. For the integer column x, the naive SUM(x)/COUNT() would fail when the sum goes over the INT maximum value (I am not sure that SUM should fail, either, but that is another discussion). So our solution would be something like CONVERT(INT,SUM(CONVERT(BIGINT,x)))/COUNT(). Obviously, this is the solution for the AVG issue, just average the values converted to the biggest type available. So it seems that in their flagship SQL server, Microsoft implemented the naive version, hmm... To be fair, this behaviour is documented in the T-SQL function page: If the sum exceeds the maximum value for the data type of the return value an error will be returned.

As I see it, it is a bug, as this should have been handled internally, and one is logged in Microsoft Connect here: AVG function causes Arithmetic overflow error. Read the funny comment from some Microsoft employee who says there is a solution proposed, but they don't have time to implement it in SQL Server 2008, so probably they will do that in a future version. I'm trying on SQL Server 2014; it's 7 years later, Microsoft!!

Just for kicks, I tried the same in MySQL and it works there. More than that, it doesn't fail for SUM, either, as it automatically returns a BIGINT value, pushing the issue to an eventual insert or update in an INT column.

I was trying to write a simple query in MySQL that only inserted a row if there wasn't one already there, else it would update the existing one. MySQL has a system to take care of that for you if you have a primary key or a unique index on the table you want to insert or update. You can choose between INSERT ... ON DUPLICATE KEY or REPLACE. However, what if you don't want to add an index on your table just for that?

T-SQL (Microsoft SQL Server) has a syntax that works like this:
IF EXISTS(SELECT * FROM MyTable WHERE y=2)
UPDATE MyTable SET x=x+1 WHERE y=2;
ELSE
INSERT INTO MyTable (x,y) VALUES(1,2);
END

MySQL also has an IF...THEN syntax as well as an EXISTS statement. The problem is that they work in a completely different way. IF has an extra THEN keyword, uses ELSEIF instead of ELSE and needs to end with an END IF. EXISTS works only in WHERE clauses. So, let's translate the query above in MySQL and hope for the best:
INSERT INTO MyTable (x,y) 
SELECT 0,2 FROM DUAL
WHERE NOT EXISTS(SELECT * FROM MyTable WHERE y=2);
UPDATE MyTable SET x=x+1 WHERE y=2;

Notice several things: I am not using any IF and I am updating rows all the time after the conditional insert. I am selecting values from DUAL because any select with a where clause needs a table in MySQL and DUAL is a dummy one built into the engine. (SELECT 1 WHERE 2=2; is not valid in MySQL). Also, I am inserting a value, then updating it, which is not the most efficient.

There is another way, closer to the original T-SQL query, but it doesn't use EXISTS. It looks like this:
IF (SELECT 1=1 FROM MyTable WHERE y=2) THEN
UPDATE MyTable SET x=x+1 WHERE y=2;
ELSE
INSERT INTO MyTable (x,y) VALUES(1,2);
END IF;

Thing to notice here: I am using 1=1 to return a TRUE value that will make the IF work. Also, PAY ATTENTION!, this doesn't work as a simple query. I spent the better half of an hour trying to understand where the syntax error was while trying to execute this directly. Any flow operations like IF or WHILE, etc, are only valid in "programs", the MySQL term for stored procedures and functions.

I hope I clarified things for you on this.

The algorithm works perfectly well and is better than Sift3, however it's slightly more complex. You might want to start with Sift3 in order to understand where it came from.

Update November 8 2022: I found a bug in the algorithm, relating to maxDistance. I've updated the code. If you didn't use maxDistance, you are unaffected. Basically the fix is to compare temporaryDistance>minDistance (before it was >= ) and to move the calculation of the temporary distance after c1 and c2 are updated to their minimum value when a token was not found (otherwise the temporary distance might become larger than the final distance)


Try the Javascript implementation here:


Algorithm:

  •  MaxOffset:

String 1:

String 2:



Result: 



Update 28 Mar 2015: I've changed the algorithm significantly. The transpositions are now computed differently and the cost of a transposition in the final result is 1, rather than 0.5. Also, while I think a value of 1 is better conceptually, I noticed that Sift4 approximates Levenshtein a little better when the cost of a transposition is either 2 or a function depending on the offset difference between c2 and c1, especially when maxOffset grows. This can be now changed via the new options function transpositionCostEvaluator. The problem I am having now is more false positives when the letters/tokens of the two strings are the same, but their positions are jumbled differently. With small maxOffset values, like 5 or 10, the result is much better than Sift3, however when maxOffset grows, lots of matches can be found and the cost of transpositions becomes very important.

Update 27 Mar 2015: Thanks to Emanuele Bastianelli who discovered a bug that appeared in an edge case, I've updated the algorithms. Now, at the end of the while loop there is an extra check to prevent the algorithm exiting prematurely, before computing remaining tokens.

Intro

A really long time ago I wrote the third version of Sift, the string distance algorithm. It so happens that I am going to give a small presentation, here in Ispra, about this algorithm, so I had the opportunity to review it. I found some inconsistencies and I actually did some research in the field that gave more more ideas. So before giving the presentation I thought of publishing what I think is the fourth version. What's new:

  • 33% more accurate
  • three different variants: simple, common and general
  • new concepts added
  • support for own value and matching functions, different tokenizer functions, etc.
  • actually tested with a (slightly more) serious test
  • more robust, working better for large values of maxOffset


Before I get into the details, I am publishing the algorithm here for the moment, no Codeplex or PasteBin or GitHub or whatever. Also, it is written in Javascript now, the C# and T-SQL version pending. Of course, it would be great if, as before, the community of people using the algorithm would go into implementing it into various programming languages, however I am a bit apprehensive because more often than not people came with their own improvements or interpretations when translating the algorithm into another language. But support is always welcome!

New concepts in Sift4

I created a test that used random strings, but also a huge list of commonly used English phrases as well as mutations on these strings, adding or removing small bits and so on. I then implemented Sift3, Levenstein and the new algorithm and computed the error distance between the Levenstein distance and the two Sift variants. This permitted me to see how the error evolves when changing the algorithm and the parameters. One thing I noticed is that when increasing the maxOffset value to large values like 15 or 20, the accuracy of Sift3 was going down. Also, as pointed out by one commenter on the Sift3 post, there are cases when Sift3(a,b) is different from Sift3(b,a). There are edge cases, but this one in particular grated me.

After implementing Sift4, I can now tell you that the simple version is slightly better than Sift3 for small maxOffset values like 5, but it gets better as the value increases. The common version is a bit more complex, but the error decreases with 33% and maintains a low error for large maxOffset values. The extended or general version receives an options object that can change almost everything, but most important is the tokenizer function. Imagine that you want to compute the distance based not on letters, but on n-grams (groups of n characters). Or that you want to compare them by the words in the text, maybe even their synonyms. This can all be achieved just by changing the tokenizer function. The other parameters involve defining what it means for two tokens to match and what is the value of their match, etc.

One of the new concepts implemented is taken from the Jaro distance. Jaro seems a lot like Sift in the way that it considers two characters to match if they are in close proximity. Also, if "the streams cross", like 'ab' vs 'ba', one considers them transpositions and removes some of their value from the distance. Actually, if I look at the implementation, it might be that I have independently discovered the Jaro distance. I will research this further. I don't know if the transposition calculation is the most optimal. At the moment it uses an array of all matches found until a point, clearing it of values as the cursors move along the string. The difference between the simple and the common versions of Sift4 is that the simple version is not computing the transpositions at all and has no concept of maxDistance. In that respect it is a slightly fixed up Sift3.

Another new concept added is the one of local substring. Imagine that the Largest Common Subsequence that Sift is actually trying to find in order to determine the distance is made of substrings, separated by non matching characters. Each of these substrings can be used to improve the distance function. For example one could argue that 'abcdex' is closer to 'abcde' than 'abcxde', because even if the largest common subsequence is 5, the largest common substring is 5 for the first string and only 3 for the second. The extended version of the algorithm allows for changing the value of each substring individually.

Well, here they are, the three versions. The extended version has some examples at the end for possible parameters.

The code

Simplest Sift4:

// Sift4 - simplest version
// online algorithm to compute the distance between two strings in O(n)
// maxOffset is the number of characters to search for matching letters
function sift4(s1, s2, maxOffset) {
    if (!s1 || !s1.length) {
        if (!s2) {
            return 0;
        }
        return s2.length;
    }

    if (!s2 || !s2.length) {
        return s1.length;
    }

    var l1 = s1.length;
    var l2 = s2.length;

    var c1 = 0; //cursor for string 1
    var c2 = 0; //cursor for string 2
    var lcss = 0; //largest common subsequence
    var local_cs = 0; //local common substring
    while ((c1 < l1) && (c2 < l2)) {
        if (s1.charAt(c1) == s2.charAt(c2)) {
            local_cs++;
        } else {
            lcss += local_cs;
            local_cs = 0;
            if (c1 != c2) {
                c1 = c2 = Math.max(c1, c2); //using max to bypass the need for computer transpositions ('ab' vs 'ba')
            }
            for (var i = 0; i < maxOffset && (c1 + i < l1 || c2 + i < l2); i++) {
                if ((c1 + i < l1) && (s1.charAt(c1 + i) == s2.charAt(c2))) {
                    c1 += i;
                    local_cs++;
                    break;
                }
                if ((c2 + i < l2) && (s1.charAt(c1) == s2.charAt(c2 + i))) {
                    c2 += i;
                    local_cs++;
                    break;
                }
            }
        }
        c1++;
        c2++;
    }
    lcss += local_cs;
    return Math.round(Math.max(l1, l2) - lcss);
}

Common Sift4:

// Sift4 - common version
// online algorithm to compute the distance between two strings in O(n)
// maxOffset is the number of characters to search for matching letters
// maxDistance is the distance at which the algorithm should stop computing the value and just exit (the strings are too different anyway)
function sift4(s1, s2, maxOffset, maxDistance) {
    if (!s1 || !s1.length) {
        if (!s2) {
            return 0;
        }
        return s2.length;
    }

    if (!s2 || !s2.length) {
        return s1.length;
    }

    var l1 = s1.length;
    var l2 = s2.length;

    var c1 = 0; //cursor for string 1
    var c2 = 0; //cursor for string 2
    var lcss = 0; //largest common subsequence
    var local_cs = 0; //local common substring
    var trans = 0; //number of transpositions ('ab' vs 'ba')
    var offset_arr = []; //offset pair array, for computing the transpositions
    while ((c1 < l1) && (c2 < l2)) {
        if (s1.charAt(c1) == s2.charAt(c2)) {
            local_cs++;
            var isTrans = false;
            //see if current match is a transposition
            var i = 0;
            while (i < offset_arr.length) {
                var ofs = offset_arr[i];
                if (c1 <= ofs.c1 || c2 <= ofs.c2) {
                    // when two matches cross, the one considered a transposition is the one with the largest difference in offsets
                    isTrans = Math.abs(c2 - c1) >= Math.abs(ofs.c2 - ofs.c1);
                    if (isTrans) {
                        trans++;
                    } else {
                        if (!ofs.trans) {
                            ofs.trans = true;
                            trans++;
                        }
                    }
                    break;
                } else {
                    if (c1 > ofs.c2 && c2 > ofs.c1) {
                        offset_arr.splice(i, 1);
                    } else {
                        i++;
                    }
                }
            }
            offset_arr.push({
                c1: c1,
                c2: c2,
                trans: isTrans
            });
        } else {
            lcss += local_cs;
            local_cs = 0;
            if (c1 != c2) {
                c1 = c2 = Math.min(c1, c2); //using min allows the computation of transpositions
            }
            if (maxDistance) {
                var temporaryDistance = Math.max(c1, c2) - lcss + trans;
                if (temporaryDistance > maxDistance)
                    return temporaryDistance;
            }
            //if matching characters are found, remove 1 from both cursors (they get incremented at the end of the loop)
            //so that we can have only one code block handling matches
            for (var i = 0; i < maxOffset && (c1 + i < l1 || c2 + i < l2); i++) {
                if ((c1 + i < l1) && (s1.charAt(c1 + i) == s2.charAt(c2))) {
                    c1 += i - 1;
                    c2--;
                    break;
                }
                if ((c2 + i < l2) && (s1.charAt(c1) == s2.charAt(c2 + i))) {
                    c1--;
                    c2 += i - 1;
                    break;
                }
            }
        }
        c1++;
        c2++;
        // this covers the case where the last match is on the last token in list, so that it can compute transpositions correctly
        if ((c1 >= l1) || (c2 >= l2)) {
            lcss += local_cs;
            local_cs = 0;
            c1 = c2 = Math.min(c1, c2);
        }
    }
    lcss += local_cs;
    return Math.max(l1, l2) - lcss + trans; //add the cost of transpositions to the final result
}


Extended/General Sift4:

// Sift4 - extended version
// online algorithm to compute the distance between two strings in O(n)
// maxOffset is the number of positions to search for matching tokens
// options: the options for the function, allowing for customization of the scope and algorithm:
//         maxDistance: the distance at which the algorithm should stop computing the value and just exit (the strings are too different anyway)
//         tokenizer: a function to transform strings into vectors of tokens
//        tokenMatcher: a function to determine if two tokens are matching (equal)
//        matchingEvaluator: a function to determine the way a token match should be added to the local_cs. For example a fuzzy match could be implemented.
//        localLengthEvaluator: a function to determine the way the local_cs value is added to the lcss. For example longer continuous substrings could be awarded.
//        transpositionCostEvaluator: a function to determine the value of an individual transposition. For example longer transpositions should have a higher cost.
//        transpositionsEvaluator: a function to determine the way the total cost of transpositions affects the final result
// the options can and should be implemented at a class level, but this is the demo algorithm
function sift4(s1, s2, maxOffset, options) {

    options = extend(options, {
            maxDistance: null,
            tokenizer: function (s) {
                return s ? s.split('') : [];
            },
            tokenMatcher: function (t1, t2) {
                return t1 == t2;
            },
            matchingEvaluator: function (t1, t2) {
                return 1;
            },
            localLengthEvaluator: function (local_cs) {
                return local_cs;
            },
            transpositionCostEvaluator: function (c1, c2) {
                return 1;
            },
            transpositionsEvaluator: function (lcss, trans) {
                return lcss - trans;
            }
        });

    var t1 = options.tokenizer(s1);
    var t2 = options.tokenizer(s2);

    var l1 = t1.length;
    var l2 = t2.length;

    if (l1 == 0)
        return l2;
    if (l2 == 0)
        return l1;

    var c1 = 0; //cursor for string 1
    var c2 = 0; //cursor for string 2
    var lcss = 0; //largest common subsequence
    var local_cs = 0; //local common substring
    var trans = 0; //number of transpositions ('ab' vs 'ba')
    var offset_arr = []; //offset pair array, for computing the transpositions
    while ((c1 < l1) && (c2 < l2)) {
        if (options.tokenMatcher(t1[c1], t2[c2])) {
            local_cs += options.matchingEvaluator(t1[c1], t2[c2]);
            var isTrans = false;
            //see if current match is a transposition
            var i = 0;
            while (i < offset_arr.length) {
                var ofs = offset_arr[i];
                if (c1 <= ofs.c1 || c2 <= ofs.c2) {
                    // when two matches cross, the one considered a transposition is the one with the largest difference in offsets
                    isTrans = Math.abs(c2 - c1) >= Math.abs(ofs.c2 - ofs.c1);
                    if (isTrans) {
                        trans += options.transpositionCostEvaluator(c1, c2);
                    } else {
                        if (!ofs.trans) {
                            ofs.trans = true;
                            trans += options.transpositionCostEvaluator(ofs.c1, ofs.c2);
                        }
                    }
                    break;
                } else {
                    if (c1 > ofs.c2 && c2 > ofs.c1) {
                        offset_arr.splice(i, 1);
                    } else {
                        i++;
                    }
                }
            }
            offset_arr.push({
                c1: c1,
                c2: c2,
                trans: isTrans
            });
        } else {
            lcss += options.localLengthEvaluator(local_cs);
            local_cs = 0;
            if (c1 != c2) {
                c1 = c2 = Math.min(c1, c2); //using min allows the computation of transpositions
            }
            if (options.maxDistance) {
                var temporaryDistance = options.localLengthEvaluator(Math.max(c1, c2)) - options.transpositionsEvaluator(lcss, trans);
                if (temporaryDistance > options.maxDistance)
                    return Math.round(temporaryDistance);
            }
            //if matching tokens are found, remove 1 from both cursors (they get incremented at the end of the loop)
            //so that we can have only one code block handling matches
            for (var i = 0; i < maxOffset && (c1 + i < l1 || c2 + i < l2); i++) {
                if ((c1 + i < l1) && options.tokenMatcher(t1[c1 + i], t2[c2])) {
                    c1 += i - 1;
                    c2--;
                    break;
                }
                if ((c2 + i < l2) && options.tokenMatcher(t1[c1], t2[c2 + i])) {
                    c1--;
                    c2 += i - 1;
                    break;
                }
            }
        }
        c1++;
        c2++;
        // this covers the case where the last match is on the last token in list, so that it can compute transpositions correctly
        if ((c1 >= l1) || (c2 >= l2)) {
            lcss += options.localLengthEvaluator(local_cs);
            local_cs = 0;
            c1 = c2 = Math.min(c1, c2);
        }
    }
    lcss += options.localLengthEvaluator(local_cs);
    return Math.round(options.localLengthEvaluator(Math.max(l1, l2)) - options.transpositionsEvaluator(lcss, trans)); //add the cost of found transpositions
}

function extend(obj, def) {
    var result = {};
    for (var prop in def) {
        if (!obj || !obj.hasOwnProperty(prop)) {
            result[prop] = def[prop];
        } else {
            result[prop] = obj[prop];
        }
    }
    return result;
}

// possible values for the options
// tokenizers:
function nGramTokenizer(s, n) { //tokenizer:function(s) { return nGramTokenizer(s,2); }
    var result = [];
    if (!s)
        return result;
    for (var i = 0; i <= s.length - n; i++) {
        result.push(s.substr(i, n));
    }
    return result;
}

function wordSplitTokenizer(s) { //tokenizer:wordSplitTokenizer
    if (!s)
        return [];
    return s.split(/\s+/);
}

function characterFrequencyTokenizer(s) { //tokenizer:characterFrequencyTokenizer (letters only)
    var result = [];
    for (var i = 0; i <= 25; i++) {
        var val = 0;
        if (s) {
            for (j = 0; j < s.length; j++) {
                var code = s.charCodeAt(j);
                if (code == i + 65 || code == i + 97)
                    val++;
            }
        }
        result.push(val);
    }
    return result;
}

//tokenMatchers:
function sift4TokenMatcher(t1, t2) { //tokenMatcher:sift4TokenMatcher
    var similarity = 1 - sift4(t1, t2, 5) / Math.max(t1.length, t2.length);
    return similarity > 0.7;
}

//matchingEvaluators:
function sift4MatchingEvaluator(t1, t2) { //matchingEvaluator:sift4MatchingEvaluator
    var similarity = 1 - sift4(t1, t2, 5) / Math.max(t1.length, t2.length);
    return similarity;
}

//localLengthEvaluators:
function rewardLengthEvaluator(l) {
    if (l < 1)
        return l; //0 -> 0
    return l - 1 / (l + 1); //1 -> 0.5, 2-> 0.66, 9 -> 0.9
}

function rewardLengthEvaluator2(l) {
    return Math.pow(l, 1.5); // 0 -> 0, 1 -> 1, 2 -> 2.83, 10 -> 31.62
}

//transpositionCostEvaluators:
function longerTranspositionsAreMoreCostly(c1, c2) {
    return Math.abs(c2 - c1) / 9 + 1;
}

As always, I will be most happy to know if you used my algorithm and how it performed, as well as receive any suggestion that you might have.

Options explained

Here is some explanation for the options of the general algorithm.

It no longer searches for characters, but for tokens. That is why the default tokenizer function splits the values into characters so that the algorithm would work on an array of one character long tokens. Other options are possible, like splitting the strings by empty spaces so that the comparisons are done on words or transforming a string into an array of strings N characters long, the so called N-grams. The tokenizer can be anything, like the characterFrequencyTokenizer, which turns each word in an array of 25 values representing the number of letters in the word for each letter a-z.

The tokenMatcher function returns true if two tokens are matching. They can be fuzzy matched, for example the sift4tokenMatcher example function uses Sift inside Sift to determine the character distance between two tokens and returns true if they match more than 70%.

The matchingEvaluator is a function that returns the value that will be added to the "common substring" length value when two tokens match. The default is 1, but one can use some other metric, like the similarity, for example. Of course, the common substring length has lost its meaning when these functions change, but the variable local_cs is still used.

The lengthEvaluator is taking the length value of the local common substring and returns a value that will be added to the longest common subsequence value. Usually it returns the same value as the one provided, but some functions could reward longer substrings.

FAQ

Q: Can you make Sift4 to work case insensitive?
A: Just turn the strings to lower or upper case before you compare them. Since this algorithm is more general, the concept of 'case' might not apply. Or implement a case insensitive tokenMatcher.

Q: Can you make Sift4 to compare strings based on their meaning, like using synonyms?
A: Use a tokenizer function that splits the strings into words, then replaces them with the most used of their synonyms. A more complex solution would require to analyze the strings beforehand and turn them into some ordered synonym or equivalent expressions equivalents, then use Sift4 with a word tokenizer (one is provided in the Extended algorithm source)

Q: I need an implementation for this programming language, can you help?
A: I can, but I might not have the time. Ask anyway, maybe I can be persuaded :)

Q: I have been using Sift3 until now, how do I upgrade to Sift4?
A: The best way I can think of is to implement Sift4 Simplest, as it needs only the Sift3 code and some minor changes. Since you never needed tokens before, I doubt you need them now. But if you do, I can help, see the above question.

Q: How can I reward you for this fantastic piece of software engineering?
A: While I did this for free and I don't expect to make any money out of it and while this algorithm is completely free to use and change as you see fit, I don't mind having a beer every now and then ;)

Q: Your algorithm really sucks because... reasons.
A: It may. I would be glad to discuss the reasons, though, and try to fix any problem you encounter.

Q: I compared Sift4 with another algorithm that is much more exact and there are differences.
A: Of course, they are different algorithms. This is a fuzzy distance calculator, it doesn't give you the exact value. There are still edge cases. But the idea of Sift is to be fast and relatively accurate, rather than very accurate. You need more accuracy, try to combine Sift with Levenshtein for example, computing Levenshtein only where Sift says the strings are above a certain similarity.

Q: I want to make maxOffset dependent on the length of the strings compared. Can you do that?
A: That is a perfect example why maxOffset should be a parameter of the function rather than a member of the class. Since this implementation is so far Javascript only, just compute the maxOffset that is convenient to you before you compare.

Q: I want to vary the weight of matches based on the position of the match, for example matches at the beginning of the string could be more valuable than those at the end.
A: The position of the match is indeed not sent to the functions that can be specified in the options object of the Sift4 Extended, but that can be trivially changed in the code. I don't think this particular request is very common, though, and I prefer to keep it out of the published implementation to make the code easier to understand.

Q: I found a bug!
A: Let me know it and I will try and fix it.

Q: If you need to compare large lists of strings, it is better to precompute some things, like specific hashes or suffix trees, etc. This will speed up the comparison tremendously!
A: Sift is what is called an online algorithm. It does not precompute anything, it just gets the two strings and the parameters for its functioning and returns the distance. You are correct in what you are saying, but that kind of solution is not in the scope of Sift, at least not version 4.

Q: What are the edge cases for Sift?
A: Probably there are several, but I didn't really spot them. One of them is that one might find both letters at a position matching letters at other positions, but only one will count. Example 'abxx' and 'bayy'. The algorithm will look at position 0, find no match, then try to find the closest match for each letter. Starting with position 0 in the first string it will find 'a' matched in position 1 in the second. It will increase both counters and lcss will be increase as well. Next check will be 'b', the character at position 1 in the first string matched with position 2 in the second string. No match, therefore both counters will be reset to 1, and starting search again. The 'b' match is lost and distance is 3 instead of 2. Also I think there might be some situations where the counters are not equal and the biggest of them reaches the end of its string, thus terminating the algorithm, but there could have been more matches. Incidentally I tried to fix both these issues and the error from Levenshtein was not really affected, but I am not 100% sure of the implementation.

Q: The algorithm continues to be asymmetric, Sift4(s1,s2) can be different from Sift4(s2,s1).
A: Yes. This is one of the artifacts of the linear nature of the algorithm. There is a function that is symmetric and that is Math.min(Sift4(a,b),Sift4(b,a)), however it is twice as slow, obviously.

Implementations in other languages

C# implementation [expand/collapse]
public class Sift4
{
    private Options _options;

    public Sift4(Options options)
    {
        if (options == null) options = new Options();
        if (options.Tokenizer == null)
        {
            options.Tokenizer = (s) => s == null
            ? new string[0]
            : s.ToCharArray().Select(c => c.ToString()).ToArray();
        }
        if (options.TokenMatcher == null)
        {
            options.TokenMatcher = (t1, t2) => t1 == t2;
        }
        if (options.MatchingEvaluator == null)
        {
            options.MatchingEvaluator = (t1, t2) => 1;
        }
        if (options.LocalLengthEvaluator == null)
        {
            options.LocalLengthEvaluator = (l) => l;
        }
        if (options.TranspositionCostEvaluator == null)
        {
            options.TranspositionCostEvaluator = (c1, c2) => 1;
        }
        if (options.TranspositionsEvaluator == null)
        {
            options.TranspositionsEvaluator = (l, t) => l - t;
        }
        _options = options;
    }

    /// <summary>
    /// General distance algorithm uses all the parameters in the options object and works on tokens
    /// </summary>
    /// <param name="s1"></param>
    /// <param name="s2"></param>
    /// <param name="maxOffset"></param>
    /// <returns></returns>
    public double GeneralDistance(string s1, string s2, int maxOffset)
    {
        var t1 = _options.Tokenizer(s1);
        var t2 = _options.Tokenizer(s2);

        var l1 = t1.Length;
        var l2 = t2.Length;

        if (l1 == 0) return _options.LocalLengthEvaluator(l2);
        if (l2 == 0) return _options.LocalLengthEvaluator(l1);

        var c1 = 0;  //cursor for string 1
        var c2 = 0;  //cursor for string 2
        var lcss = 0.0;  //largest common subsequence
        var local_cs = 0.0; //local common substring
        var trans = 0.0;  //cost of transpositions ('axb' vs 'xba')
        var offset_arr = new LinkedList<OffsetPair>();  //offset pair array, for computing the transpositions

        while ((c1 < l1) && (c2 < l2))
        {
            if (_options.TokenMatcher(t1[c1], t2[c2]))
            {
                local_cs += _options.MatchingEvaluator(t1[c1], t2[c2]);
                var isTransposition = false;
                var op = offset_arr.First;
                while (op != null)
                {  //see if current match is a transposition
                    var ofs = op.Value;
                    if (c1 <= ofs.C1 || c2 <= ofs.C2)
                    {
                        // when two matches cross, the one considered a transposition is the one with the largest difference in offsets
                        isTransposition = Math.Abs(c2 - c1) >= Math.Abs(ofs.C2 - ofs.C1);
                        if (isTransposition)
                        {
                            trans += _options.TranspositionCostEvaluator(c1, c2);
                        }
                        else
                        {
                            if (!ofs.IsTransposition)
                            {
                                ofs.IsTransposition = true;
                                trans += _options.TranspositionCostEvaluator(ofs.C1, ofs.C2);
                            }
                        }
                        break;
                    }
                    else
                    {
                        var next_op = op.Next;
                        if (c1 > ofs.C2 && c2 > ofs.C1)
                        {
                            offset_arr.Remove(op);
                        }
                        op = next_op;
                    }
                }
                offset_arr.AddLast(new OffsetPair(c1, c2)
                {
                    IsTransposition = isTransposition
                });
            }
            else
            {
                lcss += _options.LocalLengthEvaluator(local_cs);
                local_cs = 0;
                if (c1 != c2)
                {
                    c1 = c2 = Math.Min(c1, c2);  //using min allows the computation of transpositions
                }
                if (_options.MaxDistance != null)
                {
                    var temporaryDistance = _options.LocalLengthEvaluator(Math.Max(c1, c2)) - _options.TranspositionsEvaluator(lcss, trans);
                    if (temporaryDistance > _options.MaxDistance) return Math.Round(temporaryDistance, MidpointRounding.AwayFromZero);
                }
                //if matching tokens are found, remove 1 from both cursors (they get incremented at the end of the loop)
                //so that we can have only one code block handling matches 
                for (var i = 0; i < maxOffset && (c1 + i < l1 || c2 + i < l2); i++)
                {
                    if ((c1 + i < l1) && _options.TokenMatcher(t1[c1 + i], t2[c2]))
                    {
                        c1 += i - 1;
                        c2--;
                        break;
                    }
                    if ((c2 + i < l2) && _options.TokenMatcher(t1[c1], t2[c2 + i]))
                    {
                        c1--;
                        c2 += i - 1;
                        break;
                    }
                }
            }
            c1++;
            c2++;
            // this covers the case where the last match is on the last token in list, so that it can compute transpositions correctly
            if ((c1 >= l1) || (c2 >= l2))
            {
                lcss += _options.LocalLengthEvaluator(local_cs);
                local_cs = 0;
                c1 = c2 = Math.Min(c1, c2);
            }
        }
        lcss += _options.LocalLengthEvaluator(local_cs);
        return Math.Round(_options.LocalLengthEvaluator(Math.Max(l1, l2)) - _options.TranspositionsEvaluator(lcss, trans), MidpointRounding.AwayFromZero); //apply transposition cost to the final result
    }

    /// <summary>
    /// Static distance algorithm working on strings, computing transpositions as well as stopping when maxDistance was reached.
    /// </summary>
    /// <param name="s1"></param>
    /// <param name="s2"></param>
    /// <param name="maxOffset"></param>
    /// <param name="maxDistance"></param>
    /// <returns></returns>
    public static double CommonDistance(string s1, string s2, int maxOffset, int maxDistance = 0)
    {
        var l1 = s1 == null ? 0 : s1.Length;
        var l2 = s2 == null ? 0 : s2.Length;

        if (l1 == 0) return l2;
        if (l2 == 0) return l1;

        var c1 = 0;  //cursor for string 1
        var c2 = 0;  //cursor for string 2
        var lcss = 0;  //largest common subsequence
        var local_cs = 0; //local common substring
        var trans = 0;  //number of transpositions ('axb' vs 'xba')
        var offset_arr = new LinkedList<OffsetPair>();  //offset pair array, for computing the transpositions

        while ((c1 < l1) && (c2 < l2))
        {
            if (s1[c1] == s2[c2])
            {
                local_cs++;
                var isTransposition = false;
                var op = offset_arr.First;
                while (op != null)
                {  //see if current match is a transposition
                    var ofs = op.Value;
                    if (c1 <= ofs.C1 || c2 <= ofs.C2)
                    {
                        // when two matches cross, the one considered a transposition is the one with the largest difference in offsets
                        isTransposition = Math.Abs(c2 - c1) >= Math.Abs(ofs.C2 - ofs.C1);
                        if (isTransposition)
                        {
                            trans++;
                        }
                        else
                        {
                            if (!ofs.IsTransposition)
                            {
                                ofs.IsTransposition = true;
                                trans++;
                            }
                        }
                        break;
                    }
                    else
                    {
                        var next_op = op.Next;
                        if (c1 > ofs.C2 && c2 > ofs.C1)
                        {
                            offset_arr.Remove(op);
                        }
                        op = next_op;
                    }
                }
                offset_arr.AddLast(new OffsetPair(c1, c2)
                {
                    IsTransposition = isTransposition
                });
            }
            else
            {
                lcss += local_cs;
                local_cs = 0;
                if (c1 != c2)
                {
                    c1 = c2 = Math.Min(c1, c2);  //using min allows the computation of transpositions
                }
                if (maxDistance > 0)
                {
                    var temporaryDistance = Math.Max(c1, c2) - (lcss - trans);
                    if (temporaryDistance > maxDistance) return temporaryDistance;
                }
                //if matching tokens are found, remove 1 from both cursors (they get incremented at the end of the loop)
                //so that we can have only one code block handling matches 
                for (var i = 0; i < maxOffset && (c1 + i < l1 || c2 + i < l2); i++)
                {
                    if ((c1 + i < l1) && s1[c1 + i] == s2[c2])
                    {
                        c1 += i - 1;
                        c2--;
                        break;
                    }
                    if ((c2 + i < l2) && s1[c1] == s2[c2 + i])
                    {
                        c1--;
                        c2 += i - 1;
                        break;
                    }
                }
            }
            c1++;
            c2++;
            // this covers the case where the last match is on the last token in list, so that it can compute transpositions correctly
            if ((c1 >= l1) || (c2 >= l2))
            {
                lcss += local_cs;
                local_cs = 0;
                c1 = c2 = Math.Min(c1, c2);
            }
        }
        lcss += local_cs;
        return Math.Max(l1, l2) - (lcss - trans); //apply transposition cost to the final result
    }

    /// <summary>
    /// Standard Sift algorithm, using strings and taking only maxOffset as a parameter
    /// </summary>
    /// <param name="s1"></param>
    /// <param name="s2"></param>
    /// <param name="maxOffset"></param>
    /// <returns></returns>
    public static int SimplestDistance(string s1, string s2, int maxOffset)
    {
        var l1 = s1 == null ? 0 : s1.Length;
        var l2 = s2 == null ? 0 : s2.Length;

        if (l1 == 0) return l2;
        if (l2 == 0) return l1;

        var c1 = 0;  //cursor for string 1
        var c2 = 0;  //cursor for string 2
        var lcss = 0;  //largest common subsequence
        var local_cs = 0; //local common substring

        while ((c1 < l1) && (c2 < l2))
        {
            if (s1[c1] == s2[c2])
            {
                local_cs++;
            }
            else
            {
                lcss += local_cs;
                local_cs = 0;
                if (c1 != c2)
                {
                    c1 = c2 = Math.Max(c1, c2);
                }
                //if matching tokens are found, remove 1 from both cursors (they get incremented at the end of the loop)
                //so that we can have only one code block handling matches 
                for (var i = 0; i < maxOffset && (c1 + i < l1 && c2 + i < l2); i++)
                {
                    if ((c1 + i < l1) && s1[c1 + i] == s2[c2])
                    {
                        c1 += i - 1;
                        c2--;
                        break;
                    }
                    if ((c2 + i < l2) && s1[c1] == s2[c2 + i])
                    {
                        c1--;
                        c2 += i - 1;
                        break;
                    }
                }
            }
            c1++;
            c2++;
        }
        lcss += local_cs;
        return Math.Max(l1, l2) - lcss;
    }

    private class OffsetPair
    {
        public int C1 { get; set; }
        public int C2 { get; set; }
        public bool IsTransposition { get; set; }

        public OffsetPair(int c1, int c2)
        {
            this.C1 = c1;
            this.C2 = c2;
            this.IsTransposition = false;
        }
    }

    public class Options
    {
        /// <summary>
        /// If set, the algorithm will stop if the distance reaches this value
        /// </summary>
        public int? MaxDistance { get; set; }

        /// <summary>
        /// The function that turns strings into a list of tokens (also strings)
        /// </summary>
        public Func<string, string[]> Tokenizer { get; set; }

        /// <summary>
        /// The function that determines if two tokens are matching (similar to characters being the same in the simple algorithm)
        /// </summary>
        public Func<string, string, bool> TokenMatcher { get; set; }

        /// <summary>
        /// The function that determines the value of a match of two tokens (the equivalent of adding 1 to the lcss when two characters match)
        /// This assumes that the TokenMatcher function is a lot less expensive than this evaluator. If that is not the case, 
        /// you can optimize the speed of the algorithm by using only the matching evaluator and then calculating if two tokens match on the returned value.
        /// </summary>
        public Func<string, string, double> MatchingEvaluator { get; set; }

        /// <summary>
        /// Determines if the local value (computed on subsequent matched tokens) must be modified.
        /// In case one wants to reward longer matched substrings, for example, this is what you need to change.
        /// </summary>
        public Func<double, double> LocalLengthEvaluator { get; set; }

        /// <summary>
        /// The function determining the cost of an individual transposition, based on its counter positions.
        /// </summary>
        public Func<int, int, double> TranspositionCostEvaluator { get; set; }

        /// <summary>
        /// The function determining how the cost of transpositions affects the final result
        /// Change it if it doesn't suit you.
        /// </summary>
        public Func<double, double, double> TranspositionsEvaluator { get; set; }
    }
}

PHP implementation [expand/collapse]
// Sift4 - common version
// online algorithm to compute the distance between two strings in O(n)
// $maxOffset is the number of characters to search for matching letters
// $maxDistance is the distance at which the algorithm should stop computing the value and just exit (the strings are too different anyway)
function sift4($s1, $s2, $maxOffset, $maxDistance = 0) {
    if (!$s1 || !strlen($s1)) {
        if (!$s2) {
            return 0;
        }
        return strlen($s2);
    }
    if (!$s2 || !strlen($s2)) {
        return strlen($s1);
    }
    $l1 = strlen($s1);
    $l2 = strlen($s2);
    $c1 = 0; //cursor for string 1
    $c2 = 0; //cursor for string 2
    $lcss = 0; //largest common subsequence
    $local_cs = 0; //local common substring
    $trans = 0; //number of transpositions ('ab' vs 'ba')
    $offset_arr = array(); //offset pair array, for computing the transpositions
    while (($c1 < $l1) && ($c2 < $l2)) {
        if (substr($s1, $c1, 1) == substr($s2, $c2, 1)) {
            $local_cs++;
            $isTrans = false;
            $i = 0;
            while ($i < sizeof($offset_arr)) { //see if current match is a transposition
                $ofs = $offset_arr[$i];
                if ($c1 <= $ofs['c1'] || $c2 <= $ofs['c2']) {
                    $isTrans = abs($c2 - $c1) >= abs($ofs['c2'] - $ofs['c1']);
                    if ($isTrans) {
                        $trans++;
                    } else {
                        if (!$ofs['trans']) {
                            $ofs['trans'] = true;
                            $trans++;
                        }
                    }
                    break;
                } else {
                    if ($c1 > $ofs['c2'] && $c2 > $ofs['c1']) {
                        array_splice($offset_arr, $i, 1);
                    } else {
                        $i++;
                    }
                }
            }
            array_push($offset_arr, array('c1' = > $c1, 'c2' = > $c2, 'trans' = > $isTrans));
        } else {
            $lcss+= $local_cs;
            $local_cs = 0;
            if ($c1 != $c2) {
                $c1 = $c2 = min($c1, $c2); //using min allows the computation of transpositions
            }
            if ($maxDistance) {
                $temporaryDistance = max($c1, $c2) - $lcss + $trans;
                if ($temporaryDistance > $maxDistance) return $temporaryDistance;
            }

            //if matching characters are found, remove 1 from both cursors (they get incremented at the end of the loop)
            //so that we can have only one code block handling matches
            for ($i = 0;$i < $maxOffset && ($c1 + $i < $l1 || $c2 + $i < $l2);$i++) {
                if (($c1 + $i < $l1) && (substr($s1, $c1 + $i, 1) == substr($s2, $c2, 1))) {
                    $c1+= $i - 1;
                    $c2--;
                    break;
                }
                if (($c2 + $i < $l2) && (substr($s1, $c1, 1) == substr($s2, $c2 + $i, 1))) {
                    $c1--;
                    $c2+= $i - 1;
                    break;
                }
            }
        }
        $c1++;
        $c2++;
        // this covers the case where the last match is on the last token in list, so that it can compute transpositions correctly
        if (($c1 >= $l1) || ($c2 >= $l2)) {
            $lcss+= $local_cs;
            $local_cs = 0;
            $c1 = $c2 = min($c1, $c2);
        }
    }
    $lcss+= $local_cs;
    return max($l1, $l2) - $lcss + $trans; //apply transposition cost to final result
    
}
Thanks to Ferenc Szatmári for corrections in the PHP code

The Simplest and General versions of the algorithm remain as an exercise to you, since I haven't been working in PHP for more than a decade.

T-SQL implementation [expand/collapse]
---<summary>
---Static distance algorithm working on strings, computing transpositions as well as stopping when maxDistance was reached.
---</summary>
---<param name="s1"></param>
---<param name="s2"></param>
---<param name="maxOffset"></param>
---<param name="maxDistance"></param>
---<returns></returns>
CREATE FUNCTION Sift4common(@s1          NVARCHAR(max), 
                            @s2          NVARCHAR(max), 
                            @maxOffset   INT, 
                            @maxDistance INT) 
returns INT 
AS 
  BEGIN 
      DECLARE @l1 INT = Len(Isnull(@s1, '')); 
      DECLARE @l2 INT = Len(Isnull(@s2, '')); 

      IF ( @l1 = 0 ) 
        RETURN @l2; 

      IF ( @l2 = 0 ) 
        RETURN @l1; 

      DECLARE @c1 INT = 0; 
      DECLARE @c2 INT = 0; 
      DECLARE @lcss INT = 0; 
      DECLARE @local_cs INT = 0; 
      DECLARE @trans INT = 0; 
      DECLARE @offset_arr TABLE 
        ( 
           C1    INT, 
           C2    INT, 
           Trans BIT 
        ) 
      DECLARE @i INT 
      DECLARE @temporaryDistance FLOAT 
      DECLARE @result INT 
      DECLARE @oc1 INT, 
              @oc2 INT, 
              @otr BIT 
      DECLARE @isTrans BIT 

      WHILE ( ( @c1 < @l1 ) 
              AND ( @c2 < @l2 ) ) 
        BEGIN 
            IF ( Substring(@s1, @c1 + 1, 1) = Substring(@s2, @c2 + 1, 1) ) 
              BEGIN 
                  SET @local_cs=@local_cs + 1; 
                  SET @isTrans=0 
                  SET @oc1=NULL; 

                  SELECT TOP 1 @oc1 = o.C1, 
                               @oc2 = o.C2, 
                               @otr = o.Trans 
                  FROM   @offset_arr o 
                  WHERE  @c1 <= o.C1 
                          OR @c2 <= o.C2 

                  IF ( @oc1 IS NOT NULL ) 
                    BEGIN 
                        SET @isTrans=CASE 
                                       WHEN Abs(@c2 - @c1) >= Abs(@oc2 - @oc1) 
                                     THEN 1 
                                       ELSE 0 
                                     END 

                        IF ( @isTrans = 1 ) 
                          BEGIN 
                              SET @trans=@trans + 1 
                          END 
                        ELSE IF ( @otr = 0 ) 
                          BEGIN 
                              SET @trans=@trans + 1 

                              UPDATE @offset_arr 
                              SET    Trans = 1 
                              WHERE  C1 = @oc1 
                                     AND C2 = @oc2 
                          END 
                    END 

                  DELETE FROM @offset_arr 
                  WHERE  @c1 > C1 
                         AND @c1 > C2 
                         AND @c2 > C1 
                         AND @c2 > C2; 

                  INSERT INTO @offset_arr 
                  VALUES      (@c1, 
                               @c2, 
                               @isTrans); 
              END 
            ELSE 
              BEGIN 
                  SET @lcss = @lcss + @local_cs; 
                  SET @local_cs = 0; 

                  IF ( @c1 != @c2 ) 
                    -- using min allows the computation of transpositions  
                    BEGIN 
                        IF ( @c1 < @c2 ) 
                          BEGIN 
                              SET @c2=@c1; 
                          END 
                        ELSE 
                          BEGIN 
                              SET @c1=@c2; 
                          END 
                    END 
                IF ( @maxDistance > 0 ) 
                  BEGIN 
                    IF ( @c1 > @c2 ) 
                    BEGIN 
                      SET @temporaryDistance = @c1 - ( @lcss - @trans / 2.0 ); 
                    END 
                    ELSE 
                    BEGIN 
                      SET @temporaryDistance = @c2 - ( @lcss - @trans / 2.0 ); 
                    END 

                  IF ( @temporaryDistance > @maxDistance ) 
                    RETURN Round(@temporaryDistance, 0); 
              END 


                  --if matching tokens are found, remove 1 from both cursors (they get incremented at the end of the loop)
                  --so that we can have only one code block handling matches   
                  SET @i=0; 

                  WHILE ( @i < @maxOffset 
                          AND ( @c1 + @i < @l1 
                                 OR @c2 + @i < @l2 ) ) 
                    BEGIN 
                        IF ( ( @c1 + @i < @l1 ) 
                             AND Substring(@s1, @c1 + @i + 1, 1) = 
                                 Substring(@s2, @c2 + 1, 1) 
                           ) 
                          BEGIN 
                              SET @c1 = @c1 + @i - 1; 
                              SET @c2=@c2 - 1; 

                              BREAK; 
                          END 

                        IF ( ( @c2 + @i < @l2 ) 
                             AND Substring(@s1, @c1 + 1, 1) = Substring(@s2, 
                                                              @c2 + @i + 1, 1 
                                                              ) 
                           ) 
                          BEGIN 
                              SET @c1 = @c1 - 1; 
                              SET @c2=@c2 + @i - 1; 

                              BREAK; 
                          END 

                        SET @i=@i + 1; 
                    END 
              END 

            SET @c1=@c1 + 1; 
            SET @c2=@c2 + 1; 

            IF ( ( @c1 >= @l1 ) 
                  OR ( @c2 >= @l2 ) ) 
              BEGIN 
                  SET @lcss = @lcss + @local_cs; 
                  SET @local_cs = 0; 

                  IF ( @c1 < @c2 ) 
                    BEGIN 
                        SET @c2=@c1; 
                    END 
                  ELSE 
                    BEGIN 
                        SET @c1=@c2; 
                    END 
              END 
        END 

      SET @lcss = @lcss + @local_cs; 

      --apply the transposition cost to the final result 
      IF ( @l1 > @l2 ) 
        BEGIN 
            SET @result = @l1 - ( @lcss - @trans ); 
        END 
      ELSE 
        BEGIN 
            SET @result = @l2 - ( @lcss - @trans ); 
        END 

      RETURN @result; 
  END 

Clearly a general version of the algorithm is not possible in Transact SQL.

Here is a MySQL version, gracefully provided by Ferenc Szatmári:
BEGIN 
DECLARE l1 INT DEFAULT Length(IFNULL(s1, '')); 
DECLARE l2 INT DEFAULT Length(IFNULL(s2, '')); 


DECLARE c1 INT Default 0; 
DECLARE c2 INT default 0; 
DECLARE lcss INT default 0; 
DECLARE local_cs INT default 0; 
DECLARE trans INT default 0; 
DECLARE i INT;
DECLARE temporaryDistance FLOAT;
DECLARE result INT;
DECLARE oc1 INT;
DECLARE oc2 INT;
DECLARE otr smallint;
DECLARE isTrans smallint;

drop temporary table if exists offset_arr;
CREATE TEMPORARY TABLE IF not EXISTS offset_arr
( 
C1 INT, 
C2 INT,
Trans BIT
)engine=memory;


/*      delete from offset_arr;*/


IF l1 = 0 THEN
RETURN l2; 
END IF;
IF l2 = 0 THEN 
RETURN l1; 
END IF;  






WHILE ( ( c1 < l1 ) AND ( c2 < l2 ) ) 
DO 
IF ( Substring(s1, c1 + 1, 1) = Substring(s2, c2 + 1, 1) ) THEN

SET local_cs=local_cs + 1; 
SET isTrans=0;
SET oc1=NULL;
SELECT  o.C1, o.C2,o.Trans  into oc1, oc2, otr
FROM offset_arr o 
WHERE c1 <= o.C1 OR c2 <= o.C2
LIMIT 1;
IF oc1 IS NOT NULL THEN

SET isTrans=CASE WHEN ABS(c2-c1)>=ABS(oc2-oc1) THEN 1 ELSE 0 END;
IF (isTrans=1) THEN
SET trans=trans+1;
ELSE
IF (otr=0) THEN


SET trans=trans+1;
UPDATE offset_arr SET Trans=1 WHERE C1=oc1 AND C2=oc2;
END IF;
END IF;  
END IF;


DELETE FROM offset_arr 
WHERE  c1 > C1 
AND c1 > C2
AND c2 > C1 
AND c2 > C2; 


INSERT INTO offset_arr 
VALUES (c1, c2, isTrans); 

ELSE 

SET lcss = lcss + local_cs; 
SET local_cs = 0; 


IF ( c1 != c2 ) THEN
-- using min allows the computation of transpositions 


IF ( c1 < c2 ) THEN
SET c2=c1; 
ELSE 
SET c1=c2; 
END IF;

IF ( maxDistance > 0 ) THEN


IF ( c1 > c2 ) THEN
SET temporaryDistance = c1 - ( lcss - trans / 2.0 ); 
ELSE 
SET temporaryDistance = c2 - ( lcss - trans / 2.0 ); 
END IF;


IF ( temporaryDistance > maxDistance ) THEN
RETURN Round(temporaryDistance, 0); 
END IF;  
END IF;

END IF;


/*if matching tokens are found, remove 1 from both cursors (they get incremented at the end of the loop)
--so that we can have only one code block handling matches  */
SET i=0; 


label : WHILE ( i < maxOffset AND (c1 + i < l1 OR c2 + i < l2) ) do


IF ( ( c1 + i < l1 ) 
AND Substring(s1, c1 + i + 1, 1) = Substring(s2, c2 + 1, 1) 
) THEN


SET c1 = c1 + i - 1; 
SET c2=c2 - 1; 


leave label; 
END IF; 


IF ( ( c2 + i < l2 ) 
AND Substring(s1, c1 + 1, 1) = Substring(s2, c2 + i + 1, 1) 
)THEN 


SET c1 = c1 - 1; 
SET c2=c2 + i - 1; 


leave label;  
END IF;


SET i=i + 1; 
END WHILE;
END IF; 


SET c1=c1 + 1; 
SET c2=c2 + 1; 


IF ( ( c1 >= l1 ) OR ( c2 >= l2 ) ) THEN
SET lcss = lcss + local_cs; 
SET local_cs = 0; 


IF ( c1 < c2 ) THEN
SET c2=c1; 
ELSE 
SET c1=c2; 
END IF;
END IF;
END while;


SET lcss = lcss + local_cs; 


/*apply the transposition cost to the final result*/
IF ( l1 > l2 ) THEN
SET result = l1 - (lcss - trans);
ELSE 
SET result = l2 - (lcss - trans);
END IF;
RETURN result; 
END



Java implementation [expand/collapse]
Here is a Java version, gracefully provided by Nathanæl Fischer:
/**
 * Sift4 - common version
 * online algorithm to compute the distance between two strings in O(n)
 * Algorithm by siderite, java port by Nathan Fischer 2016
 * /blog/super-fast-and-accurate-string-distance.html
 * @param s1
 * @param s2
 * @param maxOffset the number of characters to search for matching letters
 * @return
 */
public static double sift4(String s1, String s2, int maxOffset) {
    class Offset {
        int c1;
        int c2;
        boolean trans;

        Offset(int c1, int c2, boolean trans) {
            this.c1 = c1;
            this.c2 = c2;
            this.trans = trans;
        }
    }

    if (s1 == null || s1.isEmpty())
        return s2 == null ? 0 : s2.length();

    if (s2 == null || s2.isEmpty())
        return s1.length();

    int l1 = s1.length();
    int l2 = s2.length();

    int c1 = 0; //cursor for string 1
    int c2 = 0; //cursor for string 2
    int lcss = 0; //largest common subsequence
    int local_cs = 0; //local common substring
    int trans = 0; //number of transpositions ('ab' vs 'ba')
    LinkedList < Offset > offset_arr = new LinkedList < > (); //offset pair array, for computing the transpositions

    while ((c1 < l1) && (c2 < l2)) {
        if (s1.charAt(c1) == s2.charAt(c2)) {
            local_cs++;
            boolean isTrans = false;
            //see if current match is a transposition
            int i = 0;
            while (i < offset_arr.size()) {
                Offset ofs = offset_arr.get(i);
                if (c1 <= ofs.c1 || c2 <= ofs.c2) {
                    // when two matches cross, the one considered a transposition is the one with the largest difference in offsets
                    isTrans = Math.abs(c2 - c1) >= Math.abs(ofs.c2 - ofs.c1);
                    if (isTrans) {
                        trans++;
                    } else {
                        if (!ofs.trans) {
                            ofs.trans = true;
                            trans++;
                        }
                    }
                    break;
                } else {
                    if (c1 > ofs.c2 && c2 > ofs.c1) {
                        offset_arr.remove(i);
                    } else {
                        i++;
                    }
                }
            }
            offset_arr.add(new Offset(c1, c2, isTrans));
        } else {
            lcss += local_cs;
            local_cs = 0;
            if (c1 != c2) {
                c1 = c2 = Math.min(c1, c2); //using min allows the computation of transpositions
            }
            //if matching characters are found, remove 1 from both cursors (they get incremented at the end of the loop)
            //so that we can have only one code block handling matches
            for (int i = 0; i < maxOffset && (c1 + i < l1 || c2 + i < l2); i++) {
                if ((c1 + i < l1) && (s1.charAt(c1 + i) == s2.charAt(c2))) {
                    c1 += i - 1;
                    c2--;
                    break;
                }
                if ((c2 + i < l2) && (s1.charAt(c1) == s2.charAt(c2 + i))) {
                    c1--;
                    c2 += i - 1;
                    break;
                }
            }
        }
        c1++;
        c2++;
        // this covers the case where the last match is on the last token in list, so that it can compute transpositions correctly
        if ((c1 >= l1) || (c2 >= l2)) {
            lcss += local_cs;
            local_cs = 0;
            c1 = c2 = Math.min(c1, c2);
        }
    }
    lcss += local_cs;
    return Math.round(Math.max(l1, l2) - lcss + trans); //add the cost of transpositions to the final result
}
Powershell implementation [expand/collapse]


Powershell implementation of simple Sift4, by Kirk Sayre:

function Calc-Sift4Distance 
{
  <# 
      .SYNOPSIS 

      Compute the edit distance between 2 strings using the sift4 string
      edit distance algorithm.

      .PARAMETER s1

      The 1st string

      .PARAMETER s2

      The 2nd string

      .PARAMETER maxOffset

      The maximum common substring length for which to search. The default
      is 10.

      .RETURN

      The # of edits needed to make the given 2 strings equal.
  #>

  Param(
    [Parameter(Mandatory = $True, ValueFromPipelineByPropertyName = $True)]
    [String]
    $s1,

    [Parameter(Mandatory = $True, ValueFromPipelineByPropertyName = $True)]
    [String]
    $s2,

    [Parameter(ValueFromPipelineByPropertyName = $True)]
    [Int]
    $maxOffset = 10
  )

  # Handle null or empty strings.
  if ((-not $s1) -or ($s1.length -eq 0)) 
  {
    if (-not $s2) 
    {
      return 0
    }
    return $s2.length
  }

  if ((-not $s2) -or ($s2.length -eq 0)) 
  {
    return $s1.length
  }

  # Initialization.
  $l1 = $s1.length
  $l2 = $s2.length
  $c1 = 0 # Cursor for string 1.
  $c2 = 0 # Cursor for string 2.
  $lcss = 0 # Largest common subsequence.
  $local_cs = 0 # Local common substring.

  # Scan strings.
  while (($c1 -lt $l1) -and ($c2 -lt $l2)) 
  {
    if ($s1[$c1] -eq $s2[$c2]) 
    {
      $local_cs++
    }
    else 
    {
      $lcss += $local_cs
      $local_cs = 0
      if ($c1 -ne $c2) 
      {
        # Using max to bypass the need for computer transpositions ('ab' vs 'ba').
        $c1 = $c2 = (@($c1, $c2) | Measure-Object -Maximum).Maximum
      }

      for ($i = 0; (($i -lt $maxOffset) -and ((($c1 + $i) -lt $l1) -or (($c2 + $i) -lt $l2))); $i++) 
      {
        if ((($c1 + $i) -lt $l1) -and ($s1[$c1 + $i] -eq $s2[$c2])) 
        {
          $c1 += $i
          $local_cs++
          break
        }

        if ((($c1 + $i) -lt $l2) -and ($s1[$c1] -eq $s2[$c2 + $i])) 
        {
          $c2 += $i
          $local_cs++
          break
        }
      }
    }
    $c1++
    $c2++
  }
  $lcss += $local_cs
  return [math]::Round((@($l1, $l2) | Measure-Object -Maximum).Maximum - $lcss)
}
C++ implementation [expand/collapse]


Thanks to Hugo Amaro, a C++ implementation:

struct sift_offset {
  int c1;
  int c2;
  bool trans;
};

template < typename T >
  int sift4(T * s1, int s1_size, T * s2, int s2_size, int maxOffset, int maxDistance) {
    if (!s1 || !s1_size) {
      if (!s2) {
        return 0;
      }
      return s2_size;
    }

    if (!s2 || !s2_size) {
      return s1_size;
    }

    int l1 = s1_size;
    int l2 = s2_size;

    int c1 = 0; //cursor for string 1
    int c2 = 0; //cursor for string 2
    int lcss = 0; //largest common subsequence
    int local_cs = 0; //local common substring
    int trans = 0; //number of transpositions ('ab' vs 'ba')
    std::vector < sift_offset > offset_arr; //offset pair array, for computing the transpositions

    while ((c1 < l1) && (c2 < l2)) {
      if (s1[c1] == s2[c2]) {
        local_cs++;
        bool isTrans = false;
        //see if current match is a transposition
        int i = 0;
        while (i < offset_arr.size()) {
          sift_offset ofs = offset_arr[i];
          if (c1 <= ofs.c1 || c2 <= ofs.c2) {
            // when two matches cross, the one considered a transposition is the one with the largest difference in offsets
            isTrans = std::abs(c2 - c1) >= std::abs(ofs.c2 - ofs.c1);

            if (isTrans) {
              trans++;
            } else {
              if (!ofs.trans) {
                ofs.trans = true;
                trans++;
              }
            }
            break;
          } else {
            if (c1 > ofs.c2 && c2 > ofs.c1) {
              offset_arr.erase(offset_arr.begin() + i);

            } else {
              i++;
            }
          }
        }
        offset_arr.push_back({
          c1,
          c2,
          isTrans
        });
      } else {
        lcss += local_cs;
        local_cs = 0;
        if (c1 != c2) {
          c1 = c2 = (int) min(c1, c2); //using min allows the computation of transpositions
        }
        //if matching characters are found, remove 1 from both cursors (they get incremented at the end of the loop)
        //so that we can have only one code block handling matches 
        for (int i = 0; i < maxOffset && (c1 + i < l1 || c2 + i < l2); i++) {
          if ((c1 + i < l1) && (s1[c1 + i] == s2[c2])) {
            c1 += i - 1;
            c2--;
            break;
          }
          if ((c2 + i < l2) && (s1[c1] == s2[c2 + i])) {
            c1--;
            c2 += i - 1;
            break;
          }
        }
      }
      c1++;
      c2++;
      if (maxDistance) {
        int temporaryDistance = (int) max(c1, c2) - lcss + trans;
        if (temporaryDistance >= maxDistance) return std: round(temporaryDistance);
      }

    }
    // this covers the case where the last match is on the last token in list, so that it can compute transpositions correctly
    if ((c1 >= l1) || (c2 >= l2)) {
      lcss += local_cs;
      local_cs = 0;
      c1 = c2 = (int) min(c1, c2);
    }
  }
lcss += local_cs;
return std::round((int) max(l1, l2) - lcss + trans); //add the cost of transpositions to the final result
}

You can find a Go implementation here, written by Jason W. Hutchinson.
There is also a Swift implementation here.
A Perl 6 (now called Raku) implementation can be found here.

You know how many a time you need to get the results of a query as a strings list, let's say a CSV and there is nothing in Microsoft SQL Server to help you? What you would like is something like an aggregate function similar to SUM that returns the concatenated value of all strings in the SELECT query.

From SQL Server 2017 on, there is such a function called STRING_AGG and you use it just like one might expect.

If you have an older version of SQL Server... upgrade :) But if you can't, here is a solution:

And before SQL Server 2008 that was the case, you needed to use variables and SELECT @S=@S+[Value] FROM .... But in SQL 2008 they added more XML support and thus the data() XML PATH method. Take notice that this method adds a space between atomic values. So, without further due, here is the code to concatenate several values into a comma separated value string:

DECLARE @T TABLE(S NVARCHAR(Max))
INSERT INTO @T VALUES('Filet'),('T-bone'),('Sausage'),('Würstel') -- enough with the fruity examples!

SELECT CONVERT(NVARCHAR(Max),(SELECT S+',' AS 'data()' FROM @T t
FOR XML PATH('')))


Result: Filet, T-bone, Sausage, Würstel, - of course, remove the last comma.

Update: What if I have a table with multiple columns and I just want to aggregate one?

The solution is to use the select with WHERE clauses on the GROUP BY columns. Here is an example:

DECLARE @t TABLE(col1 NVARCHAR(Max), col2 INT, value NVARCHAR(Max) )
INSERT INTO @T VALUES ('a',1,'1'),
('a',1,'2'),
('a',2,'3'),
('b',1,'1'),
('b',1,'2'),
('c',3,'67')
 
--SELECT * FROM @t t
 
SELECT col1, col2, CONVERT(NVARCHAR(Max),(
SELECT value+',' AS 'data()' FROM @T t2
WHERE t1.col1 = t2.col1
AND t1.col2 = t2.col2
FOR XML PATH('')))
FROM @t t1
GROUP BY col1, col2

I have been working on a REST API lately and, while using Entity Framework or some other similar framework to abstract the database is certainly possible, I wanted to control every aspect of the implementation. I know, reinventing wheels, but this is how one learns. One of the most annoying bits was trying to translate some complex object from JSON to the (two dimensional) database relational tables. This post will explore my attempts and the solutions I have found.

My first attempt was straightforward: just send all types as DataTables, with some extra property to define identity and parent entity. This relies on the Microsoft Server SQL mechanism that allows sending of table variables to stored procedures. But this approach has several downsides. One of them is that in order to send a datatable to SQL you need... DataTables. As I have pointed out in several blog posts, the DataTable object is slow, and sometimes downright buggy. Even if I didn't care about performance that much, in order for SQL to receive the content of the DataTable one must create corresponding User Defined Types on the database side. Working with UDTs is very difficult for several reasons: you cannot alter a UDT (unless employing some SQL voodoo that changes system tables), you can only drop it and recreate it. This does not work if you use the UDT anywhere, so a lot of renaming needs to be done. Even if you automate the process, it's still very annoying. Then the UDT definition has to be an exact duplicate of the DataTable definition. Move some columns around and it fails. Debugging is also made difficult by the fact that the SQL profiler does not see the content of table variables when sending them to the server.

Long story short, I was looking for alternatives and I found XML. Now, you might think that this leads to a simple, and maybe even obvious, solution. But it's not that easy. Imagine that you send a list of objects in an XML. Each object is represented by an XML element and each property by a child element. In order to get the value of a property you need to do iterate through all the nodes, for each node find the properties, for each property find the one element that defines it, then get the attribute value or content of the property, all while making sure you select everything in a table. It's not that easy.

The solution I found, which simplifies the SQL code (and hopefully brings some well needed performance to the table) is to serialize the objects in a way that makes the selection simple enough. Here is an example: I have a Configuration object with an Id and a Name that also has a property called Servers, containing Server objects having an Id and a Url. Here is an example of XML serialization from the DataContractSerializer:

<?xml version="1.0" encoding="UTF-8"?>
<Configuration xmlns="http://schemas.datacontract.org/2004/07/SerializationTest" xmlns:i="http://www.w3.org/2001/XMLSchema-instance">
<Id>1</Id>
<Name>Test Config</Name>
<Servers>
<Server>
<Id>1</Id>
<Url>http://some.url</Url>
</Server>
<Server>
<Id>2</Id>
<Url>http://some.other.url</Url>
</Server>
</Servers>
</Configuration>

The SQL code to get the information from an XML variable with this content would look like this:

DECLARE @Xml XML='<?xml version="1.0" encoding="UTF-8"?>
<Configuration xmlns="http://schemas.datacontract.org/2004/07/SerializationTest" xmlns:i="http://www.w3.org/2001/XMLSchema-instance">
<Id>1</Id>
<Name>Test Config</Name>
<Servers>
<Server>
<Id>1</Id>
<Url>http://some.url</Url>
</Server>
<Server>
<Id>2</Id>
<Url>http://some.other.url</Url>
</Server>
</Servers>
</Configuration>'


;WITH XMLNAMESPACES(DEFAULT 'http://schemas.datacontract.org/2004/07/SerializationTest')
SELECT T.Item.value('(Id/text())[1]','INT') as Id,
T.Item.value('(Name/text())[1]','NVARCHAR(100)') as Name
FROM @Xml.nodes('//Configuration') as T(Item)

;WITH XMLNAMESPACES(DEFAULT 'http://schemas.datacontract.org/2004/07/SerializationTest')
SELECT T.Item.value('(Id/text())[1]','INT') as Id,
T.Item.value('(Url/text())[1]','NVARCHAR(100)') as Url
FROM @Xml.nodes('//Configuration/Servers/Server') as T(Item)


This works, but look at that code. In my case, the situation was worse, the object I was using was a wrapper which implemented IDictionary<string,object> and, even if it did implement ISerializable, both XmlSerializer and DataContractSerializer use the dictionary as their data and in the end I get ugly key elements and value elements that are even harder to get to and, I suppose, more inefficient to parse. Therefore I found the solution in IXmlSerializable, (yet) another serialization interface used exclusively by XML serializer classes. If every simple value would be saved as an attribute and every complex object in an element, then this could be the SQL code:

DECLARE @Xml XML='<?xml version="1.0" encoding="UTF-8"?>
<Configuration xmlns="http://schemas.datacontract.org/2004/07/SerializationTest" xmlns:i="http://www.w3.org/2001/XMLSchema-instance">
<Config Id="1" Name="Test Config">
<Servers>
<List>
<Server Id="1" Url="http://some.url" />
<Server Id="2" Url="http://some.other.url" />
</List>
</Servers>
</Config>
</Configuration>'


;WITH XMLNAMESPACES(DEFAULT 'http://schemas.datacontract.org/2004/07/SerializationTest')
SELECT T.Item.value('@Id','INT') as Id,
T.Item.value('@Name','NVARCHAR(100)') as Name
FROM @Xml.nodes('//Configuration/Config') as T(Item)

;WITH XMLNAMESPACES(DEFAULT 'http://schemas.datacontract.org/2004/07/SerializationTest')
SELECT T.Item.value('@Id','INT') as Id,
T.Item.value('@Url','NVARCHAR(100)') as Url
FROM @Xml.nodes('//Configuration/Config/Servers/List/Server') as T(Item)


Much easier to read and hopefully to parse.

I am not going to write here about the actual implementation of IXmlSerializable. There are plenty of tutorials on the Internet about that. It's not pretty, :) but not too difficult, either.

What was the purpose of this exercise? Now I can send a complex object to SQL in a single query, making inserts and updates simple and not requiring at a call for each instance of each type of complex object. Now, is it fast? I have no idea. Certainly if performance is needed, perhaps the UDT/DataTable approach is faster. However you will have to define a type for each type that you send as a DataTable to a stored procedure. An alternative can be a binary serializer and a CLR SQL function that translates it into tables. However, in my project I need to easily implement very custom API methods and to control every aspect, including tracing and profiling the various SQL calls. I believe the customized IXmlSerializable/XML in SQL approach is a reasonable one.

As always, I hope this helps someone.

and has 1 comment
I seem to remember that I blogged about this before, but I can't find it anymore. Probably it was just a missed intention. This is simply a warning on how T-SQL converts FLOAT values to string. Here are some Transact SQL queries and their results:
DECLARE @aFloat FLOAT = 1234.123456789
DECLARE @aDecimal DECIMAL(18,9) = 1234.123456789
DECLARE @aNumeric NUMERIC(18,9) = 1234.123456789
DECLARE @aString NVARCHAR(20) = 1234.123456789

SELECT @aFloat,@aDecimal,@aNumeric,@aString -- result: 1234.123456789 1234.123456789 1234.123456789 1234.123456789
SELECT CAST(@aFloat as NVARCHAR(20)),CAST(@aDecimal as NVARCHAR(20)),CAST(@aNumeric as NVARCHAR(20)),CAST(@aString as NVARCHAR(20)) -- result: 1234.12 1234.123456789 1234.123456789 1234.123456789
Wait! What happened there? The FLOAT was the only numeric format that lost precision to only 2 decimals (it is actually a loss of scale, 12345.123456789 would be converted to 12345.1). The solution is either to either convert to DECIMAL or NUMERIC values before converting to NVARCHAR or to use the STR function, which receives the scale and precision parameters. Like this:
SELECT CAST(@aFloat as NVARCHAR(20)), CAST(CAST(@aFloat as DECIMAL(18,9)) as NVARCHAR(20)), STR(@aFloat,18,9) -- result: 1234.12    1234.123456789        1234.123456789

The first conversion to DECIMAL and the STR function later on are equivalent.

I have looked into SQL options to somehow set the default precision that is used when converting a float to string, but I could not find anything usefule. Neither did settings like
SET ARITHIGNORE OFF
SET NUMERIC_ROUNDABORT ON
SET ARITHABORT ON
have any effect on the queries above. No error and the same result.

You don't want to know what happens with 123456789.123456789!
SELECT @aFloat, CAST(@aFloat as NVARCHAR(20)), CAST(CAST(@aFloat as DECIMAL(30,10)) as NVARCHAR(30)), STR(@aFloat,30,10) -- result: 123456789.123457    1.23457e+008    123456789.1234567900    123456789.1234567900

Not only the digits are cut even when selecting the actual value!!, but the scientific notation rears its ugly head. And look at the beautiful STR function returning ugly extra zeroes! Same issue appears when trying to use XML functions. The resulting XML has really ugly strings of the float values.

Bottom line: as much as I hate it, you probably should not use FLOAT when trying to display values. Ever.

Update 2015 August 28: I've replaced the function master.sys.fn_varbintohexstr with CONVERT, with the extra parameter 2, which translates a binary field into a hexadecimal string with no leading 0x. In addition to being ugly to use, fn_varbintohexstr is very slow.

Sometimes you need to create a unique identifier for a bunch of values so that you use it as an ID in the database. The immediately obvious choice is the CHECKSUM and BINARYCHECKSUM functions. But beware, the purpose of these functions is to detect changes in a string, not to uniquely identify it. It might seem strange, but the two concepts are very different. The change modification functionality is only meant to generate very different values on small changes. The uniqueness is trying to create a value as distinctive as possible for any string. That is why when you use a checksum you will get a lot of similar values for (very) different strings.

Enter HASHBYTES, another function that has the purpose of creating a cryptographic hash for a string. It is mainly used for password hashing, but it will fit nicely for our purpose. There are some caveats, though. First, CHECKSUM gets a variable number of parameters, HASHBYTES only accepts one, so we must take care of the cumbersome concatenation of multiple values. Unfortunately SQL functions do not have the option of variable parameters, which is truly a shame, so we can't hack it. Also, the value that HASHBYTES returns is a varbinary. We could cast it to NVARCHAR, but it turns into a weird Chinese characters string. In order to turn it into a proper string, we need to use the same function used by SQL Server to display varbinary when selecting it: master.sys.fn_varbintohexstr the CONVERT function with a parameter of 2 (hex string without the leading 0x).

So let's compare the two usages. Suppose we have this nice table that contains company data: company name, contact first name, contact last name, phone, email, yearly value. We need to create a unique ID based on these values.
First CHECKSUM:
SELECT CHECKSUM(companyName, firstName, lastName, phone, email, yearlyValue) FROM OurTable
So easy! Just add the columns, no matter how many or what type they have, and get a value as a result. You can even use * to select all columns in a row. You also have the advantage of getting the same checksum for differently capitalized strings. If you don't want this behaviour, use BINARYCHECSUM, which works even better.

Second HASHBYTES:
SELECT CONVERT(VARCHAR(Max),HASHBYTES('SHA1',companyName+'|'+firstName+'|'+lastName+'|'+phone+'|'+email+'|'+CAST(yearlyValue as NVARCHAR(100))),2) as id,*
FROM OurTable
Ugly! You need to create a string from different types, using ugly casts. Also, this works more like BINARYCHECKSUM. If you want to get the same functionality as CHECKSUM you need to use LOWER(LTRIM(RTRIM(value))). Horrid!
However, it works.

WARNING: using CAST to NVARCHAR from a FLOAT loses precision. You should use STR instead!

A middle solution is to use XCHECKSUM. What is that, you ask? A placeholder that can be replaced with some regular expression search and replace, of course :)

Update: I've created a query that creates the script to update the value of a column called 'ValuesHash', for tables that have it, with the hash of all columns that are not in a list of names, they are not primary keys and they are not foreign keys, plus they are not computed, rowguidcol or filestream.
Imagine the scenario where you have something like this:
  • Table A:
    1. Id: primary identity key
    2. Data1: some data
    3. Data2: some data
    4. CreateTime: the creation time
    5. ValuesHash: a VARBINARY(50) column - only 20 are required normally, but let's make sure :)
  • Table B:
    1. Id: primary identity key
    2. AId: foreign key to A
    3. Data1: some data
    4. Data2: some data
    5. ModifyTime: the modification time
    6. ValuesHash: a VARBINARY(50) column - only 20 are required normally, but let's make sure :)
  • Table C:
    1. Id: primary identity key
    2. AId: foreign key to A
    3. Data1: some data
    4. Data2: some data
The query below will update ValuesHash for A and B (because C doesn't have the ValuesHash column) with a hash constructed from the Data columns. The Id columns will be ignored for being primary keys (and for being in the list of columns to ignore), the AId columns will be ignored for being foreign keys, ValuesHash and CreateTime and ModifyTime will be ignored for being in a list of custom columns)

WARNING: each column data is always truncated to 4000 characters, then the corresponding string is also truncated to 4000 bytes before running HASHBYTES (which only accepts a maximum of 8000 bytes). This hash will help in determining unique records, but it is not 100%.

SELECT * 
FROM (
SELECT t.name,
'UPDATE [' + t.name+ '] SET ValuesHash = HASHBYTES(''SHA1'',SUBSTRING('
+ Stuff(
(SELECT '+ ''|''+ ISNULL('+CASE
WHEN tp.name IN ('float', 'real') THEN 'STR('+c.name+',30,30)'
WHEN tp.name IN ('binary', 'varbinary') THEN 'CONVERT(NVARCHAR(4000),'+c.name+',2)'
ELSE 'CONVERT(NVARCHAR(4000),'+c.name+')' END+','''')'
FROM sys.all_columns c
INNER JOIN sys.types tp
ON c.system_type_id=tp.system_type_id
AND c.user_type_id=tp.user_type_id
LEFT JOIN sys.index_columns ic
ON ic.object_id=t.object_id
AND ic.column_id=c.column_id
LEFT JOIN sys.indexes i
ON ic.object_id=i.object_id
AND ic.index_id=i.index_id
LEFT JOIN sys.foreign_key_columns fc
ON fc.parent_object_id=t.object_id
AND c.column_id=fc.parent_column_id
WHERE t.object_id=c.object_id
AND ISNULL(c.is_identity, 0)=0
AND ISNULL(c.is_computed, 0)=0
AND ISNULL(c.is_filestream, 0)=0
AND ISNULL(c.is_rowguidcol, 0)=0
AND ISNULL(i.is_primary_key, 0)=0
AND fc.parent_column_id IS NULL
AND c.name NOT IN ('Id', 'CreateTime' , 'AcquireTime' , 'IntermediateCreateTime', 'IntermediateModifyTime', 'IntermediateDeleteTime', 'ValuesHash')
ORDER BY Sign(c.max_length) DESC, c.max_length, Lower(c.name)
FOR XML PATH(''), TYPE).value('.', 'nvarchar(max)')
, 1, 7, '')
+ ',0,4000)) WHERE ValuesHash IS NULL' AS computed
FROM sys.tables t
INNER JOIN sys.all_columns c
ON t.object_id = c.object_id
WHERE c.name = 'ValuesHash') x
WHERE computed IS NOT NULL
ORDER BY name

Change it to suit your needs. It is by no means perfect, but it's a start for whatever you need.

Update:

A new FORMAT function was introduced in SQL Server 2012, working somewhat similar to the .NET ToString method. Using that function is slightly more precise:

SELECT * 
FROM (
SELECT t.name,
'UPDATE [' + t.name+ '] SET ValuesHash = HASHBYTES(''SHA1'',SUBSTRING('
+ Stuff(
(SELECT '+ ''|''+ ISNULL('+CASE
WHEN tp.name IN ('float', 'real') THEN 'FORMAT('+c.name+',''R'')'
WHEN tp.name IN ('decimal') THEN 'FORMAT('+c.name+',''G'')'
WHEN tp.name IN ('datetime','datetime2') THEN 'FORMAT('+c.name+',''O'')'
WHEN tp.name IN ('binary', 'varbinary') THEN 'CONVERT(NVARCHAR(4000),'+c.name+',2)'
ELSE 'CONVERT(NVARCHAR(4000),'+c.name+')' END+','''')'
FROM sys.all_columns c
INNER JOIN sys.types tp
ON c.system_type_id=tp.system_type_id
AND c.user_type_id=tp.user_type_id
LEFT JOIN sys.index_columns ic
ON ic.object_id=t.object_id
AND ic.column_id=c.column_id
LEFT JOIN sys.indexes i
ON ic.object_id=i.object_id
AND ic.index_id=i.index_id
LEFT JOIN sys.foreign_key_columns fc
ON fc.parent_object_id=t.object_id
AND c.column_id=fc.parent_column_id
WHERE t.object_id=c.object_id
AND ISNULL(c.is_identity, 0)=0
AND ISNULL(c.is_computed, 0)=0
AND ISNULL(c.is_filestream, 0)=0
AND ISNULL(c.is_rowguidcol, 0)=0
AND ISNULL(i.is_primary_key, 0)=0
AND fc.parent_column_id IS NULL
AND c.name NOT IN ('Id', 'CreateTime' , 'AcquireTime' , 'IntermediateCreateTime', 'IntermediateModifyTime', 'IntermediateDeleteTime', 'ValuesHash')
ORDER BY Sign(c.max_length) DESC, c.max_length, Lower(c.name)
FOR XML PATH(''), TYPE).value('.', 'nvarchar(max)')
, 1, 7, '')
+ ',0,4000)) WHERE ValuesHash IS NULL' AS computed
FROM sys.tables t
INNER JOIN sys.all_columns c
ON t.object_id = c.object_id
WHERE c.name = 'ValuesHash'
) x
WHERE computed IS NOT NULL
ORDER BY name

This is one of those WTF moments. After more than a decade of working in software development I learn something this basic about T-SQL (or rather, any flavour based on SQL-92). What would you think happens when running this script?
IF ''='                 ' SELECT 'WTF?!' -- empty string compared to a bunch of spaces
IF ' '=' ' SELECT 'WTF?!' -- bunch or spaces compared to another bunch of spaces of different length
IF 'x'='x ' SELECT 'WTF?!' -- 'x' compared to 'x' followed by a bunch of spaces
IF 'x'=' x' SELECT 'WTF?!' -- 'x' compared to 'x' preceded by a bunch of spaces

There will be three WTF rows returned, for the first three queries. You don't believe me? Try it yourself. The motive is explained here: INF: How SQL Server Compares Strings with Trailing Spaces. Short story shorter: in order for SQL to compare two strings of different lengths, it first right-pads the shorter one with spaces.

So what can you do to fix it? Easy enough, use LEN ,right? Nope. Read the definition carefully: Returns the number of characters of the specified string expression, excluding trailing blanks. A possible but weird solution is to use DATALENGTH. A string is empty only is it has a datalength of 0. In the case of NVARCHAR you could even divide the resulting number to 2 in order to get the true length of the string. WTF, right?

and has 1 comment
Well, it's pretty obvious, but I wanted to post it here, as well. You see, we have this query that was supposed to filter some values based on a parameter. The query was done like this: SELECT * FROM table WHERE value=(CASE WHEN @filter IS NULL THEN value ELSE @filter). Can you spot the problem? Indeed, if value is NULL, then value is NOT equal to value. Not only is this incorrect, but also bad from the standpoint of the SQL execution plan. It is much faster to do SELECT * FROM table WHERE (@filter is NULL OR value=@filter). If, for whatever reason, you need the first syntax, you need to do it like this: SELECT * FROM table WHERE ISNULL(value,@impossibleValueThatIsNotNull)=COALESCE(@filter, value, @impossibleValueThatIsNotNull). Yeah, a bit of a show off, but when the "no filter" value is null, it's better to use ISNULL and COALESCE wherever possible.

I made a function in T-SQL that parsed some string and returned a decimal. It all worked fine until one of my colleagues wanted to use it on the test server. And here there was, a beautiful error message: 'decimal' is not a recognized built-in function name. I isolated the line, executed it, same error. It was like the server did not understand decimals anymore. The line in question was a simple SELECT TRY_CONVERT(DECIMAL(18,6),'10.33'). If I used CONVERT, though, it all worked fine. Another hint was that the function worked perfectly fine on the same server, in another database. The problem was that for that particular database, the defined SQL server version was 2008, not 2012. We changed it and it all worked fine after that. The setting in question is found in the Properties of the database, Options, Compatibility level.

It is a bit embarrassing not knowing this at my level of software development, but I was stunned to see other people, even more experienced than I, had the same lack of knowledge. Apparently Microsoft SQL Server converts empty or whitespace strings to default values when using CONVERT or CAST. So CONVERT(INT,''), equivalent to CAST('' as INT), equals 0. DATETIME conversion leads to a value of 1900-01-01. And so on. That means that a good practice for data conversion when you don't know what data you may be getting is to always turn whitespace to null before using CONVERT or CAST. Also, in related news, newline is NOT whitespace in T-SQL so LTRIM(CHAR(10)) and LTRIM(CHAR(13)) is not empty string!

Bottom line: instead of CONVERT(<type>,<unknown string value>) use the cumbersome CONVERT(<type>,CASE WHEN LTRIM(RTRIM(<unknown string value>))!='' THEN <unknown string value> END). Same with CAST.

Here is a table of conversions for some values converted to FLOAT:

ValueNormal CONVERTCumbersome CONVERTTRY_CONVERT
NULLNULLNULLNULL
'' (empty string)0NULL0
' ' (whitespace)0NULL0
'
' (whitespace and newlines)
Conversion errorConversion errorNULL
'123'123123123

You might think this is not such a big deal, but in Microsoft SQL 2012 they introduced TRY_CONVERT and the similar TRY_CAST, which return null if there is a conversion error. This means that for an incorrect string value the function would return null for most but empty string, where it would return the default value of the type chosen, thus resulting in an inconsistent behavior.

This is something I have been hitting my head on from the beginning of my programming career: just find the best match in a table for each row in another table through a single query.

There are solutions, but they are all very inefficient. To demonstrate the issue I will start with a simple structure: tables A and B, having the same columns id, x and y. I want to get, for each point in table A defined by the (x,y) coordinates, the closest point in table B. I only need one and it doesn't need to be exclusive (other points in A might be closest to the same point). It doesn't even have to be one row in B for each row in A, in case there are two points at the exact same distance to a point in A. The creation of the structure is done here:
CREATE TABLE A(id INT PRIMARY KEY IDENTITY(1,1), x FLOAT, y FLOAT)
INSERT INTO A (x,y) VALUES(10,20),(20,30),(20,10),(30,20),(30,20),(10,30)

CREATE TABLE B(id INT PRIMARY KEY IDENTITY(1,1), x FLOAT, y FLOAT)
INSERT INTO B (x,y) VALUES(11,20),(20,31),(21,10),(31,21),(30,20),(11,30)

To find the distance from A to the closest point in B is trivial:
SELECT a.id, 
a.x,
a.y,
Min(( a.x - b.x ) * ( a.x - b.x ) + ( a.y - b.y ) * ( a.y - b.y )) AS
dist
FROM a
CROSS JOIN b
GROUP BY a.id,
a.x,
a.y
To get the id of the closest B point, not so easy.

The first naive solution would be to just find the row in B that corresponds to each row in A using nested selects, like this:
SELECT * 
FROM a
JOIN b
ON b.id = (SELECT TOP 1 b.id
FROM b
ORDER BY ( a.x - b.x ) * ( a.x - b.x ) + ( a.y - b.y ) * ( a.y - b.y ) ASC)

Looking at the execution plan we see what is going on: 86% of the query is spent on "Top N Sort".
Let's get some other solutions so we can compare them in the end in the same execution plan.

Another solution is to just use the result of the query that computes the distance and just join again on the distance. That means we would compare each row in A with each row in B twice, once for the computation of the MIN function and the other for the join:
SELECT j.*, 
b2.*
FROM (SELECT
a.id,
a.x,
a.y,
Min(( a.x - b.x ) * ( a.x - b.x ) + ( a.y - b.y ) * ( a.y - b.y )) AS m
FROM a
CROSS JOIN b
GROUP BY a.id,
a.x,
a.y) j
INNER JOIN b b2
ON j.m = ( j.x - b2.x ) * ( j.x - b2.x ) + ( j.y - b2.y ) * ( j.y - b2.y )

Something that does the same thing, essentially, but looks a little better is joining the table A with B and then again with B on having the point from B2 be closer to the one in B1, but then adding a condition that there is no B2 (in other words, B1 is closest):
SELECT a.*, 
b1.*
FROM a
CROSS JOIN b b1
LEFT JOIN b b2
ON ( a.x - b1.x ) * ( a.x - b1.x ) + ( a.y - b1.y ) *
( a.y - b1.y ) >
( a.x - b2.x ) * ( a.x - b2.x ) + ( a.y - b2.y ) *
( a.y - b2.y )
WHERE b2.id IS NULL

None of these solutions scan B only once for each row in A. Their relative complexity is this: 75%, 11% and 14%, respectively. In other words, finding the minimum distance and then joining with the B table again on the points that are in exactly that distance is the best solution. However, given some assumptions and a weird structure, we can get to something that runs in half that time:
SELECT id      AS Aid, 
x,
y,
m % 100 AS bId
FROM (SELECT a.id,
a.x,
a.y,
Min(Cast( ( ( a.x - b.x ) * ( a.x - b.x ) + ( a.y - b.y ) * ( a.y - b.y ) ) AS BIGINT) * 100 + b.id) AS m
FROM a
CROSS JOIN b
GROUP BY a.id,
a.x,
a.y) j

These are the assumptions that must be true in order for this to work:
  • The function value can be converted to a BIGINT without problems. (if the distance between points would have been subunitary, this would have lost precision)
  • The maximum ID in table B is under a certain value (in this case 100)
  • The converted function multiplied by this maximum number doesn't cause an overflow
Basically I am mathematically creating a container for the value of the function and the id of the point in B, computing the minimum, then extracting the id back from the value. Neat.

Another solution, one that makes most apparent sense, is using a feature that was introduced in SQL Server 2005: RANK. We rank the points in B to each point in A, based on our function, then we only get the first by selecting on the rank being 1. Unfortunately, this doesn't work as expected. First of all, you cannot use RANK in the WHERE clause, so you must select the rank first, then select from that selection to add the condition. This might mean horrid temporary data tables if tables A and B are huge. Also, after running the query, it appears it is slower than the one that joins on the minimum distance. Here it is:
SELECT aid, 
bid
FROM (SELECT a.id AS aId,
a.x,
a.y,
b.id AS bId,
Rank()
OVER(
partition BY a.id
ORDER BY (a.x-b.x)*(a.x-b.x)+(a.y-b.y)*(a.y-b.y) ASC) AS rnk
FROM a
CROSS JOIN b) x
WHERE rnk = 1

Comparing all the solutions so far, without the first naive one, with the nested selects, we get these values:
  1. Mathematical container of function value and id: 14%
  2. Selection of the minimum distance to each point and then joining with table B for the second time to look for the point that is at that distance: 21%
  3. Joining twice on the same table with the condition that one is better than the other and that the better one doesn't actually exist: 29%
  4. Using RANK: 36%, most surprisingly the worst solution

The final solution, adding some more computation in order to get rid of constants and some assumptions thus becomes:
DECLARE @MaxId BIGINT 

SELECT @MaxId = Isnull(Max(id) + 1, 1)
FROM B;

WITH q AS (SELECT A.id,
A.x,
A.y,
Min(Cast(Power(A.x-B.x, 2) + Power(A.y-B.y, 2) AS BIGINT) * @MaxId + B.id) AS m
FROM A
CROSS JOIN B
GROUP BY A.id,
A.x,
A.y)
SELECT id AS aId,
x,
y,
m % @MaxId AS bId
FROM q;


I am still looking and there is now a question on StackOverflow that attempts to get the answer from the community, so far with limited results.