Javascript "arguments" weirdness

Published Feb 5, 2015

I was using this JavaScript function that I wanted to accept an array of arguments or just a bunch of arguments that would be interpreted as an array. As you probably know, for each function in JS you get a variable called arguments which contains the arguments to the function. It is a pseudo array, not a real array, and has some extra properties. My code looked like this:

function x(arg) {
  if (!arg) {
    arg=[];
  } else if (!arg.length) {
    arg=[];
    for(var i=0; i<arguments.length; i++) arg.push(arguments[i]);
  }
  // do something with arg
}

The logic is simple, if there is no first parameter, arg becomes an empty array, else, if there is a first argument, but it doesn't have a length property (not an array) set arg to an array and push all arguments of the function as items of the array. But it doesn't work! The point is this: you set arg to an empty array and at that moment arguments[0] is no longer the original argument, but the empty array. Even worse, the code then adds the array as an item of itself, which makes the object be infinitely recursive.

Let's make this simpler:

function x(arg) {
  arg=[];
  console.log(arguments[0]);
}

After you execute x() with any arguments, the console will show an empty array, not the original argument. Weird, huh?

Run Chrome with file access AND in a separate instance

Published Jan 26, 2015

Posted in
browser
misc
programming

and has 0 comments

Sometimes you want to run your browser without some protections that are enabled by default in it. One of these is the cross-origins protection when running a local filesystem script. For Chrome, the solution is to run the browser with the command line switch --allow-file-access-from-files. Seems straight forward enough, but if your browser is already open (usually I have at least 10 active tabs, ranging from documentation pages, email to the music I listen to), the command line switches will be ignored and your script will be run as just another window in the same instance of Chrome. In order to fix this, you need to use another switch called --user-data-dir. Just make sure this folder exists and it can be deleted (because it will be filled with a zillion useless files).

So, how do I run a test.html file that can access files and is located in C:\SomePath ? Like this:
start "Chrome with access to files" chrome --allow-file-access-from-files "file:///C:\SomePath\test.html" --user-data-dir="C:\SomePath\UserDir"

In your path a UserDir folder will be created which you can delete after you finish your work.

Of course, this issue applies to any other flags that you want to use ad-hoc while Chrome is already open.

Inserting or updating in MySQL without a primary key or unique index OR how to use EXISTS in MySQL

Published Jan 21, 2015

Posted in
database
programming

and has 3 comments

I was trying to write a simple query in MySQL that only inserted a row if there wasn't one already there, else it would update the existing one. MySQL has a system to take care of that for you if you have a primary key or a unique index on the table you want to insert or update. You can choose between INSERT ... ON DUPLICATE KEY or REPLACE. However, what if you don't want to add an index on your table just for that?

T-SQL (Microsoft SQL Server) has a syntax that works like this:

IF EXISTS(SELECT * FROM MyTable WHERE y=2)
  UPDATE MyTable SET x=x+1 WHERE y=2;
ELSE
  INSERT INTO MyTable (x,y) VALUES(1,2);
END

MySQL also has an IF...THEN syntax as well as an EXISTS statement. The problem is that they work in a completely different way. IF has an extra THEN keyword, uses ELSEIF instead of ELSE and needs to end with an END IF. EXISTS works only in WHERE clauses. So, let's translate the query above in MySQL and hope for the best:

INSERT INTO MyTable (x,y) 
  SELECT 0,2 FROM DUAL
  WHERE NOT EXISTS(SELECT * FROM MyTable WHERE y=2);
UPDATE MyTable SET x=x+1 WHERE y=2;

Notice several things: I am not using any IF and I am updating rows all the time after the conditional insert. I am selecting values from DUAL because any select with a where clause needs a table in MySQL and DUAL is a dummy one built into the engine. (SELECT 1 WHERE 2=2; is not valid in MySQL). Also, I am inserting a value, then updating it, which is not the most efficient.

There is another way, closer to the original T-SQL query, but it doesn't use EXISTS. It looks like this:

IF (SELECT 1=1 FROM MyTable WHERE y=2) THEN
  UPDATE MyTable SET x=x+1 WHERE y=2;
ELSE
  INSERT INTO MyTable (x,y) VALUES(1,2);
END IF;

Thing to notice here: I am using 1=1 to return a TRUE value that will make the IF work. Also, PAY ATTENTION!, this doesn't work as a simple query. I spent the better half of an hour trying to understand where the syntax error was while trying to execute this directly. Any flow operations like IF or WHILE, etc, are only valid in "programs", the MySQL term for stored procedures and functions.

I hope I clarified things for you on this.

Super Fast and Accurate string distance algorithm: Sift4

Published Nov 10, 2014

and has 44 comments

The algorithm works perfectly well and is better than Sift3, however it's slightly more complex. You might want to start with Sift3 in order to understand where it came from.

Update November 8 2022: I found a bug in the algorithm, relating to maxDistance. I've updated the code. If you didn't use maxDistance, you are unaffected. Basically the fix is to compare temporaryDistance>minDistance (before it was >= ) and to move the calculation of the temporary distance after c1 and c2 are updated to their minimum value when a token was not found (otherwise the temporary distance might become larger than the final distance)

Try the Javascript implementation here:

Algorithm:

Levenstein
Sift3
Sift4 MaxOffset:

String 1:

String 2:

Result:

Update 28 Mar 2015: I've changed the algorithm significantly. The transpositions are now computed differently and the cost of a transposition in the final result is 1, rather than 0.5. Also, while I think a value of 1 is better conceptually, I noticed that Sift4 approximates Levenshtein a little better when the cost of a transposition is either 2 or a function depending on the offset difference between c2 and c1, especially when maxOffset grows. This can be now changed via the new options function transpositionCostEvaluator. The problem I am having now is more false positives when the letters/tokens of the two strings are the same, but their positions are jumbled differently. With small maxOffset values, like 5 or 10, the result is much better than Sift3, however when maxOffset grows, lots of matches can be found and the cost of transpositions becomes very important.

Update 27 Mar 2015: Thanks to Emanuele Bastianelli who discovered a bug that appeared in an edge case, I've updated the algorithms. Now, at the end of the while loop there is an extra check to prevent the algorithm exiting prematurely, before computing remaining tokens.

Intro

A really long time ago I wrote the third version of Sift, the string distance algorithm. It so happens that I am going to give a small presentation, here in Ispra, about this algorithm, so I had the opportunity to review it. I found some inconsistencies and I actually did some research in the field that gave more more ideas. So before giving the presentation I thought of publishing what I think is the fourth version. What's new:

33% more accurate
three different variants: simple, common and general
new concepts added
support for own value and matching functions, different tokenizer functions, etc.
actually tested with a (slightly more) serious test
more robust, working better for large values of maxOffset

Before I get into the details, I am publishing the algorithm here for the moment, no Codeplex or PasteBin or GitHub or whatever. Also, it is written in Javascript now, the C# and T-SQL version pending. Of course, it would be great if, as before, the community of people using the algorithm would go into implementing it into various programming languages, however I am a bit apprehensive because more often than not people came with their own improvements or interpretations when translating the algorithm into another language. But support is always welcome!

New concepts in Sift4

I created a test that used random strings, but also a huge list of commonly used English phrases as well as mutations on these strings, adding or removing small bits and so on. I then implemented Sift3, Levenstein and the new algorithm and computed the error distance between the Levenstein distance and the two Sift variants. This permitted me to see how the error evolves when changing the algorithm and the parameters. One thing I noticed is that when increasing the maxOffset value to large values like 15 or 20, the accuracy of Sift3 was going down. Also, as pointed out by one commenter on the Sift3 post, there are cases when Sift3(a,b) is different from Sift3(b,a). There are edge cases, but this one in particular grated me.

After implementing Sift4, I can now tell you that the simple version is slightly better than Sift3 for small maxOffset values like 5, but it gets better as the value increases. The common version is a bit more complex, but the error decreases with 33% and maintains a low error for large maxOffset values. The extended or general version receives an options object that can change almost everything, but most important is the tokenizer function. Imagine that you want to compute the distance based not on letters, but on n-grams (groups of n characters). Or that you want to compare them by the words in the text, maybe even their synonyms. This can all be achieved just by changing the tokenizer function. The other parameters involve defining what it means for two tokens to match and what is the value of their match, etc.

One of the new concepts implemented is taken from the Jaro distance. Jaro seems a lot like Sift in the way that it considers two characters to match if they are in close proximity. Also, if "the streams cross", like 'ab' vs 'ba', one considers them transpositions and removes some of their value from the distance. Actually, if I look at the implementation, it might be that I have independently discovered the Jaro distance. I will research this further. I don't know if the transposition calculation is the most optimal. At the moment it uses an array of all matches found until a point, clearing it of values as the cursors move along the string. The difference between the simple and the common versions of Sift4 is that the simple version is not computing the transpositions at all and has no concept of maxDistance. In that respect it is a slightly fixed up Sift3.

Another new concept added is the one of local substring. Imagine that the Largest Common Subsequence that Sift is actually trying to find in order to determine the distance is made of substrings, separated by non matching characters. Each of these substrings can be used to improve the distance function. For example one could argue that 'abcdex' is closer to 'abcde' than 'abcxde', because even if the largest common subsequence is 5, the largest common substring is 5 for the first string and only 3 for the second. The extended version of the algorithm allows for changing the value of each substring individually.

Well, here they are, the three versions. The extended version has some examples at the end for possible parameters.

The code

Simplest Sift4:

// Sift4 - simplest version
// online algorithm to compute the distance between two strings in O(n)
// maxOffset is the number of characters to search for matching letters
function sift4(s1, s2, maxOffset) {
    if (!s1 || !s1.length) {
        if (!s2) {
            return 0;
        }
        return s2.length;
    }

    if (!s2 || !s2.length) {
        return s1.length;
    }

    var l1 = s1.length;
    var l2 = s2.length;

    var c1 = 0; //cursor for string 1
    var c2 = 0; //cursor for string 2
    var lcss = 0; //largest common subsequence
    var local_cs = 0; //local common substring
    while ((c1 < l1) && (c2 < l2)) {
        if (s1.charAt(c1) == s2.charAt(c2)) {
            local_cs++;
        } else {
            lcss += local_cs;
            local_cs = 0;
            if (c1 != c2) {
                c1 = c2 = Math.max(c1, c2); //using max to bypass the need for computer transpositions ('ab' vs 'ba')
            }
            for (var i = 0; i < maxOffset && (c1 + i < l1 || c2 + i < l2); i++) {
                if ((c1 + i < l1) && (s1.charAt(c1 + i) == s2.charAt(c2))) {
                    c1 += i;
                    local_cs++;
                    break;
                }
                if ((c2 + i < l2) && (s1.charAt(c1) == s2.charAt(c2 + i))) {
                    c2 += i;
                    local_cs++;
                    break;
                }
            }
        }
        c1++;
        c2++;
    }
    lcss += local_cs;
    return Math.round(Math.max(l1, l2) - lcss);
}

Common Sift4:

// Sift4 - common version
// online algorithm to compute the distance between two strings in O(n)
// maxOffset is the number of characters to search for matching letters
// maxDistance is the distance at which the algorithm should stop computing the value and just exit (the strings are too different anyway)
function sift4(s1, s2, maxOffset, maxDistance) {
    if (!s1 || !s1.length) {
        if (!s2) {
            return 0;
        }
        return s2.length;
    }

    if (!s2 || !s2.length) {
        return s1.length;
    }

    var l1 = s1.length;
    var l2 = s2.length;

    var c1 = 0; //cursor for string 1
    var c2 = 0; //cursor for string 2
    var lcss = 0; //largest common subsequence
    var local_cs = 0; //local common substring
    var trans = 0; //number of transpositions ('ab' vs 'ba')
    var offset_arr = []; //offset pair array, for computing the transpositions
    while ((c1 < l1) && (c2 < l2)) {
        if (s1.charAt(c1) == s2.charAt(c2)) {
            local_cs++;
            var isTrans = false;
            //see if current match is a transposition
            var i = 0;
            while (i < offset_arr.length) {
                var ofs = offset_arr[i];
                if (c1 <= ofs.c1 || c2 <= ofs.c2) {
                    // when two matches cross, the one considered a transposition is the one with the largest difference in offsets
                    isTrans = Math.abs(c2 - c1) >= Math.abs(ofs.c2 - ofs.c1);
                    if (isTrans) {
                        trans++;
                    } else {
                        if (!ofs.trans) {
                            ofs.trans = true;
                            trans++;
                        }
                    }
                    break;
                } else {
                    if (c1 > ofs.c2 && c2 > ofs.c1) {
                        offset_arr.splice(i, 1);
                    } else {
                        i++;
                    }
                }
            }
            offset_arr.push({
                c1: c1,
                c2: c2,
                trans: isTrans
            });
        } else {
            lcss += local_cs;
            local_cs = 0;
            if (c1 != c2) {
                c1 = c2 = Math.min(c1, c2); //using min allows the computation of transpositions
            }
            if (maxDistance) {
                var temporaryDistance = Math.max(c1, c2) - lcss + trans;
                if (temporaryDistance > maxDistance)
                    return temporaryDistance;
            }
            //if matching characters are found, remove 1 from both cursors (they get incremented at the end of the loop)
            //so that we can have only one code block handling matches
            for (var i = 0; i < maxOffset && (c1 + i < l1 || c2 + i < l2); i++) {
                if ((c1 + i < l1) && (s1.charAt(c1 + i) == s2.charAt(c2))) {
                    c1 += i - 1;
                    c2--;
                    break;
                }
                if ((c2 + i < l2) && (s1.charAt(c1) == s2.charAt(c2 + i))) {
                    c1--;
                    c2 += i - 1;
                    break;
                }
            }
        }
        c1++;
        c2++;
        // this covers the case where the last match is on the last token in list, so that it can compute transpositions correctly
        if ((c1 >= l1) || (c2 >= l2)) {
            lcss += local_cs;
            local_cs = 0;
            c1 = c2 = Math.min(c1, c2);
        }
    }
    lcss += local_cs;
    return Math.max(l1, l2) - lcss + trans; //add the cost of transpositions to the final result
}

Extended/General Sift4:

// Sift4 - extended version
// online algorithm to compute the distance between two strings in O(n)
// maxOffset is the number of positions to search for matching tokens
// options: the options for the function, allowing for customization of the scope and algorithm:
//         maxDistance: the distance at which the algorithm should stop computing the value and just exit (the strings are too different anyway)
//         tokenizer: a function to transform strings into vectors of tokens
//        tokenMatcher: a function to determine if two tokens are matching (equal)
//        matchingEvaluator: a function to determine the way a token match should be added to the local_cs. For example a fuzzy match could be implemented.
//        localLengthEvaluator: a function to determine the way the local_cs value is added to the lcss. For example longer continuous substrings could be awarded.
//        transpositionCostEvaluator: a function to determine the value of an individual transposition. For example longer transpositions should have a higher cost.
//        transpositionsEvaluator: a function to determine the way the total cost of transpositions affects the final result
// the options can and should be implemented at a class level, but this is the demo algorithm
function sift4(s1, s2, maxOffset, options) {

    options = extend(options, {
            maxDistance: null,
            tokenizer: function (s) {
                return s ? s.split('') : [];
            },
            tokenMatcher: function (t1, t2) {
                return t1 == t2;
            },
            matchingEvaluator: function (t1, t2) {
                return 1;
            },
            localLengthEvaluator: function (local_cs) {
                return local_cs;
            },
            transpositionCostEvaluator: function (c1, c2) {
                return 1;
            },
            transpositionsEvaluator: function (lcss, trans) {
                return lcss - trans;
            }
        });

    var t1 = options.tokenizer(s1);
    var t2 = options.tokenizer(s2);

    var l1 = t1.length;
    var l2 = t2.length;

    if (l1 == 0)
        return l2;
    if (l2 == 0)
        return l1;

    var c1 = 0; //cursor for string 1
    var c2 = 0; //cursor for string 2
    var lcss = 0; //largest common subsequence
    var local_cs = 0; //local common substring
    var trans = 0; //number of transpositions ('ab' vs 'ba')
    var offset_arr = []; //offset pair array, for computing the transpositions
    while ((c1 < l1) && (c2 < l2)) {
        if (options.tokenMatcher(t1[c1], t2[c2])) {
            local_cs += options.matchingEvaluator(t1[c1], t2[c2]);
            var isTrans = false;
            //see if current match is a transposition
            var i = 0;
            while (i < offset_arr.length) {
                var ofs = offset_arr[i];
                if (c1 <= ofs.c1 || c2 <= ofs.c2) {
                    // when two matches cross, the one considered a transposition is the one with the largest difference in offsets
                    isTrans = Math.abs(c2 - c1) >= Math.abs(ofs.c2 - ofs.c1);
                    if (isTrans) {
                        trans += options.transpositionCostEvaluator(c1, c2);
                    } else {
                        if (!ofs.trans) {
                            ofs.trans = true;
                            trans += options.transpositionCostEvaluator(ofs.c1, ofs.c2);
                        }
                    }
                    break;
                } else {
                    if (c1 > ofs.c2 && c2 > ofs.c1) {
                        offset_arr.splice(i, 1);
                    } else {
                        i++;
                    }
                }
            }
            offset_arr.push({
                c1: c1,
                c2: c2,
                trans: isTrans
            });
        } else {
            lcss += options.localLengthEvaluator(local_cs);
            local_cs = 0;
            if (c1 != c2) {
                c1 = c2 = Math.min(c1, c2); //using min allows the computation of transpositions
            }
            if (options.maxDistance) {
                var temporaryDistance = options.localLengthEvaluator(Math.max(c1, c2)) - options.transpositionsEvaluator(lcss, trans);
                if (temporaryDistance > options.maxDistance)
                    return Math.round(temporaryDistance);
            }
            //if matching tokens are found, remove 1 from both cursors (they get incremented at the end of the loop)
            //so that we can have only one code block handling matches
            for (var i = 0; i < maxOffset && (c1 + i < l1 || c2 + i < l2); i++) {
                if ((c1 + i < l1) && options.tokenMatcher(t1[c1 + i], t2[c2])) {
                    c1 += i - 1;
                    c2--;
                    break;
                }
                if ((c2 + i < l2) && options.tokenMatcher(t1[c1], t2[c2 + i])) {
                    c1--;
                    c2 += i - 1;
                    break;
                }
            }
        }
        c1++;
        c2++;
        // this covers the case where the last match is on the last token in list, so that it can compute transpositions correctly
        if ((c1 >= l1) || (c2 >= l2)) {
            lcss += options.localLengthEvaluator(local_cs);
            local_cs = 0;
            c1 = c2 = Math.min(c1, c2);
        }
    }
    lcss += options.localLengthEvaluator(local_cs);
    return Math.round(options.localLengthEvaluator(Math.max(l1, l2)) - options.transpositionsEvaluator(lcss, trans)); //add the cost of found transpositions
}

function extend(obj, def) {
    var result = {};
    for (var prop in def) {
        if (!obj || !obj.hasOwnProperty(prop)) {
            result[prop] = def[prop];
        } else {
            result[prop] = obj[prop];
        }
    }
    return result;
}

// possible values for the options
// tokenizers:
function nGramTokenizer(s, n) { //tokenizer:function(s) { return nGramTokenizer(s,2); }
    var result = [];
    if (!s)
        return result;
    for (var i = 0; i <= s.length - n; i++) {
        result.push(s.substr(i, n));
    }
    return result;
}

function wordSplitTokenizer(s) { //tokenizer:wordSplitTokenizer
    if (!s)
        return [];
    return s.split(/\s+/);
}

function characterFrequencyTokenizer(s) { //tokenizer:characterFrequencyTokenizer (letters only)
    var result = [];
    for (var i = 0; i <= 25; i++) {
        var val = 0;
        if (s) {
            for (j = 0; j < s.length; j++) {
                var code = s.charCodeAt(j);
                if (code == i + 65 || code == i + 97)
                    val++;
            }
        }
        result.push(val);
    }
    return result;
}

//tokenMatchers:
function sift4TokenMatcher(t1, t2) { //tokenMatcher:sift4TokenMatcher
    var similarity = 1 - sift4(t1, t2, 5) / Math.max(t1.length, t2.length);
    return similarity > 0.7;
}

//matchingEvaluators:
function sift4MatchingEvaluator(t1, t2) { //matchingEvaluator:sift4MatchingEvaluator
    var similarity = 1 - sift4(t1, t2, 5) / Math.max(t1.length, t2.length);
    return similarity;
}

//localLengthEvaluators:
function rewardLengthEvaluator(l) {
    if (l < 1)
        return l; //0 -> 0
    return l - 1 / (l + 1); //1 -> 0.5, 2-> 0.66, 9 -> 0.9
}

function rewardLengthEvaluator2(l) {
    return Math.pow(l, 1.5); // 0 -> 0, 1 -> 1, 2 -> 2.83, 10 -> 31.62
}

//transpositionCostEvaluators:
function longerTranspositionsAreMoreCostly(c1, c2) {
    return Math.abs(c2 - c1) / 9 + 1;
}

As always, I will be most happy to know if you used my algorithm and how it performed, as well as receive any suggestion that you might have.

Options explained

Here is some explanation for the options of the general algorithm.

It no longer searches for characters, but for tokens. That is why the default tokenizer function splits the values into characters so that the algorithm would work on an array of one character long tokens. Other options are possible, like splitting the strings by empty spaces so that the comparisons are done on words or transforming a string into an array of strings N characters long, the so called N-grams. The tokenizer can be anything, like the characterFrequencyTokenizer, which turns each word in an array of 25 values representing the number of letters in the word for each letter a-z.

The tokenMatcher function returns true if two tokens are matching. They can be fuzzy matched, for example the sift4tokenMatcher example function uses Sift inside Sift to determine the character distance between two tokens and returns true if they match more than 70%.

The matchingEvaluator is a function that returns the value that will be added to the "common substring" length value when two tokens match. The default is 1, but one can use some other metric, like the similarity, for example. Of course, the common substring length has lost its meaning when these functions change, but the variable local_cs is still used.

The lengthEvaluator is taking the length value of the local common substring and returns a value that will be added to the longest common subsequence value. Usually it returns the same value as the one provided, but some functions could reward longer substrings.

FAQ

Q: Can you make Sift4 to work case insensitive?
A: Just turn the strings to lower or upper case before you compare them. Since this algorithm is more general, the concept of 'case' might not apply. Or implement a case insensitive tokenMatcher.

Q: Can you make Sift4 to compare strings based on their meaning, like using synonyms?
A: Use a tokenizer function that splits the strings into words, then replaces them with the most used of their synonyms. A more complex solution would require to analyze the strings beforehand and turn them into some ordered synonym or equivalent expressions equivalents, then use Sift4 with a word tokenizer (one is provided in the Extended algorithm source)

Q: I need an implementation for this programming language, can you help?
A: I can, but I might not have the time. Ask anyway, maybe I can be persuaded :)

Q: I have been using Sift3 until now, how do I upgrade to Sift4?
A: The best way I can think of is to implement Sift4 Simplest, as it needs only the Sift3 code and some minor changes. Since you never needed tokens before, I doubt you need them now. But if you do, I can help, see the above question.

Q: How can I reward you for this fantastic piece of software engineering?
A: While I did this for free and I don't expect to make any money out of it and while this algorithm is completely free to use and change as you see fit, I don't mind having a beer every now and then ;)

Q: Your algorithm really sucks because... reasons.
A: It may. I would be glad to discuss the reasons, though, and try to fix any problem you encounter.

Q: I compared Sift4 with another algorithm that is much more exact and there are differences.
A: Of course, they are different algorithms. This is a fuzzy distance calculator, it doesn't give you the exact value. There are still edge cases. But the idea of Sift is to be fast and relatively accurate, rather than very accurate. You need more accuracy, try to combine Sift with Levenshtein for example, computing Levenshtein only where Sift says the strings are above a certain similarity.

Q: I want to make maxOffset dependent on the length of the strings compared. Can you do that?
A: That is a perfect example why maxOffset should be a parameter of the function rather than a member of the class. Since this implementation is so far Javascript only, just compute the maxOffset that is convenient to you before you compare.

Q: I want to vary the weight of matches based on the position of the match, for example matches at the beginning of the string could be more valuable than those at the end.
A: The position of the match is indeed not sent to the functions that can be specified in the options object of the Sift4 Extended, but that can be trivially changed in the code. I don't think this particular request is very common, though, and I prefer to keep it out of the published implementation to make the code easier to understand.

Q: I found a bug!
A: Let me know it and I will try and fix it.

Q: If you need to compare large lists of strings, it is better to precompute some things, like specific hashes or suffix trees, etc. This will speed up the comparison tremendously!
A: Sift is what is called an online algorithm. It does not precompute anything, it just gets the two strings and the parameters for its functioning and returns the distance. You are correct in what you are saying, but that kind of solution is not in the scope of Sift, at least not version 4.

Q: What are the edge cases for Sift?
A: Probably there are several, but I didn't really spot them. One of them is that one might find both letters at a position matching letters at other positions, but only one will count. Example 'abxx' and 'bayy'. The algorithm will look at position 0, find no match, then try to find the closest match for each letter. Starting with position 0 in the first string it will find 'a' matched in position 1 in the second. It will increase both counters and lcss will be increase as well. Next check will be 'b', the character at position 1 in the first string matched with position 2 in the second string. No match, therefore both counters will be reset to 1, and starting search again. The 'b' match is lost and distance is 3 instead of 2. Also I think there might be some situations where the counters are not equal and the biggest of them reaches the end of its string, thus terminating the algorithm, but there could have been more matches. Incidentally I tried to fix both these issues and the error from Levenshtein was not really affected, but I am not 100% sure of the implementation.

Q: The algorithm continues to be asymmetric, Sift4(s1,s2) can be different from Sift4(s2,s1).
A: Yes. This is one of the artifacts of the linear nature of the algorithm. There is a function that is symmetric and that is Math.min(Sift4(a,b),Sift4(b,a)), however it is twice as slow, obviously.

Implementations in other languages

C# implementation [expand/collapse]

public class Sift4
{
    private Options _options;

    public Sift4(Options options)
    {
        if (options == null) options = new Options();
        if (options.Tokenizer == null)
        {
            options.Tokenizer = (s) => s == null
            ? new string[0]
            : s.ToCharArray().Select(c => c.ToString()).ToArray();
        }
        if (options.TokenMatcher == null)
        {
            options.TokenMatcher = (t1, t2) => t1 == t2;
        }
        if (options.MatchingEvaluator == null)
        {
            options.MatchingEvaluator = (t1, t2) => 1;
        }
        if (options.LocalLengthEvaluator == null)
        {
            options.LocalLengthEvaluator = (l) => l;
        }
        if (options.TranspositionCostEvaluator == null)
        {
            options.TranspositionCostEvaluator = (c1, c2) => 1;
        }
        if (options.TranspositionsEvaluator == null)
        {
            options.TranspositionsEvaluator = (l, t) => l - t;
        }
        _options = options;
    }

    /// <summary>
    /// General distance algorithm uses all the parameters in the options object and works on tokens
    /// </summary>
    /// <param name="s1"></param>
    /// <param name="s2"></param>
    /// <param name="maxOffset"></param>
    /// <returns></returns>
    public double GeneralDistance(string s1, string s2, int maxOffset)
    {
        var t1 = _options.Tokenizer(s1);
        var t2 = _options.Tokenizer(s2);

        var l1 = t1.Length;
        var l2 = t2.Length;

        if (l1 == 0) return _options.LocalLengthEvaluator(l2);
        if (l2 == 0) return _options.LocalLengthEvaluator(l1);

        var c1 = 0;  //cursor for string 1
        var c2 = 0;  //cursor for string 2
        var lcss = 0.0;  //largest common subsequence
        var local_cs = 0.0; //local common substring
        var trans = 0.0;  //cost of transpositions ('axb' vs 'xba')
        var offset_arr = new LinkedList<OffsetPair>();  //offset pair array, for computing the transpositions

        while ((c1 < l1) && (c2 < l2))
        {
            if (_options.TokenMatcher(t1[c1], t2[c2]))
            {
                local_cs += _options.MatchingEvaluator(t1[c1], t2[c2]);
                var isTransposition = false;
                var op = offset_arr.First;
                while (op != null)
                {  //see if current match is a transposition
                    var ofs = op.Value;
                    if (c1 <= ofs.C1 || c2 <= ofs.C2)
                    {
                        // when two matches cross, the one considered a transposition is the one with the largest difference in offsets
                        isTransposition = Math.Abs(c2 - c1) >= Math.Abs(ofs.C2 - ofs.C1);
                        if (isTransposition)
                        {
                            trans += _options.TranspositionCostEvaluator(c1, c2);
                        }
                        else
                        {
                            if (!ofs.IsTransposition)
                            {
                                ofs.IsTransposition = true;
                                trans += _options.TranspositionCostEvaluator(ofs.C1, ofs.C2);
                            }
                        }
                        break;
                    }
                    else
                    {
                        var next_op = op.Next;
                        if (c1 > ofs.C2 && c2 > ofs.C1)
                        {
                            offset_arr.Remove(op);
                        }
                        op = next_op;
                    }
                }
                offset_arr.AddLast(new OffsetPair(c1, c2)
                {
                    IsTransposition = isTransposition
                });
            }
            else
            {
                lcss += _options.LocalLengthEvaluator(local_cs);
                local_cs = 0;
                if (c1 != c2)
                {
                    c1 = c2 = Math.Min(c1, c2);  //using min allows the computation of transpositions
                }
                if (_options.MaxDistance != null)
                {
                    var temporaryDistance = _options.LocalLengthEvaluator(Math.Max(c1, c2)) - _options.TranspositionsEvaluator(lcss, trans);
                    if (temporaryDistance > _options.MaxDistance) return Math.Round(temporaryDistance, MidpointRounding.AwayFromZero);
                }
                //if matching tokens are found, remove 1 from both cursors (they get incremented at the end of the loop)
                //so that we can have only one code block handling matches 
                for (var i = 0; i < maxOffset && (c1 + i < l1 || c2 + i < l2); i++)
                {
                    if ((c1 + i < l1) && _options.TokenMatcher(t1[c1 + i], t2[c2]))
                    {
                        c1 += i - 1;
                        c2--;
                        break;
                    }
                    if ((c2 + i < l2) && _options.TokenMatcher(t1[c1], t2[c2 + i]))
                    {
                        c1--;
                        c2 += i - 1;
                        break;
                    }
                }
            }
            c1++;
            c2++;
            // this covers the case where the last match is on the last token in list, so that it can compute transpositions correctly
            if ((c1 >= l1) || (c2 >= l2))
            {
                lcss += _options.LocalLengthEvaluator(local_cs);
                local_cs = 0;
                c1 = c2 = Math.Min(c1, c2);
            }
        }
        lcss += _options.LocalLengthEvaluator(local_cs);
        return Math.Round(_options.LocalLengthEvaluator(Math.Max(l1, l2)) - _options.TranspositionsEvaluator(lcss, trans), MidpointRounding.AwayFromZero); //apply transposition cost to the final result
    }

    /// <summary>
    /// Static distance algorithm working on strings, computing transpositions as well as stopping when maxDistance was reached.
    /// </summary>
    /// <param name="s1"></param>
    /// <param name="s2"></param>
    /// <param name="maxOffset"></param>
    /// <param name="maxDistance"></param>
    /// <returns></returns>
    public static double CommonDistance(string s1, string s2, int maxOffset, int maxDistance = 0)
    {
        var l1 = s1 == null ? 0 : s1.Length;
        var l2 = s2 == null ? 0 : s2.Length;

        if (l1 == 0) return l2;
        if (l2 == 0) return l1;

        var c1 = 0;  //cursor for string 1
        var c2 = 0;  //cursor for string 2
        var lcss = 0;  //largest common subsequence
        var local_cs = 0; //local common substring
        var trans = 0;  //number of transpositions ('axb' vs 'xba')
        var offset_arr = new LinkedList<OffsetPair>();  //offset pair array, for computing the transpositions

        while ((c1 < l1) && (c2 < l2))
        {
            if (s1[c1] == s2[c2])
            {
                local_cs++;
                var isTransposition = false;
                var op = offset_arr.First;
                while (op != null)
                {  //see if current match is a transposition
                    var ofs = op.Value;
                    if (c1 <= ofs.C1 || c2 <= ofs.C2)
                    {
                        // when two matches cross, the one considered a transposition is the one with the largest difference in offsets
                        isTransposition = Math.Abs(c2 - c1) >= Math.Abs(ofs.C2 - ofs.C1);
                        if (isTransposition)
                        {
                            trans++;
                        }
                        else
                        {
                            if (!ofs.IsTransposition)
                            {
                                ofs.IsTransposition = true;
                                trans++;
                            }
                        }
                        break;
                    }
                    else
                    {
                        var next_op = op.Next;
                        if (c1 > ofs.C2 && c2 > ofs.C1)
                        {
                            offset_arr.Remove(op);
                        }
                        op = next_op;
                    }
                }
                offset_arr.AddLast(new OffsetPair(c1, c2)
                {
                    IsTransposition = isTransposition
                });
            }
            else
            {
                lcss += local_cs;
                local_cs = 0;
                if (c1 != c2)
                {
                    c1 = c2 = Math.Min(c1, c2);  //using min allows the computation of transpositions
                }
                if (maxDistance > 0)
                {
                    var temporaryDistance = Math.Max(c1, c2) - (lcss - trans);
                    if (temporaryDistance > maxDistance) return temporaryDistance;
                }
                //if matching tokens are found, remove 1 from both cursors (they get incremented at the end of the loop)
                //so that we can have only one code block handling matches 
                for (var i = 0; i < maxOffset && (c1 + i < l1 || c2 + i < l2); i++)
                {
                    if ((c1 + i < l1) && s1[c1 + i] == s2[c2])
                    {
                        c1 += i - 1;
                        c2--;
                        break;
                    }
                    if ((c2 + i < l2) && s1[c1] == s2[c2 + i])
                    {
                        c1--;
                        c2 += i - 1;
                        break;
                    }
                }
            }
            c1++;
            c2++;
            // this covers the case where the last match is on the last token in list, so that it can compute transpositions correctly
            if ((c1 >= l1) || (c2 >= l2))
            {
                lcss += local_cs;
                local_cs = 0;
                c1 = c2 = Math.Min(c1, c2);
            }
        }
        lcss += local_cs;
        return Math.Max(l1, l2) - (lcss - trans); //apply transposition cost to the final result
    }

    /// <summary>
    /// Standard Sift algorithm, using strings and taking only maxOffset as a parameter
    /// </summary>
    /// <param name="s1"></param>
    /// <param name="s2"></param>
    /// <param name="maxOffset"></param>
    /// <returns></returns>
    public static int SimplestDistance(string s1, string s2, int maxOffset)
    {
        var l1 = s1 == null ? 0 : s1.Length;
        var l2 = s2 == null ? 0 : s2.Length;

        if (l1 == 0) return l2;
        if (l2 == 0) return l1;

        var c1 = 0;  //cursor for string 1
        var c2 = 0;  //cursor for string 2
        var lcss = 0;  //largest common subsequence
        var local_cs = 0; //local common substring

        while ((c1 < l1) && (c2 < l2))
        {
            if (s1[c1] == s2[c2])
            {
                local_cs++;
            }
            else
            {
                lcss += local_cs;
                local_cs = 0;
                if (c1 != c2)
                {
                    c1 = c2 = Math.Max(c1, c2);
                }
                //if matching tokens are found, remove 1 from both cursors (they get incremented at the end of the loop)
                //so that we can have only one code block handling matches 
                for (var i = 0; i < maxOffset && (c1 + i < l1 && c2 + i < l2); i++)
                {
                    if ((c1 + i < l1) && s1[c1 + i] == s2[c2])
                    {
                        c1 += i - 1;
                        c2--;
                        break;
                    }
                    if ((c2 + i < l2) && s1[c1] == s2[c2 + i])
                    {
                        c1--;
                        c2 += i - 1;
                        break;
                    }
                }
            }
            c1++;
            c2++;
        }
        lcss += local_cs;
        return Math.Max(l1, l2) - lcss;
    }

    private class OffsetPair
    {
        public int C1 { get; set; }
        public int C2 { get; set; }
        public bool IsTransposition { get; set; }

        public OffsetPair(int c1, int c2)
        {
            this.C1 = c1;
            this.C2 = c2;
            this.IsTransposition = false;
        }
    }

    public class Options
    {
        /// <summary>
        /// If set, the algorithm will stop if the distance reaches this value
        /// </summary>
        public int? MaxDistance { get; set; }

        /// <summary>
        /// The function that turns strings into a list of tokens (also strings)
        /// </summary>
        public Func<string, string[]> Tokenizer { get; set; }

        /// <summary>
        /// The function that determines if two tokens are matching (similar to characters being the same in the simple algorithm)
        /// </summary>
        public Func<string, string, bool> TokenMatcher { get; set; }

        /// <summary>
        /// The function that determines the value of a match of two tokens (the equivalent of adding 1 to the lcss when two characters match)
        /// This assumes that the TokenMatcher function is a lot less expensive than this evaluator. If that is not the case, 
        /// you can optimize the speed of the algorithm by using only the matching evaluator and then calculating if two tokens match on the returned value.
        /// </summary>
        public Func<string, string, double> MatchingEvaluator { get; set; }

        /// <summary>
        /// Determines if the local value (computed on subsequent matched tokens) must be modified.
        /// In case one wants to reward longer matched substrings, for example, this is what you need to change.
        /// </summary>
        public Func<double, double> LocalLengthEvaluator { get; set; }

        /// <summary>
        /// The function determining the cost of an individual transposition, based on its counter positions.
        /// </summary>
        public Func<int, int, double> TranspositionCostEvaluator { get; set; }

        /// <summary>
        /// The function determining how the cost of transpositions affects the final result
        /// Change it if it doesn't suit you.
        /// </summary>
        public Func<double, double, double> TranspositionsEvaluator { get; set; }
    }
}

PHP implementation [expand/collapse]

// Sift4 - common version
// online algorithm to compute the distance between two strings in O(n)
// $maxOffset is the number of characters to search for matching letters
// $maxDistance is the distance at which the algorithm should stop computing the value and just exit (the strings are too different anyway)
function sift4($s1, $s2, $maxOffset, $maxDistance = 0) {
    if (!$s1 || !strlen($s1)) {
        if (!$s2) {
            return 0;
        }
        return strlen($s2);
    }
    if (!$s2 || !strlen($s2)) {
        return strlen($s1);
    }
    $l1 = strlen($s1);
    $l2 = strlen($s2);
    $c1 = 0; //cursor for string 1
    $c2 = 0; //cursor for string 2
    $lcss = 0; //largest common subsequence
    $local_cs = 0; //local common substring
    $trans = 0; //number of transpositions ('ab' vs 'ba')
    $offset_arr = array(); //offset pair array, for computing the transpositions
    while (($c1 < $l1) && ($c2 < $l2)) {
        if (substr($s1, $c1, 1) == substr($s2, $c2, 1)) {
            $local_cs++;
            $isTrans = false;
            $i = 0;
            while ($i < sizeof($offset_arr)) { //see if current match is a transposition
                $ofs = $offset_arr[$i];
                if ($c1 <= $ofs['c1'] || $c2 <= $ofs['c2']) {
                    $isTrans = abs($c2 - $c1) >= abs($ofs['c2'] - $ofs['c1']);
                    if ($isTrans) {
                        $trans++;
                    } else {
                        if (!$ofs['trans']) {
                            $ofs['trans'] = true;
                            $trans++;
                        }
                    }
                    break;
                } else {
                    if ($c1 > $ofs['c2'] && $c2 > $ofs['c1']) {
                        array_splice($offset_arr, $i, 1);
                    } else {
                        $i++;
                    }
                }
            }
            array_push($offset_arr, array('c1' = > $c1, 'c2' = > $c2, 'trans' = > $isTrans));
        } else {
            $lcss+= $local_cs;
            $local_cs = 0;
            if ($c1 != $c2) {
                $c1 = $c2 = min($c1, $c2); //using min allows the computation of transpositions
            }
            if ($maxDistance) {
                $temporaryDistance = max($c1, $c2) - $lcss + $trans;
                if ($temporaryDistance > $maxDistance) return $temporaryDistance;
            }

            //if matching characters are found, remove 1 from both cursors (they get incremented at the end of the loop)
            //so that we can have only one code block handling matches
            for ($i = 0;$i < $maxOffset && ($c1 + $i < $l1 || $c2 + $i < $l2);$i++) {
                if (($c1 + $i < $l1) && (substr($s1, $c1 + $i, 1) == substr($s2, $c2, 1))) {
                    $c1+= $i - 1;
                    $c2--;
                    break;
                }
                if (($c2 + $i < $l2) && (substr($s1, $c1, 1) == substr($s2, $c2 + $i, 1))) {
                    $c1--;
                    $c2+= $i - 1;
                    break;
                }
            }
        }
        $c1++;
        $c2++;
        // this covers the case where the last match is on the last token in list, so that it can compute transpositions correctly
        if (($c1 >= $l1) || ($c2 >= $l2)) {
            $lcss+= $local_cs;
            $local_cs = 0;
            $c1 = $c2 = min($c1, $c2);
        }
    }
    $lcss+= $local_cs;
    return max($l1, $l2) - $lcss + $trans; //apply transposition cost to final result
    
}

Thanks to Ferenc Szatmári for corrections in the PHP code

The Simplest and General versions of the algorithm remain as an exercise to you, since I haven't been working in PHP for more than a decade.

T-SQL implementation [expand/collapse]

---<summary>
---Static distance algorithm working on strings, computing transpositions as well as stopping when maxDistance was reached.
---</summary>
---<param name="s1"></param>
---<param name="s2"></param>
---<param name="maxOffset"></param>
---<param name="maxDistance"></param>
---<returns></returns>
CREATE FUNCTION Sift4common(@s1          NVARCHAR(max), 
                            @s2          NVARCHAR(max), 
                            @maxOffset   INT, 
                            @maxDistance INT) 
returns INT 
AS 
  BEGIN 
      DECLARE @l1 INT = Len(Isnull(@s1, '')); 
      DECLARE @l2 INT = Len(Isnull(@s2, '')); 

      IF ( @l1 = 0 ) 
        RETURN @l2; 

      IF ( @l2 = 0 ) 
        RETURN @l1; 

      DECLARE @c1 INT = 0; 
      DECLARE @c2 INT = 0; 
      DECLARE @lcss INT = 0; 
      DECLARE @local_cs INT = 0; 
      DECLARE @trans INT = 0; 
      DECLARE @offset_arr TABLE 
        ( 
           C1    INT, 
           C2    INT, 
           Trans BIT 
        ) 
      DECLARE @i INT 
      DECLARE @temporaryDistance FLOAT 
      DECLARE @result INT 
      DECLARE @oc1 INT, 
              @oc2 INT, 
              @otr BIT 
      DECLARE @isTrans BIT 

      WHILE ( ( @c1 < @l1 ) 
              AND ( @c2 < @l2 ) ) 
        BEGIN 
            IF ( Substring(@s1, @c1 + 1, 1) = Substring(@s2, @c2 + 1, 1) ) 
              BEGIN 
                  SET @local_cs=@local_cs + 1; 
                  SET @isTrans=0 
                  SET @oc1=NULL; 

                  SELECT TOP 1 @oc1 = o.C1, 
                               @oc2 = o.C2, 
                               @otr = o.Trans 
                  FROM   @offset_arr o 
                  WHERE  @c1 <= o.C1 
                          OR @c2 <= o.C2 

                  IF ( @oc1 IS NOT NULL ) 
                    BEGIN 
                        SET @isTrans=CASE 
                                       WHEN Abs(@c2 - @c1) >= Abs(@oc2 - @oc1) 
                                     THEN 1 
                                       ELSE 0 
                                     END 

                        IF ( @isTrans = 1 ) 
                          BEGIN 
                              SET @trans=@trans + 1 
                          END 
                        ELSE IF ( @otr = 0 ) 
                          BEGIN 
                              SET @trans=@trans + 1 

                              UPDATE @offset_arr 
                              SET    Trans = 1 
                              WHERE  C1 = @oc1 
                                     AND C2 = @oc2 
                          END 
                    END 

                  DELETE FROM @offset_arr 
                  WHERE  @c1 > C1 
                         AND @c1 > C2 
                         AND @c2 > C1 
                         AND @c2 > C2; 

                  INSERT INTO @offset_arr 
                  VALUES      (@c1, 
                               @c2, 
                               @isTrans); 
              END 
            ELSE 
              BEGIN 
                  SET @lcss = @lcss + @local_cs; 
                  SET @local_cs = 0; 

                  IF ( @c1 != @c2 ) 
                    -- using min allows the computation of transpositions  
                    BEGIN 
                        IF ( @c1 < @c2 ) 
                          BEGIN 
                              SET @c2=@c1; 
                          END 
                        ELSE 
                          BEGIN 
                              SET @c1=@c2; 
                          END 
                    END 
                IF ( @maxDistance > 0 ) 
                  BEGIN 
                    IF ( @c1 > @c2 ) 
                    BEGIN 
                      SET @temporaryDistance = @c1 - ( @lcss - @trans / 2.0 ); 
                    END 
                    ELSE 
                    BEGIN 
                      SET @temporaryDistance = @c2 - ( @lcss - @trans / 2.0 ); 
                    END 

                  IF ( @temporaryDistance > @maxDistance ) 
                    RETURN Round(@temporaryDistance, 0); 
              END 


                  --if matching tokens are found, remove 1 from both cursors (they get incremented at the end of the loop)
                  --so that we can have only one code block handling matches   
                  SET @i=0; 

                  WHILE ( @i < @maxOffset 
                          AND ( @c1 + @i < @l1 
                                 OR @c2 + @i < @l2 ) ) 
                    BEGIN 
                        IF ( ( @c1 + @i < @l1 ) 
                             AND Substring(@s1, @c1 + @i + 1, 1) = 
                                 Substring(@s2, @c2 + 1, 1) 
                           ) 
                          BEGIN 
                              SET @c1 = @c1 + @i - 1; 
                              SET @c2=@c2 - 1; 

                              BREAK; 
                          END 

                        IF ( ( @c2 + @i < @l2 ) 
                             AND Substring(@s1, @c1 + 1, 1) = Substring(@s2, 
                                                              @c2 + @i + 1, 1 
                                                              ) 
                           ) 
                          BEGIN 
                              SET @c1 = @c1 - 1; 
                              SET @c2=@c2 + @i - 1; 

                              BREAK; 
                          END 

                        SET @i=@i + 1; 
                    END 
              END 

            SET @c1=@c1 + 1; 
            SET @c2=@c2 + 1; 

            IF ( ( @c1 >= @l1 ) 
                  OR ( @c2 >= @l2 ) ) 
              BEGIN 
                  SET @lcss = @lcss + @local_cs; 
                  SET @local_cs = 0; 

                  IF ( @c1 < @c2 ) 
                    BEGIN 
                        SET @c2=@c1; 
                    END 
                  ELSE 
                    BEGIN 
                        SET @c1=@c2; 
                    END 
              END 
        END 

      SET @lcss = @lcss + @local_cs; 

      --apply the transposition cost to the final result 
      IF ( @l1 > @l2 ) 
        BEGIN 
            SET @result = @l1 - ( @lcss - @trans ); 
        END 
      ELSE 
        BEGIN 
            SET @result = @l2 - ( @lcss - @trans ); 
        END 

      RETURN @result; 
  END

Clearly a general version of the algorithm is not possible in Transact SQL.

Here is a MySQL version, gracefully provided by Ferenc Szatmári:

BEGIN 
DECLARE l1 INT DEFAULT Length(IFNULL(s1, '')); 
DECLARE l2 INT DEFAULT Length(IFNULL(s2, '')); 


DECLARE c1 INT Default 0; 
DECLARE c2 INT default 0; 
DECLARE lcss INT default 0; 
DECLARE local_cs INT default 0; 
DECLARE trans INT default 0; 
DECLARE i INT;
DECLARE temporaryDistance FLOAT;
DECLARE result INT;
DECLARE oc1 INT;
DECLARE oc2 INT;
DECLARE otr smallint;
DECLARE isTrans smallint;

drop temporary table if exists offset_arr;
CREATE TEMPORARY TABLE IF not EXISTS offset_arr
( 
C1 INT, 
C2 INT,
Trans BIT
)engine=memory;


/*      delete from offset_arr;*/


IF l1 = 0 THEN
RETURN l2; 
END IF;
IF l2 = 0 THEN 
RETURN l1; 
END IF;  






WHILE ( ( c1 < l1 ) AND ( c2 < l2 ) ) 
DO 
IF ( Substring(s1, c1 + 1, 1) = Substring(s2, c2 + 1, 1) ) THEN

SET local_cs=local_cs + 1; 
SET isTrans=0;
SET oc1=NULL;
SELECT  o.C1, o.C2,o.Trans  into oc1, oc2, otr
FROM offset_arr o 
WHERE c1 <= o.C1 OR c2 <= o.C2
LIMIT 1;
IF oc1 IS NOT NULL THEN

SET isTrans=CASE WHEN ABS(c2-c1)>=ABS(oc2-oc1) THEN 1 ELSE 0 END;
IF (isTrans=1) THEN
SET trans=trans+1;
ELSE
IF (otr=0) THEN


SET trans=trans+1;
UPDATE offset_arr SET Trans=1 WHERE C1=oc1 AND C2=oc2;
END IF;
END IF;  
END IF;


DELETE FROM offset_arr 
WHERE  c1 > C1 
AND c1 > C2
AND c2 > C1 
AND c2 > C2; 


INSERT INTO offset_arr 
VALUES (c1, c2, isTrans); 

ELSE 

SET lcss = lcss + local_cs; 
SET local_cs = 0; 


IF ( c1 != c2 ) THEN
-- using min allows the computation of transpositions 


IF ( c1 < c2 ) THEN
SET c2=c1; 
ELSE 
SET c1=c2; 
END IF;

IF ( maxDistance > 0 ) THEN


IF ( c1 > c2 ) THEN
SET temporaryDistance = c1 - ( lcss - trans / 2.0 ); 
ELSE 
SET temporaryDistance = c2 - ( lcss - trans / 2.0 ); 
END IF;


IF ( temporaryDistance > maxDistance ) THEN
RETURN Round(temporaryDistance, 0); 
END IF;  
END IF;

END IF;


/*if matching tokens are found, remove 1 from both cursors (they get incremented at the end of the loop)
--so that we can have only one code block handling matches  */
SET i=0; 


label : WHILE ( i < maxOffset AND (c1 + i < l1 OR c2 + i < l2) ) do


IF ( ( c1 + i < l1 ) 
AND Substring(s1, c1 + i + 1, 1) = Substring(s2, c2 + 1, 1) 
) THEN


SET c1 = c1 + i - 1; 
SET c2=c2 - 1; 


leave label; 
END IF; 


IF ( ( c2 + i < l2 ) 
AND Substring(s1, c1 + 1, 1) = Substring(s2, c2 + i + 1, 1) 
)THEN 


SET c1 = c1 - 1; 
SET c2=c2 + i - 1; 


leave label;  
END IF;


SET i=i + 1; 
END WHILE;
END IF; 


SET c1=c1 + 1; 
SET c2=c2 + 1; 


IF ( ( c1 >= l1 ) OR ( c2 >= l2 ) ) THEN
SET lcss = lcss + local_cs; 
SET local_cs = 0; 


IF ( c1 < c2 ) THEN
SET c2=c1; 
ELSE 
SET c1=c2; 
END IF;
END IF;
END while;


SET lcss = lcss + local_cs; 


/*apply the transposition cost to the final result*/
IF ( l1 > l2 ) THEN
SET result = l1 - (lcss - trans);
ELSE 
SET result = l2 - (lcss - trans);
END IF;
RETURN result; 
END

Java implementation [expand/collapse]

Here is a Java version, gracefully provided by Nathanæl Fischer:

/**
 * Sift4 - common version
 * online algorithm to compute the distance between two strings in O(n)
 * Algorithm by siderite, java port by Nathan Fischer 2016
 * /blog/super-fast-and-accurate-string-distance.html
 * @param s1
 * @param s2
 * @param maxOffset the number of characters to search for matching letters
 * @return
 */
public static double sift4(String s1, String s2, int maxOffset) {
    class Offset {
        int c1;
        int c2;
        boolean trans;

        Offset(int c1, int c2, boolean trans) {
            this.c1 = c1;
            this.c2 = c2;
            this.trans = trans;
        }
    }

    if (s1 == null || s1.isEmpty())
        return s2 == null ? 0 : s2.length();

    if (s2 == null || s2.isEmpty())
        return s1.length();

    int l1 = s1.length();
    int l2 = s2.length();

    int c1 = 0; //cursor for string 1
    int c2 = 0; //cursor for string 2
    int lcss = 0; //largest common subsequence
    int local_cs = 0; //local common substring
    int trans = 0; //number of transpositions ('ab' vs 'ba')
    LinkedList < Offset > offset_arr = new LinkedList < > (); //offset pair array, for computing the transpositions

    while ((c1 < l1) && (c2 < l2)) {
        if (s1.charAt(c1) == s2.charAt(c2)) {
            local_cs++;
            boolean isTrans = false;
            //see if current match is a transposition
            int i = 0;
            while (i < offset_arr.size()) {
                Offset ofs = offset_arr.get(i);
                if (c1 <= ofs.c1 || c2 <= ofs.c2) {
                    // when two matches cross, the one considered a transposition is the one with the largest difference in offsets
                    isTrans = Math.abs(c2 - c1) >= Math.abs(ofs.c2 - ofs.c1);
                    if (isTrans) {
                        trans++;
                    } else {
                        if (!ofs.trans) {
                            ofs.trans = true;
                            trans++;
                        }
                    }
                    break;
                } else {
                    if (c1 > ofs.c2 && c2 > ofs.c1) {
                        offset_arr.remove(i);
                    } else {
                        i++;
                    }
                }
            }
            offset_arr.add(new Offset(c1, c2, isTrans));
        } else {
            lcss += local_cs;
            local_cs = 0;
            if (c1 != c2) {
                c1 = c2 = Math.min(c1, c2); //using min allows the computation of transpositions
            }
            //if matching characters are found, remove 1 from both cursors (they get incremented at the end of the loop)
            //so that we can have only one code block handling matches
            for (int i = 0; i < maxOffset && (c1 + i < l1 || c2 + i < l2); i++) {
                if ((c1 + i < l1) && (s1.charAt(c1 + i) == s2.charAt(c2))) {
                    c1 += i - 1;
                    c2--;
                    break;
                }
                if ((c2 + i < l2) && (s1.charAt(c1) == s2.charAt(c2 + i))) {
                    c1--;
                    c2 += i - 1;
                    break;
                }
            }
        }
        c1++;
        c2++;
        // this covers the case where the last match is on the last token in list, so that it can compute transpositions correctly
        if ((c1 >= l1) || (c2 >= l2)) {
            lcss += local_cs;
            local_cs = 0;
            c1 = c2 = Math.min(c1, c2);
        }
    }
    lcss += local_cs;
    return Math.round(Math.max(l1, l2) - lcss + trans); //add the cost of transpositions to the final result
}

Powershell implementation [expand/collapse]

Powershell implementation of simple Sift4, by Kirk Sayre:

function Calc-Sift4Distance 
{
  <# 
      .SYNOPSIS 

      Compute the edit distance between 2 strings using the sift4 string
      edit distance algorithm.

      .PARAMETER s1

      The 1st string

      .PARAMETER s2

      The 2nd string

      .PARAMETER maxOffset

      The maximum common substring length for which to search. The default
      is 10.

      .RETURN

      The # of edits needed to make the given 2 strings equal.
  #>

  Param(
    [Parameter(Mandatory = $True, ValueFromPipelineByPropertyName = $True)]
    [String]
    $s1,

    [Parameter(Mandatory = $True, ValueFromPipelineByPropertyName = $True)]
    [String]
    $s2,

    [Parameter(ValueFromPipelineByPropertyName = $True)]
    [Int]
    $maxOffset = 10
  )

  # Handle null or empty strings.
  if ((-not $s1) -or ($s1.length -eq 0)) 
  {
    if (-not $s2) 
    {
      return 0
    }
    return $s2.length
  }

  if ((-not $s2) -or ($s2.length -eq 0)) 
  {
    return $s1.length
  }

  # Initialization.
  $l1 = $s1.length
  $l2 = $s2.length
  $c1 = 0 # Cursor for string 1.
  $c2 = 0 # Cursor for string 2.
  $lcss = 0 # Largest common subsequence.
  $local_cs = 0 # Local common substring.

  # Scan strings.
  while (($c1 -lt $l1) -and ($c2 -lt $l2)) 
  {
    if ($s1[$c1] -eq $s2[$c2]) 
    {
      $local_cs++
    }
    else 
    {
      $lcss += $local_cs
      $local_cs = 0
      if ($c1 -ne $c2) 
      {
        # Using max to bypass the need for computer transpositions ('ab' vs 'ba').
        $c1 = $c2 = (@($c1, $c2) | Measure-Object -Maximum).Maximum
      }

      for ($i = 0; (($i -lt $maxOffset) -and ((($c1 + $i) -lt $l1) -or (($c2 + $i) -lt $l2))); $i++) 
      {
        if ((($c1 + $i) -lt $l1) -and ($s1[$c1 + $i] -eq $s2[$c2])) 
        {
          $c1 += $i
          $local_cs++
          break
        }

        if ((($c1 + $i) -lt $l2) -and ($s1[$c1] -eq $s2[$c2 + $i])) 
        {
          $c2 += $i
          $local_cs++
          break
        }
      }
    }
    $c1++
    $c2++
  }
  $lcss += $local_cs
  return [math]::Round((@($l1, $l2) | Measure-Object -Maximum).Maximum - $lcss)
}

C++ implementation [expand/collapse]

Thanks to Hugo Amaro, a C++ implementation:

struct sift_offset {
  int c1;
  int c2;
  bool trans;
};

template < typename T >
  int sift4(T * s1, int s1_size, T * s2, int s2_size, int maxOffset, int maxDistance) {
    if (!s1 || !s1_size) {
      if (!s2) {
        return 0;
      }
      return s2_size;
    }

    if (!s2 || !s2_size) {
      return s1_size;
    }

    int l1 = s1_size;
    int l2 = s2_size;

    int c1 = 0; //cursor for string 1
    int c2 = 0; //cursor for string 2
    int lcss = 0; //largest common subsequence
    int local_cs = 0; //local common substring
    int trans = 0; //number of transpositions ('ab' vs 'ba')
    std::vector < sift_offset > offset_arr; //offset pair array, for computing the transpositions

    while ((c1 < l1) && (c2 < l2)) {
      if (s1[c1] == s2[c2]) {
        local_cs++;
        bool isTrans = false;
        //see if current match is a transposition
        int i = 0;
        while (i < offset_arr.size()) {
          sift_offset ofs = offset_arr[i];
          if (c1 <= ofs.c1 || c2 <= ofs.c2) {
            // when two matches cross, the one considered a transposition is the one with the largest difference in offsets
            isTrans = std::abs(c2 - c1) >= std::abs(ofs.c2 - ofs.c1);

            if (isTrans) {
              trans++;
            } else {
              if (!ofs.trans) {
                ofs.trans = true;
                trans++;
              }
            }
            break;
          } else {
            if (c1 > ofs.c2 && c2 > ofs.c1) {
              offset_arr.erase(offset_arr.begin() + i);

            } else {
              i++;
            }
          }
        }
        offset_arr.push_back({
          c1,
          c2,
          isTrans
        });
      } else {
        lcss += local_cs;
        local_cs = 0;
        if (c1 != c2) {
          c1 = c2 = (int) min(c1, c2); //using min allows the computation of transpositions
        }
        //if matching characters are found, remove 1 from both cursors (they get incremented at the end of the loop)
        //so that we can have only one code block handling matches 
        for (int i = 0; i < maxOffset && (c1 + i < l1 || c2 + i < l2); i++) {
          if ((c1 + i < l1) && (s1[c1 + i] == s2[c2])) {
            c1 += i - 1;
            c2--;
            break;
          }
          if ((c2 + i < l2) && (s1[c1] == s2[c2 + i])) {
            c1--;
            c2 += i - 1;
            break;
          }
        }
      }
      c1++;
      c2++;
      if (maxDistance) {
        int temporaryDistance = (int) max(c1, c2) - lcss + trans;
        if (temporaryDistance >= maxDistance) return std: round(temporaryDistance);
      }

    }
    // this covers the case where the last match is on the last token in list, so that it can compute transpositions correctly
    if ((c1 >= l1) || (c2 >= l2)) {
      lcss += local_cs;
      local_cs = 0;
      c1 = c2 = (int) min(c1, c2);
    }
  }
lcss += local_cs;
return std::round((int) max(l1, l2) - lcss + trans); //add the cost of transpositions to the final result
}

You can find a Go implementation here, written by Jason W. Hutchinson.
There is also a Swift implementation here.
A Perl 6 (now called Raku) implementation can be found here.

By default, IISExpress is working on 32bit even in 64bit Windows.

Published Nov 10, 2014

and has 0 comments

I had this memory problem that I could not understand. OK, the design of the application was not the best in the world, but why would it always give me a damn OutOfMemoryException when I have a super duper computer with a lot of memory and I have solved issues like heap memory allocation? And the reason is that IISExpress, the default Windows7 ASP.Net web server is running in 32bit mode, meaning it has a maximum of 4GB of memory no matter what you do. Well, you can make IISExpress run in 64bit mode by simply switching it on in the Windows registry. I don't want to copy the content of the article from someone else, so here is a link towards how to do it: Debugging VS2013 Websites Using 64-bit IIS Express.

Just in case you want the ultra fast version, copy this into a file with the .reg extension and execute it:

Windows Registry Editor Version 5.00

[HKEY_CURRENT_USER\Software\Microsoft\VisualStudio\12.0\WebProjects]
"Use64BitIISExpress"=dword:00000001

Cheers!

Aggregate function to concatenate strings in T-SQL

Published Nov 10, 2014

Posted in
database
programming

and has 0 comments

You know how many a time you need to get the results of a query as a strings list, let's say a CSV and there is nothing in Microsoft SQL Server to help you? What you would like is something like an aggregate function similar to SUM that returns the concatenated value of all strings in the SELECT query.

From SQL Server 2017 on, there is such a function called STRING_AGG and you use it just like one might expect.

If you have an older version of SQL Server... upgrade :) But if you can't, here is a solution:

And before SQL Server 2008 that was the case, you needed to use variables and SELECT @S=@S+[Value] FROM .... But in SQL 2008 they added more XML support and thus the data() XML PATH method. Take notice that this method adds a space between atomic values. So, without further ado, here is the code to concatenate several values into a comma separated value string:

DECLARE @T TABLE(S NVARCHAR(Max))
INSERT INTO @T VALUES('Filet'),('T-bone'),('Sausage'),('Würstel') -- enough with the fruity examples!

SELECT CONVERT(NVARCHAR(Max),(SELECT S+',' AS 'data()' FROM @T t
FOR XML PATH('')))

Result: Filet, T-bone, Sausage, Würstel, - of course, remove the last comma.

Update: What if I have a table with multiple columns and I just want to aggregate one?

The solution is to use the select with WHERE clauses on the GROUP BY columns. Here is an example:

DECLARE @t TABLE(col1 NVARCHAR(Max), col2 INT, value NVARCHAR(Max) )
INSERT INTO @T VALUES ('a',1,'1'),
('a',1,'2'),
('a',2,'3'),
('b',1,'1'),
('b',1,'2'),
('c',3,'67')
 
--SELECT * FROM @t t
 
SELECT col1, col2, CONVERT(NVARCHAR(Max),(
    SELECT value+',' AS 'data()' FROM @T t2  
        WHERE t1.col1 = t2.col1 
            AND t1.col2 = t2.col2 
        FOR XML PATH('')))
FROM @t t1
GROUP BY col1, col2

The BucketList, an generic IList implementation that doesn't use a contiguous array of items

Published Oct 15, 2014

Posted in
.NET
programming
C#

and has 0 comments

The wonder of .Net is that most of the time we don't really have to care about stuff like memory allocation, the framework does everything for you. One especially annoying thing, though, is when you are using a basic data structure that is supposed to be efficient and you get stuff like OutOfMemoryException. One of the cases is List<T> which in the background uses one big array. This means two things: one is that certain modifying operations are slow and the other is that it requires contiguous memory space for the items. If your memory space gets too fragmented, then there is not enough to allocate for hundred of thousands of items, even if in that memory space you only need a pointer for each item. That is why you end up with out of memory exceptions. It's not like you don't have enough memory, it's that you don't have a big enough contiguous block of it.

As a solution I give you... the BucketList<T> class. It has a bucket size that defaults to 10000 and it implements a list of lists that each will always have at most that amount of items as specified in the bucket size. This way operations that remove and add items will only operate on 10000 item big arrays and there is no need for only one big memory block. I implemented the IList interface explicitly, so that you will never find it comfortable to use an instance as a BucketList, but as an IList. This way you can replace the implementation of the interface with a normal List or whatever other form you like. Enjoy!

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;

namespace Siderite.DataStructures
{
    /// <summary>
    /// IList implementation that doesn't require a contiguous memory space (an array)
    /// </summary>
    /// <typeparam name="T"></typeparam>
    public class BucketList<T>:IList<T>
    {
        private int _bucketSize;
        private List<List<T>> _list;
        private int _count;

        /// <summary>
        /// Default constructor
        /// </summary>
        public BucketList():this(10000)
        {
            _list = new List<List<T>>();
        }

        /// <summary>
        /// Specify the bucket size (default 10000)
        /// </summary>
        /// <param name="bucketSize"></param>
        public BucketList(int bucketSize)
        {
            _bucketSize = bucketSize;
        }

        /// <summary>
        /// Create a bucket list from an IEnumerable
        /// </summary>
        /// <param name="enm"></param>
        public BucketList(IEnumerable<T> enm):this()
        {
            var list = (IList<T>)this;
            foreach (var itm in enm)
            {
                list.Add(itm);
            }
        }

        /// <summary>
        /// The item count
        /// </summary>
        public int Count
        {
            get
            {
                return ((IList<T>)this).Count;
            }
        }

        #region IList<T> implementation

        int IList<T>.IndexOf(T item)
        {
            var index = 0;
            for (var i = 0; i < _list.Count; i++)
            {
                var idx = _list[i].IndexOf(item);
                if (idx < 0)
                {
                    index += _list[i].Count;
                }
                else
                {
                    index += idx;
                    return index;
                }
            }
            return -1;
        }

        void IList<T>.Insert(int index, T item)
        {
            var idx = 0;
            for (var i = 0; i < _list.Count; i++)
            {
                var lst = _list[i];
                if (index < idx + lst.Count)
                {
                    lst.Insert(index - idx, item);
                    splitListIfTooLarge(i);
                    _count++;
                    return;
                }
                else
                {
                    idx += lst.Count;
                }
            }
            throw new IndexOutOfRangeException("index");
        }

        void IList<T>.RemoveAt(int index)
        {
            var idx = 0;
            for (var i = 0; i < _list.Count; i++)
            {
                var lst = _list[i];
                if (index < idx + lst.Count)
                {
                    lst.RemoveAt(index - idx);
                    removeListIfEmpty(i);
                    _count--;
                    return;
                }
                else
                {
                    idx += lst.Count;
                }
            }
            throw new IndexOutOfRangeException("index");
        }

        T IList<T>.this[int index]
        {
            get
            {
                var idx = 0;
                for (var i = 0; i < _list.Count; i++)
                {
                    var lst = _list[i];
                    if (index < idx + lst.Count)
                    {
                        return lst[index - idx];
                    }
                    else
                    {
                        idx += lst.Count;
                    }
                }
                throw new IndexOutOfRangeException("index");
            }
            set
            {
                var idx = 0;
                for (var i = 0; i < _list.Count; i++)
                {
                    var lst = _list[i];
                    if (index < idx + lst.Count)
                    {
                        lst[index - idx]=value;
                    }
                    else
                    {
                        idx += lst.Count;
                    }
                }
                throw new IndexOutOfRangeException("index");
            }
        }

        void ICollection<T>.Add(T item)
        {
            List<T> lst;
            int index;
            if (_list.Count == 0)
            {
                lst = new List<T>();
                _list.Add(lst);
                index=0;
            }
            else
            {
                index=_list.Count - 1;
                lst = _list[index];
            }
            lst.Add(item);
            splitListIfTooLarge(index);
            _count++;
        }

        void ICollection<T>.Clear()
        {
            _list.Clear();
            _count = 0;
        }

        bool ICollection<T>.Contains(T item)
        {
            return ((IList<T>)this).IndexOf(item) > -1;
        }

        void ICollection<T>.CopyTo(T[] array, int arrayIndex)
        {
            var index = arrayIndex;
            foreach (var lst in _list)
            {
                lst.CopyTo(array, index);
                index += lst.Count;
            }
        }

        int ICollection<T>.Count
        {
            get
            {
                return _count;
            }
        }

        bool ICollection<T>.IsReadOnly
        {
            get { return false; }
        }

        bool ICollection<T>.Remove(T item)
        {
            for (var i = 0; i < _list.Count; i++)
            {
                var lst = _list[i];
                if (lst.Remove(item))
                {
                    _count--;
                    removeListIfEmpty(i);
                    return true;
                }
            }
            return false;
        }

        IEnumerator<T> IEnumerable<T>.GetEnumerator()
        {
            foreach (var lst in _list)
            {
                foreach (var item in lst)
                {
                    yield return item;
                }
            }
        }

        System.Collections.IEnumerator System.Collections.IEnumerable.GetEnumerator()
        {
            return ((IList<T>)this).GetEnumerator();
        }

        #endregion

        private void splitListIfTooLarge(int listIndex)
        {
            var lst = _list[listIndex];
            if (lst.Count > _bucketSize)
            {
                var newList = lst.GetRange(0, _bucketSize);
                lst.RemoveRange(0, _bucketSize);
                _list.Insert(listIndex, newList);
            }
        }

        private void removeListIfEmpty(int listIndex)
        {
            var lst = _list[listIndex];
            if (lst.Count ==0)
            {
                _list.RemoveAt(listIndex);
            }
        }
    }
}

@font-face not working! ... if you don't write the CSS rule correctly

Published Sep 1, 2014

and has 0 comments

Just a quick solution for a problem that seems to be @font-face not functioning at all. You define the @font-face CSS rule, you then add a font-family CSS rule for your element and nothing seems to be happening. The element has the correct family when you look in the browser inspection tool, but the font is never loaded. Nor is it displayed. What could be wrong?

The simple answer in my case is that the element I wanted to style had a rule like this: font-family: 'My Special Font, Verdana, Arial';. The correct rule should have been font-family: 'My Special Font', Verdana, Arial;. The quotes are just for escaping spaces and the like for the individual family names, not for encapsulating the "value" of the css rule. I know, stupid, but I wasted half an hour on it!

Command line Windows Update

Published Aug 28, 2014

and has 0 comments

Update: If you are behind a proxy, here is some additional code to add right after creating the update session:

'updateSession.WebProxy.AutoDetect = true 'try this first. It doesn't work so well in some environments if no authentication windows appears (*cough* Windows 8 *cough*)

strProxy = "proxy name or address:proxy port"  'ex: 1234:999
strProxyUser = "your username"
strProxyPass = "your password" 

updateSession.WebProxy.Address=strProxy
updateSession.WebProxy.UserName=strProxyUser
updateSession.WebProxy.SetPassword(strProxyPass)

I am working behind a "secured" web proxy that sometimes skips a beat. As a result there are days in which I cannot install Window Updates, the normal Windows update application just fails (with Error Code: 0x80246002) and I am left angry and powerless. Well, there are options. First of all, none of the "solutions" offered by Microsoft seem to work. The most promising one (which may apply to you, but it did not apply to me) was that you may have corrupted files in the Download folder for Windows updates. As a result you need to:

Stop the Windows Update service issuing the command line command: net stop wuauserv or by going to Control Panel, Services and manually stopping it.
Go to the download folder parent found at %systemroot%\SoftwareDistribution (cd %systemroot%\SoftwareDistribution) and rename the Download folder (ren Download Download.old)
Start the Windows Update service issuing the command line command: net start wuauserv or by going to Control Panel, Services and manually starting it.

So my solution was to use a script that downloads and installs the Windows updates from the command line and I found this link: Searching, Downloading, and Installing Updates that pretty much provided the solution I was looking for. There are two issues with the script. The first is that it prompts you to accept any EULA that the updates may present. The second is that it downloads all updates, regardless of severity. So I am publishing here the script that I am using who fixes these two problems: EULA is automatically accepted and only Important and Critical updates are downloaded and installed:

Set updateSession = CreateObject("Microsoft.Update.Session")
updateSession.ClientApplicationID = "Siderite :) Sample Script"

Set updateSearcher = updateSession.CreateUpdateSearcher()

WScript.Echo "Searching for updates..." & vbCRLF

Set searchResult = _
updateSearcher.Search("IsInstalled=0 and Type='Software' and IsHidden=0")

WScript.Echo "List of applicable items on the machine:"

For I = 0 To searchResult.Updates.Count-1
    Set update = searchResult.Updates.Item(I)
    WScript.Echo I + 1 & "> " & update.Title
Next

If searchResult.Updates.Count = 0 Then
    WScript.Echo "There are no applicable updates."
    WScript.Quit
End If

WScript.Echo vbCRLF & "Creating collection of updates to download:"

Set updatesToDownload = CreateObject("Microsoft.Update.UpdateColl")

For I = 0 to searchResult.Updates.Count-1
    Set update = searchResult.Updates.Item(I)

    addThisUpdate = false
    If update.InstallationBehavior.CanRequestUserInput = true Then
        WScript.Echo I + 1 & "> skipping: " & update.Title & _
        " because it requires user input"
    Else
        If update.EulaAccepted = false Then
            update.AcceptEula()
            WScript.Echo I + 1 & ">     Accept EULA " & update.Title 
            addThisUpdate = true
            'WScript.Echo I + 1 & "> note: " & update.Title & " has a license agreement that must be accepted:"
            'WScript.Echo update.EulaText
            'WScript.Echo "Do you accept this license agreement? (Y/N)"
            'strInput = WScript.StdIn.Readline
            'WScript.Echo 
            'If (strInput = "Y" or strInput = "y") Then
            '    update.AcceptEula()
            '    addThisUpdate = true
            'Else
            '    WScript.Echo I + 1 & "> skipping: " & update.Title & _
            '    " because the license agreement was declined"
            'End If
        Else
            addThisUpdate = true
        End If
    End If

    If addThisUpdate AND (update.MsrcSeverity = "Important" OR update.MsrcSeverity = "Critical") Then
        'wscript.echo ("This item is " & update.MsrcSeverity & " and will be processed!")
    Else
        'comment these lines to make it download everything
        wscript.echo (update.Title & " has severity [" & update.MsrcSeverity & "] and will NOT be processed!")
        addThisUpdate=false
    End If

    If addThisUpdate = true Then
        wscript.echo(I + 1 & "> adding: (" & update.MsrcSeverity & ") " & update.Title)
        updatesToDownload.Add(update)
    End If
Next

If updatesToDownload.Count = 0 Then
    WScript.Echo "All applicable updates were skipped."
    WScript.Quit
End If
    
WScript.Echo vbCRLF & "Downloading updates..."

Set downloader = updateSession.CreateUpdateDownloader() 
downloader.Updates = updatesToDownload
downloader.Download()

Set updatesToInstall = CreateObject("Microsoft.Update.UpdateColl")

rebootMayBeRequired = false

WScript.Echo vbCRLF & "Successfully downloaded updates:"

For I = 0 To searchResult.Updates.Count-1
    set update = searchResult.Updates.Item(I)
    If update.IsDownloaded = true Then
        WScript.Echo I + 1 & "> " & update.Title 
        updatesToInstall.Add(update) 
        If update.InstallationBehavior.RebootBehavior > 0 Then
            rebootMayBeRequired = true
        End If
    End If
Next

If updatesToInstall.Count = 0 Then
    WScript.Echo "No updates were successfully downloaded."
    WScript.Quit
End If

If rebootMayBeRequired = true Then
    WScript.Echo vbCRLF & "These updates may require a reboot."
End If

WScript.Echo  vbCRLF & "Would you like to install updates now? (Y/N)"
strInput = WScript.StdIn.Readline
WScript.Echo 

If (strInput = "Y" or strInput = "y") Then
    WScript.Echo "Installing updates..."
    Set installer = updateSession.CreateUpdateInstaller()
    installer.Updates = updatesToInstall
    Set installationResult = installer.Install()
 
    'Output results of install
    WScript.Echo "Installation Result: " & _
    installationResult.ResultCode 
    WScript.Echo "Reboot Required: " & _ 
    installationResult.RebootRequired & vbCRLF 
    WScript.Echo "Listing of updates installed " & _
    "and individual installation results:" 
 
    For I = 0 to updatesToInstall.Count - 1
        WScript.Echo I + 1 & "> " & _
        updatesToInstall.Item(i).Title & _
        ": " & installationResult.GetUpdateResult(i).ResultCode   
    Next
End If
WScript.StdIn.Readline()

Save the code above in a file called Update.vbs and then creating a batch file that looks like this:

@ECHO OFF
start "Command line Windows update" cscript Update.vbs

Run the script and you will get the .vbs executed in a command line window that will also wait for pressing Enter at the end of execution so you can see the result.

For other solutions that are more system admin oriented, follow this link which provides you with a lot of possibilities, some in PowerShell, for example.

Also, I didn't find a way to install the updates without the Windows annoyance that asks me to reboot the computer popping up. If you know how to do that, I would be grateful.

Sending complex objects to T-SQL using the XML type

Published Aug 8, 2014

Posted in
database
.net
programming
c#

and has 1 comment

I have been working on a REST API lately and, while using Entity Framework or some other similar framework to abstract the database is certainly possible, I wanted to control every aspect of the implementation. I know, reinventing wheels, but this is how one learns. One of the most annoying bits was trying to translate some complex object from JSON to the (two dimensional) database relational tables. This post will explore my attempts and the solutions I have found.

My first attempt was straightforward: just send all types as DataTables, with some extra property to define identity and parent entity. This relies on the Microsoft Server SQL mechanism that allows sending of table variables to stored procedures. But this approach has several downsides. One of them is that in order to send a datatable to SQL you need... DataTables. As I have pointed out in several blog posts, the DataTable object is slow, and sometimes downright buggy. Even if I didn't care about performance that much, in order for SQL to receive the content of the DataTable one must create corresponding User Defined Types on the database side. Working with UDTs is very difficult for several reasons: you cannot alter a UDT (unless employing some SQL voodoo that changes system tables), you can only drop it and recreate it. This does not work if you use the UDT anywhere, so a lot of renaming needs to be done. Even if you automate the process, it's still very annoying. Then the UDT definition has to be an exact duplicate of the DataTable definition. Move some columns around and it fails. Debugging is also made difficult by the fact that the SQL profiler does not see the content of table variables when sending them to the server.

Long story short, I was looking for alternatives and I found XML. Now, you might think that this leads to a simple, and maybe even obvious, solution. But it's not that easy. Imagine that you send a list of objects in an XML. Each object is represented by an XML element and each property by a child element. In order to get the value of a property you need to do iterate through all the nodes, for each node find the properties, for each property find the one element that defines it, then get the attribute value or content of the property, all while making sure you select everything in a table. It's not that easy.

The solution I found, which simplifies the SQL code (and hopefully brings some well needed performance to the table) is to serialize the objects in a way that makes the selection simple enough. Here is an example: I have a Configuration object with an Id and a Name that also has a property called Servers, containing Server objects having an Id and a Url. Here is an example of XML serialization from the DataContractSerializer:

<?xml version="1.0" encoding="UTF-8"?>
<Configuration xmlns="http://schemas.datacontract.org/2004/07/SerializationTest" xmlns:i="http://www.w3.org/2001/XMLSchema-instance">
   <Id>1</Id>
   <Name>Test Config</Name>
   <Servers>
      <Server>
         <Id>1</Id>
         <Url>http://some.url</Url>
      </Server>
      <Server>
         <Id>2</Id>
         <Url>http://some.other.url</Url>
      </Server>
   </Servers>
</Configuration>

The SQL code to get the information from an XML variable with this content would look like this:

DECLARE @Xml XML='<?xml version="1.0" encoding="UTF-8"?>
<Configuration xmlns="http://schemas.datacontract.org/2004/07/SerializationTest" xmlns:i="http://www.w3.org/2001/XMLSchema-instance">
   <Id>1</Id>
   <Name>Test Config</Name>
   <Servers>
      <Server>
         <Id>1</Id>
         <Url>http://some.url</Url>
      </Server>
      <Server>
         <Id>2</Id>
         <Url>http://some.other.url</Url>
      </Server>
   </Servers>
</Configuration>'

;WITH XMLNAMESPACES(DEFAULT 'http://schemas.datacontract.org/2004/07/SerializationTest')
SELECT T.Item.value('(Id/text())[1]','INT')  as Id,
       T.Item.value('(Name/text())[1]','NVARCHAR(100)')  as Name
FROM @Xml.nodes('//Configuration') as T(Item)

;WITH XMLNAMESPACES(DEFAULT 'http://schemas.datacontract.org/2004/07/SerializationTest')
SELECT T.Item.value('(Id/text())[1]','INT')  as Id,
       T.Item.value('(Url/text())[1]','NVARCHAR(100)')  as Url
FROM @Xml.nodes('//Configuration/Servers/Server') as T(Item)

This works, but look at that code. In my case, the situation was worse, the object I was using was a wrapper which implemented IDictionary<string,object> and, even if it did implement ISerializable, both XmlSerializer and DataContractSerializer use the dictionary as their data and in the end I get ugly key elements and value elements that are even harder to get to and, I suppose, more inefficient to parse. Therefore I found the solution in IXmlSerializable, (yet) another serialization interface used exclusively by XML serializer classes. If every simple value would be saved as an attribute and every complex object in an element, then this could be the SQL code:

DECLARE @Xml XML='<?xml version="1.0" encoding="UTF-8"?>
<Configuration xmlns="http://schemas.datacontract.org/2004/07/SerializationTest" xmlns:i="http://www.w3.org/2001/XMLSchema-instance">
   <Config Id="1" Name="Test Config">
   <Servers>
      <List>
        <Server Id="1" Url="http://some.url" />
        <Server Id="2" Url="http://some.other.url" />
      </List>
   </Servers>
   </Config>
</Configuration>'

;WITH XMLNAMESPACES(DEFAULT 'http://schemas.datacontract.org/2004/07/SerializationTest')
SELECT T.Item.value('@Id','INT')  as Id,
       T.Item.value('@Name','NVARCHAR(100)')  as Name
FROM @Xml.nodes('//Configuration/Config') as T(Item)

;WITH XMLNAMESPACES(DEFAULT 'http://schemas.datacontract.org/2004/07/SerializationTest')
SELECT T.Item.value('@Id','INT')  as Id,
       T.Item.value('@Url','NVARCHAR(100)')  as Url
FROM @Xml.nodes('//Configuration/Config/Servers/List/Server') as T(Item)

Much easier to read and hopefully to parse.

I am not going to write here about the actual implementation of IXmlSerializable. There are plenty of tutorials on the Internet about that. It's not pretty, :) but not too difficult, either.

What was the purpose of this exercise? Now I can send a complex object to SQL in a single query, making inserts and updates simple and not requiring at a call for each instance of each type of complex object. Now, is it fast? I have no idea. Certainly if performance is needed, perhaps the UDT/DataTable approach is faster. However you will have to define a type for each type that you send as a DataTable to a stored procedure. An alternative can be a binary serializer and a CLR SQL function that translates it into tables. However, in my project I need to easily implement very custom API methods and to control every aspect, including tracing and profiling the various SQL calls. I believe the customized IXmlSerializable/XML in SQL approach is a reasonable one.

As always, I hope this helps someone.

Efficiently implementing generic getters and setters

Published Aug 4, 2014

Posted in
.NET
programming
C#

and has 0 comments

I am going to tell you about how I worked on an object that can wrap any object, serialize it, deserialize it, send it to a database, all efficiently and generically. But first, let me start with the beginning: you need a way to efficiently set/get values of properties in objects of different types. That's where TypeCache comes in. I am sure not everything is perfect with the class, but hopefully it will put you on the right track.

I will start with some of the code, the class that does everything. Something will be missing, though.

/// <summary>
/// It caches the property setters and getters for the public properties of a type 
/// </summary>
public class TypeCache
{
    private static ConcurrentDictionary<Type, TypeCache> _typeCacheDict = new ConcurrentDictionary<Type, TypeCache>();

    public static TypeCache Get(Type type)
    {
        TypeCache cache;
        if (!_typeCacheDict.TryGetValue(type, out cache))
        {
            cache = new TypeCache(type);
            _typeCacheDict[type] = cache;
        }
        return cache;
    }

    private TypeCache(Type type)
    {
        Type = type;
        Setters = new ConcurrentDictionary<string, Action<object, object>>();
        Getters = new ConcurrentDictionary<string, Func<object, object>>();
        var props = type.GetProperties(BindingFlags.Public | BindingFlags.Instance);
        Properties = new ConcurrentDictionary<string, PropertyInfo>();
        foreach (var prop in props)
        {
            if (prop.CanRead)
            {
                var objGetter = prop.GetValueGetter();
                Getters[prop.Name] = objGetter.Compile();
            }
            if (prop.CanWrite)
            {
                var objSetter = prop.GetValueSetter();
                Setters[prop.Name] = objSetter.Compile();
            }
            Properties[prop.Name] = prop;
        }
    }

    public Type Type { get; private set; }
    public ConcurrentDictionary<string, Action<object, object>> Setters { get; private set; }
    public ConcurrentDictionary<string, Func<object, object>> Getters { get; private set; }
    public ConcurrentDictionary<string, PropertyInfo> Properties { get; private set; }
}

public static class TypeCacheExtensions
{
    /// <summary>
    /// Set the value of a property by name
    /// </summary>
    /// <param name="cache"></param>
    /// <param name="item"></param>
    /// <param name="key"></param>
    /// <param name="value"></param>
    public static void Set<T>(this TypeCache cache, object item, string key, T value)
    {
        if (cache == null || item == null) return;
        Action<object, object> setter;
        if (!cache.Setters.TryGetValue(key, out setter)) return;
        setter(item, (object)value);
    }

    /// <summary>
    /// Get the value of a property by name
    /// </summary>
    /// <param name="cache"></param>
    /// <param name="item"></param>
    /// <param name="key"></param>
    /// <returns></returns>
    public static T Get<T>(this TypeCache cache, object item, string key)
    {
        if (cache == null || item == null) return default(T);
        Func<object, object> getter;
        if (!cache.Getters.TryGetValue(key, out getter)) return default(T);
        return (T)getter(item);
    }

    /// <summary>
    /// Set the value for a property to default by name
    /// </summary>
    /// <param name="cache"></param>
    /// <param name="item"></param>
    /// <param name="key"></param>
    public static void Delete(this TypeCache cache, object item, string key)
    {
        if (cache == null || item == null) return;
        Action<object, object> setter;
        if (!cache.Setters.TryGetValue(key, out setter)) return;
        var value = cache.Properties[key].PropertyType.GetDefaultValue();
        setter(item, value);
    }

    /// <summary>
    /// Set the values for all the public properties of a class to their default
    /// </summary>
    /// <param name="cache"></param>
    /// <param name="item"></param>
    public static void Clear(this TypeCache cache, object item)
    {
        if (cache == null || item == null) return;
        Action<object, object> setter;
        foreach (var pair in cache.Properties)
        {
            if (!cache.Setters.TryGetValue(pair.Key, out setter)) continue;
            var value = pair.Value.PropertyType.GetDefaultValue();
            setter(item, value);
        }
    }
}

(I used extension methods so that there would be no problems using a null value as the TypeCache) This class would be used something like this:

class Program
    {
        static void Main(string[] args)
        {
            var obj = new TestObject();
            var cache = TypeCache.Get(obj.GetType());
            Stopwatch sw = new Stopwatch();
            sw.Start();
            for (var i = 0; i < 100000; i++)
            {
                cache.Get<int>(obj, "TestProperty");
                cache.Set<int>(obj, "TestProperty",i);
            }
            sw.Stop();
            Console.WriteLine("Time: " + sw.Elapsed.TotalMilliseconds);
            Console.ReadKey();
        }
    }

If you try to compile this, after adding all the namespaces required (System.Collections.Concurrent and System.Reflection), you will get an error, because you are missing the extension methods PropertyInfo.GetValueGetter and PropertyInfo.GetValueSetter. These should be returning Expressions that compiled get or set the value of an object. Here is the first attempt, using the normal PropertyInfo methods:

    public static class ReflectionExtensions
    {
        public static Expression<Func<object, object>> GetValueGetter(this PropertyInfo propertyInfo)
        {
            var getter=propertyInfo.GetGetMethod();
            return (obj) => getter.Invoke(obj, new object[] { });
        }

        public static Expression<Action<object, object>> GetValueSetter(this PropertyInfo propertyInfo)
        {
            var setter = propertyInfo.GetSetMethod();
            return (obj,val) => setter.Invoke(obj, new[] { val });
        }

        public static object GetDefaultValue(this Type type)
        {
            return type.IsValueType ? Activator.CreateInstance(type) : null;
        }
    }

Wonderful thing! GetGet and GetSet :) Basically I am returning expressions that use reflection to get the getter/setter and then execute them. How long would the program with one million tries of get and set take? 1000 milliseconds. Could we improve on that?

Before .Net 4.0 the only solution to do that would have been to emit IL and compile it inside the code. Actually, even in 4.0 it is the most efficient option. But given my incompetence in that direction, I will give you the (slightly) easier to understand solution. Here we sacrifice a little speed for the readability of code:

    public static class ReflectionExtensions
    {
        /// <summary>
        /// Get the expression of a value getter for a property. Compile the expression to execute or combine it with other expressions.
        /// </summary>
        /// <param name="propertyInfo"></param>
        /// <returns></returns>
        public static Expression<Func<object, object>> GetValueGetter(this PropertyInfo propertyInfo)
        {
            var instance = Expression.Parameter(typeof(object), "i");
            var convertObj = Expression.TypeAs(instance, propertyInfo.DeclaringType);
            var property = Expression.Property(convertObj, propertyInfo);
            var convert = Expression.Convert(property, typeof(object));
            return (Expression<Func<object, object>>)Expression.Lambda(convert, instance);
        }

        /// <summary>
        /// Get the expression of a value setter for a property. Compile the expression to execute or combine it with other expressions.
        /// </summary>
        /// <param name="propertyInfo"></param>
        /// <returns></returns>
        public static Expression<Action<object, object>> GetValueSetter(this PropertyInfo propertyInfo)
        {
            var instance = Expression.Parameter(typeof(object), "i");
            var argument = Expression.Parameter(typeof(object), "a");
            var convertObj = Expression.TypeAs(instance, propertyInfo.DeclaringType);
            var convert = Expression.Convert(argument, propertyInfo.PropertyType);
            var setterCall = Expression.Call(convertObj,propertyInfo.GetSetMethod(),convert);
            return (Expression<Action<object, object>>)Expression.Lambda(setterCall, instance, argument);
        }

        /// <summary>
        /// Get the default value of a type
        /// </summary>
        /// <param name="type"></param>
        /// <returns></returns>
        public static object GetDefaultValue(this Type type)
        {
            return type.IsValueType ? New.Instance(type) : null;
        }
    }

    /// <summary>
    /// Class used to get instances of a type
    /// </summary>
    public static class New
    {
        private static ConcurrentDictionary<Type, Func<object>> _dict = new ConcurrentDictionary<Type, Func<object>>();

        public static object Instance(Type type)
        {
            Func<object> func;
            if (!_dict.TryGetValue(type, out func))
            {
                func = Expression.Lambda<Func<object>>(Expression.New(type)).Compile();
                _dict[type] = func;
            }
            return func();
        }
    }

The rest of the article will be about explaining what these extension methods are doing. How fast were they? 300ms! 3.3 times faster. How did we do that?

Well, first of all let's get the New class out of the way. In this simple version it just creates instances of a type that has a parameterless constructor. In the case that I had, using simple objects to transfer information, I only needed that. What is does is create a lambda expression that represents the parameterless constructor of a type, compiles it and caches it. Even a simple thing like this is faster than Activator.GetInstance(type). Not by much, about 10%, but still. Anyway, we are only using this to delete the value of a property (by setting it to its default value), which is not really in the scope of our test. However, being a simple expression build, it shows you the things to come.

Now for the GetValueGetter and GetValueSetter methods. I will humanly read you the GetValueGetter method. You can correlate it with the code quite easily. Basically it says this:

the expression has one parameter, of type object
we will attempt to safely convert it (as) into the declaring type (the object the property belongs to)
we will access a property of an object of the declaring type
which we then convert to object so that the resulting expression returns object not the specific type of the property
transform it all into a lambda expression and return it

The major difficulty, for me at least, was to grasp that there is no object (there is no spoon), but only a lambda expression that receives a parameter of type object and returns another object. The expression would be compiled and then applied on an actual instance.

And that's that. The object that does the serialization and transformation to SqlParameters and gets/sets the values of the wrapped object is more complicated and I will probably not write a blog entry about it, but think about the concept a little: an object that receives another object as the constructor and works like a dictionary. The keys and values of the dictionary are filled by the property names and values of the original object, while any change in the values of the dictionary will be set to the properties of the object. It makes it very easy to access objects generically, controlling what accessors do, but without the need for PostSharp or T4 templating, real time, programmatic. The task of serialization is taken by the ISerializable interface, which also uses a type of dictionary, meaning the object can be passed around from web services to C# code and then to SQL via SqlParameters.

Changing the grouping template of the AngularJS grid

Published Jul 1, 2014

and has 1 comment

I had this AngularJS grid that I wanted to show totals for certain categories and types. To be more precise, I had a list of items with Category and Type and I want to know the total for all categories and, for each category, the total for each type. This works perfectly if I load all items as individual, but I had hundred of thousands of items so it was clearly impractical. The solution? Send totals for every category and type, then just add them in the grid. In order to do that, though, I had to change the template of the "grouping row", the one that in ngGrid has the ngAggregate class.

It seems that all that is required for that is to change the aggregate template in the grid options. If you are not interested in the details, jump directly to the solution.

There is already an aggregate template in ngGrid.js, one that (at this time) looks like this:

<div ng-click="row.toggleExpand()" ng-style="rowStyle(row)" class="ngAggregate">
    <span class="ngAggregateText">{{row.label CUSTOM_FILTERS}} ({{row.totalChildren()}} {{AggItemsLabel}})</span>
    <div class="{{row.aggClass()}}"></div>
</div>

So we see that that the number displayed in an aggregate row is coming from a function of the row object called totalChildren, which is defined in ngAggregate.prototype and looks like this:

ngAggregate.prototype.totalChildren = function () {
    if (this.aggChildren.length > 0) {
        var i = 0;
        var recurse = function (cur) {
            if (cur.aggChildren.length > 0) {
                angular.forEach(cur.aggChildren, function (a) {
                    recurse(a);
                });
            } else {
                i += cur.children.length;
            }
        };
        recurse(this);
        return i;
    } else {
        return this.children.length;
    }
};

Maybe one could change the function to cover specific types of objects and return a sum instead of a count, but that is not the scope of the current post.

The solution described here will involve a custom function and a custom template. Here is how you do it:

Define the options for the grid. I am sure you already have it defined somewhere, if not, it is advisable you would. Sooner or later you will want to customize the output and functionality.
Add a new property to the options called aggregateTemplate. This will look probably like the default template, but with another function instead of totalChildren.
Define the function that will aggregate the items.

Solution 1:

Define template:

$scope.gridOptions = {
        data: 'Data.Units',
        enableColumnResize: true,
        showColumnMenu: false,
        showFilter: true,
        multiSelect: false,
        showGroupPanel: true,
        enablePaging: false,
        showFooter: true,
        columnDefs: [
            { field: 'Country', displayName: 'Country', width: 190 },
            { field: 'Type', displayName: 'Unit Type' },
            { field: 'Count', displayName: 'Count', width: 180}
        ],
        aggregateTemplate: "<div ng-click=\"row.toggleExpand()\" ng-style=\"rowStyle(row)\" class=\"ngAggregate\">" +
    "    <span class=\"ngAggregateText\">{{row.label CUSTOM_FILTERS}} ({{aggFunc(row)}} {{AggItemsLabel}})</span>" +
    "    <div class=\"{{row.aggClass()}}\"></div>" +
    "</div>"
    };

Define function:

$scope.aggFunc = function (row) {
        var sumColumn='Count';
        var total = 0;
        angular.forEach(row.children, function(entry) {
            total+=entry.entity[sumColumn];
        });
        angular.forEach(row.aggChildren, function(entry) {
            total+=$scope.aggFunc(entry);
        });
        return total;
    };

What we did here is we replaced row.totalChildren() with aggFunc(row) which we defined in the scope. What it does is add to the total the value of 'Count' rather than just count the items. It goes through row.children, which contains normal row items, then through aggChildren, which contains aggregate rows, which we pass through the same function in order to get their total.

Well, this works perfectly, but doesn't that mean we need to use this for each grid? There is a lot of code duplication. Let's first put the template in the cache so we can reuse it:

module.run(["$templateCache", function($templateCache) {

  $templateCache.put("aggregateCountTemplate.html",
    "<div ng-click=\"row.toggleExpand()\" ng-style=\"rowStyle(row)\" class=\"ngAggregate\">" +
    "    <span class=\"ngAggregateText\">{{row.label CUSTOM_FILTERS}} ({{aggFunc(row)}} {{AggItemsLabel}})</span>" +
    "    <div class=\"{{row.aggClass()}}\"></div>" +
    "</div>"
  );

}]);

Now the gridOptions change to

$scope.gridOptions = {
        [...]
        aggregateTemplate: "aggregateCountTemplate.html"
    };

and we can reuse the template anytime we want. Alternatively we can create a file and reference it, without using the cache:

$scope.gridOptions = {
        [...]
        aggregateTemplate: "/templates/aggregateCountTemplate.html"
    };

Now, if we could replace the aggFunc function with a row function, adding it to ngAggregate.prototype. Unfortunately we cannot do that, since ngAggregate is a 'private' object. The only thing we can do is to add some sort of static function. The solution is to add it in the root scope, so that is available everywhere.

Solution 2:

Here is the content of the file aggregateCountTemplateCache.js, that I created and load every time in the site. It does two things: inject the function in the root scope of the application and add the template to the cache. The only other thing to do is to use the aggregateTemplate: "aggregateCountTemplate.html" grid options.

var module = angular.module('app', ['ngResource', 'ui.bootstrap', 'ui', 'ngGrid', 'importTreeFilter', 'ui.date', 'SharedServices', 'ui.autocomplete']);

module.run(["$templateCache","$rootScope", function($templateCache,$rootScope) {

  $rootScope.aggregateCountFunc = function (row) {
        var total = 0;
        angular.forEach(row.children, function(entry) {
            total+=entry.entity.Count;
        });
        angular.forEach(row.aggChildren, function(entry) {
            total+=$rootScope.aggregateCountFunc(entry);
        });
        return total;
    };

  $templateCache.put("aggregateCountTemplate.html",
    "<div ng-click=\"row.toggleExpand()\" ng-style=\"rowStyle(row)\" class=\"ngAggregate\">" +
    "    <span class=\"ngAggregateText\">{{row.label CUSTOM_FILTERS}} ({{aggregateCountFunc(row)}}{{AggItemsLabel}})</span>" +
    "    <div class=\"{{row.aggClass()}}\"></div>" +
    "</div>"
  );

}]);

Enjoy!

Detecting the language of a text in C#

Published May 26, 2014

Posted in
.NET
programming
C#

and has 0 comments

Short post about how to use language detection capabilities in your application. I will be demonstrating NTextCat, which is a .NET port of text_cat, a Perl script, itself an implementation of a whitepaper published in 1994 called N-Gram-Based Text Categorization.

Reference NTextCat - the library is now available as a NuGet package as well
Instantiate an identifier factory (usually RankedLanguageIdentifierFactory)
Get an instance of a RankedLanguageIdentifier from the factory by loading a language XML file
Call the Identify method on your text and get a list of languages in the order of the probability that the text in that language

Here is a piece of code that does that using the core XML file published with the library. Remember to add the XML to your project and set its property of Copy to Output Directory.

public class LanguageProcessor
{
    private RankedLanguageIdentifier _identifier;

    public string IdentifyLanguage(string text)
    {
        if (_identifier == null)
        {
            var file = new FileInfo(Path.Combine(AppDomain.CurrentDomain.BaseDirectory, "LanguageModels/Core14.profile.xml"));
            if (!file.Exists)
            {
                throw new FileNotFoundException("Could not find LanguageModels/Core14.profile.xml to detect the language");
            }
            using (var readStream = File.OpenRead(file.FullName))
            {
                var factory = new RankedLanguageIdentifierFactory();
                _identifier = factory.Load(readStream);
            }
        }
        var languages = _identifier.Identify(text);
        var mostCertainLanguage = languages.FirstOrDefault();
        if (mostCertainLanguage != null)
        {
            return mostCertainLanguage.Item1.Iso639_3;
        }
        return null;
    }

}

There are a lot of XML files, some taken from Wikipedia, for example, and handling 280+ languages, but for the casual purpose of finding non-English text in a list, the core one will suffice.

A log4net custom appender that creates JIRA issues and notifies users

Published May 20, 2014

Posted in
.NET
programming
C#

and has 0 comments

It has been a while since I've blogged something technical. It's just that nothing I did seemed to be worthy of this majestic blog... Well, jokes aside, here is a post detailing my log4net JIRA appender.

log4net is a popular (if not the most popular) logging framework for .NET out there. Its strength lies in its configurability, the possibility to create custom loggers, custom appenders, custom filters, etc. I will be talking about a custom appender, a class that can be loaded by log4net to consume the logged lines and put them somewhere. For example to make an application that uses log4net to write the log to the console all you do is configure it to use the console appender. The JIRA appender takes the log output and creates issues in JIRA, notifying users afterwards. JIRA is a tracker for team planning. It is also very popular.

In order to create an appender, one references the log4net assembly (or NuGet package) and then creates a class that inherits from AppenderSkeleton. We could implement IAppender, but the skeleton class has most of what people want from an appender. The next step is to override the Append method and we are done. We don't want to create an issue with each logged line, though, so we will make it so that it creates the issue after a period of inactivity or when the logger closes. For that we use the CancellationTokenSource class to create delayed actions that we can cancel and recreate. We also need to override OnClose().

For the JIRA Api I used a project called AnotherJiraRestClient, but I guess one can used anything out there. You will see that the notify functionality is not implemented so we have to add it.

Here is the appender source code:

using AnotherJiraRestClient;
using log4net.Appender;
using System;
using System.Collections.Generic;
using System.IO;
using System.Linq;
using System.Net;
using System.Text;
using System.Threading;
using System.Threading.Tasks;

namespace JiraLog4netAppender
{
    public class JiraAppender : AppenderSkeleton // most appender functionality is found in this abstract class
    {
        private List<string> _notifyUsers;

        // properties of the appender, configurable in the log4net.config file as children elements of the appender
        public string url { get; set; } // the url of the JIRA site
        public string user { get; set; } // the JIRA user
        public string password { get; set; } // the JIRA password
        public string project { get; set; } // the JIRA project
        public string notify // a comma separated list of JIRA users who will be notified of the issue's creation
        {
            get
            {
                return string.Join(", ", _notifyUsers);
            }
            set
            {
                _notifyUsers.Clear();
                if (!string.IsNullOrWhiteSpace(value))
                {
                    _notifyUsers.AddRange(value.Split(',').Select(s => s.Trim()).Where(s => !string.IsNullOrWhiteSpace(s)));
                }
            }
        }

        CancellationTokenSource _ts;
        StringWriter _sw;
        Task _task;
        private JiraClient _jc;
        private string _priorityId;
        private string _issueTypeId;

        private object _writerLock=new object();

        public JiraAppender()
        {
            _notifyUsers = new List<string>();
            _ts = new CancellationTokenSource(); // use this to cancel the Delay task
            _sw = new StringWriter();
        }

        protected override void Append(log4net.Core.LoggingEvent loggingEvent)
        {
            this.Layout.Format(_sw, loggingEvent); // use the appender layout to format the log lines (warning: for this you need to override RequiresLayout and return true)
            _ts.Cancel(); // cancel the task and create a new one. This means as long as the logger writes, the 5 second delay will be reset
            _ts = new CancellationTokenSource();
            _task = Task.Delay(5000, _ts.Token).ContinueWith(writeToJira, _ts.Token); // after 5 seconds (I guess you could make this configurable as well) create a JIRA issue
        }

        protected override bool RequiresLayout
        {
            get
            {
                return true;
            }
        }

        /// <summary>
        /// write to jira, either when 5 seconds of inactivity passed or the appender is closed.
        /// </summary>
        /// <param name="task"></param>
        private void writeToJira(Task task)
        {
            string s;
            lock (_writerLock) // maybe the method was already in progress when another one was called. We need to clear the StringWriter before we allow access to it again
            {
                s = _sw.ToString();
                var sb = _sw.GetStringBuilder();
                sb.Clear();
            }
            if (!string.IsNullOrWhiteSpace(s))
            {
                writeTextToJira(s);
            }
        }

        private void writeTextToJira(string text)
        {
            ensureClientAndValues();
            var summary = "Log: " + this.Name; // the issue summary
            var labels = new List<string> // some labels
            {
                this.Name, this.GetType().Name
            };
            var issue = new AnotherJiraRestClient.JiraModel.CreateIssue(project, summary, text, _issueTypeId, _priorityId, labels); // create the issue with type Issue and priority Trivial
            var basicIssue = _jc.CreateIssue(issue);
            _jc.Notify(basicIssue, _notifyUsers, "JiraAppender created an issue", null, null); // notify users of the issue's creation
        }

        /// <summary>
        /// Make sure we have a JiraClient and that we know the ids of the Issue type and the Trivial priority
        /// </summary>
        private void ensureClientAndValues()
        {
            if (_jc == null)
            {
                _jc = new JiraClient(new JiraAccount
                {
                    ServerUrl = url,
                    User = user,
                    Password = password
                });
            }
            if (_priorityId==null) {
                var priority = _jc.GetPriorities().FirstOrDefault(p => p.name == "Trivial");
                if (priority == null)
                {
                    throw new Exception("A priority with the name 'Trivial' was not found");
                }
                _priorityId = priority.id;
            }
            if (_issueTypeId == null)
            {
                var meta = _jc.GetProjectMeta(project);
                var issue = meta.issuetypes.FirstOrDefault(i => i.name == "Issue");
                if (issue == null)
                {
                    throw new Exception("An issue type with the name 'Issue' was not found");
                }
                _issueTypeId = issue.id;
            }
        }

        protected override void OnClose() //clean what you can and write to jira if there is anything to write
        {
            _ts.Cancel();
            writeToJira(null);
            _sw.Dispose();
            _ts.Dispose();
            _task = null;
            base.OnClose();
        }

    }
}

As I said, AnotherJiraRestClient does not implement the notify API call needed to inform the users of the creation of an issue, so we need to change the project a little bit. Perhaps when you are implementing this, you will find notify already there, with a different format, but just in case you don't:

add to JiraClient the following method:

public void Notify(BasicIssue issue, IEnumerable<string> userList, string subject, string textBody, string htmlBody)
{
    var request = new RestRequest()
    {
        Method = Method.POST,
        Resource = ResourceUrls.Notify(issue.id),
        RequestFormat = DataFormat.Json
    };
    request.AddBody(new NotifyData
    {
        subject = subject,
        textBody = textBody,
        htmlBody = htmlBody,
        to = new NotifyTo
        {
            users = userList.Select(u => new User
            {
                active = false,
                name = u
            }).ToList()
        }
    });
    var response = client.Execute(request);
    if (response.StatusCode != HttpStatusCode.NoContent)
    {
        throw new JiraApiException("Failed to notify users from issue with id=" + issue.id+"("+response.StatusCode+")");
    }
}

add to ResourceUrls the following method:

public static string Notify(string issueId)
{
    return Url(string.Format("issue/{0}/notify", issueId));
}

create the following classes in the JiraModel folder and namespace:

public class NotifyData
{
    public string subject { get; set; }
    public string textBody { get; set; }
    public string htmlBody { get; set; }
    public NotifyTo to { get; set; }
}

public class NotifyTo
{
    public List<User> users { get; set; }
}

public class User
{
    public string name { get; set; }
    public bool active { get; set; }
}

Here is an example configuration. Have fun!

<log4net>
   <appender name="JiraAppender" type="JiraLog4netAppender.JiraAppender, JiraLog4netAppender">
      <url value="https://my.server.url/jira"/>
      <user value="jirauser"/>
      <password value="jirapassword"/>
      <project value="jiraproject"/>
      <notify value="siderite,someotheruser" />
      <layout type="log4net.Layout.PatternLayout">
          <conversionPattern value="%date [%thread] %-5level %logger [%property{NDC}] - %message%newline" />
      </layout>
        <filter type="log4net.Filter.LevelRangeFilter">
          <levelMin value="WARN" />
          <levelMax value="FATAL" />
        </filter>
   </appender>
     <root>
     <level value="DEBUG" />
         <appender-ref ref="JiraAppender" />
   </root>
</log4net>

Adventures in the .Net file system library System.IO

Published May 2, 2014

Posted in
.NET
programming
C#

and has 1 comment

Intro (click to hide)

Sometimes, after making some software or another and feeling all proud that everything works, I get an annoying exception that the path of the files my program uses is too long, the horrible PathTooLongException. At first I thought it was a filesystem thing or an operating system thing, or even a .Net version thing, but it is not. The exception would appear in .Net 4.5 apps running on Windows 7 systems that used NTFS on modern hardware.

Lazy as any software dev, I googled for it and found a quick and dirty replacement for System.IO called Delimon that mitigated the issue. This library is trying to do most of what System.IO does, but using the so called extended-length paths, stuff that looks like \\?\C:\Windows, with external functions accessed directly from kernel32.dll. All good and nice, but the library is hardly perfect. It has bugs, it doesn't implement all of the System.IO functionality and feels wrong. Why would Microsoft, the great and mighty, allow such a ridiculous problem to persist in their core filesystem library?

And the answer is: because it is a core library. Probably they would have to make an enormous testing effort to change anything there. It is something from a managed code developer nightmare: unsafe access to system libraries, code that spans decades of work and a maze of internal fields, attributes, methods, properties that can be used only from inside the library itself. Not to mention all those people who decided to solve problems in core classes using reflection and stuff. My guess is that they probably want to replace file system usage with another API, like Windows.Storage, which, alas, are only used for Windows Phone.

In this blog post I will discuss the System.IO problems that relate to the total length of a path, what causes them and possible solutions (if any).

Let's start with the exception itself: PathTooLongException. Looking for usages of the exception in the mscorlib.dll assembly of .Net 4.0 we see some interesting things. First of all, there is a direct translation from Windows IO error code 206 to this exception, so that means that, in principle, there should be no managed code throwing this exception at all. The operating system should complain if there is a path length issue. But that is not the case in System.IO.

Most of the other usages of the exception come from the class PathHelper, a helper class used by the System.IO.Path class in a single method: NormalizePath. Wonderful method, that: internal static unsafe. PathHelper is like a multiple personality class, the active one being determined by the field useStackAlloc. If set to true, then it uses memory and speed optimized code, but assumes that the longest path will always be 260. That's a constant, it is not something read from the operating system. If set to false, the max path length is also provided as a parameter. Obviously, useStackAlloc is set to true in most situations. We will talk about NormalizePath a bit later.

The other usages of the PathTooLongException class come from two Directory classes: Directory and LongPathDirectory. If you instantly thought "Oh, God, I didn't know there was a LongPathDirectory class! We can use that and all problems disappear!", I have bad news for you. LongPathDirectory is an internal class. Other than that it seems to be a copy paste clone of Directory that uses Path.LongMaxPath instead of hardcoded constants (248 maximum directory name length, for example) or... Path.MaxPath. If you thought MaxPath and LongMaxPath are properties that can be set to fix long path problems, I have bad news for you: they are internal constants set to 260 and 32000, respectively. Who uses this LongPathDirectory class, though? The System.IO.IsolatedStorage namespace. We'll get back to this in a moment.

Back to Path.NormalizePath. It is a nightmare method that uses a lot of internal constants, system calls, convoluted code; it seems like someone deliberately tried to obfuscate its code. It's an internal method, of course, which makes no sense, as the functionality of path normalization would be useful in a lot of scenarios. Its first parameter is path, then fullCheck, which when true leads to extra character validation. The fourth parameter is expandShortPaths which calls the GetLongPathName function of kernel32.dll. The third parameter is more interesting, it specifies the maximum path length which is sent to PathHelper or makes local checks on the path length. But who uses this parameter?

And now we find a familiar pattern: there is a class (internal of course) called LongPath, which seems to be a clone of Path, only designed to work with long paths. Who uses LongPath? LongPathDirectory, LongPathFile and classes in the System.IO.IsolatedStorage namespace!

So, another idea becomes apparent. Can we use System.IO.IsolatedStorage to have a nice access to the file system? No we can't. For at least two reasons. First of all, the isolated storage paradigm is different from what we want to achieve, it doesn't access the raw file system, instead it isolates files in containers that are accessible to that machine, domain, application or user only. Second, even if we could get an "isolated" store that would represent the file system - which we can't, we would still have to contend with the string based way in which IsolatedStorage works. It is interesting to note that IsolatedStorage is pretty much deprecated by the Windows 8 Windows.Storage API, so forget about it. Yeah, so we have LongPathDirectory, LongPathFile and LongPath classes, but we can't really use them. Besides, what we want is something more akin to DirectoryInfo and FileInfo, which have no LongPath versions.

What can we do about it, then? One solution is to use Delimon. It has some bugs, but they can be avoided or fixed, either by the developers or by getting the source/decompiling the library and fixing the bugs yourself. A limited, but functional solution.
An interesting alternative is to use libraries the BCL team published for long path access: LongPath which seems to contain classes similar to the ones we find in mscorlib, but it's latest release is from 2010 or Zeta long paths which has a more recent release, 2013, but is completely different, using the FileInfo and DirectoryInfo paradigm, too.
Of course, you can always make your own API.

Another solution is to be aware of the places where the length limitation appears and avoid them via other type of development, in other words, a file system best practices compilation that eventually turns into a new file system API.

Both solutions coalesce into using some other library instead of System.IO. That's horrible and I think a stain on .Net's honor!

So let's see where the exception is actually thrown.

I've made some tests. First of all, I used FAR Manager, a file manager, to create folders of various lengths. The longest one was 255, before I got an exception. To my surprise, Windows Explorer could see it, but it could not open or copy/rename/delete it. I reduced the size of its name until the total size of the path was 260, then I could manipulate it in Windows Explorer. So there are external reasons for not creating paths as long, but we see that there are tools that can access files like that. Let's attempt to create some programatically.

System.IO.Directory.CreateDirectory immediately fires the exception. DirectoryInfo has no problem instantiating with the long path as the parameter, but the Create method throws the same exception. Any attempt to create a folder of more than 248 characters, even if the total path was less than 260 characters, failed as well.

However, with reflection to the rescue, I could create paths as long as 32000 characters and folders with names as long as 255 characters using our friend LongPathDirectory:

var longPathDirectoryType = typeof(System.IO.Directory).Assembly.GetTypes().First(t=>t.Name=="LongPathDirectory");
var createDirectoryMethodInfo = longPathDirectoryType.GetMethod("CreateDirectory", System.Reflection.BindingFlags.Static | System.Reflection.BindingFlags.NonPublic);
createDirectoryMethodInfo.Invoke(null, new object[] { path });

What about files? FAR Manager threw the same errors if I tried to create a filename larger than 255 characters. Let's try to create the same programmatically.

File.Create threw the exception, as well as FileInfo.Create and the FileStream constructors.

So can we use the same method and use LongPathFile? No! Because LongPathFile doesn't have the creating and opening functionalities of File. Instead, FileStream has a constructor that specifies useLongPath. It is internal, of course, and used only by IsolatedStorageFileStream!

Code to create a file:

var fileStreamConstructorInfo = typeof(System.IO.FileStream).GetConstructor(System.Reflection.BindingFlags.NonPublic|System.Reflection.BindingFlags.Instance,null,
        new Type[] { 
        typeof(string) /*path*/, typeof(System.IO.FileMode) /*mode*/, typeof(System.IO.FileAccess) /*access*/, 
        typeof(System.IO.FileShare) /*share*/, typeof(int) /*bufferSize*/, typeof(System.IO.FileOptions) /*options*/, 
        typeof(string) /*msgPath*/, typeof(bool) /*bFromProxy*/, typeof(bool) /*useLongPath*/, typeof(bool) /*checkHost*/},null);
    var stream = (System.IO.Stream)fileStreamConstructorInfo.Invoke(new object[] {
        path, System.IO.FileMode.Create, System.IO.FileAccess.Write,
        System.IO.FileShare.None, 4096, System.IO.FileOptions.None,
        System.IO.Path.GetFileName(path), false, true, false
    });

Horrible, but it works. Again, no filenames bigger than 255 and the exception coming from the file system, as it should. Some info about the parameters: msgPath is the name of the file opened by the stream, if bFromProxy is true the stream doesn't try to demand some security permissions, checkHost... does nothing :) Probably someone wanted to add a piece of code there, but forgot about it.

Why did I use 4096 as the buffer size? Because that is the default value used by .Net when not specifying the value. Kind of low, right?

Now, this is some sort of midway alternative to using a third party file system library: you invoke through reflection code that is done by Microsoft and hidden for no good reason. I don't condone using it, unless you really need it. What I expect from the .Net framework is that it takes care of all this for me. It seems (as detailed in this blog post), that efforts are indeed made. A little late, I'd say, but still.