The Sift string distance algorithm is now on Github

Published Mar 28, 2009

Posted in
.NET
programming
C#
github

I've gathered the strength to defeat my laziness and put another of my projects on Github. I am talking about the Sift3 algorithm, described here.

The URL for the project is https://github.com/Siderite/Sift3/ where you can download the library and sources in .Net 3.5 C#. A class that also implements Levenstein and Length string distance algorithms is provided.

Please let me know if you are interested in the project, have any suggestions or are even using the algorithm in your projects.

UpdatePanel Shrinker on Github

Published Feb 23, 2009

and has 3 comments

This is a message I got from the UpdatePanel Shrinker in a site we built:
Shrinkage: from 70000 to 10 = 0.014%.

Almost a year ago I thought of a way of compressing the UpdatePanel asynchronous output based on the previously sent information and created the UpdatePanel Shrinker. I waited all this time to test it and also I've used it in some small projects.

From today, the project is on Github, with an MIT licence, that means do whatever you want with it, but I would appreciate some words from you.

As for the details: it uses a sort of fast and dirty home made diff algorithm to compare the previously sent output for an UpdatePanel with the current one. The problem is that the effect can only be seen from the second async postback on, but when used, it yields fantastic compression rates.

You can use it for sites that are accessed from Internet challenged locations, for sites that have complex Ajax interactions and for sites that you "Ajaxify". And before I get angry comments from purists, yes, I know it is more efficient to use Ajax in a smart way to solve each problem in the best possible way, but if you just want the quick and dirty solution, like a MasterPage with a ScriptManager, an UpdatePanel and the page content in it, the Shrinker is the thing for you.

Take care to look in the Debug Output window. The shrinker will output the compression rate and any warnings it might have.

Very slow UpdatePanel refresh when containing big ListBoxes or DropDownLists

Published Jun 14, 2008

and has 53 comments

Update: this fix is now on Github: Github. Get the latest version from there.

The scenario is pretty straightforward: a ListBox or DropDownList or any control that renders as a Select html element with a few thousand entries or more causes an asynchronous UpdatePanel update to become incredibly slow on Internet Explorer and reasonably slow on FireFox, keeping the CPU to 100% during this time. Why is that?

Delving into the UpdatePanel inner workings one can see that the actual update is done through an _updatePanel Javascript function. It contains three major parts: it runs all dispose scripts for the update panel, then it executes _destroyTree(element) and then sets element.innerHTML to whatever content it contains. Amazingly enough, the slow part comes from the _destroyTree function. It recursively takes all html elements in an UpdatePanel div and tries to dispose them, their associated controls and their associated behaviours. I don't know why it takes so long with select elements, all I can tell you is that childNodes contains all the options of a select and thus the script tries to dispose every one of them, but it is mostly an IE DOM issue.

What is the solution? Enter the ScriptManager.RegisterDispose method. It registers dispose Javascript scripts for any control during UpdatePanel refresh or delete. Remember the first part of _updatePanel? So if you add a script that clears all the useless options of the select on dispose, you get instantaneous update!

First attempt: I used select.options.length=0;. I realized that on Internet Explorer it took just as much to clear the options as it took to dispose them in the _destroyTree function. The only way I could make it work instantly is with select.parentNode.removeChild(select). Of course, that means that the actual selection would be lost, so something more complicated was needed if I wanted to preserve the selection in the ListBox.

Second attempt: I would dynamically create another select, with the same id and name as the target select element, but then I would populate it only with the selected options from the target, then use replaceChild to make the switch. This worked fine, but I wanted something a little better, because I would have the same issue trying to dynamically create a select with a few thousand items.

Third attempt: I would dynamically create a hidden input with the same id and name as the target select, then I would set its value to the comma separated list of the values of the selected options in the target select element. That should have solved all problems, but somehow it didn't. When selecting 10000 items and updating the UpdatePanel, it took about 5 seconds to replace the select with the hidden field, but then it took minutes again to recreate the updatePanel!

Here is the piece of code that fixes most of the issues so far:

/// <summary>
/// Use it in Page_Load.
/// lbTest is a ListBox with 10000 items
/// updMain is the UpdatePanel in which it resides
/// </summary>
private void RegisterScript()
{
string script =
string.Format(@"
var select=document.getElementById('{0}'); 
if (select) {{
// first attempt
//select.parentNode.removeChild(select);


// second attempt
//    var stub=document.createElement('select');
//    stub.id=select.id;
//    for (var i=0; i<select.options.length; i++)
//        if (select.options[i].selected) {{
//            var op=new Option(select.options[i].text,select.options[i].value);
//            op.selected=true;
//            stub.options[stub.options.length]=op;
//        }}
//    select.parentNode.replaceChild(stub,select);


// third attempt
var stub=document.createElement('input');
stub.type='hidden';
stub.id=select.id;
stub.name=select.name;
stub._behaviors=select._behaviors;
var val=new Array();
for (var i=0; i<select.options.length; i++)
if (select.options[i].selected) {{
val[val.length]=select.options[i].value;
}}
stub.value=val.join(',');
select.parentNode.replaceChild(stub,select);

}};",
lbTest.ClientID);
ScriptManager sm = ScriptManager.GetCurrent(this);
if (sm != null) sm.RegisterDispose(lbTest, script);
}

What made the whole thing be still slow was the initialization of the page after the UpdatePanel updated. It goes all the way to the WebForms.js file embedded in the System.Web.dll (NOT System.Web.Extensions.dll), so part of the .NET framework. What it does it take all the elements of the html form (for selects it takes all selected options) and adds them to the list of postbacked controls within the WebForm_InitCallback javascript function.

The code looks like this:

if (tagName == "select") {
var selectCount = element.options.length;
for (var j = 0; j < selectCount; j++) {
var selectChild = element.options[j];
if (selectChild.selected == true) {
WebForm_InitCallbackAddField(element.name, element.value);
}
}
}

function WebForm_InitCallbackAddField(name, value) {
var nameValue = new Object();
nameValue.name = name;
nameValue.value = value;
__theFormPostCollection[__theFormPostCollection.length] = nameValue;
__theFormPostData += name + "=" + WebForm_EncodeCallback(value) + "&";
}

That is funny enough, because __theFormPostCollection is only used to simulate a postback by adding a hidden input for each of the collection's items to a xmlRequestFrame (just like my code above) in the function WebForm_DoCallback which in turn is called only in the GetCallbackEventReference(string target, string argument, string clientCallback, string context, string clientErrorCallback, bool useAsync) method of the ClientScriptManager which in turn is only used in rarely used scenarios with the own mechanism of javascript callbacks of GridViews, DetailViews and TreeViews. And that is it!! The incredible delay in this javascript code comes from a useless piece of code! The whole WebForm_InitCallback function is useless most of the time! So I added this little bit of code to the RegisterScript method and it all went marvelously fast: 10 seconds for 10000 selected items.

string script = @"WebForm_InitCallback=function() {};";
ScriptManager.RegisterStartupScript(this, GetType(), "removeWebForm_InitCallback", script, true);

And that is it! Problem solved.

Compressing UpdatePanel output

Published Apr 22, 2008

and has 12 comments

Update: I've posted the source and binaries for this control on Github. Free to use and change. Please comment on it.

This is the story of a control that shrinks the content sent from an UpdatePanel to down as 2% without using compression algorithms. I am willing to share the code with whoever wants it and I only ask in return to tell me if and where it went wrong so I can find a solution. Even if you are not interested in the control, the article describes a little about the inner workings of the ASP.Net Ajax mechanism.

You know the UpdatePanel, the control that updates its content asynchronously in ASP.Net, allowing you to easily transform a normal postback based application in a fully fledged Ajax app. Well, the only problem with that control is that you have to either put a lot of them on the page in order to update only what changes (making the site be also fast as you would expect from an Ajax application) but hard to maintain afterwards, or put a big UpdatePanel on the entire page (maybe in the MasterPage) and allow for large chunks of data to be passed back and forth and also other clear disadvantages, some detailed in this blog entry.

Not anymore! I have made a small class, in the form of a Response.Filter, that caches the previously rendered content and instructs the browser to do the same, then sends only a small fraction of the data from the server to the browser, mainly what has changed. There is still the issue of the speed it takes the browser to render the content, which is the same no matter what I do, like when rendering a huge table. It doesn't matter that I send only the changes in one cell, the browser must still render the huge grid. Also, if, for some reason, the update fails, I catch it and I send to the server that the updatepanel must be updated again, the old way.

Enough; let's talk code. I first had to tap into the content that was sent to the browser. That can only be done at Page render level (or PageAdapter, or Response.Filter and other things that can access the rendered content). So I did catch the rendered content in a filter, I recognized it as Ajax by its distinctive format, and I only processed the updatePanel type of token.

Here I had a few problems. First I replaced the updatePanel token with a scriptBlock token that changed the innerHTML of the update panel div element. It all seemed to work until I tested it a little. I discovered that the _updatePanel javascript method of the PageRequestManager object used by the normal ajax rendering on the browser was doing a few extra things, so I used that one instead of just replacing the innerHTML, resulting in a lower speed. But that didn't help either, because it failed when using validators. Even if I did replace the updatePanel token with a correct javascript block, it still got executed a bit later than it should have.

The only solution I had was to replace the _updatePanel method with my own. Itself having a small block of code that disposed some scriptblocks and some other stuff, then a plain innerHTML replace, I could not 'override' it, since it would change the innerHTML with some meaningless stuff (the thing I would send from the server), then I would parse and change the innerHTML again, resulting in bad performance, flickering, nonsense on screen, etc. So I just copy pasted the first part and added my own ApplyPatch function instead of the html replace code line.

Now, here I met another issue. The innerHTML property of an html element is not a simple string. It gets parsed immediately when set and it recreates when read, as explained in this previous article of mine. The only solution for that was create my own property of the update panel div element that remembers the string that was set. This solved a lot more problems, because it meant I could identify the content to be replaced by simple position markers rather than through regular expressions (as was my initial idea). That property would not get changed by custom local javascript either, so I was safe to use it.

About the regular expression engine in Javascript: it has no Singleline option. That means you can only change the content line by line. I could have used Steve Levithan's beautiful work, but with the solution found above, I would not need regular expressions at all.

The only other issue was with UpdatePanels inside UpdatePanels. I found out that in this case, only parent UpdatePanels are being rendered. That meant that the custom property I added to the child panel would disappear and break my code. Therefore I had to keep a tree of the updatepanels in the page and clear all the children cached content when the parents were being updated. So I did that, too.

What else? What if somehow the property would get deleted, changed, or something else happened, like someone decided to recreate the update panel div object or something like that? For that I made a little HttpHandler that would receive an UpdatePanel id and it would clear its cached content. Then, on return from the asynchronous call, the javascript would just push another update panel refresh using __doPostBack(updatePanelId,""). I don't particularily like this approach, since it could back fire with multiple UpdatePanels (as you know, only one postback at a time is supported), but I didn't find a better solution yet. Besides, this event should normally not happen.

So, the mechanism was all in place, all I had to do was make the actual patching mechanism, the one that would find the difference between previously rendered content and current content, then send only the changed part. First thing I did was remove the start and end of the strings that were identical. As you can imagine, that's the most common scenario: a change in the UpdatePanel means all the content up to the change remains unchanged and the same goes for the content after the change. But I was testing the filter with a large grid that would randomly change one cell to a random string. That meant two changes: the previous position and the last. Assuming the first change was in one of the starting cells and the last was in one of the cells at the end, then the compression would be meaningless. So I've googled for an algorithm that would give me the difference between two files/strings and I found Diff! Well, I knew about it so I actually googled for Diff :) It was in the Longest Common Substring algorithm category.

Anyway, the algorithm was nice, clear, explained, with code, perfect for what I wanted and completely useless, since it needed an array of m*n to get what I needed. It was slow, a complete memory hog and I couldn't possibly use an array of 500000x500000. I bet they were optimizations that covered this problem, but I was miserable so I just patched up my own messy algorithm. What would it do? It would randomly select a 100 characters long string from the current content and search for it in the previous content. If it found it, it would expand the selection and consider it a Reasonably Long Common Substring and work from then on recursively. If it didn't find it, it would search a few times other randomly chosen strings then give up. Well, actually is the same algorithm, made messy and with no extra memory requirements.

It worked far better than I had expected, even if it clearly could have used some optimizations. The code was clear enough in detriment of speed, but it still worked acceptably fast, with no noticeable overhead. After a few tweaks and fixes, the class was ready. I've tested it on a project we are working on, a big project, with some complex pages, it worked like a charm.

One particular page used a control I have made that allows for grid rows and columns to have children that can be shown/hidden at will. When collapsing a column (that means that every row gets some cells removed) the compressed size was still above 50% in up to 100 patch fragments. When colapsing a row, meaning some content from the middle of the grid would just vanish, the size went down to 2%! Of course, putting the ViewState into the session also helped. Gzip compression on the server would complement this nicely, shrinking the output even more.

So, I have demonstrated incredible compression of UpdatePanel content sent through the network with something as small and reusable as a response filter that can be added once in the master page. You could use it for customers that have network bandwidth issues or for sites that pay for sent out content. It would with sites made with one big UpdatePanel placed in the MasterPage as well :).

If you want to use it in your sites, please let me know how it performs and what problems you've encountered.

Super Fast and Accurate string distance algorithm: Sift3

Published Apr 4, 2007

and has 66 comments

Update November 2014: Sift4 is here!! Check out the new improved version of Sift here: Super Fast and Accurate string distance algorithm: Sift4

Update October 6 2014: New stuff, compare Levenstein vs Sift here:

Algorithm: Levenstein Sift

String 1: String 2:

Result:

Update June 25th 2013: I've decided to play a little with the suggestions in the comments and check for validity. This was spurned by the realization that a lot of people use my algorithm. So, in order to celebrate this, here is the "3B" version of the Sift3 algorithm:
It is made in Javascript, this time, as it was easier to test and has the following extra features:

a maxDistance value that tells the algorithm to stop if the strings are already too different.
two pointers c1 and c2, rather than a single pointer c and two offsets
Instead of dividing to 2 the total length of the strings compared, now I divide it with 1.5. Why? Because this way the value is closer to the Levenshtein distance computed per random strings

~~Happy usage!~~ The variant I posted was totally buggy. I removed it. Just use sift3Distance.

Update: the Sift algorithm is now on Github.

A while ago I wrote an entry here about Sift2, an improvement of Sift, the original and silly string distance algorithm. Now I am publishing Sift3, which is way more accurate and even simpler as an algorithm.

I found out that my algorithm is part of a class of algorithms that solve the Longest Common Substring problem, therefore I calculated the LCS, not the distance, then the distance from the LCS. The result is way more robust, easy to understand and closer to the Levenshtein algorithm both on random strings and user databases. Not to mention that there is no goto in this one.

BTW, if you are looking for an algorithm that detects switched words, this is not it :) This just looks for typos and small regional differences between the strings. I mean, you could normalize the strings, so that words are ordered by some mechanism, then it would work because the words wouldn't be switched :)

I promise to work on a word switching algorithm, but not in the near future.
Without further ado, here is the code:

The C# code is a method in an object that has a private member maxOffset. As in Sift2 maxOffset should be around 5 and it represents the range in which to try to find a missing character.

public float Distance(string s1, string s2, int maxOffset)
{
if (String.IsNullOrEmpty(s1))
{
return String.IsNullOrEmpty(s2)
? 0
: s2.Length;
}
if (String.IsNullOrEmpty(s2))
{
return s1.Length;
}
int c = 0;
int offset1 = 0;
int offset2 = 0;
int lcs = 0;
while ((c + offset1 < s1.Length)
&&
(c + offset2 < s2.Length))
{
if (s1[c + offset1] ==
s2[c + offset2])
lcs++;
else
{
offset1 = 0;
offset2 = 0;
for (int i = 0;
i < maxOffset;
i++)
{
if ((c + i < s1.Length)
&&
(s1[c + i] == s2[c]))
{
offset1 = i;
break;
}
if ((c + i < s2.Length)
&&
(s1[c] == s2[c + i]))
{
offset2 = i;
break;
}
}
}
c++;
}
return (s1.Length + s2.Length)/2 - lcs;
}

And here is the T-Sql code. This version is actually an improvement of my original source, gracefully provided by Todd Wolf:

CREATE FUNCTION [DBO].[Sift3distance2]
(
@s1 NVARCHAR(3999),@s2 NVARCHAR(3999),@maxOffset INT
)
RETURNS FLOAT
AS 
BEGIN
DECLARE @s1LEN INT,@s2LEN INT

SELECT @s1LEN=Len(Isnull(@s1,'')),@s2LEN=Len(Isnull(@s2,''))

IF @s1LEN=0 RETURN @s2LEN
ELSE
IF @s2LEN=0 RETURN @s1LEN

IF Isnull(@maxOffset,0)=0 SET @maxOffset=5

DECLARE @currPos INT,@matchCnt INT,@wrkPos INT,@s1Offset INT,@s1Char VARCHAR,@s1Pos INT,@s1Dist INT,@s2Offset INT,@s2Char VARCHAR,@s2Pos INT,@s2Dist INT

SELECT @s1Offset=0,@s2Offset=0,@matchCnt=0,@currPos=0

WHILE(@currPos+@s1Offset<@s1LEN AND @currPos+@s2Offset<@s2LEN)
BEGIN
SET @wrkPos=@currPos+1

IF(Substring(@s1,@wrkPos+@s1Offset,1)=Substring(@s2,@wrkPos+@s2Offset,1)) SET @matchCnt=@matchCnt+1
ELSE
BEGIN
SET @s1Offset=0

SET @s2Offset=0

SELECT @s1Char=Substring(@s1,@wrkPos,1),@s2Char=Substring(@s2,@wrkPos,1)

SELECT @s1Pos=Charindex(@s2Char,@s1,@wrkPos)-1,@s2Pos=Charindex(@s1Char,@s2,@wrkPos)-1

SELECT @s1Dist=@s1Pos-@currPos,@s2Dist=@s2Pos-@currPos

IF(@s1Pos>0 AND (@s1Dist<=@s2Dist OR @s2Pos<1) AND @s1Dist<@maxOffset) SET @s1Offset=(@s1Pos-@wrkPos)+1
ELSE
IF(@s2Pos>0 AND (@s2Dist<@s1Dist OR @s1Pos<1) AND @s2Dist<@maxOffset) SET @s2Offset=(@s2Pos-@wrkPos)+1
END

SET @currPos=@currPos+1
END

RETURN(@s1LEN+@s2LEN)/2.0-@matchCnt
END

It doesn't give the same exact results as my own code, yet the result is close enough and the speed is about 20% higher.

And thanks to Diogo Nechtan, the version in PHP:

function sift3Plus($s1, $s2, $maxOffset) {
$s1Length = strlen($s1); 
$s2Length = strlen($s2);
if (empty($s1)) {
return (empty($s2) ? 0 : $s2Length);
}
if (empty($s2)) {
return $s1Length;
}
$c1 = $c2 = $lcs = 0;

while (($c1 < $s1Length) && ($c2 < $s2Length)) {
if (($d = $s1{$c1}) == $s2{$c2}) {
$lcs++;
} else {
for ($i = 1; $i < $maxOffset; $i++) {
if (($c1 + $i < $s1Length) && (($d = $s1{$c1 + $i}) == $s2{$c2})) {
$c1 += $i;
break;
}
if (($c2 + $i < $s2Length) && (($d = $s1{$c1}) == $s2{$c2 + $i})) {
$c2 += $i;
break;
}
}
}
$c1++;
$c2++;
}
return (($s1Length + $s2Length) / 2 - $lcs);
}

And thanks to Fernando Jorge Mota, the version in Python:

Also, here is the Javascript version, used in Mailcheck, by Derrick Ko and Wei Lu.

function sift3Distance(s1, s2) {
if (s1 == null || s1.length === 0) {
if (s2 == null || s2.length === 0) {
return 0;
} else {
return s2.length;
}
}

if (s2 == null || s2.length === 0) {
return s1.length;
}

var c = 0;
var offset1 = 0;
var offset2 = 0;
var lcs = 0;
var maxOffset = 5;

while ((c + offset1 < s1.length) && (c + offset2 < s2.length)) {
if (s1.charAt(c + offset1) == s2.charAt(c + offset2)) {
lcs++;
} else {
offset1 = 0;
offset2 = 0;
for (var i = 0; i < maxOffset; i++) {
if ((c + i < s1.length) && (s1.charAt(c + i) == s2.charAt(c))) {
offset1 = i;
break;
}
if ((c + i < s2.length) && (s1.charAt(c) == s2.charAt(c + i))) {
offset2 = i;
break;
}
}
}
c++;
}
return (s1.length + s2.length) / 2 - lcs;
}

~~Another implementation, this time in Java, by Eclesia:~~

You might also be interested in a customised version in AutoKey, by Toralf:

Thanks all for your comments and I look forward to more. Just tell me it worked or not and, most important, why. Good luck!