Update 17 June 2016: I've changed the focus of the extension to simply change the aspect of stories based on status, so that stories with content are highlighted over simple shares. I am currently working on another extension that is more adaptive, but it will be branded differently.

Update 27 May 2016: I've published the very early draft of the extension because it already does a cool thing: putting original content in the foreground and shrinking the reposts and photo uploads and feeling sharing and all that. You may find and install the extension here.

Have you ever wanted to decrease the spam in your Facebook page but couldn't do it in any way that would not make you miss important posts? I mean, even if you categorize all your contacts into good friends, close friends, relatives, acquaintances, then you unfollow the ones that really spam too much and you hide all posts that you don't like, you have no control over how Facebook decides to order what you see on the page. Worse than that, try to refresh repeatedly your Facebook page and see wildly oscillating results: posts appear, disappear, reorder themselves. It's a mess.

Well, true to this and my word I have started work on a Chrome extension to help me with this. My plan is pretty complicated, so before I publish the extension on the Chrome Webstore, like I did with my previous two efforts, I will publish this on GitHub while I am still working on it. So, depending on where I am, this might be alpha, beta or stable. At the moment of this writing - first commit - alpha is a pretty big word.

Here is the plan for the extension:
  1. Detect the user has opened the Facebook page
  2. Inject jQuery and extension code into the page
  3. Detect any post as it appears on the page
  4. Extract as many features as possible
  5. Allow the user to create categories for posts
  6. Allow the user to drag posts into categories or out of them
  7. Use AI to determine the category a post most likely belongs to
  8. Alternatively, let the user create their own filters, a la Outlook
  9. Show a list of categories (as tabs, perhaps) and hide all posts under the respective categories
This way, one might skip the annoying posts, based on personal preferences, while still enjoying the interesting ones. At the time of this writing, the first draft, the extension only works on https://www.facebook.com, not on any subpages, it extracts the type of the post and sets a CSS class on it. It also injects a CSS which makes posts get dimmer and smaller based on category. Mouse over to get the normal size and opacity.

How to make it work for you:
  1. In Chrome, go to Manage Extensions (chrome://extensions/)
  2. Click on the Developer Mode checkbox
  3. Click on the Load unpacked extension... button
  4. Select a folder where you have downloaded the source of this extension
  5. Open a new tab and load Facebook there
  6. You should see the posts getting smaller and dimmer based on category.
Change statusProcessor.css to select your own preferences (you may hide posts altogether or change the background color, etc).

As usual, please let me know what you think and contribute with code and ideas.

I've written another Chrome extension that I consider in beta, but so far it works. Really ugly makeshift code, but I am not gathering data about the way I will use it, then I am going to refactor it, just as I did with Bookmark Explorer. You may find the code at GitHub and the extension at the Chrome webstore.

This is how it works: Every time you access anything with the browser, the extension will remember the IPs for any given host. It will hold a list of the IPs, in reverse order (last one first), that you can just copy and paste into your hosts file. The hosts file is found in c:/Windows/System32/drivers/etc/hosts and on Linux in /etc/hosts. Once you add a line in the format "IP host" in it, the computer will resolve the host with the provided IP. Every time there is a problem with DNS resolution, the extension will add the latest known IP into the hosts text. Since the extension doesn't have access to your hard drive, you need to edit the file yourself. The icon of DNS resolver will show the number of hosts that it wants to resolve locally or nothing, if everything is OK.

The extension allows manual selection of an IP for a host and forced inclusion or exclusion from the list of IP/host lines. Data can be erased (all at once for now) as well. The extension does not communicate with the outside, but it does store a list of all domains you visit, so it is a slight privacy risk - although if someone has access to the local store of a browser extension, it's already too late. There is also the possibility of the extension to replace the host with IP directly in the browser requests, but this only works for the browser and fails in case the host name is important, as in the case of multiple servers using the same IP, so I don't recommend using it.

There are two scenarios for which this extension is very useful:
  • The DNS server fails for some reason or gives you a wrong IP
  • Someone removed the IP address from DNS servers or replaced it with one of their own, like in the case of governments censorship

I have some ideas for the future:
  • Sharing of working IP/host pairs - have to think of privacy before that, though
  • Installing a local DNS server that can communicate locally with the extension, so no more hosts editing - have to research and create one
  • Upvoting/Downvoting/flagging shared pairs - with all the horrible head-ache this comes with

As usual, let me know what you think here, or open issues on GitHub.

I have started writing Chrome extensions, mainly to address issues that my browser is not solving, like opening dozens of tabs and lately DNS errors/blocking and ad blocking. My code writing process is chaotic at first, just writing stuff and changing it until things work, until I get to something I feel is stable. Then I feel the need to refactor the code, organizing and cleaning it and, why not, unit testing it. This opens the question on how to do that in Javascript and, even if I have known once, I needed to refresh my understanding with new work. Without further ado: QUnit, a Javascript testing framework. Not that all code here will be in ES5 or earlier, mainly because I have not studied ES6 and I want this to work with most Javascript.

QUnit


QUnit is something that has withstood the test of time. It was first launched in 2008, but even now it is easy to use with simple design and clear documentation. Don't worry, you can use it even without jQuery. In order to use it, create an HTML page that links to the Javascript and CSS files from QUnit, then create your own Javascript file containing the tests and add it to the page together with whatever you are testing.

Already this raises the issue of having Javascript code that can be safely embedded in a random web page, so consider how you may encapsulate the code. Other testing frameworks could run the code in a headless Javascript engine, so if you want to be as generic as possible, also remove all dependencies on an existing web page. The oldest and simplest way of doing this is to use the fact that an orphan function in Javascript has its own scope and always has this pointing to the global object - in case of a web page, this would be window. So instead of something like:
i=0;
while (i<+(document.getElementById('inpNumber').value)) {
i++;
// do something
}
do something like this:
(function() {

var global=this;

var i=0;
while (i<+(global.document.getElementById('inpNumber').value)) {
i++;
// do something
}

})();

It's a silly example, but it does several things:
  • It keeps variable i in the scope of the anonymous function, thus keeping it from interfering with other code on the page
  • It clearly defines a global object, which in case of a web page is window, but may be something else
  • It uses global to access any out of scope values

In this particular case, there is still a dependency on the default global object, but if instead one would pass the object somehow, it could be abstracted and the only change to the code would be the part where global is defined and acquired.

Let's start with QUnit. Here is a Hello World kind of thing:
QUnit.test("Hello World", function (assert) {
assert.equal(1+1, 2, "One plus one is two");
});
We put it in 'tests.js' and include it into a web page that looks like this:
<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width">
<title>Unit Tests</title>
<link rel="stylesheet" href="https://code.jquery.com/qunit/qunit-1.23.1.css">
</head>
<body>
<script src="https://code.jquery.com/qunit/qunit-1.23.1.js"></script>
<div id="qunit"></div>
<div id="qunit-fixture"></div>

<script src="tests.js"></script>
</body>
</html>

The result:


As you can see, we declare a test with the static QUnit.test function, which receives a name and a function as parameters. Within the function, the assert object will do everything we need, mainly checking to see if a result conforms to an expected value or a block throws an exception. I will not go through a detailed explanation on simple uses like that. If you are interested peruse the QUnit site for tutorials.

Modules


What I want to talk about are slightly more advanced scenarios. The first thing I want to address is the concept of modules. If we declare all the tests, regardless on how many scripts they are arranged in, the test page will just list them one after another, in a huge blob. In order to somehow separate them in regions, we need a module. Here is another example:
QUnit.module("Addition");
QUnit.test("One plus one", function (assert) {
assert.equal(1+1, 2, "One plus one is two");
});
QUnit.module("Multiplication");
QUnit.test("Two by two", function (assert) {
assert.equal(2*2, 4, "Two by two is four");
});
resulting in:


It may look the same, but a Module: dropdown appeared, allowing one to choose which module to test or visualize. The names of the tests also includes the module name. Unfortunately, the resulting HTML doesn't have containers for modules, something one can collapse or expand at will. That is too bad, but it can be easily fixed - this is not the scope of the post, though. A good strategy is just to put all related tests in the same Javascript file and use QUnit.module as the first line.

Asynchronicity


Another interesting issue is asynchronous testing. If we want to test functions that return asynchronously, like setTimeout or ajax calls or Promises, then we need to use assert.async. Here is an example:
QUnit.config.testTimeout = 1000;
QUnit.module("Asynchronous tests");
QUnit.test("Called after 100 milliseconds", function (assert) {
var a=assert.async();
setTimeout(function() {
assert.ok(true, "Assertion was called from setTimeout");
a();
});
},100);

First of all, we needed to declare that we expect a result asynchronously, therefore we call assert.async() and hold a reference to the result. The result is actually a function. After we make all the assertions on the result, we call that function in order to finish the test. I've added a line before the test, though, which sets the testTimeout configuration value. Without it, an async test that fails would freeze the test suite indefinitely. You can easily test this by setting testTimeout to less than the setTimeout duration.

Asynchronous tests raise several questions, though. The example above is all nice and easy, but what about cases when the test is more complex, with multiple asynchronous code blocks that follow each other, like a Promise chain? What if the assertions themselves need to be called asynchronously, like when checking for the outcome of a click handler? If you run jQuery(selector).click() an immediately following assertion would fail, since the click handler is executed in another context, for example. One can imagine code like this, but look how ugly it is:
QUnit.test("Called after 500 milliseconds", function (assert) {
var a = assert.async();
setTimeout(function () {
assert.ok(true, "First setTimeout");
setTimeout(function () {
assert.ok(true, "Second setTimeout");
setTimeout(function () {
assert.ok(true, "Third setTimeout");
setTimeout(function () {
assert.ok(true, "Fourth setTimeout");
a();
}, 100);
}, 100);
}, 100);
}, 100);
setTimeout(function () {
assert.notOk(true, "Test timed out");
}, 500)
});

In order to solve at least this arrow antipattern I've created a stringFunctions function that looks like this:
function stringFunctions() {
if (!arguments.length)
throw 'needs functions as parameters';
var f = function () {};
var args = arguments;
for (var i = args.length - 1; i >= 0; i--) {
(function () {
var x = i;
var func = args[x];
if (typeof(func) != 'function')
throw 'parameter ' + x + ' is not a function';
var prev = f;
f = function () {
setTimeout(function () {
func();
prev();
}, 100);
};
})();
};
f();
};
which makes the previous code look like this:
QUnit.test("Called after 500 milliseconds", function (assert) {
var a = assert.async();
stringFunctions(function () {
assert.ok(true, "First setTimeout");
}, function () {
assert.ok(true, "Second setTimeout");
}, function () {
assert.ok(true, "Third setTimeout");
}, function () {
assert.ok(true, "Fourth setTimeout");
}, a);
setTimeout(function () {
assert.notOk(true, "Test timed out");
}, 500)
});

Of course, this is a specific case, but at least in a very common scenario - the one when the results of event handlers are checked - stringFunctions with 1ms instead of 100ms is very useful. Click on a button, see if a checkbox is available, check the checkbox, see if the value in a span has changed, stuff like that.

Testing average jQuery web code


Another thing I want to address is how to test Javascript that is intended as a web page companion script, with jQuery manipulations of the DOM and event listeners and all that. Ideally, all this would be stored in some sort of object that is instantiated with parameters that specify the test context, the various mocks and so on and so on. Since it is not an ideal world, I want to show you a way to test a typical such script, one that executes a function at DOMReady and does everything in it. Here is an example:
$(function () {

$('#btnSomething').click(function () {
$('#divSomethingElse').empty();
});

});
The code assumes $ is jQuery, then it adds a handler to a button click to empty another item. Think on how this should be tested:
  1. Declare a QUnit test
  2. In it, execute the script
  3. Then make some assertions

I was a bit lazy and changed the scripts themselves to check if a testContext exists and use that one. Something like this:
(function ($) {

var global = this;
var jQueryContext = global.testContext && global.testContext.document ? global.testContext.document : global.document;
var chrome = global.testContext && global.testContext.chrome ? global.testContext.chrome : global.chrome;
// etc.

$(function () {

$('#btnSomething', jQueryContext).click(function () {
$('#divSomethingElse', jQueryContext).empty();
});

});

})(jQuery);
which has certain advantages. First, it makes you aware of all the uses of jQuery in the code, yet it doesn't force you to declare everything in an object and having to refactor everything. Funny how you need to refactor the code in order to write unit tests in order to be able to refactor the code. Automated testing gets like that. It also solves some problems with testing Javascript offline - directly from the file system, because all you need to do now is define the testContext then load the script by creating a tag in the testing page and setting the src attribute:
var script = document.createElement('script');
script.onload = function () {
// your assertions here
};
script.src = "http://whatever.com/the/script.js";
document.getElementsByTagName('head')[0].appendChild(script);
In this case, even if you are running the page from the filesystem, the script will be loaded and executed correctly. Another, more elegant solution would load the script as a string and execute it inside a closure where jQuery was replaced with something that uses a mock document by default. This means you don't have to change your code at all, but you need to be able to read the script as a text, which is impossible on the filesystem. Some really messy script tag creation would be needed
QUnit.test("jQuery script Tests", function (assert) {

var global = (function () {
return this;
})();

function setIsolatedJquery() {
global.originalJquery = jQuery.noConflict(true);
var tc = global.testContext.document;
global.jQuery = global.$ = function (selectorOrHtmlOrFunction, context) {
if (typeof(selectorOrHtmlOrFunction) == 'function')
return global.originalJquery.apply(this, arguments);
var newContext;
if (!context) {
newContext = tc; //if not specified, use the testContext
} else {
if (typeof(context) == 'string') {
newContext = global.originalJquery(context, tc); //if context is a selector, use it inside the testContext
} else {
newContext = context; // use the one provided
}
}
return global.originalJquery(selectorOrHtmlOrFunction, newContext)
}
};
function restoreJquery() {
global.jQuery = global.$ = global.originalJquery;
delete global.originalJquery;
}

var a = assert.async();

global.testContext = {
document : jQuery('<div><button id="btnSomething">Something</button><div id="divSomethingElse"><span>Content</span></div></div>')
};
setIsolatedJquery();

var script = document.createElement('script');
script.onload = function () {

assert.notEqual($('#divSomethingElse').children().length, 0, "SomethingElse has children");
$('#btnSomething').click();
setTimeout(function () {
assert.equal($('#divSomethingElse').children().length, 0, "clicking Something clears SomethingElse");
restoreJquery();
a();
}, 1);
};
script.src = "sample.js";
document.getElementsByTagName('head')[0].appendChild(script);

});

There you have it: an asynchronous test that replaces jQuery with something with an isolated context, loads a script dynamically, performs a click in the isolated context, checks the results. Notice the generic way in which to get the value of the global object in Javascript.

Bottom-Up or Top-Bottom approach


A last point I want to make is more theoretical. After some consultation with a colleague, I've finally cleared up some confusion I had about the direction of automated tests. You see, once you have the code - or even in TDD, I guess, you know what every small piece of code does and also the final requirements of the product. Where should you start in order to create automated tests?

One solution is to start from the bottom and check that your methods call everything they need to call in the mocked dependencies. If you method calls 'chrome.tabs.create' and you have mocked chrome, your tabs.create method should count how many times it is called and your assertion should check that the count is 1. It has the advantage of being straightforward, but also tests details that might be irrelevant. One might refactor the method to call some other API and then the test would fail, as it tested the actual implementation details, not a result. Of course, methods that return the same result for the same input values - sometimes called immutable - are perfect for this type of testing.

Another solution is to start from the requirements and test that the entire codebase does what it is supposed to do. This makes more sense, but the combination of possible test cases increases exponentially and it is difficult to spot where the problem lies if a test fails. This would be called acceptance testing.

Well, the answer is: both! It all depends on your budget, of course, as you need to take into consideration not only the writing of the tests, but their maintenance as well. Automated acceptance tests would not need to change a lot, only when requirements change, while unit tests would need to be changed whenever the implementation is altered or new code is added.

Conclusion


I am not an expert on unit testing, so what I have written here describes my own experiments. Please let me know if you have anything to add or to comment. My personal opinion on the matter is that testing provides a measure of confidence that minimizes the stress of introducing changes or refactoring code. It also forces people to think in terms of "how will I test this?" while writing code, which I think is great from the viewpoint of separation of concerns and code modularity. On the other hand it adds a relatively large resource drain, both in writing and (especially) in maintaining the tests. There is also a circular kind of issue where someone needs to test the tests. Psychologically, I also believe automated testing only works for certain people. Chaotic asses like myself like to experiment a lot, which makes testing a drag. I don't even know what I want to achieve and someone tries to push testing down my throat. Later on, though, tests would be welcome, if only my manager allows the time for it. So it is, as always, a matter of logistics.

More info about unit testing with QUnit on their page.

During one revamp of the blog I realized that I didn't have images for some of my posts. I had counted pretty much on the Blogger system that provides a post.thumbnailUrl post metadata that I can use in the display of the post, but the url is not always there. Of course if you have a nice image in the post somewhere prominently displayed, the thumbnail URL will be populated, but what if you have a video? Surprisingly, Blogger has a pretty shitty video to thumbnail mechanism that prompted me to build my own.

So the requirements would be: get me the image representing a video embedded in my page, using Javascript only.

Well, first of all, videos can be actual video tags, but most of the time they are iframe elements coming from a reliable global provider like YouTube, Dailymotion, Vimeo, etc, and all the information available is the URL of the display frame. Here is the way to get the thumbnail for these scenarios:

YouTube


Given the iframe src value:

// find youtube.com/embed/[videohash] or youtube.com/embed/video/[videohash]
var m = /youtube\.com\/embed(?:\/video)?\/([^\/\?]+)/.exec(src);
if (m) {
// the thumbnail url is https://img.youtube.com/vi/[videohash]/0.jpg
imgSrc = 'https://img.youtube.com/vi/' + m[1] + '/0.jpg';
}

If you have embeds in the old object format, it is best to replace them with the iframe one. If you can't change the content, it remains your job to create the code to give you the thumbnail image.

Dailymotion


Given the iframe src value:

//find dailymotion.com/embed/video/[videohash]
var m=/dailymotion\.com\/embed\/video\/([^\/\?]+)/.exec(src);
if (m) {
// the thumbnail url is at the same URL with `thumbnail` replacing `embed`
imgSrc=src.replace('embed','thumbnail');
}

Vimeo


Vimeo doesn't have a one URL thumbnail format that I am aware of, but they have a Javascript accessible API.

// find vimeo.com/video/[videohash]
m=/vimeo\.com\/video\/([^\/\?]+)/.exec(src);
if (m) {
// set the value to the videohash initially
imgSrc=m[1];
$.ajax({
//call the API video/[videohash].json
url:'https://vimeo.com/api/v2/video/'+m[1]+'.json',
method:'GET',
success: function(data) {
if (data&&data.length) {
// and with the data replace the initial value with the thumbnail_medium value
replaceUrl(data[0].thumbnail_medium,m[1]);
}
}
});
}

In this example, the replaceUrl function would look for img elements to which the videohash value is attached and replace the url with the correct one, asynchronously.

TED


I am proud to announce that I was the one pestering them to make their API available over Javascript.

// find ted.com/talks/[video title].html
m=/ted\.com\/talks\/(.*)\.html/.exec(src);
if (m) {
// associate the video title with the image element
imgSrc=m[1];
$.ajax({
// call the oembed.json?url=frame_url
url:'https://www.ted.com/services/v1/oembed.json?url='+encodeURIComponent(src),
method:'GET',
success: function(data) {
// set the value of the image element asynchronously
replaceUrl(removeSchemeForHttpsEnabledSites(data.thumbnail_url),m[1]);
}
});
return false;
}

video tags


Of course there is no API to get the image from an arbitrary video URL, but the standard for the video tag specifies a poster attribute that can describe the static image associated with a video.

// if the element is a video with a poster value
if ($(this).is('video[poster]')) {
// use it
imgSrc=$(this).attr('poster');
}

I had this crazy idea that I could make each word on my page come alive. The word "robot" would walk around the page, "explosion" would explode, "rotate" would rotate, color words would appear in the color they represent, no matter how weird named like OliveDrab , Chocolate , Crimson , DeepPink , DodgerBlue and so on, "radioactive" would pulse green, "Siderite" would appear in all its rocking glory and so on. And so I did!

The library is on GitHub and I urge you to come and bring your own ideas. Every effect that you see there is an addon that can be included or not in your setup.



Also see directly on GitHub pages.








Update 29 August 2017 - Version 3.0.4: The extension has been rewritten in EcmaScript6 and tested on Chrome, Firefox and Opera.

Update 03 March 2017 - Version 2.9.3: added a function to remove marketing URLs from all created bookmarks. Enable it in the Advanced settings section. Please let me know of any particular parameters you need purged. So far it removes utm_*, wkey, wemail, _hsenc, _hsmi and hsCtaTracking.

Update 26 February 2017: Version (2.9.1): added customizing the URL comparison function. People can choose what makes pages different in general or for specific URL patterns
Update 13 June 2016: Stable version (2.5.0): added Settings page, Read Later functionality, undelete bookmarks page and much more.
Update 8 May 2016: Rewritten the extension from scratch, with unit testing.
Update 28 March 2016: The entire source code of the extension is now open sourced at GitHub.

Whenever I read my news, I open a bookmark folder containing my favorite news sites, Twitter, Facebook, etc. I then proceed to open new tabs for each link I find interesting, closing the originating links when I am done. Usually I get a number of 30-60 open tabs. This wreaks havoc on my memory and computer responsiveness. And it's really stupid, because I only need to read them one by one. In the end I've decided to fight my laziness and create my first browser extension to help me out.

The extension is published here: Siderite's Bookmark Explorer and what it does is check if the current page is found in any bookmark folder, then allow you to go forward or backwards inside that folder.

So this is my scenario on using it:
  1. Open the sites that you want to get the links from.
  2. Open new tabs for the articles you want to read or YouTube videos you want to watch,etc.
  3. Bookmark all tabs into a folder.
  4. Close all the tabs.
  5. Navigate to the bookmark folder and open the first link.
  6. Read the link, then press the Bookmark Navigator button and then the right arrow. (now added support for context menu and keyboard shortcuts)
  7. If you went too far by mistake, press the left arrow to go back.

OK, let's talk about how I did it. In order to create your own Chrome browser extension you need to follow these steps:

1. Create the folder


Create a folder and put inside a file called manifest.json. It's possible structure is pretty complex, but let's start with what I used:
{
"manifest_version" : 2,

"name" : "Siderite's Bookmark Explorer",
"description" : "Gives you a nice Next button to go to the next bookmark in the folder",
"version" : "1.0.2",

"permissions" : [
"tabs",
"activeTab",
"bookmarks",
"contextMenus"
],
"browser_action" : {
"default_icon" : "icon.png",
"default_popup" : "popup.html"
},
"background" : {
"scripts" : ["background.js"],
"persistent" : false
},
"commands" : {
"prevBookmark" : {
"suggested_key" : {
"default" : "Ctrl+Shift+K"
},
"description" : "Navigate to previous bookmark in the folder"
},
"nextBookmark" : {
"suggested_key" : {
"default" : "Ctrl+Shift+L"
},
"description" : "Navigate to next bookmark in the folder"
}
}
}

The manifest version must be 2. You need a name, a description and a version number. Start with something small, like 0.0.1, as you will want to increase the value as you make changes. The other thing is that mandatory is the permissions object, which tells the browser what Chrome APIs you intend to use. I've set there activeTab, because I want to know what the active tab is and what is its URL, tabs, because I might want to get the tab by id and then I don't get info like URL if I didn't specify this permission, bookmarks, because I want to access the bookmarks, and contextMenus, because I want to add items in the page context menu. More on permissions here.

Now, we need to know what the extension should behave like.

If you want to click on it and get a popup that does stuff, you need to specify the browser_action object, where you specify the icon that you want to have in the Chrome extensions bar and/or the popup page that you want to open. If you don't specify this, you get a default button that does nothing on click and presents the standard context menu on right click. You may only specify the icon, though. More on browserAction here.

If you want to have an extension that reacts to background events, monitors URL changes on the current page, responds to commands, then you need a background page. Here I specify that the page is a javascript, but you can add HTML and CSS and other stuff as well. More on background here.

Obviously, the files mentioned in the manifest must be created in the same folder.

The last item in the manifest is the commands object. For each command you need to define the id, the keyboard shortcut (only the 0..9 and A..Z are usable unfortunately) and a description. In order to respond to commands you need a background page as shown above.

2. Test the extension


Next you open a Chrome tab and go to chrome://extensions, click on the 'Developer mode' checkbox if it is not checked already and you get a Load unpacked extension button. Click it and point the following dialog to your folder and test that everything works OK.

3. Publish your extension


In order to publish your extension you need to have a Chrome Web Store account. Go to Chrome Web Store Developer Dashboard and create one. You will need to pay a one time 5$ fee to open it. I know, it kind of sucks, but I paid it and was done with it.

Next, you need to Add New Item, where you will be asked for a packed extension, which is nothing but the ZIP archive of all the files in your folder.

That's it.

Let's now discuss actual implementation details.

Adding functionality to popup elements


Getting the popup page elements is easy with vanilla Javascript, because we know we are building for only one browser: Chrome! So getting elements is done via document.getElementById(id), for example, and adding functionality is done via elem.addEventListener(event,handler,false);

One can use the elements as objects directly to set values that are related to those elements. For example my prev/next button functionality takes the URL from the button itself and changes the location of the current tab to that value. Code executed when the popup opens sets the 'url' property on the button object.

Just remember to do it when the popup has finished loading (with document.addEventListener('DOMContentLoaded', function () { /*here*/ }); )

Getting the currently active tab


All the Chrome APIs are asynchronous, so the code is:
chrome.tabs.query({
'active' : true,
'lastFocusedWindow' : true
}, function (tabs) {
var tab = tabs[0];
if (!tab) return;
// do something with tab
});

More on chrome.tabs here.

Changing the URL of a tab


chrome.tabs.update(tab.id, {
url : url
});

Changing the icon in the Chrome extensions bar


if (chrome.browserAction) chrome.browserAction.setIcon({
path : {
'19' : 'anotherIcon.png'
},
tabId : tab.id
});

The icons are 19x19 PNG files. browserAction may not be available, if not declared in the manifest.

Get bookmarks


Remember you need the bookmarks permission in order for this to work.
chrome.bookmarks.getTree(function (tree) {
//do something with bookmarks
});

The tree is an array of items that have title and url or children. The first tree array item is the Bookmarks Bar, for example. More about bookmarks here.

Hooking to Chrome events


chrome.tabs.onUpdated.addListener(refresh);
chrome.tabs.onCreated.addListener(refresh);
chrome.tabs.onActivated.addListener(refresh);
chrome.tabs.onActiveChanged.addListener(refresh);
chrome.contextMenus.onClicked.addListener(function (info, tab) {
navigate(info.menuItemId, tab);
});
chrome.commands.onCommand.addListener(function (command) {
navigate(command, null);
});

In order to get extended info on the tab object received by tabs events, you need the tabs permission. For access to the contextMenus object you need the contextMenus permission.

Warning: if you install your extension from the store and you disable it so you can test your unpacked extension, you will notice that keyboard commands do not work. Seems to be a bug in Chrome. The solution is to remove your extension completely so that the other version can hook into the keyboard shortcuts.

Creating, detecting and removing menu items


To create a menu item is very simple:
chrome.contextMenus.create({
"id" : "menuItemId",
"title" : "Menu item description",
"contexts" : ["page"] //where the menuItem will be available
});
However, there is no way to 'get' a menu item and if you try to blindly remove a menu item with .remove(id) it will throw an exception. My solution was to use an object to store when I created and when I destroyed the menu items so I can safely call .remove().

To hook to the context menu events, use chrome.contextMenus.onClicked.addListener(function (info, tab) { }); where info contains the menuItemId property that is the same as the id used when creating the item.

Again, to access the context menu API, you need the contextMenus permission. More about context menus here.

Commands


You use commands basically to define keyboard shortcuts. You define them in your manifest and then you hook to the event with chrome.commands.onCommand.addListener(function (command) { });, where command is a string containing the key of the command.

Only modifiers, letters and digits can be used. Amazingly, you don't need permissions for using this API, but since commands are defined in the manifest, it would be superfluous, I guess.

That's it for what I wanted to discuss here. Any questions, bug reports, feature requests... use the comments in the post.

In the previous post I was discussing Firebase, used in Javascript, and that covered initialization, basic security, read all and insert. In this post I want to discuss about complex queries: filtering, ordering, limiting, indexing, etc. For that I will get inspiration (read: copy with impunity) from the Firebase documentation on the subject Retrieving Data, but make it quick and dirty... you know, like sex! Thank you, ma'am!

OK, the fluid interface for getting the data looks a lot like C# LInQ and I plan to work on a Linq2Firebase thing, but not yet. Since LInQ itself got its inspiration from SQL, I was planning to structure the post in a similar manner: how to do order by, top/limit, select conditions, indexing and so on, so we can really use Firebase like a database. An interesting concept to explore is joining, since this is an object database, but we still need it, because we want to filter by the results of the join before we return the result, like getting all the transaction of users that have the name 'Adam'. Aggregating is another thing that I feel Firebase needs to support. I don't want a billion records in order to compute the sum of a property.

However, the Firebase API is rather limited at the moment. You get .orderByChild, then stuff like .equalTo, .startAt and .endAt and then .limitToFirst and .limitToLast. No aggregation, no complex filters, no optimized indexing, no joining. As far as I can see, this is by design, so that the server is as dumb as possible, but think about that 1GB for the free plan. It is a lot.

So, let's try a complex query, see were it gets us.
ref.child('user')
.once('value',function(snapshot) {
var users=[];
snapshot.forEach(function(childSnapshot) {
var item=childSnapshot.val();
if (/adam/i.test(item.name)) {
users.push(item.userId);
}
});
getInvoiceTotalForUsers(users,DoSomethingWithSum);
});


function getInvoiceTotalForUsers(users,callback)
{
var sum=0;
var count=0;
for (var i=0; i<users.length; i++) {
var id=users[i];
ref.child('invoice')
.equalTo(id,'userId')
.orderByChild('price')
.startAt(10)
.endAt(100)
.once('value',function(snapshot) {
snapshot.forEach(function(childSnapshot) {
var item = childSnapshot.val();
sum+=item.price;
count++;
if (count==users.length) callback(sum);
});
});
}
}

First, I selected the users that have 'adam' in the name. I used .once instead of .on because I don't want to wait for new data to arrive, I want the data so far. I used .forEach to enumerate the data from the value event. With the array of userIds I call getInvoiceTotalForUsers, which gets all the invoices for each user, with a price bigger or equal to 10 and less or equal to 100, which finally calls a callback with the resulting sum of invoice prices.

For me this feels very cumbersome. I can think of several methods to simplify this, but the vanilla code would probably look like this.

I have been looking for a long time for this kind of service, mainly because I wanted to monitor and persist stuff for my blog. Firebase is all of that and more and, with a free plan of 1GB, it's pretty awesome. However, as it is a no SQL database and as it can be accessed via Javascript, it may be a bit difficult to get it at first. In this post I will be talking about how to use Firebase as a traditional database using their Javascript library.

So, first off go to the main website and signup with Google. Once you do, you get a page with a 5 minute tutorial, quickstarts, examples, API docs... but you want the ultra-quick start! Copy pasted working code! So click on the Manage App button.

Take note of the URL where you are redirected. It is the one used for all data usage as well. Ok, quick test code:
var testRef = new Firebase('https://*******.firebaseio.com/test');
testRef.push({
val1: "any object you like",
val2: 1,
val3: "as long as it is not undefined or some complex type like a Date object",
val4: "think of it as JSON"
});
What this does is take that object there and save it in your database, in the "test" container. Let's say it's like a table. You can also save objects directly in the root, but I don't recommend it, as the path of the object is the only one telling you what type of object it is.

Now, in order to read inserted objects, you use events. It's a sort of reactive way of doing things that might be a little unfamiliar. For example, when you run the following piece of code, you will get after you connect all the objects you ever inserted into "test".
var testRef = new Firebase('https://*******.firebaseio.com/test');
testRef.on('child_added', function(snapshot) {
var obj = snapshot.val();
handle(obj); //do what you want with the object
});

Note that you can use either child_added or value, as the retrieve event. While 'child_added' is fired on each retrieved object, 'value' returns one snapshot containing all data items, then proceeds to fire on each added item with full snapshots. Beware!, that means if you have a million items and you do a value query, you get all of them (or at least attempt to, I think there are limits), then on the next added item you get a million and one. If you use .limitToLast(50), for example, you will get the last 50 items, then when a new one is added, you get another 50 item snapshot. In my mind, 'value' is to be used with .once(), while 'child_added' with .on(). More details in my Queries post

Just by using that, you have created a way to insert and read values from the database. Of course, you don't want to leave your database unprotected. Anyone could read or change your data this way. You need some sort of authentication. For that go to the left and click on Login & Auth, then you go to Email & Password and you configure what are the users to log in to your application. Notice that every user has a UID defined. Here is the code to use to authenticate:
var testRef = new Firebase('https://*******.firebaseio.com/test');
testRef.authWithPassword({
email : "some@email.com",
password : "password"
}, function(error, authData) {
if (error) {
console.log("Login Failed!", error);
} else {
console.log("Authenticated successfully with payload:", authData);
}
});
There is an extra step you want to take, secure your database so that it can only be accessed by logged users and for that you have to go to Security & Rules. A very simple structure to use is this:
{
"rules": {
"test": {
".read": false,
".write": false,
"$uid": {
// grants write access to the owner of this user account whose uid must exactly match the key ($uid)
".write": "auth !== null && auth.uid === $uid",
// grants read access to any user who is logged in with an email and password
".read": "auth !== null && auth.provider === 'password'"
}
}
}
}
This means that:
  1. It is forbidden to write to test directly, or to read from it
  2. It is allowed to write to test/uid (remember the user UID when you created the email/password pair) only by the user with the same uid
  3. It is allowed to read from test/uid, as long as you are authenticated in any way

Gotcha! This rule list allows you to read and write whatever you want on the root itself. Anyone could just waltz on your URL and fill your database with crap, just not in the "test" path. More than that, they can just listen to the root and get EVERYTHING that you write in. So the correct rule set is this:
{
"rules": {
".read": false,
".write": false,
"test": {
".read": false,
".write": false,
"$uid": {
// grants write access to the owner of this user account whose uid must exactly match the key ($uid)
".write": "auth !== null && auth.uid === $uid",
// grants read access to any user who is logged in with an email and password
".read": "auth !== null && auth.provider === 'password'"
}
}
}
}

In this particular case, in order to get to the path /test/$uid you can use the .child() function, like this: testRef.child(authData.uid).push(...), where authData is the object you retrieve from the authentication method and that contains your logged user's UID.

The rule system is easy to understand: use ".read"/".write" and a Javascript expression to allow or deny that operation, then add children paths and do the same. There are a lot more things you could learn about the way to authenticate: one can authenticate with Google, Twitter, Facebook, or even with custom tokens. Read more at Email & Password Authentication, User Authentication and User Based Security.

But because you want to do a dirty little hack and just make it work, here is one way:
{
"rules": {
".read": false,
".write": false,
"test": {
".read": "auth.uid == 'MyReadUser'",
".write": "auth.uid == 'MyWriteUser'"
}
}
}
This tells Firebase that no one is allowed to read/write except in /test and only if their UID is MyReadUser, MyWriteUser, respectively. In order to authenticate for this, we use this piece of code:
testRef.authWithCustomToken(token,success,error);
The handlers for success and error do the rest. In order to create the token, you need to do some cryptography, but nevermind that, there is an online JsFiddle where you can do just that without any thought. First you need a secret, for which you go into your Firebase console and click on Secrets. Click on "Show" and copy paste that secret into the JsFiddle "secret" textbox. Then enter MyReadUser/MyWriteUser in the "uid" textbox and create the token. You can then authenticate into Firebase using that ugly string that it spews out at you.

Done, now you only need to use the code. Here is an example:
var testRef = new Firebase('https://*****.firebaseio.com/test');
testRef.authWithCustomToken(token, function(err,authData) {
if (err) alert(err);
myDataRef.on('child_added', function(snapshot) {
var message = snapshot.val();
handle(message);
});
});
where token is the generated token and handle is a function that will run with each of the objects in the database.

In my case, I needed a way to write messages on the blog for users to read. I left read access on for everyone (true) and used the token idea from above to restrict writing. My html page that I run locally uses the authentication to write the messages.

There you have it. In the next post I will examine how you can query the database for specific objects.

I had this Javascript code that I was trying to write as tight as possible and I realized that I don't know if using "!" on an object to check if it is set to a value is slow or fast. I mean, in a strong typed language, I would compare the object with null, not use the NOT operator. So I randomly filled an array with one of five items: null, undefined, empty string, 0 and new Date(), then compared the performance of a loop checking the array items for having a value with the NOT operator versus other methods. I used Chrome 48.0.2564.109 m, Internet Explorer 11.0.9600.18163 and Firefox 44.0.2 for the tests.

Fast tally:
  • NOT operator: 1600/51480/1200ms
  • === 0 (strong type equality): 1360/47510/2180ms
  • === null : 550/45590/510ms
  • == with 0: 38700/63030/131940ms
  • == with null: 1100/48230/900ms
  • === used twice(with 0 and null): 1760/69460/3500ms
  • typeof == 'object' (which besides the Dates also catches null): 1360/382980/1380ms
  • typeof === 'object' (which besides the Dates also catches null): 1370/407000/1400ms
  • instanceof Date: 1060/69200/600ms

Thoughts: the !/NOT operator is reasonably fast. Using normal equality can really mess up your day when it tries to transform 0 into a Date or viceversa (no, using 0 == arr[i] instead of arr[i] == 0 wasn't faster). Fastest, as expected, was the strong type equality to null. Surprising was the null equality, which catches both null and undefined and is second place. typeof was also surprising, since it not only gets the type of the object, but also compares the result with a string. Funny thing: the === comparison in the case of typeof was slower than the normal == comparison for all browsers, so probably it gets treated as a special construct.

It is obvious that both Chrome and Firefox have really optimized the Javascript engine. Internet explorer has a 18 second overhead for the loops alone (so no comparison of any kind), while the other browsers optimize it to 300ms. Sure, behind the scenes they realize that nothing happens in those loops and drop them, but still, it was a drag to wait for the result from Internet Explorer. Compared with the other huge values, the ===null comparison is insignificantly smaller than all the others on Internet Explorer, but still takes first place, while typeof took forever! Take these results with a grain of salt, though. When I was at FOSDEM I watched this presentation from Firefox when they were actually advising against this type of profiling, instead recommending special browser tools that would do that. You can watch it yourselves here.

Final conclusion: if you are checking if an object exists or not, especially if you can insure that the value of a non existent object is the same (like null), === kicks ass. The NOT operator can be used to check a user provided value, since it catches all of null, undefined, empty space or 0 and it's reasonably fast.

Here is the code:
var arr=[];
for (var i=0; i<100000; i++) {
var r=parseInt(Math.random()*5);
switch(r) {
case 0: arr.push(null); break;
case 1: arr.push(undefined); break;
case 2: arr.push(''); break;
case 3: arr.push(0); break;
case 4: arr.push(new Date()); break;
}
}

var n=0;
var start=performance.now();
for (var j=0; j<1000; j++) {
for (var i=0; i<100000; i++) {
if (!arr[i]) n++;
}
}
var end=performance.now();
console.log('!value '+n+': '+(end-start));

n=0;
start=performance.now();
for (var j=0; j<1000; j++) {
for (var i=0; i<100000; i++) {
if (arr[i] === 0) n++;
}
}
end=performance.now();
console.log('value===0 '+n+': '+(end-start));

n=0;
start=performance.now();
for (var j=0; j<1000; j++) {
for (var i=0; i<100000; i++) {
if (arr[i] === null) n++;
}
}
end=performance.now();
console.log('value===null '+n+': '+(end-start));

n=0;
start=performance.now();
for (var j=0; j<1000; j++) {
for (var i=0; i<100000; i++) {
if (arr[i] == 0) n++;
}
}
end=performance.now();
console.log('value==0 '+n+': '+(end-start));

n=0;
start=performance.now();
for (var j=0; j<1000; j++) {
for (var i=0; i<100000; i++) {
if (arr[i] == null) n++;
}
}
end=performance.now();
console.log('value==null '+n+': '+(end-start));

n=0;
start=performance.now();
for (var j=0; j<1000; j++) {
for (var i=0; i<100000; i++) {
if (arr[i] === 0 || arr[i] === null) n++;
}
}
end=performance.now();
console.log('value===0 || value===null '+n+': '+(end-start));

n=0;
start=performance.now();
for (var j=0; j<1000; j++) {
for (var i=0; i<100000; i++) {
if (typeof(arr[i])=='object') n++;
}
}
end=performance.now();
console.log('typeof(value)==\'object\' '+n+': '+(end-start));

n=0;
start=performance.now();
for (var j=0; j<1000; j++) {
for (var i=0; i<100000; i++) {
if (typeof(arr[i])==='object') n++;
}
}
end=performance.now();
console.log('typeof(value)===\'object\' '+n+': '+(end-start));

n=0;
start=performance.now();
for (var j=0; j<1000; j++) {
for (var i=0; i<100000; i++) {
if (arr[i] instanceof Date) n++;
}
}
end=performance.now();
console.log('value instanceof Date '+n+': '+(end-start));

Today we tested a web application in the new Microsoft Edge browser. To our surprize, the site failed where Internet Explorer, Chrome, Firefox and even Safari worked perfectly well. I narrowed the problem to the navigator.geolocation.getCurrentLocation which wasn't working. The site would see navigator.geolocation, ask for the current location, the user would be prompted to allow the site to access location and after that would silently fail. What I mean by that is that neither the success or the error callbacks were called, even if the options object specified one second for the timeout. I don't have access to a lot of Windows 10 machines and I assume that if a lot of people met with this problem they would invade the Internet with angry messages, but so far I've found no one having the same issue.

Bottom line: forced to take into consideration the possibility that the geolocation API would silently fail, I changed the code like this:
if (navigator.geolocation) {
var timeoutInSeconds=1;
var geotimeout=setTimeout(function() {
handleNoGeolocation();
},timeoutInSeconds*1000+500); //plus 500 ms to allow the API to timeout normally
navigator.geolocation.getCurrentPosition(function (position) {
clearTimeout(geotimeout);
var pos = doSomethingWith(position.coords.latitude, position.coords.longitude);
}, function () {
clearTimeout(geotimeout);
handleNoGeolocation();
},{
enableHighAccuracy:true,
timeout: timeoutInSeconds*1000
});
} else {
handleNoGeolocation();
}

In the handleNoGeolocation function I've accessed the great service FreeGeoIp, that returns vague coordinates based on your IP and fell back to a static latitude, longitude pair if even this call failed.

Note: for the first time the function is called for your site, a browser dialog will appear, requesting permission to share the location. During the display of the dialog the timeout will fire, then, based on the user choice (and browser) a success/error handler will be called or nothing (like in this case), so make sure your code can handle running handleNoGeolocation followed by doSomethingWith.

A blog reader asked me to help him get rid of the ugly effect of a large background image getting loaded. I thought of several solutions, all more complicated than the rest, but in the end settled on one that seems to be working well and doesn't require complicated libraries or difficult implementation: using the img onload event.

Let's assume that the background image is on the body element of the page. The solution involves setting a style on the body to hide it (style="display:none") then adding as child of the body an image that also is hidden and that, when completing loading, shows the body element. Here is the initial code:
<style>
body {
background: url(bg.jpg) no-repeat center center fixed;
}
</style>
<body>

And after:

<style>
body {
background: url(bg.jpg) no-repeat center center fixed;
}
</style>
<body style="display:none">
<img src="bg.jpg" onload="document.body.style.display=''" style="display:none;" />

This loads the image in a hidden img element and shows the body element when the image finished loading.

The solution might have some problems with Internet Explorer 9, as it seems the load event is not fired for images retrieved from the cache. In that case, a slightly more complex Javascript solution is needed as detailed in this blog post: How to Fix the IE9 Image Onload Bug. Also, in Internet Explorer 5-7 the load event fires for animated GIFs at every loop. I am sure you know it's a bad idea to have an animated GIF as a page background, though :)

Warning: While this hides the effect of slow loading background images, it also hides the page until the image is loaded. This makes the page appear blank until then. More complex solutions would show some simple html content while the page is loading rather than hiding the entire page, but this post is about the simplest solution for the question asked.

A more comprehensive analysis of image preloading, complete with a very nice Javascript code that covers a lot of cases, can be found at Preloading images using javascript, the right way and without frameworks

I was using this JavaScript function that I wanted to accept an array of arguments or just a bunch of arguments that would be interpreted as an array. As you probably know, for each function in JS you get a variable called arguments which contains the arguments to the function. It is a pseudo array, not a real array, and has some extra properties. My code looked like this:
function x(arg) {
if (!arg) {
arg=[];
} else if (!arg.length) {
arg=[];
for(var i=0; i<arguments.length; i++) arg.push(arguments[i]);
}
// do something with arg
}

The logic is simple, if there is no first parameter, arg becomes an empty array, else, if there is a first argument, but it doesn't have a length property (not an array) set arg to an array and push all arguments of the function as items of the array. But it doesn't work! The point is this: you set arg to an empty array and at that moment arguments[0] is no longer the original argument, but the empty array. Even worse, the code then adds the array as an item of itself, which makes the object be infinitely recursive.

Let's make this simpler:
function x(arg) {
arg=[];
console.log(arguments[0]);
}
After you execute x() with any arguments, the console will show an empty array, not the original argument. Weird, huh?

Warning: the algorithm works perfectly well and is better than Sift3, however you might want to consider it in beta, as I am still looking for better implementation solutions that might change the structure of the code.

Try the Javascript implementation here:


Algorithm:

  •  MaxOffset:

String 1:

String 2:



Result: 



Update 28 Mar 2015: I've changed the algorithm significantly. The transpositions are now computed differently and the cost of a transposition in the final result is 1, rather than 0.5. Also, while I think a value of 1 is better conceptually, I noticed that Sift4 approximates Levenshtein a little better when the cost of a transposition is either 2 or a function depending on the offset difference between c2 and c1, especially when maxOffset grows. This can be now changed via the new options function transpositionCostEvaluator. The problem I am having now is more false positives when the letters/tokens of the two strings are the same, but their positions are jumbled differently. With small maxOffset values, like 5 or 10, the result is much better than Sift3, however when maxOffset grows, lots of matches can be found and the cost of transpositions becomes very important.

Update 27 Mar 2015: Thanks to Emanuele Bastianelli who discovered a bug that appeared in an edge case, I've updated the algorithms. Now, at the end of the while loop there is an extra check to prevent the algorithm existing prematurely, before computing remaining tokens.

A really long time ago I wrote the third version of Sift, the string distance algorithm. It so happens that I am going to make a small presentation, here in Ispra, about this algorithm, so I had the opportunity to review it. I found some inconsistencies and I actually did some research in the field that gave more more ideas. So before giving the presentation I thought of publishing what I think is the fourth version. What's new:

  • 33% more accurate
  • three different variants: simple, common and general
  • new concepts added
  • support for own value and matching functions, different tokenizer functions, etc.
  • actually tested with a (slightly more) serious test
  • more robust, working better for large values of maxOffset


Before I get into the details, I am publishing the algorithm here for the moment, no Codeplex or PasteBin or GitHub or whatever. Also, it is written in Javascript now, the C# and T-SQL version pending. Of course, it would be great if, as before, the community of people using the algorithm would go into implementing it into various programming languages, however I am a bit apprehensive because more often than not people came with their own improvements or interpretations when translating the algorithm into another language. But support is always welcome!

I created a test that used random strings, but also a huge list of commonly used English phrases as well as mutations on these strings, adding or removing small bits and so on. I then implemented Sift3, Levenstein and the new algorithm and computed the error distance between the Levenstein distance and the two Sift variants. This permitted me to see how the error evolves when changing the algorithm and the parameters. One thing I noticed is that when increasing the maxOffset value to large values like 15 or 20, the accuracy of Sift3 was going down. Also, as pointed out by one commenter on the Sift3 post, there are cases when Sift3(a,b) is different from Sift3(b,a). There are edge cases, but this one in particular grated me.

After implementing Sift4, I can now tell you that the simple version is slightly better than Sift3 for small maxOffset values like 5, but it gets better as the value increases. The common version is a bit more complex, but the error decreases with 33% and maintains a low error for large maxOffset values. The extended or general version receives an options object that can change almost everything, but most important is the tokenizer function. Imagine that you want to compute the distance based not on letters, but on n-grams (groups of n characters). Or that you want to compare them by the words in the text, maybe even their synonyms. This can all be achieved just by changing the tokenizer function. The other parameters involve defining what it means for two tokens to match and what is the value of their match, etc.

One of the new concepts implemented is taken from the Jaro distance. Jaro seems a lot like Sift in the way that it considers two characters to match if they are in close proximity. Also, if "the streams cross", like 'ab' vs 'ba', one considers them transpositions and removes some of their value from the distance. Actually, if I look at the implementation, it might be that I have independently discovered the Jaro distance. I will research this further. I don't know if the transposition calculation is the most optimal. At the moment it uses an array of all matches found until a point, clearing it of values as the cursors move along the string. The difference between the simple and the common versions of Sift4 is that the simple version is not computing the transpositions at all and has no concept of maxDistance. In that respect it is a slightly fixed up Sift3.

Another new concept added is the one of local substring. Imagine that the Largest Common Subsequence that Sift is actually trying to find in order to determine the distance is made of substrings, separated by non matching characters. Each of these substrings can be used to improve the distance function. For example one could argue that 'abcdex' is closer to 'abcde' than 'abcxde', because even if the largest common subsequence is 5, the largest common substring is 5 for the first string and only 3 for the second. The extended version of the algorithm allows for changing the value of each substring individually.

Well, here they are, the three versions. The extended version has some examples at the end for possible parameters.

Simplest Sift4:

// Sift4 - simplest version
// online algorithm to compute the distance between two strings in O(n)
// maxOffset is the number of characters to search for matching letters
function sift4(s1, s2, maxOffset) {
    if (!s1 || !s1.length) {
        if (!s2) {
            return 0;
        }
        return s2.length;
    }

    if (!s2 || !s2.length) {
        return s1.length;
    }

    var l1 = s1.length;
    var l2 = s2.length;

    var c1 = 0; //cursor for string 1
    var c2 = 0; //cursor for string 2
    var lcss = 0; //largest common subsequence
    var local_cs = 0; //local common substring
    while ((c1 < l1) && (c2 < l2)) {
        if (s1.charAt(c1) == s2.charAt(c2)) {
            local_cs++;
        } else {
            lcss += local_cs;
            local_cs = 0;
            if (c1 != c2) {
                c1 = c2 = Math.max(c1, c2); //using max to bypass the need for computer transpositions ('ab' vs 'ba')
            }
            for (var i = 0; i < maxOffset && (c1 + i < l1 || c2 + i < l2); i++) {
                if ((c1 + i < l1) && (s1.charAt(c1 + i) == s2.charAt(c2))) {
                    c1 += i;
                    local_cs++;
                    break;
                }
                if ((c2 + i < l2) && (s1.charAt(c1) == s2.charAt(c2 + i))) {
                    c2 += i;
                    local_cs++;
                    break;
                }
            }
        }
        c1++;
        c2++;
    }
    lcss += local_cs;
    return Math.round(Math.max(l1, l2) - lcss);
}
Common Sift4:
// Sift4 - common version
// online algorithm to compute the distance between two strings in O(n)
// maxOffset is the number of characters to search for matching letters
// maxDistance is the distance at which the algorithm should stop computing the value and just exit (the strings are too different anyway)
function sift4(s1, s2, maxOffset, maxDistance) {
    if (!s1 || !s1.length) {
        if (!s2) {
            return 0;
        }
        return s2.length;
    }

    if (!s2 || !s2.length) {
        return s1.length;
    }

    var l1 = s1.length;
    var l2 = s2.length;

    var c1 = 0; //cursor for string 1
    var c2 = 0; //cursor for string 2
    var lcss = 0; //largest common subsequence
    var local_cs = 0; //local common substring
    var trans = 0; //number of transpositions ('ab' vs 'ba')
    var offset_arr = []; //offset pair array, for computing the transpositions
    while ((c1 < l1) && (c2 < l2)) {
        if (s1.charAt(c1) == s2.charAt(c2)) {
            local_cs++;
            var isTrans = false;
            //see if current match is a transposition
            var i = 0;
            while (i < offset_arr.length) {
                var ofs = offset_arr[i];
                if (c1 <= ofs.c1 || c2 <= ofs.c2) {
                    // when two matches cross, the one considered a transposition is the one with the largest difference in offsets
                    isTrans = Math.abs(c2 - c1) >= Math.abs(ofs.c2 - ofs.c1);
                    if (isTrans) {
                        trans++;
                    } else {
                        if (!ofs.trans) {
                            ofs.trans = true;
                            trans++;
                        }
                    }
                    break;
                } else {
                    if (c1 > ofs.c2 && c2 > ofs.c1) {
                        offset_arr.splice(i, 1);
                    } else {
                        i++;
                    }
                }
            }
            offset_arr.push({
                c1: c1,
                c2: c2,
                trans: isTrans
            });
        } else {
            lcss += local_cs;
            local_cs = 0;
            if (c1 != c2) {
                c1 = c2 = Math.min(c1, c2); //using min allows the computation of transpositions
            }
            //if matching characters are found, remove 1 from both cursors (they get incremented at the end of the loop)
            //so that we can have only one code block handling matches
            for (var i = 0; i < maxOffset && (c1 + i < l1 || c2 + i < l2); i++) {
                if ((c1 + i < l1) && (s1.charAt(c1 + i) == s2.charAt(c2))) {
                    c1 += i - 1;
                    c2--;
                    break;
                }
                if ((c2 + i < l2) && (s1.charAt(c1) == s2.charAt(c2 + i))) {
                    c1--;
                    c2 += i - 1;
                    break;
                }
            }
        }
        c1++;
        c2++;
        if (maxDistance) {
            var temporaryDistance = Math.max(c1, c2) - lcss + trans;
            if (temporaryDistance >= maxDistance)
                return Math.round(temporaryDistance);
        }
        // this covers the case where the last match is on the last token in list, so that it can compute transpositions correctly
        if ((c1 >= l1) || (c2 >= l2)) {
            lcss += local_cs;
            local_cs = 0;
            c1 = c2 = Math.min(c1, c2);
        }
    }
    lcss += local_cs;
    return Math.round(Math.max(l1, l2) - lcss + trans); //add the cost of transpositions to the final result
}


Extended/General Sift4:

// Sift4 - extended version
// online algorithm to compute the distance between two strings in O(n)
// maxOffset is the number of positions to search for matching tokens
// options: the options for the function, allowing for customization of the scope and algorithm:
//         maxDistance: the distance at which the algorithm should stop computing the value and just exit (the strings are too different anyway)
//         tokenizer: a function to transform strings into vectors of tokens
//        tokenMatcher: a function to determine if two tokens are matching (equal)
//        matchingEvaluator: a function to determine the way a token match should be added to the local_cs. For example a fuzzy match could be implemented.
//        localLengthEvaluator: a function to determine the way the local_cs value is added to the lcss. For example longer continuous substrings could be awarded.
//        transpositionCostEvaluator: a function to determine the value of an individual transposition. For example longer transpositions should have a higher cost.
//        transpositionsEvaluator: a function to determine the way the total cost of transpositions affects the final result
// the options can and should be implemented at a class level, but this is the demo algorithm
function sift4(s1, s2, maxOffset, options) {

    options = extend(options, {
            maxDistance: null,
            tokenizer: function (s) {
                return s ? s.split('') : [];
            },
            tokenMatcher: function (t1, t2) {
                return t1 == t2;
            },
            matchingEvaluator: function (t1, t2) {
                return 1;
            },
            localLengthEvaluator: function (local_cs) {
                return local_cs;
            },
            transpositionCostEvaluator: function (c1, c2) {
                return 1;
            },
            transpositionsEvaluator: function (lcss, trans) {
                return lcss - trans;
            }
        });

    var t1 = options.tokenizer(s1);
    var t2 = options.tokenizer(s2);

    var l1 = t1.length;
    var l2 = t2.length;

    if (l1 == 0)
        return l2;
    if (l2 == 0)
        return l1;

    var c1 = 0; //cursor for string 1
    var c2 = 0; //cursor for string 2
    var lcss = 0; //largest common subsequence
    var local_cs = 0; //local common substring
    var trans = 0; //number of transpositions ('ab' vs 'ba')
    var offset_arr = []; //offset pair array, for computing the transpositions
    while ((c1 < l1) && (c2 < l2)) {
        if (options.tokenMatcher(t1[c1], t2[c2])) {
            local_cs += options.matchingEvaluator(t1[c1], t2[c2]);
            var isTrans = false;
            //see if current match is a transposition
            var i = 0;
            while (i < offset_arr.length) {
                var ofs = offset_arr[i];
                if (c1 <= ofs.c1 || c2 <= ofs.c2) {
                    // when two matches cross, the one considered a transposition is the one with the largest difference in offsets
                    isTrans = Math.abs(c2 - c1) >= Math.abs(ofs.c2 - ofs.c1);
                    if (isTrans) {
                        trans += options.transpositionCostEvaluator(c1, c2);
                    } else {
                        if (!ofs.trans) {
                            ofs.trans = true;
                            trans += options.transpositionCostEvaluator(ofs.c1, ofs.c2);
                        }
                    }
                    break;
                } else {
                    if (c1 > ofs.c2 && c2 > ofs.c1) {
                        offset_arr.splice(i, 1);
                    } else {
                        i++;
                    }
                }
            }
            offset_arr.push({
                c1: c1,
                c2: c2,
                trans: isTrans
            });
        } else {
            lcss += options.localLengthEvaluator(local_cs);
            local_cs = 0;
            if (c1 != c2) {
                c1 = c2 = Math.min(c1, c2); //using min allows the computation of transpositions
            }
            //if matching tokens are found, remove 1 from both cursors (they get incremented at the end of the loop)
            //so that we can have only one code block handling matches
            for (var i = 0; i < maxOffset && (c1 + i < l1 || c2 + i < l2); i++) {
                if ((c1 + i < l1) && options.tokenMatcher(t1[c1 + i], t2[c2])) {
                    c1 += i - 1;
                    c2--;
                    break;
                }
                if ((c2 + i < l2) && options.tokenMatcher(t1[c1], t2[c2 + i])) {
                    c1--;
                    c2 += i - 1;
                    break;
                }
            }
        }
        c1++;
        c2++;
        if (options.maxDistance) {
            var temporaryDistance = options.localLengthEvaluator(Math.max(c1, c2)) - options.transpositionsEvaluator(lcss, trans);
            if (temporaryDistance >= options.maxDistance)
                return Math.round(temporaryDistance);
        }
        // this covers the case where the last match is on the last token in list, so that it can compute transpositions correctly
        if ((c1 >= l1) || (c2 >= l2)) {
            lcss += options.localLengthEvaluator(local_cs);
            local_cs = 0;
            c1 = c2 = Math.min(c1, c2);
        }
    }
    lcss += options.localLengthEvaluator(local_cs);
    return Math.round(options.localLengthEvaluator(Math.max(l1, l2)) - options.transpositionsEvaluator(lcss, trans)); //add the cost of found transpositions
}

function extend(obj, def) {
    var result = {};
    for (var prop in def) {
        if (!obj || !obj.hasOwnProperty(prop)) {
            result[prop] = def[prop];
        } else {
            result[prop] = obj[prop];
        }
    }
    return result;
}

// possible values for the options
// tokenizers:
function nGramTokenizer(s, n) { //tokenizer:function(s) { return nGramTokenizer(s,2); }
    var result = [];
    if (!s)
        return result;
    for (var i = 0; i <= s.length - n; i++) {
        result.push(s.substr(i, n));
    }
    return result;
}

function wordSplitTokenizer(s) { //tokenizer:wordSplitTokenizer
    if (!s)
        return [];
    return s.split(/\s+/);
}

function characterFrequencyTokenizer(s) { //tokenizer:characterFrequencyTokenizer (letters only)
    var result = [];
    for (var i = 0; i <= 25; i++) {
        var val = 0;
        if (s) {
            for (j = 0; j < s.length; j++) {
                var code = s.charCodeAt(j);
                if (code == i + 65 || code == i + 97)
                    val++;
            }
        }
        result.push(val);
    }
    return result;
}

//tokenMatchers:
function sift4TokenMatcher(t1, t2) { //tokenMatcher:sift4TokenMatcher
    var similarity = 1 - sift4(t1, t2, 5) / Math.max(t1.length, t2.length);
    return similarity > 0.7;
}

//matchingEvaluators:
function sift4MatchingEvaluator(t1, t2) { //matchingEvaluator:sift4MatchingEvaluator
    var similarity = 1 - sift4(t1, t2, 5) / Math.max(t1.length, t2.length);
    return similarity;
}

//localLengthEvaluators:
function rewardLengthEvaluator(l) {
    if (l < 1)
        return l; //0 -> 0
    return l - 1 / (l + 1); //1 -> 0.5, 2-> 0.66, 9 -> 0.9
}

function rewardLengthEvaluator2(l) {
    return Math.pow(l, 1.5); // 0 -> 0, 1 -> 1, 2 -> 2.83, 10 -> 31.62
}

//transpositionCostEvaluators:
function longerTranspositionsAreMoreCostly(c1, c2) {
    return Math.abs(c2 - c1) / 9 + 1;
}

As always, I will be most happy to know if you used my algorithm and how it performed, as well as receive any suggestion that you might have.


Here is some explanation for the options of the general algorithm.

It no longer searches for characters, but for tokens. That is why the default tokenizer function splits the values into characters so that the algorithm would work on an array of one character long tokens. Other options are possible, like splitting the strings by empty spaces so that the comparisons are done on words or transforming a string into an array of strings N characters long, the so called N-grams. The tokenizer can be anything, like the characterFrequencyTokenizer, which turns each word in an array of 25 values representing the number of letters in the word for each letter a-z.

The tokenMatcher function returns true if two tokens are matching. They can be fuzzy matched, for example the sift4tokenMatcher uses Sift inside Sift to determine the character distance between two tokens and returns true if they match more than 70%.

The matchingEvaluator is a function that returns the value that will be added to the "common substring" length value when two tokens match. The default is 1, but one can use some other metric, like the similarity, for example. Of course, the common substring length has lost its meaning when these functions change, but the variable local_cs is still used.

The lengthEvaluator is taking the length value of the local common substring and returns a value that will be added to the longest common subsequence value. Usually it returns the same value as the one provided, but some functions could reward longer substrings.

FAQ:

Q: Can you make Sift4 to work case insensitive?
A: Just turn the strings to lower or upper case before you compare them. Since this algorithm is more general, the concept of 'case' might not apply.

Q: Can you make Sift4 to compare strings based on their meaning, like using synonyms?
A: Use a tokenizer function that splits the strings into words, then replaces them with the first of their synonyms. A more complex solution would require to analyse the strings beforehand and turn them into some ordered synonym or equivalent expressions equivalents, then use Sift4 with a word tokenizer (one is provided in the Extended algorithm source)

Q: I need an implementation for this programming language, can you help?
A: I can, but I might not have the time. Ask anyway, maybe I can be persuaded :)

Q: I have been using Sift3 until now, how do I upgrade to Sift4?
A: The best way I can think of is to implement Sift4 Simplest, as it needs only the Sift3 code and some minor changes. Since you never needed tokens before, I doubt you need them now. But if you do, I can help, see the above question.

Q: How can I reward you for this fantastic piece of software engineering?
A: While I did this for free and I don't expect to make any money out of it and while this algorithm is completely free to use and change as you see fit, I don't mind having a beer every now and then ;)

Q: Your algorithm really sucks because... reasons.
A: It may. I would be glad to discuss the reasons, though, and try to fix any problem you encounter.

Q: I compared Sift4 with another algorithm that is much more exact and there are differences.
A: Of course, they are different algorithms. This is a fuzzy distance calculator, it doesn't give you the exact value. There are still edge cases. But the idea of Sift is to be fast and relatively accurate, rather than very accurate. You need more accuracy, try to combine Sift with Levenshtein for example, computing Levenshtein only where Sift says the strings are above a certain similarity.

Q: I want to make maxOffset dependent on the length of the strings compared. Can you do that?
A: That is a perfect example why maxOffset should be a parameter of the function rather than a member of the class. Since this implementation is so far Javascript only, just compute the maxOffset that is convenient to you before you compare.

Q: I want to vary the weight of matches based on the position of the match, for example matches at the beginning of the string could be more valuable than those at the end.
A: The position of the match is indeed not sent to the functions that can be specified in the options object of the Sift4 Extended, but that can be trivially changed in the code. I don't think this particular request is very common, though, and I prefer to keep it out of the published implementation to make the code easier to understand.

Q: I found a bug!
A: Let me know it and I will try and fix it.

Q: If you need to compare large lists of strings, it is better to precompute some things, like specific hashes or suffix trees, etc. This will speed up the comparison tremendously!
A: Sift is what is called an online algorithm. It does not precompute anything, it just gets the two strings and the parameters for its functioning and returns the distance. You are correct in what you are saying, but that kind of solution is not in the scope of Sift, at least not version 4.

Q: What are the edge cases for Sift?
A: Probably there are several, but I didn't really spot them. One of them is that one might find both letters at a position matching letters at other positions, but only one will count. Example 'abxx' and 'bayy'. The algorithm will look at position 0, find no match, then try to find the closest match for each letter. Starting with position 0 in the first string it will find 'a' matched in position 1 in the second. It will increase both counters and lcss will be increase as well. Next check will be 'b', the character at position 1 in the first string matched with position 2 in the second string. No match, therefore both counters will be reset to 1, and starting search again. The 'b' match is lost and distance is 3 instead of 2. Also I think there might be some situations where the counters are not equal and the biggest of them reaches the end of its string, thus terminating the algorithm, but there could have been more matches. Incidentally I tried to fix both these issues and the error from Levenshtein was not really affected, but I am not 100% sure of the implementation.

Q: The algorithm continues to be asymmetric, Sift4(s1,s2) can be different from Sift4(s2,s1).
A: Yes. This is one of the artifacts of the linear nature of the algorithm. There is a function that is symmetric and that is Math.min(Sift4(a,b),Sift4(b,a)), however it is twice as slow, obviously.

Implementations in other languages:

C# implementation [expand/collapse]
public class Sift4
{
    private Options _options;

    public Sift4(Options options)
    {
        if (options == null) options = new Options();
        if (options.Tokenizer == null)
        {
            options.Tokenizer = (s) => s == null
            ? new string[0]
            : s.ToCharArray().Select(c => c.ToString()).ToArray();
        }
        if (options.TokenMatcher == null)
        {
            options.TokenMatcher = (t1, t2) => t1 == t2;
        }
        if (options.MatchingEvaluator == null)
        {
            options.MatchingEvaluator = (t1, t2) => 1;
        }
        if (options.LocalLengthEvaluator == null)
        {
            options.LocalLengthEvaluator = (l) => l;
        }
        if (options.TranspositionCostEvaluator == null)
        {
            options.TranspositionCostEvaluator = (c1, c2) => 1;
        }
        if (options.TranspositionsEvaluator == null)
        {
            options.TranspositionsEvaluator = (l, t) => l - t;
        }
        _options = options;
    }

    /// <summary>
    /// General distance algorithm uses all the parameters in the options object and works on tokens
    /// </summary>
    /// <param name="s1"></param>
    /// <param name="s2"></param>
    /// <param name="maxOffset"></param>
    /// <returns></returns>
    public double GeneralDistance(string s1, string s2, int maxOffset)
    {
        var t1 = _options.Tokenizer(s1);
        var t2 = _options.Tokenizer(s2);

        var l1 = t1.Length;
        var l2 = t2.Length;

        if (l1 == 0) return _options.LocalLengthEvaluator(l2);
        if (l2 == 0) return _options.LocalLengthEvaluator(l1);

        var c1 = 0;  //cursor for string 1
        var c2 = 0;  //cursor for string 2
        var lcss = 0.0;  //largest common subsequence
        var local_cs = 0.0; //local common substring
        var trans = 0.0;  //cost of transpositions ('axb' vs 'xba')
        var offset_arr = new LinkedList<OffsetPair>();  //offset pair array, for computing the transpositions

        while ((c1 < l1) && (c2 < l2))
        {
            if (_options.TokenMatcher(t1[c1], t2[c2]))
            {
                local_cs += _options.MatchingEvaluator(t1[c1], t2[c2]);
                var isTransposition = false;
                var op = offset_arr.First;
                while (op != null)
                {  //see if current match is a transposition
                    var ofs = op.Value;
                    if (c1 <= ofs.C1 || c2 <= ofs.C2)
                    {
                        // when two matches cross, the one considered a transposition is the one with the largest difference in offsets
                        isTransposition = Math.Abs(c2 - c1) >= Math.Abs(ofs.C2 - ofs.C1);
                        if (isTransposition)
                        {
                            trans += _options.TranspositionCostEvaluator(c1, c2);
                        }
                        else
                        {
                            if (!ofs.IsTransposition)
                            {
                                ofs.IsTransposition = true;
                                trans += _options.TranspositionCostEvaluator(ofs.C1, ofs.C2);
                            }
                        }
                        break;
                    }
                    else
                    {
                        var next_op = op.Next;
                        if (c1 > ofs.C2 && c2 > ofs.C1)
                        {
                            offset_arr.Remove(op);
                        }
                        op = next_op;
                    }
                }
                offset_arr.AddLast(new OffsetPair(c1, c2)
                {
                    IsTransposition = isTransposition
                });
            }
            else
            {
                lcss += _options.LocalLengthEvaluator(local_cs);
                local_cs = 0;
                if (c1 != c2)
                {
                    c1 = c2 = Math.Min(c1, c2);  //using min allows the computation of transpositions
                }
                //if matching tokens are found, remove 1 from both cursors (they get incremented at the end of the loop)
                //so that we can have only one code block handling matches 
                for (var i = 0; i < maxOffset && (c1 + i < l1 || c2 + i < l2); i++)
                {
                    if ((c1 + i < l1) && _options.TokenMatcher(t1[c1 + i], t2[c2]))
                    {
                        c1 += i - 1;
                        c2--;
                        break;
                    }
                    if ((c2 + i < l2) && _options.TokenMatcher(t1[c1], t2[c2 + i]))
                    {
                        c1--;
                        c2 += i - 1;
                        break;
                    }
                }
            }
            c1++;
            c2++;
            if (_options.MaxDistance != null)
            {
                var temporaryDistance = _options.LocalLengthEvaluator(Math.Max(c1, c2)) - _options.TranspositionsEvaluator(lcss, trans);
                if (temporaryDistance >= _options.MaxDistance) return Math.Round(temporaryDistance, MidpointRounding.AwayFromZero);
            }
            // this covers the case where the last match is on the last token in list, so that it can compute transpositions correctly
            if ((c1 >= l1) || (c2 >= l2))
            {
                lcss += _options.LocalLengthEvaluator(local_cs);
                local_cs = 0;
                c1 = c2 = Math.Min(c1, c2);
            }
        }
        lcss += _options.LocalLengthEvaluator(local_cs);
        return Math.Round(_options.LocalLengthEvaluator(Math.Max(l1, l2)) - _options.TranspositionsEvaluator(lcss, trans), MidpointRounding.AwayFromZero); //apply transposition cost to the final result
    }

    /// <summary>
    /// Static distance algorithm working on strings, computing transpositions as well as stopping when maxDistance was reached.
    /// </summary>
    /// <param name="s1"></param>
    /// <param name="s2"></param>
    /// <param name="maxOffset"></param>
    /// <param name="maxDistance"></param>
    /// <returns></returns>
    public static double CommonDistance(string s1, string s2, int maxOffset, int maxDistance = 0)
    {
        var l1 = s1 == null ? 0 : s1.Length;
        var l2 = s2 == null ? 0 : s2.Length;

        if (l1 == 0) return l2;
        if (l2 == 0) return l1;

        var c1 = 0;  //cursor for string 1
        var c2 = 0;  //cursor for string 2
        var lcss = 0;  //largest common subsequence
        var local_cs = 0; //local common substring
        var trans = 0;  //number of transpositions ('axb' vs 'xba')
        var offset_arr = new LinkedList<OffsetPair>();  //offset pair array, for computing the transpositions

        while ((c1 < l1) && (c2 < l2))
        {
            if (s1[c1] == s2[c2])
            {
                local_cs++;
                var isTransposition = false;
                var op = offset_arr.First;
                while (op != null)
                {  //see if current match is a transposition
                    var ofs = op.Value;
                    if (c1 <= ofs.C1 || c2 <= ofs.C2)
                    {
                        // when two matches cross, the one considered a transposition is the one with the largest difference in offsets
                        isTransposition = Math.Abs(c2 - c1) >= Math.Abs(ofs.C2 - ofs.C1);
                        if (isTransposition)
                        {
                            trans++;
                        }
                        else
                        {
                            if (!ofs.IsTransposition)
                            {
                                ofs.IsTransposition = true;
                                trans++;
                            }
                        }
                        break;
                    }
                    else
                    {
                        var next_op = op.Next;
                        if (c1 > ofs.C2 && c2 > ofs.C1)
                        {
                            offset_arr.Remove(op);
                        }
                        op = next_op;
                    }
                }
                offset_arr.AddLast(new OffsetPair(c1, c2)
                {
                    IsTransposition = isTransposition
                });
            }
            else
            {
                lcss += local_cs;
                local_cs = 0;
                if (c1 != c2)
                {
                    c1 = c2 = Math.Min(c1, c2);  //using min allows the computation of transpositions
                }
                //if matching tokens are found, remove 1 from both cursors (they get incremented at the end of the loop)
                //so that we can have only one code block handling matches 
                for (var i = 0; i < maxOffset && (c1 + i < l1 || c2 + i < l2); i++)
                {
                    if ((c1 + i < l1) && s1[c1 + i] == s2[c2])
                    {
                        c1 += i - 1;
                        c2--;
                        break;
                    }
                    if ((c2 + i < l2) && s1[c1] == s2[c2 + i])
                    {
                        c1--;
                        c2 += i - 1;
                        break;
                    }
                }
            }
            c1++;
            c2++;
            if (maxDistance > 0)
            {
                var temporaryDistance = Math.Max(c1, c2) - (lcss - trans);
                if (temporaryDistance >= maxDistance) return temporaryDistance;
            }
            // this covers the case where the last match is on the last token in list, so that it can compute transpositions correctly
            if ((c1 >= l1) || (c2 >= l2))
            {
                lcss += local_cs;
                local_cs = 0;
                c1 = c2 = Math.Min(c1, c2);
            }
        }
        lcss += local_cs;
        return Math.Max(l1, l2) - (lcss - trans); //apply transposition cost to the final result
    }

    /// <summary>
    /// Standard Sift algorithm, using strings and taking only maxOffset as a parameter
    /// </summary>
    /// <param name="s1"></param>
    /// <param name="s2"></param>
    /// <param name="maxOffset"></param>
    /// <returns></returns>
    public static int SimplestDistance(string s1, string s2, int maxOffset)
    {
        var l1 = s1 == null ? 0 : s1.Length;
        var l2 = s2 == null ? 0 : s2.Length;

        if (l1 == 0) return l2;
        if (l2 == 0) return l1;

        var c1 = 0;  //cursor for string 1
        var c2 = 0;  //cursor for string 2
        var lcss = 0;  //largest common subsequence
        var local_cs = 0; //local common substring

        while ((c1 < l1) && (c2 < l2))
        {
            if (s1[c1] == s2[c2])
            {
                local_cs++;
            }
            else
            {
                lcss += local_cs;
                local_cs = 0;
                if (c1 != c2)
                {
                    c1 = c2 = Math.Max(c1, c2);
                }
                //if matching tokens are found, remove 1 from both cursors (they get incremented at the end of the loop)
                //so that we can have only one code block handling matches 
                for (var i = 0; i < maxOffset && (c1 + i < l1 && c2 + i < l2); i++)
                {
                    if ((c1 + i < l1) && s1[c1 + i] == s2[c2])
                    {
                        c1 += i - 1;
                        c2--;
                        break;
                    }
                    if ((c2 + i < l2) && s1[c1] == s2[c2 + i])
                    {
                        c1--;
                        c2 += i - 1;
                        break;
                    }
                }
            }
            c1++;
            c2++;
        }
        lcss += local_cs;
        return Math.Max(l1, l2) - lcss;
    }

    private class OffsetPair
    {
        public int C1 { get; set; }
        public int C2 { get; set; }
        public bool IsTransposition { get; set; }

        public OffsetPair(int c1, int c2)
        {
            this.C1 = c1;
            this.C2 = c2;
            this.IsTransposition = false;
        }
    }

    public class Options
    {
        /// <summary>
        /// If set, the algorithm will stop if the distance reaches this value
        /// </summary>
        public int? MaxDistance { get; set; }

        /// <summary>
        /// The function that turns strings into a list of tokens (also strings)
        /// </summary>
        public Func<string, string[]> Tokenizer { get; set; }

        /// <summary>
        /// The function that determines if two tokens are matching (similar to characters being the same in the simple algorithm)
        /// </summary>
        public Func<string, string, bool> TokenMatcher { get; set; }

        /// <summary>
        /// The function that determines the value of a match of two tokens (the equivalent of adding 1 to the lcss when two characters match)
        /// This assumes that the TokenMatcher function is a lot less expensive than this evaluator. If that is not the case, 
        /// you can optimize the speed of the algorithm by using only the matching evaluator and then calculating if two tokens match on the returned value.
        /// </summary>
        public Func<string, string, double> MatchingEvaluator { get; set; }

        /// <summary>
        /// Determines if the local value (computed on subsequent matched tokens) must be modified.
        /// In case one wants to reward longer matched substrings, for example, this is what you need to change.
        /// </summary>
        public Func<double, double> LocalLengthEvaluator { get; set; }

        /// <summary>
        /// The function determining the cost of an individual transposition, based on its counter positions.
        /// </summary>
        public Func<int, int, double> TranspositionCostEvaluator { get; set; }

        /// <summary>
        /// The function determining how the cost of transpositions affects the final result
        /// Change it if it doesn't suit you.
        /// </summary>
        public Func<double, double, double> TranspositionsEvaluator { get; set; }
    }
}

PHP implementation [expand/collapse]
// Sift4 - common version
// online algorithm to compute the distance between two strings in O(n)
// $maxOffset is the number of characters to search for matching letters
// $maxDistance is the distance at which the algorithm should stop computing the value and just exit (the strings are too different anyway)
function sift4($s1, $s2, $maxOffset, $maxDistance = 0) {
    if (!$s1 || !strlen($s1)) {
        if (!$s2) {
            return 0;
        }
        return strlen($s2);
    }
    if (!$s2 || !strlen($s2)) {
        return strlen($s1);
    }
    $l1 = strlen($s1);
    $l2 = strlen($s2);
    $c1 = 0; //cursor for string 1
    $c2 = 0; //cursor for string 2
    $lcss = 0; //largest common subsequence
    $local_cs = 0; //local common substring
    $trans = 0; //number of transpositions ('ab' vs 'ba')
    $offset_arr = array(); //offset pair array, for computing the transpositions
    while (($c1 < $l1) && ($c2 < $l2)) {
        if (substr($s1, $c1, 1) == substr($s2, $c2, 1)) {
            $local_cs++;
            $isTrans = false;
            $i = 0;
            while ($i < sizeof($offset_arr)) { //see if current match is a transposition
                $ofs = $offset_arr[$i];
                if ($c1 <= $ofs['c1'] || $c2 <= $ofs['c2']) {
                    $isTrans = abs($c2 - $c1) >= abs($ofs['c2'] - $ofs['c1']);
                    if ($isTrans) {
                        $trans++;
                    } else {
                        if (!$ofs['trans']) {
                            $ofs['trans'] = true;
                            $trans++;
                        }
                    }
                    break;
                } else {
                    if ($c1 > $ofs['c2'] && $c2 > $ofs['c1']) {
                        array_splice($offset_arr, $i, 1);
                    } else {
                        $i++;
                    }
                }
            }
            array_push($offset_arr, array('c1' = > $c1, 'c2' = > $c2, 'trans' = > $isTrans));
        } else {
            $lcss+= $local_cs;
            $local_cs = 0;
            if ($c1 != $c2) {
                $c1 = $c2 = min($c1, $c2); //using min allows the computation of transpositions
                
            }
            //if matching characters are found, remove 1 from both cursors (they get incremented at the end of the loop)
            //so that we can have only one code block handling matches
            for ($i = 0;$i < $maxOffset && ($c1 + $i < $l1 || $c2 + $i < $l2);$i++) {
                if (($c1 + $i < $l1) && (substr($s1, $c1 + $i, 1) == substr($s2, $c2, 1))) {
                    $c1+= $i - 1;
                    $c2--;
                    break;
                }
                if (($c2 + $i < $l2) && (substr($s1, $c1, 1) == substr($s2, $c2 + $i, 1))) {
                    $c1--;
                    $c2+= $i - 1;
                    break;
                }
            }
        }
        $c1++;
        $c2++;
        if ($maxDistance) {
            $temporaryDistance = max($c1, $c2) - $lcss + $trans;
            if ($temporaryDistance >= $maxDistance) return $temporaryDistance;
        }
        // this covers the case where the last match is on the last token in list, so that it can compute transpositions correctly
        if (($c1 >= $l1) || ($c2 >= $l2)) {
            $lcss+= $local_cs;
            $local_cs = 0;
            $c1 = $c2 = min($c1, $c2);
        }
    }
    $lcss+= $local_cs;
    return max($l1, $l2) - $lcss + $trans; //apply transposition cost to final result
    
}
Thanks to Ferenc Szatmári for corrections in the PHP code

The Simplest and General versions of the algorithm remain as an exercise to you, since I haven't been working in PHP for more than a decade.

T-SQL implementation [expand/collapse]
---<summary>
---Static distance algorithm working on strings, computing transpositions as well as stopping when maxDistance was reached.
---</summary>
---<param name="s1"></param>
---<param name="s2"></param>
---<param name="maxOffset"></param>
---<param name="maxDistance"></param>
---<returns></returns>
CREATE FUNCTION Sift4common(@s1          NVARCHAR(max), 
                            @s2          NVARCHAR(max), 
                            @maxOffset   INT, 
                            @maxDistance INT) 
returns INT 
AS 
  BEGIN 
      DECLARE @l1 INT = Len(Isnull(@s1, '')); 
      DECLARE @l2 INT = Len(Isnull(@s2, '')); 

      IF ( @l1 = 0 ) 
        RETURN @l2; 

      IF ( @l2 = 0 ) 
        RETURN @l1; 

      DECLARE @c1 INT = 0; 
      DECLARE @c2 INT = 0; 
      DECLARE @lcss INT = 0; 
      DECLARE @local_cs INT = 0; 
      DECLARE @trans INT = 0; 
      DECLARE @offset_arr TABLE 
        ( 
           C1    INT, 
           C2    INT, 
           Trans BIT 
        ) 
      DECLARE @i INT 
      DECLARE @temporaryDistance FLOAT 
      DECLARE @result INT 
      DECLARE @oc1 INT, 
              @oc2 INT, 
              @otr BIT 
      DECLARE @isTrans BIT 

      WHILE ( ( @c1 < @l1 ) 
              AND ( @c2 < @l2 ) ) 
        BEGIN 
            IF ( Substring(@s1, @c1 + 1, 1) = Substring(@s2, @c2 + 1, 1) ) 
              BEGIN 
                  SET @local_cs=@local_cs + 1; 
                  SET @isTrans=0 
                  SET @oc1=NULL; 

                  SELECT TOP 1 @oc1 = o.C1, 
                               @oc2 = o.C2, 
                               @otr = o.Trans 
                  FROM   @offset_arr o 
                  WHERE  @c1 <= o.C1 
                          OR @c2 <= o.C2 

                  IF ( @oc1 IS NOT NULL ) 
                    BEGIN 
                        SET @isTrans=CASE 
                                       WHEN Abs(@c2 - @c1) >= Abs(@oc2 - @oc1) 
                                     THEN 1 
                                       ELSE 0 
                                     END 

                        IF ( @isTrans = 1 ) 
                          BEGIN 
                              SET @trans=@trans + 1 
                          END 
                        ELSE IF ( @otr = 0 ) 
                          BEGIN 
                              SET @trans=@trans + 1 

                              UPDATE @offset_arr 
                              SET    Trans = 1 
                              WHERE  C1 = @oc1 
                                     AND C2 = @oc2 
                          END 
                    END 

                  DELETE FROM @offset_arr 
                  WHERE  @c1 > C1 
                         AND @c1 > C2 
                         AND @c2 > C1 
                         AND @c2 > C2; 

                  INSERT INTO @offset_arr 
                  VALUES      (@c1, 
                               @c2, 
                               @isTrans); 
              END 
            ELSE 
              BEGIN 
                  SET @lcss = @lcss + @local_cs; 
                  SET @local_cs = 0; 

                  IF ( @c1 != @c2 ) 
                    -- using min allows the computation of transpositions  
                    BEGIN 
                        IF ( @c1 < @c2 ) 
                          BEGIN 
                              SET @c2=@c1; 
                          END 
                        ELSE 
                          BEGIN 
                              SET @c1=@c2; 
                          END 
                    END 

                  --if matching tokens are found, remove 1 from both cursors (they get incremented at the end of the loop)
                  --so that we can have only one code block handling matches   
                  SET @i=0; 

                  WHILE ( @i < @maxOffset 
                          AND ( @c1 + @i < @l1 
                                 OR @c2 + @i < @l2 ) ) 
                    BEGIN 
                        IF ( ( @c1 + @i < @l1 ) 
                             AND Substring(@s1, @c1 + @i + 1, 1) = 
                                 Substring(@s2, @c2 + 1, 1) 
                           ) 
                          BEGIN 
                              SET @c1 = @c1 + @i - 1; 
                              SET @c2=@c2 - 1; 

                              BREAK; 
                          END 

                        IF ( ( @c2 + @i < @l2 ) 
                             AND Substring(@s1, @c1 + 1, 1) = Substring(@s2, 
                                                              @c2 + @i + 1, 1 
                                                              ) 
                           ) 
                          BEGIN 
                              SET @c1 = @c1 - 1; 
                              SET @c2=@c2 + @i - 1; 

                              BREAK; 
                          END 

                        SET @i=@i + 1; 
                    END 
              END 

            SET @c1=@c1 + 1; 
            SET @c2=@c2 + 1; 

            IF ( @maxDistance > 0 ) 
              BEGIN 
                  IF ( @c1 > @c2 ) 
                    BEGIN 
                        SET @temporaryDistance = @c1 - ( @lcss - @trans / 2.0 ); 
                    END 
                  ELSE 
                    BEGIN 
                        SET @temporaryDistance = @c2 - ( @lcss - @trans / 2.0 ); 
                    END 

                  IF ( @temporaryDistance >= @maxDistance ) 
                    RETURN Round(@temporaryDistance, 0); 
              END 

            IF ( ( @c1 >= @l1 ) 
                  OR ( @c2 >= @l2 ) ) 
              BEGIN 
                  SET @lcss = @lcss + @local_cs; 
                  SET @local_cs = 0; 

                  IF ( @c1 < @c2 ) 
                    BEGIN 
                        SET @c2=@c1; 
                    END 
                  ELSE 
                    BEGIN 
                        SET @c1=@c2; 
                    END 
              END 
        END 

      SET @lcss = @lcss + @local_cs; 

      --apply the transposition cost to the final result 
      IF ( @l1 > @l2 ) 
        BEGIN 
            SET @result = @l1 - ( @lcss - @trans ); 
        END 
      ELSE 
        BEGIN 
            SET @result = @l2 - ( @lcss - @trans ); 
        END 

      RETURN @result; 
  END 

Clearly a general version of the algorithm is not possible in Transact SQL.

Here is a MySQL version, gracefully provided by Ferenc Szatmári:
BEGIN 
DECLARE l1 INT DEFAULT Length(IFNULL(s1, '')); 
DECLARE l2 INT DEFAULT Length(IFNULL(s2, '')); 


DECLARE c1 INT Default 0; 
DECLARE c2 INT default 0; 
DECLARE lcss INT default 0; 
DECLARE local_cs INT default 0; 
DECLARE trans INT default 0; 
DECLARE i INT;
DECLARE temporaryDistance FLOAT;
DECLARE result INT;
DECLARE oc1 INT;
DECLARE oc2 INT;
DECLARE otr smallint;
DECLARE isTrans smallint;

drop temporary table if exists offset_arr;
CREATE TEMPORARY TABLE IF not EXISTS offset_arr
( 
C1 INT, 
C2 INT,
Trans BIT
)engine=memory;


/*      delete from offset_arr;*/


IF l1 = 0 THEN
RETURN l2; 
END IF;
IF l2 = 0 THEN 
RETURN l1; 
END IF;  






WHILE ( ( c1 < l1 ) AND ( c2 < l2 ) ) 
DO 
IF ( Substring(s1, c1 + 1, 1) = Substring(s2, c2 + 1, 1) ) THEN

SET local_cs=local_cs + 1; 
SET isTrans=0;
SET oc1=NULL;
SELECT  o.C1, o.C2,o.Trans  into oc1, oc2, otr
FROM offset_arr o 
WHERE c1 <= o.C1 OR c2 <= o.C2
LIMIT 1;
IF oc1 IS NOT NULL THEN

SET isTrans=CASE WHEN ABS(c2-c1)>=ABS(oc2-oc1) THEN 1 ELSE 0 END;
IF (isTrans=1) THEN
SET trans=trans+1;
ELSE
IF (otr=0) THEN


SET trans=trans+1;
UPDATE offset_arr SET Trans=1 WHERE C1=oc1 AND C2=oc2;
END IF;
END IF;  
END IF;


DELETE FROM offset_arr 
WHERE  c1 > C1 
AND c1 > C2
AND c2 > C1 
AND c2 > C2; 


INSERT INTO offset_arr 
VALUES (c1, c2, isTrans); 

ELSE 

SET lcss = lcss + local_cs; 
SET local_cs = 0; 


IF ( c1 != c2 ) THEN
-- using min allows the computation of transpositions 


IF ( c1 < c2 ) THEN
SET c2=c1; 
ELSE 
SET c1=c2; 
END IF;
END IF;


/*if matching tokens are found, remove 1 from both cursors (they get incremented at the end of the loop)
--so that we can have only one code block handling matches  */
SET i=0; 


label : WHILE ( i < maxOffset AND (c1 + i < l1 OR c2 + i < l2) ) do


IF ( ( c1 + i < l1 ) 
AND Substring(s1, c1 + i + 1, 1) = Substring(s2, c2 + 1, 1) 
) THEN


SET c1 = c1 + i - 1; 
SET c2=c2 - 1; 


leave label; 
END IF; 


IF ( ( c2 + i < l2 ) 
AND Substring(s1, c1 + 1, 1) = Substring(s2, c2 + i + 1, 1) 
)THEN 


SET c1 = c1 - 1; 
SET c2=c2 + i - 1; 


leave label;  
END IF;


SET i=i + 1; 
END WHILE;
END IF; 


SET c1=c1 + 1; 
SET c2=c2 + 1; 


IF ( maxDistance > 0 ) THEN


IF ( c1 > c2 ) THEN
SET temporaryDistance = c1 - ( lcss - trans / 2.0 ); 
ELSE 
SET temporaryDistance = c2 - ( lcss - trans / 2.0 ); 
END IF;


IF ( temporaryDistance >= maxDistance ) THEN
RETURN Round(temporaryDistance, 0); 
END IF;  
END IF;
IF ( ( c1 >= l1 ) OR ( c2 >= l2 ) ) THEN
SET lcss = lcss + local_cs; 
SET local_cs = 0; 


IF ( c1 < c2 ) THEN
SET c2=c1; 
ELSE 
SET c1=c2; 
END IF;
END IF;
END while;


SET lcss = lcss + local_cs; 


/*apply the transposition cost to the final result*/
IF ( l1 > l2 ) THEN
SET result = l1 - (lcss - trans);
ELSE 
SET result = l2 - (lcss - trans);
END IF;
RETURN result; 
END



Java implementation [expand/collapse]
Here is a Java version, gracefully provided by Nathanæl Fischer:
/**
 * Sift4 - common version
 * online algorithm to compute the distance between two strings in O(n)
 * Algorithm by siderite, java port by Nathan Fischer 2016
 * /blog/super-fast-and-accurate-string-distance.html
 * @param s1
 * @param s2
 * @param maxOffset the number of characters to search for matching letters
 * @return
 */
public static double sift4(String s1, String s2, int maxOffset) {
    class Offset {
        int c1;
        int c2;
        boolean trans;

        Offset(int c1, int c2, boolean trans) {
            this.c1 = c1;
            this.c2 = c2;
            this.trans = trans;
        }
    }

    if (s1 == null || s1.isEmpty())
        return s2 == null ? 0 : s2.length();

    if (s2 == null || s2.isEmpty())
        return s1.length();

    int l1 = s1.length();
    int l2 = s2.length();

    int c1 = 0; //cursor for string 1
    int c2 = 0; //cursor for string 2
    int lcss = 0; //largest common subsequence
    int local_cs = 0; //local common substring
    int trans = 0; //number of transpositions ('ab' vs 'ba')
    LinkedList < Offset > offset_arr = new LinkedList < > (); //offset pair array, for computing the transpositions

    while ((c1 < l1) && (c2 < l2)) {
        if (s1.charAt(c1) == s2.charAt(c2)) {
            local_cs++;
            boolean isTrans = false;
            //see if current match is a transposition
            int i = 0;
            while (i < offset_arr.size()) {
                Offset ofs = offset_arr.get(i);
                if (c1 <= ofs.c1 || c2 <= ofs.c2) {
                    // when two matches cross, the one considered a transposition is the one with the largest difference in offsets
                    isTrans = Math.abs(c2 - c1) >= Math.abs(ofs.c2 - ofs.c1);
                    if (isTrans) {
                        trans++;
                    } else {
                        if (!ofs.trans) {
                            ofs.trans = true;
                            trans++;
                        }
                    }
                    break;
                } else {
                    if (c1 > ofs.c2 && c2 > ofs.c1) {
                        offset_arr.remove(i);
                    } else {
                        i++;
                    }
                }
            }
            offset_arr.add(new Offset(c1, c2, isTrans));
        } else {
            lcss += local_cs;
            local_cs = 0;
            if (c1 != c2) {
                c1 = c2 = Math.min(c1, c2); //using min allows the computation of transpositions
            }
            //if matching characters are found, remove 1 from both cursors (they get incremented at the end of the loop)
            //so that we can have only one code block handling matches
            for (int i = 0; i < maxOffset && (c1 + i < l1 || c2 + i < l2); i++) {
                if ((c1 + i < l1) && (s1.charAt(c1 + i) == s2.charAt(c2))) {
                    c1 += i - 1;
                    c2--;
                    break;
                }
                if ((c2 + i < l2) && (s1.charAt(c1) == s2.charAt(c2 + i))) {
                    c1--;
                    c2 += i - 1;
                    break;
                }
            }
        }
        c1++;
        c2++;
        // this covers the case where the last match is on the last token in list, so that it can compute transpositions correctly
        if ((c1 >= l1) || (c2 >= l2)) {
            lcss += local_cs;
            local_cs = 0;
            c1 = c2 = Math.min(c1, c2);
        }
    }
    lcss += local_cs;
    return Math.round(Math.max(l1, l2) - lcss + trans); //add the cost of transpositions to the final result
}
Powershell implementation [expand/collapse]


Powershell implementation of simple Sift4, by Kirk Sayre:

function Calc-Sift4Distance 
{
  <# 
      .SYNOPSIS 

      Compute the edit distance between 2 strings using the sift4 string
      edit distance algorithm.

      .PARAMETER s1

      The 1st string

      .PARAMETER s2

      The 2nd string

      .PARAMETER maxOffset

      The maximum common substring length for which to search. The default
      is 10.

      .RETURN

      The # of edits needed to make the given 2 strings equal.
  #>

  Param(
    [Parameter(Mandatory = $True, ValueFromPipelineByPropertyName = $True)]
    [String]
    $s1,

    [Parameter(Mandatory = $True, ValueFromPipelineByPropertyName = $True)]
    [String]
    $s2,

    [Parameter(ValueFromPipelineByPropertyName = $True)]
    [Int]
    $maxOffset = 10
  )

  # Handle null or empty strings.
  if ((-not $s1) -or ($s1.length -eq 0)) 
  {
    if (-not $s2) 
    {
      return 0
    }
    return $s2.length
  }

  if ((-not $s2) -or ($s2.length -eq 0)) 
  {
    return $s1.length
  }

  # Initialization.
  $l1 = $s1.length
  $l2 = $s2.length
  $c1 = 0 # Cursor for string 1.
  $c2 = 0 # Cursor for string 2.
  $lcss = 0 # Largest common subsequence.
  $local_cs = 0 # Local common substring.

  # Scan strings.
  while (($c1 -lt $l1) -and ($c2 -lt $l2)) 
  {
    if ($s1[$c1] -eq $s2[$c2]) 
    {
      $local_cs++
    }
    else 
    {
      $lcss += $local_cs
      $local_cs = 0
      if ($c1 -ne $c2) 
      {
        # Using max to bypass the need for computer transpositions ('ab' vs 'ba').
        $c1 = $c2 = (@($c1, $c2) | Measure-Object -Maximum).Maximum
      }

      for ($i = 0; (($i -lt $maxOffset) -and ((($c1 + $i) -lt $l1) -or (($c2 + $i) -lt $l2))); $i++) 
      {
        if ((($c1 + $i) -lt $l1) -and ($s1[$c1 + $i] -eq $s2[$c2])) 
        {
          $c1 += $i
          $local_cs++
          break
        }

        if ((($c1 + $i) -lt $l2) -and ($s1[$c1] -eq $s2[$c2 + $i])) 
        {
          $c2 += $i
          $local_cs++
          break
        }
      }
    }
    $c1++
    $c2++
  }
  $lcss += $local_cs
  return [math]::Round((@($l1, $l2) | Measure-Object -Maximum).Maximum - $lcss)
}
C++ implementation [expand/collapse]


Thanks to Hugo Amaro, a C++ implementation:

struct sift_offset {
  int c1;
  int c2;
  bool trans;
};

template < typename T >
  int sift4(T * s1, int s1_size, T * s2, int s2_size, int maxOffset, int maxDistance) {
    if (!s1 || !s1_size) {
      if (!s2) {
        return 0;
      }
      return s2_size;
    }

    if (!s2 || !s2_size) {
      return s1_size;
    }

    int l1 = s1_size;
    int l2 = s2_size;

    int c1 = 0; //cursor for string 1
    int c2 = 0; //cursor for string 2
    int lcss = 0; //largest common subsequence
    int local_cs = 0; //local common substring
    int trans = 0; //number of transpositions ('ab' vs 'ba')
    std::vector < sift_offset > offset_arr; //offset pair array, for computing the transpositions

    while ((c1 < l1) && (c2 < l2)) {
      if (s1[c1] == s2[c2]) {
        local_cs++;
        bool isTrans = false;
        //see if current match is a transposition
        int i = 0;
        while (i < offset_arr.size()) {
          sift_offset ofs = offset_arr[i];
          if (c1 <= ofs.c1 || c2 <= ofs.c2) {
            // when two matches cross, the one considered a transposition is the one with the largest difference in offsets
            isTrans = std::abs(c2 - c1) >= std::abs(ofs.c2 - ofs.c1);

            if (isTrans) {
              trans++;
            } else {
              if (!ofs.trans) {
                ofs.trans = true;
                trans++;
              }
            }
            break;
          } else {
            if (c1 > ofs.c2 && c2 > ofs.c1) {
              offset_arr.erase(offset_arr.begin() + i);

            } else {
              i++;
            }
          }
        }
        offset_arr.push_back({
          c1,
          c2,
          isTrans
        });
      } else {
        lcss += local_cs;
        local_cs = 0;
        if (c1 != c2) {
          c1 = c2 = (int) min(c1, c2); //using min allows the computation of transpositions
        }
        //if matching characters are found, remove 1 from both cursors (they get incremented at the end of the loop)
        //so that we can have only one code block handling matches 
        for (int i = 0; i < maxOffset && (c1 + i < l1 || c2 + i < l2); i++) {
          if ((c1 + i < l1) && (s1[c1 + i] == s2[c2])) {
            c1 += i - 1;
            c2--;
            break;
          }
          if ((c2 + i < l2) && (s1[c1] == s2[c2 + i])) {
            c1--;
            c2 += i - 1;
            break;
          }
        }
      }
      c1++;
      c2++;
      if (maxDistance) {
        int temporaryDistance = (int) max(c1, c2) - lcss + trans;
        if (temporaryDistance >= maxDistance) return std: round(temporaryDistance);
      }

    }
    // this covers the case where the last match is on the last token in list, so that it can compute transpositions correctly
    if ((c1 >= l1) || (c2 >= l2)) {
      lcss += local_cs;
      local_cs = 0;
      c1 = c2 = (int) min(c1, c2);
    }
  }
lcss += local_cs;
return std::round((int) max(l1, l2) - lcss + trans); //add the cost of transpositions to the final result
}

You can find a Go implementation here, written by Jason W. Hutchinson.
There is also a Swift implementation here.
A Perl 6 implementation can be found here.



I had this AngularJS grid that I wanted to show totals for certain categories and types. To be more precise, I had a list of items with Category and Type and I want to know the total for all categories and, for each category, the total for each type. This works perfectly if I load all items as individual, but I had hundred of thousands of items so it was clearly impractical. The solution? Send totals for every category and type, then just add them in the grid. In order to do that, though, I had to change the template of the "grouping row", the one that in ngGrid has the ngAggregate class.

It seems that all that is required for that is to change the aggregate template in the grid options. If you are not interested in the details, jump directly to the solution.

There is already an aggregate template in ngGrid.js, one that (at this time) looks like this:
<div ng-click="row.toggleExpand()" ng-style="rowStyle(row)" class="ngAggregate">
<span class="ngAggregateText">{{row.label CUSTOM_FILTERS}} ({{row.totalChildren()}} {{AggItemsLabel}})</span>
<div class="{{row.aggClass()}}"></div>
</div>

So we see that that the number displayed in an aggregate row is coming from a function of the row object called totalChildren, which is defined in ngAggregate.prototype and looks like this:
ngAggregate.prototype.totalChildren = function () {
if (this.aggChildren.length > 0) {
var i = 0;
var recurse = function (cur) {
if (cur.aggChildren.length > 0) {
angular.forEach(cur.aggChildren, function (a) {
recurse(a);
});
} else {
i += cur.children.length;
}
};
recurse(this);
return i;
} else {
return this.children.length;
}
};
Maybe one could change the function to cover specific types of objects and return a sum instead of a count, but that is not the scope of the current post.

The solution described here will involve a custom function and a custom template. Here is how you do it:
  • Define the options for the grid. I am sure you already have it defined somewhere, if not, it is advisable you would. Sooner or later you will want to customize the output and functionality.
  • Add a new property to the options called aggregateTemplate. This will look probably like the default template, but with another function instead of totalChildren.
  • Define the function that will aggregate the items.

Solution 1:

Define template:
$scope.gridOptions = {
data: 'Data.Units',
enableColumnResize: true,
showColumnMenu: false,
showFilter: true,
multiSelect: false,
showGroupPanel: true,
enablePaging: false,
showFooter: true,
columnDefs: [
{ field: 'Country', displayName: 'Country', width: 190 },
{ field: 'Type', displayName: 'Unit Type' },
{ field: 'Count', displayName: 'Count', width: 180}
],
aggregateTemplate: "<div ng-click=\"row.toggleExpand()\" ng-style=\"rowStyle(row)\" class=\"ngAggregate\">" +
" <span class=\"ngAggregateText\">{{row.label CUSTOM_FILTERS}} ({{aggFunc(row)}} {{AggItemsLabel}})</span>" +
" <div class=\"{{row.aggClass()}}\"></div>" +
"</div>"
};
Define function:
$scope.aggFunc = function (row) {
var sumColumn='Count';
var total = 0;
angular.forEach(row.children, function(entry) {
total+=entry.entity[sumColumn];
});
angular.forEach(row.aggChildren, function(entry) {
total+=$scope.aggFunc(entry);
});
return total;
};

What we did here is we replaced row.totalChildren() with aggFunc(row) which we defined in the scope. What it does is add to the total the value of 'Count' rather than just count the items. It goes through row.children, which contains normal row items, then through aggChildren, which contains aggregate rows, which we pass through the same function in order to get their total.

Well, this works perfectly, but doesn't that mean we need to use this for each grid? There is a lot of code duplication. Let's first put the template in the cache so we can reuse it:
module.run(["$templateCache", function($templateCache) {

$templateCache.put("aggregateCountTemplate.html",
"<div ng-click=\"row.toggleExpand()\" ng-style=\"rowStyle(row)\" class=\"ngAggregate\">" +
" <span class=\"ngAggregateText\">{{row.label CUSTOM_FILTERS}} ({{aggFunc(row)}} {{AggItemsLabel}})</span>" +
" <div class=\"{{row.aggClass()}}\"></div>" +
"</div>"
);

}]);
Now the gridOptions change to
$scope.gridOptions = {
[...]
aggregateTemplate: "aggregateCountTemplate.html"
};
and we can reuse the template anytime we want. Alternatively we can create a file and reference it, without using the cache:
$scope.gridOptions = {
[...]
aggregateTemplate: "/templates/aggregateCountTemplate.html"
};

Now, if we could replace the aggFunc function with a row function, adding it to ngAggregate.prototype. Unfortunately we cannot do that, since ngAggregate is a 'private' object. The only thing we can do is to add some sort of static function. The solution is to add it in the root scope, so that is available everywhere.

Solution 2:

Here is the content of the file aggregateCountTemplateCache.js, that I created and load every time in the site. It does two things: inject the function in the root scope of the application and add the template to the cache. The only other thing to do is to use the aggregateTemplate: "aggregateCountTemplate.html" grid options.
var module = angular.module('app', ['ngResource', 'ui.bootstrap', 'ui', 'ngGrid', 'importTreeFilter', 'ui.date', 'SharedServices', 'ui.autocomplete']);

module.run(["$templateCache","$rootScope", function($templateCache,$rootScope) {

$rootScope.aggregateCountFunc = function (row) {
var total = 0;
angular.forEach(row.children, function(entry) {
total+=entry.entity.Count;
});
angular.forEach(row.aggChildren, function(entry) {
total+=$rootScope.aggregateCountFunc(entry);
});
return total;
};

$templateCache.put("aggregateCountTemplate.html",
"<div ng-click=\"row.toggleExpand()\" ng-style=\"rowStyle(row)\" class=\"ngAggregate\">" +
" <span class=\"ngAggregateText\">{{row.label CUSTOM_FILTERS}} ({{aggregateCountFunc(row)}}{{AggItemsLabel}})</span>" +
" <div class=\"{{row.aggClass()}}\"></div>" +
"</div>"
);

}]);

Enjoy!

I was working on a pretty nice task that involved translating the text in a page in real time. For this I created a one page function that would do magic on elements that were added or changed in the page. On specific pages it moved with abysmal speed and I had no idea why. So I went to profile the thing and I was shocked to see that the problem did not come from my long piece of code, but from a simple encapsulation of an element in a jQuery object. I was using it only to have a nicer interface for getting the name of the element and changing an attribute. Here is the code:
var j=jQuery(elem);
if (j.is('img[alt]')) {
j.attr('alt',translate(j.attr('alt')));
}

Replaced it with:
if (/^img$/i.test(elem.tagName)) {
var alt=elem.getAttribute('alt');
if (alt) {
elem.setAttribute('alt',translate(alt));
}
}

And it worked very fast indeed. The element might have been body so maybe the encapsulation tries to also parse the children or something like that or perhaps the problem was fixed with later versions of the library. However, think about how many times we used this kind of code without thinking twice about it. Think twice about it! :)