and has 0 comments
This is a great film. It shows what it would look like to travel with the speed of light from the Sun to Jupiter. It takes 45 minutes (the length of the film) and it is one of the few things that show really how space is, how large, empty, unforgiving it is and how small and almost insignificant are the small islands of dirt we are fixated on. Don't worry, you have a little indicator of when something is about to appear, so you can fast forward. Really loved this. I only wish it would have continued to Pluto and beyond (just so you understand how awesome the New Horizons probe really is)

I was using this JavaScript function that I wanted to accept an array of arguments or just a bunch of arguments that would be interpreted as an array. As you probably know, for each function in JS you get a variable called arguments which contains the arguments to the function. It is a pseudo array, not a real array, and has some extra properties. My code looked like this:
function x(arg) {
if (!arg) {
arg=[];
} else if (!arg.length) {
arg=[];
for(var i=0; i<arguments.length; i++) arg.push(arguments[i]);
}
// do something with arg
}

The logic is simple, if there is no first parameter, arg becomes an empty array, else, if there is a first argument, but it doesn't have a length property (not an array) set arg to an array and push all arguments of the function as items of the array. But it doesn't work! The point is this: you set arg to an empty array and at that moment arguments[0] is no longer the original argument, but the empty array. Even worse, the code then adds the array as an item of itself, which makes the object be infinitely recursive.

Let's make this simpler:
function x(arg) {
arg=[];
console.log(arguments[0]);
}
After you execute x() with any arguments, the console will show an empty array, not the original argument. Weird, huh?

and has 0 comments
Childhood's End is a really interesting book. It is actually a series of medium length stories set in the same world. It starts with a sort of Independence Day kind of invasion - a peaceful one, though - it continues with the effects on the human world after decades have passed and then it ends with a human evolutionary leap that explains the entire book so far. The reason why I loved it so much is that a vast majority of the ideas in the book have withstood the test of time. Written in 1953, Childhood's End is remarkably modern and rational. Well, maybe today's world is not particularly rational, so maybe it is more modern that the present, which is remarkable :)

Arthur C. Clarke seems to have had an obsession with alien encounters - and by this I mean advanced species that have good reason to come to Earth, other than wanting to steal our water or mine our gold or other stupid thing like that. He wrote 2001: a Space Odyssey and Rendezvous with Rama, both about humans suffering of culture shock after a meeting with an alien species. Personally I think Rendezvous with Rama should not have had sequels, perhaps even Space Odyssey; for me it seems like Clarke continued some stories that had great success, rather than needing to continue those stories.

Anyway, Childhood's End is like that: alien creatures just oversee the evolution of our species on Earth, intervening only on minimal occasions. I loved the idea because it is a quick and dirty sci-fi solution for historical and all too present issues like borders, religion, corruption, politics and all those ugly things that appear like magic when enough people get together. I also loved the kind of Christian metaphor of daemons being directed to oversee and guide the human race, without them being privy to "God's grace", so to speak.

It is not an easy to finish book, as it isn't really an emotional story. There are no heroes that one can identify with; it is just a descriptive, rational, logical narration. It is a good book, though, one that I am glad to have read listened to as an audio book.

and has 0 comments

Sometimes you want to run your browser without some protections that are enabled by default in it. One of these is the cross-origins protection when running a local filesystem script. For Chrome, the solution is to run the browser with the command line switch --allow-file-access-from-files. Seems straight forward enough, but if your browser is already open (usually I have at least 10 active tabs, ranging from documentation pages, email to the music I listen to), the command line switches will be ignored and your script will be run as just another window in the same instance of Chrome. In order to fix this, you need to use another switch called --user-data-dir. Just make sure this folder exists and it can be deleted (because it will be filled with a zillion useless files).

So, how do I run a test.html file that can access files and is located in C:\SomePath ? Like this:
start "Chrome with access to files" chrome --allow-file-access-from-files "file:///C:\SomePath\test.html" --user-data-dir="C:\SomePath\UserDir"


In your path a UserDir folder will be created which you can delete after you finish your work.

Of course, this issue applies to any other flags that you want to use ad-hoc while Chrome is already open.

I was trying to write a simple query in MySQL that only inserted a row if there wasn't one already there, else it would update the existing one. MySQL has a system to take care of that for you if you have a primary key or a unique index on the table you want to insert or update. You can choose between INSERT ... ON DUPLICATE KEY or REPLACE. However, what if you don't want to add an index on your table just for that?

T-SQL (Microsoft SQL Server) has a syntax that works like this:
IF EXISTS(SELECT * FROM MyTable WHERE y=2)
UPDATE MyTable SET x=x+1 WHERE y=2;
ELSE
INSERT INTO MyTable (x,y) VALUES(1,2);
END

MySQL also has an IF...THEN syntax as well as an EXISTS statement. The problem is that they work in a completely different way. IF has an extra THEN keyword, uses ELSEIF instead of ELSE and needs to end with an END IF. EXISTS works only in WHERE clauses. So, let's translate the query above in MySQL and hope for the best:
INSERT INTO MyTable (x,y) 
SELECT 0,2 FROM DUAL
WHERE NOT EXISTS(SELECT * FROM MyTable WHERE y=2);
UPDATE MyTable SET x=x+1 WHERE y=2;

Notice several things: I am not using any IF and I am updating rows all the time after the conditional insert. I am selecting values from DUAL because any select with a where clause needs a table in MySQL and DUAL is a dummy one built into the engine. (SELECT 1 WHERE 2=2; is not valid in MySQL). Also, I am inserting a value, then updating it, which is not the most efficient.

There is another way, closer to the original T-SQL query, but it doesn't use EXISTS. It looks like this:
IF (SELECT 1=1 FROM MyTable WHERE y=2) THEN
UPDATE MyTable SET x=x+1 WHERE y=2;
ELSE
INSERT INTO MyTable (x,y) VALUES(1,2);
END IF;

Thing to notice here: I am using 1=1 to return a TRUE value that will make the IF work. Also, PAY ATTENTION!, this doesn't work as a simple query. I spent the better half of an hour trying to understand where the syntax error was while trying to execute this directly. Any flow operations like IF or WHILE, etc, are only valid in "programs", the MySQL term for stored procedures and functions.

I hope I clarified things for you on this.

and has 0 comments
As you know, I have been watching a lot of TV series, some of them good, some of them bad, most of them complete waste of time. As a New Year resolution (yeah, I know, lame) I have decided to create a slot system for TV series. Thus, from now on soon I will switch to a 7/3 method, meaning that I will watch only seven TV shows regularly and reserve three slots for new series, just in order to determine if they are more worth watching than the current ones. If I find a series that needs to go into the magical seven, I have to bump out another. Thus, this is the last post about TV series in this format.



So here is what I have been watching lately:
  • The Legend of Korra - The fourth and final season ended recently. It was a nice ending of an otherwise boring or annoying series. I maintain my opinion that The Last Airbender series was levels of magnitude better.
  • The Good Wife - The story of the legal troubles of Cary Agos is in the foreground, while Alicia's juggling of family, politics and job reaches ridiculous levels. Some interesting moral conundrums are created, but for me this season is kind of unfulfilling.
  • Homeland - The fourth season is focusing on an Afghani terrorist who, with the help of Pakistani intelligence, is wreaking havoc for the Americans. Some interesting dynamics, but far away from the edge of the seat feeling you got from sergeant Brody's story arch.
  • Gotham - I really expected this show to suck. Instead we get solid performances and more or less believable plots. It is acceptably dark for events happening in Gotham, as well. The girlfriend part is really annoying me, but the rest is good.
  • The Honourable Woman - The show is really dark and depressing. So much, in fact, that I couldn't watch it as much as I probably should. I always have issues with stories that show indomitable and incorruptible heroes, but the alternative extreme, where everything is gloomy and hopeless, is not much better. Just as too much fantasy, if you can't affect reality, then it is just as unengaging.
  • The Witches of East End - The series was cancelled before a third season, leaving everything in limbo, with no closure. I think this kind of behavior should be made illegal. No matter how small, the investment of the viewer in a story needs to be repaid. What would you do if you went at the cinema, watched a movie, and it suddenly stopped before the ending, saying that the producers didn't have the money to finish it? Wouldn't you ask for your money back?
  • Tyrant - Weird story, kind of hard to swallow, but interesting in many respects. The show has been renewed for a second season.
  • Extant - The series will get a second season, but I will not watch it. The show is a clumsy cocktail of sci-fi cliches, all thrown together while expecting Halle Berry to hold them together.
  • The Bridge - The series has been cancelled after the second season, while I am still waiting to watch it. I love Diane Kruger, so I probably will watch it someday.
  • Ghost in the Shell - Arise - Solid reboot. Too bad it has only four episodes which pretty much retell the same story. Meanwhile, a GiTS movie is in the works, starring Scarlet Johansson
  • The Strain - the show is pretty good. I kind of dislike the main character, while I totally love Kevin Durand. The series has been renewed for a second season.
  • Longmire - Loved the first seasons, the last was kind of over the top and it showed. The show was cancelled.
  • The Lottery - I've decided, more or less by not feeling like watching it at any time, to stop watching it. I haven't really watched enough of it to make a rational impression, but the pilot totally threw me off. Besides, this show was also cancelled.
  • Manhattan - The show has been renewed for a second season. I have mixed feelings about it. On one hand I like the story, but at the same time I am a bit put off by the fact that is complete fabrication.
  • Legends - The show's concept is a lot cooler than it's implementation. I like Sean Bean's interpretation, but the story is very similar to a zillion cop/government agency shows. They need to focus more on his character, rather than on pointless villains. The show has been renewed for a second season.
  • Outlander - They cancelled it! Good actors, wonderful scenery, an interesting story. A great shame.
  • The Divide - I've stopped watching it after a while, but I can't tell you why, exactly. I will remove it from my watch list. It had been cancelled, anyway.
  • The Knick - Renewed for a second season, this show is great. Good acting, nice depiction of the times, unapologetic critique of pharmaceutical companies and human nature in general. I love it!
  • Doctor Who - What are you doing, Capaldi?! I think this season of Doctor Who was the most boring and pointless of them all. Even the Christmas Special was bad. Something has to be done, if it goes like this, Doctor Who will take another decadal hiatus.
  • Forever - It is difficult to reject this series outright, because I really like the actors. However the story itself is completely boring. Other than this quirk of the main character that he cannot die, the show is a standard cop thing. You know who else cannot die? Deadpool! Isn't that slightly more interesting? Also the show is likely to get cancelled.
  • Hysteria and Hand of God - I love Ron Perlman, but that doesn't mean I liked the Hand of God pilot more than the one from Hysteria. Unfortunately Hand of God will probably get picked up for a series, while Hysteria does not.
  • Haven - It becomes increasingly difficult to give up series as you watch them for more and more time. Unfortunately for Haven, which was never good to begin with, the time has come for the "mother of Audrey/Mara" to appear. When family members appear in a story, it usually means they've run out of ideas. The fifth season will get some extra episodes in order to give viewers closure, afterwards it will probably get cancelled. See? This is how you do it when you know you are cancelling a show!
  • Z Nation - This SyFy clone of The Walking Dead should have sucked ass. Instead, it is a low budget yet fun show, with a lot of tongue in cheek and also original ideas. Who would have thought SyFy could do something entertaining? If you divide entertainment value to the show budget, Z Nation clearly wins over The Walking Dead.
  • Madam Secretary - Oh, the level of obnoxious American propaganda and overall stupidity of this show is off the charts. I refuse to watch this filth. And it appears it will get renewed as well. Ugh!
  • Sleepy Hollow - The second season has just started. This show will never be good, just admit it. Its value is purely guilty pleasure.
  • The Driver - A BBC One miniseries about a cab driver recruited by the mob, starring David Morrissey and Colm Meaney. The story is not new, the acting is good, but the show... just doesn't do it for me. Sorry.
  • Arrow - Oh, Marvel is doing something interesting. Their "phase 3" operation involves spamming the big and small screens with series and films about Marvel superheroes. Now, Arrow is not a great show, and everybody knows it, however they started doing crossover episodes with another new show, The Flash, and probably some of the story ideas will be found or hinted about in the films. Already some things that happened in Captain America movies are found and expanded upon in Marvel's Agents of S.H.I.E.L.D and Marvel's Agent Carter series. Standalone, Arrow went in the "friends and family" direction, which I despise and personally think it means they lack any original ideas. I know that there are some comics that need to be taken into consideration when creating the scripts, but still.
  • South Park - South Park sputtered lately, going from meh! to really funny from episode to episode. The Fremium episode, for example, was really cool. I remain a fan.
  • Stalker - Another police procedural with a twist. I kind of like it, but I wouldn't recommend it, if you know what I mean. The show focuses on stalkers, with the small twist that the main character (and member of the task force that fights them) seems to be one himself. Too bad that they broke the tension on that one. I think I like it because of the lead actors.
  • Marvel's Agents of Shield - Yes, I know it's an acronym, but I am tired of typing S.H.I.E.L.D. all the time. The show is pretty good! It does go into the whole "dad and mom" territory, but not too much. I will continue watching it.
  • Marvel's Agent Carter - Another Marvel TV show, this time about agent Carter, the girlfriend of Captain America. Set in the 1940's, it also has to deal with the sexism of the period. Sexy actress, some interesting characters, this shows promise.
  • Our Girl - This is an interesting series, concerning a young girl that joins the British army. She has to deal with a stupid and petty but well meaning family, fighting in the Middle East and also a love triangle (what else when a woman is concerned?). No, seriously, it's better than Twilight.
  • Tim and Eric's Bedtime Stories - In an era where big budget horror TV series abound, here is this minimalist standalone episode show. And it is fucking scary! The episodes are very short and involve usually similar worlds to ours. No special effects, monsters or whatever, but the subjects are really unsettling and powerful. A must see!
  • Black Dynamite - The second season doesn't seem as funny as the first, but still a lot of fun and makes me learn many things about the American 70's
  • The 100 - I couldn't make myself watch the second season. That probably means that I feel it sucks. However I still might watch it...
  • Constantine - A series based on the Constantine comics and film. Keanu Reeves has been replaced by a skinny British wanker with illusions of grandeur. Actually, the character is a bit more layered than that, but not by much. I had high hopes for the show, but I have to admit that it is subpar and I don't like it.
  • Star Wars Rebels - At first I really hated this animated series. I thought it was too childish. Then I realized that the entire Star Wars franchise is made for little children. Really, this settles the Star Wars/Trek debate for me: the intended audience ages are slightly different, in the sense that Star Trek contains elements that have meaning for adults, as well >:-D.
  • Ripper Street - Ripper Street got cancelled, then got revived by Amazon Prime, and it is now at season 3. I have to admit that the direction of the show feels strained right now. I hope it picks up pace, because I really loved the first two seasons
  • State of Affairs - Another series that tries to be apologetical to the US foreign policy, it portrays romantic comedy star Katherine Heigl as a CIA executive, married to the son of the US president, who is a black woman, and that has dark secrets threatening to get revealed. I couldn't watch more than two episodes. It is just as surreal as her romantic films: all nice and pink, unless it's about non-Americans. They are bad!
  • The Shadow Line - Dark British police/political thriller, it is a miniseries, meaning it ends without having "seasons" and it is pretty amazing. Good acting, interesting story. It came highly recommended and it didn't disappoint.
  • Babylon - The series had an interesting pilot a while ago. I liked it, now it got picked up as a series. I want to watch it, but I haven't started yet.
  • Marco Polo - a series about the explorer Marco Polo, left by his father at the court of Kublai Khan. I like the actors and the show. The story is also interesting, showing the Mongols as more than invading horseriding barbarians, instead a nation covering half of China and with expansion intentions covering the entire known world.
  • The Newsroom - The third season was short and it was meant to give some sort of closure to the ending of the show. However is was so horribly weak, especially the last episode which was made out of fragments of the first two seasons episodes, that I ended up hating it more than I was already hating it. Good riddance!
  • The Librarians - Horrible show, trying to serialize an obscure film that wasn't that good to begin with. Don't watch it!
  • Scorpion - Leaving aside the alleged basis in reality that the show has, it is another cop show, with "geniuses" helping the FBI. Unlike Numb3rs, which had the same idea, this show is not very good at all. Characters are difficult to empathize with and the whole "normal mom helping the helpers" thing is just... offensive.
  • The Musketeers - You will close your browser window in anger for wasting you the time to read so far, but... I like this show. It is silly, has little to do with Dumas' creation, but I enjoy watching it. It probably has to do with the generously endowed beauties that seem all to like D'Artagnan
  • Elementary - I really like both Lucy Liu and Johhny Lee Miller, but this show has gone to shit. Having nothing to do with deductive skills now, it turned into yet another police procedural, with brilliant people helping the police.
  • Broadchurch - Haven't yet started watching the second season - yes! there is a second season. One that is not the Gracepoint American redo starring David Tennant. I believe Tennant has to feel a bit Doctor Whoish when he is starring in both the American remake and the second season of the original... BTW, Gracepoint got cancelled!
  • Ascension - A SyFy miniseries, in the sense that it has only three episodes per season, it involves a ship that was clandestinely sent to another solar system by Kennedy! At first I was all "what?! How can anyone even think of it?", but then the reason for it all got revealed. It also features a *really annoying* little girl that has mental superpowers and, what else?, a government conspiracy. Once you get the hang of it, it is pretty nice and the human branching from a 1950's United States is a nice twist. I am awaiting the second season. Tricia Helfer is almost as cool as in BSG
  • Banshee - The third season started and it is just as satisfying as the first two. I don't know what sort of human button this show pushes, but I love it and have absolutely no idea why.

Now all I have to do is choose 7 series out of this list and implement my resolution. This list alone contains 50 series and I haven't been including new series that I haven't started watching and shows that continue in the summer season. It will be difficult, but necessary. If we consider an average of 10 episodes per season, one hour each, that means I use up around 10 hours of my time each season per series. If I remove 40 from my list, that means a staggering 400 hours of time freed per year, more than an hour per day. Of course, I will use this time to watch movies, which don't have an upper bound :) Let's see how it goes.

and has 0 comments

Dan Ariely is a professor of behavioral economics, the field that is trying to analyse economics via human behavior studies. In his book, Predictable Irrational - The Hidden Forces that Shape our Decisions, he is arguing that the simple model of market forces constraining people to behave rationally to maximize gain is false, as people are not rational and will never be rational. He then goes to explain various mental fallacies that we are subject to, complete with experiments testing and proving them.

The book is rather short and easy to read, split into 15 chapters and some annexes. Here is a summary:

  • Chapter 1: The Truth About Relativity - People assess the value of something relative to something else that is known. Thus people can be "primed" by being exposed to items that are priced in a certain way, influencing the value they give to something else. Think supermarkets and the three category of items: cheap stuff, expensive stuff and one item that is insanely expensive. Relativity will make people to buy the expensive stuff.
  • Chapter 2: The Fallacy of Supply and Demand - Again, value is not really an objective thing, coming from supply and demand, but by comparing to other things. The example given is that of black pearls, at first no one would buy them, but the importer chose the most beautiful and best, created a line of insanely expensive jewels and advertised them everywhere. Soon black pearls were in demand and at a much higher price than normal pearls.
  • Chapter 3: The Cost of Zero Cost - Ariely argues that zero is a price in a category of its own. He makes an experiment where people have to choose between average and good candy and they are asked to pay 2 cents and 14 cents, respectively. People overwhelmingly choose the good candy, since the price is not that high. However, when the price of the average candy drops to nothing and that of the good candy to 12 cents (so the financial gain is the same for the same quantity) people switch sides and take the average candies.
  • Chapter 4: The Cost of Social Norms - One of the most interesting for me, this chapter discusses how people function on social norms until someone introduces the market norms (tit for tat), in which case the social norms go out the window and the situation may even become really embarrassing socially. Imagine Thanksgiving dinner at the house of mother in law, nice roasted turkey, good wine, the wife's recipe was used for the delicious desert, everybody is happy. What happens if the man thanks the hosts and attempts to give them money to cover the expense? A lot of interesting experiments expand on the concept.
  • Chapter 5: The Power of a Free Cookie - A reverse of Chapter 4, it considers what happens when you get something for free as opposed to having to pay for it. When a colleague comes to the office and brings cookies, people take one or two, taking into consideration that other people in the office need to get a cookie. However, if people are asked to pay a cent on the cookies, they usually take more, again market rules trumping social norms when money is involved.
  • Chapter 6: The Influence of Arousal - Rather funny chapter, but really interesting. It shows that people, when sexually aroused, change their behavior significantly. That is not a surprise, but that change is so large that those people themselves cannot predict what they will actually do. Consider this when you rationally "plan" on how you are going to behave when exposed to temptation or strong emotions.
  • Chapter 7: The Problem of Procrastination and Self Control - People tend to value the present much more than the future. People plan to save money or lose weight, but are deflected by present moment temptations. Can something be done about it?
  • Chapter 8: The High Price of Ownership - People tend to overvalue the things they already possess. In an experiment, Ariely proves that people would not buy the things that they are trying to sell at the price that they would themselves ask. This is used in economics when people are offered the option of "trying out" something and only when they actually "have" the item, decide if they want to give it back.
  • Chapter 9: Keeping Doors Open - One AI researcher came with the idea that intelligent behavior arises spontaneously when trying to maximize the available options. Ariely argues that this kind of behavior is not intelligent at all. He does clever experiments with doors that disappear if not opened in an interval of time and observes people periodically open them just in order to keep them there, even if they gain less by not visiting more lucrative rooms.
  • Chapter 10: The Effect of Expectations - This chapter seems to be incomplete. It is argued that if expecting to enjoy or not enjoy something, the enjoyment will be proportional to the expectations. Personally I feel that this only happens in cases where people can't really tell the difference between good and bad. I often face the opposite effect when watching a movie that I expect to be good and hate it when it is merely average.
  • Chapter 11: The Power of Price - Similar to Chapter 10, this shows how we feel we get more from something that is priced higher. The placebo effect is also discussed here. Interesting, indeed.
  • Chapter 12: The Cycle of Distrust - Economics says that there can be no hundred dollar bills on the ground because someone would have picked them up already. Making fun of this view on things, Ariely discusses dishonest offers that look like great deals, but instead are taking advantage of your gullibility. He argues that originally trustful people quickly lose that trust when cheated and it is hard to get it back. He gives an interesting example where they installed a stall offering free money. Only about one in five people even approached it.
  • Chapter 13: The Context of Our Character part 1 - Both chapters discuss the level of human dishonesty, but show that circumstances change the amount considerably. In an experiment he gives people the chance to cheat after doing some word memory tests, but people almost don't cheat at all if the words were related to honesty or moral codes.
  • Chapter 14: The Context of Our Character part 2 - In this it is shown that people are more likely to cheat if they can rationalize the value of the thing they steal. A concrete example is that they are less likely to steal money than something one step apart from money, like a worthless token. The difference is quite significant.
  • Chapter 15: Beer and Free Lunches - A kind of good bye chapter, this shows how people are influenced in their choices by what other people in their group chose. He makes a clever experiment where people order from several types of beer, either publicly or on a piece of paper. Depending on the culture, they choose differently or similarly to what people before them chose.


Overall I found the book informative. If one can integrate the teachings of the book, the benefit for one's life would be great. Unfortunately, Ariely shows that this kind of rational illusions are predictable, and that people need to make great efforts to dispel them. I leave you with a video presentation from Dan Ariely on TED, just to give you a taste of what he is like and what he does.

[youtube:9X68dm92HVI]

and has 0 comments
The Martian is a short and easy to read book about a guy being abandoned on Mars by mistake. Andy Weir writes most of the book as the astronaut's log entries, but in a colloquial and funny way. I started reading the book since there are a lot of people that praised it and there is also a Ridley Scott movie being made from the book. I hope it won't suck (*cough*Interstellar*cough*).

What I found really appealing is not only the subject matter and all the science and engineering involved, but also the positive attitude of the main character in the face of adversity. All that science and engineering was cool too, though :) and presented in an easy to digest way (there are no equations anywhere :-P). After a while it becomes difficult to suspend disbelief since there are accidents after accidents and Mars really is being painted as the bad guy, trying to kill the protagonist, yet he always finds a way out of the problem at hand. I mean, if they make the movie follow the book, they need Matt Damon in it just because he has all that Bourne training and he needs it to survive the set. Yet one cannot help rooting for Mark Watney, anyway. There are some politics involved as well, but not that much; basically NASA is presented as an organization of scientists that want to get the job done, even if some are more cautious than others.

In summary, I think this is a book that any space geek should read. I finished it in two days, so it's not really a waste of anyone's time.

and has 0 comments
Merry Christmas and a Happy New Year, all!

I thought I would write this post as a Christmas present and let you know about this very cool book. I've read the Harry Potter books and so I could appreciate this book more, as it is fan fiction, however it can be read separately; my recommendation, though, is to know at least something (like having watched the movies) about the Harry Potter universe.

That being said, imagine that Harry Potter would have been a slightly different person, a combination between Sherlock Holmes and Richard Feynman, maybe. Highly intelligent, having read a lot of really serious books and having understood and integrated them into his own philosophy, this Harry Potter goes to the Hogwarts school of magic for two reasons: to apply the scientific method to magic and, having discovered its mysteries, take over the world and stave off death.

Imagine what such a character would do with the stern and moralizing lectures of Dark Ages tutors and you can see why this book is really really funny. But don't take it all as a satire. The references, both serious and in jest, are real. The teachings and methods applied are real. All in all, from every book about children and young adult heroes that I've read so far, this one presents the best role model yet! And I include Ender's Game here (which is also referenced in the book when Harry's adoptive father is comparing the two - hillarious).

I would have to say that I've finished the book, but I didn't actually do that. And that is because the book is not something put on paper and sold by a publisher, instead it is an ongoing story that is offered freely on a blog-like site. Yes, you can read it online. And I have read all that was written yet and, if you consider the parallel universe of the original Harry Potter books, we are about a book and a half in.
Update: I have actually finished the book. :( The writer actually finished writing it after 122 chapters. The ending was pretty cool, too, but I really wanted more. He writes "This is the end of Harry Potter and the Methods of Rationality. I will write no sequel myself; I have said what I set out to say, and it is done. You have my enthusiastic consent to write within this universe yourself, if you wish.".

So, the bottom line is that I love this book, even if a little inconsistent in the sense that the style and the ideas do not keep the same sort of rhythm throughout. You can read it at its completely free site: HPMOR. Its author, Eliezer S. Yudkowsky is a scholar of human rationality and artificial intelligence. I don't know much about him, but let me tell you that, after reading Harry Potter and the Methods of Rationality, I am eager to familiarize myself with his work. I highly recommend this epic undertaking, which probably started as a satire, until its characters gained enough consistency to define their own solid and inspiring story.

and has 0 comments
Right now there is a battle raging on that few of us are aware of. It is for one thing only, and that is control over the Internet, control over communications between people, whether it is a discussion about a two tiered Internet, one free and one paid, or a ban instituted by a government or another on sites that are considered bad for you. It started as it usually does, with governments and corporations trying to get as much of the pie as possible. Only something was different: the Internet is so basic, so flexible, that the companies regulating its use and owning the hardware it runs on cannot control its flow or its direction. And as great strides have been made by intelligence and commercial entities alike to control the content and to track the use, equally great strides have been made by individuals to conceal the use and escape monitoring and censorship. The biggest and most touted mechanism that allows anonymity on the Internet is called TOR, The Onion Router, and its concept is simple: encrypt all communications and randomly route requests through the TOR nodes so that the origin of the access is next to impossible to find. There are other, less known methods of doing this, but TOR is the most used and the most known. It is mostly used as a proxy to anonymize normal Internet access, though, and very few people are actually using TOR to access TOR services only.

I am here to tell you that, first, TOR is not enough and, second, that no other software will ever be enough for this kind of use. You see, the TOR nodes I was talking about are people using TOR on their computers and allowing other people to access the "normal" Internet through them. A lot of the TOR exit nodes that are the border of the anonymous TOR world and the transparent Internet, are actually heavily monitored by everyone interested, if not actually ran by them from the beginning. Like in an old example where the FBI was running an IP anonymizing proxy, those exit points are the weak spot of the TOR network. Another flaw is the fact that it works as a proxy for normal IP protocols. Some software (Bittorrent, for example) is openly sending the originating IP in their data, so it doesn't matter if you go through TOR to download stuff, your IP is still there for the world to see. Since you cannot trust all software than runs on your computer, you cannot completely trust using TOR as a proxy for anonymous Internet access.

The solution, I believe, is to implement the anonymizing and encryption features in the Internet itself. Make it so that there is no address for any of its users, or if it is, it is something temporary that you assigned for a connection or another and can be easily recreated and changed. Do it in such a manner that no one will be able to control the DNS servers and the naming schemes, so that you can call your web site whatever you want and not have to pay for it and be able to host it without broadcasting to the world where you are. The problems in implementing this are major, but not insurmountable. One of them is that encryption and complicated routing are significantly decreasing access time. However, given the speed of Internet today, that is not really a big problem anymore.

My thesis is that if freedom of speech, true freedom of speech, is implemented in a technical way, unbiased by any other rule than that you are free to communicate without fear, then no amount of intimidation will be able to break it. As always when human politics have encroached in the territory of personal freedom, the only solution is usually technical, at least since Gutenberg made his printing press and probably way before that.

I am myself not skilled enough to think of all the aspects of such a new protocol for the Internet. Also I am pretty sure that opposition will be huge against any attempt to do it. But what about if we, technical people, get together and make this work? Borrowing parts from the enormously successful TOR, Bittorrent, Bitcoin, we can architect freedom rather than just talk about it in the context of some war or another. Think about it the next time when, in your free country, you get arrested for saying what you believe in or sharing what you know or trying to access a site and finding that it is not there anymore, not for you at least.


Last year there were three very good US political shows: Homeland, of course, then The Americans, which presents two KBG agents pretending to be US citizens as the main protagonists, as well as The Assets, which was not that good, but was about Aldrich Ames, the infamous American CIA agent who sold secrets to the Russians. All of these shows were presenting various intelligence services doing their best, and in good conscience, to further the interests of their countries. Motivated people, some you might love, some you might hate, but all doing things for the right reasons. Unfortunately, this year seems to be plagued by disgustingly propagandistic shows like Madam Secretary and State of Affairs, bent on showing the US as the spotless white knights and their enemies as faceless evil villains.

Both seemingly wanting to put forward strong female characters with real power and responsibility, they do it in a sledgehammer way that makes me want to cringe. Madam Secretary is about an ex-CIA, now political university intellectual, woman who gets to become the US Foreign Secretary after an unforeseen accident to the real secretary. Interpreted by the beautiful Téa Leoni, it presents the entire US administration as a bunch of do-gooders, plagued by the moral consequences of using their power and having to often sidestep their conscience in order to save the world from the boogie man. Not only a powerful woman at work, she is also a mother, having to take care of her daughter and solve family problems together with her teacher husband. The entire image portrayed by the series is so artificial that you actually see all the pink dripping from where it was forcefully painted over all the black bits.

State of Affairs just started. The lead of the series is Grey's Anatomy star Katherine Heigl, also a CIA woman with a direct professional and personal connection with the US president, who is a black woman. Her fiance was killed right in front of her by terrorists. He was none other than the president's son. In the first three episodes she has to make decisions to thwart the efforts of: Arab terrorists abducting American doctors and threatening them with decapitation, Russian submarines that steal American secrets by tapping undersea fiber optics and Boko Haram terrorists kidnapping school girls. Meanwhile she is being torn by the fact that the guy who killed her fiance was a former asset of hers. She doesn't tell the president because... she would break her oaths to the CIA. The show has some darkness in it, but it also artificial, as some hidden entity has some proof that would incriminate her and a shady character who might be good or bad or both is circling like a vulture. Soon enough we'll discover her father is somehow involved and her fiance is not dead and he is actually her brother or something.

To their benefit, after exhausting the original material, Homeland is not necessarily worst. Also there is another show which I cannot yet say if I like or not, called The Honourable Woman. Great cast: Maggie Gyllenhaal, Stephen Rea (who is not a werewolf in this one... yet :D ) and others. It is an US/UK coproduction, really atmospheric, really dark, but also a bit slow and obtuse. I may have had my brain affected by the previous shows so I can't yet appreciate the value of this one. It seems to mix real politics with some really heavy stuff like assassinations, the economic battlefront between Israel and Palestine, arms smuggling, etc.

The reason why I wrote this post is because sometimes I feel like the media shows are following too closely the politics of the moment, so close in fact that many a time they seem to be slightly ahead of it. The bad guys are entities that are not yet enemies of the US or that behave in worst ways than their real life counterparts, the people in charge often have to bend the rules in order to remain the good guys, even if officially in reality those people have not (yet) bent those rules, etc. Unlike war movies that try to erase or at least erode the moral debt of nations and people involved in past wars, I feel now there are films and series that inject the apology before the act has even been committed. In a period when the US intelligence apparatus is being attacked from all sides by news reports of their activities, here they are, all these stories about the good ole CIA, ran by intellectual and beautiful women of power who maintain their morality like the good mothers of the nation that they are. Am I paranoid here?

The algorithm works perfectly well and is better than Sift3, however it's slightly more complex. You might want to start with Sift3 in order to understand where it came from.

Update November 8 2022: I found a bug in the algorithm, relating to maxDistance. I've updated the code. If you didn't use maxDistance, you are unaffected. Basically the fix is to compare temporaryDistance>minDistance (before it was >= ) and to move the calculation of the temporary distance after c1 and c2 are updated to their minimum value when a token was not found (otherwise the temporary distance might become larger than the final distance)


Try the Javascript implementation here:


Algorithm:

  •  MaxOffset:

String 1:

String 2:



Result: 



Update 28 Mar 2015: I've changed the algorithm significantly. The transpositions are now computed differently and the cost of a transposition in the final result is 1, rather than 0.5. Also, while I think a value of 1 is better conceptually, I noticed that Sift4 approximates Levenshtein a little better when the cost of a transposition is either 2 or a function depending on the offset difference between c2 and c1, especially when maxOffset grows. This can be now changed via the new options function transpositionCostEvaluator. The problem I am having now is more false positives when the letters/tokens of the two strings are the same, but their positions are jumbled differently. With small maxOffset values, like 5 or 10, the result is much better than Sift3, however when maxOffset grows, lots of matches can be found and the cost of transpositions becomes very important.

Update 27 Mar 2015: Thanks to Emanuele Bastianelli who discovered a bug that appeared in an edge case, I've updated the algorithms. Now, at the end of the while loop there is an extra check to prevent the algorithm exiting prematurely, before computing remaining tokens.

Intro

A really long time ago I wrote the third version of Sift, the string distance algorithm. It so happens that I am going to give a small presentation, here in Ispra, about this algorithm, so I had the opportunity to review it. I found some inconsistencies and I actually did some research in the field that gave more more ideas. So before giving the presentation I thought of publishing what I think is the fourth version. What's new:

  • 33% more accurate
  • three different variants: simple, common and general
  • new concepts added
  • support for own value and matching functions, different tokenizer functions, etc.
  • actually tested with a (slightly more) serious test
  • more robust, working better for large values of maxOffset


Before I get into the details, I am publishing the algorithm here for the moment, no Codeplex or PasteBin or GitHub or whatever. Also, it is written in Javascript now, the C# and T-SQL version pending. Of course, it would be great if, as before, the community of people using the algorithm would go into implementing it into various programming languages, however I am a bit apprehensive because more often than not people came with their own improvements or interpretations when translating the algorithm into another language. But support is always welcome!

New concepts in Sift4

I created a test that used random strings, but also a huge list of commonly used English phrases as well as mutations on these strings, adding or removing small bits and so on. I then implemented Sift3, Levenstein and the new algorithm and computed the error distance between the Levenstein distance and the two Sift variants. This permitted me to see how the error evolves when changing the algorithm and the parameters. One thing I noticed is that when increasing the maxOffset value to large values like 15 or 20, the accuracy of Sift3 was going down. Also, as pointed out by one commenter on the Sift3 post, there are cases when Sift3(a,b) is different from Sift3(b,a). There are edge cases, but this one in particular grated me.

After implementing Sift4, I can now tell you that the simple version is slightly better than Sift3 for small maxOffset values like 5, but it gets better as the value increases. The common version is a bit more complex, but the error decreases with 33% and maintains a low error for large maxOffset values. The extended or general version receives an options object that can change almost everything, but most important is the tokenizer function. Imagine that you want to compute the distance based not on letters, but on n-grams (groups of n characters). Or that you want to compare them by the words in the text, maybe even their synonyms. This can all be achieved just by changing the tokenizer function. The other parameters involve defining what it means for two tokens to match and what is the value of their match, etc.

One of the new concepts implemented is taken from the Jaro distance. Jaro seems a lot like Sift in the way that it considers two characters to match if they are in close proximity. Also, if "the streams cross", like 'ab' vs 'ba', one considers them transpositions and removes some of their value from the distance. Actually, if I look at the implementation, it might be that I have independently discovered the Jaro distance. I will research this further. I don't know if the transposition calculation is the most optimal. At the moment it uses an array of all matches found until a point, clearing it of values as the cursors move along the string. The difference between the simple and the common versions of Sift4 is that the simple version is not computing the transpositions at all and has no concept of maxDistance. In that respect it is a slightly fixed up Sift3.

Another new concept added is the one of local substring. Imagine that the Largest Common Subsequence that Sift is actually trying to find in order to determine the distance is made of substrings, separated by non matching characters. Each of these substrings can be used to improve the distance function. For example one could argue that 'abcdex' is closer to 'abcde' than 'abcxde', because even if the largest common subsequence is 5, the largest common substring is 5 for the first string and only 3 for the second. The extended version of the algorithm allows for changing the value of each substring individually.

Well, here they are, the three versions. The extended version has some examples at the end for possible parameters.

The code

Simplest Sift4:

// Sift4 - simplest version
// online algorithm to compute the distance between two strings in O(n)
// maxOffset is the number of characters to search for matching letters
function sift4(s1, s2, maxOffset) {
    if (!s1 || !s1.length) {
        if (!s2) {
            return 0;
        }
        return s2.length;
    }

    if (!s2 || !s2.length) {
        return s1.length;
    }

    var l1 = s1.length;
    var l2 = s2.length;

    var c1 = 0; //cursor for string 1
    var c2 = 0; //cursor for string 2
    var lcss = 0; //largest common subsequence
    var local_cs = 0; //local common substring
    while ((c1 < l1) && (c2 < l2)) {
        if (s1.charAt(c1) == s2.charAt(c2)) {
            local_cs++;
        } else {
            lcss += local_cs;
            local_cs = 0;
            if (c1 != c2) {
                c1 = c2 = Math.max(c1, c2); //using max to bypass the need for computer transpositions ('ab' vs 'ba')
            }
            for (var i = 0; i < maxOffset && (c1 + i < l1 || c2 + i < l2); i++) {
                if ((c1 + i < l1) && (s1.charAt(c1 + i) == s2.charAt(c2))) {
                    c1 += i;
                    local_cs++;
                    break;
                }
                if ((c2 + i < l2) && (s1.charAt(c1) == s2.charAt(c2 + i))) {
                    c2 += i;
                    local_cs++;
                    break;
                }
            }
        }
        c1++;
        c2++;
    }
    lcss += local_cs;
    return Math.round(Math.max(l1, l2) - lcss);
}

Common Sift4:

// Sift4 - common version
// online algorithm to compute the distance between two strings in O(n)
// maxOffset is the number of characters to search for matching letters
// maxDistance is the distance at which the algorithm should stop computing the value and just exit (the strings are too different anyway)
function sift4(s1, s2, maxOffset, maxDistance) {
    if (!s1 || !s1.length) {
        if (!s2) {
            return 0;
        }
        return s2.length;
    }

    if (!s2 || !s2.length) {
        return s1.length;
    }

    var l1 = s1.length;
    var l2 = s2.length;

    var c1 = 0; //cursor for string 1
    var c2 = 0; //cursor for string 2
    var lcss = 0; //largest common subsequence
    var local_cs = 0; //local common substring
    var trans = 0; //number of transpositions ('ab' vs 'ba')
    var offset_arr = []; //offset pair array, for computing the transpositions
    while ((c1 < l1) && (c2 < l2)) {
        if (s1.charAt(c1) == s2.charAt(c2)) {
            local_cs++;
            var isTrans = false;
            //see if current match is a transposition
            var i = 0;
            while (i < offset_arr.length) {
                var ofs = offset_arr[i];
                if (c1 <= ofs.c1 || c2 <= ofs.c2) {
                    // when two matches cross, the one considered a transposition is the one with the largest difference in offsets
                    isTrans = Math.abs(c2 - c1) >= Math.abs(ofs.c2 - ofs.c1);
                    if (isTrans) {
                        trans++;
                    } else {
                        if (!ofs.trans) {
                            ofs.trans = true;
                            trans++;
                        }
                    }
                    break;
                } else {
                    if (c1 > ofs.c2 && c2 > ofs.c1) {
                        offset_arr.splice(i, 1);
                    } else {
                        i++;
                    }
                }
            }
            offset_arr.push({
                c1: c1,
                c2: c2,
                trans: isTrans
            });
        } else {
            lcss += local_cs;
            local_cs = 0;
            if (c1 != c2) {
                c1 = c2 = Math.min(c1, c2); //using min allows the computation of transpositions
            }
            if (maxDistance) {
                var temporaryDistance = Math.max(c1, c2) - lcss + trans;
                if (temporaryDistance > maxDistance)
                    return temporaryDistance;
            }
            //if matching characters are found, remove 1 from both cursors (they get incremented at the end of the loop)
            //so that we can have only one code block handling matches
            for (var i = 0; i < maxOffset && (c1 + i < l1 || c2 + i < l2); i++) {
                if ((c1 + i < l1) && (s1.charAt(c1 + i) == s2.charAt(c2))) {
                    c1 += i - 1;
                    c2--;
                    break;
                }
                if ((c2 + i < l2) && (s1.charAt(c1) == s2.charAt(c2 + i))) {
                    c1--;
                    c2 += i - 1;
                    break;
                }
            }
        }
        c1++;
        c2++;
        // this covers the case where the last match is on the last token in list, so that it can compute transpositions correctly
        if ((c1 >= l1) || (c2 >= l2)) {
            lcss += local_cs;
            local_cs = 0;
            c1 = c2 = Math.min(c1, c2);
        }
    }
    lcss += local_cs;
    return Math.max(l1, l2) - lcss + trans; //add the cost of transpositions to the final result
}


Extended/General Sift4:

// Sift4 - extended version
// online algorithm to compute the distance between two strings in O(n)
// maxOffset is the number of positions to search for matching tokens
// options: the options for the function, allowing for customization of the scope and algorithm:
//         maxDistance: the distance at which the algorithm should stop computing the value and just exit (the strings are too different anyway)
//         tokenizer: a function to transform strings into vectors of tokens
//        tokenMatcher: a function to determine if two tokens are matching (equal)
//        matchingEvaluator: a function to determine the way a token match should be added to the local_cs. For example a fuzzy match could be implemented.
//        localLengthEvaluator: a function to determine the way the local_cs value is added to the lcss. For example longer continuous substrings could be awarded.
//        transpositionCostEvaluator: a function to determine the value of an individual transposition. For example longer transpositions should have a higher cost.
//        transpositionsEvaluator: a function to determine the way the total cost of transpositions affects the final result
// the options can and should be implemented at a class level, but this is the demo algorithm
function sift4(s1, s2, maxOffset, options) {

    options = extend(options, {
            maxDistance: null,
            tokenizer: function (s) {
                return s ? s.split('') : [];
            },
            tokenMatcher: function (t1, t2) {
                return t1 == t2;
            },
            matchingEvaluator: function (t1, t2) {
                return 1;
            },
            localLengthEvaluator: function (local_cs) {
                return local_cs;
            },
            transpositionCostEvaluator: function (c1, c2) {
                return 1;
            },
            transpositionsEvaluator: function (lcss, trans) {
                return lcss - trans;
            }
        });

    var t1 = options.tokenizer(s1);
    var t2 = options.tokenizer(s2);

    var l1 = t1.length;
    var l2 = t2.length;

    if (l1 == 0)
        return l2;
    if (l2 == 0)
        return l1;

    var c1 = 0; //cursor for string 1
    var c2 = 0; //cursor for string 2
    var lcss = 0; //largest common subsequence
    var local_cs = 0; //local common substring
    var trans = 0; //number of transpositions ('ab' vs 'ba')
    var offset_arr = []; //offset pair array, for computing the transpositions
    while ((c1 < l1) && (c2 < l2)) {
        if (options.tokenMatcher(t1[c1], t2[c2])) {
            local_cs += options.matchingEvaluator(t1[c1], t2[c2]);
            var isTrans = false;
            //see if current match is a transposition
            var i = 0;
            while (i < offset_arr.length) {
                var ofs = offset_arr[i];
                if (c1 <= ofs.c1 || c2 <= ofs.c2) {
                    // when two matches cross, the one considered a transposition is the one with the largest difference in offsets
                    isTrans = Math.abs(c2 - c1) >= Math.abs(ofs.c2 - ofs.c1);
                    if (isTrans) {
                        trans += options.transpositionCostEvaluator(c1, c2);
                    } else {
                        if (!ofs.trans) {
                            ofs.trans = true;
                            trans += options.transpositionCostEvaluator(ofs.c1, ofs.c2);
                        }
                    }
                    break;
                } else {
                    if (c1 > ofs.c2 && c2 > ofs.c1) {
                        offset_arr.splice(i, 1);
                    } else {
                        i++;
                    }
                }
            }
            offset_arr.push({
                c1: c1,
                c2: c2,
                trans: isTrans
            });
        } else {
            lcss += options.localLengthEvaluator(local_cs);
            local_cs = 0;
            if (c1 != c2) {
                c1 = c2 = Math.min(c1, c2); //using min allows the computation of transpositions
            }
            if (options.maxDistance) {
                var temporaryDistance = options.localLengthEvaluator(Math.max(c1, c2)) - options.transpositionsEvaluator(lcss, trans);
                if (temporaryDistance > options.maxDistance)
                    return Math.round(temporaryDistance);
            }
            //if matching tokens are found, remove 1 from both cursors (they get incremented at the end of the loop)
            //so that we can have only one code block handling matches
            for (var i = 0; i < maxOffset && (c1 + i < l1 || c2 + i < l2); i++) {
                if ((c1 + i < l1) && options.tokenMatcher(t1[c1 + i], t2[c2])) {
                    c1 += i - 1;
                    c2--;
                    break;
                }
                if ((c2 + i < l2) && options.tokenMatcher(t1[c1], t2[c2 + i])) {
                    c1--;
                    c2 += i - 1;
                    break;
                }
            }
        }
        c1++;
        c2++;
        // this covers the case where the last match is on the last token in list, so that it can compute transpositions correctly
        if ((c1 >= l1) || (c2 >= l2)) {
            lcss += options.localLengthEvaluator(local_cs);
            local_cs = 0;
            c1 = c2 = Math.min(c1, c2);
        }
    }
    lcss += options.localLengthEvaluator(local_cs);
    return Math.round(options.localLengthEvaluator(Math.max(l1, l2)) - options.transpositionsEvaluator(lcss, trans)); //add the cost of found transpositions
}

function extend(obj, def) {
    var result = {};
    for (var prop in def) {
        if (!obj || !obj.hasOwnProperty(prop)) {
            result[prop] = def[prop];
        } else {
            result[prop] = obj[prop];
        }
    }
    return result;
}

// possible values for the options
// tokenizers:
function nGramTokenizer(s, n) { //tokenizer:function(s) { return nGramTokenizer(s,2); }
    var result = [];
    if (!s)
        return result;
    for (var i = 0; i <= s.length - n; i++) {
        result.push(s.substr(i, n));
    }
    return result;
}

function wordSplitTokenizer(s) { //tokenizer:wordSplitTokenizer
    if (!s)
        return [];
    return s.split(/\s+/);
}

function characterFrequencyTokenizer(s) { //tokenizer:characterFrequencyTokenizer (letters only)
    var result = [];
    for (var i = 0; i <= 25; i++) {
        var val = 0;
        if (s) {
            for (j = 0; j < s.length; j++) {
                var code = s.charCodeAt(j);
                if (code == i + 65 || code == i + 97)
                    val++;
            }
        }
        result.push(val);
    }
    return result;
}

//tokenMatchers:
function sift4TokenMatcher(t1, t2) { //tokenMatcher:sift4TokenMatcher
    var similarity = 1 - sift4(t1, t2, 5) / Math.max(t1.length, t2.length);
    return similarity > 0.7;
}

//matchingEvaluators:
function sift4MatchingEvaluator(t1, t2) { //matchingEvaluator:sift4MatchingEvaluator
    var similarity = 1 - sift4(t1, t2, 5) / Math.max(t1.length, t2.length);
    return similarity;
}

//localLengthEvaluators:
function rewardLengthEvaluator(l) {
    if (l < 1)
        return l; //0 -> 0
    return l - 1 / (l + 1); //1 -> 0.5, 2-> 0.66, 9 -> 0.9
}

function rewardLengthEvaluator2(l) {
    return Math.pow(l, 1.5); // 0 -> 0, 1 -> 1, 2 -> 2.83, 10 -> 31.62
}

//transpositionCostEvaluators:
function longerTranspositionsAreMoreCostly(c1, c2) {
    return Math.abs(c2 - c1) / 9 + 1;
}

As always, I will be most happy to know if you used my algorithm and how it performed, as well as receive any suggestion that you might have.

Options explained

Here is some explanation for the options of the general algorithm.

It no longer searches for characters, but for tokens. That is why the default tokenizer function splits the values into characters so that the algorithm would work on an array of one character long tokens. Other options are possible, like splitting the strings by empty spaces so that the comparisons are done on words or transforming a string into an array of strings N characters long, the so called N-grams. The tokenizer can be anything, like the characterFrequencyTokenizer, which turns each word in an array of 25 values representing the number of letters in the word for each letter a-z.

The tokenMatcher function returns true if two tokens are matching. They can be fuzzy matched, for example the sift4tokenMatcher example function uses Sift inside Sift to determine the character distance between two tokens and returns true if they match more than 70%.

The matchingEvaluator is a function that returns the value that will be added to the "common substring" length value when two tokens match. The default is 1, but one can use some other metric, like the similarity, for example. Of course, the common substring length has lost its meaning when these functions change, but the variable local_cs is still used.

The lengthEvaluator is taking the length value of the local common substring and returns a value that will be added to the longest common subsequence value. Usually it returns the same value as the one provided, but some functions could reward longer substrings.

FAQ

Q: Can you make Sift4 to work case insensitive?
A: Just turn the strings to lower or upper case before you compare them. Since this algorithm is more general, the concept of 'case' might not apply. Or implement a case insensitive tokenMatcher.

Q: Can you make Sift4 to compare strings based on their meaning, like using synonyms?
A: Use a tokenizer function that splits the strings into words, then replaces them with the most used of their synonyms. A more complex solution would require to analyze the strings beforehand and turn them into some ordered synonym or equivalent expressions equivalents, then use Sift4 with a word tokenizer (one is provided in the Extended algorithm source)

Q: I need an implementation for this programming language, can you help?
A: I can, but I might not have the time. Ask anyway, maybe I can be persuaded :)

Q: I have been using Sift3 until now, how do I upgrade to Sift4?
A: The best way I can think of is to implement Sift4 Simplest, as it needs only the Sift3 code and some minor changes. Since you never needed tokens before, I doubt you need them now. But if you do, I can help, see the above question.

Q: How can I reward you for this fantastic piece of software engineering?
A: While I did this for free and I don't expect to make any money out of it and while this algorithm is completely free to use and change as you see fit, I don't mind having a beer every now and then ;)

Q: Your algorithm really sucks because... reasons.
A: It may. I would be glad to discuss the reasons, though, and try to fix any problem you encounter.

Q: I compared Sift4 with another algorithm that is much more exact and there are differences.
A: Of course, they are different algorithms. This is a fuzzy distance calculator, it doesn't give you the exact value. There are still edge cases. But the idea of Sift is to be fast and relatively accurate, rather than very accurate. You need more accuracy, try to combine Sift with Levenshtein for example, computing Levenshtein only where Sift says the strings are above a certain similarity.

Q: I want to make maxOffset dependent on the length of the strings compared. Can you do that?
A: That is a perfect example why maxOffset should be a parameter of the function rather than a member of the class. Since this implementation is so far Javascript only, just compute the maxOffset that is convenient to you before you compare.

Q: I want to vary the weight of matches based on the position of the match, for example matches at the beginning of the string could be more valuable than those at the end.
A: The position of the match is indeed not sent to the functions that can be specified in the options object of the Sift4 Extended, but that can be trivially changed in the code. I don't think this particular request is very common, though, and I prefer to keep it out of the published implementation to make the code easier to understand.

Q: I found a bug!
A: Let me know it and I will try and fix it.

Q: If you need to compare large lists of strings, it is better to precompute some things, like specific hashes or suffix trees, etc. This will speed up the comparison tremendously!
A: Sift is what is called an online algorithm. It does not precompute anything, it just gets the two strings and the parameters for its functioning and returns the distance. You are correct in what you are saying, but that kind of solution is not in the scope of Sift, at least not version 4.

Q: What are the edge cases for Sift?
A: Probably there are several, but I didn't really spot them. One of them is that one might find both letters at a position matching letters at other positions, but only one will count. Example 'abxx' and 'bayy'. The algorithm will look at position 0, find no match, then try to find the closest match for each letter. Starting with position 0 in the first string it will find 'a' matched in position 1 in the second. It will increase both counters and lcss will be increase as well. Next check will be 'b', the character at position 1 in the first string matched with position 2 in the second string. No match, therefore both counters will be reset to 1, and starting search again. The 'b' match is lost and distance is 3 instead of 2. Also I think there might be some situations where the counters are not equal and the biggest of them reaches the end of its string, thus terminating the algorithm, but there could have been more matches. Incidentally I tried to fix both these issues and the error from Levenshtein was not really affected, but I am not 100% sure of the implementation.

Q: The algorithm continues to be asymmetric, Sift4(s1,s2) can be different from Sift4(s2,s1).
A: Yes. This is one of the artifacts of the linear nature of the algorithm. There is a function that is symmetric and that is Math.min(Sift4(a,b),Sift4(b,a)), however it is twice as slow, obviously.

Implementations in other languages

C# implementation [expand/collapse]
public class Sift4
{
    private Options _options;

    public Sift4(Options options)
    {
        if (options == null) options = new Options();
        if (options.Tokenizer == null)
        {
            options.Tokenizer = (s) => s == null
            ? new string[0]
            : s.ToCharArray().Select(c => c.ToString()).ToArray();
        }
        if (options.TokenMatcher == null)
        {
            options.TokenMatcher = (t1, t2) => t1 == t2;
        }
        if (options.MatchingEvaluator == null)
        {
            options.MatchingEvaluator = (t1, t2) => 1;
        }
        if (options.LocalLengthEvaluator == null)
        {
            options.LocalLengthEvaluator = (l) => l;
        }
        if (options.TranspositionCostEvaluator == null)
        {
            options.TranspositionCostEvaluator = (c1, c2) => 1;
        }
        if (options.TranspositionsEvaluator == null)
        {
            options.TranspositionsEvaluator = (l, t) => l - t;
        }
        _options = options;
    }

    /// <summary>
    /// General distance algorithm uses all the parameters in the options object and works on tokens
    /// </summary>
    /// <param name="s1"></param>
    /// <param name="s2"></param>
    /// <param name="maxOffset"></param>
    /// <returns></returns>
    public double GeneralDistance(string s1, string s2, int maxOffset)
    {
        var t1 = _options.Tokenizer(s1);
        var t2 = _options.Tokenizer(s2);

        var l1 = t1.Length;
        var l2 = t2.Length;

        if (l1 == 0) return _options.LocalLengthEvaluator(l2);
        if (l2 == 0) return _options.LocalLengthEvaluator(l1);

        var c1 = 0;  //cursor for string 1
        var c2 = 0;  //cursor for string 2
        var lcss = 0.0;  //largest common subsequence
        var local_cs = 0.0; //local common substring
        var trans = 0.0;  //cost of transpositions ('axb' vs 'xba')
        var offset_arr = new LinkedList<OffsetPair>();  //offset pair array, for computing the transpositions

        while ((c1 < l1) && (c2 < l2))
        {
            if (_options.TokenMatcher(t1[c1], t2[c2]))
            {
                local_cs += _options.MatchingEvaluator(t1[c1], t2[c2]);
                var isTransposition = false;
                var op = offset_arr.First;
                while (op != null)
                {  //see if current match is a transposition
                    var ofs = op.Value;
                    if (c1 <= ofs.C1 || c2 <= ofs.C2)
                    {
                        // when two matches cross, the one considered a transposition is the one with the largest difference in offsets
                        isTransposition = Math.Abs(c2 - c1) >= Math.Abs(ofs.C2 - ofs.C1);
                        if (isTransposition)
                        {
                            trans += _options.TranspositionCostEvaluator(c1, c2);
                        }
                        else
                        {
                            if (!ofs.IsTransposition)
                            {
                                ofs.IsTransposition = true;
                                trans += _options.TranspositionCostEvaluator(ofs.C1, ofs.C2);
                            }
                        }
                        break;
                    }
                    else
                    {
                        var next_op = op.Next;
                        if (c1 > ofs.C2 && c2 > ofs.C1)
                        {
                            offset_arr.Remove(op);
                        }
                        op = next_op;
                    }
                }
                offset_arr.AddLast(new OffsetPair(c1, c2)
                {
                    IsTransposition = isTransposition
                });
            }
            else
            {
                lcss += _options.LocalLengthEvaluator(local_cs);
                local_cs = 0;
                if (c1 != c2)
                {
                    c1 = c2 = Math.Min(c1, c2);  //using min allows the computation of transpositions
                }
                if (_options.MaxDistance != null)
                {
                    var temporaryDistance = _options.LocalLengthEvaluator(Math.Max(c1, c2)) - _options.TranspositionsEvaluator(lcss, trans);
                    if (temporaryDistance > _options.MaxDistance) return Math.Round(temporaryDistance, MidpointRounding.AwayFromZero);
                }
                //if matching tokens are found, remove 1 from both cursors (they get incremented at the end of the loop)
                //so that we can have only one code block handling matches 
                for (var i = 0; i < maxOffset && (c1 + i < l1 || c2 + i < l2); i++)
                {
                    if ((c1 + i < l1) && _options.TokenMatcher(t1[c1 + i], t2[c2]))
                    {
                        c1 += i - 1;
                        c2--;
                        break;
                    }
                    if ((c2 + i < l2) && _options.TokenMatcher(t1[c1], t2[c2 + i]))
                    {
                        c1--;
                        c2 += i - 1;
                        break;
                    }
                }
            }
            c1++;
            c2++;
            // this covers the case where the last match is on the last token in list, so that it can compute transpositions correctly
            if ((c1 >= l1) || (c2 >= l2))
            {
                lcss += _options.LocalLengthEvaluator(local_cs);
                local_cs = 0;
                c1 = c2 = Math.Min(c1, c2);
            }
        }
        lcss += _options.LocalLengthEvaluator(local_cs);
        return Math.Round(_options.LocalLengthEvaluator(Math.Max(l1, l2)) - _options.TranspositionsEvaluator(lcss, trans), MidpointRounding.AwayFromZero); //apply transposition cost to the final result
    }

    /// <summary>
    /// Static distance algorithm working on strings, computing transpositions as well as stopping when maxDistance was reached.
    /// </summary>
    /// <param name="s1"></param>
    /// <param name="s2"></param>
    /// <param name="maxOffset"></param>
    /// <param name="maxDistance"></param>
    /// <returns></returns>
    public static double CommonDistance(string s1, string s2, int maxOffset, int maxDistance = 0)
    {
        var l1 = s1 == null ? 0 : s1.Length;
        var l2 = s2 == null ? 0 : s2.Length;

        if (l1 == 0) return l2;
        if (l2 == 0) return l1;

        var c1 = 0;  //cursor for string 1
        var c2 = 0;  //cursor for string 2
        var lcss = 0;  //largest common subsequence
        var local_cs = 0; //local common substring
        var trans = 0;  //number of transpositions ('axb' vs 'xba')
        var offset_arr = new LinkedList<OffsetPair>();  //offset pair array, for computing the transpositions

        while ((c1 < l1) && (c2 < l2))
        {
            if (s1[c1] == s2[c2])
            {
                local_cs++;
                var isTransposition = false;
                var op = offset_arr.First;
                while (op != null)
                {  //see if current match is a transposition
                    var ofs = op.Value;
                    if (c1 <= ofs.C1 || c2 <= ofs.C2)
                    {
                        // when two matches cross, the one considered a transposition is the one with the largest difference in offsets
                        isTransposition = Math.Abs(c2 - c1) >= Math.Abs(ofs.C2 - ofs.C1);
                        if (isTransposition)
                        {
                            trans++;
                        }
                        else
                        {
                            if (!ofs.IsTransposition)
                            {
                                ofs.IsTransposition = true;
                                trans++;
                            }
                        }
                        break;
                    }
                    else
                    {
                        var next_op = op.Next;
                        if (c1 > ofs.C2 && c2 > ofs.C1)
                        {
                            offset_arr.Remove(op);
                        }
                        op = next_op;
                    }
                }
                offset_arr.AddLast(new OffsetPair(c1, c2)
                {
                    IsTransposition = isTransposition
                });
            }
            else
            {
                lcss += local_cs;
                local_cs = 0;
                if (c1 != c2)
                {
                    c1 = c2 = Math.Min(c1, c2);  //using min allows the computation of transpositions
                }
                if (maxDistance > 0)
                {
                    var temporaryDistance = Math.Max(c1, c2) - (lcss - trans);
                    if (temporaryDistance > maxDistance) return temporaryDistance;
                }
                //if matching tokens are found, remove 1 from both cursors (they get incremented at the end of the loop)
                //so that we can have only one code block handling matches 
                for (var i = 0; i < maxOffset && (c1 + i < l1 || c2 + i < l2); i++)
                {
                    if ((c1 + i < l1) && s1[c1 + i] == s2[c2])
                    {
                        c1 += i - 1;
                        c2--;
                        break;
                    }
                    if ((c2 + i < l2) && s1[c1] == s2[c2 + i])
                    {
                        c1--;
                        c2 += i - 1;
                        break;
                    }
                }
            }
            c1++;
            c2++;
            // this covers the case where the last match is on the last token in list, so that it can compute transpositions correctly
            if ((c1 >= l1) || (c2 >= l2))
            {
                lcss += local_cs;
                local_cs = 0;
                c1 = c2 = Math.Min(c1, c2);
            }
        }
        lcss += local_cs;
        return Math.Max(l1, l2) - (lcss - trans); //apply transposition cost to the final result
    }

    /// <summary>
    /// Standard Sift algorithm, using strings and taking only maxOffset as a parameter
    /// </summary>
    /// <param name="s1"></param>
    /// <param name="s2"></param>
    /// <param name="maxOffset"></param>
    /// <returns></returns>
    public static int SimplestDistance(string s1, string s2, int maxOffset)
    {
        var l1 = s1 == null ? 0 : s1.Length;
        var l2 = s2 == null ? 0 : s2.Length;

        if (l1 == 0) return l2;
        if (l2 == 0) return l1;

        var c1 = 0;  //cursor for string 1
        var c2 = 0;  //cursor for string 2
        var lcss = 0;  //largest common subsequence
        var local_cs = 0; //local common substring

        while ((c1 < l1) && (c2 < l2))
        {
            if (s1[c1] == s2[c2])
            {
                local_cs++;
            }
            else
            {
                lcss += local_cs;
                local_cs = 0;
                if (c1 != c2)
                {
                    c1 = c2 = Math.Max(c1, c2);
                }
                //if matching tokens are found, remove 1 from both cursors (they get incremented at the end of the loop)
                //so that we can have only one code block handling matches 
                for (var i = 0; i < maxOffset && (c1 + i < l1 && c2 + i < l2); i++)
                {
                    if ((c1 + i < l1) && s1[c1 + i] == s2[c2])
                    {
                        c1 += i - 1;
                        c2--;
                        break;
                    }
                    if ((c2 + i < l2) && s1[c1] == s2[c2 + i])
                    {
                        c1--;
                        c2 += i - 1;
                        break;
                    }
                }
            }
            c1++;
            c2++;
        }
        lcss += local_cs;
        return Math.Max(l1, l2) - lcss;
    }

    private class OffsetPair
    {
        public int C1 { get; set; }
        public int C2 { get; set; }
        public bool IsTransposition { get; set; }

        public OffsetPair(int c1, int c2)
        {
            this.C1 = c1;
            this.C2 = c2;
            this.IsTransposition = false;
        }
    }

    public class Options
    {
        /// <summary>
        /// If set, the algorithm will stop if the distance reaches this value
        /// </summary>
        public int? MaxDistance { get; set; }

        /// <summary>
        /// The function that turns strings into a list of tokens (also strings)
        /// </summary>
        public Func<string, string[]> Tokenizer { get; set; }

        /// <summary>
        /// The function that determines if two tokens are matching (similar to characters being the same in the simple algorithm)
        /// </summary>
        public Func<string, string, bool> TokenMatcher { get; set; }

        /// <summary>
        /// The function that determines the value of a match of two tokens (the equivalent of adding 1 to the lcss when two characters match)
        /// This assumes that the TokenMatcher function is a lot less expensive than this evaluator. If that is not the case, 
        /// you can optimize the speed of the algorithm by using only the matching evaluator and then calculating if two tokens match on the returned value.
        /// </summary>
        public Func<string, string, double> MatchingEvaluator { get; set; }

        /// <summary>
        /// Determines if the local value (computed on subsequent matched tokens) must be modified.
        /// In case one wants to reward longer matched substrings, for example, this is what you need to change.
        /// </summary>
        public Func<double, double> LocalLengthEvaluator { get; set; }

        /// <summary>
        /// The function determining the cost of an individual transposition, based on its counter positions.
        /// </summary>
        public Func<int, int, double> TranspositionCostEvaluator { get; set; }

        /// <summary>
        /// The function determining how the cost of transpositions affects the final result
        /// Change it if it doesn't suit you.
        /// </summary>
        public Func<double, double, double> TranspositionsEvaluator { get; set; }
    }
}

PHP implementation [expand/collapse]
// Sift4 - common version
// online algorithm to compute the distance between two strings in O(n)
// $maxOffset is the number of characters to search for matching letters
// $maxDistance is the distance at which the algorithm should stop computing the value and just exit (the strings are too different anyway)
function sift4($s1, $s2, $maxOffset, $maxDistance = 0) {
    if (!$s1 || !strlen($s1)) {
        if (!$s2) {
            return 0;
        }
        return strlen($s2);
    }
    if (!$s2 || !strlen($s2)) {
        return strlen($s1);
    }
    $l1 = strlen($s1);
    $l2 = strlen($s2);
    $c1 = 0; //cursor for string 1
    $c2 = 0; //cursor for string 2
    $lcss = 0; //largest common subsequence
    $local_cs = 0; //local common substring
    $trans = 0; //number of transpositions ('ab' vs 'ba')
    $offset_arr = array(); //offset pair array, for computing the transpositions
    while (($c1 < $l1) && ($c2 < $l2)) {
        if (substr($s1, $c1, 1) == substr($s2, $c2, 1)) {
            $local_cs++;
            $isTrans = false;
            $i = 0;
            while ($i < sizeof($offset_arr)) { //see if current match is a transposition
                $ofs = $offset_arr[$i];
                if ($c1 <= $ofs['c1'] || $c2 <= $ofs['c2']) {
                    $isTrans = abs($c2 - $c1) >= abs($ofs['c2'] - $ofs['c1']);
                    if ($isTrans) {
                        $trans++;
                    } else {
                        if (!$ofs['trans']) {
                            $ofs['trans'] = true;
                            $trans++;
                        }
                    }
                    break;
                } else {
                    if ($c1 > $ofs['c2'] && $c2 > $ofs['c1']) {
                        array_splice($offset_arr, $i, 1);
                    } else {
                        $i++;
                    }
                }
            }
            array_push($offset_arr, array('c1' = > $c1, 'c2' = > $c2, 'trans' = > $isTrans));
        } else {
            $lcss+= $local_cs;
            $local_cs = 0;
            if ($c1 != $c2) {
                $c1 = $c2 = min($c1, $c2); //using min allows the computation of transpositions
            }
            if ($maxDistance) {
                $temporaryDistance = max($c1, $c2) - $lcss + $trans;
                if ($temporaryDistance > $maxDistance) return $temporaryDistance;
            }

            //if matching characters are found, remove 1 from both cursors (they get incremented at the end of the loop)
            //so that we can have only one code block handling matches
            for ($i = 0;$i < $maxOffset && ($c1 + $i < $l1 || $c2 + $i < $l2);$i++) {
                if (($c1 + $i < $l1) && (substr($s1, $c1 + $i, 1) == substr($s2, $c2, 1))) {
                    $c1+= $i - 1;
                    $c2--;
                    break;
                }
                if (($c2 + $i < $l2) && (substr($s1, $c1, 1) == substr($s2, $c2 + $i, 1))) {
                    $c1--;
                    $c2+= $i - 1;
                    break;
                }
            }
        }
        $c1++;
        $c2++;
        // this covers the case where the last match is on the last token in list, so that it can compute transpositions correctly
        if (($c1 >= $l1) || ($c2 >= $l2)) {
            $lcss+= $local_cs;
            $local_cs = 0;
            $c1 = $c2 = min($c1, $c2);
        }
    }
    $lcss+= $local_cs;
    return max($l1, $l2) - $lcss + $trans; //apply transposition cost to final result
    
}
Thanks to Ferenc Szatmári for corrections in the PHP code

The Simplest and General versions of the algorithm remain as an exercise to you, since I haven't been working in PHP for more than a decade.

T-SQL implementation [expand/collapse]
---<summary>
---Static distance algorithm working on strings, computing transpositions as well as stopping when maxDistance was reached.
---</summary>
---<param name="s1"></param>
---<param name="s2"></param>
---<param name="maxOffset"></param>
---<param name="maxDistance"></param>
---<returns></returns>
CREATE FUNCTION Sift4common(@s1          NVARCHAR(max), 
                            @s2          NVARCHAR(max), 
                            @maxOffset   INT, 
                            @maxDistance INT) 
returns INT 
AS 
  BEGIN 
      DECLARE @l1 INT = Len(Isnull(@s1, '')); 
      DECLARE @l2 INT = Len(Isnull(@s2, '')); 

      IF ( @l1 = 0 ) 
        RETURN @l2; 

      IF ( @l2 = 0 ) 
        RETURN @l1; 

      DECLARE @c1 INT = 0; 
      DECLARE @c2 INT = 0; 
      DECLARE @lcss INT = 0; 
      DECLARE @local_cs INT = 0; 
      DECLARE @trans INT = 0; 
      DECLARE @offset_arr TABLE 
        ( 
           C1    INT, 
           C2    INT, 
           Trans BIT 
        ) 
      DECLARE @i INT 
      DECLARE @temporaryDistance FLOAT 
      DECLARE @result INT 
      DECLARE @oc1 INT, 
              @oc2 INT, 
              @otr BIT 
      DECLARE @isTrans BIT 

      WHILE ( ( @c1 < @l1 ) 
              AND ( @c2 < @l2 ) ) 
        BEGIN 
            IF ( Substring(@s1, @c1 + 1, 1) = Substring(@s2, @c2 + 1, 1) ) 
              BEGIN 
                  SET @local_cs=@local_cs + 1; 
                  SET @isTrans=0 
                  SET @oc1=NULL; 

                  SELECT TOP 1 @oc1 = o.C1, 
                               @oc2 = o.C2, 
                               @otr = o.Trans 
                  FROM   @offset_arr o 
                  WHERE  @c1 <= o.C1 
                          OR @c2 <= o.C2 

                  IF ( @oc1 IS NOT NULL ) 
                    BEGIN 
                        SET @isTrans=CASE 
                                       WHEN Abs(@c2 - @c1) >= Abs(@oc2 - @oc1) 
                                     THEN 1 
                                       ELSE 0 
                                     END 

                        IF ( @isTrans = 1 ) 
                          BEGIN 
                              SET @trans=@trans + 1 
                          END 
                        ELSE IF ( @otr = 0 ) 
                          BEGIN 
                              SET @trans=@trans + 1 

                              UPDATE @offset_arr 
                              SET    Trans = 1 
                              WHERE  C1 = @oc1 
                                     AND C2 = @oc2 
                          END 
                    END 

                  DELETE FROM @offset_arr 
                  WHERE  @c1 > C1 
                         AND @c1 > C2 
                         AND @c2 > C1 
                         AND @c2 > C2; 

                  INSERT INTO @offset_arr 
                  VALUES      (@c1, 
                               @c2, 
                               @isTrans); 
              END 
            ELSE 
              BEGIN 
                  SET @lcss = @lcss + @local_cs; 
                  SET @local_cs = 0; 

                  IF ( @c1 != @c2 ) 
                    -- using min allows the computation of transpositions  
                    BEGIN 
                        IF ( @c1 < @c2 ) 
                          BEGIN 
                              SET @c2=@c1; 
                          END 
                        ELSE 
                          BEGIN 
                              SET @c1=@c2; 
                          END 
                    END 
                IF ( @maxDistance > 0 ) 
                  BEGIN 
                    IF ( @c1 > @c2 ) 
                    BEGIN 
                      SET @temporaryDistance = @c1 - ( @lcss - @trans / 2.0 ); 
                    END 
                    ELSE 
                    BEGIN 
                      SET @temporaryDistance = @c2 - ( @lcss - @trans / 2.0 ); 
                    END 

                  IF ( @temporaryDistance > @maxDistance ) 
                    RETURN Round(@temporaryDistance, 0); 
              END 


                  --if matching tokens are found, remove 1 from both cursors (they get incremented at the end of the loop)
                  --so that we can have only one code block handling matches   
                  SET @i=0; 

                  WHILE ( @i < @maxOffset 
                          AND ( @c1 + @i < @l1 
                                 OR @c2 + @i < @l2 ) ) 
                    BEGIN 
                        IF ( ( @c1 + @i < @l1 ) 
                             AND Substring(@s1, @c1 + @i + 1, 1) = 
                                 Substring(@s2, @c2 + 1, 1) 
                           ) 
                          BEGIN 
                              SET @c1 = @c1 + @i - 1; 
                              SET @c2=@c2 - 1; 

                              BREAK; 
                          END 

                        IF ( ( @c2 + @i < @l2 ) 
                             AND Substring(@s1, @c1 + 1, 1) = Substring(@s2, 
                                                              @c2 + @i + 1, 1 
                                                              ) 
                           ) 
                          BEGIN 
                              SET @c1 = @c1 - 1; 
                              SET @c2=@c2 + @i - 1; 

                              BREAK; 
                          END 

                        SET @i=@i + 1; 
                    END 
              END 

            SET @c1=@c1 + 1; 
            SET @c2=@c2 + 1; 

            IF ( ( @c1 >= @l1 ) 
                  OR ( @c2 >= @l2 ) ) 
              BEGIN 
                  SET @lcss = @lcss + @local_cs; 
                  SET @local_cs = 0; 

                  IF ( @c1 < @c2 ) 
                    BEGIN 
                        SET @c2=@c1; 
                    END 
                  ELSE 
                    BEGIN 
                        SET @c1=@c2; 
                    END 
              END 
        END 

      SET @lcss = @lcss + @local_cs; 

      --apply the transposition cost to the final result 
      IF ( @l1 > @l2 ) 
        BEGIN 
            SET @result = @l1 - ( @lcss - @trans ); 
        END 
      ELSE 
        BEGIN 
            SET @result = @l2 - ( @lcss - @trans ); 
        END 

      RETURN @result; 
  END 

Clearly a general version of the algorithm is not possible in Transact SQL.

Here is a MySQL version, gracefully provided by Ferenc Szatmári:
BEGIN 
DECLARE l1 INT DEFAULT Length(IFNULL(s1, '')); 
DECLARE l2 INT DEFAULT Length(IFNULL(s2, '')); 


DECLARE c1 INT Default 0; 
DECLARE c2 INT default 0; 
DECLARE lcss INT default 0; 
DECLARE local_cs INT default 0; 
DECLARE trans INT default 0; 
DECLARE i INT;
DECLARE temporaryDistance FLOAT;
DECLARE result INT;
DECLARE oc1 INT;
DECLARE oc2 INT;
DECLARE otr smallint;
DECLARE isTrans smallint;

drop temporary table if exists offset_arr;
CREATE TEMPORARY TABLE IF not EXISTS offset_arr
( 
C1 INT, 
C2 INT,
Trans BIT
)engine=memory;


/*      delete from offset_arr;*/


IF l1 = 0 THEN
RETURN l2; 
END IF;
IF l2 = 0 THEN 
RETURN l1; 
END IF;  






WHILE ( ( c1 < l1 ) AND ( c2 < l2 ) ) 
DO 
IF ( Substring(s1, c1 + 1, 1) = Substring(s2, c2 + 1, 1) ) THEN

SET local_cs=local_cs + 1; 
SET isTrans=0;
SET oc1=NULL;
SELECT  o.C1, o.C2,o.Trans  into oc1, oc2, otr
FROM offset_arr o 
WHERE c1 <= o.C1 OR c2 <= o.C2
LIMIT 1;
IF oc1 IS NOT NULL THEN

SET isTrans=CASE WHEN ABS(c2-c1)>=ABS(oc2-oc1) THEN 1 ELSE 0 END;
IF (isTrans=1) THEN
SET trans=trans+1;
ELSE
IF (otr=0) THEN


SET trans=trans+1;
UPDATE offset_arr SET Trans=1 WHERE C1=oc1 AND C2=oc2;
END IF;
END IF;  
END IF;


DELETE FROM offset_arr 
WHERE  c1 > C1 
AND c1 > C2
AND c2 > C1 
AND c2 > C2; 


INSERT INTO offset_arr 
VALUES (c1, c2, isTrans); 

ELSE 

SET lcss = lcss + local_cs; 
SET local_cs = 0; 


IF ( c1 != c2 ) THEN
-- using min allows the computation of transpositions 


IF ( c1 < c2 ) THEN
SET c2=c1; 
ELSE 
SET c1=c2; 
END IF;

IF ( maxDistance > 0 ) THEN


IF ( c1 > c2 ) THEN
SET temporaryDistance = c1 - ( lcss - trans / 2.0 ); 
ELSE 
SET temporaryDistance = c2 - ( lcss - trans / 2.0 ); 
END IF;


IF ( temporaryDistance > maxDistance ) THEN
RETURN Round(temporaryDistance, 0); 
END IF;  
END IF;

END IF;


/*if matching tokens are found, remove 1 from both cursors (they get incremented at the end of the loop)
--so that we can have only one code block handling matches  */
SET i=0; 


label : WHILE ( i < maxOffset AND (c1 + i < l1 OR c2 + i < l2) ) do


IF ( ( c1 + i < l1 ) 
AND Substring(s1, c1 + i + 1, 1) = Substring(s2, c2 + 1, 1) 
) THEN


SET c1 = c1 + i - 1; 
SET c2=c2 - 1; 


leave label; 
END IF; 


IF ( ( c2 + i < l2 ) 
AND Substring(s1, c1 + 1, 1) = Substring(s2, c2 + i + 1, 1) 
)THEN 


SET c1 = c1 - 1; 
SET c2=c2 + i - 1; 


leave label;  
END IF;


SET i=i + 1; 
END WHILE;
END IF; 


SET c1=c1 + 1; 
SET c2=c2 + 1; 


IF ( ( c1 >= l1 ) OR ( c2 >= l2 ) ) THEN
SET lcss = lcss + local_cs; 
SET local_cs = 0; 


IF ( c1 < c2 ) THEN
SET c2=c1; 
ELSE 
SET c1=c2; 
END IF;
END IF;
END while;


SET lcss = lcss + local_cs; 


/*apply the transposition cost to the final result*/
IF ( l1 > l2 ) THEN
SET result = l1 - (lcss - trans);
ELSE 
SET result = l2 - (lcss - trans);
END IF;
RETURN result; 
END



Java implementation [expand/collapse]
Here is a Java version, gracefully provided by Nathanæl Fischer:
/**
 * Sift4 - common version
 * online algorithm to compute the distance between two strings in O(n)
 * Algorithm by siderite, java port by Nathan Fischer 2016
 * /blog/super-fast-and-accurate-string-distance.html
 * @param s1
 * @param s2
 * @param maxOffset the number of characters to search for matching letters
 * @return
 */
public static double sift4(String s1, String s2, int maxOffset) {
    class Offset {
        int c1;
        int c2;
        boolean trans;

        Offset(int c1, int c2, boolean trans) {
            this.c1 = c1;
            this.c2 = c2;
            this.trans = trans;
        }
    }

    if (s1 == null || s1.isEmpty())
        return s2 == null ? 0 : s2.length();

    if (s2 == null || s2.isEmpty())
        return s1.length();

    int l1 = s1.length();
    int l2 = s2.length();

    int c1 = 0; //cursor for string 1
    int c2 = 0; //cursor for string 2
    int lcss = 0; //largest common subsequence
    int local_cs = 0; //local common substring
    int trans = 0; //number of transpositions ('ab' vs 'ba')
    LinkedList < Offset > offset_arr = new LinkedList < > (); //offset pair array, for computing the transpositions

    while ((c1 < l1) && (c2 < l2)) {
        if (s1.charAt(c1) == s2.charAt(c2)) {
            local_cs++;
            boolean isTrans = false;
            //see if current match is a transposition
            int i = 0;
            while (i < offset_arr.size()) {
                Offset ofs = offset_arr.get(i);
                if (c1 <= ofs.c1 || c2 <= ofs.c2) {
                    // when two matches cross, the one considered a transposition is the one with the largest difference in offsets
                    isTrans = Math.abs(c2 - c1) >= Math.abs(ofs.c2 - ofs.c1);
                    if (isTrans) {
                        trans++;
                    } else {
                        if (!ofs.trans) {
                            ofs.trans = true;
                            trans++;
                        }
                    }
                    break;
                } else {
                    if (c1 > ofs.c2 && c2 > ofs.c1) {
                        offset_arr.remove(i);
                    } else {
                        i++;
                    }
                }
            }
            offset_arr.add(new Offset(c1, c2, isTrans));
        } else {
            lcss += local_cs;
            local_cs = 0;
            if (c1 != c2) {
                c1 = c2 = Math.min(c1, c2); //using min allows the computation of transpositions
            }
            //if matching characters are found, remove 1 from both cursors (they get incremented at the end of the loop)
            //so that we can have only one code block handling matches
            for (int i = 0; i < maxOffset && (c1 + i < l1 || c2 + i < l2); i++) {
                if ((c1 + i < l1) && (s1.charAt(c1 + i) == s2.charAt(c2))) {
                    c1 += i - 1;
                    c2--;
                    break;
                }
                if ((c2 + i < l2) && (s1.charAt(c1) == s2.charAt(c2 + i))) {
                    c1--;
                    c2 += i - 1;
                    break;
                }
            }
        }
        c1++;
        c2++;
        // this covers the case where the last match is on the last token in list, so that it can compute transpositions correctly
        if ((c1 >= l1) || (c2 >= l2)) {
            lcss += local_cs;
            local_cs = 0;
            c1 = c2 = Math.min(c1, c2);
        }
    }
    lcss += local_cs;
    return Math.round(Math.max(l1, l2) - lcss + trans); //add the cost of transpositions to the final result
}
Powershell implementation [expand/collapse]


Powershell implementation of simple Sift4, by Kirk Sayre:

function Calc-Sift4Distance 
{
  <# 
      .SYNOPSIS 

      Compute the edit distance between 2 strings using the sift4 string
      edit distance algorithm.

      .PARAMETER s1

      The 1st string

      .PARAMETER s2

      The 2nd string

      .PARAMETER maxOffset

      The maximum common substring length for which to search. The default
      is 10.

      .RETURN

      The # of edits needed to make the given 2 strings equal.
  #>

  Param(
    [Parameter(Mandatory = $True, ValueFromPipelineByPropertyName = $True)]
    [String]
    $s1,

    [Parameter(Mandatory = $True, ValueFromPipelineByPropertyName = $True)]
    [String]
    $s2,

    [Parameter(ValueFromPipelineByPropertyName = $True)]
    [Int]
    $maxOffset = 10
  )

  # Handle null or empty strings.
  if ((-not $s1) -or ($s1.length -eq 0)) 
  {
    if (-not $s2) 
    {
      return 0
    }
    return $s2.length
  }

  if ((-not $s2) -or ($s2.length -eq 0)) 
  {
    return $s1.length
  }

  # Initialization.
  $l1 = $s1.length
  $l2 = $s2.length
  $c1 = 0 # Cursor for string 1.
  $c2 = 0 # Cursor for string 2.
  $lcss = 0 # Largest common subsequence.
  $local_cs = 0 # Local common substring.

  # Scan strings.
  while (($c1 -lt $l1) -and ($c2 -lt $l2)) 
  {
    if ($s1[$c1] -eq $s2[$c2]) 
    {
      $local_cs++
    }
    else 
    {
      $lcss += $local_cs
      $local_cs = 0
      if ($c1 -ne $c2) 
      {
        # Using max to bypass the need for computer transpositions ('ab' vs 'ba').
        $c1 = $c2 = (@($c1, $c2) | Measure-Object -Maximum).Maximum
      }

      for ($i = 0; (($i -lt $maxOffset) -and ((($c1 + $i) -lt $l1) -or (($c2 + $i) -lt $l2))); $i++) 
      {
        if ((($c1 + $i) -lt $l1) -and ($s1[$c1 + $i] -eq $s2[$c2])) 
        {
          $c1 += $i
          $local_cs++
          break
        }

        if ((($c1 + $i) -lt $l2) -and ($s1[$c1] -eq $s2[$c2 + $i])) 
        {
          $c2 += $i
          $local_cs++
          break
        }
      }
    }
    $c1++
    $c2++
  }
  $lcss += $local_cs
  return [math]::Round((@($l1, $l2) | Measure-Object -Maximum).Maximum - $lcss)
}
C++ implementation [expand/collapse]


Thanks to Hugo Amaro, a C++ implementation:

struct sift_offset {
  int c1;
  int c2;
  bool trans;
};

template < typename T >
  int sift4(T * s1, int s1_size, T * s2, int s2_size, int maxOffset, int maxDistance) {
    if (!s1 || !s1_size) {
      if (!s2) {
        return 0;
      }
      return s2_size;
    }

    if (!s2 || !s2_size) {
      return s1_size;
    }

    int l1 = s1_size;
    int l2 = s2_size;

    int c1 = 0; //cursor for string 1
    int c2 = 0; //cursor for string 2
    int lcss = 0; //largest common subsequence
    int local_cs = 0; //local common substring
    int trans = 0; //number of transpositions ('ab' vs 'ba')
    std::vector < sift_offset > offset_arr; //offset pair array, for computing the transpositions

    while ((c1 < l1) && (c2 < l2)) {
      if (s1[c1] == s2[c2]) {
        local_cs++;
        bool isTrans = false;
        //see if current match is a transposition
        int i = 0;
        while (i < offset_arr.size()) {
          sift_offset ofs = offset_arr[i];
          if (c1 <= ofs.c1 || c2 <= ofs.c2) {
            // when two matches cross, the one considered a transposition is the one with the largest difference in offsets
            isTrans = std::abs(c2 - c1) >= std::abs(ofs.c2 - ofs.c1);

            if (isTrans) {
              trans++;
            } else {
              if (!ofs.trans) {
                ofs.trans = true;
                trans++;
              }
            }
            break;
          } else {
            if (c1 > ofs.c2 && c2 > ofs.c1) {
              offset_arr.erase(offset_arr.begin() + i);

            } else {
              i++;
            }
          }
        }
        offset_arr.push_back({
          c1,
          c2,
          isTrans
        });
      } else {
        lcss += local_cs;
        local_cs = 0;
        if (c1 != c2) {
          c1 = c2 = (int) min(c1, c2); //using min allows the computation of transpositions
        }
        //if matching characters are found, remove 1 from both cursors (they get incremented at the end of the loop)
        //so that we can have only one code block handling matches 
        for (int i = 0; i < maxOffset && (c1 + i < l1 || c2 + i < l2); i++) {
          if ((c1 + i < l1) && (s1[c1 + i] == s2[c2])) {
            c1 += i - 1;
            c2--;
            break;
          }
          if ((c2 + i < l2) && (s1[c1] == s2[c2 + i])) {
            c1--;
            c2 += i - 1;
            break;
          }
        }
      }
      c1++;
      c2++;
      if (maxDistance) {
        int temporaryDistance = (int) max(c1, c2) - lcss + trans;
        if (temporaryDistance >= maxDistance) return std: round(temporaryDistance);
      }

    }
    // this covers the case where the last match is on the last token in list, so that it can compute transpositions correctly
    if ((c1 >= l1) || (c2 >= l2)) {
      lcss += local_cs;
      local_cs = 0;
      c1 = c2 = (int) min(c1, c2);
    }
  }
lcss += local_cs;
return std::round((int) max(l1, l2) - lcss + trans); //add the cost of transpositions to the final result
}

You can find a Go implementation here, written by Jason W. Hutchinson.
There is also a Swift implementation here.
A Perl 6 (now called Raku) implementation can be found here.

I had this memory problem that I could not understand. OK, the design of the application was not the best in the world, but why would it always give me a damn OutOfMemoryException when I have a super duper computer with a lot of memory and I have solved issues like heap memory allocation? And the reason is that IISExpress, the default Windows7 ASP.Net web server is running in 32bit mode, meaning it has a maximum of 4GB of memory no matter what you do. Well, you can make IISExpress run in 64bit mode by simply switching it on in the Windows registry. I don't want to copy the content of the article from someone else, so here is a link towards how to do it: Debugging VS2013 Websites Using 64-bit IIS Express.

Just in case you want the ultra fast version, copy this into a file with the .reg extension and execute it:
Windows Registry Editor Version 5.00

[HKEY_CURRENT_USER\Software\Microsoft\VisualStudio\12.0\WebProjects]
"Use64BitIISExpress"=dword:00000001

Cheers!

You know how many a time you need to get the results of a query as a strings list, let's say a CSV and there is nothing in Microsoft SQL Server to help you? What you would like is something like an aggregate function similar to SUM that returns the concatenated value of all strings in the SELECT query.

From SQL Server 2017 on, there is such a function called STRING_AGG and you use it just like one might expect.

If you have an older version of SQL Server... upgrade :) But if you can't, here is a solution:

And before SQL Server 2008 that was the case, you needed to use variables and SELECT @S=@S+[Value] FROM .... But in SQL 2008 they added more XML support and thus the data() XML PATH method. Take notice that this method adds a space between atomic values. So, without further ado, here is the code to concatenate several values into a comma separated value string:

DECLARE @T TABLE(S NVARCHAR(Max))
INSERT INTO @T VALUES('Filet'),('T-bone'),('Sausage'),('Würstel') -- enough with the fruity examples!

SELECT CONVERT(NVARCHAR(Max),(SELECT S+',' AS 'data()' FROM @T t
FOR XML PATH('')))


Result: Filet, T-bone, Sausage, Würstel, - of course, remove the last comma.

Update: What if I have a table with multiple columns and I just want to aggregate one?

The solution is to use the select with WHERE clauses on the GROUP BY columns. Here is an example:

DECLARE @t TABLE(col1 NVARCHAR(Max), col2 INT, value NVARCHAR(Max) )
INSERT INTO @T VALUES ('a',1,'1'),
('a',1,'2'),
('a',2,'3'),
('b',1,'1'),
('b',1,'2'),
('c',3,'67')
 
--SELECT * FROM @t t
 
SELECT col1, col2, CONVERT(NVARCHAR(Max),(
SELECT value+',' AS 'data()' FROM @T t2
WHERE t1.col1 = t2.col1
AND t1.col2 = t2.col2
FOR XML PATH('')))
FROM @t t1
GROUP BY col1, col2