Sparse arrays in JavaScript

Published Jun 15, 2025

Intro

Welcome to sparse arrays, a JavaScript concept that completely escaped me until now. Let's dig into it.

I wanted to create a game board, an array of 8 rows, themselves arrays of 8 elements so I unthinkingly did a const row=Array(8). I used a console.table on it and it was displayed in a stupid way, but I blamed it on Google devs and moved on. I didn't understand something was amiss until I did a .forEach on a row array that had only three pieces on it and it was executed just three times. What was going on?

Explanation

Well, when you use the Array(integer) constructor you get a sparse array. It has a length, it has items, but it is not a contiguous memory space with N slots for values. The same thing happens when you omit an element from an array declaration, like [1,,2] which is a sparse array different from [1, undefined, 2]. A normal array, in this context, is called a dense array.

In other words, sparse arrays work differently in for and forEach loops! It's almost like an empty object with a length property that has array methods working for it. Think about this code for example:

const arr = { length: 8 }
arr[1]='something';

// displays 8 lines, where only the second is 'something' and the rest are undefined
for (let i=0; i<arr.length; i++) console.log(arr[i]);

// simulated forEach - displays only 2 lines: something and 8
Object.values(arr).forEach(v=>console.log(v));

The simplest way to transform a sparse array into a dense array is to iterate it, like this: [...sparseArray]. You can create a dense array with Array.from( { length:8 } ), however new Array(8) or Array(8) will return a sparse array.

If you look in the docs, sparse arrays are often used with .fill, like this: Array(8).fill(undefined), which will return an array of 8 undefined elements.

Going deeper

So, wait... now there are two types of arrays, they function differently, they must be different classes, right? Actually, no! These are just concepts that apply to Array. Let's see some code:

const x = [1]; // dense array
x.length=10; // sparse array, all slots except the first are 'empty'.
x.fill(2); // dense array, all slots are filled with the value 2
x[1]=undefined; // dense array, second slot contains undefined
delete x[3]; // sparse array, fourth slot is 'empty'
// 1 in x == true;
// 3 in x == false;

What is this madness? This is something new introduced by nasty people who just want to make everything complicated, right? Actually, no. Arrays were always sparse, from the very beginning of the language in 1995. The concept itself was formalized in 2009 for ES5, though, where it was explicitly said that for..in and for..of and forEach, map, filter, etc, will skip empty slots.

Conclusion

JavaScript arrays are misnomers and different from the previous concept of array, used in languages like C or .NET, a fixed and contiguous memory space to store elements based on a positional index. JavaScript arrays are complex objects that allow insertion and removal of items, resizing and, yes, sparsity.

You can work in JavaScript for years and years and still get blindsighted by something like this...

Array.from - more powerful than you think

Published May 9, 2025

and has 0 comments

Array.from is a JavaScript function that until now, I've used exclusively to transform an iterable into an array. I had never considered the powerful factory functionality that it provides, because its declared signature is deceptively simple: Array.from(arrayLike, mapFn, thisArg). So you get something that's like an array (an interable, right?) then you map it (phaw, like I don't know how to map an array!) and a thisArg probably sets the this context for the function (like I don't know how to bind a function!)

And yet...

The first "array-like" parameter doesn't need to be an iterable, it can be an object with a length. Meaning that you can just do something a number of times.

Example for creating a range:

// Sequence generator function (commonly referred to as "range", cf. Python, Clojure, etc.)
const range = (start, stop, step) =>
  Array.from(
    { length: Math.ceil((stop - start) / step) },
    (_, i) => start + i * step,
  );

Getting a sample from an array of string:

const sample = (arr, count) =>
  Array.from(
    { length: count },
    ()=>arr[Math.floor(Math.random()*arr.length)]
  );

How about generating a random string?

const chars = 'abcdefghijklmnopqrstuvwxyz';
const randomString = (length) =>
  Array.from({ length }, () => chars[Math.floor(Math.random() * chars.length)]).join('');

// OR using the sample function above

const randomString = (length) =>
  sample('abcdefghijklmnopqrstuvwxyz',length).join('');

And if you don't want an array, but something "array-like", like a NodeList?

class MyClass {}

Array.from.call(MyClass, ["foo", "bar", "baz"]);
// MyClass {0: 'foo', 1: 'bar', 2: 'baz', length: 3}

I admit Array.from(a,b,c) is functionally equivalent to Array.from(a).map(b.bind(c)), but it's more efficient, as the array is constructed only once. The power of this function comes from its interpretation of the "array-like" first argument. It takes iterable objects and/or containing a length property, thus allowing a more concise, functional way of writing code and avoiding unnecessary loops.

Hope it helps!

In class members which are functions, this is not defined unless you ask for it

Published Dec 19, 2023

and has 0 comments

You are writing some Javascript code in the browser and you create classes, then you create some methods, which you see have some issue with the "this" keyword by default if you use the standard method declaration so you end up doing something like this:

// instead of myMethod() { ... }
myMethod=()=>{
  const myClass=this;
}

And this (pardon the pun) works here. Imagine my surprise when I did something that seemed identical:

// instead of myMethod() { ... }
myMethod=()=>{
  debugger;
}

And then I tried to see what "this" was in the Chrome Developer Tools. It was "undefined". Why?!

Long story short, I then tried this (heh!) and it worked:

// instead of myMethod() { ... }
myMethod=()=>{
  const myClass=this;
  debugger;
}

The moral of the story is that "this" is not declared unless used in code. Probably a browser optimization. Hope it saves you the ten minutes it took me to understand what was going on, after I've started doubting my entire JavaScript expertise and the entire design of my app.

Fast fuzzy searching of a string in a list

Published Nov 8, 2022

and has 0 comments

I haven't been working on the Sift string distance algorithm for a while, but then I was reminded of it because someone wanted it to use it to suggest corrections to user input. Something like Google's: "Did you mean...?" or like an autocomplete application. And it got me thinking of ways to use Sift for bulk searching. I am still thinking about it, but in the meanwhile, this can be achieved using the Sift4 algorithm, with up to 40% improvement in speed to the naïve comparison with each item in the list.

Testing this solution, I've realized that the maxDistance parameter did not work correctly. I apologize. The code is now fixed on the algorithm's blog post, so go and get it.

So what is this solution for mass search? We can use two pieces of knowledge about the problem space:

the minimum possible distance between two string of length l1 and l2 will always abs(l1-l2)
- it's very easy to understand the intuition behind it: one cannot generate a string of size 5 from a string of size 3 without at least adding two new letters, so the minimum distance would be 2
as we advance through the list of strings, we have a best distance value that we keep updating
- this molds very well on the maxDistance option of Sift4

Thus armed, we can find the best matches for our string from a list using the following steps:

set a bestDistance variable to a very large value
set a matches variable to an empty list
for each of the strings in the list:
1. compare the minimum distance between the search string and the string in the list (abs(l1-l2)) to bestDistance
  1. if the minimum distance is larger than bestDistance, ignore the string and move to the next
2. use Sift4 to get the distance between the search string and the string in the list, using bestDistance as the maxDistance parameter
  1. if the algorithm reaches a temporary distance that is larger than bestDistance, it will break early and report the temporary distance, which we will ignore
3. if distance<bestDistance, then clear the matches list and add the string to it, updating bestDistance to distance
4. if distance=bestDistance, then add the string to the list of matches

When using the common Sift4 version, which doesn't compute transpositions, the list of matches is retrieved 40% faster on average than simply searching through the list of strings and updating the distance. (about 15% faster with transpositions) Considering that Sift4 is already a lot faster than Levenshtein, this method will allow searching through hundreds of thousands of strings really fast. The gained time can be used to further refine the matches list using a slower, but more precise algorithm, like Levenshtein, only on a lot smaller set of possible matches.

Here is a sample written in JavaScript, where we search a random string in the list of English words:

search = getRandomString(); // this is the search string
let matches=[];             // the list of found matches
let bestDistance=1000000;   // the smaller distance to our search found so far
const maxOffset=5;          // a common value for searching similar strings
const l = search.length;    // the length of the search string
for (let word of english) {
    const minDist=Math.abs(l-word.length); // minimum possible distance
    if (minDist>bestDistance) continue;    // if too large, just exit
    const dist=sift4(search,word,maxOffset,bestDistance);
    if (dist<bestDistance) {
        matches = [word];                  // new array with a single item
        bestDistance=dist;
        if (bestDistance==0) break;        // if an exact match, we can exit (optional)
    } else if (dist==bestDistance) {
        matches.push(word);                // add the match to the list
    }
}

There are further optimizations that can be added, beyond the scope of this post:

words can be grouped by length and the minimum distance check can be done on entire buckets of strings of the same lengths
words can be sorted, and when a string is rejected as a match, reject all string with the same prefix
- this requires an update of the Sift algorithm to return the offset at which it stopped (to which the maxOffset must be added)

I am still thinking of performance improvements. The transposition table gives more control over the precision of the search, but it's rather inefficient and resource consuming, not to mention adding code complexity, making the algorithm harder to read. If I can't find a way to simplify and improve the speed of using transpositions I might give up entirely on the concept. Also, some sort of data structure could be created - regardless of how much time and space is required, assuming that the list of strings to search is large and constant and the number of searches will be very big.

Let me know what you think in the comments!

Explaining the DOM: a script helper for CSS and ad blockers

Published Feb 12, 2022

and has 0 comments

The need

I don't know about you, but I've been living with ad blockers from the moment they arrived. Occasionally I get access to a new machine and experience the Internet without an ad blocker and I can't believe how bad it is. A long time ago I had a job where I was going by bike. After two years of not using public transport, I got in a bus and had to get out immediately. It smelled so bad! How had I ever used that before? It's the same thing.

However, most of the companies we take for granted as pillars of the web need to bombard us with ads and generally push or skew our perceptions in order to make money, so they find ways to obfuscate their web sites, lock them in "apps" that you have no control of or plain manipulate the design of the things we use to write code so that it makes this more difficult.

Continuous war

Let me give you an example of this arms race. You open your favorite web site and there it is, a garish, blinking, offending, annoying ad that is completely useless. Luckily, you have an ad blocker so you select the ad, see that it's enclosed in a div with class "annoyingAd" and you make a blocking rule for it. Your web site is clean again. But the site owner realizes people are not clicking on the ad anymore, so he dynamically changes the class name to something different every time you open the page. Now, you could try to decipher the JavaScript code that populates the div and write a script to get the correct class, but it would only work for this web site and you would have to know how to code in JavaScript and it would take a lot of effort you don't want to spend. But then you realize that above the horrid thing there is a title "Annoying ad you can't get rid of", so you write a simple thing to get rid of the div that contains a span with that content. Yay!

At this point you already have some issues. The normal way people block ads is to create a quasi CSS rule for an element. Yet CSS doesn't easily let's you select elements based on the inner text or to select parents of elements with certain characteristics. In part it's a question of performance, but at the same time there is a matter of people who want to obfuscate your web site taking part in the decision process of what should go in CSS. So here, to get the element with a certain content we had to use something that expands normal CSS, like the jQuery syntax or some extra JavaScript. This is, needless to say, already applicable to a low number of people. But suspend your disbelief for a moment.

Maybe your ad blocker is providing you with custom rules that you can make based on content, or you write a little script or even the ad blocker people write the code for you. So the site owner catches up and he does something: instead of having a span with the title, he puts many little spans, containing just a few letters, some of them hidden visually and filled with garbage, others visible. The title is now something like "Ann"+"xxx"+"oying"+"xxx"+" ad", where all "xxx" texts appear as part of the domain object model (the page's DOM) but they are somehow not visible to the naked eye. Now the inner text of the container is "Annxxxoyingxxx ad", with random letters instead of xxx. Beat that!

And so it goes. You need to spend knowledge and effort to escalate this war that you might not even win. Facebook is the king of obfuscation, where even the items shared by people are mixed and remixed so that you cannot select them. So what's the solution?

Solution

At first I wanted to go in the same direction, fight the same war. Let's create a tool that deobfuscates the DOM! Maybe using AI! Something that would, at the end, give me the simplest DOM possible that would create the visual output of the current page and, when I change one element in this simple DOM, it would apply the changes to the corresponding obfuscated DOM. And that IS a solution, if not THE solution, but it is incredibly hard to implement.

There is another option, though, something that would progressively enhance the existing DOM with information that one could use in a CSS rule. Imagine a small script that, added to any page, would add attributes to elements like this: visibleText="Annoying ad" containingText="Annxxxoingxxx ad" innerText="" positionInPage="78%,30%-middle-right" positionInViewport="78%,5%-top-right". Now you can use a CSS rule for it, because CSS has syntax for attributes equal to, containing, starting or ending with something. This would somewhat slow the page, but not terribly so. One can use it as a one shot (no matter how long it takes, it only runs once) or continuous (where every time an element changes, it would recreate the attributes in it and its parents).

Feedback

Now, I have not begun development on this yet, I've just had this idea of a domExplainer library that I could make available for everybody. I have to test how it works on difficult web sites like Facebook and try it as a general option in my browser. But I would really appreciate feedback first. What do you think? What else would you add to (or remove from) it? What else would you use it for?

Why all articles about demystifying JS array methods are rubbish

Published Aug 5, 2021

and has 0 comments

Every month or so I see another article posted by some dev, usually with a catchy title using words like "demystifying" or "understanding" or "N array methods you should be using" or "simplify your Javascript" or something similar. It has become so mundane and boring that it makes me mad someone is still trying to cache on these tired ideas to try to appear smart. So stop doing it! There is no need to explain methods that were introduced in 2009!

But it gets worse. These articles are partially misleading because Javascript has evolved past the need to receive or return data as arrays. Let me demystify the hell out of you.

First of all, the methods we are discussing here are .filter and .map. There is of course .reduce, but that one doesn't necessarily return an array. Ironically, one can write both .filter and .map as a reduce function, so fix that one and you can get far. There is also .sort, which for performance reasons works a bit differently and returns nothing, so it cannot be chained as the others can. All of these methods from the Array object have something in common: they receive functions as parameters that are then applied to all of the items in the array. Read that again: all of the items.

Having functions as first class citizens of the language has always been the case for Javascript, so that's not a great new thing to teach developers. And now, with arrow functions, these methods are even easier to use because there are no scope issues that caused so many hidden errors in the past.

Let's take a common use example for these methods for data display. You have many data records that need to be displayed. You have to first filter them using some search parameters, then you have to order them so you can take just a maximum of n records to display on a page. Because what you display is not necessarily what you have as a data source, you also apply a transformation function before returning something. The code would look like this:

var colors = [
  {    name: 'red',    R: 255,    G: 0,    B: 0  },
  {    name: 'blue',   R: 0,      G: 0,    B: 255  },
  {    name: 'green',  R: 0,      G: 255,  B: 0  },
  {    name: 'pink',   R: 255,    G: 128,  B: 128  }
];

// it would be more efficient to get the reddish colors in an array
// and sort only those, but we want to discuss chaining array methods
colors.sort((c1, c2) => c1.name > c2.name ? 1 : (c1.name < c2.name ? -1 : 0));

const result = colors
  .filter(c => c.R > c.G && c.R > c.B)
  .slice(page * pageSize, (page + 1) * pageSize)
  .map(c => ({
      name: c.name,
      color: `#${hex(c.R)}${hex(c.G)}${hex(c.B)}`
  }));

This code takes a bunch of colors that have RGB values and a name and returns a page (defined by page and pageSize) of the colors that are "reddish" (more red than blue and green) order by name. The resulting objects have a name and an HTML color string.

This works for an array of four elements, it works fine for arrays of thousands of elements, too, but let's look at what it is doing:

we pushed the sort up, thus sorting all colors in order to get the nice syntax at the end, rather than sorting just the reddish colors
we filtered all colors, even if we needed just pageSize elements
we created an array at every step (three times), even if we only needed one with a max size of pageSize

Let's write this in a classical way, with loops, to see how it works:

const result = [];
let i=0;
for (const c of colors) {
	if (c.R<c.G || c.R<c.B) continue;
	i++;
	if (i<page*pageSize) continue;
	result.push({
      name: c.name,
      color: `#${hex(c.R)}${hex(c.G)}${hex(c.B)}`
    });
	if (result.length>=pageSize) break;
}

And it does this:

it iterates through the colors array, but it has an exit condition
it ignores not reddish colors
it ignores the colors of previous pages, but without storing them anywhere
it stores the reddish colors in the result as their transformed version directly
it exits the loop if the result is the size of a page, thus only going through (page+1)*pageSize loops

No extra arrays, no extra iterations, only some ugly ass code. But what if we could write this as nicely as in the first example and make it work as efficiently as the second? Because of ECMAScript 6 we actually can!

Take a look at this:

const result = Enumerable.from(colors)
  .where(c => c.R > c.G && c.R > c.B)
  //.orderBy(c => c.name)
  .skip(page * pageSize)
  .take(pageSize)
  .select(c => ({
      name: c.name,
      color: `#${hex(c.R)}${hex(c.G)}${hex(c.B)}`
  }))
  .toArray();

What is this Enumerable thing? It's a class I made to encapsulate the methods .where, .skip, .take and .select and will examine it later. Why these names? Because they mirror similar method names in LINQ (Language Integrated Queries from .NET) and because I wanted to clearly separate them from the array methods.

How does it all work? If you look at the "classical" version of the code you see the new for..of loop introduced in ES6. It uses the concept of "iterable" to go through all of the elements it contains. An array is an iterable, but so is a generator function, also an ES6 construct. A generator function is a function that generates values as it is iterated, the advantage being that it doesn't need to hold all of the items in memory (like an array) and any operation that needs doing on the values is done only on the ones requested by code.

Here is what the code above does:

it creates an Enumerable wrapper over array (performs no operation, just assignments)
it filters by defining a generator function that only returns reddish colors (but performs no operation) and returns an Enumerable wrapper over the function
it ignores the items from previous pages by defining a generator function that counts items and only returns items after the specified number (again, no operation) and returns an Enumerable wrapper over the function
it then takes a page full of items, stopping immediately after, by defining a generator function that does that (no operation) and returns an Enumerable wrapper over the function
it transforms the colors in output items by defining a generator function that iterates existing items and returns the transformed values (no operation) and returns an Enumerable wrapper over the function
it iterates the generator function in the current Enumerable and fills an array with the values (all the operations are performed here)

And here is the flow for each item:

.toArray enumerates the generator function of .select
.select enumerates the generator function of .take
.take enumerates the generator function of .skip
.skip enumerates the generator function of .where
.where enumerates the generator function that iterates over the colors array
the first color is red, which is reddish, so .where "yields" it, it passes as the next item in the iteration
the page is 0, let's say, so .skip has nothing to skip, it yields the color
.take still has pageSize items to take, let's assume 20, so it yields the color
.select yields the color transformed for output
.toArray pushes the color in the result
go to 1.

If for some reason you would only need the first item, not the entire page (imagine using a .first method instead of .toArray) only the steps from 1. to 10. would be executed. No extra arrays, no extra filtering, mapping or assigning.

Am I trying too hard to seem smart? Well, imagine that there are three million colors, a third of them are reddish. The first code would create an array of a million items, by iterating and checking all three million colors, then take a page slice from that (another array, however small), then create another array of mapped objects. This code? It is the equivalent of the classical one, but with extreme readability and ease of use.

OK, what is that .orderBy thing that I commented out? It's a possible method that orders items online, as they come, at the moment of execution (so when .toArray is executed). It is too complex for this blog post, but there is a full implementation of Enumerable that I wrote containing everything you will ever need. In that case .orderBy would only order the minimal number of items required to extract the page ((page+1) * pageSize). The implementation can use custom sorting algorithms that take into account .take and .skip operators, just like in LiNQer.

The purpose of this post was to raise awareness on how Javascript evolved and on how we can write code that is both readable AND efficient.

One actually doesn't need an Enumerable wrapper, and can add the methods to the prototype of all generator functions, as well (see LINQ-like functions in JavaScript with deferred execution). As you can see, this was written 5 years ago, and still people "teach" others that .filter and .map are the Javascript equivalents of .Where and .Select from .NET. NO, they are NOT!

The immense advantage for using a dedicated object is that you can store information for each operator and use it in other operators to optimize things even further (like for orderBy). All code is in one place, it can be unit tested and refined to perfection, while the code using it remains the same.

Here is the code for the simplified Enumerable object used for this post:

class Enumerable {
  constructor(generator) {
	this.generator = generator || function* () { };
  }

  static from(arr) {
	return new Enumerable(arr[Symbol.iterator].bind(arr));
  }

  where(condition) {
    const generator = this.generator();
    const gen = function* () {
      let index = 0;
      for (const item of generator) {
        if (condition(item, index)) {
          yield item;
        }
        index++;
      }
    };
    return new Enumerable(gen);
  }

  take(nr) {
    const generator = this.generator();
    const gen = function* () {
      let nrLeft = nr;
      for (const item of generator) {
        if (nrLeft > 0) {
          yield item;
          nrLeft--;
        }
        if (nrLeft <= 0) {
          break;
        }
      }
    };
    return new Enumerable(gen);
  }

  skip(nr) {
    const generator = this.generator();
    const gen = function* () {
      let nrLeft = nr;
      for (const item of generator) {
        if (nrLeft > 0) {
          nrLeft--;
        } else {
          yield item;
        }
      }
    };
    return new Enumerable(gen);
  }

  select(transform) {
    const generator = this.generator();
    const gen = function* () {
      for (const item of generator) {
		yield transform(item);
      }
    };
    return new Enumerable(gen);
  }

  toArray() {
	return Array.from(this.generator());
  }
}

The post is filled with links and for whatever you don't understand from the post, I urge you to search and learn.

Using Kendo UI for jQuery MultiSelect with dynamic values

Published Apr 18, 2021

and has 6 comments

MultiSelect is a Kendo UI control that transforms a select element into a nice dropdown with text filtering which allows the selection of multiple items. This is how you use the same control to write values directly in the list, something akin to the Outlook address bar functionality.

Long story short: the control exposes some events like: 'filtering','open','close' and 'change'. In the filtering event, which is fired by someone writing or pasting text in order to filter the list of items, we dynamically create a list item that holds that value, so that the user can just press Enter and enter the value in the list. The code also allows for a custom transformation function, so for example someone could enter "1,2,3" and it would be translated into three values 1, 2 and 3 instead of an item with the value "1,2,3". On the close and change events we clear the items in the list that have not been selected. This means you cannot use this code as is to show an autocomplete list and also add dynamic values, but it is easy to tweak for that purpose.

In order to use it, instead of doing $(selector).kendoMultiSelect(options), just use $(selector).kendoDynamicMultiSelect(options). Here is the code:

$.fn.kendoDynamicMultiSelect = function (options) {
  var multiSelect = $(this).kendoMultiSelect(options).getKendoMultiSelect();

  multiSelect.bind('filtering', function (ev) {
    var val = ev.filter && ev.filter.value;
    if (!val) return;
    
    var dataSource = ev.sender.dataSource;
    var items = dataSource.data();
    
    // if there is an existing item in the list, don't create a new one
    var existingItem = items.filter(function (i) {
      return i.value == val;
    })[0];
    if (existingItem) return;

    // find or create the item that will hold the current filter value
    var inputItem = items.filter(function (i) {
      return i.isInput;
    })[0];
    if (!inputItem) {
      inputItem = dataSource.insert(0, { isInput: true });
      // when inserting a value the input gets cleared in some situations
      // so set it back 
      ev.sender.input.val(ev.filter.value);
    }
    inputItem.value = val;
  });

  // cleans input items and also applies an optional value transformation function
  var updateValues = function (ev) {
    var values = ev.sender.value();
    if (typeof options.valueTransformationFunction === 'function') {
      // for example split comma separated values
      values = options.valueTransformationFunction(values);
    }

    var dataSource = ev.sender.dataSource;
    var items = dataSource.data();
    for (var i = 0; i < items.length; i++) {
      var item = items[i];
      item.shouldBeKept = false;
    }

    // add items for existing values
    for (var i = 0; i < values.length; i++) {
      var value = values[i];
	    
      var item = items.filter(function (i) { return i.value == value; })[0];
      if (!item) {
        item = dataSource.add({ value: value });
      }
      item.isInput = false;
      item.shouldBeKept = true;
    }

    ev.sender.value(values);

    // delete all others
    for (var i = 0; i < items.length; i++) {
      var item = items[i];
      if (!item.shouldBeKept) {
        dataSource.remove(item);
      }
    }
  };

  multiSelect.bind('change', updateValues);
  multiSelect.bind('close', updateValues);
};

I kind of copied this code by hand and tried it on another computer. If you find any bugs, let me know. Also, I know this is old time tech, but they use it in my company and I couldn't find this functionality by googling it, so here it is.

I hope it helps.

Careful with dates in Javascript

Published Nov 26, 2020

and has 0 comments

I was working on a grid display and I had to properly sort date columns. The value provided was not a datetime, but instead a string like "20 Jan 2017" or "01 Feb 2020". Obviously sorting them alphabetically would not be very useful. So what I did was implement a custom sorting function that first parsed the strings as dates, then compared them. Easy enough, particularly since the Date object in Javascript has a Parse function that understands this format.

The problem came with a string with the value "01 Jan 0001" which appeared randomly among the existing values. I first thought it was an error being thrown somewhere, or that it would not parse this string or even that it would be an overflow. It was none of that. Instead, it was about handling the year part.

A little context first:

Date.parse('01 Jan 0001') //978300000000
new Date(0) //Thu Jan 01 1970 00:00:00

Date.parse('01 Jan 1950') //-631159200000
new Date(Date.parse('01 Jan 1950')) //Sun Jan 01 1950 00:00:00

Date.parse('31 Dec 49 23:59:59.999') //2524600799999
Date.parse('1 Jan 50 00:00:00.000') //-631159200000

new Date(Date.parse('01 Jan 0001')) //Mon Jan 01 2001 00:00:00

The first two lines almost had me convinced Javascript does not handle dates lower than 1970. The next two lines disproved that and made me think it was a case of numerical overflow. The next two demonstrated it was not so. Now look closely at the last line. What? 2001?

The problem was with the handling of years that are numerically smaller than 50. The parser assumes we used a two digit year and translates it into Date.parse('01 Jan 01') which would be 2001. We get a glimpse into how it works, too, because everything between 50 and 99 would be translated into 19xx and everything between 00 and 49 is considered 20xx.

Note that .NET does not have this problem, correctly making the difference between a 2 digit and 4 digit year.

Hope it helps people.

I finally understood what a reducer is!

Published Jul 31, 2020

and has 1 comment

When I was looking at Javascript frameworks like Angular and ReactJS I kept running into these weird reducers that were used in state management mostly. It all felt so unnecessarily complicated, so I didn't look too closely into it. Today, reading some random post on dev.to, I found this simple and concise piece of code that explains it:

// simple to unit test this reducer
function maximum(max, num) { return Math.max(max, num); }

// read as: 'reduce to a maximum' 
let numbers = [5, 10, 7, -1, 2, -8, -12];
let max = numbers.reduce(maximum);

Kudos to David for the code sample.

The reducer, in this case, is a function that can be fed to the reduce function, which is known to developers in Javascript and a few other languages, but which for .NET developers it's foreign. In LINQ, we have Aggregate!

// simple to unit test this Aggregator ( :) )
Func<int, int, int> maximum = (max, num) => Math.Max(max, num);

// read as: 'reduce to a maximum' 
var numbers = new[] { 5, 10, 7, -1, 2, -8, -12 };
var max = numbers.Aggregate(maximum);

Of course, in C# Math.Max is already a reducer/Aggregator and can be used directly as a parameter to Aggregate.

I found a lot of situations where people used .reduce instead of a normal loop, which is why I almost never use Aggregate, but there are situations where this kind of syntax is very useful. One would be in functional programming or LINQ expressions that then get translated or optimized to something else before execution, like SQL code. (I don't know if Entity Framework translates Aggregate, though). Another would be where you have a bunch of reducers that can be used interchangeably.

Towards generic high performance sorting algorithms

Published Jul 27, 2020

and has 0 comments

Intro

I want to examine together with you various types of sort algorithms and the tricks they use to lower the magic O number. I reach the conclusion that high performance algorithms that are labeled as specific to a certain type of data can be made generic or that the generic algorithms aren't really that generic either. I end up proposing a new form of function that can be fed to a sorting function in order to reach better performance than the classic O(n*log(n)). Extra bonus: finding distinct values in a list.

Sorting

But first, what is sorting? Given a list of items that can be compared to one another as lower or higher, return the list in the order from lowest to highest. Since an item can be any type of data record, to define a generic sorting algorithm we need to feed it the rules that make an item lower than another and that is called the comparison function. Let's try an example in JavaScript:

  // random function from start to end inclusive
  function rand(start, end) {
    return parseInt(start + Math.random() * (end - start + 1));
  }
  
  // measure time taken by an action and output it in console
  let perfKey = 0;
  function calcPerf(action) {
    const key = perfKey++;
    performance.mark('start_' + key);
    action();
    performance.mark('end_' + key);
    const measure = performance.measure('measure_' + key, 'start_' + key, 'end_' + key);
    console.log('Action took ' + measure.duration);
  }
  
  // change this based on how powerful the computer is
  const size = 10000000;
  // the input is a list of size 'size' containing random values from 1 to 50000
  const input = [];
  for (let i = 0; i < size; i++)
    input.push(rand(1, 50000));
  
  // a comparison function between two items a and b
  function comparisonFunction(a, b) {
    if (a > b)
      return 1;
    if (a < b)
      return -1;
    return 0;
  }
  
  const output = [];
  // copy input into output, then sort it using the comparison function
  // same copying method will be used for future code
  calcPerf(() => {
    for (let i = 0; i < size; i++)
      output.push(input[i]);
    output.sort(comparisonFunction);
  });

It's not the crispest code in the world, but it's simple to understand:

calcPerf is computing the time it takes for an action to take and logs it to the console
start by creating a big array of random numbers as input
copy the array in a result array and sort that with the default browser sort function, to which we give the comparison function as an argument
display the time it took for the operation.

This takes about 4500 milliseconds on my computer.

Focus on the comparison function. It takes two items and returns a number that is -1, 0 or 1 depending on whether the first item is smaller, equal or larger than the second. Now let's consider the sorting algorithm itself. How does it work?

A naïve way to do it would be to find the smallest item in the list, move it to the first position in the array, then continue the process with the rest of the array. This would have a complexity of O(n²). If you don't know what the O complexity is, don't worry, it just provides an easy to spell approximation of how the amount of work would increase with the number of items in the input. In this case, 10 million records, squared, would lead to 100 trillion operations! That's not good.

Other algorithms are much better, bringing the complexity to O(n*log(n)), so assuming base 10, around 70 million operations. But how do they improve on this? Surely in order to sort all items you must compare them to each other. The explanation is that if a<b and b<c you do not need to compare a to c. And each algorithm tries to get to this in a different way.

However, the basic logic of sorting remains the same: compare all items with a subset of the other items.

Partitioning

A very common and recommended sorting algorithm is QuickSort. I am not going to go through the entire history of sorting algorithms and what they do, you can check that out yourself, but I can focus on the important innovation that QuickSort added: partitioning. The first step in the algorithm is to choose a value out of the list of items, which the algorithm hopes it's as close as possible to the median value and is called a pivot, then arrange the items in two groups called partitions: the ones smaller than the pivot and the ones larger than the pivot. Then it proceeds on doing the same to each partition until the partitions are small enough to be sorted by some other sort algorithm, like insertion sort (used by Chrome by default).

Let's try to do this manually in our code, just the very first run of the step, to see if it improves the execution time. Lucky for us, we know that the median is around 25000, as the input we generated contains random numbers from 1 to 50000. So let's copy the values from input into two output arrays, then sort each of them. The sorted result would be reading from the first array, then from the second!

  // two output arrays, one for numbers below 25000, the other for the rest
  const output1 = [];
  const output2 = [];
  const pivot = 25000;
  
  calcPerf(() => {
    for (let i = 0; i < size; i++) {
      const val = input[i];
      if (comparisonFunction(val, pivot) < 0)
        output1.push(val);
      else
        output2.push(val);
    }
    // sorting smaller arrays is cheaper
    output1.sort(comparisonFunction);
    output2.sort(comparisonFunction);
  });

Now, the performance is slightly better. If we do this several times, the time taken would get even lower. The partitioning of the array by an operation that is essentially O(n) (we just go once through the entire input array) reduces the comparisons that will be made in each partition. If we would use the naïve sorting, partitioning would reduce n²to n+(n/2)²+(n/2)² (once for each partitioned half), thus n+n²/2. Each partitioning almost halves the number of operations!

So, how many times can we half the number of operations for? Imagine that we do this with an array of distinct values, from 1 to 10 million. In the end, we would get to partitions of just one element and that means we did a log₂(n) number of operations and for each we added one n (the partitioning operation). That means that the total number of operations is... n*log(n). Each algorithm gets to this in a different way, but at the core of it there is some sort of partitioning, that b value that makes comparing a and c unnecessary.

Note that we treated the sort algorithm as "generic", meaning we fed it a comparison function between any two items, as if we didn't know how to compare numbers. That means we could have used any type of data as long as we knew the rule for comparison between items.

There are other types of sorting algorithms that only work on specific types of data, though. Some of them claim a complexity of O(n)! But before we get to them, let's make a short detour.

Distinct values intermezzo

Another useful operation with lists of items is finding the list of distinct items. From [1,2,2,3] we want to get [1,2,3]. To do this, we often use something called a trie, a tree-like data structure that is used for quickly finding if a value exists or not in a list. It's the thing used for autocorrect or finding a word in a dictionary. It has an O(log n) complexity in checking if an item exists. So in a list of 10 million items, it would take maybe 20 operations to find the item exists or not. That's amazing! You can see that what it does is partition the list down to the item level.

Unfortunately, this only works for numbers and strings and such primitive values. If we want to make it generic, we need to use a function that determines when two items are equal and then we use it to compare to all the other items we found as distinct so far. That makes using a trie impossible.

Let me give you an example: we take [1,1,2,3,3,4,5] and we use an externally provided equality function:

create an empty output of distinct items
take first item (1) and compare with existing distinct items (none)
item is not found, so we add it to output
take next item (1) and compare with existing distinct items (1)
item is found, so we do nothing
...
we take the last item (5) and compare with existing items (1,2,3,4)
item is not found, so we add it to the output

The number of operations that must be taken is the number of total items multiplied by the average number of distinct items. That means that for a list of already distinct values, the complexity if O(n²). Not good: It increases exponentially with the number of items! And we cannot use a trie unless we have some function that would provide us with a distinctive primitive value for an item. So instead of an equality function, a hashing function that would return a number or maybe a string.

However, given the knowledge we have so far, we can reduce the complexity of finding distinct items to O(n*log(n))! It's as simple as sorting the items, then going through the list and sending to output an item when different from the one before. One little problem here: we need a comparison function for sorting, not an equality one.

So far

We looked into the basic operations of sorting and finding distinct values. To be generic, one has to be provided with a comparison function, the other with an equality function. However, if we would have a comparison function available, finding distinct generic items would become significantly less complex by using sorting. Sorting is better than exponential comparison because it uses partitioning as an optimization trick.

Breaking the n*log(n) barrier

As I said above, there are algorithms that claim a much better performance than n*log(n). One of them is called RadixSort. BurstSort is a version of it, optimized for strings. CountSort is a similar algorithm, as well. The only problem with Radix type algorithms is that they only work on numbers or recursively on series of numbers. How do they do that? Well, since we know we have numbers to sort, we can use math to partition the lot of them, thus reducing the cost of the partitioning phase.

Let's look at our starting code. We know that we have numbers from 1 to 50000. We can also find that out easily by going once through all of them and computing the minimum and maximum value. O(n). We can then partition the numbers by their value. BurstSort starts with a number of "buckets" or lists, then assigns numbers to the buckets based on their value (dividing the value to the number of buckets). If a bucket becomes too large, it is "burst" into another number of smaller buckets. In our case, we can use CountSort, which simply counts each occurrence of a value in an ordered array. Let's see some code:

  const output = [];
  const buckets = [];
  calcPerf(() => {
    // for each possible value add a counter
    for (let i = 1; i <= 50000; i++)
      buckets.push(0);
    // count all values
    for (let i = 1; i <= size; i++) {
      const val = input[i];
      buckets[val - 1]++;
    }
    // create the output array of sorted values
    for (let i = 1; i <= 50000; i++) {
      const counter = buckets[i - 1];
      for (let j = 0; j < counter; j++)
        output.push(i);
    }
  });

This does the following:

create an array from 1 to 50000 containing zeros (these are the count buckets)
for each value in the input, increment the bucket for that value (at that index)
at the end just go through all of the buckets and output the index as many times as the value in the bucket shows

This algorithm generated a sorted output array in 160 milliseconds!

And of course, it is too good to be true. We used a lot of a priori knowledge:

min/max values were already known
the values were conveniently close together integers so we can use them as array indexes (an array of size 50000 is not too big)

I can already hear you sigh "Awwh, so I can't use it!". Do not despair yet!

The Radix algorithm, that is used only for numbers, is also used on strings. How? Well, a string is reducible to a list of numbers (characters) so one can recursively assign each string into a bucket based on the character value at a certain index. Note that we don't have to go through the entire string, the first few letters are enough to partition the list in small enough lists that can be cheaply sorted.

Do you see it yet?

A generic partition function

What if we would not use an equality function or a comparison function or a hashing function as a parameter for our generic sort/distinct algorithm? What if we would use a partition function? This partition function would act like a multilevel hashing function returning values that can also be compared to each other. In other words, the generic partition function could look like this:

function partitionFunction(item, level) returning a byte

For strings it returns the numeric value of the character at position level or 0. For numbers it returns the high to low byte in the binary representation of the number. For object instances with multiple properties, it would return a byte for each level in each of the properties that we want to order by. Radix style buckets would use the known values from 0 to 255 (the partition table returns a byte). The fact that the multilevel partitioning function is provided by the user means we can pack in it all the a priori knowledge we have, while keeping the sorting/distinct algorithm unchanged and thus, generic! The sorting will be called by providing two parameters: the partitioning function and the maximum level to which it should be called:

sort(input, partitioningFunction, maxLevel)

A final example

Here is an implementation of a radix sorting algorithm that receives a multilevel partitioning function using our original input. Note that it is written so that it is easily read and not for performance:

  // will return a sorted array from the input array
  // using the partitioning function up to maxLevel
  function radixSort(input, partitioningFunction, maxLevel) {
    // one bucket for each possible value of the partition function
    let buckets = Array.from({length: 256}, () => []); 
    buckets[0] = input;
    // reverse order, because level 0 should be the most significant
    for (let level = maxLevel-1; level >=0; level--) {
      // for each level we re-sort everything in new buckets
      // but taken in the same order as the previous step buckets
      let tempBuckets = Array.from({length: 256}, () => []);
      for (let bucketIndex = 0; bucketIndex < buckets.length; bucketIndex++) {
        const bucket = buckets[bucketIndex];
        const bucketLength = bucket.length;
        for (let bucketOffset = 0; bucketOffset < bucketLength; bucketOffset++) {
          const val = bucket[bucketOffset];
          const partByte = partitioningFunction(val, level);
          tempBuckets[partByte].push(val);
        }
      }
      // we replace the source buckets with the new ones, then repeat
      buckets = tempBuckets;
    }
    // combine all buckets in an output array
    const output = [].concat(...buckets);
    return output;
  }

  // return value bytes, from the most significant to the least
  // being <50000 the values are always represented as at most 2 bytes
  // 0xFFFF is 65535 in hexadecimal
  function partitioningFunction(item, level) {
    if (level === 0) return item >> 8;
    if (level === 1) return item & 255;
    return 0;
  }
  
  let output3 = [];
  calcPerf(() => {
    output3 = radixSort(input, partitioningFunction, 2);
  });

Want to know how long it took? 1300 milliseconds!

You can see how the same kind of logic can be used to find distinct values, without actually sorting, just by going through each byte from the partitioning function and using them as values in a trie, right?

Conclusion

Here is how a generic multilevel partitioning function replaces comparison, equality and hashing functions with a single concept that is then used to get high performance from common data operations such as sorting and finding distinct values.

I will want to work on formalizing this and publishing it as a library or something like that, but until then, what do you think?

Wait, there is more

There is a framework in which something similar is being used: SQL. It's the most common place where ORDER BY and DISTINCT are used. In SQL's case, we use an optimization method that uses indexes, which are also trie data structures storing the keys that we want to order or filter by. Gathering the data to fill a database index also has its complexity. In this case, we pre-partition once and we sort many. It's another way of reducing the cost of the partitioning!

However, this is just a sub-type of the partition function I am talking about, one that uses a precomputed data structure to reach its goal. The multilevel partition function concept I am describing here may be pure code or some other encoding of information we know out of hand before doing the operation.

Finally, the complexity. What is it? Well instead of O(n*log(n)) we get O(n*k), where k is the maximum level used in the partition function. This depends on the data, so it's not a constant, but it's the closest theoretical limit for sorting, closer to O(n) than the classic log version. In our example, k was a log, but its base was 256, not 10 as usually assumed.

I am not the best algorithm and data structure person, so if you have ideas about it and want to help me out, I would be grateful.

Quirks in Javascript regular expressions

Published Apr 21, 2020

and has 0 comments

I am subscribed to the StackOverflow newsletter and most of the times the "top" questions there are really simple things that gain attention from a lot of people. Today I got one question that I would have thought has an obvious answer, but it did not.

The question was what does "asdf".replace(/.*/g,"x") return?

And the answer to the question "What does a regular expression replace of everything with x return?" is.... [Ba da bum!] "xx".

The technical answer is there in the StackOverflow question, but I am gonna walk you through some steps to get to understand this the... dumb way.

So, let's try variations on the same theme. What does "asdf".matchAll(/.*/g) return? Well, first of all, in Chrome, it returns a RegExpStringIterator, which is pretty cool, because it's already using the latest Javascript features and it is returning an iterator rather than an array. But we can just use Array.from on it to get an array of all matches: for "asdf" and for "".

That's a pretty clear giveaway. Since the regular expression is a global one, it will get a match, then the next one until there is nothing left. First match is "asdf" as expected, the next one is "", which is the rest of the string and which also matches .* Why is it, then, that it doesn't go into a stack overflow (no pun intended) and keep turning up empty strings? Again, it's an algorithm described in an RFC and you need a doctorate in computer science to read it. Well, it's not that complicated, but I did promise a dumb explanation.

And that is that after you get a match on an index, the index is incremented. First match is found at index 0, the next one at 4. There are no matches from index 5 on.

Other variations on this theme are "asdf".matchAll(/.?/g), which will return "a","s","d","f","". You can't do "asdf".matchAll(/.*/) , you get a TypeError: undefineds called with a non-global RegExp argument error that really doesn't say much, but you can do "asdf".match(/.*/g) which returns just an array of strings, rather than more complex objects. You can also do

var reg = /.*/g;
console.log(reg.exec("asdf"),reg.exec("asdf"),reg.exec("asdf"),reg.exec("asdf"))

This more classic approach will return "asdf", "", "", "" and it would continue to return empty strings ad infinitum!

But how should one write a regular expression to get what you wanted to get, a replacement of everything with x? /.+/g would work, but it would not match an empty string. On the other hand, when was the last time you wanted to replace empty strings with anything?

Question: background jobs from a for loop using the index variable

Published Apr 18, 2020

and has 1 comment

I am going to do this in Javascript, because it's easier to write and easier for you to test (just press F12 in this page and write it in the console), but it applies to any programming language. The issue arises when you want to execute a background job (a setTimeout, an async method that you do not wait, a Task.Run, anything that runs on another execution path than the current one) inside a for loop. You have the index variable (from 0 to 10, for example) and you want to use it as a parameter for the background job. The result is not as expected, as all the background jobs use the same value for some reason.

Let's see a bit of code:

// just write in the console numbers from 0 to 9
// but delayed for a second
for (var i=0; i<10; i++)
{
  setTimeout(function() { console.log(i); },1000);
}

// the result: 10 values of 10 after a second!

But why? The reason is the "scope" of the i variable. In this case, classic (EcmaScript 5) code that uses var generates a value that exists everywhere in the current scope, which for ES5 is defined as the function this code is running from or the global scope if executed directly. If after this loop we write console.log(i) we get 10, because the loop has incremented i, got it to 10 - which is not less than 10, and exited the loop. The variable is still available. That explains why, a second later, all the functions executed in setTimeout will display 10: that is the current value of the same variable.

Now, we can solve this by introducing a local scope inside the for. In ES5 it looked really cumbersome:

for (var i=0; i<10; i++)
{
  (function() {
    var li = i;
    setTimeout(function() { console.log(li); },1000);
  })();
}

The result is the expected one, values from 0 to 9.

What happened here? We added an anonymous function and executed it. This generates a function scope. Inside it, we added a li variable (local i) and then set executing the timeout using that variable. For each value from 0 to 9, another scope is created with another li variable. If after this code we write console.log(li) we get an error because li is undefined in this scope. It is a bit confusing, but there are 10 li variables in 10 different scopes.

Now, EcmaScript 6 wanted to align Javascript with other common use modern languages so they introduced local scope for variables by defining them differently. We can now use let and const to define variables that are either going to get modified or remain constant. They also exist only in the scope of an execution block (between curly brackets).

We can write the same code from above like this:

for (let i=0; i<10; i++) {
  const li = i;
  setTimeout(()=>console.log(li),1000);
}

In fact, this is more complex than it has to be, but that's because another Javascript quirk. We can simplify this for the same result as:

for (let i=0; i<10; i++) {
  setTimeout(()=>console.log(i),1000);
}

Why? Because we "let" the index variable, so it only exists in the context of the loop execution block, but apparently it creates one version of the variable for each loop run. Strangely enough, though, it doesn't work if we define it as "const".

Just as an aside, this is way less confusing with for...of loops because you can declare the item as const. Don't use "var", though, or you get the same problem we started with!

const arr=[1,2,3,4,5];
for (const item of arr) setTimeout(()=>console.log(item),1000);

In other languages, like C#, variables exist in the scope of their execution block by default, but using a for loop will not generate multiple versions of the same variable, so you need to define a local variable inside the loop to avoid this issue. Here is an example in C#:

for (var i=0; i<10; i++)
{
    var li = i;
    Task.Run(() => Console.WriteLine(li));
}
Thread.Sleep(1000);

Note that in the case above we added a Thread.Sleep to make sure the app doesn't close while the tasks are running and that the values of the loop will not necessarily be written in order, but that's beside the point here. Also, var is the way variables are defined in C# when the type can be inferred by the compiler, it is not the same as the one in Javascript.

I hope you now have a better understanding of variable scope.

Stream of consciousness: iterables instead of numbers

Published Apr 6, 2020

and has 0 comments

Intro

As I was working on LInQer I was hooked on the multiple optimizations that I was finding. Do you want to compute the average of an iterable? You would need the total count and the sum of the items, which you can get in a single function that you can reuse to get the sum or the count. But what if the iterable is an integer range between 1 and 10? Then you can compute the sum and you already know the count. Inspired by that work and by other concepts like interval types or Maybe/Nullable types, I've decided to write this post, which I do not know if it will lead to any usable code.

What is an iterable/enumerable?

In Javascript they call it an Iterable, in .NET you have IEnumerable. They mean the same thing: sources of values. With new concepts like async/await you can use Observables as Enumerables as well, although they are theoretically diametrically opposing patterns. In both languages they are implemented as having a method that returns an iterator/enumerator, an object that can move to the next value, give you the next value and maybe reset itself. You can define any stream of values like that, having one or more values or, indeed, none. From now own I will discuss in terms of .NET nomenclature, but I see no reason why it wouldn't apply to any other language that implements this feature.

For example an array is defined as an IEnumerable<T> in .NET. Its enumerator will return false if trying to move to the next value and the array is empty, or true if there is at least a value and the current value will return the first value in the array. Move next again and it will return true or false depending on whether there is a next value. But there is no need for the values to exist to have an Enumerable. One could write an object that would return all the positive integer numbers. It's length would be infinite and the values would only be generated when requested. There is no differentiation between an Enumerable and a generator in .NET, but there is in Javascript. For this reason whenever I will use the term generator, I will mean an object that generates values rather than produce them from a source of existing ones.

The NULL controversy

A very popular InfoQ post describes the introduction of the NULL concept in programming languages a the billion dollar mistake. I am not so sure about that, but I can concede they make good points. The alternative to using a special value to describe the absence of a value is use an "option" object that either has Some value or it has None. You would check the existence of a value by calling a method to tell you if it has a value and you would get the value from the current value property. Doesn't it sound familiar? It's a more specific case of an Enumerator! Another popular solution to remove NULLs from code is to never return values from your methods, but arrays. An empty array would represent no value. An array is an Enumerable!

And that last idea opens up an interesting possibility: instead of one or none, you can have multiple values. What then? What would a multiplication mean? What about a decision block?

The LInQer experience

If you know me, you are probably fed up with me plugging LInQer as the greatest thing since fire was invented. But that's because it is! And while implementing .NET LInQ as a Javascript library I've played with some very interesting concepts.

For example, while implementing the Last operator on enumerables, I had two different implementations depending on whether one could know the length in advance and one could use indexed access to the values. An array of one billion values has no problem giving you the last item in it because of two things: you know where the array ends and you can access any item at any position without having to go through other values. You just take the value at index one billion minus one. If you would have a generator, though, the only way to get the last value would be to enumerate again and again and again and only when moving to the next value would fail you would have the last value as the last one. And of course, this would be bad if there are no bounds to the generator.

But there is more. What about very common statistical values like the sum? This, of course, applies to numbers. The Enumerable need not produce numbers, so in other contexts it means nothing. Then there are concepts like statistical distribution. One can make some assumptions if they know the distribution of values. A constant yet infinite generator of numbers will always have the same average value. It would return the same value, regardless of index.

I spent a lot of time doing sorting that only needs a part of the enumerable, or partial sorting. I've implemented a Quicksort algorithm that works faster than the default sort when there are enough values and that can ignore the parts of the array that I don't need. Also, there are specific algorithm to return the last or first N items. All of this depends on functions that determine the order of items. Randomness is also interesting, as it needs to take into consideration the change of probabilities as the list of items increases with each request. Sampling was fun, too!

Then there were operators like Distinct or Group which needed to use functions to determine sameness.

With all this work, I've never intended to make this more than what LInQ is in .NET: a way to dynamically filter and map and enumerate sequences of items without having to go through them all or to create intermediate but unnecessary arrays. What I am talking about now is taking things further and deeper.

Continuous intervals

What if the Enumerable abstraction is not enough? For example one could define the interval of real numbers between 0 and 1. You can never enumerate the next value, but there are definite boundaries, a clear distribution of values, a very obvious average. What about series and limits? If a generator generates values that depend on previous values, like a geometric progression or a Fibonacci series, you can sometimes compute the minimum or maximum value of the items in it, or of their sums.

Tools

So we have more concepts in our bag now:

move next (function)
current value
item length (could be infinite or just unknown)
indexed access (or not)
boundaries (min, max, limits)
distribution (probabilities)
order
discreteness

How could we use these?

Concrete cases

There is one famous probabilities problem: what are the chances you will get a particular value by throwing a number of dice. And it is interesting because there is a marked difference between using one die or more. Using at least two dice and plotting the values you get after multiple throws you get what is called a Normal distribution, a Gauss curve, and that's because there are more combinations of values that sum up to 6 than there are for 2.

How can we declare a value that belongs to an interval? One solution is to add all kinds of metadata or validations. But what if we just declare an iterable with one value that has a random value between 1 and 6? And what if we add it with another one? What would that mean?

Here is a demo example. It's silly and it looks too much like the Calculator demos you see for unit testing and I really hate those, but I do want to just demo this. What else can we do with this idea? I will continue to think about it.

class Program
    {
        static void Main(string[] args)
        {
            var die1 = new RandomGenerator(1, 6);
            var die2 = new RandomGenerator(1, 6);
            // just get the value
            var value1 = die1.First() + die2.First();
            // compose the two dice using Linq, then get value
            var value2 = die1.Zip(die2).Select(z => z.First + z.Second).First();
            // compose the two dice using operator overload, then get value
            var value3 = (die1 + die2).First();
            var min = (die1 + die2).Min();
        }

        /// <summary>
        /// Implemented Min alone for demo purposes
        /// </summary>
        /// <typeparam name="T"></typeparam>
        public interface IGenerator<T> : IEnumerable<T>
        {
            int Min();
        }

        /// <summary>
        /// Generates integer values from minValue to maxValue inclusively
        /// </summary>
        public class RandomGenerator : IGenerator<int>
        {
            private readonly Random _rnd;
            private readonly int _minValue;
            private readonly int _maxValue;

            public RandomGenerator(int minValue, int maxValue)
            {
                _rnd = new Random();
                this._minValue = minValue;
                this._maxValue = maxValue;
            }

            public static IGenerator<int> operator +(RandomGenerator gen1, IGenerator<int> gen2)
            {
                return new AdditionGenerator(gen1, gen2);
            }

            public IEnumerator<int> GetEnumerator()
            {
                while (true)
                {
                    yield return _rnd.Next(_minValue, _maxValue + 1);
                }
            }

            IEnumerator IEnumerable.GetEnumerator()
            {
                return ((IEnumerable<int>)this).GetEnumerator();
            }

            public int Min()
            {
                return _minValue;
            }
        }
        
        /// <summary>
        /// Combines two generators through addition
        /// </summary>
        internal class AdditionGenerator : IGenerator<int>
        {
            private IGenerator<int> _gen1;
            private IGenerator<int> _gen2;

            public AdditionGenerator(Program.RandomGenerator gen1, Program.IGenerator<int> gen2)
            {
                this._gen1 = gen1;
                this._gen2 = gen2;
            }

            public IEnumerator<int> GetEnumerator()
            {
                var en1 = _gen1.GetEnumerator();
                var en2 = _gen2.GetEnumerator();
                while (true)
                {
                    var hasValue = en1.MoveNext();
                    if (hasValue != en2.MoveNext())
                    {
                        throw new InvalidOperationException("One generator stopped providing values before the other");
                    }
                    if (!hasValue)
                    {
                        yield break;
                    }
                    yield return en1.Current + en2.Current;
                }

            }

            IEnumerator IEnumerable.GetEnumerator()
            {
                return ((IEnumerable<int>)this).GetEnumerator();
            }

            public int Min()
            {
                return _gen1.Min() + _gen2.Min();
            }
        }
    }

Conclusion (so far)

I am going to think about this some more. It has a lot of potential as type abstraction, but to be honest, I deal very little in numerical values and math and statistics, so I don't see what I personally could do with this. I suspect, though, that other people might find it very useful or at least interesting. And yes, I am aware of mathematical concepts like interval arithmetic and I am sure there are a ton of existing libraries that already do something like that and much more, but I am looking at this more from the standpoint of computer science and quasi-primitive types than from a mathematical or numerical perspective. If you have any suggestions or ideas or requests, let me know!

QBasic INKEY in Javascript - an exploration of old and new

Published Mar 29, 2020

and has 0 comments

Intro

When I was a kid, computers didn't have multithreading, multitasking or even multiple processes. You executed a program and it was the only program that was running. Therefore the way to do, let's say, user key input was to check again and again if there is a key in a buffer. To give you a clearer view on how bonkers that was, if you try something similar in Javascript the page dies. Why? Because the processing power to look for a value in an array is minuscule and you will basically have a loop that executes hundreds of thousand or even millions of times a second. The CPU will try to accommodate that and run at full power. You will have a do nothing loop that will take the entire capacity of the CPU for the current process. The browser would have problems handling legitimate page events, like you trying to close it! Ridiculous!

Bad solution

Here is what this would look like:

class QBasic {

    constructor() {
        this._keyBuffer=[];
        // add a global handler on key press and place events in a buffer
        window.addEventListener('keypress', function (e) {
            this._keyBuffer.push(e);
        }.bind(this));
    }

    INKEY() {
        // remove the first key in the buffer and return it
        const ev = this._keyBuffer.shift();
        // return either the key or an empty string
        if (ev) {
            return ev.key;
        } else {
            return '';
        }
    }
}

// this code will kill your CPU and freeze your page
const qb = new QBasic();
while (qb.INKEY()=='') {
 // do absolutely nothing
}

How then, should we port the original QBasic code into Javascript?

WHILE INKEY$ = ""

    ' DO ABSOLUTELY NOTHING

WEND

Best solution (not accepted)

Of course, the best solution is to redesign the code and rewrite everything. After all, this is thirty year old code. But let's imagine that, in the best practices of porting something, you want to find the first principles of translating QBasic into Javascript, then automate it. Or that, even if you do it manually, you want to preserve the code as much as possible before you start refactoring it. I do want to write a post about the steps of refactoring legacy code (and as you can see, sometimes I actually mean legacy, as in "bestowed upon by our forefathers"), but I wanted to write something tangible first. Enough theory!

Interpretative solution (not accepted, yet)

Another solution is to reinterpret the function into a waiting function, one that does nothing until a key is pressed. That would be easier to solve, but again, I want to translate the code as faithfully as possible, so this is a no-no. However, I will discuss how to implement this at the end of this post.

Working solution (slightly less bad solution)

Final solution: do the same thing, but add a delay, so that the loop doesn't use the entire pool of CPU instructions. Something akin to Thread.Sleep in C#, maybe. But, oops! in Javascript there is no function that would freeze execution for a period of time.

The only thing related to delays in Javascript is setTimeout, a function that indeed waits for the specified interval of time, but then executes the function that was passed as a parameter. It does not pause execution. Whatever you write after setTimeout will execute immediately. Enter async/await, new in Javascript ES8 (or EcmaScript 2017), and we can use the delay function as we did when exploring QBasic PLAY:

function delay(duration) {
    return new Promise(resolve => setTimeout(resolve, duration));
}

Now we can wait inside the code with await delay(milliseconds);. However, this means turning the functions that use it into async functions. As far as I am concerned, the pollution of the entire function tree with async keywords is really annoying, but it's the future, folks!

Isn't this amazing? In order to port to Javascript code that was written in 1990, you need features that were added to the language only in 2017! If you wanted to do this in Javascript ES5 you couldn't do it! The concept of software development has changed so much that it would have been impossible to port even the simplest piece of code from something like QBasic to Javascript.

Anyway, now the code looks like this:

function delay(duration) {
    return new Promise(resolve => setTimeout(resolve, duration));
}

class QBasic {

    constructor() {
        this._keyBuffer=[];
        // add a handler on every key press and place events in a buffer
        window.addEventListener('keypress', function (e) {
            this._keyBuffer.push(e);
        }.bind(this));
    }

    async INKEY() {
        // remove the first key in the buffer and return it
        const ev = this._keyBuffer.shift();
        // return either the key or an empty string
        if (ev) {
            return ev.key;
        } else {
            await delay(100);
            return '';
        }
    }
}

const qb = new QBasic();
while (qb.INKEY()=='') {
 // do absolutely nothing
}

Now, this will work by delaying for 100 milliseconds when there is nothing in the buffer. It's clearly not ideal. If one wanted to fix a problem with a loop running too fast, then the delay function should have at least been added to the loop, not the INKEY function. Using it like this you will get some inexplicable delays in code that would want to use fast key inputs. It's, however, the only way we can implement an INKEY function that will behave as close to the original as possible, which is hiring a 90 year old guy to go to a letter box and check if there is any character in the mail and then come back and bring it to you. True story, it's the original implementation of the function!

Interpretative solution (implementation)

It would have been much simpler to implement the function in a blocking manner. In other words, when called, INKEY would wait for a key to be pressed, then exit and return that key when the user presses it. We again would have to use Promises:

class QBasic {

    constructor() {
        this._keyHandler = null;
        // instead of using a buffer for keys, keep a reference
        // to a resolve function and execute it if it exists
        window.addEventListener('keypress', function (e) {
            if (this._keyHandler) {
                const handler = this._keyHandler;
                this._keyHandler = null;
                handler(e.key);
            }
        }.bind(this));
    }

    INKEY() {
        const self = this;
        return new Promise(resolve => self._keyHandler = resolve);
    }
}


const qb = new QBasic();
while ((await qb.INKEY())=='') { // or just await qb.INKEY(); instead of the loop
 // do absolutely nothing
}

Amazing again, isn't it? The loops (pun not intended) through which one has to go in order to force a procedural mindset on an event based programming language.

Disclaimer

Just to make sure, I do not recommend this style of software development; this is only related to porting old school code and is more or less designed to show you how software development has changed in time, from a period before most of you were even born.

QBasic Play in Javascript - an exploration of old and new

Published Feb 29, 2020

and has 0 comments

Intro

This post will take you on an adventure through time and sound. It will touch the following software development concepts:

await/async in Javascript
named groups in regular expressions in Javascript
the AudioContext API in Javascript
musical note theory
Gorillas!

In times immemorial, computers were running something called the DOS operating system and almost the entire interface was text based. There was a way to draw things on the screen, by setting the values of pixels directly in the video memory. The sound was something generated on a "PC speaker" which was a little more than a small speaker connected to a power port and which you had to make work by handling "interrupts". And yet, since this is when I had my childhood, I remember so many weird little games and programs from that time with a lot of nostalgic glee.

One of these games was Gorillas, where two angry gorillas would attempt to murder each other by throwing explosive bananas. The player would have to enter the angle and speed and also take into account a wind speed that was displayed as an arrow on the bottom of the screen. That's all. The sounds were ridiculous, the graphics really abstract and yet it was fun. So, as I was remembering the game, I thought: what would it take to make that game available in a modern setting? I mean, the programming languages, the way people thought about development, the hardware platform, everything has changed.

In this post I will detail the PLAY command from the ancient programming language QBASIC. This command was being used to generate sound by instructing the computer to play musical notes on the PC speaker. Here is an example of usage:

PLAY "MBT160O1L8CDEDCDL4ECC"

This would play the short song at the beginning of the Gorillas game. The string tells the computer to play the sound in the background, at a tempo of 160 in the first octave, with notes of an eighth of a measure: CDEDCD then end with quarter measure notes: ECC. I want to replicate this with Javascript, one because it's simpler to prototype and second because I can make the result work in this very post.

Sound and Music

But first, let's see how musical notes are being generated in Javascript, using the audio API. First you have to create an AudioContext instance, with which you create an Oscillator. On the oscillator you set the frequency and then... after a while you stop the sound. The reason why the API seems so simplistic is because it works by creating an audio graph of nodes that connect to each other and build on each other. There are multiple ways in which to generate sound, including filling a buffer with data and playing that, but I am not going to go that way.

Therefore, in order to PLAY in Javascript I need to translate concepts like tempo, octaves, notes and measures into values like duration and frequency. That's why we need a little bit of musical theory.

In music, sounds are split into domains called octaves, each holding seven notes that, depending on your country, are either Do, Re, Mi, Fa, So, La, Si or A, B,C, D, E, F and G or something else. Then you have half notes, so called sharp or flat notes: A# is half a note above A and A♭ is a half note below A. A# is the same as B♭. For reasons that I don't want to even know, the octaves start with C. Also the notes themselves are not equally spaced. The octaves are not of the same size, in terms of frequency. Octave 0 starts at 16.35Hz and ends at 30.87, octave 1 ranges between 32.70 and 61.74. In fact, each octave spreads on twice as much frequency space as the one before. Each note has twice the frequency of the same note on the lower octave.

In a more numerical way, octaves are split into 12: C, C#, D, E♭, E, F, F#, G, G#, A, B♭, B. Note (heh heh) that there are no half notes between B and C and E and F. The frequency of one of these notes is 2^1/12 times the one before. Therefore one can compute the frequency of a note as:

Frequency = Key note * 2^n/12, where the key note is a note that you use as a base and n is the note-distance between the key note and the note you want to play.

The default key note is A₄, or note A from octave 4, at 440Hz. That means B♭ has a frequency of 440*1.059463 = 466.2.

Having computed the frequency, we now need the duration. The input parameters for this are: tempo, note length, mode and the occasional "dot":

tempo is the number of quarter measures in a minute
- this means if the tempo is 120, a measure is 60000 milliseconds divided by 120, then divided by 4, so 125 milliseconds
note length - the length of a note relative to a measure
- these are usually fractions of a measure: 1, 1/2, 1/4, 1/8, 1/16, etc
mode - this determines a general speed of playing the melody
- as defined by the PLAY command, you have:
  - normal: a measure is 7/8 of a default measure
  - legato: a measure is a measure
  - staccato: a measure is 3/4 of a default measure
dotted note - this means a specific note will be played for 3/2 of the defined duration for that note

This gives us the formula:

Duration = note length * mode * 60000 / 4 / tempo * dotDuration

Code

With this knowledge, we can start writing code that will interpret musical values and play a sound. Now, the code will be self explanatory, hopefully. The only thing I want to discuss outside of the audio related topic is the use of async/await in Javascript, which I will do below the code. So here it is:

class QBasicSound {

    constructor() {
        this.octave = 4;
        this.noteLength = 4;
        this.tempo = 120;
        this.mode = 7 / 8;
        this.foreground = true;
        this.type = 'square';
    }

    setType(type) {
        this.type = type;
    }

    async playSound(frequency, duration) {
        if (!this._audioContext) {
            this._audioContext = new AudioContext();
        }
        // a 0 frequency means a pause
        if (frequency == 0) {
            await delay(duration);
        } else {
            const o = this._audioContext.createOscillator();
            const g = this._audioContext.createGain();
            o.connect(g);
            g.connect(this._audioContext.destination);
            o.frequency.value = frequency;
            o.type = this.type;
            o.start();
            await delay(duration);
            // slowly decrease the volume of the note instead of just stopping so that it doesn't click in an annoying way
            g.gain.exponentialRampToValueAtTime(0.00001, this._audioContext.currentTime + 0.1);
        }
    }

    getNoteValue(octave, note) {
        const octaveNotes = 'C D EF G A B';
        const index = octaveNotes.indexOf(note.toUpperCase());
        if (index < 0) {
            throw new Error(note + ' is not a valid note');
        }
        return octave * 12 + index;
    }

    async playNote(octave, note, duration) {
        const A4 = 440;
        const noteValue = this.getNoteValue(octave, note);
        const freq = A4 * Math.pow(2, (noteValue - 48) / 12);
        await this.playSound(freq, duration);
    }

    async play(commandString) {
        const reg = /(?<octave>O\d+)|(?<octaveUp>>)|(?<octaveDown><)|(?<note>[A-G][#+-]?\d*\.?)|(?<noteN>N\d+\.?)|(?<length>L\d+)|(?<legato>ML)|(?<normal>MN)|(?<staccato>MS)|(?<pause>P\d+\.?)|(?<tempo>T\d+)|(?<foreground>MF)|(?<background>MB)/gi;
        let match = reg.exec(commandString);
        let promise = Promise.resolve();
        while (match) {
            let noteValue = null;
            let longerNote = false;
            let temporaryLength = 0;
            if (match.groups.octave) {
                this.octave = parseInt(match[0].substr(1));
            }
            if (match.groups.octaveUp) {
                this.octave++;
            }
            if (match.groups.octaveDown) {
                this.octave--;
            }
            if (match.groups.note) {
                const noteMatch = /(?<note>[A-G])(?<suffix>[#+-]?)(?<shorthand>\d*)(?<longerNote>\.?)/i.exec(match[0]);
                if (noteMatch.groups.longerNote) {
                    longerNote = true;
                }
                if (noteMatch.groups.shorthand) {
                    temporaryLength = parseInt(noteMatch.groups.shorthand);
                }
                noteValue = this.getNoteValue(this.octave, noteMatch.groups.note);
                switch (noteMatch.groups.suffix) {
                    case '#':
                    case '+':
                        noteValue++;
                        break;
                    case '-':
                        noteValue--;
                        break;
                }
            }
            if (match.groups.noteN) {
                const noteNMatch = /N(?<noteValue>\d+)(?<longerNote>\.?)/i.exec(match[0]);
                if (noteNMatch.groups.longerNote) {
                    longerNote = true;
                }
                noteValue = parseInt(noteNMatch.groups.noteValue);
            }
            if (match.groups.length) {
                this.noteLength = parseInt(match[0].substr(1));
            }
            if (match.groups.legato) {
                this.mode = 1;
            }
            if (match.groups.normal) {
                this.mode = 7 / 8;
            }
            if (match.groups.staccato) {
                this.mode = 3 / 4;
            }
            if (match.groups.pause) {
                const pauseMatch = /P(?<length>\d+)(?<longerNote>\.?)/i.exec(match[0]);
                if (pauseMatch.groups.longerNote) {
                    longerNote = true;
                }
                noteValue = 0;
                temporaryLength = parseInt(pauseMatch.groups.length);
            }
            if (match.groups.tempo) {
                this.tempo = parseInt(match[0].substr(1));
            }
            if (match.groups.foreground) {
                this.foreground = true;
            }
            if (match.groups.background) {
                this.foreground = false;
            }

            if (noteValue !== null) {
                const noteDuration = this.mode * (60000 * 4 / this.tempo) * (longerNote ? 1 : 3 / 2);
                const duration = temporaryLength
                    ? noteDuration / temporaryLength
                    : noteDuration / this.noteLength;
                const A4 = 440;
                const freq = noteValue == 0
                    ? 0
                    : A4 * Math.pow(2, (noteValue - 48) / 12);
                const playPromise = () => this.playSound(freq, duration);
                promise = promise.then(playPromise)
            }
            match = reg.exec(commandString);
        }
        if (this.foreground) {
            await promise;
        } else {
            promise;
        }
    }
}

function delay(duration) {
    return new Promise(resolve => setTimeout(resolve, duration));
}

One uses the code like this:

var player = new QBasicSound();
await player.play('T160O1L8CDEDCDL4ECC');

Note that you cannot start playing the sound directly, you need to wait for a user interaction first. An annoying rule to suppress annoying websites which would start playing the sound on load. And here is the result (press multiple times on Play for different melodies):

Javascript in modern times

There are two concepts that were used in this code that I want to discuss: named regular expression groups and async/await. Coincidentally, both are C# concepts that have crept up in the modern Javascript specifications when .NET developers from Microsoft started contributing to the language.

Named groups are something that appeared in ES2018 and it is something I've been using with joy in .NET and hated when I didn't have it in some other language. Look at the difference between the original design and the current one:

// original design
var match = /(a)bc/.exec('abcd');
if (match && match[1]) { /*do something with match[1]*/ }

// new feature
const match = /(?<theA>a)bc/.exec('abcd');
if (match && match.groups.theA) { /*do something with match.groups.theA*/ }

There are multiple advantages to this:

readability for people revisiting the code
robustness in the face of changes to the regular expression
- the index might change if new groups are added to it
the code aligns with the C# code (I like that :) )

My advice is to always use named groups when using regular expressions.

Another concept is await/async. In .NET it is used to hide complex asynchronous interactions in the code and with the help of the compiler helps with all the tasks that are running at the same time. Unfortunately, in C#, that means polluting code with async keywords on all levels as async methods can only be used inside other async methods. No such qualms in Javascript.

While in .NET the await/async system runs over Task<T> methods, in Javascript it runs over Promises. Both are abstractions over work that is being done asynchronously.

A most basic example is this:

// original design
getSomethingAsync(url,function(data) {
  getSomethingElseAsync(data.url,function(data2) {
    // do something with data2
  }, errorHandler2);
},errorHandler1);

// Promises
getSomethingAsync(url)
  .then(function(data) {
    getSomethingElseAsync(data.url);
  })
  .then(function(data2) {
    // so something with data2
  })
  .catch(errorHandler);

// async/await
try {
  var data = await getSomethingAsync(url);
  var data2 = await getSomethingElseAsync(data.url);
  // do something with data2
} catch(ex) {
  errorHandler(ex);
}

You see that the await/async way looks like synchronous code, you can even catch errors. await can be used on any function that returns a Promise instance and the result of it is a non-blocking wait until the Promise resolves and returns the value that was passed to the resolve function.

If you go back to the QBasicSound class, at the end, depending on if the sound is in the foreground or background, the function is either awaiting a promise or ... just letting it run. You might also notice that I've added a delay function at the end of the code which is using setTimeout to resolve a Promise. Here is what is actually going on:

// using await
console.log(1);
await delay(1000).then(()=>console.log(2));
console.log(3);
// this logs 1,2,3


// NOT using await
console.log(1);
delay(1000).then(()=>console.log(2));
console.log(3);
// this logs 1,3,2

In the first case, the Promise that was constructed by a one second delay and then logging 2 is awaited, meaning the code waits for the result. After it is executed, 3 gets logged. In the second case, the logging of 2 is executed after one second delay, but the code does not wait for the result, therefore 3 is logged immediately and 2 comes after.

What sorcery is this?! Isn't Javascript supposed to be single threaded? How does it work? Well, consider that in the delay function, the resolve function will only be called after a timeout of one second. When executed, it starts the timeout, then reaches the end of the function. It has not been resolved yet, so it passes control back to the engine, which uses it to execute other things. When the timeout is fired, the engine takes back control, executes the resolve function, then passes control back. All of this is invisible to the user, who gets the illusion of multithreaded behavior.

Already some standard out of the box APIs are async, like fetch. In order to get an object from a REST API that is called via HTTP the code would look like this:

// fetch API
let response = await fetch('/article/promise-chaining/user.json');
let user = await response.json();

Conclusion

I spent an entire day learning about sounds and writing code that would emulate QBASIC code from a billion years ago. Who knows, maybe my next project will be to port the entire Gorillas game in Javascript. Now one can lovingly recreate the sounds of one's childhood.

Other references:

Gorillas.BAS

QBasic/Appendix

Generate Sounds Programmatically With Javascript

Musical Notes

Gorrilas game online