I was working on a grid display and I had to properly sort date columns. The value provided was not a datetime, but instead a string like "20 Jan 2017" or "01 Feb 2020". Obviously sorting them alphabetically would not be very useful. So what I did was implement a custom sorting function that first parsed the strings as dates, then compared them. Easy enough, particularly since the Date object in Javascript has a Parse function that understands this format.

  The problem came with a string with the value "01 Jan 0001" which appeared randomly among the existing values. I first thought it was an error being thrown somewhere, or that it would not parse this string or even that it would be an overflow. It was none of that. Instead, it was about handling the year part.

  A little context first:

Date.parse('01 Jan 0001') //978300000000
new Date(0) //Thu Jan 01 1970 00:00:00

Date.parse('01 Jan 1950') //-631159200000
new Date(Date.parse('01 Jan 1950')) //Sun Jan 01 1950 00:00:00

Date.parse('31 Dec 49 23:59:59.999') //2524600799999
Date.parse('1 Jan 50 00:00:00.000') //-631159200000

new Date(Date.parse('01 Jan 0001')) //Mon Jan 01 2001 00:00:00

  The first two lines almost had me convinced Javascript does not handle dates lower than 1970. The next two lines disproved that and made me think it was a case of numerical overflow. The next two demonstrated it was not so. Now look closely at the last line. What? 2001?

  The problem was with the handling of years that are numerically smaller than 50. The parser assumes we used a two digit year and translates it into Date.parse('01 Jan 01') which would be 2001. We get a glimpse into how it works, too, because everything between 50 and 99 would be translated into 19xx and everything between 00 and 49 is considered 20xx.

  Note that .NET does not have this problem, correctly making the difference between a 2 digit and 4 digit year.

  Hope it helps people.

Intro

  Discord is something I have only vaguely heard about and when a friend told me he used it for chat with friends, I installed it, too. I was pleasantly surprised to see it is a very usable and free chat application, which combines feature of IRC, other messenger applications and a bit of Slack. You can create servers and add channels to them, for example, where you can determine the rights of people and so on. What sets Discord apart from anything, perhaps other than Slack, is the level of "integration", the ability to programatically interact with it. So I create a "bot", a program which stays active and responds to user chat messages and can take action. This post is about how to do that.

  Before you implement a bot you obviously need:

  All of this has been done to death and you can follow the links above to learn how to do it. Before we continue, a little something that might not be obvious: you can edit a Discord chat invite so that it never expires, as it is the one on this blog now.

Writing code

One can write a bot in a multitude of programming languages, but I am a .NET dev, so Discord.NET it is. Note that this is an "unofficial" library, so it may not (and it is not) completely in sync with all the options that the Discord API provides. One such feature, for example, is multiple attachments to a message. But I digress.

Since my blog is also written in ASP.NET Core, it made sense to add the bot code to that. Also, in order to make it all clean code, I will use dependency injection as much as possible and use the built-in system for commands, even if it is quite rudimentary.

Step 1 - making dependencies available

We are going to need these dependencies:

  • DiscordSocketClient - the client to connect to Discord
  • CommandService - the service managing commands
  • BotSettings - a class used to hold settings and configuration
  • BotService - the bot itself, which we are going to make implement IHostedService so we can add it as a hosted service in ASP.Net

In order to keep things separated, I will not add all of this in Startup, instead encapsulating them into a Bootstrap class:

public static class Bootstrap
{
    public static IWebHostBuilder UseDiscordBot(this IWebHostBuilder builder)
    {
        return builder.ConfigureServices(services =>
        {
            services
                .AddSingleton<BotSettings>()
                .AddSingleton<DiscordSocketClient>()
                .AddSingleton<CommandService>()
                .AddHostedService<BotService>();
        });
    }
}

This allows me to add the bot simply in CreateWebHostBuilder as: 

WebHost.CreateDefaultBuilder(args)
   .UseStartup<Startup>()
   .UseKestrel(a => a.AddServerHeader = false)
   .UseDiscordBot();

Step 2 - the settings

The BotSettings class will be used not only to hold information, but also communicate it between classes. Each Discord chat bot needs an access token to connect and we can add that as a configuration value in appsettings.config:

{
  ...
  "DiscordBot": {
	"Token":"<the token value>"
  },
  ...
}
public class BotSettings
{
    public BotSettings(IConfiguration config, IHostingEnvironment hostingEnvironment)
    {
        Token = config.GetValue<string>("DiscordBot:Token");
        RootPath = hostingEnvironment.WebRootPath;
        BotEnabled = true;
    }

    public string Token { get; }
    public string RootPath { get; }
    public bool BotEnabled { get; set; }
}

As you can see, no fancy class for getting the config, nor do we use IOptions or anything like that. We only need to get the token value once, let's keep it simple. I've added the RootPath because you might want to use it to access files on the local file system. The other property is a setting for enabling or disabling the functionality of the bot.

Step 3 - the bot skeleton

Here is the skeleton for a bot. It doesn't change much outside the MessageReceived and CommandReceived code.

public class BotService : IHostedService, IDisposable
{
    private readonly DiscordSocketClient _client;
    private readonly CommandService _commandService;
    private readonly IServiceProvider _services;
    private readonly BotSettings _settings;

    public BotService(DiscordSocketClient client,
        CommandService commandService,
        IServiceProvider services,
        BotSettings settings)
    {
        _client = client;
        _commandService = commandService;
        _services = services;
        _settings = settings;
    }

    // The hosted service has started
    public async Task StartAsync(CancellationToken cancellationToken)
    {
        _client.Ready += Ready;
        _client.MessageReceived += MessageReceived;
        _commandService.CommandExecuted += CommandExecuted;
        _client.Log += Log;
        _commandService.Log += Log;
        // look for classes implementing ModuleBase to load commands from
        await _commandService.AddModulesAsync(Assembly.GetEntryAssembly(), _services);
        // log in to Discord, using the provided token
        await _client.LoginAsync(TokenType.Bot, _settings.Token);
        // start bot
        await _client.StartAsync();
    }

    // logging
    private async Task Log(LogMessage arg)
    {
        // do some logging
    }

    // bot has connected and it's ready to work
    private async Task Ready()
    {
        // some random stuff you can do once the bot is online: 

        // set status to online
        await _client.SetStatusAsync(UserStatus.Online);
        // Discord started as a game chat service, so it has the option to show what games you are playing
        // Here the bot will display "Playing dead" while listening
        await _client.SetGameAsync("dead", "https://siderite.dev", ActivityType.Listening);
    }
    private async Task MessageReceived(SocketMessage msg)
    {
        // message retrieved
    }
    private async Task CommandExecuted(Optional<CommandInfo> command, ICommandContext context, IResult result)
    {
        // a command execution was attempted
    }

    // the hosted service is stopping
    public async Task StopAsync(CancellationToken cancellationToken)
    {
        await _client.SetGameAsync(null);
        await _client.SetStatusAsync(UserStatus.Offline);
        await _client.StopAsync();
        _client.Log -= Log;
        _client.Ready -= Ready;
        _client.MessageReceived -= MessageReceived;
        _commandService.Log -= Log;
        _commandService.CommandExecuted -= CommandExecuted;
    }


    public void Dispose()
    {
        _client?.Dispose();
    }
}

Step 4 - adding commands

In order to add commands to the bot, you must do the following:

  • create a class to inherit from ModuleBase
  • add public methods that are decorated with the CommandAttribute
  • don't forget to call commandService.AddModuleAsync like above

Here is an example of an enable/disable command class:

public class BotCommands:ModuleBase
{
    private readonly BotSettings _settings;

    public BotCommands(BotSettings settings)
    {
        _settings = settings;
    }

    [Command("bot")]
    public async Task Bot([Remainder]string rest)
    {
        if (string.Equals(rest, "enable",StringComparison.OrdinalIgnoreCase))
        {
            _settings.BotEnabled = true;
        }
        if (string.Equals(rest, "disable", StringComparison.OrdinalIgnoreCase))
        {
            _settings.BotEnabled = false;
        }
        await this.Context.Channel.SendMessageAsync("Bot is "
            + (_settings.BotEnabled ? "enabled" : "disabled"));
    }
}

When the bot command will be issued, then the state of the bot will be sent as a message to the chat. If the parameter of the command is enable or disable, the state will also be changed accordingly.

Yet, in order for this command to work, we need to add code to the bot MessageReceived method: 

private async Task MessageReceived(SocketMessage msg)
{
    // do not process bot messages or system messages
    if (msg.Author.IsBot || msg.Source != MessageSource.User) return;
    // only process this type of message
    var message = msg as SocketUserMessage;
    if (message == null) return;
    // match the message if it starts with R2D2
    var match = Regex.Match(message.Content, @"^\s*R2D2\s+", RegexOptions.IgnoreCase);
    int? pos = null;
    if (match.Success)
    {
        // this is an R2D2 command, everything after the match is the command text
        pos = match.Length;
    }
    else if (message.Channel is IPrivateChannel)
    {
        // this is a command sent directly to the private channel of the bot, 
        // don't expect to start with R2D2 at all, just execute it
        pos = 0;
    }
    if (pos.HasValue)
    {
        // this is a command, execute it
        var context = new SocketCommandContext(_client, message);
        await _commandService.ExecuteAsync(context, message.Content.Substring(pos.Value), _services);
    }
    else
    {
        // processing of messages that are not commands
        if (_settings.BotEnabled)
        {
            // if the bot is enabled and people are talking about it, show an image and say "beep beep"
            if (message.Content.Contains("R2D2",StringComparison.OrdinalIgnoreCase))
            {
                await message.Channel.SendFileAsync(_settings.RootPath + "/img/R2D2.gif", "Beep beep!", true);
            }
        }
    }
}

This code will forward commands to the command service if message starts with R2D2, else, if bot is enabled, will send replies with the R2D2 picture and saying beep beep to messages that contain R2D2.

Step 5 - handling command results

Command execution may end in one of three states:

  • command is not recognized
  • command has failed
  • command has succeeded

Here is a CommandExecuted event handler that takes these into account:

private async Task CommandExecuted(Optional<CommandInfo> command, ICommandContext context, IResult result)
{
    // if a command isn't found
    if (!command.IsSpecified)
    {
        await context.Message.AddReactionAsync(new Emoji("🤨")); // eyebrow raised emoji
        return;
    }

    // log failure to the console 
    if (!result.IsSuccess)
    {
        await Log(new LogMessage(LogSeverity.Error, nameof(CommandExecuted), $"Error: {result.ErrorReason}"));
        return;
    }
    // react to message
    await context.Message.AddReactionAsync(new Emoji("🤖")); // robot emoji
}

Note that the command info object does not expose a result value, other than success and failure.

Conclusion

This post has shown you how to create a Discord chat bot in .NET and add it to an ASP.Net Core web site as a hosted service. You may see the result by joining this blog's chat and giving commands to Tyr, the chat's bot:

  • play
  • fart
  • use metric or imperial units in messages
  • use Yahoo Messenger emoticons in messages
  • whatever else I will add in it when I get in the mood :)

  For a more in depth exploration of the concept, read Towards generic high performance sorting algorithms

Sorting

  Consider QuickSort, an algorithm that uses a divide and conquer strategy to sort efficiently and the favourite in computer implementations.

  It consists of three steps, applied recursively:

  1. find a pivot value
  2. reorder the input array so that all values smaller than the pivot are followed by values larger or equal to it (this is called Partitioning)
  3. apply the algorithm to each part of the array, before and after the pivot

  QuickSort is considered generic, meaning it can sort any type of item, assuming the user provides a comparison function between any two items. A comparison function has the same specific format: compare(item1,item2) returning -1, 0 or 1 depending on whether item1 is smaller, equal or larger than item2, respectively. This formalization of the function lends more credence to the idea that QuickSort is a generic sorting algorithm.

  Multiple optimizations have been proposed for this algorithm, including using insertion sort for small enough array segments, different ways of choosing the pivot, etc., yet the biggest problem was always the optimal way in which to partition the data. The original algorithm chose the pivot as the last value in the input array and the average complexity was O(n log n), but worse case scenario was O(n^2), when the array was already sorted and the pivot was the largest value. Without extra information you can never find the optimal partitioning schema (which would be to choose the median value of all items in the array segment you are sorting).

  But what if we turn QuickSort on its head? Instead of providing a formalized comparison function and fumbling to get the best partition, why not provide a partitioning function (from which a comparison function is trivial to obtain)? This would allow us to use the so called distribution based sorting algorithms (as opposed to comparison based ones) like Radix, BurstSort, etc, which have a complexity of O(n) in a generic way!

  My proposal for a formal signature of a partitioning function is partitionKey(item,level) returning a byte (0-255) and the sorting algorithm would receive this function and a maximum level value as parameters.

  Let's see a trivial example: an array of values [23,1,31,0,5,26,15] using a partition function that would return digits of the numbers. You would use it like sort(arr,partFunc,2) because the values are two digits numbers. Let's explore a naive Radix sorting:

  • assign 256 buckets for each possible value of the partition function result and start at the maximum (least significant) level
  • put each item in its bucket for the current level
  • concatenate the buckets
  • decrease the level and repeat the process

Concretely:

  • level 1: 23 -> bucket 3, 1 -> 1, 31 -> 1, 0 -> 0, 5 -> 5, 26 -> 6, 15 -> 5 results in [0,1,31,5,15,6]
  • level 0: 0 -> 0, 1 -> 0, 31 -> 3, 5 -> 0, 15 -> 1, 6 -> 0 results in [0,1,5,6,15,31]

Array sorted. Complexity is O(n * k) where k is 2 in this case and depends on the type of values we have, not on the number of items to be sorted!

  More complex distribution sorting algorithms, like BurstSort, optimize their function by using a normal QuickSort in small enough buckets. But QuickSort still requires an item comparison function. Well, it is easy to infer: if partFunc(item1,0) is smaller or larger than partFunc(item2,0) then item1 is smaller or larger than item2. If the partition function values are equal, then increase the level and compare partFunc(item1,1) to partFunc(item2,1).

  In short, any distribution sorting algorithm can be used in a generic way provided the user gives it a partitioning function with a formalized signature and a maximum level for its application.

  Let's see some example partitioning functions for various data types:

  • integers from 0 to N - maximum level is log256(N) and the partition function will return the bytes in the integer from the most significant to the least
    • ex: 65534 (0xFFFE) would return 255 (0xFF) for level 0 and 254 (0xFE) for level 1. 26 would return 0 and 26 for the same levels.
  • integers from -N to N - similarly, one could return 0 or 1 for level 0 if the number is negative or positive or return the bytes of the equivalent positive numbers from 0 to 2N 
  • strings that have a maximum length of N - maximum level would be N and the partition function would return the value of the character at the same position as the level
    • ex: 'ABC' would return 65, 66 and 67 for levels 0,1,2.
  • decimal or floating point or real values - more math intensive functions can be found, but a naive one would be to use a string partitioning function on the values turned to text with a fixed number of digits before and after the decimal separator.
  • date and time - easy to turn these into integers, but also one could just return year, month, day, hour, minute, second, etc based on the level
  • tuples of any of the types above - return the partition values for the first item, then the second and so on and add their maximum levels

  One does not have to invent these functions, they would be provided to the user based on standard types in code factories. Yet even these code factories will be able to encode more information about the data to be sorted than mere comparison functions. Stuff like the minimum and maximum value can be computed by going through all the values in the array to be sorted, but why do it if the user already has this information, for example.

  Assuming one cannot find a fixed length to the values to be sorted on, like real values or strings of any length, consider this type of sorting as a first step to order the array as much as possible, then using something like insertion or bubble sort on the result.

Finding a value or computing distinct values

  As an additional positive side effect, there are other processes on lists of items that are considered generic because they use a formalized form function as a parameter. Often found cases include finding the index of an item in a list equal to a given value (thus determining if the value exists in a list) and getting the distinct values from an array. They use an equality function as a parameter which is formalized as returning true or false. Of course, a comparison function could be used, depending on if its result is 0 or not, but a partitioning function can also be used to determine equality, if all of the bytes returned on all of the levels are equal.

  But there is more. The format of the partition function can be used to create a hash set of the values, thus reducing the complexity of the search for a value from O(n) to O(log n) and that of getting distinct values from O(n^2) to O(n log n)!

  In short, all operations on lists of items can be brought together and optimized by using the same format for the function that makes them "generic": that of a partitioning function.

Conclusion

  As you can see, I am rather proud of the concepts I've explained here. Preliminary tests in Javascript show a 20 fold improvement in performance for ten million items when using RadixSort over the default sort. I would really love feedback from someone who researches algorithms and can even test these assumptions under benchmark settings. Them being complex as they are, I will probably write multiple posts on the subject, trying to split it (partition it?) into easily digestible bits

 The concept of using a generic partitioning function format for operations on collections is a theoretical one at the moment. I would love to collaborate with people to get this to production level code, perhaps taking into account advanced concepts like minimizing cache misses and parallelism, not only the theoretical complexity.

 More info and details at Towards generic high performance sorting algorithms

Intro

  There is a saying that the novice will write code that works, without thinking of anything else, the expert will come and rewrite that code according to good practices and the master will rewrite it so that it works again, thinking of everything. It applies particularly well to SQL. Sometimes good and well tried best practices fail in specific cases and one must guide themselves either by precise measurements of by narrow rules that take decades to learn.

  If you ever wondered why some SQL queries are very slow or how to write complex SQL stored procedures without them reaching sentience and behaving unpredictably, this post might help. I am not a master myself, but I will share some quick and dirty ways of writing, then checking your SQL code.

Some master rules

  First of all, some debunking of best practices that make unreasonable assumptions at scale:

  1. If you have to extract data based on many parameters, then add them as WHERE or ON clauses and the SQL engine will know how to handle it.

    For small queries and for well designed databases, that is correct. The SQL server engine is attempting to create execution plans for these parameter combinations and reuse them in the future on other executions. However, when the number of parameters increases, the number of possible parameter combinations increases exponentially. The execution optimization should not take more than the execution itself, so the engine if just choosing one of the existing plans which appears more similar to the parameters given. Sometimes this results in an abysmal performance.

    There are two solutions:

    The quick and dirty one is to add OPTION (RECOMPILE) to the parameterized SELECT query. This will tell the engine to always ignore existing execution plans. With SQL 2016 there is a new feature called Query Store plus a graphical interface that explores execution plans, so one can choose which ones are good and which ones are bad. If you have the option, you might manually force an execution plan on specific queries, as well. But I don't recommend this because it is a brittle and nonintuitive solution. You need a DBA to make sure the associations are correct and maintained properly.

    The better one, to my own surprise, is to use dynamic SQL. In other words, if you have 20 parameters to your stored procedure, with only some getting used at any time (think an Advanced Search page), create an SQL string only with the parameters that are set, then execute it.

    My assumption has always been that the SQL engine will do this for me if I use queries like WHERE (@param IS NULL OR <some condition with @param>). I was disappointed to learn that it does not always do that. Be warned, though, that most of the time multiple query parameters are optimized by running several operations in parallel, which is best!

  2. If you query on a column or another column, an OR clause will be optimal. 

    Think of something like this: You have a table with two account columns AccId and AccId2. You want to query a lot on an account parameter @accountId and you have added an index on each column.

    At this time the more readable option, and for small queries readability is always preferable to performance improvement, is WHERE AccId=@accountId OR AccId2=@accountId. But how would the indexes be used here, in this OR clause? First the engine will have to find all entries with the correct AccId, then again find entries with the correct AccId2, but only the entries that have not been found in the first search.

    First of all, SQL will not do this very well when the WHERE clause is very complex. Second of all, even if it did it perfectly, if you know there is no overlap, or you don't care or you can use a DISTINCT further on to eliminate duplicates, then it is more effective to have two SELECT queries, one for AccId and the other for AccId2 that you UNION ALL afterwards.

    My assumption has always been that the SQL engine will do this automatically. I was quite astounded to hear it was not true. Also, I may be wrong, because different SQL engines and their multitude of versions, compounded with the vast array of configuration options for both engine and any database, behave quite differently. Remember the parallelism optimization, as well.

  3. Temporary tables as slow, use table variables instead.

    Now that is just simple logic, right? A temporary table uses disk while a table variable uses memory. The second has to be faster, right? In the vast majority of cases this will be true. It all depends (a verb used a lot in SQL circles) on what you do with it.

    Using a temporary table might first of all be optimized by the engine to not use the disk at all. Second, temporary tables have statistics, while table variables do not. If you want the SQL engine to do its magic without your input, you might just have to use a temporary table.

  4. A large query that does everything is better than small queries that I combine later on.

    This is a more common misconception than the others. The optimizations the SQL engine does work best on smaller queries, as I've already discussed above, so if a large query can be split into two simpler ones, the engine will be more likely able to find the best way of executing each. However, this only applies if the two queries are completely independent. If they are related, the engine might find the perfect way of getting the data in a query that combines them all.

    Again, it depends. One other scenario is when you try to DELETE or UPDATE a lot of rows. SQL is always "logging" the changes that it does on the off chance that the user cancels the query and whatever incomplete work has been done has to be undone. With large amounts of data, this results into large log files and slow performance. One common solution is to do it in batches, using UPDATE (TOP 10000) or something similar inside a WHILE loop. Note that while this solves the log performance issue, it adds a little bit of overhead for each executed UPDATE

  5. If I have an index on a DATETIME column and I want to check the records in a certain day, I can use CAST or CONVERT.

    That is just a bonus rule, but I've met the problem recently. The general rule is that you should never perform calculations on columns inside WHERE clauses. So instead of WHERE CAST(DateColumn as DATE)=@date use WHERE DateColumn>=@date AND DateColumn<DATEADD(DAY,1,@date). The calculation is done (once) on the parameters given to the query, not on every value of DateColumn. Also, indexes are now used.

Optimizing queries for dummies

So how does one determine if one of these rules apply to their case? "Complex query" might mean anything. Executing a query multiple times results in very different results based on how the engine is caching the data or computing execution plans.

A lot of what I am going to say can be performed using SQL commands, as well. Someone might want to use direct commands inside their own tool to monitor and manage performance of SQL queries. But what I am going to show you uses the SQL Management Studio and, better still, not that horrid Execution Plan chart that often crashes SSMS and it is hard to visualize for anything that the most simple queries. Downside? You will need SQL Management Studio 2014 or higher.

There are two buttons in the SSMS menu. One is "Include Actual Execution Plan" which generates an ugly and sometimes broken chart of the execution. The other one is "Include Live Query Statistics" which seems to be doing the same, only in real time. However, the magic happens when both are enabled. In the Results tab you will get not only the query results, but also tabular data about the execution performance. It is amazingly useful, as you get a table per each intermediary query, for example if you have a stored procedure that executes several queries in a row, you get a table for each.

Even more importantly, it seems that using these options will start the execution without any cached data or execution plans. Running it several times gives consistent execution times.

In the LiveQuery tables, the values we are interested about are, in order of importance, EstimateIO, EstimateCPU and Rows.

EstimateIO is telling us how much of the disk was used. The disk is the slowest part of a computer, especially when multiple processes are running queries at the same time. Your objective is to minimize that value. Luckily, on the same row, we get data about the substatement that generated that row, which parameters were used, which index was used etc. This blog is not about how to fix every single scenario, but only on how to determine where the biggest problems lie.

EstimateCPU is saying how much processing power was used. Most of the time this is very small, as complex calculations should not be performed in queries anyway, but sometimes a large value here shows a fault in the design of the query.

Finally, Rows. It is best to minimize the value here, too, but it is not always possible. For example a COUNT(*) will show a Clustered Index Scan with Rows equal to the row count in the table. That doesn't cause any performance problems. However, if your query is supposed to get 100 rows and somewhere in the Live Query table there is a value of several millions, you might have used a join without the correct ON clause parameters or something like that.

Demo

Let's see some examples of this. I have a Main table, with columns ID BIGINT, Random1 INT, Random2 NVARCHAR(100) and Random3 CHAR(10) with one million rows. Then an Ind table, with columns ID BIGINT, Qfr CHAR(4) and ValInd BIGINT with 10000 rows. The ID table is common with the Main table ID column and the Qfr column has only three possible values: AMT, QTY, Sum.

Here is a demo on how this would work:

DECLARE @r1 INT = 1300000
DECLARE @r2 NVARCHAR(100) = 'a'
DECLARE @r3 CHAR(10) = 'A'
DECLARE @qfr CHAR(4) = 'AMT'
DECLARE @val BIGINT = 500000

DECLARE @r1e INT = 1600000
DECLARE @r2e NVARCHAR(100) = 'z'
DECLARE @r3e CHAR(10)='Z'
DECLARE @vale BIGINT = 600000

SELECT *
FROM Main m
INNER JOIN Ind i
ON m.ID=i.ID
WHERE (@r1 IS NULL OR m.Random1>=@r1)
  AND (@r2 IS NULL OR m.Random2>=@r2)
  AND (@r3 IS NULL OR m.Random3>=@r3)
  AND (@val IS NULL OR i.ValInd>=@val)
  AND (@r1e IS NULL OR m.Random1<=@r1e)
  AND (@r2e IS NULL OR m.Random2<=@r2e)
  AND (@r3e IS NULL OR m.Random3<=@r3e)
  AND (@vale IS NULL OR i.ValInd<=@vale)
  AND (@qfr IS NULL OR i.Qfr=@qfr)

I have used 9 parameters, each with their own values, to limit the number of rows I get. The Live Query result is:

You can see that the EstimateIO values are non-zero only on the Clustered Index Scans, one for each table. Where is how the StmtText looks like: "|--Clustered Index Scan(OBJECT:([Test].[dbo].[Ind].[PK__Ind__DEBF89006F996CA8] AS [i]),  WHERE:(([@val] IS NULL OR [Test].[dbo].[Ind].[ValInd] as [i].[ValInd]>=[@val]) AND ([@vale] IS NULL OR [Test].[dbo].[Ind].[ValInd] as [i].[ValInd]<=[@vale]) AND ([@qfr] IS NULL OR [Test].[dbo].[Ind].[Qfr] as [i].[Qfr]=[@qfr])) ORDERED FORWARD)".

This is a silly case, but you can see that the @parameter IS NULL type of query condition has not been removed, even when parameter is clearly not null.

Let's change the values of the parameters:

DECLARE @r1 INT = 300000
DECLARE @r2 NVARCHAR(100) = NULL
DECLARE @r3 CHAR(10) = NULL
DECLARE @qfr CHAR(4) = NULL
DECLARE @val BIGINT = NULL

DECLARE @r1e INT = 600000
DECLARE @r2e NVARCHAR(100) = NULL
DECLARE @r3e CHAR(10)=NULL
DECLARE @vale BIGINT = NULL

Now the Live Query result is:

Same thing! 5.0 and 7.2

Now, let's do the same thing with dynamic SQL. It's a little more annoying, mostly because of the parameter syntax, but check it out:

DECLARE @sql NVARCHAR(Max)

DECLARE @r1 INT = 300000
DECLARE @r2 NVARCHAR(100) = NULL
DECLARE @r3 CHAR(10) = NULL
DECLARE @qfr CHAR(4) = NULL
DECLARE @val BIGINT = NULL

DECLARE @r1e INT = 600000
DECLARE @r2e NVARCHAR(100) = NULL
DECLARE @r3e CHAR(10)=NULL
DECLARE @vale BIGINT = NULL


SET @sql=N'
SELECT *
FROM Main m
INNER JOIN Ind i
ON m.ID=i.ID
WHERE 1=1 '
IF @r1 IS NOT NULL SET @sql+=' AND m.Random1>=@r1'
IF @r2 IS NOT NULL SET @sql+=' AND m.Random2>=@r2'
IF @r3 IS NOT NULL SET @sql+=' AND m.Random3>=@r3'
IF @val IS NOT NULL SET @sql+=' AND i.ValInd>=@val'
IF @r1e IS NOT NULL SET @sql+=' AND m.Random1<=@r1e'
IF @r2e IS NOT NULL SET @sql+=' AND m.Random2<=@r2e'
IF @r3e IS NOT NULL SET @sql+=' AND m.Random3<=@r3e'
IF @qfr IS NOT NULL SET @sql+=' AND i.Qfr=@qfr'
IF @vale IS NOT NULL SET @sql+=' AND i.ValInd<=@vale'

PRINT @sql

EXEC sp_executesql @sql,
  N'@r1 INT, @r2 NVARCHAR(100), @r3 CHAR(10), @qfr CHAR(4),@val BIGINT,@r1e INT, @r2e NVARCHAR(100), @r3e CHAR(10),@vale BIGINT',
  @r1,@r2,@r3,@qfr,@val,@r1e,@r2e,@r3e,@vale

Now the Live Query results are:

At first glance we have not changed much. IO is still 5.0 and 7.2. Yet there are 3 less execution steps. There is no parallelism and the query has been executed in 5 seconds, not 6. The StmtText for the same thing is now: "|--Clustered Index Scan(OBJECT:([Test].[dbo].[Ind].[PK__Ind__DEBF89006F996CA8] AS [i]), ORDERED FORWARD)". The printed SQL command is:

SELECT *
FROM Main m
INNER JOIN Ind i
ON m.ID=i.ID
WHERE 1=1  AND m.Random1>=@r1 AND m.Random1<=@r1e

Conclusion

Again, this is a silly example. But with some results anyway! In my work I have used this to get a stored procedure to work three to four times faster!

One can optimize usage of IO, CPU and Rows by adding indexes, by narrowing join conditions, by reducing the complexity of executed queries, eliminating temporary tables, partitioning existing tables, adding or removing hints, removing computation from queried columns and so many other possible methods, but they amount to nothing if you cannot measure the results of your changes.

By using Actual Execution Plan together with Live Query Statistics you get:

  • consistent execution times and disk usage
  • a clear measure of what went on with each subquery

BTW, you get the same effect if you use SET STATISTICS PROFILE ON before the query. Yet, I wrote this post with someone that doesn't want to go into extra SQL code in mind.

I wish I had some more interesting examples for you, guys, but screenshots from the workplace are not something I want to do and I don't do any complex SQL work at home. I hope this helps. 

  When I was looking at Javascript frameworks like Angular and ReactJS I kept running into these weird reducers that were used in state management mostly. It all felt so unnecessarily complicated, so I didn't look too closely into it. Today, reading some random post on dev.to, I found this simple and concise piece of code that explains it:

// simple to unit test this reducer
function maximum(max, num) { return Math.max(max, num); }

// read as: 'reduce to a maximum' 
let numbers = [5, 10, 7, -1, 2, -8, -12];
let max = numbers.reduce(maximum);

Kudos to David for the code sample.

The reducer, in this case, is a function that can be fed to the reduce function, which is known to developers in Javascript and a few other languages, but which for .NET developers it's foreign. In LINQ, we have Aggregate!

// simple to unit test this Aggregator ( :) )
Func<int, int, int> maximum = (max, num) => Math.Max(max, num);

// read as: 'reduce to a maximum' 
var numbers = new[] { 5, 10, 7, -1, 2, -8, -12 };
var max = numbers.Aggregate(maximum);

Of course, in C# Math.Max is already a reducer/Aggregator and can be used directly as a parameter to Aggregate.

I found a lot of situations where people used .reduce instead of a normal loop, which is why I almost never use Aggregate, but there are situations where this kind of syntax is very useful. One would be in functional programming or LINQ expressions that then get translated or optimized to something else before execution, like SQL code. (I don't know if Entity Framework translates Aggregate, though). Another would be where you have a bunch of reducers that can be used interchangeably.

Intro

  I want to examine together with you various types of sort algorithms and the tricks they use to lower the magic O number. I reach the conclusion that high performance algorithms that are labeled as specific to a certain type of data can be made generic or that the generic algorithms aren't really that generic either. I end up proposing a new form of function that can be fed to a sorting function in order to reach better performance than the classic O(n*log(n)). Extra bonus: finding distinct values in a list.

Sorting

  But first, what is sorting? Given a list of items that can be compared to one another as lower or higher, return the list in the order from lowest to highest. Since an item can be any type of data record, to define a generic sorting algorithm we need to feed it the rules that make an item lower than another and that is called the comparison function. Let's try an example in Javascript:

  // random function from start to end inclusive
  function rand(start, end) {
    return parseInt(start + Math.random() * (end - start + 1));
  }
  
  // measure time taken by an action and output it in console
  let perfKey = 0;
  function calcPerf(action) {
    const key = perfKey++;
    performance.mark('start_' + key);
    action();
    performance.mark('end_' + key);
    const measure = performance.measure('measure_' + key, 'start_' + key, 'end_' + key);
    console.log('Action took ' + measure.duration);
  }
  
  // change this based on how powerful the computer is
  const size = 10000000;
  // the input is a list of size 'size' containing random values from 1 to 50000
  const input = [];
  for (let i = 0; i < size; i++)
    input.push(rand(1, 50000));
  
  // a comparison function between two items a and b
  function comparisonFunction(a, b) {
    if (a > b)
      return 1;
    if (a < b)
      return -1;
    return 0;
  }
  
  const output = [];
  // copy input into output, then sort it using the comparison function
  // same copying method will be used for future code
  calcPerf(() => {
    for (let i = 0; i < size; i++)
      output.push(input[i]);
    output.sort(comparisonFunction);
  });

  It's not the crispest code in the world, but it's simple to understand:

  • calcPerf is computing the time it takes for an action to take and logs it to the console
  • start by creating a big array of random numbers as input
  • the array in a result array and sorting it with the default sort function, to which we give the comparison function
  • display the time it took for the operation.

  This takes about 4500 milliseconds on my computer.

  Focus on the comparison function. It takes two items and returns a number that is -1, 0 or 1 depending on whether the first item is smaller, equal or larger than the second. Now let's consider the sorting algorithm itself. How does it work?

  A naive way to do it would be to find the smallest item in the list, move it to the first position in the array, then continue the process with the rest of the array. This would have a complexity of O(n2). If you don't know what the O complexity is, don't worry, it just provides an easy to spell approximation of how the amount of work would increase with the number of items in the input. In this case, 10 million records, squared, would lead to 100 trillion operations! That's not good.

  Other algorithms are much better, bringing the complexity to O(n*log(n)), so assuming base 10, around 70 million operations. But how do they improve on this? Surely in order to sort all items you must compare them to each other. The explanation is that if a<b and b<c you do not need to compare a to c. And each algorithm tries to get to this in a different way.

  However, the basic logic of sorting remains the same: compare all items with a subset of the other items.

Partitioning

  A very common and recommended sorting algorithm is QuickSort. I am not going to go through the entire history of sorting algorithms and what they do, you can check that out yourself, but I can focus on the important innovation that QuickSort added: partitioning. The first step in the algorithm is to choose a value out of the list of items, which the algorithm hopes it's as close as possible to the median value and is called a pivot, then arrange the items in two partitions: the ones smaller than the pivot and the ones larger than the pivot. Then it proceeds on doing the same to each partition until the partitions are small enough to be sorted by some other sort algorithm, like insertion sort (used by Chrome by default).

  Let's try to do this manually in our code, just the very first run of the step, to see if it improves the execution time. Lucky for us, we know that the median is around 25000, as the input we generated contains random numbers from 1 to 50000. So let's copy the values from input into two output arrays, then sort each of them. The sorted result would be reading from the first array, then from the second!

  // two output arrays, one for numbers below 25000, the other for the rest
  const output1 = [];
  const output2 = [];
  const pivot = 25000;
  
  calcPerf(() => {
    for (let i = 0; i < size; i++) {
      const val = input[i];
      if (comparisonFunction(val, pivot) < 0)
        output1.push(val);
      else
        output2.push(val);
    }
    // sorting smaller arrays is cheaper
    output1.sort(comparisonFunction);
    output2.sort(comparisonFunction);
  });

  Now, the performance is slightly better. If we do this several times, the time taken would get even lower. The partitioning of the array by an operation that is essentially O(n) (we just go once through the entire input array) reduces the comparisons that will be made in each partition. If we would use the naive sorting, partitioning would reduce nto n+(n/2)2+(n/2)2 (once for each partitioned half), thus n+n2/2. Each partitioning almost halves the number of operations!

  So, how many times can we half the number of operations for? Imagine that we do this with an array of distinct values, from 1 to 10 million. In the end, we would get to partitions of just one element and that means we did a log2(n) number of operations and for each we added one n (the partitioning operation). That means that the total number of operations is... n*log(n). Each algorithm gets to this in a different way, but at the core of it there is some sort of partitioning, that b value that makes comparing a and c unnecessary.

  Note that we treated the sort algorithm as "generic", meaning we fed it a comparison function between any two items, as if we didn't know how to compare numbers. That means we could have used any type of data as long as we knew the rule for comparison between items.

  There are other types of sorting algorithms that only work on specific types of data, though. Some of them claim a complexity of O(n)! But before we get to them, let's make a short detour.

Distinct values

  Another useful operation with lists of items is finding the list of distinct items. From [1,2,2,3] we want to get [1,2,3]. To do this, we often use something called a trie, a tree-like data structure that is used for quickly finding if a value exists or not in a list. It's the thing used for autocorrect or finding a word in a dictionary. It has an O(log n) complexity in checking if an item exists. So in a list of 10 million items, it would take maybe 20 operations to find the item exists or not. That's amazing! You can see that what it does is partition the list down to the item level.

  Unfortunately, this only works for numbers and strings and such primitive values. If we want to make it generic, we need to use a function that determines when two items are equal and then we use it to compare to all the other items we found as distinct so far. That makes using a trie impossible.

  Let me give you an example: we take [1,1,2,3,3,4,5] and we use an externally provided equality function:

  • create an empty output of distinct items
  • take first item (1) and compare with existing distinct items (none)
  • item is not found, so we add it to output
  • take next item (1) and compare with existing distinct items (1)
  • item is found, so we do nothing
  • ...
  • we take the last item (5) and compare with existing items (1,2,3,4)
  • item is not found, so we add it to the output

  The number of operations that must be taken is the number of total items multiplied by the average number of distinct items. That means that for a list of already distinct values, the complexity if O(n2). Not good! It increases exponentially with the number of items. And we cannot use a trie unless we have some function that would provide us with a distinctive primitive value for an item. So instead of an equality function, a hashing function that would return a number or maybe a string.

  However, given the knowledge we have so far, we can reduce the complexity of finding distinct items to O(n*log(n))! It's as simple as sorting the items, then going through the list and sending to output an item when different from the one before. One little problem here: we need a comparison function for sorting, not an equality one.

So far

  We looked into the basic operations of sorting and finding distinct values. To be generic, one has to be provided with a comparison function, the other with an equality function. However, if we would have a comparison function available, finding distinct generic items would become significantly less complex by using sorting. Sorting is better than exponential comparison because it uses partitioning as an optimization trick.

Breaking the n*log(n) barrier

  As I said above, there are algorithms that claim a much better performance than n*log(n). One of them is called RadixSort. BurstSort is a version of it, optimized for strings. CountSort is a similar algorithm, as well. The only problem with Radix type algorithms is that they only work on numbers or recursively on series of numbers. How do they do that? Well, since we know we have numbers to sort, we can use math to partition the lot of them, thus reducing the cost of the partitioning phase.

  Let's look at our starting code. We know that we have numbers from 1 to 50000. We can find that out easily by going once through all of them and computing the minimum and maximum value. O(n). We can then partition the numbers by their value. BurstSort starts with a number of "buckets" or lists, then assigns numbers to the buckets based on their value (dividing the value to the number of buckets). If a bucket becomes too large, it is "burst" into another number of smaller buckets. In our case, we can use CountSort, which simply counts each occurrence of a value in an ordered array. Let's see some code:

  const output = [];
  const buckets = [];
  calcPerf(() => {
    // for each possible value add a counter
    for (let i = 1; i <= 50000; i++)
      buckets.push(0);
    // count all values
    for (let i = 1; i <= size; i++) {
      const val = input[i];
      buckets[val - 1]++;
    }
    // create the output array of sorted values
    for (let i = 1; i <= 50000; i++) {
      const counter = buckets[i - 1];
      for (let j = 0; j < counter; j++)
        output.push(i);
    }
  });

  This does the following:

  • create an array from 1 to 50000 containing zeros
  • for each value in the input, increment the bucket for that value
  • at the end just go through all of the buckets and output the value as many times as the value in the bucket shows

  This algorithm generated a sorted output array in 160 milliseconds!

  And of course, it is too good to be true. We used a lot of a priori knowledge:

  • min/max values were already known
  • the values were conveniently close together integers so we can use them as array indexes

  I can already hear you sigh "Awwh, so I can't use it!". Do not despair yet!

  The Radix algorithm, that is used only for numbers, is also used on strings. How? Well, a string is reducible to a list of numbers (characters) so one can recursively assign each string into a bucket based on the character value at a certain index. Note that we don't have to go through the entire string, the first few letters are enough to partition the list in small enough lists that can be cheaply sorted.

  Do you see it yet?

A generic partition function

  What if we would not use an equality function or a comparison function or a hashing function as a parameter for our generic sort/distinct algorithm? What if we would use a partition function? This partition function would act like a multilevel hashing function returning values that can also be compared to each other. In other words, the generic partition function could look like this:

  function partitionFunction(item, level) returning a byte

  For strings it returns the numeric value of the character at position level or 0. For numbers it returns the high to low byte in the number. For object instances with multiple properties, it would return a byte for each level in each of the properties that we want to order by. Radix style buckets would use the known values from 0 to 255. The fact that the multilevel partitioning function is provided by the user means we can pack in it all the a priori knowledge we have, while keeping the sorting/distinct algorithm unchanged and thus, generic! The sorting will be called by providing two parameters: the partitioning function and the maximum level to which it should be called:

  sort(input, partitioningFunction, maxLevel)

A final example

  Here is an implementation of a radix sorting algorithm that receives a multilevel partitioning function using our original input. Note that it is written so that it is easily read and not for performance:

  // will return a sorted array from the input array
  // using the partitioning function up to maxLevel
  function radixSort(input, partitioningFunction, maxLevel) {
    let buckets = Array.from({length: 256}, () => []);
    buckets[0] = input;
    // reverse order, because level 0 should be the most significant
    for (let level = maxLevel-1; level >=0; level--) {
      let tempBuckets = Array.from({length: 256}, () => []);
      for (let bucketIndex = 0; bucketIndex < buckets.length; bucketIndex++) {
        const bucket = buckets[bucketIndex];
        const bucketLength = bucket.length;
        for (let bucketOffset = 0; bucketOffset < bucketLength; bucketOffset++) {
          const val = bucket[bucketOffset];
          const partByte = partitioningFunction(val, level);
          tempBuckets[partByte].push(val);
        }
      }
      buckets = tempBuckets;
    }
    const output = [].concat(...buckets);
    return output;
  }

  // return value bytes, from the most significant to the least
  // being <50000 the values are always 2 bytes  
  function partitioningFunction(item, level) {
    if (level === 0) return item >> 8;
    if (level === 1) return item & 255;
    return 0;
  }
  
  let output3 = [];
  calcPerf(() => {
    output3 = radixSort(input, partitioningFunction, 2);
  });

Want to know how long it took? 1300 milliseconds!

You can see how the same kind of logic can be used to find distinct values, without actually sorting, just by going through each byte from the partitioning function and using them as values in a trie, right?

Conclusion

Here is how a generic multilevel partitioning function replaces comparison, equality and hashing functions with a single concept that is then used to get high performance from common data operations such as sorting and finding distinct values.

I will want to work on formalizing this and publishing it as a library or something like that, but until then, what do you think?

Wait, there is more

There is a framework in which something similar is being used: SQL. It's the most common place where ORDER BY and DISTINCT are used. In SQL's case, we use an optimization method that uses indexes, which are also trie data structures storing the keys that we want to order or filter by. Gathering the data to fill a database index also has its complexity. In this case, we pre-partition once and we sort many. It's another way of reducing the cost of the partitioning

However, this is just a sub-type of the partition function that I am talking about, one that uses a precomputed data structure to reach its goal. The multilevel partition function concept I am describing here may be pure code or some other encoding of information we know out of hand before doing the operation.

Finally, the complexity. What is it? Well instead of O(n*log(n)) we get O(n*k), where k is the maximum level used in the partition function. This depends on the data, so it's not a constant, but it's the closest theoretical limit for sorting, closer to O(n) than the classic log version. I am not the best algorithm and data structure person, so if you have ideas about it and want to help me out, I would be grateful.  

  Four years ago this day I was blogging about a list of technologies that I wanted to learn or at least explore. It was a very ambitious list, with the idea that I might shame myself into studying at least part of it. Apparently, I am shameless. Anyway, this is a follow-up post on how it all went down. 

.NET Core: MVC, Entity Framework, etc.

  I actually had the opportunity to use ASP.Net Core at my job. I didn't have a lot of it, though, so I just did the usual: start working on something, google the hell out of everything, make it work. I used .NET Core 1 and 2, moved to 3 and now they are just getting out 5. Was it useful? Well, actually no.

  The state of the software industry is in constant flux and while new technologies pop up all the time, there is also a strong force moving the other way, mainly coming from the people that pay for software. Why invest in a new technology when the old one has been working fine for years? Yeah, don't answer that. My point is that while staying current with .NET Core was a boon, I didn't really see a lot of paying customers and employers begging me to do anything with it. In fact, I left my job working on a .NET Framework 4.7 automation project to work on an ASP.Net MVC 4 application written in Visual Basic. So I am learning things that are new to me, but really are very old.

  All I am saying is that there is a paradox of job opportunities for technologies that you are supposed to know and be an expert in, but for which few have had the courage to actually pay before.

  Bottom line: having been familiar with older technology that made .Net Core exist, it was kind of simple for me to get into it, but I am not an expert in it. The fact that software generally moved to the web and that server code is now as slim as it can be also undermines the value of being proficient in it. That is, until you really need to be.

OData and OAuth

  Simply said, nothing on this front. These are mostly related to frontend stuff, to new web facing applications that can be found at a public URL. I have not worked in this field basically since forever.

Typescript, Angular

  Oh, the promise of Typescript: a strongly typed language than compiles to Javascript and can be used to code in using static analysis tools, structured code, clear project boundaries! I had the opportunity to work with it. I did that in a much lesser degree than I should have. Yet the tiny contact with the new language left me disappointed. Because it is a superset of Javascript, it inherits most of its problems, can be easily hacked and the tooling chain is basically the Javascript one.

  I commend Microsoft for even attempting to develop this language and I see that it has become popular indeed. I like C#. I like the vanilla Javascript of old. I dislike the weak hybrid that Typescript seems to be. I may be wrong. Things might evolve in very interesting directions with Typescript. I will wait until then.

  And Angular has also grown in a completely different beast. I thought Angular 1 was great, then they came in with version 2 which was completely different. Now it's 9 or 10 or whatever. I liked the structured projects and how easily data change could be controlled from non-UI code. Yet did it all have to be this complex?

Node.JS, npm, Javascript ES6+, Bower, etc.

  I thought that since I like Javascript, I might enjoy Node.JS. Working with Typescript I had to have contact with npm and at least install Node on the computer. Yet I got more and more disillusioned. A complicated chain of tools that seem to be targeted to a concept of "app" rather than a website, for which you need a completely different set of tools, the numerous changes in the paradigm of Javascript and the systems using it and, frankly, a lack of a real reason for using Javascript when I can use C# made me stop trying to get in on the action on this front. And before I started to understand the Node ecosystem, Deno appeared.

  Personally it feels that Javascript is growing too fast and uncontrolled, like a cancer. The language and tooling need to stabilize and by that I mean cut some of the fat away, not add new things. I loved some of the additions to the standard, like arrow functions, let/const and block scope, nullable and spread operators, iterators and generator functions, but then it just started to get more and more bloated, like trying to absorb every new concept in all of the other languages.

  Bottom line: it is ironic that I am now working in an app that must work on Internet Explorer 11, so I cannot even use the new features of the Javascript standard. And yes, I hate that, but at the same time I can do the job just fine. Food for thought.

Docker

  What a great idea this Docker: just tell the system what kind of setup you need and it creates it for you. It can even create it for you from scratch every single time, in isolation, so you can run your software without having to install all that heavy crap I was just telling you about above. And yet I hated it with a vengeance from the moment I tried to use it. With a clear Linux vibe, it wouldn't even work right on Windows. The images for these pseudo-virtual machines were huge. The use was cumbersome. The tooling reminded me of the good old days when I installed a Slackware distro on any old computer, without X system, of course, and then compiled beta versions of every single software I wanted to use and repeated the process whenever I had an issue.

  For a tool that is supposed to bring consistency and ease of use, it worked really inconsistently and was hard to manage. I understand that maybe all that would have gone away if I invested the effort of understanding the whole concept, but it really felt like going backward instead of forward. One day, perhaps, I will change my mind. As of now, I am just waiting for the person who reinvents Docker and makes it do what it has promised to do: ease my life.

Parallel programming

  Hey, this is actually a plus! I had a lot of opportunities to work with parallel programming, understanding the various pitfalls and going deep in the several types of parallel programming concepts in .NET. Am I a parallel programming guru now? No. But I went through the await/async phase and solved problems at all levels of parallelism. I don't like async/await. It is a good idea, it is great when you start a new project, but to add async/await in existing legacy code is a pain the butt. Inevitably you will want to execute an async method from a sync context, control the order or amount of parallel processing, forget to call an async method with await or trying to run one parallely by not calling it with await on purpose and finding that your scope has been disposed.

  Bottom line: I don't have to like it to use async/await, but I feel (without actually knowing a perfect solution) the design of the feature could have been better. And now they added it to Javascript, too! Anyway, I am comfortable in the area now.

HTML5, Responsive Design, LESS/SASS, ReactJS, new CSS, frontend

  A lot of new stuff to learn for a frontend developer. And how glad I am that I am not it! Again I hit the reality of business requirements vs technology requirements. To learn front end development right now is a full job that you have to do alone, for free and in your spare time. Meanwhile, the projects you are working on are not on the cutting edge, no one wants to change them just to help you learn something and, again, these concepts are mostly for public facing web applications. Maybe even mobile development.

  It does help to know all of this, but it is not a job that I want to do. Frontend is for the young. Can I say that? ReactJS, the few times I looked at it, appeared to me to be a sort of ASP for the browser, with that ugly JSX format that combines code and markup and that complicated state flow mechanism. Angular felt too stuffy and so on. Every time I look into frontend stuff it seems like software for the server side from twenty years before.

  Bottom line: If any web framework appealed to me, that would be VueJS! And no one used it at any of my places of work. It seems a framework dedicated to staying simple and efficient. And while I want to know the concepts in all of this stuff, why would I go deep into this when I need a UI designer for anything with a shape? I will be waiting for the resurgence of simple frameworks using the new features of HTML99 and that do not require to learn an entire new ecosystem to make anything work.

WPF, UWP

  I remember I absolutely loved developing in WPF. The MVVM concept of separating the UI completely from the code controlling it, which is then completely separated from the data models used, was very nice. The complete customizability of the UI without changing the code at all was amazing. The performance, not so much. But did it matter? I thought not. Yet the web obliterated the tech from the development world. I believe it deserves a rebirth.

  That being said, I have done nothing in this direction either. Shame on me.

Deep learning, AI, machine learning

  I really wanted to understand and use artificial intelligence, especially since I did my licence paper on neural networks. I love the new concepts, I've read some of the stuff they came up with and I've even been to some conferences on machine learning. Yet I have not yet found the project that could make it worthwhile. Machine learning and especially deep learning projects require huge amounts of data, while working on the underlying technology requires advanced knowledge of statistics and math, which I do not have.

  It seems to me that I will always be the end user of AI rather than working on building it.

Conclusion

  So there you have it: I worked on .NET Core and parallelism and even microservices, dabbled in Typescript and Angular, created a full on LINQ implementation on Javascript and later Typescript. Meanwhile the world went more and more towards frontend and mobile development and Node and stuff like machine learning. And presently I am working for a bank, coding in VB and using jQuery.

  Strangely, I don't really feel guilty about it. Instead I feel like I am still going on the path of discovery, only more of myself rather than of a particular technology. The more I wait, the more new stuff appears and makes whatever I was dreaming of learning obsolete. I even considered not being a developer anymore, maybe becoming an architect or application designer. When something new appears, the immediate instinct is to check it out, to see what novelties they come up with and to find the problems and then discover solutions, then blog about them. It's a wonderfully addictive game, but after playing it for a while, it just blurs into the same thing over and over again. It's not the technology that matters, but the problems that need fixing. As long as I have problems to fix and I can find the solution, I am happy.

  I ask you, if you got to this point, what is the point of learning any technology, when you don't have the cause, the purpose, the problem that you have to solve? An HR person asked me recently why I keep working where I do. I answered: because they need me.

Intro

  So the requirement is "Write a class that would export data into a CSV format". This would be different from "Write a CSV parser", which I think could be interesting, but not as wildly complex as this. The difference comes from the fact that a CSV parser brings a number of problems for the interviewed person to think about right away, but then it quickly dries up as a source for intelligent debate. A CSV exporter seems much simpler, because the developer controls the output, but it increases in complexity as the interview progresses.

  This post is written from the viewpoint of the interviewer.

Setup

  First of all, you start with the most basic question: Do you know what CSV is? I was going to try this question out on a guy who came interviewing for senior developer and I was excited to see how it would go. He answered he didn't know what CSV was. Bummer! I was incredulous, but then I quickly found out he didn't know much else either. CSV is a text format for exporting a number of unidimensional records. The name comes from Comma Separated Values and might at first glance appear to be a tabular data format, an idea made even more credible by Excel being able to open and export .csv files. But it is not. As the name says, it has values separated by a comma. It might even be just one record. It might be containing multiple records of different types. In some cases, the separator for value and record are not even commas or newline.

  It is important to see how the interviewee explains what CSV is, because it is a concept that looks deceivingly simple. Someone who first considers the complexity of the format before starting writing the code works very differently in a team than someone who throws themselves into the code, confident (or just unrealistically optimistic) that they would solve any problem down the line.

  Some ideas to explore, although it pays off to not bring them up yourself:

  • What data do you need to export: arrays, data tables, list of records?
  • Are the records of the same type?
  • Are there restrictions on the type of record?
  • What separators will there be used? How to escape values that contain chosen separators?
  • Do values have restrictions, like not containing separators?
  • CSV header: do we support that? What does it mean in the context of different types of input? 
  • Text encoding, unicode, non-ASCII characters
  • How to handle null values?
  • Number and date formatting
  • Is there an RFC or a specification document for the CSV export format?

Implementation

  In this particular interview I have chosen that the CSV exporter class will only support an input of IEnumerable<T> (this is .NET speak for a bunch of objects of the same type).

  Give ample opportunities for questions from the person interviewed. This is not a speed test. It is important if the candidate considers by themselves issues like:

  • are the object properties simple types? Like string, long, integer, decimal, double, float, datetime?
  • since the requirement is any T, what about objects that are arrays, or self referencing, or having complex objects as properties?
  • how to extract the values of any object (discussions about .NET reflection or Javascript object property discovery show a lot about the interviewee, especially if they start said discussions)

  Go through the code with the candidate. This shows their ability to develop software. How will they name the class, what signature will they use for export method, how they structure the code and how readable it is.

  At this stage you should have a pretty good idea if the candidate is intelligent, competent and how they handle a complex problem from requirement to implementation.

Dig deeper

  This is the time to ask the questions yourself and see how they react to new information, the knowledge that they should have asked themselves the same questions and the stress of changing their design:

  • are comma and newline the only supported separators?
  • are separators characters or strings?
  • what if an exported value is a string containing a comma?
  • do you support values containing newline?
  • if you use quotes to hold a value containing commas and newlines, what happens if values contain quotes
  • empty or null values. Any difference? How to export them? What if the object itself is null?
  • how to handle the header of the CSV, where do you get the name of the properties?
  • what if the record type is an array or IEnumerable?
  • what will be the numeric and date formatting used for export?
  • does the candidate know what text encoding is? Which one will they use and why?

  How have the answers to these questions changed the design? Did the candidate redesign the work or held tight to the original idea and tried to fix everything as it comes?

  At this point you should know how the person being interviewed responds to new information, even scope creep and, maybe most importantly, to stress. But we're not done, are we?

Bring the pain

  Bring up the concept of unit testing. If you are lucky, the candidate already brought it up. Either way, now it is time to:

  • split the code into components: the reflection code, the export code, the file system code (if any).
  • abstract components into interfaces in order to mock them in unit tests
  • collect all the specifications gathered so far in order to cover all the cases
  • ask the candidate to write one unit test

Conclusion

  A seemingly simple question will take you and the interview candidate through:

  • finding out how the other person thinks
  • specification gathering
  • code design
  • implementation
  • technical knowledge in a multitude of directions
  • development process
  • separation of concerns, unit testing
  • human interaction in a variety of circumstances
  • determining how the candidate would fit in a team

Not bad for a one line question, right?

  I want to write this post to talk about the most common mistake I make as a software developer, even after almost 20 years of experience. And it's not code related. It's more human. But I would also like to hear what you think your biggest mistakes are that are not related to lack of experience. Don't be shy!

  My mistake: assumptions.

  I was assigned this bug recently and, wanting to do a fast job and impress people, I investigated the code, noticed a bug, fixed it, then immediately gave it in for review. I had reasons for doing that, because I was new and did not know the application well. The leads would tell me if they thought I did not find the problem. But, out of the goodness of my heart, you see, I've decided to test the fix as well. And I discovered that the feature was barely implemented. It was not a bug, it was a full fuck up.

  What happened here? I assumed a certain quality of the code and expected, without any reasonable evidence, to find a small typo or a logic bug that would be solved by writing a few lines of code. Instead, I had to reimplement the whole thing as a fix, I pissed off the lead devs because they had enough on their plate and made the entire team wonder what I was doing there. I mean, I haven't even tested the fix!

  Doing things fast means working on valid assumptions that allow you to cut corners. In a new environment, with a team you are not familiar with and a code base you don't understand, you do not have the luxury to make assumptions. Instead, do a good job first: investigate thoroughly, see what the reported problem is, find the actual problem (which may be different), come with an attack plan, implement it, then test that it had the desired result. Yes, it takes more time than to quickly compile the logic flow in your head and hope for the best, but in some places you don't get second chances at fixing things, teams are more formal, processes need to be followed. Optimism is also based on assumptions. Be a realist instead.

  In order to win a game you need to know its rules. That applies both to development process and to local politics, which sometimes are more important than the work. Once you are a good player, you can start experimenting. The title "senior developer" is not given in a vacuum, but is relevant (or not) depending on the environment. Yes, you have experience, but you don't have *this* experience yet. In my desire to be efficient and fast I didn't do anything right and I couldn't believe I have been that stupid.

  Now, how about you? What are your silly mistakes that you can't believe you are still making?

  I am subscribed to the StackOverflow newsletter and most of the times the "top" questions there are really simple things that gain attention from a lot of people. Today I got one question that I would have thought has an obvious answer, but it did not.

  The question was what does "asdf".replace(/.*/g,"x") return? 

  And the answer to the question "What does a regular expression replace of everything with x return?" is.... [Ba da bum!] "xx".

  The technical answer is there in the StackOverflow question, but I am gonna walk you through some steps to get to understand this the... dumb way.

  So, let's try variations on the same theme. What does "asdf".matchAll(/.*/g) return? Well, first of all, in Chrome, it returns a RegExpStringIterator, which is pretty cool, because it's already using the latest Javascript features and it is returning an iterator rather than an array. But we can just use Array.from on it to get an array of all matches: for "asdf" and for "".

  That's a pretty clear giveaway. Since the regular expression is a global one, it will get a match, then the next one until there is nothing left. First match is "asdf" as expected, the next one is "", which is the rest of the string and which also matches .* Why is it, then, that it doesn't go into a stack overflow (no pun intended) and keep turning up empty strings? Again, it's an algorithm described in an RFC and you need a doctorate in computer science to read it. Well, it's not that complicated, but I did promise a dumb explanation.

   And that is that after you get a match on an index, the index is incremented. First match is found at index 0, the next one at 4. There are no matches from index 5 on.

   Other variations on this theme are "asdf".matchAll(/.?/g), which will return "a","s","d","f","". You can't do "asdf".matchAll(/.*/) , you get a TypeError: undefineds called with a non-global RegExp argument error that really doesn't say much, but you can do "asdf".match(/.*/g) which returns just an array of strings, rather than more complex objects. You can also do

var reg = /.*/g;
console.log(reg.exec("asdf"),reg.exec("asdf"),reg.exec("asdf"),reg.exec("asdf"))

This more classic approach will return "asdf", "", "", "" and it would continue to return empty strings ad infinitum!

But how should one write a regular expression to get what you wanted to get, a replacement of everything with x? /.+/g would work, but it would not match an empty string. On the other hand, when was the last time you wanted to replace empty strings with anything?

  I am going to do this in Javascript, because it's easier to write and easier for you to test (just press F12 in this page and write it in the console), but it applies to any programming language. The issue arises when you want to execute a background job (a setTimeout, an async method that you do not wait, a Task.Run, anything that runs on another execution path than the current one) inside a for loop. You have the index variable (from 0 to 10, for example) and you want to use it as a parameter for the background job. The result is not as expected, as all the background jobs use the same value for some reason.

  Let's see a bit of code:

// just write in the console numbers from 0 to 9
// but delayed for a second
for (var i=0; i<10; i++)
{
  setTimeout(function() { console.log(i); },1000);
}

// the result: 10 values of 10 after a second!

But why? The reason is the "scope" of the i variable. In this case, classic (EcmaScript 5) code that uses var generates a value that exists everywhere in the current scope, which for ES5 is defined as the function this code is running from or the global scope if executed directly. If after this loop we write console.log(i) we get 10, because the loop has incremented i, got it to 10 - which is not less than 10, and exited the loop. The variable is still available. That explains why, a second later, all the functions executed in setTimeout will display 10: that is the current value of the same variable.

Now, we can solve this by introducing a local scope inside the for. In ES5 it looked really cumbersome:

for (var i=0; i<10; i++)
{
  (function() {
    var li = i;
    setTimeout(function() { console.log(li); },1000);
  })();
}

The result is the expected one, values from 0 to 9.

What happened here? We added an anonymous function and executed it. This generates a function scope. Inside it, we added a li variable (local i) and then set executing the timeout using that variable. For each value from 0 to 9, another scope is created with another li variable. If after this code we write console.log(li) we get an error because li is undefined in this scope. It is a bit confusing, but there are 10 li variables in 10 different scopes.

Now, EcmaScript 6 wanted to align Javascript with other common use modern languages so they introduced local scope for variables by defining them differently. We can now use let and const to define variables that are either going to get modified or remain constant. They also exist only in the scope of an execution block (between curly brackets).

We can write the same code from above like this:

for (let i=0; i<10; i++) {
  const li = i;
  setTimeout(()=>console.log(li),1000);
}

In fact, this is more complex than it has to be, but that's because another Javascript quirk. We can simplify this for the same result as:

for (let i=0; i<10; i++) {
  setTimeout(()=>console.log(i),1000);
}

Why? Because we "let" the index variable, so it only exists in the context of the loop execution block, but apparently it creates one version of the variable for each loop run. Strangely enough, though, it doesn't work if we define it as "const".

Just as an aside, this is way less confusing with for...of loops because you can declare the item as const. Don't use "var", though, or you get the same problem we started with!

const arr=[1,2,3,4,5];
for (const item of arr) setTimeout(()=>console.log(item),1000);

In other languages, like C#, variables exist in the scope of their execution block by default, but using a for loop will not generate multiple versions of the same variable, so you need to define a local variable inside the loop to avoid this issue. Here is an example in C#:

for (var i=0; i<10; i++)
{
    var li = i;
    Task.Run(() => Console.WriteLine(li));
}
Thread.Sleep(1000);

Note that in the case above we added a Thread.Sleep to make sure the app doesn't close while the tasks are running and that the values of the loop will not necessarily be written in order, but that's beside the point here. Also, var is the way variables are defined in C# when the type can be inferred by the compiler, it is not the same as the one in Javascript.

I hope you now have a better understanding of variable scope.

  If you are like me, you want to first establish a nice skeleton app that has everything just right before you start writing your actual code. However, as weird as it may sound, I couldn't find a way to use command line parameters with dependency injection, in the same simple way that one would use a configuration file with IOptions<T> for example. This post shows you how to use CommandLineParser, a nice library that handler everything regarding command line parsing, but in a dependency injection friendly way.

  In order to use command line arguments, we need to obtain them. For any .NET Core application or .NET Framework console application you get it from the parameters of the static Main method from Program. Alternately, you can use Environment.CommandLine, which is actually a string, not an array of strings. But all of these are kind of nudging you towards some ugly code that either has a dependency on the static Environment, either has code early in the application to handle command line arguments, or stores the arguments somehow. What we want is complete separation of modules in our application.

  How can we get the arguments by injection? By creating a new type that encapsulates the simple string array.

// encapsulates the arguments
public class CommandLineArguments
{
    public CommandLineArguments(string[] args)
    {
        this.Args = args;
    }

    public string[] Args { get; }
}

// adds the type to dependency injection
services.AddSingleton<CommandLineArguments>(new CommandLineArguments(args));
// the generic type declaration is superfluous, but the code is easy to read

  With this, we can access the command line arguments anywhere by injecting a CommandLineArguments object and accessing the Args property. But this still implies writing command line parsing code wherever we need that data. We could add some parsing logic in the CommandLineArguments class so that instead of the command line arguments array it would provide us with a strong typed value of the type we want. But then we would put business logic in a command line encapsulation class. Why would it know what type of options we need and why would we need only one type of options? 

  What we would like is something like

public SomeClass(IOptions<MyCommandLineOptions> clOptions) {...}

  Now, we could use this system by writing more complicated that adds a ConfigurationSource and then declaring that certain types are command line options. But I don't want that either for several reasons:

  • writing configuration providers is complex code and at some moment in time one has to ask how much are they willing to write in order to get some damn arguments from the command line
  • declaring the types at the beginning does provide some measure of centralized validation, but on the other hand it's declaring types that we need in business logic somewhere in service configuration, which personally I do not like

  What I propose is adding a new type of IOptions, one that is specific to command line arguments:

// declare the interface for generic command line options
public interface ICommandLineOptions<T> : IOptions<T>
    where T : class, new() { }

// add it to service configuration
services.AddSingleton(typeof(ICommandLineOptions<>), typeof(CommandLineOptions<>));

// put the parsing logic inside the implementation of the interface
public class CommandLineOptions<T> : ICommandLineOptions<T>
    where T : class, new()
{
    private T _value;
    private string[] _args;

    // get the arguments via injection
    public CommandLineOptions(CommandLineArguments arguments)
    {
        _args = arguments.Args;
    }

    public T Value
    {
        get
        {
            if (_value==null)
            {
                // set the value by parsing command line arguments
            }
            return _value;
        }
    }

}

  Now, in order to make it work, we will use CommandLineParser which functions in a very simple way:

  • declare a Parser
  • create a POCO class that has properties decorated with attributes that define what kind of command line parameter they are
  • parse the command line arguments string array into the type of class declared above
  • get the value or handle errors

  Also, to follow the now familiar Microsoft pattern, we will write an extension method to register both arguments and the mechanism for ICommandLineOptions. The end result is:

// extension class to add the system to services
public static class CommandLineExtensions
{
    public static IServiceCollection AddCommandLineOptions(this IServiceCollection services, string[] args)
    {
        return services
            .AddSingleton<CommandLineArguments>(new CommandLineArguments(args))
            .AddSingleton(typeof(ICommandLineOptions<>), typeof(CommandLineOptions<>));
    }
}

public class CommandLineArguments // defined above

public interface ICommandLineOptions<T> // defined above

// full class implementation for command line options
public class CommandLineOptions<T> : ICommandLineOptions<T>
    where T : class, new()
{
    private T _value;
    private string[] _args;

    public CommandLineOptions(CommandLineArguments arguments)
    {
        _args = arguments.Args;
    }

    public T Value
    {
        get
        {
            if (_value==null)
            {
                using (var writer = new StringWriter())
                {
                    var parser = new Parser(configuration =>
                    {
                        configuration.AutoHelp = true;
                        configuration.AutoVersion = false;
                        configuration.CaseSensitive = false;
                        configuration.IgnoreUnknownArguments = true;
                        configuration.HelpWriter = writer;
                    });
                    var result = parser.ParseArguments<T>(_args);
                    result.WithNotParsed(errors => HandleErrors(errors, writer));
                    result.WithParsed(value => _value = value);
                }
            }
            return _value;
        }
    }

    private static void HandleErrors(IEnumerable<Error> errors, TextWriter writer)
    {
        if (errors.Any(e => e.Tag != ErrorType.HelpRequestedError && e.Tag != ErrorType.VersionRequestedError))
        {
            string message = writer.ToString();
            throw new CommandLineParseException(message, errors, typeof(T));
        }
    }
}

// usage when configuring dependency injection
services.AddCommandLineOptions(args);

Enjoy!

Now there are some quirks in the implementation above. One of them is that the parser class generates the usage help by writing it to a TextWriter (default being Console.Error), but since we want this to be encapsulated, we declare our own StringWriter and then store the generated help if any errors. In the case above, I am storing the help text as the exception message, but it's the principle that matters.

Also, with this system one can ask for multiple types of command line options classes, depending on the module, without the need to declare said types at the configuration of dependency injection. The downside is that if you want validation of the command line options at the very beginning, you have to write extra code. In the way implemented above, the application will fail when first asking for a command line option that cannot be mapped on the command line arguments.

As a bonus, here is the way to define an option class that CommandLineParser will parse from the arguments:

// the way we want to use the app is
// FileUtil <command> [-loglevel loglevel] [-quiet] -output <outputFile> file1 file2 .. file10
public class FileUtilOptions
{
    // use Value for parameters with no name
    [Value(0, Required = true, HelpText = "You have to enter a command")]
    public string Command { get; set; }

    // use Option for named parameters
    [Option('l',"loglevel",Required = false, HelpText ="Log level can be None, Normal, Verbose")]
    public string LogLevel { get; set; }

    // use bool for named parameters with no value
    [Option('q', "quiet", Default = false, Required = false, HelpText = "Quiet mode produces no console output")]
    public bool Quiet { get; set; }

    // Required for required values
    [Option('o', "output", Required = true, HelpText = "Output file is required")]
    public string OutputFile { get; set; }

    // use Min/Max for enumerables
    [Value(1, Min = 1, Max = 10, HelpText = "At least one file name and at most 10")]
    public IEnumerable<string> Files { get; set; }
}

Note that the short style of a parameter needs to be used with a dash, the long one with two dashes:

  • -o outputFile.txt - correct (value outputFile.txt)
  • --output outputFile.txt - correct (value outputFile.txt)
  • -output outputFile.txt - incorrect (value utput and outputFile.txt is considered an unnamed argument)

Intro

  As I was working on LInQer I was hooked on the multiple optimizations that I was finding. Do you want to compute the average of an iterable? You would need the total count and the sum of the items, which you can get in a single function that you can reuse to get the sum or the count. But what if the iterable is an integer range between 1 and 10? Then you can compute the sum and you already know the count. Inspired by that work and by other concepts like interval types or Maybe/Nullable types, I've decided to write this post, which I do not know if it will lead to any usable code.

What is an iterable/enumerable?

  In Javascript they call it an Iterable, in .NET you have IEnumerable. They mean the same thing: sources of values. With new concepts like async/await you can use Observables as Enumerables as well, although they are theoretically diametrically opposing patterns. In both languages they are implemented as having a method that returns an iterator/enumerator, an object that can move to the next value, give you the next value and maybe reset itself. You can define any stream of values like that, having one or more values or, indeed, none. From now own I will discuss in terms of .NET nomenclature, but I see no reason why it wouldn't apply to any other language that implements this feature.

  For example an array is defined as an IEnumerable<T> in .NET. Its enumerator will return false if trying to move to the next value and the array is empty, or true if there is at least a value and the current value will return the first value in the array. Move next again and it will return true or false depending on whether there is a next value. But there is no need for the values to exist to have an Enumerable. One could write an object that would return all the positive integer numbers. It's length would be infinite and the values would only be generated when requested. There is no differentiation between an Enumerable and a generator in .NET, but there is in Javascript. For this reason whenever I will use the term generator, I will mean an object that generates values rather than produce them from a source of existing ones.

The NULL controversy

  A very popular InfoQ post describes the introduction of the NULL concept in programming languages a the billion dollar mistake. I am not so sure about that, but I can concede they make good points. The alternative to using a special value to describe the absence of a value is use an "option" object that either has Some value or it has None. You would check the existence of a value by calling a method to tell you if it has a value and you would get the value from the current value property. Doesn't it sound familiar? It's a more specific case of an Enumerator! Another popular solution to remove NULLs from code is to never return values from your methods, but arrays. An empty array would represent no value. An array is an Enumerable!

And that last idea opens up an interesting possibility: instead of one or none, you can have multiple values. What then? What would a multiplication mean? What about a decision block?

The LInQer experience

  If you know me, you are probably fed up with me plugging LInQer as the greatest thing since fire was invented. But that's because it is! And while implementing .NET LInQ as a Javascript library I've played with some very interesting concepts.

  For example, while implementing the Last operator on enumerables, I had two different implementations depending on whether one could know the length in advance and one could use indexed access to the values. An array of one billion values has no problem giving you the last item in it because of two things: you know where the array ends and you can access any item at any position without having to go through other values. You just take the value at index one billion minus one. If you would have a generator, though, the only way to get the last value would be to enumerate again and again and again and only when moving to the next value would fail you would have the last value as the last one. And of course, this would be bad if there are no bounds to the generator.

  But there is more. What about very common statistical values like the sum? This, of course, applies to numbers. The Enumerable need not produce numbers, so in other contexts it means nothing. Then there are concepts like statistical distribution. One can make some assumptions if they know the distribution of values. A constant yet infinite generator of numbers will always have the same average value. It would return the same value, regardless of index.

  I spent a lot of time doing sorting that only needs a part of the enumerable, or partial sorting. I've implemented a Quicksort algorithm that works faster than the default sort when there are enough values and that can ignore the parts of the array that I don't need. Also, there are specific algorithm to return the last or first N items. All of this depends on functions that determine the order of items. Randomness is also interesting, as it needs to take into consideration the change of probabilities as the list of items increases with each request. Sampling was fun, too!

  Then there were operators like Distinct or Group which needed to use functions to determine sameness.

  With all this work, I've never intended to make this more than what LInQ is in .NET: a way to dynamically filter and map and enumerate sequences of items without having to go through them all or to create intermediate but unnecessary arrays. What I am talking about now is taking things further and deeper.

Continuous intervals

  What if the Enumerable abstraction is not enough? For example one could define the interval of real numbers between 0 and 1. You can never enumerate the next value, but there are definite boundaries, a clear distribution of values, a very obvious average. What about series and limits? If a generator generates values that depend on previous values, like a geometric progression or a Fibonacci series, you can sometimes compute the minimum or maximum value of the items in it, or of their sums.  

Tools

  So we have more concepts in our bag now:

  • move next (function)
  • current value
  • item length (could be infinite or just unknown)
  • indexed access (or not)
  • boundaries (min, max, limits)
  • distribution (probabilities)
  • order
  • discreteness

  How could we use these?

Concrete cases

  There is one famous probabilities problem: what are the chances you will get a particular value by throwing a number of dice. And it is interesting because there is a marked difference between using one die or more. Using at least two dice and plotting the values you get after multiple throws you get what is called a Normal distribution, a Gauss curve, and that's because there are more combinations of values that sum up to 6 than there are for 2.

  How can we declare a value that belongs to an interval? One solution is to add all kinds of metadata or validations. But what if we just declare an iterable with one value that has a random value between 1 and 6? And what if we add it with another one? What would that mean?

  Here is a demo example. It's silly and it looks too much like the Calculator demos you see for unit testing and I really hate those, but I do want to just demo this. What else can we do with this idea? I will continue to think about it.

class Program
    {
        static void Main(string[] args)
        {
            var die1 = new RandomGenerator(1, 6);
            var die2 = new RandomGenerator(1, 6);
            // just get the value
            var value1 = die1.First() + die2.First();
            // compose the two dice using Linq, then get value
            var value2 = die1.Zip(die2).Select(z => z.First + z.Second).First();
            // compose the two dice using operator overload, then get value
            var value3 = (die1 + die2).First();
            var min = (die1 + die2).Min();
        }

        /// <summary>
        /// Implemented Min alone for demo purposes
        /// </summary>
        /// <typeparam name="T"></typeparam>
        public interface IGenerator<T> : IEnumerable<T>
        {
            int Min();
        }

        /// <summary>
        /// Generates integer values from minValue to maxValue inclusively
        /// </summary>
        public class RandomGenerator : IGenerator<int>
        {
            private readonly Random _rnd;
            private readonly int _minValue;
            private readonly int _maxValue;

            public RandomGenerator(int minValue, int maxValue)
            {
                _rnd = new Random();
                this._minValue = minValue;
                this._maxValue = maxValue;
            }

            public static IGenerator<int> operator +(RandomGenerator gen1, IGenerator<int> gen2)
            {
                return new AdditionGenerator(gen1, gen2);
            }

            public IEnumerator<int> GetEnumerator()
            {
                while (true)
                {
                    yield return _rnd.Next(_minValue, _maxValue + 1);
                }
            }

            IEnumerator IEnumerable.GetEnumerator()
            {
                return ((IEnumerable<int>)this).GetEnumerator();
            }

            public int Min()
            {
                return _minValue;
            }
        }
        
        /// <summary>
        /// Combines two generators through addition
        /// </summary>
        internal class AdditionGenerator : IGenerator<int>
        {
            private IGenerator<int> _gen1;
            private IGenerator<int> _gen2;

            public AdditionGenerator(Program.RandomGenerator gen1, Program.IGenerator<int> gen2)
            {
                this._gen1 = gen1;
                this._gen2 = gen2;
            }

            public IEnumerator<int> GetEnumerator()
            {
                var en1 = _gen1.GetEnumerator();
                var en2 = _gen2.GetEnumerator();
                while (true)
                {
                    var hasValue = en1.MoveNext();
                    if (hasValue != en2.MoveNext())
                    {
                        throw new InvalidOperationException("One generator stopped providing values before the other");
                    }
                    if (!hasValue)
                    {
                        yield break;
                    }
                    yield return en1.Current + en2.Current;
                }

            }

            IEnumerator IEnumerable.GetEnumerator()
            {
                return ((IEnumerable<int>)this).GetEnumerator();
            }

            public int Min()
            {
                return _gen1.Min() + _gen2.Min();
            }
        }
    }

Conclusion (so far)

I am going to think about this some more. It has a lot of potential as type abstraction, but to be honest, I deal very little in numerical values and math and statistics, so I don't see what I personally could do with this. I suspect, though, that other people might find it very useful or at least interesting. And yes, I am aware of mathematical concepts like interval arithmetic and I am sure there are a ton of existing libraries that already do something like that and much more, but I am looking at this more from the standpoint of computer science and quasi-primitive types than from a mathematical or numerical perspective. If you have any suggestions or ideas or requests, let me know!

  You can consider this an interview question, although to be fair if someone did ask me this for an interview I would say they are assholes. What is the difference between the pre-increment operator and the post-increment operator in C#?

  They look the same in C and C# and Javascript and Java and all the languages that share the curly bracket syntax with C, but in fact they are slightly different. Slight enough to make someone an asshole for asking the question as if it were relevant, but important enough for you to read about it. One of the most common interpretations of the syntax is that x++ is incrementing the value after the operation, while ++x is incrementing it before the operation. That is wrong.

  In fact, for C++ the return values are different between pre and post operators. I am not a C++ dev, so I give you this reference link: "Pre operators increment or decrement the value of the object and return a reference to the result. Post operators create a copy of the object, increment or decrement the value of the object and return the copy from before the increment or decrement." So one returns an object, the other returns a reference to an object. It is also possible that the assignment be done after the value was produced in C or C++. In C# the assignment must be done before any value is returned.

  In C#, to paraphrase Eric Lippert, "Both pre and post operators determine the value of the variable, what value will be assigned back to storage and assign the new value to storage. The postfix operator produces the original value, and the prefix operator produces the assigned value." So it's (kindda) like this piece of code:

int Increment(ref int x, bool post) {
  var originalX = x;
  var newX = x+1;
  x = newX;
  return post ? originalX : newX;
}

  So why the hell does it matter? I mean, it's a rather meaningless difference between the programming languages and the before/after mnemonic is making the code pretty clear, doesn't it? OK. Let's try some code and let me see how fast you come up with the answer. Remember, this is supposed to be simple, so if you are thinking too much about it, it doesn't matter you get the correct answer. Ready?

  1. Any difference between x++ and ++x if the resulting value is not used?
  2. var a=1; var b=++a; What's the value of b?
  3. var a=1; var b=a++; var c=++a; What's the value of c?
  4. var i = 0; for (i=0; i<5; ++i) Console.Write(i+" "); Console.WriteLine(i); What is printed at the console?
  5. var i = 0; for (i=0; i<5; i++) Console.Write(i+" "); Console.WriteLine(i); What is printed at the console?
  6. var a=1; a=a++; What's the value of a?

And all of this was about the increment operator as normally used for integer values. There is a big part about operator overloading in there, but I believe less relevant in the context of differences between pre and post increment/decrement operators.

There is one important part to discuss, though, and that is best code practices. When to use post and when to use pre. And they are really easy: separate statements from expressions! Statements execute code with side effects, they should return nothing. Expressions return values without side effects. If you never use the value of an increment or decrement and instead use it as a statement with side-effects, there is no difference between ++a and a++. In fact one doesn't need the preincrement/predecrement operators at all! In this context, the answers for the questions above is 1. No 2,3,6: You are using it wrong! 4,5: the same thing, since without getting the value we have scenario 1.

Just for reference, though, here are the answers:

  1. No
  2. 2
  3. 3 (b is 1)
  4. 0 1 2 3 4 5
  5. 0 1 2 3 4 5
  6. 1

Hope that makes you think.

 Intro

  When I was a kid, computers didn't have multithreading, multitasking or even multiple processes. You executed a program and it was the only program that was running. Therefore the way to do, let's say, user key input was to check again and again if there is a key in a buffer. To give you a clearer view on how bonkers that was, if you try something similar in Javascript the page dies. Why? Because the processing power to look for a value in an array is minuscule and you will basically have a loop that executes hundreds of thousand or even millions of times a second. The CPU will try to accommodate that and run at full power. You will have a do nothing loop that will take the entire capacity of the CPU for the current process. The browser would have problems handling legitimate page events, like you trying to close it! Ridiculous!

Bad solution

Here is what this would look like:

class QBasic {

    constructor() {
        this._keyBuffer=[];
        // add a global handler on key press and place events in a buffer
        window.addEventListener('keypress', function (e) {
            this._keyBuffer.push(e);
        }.bind(this));
    }

    INKEY() {
        // remove the first key in the buffer and return it
        const ev = this._keyBuffer.shift();
        // return either the key or an empty string
        if (ev) {
            return ev.key;
        } else {
            return '';
        }
    }
}

// this code will kill your CPU and freeze your page
const qb = new QBasic();
while (qb.INKEY()=='') {
 // do absolutely nothing
}

How then, should we port the original QBasic code into Javascript?

WHILE INKEY$ = ""

    ' DO ABSOLUTELY NOTHING

WEND

Best solution (not accepted)

Of course, the best solution is to redesign the code and rewrite everything. After all, this is thirty year old code. But let's imagine that, in the best practices of porting something, you want to find the first principles of translating QBasic into Javascript, then automate it. Or that, even if you do it manually, you want to preserve the code as much as possible before you start refactoring it. I do want to write a post about the steps of refactoring legacy code (and as you can see, sometimes I actually mean legacy, as in "bestowed upon by our forefathers"), but I wanted to write something tangible first. Enough theory!

Interpretative solution (not accepted, yet)

Another solution is to reinterpret the function into a waiting function, one that does nothing until a key is pressed. That would be easier to solve, but again, I want to translate the code as faithfully as possible, so this is a no-no. However, I will discuss how to implement this at the end of this post.

Working solution (slightly less bad solution)

Final solution: do the same thing, but add a delay, so that the loop doesn't use the entire pool of CPU instructions. Something akin to Thread.Sleep in C#, maybe. But, oops! in Javascript there is no function that would freeze execution for a period of time.

The only thing related to delays in Javascript is setTimeout, a function that indeed waits for the specified interval of time, but then executes the function that was passed as a parameter. It does not pause execution. Whatever you write after setTimeout will execute immediately. Enter async/await, new in Javascript ES8 (or EcmaScript 2017), and we can use the delay function as we did when exploring QBasic PLAY:

function delay(duration) {
    return new Promise(resolve => setTimeout(resolve, duration));
}

Now we can wait inside the code with await delay(milliseconds);. However, this means turning the functions that use it into async functions. As far as I am concerned, the pollution of the entire function tree with async keywords is really annoying, but it's the future, folks!

Isn't this amazing? In order to port to Javascript code that was written in 1990, you need features that were added to the language only in 2017! If you wanted to do this in Javascript ES5 you couldn't do it! The concept of software development has changed so much that it would have been impossible to port even the simplest piece of code from something like QBasic to Javascript.

Anyway, now the code looks like this:

function delay(duration) {
    return new Promise(resolve => setTimeout(resolve, duration));
}

class QBasic {

    constructor() {
        this._keyBuffer=[];
        // add a handler on every key press and place events in a buffer
        window.addEventListener('keypress', function (e) {
            this._keyBuffer.push(e);
        }.bind(this));
    }

    async INKEY() {
        // remove the first key in the buffer and return it
        const ev = this._keyBuffer.shift();
        // return either the key or an empty string
        if (ev) {
            return ev.key;
        } else {
            await delay(100);
            return '';
        }
    }
}

const qb = new QBasic();
while (qb.INKEY()=='') {
 // do absolutely nothing
}

Now, this will work by delaying for 100 milliseconds when there is nothing in the buffer. It's clearly not ideal. If one wanted to fix a problem with a loop running too fast, then the delay function should have at least been added to the loop, not the INKEY function. Using it like this you will get some inexplicable delays in code that would want to use fast key inputs. It's, however, the only way we can implement an INKEY function that will behave as close to the original as possible, which is hiring a 90 year old guy to go to a letter box and check if there is any character in the mail and then come back and bring it to you. True story, it's the original implementation of the function!

Interpretative solution (implementation)

It would have been much simpler to implement the function in a blocking manner. In other words, when called, INKEY would wait for a key to be pressed, then exit and return that key when the user presses it. We again would have to use Promises:

class QBasic {

    constructor() {
        this._keyHandler = null;
        // instead of using a buffer for keys, keep a reference
        // to a resolve function and execute it if it exists
        window.addEventListener('keypress', function (e) {
            if (this._keyHandler) {
                const handler = this._keyHandler;
                this._keyHandler = null;
                handler(e.key);
            }
        }.bind(this));
    }

    INKEY() {
        const self = this;
        return new Promise(resolve => self._keyHandler = resolve);
    }
}


const qb = new QBasic();
while ((await qb.INKEY())=='') { // or just await qb.INKEY(); instead of the loop
 // do absolutely nothing
}

Amazing again, isn't it? The loops (pun not intended) through which one has to go in order to force a procedural mindset on an event based programming language.

Disclaimer

Just to make sure, I do not recommend this style of software development; this is only related to porting old school code and is more or less designed to show you how software development has changed in time, from a period before most of you were even born.