and has 0 comments

  I have to admit this is a strange time for me, in which I struggle with finishing any book. My mind may drift away at a moment's notice, thoughts captured by random things that inflame my interest. And with limited time resources, some things fall through the cracks, like the ability to dedicate myself to books that don't immediately grab my attention. Such a book is The Ten Thousand Doors of January.

  And you know the type. It's one of those where the way the word sounds as you read it is more important that what it says, a sort of magical white poetry that is attempting to evoke rather than tell, feel rather than reason, while also maintaining a rigorous intellectual style. Alix E. Harrow is a good writer and it shows, however she is too caught up in her own writing. This story features a girl with an ability that is manifested when she writes words of power. She is an avid reader and, in order to learn about her capabilities, she receives a book that tells the story of another girl who was similar to her. And the style of that book is, you guessed it, words crafted to evoke rather than tell.

  So at about 40% of the book nothing had happened other than a girl of color living in a house of plenty, but imprisoned by rules and starved of knowledge or power. Her captor and adoptive father is a white and powerful aristocrat, cold as ice and authoritative in every action or word, while she is a good girl caught between her desires and her upbringing. I've read books like this before and I liked some of them a lot. And this may yet evolve into a beautiful story, but as I was saying above, I am not in the mood for paying that much attention before something happens.

  In conclusion, while I get a feeling of being defeated and a desire to continue reading the book, I also have to accept I don't have the resources to struggle with it. I would rather find a more comfortable story for me at this time.

Intro

  There is a saying that the novice will write code that works, without thinking of anything else, the expert will come and rewrite that code according to good practices and the master will rewrite it so that it works again, thinking of everything. It applies particularly well to SQL. Sometimes good and well tried best practices fail in specific cases and one must guide themselves either by precise measurements of by narrow rules that take decades to learn.

  If you ever wondered why some SQL queries are very slow or how to write complex SQL stored procedures without them reaching sentience and behaving unpredictably, this post might help. I am not a master myself, but I will share some quick and dirty ways of writing, then checking your SQL code.

Some master rules

  First of all, some debunking of best practices that make unreasonable assumptions at scale:

  1. If you have to extract data based on many parameters, then add them as WHERE or ON clauses and the SQL engine will know how to handle it.

    For small queries and for well designed databases, that is correct. The SQL server engine is attempting to create execution plans for these parameter combinations and reuse them in the future on other executions. However, when the number of parameters increases, the number of possible parameter combinations increases exponentially. The execution optimization should not take more than the execution itself, so the engine if just choosing one of the existing plans which appears more similar to the parameters given. Sometimes this results in an abysmal performance.

    There are two solutions:

    The quick and dirty one is to add OPTION (RECOMPILE) to the parameterized SELECT query. This will tell the engine to always ignore existing execution plans. With SQL 2016 there is a new feature called Query Store plus a graphical interface that explores execution plans, so one can choose which ones are good and which ones are bad. If you have the option, you might manually force an execution plan on specific queries, as well. But I don't recommend this because it is a brittle and nonintuitive solution. You need a DBA to make sure the associations are correct and maintained properly.

    The better one, to my own surprise, is to use dynamic SQL. In other words, if you have 20 parameters to your stored procedure, with only some getting used at any time (think an Advanced Search page), create an SQL string only with the parameters that are set, then execute it.

    My assumption has always been that the SQL engine will do this for me if I use queries like WHERE (@param IS NULL OR <some condition with @param>). I was disappointed to learn that it does not always do that. Be warned, though, that most of the time multiple query parameters are optimized by running several operations in parallel, which is best!

  2. If you query on a column or another column, an OR clause will be optimal. 

    Think of something like this: You have a table with two account columns AccId and AccId2. You want to query a lot on an account parameter @accountId and you have added an index on each column.

    At this time the more readable option, and for small queries readability is always preferable to performance improvement, is WHERE AccId=@accountId OR AccId2=@accountId. But how would the indexes be used here, in this OR clause? First the engine will have to find all entries with the correct AccId, then again find entries with the correct AccId2, but only the entries that have not been found in the first search.

    First of all, SQL will not do this very well when the WHERE clause is very complex. Second of all, even if it did it perfectly, if you know there is no overlap, or you don't care or you can use a DISTINCT further on to eliminate duplicates, then it is more effective to have two SELECT queries, one for AccId and the other for AccId2 that you UNION ALL afterwards.

    My assumption has always been that the SQL engine will do this automatically. I was quite astounded to hear it was not true. Also, I may be wrong, because different SQL engines and their multitude of versions, compounded with the vast array of configuration options for both engine and any database, behave quite differently. Remember the parallelism optimization, as well.

  3. Temporary tables as slow, use table variables instead.

    Now that is just simple logic, right? A temporary table uses disk while a table variable uses memory. The second has to be faster, right? In the vast majority of cases this will be true. It all depends (a verb used a lot in SQL circles) on what you do with it.

    Using a temporary table might first of all be optimized by the engine to not use the disk at all. Second, temporary tables have statistics, while table variables do not. If you want the SQL engine to do its magic without your input, you might just have to use a temporary table.

  4. A large query that does everything is better than small queries that I combine later on.

    This is a more common misconception than the others. The optimizations the SQL engine does work best on smaller queries, as I've already discussed above, so if a large query can be split into two simpler ones, the engine will be more likely able to find the best way of executing each. However, this only applies if the two queries are completely independent. If they are related, the engine might find the perfect way of getting the data in a query that combines them all.

    Again, it depends. One other scenario is when you try to DELETE or UPDATE a lot of rows. SQL is always "logging" the changes that it does on the off chance that the user cancels the query and whatever incomplete work has been done has to be undone. With large amounts of data, this results into large log files and slow performance. One common solution is to do it in batches, using UPDATE (TOP 10000) or something similar inside a WHILE loop. Note that while this solves the log performance issue, it adds a little bit of overhead for each executed UPDATE

  5. If I have an index on a DATETIME column and I want to check the records in a certain day, I can use CAST or CONVERT.

    That is just a bonus rule, but I've met the problem recently. The general rule is that you should never perform calculations on columns inside WHERE clauses. So instead of WHERE CAST(DateColumn as DATE)=@date use WHERE DateColumn>=@date AND DateColumn<DATEADD(DAY,1,@date). The calculation is done (once) on the parameters given to the query, not on every value of DateColumn. Also, indexes are now used.

Optimizing queries for dummies

So how does one determine if one of these rules apply to their case? "Complex query" might mean anything. Executing a query multiple times results in very different results based on how the engine is caching the data or computing execution plans.

A lot of what I am going to say can be performed using SQL commands, as well. Someone might want to use direct commands inside their own tool to monitor and manage performance of SQL queries. But what I am going to show you uses the SQL Management Studio and, better still, not that horrid Execution Plan chart that often crashes SSMS and it is hard to visualize for anything that the most simple queries. Downside? You will need SQL Management Studio 2014 or higher.

There are two buttons in the SSMS menu. One is "Include Actual Execution Plan" which generates an ugly and sometimes broken chart of the execution. The other one is "Include Live Query Statistics" which seems to be doing the same, only in real time. However, the magic happens when both are enabled. In the Results tab you will get not only the query results, but also tabular data about the execution performance. It is amazingly useful, as you get a table per each intermediary query, for example if you have a stored procedure that executes several queries in a row, you get a table for each.

Even more importantly, it seems that using these options will start the execution without any cached data or execution plans. Running it several times gives consistent execution times.

In the LiveQuery tables, the values we are interested about are, in order of importance, EstimateIO, EstimateCPU and Rows.

EstimateIO is telling us how much of the disk was used. The disk is the slowest part of a computer, especially when multiple processes are running queries at the same time. Your objective is to minimize that value. Luckily, on the same row, we get data about the substatement that generated that row, which parameters were used, which index was used etc. This blog is not about how to fix every single scenario, but only on how to determine where the biggest problems lie.

EstimateCPU is saying how much processing power was used. Most of the time this is very small, as complex calculations should not be performed in queries anyway, but sometimes a large value here shows a fault in the design of the query.

Finally, Rows. It is best to minimize the value here, too, but it is not always possible. For example a COUNT(*) will show a Clustered Index Scan with Rows equal to the row count in the table. That doesn't cause any performance problems. However, if your query is supposed to get 100 rows and somewhere in the Live Query table there is a value of several millions, you might have used a join without the correct ON clause parameters or something like that.

Demo

Let's see some examples of this. I have a Main table, with columns ID BIGINT, Random1 INT, Random2 NVARCHAR(100) and Random3 CHAR(10) with one million rows. Then an Ind table, with columns ID BIGINT, Qfr CHAR(4) and ValInd BIGINT with 10000 rows. The ID table is common with the Main table ID column and the Qfr column has only three possible values: AMT, QTY, Sum.

Here is a demo on how this would work:

DECLARE @r1 INT = 1300000
DECLARE @r2 NVARCHAR(100) = 'a'
DECLARE @r3 CHAR(10) = 'A'
DECLARE @qfr CHAR(4) = 'AMT'
DECLARE @val BIGINT = 500000

DECLARE @r1e INT = 1600000
DECLARE @r2e NVARCHAR(100) = 'z'
DECLARE @r3e CHAR(10)='Z'
DECLARE @vale BIGINT = 600000

SELECT *
FROM Main m
INNER JOIN Ind i
ON m.ID=i.ID
WHERE (@r1 IS NULL OR m.Random1>=@r1)
  AND (@r2 IS NULL OR m.Random2>=@r2)
  AND (@r3 IS NULL OR m.Random3>=@r3)
  AND (@val IS NULL OR i.ValInd>=@val)
  AND (@r1e IS NULL OR m.Random1<=@r1e)
  AND (@r2e IS NULL OR m.Random2<=@r2e)
  AND (@r3e IS NULL OR m.Random3<=@r3e)
  AND (@vale IS NULL OR i.ValInd<=@vale)
  AND (@qfr IS NULL OR i.Qfr=@qfr)

I have used 9 parameters, each with their own values, to limit the number of rows I get. The Live Query result is:

You can see that the EstimateIO values are non-zero only on the Clustered Index Scans, one for each table. Where is how the StmtText looks like: "|--Clustered Index Scan(OBJECT:([Test].[dbo].[Ind].[PK__Ind__DEBF89006F996CA8] AS [i]),  WHERE:(([@val] IS NULL OR [Test].[dbo].[Ind].[ValInd] as [i].[ValInd]>=[@val]) AND ([@vale] IS NULL OR [Test].[dbo].[Ind].[ValInd] as [i].[ValInd]<=[@vale]) AND ([@qfr] IS NULL OR [Test].[dbo].[Ind].[Qfr] as [i].[Qfr]=[@qfr])) ORDERED FORWARD)".

This is a silly case, but you can see that the @parameter IS NULL type of query condition has not been removed, even when parameter is clearly not null.

Let's change the values of the parameters:

DECLARE @r1 INT = 300000
DECLARE @r2 NVARCHAR(100) = NULL
DECLARE @r3 CHAR(10) = NULL
DECLARE @qfr CHAR(4) = NULL
DECLARE @val BIGINT = NULL

DECLARE @r1e INT = 600000
DECLARE @r2e NVARCHAR(100) = NULL
DECLARE @r3e CHAR(10)=NULL
DECLARE @vale BIGINT = NULL

Now the Live Query result is:

Same thing! 5.0 and 7.2

Now, let's do the same thing with dynamic SQL. It's a little more annoying, mostly because of the parameter syntax, but check it out:

DECLARE @sql NVARCHAR(Max)

DECLARE @r1 INT = 300000
DECLARE @r2 NVARCHAR(100) = NULL
DECLARE @r3 CHAR(10) = NULL
DECLARE @qfr CHAR(4) = NULL
DECLARE @val BIGINT = NULL

DECLARE @r1e INT = 600000
DECLARE @r2e NVARCHAR(100) = NULL
DECLARE @r3e CHAR(10)=NULL
DECLARE @vale BIGINT = NULL


SET @sql=N'
SELECT *
FROM Main m
INNER JOIN Ind i
ON m.ID=i.ID
WHERE 1=1 '
IF @r1 IS NOT NULL SET @sql+=' AND m.Random1>=@r1'
IF @r2 IS NOT NULL SET @sql+=' AND m.Random2>=@r2'
IF @r3 IS NOT NULL SET @sql+=' AND m.Random3>=@r3'
IF @val IS NOT NULL SET @sql+=' AND i.ValInd>=@val'
IF @r1e IS NOT NULL SET @sql+=' AND m.Random1<=@r1e'
IF @r2e IS NOT NULL SET @sql+=' AND m.Random2<=@r2e'
IF @r3e IS NOT NULL SET @sql+=' AND m.Random3<=@r3e'
IF @qfr IS NOT NULL SET @sql+=' AND i.Qfr=@qfr'
IF @vale IS NOT NULL SET @sql+=' AND i.ValInd<=@vale'

PRINT @sql

EXEC sp_executesql @sql,
  N'@r1 INT, @r2 NVARCHAR(100), @r3 CHAR(10), @qfr CHAR(4),@val BIGINT,@r1e INT, @r2e NVARCHAR(100), @r3e CHAR(10),@vale BIGINT',
  @r1,@r2,@r3,@qfr,@val,@r1e,@r2e,@r3e,@vale

Now the Live Query results are:

At first glance we have not changed much. IO is still 5.0 and 7.2. Yet there are 3 less execution steps. There is no parallelism and the query has been executed in 5 seconds, not 6. The StmtText for the same thing is now: "|--Clustered Index Scan(OBJECT:([Test].[dbo].[Ind].[PK__Ind__DEBF89006F996CA8] AS [i]), ORDERED FORWARD)". The printed SQL command is:

SELECT *
FROM Main m
INNER JOIN Ind i
ON m.ID=i.ID
WHERE 1=1  AND m.Random1>=@r1 AND m.Random1<=@r1e

Conclusion

Again, this is a silly example. But with some results anyway! In my work I have used this to get a stored procedure to work three to four times faster!

One can optimize usage of IO, CPU and Rows by adding indexes, by narrowing join conditions, by reducing the complexity of executed queries, eliminating temporary tables, partitioning existing tables, adding or removing hints, removing computation from queried columns and so many other possible methods, but they amount to nothing if you cannot measure the results of your changes.

By using Actual Execution Plan together with Live Query Statistics you get:

  • consistent execution times and disk usage
  • a clear measure of what went on with each subquery

BTW, you get the same effect if you use SET STATISTICS PROFILE ON before the query. Yet, I wrote this post with someone that doesn't want to go into extra SQL code in mind. Also, when calculating performance, it is recommended to add a DBCC FREEPROCCACHE line before execution OR add the option RECOMPILE to your query (this doesn't work on a stored procedure execution, you would have to change the SP queries to include RECOMPILE).

I wish I had some more interesting examples for you, guys, but screenshots from the workplace are not something I want to do and I don't do any complex SQL work at home. I hope this helps. 

  When I was looking at Javascript frameworks like Angular and ReactJS I kept running into these weird reducers that were used in state management mostly. It all felt so unnecessarily complicated, so I didn't look too closely into it. Today, reading some random post on dev.to, I found this simple and concise piece of code that explains it:

// simple to unit test this reducer
function maximum(max, num) { return Math.max(max, num); }

// read as: 'reduce to a maximum' 
let numbers = [5, 10, 7, -1, 2, -8, -12];
let max = numbers.reduce(maximum);

Kudos to David for the code sample.

The reducer, in this case, is a function that can be fed to the reduce function, which is known to developers in Javascript and a few other languages, but which for .NET developers it's foreign. In LINQ, we have Aggregate!

// simple to unit test this Aggregator ( :) )
Func<int, int, int> maximum = (max, num) => Math.Max(max, num);

// read as: 'reduce to a maximum' 
var numbers = new[] { 5, 10, 7, -1, 2, -8, -12 };
var max = numbers.Aggregate(maximum);

Of course, in C# Math.Max is already a reducer/Aggregator and can be used directly as a parameter to Aggregate.

I found a lot of situations where people used .reduce instead of a normal loop, which is why I almost never use Aggregate, but there are situations where this kind of syntax is very useful. One would be in functional programming or LINQ expressions that then get translated or optimized to something else before execution, like SQL code. (I don't know if Entity Framework translates Aggregate, though). Another would be where you have a bunch of reducers that can be used interchangeably.

Intro

  I want to examine together with you various types of sort algorithms and the tricks they use to lower the magic O number. I reach the conclusion that high performance algorithms that are labeled as specific to a certain type of data can be made generic or that the generic algorithms aren't really that generic either. I end up proposing a new form of function that can be fed to a sorting function in order to reach better performance than the classic O(n*log(n)). Extra bonus: finding distinct values in a list.

Sorting

  But first, what is sorting? Given a list of items that can be compared to one another as lower or higher, return the list in the order from lowest to highest. Since an item can be any type of data record, to define a generic sorting algorithm we need to feed it the rules that make an item lower than another and that is called the comparison function. Let's try an example in JavaScript:

  // random function from start to end inclusive
  function rand(start, end) {
    return parseInt(start + Math.random() * (end - start + 1));
  }
  
  // measure time taken by an action and output it in console
  let perfKey = 0;
  function calcPerf(action) {
    const key = perfKey++;
    performance.mark('start_' + key);
    action();
    performance.mark('end_' + key);
    const measure = performance.measure('measure_' + key, 'start_' + key, 'end_' + key);
    console.log('Action took ' + measure.duration);
  }
  
  // change this based on how powerful the computer is
  const size = 10000000;
  // the input is a list of size 'size' containing random values from 1 to 50000
  const input = [];
  for (let i = 0; i < size; i++)
    input.push(rand(1, 50000));
  
  // a comparison function between two items a and b
  function comparisonFunction(a, b) {
    if (a > b)
      return 1;
    if (a < b)
      return -1;
    return 0;
  }
  
  const output = [];
  // copy input into output, then sort it using the comparison function
  // same copying method will be used for future code
  calcPerf(() => {
    for (let i = 0; i < size; i++)
      output.push(input[i]);
    output.sort(comparisonFunction);
  });

  It's not the crispest code in the world, but it's simple to understand:

  • calcPerf is computing the time it takes for an action to take and logs it to the console
  • start by creating a big array of random numbers as input
  • copy the array in a result array and sort that with the default browser sort function, to which we give the comparison function as an argument
  • display the time it took for the operation.

  This takes about 4500 milliseconds on my computer.

  Focus on the comparison function. It takes two items and returns a number that is -1, 0 or 1 depending on whether the first item is smaller, equal or larger than the second. Now let's consider the sorting algorithm itself. How does it work?

  A naïve way to do it would be to find the smallest item in the list, move it to the first position in the array, then continue the process with the rest of the array. This would have a complexity of O(n2). If you don't know what the O complexity is, don't worry, it just provides an easy to spell approximation of how the amount of work would increase with the number of items in the input. In this case, 10 million records, squared, would lead to 100 trillion operations! That's not good.

  Other algorithms are much better, bringing the complexity to O(n*log(n)), so assuming base 10, around 70 million operations. But how do they improve on this? Surely in order to sort all items you must compare them to each other. The explanation is that if a<b and b<c you do not need to compare a to c. And each algorithm tries to get to this in a different way.

  However, the basic logic of sorting remains the same: compare all items with a subset of the other items.

Partitioning

  A very common and recommended sorting algorithm is QuickSort. I am not going to go through the entire history of sorting algorithms and what they do, you can check that out yourself, but I can focus on the important innovation that QuickSort added: partitioning. The first step in the algorithm is to choose a value out of the list of items, which the algorithm hopes it's as close as possible to the median value and is called a pivot, then arrange the items in two groups called partitions: the ones smaller than the pivot and the ones larger than the pivot. Then it proceeds on doing the same to each partition until the partitions are small enough to be sorted by some other sort algorithm, like insertion sort (used by Chrome by default).

  Let's try to do this manually in our code, just the very first run of the step, to see if it improves the execution time. Lucky for us, we know that the median is around 25000, as the input we generated contains random numbers from 1 to 50000. So let's copy the values from input into two output arrays, then sort each of them. The sorted result would be reading from the first array, then from the second!

  // two output arrays, one for numbers below 25000, the other for the rest
  const output1 = [];
  const output2 = [];
  const pivot = 25000;
  
  calcPerf(() => {
    for (let i = 0; i < size; i++) {
      const val = input[i];
      if (comparisonFunction(val, pivot) < 0)
        output1.push(val);
      else
        output2.push(val);
    }
    // sorting smaller arrays is cheaper
    output1.sort(comparisonFunction);
    output2.sort(comparisonFunction);
  });

  Now, the performance is slightly better. If we do this several times, the time taken would get even lower. The partitioning of the array by an operation that is essentially O(n) (we just go once through the entire input array) reduces the comparisons that will be made in each partition. If we would use the naïve sorting, partitioning would reduce nto n+(n/2)2+(n/2)2 (once for each partitioned half), thus n+n2/2. Each partitioning almost halves the number of operations!

  So, how many times can we half the number of operations for? Imagine that we do this with an array of distinct values, from 1 to 10 million. In the end, we would get to partitions of just one element and that means we did a log2(n) number of operations and for each we added one n (the partitioning operation). That means that the total number of operations is... n*log(n). Each algorithm gets to this in a different way, but at the core of it there is some sort of partitioning, that b value that makes comparing a and c unnecessary.

  Note that we treated the sort algorithm as "generic", meaning we fed it a comparison function between any two items, as if we didn't know how to compare numbers. That means we could have used any type of data as long as we knew the rule for comparison between items.

  There are other types of sorting algorithms that only work on specific types of data, though. Some of them claim a complexity of O(n)! But before we get to them, let's make a short detour.

Distinct values intermezzo

  Another useful operation with lists of items is finding the list of distinct items. From [1,2,2,3] we want to get [1,2,3]. To do this, we often use something called a trie, a tree-like data structure that is used for quickly finding if a value exists or not in a list. It's the thing used for autocorrect or finding a word in a dictionary. It has an O(log n) complexity in checking if an item exists. So in a list of 10 million items, it would take maybe 20 operations to find the item exists or not. That's amazing! You can see that what it does is partition the list down to the item level.

  Unfortunately, this only works for numbers and strings and such primitive values. If we want to make it generic, we need to use a function that determines when two items are equal and then we use it to compare to all the other items we found as distinct so far. That makes using a trie impossible.

  Let me give you an example: we take [1,1,2,3,3,4,5] and we use an externally provided equality function:

  • create an empty output of distinct items
  • take first item (1) and compare with existing distinct items (none)
  • item is not found, so we add it to output
  • take next item (1) and compare with existing distinct items (1)
  • item is found, so we do nothing
  • ...
  • we take the last item (5) and compare with existing items (1,2,3,4)
  • item is not found, so we add it to the output

  The number of operations that must be taken is the number of total items multiplied by the average number of distinct items. That means that for a list of already distinct values, the complexity if O(n2). Not good: It increases exponentially with the number of items! And we cannot use a trie unless we have some function that would provide us with a distinctive primitive value for an item. So instead of an equality function, a hashing function that would return a number or maybe a string.

  However, given the knowledge we have so far, we can reduce the complexity of finding distinct items to O(n*log(n))! It's as simple as sorting the items, then going through the list and sending to output an item when different from the one before. One little problem here: we need a comparison function for sorting, not an equality one.

So far

  We looked into the basic operations of sorting and finding distinct values. To be generic, one has to be provided with a comparison function, the other with an equality function. However, if we would have a comparison function available, finding distinct generic items would become significantly less complex by using sorting. Sorting is better than exponential comparison because it uses partitioning as an optimization trick.

Breaking the n*log(n) barrier

  As I said above, there are algorithms that claim a much better performance than n*log(n). One of them is called RadixSort. BurstSort is a version of it, optimized for strings. CountSort is a similar algorithm, as well. The only problem with Radix type algorithms is that they only work on numbers or recursively on series of numbers. How do they do that? Well, since we know we have numbers to sort, we can use math to partition the lot of them, thus reducing the cost of the partitioning phase.

  Let's look at our starting code. We know that we have numbers from 1 to 50000. We can also find that out easily by going once through all of them and computing the minimum and maximum value. O(n). We can then partition the numbers by their value. BurstSort starts with a number of "buckets" or lists, then assigns numbers to the buckets based on their value (dividing the value to the number of buckets). If a bucket becomes too large, it is "burst" into another number of smaller buckets. In our case, we can use CountSort, which simply counts each occurrence of a value in an ordered array. Let's see some code:

  const output = [];
  const buckets = [];
  calcPerf(() => {
    // for each possible value add a counter
    for (let i = 1; i <= 50000; i++)
      buckets.push(0);
    // count all values
    for (let i = 1; i <= size; i++) {
      const val = input[i];
      buckets[val - 1]++;
    }
    // create the output array of sorted values
    for (let i = 1; i <= 50000; i++) {
      const counter = buckets[i - 1];
      for (let j = 0; j < counter; j++)
        output.push(i);
    }
  });

  This does the following:

  • create an array from 1 to 50000 containing zeros (these are the count buckets)
  • for each value in the input, increment the bucket for that value (at that index)
  • at the end just go through all of the buckets and output the index as many times as the value in the bucket shows

  This algorithm generated a sorted output array in 160 milliseconds!

  And of course, it is too good to be true. We used a lot of a priori knowledge:

  • min/max values were already known
  • the values were conveniently close together integers so we can use them as array indexes (an array of size 50000 is not too big)

  I can already hear you sigh "Awwh, so I can't use it!". Do not despair yet!

  The Radix algorithm, that is used only for numbers, is also used on strings. How? Well, a string is reducible to a list of numbers (characters) so one can recursively assign each string into a bucket based on the character value at a certain index. Note that we don't have to go through the entire string, the first few letters are enough to partition the list in small enough lists that can be cheaply sorted.

  Do you see it yet?

A generic partition function

  What if we would not use an equality function or a comparison function or a hashing function as a parameter for our generic sort/distinct algorithm? What if we would use a partition function? This partition function would act like a multilevel hashing function returning values that can also be compared to each other. In other words, the generic partition function could look like this:

  function partitionFunction(item, level) returning a byte

  For strings it returns the numeric value of the character at position level or 0. For numbers it returns the high to low byte in the binary representation of the number. For object instances with multiple properties, it would return a byte for each level in each of the properties that we want to order by. Radix style buckets would use the known values from 0 to 255 (the partition table returns a byte). The fact that the multilevel partitioning function is provided by the user means we can pack in it all the a priori knowledge we have, while keeping the sorting/distinct algorithm unchanged and thus, generic! The sorting will be called by providing two parameters: the partitioning function and the maximum level to which it should be called:

  sort(input, partitioningFunction, maxLevel)

A final example

  Here is an implementation of a radix sorting algorithm that receives a multilevel partitioning function using our original input. Note that it is written so that it is easily read and not for performance:

  // will return a sorted array from the input array
  // using the partitioning function up to maxLevel
  function radixSort(input, partitioningFunction, maxLevel) {
    // one bucket for each possible value of the partition function
    let buckets = Array.from({length: 256}, () => []); 
    buckets[0] = input;
    // reverse order, because level 0 should be the most significant
    for (let level = maxLevel-1; level >=0; level--) {
      // for each level we re-sort everything in new buckets
      // but taken in the same order as the previous step buckets
      let tempBuckets = Array.from({length: 256}, () => []);
      for (let bucketIndex = 0; bucketIndex < buckets.length; bucketIndex++) {
        const bucket = buckets[bucketIndex];
        const bucketLength = bucket.length;
        for (let bucketOffset = 0; bucketOffset < bucketLength; bucketOffset++) {
          const val = bucket[bucketOffset];
          const partByte = partitioningFunction(val, level);
          tempBuckets[partByte].push(val);
        }
      }
      // we replace the source buckets with the new ones, then repeat
      buckets = tempBuckets;
    }
    // combine all buckets in an output array
    const output = [].concat(...buckets);
    return output;
  }

  // return value bytes, from the most significant to the least
  // being <50000 the values are always represented as at most 2 bytes
  // 0xFFFF is 65535 in hexadecimal
  function partitioningFunction(item, level) {
    if (level === 0) return item >> 8;
    if (level === 1) return item & 255;
    return 0;
  }
  
  let output3 = [];
  calcPerf(() => {
    output3 = radixSort(input, partitioningFunction, 2);
  });

Want to know how long it took? 1300 milliseconds!

You can see how the same kind of logic can be used to find distinct values, without actually sorting, just by going through each byte from the partitioning function and using them as values in a trie, right?

Conclusion

Here is how a generic multilevel partitioning function replaces comparison, equality and hashing functions with a single concept that is then used to get high performance from common data operations such as sorting and finding distinct values.

I will want to work on formalizing this and publishing it as a library or something like that, but until then, what do you think?

Wait, there is more

There is a framework in which something similar is being used: SQL. It's the most common place where ORDER BY and DISTINCT are used. In SQL's case, we use an optimization method that uses indexes, which are also trie data structures storing the keys that we want to order or filter by. Gathering the data to fill a database index also has its complexity. In this case, we pre-partition once and we sort many. It's another way of reducing the cost of the partitioning!

However, this is just a sub-type of the partition function I am talking about, one that uses a precomputed data structure to reach its goal. The multilevel partition function concept I am describing here may be pure code or some other encoding of information we know out of hand before doing the operation.

Finally, the complexity. What is it? Well instead of O(n*log(n)) we get O(n*k), where k is the maximum level used in the partition function. This depends on the data, so it's not a constant, but it's the closest theoretical limit for sorting, closer to O(n) than the classic log version. In our example, k was a log, but its base was 256, not 10 as usually assumed.

I am not the best algorithm and data structure person, so if you have ideas about it and want to help me out, I would be grateful.  

and has 0 comments

  Four years ago this day I was blogging about a list of technologies that I wanted to learn or at least explore. It was a very ambitious list, with the idea that I might shame myself into studying at least part of it. Apparently, I am shameless. Anyway, this is a follow-up post on how it all went down. 

.NET Core: MVC, Entity Framework, etc.

  I actually had the opportunity to use ASP.Net Core at my job. I didn't have a lot of it, though, so I just did the usual: start working on something, google the hell out of everything, make it work. I used .NET Core 1 and 2, moved to 3 and now they are just getting out 5. Was it useful? Well, actually no.

  The state of the software industry is in constant flux and while new technologies pop up all the time, there is also a strong force moving the other way, mainly coming from the people that pay for software. Why invest in a new technology when the old one has been working fine for years? Yeah, don't answer that. My point is that while staying current with .NET Core was a boon, I didn't really see a lot of paying customers and employers begging me to do anything with it. In fact, I left my job working on a .NET Framework 4.7 automation project to work on an ASP.Net MVC 4 application written in Visual Basic. So I am learning things that are new to me, but really are very old.

  All I am saying is that there is a paradox of job opportunities for technologies that you are supposed to know and be an expert in, but for which few have had the courage to actually pay before.

  Bottom line: having been familiar with older technology that made .Net Core exist, it was kind of simple for me to get into it, but I am not an expert in it. The fact that software generally moved to the web and that server code is now as slim as it can be also undermines the value of being proficient in it. That is, until you really need to be.

OData and OAuth

  Simply said, nothing on this front. These are mostly related to frontend stuff, to new web facing applications that can be found at a public URL. I have not worked in this field basically since forever.

Typescript, Angular

  Oh, the promise of Typescript: a strongly typed language than compiles to Javascript and can be used to code in using static analysis tools, structured code, clear project boundaries! I had the opportunity to work with it. I did that in a much lesser degree than I should have. Yet the tiny contact with the new language left me disappointed. Because it is a superset of Javascript, it inherits most of its problems, can be easily hacked and the tooling chain is basically the Javascript one.

  I commend Microsoft for even attempting to develop this language and I see that it has become popular indeed. I like C#. I like the vanilla Javascript of old. I dislike the weak hybrid that Typescript seems to be. I may be wrong. Things might evolve in very interesting directions with Typescript. I will wait until then.

  And Angular has also grown in a completely different beast. I thought Angular 1 was great, then they came in with version 2 which was completely different. Now it's 9 or 10 or whatever. I liked the structured projects and how easily data change could be controlled from non-UI code. Yet did it all have to be this complex?

Node.JS, npm, Javascript ES6+, Bower, etc.

  I thought that since I like Javascript, I might enjoy Node.JS. Working with Typescript I had to have contact with npm and at least install Node on the computer. Yet I got more and more disillusioned. A complicated chain of tools that seem to be targeted to a concept of "app" rather than a website, for which you need a completely different set of tools, the numerous changes in the paradigm of Javascript and the systems using it and, frankly, a lack of a real reason for using Javascript when I can use C# made me stop trying to get in on the action on this front. And before I started to understand the Node ecosystem, Deno appeared.

  Personally it feels that Javascript is growing too fast and uncontrolled, like a cancer. The language and tooling need to stabilize and by that I mean cut some of the fat away, not add new things. I loved some of the additions to the standard, like arrow functions, let/const and block scope, nullable and spread operators, iterators and generator functions, but then it just started to get more and more bloated, like trying to absorb every new concept in all of the other languages.

  Bottom line: it is ironic that I am now working in an app that must work on Internet Explorer 11, so I cannot even use the new features of the Javascript standard. And yes, I hate that, but at the same time I can do the job just fine. Food for thought.

Docker

  What a great idea this Docker: just tell the system what kind of setup you need and it creates it for you. It can even create it for you from scratch every single time, in isolation, so you can run your software without having to install all that heavy crap I was just telling you about above. And yet I hated it with a vengeance from the moment I tried to use it. With a clear Linux vibe, it wouldn't even work right on Windows. The images for these pseudo-virtual machines were huge. The use was cumbersome. The tooling reminded me of the good old days when I installed a Slackware distro on any old computer, without X system, of course, and then compiled beta versions of every single software I wanted to use and repeated the process whenever I had an issue.

  For a tool that is supposed to bring consistency and ease of use, it worked really inconsistently and was hard to manage. I understand that maybe all that would have gone away if I invested the effort of understanding the whole concept, but it really felt like going backward instead of forward. One day, perhaps, I will change my mind. As of now, I am just waiting for the person who reinvents Docker and makes it do what it has promised to do: ease my life.

Parallel programming

  Hey, this is actually a plus! I had a lot of opportunities to work with parallel programming, understanding the various pitfalls and going deep in the several types of parallel programming concepts in .NET. Am I a parallel programming guru now? No. But I went through the await/async phase and solved problems at all levels of parallelism. I don't like async/await. It is a good idea, it is great when you start a new project, but to add async/await in existing legacy code is a pain the butt. Inevitably you will want to execute an async method from a sync context, control the order or amount of parallel processing, forget to call an async method with await or trying to run one parallely by not calling it with await on purpose and finding that your scope has been disposed.

  Bottom line: I don't have to like it to use async/await, but I feel (without actually knowing a perfect solution) the design of the feature could have been better. And now they added it to Javascript, too! Anyway, I am comfortable in the area now.

HTML5, Responsive Design, LESS/SASS, ReactJS, new CSS, frontend

  A lot of new stuff to learn for a frontend developer. And how glad I am that I am not it! Again I hit the reality of business requirements vs technology requirements. To learn front end development right now is a full job that you have to do alone, for free and in your spare time. Meanwhile, the projects you are working on are not on the cutting edge, no one wants to change them just to help you learn something and, again, these concepts are mostly for public facing web applications. Maybe even mobile development.

  It does help to know all of this, but it is not a job that I want to do. Frontend is for the young. Can I say that? ReactJS, the few times I looked at it, appeared to me to be a sort of ASP for the browser, with that ugly JSX format that combines code and markup and that complicated state flow mechanism. Angular felt too stuffy and so on. Every time I look into frontend stuff it seems like software for the server side from twenty years before.

  Bottom line: If any web framework appealed to me, that would be VueJS! And no one used it at any of my places of work. It seems a framework dedicated to staying simple and efficient. And while I want to know the concepts in all of this stuff, why would I go deep into this when I need a UI designer for anything with a shape? I will be waiting for the resurgence of simple frameworks using the new features of HTML99 and that do not require to learn an entire new ecosystem to make anything work.

WPF, UWP

  I remember I absolutely loved developing in WPF. The MVVM concept of separating the UI completely from the code controlling it, which is then completely separated from the data models used, was very nice. The complete customizability of the UI without changing the code at all was amazing. The performance, not so much. But did it matter? I thought not. Yet the web obliterated the tech from the development world. I believe it deserves a rebirth.

  That being said, I have done nothing in this direction either. Shame on me.

Deep learning, AI, machine learning

  I really wanted to understand and use artificial intelligence, especially since I did my licence paper on neural networks. I love the new concepts, I've read some of the stuff they came up with and I've even been to some conferences on machine learning. Yet I have not yet found the project that could make it worthwhile. Machine learning and especially deep learning projects require huge amounts of data, while working on the underlying technology requires advanced knowledge of statistics and math, which I do not have.

  It seems to me that I will always be the end user of AI rather than working on building it.

Conclusion

  So there you have it: I worked on .NET Core and parallelism and even microservices, dabbled in Typescript and Angular, created a full on LINQ implementation on Javascript and later Typescript. Meanwhile the world went more and more towards frontend and mobile development and Node and stuff like machine learning. And presently I am working for a bank, coding in VB and using jQuery.

  Strangely, I don't really feel guilty about it. Instead I feel like I am still going on the path of discovery, only more of myself rather than of a particular technology. The more I wait, the more new stuff appears and makes whatever I was dreaming of learning obsolete. I even considered not being a developer anymore, maybe becoming an architect or application designer. When something new appears, the immediate instinct is to check it out, to see what novelties they come up with and to find the problems and then discover solutions, then blog about them. It's a wonderfully addictive game, but after playing it for a while, it just blurs into the same thing over and over again. It's not the technology that matters, but the problems that need fixing. As long as I have problems to fix and I can find the solution, I am happy.

  I ask you, if you got to this point, what is the point of learning any technology, when you don't have the cause, the purpose, the problem that you have to solve? An HR person asked me recently why I keep working where I do. I answered: because they need me.

and has 0 comments

  The Authenticity Project is one of those books that got great marketing and so I got to read, so there is a little feeling of getting tricked to read it, but it's not a bad book. It is, however, terribly naive. It almost begs for the Brit-com makeover transition to the big screen with its physically perfect characters that feel their lives had lost meaning, but have all the resources to change them, the courage of telling the truth leading to strong friendships and not people taking advantage of them and the serendipity for all of them to meet each other and fit together. But it is a feel good book, so why not enjoy it?

  Clare Pooley graduated from a blog turned book about her own struggle with posh alcohol addiction to fiction with this book, after feeling inspired by the power of being truthful. In the book, someone decides to write their most personal truth in a notebook and leave it around so other people can read it and maybe also write in it. This brings together these people who have been living financially rewarding lives, but spiritually empty existences. The writing is decent, the story is obvious and lacking much subtlety, so if you want to read an uplifting fantasy about people getting everything right in their lives, this is the one for you.

  However, despite the book's premise that underneath the facade people are really different, the characters are quite cardboard. Instead of them having layer over layer of complexity, which would have made the story worth reading, it's like they hold party masks over their faces and when they drop them, you get to see all they are, vulnerable and normal while being amazing. There is a twist at the end that was kind of unexpected and a good opportunity to add more dimensions to the whole thing, only it fizzled immediately after the initial shock value.

  Bottom line: it feels as real as a fairy tale. The princesses get the princes, the dragons live happily ever after and everybody gets to keep the gold. It was not unpleasant to read, but I wouldn't recommend it, either.

and has 0 comments

  A few days ago I was stumbling upon a global pandemic book that I couldn't finish because it was avoiding the exact parts that would interest me in the scenario: the disease, the cause, the immediate aftermath. Instead it used the disease as a prop to describe a world in which only children survived. Disappointed, I randomly picked another book, this time one from Liu Cixin, the author of Three Body Problem, which I liked. Lo and behold, it was about a global catastrophe that kills all the adults, too! And while it started great, I have to say that it ultimately was also a disappointment.

  In Liu's defense, it is a story he wrote in 1989, only published in China in 2003 and translated to English in 2019 because of his success with other works.

  Supernova Era has a very interesting premise: a nearby star, occluded from us by a dust cloud, goes supernova, bathing the Earth in deadly radiation. People quickly realize that the genetic damage has affected everybody and only children under the age of 13 will be able to shrug it off, while all the adults will die. Children will have finally inherited the planet. What will the outcome be?

  Unfortunately, this is where the nice part ends. The genetic damage on animals and plants is swept under the rug, logistic issues such as how children would be fed and countries run are only touched upon, the actual effects of radiation damage, its long time effects, the way it would have affected people are completely ignored. And this, also, because the supernova was only a prop to describe a world in which only children survived.

  Liu had a really weird outlook on children back then. In his mind, children are lacking empathy, are only interested in games and even after a ten month training period from adults, they can only superficially grasp the nature of the world. And even if they are as old as 13 year old, they have no sexual drives. To be fair, I doubt that part would have passed by the Chinese censors, but not even mentioning it and portraying prepubescent teens as asexual feels even creepier than mentioning it too much.

  And I remember myself at 12: I was reading five books at the same time, was interested in natural sciences and was avid for knowledge. If people would have said "Now we will teach you how to do what we do", I would have been immensely happy, at least for a little while. But one thing is certain, I loved my friends without reservation and I was always thinking of how I would change the world when I would grow up.

  Not these children. They are written more as psychopathic caricatures of their nationalities. American kids start shooting each other for fun, Brazilians play games of soccer with a hundred thousand players and one ball and the Chinese succumb to fear when they find themselves under no authority and have to resort to a quantum computer to tell them what to do. They play war games that kill hundreds of thousands, but they are emotionally unaffected. They nuke their opponents for laughs. The ending is even more confusing, as it involves switching populations between the countries of the U.S. and China, for no apparent reason and ignoring transport issues and the immediate famine that would lead to.

  Bottom line: an interesting premise that fails miserably at the end, even though the author did make the effort to finish the book. But that's exactly the feeling one gets: someone struggled to finish this, changing direction, bringing in random ideas that are never explored and ignoring the obvious.

and has 0 comments

  Station Eleven started really well. It had the fresh scene setup, the internal thoughts of a complex character, a dissection of actual motivations and emotions, rather than cardboard cliches. Because I have a bunch of books to read and when I start reading I just pick one at random, I didn't know what it was about, and so I had that feeling of "Oh well, it's not sci-fi or fantasy, but I like the writing!". I was convinced I was going to like the book.

  A chapter later and there is a killer epidemic starting. One chapter later twenty years have passed and everybody except young people are alive in a post apocalyptic non technological world. I just couldn't go on. The complex character at the beginning and the interesting setup had been completely obliterated and replaced with tired formulaic ideas. I couldn't care less about any of the new characters or what was going to happen. I don't know what Emily St. John Mandel was thinking when she started writing the book, but it is clearly not for me.

  One of the main reasons to put the book down and not continue reading was the lazy and unscientific treatment of the killer pandemic. We are talking about a flu virus that infects just by breathing for a few seconds next to someone, then disabled those people within a day. Viruses like this do not spread! Moreover, there is no way that a flu virus kills everybody. There are always exceptions, whether due to immunity, isolation or other factors. I love pandemic stories, I read them all with glee, and I did that way before the current situation, and when I see one that teaches nothing, because the epidemic is just a prop in an otherwise unrelated story, I get frustrated.

  A few years ago I was watching a movie with my wife. We both didn't quite enjoy it, but were curious on how it ended, so I told her to skip scenes to get to the end. She was skipping all the scenes I was interested in and watching the ones I couldn't care less about. This book is the same. I understand why people would like it, as I said, the writing was good, but the focus of the story and the characters were in complete opposition to my own interests.

and has 0 comments

  Almost a year ago I was reading Vaccinated, by Paul A. Offit, an ode to vaccines and their de facto promoter in modern medicine, Maurice Hilleman. Offit was angry in the book when talking about the vaccine craze started by Andrew Wakefield, a self interested psychopath that gained money and fame while feeding on the fear and desperation of parents. Yet, he is not even close to how angry Seth Mnookin is in The Panic Virus, a book dedicated solely to exposing the roots and mechanisms of the industry of fear towards vaccines.

  The author systematically dissects, with proof and careful analysis, the entire argument of harmful vaccines causing autism, mercury poisoning or damaging immunity. Let me be as blunt as he is: the theory that vaccines cause autism has been thoroughly debunked and whoever continues to spout such nonsense has not read anything about the subject other than Facebook. Mnookin talks about Wakefield, David Kirby, Jenny McCarthy, Oprah Winfrey, exposing them for the profiteering promoters of deadly lies that they are. He talks about law trials and medical trials and research papers as they destroy any leg these theories stand on, but which never reach the news. He talks about devastated families, tricked into wasting their lives and money championing harmful ideas just for the tiny hope their child might get better.  

  However it is ironic that this book suffers from the same problems that made the vaccine argument lean so much towards the emotional, dramatic, sound-bite sized bullshit: it is technical, precise, verbose, intellectual. It is difficult to read the book because it engages with your brain, assaulting it with educated language and loads of information. Meanwhile, the opponents of the Mnookin's views use animated gifs with large colorful font texts and the occasional kitten. But it is a book that needs reading.

  Consider The Panic Virus as a form of vaccine itself. You need to read it so you don't fall prey to soulless predators that would use targeted well crafted yet completely misleading arguments to sway you to their side for their own profit. I am warning you, though, this is not a happy book. It made me question the worth of the human race as a whole. If such cheap techniques can be so effective in brainwashing so many people into believing absurd lies, then don't we deserve it, all the death and suffering? Aren't we failing at... I don't know, evolution? And the sad part is that most of the affected are fairly educated people, who start to rebel against "the establishment" and branch out into alternative theories without actually employing any technique of differentiating between fact and fallacy.

  Bottom line: I will rate this book the maximum value, because it is important to be read, but it is not a perfectly written piece of literature, nor it is easy to finish. But give it a try.

and has 0 comments

   You may have heard of Richard Feynman from several sources: he was a Nobel winning physicist, he worked in the team creating the first atomic bomb, he said many a smart thing that turned into memes at one time or another and is generally considered a genius. This book is a collection of short anecdotal stories put on paper from recorded interviews with the man, in which you will be surprised to see that he didn't really consider himself very smart. Instead, he was looking at situations where the solution seemed perfectly obvious and did not understand why other can't see it.

   I found the short tales pretty amusing, but also incredibly inspiring. Here is a physicist who makes a bet with an artist to makes one the teacher of the other, so that he learns to draw - something he feels to be impossible, and the artist understands more about science. In the end, Feynman sells paintings for money and the artist is none the wiser. Here is this person who at one time started fiddling with the safes in Los Alamos holding the secrets of the atomic bomb and found how easy it is to crack them. No one else thought it was easy. And above everything, he is always pranking people, making them believe he was smarter or more capable than he really was. But the joke was on him, because every time he did something, he really became good at it.

  The title says it all: "Surely You're Joking, Mr. Feynman!": Adventures of a Curious Character. If anything, he was very curious and kept his mind open to any experience. It's people like these that I admire and, of course, envy with all my being. Feynman seems not only to be a complete man, in both work, fun and personal life, but also get more from the same experience than anyone around him. I found that fascinating and therefore I invite you to read the book yourselves.

and has 1 comment

Why this article should never have been written

  It's a bit too early to do this. I am sure no one in their right mind would want to display any non-positive words about George Floyd at this moment for fear of reprisals. But I feel like that is exactly what must be done. If Floyd was an innocent victim, a hero that overcame his background only to be brought down, literally, by the heavy boot of law enforcement, then what chance do normal people have?

  I didn't want to be writing about this. I think that the entire thing has become grotesque and the only thing that could now bring these Whites and Blacks together, corrupt police and gangstas, racists and anti-racists... is me! I am sure that if I were to enter the argument, all these people angrily hating each other would come together in trying to murder me. Because while I understand both sides, I can't fully endorse any of them. Racists are just dumb. What the hell does race matter in anything? But I do understand anti-anti-racists, the ones that hate being put together with assholes because of the color of their skin. Anti-racism protesters are dumb. Well, maybe. I am sure all of this protesting is finally going to have an impact, and this is good, but have you seen some of these people? Manicaly jumping on toppled down statues and roaring in megaphones about how great they are because they oppose evil. In this whole discussion, again, normal people are left out. They are boring, they don't clump together, they don't stand out, each one has their own slightly different opinion. Well, this is mine.

The gentle giant saint versus the black monster

  Something happened today that pushed me to write this post. I saw a Facebook post that detailed the criminal record of George Floyd. Cocaine dealing, two armed robberies, one which held him back four years, addiction and, when he was arrested, metamfetamine and fentanyl in his blood and the incriminating fake twenty dollar bill. Was it true? It is a very important question to ask, because many of these things are complete bullshit. So I googled it. And the result was actually worse: almost nothing!

  There are just two websites that actually advertise Floyd's criminal record: Great Game India - self titled "Journal on Geopolitics and International Relations" and sporting articles like "Coronavirus Bioweapon - How China Stole Coronavirus From Canada and Weaponized It" and "How A Pornstar & Sci-Fi Writer Influenced WHO Policies On Hydroxychloroquine With Fake Data" - and The Courier Daily, which seems legit. Funny though, when you search for "George Floyd criminal record" you get Game India first and not The Daily Mail, which is linked in their article and who actually did the research and published the court documents attesting to that. They are fifth on the search page. More, during the writing of this blog post, the Courier Daily link disappeared from my Google search and Game India was demoted to second place, with a "gentle giant" story on top instead.

  Either way, other than Game India, no other news outlet even phrases the title as to indicate George had been a criminal. The few who tackle the subject: The Star, The Daily Mail itself and even The Courier Daily, just portray the man as a flawed individual who nevertheless was set to change, found religion and even moved to Minneapolis to escape his past. And I agree with this viewpoint, because as far as I can see, the armed robbery had been many years before and the man had changed, in both behavior and intent. But hiding this doesn't really help. The Daily Mail article was published on the 26th of May, one day after Floyd's death, and the information therein is either not discussed or spun into a "gentle giant" narrative. He was a bouncer before the Coronavirus hit and he lost his job. George the gentle bouncer?

  One thing is certain, when you search for George's criminal record, it's more likely you get to articles about the criminal records of the arresting officers or Mark Wahlberg's hate crimes than what you actually searched for.

How did George die and why it doesn't matter

  But there is more. How did George die? You would say that having a knee on their throat while they gasp for air saying "I can't breathe" would be enough. But it's not. Several different reports say different things. The first one preliminarily portrays Floyd as a very sick man: coronary artery disease, hypertensive heart disease, even Covid-19. There were "no physical findings that support a diagnosis of traumatic asphyxia or strangulation", but instead they diagnosed it as a heart failure under stress and multiple intoxicants. Finally, two days later, the report admits "a cardiopulmonary arrest while being restrained" by officers who had subjected Floyd to "neck compression". But Floyd's family would not have that, so they commissioned their own autopsy. The result? Floyd died from "asphyxia due to compression of the neck", affecting "blood flow and oxygen going into the brain", and also from "compression of the back, which interferes with breathing". The medical examiner said Floyd had no underlying medical problem that caused or contributed to his death.

  So which was it? It doesn't really matter. He died with a knee on his neck, which should never happen to anyone, and both reports admit it was a homicide. But ignoring all of these other data points doesn't help. People just vilify the policeman and raise George to saintly status. You want to solve something? Start with the truth. All of it. Now both sides have the ammunition required to never change their minds.

  I have not found any article that makes a definitive claim on which report is the good one, if any. They all lean on believing the second, because it fits, but if the first one was a complete fabrication, why wasn't anyone charged with it?

Wikipedia v. Facebook

  So of course I would find about Floyd's criminal past from Facebook. It makes so much sense. It is a pool of hateful bile and rank outrage that brings ugly right up to the surface. But this time it pointed me towards an interesting (albeit short) investigation. Without it, I would have swallowed up the entire completely innocent victim narrative that is pushed on every media outlet. So, once in a blue moon, even Facebook is good for something.

  As you may have noticed above, I took some information from Wikipedia, which has an entire article dedicated to George Floyd's death. It is there where the information about his two medical autopsies is also published. On George Floyd's page, his early life consists of: basketball, football, his family calling him a gentle giant. Then he customized cars, did some rap and was an informal community leader. Only then did he get arrested a few times then put in jail for five years. He was charged in 2007, went to jail in 2009 and was released on 2013. It's just six years and it does not define a man, but try to say that to a police officer who has just read the fact sheet on his cruiser's terminal and has to arrest a 1.93m tall intoxicated man.

  And you may want to read the entire chain of events. The police didn't just put him on the ground, they talked to him, they put him in their car, they fought, they pulled him out, all while being filmed and surrounded by a crowd.

You will never gonna get it

  How much of this is truth and how much of it is spin? You will never know. There are so many people that have to justify their own shit using carefully chosen bits and pieces from this that there will never be a truthful image of who George Floyd was and what happened to him. He is now more than a man and also much less: he is a symbol, rallying people to cry out against injustice, as they see it. The greatest thing George Floyd ever did was die and after that he stopped being human. How sad is that?

  In truth, he was a flawed man. He was not perfect. He was your everyman. A policeman casually killing him while getting filmed doing it hurts on so many levels because that could be you. That was you or your friend or your relative somewhere. But no, they had to make it about being black and being the gentle giant and being killed by the bad psycho cop and his cronies. It sounds like a Hollywood movie because it is scripted as one. You can be certain that at this point several documentaries and movies are in the works about this. And when you'll see it, a big time celebrity will be interpreting Floyd and the cop will be played by that actor who plays psychos in every other movie because he has that face. Once you go black, you can never go back.

  I am not defending anyone here. As I said in the beginning, I am on nobody's side in this. I just hope no one will knee me or my friends to death while everybody films it down.

The world has spoken

  I find it amazing that the protests in Minneapolis have spread to the entire world. It makes me hope that they will slowly turn into protests about things that matter even more than the color of one's skin, like our responsibility as a community to carefully choose our guardians, like having to think for ourselves if something is right or wrong and maybe doing something about it. George Floyd was killed slowly, over nine minutes, while people stood around and filmed it. Not just the other officers, but civilian bystanders, too.

  There were people who did something. At one point a witness said: "You got him down. Let him breathe." Another pointed out that Floyd was bleeding from the nose. Another told the officers that Floyd was "not even resisting arrest right now". Yet another said "Get him off the ground ... You could have put him in the car by now. He's not resisting arrest or nothing. You're enjoying it. Look at you. Your body language explains it." But that's all they did. Wikipedia calls them "witnesses", but you have to wonder: what skin color were they? Were they afraid they would be next and that's why all they could was beg for George's life as he slowly died? Or did they believe the story that American TV has fed them for decades, that cops are ultimately good people who break the rules in order to protect the innocent? Or maybe a more recent belief had taken hold: that filming injustice makes you a hero and it's more than enough.

  The world has spoken. Racism must go, police brutality must go. Let's not replace them by carefully crafted fantasies, though. Let's see the truth as it is so we can make it better.

2020 is great so far

  I am not being sarcastic. After a virus that punched presidents of the free world and dictators alike in the nose, that made people question their fake feelings of safety and forced them to act, here comes this age of protesting how things are. We have been shaken awake. Will we fall asleep again? I am sure we will, but some things will have changed by then. And the year is not yet over.

and has 0 comments

  A Cavern of Black Ice is a huge 769 page long book, but only the beginning of a story that happens in a fictional realm of feudalism and magic. You just have to have the classic hero journey, starting with a young man torn from the world he knew and was comfortable to him, partially mentored by a wise and hitherto unknown relative, given a reason to trek on a perilous journey and beset by powerful, yet strangely ineffectual enemies. Of course, Deus ex Machina abilities that help him and his quarry escape tight situations are also there.

  But there is more: various clans living in a cold inhospitable North, the ambitious ruler of a city coveting the resources of said clans, a mysterious and powerful entity chained by the ruler, a strange and magical race of people even further north, a secret sorcerous society, female assassins that you can't quite remember what they look like, a dark realm where dangerous creatures await release and so on and so on.

  The thing to understand here is that J. V. Jones set to create a vast universe in which multiple interests clash to create a captivating story. The writing is good, the characters are decent, but there is something missing and while I can't quite put my finger on it, I suspect it involves editing. There is too much text for what the story wants to say and when characterisation is concerned, some actions or even complete characters are just pulled out of a hat. And remember, this is just one of at least four books in the Sword of Shadows series and it barely scratched the surface of it all.

  Bottom line: I liked the book, but not so much as to be absolutely certain I will continue to read the rest of the series. When I finished reading it I felt actual relief. If you want to spend some time immersed in a fantastic fantasy universe, this might be a good fit for you.

Intro

  So the requirement is "Write a class that would export data into a CSV format". This would be different from "Write a CSV parser", which I think could be interesting, but not as wildly complex as this. The difference comes from the fact that a CSV parser brings a number of problems for the interviewed person to think about right away, but then it quickly dries up as a source for intelligent debate. A CSV exporter seems much simpler, because the developer controls the output, but it increases in complexity as the interview progresses.

  This post is written from the viewpoint of the interviewer.

Setup

  First of all, you start with the most basic question: Do you know what CSV is? I was going to try this question out on a guy who came interviewing for senior developer and I was excited to see how it would go. He answered he didn't know what CSV was. Bummer! I was incredulous, but then I quickly found out he didn't know much else either. CSV is a text format for exporting a number of unidimensional records. The name comes from Comma Separated Values and might at first glance appear to be a tabular data format, an idea made even more credible by Excel being able to open and export .csv files. But it is not. As the name says, it has values separated by a comma. It might even be just one record. It might be containing multiple records of different types. In some cases, the separator for value and record are not even commas or newline.

  It is important to see how the interviewee explains what CSV is, because it is a concept that looks deceivingly simple. Someone who first considers the complexity of the format before starting writing the code works very differently in a team than someone who throws themselves into the code, confident (or just unrealistically optimistic) that they would solve any problem down the line.

  Some ideas to explore, although it pays off to not bring them up yourself:

  • What data do you need to export: arrays, data tables, list of records?
  • Are the records of the same type?
  • Are there restrictions on the type of record?
  • What separators will there be used? How to escape values that contain chosen separators?
  • Do values have restrictions, like not containing separators?
  • CSV header: do we support that? What does it mean in the context of different types of input? 
  • Text encoding, unicode, non-ASCII characters
  • How to handle null values?
  • Number and date formatting
  • Is there an RFC or a specification document for the CSV export format?

Implementation

  In this particular interview I have chosen that the CSV exporter class will only support an input of IEnumerable<T> (this is .NET speak for a bunch of objects of the same type).

  Give ample opportunities for questions from the person interviewed. This is not a speed test. It is important if the candidate considers by themselves issues like:

  • are the object properties simple types? Like string, long, integer, decimal, double, float, datetime?
  • since the requirement is any T, what about objects that are arrays, or self referencing, or having complex objects as properties?
  • how to extract the values of any object (discussions about .NET reflection or Javascript object property discovery show a lot about the interviewee, especially if they start said discussions)

  Go through the code with the candidate. This shows their ability to develop software. How will they name the class, what signature will they use for export method, how they structure the code and how readable it is.

  At this stage you should have a pretty good idea if the candidate is intelligent, competent and how they handle a complex problem from requirement to implementation.

Dig deeper

  This is the time to ask the questions yourself and see how they react to new information, the knowledge that they should have asked themselves the same questions and the stress of changing their design:

  • are comma and newline the only supported separators?
  • are separators characters or strings?
  • what if an exported value is a string containing a comma?
  • do you support values containing newline?
  • if you use quotes to hold a value containing commas and newlines, what happens if values contain quotes
  • empty or null values. Any difference? How to export them? What if the object itself is null?
  • how to handle the header of the CSV, where do you get the name of the properties?
  • what if the record type is an array or IEnumerable?
  • what will be the numeric and date formatting used for export?
  • does the candidate know what text encoding is? Which one will they use and why?

  How have the answers to these questions changed the design? Did the candidate redesign the work or held tight to the original idea and tried to fix everything as it comes?

  At this point you should know how the person being interviewed responds to new information, even scope creep and, maybe most importantly, to stress. But we're not done, are we?

Bring the pain

  Bring up the concept of unit testing. If you are lucky, the candidate already brought it up. Either way, now it is time to:

  • split the code into components: the reflection code, the export code, the file system code (if any).
  • abstract components into interfaces in order to mock them in unit tests
  • collect all the specifications gathered so far in order to cover all the cases
  • ask the candidate to write one unit test

Conclusion

  A seemingly simple question will take you and the interview candidate through:

  • finding out how the other person thinks
  • specification gathering
  • code design
  • implementation
  • technical knowledge in a multitude of directions
  • development process
  • separation of concerns, unit testing
  • human interaction in a variety of circumstances
  • determining how the candidate would fit in a team

Not bad for a one line question, right?

and has 0 comments

  It's very rare for me to have such a strong reaction to a book as I has to The Shallows. A combination of high expectations from the people who recommended it and the ironically poor quality of the book almost forced me to stop reading it. It gives me a great and perverse pleasure to summarize this book into a single paragraph: the Internet is bombarding us with information and stimuli, therefore training our brains to skim the surface of things and depriving us of the ability to "deep read", meaning slowly digesting large blocks of text and fully processing what we read is now difficult to impossible for most people. That is it! There is nothing else in this book. And the reason why this book was bad is that it brings nothing more to the original idea explored by the author in an Atlantic Monthly cover story than quotes from other people who agree.

  Nicholas Carr decries (and cries and cries) the way the medium of the information we digest is changing the way we process that information. He uses page long paragraphs filled with big words meant only to make him look well read to repeat the same things over and over again, all the while complaining about people skipping to the juicy parts. I mean, I've been known to use a few pompous words myself, but I don't think I've ever went out of my way to use complicated expressions when simpler ones would do.

  The multitude of citations from people ranging from ancient Greek philosophers to Artificial Intelligence scientists are cherry-picked to make his case of the demise of the "deep read" in favor of meaningless web skimming. Carr makes the correct case that too much information trains us to not completely absorb the content of the things we read, but he completely misses the mark on why that happens, ironically made evident by his style of writing: boring, pompous, long, verbose. In a classic (by now) bubble effect, he writes a book about his fears that no one but people who share those fears would actually be able to read.

  Also ironic is that he makes some predictions (in 2010) about artificial intelligence and how people will use the various (and nefarious) web services like Google Wave that now make one laugh out loud.

  The point, Carr, is that people who are bombarded with lots of information learn to quickly categorize that information, then send it in the correct bin. You skim an article, see that it is mostly word filling around a central idea, you extract that idea, then move on. There is no deep reading because there is no deep writing. It happens with books, too. One is quick to determine when one book is captivating, engaging and well researched rather than repetitive, single-sided and written for the pleasure of reading oneself and looking smug rather than for knowledge sharing or the pleasure of others. The point (made clearer by research in how AI systems designed after brains function) is that this is how brains have always worked: filtered out as much as possible of the meaningless and tried to absorb as quickly as possible the meaningful. It is literally a search for meaning, you buffoon!

  So, yes, no one finds the time to laboriously study a book, annotate it, keeping well received paragraphs and quips in notebooks they carry with them. But that is because there is more information out there that brings more value through its diversity. In a very sad way, The Shallows reminds me of those religious people who complained about how laic books made people not study the Bible and absorb its teachings.

  Now, the book is not completely without merit. It's just very annoying. The way we use our brains does change the abilities we later have. It's what brains are meant to do: adapt.

  Would it hurt to regularly take a break from distraction, reading something that we have decided is important and high quality, then taking the time to think and absorb and maybe reread what we thought was valuable? No, of course not. I am intimately familiar with the restlessness that comes when trying to spend more than an hour doing the same thing or keeping my attention focused on one thing only. In this, Carr is not wrong. But in assuming that slowly and carefully navigating an avalanche of information is possible, he is definitely going too far.

  Instead of complaining about how we don't absorb meaning because we are too busy filtering out noise, one could be optimistic about the ability of people, helped by technology and not despite it, to improve the way they separate chaff from wheat. Instead of decrying the size and complexity of the information that one must use, making it impossible to hold it all in one brain, why not enjoy the ability to collaborate, network and share that makes it possible for that information to be well processed by groups of people?

  Bottom line: the ideas explored in this book are conservative in nature, fearful of change, focused on what drives that change yet blind on where it takes us. It is the intellectual pompous version of the old man wagging his cane in the air, shouting in anger at young people. It is a book that examines one phenomenon without bringing one any closer to an understanding of it. Woe betide things will change! Well, duh!

and has 1 comment

  I want to write this post to talk about the most common mistake I make as a software developer, even after almost 20 years of experience. And it's not code related. It's more human. But I would also like to hear what you think your biggest mistakes are that are not related to lack of experience. Don't be shy!

  My mistake: assumptions.

  I was assigned this bug recently and, wanting to do a fast job and impress people, I investigated the code, noticed a bug, fixed it, then immediately gave it in for review. I had reasons for doing that, because I was new and did not know the application well. The leads would tell me if they thought I did not find the problem. But, out of the goodness of my heart, you see, I've decided to test the fix as well. And I discovered that the feature was barely implemented. It was not a bug, it was a full fuck up.

  What happened here? I assumed a certain quality of the code and expected, without any reasonable evidence, to find a small typo or a logic bug that would be solved by writing a few lines of code. Instead, I had to reimplement the whole thing as a fix, I pissed off the lead devs because they had enough on their plate and made the entire team wonder what I was doing there. I mean, I haven't even tested the fix!

  Doing things fast means working on valid assumptions that allow you to cut corners. In a new environment, with a team you are not familiar with and a code base you don't understand, you do not have the luxury to make assumptions. Instead, do a good job first: investigate thoroughly, see what the reported problem is, find the actual problem (which may be different), come with an attack plan, implement it, then test that it had the desired result. Yes, it takes more time than to quickly compile the logic flow in your head and hope for the best, but in some places you don't get second chances at fixing things, teams are more formal, processes need to be followed. Optimism is also based on assumptions. Be a realist instead.

  In order to win a game you need to know its rules. That applies both to development process and to local politics, which sometimes are more important than the work. Once you are a good player, you can start experimenting. The title "senior developer" is not given in a vacuum, but is relevant (or not) depending on the environment. Yes, you have experience, but you don't have *this* experience yet. In my desire to be efficient and fast I didn't do anything right and I couldn't believe I have been that stupid.

  Now, how about you? What are your silly mistakes that you can't believe you are still making?