While working on my pet project Linqer (LINQ for Javascript and Typescript) I've spent quite a lot of time improving the performance of the Quicksort algorithm I am using for .orderBy. Therefore I am publishing it here, even if you could extract it just the same from the Linqer sources, with limited discussion on what is going on.

  Why

  First, why use it at all? Doesn't Javascript have the .sort method in the Array class? What's wrong with that?

  The answer is that the implementation for sort is different from browser to browser, or better said, from Javascript engine to Javascript engine. In Chrome, for example, the algorithm used is insertion sort, which is simple, in place, stable and reasonably fast. It is optimized for the most common usage: small arrays that need to be sorted for UI purposes and such. However, when using large arrays, the algorithm doesn't perform as well as one might expect.

  For Linqer I had an additional reason, because I would use ordering followed by skip and take methods that limited the scope of the need for sorting. Imagine a million items array that I wanted ordered and then needed the first ten items. Sorting the entire thing for just ten items would have been overkill. The default .sort function doesn't have parameters for such scenarios.

  And there is another reason: the default function used to compare array items is alphanumeric. [1, 2, 10] would get ordered as [1, 10, 2].

  Second, why Quicksort? There are a bunch of sorting algorithms out there. Mergesort, Heapsort, Radixsort, Timsort, Selectionsort. What's so special about Quicksort.

  I have to admit that I went for it by googling fast sorting algorithm. It does have "quick" in the name, doesn't it? I also found it elegant and easy to comprehend. And for my particular scenario, I liked that it used a divide et impera strategy which allowed me to ignore parts of the array if I didn't need the items there. In other words, it's very well suited both as a general sorting algorithm and a partial sorting algorithm.

  What

  I would like to tell you that it's simple to explain what Quicksort does, but it requires some amount of attention and time. In general terms, it chooses an arbitrary item (called a pivot) then orders the remaining items relative to the pivot, in two so called partitions: the smaller items to the left, the larger to the right. Then it repeats the process for each of the two sides. How the pivot is chosen and how the partitions are handled is what differentiates Quicksort algorithms and determines their performance.

  It is an in place algorithm, meaning it doesn't copy the array in some other type of structure and instead it moves items around inside it. It is not a stable algorithm, meaning the order of "equal" items is not preserved. The average computational complexity is O(n log n), with the worst cases O(n^2). The space complexity is harder to determine. Most people say it is O(1) because it uses no additional data structures, but that is not really correct. Being a recursive algorithm, the call stack gets used quite a lot, an invisible storage that should be computed in the data complexity.

  Unfortunately, the worst case scenarios are also very common: already sorted arrays and arrays filled with the same value. There are various optimizations to be used in order to handle this sort of thing. Also, Quicksort is efficient with large quantities of data, but less so with small numbers of items.

  How

  Finally, we get to the code. The _quicksort function receives:

  • an array
  • left and right index values determining the inclusive area that will be sorted (usually 0 and array.length-1)
  • a comparer function (item1,item2)=> 1, 0 or -1 and that defaults to _defaultComparer which tries to sort items based on the > and < operators
  • min and max index values determining the window of the array that we need to have sorted

The left and right indexes determine which section (before the sort) of the array will be sorted, the min and max indexes determine which items I am interested in (after the sort). This allows me to skip ordering partitions that are outside my area of interest.

As I said, the pivot choice is important. Some strategies are very popular:

  • the last item in the array as the pivot
    • this is the strategy used in the original incarnation of Quicksort
    • leads to very poor performance when the array is already sorted
  • the median item
    • this suggests parsing the array in order to get the value, implying extra computation
    • it only makes sense when the values in the array are numbers
  • the average between the first, the last and the middle item
    • it only makes sense when the values in the array are numbers
  • the item that is in the middle of the array
    • this is the one that I am using
  • a random item in the array
    • this makes the algorithm escape scenarios where the performance would be bad
    • the outcome of the sorting is unpredictable in terms of time used and stability of items
  • multiple pivots
    • an interesting concept, but one that complicated the algorithm too much for comfort

Then there is the matter of the partitioning. I've used an optimization that involves two indexes, one at the start the other at the end of a partition, coming toward each other and swapping items that are on the wrong side of the pivot. In some implementations, if the pivot is the last item, the partitioning is from one side only. In others, multiple indexes are used to handle multiple pivots.

In most implementations, the algorithm recurses on _quicksort, but I refactored it to only recurse on the partitioning. Then, because I didn't want to get stack overflows when bad data was used, I've eliminated the recursion and instead used a stack of my own where the partitions to be sorted are stored and wait their turn. This is where the data complexity comes around. In my case I was using a little more data than I actually needed, because I was adding partitions to the stack and also incrementing the index of the current partition, meaning the stack array grew with handled partitions. Even if there is no computation performance benefit, I've optimized this as well by adding a queueIndex which is used to recycle the slots in the partition array that are behind the partitionIndex. New partitions are being added behind the partitionIndex and queueIndex is incremented. When the loop reaches the last partition in the stack, a new loop is started with the partitions from 0 to queueIndex. Thus, for a ten million item array the partition stack rarely goes above 40000 in length.

A further optimization is to use insertion sort on partitions that have become too small (under 64 items). It irks me to have had to do this, I would have liked to use a "pure" algorithm, but this improved the performance and minimized the size of the partition stack.

The Code

That's about it. Here is the code:

    function _insertionsort(arr, leftIndex, rightIndex, comparer) {
        for (let j = leftIndex; j <= rightIndex; j++) {
            const key = arr[j];
            let i = j - 1;
            while (i >= leftIndex && comparer(arr[i], key) > 0) {
                arr[i + 1] = arr[i];
                i--;
            }
            arr[i + 1] = key;
        }
    }
    function _swapArrayItems(array, leftIndex, rightIndex) {
        const temp = array[leftIndex];
        array[leftIndex] = array[rightIndex];
        array[rightIndex] = temp;
    }
    function _partition(items, left, right, comparer) {
        const pivot = items[(right + left) >> 1];
        while (left <= right) {
            while (comparer(items[left], pivot) < 0) {
                left++;
            }
            while (comparer(items[right], pivot) > 0) {
                right--;
            }
            if (left < right) {
                _swapArrayItems(items, left, right);
                left++;
                right--;
            }
            else {
                if (left === right)
                    return left + 1;
            }
        }
        return left;
    }
    const _insertionSortThreshold = 64;
    function _quicksort(items, 
                        left, right, comparer = _defaultComparer,
                        minIndex = 0, maxIndex = Number.MAX_SAFE_INTEGER) {
        if (!items.length)
            return items;
        const partitions = [];
        partitions.push({ left, right });
        while (partitions.length) {
            ({ left, right } = partitions.pop());
            if (right - left < _insertionSortThreshold) {
                _insertionsort(items, left, right, comparer);
                continue;
            }
            const index = _partition(items, left, right, comparer);
            if (left < index - 1 && index - 1 >= minIndex) {
                partitions.push({ left, right: index - 1 });
            }
            if (index < right && index < maxIndex) {
                partitions.push({ left: index, right });
            }
        }

        return items;
    }
    _defaultComparer = (item1, item2) => {
        if (item1 > item2)
            return 1;
        if (item1 < item2)
            return -1;
        return 0;
    };

  I've accepted the old man should teach me as the only solution to becoming a champion, but it is hard to swallow it. He is very old, but mischievous, so whenever I try to learn something from him, he kicks me to the ground. He tricks me again and again and again. I am frustrated, but I am trying to keep my cool. I am strong. If I were to really fight him, he might be smart, but every attack would break bone and then what good would he be? Just a bag of meat and broken shards. I close my eyes, I breath, I tell myself it is worth it.

  The old man apologizes and offers me a hand, I take it, only to be kicked in the ass and thrown into a jumble of debris. I lose my temper and stomp away. He doesn't understand. Getting angry at him is pointless, hurting him futile. I have nothing to learn from him. I walk through the old grounds of my conquests, now just the walled in and decrepit underground of the large arena above. I feel a presence behind me and I see the old man is following me, eyes to the ground. Contrition? Surely another of his tricks. "Begone!" I roar at him, but he goes to his knees and kowtows in front of me, his hands touching my feet. I feel tears swelling up in my eyes. He might as well be a little boy asking for forgiveness. Just who is the teacher and who is the student? Who is the adult here?

  "How did you get to a hundred years or whatever behaving like a little kid?! You are a child!" I shout at him in admonishment. I look around and ghosts of my past awaken my anguish. I feel my face contort into a painful grin as my tears flow freely. "Every week I was coming here to murder people!", I rage, my voice barely my own, a booming, low, animal growl, my expression that of an enraptured madman, for sure. "I would stake my life every time and I would leave, alive, every time!". The images of old fights flash before my wet blurred vision and I imagine that some of the painted white walls might contain some of the scrolls of the ancient arts, built over by a world that doesn't get it anymore. "I loved it!", I say, walking in the dead halls, every step a pulse of power overlaying glorious past over grey reality. My body is shaking with now uncontrollable weeping. "I killed so many people and I miss it... so.... very... MUCH!".

  Does he get it now, I ask myself? Has he even an inkling of the power he needs to teach me to control? I burst through the door to the surface and climb the stairs that get me to the arena above. The seats are packed with oblivious spectators, all watching some performance I don't even care to notice. I breathe in the fresh air and feel better. Ready to come to a final understanding with the old man, if he is capable of it ,I turn around. There is little time and we should not fight each other. But the old man is gone.

   I strain my eyes into the darkness of the stairs and I feel it, The Beast, the adversary I need to fight is there. He's got the old man and, even if I cannot see it, I know it is there, all cunning, fury and power. My body roars by itself, a predator sound, strong and fearless, no sound a man should ever be able to make. The arena spectators panic in surprised horror, but I ignore them. I jump into the darkness with animal strength. I will fight this beast, I will meet it head on, I will be the most savage, alone I will remain alive.

and has 0 comments

  The Grace of Kings feels long from the very start. Ken Liu is starting off from a fictional empire of seven islands, but it might as well have been a historical book. Everything is mostly realistic, with very human characters that do what human characters do: harm and kill other people and find rationalizations for it. Some of them are heroic and occasionally think of other people, too.

  Half way through the book (which is one of a trilogy, of course) I couldn't keep up with all the characters that kind of did the same thing, the long expositions of why people did stupid or horrible things to others and the various anecdotes that made some of the characters heroes or villains. And I call them anecdotes because that's what they feel like: short moments that disrupt rather than enforce the long and unfortunately boring history of the realm.

  Bottom line, it feels like a Chinese Game of Thrones, with less interesting characters and no magic as of yet. It's not badly written, quite the contrary, but its subject is long winding and doesn't interest me. I will therefore abandon reading it.

So I was trying to optimize a sorting algorithm, only the metrics did not make any sense. On the test page I had amazing performance, on the other it was slow as hell. What could possibly be the problem?

The difference between the two tests was that one was sorting inline (normal array reading and writing) and the other was using a more complex function and iterated a data source. So I went to test the performance of iterating itself.

The code tests the speed of adding all the items of a large array in three cases:

  • a classic for...in loop that increments an index and reads the array at that index
  • a for...of loop that iterates the items of the array directly
  • a for...of loop that iterates over a generator function that yields the values of the array
time(()=>{ let sum=0; for (let i=0; i<arr.length; i++) sum+=arr[i]; },'for in');
time(()=>{ let sum=0; for (const v of arr) sum+=v; },'iterator for of');
time(()=>{ let sum=0; for (const v of (function*(){ for (let i=0; i<arr.length; i++) yield arr[i]; })()) sum+=v; },'generator for of');

time is a function I used to compute the speed of the execution. The array is 100 million integers. And here are the results:

for in: 155.12999997008592
for of: 1105.7250000303611
for of: 2309.88499999512

I haven't yet processed what it means, but I really thought using an iterator was going to be at least as fast as a for loop that uses index accessing to read values. Instead there is a whooping 7 to 14 times decrease in speed.

So from now on I will avoid for...of in high performance scenarios.

  I guess I don't have to tell you about ad blockers and browser extensions that improve YouTube. They are a dime a dozen and bring many features to the habitual YouTube watcher. However there is one particular new YouTube annoyance that you don't really need an extension to get rid of: the dreaded Video paused dialog.

  To get rid of it is easy: on an interval, check if there is a visible element of a certain type containing a certain text and click its button. While this can be done in simple Javascript, I am lazy, so the script that I am using will first load jQuery, then run a one line function periodically. This code is to be copied and pasted in the Console tab of the browser's development tools, then press Enter.

const scr = document.createElement('script');
scr.setAttribute('src','https://code.jquery.com/jquery-3.4.1.min.js');
document.querySelector('head').appendChild(scr);
setInterval(()=> { 
  $('#button:contains(Yes)','yt-confirm-dialog-renderer:visible:contains(Video paused)').click();
},500);

It's easy to understand:

  • create a script element
  • set its source to jQuery
  • append it to the page
  • execute every 500 milliseconds a code that:
    • finds the element with id button containing the text "Yes"
    • inside an element of type yt-confirm-dialog-renderer which is visible and contains the text "Video paused"
    • click the element

There is an even more comfortable solution, though, that I recommend. You will need a Chrome extension called cjs that loads whatever script you tell it in whatever page you want. It gives you the option to inject jQuery, so all you have to do is write 

setInterval(()=> { $('#button:contains(Yes)','yt-confirm-dialog-renderer:visible:contains(Video paused)').click(); },500);

 as the script to be executed on YouTube.

That's it. You're all done.

and has 0 comments

  Wakenhyrst is very well written, but where it excels is the dissection of the hypocrisy of people. Michelle Paver is telling the story from the viewpoint of a young girl who must navigate the world and her own adolescence in the house of a father that has no love for her or for her mother, finds every reason to blame others for his shortcomings and deeds, and yet is untouchable because he is a man and the lord of the manor. What legions of screeching feminists could not do, Paver manages with her subdued, yet defiant description of how women are used and ignored and pretty much treated as glorified pets. It is impossible to not hate the father figure in the book, even as the main character is torn between wanting to forgive him and dealing with the creepy and sometimes evil shit he pulls. The ending is powerful, as the daughter finds the strength to sublimate her hate into an even more appropriate emotion: pity.

  But the story's power is not limited to the detailed analysis of the human psyche. It binds together Anglican folklore, medieval beliefs about devils and angels and art, whitewashed (in the actual sense of the term) by Puritans and systematically destroyed by Victorians, the power of untamed nature and the horror of the human complacency. How refreshing to have a very young girl be the rational and intelligent agent that fends for herself in a world of mystical belief and societal poppycock, so that we can identify with her and see it as it was. How wonderful to have Paver describe it all without any trace of anachronism, as if she has lived in that world herself.

  The story starts slow and the pace almost never picks up, yet the tension and the level of details are constantly increasing, managing to somehow convey at the same time two distinct and contrary feelings: one of slow burn and the other of untamed power rising to a crescendo. It brilliantly mingles the oppressive hot wet feel of subconscious fear and superstition with cold analytical reason as its adversary. In the beginning I wanted to rate it above average only, but now, the more I think about it the more I admire the writing and the way the book tells the story. Good job, Michelle Paver!

  Bottom line: move past the slower start, it is certainly worth reading. A gothic tale of subliminal supernatural horror and a very human and real one at the same time.

and has 0 comments

  Salvation Lost is the second book in the Salvation Sequence trilogy from Peter F. Hamilton. I was commenting on the previous book saying that it is mostly filler and action that is irrelevant to the larger story. This book is a lot more action packed and a bit more interesting, but ultimately just as pointless. A lot of characters that will only get relevant in the third book, if at all, a lot of stories that happen in the past as we know what is going 10000 years into the future and no closure on anything. The horror of the alien invasion is powerful, but not as much as it could have been. The story invests so much in some people only to kill them later with no apparent effect on the timeline of events.

  Bottom line: I will probably read the last book, scheduled sometime in Sep 2020, but I feel this series is one of Hamilton's weakest.

Just a thing I learned today: using the bitwise not operator (~) on a number in Javascript ignores its fractional part (it converts it to integer first), therefore using it twice gives you the integer part of original number. Thanks to fetishlace for clarifications.

Notes:

  • this is equivalent to (int)number in languages that support the int type
  • this is equivalent to Math.trunc for numbers in the integer range
  • this is equivalent to Math.floor only for positive numbers in the integer range

Examples:
~~1.3 = 1
~~-6.5432 = -6
~~(2 ** 32 + 0.5) = 0
~~10000000000 = 1410065408

P.S. If you don't know what ** is, it's called the Exponentiation operator and it came around in Javascript ES2016 (ES7) and it is the equivalent of Math.pow. 2 ** 3 = Math.pow(2,3) = 8

and has 0 comments

  I consider Peter F. Hamilton to be one of the great science fiction writers. Yes, he has a formula, yes he messes up the endings, but the ideas and worlds that he puts on paper have rarely disappointed me. I can't say that of Salvation, either, but I didn't especially like it, as it gives away too much too soon, then proceeds on boring or enraging the reader with police procedural vignettes that we already know will have no impact on future events.

  Hamilton has this method of combining at least two threads, usually one is hard science fiction and the other is fantasy or police procedural or something different, only to bind them together at some point in time. In this book, we see a group of people running away from an enemy species bent on exterminating humanity and also a peaceful future in which humanity has discovered how to create instantaneous transport portals to other places and was contacted by two different alien species. Somehow, how we get from one point to the other is the topic of the book, but the vast majority of it is about corporate security people that abuse their power to "get things done" or toxic cleanup people or other kinds of short stories that only bring some new information to light while eating up reading time. Then the book ends!

  And it's a bit annoying that even the technical aspects don't add up, like how you can know how to make portals, but you still rely on nuclear weapons or rockets to do war, or that people fear sabotage and terrorism, but don't see the possible threat posed by personal portals to other worlds. Gravity alterations, atmosphere loss or pollution, a portal from a star to an inhabited area would have been much more dangerous. Also some of the ways characters act are completely unnatural and it feels jarring to see them do things in a sequence (heh!) only to further the story and not consistent to their character and expertise.

  Anyway, I am reading Salvation Lost now, but in my view the first book in this series could have been a lot more, especially from a brilliant writer such as Hamilton.

and has 2 comments

  Sometimes a good regular expression can save a lot of time and lead to a robust, yet flexible system that works very efficiently in terms of performance. It may feel like having superpowers. I don't remember when exactly I've decided they were useful, but it happened during my PHP period, when the Linux command line system and its multitude of tools made using regular expressions a pretty obvious decision. Fast forward (waaaay forward) and now I am a .NET developer, spoiled by the likes of Visual Studio and the next-next approach to solving everything, yet I still use regular expressions, well... regularly, sometimes even when I shouldn't. The syntax is concise, flexible and easy to use most of the time.

  And yet I see many senior developers avoiding regular expressions like they were the plague. Why is that? In this post I will argue that the syntax makes a regular pattern be everything developers hate: unreadable and hard to maintain. Having many of the characters common in both XML and JSON have special meaning doesn't help either. And even when bent on learning them, having multiple flavors depending on the operating system, language and even the particular tool you use makes it difficult. However, using small incremental steps to get the job done and occasionally look for references to less used features is usually super easy, barely an inconvenience. As for using them in your code, new language features and a few tricks can solve most problems one has with regular expressions.

 The Syntax Issue

The most common scenario for the use of regular expressions is wanting to search, search and replace or validate strings and you almost always start with a bunch of strings that you want to match. For example, you want to match both dates in the format 2020-01-21 and 21/01/2020. Immediately there are problems:

  • do you need to escape the slashes?
  • if you match the digits, are you going for two digit month and day segments or do you also accept something like 21/1/2020?
  • is there the possibility of having strings in your input that look like 321/01/20201, in which case you will still match 21/01/2020, but it's clearly not a date?
  • do you need to validate stuff like months being between 1-12 and days between 1-31? Or worse, validate stuff like 30 February?

But all of these questions, as valid as they are, can be boiled down to one: given my input, is my regular expression matching what I want it to match? And with that mindset, all you have to do is get a representative input and test your regular expression in a tester tool. There are many out there, free, online, all you have to do is open a web site and you are done. My favourite is RegexStorm, because it tests .NET style regex, but a simple Google search will find many and varied tools for reading and writing and testing regular expressions.

The syntax does present several problems that you will hit every time:

  • you cannot reuse parts of the pattern
    • in the example above, even if you have clearly three items that you look for - year, month, day - you will need to copy/paste the pattern for each variation you are looking for
  • checking the same part of the input string multiple times is not what regular expressions were created for and even those that support various methods to do that do it in a hackish way
    • example: find a any date in several formats, but not the ones that are in March or in 2017
    • look behind and look ahead expressions are usually used for scenarios like this, but they are not easy to use and reduce a lot of the efficiency of the algorithm
  • classic regular expression syntax doesn't support named groups, meaning you often need to find and maintain the correct index for the match
    • what index does one of the capturing groups have?
    • if you change something, how do other indexes change?
    • how do you count groups inside groups? Do they even count if they match nothing?
  • the "flags" indicating how the regular engine should interpret the pattern are different in every implementation
    • /x/g will look for all x characters in the string in Javascript
    • C# doesn't even need a global flag and the flags themselves, like CaseInsensitive, are .NET enums in code, not part of the regular pattern string
  • from the same class of issues as the two above (inconsistent implementation), many a tool uses a regular expression syntax that is particular to it
    • a glaring example is Visual Studio, which does not use a fully compatible .NET syntax

  The Escape Issue

The starting point for writing any regular expression has to be a regex escaped sample string that you want to match. Escaping means telling the regular expression engine that characters from your sample string that have meaning to it are just simple characters. Example 12.23.34.56, which is an IP address, if used exactly like that as a regular pattern, will match 12a23b34c56, because the dot is a catch all special character for regular expressions. The pattern working as expected would be 12\.23\.34\.56. Escaping brings several severe problems:

  • it makes the pattern less humanly readable
    • think of a phrase in which all white space has been replaced with \s+ to make it more robust (it\s+makes\s+the\s+pattern\s+less\s+humanly\s+readable)
  • you only escape with a backslash in every flavor of regular expressions, but the characters that are considered special are different depending on the specific implementation
  • many characters found in very widely used data formats like XML and JSON are special characters in regular expressions and the escaping character for regex is a special character in JSON and also string in many popular programming languages, forcing you to "double escape", which magnifies the problem
    • this is often an annoyance when you try to store regular patterns in config files

  Readability and Maintainability

Perhaps the biggest issue with regular expressions is that they are hard to maintain. Have you ever tried to understand what a complex regular expression does? The author of the code started with some input data and clear objectives of what they wanted to match, went to regular expression testers, found what worked, then dumped a big string in the code or configuration file. Now you have neither the input data or the things that they wanted matched. You have to decompile the regular pattern in your head and try to divine what it was trying to do. Even when you manage to do that, how often do developers redo the testing step so they verify the changes in a regular expressions do what was intended?

Combine this with the escape issue and the duplication of subpatterns issue and you get every developer's nightmare: a piece of code they can't understand and they are afraid to touch, one that is clearly breaking every tenet of their religion, like Don't Repeat Yourself or Keep It Simple Silly, but they can't change. It's like an itch they can't scratch. The usual solution for code like that is to unit test it, but regular expression unit tests are really really ugly:

  • they contain a lot of text variables, on many lines
  • they seem to test the functionality of the regular expression engine, not that of the regular expression pattern itself
  • usually regular expressions are used internally, they are not exposed outside a class, making it difficult to test by themselves

  Risk

Last, but not least, regular expressions can work poorly in some specific situations and people don't want to learn the very abstract computer science concepts behind regular expression engines in order to determine how to solve them.

  • typical example is lazy modifiers (.*? instead of .*) which tell the engine to not be greedy (get the least, not the most)
    • ex: for input "ineffective" the regular expression .*n will work a lot worse than .*?n, because it will first match the entire word, then see it doesn't end with n, then backtrack until it gets to "in" which it finally matches. The other syntax just stops immediately as it finds the n.
  • another typical example is people trying to find the HTML tag that has a an attribute and they do something like \<.*href=\"something\".*\/\> and what it matches is the entire HTML document up to a href attribute and the end of the last atomic tag in the document.
  • the golden hammer strikes again
    • people start with a simple regular expression in the proof of concept, they put it in a config file, then for the real life application they continuously tweak the configured pattern instead of trying to redesign the code, until they get to some horrible monstrosity
    • a regex in an online library that uses look aheads and look behinds solves the immediate problem you have, so you copy paste it in the code and forget about it. Then the production app has serious performance problems. 

  Solutions

  There are two major contexts in which to look for solutions. One is the search/replace situation in a tool, like a text editor. In this case you cannot play with code. The most you can hope for is that you will find a tester online that supports the exact syntax for regular expressions of the tool you are in. A social solution would be to throw shade on lazy developers that think only certain bits of regular expressions should be supported and implemented and then only in one particular flavor that they liked when they were children.

  The second provides more flexibility: you are writing the code and you want to use the power of regular expressions without sacrificing readability, testability and maintainability. Here are some possible solutions:

  • start with simple stuff and learn as you go
    • the overwhelming majority of the time you need only the very basic features of regular expressions, the rest you can look up when you need them
    • if the regular expression becomes too complex it is an indication that maybe it's not the best approach
  • store subpatterns in constants than you can then reuse using templated strings
    • ex: var yearPattern = @"(?<year>\d{4})"; var datePattern = $@"\b(?:{yearPattern}-(?<month>\d{2})-(?<day>\d{2})|(?<month>\d{2})\/(?<day>\d{2})\/{yearPattern})\b";
    • the example above only stores the year in another variable, but you can store the two different formats, the day, the month, etc
    • in the end your code might look more like the Backus-Naur (BNF) syntax, in which every separate component is described separately
  • use verbatim strings to make the patterns more readable by avoiding double escaping
    • in C# use @"\d+" not "\\d+"
    • in Javascript they are called template literals and use backticks instead of quotes, but they have the same mechanism for escaping characters as normal strings, so they are not a solution
  • use simple regular expressions or, if not, abstract their use
    • a good solution is using a fluent interface (check this discussion out) that allows the expressivity of human language for something that ends up being a regular expression
    • no, I am not advocating creating your own regular expression syntax... I just think someone probably already did it and you just have to find the right one for you :)
  • look for people who have solved the same problem with regular expression and you don't have to rewrite the wheel
  • always test your regular expressions on valid data and pay attention to the time it took to get the matches
  • double check any use of the string methods like IndexOf, StartsWith, Contains, even Substring. Could you use regular expressions?
    • note that you cannot really chain these methods. Replacing a regular expression like ^http[s]?:// with methods always involves several blocks of code and the introduction of cumbersome constants:
      if (text.StartsWith("http")) {
        // 4 works now, but when you change the string above, it stops working
        // you save "http" to a constant and then use .Length, you write yet another line of code
        text=text.Substring(4);
      } else {
        return false;
      }
      if (text.StartsWith("s")) {
        text=text.Substring(1);
      }
      return text.StartsWith("://");
      
      // this can be easily refactored to
      return text
               .IfStartsWith("http")
               .IfStartsWithOrIgnore("s")
               .IfStartsWith("://");
      // but you have to write your own helper methods
      // and you can't save the result in a config file​

  Conclusion

Regular expressions look daunting. Anyone not familiar with the subject will get scared by trying to read a regular expression. Yet most regular expression patterns in use are very simple. No one actually knows by heart the entire feature set of regular expressions: they don't need to. Language features and online tools can help tremendously to make regex readable, maintainable and even testable. Regular expressions shine for partial input validation and efficient string search and replace operations and can be easily stored in configuration files or data stores. When regular expressions become too complex and hard to write, it might be a sign you need to redesign your feature. Often you do not need to rewrite a regular expression, as many libraries with patterns to solve most common problems already exist.

and has 0 comments

  This is How You Lose the Time War came highly recommended by a Goodreads buddy of mine as the way to write sci-fi. I don't know what to say about that. He was so ebullient about how great the book was that there was bound to be some disappointment. It is nicely written and touches, under the guise of science fiction, the intricacies of human relationships and feelings. But other than that it was just one idea, stretched over 200 pages, in something that was both short and felt unreasonably long. Perhaps some time issues need arise when time travel is involved.

  What I found interesting is that it is a double author book. Amal El-Mohtar and Max Gladstone worked on it together although I suspect it was mostly El-Mohtar with the story and Gladstone with the tech stuff, although I could be just stereotyping. To me, it felt like the story has a distinctly female vibe.

  Bottom line: it's a very humanist type of story: the time war, the tech, they are all props. One could have written the same kind of stuff about African tribes or corporate lawyers. After a while everything started to feel repetitive and stale only to reach the all too predictable ending. Nice book, but not great.

  I was looking into the concept of partial sorting, something that would help in a scenario where you want the k smaller or bigger items from an array of n items and k is significantly smaller than n. Since I am tinkering with LInQer, my implementation for LINQ methods in Javascript, I wanted to tackle the OrderBy(...).Take(k) situation. Anyway, doing that I found out some interesting things.

  First, the default Javascript array .sort function has different implementations depending on browser and by that I mean different sort algorithms. Chrome uses insertion sort and Firefox uses merge sort. None of them is Quicksort, the one that would function best when the number of items is large, but that has worst case complexity O(n^2) when the array is already sorted (or filled with the same value).

I've implemented a custom function to do Quicksort and after about 30000 items it becomes faster than the default one.  For a million items the default sort was almost three times slower than the Quicksort implementation. To be fair, I only tested this on Chrome. I have suspicions that the merge sort implementation might be better.

  I was reporting in a previous version of this post that QuickSort, implemented in Javascript, was faster than the default .sort function, but it seems this was only an artifact of the unit tests I was using. Afterwards, I've found an optimized version of quicksort and it performed three times faster on ten million integers. I therefore conclude that the default sort implementation leaves a lot to be desired.

  Second, the .sort function is by default alphanumeric. Try it: [1,2,10].sort() will return [1,10,2]. In order to do it numerically you need to hack away at it with [1,2,10].sort((i1,i2)=>i1-i2). In order to sort the array based on the type of item, you need to do something like: [1,2,10].sort((i1,i2)=>i1>i2?1:(i1<i2?-1:0)).

  Javascript has two useful functions defined on the prototype of Object: toString and valueOf. They can also be overwritten, resulting in interesting hacks. For some reason, the default .sort function is calling toString on the objects in the array before sorting and not valueOf. Using the custom comparers functions above we force using valueOf. I can't test this on primitive types though (number, string, boolean) because for them valueOf cannot be overridden.

  And coming back to the partial sorting, you can't do that with the default implementation, but you can with Quicksort. Simply do not sort any partition that is above and below the indexes you need to get the items from. The increases in time are amazing!

  There is a difference between the browser implemented sort algorithms and QuickSort: they are stable. QuickSort does not guarantee the order of items with equal sort keys. Here is an example: [1,3,2,4,5,0].sort(i=>i%2) results in [2,4,0,1,3,5] (first even numbers, then odd, but in the same order as the original array). The QuickSort order depends on the choice of the pivot in the partition function, but assuming a clean down the middle index, the result is [4,2,0,5,1,3]. Note that in both cases the requirement has been met and the even numbers come first.

and has 0 comments

  A Short History of Nearly Everything is very well researched, subtly written and does pretty much what it says: explain the history of science up to the present as humanity is trying to figure out where it came from, how long ago it happened and how things actually work. It's a dense work, which makes it a long read. You either go through it and not retain much or you have to read bit by bit and think on each for a while. I read it bit by bit and retained little.

  Anyway, as I was reading The Invention of Nature, another great popular science book, I've stumbled upon this quote "There are three stages of scientific discovery: first people deny it is true; then they deny it is important; finally they credit the wrong person." and never have I thought about it so much as when reading A Short History of Nearly Everything. In fact, the same quote appears in the book near the end. As Bill Bryson describes it, most of science is accidental and has to fight a plethora of egos that believe they are better than you just in order to surface. Many times the work is lost, misattributed, stolen or sabotaged into oblivion by personal opponents. As such, the book has a wonderful freshness from the tired history of science that we are so often presented where very smart people think of something and then everybody applauds and accepts another idea that will further human knowledge. The book is also about how little we know about many things that usually are presented as completely clear, fully researched and completely understood. All in all, it's a book that needs reading.

  "So", you will say, "isn't this another Sapiens?". No. I liked Sapiens and its funny and accessible style made it an instant hit worldwide. A Short History of Nearly Everything is way better. It focuses more on sciences like geology and anthropology and abstract physics and on the personal histories of the people involved in the discoveries rather than on humanity as a whole, so it's a bit harder. When it does look at humanity it sees it as small, petty and destructive. Sapiens makes you feel good, this makes you feel ashamed and happy you are still alive.

  I have to say that I almost abandoned reading it; it is that dense and full of information. If I was reading a novel in three days, spending weeks trudging through knowledge made me feel both too stupid and getting smarter at the same time. Surely I could find a better way to entertain myself, I thought. This book is entertaining, but it requires focus to read and understand. In the end, I am very glad I've read it.

  A little gem I stumbled upon today that I didn't know about: https://placeholder.com/. You use it really simple like https://via.placeholder.com/300x150?text=Placeholder, but there is also a shorter URL that looks like this: https://placehold.it/1280x800/8020A0/FFFFFF?text=Siderite's blog.

  The simplest syntax is https://placehold.it/300 (a 300x300 image that displays "Placeholder" and "Powered by HTML.COM").

So the placeholder.com service has disappeared, but I found another that seems to do the same thing. It's called Placehold.com

Here is an example of one that specifies size, color, image type, text and font: https://placehold.co/640x400/lightblue/darkviolet/jpg?text=Siderite's Blog&font=Playfair Display

and has 0 comments

  I see a lot of pages about how to write blog posts. I read them, because I am both curious and sincere in my desire to make my blog popular and spread the knowledge I've amassed here. They are always crap. Take one that says the best tool to get a blog popular is to use Google Trends or Google autocomplete to see what people are searching for. And the results are always incredibly stupid. Like "how to add one to one to get two". I am paraphrasing a bit here, but you get the gist. Go "worldwide" and the first trend is always some Chinese spam. Another post is saying that a blog post should be written as four drafts: one for what you want to say, one for how you want to say it, one for peer reviewed content and the final one that actually is what you want to publish. It sounds great, but it implies a level of work that sometimes is prohibitive related to the subject of your post. Sometimes you just want to share something as a stream of consciousness and be done with it. Is that better? No. But it sure beats NOT writing anything. There is always time to improve your work and get peer review AFTER publishing it.

  There are two big types of people blogging. The first category is akin to reporters and media people. They want to get their message across for reasons that are rather independent of the message itself. They want to earn money or influence or some other kind of benefit. I don't have any advice for people like that. Not because I disconsider their goals, but because I have never blogged for an ulterior reason. The second category of bloggers is akin to writers: they want to get their message across because they feel there is some value in the message itself. I consider myself such a person, although I probably suck as a writer. This post is for people like that.

  The most important part of writing a post is motivation. And I don't mean just the reason for writing it, but the reason for wanting to share it. For me, most of the posts I write are either content that I consume, such as books, or stuff that I think is worth considering or technical stuff that I've stumbled upon and I believe people would want to find if googling for it instead of wasting the time I wasted to solve it. Now, the books and the personal idea posts I totally agree are ego boosting shit: I feel like it's important enough to "share", but I don't really expect people to read it or that there is any inherent value in them other than getting to know me better. And everyone wants to understand other people better on the Internet, right? In the end they are just a personal log of random thoughts I have. My blog is something that represents me and so I feel that I need to share things that are personal to me, including thoughts that are not politically correct or even correct in any possible way. One can use Facebook for this, so I won't write about those posts. They still reach some people and inform their choices, which is something I love.

  What is left is the posts that I work for. You have no idea how much I work on some of these posts. It may take me hours or even days for content that you read in a few minutes. That is because I am testing my ideas in code and creating experiments to validate my beliefs and doing research on how other people did it. A lot of the times I learn a lot from writing these posts. I start with the expectation that I know what I am talking about only to find out that I was wrong. The important part is that I do correct myself and some of the blog posts here are exclusively about discovering how wrong I was. There is nothing more rewarding than writing something that you feel others might benefit from. Perhaps other than getting feedback about how your post benefited others. Publishing your failures is just as important as publishing your successes.

  Yes, I know, if I learn something new by doing what I need to be doing, then sharing the results is like writing for myself, too. It's ego boosting, for sure. However, it would be even more obnoxious to believe no one is like me and so no one would benefit from the same work. There was a time when people came to my blog and asked me about all kinds of technical problems and I worked with them to solve them. There were usually incredibly simple problems that posed difficulties only to the laziest people, but it felt good! Then StackOverflow came along and no one actually interacts with me. But I have solved stupid problems that I still keep getting thanks for, even (maybe especially because) if the technology is really old and obsolete. Many other blogs published cool things about subjects that are not fashionable anymore and then just disappeared. The value of your content is that it may help people in your situation, even if they don't share your sense of now and even if all they take away is how NOT to do things.

  Sometimes you are looking for the solution for a problem and after hours of work you realize the problem was invalid or the solution was deceptively simple. It's the "Oh, I am so stupid!" moment that makes a lot of people shy away from writing about it. I find that these moments are especially important, because other people will surely make the same mistake and be hungry about finding the answer. OK, you admit to the world you were stupid, but you also help so many other people that would waste time and effort and feel as stupid as you if not for writing the post.

  My take on writing a blog post is that you just have to care about what you are writing. You may not be the best writer out there, you might not even be explaining the thing right, but if you care about what you are writing, then you will make the effort of getting it right eventually. Write what you think and, if you are lucky, people will give you reasons to doubt your results or improve them. Most likely people will NOT care as much about the subject as you, but you are not writing because of them, you are writing for them. Some of your thoughts and toils will reach and help someone and that is what blogging is all about.

  The last thing I want to mention is maintenance. Your work is valid when you write it, but may become obsolete later on. You need to make the effort to update the content, not by removing the posts or deleting their content, but by making clear things have changed, how they did and what can be done about it. It is amazing how many recent posts are reached only because I mentioned them in an "obsolete" post. People search for obsolete content, find out it's too old, then follow the link to your latest solution for that problem. It makes for good reading and even better understanding of how things got to the current point.

  So, bottom line: publish only what you care about and care about your readers, keep the posts up to date, publish both successes and failures.