For a more in depth exploration of the concept, read Towards generic high performance sorting algorithms

Sorting

  Consider QuickSort, an algorithm that uses a divide and conquer strategy to sort efficiently and the favourite in computer implementations.

  It consists of three steps, applied recursively:

  1. find a pivot value
  2. reorder the input array so that all values smaller than the pivot are followed by values larger or equal to it (this is called Partitioning)
  3. apply the algorithm to each part of the array, before and after the pivot

  QuickSort is considered generic, meaning it can sort any type of item, assuming the user provides a comparison function between any two items. A comparison function has the same specific format: compare(item1,item2) returning -1, 0 or 1 depending on whether item1 is smaller, equal or larger than item2, respectively. This formalization of the function lends more credence to the idea that QuickSort is a generic sorting algorithm.

  Multiple optimizations have been proposed for this algorithm, including using insertion sort for small enough array segments, different ways of choosing the pivot, etc., yet the biggest problem was always the optimal way in which to partition the data. The original algorithm chose the pivot as the last value in the input array and the average complexity was O(n log n), but worse case scenario was O(n^2), when the array was already sorted and the pivot was the largest value. Without extra information you can never find the optimal partitioning schema (which would be to choose the median value of all items in the array segment you are sorting).

  But what if we turn QuickSort on its head? Instead of providing a formalized comparison function and fumbling to get the best partition, why not provide a partitioning function (from which a comparison function is trivial to obtain)? This would allow us to use the so called distribution based sorting algorithms (as opposed to comparison based ones) like Radix, BurstSort, etc, which have a complexity of O(n) in a generic way!

  My proposal for a formal signature of a partitioning function is partitionKey(item,level) returning a byte (0-255) and the sorting algorithm would receive this function and a maximum level value as parameters.

  Let's see a trivial example: an array of values [23,1,31,0,5,26,15] using a partition function that would return digits of the numbers. You would use it like sort(arr,partFunc,2) because the values are two digits numbers. Let's explore a naive Radix sorting:

  • assign 256 buckets for each possible value of the partition function result and start at the maximum (least significant) level
  • put each item in its bucket for the current level
  • concatenate the buckets
  • decrease the level and repeat the process

Concretely:

  • level 1: 23 -> bucket 3, 1 -> 1, 31 -> 1, 0 -> 0, 5 -> 5, 26 -> 6, 15 -> 5 results in [0,1,31,5,15,6]
  • level 0: 0 -> 0, 1 -> 0, 31 -> 3, 5 -> 0, 15 -> 1, 6 -> 0 results in [0,1,5,6,15,31]

Array sorted. Complexity is O(n * k) where k is 2 in this case and depends on the type of values we have, not on the number of items to be sorted!

  More complex distribution sorting algorithms, like BurstSort, optimize their function by using a normal QuickSort in small enough buckets. But QuickSort still requires an item comparison function. Well, it is easy to infer: if partFunc(item1,0) is smaller or larger than partFunc(item2,0) then item1 is smaller or larger than item2. If the partition function values are equal, then increase the level and compare partFunc(item1,1) to partFunc(item2,1).

  In short, any distribution sorting algorithm can be used in a generic way provided the user gives it a partitioning function with a formalized signature and a maximum level for its application.

  Let's see some example partitioning functions for various data types:

  • integers from 0 to N - maximum level is log256(N) and the partition function will return the bytes in the integer from the most significant to the least
    • ex: 65534 (0xFFFE) would return 255 (0xFF) for level 0 and 254 (0xFE) for level 1. 26 would return 0 and 26 for the same levels.
  • integers from -N to N - similarly, one could return 0 or 1 for level 0 if the number is negative or positive or return the bytes of the equivalent positive numbers from 0 to 2N 
  • strings that have a maximum length of N - maximum level would be N and the partition function would return the value of the character at the same position as the level
    • ex: 'ABC' would return 65, 66 and 67 for levels 0,1,2.
  • decimal or floating point or real values - more math intensive functions can be found, but a naive one would be to use a string partitioning function on the values turned to text with a fixed number of digits before and after the decimal separator.
  • date and time - easy to turn these into integers, but also one could just return year, month, day, hour, minute, second, etc based on the level
  • tuples of any of the types above - return the partition values for the first item, then the second and so on and add their maximum levels

  One does not have to invent these functions, they would be provided to the user based on standard types in code factories. Yet even these code factories will be able to encode more information about the data to be sorted than mere comparison functions. Stuff like the minimum and maximum value can be computed by going through all the values in the array to be sorted, but why do it if the user already has this information, for example.

  Assuming one cannot find a fixed length to the values to be sorted on, like real values or strings of any length, consider this type of sorting as a first step to order the array as much as possible, then using something like insertion or bubble sort on the result.

Finding a value or computing distinct values

  As an additional positive side effect, there are other processes on lists of items that are considered generic because they use a formalized form function as a parameter. Often found cases include finding the index of an item in a list equal to a given value (thus determining if the value exists in a list) and getting the distinct values from an array. They use an equality function as a parameter which is formalized as returning true or false. Of course, a comparison function could be used, depending on if its result is 0 or not, but a partitioning function can also be used to determine equality, if all of the bytes returned on all of the levels are equal.

  But there is more. The format of the partition function can be used to create a hash set of the values, thus reducing the complexity of the search for a value from O(n) to O(log n) and that of getting distinct values from O(n^2) to O(n log n)!

  In short, all operations on lists of items can be brought together and optimized by using the same format for the function that makes them "generic": that of a partitioning function.

Conclusion

  As you can see, I am rather proud of the concepts I've explained here. Preliminary tests in Javascript show a 20 fold improvement in performance for ten million items when using RadixSort over the default sort. I would really love feedback from someone who researches algorithms and can even test these assumptions under benchmark settings. Them being complex as they are, I will probably write multiple posts on the subject, trying to split it (partition it?) into easily digestible bits

 The concept of using a generic partitioning function format for operations on collections is a theoretical one at the moment. I would love to collaborate with people to get this to production level code, perhaps taking into account advanced concepts like minimizing cache misses and parallelism, not only the theoretical complexity.

 More info and details at Towards generic high performance sorting algorithms

Comments

Be the first to post a comment

Post a comment