I remember the first time it happened. We were just beginning to watch your struggle, the rise of such wonderful life, then the strike. It had happened before, but we weren't really paying attention. This time it was incredible drama. Galaxies cried out in anguish and desperation, but it wasn't the end. No, you survived, rebuilt, went on.
It's funny, the first time apes started acting smart there was a lot of disappointment. Ugly, cumbersome, a bit mad, everybody thought. But we continued to watch, mesmerized, as you got through so much that almost destroyed you. You changed, you evolved.
Then you bloomed. Started talking, singing, thinking and painting. Further on you started writing. What a wonderful flower you were. Worlds would die and form on the sounds of your wails or yells of joy. We loved it all, the wars, the art, the science, the suffering. We gulped it up, knowing that it was not going to last. They never do, we thought.
We cheered for you when you were dying in the great pandemics, we cried for you when you killed each other, you had our full attention when you almost destroyed everything with nuclear devices. Some of us rooted for that ending. They love violent, brusque ends, some of us do. But it didn't happen.
Oh how you danced, how you sang, the beautiful things you thought and put in writing, in films, did with your computers. We knew you were close to the end and we knew that it was all good stuff, because we didn't want it to end, but they all end, don't they?
We marvelled at your ships, at your resilience in hoping to contact others... us... and we cried. The space battles were magnificent. You took inspiration from your history and created fiction, then you used fiction to create your future. You bypassed your present altogether. A whole universe laughed and cheered. Those were good times.
Your enormous space habitats, floating around your sun, always changing, always growing, they gave us hope. Hope that some of you might make it, leave your system. Sometimes it happens. We love splinters, that we do, but it didn't happen. You stayed put.
And then your flower wilted away. You had altered yourselves, you had become faster, smarter, you had already merged with the machines you had built and defeated most of your biological problems. You were invincible and beautiful, immortal. But then you found it; after all, they all do, the universal link, the thing that finally allowed you to fulfil your dreams. You found us.
And now you listen to our songs, watch our histories, run our software, use our technology. But we remember you, how beautiful you were when you were young. We all loved you, humans. Now that you are old, like us, you have control. You are us. Join in watching this wonderful new life that emerges from the chaos. It's a great show. And maybe some will make it, some will manage to be different, somehow, sometime.
As always, this post will reflect my personal opinion. I know that The Listeners is a classic book, one that has been cited by SETI as a major factor in the project becoming known and supported by others. I know that at that time, doing a reasonable sci-fi book was a feat. I know that the writer was a believer in the contact with aliens and human nature and so on, and thus he must have been a nice guy, with similar desires to mine and other space-looking people. However the book annoyed me to no end.
The first and biggest of all problems is the insistence of the writer to add to the book all kinds of quotes from various works, many of them in a foreign language - that is, other than English. It was the reason why originally publishers refused his manuscript. Now, even if I understand the language, I don't know the quote. There is an annex at the end of the book that translates everything, but really, when a character randomly interrupts a perfectly good conversation to spout something unintelligible in another language, that guy is an asshole!
Then there was the construction of the book, the Project being presented like something that held sway over the human heart. All you had to do to convince anyone of anything was turn on the speakers so that they hear static, while the main character would do PR work, knowing exactly what to say to manipulate the other person. I would not have a problem with that, if the manipulation would not be completely obvious and most of the time completely ridiculous. It felt like a Naruto episode where the other ninja, filled with power, suddenly decides to switch sides because Naruto is such a nice guy. I know I don't inspire confidence when I compare a classic sci-fi book with a Japanese manga, but for me it was the same quality of work, which may be entertaining, but not great.
All the people and events changed in order to conveniently support the plot. It felt fake and it is a lousy writing technique, more suited to pulp. I did not enjoy that.
As for the plot itself, it is about this Project, which is pretty much SETI, that suddenly receives an alien signal piggybacked on 90 years old radio transmissions. What people do and say is so underwhelming that it felt like I was wasting my time while reading the book. That is why it took so long to finish it. My conclusion: while a classic for the science fiction genre, I did not enjoy the book or empathise with its characters. The plot is difficult to swallow and the story is very dated. I would not recommend it.
It just happens that I have two different projects that have the need of cluster analysis, applied in two different ways: one has uses on maps, where a large number of items needs to be displayed quickly, while another implies finding clusters of news items, where the distance between them is determined by their content. The most used clustering algorithm and the first to be found by searching the web is the k-means clustering. Its purpose is to categorize a list of items into a number of k clusters, hence the name. Setting aside the use value of the algorithm for my purposes, the biggest problem I see is the complexity: in its general form it is at least O(n2), and most of the time a lot higher. The net abounds with scientific papers investigating the k-means complexity and suggesting improvements, but they are highly mathematical and I didn't have the time to investigate further. So I just built my own algorithm. It is clearly fuzzy, imperfect, may even be wrong in some situations, but at least it is fast. I will certainly investigate this area more, maybe even try to understand the math behind it and analyse my results based on this research. When I do that I will update this post or write others. But until then, let me present my work so far.
The first problem I had was, as I said, complexity. For one million points on the map, any algorithm that takes into account the distance between any two items will have to make at least one trillion comparisons. So my solution was to limit the number of items by grouping them in a grid: Step 1: find the min and max on each dimension (that means going through the entire item collection once or knowing beforetime the map bounds) Step 2: determine a number of cells that would be a bit more than what I need in the end. (that's a decision I have to take, no algorithmic complexity) Example: for my map example I have only two dimensions: X and Y. I want to display an upper bound of 1000 clusters. Therefore I find the minimum and maximum X and Y and then split each dimension into 100 slots. That means I would cluster the items I have into 10000 cells. Step 3: for each item, find its cell based on X,Y and add the item to the cell. This is done by simple division: (X-minX)/(maxX-minX). (again that means going once through the collection) Step 4: find the smallest cell (the complexity is reduced now to working with cells) Step 5: find its smallest neighbour (the complexity of this on the implementation) Step 6: merge the two cells Until the number of cells is larger than the desired number of clusters, repeat from Step 4. In the end, the algorithm is O(n+p*log(p)), I guess, where p is the number of cells chosen at step 2.
Optimizations are the next issue.
How does one find the neighbours of a cell? On Step 3 we also create a list of neighbors for each new cluster by looking for a cluster that is at coordinates immediately above, below, left or right. When we merge two clusters, we get a cluster that is a neighbour to all the neighbours of the merged clusters.
How does one quickly get the cluster at a certain position? We create a dictionary that has the coordinates as the key. What about when we merge two clusters? Then the new cluster will be accessible by any of the original cluster keys (that implied that each cluster has a list of keys, as well)
How does one find the smallest cell in the cluster list? After Step 3 we sort the cluster list by the number of items they contain and each time we perform a merge we find the first item larger than the merged result and we insert it in the list at that location, so that the list always remains sorted.
How do we easily find the first item larger than an item? By employing a divide-et-impera method of splitting the list in two at each step and choosing to look into one bucket based on the item count of the cluster at the middle position
Before you use the code note that there are specific scenarios where this type of clustering would look a bit off, like items in a long line or an empty polygon (the cluster will appear to be in its center). But I needed speed and I got it.
Enjoy!
Update: The performance of removing or adding items from a List is very poor, so I created a LinkedList version that seems to be even faster. Here it is. The old List version is at the end
/// <summary> /// Generic x,y positioned item class /// </summary> publicclass Item { publicdouble X { get; set; } publicdouble Y { get; set; } }
publicclass ClusteringDataProcessor { /// <summary> /// the squared root of the initial cell number (100*100 = 10000) /// </summary> privateconstint initialClustersSquareSide = 100; /// <summary> /// the desired number of resulting clusters /// </summary> privateconstint numberOfFinalClusters = 1000; privatestatic Random _rnd = new Random();
/// <summary> /// In this implementation, the Cluster inherits from Item, so the result is a list of Item /// In the case of one Item Clusters, we actually return the original Item /// </summary> /// <param name="list"></param> /// <returns></returns> public List<Item> Process(List<Item> list) { if (list.Count <= numberOfFinalClusters) return list; // find bounds. If already known, this can be provided as parameters double minY = double.MaxValue; double minX = double.MaxValue; double maxY = double.MinValue; double maxX = double.MinValue; foreach (var item in list) { var y = item.Y; var x = item.X; minY = Math.Min(minY, y); maxY = Math.Max(maxY, y); minX = Math.Min(minX, x); maxX = Math.Max(maxX, x); } // the original list of clusters var clusterArr = new List<Cluster>();
// the dictionary used to index clusters on their position var clusterDict = new Dictionary<string, Cluster>();
// the unit for finding the cell position for the initial clusters var qX = (maxX - minX) / initialClustersSquareSide; var qY = (maxY - minY) / initialClustersSquareSide;
foreach (var item in list) { // compute cell coordinates (integer and the values used as keys in the dictionary) var cx = Math.Min((int)((item.X - minX) / qX), initialClustersSquareSide - 1); var cy = Math.Min((int)((item.Y - minY) / qY), initialClustersSquareSide - 1); var key = getKey(cx, cy); Cluster cluster; // if the cluster for this position does not exist, create it if (!clusterDict.TryGetValue(key, out cluster)) { cluster = new Cluster { Keys = new List<string> { key }, X = minX + cx * qX + qX / 2, Y = minY + cy * qY + qY / 2, //Items = new List<Item>(), Count = 0, Neighbors = new List<string>() }; // the neighbours of this cluster are the existing clusters that are below, above, left or right. If they exist, this cluster also is added to their neighbour list var nkeys = new[] { getKey(cx - 1, cy), getKey(cx + 1, cy), getKey(cx, cy - 1), getKey(cx, cy - 1) }; for (var j = 0; j < 4; j++) { Cluster nc; if (clusterDict.TryGetValue(nkeys[j], out nc)) { cluster.Neighbors.Add(nkeys[j]); nc.Neighbors.Add(key); } } clusterDict[key] = cluster; clusterArr.Add(cluster); } // add the item to the cluster (note that the commented lines hold the items in a list, while the current implementation only remember the number of items) //cluster.Items.Add(item); cluster.Item = item; cluster.Count++; // add the item position to the sums, so that we can compute the final position of the cluster at the Finalize stage without enumerating the items (or having to hold them in an Items list) cluster.SumX += item.X; cluster.SumY += item.Y; } // if the number of items is already smaller than the desired number of clusters, just return the clusters if (clusterArr.Count <= numberOfFinalClusters) { return clusterArr.Select(c => c.Finalize()).ToList(); }
// sort the cluster list so we can efficiently find the smallest cluster //clusterArr.Sort(new Comparison<Cluster>((c1, c2) => c1.Items.Count.CompareTo(c2.Items.Count))); LinkedList<Cluster> clusterLinkedList = new LinkedList<Cluster>(clusterArr.OrderBy(c => c.Count));
// remember last merged cluster, as next merged clusters might have similar sizes var lastCount = int.MaxValue; LinkedListNode<Cluster> lastLinkedNode = null;
// do this until we get to the desired number of clusters while (clusterLinkedList.Count > numberOfFinalClusters) { // we need to get the smallest (so first) cluster that has any neighbours var cluster1 = clusterLinkedList.First(c => c.Neighbors.Any()); Cluster cluster2 = null; // then find the smallest neighbour var min = int.MaxValue; foreach (var nkey in cluster1.Neighbors) { var n = clusterDict[nkey]; //var l = n.Items.Count; var l = n.Count; if (l < min) { min = l; cluster2 = n; } } // join the clusters var keys = cluster1.Keys.Union(cluster2.Keys).ToList(); var cluster = new Cluster { Keys = keys, // approximate cluster position, not needed //X = (cluster1.X + cluster2.X) / 2, //Y = (cluster1.Y + cluster2.Y) / 2,
// the result holds the count of both clusters //Items = cluster1.Items.Union(cluster2.Items).ToList(), Count = cluster1.Count + cluster2.Count, // the neighbors are in the union of their neighbours that does not contain themselves Neighbors = cluster1.Neighbors.Union(cluster2.Neighbors) .Distinct() .Except(keys) .ToList(), // compute the sums for the final position SumX = cluster1.SumX + cluster2.SumX, SumY = cluster1.SumY + cluster2.SumY }; foreach (var key in keys) { clusterDict[key] = cluster; } // efficiently remove clusters since LinkedList removals are fast clusterLinkedList.Remove(cluster1); clusterLinkedList.Remove(cluster2);
// a little bit of magic to make the finding of the insertion point faster (LinkedLists go through the entire list to find an item) // if the last merged cluster is smaller or equal to the new merged cluster, then start searching from it. // this halves the insert time operation, but I am sure there are even better implementations, just didn't think it's so important LinkedListNode<Cluster> start; if (lastCount <= cluster.Count && lastLinkedNode.Value != cluster1 && lastLinkedNode.Value != cluster2) { start = lastLinkedNode; } else { start = clusterLinkedList.First; } var insertionPoint = nextOrDefault(clusterLinkedList, start, c => c.Count >= cluster.Count); // remember last merged cluster LinkedListNode<Cluster> link; if (insertionPoint == null) { link = clusterLinkedList.AddLast(cluster); } else { link = clusterLinkedList.AddBefore(insertionPoint, cluster); } lastLinkedNode = link; lastCount = cluster.Count; } return clusterLinkedList.Select(c => c.Finalize()).ToList(); }
/// <summary> /// the function that finalizes the computation of the values or even decides to return the only item in the cluster /// </summary> /// <returns></returns> public Item Finalize() { //if (Items.Count == 1) return Items[0]; if (Count == 1) return Item; /*Y = SumY / Items.Count; X = SumX / Items.Count; Count = Items.Count;*/ Y = SumY / Count; X = SumX / Count; Count = Count; returnthis; } } }