The OUTPUT clause is a very useful tool in Microsoft SQL, allowing for getting automatically inserted columns in the same command as the INSERT. Imagine you have a table with an identity column and you need the generated ids as you insert new records. It would look like this:
CREATE TABLE MyTable 
(
Id INT PRIMARY KEY IDENTITY(1, 1),
Value NVARCHAR(100)
)

CREATE TABLE AnotherTable
(
Value NVARCHAR(100),
AnotherValue NVARCHAR(100),
SomeConditionIsTrue BIT
)

go

CREATE TABLE #ids
(
Id INT ,
AnotherValue NVARCHAR(100)
)

INSERT INTO MyTable (Value)
OUTPUT inserted.Id INTO #ids (id)
SELECT Value
FROM AnotherTable
WHERE SomeConditionIsTrue = 1

-- Do something with the inserted Ids

However, what do you do if you want to also insert the column AnotherValue to the #ids table? Something like this does not work:
INSERT INTO MyTable (Value) 
OUTPUT inserted.Id,AnotherTable.AnotherValue INTO #ids (id,AnotherValue)
SELECT Value
FROM AnotherTable
WHERE SomeConditionIsTrue = 1

Enter the often ignored MERGE, which can help us translate the query above into:
MERGE INTO MyTable USING (
SELECT Value , AnotherValue
FROM AnotherTable
WHERE SomeConditionIsTrue = 1
) t ON 1=0 --FALSE
WHEN NOT MATCHED THEN
INSERT (Value) VALUES (t.Value)
OUTPUT Inserted.Id, t.AnotherValue INTO #ids (Id, AnotherValue);

Note the 1=0 condition so that the merge never "matches" and how the select from the first query now contains all the columns needed to be output, even if only some of them are inserted in the insert table.

This post was prompted by a StackOverflow answer that, as great as it was, didn't make it clear what to do when you get your values from a more complicated select. The answer is simple: put it all in the 'using' table.

and has 0 comments
I read this short novel from start to end in under a day. Osamu Dazai writes from the point of view of a sociopathic young man who cannot seem to understand the human condition and fears all people around him, mostly because he expects to be found out at every moment. The title of the book can be translated in several ways, the English one relates to the protagonist's feelings of losing one's humanity, while the literal translation reads as "disqualified from being human", implying a societal judgement. Imagine a Japanese version of The Stranger, by Albert Camus, and you get a good picture of the plot and feel of the book. Both books were written in the same period, more or less, but while Camus probably imagined the character, many believe Dazai was talking about himself - he committed suicide soon after.

No Longer Human is the second best rated Japanese book and was adapted in movie and manga. It is difficult to imagine those being better than the dry accounting of the inner turmoil of the character, starting as a little boy who devises "clowning" as a method of passing the test of humanity, outwardly fun and good natured and inwardly terrified of being discovered as a fraud and punished by the society of strange human beings that he cannot understand or empathize with. I highly recommend it.

and has 0 comments
I have been watching horror movies since I was six and read books of all sort through the years, but rarely have I seen something so truly scary as Daemon. Daniel Suarez manages to convey terror not by upgrading the villain, but by making it mundane. The daemon is not an all knowing Artificial Intelligence that takes over the world, but a stupid game engine run by a logic tree. The ease with which something like this could be created makes the book truly terrifying, particularly for me, who has actually thought of the weakness of humans when faced with decisions and pondered a world where machines make the decisions not because they want to rule us, but because we don't want to choose.

But there is more to this book than its subject. It is actually very well written and that is remarkable considering it is Suarez' first book. I will read the sequel to Daemon, Freedom™ as soon as I can. I loved the attention to detail, not a descriptive boring series of useless trivia, but a close focus on what makes people tick and how technology falls into place to fill the gaps that our failings leave. On the cover of the new book that Daniel Suarez wrote there is a quote that I feel is totally true: he is a true heir to Michael Crichton.

and has 0 comments
Bastard!! is an adaptation of the manga with the same name. The manga itself is ongoing, but very slowly. At the moment of the writing it had 138 chapters. The genre of it is magical fights in an action comedy kind of style. Bob Samurai has a video review of it.

For myself I have to say that I had fun watching it, in a mindless "I come from work and I don't feel like doing anything" kind of way, but it wasn't that special in plot, animation or feeling. The "anti-hero" is actually the typical hero that does incredible good deeds for the love of women and the biggest source of humor are the few lines peppered throughout the episodes that break the fourth wall. Stuff like "What would have been the purpose of defeating that guy when we were off screen" or "a handsome hero like myself couldn't possible lose to one as ugly as you". The manga is a little bit more about the scoundrel nature of the main character - as it should be, there are 70 chapters (the Host of Shadows) covered by mere 6 episodes of the OVA - but it is also rather different from the anime: more story detail, more types of magic, etc. Probably the OVA, as quick dirty fun as it was, is not a very good one, since it relays only bits and pieces of the manga.

One can watch the anime at AnimeDreaming, read the manga at MangaHere and watch BobSamurai's video review on YouTube.

and has 0 comments

Another book that can easily be found in audio format on Librivox and YouTube, Creatures of the Abyss (also known as The Listeners), by Murray Leinster, is a slow mid 20th century sci-fi that reads as a cross between Jules Verne, H. P. Lovecraft and one of those books about people drinking and falling in love on boats in South America. More Verne, though.

The thing that made me continue listening to it was its way of depicting the mentality from back then. Written in 1961, it tells a story of people who, faced with extraordinary circumstances, first evade formulating a theory in their own head, for fear of contravening their own set view of the world, then - forced by events - they do allow themselves to formulate a theory, but keep it to themselves for fear of ridicule, even when they see other people considering the same things, then they proceed to test those theories by themselves and only then share them with others. Compared with the modern culture of sharing half formed thoughts before they can constitute complete phrases, it is quite different. It is also fun to read about people that think Venus is a large ocean planet, as is Jupiter, with a gravity four times that of Earth.

However, while it was interesting in a sociological way and good as a background for other activities, its slow pace might feel excruciating for the casual reader. More than half of it is more about boats and sailing and catching fish. The science fiction part is slowly creeping into the story and the climax is in the last chapter alone. Maybe my association of the book with Lovecraft is strained, as the only commonality is touching on tentacled abyssal creatures that might appear disturbing to human sensibilities and certainly the elements of horror are very rare in Creatures of the Abyss. The book does feel more real, though, as it goes through this slow process of examination of evidence and formulating hypotheses and testing them before jumping to conclusions. It depicts the beginning of the modern era of scientific thought, back when it was respectable and desirable to be thinking like that.

Bottom line: Slow paced, but very well written, you should at least try it, since it is so readily available. You can even listen to it right here, on this post.

[youtube:IlKJbS4NU1A]

Today we tested a web application in the new Microsoft Edge browser. To our surprize, the site failed where Internet Explorer, Chrome, Firefox and even Safari worked perfectly well. I narrowed the problem to the navigator.geolocation.getCurrentLocation which wasn't working. The site would see navigator.geolocation, ask for the current location, the user would be prompted to allow the site to access location and after that would silently fail. What I mean by that is that neither the success or the error callbacks were called, even if the options object specified one second for the timeout. I don't have access to a lot of Windows 10 machines and I assume that if a lot of people met with this problem they would invade the Internet with angry messages, but so far I've found no one having the same issue.

Bottom line: forced to take into consideration the possibility that the geolocation API would silently fail, I changed the code like this:
if (navigator.geolocation) {
var timeoutInSeconds=1;
var geotimeout=setTimeout(function() {
handleNoGeolocation();
},timeoutInSeconds*1000+500); //plus 500 ms to allow the API to timeout normally
navigator.geolocation.getCurrentPosition(function (position) {
clearTimeout(geotimeout);
var pos = doSomethingWith(position.coords.latitude, position.coords.longitude);
}, function () {
clearTimeout(geotimeout);
handleNoGeolocation();
},{
enableHighAccuracy:true,
timeout: timeoutInSeconds*1000
});
} else {
handleNoGeolocation();
}

In the handleNoGeolocation function I've accessed the great service FreeGeoIp, that returns vague coordinates based on your IP and fell back to a static latitude, longitude pair if even this call failed.

Note: for the first time the function is called for your site, a browser dialog will appear, requesting permission to share the location. During the display of the dialog the timeout will fire, then, based on the user choice (and browser) a success/error handler will be called or nothing (like in this case), so make sure your code can handle running handleNoGeolocation followed by doSomethingWith.

and has 0 comments
A day ago there was a "leak" of three TV series pilots. I know, it sounds like someone out there is pissing TV series, but a look at most of them and you start seeing the truth of it. I don't really believe they were stolen or anything, either. I think they were deliberately distributed to gauge viewer reaction. The three shows in question are Blindspot, Lucifer and Minority Report. What do they all have in common? Law enforcement. It gets ridiculous from here on, you've been warned.

Blindspot is about a young woman (lovely Jaimie Alexander - the actress playing the Asgardian warrior Lady Sif in the Thor Marvel universe) found naked, without memories and tattooed all over her body. The tattoos are clues about future crimes and our Jane Doe helps the FBI solve them. The series has the obvious hallmarks of the post Lost era, with just enough artificial mystery to keep one guessing, but not really caring. Anyway, all I can say is that if you make a show about Jaimie Alexander found naked you should bloody show her naked! Stupid Americans! The French should start remaking these shows and demonstrate how it is done!

Lucifer is about... the devil. He comes to Earth because he got tired of ruling Hell - which was his divine punishment from his father, God. And by Earth I mean Los Angeles. Yes, very subtle. He teams up with another lovely (Lauren German) who is a police detective. Why, you might ask? Devil may care, he just loves solving crimes and has daddy issues. The show is so ridiculously pompous that it raises hackles. It reminds me of the well deservedly cancelled The Transporter series.

Minority Report is based on a movie about "precogs" used to stop crime by predicting it, leading to the paradox of arresting and incarcerating people because of crimes they did not commit. Yet. I haven't watched it. Yet. But since the movie was based itself on the works of famous paranoid sci-fi writer Phillip K. Dick, it is the only one that I have hopes for. Of course the detective will be a young attractive person, teamed with another young attractive person with some special power that helps solving some type of crisis, probably crimes and possibly related to terror attacks. I can see it... in the future...

Update: I was right. One of the precogs in the movie helps a young black female police detective to prevent crimes. This is a horrible perversion of the film, which ended with showing the precog system not working and putting innocent people in jail. In the series, the police is frustrated that the precog era has ended and is convinced that every released arrestee from the program would have become a killer. Yuck!

and has 0 comments
Another great Star Trek novel placed in the Kirk era, Star Trek Prime Directive keeps the reader/listener on the edge of their seat. It starts with a disgraced Kirk, a scattered crew and a scrapped Enterprise. It shows the dark, bureaucratic side of the Federation, cruel and merciless when you are not the lucky wearer of the golden captain uniform or, even better, an admiral. How did it come to this? The answer is both captivating, original and with deep roots in the Star Trek basic tenant: the Prime Directive.

I actually listened to the audiobook, also on YouTube (see embedded video), which was very well narrated. If I had any problems with the story was that it was clearly very biased. Kirk is always thinking of the poor alien species that are like humans, but seems to have no qualms to experiment with phaser fire and even slightly torture other alien beings if they are bug like. Also Spock seems very little a Vulcan in this.

Bottom line is that the idea was intriguing and original and the style of the writing was very good. One of the best ST novels so far.

I've come upon this strange Not enough storage is available to complete this operation ArgumentException when creating an instance of EventSource derived classes. This class is responsible for creating entries in Windows logs. Strangely enough, there are very few articles on the Internet connecting the class with this particular exception, so I started to investigate. One important thing to notice is that the exception is intermittent. Basically you can cycle a few times with a try/catch block and get a valid instance. That seems to indicate some sort of race condition. So far, this is the easy solution I could find. However, I really wanted to know why does it happen.

If I remove the EventSource class from the searches I get more pages reporting the same exception and one of the reasons that people say it happens is related to the size of the registry. Retrospectively it makes sense, but it never occurred to me that the system registry has a maximum size. But is that the problem? Looking with the EventViewer summary I see something like this:

Of course, the most obvious thing there is the exact size of each log category: 20.00 MB. If one right-clicks on any of the log groups and goes to Properties, the size limit for each is clearly shown and configurable. So is that the problem?

The exception is thrown almost exclusively when the logging is heavy: multiple threads trying to log stuff at the same time. Since retrying usually solves the problem, my guess is that the exception is thrown somewhere between the request for a new log entry and the process that eliminates old entries to allow for new ones. Unfortunately I don't see any configuration option for how many entries to eliminate. I would have liked to clear, let's say, 20% of the log when it is full to make this problem less relevant. Perhaps hidden in the bowels of the system registry there is a way to set this, but at this time I don't know it. Nor is it clear if I have the option to remove more of the less important events rather than just the oldest. Clearly the EventLog class in .NET supports deleting individual log entries, so this is feasible if it ever becomes a real problem.

So far, my solution is to just try again when the error is thrown:
LoggerEventSource eventSource = null; //EventSource derived class (see documentation)
for (var i = 0; i < 5 && eventSource == null; i++)
{
try
{
eventSource = new LoggerEventSource();
}
catch (ArgumentException)
{
Thread.Sleep(100);
}
}

and has 0 comments
Star Trek: Strangers from the Sky is an audiobook read by George Takei and Leonard Nimoy. While it is a typical ST The Original Series plot, with god like aliens, travel back in time to significant moments of Earth's history and a focus on high moral values that, in the end, save the day, I felt that it was a little bit more subtle, deeper than a typical episode of any of the series. Was it because of the introspection of the characters, or the wonderful narration of Nimoy and Takei, I do not know. What I can say is that I enjoyed listening to the story quite a lot and I recommend it highly for any Star Trek fan.

I also don't know if it is in the public domain or not, all I can say is that I listened to it on YouTube and so can you:
Of course you cannot listen to it on YouTube anymore. Some lawyers saw to that.

and has 0 comments
This is the third writing of Esther Friesner that I've read, after The Shunned Trailer and Druid's Blood, both excellent and funny, combining fantasy elements with the present or other realistic historical settings. Gnome Man's Land does the same thing, but I have to say I didn't find it as funny or as good as the others I mentioned. I also attempted to start Here Be Demons, another of her books, but couldn't really enjoy it enough to go past the first chapter. Probably she is one of those authors who, when they are good are really good and when they are not, well...

The book is the first of a trilogy starring Tim Desmond, a young boy of Irish descent who finds himself in a strange situation when the veil between our world and the land of the fey is punctured and more and more fantastic creatures go through. They come and attach themselves to mortals, as many of them are creatures who's very reason for existing is serving their masters. Stuff like banshees, Mongolian ancestors, goblins, kobolds, Greek demigodesses, Russian bath spirits, sprites, elves and so on and so on just sprout from the rupture, bringing annoyance and confusion more than anything. Tim somehow gets tricked into becoming the champion of the Fey on Earth and he does the job mainly because he feels all of these supernatural creatures need his help (plus the girl he secretly loves supports this and his banshee is a hot redhead to boot).

Some hilarity ensues, but often feeling a bit artificial, while the actions of the characters involved are simplistic, inconsistent and dragging on, like the author wanted to tell a joke and she ended up writing an entire book. The crises are not that good either, oscillating between childishly funny and dead bloody serious. The ending was disappointing as well, leaving a very traumatic event just in the wind, like an afterthought, pending Tim's recovery of some of his memory. I really wanted to like the book, too, but in the end I just forced myself to reach the end and I am confident I will not read the other two books in the series. I have some hopes for the Princesses series, which I understand is one of Friesner's better works.

Change Data Capture is a useful mechanism for tracking data changes in Microsoft SQL Server. Once you enable it for a database and some of its tables, it will create a record of any change to the data of those tables. In order to use the captured data, Microsoft recommends using the table functions provided by the mechanism, mainly [cdc].[fn_cdc_get_all_changes_dbo_<table name>] and [cdc].[fn_cdc_get_net_changes_<table name>]. However, when you use them, your very first experience might be an error that looks like this: Msg 313, Level 16, State 3, Line 20 An insufficient number of arguments were supplied for the procedure or function cdc.fn_cdc_get_all_changes_ ... . Since this is an error message everyone associates with missing a parameter for a function, one might assume that the documentation is wrong and one must add another parameter. You try adding a bogus one and you get Msg 8144, Level 16, State 3, Line 9 Procedure or function cdc.fn_cdc_get_all_changes_dbo_<table name> has too many arguments specified. which brings confusion. One hint is that if we use one less parameter than in the documentation, the error is slightly different Msg 313, Level 16, State 3, Line 9 An insufficient number of arguments were supplied for the procedure or function cdc.fn_cdc_get_all_changes_dbo_<table name>. In this error message, the tracked table name is specified in the function name, as opposed to the other where ... is used instead. What is going on?

The documentation for the function (which, as usual, nobody reads past the usage syntax and the code examples - just read the Remarks section) says this: If the specified LSN range does not fall within the change tracking timeline for the capture instance, the function returns error 208 ("An insufficient number of arguments were supplied for the procedure or function cdc.fn_cdc_get_all_changes.")., which of course is the explanation for this weird behaviour, but why and when does it happen?

The why comes from a Microsoft Connect page where an overly honest developer explains that the reason for the obscure error message is the antiquated error and function system used in T-SQL: The issue here is the inability to do raiseerror from within a function that prevents us from bubbling up meaningful error message. If one looks at the source of cdc.fn_cdc_get_all_changes_dbo_<table name>, one sees that the error is thrown from another function, a system one, called [sys].[fn_cdc_check_parameters]. Doing a select on it we get the same original error which is now slightly humourous, since it comes from a completely different function than the one in the message. Since it is a system function this time, there is no source code for it.

The when is more tricky and it shows that they didn't think this through much. First of all, whenever you send a NULL or an empty (0x0000...) value to the function as the begin or end LSN you get this error message. The code examples from Microsoft always show these mysterious LSN values being received from functions like sys.fn_cdc_get_min_lsn('<schema name>_<table name>'), sys.fn_cdc_get_max_lsn(), sys.fn_cdc_map_time_to_lsn('largest less than or equal', GETDATE()) and so on, but they are hardly easy to understand, as they return an empty value for wrong parameters. For example, a common reason why your code fails is from getting the LSN like this: sys.fn_cdc_get_min_lsn('MyWonderfulTable') when in fact you need to use the schema in the name as well: sys.fn_cdc_get_min_lsn('dbo_MyWonderfulTable'). You have to use this syntax everywhere. Funny enough, if the tracked table is empty, you get the lowest LSN for the entire database, but if you use a wrong database name (or without the schema, or NULL, etc) you get an empty LSN. How an empty LSN is not the minimum LSN is beyond me.

My solution? Just select from the tables directly. Yeah, I know, it's bad, unrecommended by Microsoft, reinventing the wheel. But it works and I don't get weird functions messing up my flow with obscure error messages. Just remember to take a look at the cdc.* functions and see how they are written.

So, to summarize: The error message is misleading and it's all they could do within the confines of the T-SQL function error system. Remember to use the schema in the string defining the table in the cdc functions (ex: dbo_MyTable). In case you really want to be flexible, interrogate the cdc tables directly.

I had this situation where I had to execute large SQL script files and the default sqlcmd tool was throwing exceptions rather than execute them. So I created my own tool to read the scripts and execute them transactionally. Everything went smoothly, except at the end. You see, I didn't use TransactionScope.Complete from the beginning in order to see how the program would cope with a massive rollback. Not well, apparently.

The exception was thrown at rollback: Timeout expired. The timeout period elapsed prior to completion of the operation or the server is not responding. I had set the TransactionOptions.Timeout to TransactionManager.MaximumTimeout and the SqlCommand.CommandTimeout to 0 (meaning never end) and I still got the exception. Apparently, the problem was the SqlConnection.ConnectTimeout which is a readonly property with a default value of 15 seconds. The value can be changed via the connection string, by adding something like Connect Timeout=36000 (10 hours) and many articles on the Internet suggest doing that. However, that is just really ugly. A better solution is to set the value of the timeout programmatically and this is how to do it:
var timeout = TimeSpan.FromHours(10);
SqlConnectionStringBuilder csb = new SqlConnectionStringBuilder(connectionString);
csb.ConnectTimeout = (int)timeout.TotalSeconds;
connectionString = csb.ConnectionString;

As you can see, the nice SqlConnectionStringBuilder helps us validate, parse and change the values in the connection string. One can imagine other interesting uses, like adding MARS to the connection string automatically or restricting the use to a list of databases or disallowing weak passwords, etc.

and has 0 comments
If you were born before the 80s, Ready Player One is going to fill you with melancholy. Ernest Cline combines several classical young adult themes - like a battle versus an oppressive corporate evil, true and pure love, villainy and lack of honor defeated through friendship and good feelings - with (often obscure) geek memes of the 70s and 80s. If you are the kind of person who likes to impress by quoting lines from movies or telling of your adventures in games that are older than your children, this is the book for you. OK, I won't be mean, the book is going to be fun no matter when you were born, but the level of enjoyment may vary.

However, the book is still a young adult book at its core and, besides the overall message that you actually have to make an effort to reach your goals - an often neglected tidbit in young adult books and movies - it reads like one. Young heroes, with enough skill to pass through the challenges of the story, but awkward enough to also be endearing, manage to save the world through the power of their dedication and ideals. Also Chekhov's Gun has been used so much that it left gaping holes in the story. Amazing how random things in the story come perfectly together at the end. Let a bitter Harry Potterish aftertaste.

But let's start with the plot, which is pretty fun. It all happens in a dystopian world where energy reserves dwindled, gasoline became way too scarce and expensive, Elon Musk never happened and most people live their lives in a virtual world called OASIS, created by a brilliant yet reclusive visionary who made sure the system will remain secure and anonymous. Kind of like the Internet, but without the MPAA or the NSA. Yeah, it already sounds like an impossible dream, doesn't it? Well, the maker of the game world dies and leaves his entire estate (hundreds of billions of dollars and complete control over OASIS) to whoever finds the Easter Egg he his inside the game. The heroes of the story are young "egg hunters", while the villains are corporate drones who have been hired to find the egg for their company.

The writing style irked me a little. I know it is Cline's first book, and it certainly was a decent effort, but it had that way of explaining things that I call "fake past" in which the narrator explains things as if he is telling a story from the past. "The OASIS was...". Since this is supposed to happen in the future, it took a while until I could stop feeling irritated by it. However this has the advantage of being very easy to read.

I think it matters a lot if the reader is into cultural references. I could understand some, I could remember some, most of the references in the book, though, were ancient or obscure enough that even I didn't recognize them, and I am a pretty geeky person. I felt rewarded when I could "get it" and frustrated when I didn't, so probably it will be the same for most readers. If you don't care about these things, I think it is better to wait for the movie.

...which is in the works, with Steven Spielberg attached to the project. It might be difficult to put the story on the screen, though, since the book made an effort to describe a future world where everybody is obsessed with this specific period of the late 20th century, kind of like the Star Trek episodes that happened in the past or that required Kirk or Picard to know some specific book from the school curriculum. There is even a Ready Player One web site, that might have some Easter eggs in it (they would be dumb not to program some) but to me it seems both way too geeky and way to social at the same time.

Bottom line: a fun book for geeks. I hope it inspires the younger generations to look at the world a little bit differently, but I don't have my hopes up. My guess is that they will go all 'Meh' on a story that references anything that happened last century.

Update: after another very useful comment from NULLable, I tried several new ideas:

  • range queries - trying to specify that the child StartIp, for example, is not only greater or equal to the parent StartIp, but also less or equal to the parent EndIp. In my case the query didn't go faster and adding new indexes as recommended in the comment made is slower. I believe it is because the range values are not static or just because clustering the start/end IP index is really way faster than any logical implementation of the search algorithm
  • cursor hints - obviously a very important hint that I should add to almost any cursor is LOCAL. A GLOBAL cursor can be accessed from outside the stored procedure and weird things can happen when running the stored procedure twice at the same time. NULLable recommended also STATIC READ_ONLY and FORWARD_ONLY. In truth the performance of the query doesn't really depend on the speed of the cursor, anyway, but I found an article that discusses the various cursor hints and ends up recommending LOCAL FAST_FORWARD. Check it out, it is very informative. My tests showed no real difference in this particular scenario.
  • RI-Tree implementation in SQL - the article that NULLable linked to is amazing! I just don't get it :) I will update this more when I gain more IQ points.


Update 2: I kind of understood the Relational Interval Tree implementation, but I couldn't find a way for it to help me. The code there creates a computed column of the same type as the IP columns then makes a BETWEEN comparison and/or a join or an apply with two table functions. I can't imagine how it could help me since the original query is basically just two BETWEEN conditions. But still a very interesting article.

I wanted to have a database of all Ripe records, in order to quickly determine the Internet Service Provider for an IP. We are discussing IPv4 only, so the structure of the table in the database looked like this:

CREATE TABLE [dbo].[RipeDb](
[Id] [int] IDENTITY(1,1) NOT NULL,
[StartIp] [bigint] NULL,
[EndIp] [bigint] NULL,
[NetName] [nvarchar](450) NULL,
[StartTime] [datetime2](7) NULL,
[EndTime] [datetime2](7) NULL,
[ParentId] [int] NULL)


As you can see, I translate IPs into BIGINT so that I can quickly sort and select stuff. I also added a ParentId column that represents the parent ISP, as you have some huge chunk of IPs, split and sold to other ISPs, which in turn are selling bits of the IP range they own to others and so on. The data I receive, though, is a simple text file with no hierarchical relations.

The task, therefore, is to take a table like described above, with more than four million records, and for each of them find their parent, if any.

The simplest idea is to join the table with itself like this:

SELECT rp.Id as ParentId, 
r.Id
FROM RipeDb r
INNER JOIN RipeDb rp
ON rp.StartIp <= r.StartIp
AND rp.EndIp >= r.EndIp
AND rp.EndIp - rp.StartIp > r.EndIp - r.StartIp

This gets all ancestors for each record, so we need to use a RANK() OVER() in an inner select in order to select only the parent, but that's beyond the scope of the article.

Since we have conditions on the StartIp and EndIp columns, we need an index on them. But which?

Through trial and error, more than anything else, I realised that the best solution is a clustered index on StartIp,EndIp. That is why the first column (Id) is not marked as PRIMARY KEY in the definition of the table, because it has to look like this:

[Id] [int] PRIMARY KEY NONCLUSTERED IDENTITY(1,1) NOT NULL

. Yes, primary keys don't have to be clustered.

But now you hit the snag. The process is EXTREMELY slow. Basically on my computer this query would end in a few days (as opposed to twice as much with a nonclustered index). What the hell is going on?

I tried several things:

  • JOIN hints (Merge, Loop and Hash joins) - the query optimizer seems to choose the best solution anyway
  • Various index combinations - nothing beats a clustered index
  • Taking a bunch of records and joining only them in a WHILE loop - it doesn't fill up the temp db, but it is just as slow, if not worse


At this point I kind of gave up. Days of work trying to figure out why this is going so slow reached a simple solution: 4 million records squared means 16 thousand billion comparisons. No matter how ingenious SQL would be, this will be slow. "But, Siderite, I have tables large like this and joining them is really fast!" you will say. True, with equality the joins are orders of magnitude faster. Probably there is either place for improvement in the way I used the indexes or in the way they are implemented. If you have any ideas, please let me know.

So did I solve the problem? Yes, of course, by not relying on an SQL join. Think about how the ranges are arranged. If we order the IP ranges on their start and end values, you get something like this:



For each range, the following is either a direct child or a sibling. I created a stored procedure that called itself recursively, which should have worked, but then it reached the maximum level of recursion in SQL (32 - a value that one cannot change!) and so I had to do everything myself. How? With a cursor. Here is the final code:

DECLARE @ParentIds TABLE (Id INT,StartIp BIGINT, EndIp BIGINT)
DECLARE @ParentId INT
DECLARE @Id INT
DECLARE @StartIp BIGINT
DECLARE @EndIp BIGINT
DECLARE @OldParentId INT

DECLARE @i INT=0
DECLARE @c INT

DECLARE curs CURSOR LOCAL FAST_FORWARD FOR
SELECT r.Id, r.StartIp, r.EndIp, r.ParentId
FROM RipeDb r
WHERE r.EndTime IS NULL
ORDER BY StartIp ASC, EndIp DESC

OPEN curs

FETCH NEXT FROM curs
INTO @Id, @StartIp, @EndIp, @OldParentId

WHILE @@FETCH_STATUS=0
BEGIN

DELETE FROM @ParentIds WHERE EndIp<@StartIp

SET @ParentId=NULL
SELECT TOP 1 @ParentId=Id FROM @ParentIds
ORDER BY Id DESC

SELECT @c=COUNT(1) FROM @ParentIds

IF (@i % 1000=0)
BEGIN

PRINT CONVERT(NVARCHAR(100),SysUtcDatetime())+' Updated parent id for ' + CONVERT(NVARCHAR(100),@i) +' rows. ' + CONVERT(NVARCHAR(100),@c) +' parents in temp table.'
RAISERROR ('', 0, 1) WITH NOWAIT

END
SET @i=@i+1

IF (ISNULL(@OldParentId,-1) != ISNULL(@ParentId,-1))
BEGIN
UPDATE RipeDb SET ParentId=@ParentId WHERE Id=@Id
END

INSERT INTO @ParentIds VALUES(@Id,@StartIp,@EndIp)

FETCH NEXT FROM curs
INTO @Id, @StartIp, @EndIp
END

CLOSE curs
DEALLOCATE curs


I will follow the explanation of the algorithm, for people hitting the exact issue that I had, but let me write the conclusion of this blog post: even if SQL is awesome in sorting and indexing, it doesn't mean that is the only solution. In my case, the SQL indexes proved to be a golden hammer that wasted days of my work.

So, the logic here is really simple, which makes this entire endeavour educational, but really frustrating to me:

  1. Sort the table by start IP ascending, then end IP descending - this makes the parents come before the children in the list
  2. Create a table variable to store the previous parents - so when you finished with a range you will automatically find yourself in its parent
  3. Use a cursor to move through all the items and for each one:
  4. Remove all parents that ended before the current item starts - removes siblings for the list
  5. Get the last parent in the list - that is the current parent range
  6. Set the parent id to be the one of the last parent


It's that deceptively simple and the query now ends in 15 minutes instead of days.

Another issue that might be interesting is that after the original import is created, the new records added to the table should be just a few. In that case, the first join and update might work faster! The next thing that I will do is count how many items I need to update and use one method or another based on that.

Hope that helps someone.