I have been asking this of people at the interviews I am conducting and I thought I should document the correct answer and the expected behavior. And yes, we've filled the position in for this interview question, so you can't cheat :)

The question is quite banal: given two tables (TableA and TableB) both having a column ID, select the rows in TableA that don't have any corresponding row in TableB with the same ID.

Whenever you are answering an interview question, remember that your thinking process is just as important as the answer. So saying nothing, while better than "so I am adding 1 and 1 and getting 2", may not be your best option. Assuming you don't know the answer, a reasonable way of tackling any problem is to take it apart and try to solve every part separately. Let's do this here.

As the question requires the rows in A, select them:

SELECT * FROM TableA

Now, a filter should be applied, but which one? Here are some ideas:

  1. WHERE ID NOT IN (SELECT ID FROM TableB)
  2. WHERE NOT EXISTS (SELECT * FROM TableB WHERE TableA.ID=TableB.ID)
  3. EXCEPT SELECT ID FROM TableB -- this requires to select only ID from TableA, as well (EXCEPT and INTERSECT are new additions to SQL 2019)

Think about it. Any issues with any of them? Any other options?

To test performance, I've used two tables with approximately 35 million rows. Here are the results:

  1. After 17 minutes I had to stop the query. Also, NOT IN has issues with NULL as a value is nether equal or unequal to NULL. SELECT * FROM Table WHERE Value NOT IN (NULL) for example, will always return no rows.
  2. It finished within 4 seconds. There are still issues with NULL, though, as a simple equality would not work with NULL. Assuming we wanted the non-null values of TableA, we're good.
  3. It finished within 5 seconds. This doesn't have any issues with NULL. SELECT NULL EXCEPT SELECT NULL will return no rows, while SELECT 1 EXCEPT SELECT NULL will return a row with the value 1. The syntax is pretty ugly though and works badly if the tables have other columns

What about another solution? We've exhausted simple filtering, how about another avenue? Whenever we want to combine information from two tables we use JOIN, but is that the case here?

SELECT * FROM TableA a
JOIN TableB b
ON a.ID = b.ID -- again, while I would ask people in the interview about null values, we will assume for this post that the values are not nullable

I've used a JOIN keyword, which translates to an INNER JOIN. The query above will select rows from A, but only those that have a correspondence in B. A funny solution to a slightly different question: count the items in A that do not have corresponding items in B:

SELECT (SELECT COUNT(*) FROM TableA) - (SELECT COUNT(*) FROM TableA a JOIN TableB b ON a.ID = b.ID)

However, we want the inverse of the INNER JOIN. What other types of JOINs are there? Bonus interview question! And the answers are:

  • INNER JOIN - returns only rows that have been successfully joined
  • OUTER JOIN (LEFT AND RIGHT) - returns all rows of one table joined to the corresponding values of the other table or NULLs if none
  • CROSS JOIN - returns a 

INNER would not work, as demonstrated, so what about a CROSS JOIN? Clearly not, as it will generate 100 trillion rows before filtering anything. SQL Server would optimize a lot of the query, but it would look really weird anyway.

Is there a solution with OUTER JOIN? RIGHT OUTER JOIN will get the rows in B, not in A, so LEFT OUTER JOIN, by elimination, is the only remaining possible solution.

SELECT a.* FROM TableA a 
LEFT OUTER JOIN TableB b
ON a.ID=b.ID

This returns ALL the rows in table A and for each of them, rows in table B that have the same id. In case of a mismatch, though, for a row in table A with no correspondence in table B, we get a row of NULL values. So all we have to do is filter for those. We know that there are no NULLs in the tables, so here is another working solution, solution 4:

SELECT a.* FROM TableA a 
LEFT OUTER JOIN TableB b
ON a.ID=b.ID
WHERE b.ID IS NULL

This solves the problem, as well, in about 4 seconds. However, the other working solution within the same time (solution 2 above) only works as well because newer versions of SQL server are optimizing the execution. Maybe it's a personal preference from the times solution 4 was clearly the best in terms of performance, but I would chose that as the winner.

Summary

  • You can either use NOT EXISTS (and not NOT IN!) or a LEFT OUTER JOIN with a filter on NULL b values.
  • It's important to know if you have NULL values in the joining columns and it's extra points for asking that from your interviewer
  • If not asking, I would penalize solutions that do not take NULL values in consideration. Extra complexity of code, as one cannot simply check for NULL for solution 4. Also a decision has to be made on the expected behavior when working with NULL values
  • When trying to find the solution to a problem in an interview:
    • think of concrete examples of the problem so you can test your solutions
    • break the problems into manageable bits if possible
    • think aloud, but to the point
  • Also, there is nothing more annoying than doing that thing pupils in school do: looking puppy eyed at the teacher while listing solutions to elicit a response. You're not in school anymore. Examples are dirty, time is important, no one cares about your grades.
  • Good luck out there!

A funny feature that I've encountered recently. It's not something most people would find useful, but it helps tremendously with tracing and debugging what is going on. It's easy, just add .TagWith(someString) to your LINQ query and it will generate comments in SQL. More details here: Query tags.

Just a heads up on a really useful blog post: Script to Delete Data from SQL Server Tables with Foreign Key Constraints

Basically it creates a stored procedure to display the dependencies based on a table name and a condition, then some other code to generate hierarchical delete code.

Tonight I went to an ADCES presentation about SQL table partitioning, a concept that allows for a lot of flexibility while preserving the same basic interface for a table one would use for a simpler and less scalable application. The talk was very professionally held by Bogdan Sahlean and you should have been there to see it :)

He talked about how one can create filegroups on which a table can be split into as many partitions as needed. He then demonstrated the concept of partition switching, which means swapping two tables without overhead, just via metadata, and, used in the context of partitions, the possibility to create a staging table, do stuff on it, then just swap it with a partition with no downtime. The SQL scripts used in the demo can be found on Sahlean's blog. This technology exists since SQL Server 2005, it's not something terribly new, and features with similar but limited functionality existed since SQL Server 2000. Basically the data in a table can be organized in separate buckets and one can even put each partition on a different drive for extra speed.

Things I've found interesting, in no particular order:
  • Best practice: create custom filegroups for databases and put objects in them, rather than in the primary (default) filegroup. Reason: each filegroup is restored separately,
    with the primary being the first and the one the database restore waits for to call a database as online. That means one can quickly restore the important data and see the db online, while the less accessed or less important data, like archive info, loaded afterwards.
  • Using constraints with CHECK on tables is useful in so many ways. For example, even since SQL Server 2000, one could create tables on different databases, even different servers, and if they are marked with not overlapping checks, one can not only create a view that combines all data with UNION ALL, but also insert into the view. The server will know which tables, databases and servers to connect to. Also, useful in the partition presentation.
  • CREATE INDEX with a DROP_EXISTING hint to quickly recreate or alter clustered indexes. With DROP_EXISTING, you save one complete cycle of dropping and recreating nonclustered indexes. Also, if specifying a different filegroup, you are effectively moving the data in a table from a filegroup to another.
  • Finally, the SWITCH TO partition switching can be used to quickly swap two tables, since from Sql Server 2005 all tables are considered partitioned, with regular ones just having one partition. So one creates a table identical in structure with another, does whatever with it, then just uses something like this: ALTER TABLE Orders SWITCH PARTITION 1 TO OrdersHistory PARTITION 1; to swap them out, with minimal overhang.

This clause is so obscure that I couldn't even find the Microsoft reference page for it for a few minutes, so no wonder I didn't know about it. Introduced in SQL Server 2005, the TABLESAMPLE clause limits the number of rows returned from a table in the FROM clause to a sample number or PERCENT of rows.

Usage:
TABLESAMPLE (sample_number [ PERCENT | ROWS ] ) [ REPEATABLE (repeat_seed) ]

REPEATABLE is used to set the seed of the random number generator so one can get the same result if running the query again.

It sounds great at the beginning, until you start seeing the limitations:
  • it cannot be applied to derived tables, tables from linked servers, and tables derived from table-valued functions, rowset functions, or OPENXML
  • the number of rows returned is approximate. 10 ROWS doesn't necessarily return 10 records. In fact, the functionality underneath transforms 10 into a percentage, first
  • a join of two tables is likely to return a match for each row in both tables; however, if TABLESAMPLE is specified for either of the two tables, some rows returned from the unsampled table are unlikely to have a matching row in the sampled table.
  • it isn't even that random!

Funny enough, even the reference page recommends a different way of getting a random sample of rows from a table:
SELECT * FROM Sales.SalesOrderDetail
WHERE 0.01 >= CAST(CHECKSUM(NEWID(), SalesOrderID) & 0x7fffffff AS float) / CAST (0x7fffffff AS int)

Even if probably not really usable, at least I've learned something new about SQL.

Update:
More about getting random samples from a table here, where it explains why ORDER BY NEWID() is not the way to do it and gives hints of what really happens in the background when we invoke TABLESAMPLE.
Another interesting article on the subject, focused more on the statistical probability, can be found here, where it also shows how TABLESAMPLE's cluster sampling may fail in spectacular ways.

I've met an interesting case today when we needed to manipulate data from tens of thousands of people daily. Assuming we would use table rows for the information, then we get a table in which rows are constantly added, updated and deleted. The issue is with the space allocated in table pages.

SQL works like this: If it needs space it allocates some as a "page" which can contain more records. When you delete records the space is not reclaimed, it remains as is (this is called ghosting). The exception is when all records in a page are deleted, in which case the page is reused as an empty page. When you update a record with more data then it held before (like when you have a variable length column), the page is split, with the rest of the records on the page moved to a new page.

In a heap table (no clustered index) the space inside pages is reused for new records or for updated records that don't fit in their allocated space, however if you use a clustered index, like a primary key, the space is not reused, since there needs to be a correlation between the value of the column and its position in the page. And here lies the problem. You may end up with a lot of pages with very few records in them. A typical page is 8 kilobytes, so a table with a few integers in a record can hold hundreds of records on a single page.

Fragmentation can be within a page, as described above, also called internal, but also external, between pages, when the recycled pages are used for data that is out of order. To get a large swathe of records the disk might be worked hard in order to jump from page to page to get what is logically a continuous blob of data. It is disk input/output that kills a database.

OK, back to our case. A possible solution was to store all the data for a user in a "blob", a VARBINARY column. For reads or changes only the disk space occupied by the blob would be changed, with C# code handling everything. It's what is called trading CPU for IO, which is generally good. However this NoSql-like idea itself smelled badly to me. We are supposed to trust our databases, not work against them. The solution I chose is monitoring index fragmentation and occasionally issuing clustered index rebuilding or reorganizing. I am willing to bet that reading/writing the data equivalent to several pages of table is going to be more expensive than selecting the changes I want to make. Also, rebuilding the index will end up storing all the data per user in the same space anyway.

However, this case made me think. Here is a situation in which the solution might have been (and it was in a similar case implemented by someone else) to micromanage the way the database works. It made me question using a clustered index/primary key on a table.

These articles helped me understand more:

I was glad to attend the 565th SQLSaturday in Bucharest yesterday and, while all presentations were cool, I wanted to share with you some specific points that I found very revealing. Without further ado, here is the list:
  • SQL execution plans are read from right to left - such a simple thing, but I remember when I was trying to read them from left to right and I didn't get anything. In SQL Server Management 2016 you also get a "live" version, which shows you an execution plan while it's executing. Really useful to see where the blocking operations are.
  • Manually control your statistics update - execution plans are calculated based on statistics, but the condition for updating the statistics is to have changes in a number of 20% of the rows plus 500 of any table. This default setting is completely arbitrary and may cause a lot of pain. Not only updating the statistics blocks your table (which means more chances that the table will be locked when it is most used), but sometimes the statistics are not useful. One example are reports which may receive a startdate/enddate range or a count or something like that which makes the number of rows affected vary immensely with different parameters. Use OPTION(RECOMPILE) for that.
  • Look for a difference between estimated and actual rows in a query plan, which leads to tempdb spills, which leads to unwanted IO operations - before a query, an execution plan is created or reused, based on statistics, as I was saying above. Once a plan has been chosen, though, it doesn't change during its execution. Basically what this means is that the structure of the plan remains unchanged between the estimated and actual plan. Also based on the plan, memory is requested and never changed. So if the plan asks for 10KB of memory and you need 1000KB, the rest of 990 will be stored and used from tempdb even if there is enough memory to put them in, since the memory requirements don't change from estimated to actual. The reverse is not much better, since a plan may ask for a lot of memory when it only needs little, thus making everything else on that machine have less available resources.
  • SQL default settings suck - there was an entire presentation about that, it is useful to think a little about it. So many settings are legacy things that make no sense, like the initial database size, the autogrow size, index fill factors, maxdop (degree of parallelism), parallelism threshold, used memory (ironically, using all of it may be hurtful as it takes it away from other processes which leads to using the swap file), etc.
  • Look for hard page faults - this counter is much useful than the soft page faults, which are fixable faults. A hard page fault is indicative of unnecessary IO operations, which are orders of magnitude slower than memory use.

There are a lot more things that I want to explore now, since I participated to the event. You may find the files for the presentations in the same place as the full list of talks at SQLSaturday.

The new Datetime2 data type introduced in Microsoft SQL Server 2008 has several advantages over the old Datetime type. One of them is precision in 100 nanoseconds rather than coarse milliseconds, another is that is has a larger date range. It has disadvantages, too, like a difficulty in translating it into numerical values. It was a classic hack to CONVERT/CAST a Datetime into a Float in order to get a numerical value that you could manipulate (like convert it to an integer to get the date without time, which is now accomplished by converting it to Date, another type introduced in SQL Server 2008). There are many reasons why one needs to translate a datetime into a numerical value, I don't get into that here. So here is how to convert a Datetime2 value into a Float.

First solution:
DATEDIFF(SECOND,{d '1970-01-01'}, @Time)+DATEPART(nanosecond,@Time)/1.0E+9
- returns a value in seconds with nanosecond precision from the beginning of the year 1970. Advantage: simple to use and understand. Disadvantage: not similar to the conversion from Datetime to Float.

Second solution:
DATEDIFF(DAY,{d '1900-01-01'}, @Time)+DATEPART(HOUR,@Time)/24.0+DATEPART(MINUTE,@Time)/(24.0*60)+DATEPART(SECOND,@Time)/(24.0*60*60)+DATEPART(nanosecond,@Time)/(24.0*60*60*1.0E+9)
- returns a value that is similar to the float conversion of a datetime. Advantage: doesn't lose precision like converting to a Datetime and then to Float. Disadvantage: look at the length of that!

Final solution:
25567+(DATEDIFF(SECOND,{d '1970-01-01'}, @Time)+DATEPART(nanosecond,@Time)/1.0E+9)/86400.0
- combines the two solutions above. It easy to read and understand. It computes the number of seconds with nanoseconds from 1970, divides by 86400 to get the number of days and adds 25567, which is the number of days between 1900 and 1970.

As a software developer - and by that I mean someone writing programs in C#, Javascript and so on, and occasionally using databases when something needs to be stored somewhere - I have an instinctual fear of the Arrow Anti-pattern. Therefore I really dislike stuff like NOT EXISTS(SELECT 1 FROM Something). However, recent tests have convinced me that this is the best solution for testing for existing records. I am not going to reinvent the wheel; here are some wonderful links regarding this, after which I will summarize:

Let's say you want to insert in a table all records from another source that do not already exist in the table. You have several options, but the most commonly used are:
SELECT *
FROM SourceTable
LEFT JOIN DestinationTable
ON SomeCondition
WHERE DestinationTable.Id IS NULL
and
SELECT *
FROM SourceTable
WHERE NOT EXIST(SELECT 1 FROM DestinationTable WHERE SomeCondition)

Personally I prefer the first version, for readability reasons as well as having listened to the mantra "Never do selects in selects" for all my life. However, it becomes apparent that the second version is a lot more efficient. The simple reason is that for the first example Microsoft SQL Server will first join the two tables in memory, then filter. If you have multiple combinations of records that satisfy the condition this will result in some huge memory and CPU usage, especially if you have no indexes defined and, sometimes, because you have some indexes defined. The second option uses one of the few methods guaranteed to exit, NOT EXISTS, which immediately invalidates a record at the first match.

Other options involve using the EXCEPT or INTERSECT operations in SQL, but they are not really helping. Intersecting ids, for example, then inner joining with SourceTable works, but it is somewhere in between the two solutions above and it looks like crap as well. Join hints don't help either.

The OUTPUT clause is a very useful tool in Microsoft SQL, allowing for getting automatically inserted columns in the same command as the INSERT. Imagine you have a table with an identity column and you need the generated ids as you insert new records. It would look like this:
CREATE TABLE MyTable 
(
Id INT PRIMARY KEY IDENTITY(1, 1),
Value NVARCHAR(100)
)

CREATE TABLE AnotherTable
(
Value NVARCHAR(100),
AnotherValue NVARCHAR(100),
SomeConditionIsTrue BIT
)

go

CREATE TABLE #ids
(
Id INT ,
AnotherValue NVARCHAR(100)
)

INSERT INTO MyTable (Value)
OUTPUT inserted.Id INTO #ids (id)
SELECT Value
FROM AnotherTable
WHERE SomeConditionIsTrue = 1

-- Do something with the inserted Ids

However, what do you do if you want to also insert the column AnotherValue to the #ids table? Something like this does not work:
INSERT INTO MyTable (Value) 
OUTPUT inserted.Id,AnotherTable.AnotherValue INTO #ids (id,AnotherValue)
SELECT Value
FROM AnotherTable
WHERE SomeConditionIsTrue = 1

Enter the often ignored MERGE, which can help us translate the query above into:
MERGE INTO MyTable USING (
SELECT Value , AnotherValue
FROM AnotherTable
WHERE SomeConditionIsTrue = 1
) t ON 1=0 --FALSE
WHEN NOT MATCHED THEN
INSERT (Value) VALUES (t.Value)
OUTPUT Inserted.Id, t.AnotherValue INTO #ids (Id, AnotherValue);

Note the 1=0 condition so that the merge never "matches" and how the select from the first query now contains all the columns needed to be output, even if only some of them are inserted in the insert table.

This post was prompted by a StackOverflow answer that, as great as it was, didn't make it clear what to do when you get your values from a more complicated select. The answer is simple: put it all in the 'using' table.

Change Data Capture is a useful mechanism for tracking data changes in Microsoft SQL Server. Once you enable it for a database and some of its tables, it will create a record of any change to the data of those tables. In order to use the captured data, Microsoft recommends using the table functions provided by the mechanism, mainly [cdc].[fn_cdc_get_all_changes_dbo_<table name>] and [cdc].[fn_cdc_get_net_changes_<table name>]. However, when you use them, your very first experience might be an error that looks like this: Msg 313, Level 16, State 3, Line 20 An insufficient number of arguments were supplied for the procedure or function cdc.fn_cdc_get_all_changes_ ... . Since this is an error message everyone associates with missing a parameter for a function, one might assume that the documentation is wrong and one must add another parameter. You try adding a bogus one and you get Msg 8144, Level 16, State 3, Line 9 Procedure or function cdc.fn_cdc_get_all_changes_dbo_<table name> has too many arguments specified. which brings confusion. One hint is that if we use one less parameter than in the documentation, the error is slightly different Msg 313, Level 16, State 3, Line 9 An insufficient number of arguments were supplied for the procedure or function cdc.fn_cdc_get_all_changes_dbo_<table name>. In this error message, the tracked table name is specified in the function name, as opposed to the other where ... is used instead. What is going on?

The documentation for the function (which, as usual, nobody reads past the usage syntax and the code examples - just read the Remarks section) says this: If the specified LSN range does not fall within the change tracking timeline for the capture instance, the function returns error 208 ("An insufficient number of arguments were supplied for the procedure or function cdc.fn_cdc_get_all_changes.")., which of course is the explanation for this weird behaviour, but why and when does it happen?

The why comes from a Microsoft Connect page where an overly honest developer explains that the reason for the obscure error message is the antiquated error and function system used in T-SQL: The issue here is the inability to do raiseerror from within a function that prevents us from bubbling up meaningful error message. If one looks at the source of cdc.fn_cdc_get_all_changes_dbo_<table name>, one sees that the error is thrown from another function, a system one, called [sys].[fn_cdc_check_parameters]. Doing a select on it we get the same original error which is now slightly humourous, since it comes from a completely different function than the one in the message. Since it is a system function this time, there is no source code for it.

The when is more tricky and it shows that they didn't think this through much. First of all, whenever you send a NULL or an empty (0x0000...) value to the function as the begin or end LSN you get this error message. The code examples from Microsoft always show these mysterious LSN values being received from functions like sys.fn_cdc_get_min_lsn('<schema name>_<table name>'), sys.fn_cdc_get_max_lsn(), sys.fn_cdc_map_time_to_lsn('largest less than or equal', GETDATE()) and so on, but they are hardly easy to understand, as they return an empty value for wrong parameters. For example, a common reason why your code fails is from getting the LSN like this: sys.fn_cdc_get_min_lsn('MyWonderfulTable') when in fact you need to use the schema in the name as well: sys.fn_cdc_get_min_lsn('dbo_MyWonderfulTable'). You have to use this syntax everywhere. Funny enough, if the tracked table is empty, you get the lowest LSN for the entire database, but if you use a wrong database name (or without the schema, or NULL, etc) you get an empty LSN. How an empty LSN is not the minimum LSN is beyond me.

My solution? Just select from the tables directly. Yeah, I know, it's bad, unrecommended by Microsoft, reinventing the wheel. But it works and I don't get weird functions messing up my flow with obscure error messages. Just remember to take a look at the cdc.* functions and see how they are written.

So, to summarize: The error message is misleading and it's all they could do within the confines of the T-SQL function error system. Remember to use the schema in the string defining the table in the cdc functions (ex: dbo_MyTable). In case you really want to be flexible, interrogate the cdc tables directly.

I had this situation where I had to execute large SQL script files and the default sqlcmd tool was throwing exceptions rather than execute them. So I created my own tool to read the scripts and execute them transactionally. Everything went smoothly, except at the end. You see, I didn't use TransactionScope.Complete from the beginning in order to see how the program would cope with a massive rollback. Not well, apparently.

The exception was thrown at rollback: Timeout expired. The timeout period elapsed prior to completion of the operation or the server is not responding. I had set the TransactionOptions.Timeout to TransactionManager.MaximumTimeout and the SqlCommand.CommandTimeout to 0 (meaning never end) and I still got the exception. Apparently, the problem was the SqlConnection.ConnectTimeout which is a readonly property with a default value of 15 seconds. The value can be changed via the connection string, by adding something like Connect Timeout=36000 (10 hours) and many articles on the Internet suggest doing that. However, that is just really ugly. A better solution is to set the value of the timeout programmatically and this is how to do it:
var timeout = TimeSpan.FromHours(10);
SqlConnectionStringBuilder csb = new SqlConnectionStringBuilder(connectionString);
csb.ConnectTimeout = (int)timeout.TotalSeconds;
connectionString = csb.ConnectionString;

As you can see, the nice SqlConnectionStringBuilder helps us validate, parse and change the values in the connection string. One can imagine other interesting uses, like adding MARS to the connection string automatically or restricting the use to a list of databases or disallowing weak passwords, etc.

Update: after another very useful comment from NULLable, I tried several new ideas:
  • range queries - trying to specify that the child StartIp, for example, is not only greater or equal to the parent StartIp, but also less or equal to the parent EndIp. In my case the query didn't go faster and adding new indexes as recommended in the comment made is slower. I believe it is because the range values are not static or just because clustering the start/end IP index is really way faster than any logical implementation of the search algorithm
  • cursor hints - obviously a very important hint that I should add to almost any cursor is LOCAL. A GLOBAL cursor can be accessed from outside the stored procedure and weird things can happen when running the stored procedure twice at the same time. NULLable recommended also STATIC READ_ONLY and FORWARD_ONLY. In truth the performance of the query doesn't really depend on the speed of the cursor, anyway, but I found an article that discusses the various cursor hints and ends up recommending LOCAL FAST_FORWARD. Check it out, it is very informative. My tests showed no real difference in this particular scenario.
  • RI-Tree implementation in SQL - the article that NULLable linked to is amazing! I just don't get it :) I will update this more when I gain more IQ points.

Update 2: I kind of understood the Relational Interval Tree implementation, but I couldn't find a way for it to help me. The code there creates a computed column of the same type as the IP columns then makes a BETWEEN comparison and/or a join or an apply with two table functions. I can't imagine how it could help me since the original query is basically just two BETWEEN conditions. But still a very interesting article.

I wanted to have a database of all Ripe records, in order to quickly determine the Internet Service Provider for an IP. We are discussing IPv4 only, so the structure of the table in the database looked like this:
CREATE TABLE [dbo].[RipeDb](
[Id] [int] IDENTITY(1,1) NOT NULL,
[StartIp] [bigint] NULL,
[EndIp] [bigint] NULL,
[NetName] [nvarchar](450) NULL,
[StartTime] [datetime2](7) NULL,
[EndTime] [datetime2](7) NULL,
[ParentId] [int] NULL)

As you can see, I translate IPs into BIGINT so that I can quickly sort and select stuff. I also added a ParentId column that represents the parent ISP, as you have some huge chunk of IPs, split and sold to other ISPs, which in turn are selling bits of the IP range they own to others and so on. The data I receive, though, is a simple text file with no hierarchical relations.

The task, therefore, is to take a table like described above, with more than four million records, and for each of them find their parent, if any.

The simplest idea is to join the table with itself like this:
SELECT rp.Id as ParentId, 
r.Id
FROM RipeDb r
INNER JOIN RipeDb rp
ON rp.StartIp <= r.StartIp
AND rp.EndIp >= r.EndIp
AND rp.EndIp - rp.StartIp > r.EndIp - r.StartIp
This gets all ancestors for each record, so we need to use a RANK() OVER() in an inner select in order to select only the parent, but that's beyond the scope of the article.

Since we have conditions on the StartIp and EndIp columns, we need an index on them. But which?

Through trial and error, more than anything else, I realised that the best solution is a clustered index on StartIp,EndIp. That is why the first column (Id) is not marked as PRIMARY KEY in the definition of the table, because it has to look like this:
[Id] [int] PRIMARY KEY NONCLUSTERED IDENTITY(1,1) NOT NULL
. Yes, primary keys don't have to be clustered.

But now you hit the snag. The process is EXTREMELY slow. Basically on my computer this query would end in a few days (as opposed to twice as much with a nonclustered index). What the hell is going on?

I tried several things:
  • JOIN hints (Merge, Loop and Hash joins) - the query optimizer seems to choose the best solution anyway
  • Various index combinations - nothing beats a clustered index
  • Taking a bunch of records and joining only them in a WHILE loop - it doesn't fill up the temp db, but it is just as slow, if not worse

At this point I kind of gave up. Days of work trying to figure out why this is going so slow reached a simple solution: 4 million records squared means 16 thousand billion comparisons. No matter how ingenious SQL would be, this will be slow. "But, Siderite, I have tables large like this and joining them is really fast!" you will say. True, with equality the joins are orders of magnitude faster. Probably there is either place for improvement in the way I used the indexes or in the way they are implemented. If you have any ideas, please let me know.

So did I solve the problem? Yes, of course, by not relying on an SQL join. Think about how the ranges are arranged. If we order the IP ranges on their start and end values, you get something like this:



For each range, the following is either a direct child or a sibling. I created a stored procedure that called itself recursively, which should have worked, but then it reached the maximum level of recursion in SQL (32 - a value that one cannot change!) and so I had to do everything myself. How? With a cursor. Here is the final code:
DECLARE @ParentIds TABLE (Id INT,StartIp BIGINT, EndIp BIGINT)
DECLARE @ParentId INT
DECLARE @Id INT
DECLARE @StartIp BIGINT
DECLARE @EndIp BIGINT
DECLARE @OldParentId INT

DECLARE @i INT=0
DECLARE @c INT

DECLARE curs CURSOR LOCAL FAST_FORWARD FOR
SELECT r.Id, r.StartIp, r.EndIp, r.ParentId
FROM RipeDb r
WHERE r.EndTime IS NULL
ORDER BY StartIp ASC, EndIp DESC

OPEN curs

FETCH NEXT FROM curs
INTO @Id, @StartIp, @EndIp, @OldParentId

WHILE @@FETCH_STATUS=0
BEGIN

DELETE FROM @ParentIds WHERE EndIp<@StartIp

SET @ParentId=NULL
SELECT TOP 1 @ParentId=Id FROM @ParentIds
ORDER BY Id DESC

SELECT @c=COUNT(1) FROM @ParentIds

IF (@i % 1000=0)
BEGIN

PRINT CONVERT(NVARCHAR(100),SysUtcDatetime())+' Updated parent id for ' + CONVERT(NVARCHAR(100),@i) +' rows. ' + CONVERT(NVARCHAR(100),@c) +' parents in temp table.'
RAISERROR ('', 0, 1) WITH NOWAIT

END
SET @i=@i+1

IF (ISNULL(@OldParentId,-1) != ISNULL(@ParentId,-1))
BEGIN
UPDATE RipeDb SET ParentId=@ParentId WHERE Id=@Id
END

INSERT INTO @ParentIds VALUES(@Id,@StartIp,@EndIp)

FETCH NEXT FROM curs
INTO @Id, @StartIp, @EndIp
END

CLOSE curs
DEALLOCATE curs

I will follow the explanation of the algorithm, for people hitting the exact issue that I had, but let me write the conclusion of this blog post: even if SQL is awesome in sorting and indexing, it doesn't mean that is the only solution. In my case, the SQL indexes proved to be a golden hammer that wasted days of my work.

So, the logic here is really simple, which makes this entire endeavour educational, but really frustrating to me:
  1. Sort the table by start IP ascending, then end IP descending - this makes the parents come before the children in the list
  2. Create a table variable to store the previous parents - so when you finished with a range you will automatically find yourself in its parent
  3. Use a cursor to move through all the items and for each one:
  4. Remove all parents that ended before the current item starts - removes siblings for the list
  5. Get the last parent in the list - that is the current parent range
  6. Set the parent id to be the one of the last parent

It's that deceptively simple and the query now ends in 15 minutes instead of days.

Another issue that might be interesting is that after the original import is created, the new records added to the table should be just a few. In that case, the first join and update might work faster! The next thing that I will do is count how many items I need to update and use one method or another based on that.

Hope that helps someone.

This post will discuss the possibility of creating an SQL injection using the name of a parameter.

We have known about SQL injection since forever: a programmer is constructing an SQL command using user provided information and some malicious hacker inserts a specially crafted string that transforms the intended query into something that returns the content of a user table or deletes some data on the server. The solution for this, in the .NET world, is to use parameterized queries. Even if someone dynamically creates an SQL query, they should use parameter names and then provide the parameters to the SQL command. The idea behind this is that any value provided by the users will be escaped correctly using parameters. Today I found that this solution works perfectly for parameter values, but less perfectly for parameter names. Try to use a parameter with a name containing a single quote and you will get an error. Let's analyse this.

Here is a piece of code that just executes some random SQL text command, but it also adds a parameter containing a single quote in the name:
using (var conn = new SqlConnection("Server=localhost;Database=Test;UID=sa;Trusted_Connection=True;"))
{
conn.Open();
using (var comm = new SqlCommand())
{
var paramName = "a'";

comm.Connection = conn;

comm.CommandText = "SELECT 1";
comm.Parameters.Add(new SqlParameter(paramName, SqlDbType.NVarChar, 100)
{
Value="text"
});

comm.ExecuteNonQuery();
}
conn.Close();
}

As you can see, the text content of the SQL command is irrelevant. the name of the parameter is important and for this I just used a single quote in the name. Running SQL Profiler, we get this query string that is actually executed:
exec sp_executesql N'SELECT 1',N'@a'' nvarchar(100)',@a'=N'text'
In this example the name of the parameter is properly handled in the string defining the name and type of the parameters, but it is NOT escaped during the parameter value declaration. A small change of the code with paramName="a='';DELETE * FROM SomeTable --" results in an interesting query string in the SQL Profiler:
exec sp_executesql N'SELECT 1',N'@a='''';DELETE FROM SomeTable -- nvarchar(100)',@a='';DELETE FROM SomeTable --=N'text'
Strangely enough, when inspecting the SomeTable table, the values are still there, even if copying the text into SQL Management Studio actually deletes the values. A similar construction using stored procedures leads to a completely legal SQL that is recorded by SQL Profiler, but it doesn't really do anything:
using (var conn = new SqlConnection("Server=localhost;Database=Test;UID=sa;Trusted_Connection=True;"))
{
conn.Open();
using (var comm = new SqlCommand())
{
var paramName = "a='';DELETE FROM SomeTable --";

comm.Connection = conn;
comm.CommandText = "DoTest";
comm.CommandType = CommandType.StoredProcedure;
comm.Parameters.Add(new SqlParameter(paramName, SqlDbType.NVarChar, 100)
{
Value="text"
});

int k = 0;
using (var reader = comm.ExecuteReader())
{
while (reader.Read()) k++;
}
Console.WriteLine(k);
}
conn.Close();
}
... with the resulting SQL:
exec DoTest @a='';DELETE FROM SomeTable --=N'text'

I have demonstrated a method of sending a maliciously crafted parameter name to the SqlCommand class and managing to send to the SQL Server a query that should achieve a destructive result. So why doesn't it actually do anything?

The explanation is in the EventClass column of the SQL Profiler. While a normally executed SQL command (let's say from SQL Management Studio) has event classes of SQL:BatchStarting and SQL:BatchCompleted, the query resulting from my attempts have an EventClass of RPC:Completed. It appears that the RPC method of sending queries to the SQL Server doesn't allow for several commands separated by a semicolon. The result is that the first command is executed and the rest are apparently being ignored. The TDS protocol documentation shows that the RPC method is a system of executing stored procedures by sending a binary structure that contains stuff like the name of the procedure, the parameters and so on. Since an SQL text is actually translated into a call to the sp_executesql stored procedure, RPC is used for both types of SqlCommand: Text and StoredProcedure.

I don't have the time to explore this further, but I wonder if this can be used for any type of SQL injection. To make sure, try to check the names of the parameters, if they come from user input.

Update: This post discusses shrinking the data file of a Microsoft SQL database, caused in this case by misconfiguring the initial size of the database. For shrinking the log file one must at least use type 1, not 0, in the query. Also, a very pertinent comment from NULLable warns of the performance issues related to shrinking database files resulting from the fragmentation of the file.

I had this situation when the available space on the SQL database disk was less than the size of the database, in this case the temp database. Someone had wrongly configured the database to have an initial size of 64GB. Changing the size of the file in Microsoft SQL Management Studio doesn't work because it tries to create a different file, fill it with the data and then replace the file. No space for that. Also, it is damn slow, even if you have the space (I have no idea why). Shrink doesn't work either, because the database will not go smaller than the configured initial size. Time to do it command line style. Well, with sql queries, but you know what I mean.

The code for it goes like this:
USE [master];
GO

CHECKPOINT;
DBCC DROPCLEANBUFFERS;
DBCC FREEPROCCACHE;
DBCC FREESYSTEMCACHE ('ALL');
DBCC FREESESSIONCACHE;
GO

USE [tempdb]
GO

DBCC SHRINKFILE (tempdev, 3000); --- New file size in MB
As you can see, you need to know not only the name of the database, but also the logical name of the database file that you want to shrink. It is not even a string, it is like a keyword in the DBCC SHRINKFILE command. Even if it does work, one would benefit from encapsulating it into a stored procedure. Here is the final code:
CREATE PROC ShrinkDatabase(@DbName NVARCHAR(100),@SizeMB INT)
AS
BEGIN


DECLARE @filename NVARCHAR(255)

DECLARE @sql NVARCHAR(Max) = 'SELECT @filename = dbf.name FROM ['+REPLACE(@DbName,'''','''''')+'].sys.database_files dbf WHERE dbf.[type]=0'
EXEC sp_executesql @sql,N'@filename NVARCHAR(255) OUTPUT',@filename OUTPUT

SET @sql='USE [master];
CHECKPOINT;
DBCC DROPCLEANBUFFERS;
DBCC FREEPROCCACHE;
DBCC FREESYSTEMCACHE ('
'ALL'');
DBCC FREESESSIONCACHE;'

EXEC sp_executesql @sql

SET @sql='USE ['+REPLACE(@DbName,'''','''''')+'];
DBCC SHRINKFILE ('
+REPLACE(@filename,'''','''''')+', '+CONVERT(NVARCHAR(100),@SizeMb)+');'
EXEC sp_executesql @sql

END

Create it in the master database and use it like this:
EXEC master.dbo.ShrinkDatabase 'tempdb',3000
Take note that you cannot use this to "shrink up" the database. If the value you set is larger than the current size, the file will remain the same size as well as the setting for the initial size. Also take note of the fact that this stored procedure only shrinks the data file, not the log file (dbf.[type]=0).