Update: more analysis shows that the change was not because of the OR clauses, but because of some debug OPTIONs that I had used. This post is thus wrong.

Original post:

  So I had this stored procedure that would calculate counts from a table, based on a specific column which was also indexed. Something like this:

SELECT Code, COUNT(*) as Nr 
FROM MyTable

  The code would take the result of this query and only use the counts for some of the codes, let's say 'A', 'B' and 'C'. There was also a large number of instructions that had a NULL Code. So the obvious optimizations was:

SELECT Code, COUNT(*) as Nr 
FROM MyTable
WHERE Code IN ('A','B','C')

  And it worked, however I was getting this annoying warning in the execution plan: "Operator used tempdb to spill data during execution". What the hell was that?

  Long story short, I found a very nice SO answer that explains it: SQL Server cardinality estimation can use two types of statistics to guess how many rows will get through a predicate filter:

  • about the column on average using the density vector
  • about particular values for that column using the histogram

When a literal is used, the cardinality estimator can search for that literal in the histogram. When a parameter is used, its value is not evaluated until after cardinality estimation, so the CE has to use column averages in the density vector.

  Probably, behind the scenes, ('A','B','C') is treated as a variable, so it only uses the density vector. Also, because how the IN operator is implemented, what happens to the query next is very different than replacing it with a bunch of ORs:

SELECT Code, COUNT(*) as Nr 
FROM MyTable
WHERE (Code='A' OR Code='B' OR Code='C')

  Not only the warning disappeared, but the execution time was greatly reduced!  

  You see, the query here is simplified a lot, but in real life it was part of a larger one, including complicated joins and multiple conditions. With an IN clause, the execution plan would only show me one query, containing multiple joins and covering all of the rows returned. By using OR clauses, the execution plan would show me three different queries, one for each code.

  This means that in certain situations, this strategy might not work, especially if the conditions are not disjunct and have rows that meet multiple ones. I am also astonished that for such a simple IN clause, the engine did not know to translate it automatically into a series of ORs! My intent as a developer is clear and the implementation should just take that and turn it into the most effective query possible.

  I usually tell people to avoid using OR clauses (and instead try to use ANDs) or query on values that are different (try for equality instead) or using NOT IN. And the reason is again the execution plan and how you cannot prove a general negative claim. Even so, I've always assumed that IN will work as a series or ORs. My reasoning was that, in case of an OR, the engine would have to do something like an expensive DISTINCT operation, something like this: 

FROM MyTable
WHERE (Code='A' OR Code='B')

-- the above should equate to
FROM MyTable
WHERE Code='A'
UNION -- not a disjunct union so engine would have to eliminate duplicates
FROM MyTable
WHERE Code='B'

-- therefore an optimal query is
FROM MyTable
WHERE Code='A'
UNION ALL - disjunct union
FROM MyTable
WHERE Code='B'
-- assuming that there are no rows that meet both conditions (code A and code B)

  In this case, however, SQL did manage to understand that the conditions were disjunct so it split the work into three, correctly using the index and each of them being quite efficient.

  I learned something today!


  There is a saying that the novice will write code that works, without thinking of anything else, the expert will come and rewrite that code according to good practices and the master will rewrite it so that it works again, thinking of everything. It applies particularly well to SQL. Sometimes good and well tried best practices fail in specific cases and one must guide themselves either by precise measurements of by narrow rules that take decades to learn.

  If you ever wondered why some SQL queries are very slow or how to write complex SQL stored procedures without them reaching sentience and behaving unpredictably, this post might help. I am not a master myself, but I will share some quick and dirty ways of writing, then checking your SQL code.

Some master rules

  First of all, some debunking of best practices that make unreasonable assumptions at scale:

  1. If you have to extract data based on many parameters, then add them as WHERE or ON clauses and the SQL engine will know how to handle it.

    For small queries and for well designed databases, that is correct. The SQL server engine is attempting to create execution plans for these parameter combinations and reuse them in the future on other executions. However, when the number of parameters increases, the number of possible parameter combinations increases exponentially. The execution optimization should not take more than the execution itself, so the engine if just choosing one of the existing plans which appears more similar to the parameters given. Sometimes this results in an abysmal performance.

    There are two solutions:

    The quick and dirty one is to add OPTION (RECOMPILE) to the parameterized SELECT query. This will tell the engine to always ignore existing execution plans. With SQL 2016 there is a new feature called Query Store plus a graphical interface that explores execution plans, so one can choose which ones are good and which ones are bad. If you have the option, you might manually force an execution plan on specific queries, as well. But I don't recommend this because it is a brittle and nonintuitive solution. You need a DBA to make sure the associations are correct and maintained properly.

    The better one, to my own surprise, is to use dynamic SQL. In other words, if you have 20 parameters to your stored procedure, with only some getting used at any time (think an Advanced Search page), create an SQL string only with the parameters that are set, then execute it.

    My assumption has always been that the SQL engine will do this for me if I use queries like WHERE (@param IS NULL OR <some condition with @param>). I was disappointed to learn that it does not always do that. Be warned, though, that most of the time multiple query parameters are optimized by running several operations in parallel, which is best!

  2. If you query on a column or another column, an OR clause will be optimal. 

    Think of something like this: You have a table with two account columns AccId and AccId2. You want to query a lot on an account parameter @accountId and you have added an index on each column.

    At this time the more readable option, and for small queries readability is always preferable to performance improvement, is WHERE AccId=@accountId OR AccId2=@accountId. But how would the indexes be used here, in this OR clause? First the engine will have to find all entries with the correct AccId, then again find entries with the correct AccId2, but only the entries that have not been found in the first search.

    First of all, SQL will not do this very well when the WHERE clause is very complex. Second of all, even if it did it perfectly, if you know there is no overlap, or you don't care or you can use a DISTINCT further on to eliminate duplicates, then it is more effective to have two SELECT queries, one for AccId and the other for AccId2 that you UNION ALL afterwards.

    My assumption has always been that the SQL engine will do this automatically. I was quite astounded to hear it was not true. Also, I may be wrong, because different SQL engines and their multitude of versions, compounded with the vast array of configuration options for both engine and any database, behave quite differently. Remember the parallelism optimization, as well.

  3. Temporary tables as slow, use table variables instead.

    Now that is just simple logic, right? A temporary table uses disk while a table variable uses memory. The second has to be faster, right? In the vast majority of cases this will be true. It all depends (a verb used a lot in SQL circles) on what you do with it.

    Using a temporary table might first of all be optimized by the engine to not use the disk at all. Second, temporary tables have statistics, while table variables do not. If you want the SQL engine to do its magic without your input, you might just have to use a temporary table.

  4. A large query that does everything is better than small queries that I combine later on.

    This is a more common misconception than the others. The optimizations the SQL engine does work best on smaller queries, as I've already discussed above, so if a large query can be split into two simpler ones, the engine will be more likely able to find the best way of executing each. However, this only applies if the two queries are completely independent. If they are related, the engine might find the perfect way of getting the data in a query that combines them all.

    Again, it depends. One other scenario is when you try to DELETE or UPDATE a lot of rows. SQL is always "logging" the changes that it does on the off chance that the user cancels the query and whatever incomplete work has been done has to be undone. With large amounts of data, this results into large log files and slow performance. One common solution is to do it in batches, using UPDATE (TOP 10000) or something similar inside a WHILE loop. Note that while this solves the log performance issue, it adds a little bit of overhead for each executed UPDATE

  5. If I have an index on a DATETIME column and I want to check the records in a certain day, I can use CAST or CONVERT.

    That is just a bonus rule, but I've met the problem recently. The general rule is that you should never perform calculations on columns inside WHERE clauses. So instead of WHERE CAST(DateColumn as DATE)=@date use WHERE DateColumn>=@date AND DateColumn<DATEADD(DAY,1,@date). The calculation is done (once) on the parameters given to the query, not on every value of DateColumn. Also, indexes are now used.

Optimizing queries for dummies

So how does one determine if one of these rules apply to their case? "Complex query" might mean anything. Executing a query multiple times results in very different results based on how the engine is caching the data or computing execution plans.

A lot of what I am going to say can be performed using SQL commands, as well. Someone might want to use direct commands inside their own tool to monitor and manage performance of SQL queries. But what I am going to show you uses the SQL Management Studio and, better still, not that horrid Execution Plan chart that often crashes SSMS and it is hard to visualize for anything that the most simple queries. Downside? You will need SQL Management Studio 2014 or higher.

There are two buttons in the SSMS menu. One is "Include Actual Execution Plan" which generates an ugly and sometimes broken chart of the execution. The other one is "Include Live Query Statistics" which seems to be doing the same, only in real time. However, the magic happens when both are enabled. In the Results tab you will get not only the query results, but also tabular data about the execution performance. It is amazingly useful, as you get a table per each intermediary query, for example if you have a stored procedure that executes several queries in a row, you get a table for each.

Even more importantly, it seems that using these options will start the execution without any cached data or execution plans. Running it several times gives consistent execution times.

In the LiveQuery tables, the values we are interested about are, in order of importance, EstimateIO, EstimateCPU and Rows.

EstimateIO is telling us how much of the disk was used. The disk is the slowest part of a computer, especially when multiple processes are running queries at the same time. Your objective is to minimize that value. Luckily, on the same row, we get data about the substatement that generated that row, which parameters were used, which index was used etc. This blog is not about how to fix every single scenario, but only on how to determine where the biggest problems lie.

EstimateCPU is saying how much processing power was used. Most of the time this is very small, as complex calculations should not be performed in queries anyway, but sometimes a large value here shows a fault in the design of the query.

Finally, Rows. It is best to minimize the value here, too, but it is not always possible. For example a COUNT(*) will show a Clustered Index Scan with Rows equal to the row count in the table. That doesn't cause any performance problems. However, if your query is supposed to get 100 rows and somewhere in the Live Query table there is a value of several millions, you might have used a join without the correct ON clause parameters or something like that.


Let's see some examples of this. I have a Main table, with columns ID BIGINT, Random1 INT, Random2 NVARCHAR(100) and Random3 CHAR(10) with one million rows. Then an Ind table, with columns ID BIGINT, Qfr CHAR(4) and ValInd BIGINT with 10000 rows. The ID table is common with the Main table ID column and the Qfr column has only three possible values: AMT, QTY, Sum.

Here is a demo on how this would work:

DECLARE @r1 INT = 1300000
DECLARE @r2 NVARCHAR(100) = 'a'
DECLARE @r3 CHAR(10) = 'A'
DECLARE @qfr CHAR(4) = 'AMT'
DECLARE @val BIGINT = 500000

DECLARE @r1e INT = 1600000
DECLARE @r2e NVARCHAR(100) = 'z'
DECLARE @r3e CHAR(10)='Z'
DECLARE @vale BIGINT = 600000

FROM Main m
ON m.ID=i.ID
WHERE (@r1 IS NULL OR m.Random1>=@r1)
  AND (@r2 IS NULL OR m.Random2>=@r2)
  AND (@r3 IS NULL OR m.Random3>=@r3)
  AND (@val IS NULL OR i.ValInd>=@val)
  AND (@r1e IS NULL OR m.Random1<=@r1e)
  AND (@r2e IS NULL OR m.Random2<=@r2e)
  AND (@r3e IS NULL OR m.Random3<=@r3e)
  AND (@vale IS NULL OR i.ValInd<=@vale)
  AND (@qfr IS NULL OR i.Qfr=@qfr)

I have used 9 parameters, each with their own values, to limit the number of rows I get. The Live Query result is:

You can see that the EstimateIO values are non-zero only on the Clustered Index Scans, one for each table. Where is how the StmtText looks like: "|--Clustered Index Scan(OBJECT:([Test].[dbo].[Ind].[PK__Ind__DEBF89006F996CA8] AS [i]),  WHERE:(([@val] IS NULL OR [Test].[dbo].[Ind].[ValInd] as [i].[ValInd]>=[@val]) AND ([@vale] IS NULL OR [Test].[dbo].[Ind].[ValInd] as [i].[ValInd]<=[@vale]) AND ([@qfr] IS NULL OR [Test].[dbo].[Ind].[Qfr] as [i].[Qfr]=[@qfr])) ORDERED FORWARD)".

This is a silly case, but you can see that the @parameter IS NULL type of query condition has not been removed, even when parameter is clearly not null.

Let's change the values of the parameters:

DECLARE @r1 INT = 300000

DECLARE @r1e INT = 600000

Now the Live Query result is:

Same thing! 5.0 and 7.2

Now, let's do the same thing with dynamic SQL. It's a little more annoying, mostly because of the parameter syntax, but check it out:


DECLARE @r1 INT = 300000

DECLARE @r1e INT = 600000

SET @sql=N'
FROM Main m
ON m.ID=i.ID
WHERE 1=1 '
IF @r1 IS NOT NULL SET @sql+=' AND m.Random1>=@r1'
IF @r2 IS NOT NULL SET @sql+=' AND m.Random2>=@r2'
IF @r3 IS NOT NULL SET @sql+=' AND m.Random3>=@r3'
IF @val IS NOT NULL SET @sql+=' AND i.ValInd>=@val'
IF @r1e IS NOT NULL SET @sql+=' AND m.Random1<=@r1e'
IF @r2e IS NOT NULL SET @sql+=' AND m.Random2<=@r2e'
IF @r3e IS NOT NULL SET @sql+=' AND m.Random3<=@r3e'
IF @qfr IS NOT NULL SET @sql+=' AND i.Qfr=@qfr'
IF @vale IS NOT NULL SET @sql+=' AND i.ValInd<=@vale'

PRINT @sql

EXEC sp_executesql @sql,
  N'@r1 INT, @r2 NVARCHAR(100), @r3 CHAR(10), @qfr CHAR(4),@val BIGINT,@r1e INT, @r2e NVARCHAR(100), @r3e CHAR(10),@vale BIGINT',

Now the Live Query results are:

At first glance we have not changed much. IO is still 5.0 and 7.2. Yet there are 3 less execution steps. There is no parallelism and the query has been executed in 5 seconds, not 6. The StmtText for the same thing is now: "|--Clustered Index Scan(OBJECT:([Test].[dbo].[Ind].[PK__Ind__DEBF89006F996CA8] AS [i]), ORDERED FORWARD)". The printed SQL command is:

FROM Main m
ON m.ID=i.ID
WHERE 1=1  AND m.Random1>=@r1 AND m.Random1<=@r1e


Again, this is a silly example. But with some results anyway! In my work I have used this to get a stored procedure to work three to four times faster!

One can optimize usage of IO, CPU and Rows by adding indexes, by narrowing join conditions, by reducing the complexity of executed queries, eliminating temporary tables, partitioning existing tables, adding or removing hints, removing computation from queried columns and so many other possible methods, but they amount to nothing if you cannot measure the results of your changes.

By using Actual Execution Plan together with Live Query Statistics you get:

  • consistent execution times and disk usage
  • a clear measure of what went on with each subquery

BTW, you get the same effect if you use SET STATISTICS PROFILE ON before the query. Yet, I wrote this post with someone that doesn't want to go into extra SQL code in mind. Also, when calculating performance, it is recommended to add a DBCC FREEPROCCACHE line before execution OR add the option RECOMPILE to your query (this doesn't work on a stored procedure execution, you would have to change the SP queries to include RECOMPILE).

I wish I had some more interesting examples for you, guys, but screenshots from the workplace are not something I want to do and I don't do any complex SQL work at home. I hope this helps. 

  On the SQLite reference page for the WITH clause there is a little example of solving a Sudoku puzzle. Using SQL. I wanted to see it in action and therefore I've translated it into T-SQL.

  You might think that there is a great algorithm at play, something that will blow your mind. I mean, people have blogged about Sudoku solvers to hone their programming skills for ages and they have worked quite a lot, writing lines and lines of how clever they were. And this is SQL, it works, but how do you do something complex in it? But no, it's very simple, very straightforward and also performant. Kind of a let down, I know, but it pretty much takes all possible solutions and only selects for the valid ones using CTEs (Common Table Expressions).

  Here is the translation, followed by some explanation of the code:

DECLARE @Board VARCHAR(81) = '86....3...2...1..7....74...27.9..1...8.....7...1..7.95...56....4..1...5...3....81';
WITH x(s,ind) AS
  SELECT sud,CHARINDEX('.',sud) as ind FROM (VALUES(@Board)) as input(sud)
	CONVERT(VARCHAR(81),CONCAT(SUBSTRING(s,1,ind-1),z,SUBSTRING(s,ind+1,81))) as s,
	CHARINDEX('.',CONCAT(SUBSTRING(s,1,ind-1),z,SUBSTRING(s,ind+1,81))) as ind
  FROM x
  INNER JOIN (VALUES('1'),('2'),('3'),('4'),('5'),('6'),('7'),('8'),('9')) as digits(z)
            SELECT 1
              FROM (VALUES(1),(2),(3),(4),(5),(6),(7),(8),(9)) as positions(lp)
             WHERE z = SUBSTRING(s, ((ind-1)/9)*9 + lp, 1)
                OR z = SUBSTRING(s, ((ind-1)%9) + (lp-1)*9 + 1, 1)
                OR z = SUBSTRING(s, (((ind-1)/3) % 3) * 3
                        + ((ind-1)/27) * 27 + lp
                        + ((lp-1) / 3) * 6, 1)
	WHERE ind>0

  The only changes from the original code I've done is to extract the unsolved puzzle into its own variable and to change the puzzle values. Also, added a more clear INNER JOIN syntax to replace the obnoxious, but still valid, comma (aka CROSS JOIN) notation. Here is the breakdown of the algorithm, as it were:

  • start with an initial state of the unsolved puzzle as a VARCHAR(81) string and the first index of a dot in that string, representing an empty slot - this is the anchor part
  • for the recursive member, join the current state with all the possible digit values (1 through 9) and return the strings with the first empty slot replaced by all valid possibilities and the position of the next empty slot
  • stop when there are no more empty slots
  • select the solutions (no empty slots)

  It's that simple. And before you imagine it will generate a huge table in memory or that it will take a huge time, worry not. It takes less than a second (a lot less) to find the solution. Obviously, resource use increases exponentially when the puzzle doesn't have just one solution. If you empty the first slot (. instead of 8) the number of rows is 10 and it takes a second to compute them all. Empty the next slot, too (6) and you get 228 solutions in 26 seconds and so on.

 The magical parts are the recursive Common Table Expression itself and the little piece of code that checks for validity, but the validity check is quite obvious as it is the exact translation of the Sudoku rules: no same digits on lines, rows or square sections.

  A recursive CTE has three parts:

  • an initial query that represents the starting state, often called the anchor member
  • a recursive query that references the CTE itself, called the recursive member, which is UNIONed with the anchor
  • a termination condition, to tell SQL when to end the recursion

  For us, we started with one unsolved solution, we recursed on all possible valid solutions for replacing the first empty slot and we stopped when there were no more empty slots.

  CTEs are often confusing because the notation seems to indicate something else to a procedural programmer. You imagine doing this without CTEs, maybe in an object oriented programming language, and you think of this huge buffer that just keeps increasing and you have to remember where you left off so you don't process the same partial solution multiple times and you have to clean the data structure so it doesn't get too large, etc. SQL, though, is at heart a declarative programming language, very close to functional programming. It will take care not only of the recursion, but also filter the rows by the final condition of no empty slots while (and sometimes before) it makes the computations.

  Once you consider the set of possible solutions for a problem as a working set, SQL can do wonders to find the solution, provided you can encode it in a way the SQL engine will understand. This is just another example of the right tool for the right job. Hope you learned something.

I have been asking this of people at the interviews I am conducting and I thought I should document the correct answer and the expected behavior. And yes, we've filled the position in for this interview question, so you can't cheat :)

The question is quite banal: given two tables (TableA and TableB) both having a column ID, select the rows in TableA that don't have any corresponding row in TableB with the same ID.

Whenever you are answering an interview question, remember that your thinking process is just as important as the answer. So saying nothing, while better than "so I am adding 1 and 1 and getting 2", may not be your best option. Assuming you don't know the answer, a reasonable way of tackling any problem is to take it apart and try to solve every part separately. Let's do this here.

As the question requires the rows in A, select them:


Now, a filter should be applied, but which one? Here are some ideas:

  3. EXCEPT SELECT ID FROM TableB -- this requires to select only ID from TableA, as well (EXCEPT and INTERSECT are new additions to SQL 2019)

Think about it. Any issues with any of them? Any other options?

To test performance, I've used two tables with approximately 35 million rows. Here are the results:

  1. After 17 minutes I had to stop the query. Also, NOT IN has issues with NULL as a value is nether equal or unequal to NULL. SELECT * FROM Table WHERE Value NOT IN (NULL) for example, will always return no rows.
  2. It finished within 4 seconds. There are still issues with NULL, though, as a simple equality would not work with NULL. Assuming we wanted the non-null values of TableA, we're good.
  3. It finished within 5 seconds. This doesn't have any issues with NULL. SELECT NULL EXCEPT SELECT NULL will return no rows, while SELECT 1 EXCEPT SELECT NULL will return a row with the value 1. The syntax is pretty ugly though and works badly if the tables have other columns

What about another solution? We've exhausted simple filtering, how about another avenue? Whenever we want to combine information from two tables we use JOIN, but is that the case here?

JOIN TableB b
ON a.ID = b.ID -- again, while I would ask people in the interview about null values, we will assume for this post that the values are not nullable

I've used a JOIN keyword, which translates to an INNER JOIN. The query above will select rows from A, but only those that have a correspondence in B. A funny solution to a slightly different question: count the items in A that do not have corresponding items in B:


However, we want the inverse of the INNER JOIN. What other types of JOINs are there? Bonus interview question! And the answers are:

  • INNER JOIN - returns only rows that have been successfully joined
  • OUTER JOIN (LEFT AND RIGHT) - returns all rows of one table joined to the corresponding values of the other table or NULLs if none
  • CROSS JOIN - returns all the rows in A joined with all the rows in B and all the rows in B that have no match in A

INNER would not work, as demonstrated, so what about a CROSS JOIN? Clearly not, as it will generate 100 trillion rows before filtering anything. SQL Server would optimize a lot of the query, but it would look really weird anyway.

Is there a solution with OUTER JOIN? RIGHT OUTER JOIN will get the rows in B, not in A, so LEFT OUTER JOIN, by elimination, is the only remaining possible solution.

SELECT a.* FROM TableA a 
ON a.ID=b.ID

This returns ALL the rows in table A and for each of them, rows in table B that have the same id. In case of a mismatch, though, for a row in table A with no correspondence in table B, we get a row of NULL values. So all we have to do is filter for those. We know that there are no NULLs in the tables, so here is another working solution, solution 4:

SELECT a.* FROM TableA a 
ON a.ID=b.ID

This solves the problem, as well, in about 4 seconds. However, the other working solution within the same time (solution 2 above) only works as well because newer versions of SQL server are optimizing the execution. Maybe it's a personal preference from the times solution 4 was clearly the best in terms of performance, but I would chose that as the winner.


  • You can either use NOT EXISTS (and not NOT IN!) or a LEFT OUTER JOIN with a filter on NULL b values.
  • It's important to know if you have NULL values in the joining columns and it's extra points for asking that from your interviewer
  • If not asking, I would penalize solutions that do not take NULL values in consideration. Extra complexity of code, as one cannot simply check for NULL for solution 4. Also a decision has to be made on the expected behavior when working with NULL values
  • When trying to find the solution to a problem in an interview:
    • think of concrete examples of the problem so you can test your solutions
    • break the problems into manageable bits if possible
    • think aloud, but to the point
  • Also, there is nothing more annoying than doing that thing pupils in school do: looking puppy eyed at the teacher while listing solutions to elicit a response. You're not in school anymore. Examples are dirty, time is important, no one cares about your grades.
  • Good luck out there!

A funny feature that I've encountered recently. It's not something most people would find useful, but it helps tremendously with tracing and debugging what is going on. It's easy, just add .TagWith(someString) to your LINQ query and it will generate comments in SQL. More details here: Query tags.

Just a heads up on a really useful blog post: Script to Delete Data from SQL Server Tables with Foreign Key Constraints

Basically it creates a stored procedure to display the dependencies based on a table name and a condition, then some other code to generate hierarchical delete code.

Tonight I went to an ADCES presentation about SQL table partitioning, a concept that allows for a lot of flexibility while preserving the same basic interface for a table one would use for a simpler and less scalable application. The talk was very professionally held by Bogdan Sahlean and you should have been there to see it :)

He talked about how one can create filegroups on which a table can be split into as many partitions as needed. He then demonstrated the concept of partition switching, which means swapping two tables without overhead, just via metadata, and, used in the context of partitions, the possibility to create a staging table, do stuff on it, then just swap it with a partition with no downtime. The SQL scripts used in the demo can be found on Sahlean's blog. This technology exists since SQL Server 2005, it's not something terribly new, and features with similar but limited functionality existed since SQL Server 2000. Basically the data in a table can be organized in separate buckets and one can even put each partition on a different drive for extra speed.

Things I've found interesting, in no particular order:

  • Best practice: create custom filegroups for databases and put objects in them, rather than in the primary (default) filegroup. Reason: each filegroup is restored separately,
    with the primary being the first and the one the database restore waits for to call a database as online. That means one can quickly restore the important data and see the db online, while the less accessed or less important data, like archive info, loaded afterwards.
  • Using constraints with CHECK on tables is useful in so many ways. For example, even since SQL Server 2000, one could create tables on different databases, even different servers, and if they are marked with not overlapping checks, one can not only create a view that combines all data with UNION ALL, but also insert into the view. The server will know which tables, databases and servers to connect to. Also, useful in the partition presentation.
  • CREATE INDEX with a DROP_EXISTING hint to quickly recreate or alter clustered indexes. With DROP_EXISTING, you save one complete cycle of dropping and recreating nonclustered indexes. Also, if specifying a different filegroup, you are effectively moving the data in a table from a filegroup to another.
  • Finally, the SWITCH TO partition switching can be used to quickly swap two tables, since from Sql Server 2005 all tables are considered partitioned, with regular ones just having one partition. So one creates a table identical in structure with another, does whatever with it, then just uses something like this: ALTER TABLE Orders SWITCH PARTITION 1 TO OrdersHistory PARTITION 1; to swap them out, with minimal overhang.

This clause is so obscure that I couldn't even find the Microsoft reference page for it for a few minutes, so no wonder I didn't know about it. Introduced in SQL Server 2005, the TABLESAMPLE clause limits the number of rows returned from a table in the FROM clause to a sample number or PERCENT of rows.

TABLESAMPLE (sample_number [ PERCENT | ROWS ] ) [ REPEATABLE (repeat_seed) ]

REPEATABLE is used to set the seed of the random number generator so one can get the same result if running the query again.

It sounds great at the beginning, until you start seeing the limitations:
  • it cannot be applied to derived tables, tables from linked servers, and tables derived from table-valued functions, rowset functions, or OPENXML
  • the number of rows returned is approximate. 10 ROWS doesn't necessarily return 10 records. In fact, the functionality underneath transforms 10 into a percentage, first
  • a join of two tables is likely to return a match for each row in both tables; however, if TABLESAMPLE is specified for either of the two tables, some rows returned from the unsampled table are unlikely to have a matching row in the sampled table.
  • it isn't even that random!

Funny enough, even the reference page recommends a different way of getting a random sample of rows from a table:
SELECT * FROM Sales.SalesOrderDetail
WHERE 0.01 >= CAST(CHECKSUM(NEWID(), SalesOrderID) & 0x7fffffff AS float) / CAST (0x7fffffff AS int)

Even if probably not really usable, at least I've learned something new about SQL.

More about getting random samples from a table here, where it explains why ORDER BY NEWID() is not the way to do it and gives hints of what really happens in the background when we invoke TABLESAMPLE.
Another interesting article on the subject, focused more on the statistical probability, can be found here, where it also shows how TABLESAMPLE's cluster sampling may fail in spectacular ways.

I've met an interesting case today when we needed to manipulate data from tens of thousands of people daily. Assuming we would use table rows for the information, then we get a table in which rows are constantly added, updated and deleted. The issue is with the space allocated in table pages.

SQL works like this: If it needs space it allocates some as a "page" which can contain more records. When you delete records the space is not reclaimed, it remains as is (this is called ghosting). The exception is when all records in a page are deleted, in which case the page is reused as an empty page. When you update a record with more data then it held before (like when you have a variable length column), the page is split, with the rest of the records on the page moved to a new page.

In a heap table (no clustered index) the space inside pages is reused for new records or for updated records that don't fit in their allocated space, however if you use a clustered index, like a primary key, the space is not reused, since there needs to be a correlation between the value of the column and its position in the page. And here lies the problem. You may end up with a lot of pages with very few records in them. A typical page is 8 kilobytes, so a table with a few integers in a record can hold hundreds of records on a single page.

Fragmentation can be within a page, as described above, also called internal, but also external, between pages, when the recycled pages are used for data that is out of order. To get a large swathe of records the disk might be worked hard in order to jump from page to page to get what is logically a continuous blob of data. It is disk input/output that kills a database.

OK, back to our case. A possible solution was to store all the data for a user in a "blob", a VARBINARY column. For reads or changes only the disk space occupied by the blob would be changed, with C# code handling everything. It's what is called trading CPU for IO, which is generally good. However this NoSql-like idea itself smelled badly to me. We are supposed to trust our databases, not work against them. The solution I chose is monitoring index fragmentation and occasionally issuing clustered index rebuilding or reorganizing. I am willing to bet that reading/writing the data equivalent to several pages of table is going to be more expensive than selecting the changes I want to make. Also, rebuilding the index will end up storing all the data per user in the same space anyway.

However, this case made me think. Here is a situation in which the solution might have been (and it was in a similar case implemented by someone else) to micromanage the way the database works. It made me question using a clustered index/primary key on a table.

These articles helped me understand more:

I was glad to attend the 565th SQLSaturday in Bucharest yesterday and, while all presentations were cool, I wanted to share with you some specific points that I found very revealing. Without further ado, here is the list:
  • SQL execution plans are read from right to left - such a simple thing, but I remember when I was trying to read them from left to right and I didn't get anything. In SQL Server Management 2016 you also get a "live" version, which shows you an execution plan while it's executing. Really useful to see where the blocking operations are.
  • Manually control your statistics update - execution plans are calculated based on statistics, but the condition for updating the statistics is to have changes in a number of 20% of the rows plus 500 of any table. This default setting is completely arbitrary and may cause a lot of pain. Not only updating the statistics blocks your table (which means more chances that the table will be locked when it is most used), but sometimes the statistics are not useful. One example are reports which may receive a startdate/enddate range or a count or something like that which makes the number of rows affected vary immensely with different parameters. Use OPTION(RECOMPILE) for that.
  • Look for a difference between estimated and actual rows in a query plan, which leads to tempdb spills, which leads to unwanted IO operations - before a query, an execution plan is created or reused, based on statistics, as I was saying above. Once a plan has been chosen, though, it doesn't change during its execution. Basically what this means is that the structure of the plan remains unchanged between the estimated and actual plan. Also based on the plan, memory is requested and never changed. So if the plan asks for 10KB of memory and you need 1000KB, the rest of 990 will be stored and used from tempdb even if there is enough memory to put them in, since the memory requirements don't change from estimated to actual. The reverse is not much better, since a plan may ask for a lot of memory when it only needs little, thus making everything else on that machine have less available resources.
  • SQL default settings suck - there was an entire presentation about that, it is useful to think a little about it. So many settings are legacy things that make no sense, like the initial database size, the autogrow size, index fill factors, maxdop (degree of parallelism), parallelism threshold, used memory (ironically, using all of it may be hurtful as it takes it away from other processes which leads to using the swap file), etc.
  • Look for hard page faults - this counter is much useful than the soft page faults, which are fixable faults. A hard page fault is indicative of unnecessary IO operations, which are orders of magnitude slower than memory use.

There are a lot more things that I want to explore now, since I participated to the event. You may find the files for the presentations in the same place as the full list of talks at SQLSaturday.

The new Datetime2 data type introduced in Microsoft SQL Server 2008 has several advantages over the old Datetime type. One of them is precision in 100 nanoseconds rather than coarse milliseconds, another is that is has a larger date range. It has disadvantages, too, like a difficulty in translating it into numerical values. It was a classic hack to CONVERT/CAST a Datetime into a Float in order to get a numerical value that you could manipulate (like convert it to an integer to get the date without time, which is now accomplished by converting it to Date, another type introduced in SQL Server 2008). There are many reasons why one needs to translate a datetime into a numerical value, I don't get into that here. So here is how to convert a Datetime2 value into a Float.

First solution:
DATEDIFF(SECOND,{d '1970-01-01'}, @Time)+DATEPART(nanosecond,@Time)/1.0E+9
- returns a value in seconds with nanosecond precision from the beginning of the year 1970. Advantage: simple to use and understand. Disadvantage: not similar to the conversion from Datetime to Float.

Second solution:
DATEDIFF(DAY,{d '1900-01-01'}, @Time)+DATEPART(HOUR,@Time)/24.0+DATEPART(MINUTE,@Time)/(24.0*60)+DATEPART(SECOND,@Time)/(24.0*60*60)+DATEPART(nanosecond,@Time)/(24.0*60*60*1.0E+9)
- returns a value that is similar to the float conversion of a datetime. Advantage: doesn't lose precision like converting to a Datetime and then to Float. Disadvantage: look at the length of that!

Final solution:
25567+(DATEDIFF(SECOND,{d '1970-01-01'}, @Time)+DATEPART(nanosecond,@Time)/1.0E+9)/86400.0
- combines the two solutions above. It easy to read and understand. It computes the number of seconds with nanoseconds from 1970, divides by 86400 to get the number of days and adds 25567, which is the number of days between 1900 and 1970.

As a software developer - and by that I mean someone writing programs in C#, Javascript and so on, and occasionally using databases when something needs to be stored somewhere - I have an instinctual fear of the Arrow Anti-pattern. Therefore I really dislike stuff like NOT EXISTS(SELECT 1 FROM Something). However, recent tests have convinced me that this is the best solution for testing for existing records. I am not going to reinvent the wheel; here are some wonderful links regarding this, after which I will summarize:

Let's say you want to insert in a table all records from another source that do not already exist in the table. You have several options, but the most commonly used are:
FROM SourceTable
LEFT JOIN DestinationTable
ON SomeCondition
WHERE DestinationTable.Id IS NULL
FROM SourceTable
WHERE NOT EXIST(SELECT 1 FROM DestinationTable WHERE SomeCondition)

Personally I prefer the first version, for readability reasons as well as having listened to the mantra "Never do selects in selects" for all my life. However, it becomes apparent that the second version is a lot more efficient. The simple reason is that for the first example Microsoft SQL Server will first join the two tables in memory, then filter. If you have multiple combinations of records that satisfy the condition this will result in some huge memory and CPU usage, especially if you have no indexes defined and, sometimes, because you have some indexes defined. The second option uses one of the few methods guaranteed to exit, NOT EXISTS, which immediately invalidates a record at the first match.

Other options involve using the EXCEPT or INTERSECT operations in SQL, but they are not really helping. Intersecting ids, for example, then inner joining with SourceTable works, but it is somewhere in between the two solutions above and it looks like crap as well. Join hints don't help either.

The OUTPUT clause is a very useful tool in Microsoft SQL, allowing for getting automatically inserted columns in the same command as the INSERT. Imagine you have a table with an identity column and you need the generated ids as you insert new records. It would look like this:
Value NVARCHAR(100)

Value NVARCHAR(100),
AnotherValue NVARCHAR(100),
SomeConditionIsTrue BIT


Id INT ,
AnotherValue NVARCHAR(100)

INSERT INTO MyTable (Value)
OUTPUT inserted.Id INTO #ids (id)
FROM AnotherTable
WHERE SomeConditionIsTrue = 1

-- Do something with the inserted Ids

However, what do you do if you want to also insert the column AnotherValue to the #ids table? Something like this does not work:
INSERT INTO MyTable (Value) 
OUTPUT inserted.Id,AnotherTable.AnotherValue INTO #ids (id,AnotherValue)
FROM AnotherTable
WHERE SomeConditionIsTrue = 1

Enter the often ignored MERGE, which can help us translate the query above into:
SELECT Value , AnotherValue
FROM AnotherTable
WHERE SomeConditionIsTrue = 1
) t ON 1=0 --FALSE
INSERT (Value) VALUES (t.Value)
OUTPUT Inserted.Id, t.AnotherValue INTO #ids (Id, AnotherValue);

Note the 1=0 condition so that the merge never "matches" and how the select from the first query now contains all the columns needed to be output, even if only some of them are inserted in the insert table.

This post was prompted by a StackOverflow answer that, as great as it was, didn't make it clear what to do when you get your values from a more complicated select. The answer is simple: put it all in the 'using' table.

Change Data Capture is a useful mechanism for tracking data changes in Microsoft SQL Server. Once you enable it for a database and some of its tables, it will create a record of any change to the data of those tables. In order to use the captured data, Microsoft recommends using the table functions provided by the mechanism, mainly [cdc].[fn_cdc_get_all_changes_dbo_<table name>] and [cdc].[fn_cdc_get_net_changes_<table name>]. However, when you use them, your very first experience might be an error that looks like this: Msg 313, Level 16, State 3, Line 20 An insufficient number of arguments were supplied for the procedure or function cdc.fn_cdc_get_all_changes_ ... . Since this is an error message everyone associates with missing a parameter for a function, one might assume that the documentation is wrong and one must add another parameter. You try adding a bogus one and you get Msg 8144, Level 16, State 3, Line 9 Procedure or function cdc.fn_cdc_get_all_changes_dbo_<table name> has too many arguments specified. which brings confusion. One hint is that if we use one less parameter than in the documentation, the error is slightly different Msg 313, Level 16, State 3, Line 9 An insufficient number of arguments were supplied for the procedure or function cdc.fn_cdc_get_all_changes_dbo_<table name>. In this error message, the tracked table name is specified in the function name, as opposed to the other where ... is used instead. What is going on?

The documentation for the function (which, as usual, nobody reads past the usage syntax and the code examples - just read the Remarks section) says this: If the specified LSN range does not fall within the change tracking timeline for the capture instance, the function returns error 208 ("An insufficient number of arguments were supplied for the procedure or function cdc.fn_cdc_get_all_changes.")., which of course is the explanation for this weird behaviour, but why and when does it happen?

The why comes from a Microsoft Connect page where an overly honest developer explains that the reason for the obscure error message is the antiquated error and function system used in T-SQL: The issue here is the inability to do raiseerror from within a function that prevents us from bubbling up meaningful error message. If one looks at the source of cdc.fn_cdc_get_all_changes_dbo_<table name>, one sees that the error is thrown from another function, a system one, called [sys].[fn_cdc_check_parameters]. Doing a select on it we get the same original error which is now slightly humourous, since it comes from a completely different function than the one in the message. Since it is a system function this time, there is no source code for it.

The when is more tricky and it shows that they didn't think this through much. First of all, whenever you send a NULL or an empty (0x0000...) value to the function as the begin or end LSN you get this error message. The code examples from Microsoft always show these mysterious LSN values being received from functions like sys.fn_cdc_get_min_lsn('<schema name>_<table name>'), sys.fn_cdc_get_max_lsn(), sys.fn_cdc_map_time_to_lsn('largest less than or equal', GETDATE()) and so on, but they are hardly easy to understand, as they return an empty value for wrong parameters. For example, a common reason why your code fails is from getting the LSN like this: sys.fn_cdc_get_min_lsn('MyWonderfulTable') when in fact you need to use the schema in the name as well: sys.fn_cdc_get_min_lsn('dbo_MyWonderfulTable'). You have to use this syntax everywhere. Funny enough, if the tracked table is empty, you get the lowest LSN for the entire database, but if you use a wrong database name (or without the schema, or NULL, etc) you get an empty LSN. How an empty LSN is not the minimum LSN is beyond me.

My solution? Just select from the tables directly. Yeah, I know, it's bad, unrecommended by Microsoft, reinventing the wheel. But it works and I don't get weird functions messing up my flow with obscure error messages. Just remember to take a look at the cdc.* functions and see how they are written.

So, to summarize: The error message is misleading and it's all they could do within the confines of the T-SQL function error system. Remember to use the schema in the string defining the table in the cdc functions (ex: dbo_MyTable). In case you really want to be flexible, interrogate the cdc tables directly.

I had this situation where I had to execute large SQL script files and the default sqlcmd tool was throwing exceptions rather than execute them. So I created my own tool to read the scripts and execute them transactionally. Everything went smoothly, except at the end. You see, I didn't use TransactionScope.Complete from the beginning in order to see how the program would cope with a massive rollback. Not well, apparently.

The exception was thrown at rollback: Timeout expired. The timeout period elapsed prior to completion of the operation or the server is not responding. I had set the TransactionOptions.Timeout to TransactionManager.MaximumTimeout and the SqlCommand.CommandTimeout to 0 (meaning never end) and I still got the exception. Apparently, the problem was the SqlConnection.ConnectTimeout which is a readonly property with a default value of 15 seconds. The value can be changed via the connection string, by adding something like Connect Timeout=36000 (10 hours) and many articles on the Internet suggest doing that. However, that is just really ugly. A better solution is to set the value of the timeout programmatically and this is how to do it:
var timeout = TimeSpan.FromHours(10);
SqlConnectionStringBuilder csb = new SqlConnectionStringBuilder(connectionString);
csb.ConnectTimeout = (int)timeout.TotalSeconds;
connectionString = csb.ConnectionString;

As you can see, the nice SqlConnectionStringBuilder helps us validate, parse and change the values in the connection string. One can imagine other interesting uses, like adding MARS to the connection string automatically or restricting the use to a list of databases or disallowing weak passwords, etc.