Some time ago I wrote a post about the difference between temporary tables (#something) and table variables (@something) which concluded that before SQL Server 2019 table variables sucked. But I didn't give any real example. I made it very technical and clinical, perhaps creating the wrong impression that it's a matter of minute optimizations and not something somebody should know.

So gather 'round, children, and let me tell you the tale of the dreaded table variable! How a query that timed out after an hour was run in just 3 seconds with a tiny little fix. The server version was... 2017 (fade to dream sequence)

First, the setup: I needed to delete rows from a large table (30+ million rows) which did not have an associated account in a calculated list of about 20 thousand. Simple right?

DECLARE @Accounts TABLE(ACC_ID CHAR(5))

INSERT INTO @Accounts
SELECT ... whatever
      
SELECT TOP 1 *
FROM MyTable t
LEFT OUTER JOIN @Accounts a
ON t.ACC_ID=a.ACC_ID
WHERE a.ACC_ID IS NULL

Just get me that first orphaned record please... ... ... 40 minutes later... WTF?

As I was saying in the previous post, table variables do not have statistics and the engine assumes they have only one row. So what does the execution plan want? To go through all of the 30+ million rows and then search them into the accounts table. Only the accounts table has 20 thousand rows, too. And it takes forever!

Mind that I've simplified the case here. Table variables do allow indexes and primary keys. I tried that. No effect at all!

Let's change the query then:

CREATE TABLE #Accounts(ACC_ID CHAR(5))

INSERT INTO #Accounts
SELECT ... whatever
      
SELECT TOP 1 *
FROM MyTable t
LEFT OUTER JOIN #Accounts a
ON t.ACC_ID=a.ACC_ID
WHERE a.ACC_ID IS NULL

DROP TABLE #Accounts

Just get me... wait, what? The query ended in 3 seconds. Suddenly, knowing the number and structure of the rows in the table led to the correct execution plan.

But can't I use table variables somehow? You can, but you have to force the engine to do it your way. First of all, you use OPTION (FORCE ORDER) which will keep the tables in the order you declared them. And then you have to reverse the JOIN so that @Accounts is the first table, but the effect is the same.

DECLARE @Accounts TABLE(ACC_ID CHAR(5))

INSERT INTO @Accounts
SELECT ... whatever
      
SELECT TOP 1 *
FROM @Accounts a
RIGHT OUTER JOIN MyTable t
ON t.ACC_ID=a.ACC_ID
WHERE a.ACC_ID IS NULL
OPTION (FORCE ORDER)

Back to three seconds. Ugh!

Now, I will probably use a temporary table, because forcing the way the SQL engine interprets your queries is almost always a bad idea and it makes the code harder to read, too.

Hope it helps!

  Today I had a very interesting discussion with a colleague who optimized my work in Microsoft's SQL Server by replacing a table variable with a temporary table. Which is annoying, since I've done the opposite plenty of time, thinking that I am choosing the best solution. After all, temporary tables have the overhead of being stored into tempdb, on the disk. What could possibly be wrong with using a table variables? I believe this table explains it all:

First of all, the storage is the same. How? Well, table variables start off in memory, but if they go above a limit they get saved to tempdb! Another interesting bit is the indexes. While you can create primary keys on table variables, you can't use other indexes - that's OK, though, because you would hardly need very complex variable tables. But then there is the parallelism: none for table variables! As you will see, that's rather important. At least table variables don't cause recompilations. And last, but certainly not least, perhaps the most important difference: statistics! You don't have statistics on table variables.

Let's consider my scenario: I was executing a stored procedure and storing the selected values in a table variable. This SP had the single reason to filter the ids of records that I would then have to extract - joining them with a lot of other tables - and could return 200, 800 or several hundred thousand rows.

With a table variable this means :

  1. when inserting potentially hundreds of thousands of rows I would have no parallelism (slow!) and it would probably save it to tempdb anyway (slow!)
  2. when joining other tables with it, not having statistics, it would just treat it like a short list of values, which it potentially wasn't, and looping through it : Table Spool (slow!)
  3. various profiling tools would show the same or even less physical reads and the same SQL server execution time, but the CPU time would be larger than execution time (hidden slow!)

This situation has been improved considerably in SQL Server 2019, to the point that in most cases table variables and temporary tables show the same performance, but versions previous to that would show this to a larger degree.

And then there are hacks. For my example, there is reason why parallelism DOES occur:

So are temporary tables always better? No. There are several advantages of table variables:

  1. they get cleared automatically at the end of their scope
  2. result in fewer recompilations of stored procedures
  3. less locking and resources, since they don't have transaction logs

For many simple situations, like where you want to generate some small quantity of data and then work with that, table variables are best. However, as soon as the data size or scenario complexity increases, temporary tables become better.

As always, don't believe me, test! In SQL everything "depends", you can't rely on fixed rules like "X is always better" so profile your particular scenarios and see which solution is better.

Hope it helps!

  I had this situation where I was trying to optimize a query. And after some investigation I've stumbled upon something strange: querying on the primary key was generating a lot of reads. I was joining my table with a temporary table of 10 ids and there were 630 reads! How come?

  At first I thought it was because the way indexes work. The primary key was comprised of RowId and RowDate and, even if I knew theoretically searching by RowId should use the primary key, the evidence was against me: when querying by RowId and RowDate I would get the expected 10 reads.

  I created two queries, one with and one without RowDate. I then compared their execution plans. They were identical! Only one took a lot longer, specifically in the Index Seek (which used correctly the primary key). When I looked at the properties for that plan element, I saw something strange:

Actual Partitions Accessed 1..63!

I then realized that the table was partitioned on the RowDate column. In this case, RowDate takes precedence to any indexed column! You might think of partitioning a table like forcefully adding the partition columns to every index in the table, including the primary key. In fact, a partitioned table acts like a number of separate tables with the same definition (columns, indexes, etc.), just different data. The indexes work on each separate partition. When you partition a table, you also partition its indexes.

In truth, I would have expected the query execution plan to show the partition split as a separate step. I understand it's hard to conceptualize it without creating as many execution paths as there are partitions, but still, there should be an indication in the shape of the plan that makes it clear you are querying on multiple partitions.

Once RowDate was used, the SQL engine would choose the one partition of my row, then use the primary key index to seek it. Instead of 63*10 reads, just 10 reads, the number of the rows in the id table.

So be careful when you use table partitioning to ALWAYS use the partition columns in the queries for the table, else you will get as many parallel searches as there are partitions, regardless of the indexes you created, as they are also partitioned.

Hope that helps!

  This is a very basic tutorial on how to access Microsoft SQL Server data via SQL queries. Since these are generic concepts, they will be applicable in most other SQL variants out there. My hope is that it will provide the necessary tools to quickly "get into it" without having to read (or understand) too much. Where you go from there is on you.

  There are a lot of basic concepts about SQL, this post will be pretty long.

Table of contents

Connecting to a database

  Let's start with tooling. To access a database you will need SQL Server Management Studio, in my case version 2022, but I will not do anything complicated with it here, therefore any version will do just fine. I will assume you have it installed already as installation is beyond the scope of the blog post. Starting it will prompt for a connection:

  To connect to the local computer, the server will be either . or (local) or the computer name. You can of course connect to any server and you can specify the "instance" and the port number as well. An instance is a specific named installation of SQL server which allows one to have multiple installations (and even versions) of SQL Server. In fact, each instance has its own port, so specifying the port number will ignore the name of the instance. The default port is usually 1433.

  Example of connection server strings: Computer1\SQLEXPRESS, sql.corporate.com,1433, (local), .

  The image here is from a connection to the local machine using Windows Authentication (your windows user). You can connect using SQL Server Authentication, which means providing a username and a password, or using one of the more modern Azure Active Directory methods.

  I will also assume that the connection parameters are known to you, so let's go to the next step.

  Once connected, the Object Explorer window will display the connection you've opened.

  Expanding the Databases node will show the available databases.

  Expanding a database node we get the objects that are part of the database, the most important being:

  • Tables - where the actual data resides
  • Views - abstractions over more complex queries that behave like tables as much as possible, but with some restrictions
  • Stored Procedures - SQL code that can be executed with parameters and may return data results
  • Functions - SQL code that can be executed and returns a value (which can be scalar, like a number of string, or a table type, etc.) 

  In essence they are the equivalent of data stores and code that is executed to use those stores. Views, SPs and functions will not be explained in this post, but feel free to read about them afterwards.

  If one expands a table node, the child nodes will contains various things, the most important of which are:

  • Columns - the names and types of each column in the table
  • Indexes - data structures designed to increase performance to various ways of accessing the data in the table
  • Constraints and Keys - logical restrictions and relationships between tables

  Tables are kind of like Excel sheets, they have rows (data records) and columns (record properties). The power of SQL is a way to declare what you want from tabular representations of data and get the results quickly and efficiently.

  Last thing I want to show from the graphical interface is right clicking on a table node, which shows multiple options, including generating simple operations on the table, the CRUD (Create, Read, Update, Delete) operations mostly, which in SQL are called INSERT, SELECT, UPDATE and DELETE respectively.

  The keywords are traditionally written in all caps, I am not shouting at you. Depending on your preferences and of course the coding standards that apply to your project you can capitalize SQL code however you like. SQL is case insensitive.

Anyway, whatever you are going to choose to "script" it's going to open a so called query window and show you a text with the query. You then have the option of executing it. Normally no one uses the UI to generate scripts except for getting the column names in order for SELECT or INSERT operations. Most of the time you will just right click on a database and choose New Query or select a database and press Ctrl-N, with the same result.

Getting data from tables

Finally we get to doing something. The operation to read data from SQL is called SELECT. One can specify the columns to be returned or just use * to get them all. It is good practice to always specify the column names in production code, even if you intend to select all columns, as the output of the query will not change if we add more columns in the future. However, we will not be discussing software projects, just how to get or change the data using SQL server, so let's get to it.

The simplest select query is: SELECT * FROM MyTable, which will return all columns of all records of the table. Note that MyTable is the name of a table and the least specific way of accessing that table. The same query can be written as: SELECT * FROM [MyDatabase].[dbo].[MyTable], specifying the database name, the schema name (default one is dbo, but your database can use multiple ones) and only then the table name.

The square bracket syntax is usually not required, but might be needed in special cases, like when a column has the same name as a keyword or if an object has spaces or commas in it (never a good idea, but a distinct possibility), for example: SELECT [Stupid,column] FROM [Stupid table name with spaces]. Here we are selecting a badly named column from a badly named table. Removing the square brackets would result in a syntax error.

In the example above we selected stuff from table CasesSince100 and we got tabular results for every record and the columns defined in the table. But that is not really useful. What we want to do when getting data is:

  • getting data from specific columns
  • formatting the data for our purposes
  • filtering the data on conditions
  • grouping the data
  • ordering the results

So here is a more complex query:

-- everything after two dashes in a line is a comment, ignored by the engine
/* there is also
   a multiline comment syntax */
SELECT TOP 10                            -- just the first 10 records
    c.Entity as Country,                 -- Entity will be returned with the name Country
    CAST(c.[Date] as Date) as [Date],    -- Unfortunate naming, as Date is also a type
    c.cases as Cases                     -- capitalized alias
FROM CasesSince100 c                     -- source for the data, aliased as 'c'
WHERE c.Code='ROU'                       -- conditions to filter by
    AND c.[Date]>'2020-03-01'
ORDER BY c.[Date] DESC                   -- ordering in descending order

  The query above will return at most 10 rows, only for Romania, for dates larger than March 2020, but ordered from the newest to oldest. Data returned will be the country name, the date (which was originally a DATETIME and now is cast to a timeless DATE type) and the number of cases.

  Note that I have aliased all columns, so the resulting table has columns named as the aliases. I've also aliased the table name as 'c', which helps in several ways. First of all, Intellisense works better and faster when specifying the table name. All you have to do is type c. and the list of columns will pop up and be filtered as you type. The second reason will become apparent when I am talking about updating and deleting. For the moment just remember that it's a good idea to alias your tables.

  You can alias a table by specifying a name to call it by next to its own name and optionally using 'as', like SELECT ltn.* FROM Schema.LongTableName as ltn. It helps differentiating between ambiguous names (like if two joined tables have columns with the same name), simplifying the code for long named tables and helping with code completion. Even when aliased, the table name can be used and one can specify or ignore the name of the table if the column names are unambiguous.

Of course these are trivial examples. The power of SQL is that you can get information from multiple sources, aggregate them and structure your database for quick access. More advanced concepts are JOINs and indexes, and I hope you will read until I get there, but for now let's just go through the very basics.

Here is another query that groups and aggregates data:

SELECT TOP 10                            -- top 10 results
    c.Entity as Country,                 -- country name
    SUM(CAST(c.cases as INT)) as Cases   -- cases is text, so we transform it to int
FROM CasesSince100 c
WHERE YEAR([Date])=2020                  -- condition applies a function to the date
GROUP BY c.Entity                        -- groups by country
HAVING SUM(CAST(c.cases as INT))<1000000 -- this is filtering on grouped values
ORDER BY SUM(CAST(c.cases as INT)) DESC  -- order on sum of cases

This query will show us the top 10 countries and the total sum of cases in year 2020, but only for countries where that total is less than a million. There is a lot to unpack here:

  • cases column is declared as NVARCHAR(150) meaning Unicode strings of varied length, but at most 150 characters, so we need to cast it to INT (integer) to be able to apply summing to it
  • there are two different ways of filtering: WHERE, which applies to the data before grouping, then HAVING, which applies to data after grouping
  • filtering, grouping, ordering all work with unaliased columns, so even if Entity is returned as Country, I cannot do WHERE Country='Romania'
  • grouping allows to get a row for each combination of the columns the grouping is done and compute some sort of aggregation (in the case above, a sum of cases per country)

Here are the results:

Let me rewrite this in a way that is more readable using what is called a subquery, in other words a query from which I will query once again:

SELECT TOP 10
    Country,
	SUM(Cases) as Cases
FROM (
    SELECT
        c.Entity as Country,
        CAST(c.cases as INT) as Cases,
	    YEAR([Date]) as [Year]
FROM CasesSince100 c
) x
WHERE [Year]=2020
GROUP BY Country
HAVING SUM(Cases)<1000000
ORDER BY Cases DESC

Note that I still have to use SUM(Cases) in the HAVING clause. I could have grouped it in another subquery and selected again and so on. In order to select from a subquery, you need to name it (in our case, we named it x). Also I selected Country from x, which I could have also written as x.Country. As I said before, table names (aliased or not) are optional if the column name if unambiguous. Also you may notice that I've given a name to the summed column. I could have skipped that, but that would mean the resulting columns would have had no name and the query itself would have been difficult to use in code (extracted column values would have had to be retrieved by index and not by name, which is never recommended).

If you think about it, the order of the clauses in a SELECT operation has a major flaw: you are supposed to write SELECT, then specify what columns you want and only then specify where you want the columns to be read from. This makes code completion problematic, which is why the in code query language for .NET (LInQ) puts the selection at the end. But even so there is a trick:

  • SELECT * and then complete the query
  • go back and replace the * with the column names you want to extract (you will now have Intellisense code completion)
  • the alias of the tables will now come in handy, but even without aliases one can press Ctrl-Space and get a list of possible values to select

Defining tables and inserting data

Before we start inserting information, let's create a table:

CREATE TABLE Food(
    Id INT IDENTITY(1,1) PRIMARY KEY,
    FoodName NVARCHAR(100),
    Quantity INT
)

One important concept in SQL is the primary key. It is a good idea in most cases that your tables have a primary key which identifies each record uniquely and also makes them easy to reference. Let me give you an example. Let's assume that we would put no Id column in our Food table and then we would accidentally add cheese twice. How would you reference the first record as opposed to the second? How would you delete the second one?

A primary key is actually just a special case of a unique index, clustered by default. We will get to indexes later, so don't worry about that yet. Enough to remember that it is fastest (most efficient) to find records by the primary key than any other column combination and the way records are uniquely identified. 

The IDENTITY(1,1) notation tells SQL Server that we will not insert values in that column and instead let it put values starting with 1, then increasing with 1 each time. That functionality will become clear when we INSERT data in the table:

INSERT INTO Food(FoodName,Quantity)
VALUES('Bread',1),('Cheese',1),('Pork',2),('Chilly',10)

Selecting from our Food table now gets us these results:

As you can see, we've inserted four records, by only specifying two out of three columns - we skipped Id. Yet SQL has filled the column with values from 1 to 4, starting with 1 and incrementing each time with 1.

The VALUES syntax is specifying inline data, but we could, in fact, insert into a table the results of a query, something like this:

INSERT INTO Food(FoodName,Quantity)
SELECT [Name],Quantity
FROM Store
WHERE [Type]='Food'

There is another syntax for insert that is useful with what are called temporary tables, tables created for the purpose of your session (lifetime of the query window) and that will automatically disappear once the session is over. It looks like this:

SELECT FoodName,Quantity
INTO #temp
FROM Food

This will create a table (temporary because of the # sign in front of it) that will have just FoodName and Quantity as columns, then proceed on saving the data there. This table will not have a primary key nor any types of indexes and it will work as a simple dump of the data selected. You can add indexes later or alter the table in any way you want, it works just like a regular table. While a convenient syntax (you don't have to write a CREATE TABLE query or think of the type of columns) it has a limited usefulness and I recommend not using it in application code.

Just as one creates a table, there are DROP TABLE and ALTER TABLE statements that delete or change the structure of the table, but we won't go into that.

Changing existing data

So now we have some data in a table that we have defined. We will see how the alias syntax I discussed in the SELECT section will come in handy. In short, I propose you use just two basic syntax forms for all CRUD operations: one for INSERT and one for SELECT, UPDATE and DELETE.

But how can you use the same syntax for statements that are so different, I hear you ask? Let me give you some example of similar code doing just that before I dive in what each operation does.

SELECT *
FROM Food f
WHERE f.Id=4

UPDATE f
SET f.Quantity=9
FROM Food f
WHERE f.Id=4

DELETE FROM f
FROM Food f
WHERE f.Id=4

The last two lines of all operations are exactly the same. These are simple queries, but imagine you have a complex one to craft. The first thing you want to see is that you are updating or deleting the right thing, therefore it makes sense to start with a SELECT query instead, then change it to a DELETE or UPDATE when satisfied. You see I UPDATE and DELETE using the alias I gave the table.

When first learning UPDATE and DELETE statements, one usually gets to this syntax:

UPDATE Food     -- using the table name is cumbersome if in a complex query
SET Quantity=9  -- unless using Food.Quantity and Food.Id
WHERE Id=4      -- you don't get easy Intellisense

DELETE          -- this seems a lot easier to remember
FROM Food       -- but it only works with one table in a simple query
WHERE Id=4

I've outlined some of the reasons I don't use this syntax in the comments, but the most important reason why one shouldn't use them except for very simplistic cases is that you are trying to create a query to destructively change the data in the database and there is no fool proof way to duplicate the same logic in a SELECT query to verify what you are going to change. I've seen people (read that as: I was dumb enough to do it myself) who created an entire different SELECT statement to verify what they would do, then realize to their horror the statements were not equivalent and they had updated or deleted the wrong thing!

OK, let's look at UPDATE and DELETE a little closer.

One of the useful clauses for these statements is, just like with SELECT, the TOP clause, which instructs SQL to affect just a finite number of rows. However, because TOP has been added later for write operations, you need to encase the value (or variable) in parentheses. For SELECT you can skip the parentheses for constant values (you still need them for variables)

DELETE TOP (10) FROM MyTable

Another interesting clause, that frankly I have not used a lot, but is essential in some specific cases, is OUTPUT. One can delete or update some rows and at the same time get the rows they have changed. The reason being that first of all in a DELETE statement the rows will be gone, so you won't be able to SELECT them again. But even in an UPDATE operation, the rows chosen to be updated by a query may not be the same if you execute them again. 

SQL does not guarantee the order of rows unless specifically using ORDER BY. So if you execute SELECT TOP 10 * FROM MyTable twice, you may get two different results. Moreover, between the time you UPDATE some rows and you SELECT them in another query, things may change because of other processes running at the same time on the same data.

So let's say we have some for of Invoices and Items tables that reference each other. You want to delete one invoice and all the items associated with it. There is no way of telling SQL to DELETE from multiple tables at the same time, so you DELETE the invoice, OUTPUT its Id, then delete the items for that Id.

CREATE TABLE #deleted(Id INT) -- temporary table, but explicitly created

DELETE FROM Invoice 
OUTPUT Deleted.Id    -- here Deleted is a keyword
INTO #deleted        -- the Id from the deleted rows will be stored here
WHERE Id=2           -- and can be even be restored from there

DELETE 
FROM Item
WHERE Id IN (
  SELECT Id FROM #deleted
)  -- a subquery used in a DELETE statement

-- same thing can be written as:
DELETE FROM i
FROM Item i
INNER JOIN #deleted d  -- I will get to JOINs soon
ON i.Id=d.Id

I have been informed that the INTO syntax is confusing and indeed it is:

  • SELECTing INTO will create a new table with results and throw an exception if the table already exists. The table will have the names and types of the selected values, which may be what one wants for a quick data dump, but it may also cause issues. For example the following query would throw an exception:
    SELECT 'Blog' as [Name]
    INTO #temp
    
    INSERT INTO #temp([Name]) -- String or binary data would be truncated error
    VALUES('Siderite')
    ​

    because the Name column of the new temporary table would be VARCHAR(4), just like 'Blog' and 'Siderite' would be too long

  • UPDATEing or DELETEing with OUTPUT INTO will require an existing table with the same number and types of columns as the columns specified in the OUTPUT clause and will throw an exception if it doesn't exist

One can use derived values in UPDATE statements, not just constants. One can reference the columns already existing or use any type of function that would be allowed in a similar SELECT statement. For example, here is a query to get the tax value of each row and the equivalent update to store it into a separate column:

SELECT
    i.Price, 
    i.TaxPercent, 
    i.Price*(i.TaxPercent/100) as Tax  -- best practice: SELECT first
FROM Item i

UPDATE i
SET Tax = i.Price*(i.TaxPercent/100)   -- UPDATE next
FROM Item i

So here we first do a SELECT, to see if the values we have and calculate are correct and, if satisfied, we UPDATE using the same logic. Always SELECT before you change data, so you know you are changing the right thing.

There is another trick to help you work safely, one that works on small volumes of data, which involves transactions. Transactions are atomic operations (all or nothing) which are defined by starting them with BEGIN TRANSACTION and are finalized with either COMMIT TRANSACTION (save the changes to the database) or ROLLBACK TRANSACTION (revert changes to the database). Transactions are an advanced concept also, so read about it yourself, but remember one can do the following:

  • open a new query window
  • execute BEGIN TRANSACTION
  • do almost anything in the query window
  • if satisfied with the result execute COMMIT TRANSACTION
  • if any issue with what you've done execute ROLLBACK TRANSACTION to undo the changes

Note that this only applies for stuff you do in that query window. Also, all of these operations are being saved in the log of the database, so this works only with small amounts of data. Attempting to do this with large amounts of data will practically duplicate it on disk and take a long time to execute and revert.

The NULL value

We need a quick primer on what NULL is. NULL is a placeholder for a value that was not set or is considered unknown. It's a non-value. It is similar to null in C# or JavaScript, but with some significant differences applicable to SQL only. For example, a NULL value (an oxymoron for sure) will never be equal to (or not equal to) or less than or greater than anything. One might expect to get all the values in a table in these two queries: SELECT * FROM MyTable WHERE Value>5 and SELECT * FROM MyTable WHERE Value<=5. But if any rows will have NULL for a Value, then they will not appear in any of the query results. That applies to the negation operator NOT as well: SELECT * FROM MyTable WHERE NOT (Value>5).

This behavior can be changed by using SET ANSI_NULLS OFF, but I am yet to see a database that has ever been set up like this.

To check if a value is or is not NULL, one uses the IS and IS NOT syntax :)

SELECT *
FROM MyTable
WHERE MyValue IS NOT NULL

The NULL concept will be used a lot in the next chapter.

Combining data from multiple sources

We finally go to JOIN operations. In most scenarios, you have a database containing multiple table, with intricate connections between them. Invoices that have items, customers, the employee that processed it, dates, departments, store quantities, etc., all referencing something. Integrating data from multiple tables is a complex subject, but I will touch just the most common and important parts:

  • INNER JOIN
  • OUTER JOIN
  • EXISTS
  • UNION / UNION ALL

Let's write a query that displays the name of employees and their department. I will show the CREATE TABLE statements, too, in order to see where we get the data from:

CREATE TABLE Employee (
  EmployeeId INT,          -- Best practice: descriptive column names
  FirstName NVARCHAR(100),
  LastName NVARCHAR(100),
  DepartmentId INT)        -- Best practice: use same name for the same thing

CREATE TABLE Department (
  DepartmentId INT,        -- same thing here
  DepartmentName NVARCHAR(100)
)

SELECT
    CONCAT(FirstName,' ',LastName) as Employee,
    DepartmentName
FROM Employee e
INNER JOIN Department d
ON e.DepartmentId=d.DepartmentId

Here it is: INNER JOIN, a clause that combines the data from two tables based ON a condition or series of conditions. For each row of Employee we are looking for the corresponding row of Department. In this example, one employee belongs to only one department, but a department can hold multiple employees. It's what we call a "one to many relationship". One can have "one to one" or "many to many" relationships as well. That is very important when trying to gauge performance (and number of returned rows).

Our query will only find at most one department for each employee, so for 10 employees we will get at most 10 rows of data. Why do I say "at most"? Because the DepartmentId for some employees might not have a corresponding department row in the Department table. INNER JOIN will not generate records if there is no match. But what if I want to see all employees, regardless if their department exists or not? Then we use an OUTER JOIN:

SELECT
    CONCAT(FirstName,' ',LastName) as Employee,
    DepartmentName
FROM Employee e
LEFT OUTER JOIN Department d
ON e.DepartmentId=d.DepartmentId

This will generate results for each Employee and their Department, but show a NULL (without value) result if the department does not exist. In this case LEFT is used to define that there will be rows for each record in the left table (Employee). We could have used RIGHT, in which case we would have rows for each department and NULL values for departments that have no employees. There is also the FULL OUTER JOIN option, in which case we will get both departments with NULL employees if none are attached and employees with NULL departments in case the department does not exist (or the employee is not assigned - DepartmentId is NULL)

Note that the keywords INNER and OUTER are completely optional. JOIN is the same thing as INNER JOIN and LEFT JOIN is the same as LEFT OUTER JOIN. I find that specifying them makes the code more readable, but that's a personal choice.

The OUTER JOINs are sometimes used in a non intuitive way to find records that have no match in another table. Here is a query that shows employees that are not assigned to a department:

SELECT
    CONCAT(FirstName,' ',LastName) as Employee
FROM Employee e
LEFT OUTER JOIN Department d
ON e.DepartmentId=d.DepartmentId
WHERE d.DepartmentId IS NULL

Until now, we talked about the WHERE clause as a filter that is applied first (before grouping) so one might intuitively have assumed that the WHERE clauses are applied immediately on the tables we get the data from. If that were the case, then this query would never return anything, because every Department will have a DepartmentId. Instead, what happens here is the tables are LEFT JOINed, then the WHERE clause applies next. In the case of unassigned employees, the department id or name will be NULL, so that is what we are filtering on.

So what happens above is:

  • the Employee table is LEFT JOINed with the Department table
  • for each employee (left) there will be rows that contain the values of the Employee table rows and the values of any matched Department table rows
  • in the case there is no match, NULL values will be returned for the Department table for all columns
  • when we filter by Department.DepartmentId being NULL we don't mean any Department that doesn't have an Id (which is impossible) but any Employee row with no matching Department row, which will have a NULL value where the Department.DepartmentId value would have been in case of a match.
  • not matching can happen for two reasons: Employee.DepartmentId is NULL (meaning the employee has not been assigned to a department) or the value stored there has no associated Department (the department may have been removed for some reason)

Also, note that if we are joining tables on some condition we have to be extra careful with NULL values. Here is how one would join two tables on VARCHAR columns being equal even when NULL:

SELECT *
FROM Table1 t1
INNER JOIN Table2 t2
ON (t1.Value IS NULL AND t2.Value IS NULL) OR t1.Value=t2.Value

SELECT *
FROM Table1 t1
INNER JOIN Table2 t2
ON ISNULL(t1.Value,'')=ISNULL(t2.Value,'')

The second syntax seems promising, doesn't it? It is more readable for sure. Unfortunately, it introduces some assumptions and also decreases the performance of the query (we will talk about performance later on). The assumption is that if Value is an empty string, then it's the same as having no value (being NULL). One could use something like ISNULL(Value,'--NULL--') but now it starts looking worse.

There are other ways of joining two tables (or queries, or table variables, or table functions, etc.), for example by using the IN or the EXISTS/NOT EXISTS clauses or subqueries. Here are some examples:

SELECT *
FROM Table1
WHERE MyValue IN (SELECT MyValue FROM Table2)

SELECT *
FROM Table1
WHERE MyValue = (SELECT TOP 1 MyValue FROM Table2 WHERE Table1.MyValue=Table2.MyValue)

SELECT *
FROM Table1
WHERE NOT EXISTS(SELECT * FROM Table2 WHERE Table1.MyValue=Table2.MyValue)

These are less readable, usually have terrible performance and may not return what you expect them to return.

When I was learning SQL, I thought using a JOIN would be optimal on all cases and subqueries in the WHERE clause were all bad, no exception. That is, in fact, false. There is a specific case where it is better to use a subquery in WHERE instead of JOIN, and that is when trying to find records that have at least one match. It is better to use EXISTS because it is short-circuiting logic which leads to better performance.

Here is an example with different syntax for achieving the same goal:

SELECT DISTINCT d.DepartmentId
FROM Department d
INNER JOIN Employee e
ON e.DepartmentId=d.DepartmentId

SELECT d.DepartmentId
FROM Department d
WHERE EXISTS(SELECT * FROM Employee e WHERE e.DepartmentId=d.DepartmentId)

Here, the search for departments with employees will return the same thing, but in the first situation it will get all employees for all departments, then list the department ids that had employees, while in the second query the department will be returned the moment just one employee that matches is found.

There is another way of combining data from two sources and that is to UNION two or multiple result sets. It is the equivalent of taking rows from multiple sources of the same type and showing them together in the same result set.

Here is a dummy example:

SELECT 1 as Id
UNION
SELECT 2
UNION
SELECT 2

And we execute it and...

What happened? Shouldn't there have been three values? Somehow, when copy pasting the silly example, you added two identical values. UNION will add only distinct values to the result set. using UNION ALL will show all three values.

SELECT 1 as Id
UNION ALL
SELECT 2
UNION ALL
SELECT 2

SELECT DISTINCT Id FROM (
  SELECT 1 as Id
  UNION ALL
  SELECT 2
  UNION ALL
  SELECT 2
) x

The first query will return 1,2,2 and the second will be the equivalent of the UNION one, returning 1 and 2. Note the DISTINCT keyword.

My recommendation is to never use UNION and instead use UNION ALL everywhere, unless it makes some kind of sense for a very specific scenario, because the operation to DISTINCT values is expensive, especially for many and/or large columns. When results are supposed to be different anyway, UNION and UNION ALL will return the same output, but UNION is going to perform one more pointless distinct operation.

After learning about JOIN, my request to start with SELECT queries and only them modify them to be UPDATE or DELETE begins to make more sense. Take a look at this query:

UPDATE d
SET ToFindManager=1
--SELECT *
FROM Department d
LEFT OUTER JOIN Employee e
ON d.DepartmentId=e.DepartmentId
AND e.[Role]='Manager'
WHERE e.EmployeeId IS NULL

This will set ToFindManager in departments that have no corresponding manager. But if you select the text from SELECT * on and then execute, you will get the results that you are going to update. Same query, executing by selecting different sections of it will either verify or perform the operation.

Indexes and relationships. Performance.

We have seen how to define tables, how to insert, select, update and delete records from them. We've also seen how to integrate data from multiple sources to get what we want. The SQL engine will take our queries, try to understand what we meant, optimize the execution, then give us the results. However, with large enough data, no amount of query optimization will help if the relationships between tables are not properly defined and tables are not prepared for the kind of queries we will execute.

This requires an introduction to indexes, which is a rather advanced idea, both in terms of how to create, use, debug and profile, but also as a computer science concept. I will try to stick to the basics here, and you go and get more in depth from here.

What is an index? It's a separate data structure that will allow quick access to specific parts of the original data. A table of contents in a blog post is an index. It allows you to quickly jump to the section of the post without having to read it all. There are many types of indexes and they are used in different ways.

We've talked about the primary key: (unless specified differently) it's a CLUSTERED, UNIQUE index. It can be on a single column or a combination of columns. Normally, the primary key will be the preferred way to find or join records on, as it physically rearranges the table records in order and insures only one record has a particular primary key.

The difference between CLUSTERED and NONCLUSTERED indexes is that a table can have only one clustered index, which will determine the physical order of record data on the disk. As an example, let's consider a simple table with a single integer column called X. If there is a clustered index on X, then when inserting new values, data will be moved around on the disk to account for this:

CREATE TABLE Test(X INT PRIMARY KEY)

INSERT INTO Test VALUES (10),(1),(20)

INSERT INTO Test VALUES (2),(3)

DELETE FROM Test WHERE X=1

After inserting 10,1 and 20, data on the disk will be in the order of X: a 1, followed by a 10, then a 20. When we insert values 2 and 3, 10 and 20 will have to be moved so that 2 and 3 are inserted. Then, after deleting 1, all data will be moved so that the final physical order of the data (the actual file on the disk holding the database data) will be 2,3,10,20. This will help optimize not only finding the rows, but also efficiently reading them from disk (disk access is the most expensive operation for a database). 

Note: deletion is working a little differently in reality, but in theory this is how it would work.

Nonclustered indexes, on the other hand, keep their own order and reference the records from the original data. For such a simple example as above, the result would be almost identical, but imagine you have the Employee table and you create a nonclustered index on LastName. This means that behind the scenes, a data structure that looks like a table is created, which is ordered by LastName and contains another column for EmployeeId (which is the primary key, the identifier of an employee). When you do SELECT * FROM Employee ORDER BY LastName, the index will be used to first get a list of ids, then select the values from them.

A UNIQUE index also insures that no two records will have the same combination of values as defined therein. In the case of the primary key, there cannot be two records with the same id. But one can imagine something like:

CREATE UNIQUE INDEX IX_Employee_Name ON Employee(FirstName,LastName)

INSERT INTO Employee (FirstName,LastName)
VALUES('Siderite','Blog')

IX_Employee_Name is a nonclustered unique index on FirstName and LastName. If you execute the insert, it will work the first time, but fail the second time:

There is another type of index-like structure called a foreign key. It should be used to define logical relationships between tables. For the Department table, DepartmentId should be a primary key, but in the Employee table, DepartmentId should be defined as a foreign key connecting to the column in the Department table.

Important note: a foreign key defines the relationship, but doesn't index the column. A separate index should be added on the Employee.DepartmentId column for performance reasons.

I don't want to get into foreign keys here. Suffice to say that once this relationship is defined, some things can be achieved automatically, like deleting corresponding Item records by the engine when deleting Invoices. Also the performance of JOIN queries increases.

Indexes can be used not only on equality, but also other more complex cases: numerical ranges, prefixes, etc. It is important to understand how they are structured, so you know when to use them.

Let's consider the IX_Employee_Name index. The index is practically creating a tree structure on the concatenation of the first and last name of the employee and stores the primary key columns for the table for reference. It will work great for increasing performance of a query like SELECT * FROM Employee ORDER BY FirstName or SELECT * FROM Employee WHERE FirstName LIKE 'Sid%'. However it will not work for LastName queries or contains queries like SELECT * FROM Employee ORDER BY LastName or SELECT * FROM Employee WHERE FirstName LIKE '%derit%'.

That's important because sometimes simpler queries will take more resources than more complicated ones. Here is a dumb example:

CREATE INDEX IX_Employee_Dumb ON Employee(
    FirstName,
    DepartmentId,
    LastName
)

SELECT *
FROM Employee e
WHERE e.FirstName='Siderite'
  AND e.LastName='Blog'

SELECT *
FROM Employee e
WHERE e.FirstName='Siderite'
  AND e.LastName='Blog'
  AND e.DepartmentId=1

The index we create is called IX_Employee_Dumb and it creates a data structure to help find rows by FirstName, DepartmentId and LastName in that order. 

For some reason, in our employee table there are a lot of people called Siderite, but with different departments and last names. The first query will use the index to find all Siderite employees (fast), then look into each and check if LastName is 'Blog' (slow). The second query will directly find the Siderite Blog employee from department with id 1 (fast), because it uses all columns in the index. As you can see, the order of columns in the index is important, because without the DepartmentId in the WHERE clause, only the first part of the index, for FirstName, can be used. In the last query, because we specify all columns, the entire index can be used to efficiently locate the matching rows. 

Note 2022-09-06: Partitioning a table (advanced concept) takes precedence to indexes. I had a situation where a table was partitioned on column RowDate into 63 partitions. The primary key was RowId, but when you SELECTed on RowId, there were 63 index seeks performed. If queried on RowId AND RowDate, it went to the containing partition and did only one index seek inside it. So careful with partitioning. It only provides a benefit if you query on the columns you use to partition on.

One more way of optimizing queries is using the INCLUDE clause. Imagine that Employee is a table with a lot of columns. On the disk, each record is taking a lot of space. Now, we want to optimize the way we get just FirstName and LastName when searching in a department:

SELECT FirstName,LastName
FROM Employee
WHERE DepartmentId=@departmentId

That @ syntax is used for variables and parameters. As a general rule, any values you send to an SQL query should be parameterized. So don't do in C# var sql = "SELECT * FROM MyTable WHERE Id="+id, instead do var sql="SELECT * FROM MyTable WHERE Id=@id" and add an @id parameter when running the query.

So, in the query above SQL will do the following:

  • use an index for DepartmentId if any (fast)
  • find the EmployeeId
  • read the (large) records of each employee from the table (slow)
  • extract and return the first and last name for each

But add this index and there is no need to even go to the table:

CREATE INDEX IX_Employee_DepWithNames
  ON Employee(DepartmentId)
  INCLUDE(FirstName,LastName)

What this will do is add the values of FirstName and LastName to the data inside the index and, if only selecting values from the include list, return them from the index directly, without having to read records from the initial table.

Note that DepartmentId is used to locate rows (in WHERE and JOIN ON clauses) while FirstName and LastName are the columns one SELECTs.

Indexes are a very complex concept and I invite you to examine it at length. It might even be fun.

When indexes are bad

Before I close, let me tell you where indexes are NOT recommended.

One might think that adding an index for each type of query would be a good thing and in some scenarios it might, but as usual in database work, it depends. What performance you gain for finding records in SELECT, UPDATE and DELETE statements, you lose with INSERT, UPDATE and DELETE data changes.

As I explained before, indexes are basically hidden tables themselves. Slight differences, but the data they contain is similar, organized in columns. Whenever you change or add data, these indexes will have to be updated, too. It's like writing in multiple tables at the same time and it affects not only the execution time, but also the disk space.

In my opinion, the index and table structure of a database depends the most on if you intend to read a lot from it or write a lot to it. And of course, everybody will scowl and say: "I want both! High performance read and write". My recommendation is to separate the two cases as much as possible.

  • You want to insert a lot of data and often? Use large tables with many columns and no indexes, not even primary keys sometimes.
  • You want to update a lot of data and often? Use the same tables to insert the modifications you want to perform.
  • You want to read a lot of data and often? Use small read only tables, well defined, normalized data, clear relationships between tables, a lot of indexes
  • Have a background process to get inserts and updates and translate them into read only records

Writing data and reading data, from the SQL engine perspective, are very very different things. They might as well be different software and indeed some companies use one technology to insert data (like NoSQL databases) and another to read it.

Conclusion

I hope the post hasn't been too long and that it will help you when beginning with SQL. Please leave any feedback that you might have, the purpose of this blog is to help people and every perspective helps.

SQL is a very interesting idea and has changed the way people think of data access. However, it has become so complex that most people are still confused even after years of working with it. Every year new features are being added and new ideas are put forward. Yet there are a few concepts, a foundation if you will, that will get you most of the way there. This is what I have tried to distil here. Hope I succeeded.

  I was attempting to optimize an SQL process that was cleaning records from a big table. There are a multitude of ways of doing this, but the pattern that I had adopted for the last similar tasks were to delete rows in batches using the TOP (@rowCount) syntax. And it had all worked fine until then, but now my "optimization" increased the run time from 6 minutes to 2 hours! Humbled (or more like humiliated) I started to analyze what was going on.

  First thing I did was to SET STATISTICS IO ON. Then I ran the cleaning task again. And lo and behold, there was a row reporting accessing an object that was not part of the query itself. What was going on? At first I thought that I was using a VIEW somewhere, one that I had thought was a table, but no, there was no reference to that object anywhere. But when I looked for that object is was a view!

  The VIEW in question was a view with SCHEMABINDING, to which several indexes were then created. That explained it all. If you ever attempted to create an index on a view you probably got the error "Cannot create index on view, because the view is not schema bound" and then you investigated what that entailed (and probably gave up because of all the restrictions) but in that first moment when you thought "all I have to do is add WITH SCHEMABINDING and I can index my views!" it seemed like a good idea. It might even be a good idea for several scenarios, but what it also does is create a reverse dependency on the object you are using. Moreover, if you look more carefully at the Microsoft documentation it says: "The query optimizer may use indexed views to speed up the query execution. The view does not have to be referenced in the query for the optimizer to consider that view for a substitution." So you may find yourself querying a table and instead the engine queries a view instead!

  You see, what happens is that every time when you delete 4900 rows from a table that is used by a view that has indexes on it is those indexes are being recreated, so not only your table is affected, but potentially everything that is being called in the view as well. If it's a complicated view that integrates data from multiple sources, it will be run after every batch delete and indexed. Again. And again. And again again. It also prohibits you from some operations, like TRUNCATE TABLE, where you get a funny message saying it's referenced by a view and that is why you can't truncate it. What?!

  Now, I deleted the VIEW and ran the same code. It was faster, but it still took ages because finding the records to delete was a much longer operation than the deletion itself. This post is about this reverse dependency that an indexed view introduces.

  So what is the solution? What if you have the view, you need the view and you also need it indexed? You can disable the indexes before your operation, then enable them again. I believe this will solve most issues, even if it's not a trivial operation. Just remember that in cleaning operations, you need some indexes to find the records to delete as well.

  That's it. I hope it helps. Get out of here!

  When we connect to SQL we usually copy/paste some connection string and change the values we need and rarely consider what we could change in it. That is mostly because of the arcane looking syntax and the rarely read documentation for it. You want to connect to a database, give it the server, instance and credentials and be done with it. However, there are some parameters that, when set, can save us a lot of grief later on.

  Application Name is something that identifies the code executing SQL commands to SQL Server, which can then be seen in profilers and DMVs or used in SQL queries. It has a maximum length of 128 characters. Let's consider the often met situation when your application is large enough to be segregated into different domains, each having their own data access layer, business rules and user interface. In this case, each domain can have its own connection string and it makes sense to specify a different Application Name for each. Later on, one can use SQL Profiler, for example, and filter on the specific area of interest.

 

  The application name can also be seen in some queries to SQL Server's Dynamic Management Views (quite normal, considering DMVs are used by SQL Profiler) like sys.dm_exec_sessions. Inside your own queries you can also get the value of the application name by simply calling APP_NAME(). For example, running SELECT APP_NAME(); in SQL Management Studio returns a nice "Microsoft SQL Server Management Studio - Query" value. In SQL Server Profiler the column is ApplicationName while in DMVs like sys.dm_exec_sessions the column is program_name.

  Example connection string: Server=localhost;Database=MyDatabase;User Id=Siderite;Password=P4ssword; Application Name=Greatest App Ever

  Hope it helps!

  Interesting SQL table hint I found today: READPAST. It instructs SQL queries to ignore locked rows. This comes with advantages and disadvantages. For one it avoids deadlocks when trying to read or write an already locked row, but it also provides the wrong results. Just as NOLOCK, it works around the transaction mechanism, and while NOLOCK will allow dirty reads of information partially changed in transactions that have not been committed, READPAST ignores its existence completely.

  There is one scenario where I think this works best: batched DELETE operations. You want to delete a lot of rows from a table, but without locking it. If you just do a delete for the entire table with some condition you will get these issues:

  • the operation will be slow, especially if you are deleting on a clustered index which moves data around in the table
  • if the number of deleted rows is too large (usually 5000 or more) then the operation will lock the entire table, not just the deleted rows
  • if there are many rows to be deleted, the operation will take a long while, increasing the possibility of deadlocks

  While there are several solutions for this, like partitioning the table and then truncating the partitions or soft deletes or designing your database to separate read and write operations, one type of implementation change that is small in scope and large is result is batched deletes. Basically, you run a flow like this:

  1. SELECT a small number of rows to be deleted (again, mind the 5000 limit that causes table locks, perhaps even use ROWLOCK hint)
  2. DELETE rows selected and their dependencies (DELETE TOP x should work as well for steps 1 and 2, but I understand in some cases this syntax automatically causes a table lock and maybe also use ROWLOCK hint)
  3. if the number of selected rows is larger than 0, go back to step 1

  This allows SQL to lock individual rows and, if your business logic is sound, no rows should be deleted while something is trying to read or write them. However, this is not always the case, especially in high stress cases with many concurrent reads and writes. But here, if you use READPAST, then locked rows will be ignored and the next loops will have the chance to delete them.

  But there is a catch. Let's take an example:

  1. Table has 2 rows: A and B
  2. Transaction 1 locks row A
  3. In a batched delete scenario, Transaction 2 gets the rows with READPAST and so only gets B
  4. Transaction 2 deletes row B and commits, and continues the loop
  5. Transaction 3 gets the rows with READPAST and gets no rows (A is still locked)
  6. Transaction 3 deletes nothing and exists the loop
  7. Transaction 1 unlocks row A
  8. Table now has 1 row: A, which should have been deleted, but it's not

  There is a way to solve this: SELECT with NOLOCK and DELETE with READPAST

  • this will allow to always select even locked and uncommitted rows
  • this will only delete rows that are not locked
  • this will never deadlock, but will loop forever as long as some rows remain locked

  One more gotcha is that READPAST allows for a NOWAIT syntax, which says to immediately ignore locked rows, without waiting for a number of seconds (specified by LOCK_TIMEOUT) to see if it unlocks. Since you are doing a loop, it would be wise to wait, so that it doesn't go into a rapid loop while some rows are locked. Barring that, you might want to use READPAST NOWAIT and then add a WAITFOR DELAY '00:00:00.010' at the end of the loop to add 10 millisecond delay, but if you have a lot of rows to delete, it might make this too slow.

  Enough of this, lets see some code example:

DECLARE @batchSize INT = 1000
DECLARE @nrRows INT = 1

CREATE TABLE #temp (Id INT PRIMARY KEY)

WHILE (@nrRows>0)
BEGIN

  BEGIN TRAN

	INSERT INTO #temp
    SELECT TOP (@batchSize) Id
    FROM MyTable WITH (NOLOCK)
    WHERE Condition=1

    SET @nrRows = @@ROWCOUNT

    DELETE FROM mt 
    FROM MyTable mt WITH (READPAST NOWAIT)
    INNER JOIN #temp t
    ON mt.Id=t.Id

	WAITFOR DELAY '00:00:00.010'

  COMMIT TRAN

END

DROP TABLE #temp

Now the scenario goes like this:

  1. Table has 2 rows: A and B
  2. Transaction 1 locks row A
  3. Transaction 2 gets the rows with NOLOCK and so only gets A and B
  4. Transaction 2 deletes rows A and B with READPAST, but only B is deleted
  5. loop continues (2 rows selected)
  6. Transaction 3 gets the rows with NOLOCK and gets one row 
  7. Transaction 3 deletes with READPAST with no effect (A is still locked)
  8. loop continues (1 rows selected)
  9. Transaction 1 unlocks row A
  10. Transaction 4 gets the rows with NOLOCK and gets row A (not locked)
  11. Transaction 4 deleted with READPAST and deletes row A
  12. loop continues (1 rows selected), but next transaction selects nothing, so loop ends (0 rows selected)
  13. Table now has no rows and no deadlock occurred

Hope this helps.

 So I got assigned this bug where date 1900-01-01 was displayed on the screen so, as I am lazy, I started to look into the code without reproducing the issue. The SQL stored procedure looked fine, it was returning:

SELECT
  CASE SpecialCase=1 THEN ''
  ELSE SomeDate
  END as DateFilteredBySpecialCase

Then the value was being passed around through various application layers, but it wasn't transformed into anything, then it was displayed. So where did this magical value come from? I was expecting some kind of ISNULL(SomeDate,'1900-01-01') or some change in the mapping code or maybe SomeDate was 1900-01-01 in some records, but I couldn't find anything like that.

Well, at second glance, the selected column has to have a returning type, so what is it? The Microsoft documentation explains:

Returns the highest precedence type from the set of types in result_expressions and the optional else_result_expression. For more information, see Data Type Precedence.

If you follow that link you will see that strings are at the very bottom, while dates are close to the top. In other words, a CASE statement that returns strings and dates will always have the return type a date!

SELECT CAST('' as DATETIME) -- selects 1900-01-01

Just a quickie. Hope it helps.

 T-SQL Querying is a very good overview of SQL Server queries, indexing, best practices, optimization and troubleshooting. I can't imagine someone can just read it and be done with it, as it is full of useful references, so it's good to keep it on the table. Also, it's relatively short, so one can peruse it in a day and then keep using it while doing SQL work.

What I didn't like so much was the inconsistent level of knowledge needed for the various chapters. It starts with a tedious explanations of types of queries and what JOINs are and what ORDER BY is and so on, then moves on to the actual interesting stuff. Also, what the hell is that title and cover? :) You'd think it's a gardening book.

Another great thing about it is that it is available free online, from its publishers: Packt.

  This blog post is about Table Value Constructors or Row Constructors. While they make intuitive sense and this is how I will present them, they were introduced in Microsoft Sql Server 2008 and because they look like another very old feature, most database developers are not aware of them.

  So let's begin with a simple INSERT statement:

CREATE TABLE Position(X INT, Y INT)

INSERT INTO Position
VALUES (1,1),(1,2),(NULL,NULL),(2,1)

So we create a table and we insert some data using the VALUES expression. This is equivalent to

CREATE TABLE Position(X INT, Y INT)

INSERT INTO Position
SELECT 1,1
UNION ALL
SELECT 1,2
UNION ALL
SELECT NULL,NULL
UNION ALL
SELECT 2,1

I've certainly used this SELECT + UNION ALL construct to generate static data in my SQL statements. Sometimes, because it's such an annoying syntax, I've created a table or table variable and then inserted values into it in order to use data in a structured way. But could we use the VALUES expression in other contexts, not just for INSERT statements? And the answer is: Yes! (in Sql Server 2008 or newer)

Here is an example:

SELECT *
FROM (VALUES(1,1),(1,2),(NULL,NULL),(2,1)) as Position(X,Y)

This is not a disk table, nor is it a table variable, but an expression that will be treated as a table with columns X and Y, of type INT. As in a SELECT/UNION ALL construct, the type of the columns will be determined by the first set of values.

You can see a "real life" example in my previous post on how to solve Sudoku using an SQL statement.

Now, while I've explained how to remember the syntax and purpose of Table Value Constructors, there are differences between the VALUES expression used as a TVC and when used in an INSERT statement.

In an INSERT statement, VALUES is just a way to specify data to add and has been there since the beginning of SQL and therefore is subject to constraints from that era. For example, you cannot add more than 1000 rows in an INSERT/VALUES construct. But you can using an INSERT/SELECT/VALUES construct:

INSERT INTO Positions
VALUES (1,2),
       (1,1),
       -- ... more than 1000 records
       (0,1)

-- Error 10738 is returned

INSERT INTO Positions
SELECT x,y FROM (
VALUES (1,2),
       (1,1),
       -- ... more than 1000 records
       (0,1)
) as P(x,y)

-- Works like a charm

Hope it helps!

OK, so I played a little with SQL and I found an interesting flow for analysing queries. It uses the SET STATISTICS PROFILE functionality, but the results of this are usually hard to read and handle in any meaningful way. There are applications that help out with this, but this blog post is trying to show you a method that doesn't need any extra software (for when you are working for a paranoid company that doesn't allow you to install what you need to do your work, for example).

This works in the query itself, so no need of any extra tool except SQL Server Management Studio and Excel:

  1. Add SET STATISTICS PROFILE OFF at the start of the query (because you don’t need to profile the setup)
  2. Add SET STATISTICS PROFILE ON just before the SELECT that you want to optimize
  3. Clear cache and stats - this is optional, but good practice. There are multiple ways of doing this and it depends on your environment and preferences, so I am not covering this here.
  4. Execute the query -> In the query results you will get the results of the query, but also the profiling statistics of the query execution, also in table form
  5. Copy the entire statistics table with headers and insert it into a new Excel sheet
  6. Add a new column right after Parent, call it IsLeaf
  7. Fill the IsLeaf column with a formula to see if the value in NodeId exists in the Parent column
    1. Write "=COUNTIF($F$2:$F$10000,E2)=0" as the first value of the column
    2. Keep pressing Shift, then press End and Down arrow (and release Shift) – you should have the entire column selected
    3. Press Ctrl-D
  8. Select the header row of the table
  9. Click on "Sort and Filter"
  10. Select "Filter"
  11. Click on a random cell, click on "Sort and Filter" again
  12. Click on "Custom sort"
  13. Select TotalTreeSubcost and "From largest to smallest"
  14. Now click on the filter on the IsLeaf column and filter on value TRUE (only the leaves)

You should now have the rows of the final tree branch nodes, ordered descending by the cost to the query.

Here you can look at the IO cost, CPU cost and Rows columns to find the places you need to work on. These values need to be as small as possible.

I hope this helps.

  Update: more analysis shows that the change was not because of the OR clauses, but because of some debug OPTIONs that I had used. This post is thus wrong.

Original post:

  So I had this stored procedure that would calculate counts from a table, based on a specific column which was also indexed. Something like this:

SELECT Code, COUNT(*) as Nr 
FROM MyTable

  The code would take the result of this query and only use the counts for some of the codes, let's say 'A', 'B' and 'C'. There was also a large number of instructions that had a NULL Code. So the obvious optimizations was:

SELECT Code, COUNT(*) as Nr 
FROM MyTable
WHERE Code IN ('A','B','C')

  And it worked, however I was getting this annoying warning in the execution plan: "Operator used tempdb to spill data during execution". What the hell was that?

  Long story short, I found a very nice SO answer that explains it: SQL Server cardinality estimation can use two types of statistics to guess how many rows will get through a predicate filter:

  • about the column on average using the density vector
  • about particular values for that column using the histogram

When a literal is used, the cardinality estimator can search for that literal in the histogram. When a parameter is used, its value is not evaluated until after cardinality estimation, so the CE has to use column averages in the density vector.

  Probably, behind the scenes, ('A','B','C') is treated as a variable, so it only uses the density vector. Also, because how the IN operator is implemented, what happens to the query next is very different than replacing it with a bunch of ORs:

SELECT Code, COUNT(*) as Nr 
FROM MyTable
WHERE (Code='A' OR Code='B' OR Code='C')

  Not only the warning disappeared, but the execution time was greatly reduced!  

  You see, the query here is simplified a lot, but in real life it was part of a larger one, including complicated joins and multiple conditions. With an IN clause, the execution plan would only show me one query, containing multiple joins and covering all of the rows returned. By using OR clauses, the execution plan would show me three different queries, one for each code.

  This means that in certain situations, this strategy might not work, especially if the conditions are not disjunct and have rows that meet multiple ones. I am also astonished that for such a simple IN clause, the engine did not know to translate it automatically into a series of ORs! My intent as a developer is clear and the implementation should just take that and turn it into the most effective query possible.

  I usually tell people to avoid using OR clauses (and instead try to use ANDs) or query on values that are different (try for equality instead) or using NOT IN. And the reason is again the execution plan and how you cannot prove a general negative claim. Even so, I've always assumed that IN will work as a series or ORs. My reasoning was that, in case of an OR, the engine would have to do something like an expensive DISTINCT operation, something like this: 

SELECT *
FROM MyTable
WHERE (Code='A' OR Code='B')

-- the above should equate to
SELECT *
FROM MyTable
WHERE Code='A'
UNION -- not a disjunct union so engine would have to eliminate duplicates
SELECT *
FROM MyTable
WHERE Code='B'

-- therefore an optimal query is
SELECT *
FROM MyTable
WHERE Code='A'
UNION ALL - disjunct union
SELECT *
FROM MyTable
WHERE Code='B'
-- assuming that there are no rows that meet both conditions (code A and code B)

  In this case, however, SQL did manage to understand that the conditions were disjunct so it split the work into three, correctly using the index and each of them being quite efficient.

  I learned something today!

Intro

  There is a saying that the novice will write code that works, without thinking of anything else, the expert will come and rewrite that code according to good practices and the master will rewrite it so that it works again, thinking of everything. It applies particularly well to SQL. Sometimes good and well tried best practices fail in specific cases and one must guide themselves either by precise measurements of by narrow rules that take decades to learn.

  If you ever wondered why some SQL queries are very slow or how to write complex SQL stored procedures without them reaching sentience and behaving unpredictably, this post might help. I am not a master myself, but I will share some quick and dirty ways of writing, then checking your SQL code.

Some master rules

  First of all, some debunking of best practices that make unreasonable assumptions at scale:

  1. If you have to extract data based on many parameters, then add them as WHERE or ON clauses and the SQL engine will know how to handle it.

    For small queries and for well designed databases, that is correct. The SQL server engine is attempting to create execution plans for these parameter combinations and reuse them in the future on other executions. However, when the number of parameters increases, the number of possible parameter combinations increases exponentially. The execution optimization should not take more than the execution itself, so the engine if just choosing one of the existing plans which appears more similar to the parameters given. Sometimes this results in an abysmal performance.

    There are two solutions:

    The quick and dirty one is to add OPTION (RECOMPILE) to the parameterized SELECT query. This will tell the engine to always ignore existing execution plans. With SQL 2016 there is a new feature called Query Store plus a graphical interface that explores execution plans, so one can choose which ones are good and which ones are bad. If you have the option, you might manually force an execution plan on specific queries, as well. But I don't recommend this because it is a brittle and nonintuitive solution. You need a DBA to make sure the associations are correct and maintained properly.

    The better one, to my own surprise, is to use dynamic SQL. In other words, if you have 20 parameters to your stored procedure, with only some getting used at any time (think an Advanced Search page), create an SQL string only with the parameters that are set, then execute it.

    My assumption has always been that the SQL engine will do this for me if I use queries like WHERE (@param IS NULL OR <some condition with @param>). I was disappointed to learn that it does not always do that. Be warned, though, that most of the time multiple query parameters are optimized by running several operations in parallel, which is best!

  2. If you query on a column or another column, an OR clause will be optimal. 

    Think of something like this: You have a table with two account columns AccId and AccId2. You want to query a lot on an account parameter @accountId and you have added an index on each column.

    At this time the more readable option, and for small queries readability is always preferable to performance improvement, is WHERE AccId=@accountId OR AccId2=@accountId. But how would the indexes be used here, in this OR clause? First the engine will have to find all entries with the correct AccId, then again find entries with the correct AccId2, but only the entries that have not been found in the first search.

    First of all, SQL will not do this very well when the WHERE clause is very complex. Second of all, even if it did it perfectly, if you know there is no overlap, or you don't care or you can use a DISTINCT further on to eliminate duplicates, then it is more effective to have two SELECT queries, one for AccId and the other for AccId2 that you UNION ALL afterwards.

    My assumption has always been that the SQL engine will do this automatically. I was quite astounded to hear it was not true. Also, I may be wrong, because different SQL engines and their multitude of versions, compounded with the vast array of configuration options for both engine and any database, behave quite differently. Remember the parallelism optimization, as well.

  3. Temporary tables as slow, use table variables instead.

    Now that is just simple logic, right? A temporary table uses disk while a table variable uses memory. The second has to be faster, right? In the vast majority of cases this will be true. It all depends (a verb used a lot in SQL circles) on what you do with it.

    Using a temporary table might first of all be optimized by the engine to not use the disk at all. Second, temporary tables have statistics, while table variables do not. If you want the SQL engine to do its magic without your input, you might just have to use a temporary table.

  4. A large query that does everything is better than small queries that I combine later on.

    This is a more common misconception than the others. The optimizations the SQL engine does work best on smaller queries, as I've already discussed above, so if a large query can be split into two simpler ones, the engine will be more likely able to find the best way of executing each. However, this only applies if the two queries are completely independent. If they are related, the engine might find the perfect way of getting the data in a query that combines them all.

    Again, it depends. One other scenario is when you try to DELETE or UPDATE a lot of rows. SQL is always "logging" the changes that it does on the off chance that the user cancels the query and whatever incomplete work has been done has to be undone. With large amounts of data, this results into large log files and slow performance. One common solution is to do it in batches, using UPDATE (TOP 10000) or something similar inside a WHILE loop. Note that while this solves the log performance issue, it adds a little bit of overhead for each executed UPDATE

  5. If I have an index on a DATETIME column and I want to check the records in a certain day, I can use CAST or CONVERT.

    That is just a bonus rule, but I've met the problem recently. The general rule is that you should never perform calculations on columns inside WHERE clauses. So instead of WHERE CAST(DateColumn as DATE)=@date use WHERE DateColumn>=@date AND DateColumn<DATEADD(DAY,1,@date). The calculation is done (once) on the parameters given to the query, not on every value of DateColumn. Also, indexes are now used.

Optimizing queries for dummies

So how does one determine if one of these rules apply to their case? "Complex query" might mean anything. Executing a query multiple times results in very different results based on how the engine is caching the data or computing execution plans.

A lot of what I am going to say can be performed using SQL commands, as well. Someone might want to use direct commands inside their own tool to monitor and manage performance of SQL queries. But what I am going to show you uses the SQL Management Studio and, better still, not that horrid Execution Plan chart that often crashes SSMS and it is hard to visualize for anything that the most simple queries. Downside? You will need SQL Management Studio 2014 or higher.

There are two buttons in the SSMS menu. One is "Include Actual Execution Plan" which generates an ugly and sometimes broken chart of the execution. The other one is "Include Live Query Statistics" which seems to be doing the same, only in real time. However, the magic happens when both are enabled. In the Results tab you will get not only the query results, but also tabular data about the execution performance. It is amazingly useful, as you get a table per each intermediary query, for example if you have a stored procedure that executes several queries in a row, you get a table for each.

Even more importantly, it seems that using these options will start the execution without any cached data or execution plans. Running it several times gives consistent execution times.

In the LiveQuery tables, the values we are interested about are, in order of importance, EstimateIO, EstimateCPU and Rows.

EstimateIO is telling us how much of the disk was used. The disk is the slowest part of a computer, especially when multiple processes are running queries at the same time. Your objective is to minimize that value. Luckily, on the same row, we get data about the substatement that generated that row, which parameters were used, which index was used etc. This blog is not about how to fix every single scenario, but only on how to determine where the biggest problems lie.

EstimateCPU is saying how much processing power was used. Most of the time this is very small, as complex calculations should not be performed in queries anyway, but sometimes a large value here shows a fault in the design of the query.

Finally, Rows. It is best to minimize the value here, too, but it is not always possible. For example a COUNT(*) will show a Clustered Index Scan with Rows equal to the row count in the table. That doesn't cause any performance problems. However, if your query is supposed to get 100 rows and somewhere in the Live Query table there is a value of several millions, you might have used a join without the correct ON clause parameters or something like that.

Demo

Let's see some examples of this. I have a Main table, with columns ID BIGINT, Random1 INT, Random2 NVARCHAR(100) and Random3 CHAR(10) with one million rows. Then an Ind table, with columns ID BIGINT, Qfr CHAR(4) and ValInd BIGINT with 10000 rows. The ID table is common with the Main table ID column and the Qfr column has only three possible values: AMT, QTY, Sum.

Here is a demo on how this would work:

DECLARE @r1 INT = 1300000
DECLARE @r2 NVARCHAR(100) = 'a'
DECLARE @r3 CHAR(10) = 'A'
DECLARE @qfr CHAR(4) = 'AMT'
DECLARE @val BIGINT = 500000

DECLARE @r1e INT = 1600000
DECLARE @r2e NVARCHAR(100) = 'z'
DECLARE @r3e CHAR(10)='Z'
DECLARE @vale BIGINT = 600000

SELECT *
FROM Main m
INNER JOIN Ind i
ON m.ID=i.ID
WHERE (@r1 IS NULL OR m.Random1>=@r1)
  AND (@r2 IS NULL OR m.Random2>=@r2)
  AND (@r3 IS NULL OR m.Random3>=@r3)
  AND (@val IS NULL OR i.ValInd>=@val)
  AND (@r1e IS NULL OR m.Random1<=@r1e)
  AND (@r2e IS NULL OR m.Random2<=@r2e)
  AND (@r3e IS NULL OR m.Random3<=@r3e)
  AND (@vale IS NULL OR i.ValInd<=@vale)
  AND (@qfr IS NULL OR i.Qfr=@qfr)

I have used 9 parameters, each with their own values, to limit the number of rows I get. The Live Query result is:

You can see that the EstimateIO values are non-zero only on the Clustered Index Scans, one for each table. Where is how the StmtText looks like: "|--Clustered Index Scan(OBJECT:([Test].[dbo].[Ind].[PK__Ind__DEBF89006F996CA8] AS [i]),  WHERE:(([@val] IS NULL OR [Test].[dbo].[Ind].[ValInd] as [i].[ValInd]>=[@val]) AND ([@vale] IS NULL OR [Test].[dbo].[Ind].[ValInd] as [i].[ValInd]<=[@vale]) AND ([@qfr] IS NULL OR [Test].[dbo].[Ind].[Qfr] as [i].[Qfr]=[@qfr])) ORDERED FORWARD)".

This is a silly case, but you can see that the @parameter IS NULL type of query condition has not been removed, even when parameter is clearly not null.

Let's change the values of the parameters:

DECLARE @r1 INT = 300000
DECLARE @r2 NVARCHAR(100) = NULL
DECLARE @r3 CHAR(10) = NULL
DECLARE @qfr CHAR(4) = NULL
DECLARE @val BIGINT = NULL

DECLARE @r1e INT = 600000
DECLARE @r2e NVARCHAR(100) = NULL
DECLARE @r3e CHAR(10)=NULL
DECLARE @vale BIGINT = NULL

Now the Live Query result is:

Same thing! 5.0 and 7.2

Now, let's do the same thing with dynamic SQL. It's a little more annoying, mostly because of the parameter syntax, but check it out:

DECLARE @sql NVARCHAR(Max)

DECLARE @r1 INT = 300000
DECLARE @r2 NVARCHAR(100) = NULL
DECLARE @r3 CHAR(10) = NULL
DECLARE @qfr CHAR(4) = NULL
DECLARE @val BIGINT = NULL

DECLARE @r1e INT = 600000
DECLARE @r2e NVARCHAR(100) = NULL
DECLARE @r3e CHAR(10)=NULL
DECLARE @vale BIGINT = NULL


SET @sql=N'
SELECT *
FROM Main m
INNER JOIN Ind i
ON m.ID=i.ID
WHERE 1=1 '
IF @r1 IS NOT NULL SET @sql+=' AND m.Random1>=@r1'
IF @r2 IS NOT NULL SET @sql+=' AND m.Random2>=@r2'
IF @r3 IS NOT NULL SET @sql+=' AND m.Random3>=@r3'
IF @val IS NOT NULL SET @sql+=' AND i.ValInd>=@val'
IF @r1e IS NOT NULL SET @sql+=' AND m.Random1<=@r1e'
IF @r2e IS NOT NULL SET @sql+=' AND m.Random2<=@r2e'
IF @r3e IS NOT NULL SET @sql+=' AND m.Random3<=@r3e'
IF @qfr IS NOT NULL SET @sql+=' AND i.Qfr=@qfr'
IF @vale IS NOT NULL SET @sql+=' AND i.ValInd<=@vale'

PRINT @sql

EXEC sp_executesql @sql,
  N'@r1 INT, @r2 NVARCHAR(100), @r3 CHAR(10), @qfr CHAR(4),@val BIGINT,@r1e INT, @r2e NVARCHAR(100), @r3e CHAR(10),@vale BIGINT',
  @r1,@r2,@r3,@qfr,@val,@r1e,@r2e,@r3e,@vale

Now the Live Query results are:

At first glance we have not changed much. IO is still 5.0 and 7.2. Yet there are 3 less execution steps. There is no parallelism and the query has been executed in 5 seconds, not 6. The StmtText for the same thing is now: "|--Clustered Index Scan(OBJECT:([Test].[dbo].[Ind].[PK__Ind__DEBF89006F996CA8] AS [i]), ORDERED FORWARD)". The printed SQL command is:

SELECT *
FROM Main m
INNER JOIN Ind i
ON m.ID=i.ID
WHERE 1=1  AND m.Random1>=@r1 AND m.Random1<=@r1e

Conclusion

Again, this is a silly example. But with some results anyway! In my work I have used this to get a stored procedure to work three to four times faster!

One can optimize usage of IO, CPU and Rows by adding indexes, by narrowing join conditions, by reducing the complexity of executed queries, eliminating temporary tables, partitioning existing tables, adding or removing hints, removing computation from queried columns and so many other possible methods, but they amount to nothing if you cannot measure the results of your changes.

By using Actual Execution Plan together with Live Query Statistics you get:

  • consistent execution times and disk usage
  • a clear measure of what went on with each subquery

BTW, you get the same effect if you use SET STATISTICS PROFILE ON before the query. Yet, I wrote this post with someone that doesn't want to go into extra SQL code in mind. Also, when calculating performance, it is recommended to add a DBCC FREEPROCCACHE line before execution OR add the option RECOMPILE to your query (this doesn't work on a stored procedure execution, you would have to change the SP queries to include RECOMPILE).

I wish I had some more interesting examples for you, guys, but screenshots from the workplace are not something I want to do and I don't do any complex SQL work at home. I hope this helps. 

  On the SQLite reference page for the WITH clause there is a little example of solving a Sudoku puzzle. Using SQL. I wanted to see it in action and therefore I've translated it into T-SQL.

  You might think that there is a great algorithm at play, something that will blow your mind. I mean, people have blogged about Sudoku solvers to hone their programming skills for ages and they have worked quite a lot, writing lines and lines of how clever they were. And this is SQL, it works, but how do you do something complex in it? But no, it's very simple, very straightforward and also performant. Kind of a let down, I know, but it pretty much takes all possible solutions and only selects for the valid ones using CTEs (Common Table Expressions).

  Here is the translation, followed by some explanation of the code:

DECLARE @Board VARCHAR(81) = '86....3...2...1..7....74...27.9..1...8.....7...1..7.95...56....4..1...5...3....81';
WITH x(s,ind) AS
(
  SELECT sud,CHARINDEX('.',sud) as ind FROM (VALUES(@Board)) as input(sud)
  UNION ALL
  SELECT
	CONVERT(VARCHAR(81),CONCAT(SUBSTRING(s,1,ind-1),z,SUBSTRING(s,ind+1,81))) as s,
	CHARINDEX('.',CONCAT(SUBSTRING(s,1,ind-1),z,SUBSTRING(s,ind+1,81))) as ind
  FROM x
  INNER JOIN (VALUES('1'),('2'),('3'),('4'),('5'),('6'),('7'),('8'),('9')) as digits(z)
  ON NOT EXISTS (
            SELECT 1
              FROM (VALUES(1),(2),(3),(4),(5),(6),(7),(8),(9)) as positions(lp)
             WHERE z = SUBSTRING(s, ((ind-1)/9)*9 + lp, 1)
                OR z = SUBSTRING(s, ((ind-1)%9) + (lp-1)*9 + 1, 1)
                OR z = SUBSTRING(s, (((ind-1)/3) % 3) * 3
                        + ((ind-1)/27) * 27 + lp
                        + ((lp-1) / 3) * 6, 1)
	)
	WHERE ind>0
)
SELECT s FROM x WHERE ind = 0

  The only changes from the original code I've done is to extract the unsolved puzzle into its own variable and to change the puzzle values. Also, added a more clear INNER JOIN syntax to replace the obnoxious, but still valid, comma (aka CROSS JOIN) notation. Here is the breakdown of the algorithm, as it were:

  • start with an initial state of the unsolved puzzle as a VARCHAR(81) string and the first index of a dot in that string, representing an empty slot - this is the anchor part
  • for the recursive member, join the current state with all the possible digit values (1 through 9) and return the strings with the first empty slot replaced by all valid possibilities and the position of the next empty slot
  • stop when there are no more empty slots
  • select the solutions (no empty slots)

  It's that simple. And before you imagine it will generate a huge table in memory or that it will take a huge time, worry not. It takes less than a second (a lot less) to find the solution. Obviously, resource use increases exponentially when the puzzle doesn't have just one solution. If you empty the first slot (. instead of 8) the number of rows is 10 and it takes a second to compute them all. Empty the next slot, too (6) and you get 228 solutions in 26 seconds and so on.

 The magical parts are the recursive Common Table Expression itself and the little piece of code that checks for validity, but the validity check is quite obvious as it is the exact translation of the Sudoku rules: no same digits on lines, rows or square sections.

  A recursive CTE has three parts:

  • an initial query that represents the starting state, often called the anchor member
  • a recursive query that references the CTE itself, called the recursive member, which is UNIONed with the anchor
  • a termination condition, to tell SQL when to end the recursion

  For us, we started with one unsolved solution, we recursed on all possible valid solutions for replacing the first empty slot and we stopped when there were no more empty slots.

  CTEs are often confusing because the notation seems to indicate something else to a procedural programmer. You imagine doing this without CTEs, maybe in an object oriented programming language, and you think of this huge buffer that just keeps increasing and you have to remember where you left off so you don't process the same partial solution multiple times and you have to clean the data structure so it doesn't get too large, etc. SQL, though, is at heart a declarative programming language, very close to functional programming. It will take care not only of the recursion, but also filter the rows by the final condition of no empty slots while (and sometimes before) it makes the computations.

  Once you consider the set of possible solutions for a problem as a working set, SQL can do wonders to find the solution, provided you can encode it in a way the SQL engine will understand. This is just another example of the right tool for the right job. Hope you learned something.

I have been asking this of people at the interviews I am conducting and I thought I should document the correct answer and the expected behavior. And yes, we've filled the position in for this interview question, so you can't cheat :)

The question is quite banal: given two tables (TableA and TableB) both having a column ID, select the rows in TableA that don't have any corresponding row in TableB with the same ID.

Whenever you are answering an interview question, remember that your thinking process is just as important as the answer. So saying nothing, while better than "so I am adding 1 and 1 and getting 2", may not be your best option. Assuming you don't know the answer, a reasonable way of tackling any problem is to take it apart and try to solve every part separately. Let's do this here.

As the question requires the rows in A, select them:

SELECT * FROM TableA

Now, a filter should be applied, but which one? Here are some ideas:

  1. WHERE ID NOT IN (SELECT ID FROM TableB)
  2. WHERE NOT EXISTS (SELECT * FROM TableB WHERE TableA.ID=TableB.ID)
  3. EXCEPT SELECT ID FROM TableB -- this requires to select only ID from TableA, as well (EXCEPT and INTERSECT are new additions to SQL 2019)

Think about it. Any issues with any of them? Any other options?

To test performance, I've used two tables with approximately 35 million rows. Here are the results:

  1. After 17 minutes I had to stop the query. Also, NOT IN has issues with NULL as a value is nether equal or unequal to NULL. SELECT * FROM Table WHERE Value NOT IN (NULL) for example, will always return no rows.
  2. It finished within 4 seconds. There are still issues with NULL, though, as a simple equality would not work with NULL. Assuming we wanted the non-null values of TableA, we're good.
  3. It finished within 5 seconds. This doesn't have any issues with NULL. SELECT NULL EXCEPT SELECT NULL will return no rows, while SELECT 1 EXCEPT SELECT NULL will return a row with the value 1. The syntax is pretty ugly though and works badly if the tables have other columns

What about another solution? We've exhausted simple filtering, how about another avenue? Whenever we want to combine information from two tables we use JOIN, but is that the case here?

SELECT * FROM TableA a
JOIN TableB b
ON a.ID = b.ID -- again, while I would ask people in the interview about null values, we will assume for this post that the values are not nullable

I've used a JOIN keyword, which translates to an INNER JOIN. The query above will select rows from A, but only those that have a correspondence in B. A funny solution to a slightly different question: count the items in A that do not have corresponding items in B:

SELECT (SELECT COUNT(*) FROM TableA) - (SELECT COUNT(*) FROM TableA a JOIN TableB b ON a.ID = b.ID)

However, we want the inverse of the INNER JOIN. What other types of JOINs are there? Bonus interview question! And the answers are:

  • INNER JOIN - returns only rows that have been successfully joined
  • OUTER JOIN (LEFT AND RIGHT) - returns all rows of one table joined to the corresponding values of the other table or NULLs if none
  • CROSS JOIN - returns all the rows in A joined with all the rows in B and all the rows in B that have no match in A

INNER would not work, as demonstrated, so what about a CROSS JOIN? Clearly not, as it will generate 100 trillion rows before filtering anything. SQL Server would optimize a lot of the query, but it would look really weird anyway.

Is there a solution with OUTER JOIN? RIGHT OUTER JOIN will get the rows in B, not in A, so LEFT OUTER JOIN, by elimination, is the only remaining possible solution.

SELECT a.* FROM TableA a 
LEFT OUTER JOIN TableB b
ON a.ID=b.ID

This returns ALL the rows in table A and for each of them, rows in table B that have the same id. In case of a mismatch, though, for a row in table A with no correspondence in table B, we get a row of NULL values. So all we have to do is filter for those. We know that there are no NULLs in the tables, so here is another working solution, solution 4:

SELECT a.* FROM TableA a 
LEFT OUTER JOIN TableB b
ON a.ID=b.ID
WHERE b.ID IS NULL

This solves the problem, as well, in about 4 seconds. However, the other working solution within the same time (solution 2 above) only works as well because newer versions of SQL server are optimizing the execution. Maybe it's a personal preference from the times solution 4 was clearly the best in terms of performance, but I would chose that as the winner.

Summary

  • You can either use NOT EXISTS (and not NOT IN!) or a LEFT OUTER JOIN with a filter on NULL b values.
  • It's important to know if you have NULL values in the joining columns and it's extra points for asking that from your interviewer
  • If not asking, I would penalize solutions that do not take NULL values in consideration. Extra complexity of code, as one cannot simply check for NULL for solution 4. Also a decision has to be made on the expected behavior when working with NULL values
  • When trying to find the solution to a problem in an interview:
    • think of concrete examples of the problem so you can test your solutions
    • break the problems into manageable bits if possible
    • think aloud, but to the point
  • Also, there is nothing more annoying than doing that thing pupils in school do: looking puppy eyed at the teacher while listing solutions to elicit a response. You're not in school anymore. Examples are dirty, time is important, no one cares about your grades.
  • Good luck out there!