There is a saying that the novice will write code that works, without thinking of anything else, the expert will come and rewrite that code according to good practices and the master will rewrite it so that it works again, thinking of everything. It applies particularly well to SQL. Sometimes good and well tried best practices fail in specific cases and one must guide themselves either by precise measurements of by narrow rules that take decades to learn.
If you ever wondered why some SQL queries are very slow or how to write complex SQL stored procedures without them reaching sentience and behaving unpredictably, this post might help. I am not a master myself, but I will share some quick and dirty ways of writing, then checking your SQL code.
Some master rules
First of all, some debunking of best practices that make unreasonable assumptions at scale:
- If you have to extract data based on many parameters, then add them as WHERE or ON clauses and the SQL engine will know how to handle it.
For small queries and for well designed databases, that is correct. The SQL server engine is attempting to create execution plans for these parameter combinations and reuse them in the future on other executions. However, when the number of parameters increases, the number of possible parameter combinations increases exponentially. The execution optimization should not take more than the execution itself, so the engine if just choosing one of the existing plans which appears more similar to the parameters given. Sometimes this results in an abysmal performance.
There are two solutions:
The quick and dirty one is to add OPTION (RECOMPILE) to the parameterized SELECT query. This will tell the engine to always ignore existing execution plans. With SQL 2016 there is a new feature called Query Store plus a graphical interface that explores execution plans, so one can choose which ones are good and which ones are bad. If you have the option, you might manually force an execution plan on specific queries, as well. But I don't recommend this because it is a brittle and nonintuitive solution. You need a DBA to make sure the associations are correct and maintained properly.
The better one, to my own surprise, is to use dynamic SQL. In other words, if you have 20 parameters to your stored procedure, with only some getting used at any time (think an Advanced Search page), create an SQL string only with the parameters that are set, then execute it.
My assumption has always been that the SQL engine will do this for me if I use queries like WHERE (@param IS NULL OR <some condition with @param>). I was disappointed to learn that it does not always do that. Be warned, though, that most of the time multiple query parameters are optimized by running several operations in parallel, which is best!
- If you query on a column or another column, an OR clause will be optimal.
Think of something like this: You have a table with two account columns AccId and AccId2. You want to query a lot on an account parameter @accountId and you have added an index on each column.
At this time the more readable option, and for small queries readability is always preferable to performance improvement, is WHERE AccId=@accountId OR AccId2=@accountId. But how would the indexes be used here, in this OR clause? First the engine will have to find all entries with the correct AccId, then again find entries with the correct AccId2, but only the entries that have not been found in the first search.
First of all, SQL will not do this very well when the WHERE clause is very complex. Second of all, even if it did it perfectly, if you know there is no overlap, or you don't care or you can use a DISTINCT further on to eliminate duplicates, then it is more effective to have two SELECT queries, one for AccId and the other for AccId2 that you UNION ALL afterwards.
My assumption has always been that the SQL engine will do this automatically. I was quite astounded to hear it was not true. Also, I may be wrong, because different SQL engines and their multitude of versions, compounded with the vast array of configuration options for both engine and any database, behave quite differently. Remember the parallelism optimization, as well.
- Temporary tables as slow, use table variables instead.
Now that is just simple logic, right? A temporary table uses disk while a table variable uses memory. The second has to be faster, right? In the vast majority of cases this will be true. It all depends (a verb used a lot in SQL circles) on what you do with it.
Using a temporary table might first of all be optimized by the engine to not use the disk at all. Second, temporary tables have statistics, while table variables do not. If you want the SQL engine to do its magic without your input, you might just have to use a temporary table.
- A large query that does everything is better than small queries that I combine later on.
This is a more common misconception than the others. The optimizations the SQL engine does work best on smaller queries, as I've already discussed above, so if a large query can be split into two simpler ones, the engine will be more likely able to find the best way of executing each. However, this only applies if the two queries are completely independent. If they are related, the engine might find the perfect way of getting the data in a query that combines them all.
Again, it depends. One other scenario is when you try to DELETE or UPDATE a lot of rows. SQL is always "logging" the changes that it does on the off chance that the user cancels the query and whatever incomplete work has been done has to be undone. With large amounts of data, this results into large log files and slow performance. One common solution is to do it in batches, using UPDATE (TOP 10000) or something similar inside a WHILE loop. Note that while this solves the log performance issue, it adds a little bit of overhead for each executed UPDATE.
- If I have an index on a DATETIME column and I want to check the records in a certain day, I can use CAST or CONVERT.
That is just a bonus rule, but I've met the problem recently. The general rule is that you should never perform calculations on columns inside WHERE clauses. So instead of WHERE CAST(DateColumn as DATE)=@date use WHERE DateColumn>=@date AND DateColumn<DATEADD(DAY,1,@date). The calculation is done (once) on the parameters given to the query, not on every value of DateColumn. Also, indexes are now used.
Optimizing queries for dummies
So how does one determine if one of these rules apply to their case? "Complex query" might mean anything. Executing a query multiple times results in very different results based on how the engine is caching the data or computing execution plans.
A lot of what I am going to say can be performed using SQL commands, as well. Someone might want to use direct commands inside their own tool to monitor and manage performance of SQL queries. But what I am going to show you uses the SQL Management Studio and, better still, not that horrid Execution Plan chart that often crashes SSMS and it is hard to visualize for anything that the most simple queries. Downside? You will need SQL Management Studio 2014 or higher.
There are two buttons in the SSMS menu. One is "Include Actual Execution Plan" which generates an ugly and sometimes broken chart of the execution. The other one is "Include Live Query Statistics" which seems to be doing the same, only in real time. However, the magic happens when both are enabled. In the Results tab you will get not only the query results, but also tabular data about the execution performance. It is amazingly useful, as you get a table per each intermediary query, for example if you have a stored procedure that executes several queries in a row, you get a table for each.
Even more importantly, it seems that using these options will start the execution without any cached data or execution plans. Running it several times gives consistent execution times.
In the LiveQuery tables, the values we are interested about are, in order of importance, EstimateIO, EstimateCPU and Rows.
EstimateIO is telling us how much of the disk was used. The disk is the slowest part of a computer, especially when multiple processes are running queries at the same time. Your objective is to minimize that value. Luckily, on the same row, we get data about the substatement that generated that row, which parameters were used, which index was used etc. This blog is not about how to fix every single scenario, but only on how to determine where the biggest problems lie.
EstimateCPU is saying how much processing power was used. Most of the time this is very small, as complex calculations should not be performed in queries anyway, but sometimes a large value here shows a fault in the design of the query.
Finally, Rows. It is best to minimize the value here, too, but it is not always possible. For example a COUNT(*) will show a Clustered Index Scan with Rows equal to the row count in the table. That doesn't cause any performance problems. However, if your query is supposed to get 100 rows and somewhere in the Live Query table there is a value of several millions, you might have used a join without the correct ON clause parameters or something like that.
Let's see some examples of this. I have a Main table, with columns ID BIGINT, Random1 INT, Random2 NVARCHAR(100) and Random3 CHAR(10) with one million rows. Then an Ind table, with columns ID BIGINT, Qfr CHAR(4) and ValInd BIGINT with 10000 rows. The ID table is common with the Main table ID column and the Qfr column has only three possible values: AMT, QTY, Sum.
Here is a demo on how this would work:
DECLARE @r1 INT = 1300000 DECLARE @r2 NVARCHAR(100) = 'a' DECLARE @r3 CHAR(10) = 'A' DECLARE @qfr CHAR(4) = 'AMT' DECLARE @val BIGINT = 500000 DECLARE @r1e INT = 1600000 DECLARE @r2e NVARCHAR(100) = 'z' DECLARE @r3e CHAR(10)='Z' DECLARE @vale BIGINT = 600000 SELECT * FROM Main m INNER JOIN Ind i ON m.ID=i.ID WHERE (@r1 IS NULL OR m.Random1>=@r1) AND (@r2 IS NULL OR m.Random2>=@r2) AND (@r3 IS NULL OR m.Random3>=@r3) AND (@val IS NULL OR i.ValInd>=@val) AND (@r1e IS NULL OR m.Random1<=@r1e) AND (@r2e IS NULL OR m.Random2<=@r2e) AND (@r3e IS NULL OR m.Random3<=@r3e) AND (@vale IS NULL OR i.ValInd<=@vale) AND (@qfr IS NULL OR i.Qfr=@qfr)
I have used 9 parameters, each with their own values, to limit the number of rows I get. The Live Query result is:
You can see that the EstimateIO values are non-zero only on the Clustered Index Scans, one for each table. Where is how the StmtText looks like: "|--Clustered Index Scan(OBJECT:([Test].[dbo].[Ind].[PK__Ind__DEBF89006F996CA8] AS [i]), WHERE:(([@val] IS NULL OR [Test].[dbo].[Ind].[ValInd] as [i].[ValInd]>=[@val]) AND ([@vale] IS NULL OR [Test].[dbo].[Ind].[ValInd] as [i].[ValInd]<=[@vale]) AND ([@qfr] IS NULL OR [Test].[dbo].[Ind].[Qfr] as [i].[Qfr]=[@qfr])) ORDERED FORWARD)".
This is a silly case, but you can see that the @parameter IS NULL type of query condition has not been removed, even when parameter is clearly not null.
Let's change the values of the parameters:
DECLARE @r1 INT = 300000 DECLARE @r2 NVARCHAR(100) = NULL DECLARE @r3 CHAR(10) = NULL DECLARE @qfr CHAR(4) = NULL DECLARE @val BIGINT = NULL DECLARE @r1e INT = 600000 DECLARE @r2e NVARCHAR(100) = NULL DECLARE @r3e CHAR(10)=NULL DECLARE @vale BIGINT = NULL
Now the Live Query result is:
Same thing! 5.0 and 7.2
Now, let's do the same thing with dynamic SQL. It's a little more annoying, mostly because of the parameter syntax, but check it out:
DECLARE @sql NVARCHAR(Max) DECLARE @r1 INT = 300000 DECLARE @r2 NVARCHAR(100) = NULL DECLARE @r3 CHAR(10) = NULL DECLARE @qfr CHAR(4) = NULL DECLARE @val BIGINT = NULL DECLARE @r1e INT = 600000 DECLARE @r2e NVARCHAR(100) = NULL DECLARE @r3e CHAR(10)=NULL DECLARE @vale BIGINT = NULL SET @sql=N' SELECT * FROM Main m INNER JOIN Ind i ON m.ID=i.ID WHERE 1=1 ' IF @r1 IS NOT NULL SET @sql+=' AND m.Random1>=@r1' IF @r2 IS NOT NULL SET @sql+=' AND m.Random2>=@r2' IF @r3 IS NOT NULL SET @sql+=' AND m.Random3>=@r3' IF @val IS NOT NULL SET @sql+=' AND i.ValInd>=@val' IF @r1e IS NOT NULL SET @sql+=' AND m.Random1<=@r1e' IF @r2e IS NOT NULL SET @sql+=' AND m.Random2<=@r2e' IF @r3e IS NOT NULL SET @sql+=' AND m.Random3<=@r3e' IF @qfr IS NOT NULL SET @sql+=' AND i.Qfr=@qfr' IF @vale IS NOT NULL SET @sql+=' AND i.ValInd<=@vale' PRINT @sql EXEC sp_executesql @sql, N'@r1 INT, @r2 NVARCHAR(100), @r3 CHAR(10), @qfr CHAR(4),@val BIGINT,@r1e INT, @r2e NVARCHAR(100), @r3e CHAR(10),@vale BIGINT', @r1,@r2,@r3,@qfr,@val,@r1e,@r2e,@r3e,@vale
Now the Live Query results are:
At first glance we have not changed much. IO is still 5.0 and 7.2. Yet there are 3 less execution steps. There is no parallelism and the query has been executed in 5 seconds, not 6. The StmtText for the same thing is now: "|--Clustered Index Scan(OBJECT:([Test].[dbo].[Ind].[PK__Ind__DEBF89006F996CA8] AS [i]), ORDERED FORWARD)". The printed SQL command is:
SELECT * FROM Main m INNER JOIN Ind i ON m.ID=i.ID WHERE 1=1 AND m.Random1>=@r1 AND m.Random1<=@r1e
Again, this is a silly example. But with some results anyway! In my work I have used this to get a stored procedure to work three to four times faster!
One can optimize usage of IO, CPU and Rows by adding indexes, by narrowing join conditions, by reducing the complexity of executed queries, eliminating temporary tables, partitioning existing tables, adding or removing hints, removing computation from queried columns and so many other possible methods, but they amount to nothing if you cannot measure the results of your changes.
By using Actual Execution Plan together with Live Query Statistics you get:
- consistent execution times and disk usage
- a clear measure of what went on with each subquery
I wish I had some more interesting examples for you, guys, but screenshots from the workplace are not something I want to do and I don't do any complex SQL work at home. I hope this helps.