Update: more analysis shows that the change was not because of the OR clauses, but because of some debug OPTIONs that I had used. This post is thus wrong.

Original post:

  So I had this stored procedure that would calculate counts from a table, based on a specific column which was also indexed. Something like this:

SELECT Code, COUNT(*) as Nr 
FROM MyTable

  The code would take the result of this query and only use the counts for some of the codes, let's say 'A', 'B' and 'C'. There was also a large number of instructions that had a NULL Code. So the obvious optimizations was:

SELECT Code, COUNT(*) as Nr 
FROM MyTable
WHERE Code IN ('A','B','C')

  And it worked, however I was getting this annoying warning in the execution plan: "Operator used tempdb to spill data during execution". What the hell was that?

  Long story short, I found a very nice SO answer that explains it: SQL Server cardinality estimation can use two types of statistics to guess how many rows will get through a predicate filter:

  • about the column on average using the density vector
  • about particular values for that column using the histogram

When a literal is used, the cardinality estimator can search for that literal in the histogram. When a parameter is used, its value is not evaluated until after cardinality estimation, so the CE has to use column averages in the density vector.

  Probably, behind the scenes, ('A','B','C') is treated as a variable, so it only uses the density vector. Also, because how the IN operator is implemented, what happens to the query next is very different than replacing it with a bunch of ORs:

SELECT Code, COUNT(*) as Nr 
FROM MyTable
WHERE (Code='A' OR Code='B' OR Code='C')

  Not only the warning disappeared, but the execution time was greatly reduced!  

  You see, the query here is simplified a lot, but in real life it was part of a larger one, including complicated joins and multiple conditions. With an IN clause, the execution plan would only show me one query, containing multiple joins and covering all of the rows returned. By using OR clauses, the execution plan would show me three different queries, one for each code.

  This means that in certain situations, this strategy might not work, especially if the conditions are not disjunct and have rows that meet multiple ones. I am also astonished that for such a simple IN clause, the engine did not know to translate it automatically into a series of ORs! My intent as a developer is clear and the implementation should just take that and turn it into the most effective query possible.

  I usually tell people to avoid using OR clauses (and instead try to use ANDs) or query on values that are different (try for equality instead) or using NOT IN. And the reason is again the execution plan and how you cannot prove a general negative claim. Even so, I've always assumed that IN will work as a series or ORs. My reasoning was that, in case of an OR, the engine would have to do something like an expensive DISTINCT operation, something like this: 

SELECT *
FROM MyTable
WHERE (Code='A' OR Code='B')

-- the above should equate to
SELECT *
FROM MyTable
WHERE Code='A'
UNION -- not a disjunct union so engine would have to eliminate duplicates
SELECT *
FROM MyTable
WHERE Code='B'

-- therefore an optimal query is
SELECT *
FROM MyTable
WHERE Code='A'
UNION ALL - disjunct union
SELECT *
FROM MyTable
WHERE Code='B'
-- assuming that there are no rows that meet both conditions (code A and code B)

  In this case, however, SQL did manage to understand that the conditions were disjunct so it split the work into three, correctly using the index and each of them being quite efficient.

  I learned something today!

Comments

Be the first to post a comment

Post a comment