I have been asking this of people at the interviews I am conducting and I thought I should document the correct answer and the expected behavior. And yes, we've filled the position in for this interview question, so you can't cheat :)

The question is quite banal: given two tables (TableA and TableB) both having a column ID, select the rows in TableA that don't have any corresponding row in TableB with the same ID.

Whenever you are answering an interview question, remember that your thinking process is just as important as the answer. So saying nothing, while better than "so I am adding 1 and 1 and getting 2", may not be your best option. Assuming you don't know the answer, a reasonable way of tackling any problem is to take it apart and try to solve every part separately. Let's do this here.

As the question requires the rows in A, select them:

SELECT * FROM TableA

Now, a filter should be applied, but which one? Here are some ideas:

  1. WHERE ID NOT IN (SELECT ID FROM TableB)
  2. WHERE NOT EXISTS (SELECT * FROM TableB WHERE TableA.ID=TableB.ID)
  3. EXCEPT SELECT ID FROM TableB -- this requires to select only ID from TableA, as well (EXCEPT and INTERSECT are new additions to SQL 2019)

Think about it. Any issues with any of them? Any other options?

To test performance, I've used two tables with approximately 35 million rows. Here are the results:

  1. After 17 minutes I had to stop the query. Also, NOT IN has issues with NULL as a value is nether equal or unequal to NULL. SELECT * FROM Table WHERE Value NOT IN (NULL) for example, will always return no rows.
  2. It finished within 4 seconds. There are still issues with NULL, though, as a simple equality would not work with NULL. Assuming we wanted the non-null values of TableA, we're good.
  3. It finished within 5 seconds. This doesn't have any issues with NULL. SELECT NULL EXCEPT SELECT NULL will return no rows, while SELECT 1 EXCEPT SELECT NULL will return a row with the value 1. The syntax is pretty ugly though and works badly if the tables have other columns

What about another solution? We've exhausted simple filtering, how about another avenue? Whenever we want to combine information from two tables we use JOIN, but is that the case here?

SELECT * FROM TableA a
JOIN TableB b
ON a.ID = b.ID -- again, while I would ask people in the interview about null values, we will assume for this post that the values are not nullable

I've used a JOIN keyword, which translates to an INNER JOIN. The query above will select rows from A, but only those that have a correspondence in B. A funny solution to a slightly different question: count the items in A that do not have corresponding items in B:

SELECT (SELECT COUNT(*) FROM TableA) - (SELECT COUNT(*) FROM TableA a JOIN TableB b ON a.ID = b.ID)

However, we want the inverse of the INNER JOIN. What other types of JOINs are there? Bonus interview question! And the answers are:

  • INNER JOIN - returns only rows that have been successfully joined
  • OUTER JOIN (LEFT AND RIGHT) - returns all rows of one table joined to the corresponding values of the other table or NULLs if none
  • CROSS JOIN - returns a 

INNER would not work, as demonstrated, so what about a CROSS JOIN? Clearly not, as it will generate 100 trillion rows before filtering anything. SQL Server would optimize a lot of the query, but it would look really weird anyway.

Is there a solution with OUTER JOIN? RIGHT OUTER JOIN will get the rows in B, not in A, so LEFT OUTER JOIN, by elimination, is the only remaining possible solution.

SELECT a.* FROM TableA a 
LEFT OUTER JOIN TableB b
ON a.ID=b.ID

This returns ALL the rows in table A and for each of them, rows in table B that have the same id. In case of a mismatch, though, for a row in table A with no correspondence in table B, we get a row of NULL values. So all we have to do is filter for those. We know that there are no NULLs in the tables, so here is another working solution, solution 4:

SELECT a.* FROM TableA a 
LEFT OUTER JOIN TableB b
ON a.ID=b.ID
WHERE b.ID IS NULL

This solves the problem, as well, in about 4 seconds. However, the other working solution within the same time (solution 2 above) only works as well because newer versions of SQL server are optimizing the execution. Maybe it's a personal preference from the times solution 4 was clearly the best in terms of performance, but I would chose that as the winner.

Summary

  • You can either use NOT EXISTS (and not NOT IN!) or a LEFT OUTER JOIN with a filter on NULL b values.
  • It's important to know if you have NULL values in the joining columns and it's extra points for asking that from your interviewer
  • If not asking, I would penalize solutions that do not take NULL values in consideration. Extra complexity of code, as one cannot simply check for NULL for solution 4. Also a decision has to be made on the expected behavior when working with NULL values
  • When trying to find the solution to a problem in an interview:
    • think of concrete examples of the problem so you can test your solutions
    • break the problems into manageable bits if possible
    • think aloud, but to the point
  • Also, there is nothing more annoying than doing that thing pupils in school do: looking puppy eyed at the teacher while listing solutions to elicit a response. You're not in school anymore. Examples are dirty, time is important, no one cares about your grades.
  • Good luck out there!

Comments

Be the first to post a comment

Post a comment