Last time we discussed some general ways to write SQL that made it easier to use it interactively. Today I want to talk about an error I often see with general SQL, especially the use of commenting the NOT EXISTS
clause.
As I am sure you already know, the use of the NOT EXISTS filter in SQL logically will only return a row if there are no rows associated with a particular sub-query. Here’s a basic NOT EXISTS clause:
SELECT *
FROM dbo.Customer AS c
JOIN dbo.[Order] AS o ON c.CustomerId = o.CustomerId
WHERE NOT EXISTS (
SELECT OrderId
FROM dbo.[Order] AS o2
WHERE o2.CustomerId = c.CustomerId
AND o2.OrderId > o.OrderId
)
Now, I know this structure of command is not immediately obvious to some, but take a moment to see if you can work it out — it logically says the following (bear in mind, the optimiser may change order of processing dramatically, so this might not be the actual order of processing):
- For each row that might be returned by the outer query (that joins Customer to Order data) – that is to say, after the query execution has joined the two tables …
- …only return the row if there are no results when running the correlated query…
- …the inner query is notionally run for each customer that the outer query returns…
- …and examining the inner query, it is looking for Order records that are for the current Customer, but have a greater OrderId than the current Order record (with alias ‘o’).
What does this mean – to look for Order records for the same CustomerId and greater OrderId? In English, that essentially means ‘a later Order’¹. But, because this is being referred to in a NOT EXISTS clause, the overall impact of the clause on the outer query is to only return rows that do not have a later order for the customer, which actually means ‘only show the latest order for all customers‘.
Now, the first thing to notice is that it does not matter what is being selected in the inner query, only that there are – or are not – result rows. So, while you can say SELECT *, or SELECT OrderId, you can also write SELECT ‘some text which could be useful later’ – an idea we will return to in a moment. I should also add that the optimiser seems to apply some logic and does not necessarily insist on returning ‘all columns in the result set’ if SELECT * is used.
Now, last time I recommended that you comment your SQL, but the problem here is that the negating effect of the NOT EXISTS clause can be confusing. So, I have sometimes seen SQL like this:
...
WHERE NOT EXISTS (
-- not the latest order
SELECT OrderId
FROM dbo.[Order] AS o2
WHERE o2.CustomerId = c.CustomerId
AND o2.OrderId > o.OrderId
)
where the comment -- not the latest order
refers to the overall effect of the NOT EXISTS
, but is most closely placed to the SELECT
statement. So if you only examine the SQL in the subquery, the comment says something which appears to contradict the SQL.
So, although it is not particularly necessary, I often use the following pattern, where instead of using a comment I use a text string to show what the subquery is doing:
SELECT *
FROM dbo.Customer AS c
JOIN dbo.[Order] AS o ON c.CustomerId = o.CustomerId
WHERE NOT EXISTS (
SELECT 'Later orders for the customer'
FROM dbo.[Order] AS o2
WHERE o2.CustomerId = c.CustomerId
AND o2.OrderId > o.OrderId
)
In SSMS the highlighting of text can make this stand-out more than a comment does; here’s how it looks:
Now, the text ‘comment’ says correctly what the subquery is looking for; Later orders for the same customer… and when you are reading the SQL as-a-whole, you can mentally convert NOT EXISTS
into the phrase ‘DO NOT HAVE’; and thus you can say for the whole SQL:
Select all columns from Customers joined to Orders which do not have a later order for the customer.
Alternatives
There are many, many ways of writing a query to return only the latest of some set of records. I have found that the NOT EXISTS
approach with commenting shown here is something that you can quickly get used to, and in addition it often seems to perform quite well (and better than the alternatives) – but the SQL Server optimiser is pretty smart, so I can’t say with certainty when things go wrong. In addition, some of the strategies; such as using IN
(a list of all the latest OrderIds for all customers) can not work where there is logically more than one column to be tested.
Consider the following two approaches:
-- Use a correlated subquery to get the latest order for each customer
SELECT *
FROM dbo.Customer AS c
JOIN dbo.[Order] AS o ON c.CustomerId = o.CustomerId
WHERE o.OrderId = (
SELECT MAX(OrderId) AS MaxOrderId
FROM dbo.[Order] AS o2
WHERE o2.CustomerId = c.CustomerId
GROUP BY o2.CustomerId
)
-- Use a Derived Table of all customer's maximum OrderIds
SELECT *
FROM dbo.Customer AS c
JOIN dbo.[Order] AS o ON c.CustomerId = o.CustomerId
JOIN (
SELECT CustomerId
, MAX(OrderId) AS MaxOrderId
FROM dbo.[Order] AS o2
GROUP BY o2.CustomerId
) AS MaxOrders ON c.CustomerId = MaxOrders.CustomerId
WHERE o.OrderId = MaxOrders.MaxOrderId
On my tiny sample database, both of these queries have the same execution plan, which have a greater cost than the earlier examples… that’s not a huge point and I want to acknowledge that in certain circumstances the opimiser might create equally performant plans… but the issue I have with this is one of believing what will be done behind-the-scenes; in the latest example using a Derived Table (that I called MaxOrders), it is perfectly logical to build a list of all customers with their latest OrderId; but bear in mind if we had a filter on the main query to only show orders for the last few days, then logically we would still have to do a full scan of the Order table (or a suitable alternate index) to find out the latest OrderIds for all customers however old they are. The other query strategies would not seem to suffer from this logical overhead. You could duplicate the filter on the outer query on the derived table – but this is more work and again is not something that we need to do with the other approaches.
Notes
1. OrderId might not be the most reliable way to indicate the sequence of an Order, even if it is a Primary Key with the IDENTITY
attribute set (CreatedOn might be better). It depends on precise implementation details, but it serves a useful purpose here.