stereochrome

SQL gotchas

I’ve been reviewing some SQL queries recently, and I’ve noticed some things that people do that lead to far from optimal performance. Here’s a few, and what people should be doing.

COUNT(*) vs. COUNT(col)

At first blush, these may seem the same, but they actually mean very different things. COUNT(*) will aggregate the number of rows in the result set or group, whereas COUNT(col) will give you the number of non-null values that column has in the current result set or group. You almost always want COUNT(*). If you don’t know which you want, use COUNT(*). Treat that like it’s the law.

If you’re thinking that this isn’t a big deal, keep in mind that COUNT(*) requires much work on the part of the DBMS than COUNT(col); the value of COUNT(*) falls naturally out of what the DBMS is doing whereas COUNT(col) requires the DBMS to keep track of an extra counter for non-null values for that column.

Applying functions to fields you’re searching against

Say you have a big table, and you want to get all the rows whose created fields is between two dates. created is indexed. If you were to try this:

SELECT id, title, created
FROM   entries
WHERE  (TO_DAYS(NOW()) - TO_DAYS(created)) > ? AND (TO_DAYS(NOW()) - TO_DAYS(created)) <= ?;

You’re going to subject the DBMS to doing a full table scan just to get those entries. The reason for this is TO_DAYS(created). By doing this, you prevent the DBMS from using the index on created. It’s easy enough to rearrange this to work a lot better:

SELECT id, title, created
FROM   entries
WHERE  created BETWEEN CURRENT_DATE + INTERVAL ? DAY AND CURRENT_DATE + INTERVAL ? DAY;

It’s a simple change, but now the DBMS is able to make effective use of the index.

Created at 14:58 UTC on September 8th, 2010 and last modified at 17:40 UTC on September 8th, 2010