April 19, 2014

A common problem when optimizing COUNT()

When optimizing queries for customers, the first thing I do with a slow query is figure out what it’s trying to do. You can’t fully optimize a query unless you know how to consider alternative ways to write it, and you can’t do that unless you know what the query “means.” I frequently run into a situation where I’m forced to stop and ask the developers what they were trying to do with COUNT(). This is database-agnostic, not related to MySQL.

The problem is when the COUNT() contains a column name, like this:

If you know your SQL well, you know COUNT() has two meanings. 1) count the number of rows 2) count the number of values. Sometimes, but not always, these are the same thing. COUNT(*) always counts the number of rows in the result. If you write COUNT(col1) it counts the number of times col1 is not null. If it’s never null, the result is the same as the number of rows.

The problem with that is that you don’t know by looking at the query whether the developer wanted to count rows or values — or, quite possibly, the number of distinct values in that column. You see, there’s another form for COUNT():

So when I see a query that just does COUNT(col1) I am left with these guesses:

  1. You meant to count the number of rows. You should have written COUNT(*) to convey that meaning accurately. If the column is nullable, there’s a bug in your query — it won’t do what you think it does.
  2. You meant to count the number of distinct values, but you left out the DISTINCT keyword. That’s also a bug in your query.
  3. You meant to count the number of times the column’s value is not null. This in my experience is pretty unlikely — people rarely do this.

It is impossible to read the developer’s mind in these cases, so I always end up getting stalled waiting for them to reply and tell me what the query means so I can optimize it.

A good coding standard can help here — you can quickly write a tool to grep your source code and search for instances of such constructs, and flag them as errors unless some magic comment is embedded in the code next to them:

Alas, there is a lot of misinformation on the web about COUNT(), so it is not a good place to learn about what it does. 90% of what you can find online is just wrong. So I would advise a good book on the topic, except I can’t think of one — maybe Joe Celko’s books address this topic clearly? What’s your recommendation?

About Baron Schwartz

Baron is the lead author of High Performance MySQL. He maintains a personal blog at Xaprb. Follow him at @xaprb or connect with him on LinkedIn.

Comments

  1. Hi!

    good points ;)

    Just a FYI comment:

    “You see, there’s another form for COUNT()
    SELECT count(DISTINCT col1) FROM TABLE;”

    Interestingly, MySQL also supports

    SELECT count(DISTINCT col1[,...,colN]) FROM TABLE;

    (count of a combination of values, a rarely used feature which is AFAIK unique to MySQL)

  2. Pat says:

    This may be old news to folks on this board, but a common “gotcha” I see is folks who write code that does the (intended):

    select count(*) from some_big_table

    Lightight fast on myisam

    slow on innodb (and Oracle and most everything else)

  3. j says:

    I actually do #3 quite a bit — it is often useful when you’re outer-join-ing tables.

  4. Rob Wultsch says:

    Roland:
    While the feature is useful, I don’t think it is a major failing:

    postgres=# CREATE TABLE t(c1 varchar(20),c2 varchar(20));
    CREATE TABLE

    postgres=# INSERT INTO t VALUES(‘foo’,'fooy’),(‘foo’,'bar’),(‘foo’,'fooy’);
    INSERT 0 3

    postgres=# SELECT COUNT(*) FROM ( SELECT DISTINCT c1,c2 FROM t )t2 ;
    count
    ——-
    2
    (1 row)

    I tend to think MySQL would not add this functionality now, if it had not been added previously. Given that MySQL did not support subqueries until 4.1 I think it is easy to understand why this functionality exists (the same I think also explains the allowable less than ideal ONLY_FULL_GROUP_BY stuff…)

    - Rob

  5. @Rob: ah interesting, so pg supports this too. Thanks for that info.

    “I tend to think MySQL would not add this functionality now, if it had not been added previously.”

    Yeah, you could be right. But I must admit I don’t see the direct link to subqueries or the faulty only_full_group_by mode.

  6. Rob Wultsch says:

    Roland:
    Postgres does not support the syntax you described:

    postgres=# SELECT COUNT( DISTINCT c1,c2) FROM t;
    ERROR: function count(character varying, character varying) does not exist

    I would imagine that the syntax I showed ( using a derived table) would work on most databases. I’m not sure why PG does not support the count’ing on multiple columns.

    I was smoking crack with the comment about ONLY_FULL_GROUP_BY . I need to stay off the rock. What I was thinking and completely botched describing was that previous to MySQL having subqueries/derived tables being able to use columns that do not appear in the group by clause in the select clause in non-aggregating ways made life easier in some scenarios.

    I do think the faulty only_full_group_by mode can be attributed to many MySQL users not understanding what is going on when they type GROUP BY. I think that ONLY_FULL_GROUP_BY was a checkbox that was checked so that MySQL might allow some to better argue that it is a ‘real database’. As many MySQL users do not want to write correct/sane queries that have GROUP BY means that many will keep ONLY_FULL_GROUP_BY off…

  7. @Rob,

    “I think that ONLY_FULL_GROUP_BY was a checkbox that was checked so that MySQL might allow some to better argue that it is a ‘real database’.”

    Yeah, at least I think it was part of the “use sql_mode to be more standard” plan. However, I don’t think http://bugs.mysql.com/bug.php?id=8510 was part of that plan.

  8. Rob Wultsch says:

    That bug is only 1,314 days old. I’m sure they will get on that “real soon now”â„¢.

    I really like the way the bug was closed:
    “Boy, this looks tough. Closing bug.”

    MySQL got their checkbox, and when their checkbox broke something a user had a easy work around. Sounds like a win all the way around ;)

  9. idont says:

    What I usually do:

    SELECT COUNT(1) FROM mylargetable WHERE ….

    I do not know if it is a good idea but that way there is no doubt about the result.

  10. Ok. Why would anyone want to use ‘SELECT COUNT(col1)’ instead of ‘SELECT COUNT (*)’ for finding the number of rows? Probably because the person believes that performing count on a single column is faster than performing it on the entire column set (the spooky star sign has horrible reputation for being slow for normal SELECT queries although it might or might not apply in this particular case). Now, if I do use ‘SELECT COUNT (col1) for finding number of row in the (possibly false) hope that it’s faster than COUNT(*), I would at least make sure that col1 is a primary key. No NULL values for col1 (PK) means COUNT(col1) would work just as good as COUNT(*), if not better.

    The real question should be: is COUNT(col1) faster than COUNT(*)?

  11. Ok. Why would anyone want to use ‘SELECT COUNT(col1)’ instead of ‘SELECT COUNT (*)’ for finding the number of rows? Probably because the person believes that performing count on a single column is faster than performing it on the entire column set (the spooky star sign has horrible reputation for being slow for normal SELECT queries although it might or might not apply in this particular case). Now, if I do use ‘SELECT COUNT (col1) for finding number of row in the (possibly false) hope that it’s faster than COUNT(*), I would at least make sure that col1 is a primary key. No NULL values for col1 (PK) means COUNT(col1) would work just as good as COUNT(*), if not better.

    The real question should be: is COUNT(col1) faster than COUNT(*

  12. Kalaish, the real question is not what is faster. The real question is what did the user intend when writing the query?

    To address the off-topic ;-) The hope that COUNT(col1) is faster than COUNT(*) is false indeed. Take a look at this:

    create table test(
    pk char(100) not null,
    col1 tinyint not null,
    primary key(pk),
    key(col1)
    ) engine = innodb;

    Now what do you think MySQL will do with various types of COUNT() queries? Try putting a few rows into the table, then using EXPLAIN with COUNT(*), COUNT(pk) and COUNT(col1). You might be surprised. Convert to MyISAM and try again. Make col1 nullable and try again.

    Now imagine the table has a primary key on (col1, col2, … colN). Is COUNT(col1) a good idea? Did the user mean COUNT(DISTINCT col1) or was it someone trying to outsmart the optimizer? Are you still convinced that it’s just as good if not better? Assuming the user wanted to count rows and not values, in the best case it may perform as well as COUNT(*) but never better. What happens when you alter the table’s schema?

    The optimizer needs choices, and the consultant who is trying to find out why the choices are forbidden needs information. So the real question still remains, what does the SQL mean? To return to your statement “just as good as COUNT(*), if not better,” what metric of goodness are you using? In performance it can never be better; in optimizability/understandability/maintainability, COUNT(*) always wins when you’re trying to count rows.

    The person who commented about LEFT OUTER JOIN is right on the money, that is the most common use I’ve seen for actually counting values instead of rows.

  13. Eric says:

    I just found a case of #3 in my code which works as intended, but it was easily rewritten to use count(*). The count(*) version seems to be slightly faster, particularly since the application didn’t care that the zero rows disappeared and was happy enough to not loop over them.

  14. If COUNT(col1) is not better (in terms of performance) than COUNT(*), I bet it’s not worse either. I also don’t see why COUNT(Col1) is less readable than COUNT(*) when I can easily make out that the developer intends to find the number of rows because col1 is PK. Ok, I have no grudges against COUNT(*). It’s fine in itself but there is no point in preferring it over COUNT(PK). Even so more in making the massive, system-wide alteration from COUNT(PK) to COUNT(*).

    The argument that COUNT(*) is better because it’s immune to changes in table schema doesn’t hold water because if a column name can change, so can a table name. Thus invalidating SELECT COUNT(*) FROM table_name;

  15. amit says:

    he real question is what did the user intend when writing the query?

    comments can be effectively used for that, no? :)

  16. Yes! Comments are great for that. And they can absolutely clarify whether it’s a mistake or intentional. But most of the customers I work with don’t have comments in their SQL.

  17. Rob Wultsch says:

    Heck knowing where a sql query is coming from is half the battle. With the popularity of people doing stuff like:
    SELECT $fields
    FROM $table $join
    $where
    $order_by

    figuring out context is a real pain in the backside. grep is much less useful with such queries. I wish people would add magic constants like __FILE__ and __LINE__ if they are going to dynamically build the queries…

  18. victori says:

    @2. Pat

    count(*) can be fast on postgresql by counting tuples instead of actually rows. Yes the number returned isn’t accurate but works great for pagination. When was the last time a user went to page 10,000? only the first few pages are relevant.

    select reltuples from pg_class where relname=”;

    Replace …

    Just one of the few tricks I am using to getting the most out of our database.

  19. Oma says:

    This paragraph will assist the internet people for creating new webpage or even a weblog from start to end.

Speak Your Mind

*