In writing a recommendation for our Web development team on how to use MySQL, I came up with the following list, which I want to share: What kind of queries are bad for MySQL?

  • Any query is bad. Send a query only if you must. (Hint: use caching like memcache or redis)
  • Queries that examine many rows are bad. Try instead to use…

    . Or at least

    . If it is still not possible, try to make the query examine the least amount of rows possible (zero is ideal, as we come to the first case here)
  • Queries with JOINS are bad. Try to denormalize the table to avoid JOINS. Example: original query

    . This can be denormalized by copying the column orderdate from table t1 to table t2, so the query will look like:

    . I know that fans of 3NF will hit me hard for this, but this is needed for performance
  • Aggregation queries are bad. If you have SUM, AVG, MIN, MAX over many rows, try to pre-calculate these value in roll-up tables, or at least minimize amount of rows to handle. You can have daily, weekly, monthly, you name it, summary tables. Actually INSERT .. ON DUPLICATE KEY UPDATE ... is very useful here.

Do you have more examples?

30 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
Tim Soderstrom

Oooh yeah I’m not a fan of denormalizing, at least if the first suggestion is used. Proper caching means that the database can trade some performance for understandability. I’m a huge fan of having an easy to follow and understand database, even if it trades some reasonable performance to get it.

Plus, cases come to mind where a text string is being supplied. If the schema can be normalized such that the string needs to be compared on a smaller lookup table, and then JOINed, my gut feeling is that it would be faster to do so than to normalize? Trying to think of a good example, and the only one that comes to mind would be a song database.

If you just had a table of ‘Songs’ that denormalized albums and artists, that means you would have to search the whole thing if you just, say, want all the albums from an artist. Say we avoided that and then used an Albums table, well, you still have to search by artist and do a textual index scan. But if you normalized all that out, you can do a smaller search on an Artists table and join it with the rest.

I would be happy to be proven wrong there 🙂 I haven’t tried the above test myself. I have, however, moved to normalizing out textual columns when it made sense and the results were generally positive for performance in those cases.

Justin Swanhart

If you want to maintain summary tables I strongly suggest looking at http://flexvie.ws

Flexviews has two refresh methods and works very well for many people who have tried it. It is definitely worth investigating before you try to roll your own materialized view solution.

Column stores are also a viable alternative if you want aggregation. Column stores are typically faster than aggregate tables and you don’t have to change your queries to simply use the column store. Column stores are geared for OLAP and row stores are geared for OLTP.

Justin Swanhart

Tim,

De-normalize the table and use a combination of text indexing and bitmap indexing to index the attributes effectively. This gives excellent multi-dimensional search performance.

Fastbit can be used for this:
https://sdm.lbl.gov/fastbit/

Justin Swanhart

i disagree about the blanket statement that joins are bad. MySQL only has nested loops (in general) and thus joins can be inefficient when large amounts of data must be examined. For simple OLTP queries, such as looking up an order and the associated lines, joins are not bad.

Then again, when you need to report on all the orders in a month or a year, the join kills you. So you probably need to ETL to a reporting system where the data is de-normalized.

If you are going to go to all the effort to ETL, then try out a column store. The column store will be able to do hash joins too if you do want to have some lookup tables.

Tim Soderstrom

Hmm I’ll have to look into fastbit, thanks Justin! I’m still not convinced using conventional textual indexes, though. Depends on the data distribution, but if the column being searched is one that would normally effectively be an ID column, seems like a simple join could still win out?

That said, if there is no meta-data needed for a column, and the set is smallish, I’m a fan of ENUMs. I know some don’t care for them, but if used properly, they give you quite a few nice benefits.

Justin Swanhart

An ENUM is a form of dictionary compression. Any compression when used properly can be beneficial.

The point is that you have one table with all the details, and that table is effectively indexed on all columns. The integer (id) columns are bitmap indexed and text fields like ‘album_title’ and ‘record_label’ are full text indexed because they must be tokenized for search. All of the indexes can effectively be combined by the engine because each search produces a bitmap of rowids and the bitmaps can be efficiently combined.

Jonathan

Is Percona planning on having some summary table implementation in Percona Server?
Perhaps integrate Justin’s flexviews into it?

I realise that summary table may not be for everyone, but the number of people not interested in some reports for their data is dwindling very fast.

Robert Thomas

I’m just a beginner with MySQL, thanks for this info..it will help me a lot with my new project.

bluebyte

I do not understand why a query with “WHERE primary_key = some_value” is bad.
Isn’t this the most efficient way to retrieve a single row from a table sincethe primary key is indexed and unique?

Justin Swanhart

Bluebyte:
That query is prefixed by “Queries that examine many rows are bad. Try instead to use…”:

Basically this says only use MySQL for sysbench.

Van

There is nothing wrong with normalized databases and JOINs – provided you know what you are doing. Most of the time I find that programmers (and I am one too) design the worst databases. But it depends on what type of data you are storing. Normalization helps enforce data integrity at the database level. Relying on the application/programmers to ensure data integrity is poor practice because it cannot be done well and programmers will cheat.

Evgeny

Вадим, ты это серьезно?

Alex

> Do you have more examples?

Sure we do.

INSERT queries are even more bad, than SELECT queries. Each INSERT might scan the same amount of indices as a SELECT, but without producing any sensible outcome — a lot of efforts for no profit. Worse than that, the INSERTs will need to write the data and update the index, rather than just read. Hint: whenever possible, avoid INSERTs at all cost and use redis/memcache/whateverelse as aforementioned.

But the root cause of all evil, of all the index tree disbalances, of all the excessive storage media wear, — is obviously the UPDATE query. It has the same problems as INSERTs, but even amplified: no outcome, while even more tree-walking the new record is added to the treem, and the old one is removed! So much bit-crunching, and just to satisfy a nonsensical programmer’s desire idea to update some stored data! Hint: if your MySQL usage involves UPDATE queries, the architecture of your system is likely wrong and should be completely reconsidered.

Pavel

tl; dr Don’t use MySQL.
Seriously are there any queries where MySQL is good?

Pavel

(Except DROP DATABASE of course)

Alex

> Seriously are there any queries where MySQL is good?

There is wide agreement that the performance of MySQL implementation of TRUNCATE command is acceptable for most usage scenarios. When possible, it can be used as a good substitute of subpar implemented INSERT and UPDATE queries.

Evgeny

Alex,

to be fair, Web Scale Database MONGODB is superior than MySQL even in TRUNCATE benchmark.

Pavel

Evgeny,
If your application relies heavily on TRUNCATE queries you can cut the middleman and directly call one of system functions – unlink, remove or rmdir. They are written in assembly language so they work very fast.

jk2l

Alex,

I strongly suggest you consider Blackhole storage engine, it is strongly optimized for INSERT and UPDATE performance. It is totally scalable for any amount of data input

Boyd Hemphill

@jk2l … beautiful twist on just writing to dev null/dev/null 🙂

Here is mine:
At the point where you are denormalizing for performance from day one of app design, ask yourself if a relational database is the right tool. If you are going to go through the trouble of managing all data integrity issues in your application you might as well take the time to explore the features of NoSQL solutions such at Mongo, Riak, Redis, etc. I mean everyone knows Mongo is webscale and MySQL is not (http://www.youtube.com/watch?v=b2F-DItXtZs). (note the video is _not_ work safe due to language the likelihood if snorting a drink through your nose while watching it.)

OK a bit of seriousness:

Vadim says: Of course JOINs and ranges over thousands rows could be fine if run once per minute. However when it comes to hundreds queries per second, this is when it becomes problematic and further optimization are needed.

We regularly run a dedicate DB server with 2GB of memory in the RS cloud. No local block storage, bells or whistles. We have multiple queires with joins of 5 or 6 tables some with 500k rows. These servers have, in the wild, run at > 2000 queries per second.

Maybe we could talk about the optimizer’s difficulty with correlated subqueries or how delete from x where (select x_id from y) takes a very long time and must be avoided (esp when using an ORM). Those seem to be topics worthy of the MySQL Performance Blog.

Justin Swanhart

Wow guys, that really went to sarcasm level 9000 quickly.

As with all RDBMS YMMV when it comes to MySQL. I personally focus on creating tools to help MySQL perform significantly better in the areas where it traditionally struggles. I also pick the right tool for the right job:sphinx for full text, fastbit for bitmap indexes, flexviews for materialized views and shard-query for OLAP scale-out. Both TokuDB and ICE are new storage engines with strong compression features, one focused on OLTP and the other on OLAP.

If you really want to avoid DELETE and UPDATE, then you can consider using “log structured tables”:
http://shardquery.com/2013/06/20/tips-for-working-with-append-only-databases-using-sharding-and-log-structured-tables/

These can be very useful for column stores which don’t have those pesky indexes you were complaining about 🙂

This is of course in addition to Shard-Query which can pretty much stand toe-to-toe with Amazon Redshift:
http://shardquery.com/2013/07/11/comparing-amazon-redshift-and-shard-query-features-and-performance/

Rick James

Back to the “bad query” question…
* In spite of the wonderfulness of LIMIT, doing pagination via OFFSET and LIMIT is bad because it (1) gets quadratically slower as you flip through the pages, and (2) can duplicate/skip items if other INSERT/DELETE. Solution: http://mysql.rjweb.org/doc.php/pagination
* Big DELETEs are slow, lock things, and can delay Replication. Solutions: http://mysql.rjweb.org/doc.php/deletebig
* Find the nearest 10 pizza parlors via any obvious SELECT. Solution: http://mysql.rjweb.org/doc.php/latlng
* Key-Value stores (Entity-Attribute-Value triples). Solution: http://mysql.rjweb.org/doc.php/eav
etc.

Paul Ingram

“Any query is bad. Send a query only if you must.”

Seriously? Surely that’s like saying “Food is bad. Eat it only if you must.”

MySQL Server is there to be used. You cannot retrieve data from the server without writing a query.

While I agree accessing data in many ways is not optimal (I’ll refrain from using “bad” – it makes me sound like I’m telling off a child), I struggle to see how any of your article is constructive.

In addition, how does INSERT .. ON DUPLICATE KEY UPDATE … have anything to do with aggregation queries? I use aggregation queries where I need to obtain data on the fly.

Sorry to say it, but if I was a member of your web development team I would argue almost every single point on your list.

Justin Swanhart

Paul,

Saying sending queries is bad is true. The fastest query you ever run is a query you don’t send. That is, you get the data from external cache instead. The idea is to reduce the amount of queries hitting the database, and when they do hit the database reduce the work they must do.

INSERT .. ON DUPLICATE KEY UPDATE is for maintaining aggregate tables (aka summary tables or materialized views). Summary tables can improve database performance by orders of magnitude when used properly. This is why I maintain Flexviews.

Bayu Angkasa

I don’t know what you’re all gentlement talking about !!! Every database management system has it”s plus and minus depend on what problem we are really facing. So, we should all stop this discussion!

Gavin Towey

1. Any query with OR in the WHERE clause is generally bad

2. Any query that does a filesort on more than a trivial set of rows, or generally has GROUP/ORDER which isn’t covered by an index

3. Multiple inequality conditions: WHERE a > 1 AND b < 2

4. SQL_CALC_FOUND_ROWS

5. Wildcard matches: WHERE col LIKE '%something%'

6. Using functions on columns: WHERE DATE(datetime_col) = CURDATE()

7. subqueries

Pep Pla

Sorry but I think this post doesn’t make a lot of sense.

There are not rules of thumb on what is “bad” or “good”. For example, the theory says that denormalizing can be great for select performance, but you must take into consideration things like the increase in disk space, this is the number of blocks you have to read/write both from the logs and datafiles, which can turn the theory into a wrong design decision thus making not only inserts and updates a lot slower but also selects.

I’ve often seen that counter-intuitive things must be done to improve performance, from droping/not using indexes to disabling caches and buffers.

Also having summary tables can be a wrong decision when aggregate data changes too often (you have to calculate your aggregates often) or the information is queried infrequently.

You must have a deep understanding of your database and applications behaviour to be able to take this design decisions.

There are no bad queries, there are only bad programmers/designers/dba’s.

Pep

Fedge

I’ve been building apps that do joins on tables with millions of rows. They do alright in the wild with no front end caching as long as I keep the number of rows scanned down to a reasonable number (whatever that might happen to be). I try to always hit a primary key for internal stuff and either a unique key or a fulltext index for user input. Joins on large numbers of tables seem to be more troublesome, even when using primary keys. If I find myself doing a join with a lot of tables, I make an intermediate one that suits the use case of the application. It’s all primary keys of the data the app actually is looking up, rather than directly hitting the big tables I’m referencing and using them as intermediate steps to get to the data I’m really after. Much faster. The data you’re actually after is usually all in one or two tables anyway.

Say you want to find all the hashtags people have used while tweeting a certain link:

SELECT hashtag
FROM tweet_links
LEFT JOIN tweet_hashtags USING (tweetId)
LEFT JOIN hashtags USING (hashtagId)
WHERE linkId = 34

becomes:

SELECT hashtag
FROM link_hashtags
LEFT JOIN hashtags USING (hashtagId)
WHERE linkId = 34

Even though the first case contains all the data and is normalized, creating the link_hashtags table and keeping a duplicate of your data there is much faster for lookups. With many to many relationships like this one, you can’t just store the hashtagId in the link table. If you could know the list of hashtags per link won’t grow much more than a set amount, you could store a comma separated list or something but that’s just horrible and leaves data integrity in the programming end, rather than the database.

Jags

It’s over generalization to say “Queries with JOINS are bad. Try to denormalize the table to avoid JOINS.”. Instead I would say over doing either Normalization or De-normalization are bad. In practical world, normalization is mandatory to have good performance balance between read and write. Your example is the case of over doing normalization, order date will be in 1-1 relation with orderid and hence no need to be in another table.