It is very common for me to look at a customer’s database and notice a lot of overhead from checking whether a database connection is active before sending a query to it. This comes from the following design pattern, written in pseudo-code:

Many of the popular development platforms do something similar to this. Two things are wrong with this code: 1) it doesn’t actually work and 2) it has a large performance overhead.

It Does Not Work

This code doesn’t work because of a race condition. If the connection is alive when checked, there’s no guarantee that it is still alive by the time connection.execute(sql) is called. And if it’s not alive and gets reconnected, there’s still no guarantee.

It’s useless to check and then execute. Instead, the code should be rewritten something like this:

Notice that I completely removed the call to is_active(). If the connection is active, the query will succeed; if not, it’ll fail, reconnect, and retry.

This code also has the nice property that it also allows you to retry things like lock wait timeouts and deadlocks, if you so desire. These are areas where many applications can benefit a lot, in my experience. Most applications should simply retry these conditions, but instead they don’t handle them at all.

Performance Overhead

Checking whether the connection is active typically involves a call to the ‘ping’ or ‘statistics’ command at the MySQL protocol level, which will increment Com_admin_commands in SHOW GLOBAL STATUS, or a trivial query such as ‘SELECT 1’ which is usually harder to diagnose. This has a very high cost in many applications. There are two costs: the cost to the application of the network round-trip plus the query execution time, and the added load on the database server. The added load on the database server can be very high. A few days ago I saw a Ruby on Rails app where the ‘Administrator command: statistics’ command was taking over 40% of the server’s total query time. Eliminating the useless connection checks could have nearly halved the load on this database server. That’s not unusual!

When the application’s queries are long, the extra query is lost in the noise. But high-traffic apps put tremendous effort into getting query times down, and some highly tuned apps are worried when their queries take longer than a millisecond or so. When you’re running 20k queries per second against your database server, an extra 20k queries per second to check whether the connection is alive matters a lot. Those ‘statistics’ or ‘ping’ queries are nearly as expensive as the actual queries the application wants to run!

And that’s just the load on the database server. On the application side, you essentially see doubled query latency. Every time you want to run a query, your app framework is doing a network round-trip to the database to check the connection, then another network round-trip to run the query. Again, this matters a lot.

The problem is that the bad pseudo-code I showed above penalizes the common case in favor of the uncommon case. Connections are usually alive and don’t need to be pinged and reconnected. A better approach is to use the same code that fixes the race condition. Again, if the connection is down, you’ll find out when you try to run the query. Until then, assume everything is OK and just run the queries.

I hope that the upstream maintainers of the offending libraries can find and solve these issues, because it’d be a big help to apps when they grow. It’s one of those things that works fine in the lab, and even in the field, until performance really starts to matter — and then it sticks out like a sore thumb.

(Edit)

Here’s another example of the impact of these silly queries:

That’s right, 73% of the server’s load is consumed by checking to see if the connection is still alive 🙂

36 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
Roland Bouman

Hi Baron!

“If the connection is active, the query will succeed; if not, it’ll fail, reconnect, and retry.”

uhm, looks like you forgot to actually retry – I’d expect the catch block to contain a recursive call to query_database read:

catch InactiveConnectionException e
if retries> 0 then
retries = retries – 1
connection.reconnect()
query_database(connection, sql, retries=1)
else
throw e
end
end

Ingo

Hi,

I second that! We are using apache commons-dbcp for pooling connections to a mysql database. After setting the config “testOnBorrow” from “true” (default) to “false” the load on the database server dropped dramatically. This config specifies if a connection test should be done before using a connection for the actual query.

So at least for commons-dbcp, although switched on by default, it’s quite easy to get rid of this behaviour, by just setting “testOnBorrow” to “false”.

Ingo

jmartin

I can understand that applications really shouldn’t do that – what about monitoring apps like Nagios?

Perrin Harkins

I agree that checking every time you run a query is wasteful. However, checking once at the beginning of each request seems pretty reasonable. On websites that don’t have continuous high traffic, there will often be lulls, allowing some of the connections to time out. Checking once at the beginning of each request fixes this problem.

Rob Wultsch


Is the title ripped off from the current PG Hackers thread or just a coincidence?

Roland Bouman

– darn – missed that 🙁

Thanks.

Shlomi Noach

@Ingo, ,
Another method is to do a periodic eviction of non working connections.
It’s not entirely safe; but I find it does the job very well. It is optimistic, of course.
Yet another higher level test would be that if some server fails (e.g. it’s a slave, and it’s too far behind; or perhaps it does not respond at all), you can collectively remove from pool all connections to that server.

waterguo

retry doesn’t work for 3 reasons: (1) it breaks transaction (2) large number of retires kill app server (3) platform dependent, in general developers hate to write apps only works on one type of database

and the example used in this blog actually hints a case of bad programming. supposedly after a tested connection is retrieved from connection pool, a good programmer will use it for 10 or more things before returning it to pool so the overhead of testing becomes marginal. getting the connection from the pool for every single sql is bad practice.

test connection is not fail-proof but it helps a lot on reducing the number of failures.

yes you can ask the app server not to test the connection. then the trade off is very likely you will see an increase of failed operations. imaging the admin restart your database or reset the router/firewall, all of sudden you see hundreds of failures. this is not acceptable in most organizations.

Andy

Baron.

Which libraries have you seen this problem in?

Perrin Harkins

Good point about Ima::DBI. I can fix that pretty simply for mod_perl-based apps and make a new release. If anyone knows how to fix it generically for other runtime environments like FastCGI, let me know.

axl

yes please, which code in rails is responsible and, assuming it’s not going to get patched any time soon (if ever), how have people worked around it?

Kim Carlsen

Deadlocks and dead connection are difficult to handle in the DB layer as they will rollback the entire transaction and not just the last statement,
so the DB layer would have to re-execute all statements in the whole transaction and assume that no result would change the execution path of the code.

If you are only handling ‘lock wait timeout’ by retrying the transaction, wouldnt that be the same as increasing the timeout.
E.g. trying the statement 3 times vs trying the statement 1 time but wait 3 times longer for locks to be released?

Brian Cavanagh

If you fail on a transaction for a connection related issue, you DO NOT retry. You must investigate. A transaction that goes bad on your end may have succeeded, and if you just try again as an auto procedure it is very possible you may be damaging your relational model silently. I can remember many cases where this very same philosophy has destroyed customer relationships when their CC gets hit 10k times. I can say that it has happened to me at popular stores like Target at the checkout line. Every connection failure must be investigated (or work through case) by the app (or admin) to ensure relational integrity. You ignore errors at your own peril.

Brian Cavanagh

Hey Kim,

Figure out exactly what your data is supposed to look like and then execute all the sql you need at once in one block. Otherwise, re-executing your transactions become problematic.

Ashley Martens

This problem shows itself in Ruby on Rails if you are using a connection pool. When a connection is checked out of a pool it is checked if it is active. Under heavy load this starts to cause problems on the DB and the “admin statistics” command starts appearing in your slow query log.

Ioannis Cherouvim

Great post. Thanks

How did you generate the final query ranking table?

corsair

What was the query or program you used to get the ranking at the bottom of the post?

waterguo

Baron, I was trying to verify that mysql is able to resume transaction in reconnect. In JDBC standard, there is no mentioning of reconnect. In mysql jconnector document (http://dev.mysql.com/doc/refman/5.0/en/connector-j-reference-configuration-properties.html), I found a property called autoReconnect. Its purpose seems like identical to what you try to promote though I am not sure. But it also notes that after reconnect, old transaction is gone.

Here is what I think. If reconnect is able to resume the session, would it make more sense to build this capability into driver? Because I cant think of any case that I don’t want to reconnect at app level.

Secondly, I am trying to image how this is can be implemented on the server side. In order to achieve this, server code will need to remember the whole session which includes locks, dirty pages, cursors etc. And for how long? This feature won’t come without a price tag, will it?

J

Well written article.

I think it is important to point out that the retry logic, as written, is incomplete and dangerous. The standard “best practices” for retry logic should include the following:

* Back-off: Reducing the rate of accessing the downstream resource as reoccurring failures.
* Bounded: At some point the retries should stop. 🙂
* Jitter: Introduce randomness to ensure multiple callers aren’t getting into lock-step and pounding the downstream resource.

Ignoring any one of these will inevitably turn some bad situations into even worse ones.

Scott S. McCoy

I would consider this type of continual query attempt to be harmful in many instances. I prefer that applications simply error if the connection becomes stale or breaks — I would like not to have partial and corrupt data in most instances.

Jay

Roland,

He does retry to connect as many times as specified in the parameter.
Because of the while loop it will try to reconnect and requery

Brigham

Thanks for the interesting article, especially the info about what goes on in the server when you make a check. I think the check at any point is really useless for the most part. I would not perform a retry either because if the connection is down something is really wrong. I would report the error back to the consumer of the query and let it decide how to handle.

Haile

Yes, this is a general principle that applies outside databases. Checking whether a file exists before trying to open it is the classic example of LBYL and has the same problems as the database example.

Syd

Depends on the actual implementation & overhead. It may or may not be faster to manually check rather than relying on an exception, which can be heavy, depending on your platform.

sys

The Oracle database adapter shipped with the Sun (now Oracle) Java CAPS (Composite Application Platform System) actually used to get a sorted list of users in the database, for checking if a connection is alive.

After much frantic yelling at Support we got it down to _merely_ doing something similar to select 1 from dual.

Bottomless

Shouldn’t there be some sort of sleeping introduced in the code? If so for how long? (or better a back-off algorithm as suggested by “J”)

Otherwise it will just result in a storm of queries in a very short period of time, not allowing the system to recover from probably a temporary overload (and probably adding to the overload too) that caused the first query to fail in the first place.

Thanks