I wrote couple of weeks ago on dangers of bad cache design. Today I’ve been troubleshooting the production down case which had fair amount of issues related to how cache was used.

The deal was as following. The update to the codebase was performed and it caused performance issues, so it was rolled back but yet the problem remained. This is a very common case when you would see customer telling you everything is the same as it was yesterday… but it does not work today.

When I hear these words I like to tell people computers are state machines and they work in predictable way. If it does not work same today as it worked yesterday something was changed… it is just you may not recognize WHAT was changed. It may be something subtle as change in query plan or increase in search engine bot activity. It may be RAID writeback cache disabled due to battery learning but there must be something. This is actually where Trending often comes handy – graphs would often expose which metrics became different, they just need to be detailed enough.

So back to this case… MySQL was getting overloaded with thousands of same queries… which corresponded to cache miss storm but why it was not problem before ? The answer lies in caching as well. When software is deployed the memcache is cleared to avoid potential issues with different cache content, so system have to start with cold cache which overloads the system and it never recovers. When you have expiration based cache you increase the chance of conditions when system will not gradually recover by populating cache – if because of cache misses performance is so bad the speed of populating cache with new items is lower than speed with which items expire due to timeout you may never get a system warmed up.

But wait again… was this the first change ? Was not the code ever updated before ? Of course it was. As often with serious failures there is more than one reason which pushes system over top. During normal deployment the code change is done at night when when the traffic is low, so even if system has higher load and worse response time for several minutes after code is updated, the traffic is not high enough to push it to conditions it is unable to recover. This time code update was not successful and by the time rollback was completed the traffic was already high enough to cause the problems.

So the immediate solution to bring the system up was surprisingly simple. We just had to get traffic on the system in stages allowing Memcache to be warmed up. There were no code which would allow to do it on application side so we did it on MySQL side instead. “SET GLOBAL max_connections=20” to limit number of connections to MySQL and so let application to err when it tries to put too much load on MySQL as MySQL load stabilizes increasing number of connections higher until you finally can serve all traffic without problems.

So what we can learn from this, besides cache design related issues I mentioned in the previous post.

Include Rollback in Maintainance Window Ensure you plan the maintainance window long enough so you can do rollback inside this window and do not hesitate to do this rollback
if you’re running out of time. Know how long rollback takes and have it well prepared. Way to often I see people trying to make things work until time allocated for the operation is up and when
rollback have to be done outside of the time window allowed.

Know your Cold Cache Performance and Behavior Know how your application behaves with cold cache. Does it recovers or does it just dies with the high traffic ? How high is the response time penalty and how long it takes to reach normal performance ?

Have a way to increase traffic gradually There are many reasons beyond caching when you may want to slowly ramp up the traffic on the system. Make sure you have some means to do that. I’d recommend doing it on user session so some users are in and can use the system completely while others have to wait for their turn to get in. It is a lot better than having it done on page basics when you randomly have some pages giving error messages. In some cases you can also do ramp up feature by feature.

Consider Pre-Priming Caches In some cases when cold performance gives too bad response time you may want to prime the caches by running/replaying some production workload on the system before it is put online. In this case all ramp up and suffering from bad response time can be done by script… which does not care.

5 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
Mark R

Several times I have seen poorly implemented application-caches. These usually have the effects of

* Introducing functional bugs – users see stale data from the cache
* Increasing code complexity
* Not improving performance; almost all users get a cache miss, those who hit get stale data.

People should really be very wary of using application-side caching, because it usually causes more trouble (i.e. any) than it’s worth (i.e. nothing).

I realise this isn’t what your article is about, but I thought I’d comment anyway.

Mark

Andy

@Mark – web applications are able to run with much lower latency on far less hardware when application caching is used correctly, and I’d suggest that this type of caching is vital to all of the web’s largest sites. Obviously cache invalidation adds complexity, but consider an application designed with write-through caching in mind; in this scenario stale data is impossible, and the cache acts only to alleviate load from the database servers.

Dave

I found mysql will slow down after several runs. for example, if you run super-smack select-key.smack 10 20000,
first run will give you highest score, then second and third run, the performance is dropped, score is lower.
On FreeBSD, my machine gave me 51000 queries/second, but after first run, second run only gives me 43000,
and third run is same as second run. I was rather frustrated.

Patrick Casey

I have to agree with Andy on this one. Clearly caching adds complexity, and clearly there are cases where caching is counterproductive, but caching, like hashing or sorting, is a tool in a developer’s toolkit that you can’t be afraid to use. There’s a time and a place, and where it makes sense, using caching can have dramatic performance benefits for your application.

The trick from a development standpoint is to correctly identify those places that will benefit from caching and use it there, rather than willy-nilly caching everything your app ever fetched from the database.

Pedro Mata-Mouros

Proper and probably smarter application cache design would also warrant a throttling mechanism, by which only the first cache miss is allowed to go to the database and refresh the cache. All others behind it should wait for as long as necessary.