I worked on the problem recently which showed itself as rather low MySQL load (probably 5% CPU usage and close to zero IO) would spike to have hundreds instances of threads running at the same time, causing intense utilization spike and server very unresponsive for anywhere from half a minute to ten minutes until everything would go back to normal. What was interesting is Same query was taking large portion of slots in PROCESSLIST. I do not just mean query with same fingerprint but literally the same query with same constants.

What we observed was a cache miss storm – situation which can happen with memcache (as in this case) as well as with query cache. If you have the item which is expensive to generate but which has a lot of hits in the cache you can get into situation when many clients at once will have miss in the cache and will attempt to re-create the item pushing server to overload. Now because a lot of requests being proceed in parallel the response time for initial request may take a lot longer than if it is ran all by itself increasing the time it takes server to recover.

What do I mean by expensive query in this case ? This is the query which has too high ratio of requests to be served with 100% misses for portion of time. For example if I have 100 accesses for given cache objects per second and it takes 500ms to populate it, it still will be too expensive, because for these 500 ms it takes to populate the item 50 requests will be started (this is the average case, because of random arrivals the worse case is worse) which takes 25 seconds to deal with (assuming there is just one execution unit). Because we normally have multiple cores and multiple drives it can be less than that but it is enough to cause hiccup for a few seconds which is unacceptable for a lot of modern applications.

How can you deal with this problem ? You should carefully watch frequently accessed cache items as well as cache items which take long to generate in case of cache miss. To find first one for memcached you can use mk-query-digest to analyze which items are requested frequently, it can decode memcached wire traffic. For second you can have instrumentation in your applications or take a look at MySQL Slow queries – which is good enough if you populate each cache item with single query.

Optimize query if you can. This is a good thing to do in any case but it may not be the only part of best solution. You can get some query patterns getting slow over time as data size growths or execution plan changes, you can also have some items becoming hot unexpectedly due to changes to content interest or launch of new features.

Use Smarter Cache Especially with memcache it is you who decide how to populate the cache. There is number of techniques you can use to avoid this problem such as probabilistic invalidation, you can also put the special value in the cache to reflect it is being updated right now so you’re better wait rather than starting populating it. For MySQL Query Cache the solution should have been to make queries wait on first query started to complete. Unfortunately this have not been implemented so far.

Pre-Populate the cache In some cases you can’t change how caching works easily, especially if it is built in in the application however it may be easier enough to identify hot items
and pre-populate them before they expire. So if item expires in 15 minutes you can refresh it every 10 minutes and so you get basically no misses. This works best when there are few hot cache entries which cause the problem.

So if your server has seemingly random spikes of activities check this out – cache miss storm could be one of the possible causes.

31 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
Baron Schwartz

I can’t resist adding some Google keywords for alternative ways to say “cache miss storm”: dogpile effect, thundering herd, cache stampede.

mike

You could also build a denormalized table every N minutes so falling back on SQL isn’t as big of a deal, and the stampede effect should be minimized.

hlopetz

peter,

another solution is — don’t use the only memcache for caching the same data and use multiple DB-sources for each application. so different memcaches will “forget” the data in different moments.

domas

Baron, please don’t confuse “Thundering herd” (which is miss storm) with “Dogpile” (which is, of course, a query pileup, that could be caused by many reasons!)

Rob Wultsch

I think we have all just discovered Baron’s true calling is SEO 😉

Baron Schwartz

It’s true, so true!

Michael Peters

In Perl, the standard caching interface module CHI (http://search.cpan.org/perldoc?CHI) has built-in probabilistic expiration.

James Bromberger

Cache Miss Storm. Nice term. But not news:

Bug #15044 Query Cache synchronisation for multiple identical threads
Date: 18 November 2005
http://bugs.mysql.com/bug.php?id=15044

Corresponding forums discussion, from 15 November 2005:
http://forums.mysql.com/read.php?24,54799,54799

Not long off half a decade for this bug to be lying around!

Edmar

I’m note sure where I read this, but IMHO the best approach is something like:

On cache miss, verify if this is the first miss or if the database query has already been requested.

If the database query has been requested (i.e., this is not the first cache miss), serve stale/expired data from the cache, and be done with it.

If the database query has not been requested yet, query it, cache it, and then return the new data. Of course you could opt to return stale/expired data even here, for the sake of 1st-miss response time at the cost of async query complexity.

I supposed you could embed this second/shorter cache data time-to-live inside the cached data itself. Chosen appropriately, in general it would only actually get used for high-usage cached items.

Edmar

P.S. Since the problem is caused by/at the cache layer, it seems sensible to solve it at the cache layer itself, instead of hacking the database to minimize the side-effects. But that’s my opinion of course.

Henrik Nordström

The Squid HTTP Proxy cache first approach to this problem is to cache early. When seeing two identical queries it just forwards/processes one, and internally attaches the second one as a hit on the first waiting for it’s response.

There is also other models, such as background cache update, where the cache gets updated in the background if the requested cached item is about to expire.

and finally error override, serving the cached response even if already expired if the cache can not be updated (data source unavailable etc…)

Marki

I have seen the same with my simple implementation of caching formatted results from MySQL in a file. So the solution would be to first update the timestamp of file (so that other threads can serve the data from file as new) and only after that run the query and save the new data to the file. Of course if something goes wrong, we have old data for another cache time period.

Ted

I recently wrote a recursive spin lock for exactly this issue. Code is here (in Ruby, for Rails) http://gist.github.com/578303

Morgan Tocker

It’s interesting to point out, this is in the memcached FAQ:
http://code.google.com/p/memcached/wiki/FAQ#Race_conditions_and_stale_data

Moshe

I used Microsoft’s .NET Internal locking implementation to bypass this:

http://blogs.microsoft.co.il/blogs/moshel/archive/2009/07/11/cache.aspx

This function used for execute other functions that process heavy work, such as heavy SQL queries.

Michael Mior

I think what you really want is something like Mint Cache for Django (http://djangosnippets.org/snippets/155/) It basically stores the time of invalidation along with the cached object so when it becomes invalidated, the first request to hit that item in the cache can repopulate the cache while other requests can continue to use the current cache value until its replaced.

I think it’s rare that cache expiry times require a strict deadline, so this provides a clean solution 🙂

Baron Schwartz

I am not a Python coder. But that code really looks like it has a race condition and will still cause a rush to the database at high traffic levels.

Michael Mior

Unfortunately, it is. But at the very least, it’s likely to generate far less traffic. The only time you’ll get multiple hits to the database is when a second request for the memcached key comes in before the first request has had a chance to store the fact that it is performing the fetch. So it is possible for high traffic levels, you’ll get several hits to the database, but it still should be significantly less than you would otherwise, depending on how long the database query takes.

Baron Schwartz

You’re making some assumptions that will not hold in lots of cases. This kind of code could be very expensive or even disastrous on a frequently accessed item that is expensive to generate. An atomic operation such as incr() is an easy way to make this work properly. Or, just run a routine job to pre-refresh the item instead of letting the application code refresh it.

Michael Mior

I agree that it might not solve the problem, but I fail to see how it makes the situation worse. Normally, the fetch would always occur for any one accessing the cached item during the refresh interval. A technique such as this can significantly reduce the number of requests.

Baron Schwartz

It won’t make the situation worse… sorry I was not very clear in my meaning. The biggest danger here is that over time as your load grows, you are kind of shielded from seeing how expensive it would be to have a bunch of cache misses, and then you get surprised. See http://www.mysqlperformanceblog.com/2010/09/23/more-on-dangers-of-the-caches/ for another look at this.

That aside, I believe that when possible (see my previous comment), it’s better to make code such as this always correct, rather than “incorrect but it doesn’t matter for some people.”

Jared Williams

@Micheal Mior

That code does have a problem, however it could be corrected.

I posted to the memcached group.

http://groups.google.com/group/memcached/browse_thread/thread/d123857968b76087/4985b0b4fb91bdbc?lnk=gst&q=CAS+#4985b0b4fb91bdbc

Jared Williams

Pseudo code using PHP & MemCached.. http://gist.github.com/613772

sunli1223

you can use phplock http://code.google.com/p/phplock/ to avoid this problem .

Luc

Wow this was useful. Take a look at our issue as reported on server fault: http://serverfault.com/questions/429013/mysql-server-hitting-100-unexpectedly-amazon-aws-rds

It may have a proper name but the technique we used was very effective. Benchmark results are much better!

Baron Schwartz

This blog post was referenced from a paper on optimal pre-expiration of cache items. http://www.vldb.org/pvldb/vol8/p886-vattani.pdf