April 17, 2014

Caching techinques

Recently Jay Pipes published great article about lazy connecting and caching which reminded me my post on this matter is well overdue.

Let me start with couple of comments about Jays article. First – caching in files should be used with caution. It may be very efficient especially if number of cached objects is small but if you get too many small objects which need to be cached files can become inefficient, especially if you do not care about putting them into different directories on file systems which can’t handle too many files in the same directory efficiently. The other problem with files is of course being local to the local node which might be inefficient with many web servers. Putting cache on shared storage could work but that is extra complication. There are cases when file cache works pretty well – for example caching full page results in files can be extremely efficient, especially if you can ofload sending cache result to the client by web server (instead of doing it from PHP). The other case when caching in files can be very helpful is caching data long term – ie if you got it from some Web Service. For smaller short lived objects things like memcached (or local shared memory for small installations) can be more efficient.

Lets now get to the main point of this article – I wanted to talk in a bit more details about caching algorithms and for which kind of application each of them can work.

TTL Based Caching This is probably most common approach for cache organization in the Web – you cache the object and say it is going to expire in 60 seconds. This type of caching is even supported by HTTP protocol with its Expires Header. This approach to the caching is very simple and pretty efficient, however it leads to having stale data which is problematic for some applications, also objects may expire before they are really modified so cache can be less efficient than possible. TTL caching is great if you have no way to know when object has changed. For example if you retrieve product information from Amazon Web Services, you have no way to get notification about it being changed so the only thing you can do is check if it was changed every so often.

Cache Invalidate With this apporach you invalidate or remove objects in cache once underlying information is updated. This is how MySQL Query Cache works by removing all queries derived from the table if that table is updated. This technique requires you to know which objects are based on which data so you can invalidate wisely and of course have update notifications. This cache can be more efficient with rare updates but you better to have fine grain control of what has to be invalidated – approach used by MySQL Query Cache is too coarse invalidating a lot of queries which really do not need to be. The other problem with this method is – invalidation can be expensive, especially if you get many objects cached and they are stored on the storage or network rather than local memory. This approach works very well if you cache is structured so each update removes only one object from the cache.

Cache Update This one is quite similar to cache-invalidate approach but with difference what cache content is actually renewed rather than invalidated. This approach is more efficient, especially if renewing is done in background as it greatly reduces number of cache misses, while it also wastes resources by updating what might never be needed again. Another complication for this approach is – cache has to become active and know how to update cache content rather than being simply storage system. You can also mix cache update and cache-invalidate approaches efficiently. For example I worked with one company having trouble with some Full Text Search requests – they just were way to slow, while some of them was very frequent. Until Full Text itself could be optimized we added cache update algorithm which would populate cache with result set for most frequent and complicated queries as new data is loaded and index is rebuilt.

Version based caching This could be thought about as modification of previous ones but I will put it as separate item as it is quite efficient way to organize things. In this case version is associated with each of the data sources used to compose the object. For each of the object data source versions are stored together with cached data. Once you take object out of cache you validate if all versions are still current and use it or throw it away. The good thing about this approach is – you cache operations have constant complexity – you do not have to do big invalidation clean up on updates. Let me come with example how it can work in practice. Lets say you’re running big blogging site which has many blogs published. You may store blog version which is updated on each blog modification and have all cache lookups for this user to check if blog version has chaged. You can become more fine grained and have versions added to blog entries so if comment is added to post B you do not invalidate post A. Of course this method also has its downsides – you have to perform extra version check on each lookup and it might be complicated and caching bugs hard to debug, but it can give very good cache efficiency for some applications.

These are of course only some of the approaches which you can use, and in many cases mixing them may make sense. You may use TTL for some of the objects while invalidation for others.

About Peter Zaitsev

Peter managed the High Performance Group within MySQL until 2006, when he founded Percona. Peter has a Master's Degree in Computer Science and is an expert in database kernels, computer hardware, and application scaling.

Comments

  1. Apachez says:

    Uhm, whats the point of above comment?

    Looks like spam to me…

  2. peter says:

    Apachez,

    It is trackback from existing page which seems to be more or less valid this is why I leave it :)

  3. Markus says:

    I have implemented a similar caching method than the one described in Jay’s blog. Our site is growing so I tried to do some things before getting to buy more hardware. I just noticed (in Jay’s cache class) that reads to cached files aren’t locked, so more than one concurrent process could try to execute the query and rebuild the cached file at the same time.

    I was was thinking to lock cache reads using MySQL function GET_LOCK. I’m wondering if such a thing really worths. Any advice?

    Thanks a lot, this blog has been helping me a lot!

  4. peter says:

    Markus. If you’re working with files you probbably should do locking by file locks (assuming they work in your envinronment) Or you can use shared memory semaphores if you’re running inside single host. Using GET_LOCK() will work but it is expensive – it is query to MySQL server which needs parsing etc.

  5. Markus says:

    Peter, thanks for your feedback.

    I made some tests and noticed just 1 millisecond or less overhead when using GET_LOCK/RELEASE_LOCK to queue accesses to cached files. The one thing that liked me of this approach is implementation was as easy as adding the requests to get and release the locks in the proper places and, in case of problems, they will be released as soon as each connection ends.

    I would like to point out to this thread where I’ve been posting my thoughts in more depth:
    http://area51.phpbb.com/phpBB/viewtopic.php?f=4&t=25059

    Cheers

  6. peter says:

    Markus,

    If you have single cache request per page you probably do not care, in this sense all listed techniques would work just fine. If you have many cache lookups it is however other story. Ie 100 cache lookups would cause 100ms of just _overhead_ not to count cache access itself.

    So I do not think this approach is the most optimal but it will be good enough in many cases.

    If it delivers good enough performance in your case – use it :)

  7. Markus says:

    We’re using this locking method for a web application. To draw some numbers, at this moment, the server has been running for about 21 hours. There have been 414,322 connections that have executed 6,654,212 questions, being 1,952,321 of them selects. That’s about 16 questions (around 5 selects) per connection average, but not all of them are cached. Maybe our experience may serve someone else out there. Cheers

  8. peter says:

    Marcus,

    This is in average some 5 connections/sec and 25 queries/sec, not much at all. In your case load is very light in terms of query numbers so this method should not cause large overhead.

  9. Real Time transparent caching is another way to scale your application. In this algorithm, an in-memory DBMS system is used to cache frequently accessed tables from actual DBMS. If application requests data from cached tables, they are retrieved from In-memory DB, otherwise it is retrieved from the actual DBMS.

    Check out CSQL Cache product @ http://www.csqldb.com for more information on this.

Speak Your Mind

*