All to often people force themselves into using a database like MySQL with no thought into whether if its the best solution to there problem. Why?  Because their other applications use it, so why not the new application?  Over the past couple of months I have been doing a ton of work for clients who use their database like most people use memcached .  Lookup a row based on a key, update the data in the row, stuff the row back in the database.  Rinse and repeat.  Sure these setups vary sometimes, throwing in a “lookup” via username, or even the rare count.  But for the most part they are designed to be simple.

A classic example is a simple online game.  An online game may only require that an application retrieve a single record from the database.  The record may contain all the vital stats for the game, be updated and stuffed back into the database.  You would be surprised how many people use this type of system as I run into this type of application frequently.  Keeping it simple, ensures that application is generally mean and lean and performs well.  The issue is even this simple design can start to have issues as the data size increases and you blow through your available memory.  Is there a better architecture?  Is there a way to get more scalability out of your database?  Is the database even the best place for this data?

I decided to walk through setting up a very simple application that does what I have seen many clients do.  Using this application I can then compare using MySQL to using MySQL + Memcached, and then to other solutions like Tokyo Tyrant or Cassandra.   My Application does the following:

A.)  read a row from a database based on an integer based primary key
B.)  Update data from that row and replace the stored contents on disk
C.)  Use the data from that row to lookup up a row in another table based on a text field ( called email address ).

Seems simple enough right?  My two tables each contain 5M rows of data.  let’s see what happens:

DB Fits into Memory

Chart of TPS for benchmark application

You can see a dramatic drop off in performance as my data falls out of memory, that’s not cool is it?  After all database sizes tend to always grow and very rarely shrink.  Which leads to a challenge faced by almost everyone how do you maintain your performance in the face of increasing data size?

Here is where people start to scratch their heads.  They naturally assume they need to scale more, we need more memory!   If performance sucks, we must need more.  So here comes the bigger boxes, the read-only slaves,  the complex sharding systems, the discussions on cluster, more memcached.  We need to cover up the databases inefficiencies to ensure that our application scales.

The problem is for some applications, we are fixing symptoms, not the problem itself.  No matter how much you want it to fit,  some things may not work (like the Godfather 3).    The issue is people assume that data storage has to be in the database.  “It’s data, it needs to go into the database.” is often the battle cry.   But hold on to your hat,  I  am going to shock you.  For some applications, putting your data in the database is silly.  Yes the guy who blogs on bigdbahead.com and is writing this on the mysqlperformanceblog is saying you may not want to use a database.  Heresy I know!  But for many of us we already accept storing data ( at least temporarily ) outside the DB.  Think memcached.

Almost everyone loves memcached, it’s simple, fast, and just works.  When your dataset exceeds your memory limitations or the database can simply not keep up any more this solution can really boost performance.  I know you’re thinking my simple key lookup should really benefit from memcached. So let’s try it!  I took the simple app I created that reads two rows, and update one of them to read from memcached if available, remove on update, and read from the db only when required.  I tested with a memcached size of 1GB, 2GB, and 4GB.  For these tests I left Innodb with a 256M buffer pool, or roughly with 9% of the total data in memory.

let’s look at the 1GB Setting:

Ensure you have enough memory for memcached

What, a performance regression?  But we threw more memory at it!!   How can that be!

Memcached is not a cure all.  I have talked to many client’s who say “we will just throw memcached at it”.   Sometimes an app will scream other times it won’t… and yet others require lots and lots of memory allocated to memcached to be successful.    This application selects a random # between 1 and 2 Million and looks up the result via that key.  It then uses data from that random row to look up a second piece of information via email address.  Because the entire dataset  is about 4GB and only 1G is in memcached, I keep pushing data out of memcached to make room for new records I am reading from the database. Remember memcached needs repeatability to be helpful.   I am still getting a really solid # of hits in memcached, but the # of writes in MySQL coupled with the still large # of reads takes its toll.  Another place where I have seen this kill clients is in apps that do some sequential scanning and do not have enough memory for memcached.  For instance, if you have 1,000,000 rows of data, but enough memory to only store 500,000 rows… sequentially accessing this data will destroy the use of cache:

get record 1, miss, read from disk, cache record 1
….
get record 500,001, miss, read from disk, expunge record 1, cache record 500,001
….
get record 1, miss, read from disk, expunge record 500,001, cache record 1

you keep overwriting the cache before you can use it.  So here the complexity of adding memcached hurts us, because the cache is not actually buying us anything.

Now bumping this up to 2GB actually makes the TPS jump around a lot, sometimes hitting 400 or 500 TPS and other times hitting as high as 1800 TPS.  My guess is the movement is caused by the random #’s being generated and simply the luck of the draw.

Finally let’s look when we have 4GB of memory allocated to memcached (full dataset fits ):

Transactions with and without Memcached

Here you can see that our “transactions”  per second for this app increased almost 10Xby using memcached.  The TPS I get here vary from 1100 TPS to just under 2000TPS with the average around 1400TPS.   I think we would all be very happy if we could get a 10X performance boost from your application.

But wouldn’t it be great if we could get more?  I mean our reads are going pretty fast, but our writes leave a lot to be desired:

Read -vs- write times with memcached + mysql mixed

Over 17 MS to do an update.  Wouldn’t be great to just eliminate all the updates as well?  What sort of throughput would we get?   I will show you in part 2.  Part 2 of this post will talk about performance in a 100% pure memcached environment. Part 3 will focus on these same benchmarks in Tokyo tyrant.

All to often people force themselves into using a database like MySQL with no thought into whether if its the best solution to there problem. Why? Because their other applications use it, so why not the new application? Over the past couple of months I have been doing a ton of work for clients who use their database like most people use memcached . Lookup a row based on a key, update the data in the row, stuff the row back in the database. Rinse and repeat. Sure these setups vary sometimes, throwing in a “lookup” via username, or even the rare count. But for the most part they are designed to be simple.

A classic example is a simple online game. An online game may only require that an application retrieve a single record from the database. The record may contain all the vital stats for the game, be updated and stuffed back into the database. You would be surprised how many people use this type of system as I run into this type of application frequently. Keeping it simple, ensures that application is generally mean and lean and performs well. The issue is even this simple design can start to have issues as the data size increases and you blow through your available memory. Is there a better architecture? Is there a way to get more scalability out of your database? Is the database even the best place for this data?

I decided to walk through setting up a very simple application that does what I have seen many clients do. Using this application I can then compare using MySQL to using MySQL + Memcached, and then to other solutions like Tokyo Tyrant or Cassandra. My Application does the following:

A.) read a row from a database based on an integer based primary key

B.) Update data from that row and replace the stored contents on disk

C.) Use the data from that row to lookup up a row in another table based on a text field ( called email address ).

Seems simple enough right? My two tables each contain 5M rows of data. let’s see what happens:

You can see a dramatic drop off in performance as my data falls out of memory, that’s not cool is it? After all database sizes tend to always grow and very rarely shrink. Which leads to a challenge faced by almost everyone how do you maintain your performance in the face of increasing data size?

Here is where people start to scratch their heads. They naturally assume they need to scale more, we need more memory! If performance sucks, we must need more. So here comes the bigger boxes, the read-only slaves, the complex sharding systems, the discussions on cluster, more memcached. We need to cover up the databases inefficiencies to ensure that our application scales.

The problem is for some applications, we are fixing symptoms, not the problem itself. No matter how much you want it to fit, somethings may not work (like the Godfather 3). The issue is people assume that data storage has to be in the database. “It’s data, it needs to go into the database.” is often the battle cry. But hold on to your hat, I am going to shock you. For some applications, putting your data in the database is silly. Yes the guy who blogs on bigdbahead.com and is writing this on the mysqlperformanceblog is saying you may not want to use a database. Heresy I know! But for many of us we already accept storing data ( at least temporarily ) outside the DB. Think memcached.

Almost everyone loves memcached, it’s simple, fast, and just works. When your dataset exceeds your memory limitations or the database can simply not keep up any more this solution can really boost performance. I know you’re thinking my simple key lookup should really benefit from memcached. So let’s try it! I took the simple app I created that reads two rows, and update one of them to read from memcached if available, remove on update, and read from the db only when required. I tested with a memcached size of 1GB, 2GB, and 4GB. For these tests I left Innodb with a 256M buffer pool, or roughly with 9% of the total data in memory.

let’s look at the 1GB Setting:

What, a performance regression? But we threw more memory at it!! How can that be!

Memcached is not a cure all. I have talked to many client’s who say “we will just throw memcached as it”. Sometimes an app will scream other times it won’t… and yet others require lots and lots of memory allocated to memcached to be successful. This application selects a random # between 1 and 2 Million and looks up the result via that key. It then uses data from that random row to look up a second piece of information via email address. Because the entire dataset is about 4GB and only 1G is in memcached, I keep pushing data out of memcached to make room for new records I am reading from the database. Remember memcached needs repeatability to be helpful. I am still getting a really solid # of hits in memcached, but the # of writes in MySQL coupled with the still large # of reads takes its toll. Another place where I have seen this kill clients is in apps that do some sequential scanning and do not have enough memory for memcached. For instance, if you have 1,000,000 rows of data, but enough memory to only store 500,000 rows… sequentially accessing this data will destroy the use of cache:

  • get record 1, miss, read from disk, cache record 1

  • ….

  • get record 500,001, miss, read from disk, expunge record 1, cache record 500,001

  • ….

  • get record 1, miss, read from disk, expunge record 500,001, cache record 1

    you keep overwriting the cache before you can use it. So here the complexity of adding memcached hurts us, because the cache is not actually buying us anything.

Now bumping this up to 2GB actually makes the TPS jump around a lot, sometimes hitting 400 or 500 TPS and other times hitting as high as 1800 TPS. My guess is the movement is caused by the random #’s being generated and simply the luck of the draw.

Finally let’s look when we have 4GB of memory allocated to memcached (full dataset fits ):

Here you can see that our “transactions” per second for this app increased almost 10Xby using memcached. The TPS I get here vary from 1100 TPS to just under 2000TPS with the average around 1400TPS. I think we would all be very happy if we could get a 10X performance boost from your application.

But wouldn’t it be great if we could get more? I mean our reads are going pretty fast, but our writes leave a lot to be desired:

Over 17 MS to do an update. Wouldn’t be great to just eliminate all the updates as well? What sort of throughput would we get?

13 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
Darius Jahandarie

Great post. It really highlights how memcached isn’t always the solution, because it ain’t.

Though, I find that memcached can definitely shine when you have hot spots in your data. Those spots get cached more frequently, so having enough memory for the whole dataset isn’t required.

I look forward to the rest of the post.

Tom Hanlon

This is going to be a good series of blog posts.

The dominance of SQL and the ease of use of particularly MySQL has made a Relational Database the standard solution. In addition the speed and simplicity of memcached has made it a standard solution to deal with real Database limitations or to cover up bad design decisions.

They are not always the best choice. I look forward to more on this topic.

Trouble is.. the array of alternatives is daunting, and leaving a standards based solution and going into other territories, that may include trade-offs that take understanding and agreement of all stakeholders is a challenge.

We need more discussions along these lines, Relational Database alternatives and assisting tools, memcached being an assisting tool, and alternatives.. well there are so many.


Tom Hanlon

tobi

this is a very interesting post. if i had a social media presence i would repost it immediately.

Andy

Are you using the new memcached binary protocol?

I wondered what kind of performance improvement would binary protocol have vs, text protocol.

Dormando

What sort of application has 2 million rows and selects from them *completely randomly* ?

I can imagine a few, but it’s probably not a website or an online game?

The move to unscalable key/value stores is likely to cause people extreme headaches in the future. I hope you’ll cover what to do if you ever need to grow off of a replicated pair of tokyo tyrant servers? Or does it have a data repartitioning feature I don’t know about? 🙂

Shlomi Noach

@Dormando,

I have seen a few such cases. One, for example, is a p2p client, which, every time someone goes online/offline, needs to read/write to the DB. There’s also a list of buddies to be managed. There are users in many continents, so different timezones; a lot more users are connected than can be cached; the result is an almost random behavior.
Another one is a plugin for websites, which calls upon the DB each time it’s displayed. It presented the same characteristics.

,
Great post!

JD

Great article Matt,

I’d like to know more about the testing environment.
We use memcached on webheads with our database on separate machines connected over ethernet. Our web servers run nginx and have a tiny memory footprint, we dedicate the rest of the webhead’s RAM to a local memcached instance so all queries are done locally; we don’t take advantage of the distributed features memcached is known for.

Only if an object isn’t in the cache or has expired do we query the database (and populate the cache again).

From this article it appears that you are looking into using memcached as an alternative to a database server where the clients are connected remotely, not locally, and that you are writing to memcached.

Most people write to their database and read from memcached, I assume quite similarly to what we do, your ideas appear to be to use memcached for both reads and writes, at what point is the data stored persistently in the database if at all?

I look forward to your next posts Matt, keep up the good work.

Peter Zaitsev

Matt,

Indeed uniform distribution barely exists in the practice but I think for demonstration purposes it is a good choice.

I do not quite understand why MySQL + Memcache has so large regression in the first case. Considering memcache can do 17K gets per second (even if these are misses) the overhead for constant misses should be small. Was MySQL somethat affected on having memcache on the same box (which I assume was the case)

It is also very interesting to see if you fit everything in memory memcache is about 2x-2.5x better than MySQL for primary key lookups. Am I reading it correctly?

Also are you planning to make a sources available I think it would be really helpful right now I’m not sure I understand completely what is involved in every graph you’re showing.

Finally for update…. why would you see 17ms for update ? Is this because you have to go to the database which does not have it in the cache so read has to happen from the disk ? Modification itself is done in Buffer pool and has very low added latency.

At the tool you’re looking it is interesting if you can add Redis to the list. It is not as powerful as some other technologies in terms of automatic clustering etc but it is very simple which is cool.

Finally… I should note the direct implementation in MySQL is not only choice. The approach similar to SS tables can be used in MySQL as well. Replace your data to the “daily” small table which fits in ram while select from both small and large table. Do merge in the background.

And there is yet another one. How this all relates to SSDs ? Can it be simple solution instead of using “optimized” systems ?

Santiago

I think it’s true. We not always need databases. And i’d like to say something more, when we need Databases, we do not always need Relational DBMS (RDBMS), not every solution is addressed with relational paradigm, and i’m not talking about ORDMBS.