All to often people force themselves into using a database like MySQL with no thought into whether if its the best solution to there problem. Why? Because their other applications use it, so why not the new application? Over the past couple of months I have been doing a ton of work for clients who use their database like most people use memcached . Lookup a row based on a key, update the data in the row, stuff the row back in the database. Rinse and repeat. Sure these setups vary sometimes, throwing in a “lookup†via username, or even the rare count. But for the most part they are designed to be simple.
A classic example is a simple online game. An online game may only require that an application retrieve a single record from the database. The record may contain all the vital stats for the game, be updated and stuffed back into the database. You would be surprised how many people use this type of system as I run into this type of application frequently. Keeping it simple, ensures that application is generally mean and lean and performs well. The issue is even this simple design can start to have issues as the data size increases and you blow through your available memory. Is there a better architecture? Is there a way to get more scalability out of your database? Is the database even the best place for this data?
I decided to walk through setting up a very simple application that does what I have seen many clients do. Using this application I can then compare using MySQL to using MySQL + Memcached, and then to other solutions like Tokyo Tyrant or Cassandra.  My Application does the following:
A.)Â read a row from a database based on an integer based primary key
B.)Â Update data from that row and replace the stored contents on disk
C.)Â Use the data from that row to lookup up a row in another table based on a text field ( called email address ).
Seems simple enough right? My two tables each contain 5M rows of data. let’s see what happens:
You can see a dramatic drop off in performance as my data falls out of memory, that’s not cool is it? After all database sizes tend to always grow and very rarely shrink. Which leads to a challenge faced by almost everyone how do you maintain your performance in the face of increasing data size?
Here is where people start to scratch their heads. They naturally assume they need to scale more, we need more memory!  If performance sucks, we must need more. So here comes the bigger boxes, the read-only slaves, the complex sharding systems, the discussions on cluster, more memcached. We need to cover up the databases inefficiencies to ensure that our application scales.
The problem is for some applications, we are fixing symptoms, not the problem itself. No matter how much you want it to fit, some things may not work (like the Godfather 3).   The issue is people assume that data storage has to be in the database. “It’s data, it needs to go into the database.†is often the battle cry.  But hold on to your hat, I am going to shock you. For some applications, putting your data in the database is silly. Yes the guy who blogs on bigdbahead.com and is writing this on the mysqlperformanceblog is saying you may not want to use a database. Heresy I know! But for many of us we already accept storing data ( at least temporarily ) outside the DB. Think memcached.
Almost everyone loves memcached, it’s simple, fast, and just works. When your dataset exceeds your memory limitations or the database can simply not keep up any more this solution can really boost performance. I know you’re thinking my simple key lookup should really benefit from memcached. So let’s try it! I took the simple app I created that reads two rows, and update one of them to read from memcached if available, remove on update, and read from the db only when required. I tested with a memcached size of 1GB, 2GB, and 4GB. For these tests I left Innodb with a 256M buffer pool, or roughly with 9% of the total data in memory.
let’s look at the 1GB Setting:
What, a performance regression? But we threw more memory at it!!  How can that be!
Memcached is not a cure all. I have talked to many client’s who say “we will just throw memcached at itâ€.  Sometimes an app will scream other times it won’t… and yet others require lots and lots of memory allocated to memcached to be successful.   This application selects a random # between 1 and 2 Million and looks up the result via that key. It then uses data from that random row to look up a second piece of information via email address. Because the entire dataset is about 4GB and only 1G is in memcached, I keep pushing data out of memcached to make room for new records I am reading from the database. Remember memcached needs repeatability to be helpful.  I am still getting a really solid # of hits in memcached, but the # of writes in MySQL coupled with the still large # of reads takes its toll. Another place where I have seen this kill clients is in apps that do some sequential scanning and do not have enough memory for memcached. For instance, if you have 1,000,000 rows of data, but enough memory to only store 500,000 rows… sequentially accessing this data will destroy the use of cache:
get record 1, miss, read from disk, cache record 1
….
get record 500,001, miss, read from disk, expunge record 1, cache record 500,001
….
get record 1, miss, read from disk, expunge record 500,001, cache record 1
you keep overwriting the cache before you can use it. So here the complexity of adding memcached hurts us, because the cache is not actually buying us anything.
Now bumping this up to 2GB actually makes the TPS jump around a lot, sometimes hitting 400 or 500 TPS and other times hitting as high as 1800 TPS. My guess is the movement is caused by the random #’s being generated and simply the luck of the draw.
Finally let’s look when we have 4GB of memory allocated to memcached (full dataset fits ):
Here you can see that our “transactionsâ€Â per second for this app increased almost 10Xby using memcached. The TPS I get here vary from 1100 TPS to just under 2000TPS with the average around 1400TPS.  I think we would all be very happy if we could get a 10X performance boost from your application.
But wouldn’t it be great if we could get more? I mean our reads are going pretty fast, but our writes leave a lot to be desired:
Over 17 MS to do an update. Wouldn’t be great to just eliminate all the updates as well? What sort of throughput would we get?  I will show you in part 2. Part 2 of this post will talk about performance in a 100% pure memcached environment. Part 3 will focus on these same benchmarks in Tokyo tyrant.
All to often people force themselves into using a database like MySQL with no thought into whether if its the best solution to there problem. Why? Because their other applications use it, so why not the new application? Over the past couple of months I have been doing a ton of work for clients who use their database like most people use memcached . Lookup a row based on a key, update the data in the row, stuff the row back in the database. Rinse and repeat. Sure these setups vary sometimes, throwing in a “lookup†via username, or even the rare count. But for the most part they are designed to be simple.
A classic example is a simple online game. An online game may only require that an application retrieve a single record from the database. The record may contain all the vital stats for the game, be updated and stuffed back into the database. You would be surprised how many people use this type of system as I run into this type of application frequently. Keeping it simple, ensures that application is generally mean and lean and performs well. The issue is even this simple design can start to have issues as the data size increases and you blow through your available memory. Is there a better architecture? Is there a way to get more scalability out of your database? Is the database even the best place for this data?
I decided to walk through setting up a very simple application that does what I have seen many clients do. Using this application I can then compare using MySQL to using MySQL + Memcached, and then to other solutions like Tokyo Tyrant or Cassandra. My Application does the following:
A.) read a row from a database based on an integer based primary key
B.) Update data from that row and replace the stored contents on disk
C.) Use the data from that row to lookup up a row in another table based on a text field ( called email address ).
Seems simple enough right? My two tables each contain 5M rows of data. let’s see what happens:
You can see a dramatic drop off in performance as my data falls out of memory, that’s not cool is it? After all database sizes tend to always grow and very rarely shrink. Which leads to a challenge faced by almost everyone how do you maintain your performance in the face of increasing data size?
Here is where people start to scratch their heads. They naturally assume they need to scale more, we need more memory! If performance sucks, we must need more. So here comes the bigger boxes, the read-only slaves, the complex sharding systems, the discussions on cluster, more memcached. We need to cover up the databases inefficiencies to ensure that our application scales.
The problem is for some applications, we are fixing symptoms, not the problem itself. No matter how much you want it to fit, somethings may not work (like the Godfather 3). The issue is people assume that data storage has to be in the database. “It’s data, it needs to go into the database.†is often the battle cry. But hold on to your hat, I am going to shock you. For some applications, putting your data in the database is silly. Yes the guy who blogs on bigdbahead.com and is writing this on the mysqlperformanceblog is saying you may not want to use a database. Heresy I know! But for many of us we already accept storing data ( at least temporarily ) outside the DB. Think memcached.
Almost everyone loves memcached, it’s simple, fast, and just works. When your dataset exceeds your memory limitations or the database can simply not keep up any more this solution can really boost performance. I know you’re thinking my simple key lookup should really benefit from memcached. So let’s try it! I took the simple app I created that reads two rows, and update one of them to read from memcached if available, remove on update, and read from the db only when required. I tested with a memcached size of 1GB, 2GB, and 4GB. For these tests I left Innodb with a 256M buffer pool, or roughly with 9% of the total data in memory.
let’s look at the 1GB Setting:
What, a performance regression? But we threw more memory at it!! How can that be!
Memcached is not a cure all. I have talked to many client’s who say “we will just throw memcached as itâ€. Sometimes an app will scream other times it won’t… and yet others require lots and lots of memory allocated to memcached to be successful. This application selects a random # between 1 and 2 Million and looks up the result via that key. It then uses data from that random row to look up a second piece of information via email address. Because the entire dataset is about 4GB and only 1G is in memcached, I keep pushing data out of memcached to make room for new records I am reading from the database. Remember memcached needs repeatability to be helpful. I am still getting a really solid # of hits in memcached, but the # of writes in MySQL coupled with the still large # of reads takes its toll. Another place where I have seen this kill clients is in apps that do some sequential scanning and do not have enough memory for memcached. For instance, if you have 1,000,000 rows of data, but enough memory to only store 500,000 rows… sequentially accessing this data will destroy the use of cache:
get record 1, miss, read from disk, cache record 1
….
get record 500,001, miss, read from disk, expunge record 1, cache record 500,001
….
get record 1, miss, read from disk, expunge record 500,001, cache record 1
you keep overwriting the cache before you can use it. So here the complexity of adding memcached hurts us, because the cache is not actually buying us anything.
Now bumping this up to 2GB actually makes the TPS jump around a lot, sometimes hitting 400 or 500 TPS and other times hitting as high as 1800 TPS. My guess is the movement is caused by the random #’s being generated and simply the luck of the draw.
Finally let’s look when we have 4GB of memory allocated to memcached (full dataset fits ):
Here you can see that our “transactions†per second for this app increased almost 10Xby using memcached. The TPS I get here vary from 1100 TPS to just under 2000TPS with the average around 1400TPS. I think we would all be very happy if we could get a 10X performance boost from your application.
But wouldn’t it be great if we could get more? I mean our reads are going pretty fast, but our writes leave a lot to be desired:
Over 17 MS to do an update. Wouldn’t be great to just eliminate all the updates as well? What sort of throughput would we get?
Great post. It really highlights how memcached isn’t always the solution, because it ain’t.
Though, I find that memcached can definitely shine when you have hot spots in your data. Those spots get cached more frequently, so having enough memory for the whole dataset isn’t required.
I look forward to the rest of the post.
This is going to be a good series of blog posts.
The dominance of SQL and the ease of use of particularly MySQL has made a Relational Database the standard solution. In addition the speed and simplicity of memcached has made it a standard solution to deal with real Database limitations or to cover up bad design decisions.
They are not always the best choice. I look forward to more on this topic.
Trouble is.. the array of alternatives is daunting, and leaving a standards based solution and going into other territories, that may include trade-offs that take understanding and agreement of all stakeholders is a challenge.
We need more discussions along these lines, Relational Database alternatives and assisting tools, memcached being an assisting tool, and alternatives.. well there are so many.
—
Tom Hanlon
this is a very interesting post. if i had a social media presence i would repost it immediately.
Are you using the new memcached binary protocol?
I wondered what kind of performance improvement would binary protocol have vs, text protocol.
What sort of application has 2 million rows and selects from them *completely randomly* ?
I can imagine a few, but it’s probably not a website or an online game?
The move to unscalable key/value stores is likely to cause people extreme headaches in the future. I hope you’ll cover what to do if you ever need to grow off of a replicated pair of tokyo tyrant servers? Or does it have a data repartitioning feature I don’t know about? 🙂
@Dormando,
I have seen a few such cases. One, for example, is a p2p client, which, every time someone goes online/offline, needs to read/write to the DB. There’s also a list of buddies to be managed. There are users in many continents, so different timezones; a lot more users are connected than can be cached; the result is an almost random behavior.
Another one is a plugin for websites, which calls upon the DB each time it’s displayed. It presented the same characteristics.
@Matt Yonkovit,
Great post!
Andy,
No these tests were with text and not with the binary protocol. Although that would be an interesting separate set of benchmarks.
Dormando,
from the memcached perspective would it matter if there where 4 or 8 million rows and I selected from and tried to cache only 2 million of them? The result would be the same in this case, too little memory too much data. Now from the Tyrant perspective since its disk based increasing the dataset could/should decrease performance, however in my tests I went up to 6 million rows and saw almost identical #’s to the 2 million set. The point about memcached not always fixing issues, is you need to understand your application and your access patterns. I run into many applications which 100% of its stored MySQL data is hot, and many times adding a caching layer does not help nearly as much as other environments. Now from a gaming standpoint, the specific example I was thinking of was finding an opponent from a list at random, then doing something with them before stuffing back into a list. But I have worked on others a game matching-making service, game lobby, etc… that all do very similar PK lookups.
In terms of scalability… There are many options here ( this article was not intended to be a primer on tyrant, rather to get people to think outside the database ), and for the most part its as scalable or not as MySQL. Replication, sure… then? Sharding… if you plan for this, or with some tweaks in the API of tyrant you could build sharing into the system. Also you can use the memcached API to access and write to tyrant which allows you to distribute your data across multiple locations. Tyrant however may not be the best solution for every setup, maybe its redis, or volemort, or cassandra, or something else… I think these all have features and specifics which make them compelling for certain applications… and hopefully I will find the time to write about these as well.
Matt
Great article Matt,
I’d like to know more about the testing environment.
We use memcached on webheads with our database on separate machines connected over ethernet. Our web servers run nginx and have a tiny memory footprint, we dedicate the rest of the webhead’s RAM to a local memcached instance so all queries are done locally; we don’t take advantage of the distributed features memcached is known for.
Only if an object isn’t in the cache or has expired do we query the database (and populate the cache again).
From this article it appears that you are looking into using memcached as an alternative to a database server where the clients are connected remotely, not locally, and that you are writing to memcached.
Most people write to their database and read from memcached, I assume quite similarly to what we do, your ideas appear to be to use memcached for both reads and writes, at what point is the data stored persistently in the database if at all?
I look forward to your next posts Matt, keep up the good work.
Matt,
Indeed uniform distribution barely exists in the practice but I think for demonstration purposes it is a good choice.
I do not quite understand why MySQL + Memcache has so large regression in the first case. Considering memcache can do 17K gets per second (even if these are misses) the overhead for constant misses should be small. Was MySQL somethat affected on having memcache on the same box (which I assume was the case)
It is also very interesting to see if you fit everything in memory memcache is about 2x-2.5x better than MySQL for primary key lookups. Am I reading it correctly?
Also are you planning to make a sources available I think it would be really helpful right now I’m not sure I understand completely what is involved in every graph you’re showing.
Finally for update…. why would you see 17ms for update ? Is this because you have to go to the database which does not have it in the cache so read has to happen from the disk ? Modification itself is done in Buffer pool and has very low added latency.
At the tool you’re looking it is interesting if you can add Redis to the list. It is not as powerful as some other technologies in terms of automatic clustering etc but it is very simple which is cool.
Finally… I should note the direct implementation in MySQL is not only choice. The approach similar to SS tables can be used in MySQL as well. Replace your data to the “daily” small table which fits in ram while select from both small and large table. Do merge in the background.
And there is yet another one. How this all relates to SSDs ? Can it be simple solution instead of using “optimized” systems ?
@JD,
For these tests I was on the same local machine as the DB running a perl client. The test in part 1 read from Memcached when the row it was looking for was there, otherwise would go to disk. In the event of a write i would write the row to the database and replace the row in memcached with the new value.
@Peter,
I have done two or three gigs lately with a very random distribution of data, so they do exist. But realistically there is going to be some repeatability. I think the idea is to show the extreme.
There is limited overhead for Memcahed on the same server. The drop was not that significant in the grand scheme of things what about 13%? Plus there is always the chance that over the tests things slow down for one test or another ( even with an average of 3 test runs, the sample set is small ).
Your saying with a 4GB buffer pool, the Memcached PK + Mysql is 2x ish faster. The 1385 # is for 2 memcached lookups and 1 Mysql write + 1 Memcached set. I have pure memcached lookup #’s -vs- pk lookups without the 3 combined and could make another post just on that if there is interest.
Yes I can make the source available.
Database is setup to have only 256M Buffer pool, so for many of the pages it needs to find – i.e. read from disk and then replace.
Redis is on a very long list of tools I want to try out.
There are always choices and other options. Here I am trying to get people to think outside the box and outside the database. If you don’t need sql, why use it?
That’s an interesting question, Testing with ssd and seeing the performance impact for both NOSQL -vs- MySQL would be very interesting.
I think it’s true. We not always need databases. And i’d like to say something more, when we need Databases, we do not always need Relational DBMS (RDBMS), not every solution is addressed with relational paradigm, and i’m not talking about ORDMBS.