We’ve published our first case study. The customer, ideeli, had a database that was struggling on standard MySQL and InnoDB. The big win was the upgrade to XtraDB. The business continued to grow quickly, and months later under much more traffic, the database is still outperforming their previous version.
I thought I’d write a few notes that didn’t seem appropriate to include in the case study, because this was a fun project that might be interesting to readers.
As usual, it was all about diagnosing the problem correctly. I used a variety of tools to help with this, foremost among them “stalk” and “collect” from Aspersa. There were several problems, not just one, and they required different techniques to diagnose. This can be hard when the problems are happening sporadically and/or mixed together. You really need to be disciplined and collect data, data, data. If you are not sure about the cause of something, you don’t have the right data. Maybe you have too little, or too much, or you have the signal mixed in with the noise. Knowing when and how to get and interpret good diagnostic data is easily 95% or 98% of the work in a case like this. All I had to do was wait until the problem happened, look at the diagnostics, and a couple minutes later I had my answer.
What were the problems? The query cache was causing both mutex contention and excessive CPU usage, for different reasons, and I found different problems in different samples. InnoDB was also dying under mutex contention. Each spike of slow queries I found was caused by different things. Sometimes GDB stack traces showed InnoDB mutex contention, sometimes oprofile showed the query cache hogging the CPU, and so on. So we had to solve all the problems, not just some of them.
The graphs of query traffic and response times were from data I gathered with tcprstat. I also used the data from tcprstat to analyze the variation in query response time. One-second intervals is a relatively fine granularity, but at that level you can see better when micro-freezes are occurring. I used ad-hoc slow-query-log analysis with awk and other tools to discover and investigate unusual patterns, and figure out whether queries were causes or victims of performance problems. The problems here were not caused by queries, but query behavior was the symptom that we could observe, so all of the above analysis was useful for detecting the problem as it happened, and verifying that it was not still happening after we implemented fixes.
New Relic was a very helpful tool in this case, too. If you don’t use New Relic, you might try it. (We don’t get paid to say that.) Their tools are really nice.
I also want to mention that this database’s problems were entirely inside the database software itself. The ideeli team had already done a great job with indexing, query optimization, and so forth. Nothing more could be done without fixing these problems inside MySQL and InnoDB at the source code level, or changing the application architecture.
All things considered, the database server’s performance is not as high as many I’ve worked on, so the absolute numbers of queries per second may not look impressive. However, remember that this database is running on an EC2 server. EC2 has relatively slow virtual CPUs, and given that and the workload this server is under, it does very well. Of course you could far exceed that performance on a “real” server.
This case illustrates why average-case performance isn’t a good metric. As Peter says, “the average temperature of patients in the hospital isn’t important.” Good performance means avoiding outliers and variations. The query performance needs to be a) fast, and b) fast all the time, and c) the same kind of fast all the time. Variations in performance at one layer introduce cascading variations at each higher layer. Rarely is a stuttering component’s misbehavior absorbed by other layers. Instead, every layer above it gets into trouble.
That, among other things, is why the database has a much harder job than people sometimes realize, and why it’s so hard to write a good database server.
The technical content of the white paper is excellent. Thanks.
The white paper mentions that thundering herds are a serious problem for their memcache deployment and that they use something like multi-version keys to reduce the problem. Can you elaborate on that? Are any of the key-value in-memory caches working on something better? This is a common problem for memcache deployments.
I’m surprised they run their DB servers on EC2. EC2 is very expensive yet the hardware you get for the price they charge is truly anemic.
What are the reasons that compel people to run DB servers on the cloud? I’m interested to know.
A comment with confidential information about ideeli was deleted from this thread.
Mark, I will ask them if they would like to write something about their memcached techniques. I thought that they had blogged about it but I can’t find it now.
Andy, one of the factors in this case is 40x swings between peak and off-peak, and the peaks are really short.
Baron – I really like the tools and processes that Percona has implemented and shared for debugging intermittent performance problems. They make it possible to run even more critical workloads on InnoDB and XtraDB. Do you cover usage of these tools in Percona training classes?
Hi Baron,
the article sounds great, but it’s missing a key reason why the move to XtraDB was justified (and I mean here the technical reason, because for the rest – if they involved Percona for consulting it’s logical you move them to XtraDB ;-)) – well, 5.1.37 is quite old, but I’m looking for the main reason to move to XtraDB and not to use the latest InnoDB plugin in 5.1 (BTW, from the white paper it’s not clear also if they used InnoDB plugin before, and if yes – which version?)..
While query cache performance is a common issue, I did not find any details about “Internal contention inside InnoDB” you’ve observed.. – may you, please, go more in depth here?.. – as it’s pure InnoDB internals, I don’t think it’ll touch any confidential information regarding the customer 😉
Also, was glad to know Peter is using the same example as me about an “average temperature in the hospital” 🙂 – I’m usually adding “if you ignore individual spikes – you have several dead patient every morning..” 🙂
Rgds,
-Dimitri
Mark,
Thanks. You know that writing software is a small fraction of the effort of testing, releasing, and documenting it. Often I hastily write a small Awk or Perl script, or even just some grep|sort|uniq|sort, figure out what I need, and save the snippet to a text file for “someday,” with reference to the internal case number so I can figure out what I was using it for. That is a big file now. But some of the things eventually get implemented, like ioprofile.
It’s similar with our training classes, and even with the major open-source projects we do. I just spent yesterday working on xtrabackup documentation. Alas, there is so much richness there that isn’t communicated to the world, just because that takes so much work and time.
I have submitted some session proposals to cover the most common tools and techniques I use at the MySQL conference in April.
Dimitri,
Actually the case study is more promotional of XtraDB than I am in consulting. I forget now exactly which mutexes were the problem, but I knew that at least some of them were solved in the InnoDB plugin. (They were not using the plugin.) I recommended *either* the plugin *or* XtraDB, and they independently decided to use Percona Server with XtraDB, without asking my opinion further. I do not recommend Percona’s products just because I work with Percona; I only recommend what will actually solve the problem. Beyond diagnosing the problem and presenting some solutions, I didn’t try to influence them.
I am certain that MySQL 5.5 would also have fixed the problem, but ideeli is very disciplined and prudent. I don’t think they would have been comfortable with so many changes at once.
Baron,
thanks for clarification – and I’d say your honest answer is making a honor for Percona!
May only wish you to keep your great work!
Rgds,
-Dimitri
Hi Baron, others. First of all, congratulations for publishing your first case study. You will find that reference customers and case studies are one of the best marketing and sales tools. I hope with the new approaches you’ve adopted this Spring, Percona will grow even stronger and more prosperous.
I also liked the distinct “Percona style” preserved in the paper. Yes, it’s a case study, pdf, even has a picture, but still very technical, I’m almost missing the fluff 🙂 It fits very well with the Percona brand.
Look forward to reading more stories, and a background blog is always a nice complement.
Thanks Henrik, that’s a very nice compliment!