We’ve published our first case study. The customer, ideeli, had a database that was struggling on standard MySQL and InnoDB. The big win was the upgrade to XtraDB. The business continued to grow quickly, and months later under much more traffic, the database is still outperforming their previous version.

I thought I’d write a few notes that didn’t seem appropriate to include in the case study, because this was a fun project that might be interesting to readers.

As usual, it was all about diagnosing the problem correctly. I used a variety of tools to help with this, foremost among them “stalk” and “collect” from Aspersa. There were several problems, not just one, and they required different techniques to diagnose. This can be hard when the problems are happening sporadically and/or mixed together. You really need to be disciplined and collect data, data, data. If you are not sure about the cause of something, you don’t have the right data. Maybe you have too little, or too much, or you have the signal mixed in with the noise. Knowing when and how to get and interpret good diagnostic data is easily 95% or 98% of the work in a case like this. All I had to do was wait until the problem happened, look at the diagnostics, and a couple minutes later I had my answer.

What were the problems? The query cache was causing both mutex contention and excessive CPU usage, for different reasons, and I found different problems in different samples. InnoDB was also dying under mutex contention. Each spike of slow queries I found was caused by different things. Sometimes GDB stack traces showed InnoDB mutex contention, sometimes oprofile showed the query cache hogging the CPU, and so on. So we had to solve all the problems, not just some of them.

The graphs of query traffic and response times were from data I gathered with tcprstat. I also used the data from tcprstat to analyze the variation in query response time. One-second intervals is a relatively fine granularity, but at that level you can see better when micro-freezes are occurring. I used ad-hoc slow-query-log analysis with awk and other tools to discover and investigate unusual patterns, and figure out whether queries were causes or victims of performance problems. The problems here were not caused by queries, but query behavior was the symptom that we could observe, so all of the above analysis was useful for detecting the problem as it happened, and verifying that it was not still happening after we implemented fixes.

New Relic was a very helpful tool in this case, too. If you don’t use New Relic, you might try it. (We don’t get paid to say that.) Their tools are really nice.

I also want to mention that this database’s problems were entirely inside the database software itself. The ideeli team had already done a great job with indexing, query optimization, and so forth. Nothing more could be done without fixing these problems inside MySQL and InnoDB at the source code level, or changing the application architecture.

All things considered, the database server’s performance is not as high as many I’ve worked on, so the absolute numbers of queries per second may not look impressive. However, remember that this database is running on an EC2 server. EC2 has relatively slow virtual CPUs, and given that and the workload this server is under, it does very well. Of course you could far exceed that performance on a “real” server.

This case illustrates why average-case performance isn’t a good metric. As Peter says, “the average temperature of patients in the hospital isn’t important.” Good performance means avoiding outliers and variations. The query performance needs to be a) fast, and b) fast all the time, and c) the same kind of fast all the time. Variations in performance at one layer introduce cascading variations at each higher layer. Rarely is a stuttering component’s misbehavior absorbed by other layers. Instead, every layer above it gets into trouble.

That, among other things, is why the database has a much harder job than people sometimes realize, and why it’s so hard to write a good database server.

13 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
Mark Callaghan

The technical content of the white paper is excellent. Thanks.

Mark Callaghan

The white paper mentions that thundering herds are a serious problem for their memcache deployment and that they use something like multi-version keys to reduce the problem. Can you elaborate on that? Are any of the key-value in-memory caches working on something better? This is a common problem for memcache deployments.

Andy

I’m surprised they run their DB servers on EC2. EC2 is very expensive yet the hardware you get for the price they charge is truly anemic.

What are the reasons that compel people to run DB servers on the cloud? I’m interested to know.

Mark Callaghan

Baron – I really like the tools and processes that Percona has implemented and shared for debugging intermittent performance problems. They make it possible to run even more critical workloads on InnoDB and XtraDB. Do you cover usage of these tools in Percona training classes?

Dimitri

Hi Baron,

the article sounds great, but it’s missing a key reason why the move to XtraDB was justified (and I mean here the technical reason, because for the rest – if they involved Percona for consulting it’s logical you move them to XtraDB ;-)) – well, 5.1.37 is quite old, but I’m looking for the main reason to move to XtraDB and not to use the latest InnoDB plugin in 5.1 (BTW, from the white paper it’s not clear also if they used InnoDB plugin before, and if yes – which version?)..

While query cache performance is a common issue, I did not find any details about “Internal contention inside InnoDB” you’ve observed.. – may you, please, go more in depth here?.. – as it’s pure InnoDB internals, I don’t think it’ll touch any confidential information regarding the customer 😉

Also, was glad to know Peter is using the same example as me about an “average temperature in the hospital” 🙂 – I’m usually adding “if you ignore individual spikes – you have several dead patient every morning..” 🙂

Rgds,
-Dimitri

Dimitri

Baron,

thanks for clarification – and I’d say your honest answer is making a honor for Percona!
May only wish you to keep your great work!

Rgds,
-Dimitri

Henrik Ingo

Hi Baron, others. First of all, congratulations for publishing your first case study. You will find that reference customers and case studies are one of the best marketing and sales tools. I hope with the new approaches you’ve adopted this Spring, Percona will grow even stronger and more prosperous.

I also liked the distinct “Percona style” preserved in the paper. Yes, it’s a case study, pdf, even has a picture, but still very technical, I’m almost missing the fluff 🙂 It fits very well with the Percona brand.

Look forward to reading more stories, and a background blog is always a nice complement.