How Percona does a MySQL Performance Audit

Our customers or prospective customers often ask us how we do a performance audit (it’s our most popular service). I thought I should write a blog post that will both answer their question, so I can just reply “read all about it at this URL” and share our methodology with readers a little bit. This fits well with our philosophy of openness. It also shocks people sometimes — “you’re giving away the golden goose!” Not really. What you hire us for is our experience, not a recipe that anyone can follow.

A full performance audit is far more extensive than this article can cover, and might wander into Apache, networking config, caching layers, etc. Wherever the problem is, we’ll track it down. I won’t talk about that. That’s not because I want to keep secrets from you. To the contrary, I’d love to share it all with you. But that’s a huge job; it will take many pages, and I’m not going to write that much.

The kickoff call

There’s actually a step before the performance audit begins. We call this a kickoff call. We get on the phone with the key technical staff on the client’s side and discuss the application in general. If possible, we take a brief look at the server beforehand, so we can ask more intelligent questions and skip obvious things. This call is sometimes up to an hour long, if we’re discussing a lot of things like how to build for massive scalability, how to do read-write splitting without breaking the user experience, how to take online non-blocking backups, or things like that.

But in general it can be a lot shorter than clients expect, because

time is money
your application is not as unique as you think, and we already know a lot about it by browsing your site or hearing you say “we are a SaaS hosted application for data mining stock market transactions,” or whatever the case may be
we’ve generally seen just about everything people throw at databases

Many people assume their application is difficult or somehow different, and that we need all kinds of schema diagrams and code listings, but the truth is people often have the same insights into particular problems, and therefore they try similar solutions. And besides, most applications have a lot of the same components. Tagging, friends, queues, click tracking, search, paginated displays — we’ve seen these and dozens of other common patterns done a hundred different ways.

The kickoff call has some other important goals too. We want to know what your concerns are at this stage. Is it the current performance, future performance, scalability, ability to recover from disasters? We also want to know what the operating parameters are. Can we suggest upgrades, rewriting queries, or whatnot? Sometimes there are rigid constraints on what types of solutions we can propose, and we need to know ahead of time so we don’t spend time on things you have no way to change. An example of this is a third-party product whose code you are unwilling or unable to change.

We also want to know the operating mode you expect us to work in. Maybe you want us to analyze and present recommendations for your consideration, and take a look-don’t-touch approach. Or, maybe you just want us to fix things and tell you later what we did. Either is a completely valid approach. You tell us what you want, and that’s what we’ll do (of course, we will also tell you what we think is best for you, which is our job).

After the kickoff call, we write some notes, and then get down to the audit itself. We usually have two people on the kickoff call so that we don’t have a single person knowing everything about you (only one is billable, though). The second person will be from the same team within Percona. However, the audit itself is generally a single person, with another person reading all the notes and generally keeping in touch with what the lead consultant does. We want two sets of eyes on things whenever possible. This also has value in case you decide to continue on to a longer relationship with Percona, which is reasonably common; in that case you will establish a solid relationship with the team, and group involvement early on is a good start to this.

What to look for during the audit

During the audit I look for outliers and anomalies in every aspect of the server’s configuration and performance and contents. I’m looking for characteristics that are much bigger or smaller than normal, or things I don’t usually see. Anything out of the ordinary or out of proportion. These things are often relatively easy to catch if you have enough experience. It’s a little hard to describe without making it sound like it’s some secret handshake, but it’s really not magical. You’re either familiar with what a server normally looks like, or you’re not. It’s just like a mechanic who listens to the car and says “I hear the valves knocking.”

In other cases there may be things that are NOT easy to catch, and may require a lot of experience and investigation. The work we do is often not simple at all. It’s impossible to predict. A customer may call and ask “how long to debug server lockups?” It could be ten minutes, or it could be a really major effort chasing down something like a bug in a hardware component, and you never know ahead of time.

In either case, the point to know here is that we can work with systems that are at any level of tuning, from completely untuned to a system that’s already had a lot of expert attention. I use “tuning” in a generic way here — we focus on far more than my.cnf files. In fact my.cnf files are usually one of the smallest levers we can pull for server optimization. Schema and query optimization can give much greater improvements, for example.

If you’re familiar with Malcolm Gladwell’s excellent book Blink, you know the term “thin-slicing.” That describes an experienced consultant’s approach to the problem very well.

The audit itself

OK, so here’s a very basic audit, whose scope I’ll keep within just a MySQL server. The first step is to log into the machine via SSH and open up a text editor (gedit, notepad, what have you.) I want to keep this open and paste everything I see into it. If there’s no record later that someone else can follow, I am not doing my job.

I would also note that the commands I’ll show here are the lowest common denominator. We often use tools to gather a lot of the data for us and speed the process, when possible, but the core commands I’ll show are what we can always fall back to if that’s not possible.

Gathering information about the server

I start off with these commands:

date
uname -a
cat /etc/*release

The first command is invaluable for later reference. The others will show exactly what kind of server I’m looking at, assuming I’m on a GNU/Linux box (I have to adapt to whatever environment I’m in — obviously a FreeBSD box is different). For example,

[percona@db1 ~]$ uname -a
Linux hostname.domain 2.6.18-92.el5 #1 SMP Tue Apr 29 13:16:15 EDT 2008 x86_64 x86_64 x86_64 GNU/Linux
[percona@db1 ~]$ cat /etc/*release
Red Hat Enterprise Linux Server release 5.2 (Tikanga)

[percona@db1 ~]$ uname -a

Linux hostname.domain 2.6.18-92.el5 #1 SMP Tue Apr 29 13:16:15 EDT 2008 x86_64 x86_64 x86_64 GNU/Linux

[percona@db1 ~]$ cat /etc/*release

Red Hat Enterprise Linux Server release 5.2 (Tikanga)

From this, I can see a couple of important things. One thing is that I’m looking at a 64-bit OS, which is important to know. I also know I’m on RHEL and I know the kernel — knowing this may tell us a lot of useful things about the system.

Next I continue looking around the server. Holding with the assumption of GNU/Linux, I keep working to get a picture of the OS and hardware.

[percona@db1 ~]$ tail -n 25 /proc/cpuinfo
power management:

processor       : 7
vendor_id       : GenuineIntel
cpu family      : 6
model           : 15
model name      : Intel(R) Xeon(R) CPU           L5335  @ 2.00GHz
stepping        : 11
cpu MHz         : 1995.022
cache size      : 4096 KB
physical id     : 1
siblings        : 4
core id         : 3
cpu cores       : 4
fpu             : yes
fpu_exception   : yes
cpuid level     : 10
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx lm constant_tsc pni monitor ds_cpl vmx tm2 cx16 xtpr lahf_lm
bogomips        : 3990.03
clflush size    : 64
cache_alignment : 64
address sizes   : 38 bits physical, 48 bits virtual
power management:

[percona@db1 ~]$ tail -n 25 /proc/cpuinfo

power management:

processor : 7

vendor_id : GenuineIntel

cpu family : 6

model : 15

model name : Intel(R) Xeon(R) CPU L5335 @ 2.00GHz

stepping : 11

cpu MHz : 1995.022

cache size : 4096 KB

physical id : 1

siblings : 4

core id : 3

cpu cores : 4

fpu : yes

fpu_exception : yes

cpuid level : 10

wp : yes

flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx lm constant_tsc pni monitor ds_cpl vmx tm2 cx16 xtpr lahf_lm

bogomips : 3990.03

clflush size : 64

cache_alignment : 64

address sizes : 38 bits physical, 48 bits virtual

power management:

What I see here is that this is an 8-core machine, two quad-core processors at 2GHz. I also know some things about the CPU, such as the model number and the cache size, etc. What else is important is the ‘lm’ flag, which tells me this is a 64-bit CPU. From this and what I saw earlier I can see whether someone installed a 32-bit OS on 64-bit hardware, which is important if you want to allocate a lot of memory to mysqld.

Next I look at a couple of all-in-one commands. For example, ‘top -n 1’ and ‘uptime’. That shows me what processes are running, but it also shows me load average, memory sizes, and a bunch of other stuff. In this case I’m on a machine with 32GB of memory and there is very little load. I see that mysqld is using 10g of memory with 17g virtual size, and there’s practically nothing else running (this is a dedicated master server). The machine has a lot of swap, but basically none is used. This isn’t all that relevant, as you’ll see later, but in some cases it might be.

I continue my investigation into the machine itself with these commands:

mount
df -h
dmesg
pvs; vgs; lvs

These commands show me what filesystems are mounted, how much space they have on them, and what kind of I/O system is under the box in general. dmesg is useful for several reasons. For one thing, it has a lot of information on the box’s hardware, including RAID controllers, installed memory, and so on in the boot output. In addition, at the tail it has information on core dumps and so on. It’s possible that it has filled up with stuff and the boot output has been lost, but in many cases there is a lot of information here that’s hard to get elsewhere.

If I see an LSI MegaRAID card, which is pretty common, I can check whether MegaCli is installed, and use it to get information about the physical drives installed; then I can find out exactly what they are, both in terms of size and spindle speed. I need to know how many physical spindles are under the box, and how fast they are. I can also get information about the all-important battery-backed write cache, how it’s configured and what its status is. I can see the drive status, too. Sometimes a failed drive is present but no one has noticed it yet! (By the way, we can also help you install and configure monitoring and graphing systems, if you don’t have any. These will help catch such problems.) If MegaCLI isn’t installed there may be other ways to get this information, too.

The last three commands are for giving me information about LVM devices, volume groups, and logical volumes. At this point, unless I’ve seen something that I want to investigate further, I have a pretty good idea what hardware and operating system I’m working with.

System performance

Next I investigate what the system is doing, in terms of performance. There are two key commands for this on most operating systems (there are others, such as mpstat or dstat, that I may also elect to use in some cases):

vmstat 5 5
iostat -dx 5 5

If you don’t know how to read these, there’s a full explanation in our book, High Performance MySQL 2nd Edition (it’s not in the first edition). From this I get an idea where any possible problems might be: is the disk saturated? Is the machine swapping actively? (This is more important than whether it’s using swap.) There are a lot of combinations of possible things you can see here, so I will not try to explain it all. Basically you need to know everything there is to know about these two commands and their output.

Based on what I’ve seen so far, I may look at other things, such as ifconfig (which shows good stuff like dropped packets).

Starting to look at MySQL

Assuming I don’t see anything noteworthy here, I’ll move on to the MySQL server. I mean, let’s just assume the physical hardware and the machine setup is fine, and the client has asked me to figure out why the website is slow; assume also I’ve checked out Apache and network config, etc, and I don’t see anything wrong there. So at this point, it looks like MySQL is probably a good place to look for performance problems. Everything I’ve done up till now is just due diligence; I’ve gotten my bearings on the server as a whole, and now I’m ready to see what’s up with the database.

The first step is to find out what MySQL instances are on the machine. It doesn’t do to just assume there’s a single instance in /var/lib/mysql and it’s reading from /etc/my.cnf. I’ve seen a lot of cases where there’s a /var/lib/mysql and an /etc/my.cnf, and the server is installed in /customsoftware/mysql with a different my.cnf file. The way to find this out is to look at ps:

[percona@db1 ~]$ ps -eaf | grep mysqld
root      3137     1  0 Nov19 ?        00:00:00 /bin/sh /usr/bin/mysqld_safe --defaults-file=/etc/my.cnf --pid-file=/var/run/mysqld/mysqld.pid --log-error=/var/log/mysqld.log
mysql     3172  3137 78 Nov19 ?        1-11:17:30 /usr/libexec/mysqld --defaults-file=/etc/my.cnf --basedir=/usr --datadir=/var/db/mysql --user=mysql --pid-file=/var/run/mysqld/mysqld.pid --skip-external-locking --port=3306 --socket=/var/db/mysql/mysql.sock
consult  28852 26909  0 19:08 pts/2    00:00:00 grep mysqld

[percona@db1 ~]$ ps -eaf | grep mysqld

root 3137 1 0 Nov19 ? 00:00:00 /bin/sh /usr/bin/mysqld_safe --defaults-file=/etc/my.cnf --pid-file=/var/run/mysqld/mysqld.pid --log-error=/var/log/mysqld.log

mysql 3172 3137 78 Nov19 ? 1-11:17:30 /usr/libexec/mysqld --defaults-file=/etc/my.cnf --basedir=/usr --datadir=/var/db/mysql --user=mysql --pid-file=/var/run/mysqld/mysqld.pid --skip-external-locking --port=3306 --socket=/var/db/mysql/mysql.sock

consult 28852 26909 0 19:08 pts/2 00:00:00 grep mysqld

In this case, yes it’s all default. It’s listening on the usual port, etc etc. Nothing unusual. Next I check what the MySQL version is, and I check (with the ‘file’ command) whether /usr/libexec/mysqld is a 64-bit build, if I’m on a 64-bit OS.

Examining MySQL configuration

Then I’ll look at my.cnf quickly for anything weird:

[percona@db1 ~]$ grep -v '^#' /etc/my.cnf | grep .

1	[percona@db1 ~]$ grep -v '^#' /etc/my.cnf \| grep .

My goal here is to look at the my.cnf without seeing all the comments, which a) are often a small novel, b) often say what the sysadmin thinks s/he’s doing, which might not be what is really being done. I want to look at it without polluting my brain with that. I’ll just scan this for anything odd. In contrast to what you might think, spending a lot of time here isn’t usually a good idea, because what’s left out is often more important than what’s included. For example, the absence of skip-name-resolve is easy to miss, as is the absence of innodb_log_file_size.

I won’t give you a full rundown over every option in the file. But again, you learn quickly if something strange is here.

This whole time I’ve been copying the output of the commands and saving them in a text file. I’ll attach this text file to the issue in our CRM system so I (or someone else) can look at it later, see what I saw, and follow my line of reasoning about any suggestions I make.

The next thing to do is grab ‘mysqladmin variables’ and put it in the text file. I don’t spend a lot of time looking at this. For one thing, all the values are in big units, so it’s hard to read. If the InnoDB buffer pool is set to 22GB, it’s much easier to see 22GB than 23622320128, especially since that number is buried next to a lot of other long numbers. They make my eyes hurt, and there are better ways to do this (I will run mysqlreport to get friendly numbers I can read easily). But what I will do is scan the output for something strange that might be caused by a syntax error in the my.cnf file. For an example, take a look at this bug report (not a bug).

Looking at MySQL status

Now we’re starting to get into the really interesting bits. Alas, these are the bits where there is the most variability, so I’ll have to be even less detailed and a little vague. The next command is this:

mysqladmin ext -ri10

1	mysqladmin ext -ri10

I let that run for at least two iterations. The first iteration is the current values since the server was booted; the second and subsequent are incremental differences. I’ll usually capture two of these. It looks like this:

[percona@db1 ~]$ mysqladmin ext -ri10
+-----------------------------------+----------------------+
| Variable_name                     | Value                |
+-----------------------------------+----------------------+
| Aborted_clients                   | 205174               |
| Aborted_connects                  | 29                   |
| Binlog_cache_disk_use             | 0                    |
| Binlog_cache_use                  | 9630066              |
| Bytes_received                    | 38563413074          |
| Bytes_sent                        | 216162991863         |
| Com_admin_commands                | 255868807            |

... snip ...

| Uptime                            | 162626               |
| Uptime_since_flush_status         | 162626               |
+-----------------------------------+----------------------+

+-----------------------------------+----------------------+
| Variable_name                     | Value                |
+-----------------------------------+----------------------+
| Aborted_clients                   | 6                    |
| Aborted_connects                  | 0                    |
| Binlog_cache_disk_use             | 0                    |
| Binlog_cache_use                  | 468                  |
| Bytes_received                    | 1708212              |
| Bytes_sent                        | 7796961              |
| Com_admin_commands                | 11893                |
... snip ...

[percona@db1 ~]$ mysqladmin ext -ri10

+-----------------------------------+----------------------+

| Variable_name | Value |

+-----------------------------------+----------------------+

| Aborted_clients | 205174 |

| Aborted_connects | 29 |

| Binlog_cache_disk_use | 0 |

| Binlog_cache_use | 9630066 |

| Bytes_received | 38563413074 |

| Bytes_sent | 216162991863 |

| Com_admin_commands | 255868807 |

... snip ...

| Uptime | 162626 |

| Uptime_since_flush_status | 162626 |

+-----------------------------------+----------------------+

| Variable_name | Value |

+-----------------------------------+----------------------+

| Aborted_clients | 6 |

| Aborted_connects | 0 |

| Binlog_cache_disk_use | 0 |

| Binlog_cache_use | 468 |

| Bytes_received | 1708212 |

| Bytes_sent | 7796961 |

| Com_admin_commands | 11893 |

... snip ...

Then I format the second set of values up beside the first set for ease of reference (an easy task with an automated too, or a quick Vim command if I’m doing this the manual way.) The result looks like this:

+-----------------------------------+----------------------+----------------------+
| Variable_name                     | Value                | Value                |
+-----------------------------------+----------------------+----------------------+
| Aborted_clients                   | 205174               | 6                    |
| Aborted_connects                  | 29                   | 0                    |
| Binlog_cache_disk_use             | 0                    | 0                    |
| Binlog_cache_use                  | 9630066              | 468                  |
| Bytes_received                    | 38563413074          | 1708212              |
| Bytes_sent                        | 216162991863         | 7796961              |
| Com_admin_commands                | 255868807            | 11893                |
| Com_alter_db                      | 0                    | 0                    |
...snip

+-----------------------------------+----------------------+----------------------+

| Variable_name | Value | Value |

+-----------------------------------+----------------------+----------------------+

| Aborted_clients | 205174 | 6 |

| Aborted_connects | 29 | 0 |

| Binlog_cache_disk_use | 0 | 0 |

| Binlog_cache_use | 9630066 | 468 |

| Bytes_received | 38563413074 | 1708212 |

| Bytes_sent | 216162991863 | 7796961 |

| Com_admin_commands | 255868807 | 11893 |

| Com_alter_db | 0 | 0 |

...snip

This output is very useful when writing the report to the client. Now I go through it line by line looking for things that look wrong. When I find something, I compare it against the server’s variables or whatever other data I need, and write something instructional to the client about it.

Preparing to report findings

I should mention here a bit about how I write the report to the client. I copy and paste the interesting snippets from what I’ve been gathering, save it in a text file, and explain what I’m seeing to the client. Anything I see that looks wrong should be explained in detail, because my goal is to teach the client, not just to awe them with my knowledge and get “I could never understand, O Almighty Percona, here’s your money” in reply. If the client doesn’t understand what I’m doing, I’ve done something wrong. Clients need to understand so that they know they’re getting good service. A Percona consultant never says “just trust me” unless the client really insists “I don’t want to know.”

That’s also why I’ll comment on things that seem good. Clients need to know that I’m being thorough. Once upon a time, a Percona consultant was called back in to do some further work about 6 months after Peter helped a client. Now, if Peter worked with the client 6 months ago, you have big shoes to fill, and getting further gains will probably not be easy. So this consultant looked through things, spent about 2 hours, and came back to the client with “there’s not a lot of performance left to gain here. You can do X, Y and Z, and that’s about all I see. Beyond that, we need to talk about making your overall architecture more scalable.” And the client said “that’s all? Didn’t you do any work? Peter’s report 6 months ago was a small encyclopedia!” The mistake here was that the consultant didn’t show all the things he’d checked, which was extensive indeed. After a quick explanation the customer understood that the consultant had really done a lot of work, but it’s better to show the client the full extent of the investigations in the first place, and not just show “here’s the 3 bad things I found in 2 hours.” We also make sure that we refer back to the previous issue’s findings, so that we don’t duplicate efforts from the earlier optimization.

Back to the output of mysqladmin. When I send the report to the client, it typically looks something like this. First, I explain the meaning of the output, so it’s clear what I’m showing them. Then, I go through it line by line, like this:

+-----------------------------------+----------------------+----------------------+ | Variable_name | Value | Value | +-----------------------------------+----------------------+----------------------+ | Aborted_clients | 205174 | 6 | Here we can see that something is disconnecting ungracefully (without closing the connection properly). This is happening 6 times every ten seconds. We need to figure out what is happening because it can indicate something else is wrong. This is not a big deal for performance, it's just something to figure out. ... snip ... | Com_admin_commands | 255868807 | 11893 | Why is something running an admin command constantly? This is generally a Ping command, which is useless. Something is pinging the server about 1200 times per second. Often this is from some ORM system, such as Class::DBI, which does $dbh->ping before each and every query. Why is Ping useless? Well, it is intended to tell you whether the server is alive. The trouble is, if it succeeds, it tells you nothing. An instant later, when the ORM system actually runs the query, the server could have died. The ORM should just be running the query, and if the query fails, then you know from the error code that the server is dead. So a successful Ping is meaningless. On the other hand, pinging can be a serious performance problem. Not really for the server, since it's a cheap query. It adds hardly any load. But for the application, it's another network round trip. It is a "garbage" query that just adds latency and makes the overall time to complete an action longer. If your application runs very short queries, this can be a non-trivial part of the app's response time.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
+-----------------------------------+----------------------+----------------------+
| Variable_name                     | Value                | Value                |
+-----------------------------------+----------------------+----------------------+
| Aborted_clients                   | 205174               | 6                    |

Here we can see that something is disconnecting ungracefully
(without closing the connection properly).  This is happening 6
times every ten seconds.  We need to figure out what is happening
because it can indicate something else is wrong.  This is not a
big deal for performance, it's just something to figure out.

... snip ...

| Com_admin_commands                | 255868807            | 11893                |

Why is something running an admin command constantly?  This
is generally a Ping command, which is useless.  Something is
pinging the server about 1200 times per second.  Often this is
from some ORM system, such as Class::DBI, which does $dbh->ping
before each and every query.

Why is Ping useless?  Well, it is intended to tell you whether the
server is alive.  The trouble is, if it succeeds, it tells you nothing.
An instant later, when the ORM system actually runs the query,
the server could have died.  The ORM should just be running the
query, and if the query fails, then you know from the error code
that the server is dead.  So a successful Ping is meaningless.

On the other hand, pinging can be a serious performance problem.
Not really for the server, since it's a cheap query.  It adds hardly
any load.  But for the application, it's another network round trip.
It is a "garbage" query that just adds latency and makes the overall
time to complete an action longer.  If your application runs very short
queries, this can be a non-trivial part of the app's response time.

And so on, and so on. I’ll often refer to other things; for example, when I’m examining the status values for Created_tmp_tables and Created_disk_tmp_tables, I’ll paste in

mysql> show global variables like '%table_size%';
+---------------------+-----------+
| Variable_name       | Value     |
+---------------------+-----------+
| max_heap_table_size | 268435456 | 
| tmp_table_size      | 268435456 | 
+---------------------+-----------+

mysql> show global variables like '%table_size%';

+---------------------+-----------+

| Variable_name | Value |

+---------------------+-----------+

| max_heap_table_size | 268435456 |

| tmp_table_size | 268435456 |

+---------------------+-----------+

If these are not the same size, as often happens, I’ll explain that the minimum of the two is used to determine when an in-memory temporary table will be converted to an on-disk table, which is an expensive operation.

In this way, I work through the status and the variables, and explain to the client what I can deduce about the server from them.

I would point out that I don’t spend a lot of time on settings and status. It is not where the greatest gains are to be found. The real gold mine is yet to come. After you’ve done it a few times, you can go through the settings and status pretty quickly.

Analyzing MySQL’s workload and data

After I’m done with that, I’ll grab a few snapshots of SHOW FULL PROCESSLIST, and look for odd values in this. For example, if I see a lot of queries in odd statuses, like “statistics”, I know something weird is going on, and I look deeper. I also look at the queries themselves. If the client has long-running queries, it can be pretty easy to catch. Maybe they’re long-running because they’re in Locked status, which might be an indication that it’s time to convert MyISAM tables to InnoDB. (But then again, it might not be.) Or maybe the client is doing queries like “… WHERE client IN (SELECT id FROM clients)” which is a really bad query plan.

After this I’ll look at SHOW INNODB STATUS\G and see if there’s anything worth commenting on there. Depending on the workload, there might be a lot of substance here. There’s also a very thorough section on this in our book.

Next I look at the data in the server and see what I think about it. If the server is not heavily loaded, I may even do some INFORMATION_SCHEMA queries to help me find the biggest tables and so forth. If the server is heavily loaded or has a lot of data, touching the INFORMATION_SCHEMA can take a barely-running site and bring it down, so I do this with great care! If it’s not appropriate I may run ‘du’ on the data files, or simply skip this step. This information can help me determine what kinds of things I ought to look for later during query analysis, and also might show me things like huge tables that should be archived, tables that I should be aware of if I see them in queries, or what have you.

I will also look at the error log. The error log shows all sorts of interesting things. Basically, anything but silence in the error log is interesting and needs to be investigated. You might find that there are InnoDB tables whose .frm files exist but have been dropped from InnoDB, for example. (The INFORMATION_SCHEMA query will surely trigger errors about this.)

There are a bunch of other things I’ll look at, including checking for default users in the mysql.* tables, running mk-duplicate-key-checker to find redundant indexes, and so on. As with settings and status, these set the stage for bigger gains to come later, so I don’t spend all that much time here, although there is occasionally something really bad that can be fixed and bring big gains.

Schema, query, and index optimization

Analyzing and optimizing a server’s table and index structures, and the queries that run against them, is by far the most important thing to concentrate on. All the other work is a prerequisite to understanding the context within which these things operate. For example, if the workload is a star schema data warehouse, it is important to know a lot about the I/O subsystem. And by now I know that I should concentrate in certain areas — there may be hundreds of tables, but at this point I should have identified a handful of tables that really matter. That’s why I save this part of the analysis for last.

You cannot consider any one of these factors (schema, queries, indexing) in isolation, because they are tightly bound together, and tweaking one will often have effects on the others. Each decision is a consideration of the costs and benefits on the system as a whole.

The queries the system runs, and hence the slow query log, is one of the more fruitful ways to analyze the system. What I usually do is ask the client to enable it a day before the audit if possible, and set it to one second. That way there’s at least one “cycle” of queries in it. Then I run the following:

wget http://hackmysql.com/scripts/mysqlsla
perl mysqlsla -lt slow /path/to/slow.log

1 2	wget http://hackmysql.com/scripts/mysqlsla perl mysqlsla -lt slow /path/to/slow.log

By default, this outputs the top 10 most expensive queries, in total execution time. By the way, the slow query logs in the stock MySQL server are extremely limited, and for serious analysis it’s basically mandatory to use a server that has the Percona patches for microsecond logging and additional information in the slow query log output. The stock MySQL server’s limitation of one-second granularity makes it hide problem queries that are faster than one second (which in a high-performance system is virtually every query.) If it’s not possible to use a patched binary, we can use MySQL Proxy, packet sniffing, or other techniques to get more information than the slow log gives us; but by far the richest and most efficient source of analysis information is our custom server, which we have designed exactly for the purpose of giving us as much insight as possible into the server’s true workload. At the moment there is no other technique that approaches the amount of information a Percona build can give you.

Having found the desired information about the queries by any means necessary, I find out which are the most likely to give the greatest gains. I use a dual approach for this: I look for both the queries which cause the greatest load on the server (in aggregate) and which cause the greatest latency for the application. Again, it’s a process of finding outliers by some criteria. Both types of queries matter. Some may be extremely fast, but run very often; others may be run seldom but take a long time to finish. And it’s important to focus on ones that can give the greatest improvement for the customer. I can often tell at a glance whether a query is going to be possible to improve a lot, and a simple mental calculation can then tell me how much total gain I can get from it. Queries are not just “bad” or “good” — it’s a question of where they fall on the scale.

Now I analyze these one at a time. For each query, I run EXPLAIN if possible (rewriting non-SELECT queries if needed, and using careful judgment for queries with subqueries in the FROM clause, which will actually execute the inner subquery!), and examine the query plan. This is where you need to really know how to write queries and how EXPLAIN works. At Percona, we have peer training constantly on our internal IRC channel and mailing list. We share all sorts of dirty tricks and neat ideas with each other. Nothing is off the table. Queries can be rewritten. Indexes and data types and table structures can be changed. Queries can be broken into pieces, combined, or even eliminated entirely (we may suggest caching, or tell the customer to evaluate Sphinx, or something like that.)

In the common case, though, a query simply needs a rewrite or a new index or something. In this case we show the query and the information mysqlsla outputs about it (execution times, etc), the EXPLAIN plan, and the desired modifications to the query or the table. We explain how to interpret what we’re showing, and why the proposed modifications are better. If the client approves it, we may make a copy of the table, make the proposed modifications to it, and show the difference afterwards. Or we might have a test server to run on. It varies widely; some clients have a test server, some don’t; some have a formal QA process; some don’t. It is very customer-specific, and we work with what we have.

Sending the report to the client

After the slow query log analysis is done, I format all my analysis for sending, add in things I may have noticed along the way (comments on backups, for example) read over it once again and make sure I didn’t miss anything important or write something in a confusing way, and then send it to the client. Then I usually call to discuss, or just confirm that it was received and wait for a reply. In most cases the client has a lot of work to do — sometimes weeks of application changes. Afterwards there’s usually more to be gained by doing another pass through.

We may also want to have another call with the customer and talk about changes that can or should be made at a much higher level. If we see that the application’s overall architecture needs to be changed, that’s something to discuss. Customers usually want us to validate their application’s overall scalability and whether it’ll grow to meet their demands for some period of time. We also frequently bring the discussion to topics such as monitoring, alerting, backups, caching, reverse proxies, and high availability solutions, all of which we can frequently help customers set up much more efficiently and with better results, due to our knowledge of what works in the real world and where there’s quicksand to be avoided.

The changes Percona’s performance audits typically suggest can be really significant, and can completely change the performance profile and workload of a server, so it’s often worth another iteration, and sometimes even more after that. A really thorough audit can take up to ten hours! The initial round usually takes less than two, however. The ten-hour cases are usually for really complex applications, or apps that have a lot of stored procedures, which are more difficult to analyze because you have to pull out the statements and analyze them individually. The goal is always to develop X-Ray Vision and see through the symptoms to the underlying causes in every aspect of application performance.

Customer interaction

In all cases, the customer’s guidance is our first rule. The customer determines the depth and complexity of the analysis we do. A lot of times, customers will tell us up front to limit the work to some number of hours, such as 3 hours. Knowing this ahead of time gives us a framework within which we have to deliver the maximum value to the customer. If we have less time, we fix the “sore thumbs” and make notes of places where more effort is likely to pay off, and mention those to the customer. Or we economize by giving the customer easy choices so they can decide which things to have us fix, which to fix themselves, and which to just ignore.

The above process may look like a static recipe that you can just mechanically execute, and in some cases that’s truly what we do. However, we don’t just go through the motions. The “recipe” above is only a framework within which we work. Knowing the rules of a performance audit, and why they are the way they are, means you can know when and how to break them for the client’s benefit. It’s like knowing openings in chess — the great chess players study openings exhaustively, but they don’t robotically play them in tournaments.

For example, although it may sound as though I go work for a while and then emerge with a report, it’s not that way a lot of times. Flexibility and exercising judgment about when it’s appropriate to interact with the customer are a hallmark of our approach. If we need to be interactive and keep in touch with the client early and often, we do that. A lot of times I’ll start out with a quick look and have a second brief call with the customer, or send a quick email and ask for feedback on the direction I’m going; or just get on an instant messaging chat and tell the customer “here’s the top thing you can start working on while I keep investigating.” We make our process as interactive and flexible as the situation calls for. Early, frequent feedback lets us make small adjustments to the direction we’re going, communicate to the client how much time we’re spending, and shortcut effort if it’s not useful, e.g. skipping analysis for a query the client’s developers are in the process of removing from the application completely.

That’s all! I hope this has been an informative and fun trip through the world of a typical Percona performance audit, although as you’ve seen there really isn’t any such thing. Post your questions in the comments, and I’ll try to answer them as time permits.

17 Comments

Oldest

Newest Most Voted

Inline Feedbacks

View all comments

Shlomi Noach

15 years ago

Baron,

Thank you for this great post!
If I may add a suggestion: I keep a sample my.cnf file on my website, which contains the parameters I would like to see in any installation, and commented parameters which I may wish to ignore, or those I may wish to play with. This way I rarely miss an missing parameter or a misconfigured one.
Another review I do even if the customer does not bring this up is the backup policy the company has (if any) and see if the backups they do (if at all) match that policy. Quite a few times it turns out the company didn’t have the notion of a backup plan, or the difficulties in following it.

Again, thanks.
Shlomi

Rob Wultsch

15 years ago

Baron,
Holy crap. Nice.

– Rob

Wouter

15 years ago

Thanks for sharing!

fenway

15 years ago

A goldmine!

Michael

15 years ago

So….what do you do when this happens:

You’ve completed your analysis, handed it to the client, and they were expecting that you’d provide a panacea at the server-level, solving all their “code problems”? We regularly have clients that feel that they shouldn’t need to change their application. What do you do at that point? Do you actually re-write the queries for them in your service? It was difficult to convince them to implement caching (which saved massive load on the database servers, even if some changes are still needed). Some areas need to be re-worked completely.

Michael

Peter Zaitsev

Admin

15 years ago

Michael,

Sure we can re-write queries, re-structure the schema and all this kind of stuff. Our goal is to give the client a “menu” so they can pick what they want because business reasoning can be outside of our immediate reach. For example we may say something like we tuned your MySQL settings this should get your capacity up another 20%. You can double capacity by by adding the proper indexes and get another 50% by upgrading to faster system. But if you need more gains you need to restructuring and schema or architecture changes.

It is more complicated than that really but it gives some good insight.

Michael

15 years ago

Peter:

Our problem seems to be that we’ve done all the tuning (perhaps even provided a bandaid as such), and so no application changes are made, the only “breathing room” came from the tuning. Then, the wall of scability is hit again, loads go up, queries lag, and the question comes up “Why can’t you do what you did before?”. Obviously it’s because it’s been done before.

When the client has the fastest (excepting ssd and dedicated appliances) hardware you can buy, an already tuned infrastructure (my.cnf, hardware, kernel patches, the works), and still doesn’t “want” to change the application or doesn’t believe it’s the problem, that’s where I start to pull out my hair.

Michael

Vadim

15 years ago

Michael,

There is example (quite simplified):

if application is running SELECT COUNT(*) from InnoDB_table to scan 100.000.000 rows, there is no good way to optimize it if you do not want to change application.

Gil

15 years ago

Having just completed a Percona audit, I can definitely attest to the thoroughness and volume of documentation provided. It is similar to this except much more detailed and customized to the way your system works.

Knowing how Percona identified the problems in my specific system was absolutely invaluable, and saved me more time than I can imagine. Thanks guys.

Sheeri

15 years ago

Baron,

“The stock MySQL server’s limitation of one-second granularity makes it hide problem queries that are faster than one second (which in a high-performance system is virtually every query.)”

Actually, it’s worse than that. The slow query log logs queries that take >long_query_time. Not >=long_query_time. So actually you don’t get any logging for queries that are faster than 2 seconds. (If you have data to counteract this, let me know, but I haven’t seen any…)

Michael,

The point is when *tuning*, you let the client know exactly what’s happening. If the client thinks you’re fixing something when what you’re doing is buying time, you need to explain to them that you’re buying time — not fixing a problem.

It’s certainly a common problem, as often there are tons of change management procedures regarding the code (and tests and qa to pass) and not so many regarding changing a my.cnf parameter.

For the toughest clients, use a test database, fill with sample data, make your changes (code and perhaps schema as well) and show them the difference.

You can always make an analogy, as Peter did — though you can explain that if you have lots of memory, fast cpu and lots of fast disk, and have to process 500 records, it’s faster to only have to process 50 records than 500, no matter how good your stuff is.

Shlomi Noach

15 years ago

Michael,
The situation you’ve described is so true. I think it is important to clear this up right from the start: “The analysis may find that changes should be made to the application. There’s only so much you can gain by tuning the server or upgrading the hardware.”
I happened on such a case where I was hired to “help out the DBA fix performance issues”, when eventually I went up to the developers and have shown them how their queries were making for low performance (one of the things they did was, on an InnoDB table, never to update row data, but instead DELETE+INSERT. The performance boost once converted to UPDATE could be seen by the naked eye – I was lucky – and this allowed me to further pursue their application code)

Baron Schwartz

Author

15 years ago

Someone asked why grep my.cnf instead of using my_print_defaults, which will do things like pull values from multiple files (if any) and eliminate duplicate entries. There’s a two part answer.

1) we want to see mistakes in my.cnf, rather than have them hidden by my_print_defaults. A duplicate entry is pretty common and should be fixed.

2) it’s true that there may be several .cnf files that mysqld reads from, so a single one isn’t the whole truth. But it’s also the case that variables can be set online, so the only source of truth is SHOW VARIABLES.

The other thing I didn’t mention is that some of the tools we use will do .cnf file analysis to look for duplicate/conflicting values, and values that are different inside mysql than they are in the config files.

sefa

15 years ago

kÄ±mÄ± sevÄ±yorum

hironics

14 years ago

I think you have no way to get know all of the configuration variables by just looking into config file.

You may probably look the whole config variables with ‘show global variables’.

Pera, Thailand

12 years ago

I would like to get the book for mysql performance, community version 5.x.

Can anyone direct me to the point?

I’m from Thailand.

Baron Schwartz

Author

12 years ago

You can get it from http://www.highperfmysql.com/.

Pera, Thailand

12 years ago

Thanks Baron,

I was wondering is this book specific for any version of mysql? or it is an overall performance tuning of mysql.

I prefer to get the community version 5.x if you could get me the right one, pls direct me?

MySQL 5.7
End of Life

Compare Percona to Leading Database Solutions

Software
Downloads

Product
Documentation

Resource Hub

Financial Services

Driving Database Success

Percona Blog

Percona Community Hub

Percona Events Hub

About Percona

Percona in the News

Our Customers

Our Partners

Careers

Contact Us

How Percona does a MySQL Performance Audit

Related

Related Blog Articles

RECOMMENDED ARTICLES

Valkey/Redis Replication and Auto-Failover With Sentinel Service

Valkey/Redis: Sets and Sorted Sets

Hello World… Hello Valkey! Let’s Get Started!

MOST POPULAR ARTICLES

Auditing login attempts in MySQL

Deploy Django on Kubernetes With Percona Operator for PostgreSQL

MySQL “Got an error reading communication packet”

MySQL 5.7 End of Life

Compare Percona to Leading Database Solutions

Software Downloads

Product Documentation

Resource Hub

Financial Services

Driving Database Success

Percona Blog

Percona Community Hub

Percona Events Hub

About Percona

Percona in the News

Our Customers

Our Partners

Careers

Contact Us

How Percona does a MySQL Performance Audit

Related

Share This Post!

Want to get weekly updates listing the latest blog posts?

Related Blog Articles

RECOMMENDED ARTICLES

Valkey/Redis Replication and Auto-Failover With Sentinel Service

Valkey/Redis: Sets and Sorted Sets

Hello World… Hello Valkey! Let’s Get Started!

MOST POPULAR ARTICLES

Auditing login attempts in MySQL

Deploy Django on Kubernetes With Percona Operator for PostgreSQL

MySQL “Got an error reading communication packet”

MySQL 5.7
End of Life

Software
Downloads

Product
Documentation