I wrote one week ago about how to find duplicate indexes. This time we’ll learn how to find unused indexes to continue improving our schema and the overall performance. There are different possibilites and we’ll explore the two most common here. User Statistics from Percona Server and pt-index-usage.
User Statistics is an improvement on Percona Server that adds some tables to Information Schema with useful information to understand the server activity and identify the source of the load. Those tables store statistics from our users, tables, indexes and so on, information very helpful to have a clear picture of our users are doing in the database.
To accomplish our task of finding unused indexes we’re going to use INFORMATION_SCHEMA.INDEX_STATISTICS but first we should enable User Statistics:
mysql> SET GLOBAL userstat=on;
After all this hard work we have the statistics plugin gathering information. I usually enable it during load peaks to get realistic information about what is happening in the server. To generate some load on the database I’m going to use the benchmark software tpcc-mysql.
After some runs we can start to analyze the information on INDEX_STATISTICS table.
<code>mysql> SELECT * FROM INDEX_STATISTICS;
| TABLE_SCHEMA | TABLE_NAME | INDEX_NAME | ROWS_READ |
| tpcc | orders | PRIMARY | 59472 |
| tpcc | customer | PRIMARY | 128652 |
| tpcc | stock | PRIMARY | 1128357 |
| tpcc | district | PRIMARY | 116107 |
| tpcc | item | PRIMARY | 282870 |
| tpcc | warehouse | PRIMARY | 84959 |
| tpcc | orders | idx_orders | 2832 |
| tpcc | customer | idx_customer | 175240 |
| tpcc | new_orders | PRIMARY | 56202 |
| tpcc | order_line | PRIMARY | 1164682 |
These are the index used during the benchmark. Let’s compare it with the data on INNODB_INDEX_STATS to get the list of unused indexes:
<code>mysql> SELECT INNODB_INDEX_STATS.TABLE_NAME, INNODB_INDEX_STATS.INDEX_NAME from INNODB_INDEX_STATS WHERE CONCAT(INNODB_INDEX_STATS.index_name, INNODB_INDEX_STATS.table_name)NOT IN(SELECT CONCAT(index_statistics.index_name, index_statistics.table_name) FROM index_statistics) AND INNODB_INDEX_STATS.TABLE_SCHEMA='tpcc' AND INNODB_INDEX_STATS.INDEX_NAME<>'GEN_CLUST_INDEX';
| TABLE_NAME | INDEX_NAME |
| stock | fkey_stock_2 |
| order_line | fkey_order_line_2 |
| history | fkey_history_1 |
| history | fkey_history_2 |
Here we see one of the limitations on User Statistics, it can only track direct lookups on index keys but not lookups done during constraint checks. That’s the reason FK are marked as unused so that means all the other indexes has been used during the benchmark.
The second limitation is with partitioned tables, you can’t get index statistics usage from those tables.
The previous example only works with Percona Server so with stock MySQL we should try another approach.
Percona Toolkit has a tool to check the usage of our index from the slow query log. The concept is easy to understand. pt-index-usage reads the slow query log and execute every query with EXPLAIN to ask MySQL which indexes would it use. At the end of the process you’ll have an output with a list of not used indexes. This is an example:
<code>root@debian:/var/log# pt-index-usage slow.log
slow.log: 11% 03:58 remain
slow.log: 21% 03:43 remain
slow.log: 32% 03:09 remain
ALTER TABLE `tpcc`.`order_line` DROP KEY `fkey_order_line_2`; -- type:non-unique
ALTER TABLE `tpcc`.`orders` DROP KEY `idx_orders`; -- type:non-unique
ALTER TABLE `tpcc`.`stock` DROP KEY `fkey_stock_2`; -- type:non-unique</code>
If you compare these results with the previous example, there are some differences and these are some of the reasons:
- pt-index-usage uses EXPLAIN and User Statistics number of rows read for every index. So we’re comparing an estimation with real data.
- EXPLAIN, as I said before, is an estimation from the optimizer and sometimes the real query can use a different execution plan.
- pt-index-usage tries to convert non-SELECT queries to SELECT queries and is not always possible to get a perfect conversion so there can be some differences.
Be cautious with Unique indexes, because maybe they are not used for index lookups but your application may need them to avoid duplicates on the column.
Take in account that depending on the size of the log the server would need lot of process time and cpu power so I recommend you to not use in the production server. Run it on a slave or a testing environment with an similar data estructure and size.
We’ve learnt how to find unused indexes using different approaches and we’ve also learnt that we should’t follow tool’s suggestion literally. These tools are here to help us and after that is our job to check and benchmark the changes on our testing environment to measure the impact on performance. You won’t need to drop all such indexes but it gives you an idea of which of them you should evaluate. Good tools to help us on that evaluation are pt-log-player and pt-upgrade.
It’s worth to mention the importance of taking a long enough period of logs or statistics to analyze all relevant workload. In some cases there are processes that run once a day or a week and we should have that information in our analysis report.