July 25, 2014

Quickly finding unused indexes (and estimating their size)

I had a customer recently who needed to reduce their database size on disk quickly without a lot of messy schema redesign and application recoding.  They didn’t want to drop any actual data, and their index usage was fairly high, so we decided to look for unused indexes that could be removed.

Collecting data

It’s quite easy to collect statistics about index usage in Percona Server (and others) using the User Statistics patch.  By enabling ‘userstat_running’, we start to get information in the INFORMATION_SCHEMA.INDEX_STATISTICS table.  This data collection does add some overhead to your running server, but it’s important to leave this running for a good long while to get a good dataset that is representative of as much of your workload as possible.

If you miss collecting index stats while some occasional queries run, you run the risk of dropping indexes that are being (seldomly) used, but are still important for the health of your system to have.  This may or may not impact you, but I’d highly recommend you manually review the list of unused indexes being generated above before you simply drop them.

Depending on your sensitivity to production load, you may therefore want to run this several days, or just sample different short windows during your normal production peak.  In either case, you may want to compare or repeat this index analysis, so let’s setup a separate schema to do this.  Its important that this index analysis is on a server with your full production dataset loaded, but it could be a master, or just a slave somewhere (just be careful not to break replication!).

If our index_statistics are collecting on the same server, then we can simply get a snapshot of it into our schema with one command:

If the stats come from some other server, then you may need to dump and load a copy of that table into your working index_analysis schema.

Merging stats from several servers

In the case of this client, they had a master and several slaves taking read traffic.  The index workload on these two sets of servers was different and I wanted to make sure I considered the index statistics from both of these sources.  Be sure you include all relevant index stats from all aspects of your application, otherwise garbage-in, garbage-out and you risk dropping necessary indexes.

To accomplish merging multiple result sets, I gathered statistics from both their master and slave and loaded them into my schema as separate tables.  Then I simply created a view of a UNION DISTINCT of those two tables:

Now I can query the ‘all_known_index_usage’ and see the union of both of those datasets.  This, of course, can be extended to all the datasets you want.

Interpreting the data

So, this is all well and good, but how do we then easily determine a list of indexes that are not being used?  Well, for this we need to back to the INFORMATION_SCHEMA to get a list of ALL the indexes on my system (or at least the schemas I want to consider dropping indexes in).  Let’s keep using views so this dynamically updates as our schema changes over time:

Now I can query this view to see my indexes:

Now I need a way to find the set of indexes in all_indexes, but not in used_indexes.  These indexes (if our original index statistics are good) are candidates to be dropped:

Note that we also want to avoid dropping PRIMARY and UNIQUE indexes since those tend to enforce important application data constraints, so we added some additional criteria to the end of our SELECT.

I can now select my droppable (unused) indexes from this view:

From here I can use some clever SQL to generate the precise ALTER TABLE statements to drop these indexes, an exercise left to the reader. :)

Estimating the size of these indexes

But, what if we want to see if it’s worth doing first?  Do these indexes actually represent a significant enough amount of disk space for it to be worth our while?

We need some more information to answer this question, but fortunately in Percona Server, we have it in the INFORMATION_SCHEMA.INNODB_INDEX_STATS table and the ‘index_total_pages’ column.  A page in Innodb is (usually) 16k, so some simple math here should help us know how much disk space an index utilizes.

Let’s go update our all_indexes view to include this information:

Now we can can see index sizing information in the all_indexes view:

Now we just need to update our droppable_indexes view to use that information:

Now we can easily see how big each index is if we dropped it (not big in this case with test data):

 

Recovering filesystem space

Now astute innodb experts will realize that this isn’t the end of the story when it comes to reclaiming disk space.  You may have dropped the indexes, but the tablespaces on disk are still the same old size.  If you use innodb_file_per_table, then you can rebuild the tablespace for your table by simply doing:

However, this blocks and on a large table can take quite some time. All the normal tricks and tips about doing a long blocking schema change without affecting your production environment apply here and is out of scope for this blog post.

Happy hunting for those unused indexes!

About Jay Janssen

Jay joined Percona in 2011 after 7 years at Yahoo working in a variety of fields including High Availability architectures, MySQL training, tool building, global server load balancing, multi-datacenter environments, operationalization, and monitoring. He holds a B.S. of Computer Science from Rochester Institute of Technology.

Comments

  1. You should have used my patent-pending Intensive Table Optimization technique from my PLUK tutorial:
    http://www.jonathanlevin.co.uk/2012/12/tutorial-slides.html

  2. Keep in mind with MySQL you almost always need to have space in order to free more space. Even with innodb_file_per_table, you may run into trouble if you have gotten that low on disk space. If you have a 5GB table and you are attempting to remove 500MB in indexes, you’lll still need at least an additional 4.5GB, since MySQL will retain the old copy of the table while building the new.

    On a side note: I wish tutorials would make more use of NOT EXISTS vs LEFT JOIN NULL. I believe it is slightly more readable in that one immediately understands we are looking for a record in one table that does not exist in the other. Also, MySQL is likely to optimize that “left join null” using a “not exists” behind the scenes anyway.

Speak Your Mind

*