How to find wrong indexing with glance view
Quite common beginners mistake is not to understand how indexing works and so index all columns used in the queries…. separately. So you end up with table which has say 20 indexes but all single column ones. This can be spotted with a glance view. If you have queries with multiple column restrictions in WHERE clause you most likely will need to have multiple column indexes for optimal performance. But wait. Do not go ahead and index all combinations. This would likely be poor choice too
24 Comments











del.icio.us
digg
I’d rather see 20 separate indexes than an index which covers 15 columns….
(I’ve seen both)
Comment :: August 21, 2008 @ 10:04 pm
Hehe. Indeed. Putting all columns you want to lookup on in the index (in any order) is another mistake you can see. Though I think it is less frequent one.
Comment :: August 21, 2008 @ 10:09 pm
Yeah i want to see 20 separate indexes too than 15 columns.
Comment :: August 22, 2008 @ 12:21 am
how do you think it would be better approach with a query with so many restrictions in WHERE clause like this filter ? http://browseusers.myspace.com/Browse/browse.aspx?MyToken=633549779643841239 (in case that all data is on the same table)
i guess a multiple index in the main columns would be the best choice
Comment :: August 22, 2008 @ 5:06 am
There are few particular cases when multi-column indexes are used by mysql. Most of the time mysql will scan the first column on the index (check EXPLAIN to verify). At least with 2 separate indexes there is a chance that mysql will use both indexes together.
Comment :: August 22, 2008 @ 5:06 am
I suspect no one “tests” scenarios based on the comments left here. There are situations which can warrant any number of scenarios. I can tell you that I’ve seen a use for creating one index for 5 columns and it made a huge difference. In my case, queries went from taking over an hour to 5 minutes by adding 3 columns to my original one index.
The last column I was searching for in my WHERE clause was a string. Here’s how I took that 5 minutes and brought it down to 2 seconds: I made a new column that was an unsigned INT where I stored the string’s crc32 value. Now instead of doing WHERE string = ‘value’ I do WHERE string_crc32 = ‘4958494034′ for example. Since I now added string_crc32 to my index (now with 5 columns) my queries execute in seconds rather than minutes.
You can index a VARCHAR but from my tests it just doesn’t match the performance of indexing an INT, unsigned or not.
You can take the advice or not but even advice is just that. Real way to know if a change makes a difference in your situation is to create a development environment with the same database on a slower machine/server, setup log-queries and use EXPLAIN on queries that are slow. Once you get your creative juices flowing with knowledge like above you can take just about any slow query and make it super fast. This is how I did it, I tried every combination testing each with EXPLAIN. Eventually you’ll find the right combination. For me, the primary key had to be first, the other three columns in my index had to be in the middle and the string_crc was last.
I was somewhat cleaver with my queries till I attended a presentation by Jay Pipes on Join Fu. (http://www.jpipes.com/) You think this discussion is enlightening, then you definitely want to check out his Join Fu slides on his blog.
Comment :: August 22, 2008 @ 6:19 am
As I said: few particular cases. Of course everybody do (or should) test the queries. Of course integer indexes with high cardinality will beat string indexes – unless you need to search by partial matches like ‘abc%’. But we’re talking here about mysql behaviour. I suspect that in your case the cardinality of the index was pretty bad and that improved by adding extra 3 columns. I bet that you can drop a few columns from that index without losing any performance. And there’s the problem of mysql choosing single column indexes where multi-column indexes are present, and wasting memory and disk space.
I still think that single column indexes make more sense in mysql most of the time. Index intersection should be more efficient than multi-column index for searching. Ordering is a diffent issue.
Comment :: August 22, 2008 @ 7:28 am
Like I said in my case I tried multiple combinations, including removing some of the columns that you recommend that I remove and that had performance consequences. I join 2 tables to together plus my WHERE is a bit more complicated referencing account and product id’s, which I didn’t include in the example above. Again, try every combination in a test environment. Another tip, turn off caching on your testing environment MySQL that way the queries are re-executed. Cached results can make you think you got an optimized query when you really don’t.
Thinking in terms of the process of finding the records, if your first indexed column cuts the results from billions to millions, then your second, third and fourth column cut that down to about a million records, you definitely want to keep those 3 additional columns in your index. When that 4th extra column moves from millions to 20 records, it makes all the difference. Then your database doesn’t have to walk over a million records to find those 20 records.
One other thought, if your not working with millions of records, you may have better results just by tweaking the MySQL config settings rather than spending time optimizing your queries, especially if your trying to improve queries that are less than 5-10 seconds. For readability purposes, you can just index one column without putting much thought into how the data will be retrieved by the database. This allows you to focus on the end goal, the web site or application’s results. Again, depends on your situation.
I will agree with tgabi, most of the time multi-column indexes are not necessary. If you’re not having issues then you most likely have no need to read this post either. I’m just trying to help others by making aware reasons for particular solutions (multi-column index in this case) in MySQL.
Comment :: August 22, 2008 @ 9:16 am
tgabi,
MySQL Will only use only first column of multiple column index in case you have Range on the first column like col1 BETWEEN 5 and 10 – it is exactly skill of creating multi column indexes to ensure they are created smart way, so they can be used, as well as queries are written appropriate way (for example IN vs BETWEEN makes hell a lot of difference)
Saying MySQL can’t use multiple column indexes is simular to saying MySQL always does full table scan – if you write write you queries suboptimal way or if you have poor choice of indexes (like having index on gender for general population) that is exactly what will happen.
Comment :: August 22, 2008 @ 11:47 am
Angelo,
You pick one particular query and for that query you indeed do not need the index because the extra column does not really improve selectivity. In this case you’re you exactly should not extend the index unless you’re planning to use index covered queries for this query.
Comment :: August 22, 2008 @ 11:51 am
tgabi,
Multiple column indexes will be a lot faster than multiple single column indexes in case there is good cardinality on both columns and full index can be used. Relaying on index-merge is actually very typical design mistake.
Comment :: August 22, 2008 @ 11:59 am
The point:
My point is not what single column indexes are bad and you always need multiple column indexes but the fact it is quite typical there are some queries which will benefit from multiple column indexes. If I see all tables have only single column indexes this typically means someone does not know about multiple column indexes and how MySQL can use them
Comment :: August 22, 2008 @ 12:01 pm
Yeah, all the queries in testing should have SQL_NO_CACHE.
The table with the index you’re talking about is the pivot of the join ? I agree that indexes are good to cut trough useless data, still, so many values for each key of the first index makes me think the data is not normalized. Not that’s a bad thing, sometimes de-normalizing the data improves performance, just not common.
I’m glad we agree on the per-case usage of multi-column indexes – contradicts Peter’s view BTW.
Billions of records ? Wow. And 5 columns indexes ? Double wow.
Comment :: August 22, 2008 @ 12:04 pm
Peter,
You said:”in case there is good cardinality on both columns and full index can be used”. I’d call that a “best case scenario”. In my line of work I’m searching trough many low cardinality columns in the same query. The query is very specific, yet each column has only few values.Sometimes I wish I can find a multi-column index that works. I have few multi-column that do work, particular when used for ordering (ORDER BY a,b LIMIT 5).
Index merge is the only thing that works – when it works. I do agree that nobody should rely on index merge for now.
Comment :: August 22, 2008 @ 2:50 pm
I think adding to the column to the index (for selection not sorting) has the same question as adding the index for selection all together – if cardinality is poor it is better not to. Though increasing index length usually has less impact than going from full table scan to index ref scan (causing a lot of IOs)
In case of low selectivity indexes you too can often build multiple column indexes, assuming you can use index prefix which is selective enough. Though there are times when bitmap indexes would be much better.
Comment :: August 22, 2008 @ 3:03 pm
tgabi,
I do not see what you really disagree with. It is the question of multiple column indexing speeding up your queries. If they do not. There is surely no point using them. Billions of records ? So what. If you your main queries can use all key parts in 5 column index it will be faster than 5 single column indexes. And also much easier to maintain.
Really it may look as columns are scary but 5 ints for example is just 20 bytes so the index would be same as index on char(20) in terms of side.
Comment :: August 22, 2008 @ 3:19 pm
I think there are many cases where multiple column indexes don’t work. That’s all. By looking at the index structure alone you cannot draw any conclusion about the design quality. You need to look at the data too.
It seems that we disagree on index merge versus multiple column index. I’m saying that I see index merge working when multiple column indexes don’t. You’re saying it cannot be used – you don’t say why, but I don’t disagree. One thing is clear: single column indexes are not enough, no doubt about that.
Regarding billions of records: it’s just awesome. It’s not every day you hear something like that. Well maybe you do, I don’t. I’m few months away from “billion records club” BTW.
Comment :: August 22, 2008 @ 4:11 pm
tgabi,
I have cases when billions records are loaded daily so it no more excites me.
Speaking about index merge – indeed there are cases when index merge works when multiple column index does not however if they both work multiple column index are most likely to be the winner.
There are always exceptions. Even running with MySQL defaults can be good fit for particular case. I’m just saying this is very typical red flag thing. Sure you need to look in details to check if it is really the case but it is called red flag exactly because it shows there are the problem in say 90% of all cases. If you made deliberate choice going with multiple column indexes good for you – if you can make this deliberate choice you already stand away from the crowd
Comment :: August 22, 2008 @ 4:20 pm
Nacho,
For cases like you mentioned – Dating site search I would rather go with Sphinx all together. It handles cases like this beautifully (even if you do not have any full text search) though if you want to stick to MySQL you’ve got to look at the cardinality for different columns and query pattern and handle it appropriately. For example gender may not be selective but specified in 99% of the cases which makes it good first column in the index.
Comment :: August 22, 2008 @ 4:24 pm
Peter,
you said “assuming you can use index prefix which is selective enough”. What do you mean by that ?
In specific case – derived from Nacho’s example: say you have millions of users with gender, race and hair color. Many males, many blacks, many blondes, very few all three. How do you see this search in Mysql ?
Comment :: August 22, 2008 @ 5:29 pm
tgabi,
I mean if you have Index (A,B,C,D,E) and MySQL can use (A,B,C) index only for this query. Considering it is selective enough (enough meaning query response time is acceptable) you’re good. Though as your data size growths the acceptable selectivity tends to grow as well. If you have 100000 profiles selectivity of 1/10 is often good enough as scanning through 10000 rows in memory is often acceptable. With 50.000.000 profiles this would not be the case.
With Nachos case if you have “WHERE GENDER=’M’ AND RACE=’B’ AND HAIR=’Blond’ and it is very selective index on (GENDER,RACE,HAIR) would work quite well to resolve it. Though you really need to be looking at your best queries while picking solutions and so your worst cardinaliy.
Comment :: August 22, 2008 @ 7:17 pm
More about Nachos’ example
1) Looks like all fields are linked one-to-one to users so can be all stored in users table(may be except looking_for, but let’s keep it simple), some thing like (gender, age, available, looking_for, location, has_foto)
2) booleans – gender, has_foto; enums – available, looking_for, location; integer – age
3) All queries fall into X IN (…) AND Y IN (…) AND …
Some conditions fall into X = const – ref access that may be preferable over range. Condition on Age may have up to 60 values so it can be converted to IN instead of using BETWEEN [1].
As all boolean operators are ‘AND’, mysql can use either indexes built on multiple columns or intersection of multiple accesses by indexes. When we talk about index_merge_intersection, we should understand that mysql finds one set using an index, another set and then finds records which exist in both sets [2]. Consider this “age = 18 and available = ‘divorced’”. I’ve used Russia as target for investigation and here is results: both conditions – 58, divorced – 623, 18th – 3000 (looks like it’s limit and most probably it’s much more than that). It’s pretty obvious that using index (age, available) is much better all the time as it’s always more selective than any of its columns alone. If you don’t have such compound index then mysql most probably use index on “available” column only using ref access and after that filter by age. This is optimal choice. So in any case index intersection sucks. When it doesn’t suck? In something like “age = 68 AND available = ‘engaged’”, but even in this case compound index quickly returns a few rows.
Enough about intersections. About booleans. has_foto? How many of your users have a foto? 100%? 10%? Most probably it’s more than 70% or even close to 90%. So it’s very not selective column. Should we avoid using it in indexes? Yes, if it’s an index based on this column only. No, if it’s part of compound index. Why? Because more than 70% percents of people will mark “show people with fotos only” and mysql can use index to filter records before fetching any rows. How much it will give? Not many, worth investigation. May be impact will be negative especially when almost all people have fotos.
Gender? you can use it everywhere as peter suggested as 90% of searches will use one either ‘male’ or ‘female’ and it will give you twice smaller result set. However, you most probably should cheat a little and give mysql a hint by using “gender IN (’male’,'female’)” when people looking for any gender.
Other. Location is mandatory with quite good selectivity comparing with other enumns. Age as well is mandatory, but it can select few rows or many that depends.
Final set?
1) (City, Age, Gender) or (City, Gender, Age) – you should benchmark and check EXPLAINS which one is better for all combinations of (male, female, both) and different ranges on Age with high selectivity and low. At the end you can end up with one index, both or may be consider using (City, Age). Whatever you select to use can be used as prefix for other indexes cuz conditions on these fields are mandatory.
2) You’ve selected mandatory prefix, for example (City, Gender, Age) is the most effective. Drop it
and instead create (City, Gender, Age, LookingFor) and (City, Gender, Age, Available). So mysql can choose between two which is more selective.
3) You can even use (City, Gender, Age, LookingFor, Available) – needs investigation.
Aaaa… so long indexes. Let’s consider you’re superb and have 1 billion of users. If you’re smart enough and read mysql performance related articles then you’ll store all those fields as enums or tiny int with dictionary (in your up or table). 4(id)+1+1+1+1+1+1 = 10 bytes per record => >=10GB for data. Each index will be between 5GB and size of data (for innodb). May be I’m wrong here, but I do think that size of index(City, Gender, Age) will be smaller than size of three (City), (Gender) and (Age). Considering quite low standalone selectivity of those columns, it’s up to you decide what to do.
1. http://www.mysqlperformanceblog.com/2006/08/10/using-union-to-implement-loose-index-scan-to-mysql/#comment-1695
2. http://dev.mysql.com/doc/refman/5.1/en/index-merge-intersection.html
Comment :: August 23, 2008 @ 2:21 am
Ruslan,
Thanks for good explanations. The City in the start would work only in case you always specify it… As you can’t use IN trick for cities as there are too many of them.
The problem with IN in such cases is generally related to sorting – results typically need to be sorted some way and as soon as you use IN on any key parts sorting can’t be done using index and filesort may be too slow for many matches. Though in reality problem can get even more complicated because sorting can be done by the function, for example by distance or using some sort of scoring system.
Comment :: August 23, 2008 @ 10:07 am
I was describing myspace’s search referenced by Nachos where country (sorry, it’s not city) is mandatory. Anyway, condition on age is mandatory and it’s a range, so sorting sure will be problem. Especially when result set is big. At least it will be filesort and not necessary temporary table as we select from one table.
To solve sorting issue on big sets, people can use pre-explain. Then decide either use hint to force ordered access or access based on conditions. To avoid explains local in memory statistics can be used to calculate conditions selectivity.
Anyway, question was about set of indexes with compound vs. single context. Sorting falls out of scope
Comment :: August 23, 2008 @ 11:12 am