InnoDB Full-text Search in MySQL 5.6: Part 2, The Queries!

InnoDB Full-text Search in MySQL 5.6: Part 2, The Queries!

This is part 2 in a 3 part series on InnoDB MySQL full-text searches. In part 1, we took a quick look at some initial configuration of InnoDB full-text search and discovered a little bit of quirky behavior; here, we are going to run some queries and compare the result sets. Our hope is that the one of two things will happen; either the results returned from a MyISAM FTS query will be exactly identical to the same query when performed against InnoDB data, OR that the results returned by InnoDB FTS will somehow be “better” (as much as it’s actually possible to do this in a single blog post) than what MyISAM gives us.

Recall that we have two different sets of data, one which is the text of roughly 8000 SEO-stuffed webpage bodies (we call that one SEO) and the other, which we call DIR, that is roughly 800,000 directory records with name, address, and the like. We are using MySQL 5.5.30 and MySQL 5.6.10 with no configuration tuning other than to set innodb_ft_min_token_size to 4 (rather than the default of 3) so that it matches MyISAM’s default ft_min_word_length.

First, MyISAM, with MySQL 5.5, on the SEO data set:

The same query, run against InnoDB on 5.6.10:

Wow. I’m not sure if I should be concerned so much that the *scores* are different, but the *matches* are COMPLETELY DIFFERENT between 5.5/MyISAM and 5.6/InnoDB. Now, we know that MyISAM FTS does have the caveat with natural language searches whereby a word that’s present in 50% or more of the rows is treated as a stopword, so does that account for our problem? It might, because the word ‘arizona’ appears in over 6900 of the 7150 rows, and the word ‘records’ appears in 7082 of them. So let’s try something else that’s less likely to have that issue. The word “corporation” appears in 143 of the documents; the word “forms” appears in 439 of them, and the word “commission” appears in 130. There might be some overlap here, but even if there isn’t, 143+130+439 < 0.5 * 7150, so none of these should be treated as stopwords in MyISAM.

With 5.5:

With 5.6:

OK, now I’m starting to get a little worried. The docs do tell us that the default stopword list is substantially different between InnoDB and MyISAM, and as it turns out, there are only 36 stopwords in the default InnoDB list, but there are 543 stopwords in the default MyISAM list. What happens if we take the MyISAM stopwords, insert them into a table, and configure that table to be our stopword list for InnoDB?

This is the table that we’re trying to emulate:

The docs tell us that we need to create an *InnoDB* table with a single VARCHAR column named “value”. OK, sounds easy enough:

But, when we try to use this table, here’s what comes back:

And here’s what appeared in the server’s error log:

Uh… Does this mean that my next blog post should be entitled, “When is a VARCHAR Not Really a VARCHAR?” Thinking that maybe this was a case of GEN_CLUST_INDEX causing me issues, I tried adding a second column to the table which was an integer PK, and in another attempt, I tried just making the “value” column the PK, but neither of those worked. Also, trying to set innodb_ft_user_stopword_table produced the same error. I submitted a bug report (68450), and as you can see from the bug discussion, it turns out that this table is character-set-sensitive. If you’re going to use your own stopword table for InnoDB FTS, at least for the moment, this table must use the latin1 character set.

As far as I can tell, this little gotcha doesn’t appear to be mentioned anywhere in the MySQL 5.6 documentation; every place where it talks about creating one of these stopword tables, it simply mentions the table engine and the column name/type, so I’m not sure if this is an intentional restriction that just needs to be better documented or if it’s a limitation with the InnoDB FTS feature that will be removed in a later version.

Now that we’ve sorted this out, let’s drop and rebuild our FT index on the InnoDB table and try the above queries one more time. We already know what the MyISAM results are going to be; do our InnoDB results change? No, they are exactly the same, although the scores did change slightly.

What about a Boolean mode query? The docs tell us that if we use Boolean mode, and we put a “+” in front of our search term, then that term *must* appear in the search results. But does it?

With 5.5:

And with 5.6:

There’s only one row in the table that actually matches all three search terms, and in this case, both MyISAM and InnoDB FTS performed identically and found it. I’m not really concerned about the fact that the next four rows are completely different; the scores are zero, which means “no match.” This looks promising, so let’s explore further. Again, from the docs, if we run a boolean mode query where some of the search terms are prefixed with “+” and others have no prefix, results that have the unprefixed term will be ranked higher than those with out it. So, for example, if we change the above query to be “+james +peterson arizona” then we might expect to get back multiple matches containing the words “James” and “Peterson”, and we should expect the record from Arizona to be towards the top of the list.

With 5.5, this is exactly what happens:

With 5.6, we’re not so fortunate.

These results aren’t even close to identical. As it turns out, the full record for “Alphonso Lee Peterson Sr” does also contain the name “James”, and the word “Peterson” is listed in there several times, but “Arizona” is not present at all, whereas the record for “James R Peterson” had all three search terms and no significant repetition of any of them. Using this particular query, “James R Peterson” is #15 on the list.

At this point, it’s pretty obvious that the way MyISAM is calculating the scores is much different from the way that InnoDB is doing it, and given what I said earlier about the repetition of words in the “Alphonso Lee Peterson Sr” record versus the “James R Peterson” one, we might argue that InnoDB is actually behaving more correctly than MyISAM. Imagine if we were searching through newspaper articles or something of that sort, and we were looking for queries containing the word “MySQL” – odds are that an article which has 10 instances of “MySQL” might be more desirable to us than an article which only has it mentioned once. So if I look at these results from that persepctive, I can understand the how and the why of it. My concern is that there are likely going to be people who believe that switching to InnoDB FTS is simply a matter of upgrading to 5.6 and running ALTER TABLE foo ENGINE=InnoDB. In theory, yes. In practice, not even close.

I tried one more Boolean search, this time looking for someone’s full name, which I knew to be present only once in the database, and I used double quotes to group the search terms as a single phrase:

With 5.5:

Looks good, there he is. Now what happens under 5.6?

In the immortal words of Homer J. Simpson, “D’OH!!” Why is MyISAM able to locate this record but InnoDB cannot find it at all? I suspect that the “B” is causing problems for InnoDB, because it’s only a single character and we’ve set innodb_ft_min_token_size to 4. Thus, when InnoDB is parsing the data and building the word list, it’s completely ignoring Mr. Smith’s middle initial. To test this hypothesis, I reset innodb_ft_min_token_size to 1, dropped/rebuilt the InnoDB index, and tried again.

Aha, there he is! Based on that result, I would caution anyone designing an application that’s going to use InnoDB FTS to be quite mindful of the types of queries that you’re expecting your users to run. In particular, if you expect or are going to allow users to enter search phrases that include initials, numbers, or any other string of length less than 3 (the default), I think you’re going to be forced to set innodb_ft_min_token_size to 1. Otherwise you’ll run into the same problem as our Mr. Smith here. [This does raise the question of why it works with MyISAM when ft_min_word_length defaults to 4, but that is a topic for another day.]

Note that there may or may not be some performance implications to cranking this value all the way down; that is something I have not yet tested but will be reporting on in part 3 of this series. I can, however, confirm that the on-disk size of my DIR dataset is exactly the same with a setting of 1 versus a setting of 4. This may or may not be the case with multi-byte character sets or with ideographic languages such as Japanese, although Japanese poses its own unique problems for FTS of any kind due to its lack of traditional word boundaries.

In any event, it appears that we’ve solved the Boolean-mode search issue, but we still have vastly different results with the natural-language-mode search. For those of you who are expecting and need to have the MyISAM-style search results, there is at least one potential escape hatch from this rabbit hole. When defining a FULLTEXT KEY, you can use the “WITH PARSER” modifier to specify the name of a UDF which references your own custom-written fulltext parser plugin. Thus I am thinking that it may be possible to take the MyISAM full-text parser code, convert it to a plugin, and use it for InnoDB FT indexes where you’re expecting MyISAM-style results. Verifying or refuting this conjecture is left as an exercise for the reader. 🙂

A quick recap of what we’ve learned so far:

  • There are parts of InnoDB FTS configuration which are both letter-case and character-set sensitive. Watch out!
  • When you add your first FULLTEXT KEY to an InnoDB table, be prepared for a table rebuild.
  • Calculation of match score is completely different between the two engines; sometimes this leads to wildly different results.
  • If you were hoping to use InnoDB FTS as a simple drop-in replacement for your current MyISAM FTS, the results may surprise you.

That last point bears particular emphasis, as it also illustrates an important best practice even if FTS isn’t involved. Always test how your application behaves as a result of a major MySQL version upgrade before rolling it into production! Percona has tools (pt-upgrade and Percona Playback) that can help you with this. These tools are free and open source, please use them. You, and your users, will be happy that you did.

In the third and final installment of this series, we will take a look at performance. How does the speed of InnoDB FTS compare to its MyISAM counterpart, and what kinds of tricks can we use to make it go faster? Stay tuned to find out in the next blog on MySQL full-text searches!

13 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
Ronald Bradford

I was rather shocked when reading your post. It was most informative to highlight configuration, documentation and especially feature differences. I am surprised these types of feature comparisons were not clearly performed by the InnoDB team, and any explanations not provided to give cause for differences.

My initial impression of InnoDB FTS feature was a good chance to evaluate and recommend to clients to move to an InnoDB environment, however now I am not really prepared to do so.

Christian Riesen

In your boolean test, you match against the details field as well, but don’t select it. So the hits you get for words not present in full_name probably come from there. Still a surprise the results are so different though.

Ernie Souhrada

@Christian–

No, sorry, that’s not correct. The only reason that I’m not including the details field is for display purposes, to keep the formatting of this post a little bit more readable. When my boolean-mode searches came back with different results, the first thing I did was go and manually examine the details field to see if there was some hidden text that I wasn’t seeing. Besides, it doesn’t matter if I don’t include a field in the SELECT that was also in the MATCH…AGAINST. For instance, if we use the same example that I have above, I can leave out any of the text fields entirely and still get the same results:

mysql 5.5: select id, MATCH(full_name, details) AGAINST (‘+james +peterson arizona’ IN BOOLEAN MODE) AS score FROM dir_test_myisam ORDER BY 2 DESC LIMIT 5;
+——–+——————–+
| id | score |
+——–+——————–+
| 74717 | 1.3333333730697632 |
| 14159 | 1 |
| 44427 | 1 |
| 53501 | 1 |
| 126373 | 1 |
+——–+——————–+

mysql5.6: select id, MATCH(full_name, details) AGAINST (‘+james +peterson arizona’ IN BOOLEAN MODE) AS score FROM dir_test_innodb ORDER BY 2 DESC LIMIT 5;
+——–+——————–+
| id | score |
+——–+——————–+
| 741223 | 59.972904205322266 |
| 925811 | 59.37348937988281 |
| 269589 | 44.82982635498047 |
| 471273 | 39.58232879638672 |
| 925781 | 38.03296661376953 |
+——–+——————–+

Jaimie Sirovich

I looked into this just now based on your post. My understanding is that InnoDB doesn’t use document length in its calculations for natural mode (in other words it’s not BM25). It still uses some sort of tf*idf magic, but some of the data points BM25 requires are difficult to maintain and it doesn’t look like they did it.

Boolean-mode appears to bake in tf*idf (unlike MyISAM), which for sure will make a difference — mostly a good thing and more like what Lucene does. After all, why should boolean operators preclude ranking? Easy — they shouldn’t. And in Lucene/Solr they do not. That IS an improvement IMO.

Regarding your position that the results should be the same or better … well, I agree. They ARE for boolean mode. The only way they could be better still given the handicap of not using document length and some other normalization stuff that BM25 does is if it somehow incorporates in proximity.

Since it doesn’t do that automatically, you’d have to incorporate that in the query by using the NEAR operator. If you did that, you might get some interesting results, but with _identical_ natural-mode queries I would only expect it to be worse.

My 2c.

Jaimie Sirovich

@Ronald @Ernie,

I took a quick look at the code and it appears that I am correct. This is interesting news, but not _all_ bad. I would expect the boolean ranking to be better (i.e. not random) and the natural ranking to be worse.

Daniël van Eeden

“When defining a FULLTEXT KEY, you can use the “WITH PARSER” modifier to specify the name of a UDF which references your own custom-written fulltext parser plugin. Thus I am thinking that it may be possible to take the MyISAM full-text parser code, convert it to a plugin, and use it for InnoDB FT indexes where you’re expecting MyISAM-style results. Verifying or refuting this conjecture is left as an exercise for the reader.”

This won’t work as WITH PARSER currently doesn’t work with InnoDB.
http://bugs.mysql.com/bug.php?id=62004

Jaimie Sirovich

The parser has little to do with the problem here. It’s the ranker.

Jimmy

A smaller default stopword size is really an advantage rather than disavantage. That shows InnoDB is more scalable as it can index large number of tokens. We also remove the restriction of not allowing words to be tokenzed when they appear more than 50% of the time.

Also the “B” issue you mentioned should be fixed in 5.6.12.

Jaimie Sirovich

Jimmy,

Right now InnoDB is also not factoring in document length, though I do see the beginnings of that in the source. That’s a minor speed hit if you’re looking at document level factors, so the comparison is as yet a bit unfair. MyISAM fulltext uses some sort of pivoted normalization. What I also really don’t understand is why the proximity operator is working on byte distance. The only thing I can think of is Asian languages, but even then I believe there could be another operator that works on word distance, such as “@@.”

I’m actually not sure InnoDB FTS is faster right now. One of the more curious architectural decisions I see is that whereas usually a full text index keeps the hitlists separate from the inverted index, InnoDB’s decision is that you must scan the hit list. If I’m not mistaken, that means a lot of work when you’re not searching or ranking based on proximity.

Am I missing something? I saw the research paper that the code is based on. I can’t say I agree with everything in that research paper 🙂 I might have some suggestions.

Nick Williams

@Jimmy,

You said, “Also the ‘B’ issue you mentioned should be fixed in 5.6.12.” I looked over the release notes for 5.6.11, 5.6.12, and 5.6.13, but I didn’t see anything mentioned about it. Can you confirm that it was fixed? Do you have a bug number associated with it?

Nick Williams

Nevermind, Jimmy. I found part 3 where Ernie confirms that the bug is fixed.

Indeprensus

All databases should start searching using queries now …. MongoDB is on the way to do it possibly soon.
We should also consider NLP and Machine Learning Concepts integrated in DB operations
This one is cool … may databases should search like this … 😀
https://sourceforge.net/projects/falcontextsearch/

Hanxue

Thanks, this is insightful and surprising.

Is there a properly written up Part 3, besides the webinar and slides?