August 30, 2014

InnoDB Full-text Search in MySQL 5.6 (part 1)

I’ve never been a very big fan of MyISAM; I would argue that in most situations, any possible advantages to using MyISAM are far outweighed by the potential disadvantages and the strengths of InnoDB. However, up until MySQL 5.6, MyISAM was the only storage engine with support for full-text search (FTS). And I’ve encountered many customers for whom the prudent move would be a migration to InnoDB, but due to their use of MyISAM FTS, the idea of a complete or partial migration was, for one reason or another, an impractical solution. So, when FTS for InnoDB was first announced, I thought this might end up being the magic bullet that would help these sorts of customers realize all of the benefits that have been engineered into InnoDB over the past few years and still keep their FTS capability without having to make any significant code changes.

Unfortunately, I think that hope may be premature. While it is true that InnoDB full-text search in MySQL 5.6 (part 1) in MySQL 5.6 is syntactically identical to MyISAM full-text search, in the sense that the SQL required to run a MATCH .. AGAINST is the same (modulo any new features introduced with InnoDB full-text search), that’s largely where the similarities end.

NOTE 1: I was originally planning to cover everything I wanted to discuss with respect to InnoDB full-text search in this one post, but I think there’s a lot of interesting stuff here, so I will break it into three pieces instead. The first part (this one) will be a very quick overview of FTS in InnoDB and some observations that I’ve made while getting it configured. The second part will compare query results between MyISAM FTS and InnoDB FTS over the same data sets, and then finally in the third installment, we’ll look at query performance. In the event that a new release of 5.6 appears between now and part 3, I’ll also revisit some of the “quirks” from parts 1 and 2 to see if the behavior has changed.

NOTE 2: For purposes of this discussion, I used two separate data sets. The first one is a set of about 8K very SEO-stuffed web pages, wherein the document title is the page title, and the document body is the HTML-tag-stripped body of the page. We’ll call this data set “SEO” – it’s about 20MB of actual data. The other one is a set of about 790K directory records, each one containing the name, address, and some other public-records-type information about each person. We’ll call this data set “DIR”, and it’s about 155MB of actual data.

NOTE 3: Also, keep in mind that I used the community editions of MySQL 5.5.30 and MySQL 5.6.10 with no tuning whatsoever (with one exception that I’ll explain below) – the idea behind this investigation wasn’t to find out how to make InnoDB FTS blazingly-fast, but simply to get a sense of how it works compared to traditional MyISAM FTS. We’ll get to performance in the third installment. For now, the important number here is to note that the InnoDB buffer pool for my 5.6 instance is 128MB – smaller than the size of my DIR data.

So, with all of that out of the way, let’s get to it.

Here is our basic table definition for the DIR dataset. The table for the SEO dataset looks identical, except that we replace “full_name” with a VARCHAR(255) “title” and “details” with a TEXT “body”.

We also have identical tables created in 5.5.30 where, of course, the only difference is that the engine is MyISAM rather than InnoDB. Loading the data was done via a simple Perl script, inserting one row at a time with AutoCommit enabled – remember, the focus here isn’t on performance just yet.

Having loaded the data, the first thing we notice is that there are a lot of “new” InnoDB tablespace files in our database directory:

By comparison, this is what we see on the MyISAM side:

I also observed that if I simply load the data into an InnoDB table that has never had a full-text index on it, and then I create one, the following warning is generated:

This doesn’t make a lot of sense to me. Why does InnoDB need to add a hidden column (similar to GEN_CLUST_INDEX when you don’t define a PRIMARY KEY, I assume) when I already have an INT UNSIGNED PK that should suffice as any sort of document ID ? As it turns out, if you create a column called FTS_DOC_ID which is a BIGINT UNSIGNED NOT NULL with a unique index on it, your table doesn’t need to be rebuilt. The most important item to note here – FTS_DOC_ID must be spelled and specified exactly that way – IN ALL CAPS. If you try “fts_doc_id” or any other mixture of lettercase, you’ll get an error:

Various points in the documentation mention the notion of a “document ID” that “might reflect the value of an ID column that you defined for the underlying table, or it can be a sequence value generated by InnoDB when the table does not contain a suitable column,” but there are only a handful of references to FTS_DOC_ID found when searching the MySQL 5.6 manual, and the only page which appears to suggest how using an explicitly-defined column is done is this one, which discusses improving bulk insert performance. At the very bottom, the page claims that you can speed up bulk loading into an InnoDB FT index by declaring a column called FTS_DOC_ID at table creation time of type BIGINT UNSIGNED NOT NULL with a unique index on it, loading your data, and then creating the FT index after the data is loaded.

One obvious problem wih those instructions is that if you define a column and a unique key as they suggest, your data won’t load due to unique key constraint violations unless you also do something to provide some sort of sequence value for that column, whether as an auto_increment value or via some other means, but the bit that troubles me further is the introduction of a column-level case-sensitivity requirement that only seems to actually matter at table creation time. Once I’ve got a table with an explicit FTS_DOC_ID column, however, MySQL apparently has no problem with either of the following statements:

Philosophically, I find that kind of behavior unsettling. I don’t like case-sensitivity in my table or column names to begin with (I may be one of the few people that likes lower_case_table_names = 1 in /etc/my.cnf), but I think it’s even worse that the case-sensitivity only matters some of the time. That strikes me as a good recipe for DBA frustration.

Now, let’s return to all of those FTS_*.ibd files. What, exactly, are they? In short, the _CONFIG.ibd file contains configuration info about the FT index (the same sort of configuration data that can be queried out of the I_S.INNODB_FT_CONFIG table, as I’ll discuss momentarily), and the others contain document IDs of new rows that are added to or removed from the table and which need to be merged back into or removed from the main index. I’m not entirely sure about the _STOPWORDS.ibd file just yet; I thought it might be somehow related to a custom stopwords table, but that doesn’t seem to be the case (or at least not in the way that I had thought), so I will need to look through the source to figure out what’s going on there.

In any case, for each new FULLTEXT KEY that you create, you’ll get a corresponding FTS_*_DOC_ID.ibd file (but none of the others), and if you drop a FT index, the its corresponding FTS_*_DOC_ID.ibd file will also be removed. HOWEVER, even if you drop all of the FT indexes for a given table, you’re still left with all of the other FTS_*.ibd files, and it appears that the only way to get rid of them is to actually rebuild the table.

Also, while we’re on the subject of adding and dropping FT indexes, it’s entirely possible to DROP multiple FT indexes with InnoDB in the same ALTER TABLE statement, but it’s not possible to CREATE more than one at a time. If you try it, this is what happens:

That’s an odd limitation. Do it as two separate ALTER statements, and it appears to work fine.

But here’s where things start to get even weirder. According to the documentation, if we specify the name of a table that has a FT index for the global variable innodb_ft_aux_table, we should be able to get some statistics about the FT indexes on that table by querying the various I_S.INNODB_FT_* tables. In particular, the INNODB_FT_CONFIG table is supposed to “display metadata about the FULLTEXT index and associated processing for an InnoDB table.” The documentation also tells us that we can keep our FT indexes up to date by setting innodb_optimize_fulltext_only = 1 and then running OPTIMIZE TABLE, and that we might have to run OPTIMIZE TABLE multiple times if we’ve had a lot of changes to the table.

This all sounds pretty good, in theory, but at least some part of it doesn’t seem to work. First, let’s check the stats immediately after setting these variables, and then let’s push some additional data into the table, run an OPTIMIZE or two, delete some data, and see what happens:

I ran OPTIMIZE TABLE several more times, and each execution took between 6 and 8 seconds, but the output of the query against I_S.innodb_ft_config never changed, so it seems like at least some of the diagnostic output isn’t working quite right. Intuitively, I would expect some changes in total_word_count, or optimize_(start|end)_time, and the like. However, if I check some of the other I_S tables, I do find that the number of rows in I_S.innodb_ft_index_table is changing, so it’s pretty clear that I do actually have a FT index available.

At the start of this post, I mentioned that I did make one configuration change to the default InnoDB settings for MySQL 5.6, and that was to change innodb_ft_min_token_size from the default of 3 to a value of 4 so that it would be identical to the MyISAM default. After all, the (naive?) hope here is that when I run an FTS query against both MyISAM and InnoDB I will get back the same results; if this equivalence doesn’t hold, then as a consultant, I might have a hard time recommending this feature to my customers, and as an end user, I might have a hard time using the feature at all, because it could completely alter the behavior of my application unless I also make a nontrivial number of code changes.

In part 2 of this series, we’ll reload and reset our SEO and DIR data sets back to their initial states, run some queries, and compare the output. Stay tuned, it gets rather curious.

About Ernie Souhrada

Ernie joined Percona in April 2012 as a Senior Consultant. In his previous lives, he has been everything from a Perl/Java developer to a Linux sysadmin, a MySQL DBA to a Cisco network engineer, and a security auditor to an IT engineering manager, many of these things all at the same time. When not working on MySQL, he might be found on the ski slope, at a psytrance festival, or at the nearest sushi bar.

Comments

  1. Jimmy Yang says:

    Ernie, I think there could be more documentation on the Doc ID concept. The reason that this is vital important is from the design. When deleting a document, InnoDB Fulltext index does not delete its tokenized words from corresponding auxiliary tables (to avoid too many tables updates), instead, it records the deleted Doc ID. In such way, it can’t use the primary key, because the same primary key can be deleted and re-inserted, but with different documents

    And regarding why the config.ibd etc. are still there when you drop the indexes is due to the same reason, that is, to avoid a table reorg to remove the hidden column. And we have the basic auxilliary tables remain to maintain this Doc ID, as well as avoid another table rebuild if user creates fulltext index again.

    In terms of information_schema.innodb_ft_config table, it is really an internal config table, not for the purpose of statistics (that needs to be corrected in documentation).

Speak Your Mind

*