As we know build in full text search is currently limited only to MyISAM search engine as well as has few other limits.

Today Sphinx Search plugin for MySQL was released which now provides fast and easy to use full text search solution for all storage engines. This version also adds a lot of other new features, including boolean search and distributed searching.

A while ago I already wrote about Sphinx Search Egine, comparing it to built in FullText search and Mnogosearch. I guess I should soon repeat tests, adding Lucene to the list for complete picture.

And if you do not feel like patching MySQL or use MySQL 5.1 beta to use sphinx as MySQL Storage Engine you can still use it old fashion way as separate server.

36 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
Apachez

I have been for the past months experimenting with various fulltext engines since the mysql fulltext has it limitations specially when used in boolean mode on large results (for example a boolean search that searches on common words and then order the hits by date and finally picks out say top 200 best matches).

Sphinx is really promising specially now when it will be available as a direct plugin for mysql (I assume this means that the sphinx index can be created directly from within mysql just as with mysql own fulltext engine ?).

However I found some issues with the standalone testversions which ended up with that I created my own fulltext search engine which is written in perl and use mysql as backend mostly to see if I could do it any better for my case (fulltext search engine for a forum).

What I have done is to mix reversed index with vectorspace model which means that instead of having a vector per message and a relevance (vectorangle) connected to this vector (which means that each time a new unique word turns up you need to recalculate all vectorangles) I will instead have one vector per unique word that has been found among the messages. This means that both indexing and searching is fast (compared to vectorspace where the indexing will be slow) and the datastorage needed is smaller than with reversed index.

Each vector is also splitted in 100kbit chunks in order to keep the sizes down to a minimum aswell as having the possibility to scale this solution to use more servers if needed in future (google style :P), since the vectors are sparse I use LZF to have a fast and efficient way to compress the vectors.

Some realworld stats for above engine:

1 million indexed posts (raw size about 750 MB) from http://www.tbg.nu ended up in:
* 601288 unique searchwords.
* The wordtable is 24.05 MB in size along with 5.88 MB index.
* 1.39 million searchvectors.
* The vectortable is 251.17 MB along with 18.44 MB index.

The average size of the LZF compressed vectors are 169 bytes (compared to 12.5 kbyte (100kbit) if not compressed), where the largest compressed vector is 12501 bytes and the smallest compressed vector is 2 bytes.

More is explained at http://www.tbg.nu/tbgsearch (unfortunately currently only available in swedish but I will create an english translation shortly (ehm, at least within months :P)).

Does anyone know if the above model has an official name more than “mix between reversed index and vectorspace model” ? 😛

Also, does the implementation of sphinx as plugin for mysql 5.1 mean that mysqls own fulltext engine will/might be replaced shortly by sphinx ?

Andrew Aksyonoff

Your wording is pretty confusing. It’s unclear from your explanations where your “vectors” store word positions (within the documents) or not. Also, if your “vectors” store (sorted) document ID lists for each word, that’s just what inverted files (“reverse indexes”?) basically are, so you are not going to get any significant size or performance benefit against inverted files.

As for the in-document word positions, if you’re not storing those, you are not going to do much better than simple boolean or classic BM25 in terms of relevance. Now if that is acceptable, both indexing and searching could be made much, much faster. And index size should actually be some 7-10% of initial text size, not 30%; so it seems that LZF is not doing a great job compressed document ID lists.

Andrew Aksyonoff

D’oh! I just looked at your link at it’s now a little more clear; even despite it’s in Swedish. 🙂

“Reverse indexes” are in fact usually called “inverted files” (IF); however I was mistaken about your “vectors”: those are not document lists, but rather a kind of “signature files” (SF).

So what you’re actually doing is adding some additional hashing to “plain” signature files. I can’t point you to exact papers but I’m pretty sure that this had already been done. There’s this http://citeseer.ist.psu.edu/88449.html paper comparing SF and IF; I think you could follow it’s bibliography and find out what’s the academic name of your hashing scheme. 🙂

The conclusion however is the same. You’re not storing word positions; you’re not storing any per-document word frequencies either; so you’re limiting yourself to boolean, not even BM25. And if you only need booleans, IF should yield somewhat better index size and search speed.

Apachez

Ahh I forgot to tell what the vectors store… The position within each vector is equal to the docid which the indexed post has (or actually position * chunkid).

Reversed index is often described as using unsigned int where each unique word gets its own wordid and then you have a lookup table which stores the relation between wordid and docid. The problem with this method compared to store it as vectors is that for each combination you need at least 8 bytes (4 + 4 if using unsigned int) of storage (to make it easier to count I leave out the mysql codes needed to store these bytes). If you have an average of 100 words per docid this means that reversed index will use at least 800 bytes per indexed post.

By using bitvectors this has been limited down to 100bit (12.5 bytes) which also can be compressed with LZF since its sparse vectors (actually you can use any compression algorithm you like but I have found that LZF will bring me compression while the speed will not be as much affected as if using gzip or bzip).

That means that even without compression the ratio it is approx 64:1 between using a regular reversed index and the vectorbased storage regarding size needed on disk.

Not only space needed is changed between using regular reversed index vs vectorbased storage but also how the search itself will be performed is changed. With a regular lookup table you need to focus to get your sql queries correct. Even when they are correct it will still take some time for huge search results (which means that you need to cheat one way or another to shorten the data needed when the search is performed). By using vectors you can use simple bit arithmetics like if you want to perform AND search you just do resultvector = vector1 & vector2, or if you want OR search just do resultvector = vector1 | vector2, and so on.

So this means that also the speed of the search is significally improved since having a search that contains 10 most common words means (if you get all your hits in first chunk) that only 10 vectors are needed, compared to regular reversed index which might end up with 500.000 items of each word to process, group by, order by and so forth until the engine can spit out an answer to the client (unless you “cheat” one way or another, one way is to limit the items per word down to say 10 * endresult, like if you will return top 200 back to the client the internal processing might use 2000 items per searchword to find a match and there are other optimizations but they all are slower than using vectors for the processing).

Some benchmark results which I have collected on my dual p2 333MHz linuxbox (the search query contained 10 most common words found in the database, linux 2.6 and mysql 4.1.18 was used in all cases – SQL_NO_CACHE was used in order to avoid hitting the query cache during tests):

Reversed index without cheating: 24.0 seconds
Reversed index with cheating (limit each word internally to 2000 hits): 1.66 seconds
Sphinx: 16.31 seconds
Vectorbased search: 0.06 seconds

On xp3200+ windowsbox:

Vectorbased search: 0.01 seconds

I dont know how sphinx stores data internally but hopefully something of what I have said above might be valuable/used to make sphinx even better ?

Apachez

Ohh you posted another message while I was writing my above text…

Thanks for the link, I will take a look at it.

No, I am not currently storing word positions but I think I will since the client might want to search for an exact phrase and not just words.

Regarding word frequencies they are stored in the wordtable since the index engine I have built is removing duplicated words from a post while indexing and the word frequency is not actually needed but I collect that as extra information which might be useful in future (at least in order to see which words are common in the indexed data).

Also take a look at my benchmarks regarding IF vs SF (if we can assume that “my” vectorbased stored is equal to SF).

Andrew Aksyonoff

Your size estimations are, well, flawed.

First, your IF estimations don’t account for compression at all, and it saves a LOT. One wouldn’t ever save (wordid,docid) pair (so called “hit”). It would be sorted, delta-encoded and compressed; and one would end up with only a few bits per hit. So the final index size is be 7-10% of initial text size, as I mentioned.

Second, your SF estimations don’t account for that SF index size growth depends non-linearly on hit count, unlike IF. So if you’re indexing 10 times more data with IF, your index will grow 10 times, but with SF, make that anything from 20 to 100.

It’s true that searching for common words would be better with SF; there would be an order of magnitude less data to process with SF in this case. However, with rare words, that wouldn’t be the case already. And there’s not much point in quickly finding all 1M documents containing “a the i what” and having no other ranking. 🙂

Apachez

7. peter: All results have been performed in boolean search mode. That is because in my case the client is not interrested in “relevance” of a message but rather that the message which popups in the result contains all words that has been searched for and applies for the datesort criteria which has been used (desc or asc sorting based on date).

8. peter: The relevance technic is interresting but not of any interrest in this case. The problem with relevance is also that the user doesnt know if there exist any other matches which still fulfill the search query because the fulltext engine has on its own made up what the client should see based on “relevance” instead of exact matches for example. Its like searching for something on google which you know exists but google decides to not give you the hit until page 14.

9. Andrew: Your IF example doesnt seem correct. Even if you use some delta-encoding or similar a pair will still take more than 1 bit in storage (where the vectorbased search takes only 1 bit at the most (or 0.something bits when using compression of the whole chunked bitvector).

Regarding your SF claim of growth as I have seen it the growth decreases specially for cases when a particular word exists only in say postid 1, 2 and 3 and then later in 900001, 900002 and 900003. This means that the chunks in between will not need to be saved in the storage (no idea to store a vector that only contains 0’s). That is assuming that the vectorstorage I use can be interpretted as “SF”.

With rare words there will be less data to process when using vectorstorage. And there is huge interrest of finding document that contains “a the i what” specially when these words can exist in other context aswell as languages. I would be glad if I could use a stopwordlist for my case to decrease the storage needed (which some other fulltext engines “cheats” with) but in my case that cannot be done and thats probably why there is such bad result when using mysql fulltext because there will be too much data to process for mysql where the vectorstorage solution solves that problem.

The reason for why I use the “10 most common words” benchmark is to easily see how different solutions deals with large internal datamasses. That is because most solutions are somewhat quick for a single word or a rare word, but when it comes to “10 most common words” search the mysql fulltext in boolean mode takes more than 800 seconds on this hardware, sphinx takes approx 16 seconds, bitvector storage takes 0.06 seconds.

A profile performed on sphinx showed that sphinx spent most of its time working in the cpu and not digging around on the disk (it was cpu bound and not disk bound):

— PROFILE —
root: 16.30, self: 0.00
query_load_dir: 0.01, self: 0.00
read_hits: 0.01
query_load_words: 0.13, self: 0.00
read_hits: 0.13
query_match: 16.17, self: 11.96
read_hits: 4.20
—————

Andrew Aksyonoff

> 7. peter: All results have been performed in boolean search mode.

Does that mean using -b command line option to Sphinx search utility?

> fulltext engine has on its own made up what the client should see based on “relevance� instead of exact matches

Well, I don’t know what’s your experience with other engines, but Sphinx never does that. In it’s default mode, it will rank depending on relevance, but it WON’T return you any document which isn’t an exact match.

> 9. Andrew: Your IF example doesnt seem correct. Even if you use some delta-encoding or similar a pair will still take more than 1 bit in storage

Yes. The hits are typically taking 4 to 8 bits in the inverted file.

> (where the vectorbased search takes only 1 bit

No! This is where you are mistaken. Assume there are N documents and W words. There also would be H hits; whether hits are not ALL the (word_id,doc_id) pairs possible, but ONLY the ones which are actually present – the amount of these hits is exactly equal to the amount of 1’s in all your signatures (ie. “vectors”).

Now, IFs would just store those H hits, while SFs store N*W bits, so SFs needs store N*W/H bits per hit. These bits are ought to be compressed, otherwise the data size would be overwhelming even for tiny collections, and you are right that most of them are zero, but you’ll need to spend at least some bits from time to time to encode blocks of zeroes anyway. And with the growth of N, that’d be more and more bits per hit, because N*W grows somewhat faster H.

Actually, you’re already using much more that 1 bit per hit. 1M documents should produce about 100M hits. Your index size is 250 MB; that’s 2.5 bytes or 20 bits per hit, and that’d be even more even if you index 10M documents.

> With rare words there will be less data to process when using vectorstorage.

Sure.

> And there is huge interrest of finding document that contains “a the i what�

The point is that such a query would return some 150K matchs over your 1M dataset, and no one is ever going to browse that, so you’d need to sort either by date, or by relevance, or something else. Date could sometimes be fine; sometimes, it couldn’t. How would you find the only post which contains “to be or not to be” if it’s 1 year old? 🙂

> The reason for why I use the “10 most common words� benchmark is to easily see how different solutions deals with large internal datamasses.

Yes. It’s good to estimate worst case running time. However an average query won’t typically be that hard to evaluate.

> sphinx takes approx 16 seconds,

My thinking is that most of that time is spent computing phrase proximity and it’s definitely CPU bound.

Apachez

I’ll try to shorten down my posts mainly because I have had 39.5C in fever last night 😛

11. Andrew: I have been using SPH_MATCH_ALL along with SPH_SORT_DATE_DESC/SPH_SORT_DATE_ASC.

Regarding relevance the problem is that the result should be (in my case) sorted by date and not by relevance. If you sort by relevance and pick up say top 200 hits you will most likely miss newer posts which still matches unless you use the sort by date instead of sort by relevance (where sort by relevance is often the default mode in most fulltext engines and most likely because of this they will perform somewhat bad when you need a sort by date).

The reported index size on disk is including mysqls own columndefinitions etc. What would be interresting is if there exist some method to efficiently store sparse vectors. A quick fix in my case was to split up the bitvectors into chunks and compress each chunk using LZF.

Regarding the 150k example it was more of how the fulltext engine internally needs to process all those hits specially for the case when I want the result to be top 200 sorted by date. For mysql fulltext that means that if each of the 10 common words produces 150k hits internally the mysql server needs to create some sort of temporary table (otherwise someone should explain to me why mysqls fulltext takes more than 800seconds for this 10 most common word case :P), fill it with the internal hits, group them, sort them by date and finally pick up the 200 best matches (desc by date or asc by date). As you can see the profile for sphinx to process the 10 most common word query, sphinx spends 16 seconds in the cpu while the bitvector engine spends not more than 0.06 seconds (and 0.06 seconds was the total time, say 0.05 seconds was cpu bound or something like that including cpu used by mysql).

In my case why I have started to look at solutions other than mysqls own fulltext is due to the fact that more and more people asks queries which takes way longer than a couple of seconds to answer. That is where solutions such as sphinx works very well. However I found some issues (even if sphinx was way faster than mysqls own fulltext it still took 16 seconds for the “heavy” search) and that is why I started to look for if I could create some better solution which fits my needs (not that 16seconds is bad in this case or that these heavy searches are common but they do exist). A sidenote was that the sphinx I tested was a beta version and that wildcard searches etc will be implemented in near future according to email conversations I have had with the developer and hopefully the performance will increase aswell. In my case it would be much easier to use a ready solution such as sphinx instead of having to build my own engine and maintain it.

However it will take some more months before 5.1 with sphinx will be GA so I will continue to dig into my bitvector solution to see how I can improve it 😛

Vadim Tkachenko

Apachez,

I doubt MySQL will include Sphinx storage engine in 5.1 GA or even in 5.2.
Unfortunately sphinx index cannot be created directly from within mysql,
MySQL pluginable architecture does not allow that yet.

Apachez

14. Vadim: Darn, what are the odds that mysql will use sphinx as its native fulltext index instead of todays version/edition ?

15. Peter: Yeah it seems that the version supporting SPH_MATCH_BOOLEAN (thou it is currently in beta according to release info on the site) was released just 2-3 days ago. My previous observations are based on 0.9.5-dev back in mars this year.

I have now reindexed using the latest 0.9.6-rc1 and here are the results:

10 most common words along with -b -l 200 –sort=date (in boolean mode sorting by date descending returning top 200 matches):

— PROFILE —
root: 8.81, self: 0.00
query_load_dir: 0.01, self: 0.00
read_hits: 0.01
query_load_words: 0.04, self: 0.00
read_hits: 0.04
query_match: 8.76, self: 6.60
read_hits: 2.16
—————

10 most common words along with only -l 200 –sort=date (sorting by date descending (equal to my previous benchmark when there were no boolean search available in sphinx, defaulting to SPH_MATCH_ALL) returning top 200 matches):

— PROFILE —
root: 12.86, self: 0.00
query_load_dir: 0.01, self: 0.00
read_hits: 0.01
query_load_words: 0.04, self: 0.00
read_hits: 0.04
query_match: 12.82, self: 11.51
read_hits: 1.31
—————

Even if the new SPH_MATCH_BOOLEAN is an improvement by approx 30% compared to previous available method through SPH_MATCH_ALL the search still takes several seconds while TBGsearch takes a few milliseconds.

The sphinx index takes 648 MB spd-file and 3.09 MB spi-file , 651.09 MB in total.

TBGsearch index takes 275.22 MB mysqldata and 24.32 MB mysqlindex, 299.54 MB in total.

In both cases 1 million indexed messages (forum posts).

Which storage method were sphinx using ?

Apachez

Well in this case with more than 1 million posts and more than 800 meg on disk the mysql “LIKE ‘% keyword %'” will take plenty of time aswell to perform.

I have extracted the searches that people has performed during this month (june 2006) and ran them through sphinx aswell as TBGsearch.

Half way through I had to stop the first execution since I noticed that sphinx doesnt support wildcard searches !?

Can someone confirm if that is true and if so, when will sphinx support wildcard because there is no words about this in the todo.txt ?

After cleaning out the wildcard searches and running the benchmark a second time the result now is the opposite, sphinx is about twice as fast as TBGsearch. Looking for the reason its obvious that it is because most search queries are based on single words and not multiple words. There seem to be a boarder near 3-4 words in the search when TBGsearch outperforms sphinx but as long as the search query has fewer than roughly 4 words it seems to be faster than TBGsearch.

In both cases same columns were indexed (postid, postdate, postmessage) so that means that the TBGsearch is looking at pure size smaller due to mysql overhead.

Maybe I was using bad syntax for wildcard searches but once wildcard searches and dynamic update of the index (adding new docs to existing index) has been fixed (ohh and a perl api, I have already emailed Andrew a functional perl api for 0.9.5-dev back in mars but he has still not included that into the sphinx releases) I will start use sphinx instead of TBGsearch 🙂

Vadim Tkachenko

Apachez,

> 14. Vadim: Darn, what are the odds that mysql will use sphinx as its native fulltext index instead of todays version/edition ?

Hehe, sphinx plugin requires sphinx daemon and authour of daemon is Andrew. Andrew will ask big money if someone wants to buy sphinx 🙂

Apachez

Well its GPL aswell as MySQL is so perhaps a joint venture ? 🙂

http://www.sphinxsearch.com/doc.html#license

Vadim Tkachenko

Apachez,

There is a trick… MySQL has Dual License, and
for companies that does not want to use GPL, MySQL is distributed by Commercial License.
Sphinx can not be included in Commercial Licence by that way.

Apachez

Sure but since sphinx exists as a plugin these companies can still use sphinx but as a plugin.

Since Andrew himself has been writing here, any comment from you Andrew regarding the possibility of having mysql use sphinx as its native fulltext engine ?

Or would you only accept this if you at the same time gets gazillion dollars put to your bank account from MySQL AB ? 😉

As a sidenote, how was the cooperation with InnoBase regarding a similar case ?

InnoBase own engine was being used natively my MySQL as InnoDB in order to support transactions, row-level locking and so on… why couldnt sphinx be incooperated in a similar fashion ?

In this case Im speaking of how much easier it would be to just from within mysql run “ALTER TABLE t1 ADD SPHINXFULLTEXT Ft_News_Search_Message (NewsMessage);” instead of having to fiddle around with the plugin context and external applications etc. That was the main reason for why at least I was using mysql fulltext from beginning instead of start to dig into mnogosearch or similar.

Andrew Aksyonoff

> the possibility of having mysql use sphinx as its native fulltext engine ?

First, I think there’s some work do in Sphinx before it could considered a “suit-all” engine which could serve as a replacement to MySQL’s own FT.

Second, even when that’s done, including Sphinx into MySQL would depend on its owners – that it, MySQL AB, not me.

Apachez

23. Peter: Ok since I noticed this during my benchmark tests that TBGsearch supports wildcard while sphinx currently does not and I wanted to double check that I didnt miss anything obvious. During the tests I tried to use 10k chunks instead of 100k chunks in TBGsearch and that boosted the 10 common word query on the p2 333MHz box from 0.06s down to 0.027s, on the xp3200 box the search takes about 0.007s 🙂

I have asked a bunch of questions on sphinx forum 4-5 days ago but still no replies :/

Apachez

Just for the record 😛

I received an email from Andrew today that he will answer the questions in the sphinxforum during the day 🙂

gunyarakun

How about Senna ?
Senna is a fulltext search engine embedded in MySQL 4.0, 4.1, 5.0.
http://qwik.jp/senna/

Ashish

just got a problem in mysql query,

it’s just I can’t understand why it happens that whenever i try to run queries at the first time it takes lots of time while after running it at first time they don’t take much time to run…..i want to run my all queries run asmuch faster as they get run after running them first time….i hope you’ll get what i mean…….i’m hopeful for your favourable response.

Gunnar

Hi Peter,

In your main post you outline that sphinx can be used by MySQL 5.1 as storage engine?
Can you please explain this in detail?

I’m currently using MyISAM becasue of the fulltext index for two forums.
Size is 2 Million records and 100,000 records.
Average new rows per day 400.
Average page hits per day 250,000

I’m very unhappy about using MyISAM for obvious table locking reasons and issues.
The combination of complicated (long) search + insert + many parallel small selects gives terrible performance with MyISAM.

I would liek to switch to something “better”.
I wonder if switching to a mix of innodb and myisam could help.
Running the table under Innodb and having an extar fulltext ISAM table.
This way normal reads that don’t use FT search could run without slowdown from the normal table …

A FT search for Inno would propably the easiest way.
Or do you think that sphinx would a better solution?

Regards to you and vadim
Gunnar

Chris Lu

Please take a look at this

http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes

You can create a full-text database search service, return results as HTML/XML/JSON. It uses the Lucene directly in java, but can be easily used with Ruby, PHP, or any existing database web applicatoins.

You can easily index, re-index, incremental-index. It’s also highly scalable and easily customizable.

The best thing is, it’s super easy. You can create a production-level search in 3 minutes.

AMlogix

Thats great article undergone, realy informative,
For more Mysql examples please check my Blog

Mysql Examples
http://mysqlexamples.blogspot.com

CSS

Got a nice blog site. Please visit.

Brian

Is there any way I can use fulltext search feature on innodb yet? If somebody knows, please let me know.

@Brian

AFAIK there is still no way, but I’m not sure why people regard this as a problem. It wouldn’t be too hard to use a MyISAM table just for search, and keep the information via rules (functions, triggers, whatever). In a sense, if it’s not critical to keep the index 100% up-to-date, it’s better to cron it and rebuild at interval. In fact, MySQL documentation indicates it is much faster to rebuild sometimes as well. There’s some underlying algorithmic reason, I’m sure.

Also, I’ve noticed that it’s extremely expensive (others might benefit from this knowledge) to hit 2 FT indices at a time. Again, probably some underlying reason why this is not efficient. WHERE FT_QUERY_1 AND FT_QUERY_2 seems to be OK, but “OR” seems to be a huge problem on large datasets.

Therefore, it would seem that the solution to searching multiple FT indices is NOT to do so, and create a table or column that puts all the information together. Unfortunately, in MySQL, there are no expression-based indexes, but I’m not 100% sure that would work for FT.

Hope the above helps someone. Also, reasonable results can be achieved by using BOOLEAN queries to restrict, and expressions to sort. AFAIK, natural FTS can’t sort until all results are returned anyway.

I have reasonable success with MySQL 5.1 and 100k+ product rows. Sphinx is better in some ways, but FT is fairly mature now.

One thing I cannot figure out is how to efficiently understand/normalize the numbers returned from natural FTS. i.e. to add a boost from another non FTS column. The numbers are effectively random from one query to the next. It would be nice if MySQL had a way to rank from 1-100, but I’m not sure that’s a free operation. I’m not a DBMS expert. You can’t just add +5, though, and selecting MAX is not cheap.

If anyone knows how to do that, I’d love to know 🙂