April 21, 2014

Full text search for all MySQL Storage Engines

As we know build in full text search is currently limited only to MyISAM search engine as well as has few other limits.

Today Sphinx Search plugin for MySQL was released which now provides fast and easy to use full text search solution for all storage engines. This version also adds a lot of other new features, including boolean search and distributed searching.

A while ago I already wrote about Sphinx Search Egine, comparing it to built in FullText search and Mnogosearch. I guess I should soon repeat tests, adding Lucene to the list for complete picture.

And if you do not feel like patching MySQL or use MySQL 5.1 beta to use sphinx as MySQL Storage Engine you can still use it old fashion way as separate server.

About Peter Zaitsev

Peter managed the High Performance Group within MySQL until 2006, when he founded Percona. Peter has a Master's Degree in Computer Science and is an expert in database kernels, computer hardware, and application scaling.

Comments

  1. Apachez says:

    I have been for the past months experimenting with various fulltext engines since the mysql fulltext has it limitations specially when used in boolean mode on large results (for example a boolean search that searches on common words and then order the hits by date and finally picks out say top 200 best matches).

    Sphinx is really promising specially now when it will be available as a direct plugin for mysql (I assume this means that the sphinx index can be created directly from within mysql just as with mysql own fulltext engine ?).

    However I found some issues with the standalone testversions which ended up with that I created my own fulltext search engine which is written in perl and use mysql as backend mostly to see if I could do it any better for my case (fulltext search engine for a forum).

    What I have done is to mix reversed index with vectorspace model which means that instead of having a vector per message and a relevance (vectorangle) connected to this vector (which means that each time a new unique word turns up you need to recalculate all vectorangles) I will instead have one vector per unique word that has been found among the messages. This means that both indexing and searching is fast (compared to vectorspace where the indexing will be slow) and the datastorage needed is smaller than with reversed index.

    Each vector is also splitted in 100kbit chunks in order to keep the sizes down to a minimum aswell as having the possibility to scale this solution to use more servers if needed in future (google style :P), since the vectors are sparse I use LZF to have a fast and efficient way to compress the vectors.

    Some realworld stats for above engine:

    1 million indexed posts (raw size about 750 MB) from http://www.tbg.nu ended up in:
    * 601288 unique searchwords.
    * The wordtable is 24.05 MB in size along with 5.88 MB index.
    * 1.39 million searchvectors.
    * The vectortable is 251.17 MB along with 18.44 MB index.

    The average size of the LZF compressed vectors are 169 bytes (compared to 12.5 kbyte (100kbit) if not compressed), where the largest compressed vector is 12501 bytes and the smallest compressed vector is 2 bytes.

    More is explained at http://www.tbg.nu/tbgsearch (unfortunately currently only available in swedish but I will create an english translation shortly (ehm, at least within months :P)).

    Does anyone know if the above model has an official name more than “mix between reversed index and vectorspace model” ? :P

    Also, does the implementation of sphinx as plugin for mysql 5.1 mean that mysqls own fulltext engine will/might be replaced shortly by sphinx ?

  2. Your wording is pretty confusing. It’s unclear from your explanations where your “vectors” store word positions (within the documents) or not. Also, if your “vectors” store (sorted) document ID lists for each word, that’s just what inverted files (“reverse indexes”?) basically are, so you are not going to get any significant size or performance benefit against inverted files.

    As for the in-document word positions, if you’re not storing those, you are not going to do much better than simple boolean or classic BM25 in terms of relevance. Now if that is acceptable, both indexing and searching could be made much, much faster. And index size should actually be some 7-10% of initial text size, not 30%; so it seems that LZF is not doing a great job compressed document ID lists.

  3. D’oh! I just looked at your link at it’s now a little more clear; even despite it’s in Swedish. :)

    “Reverse indexes” are in fact usually called “inverted files” (IF); however I was mistaken about your “vectors”: those are not document lists, but rather a kind of “signature files” (SF).

    So what you’re actually doing is adding some additional hashing to “plain” signature files. I can’t point you to exact papers but I’m pretty sure that this had already been done. There’s this http://citeseer.ist.psu.edu/88449.html paper comparing SF and IF; I think you could follow it’s bibliography and find out what’s the academic name of your hashing scheme. :)

    The conclusion however is the same. You’re not storing word positions; you’re not storing any per-document word frequencies either; so you’re limiting yourself to boolean, not even BM25. And if you only need booleans, IF should yield somewhat better index size and search speed.

  4. Apachez says:

    Ahh I forgot to tell what the vectors store… The position within each vector is equal to the docid which the indexed post has (or actually position * chunkid).

    Reversed index is often described as using unsigned int where each unique word gets its own wordid and then you have a lookup table which stores the relation between wordid and docid. The problem with this method compared to store it as vectors is that for each combination you need at least 8 bytes (4 + 4 if using unsigned int) of storage (to make it easier to count I leave out the mysql codes needed to store these bytes). If you have an average of 100 words per docid this means that reversed index will use at least 800 bytes per indexed post.

    By using bitvectors this has been limited down to 100bit (12.5 bytes) which also can be compressed with LZF since its sparse vectors (actually you can use any compression algorithm you like but I have found that LZF will bring me compression while the speed will not be as much affected as if using gzip or bzip).

    That means that even without compression the ratio it is approx 64:1 between using a regular reversed index and the vectorbased storage regarding size needed on disk.

    Not only space needed is changed between using regular reversed index vs vectorbased storage but also how the search itself will be performed is changed. With a regular lookup table you need to focus to get your sql queries correct. Even when they are correct it will still take some time for huge search results (which means that you need to cheat one way or another to shorten the data needed when the search is performed). By using vectors you can use simple bit arithmetics like if you want to perform AND search you just do resultvector = vector1 & vector2, or if you want OR search just do resultvector = vector1 | vector2, and so on.

    So this means that also the speed of the search is significally improved since having a search that contains 10 most common words means (if you get all your hits in first chunk) that only 10 vectors are needed, compared to regular reversed index which might end up with 500.000 items of each word to process, group by, order by and so forth until the engine can spit out an answer to the client (unless you “cheat” one way or another, one way is to limit the items per word down to say 10 * endresult, like if you will return top 200 back to the client the internal processing might use 2000 items per searchword to find a match and there are other optimizations but they all are slower than using vectors for the processing).

    Some benchmark results which I have collected on my dual p2 333MHz linuxbox (the search query contained 10 most common words found in the database, linux 2.6 and mysql 4.1.18 was used in all cases – SQL_NO_CACHE was used in order to avoid hitting the query cache during tests):

    Reversed index without cheating: 24.0 seconds
    Reversed index with cheating (limit each word internally to 2000 hits): 1.66 seconds
    Sphinx: 16.31 seconds
    Vectorbased search: 0.06 seconds

    On xp3200+ windowsbox:

    Vectorbased search: 0.01 seconds

    I dont know how sphinx stores data internally but hopefully something of what I have said above might be valuable/used to make sphinx even better ?

  5. Apachez says:

    Ohh you posted another message while I was writing my above text…

    Thanks for the link, I will take a look at it.

    No, I am not currently storing word positions but I think I will since the client might want to search for an exact phrase and not just words.

    Regarding word frequencies they are stored in the wordtable since the index engine I have built is removing duplicated words from a post while indexing and the word frequency is not actually needed but I collect that as extra information which might be useful in future (at least in order to see which words are common in the indexed data).

    Also take a look at my benchmarks regarding IF vs SF (if we can assume that “my” vectorbased stored is equal to SF).

  6. peter says:

    Apachez,

    You’re probably doing typical mistake done comparing systems providing full text search systems. Unlike database management systems which normally should provide same result to same SQL query, for full text search systems results are often different, depending on relevance computing method.

    If you’ve been using sphinx in default search mode it will not only retrieve documents which match all your words but also sort them appropriately – taking into account how close words are to each other which is much more resource intensive.

    If you want to compare search to vecror based search you should use boolean match mode – this bypasses all word position check logic. Note Sphinx at this moment builds index which has such data anyway so you can’t compare index sizes.

    Would be pretty interesting to see what numbers do you get for boolean search in sphinx.

  7. peter says:

    One more note. I guess andrew meant different word frequency. If you do not have word positions you usually base relevance on counting on relative frequences, which means you need both frequency of the word in whole collection and frequency of the word in given document. As I can see from description you only have bit set if word is met in the document so it is not that easy to add it.

    For purely simple boolean search bit vectors might be faster (I would still like to see full benchmark code to be honest) but it is not really that important on small collections, on large collections you would normally want good relevance algorithm as there often would be too many results to check all of them.

  8. Your size estimations are, well, flawed.

    First, your IF estimations don’t account for compression at all, and it saves a LOT. One wouldn’t ever save (wordid,docid) pair (so called “hit”). It would be sorted, delta-encoded and compressed; and one would end up with only a few bits per hit. So the final index size is be 7-10% of initial text size, as I mentioned.

    Second, your SF estimations don’t account for that SF index size growth depends non-linearly on hit count, unlike IF. So if you’re indexing 10 times more data with IF, your index will grow 10 times, but with SF, make that anything from 20 to 100.

    It’s true that searching for common words would be better with SF; there would be an order of magnitude less data to process with SF in this case. However, with rare words, that wouldn’t be the case already. And there’s not much point in quickly finding all 1M documents containing “a the i what” and having no other ranking. :)

  9. Apachez says:

    7. peter: All results have been performed in boolean search mode. That is because in my case the client is not interrested in “relevance” of a message but rather that the message which popups in the result contains all words that has been searched for and applies for the datesort criteria which has been used (desc or asc sorting based on date).

    8. peter: The relevance technic is interresting but not of any interrest in this case. The problem with relevance is also that the user doesnt know if there exist any other matches which still fulfill the search query because the fulltext engine has on its own made up what the client should see based on “relevance” instead of exact matches for example. Its like searching for something on google which you know exists but google decides to not give you the hit until page 14.

    9. Andrew: Your IF example doesnt seem correct. Even if you use some delta-encoding or similar a pair will still take more than 1 bit in storage (where the vectorbased search takes only 1 bit at the most (or 0.something bits when using compression of the whole chunked bitvector).

    Regarding your SF claim of growth as I have seen it the growth decreases specially for cases when a particular word exists only in say postid 1, 2 and 3 and then later in 900001, 900002 and 900003. This means that the chunks in between will not need to be saved in the storage (no idea to store a vector that only contains 0′s). That is assuming that the vectorstorage I use can be interpretted as “SF”.

    With rare words there will be less data to process when using vectorstorage. And there is huge interrest of finding document that contains “a the i what” specially when these words can exist in other context aswell as languages. I would be glad if I could use a stopwordlist for my case to decrease the storage needed (which some other fulltext engines “cheats” with) but in my case that cannot be done and thats probably why there is such bad result when using mysql fulltext because there will be too much data to process for mysql where the vectorstorage solution solves that problem.

    The reason for why I use the “10 most common words” benchmark is to easily see how different solutions deals with large internal datamasses. That is because most solutions are somewhat quick for a single word or a rare word, but when it comes to “10 most common words” search the mysql fulltext in boolean mode takes more than 800 seconds on this hardware, sphinx takes approx 16 seconds, bitvector storage takes 0.06 seconds.

    A profile performed on sphinx showed that sphinx spent most of its time working in the cpu and not digging around on the disk (it was cpu bound and not disk bound):

    — PROFILE —
    root: 16.30, self: 0.00
    query_load_dir: 0.01, self: 0.00
    read_hits: 0.01
    query_load_words: 0.13, self: 0.00
    read_hits: 0.13
    query_match: 16.17, self: 11.96
    read_hits: 4.20
    —————

  10. > 7. peter: All results have been performed in boolean search mode.

    Does that mean using -b command line option to Sphinx search utility?

    > fulltext engine has on its own made up what the client should see based on “relevance� instead of exact matches

    Well, I don’t know what’s your experience with other engines, but Sphinx never does that. In it’s default mode, it will rank depending on relevance, but it WON’T return you any document which isn’t an exact match.

    > 9. Andrew: Your IF example doesnt seem correct. Even if you use some delta-encoding or similar a pair will still take more than 1 bit in storage

    Yes. The hits are typically taking 4 to 8 bits in the inverted file.

    > (where the vectorbased search takes only 1 bit

    No! This is where you are mistaken. Assume there are N documents and W words. There also would be H hits; whether hits are not ALL the (word_id,doc_id) pairs possible, but ONLY the ones which are actually present – the amount of these hits is exactly equal to the amount of 1′s in all your signatures (ie. “vectors”).

    Now, IFs would just store those H hits, while SFs store N*W bits, so SFs needs store N*W/H bits per hit. These bits are ought to be compressed, otherwise the data size would be overwhelming even for tiny collections, and you are right that most of them are zero, but you’ll need to spend at least some bits from time to time to encode blocks of zeroes anyway. And with the growth of N, that’d be more and more bits per hit, because N*W grows somewhat faster H.

    Actually, you’re already using much more that 1 bit per hit. 1M documents should produce about 100M hits. Your index size is 250 MB; that’s 2.5 bytes or 20 bits per hit, and that’d be even more even if you index 10M documents.

    > With rare words there will be less data to process when using vectorstorage.

    Sure.

    > And there is huge interrest of finding document that contains “a the i what�

    The point is that such a query would return some 150K matchs over your 1M dataset, and no one is ever going to browse that, so you’d need to sort either by date, or by relevance, or something else. Date could sometimes be fine; sometimes, it couldn’t. How would you find the only post which contains “to be or not to be” if it’s 1 year old? :)

    > The reason for why I use the “10 most common words� benchmark is to easily see how different solutions deals with large internal datamasses.

    Yes. It’s good to estimate worst case running time. However an average query won’t typically be that hard to evaluate.

    > sphinx takes approx 16 seconds,

    My thinking is that most of that time is spent computing phrase proximity and it’s definitely CPU bound.

  11. peter says:

    Apachez,

    I guess you’ve been experimenting with search engines in default settings which is quite frequently does not require all keywords to present in found message. Sphinx does require that in “match_all” mode. I think relevance ranking and cutting of result set should be performed in search engine – you can’t do it in client well. Of course if there are just 1000 matches you can sort them fast enough but what if there are millions or tens of millions of matches (google has billions for many) for common query ?

    Different application have different sorting needs – right, this is why sphinx has number of sort mode (and one can add his own if he needs) – sorting by object timestamp is one of them.

    Now about relevance – there is an important difference between google and opensoure search engines. Google uses some very complex relevance ranking, which they would not even ever disclose to the public, as this would help SEO guys to spam top result pages. Sphinx relevance algorithm is however well visible in the source code and you can use your own if you do not like it. So there are no surprises here. Also good you’ve mentioned google – they actually typically require all words to be present in the article (and links to the article) plus word positions are surely considered – you would normally see close to full phrases in the top of results. Plus google has an option to do phrase search which only finds phrases. This is to mention for large scale search you typically need word positions.

    Speaking about benchmarks It would be interesting to see your data set if you do not mind sharing it. As for becnchmark I would rather see real world scenario – for example 100.000 real user queries logged and executed. For your particular case sphinx is quite likely to be faster but it is curious how much faster.

    Also I’m not sure why there would be less handling with less frequent words ? As I understand you still store vector for the word which has bits for every document to show if this document have this word. In such case even though compressed size might be smaller you would still need to operate with fill bit vectors. Do I miss something ?

  12. Apachez says:

    I’ll try to shorten down my posts mainly because I have had 39.5C in fever last night :P

    11. Andrew: I have been using SPH_MATCH_ALL along with SPH_SORT_DATE_DESC/SPH_SORT_DATE_ASC.

    Regarding relevance the problem is that the result should be (in my case) sorted by date and not by relevance. If you sort by relevance and pick up say top 200 hits you will most likely miss newer posts which still matches unless you use the sort by date instead of sort by relevance (where sort by relevance is often the default mode in most fulltext engines and most likely because of this they will perform somewhat bad when you need a sort by date).

    The reported index size on disk is including mysqls own columndefinitions etc. What would be interresting is if there exist some method to efficiently store sparse vectors. A quick fix in my case was to split up the bitvectors into chunks and compress each chunk using LZF.

    Regarding the 150k example it was more of how the fulltext engine internally needs to process all those hits specially for the case when I want the result to be top 200 sorted by date. For mysql fulltext that means that if each of the 10 common words produces 150k hits internally the mysql server needs to create some sort of temporary table (otherwise someone should explain to me why mysqls fulltext takes more than 800seconds for this 10 most common word case :P), fill it with the internal hits, group them, sort them by date and finally pick up the 200 best matches (desc by date or asc by date). As you can see the profile for sphinx to process the 10 most common word query, sphinx spends 16 seconds in the cpu while the bitvector engine spends not more than 0.06 seconds (and 0.06 seconds was the total time, say 0.05 seconds was cpu bound or something like that including cpu used by mysql).

    In my case why I have started to look at solutions other than mysqls own fulltext is due to the fact that more and more people asks queries which takes way longer than a couple of seconds to answer. That is where solutions such as sphinx works very well. However I found some issues (even if sphinx was way faster than mysqls own fulltext it still took 16 seconds for the “heavy” search) and that is why I started to look for if I could create some better solution which fits my needs (not that 16seconds is bad in this case or that these heavy searches are common but they do exist). A sidenote was that the sphinx I tested was a beta version and that wildcard searches etc will be implemented in near future according to email conversations I have had with the developer and hopefully the performance will increase aswell. In my case it would be much easier to use a ready solution such as sphinx instead of having to build my own engine and maintain it.

    However it will take some more months before 5.1 with sphinx will be GA so I will continue to dig into my bitvector solution to see how I can improve it :P

  13. Vadim says:

    Apachez,

    I doubt MySQL will include Sphinx storage engine in 5.1 GA or even in 5.2.
    Unfortunately sphinx index cannot be created directly from within mysql,
    MySQL pluginable architecture does not allow that yet.

  14. peter says:

    Apachez,

    Sorry to hear you do not feel well. Get well soon, discussion can wait :)

    SPH_MATCH_ALL is not boolean search. This is search mode which uses phrase matching for relevance this is why it is slower. You should use SPH_MATCH_BOOLEAN if you do not care about relevance end just want to check if there is the match. Sphinx would count relevance even if sorting is done by SPH_SORT_DATE_DESC because it sorts by relevance for objects having the same date.

    Sphinx is not getting slower whatever sorting mode you’re using as data is not presorted in the index by relevance criteria, this allows to build very dynamic sort criterias.

    Speaking about MySQL performance you’re right. There are two pieces which slow things down. First finding the rows which contain all keywords. Intersection of pretty large sets needs to be performed. Next problem is sorting – it is done by MySQL rather than full text search engine itself, which means it may be very slow. If you would get million of row in result set and will want to sort them by date and get top 20 matches that would be really slow.

    About sphinx – it would be interesting to see which performance do you get if you’re using boolean match mode as it looks like what you need. Also benchmark using real queries from your site might be very interesting :)

  15. Apachez says:

    14. Vadim: Darn, what are the odds that mysql will use sphinx as its native fulltext index instead of todays version/edition ?

    15. Peter: Yeah it seems that the version supporting SPH_MATCH_BOOLEAN (thou it is currently in beta according to release info on the site) was released just 2-3 days ago. My previous observations are based on 0.9.5-dev back in mars this year.

    I have now reindexed using the latest 0.9.6-rc1 and here are the results:

    10 most common words along with -b -l 200 –sort=date (in boolean mode sorting by date descending returning top 200 matches):

    — PROFILE —
    root: 8.81, self: 0.00
    query_load_dir: 0.01, self: 0.00
    read_hits: 0.01
    query_load_words: 0.04, self: 0.00
    read_hits: 0.04
    query_match: 8.76, self: 6.60
    read_hits: 2.16
    —————

    10 most common words along with only -l 200 –sort=date (sorting by date descending (equal to my previous benchmark when there were no boolean search available in sphinx, defaulting to SPH_MATCH_ALL) returning top 200 matches):

    — PROFILE —
    root: 12.86, self: 0.00
    query_load_dir: 0.01, self: 0.00
    read_hits: 0.01
    query_load_words: 0.04, self: 0.00
    read_hits: 0.04
    query_match: 12.82, self: 11.51
    read_hits: 1.31
    —————

    Even if the new SPH_MATCH_BOOLEAN is an improvement by approx 30% compared to previous available method through SPH_MATCH_ALL the search still takes several seconds while TBGsearch takes a few milliseconds.

    The sphinx index takes 648 MB spd-file and 3.09 MB spi-file , 651.09 MB in total.

    TBGsearch index takes 275.22 MB mysqldata and 24.32 MB mysqlindex, 299.54 MB in total.

    In both cases 1 million indexed messages (forum posts).

    Which storage method were sphinx using ?

  16. peter says:

    Apachez,

    Thank you for your benchmarks again. I thought there would be larger improvement, not just 30% you’re observing. Might be Andrew would optimize further :)

    I still however would be interested to see some more real benchmark with actual queries your users do. You’re testing very narrow case and it is hard to base any judgement on it. For example it is easy to come up with query when full text scan will be faster than index access but would you claim you do not need any indexes ? Or in your case I guess running txt like “% keyword1 %” and txt like “% keyword2 %” instead of running MySQL Full text search query would be faster but does it mean it is always the case ?

    bit vector is similar to full table scan in this case – you have to have bit for each of the records in the index and scan it fully if word is frequent or not. This is surely going to be faster if word is frequent (selectivity is low) but will be slower if seletivity is high.

    It would be also curious to see what are the numbers with 10mil of records or something similar. If you have the code for your Full Text Search system you share it.

    Now speaking about index sizes – Sphinx always stores word positions which would take over half of the sphinx index. Plus MySQL stores timestamp and groupid for each document which is not needed for full text search itself, only needed if they are used in sorting or matching clause. I guess removing them would half index size again.

    Storage sphinx is using is doclist+hitlist with crc32 used as dictionary method. Both lists are compressed using byte based compresion.

  17. Apachez says:

    Well in this case with more than 1 million posts and more than 800 meg on disk the mysql “LIKE ‘% keyword %’” will take plenty of time aswell to perform.

    I have extracted the searches that people has performed during this month (june 2006) and ran them through sphinx aswell as TBGsearch.

    Half way through I had to stop the first execution since I noticed that sphinx doesnt support wildcard searches !?

    Can someone confirm if that is true and if so, when will sphinx support wildcard because there is no words about this in the todo.txt ?

    After cleaning out the wildcard searches and running the benchmark a second time the result now is the opposite, sphinx is about twice as fast as TBGsearch. Looking for the reason its obvious that it is because most search queries are based on single words and not multiple words. There seem to be a boarder near 3-4 words in the search when TBGsearch outperforms sphinx but as long as the search query has fewer than roughly 4 words it seems to be faster than TBGsearch.

    In both cases same columns were indexed (postid, postdate, postmessage) so that means that the TBGsearch is looking at pure size smaller due to mysql overhead.

    Maybe I was using bad syntax for wildcard searches but once wildcard searches and dynamic update of the index (adding new docs to existing index) has been fixed (ohh and a perl api, I have already emailed Andrew a functional perl api for 0.9.5-dev back in mars but he has still not included that into the sphinx releases) I will start use sphinx instead of TBGsearch :-)

  18. Vadim says:

    Apachez,

    > 14. Vadim: Darn, what are the odds that mysql will use sphinx as its native fulltext index instead of todays version/edition ?

    Hehe, sphinx plugin requires sphinx daemon and authour of daemon is Andrew. Andrew will ask big money if someone wants to buy sphinx :)

  19. Apachez says:

    Well its GPL aswell as MySQL is so perhaps a joint venture ? :)

    http://www.sphinxsearch.com/doc.html#license

  20. Vadim says:

    Apachez,

    There is a trick… MySQL has Dual License, and
    for companies that does not want to use GPL, MySQL is distributed by Commercial License.
    Sphinx can not be included in Commercial Licence by that way.

  21. Apachez says:

    Sure but since sphinx exists as a plugin these companies can still use sphinx but as a plugin.

    Since Andrew himself has been writing here, any comment from you Andrew regarding the possibility of having mysql use sphinx as its native fulltext engine ?

    Or would you only accept this if you at the same time gets gazillion dollars put to your bank account from MySQL AB ? ;-)

    As a sidenote, how was the cooperation with InnoBase regarding a similar case ?

    InnoBase own engine was being used natively my MySQL as InnoDB in order to support transactions, row-level locking and so on… why couldnt sphinx be incooperated in a similar fashion ?

    In this case Im speaking of how much easier it would be to just from within mysql run “ALTER TABLE t1 ADD SPHINXFULLTEXT Ft_News_Search_Message (`NewsMessage`);” instead of having to fiddle around with the plugin context and external applications etc. That was the main reason for why at least I was using mysql fulltext from beginning instead of start to dig into mnogosearch or similar.

  22. peter says:

    Apachez,

    Yes. 800MB would take a time to scan but I expect it will be still much faster than 800 seconds. I’ve seen in many cases LIKE being faster than MySQL full text search in my tests.

    Speaking about Wildcards – as I know it is on the way. Wildcard search requires “proper” dictionary to be implemented in sphinx. Currently (0.9.6) sphinx just uses crc32() to get word_id from the keyword. If word->id mapping is stored separately one can find all word_id matching to the mask and perform the search.

    I would suggest to write request on sphinx Forum. Andrew is very considerable to user requests in defining the road map.

    Dynamic updates are also hopefully to come in the few months.

    Now speaking about using Sphinx or other full text search system with MySQL as plugin (not hack with storage engine) – for this to happen fulltext search interface would need to be rewritten to support multiple full text search engines behind it. Currently it is coded so it is MyISAM specific and implementation specific. I hope this would happen sooner or later especially as there is Innodb full text search in works by Innodb team.

  23. > the possibility of having mysql use sphinx as its native fulltext engine ?

    First, I think there’s some work do in Sphinx before it could considered a “suit-all” engine which could serve as a replacement to MySQL’s own FT.

    Second, even when that’s done, including Sphinx into MySQL would depend on its owners – that it, MySQL AB, not me.

  24. Apachez says:

    23. Peter: Ok since I noticed this during my benchmark tests that TBGsearch supports wildcard while sphinx currently does not and I wanted to double check that I didnt miss anything obvious. During the tests I tried to use 10k chunks instead of 100k chunks in TBGsearch and that boosted the 10 common word query on the p2 333MHz box from 0.06s down to 0.027s, on the xp3200 box the search takes about 0.007s :-)

    I have asked a bunch of questions on sphinx forum 4-5 days ago but still no replies :/

  25. Apachez says:

    Just for the record :P

    I received an email from Andrew today that he will answer the questions in the sphinxforum during the day :-)

  26. gunyarakun says:

    How about Senna ?
    Senna is a fulltext search engine embedded in MySQL 4.0, 4.1, 5.0.
    http://qwik.jp/senna/

  27. peter says:

    gunyarakun,

    I have not used Senna. It is however not full text engine which is embedded in MySQL, but I see there are some patches so it can be bound to MySQL.

    Only fractions of web site are available in English so I could not find all info I was looking for. If you know author email let me know.

  28. Ashish says:

    just got a problem in mysql query,

    it’s just I can’t understand why it happens that whenever i try to run queries at the first time it takes lots of time while after running it at first time they don’t take much time to run…..i want to run my all queries run asmuch faster as they get run after running them first time….i hope you’ll get what i mean…….i’m hopeful for your favourable response.

  29. peter says:

    Ashish,

    Please post your question at forum, forum.mysql.com as it is not relevant to the blog post.

  30. Gunnar says:

    Hi Peter,

    In your main post you outline that sphinx can be used by MySQL 5.1 as storage engine?
    Can you please explain this in detail?

    I’m currently using MyISAM becasue of the fulltext index for two forums.
    Size is 2 Million records and 100,000 records.
    Average new rows per day 400.
    Average page hits per day 250,000

    I’m very unhappy about using MyISAM for obvious table locking reasons and issues.
    The combination of complicated (long) search + insert + many parallel small selects gives terrible performance with MyISAM.

    I would liek to switch to something “better”.
    I wonder if switching to a mix of innodb and myisam could help.
    Running the table under Innodb and having an extar fulltext ISAM table.
    This way normal reads that don’t use FT search could run without slowdown from the normal table …

    A FT search for Inno would propably the easiest way.
    Or do you think that sphinx would a better solution?

    Regards to you and vadim
    Gunnar

  31. peter says:

    Gunnar,

    I’ve asked Andrew to add some kind of tutorial and he will do that once he finds a time. Honestly we do not use sphinx as storage engine but instead as stand alone application most of the time simply retrieving rows from MySQL based on IDs.

    It makes it much easier in terms of profiling etc.

    I think Sphinx is great solution for MySQL at this point especially if you’re looking for search performance and large data sizes.

  32. Chris Lu says:

    Please take a look at this

    http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes

    You can create a full-text database search service, return results as HTML/XML/JSON. It uses the Lucene directly in java, but can be easily used with Ruby, PHP, or any existing database web applicatoins.

    You can easily index, re-index, incremental-index. It’s also highly scalable and easily customizable.

    The best thing is, it’s super easy. You can create a production-level search in 3 minutes.

  33. AMlogix says:

    Thats great article undergone, realy informative,
    For more Mysql examples please check my Blog

    Mysql Examples
    http://mysqlexamples.blogspot.com

  34. CSS says:

    Got a nice blog site. Please visit.

  35. Brian says:

    Is there any way I can use fulltext search feature on innodb yet? If somebody knows, please let me know.

  36. @Brian

    AFAIK there is still no way, but I’m not sure why people regard this as a problem. It wouldn’t be too hard to use a MyISAM table just for search, and keep the information via rules (functions, triggers, whatever). In a sense, if it’s not critical to keep the index 100% up-to-date, it’s better to cron it and rebuild at interval. In fact, MySQL documentation indicates it is much faster to rebuild sometimes as well. There’s some underlying algorithmic reason, I’m sure.

    Also, I’ve noticed that it’s extremely expensive (others might benefit from this knowledge) to hit 2 FT indices at a time. Again, probably some underlying reason why this is not efficient. WHERE FT_QUERY_1 AND FT_QUERY_2 seems to be OK, but “OR” seems to be a huge problem on large datasets.

    Therefore, it would seem that the solution to searching multiple FT indices is NOT to do so, and create a table or column that puts all the information together. Unfortunately, in MySQL, there are no expression-based indexes, but I’m not 100% sure that would work for FT.

    Hope the above helps someone. Also, reasonable results can be achieved by using BOOLEAN queries to restrict, and expressions to sort. AFAIK, natural FTS can’t sort until all results are returned anyway.

    I have reasonable success with MySQL 5.1 and 100k+ product rows. Sphinx is better in some ways, but FT is fairly mature now.

    One thing I cannot figure out is how to efficiently understand/normalize the numbers returned from natural FTS. i.e. to add a boost from another non FTS column. The numbers are effectively random from one query to the next. It would be nice if MySQL had a way to rank from 1-100, but I’m not sure that’s a free operation. I’m not a DBMS expert. You can’t just add +5, though, and selecting MAX is not cheap.

    If anyone knows how to do that, I’d love to know :)

Speak Your Mind

*