May 21, 2013

Sphinx search performance optimization: multi-threaded search

Queries in MySQL, Sphinx and many other database or search engines are typically single-threaded. That is when you issue a single query on your brand new r910 with 32 CPU cores and 16 disks, the maximum that is going to be used to process this query at any given point is 1 CPU core and 1 disk. In fact, only one or the other.

Seriously, if query is CPU intensive, it is only going to be using 3% of the available CPU capacity (for the same 32-core machine). If disk IO intensive – 6% of the available IO capacity (for the 16-disk RAID10 or RAID0 for that matter).

Let me put it another way. If your MySQL or Sphinx query takes 10s to run on a machine with a single CPU core and single disk, putting it on a machine with 32 such cores and 16 such disks will not make it any better.

But you knew this already. Question is – can you do something about it?

In case of Sphinx – indeed you can! And with very little effort. In fact, it does not require any changes to your application or database whatsoever, it is only a matter of small changes to the sphinx configuration.

The Plan

First of all, let me explain what we are trying to achieve here.

Sphinx has the ability to do distributed search out of the box – it was designed to scale out that way very early on. If your sphinx index does not fit to one machine, you would index different parts of it from different machines and then you would have an aggregator node that receives the request from application, issues search requests to all data nodes in parallel, merges results from all of the data nodes and returns results back to the application as if it was just one server serving the request in the first place.

Well, guess what – you can actually utilize this feature to your advantage even if your data can easily fit into one machine and all you want is your queries to be many times faster. Even more so, Sphinx now supports this out of the box, so you don’t need to pretend you are querying remote nodes.

Also, you get a bonus: once you configure server for distributed search, you can do indexing in parallel too!

Word of caution: while this technique will improve most types of search queries, there are some that aren’t going to benefit greatly from parallel execution. The reason is that while search is done on data nodes (even if local ones) and in parallel, merging of results is done by the aggregator and therefore it is single-threaded. Merging includes some CPU-intensive operations such as ranking, ordering or even COUNT with GROUP BY and if data nodes return large amounts of data to post-process, aggregator may well become a bottle-neck due to its single-threaded nature.

This is however easy to check – look at your Sphinx query log and look at the number of results matched per query – this should give you a clue.

Execution

Let us assume you have this one index on one server with the following basic configuration (many irrelevant details omitted):

And now we want it to utilize 3 CPU cores and/or disks on a local machine for this index of ours – idx1. Here’s how we would change the configuration:

And you’re done. Of course, you need to reindex all of the indexes, but you can now do it in parallel – just run a separate indexer for every index idx1p0 through idx1p2.

BTW, using div operator is not necessarily the best way to split data, you may want these to be ranges by using a helper table in MySQL to define those ranges, mixing it with sql_query_range or something else, depending on how your data looks like.

Finishing line

I always loved how Sphinx scales out easily with as many machines you need and have been running it this way for many years now, however I think I don’t utilize this feature to make queries even faster on a one-machine show nearly as often as I should. Well, it’s not like it is slow or anything, but queries are never too fast, are they? :)

About Aurimas Mikalauskas

Aurimas joined Percona in 2006, a few months after Peter and Vadim founded the company. His primary focus is on high performance, but he also specializes in full text search, high availability, content caching techniques and MySQL data recovery.

Comments

  1. Stefan says:

    Thank for this great post!

    What do you think? Does it make sense to distribute my index?
    I have an 4 Core CPU (Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz) with 32GB of RAM
    but only with an RAID-1 system (containing 2 disks).

    My sphinx index contains about 12 millions records with a disk size of about 2,7 GB.

    Greetings,
    Stefan

  2. Stefan, -

    If this machine is running sphinx only, you are definitely going to benefit. Even if it’s a shared box – if you don’t have many disk reads hence sphinx serves data from OS cache, it should help just as well – making 3-4 partitions should improve performance of sphinx queries 2-3 times. Note this also assumes that your queries are rather selective and do not return millions of records.

    Cheers,
    Aurimas

  3. Brian says:

    Is the above config example correct? I’m trying to replicate it on my end and I see some issues.

    ERROR: index ‘idx1p0′: source ‘src0′ not found.

    I don’t see src0 defined anywhere in the above config.

    indx1p1 as configured is going to contain the entire dataset, not just 1/3rd of it as expected because it is using src1 which is defined without a % in the where query.

    Same error for the 3rd index:
    indexing index ‘idx1p2′…
    ERROR: index ‘idx1p2′: source ‘src2′ not found.
    ERROR: index ‘idx1p2′: no valid sources configured; skipping.

  4. Brian, -

    thanks – good catch. That was an error on my side while making the pseudo-config. You should replace “source = src0″ to “source = src1p0″, “source = src1″ to “source = src1p1″ and “source = src2″ to “source = src1p2″. I have fixed that in the post already.

    Let me know if that fixes the issue.

    Aurimas

  5. Brian says:

    Thanks, the config works perfectly now.

    I’m troubleshooting an issue where we get zero results back after sharding the index into 4 chunks. We’ve been running a single index for a few years now, and we have 16 24 core servers handling the search side of things. For whatever reason, when we shard the index, we no longer get any results back in production. Nothing changes software side, just the sphinx config, index, and a searchd restart.

  6. Brian, -

    I would try to query the individual shards and see if they return results. If not, chances are something went wrong with indexing and indexes are just empty. If they do, then it could be a bug on the sphinx side.

    If you’re unable to figure it out, please create a forum entry and just post a link here:

    http://www.perconaforum.com/index.php?t=thread&frm_id=4&

    Aurimas

Speak Your Mind

*