April 17, 2014

Sphinx search performance optimization: attribute-based filters

One of the most common causes of a poor Sphinx search performance I find our customers face is misuse of search filters. In this article I will cover how Sphinx attributes (which are normally used for filtering) work, when they are a good idea to use and what to do when they are not, but you still want to take advantage of otherwise superb Sphinx performance.

The Problem

While Sphinx is great for full text search, you can certainly go beyond full text search, but before you go there, it is a good idea to make sure you’re doing it the right way.

In Sphinx, columns are basically one of two kinds:

a) full text
b) attributes

Speaking in MySQL terms, Full text columns are always indexed and using the very powerful extended query syntax you can do search against certain columns only, or against all of them – the fewer results your full text query matches, the faster the query will be. That’s self-evident, I guess. Every time a keyword matches a document, Sphinx needs to resolve an appropriate document and evaluate the result. If your keywords match all of your 100M records, it is going to be a lot of work to do this. However with just a few hundred thousand records it is going to be much much faster.

Attributes on the other hand are sort of like unindexed MySQL columns. They are the extra details you may want to filter by, which is usually things like gender, status, age, group_id etc. The effect of them being unindexed is that whenever you are using attributes – it is a full scan of this attribute for all the records that matched the full text search query. For few hundred thousand of matched records, checking attributes is not going to slow down performance of queries significantly. But – and this is the misuse that I see a lot – when you are NOT doing full text search, only using attribute-based filters means a full scan of all records for that attribute.

Because attributes are not B-tree structured (and therefore are slow to work with), by default Sphinx actually stores them in memory. In fact, it requires that all attributes fit in memory or the server performance will be simply unbearable. However, that still does not mean that you can use attributes to find all the records that match group_id=10 – that query will have to check all 100M of records.

BTW internally there’s some differences between numeric attributes and character based as well as multi-value attributes (MVAs), but for the purpose of our discussion it’s enough to know that attributes are not indexed.

For example..

Now let me give you few examples so it’s not just an empty talk. For the examples below I will be using SphinxQL protocol which looks like talking to MySQL server, but it’s not. It is me talking to Sphinx server.

First of all, let us see how many records we have in this index and how long does it take to do a full scan:

Note this is a real index used in production – a catalog of books, so if same query happens to give slightly different results it could be because the indexing occurred between different iterations.

If you are seeing this SphinxQL output first time, it maybe a little confusing, but let me explain. Query returned 20 rows because unless you specify an explicit LIMIT, it defaults to 20. Total says 1000 because by default query is limited to 1000 best results to process (it still searches the entire index though).

Otherwise, takeaway is that this index has 10M records and it takes 700ms to do a full scan.

Now, let us find all records that match user_id = 50:

Pretty bad, isn’t it? 287 records returned in 155ms. Doing the same thing in MySQL, assuming user_id is indexed and in cache, would take less than a millisecond, so it is definitely not the best use case for Sphinx.

When you have full text search keywords that match many documents (and therefore are slow), using attributes may reduce the number of results matched significantly. But not the amount of time it takes to do that:

Solution

Solution may not be obvious first, but you will see that it makes sense. So, the strength of Sphinx is full text search. I suggest we exploit that to get good performance on attributes that are highly selective, such as the user_id example above. In fact, I’ve been using this technique with great success for many years now.

First, I would add the following extra item to fetch when indexing the catalog:

CONCAT(‘userkey_’, user_id) userkey

And now I have an extra column in a full text index that I can use for filtering:

That looks much better and I can mix it with other search keywords:

Highly selective columns only

I thought I would emphasize – while it is a neat performance optimization for highly selective attributes, this is certainly not something you would want to use for every attribute. There’s few reasons for that:

  • it does use more disk space for the index (although it’s not as bad as you might think)
  • attributes are still a good way to filter out data when your search queries don’t match many records
  • in fact, it could reduce performance of queries that otherwise match few records

Let me illustrate that. I have created another full text indexed column for a skewed boolean attribute “ancient” which identifies books published before year 1500 and after, and now I will run some different queries against the two:

What you can see here is that while there’s very little difference when using only the check against “ancient” – it is very slow in both cases – when doing a selective search and then filtering, using attribute is orders of magnitude better than using a fulltext key.

That being said, for highly selective columns, even more so if they will be used in queries without full text search keywords, having them full text indexed is a very good way to improve such Sphinx search query performance.

About Aurimas Mikalauskas

Aurimas joined Percona in 2006, a few months after Peter and Vadim founded the company. His primary focus is on high performance, but he also specializes in full text search, high availability, content caching techniques and MySQL data recovery.

Comments

  1. Edmar says:

    Thanks for the explanations. This didn’t seem obvious from the Sphinx docs. It seems this is a side-effect that demands careful consideration of stop words (something I was hoping to be able to simply disable).

    Too bad the query match(‘solar @ancientkey ancientkey_0′) can’t optimize on the high selectivity of the ‘solar’ keyword (checking for ‘@ancientkey ancientkey_0′ only for documents that already matched ‘solar’).

    By default, match(‘solar @ancientkey ancientkey_0′) uses boolean AND right? That’s why thought the optmization would be possible.

  2. A full text index maps lexical tokens to documents. So when you do a boolean AND against a FTS index of just about any type the FTS much intersect the document ids which match each term to return the correct results. Efficient intersection would probably be O(a + b + … + z) time complexity where each operand is the number of results for a search term. A search over two terms should be O( a + b) complexity, where a is the number of results that match the first term and b the number of results that match the second.

  3. Edmar,

    Consider looking at fastbit if you have low cardinality columns that you need to intersect with high ones. Fastbit uses WAH compressed bitmaps which can be compared without decompressing the bitmaps. This makes multidimensional lookups very efficient.

    https://sdm.lbl.gov/fastbit/

  4. Edmar says:

    Thanks Justin. Fastbit does seem cool,
    http://crd-legacy.lbl.gov/~kewu/ps/LBNL-61768.pdf . I wonder if something like Sphinx could be made to automagically use Fastbit to index low-selectivity words…

    For my relatively-low-volume insert-only dataset (private trouble ticket messages), I’ll probably go for regular full-text indexing (probably Sphinx or MySQL), sharded by year. And let users know about low-selectivity/”common” search terms significantly increasing query times, regardless of the presence of high-selectivity terms (argh).

  5. Aurimas,

    You mention attributes in sphinx are like unindexed columns (hence full table scan) yet I remember sphinx did have some optimization like storing attributes in blocks and having ability to skip the whole block if there are for sure no matching rows in it ?

    The details of this implementation do not seem to be available in the docs though
    http://sphinxsearch.com/docs/2.0.6/attributes.html

  6. Aurimas,

    Thx for this article. We have been facing the issue of attribute-based filtering and the full scan it generates. You speak of misuse for many of your clients but, in our case, it’s something we cannot avoid for full functionality. We have been using the technique of fake keyword as you describe.

    It works but does not fulfill the case where the value(s) of the attribute changes over time. Unless you re-index your documents accordingly which is costly and somehow inefficient.

    It would be great that Sphinx tackles this general issue by proposing to index attributes on a case-by-case basis with a B-tree structure. We’re even discussing this evolution with the Sphinx dev team and, if you know some companies interested, it would be great if they could join the discussion.

  7. Peter,

    you remember right ;) we do indeed have a small block index. We save min/max column values for every 128 rows. When doing a full scan, we scan blocks first, then maybe scan the rows within the block. That isn’t a generic solution but lets us speed up particular scans where the target column is more or less correlated with the ID.

  8. Peter, Andrew – thanks, that’s good to know. I’m sure in some cases this can be a big help.

    Laurent – I’m not sure I understand why it will not work properly with delta index? I assume you would reindex all of the documents that have changed since last full indexing (even if it was only attributes data that changed) in which case any such changes should be available after delta reindexing? I agree B-tree on attributes would be great, but I’m not sure it’s not too much to ask from a full text search engine.

    Anyway, if I get to work with any customers who would benefit from that, I’ll definitely mention it.

  9. Aurimas,

    It could properly work with delta index but only to a certain extent. Said differently, it does not scale up if the value(s) of the (multi-value) attribute often change (like potentially several times a day) and/or the volume of documents impacted is high (x00 thousands docs impacted over a corpus of x00 millions docs for every change). In such a situation, the time and resources taken by producing (and eventually merging) the delta index is a show-stopper.

    In our case, we provide a kind of ‘dynamic grouping’. Let’s say each document has an author. We’ve got 1 (big) index for all the docs. But our users are free to group authors arbitrarily and create new groups on demand, when they want. They want to search inside each group indenpently, immediately after they created it. While we’re clearly very satisfied with Sphinx functionality and performance, we have not been able to fulfill this requirement based on the current feature set. We have thought of B-tree indexed attribute as the solution to this need.

  10. we use sphinx to our website where we have about 1200 tables each with hundreds of thousands of records, for we use sphinx based on attributes, but memory consumption is exaggerated, so we applied the technique described here, but the search results have gone from milliseconds to more than 43 seconds

  11. Fabian, -

    please note that, as stated in the article, this technique only works well for highly selective attributes. If attributes only have few distinct values (e.g status, gender, group_id with only 10 groups per 10M records), then you don’t want to use this. Please see “Highly selective columns only” section for examples where it doesn’t work well for us too.

Speak Your Mind

*