I’ve already wrote a few times about various projects using Sphinx with MySQL for scalable Full Text Search applications. For example on BoardReader we’re using this combination to build search against over 1 billion of forum posts totaling over 1.5TB of data handling hundreds of thousands of search queries per day.
The count of forum posts being large, is however not the largest we’ve got to deal in the project – number of links originating from forum posts is a bit larger number.
The task we had for links is to be able to search for links pointing to given web site as well. The challenge in this case is we do not only want to match links directed to “mysql.com” but links to “www.mysql.com” or “dev.mysql.com/download/” as well as they are all considered to belong to mysql.com domain, while searching for “dev.mysql.com/download”" will only match dev.mysql.com domain and files within /download/ directory.
Initially we implemented it in MySQL using partitioning by domain which link was pointing to. So “mysql.com” links were stored in one table group and “google.co.uk” on another. We still had serious challenges however – as each applies to many search URLS,
such as “dev.mysql.com/download/mysql-5.1.html” would match “mysql.com”, “dev.mysql.com”, “dev.mysql.com/download/” and
“dev.mysql.com/download/mysql-5.1.html” we could not use link=const where clause but had to use link like “prefix%” which means index could not be used to get 20 last links and filesort over millions of links we had to youtube.com wikipedia.org and other top domains was extremely slow. Not to mention counting number of links (and number number of distinct forum sites) pointing to the given URL or graphs showing number of links per day. To fight this problem we had to restrict number of days we allow to cover based on the amount of links to the domain… but for some top domains it was slow even with just 3 days worth of data.
You might point out if we had link_date between X and Y and link like “prefix%” kind of where clause we would not be able to use index past link_date part, it is true so we had to use link_date in (
) and link like “prefix%” which allows to use both keyparts which is much better but not good enough.
Caching is not good enough in such case as we do not want a single user to wait for minutes. large variety of problematic search urls does not allow to use pre-caching not to mention general load on server such batch processing would put.
The first alternative to this approach was to store duplicate data storing link to “dev.mysql.com/download/mysql-5.1.html” as links to 4 url prefixes I mentioned above. Unfortunately this would blow up data stored quite significantly, requiring in average of 6 rows for each link and it does not solve all the problems – result counting and number of distinct sites were still pretty slow and we did not want to go into creating all this data as summary tables.
Instead we decided to use Sphinx for this kind of task which proved to be extremely good idea. We converted all URLs to search keywords and now these 6 rows become simply one row in sphinx index with 6 “keywords” – specially crafter strings which corresponded to the URLs. Of course we did not store these in the table but instead used UDF to convert URL to list of “keywords” on the fly.
As results we now can pull up results even for youtube.com for fractions of the second and we could show 3 months worth of data for any URLs. (We could use longer time span but we did not have enough memory for Sphinx attribute storage). It is especially great as there is still room for optimization – Sphinx stores word positions in the index, while we do not need them in this case as we’re doing kind of “boolean full text search”. Plus we can make index built sorted by timestamp which would allow to same on sorting which is now still happening.
Using Sphinx such non-traditional way required implementing some features more traditional for SQL databases rather than full text search applications. Group By was added to Sphinx so we could search number of matches per day, or number of matches per language.
For Domain Profile we’ve got to use even more of those features such as counting number of distinct sites pointing to the given url or domain etc. Honestly this is where we cheated a bit and distinct number is bit approximate for large numbers but it still works really well for our needs.
Sure we could use summary tables for domains but it would be a lot of work and raver inflexible if we would like to add some more features and take a lot of resources to update or periodically build for millions of domains.
As this part worked pretty well we also decided to use Sphinx for other part of the web site – Forum Site Profile. This uses some pre-generated data such as number of posts in total for forum or in thread but most of other stuff is built with Sphinx. This also uses fair amount of tricks using fake full text search to retrieve all posts from given web site or forum from the global sphinx index.
So in general we find parallel processing using sphinx pretty good solution for many data crunching needs especially when lack of parallel abilities in MySQL makes its use rather inconvenient and pre-generating data is inconvenient or impossible.
If you’re attending OSCON 2007 and would like to learn more about Sphinx we have a BOF on Thursday to talk on the topic.