Preparing to move I’m selling stuff on GumTree which is UK based clone of Craigslist offering similar functionality but with Ads 🙂

Similarly to Craigslist GumTree is powered by MySQL but unlike craigslist it looks like they are shy in tuning their MySQL Full Text Search index configuration and setting proper ft_min_word_len value. Searching for Audi A6 will give you a lot of Audi with different model numbers. Searching just for A6 gives no results at all. At the same time we can find V40 which means ft_min_word_len was adjusted from default value of 4 which probably had even worse search quality to value 3.

CraigsList seems to be using value 2 because it is able to find A6 but still fails to find data with 1 character keyword, say Mazda 3

Gumtree also seems to have another issue – as said on their web site when you post an item it takes up to 3-4 hours for items to become visible in search results. I’m wondering if that is replication lag or issues with full text search index updates as I can’t imagine this being user experience you would desire.

In general I think both of this sites have the same mistake (or shortcut) in MySQL Full Text Search configuration – relaying on minimum keyword length while their application has use cases when search for short keywords make sense.

It is much more reasonable to look at frequencies at keywords not their length and if you happen to have some very frequent keywords you can’t afford to index for performance or search quality reasons you better analyze word frequencies and build stop word list which matches your application. For example “A6” unlikely needs to be stop word even if it is short while “IN” surely needs to be. Besides general terms as in you may have your application specific stop words such as “com” or “html”.

In general performance should be the only thing which forces you to have stop word, as you can see Google for example has none, even searching for “a” gives some results (Though of course google takes other kinds of shortcuts) This however is frequently not the case with MySQL Full Text Search which both will skip to frequent keywords in natural language search and because it is using frequency ranking indexing too frequent keywords can add noise and actually get relevance worse.

Looking at alternative solution, which we’re often using with MySQL – Sphinx Search we still tend to use stop word list for performance reasons, but it is normally good enough to keep it very short. Plus as in new versions you can set the time limit on query execution it is rather safe to search a very complex query and just get partial results if it can’t complete in time. From Relevance standpoint “stop words” rarely cause problems with sphinx because it uses word position based ranking not just frequency based ranking by default which means if you’re searching for “a the” the documents which contain this strange phrase will be ranked on top rather than those which just contain a lot of “a” and a lot of “the” articles.

6 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
Julian

I don’t quite understand. Are you sure they all have their data in Myisam tables with fulltext index?

Samuel

I’m guessing that the 3-4 hour lag before posts become visible in search results at Gumtree is because they (semi)manually check the ads before publishing. I once had anad for some furniture and linked to pictures hosted elsewhere. Got an email notification saying that the ad will not be published, because it didn’t pass their what ever criteria. But right you are: “I can’t imagine this being user experience you would desire.”

Dave

You can do two letter searches on Gumtree now – http://www.gumtree.com/cgi-bin/list_postings.pl?search_terms=A3&posting_cat=2477

Yes it is Mysql full text.

The three to four hour lag is worse case scenario, the reality it that ads are live in a few minutes.

laki

here is one simple question.

main_text|first_complement_text|alternative_complement_text

i need to search against each individually, or against main paired with one or the other complement, should i also have individual indexes, or only multi column?

1) fulltext(main),fulltext(first),fulltext(alternative),fulltext(main,first),fulltext(main,alternative)

or

2) —————fulltext(first),fulltext(alternative),fulltext(main,first),fulltext(main,alternative)

i know that main will be covered by either one of last two, but will it be faster for search to have also index on only main column?