A distributed index setup for Sphinx

Sphinx search is a powerful search engine. Recently we released it (version 0.9.9-rc2) as the backend for most of the searches on one of our high-volume websites. This site has about 360.000 visitors a day that generate about 4.500 search queries for the Sphinx backend per minute on average, peaking to nearly 9.000 per minute when it gets busy on the site. To be able to handle that many requests we currently run Sphinx on four dedicated servers.

A problem with having more than one sphinx server is that you need to make sure the results from the different server are close to the same. Since it is possible to switch between servers for two consecutive searches (which on the site in question could also be a browsing action, for example moving from one page of results to the next) it could be very confusing if the search result were different.

With Sphinx there are a number of ways to solve this problem. The most commonly used solutions are:

  • run the indexer on one server and make those indexing results available to all the other servers (through scp, rsync, or hosting on a shared filesystem)
  • using a distributed index setup

The first should work, but is actually not recommended by the makers of Sphinx. We went for the second solution: a distributed index setup.

Full and delta indices

Our four servers do a full reindex every night (kicked off by cron). These run sequentially, so each server starts out with a different base index. With cron we run the delta index on each server every three minutes. These delta indices only reindex the last added records since the server’s last full index the night before. This delta index grows throughout the day and starts off taking less than 15 minutes and grows to take about 45 minutes just before the full index is re-run.

The setup for these indices, let’s call them base and delta, looks something like this (for the server with server_id = 2):

source base {
  type            = mysql
  sql_query_range = SELECT MIN(nr),MAX(nr) FROM object_table
  sql_range_step  = 10000
  sql_query_pre   = REPLACE INTO sph_counter SELECT 2, MAX(nr) FROM object_table
  sql_query       = SELECT nr AS id, uid, \
                    rootnr, \
                    is_active, \
                    title, \
                    [ .. and all other fields we want in index .. ], \
                    FROM object_table \
                    WHERE is_active = '1' \
                      AND nr >= $start \
                      AND nr <= $end \
                      AND nr <= ( SELECT max_doc_id  \
                                    FROM sph_counter \
                                   WHERE server_id=2 )
  sql_attr_uint   = rootnr_attr
  sql_attr_uint   = is_active_attr
  sql_attr_uint   = uid_attr
  [ .. ]
}
source delta : base {
  sql_query       = SELECT nr AS id, uid, \
                    rootnr, \
                    is_active, \
                    title, \
                    [ .. and all other fields we want in index .. ], \
                    FROM object_table \
                    WHERE is_active = '1' \
                      AND nr >= $start \
                      AND nr <= $end \                      
                      AND nr > ( SELECT max_doc_id  \
                                   FROM sph_counter \
                                  WHERE server_id=2 )
}

The table sph_counter is just a table that tracks the max_doc_id at the time of the last full index run for each server (set by the ‘REPLACE ..’ sql_query_pre in base).

Each server has a similar setup as above, so that they all have a full index from the previous night, and a delta index that covers all the changes from that time up to the last delta index run.

The distributed index

Now we create a distributed index on all the servers that queries the local base and delta indices, and also all the delta indices on the other three servers. That looks something like this:

index main {
    type                    = distributed
    local                   = base
    local                   = delta
    agent                   = 192.168.33.57:3312:delta
    agent                   = 192.168.33.56:3312:delta
    agent                   = 192.168.33.13:3312:delta
    agent_connect_timeout   = 200
    agent_query_timeout     = 1000
}

In our code we exclusively query the index named main. This index combines the base index from the particular random server the code happened to choose with the delta indices of all four servers. A distributed index combines the results from all of these, and discards the duplicates automatically. This seems to work pretty well.

Things to look at

One downside to this setup is that each search kicks off a search on the machine it connects to, but also sends a search on the delta index to all the other servers. At the moment it seems to work, but at busy moments we see a bit of an increase in our response times for our searches. It might be better to concentrate the delta index on one server dedicated to the delta index and leave the other three to concentrate on the base index. This is something we plan on looking into soon. So, possibly an update soon… ?

Leave a Reply

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

Please copy the string L6WXiR to the field below: