Sphinx search is a powerful search engine. Recently we released it (version 0.9.9-rc2) as the backend for most of the searches on one of our high-volume websites. This site has about 360.000 visitors a day that generate about 4.500 search queries for the Sphinx backend per minute on average, peaking to nearly 9.000 per minute when it gets busy on the site. To be able to handle that many requests we currently run Sphinx on four dedicated servers.
A problem with having more than one sphinx server is that you need to make sure the results from the different server are close to the same. Since it is possible to switch between servers for two consecutive searches (which on the site in question could also be a browsing action, for example moving from one page of results to the next) it could be very confusing if the search result were different.
With Sphinx there are a number of ways to solve this problem. The most commonly used solutions are:
- run the indexer on one server and make those indexing results available to all the other servers (through scp, rsync, or hosting on a shared filesystem)
- using a distributed index setup
The first should work, but is actually not recommended by the makers of Sphinx. We went for the second solution: a distributed index setup.
Full and delta indices
Our four servers do a full reindex every night (kicked off by cron). These run sequentially, so each server starts out with a different base index. With cron we run the delta index on each server every three minutes. These delta indices only reindex the last added records since the server’s last full index the night before. This delta index grows throughout the day and starts off taking less than 15 minutes and grows to take about 45 minutes just before the full index is re-run.
The setup for these indices, let’s call them base and delta, looks something like this (for the server with server_id = 2):
type = mysql
sql_query_range = SELECT MIN(nr),MAX(nr) FROM object_table
sql_range_step = 10000
sql_query_pre = REPLACE INTO sph_counter SELECT 2, MAX(nr) FROM object_table
sql_query = SELECT nr AS id, uid, \
rootnr, \
is_active, \
title, \
[ .. and all other fields we want in index .. ], \
FROM object_table \
WHERE is_active = '1' \
AND nr >= $start \
AND nr <= $end \
AND nr <= ( SELECT max_doc_id \
FROM sph_counter \
WHERE server_id=2 )
sql_attr_uint = rootnr_attr
sql_attr_uint = is_active_attr
sql_attr_uint = uid_attr
[ .. ]
}
sql_query = SELECT nr AS id, uid, \
rootnr, \
is_active, \
title, \
[ .. and all other fields we want in index .. ], \
FROM object_table \
WHERE is_active = '1' \
AND nr >= $start \
AND nr <= $end \
AND nr > ( SELECT max_doc_id \
FROM sph_counter \
WHERE server_id=2 )
}
The table sph_counter is just a table that tracks the max_doc_id at the time of the last full index run for each server (set by the ‘REPLACE ..’ sql_query_pre in base).
Each server has a similar setup as above, so that they all have a full index from the previous night, and a delta index that covers all the changes from that time up to the last delta index run.
The distributed index
Now we create a distributed index on all the servers that queries the local base and delta indices, and also all the delta indices on the other three servers. That looks something like this:
type = distributed
local = base
local = delta
agent = 192.168.33.57:3312:delta
agent = 192.168.33.56:3312:delta
agent = 192.168.33.13:3312:delta
agent_connect_timeout = 200
agent_query_timeout = 1000
}
In our code we exclusively query the index named main. This index combines the base index from the particular random server the code happened to choose with the delta indices of all four servers. A distributed index combines the results from all of these, and discards the duplicates automatically. This seems to work pretty well.
Things to look at
One downside to this setup is that each search kicks off a search on the machine it connects to, but also sends a search on the delta index to all the other servers. At the moment it seems to work, but at busy moments we see a bit of an increase in our response times for our searches. It might be better to concentrate the delta index on one server dedicated to the delta index and leave the other three to concentrate on the base index. This is something we plan on looking into soon. So, possibly an update soon… ?