Posted: February 28th, 2010 | Author: Michel Rijnders | Filed under: Books, Mac, NLTK, Python | No Comments »
While reading “Natural Language Processing with Python” I ran into problems on my Mac with examples that were using the dispersion_plot function: calls to the function returned immediately without displaying anything.
Turns out matplotlib’s back-end wasn’t configured properly. To fix this I had to add a rc file (matplotlibrc) to my ~/.matplotlib directory. The rc file contains the following:
And, hey presto:

(disclaimer: “Works on my machine!”)
Posted: February 9th, 2010 | Author: Thijs Oppermann | Filed under: Sphinx search | Tags: search, setup, sphinx | No Comments »
Sphinx search is a powerful search engine. Recently we released it (version 0.9.9-rc2) as the backend for most of the searches on one of our high-volume websites. This site has about 360.000 visitors a day that generate about 4.500 search queries for the Sphinx backend per minute on average, peaking to nearly 9.000 per minute when it gets busy on the site. To be able to handle that many requests we currently run Sphinx on four dedicated servers.
A problem with having more than one sphinx server is that you need to make sure the results from the different server are close to the same. Since it is possible to switch between servers for two consecutive searches (which on the site in question could also be a browsing action, for example moving from one page of results to the next) it could be very confusing if the search result were different.
With Sphinx there are a number of ways to solve this problem. The most commonly used solutions are:
- run the indexer on one server and make those indexing results available to all the other servers (through scp, rsync, or hosting on a shared filesystem)
- using a distributed index setup
The first should work, but is actually not recommended by the makers of Sphinx. We went for the second solution: a distributed index setup.
Read the rest of this entry »
Posted: February 8th, 2010 | Author: Ward Bekker | Filed under: Open Source Projects, Ruby, Software Engineering | No Comments »
In this code snippit you can see how to do a basic ranked text search for MongoDB. The code relies on two simple mapreduce operations. One to create an inverted index from some demo text, and a second one to score the matching documents based on query term hits.
Posted: February 8th, 2010 | Author: Ward Bekker | Filed under: Uncategorized | 4 Comments »
For a customer we have developed log analytics software. It’s currently uses MYSQL as the database backend. The system reads in a hourly log file, and calculates all kinds of fancy statistics. I wanted to see how the system would work if I used MongoDB, a schema-less document DB, instead of MYSQL. My impressions in no particular order:
- Importing log data is much easier than on MYSQL because MongoDB is schema-less. Just create a collection (=bucket) and insert every log line into it as a hash. For log files that don’t have a fixed amount of fields, it’s a great fit.
- Like MYSQL, you do need to create indexes to make searching fast(er).
- MongoDB supports map reduce operations. It made some of the calculations much more elegant and better readable than the code that was written for MYSQL.
- Chaining of map reduce operations is supported, and works as you would expect.
- Queries are written in javascript. I’m happy that they didn’t invent yet another ’scripting’ language. Javascript looks capable enough.
- Map reduce operations are not particularly fast. They are upgrading their javascript engine to V8 to improve the execution speed.
- MongoDB community is nowhere near the size of MYSQL. Don’t expect a lot of Google results for a specific mongoDB issue. The moderated Google group is a better place to go currently.
- I liked the API. Calls are not verbose and their intented use is easy to understand.
- Although quite capable, mongoDB is still a young project. I need to have more time with it before using it on a customer project.