Simple ranked text search for MongoDB

Posted: February 8th, 2010 | Author: Ward Bekker | Filed under: Open Source Projects, Ruby, Software Engineering | No Comments »

In this code snippit you can see how to do a basic ranked text search for MongoDB. The code relies on two simple mapreduce operations. One to create an inverted index from some demo text, and a second one to score the matching documents based on query term hits.


MongoDB first impressions

Posted: February 8th, 2010 | Author: Ward Bekker | Filed under: Uncategorized | 4 Comments »

For a customer we have developed log analytics software. It’s currently uses MYSQL as the database backend. The system reads in a hourly log file, and calculates all kinds of fancy statistics. I wanted to see how the system would work if I used MongoDB, a schema-less document DB, instead of MYSQL. My impressions in no particular order:

  • Importing log data is much easier than on MYSQL because MongoDB is schema-less. Just create a collection (=bucket) and insert every log line into it as a hash. For log files that don’t have a fixed amount of fields, it’s a great fit.
  • Like MYSQL, you do need to create indexes to make searching fast(er).
  • MongoDB supports map reduce operations. It made some of the calculations much more elegant and better readable than the code that was written for MYSQL.
  • Chaining of map reduce operations is supported, and works as you would expect.
  • Queries are written in javascript. I’m happy that they didn’t invent yet another ’scripting’ language. Javascript looks capable enough.
  • Map reduce operations are not particularly fast. They are upgrading their javascript engine to V8 to improve the execution speed.
  • MongoDB community is nowhere near the size of MYSQL. Don’t expect a lot of Google results for a specific mongoDB issue. The moderated Google group is a better place to go currently.
  • I liked the API. Calls are not verbose and their intented use is easy to understand.
  • Although quite capable, mongoDB is still a young project. I need to have more time with it before using it on a customer project.

My Reading List for 2010

Posted: January 9th, 2010 | Author: Michel Rijnders | Filed under: Books, Programming Language Theory | No Comments »

One of the suggestions of “The Pragmatic Programmer” is that you should learn at least one new programming language every year. This is a great suggestion, but after a couple of years its usefulness diminishes, e.g. if one already knows Perl and Python, then the payback on learning Ruby is rather small. Therefore I’m going to concentrate on the foundations of programming languages this year. Here’s my tentative reading list:

Suggestions welcome.


Ruby Quiz, Haskell Solution: LCD Numbers

Posted: December 17th, 2009 | Author: Michel Rijnders | Filed under: Haskell, Ruby Quiz, Uncategorized | 2 Comments »

A solution to Ruby Quiz #14 in literate Haskell:

LCD Numbers
===========

Problem
-------

[original source](http://rubyquiz.com/quiz14.html)

This week's quiz is to write a program that displays LCD style numbers
at adjustable sizes.

The digits to be displayed will be passed as an argument to the
program. Size should be controlled with the command-line option -s
follow up by a positive integer. The default value for -s is 2.

For example, if your program is called with:

    $ lcd.rb 012345

The correct display is:

     --        --   --        --
    |  |    |    |    | |  | |
    |  |    |    |    | |  | |
               --   --   --   --
    |  |    | |       |    |    |
    |  |    | |       |    |    |
     --        --   --        -- 

And for:

    $ lcd.rb -s 1 6789

Your program should print:

     -   -   -   -
    |     | | | | |
     -       -   -
    | |   | | |   |
     -       -   - 

Note the single column of space between digits in both examples. For
other values of -s, simply lengthen the - and | bars.

Solution
--------

Module declaration and imports:

> module Main where
>
> import Data.Char (digitToInt)
> import Data.List (intersperse)
> import System.Console.GetOpt
> import System.Environment (getArgs)

First we define the numbers at size 1:

> n0 = [ " - "
>      , "| |"
>      , "   "
>      , "| |"
>      , " - "
>      ]
>
> n1 = [ "   "
>      , "  |"
>      , "   "
>      , "  |"
>      , "   "
>      ]
>
> n2 = [ " - "
>      , "  |"
>      , " - "
>      , "|  "
>      , " - "
>      ]
>
> n3 = [ " - "
>      , "  |"
>      , " - "
>      , "  |"
>      , " - "
>      ]
>
> n4 = [ "   "
>      , "| |"
>      , " - "
>      , "  |"
>      , "   "
>      ]
>
> n5 = [ " - "
>      , "|  "
>      , " - "
>      , "  |"
>      , " - "
>      ]
>
> n6 = [ " - "
>      , "|  "
>      , " - "
>      , "| |"
>      , " - "
>      ]
>
> n7 = [ " - "
>      , "  |"
>      , "   "
>      , "  |"
>      , "   "
>      ]
>
> n8 = [ " - "
>      , "| |"
>      , " - "
>      , "| |"
>      , " - "
>      ]
>
> n9 = [ " - "
>      , "| |"
>      , " - "
>      , "  |"
>      , " - "
>      ]
>

Put the numbers in  a list:

> numbers = [n0,n1,n2,n3,n4,n5,n6,n7,n8,n9]

Horizontal scaling function, given a string replicate the second
character n times:

> hscale n cs = head cs : replicate n (cs!!1) ++ [last cs]

Vertical scaling function, repeat the second and fourth row n times:

> vscale n css = head css : replicate n cs1 ++ [cs2] ++ replicate n cs3 ++ [cs4]
>   where cs1 = css !! 1
>         cs2 = css !! 2
>         cs3 = css !! 3
>         cs4 = last css

Scale function; note this function scales a single number:

> scale n = vscale n . map (hscale n)

Function that converts a list of numbers to a string of LCD numbers:

> lcd n = concat .
>         intersperse "\n" .
>         foldr1 (zipWith (++)) .
>         intersperse (replicate (3 + 2*n) " ") .
>         map (scale n . (numbers !!))

`main` function:

> main = do
>   args <- getArgs
>   let (n, digits) = parseArgs args
>   putStrLn $ lcd n $ map digitToInt digits

Command-line argument parsing:

> data Flag = Scale Int
>             deriving Eq
>
> options = [Option "s" [] (ReqArg (Scale . read) "") ""]
>
> parseArgs args =
>   case parse args of
>    (_, [], _)              -> error "Usage: lcd [-s n] digits"
>    ([], digits, [])        -> (2, head digits)
>    ([Scale n], digits, []) -> (n, head digits)
>    (_, _, _)               -> error "Usage: lcd [-s n] digits"
>   where
>     parse = getOpt RequireOrder options

Compiling Apache from source on Ubuntu 9.10 (Karmic Koala)

Posted: November 11th, 2009 | Author: Henry Snoek | Filed under: Open Source Projects | Tags: , | 1 Comment »

Last week my OS got upgraded to Ubuntu 9.10. After that I wanted to compile Apache from source. Unfortunately I got this build error:

htpasswd.c:101: error: conflicting types for ‘getline’
/usr/include/stdio.h:651: note: previous declaration of ‘getline’ was here
make[2]: *** [htpasswd.o] Error 1

This is fixed by replacing getline with parseline on line 651 in /usr/include/stdio.h

Kudos to HowtoForge for pointing this out.


The Myth of the Page Fold

Posted: November 9th, 2009 | Author: Michel Rijnders | Filed under: Web Development | No Comments »

Nice article dispelling the myth of the page fold being a impenetrable barrier for users.

Update: page fold: myth or reality?



Slides Haskell Workshop

Posted: November 8th, 2009 | Author: Michel Rijnders | Filed under: Haskell | Tags: , , | 1 Comment »

Haskell Workshop

The slides for the workshop on Haskell and functional programming I gave yesterday at Devnology’s Community Day.


Ruby Quiz, Haskell Solution: Sampling

Posted: September 27th, 2009 | Author: Michel Rijnders | Filed under: Haskell, Ruby Quiz | No Comments »

The Quiz

A classic sampling problem: write a program sample which takes two integers n and m as input. n is the size of the sample. m is the size of the population. The program should print out n random unique indices. Two example runs:

$ ./sample 3 10
0
2
8
$ ./sample 3 10
1
2
9

The output must be sorted. The complete, original quiz is here.

A Haskell Solution

Take One

My first (naïve) attempt uses a list of integers to represent the pool still available (i.e. the population not sampled yet). When it has to draw a sample it takes a random number i between 0 and the length of the list and removes the element at index i from the list, thus guaranteeing the uniqueness of the generated indices. It works correctly but it runs out of memory for the "big sample" (n= 5,000,000 and m = 1,000,000,000) mentioned in the original quiz, not very suprising since it keeps both the current samples as well as the pool still availabe in memory. It is also quite slow because of the use of a plain list.

module Main where

import Control.Monad.State
import Data.List (delete, sort)
import System (getArgs)
import System.Random

main :: IO ()
main = do
  args <- getArgs
  let n = read (args !! 0) ::Int
      m = read (args !! 1) :: Int
  gen <- getStdGen
  let init = RandomPool [0..m] gen
      result = evalState (sample n) init
  mapM_ print (sort result)

data RandomPool = RandomPool { pool :: [Int], gen :: StdGen }

type StateRP = State RandomPool

sample :: Int -> StateRP [Int]
sample 0 = return []
sample n = do
  st <- get
  let hi = length (pool st) - 1
      (i, gen') = randomR (0, hi) (gen st)
      x = pool st !!i
      pool' = delete x (pool st)
  put RandomPool { pool = pool', gen = gen' }
  xs <- sample (n - 1)
  return (x:xs)

Take Two

My second attempt solves the memory problem by keeping only the current samples in memory. When it has to draw a sample it takes a random number x between 0 and m and checks if that number has already been used. If the number has been used it tries agian. This solution also uses the Data.Set module for increased performance.

module Main where

import Control.Monad.State
import Data.List (sort)
import Data.Set as S
import System (getArgs)
import System.Random

main :: IO ()
main = do
  args <- getArgs
  let n = read (args !! 0) ::Int
      m = read (args !! 1) :: Int
  gen <- getStdGen
  let init = RandomSet S.empty gen
      result = evalState (sample m n) init
  mapM_ print (sort result)

data RandomSet = RandomSet { set :: S.Set Int , gen :: StdGen }

type StateRS = State RandomSet

sample :: Int -> Int -> StateRS [Int]
sample hi n =
  if n == 0
    then do st <- get
            return (toList (set st))
    else do draw hi
            sample hi (n - 1)

draw :: Int -> StateRS ()
draw hi = do
  st <- get
  let (x, gen') = randomR (0, hi - 1) (gen st)
  put st { gen = gen' }
  if x `S.member` set st
     then draw hi
     else do
       put st { set = insert x (set st) }
       return ()

Here's an example run for the big sample. Note that I have to increase the maximum stack size for individual threads (+RTS -K250m) to prevent a stack space overflow:

$ time ./sample 5000000 1000000000 +RTS -K250m > big_sample.txt 

real    23m24.355s
user    23m1.658s
sys     0m9.548s
$ ls -l big_sample.txt
-rw-r--r--  1 mies  staff  49483467 Sep 27 17:13 big_sample.txt
$ head big_sample.txt
243
280
416
494
556
602
804
909
970
1126
$ tail big_sample.txt
999998483
999998863
999999002
999999028
999999052
999999053
999999115
999999291
999999853
999999870

The code plus solutions to other quizes is available on GitHub.


100% completeness-fu

Posted: September 23rd, 2009 | Author: Josh Kalderimis | Filed under: Open Source Projects, Rails, Ruby, Web Development | Tags: , , | 4 Comments »

The age of completeness-fu is upon us!

Sometimes validations just don’t cut the mustard and all you want to do is to grade an instance based on how complete its information is. For example, a Location has a title and a description but no address, thus its only 60% complete. Or maybe title is worth more than description and address so its 80% complete. Whatever the case, this is not a new problem and recreating the wheel is a bit unnecessary, so welcome to completeness-fu.

The dsl is based on the thinking-sphinx configuration, which is nice, clean and simple, but very effective.

Here is a sample of the config code used to define a set of checks for a completeness score:

define_completeness_scoring do
  check :title,       lambda { |per| per.title.present? },  :high
  check :description, lambda { |per| per.description.present? }, :medium
  check :main_image,  lambda { |per| per.main_image? },     :low
end

It still needs some more tlc, but its a nice start and a simple solution for a common problem.

So please, have a play around with it, fork the code, make some improvements/enhancements and let me know what you think.


Using Nginx + Passenger as your development environment

Posted: September 17th, 2009 | Author: Josh Kalderimis | Filed under: Mac, Rails, Ruby, Web Development | Tags: , , , | 3 Comments »

As a rails developer you are blessed early on with the fantastic script/server for starting a local development server. Rails is smart enough that it will even suggest you install the Mongrel gem as it is a faster alternative to the basic stock standard WEBrick. But as time passes and your skills improve and the amount of projects you are working on increases, you may find yourself looking for a simpler solution than having to start up an individual script/server on different ports for each project. Or you may just want to have an app run in the background waiting for you to access one of the sites and start it up automatically. What ever the case, there are some very nice solutions available.

99% percent of people who have deployed a rails app have undoubtedly come across Passenger (modrails)  from the fantastic guys at Phusion. Simple put, this allows you to config and run your rails app with (initially) Apache or (as of lately) Nginx while also taking advantage of their supplier static assets serving capabilities.

So why would you choose Nginx over Apache in you development environment? For me the reasons for using Nginx was simple and quick configuration, very very very low memory usage, and it mimics my deployment server setup.

So waffle aside, how do we install and setup Nginx, including for you development environment

installing nginx and passenger

Passenger is nice enough to offer to install Nginx for you automagically, including downloading Nginx 0.7.61, but this is already an old version, so the plan is to :

  1. download the latest stable version of nginx
  2. extract to /usr/local/src/nginx
  3. install the latest version of passenger via gem
  4. have passenger configure, compile and install nginx for us
  5. tweak the nginx configs
  6. putting it all together

so lets get started….

1. and 2. download the latest stable version of nginx and extract it

cd /usr/local
sudo mkdir src
cd src
wget http://sysoev.ru/nginx/nginx-0.7.62.tar.gz (latest verion at time of writing)
tar -zxvf nginx-0.7.62.tar.gz

3. and 4. install passenger via gem and configure, compile and install nginx

sudo gem install passenger (or sudo gem update passenger if already installed)
sudo passenger-install-nginx-module

during the installer program enter the following information

When asked: ‘Where is your Nginx source code located?’
answer /usr/local/src/nginx-0.7.62
When asked: ‘Where do you want to install Nginx to?’
answer /usr/local/nginx
When asked about: ‘Extra arguments to pass to configure script:’
answer --with-http_ssl_module
When asked to ‘Confirm configure flags’
answer yes

Ok, now nginx and passenger are installed with ssl support baked in, what to do from here…

5. tweak the nginx configs

The default nginx config file is pretty basic, which is excellent, because its all you really need, but a few tweaks here and there can make a great thing even better. Slicehost has an excellent write up on some recommended changes here, but as this is your development environment and not production, I suggest not changing worker_processes or keepalive_timeout.

In the end, this is what my server nginx.conf looks like:

worker_processes  1;

events {
    worker_connections  1024;
}

http {
    passenger_root /opt/local/lib/ruby/gems/1.8/gems/passenger-2.2.5;
    passenger_ruby /opt/local/bin/ruby;

    include          mime.types;
    default_type  application/octet-stream;

    sendfile           on;
    tcp_nopush     on;
    tcp_nodelay     off;

    keepalive_timeout  65;

    gzip              on;
    gzip_comp_level   2;
    gzip_proxied      any;
    gzip_types        text/plain text/css application/x-javascript text/xml application/xml application/xml+rss text/javascript;

    # All the virtual hosts exist here
    include /usr/local/nginx/sites-enabled/*;
}

(I have taken out all the commented out lines)

You will notice the passenger_root and passenger_ruby properties/directives in the conf file. These are required for passenger to start, but you don’t need to worry about inserting them as the passenger nginx installer does it for you.

This is what my virtual server file looks like:

server {
    listen       80;
    server_name  lotsoffunstuff.local;
   
    root /Users/me/Development/ruby-workspace/pet-projects/lotsoffunstuff.com/public;
   
    passenger_enabled on;
    rails_env development;
}

And thats it, an nginx server + one virtual host all ready to run via:

sudo /usr/local/nginx/sbin/nginx

I also added an alias to me ~/.profile file

alias nginx='sudo /usr/local/nginx/sbin/nginx'
alias stopnginx='sudo /usr/local/nginx/sbin/nginx -s stop'

now you can just use nginx and stopnginx.

6. putting it all together

ok, the title is a little deceptive as it all seems to be together, but there is one very important change yet to be made, making sure passenger can reach and read your source code.

I ran into this problem when I was setting up my environment, and it all has to do with how passenger and nginx works. As per any good webserver, you need to start it as root so it can access the right ports (80) and directories (pid files), but its worker processes should run as nobody or www-data to restrict unneeded access to other resources. For Nginx to know if the server is a rails app or not, its worker process needs to be able to access the document root, in this case public, and every single one of its parent directories. As I keep my development files within my home directory, nginx would throw a 403 error and add a non-descriptive error message to the error.log file.

Two options are available to fix this:

  1. add read access to all the parent directories to everyone (chmod o+r -R .)(I think)
  2. have the nginx worker processes run as a privileged user which can access all the directories in the path

I choose option two and had the nginx worker processes run as myself. Although you could argue this is insecure, as I only have the server running when I need to, and nginx and passenger have a great track record, I think this is better than setting my home directory to read for everyone.

And there we are, all set up and ready to develop! And all in under 30 mins!

some good links and tips

important for deployment : rails maintenance pages done right
init script for ubuntu : nginx-init-ubuntu
1.9 + nginx + passenger : ruby-rails-nginx-passenger
excellent config details : ubuntu-intrepid-nginx-configuration
nginx + vhosts : ubuntu-intrepid-nginx-virtual-hosts
docs galore : http://nginx.net/ and http://wiki.nginx.net/

special mention to slicehost for all the fantastic server and service related articles known to man