Ruby Quiz, Haskell Solution: Sampling
Posted: September 27th, 2009 | Author: Michel Rijnders | Filed under: Haskell, Ruby Quiz | No Comments »The Quiz
A classic sampling problem: write a program sample which takes two integers n and m as input. n is the size of the sample. m is the size of the population. The program should print out n random unique indices. Two example runs:
$ ./sample 3 10 0 2 8 $ ./sample 3 10 1 2 9
The output must be sorted. The complete, original quiz is here.
A Haskell Solution
Take One
My first (naïve) attempt uses a list of integers to represent the pool still available (i.e. the population not sampled yet). When it has to draw a sample it takes a random number i between 0 and the length of the list and removes the element at index i from the list, thus guaranteeing the uniqueness of the generated indices. It works correctly but it runs out of memory for the "big sample" (n= 5,000,000 and m = 1,000,000,000) mentioned in the original quiz, not very suprising since it keeps both the current samples as well as the pool still availabe in memory. It is also quite slow because of the use of a plain list.
import Control.Monad.State
import Data.List (delete, sort)
import System (getArgs)
import System.Random
main :: IO ()
main = do
args <- getArgs
let n = read (args !! 0) ::Int
m = read (args !! 1) :: Int
gen <- getStdGen
let init = RandomPool [0..m] gen
result = evalState (sample n) init
mapM_ print (sort result)
data RandomPool = RandomPool { pool :: [Int], gen :: StdGen }
type StateRP = State RandomPool
sample :: Int -> StateRP [Int]
sample 0 = return []
sample n = do
st <- get
let hi = length (pool st) - 1
(i, gen') = randomR (0, hi) (gen st)
x = pool st !!i
pool' = delete x (pool st)
put RandomPool { pool = pool', gen = gen' }
xs <- sample (n - 1)
return (x:xs)
Take Two
My second attempt solves the memory problem by keeping only the current samples in memory. When it has to draw a sample it takes a random number x between 0 and m and checks if that number has already been used. If the number has been used it tries agian. This solution also uses the Data.Set module for increased performance.
import Control.Monad.State
import Data.List (sort)
import Data.Set as S
import System (getArgs)
import System.Random
main :: IO ()
main = do
args <- getArgs
let n = read (args !! 0) ::Int
m = read (args !! 1) :: Int
gen <- getStdGen
let init = RandomSet S.empty gen
result = evalState (sample m n) init
mapM_ print (sort result)
data RandomSet = RandomSet { set :: S.Set Int , gen :: StdGen }
type StateRS = State RandomSet
sample :: Int -> Int -> StateRS [Int]
sample hi n =
if n == 0
then do st <- get
return (toList (set st))
else do draw hi
sample hi (n - 1)
draw :: Int -> StateRS ()
draw hi = do
st <- get
let (x, gen') = randomR (0, hi - 1) (gen st)
put st { gen = gen' }
if x `S.member` set st
then draw hi
else do
put st { set = insert x (set st) }
return ()
Here's an example run for the big sample. Note that I have to increase the maximum stack size for individual threads (+RTS -K250m) to prevent a stack space overflow:
$ time ./sample 5000000 1000000000 +RTS -K250m > big_sample.txt real 23m24.355s user 23m1.658s sys 0m9.548s $ ls -l big_sample.txt -rw-r--r-- 1 mies staff 49483467 Sep 27 17:13 big_sample.txt $ head big_sample.txt 243 280 416 494 556 602 804 909 970 1126 $ tail big_sample.txt 999998483 999998863 999999002 999999028 999999052 999999053 999999115 999999291 999999853 999999870
The code plus solutions to other quizes is available on GitHub.
Leave a Reply