-- Haskell98! -- Contrasting Lazy and Iteratee IO on the example of Unix wc: -- counting lines and words in a file and in a sequence of files -- -- See the message posted on Haskell-Cafe on Sep 19, 2008: -- http://www.haskell.org/pipermail/haskell-cafe/2008-September/047738.html -- for more discussion. -- The message above refers to an older version of the code. -- The new numbers are given below. All timing results are -- medians of 5 consecutive runs. -- To compile this code -- ghc --make -O2 -main-is Main.main_nl_iter Wc.hs -- To run this code -- GHCRTS="-tstderr" /usr/bin/time ./Wc arg module Main where import System.Environment import IterateeM -- Counting lines in a file {- time wc -l /usr/share/dict/words 235882 /usr/share/dict/words real 0m0.026s user 0m0.026s sys 0m0.001s -} -- Counting lines with Lazy IO: the baseline -- http://www.haskell.org/pipermail/haskell-cafe/2008-September/047729.html main_nl_lazy = do name:_ <- getArgs file <- readFile name print $ length $ lines file {- ./Wc /usr/share/dict/words +RTS -tstderr 235882 <> 0.28 real 0.27 user 0.01 sys -} -- Count NL in a stream of characters. count_nl = ie_cont $ step 0 where step acc (Chunk str) = ie_contM (step $! acc + count str) step acc stream = ie_doneM acc stream count [] = 0 count ('\n':str) = succ $! count str count (_:str) = count str -- Iteratee-based solution. It seems faster than lazy IO main_nl_iter = do name:_ <- getArgs counter <- run =<< (enum_file name count_nl) print counter {- ./Wc /usr/share/dict/words +RTS -tstderr opened file /usr/share/dict/words closed file /usr/share/dict/words 235882 <> 0.13 real 0.11 user 0.01 sys -} -- Count the stream. This could have been in the IterateeM library stream_count :: Monad m => Iteratee el m Int stream_count = ie_cont $ step 0 where step acc (Chunk []) = ie_contM (step acc) step acc (Chunk [_]) = ie_contM (step $! succ acc) step acc (Chunk ls) = ie_contM (step $! acc + length ls) step acc stream = ie_doneM acc stream -- Count the lines, words and characters in parallel -- That is difficult to do with Lazy IO in constant space! main_nl_words_char_iter = do name:_ <- getArgs counter <- run =<< enum_file name (count_nl `enumPair` (runI =<< enum_words stream_count) `enumPair` stream_count) print counter -- Conclusion from the benchmark data below: -- running time is linear in the size of the file -- the peak memory (2MB) and max residency (110 KB) are constant, -- regardless of the size of the file. -- Thus all processing is done in constant memory. -- In fact, the results are virtually the same to those for -- for counting words and characters only. Adding line counting hardly -- matters. {- ./Wc /usr/share/dict/words +RTS -tstderr opened file /usr/share/dict/words closed file /usr/share/dict/words ((235882,235882),2493082) <> 1.23 real 1.17 user 0.04 sys wc: 113739 415101 5137616 /usr/share/doc/en_US.ISO8859-1/books/handbook/book.html ./Wc /usr/share/doc/en_US.ISO8859-1/books/handbook/book.html +RTS -tstderr opened file /usr/share/doc/en_US.ISO8859-1/books/handbook/book.html closed file /usr/share/doc/en_US.ISO8859-1/books/handbook/book.html ((113739,415101),5137616) <> 2.49 real 2.38 user 0.06 sys ./Wc /usr/local/share/dict/edict/enamdict +RTS -tstderr opened file /usr/local/share/dict/edict/enamdict closed file /usr/local/share/dict/edict/enamdict ((727904,2979864),26010496) <> 13.61 real 13.13 user 0.40 sys -} main_words_char_iter = do name:_ <- getArgs counter <- run =<< enum_file name ((runI =<< enum_words stream_count) `enumPair` stream_count) {- -- The following variant works almost the same counter <- run . runI2 =<< (enum_file name ((runI =<< enum_words stream_count) `enum2` stream_count)) -} print counter -- Conclusion from the benchmark data below: -- running time is linear in the size of the file -- the peak memory (2MB) and max residency (104 KB) are constant, -- regardless of the size of the file. -- Thus all processing is done in constant memory. {- ./Wc /usr/share/dict/words +RTS -tstderr opened file /usr/share/dict/words closed file /usr/share/dict/words (235882,2493082) <> 1.20 real 1.18 user 0.02 sys ./Wc /usr/share/doc/en_US.ISO8859-1/books/handbook/book.html +RTS -tstderr opened file /usr/share/doc/en_US.ISO8859-1/books/handbook/book.html closed file /usr/share/doc/en_US.ISO8859-1/books/handbook/book.html (415101,5137616) <> 2.41 real 2.33 user 0.06 sys ./Wc /usr/local/share/dict/edict/enamdict +RTS -tstderr opened file /usr/local/share/dict/edict/enamdict closed file /usr/local/share/dict/edict/enamdict (2979864,26010496) <> 13.30 real 12.81 user 0.42 sys -} -- For comparison, using lazy IO to count only words and characters main_words_char_lazy = do name:_ <- getArgs file <- readFile name print $ (length $ words file, length file) -- The processing is certainly not in constant memory: -- as the file size increases, from 2.4MB, 4.9MB, 24.8MB, -- the peak memory use increases: 43MB, 78MB and 337MB -- and so max residency: 20MB, 37MB, 163MB. -- Clearly the program has to read the whole file in memory. {- ./Wc /usr/share/dict/words +RTS -tstderr (235882,2493082) <> 1.10 real 0.99 user 0.10 sys ./Wc /usr/share/doc/en_US.ISO8859-1/books/handbook/book.html +RTS -tstderr (415101,5137616) <> 2.32 real 2.11 user 0.19 sys ./Wc /usr/local/share/dict/edict/enamdict +RTS -tstderr (2979864,26010496) <> 10.95 real 10.13 user 0.76 sys -} -- Counting words in a sequence of files, whose names are given -- on the command line -- For warm-up, we count words in one file, and in two files count_words_1file = do name:_ <- getArgs -- counter <- run =<< run =<< enum_file name (enum_words stream_count) counter <- run =<< enum_file name (runI =<< enum_words stream_count) print counter {- counter <- run =<< runI =<< enum_file name (enum_words stream_count) ./Wc /usr/local/share/dict/edict/enamdict +RTS -tstderr opened file /usr/local/share/dict/edict/enamdict closed file /usr/local/share/dict/edict/enamdict 2979864 <> 9.78 real 9.36 user 0.37 sys counter <- run =<< enum_file name (runI =<< enum_words stream_count) ./Wc /usr/local/share/dict/edict/enamdict +RTS -tstderr opened file /usr/local/share/dict/edict/enamdict closed file /usr/local/share/dict/edict/enamdict 2979864 <> 9.92 real 9.47 user 0.38 sys -} count_words_2files = do let names = ["/etc/motd", "/etc/resolv.conf"] let r = enum_file (names !! 1) =<< enum_file (names !! 0) (enum_words stream_count) counter <- run =<< run =<< r print counter -- Counting words in all files given on the command-line -- Example usage: -- find /usr/local/share/doc/ghc-6.10.4 -name \*.html -print | time xargs ./Wc -- That counts words in 2136 files. -- Iteratee-based solution main_word_iter = do names <- getArgs let enumerators = foldr (\name -> (enum_file name >>>)) enum_eof names counter <- run =<< run =<< enumerators (enum_words stream_count) print counter {- The composition of enumerators corresponds to the `concatenation' of their sources. Declaratively, the meaning of the above code is: -- all given files are concatenated -- the resulting stream of characters is converted to a stream of words -- the stream of words is counted. Operationally, the code does not open more than one file at a time. More importantly, the code *never* reads more than 4096 characters at a time. A block of the file is read, split into words, counted, and only then another chunk is read. After one file is done, it is closed, and another file is processed. One can see that only one file is being opened at a time by enabling traces. The processing is fully incremental. -- Sample output: opened file /usr/local/share/doc/ghc-6.10.4/users_guide/assertions.html closed file /usr/local/share/doc/ghc-6.10.4/users_guide/assertions.html opened file /usr/local/share/doc/ghc-6.10.4/users_guide/arrow-notation.html closed file /usr/local/share/doc/ghc-6.10.4/users_guide/arrow-notation.html opened file /usr/local/share/doc/ghc-6.10.4/users_guide/wrong.html closed file /usr/local/share/doc/ghc-6.10.4/users_guide/wrong.html 3747046 19.54 real 15.57 user 0.74 sys We emphasize that we do not open more than one file at a time: we open the next file only after we are done with the previous one. -} -- The lazy IO solution, for contrast -- -- On the above example, it aborts with an error: -- openFile: resource exhausted (Too many open files) -- Indeed, one of the main drawbacks of Lazy IO is resource mismanagement. -- On GHC 6.8.3, we get a worse error: -- Wc: internal error: awaitEvent: descriptor out of range -- (GHC version 6.8.3 for i386_unknown_freebsd) -- Please report this as a GHC bug: http://www.haskell.org/ghc/reportabug -- 18.69 real 16.62 user 0.73 sys -- The problem finally was fixed in GHC 6.10.4, at the expense of the -- performance: -- 3747046 -- 22.41 real 21.62 user 0.62 sys main_word_lazy = do names <- getArgs files <- mapM readFile names print $ length $ words (concat files) main = main_word_iter