From enumerators to cursors: turning the left fold inside out $Id: fold-stream.lhs,v 1.5 2004/01/01 00:48:38 oleg Exp oleg $ There are two basic approaches to accessing a collection, be it a file, a database, or a generating function. One approach relies on enumerators (aka folds); the other is based on lazy lists (aka cursors, streams). Given a stream, we can always construct the corresponding enumerator. It's perhaps less appreciated that given a left fold enumerator, we can always construct the corresponding stream: we can mechanically invert an enumerator inside out. It has been argued [*] that the enumerator approach is superior, especially from the point of view of managing scarce resources such as database connections. When we design the interface to a collection, it seems a good idea to choose enumerator as a primitive. We can always obtain a stream if we really need it. [*] http://pobox.com/~oleg/ftp/papers/LL3-collections-enumerators.txt The mechanical inversion procedure presented in [*] had a catch: it relies on shift/reset (or call/cc plus a mutable cell, which is the same thing). How can we do such an inversion in Haskell? We can introduce a right fold enumerator, which is more amenable to such transformations. Or we can use a continuation monad and emulate shift/reset. The present article demonstrates the third approach: a non-recursive left-fold. We argue that such a left fold is the best interface for a collection. Indeed, given the non-recursive left-fold we can: - instantiate it into the ordinary left fold - instantiate in into a stream If we turn two enumerators into streams, we can *safely* interleave these streams. We should point out that the relation between the left fold, the non-recursive left fold and the stream is deep. The ordinary, recursive left fold is the fix point of the non-recursive one. On the other hand, the instantiation of the non-recursive left fold as a stream, as we shall see, effectively captures a continuation in a monadic action. We see once again that call/cc and Y are indeed two sides of the same coin [**]. [**] http://pobox.com/~oleg/ftp/Computation/Computation.html#fix-callcc The rest of the article demonstrates the inversion procedure. The procedure is generic, as evidenced by its polymorphic type. We illustrate the technique on an example of a file considered a collection of characters. Haskell provides a stream interface to that collection: hGetChar. We implement a left fold enumerator. We then turn that enumerator back to a stream: we implement a function 'myhgetchar' _only_ in terms of the left fold enumerator. The approach is general and uses no monadic heavy lifting. The following code is Haskell98. No unsafe operations are used. > import IO > import Data.IORef -- to fully emulate hGetChar > import Control.Monad (unless) The desired left-fold collection enumerator has the following type: > type CFoldLeft coll val m seed = coll -> CollEnumerator val m seed > type CollEnumerator val m seed = > Iteratee val seed > -> seed -- the initial seed > -> m seed > type Iteratee val seed = seed -> val -> Either seed seed The enumerator is parameterized over a collection (coll), the type of elements of the collection (val), monad m and the type of the seed (accumulating parameter of the enumerator). The function of a type "Iteratee val seed" is an iteratee: it takes the current seed and the value of the current element and returns a new seed. To be more precise, if it returns "Right seed'", the iteration continues with seed' as the new seed. If the iteratee returns "Left seed''", the enumerator immediately stops further iterations, frees all the resources, and returns seed'' as the final result. So our left fold enumerator supports premature termination of iterations, a quite useful feature. In our running example, the collection is a file. Therefore, the enumerator over a file considered a collection of characters has the following type: > hfold_left:: CFoldLeft FileName Char IO seed > type FileName = String We will see its implementation shortly. But first, let us introduce a non-recursive version of CFoldLeft: > type CFoldLeft' val m seed = > Self (Iteratee val seed) m seed > -> CollEnumerator val m seed > type Self iter m seed = iter -> seed -> m seed > type CFoldLeft1Maker coll val m seed = coll -> m (CFoldLeft' val m seed) The non-recursive enumerator CFoldLeft' is created by a maker procedure CFoldLeft1Maker. In our running example of a file, the maker procedure has the following specialized type and the implementation: > make_hfold_left':: CFoldLeft1Maker FileName Char IO seed > make_hfold_left' filename = do > h <- openFile filename ReadMode > let hfold_left' self iteratee seed = do > mc <- try $ hGetChar h > case mc of > Left exc -> hClose h >> return seed > Right c -> do > putStrLn $ " reading from " ++". Got " ++ (show c) > case iteratee seed c of > Left seed -> hClose h >> return seed > Right seed -> self iteratee seed > return hfold_left' We should note that the Handle h is an internal resource and is never leaked from make_hfold_left'. The statement putStrLn $ " reading from " ++ title ... is for debugging. We want to be sure that we don't read from a stream too much. Reading just enough is important for proper interleaving. Given a non-recursive CFoldLeft' enumerator, we can instantiate it into a desired left fold enumerator CFoldLeft. That conversion is performed by the following generic procedure: > lfold_nonrec_to_rec:: (Monad m) => > coll -> (CFoldLeft1Maker coll val m seed) > -> m (CollEnumerator val m seed) > lfold_nonrec_to_rec coll lfold1_maker = do > lfold_left' <- lfold1_maker coll > return $ fix lfold_left' > fix f = f g where g = f g As its polymorphic type indicates, it handles _any_ CFoldLeft' enumerator. For our running example, we can apply the procedure lfold_nonrec_to_rec to a particular CFoldLeft1Maker, make_hfold_left', to obtain the desired file enumerator hfold_left (whose type we have shown earlier): > hfold_left filename iteratee seed = do > enumerator <- lfold_nonrec_to_rec filename make_hfold_left' > enumerator iteratee seed Let us create a test file > tf = "/tmp/a" > make_test_file fname = do > h <- openFile fname WriteMode > mapM_ ((hPutStr h) . show) [0..9] > putStrLn $ "Wrote " ++ fname > hClose h and test our left fold enumerator > test_read fname n = do > (count,acc) <- hfold_left fname reader (0,[]) > putStrLn $ "Have read: " ++ (show acc) > where reader (count,acc) c = > let count' = count + 1 > acc' = c:acc > in (if count' >= n then Left else Right) (count',acc') The first test checks the premature termination. We read only five characters, and then terminate the iteration and dispose of its resources, including the (invisible) handle. *Main> test_read tf 5 reading from handle. Got '0' reading from handle. Got '1' reading from handle. Got '2' reading from handle. Got '3' reading from handle. Got '4' Have read: "43210" The debug print statement we inserted into make_hfold_left' earlier made it clear that we indeed have read only five characters from the file. The following tests the exhaustive iteration. *Main> test_read tf 20 reading from handle. Got '0' reading from handle. Got '1' reading from handle. Got '2' reading from handle. Got '3' reading from handle. Got '4' reading from handle. Got '5' reading from handle. Got '6' reading from handle. Got '7' reading from handle. Got '8' reading from handle. Got '9' Have read: "9876543210" The iteration stopped because the file contains only ten characters. Let us suppose that make_hfold_left' is the only way to access files. Can we recover the ordinary i/o stream interface from it? The answer is yes. We can get the familiar open, getchar and eof functions from the non-recursive enumerator. Furthermore, the derivation of these functions is *independent* of the precise nature of the enumerator. We will never need (nor be given) access to the file handle, for example. > data MyStream m a = MyNil (Maybe a) | MyCons a (m (MyStream m a)) > lfold_nonrec_to_stream:: > (Monad m) => CFoldLeft' val m (MyStream m val) > -> m (MyStream m val) > lfold_nonrec_to_stream lfold' = do > let k fn (MyNil Nothing) = return $ MyNil Nothing > k fn (MyNil (Just c)) > = return $ MyCons c (lfold' k fn (MyNil Nothing)) > lfold' k (\_ c -> Right $ MyNil $ Just c) (MyNil Nothing) Again the polymorphic type of lfold_nonrec_to_stream demonstrates the genericity of the transformation procedure. In our running example of files, we obtain: > myopen:: FileName -> IO (MyStream IO Char) > > myopen filename = make_hfold_left' filename >>= lfold_nonrec_to_stream > > mygetchar:: MyStream IO Char -> IO (Char,MyStream IO Char) > mygetchar (MyCons c k) = k >>= (return . ((,) c)) > > myiseof:: MyStream IO a -> Bool > myiseof (MyNil Nothing) = True > myiseof _ = False We can now test reading from two streams with interleaving: > test_interleave fname = do > stream1 <- myopen fname > stream2 <- myopen fname > > putStrLn "\nReading 2 chars from one stream" > (c1,stream1) <- mygetchar stream1 > (c2,stream1) <- mygetchar stream1 > putStrLn $ "Read: " ++ (show [c1,c2]) > > putStrLn "\nReading 3 chars from the second stream" > (c1,stream2) <- mygetchar stream2 > (c2,stream2) <- mygetchar stream2 > (c3,stream2) <- mygetchar stream2 > putStrLn $ "Read: " ++ (show [c1,c2,c3]) > > putStrLn "\nReading again 2 chars from the first stream" > (c1,stream1) <- mygetchar stream1 > (c2,stream1) <- mygetchar stream1 > putStrLn $ "Read: " ++ (show [c1,c2]) *Main> test_interleave tf reading from stream1. Got '0' reading from stream2. Got '0' Reading 2 chars from one stream reading from stream1. Got '1' reading from stream1. Got '2' Read: "01" Reading 3 chars from the second stream reading from stream2. Got '1' reading from stream2. Got '2' reading from stream2. Got '3' Read: "012" Reading again 2 chars from the first stream reading from stream1. Got '3' reading from stream1. Got '4' Read: "23" Note that we read ahead by exactly one character. If we want to detect EOF, reading ahead is inevitable: to find out that a file has no more characters we must attempt to read past the EOF. The following code tests that we can use mygetchar to read the whole stream, till the very end. > test_read_all fname = do > stream <- myopen fname > let readall stream = if myiseof stream then return () > else do > (c,stream') <- mygetchar stream > putStrLn $ show c > readall stream' > readall stream Ken Shan (Ken at digitas-Harvard) has pointed out the linearity of MyStream. The function mygetchar consumes its argument MyStream and produces an updated MyStream in return. Furthermore, we should not use the value MyStream after it has been passed to mygetchar. Otherwise, strange things will happen, such as an attempt to read from a stream after EOF is reached. The latter will trigger a run-time error. If a type system provided linear (or at least, unique) types as in Clean, we could have enforced the linearity restriction that way. In Haskell, we can use IORefs to ensure the linearity of MyStream. We store the current value of MyStream in an IORef. We define a new procedure, myhgetchar, which extracts the MyStream value from the IORef, invokes mygetchar on it, and updates the IORef with the new MyStream value. The latter operation makes sure that the old value of MyStream is consumed and no longer accessible. Indeed, a mutable variable implicitly single-threads throughout the computation. We also define procedures myhiseof and myhopen. The latter creates the IORef with the initial value of MyStream. To prevent the user from accessing MyStream directly (and breaking the linearity condition), we export myhopen, myhgetchar, and myhiseof but hide MyStream: > type MyHandle = IORef (MyStream IO Char) > > myhopen:: FileName -> IO MyHandle > > myhopen filename = myopen filename >>= newIORef > > myhgetchar:: MyHandle -> IO Char > myhgetchar h = > do > stream <- readIORef h > (c,stream') <- mygetchar stream > writeIORef h stream' > return c > > myhiseof:: MyHandle -> IO Bool > myhiseof h = readIORef h >>= return . myiseof > test_read_all' fname = do > h <- myhopen fname > let readall h = myhiseof h >>= > (\ e -> unless e (do > c <- myhgetchar h > putStrLn $ show c > readall h)) > readall h We obtained the same interface as hGetChar! Ken Shan has pointed out that "It's particularly telling that the monadic way to enforce state linearity for IO streams ends up giving us the same programming interface as the monadic way to perform IO in the first place!" All tests: > main = sequence_ > [make_test_file tf, > putStrLn "Reading a few", test_read tf 5, > putStrLn "Reading too many", test_read tf 20, > putStrLn "Test-interleave", test_interleave tf, > putStrLn "Read-all", test_read_all tf, > putStrLn "Read-all, with our handle", test_read_all' tf]