From enumerators to cursors: turning the left fold inside out

$Id: fold-stream.lhs,v 1.5 2004/01/01 00:48:38 oleg Exp oleg $

There are two basic approaches to accessing a collection, be it a
file, a database, or a generating function. One approach relies on
enumerators (aka folds); the other is based on lazy lists (aka
cursors, streams). Given a stream, we can always construct the
corresponding enumerator. It's perhaps less appreciated that given a
left fold enumerator, we can always construct the corresponding
stream: we can mechanically invert an enumerator inside out. It has
been argued [*] that the enumerator approach is superior, especially
from the point of view of managing scarce resources such as database
connections. When we design the interface to a collection, it seems a
good idea to choose enumerator as a primitive. We can always obtain a
stream if we really need it.
 [*] http://pobox.com/~oleg/ftp/papers/LL3-collections-enumerators.txt
 
The mechanical inversion procedure presented in [*] had a catch: it
relies on shift/reset (or call/cc plus a mutable cell, which is the
same thing). How can we do such an inversion in Haskell? We can
introduce a right fold enumerator, which is more amenable to such
transformations. Or we can use a continuation monad and emulate
shift/reset. The present article demonstrates the third approach: a
non-recursive left-fold. We argue that such a left fold is the best
interface for a collection. Indeed, given the non-recursive left-fold
we can:
	- instantiate it into the ordinary left fold
	- instantiate in into a stream

If we turn two enumerators into streams, we can *safely* interleave
these streams.

We should point out that the relation between the left fold, the
non-recursive left fold and the stream is deep. The ordinary,
recursive left fold is the fix point of the non-recursive one. On the
other hand, the instantiation of the non-recursive left fold as a
stream, as we shall see, effectively captures a continuation in a
monadic action. We see once again that call/cc and Y are indeed two
sides of the same coin [**].
 [**] http://pobox.com/~oleg/ftp/Computation/Computation.html#fix-callcc

The rest of the article demonstrates the inversion procedure. The
procedure is generic, as evidenced by its polymorphic type. We
illustrate the technique on an example of a file considered a
collection of characters. Haskell provides a stream interface to that
collection: hGetChar. We implement a left fold enumerator. We then
turn that enumerator back to a stream: we implement a function
'myhgetchar' _only_ in terms of the left fold enumerator. The approach
is general and uses no monadic heavy lifting.

The following code is Haskell98. No unsafe operations are used.


> import IO
> import Data.IORef	-- to fully emulate hGetChar
> import Control.Monad (unless)


The desired left-fold collection enumerator has the following 
type:

> type CFoldLeft coll val m seed = coll -> CollEnumerator val m seed
> type CollEnumerator val m seed =
>        Iteratee val seed
>        -> seed      -- the initial seed
>        -> m seed
> type Iteratee val seed = seed -> val -> Either seed seed

The enumerator is parameterized over a collection (coll), the type of
elements of the collection (val), monad m and the type of the seed
(accumulating parameter of the enumerator). The function of a type
"Iteratee val seed" is an iteratee: it takes the current seed and the
value of the current element and returns a new seed. To be more
precise, if it returns "Right seed'", the iteration continues with
seed' as the new seed. If the iteratee returns "Left seed''", the
enumerator immediately stops further iterations, frees all the
resources, and returns seed'' as the final result.  So our left fold
enumerator supports premature termination of iterations, a quite
useful feature.

In our running example, the collection is a file. Therefore, the
enumerator over a file considered a collection of characters has the
following type:

> hfold_left:: CFoldLeft FileName Char IO seed
> type FileName  = String

We will see its implementation shortly. But first, let us introduce a
non-recursive version of CFoldLeft:
	

> type CFoldLeft' val m seed = 
>        Self (Iteratee val seed) m seed
>        -> CollEnumerator val m seed
> type Self iter m seed = iter -> seed -> m seed
> type CFoldLeft1Maker coll val m seed = coll -> m (CFoldLeft' val m seed)

The non-recursive enumerator CFoldLeft' is created by a maker procedure
CFoldLeft1Maker.

In our running example of a file, the maker procedure has the
following specialized type and the implementation:


> make_hfold_left':: CFoldLeft1Maker FileName Char IO seed
> make_hfold_left' filename = do
>   h <- openFile filename ReadMode
>   let hfold_left' self iteratee seed = do
>       mc <- try $ hGetChar h
>       case mc of
>         Left exc -> hClose h >> return seed
> 	  Right c  -> do
>            putStrLn $ " reading from " ++". Got " ++ (show c)
>            case iteratee seed c of
> 	       Left  seed -> hClose h >> return seed
> 	       Right seed -> self iteratee seed
>   return hfold_left'

We should note that the Handle h is an internal resource and is never
leaked from make_hfold_left'. The statement
	putStrLn $ " reading from " ++ title ...
is for debugging. We want to be sure that we don't read from a
stream too much. Reading just enough is important for proper
interleaving.


Given a non-recursive CFoldLeft' enumerator, we can instantiate it
into a desired left fold enumerator CFoldLeft. That conversion is
performed by the following generic procedure:

> lfold_nonrec_to_rec:: (Monad m) => 
>        coll -> (CFoldLeft1Maker coll val m seed)
>             -> m (CollEnumerator val m seed)
> lfold_nonrec_to_rec coll lfold1_maker = do
>    lfold_left' <- lfold1_maker coll
>    return $ fix lfold_left'
> fix f = f g where g = f g

As its polymorphic type indicates, it handles _any_ CFoldLeft'
enumerator.

For our running example, we can apply the procedure
lfold_nonrec_to_rec to a particular CFoldLeft1Maker, make_hfold_left',
to obtain the desired file enumerator hfold_left (whose type we have
shown earlier):

> hfold_left filename iteratee seed = do
>    enumerator <- lfold_nonrec_to_rec filename make_hfold_left'
>    enumerator iteratee seed


Let us create a test file

> tf = "/tmp/a"
> make_test_file fname = do
>   h <- openFile fname WriteMode
>   mapM_ ((hPutStr h) . show) [0..9]
>   putStrLn $ "Wrote " ++ fname
>   hClose h

and test our left fold enumerator

> test_read fname n = do
>    (count,acc) <- hfold_left fname reader (0,[])
>    putStrLn $ "Have read: " ++ (show acc)
>   where reader (count,acc) c = 
> 	    let count' = count + 1
> 	        acc'   = c:acc
> 	    in (if count' >= n then Left else Right) (count',acc')


The first test checks the premature termination. We read only five
characters, and then terminate the iteration and dispose of its
resources, including the (invisible) handle.

*Main> test_read tf 5
 reading from handle. Got '0'
 reading from handle. Got '1'
 reading from handle. Got '2'
 reading from handle. Got '3'
 reading from handle. Got '4'
Have read: "43210"

The debug print statement we inserted into make_hfold_left' earlier
made it clear that we indeed have read only five characters from the
file.

The following tests the exhaustive iteration.
*Main> test_read tf 20
 reading from handle. Got '0'
 reading from handle. Got '1'
 reading from handle. Got '2'
 reading from handle. Got '3'
 reading from handle. Got '4'
 reading from handle. Got '5'
 reading from handle. Got '6'
 reading from handle. Got '7'
 reading from handle. Got '8'
 reading from handle. Got '9'
Have read: "9876543210"

The iteration stopped because the file contains only ten characters.


Let us suppose that make_hfold_left' is the only way to access
files. Can we recover the ordinary i/o stream interface from it? The
answer is yes. We can get the familiar open, getchar and eof functions
from the non-recursive enumerator. Furthermore, the derivation of
these functions is *independent* of the precise nature of the
enumerator. We will never need (nor be given) access to the file
handle, for example.

> data MyStream m a = MyNil (Maybe a) | MyCons a (m (MyStream m a))

> lfold_nonrec_to_stream:: 
>    (Monad m) => CFoldLeft' val m (MyStream m val) 
>                 -> m (MyStream m val)
> lfold_nonrec_to_stream lfold' = do
>   let k fn (MyNil Nothing)  = return $ MyNil Nothing
>       k fn (MyNil (Just c)) 
>         = return $ MyCons c (lfold' k fn (MyNil Nothing))
>   lfold' k (\_ c -> Right $ MyNil $ Just c) (MyNil Nothing)

Again the polymorphic type of lfold_nonrec_to_stream demonstrates the
genericity of the transformation procedure.


In our running example of files, we obtain:

> myopen:: FileName -> IO (MyStream IO Char)
>
> myopen filename = make_hfold_left' filename >>= lfold_nonrec_to_stream
>
> mygetchar:: MyStream IO Char -> IO (Char,MyStream IO Char)
> mygetchar (MyCons c k) = k >>= (return . ((,) c))
>
> myiseof::   MyStream IO a -> Bool
> myiseof (MyNil Nothing) = True
> myiseof _ = False

We can now test reading from two streams with interleaving:

> test_interleave fname = do
>   stream1 <- myopen fname
>   stream2 <- myopen fname
>
>   putStrLn "\nReading 2 chars from one stream"
>   (c1,stream1) <- mygetchar stream1
>   (c2,stream1) <- mygetchar stream1
>   putStrLn $ "Read: " ++ (show [c1,c2])
>
>   putStrLn "\nReading 3 chars from the second stream"
>   (c1,stream2) <- mygetchar stream2
>   (c2,stream2) <- mygetchar stream2
>   (c3,stream2) <- mygetchar stream2
>   putStrLn $ "Read: " ++ (show [c1,c2,c3])
>  
>   putStrLn "\nReading again 2 chars from the first stream"
>   (c1,stream1) <- mygetchar stream1
>   (c2,stream1) <- mygetchar stream1
>   putStrLn $ "Read: " ++ (show [c1,c2])

*Main> test_interleave tf
 reading from stream1. Got '0'
 reading from stream2. Got '0'

Reading 2 chars from one stream
 reading from stream1. Got '1'
 reading from stream1. Got '2'
Read: "01"

Reading 3 chars from the second stream
 reading from stream2. Got '1'
 reading from stream2. Got '2'
 reading from stream2. Got '3'
Read: "012"

Reading again 2 chars from the first stream
 reading from stream1. Got '3'
 reading from stream1. Got '4'
Read: "23"

Note that we read ahead by exactly one character. If we want to detect
EOF, reading ahead is inevitable: to find out that a file has no more
characters we must attempt to read past the EOF.

The following code tests that we can use mygetchar to read the whole
stream, till the very end.

> test_read_all fname = do
>   stream <- myopen fname
>   let readall stream = if myiseof stream then return ()
>                           else do 
>                                  (c,stream') <- mygetchar stream
>                                  putStrLn $ show c
>                                  readall stream'
>   readall stream

Ken Shan (Ken at digitas-Harvard) has pointed out the linearity of
MyStream. The function mygetchar consumes its argument MyStream and
produces an updated MyStream in return. Furthermore, we should not use
the value MyStream after it has been passed to mygetchar. Otherwise,
strange things will happen, such as an attempt to read from a stream
after EOF is reached. The latter will trigger a run-time error.

If a type system provided linear (or at least, unique) types as in
Clean, we could have enforced the linearity restriction that way. In
Haskell, we can use IORefs to ensure the linearity of MyStream. We
store the current value of MyStream in an IORef. We define a new
procedure, myhgetchar, which extracts the MyStream value from the
IORef, invokes mygetchar on it, and updates the IORef with the new
MyStream value. The latter operation makes sure that the old value of
MyStream is consumed and no longer accessible. Indeed, a mutable
variable implicitly single-threads throughout the computation. We also
define procedures myhiseof and myhopen. The latter creates the IORef
with the initial value of MyStream. To prevent the user from accessing
MyStream directly (and breaking the linearity condition), we export
myhopen, myhgetchar, and myhiseof but hide MyStream:


> type MyHandle = IORef (MyStream IO Char)
>
> myhopen:: FileName -> IO MyHandle
>
> myhopen filename = myopen filename >>= newIORef 
>
> myhgetchar:: MyHandle -> IO Char
> myhgetchar h =
>   do
>     stream <- readIORef h
>     (c,stream') <- mygetchar stream
>     writeIORef h stream'
>     return c          
>
> myhiseof::  MyHandle -> IO Bool
> myhiseof h = readIORef h >>= return . myiseof 

> test_read_all' fname = do
>   h <- myhopen fname
>   let readall h = myhiseof h >>= 
>                   (\ e -> unless e (do
>                                      c <- myhgetchar h
>                                      putStrLn $ show c
>                                      readall h))
>   readall h

We obtained the same interface as hGetChar! Ken Shan has pointed out
that "It's particularly telling that the monadic way to enforce state
linearity for IO streams ends up giving us the same programming
interface as the monadic way to perform IO in the first place!"

All tests:

> main = sequence_
>          [make_test_file tf,
>           putStrLn "Reading a few", test_read tf 5,
>           putStrLn "Reading too many", test_read tf 20,
>           putStrLn "Test-interleave", test_interleave tf,
>           putStrLn "Read-all", test_read_all tf,
>           putStrLn "Read-all, with our handle", test_read_all' tf]