Iteratee/Enumeratee is a way to process data --
file bytes, network datagrams or matrix elements -- incrementally,
declaratively and with tight resource control. The approach has been
inspired by fold
, parser/printer combinators and circuit diagrams.
There are several variants of the Iteratee-based I/O, available as
independent libraries on Hackage or as part of bigger applications
such as web servers. Here we give motivation, justification and
explanation, pointing out earlier and experimental versions as well as
related projects.
We exposit programming with iteratees, contrasting them with Lazy IO
and the Handle-based, stdio
-like IO. We relate them to online
parser combinators. We introduce a simple implementation as free
monads, which lets us formally reason with iteratees. As an example,
we validate several equational laws and use them to optimize iteratee
programs. The simple implementation helps understand existing
implementations of iteratees and derive new ones.
IterDemo.hs [14K]
IterDemo1.hs [28K]
Demonstrations of Iteratee IO on a sequence of progressively larger examples. The second demo, which is described in the paper, illustrates Iteratee IO in contrast with Lazy and Handle IO.
IterDeriv.hs [11K]
IterDerivM.hs [17K]
Derivations of a simple Iteratee IO. Both derivations use a free monad and treat iteratees as a final co-algebra. The second file derives effectful iteratees. The second file also contains complete proofs of the equational laws of iteratees described in the paper. The files also illustrate iteratees as parser combinators.
talk-FLOPS.pdf [175K]
The annotated slides of the talk presented at FLOPS 2012 in Kobe, Japan on May 24, 2012.
IterDem.hs [16K]
The talk used a veneer over IterateeM to better explain Iteratees. This file is the complete code used in the talk.
IterateeIO-talk.pdf [159K]
IterateeIO-talk-notes.pdf [194K]
Old talks about Iteratee IO
The first version of the talk was presented at DEFUN08, ACM SIGPLAN 2008 Developer Tracks on Functional Programming. September 27, 2008. Victoria, Canada.
The present version of the talk (given at the research colloquium of the Center for Software Technology, Department of Information and Computing Sciences, Utrecht University on December 17, 2009) describes the December 2009 version of the Iteratee IO.
The slides have been updated for the November 2010 version of IterateeM.
The talk is aimed at practitioners not afraid of Haskell, in particular, server programmers. That is, programmers who program long-running distributed applications and are painfully aware of the issues of reading from sockets, latency, buffering, many layers of decoding, proper resource disposal and sustaining high load.
README.dr [4K]
Description of the Iteratee IO library and the code accompanying the talk.
Lazy-vs-correct.txt [5K]
Lazy vs correct IO: A message on composability and performance of Iteratee IO, posted on the Haskell-Cafe mailing list on Thu, 18 Sep 2008 23:51:59 -0700 (PDT).
We show that although the Iteratee IO library is not optimized at all, it already outperforms Lazy IO. The final benchmark, counting words in all GHC documentation, demonstrates the differences between Lazy and Iteratee IO: Iteratee IO finishes the test whereas Lazy IO runs out of file descriptors.
data Stream el = EOF (Maybe ErrMsg) | Chunk [el]Stream is a (continuing) sequence of elements bundled in Chunks. The
EOF
variant tells that no more data will be coming: the stream
is exhausted, either due to EOF or some error.
Chunk [a]
gives the currently available part of the stream.
In particular, Chunk []
signifies a stream with no immediately available
data but which is still continuing. When a stream processor
receives such empty chunk, it should ``suspend itself'' --
indicate its desire for more data.data Iteratee el m a = IE_done a | IE_cont (Maybe ErrMsg) (Stream el -> m (Iteratee el m a, Stream el))Iteratee is a generic stream processor -- what is being folded over a stream. An iteratee exists in one of the three states:
We assume that all iteratees are `good' -- given bounded input, they do the bounded amount of work and take the bounded amount of resources. Given a terminated stream, an iteratee should move to the done state, returning the results computed so far.
It turns out, iteratee forms a monad and a monad transformer. After all,
Stream is a gradually consumed state; to read another chunk, the
iteratee has to make an `operating system call' to the stream
producer. Iteratee is a State and Cont monad, and hence an Error
monad. We can write iteratees in the familiar do
notation.
The shown type of iteratees is not the only one possible;
the comments in the source code debate two other choices.
type Enumerator el m a = Iteratee el m a -> m (Iteratee el m a)Enumerator is an encapsulation of a data source, a stream producer -- what folds an iteratee over the stream. An enumerator takes an iteratee and applies it to the stream data as they are being produced, until the source is depleted or the iteratee said it had enough. After disposing of buffers and other source-draining resources, enumerator returns the final value of the iteratee. Enumerator thus is an iteratee transformer.
Iteratees and enumerators operate over a base monad m
, allowing
producers and consumers side-effecting actions
to obtain and digest data (e.g., reading or writing files).
That base monad is often IO
. That is not the only choice:
Iteratee el IO
is a monad too; it too may act as the
base monad. If we instantiate the type of Enumerator el m a
by
choosing m
to be the iteratee monad, we obtain Enumeratee:
type Enumeratee elo eli m a = Iteratee eli m a -> Iteratee elo m (Iteratee eli m a)Enumeratee, as any other enumerator, produces a stream, of the elements of type
eli
. Since an enumeratee acts in the monad Iteratee
el IO
, it may grab chunks of elo
elements and use them as the
source of its own stream. Thus enumeratee is a generic stream
decoder. Stream decoding, or nested parsing, is rather common:
buffering, character encoding, compression, encryption, SSL are just
a few examples. The TIFF library relies on enumeratees,
to decode a stream of bytes from a TIFF file, to the stream
of 2- or 4-byte signed integers and then, for example, to the stream
of rational numbers (the image resolution).Concerns indeed separate: enumerators know how to get to the next element; iteratees know what to do with the next element. The character of Iteratees is the encapsulation of input processing layers that can be freely composed.
The November 2010 version of the code adds guidelines for writing
simple iteratees and enumerators. The library uses standard
combinators such as monadic bind to build iteratee chains;
there are no more $$
and >>==
.
IterateeM.hs [54K]
Monadic and general iteratees: the main library. Examples of chunk-decoding and IO multiplexing.
Wc.hs [12K]
Many examples of nested and parallel processing of input stream.
IterateeMCPS.hs [39K]
The CPS variant of the January 2010 version of IterateeM.hs
Iteratee IO: safe, practical, declarative input processing
Parallel composition of iteratees: one source to several sinks
Parallel composition of streams: several sources to one sink
Using Iteratee as a base monad to process several streams, at different speeds and inter-dependently.
do
-notation. Composing enumerators --
stream producers -- gives the producer for the concatenated
stream. Stream transformers, enumeratees, are consumers and
producers at the same time and so are often regarded as complex. A
more general view on Iteratee IO gives a simpler picture: stream
consumers are monads, and stream producers are monoids. Stream
combinators become fewer in number but clearer in semantics --
demonstrated by the examples of three different ways of composing
(lists of) stream producers below.
Iteratees are akin to getChar
and getLine
from the standard
System.IO
library, parsing the implicit stdin
-like stream. Like
getChar
, iteratees transparently take care of buffering and error
handling. The stream `handle' is always implicit, and hence cannot be
prematurely closed. Iteratees are more general than System.IO
readers since iteratees' stdin
is not restricted to a character
stream. The type of iteratee stream elements e
is part of the
Iteratee type. Compare the signatures of the corresponding functions
that extract the current element of the stream:
System.IO.getChar :: IO Char IterateeM.head :: Monad m => Iteratee e m eIteratees are not restricted to the IO monad either: the iteratee
head
above works in any monad and is
therefore pure. We give iteratees a short alias, whose name should
become clear shortly:type R e m = Iteratee e mWriting iteratee programs should be just as familiar as writing
System.IO
programs, only without the restrictions on the type of
stream elements and the monad.
The stream consumed by getChar
is produced by the the Haskell
run-time system, which is part of the Haskell environment rather than
of a Haskell program. Iteratee IO brings the producers inside, letting
the program itself produce streams and connect producers with
consumers. A stream producer takes a consumer (a parser of
the stream of e
s into a value of some type a
), feeds it stream
data and returns the resulting consumer.
type Enumerator e m a = R e m a -> m (R e m a)A producer encapsulates stream generation; a consumer encapsulates stream parsing. A consumer should work with any producer that generates the stream of the right type; correspondingly, a producer should work uniformly with any consumer so long as their stream types match. We enforce the abstraction boundary by requiring the enumerator to be parametric in
a
, the type of the parsing result. A general
stream producer hence has the following type.newtype L e mi mo = L{unL :: forall a. R e mi a -> mo (R e mi a)}Enumerator is a particular case of
L e mi mo
when mi
and mo
are the same.
The general stream producer L e mi mo
is a monoid:
instance (Monad mi, Monad mo) => Monoid (L e mi mo) where mempty = L return mappend (L e1) (L e2) = L $ \i -> e1 i >>= e2It becomes intuitive that composing (or,
mappend
ing) enumerators
effectively concatenates their streams. The parameters mi
and mo
of
L e mi mo
are monads; moreover, mo
should be the same or `bigger'
than mi
: the producer should do all the effects of the consumer,
and perhaps a few more. The following type class establishes
the partial order on monads: what it means to be `bigger':class (Monad m1, Monad m2) => m1 :>=: m2 where promote :: m2 a -> m1 awith at least the following obvious instances.
instance Monad m => m :>=: m where promote = id instance (Monad m, MonadTrans t, Monad (t m)) => t m :>=: m where promote = lift
A notable particular case of L e mi mo
is the outer monad mo
being R e' m
. The type L e m (R e' m)
describes a producer that gets its
data by reading another stream, of e'
, transforming one or more of e'
into one or more of e
. The type L e m (R e' m)
thus
describes a stream transformer -- an enumeratee! Enumerators and
enumeratees hence unify in the general stream producer.
A producer connects with a consumer by a regular functional
application (after the L
unwrapping). For convenience, we introduce
a right-associative infix operator
L f ||| x = f x
Since an enumeratee is a consumer of the outer, e'
stream, it can
be connected with the producer of that stream. The composition then
acts as a inner stream producer. We obtain a combinator to compose
enumeratees with enumerators or with themselves:
compL :: (mo :>=: m) => L e m mo -> L e' m (R e m) -> L e' m mo compL (L e12) (L e23) = L (\ie' -> e12 (e23 ie') >>= promote . run)
We now demonstrate various compositions on a simple running example
of composing the list of take
enumeratees:
takes = [takeL 2, takeL 3, takeL 4](where
takeL
is IterateeM.take
wrapped in L
).
Granted, the example is a bit contrived: normally we want to compose a
variety of enumeratees rather than the single take. On the other
hand, take n
makes a neater example.
First, takeL
, as any stream producer, is a monoid,
and can be mappend
ed:
c1 :: Monad m => L e m (R e m) c1 = foldl1 mappend takesWe could have just as well used
foldr1
:
mappend
is associative. Mappend'ing producers `concatenates' their
sources. Therefore, c1 == take (2+3+4)
. To demonstrate this equality,
we run c1
on a sample input stream [1..15]
, with
stream2list
as the consumer that reads the whole stream
and returns it as a list.c1r = runIdentity $ run =<< (enum1chunkL [1..15] `compL` c1) ||| stream2list -- [1,2,3,4,5,6,7,8,9]We have composed
c1
with the enumerator enum1chunkL
to obtain the enumerator of the prefix stream, which we then
hook up with the stream2list
iteratee.
The result, in the comments, shows that c1
indeed behaves like
take 9
. The code to run c1
can be written differently but equivalently
asc1r' = runIdentity $ run =<< enum1chunkL [1..15] ||| (runI =<< c1 ||| stream2list) -- [1,2,3,4,5,6,7,8,9]Instead of composing
c1
with the producer enum1chunkL
on the right we
composed it with the consumer stream2list
on the left, which leads
to another way of composing takes
.
The second method of composing enumeratees is to apply each to
a consumer and compose the results.
If e_j
is a producer and i
is an iteratee (for the inner stream), (e_i ||| i)
is an iteratee for the outer stream. A list of such iteratees
can be composed in several ways, sequentially:
c2s :: Monad m => R e m a -> R e m [a] c2s = \i -> sequence $ map (\e -> runI =<< e ||| i) takes c2sr = runIdentity $ run =<< enum1chunkL [1..15] ||| c2s stream2list -- [[1,2],[3,4,5],[6,7,8,9]]or in parallel:
c2p :: Monad m => R e m a -> R e m [a] c2p = \i -> parallel $ map (\e -> runI =<< e ||| i) takes where parallel [i] = liftM (: []) i parallel (i:t) = do (iv,tv) <- enumPair i (parallel t) return (iv:tv) c2pr = runIdentity $ run =<< enum1chunkL [1..15] ||| c2p stream2list -- [[1,2],[1,2,3],[1,2,3,4]]
Finally, enumeratees compose `telescopically':
enumeratees are stream transformers and can be arranged, by compL
, into a
pipeline -- very much like the Unix shell pipeline -- to transform stream
elements further and further. We have seen compL
in action already, in
the term c1r
, which composed an enumerator for the stream e'
with a e'
-to-e
enumeratee giving the enumerator
for the stream e
.
-- compare with c1: compL vs mappend c4 = foldl1 compL takes
(again, foldr1
works just as well).
Piping take n_j
into each other is not very interesting. It is
easy to see that the result is equivalent to the single
take (minimum [n_1,...n_j])
. Indeed, c4
behaves like
take (minimum [2,3,4]) == take 2
:
c4r = runIdentity $ run =<< enum1chunkL [1..15] `compL` c4' ||| stream2list -- [1,2]
We have shown a general view on Iteratee IO that unifies enumerators and enumeratees as general stream producers. Whereas iteratees, stream consumers, are monads and monad transformers, stream producers are monoids -- which gives numerous ways of composing them.
Compose.hs [5K]
The older code using the Iteratee library without any embellishments. It was originally posted as Re: Composing a list of Enumeratees on the Haskell-Cafe mailing list on Fri, 7 Oct 2011 23:27:40 -0700 (PDT).
We use random and binary IO to write a general-purpose TIFF library. The library emphasizes incremental processing, relying on iteratees and enumerators for on-demand reading of tag values. The library extensively uses nested streams, tacitly converting the stream of raw bytes from the file into streams of integers, rationals and other user-friendly items. The pixel matrix is presented as a contiguous stream, regardless of its segmentation into strips and physical arrangement.
We show a representative application of the library: reading a sample TIFF file, printing selected values from the TIFF dictionary, verifying the values of selected pixels and computing the histogram of pixel values. The pixel verification procedure stops reading the pixel matrix as soon as all specified pixel values are verified. The histogram accumulation does read the entire matrix, but incrementally. Neither pixel matrix processing procedure loads the whole matrix in memory. In fact, we never read and retain more than the IO-buffer-full of raw data.
Version 2.0 of the library demonstrates restartable exceptions to seek (that is, control the position) in a stream.
Tiff.hs [19K]
TIFF library
TiffTest.hs [8K]
A sample application of the TIFF library
Feeding several parallel consumers off the same stream is quite
common. Examples include Unix tee
, tcpdump
and similar
wire-tapping commands; reading a stream of numbers and computing the
average and the variance, which requires keeping the count of the
numbers, of their sum and the sum of their squares. Our running example
is even more familiar: Unix wc
command, which reports the number of
lines, words and characters in its input stream or files.
The input data may come from the network or a pipe, which we cannot rewind and re-read. We should handle large input, fast; storing received data in memory or files for re-processing is not an option either. We really have to process data incrementally and in parallel.
The combinator enumPair
composes two iteratees in parallel, turning a pair
of iteratees into an iteratee for a pair of their results:
enumPair :: Monad m => Iteratee el m a -> Iteratee el m b -> Iteratee el m (a,b) enumPair i1 i2 = enum2 i1 i2 >>= runI2One should not confuse
enumPair
with the lifted pair:
enumPair i1 i2
composes the iteratees i1
and i2
in parallel,
feeding the same chunk of input to both of them. In contrast,
liftM2 (,) i1 i2
composes the iteratees sequentially, feeding
the stream only to i1
, with i2
getting the left-over
after i1
is finished.
The combinator enumPair
is expressed in terms of enum2
,
a signal-splitter--like enumeratee with one source and two sinks.
enum2 :: Monad m => Iteratee el m a -> Iteratee el m b -> Iteratee el m (Iteratee el m a, Iteratee el m b)The enumeratee splits the stream without any buffering, passing the received chunk to both iteratees, before asking for the next chunk. The enumeratee continues for as long as there are stream data and at least one of the iteratees wants them.
Our running example is like wc
, reporting the number of words and characters
in a given file. It can be elegantly written with Lazy IO:
main_words_char_lazy = do name:_ <- getArgs file <- readFile name print $ (length $ words file, length file)Here is the same code using iteratees:
main_words_char_iter = do name:_ <- getArgs counter <- run =<< enum_file name ((runI =<< enum_words stream_count) `enumPair` stream_count) print counterThe iteratee
stream_count
counts the elements in its stream;
we apply it to the stream of characters produced by enum_file
and
to the stream of words produced by decoding the character stream.The ugly side of Lazy IO turns up when we run the code. As the size of the input file increases, from 2.4MB, to 4.9MB, to 24.8MB, so does the peak memory use of the lazy IO code: from 43MB to 78MB and finally, 337MB. Maximal residency grows too: from 20MB to 37MB and 163MB. The program must be reading the whole file in memory. That is not acceptable: we should not need 1/3-Gigabyte of memory to count words and characters in a 25-Megabyte file.
In contrast, the iteratee-based code counts in constant memory: the peak memory use is 2MB and the maximal residency is 108KB -- regardless of the size of the input file.
I am grateful to John Lato for introducing parallel composition and describing its advantages.
Recall that Iteratee el m
is a monad and so can
be used as a base monad of another iteratee. That is how
we derived enumeratees. We now observe that
Iteratee el
is also a monad transformer. We can transform
any monad into an Iteratee monad -- including the iteratee
monad. Laying iteratee monad transformers upon each other gives an
us access to several streams at once.
A simple example of returning the first elements from two streams should make it concrete. Neither stream should be read beyond the first element. The most naive code works:
i1 :: Monad m => Iteratee el1 (Iteratee el2 m) (el1, el2) i1 = do e1 <- head e2 <- lift head return (e1,e2)All signatures hereafter can be and have been inferred; we write them for clarity. The signature indeed clarifies that the iteratee
i1
gets the element of the type el1
from the inner stream,
and the element of the type el2
from the base monad, which
happens to be iteratee as well and so can read from (its own)
stream.
Since i1
is an iteratee, we can pass it to an enumerator. The nesting
of streams becomes apparent in the (inferred) signature:
the inner stream contains Int
s and the outer stream has elements
of the type el2
.
t11 :: (Monad m) => Iteratee el2 m (Iteratee Int (Iteratee el2 m) (Int, el2)) t11 = enum_pure_1chunk [1::Int,2,3,4] i1We complete the inner enumeration by running it. The type of
t12
shows the result to be an iteratee, of an outer stream.
We may pass it to a suitable enumerator (see t13
) and run it:t12 :: (Monad m) => Iteratee el2 m (Int, el2) t12 = run =<< enum_pure_1chunk [1::Int,2,3,4] i1 t13 :: (Monad m) => m (Iteratee Char m (Int, Char)) t13 = enum_pure_1chunk "5678" $ run =<< enum_pure_1chunk [1::Int,2,3,4] i1 t14 :: (Monad m) => m (Int, Char) t14 = run =<< (enum_pure_1chunk "5678" $ run =<< enum_pure_1chunk [1::Int,2,3,4] i1)Running the complete example,
runIdentity t14
, gives us the pair
(1,'5')
, the heads of the two streams. With different enumerators
we can verify that neither stream has been read beyond the first
element.
We may even process two streams interdependently,
reading incrementally and by various amounts. The amount to read
from one stream may depend on what we have read from the other.
For example, we will read the n
-th word from the file test2.txt
and find the first line in the file test1.txt
with that word. We return the line paired with the word.
zwl :: (MonadIO m) => Int -> Iteratee String (Iteratee Line m) (Line, String) zwl n = do drop n -- Drop the words from the word stream before we w <- head -- get the desired n-th word liftIO $ putStrLn $ "Got the word: " ++ w -- From the line stream, drop the lines that -- do not contain the word w lift $ dropWhile (not . isInfixOf w) -- Read the first line that does contain w l <- lift $ head return (l,w)We pass the two-stream iteratee
zwl
first to the enumerator
of the inner stream of words, tzwlI
, and then to the enumerator
of the outer stream of lines, tzwlO
. We run the whole computation
and print the result:tzwlI :: (MonadIO m) => Int -> Iteratee Line m (Line, String) tzwlI n = run =<< enum_file_gen "test2.txt" (runI =<< enum_words (zwl n)) tzwlO :: (MonadIO m) => Int -> m (Line, String) tzwlO n = run =<< enum_file_gen "test1.txt" (runI =<< enum_lines (tzwlI n)) tzwlO5 = tzwlO 5 >>= printThe traces of read operations show we read both files incrementally, with interleaving, and only as far as needed.
Iterating iteratee transformers is reminiscent of lightweight monadic
regions, which were obtained by iterating ST
-monad--like
transformers. Many-stream iteratees have many similarities with the
regions, including the region-like discipline of allocating and
disposing of stream resources.
Merge.hs [8K]
A two-headed enumeratee: Merging two sorted streams into one sorted stream
The problem well illustrates the reading of two streams at various paces.
How much to read from a stream can be determined only dynamically,
based on the data read from both streams.
We design not merely an iteratee with two input streams;
we design an enumeratee. Thus we truly merge two input streams into a one,
to be processed by inner iteratees as any other stream.
We also process the two input streams by chunks rather than
element-by-element. If a stream can yield several pieces of
data at once, we take advantage of that, improving the performance.
How to zip folds: A library of fold transformers (streams)
The question of parallel processing of several streams is quite
like the question of expressing list zipWith
via list fold.
(Recall the origins of iteratees and enumerators in the left fold.)
If the only way to obtain elements of a list is to fold over it,
could we express the zip function?
At first blush, it seems impossible: after all, zip requires
simultaneous processing of two lists whereas list fold can at best give us
only nested processing. Yet we can write zip using only folds to deconstruct
lists, as the referenced article explains.
Iteratee IO also aims, among other things, to encapsulate resources such as file handles and network connections, ensuring their safe use and prompt disposal. One may wonder how much Iteratees and Regions have in common and if that commonality can be factored out.
There seem to be several ways to combine regions and iteratees. We
describe the most straightforward attempt, combining a
monadic region library (mostly as it is) with an Iteratee IO library,
also mostly as it is. We use monadic regions to manage file handles or
file descriptors, ensuring that file handles are always closed, even in
case of IOError
and other asynchronous exceptions. An enumerator like enumFile
provided similar guarantees for its handles. (Since an
enumerator keeps its handles to itself, there is no danger of
iteratees' misusing them.) With the monadic region library, the
enumerator code becomes simpler: we no longer have to worry about
exceptions. The main benefit of monadic region library is to manage
files opened by iteratees. The latter are being passed around, and so
their resources are harder to keep track.
We stress that monadic regions do not impose the stack-based (LIFO) order on opening and closing of files. The original memory regions were invented to get around the stack allocation, LIFO strategy, which does not suffice because the memory allocation patterns are complex: a function may allocate memory that outlives it. Regions let the programmer create arbitrarily many nested regions; everything in a parent region is available to a child. Crucially, a child may request any of its ancestors to allocate memory in their regions. Therefore, although regions are nested, memory does not have to be allocated and freed in LIFO order. The same principle holds for monadic regions and file handles.
We thus demonstrate enumFile
and in particular iterFile
for
incremental file reading
and writing, with the same safety guarantees. All opened files are always closed, regardless of any (asynchronous) exceptions that may
arise during opening, reading, writing or transforming. The code has
many examples of throwing errors from various stages of the pipeline and
at various times. All files are closed.
Before looking at iterFile
code, let us first see what it can do.
The first example simply copies one file to another, block-by-clock.
tIF1 = runSIO $ run =<< unL (enumFile "/etc/motd") (iterFile "/tmp/x")
According to the trace
opened file /etc/motd iterFile: opening /tmp/x Closing {handle: /etc/motd} Closing {handle: /tmp/x}the files are indeed closed, but not in the LIFO order. That is important, so to let an iteratee write data coming from several sources. For example:
tIF3 = runSIO $ run =<< unL (enumFile "/etc/motd" `mappend` enumFile "/usr/share/dict/words") (iterFile "/tmp/x") opened file /etc/motd iterFile: opening /tmp/x Closing {handle: /etc/motd} opened file /usr/share/dict/words Closing {handle: /usr/share/dict/words} Closing {handle: /tmp/x}The complex patterns of opening and closing are worth stressing again. For example, the file
/etc/motd
is opened first and is closed first.
This is not a stack-allocation strategy.The files will be closed even in case of exceptions:
tIF4 = runSIO $ run =<< unL (enumFile "/etc/motd" `mappend` enumFile "/nonexistent") (iterFile "/tmp/x") opened file /etc/motd iterFile: opening /tmp/x Closing {handle: /etc/motd} opened file /nonexistent Closing {handle: /tmp/x} *** Exception: /nonexistent: openFile: does not exist
The source code demonstrates handling of other exceptions. All monadic
region monads all support shCatch
, so we can write our own
exception-handling code.
Monadic regions simplify the code for enumFile
, which reads data from a
file and feeds to an iteratee. The code relies on the regions library
to ensure the opened file shall be closed, on normal or any exceptional
exit.
enumFile :: (SMonadIO m) => FilePath -> L ByteString m m enumFile filepath = L $ \iterv -> do newRgn $ do h <- newSHandle filepath ReadMode unL (enumHandle h) iterv
More interesting is an exception-safe iterFile
, which has been claimed
to be impossible to write. This iteratee saves all received data into a
given file, incrementally. The code is written in the ordinary
monadic style. The file is opened in the region one up: therefore,
the file will remain open after an enumerator is finished with iterFile
.
Another enumerator may pick up. The file will be closed when iterFile
's
region is exited.
iterFile :: (SMonad1IO m, m ~ (IORT s' m')) => FilePath -> R ByteString m () iterFile fname = lift (newSHandle fname WriteMode) >>= loop where loop h = getChunk >>= check h check h (Chunk s) = lift (mapM (shPut h) s) >> loop h check h e = return ()
Monadic regions can indeed be combined with an Iteratee library, relieving enumerators and mainly iteratees from worrying about closing files, especially upon exceptions. Although regions constrain patterns of opening and closing, they do not impose the LIFO order. The whole pattern of opening and closing does not even have to be known statically.
SafeHandles.hs [12K]
SafeHandlesTest.hs [10K]
The implementation of the monadic region library, described in Lightweight monadic regions
Composing stream producers
More economical interface to the IterateeM library
The main component of the XML parsing framework is
the Enumeratee xml_enum
that converts from
a Char
stream to an XMLStream
, the stream of parsed XML data.
That is, xml_enum
is a pull, SAX XML parser. Hooking up xml_enum
with a simple iteratee (see below) gives a DOM parser. The
framework also exports a number of low-to-high level lexing and
parsing iteratees for tokenizing or parsing marked-up documents,
in various XML or HTML formats. One may use these "Lego blocks"
to construct a parser following any discipline and performing validation
to any degree.
The interface to the XML parser is through XMLStream
rather than
call-backs:
data XMLStream a = XMLString String String | XMLStart QNameR (AttList QNameR) NameSpaces | XMLEnd QNameR (AttList QNameR) NameSpaces | XMLOther a -- extension, for PI handlerswhere
QNameR
is a resolved QName
, the name of an element or an
attribute with the resolved (as a URI
) namespace qualification.
The XML DOM parser, which converts XML text to a Document Object Model
(DOM) tree, illustrates xml_enum
and the composability -- stringing up --
of Iteratee-based parsers. It is short enough
to fully describe it here. DOM is a tree representation of the XML document:data DOM = DNode QNameR (AttList QNameR) [DOM] | DC String -- Character data
The iteratee xml_dom_iter
, whose complete code shown below, builds the tree
from the XMLStream
. The xml_dom_iter
signature says so directly.
xml_dom_iter :: Monad m => Iteratee (XMLStream ()) m DOM xml_dom_iter = head >>= \x -> case x of XMLString s "" -> return $ DC s XMLStart{} -> forest [] -- XMLEnd is not expected; neither is EOF where forest acc = head >>= \x -> case x of XMLString s "" -> forest $ DC s : acc XMLStart{} -> forest [] >>= forest . (:acc) XMLEnd name attrs _ -> return $ DNode name attrs $ reverse accThe full DOM parser is obtained by connecting
xml_enum
to xml_dom_iter
:xml_dom :: Monad m => Iteratee Char m DOM xml_dom = id .| xml_enum default_handlers .| xml_normalize xml_dom_iterThe argument
default_handlers
provides no PI handlers and no pre-defined
parsed entities or namespaces. The pipeline includes xml_normalize
to coalesce adjacent character content and to remove white space between
elements. Although such white space is almost always insignificant,
its removal is not standard. Therefore, it is separated into its own
enumeratee.The Iteratee XML parsing framework is based on the SSAX XML parser implemented in Scheme. The main difference is the stream-based interface, making incremental parsing much easier to use. The SSAX parser relied on call-backs, which had to manually manage their internal state, repeatedly packing and unpacking it. Call-backs were therefore unwieldy, error-prone, and difficult to compose.
XMLDom.hs [14K]
A sample application and a demonstration of the SAX XML parser: DOM XML parser
The bulk of the file are the unit tests of the parser, including the tests of namespaces and parsed entities.
SearchString.hs [4K]
A general-purpose iteratee to search the stream for a given string
This iteratee does no buffering and never reads (ahead) more than really necessary to locate the string.
Functional XML parsing framework in Scheme: SAX/DOM and SXML parsers with support for XML Namespaces and validation
The Scheme parser was the prototype and the inspiration for the present XML Enumerator.
The problem is common: suppose we receive an XML document over a network (e.g., in an HTTP stream). Network connections are inherently unreliable and can be dropped at any time. The XML document therefore can come truncated, for example, missing the end tag for the root element. According to the XML Recommendations, such document is ill-formed, and hence does not have an Infoset. In contrast, invalid but well-formed documents -- with all tags balanced but expected integral data does not parse as such -- do have the Infoset. Strictly speaking, we should not be using data from an XML document until we verified that it is well-formed, that is, until we parsed it at all and have checked that all end tags are in place. Incremental XML processing seems impossible.
Sometimes people avoid such a strict interpretation and process a document as far as they can. For example, the XML-encoded telemetry may be regarded meaningful even if the root end tag is missing.
Even in the strict interpretation, it is still possible to handle a document incrementally so long as the processing is functional or its side effects can be backed out (e.g., in a transaction). We demonstrate such an incremental processing that always detects ill-formed XML, and, optionally, invalid XML. It is possible after all to detect ill-formedness errors and process the document without loading it all in memory first -- using as little memory as needed to hold the state of the processor (just a short string in our example).
Our running example is an XML document representing a finite map: a collection of key-value pairs with an integer key:
<map> <kv><key>1</key><value>v1</value></kv> <kv><key>2</key><value>v2</value></kv> <kv><key>bad</key><value>v3</value></kv>This document is both ill-formed (missing the end tag) and invalid (one key is bad: non-
read
-able). The goal is to write a
lookup function for a key in such an encoded map reporting all
ill-formedness errors, always. We may choose not to report an
invalid key if the lookup succeeds before the bad
key is
encountered.The following pipeline is the answer:
type Key = Int type Value = String xml_lookup :: Monad m => Key -> Iteratee Char m (Maybe Value) xml_lookup key = id .| xml_enum default_handlers .| xml_normalize .| kv_enum (lkup key)The expression
xml_lookup key
is the iteratee that consumes characters
and produces a Value
associated with the given Key
, or Nothing
if the lookup fails. Alternatively, the expression may throw an iteratee
error. The function xml_lookup
is built
by composition from smaller, separately developed and tested
pipeline components; xml_enum
, the XML enumerator, and xml_normalize
are part of the incremental XML parsing framework:xml_normalize :: Monad m => Enumeratee (XMLStream x) (XMLStream x) m a xml_enum :: Monad m => Handlers x m a -> Enumeratee Char (XMLStream x) m aThey parse an XML document incrementally and produce the stream of parsing results -- detecting ill-formed documents and raising corresponding errors. The ill-formedness check is hence taken care of, by the framework.
The enumeratee kv_enum
just as incrementally transforms the stream of
parsing results to the stream of key-value pairs:
kv_enum :: Monad m => Enumeratee (XMLStream ()) (Key,Value) m a kv_enum = parse_element (name "map") map_parser where map_parser n iter = parse_many (name "kv") kv1 iter >>= finish_element n kv1 n iter = do key <- parse_element (name "key") (\n _ -> as_readable n) 0 val <- parse_element (name "value") (\n _ -> as_str n) "" yield_to iter (key,val) >>= finish_element nThe enumeratee relies on a simple DTD parser combinator library. Here,
map_parser
parses the stream (<kv>...</kv>)*</map>
and kv1
handles <key>
int-key</key><value>
val</value>
. This enumeratee parses
the stream till the end and hence always reports validation errors,
of a key unredable as an integer.
Finally the iteratee lkup
looks up the given key in the stream of
key-value pairs:
lkup :: Monad m => Key -> Iteratee (Key,Value) m (Maybe Value) lkup key = headM >>= check where check (Just (k,v)) | k == key = return (Just v) -- premature termination check Nothing = return Nothing check _ = lkup key
Applying xml_lookup 2
to the sample XML document returns the error
``XML [43] (EOF) broken for element map. Stream after the error: ""''. If we
add the closing tag </map>
we still get an error,
``Reading failure. Stream after the error:
[XMLStart value (fromList []) (fromList []),XMLString "v3" "",....]
''
because the key bad
cannot be parsed as integer. If we change kv_enum
slightly so that it stops parsing as soon as its client iteratee
is done, then xml_lookup 2
on the sample document, with the closing </map>
added, returns Just "v2"
. The lookup succeeded
before the bad key is encountered.
We have thus demonstrated constant-space processing -- lookup in an XML document -- while detecting all XML parsing errors. The error reporting comes `for free' , courtesy of the XML parsing framework. The whole processing is a pipeline of several stages, each incremental, each with its own error-reporting.
XMLIter
. The code contains a simple
library of parsing combinators, which represent the element
structure (`DTD').The iteratee log parser is simple indeed, which is the point. Its code reads just like the specification for the log file format. There are however two interesting twists: pointing out a parsing error precisely without complicating the parsers, and handling a surprising non-determinism. (The latter is described in a separate article.) Adding location tracking as an orthogonal aspect to an already written parser along with the handling of non-determinism show off parallel iteratee composition: parsing the same input by several iteratees side-by-side.
Our parser is written for the popular, so-called combined logs. They are made of lines whose format is specified in the Apache documentation as
%h %l %u %t "%r" %>s %b "%{Referer}i" "%{User-agent}i"Here
%h
is the host name or the IP address of the client, %l
is usually dash
and ignored, %u
is the user name, etc. The formats of %r
(the request string)
and %t
(the time stamp) are described separately. Here is a sample log line:127.0.0.1 - frank [10/Oct/2000:13:55:36 -0700] "GET /apache_pb.gif HTTP/1.0" 200 2326 "http://www.example.com/start.html" "Mozilla/4.08 [en] (Win98; I ;Nav)"
Our log parser extracts the information from one log line and returns it as
data Hit = Hit{ hit_remote_addr :: String, hit_remote_user :: Maybe String, hit_tstamp :: UTCTime, hit_method :: HTTP_method, hit_url :: String, hit_version :: (Int,Int), ...The parser hence is an iteratee that reads the stream of characters and produces a
Hit
:parse_log_line_proper :: Monad m => Iteratee Char m Hit parse_log_line_proper = do hit_remote_addr <- parse_symbol space parse_symbol -- the net identity, commonly disregarded space hit_remote_user <- if_not_dash =<< parse_symbol space hit_tstamp <- parse_within_delim '[' ']' parse_tstamp space (hit_method,hit_url,hit_version) <- parse_within_delim '"' '"' $ do method <- parse_method space url <- parse_symbol space assert_string "HTTP/" major <- read_num_base 10 assert_string "." minor <- read_num_base 10 return (method,url,(fromInteger major, fromInteger minor)) space ... return $ Hit{..} where parse_symbol = break (== ' ') if_not_dash "-" = return Nothing if_not_dash x = return $ Just x parse_method = ...The parsers for the time stamp and the request string are separate iteratees, befitting the separate specification of the these formats in the Apache documentation. The request string is double-quoted and the time stamp is enclosed in square brackets. Correspondingly, they are parsed by
parse_within_delim
-- a higher-order
parser that applies the argument iteratee to the part of the stream
within the delimiters. This parse_within_delim
is not part of
the iteratee library: it is trivially user-defined as:parse_within_delim :: (Monad m, Eq e, Show e) => e -> e -> Iteratee e m a -> Iteratee e m a parse_within_delim open_delim close_delim iter = do assert_string [open_delim] res <- id .| take_while (/= close_delim) iter assert_string [close_delim] return res
Our parsers are indeed built by (monadically) stacking up simpler parsers -- which is what parser combinators are all about. Using Haskell abstractions to define parser `macros' is to be expected as well. Iteratees worked as expected, but the parsing task so far has been trivial.
We now come to the first complication of the HTTP log parser. The
server logs are made of many lines, and not all of them may be well-formed.
When parsing fails, we should report the error, skip the line but continue
with the next one. The error report should not only say
that the time stamp date is invalid or an expected
quote is missing. The report should also show the bad timestamp string
with enough context, so we can see what is wrong with it and where in
the file it occurred. Error recovery and error reporting are notorious
for making simple parsers messy: we have to add location tracking
and `error' tokens to the grammar -- generally,
modify the existing parsers at many places. As we shall see, the layered
structure of iteratee parsers gives the error recovery for free. Furthermore,
adding context to error reports of parse_log_line_proper
requires
no modification to the parser at all.
Without any error reporting and recovery, the parser of the whole HTTP log is as follows
httplog_enum_no_err_report :: Monad m => Enumeratee Char Hit m a httplog_enum_no_err_report = sequence_stream (id .| whole_line (parse_log_line_proper))
It applies the line parser to each line in the input stream,
passing the Hit
to a user-supplied iteratee (e.g.,
to accumulate hit statistics). The higher-order combinator whole_line
is a degenerate version of parse_within_delim
for only one delimiter, the newline character. The log parser
is written as if the log file is blameless. A parsing error
can still be generated, for example, in
assert_string :: (Show e, Eq e, Monad m) => [e] -> Iteratee e m () assert_string str = do n <- heads str if n /= length str then parser_error $ "Expected " ++ show str ++ " was not found" else return ()but the error is not handled in any way; if it occurs, the entire log parsing fails and all accumulated hit statistics is lost. This is not acceptable. The fix is easy: trap the parsing exception with
en_handle
, the analogue
of Control.Exception.handle
for iteratees:httplog_enum_poor_err_report :: Monad m => Enumeratee Char (Either String Hit) m a httplog_enum_poor_err_report = sequence_stream (id .| whole_line (en_handle show parse_log_line_proper))The user iteratee now receives not just
Hit
but Either String Hit
,
the result of a successful or an unsuccessful parse of a log line. Parsing
problems no longer abort the processing of the entire file; the
user iteratee can keep accumulating hits, perhaps printing diagnostic
for ill-formed lines. The error recovery -- skipping bad lines through
the newline -- is automatic. Although the argument of whole_line
no longer
raises any iteratee exceptions, it may consume only a part of
the line, until an exception, caught by en_handle
, terminated the
line parsing. No worries: whole_line
, like parse_within_delim
and
like the underlying take_while
always read the input stream through
the closing delimiter, no matter whether the inner iteratee wants the data
or not. Error recovery came for free, from the layered design:
parsing the file as lines which are parsed as log data (and
further parsed as quoted data).
Our log parser now reports and recovers from errors, but the reports
are poor. A report may say that the expected double-quote was not
found -- but not where the error has occurred. Adding context to
the error message helps. Curiously, this can be done without changing
log line parsers at all. We merely replace en_handle
with a more sophisticated version parse_error_handler
, which runs parse_log_line_proper
in parallel with stream_suffix :: Monad m => Int -> Iteratee e m [e]
.
The iteratee stream_suffix n
, when terminated, returns the last n
elements of its stream, if the stream was that long. Hence, parse_error_handler
gets a chunk of the stream and passes
it to parse_log_line_proper
and also to stream_suffix n
. If
the former finishes normally, its result is returned. If the line
parser raised an iteratee error, we ask the stream_suffix
for the
last n
elements of the stream so far -- the portion of the stream
that has lead to the error. (Since iteratees generally get stream
data in chunks, we have to deal with incompletely consumed
chunks. This can be messy, but luckily is abstracted away by the
iteratee library. The user of the library may well think that
iteratees receive stream data one element at a time.)
Here is an example: given the following ill-formed log line
virus.com - - [10/Jul/2009:02:15:58 -0400] "GET /foo.txt/xxx/yyy.reflect.php? reflect_base=xxx? HTTP/1.1" 404 450 "-" "Mozilla/5.0"
parse_error_handler parse_log_line_proper
reports Expected "HTTP/" was not found
with the stream before error foo.txt/xxx/yyy.reflect.php?
and after reflect_bas
.In summary, we have seen that iteratees do behave like parser combinators, letting us combine primitive parsers and define new combinators: higher-order parsers. Iteratees can be combined in less traditional ways, to parse the stream of the results of other iteratees. The layering not only makes parsers lucid and intuitive but also makes error recovery simple. The parallel composition, handling the same input in different ways, lets us add error context to error messages truly as an orthogonal aspect, without any modification to the main parsers.
A simple example that calls for the unlimited look-ahead is parsing a string
according to the regular expression (abc)*aby | a(bca)*x
.
Mere search for that pattern is easy, after
converting the regular expression into the equivalent one that can be matched
with a fixed, small look-ahead. However we are interested in parsing rather
than recognizing, associating semantic actions with abc
and bca
--
the task that is beyond deterministic fixed-look-ahead parsers.
A realistic example of the unbounded look-ahead is parsing the do
-notation
in Haskell, for instance:
do [x1,x2,...,xn] <- m [x1,x2,...,xn]Here [x1,x2,...,xn] is a pattern in the first occurrence and an expression in the second. The parser can only tell which is which when scanning
<-
. Another, surprising example is parsing HTTP web server
logs. The popular combined log format
puts the contents of the HTTP User-Agent
header in double-quotes.
The HTTP Recommendation however allows double-quoted strings in that header.
We may end up with unescaped double-quotes within a double-quoted string,
which cannot be parsed deterministically with a fixed look-ahead.Non-deterministic choice is a good way to specify such complex grammars. The regular expression in our first example was written that way. The vertical bar, the regular-expression notation for choice, corresponds to the choice combinator in parsing combinator libraries. One way to implement it is backtracking, which requires backing up the input stream, to backtrack to. An alternative is to run the parsers to choose from side-by-side. Once a chunk of the input stream is fed to all the competing parsers, the chunk won't be needed again. No backing storage is necessary then. The iteratee choice combinator follows this approach.
Running iteratees in parallel feeding off the same input has been already
described elsewhere on this page. The remaining question is
how to `split' an iteratee parser at a choice point, to obtain the
alternatives to run in parallel. Recall that an iteratee may throw an
`exception', to notify its enumerator of a parsing error or other similar
condition. Iteratee exceptions are resumable: the enumerator
may attempt to `fix' whatever upset the iteratee
and resume it, to let it try again. Resumable exceptions helped
implement re-positioning in the input stream: seek
and tell
. Resumable
exceptions help again: to split an iteratee computation at a choice
point, we throw a resumable exception, which the
exception handler resumes twice.
The following iteratee for non-deterministic choice implements the key idea:
iter_fork :: Monad m => Iteratee e m BoolLike Unix
fork(2)
, it `returns twice': with False
to the parent
and with True
to the child `process'. It sends a resumable exception
which a fork_server
receives and resumes twice. The fork_server
runs the two iteratees, the parent and the child, side-by-side,
finishing whenever one iteratee terminates. The choice combinator with a
more familiar interface is just one definition away:choice :: Monad m => Iteratee e m a -> Iteratee e m a -> Iteratee e m a choice p1 p2 = iter_fork >>= \case {False -> p1; True -> p2}which in turn lets us easily write popular combinators such as
many
:many :: Monad m => (z -> a -> z) -> z -> Iteratee e m a -> Iteratee e m z many f z i = return z `choice` i >>= \a -> many f (f z a) iThe
many
combinator shows that after forking once, the parent or the child
process may fork again. The fork_server
must be prepared to handle
nested fork requests. The many
and the choice
let us implement our examples,
to go beyond deterministic fixed-lookahead parsing.
We have seen how iteratees support non-deterministic choice.
With choice
implemented, iteratees become the complete parsing combinator
library. Our choice combinator iter_fork
is a `coin-flipping'
iteratee, closely related to McCarthy's amb
. Its implementation
uses no backtracking and no backing storage for the stream. In fact, it
is essentially Unix fork
.
for-each
, map
, filter
higher-order procedures -- of which the most general is fold
. The second approach relies on streams, a.k.a. cursors, lazy lists. Generators such as the ones in Icon, Ruby and Python are a hybrid approach.It is well-known that given a cursor interface to a collection, we can implement an enumerator. It is less appreciated that given an enumerator interface, we can always derive a cursor -- in an automatic way. We demonstrate that generic procedure for languages with and without first-class continuations.
Now that cursors and enumerators are inter-convertible, an implementor of a collection has a choice: which of the two interfaces to implement natively? We argue that he should offer the enumerator interface as the native one. The paper elaborates that enumerators are superior: in efficiency; in ease of programming; in more predictable resource utilization and avoidance of resource leaks. We present a design of the overall optimal collection traversal interface, which is based on a left-fold-like combinator with premature termination. The design has been implemented and tested in practice.
Generic Zipper: the context of a traversal
The article arose during the discussion of a good database
interface for Oracle. For portability, the article illustrates the
enumerator inversion technique on an example of a file, regarded as a
collection of characters. Haskell already provides a stream interface to that
collection: hGetChar
. We implement a left fold
enumerator, and turn it back into a stream, implementing a function myhgetchar
exclusively in terms of the left fold enumerator.
Towards the best collection traversal interface
This extended abstract of a LL3 presentation lists the advantages of enumerators for traversing collections and discusses the enumerator inversion procedure in the broad context.
Parallel composition of streams: several sources to one sink
A different, and better way to interleave two folds
io.txt [26K]
An early draft of an ambitious layered IO proposal, cast in the context of Scheme. More polished drafts exist, and even a prototype Scheme implementation. Unfortunately, once it became clear that the ideas are working out, the motivation fizzled.
oleg-at-okmij.org
Your comments, problem reports, questions are very welcome!
Converted from HSXML by HSXML->HTML