previous next up top

Incremental multi-level input processing and collection enumeration

Iteratee/Enumeratee is a way to process data -- file bytes, network datagrams or matrix elements -- incrementally, declaratively and with tight resource control. The approach has been inspired by fold, parser/printer combinators and circuit diagrams. There are several variants of the Iteratee-based I/O, available as independent libraries on Hackage or as part of bigger applications such as web servers. Here we give motivation, justification and explanation, pointing out earlier and experimental versions as well as related projects.

Iteratee IO: safe, practical, declarative input processing
The design of iteratees
Composing stream producers
Random and binary IO: Reading TIFF
Parallel composition of iteratees: one source to several sinks
Parallel composition of streams: several sources to one sink
Combining Monadic Regions and Iteratees
Reflection without Remorse, or speeding-up iteratees
Parsing combinators with good error reporting
Iteratee fork: backtracking-free non-determinism
Constant-space XML processing with well-formedness check
Incremental XML parser and the XML parsing framework
Streams and Incremental Processing
Towards the best collection traversal interface
From enumerators to cursors: turning the left fold inside out
Justifying layered streams for input processing
Takusen: Takusen: a DBMS access library built around a left-fold enumerator

Iteratee IO: safe, practical, declarative input processing

Iteratee IO is a style of incremental input processing with precise resource control. The style encourages building input processors from a user-extensible set of primitives by chaining, layering, pairing and other modes of compositions. The programmer is still able, where needed, to precisely control look-ahead, the allocation of buffers, file descriptors and other resources. The style is especially suitable for processing of communication streams, large amount of data, and data undergone several levels of encoding such as pickling, compression, chunking, framing. It has been used for programming high-performance (HTTP) servers and web frameworks, in computational linguistics and financial trading.

We exposit programming with iteratees, contrasting them with Lazy IO and the Handle-based, stdio-like IO. We relate them to online parser combinators. We introduce a simple implementation as free monads, which lets us formally reason with iteratees. As an example, we validate several equational laws and use them to optimize iteratee programs. The simple implementation helps understand existing implementations of iteratees and derive new ones.

References

describe.pdf [334K]
The full version of the paper presented at FLOPS 2012. A shorter version is published in FLOPS Proceedings, Springer's LNCS 7294, pp. 166-181

IterDemo.hs [14K]
IterDemo1.hs [28K]
Demonstrations of Iteratee IO on a sequence of progressively larger examples. The second demo, which is described in the paper, illustrates Iteratee IO in contrast with Lazy and Handle IO.

IterDeriv.hs [11K]
IterDerivM.hs [17K]
Derivations of a simple Iteratee IO. Both derivations use a free monad and treat iteratees as a final co-algebra. The second file derives effectful iteratees. The second file also contains complete proofs of the equational laws of iteratees described in the paper. The files also illustrate iteratees as parser combinators.

talk-FLOPS.pdf [175K]
The annotated slides of the talk presented at FLOPS 2012 in Kobe, Japan on May 24, 2012.

IterDem.hs [16K]
The talk used a veneer over IterateeM to better explain Iteratees. This file is the complete code used in the talk.

IterateeIO-talk.pdf [159K]
IterateeIO-talk-notes.pdf [194K]
Old talks about Iteratee IO
The first version of the talk was presented at DEFUN08, ACM SIGPLAN 2008 Developer Tracks on Functional Programming. September 27, 2008. Victoria, Canada.
The present version of the talk (given at the research colloquium of the Center for Software Technology, Department of Information and Computing Sciences, Utrecht University on December 17, 2009) describes the December 2009 version of the Iteratee IO.
The slides have been updated for the November 2010 version of IterateeM.
The talk is aimed at practitioners not afraid of Haskell, in particular, server programmers. That is, programmers who program long-running distributed applications and are painfully aware of the issues of reading from sockets, latency, buffering, many layers of decoding, proper resource disposal and sustaining high load.

README.dr [4K]
Description of the Iteratee IO library and the code accompanying the talk.

Lazy-vs-correct.txt [5K]
Lazy vs correct IO: A message on composability and performance of Iteratee IO, posted on the Haskell-Cafe mailing list on Thu, 18 Sep 2008 23:51:59 -0700 (PDT).
We show that although the Iteratee IO library is not optimized at all, it already outperforms Lazy IO. The final benchmark, counting words in all GHC documentation, demonstrates the differences between Lazy and Iteratee IO: Iteratee IO finishes the test whereas Lazy IO runs out of file descriptors.

Lazy IO breaks equational reasoning

The design of iteratees

The protagonists are Stream, Iteratee, Enumerator, and the offspring Enumeratee. Stream carries the data, Iteratee consumes them, Enumerator produces them. Enumeratee is both a consumer and a producer, incrementally decoding the outer stream and producing the nested stream of decoded data. Enumerators compose to give a bigger producer, effectively concatenating the sources. Iteratee compose serially and in parallel. Iteratee can also nest, so to process nested streams or several streams incrementally.

     
     data Stream el = EOF (Maybe ErrMsg) | Chunk [el]

Stream is a (continuing) sequence of elements bundled in Chunks. The EOF variant tells that no more data will be coming: the stream is exhausted, either due to EOF or some error. Chunk [a] gives the currently available part of the stream. In particular, Chunk [] signifies a stream with no immediately available data but which is still continuing. When a stream processor receives such empty chunk, it should ``suspend itself'' -- indicate its desire for more data.

     
     data Iteratee el m a = IE_done a
                          | IE_cont (Maybe ErrMsg)
                                    (Stream el -> m (Iteratee el m a, Stream el))

Iteratee is a generic stream processor -- what is being folded over a stream. An iteratee exists in one of the three states:

Done state: the iteratee has computed the result and does not want any more data;
More data state, holding the continuation (the `step function') to process a chunk of data and yield a new state;
A message to the stream producer (e.g., to rewind the stream) or an error indication.

The control/error message too includes the continuation, to handle the possible reply by the stream producer. Like in Common Lisp, all iteratee errors are hence potentially restartable.

We assume that all iteratees are `good' -- given bounded input, they do the bounded amount of work and take the bounded amount of resources. Given a terminated stream, an iteratee should move to the done state, returning the results computed so far.

It turns out, iteratee forms a monad and a monad transformer. After all, Stream is a gradually consumed state; to read another chunk, the iteratee has to make an `operating system call' to the stream producer. Iteratee is a State and Cont monad, and hence an Error monad. We can write iteratees in the familiar do notation. The shown type of iteratees is not the only one possible; the comments in the source code debate two other choices.

     
     type Enumerator el m a = Iteratee el m a -> m (Iteratee el m a)

Enumerator is an encapsulation of a data source, a stream producer -- what folds an iteratee over the stream. An enumerator takes an iteratee and applies it to the stream data as they are being produced, until the source is depleted or the iteratee said it had enough. After disposing of buffers and other source-draining resources, enumerator returns the final value of the iteratee. Enumerator thus is an iteratee transformer.

Iteratees and enumerators operate over a base monad m, allowing producers and consumers side-effecting actions to obtain and digest data (e.g., reading or writing files). That base monad is often IO. That is not the only choice: Iteratee el IO is a monad too; it too may act as the base monad. If we instantiate the type of Enumerator el m a by choosing m to be the iteratee monad, we obtain Enumeratee:

     
      type Enumeratee elo eli m a =
        Iteratee eli m a -> Iteratee elo m (Iteratee eli m a)

Enumeratee, as any other enumerator, produces a stream, of the elements of type eli. Since an enumeratee acts in the monad

Iteratee
 el IO

, it may grab chunks of elo elements and use them as the source of its own stream. Thus enumeratee is a generic stream decoder. Stream decoding, or nested parsing, is rather common: buffering, character encoding, compression, encryption, SSL are just a few examples. The TIFF library relies on enumeratees, to decode a stream of bytes from a TIFF file, to the stream of 2- or 4-byte signed integers and then, for example, to the stream of rational numbers (the image resolution).

Concerns indeed separate: enumerators know how to get to the next element; iteratees know what to do with the next element. The character of Iteratees is the encapsulation of input processing layers that can be freely composed.

The November 2010 version of the code adds guidelines for writing simple iteratees and enumerators. The library uses standard combinators such as monadic bind to build iteratee chains; there are no more $$ and >>==.

Version

The current version is November 2010.

References

Iteratee.hs [25K]
Pure iteratees and parsing combinators. Justification for design decisions. Parsing of a stream chunk-encoded in another stream.

IterateeM.hs [54K]
Monadic and general iteratees: the main library. Examples of chunk-decoding and IO multiplexing.

Wc.hs [12K]
Many examples of nested and parallel processing of input stream.

IterateeMCPS.hs [39K]
The CPS variant of the January 2010 version of IterateeM.hs

Iteratee IO: safe, practical, declarative input processing

Parallel composition of iteratees: one source to several sinks

Parallel composition of streams: several sources to one sink
Using Iteratee as a base monad to process several streams, at different speeds and inter-dependently.

Composing stream producers

A salient feature of Iteratee IO is compositionality: building complex stream operations by combining simpler ones. Iteratees -- stream consumers -- are monads and can be built by binding primitive iteratees, often using the do-notation. Composing enumerators -- stream producers -- gives the producer for the concatenated stream. Stream transformers, enumeratees, are consumers and producers at the same time and so are often regarded as complex. A more general view on Iteratee IO gives a simpler picture: stream consumers are monads, and stream producers are monoids. Stream combinators become fewer in number but clearer in semantics -- demonstrated by the examples of three different ways of composing (lists of) stream producers below.

Iteratees are akin to getChar and getLine from the standard System.IO library, parsing the implicit stdin-like stream. Like getChar, iteratees transparently take care of buffering and error handling. The stream `handle' is always implicit, and hence cannot be prematurely closed. Iteratees are more general than System.IO readers since iteratees' stdin is not restricted to a character stream. The type of iteratee stream elements e is part of the Iteratee type. Compare the signatures of the corresponding functions that extract the current element of the stream:

     System.IO.getChar :: IO Char
     IterateeM.head    :: Monad m => Iteratee e m e

Iteratees are not restricted to the IO monad either: the iteratee head above works in any monad and is therefore pure. We give iteratees a short alias, whose name should become clear shortly:

     type R e m = Iteratee e m

Writing iteratee programs should be just as familiar as writing System.IO programs, only without the restrictions on the type of stream elements and the monad.

The stream consumed by getChar is produced by the the Haskell run-time system, which is part of the Haskell environment rather than of a Haskell program. Iteratee IO brings the producers inside, letting the program itself produce streams and connect producers with consumers. A stream producer takes a consumer (a parser of the stream of es into a value of some type a), feeds it stream data and returns the resulting consumer.

     type Enumerator e m a = R e m a -> m (R e m a)

A producer encapsulates stream generation; a consumer encapsulates stream parsing. A consumer should work with any producer that generates the stream of the right type; correspondingly, a producer should work uniformly with any consumer so long as their stream types match. We enforce the abstraction boundary by requiring the enumerator to be parametric in a, the type of the parsing result. A general stream producer hence has the following type.

     newtype L e mi mo = L{unL :: forall a. R e mi a -> mo (R e mi a)}

Enumerator is a particular case of L e mi mo when mi and mo are the same.

The general stream producer L e mi mo is a monoid:

     instance (Monad mi, Monad mo) => Monoid (L e mi mo) where
        mempty = L return
        mappend (L e1) (L e2) = L $ \i -> e1 i >>= e2

It becomes intuitive that composing (or, mappending) enumerators effectively concatenates their streams. The parameters mi and mo of L e mi mo are monads; moreover, mo should be the same or `bigger' than mi: the producer should do all the effects of the consumer, and perhaps a few more. The following type class establishes the partial order on monads: what it means to be `bigger':

     class (Monad m1, Monad m2) => m1 :>=: m2 where
        promote :: m2 a -> m1 a

with at least the following obvious instances.

     instance Monad m => m :>=: m where 
        promote = id
     instance (Monad m, MonadTrans t, Monad (t m)) => t m :>=: m where 
        promote = lift

A notable particular case of L e mi mo is the outer monad mo being R e' m. The type L e m (R e' m) describes a producer that gets its data by reading another stream, of e', transforming one or more of e' into one or more of e. The type L e m (R e' m) thus describes a stream transformer -- an enumeratee! Enumerators and enumeratees hence unify in the general stream producer.

A producer connects with a consumer by a regular functional application (after the L unwrapping). For convenience, we introduce a right-associative infix operator

     L f ||| x = f x

Since an enumeratee is a consumer of the outer, e' stream, it can be connected with the producer of that stream. The composition then acts as a inner stream producer. We obtain a combinator to compose enumeratees with enumerators or with themselves:

     compL :: (mo :>=: m) => L e m mo -> L e' m (R e m)  -> L e' m mo
     compL (L e12) (L e23) = L (\ie' -> e12 (e23 ie') >>= promote . run)

We now demonstrate various compositions on a simple running example of composing the list of take enumeratees:

     takes = [takeL 2, takeL 3, takeL 4]

(where takeL is IterateeM.take wrapped in L). Granted, the example is a bit contrived: normally we want to compose a variety of enumeratees rather than the single take. On the other hand, take n makes a neater example.

First, takeL, as any stream producer, is a monoid, and can be mappended:

     c1 :: Monad m => L e m (R e m)
     c1 = foldl1 mappend takes

We could have just as well used foldr1: mappend is associative. Mappend'ing producers `concatenates' their sources. Therefore, c1 == take (2+3+4). To demonstrate this equality, we run c1 on a sample input stream [1..15], with stream2list as the consumer that reads the whole stream and returns it as a list.

     c1r = runIdentity $ run =<< (enum1chunkL [1..15] `compL` c1) ||| stream2list
     -- [1,2,3,4,5,6,7,8,9]

We have composed c1 with the enumerator enum1chunkL to obtain the enumerator of the prefix stream, which we then hook up with the stream2list iteratee. The result, in the comments, shows that c1 indeed behaves like take 9. The code to run c1 can be written differently but equivalently as

     c1r' = runIdentity $ run =<< 
           enum1chunkL [1..15] ||| (runI =<< c1 ||| stream2list)
     -- [1,2,3,4,5,6,7,8,9]

Instead of composing c1 with the producer enum1chunkL on the right we composed it with the consumer stream2list on the left, which leads to another way of composing takes.

The second method of composing enumeratees is to apply each to a consumer and compose the results. If e_j is a producer and i is an iteratee (for the inner stream), (e_i ||| i) is an iteratee for the outer stream. A list of such iteratees can be composed in several ways, sequentially:

     c2s :: Monad m => R e m a -> R e m [a]
     c2s = \i -> sequence $ map (\e -> runI =<< e ||| i) takes
     
     c2sr = runIdentity $ run =<< enum1chunkL [1..15] ||| c2s stream2list
     -- [[1,2],[3,4,5],[6,7,8,9]]

or in parallel:

     c2p :: Monad m => R e m a -> R e m [a]
     c2p = \i -> parallel $ map (\e -> runI =<< e ||| i) takes
      where
      parallel [i] = liftM (: []) i
      parallel (i:t) = do
        (iv,tv) <- enumPair i (parallel t)
        return (iv:tv)
     
     c2pr = runIdentity $ run =<< enum1chunkL [1..15] ||| c2p stream2list
     -- [[1,2],[1,2,3],[1,2,3,4]]

Finally, enumeratees compose `telescopically': enumeratees are stream transformers and can be arranged, by compL, into a pipeline -- very much like the Unix shell pipeline -- to transform stream elements further and further. We have seen compL in action already, in the term c1r, which composed an enumerator for the stream e' with a e'-to-e enumeratee giving the enumerator for the stream e.

     -- compare with c1: compL vs mappend
     c4  = foldl1 compL takes

(again, foldr1 works just as well). Piping take n_j into each other is not very interesting. It is easy to see that the result is equivalent to the single take (minimum [n_1,...n_j]). Indeed, c4 behaves like take (minimum [2,3,4]) == take 2:

     c4r = runIdentity $ run =<< enum1chunkL [1..15] `compL` c4' ||| stream2list
     -- [1,2]

We have shown a general view on Iteratee IO that unifies enumerators and enumeratees as general stream producers. Whereas iteratees, stream consumers, are monads and monad transformers, stream producers are monoids -- which gives numerous ways of composing them.

Version

The current version is January 2012.

References

ComposeAdv.hs [6K]
The commented code illustrating different compositions

Compose.hs [5K]
The older code using the Iteratee library without any embellishments. It was originally posted as Re: Composing a list of Enumeratees on the Haskell-Cafe mailing list on Fri, 7 Oct 2011 23:27:40 -0700 (PDT).

Random and binary IO: Reading TIFF

Iteratees presuppose sequential processing. A general-purpose input method must also support random IO: processing a seek-able input stream from an arbitrary position, jumping back and forth through the stream. We demonstrate random IO with iteratees, as well as reading non-textual files and converting raw bytes into multi-byte quantities such as integers, rationals, and TIFF dictionaries. Positioning of the input stream is evocative of delimited continuations.

We use random and binary IO to write a general-purpose TIFF library. The library emphasizes incremental processing, relying on iteratees and enumerators for on-demand reading of tag values. The library extensively uses nested streams, tacitly converting the stream of raw bytes from the file into streams of integers, rationals and other user-friendly items. The pixel matrix is presented as a contiguous stream, regardless of its segmentation into strips and physical arrangement.

We show a representative application of the library: reading a sample TIFF file, printing selected values from the TIFF dictionary, verifying the values of selected pixels and computing the histogram of pixel values. The pixel verification procedure stops reading the pixel matrix as soon as all specified pixel values are verified. The histogram accumulation does read the entire matrix, but incrementally. Neither pixel matrix processing procedure loads the whole matrix in memory. In fact, we never read and retain more than the IO-buffer-full of raw data.

Version 2.0 of the library demonstrates restartable exceptions to seek (that is, control the position) in a stream.

Version

The current version is November 2010.

References

RandomIO.hs [10K]
Random and Binary IO with Iteratees
The code implements the enumerator for a seekable stream and iteratees for reading multi-byte integral quantities in big- and little-endian formats.

Tiff.hs [19K]
TIFF library

TiffTest.hs [8K]
A sample application of the TIFF library

Binary IO and the TIFF library in Scheme

Parallel composition of iteratees: one source to several sinks

A compelling application of iteratees is incremental processing of a single stream by several consumers in parallel, in constant memory.

Feeding several parallel consumers off the same stream is quite common. Examples include Unix tee, tcpdump and similar wire-tapping commands; reading a stream of numbers and computing the average and the variance, which requires keeping the count of the numbers, of their sum and the sum of their squares. Our running example is even more familiar: Unix wc command, which reports the number of lines, words and characters in its input stream or files.

The input data may come from the network or a pipe, which we cannot rewind and re-read. We should handle large input, fast; storing received data in memory or files for re-processing is not an option either. We really have to process data incrementally and in parallel.

The combinator enumPair composes two iteratees in parallel, turning a pair of iteratees into an iteratee for a pair of their results:

     enumPair :: Monad m => Iteratee el m a -> Iteratee el m b ->
                 Iteratee el m (a,b)
     enumPair i1 i2 = enum2 i1 i2 >>= runI2

One should not confuse enumPair with the lifted pair: enumPair i1 i2 composes the iteratees i1 and i2 in parallel, feeding the same chunk of input to both of them. In contrast, liftM2 (,) i1 i2 composes the iteratees sequentially, feeding the stream only to i1, with i2 getting the left-over after i1 is finished.

The combinator enumPair is expressed in terms of enum2, a signal-splitter--like enumeratee with one source and two sinks.

     enum2 :: Monad m => Iteratee el m a -> Iteratee el m b ->
              Iteratee el m (Iteratee el m a, Iteratee el m b)

The enumeratee splits the stream without any buffering, passing the received chunk to both iteratees, before asking for the next chunk. The enumeratee continues for as long as there are stream data and at least one of the iteratees wants them.

Our running example is like wc, reporting the number of words and characters in a given file. It can be elegantly written with Lazy IO:

     main_words_char_lazy = do
        name:_ <- getArgs
        file   <- readFile name
        print $ (length $ words file, length file)

Here is the same code using iteratees:

     main_words_char_iter = do
        name:_  <- getArgs
        counter <- run =<< enum_file name 
          ((runI =<< enum_words stream_count) `enumPair` stream_count)
        print counter

The iteratee stream_count counts the elements in its stream; we apply it to the stream of characters produced by enum_file and to the stream of words produced by decoding the character stream.

The ugly side of Lazy IO turns up when we run the code. As the size of the input file increases, from 2.4MB, to 4.9MB, to 24.8MB, so does the peak memory use of the lazy IO code: from 43MB to 78MB and finally, 337MB. Maximal residency grows too: from 20MB to 37MB and 163MB. The program must be reading the whole file in memory. That is not acceptable: we should not need 1/3-Gigabyte of memory to count words and characters in a 25-Megabyte file.

In contrast, the iteratee-based code counts in constant memory: the peak memory use is 2MB and the maximal residency is 108KB -- regardless of the size of the input file.

I am grateful to John Lato for introducing parallel composition and describing its advantages.

Version

The current version is November 2010.

References

Wc.hs [12K]
The benchmark code and the results. The code also tests counting words in very many files, demonstrating that the iteratee-based code is also frugal about file descriptors, needing only one.

Parallel composition of streams: several sources to one sink

Iteratee is a generic stream processor, what is being folded over a stream -- one stream. The standard Haskell Handle-based IO seems more general then, letting us read from several streams at once. Not only we can zip two files together, we can read two files at different paces. Furthermore, how much we read from one file may depend on what we have read from the other. Can we do all that with iteratees, while maintaining their advantages of incremental processing, precise resource control and the prevention of resource leaks? It turns out that we can. More surprising is how trivial it is. Neither we have to change the Iteratee library, nor do we have to know how iteratees are implemented. Monadic parsing combinators of the Iteratee library suffice.

Recall that Iteratee el m is a monad and so can be used as a base monad of another iteratee. That is how we derived enumeratees. We now observe that Iteratee el is also a monad transformer. We can transform any monad into an Iteratee monad -- including the iteratee monad. Laying iteratee monad transformers upon each other gives an us access to several streams at once.

A simple example of returning the first elements from two streams should make it concrete. Neither stream should be read beyond the first element. The most naive code works:

     i1 :: Monad m => Iteratee el1 (Iteratee el2 m) (el1, el2)
     i1 = do
          e1 <- head 
          e2 <- lift head
          return (e1,e2)

All signatures hereafter can be and have been inferred; we write them for clarity. The signature indeed clarifies that the iteratee i1 gets the element of the type el1 from the inner stream, and the element of the type el2 from the base monad, which happens to be iteratee as well and so can read from (its own) stream.

Since i1 is an iteratee, we can pass it to an enumerator. The nesting of streams becomes apparent in the (inferred) signature: the inner stream contains Ints and the outer stream has elements of the type el2.

     t11 :: (Monad m) => Iteratee el2 m (Iteratee Int (Iteratee el2 m) (Int, el2))
     t11 = enum_pure_1chunk [1::Int,2,3,4] i1

We complete the inner enumeration by running it. The type of t12 shows the result to be an iteratee, of an outer stream. We may pass it to a suitable enumerator (see t13) and run it:

     t12 :: (Monad m) => Iteratee el2 m (Int, el2)
     t12 = run =<< enum_pure_1chunk [1::Int,2,3,4] i1
     
     t13 :: (Monad m) => m (Iteratee Char m (Int, Char))
     t13 = enum_pure_1chunk "5678" $ run =<< enum_pure_1chunk [1::Int,2,3,4] i1
     
     t14 :: (Monad m) => m (Int, Char)
     t14 = run =<< (enum_pure_1chunk "5678" $
                      run =<< enum_pure_1chunk [1::Int,2,3,4] i1)

Running the complete example, runIdentity t14, gives us the pair (1,'5'), the heads of the two streams. With different enumerators we can verify that neither stream has been read beyond the first element.

We may even process two streams interdependently, reading incrementally and by various amounts. The amount to read from one stream may depend on what we have read from the other. For example, we will read the n-th word from the file test2.txt and find the first line in the file test1.txt with that word. We return the line paired with the word.

     zwl :: (MonadIO m) => Int -> Iteratee String (Iteratee Line m) (Line, String)
     zwl n = do
             drop n     -- Drop the words from the word stream before we
             w <- head  -- get the desired n-th word
             liftIO $ putStrLn $ "Got the word: " ++ w
                        -- From the line stream, drop the lines that
                        -- do not contain the word w
             lift $ dropWhile (not . isInfixOf w)
                        -- Read the first line that does contain w
             l <- lift $ head
             return (l,w)

We pass the two-stream iteratee zwl first to the enumerator of the inner stream of words, tzwlI, and then to the enumerator of the outer stream of lines, tzwlO. We run the whole computation and print the result:

     tzwlI :: (MonadIO m) => Int -> Iteratee Line m (Line, String)
     tzwlI n = run =<< enum_file_gen "test2.txt" (runI =<< enum_words (zwl n))
     
     tzwlO :: (MonadIO m) => Int -> m (Line, String)
     tzwlO n = run =<< enum_file_gen "test1.txt" (runI =<< enum_lines (tzwlI n))
     
     tzwlO5 = tzwlO 5 >>= print

The traces of read operations show we read both files incrementally, with interleaving, and only as far as needed.

Iterating iteratee transformers is reminiscent of lightweight monadic regions, which were obtained by iterating ST-monad--like transformers. Many-stream iteratees have many similarities with the regions, including the region-like discipline of allocating and disposing of stream resources.

Version

The current version is November 2010.

References

IterateeN.hs [7K]
The shown and other examples of multiple-stream processing

Merge.hs [8K]
A two-headed enumeratee: Merging two sorted streams into one sorted stream
The problem well illustrates the reading of two streams at various paces. How much to read from a stream can be determined only dynamically, based on the data read from both streams. We design not merely an iteratee with two input streams; we design an enumeratee. Thus we truly merge two input streams into a one, to be processed by inner iteratees as any other stream. We also process the two input streams by chunks rather than element-by-element. If a stream can yield several pieces of data at once, we take advantage of that, improving the performance.

How to zip folds: A library of fold transformers (streams)
The question of parallel processing of several streams is quite like the question of expressing list zipWith via list fold. (Recall the origins of iteratees and enumerators in the left fold.) If the only way to obtain elements of a list is to fold over it, could we express the zip function? At first blush, it seems impossible: after all, zip requires simultaneous processing of two lists whereas list fold can at best give us only nested processing. Yet we can write zip using only folds to deconstruct lists, as the referenced article explains.

Lightweight monadic regions

Combining Monadic Regions and Iteratees

Regions is an automatic resource management technique that statically ensures that all allocated resources are freed and a freed resource cannot be used. Regions also promote efficiency by helping to structure the computation so that resources will be freed soon, and en masse. Regions are exception-safe. Therefore, regions are particularly suitable for scarce resources such as file handles, database or network connections, etc. A lightweight monadic region library been described in the Haskell Symposium 2008 paper; the code is available on Hackage.

Iteratee IO also aims, among other things, to encapsulate resources such as file handles and network connections, ensuring their safe use and prompt disposal. One may wonder how much Iteratees and Regions have in common and if that commonality can be factored out.

There seem to be several ways to combine regions and iteratees. We describe the most straightforward attempt, combining a monadic region library (mostly as it is) with an Iteratee IO library, also mostly as it is. We use monadic regions to manage file handles or file descriptors, ensuring that file handles are always closed, even in case of IOError and other asynchronous exceptions. An enumerator like enumFile provided similar guarantees for its handles. (Since an enumerator keeps its handles to itself, there is no danger of iteratees' misusing them.) With the monadic region library, the enumerator code becomes simpler: we no longer have to worry about exceptions. The main benefit of monadic region library is to manage files opened by iteratees. The latter are being passed around, and so their resources are harder to keep track.

We stress that monadic regions do not impose the stack-based (LIFO) order on opening and closing of files. The original memory regions were invented to get around the stack allocation, LIFO strategy, which does not suffice because the memory allocation patterns are complex: a function may allocate memory that outlives it. Regions let the programmer create arbitrarily many nested regions; everything in a parent region is available to a child. Crucially, a child may request any of its ancestors to allocate memory in their regions. Therefore, although regions are nested, memory does not have to be allocated and freed in LIFO order. The same principle holds for monadic regions and file handles.

We thus demonstrate enumFile and in particular iterFile for incremental file reading and writing, with the same safety guarantees. All opened files are always closed, regardless of any (asynchronous) exceptions that may arise during opening, reading, writing or transforming. The code has many examples of throwing errors from various stages of the pipeline and at various times. All files are closed.

Before looking at iterFile code, let us first see what it can do. The first example simply copies one file to another, block-by-clock.

     tIF1 = runSIO $ 
             run =<< unL (enumFile "/etc/motd") (iterFile "/tmp/x")

According to the trace

     	opened file /etc/motd
     	iterFile: opening /tmp/x
     	Closing {handle: /etc/motd}
     	Closing {handle: /tmp/x}

the files are indeed closed, but not in the LIFO order. That is important, so to let an iteratee write data coming from several sources. For example:

     tIF3 = runSIO $ 
             run =<< unL (enumFile "/etc/motd" `mappend`
     		      enumFile "/usr/share/dict/words")
     	     (iterFile "/tmp/x")
     
        opened file /etc/motd
        iterFile: opening /tmp/x
        Closing {handle: /etc/motd}
        opened file /usr/share/dict/words
        Closing {handle: /usr/share/dict/words}
        Closing {handle: /tmp/x}

The complex patterns of opening and closing are worth stressing again. For example, the file /etc/motd is opened first and is closed first. This is not a stack-allocation strategy.

The files will be closed even in case of exceptions:

     tIF4 = runSIO $ 
             run =<< unL (enumFile "/etc/motd" `mappend`
     		      enumFile "/nonexistent")
     	     (iterFile "/tmp/x")
     
        opened file /etc/motd
        iterFile: opening /tmp/x
        Closing {handle: /etc/motd}
        opened file /nonexistent
        Closing {handle: /tmp/x}
        *** Exception: /nonexistent: openFile: does not exist

The source code demonstrates handling of other exceptions. All monadic region monads all support shCatch, so we can write our own exception-handling code.

Monadic regions simplify the code for enumFile, which reads data from a file and feeds to an iteratee. The code relies on the regions library to ensure the opened file shall be closed, on normal or any exceptional exit.

     enumFile :: (SMonadIO m) => FilePath -> L ByteString m m
     enumFile filepath = L $ \iterv -> do
       newRgn $ do
           h <- newSHandle filepath ReadMode
           unL (enumHandle h) iterv

More interesting is an exception-safe iterFile, which has been claimed to be impossible to write. This iteratee saves all received data into a given file, incrementally. The code is written in the ordinary monadic style. The file is opened in the region one up: therefore, the file will remain open after an enumerator is finished with iterFile. Another enumerator may pick up. The file will be closed when iterFile's region is exited.

     iterFile :: (SMonad1IO m, m ~ (IORT s' m')) 
     	     => FilePath -> R ByteString m ()
     iterFile fname = lift (newSHandle fname WriteMode) >>= loop
      where
      loop h = getChunk >>= check h
      check h (Chunk s) = lift (mapM (shPut h) s) >> loop h
      check h e  = return ()

Monadic regions can indeed be combined with an Iteratee library, relieving enumerators and mainly iteratees from worrying about closing files, especially upon exceptions. Although regions constrain patterns of opening and closing, they do not impose the LIFO order. The whole pattern of opening and closing does not even have to be known statically.

Version

The current version is January 2012.

References

IterReg.hs [11K]
The complete Haskell code
It was originally posted in an article Combining Regions and Iteratees on the Haskell-Cafe mailing list on Sat Jan 21 10:10:23 2012

SafeHandles.hs [12K]
SafeHandlesTest.hs [10K]
The implementation of the monadic region library, described in Lightweight monadic regions

Composing stream producers
More economical interface to the IterateeM library

Incremental XML parser and the XML parsing framework

Faithful XML parsing -- accounting for namespaces, character, internal and external parsed entities, attribute value normalization, processing instructions and CDATA sections -- is among the larger applications of the Iteratee library. Online parsing combinators are one way to understand Iteratees. Although this project was undertaken for practical reasons, there was a motivation to see how good are Iteratees, on their own, as a parsing combinator library. Can they scale to the complex problem of parsing the full XML, as described in the XML 1.0 Recommendation and the XML Namespace Recommendation, doing required normalizations and reporting required well-formedness errors? The experience of implementing the XML parsing framework and using it in real-life to incrementally parse large XML files shows that Iteratees fit for task.

The main component of the XML parsing framework is the Enumeratee xml_enum that converts from a Char stream to an XMLStream, the stream of parsed XML data. That is, xml_enum is a pull, SAX XML parser. Hooking up xml_enum with a simple iteratee (see below) gives a DOM parser. The framework also exports a number of low-to-high level lexing and parsing iteratees for tokenizing or parsing marked-up documents, in various XML or HTML formats. One may use these "Lego blocks" to construct a parser following any discipline and performing validation to any degree.

The interface to the XML parser is through XMLStream rather than call-backs:

     data XMLStream a = 
         XMLString String String
       | XMLStart  QNameR (AttList QNameR) NameSpaces
       | XMLEnd    QNameR (AttList QNameR) NameSpaces
       | XMLOther a				-- extension, for PI handlers

where QNameR is a resolved QName, the name of an element or an attribute with the resolved (as a URI) namespace qualification. The XML DOM parser, which converts XML text to a Document Object Model (DOM) tree, illustrates xml_enum and the composability -- stringing up -- of Iteratee-based parsers. It is short enough to fully describe it here. DOM is a tree representation of the XML document:

     data DOM = 
         DNode QNameR (AttList QNameR) [DOM]
      |  DC String				-- Character data

The iteratee xml_dom_iter, whose complete code shown below, builds the tree from the XMLStream. The xml_dom_iter signature says so directly.

     xml_dom_iter :: Monad m => Iteratee (XMLStream ()) m DOM
     xml_dom_iter = head >>= \x -> case x of
        XMLString s "" -> return $ DC s
        XMLStart{}     -> forest []
        -- XMLEnd is not expected; neither is EOF
      where
      forest acc = head >>= \x -> case x of
        XMLString s "" -> forest $ DC s : acc
        XMLStart{}     -> forest [] >>= forest . (:acc)
        XMLEnd name attrs _ -> return $ DNode name attrs $ reverse acc

The full DOM parser is obtained by connecting xml_enum to xml_dom_iter:

     xml_dom :: Monad m => Iteratee Char m DOM
     xml_dom = id .| xml_enum default_handlers .| xml_normalize xml_dom_iter

The argument default_handlers provides no PI handlers and no pre-defined parsed entities or namespaces. The pipeline includes xml_normalize to coalesce adjacent character content and to remove white space between elements. Although such white space is almost always insignificant, its removal is not standard. Therefore, it is separated into its own enumeratee.

The Iteratee XML parsing framework is based on the SSAX XML parser implemented in Scheme. The main difference is the stream-based interface, making incremental parsing much easier to use. The SSAX parser relied on call-backs, which had to manually manage their internal state, repeatedly packing and unpacking it. Call-backs were therefore unwieldy, error-prone, and difficult to compose.

Version

The current version is August 2012.

References

XMLIter.hs [58K]
The main parser code: the enumerator for XML documents, auxiliary normalization functions and lower-level parsers
One quarter of the file are the unit tests.

XMLDom.hs [14K]
A sample application and a demonstration of the SAX XML parser: DOM XML parser
The bulk of the file are the unit tests of the parser, including the tests of namespaces and parsed entities.

SearchString.hs [4K]
A general-purpose iteratee to search the stream for a given string
This iteratee does no buffering and never reads (ahead) more than really necessary to locate the string.

Functional XML parsing framework in Scheme: SAX/DOM and SXML parsers with support for XML Namespaces and validation
The Scheme parser was the prototype and the inspiration for the present XML Enumerator.

Constant-space XML processing with well-formedness check

XML, JSON, S-expressions are often used as storage or serialization formats. When processing data received in these formats, it may be important to check for corruption first and consume the data only after verifying that the XML etc. container is well-formed. A typical ill-formedness error is a missing end-tag, brace or parenthesis. One can hence tell the document is well-formed only after reading it through the end. Constant-space processing seems therefore impossible in principle. And yet we can handle serialized data in constant time and still report well-formedness and optionally validation errors. This article uses XML and the incremental XML parsing framework; the same principles apply to JSON, S-expressions, and other such formats.

The problem is common: suppose we receive an XML document over a network (e.g., in an HTTP stream). Network connections are inherently unreliable and can be dropped at any time. The XML document therefore can come truncated, for example, missing the end tag for the root element. According to the XML Recommendations, such document is ill-formed, and hence does not have an Infoset. In contrast, invalid but well-formed documents -- with all tags balanced but expected integral data does not parse as such -- do have the Infoset. Strictly speaking, we should not be using data from an XML document until we verified that it is well-formed, that is, until we parsed it at all and have checked that all end tags are in place. Incremental XML processing seems impossible.

Sometimes people avoid such a strict interpretation and process a document as far as they can. For example, the XML-encoded telemetry may be regarded meaningful even if the root end tag is missing.

Even in the strict interpretation, it is still possible to handle a document incrementally so long as the processing is functional or its side effects can be backed out (e.g., in a transaction). We demonstrate such an incremental processing that always detects ill-formed XML, and, optionally, invalid XML. It is possible after all to detect ill-formedness errors and process the document without loading it all in memory first -- using as little memory as needed to hold the state of the processor (just a short string in our example).

Our running example is an XML document representing a finite map: a collection of key-value pairs with an integer key:

     <map>
      <kv><key>1</key><value>v1</value></kv>
      <kv><key>2</key><value>v2</value></kv>
      <kv><key>bad</key><value>v3</value></kv>

This document is both ill-formed (missing the end tag) and invalid (one key is bad: non-read-able). The goal is to write a lookup function for a key in such an encoded map reporting all ill-formedness errors, always. We may choose not to report an invalid key if the lookup succeeds before the bad key is encountered.

The following pipeline is the answer:

     type Key   = Int
     type Value = String
     xml_lookup :: Monad m => Key -> Iteratee Char m (Maybe Value)
     xml_lookup key = id .| xml_enum default_handlers .| xml_normalize .| kv_enum (lkup key)

The expression xml_lookup key is the iteratee that consumes characters and produces a Value associated with the given Key, or Nothing if the lookup fails. Alternatively, the expression may throw an iteratee error. The function xml_lookup is built by composition from smaller, separately developed and tested pipeline components; xml_enum, the XML enumerator, and xml_normalize are part of the incremental XML parsing framework:

     xml_normalize :: Monad m => Enumeratee (XMLStream x) (XMLStream x) m a
     xml_enum   :: Monad m => Handlers x m a -> Enumeratee Char (XMLStream x) m a

They parse an XML document incrementally and produce the stream of parsing results -- detecting ill-formed documents and raising corresponding errors. The ill-formedness check is hence taken care of, by the framework.

The enumeratee kv_enum just as incrementally transforms the stream of parsing results to the stream of key-value pairs:

     kv_enum :: Monad m => Enumeratee (XMLStream ()) (Key,Value) m a
     kv_enum = parse_element (name "map") map_parser
      where
      map_parser n iter = 
          parse_many (name "kv") kv1 iter >>=
          finish_element n
      kv1 n iter = do
          key <- parse_element (name "key")   (\n _ -> as_readable n) 0
          val <- parse_element (name "value") (\n _ -> as_str n) ""
          yield_to iter (key,val) >>= finish_element n

The enumeratee relies on a simple DTD parser combinator library. Here, map_parser parses the stream (<kv>...</kv>)*</map> and kv1 handles <key>int-key</key><value>val</value>. This enumeratee parses the stream till the end and hence always reports validation errors, of a key unredable as an integer.

Finally the iteratee lkup looks up the given key in the stream of key-value pairs:

     lkup :: Monad m => Key -> Iteratee (Key,Value) m (Maybe Value)
     lkup key = headM >>= check
      where
      check (Just (k,v)) | k == key = return (Just v) -- premature termination
      check Nothing                 = return Nothing
      check _                       = lkup key

Applying xml_lookup 2 to the sample XML document returns the error ``XML [43] (EOF) broken for element map. Stream after the error: ""''. If we add the closing tag </map> we still get an error, ``Reading failure. Stream after the error: [XMLStart value (fromList []) (fromList []),XMLString "v3" "",....]'' because the key bad cannot be parsed as integer. If we change kv_enum slightly so that it stops parsing as soon as its client iteratee is done, then xml_lookup 2 on the sample document, with the closing </map> added, returns Just "v2". The lookup succeeded before the bad key is encountered.

We have thus demonstrated constant-space processing -- lookup in an XML document -- while detecting all XML parsing errors. The error reporting comes `for free' , courtesy of the XML parsing framework. The whole processing is a pipeline of several stages, each incremental, each with its own error-reporting.

Version

The current version is December 2012.

References

XMLookup.hs [10K]
The commented code incrementally developing the running example and illustrating the reporting of malformed documents and invalid data in detail. It relies on the incremental XML parsing framework XMLIter. The code contains a simple library of parsing combinators, which represent the element structure (`DTD').

Parsing combinators with good error reporting

Iteratees may be viewed from several angles: left folds, reducers, coroutines, actors, or free monads. Iteratees can also be regarded as parsing combinators. Although various parsing frameworks such as attoparsec have been combined with iteratees, iteratees are parsing combinators on their own. They are easily composable and nestable; they give incremental, online, constant-space parsers; they can produce error messages with precise location information; and they allow non-determinism without compromising incrementality. This article illustrates these features on an example of parsing HTTP server logs.

The iteratee log parser is simple indeed, which is the point. Its code reads just like the specification for the log file format. There are however two interesting twists: pointing out a parsing error precisely without complicating the parsers, and handling a surprising non-determinism. (The latter is described in a separate article.) Adding location tracking as an orthogonal aspect to an already written parser along with the handling of non-determinism show off parallel iteratee composition: parsing the same input by several iteratees side-by-side.

Our parser is written for the popular, so-called combined logs. They are made of lines whose format is specified in the Apache documentation as

     %h %l %u %t "%r" %>s %b "%{Referer}i" "%{User-agent}i"

Here %h is the host name or the IP address of the client, %l is usually dash and ignored, %u is the user name, etc. The formats of %r (the request string) and %t (the time stamp) are described separately. Here is a sample log line:

     127.0.0.1 - frank [10/Oct/2000:13:55:36 -0700] "GET /apache_pb.gif HTTP/1.0" 200 2326 "http://www.example.com/start.html" "Mozilla/4.08 [en] (Win98; I ;Nav)"

Our log parser extracts the information from one log line and returns it as

     data Hit = Hit{
       hit_remote_addr :: String,
       hit_remote_user :: Maybe String,
       hit_tstamp      :: UTCTime,
       hit_method      :: HTTP_method,
       hit_url         :: String,
       hit_version     :: (Int,Int),
       ...

The parser hence is an iteratee that reads the stream of characters and produces a Hit:

     parse_log_line_proper :: Monad m => Iteratee Char m Hit
     parse_log_line_proper = do
       hit_remote_addr <- parse_symbol
       space
       parse_symbol  -- the net identity, commonly disregarded
       space
       hit_remote_user <- if_not_dash =<< parse_symbol
       space
       hit_tstamp      <- parse_within_delim '[' ']' parse_tstamp
       space
       (hit_method,hit_url,hit_version) <-
         parse_within_delim '"' '"' $ do
           method <- parse_method
           space
           url    <- parse_symbol
           space
           assert_string "HTTP/"
           major  <- read_num_base 10
           assert_string "."
           minor  <- read_num_base 10
           return (method,url,(fromInteger major, fromInteger minor))
       space
       ...
       return $ Hit{..}
      where
        parse_symbol = break (== ' ')
        if_not_dash "-" = return Nothing
        if_not_dash x   = return $ Just x
        parse_method = ...

The parsers for the time stamp and the request string are separate iteratees, befitting the separate specification of the these formats in the Apache documentation. The request string is double-quoted and the time stamp is enclosed in square brackets. Correspondingly, they are parsed by parse_within_delim -- a higher-order parser that applies the argument iteratee to the part of the stream within the delimiters. This parse_within_delim is not part of the iteratee library: it is trivially user-defined as:

     parse_within_delim :: (Monad m, Eq e, Show e) => e -> e -> Iteratee e m a -> Iteratee e m a 
     parse_within_delim open_delim close_delim iter = do
       assert_string [open_delim]
       res <- id .| take_while (/= close_delim) iter  
       assert_string [close_delim]
       return res

Our parsers are indeed built by (monadically) stacking up simpler parsers -- which is what parser combinators are all about. Using Haskell abstractions to define parser `macros' is to be expected as well. Iteratees worked as expected, but the parsing task so far has been trivial.

We now come to the first complication of the HTTP log parser. The server logs are made of many lines, and not all of them may be well-formed. When parsing fails, we should report the error, skip the line but continue with the next one. The error report should not only say that the time stamp date is invalid or an expected quote is missing. The report should also show the bad timestamp string with enough context, so we can see what is wrong with it and where in the file it occurred. Error recovery and error reporting are notorious for making simple parsers messy: we have to add location tracking and `error' tokens to the grammar -- generally, modify the existing parsers at many places. As we shall see, the layered structure of iteratee parsers gives the error recovery for free. Furthermore, adding context to error reports of parse_log_line_proper requires no modification to the parser at all.

Without any error reporting and recovery, the parser of the whole HTTP log is as follows

     httplog_enum_no_err_report :: Monad m => Enumeratee Char Hit m a
     httplog_enum_no_err_report = sequence_stream (id .| whole_line 
                    (parse_log_line_proper))

It applies the line parser to each line in the input stream, passing the Hit to a user-supplied iteratee (e.g., to accumulate hit statistics). The higher-order combinator whole_line is a degenerate version of parse_within_delim for only one delimiter, the newline character. The log parser is written as if the log file is blameless. A parsing error can still be generated, for example, in

     assert_string :: (Show e, Eq e, Monad m) => [e] -> Iteratee e m ()
     assert_string str = do
          n <- heads str
          if n /= length str
             then parser_error $ "Expected " ++ show str ++ " was not found"
             else return ()

but the error is not handled in any way; if it occurs, the entire log parsing fails and all accumulated hit statistics is lost. This is not acceptable. The fix is easy: trap the parsing exception with en_handle, the analogue of Control.Exception.handle for iteratees:

     httplog_enum_poor_err_report :: Monad m =>
        Enumeratee Char (Either String Hit) m a
     httplog_enum_poor_err_report = sequence_stream (id .| whole_line 
                     (en_handle show parse_log_line_proper))

The user iteratee now receives not just Hit but Either String Hit, the result of a successful or an unsuccessful parse of a log line. Parsing problems no longer abort the processing of the entire file; the user iteratee can keep accumulating hits, perhaps printing diagnostic for ill-formed lines. The error recovery -- skipping bad lines through the newline -- is automatic. Although the argument of whole_line no longer raises any iteratee exceptions, it may consume only a part of the line, until an exception, caught by en_handle, terminated the line parsing. No worries: whole_line, like parse_within_delim and like the underlying take_while always read the input stream through the closing delimiter, no matter whether the inner iteratee wants the data or not. Error recovery came for free, from the layered design: parsing the file as lines which are parsed as log data (and further parsed as quoted data).

Our log parser now reports and recovers from errors, but the reports are poor. A report may say that the expected double-quote was not found -- but not where the error has occurred. Adding context to the error message helps. Curiously, this can be done without changing log line parsers at all. We merely replace en_handle with a more sophisticated version parse_error_handler, which runs parse_log_line_proper in parallel with stream_suffix :: Monad m => Int -> Iteratee e m [e]. The iteratee stream_suffix n, when terminated, returns the last n elements of its stream, if the stream was that long. Hence, parse_error_handler gets a chunk of the stream and passes it to parse_log_line_proper and also to stream_suffix n. If the former finishes normally, its result is returned. If the line parser raised an iteratee error, we ask the stream_suffix for the last n elements of the stream so far -- the portion of the stream that has lead to the error. (Since iteratees generally get stream data in chunks, we have to deal with incompletely consumed chunks. This can be messy, but luckily is abstracted away by the iteratee library. The user of the library may well think that iteratees receive stream data one element at a time.)

Here is an example: given the following ill-formed log line

     virus.com - - [10/Jul/2009:02:15:58 -0400] "GET /foo.txt/xxx/yyy.reflect.php? reflect_base=xxx? HTTP/1.1" 404 450 "-" "Mozilla/5.0"

parse_error_handler parse_log_line_proper reports Expected "HTTP/" was not found with the stream before error foo.txt/xxx/yyy.reflect.php? and after reflect_bas.

In summary, we have seen that iteratees do behave like parser combinators, letting us combine primitive parsers and define new combinators: higher-order parsers. Iteratees can be combined in less traditional ways, to parse the stream of the results of other iteratees. The layering not only makes parsers lucid and intuitive but also makes error recovery simple. The parallel composition, handling the same input in different ways, lets us add error context to error messages truly as an orthogonal aspect, without any modification to the main parsers.

Version

The current version is April 2014.

References

LogParser.hs [24K]
The complete code for the HTTP log parser with many tests and examples

Iteratee fork: backtracking-free non-determinism

The iteratee parser combinator library is geared towards deterministic parsers with small, limited look-ahead. Such parsers are fast, incremental, and work in constant and small memory. Attoparsec has similar design goals. Many everyday parsing chores -- from log files to network packets -- can be handled by these parsers. Yet sometimes large or unbound look-ahead is inescapable; LL(1) parsers then won't work, in practice or in principle. Even parsing HTTP server logs turns out an unlimited-lookahead problem, in an edge case. Non-deterministic choice is an elegant way to deal with such complex problems (after all, choice, or alternation, has been the staple of the BNF notation). It turns out iteratees by design easily support non-determinism, without compromising the incrementality of parsing, without backtracking and without the need to store the unbound amount of input data.

A simple example that calls for the unlimited look-ahead is parsing a string according to the regular expression (abc)*aby | a(bca)*x. Mere search for that pattern is easy, after converting the regular expression into the equivalent one that can be matched with a fixed, small look-ahead. However we are interested in parsing rather than recognizing, associating semantic actions with abc and bca -- the task that is beyond deterministic fixed-look-ahead parsers.

A realistic example of the unbounded look-ahead is parsing the do-notation in Haskell, for instance:

     do
      [x1,x2,...,xn] <- m
      [x1,x2,...,xn]

Here [x1,x2,...,xn] is a pattern in the first occurrence and an expression in the second. The parser can only tell which is which when scanning <-. Another, surprising example is parsing HTTP web server logs. The popular combined log format puts the contents of the HTTP User-Agent header in double-quotes. The HTTP Recommendation however allows double-quoted strings in that header. We may end up with unescaped double-quotes within a double-quoted string, which cannot be parsed deterministically with a fixed look-ahead.

Non-deterministic choice is a good way to specify such complex grammars. The regular expression in our first example was written that way. The vertical bar, the regular-expression notation for choice, corresponds to the choice combinator in parsing combinator libraries. One way to implement it is backtracking, which requires backing up the input stream, to backtrack to. An alternative is to run the parsers to choose from side-by-side. Once a chunk of the input stream is fed to all the competing parsers, the chunk won't be needed again. No backing storage is necessary then. The iteratee choice combinator follows this approach.

Running iteratees in parallel feeding off the same input has been already described elsewhere on this page. The remaining question is how to `split' an iteratee parser at a choice point, to obtain the alternatives to run in parallel. Recall that an iteratee may throw an `exception', to notify its enumerator of a parsing error or other similar condition. Iteratee exceptions are resumable: the enumerator may attempt to `fix' whatever upset the iteratee and resume it, to let it try again. Resumable exceptions helped implement re-positioning in the input stream: seek and tell. Resumable exceptions help again: to split an iteratee computation at a choice point, we throw a resumable exception, which the exception handler resumes twice.

The following iteratee for non-deterministic choice implements the key idea:

     iter_fork :: Monad m => Iteratee e m Bool

Like Unix fork(2), it `returns twice': with False to the parent and with True to the child `process'. It sends a resumable exception which a fork_server receives and resumes twice. The fork_server runs the two iteratees, the parent and the child, side-by-side, finishing whenever one iteratee terminates. The choice combinator with a more familiar interface is just one definition away:

     choice :: Monad m => Iteratee e m a -> Iteratee e m a -> Iteratee e m a
     choice p1 p2 = iter_fork >>= \case {False -> p1; True  -> p2}

which in turn lets us easily write popular combinators such as many:

     many :: Monad m => (z -> a -> z) -> z -> Iteratee e m a -> Iteratee e m z
     many f z i = return z `choice` i >>= \a -> many f (f z a) i

The many combinator shows that after forking once, the parent or the child process may fork again. The fork_server must be prepared to handle nested fork requests. The many and the choice let us implement our examples, to go beyond deterministic fixed-lookahead parsing.

We have seen how iteratees support non-deterministic choice. With choice implemented, iteratees become the complete parsing combinator library. Our choice combinator iter_fork is a `coin-flipping' iteratee, closely related to McCarthy's amb. Its implementation uses no backtracking and no backing storage for the stream. In fact, it is essentially Unix fork.

Version

The current version is April 2014.

References

IterFork.hs [10K]
The complete code for the choice combinator and many examples of non-deterministic parsing.

Towards the best collection traversal interface

Most programming languages support collections, represented by an in-memory data structure, a file, a database, or a generating function. A programming language system gives us typically one of the two interfaces to systematically access elements of a collection. One traversal API is based on enumerators -- e.g., for-each , map , filter higher-order procedures -- of which the most general is fold . The second approach relies on streams, a.k.a. cursors, lazy lists. Generators such as the ones in Icon, Ruby and Python are a hybrid approach.

It is well-known that given a cursor interface to a collection, we can implement an enumerator. It is less appreciated that given an enumerator interface, we can always derive a cursor -- in an automatic way. We demonstrate that generic procedure for languages with and without first-class continuations.

Now that cursors and enumerators are inter-convertible, an implementor of a collection has a choice: which of the two interfaces to implement natively? We argue that he should offer the enumerator interface as the native one. The paper elaborates that enumerators are superior: in efficiency; in ease of programming; in more predictable resource utilization and avoidance of resource leaks. We present a design of the overall optimal collection traversal interface, which is based on a left-fold-like combinator with premature termination. The design has been implemented and tested in practice.

Version

The current version is 1.5, Nov 6, 2003.

References

LL3-collections-enumerators.txt [23K]
Towards the best collection API: an extended abstract
LL3-collections-talk.pdf [55K]
An extended abstract and a poster presentation at the Lightweight Languages 2003 (LL3) workshop. November 8, 2003, MIT, Cambridge MA.

Generic Zipper: the context of a traversal

Enumerating collections: searching for the best iterator

Relationship between iterations and continuations

From enumerators to cursors: turning the left fold inside out

This article is part of the early series that argued for enumerators, or left-fold, as a better traversal interface for any collection including files and communication channels. The article introduced a non-recursive left-fold and argued that it is an optimal traversal interface in a language without first-class delimited continuations. The non-recursive left-fold lets us:

instantiate it into a (recursive) left fold enumerator supporting a premature termination;
instantiate in into a stream/cursor. Turning two enumerators into streams lets us safely interleave these streams.

Both instantiation procedures are generic, as evidenced by their polymorphic types.

The article arose during the discussion of a good database interface for Oracle. For portability, the article illustrates the enumerator inversion technique on an example of a file, regarded as a collection of characters. Haskell already provides a stream interface to that collection: hGetChar. We implement a left fold enumerator, and turn it back into a stream, implementing a function myhgetchar exclusively in terms of the left fold enumerator.

Version

The current version is 1.5, Dec 31, 2003.

References

fold-stream.lhs [14K]
The article as a well-commented literate Haskell98 code.
An earlier version was posted on the Haskell mailing list on Tue, 23 Sep 2003 23:59:45 -0700 (PDT).

Towards the best collection traversal interface
This extended abstract of a LL3 presentation lists the advantages of enumerators for traversing collections and discusses the enumerator inversion procedure in the broad context.

Parallel composition of streams: several sources to one sink
A different, and better way to interleave two folds

Justifying layered streams for input processing

Handling many internet protocols and file formats -- describing structured, multiply encoded data -- emphasizes layered stream processing. For example, the lowest-level stream reads data in blocks or ethernet frames. An overlay stream provides abstractions of characters or bytes. Higher streams take care of Base64, UTF and other decoding, assembling data into words, data structures, or Unicode characters. It is often useful to layer a stream that reads at most N units (octets, bits, etc) from a lower-level stream. A layered stream may transparently accumulate the SHA-1 digest of the read data and check the signature afterwards. It is important to be able to push and peel off the layers at run time. The latter is indispensable when dealing with HTTP streams carrying multi-part messages.
Version: The current version is 1.1, 2002.
References: layered-io.txt [6K]
Advocating layered IO on the example of HTTP multi-part processing. A message Layered I/O posted on the Haskell-Cafe mailing list on Tue, 14 Sep 2004 23:55:25 -0700 (PDT).
io.txt [26K]
An early draft of an ambitious layered IO proposal, cast in the context of Scheme. More polished drafts exist, and even a prototype Scheme implementation. Unfortunately, once it became clear that the ideas are working out, the motivation fizzled.

Last updated December 30, 2016

This site's top page is http://okmij.org/ftp/

oleg-at-okmij.org
Your comments, problem reports, questions are very welcome!

Converted from HSXML by HSXML->HTML