We propose a programming style for incremental stream processing based on typed simple generators. It promotes modularity and decoupling of producers and consumers just like lazy evaluation. Simple generators, however, expose the implicit suspension and resumption inherent in lazy evaluation as computational effects, and hence are robust in the presence of other effects. Simple generators let us accurately reason about memory consumption and latency. The remarkable implementation simplicity and efficiency of simple generators strongly motivates investigating and pushing the limits of their expressiveness.
To substantiate our claims we give a new solution to the notorious pretty-printing problem. Like earlier solutions, it is linear, backtracking-free and with bounded latency. It is also modular, structured as a cascade of separately developed stream transducers, which makes it simpler to write, test and to precisely analyze latency, time and space consumption. It is compatible with effects including IO, letting us read the source document from a file, and format it as we read.
Joint work with Simon Peyton-Jones and Amr Sabry.
The annotated slides of the talk presented at APLAS 2012 on December 12, 2012 in Kyoto, Japan.
To illustrate, we take the main point example from Lennart Augustsson's
well-discussed May 2011 blog post. The example is the Haskell function
which tells if an element of a list satisfies a given predicate.
any :: (a -> Bool) -> [a] -> BoolWhen the predicate returns
Truefor some element of the list,
Trueand stop scanning the list. The function can be defined by the composition of two already written, standard Haskell functions
any f = or . map f or :: [Bool] -> Bool map :: (a -> b) -> [a] -> [b]Whenever the list contains an element satisfying the predicate, thus defined
True-- even if the list is infinite:
t1 = any (>10) [1..] -- TrueWe will not get any result in a strict language, where arguments must be fully evaluated before the application and hence
orgets to work only after
map ffinishes -- which, if it is applied to an infinite list, does not. The infinite list is not a frivolity: Although all practical programs ever examine a finite prefix of an infinite list, the size of that prefix is often difficult to tell beforehand. Lazy evaluation spares us from guessing the limits and imposing arbitrary restrictions. Rather, lazy evaluation generates as much of the list as needed for a given task, on demand.
The on-demand evaluation helps even if the list is finite. In a strict
map f must first build the intermediate list of Booleans
or can search through it. In Haskell,
or consumes a new
element of the intermediate list as soon as
map f produces it. The
intermediate list is never constructed in full.
any processing chain can be grown farther. For example, the
input list of numbers could be the result of parsing a column of
t2 = any (>10) . map read . lines $ "2\n3\n5\n7\n11\n13\nINVALID" -- TrueSplitting into lines, parsing, comparing -- all happens incrementally and on demand. Indeed,
"INVALID"is unparseable as a number. After the satisfying element is found, there is no further demand on the list of integers, and rest of it is not constructed.
But what if the string to parse and search is to be read from a file?
read_string to be an IO action that reads a string from a file,
we may re-write the code as
(any (>10) . map read . lines) `fmap` (read_string fname :: IO String)How much of the file
read_stringhas to read however? For as long as
anyneeds data. However,
anyis a pure function consuming the result of an
read_string, the action that returns the result after it fully finishes. An action cannot yield the result half-way. Therefore,
anycannot give the feedback to
read_stringwhile it is still reading. Without the feedback,
read_stringhas no choice but to read the whole file. We have lost the incrementality and the early termination, and we have to completely re-write the code to regain them.
But an action can yield the result half-way, suspending and resuming
as needed! It is called Lazy IO: the standard Haskell function
arranges for the on-demand reading of a file:
t2_lazy = do h <- openFile test_file ReadMode str <- hGetContents h let result = any (>10) . map read . lines $ str return resultLazy IO is not a solution however: it is a problem. Its many pitfalls have been described in detail elsewhere (see the FLOPS 2012 paper). For example: the above code has the action to open a file but no action to close it. Since the point of the example is to return
Trueas soon as the satisfying line is found, the file in general will not be read completely. Therefore, we have to rely on the garbage collector to close the file, when it gets around to it, if ever. The Lazy IO code thus leaks a scarce resource, the file handle. Lazy IO is especially problematic when the input comes from a pipe or a network channel. It is not specified how much
hGetContentsreally reads. It usually reads by buffer-full and it may read-ahead arbitrarily, which for communication channels can (and does) lead to a deadlock.
Let's re-write the reading-parsing-checking pipeline
in a strict language, embedded in Haskell via the (strict) IO monad.
t2_strict = bracket (openFile test_file ReadMode) hClose (loop (>10)) where loop f h = do r <- liftM f $ liftM read $ hGetLine h if r then return True else loop f hThe code reads as little of the file as needed to decide if the file has a line with the number greater than 10. We attain the incremental processing and early termination. The code now includes
hClose: the file will definitely be closed as soon as the result is determined. A strict language lets us reason about resources such as memory, file handles, etc. and tightly control them. However, the compositionality is lost. The program
t2_strictis not written by combining standard list-processing functions. We had to write the recursion explicitly and inline
or, hence forgoing code reuse. Lennart Augustsson expounds on this principal drawback of strict languages in his blog article.
There is one more option: incremental IO via a stream, or monadic list:
data ListV m a = Nil | Cons a (List m a) type List m a = m (ListV m a)Like Lazy IO, stream processing lets us reuse code and build programs by pipe-lining previously written fragments. Also like Lazy IO, stream processing leaks resources such as file handles; see the accompanying code for details.
To summarize: Non-strict evaluation promotes modularity and compositionality by decoupling producers and consumers. It is incompatible with effects and makes it very difficult to reason about memory and other resources. Strict languages, in contrast, have the excellent resource cost model and are compatible with effects, but couple data consumers and producers and hence break modularity and inhibit reuse. However, there is a way to untangle producers and consumers even in strict languages: generators.
The complete code for the example
The APLAS paper briefly talks about tying the knot -- a fascinating application on one hand, which also lets us tie up ourselves with our own rope.
Preventing memoization in (AI) search problems
Lazy evaluation is a vexing obstacle for non-deterministic, probabilistic programming and general AI search problems, where memoization is exactly the wrong trade-off.
The full version of the paper presented at FLOPS 2012. The first half of the paper describes many problems of Lazy IO.
type GenT e m instance MonadTrans (GenT e) type Producer m e = GenT e m () type Consumer m e = e -> m () type Transducer m1 m2 e1 e2 = Producer m1 e1 -> Producer m2 e2 yield :: Monad m => e -> Producer m e runGenT :: Monad m => Producer m e -> Consumer m e -> m () foldG :: Monad m => (s -> e -> m s) -> s -> Producer (StateT s m) e -> m sThe interface is simple: one abstract type
GenT e m aand three functions.
GenT eis a monad transformer for generators that yield the value of type
ein some monad
m. Intuitively, yielding is like tracing, from
printfor similar debug tracing statements embedded in the code. As the program runs it sends traced intermediate results, usually to a log file. A debugger can intercept the traced data to analyze them.
Producer m e is a computation executed only for its tracing,
of values of type
e; the overall result is irrelevant. The generator
interface provides one primitive producer,
yield. Other producers are
built by monadic composition: for example, a producer that reads a file
and yields its characters one-by-one.
fileG :: (GBracket m, MonadIO m) => FilePath -> Producer m Char fileG fname = gbracket (liftIO $ openFile fname ReadMode) (liftIO . hClose) loop where loop h = do b <- liftIO $ hIsEOF h if b then return () else liftIO (hGetChar h) >>= yield >> loop hThe overall result is indeed irrelevant: one runs
fileGfor the data it yields. The code looks like the strict IO code
t2_strictof the running example described earlier. Emphatically, the file handle is closed as soon as the result (or an exception) is obtained. As in strict IO, the generator lets us account for and tightly control resources. Unlike
fileGdoes no processing of file data -- it merely yields them. The function
fileGis a only producer. The consumption of file data is clearly untangled; the same
fileGproducer can be used with many consumers.
Consumer m e is an action to be executed on `intercepted trace
data'. The standard function
putChar is one, simple consumer. Its
type is indeed
Consumer IO Char. The primitive
runGenT connects a
producer and a consumer, acting like debugger's setting an action to
do at a trace point, and running a program. Hooking up
putChar gives us the
cat program that copies a file to the
catG :: FilePath -> IO () catG fname = fileG fname `runGenT` putChar
Simple generators have very simple semantics, embodied in the equation
runGenT C[yield v] consumer === C[consumer v]where
Cis a (monadic) evaluation context containing no embedded
runGenT. The consumer action is executed right at the trace point. In the case of
catG, the read character is immediately written out; no intermediate data structures are built.
A more interesting consumer is the generator analogue
of the function
or :: [Bool] -> Bool on lists.
orG :: Monad m => Producer (ErrT Bool m) Bool -> m Bool orG gen = either id (const False) `liftM` runErrT (runGenT gen orC) where orC :: MonadError Bool m => Consumer m Bool orC True = throwError True orC False = return ()The semantics of Haskell's
oris purposely undetermined:
orshould stop and return
Trueelement is found. However,
ordoes not have to stop right away. It may speculatively examine a few more elements of the list, so long as this speculation is bounded. With effects we cannot afford such ambiguity: Any speculation becomes observable. By throwing the exception,
orGmakes it clear that the evaluation really stops, as soon as the
Trueelement is found.
What makes simple generators interesting is that
yield is not only
a producer but is also a consumer. Indeed,
yield also has the type
Consumer (GenT e m) e. In other words, a consumer of data may itself
yield, to produce other data. The equational semantics of simple
generators derives the following law:
runGenT gen yield === genwhere
genis any generator. Thus
yieldis the left and the right unit of
yield is both a consumer and a producer, we can write
Transducer m1 m2 e1 e2 consumes data of the type
e2. If one regards
Producer m e as an effectful version
of a list
Transducer m1 m2 e1 e2 is a list
transformer. A simple list transformer is
map. Here is the effectful
version of it:
mapG :: Monad m => (e1 -> e2) -> Transducer (GenT e2 m) m e1 e2 mapG f gen = runGenT gen (yield . f)Once
e1, it is passed to the function
fwhose result is immediately yielded. Therefore,
mapGtransduces in lock-step. It keeps no transformation state. If the running time of
fis bounded, so is the latency of
mapG f(the time from obtaining the input to producing the output). The function
foldG-- which is the analogue of the (left) list fold, as its signature suggests -- lets us write stateful transducers. The transducer
linesGbelow transforms a producer of
Charto the producer of lines. It uses the internal state -- the look-ahead buffer -- to accumulate characters until it sees a newline, at which point the accumulated string is yielded. The transducer is the analogue of
type TrState m = StateT String (GenT String m) linesG :: Monad m => Transducer (TrState m) m Char String linesG gen = foldG tr  gen >>= at_end where tr s '\n' = yield (reverse s) >> return  tr s c = return (c:s) at_end  = return () at_end s = yield (reverse s)
We can now re-write the
any example from the previous section:
anyG f = orG . mapG f t2_gen :: IO Bool t2_gen = anyG (>10) . mapG read . linesG $ fileG test_fileThe code looks remarkably like that for lazy evaluation,
t2. Complex producers and consumers are built by literally composing previously written components. There is clearly code reuse. Unlike the lazy evaluation code
t2, we read from a file, and we definitely close the file as soon as the result is obtained.
Our generators are intentionally simple. They are
asymmetric, one-shot, implicit suspensions. They are asymmetric because
yield only passes data into the context without accepting anything from it.
Simple generators are one-shot since the computation suspended by
can be resumed only once. Unlike delimited control operators, simple
generators do not explicitly capture their context as a first-class object.
For these reasons, simple generators are limited in expressiveness. For
example, two simple generators cannot run side-by-side; we cannot
solve the same fringe problem with simple generators. But they are
very easy to implement: simple generators are essentially restartable
exceptions. Their implementation in Haskell is trivial, directly following
from the equational specification.
type GenT e m = ReaderT (e -> m ()) m yield :: Monad m => e -> Producer m e yield e = ask >>= \foreach -> lift $ foreach e runGenT :: Monad m => Producer m e -> Consumer m e -> m () runGenT m f = runReaderT m f
Simple generators were first introduced in 1975 in the programming
language CLU (where they are called iterators). In CLU,
amounts to a regular function call: ``a yield effectively calls the
loop body, which returns to the iterator when it is finished. Imposing
the limitations on iterators was done to get the efficient, single
stack implementation, albeit at the expense of some expressive
power.'' (Barbara Liskov: A History of CLU). The CLU implementation was
the inspiration for our Haskell library of simple generators, which
are likewise lightweight. The insight of the CLU implementation
yield calling the consumer as a regular function, which it must
have obtained from the environment.
We have shown how simple generators decouple data producers and consumers in a (embedded) strict language, and how simple they are and easy to implement. Aren't they too simple? Quite surprisingly, they turn out expressive enough for the notorious pretty-printing problem.
The interface and implementation for simple generators
Running example from the APLAS paper (tab expansion) and simple illustrative examples of generators and transducers
A more detailed and slow derivation of simple generators and an example of repmin-like problem (tree traversals) implemented with simple generators
The pretty-printing problem was posed by Oppen in 1980 as a `nice' formatting of a tree-structured document within a fixed page width. Oppen has abstracted the problem to the bare minimum; his formulation is used in almost all other subsequent works on the topic. In Haskell, the source document is represented by the data type:
data Doc = Text String | Line | Doc :+: Doc | Group DocThe source document is composed of text with advisory line breaks
Line, which are to be interpreted consistently within a
Lines in a group (excluding embedded groups) are to be interpreted either as all spaces or as all line breaks.
Lines should be formatted as spaces if the entire group fits within the remainder of the current line. For example, a sample document
Group (Text "A" :+: Line :+: Group (Text "B" :+: Line :+: Text "C"))should we formatted on one line if the page width
wis 5 or more, on two lines if the page width is 3 or 4, and on three lines otherwise:
A B C A A B C B C
The desired optimal algorithm should have:
Ais seen by the formatter till the moment it is placed in the output document;
Oppen has demonstrated the first optimal algorithm, with
co-routines. Oppen's work has inspired a large number of attempts to derive the optimal algorithm equationally (unsuccessfully so far) or
to find a simpler algorithm. The problem is indeed complex: the naive
implementation has the running time exponential in document size and
the unbounded latency -- and so has the backtracking. Memoization, or
pre-computing the widths of all groups, reduces the running time to
n but requires likewise linear in
n extra space; the
latency remains unbounded. Satisfying all the requirements has proved
to be challenging. A particular challenge is finding a modular way to add
the second optimization, pruning (computing the widths of groups
only up to
The full version of the paper (Sections 4 and 5 and Appendix A) explain the solution in detail, demonstrating reasoning about time, space, and latency.
The complete code for the solution, and the benchmarks
oleg-at-pobox.com or oleg-at-okmij.org
Your comments, problem reports, questions are very welcome!
Converted from HSXML by HSXML->HTML