A series of list appends or monadic binds for many monads performs algorithmically worse when left-associated. Continuation-passing style (CPS) is well-known to cure this severe dependence of performance on the association pattern. The advantage of CPS dwindles or disappears if we have to examine or modify the intermediate result of that series of appends or binds, before continuing the series. Such examination is frequently needed, for example, to control search in non-determinism monads.
We present an alternative approach that is just as general as CPS but more robust: it makes series of binds and other such operations efficient regardless of the association pattern -- and also provides efficient access to intermediate results. The key is to represent such a conceptual sequence as an efficient sequence data structure. Efficient sequence data structures from the literature are homogeneous and cannot be applied as they are in a type-safe way to a series of monadic binds. We generalize them to type aligned sequences and show how to construct their (assuredly order-preserving) implementations. We demonstrate that our solution solves previously undocumented, severe performance problems in iteratees, LogicT transformers, free monads and extensible effects.
Joint work with Atze van der Ploeg.
The performance advantage of efficient sequences is tangible, e.g., when building a sequence by concatenating many smaller ones, as typically happens in serialization. With lists, whose append runs in time proportional to the length of the left argument, even simple examples set off the performance trap, as we shall see below. Turning to difference lists lets us out: the underlying continuation-passing style (CPS) magically re-associates appends in the most efficient pattern, to the right. Alas, CPS saves us so long as we do not examine the list built so far. Efficient sequences once and for all solve the performance problems of building sequences from smaller ones and examining intermediate results. This work applies what is well-known about the efficient concrete sequences to abstract ones.
The pattern of building a complex structure, transformation or monadic action by adjoining, composing or binding smaller parts is pervasive. Also frequent is the need to examine intermediate results to decide on further steps. A transformation or a monadic action are not sequences. But the series of steps taken to construct them are a conceptual sequence. Representing that abstract series of steps as a concrete efficient sequence helps here as well.
For illustration, we show a simple but slow implementation of iteratees as a Free Monad. The monadic bind here, like list append, is associative and performs much better when right-associated. Representing the abstract series of binds as an efficient type-aligned sequence data structure algorithmically improves the performance of iteratees. The Haskell Symposium paper shows more examples of the speed-up, taken from the implementations of delimited control and non-determinism.
Our observations about the performance problems of appends and monadic binds, and the offered solutions hold in call-by-value as they do in call-by-need and can be implemented in, say, OCaml or Scala. For concreteness, on this page we will be using Haskell.
Video of Atze van der Ploeg's presentation at the Haskell Symposium 2014. September 5, 2014.
< https://www.youtube.com/watch?v=_XoI65Rxmss&list=UUP9g4dLR7xt6KzCYntNqYcw >
< https://github.com/atzeus/reflectionwithoutremorse >
Accompanying source code
Although our examples are in Haskell, everything we describe here applies
to any other functional language.
Recall, Haskell lists [a]
of elements of the type a
is a built-in
data structure, which could have been defined as
data [a] = [] | a : [a]Furthermore, the type
String
is an alias of [Char]
. Henceforth we
will be using strings as concrete lists. The list append
operation is the infix (++)
, defined as(++) :: [a] -> [a] -> [a] [] ++ y = y (h:t) ++ y = h : (t ++ y)
It traverses the left operand but merely passes around the right
one. The operation is clearly associative: (l1 ++ l2) ++ l3
and l1
++ (l2 ++ l3)
give the same result for any lists l1
, l2
and l3
. Yet the asymmetry of handling the operands shows up as the
difference in performance: if the lists l1
, l2
and l3
have the
length n
, the right-associative version l1 ++ (l2 ++ l3)
needs 2n
cons-operations (:)
whereas the left-associative version
needs 3n
. For appending many small lists, the right-associative
pattern does algorithmically better than the left-associative one:
linearly rather than quadratically. The reader is encouraged to verify
this as an exercise. As a slightly more advanced exercise, confirm
that the conclusion holds regardless of the call-by-value or
call-by-need evaluation strategy.
The problem is clear, and so is the solution: just associate append operations to the right. That is easier said than done, as we are about to see, already on a simple example.
Our running example is printing out (serializing) a JSON-like document represented in memory as the data structure
data JSON = A Int | Node String JSON | Arr [JSON] sample = Node "Root" ( Arr [Node "Ch1" (A 1), Node "Ch2" (A 2)])
Integers are atomic documents; documents may be grouped into an array and given
a name, as we see in the sample
document. Another, deeply nested, sample is
produced by makej n
, with n
nested nodes and no arrays:
makej :: Int -> JSON makej 0 = A 1 makej d = Node "node-name" (makej (d-1))
The most straightforward function to show this document as a string
showj :: JSON -> String showj (A x) = show x showj (Node t j) = "{" ++ t ++ " " ++ showj j ++ "}" -- Don't handle Arr yethas a surprisingly poor performance:
showj (makej 4096)
produces scant 49Kb of output and yet running length $ showj (makej 4096)
takes 4.8
seconds in GHCi on a modern
Core i3 laptop. As the size of the output increases, the running time grows
super-linearly. The performance trap of list append is truly easy to
fall into.
Fortunately, there is a way out: a more efficient representation for
lists, namely, difference lists. They are used for intermediate results
and converted to the real list, by fromDiff
, at the very end:
type DiffList a = [a] -> [a] fromDiff :: DiffList a -> [a] fromDiff x = x [] toDiff :: [a] -> DiffList a toDiff l = \t -> l ++ t showjCPS :: JSON -> String showjCPS j = fromDiff (go j) where go :: JSON -> DiffList Char go (A x) = toDiff $ show x go (Node t j) = (toDiff "{") . (toDiff t) . (toDiff " ") . go j . (toDiff "}") -- Don't handle arrays yet
A difference list is a function that represents the list prefix, to be
added to the suffix received as an argument. Appending difference
lists is composing the corresponding functions, which is
constant-time. Re-writing showj
to use difference lists internally
is mere replacing the regular list append (++)
with the
difference-list append (.)
. It is easy to recognize the re-writing
as the conversion of showj
into the continuation-passing style: the
continuation here is the tail of the list being constructed. Now, length $ showjCPS (makej 4096)
runs in 0.03 seconds -- more than 100
times improvement! One can really feel such speed-ups, however
puzzling. After all, eventually we do rely on the same list append (++)
to concatenate strings, see toDiff
. However, CPS always
associates these appends to the right, which is most efficient. It is
another good exercise for the reader to reflect on how CPS manages
such a profitable re-association.
We have not handled the array grouping yet. We'd like to show them the
way Haskell shows lists, e.g. "[1,2,3,4]"
-- with the elements
separated by commas and the whole output enclosed in brackets. The
format is slightly irregular: there is a comma after each element, except for the last one. Everyone who has dealt with separators
knows how irritating this exception is. Yet there is an elegant way of
writing this formatting by taking its definition literally:
showLst :: Show a => [a] -> String showLst [] = "[]" showLst (h:t) = let (p:s) = showLst t in p : show h ++ (if s == "]" then s else ","++s)One can read the code as: to show a non-empty list
(h:t)
, show its
tail and then insert the result of show h
right after the opening
bracket, with the comma. Do not put the comma before the closing
bracket. The program demonstrates the examining and modifying the
intermediate result (the result of showing the tail of the list), to
wedge-in show h
and the comma. This inspection and modification of
the intermediate result is called reflection.Let's add this algorithm to our JSON serialization function:
showjCPS1 :: JSON -> String showjCPS1 j = fromDiff (go j) where -- this part is as before go (A x) = toDiff $ show x go (Node t j) = (toDiff "{") . (toDiff t) . (toDiff " ") . go j . (toDiff "}") -- the new part: showing arrays of nodes go (Arr []) = (toDiff "[]") go (Arr (h:t)) = let (p:s) = fromDiff (go (Arr t)) in (p:) . go h . toDiff (if s == "]" then s else "," ++ s)Recall, the elegant list serialization needs to examine the intermediate result. The internally used difference list is a function, and cannot be examined. It has to be converted to a regular list first, by
fromDiff
, and then toDiff
back to the difference list to continue
the formatting. The fromDiff
conversion is expensive, forcing all the list appends that have been put off.
That is not a good sign. It is not good indeed: showing an array of a single
deeply nested document showjCPS1 (Arr $ replicate 1 (makej 80000))
produces 96Kb of output and takes
0.50 seconds. On the other hand, showing an array of many shallow documents showjCPS1 (Arr $ replicate 2000 (makej 10))
produces one-quarter of
the output but takes 16 times longer!Thus difference lists, or the CPS, only partly solve the performance problem of badly associated appends. In other words, reflection destroys the performance advantage of CPS.
We can reflect without remorse if we take a better working data
structure. Instead of difference lists, with their efficient
concatenation but inefficient deconstruction, we turn to the data
structure where appending and the left-edge construction/deconstruction
are all efficient. Okasaki's book lists several candidates. The Data.Sequence
module in the Haskell standard library provides the
data structure with exactly the desired properties. With these
sequences (imported with the prefix Seq
), the serialization function
becomes:
showjSeq :: JSON -> String showjSeq j = toList (go j) where char :: Char -> Seq.Seq Char char = Seq.singleton str :: [Char] -> Seq.Seq Char str = Seq.fromList go :: JSON -> Seq.Seq Char go (A x) = str (show x) go (Node t j) = char '{' >< str t >< char ' ' >< go j >< char '}' go (Arr []) = str "[]" go (Arr (h:t)) = let (p:<s) = Seq.viewl (go (Arr t)) in char p >< go h >< (case Seq.viewl s of {']' :< _ -> s; _ -> char ',' >< s})The code is quite like the original version, but with the different ``append operator'':
(><)
in place of (++)
. Sequences can
be deconstructed, with viewl
, obtaining the left-most element p
and
the rest s
-- which remains the sequence, rather than a list, and so
does not have to be converted back to the sequence after prepending a comma.
The results speak for themselves: showjSeq (Arr $ replicate 1 (makej 80000))
takes 1 second (slower
than with showjCPS1
) but showjSeq (Arr $ replicate 2000 (makej 10))
takes only 0.18
seconds, which is 60 times faster than before.
Again, a small change in the algorithm gives a remarkable speed-up.In the upshot, the efficient sequences fully solve the performance problems of building a structure from small components and examining the intermediate results. The solution applies to any associative operation that recursively examines only one of its arguments.
The efficient data structure for monadic binds is not one of
the Okasaki sequences. The latter are all homogeneous. Monads bound in a
series, however, have in general different result types.
Yet the Okasaki structures easily generalize to the heterogeneous, type-aligned
sequences; please see the Haskell Symposium paper for details. Below we
use FastTCQueue
from type-aligned
on Hackage.
The running example is the incremental parsing, or iteratee. For the
sake of explanation, it is not written in full generality. Abstracting
over the effect signature gives the Free Monad, and further,
Extensible Effects. The speed-ups we obtain here apply to all these
general constructions. An iteratee It
is defined by its effects,
which in our case are:
data It i a = Pure a | Get (i -> It i a)That is, the computation
It i a
may be pure, producing the value
of the type a
without any effects. Or it may request to input a value
of the type i
, and, upon receiving it, eventually produce the value of the
type a
perhaps after asking for more i
values. Here is the simple iteratee
that inputs a value and immediately returns it:get :: It i i get = Get returnIt is analogous to
getchar
in C, only the type of its input is not limited to
characters. Like getchar
, the iteratee get
is the simplest parser.
To build bigger parsers we have to specify how to compose iteratees:instance Monad (It i) where return = Pure Pure x >>= k = k x Get f >>= k = Get (f >>> k)where the operation
(>>>)
, technically called Kleisli composition,
is just the composition of two effectful functions:(>>>) :: Monad m => (a -> m b) -> (b -> m c) -> (a -> m c) f >>> g = (>>= g) . f
Here are a couple of bigger parsers, built from get
by monadic bind. The
second, addN
, inputs and sums up n
numbers from the input stream.
addGet :: Int -> It Int Int addGet x = get >>= \i -> return (i+x) -- glibly: addGet x = (+ x) <*> get addN :: Int -> It Int Int addN n = foldl (>>>) return (replicate n addGet) 0
Finally, feedAll
below runs our parsers on the input stream [a]
. If an
iteratee asks for input, feedAll
gives one off the stream, unless the stream
is exhausted. In iteratee papers, this function is called an
enumerator.
feedAll :: It a b -> [a] -> Maybe b feedAll (Pure a) _ = Just a feedAll _ [] = Nothing feedAll (Get f) (h : t) = feedAll (f h) t
The last clause in the monadic bind, Get f >>= k = Get (f >>> k)
,
has the telltale pattern of recursively examining the left argument
and merely passing around the right one. One suspects there could be
a performance problem similar to the one with list append. It is there
indeed: the running time of feedAll (addN n) [1..]
increases quadratically
with n
. For n=4000
, it takes 9.30 seconds. There has to be a faster
way to sum up 4000 integers on a modern laptop,
even with all the GHCi interpretation overhead.
Just like with list appends, one can try CPS. Alas, feedAll
has
another telltale pattern -- examining intermediate result, the iteratee,
to find out if it is finished or wants more data. Thus findAll
is built around reflection, which is a bad news for CPS.
We now write an efficient iteratee, to be called ItZ
.
Looking at the monad bind again shows that iteratees are built by (>>>)
-composing ``effectful functions'' a -> It i b
, for which we
define the abbreviation:
newtype Arr i a b = Arr{unArr :: a -> ItZ i b}The conceptual series of composition steps can be represented concretely, as a type-aligned sequence:
type Arrs i a b = FastTCQueue (Arr i) a bWe can treat
Arrs i a b
also as an effectful function a -> ItZ i b
:
the operation appZ
below does exactly this conversion. Arrs i a b
is build by composing a series of Arr i
functions.
Unlike the functional composition, we can ``uncompose'' Arrs i a b
:appZ :: Arrs i a b -> a -> ItZ i b appZ q x = case tviewl q of TEmptyL -> return x Arr h :| t -> case h x of ZPure x -> appZ t x ZGet f -> ZGet (f >< t)
The efficient iteratee hence becomes:
data ItZ i a = ZPure a | ZGet (Arrs i i a)The continuation -- what to do after acquiring the input value
i
--
is now represented as the concrete sequence of small compositions rather than
one opaque i -> It i a
function as before. The monad bind appends a new
continuation to that sequence:instance Monad (ItZ i) where return = ZPure ZPure x >>= k = k x ZGet f >>= k = ZGet (f |> Arr k) -- NEW
The simple parser is now
getZ :: ItZ i i getZ = ZGet tempty
where tempty
is the empty sequence. The other parsers, like addGetZ
and addNZ
, are exactly as before; see the code below for
details. The enumerator is almost the same as before, only using appZ
to apply the ``data structure continuation'' Arrs i a b
to the
argument a
:
feedAllZ :: ItZ a b -> [a] -> Maybe b feedAllZ (ZPure a) _ = Just a feedAllZ _ [] = Nothing feedAllZ (ZGet f) (h : t) = feedAllZ (appZ f h) t
Now, feedAllZ (addNZ n) [1..]
runs in time linear with n
; for n=4000
, it takes 0.13 seconds, which is 70 times faster than the
original version. The improved version also takes 147 times less
memory allocated during the GHCi interpretation. The key to the
improvement is appZ
: when applying the series of continuations Arrs
i a b
to an argument it avoids (>>>)
-composing out all the
continuations in the sequence. Rather, it peels off the first one and
applies to the argument. If the resulting iteratee waits for input,
we append the tail of the original series of continuations, as it was,
to that iteratee's continuation. It is hence important we do the
compositions incrementally and the peeling-off and appending all
take constant time.
oleg-at-okmij.org
Your comments, problem reports, questions are very welcome!
Converted from HSXML by HSXML->HTML