fold
) as the composition of a map
and a monoid
reduction. As the result, some seemingly sequential algorithms can run
not just in parallel but embarrassingly in parallel: The input sequence
can be arbitrarily partitioned (and recursively sub-partitioned, if
desired) among workers, which would run in parallel with no races,
dependencies or even memory bank conflicts. Such embarrassing
parallelism is ideal for multi-core, GPU or distributed processing.
The general principle is well-known; also well-known is that its specific applications require ingenuity.
Recall, folding over a sequence is an inherently sequential stateful accumulation; using lists for concreteness, it is defined as
fold_left : ('z -> 'a -> 'z) -> 'z -> 'a list -> 'z fold_right : ('a -> 'z -> 'z) -> 'a list -> 'z -> 'zActually, there are two operations: the left and the right fold. Their meaning should be clear from the following example:
fold_left (+) 0 [1;2;3;4] ≡ (((0 + 1) + 2) + 3) + 4 fold_right (+) [1;2;3;4] 0 ≡ 1 + (2 + (3 + (4 + 0)))In this case, of the folding function being addition, the results are identical: both expressions sum the list. Generally, left and right folds produce different results: try, for example, subtraction as the folding function. The types of the list elements
'a
and
of the accumulator 'z
need not be the same: for example,
fold_left (fun z _ -> z + 1) 0 lcomputes the length of any list,
fold_left (fun z x -> x :: z) [] lreverses the list
l
and
fold_right (fun x z -> if p x then x::z else z) l []filters the list: omits the elements for which the predicate
p
returns false
. Many other operations on lists (in fact, all of them)
can be expressed as folds. Fold is indeed the general pattern of sequential
stateful processing of a sequence.
The are alternative definitions of fold: sometimes the last two arguments of the right fold are swapped. If the arguments of the right folding function are likewise swapped, then the left and the right fold have the same signature. Their behavior, the association pattern, is still different. Olivier Danvy, see below, traces the history of list folds and the argument order.
For concreteness we showed folds over lists, but similar operations exist over arrays, streams, files, trees, dictionaries and any other collections.
In this article, by reduce we always mean the reduce over a monoid. A monoid is a set with an associative binary operation which has a unit (also called zero) element. Concretely, in OCaml
type 'a monoid = {zero: 'a; op: 'a -> 'a -> 'a}where
'a
is the type of monoid elements,
op
must be associative and
op zero x = op x zero = xmust hold for every element
x
of the monoid. In Google MapReduce, the
operation op
is also taken to be commutative. We do not impose such
requirement.
Reduce over a sequence (here, a list) is the operation
reduce : 'a monoid -> 'a list -> 'awith the behavior that can be illustrated as
reduce monoid [] ≡ monoid.zero reduce monoid [x] ≡ x (* for any x and monoid *) reduce {zero=0;op=(+)} [1;2;3;4] ≡ 1 + 2 + 3 + 4One may say that reduce `wedges in' the monoid operation between the consecutive elements of the sequence. Since
op
is associative, the parentheses are not necessary. Therefore, unlike
the left and the right fold, there is only one reduce.
We shall see that reduce is often pre-composed with map. The two operations may always be fused, into
map_reduce : ('a -> 'z) -> 'z monoid -> 'a list -> 'zNevertheless, we will write map and reduce separately, for clarity -- assuming that actual implementations use the efficient, fused operation.
Fold has to be evaluated sequentially, because of the data dependency
on the accumulator. In fact, the left fold is just
another notation for a for
-loop:
type 'a array_slice = {arr:'a array; from:int; upto:int} let fold_left_arr : ('z -> 'a -> 'z) -> 'z -> 'a array_slice -> 'z = fun f z {arr;from;upto} -> let acc = ref z in for i=from to upto do acc := f !acc arr.(i) done; !accHere, for variety, we use an array slice as a sequence.
On the other hand, reduce has a variety of implementations. It may be performed sequentially:
let seqreduce_arr (m: 'a monoid) (arrsl: 'a array_slice) : 'a = fold_left_arr m.op m.zero arrslOr it can be done in parallel:
let rec parreduce_arr (m: 'a monoid) {arr;from;upto} : 'a = match upto+1-from with | 0 -> m.zero | 1 -> arr.(from) | 2 -> m.op arr.(from) arr.(from+1) | 3 -> m.op (m.op arr.(from) arr.(from+1)) arr.(from+2) | n -> let n' = n / 2 in (* Here, the two parreduce_arr invocations can be done in parallel! *) m.op (parreduce_arr m {arr;from;upto=from+n'-1}) (parreduce_arr m {arr;from=from+n';upto})The two
parreduce_arr
invocations in the recursive case can run in
parallel -- embarrassingly in parallel, with no races or even read
dependencies. Whether they should be done in parallel is another
question -- something that we can decide case-by-case. For example, if
the array slice is short, the two parreduce_arr
invocations are
better done sequentially (since the ever-present overhead of parallel
evaluation would dominate.) If the two halves of the slice
are reduced also in parallel, we obtain a hierarchical decomposition:
binary-tree--like processing.
We do not have to recursively decompose the slices. We may cut the input array into a sequence of non-overlapping slices, arbitrarily, and assign them to available cores. The cores may do their assigned work as they wish, without any synchronization with the others. At the end, we combine their results using the monoid operation.
Since the (left or right) fold commits us to sequential evaluation, Guy Steele called it `slightly harmful'. He urged using reduce, as far as possible, because it is flexible and decouples the algorithm from the execution strategy. The strategy (sequential, parallel, distributed) and data partitioning can be chosen later, depending on circumstances and available resources.
Thus the main question is: can we convert fold into reduce? The rest of the article answers it.
Olivier Danvy:
``Folding left and right matters: direct style, accumulators,
and continuations''
Journal of Functional Programming, vol 33, e2.
Functional Pearl, February 2023
Appendix A: A brief history of folding left and right over lists
According to Danvy,
the first instances of fold_left
and fold_right
were investigated by
Christopher Strachey in 1961(!). Reduce is part of APL (Iverson,
1962), where it is
called `/': hence +/x
sums the array x
. Goedel recursor R in
System T is a more general version of fold (called now para-fold).
Church numerals are folds.
monoid_reduce.ml [8K]
Complete code for the article
How to zip folds
A complete library of fold-represented lists, demonstrating
that all list processing operations can be expressed as folds
Accumulating tree traversals, a better tree fold with applications to XML parsing
fold_left
and
fold_right
in terms of reduce
. In this section we see two trivial
answers. In fact, we see that fold can always be expressed in
terms of reduce -- but in a way that is not useful or interesting.
If the folding function, the first argument of fold, is associative (which implies that the type of sequence elements is the same as the accumulator/result type) and has a zero element, then fold is trivially an instance of reduce. We have seen the example already:
fold_left (+) 0 l ≡ fold_right (+) l 0 ≡ reduce {op=(+);zero=0} lfor any list
l
.
The second trivial answer is that a fold can always be expressed in terms of a monoid reduce, unconditionally:
let compose_monoid = {op=(fun g h -> fun x -> g (h x)); zero=Fun.id} fold_right f l z ≡ List.map f l |> reduce compose_monoid |> (|>) zfor any
f
, z
, l
of the appropriate types. The very similar
expression can be written for the left fold (left as an exercise to
the reader). The key idea is that function composition is associative.
Unfortunately, this trivial answer is of little practical use. If
f:'a -> 'z -> 'z
, then map f l
creates a list of 'z -> 'z
closures, which reduce
composes together into a (big) 'z -> 'z
closure, which is finally applied to z
. If the list l
is of size
N
, the result of the reduction is the composition of N
closures,
and hence has the size proportional to N
(with a fairly noticeable
proportionality constant: a closure takes several words of heap
space). In effect we have built an intermediate data structure of
unbounded size. Although composing the closures can be
parallelized,
the useful work is done only when the big closure composition is
finally applied to the initial accumulator z
-- at which point the
folding function f
is applied step-by-step, sequentially, just like in
the original fold_right f l z
. This
trivial reduction of fold only wastes time and space.
The problem hence is not merely to express fold as reduce, but do so efficiently, without undue and unbounded overhead. Only then we can profitably fold in parallel.
Sometimes the problem can be solved simply. We have already seen one
such case: the folding function is associative and has the unit
element. The left and the right fold are instances
of reduce then. A little more interesting is the case of the folding
function f that can be factored out in terms of two other functions
op
and g
as
f z x = op z (g x)for any sequence element
x
and accumulator z
.
Here op
is an associative operation that has a zero element. As an example,
remember finding the length of a list, expressed as fold as
List.fold_left (fun z _ -> z + 1) 0 lThe folding function indeed factors out
z + 1 = z + (Fun.const 1 x)and so length can be re-written as
map (Fun.const 1) l |> reduce {op=(+);zero=0}or, as
map_reduce
. Length, therefore, may be computed in parallel.
Such a factorization is an instance of the general principle, called ``Conjugate Transform'' in Guy Steele's talk. The principle calls for representing fold as
fold_left (f:'z->'a->'a) (z:'z) l = map g l |> reduce m |> hand similarly for the right fold. Here
m:'u monoid
is a monoid
for some type 'u
, and g:'a->'u
and h:'u->'z
are some functions.
They generally depend on f
and z
. Informally, the principle
recommends we look for a ``bigger'' type 'u
that can embed both 'a
and 'z
and admits a suitable associative operation with a unit
element.
One may always represent fold in such a way, as we saw in the previous
section: we chose compose_monoid
as m
, with 'z->'z
as the type 'u
.
The conjugate transform is a principle, or a schema. It does not tell
how to actually find the efficient monoid, or if it even
exists. In fact, finding the efficient monoid is often
non-trivial and requires ingenuity. As an example,
consider a subtractive folding, mentioned
in passing earlier. Subtraction is not associative and therefore
fold_left (-)
is not an instance of reduce. A moment of thought
however shows that
fold_left (-) z l = z - fold_left (+) 0 l = z - reduce {op=(+);zero=0} lIn terms of the conjugate transform schema, the function
g
is
negation and h
is adding z
. The second moment of thought shows
that fold_right (-)
can not be reduced just as easily (or at
all?). Actually, the right-subtractive-folding can also be carried
out efficiently in parallel, as a monoid reduction. The reader is encouraged to
find this monoid -- to appreciate the difficulty and ingenuity.
As an illustrating example, we take the conversion of a sequence of digits (most-significant first) to the corresponding number -- a simple parsing operation. It is an accumulating sequential operation and can be written as fold:
let digits_to_num = List.fold_left (fun z x -> 10*z + (Char.code x - Char.code '0')) 0For example,
digits_to_num ['1'; '7'; '5']
gives 175
. One can
easily factor it as a composition of a map
and
List.fold_left (fun z x -> 10*z + x) 0which is essentially the Horner rule: evaluating the polynomial Σ
d
i b
i at b=10
.
The folding function fun z x -> 10*z + x
is not associative,
however.
Still, there is a way to build a monoid. It is the monoid with the operation
let op (x,b1) (y,b2) = (x*b2+y, b1*b2)which is clearly associative (but not commutative):
((x,b1) `op` (y,b2)) `op` (z,b3) = (x*b2+y, b1*b2) `op` (z,b3) = (x*b2*b3+y*b3+z, b1*b2*b3) = (x,b1) `op` (y*b3+z,b2*b3) = (x,b1) `op` ((y,b2) `op` (z,b3))with the unit
(0,1)
:
(x,b) `op` (0,1) = (0,1) `op` (x,b) = (x,b)Therefore,
digits_to_num
can be written as
let digits_to_num = List.map (fun x -> (Char.code x - Char.code '0')) >> List.map (fun x -> (x,10)) >> reduce {op;zero=(0,1)} >> fstwhere
>>
is the left-to-right functional composition. Our
construction is also an instance of the conjugate transform, with g
being fun x -> (x,10)
and and h
being fst
.
Thus, the Horner rule can be expressed as map-reduce and can hence be evaluated embarrassingly in parallel.
Our construction is general and applies to any folding function
f : 'z -> 'a -> 'z
of the form
f z x = m.op (h z) xwhere
m.op
is associative and has a zero element (that is, an
operation of some monoid m
). This factorization
looks very similar to the one in the previous section; the
difference, however slight, makes finding the monoid quite more
difficult. It does exists:
let hmonoid (m:'a monoid) : ('a * ('a->'a)) monoid = {op = (fun (x,h1) (y,h2) -> (m.op (h2 x) y, h1 >> h2)); zero = (m.zero, Fun.id)}Its carrier is the set of pairs
(x,h)
of monoid m
elements and
functions on them. We assume the composition h1>>h2
is represented
efficiently. For example, when h
is a multiplication by a constant,
it can be represented by that constant; the composition of such
functions is then representable by the product of the corresponding
constants.
The Horner rule has the feel of the parallel prefix problem and the famous parallel scan of Guy Blelloch (PPoPP 2009). Horner rule does not require reporting of intermediate results, however. Our monoid reduction differs from Blelloch's even-odd interleaving. It also differs from monoid-cached trees from Guy Steele's ICFP09 keynote. We do not rely on any special data structures: We can work with plain arrays, partitioning them across available cores.
Since Horner method is pervasive in polynomial evaluation, there are approaches to parallelize it. The most common is even/odd splitting of polynomial coefficients. Although it may be well suitable for SIMD processing, the even/odd splitting is bad for multicore since it creates read contention (bank conflicts) and wastes cache space. Our monoid reduce lets us assign different memory banks to different cores, for their exclusive, conflict-free access.
On the other hand, Estrin's scheme may be seen as a very particular instance of our monoid construction: binomial grouping. Our monoid reduction allows for arbitrary grouping, hierarchically if desired, and of not necessarily of the same size.
map_reduce
) rather than fold, as far as possible. Unlike fold,
reduce does not commit us to a particular evaluation strategy. It can
be performed sequentially, embarrassingly parallel, or in a tree-like
fashion. Like the Σ notation in Math, is specifies what should be summed
up, but not how or in which sequence.
Guy Steele conclusions apply to the present article as well. Especially his final thoughts: