We show why monoids are a big deal, and also a big fun -- even more than expected. Horner rule and Boyer-Moore majority voting -- commonly regarded as prototypically inherently sequential -- turn out monoids, and hence embarrassingly parallelizable. As we generalize monoids to observable monoids, we discover a new, vertical way to combine them, enabling efficient iterated grouping and aggregation directly on raw deserialized (big) data, on multicore or multi-processor.
Recall, folding over a sequence is an inherently sequential stateful accumulation; using lists for concreteness, it is defined as
fold_left : ('z -> 'a -> 'z) -> 'z -> 'a list -> 'z
fold_right : ('a -> 'z -> 'z) -> 'a list -> 'z -> 'z
Actually, there are two operations: the left and the right fold. Their
meaning should be clear from the following example:
fold_left (+) 0 [1;2;3;4] ≡ (((0 + 1) + 2) + 3) + 4
fold_right (+) [1;2;3;4] 0 ≡ 1 + (2 + (3 + (4 + 0)))
In this case, of the folding function being addition,
the results are identical: both expressions sum the
list. Generally, left and right folds produce different results: try,
for example, subtraction as the folding function.
The types of the list elements 'a and
of the accumulator 'z need not be the same: for example,
fold_left (fun z _ -> z + 1) 0 lcomputes the length of any list,
fold_left (fun z x -> x :: z) [] lreverses the list
l and
fold_right (fun x z -> if p x then x::z else z) l []filters the list: omits the elements for which the predicate
p
returns false. Many other operations on lists (in fact, all of them)
can be expressed as folds. Fold is indeed the general pattern of sequential
stateful processing of a sequence.
The are alternative definitions of fold: sometimes the last two arguments of the right fold are swapped. If the arguments of the right folding function are likewise swapped, then the left and the right fold have the same signature. Their behavior, the association pattern, is still different. Olivier Danvy, see below, traces the history of list folds and the argument order.
For concreteness we showed folds over lists, but similar operations exist over arrays, streams, files, trees, dictionaries and any other collections.
In this article, by reduce we always mean the reduce over a monoid. A monoid is a set (called `carrier set') with an associative binary operation which has a neutral (also called unit, or zero) element. Concretely, in OCaml
type 'a monoid = {zero: 'a; op: 'a -> 'a -> 'a}
where 'a is the type of monoid elements,
op must be associative and
op zero x = op x zero = xmust hold for every element
x of the monoid. In Google MapReduce, the
operation op is also taken to be commutative. We do not impose such
requirement.
Reduce over a sequence (here, a list) is the operation
reduce : 'a monoid -> 'a list -> 'awith the behavior that can be illustrated as
reduce monoid [] ≡ monoid.zero
reduce monoid [x] ≡ x (* for any x and monoid *)
reduce {zero=0;op=(+)} [1;2;3;4] ≡ 1 + 2 + 3 + 4
One may say that reduce `wedges in' the monoid operation between the
consecutive elements of the sequence. Since op
is associative, the parentheses are not necessary. Therefore, unlike
the left and the right fold, there is only one reduce.
We shall see that reduce is often pre-composed with map. The two operations may always be fused, into
map_reduce : ('a -> 'z) -> 'z monoid -> 'a list -> 'z
Nevertheless, we will write map and reduce separately, for clarity --
assuming that actual implementations use the efficient, fused operation.
(That map and reduce often occur together is no accident:
as Fegaras and Maier explained, map_reduce is a monoid homomorphism.)
Fold has to be evaluated sequentially, because of the data dependency
on the accumulator. In fact, the left fold is just
another notation for a for-loop:
type 'a array_slice = {arr:'a array; from:int; upto:int}
let fold_left_arr : ('z -> 'a -> 'z) -> 'z -> 'a array_slice -> 'z =
fun f z {arr;from;upto} ->
let acc = ref z in
for i=from to upto do
acc := f !acc arr.(i)
done;
!acc
Here, for variety, we use an array slice as a sequence.
On the other hand, reduce has a variety of implementations. It may be
performed sequentially:
let seqreduce_arr (m: 'a monoid) (arrsl: 'a array_slice) : 'a =
fold_left_arr m.op m.zero arrsl
Or it can be done in parallel (see the end of
Nested grouping-aggregation: Vertical monoid composition for the real example):
let rec parreduce_arr (m: 'a monoid) {arr;from;upto} : 'a =
match upto+1-from with
| 0 -> m.zero
| 1 -> arr.(from)
| 2 -> m.op arr.(from) arr.(from+1)
| 3 -> m.op (m.op arr.(from) arr.(from+1)) arr.(from+2)
| n -> let n' = n / 2 in
(* Here, the two parreduce_arr invocations can be done in parallel! *)
m.op
(parreduce_arr m {arr;from;upto=from+n'-1})
(parreduce_arr m {arr;from=from+n';upto})
The two parreduce_arr invocations in the recursive case can run in
parallel -- embarrassingly in parallel, with no races or even read
dependencies. Whether they should be done in parallel is another
question -- something that we can decide case-by-case. For example, if
the array slice is short, the two parreduce_arr invocations are
better done sequentially (since the ever-present overhead of parallel
evaluation would dominate.) If the two halves of the slice
are reduced also in parallel, we obtain a hierarchical decomposition:
binary-tree--like processing.
We do not have to recursively decompose the slices. We may cut the input array into a sequence of non-overlapping slices, arbitrarily, and assign them to available cores. The cores may do their assigned work as they wish, without any synchronization with the others. At the end, we combine their results using the monoid operation.
Since the (left or right) fold commits us to sequential evaluation, Guy Steele called it `slightly harmful'. He urged using reduce, as far as possible, because it is flexible and decouples the algorithm from the execution strategy. The strategy (sequential, parallel, distributed) and data partitioning can be chosen later, depending on circumstances and available resources.
Thus the main question is: can we convert fold into reduce? The rest of the article answers it.
Guy Steele: Organizing Functional Code for Parallel execution
or, foldl and foldr considered slightly harmful
August 2009 (ICFP 2009, Keynote)
Olivier Danvy:
``Folding left and right matters: direct style, accumulators,
and continuations''
Journal of Functional Programming, vol 33, e2.
Functional Pearl, February 2023
Appendix A: A brief history of folding left and right over lists
According to Danvy,
the first instances of fold_left and fold_right were investigated by
Christopher Strachey in 1961(!). Reduce is part of APL (Iverson,
1962), where it is
called `/': hence +/x sums the array x. Goedel recursor R in
System T is a more general version of fold (called now para-fold).
Church numerals are folds.
monoid_reduce.ml [26K]
Complete code for the article
How to zip folds
A complete library of fold-represented lists, demonstrating
that all list processing operations can be expressed as folds
Accumulating tree traversals, a better tree fold with applications to XML parsing
Monoid reduce is sometimes called `big operator' (by analogy with the capital Σ and Π for summation and multiplication, resp). Another name for them is `Eindhoven quantifiers'.
One may think that foldMap from the Haskell Foldable class is also
related: it also talks about monoid reductions. However, foldMap is
specified as an explicitly right-associative operator, and hence lacks
the main property of map-reduce emphasized by Guy Steele: decoupling
of specification from implementation. One may argue the Foldable
type class itself is to blame: according to it, a collection has a
single foldMap, whereas in fact there may be many reduction
implementations for the same collection.
fold_left and
fold_right in terms of reduce. In this section we see two trivial
answers. In fact, we see that fold can always be expressed in
terms of reduce -- but in a way that is not useful or interesting.
If the folding function, the first argument of fold, is associative (which implies that the type of sequence elements is the same as the accumulator/result type) and has a zero element, then fold is trivially an instance of reduce. We have seen the example already:
fold_left (+) 0 l ≡ fold_right (+) l 0 ≡ reduce {op=(+);zero=0} l
for any list l.
The second trivial answer is that a fold can always be expressed in terms of a monoid reduce, unconditionally:
fold_right f [1;2;3;4] z
= f 1 (f 2 (f 3 (f 4 z)))
= (f 1 · f 2 · f 3 · f 4) z
= map f [1;2;3;4] |> reduce fun_monoid |> (fun h -> h z)
where
let fun_monoid = {op=(fun g h -> fun x -> g (h x)); zero=Fun.id}
for any f, z and the collection of the appropriate types. The very similar
expression can be written for the left fold (left as an exercise to
the reader). The key idea is that function composition is associative.
Unfortunately, this trivial answer is of little practical use. As the
example shows, map (f:'a -> 'z -> 'z) [1;2;3;4] creates a list of
closures [f 1; f 2; f 3; f 4], which reduce composes together into
a big closure (f 1 · f 2 · f 3 · f 4), which is finally applied to
z. The composed closure (f 1 · f 2 · f 3 · f 4) has the same
structure as the original list -- and hence the same size (actually, a
few times larger: not only we have to store the list elements
themselves, we have to store the function pointers f, plus the
overhead). In effect we have built an intermediate data structure of
unbounded size. Although composing the closures can be parallelized,
the useful work is done only when the big closure composition is
finally applied to the initial accumulator z -- at which point the
folding function f is applied step-by-step, sequentially, just like
in the original fold_right f l z. This trivial reduction of fold
only wastes time and space.
The problem hence is not merely to express fold as reduce, but do so efficiently, without undue and unbounded overhead. Only then we can profitably fold in parallel.
Jeremy Gibbons: Origami Programming for Fun and Profit
<https://www.cs.ox.ac.uk/publications/publication16637-abstract.html>
Exposition of a fold as fun_monoid
Sometimes the problem can be solved simply. We have already seen one
such case: the folding function is associative and has the neutral
element. The left and the right fold are instances
of reduce then. A little more interesting is the case of the folding
function f that can be factored out in terms of two other functions
op and g as
f z x = op z (g x)for any sequence element
x and accumulator z.
Here op is an associative operation that has a zero element. As an example,
remember finding the length of a list, expressed as fold as
List.fold_left (fun z _ -> z + 1) 0 lThe folding function indeed factors out
z + 1 = z + (Fun.const 1 x)and so length can be re-written as
map (Fun.const 1) l |> reduce {op=(+);zero=0}
or, as map_reduce. Length, therefore, may be computed in parallel.
Such a factorization is an instance of the general principle, called ``Conjugate Transform'' in Guy Steele's talk. The principle calls for representing fold as
fold_left (f:'z->'a->'a) (z:'z) l = map g l |> reduce m |> hand similarly for the right fold. Here
m:'u monoid is a monoid
for some type 'u, and g:'a->'u and h:'u->'z are some functions.
They generally depend on f and z. Informally, the principle
recommends we look for a ``bigger'' type 'u that can embed both 'a
and 'z and admits a suitable associative operation with a neutral
element.
One may always represent fold in such a way, as we saw in the previous
section: we chose fun_monoid as m, with 'z->'z as the type 'u.
The conjugate transform is a principle, or a schema. It does not tell
how to actually find the efficient monoid, or if it even
exists. In fact, finding the efficient monoid is often
non-trivial and requires ingenuity. As an example,
consider a subtractive folding, mentioned
in passing earlier. Subtraction is not associative and therefore
fold_left (-) is not an instance of reduce. A moment of thought
however shows that
fold_left (-) z l = z - fold_left (+) 0 l = z - reduce {op=(+);zero=0} l
In terms of the conjugate transform schema, the function g is
negation and h is adding z. The second moment of thought shows
that fold_right (-) can not be reduced just as easily (or at
all?). Actually, the right-subtractive-folding can also be carried
out efficiently in parallel, as a monoid reduction. The reader is encouraged to
find this monoid -- to appreciate the difficulty and ingenuity.
As an illustrating example, we take the conversion of a sequence of digits (most-significant first) to the corresponding number -- a simple parsing operation. It is an accumulating sequential operation and can be written as fold:
let digits_to_num =
List.fold_left (fun z x -> 10*z + (Char.code x - Char.code '0')) 0
For example, digits_to_num ['1'; '7'; '5'] gives 175. One can
easily factor it as a composition of a map and
List.fold_left (fun z x -> 10*z + x) 0which is essentially the Horner rule: evaluating the polynomial Σ
di bi at b=10.
The folding function fun z x -> 10*z + x is not associative,
however.
Still, there is a way to build a monoid. It is the monoid with the operation
let op (x,b1) (y,b2) = (x*b2+y, b1*b2)which is clearly associative (but not commutative):
((x,b1) `op` (y,b2)) `op` (z,b3)
= (x*b2+y, b1*b2) `op` (z,b3)
= (x*b2*b3+y*b3+z, b1*b2*b3)
= (x,b1) `op` (y*b3+z,b2*b3)
= (x,b1) `op` ((y,b2) `op` (z,b3))
with the neutral (0,1):
(x,b) `op` (0,1) = (0,1) `op` (x,b) = (x,b)One may think of a
(x,b) element of this monoid as a `digit' x
in base b, which represents a (sub)sequence of decimal digits. The monoid
operation is then converting the `two-digit' mixed-base number
(digit x1 in base b1 and digit x2 in base b2) to the
single-digit base-b1*b2 number (which corresponding to the concatenation
of the corresponding digit sequences).
Therefore, digits_to_num can be written as
let digits_to_num =
List.map (fun x -> (Char.code x - Char.code '0')) >>
List.map (fun x -> (x,10)) >>
reduce {op;zero=(0,1)} >> fst
where >> is the left-to-right functional composition. Our
construction is also an instance of the conjugate transform, with g
being fun x -> (x,10) and and h being fst.
Thus, the Horner rule can be expressed as map-reduce and can hence be evaluated embarrassingly in parallel.
Our construction is general and applies to any folding function
f : 'a -> 'a -> 'a of the form
f z x = m.op (h z) xwhere
m.op is associative and has a zero element (that is, an
operation of some monoid m). This factorization
looks very similar to the one in the previous section; the
difference, however slight, makes finding the monoid quite more
difficult. It does exists:
let hmonoid (m:'a monoid) : ('a * ('a->'a)) monoid =
{op = (fun (x,h1) (y,h2) -> (m.op (h2 x) y, h1 >> h2));
zero = (m.zero, Fun.id)}
Its carrier is the set of pairs (x,h) of monoid m elements and
functions on them. We assume the composition h1>>h2 is represented
efficiently.
An interesting case of this generalization is left folding
fold_left (f:'z -> 'a -> 'z) (z:'z) (l:'a coll)
where 'a is a monoid (call it m), 'z is also a monoid (call it n)
and the folding function (with flipped arguments)
is a monoid action of monoid m on n preserving n structure:
i.e., the action is n automorphism.
That is, action = flip f satisfies the properties of a monoid action
action m.zero z = z
action m2 (action m1 z) = action (m.op m2 m1) z
as well the properties of n morphism:
action m n.zero = n.zero
action m (n.op n1 n2) = n.op (action m n1) (action m n2)
In that case, the left folding is the monoid reduction with the
('z,'a) monoid
{zero = (n.zero,m.zero);
op = fun (z1,m1) (z2,m2) -> (n.op (action m2 z1) z2, m.op m2 m1)}
which is essentially the outer semidirect product of monoids n and m.
I thank James McKinna for bringing monoid products to my attention.
The Horner rule has the feel of the parallel prefix problem and the famous parallel scan of Guy Blelloch (PPoPP 2009). Horner rule does not require reporting of intermediate results, however. Our monoid reduction differs from Blelloch's even-odd interleaving. It also differs from monoid-cached trees from Guy Steele's ICFP09 keynote. We do not rely on any special data structures: We can work with plain arrays, partitioning them across available cores.
Since Horner method is pervasive in polynomial evaluation, there are approaches to parallelize it. The most common is even/odd splitting of polynomial coefficients. Although it may be well suitable for SIMD processing, the even/odd splitting is bad for multicore since it creates read contention (bank conflicts) and wastes cache space. Our monoid reduce lets us assign different memory banks to different cores, for their exclusive, conflict-free access.
On the other hand, Estrin's scheme may be seen as a very particular instance of our monoid construction: binomial grouping. Our monoid reduction allows for arbitrary grouping, hierarchically if desired, and of not necessarily of the same size.
monoid_reduce.ml [26K]
Complete code for the article
The Boyer-Moore algorithm is called a prototypical streaming algorithm and is inherently sequential. It may be surprising, therefore, that it may be presented as monoid reduction and hence performed just as efficiently in parallel -- in fact, embarrassingly in parallel. The monoid reduction implementation is actually simple. What is complex is convincing ourselves that it indeed works and does the right thing, in all cases.
The Boyer-Moore algorithm is a state machine processing, ingesting input element-by-element. It can hence be implemented as fold (assuming the input is given as a non-empty list):
let bm : 'a list -> 'a = function h::t ->
let fsm (c,m) x =
if c = 0 then (1,x)
else if m = x then (c+1,m)
else (c-1,m)
in List.fold_left fsm (1,h) t >> snd
The state is the counter c and the candidate majority m. At the
end of the algorithm, when the entire sequence is scanned, m is the
majority element (if the sequence has one). If the counter c at the
end is more than half of the sequence length, the sequence has
majority and m is the majority element. Otherwise, we have to make
another pass through the sequence to verify that m is indeed
the majority. The verification pass may be unnecessary if we have a prior
knowledge that majority exists -- or if we are satisfied with any
element if it does not.
Although the algorithm seems inherently sequential, it can in fact be represented as monoid reduction. The monoid is simple and efficient:
type 'a bm = int * 'a
let bm_monoid (z:'a) : 'a bm monoid =
{zero = (0,z);
op = fun (c1,m1) (c2,m2) ->
if c1 = 0 then (c2,m2) else
if c2 = 0 then (c1,m1) else
if m1 = m2 then (c1+c2,m1) else
if c1 > c2 then (c1-c2, m1) else
(c2-c1, m2)}
Monoid elements (the carrier set) are pairs: of a natural number c and of a
sequence element m. When c is zero, m may be arbitrary; in the
above code we use z for such an arbitrary element. We could have
avoided specifying it by defining a more sophisticated data type for
monoid carrier, or using 'a option. The reasoning below would get
messier; therefore, we keep the above definition for the sake of
clarity.
One can see that op is commutative and that zero is indeed
the zero of op. However, is op associative? That is, is
bm_monoid really a monoid? It is easy to check:
let m1 = (2,10) and m2 = (3,9) and m3 = (2,8)
op (op m1 m2) m3 ⇝ (1, 8)
op m1 (op m2 m3) ⇝ (1, 10)
Unfortunately, bm_monoid is not a monoid.
To cut the suspense, it turns out that bm_monoid is `morally' a
monoid, and reduction with it indeed gives the right result, in all
cases. Proving it however is not that straightforward.
For the sake of proof, we extend
bm_monoid to:
type 'a ebm = int * 'a * ('a*'a) list
let bm_monoid_ext (z:'a) : 'a ebm monoid =
{zero = (0,z,[]);
op = fun (c1,m1,l1) (c2,m2,l2) ->
if c1 = 0 then (c2,m2,l1@l2) else
if c2 = 0 then (c1,m1,l1@l2) else
if m1 = m2 then (c1+c2,m1,l1@l2) else
if c1 > c2 then (c1-c2, m1, repeat c2 (m1,m2) @ l1 @ l2) else
(c2-c1, m2, repeat c1 (m1,m2) @ l1 @ l2)}
Here, the infix operation @ is list concatenation and
repeat (n:int) (x:'a) : 'a list produces the list of length n
with elements all equal to x.
The carrier set of bm_monoid_ext is 'a ebm:
triples, of a natural number c, a sequence
element m, and a list of pairs with distinct components (that is,
fst of any pair is not equal to its snd). We note that op preserves
the invariant that all pairs in l have distinct components.
The operation op of thus extended bm_monoid_ext is not associative.
Not only bm_monoid_ext is not a monoid, it is also not efficient:
the list concatenation operations in op
are neither constant space nor constant time.
Let us introduce the relation ≈ on that set, defined as follows:
x ≈ y just in case flatten x is equal to flatten y, where
let flatten : 'a ebm -> 'a multiset = fun (c,m,l) ->
repeat c m @ List.map fst l @ List.map snd l
That is, flatten x collects all elements mentioned within x in a
multiset. Therefore, x ≈ y iff the multiset of elements of x
is equal to the multiset of elements of y. One easily sees that
the relation ≈ is reflexive, commutative and associative. That is,
it is an equivalence relation. It also has useful for us properties,
as follows.
Proposition: bm_monoid_ext is a monoid modulo ≈. That is,
op (0,z,[]) (c,m,l) ≈ op (c,m,l) (0,z,[]) ≈ (c,m,l)
op (c1,m1,l1) (op (c2,m2,l2) (c3,m3,l3)) ≈ op (op (c1,m1,l1) (c2,m2,l2)) (c3,m3,l3)
The associativity property follows from the fact op preserves
elements occurring in (c,m,l), including their multiplicities (and
the multiset equality is associative).
Moreover, ≈ is adequate for the task of finding the majority.
Proposition (Adequacy): if (c1,m1,l1) ≈ (c2,m2,l2) and
the multiset flatten (c1,m1,l1) has majority, then c1>0, c2>0,
and m1=m2. Informally, ≈ preserves the found majority (if the majority
exists). The proposition is the consequence of the fact that majority
is invariant to sequence permutation, and the following representation
lemma.
Lemma (Representation): if a sequence (multiset) flatten (c,m,l)
has majority, then c>0 and m is the majority element.
The key idea is that (c,m,l) is a representation of
a partition of the multiset flatten m: specifically, a
partition of the multiset into pairs of distinct elements and the
remainder (which, if exists, is comprised of one or more copies of a
particular element m). Such a partition is generally not unique:
consider the sequence [1; 2; 3; 3; 3] and its two partitions
(3,1,[(1,3);(2,3)]) and (3,3,[(1,2)]). It has however the property
specified in the lemma.
For the proof we note that if the list of distinct pairs l has n
elements, it has 2*n of values. Not all of them are
distinct. However, there is no value that occurs more than n times;
otherwise, by the pigeonhole principle, some pair in l would have
had equal components. Therefore, no value occurring in (c,m,l) that is
distinct from m (if c>0) can be the majority element. If the
majority exists, it has to be m, and c>0.
Our proofs are inspired by Boyer and Moore's original proofs. They however used induction (which is appropriate for the sequential algorithm). We, in contrast, used equational reasoning (and the pigeonhole principle) -- but no induction.
As an aside, it may seem that induction is the only proof method used in computer science -- many widely used textbooks present nothing but induction. In my opinion, induction is overused. Mathematics has many other proof methods.
The final step is to observe that the part of bm_monoid_ext that we
actually care about -- the element m and
the count c -- is computed with no regard to the list l.
The list is a `ghost' list, so to speak, needed only for proving
correctness but not for computing the majority. Therefore, we may omit it
and recover the bm_monoid.
In the upshot, the majority voting may hence be performed as
map-reduce: mapping with (fun x -> (1,x)) and then reducing over the
monoid bm_monoid. It is just as efficient as the original
Boyer-Moore algorithm: one pass over the sequence using constant and
small working space. However, the input may be split arbitrarily among
available workers, who would do the search in parallel, without any
interference or dependencies. Majority voting can be done
embarrassingly in parallel.
<https://en.wikipedia.org/wiki/Boyer%E2%80%93Moore_majority_vote_algorithm>
monoid_reduce.ml [26K]
Complete code for the article
Lambert Meertens: Reducing hopeful majority. The Squiggolist, 1(1):5, 1989
<https://www.kestrel.edu/people/meertens/publications/papers/Reducing_hopeful_majority.pdf>
The paper also noted the lack of associativity and informally argued
(invoking indeterminism) that it is not required for consistency.
We have shown how associativity can be formally regained without
invoking indeterminism, using the `ghost' list of distinct pairs.
This `ghost' list is the crucial ingredient to be able to define
the equivalence relation `≈'.
The running example, as in the earlier article, is an Advent of Code 2022 problem, summarized as:
The input is a sequence of numbers separated by pipes into chunks where each chunk contains multiple comma-separated numbers. The task is to compute the sum of each chunk and find the maximum across all chunks.A sample input is as follows:
100,200,300|400|500,600|700,800,900|1000
The earlier article showed an efficient -- single-pass with no buffering -- sequential solution. We now present an embarrassingly parallel one. Emphatically, the input can be split arbitrarily, for example
100,20 0,300|40 0|500,600 |700,800 ,900|1000haphazardly cutting through the numbers and delimiters. We may
mmap
the input file and assign a range of pages to a worker (a CPU core),
not worrying about how or if the data are aligned with page
boundaries. The workers may then all run in parallel, in constant memory,
with no
dependencies or races -- with the linear
speed-up. Such a processing is also very fitting for
GPU.
A more realistic variation of the problem is, say, finding the 10 most frequent words in a UTF-8--encoded text file.
char, or digit,
to be precise:
let adv22_input =
['1'; '0'; '0'; ','; '2'; '0'; '0'; ','; '3'; '0'; '0'; '|'; '4'; '0'; '0';...]
We take the input here to be a list for simplicity; in the full version,
the input is a byte array or a file.
The task is to group the items between the delimiters (commas and bars).
Formally, the task is to convert (α+δ)* to (α*↑)* where α*↑ is a maximal subsequence not containing the delimiter. Here, δ is the type (set) of delimiters. We use `+' for a union (of two types or sequences) and multiplication for concatenation; the superscripted star is Kleene star. Observe:
(α+δ)* = (α*δ*)* = α* + α*δ(α*δ)*α* = α* + α*δ(α*↑δ)*α*
using the almost standard algebra (only the multiplication, being sequence concatenation, is not commutative). In words: A sequence (α+δ)* either:
(α+δ)* = (α+δ)* (α+δ)*
=
(α* + α*δ(α*↑δ)*α*)
(α* + α*δ(α*↑δ)*α*
=
α* + α*δ(α*↑δ)*α*
+ α*δ(α*↑δ)*α*
α*δ(α*↑δ)*α*
=
α* + α*δ(α*↑δ)*α*
The original (α+δ)* is a monoid: the operation is concatenation, zero is the empty sequence. As we have just shown, the representation α* + α*δ(α*↑δ)*α* is also a monoid; the monoid operation can be read directly from the above equations.
Concretely, in OCaml, α* + α*δ(α*↑δ)*α* is the sum data type:
type delim = char
type 'a chunk = A of 'a list
| C of 'a list * delim * (('a list) * delim) list * 'a list
The monoid is then
let wap_monoid =
{zero = A [];
op = fun x y -> match (x,y) with
| (A lx,A ly) -> A (lx @ ly)
| (A lx,C (l,d,c,r)) -> C (lx @ l, d,c,r)
| (C (l,d,c,r),A ly) -> C (l,d,c, r @ ly)
| (C (lx,dx,cx,rx),C (ly,dy,cy,ry)) ->
C (lx,dx, (cx @ [(rx@ly),dy] @ cy), ry)}
It is a complicated version of sequence concatenation, taking into
account that α* subsequences flanked by delimiters must be
maximal.
The monoid represents an
(α+δ)* subsequence: the empty subsequence is represented by
A [], the singleton subsequence by
A [c] or C ([],d,[],[]) (depending on if its element is a digit c
or a delimiter d). The monoid representing the whole input is found
by map-reduce:
adv22_input |>
map_reduce (function '0'..'9' as c -> A [c] | d -> C ([],d,[],[]))
wap_monoid
resulting in
C (['1'; '0'; '0'], ',',
[(['2'; '0'; '0'], ','); (['3'; '0'; '0'], '|'); ...],
['1'; '0'; '0'; '0'])
We may then extract the the maximal delimiter-free subsequences:
let of_chunk : 'a chunk -> ('a list) list = function
| A x -> [x]
| C (l,d,c,r) -> [l] @ (List.map fst c) @ [r]
The warm-up task is hence solved by map-reduce, followed by of_chunk
to extract the result.
((char list)) list, with two occurrences of list. We now
replace them with arbitrary monoids.
To make the monoids and their compositions more apparent, we take a more general, algebraic look. Recall, monoid is a set with two operations (satisfying particular properties) -- in other words, an Algebra, of the following signature.
module type monoid = sig
type t
val zero : t
val op : t -> t -> t
end
Different monoids -- monoids algebras -- are implementations of the signature.
The monoid signature delineates the minimum required set of operations;
to build useful monoids we also need some ways to create their elements,
other than zero. In other words, we need generators, probably
more than one. The following extension of the monoid interface with
generators
module type coll_monoid = sig
include monoid
type ix
val gen : ix -> t
end
fits an arbitrary collection of generators, indexed by a set
ix. The set may be empty, unitary (e.g., unit), finite (e.g.,
bool) or infinite. Fegaras and Maier call (a simpler version of)
this interface a collection monoid. An example that immediately
comes to mind is the sum monoid:
module Sum = struct
type t = int
let zero = 0
let op = (+)
type ix = int
let gen = Fun.id
end
and the similar Max monoid. Their generators are all integers. (Such
collection monoids whose gen is the identity function, Fegaras and
Maier call primitive.) A non-primitive collection monoid
is a list of elements of type e:
module ListM(S:sig type e end) = struct
type t = S.e list
let zero = []
let op = (@)
type ix = S.e
let gen x = [x]
end
From the sense of symmetry, and with the ``Conjugate Transform'' in mind, we add one more operation:
module type obs_monoid = sig
include coll_monoid
type obs
val obs : t -> obs
end
to observe the monoid, as a value of some type obs, specific to the monoid.
A collection
monoid may be observed as is:
module AsIs(M:coll_monoid) :
(obs_monoid with type ix = M.ix and type obs = M.t) = struct
include M
type obs = M.t
let obs = Fun.id
end
A non-trivial observable monoid is the Horner rule monoid we saw earlier:
module Horner : (obs_monoid with type ix = int and type obs = int) = struct
type t = int * int
let zero = (0,1)
let op (x,b1) (y,b2) = (x*b2+y, b1*b2)
type ix = int
let gen x = (x,10)
type obs = int
let obs = fst
end
The interface obs_monoid lets us succinctly express the conjugate transform:
let map_reduce_conj (type i) (type o)
(module M:obs_monoid with type ix = i and type obs = o) (l:i list) : o =
map_reduce M.gen {zero=M.zero;op=M.op} l |> M.obs
We now generalize from sequences to
arbitrary monoids. First, suppose there is an observable monoid M1 which
reduces a sequence α*↑ to some value ρ (e.g., a sequence of
digits to an integer). The first generalization is the reduction from
(α+δ)* to (ρ+δ')*. Now, suppose there is a monoid M2
that reduces (ρ+δ')*. This monoid should accept not just ρ items
but also delimiters. It is worth then introducing delimiter-accepting
monoids
module type obs_del_monoid = sig
include obs_monoid
type del (* type of delimiters *)
val del : del -> t
end
Given such M1 and M2 we can build an obs_del_monoid that reduces
(α+δ)* in one pass:
module ChunkMonoid(S:sig
module M1: obs_monoid
module M2: obs_del_monoid with type ix = M1.obs
val del_ign : M2.del -> bool
end) :
(obs_del_monoid with
type del = S.M2.del and type ix = S.M1.ix and type obs = S.M2.obs) = struct
open S
type t = A of M1.t | C of M1.t * M2.t * M1.t (* chunk *)
let zero = A M1.zero
let m1_to_m2 : M1.t -> M2.t = fun x -> M2.gen (M1.obs x)
let op x y = match (x,y) with
| (A lx,A ly) -> A (M1.op lx ly)
| (A lx,C (l,c,r)) -> C (M1.op lx l, c, r)
| (C (l,c,r),A ly) -> C (l,c, M1.op r ly)
| (C (lx,cx,rx),C (ly,cy,ry)) ->
C (lx, M2.op cx (M2.op (m1_to_m2 (M1.op rx ly)) cy), ry)
type ix = M1.ix
let gen x = A (M1.gen x)
type del = M2.del
let del c =
if del_ign c then A M1.zero
else C (M1.zero, M2.del c, M1.zero)
type obs = M2.obs
let obs = function
| A x -> m1_to_m2 x |> M2.obs
| C (l,c,r) -> M2.(op (m1_to_m2 l) (op c (m1_to_m2 r)) |> obs)
end
Here, del_ign is the predicate for the ignored
delimiters. This is the vertical monoid composition.
When the monoids M1 and M2 are efficient --
their operations take constant time and space -- so is the
ChunkMonoid.
ChunkMonoid produces an obs_del_monoid, which can be composed again,
and again. The full Advent of Code 2022 problem is a reduction with
the monoid
struct
include
ChunkMonoid (struct
module M1 = struct include Horner
type ix = char
let gen x = (Char.code x - Char.code '0') |> gen end
let del_ign = Fun.const false
module M2 = ChunkMonoid (struct
module M1 = AsIs(Sum)
let del_ign = ((=) ',')
module M2 = struct
include AsIs(Max)
type del = char
let del c = assert (c='|'); zero
end
end)
end)
let gen = function '0'..'9' as c -> gen c | c -> del c
end
The problem presented here is reminiscent of the one described by Steele in his ICFP talk. He also used the idea of `chunks' and segments. However, our problem here is the triple nesting of the chunked processing -- three times more complex -- and deriving the general vertical monoid composition.
Nested Grouping-Aggregation
Introducing the problem and its sequential solution.
Uncannily, it also used monoids and monoid composition, albeit
not explicitly realized.
Leonidas Fegaras, David Maier: Towards an Effective Calculus for Object Query Languages. SIGMOD 1995, pp. 47-58.
Algebra
Introduction to the mathematical concept of Algebras for
computer scientists, specifically for tagless-final programmers
talk_FP_Madrid.ml [16K]
Complete OCaml code with various monoids and monoid compositions
(including nested-aggregation composition)
par_nested_agg.ml [11K]
Complete multicore OCaml code for parallel nested-aggregation,
using OCaml domains. For benchmark we take a 4.6GB file (of the same
structure as the sample Advent of Code file). Each core gets a slice of the file
to reduce; the main thread then collects the results. The parallel code
needs no locking and is assuredly race-free.
For illustration, we generate OCaml code, using MetaOCaml. The generated OCaml code, however, is deliberately low level: no algebraic data types, no pattern-matching, and certainly no first-class functions. It may then be offshored to C.
As repeatedly emphasized, monoid reductions may be done in several ways. One is sequential: left-fold, or the accumulating for-loop. It needs only a subset of the observable monoid interface:
module type state_machine = sig
type t
val zero : t
type ix
val opgen : t -> ix -> t
type obs
val obs : t -> obs
end
This is the interface for a state machine: with the state t;
initial state zero; input alphabet ix; state transition opgen
computing a new state from the current state and current input; and
the observation obs. This interface can be cast into an imperative form:
module type upd_state_machine = sig
type u
type unt
val newupd : (u -> unt) -> unt
type ix
val updgen : u -> ix -> unt
type obs
val obs : u -> (obs -> unt) -> unt
end
The type u is the type of the updateable state (e.g., reference cell);
newupd allocates/initializes the state, updgen updates it based on
the current input, and obs observes it. The type unt is the type
of `actions' so to speak; it is also a monoid, letting us compose actions.
For example, unt can be unit (the type of imperative statements) or
unit code (the code of imperative statements).
A monoid is more than just a state machine: monoid reductions may also
be done slice-wise, so to speak. Therefore, monoid
not only provides opgen : t -> ix -> t to update the state t
based on the current input ix, but also op : t -> t -> t to
concatenate, or merge the (partially accumulated) states. The
imperative version of the observable monoid therefore provides
operations to extract the current state from the updateable state
u and update u with the imported state t:
module type upd_monoid = sig
type u
type unt
val newupd : (u -> unt) -> unt
type ix
val updgen : u -> ix -> unt
type obs
val obs : u -> (obs -> unt) -> unt
type t (* exportable state *)
val from_upd : u -> t (* export u *)
val upd : u -> t -> unt (* update u *)
val set : u -> t -> unt
val reset : u -> unt
type us (* plural of u *)
type usix
val newupds : usix -> (us -> unt) -> unt
val us_set : us -> usix -> t -> unt
val us_get : us -> usix -> t
end
The slice-wise processing can be done independently, in parallel if
possible. Therefore, the interface provides us for creating
an independent updateable state for each of the workers (core). The
type us and its operations hence support parallel-for (which we
emulate in multicore OCaml using domains).
See below for the source code that implements
several staged upd_monoids, including the vertical monoid composition.
See also below for the generated OCaml code -- which is indeed
first-order, low-level and imperative -- essentially, C in OCaml
notation.
Overall, the code generation and update semantics consistently deliver 4x single-core speed-up, scaling to multiple cores.
spar_nested_agg.ml [20K]
The staged version of par_nested_agg.ml, also introducing the
update-oriented monoids.
pa_na_red.ml [25K]
The generated code for parallel nested aggregation. It takes the file
descriptor and the number of cores to use.
par_na_main.ml [2K]
The main function to run the generated code, also showing the
benchmark results.
map_reduce) rather than fold, as far as possible. Unlike fold,
reduce does not commit us to a particular evaluation strategy. It can
be performed sequentially, embarrassingly parallel, or in a tree-like
fashion. Like the Σ notation in Math, it specifies what should be summed
up, but not how or in which sequence.
Guy Steele conclusions apply to the present article as well. Especially his final thoughts: