The promise of program generation is producing efficient code from high-level, obviously correct specifications by making it possible and convenient for domain experts to write and apply ``optimization passes'' at a high-level of abstraction.
On the example of the first Shonan Challenge by high-performance computing to staging we demonstrate that the promise is fulfilled. We obtain assuredly well-typed and optimal code. Although generators are not allowed to inspect, let alone rearrange, the generated code (for the sake of static assurances), optimizations are possible. The optimizations have to be at the level much higher than search-and-replace in the abstract syntax tree or a source code string. The optimizations are developed in small separate modules and mixed in one-by-one, checking their effect after each small step. The end result is a small domain-specific language embedded in MetaOCaml that lets us write high-level Basic Linear Algebra expressions and apply several passes of domain-specific optimizations such as selective loop unrolling, sparse matrix representations, and algebraic simplifications.
This page describes the poster presented at the 10th Asian Symposium on Programming Languages and Systems (APLAS). Kyoto, Japan, December 11-13, 2012.
Solving the challenge in small easy steps
mvmult
:
RING.t => int, ind => int
mvmult
code:
RING.t => int code, ind => int code
RING.t => int code, ind => int
RING.t => int pv, ind => int pv
add
and mul
reduce
Pure generative programming makes it easier to reason about the correctness of the generator, in particular, to statically ensure that the generated code is well-typed. A language that supports code inspection like Pure FreshML also can provide similar guarantees, at the cost of a significantly more complex type system. As a good analogy, parametricity, which prevents some functions from inspecting their arguments, gives in return static assurances of those functions' behavior: ``theorems for free''. As the opposite example, Scheme syntax-rule macros, which may examine their arguments, turn out to have surprising and disturbing power, calling into question static guarantees of their hygiene.
If we cannot rearrange the generated code, however, how can we optimize it? Several examples in the literature prove that optimization is possible. It has not been clear however if optimizations can be programmed modularly, like compiler passes. This page demonstrates pure-generative modular and convenient domain-specific optimizations on the example of the first Shonan Challenge by high-performance computing to staging.
We suppose the reader is familiar with MetaOCaml, at least to the extent of the simple tutorial referenced below.
1 0 0 1 0 0 0 1 0 0 0 0 1 1 1 ...In the Challenge, the matrix,
a
, is the adjacency matrix, whose elements are
either 0 or 1. (In HMM, non-zero elements
are transition probabilities.) Some, but not all, rows of the matrix
are sparse. In bioinformatics HMM, the
number of rows and columns could be as high as 30 000.
The problem is to generate the function that computes v' <- a * v
. The generated code should be algebraically simplified, that is,
contain no expressions of the form 0 * e
, 1 * e
, 0 + e
, etc.
The loop over the matrix rows must be fully unrolled; the inner loop,
computing the inner product of v
with a matrix row, should be unrolled
only if the row is sparse enough (has fewer non-zero elements than
a fixed threshold). For the sample matrix above the desired code, assuming
the sparsity threshold 3, should look like
let imvmult_ideal : int array -> int array -> unit = fun v' v -> v'.(0) <- v.(0) + v.(3) (* a.(0) is truly sparse *) v'.(1) <- v.(2) (* a.(1) is very sparse *) let sum = ref 0 in (* a.(2) is dense enough *) for j=0 to n-1 do sum <- sum + a.(2).(j) * v.(j) done; v'.(2) <- sum ...
The main challenge is to add the optimizations -- outer loop unrolling, selective inner loop unrolling, algebraic simplifications -- modularly. The strong point of Scala's Lightweight Modular Staging (LMS) solution is that algebraic specifications or loop unrolling can be added to the existing generator, post hoc. We aspire to do the same in MetaOCaml, without inspecting the generated code.
Tiark Rompf: Solving Shonan Challenge 1 with Scala LMS
< https://github.com/TiarkRompf/virtualization-lms-core/blob/delite-develop2/test-src/epfl/test11-shonan/TestHMM.scala >
This LMS solution to the challenge is described in Sec 5 of the PEPM2013 paper below.
Baris Aktemur, Yukiyoshi Kameyama, Oleg Kiselyov, and Chung-chieh Shan:
Shonan Challenge for Generative Programming PEPM 2013
The paper outlines the challenges and describes other solutions to Challenge 1, in MetaOCaml and Scala.
mvmult
let mvmult vout ma vin = vout := ma ** vinwhere
vout
and vin
are intended to be vectors, ma
is the matrix.
The operations (:=)
for vector assignment, (**)
for
matrix-vector product and (*.)
for the element-wise
multiplication of two vectors are generically defined aslet ( := ) vout vin = iter (arr_assign vout vin) let ( *. ) v1 v2 = zip_with mul v1 v2 let ( ** ) a v = arr_map (fun arow -> reduce add zero (arow *. v)) ain terms of the primitive operations on vectors
arr_map
and reduce
(and also iter
, zip_with
, arr_assign
), and the operations on
vector elements add
, mul
, zero
. At this stage we do not say
what the vector elements are, other than that they belong to a ring:module type RING = sig type t (* abstract *) val zero : t val one : t val add : t -> t -> t val mul : t -> t -> t val eq : t -> t -> bool option (* best-effort comparison *) end(We do not require the equality on ring elements. Some instances of
RING
support element comparisons. In other instances,
only some elements could be decided as equal.)Likewise we do not commit to any particular representation of vector and matrices. We take a general view of an input vector as a map from an (abstract) index to an element, paired with the length (itself specified as an index):
type ('i,'t) arr = Arr of 'i * (* length *) ('i -> 't)A generic write-only vector
oarr
is similar. Although we do not yet
know anything about the index, we can already implement
some vector operations:let zip_with : ('t -> 't ->'t) -> ('i,'t) arr -> ('i,'t) arr -> ('i,'t) arr = fun f (Arr (n1,v1)) (Arr (n2,v2)) -> Arr (n1, fun i -> f (v1 i) (v2 i)) (* A variation of zip_with, when the first array is output *) let arr_assign : ('i,'t,'r) oarr -> ('i,'t) arr -> ('i,'r) arr = .... let arr_map : ('t1 -> 't2) -> ('i,'t1) arr -> ('i,'t2) arr = fun f (Arr (n, v)) -> Arr (n, fun i -> f (v i))
The other vector operations depend on the index representation. Whatever
that representation is, it should support the operations iter
and
reduce
. Thus a vector must implement the signature:
module type ARR = sig type ind (* index *) type t (* type of values *) type unt (* unit type *) val iter : (ind, unt) arr -> unt val reduce : (t -> t -> t) -> t -> (ind, t) arr -> t end
High-level specifications are written as OCaml modules parameterized
by a RING
and by a vector representation. For example,
the Linear Algebra module:
module LA(R: RING)(A: ARR with type t = R.t) = struct open R open A let ( := ) vout vin = iter (arr_assign vout vin) let ( *. ) v1 v2 = zip_with mul v1 v2 (* element-wise mul *) let ( ** ) a v = arr_map (fun arow -> (* matrix-vector mul *) reduce add zero (arow *. v)) a endand the specification for the matrix-vector multiplication:
module MVMULT(R: RING)(A: ARR with type t = R.t) = struct module L = LA(R)(A) open L let mvmult vout ma vin = vout := ma ** vin end
By instantiating MVMULT
with different implementations of RING
and ARR
--
by applying the functor MVMULT
to realizations of RING
and ARR
--
we get different implementations of the matrix-vector multiplication.
We program mvmult
once, and obtain the whole family of matrix-vector
multiplication codes. The answer to the Shonan Challenge is
one member of the family.
We have thus specified a simple Domain-Specific Language for Linear Algebra and used it to write matrix-vector multiplication.
MVMULT
with just the right
implementations of RING
and ARR
. These `right' implementations aren't
conjured up. Rather, we develop them step-by-step, changing
only few things at a time and checking their effect.mvmult
:
RING.t => int, ind => int
MVMULT
with the `obvious' first
choice: RingInt
and int
-indexed vectors. The latter structure has the
form:module ArrSta(S: sig type t end) = struct include S type ind = int type unt = unit let iter (Arr (n,body)) = ... let reduce plus zero (Arr (n,v)) = let sum = ref zero in for i = 0 to n-1 do sum := plus !sum (v i) done; !sum end
This is how the instantiation of MVMULT
looks in all detail:
let imvmult : int array -> int array array -> int array -> unit = fun v' a v -> let n = Array.length v' in (* vector and matrix representations*) let vout = OArr (n, fun i v -> v'.(i) <- v) in let vin = Arr (n,fun i -> v.(i)) in let ma = Arr (n,fun i -> Arr (n, fun j -> a.(i).(j))) in let module MV = MVMULT(RingInt)(ArrSta(RingInt)) in MV.mvmult vout ma vin
The result is the ordinary matrix-vector multiplication function where
vectors are ordinary OCaml arrays. We can immediately test and benchmark imvmult
by applying it to sample vectors and matrices.
mvmult
code:
RING.t => int code, ind => int code
MVMULT
with the ring whose
elements are not integers but integer code expressions. The add
operation,
for example, does no addition now. Rather, it builds the code that will
add integers later, when the generated code is compiled and run.module RingIntCode = struct type t = int code let zero = .<0>. let one = .<1>. let add x y = .<.~x + .~y>. let mul x y = .<.~x * .~y>. let eq x y = None (* can't compare code *) end
Likewise we introduce (future-stage) vectors whose index is an integer code expression. The length of a vector, also an index, is not known at the code generation time.
module ArrDyn(S: sig type t end) = struct include S type ind = int code type unt = unit code let iter (Arr (n,body)) = ... let reduce plus zero (Arr (n,v)) = .<let sum = ref .~zero in for i = 0 to (.~n)-1 do sum := .~(plus .<!sum>. (v .<i>.)) done; !sum>. endWe do not change
zip_with
or arr_map
at all: they are
defined generically for all types of indices, irrespective of vector
implementation.
Finally we instantiate MVMULT
. Vectors are realized as int array code
of statically, at generation time, unknown length.
let imvmult_c : (int array -> int array array -> int array -> unit) code = .<fun v' a v -> let n = Array.length v' in (* vector representations *) .~(let vout = OArr (.<n>., fun i v -> .<v'.(.~i) <- .~v>.) in let vin = Arr (.<n>.,fun i -> .<v.(.~i)>.) in let ma = Arr (.<n>.,fun i -> Arr (.<n>., fun j -> .<a.(.~i).(.~j)>.)) in let module MV = MVMULT(RingIntCode)(ArrDyn(RingIntCode)) in MV.mvmult vout ma vin) (* This line is the same *) >.
The difference in type of imvmult
and imvmult_c
is telling:
whereas the former was a function, imvmult_c
is the code of a function.
The code can be printed by MetaOCaml top-level as the value of imvmult_c
:
val imvmult_c : (int array -> int array array -> int array -> unit) code = .<fun v'_2 a_3 v_4 -> let n_5 = (Array.length v'_2) in for i_6 = 0 to n_5 - 1 do v'_2.(i_6) <- let sum_7 = ref 0 in for i_8 = 0 to n_5 - 1 do sum_7 := ! sum_7 + (a_3.(i_6).(i_8) * v_4.(i_8)) done; ! sum_7 done>.
The result does look like the code for matrix-vector multiplication --
like the code for imvmult
after all module abstractions are eliminated.
We can check: compile imvmult_c
as !. imvmult_c
and apply the resulting
function to sample vectors and matrices. The behavior must match imvmult
.
RING.t => int code, ind => int
MVMULT
with the int
ring and
the int
-indexed array -- obtaining the matrix-vector multiplication function
to use now. We also instantiated MVMULT
with the RING
that merely
promised to add or multiply integers, and with the ARR
that
only promised to index a vector. The result was the matrix-vector multiplication
that can only be used later, after compiling it -- although we can
see the code now. It had no optimizations. A mixed choice is a
`hybrid' vector with int
indices but int code
elements: we can obtain its elements now but can only use them for
later computations. The length of a hybrid vector, an int
, is also known
now, that is, when the generator runs. Thus we obtain the code
for multiplying a matrix by a vector of statically known dimensions.
The hybrid vectors are represented by the following structure, which is
quite like ArrSta
since in both cases the indices and dimensions
are usable at generation time:
module ArrStaDim(S: sig type t end) = struct include S module M = ArrSta(S) type ind = int type unt = unit code let reduce = M.reduce let seq e1 e2 = .<begin .~e1; .~e2 end>. let iter arr = reduce seq .<()>. arr endThe code generator
let imvmult_nc : int -> (int array -> int array array -> int array -> unit) code = fun n -> .<fun v' a v -> assert (n = Array.length v'); .~(let vout = OArr (n, fun i v -> .<v'.(i) <- .~v>.) in let vin = Arr (n,fun i -> .<v.(i)>.) in let ma = Arr (n,fun i -> Arr (n, fun j -> .<a.(i).(j)>.)) in let module MV = MVMULT(RingIntCode)(ArrStaDim(RingIntCode)) in MV.mvmult vout ma vin) >.albeit quite like
imvmult_c
, is a function that takes the vector
length as the argument. Applying imvmult_nc
to a particular n
gives the
the matrix-vector multiplication code specialized to
that vector length:let imvmult_5c = imvmult_nc 5;; val imvmult_5c : (int array -> int array array -> int array -> unit) code = .<fun v'_10 a_11 v_12 -> assert (5 = (Array.length v'_10)); v'_10.(0) <- 0 + (a_11.(0).(0) * v_12.(0)) + (a_11.(0).(1) * v_12.(1)) + (a_11.(0).(2) * v_12.(2)) + (a_11.(0).(3) * v_12.(3)) + (a_11.(0).(4) * v_12.(4)) v'_10.(1) <- ... ...
All loops are fully unrolled. This is what it means to take advantage of statically known dimensions, isn't it?
RING.t => int pv, ind => int pv
mvmult
to the known matrix, we breed
another mongrel vector, whose elements and indices are the hybrid of int
and int code
.
All values of the type int
are usable at generation time; all values
of the type int code
can only be used later. Among partially known values, some are known now, and some only later:
type 't pv = Sta of 't | Dyn of 't codeAny partially-known value can be converted to the purely future-stage one, forgetting what we know now:
let dyn : 't pv -> 't code = function | Sta x -> .<x>. | Dyn x -> xJust like
int
and int code
can be put in a ring, so can their
hybrid:module RingIntPCode = struct type t = int pv let zero = Sta 0 let one = Sta 1 let add x y = match (x,y) with | (Sta x, Sta y) -> Sta (x+y) | (x, y) -> Dyn (RingIntCode.add (dyn x) (dyn y)) let mul x y = ... let eq x y = match (x,y) with | (Sta x, Sta y) -> Some (x=y) | _ -> None (* can't compare the code *) end
RingIntPCode
indeed looks like a cross-breed of RingInt
and RingIntCode
. For example, if both both summands are known, we add
them now -- otherwise, we build the code to add them later. Crucially, zero
and one
are known now. Partially-known values are the staple
of code generation and should be a library feature. We implemented
them here for completeness.
Mongrel vectors have partially-known values both as indices and
elements. If the length is statically known, we dispatch to the
previously written ArrStaDim
implementation. Otherwise, we treat the
vector as completely statically unknown and use ArrDyn
, also written
earlier, to handle that case. From now on, our development would be to
mix-and-slightly-extend the previously written code.
module ArrStaDyn(S: sig type t end) = struct include S type ind = int pv type unt = unit code module ASta = ArrStaDim(S) module ADyn = ArrDyn(S) let iter = ... let reduce plus zero = function | Arr (Sta n,v) -> ASta.reduce plus zero (Arr (n, fun i -> v (Sta i))) | Arr (Dyn n,v) -> Dyn (ADyn.reduce (fun x y -> dyn (plus (Dyn x) (Dyn y))) (dyn zero) (Arr (n,fun i -> dyn (v (Dyn i))))) end
The new generator instantiates the same old MVMULT
with mongrel
vectors. Since the matrix is regarded as statically known, indexing it
with indices known now gives the usable now value. The input vector vin
, albeit of the statically known length, comprises elements unknown at
the generator time.
let imvmult_ac : int array array -> (int array -> int array -> unit) code = fun a -> let n = Array.length a in let ma = Arr (Sta n, fun i -> Arr (Sta n, (fun j -> match (i,j) with | (Sta i, Sta j) -> Sta a.(i).(j) | (i, j) -> Dyn .<a.(.~(dyn i)).(.~(dyn j))>.))) in .<fun v' v -> assert (n = Array.length v'); .~(let vout = OArr (Sta n, fun i v -> .<v'.(.~(dyn i)) <- .~(dyn v)>.) in let vin = Arr (Sta n, fun i -> Dyn .<v.(.~(dyn i))>.) in let module MV = MVMULT(RingIntPCode)(ArrStaDyn(RingIntPCode)) in MV.mvmult vout ma vin) >.Compared to the earlier
imvmult_nc
, the generator imvmult_ac
receives not just the matrix dimensions but the matrix itself to
specialize to. When applied to the sample matrix of the challenge, it
produces:.<fun v'_27 v_28 -> assert (5 = (Array.length v'_27)); v'_27.(0) <- 0 + (1 * v_28.(0)) + (0 * v_28.(1)) + (0 * v_28.(2)) + (1 * v_28.(3)) + (0 * v_28.(4)) ...>.Not only loops are unrolled, the matrix elements are inlined. As the result, we see lots of trivial multiplications and additions. Algebraic simplifications are direly needed.
add
and mul
RingIntPCode
, overriding, or improving add
and mul
:module RingIntOPCode = struct include RingIntPCode (* inherit and override *) let add x y = match (x,y) with | (Sta 0,x) | (x,Sta 0) -> x | (x,y) -> RingIntPCode.add x y let mul x y = match (x,y) with | (Sta 0,_) | (_,Sta 0) -> Sta 0 | (Sta 1,x) | (x,Sta 1) -> x | (x,y) -> RingIntPCode.mul x y endThe improvement adds cases to
add
and mul
: for example, when one of the
multiplicands is statically zero, the product is statically zero regardless
of the other multiplicand. Just replacing RingIntPCode
with RingIntOPCode
in the earlier generator imvmult_ac
-- keeping everything
else the same -- gives the following optimized code specialized to the
sample matrix. It became short enough to show completely:.<fun v'_29 v_30 -> assert (5 = (Array.length v'_29)); v'_29.(0) <- v_30.(0) + v_30.(3) v'_29.(1) <- v_30.(2) v'_29.(2) <- v_30.(1) v'_29.(3) <- v_30.(2) + v_30.(3) + v_30.(4) v'_29.(4) <- v_30.(2) + v_30.(4)>.We have almost reached the goal. However, the Challenge specified that the inner-product loop is to be unrolled only if the corresponding matrix row is sparse enough. Unrolling should be selective.
reduce
reduce
, required by the Challenge,
will estimate that number first, and unroll the loop only if the
number is under the set threshold.
As with the algebraic simplification, we improve the previously written
code reusing as much as possible. Here we smarten-up ArrStaDyn
by adding a few checks to its reduce
:
module ArrStaOptDynInt = struct module R = RingIntOPCode module M = ArrStaDyn(R) include M let threshold = 3 (* density threshold *) let count_non_zeros n v = let rec loop acc i = if i >= n then acc else let acc = if R.eq R.zero (v (Sta i)) = Some true then acc else acc + 1 in loop acc (i+1) in loop 0 0 let reduce plus zero = function | (Arr (Sta n,v)) as arr -> if count_non_zeros n v < threshold then M.reduce plus zero arr else M.reduce plus zero (Arr (Dyn .<n>.,v)) | arr -> M.reduce plus zero arr endIf unrolling is at all possible -- the length of the vector is known -- we check how many elements of the vector are statically known to be
RING.zero
. (When the reduction is actually run, there may be more
zero elements, depending on the input vector. We cannot know about
these at the code generation time.) If there are not enough zero
elements, we turn the length of the vector into the future-stage
value, effectively switching off the unrolling.
We thus solve the challenge, attaining the desired code imvmult_ideal
.
n
-vector of
vectors (rows) of size n
each, taking overall n*n
space. Since the matrix
is supposed to be sparse, we ought to use a more compact representation.
For example, we store the matrix as an array of adjacency lists,
with each row packed as the list of non-zero elements and their indices.
Changing the matrix representation is trivial in our framework,
merely by adjusting the mapping from the concrete matrix to the abstract Arr
(in the code below, the changed definition is marked as `new').let imvmult_acs : (int*int) list array -> (int array -> int array -> unit) code = fun a -> let n = Array.length a in let ma = Arr (Sta n, fun i -> Arr (Sta n, (* new: changed ma mapping *) (fun j -> match (i,j) with | (Sta i, Sta j) -> Sta (sparse_deref a.(i) j) | (Sta i, Dyn j) -> let ai = a.(i) in (* matrix row *) Dyn .<sparse_deref ai .~j>. | (i, j) -> Dyn .<sparse_deref a.(.~(dyni i)) (.~(dyni j))>.))) in .<fun v' v -> assert (n = Array.length v'); .~(let vout = OArr (Sta n, fun i v -> .<v'.(.~(dyni i)) <- .~(dyn v)>.) in let vin = Arr (Sta n,fun i -> Dyn .<v.(.~(dyni i))>.) in let module MV = MVMULT(RingIntOPCode)(ArrStaOptDynInt) in MV.mvmult vout ma vin) >.
oleg-at-pobox.com or oleg-at-okmij.org
Your comments, problem reports, questions are very welcome!
Converted from HSXML by HSXML->HTML