The promise of program generation is producing efficient code from high-level, obviously correct specifications by making it possible and convenient for domain experts to write and apply ``optimization passes'' at a high-level of abstraction.
On the example of the first Shonan Challenge by high-performance computing to staging we demonstrate that the promise is fulfilled. We obtain assuredly well-typed and optimal code. Although generators are not allowed to inspect, let alone rearrange, the generated code (for the sake of static assurances), optimizations are possible. The optimizations have to be at the level much higher than search-and-replace in the abstract syntax tree or a source code string. The optimizations are developed in small separate modules and mixed in one-by-one, checking their effect after each small step. The end result is a small domain-specific language embedded in MetaOCaml that lets us write high-level Basic Linear Algebra expressions and apply several passes of domain-specific optimizations such as selective loop unrolling, sparse matrix representations, and algebraic simplifications.
This page describes the poster presented at the 10th Asian Symposium on Programming Languages and Systems (APLAS). Kyoto, Japan, December 11-13, 2012.
Solving the challenge in small easy steps
RING.t => int, ind => int
RING.t => int code, ind => int code
RING.t => int code, ind => int
RING.t => int pv, ind => int pv
Pure generative programming makes it easier to reason about the correctness of the generator, in particular, to statically ensure that the generated code is well-typed. A language that supports code inspection like Pure FreshML also can provide similar guarantees, at the cost of a significantly more complex type system. As a good analogy, parametricity, which prevents some functions from inspecting their arguments, gives in return static assurances of those functions' behavior: ``theorems for free''. As the opposite example, Scheme syntax-rule macros, which may examine their arguments, turn out to have surprising and disturbing power, calling into question static guarantees of their hygiene.
If we cannot rearrange the generated code, however, how can we optimize it? Several examples in the literature prove that optimization is possible. It has not been clear however if optimizations can be programmed modularly, like compiler passes. This page demonstrates pure-generative modular and convenient domain-specific optimizations on the example of the first Shonan Challenge by high-performance computing to staging.
We suppose the reader is familiar with MetaOCaml, at least to the extent of the simple tutorial referenced below.
Introduction to staging and MetaOCaml
1 0 0 1 0 0 0 1 0 0 0 0 1 1 1 ...In the Challenge, the matrix,
a, is the adjacency matrix, whose elements are either 0 or 1. (In HMM, non-zero elements are transition probabilities.) Some, but not all, rows of the matrix are sparse. In bioinformatics HMM, the number of rows and columns could be as high as 30 000.
The problem is to generate the function that computes
v' <- a * v. The generated code should be algebraically simplified, that is,
contain no expressions of the form
0 * e,
1 * e,
0 + e, etc.
The loop over the matrix rows must be fully unrolled; the inner loop,
computing the inner product of
v with a matrix row, should be unrolled
only if the row is sparse enough (has fewer non-zero elements than
a fixed threshold). For the sample matrix above the desired code, assuming
the sparsity threshold 3, should look like
let imvmult_ideal : int array -> int array -> unit = fun v' v -> v'.(0) <- v.(0) + v.(3) (* a.(0) is truly sparse *) v'.(1) <- v.(2) (* a.(1) is very sparse *) let sum = ref 0 in (* a.(2) is dense enough *) for j=0 to n-1 do sum <- sum + a.(2).(j) * v.(j) done; v'.(2) <- sum ...
The main challenge is to add the optimizations -- outer loop unrolling, selective inner loop unrolling, algebraic simplifications -- modularly. The strong point of Scala's Lightweight Modular Staging (LMS) solution is that algebraic specifications or loop unrolling can be added to the existing generator, post hoc. We aspire to do the same in MetaOCaml, without inspecting the generated code.
Tiark Rompf: Solving Shonan Challenge 1 with Scala LMS
< https://github.com/TiarkRompf/virtualization-lms-core/blob/delite-develop2/test-src/epfl/test11-shonan/TestHMM.scala >
This LMS solution to the challenge is described in Sec 5 of the PEPM2013 paper below.
Baris Aktemur, Yukiyoshi Kameyama, Oleg Kiselyov, and Chung-chieh Shan:
Shonan Challenge for Generative Programming PEPM 2013
The paper outlines the challenges and describes other solutions to Challenge 1, in MetaOCaml and Scala.
let mvmult vout ma vin = vout := ma ** vinwhere
vinare intended to be vectors,
mais the matrix. The operations
(:=)for vector assignment,
(**)for matrix-vector product and
(*.)for the element-wise multiplication of two vectors are generically defined as
let ( := ) vout vin = iter (arr_assign vout vin) let ( *. ) v1 v2 = zip_with mul v1 v2 let ( ** ) a v = arr_map (fun arow -> reduce add zero (arow *. v)) ain terms of the primitive operations on vectors
arr_assign), and the operations on vector elements
zero. At this stage we do not say what the vector elements are, other than that they belong to a ring:
module type RING = sig type t (* abstract *) val zero : t val one : t val add : t -> t -> t val mul : t -> t -> t val eq : t -> t -> bool option (* best-effort comparison *) end(We do not require the equality on ring elements. Some instances of
RINGsupport element comparisons. In other instances, only some elements could be decided as equal.)
Likewise we do not commit to any particular representation of vector and matrices. We take a general view of an input vector as a map from an (abstract) index to an element, paired with the length (itself specified as an index):
type ('i,'t) arr = Arr of 'i * (* length *) ('i -> 't)A generic write-only vector
oarris similar. Although we do not yet know anything about the index, we can already implement some vector operations:
let zip_with : ('t -> 't ->'t) -> ('i,'t) arr -> ('i,'t) arr -> ('i,'t) arr = fun f (Arr (n1,v1)) (Arr (n2,v2)) -> Arr (n1, fun i -> f (v1 i) (v2 i)) (* A variation of zip_with, when the first array is output *) let arr_assign : ('i,'t,'r) oarr -> ('i,'t) arr -> ('i,'r) arr = .... let arr_map : ('t1 -> 't2) -> ('i,'t1) arr -> ('i,'t2) arr = fun f (Arr (n, v)) -> Arr (n, fun i -> f (v i))
The other vector operations depend on the index representation. Whatever
that representation is, it should support the operations
reduce. Thus a vector must implement the signature:
module type ARR = sig type ind (* index *) type t (* type of values *) type unt (* unit type *) val iter : (ind, unt) arr -> unt val reduce : (t -> t -> t) -> t -> (ind, t) arr -> t end
High-level specifications are written as OCaml modules parameterized
RING and by a vector representation. For example,
the Linear Algebra module:
module LA(R: RING)(A: ARR with type t = R.t) = struct open R open A let ( := ) vout vin = iter (arr_assign vout vin) let ( *. ) v1 v2 = zip_with mul v1 v2 (* element-wise mul *) let ( ** ) a v = arr_map (fun arow -> (* matrix-vector mul *) reduce add zero (arow *. v)) a endand the specification for the matrix-vector multiplication:
module MVMULT(R: RING)(A: ARR with type t = R.t) = struct module L = LA(R)(A) open L let mvmult vout ma vin = vout := ma ** vin end
MVMULT with different implementations of
by applying the functor
MVMULT to realizations of
we get different implementations of the matrix-vector multiplication.
mvmult once, and obtain the whole family of matrix-vector
multiplication codes. The answer to the Shonan Challenge is
one member of the family.
We have thus specified a simple Domain-Specific Language for Linear Algebra and used it to write matrix-vector multiplication.
MVMULTwith just the right implementations of
ARR. These `right' implementations aren't conjured up. Rather, we develop them step-by-step, changing only few things at a time and checking their effect.
RING.t => int, ind => int
MVMULTwith the `obvious' first choice:
int-indexed vectors. The latter structure has the form:
module ArrSta(S: sig type t end) = struct include S type ind = int type unt = unit let iter (Arr (n,body)) = ... let reduce plus zero (Arr (n,v)) = let sum = ref zero in for i = 0 to n-1 do sum := plus !sum (v i) done; !sum end
This is how the instantiation of
MVMULT looks in all detail:
let imvmult : int array -> int array array -> int array -> unit = fun v' a v -> let n = Array.length v' in (* vector and matrix representations*) let vout = OArr (n, fun i v -> v'.(i) <- v) in let vin = Arr (n,fun i -> v.(i)) in let ma = Arr (n,fun i -> Arr (n, fun j -> a.(i).(j))) in let module MV = MVMULT(RingInt)(ArrSta(RingInt)) in MV.mvmult vout ma vin
The result is the ordinary matrix-vector multiplication function where
vectors are ordinary OCaml arrays. We can immediately test and benchmark
imvmult by applying it to sample vectors and matrices.
RING.t => int code, ind => int code
MVMULTwith the ring whose elements are not integers but integer code expressions. The
addoperation, for example, does no addition now. Rather, it builds the code that will add integers later, when the generated code is compiled and run.
module RingIntCode = struct type t = int code let zero = .<0>. let one = .<1>. let add x y = .<.~x + .~y>. let mul x y = .<.~x * .~y>. let eq x y = None (* can't compare code *) end
Likewise we introduce (future-stage) vectors whose index is an integer code expression. The length of a vector, also an index, is not known at the code generation time.
module ArrDyn(S: sig type t end) = struct include S type ind = int code type unt = unit code let iter (Arr (n,body)) = ... let reduce plus zero (Arr (n,v)) = .<let sum = ref .~zero in for i = 0 to (.~n)-1 do sum := .~(plus .<!sum>. (v .<i>.)) done; !sum>. endWe do not change
arr_mapat all: they are defined generically for all types of indices, irrespective of vector implementation.
Finally we instantiate
MVMULT. Vectors are realized as
int array code
of statically, at generation time, unknown length.
let imvmult_c : (int array -> int array array -> int array -> unit) code = .<fun v' a v -> let n = Array.length v' in (* vector representations *) .~(let vout = OArr (.<n>., fun i v -> .<v'.(.~i) <- .~v>.) in let vin = Arr (.<n>.,fun i -> .<v.(.~i)>.) in let ma = Arr (.<n>.,fun i -> Arr (.<n>., fun j -> .<a.(.~i).(.~j)>.)) in let module MV = MVMULT(RingIntCode)(ArrDyn(RingIntCode)) in MV.mvmult vout ma vin) (* This line is the same *) >.
The difference in type of
imvmult_c is telling:
whereas the former was a function,
imvmult_c is the code of a function.
The code can be printed by MetaOCaml top-level as the value of
val imvmult_c : (int array -> int array array -> int array -> unit) code = .<fun v'_2 a_3 v_4 -> let n_5 = (Array.length v'_2) in for i_6 = 0 to n_5 - 1 do v'_2.(i_6) <- let sum_7 = ref 0 in for i_8 = 0 to n_5 - 1 do sum_7 := ! sum_7 + (a_3.(i_6).(i_8) * v_4.(i_8)) done; ! sum_7 done>.
The result does look like the code for matrix-vector multiplication --
like the code for
imvmult after all module abstractions are eliminated.
We can check: compile
!. imvmult_c and apply the resulting
function to sample vectors and matrices. The behavior must match
RING.t => int code, ind => int
intring and the
int-indexed array -- obtaining the matrix-vector multiplication function to use now. We also instantiated
RINGthat merely promised to add or multiply integers, and with the
ARRthat only promised to index a vector. The result was the matrix-vector multiplication that can only be used later, after compiling it -- although we can see the code now. It had no optimizations. A mixed choice is a `hybrid' vector with
int codeelements: we can obtain its elements now but can only use them for later computations. The length of a hybrid vector, an
int, is also known now, that is, when the generator runs. Thus we obtain the code for multiplying a matrix by a vector of statically known dimensions.
The hybrid vectors are represented by the following structure, which is
ArrSta since in both cases the indices and dimensions
are usable at generation time:
module ArrStaDim(S: sig type t end) = struct include S module M = ArrSta(S) type ind = int type unt = unit code let reduce = M.reduce let seq e1 e2 = .<begin .~e1; .~e2 end>. let iter arr = reduce seq .<()>. arr endThe code generator
let imvmult_nc : int -> (int array -> int array array -> int array -> unit) code = fun n -> .<fun v' a v -> assert (n = Array.length v'); .~(let vout = OArr (n, fun i v -> .<v'.(i) <- .~v>.) in let vin = Arr (n,fun i -> .<v.(i)>.) in let ma = Arr (n,fun i -> Arr (n, fun j -> .<a.(i).(j)>.)) in let module MV = MVMULT(RingIntCode)(ArrStaDim(RingIntCode)) in MV.mvmult vout ma vin) >.albeit quite like
imvmult_c, is a function that takes the vector length as the argument. Applying
imvmult_ncto a particular
ngives the the matrix-vector multiplication code specialized to that vector length:
let imvmult_5c = imvmult_nc 5;; val imvmult_5c : (int array -> int array array -> int array -> unit) code = .<fun v'_10 a_11 v_12 -> assert (5 = (Array.length v'_10)); v'_10.(0) <- 0 + (a_11.(0).(0) * v_12.(0)) + (a_11.(0).(1) * v_12.(1)) + (a_11.(0).(2) * v_12.(2)) + (a_11.(0).(3) * v_12.(3)) + (a_11.(0).(4) * v_12.(4)) v'_10.(1) <- ... ...
All loops are fully unrolled. This is what it means to take advantage of statically known dimensions, isn't it?
RING.t => int pv, ind => int pv
mvmultto the known matrix, we breed another mongrel vector, whose elements and indices are the hybrid of
All values of the type
int are usable at generation time; all values
of the type
int code can only be used later. Among partially known values, some are known now, and some only later:
type 't pv = Sta of 't | Dyn of 't codeAny partially-known value can be converted to the purely future-stage one, forgetting what we know now:
let dyn : 't pv -> 't code = function | Sta x -> .<x>. | Dyn x -> xJust like
int codecan be put in a ring, so can their hybrid:
module RingIntPCode = struct type t = int pv let zero = Sta 0 let one = Sta 1 let add x y = match (x,y) with | (Sta x, Sta y) -> Sta (x+y) | (x, y) -> Dyn (RingIntCode.add (dyn x) (dyn y)) let mul x y = ... let eq x y = match (x,y) with | (Sta x, Sta y) -> Some (x=y) | _ -> None (* can't compare the code *) end
RingIntPCodeindeed looks like a cross-breed of
RingIntCode. For example, if both both summands are known, we add them now -- otherwise, we build the code to add them later. Crucially,
oneare known now. Partially-known values are the staple of code generation and should be a library feature. We implemented them here for completeness.
Mongrel vectors have partially-known values both as indices and
elements. If the length is statically known, we dispatch to the
ArrStaDim implementation. Otherwise, we treat the
vector as completely statically unknown and use
ArrDyn, also written
earlier, to handle that case. From now on, our development would be to
mix-and-slightly-extend the previously written code.
module ArrStaDyn(S: sig type t end) = struct include S type ind = int pv type unt = unit code module ASta = ArrStaDim(S) module ADyn = ArrDyn(S) let iter = ... let reduce plus zero = function | Arr (Sta n,v) -> ASta.reduce plus zero (Arr (n, fun i -> v (Sta i))) | Arr (Dyn n,v) -> Dyn (ADyn.reduce (fun x y -> dyn (plus (Dyn x) (Dyn y))) (dyn zero) (Arr (n,fun i -> dyn (v (Dyn i))))) end
The new generator instantiates the same old
MVMULT with mongrel
vectors. Since the matrix is regarded as statically known, indexing it
with indices known now gives the usable now value. The input vector
vin, albeit of the statically known length, comprises elements unknown at
the generator time.
let imvmult_ac : int array array -> (int array -> int array -> unit) code = fun a -> let n = Array.length a in let ma = Arr (Sta n, fun i -> Arr (Sta n, (fun j -> match (i,j) with | (Sta i, Sta j) -> Sta a.(i).(j) | (i, j) -> Dyn .<a.(.~(dyn i)).(.~(dyn j))>.))) in .<fun v' v -> assert (n = Array.length v'); .~(let vout = OArr (Sta n, fun i v -> .<v'.(.~(dyn i)) <- .~(dyn v)>.) in let vin = Arr (Sta n, fun i -> Dyn .<v.(.~(dyn i))>.) in let module MV = MVMULT(RingIntPCode)(ArrStaDyn(RingIntPCode)) in MV.mvmult vout ma vin) >.Compared to the earlier
imvmult_nc, the generator
imvmult_acreceives not just the matrix dimensions but the matrix itself to specialize to. When applied to the sample matrix of the challenge, it produces:
.<fun v'_27 v_28 -> assert (5 = (Array.length v'_27)); v'_27.(0) <- 0 + (1 * v_28.(0)) + (0 * v_28.(1)) + (0 * v_28.(2)) + (1 * v_28.(3)) + (0 * v_28.(4)) ...>.Not only loops are unrolled, the matrix elements are inlined. As the result, we see lots of trivial multiplications and additions. Algebraic simplifications are direly needed.
RingIntPCode, overriding, or improving
module RingIntOPCode = struct include RingIntPCode (* inherit and override *) let add x y = match (x,y) with | (Sta 0,x) | (x,Sta 0) -> x | (x,y) -> RingIntPCode.add x y let mul x y = match (x,y) with | (Sta 0,_) | (_,Sta 0) -> Sta 0 | (Sta 1,x) | (x,Sta 1) -> x | (x,y) -> RingIntPCode.mul x y endThe improvement adds cases to
mul: for example, when one of the multiplicands is statically zero, the product is statically zero regardless of the other multiplicand. Just replacing
RingIntOPCodein the earlier generator
imvmult_ac-- keeping everything else the same -- gives the following optimized code specialized to the sample matrix. It became short enough to show completely:
.<fun v'_29 v_30 -> assert (5 = (Array.length v'_29)); v'_29.(0) <- v_30.(0) + v_30.(3) v'_29.(1) <- v_30.(2) v'_29.(2) <- v_30.(1) v'_29.(3) <- v_30.(2) + v_30.(3) + v_30.(4) v'_29.(4) <- v_30.(2) + v_30.(4)>.We have almost reached the goal. However, the Challenge specified that the inner-product loop is to be unrolled only if the corresponding matrix row is sparse enough. Unrolling should be selective.
reduce, required by the Challenge, will estimate that number first, and unroll the loop only if the number is under the set threshold.
As with the algebraic simplification, we improve the previously written
code reusing as much as possible. Here we smarten-up
by adding a few checks to its
module ArrStaOptDynInt = struct module R = RingIntOPCode module M = ArrStaDyn(R) include M let threshold = 3 (* density threshold *) let count_non_zeros n v = let rec loop acc i = if i >= n then acc else let acc = if R.eq R.zero (v (Sta i)) = Some true then acc else acc + 1 in loop acc (i+1) in loop 0 0 let reduce plus zero = function | (Arr (Sta n,v)) as arr -> if count_non_zeros n v < threshold then M.reduce plus zero arr else M.reduce plus zero (Arr (Dyn .<n>.,v)) | arr -> M.reduce plus zero arr endIf unrolling is at all possible -- the length of the vector is known -- we check how many elements of the vector are statically known to be
RING.zero. (When the reduction is actually run, there may be more zero elements, depending on the input vector. We cannot know about these at the code generation time.) If there are not enough zero elements, we turn the length of the vector into the future-stage value, effectively switching off the unrolling.
We thus solve the challenge, attaining the desired code
n-vector of vectors (rows) of size
neach, taking overall
n*nspace. Since the matrix is supposed to be sparse, we ought to use a more compact representation. For example, we store the matrix as an array of adjacency lists, with each row packed as the list of non-zero elements and their indices. Changing the matrix representation is trivial in our framework, merely by adjusting the mapping from the concrete matrix to the abstract
Arr(in the code below, the changed definition is marked as `new').
let imvmult_acs : (int*int) list array -> (int array -> int array -> unit) code = fun a -> let n = Array.length a in let ma = Arr (Sta n, fun i -> Arr (Sta n, (* new: changed ma mapping *) (fun j -> match (i,j) with | (Sta i, Sta j) -> Sta (sparse_deref a.(i) j) | (Sta i, Dyn j) -> let ai = a.(i) in (* matrix row *) Dyn .<sparse_deref ai .~j>. | (i, j) -> Dyn .<sparse_deref a.(.~(dyni i)) (.~(dyni j))>.))) in .<fun v' v -> assert (n = Array.length v'); .~(let vout = OArr (Sta n, fun i v -> .<v'.(.~(dyni i)) <- .~(dyn v)>.) in let vin = Arr (Sta n,fun i -> Dyn .<v.(.~(dyni i))>.) in let module MV = MVMULT(RingIntOPCode)(ArrStaOptDynInt) in MV.mvmult vout ma vin) >.
Your comments, problem reports, questions are very welcome!
Converted from HSXML by HSXML->HTML