The promise of program generation is producing efficient code from high-level, obviously correct specifications by making it possible and convenient for domain experts to write and apply ``optimization passes'' at a high-level of abstraction.
On the example of the first Shonan Challenge by high-performance computing to staging we demonstrate that the promise is fulfilled. We obtain assuredly well-typed and optimal code. Although generators are not allowed to inspect, let alone rearrange, the generated code (for the sake of static assurances), optimizations are possible. The optimizations have to be at the level much higher than search-and-replace in the abstract syntax tree or a source code string. The optimizations are developed in small separate modules and mixed in one-by-one, checking their effect after each small step. The end result is a small domain-specific language embedded in MetaOCaml that lets us write high-level Basic Linear Algebra expressions and apply several passes of domain-specific optimizations such as selective loop unrolling, sparse matrix representations, and algebraic simplifications.
This page describes the poster presented at the 10th Asian Symposium on Programming Languages and Systems (APLAS). Kyoto, Japan, December 11-13, 2012.
Solving the challenge in small easy steps
mvmult:
RING.t => int, ind => intmvmult code:
RING.t => int code, ind => int codeRING.t => int code, ind => intRING.t => int pv, ind => int pvadd and mulreducePure generative programming makes it easier to reason about the correctness of the generator, in particular, to statically ensure that the generated code is well-typed. A language that supports code inspection like Pure FreshML also can provide similar guarantees, at the cost of a significantly more complex type system. As a good analogy, parametricity, which prevents some functions from inspecting their arguments, gives in return static assurances of those functions' behavior: ``theorems for free''. As the opposite example, Scheme syntax-rule macros, which may examine their arguments, turn out to have surprising and disturbing power, calling into question static guarantees of their hygiene.
If we cannot rearrange the generated code, however, how can we optimize it? Several examples in the literature prove that optimization is possible. It has not been clear however if optimizations can be programmed modularly, like compiler passes. This page demonstrates pure-generative modular and convenient domain-specific optimizations on the example of the first Shonan Challenge by high-performance computing to staging.
We suppose the reader is familiar with MetaOCaml, at least to the extent of the simple tutorial referenced below.
1 0 0 1 0
0 0 1 0 0
0 0 1 1 1
...
In the Challenge, the matrix, a, is the adjacency matrix, whose elements are
either 0 or 1. (In HMM, non-zero elements
are transition probabilities.) Some, but not all, rows of the matrix
are sparse. In bioinformatics HMM, the
number of rows and columns could be as high as 30 000.
The problem is to generate the function that computes v' <- a * v. The generated code should be algebraically simplified, that is,
contain no expressions of the form 0 * e, 1 * e, 0 + e, etc.
The loop over the matrix rows must be fully unrolled; the inner loop,
computing the inner product of v with a matrix row, should be unrolled
only if the row is sparse enough (has fewer non-zero elements than
a fixed threshold). For the sample matrix above the desired code, assuming
the sparsity threshold 3, should look like
let imvmult_ideal : int array -> int array -> unit =
fun v' v ->
v'.(0) <- v.(0) + v.(3) (* a.(0) is truly sparse *)
v'.(1) <- v.(2) (* a.(1) is very sparse *)
let sum = ref 0 in (* a.(2) is dense enough *)
for j=0 to n-1 do sum <- sum + a.(2).(j) * v.(j) done;
v'.(2) <- sum
...
The main challenge is to add the optimizations -- outer loop unrolling, selective inner loop unrolling, algebraic simplifications -- modularly. The strong point of Scala's Lightweight Modular Staging (LMS) solution is that algebraic specifications or loop unrolling can be added to the existing generator, post hoc. We aspire to do the same in MetaOCaml, without inspecting the generated code.
Tiark Rompf: Solving Shonan Challenge 1 with Scala LMS
< https://github.com/TiarkRompf/virtualization-lms-core/blob/delite-develop2/test-src/epfl/test11-shonan/TestHMM.scala >
This LMS solution to the challenge is described in Sec 5 of the PEPM2013 paper below.
Baris Aktemur, Yukiyoshi Kameyama, Oleg Kiselyov, and Chung-chieh Shan:
Shonan Challenge for Generative Programming PEPM 2013
The paper outlines the challenges and describes other solutions to Challenge 1, in MetaOCaml and Scala.
mvmultlet mvmult vout ma vin = vout := ma ** vinwhere
vout and vin are intended to be vectors, ma is the matrix.
The operations (:=) for vector assignment, (**) for
matrix-vector product and (*.) for the element-wise
multiplication of two vectors are generically defined as let ( := ) vout vin = iter (arr_assign vout vin)
let ( *. ) v1 v2 = zip_with mul v1 v2
let ( ** ) a v = arr_map (fun arow -> reduce add zero (arow *. v)) a
in terms of the primitive operations on vectors arr_map and reduce
(and also iter, zip_with, arr_assign), and the operations on
vector elements add, mul, zero. At this stage we do not say
what the vector elements are, other than that they belong to a ring: module type RING = sig
type t (* abstract *)
val zero : t
val one : t
val add : t -> t -> t
val mul : t -> t -> t
val eq : t -> t -> bool option (* best-effort comparison *)
end
(We do not require the equality on ring elements. Some instances
of RING support element comparisons. In other instances,
only some elements could be decided as equal.)Likewise we do not commit to any particular representation of vector and matrices. We take a general view of an input vector as a map from an (abstract) index to an element, paired with the length (itself specified as an index):
type ('i,'t) arr = Arr of 'i * (* length *)
('i -> 't)
A generic write-only vector oarr is similar. Although we do not yet
know anything about the index, we can already implement
some vector operations: let zip_with : ('t -> 't ->'t) -> ('i,'t) arr -> ('i,'t) arr -> ('i,'t) arr =
fun f (Arr (n1,v1)) (Arr (n2,v2)) ->
Arr (n1, fun i -> f (v1 i) (v2 i))
(* A variation of zip_with, when the first array is output *)
let arr_assign : ('i,'t,'r) oarr -> ('i,'t) arr -> ('i,'r) arr = ....
let arr_map : ('t1 -> 't2) -> ('i,'t1) arr -> ('i,'t2) arr =
fun f (Arr (n, v)) -> Arr (n, fun i -> f (v i))
The other vector operations depend on the index representation. Whatever
that representation is, it should support the operations iter and
reduce. Thus a vector must implement the signature:
module type ARR = sig
type ind (* index *)
type t (* type of values *)
type unt (* unit type *)
val iter : (ind, unt) arr -> unt
val reduce : (t -> t -> t) -> t -> (ind, t) arr -> t
end
High-level specifications are written as OCaml modules parameterized
by a RING and by a vector representation. For example,
the Linear Algebra module:
module LA(R: RING)(A: ARR with type t = R.t) = struct
open R
open A
let ( := ) vout vin = iter (arr_assign vout vin)
let ( *. ) v1 v2 = zip_with mul v1 v2 (* element-wise mul *)
let ( ** ) a v = arr_map (fun arow -> (* matrix-vector mul *)
reduce add zero (arow *. v)) a
end
and the specification for the matrix-vector multiplication: module MVMULT(R: RING)(A: ARR with type t = R.t) = struct
module L = LA(R)(A)
open L
let mvmult vout ma vin = vout := ma ** vin
end
By instantiating MVMULT with different implementations of RING and ARR --
by applying the functor MVMULT to realizations of RING and ARR --
we get different implementations of the matrix-vector multiplication.
We program mvmult once, and obtain the whole family of matrix-vector
multiplication codes. The answer to the Shonan Challenge is
one member of the family.
We have thus specified a simple Domain-Specific Language for Linear Algebra and used it to write matrix-vector multiplication.
MVMULT with just the right
implementations of RING and ARR. These `right' implementations aren't
conjured up. Rather, we develop them step-by-step, changing
only few things at a time and checking their effect.mvmult:
RING.t => int, ind => intMVMULT with the `obvious' first
choice: RingInt and int-indexed vectors. The latter structure has the
form: module ArrSta(S: sig type t end) = struct
include S
type ind = int
type unt = unit
let iter (Arr (n,body)) = ...
let reduce plus zero (Arr (n,v)) =
let sum = ref zero in
for i = 0 to n-1 do sum := plus !sum (v i) done;
!sum
end
This is how the instantiation of MVMULT looks in all detail:
let imvmult : int array -> int array array -> int array -> unit =
fun v' a v ->
let n = Array.length v' in (* vector and matrix representations*)
let vout = OArr (n, fun i v -> v'.(i) <- v) in
let vin = Arr (n,fun i -> v.(i)) in
let ma = Arr (n,fun i -> Arr (n, fun j -> a.(i).(j))) in
let module MV = MVMULT(RingInt)(ArrSta(RingInt)) in
MV.mvmult vout ma vin
The result is the ordinary matrix-vector multiplication function where
vectors are ordinary OCaml arrays. We can immediately test and benchmark imvmult by applying it to sample vectors and matrices.
mvmult code:
RING.t => int code, ind => int codeMVMULT with the ring whose
elements are not integers but integer code expressions. The add operation,
for example, does no addition now. Rather, it builds the code that will
add integers later, when the generated code is compiled and run. module RingIntCode = struct
type t = int code
let zero = .<0>. let one = .<1>.
let add x y = .<.~x + .~y>. let mul x y = .<.~x * .~y>.
let eq x y = None (* can't compare code *)
end
Likewise we introduce (future-stage) vectors whose index is an integer code expression. The length of a vector, also an index, is not known at the code generation time.
module ArrDyn(S: sig type t end) = struct
include S
type ind = int code
type unt = unit code
let iter (Arr (n,body)) = ...
let reduce plus zero (Arr (n,v)) =
.<let sum = ref .~zero in
for i = 0 to (.~n)-1 do
sum := .~(plus .<!sum>. (v .<i>.))
done;
!sum>.
end
We do not change zip_with or arr_map at all: they are
defined generically for all types of indices, irrespective of vector
implementation.
Finally we instantiate MVMULT. Vectors are realized as int array code
of statically, at generation time, unknown length.
let imvmult_c : (int array -> int array array -> int array -> unit) code =
.<fun v' a v ->
let n = Array.length v' in (* vector representations *)
.~(let vout = OArr (.<n>., fun i v -> .<v'.(.~i) <- .~v>.) in
let vin = Arr (.<n>.,fun i -> .<v.(.~i)>.) in
let ma = Arr (.<n>.,fun i ->
Arr (.<n>., fun j -> .<a.(.~i).(.~j)>.)) in
let module MV = MVMULT(RingIntCode)(ArrDyn(RingIntCode)) in
MV.mvmult vout ma vin) (* This line is the same *)
>.
The difference in type of imvmult and imvmult_c is telling:
whereas the former was a function, imvmult_c is the code of a function.
The code can be printed by MetaOCaml top-level as the value of imvmult_c:
val imvmult_c : (int array -> int array array -> int array -> unit) code =
.<fun v'_2 a_3 v_4 ->
let n_5 = (Array.length v'_2) in
for i_6 = 0 to n_5 - 1 do
v'_2.(i_6) <-
let sum_7 = ref 0 in
for i_8 = 0 to n_5 - 1 do
sum_7 := ! sum_7 + (a_3.(i_6).(i_8) * v_4.(i_8))
done;
! sum_7
done>.
The result does look like the code for matrix-vector multiplication --
like the code for imvmult after all module abstractions are eliminated.
We can check: compile imvmult_c as !. imvmult_c and apply the resulting
function to sample vectors and matrices. The behavior must match imvmult.
RING.t => int code, ind => intMVMULT with the int ring and
the int-indexed array -- obtaining the matrix-vector multiplication function
to use now. We also instantiated MVMULT with the RING that merely
promised to add or multiply integers, and with the ARR that
only promised to index a vector. The result was the matrix-vector multiplication
that can only be used later, after compiling it -- although we can
see the code now. It had no optimizations. A mixed choice is a
`hybrid' vector with int indices but int code elements: we can obtain its elements now but can only use them for
later computations. The length of a hybrid vector, an int, is also known
now, that is, when the generator runs. Thus we obtain the code
for multiplying a matrix by a vector of statically known dimensions.
The hybrid vectors are represented by the following structure, which is
quite like ArrSta since in both cases the indices and dimensions
are usable at generation time:
module ArrStaDim(S: sig type t end) = struct
include S
module M = ArrSta(S)
type ind = int
type unt = unit code
let reduce = M.reduce
let seq e1 e2 = .<begin .~e1; .~e2 end>.
let iter arr = reduce seq .<()>. arr
end
The code generator let imvmult_nc : int -> (int array -> int array array -> int array -> unit) code =
fun n -> .<fun v' a v ->
assert (n = Array.length v');
.~(let vout = OArr (n, fun i v -> .<v'.(i) <- .~v>.) in
let vin = Arr (n,fun i -> .<v.(i)>.) in
let ma = Arr (n,fun i -> Arr (n, fun j -> .<a.(i).(j)>.)) in
let module MV = MVMULT(RingIntCode)(ArrStaDim(RingIntCode)) in
MV.mvmult vout ma vin)
>.
albeit quite like imvmult_c, is a function that takes the vector
length as the argument. Applying imvmult_nc to a particular n gives the
the matrix-vector multiplication code specialized to
that vector length: let imvmult_5c = imvmult_nc 5;;
val imvmult_5c : (int array -> int array array -> int array -> unit) code =
.<fun v'_10 a_11 v_12 ->
assert (5 = (Array.length v'_10));
v'_10.(0) <-
0 + (a_11.(0).(0) * v_12.(0)) + (a_11.(0).(1) * v_12.(1)) +
(a_11.(0).(2) * v_12.(2)) + (a_11.(0).(3) * v_12.(3)) +
(a_11.(0).(4) * v_12.(4))
v'_10.(1) <- ...
...
All loops are fully unrolled. This is what it means to take advantage of statically known dimensions, isn't it?
RING.t => int pv, ind => int pvmvmult to the known matrix, we breed
another mongrel vector, whose elements and indices are the hybrid of int and int code.
All values of the type int are usable at generation time; all values
of the type int code can only be used later. Among partially known values, some are known now, and some only later:
type 't pv = Sta of 't | Dyn of 't codeAny partially-known value can be converted to the purely future-stage one, forgetting what we know now:
let dyn : 't pv -> 't code = function
| Sta x -> .<x>.
| Dyn x -> x
Just like int and int code can be put in a ring, so can their
hybrid: module RingIntPCode = struct
type t = int pv
let zero = Sta 0
let one = Sta 1
let add x y =
match (x,y) with
| (Sta x, Sta y) -> Sta (x+y)
| (x, y) -> Dyn (RingIntCode.add (dyn x) (dyn y))
let mul x y = ...
let eq x y =
match (x,y) with
| (Sta x, Sta y) -> Some (x=y)
| _ -> None (* can't compare the code *)
end
RingIntPCode indeed looks like a cross-breed of RingInt and RingIntCode. For example, if both both summands are known, we add
them now -- otherwise, we build the code to add them later. Crucially, zero and one are known now. Partially-known values are the staple
of code generation and should be a library feature. We implemented
them here for completeness.
Mongrel vectors have partially-known values both as indices and
elements. If the length is statically known, we dispatch to the
previously written ArrStaDim implementation. Otherwise, we treat the
vector as completely statically unknown and use ArrDyn, also written
earlier, to handle that case. From now on, our development would be to
mix-and-slightly-extend the previously written code.
module ArrStaDyn(S: sig type t end) = struct
include S
type ind = int pv
type unt = unit code
module ASta = ArrStaDim(S)
module ADyn = ArrDyn(S)
let iter = ...
let reduce plus zero = function
| Arr (Sta n,v) ->
ASta.reduce plus zero (Arr (n, fun i -> v (Sta i)))
| Arr (Dyn n,v) ->
Dyn (ADyn.reduce (fun x y -> dyn (plus (Dyn x) (Dyn y))) (dyn zero)
(Arr (n,fun i -> dyn (v (Dyn i)))))
end
The new generator instantiates the same old MVMULT with mongrel
vectors. Since the matrix is regarded as statically known, indexing it
with indices known now gives the usable now value. The input vector vin, albeit of the statically known length, comprises elements unknown at
the generator time.
let imvmult_ac : int array array -> (int array -> int array -> unit) code =
fun a ->
let n = Array.length a in
let ma = Arr (Sta n, fun i -> Arr (Sta n,
(fun j ->
match (i,j) with
| (Sta i, Sta j) -> Sta a.(i).(j)
| (i, j) -> Dyn .<a.(.~(dyn i)).(.~(dyn j))>.))) in
.<fun v' v ->
assert (n = Array.length v');
.~(let vout = OArr (Sta n, fun i v -> .<v'.(.~(dyn i)) <- .~(dyn v)>.) in
let vin = Arr (Sta n, fun i -> Dyn .<v.(.~(dyn i))>.) in
let module MV = MVMULT(RingIntPCode)(ArrStaDyn(RingIntPCode)) in
MV.mvmult vout ma vin)
>.
Compared to the earlier imvmult_nc, the generator imvmult_ac
receives not just the matrix dimensions but the matrix itself to
specialize to. When applied to the sample matrix of the challenge, it
produces: .<fun v'_27 v_28 ->
assert (5 = (Array.length v'_27));
v'_27.(0) <-
0 + (1 * v_28.(0)) + (0 * v_28.(1)) + (0 * v_28.(2)) +
(1 * v_28.(3)) + (0 * v_28.(4))
...>.
Not only loops are unrolled, the matrix elements are inlined. As the result,
we see lots of trivial multiplications and additions.
Algebraic simplifications are direly needed.add and mulRingIntPCode, overriding, or improving add and mul: module RingIntOPCode = struct
include RingIntPCode (* inherit and override *)
let add x y = match (x,y) with
| (Sta 0,x) | (x,Sta 0) -> x
| (x,y) -> RingIntPCode.add x y
let mul x y = match (x,y) with
| (Sta 0,_) | (_,Sta 0) -> Sta 0
| (Sta 1,x) | (x,Sta 1) -> x
| (x,y) -> RingIntPCode.mul x y
end
The improvement adds cases to add and mul: for example, when one of the
multiplicands is statically zero, the product is statically zero regardless
of the other multiplicand. Just replacing RingIntPCode with RingIntOPCode in the earlier generator imvmult_ac -- keeping everything
else the same -- gives the following optimized code specialized to the
sample matrix. It became short enough to show completely: .<fun v'_29 v_30 ->
assert (5 = (Array.length v'_29));
v'_29.(0) <- v_30.(0) + v_30.(3)
v'_29.(1) <- v_30.(2)
v'_29.(2) <- v_30.(1)
v'_29.(3) <- v_30.(2) + v_30.(3) + v_30.(4)
v'_29.(4) <- v_30.(2) + v_30.(4)>.
We have almost reached the goal. However, the Challenge specified that
the inner-product loop is to be unrolled only if the corresponding
matrix row is sparse enough. Unrolling should be selective.reducereduce, required by the Challenge,
will estimate that number first, and unroll the loop only if the
number is under the set threshold.
As with the algebraic simplification, we improve the previously written
code reusing as much as possible. Here we smarten-up ArrStaDyn
by adding a few checks to its reduce:
module ArrStaOptDynInt = struct
module R = RingIntOPCode
module M = ArrStaDyn(R)
include M
let threshold = 3 (* density threshold *)
let count_non_zeros n v =
let rec loop acc i =
if i >= n then acc
else
let acc = if R.eq R.zero (v (Sta i)) = Some true
then acc else acc + 1 in
loop acc (i+1)
in loop 0 0
let reduce plus zero = function
| (Arr (Sta n,v)) as arr ->
if count_non_zeros n v < threshold then
M.reduce plus zero arr
else
M.reduce plus zero (Arr (Dyn .<n>.,v))
| arr -> M.reduce plus zero arr
end
If unrolling is at all possible -- the length of the vector is known
-- we check how many elements of the vector are statically known to
be RING.zero. (When the reduction is actually run, there may be more
zero elements, depending on the input vector. We cannot know about
these at the code generation time.) If there are not enough zero
elements, we turn the length of the vector into the future-stage
value, effectively switching off the unrolling.
We thus solve the challenge, attaining the desired code imvmult_ideal.
n-vector of
vectors (rows) of size n each, taking overall n*n space. Since the matrix
is supposed to be sparse, we ought to use a more compact representation.
For example, we store the matrix as an array of adjacency lists,
with each row packed as the list of non-zero elements and their indices.
Changing the matrix representation is trivial in our framework,
merely by adjusting the mapping from the concrete matrix to the abstract Arr
(in the code below, the changed definition is marked as `new'). let imvmult_acs : (int*int) list array -> (int array -> int array -> unit) code =
fun a ->
let n = Array.length a in
let ma = Arr (Sta n, fun i -> Arr (Sta n, (* new: changed ma mapping *)
(fun j ->
match (i,j) with
| (Sta i, Sta j) -> Sta (sparse_deref a.(i) j)
| (Sta i, Dyn j) -> let ai = a.(i) in (* matrix row *)
Dyn .<sparse_deref ai .~j>.
| (i, j) -> Dyn .<sparse_deref a.(.~(dyni i)) (.~(dyni j))>.))) in
.<fun v' v ->
assert (n = Array.length v');
.~(let vout = OArr (Sta n, fun i v -> .<v'.(.~(dyni i)) <- .~(dyn v)>.) in
let vin = Arr (Sta n,fun i -> Dyn .<v.(.~(dyni i))>.) in
let module MV = MVMULT(RingIntOPCode)(ArrStaOptDynInt) in
MV.mvmult vout ma vin)
>.
oleg-at-okmij.org
Your comments, problem reports, questions are very welcome!
Converted from HSXML by HSXML->HTML