previous   next   up   top

Modular, convenient, assured domain-specific optimizations


The promise of program generation is producing efficient code from high-level, obviously correct specifications by making it possible and convenient for domain experts to write and apply ``optimization passes'' at a high-level of abstraction.

On the example of the first Shonan Challenge by high-performance computing to staging we demonstrate that the promise is fulfilled. We obtain assuredly well-typed and optimal code. Although generators are not allowed to inspect, let alone rearrange, the generated code (for the sake of static assurances), optimizations are possible. The optimizations have to be at the level much higher than search-and-replace in the abstract syntax tree or a source code string. The optimizations are developed in small separate modules and mixed in one-by-one, checking their effect after each small step. The end result is a small domain-specific language embedded in MetaOCaml that lets us write high-level Basic Linear Algebra expressions and apply several passes of domain-specific optimizations such as selective loop unrolling, sparse matrix representations, and algebraic simplifications.

This page describes the poster presented at the 10th Asian Symposium on Programming Languages and Systems (APLAS). Kyoto, Japan, December 11-13, 2012.

Solving the challenge in small easy steps


Pure generative programming

In pure generative programming, the generated code is a black box: we combine code fragments into bigger fragments but cannot look inside what we have generated and cannot take it apart. MetaOCaml is pure generative. In Template Haskell, on the other hand, the generated code is represented as the value of an exposed algebraic data type, which may be traversed, deconstructed and manipulated at will.

Pure generative programming makes it easier to reason about the correctness of the generator, in particular, to statically ensure that the generated code is well-typed. A language that supports code inspection like Pure FreshML also can provide similar guarantees, at the cost of a significantly more complex type system. As a good analogy, parametricity, which prevents some functions from inspecting their arguments, gives in return static assurances of those functions' behavior: ``theorems for free''. As the opposite example, Scheme syntax-rule macros, which may examine their arguments, turn out to have surprising and disturbing power, calling into question static guarantees of their hygiene.

If we cannot rearrange the generated code, however, how can we optimize it? Several examples in the literature prove that optimization is possible. It has not been clear however if optimizations can be programmed modularly, like compiler passes. This page demonstrates pure-generative modular and convenient domain-specific optimizations on the example of the first Shonan Challenge by high-performance computing to staging.

We suppose the reader is familiar with MetaOCaml, at least to the extent of the simple tutorial referenced below.

Walid Taha: A Sound Reduction Semantics for Untyped CBN Multi-Stage Computation
Proc. PEPM 2000, pp. 34-43.
The paper shows that inspection of the generated code makes the equational theory trivial: that is, only identical terms are equal and there are no interesting equalities.

Dirty macros

Introduction to staging and MetaOCaml


Shonan Challenge 1

The challenge, abstracted from Hidden-Markov Model (HMM) computations in bioinformatics, is to obtain the efficient code for multiplying a vector with a statically known and sparse matrix such as:
     1 0 0 1 0
     0 0 1 0 0
     0 0 1 1 1
In the Challenge, the matrix, a, is the adjacency matrix, whose elements are either 0 or 1. (In HMM, non-zero elements are transition probabilities.) Some, but not all, rows of the matrix are sparse. In bioinformatics HMM, the number of rows and columns could be as high as 30 000.

The problem is to generate the function that computes v' <- a * v. The generated code should be algebraically simplified, that is, contain no expressions of the form 0 * e, 1 * e, 0 + e, etc. The loop over the matrix rows must be fully unrolled; the inner loop, computing the inner product of v with a matrix row, should be unrolled only if the row is sparse enough (has fewer non-zero elements than a fixed threshold). For the sample matrix above the desired code, assuming the sparsity threshold 3, should look like

     let imvmult_ideal : int array -> int array -> unit =
       fun v' v ->
        v'.(0) <- v.(0) + v.(3)           (* a.(0) is truly sparse *)
        v'.(1) <- v.(2)                   (* a.(1) is very sparse  *)
        let sum = ref 0 in                (* a.(2) is dense enough *)
          for j=0 to n-1 do sum <- sum + a.(2).(j) * v.(j) done;
        v'.(2) <- sum

The main challenge is to add the optimizations -- outer loop unrolling, selective inner loop unrolling, algebraic simplifications -- modularly. The strong point of Scala's Lightweight Modular Staging (LMS) solution is that algebraic specifications or loop unrolling can be added to the existing generator, post hoc. We aspire to do the same in MetaOCaml, without inspecting the generated code.

Kenichi Asai: Shonan Challenge 1. May 23, 2012
< >

Tiark Rompf: Solving Shonan Challenge 1 with Scala LMS
< >
This LMS solution to the challenge is described in Sec 5 of the PEPM2013 paper below.

Baris Aktemur, Yukiyoshi Kameyama, Oleg Kiselyov, and Chung-chieh Shan:
Shonan Challenge for Generative Programming PEPM 2013
The paper outlines the challenges and describes other solutions to Challenge 1, in MetaOCaml and Scala.


High-level specification for mvmult

Our goal is to obtain the optimal code as a specialization -- refinement -- of a high-level, obviously correct specification. The textbook definition of matrix-vector multiplication is: for each matrix row, compute its inner product with the vector. To find the inner product of two vectors, multiply them element-wise and reduce. In OCaml, these specifications look as
     let mvmult vout ma vin = vout := ma ** vin
where vout and vin are intended to be vectors, ma is the matrix. The operations (:=) for vector assignment, (**) for matrix-vector product and (*.) for the element-wise multiplication of two vectors are generically defined as
     let ( := ) vout vin = iter (arr_assign vout vin)
     let ( *. ) v1 v2    = zip_with mul v1 v2
     let ( ** ) a v      = arr_map (fun arow -> reduce add zero (arow *. v)) a
in terms of the primitive operations on vectors arr_map and reduce (and also iter, zip_with, arr_assign), and the operations on vector elements add, mul, zero. At this stage we do not say what the vector elements are, other than that they belong to a ring:
     module type RING = sig
       type t				(* abstract *)
       val zero : t
       val one  : t
       val add  : t -> t -> t
       val mul  : t -> t -> t
       val eq   : t -> t -> bool option 	(* best-effort comparison *)
(We do not require the equality on ring elements. Some instances of RING support element comparisons. In other instances, only some elements could be decided as equal.)

Likewise we do not commit to any particular representation of vector and matrices. We take a general view of an input vector as a map from an (abstract) index to an element, paired with the length (itself specified as an index):

     type ('i,'t) arr = Arr of 'i *			(* length *)
                               ('i -> 't)
A generic write-only vector oarr is similar. Although we do not yet know anything about the index, we can already implement some vector operations:
     let zip_with : ('t -> 't ->'t) -> ('i,'t) arr -> ('i,'t) arr -> ('i,'t) arr = 
       fun f (Arr (n1,v1)) (Arr (n2,v2)) -> 
         Arr (n1, fun i -> f (v1 i) (v2 i))
     (* A variation of zip_with, when the first array is output *)
     let arr_assign : ('i,'t,'r) oarr -> ('i,'t) arr -> ('i,'r) arr = ....
     let arr_map : ('t1 -> 't2) -> ('i,'t1) arr -> ('i,'t2) arr =
      fun f (Arr (n, v)) -> Arr (n, fun i -> f (v i))

The other vector operations depend on the index representation. Whatever that representation is, it should support the operations iter and reduce. Thus a vector must implement the signature:

     module type ARR = sig
       type ind                         (* index *)
       type t                           (* type of values *)
       type unt                         (* unit type *)
       val iter   : (ind, unt) arr -> unt
       val reduce : (t -> t -> t) -> t -> (ind, t) arr -> t

High-level specifications are written as OCaml modules parameterized by a RING and by a vector representation. For example, the Linear Algebra module:

     module LA(R: RING)(A: ARR with type t = R.t) = struct
       open R
       open A
       let ( := ) vout vin = iter (arr_assign vout vin)
       let ( *. ) v1 v2 = zip_with mul v1 v2      (* element-wise mul *)
       let ( ** ) a v =  arr_map (fun arow ->     (* matrix-vector mul *)
         reduce add zero (arow *. v)) a
and the specification for the matrix-vector multiplication:
     module MVMULT(R: RING)(A: ARR with type t = R.t) = struct
       module L = LA(R)(A)
       open L
       let mvmult vout ma vin = vout := ma ** vin

By instantiating MVMULT with different implementations of RING and ARR -- by applying the functor MVMULT to realizations of RING and ARR -- we get different implementations of the matrix-vector multiplication. We program mvmult once, and obtain the whole family of matrix-vector multiplication codes. The answer to the Shonan Challenge is one member of the family.

We have thus specified a simple Domain-Specific Language for Linear Algebra and used it to write matrix-vector multiplication.


Solving the challenge in small easy steps

We will now solve the challenge by instantiating MVMULT with just the right implementations of RING and ARR. These `right' implementations aren't conjured up. Rather, we develop them step-by-step, changing only few things at a time and checking their effect.


Step 1. ordinary mvmult: RING.t => int, ind => int

We start by instantiating MVMULT with the `obvious' first choice: RingInt and int-indexed vectors. The latter structure has the form:
     module ArrSta(S: sig type t end) = struct
       include S
       type ind = int
       type unt = unit
       let iter (Arr (n,body)) = ...
       let reduce plus zero (Arr (n,v))  = 
         let sum = ref zero in
           for i = 0 to n-1 do sum := plus !sum (v i) done;

This is how the instantiation of MVMULT looks in all detail:

     let imvmult : int array -> int array array -> int array -> unit =
       fun v' a v ->
         let n = Array.length v' in      (* vector and matrix representations*)
         let vout = OArr (n, fun i v -> v'.(i) <- v) in
         let vin  = Arr (n,fun i -> v.(i)) in
         let ma   = Arr (n,fun i -> Arr (n, fun j -> a.(i).(j))) in
         let module MV = MVMULT(RingInt)(ArrSta(RingInt)) in
         MV.mvmult vout ma vin

The result is the ordinary matrix-vector multiplication function where vectors are ordinary OCaml arrays. We can immediately test and benchmark imvmult by applying it to sample vectors and matrices.


Step 2. ordinary mvmult code: RING.t => int code, ind => int code

The next step is instantiating the same MVMULT with the ring whose elements are not integers but integer code expressions. The add operation, for example, does no addition now. Rather, it builds the code that will add integers later, when the generated code is compiled and run.
     module RingIntCode = struct
       type t = int code
       let zero = .<0>.               let one  = .<1>.
       let add x y = .<.~x + .~y>.    let mul x y = .<.~x * .~y>.
       let eq x y = None			(* can't compare code *)

Likewise we introduce (future-stage) vectors whose index is an integer code expression. The length of a vector, also an index, is not known at the code generation time.

     module ArrDyn(S: sig type t end) = struct
       include S
       type ind = int code
       type unt = unit code
       let iter (Arr (n,body)) = ...
       let reduce plus zero (Arr (n,v))  = 
         .<let sum = ref .~zero in
           for i = 0 to (.~n)-1 do
             sum := .~(plus .<!sum>. (v .<i>.))
We do not change zip_with or arr_map at all: they are defined generically for all types of indices, irrespective of vector implementation.

Finally we instantiate MVMULT. Vectors are realized as int array code of statically, at generation time, unknown length.

     let imvmult_c : (int array -> int array array -> int array -> unit) code =
      .<fun v' a v ->
         let n = Array.length v' in  (* vector representations *)
         .~(let vout = OArr (.<n>., fun i v -> .<v'.(.~i) <- .~v>.) in
            let vin  = Arr (.<n>.,fun i -> .<v.(.~i)>.) in
            let ma   = Arr (.<n>.,fun i -> 
                                   Arr (.<n>., fun j -> .<a.(.~i).(.~j)>.)) in
            let module MV = MVMULT(RingIntCode)(ArrDyn(RingIntCode)) in
            MV.mvmult vout ma vin)		(* This line is the same *)

The difference in type of imvmult and imvmult_c is telling: whereas the former was a function, imvmult_c is the code of a function. The code can be printed by MetaOCaml top-level as the value of imvmult_c:

     val imvmult_c : (int array -> int array array -> int array -> unit) code =
      .<fun v'_2 a_3 v_4 ->
         let n_5 = (Array.length v'_2) in
         for i_6 = 0 to n_5 - 1 do
          v'_2.(i_6) <-
           let sum_7 = ref 0 in
           for i_8 = 0 to n_5 - 1 do
            sum_7 := ! sum_7 + (a_3.(i_6).(i_8) * v_4.(i_8))
           ! sum_7

The result does look like the code for matrix-vector multiplication -- like the code for imvmult after all module abstractions are eliminated. We can check: compile imvmult_c as !. imvmult_c and apply the resulting function to sample vectors and matrices. The behavior must match imvmult.


Step 3. full loop unrolling: RING.t => int code, ind => int

Previously we instantiated MVMULT with the int ring and the int-indexed array -- obtaining the matrix-vector multiplication function to use now. We also instantiated MVMULT with the RING that merely promised to add or multiply integers, and with the ARR that only promised to index a vector. The result was the matrix-vector multiplication that can only be used later, after compiling it -- although we can see the code now. It had no optimizations. A mixed choice is a `hybrid' vector with int indices but int code elements: we can obtain its elements now but can only use them for later computations. The length of a hybrid vector, an int, is also known now, that is, when the generator runs. Thus we obtain the code for multiplying a matrix by a vector of statically known dimensions.

The hybrid vectors are represented by the following structure, which is quite like ArrSta since in both cases the indices and dimensions are usable at generation time:

     module ArrStaDim(S: sig type t end) = struct
       include S
       module M = ArrSta(S)
       type ind = int
       type unt = unit code
       let reduce = M.reduce
       let seq e1 e2 = .<begin .~e1; .~e2 end>.
       let iter arr = reduce seq .<()>. arr
The code generator
     let imvmult_nc : int -> (int array -> int array array -> int array -> unit) code =
      fun n -> .<fun v' a v ->
         assert (n = Array.length v');
         .~(let vout = OArr (n, fun i v -> .<v'.(i) <- .~v>.) in
            let vin  = Arr (n,fun i -> .<v.(i)>.) in
            let ma   = Arr (n,fun i -> Arr (n, fun j -> .<a.(i).(j)>.)) in
            let module MV = MVMULT(RingIntCode)(ArrStaDim(RingIntCode)) in
            MV.mvmult vout ma vin)
albeit quite like imvmult_c, is a function that takes the vector length as the argument. Applying imvmult_nc to a particular n gives the the matrix-vector multiplication code specialized to that vector length:
     let imvmult_5c = imvmult_nc 5;;
     val imvmult_5c : (int array -> int array array -> int array -> unit) code =
       .<fun v'_10 a_11 v_12 ->
          assert (5 = (Array.length v'_10));
          v'_10.(0) <-
           0 + (a_11.(0).(0) * v_12.(0)) + (a_11.(0).(1) * v_12.(1)) + 
               (a_11.(0).(2) * v_12.(2)) + (a_11.(0).(3) * v_12.(3)) + 
               (a_11.(0).(4) * v_12.(4))
          v'_10.(1) <- ...

All loops are fully unrolled. This is what it means to take advantage of statically known dimensions, isn't it?


Step 4. specializing to the known matrix: RING.t => int pv, ind => int pv

We have just generated the code that took the full advantage of statically known dimensions -- unrolling all the loops and statically indexing all vectors. In the Challenge, the matrix itself is statically known. To specialize mvmult to the known matrix, we breed another mongrel vector, whose elements and indices are the hybrid of int and int code.

All values of the type int are usable at generation time; all values of the type int code can only be used later. Among partially known values, some are known now, and some only later:

     type 't pv = Sta of 't | Dyn of 't code
Any partially-known value can be converted to the purely future-stage one, forgetting what we know now:
     let dyn : 't pv -> 't code = function
       | Sta x -> .<x>.
       | Dyn x -> x
Just like int and int code can be put in a ring, so can their hybrid:
     module RingIntPCode = struct
       type t = int pv
       let zero = Sta 0
       let one  = Sta 1
       let add x y = 
         match (x,y) with
         | (Sta x, Sta y) -> Sta (x+y)
         | (x, y) -> Dyn (RingIntCode.add (dyn x) (dyn y))
       let mul x y =  ...
       let eq x y = 
         match (x,y) with
         | (Sta x, Sta y) -> Some (x=y)
         | _              -> None	  	(* can't compare the code *)
RingIntPCode indeed looks like a cross-breed of RingInt and RingIntCode. For example, if both both summands are known, we add them now -- otherwise, we build the code to add them later. Crucially, zero and one are known now. Partially-known values are the staple of code generation and should be a library feature. We implemented them here for completeness.

Mongrel vectors have partially-known values both as indices and elements. If the length is statically known, we dispatch to the previously written ArrStaDim implementation. Otherwise, we treat the vector as completely statically unknown and use ArrDyn, also written earlier, to handle that case. From now on, our development would be to mix-and-slightly-extend the previously written code.

     module ArrStaDyn(S: sig type t end) = struct
       include S
       type ind = int pv
       type unt = unit code
       module ASta = ArrStaDim(S)
       module ADyn = ArrDyn(S)
       let iter = ...
       let reduce plus zero = function
         | Arr (Sta n,v) ->
             ASta.reduce plus zero (Arr (n, fun i -> v (Sta i)))
         | Arr (Dyn n,v) ->
           Dyn (ADyn.reduce (fun x y -> dyn (plus (Dyn x) (Dyn y))) (dyn zero) 
                  (Arr (n,fun i -> dyn (v (Dyn i)))))

The new generator instantiates the same old MVMULT with mongrel vectors. Since the matrix is regarded as statically known, indexing it with indices known now gives the usable now value. The input vector vin, albeit of the statically known length, comprises elements unknown at the generator time.

     let imvmult_ac : int array array -> (int array -> int array -> unit) code =
      fun a -> 
        let n  = Array.length a in
        let ma = Arr (Sta n, fun i -> Arr (Sta n, 
         (fun j -> 
           match (i,j) with
           | (Sta i, Sta j) -> Sta a.(i).(j)
           | (i, j) -> Dyn .<a.(.~(dyn i)).(.~(dyn j))>.))) in
        .<fun v' v ->
         assert (n = Array.length v');
         .~(let vout = OArr (Sta n, fun i v -> .<v'.(.~(dyn i)) <- .~(dyn v)>.) in
            let vin  = Arr  (Sta n, fun i -> Dyn .<v.(.~(dyn i))>.) in
            let module MV = MVMULT(RingIntPCode)(ArrStaDyn(RingIntPCode)) in
            MV.mvmult vout ma vin)
Compared to the earlier imvmult_nc, the generator imvmult_ac receives not just the matrix dimensions but the matrix itself to specialize to. When applied to the sample matrix of the challenge, it produces:
     .<fun v'_27 v_28 -> 
       assert (5 = (Array.length v'_27));
       v'_27.(0) <-
        0 + (1 * v_28.(0)) + (0 * v_28.(1)) + (0 * v_28.(2)) +
            (1 * v_28.(3)) + (0 * v_28.(4))
Not only loops are unrolled, the matrix elements are inlined. As the result, we see lots of trivial multiplications and additions. Algebraic simplifications are direly needed.


Step 5. algebraic simplifications: overriding add and mul

Adding algebraic simplifications as an optimization of the generated code turns out simple and modular, reusing the earlier code. All we need is to adjust the ring, RingIntPCode, overriding, or improving add and mul:
     module RingIntOPCode = struct
       include RingIntPCode          (* inherit and override *)
       let add x y = match (x,y) with
         | (Sta 0,x) | (x,Sta 0) -> x
         | (x,y)     -> RingIntPCode.add x y
       let mul x y = match (x,y) with
         | (Sta 0,_) | (_,Sta 0) -> Sta 0
         | (Sta 1,x) | (x,Sta 1) -> x
         | (x,y)     -> RingIntPCode.mul x y
The improvement adds cases to add and mul: for example, when one of the multiplicands is statically zero, the product is statically zero regardless of the other multiplicand. Just replacing RingIntPCode with RingIntOPCode in the earlier generator imvmult_ac -- keeping everything else the same -- gives the following optimized code specialized to the sample matrix. It became short enough to show completely:
     .<fun v'_29 v_30 -> 
        assert (5 = (Array.length v'_29));
        v'_29.(0) <- v_30.(0) + v_30.(3)
        v'_29.(1) <- v_30.(2)
        v'_29.(2) <- v_30.(1)
        v'_29.(3) <- v_30.(2) + v_30.(3) + v_30.(4)
        v'_29.(4) <- v_30.(2) + v_30.(4)>.
We have almost reached the goal. However, the Challenge specified that the inner-product loop is to be unrolled only if the corresponding matrix row is sparse enough. Unrolling should be selective.


Step 6. selective unrolling: overriding reduce

We have been unrolling all loops with statically known bounds. Unrolling a large loop however may lead to code explosion. A smarter strategy is to estimate, prior to unrolling, how big the code will become. In matrix-vector multiplication, the size of the unrolled inner-product loop depends on the number of non-zero elements in the vector to reduce. The smart reduce, required by the Challenge, will estimate that number first, and unroll the loop only if the number is under the set threshold.

As with the algebraic simplification, we improve the previously written code reusing as much as possible. Here we smarten-up ArrStaDyn by adding a few checks to its reduce:

     module ArrStaOptDynInt = struct
       module R = RingIntOPCode
       module M = ArrStaDyn(R)
       include M
       let threshold = 3			(* density threshold *)
       let count_non_zeros n v =
         let rec loop acc i =
           if i >= n then acc 
             let acc = if R.eq (v (Sta i)) = Some true
                       then acc else acc + 1 in
             loop acc (i+1)
         in loop 0 0
       let reduce plus zero = function
         | (Arr (Sta n,v)) as arr ->
             if count_non_zeros n v < threshold then
              M.reduce plus zero arr
              M.reduce plus zero (Arr (Dyn .<n>.,v))
         | arr -> M.reduce plus zero arr
If unrolling is at all possible -- the length of the vector is known -- we check how many elements of the vector are statically known to be (When the reduction is actually run, there may be more zero elements, depending on the input vector. We cannot know about these at the code generation time.) If there are not enough zero elements, we turn the length of the vector into the future-stage value, effectively switching off the unrolling.

We thus solve the challenge, attaining the desired code imvmult_ideal.


Step 7. packed matrix: re-mapping the abstract matrix

The sample matrix in the Challenge was given as an n-vector of vectors (rows) of size n each, taking overall n*n space. Since the matrix is supposed to be sparse, we ought to use a more compact representation. For example, we store the matrix as an array of adjacency lists, with each row packed as the list of non-zero elements and their indices. Changing the matrix representation is trivial in our framework, merely by adjusting the mapping from the concrete matrix to the abstract Arr (in the code below, the changed definition is marked as `new').
     let imvmult_acs : (int*int) list array -> (int array -> int array -> unit) code =
      fun a -> 
        let n  = Array.length a in
        let ma = Arr (Sta n, fun i -> Arr (Sta n,   (* new: changed ma mapping *)
         (fun j -> 
           match (i,j) with
           | (Sta i, Sta j) -> Sta (sparse_deref a.(i) j)
           | (Sta i, Dyn j) -> let ai = a.(i) in     (* matrix row *)
     	                  Dyn .<sparse_deref ai .~j>.
           | (i, j) -> Dyn .<sparse_deref a.(.~(dyni i)) (.~(dyni j))>.))) in
        .<fun v' v ->
         assert (n = Array.length v');
         .~(let vout = OArr (Sta n, fun i v -> .<v'.(.~(dyni i)) <- .~(dyn v)>.) in
            let vin  = Arr (Sta n,fun i -> Dyn .<v.(.~(dyni i))>.) in
            let module MV = MVMULT(RingIntOPCode)(ArrStaOptDynInt) in
            MV.mvmult vout ma vin)



We have presented a small DSL embedded in MetaOCaml to write Basic Linear Algebra (BLAS) routines and apply several passes of domain-specific optimizations such as selective loop unrolling and algebraic simplifications. BLAS functions to use now, or the variously optimized BLAS code to compile and run later are different interpretations of the same general algorithm. Optimizations are programmed in layers, as reusable modules. Since the representation of the generated code is not exposed and its rearrangements are prohibited, the programmer is forced to develop optimizations in terms of the appropriate high-level properties of the code. The high level and the compositionality of optimizations lend itself to writing optimization libraries by domain experts.

Last updated December 5, 2013

This site's top page is
Your comments, problem reports, questions are very welcome!

Converted from HSXML by HSXML->HTML