On example of Basic Linear Algebra (BLAS) we demonstrate modular and correct to some degree code generation and code generators. We start from the obviously correct, algebraic specification of an operation on vectors over a ring and apply a sequence transformations. The transformations are particular instantiations of rings and vectors, often written by reusing an earlier instance and extending some of its operations. Therefore, the transformations are modular and composable. The type system of the host language enforces abstraction. The generated code is therefore by very construction well-formed and well-typed, and hence compiles without errors or warnings.
Magnus Myreen. Quoted by permission.``This make me think that the authors are under the impression that compilers can magically transform bad code into efficient high-quality code by clever enough optimizations. As a compiler writer, I consider this a complete myth that many students and non-compiler-developers hope is a truth. I agree that compilers should aim to improve the efficiency of the code given as input, but one cannot expect the compiler to recognize algorithms and swap them for ones with better asymptotic complexity or make other major code transformations that really should be the job of the programmer. [emphasis mine]''
The present article is an example of making such major transformations: conveniently, modularly, and ensuring at least that the generated code always compiles.
α X + YA more precise, executable specification is the following C code:
void daxpy(const int N, const double da, const double x[], const int inc_x, double y[], const int inc_y) { for(int i=0; i<N; i++) y[i*inc_y] += da * x[i*inc_x]; }It is very easy to understand and to write (and re-write in any other language). Alas, its performance is subpar.
Here is the very high-performance code, from OpenBLAS. This is just a fragment; the full code, which is quite large, includes several special versions, set-up and epilogue.
BLASLONG n1 = n & -4; while(i < n1) { FLOAT m1 = da * x[ix] ; FLOAT m2 = da * x[ix+inc_x] ; FLOAT m3 = da * x[ix+2*inc_x] ; FLOAT m4 = da * x[ix+3*inc_x] ; y[iy] += m1 ; y[iy+inc_y] += m2 ; y[iy+2*inc_y] += m3 ; y[iy+3*inc_y] += m4 ; ix += inc_x*4 ; iy += inc_y*4 ; i+=4 ; }The code shows the common HPC pattern: hiding memory latency. Since reading from main memory takes a lot of time, modern CPUs can do something else in the meanwhile (e.g., decoding and executing further instructions). We take advantage of this facility and arrange four reads from memory (followed by multiplication upon completion) to run effectively in parallel. Why four, rather than 5 or 8? The optimal factor is determined by the CPU architecture, bus and memory bandwidth and latency, cache organization, etc. The many parameters (most of which are not disclosed) make computing the optimal unrolling factor impossible. What one may do instead is to try various factors and see which leads to a faster code. Such empirical search, or tuning, is very common in HPC. The premise is the ability to produce candidates, with various unrolling factors. Does it sound like fun to write and repeatedly re-write such code by hand (while worrying about typos)?
Besides DAXPY, there is also SAXPY (single precision: 32-bit floats) and CAXPY (complex numbers). The optimal unrolling factors will be different in each case. Thus we need further and further re-writing of the above code. Some automation is desired.
DAXPY in OpenBLAS
<https://raw.githubusercontent.com/xianyi/OpenBLAS/develop/kernel/x86_64/daxpy.c>
In the tagless-final style, as explained in the referred article, the vector addition is specified as
let addv_dsl = mk3arr @@ fun n vout v1 v2 -> for_ (int 0) (n - int 1) @@ fun i -> array_set vout i @@ array_get v1 i +. array_get v2 iThe article showed how to obtain from this specification code in various languages (OCaml and C). The goal of the present article is to generate C code, but of the OpenBLAS shape and performance -- clearly, modularly, and with guaranteed well-formedness and well-typedness.
Looking a bit ahead, the key is a more abstract and general (and shorter) specification:
let addv_abs vout v1 v2 = zip_with add v1 v2 |> iter_assign voutIt is a clear, obviously correct specification, with just the right amount of detail. It is also an implementation given
iter_assign
, zip_with
and add
. Different choices of these
combinators give different programs: for 32- or 64-bit floating-point
numbers or complex numbers, different loop unrollings. A particular
choice, resulting from composition of strip mining and scalar
promotion optimizations, leads to even an OpenBLAS-like
implementation. Let us illustrate.
module type RING = sig type t (* abstract *) val zero : t val one : t val add : t -> t -> t val sub : t -> t -> t val mul : t -> t -> t endwhich can be turned into a type
type α ring = (module RING with type t = α)The combinator
add
mentioned earlier is hence a RING
operation.
Vectors may be generically defined as so-called `pull vectors': a mapping from
a range of indices [lwb..upb)
of type ι to values (of type α):
type (ι,α) vec = Vec of ι lmad * (* index space description *) (ι -> α)where
ι lmad
(short for linear memory address description)
describes the index range:
type ι lmad = {lwb: ι option; (* None means 0 *) upb: ι; (* that is, max index + 1 *) step: int}(Generally, LMADs are unions of such ranges; the simple one-range LMAD should be enough for us here.) The type
t
of vector elements is
typically the type of ring values -- for a vector over a ring.
It may also be a function type t -> w
, which signifies an
`update', or `consumption' of a value of type t
.
The type w
represents an update (action), which has to be a monoid:
updates may be composed. Thus a generic output vector is defined as
type (ι,α,ω) ovec = (ι,α->ω) vec
The generic vectors lets us define generic operations on them, such as
vec_map : (α -> β) -> (ι,α) vec -> (ι,β) vec
and
zip_with f v1 v2
-- applying a binary operation f
to
vectors v1
and v2
elementwise:
let zip_with : (α -> β -> γ) -> (ι,α) vec -> (ι,β) vec -> (ι,γ) vec = fun tf (Vec (n1,f1)) (Vec (n2,f2)) -> assert( n1 == n2 ); Vec (n1, fun i -> tf (f1 i) (f2 i))The LMADs of the two vector operands must be physically equal. (If they were not, one should have reconciled them first: convert the two vectors to the common index space.)
On the other hand, the operation iter_assign
to assign an input vector to an output vector
is not generic. It is ι,ω-specific.
type (ι,α,ω) iter_assign = (ι, α, ω) ovec -> (ι, α) vec -> ωIt is in turn expressed, one way or another, via
iter
, which embodies a particular way of executing
vector updates: element assignments.
When ω is the same as ω', one may think of iter
as reduction
over the monoid ω.
type (ι,ω,ω') iter = (ι, ω) vec -> ω'The particular instances of
iter_assign
or iter
(for specific
ι,ω) are described later below.
The addition of the generic vectors can then be written as
let addv (type a) ((module R):a ring) (iter_assign : (ι,a,ω) iter_assign) = fun vout v1 v2 -> zip_with R.add v1 v2 |> iter_assign voutas we said in the previous section. This is the general, textbook code, which we write once and for all. After that, we instantiate it in various ways, as below.
RING
: For example,
floating-point numbers, to a good approximation:
module RingFloat = struct type t = float let zero = 0. let one = 1. let add = Stdlib.( +. ) let sub = Stdlib.( -. ) let mul = Stdlib.( *. ) endThe ring element type is OCaml
float
, with
add
being the OCaml float addition, etc.
Code of floats is also a RING
:
module RingFloatCode = struct open C type t = float exp let zero = float 0. let one = float 1. let add = ( +. ) let sub = ( -. ) let mul = ( *. ) endHere module
C
is one of the implementations of the signature
representing the target language (such as the signature cde
in
Tagless-final embedding of (a subset of) C in
OCaml). Hereafter we assume the implementation c_cde.ml
referenced below. (We could just as well used the implementation
in terms of MetaOCaml code, followed by offshoring.) The
elements of RingFloatCode
have the type float C.exp
: expressions
of the target language of float
type. The ring addition
generates the summation code.
It is easy to write RING
implementation for single-precision
(32-bit) floating-point numbers, and for variously represented complex
numbers. One may also think of other implementations, for example
for possibly statically
known ring elements. It then lets us do online partial evaluation.
module RingPossiblyKnown = struct type t = Known of float | Unknown of float exp let zero = Known 0. let one = Known 1. let add = ... ... end
When it comes to index-specific combinators, the first comes to mind
is materialization of lmad
: converting an index space description
to a sequence of its indices. If an index is an int
, this is simply
let iota : int lmad -> int list = fun {lwb; upb; step} -> let rec loop i = if i >= upb then [] else i :: loop (i+step) in loop (Option.value lwb ~default:0)For the
int C.exp
index, the sequence of indices cannot be a `static'
list (i.e., list known at the generator time). Rather, we take an
index consumer and generate code to feed it indices.
let iota_dyn : int C.exp lmad -> (int C.exp -> unit C.stm) -> unit C.stm = fun {lwb; upb; step} body -> let open C in for_ (Option.value lwb ~default:(int 0)) ?step:(if Stdlib.(step=1) then None else Some (int step)) (upb-int 1) body
As for iter
, one can immediately come up with two implementations: one
for the index being an int
-- meaning the index range is statically
known. The generator hence knows which and how many assignments to
execute. The combinator seq
builds a sequence of statements (in our
case, assignment statements). This iter_sta
hence
corresponds to full loop unrolling.
let iter_sta : (int,unit C.stm,unit C.stm) iter = fun (Vec (n,body)) -> iota n |> List.map body |> seq
When the index is an int C.exp
(a piece of target code), the index
range is not statically known. Executing vector updates hence
requires a run-time iteration.
let iter_dyn : (int C.exp,unit C.stm,unit C.stm) iter = fun (Vec (n,body)) -> iota_dyn n body
As mentioned earlier, iter_assign
is expressed in terms of iter
.
One common expression is
let iter_assign : (ι,ω,ω) iter -> (ι,α,ω) iter_assign = fun iter -> fun vout v -> zip_with (@@) vout v |> iter
We can already generate code from the specification of vector
addition, applying to addv
the appropriate ring and the assigner:
let addv (type a) ((module R):a ring) (iter_assign : (ι,a,ω) iter_assign) = fun vout v1 v2 -> zip_with R.add v1 v2 |> iter_assign vout
RingFloatCode
and iter_assign iter_dyn
void tf_addv_c(int64_t n_1,double * vout_2,double * v1_3,double * v2_4){ for (int i_5 = 0; i_5 < n_1; i_5 += 1) (vout_2[i_5]) = (v1_3[i_5]) + (v2_4[i_5]); }
cde
signature)
tf_addv.ml [17K]
The complete code of the example
Optimizations often involve code movements and insertions (e.g., let-insertions), which require working with CPS, for which we define convenient abbreviations:
type (α,ω) m = (α -> ω) -> ω type α munit = (α -> unit C.stm) -> unit C.stm
The main job of striping a vector is performed by
val strip_vector (strip:int) : (int C.exp,α) vec -> ((int C.exp,α) vec option * (int C.exp, (int,α) vec) vec * (int C.exp,α) vec) munitwhich takes a
(int C.exp,α) vec
and turns it into a matrix
(vector of vectors). The inner vector has indices of type int
:
the index range of the inner vector is
statically known. In fact, it is [0..strip)
. The lower bound (if
non-trivial) and the upper bound of the input vector cannot be
expected to be an even multiple of strip
. Therefore, we have
remainders: on the left (possibly, if the lower bound is specified)
and on the right.
The strip mining itself is expressed as a particular iter
: an
implementation of (int C.exp,ω,unit C.stm) iter
that uses
a given inner (int,ω,unit C.stm) iter
for iterating within a
stripe. The update type ω
is not in general unit C.stm
, so
we need a mapping function exec
for turning updates into
the statements of the target language.
let iter_strip (strip:int) (exec: ω -> unit C.stm) (inner_iter : (int,ω,unit C.stm) iter) : (int C.exp,ω,unit C.stm) iter = fun v -> strip_vector strip v @@ fun (v1,m,v3) -> seq C.[ (* prologue *) (match v1 with None -> unit | Some v -> vec_map exec v |> iter_dyn); (* main loop *) vec_map inner_iter m |> iter_dyn; (* epilogue *) vec_map exec v3 |> iter_dyn]
Using the iter_assign (iter_strip 4 Fun.id iter_sta)
as
iter_assign
in the same generic addv
specification produces
void tf_addv_c(int64_t n_6,double * vout_7,double * v1_8,double * v2_9){ const int64_t t_10 = n_6 & -4; for (int i_12 = 0; i_12 < t_10; i_12 += 4){ (vout_7[i_12 + 0]) = (v1_8[i_12 + 0]) + (v2_9[i_12 + 0]); (vout_7[i_12 + 1]) = (v1_8[i_12 + 1]) + (v2_9[i_12 + 1]); (vout_7[i_12 + 2]) = (v1_8[i_12 + 2]) + (v2_9[i_12 + 2]); (vout_7[i_12 + 3]) = (v1_8[i_12 + 3]) + (v2_9[i_12 + 3]); } for (int i_11 = t_10; i_11 < n_6; i_11 += 1) (vout_7[i_11]) = (v1_8[i_11]) + (v2_9[i_11]); }The effect of strip mining is clearly visible.
e1
, e2
, e3
used in some context as in
... (* generate code using [e1; e2; e3] *) ...we want to bind its elements to variables, obtaining
const double t1 = e1; const double t2 = e2; const double t3 = e3; ... (* generate code using [t1; t2; t3] *) ...The DSL embedding provides the operation
letl
to generate
binding statements:
val letl : α C.exp -> α C.exp munitTherefore, given a vector
v : (ι,α) vec
,
the operation vec_map letl v
produces (ι, α C.exp munit) vec
:
a vector of actions of generating bindings. Performing
these actions in sequence, and obtaining the vector of their results
is accomplished by
val materialize : (int, (α,ω) m) vec -> ((int,α) vec,ω) mAlthough its implementation is a bit messy, the intuition should be clear. The whole optimization is done by a particular
iter_scalar_promotion
, which is an instance of iter_sta
and can
be used as an inner iterator for strip mining. The result is
void tf_addv_c(int64_t n_18,double * vout_19,double * v1_20,double * v2_21){ const int64_t t_22 = n_18 & -4; for (int i_24 = 0; i_24 < t_22; i_24 += 4){ const double t_25 = (v1_20[i_24 + 0]) + (v2_21[i_24 + 0]); const double t_26 = (v1_20[i_24 + 1]) + (v2_21[i_24 + 1]); const double t_27 = (v1_20[i_24 + 2]) + (v2_21[i_24 + 2]); const double t_28 = (v1_20[i_24 + 3]) + (v2_21[i_24 + 3]); (vout_19[i_24 + 0]) = t_25; (vout_19[i_24 + 1]) = t_26; (vout_19[i_24 + 2]) = t_27; (vout_19[i_24 + 3]) = t_28; } for (int i_23 = t_22; i_23 < n_18; i_23 += 1) (vout_19[i_23]) = (v1_20[i_23]) + (v2_21[i_23]); }which looks a lot like the OpenBLAS daxpy code.
Y += α X
. Or, more precisely,
let daxpy (type a) ((module R):a ring) (iter_assign : (ι,a,ω) iter_assign) = fun da vout v1 -> vec_map (R.mul da) v1 |> iter_assign voutwhich is a small variation of
addv
. Here, vout
is
understood to be a vector with a modifying
assignment. (The precise assignment action is hidden behind ovec
abstraction). General stride in array access can also
incorporated within the vec
abstraction. By merely using the above daxpy
in place of addv
produces
void daxpy(int64_t n_29,double da_30,double * x_31,int64_t inc_x_32, double * y_33,int64_t inc_y_34){ const int64_t t_35 = n_29 & -4; for (int i_37 = 0; i_37 < t_35; i_37 += 4){ const double t_38 = da_30 * (x_31[(i_37 + 0) * inc_x_32]); const double t_39 = da_30 * (x_31[(i_37 + 1) * inc_x_32]); const double t_40 = da_30 * (x_31[(i_37 + 2) * inc_x_32]); const double t_41 = da_30 * (x_31[(i_37 + 3) * inc_x_32]); array1_incr(y_33,(i_37 + 0) * inc_y_34,t_38); array1_incr(y_33,(i_37 + 1) * inc_y_34,t_39); array1_incr(y_33,(i_37 + 2) * inc_y_34,t_40); array1_incr(y_33,(i_37 + 3) * inc_y_34,t_41); } for (int i_36 = t_35; i_36 < n_29; i_36 += 1) array1_incr(y_33,i_36 * inc_y_34,da_30 * (x_31[i_36 * inc_x_32])); }assuming the macro:
#define array1_incr(a,i,v) a[i] += vThis is the general OpenBLAS DAXPY code.