Realistic, modular, composable, algebraic optimizations with some correctness guarantees: DAXPY

At first glance, the title is contradiction in terms. Modularity and abstraction, however desirable, typically impose an unacceptable overhead and make it hard to arrange the instruction stream for the best performance. General-purpose optimizing compilers are of not much help when it comes to the highest performance, which relies on narrow-purpose transformations. Therefore, the highest-performance code is typically written by hand (for example, OpenBLAS, MKL). Code generation does help, and is increasingly used. However, writing code generators is quite more complex than writing code.

On example of Basic Linear Algebra (BLAS) we demonstrate modular and correct to some degree code generation and code generators. We start from the obviously correct, algebraic specification of an operation on vectors over a ring and apply a sequence transformations. The transformations are particular instantiations of rings and vectors, often written by reusing an earlier instance and extending some of its operations. Therefore, the transformations are modular and composable. The type system of the host language enforces abstraction. The generated code is therefore by very construction well-formed and well-typed, and hence compiles without errors or warnings.



``This make me think that the authors are under the impression that compilers can magically transform bad code into efficient high-quality code by clever enough optimizations. As a compiler writer, I consider this a complete myth that many students and non-compiler-developers hope is a truth. I agree that compilers should aim to improve the efficiency of the code given as input, but one cannot expect the compiler to recognize algorithms and swap them for ones with better asymptotic complexity or make other major code transformations that really should be the job of the programmer. [emphasis mine]''
Magnus Myreen. Quoted by permission.

The present article is an example of making such major transformations: conveniently, modularly, and ensuring at least that the generated code always compiles.

High-performance C code -- modularly, algebraically, and with some correctness guarantees
Talk given at the February 28, 2023 meeting of IFIP WG 2.1.


Typical HPC Code (BLAS)

Our running example is adding two vectors, one possibly scaled. It is a common Basic Linear Algebra (BLAS) operation, called DAXPY ( (double precision) Alpha X plus Y):
    α X + Y
A more precise, executable specification is the following C code:
    void daxpy(const int N, const double da, 
               const double x[], const int inc_x, double y[], const int inc_y) {
    for(int i=0; i<N; i++)
      y[i*inc_y] += da * x[i*inc_x];
It is very easy to understand and to write (and re-write in any other language). Alas, its performance is subpar.

Here is the very high-performance code, from OpenBLAS. This is just a fragment; the full code, which is quite large, includes several special versions, set-up and epilogue.

    BLASLONG n1 = n & -4;
    while(i < n1)
    {       FLOAT m1      = da * x[ix] ;
            FLOAT m2      = da * x[ix+inc_x] ;
            FLOAT m3      = da * x[ix+2*inc_x] ;
            FLOAT m4      = da * x[ix+3*inc_x] ;
            y[iy]         += m1 ;
            y[iy+inc_y]   += m2 ;
            y[iy+2*inc_y] += m3 ;
            y[iy+3*inc_y] += m4 ;
            ix  += inc_x*4 ;
            iy  += inc_y*4 ;
            i+=4 ;
The code shows the common HPC pattern: hiding memory latency. Since reading from main memory takes a lot of time, modern CPUs can do something else in the meanwhile (e.g., decoding and executing further instructions). We take advantage of this facility and arrange four reads from memory (followed by multiplication upon completion) to run effectively in parallel. Why four, rather than 5 or 8? The optimal factor is determined by the CPU architecture, bus and memory bandwidth and latency, cache organization, etc. The many parameters (most of which are not disclosed) make computing the optimal unrolling factor impossible. What one may do instead is to try various factors and see which leads to a faster code. Such empirical search, or tuning, is very common in HPC. The premise is the ability to produce candidates, with various unrolling factors. Does it sound like fun to write and repeatedly re-write such code by hand (while worrying about typos)?

Besides DAXPY, there is also SAXPY (single precision: 32-bit floats) and CAXPY (complex numbers). The optimal unrolling factors will be different in each case. Thus we need further and further re-writing of the above code. Some automation is desired.




Run-up Example

We start with the simplified version of DAXPY: vector addition, very similar to the one explained in Tagless-final embedding of (a subset of) C in OCaml. The extension to the full DAXPY follows later below.

In the tagless-final style, as explained in the referred article, the vector addition is specified as

    let addv_dsl = mk3arr @@ fun n vout v1 v2 ->
      for_ (int 0) (n - int 1) @@ fun i ->
        array_set vout i @@ array_get v1 i +. array_get v2 i
The article showed how to obtain from this specification code in various languages (OCaml and C). The goal of the present article is to generate C code, but of the OpenBLAS shape and performance -- clearly, modularly, and with guaranteed well-formedness and well-typedness.

Looking a bit ahead, the key is a more abstract and general (and shorter) specification:

    let addv_abs vout v1 v2 = zip_with add v1 v2 |> iter_assign vout
It is a clear, obviously correct specification, with just the right amount of detail. It is also an implementation given iter_assign, zip_with and add. Different choices of these combinators give different programs: for 32- or 64-bit floating-point numbers or complex numbers, different loop unrollings. A particular choice, resulting from composition of strip mining and scalar promotion optimizations, leads to even an OpenBLAS-like implementation. Let us illustrate.


Background: Vectors over Rings

Algebraic rings are straightforwardly represented as a signature
    module type RING = sig
      type t				(* abstract *)
      val zero : t
      val one  : t
      val add  : t -> t -> t
      val sub  : t -> t -> t
      val mul  : t -> t -> t
which can be turned into a type
    type α ring = (module RING with type t = α)
The combinator add mentioned earlier is hence a RING operation.

Vectors may be generically defined as so-called `pull vectors': a mapping from a range of indices [lwb..upb) of type ι to values (of type α):

    type (ι,α) vec = Vec of ι lmad *     (* index space description *)
                              (ι -> α)
where ι lmad (short for linear memory address description) describes the index range:
    type ι lmad = {lwb: ι option;         (* None means 0 *)
                    upb: ι;               (* that is, max index + 1 *)
                    step: int}
(Generally, LMADs are unions of such ranges; the simple one-range LMAD should be enough for us here.) The type t of vector elements is typically the type of ring values -- for a vector over a ring. It may also be a function type t -> w, which signifies an `update', or `consumption' of a value of type t. The type w represents an update (action), which has to be a monoid: updates may be composed. Thus a generic output vector is defined as
    type (ι,α,ω) ovec = (ι,α->ω) vec

The generic vectors lets us define generic operations on them, such as vec_map : (α -> β) -> (ι,α) vec -> (ι,β) vec and zip_with f v1 v2 -- applying a binary operation f to vectors v1 and v2 elementwise:

    let zip_with : (α -> β -> γ) -> (ι,α) vec -> (ι,β) vec -> (ι,γ) vec = 
      fun tf (Vec (n1,f1)) (Vec (n2,f2)) -> 
        assert( n1 == n2 );
        Vec (n1, fun i -> tf (f1 i) (f2 i))
The LMADs of the two vector operands must be physically equal. (If they were not, one should have reconciled them first: convert the two vectors to the common index space.)

On the other hand, the operation iter_assign to assign an input vector to an output vector is not generic. It is ι,ω-specific.

    type (ι,α,ω) iter_assign = (ι, α, ω) ovec -> (ι, α) vec -> ω
It is in turn expressed, one way or another, via iter, which embodies a particular way of executing vector updates: element assignments. When ω is the same as ω', one may think of iter as reduction over the monoid ω.
    type (ι,ω,ω') iter = (ι, ω) vec -> ω'
The particular instances of iter_assign or iter (for specific ι,ω) are described later below.

The addition of the generic vectors can then be written as

    let addv (type a) ((module R):a ring) (iter_assign : (ι,a,ω) iter_assign) =
      fun vout v1 v2 -> 
      zip_with R.add v1 v2 |> iter_assign vout
as we said in the previous section. This is the general, textbook code, which we write once and for all. After that, we instantiate it in various ways, as below.


Concrete rings and vectors

There are several realizations of the RING: For example, floating-point numbers, to a good approximation:
    module RingFloat = struct
      type t = float
      let zero = 0.
      let one  = 1.
      let add = Stdlib.( +. )
      let sub = Stdlib.( -. )
      let mul = Stdlib.( *. )
The ring element type is OCaml float, with add being the OCaml float addition, etc. Code of floats is also a RING:
    module RingFloatCode = struct
      open C  
      type t = float exp
      let zero = float 0.
      let one  = float 1.
      let add = ( +. )
      let sub = ( -. )
      let mul = ( *. )
Here module C is one of the implementations of the signature representing the target language (such as the signature cde in Tagless-final embedding of (a subset of) C in OCaml). Hereafter we assume the implementation referenced below. (We could just as well used the implementation in terms of MetaOCaml code, followed by offshoring.) The elements of RingFloatCode have the type float C.exp: expressions of the target language of float type. The ring addition generates the summation code.

It is easy to write RING implementation for single-precision (32-bit) floating-point numbers, and for variously represented complex numbers. One may also think of other implementations, for example for possibly statically known ring elements. It then lets us do online partial evaluation.

    module RingPossiblyKnown = struct
      type t = Known of float | Unknown of float exp
      let zero = Known 0.
      let one  = Known 1.
      let add = ...

When it comes to index-specific combinators, the first comes to mind is materialization of lmad: converting an index space description to a sequence of its indices. If an index is an int, this is simply

    let iota : int lmad -> int list = fun {lwb; upb; step} ->
      let rec loop i = if i >= upb then [] else i :: loop (i+step) in
      loop (Option.value lwb ~default:0)
For the int C.exp index, the sequence of indices cannot be a `static' list (i.e., list known at the generator time). Rather, we take an index consumer and generate code to feed it indices.
    let iota_dyn : int C.exp lmad -> (int C.exp -> unit C.stm) -> unit C.stm = 
      fun {lwb; upb; step} body ->
        let open C in
        for_ (Option.value lwb ~default:(int 0)) 
             ?step:(if Stdlib.(step=1) then None else Some (int step))
             (upb-int 1) 

As for iter, one can immediately come up with two implementations: one for the index being an int -- meaning the index range is statically known. The generator hence knows which and how many assignments to execute. The combinator seq builds a sequence of statements (in our case, assignment statements). This iter_sta hence corresponds to full loop unrolling.

    let iter_sta : (int,unit C.stm,unit C.stm) iter =
      fun (Vec (n,body)) -> iota n |> body |> seq

When the index is an int C.exp (a piece of target code), the index range is not statically known. Executing vector updates hence requires a run-time iteration.

    let iter_dyn : (int C.exp,unit C.stm,unit C.stm) iter =
      fun (Vec (n,body))  -> iota_dyn n body

As mentioned earlier, iter_assign is expressed in terms of iter. One common expression is

    let iter_assign : (ι,ω,ω) iter -> (ι,α,ω) iter_assign = fun iter ->
      fun vout v -> zip_with (@@) vout v |> iter

We can already generate code from the specification of vector addition, applying to addv the appropriate ring and the assigner:

Specification (repeated from above)
    let addv (type a) ((module R):a ring) (iter_assign : (ι,a,ω) iter_assign) =
      fun vout v1 v2 -> 
      zip_with R.add v1 v2 |> iter_assign vout
Generated Code, with RingFloatCode and iter_assign iter_dyn
    void tf_addv_c(int64_t n_1,double * vout_2,double * v1_3,double * v2_4){
      for (int i_5 = 0; i_5 < n_1; i_5 += 1)
        (vout_2[i_5]) = (v1_3[i_5]) + (v2_4[i_5]);
We obtain the explicit for-loop with no unrolling. We have managed to generate some code (however simple), which we can run and test. Next are optimizations.
References [8K]
Embedded target language (the implementation of the cde signature) [17K]
The complete code of the example


Optimizations: Strip mining

The first optimization is strip mining: striping a loop into a number of consecutive blocks of statically known size. This effectively turns the loop into two nested ones: the outer over blocks, and the inner over elements within a block. Algebraically, it turns a Nx1 column vector into a matrix of N/strip rows and strip columns. (N does not have to be evenly divisible by strip, resulting in a `remainder' of the input vector)

Optimizations often involve code movements and insertions (e.g., let-insertions), which require working with CPS, for which we define convenient abbreviations:

    type (α,ω) m = (α -> ω) -> ω
    type α munit = (α -> unit C.stm) -> unit C.stm

The main job of striping a vector is performed by

    val strip_vector (strip:int) : (int C.exp,α) vec ->
      ((int C.exp,α) vec option * 
       (int C.exp, (int,α) vec) vec *
       (int C.exp,α) vec) 
which takes a (int C.exp,α) vec and turns it into a matrix (vector of vectors). The inner vector has indices of type int: the index range of the inner vector is statically known. In fact, it is [0..strip). The lower bound (if non-trivial) and the upper bound of the input vector cannot be expected to be an even multiple of strip. Therefore, we have remainders: on the left (possibly, if the lower bound is specified) and on the right.

The strip mining itself is expressed as a particular iter: an implementation of (int C.exp,ω,unit C.stm) iter that uses a given inner (int,ω,unit C.stm) iter for iterating within a stripe. The update type ω is not in general unit C.stm, so we need a mapping function exec for turning updates into the statements of the target language.

    let iter_strip (strip:int) (exec: ω -> unit C.stm)
        (inner_iter : (int,ω,unit C.stm) iter) :
        (int C.exp,ω,unit C.stm) iter =
      fun v -> 
      strip_vector strip v @@ fun (v1,m,v3) ->
        seq C.[
          (* prologue *)
          (match v1 with None -> unit | Some v -> vec_map exec v |> iter_dyn);
          (* main loop *)
          vec_map inner_iter m |> iter_dyn;
          (* epilogue *)
          vec_map exec v3 |> iter_dyn]

Using the iter_assign (iter_strip 4 iter_sta) as iter_assign in the same generic addv specification produces

    void tf_addv_c(int64_t n_6,double * vout_7,double * v1_8,double * v2_9){
      const int64_t t_10 = n_6 & -4;
      for (int i_12 = 0; i_12 < t_10; i_12 += 4){
        (vout_7[i_12 + 0]) = (v1_8[i_12 + 0]) + (v2_9[i_12 + 0]);
        (vout_7[i_12 + 1]) = (v1_8[i_12 + 1]) + (v2_9[i_12 + 1]);
        (vout_7[i_12 + 2]) = (v1_8[i_12 + 2]) + (v2_9[i_12 + 2]);
        (vout_7[i_12 + 3]) = (v1_8[i_12 + 3]) + (v2_9[i_12 + 3]);
      for (int i_11 = t_10; i_11 < n_6; i_11 += 1)
        (vout_7[i_11]) = (v1_8[i_11]) + (v2_9[i_11]);
The effect of strip mining is clearly visible.


Scalar promotion

As the name says, scalar promotion promotes array accesses to scalars. Roughly speaking, if we have a vector of expressions e1, e2, e3 used in some context as in
    ...  (* generate code using [e1; e2; e3] *) ...
we want to bind its elements to variables, obtaining
    const double t1 = e1;
    const double t2 = e2;
    const double t3 = e3;
    ... (* generate code using [t1; t2; t3] *) ...
The DSL embedding provides the operation letl to generate binding statements:
    val letl : α C.exp -> α C.exp munit
Therefore, given a vector v : (ι,α) vec, the operation vec_map letl v produces (ι, α C.exp munit) vec: a vector of actions of generating bindings. Performing these actions in sequence, and obtaining the vector of their results is accomplished by
    val materialize : (int, (α,ω) m) vec -> ((int,α) vec,ω) m 
Although its implementation is a bit messy, the intuition should be clear. The whole optimization is done by a particular iter_scalar_promotion, which is an instance of iter_sta and can be used as an inner iterator for strip mining. The result is
    void tf_addv_c(int64_t n_18,double * vout_19,double * v1_20,double * v2_21){
      const int64_t t_22 = n_18 & -4;
      for (int i_24 = 0; i_24 < t_22; i_24 += 4){
        const double t_25 = (v1_20[i_24 + 0]) + (v2_21[i_24 + 0]);
        const double t_26 = (v1_20[i_24 + 1]) + (v2_21[i_24 + 1]);
        const double t_27 = (v1_20[i_24 + 2]) + (v2_21[i_24 + 2]);
        const double t_28 = (v1_20[i_24 + 3]) + (v2_21[i_24 + 3]);
        (vout_19[i_24 + 0]) = t_25;
        (vout_19[i_24 + 1]) = t_26;
        (vout_19[i_24 + 2]) = t_27;
        (vout_19[i_24 + 3]) = t_28;
      for (int i_23 = t_22; i_23 < n_18; i_23 += 1)
        (vout_19[i_23]) = (v1_20[i_23]) + (v2_21[i_23]);
which looks a lot like the OpenBLAS daxpy code.
References [17K]
The complete code of the example



We may now perform the similar optimizations for the real DAXPY, which is, roughly, Y += α X. Or, more precisely,
    let daxpy (type a) ((module R):a ring) (iter_assign : (ι,a,ω) iter_assign) =
      fun da vout v1 -> 
      vec_map (R.mul da) v1 |> iter_assign vout
which is a small variation of addv. Here, vout is understood to be a vector with a modifying assignment. (The precise assignment action is hidden behind ovec abstraction). General stride in array access can also incorporated within the vec abstraction. By merely using the above daxpy in place of addv produces
    void daxpy(int64_t n_29,double da_30,double * x_31,int64_t inc_x_32,
              double * y_33,int64_t inc_y_34){
      const int64_t t_35 = n_29 & -4;
      for (int i_37 = 0; i_37 < t_35; i_37 += 4){
        const double t_38 = da_30 * (x_31[(i_37 + 0) * inc_x_32]);
        const double t_39 = da_30 * (x_31[(i_37 + 1) * inc_x_32]);
        const double t_40 = da_30 * (x_31[(i_37 + 2) * inc_x_32]);
        const double t_41 = da_30 * (x_31[(i_37 + 3) * inc_x_32]);
        array1_incr(y_33,(i_37 + 0) * inc_y_34,t_38);
        array1_incr(y_33,(i_37 + 1) * inc_y_34,t_39);
        array1_incr(y_33,(i_37 + 2) * inc_y_34,t_40);
        array1_incr(y_33,(i_37 + 3) * inc_y_34,t_41);
      for (int i_36 = t_35; i_36 < n_29; i_36 += 1)
        array1_incr(y_33,i_36 * inc_y_34,da_30 * (x_31[i_36 * inc_x_32]));
assuming the macro:
    #define array1_incr(a,i,v) a[i] += v
This is the general OpenBLAS DAXPY code.
References [17K]
The complete code of the example