Optimizing AXPY with Twists and Turns

 
AXPY is one of the Basic Linear Algebra (BLAS) vector operations: vector addition aX+Y. It is a perfect target for classical optimizations like partial loop unrolling and scalar promotion -- as well as clever assembly hacking. (AXPY is also embarrassingly parallel; however, we focus here on single-thread performance.) These optimizations are indeed carried out -- by hand -- in OpenBLAS, regarded as one of the fastest BLAS implementations. One can make a case for automatic code generation, to reduce the tedium of applying such optimizations -- given that there are many platforms and several AXPY varieties to optimize.

This is the traditional elevator talk about metaprogramming in high-performance computing. It is not wrong. There is truth to it -- but not in the place one would have expected.


 

Motivation

``This make me think that the authors are under the impression that compilers can magically transform bad code into efficient high-quality code by clever enough optimizations. As a compiler writer, I consider this a complete myth that many students and non-compiler-developers hope is a truth. I agree that compilers should aim to improve the efficiency of the code given as input, but one cannot expect the compiler to recognize algorithms and swap them for ones with better asymptotic complexity or make other major code transformations that really should be the job of the programmer. [emphasis mine]''
Magnus Myreen. Quoted by permission.

The present article is an example of making such major transformations: conveniently, modularly, and ensuring at least that the generated code always compiles.

References
High-performance C code -- modularly, algebraically, and with some correctness guarantees
Talk given at the February 28, 2023 meeting of IFIP WG 2.1

 

Realistic, modular, composable, algebraic optimizations with some correctness guarantees

At first glance, the section title is contradiction in terms. Modularity and abstraction, however desirable, typically impose an unacceptable overhead and make it hard to arrange the instruction stream for the best performance. General-purpose optimizing compilers are often of not much help when it comes to the highest performance, which relies on narrow-purpose transformations. Therefore, the highest-performance code is typically written by hand (for example, OpenBLAS, MKL). Code generation does help, and is increasingly used. However, writing code generators is quite more complex than writing code.

On the example of Basic Linear Algebra (BLAS) we demonstrate modular and correct to some degree code generation and code generators. We start from the obviously correct, algebraic specification of an operation on vectors and apply a sequence of transformations. The transformations are particular enumerations of vector index spaces, often written by combining earlier enumerations. An abstraction -- Linear Memory Address Descriptor (LMAD) -- turns out helpful. The type system of the host language enforces abstractions. The generated code is therefore by very construction well-formed and well-typed, and hence compiles without errors or warnings -- which is particularly important as we explore various optimization opportunities and generate many candidate kernels.

This dry summary hides many surprises and the complete failure along the way.

 

AXPY

Our running example is adding two vectors, one possibly scaled -- a common high-performance computing (HPC) operation. It is a part of Basic Linear Algebra (BLAS), where it is called AXPY (Alpha X plus Y):
    α X + Y
BLAS itself is the collection of vector/matrix operations distilled from a large body of numeric code, and which are targeted for aggressive optimizations. The motivation was: rather than spending time optimizing each particular numeric code, it's better to focus on optimizing BLAS.

The fact the AXPY name is short and in uppercase hints at its origin in FORTRAN-II mainframe days.

The precise specification -- or, the reference implementation -- is the following short and straightforward C code:

    void AXPY(const int N, const FLOAT da, 
              const FLOAT x[], const int inc_x, FLOAT y[], const int inc_y){
    for(int i=0; i<N; i++)
      y[i*inc_y] += da * x[i*inc_x];
    }
It includes the stride, or increment, inc to access each vector. Therefore AXPY can be used to add matrix columns or matrix rows, or to add a vector to the matrix diagonal, etc. The reference implementation is taken from OpenBLAS, which is in turn taken from NetLib, which is (was) the reference for all things numeric.

The reason to distill BLAS is to pour effort into optimizing it. The most optimized BLAS implementations are Intel's MKL, and OpenBLAS (as well as vendor CUDA libraries). Both are optimized by hand. As the name suggests, OpenBLAS is open-source, so we can look at its code (see below). BLAS contains four variations of AXPY: SAXPY (32-bit float vectors), DAXPY (64-bit float), CAXPY (64-bit complex) and ZAXPY (128-bit complex). In the following, we concentrate on DAXPY.

The OpenBLAS DAXPY code for the x86_64 platform is 123-line long (not counting the kernels) -- quite longer than the 5-liner naive (reference) C code above. It has two distinct parts -- one for the case when both inc_x and inc_y are one; the other being the general case. For the unitary strides, OpenBLAS DAXPY iterates 8-element kernels, which are

    #if defined(NEHALEM)
    #include "daxpy_microk_nehalem-2.c"
    #elif defined(BULLDOZER)
    #include "daxpy_microk_bulldozer-2.c"
    #elif defined(STEAMROLLER) || defined(EXCAVATOR)
    #include "daxpy_microk_steamroller-2.c"
    #elif defined(PILEDRIVER)
    #include "daxpy_microk_piledriver-2.c"
    #elif defined(HASWELL) || defined(ZEN)
    #include "daxpy_microk_haswell-2.c"
    #elif defined (SKYLAKEX) || defined (COOPERLAKE) || defined (SAPPHIRERAPIDS)
    #include "daxpy_microk_skylakex-2.c"
    #elif defined(SANDYBRIDGE)
    #include "daxpy_microk_sandy-2.c"
    #endif
That is, for each particular x86_64 (micro) architecture OpenBLAS includes a hand-written kernel -- written either using intrinsics, or inline assembly.

The general case of OpenBLAS DAXPY is the following. (Only the main loop is shown.)

    BLASLONG n1 = n & -4;
    
    while(i < n1)
    {
    	FLOAT m1      = da * x[ix] ;
    	FLOAT m2      = da * x[ix+inc_x] ;
    	FLOAT m3      = da * x[ix+2*inc_x] ;
    	FLOAT m4      = da * x[ix+3*inc_x] ;
    
    	y[iy]         += m1 ;
    	y[iy+inc_y]   += m2 ;
    	y[iy+2*inc_y] += m3 ;
    	y[iy+3*inc_y] += m4 ;
    
    	ix  += inc_x*4 ;
    	iy  += inc_y*4 ;
    	i+=4 ;
    }

The code shows the common HPC pattern of partial loop unrolling and scalar promotion, aimed to hide memory latency. Since reading from main memory takes a lot of time, modern CPUs can do something else in the meanwhile (e.g., decoding and executing further instructions). We take advantage of this facility and arrange four reads from memory (followed by the multiplication upon completion) to run effectively in parallel. Why four, rather than 5 or 8? The optimal factor is determined by the CPU microarchitecture, bus and memory bandwidth and latency, cache organization, memory controllers, access predictors, etc. The many parameters (most of which are not disclosed) make computing the optimal unrolling factor impossible. What one may do instead is to try various unrolling factors and see which leads to a faster code. Such empirical search, or tuning, is very common in HPC. The premise is the ability to produce candidates, with various unrolling factors. Does it sound like fun to write and repeatedly re-write such code by hand (while worrying about typos)?

Besides DAXPY, there is also SAXPY, CAXPY and ZAXPY -- each generally requiring its own tuning, and definitely its own kernels, for each supported microarchitecture. When Intel or AMD introduce a new microarchitecture, which happens every 2-3 years, someone has to write a new kernel, for all AXPYs.

References
OpenBLAS
<https://www.openblas.net/>

daxpy_openblas.c [4K]
DAXPY code from OpenBLAS (version 0.3.25.dev)

daxpy_microk_haswell-2.c [4K]
Hand-written OpenBLAS DAXPY kernel for the Haswell microarchitecture, using the vector fused-multiply-accumulate (fmadd) instruction

 

A mystery

Metaprogramming can help generate code with many different loop unrollings and other optimizations. This is the traditional story of metaprogramming in HPC.

How does it correspond to real life, in this day and age?

Here is the reality check, comparing the most naive 5-line DAXPY implementation (the reference implementation, shown above) with the most sophisticated: OpenBLAS. The benchmark (whose full code is available below) uses the vectors X and Y of 200,000,000 elements each: that is, it allocates 3.2 GiB memory. The size of the vectors is fixed for each benchmark run; what varies is the stride, inc_x and inc_y.

The platform is Intel Core i5-1135G7 @ 2.40GHz: x86_64 SKYLAKEX microarchitecture, which provides SSE4, AVX2, AVX512VL, FMA3 instruction sets. The main memory is 8GB with 2 channels, 8 byte/cycle/channel, for the total memory bandwidth 38.4 10^9 bytes/sec. Each core has 128 KiB of 8-way set-associative L1 data cache and 5 MiB 4-way set-associative L2 cache. There is also 8 MiB 16-way set-associative L3 cache shared by all cores. Cache line size is 64 bytes (that is, it fits eight 64-bit floats).

The compiler is GCC 11.2.0. OpenBLAS was compiled for single-thread execution (USE_THREAD=0) with the following flags (chosen by the OpenBLAS configuration script)

    -O2 -m64 -fPIC -DVERSION="0.3.25.dev" -msse3 -mssse3 -msse4.1 -mavx -mavx2 
     -march=skylake-avx512 -mavx2 
The reference and other code is compiled with the flags
    -O3 -m64 -march=native -fno-trapping-math -fomit-frame-pointer
The benchmark is to run only on the target platform; hence I can afford to specify -march=native: optimize for the target. On the other hand, OpenBLAS is meant to be portable to some extent, and cannot afford this flag (which is a bit unfair to it).

When running benchmarks one has to be always aware that the CPU frequency may dynamically change, from half of the nominal frequency (due to power management) up to twice of the nominal frequency (thanks to Intel's `turbo boost'). Therefore, the benchmark was pinned (to CPU 4) with turbo boost disabled. The benchmark output has confirmed the process remained on the single CPU that kept running at 2390.5 ± 0.5 MHz. The reported execution time is the median of 5 consecutive runs. Thanks to the pinning and the disabling of turbo boost, consecutive runs have nearly identical timings.

ref_openblas.png [14K]
The benchmark results

There results are quite surprising:

References
The Mysteries of AXPY
Talk given at the March 26, 2024 meeting of IFIP WG 2.11 (Drexel University, Philadelphia, PA, USA).

axpy.tar.gz [7K]
The complete code for the benchmark. You can run it on your own computer and see the results. To run the OpenBLAS benchmark, install OpenBLAS and adjust openblas_Makefile.

<https://www.kernel.org/doc/Documentation/cpu-freq/boost.txt>
Explanation of the turbo boost and the ways to disable it on Linux.

 

The Results

It seems that a modern C compiler can do the optimizations that in OpenBLAS are done by hand, including the hand-written kernels. Should we be using BLAS at all, at least for AXPY? Has all the talk about metaprogramming for optimizing AXPY become moot?

Let's look at the following graph, which adds one more benchmark specb2 to the earlier reference and OpenBLAS code.

ref_openblas_specb2.png [15K]
One more benchmark result

Especially notable is the case of the unit increment, where specb2 slightly outperforms the hand-optimized OpenBLAS kernels. Recall, the platform's memory bandwidth is 38.4 10^9 bytes/sec. Each iteration of DAXPY reads 2*8 bytes and writes 8 bytes. The X and Y arrays have 2 10^8 elements. The bandwidth attained by specb2 is hence 24*2 10^8 / 0.157 = 30.6 10^9 bytes/sec, or 80% of the memory bandwidth.

The results are not a fluke. Here is another graph:

ref_openblas_specb23_amd.png [19K]

The target this time is AMD EPYC 7742 64-Core Processor (Zen-2) server. Its nominal CPU frequency is a bit lower (2.2 GHz), but everything else is much bigger: 4 MiB of L1 data, 64 MiB of L2 and 512 MiB of L3 cache, with 128-byte cache lines. The anomalies become quite more pronounced. Our specb2 (and a bit improved specb3) code manage to avoid them, however.

It turns out, one can improve both on the optimizing C compiler, and the hand-written kernels and optimizations of OpenBLAS. One look at the specb2 AXPY code linked below tells it is not written by hand. The rest of this page shows how it was obtained.

For C generation, we use the tagless-final embedding of C in OCaml, as explained elsewhere. It gives assuredly well-formed and well-typed C code. (With a different back-end, we can also obtain FORTRAN or Wasm code.) But using C, even as an embedded DSL, is too low-level.

Looking a bit ahead, the key is an abstract, general and shorter specification of AXPY:

    fun a x y -> vec_map (mul a) x |> iter_assign y
It is a clear, obviously correct specification, with just the right amount of detail. It is also an implementation, given particular realizations of iter_assign, vec_map and mul. Different choices of these combinators give different programs: for 32- or 64-bit floating-point or complex numbers, different loop unrollings. The code specb2 was obtained via a particular choice of iter_assign -- a particular sequence of executing array reads and writes. Let us illustrate.
References
axpy_specb2.c [9K]
The specb2 code

Tagless-final embedding of (a subset of) C in OCaml
The code-generation framework

 

Background: Abstract Vectors over Rings

Algebraic rings are straightforwardly represented as a signature
    module type RING = sig
      type t				(* abstract *)
      val zero : t
      val one  : t
      val add  : t -> t -> t
      val sub  : t -> t -> t
      val mul  : t -> t -> t
    end
which can be turned into the type
    type α ring = (module RING with type t = α)
The combinator mul mentioned earlier is hence a RING operation.

Vectors may be generically defined as so-called `pull vectors': a mapping from a range of indices of type ι to values (of type α):

    type (ι,α) vec = Vec of ι lmads *     (* index space description *)
                              (ι -> α)
where ι lmads, the list of ι lmad (short for linear memory address description -- the description of the index space) is presented in the next section. The type t of vector elements is typically the type of ring values -- for a vector over a ring. It may also be a function type τ -> ω, which signifies an `update', or `consumption' of a value of type τ. The type ω represents an update (action), which has to be a monoid: updates may be composed. Thus a generic output vector is defined as
    type (ι,α,ω) ovec = (ι,α->ω) vec

The generic vectors let us define generic operations on them, such as vec_map : (α -> β) -> (ι,α) vec -> (ι,β) vec and zip_with f v1 v2 -- applying a binary operation f to vectors v1 and v2 elementwise:

    let zip_with : (α -> β -> γ) -> (ι,α) vec -> (ι,β) vec -> (ι,γ) vec = 
      fun tf (Vec (n1,f1)) (Vec (n2,f2)) -> 
        assert( n1 == n2 );
        Vec (n1, fun i -> tf (f1 i) (f2 i))
The LMADs of the two vector operands must be physically equal. (If they were not, one should have reconciled them first: convert the two vectors to the common index space.)

On the other hand, the operation iter_assign to assign an input vector to an output vector is not generic. It is ι,ω-specific. One may think of it as the reduction over the monoid ω.

    type (ι,α,ω) iter_assign = (ι, α, ω) ovec -> (ι, α) vec -> ω
It is in turn expressed, one way or another, via iter, which embodies a particular way of enumerating the index space (and hence of executing vector updates -- element assignments).
    type (ι,ω) iter = ι lmad -> (ι -> ω) -> ω
The particular instances of iter_assign or iter (for specific ι, ω) are described later below.

One such common expression for iter_assign in terms of iter is

    let iter_assign : (ι,ω) iter -> (ω list -> ω) -> (ι,α,ω) iter_assign = 
      fun iter seq ->
      fun vout v -> zip_with (@@) vout v |> function Vec (lmads,body) ->
        List.map (fun l -> iter l body) lmads |> seq
where the second argument is the reduction over the ω monoid.

AXPY on the generic vectors can then be written as

    let daxpy (type a) ((module R):a ring) (iter_assign : (ι,a,ω) iter_assign) =
      fun a x y -> vec_map (R.mul a) x |> iter_assign y
as we said in the previous section. This is the general, textbook code, which we write once and for all. After that, we instantiate it in various ways, as below.
References
axpy.ml [43K]
The code accompanying the article. Abstract vectors are defined at the beginning.

 

LMAD

Linear Address Space Descriptor (or, Linear Memory Access Descriptor: LMAD) has been introduced in the research group of David Padua at UIUC, in the context of FORTRAN. LMAD used here has a few minor differences to suit C better.

LMAD generally is a set of primitive LMADs, each of which is written in the following notation
     As₁s₂...u₁u₂... + b
Here b is the base offset, sd is the stride, or step, along the dimension d, and ud is the exclusive upper bound along the dimension d.

Informally, a primitive LMAD as above represents a for-loop nest:

    for(i=0; i < u₁; i += s₁)
      for(j=0; j < u₂; j += s₂)
        ...
        point in index space b+i+j+...
A set of primitive LMADs then represents a sequence of nested for-loops.

In OCaml, LMAD is realized as the following data type:

    type ι lmad = {offset: ι;
                    extent: ι ext list}
     and ι ext = {
                    upb: ι;                (* that is, greater than max index *)
                    step: ι                (* also stride, or scale *)
                  }
     and ι lmads = ι lmad list
The meaning of LMAD -- as an enumeration of the index space, or the list of indices  -- is then given by the following OCaml code.
    let iota : int lmad -> int list = function {offset;extent} ->
       let rec extents = function
       | [] -> []
       | [{upb;step}] -> 
          let rec loop i = if i >= upb then [] else i :: loop (i+step) in
          loop 0
       | ext::exts ->              (* Cartesian product *)
          let ext_space = extents [ext] in
          extents exts |> List.concat_map (fun i -> List.map ((+) i) ext_space)
       in
       extents extent |> List.map ((+) offset)
This denotation lets us distill LMAD laws, which are used later in generation of the fast AXPY code. (I have not yet checked if these laws are mentioned in the LMAD literature.) More constructively, the meaning function iota can be used to spot-check the laws: compute the denotation of the left- and right-hand--side of a concrete instance of the law, and see if they match. The accompanying OCaml code does exactly that to verify the laws.

 

The order of primitive LMADs is irrelevant (up to the traversal order)
That is, a sequence of primitive LMADs represents the same index space regardless of the order of the primitive LMADs. Different orders represent different traversals, but of the same index space.
Horizontal split
Asu + b   =   (Asks + b) ; (Asu-ks + b + ks)
where 0 < ks <= u and k is a natural number. (The case ks = u is trivial.)
Loop interchange
A...s₁...s₂......u₁...u₂... + b      A...s₂...s₁......u₂...u₁... + b
where means equivalence up to the traversal order. The two LMADs represent different traversals, but of the same index space. The law follows from the commutativity of addition (or, in other words, from the absence of dependencies among the loops).
Vertical split
Asu + b   =   As kk u + b
where u is divisible by k and k is divisible by s. This law represents the optimizations known as strip-mining: stripping a loop into a number of consecutive blocks (of statically known size k/s). This effectively turns the loop into two nested ones: the outer over blocks, and the inner over elements within a block. Algebraically, it turns a (u/s) column vector into a matrix of u/k rows and k/s columns.
References
Y. Paek. Automatic Parallelization for Distributed Memory Machines Based on Access Region Analysis. PhD thesis, University of Illinois at Urbana-Champaign, April 1997.

J. Hoeflinger. Interprocedural Parallelization Using Memory Classification Analysis. PhD thesis, University of Illinois at Urbana-Champaign, July 1998.

Jiajing Zhu, Jay P. Hoeflinger, David A. Padua. Compiling for a Hybrid Programming Model Using the LMAD Representation. Languages and Compilers for Parallel Computing, 2001 (LNCS 2624)
The paper is not about LMAD per se: it is about compiling for a cluster. LMAD is merely a tool -- very cleverly used. It is introduced there rather concisely.

axpy.ml [43K]
The code accompanying the article. Search for LMAD.

 

Code generation

To generate C we embed it in OCaml in tagless-final style, as described elsewhere. The embedded language is represented by the signature C_cde, abbreviated to just C. Integer expressions of the C language are represented in OCaml as values of the type int C.exp, and similarly for bool and other expressions. Statements of C are represented by values of the type τ C.stm, where τ is often unit (which corresponds to void in C); unit C.stm is a monoid, with the reduction seq : unit C.stm list -> unit C.stm -- meaning that C statements can be combined into one composite statement.

There are several realizations of the RING: For example, floating-point numbers, to a good approximation:

    module RingFloat = struct
      type t = float
      let zero = 0.
      let one  = 1.
      let add = Stdlib.( +. )
      let sub = Stdlib.( -. )
      let mul = Stdlib.( *. )
    end
The ring element type is OCaml float, with add being the OCaml float addition, etc. Code of floats is also a RING:
    module RingFloatCode = struct
      open C  
      type t = float exp
      let zero = float 0.
      let one  = float 1.
      let add = ( +. )
      let sub = ( -. )
      let mul = ( *. )
    end
The elements of RingFloatCode have the type float C.exp: expressions of the target language of float type (which corresponds to double in C.) The ring product generates the multiplication code.

It is easy to write RING implementation for single-precision (32-bit) floating-point numbers, and for variously represented complex numbers.

To instantiate the daxpy code generator we also need iter, the enumerator of the index space. The earlier iota immediately gives one iterator:

    let iter_sta : (int, unit C.stm) iter = fun lmad body ->
      iota lmad |> List.map body |> seq
which corresponds to full loop unrolling. If the loop bounds are not statically known (i.e., are of the type int C.exp rather than int), we generate for-loops. (Here >> is the left-to-right functional composition.)
    let iter_dyn : (int C.exp,unit C.stm) iter = fun {offset;extent} body ->
      let rec extents body = function
        | [] -> seq []
        | [{upb;step}] -> C.(for_ (int 0) ?step:(Some step) (upb-int 1) body)
        | ext::exts ->
            extents (fun i -> extents (C.(+) i >> body) exts) [ext]
      in extents (C.(+) offset >> body) extent
The full code also defines the iterator for the hybrid case, when the loop bounds may be statically known. If they are and the number of iterations is small, the loop is unrolled.

Given the Ring and iter, we can instantiate the daxpy generator (repeated from above)

    let daxpy (type a) ((module R):a ring) (iter_assign : (ι,a,ω) iter_assign) =
      fun a x y -> vec_map (R.mul a) x |> iter_assign y
and produce the AXPY code. For example, daxpy (module RingFloatCode) (iter_assign iter_dyn seq) results in
    void daxpy(int const n_8,double const a_9,double * const x_10,
              int const inc_x_11,double * const y_12,int const inc_y_13){
      for (int i_14 = 0; i_14 < n_8; i_14 += 1)
        array1_incr(y_12,i_14 * inc_y_13,a_9 * (x_10[i_14 * inc_x_11]));
    }
where
    #define array1_incr(a,i,v) a[i] += v
This is the reference AXPY implementation.

Applying algebraic laws to the LMAD describing the input arrays results in a more complicated (but correct by construction!) C code. The specb2 code is the result of applying the horizontal split, two vertical splits and interchange laws, with the particular choice of parameters (strip size).

References
axpy.ml [43K]
The AXPY code generator

c_cde.ml [8K]
Embedded target language (the implementation of the cde signature)

Tagless-final embedding of (a subset of) C in OCaml

 

Conclusions. Can we do better?

The loop unrolling (as seen in OpenBLAS) is not what it used to be. It does not necessarily lead to a better performance on modern architectures. Modern optimizing compilers can produce code, at least in simple cases like AXPY, with the performance of hand-optimized platform-specific kernels. There does not seem to be any reason to use BLAS at all (at least for vector operations).

And yet, one may still obtain AXPY that performs better than what an optimizing C compiler or a BLAS library may deliver -- using appropriate abstraction and code generation. The code specb2 described here is just one example. I cannot claim it is the best: perhaps one can do even better. The code generation has many tunable parameters, whose space is yet to be explored. There is much work ahead.