Reconciling Abstraction with High Performance: A MetaOCaml approach

 
A common application of generative programming is building high-performance computational kernels highly tuned to the problem at hand. A typical linear algebra kernel is specialized to the numerical domain (rational, float, double, etc.), loop unrolling factors, array layout and a priori knowledge (e.g., the matrix being positive definite). It is tedious and error prone to specialize by hand, writing numerous variations of the same algorithm.

The widely used generators such as ATLAS and SPIRAL reliably produce highly tuned specialized code but are difficult to extend. In ATLAS, which generates code using printf, even balancing parentheses is a challenge. According to the ATLAS creator, debugging is nightmare.

A typed staged programming language such as MetaOCaml lets us state a general, obviously correct algorithm and add layers of specializations in a modular way. By ensuring that the generated code always compiles and letting us quickly test it, MetaOCaml makes writing generators less daunting and more productive.

The readers will see it for themselves in this hands-on tutorial. Assuming no prior knowledge of MetaOCaml and only a basic familiarity with functional programming, we will eventually implement a simple domain-specific language (DSL) for linear algebra, with layers of optimizations for sparsity and memory layout of matrices and vectors, and their algebraic properties. We will generate optimal BLAS kernels. We shall get the taste of the ``Abstraction without guilt''.

As any other monograph in now's Foundations and Trends (TM) series, ``Reconciling Abstraction with High Performance: A MetaOCaml approach'' is published in three formats: journal, e-book and print book.
doi:10.1561/2500000038

The tutorial on systematic generation of optimal numeric kernels with MetaOCaml was first presented at the tutorial session of the conference of Commercial Users of Functional Programming (CUFP 13) on September 23, 2013 in Boston, USA. It was reprised at IFL 2017 (Bristol, UK).

This page describes the structure of the book and refers to the accompanying code. It also points out further developments (specifically, generating low-level code).


 

Overview

The goal of the tutorial is to teach how to write typed code generators, how to make them modular, and how to introduce local domain-specific optimizations with MetaOCaml. The tutorial is based on the progression of problems, which, except the introductory one, are all slightly simplified versions of real-life problems:
  1. First steps in staging and MetaOCaml
  2. Digital filters
  3. Complex vector multiplication: varying data representation (structure of arrays vs. array of structures)
  4. Systematic optimization of simple linear algebra: building extensively specialized general BLAS
  5. From an interpreter to a compiler: DSL for image manipulation
  6. Further challenges (Homework)
In fact, problems 3, 4 and 6 were suggested by HPC researchers as challenges to program generation community (Shonan challenges). The common theme is building high-performance computational kernels highly tuned to the problem at hand. Hence most problems revolve around simple linear algebra -- a typical and most frequently executed part in HPC.

The stress on high-performance applications and on modular optimizations and generators sets this tutorial apart from Taha's very accessible, gentle introductions to the `classical' partial evaluation and staging, focused on turning an interpreter of a generally higher-order language into a compiler. We also get to see this classical area in Chap. 6; however, we pay less attention to lambda-calculus and more to image processing. Furthermore, this tutorial mentions recent additions to MetaOCaml such as offshoring and let-insertion, which are described in detail in the linked documents.

The source code for the tutorial is available as a Accompanying Code.

 

Why MetaOCaml?

We will be using BER MetaOCaml, which is a complete re-implementation of the no longer available original MetaOCaml by Walid Taha, Cristiano Calcagno and collaborators.

BER MetaOCaml is a conservative extension of OCaml for ``writing programs that generate programs''. BER MetaOCaml adds to OCaml the type of code values (denoting ``program code'', or future-stage computations), and two basic constructs to build them: quoting and splicing. The generated code can be printed, stored in a file -- or compiled and linked-back to the running program, thus implementing run-time code optimization. MetaOCaml code without staging annotations, or with the annotations erased, is regular OCaml.

MetaOCaml has been successfully used for the most optimal stream fusion, specializing numeric and dynamic programming algorithms, building FFT kernels, compilers for an image processing and database query DSLs, OCaml server pages, generating families of specialized basic linear algebra and Gaussian Elimination routines, and high-performance stencil computations.

Writing code generators in a typed staged language like MetaOCaml benefits in several ways. First, the generated code will be well-formed, with all parentheses matching. Such a guarantee is a dear wish when writing C with printf (as done in ATLAS) or C++ with Matlab. MetaOCaml makes sure that the generated code is well-typed and shall compile without errors. There is no longer puzzling out a compilation error in the generated code, which is typically large, obfuscated and with unhelpful variable names. Mainly, code generation errors are reported in terms of the generator rather than the generated code. The tutorial will give many chances to see the importance of good error reporting.

MetaOCaml generators are hygienic, producing well-scoped code, with no unbound variables. Otherwise, hygiene violations are hard to detect in practice and may lead to the devious error of unintentionally bound variables. Although the unbound variables in the generated code stand out (when compiling it), determining what has caused them proved to be highly non-trivial in practice, as has been reported (Ofenbeck, Rompf, PĆ¼schel, Scala2016). The authors of that work wrote a new compiler testing framework, to specifically detect unbound variable and other such problems introduced during refactoring of generators. MetaOCaml is designed to prevent the generation of the problematic code in the first place.

Most importantly, MetaOCaml is typed. Types, staged types in particular, really do help write the code. All throughout the tutorial we will be writing code in live interaction with the type checker -- accepting type errors not as a punishment but as a valuable hint. We shall see on many occasions that once we fix the type signature, the generator practically writes itself. The type checker will tell us where to put a staging annotation.

MetaOCaml is purely generative: the generated code is treated as a black box and cannot be examined. One can put code together but cannot take it apart. Pure generativity significantly simplifies the type system and strengthens the static assurances. It may also seem that pure generativity precludes code optimizations. Fortunately, that is not the case, as shall soon see.

The staging annotations of MetaOCaml are like the ``assembler'' instructions of metaprogramming. We need higher-level abstractions. The final benefit of MetaOCaml -- compared to the preprocessors like camlp4 or ppx -- is that it is part of OCaml itself, and hence can take the full advantage of OCaml's abstraction and combination facilities, from higher-order functions to modules. Building optimization libraries and composing generators is the stress of the tutorial.

 

Conclusions

The trade-off between clarity, maintainability, ease of developing, on one hand -- and performance is real. Throughout the tutorial we have been encountering it time and again. We have also seen the trade-off resolved. Well-chosen abstractions (DSLs embedded in the tagless-final style, for one) let domain experts easily write code, easily see it correct and easily express domain-specific knowledge and optimizations. Code generation removes the overhead of abstraction from the resulting code (shifting the overhead to the generator, where it is bearable).

The recent ``Stream Fusion, to Completeness'' enforces the lesson, on the `industrial strength' stream processing. The strymonas library designed in the paper lets us build pipelines by freely nesting and plugging in the components such as maps, filters, joins, etc. The result is the highly imperative code, whose performance not just approaches but matches the hand-written code (in the cases where the hand-written code was feasible to write).

Building even complicated generators is simple if we take advantage of abstraction and types. OCaml's excellent abstraction facilities -- from higher-order functions to module system -- let us write and debug generators in small pieces, and compose optimizations from separate layers. As we keep saying, with code generation, the abstraction comes with no cost. We may abstract with abandon.

Staged types are of particularly great help. They help ensure that compiling the generated code produces no errors. Problematic generators are reported with helpful error messages that refer to the generator (rather than generated) code. Furthermore, on many occasions we have seen that to stage code, we merely need to give it the desired signature, submit the code to the type checker and fix the reported errors. The type checker actively helps us write the code.

This tutorial has covered the part of MetaOCaml that has been stable for a decade and is expected to remain so. MetaOCaml is an actively developed project, with more, experimental features such as offshoring and genlet, which have been mentioned only in passing. They all help write generators easily and produce faster code while maintaining confidence in the result.

 

Accompanying Code

Introduction to MetaOCaml: the power example (A.P.Ershov, 1977)
  • power.ml [6K]
    Code and explanations
  • power_rts.ml [2K]
    The example of run-time specialization
  • powerf_rts.ml [2K]
    The example of run-time specialization (FP)
  • square.ml [<1K]
    The externally defined `square' function
  • Lifting values to code: now comes with BER MetaOCaml N107
Generating optimal FIR filters
  • filter.ml [14K]
    Introductory, ad hoc approach
Systematic approach: Complex Vector Arithmetic (Shonan challenge)
  • ring.ml [2K]
    Abstracting arithmetic: Rings
  • vector.ml [3K]
    Abstract vectors and BLAS 1 operations
  • complex.ml [13K]
    Complex vector arithmetic
  • cmplx.mli [<1K]
    Defining data type cmplx
Shonan Challenge 1: Matrix-vector multiplication with a known, mostly sparse matrix
  • mvmult_full.ml [22K]
    Stepwise development of the optimal specialized matrix-vector multiplication
A DSL for image manipulation
How to build the tutorial code
Makefile [2K]