This work explores the promises of generative programming languages and techniques for the high-performance computing expert. We show that complex, architecture-specific optimizations can be implemented in a type-safe, purely generative framework. Peak performance is achievable through the careful combination of a high-level, multi-stage evaluation language -- MetaOCaml -- with low-level code generation techniques. Nevertheless, our results also show that generative approaches for high-performance computing do not come without technical caveats and implementation barriers concerning productivity and reuse. We describe these difficulties and identify ways to hide or overcome them, from abstract syntaxes to heterogeneous generators of code generators, combining high-level and type-safe multi-stage programming with a back-end generator of imperative code.
Joint work with Albert Cohen, Se'bastien Donadio, Mari'a Jesu's Garzara'n, Christoph Armin Herrmann, and David A. Padua.
Joint work with Jacques Carette.
<http://www.cas.mcmaster.ca/~carette/metamonads/index.html>
Besides the paper, the page points to the generator code, and the
generated Gaussian Elimination codes to handle real and integer
matrices using various pivoting strategies.
Shifting the Stage: Staging with Delimited Control
Section 5.2 of the JFP paper uses the same example of generating
specialized Gaussian Elimination codes to illustrate an
alternative, notationally more convenient approach: relying on
delimited control rather than monads.
Unlike FFTW, we know precisely where savings are coming from, and which particular equivalences contribute to which improvements in the code.
The paper makes the case that ``there are benefits to focusing on writing better generators rather than on fixing the results of simple generators.''
Generating optimal FFT code and relating FFTW and Split-Radix
Matching FFTW in operation counts
A particular area of current interest shared by PL and HPC researchers is how to use domain-specific languages (DSL) to capture and automate patterns and techniques of code generation, transformation, and optimization that recur in an application domain. For example, HPC has created and benefited from expressive DSLs such as OpenMP directives, SPIRAL's signal processing language, and specialized languages for stencil computations and domain decomposition. Moreover, staging helps to build efficient and expressive DSLs because it assures that the generated code is correct in the form of precise static guarantees.
Alas, the communication between PL researchers working on staging and HPC practitioners could be better. On one hand, HPC practitioners often do not know what PL research offers. For example, current staging technology not only makes sure that the generated code compiles without problems, but further eliminates frequent problems such as matrix dimension mismatch. Staged programming can detect and prevent these problems early during development, thus relieving users from scouring reams of obscure generated code to debug compiler errors, or waiting for expensive trial computations to produce dubious results. On the other hand, PL researchers often do not know how much HPC practitioners who write code generators value this or that theoretical advance or pragmatic benefit -- in other words, how the HPC wish list is ranked by importance.
This workshop aimed to solicit and discuss real-world applications of assured code generation in HPC that would drive PL research in meta-programming. The workshop participants consisted of three groups of people: PL theorists, HPC researchers, and PL-HPC intermediaries (that is, people who are working with HPC professionals, translating insights from PL theory to HPC practice). The workshop benefited PL and staging theorists by informing them what HPC practice actually needs. HPC practitioners may have also benefited, for example by learning new ways to understand or organize the code generators they are already writing and using. To promote this mutual understanding, the workshop emphasized presentations that elicited questions rather than merely provide answers.
The workshop is organized together with Chung-chieh Shan and Yukiyoshi Kameyama. The Shonan seminar series is sponsored by Japan's National Institute of Informatics (NII).
shonan1-report.pdf [73K]
The final report
One of the most interesting parts of the report is the section
`Tangible Outcomes'. The foremost is the Shonan Challenge, a set of
representative HPC examples where staging could be of help, in
producing more maintainable code and letting domain experts perform
modifications at a higher-level of abstraction.
Generative programming, in particular, in the form of staging, is widely recognized in HPC as the leading approach to resolve the conflict between high-performance on one hand, and portability and maintainability on the other hand. In its general form, staging is an implementation technique for efficient domain-specific languages (DSL), letting HPC experts conveniently express their domain-specific knowledge and optimization heuristics. However, the results and tools of the current staging research are little used in the HPC community. Partly this is because HPC practitioners are not aware of the progress in staging; staging researchers are likewise unaware of what HPC practitioners really need. The first Shonan meeting brought together HPC practitioners and programming language (PL) researchers to break this awareness barrier. The meeting aimed to solicit and discuss real-world applications of assured code generation in HPC that would drive PL research in meta-programming.
The first Shonan meeting succeeded in its aim. It developed a set of benchmarks, representative HPC examples, where staging could be of help in producing more maintainable code and letting domain experts perform modifications at a higher-level of abstraction. This set was dubbed `Shonan Challenge'.
Shonan Challenge has greatly stimulated research and development of staging, resulting in extensible compilers based on staging (Rompf et al., POPL 2013) and the revival of MetaOCaml (ML 2013, FLOPS 2014). The answers to Shonan Challenges have been presented in the overview paper (Aktemur et el., PEPM 2013), in (Rompf et al., POPL 2013) and in the poster at APLAS 2012. Shonan Challenge problems (specifically, the Hidden-Markov Model benchmark) were discussed at the staging tutorials given in 2013 at the premier PL conferences PLDI, ECOOP, and ICFP/CUFP.
The present follow-up meeting was intended as the place to report the progress in staging back to the HPC practitioners who posed the challenges, to evaluate how well the developed tools meet the needs of the HPC practice already, and what is yet to be done.
As before, the workshop participants consisted of three groups of people: PL theorists, HPC researchers, and PL-HPC intermediaries (that is, people who are working with HPC professionals, translating insights from PL theory to HPC practice). To promote the mutual understanding, we have planned for the workshop to have lots of time for discussion. We emphasized tutorial, brainstorming and working-group sessions rather than mere conference-like presentations.
The workshop is organized together with Yukiyoshi Kameyama and Jeremy Siek. The Shonan seminar series is sponsored by Japan's National Institute of Informatics (NII).
shonan2-report.pdf [126K]
The final report
Bridging the theory of staged programming languages and the practice of high-performance computing
The first Shonan meeting
Shonan Challenge is a collection of crisp problems posed by HPC and domain experts, for which efficient implementations are known but were tedious to write and modify. The challenge is to generate a similar efficient implementation from the high-level specification of a problem, performing the same optimizations, but automatically. It should be easy to adjust optimizations and the specification, maintaining confidence in the generated code.
We describe our initial set of benchmarks and provide three solutions to two of the problems. We hope that the Shonan Challenge will clarify the state of the art and stimulate the theory and technology of staging just as the POPLmark challenge did for meta-theory mechanization. Since each Shonan Challenge problem is a kernel of a significant HPC application, each solution has an immediate practical application.
The paper is written with Baris Aktemur, Yukiyoshi Kameyama, and Chung-chieh Shan.
<http://groups.google.com/group/stagedhpc>
<https://github.com/StagedHPC/shonan-challenge>
The mailing list (Google Group) and the GitHub repository for the
Shonan Challenge. The list is open: subscription requests are welcome.
The MetaOCaml Tutorial book describes solutions to two Shonan challenges
By the time of the 2006 presentation, the subject has been already dead for a year. It is described here just for the record. One can recognize in that old code the beginnings of the tagless-final approach. In fact, the first tagless-final interpreter was written in the e-mail message ``Tagless staged interpreter in 5 definitions'' sent on November 3, 2006, after I came back from the WG 2.11 meeting. Strangely, I did not recognize the connection back then. One can also see in the WG 2.11 presentation the anticipation of Scala-Virtualized and the Lightweight Modular Staging (LMS).
The running example of the WG 2.11 presentation was the convolution of two sequences. Here is the textbook algorithm, expressed in what looks like OCaml.
let dot1 = funM ENVID n a1 a2 -> let sum = ref 0.0 in for i = 0 to (n-1) do sum := !sum +. a1.(i) *. a2.(n-i-1) done; !sum
Replacing the odd-looking funM ENVID
with just fun
gives
the truly OCaml code, which can be evaluated as such. As it stands though,
dot1
generates code, in the form of the MetaOCaml quotation (brackets):
val dot1 : ('a, int -> float array -> float array -> float) code = .<fun n_3 -> fun a1_4 -> fun a2_5 -> let sum_6 = (ref 0.0) in for i_7 = 0 to (n_3 - 1) do let i_8 = i_7 in (sum_6 := ((! sum_6) +. (a1_4.(i_8) *. a2_5.((n_3 - i_8) - 1)))) done; (! sum_6)>.The generated code reproduces what we started with, but with more parentheses and less readable identifiers.
The curious ENVID
appearing in dot1
is quite boring:
module ENVID = struct type 'v m = 'v let retS x = x let retL x = x let ret x = x let bind m f = f m let run x = x let app f x = .<.~f .~x>. let ( + ) x y = .<Pervasives.( + ) .~x .~y>. let ( +.) x y = .<Pervasives.( +.) .~x .~y>. ... let fif c t e = .<if .~c then .~t else .~e>. let fseq e1 e2 = .<begin .~e1; .~e2 end>. let ffor f e1 e2 body = if f then .<for i = .~e1 to .~e2 do .~(body .<i>.) done>. else .<for i = .~e1 downto .~e2 do .~(body .<i>.) done>. let ref e = .<ref .~e>. let fass e1 e2 = .<(.~e1) := (.~e2)>. let ( ! ) e = .<! (.~e) >. let aref e1 e2 = .<(.~e1).(.~e2)>. let ym fc = .<let rec ym f x = f (ym f) x in ym .~fc>. endIt defines code generating combinators, such as
app
for building a
function application or ffor
for generating a for-loop. ENVID
is
the starting point for optimizations, which replace some of the
ENVID
generators with optimizing ones.
We now demonstrate the partial loop unrolling optimization. First we introduce the combinator for full loop unrolling:
let rec full_unroll lb ub body = if lb = ub then body lb else if lb < ub then .<begin .~(body lb); .~(full_unroll (lb+1) ub body) end>. else .< () >.For example,
.<let r = ref 0 in .~(full_unroll 1 3 (fun i -> .<r := !r + i>.))>.produces the following code
.<let r_1 = ref 0 in r_1 := ((! r_1) + 1); r_1 := ((! r_1) + 2); r_1 := ((! r_1) + 3)>.
Partial loop unrolling by the given factor is written in terms of the full unrolling:
let partial_unroll factor e1 e2 body = .<let lb = .~e1 and ub = .~e2 in let ntimes = (ub-lb+1)/factor in begin for ii = 0 to ntimes-1 do .~(full_unroll 0 (factor-1) (fun i -> body .< ii*factor+lb + i >.)) done; for i = ntimes*factor+lb to ub do .~(body .<i>.) done end>.
To apply the tactic we define ENVPU
, which is like ENVID
but with
the ffor
generating a partially-unrolled loop.
module ENVPU(N : sig val unroll_factor : int end) = struct include ENVID let ffor true e1 e2 body = partial_unroll N.unroll_factor e1 e2 body end module ENVPU2 = ENVPU(struct let unroll_factor = 2 end)Using the just defined
ENVPU2
instead of ENVID
in dot1
:
let dotu = funM ENVPU2 n a1 a2 -> let sum = ref 0.0 in for i = 0 to (n-1) do sum := !sum +. a1.(i) *. a2.(n-i-1) done; !sumproduces the code with the for-loop unrolled twice:
.<fun n_3 -> fun a1_4 -> fun a2_5 -> let sum_6 = (ref 0.0) in let lb_7 = 0 and ub_8 = (n_3 - 1) in let ntimes_9 = (((ub_8 - lb_7) + 1) / 2) in for ii_12 = 0 to (ntimes_9 - 1) do let i_14 = (((ii_12 * 2) + lb_7) + 0) in (sum_6 := ((! sum_6) +. (a1_4.(i_14) *. a2_5.((n_3 - i_14) - 1)))); let i_13 = (((ii_12 * 2) + lb_7) + 1) in (sum_6 := ((! sum_6) +. (a1_4.(i_13) *. a2_5.((n_3 - i_13) - 1)))) done; for i_10 = ((ntimes_9 * 2) + lb_7) to ub_8 do let i_11 = i_10 in (sum_6 := ((! sum_6) +. (a1_4.(i_11) *. a2_5.((n_3 - i_11) - 1)))) done; (! sum_6)>.The code is messy, as typical of optimized code. It would have been tedious to write it by hand.
What made it possible to interpret the for-loop in dot1
as the
optimized partially-unrolled loop is the syntactic extension funM
.
It did a source-to-source transformation of a block of the OCaml code
to its typed AST in, modern terms, the tagless-final style. For example,
funM ENVID n a -> let sum = ref 0 in for i = 0 to n do sum := !sum + a.(i) done; !sumbecomes
.<fun n a -> .~(ENV.run ( ENV.bind (ENV.ref (ENV.retL 0)) (fun sum' -> .<let sum = .~sum' in .~( ENV.fseq (ENV.ffor true (ENV.retL 0) (ENV.retS .<n>.) (fun i' -> .<let i = .~i' in ENV.fass (ENV.retS .<sum>.) (ENV.(+) (ENV.(!) (ENV.retS .<sum>.)) (ENV.aref (ENV.retS .<a>.) (ENV.retS .<i>.))))) (ENV.(!) (ENV.retS .<sum>.)))))) >.which is the code passed to the OCaml compiler. It is up to
ENV
then to interpret for-loops or let-statements. Thus the point of
funM
is to let the users define their own interpretations of OCaml code
-- the interpretations performing code analyzes or incorporating
domain-specific optimizations.
There is a lot of room for improvement; the most obvious is
to turn funM
into a functor -- or better yet, a first-class functor.
That is, given the signature of a typed AST
module type SYM = sig type 'a repr val int : int -> int repr val unit : unit -> unit repr val app : ('a->'b) repr -> 'a repr -> 'b repr val lam : ('a repr -> 'b repr) -> ('a->'b) repr val (+) : int repr -> int repr -> int repr val (!) : 'a ref repr -> 'a repr val ref : 'a repr -> 'a ref repr val let_ : 'a repr -> ('a repr -> 'b repr) -> 'b repr val fseq : unit repr -> 'a repr -> 'a repr (* sequencing *) val ffor : bool -> int repr -> int repr -> (int repr -> unit repr) -> unit repr val aref : 'a array repr -> int repr -> 'a repr val fass : 'a ref repr -> 'a repr -> unit repr (* assignment *) val run : 'a repr -> 'a code endwe re-write the following source code fragment
funM ENVID n a -> let sum = ref 0 in for i = 0 to n do sum := !sum + a.(i) done; !sumas follows:
fun (module S:SYM) -> let open S in run ( lam (fun n -> lam (fun a -> let_ (ref (int 0)) (fun sum -> fseq (ffor true (int 0) n (fun i -> fass sum ((! sum) + aref a i))) (! sum)))))The result is truly in the tagless-final style; it also looks quite like ``Scala-virtualized''. Another obvious improvement is to re-write the syntactic extension
pa_lift
as a ppx
source-to-source transformer.
At the 2006 meeting, a couple of people said to me after the presentation that they liked it. One person added he might like to collaborate on this topic. ``Alas, my student has just graduated,'' he said, ``and I currently don't have others to work on this.'' This was the end of the story.
convolution.ml [3K]
The code for the convolution optimization example described in
in the article
gib.ml [9K]
The code for optimizing the Gibonacci example: tactics for the state-
and continuation-passing transformations
tagless_final.ml [5K]
The modern reconstruction of the 2006 WG2.11 talk,
in the tagless-final terms
B/F
ratio -- of the
hand-optimized code. The generated code correctly handles the edge cases,
which were problematic for the hand-written code. The generated
code assuredly has no array bound errors, which also troubled the
hand-written code.
The stencil EDSL is the straightforward application of the previous,
seemingly purely academic, `useless' work on generating optimal
Gibonacci codes and the let
-insertion with delimited control. The key
insight is regarding a stencil as a sliding-window cache, for real or virtual
arrays. The implementation demonstrated that scope extrusion is a
problem in practice after all, justifying the research into its static
prevention.
The main challenge in high-performance computing (HPC) is overcoming the
memory bottleneck, reducing the amount of time CPU (local node) spends
waiting for data. The speed disparity between the CPU and the memory subsystem
is often characterized by the so-called B/F
ratio: the ratio of
the number of bytes moved from or to the main memory during a computation
to the number of executed floating-point operations. The challenge
of the HPC code optimization is to reduce the B/F
ratio and
the amount of needed local memory.
Signal processing or finite-element algorithms can often be stated as element-wise computations on shifted arrays. The following is the running example in the challenge.
w = a * S[1] a b = a - w + S[-1] wHere
a
is a global input array, b
is a global output array,
and w
is a working array. The operation S[n]
shifts the argument vector by n
(to the left, if n
is
positive). All arithmetic operations on vectors (addition,
multiplication, subtraction) are elementwise. Global arrays
are shared or distributed throughout a supercomputer; reading or writing
them requires accessing off-the-chip memory or the inter-node
communication.
The naive implementation, neglecting for now edge elements, can be written in OCaml as
for i=0 to (n-2) do w.(i) <- a.(i) *. a.(i+1) done; for i=1 to (n-1) do b.(i) <- a.(i) -. w.(i) +. w.(i-1) doneAssuming
w
is a local array, the first loop reads 2*(n-1)
elements from the global array a
and does (n-1)
floating-point
additions. The second loop reads n-1
elements from a
, writes n-1
elements to b
and does 2*(n-1)
floating-point computations.
With 8-byte floating-point values, the B/F
ratio is
32/3
. Current supercomputers can sustain B/F
of about 1 to 2. Therefore,
the naive implementation will run the order of magnitude worse
the peak performance, because global memory cannot keep
up with the CPU. The slow-down will likely to be larger since for large
n
, the array w
won't fit into the local memory (cache).
After the array computation finishes, the output array b
becomes the
input array for the next `time step', and the computation continues.
The B/F
ratio can be reduced by doing two time steps at once:
w1 = a * S[1] a wm = a - w1 + S[-1] w1 w2 = wm * S[1] wm b = wm - w2 + S[-1] w2The naive implementation has
B/F
of 32/6
-- with twice as many
floating-point operations per global memory access. However,
the implementation requires three times more local memory.
We generate the code that reads and writes only 1 floating-point value
per loop iteration while executing 6 floating-point operations, for
B/F = 16/6
-- without using any local arrays. The code
is expected to run nearly at the peak speed of the supercomputer.
The enclosed code develops the EDSL for stencil computation in three steps, starting with the naive implementation generating many duplicate array accesses. The array accesses are then cached in local variables. Finally, a stencil is introduced as a sliding-window cache. For illustration, here is the naive generator of the double-time-step computation:
let t21 max_bounded a b = forlooph 2 min_bounded max_bounded ( let a = gref0 a in let w1 = a *@ ashift 1 a in let wm = a -@ w1 +@ ashift (-1) w1 in let w2 = wm *@ ashift 1 wm in b <-@ wm -@ w2 +@ ashift (-1) w2)where
*@
, +@
etc. are element-wise operations on arrays, ashift n
is the array shift S[n]
and forlooph 2
is a loop combinator
that unrolls the first and the last two iterations. The generator will
complain if the halo value is less than 2 because it cannot see
(correctly) that array indices are in-bounds. The EDSL code looks
quite like the mathematical notation.
To generate the optimal code, we merely add stencil_memo
combinators:
let t25 max_bounded a b = let p = new_prompt () and q = new_prompt () in push_prompt q (fun () -> forlooph 2 min_bounded max_bounded (with_body_prompt p ( let a = stencil_memo q p (gref0 a) in let w1 = stencil_memo q p (a *@ ashift 1 a) in let wm = stencil_memo q p (a -@ w1 +@ ashift (-1) w1) in let w2 = stencil_memo q p (wm *@ ashift 1 wm) in b <-@ wm -@ w2 +@ ashift (-1) w2)))The combinators need to know the loop scope (the scope of loop-local variables) and the outer scope (the scope of the variables to store data cross-iteration). So-called prompts
q
and p
denote those scopes.
The generated code, shown in the comments in stencil.ml
, reads and
writes a global array once per iteration and does six floating-point
operations, with the B/F
of 16/6
.
The hand-optimized code has essentially the same loop body and the
same B/F
. However, the first versions of the hand-written code left
the elements 0
, 1
, and n-1
of the output array
uninitialized. Furthermore, the code read past the end of the input
array. Getting edge cases right is really difficult. The generated
code has very long and boring pieces of code before and after the main
loop, to properly compute the edge elements. Writing such tedious code
by hand is excruciatingly boring. Code generation truly helps.
The stencil challenge proved to be a surprisingly good motivation for the research on staging. It required the code generation with effects, in particular, effects crossing binders. Implementing the challenge showed that scope extrusion does happen in practice. Although the fact of scope extrusion is easy to see when compiling the generated code, finding the culprit within the generator turns out difficult.
problems/tiling/main1.cpp
committed to the Shonan-challenge GitHub repository on 2012-05-21stencil_hand.ml [4K]
Reference code: specification and the hand-written optimized code
The code correctly handles the edges cases and has unit tests.
stencil.ml [28K]
The MetaOCaml code of the solution to the challenge
The file gradually develops the EDSL for stencil computations
and has lots of generated sample code and numerical tests.
"The code requires MetaOCaml and the
delimcc library.
bounded.mli [3K]
bounded.ml [5K]
Partially known dynamic values for safe array indexing
A library for statically tracking bounds on unknown dynamic
values.