- Reconciling Abstraction with High Performance: A MetaOCaml approach
- Bridging the theory of staged programming languages and the practice of high-performance computing
- Staging and High-Performance Computing: Theory and Practice
- Shonan Challenge for Generative Programming
- Generating optimal code with confidence
- Multi-stage programming with functors and monads: eliminating abstraction overhead from generic code
- In search of a program generator to implement generic transformations for high-performance computing
- Generating optimal stencil code
- Do-it-yourself domain-specific optimizing compiler

- The quality of compiler-optimized code for high-performance
applications is far behind what optimization and domain experts can
achieve by hand. Although it may seem surprising at first glance,
the performance gap has been widening over time, due to the tremendous
complexity increase in microprocessor and memory architectures,
and to the rising level of abstraction of popular programming languages
and styles. This paper explores in-between solutions, neither fully
automatic nor fully manual ways to adapt a computationally intensive
application to the target architecture. By mimicking complex sequences
of transformations useful to optimize real codes, we show that generative
programming is a practical means to implement architecture-aware
optimizations for high-performance applications.
This work explores the promises of generative programming languages and techniques for the high-performance computing expert. We show that complex, architecture-specific optimizations can be implemented in a type-safe, purely generative framework. Peak performance is achievable through the careful combination of a high-level, multi-stage evaluation language -- MetaOCaml -- with low-level code generation techniques. Nevertheless, our results also show that generative approaches for high-performance computing do not come without technical caveats and implementation barriers concerning productivity and reuse. We describe these difficulties and identify ways to hide or overcome them, from abstract syntaxes to heterogeneous generators of code generators, combining high-level and type-safe multi-stage programming with a back-end generator of imperative code.

Joint work with Albert Cohen, Se'bastien Donadio, Mari'a Jesu's Garzara'n, Christoph Armin Herrmann, and David A. Padua.

**Version**- The current version is September, 2006
**References**- SCP-search.pdf [511K]

Science of Computer Programming, Volume 62, Issue 1, September 2006, pp. 25-46 doi:10.1016/j.scico.2005.10.013

- We use multi-stage programming, monads and OCaml's advanced module system
to demonstrate how to eliminate all abstraction overhead from generic
programs while avoiding any inspection of the resulting code.
We demonstrate this clearly with Gaussian Elimination as a
representative family of symbolic and numeric algorithms. We parameterize
our code to a great extent -- over domain, input and permutation matrix
representations, determinant and rank tracking, pivoting policies,
result types, etc. -- at no run-time cost. Because the resulting code
is generated just right and not changed afterward, MetaOCaml guarantees
that the generated code is well-typed. We further demonstrate that
various abstraction parameters (aspects) can be made orthogonal and
compositional, even in the presence of name-generation for temporaries,
and ``interleaving'' of aspects. We also show how to encode some
domain-specific knowledge so that ``clearly wrong'' compositions can
be rejected at or before generation time, rather than during the
compilation or running of the generated code.
Joint work with Jacques Carette.

**Version**- The current version is October 2008
**References**- SCP-metamonads.pdf [311K]

The paper published in Science of Computer Programming, v76, N5, May 2011, pp. 349-375 doi:10.1016/j.scico.2008.09.008<http://www.cas.mcmaster.ca/~carette/metamonads/index.html>

Besides the paper, the page points to the generator code, and the generated Gaussian Elimination codes to handle real and integer matrices using various pivoting strategies.Shifting the Stage: Staging with Delimited Control

Section 5.2 of the JFP paper uses the same example of generating specialized Gaussian Elimination codes to illustrate an alternative, notationally more convenient approach: relying on delimited control rather than monads.

- Can we generate
*optimal*code in one pass, without further transformations such as common-subexpression-elimination, without looking into the code at all after it has been generated? The papers below answer affirmatively. The papers discuss in detail the generation of a straight-line C, Fortran, etc. code for the power-of-two FFT. The papers show that with the help of Abstract Interpretation we can exactly match the quality of code generated by the well-known system FFTW. The second paper demonstrates that our generated power-of-two FFT code has exactly the same number of floating-point operations as that in codelets of FFTW. The significant difference from FFTW is that we do not do any intensional code analysis -- the generated code is a black box and can not be processed nor manipulated any further. Moreover, the generated code cannot even be compared in equality. The reason for these restrictions is to maintain strong invariants about the generated code, e.g., that the generated code is automatically strongly typed and does not need to be typechecked. These invariants and the absence of ad hoc manipulation on the generated low-level code significantly increase our confidence in the result.Unlike FFTW, we know precisely where savings are coming from, and which particular equivalences contribute to which improvements in the code.

The paper makes the case that ``there are benefits to focusing on writing better generators rather than on fixing the results of simple generators.''

**References**- A Methodology for Generating Verified Combinatorial Circuits

Joint work with Kedar N. Swadi and Walid Taha.

Proc. of EMSOFT'04, the Fourth ACM International Conference on Embedded Software, September 27-29, 2004, Pisa, Italy. ACM Press, pp. 249 - 258. doi:10.1145/1017753.1017794Generating optimal FFT code and relating FFTW and Split-Radix

Matching FFTW in operation counts

- This discussion-heavy workshop took place at Shonan Village Center,
Kanagawa, Japan on May 19-22 2012. It aimed to bridge the theory of
programming languages (PL) with the practice of high-performance
computing (HPC). The topic of the discussion was code generation,
or
*staging*, a form of meta-programming. Both the PL and HPC communities have come to realize the importance of code generation: whereas PL theorists widely regard staging as the leading approach to making modular software and expressive languages run fast, HPC practitioners widely regard staging as the leading approach to making high-performance software reusable and maintainable. Thanks to this confluence of theory and practice, we have had the rare chance to bring together PL researchers and the potential consumers of their work.A particular area of current interest shared by PL and HPC researchers is how to use

*domain-specific languages*(DSL) to capture and automate patterns and techniques of code generation, transformation, and optimization that recur in an application domain. For example, HPC has created and benefited from expressive DSLs such as OpenMP directives, SPIRAL's signal processing language, and specialized languages for stencil computations and domain decomposition. Moreover, staging helps to build efficient and expressive DSLs because it assures that the generated code is correct in the form of precise static guarantees.Alas, the communication between PL researchers working on staging and HPC practitioners could be better. On one hand, HPC practitioners often do not know what PL research offers. For example, current staging technology not only makes sure that the generated code compiles without problems, but further eliminates frequent problems such as matrix dimension mismatch. Staged programming can detect and prevent these problems early during development, thus relieving users from scouring reams of obscure generated code to debug compiler errors, or waiting for expensive trial computations to produce dubious results. On the other hand, PL researchers often do not know how much HPC practitioners who write code generators value this or that theoretical advance or pragmatic benefit -- in other words, how the HPC wish list is ranked by importance.

This workshop aimed to solicit and discuss real-world applications of assured code generation in HPC that would drive PL research in meta-programming. The workshop participants consisted of three groups of people: PL theorists, HPC researchers, and PL-HPC intermediaries (that is, people who are working with HPC professionals, translating insights from PL theory to HPC practice). The workshop benefited PL and staging theorists by informing them what HPC practice actually needs. HPC practitioners may have also benefited, for example by learning new ways to understand or organize the code generators they are already writing and using. To promote this mutual understanding, the workshop emphasized presentations that elicited questions rather than merely provide answers.

The workshop is organized together with Chung-chieh Shan and Yukiyoshi Kameyama. The Shonan seminar series is sponsored by Japan's National Institute of Informatics (NII).

**References**- <https://shonan.nii.ac.jp/seminar/019/>

<https://shonan.nii.ac.jp/seminars/019/>

The web page of the workshopshonan1-report.pdf [73K]

The final report

One of the most interesting parts of the report is the section `Tangible Outcomes'. The foremost is the*Shonan Challenge*, a set of representative HPC examples where staging could be of help, in producing more maintainable code and letting domain experts perform modifications at a higher-level of abstraction.

- The second meeting on high-performance computing and staging took
place at the Shonan Village Center, Kanagawa, Japan on May 27--30,
2014. It was the follow-up to the May 2012 meeting (Shonan Seminar 19)
``Bridging the theory of staged programming languages and the practice
of high-performance computing''. At the first meeting, researchers
in staged programming languages were learning of the problems in
high-performance computing (HPC) from HPC experts. Since then, thanks
in part to Shonan Challenges put forward at that meeting, the theory
and tools of staged programming languages have progressed to the point
where they could already be useful in HPC practice. The present
follow-up meeting has gauged this readiness and set further
milestones. It served as a forum for staged programming researchers to
present their progress and for HPC practitioners to evaluate it,
fostering the collaboration on real-life applications.
Generative programming, in particular, in the form of

*staging*, is widely recognized in HPC as the leading approach to resolve the conflict between high-performance on one hand, and portability and maintainability on the other hand. In its general form, staging is an implementation technique for efficient domain-specific languages (DSL), letting HPC experts conveniently express their domain-specific knowledge and optimization heuristics. However, the results and tools of the current staging research are little used in the HPC community. Partly this is because HPC practitioners are not aware of the progress in staging; staging researchers are likewise unaware of what HPC practitioners really need. The first Shonan meeting brought together HPC practitioners and programming language (PL) researchers to break this awareness barrier. The meeting aimed to solicit and discuss real-world applications of assured code generation in HPC that would drive PL research in meta-programming.The first Shonan meeting succeeded in its aim. It developed a set of benchmarks, representative HPC examples, where staging could be of help in producing more maintainable code and letting domain experts perform modifications at a higher-level of abstraction. This set was dubbed `Shonan Challenge'.

Shonan Challenge has greatly stimulated research and development of staging, resulting in extensible compilers based on staging (Rompf et al., POPL 2013) and the revival of MetaOCaml (ML 2013, FLOPS 2014). The answers to Shonan Challenges have been presented in the overview paper (Aktemur et el., PEPM 2013), in (Rompf et al., POPL 2013) and in the poster at APLAS 2012. Shonan Challenge problems (specifically, the Hidden-Markov Model benchmark) were discussed at the staging tutorials given in 2013 at the premier PL conferences PLDI, ECOOP, and ICFP/CUFP.

The present follow-up meeting was intended as the place to report the progress in staging back to the HPC practitioners who posed the challenges, to evaluate how well the developed tools meet the needs of the HPC practice already, and what is yet to be done.

As before, the workshop participants consisted of three groups of people: PL theorists, HPC researchers, and PL-HPC intermediaries (that is, people who are working with HPC professionals, translating insights from PL theory to HPC practice). To promote the mutual understanding, we have planned for the workshop to have lots of time for discussion. We emphasized tutorial, brainstorming and working-group sessions rather than mere conference-like presentations.

The workshop is organized together with Yukiyoshi Kameyama and Jeremy Siek. The Shonan seminar series is sponsored by Japan's National Institute of Informatics (NII).

**References**- <http://shonan.nii.ac.jp/seminar/056/>

The web page of the workshopshonan2-report.pdf [126K]

The final reportBridging the theory of staged programming languages and the practice of high-performance computing

The first Shonan meeting

- The appeal of generative programming is ``abstraction
without guilt'' (coined by Ken Kennedy):
eliminating the vexing trade-off between writing
high-level code and highly-performant code. Generative programming
also promises to formally capture the domain-specific
knowledge and heuristics used by high-performance computing (HPC)
experts. How far along are we in fulfilling these promises? To gauge
our progress, a recent Shonan Meeting on ``bridging the theory of staged
programming languages and the practice of high-performance computing''
proposed to use a set of benchmarks, dubbed ``Shonan Challenge''.
Shonan Challenge is a collection of crisp problems posed by HPC and domain experts, for which efficient implementations are known but were tedious to write and modify. The challenge is to generate a similar efficient implementation from the high-level specification of a problem, performing the same optimizations, but automatically. It should be easy to adjust optimizations and the specification, maintaining confidence in the generated code.

We describe our initial set of benchmarks and provide three solutions to two of the problems. We hope that the Shonan Challenge will clarify the state of the art and stimulate the theory and technology of staging just as the POPLmark challenge did for meta-theory mechanization. Since each Shonan Challenge problem is a kernel of a significant HPC application, each solution has an immediate practical application.

The paper is written with Baris Aktemur, Yukiyoshi Kameyama, and Chung-chieh Shan.

**References**- Shonan-challenge.pdf [230K]

The paper published in the Proceedings of the ACM SIGPLAN 2013 Workshop on Partial Evaluation and Program Manipulation, PEPM 2013 (Rome, Italy, January 21-22, 2013), pp. 147-154 doi:10.1145/2426890.2426917<http://groups.google.com/group/stagedhpc>

<https://github.com/StagedHPC/shonan-challenge>

The mailing list (Google Group) and the GitHub repository for the Shonan Challenge. The list is open: subscription requests are welcome.The MetaOCaml Tutorial book describes solutions to two Shonan challenges

- This article is the record of a short presentation at the October
2006 meeting of the IFIP working group on program generation (WG
2.11). The presentation reported on a small, seemingly successful
experiment in optimized DSL compilation. The goal was to apply
domain-specific optimizations (such as loop unrolling, state-passing
transformation, memoization) to a textbook algorithm written in
OCaml. The tactics are written in MetaOCaml. Hence the code and the
tactics are typed. The type system prevents obvious nonsense,
statically ensuring that the optimization result is well-typed. The
output is either the optimized OCaml or (with offshoring) C or Fortran.
The key idea was combining MetaOCaml with camlp4. MetaOCaml is
responsible for code generation and writing tactics. The camlp4
syntactic extension turns a block of OCaml code into a typed AST,
which can then be
*easily*(re-)interpreted by ordinary users through tactics.By the time of the 2006 presentation, the subject has been already dead for a year. It is described here just for the record. One can recognize in that old code the beginnings of the tagless-final approach. In fact, the first tagless-final interpreter was written in the e-mail message ``Tagless staged interpreter in 5 definitions'' sent on November 3, 2006, after I came back from the WG 2.11 meeting. Strangely, I did not recognize the connection back then. One can also see in the WG 2.11 presentation the anticipation of Scala-Virtualized and the Lightweight Modular Staging (LMS).

The running example of the WG 2.11 presentation was the convolution of two sequences. Here is the textbook algorithm, expressed in what looks like OCaml.

let dot1 = funM ENVID n a1 a2 -> let sum = ref 0.0 in for i = 0 to (n-1) do sum := !sum +. a1.(i) *. a2.(n-i-1) done; !sum

Replacing the odd-looking

`funM ENVID`

with just`fun`

gives the truly OCaml code, which can be evaluated as such. As it stands though,`dot1`

generates code, in the form of the MetaOCaml quotation (brackets):val dot1 : ('a, int -> float array -> float array -> float) code = .<fun n_3 -> fun a1_4 -> fun a2_5 -> let sum_6 = (ref 0.0) in for i_7 = 0 to (n_3 - 1) do let i_8 = i_7 in (sum_6 := ((! sum_6) +. (a1_4.(i_8) *. a2_5.((n_3 - i_8) - 1)))) done; (! sum_6)>.

The generated code reproduces what we started with, but with more parentheses and less readable identifiers.The curious

`ENVID`

appearing in`dot1`

is quite boring:module ENVID = struct type 'v m = 'v let retS x = x let retL x = x let ret x = x let bind m f = f m let run x = x let app f x = .<.~f .~x>. let ( + ) x y = .<Pervasives.( + ) .~x .~y>. let ( +.) x y = .<Pervasives.( +.) .~x .~y>. ... let fif c t e = .<if .~c then .~t else .~e>. let fseq e1 e2 = .<begin .~e1; .~e2 end>. let ffor f e1 e2 body = if f then .<for i = .~e1 to .~e2 do .~(body .<i>.) done>. else .<for i = .~e1 downto .~e2 do .~(body .<i>.) done>. let ref e = .<ref .~e>. let fass e1 e2 = .<(.~e1) := (.~e2)>. let ( ! ) e = .<! (.~e) >. let aref e1 e2 = .<(.~e1).(.~e2)>. let ym fc = .<let rec ym f x = f (ym f) x in ym .~fc>. end

It defines code generating combinators, such as`app`

for building a function application or`ffor`

for generating a for-loop.`ENVID`

is the starting point for optimizations, which replace some of the`ENVID`

generators with optimizing ones.We now demonstrate the partial loop unrolling optimization. First we introduce the combinator for full loop unrolling:

let rec full_unroll lb ub body = if lb = ub then body lb else if lb < ub then .<begin .~(body lb); .~(full_unroll (lb+1) ub body) end>. else .< () >.

For example,.<let r = ref 0 in .~(full_unroll 1 3 (fun i -> .<r := !r + i>.))>.

produces the following code.<let r_1 = ref 0 in r_1 := ((! r_1) + 1); r_1 := ((! r_1) + 2); r_1 := ((! r_1) + 3)>.

Partial loop unrolling by the given factor is written in terms of the full unrolling:

let partial_unroll factor e1 e2 body = .<let lb = .~e1 and ub = .~e2 in let ntimes = (ub-lb+1)/factor in begin for ii = 0 to ntimes-1 do .~(full_unroll 0 (factor-1) (fun i -> body .< ii*factor+lb + i >.)) done; for i = ntimes*factor+lb to ub do .~(body .<i>.) done end>.

To apply the tactic we define

`ENVPU`

, which is like`ENVID`

but with the`ffor`

generating a partially-unrolled loop.module ENVPU(N : sig val unroll_factor : int end) = struct include ENVID let ffor true e1 e2 body = partial_unroll N.unroll_factor e1 e2 body end module ENVPU2 = ENVPU(struct let unroll_factor = 2 end)

Using the just defined`ENVPU2`

instead of`ENVID`

in`dot1`

:let dotu = funM ENVPU2 n a1 a2 -> let sum = ref 0.0 in for i = 0 to (n-1) do sum := !sum +. a1.(i) *. a2.(n-i-1) done; !sum

produces the code with the for-loop unrolled twice:.<fun n_3 -> fun a1_4 -> fun a2_5 -> let sum_6 = (ref 0.0) in let lb_7 = 0 and ub_8 = (n_3 - 1) in let ntimes_9 = (((ub_8 - lb_7) + 1) / 2) in for ii_12 = 0 to (ntimes_9 - 1) do let i_14 = (((ii_12 * 2) + lb_7) + 0) in (sum_6 := ((! sum_6) +. (a1_4.(i_14) *. a2_5.((n_3 - i_14) - 1)))); let i_13 = (((ii_12 * 2) + lb_7) + 1) in (sum_6 := ((! sum_6) +. (a1_4.(i_13) *. a2_5.((n_3 - i_13) - 1)))) done; for i_10 = ((ntimes_9 * 2) + lb_7) to ub_8 do let i_11 = i_10 in (sum_6 := ((! sum_6) +. (a1_4.(i_11) *. a2_5.((n_3 - i_11) - 1)))) done; (! sum_6)>.

The code is messy, as typical of optimized code. It would have been tedious to write it by hand.What made it possible to interpret the for-loop in

`dot1`

as the optimized partially-unrolled loop is the syntactic extension`funM`

. It did a source-to-source transformation of a block of the OCaml code to its typed AST in, modern terms, the tagless-final style. For example,funM ENVID n a -> let sum = ref 0 in for i = 0 to n do sum := !sum + a.(i) done; !sum

becomes.<fun n a -> .~(ENV.run ( ENV.bind (ENV.ref (ENV.retL 0)) (fun sum' -> .<let sum = .~sum' in .~( ENV.fseq (ENV.ffor true (ENV.retL 0) (ENV.retS .<n>.) (fun i' -> .<let i = .~i' in ENV.fass (ENV.retS .<sum>.) (ENV.(+) (ENV.(!) (ENV.retS .<sum>.)) (ENV.aref (ENV.retS .<a>.) (ENV.retS .<i>.))))) (ENV.(!) (ENV.retS .<sum>.)))))) >.

which is the code passed to the OCaml compiler. It is up to`ENV`

then to interpret for-loops or let-statements. Thus the point of`funM`

is to let the users define their own interpretations of OCaml code -- the interpretations performing code analyzes or incorporating domain-specific optimizations.There is a lot of room for improvement; the most obvious is to turn

`funM`

into a functor -- or better yet, a first-class functor. That is, given the signature of a typed ASTmodule type SYM = sig type 'a repr val int : int -> int repr val unit : unit -> unit repr val app : ('a->'b) repr -> 'a repr -> 'b repr val lam : ('a repr -> 'b repr) -> ('a->'b) repr val (+) : int repr -> int repr -> int repr val (!) : 'a ref repr -> 'a repr val ref : 'a repr -> 'a ref repr val let_ : 'a repr -> ('a repr -> 'b repr) -> 'b repr val fseq : unit repr -> 'a repr -> 'a repr (* sequencing *) val ffor : bool -> int repr -> int repr -> (int repr -> unit repr) -> unit repr val aref : 'a array repr -> int repr -> 'a repr val fass : 'a ref repr -> 'a repr -> unit repr (* assignment *) val run : 'a repr -> 'a code end

we re-write the following source code fragmentfunM ENVID n a -> let sum = ref 0 in for i = 0 to n do sum := !sum + a.(i) done; !sum

as follows:fun (module S:SYM) -> let open S in run ( lam (fun n -> lam (fun a -> let_ (ref (int 0)) (fun sum -> fseq (ffor true (int 0) n (fun i -> fass sum ((! sum) + aref a i))) (! sum)))))

The result is truly in the tagless-final style; it also looks quite like ``Scala-virtualized''. Another obvious improvement is to re-write the syntactic extension`pa_lift`

as a`ppx`

source-to-source transformer.At the 2006 meeting, a couple of people said to me after the presentation that they liked it. One person added he might like to collaborate on this topic. ``Alas, my student has just graduated,'' he said, ``and I currently don't have others to work on this.'' This was the end of the story.

**Version**- The current version is October, 2006
**References**- pa_lift.ml [6K]

The old camlp4 syntactic extension that interprets the OCaml code as a tagless-final DSLconvolution.ml [3K]

The code for the convolution optimization example described in in the articlegib.ml [9K]

The code for optimizing the Gibonacci example: tactics for the state- and continuation-passing transformationstagless_final.ml [5K]

The modern reconstruction of the 2006 WG2.11 talk, in the tagless-final terms

- Generating memory-bandwidth--optimal code for array computations --
typical of PDE solvers and signal processing kernels -- was one of the
Shonan Challenge benchmarks. The challenge was submitted by Takayuki
Muranushi, an astrophysicist whose work depends on solving PDEs on
truly huge meshes. This article answers the challenge. It develops
a domain-specific language (DSL) for stencil computations,
embedded in (Meta)OCaml. The generated
code attains the bandwidth optimality -- the
`B/F`

ratio -- of the hand-optimized code. The generated code correctly handles the edge cases, which were problematic for the hand-written code. The generated code assuredly has no array bound errors, which also troubled the hand-written code.The stencil EDSL is the straightforward application of the previous, seemingly purely academic, `useless' work on generating optimal Gibonacci codes and the

`let`

-insertion with delimited control. The key insight is regarding a stencil as a sliding-window cache, for real or virtual arrays. The implementation demonstrated that scope extrusion is a problem in practice after all, justifying the research into its static prevention.The main challenge in high-performance computing (HPC) is overcoming the memory bottleneck, reducing the amount of time CPU (local node) spends waiting for data. The speed disparity between the CPU and the memory subsystem is often characterized by the so-called

`B/F`

ratio: the ratio of the number of bytes moved from or to the main memory during a computation to the number of executed floating-point operations. The challenge of the HPC code optimization is to reduce the`B/F`

ratio*and*the amount of needed local memory.Signal processing or finite-element algorithms can often be stated as element-wise computations on shifted arrays. The following is the running example in the challenge.

w = a * S[1] a b = a - w + S[-1] w

Here`a`

is a global input array,`b`

is a global output array, and`w`

is a working array. The operation`S[n]`

shifts the argument vector by`n`

(to the left, if`n`

is positive). All arithmetic operations on vectors (addition, multiplication, subtraction) are elementwise. Global arrays are shared or distributed throughout a supercomputer; reading or writing them requires accessing off-the-chip memory or the inter-node communication.The naive implementation, neglecting for now edge elements, can be written in OCaml as

for i=0 to (n-2) do w.(i) <- a.(i) *. a.(i+1) done; for i=1 to (n-1) do b.(i) <- a.(i) -. w.(i) +. w.(i-1) done

Assuming`w`

is a local array, the first loop reads`2*(n-1)`

elements from the global array`a`

and does`(n-1)`

floating-point additions. The second loop reads`n-1`

elements from`a`

, writes`n-1`

elements to`b`

and does`2*(n-1)`

floating-point computations. With 8-byte floating-point values, the`B/F`

ratio is`32/3`

. Current supercomputers can sustain`B/F`

of about 1 to 2. Therefore, the naive implementation will run the order of magnitude worse the peak performance, because global memory cannot keep up with the CPU. The slow-down will likely to be larger since for large`n`

, the array`w`

won't fit into the local memory (cache).After the array computation finishes, the output array

`b`

becomes the input array for the next `time step', and the computation continues. The`B/F`

ratio can be reduced by doing two time steps at once:w1 = a * S[1] a wm = a - w1 + S[-1] w1 w2 = wm * S[1] wm b = wm - w2 + S[-1] w2

The naive implementation has`B/F`

of`32/6`

-- with twice as many floating-point operations per global memory access. However, the implementation requires three times more local memory.We generate the code that reads and writes only 1 floating-point value per loop iteration while executing 6 floating-point operations, for

`B/F = 16/6`

-- without using any local arrays. The code is expected to run nearly at the peak speed of the supercomputer.The enclosed code develops the EDSL for stencil computation in three steps, starting with the naive implementation generating many duplicate array accesses. The array accesses are then cached in local variables. Finally, a stencil is introduced as a sliding-window cache. For illustration, here is the naive generator of the double-time-step computation:

let t21 max_bounded a b = forlooph 2 min_bounded max_bounded ( let a = gref0 a in let w1 = a *@ ashift 1 a in let wm = a -@ w1 +@ ashift (-1) w1 in let w2 = wm *@ ashift 1 wm in b <-@ wm -@ w2 +@ ashift (-1) w2)

where`*@`

,`+@`

etc. are element-wise operations on arrays,`ashift n`

is the array shift`S[n]`

and`forlooph 2`

is a loop combinator that unrolls the first and the last two iterations. The generator will complain if the halo value is less than 2 because it cannot see (correctly) that array indices are in-bounds. The EDSL code looks quite like the mathematical notation.To generate the optimal code, we merely add

`stencil_memo`

combinators:let t25 max_bounded a b = let p = new_prompt () and q = new_prompt () in push_prompt q (fun () -> forlooph 2 min_bounded max_bounded (with_body_prompt p ( let a = stencil_memo q p (gref0 a) in let w1 = stencil_memo q p (a *@ ashift 1 a) in let wm = stencil_memo q p (a -@ w1 +@ ashift (-1) w1) in let w2 = stencil_memo q p (wm *@ ashift 1 wm) in b <-@ wm -@ w2 +@ ashift (-1) w2)))

The combinators need to know the loop scope (the scope of loop-local variables) and the outer scope (the scope of the variables to store data cross-iteration). So-called prompts`q`

and`p`

denote those scopes. The generated code, shown in the comments in`stencil.ml`

, reads and writes a global array once per iteration and does six floating-point operations, with the`B/F`

of`16/6`

.The hand-optimized code has essentially the same loop body and the same

`B/F`

. However, the first versions of the hand-written code left the elements`0`

,`1`

, and`n-1`

of the output array uninitialized. Furthermore, the code read past the end of the input array. Getting edge cases right is really difficult. The generated code has very long and boring pieces of code before and after the main loop, to properly compute the edge elements. Writing such tedious code by hand is excruciatingly boring. Code generation truly helps.The stencil challenge proved to be a surprisingly good motivation for the research on staging. It required the code generation with effects, in particular, effects crossing binders. Implementing the challenge showed that scope extrusion does happen in practice. Although the fact of scope extrusion is easy to see when compiling the generated code, finding the culprit within the generator turns out difficult.

**Version**- The current version is October, 2012
**References**- stencil-challenge.cpp [6K]

Takayuki Muranushi: PDE challenge

A local copy of the file`problems/tiling/main1.cpp`

committed to the Shonan-challenge GitHub repository on 2012-05-21

The file describes the challenge showing the naive and the hand-optimized code.stencil_hand.ml [4K]

Reference code: specification and the hand-written optimized code

The code correctly handles the edges cases and has unit tests.stencil.ml [28K]

The MetaOCaml code of the solution to the challenge

The file gradually develops the EDSL for stencil computations and has lots of generated sample code and numerical tests.

"The code requires MetaOCaml and the delimcc library.bounded.mli [3K]

bounded.ml [5K]

Partially known dynamic values for safe array indexing

A library for statically tracking bounds on unknown dynamic values.