- Heterogeneous metaprogramming is using a generally higher-level host
language to generate code in a lower-lever object
language. Its appeal is taking advantage of the module system,
higher-order functions, data types, type system and verification
tools of the host language to quicker produce high-performant
lower-level code with some correctness guarantees.
We present two heterogeneous metaprogramming systems whose host language is OCaml and object language is C. The first relies on

*offshoring*: treating a subset of (MetaOCaml-generated) OCaml as a different notation for (a subset of) C. The second embeds C in OCaml in tagless-final style. The systems have been used in several projects, including the generation of C supersets OpenCL and OpenMP.Generating C with some correctness guarantees is far less trivial than it may appear, with pitfalls abound. Not coincidentally, the most subtle ones accompany the introduction of variables into the code. Maintaining the offshoring system has traps of its own. We expound the pitfalls we have came across in our experience, and describe counter-measures.

**References**- The paper published in
``Functional and Logic Programming. FLOPS 2022''

Lecture Notes in Computer Science, vol 13215, 2022, pp. 75-93. Springer, Cham. May 10-12, 2022 (online)

doi:10.1007/978-3-030-99461-7_5

- Offshoring is a facility to translate an imperative, first-order
*subset*of OCaml to C or other low-level language. It is based on an observation that a first-order subset of OCaml may be regarded as a non-standard notation for C. Indeed, consider the vector addition codelet addv = fun n vout v1 v2 -> for i=0 to n-1 do vout.(i) <- v1.(i) + v2.(i) done

It is rather easy to imagine the C code it corresponds to. If we could implement this correspondence, then by generating OCaml we may, in effect, generate C. Offshoring is the library that realizes such a correspondence. Thus,- If we can generate code in a host language with some correctness guarantees, and
- If we can translate the generated host language code into C preserving the guarantees

Offshoring is very different from an OCaml-to-C compiler. The latter has to handle the full OCaml. Offshoring needs to deal only with a small, imperative subset of OCaml, without closures, higher-order functions, algebraic data types, etc. Therefore, it is much simpler than a compiler to implement and maintain, and much easier to ensure correctness. Although the supported subset of OCaml is small, it has proved adequate for numerical, high-performance computing (including OpenCL and OpenMP programming) and embedded system programming.

Offshoring was first proposed by Eckhardt, Kaiabachev, Pasalic, Swadi and Taha (New Generation Computing, 2007). It is completely re-thought and re-implemented in BER MetaOCaml since version N107. The paper Generating C describes this system, also mentioning why the original offshoring became unavailable and what has been done to make the re-implementation to last. Here we elaborate the running example of offshoring, and point to further examples.

The running example is the vector addition code

`addv`

above. The input to offshoring is MetaOCaml generated code: so-called closed code value. In our example, the simplest way to obtain such a value is by enclosing the`addv`

code in MetaOCaml brackets:let addv_staged = .<fun n vout v1 v2 -> for i=0 to n-1 do vout.(i) <- v1.(i) + v2.(i) done>.

Code values can be printed out:val addv_staged : (int -> int array -> int array -> int array -> unit) code = .< fun n_1 vout_2 v1_3 v2_4 -> for i_5 = 0 to n_1 - 1 do Stdlib.Array.set vout_2 i_5 ((Stdlib.Array.get v1_3 i_5) + (Stdlib.Array.get v2_4 i_5)) done>.

We see the original`addv`

, but with the desugared array indexing operations and with renamed identifiers, to ensure uniqueness.To easily support a variety of low-level languages, offshoring actually maps a subset of OCaml to a simple imperative `intermediate language' (IR). One may see the IR code for the running example by

let _ = Offshoring.(offshore (module DefaultConv) addv_staged)

whose result is- : proc_t = ([("n_1", TInt); ("vout_2", TArray1 TInt); ("v1_3", TArray1 TInt); ("v2_4", TArray1 TInt)], TUnit, Block (<abstr>, For {id = "i_5"; ty = TInt; lwb = Const (Const_int 0); upb = LocalVar "n_1"; step = Const (Const_int 1); body = Block (<abstr>, Exp (FunCall ("array1_set", [LocalVar "vout_2"; LocalVar "i_5"; FunCall ("+", [FunCall ("array1_get", [LocalVar "v1_3"; LocalVar "i_5"]); FunCall ("array1_get", [LocalVar "v2_4"; LocalVar "i_5"])])])))}))

Such an IR expression may then be `pretty-printed', say, to C. The function`offshore_to_c`

combines the IR translation with pretty-printing. For example,let _ = Offshoring.offshore_to_c ~name:"addv" addv_staged

produces (on the standard output; the optional`~out`

argument may specify a different output stream)void addv(int n_1,int * vout_2,int * v1_3,int * v2_4){ for (int i_5 = 0; i_5 < n_1; i_5 += 1) (vout_2[i_5]) = (v1_3[i_5]) + (v2_4[i_5]); }

which is how one might have imagined the C code corresponding to the original OCaml`addv`

.Suppose the size of the arrays to add is statically known, and is small enough so that we wish to fully unroll the loop. To generate the unrolled code, we first have to generalize

`addv`

:let iota n = List.init n Fun.id let addvg n vout v1 v2 = iota n |> List.iter (fun i -> vout.(i) <- v1.(i) + v2.(i))

and test on sample arrays that`addvg`

has the same behavior as`addv`

.Trying to offshore

`addvg`

naivelylet addvg_staged_full = .<fun n vout v1 v2 -> iota n |> List.iter (fun i -> vout.(i) <- v1.(i) + v2.(i))>.

gives an error however: the code value (the bracketed expression) contains a first-class function and is hence outside the offshorable subset.If the size

`n`

of the arrays is indeed known at the generation time,`addvg`

may be staged differently:let addvg_staged n = .<fun vout v1 v2 -> .~(iota n |> List.map (fun i -> .<vout.(i) <- v1.(i) + v2.(i)>.) |> seq)>. (* val addvg_staged : int -> (int array -> int array -> int array -> unit) code *)

where`seq : unit code list -> unit code`

builds a code sequence. As the inferred type indicates,`addvg_staged`

takes an integer and produces the code of a function of three array arguments. For example,`let addvg_staged4 = addvg_staged 4`

generates:val addvg_staged4 : (int array -> int array -> int array -> unit) code = .< fun vout_11 v1_12 v2_13 -> Stdlib.Array.set vout_11 0 ((Stdlib.Array.get v1_12 0) + (Stdlib.Array.get v2_13 0)); Stdlib.Array.set vout_11 1 ((Stdlib.Array.get v1_12 1) + (Stdlib.Array.get v2_13 1)); Stdlib.Array.set vout_11 2 ((Stdlib.Array.get v1_12 2) + (Stdlib.Array.get v2_13 2)); Stdlib.Array.set vout_11 3 ((Stdlib.Array.get v1_12 3) + (Stdlib.Array.get v2_13 3))>.

We may again test that the generated code has the behavior of`addv 4`

. Finally, we map it to C:let _ = Offshoring.offshore_to_c ~name:"addv4" addvg_staged4

obtainingvoid addv4(int * vout_11,int * v1_12,int * v2_13){ (vout_11[0]) = (v1_12[0]) + (v2_13[0]); (vout_11[1]) = (v1_12[1]) + (v2_13[1]); (vout_11[2]) = (v1_12[2]) + (v2_13[2]); (vout_11[3]) = (v1_12[3]) + (v2_13[3]); }

Thus,

`addv`

is easy to stage for no unrolling, but`addvg`

for full unrolling. One may wish to write vector addition in such a way so that no unrolling, full unrolling, partial unrolling, scalar promotion etc. can be done without constantly re-writing it. The example`addv.ml`

demonstrates how to write the vector addition once, in a clear, textbook form -- and apply a variety of optimizations in a modular way, reusing or extending earlier optimizations. Eventually, after strip mining and scalar promotion we obtain C code with the OpenBLAS look-and-feel:void addv(int n_16,double * ar_17,double * ar_18,double * ar_19){ const int t_21 = n_16 & -4; const int t_22 = (0 + 3) & -4; for (int i_33 = 0; i_33 < t_22; i_33 += 1) (ar_17[i_33]) = (ar_18[i_33]) + (ar_19[i_33]); for (int i_24 = t_22; i_24 < t_21; i_24 += 4){ const double t_26 = (ar_18[i_24 + 0]) + (ar_19[i_24 + 0]); const double t_28 = (ar_18[i_24 + 1]) + (ar_19[i_24 + 1]); const double t_30 = (ar_18[i_24 + 2]) + (ar_19[i_24 + 2]); const double t_32 = (ar_18[i_24 + 3]) + (ar_19[i_24 + 3]); (ar_17[i_24 + 0]) = t_26; (ar_17[i_24 + 1]) = t_28; (ar_17[i_24 + 2]) = t_30; (ar_17[i_24 + 3]) = t_32; } for (int i_23 = t_21; i_23 < n_16; i_23 += 1) (ar_17[i_23]) = (ar_18[i_23]) + (ar_19[i_23]); }

**References**- offshore_simple.ml [9K]

The code for the running example, among othersAn illustration of the extensibility of offshoring: generating C code that includes calls to library functions (C file stdio, in our case) using library-specific data types (

`FILE`

)

offshore_ext.ml [3K]

The code for the example

file_stub.ml [2K]

The stub for a simple subset of C FILE i/o

offshore_ext_gen.c [<1K]

Testing the generated C codeaddv.ml [15K]

The complete development for the advanced example. It generates`addv_offshored.c`

when made

addv_gen.c [2K]

Testing the generated C codeGenerating C

The paper that presents the new offshoring, describing the attractiveness, and also pitfalls and challenges of treating a subset of OCaml as a notation for C

- Embedding a subset of C into OCaml as a DSL in tagless-final style is
another way to generate low-level code with some correctness
guarantees. At the very least, the generated code should be well-formed,
well-typed and well-scoped -- in other words, it shall compile without
errors or (serious) warnings. Unlike offshoring, which is implicitly
heterogeneous metaprogramming, in the tagless-final embedding
heterogeneous programming is explicit. The paper Generating C
presents it in detail. Here we show the running example, and point out
others.
The running example is vector addition: the same example used for Offshoring. To remind, it is written in OCaml as

let addv = fun n vout v1 v2 -> for i=0 to n-1 do vout.(i) <- v1.(i) + v2.(i) done

We now try to write it as an embedded DSL program; specifically, a program in a statement-oriented DSL. Many low-level languages draw a sharp distinction between an expression and a statement. Consider C:

`x < 10`

is an expression, but`while(x < 10) {x = foo();}`

is a statement. Only expressions are allowed as the terminating condition of while loop. An expression always yields a value; a statement often does not. Still, there are value-yielding statements:`return`

.By examining the vector addition code, we determine the DSL operations needed to write it:

module type cde = sig type 'a exp (* Abstract type of an expression *) val int : int -> int exp val ( + ) : int exp -> int exp -> int exp val ( - ) : int exp -> int exp -> int exp type 'a stm (* A statement *) val for_ : int exp -> int exp -> (int exp -> unit stm) -> unit stm type 'a arr (* An array (variable) *) val array_get : 'a arr -> int exp -> 'a exp val array_set : 'a arr -> int exp -> 'a exp -> unit stm type 'a proc_t (* The complete function *) val mk3arr : (int exp -> int arr -> int arr -> int arr -> unit stm) -> (int -> int array -> int array -> int array -> unit) proc_t end

In the tagless-final style, an embedded DSL is specified as an OCaml signature. It defines types for DSL categories, and combinators that correspond to DSL expression forms. The type`'a exp`

represents DSL expressions of type`'a`

;`'a stm`

statements, and`'a proc_t`

complete procedures. The constructor`mk3arr`

, the constructor of procedures taking three array arguments, looks too particular. We can generalize later. The strength of the tagless-final approach is extensibility: therefore, we may start with a particular and add more general when and if needed.The

`cde`

signature indeed lets us express vector addition:module ADDV(C:cde) = struct open C let addv_dsl = mk3arr @@ fun n vout v1 v2 -> for_ (int 0) (n - int 1) @@ fun i -> array_set vout i @@ array_get v1 i + array_get v2 i end

It looks quite like the OCaml`addv`

code, after desugaring (of array indexing operators). With custom indexing operators one can make the DSL even more resembling`addv`

.One obvious implementation of the signature is metacircular:

`'a exp`

,`'a stm`

and`'a proc_t`

are all just`'a`

. That is, the DSL is OCaml itself. Such an implementation is useful for testing. We leave it as an exercise to the reader.Another implementation is in terms of the MetaOCaml code:

module MOCde : (cde with type 'a proc_t = 'a code) = struct type 'a exp = 'a code let int = fun x -> .<x>. let ( + ) = fun x y -> .<.~x + .~y>. let ( - ) = fun x y -> .<.~x - .~y>. type 'a stm = 'a code let for_ lwb upb body = .<for i= .~lwb to .~upb do .~(body .<i>.) done>. type 'a arr = 'a array code let array_get a i = .< (.~a).(.~i) >. let array_set a i v = .< (.~a).(.~i) <- .~v >. type 'a proc_t = 'a code (* The complete function *) let mk3arr body = .<fun n vout v1 v2 -> .~(body .<n>. .<vout>. .<v1>. .<v2>.)>. end

It is in effect a compiler, from the DSL to OCaml. For example,`let module M = ADDV(MOCde) in M.addv_dsl`

produces- : (int -> int array -> int array -> int array -> unit) code = .< fun n_1 vout_2 v1_3 v2_4 -> for i_5 = 0 to n_1 - 1 do Stdlib.Array.set vout_2 i_5 ((Stdlib.Array.get v1_3 i_5) + (Stdlib.Array.get v2_4 i_5)) done>.

which is*identical*to the result`addv_staged`

in the Offshoring example earlier.But we are interested in C. Let's write the implementation of

`cde`

that generates C (repetitive definitions are elided; see the complete code for detail):module CCde = struct type 'a exp = string let int = string_of_int let ( + ) = Printf.sprintf "(%s + %s)" type 'a stm = string let for_ lwb upb body = let i = gensym "i" in "for(int " ^ i ^ "=" ^ lwb ^ "; " ^ i ^ "<=" ^ upb ^ "; " ^ i ^ "++){\n" ^ " " ^ body i ^ "}" type 'a arr = string let array_get = Printf.sprintf "%s[%s]" ... end

Now,let _ = let module M = ADDV(CCde) in print_endline M.addv_dsl

outputsvoid addv(int n_1, int* vout_2, int* v1_3, int* v2_4){ for(int i_5=0; i_5<=(n_1 - 1); i_5++){ vout_2[i_5]=(v1_3[i_5] + v2_4[i_5]);} }

This is exactly the`addv`

C code produced by Offshoring. We have generated both OCaml and C code from the same DSL expression (module`ADDV`

).There is a lot of room for improvement. For example,

`printf`

and string concatenations are an eye sore. Not for the DSL user however: They are all hidden behind the abstraction layer. The abstraction layer also guarantees the type safety, and the absence of unbound variable names.To help a DSL implementator, however, it makes sense to generate a C AST, to pretty-print at the end -- which is what we do in the real project.

Let's consider the first extension: mutable variables. They are not as simple as they appear (see the separate web page on this topic). The culprit, as usual, is aliasing. Mutable variables presented a big problem for offshoring, which was solved just recently. The signature below incorporates the learned lessons: the main is distinguishing mutable variables, represented by the dedicated type

`'a mut`

. From the DSL point of view, this is the*syntactic*distinction: Mutable variables are not expressions. Therefore, mutable variables in our DSL (as in C, Pascal, etc.) are not first class: one cannot pass them as an argument or return as a result. One may only dereference of assign to them.module type cdemut = sig include cde val float : float -> float exp (* also add floats for the sake of example *) val ( +. ) : float exp -> float exp -> float exp type 'a mut (* mutable variables *) val (let*) : 'a exp -> ('a mut -> 'w stm) -> 'w stm val dref : 'a mut -> 'a exp val (:=) : 'a mut -> 'a exp -> unit stm val (@.) : unit stm -> 'a stm -> 'a stm (* sequencing *) val ret : 'a exp -> 'a stm (* return *) val mk1arr : (int exp -> float arr -> 'a stm) -> (int -> float array -> 'a) proc_t end

The types make it clear that mutable variables may only be introduced in the statement context. Aliasing is impossible: an expression like

let* x = 0 in let* y = x in ...

gives a type error pointing at the second x:Error: This expression has type int mut but an expression was expected of type 'a exp

We may only writelet* x = 0 in let* y = dref x in ...

where it is clear that`y`

is not an alias of`x`

; it is a new mutable DSL variable initialized to the current value of`x`

.The extended

`cdemut`

signature lets us write the next example: vector summation.module VSUM(C:cdemut) = struct open C let vsum = mk1arr @@ fun n v -> let* sum = float 0. in begin for_ (int 0) (n - int 1) @@ fun i -> sum := dref sum +. array_get v i end @. ret (dref sum) end

Since defining new variables in C requires writing their type, our compiler has to perform type inference, which is straightforward. (Again, some definitions are elided for clarity.)

module CMutCde : (cdemut with type 'a proc_t = string) = struct type 'a typ = | TInt : int typ | TFloat : float typ | TUnit : unit typ let string_of_typ : type a. a typ -> string = function | TInt -> "int" | TFloat -> "double" | TUnit -> "void" type 'a exp = 'a typ * 'a CCde.exp let int x = (TInt, string_of_int x) (* typing rule *) let trule2 : type a b c. a typ -> b typ -> c typ -> (a CCde.exp -> b CCde.exp -> c CCde.exp) -> (a exp -> b exp -> c exp) = fun ty1 ty2 tyr f -> fun (_,x) (_,y) -> (tyr, f x y) let ( + ) = trule2 TInt TInt TInt CCde.( + ) let ( +.) = trule2 TFloat TFloat TFloat (Printf.sprintf "(%s + %s)") type 'a stm = 'a typ * 'a CCde.stm let for_ (TInt,lwb) (TInt,upb) body = (TUnit, CCde.for_ lwb upb (fun i -> let (TUnit,b) = body (TInt,i) in b)) type 'a arr = 'a typ * 'a CCde.arr let array_get (ty, a) (TInt, i) = (ty, CCde.array_get a i) let ret (t,x) = (t, Printf.sprintf "return %s;" x) let (@.) (TUnit,s1) (t,s2) = (t,Printf.sprintf "%s\n%s" s1 s2) type 'a mut = 'a typ * string let (let*) : 'a exp -> ('a mut -> 'w stm) -> 'w stm = fun (t,x) body -> let v = gensym "x" in let (tb,b) = body (t,v) in (tb,Printf.sprintf "%s %s = %s;\n%s" (string_of_typ t) v x b) let dref (t,v) = (t,v) let (:=) (_,v) (_,x) = (TUnit, Printf.sprintf "%s = %s;" v x) end

Since

`CMutCde`

is an extended interpreter, it can interpret the earlier code like`ADDV`

, with the same result. For vector summation,let _ = let module M = VSUM(CMutCde) in print_endline M.vsum

we obtaindouble sumv(int n_11, double* v_12){ double x_13 = 0.; for(int i_14=0; i_14<=(n_11 - 1); i_14++){ x_13 = (x_13 + v_12[i_14]);} return x_13; }

We can also implement the signature

`cdemut`

in terms of the Offshoring IR (see the paper), and hence take the full advantage of its pretty-printers, to C or other low-level languages. **References**- tf_simple.ml [13K]

The complete code, corresponding to the simple examples of offshoring, offshore_simple.ml [9K]