Module Extensibility and Separate Compilation

 

The two benefits of modules -- safe re-linking with a different implementation of an interface, and a non-destructive, copy-paste--free library extension -- become problematic when interfaces, their implementations and the client code are all separately compiled. On an example abstracted from a real application, we explain the problems and the found work-arounds, which lead to two simple proposals for adjusting separate compilation in OCaml.

 

Introduction

ML module system as a tool for abstraction is particularly useful for large-scale development. Separate compilation, a characteristic feature of OCaml from the very beginning, notably adds to the scalability, and enjoyment. The interaction of module system with separate compilation is not all smooth, however. For example, although a signature or a struct can be a compilation unit of its own (as .mli or .ml file, resp.), a functor cannot. This article describes two problems of separate compilation hindering extensibility and incremental development: re-linking with another implementation of a library, and extending a library with new operations. We consider it unacceptable to edit the existing code in order to use a different/extended library implementation; when making an extension we want to keep the old source as it is, and only refer to rather than copy it. Ideally, any recompilation of the existing code should be avoided.

To be concrete, suppose we have an interface LA.mli and its two implementations, EvalA.ml and PpA.ml. There is no LA.ml. We want to write the user code ExA.ml treating LA as an ordinary library:

    open LA
    ... code using LA operations ...
compile that code to ExA.cmo, and link this compiled ExA.cmo to either EvalA.cmo or PpA.cmo implementations of LA. Although straightforward in many languages, e.g., C, it is next to impossible in OCaml. (In all fairness, it is so simple in C because C does not have a module system.) For the next problem, suppose EvalA.ml has a public interface EvalA.mli abstracting its implementation details. Adding a new operation to both the interface and the implementation -- not touching or copying the original files -- is again next to impossible, when implementing the new operations requires the details abstracted away by the EvalA.mli interface.

The problematic interactions of the module system with separate compilation came to fore in the course ``Compilers: Incrementally and Extensibly''. Although workarounds were eventually found, described below, they leave a sense of dissatisfaction, even if purely aesthetic. We make two proposal for what might be a satisfactory solution.

I hasten to say that the problems are not of the module system per se but of its interaction with separate compilation. In short, a module (structure) can be ascribed several signatures (with different levels of abstraction); likewise, a single signature can be ascribed to several structures (i.e., different implementations of the signature). Alas, separate compilation imposes a one-to-one correspondence between .ml (module) and .mli (signature) files, which hinders extensibility.

References

Compilers: Incrementally and Extensibly
The compiler course

 

Running example

The ideal of extensibility, of incremental development and improvement, is attained, as illustrated in this section. The many-to-many relation between signatures and structures (modules) naturally comes up. The running example is a glimpse of the Compilers course: the extensibility we are about to see is exactly the extensibility pursued in the course. All development in this section is kept to one file, however.

The running example deals with DSL embedding: it is the prototypical example of the tagless-final style. The DSL here has only integers and subtraction: just enough to make the point. Formally, its syntax is defined by the following signature:

    module type LA = sig
      type repr
      val int : int -> repr
      val sub : repr -> repr -> repr
      val observe : repr -> unit
    end
To wit, DSL expressions are represented as OCaml values of the abstract type repr, produced by the operations int and sub. The (completed) expressions may also be observed, that its, printed out. (The observation type could be more interesting; for our exposition, printing suffices.)

Having defined the DSL, we may already write its expressions, as follows. They are parameterized by the implementation of the LA signature -- the DSL interpreter.

    module ExA(L:LA) = struct
      open L
      let term = sub (sub (int 4) (int 0)) 
                     (sub (int 0) (int (-1)))
    end

One may imagine many implementations of LA. The first that probably comes to mind is a meta-circular evaluator, which maps DSL operations to OCaml's.

    module EvalA = struct
      type repr = int
      let int x = x
      let sub   = (-)
      let observe = Printf.printf "The result: %d\n"
    end
Interpreting the sample expression ExA using EvalA, as
    let module M = ExA(EvalA) in EvalA.observe M.term
prints the expected result 3.

It does not to take long to think of other implementations of LA: e.g., to pretty-print its expressions.

    module PpA = struct
      open Seq
      type repr = string t
      let int x = string_of_int x |> return
      let paren e = ...
      let sub x y = append x (cons " - " y) |> paren
      let observe x = ...
    end
Interpreting the same ExA using PpA, as
    let module M = ExA(PpA) in PpA.observe M.term
prints ((4 - 0) - (0 - -1)) as the result.

We have thus seen two modules (structures) -- EvalA and PpA -- implementing the same signature, LA. In the Compiler class, the type checker, code generator, etc. are all interpreters of the signature that defines the source language.

Let us extend our DSL with another operation: multiplication. First we extend the language definition:

    module type LB = sig
      include LA
      val mul: repr -> repr -> repr
    end
The module system lets us literally write what we mean: take an existing collection of definitions and add to it. The older version remains as it was: not modified and not copied. We may now write DSL expressions with multiplication (and reuse earlier expressions as they are):
    module ExB(L:LB) = struct
      open L
      module EA = ExA(L)
      let term = mul EA.term (int 2)
    end

Extending the evaluator is just as simple as extending the language definition: merely adding the interpretation of the new operation:

    module EvalB = struct
      include EvalA
      let mul = ( * )
    end
Since EvalB is an extension of EvalA, it can interpret the old example ExA, with the same result:
    let module M = ExA(EvalB) in EvalB.observe M.term
In other words, we have `linked' the existing user code ExA -- as is, without any modifications -- with an enhanced/improved implementation of the LA library. Needless to say, EvalB also interprets the extended example ExB:
    let module M = ExB(EvalB) in EvalB.observe M.term
printing 6 as the result. We have just witnessed ascribing the same module, EvalB, two different signatures: LA (so it can be applied to ExA, which requires as the argument a module of the LA signature) and LB.

Extending PpA to obtain PpB is just as straightforward:

    module PpB = struct
      include PpA
      open Seq
      let mul x y = append x (cons " * " y) |> paren
    end
The extended PpB can pretty-print the old ExA (with the same result) and now ExB.

We have thus seen what extensibility means, concretely, and how the many-to-many correspondence between modules and signatures plays into it. All the code was in a single file, however -- compiled together rather than separately.

References

warmup.ml [3K]
The complete code for the example: all in one file.

 

One signature, several implementations

As DSLs grow large, keeping their definitions and implementations all in one file becomes cumbersome. Separating into several files helps focusing, collaboration, version control, and, ideally, faster (re)build. This section deals with the separate compilation of the first part of the running example: the two implementations EvalA and PpA of the signature LA, and the linking with the user code ExA. The user code is also compiled separately, unaware of the implementations, and should be linkable with either without touching or even recompiling. Alas, it is not actually linkable -- not without touching the code base by changing or adding source files. Inevitably, re-linking requires re-compilation. The section describes why, how to work around -- and how OCaml could be changed to obviate the workarounds.

Separating interfaces and implementations into files

We start by separating the running example code into files -- and observing that the result is far from satisfactory.

The DSL definition goes into the file LA.mli:

    type repr
    val int: int -> repr
    val sub: repr -> repr -> repr
    val observe: repr -> unit
Each implementation also gets its own file. The evaluator is in the file EvalA.ml with the content
    type repr = int
    let int x = x
    let sub   = (-)
    let observe = Printf.printf "The result: %d\n"
and the pretty-printer in PpA.ml. The file LA.mli is compiled as if its content were wrapped into module type LA = sig ... end. Likewise, EvalA.ml is compiled assuming the wrapper module EvalA = struct ... end around it. This assumption is the convenient syntax sugar provided by OCaml.

Alas, this syntax sugar does not extend to functors, such as ExA -- the users of our DSL. A functor may of course be placed in a file, say, ExAFunc.ml:

    module ExA(L: module type of LA) = struct
      open L
      let term = sub (sub (int 4) (int 0)) 
                     (sub (int 0) (int (-1)))
    end
As just explained, it is compiled as if it were wrapped into module ExAFunc = struct ... end. That is, the functor is compiled not as top-level, so to speak, but as a part of another module. The distinction shows up in linking.

To build the complete program one has to explicitly apply the functor ExA to a suitable implementation of LA, say, EvalA. We need a linking file, so to speak: ExAEval.ml, as follows.

    let module M = ExAFunc.ExA(EvalA) in EvalA.observe M.term
and a similar file ExAPp.ml for applying the pretty-printer. Assuming all .mli and .ml files are already compiled, the following command line builds the whole program, for the EvalA implementation of LA.
    ocamlc EvalA.cmo ExAFunc.cmo ExAEval.cmo
To use the PpA implementation, we build as
    ocamlc PpA.cmo ExAFunc.cmo ExAPp.cmo
Pro
Linking is done using OCaml, as a functor application.
Pro
This approach permits several applications of ExA in a single program. For example, concatenating ExAEval.ml and ExAPp.ml into ExABoth.ml and building with it gives the program that shows the results of both evaluating and pretty-printing ExA's expression.
Con
Building the program requires a linking file such as ExAEval for the functor application. To use ExA with the PpA implementation, we have to introduce a new linking file ExAPp.ml containing the copy of ExAEval.ml with substituting PpA for EvalA -- or modify ExAEval.ml in place. Both choices -- cut-and-paste with substitution and especially destructive modification -- are unappealing.
Con
ExAFunc.ml is compiled as a functor: therefore, calls to LA operations are indirect. The compiled ExAFunc.cmo cannot benefit from link-time optimizations.

Avoiding Functors

Wouldn't it be great to compile the user code ExA as a top-level module rather than a functor. For example, as the file ExA.ml:
    open LA
    
    let term = sub (sub (int 4) (int 0)) 
                   (sub (int 0) (int (-1)))
    
    let () = observe term
Not only the definitions being top-level is aesthetically pleasing: LA operations are now compiled as direct calls.

ExA.ml actually compiles, even though LA.ml does not exists: To compile a library user code we only need the library interface, LA.cmi. Regretfully, the straightforward linking of ExA.cmo with an LA implementation such as EvalA.cmo fails:

    ocamlc EvalA.cmo ExA.cmo
    
    Error: Module `LA' is unavailable (required by `ExA')
If we examine ExA.cmo using ocamlobjinfo, we see
    Unit name: ExA
    Interfaces imported:
    	79b0e9d3b6f7fed07eb3cc2abb961b91	Stdlib
    	d9378d8b5a64375e0a4765907a7028ed	LA
    	bf853957655a3a1eb3caac1964887180	ExA
    	8f8f634558798ee408df3c50a5539b15	CamlinternalFormatBasics
    Required globals:
    	LA
ExA.ml imports LA and uses its operations; predictably ExA.cmo contains the reference to this interface: to its name and the hash. (The hash is computed when compiling LA.mli and stored, along with the interface name, in LA.cmi). An implementation of LA would likewise tell the name/hash of the interface it provides. Name/hash matching is enough to ensure coherence, that is, a linked implementation indeed providing the required interface. ExA.cmo, however, not only refers to imported interfaces (i.e., interfaces whose implementations are required) -- but also to a specific implementation of the LA interface, also named LA. Therefore, ExA.cmo can be linked only with LA.cmo -- and not with any other module that may implement the LA interface.

Such a rigidity -- insisting on linking with a particular named module rather than any provider of the required interface -- is strange. It is a consequence of an old design decision that the correspondence between separately-compiled modules and the provided interfaces be one-to-one.

Below we show how to work around this design decision and allow different modules to serve as implementations of the same signature: in effect, to link the user code with different library implementations.

Linking with an arbitrary implementation

OCaml has made the design decision that a separately-compiled implementation of an interface, such as LA, must be in the .cmo file specifically named LA.cmo. Therefore, to link with the EvalA implementation of the LA interface, we have no other choice but to produce the file named LA.cmo. Hence the work-around:
    ocamlc -c -o LA.cmo EvalA.ml  # compiling an implementation of LA
    ocamlc LA.cmo ExA.ml          # linking
These two commands indeed produce an executable with ExA using the EvalA implementation. To use the PpA implementation instead, one has to build the executable as
    ocamlc -c -o LA.cmo PpA.ml    # compiling an implementation of LA
    ocamlc LA.cmo ExA.ml          # linking

The reader has probably noticed ExA.ml rather than the expected ExA.cmo in the linking step. That is, every time we link with a new implementation of LA, we have to re-compile the user code. Re-linking inevitably requires re-compilation. The user code has to be re-compiled because the interface LA.cmi it depends on changes when compiling the implementation:

    ocamlc -c -o LA.cmo EvalA.ml
Given the existing LA.cmi, one would expect the compiler here checks if EvalA satisfies it: that is, if the LA signature can be ascribed to EvalA. The compiler (OCaml 4.14), however, does something quite strange: it deletes the existing LA.cmi, without warning, and makes a new one, based on the module type of EvalA. Instead of ascribing a signature to an implementation, the compiler changes the signature to match the implementation. This is a strange behavior of the current OCaml system, which we propose to eliminate.
Con
The approach applies so long as we do not need two different implementations of a signature in a single program.
Pro
Re-linking with a new implementation does not change the code base: no extant source code files are modified and no new source code files are created.
Con
Although the source code is not modified, it has to be re-compiled when linking with a different library implementation.

Our work-around here is only partial: re-compilation is still needed for re-linking.

Effectively ascribing an interface to implementations

Given an interface such as LA.mli and its (purported) implementations EvalA.ml and PpA.ml, one would think the following would try to ascribe the interface to an implementation
    ocamlc -c LA.mli              # producing LA.cmi
    ocamlc -c -o LA.cmo EvalA.ml
If the compilation succeeded, the resulting LA.cmo may then be linked to any user of LA interface. As we have just seen, that does not work.

It is possible nevertheless to ascribe a separately compiled interface to separately compiled and arbitrarily named implementations -- that is, to work-around the one-to-one correspondence between a separately-compiled implementation and its interface. In fact, it is possible in two different ways (although not fully satisfactory).

The first method uses symbolic links. Assume that LA.cmi already exists and the user code ExA.cmo is compiled against it.

    ln -s EvalA.ml LA.ml   # create the file LA.ml with the same contents as EvalA.ml
    ocamlc -c LA.ml        # check that LA.ml satisfies LA.cmi, and produce LA.cmo
    ocamlc LA.cmo ExA.cmo  # linking

The second method relies on explicit interface files for each implementation (explained in more detail in the next section). Again, assume that LA.cmi already exists and the user code ExA.cmo is compiled against it. Also assume the file LA-incl.ml with the single line:

    include module type of LA
The build is performed as follows:
    ln -s LA-incl.mli EvalA.mli  # make EvalA.mli, effectively equal to LA.mli
    ocamlc -c EvalA.mli          # make EvalA.cmi
    ocamlc -c -o LA.cmo EvalA.ml
    ocamlc LA.cmo ExA.cmo        # linking
The last-but-one compilation command ascribes the signature EvalA (which is effectively LA) to the module EvalA, and compiles it under the name LA.cmo.
Con
The approach applies so long as we do not need two different implementations of a signature in a single program.
Pro
No need to re-compile the user code when re-linking with the a different implementation.
Con
Although we do not touch the source code, we do touch the code base by making a symbolic link. Building the project becomes more complicated.

The work-around leads to an actionable proposal.

Immediately actionable proposal

When compiling A.ml under a different name B.cmo, as in
    ocamlc -c -o B.cmo A.ml
check if there is B.mli (if so, check it is compiled, to B.cmi) and use this interface to ascribe to B.cmo. In other words: when compiling A.ml under a different name B.cmo, behave exactly as if the source A.ml were named B.ml.

Larger proposal

External references in a compiled module (.cmo) should refer to required interfaces rather than to required globals (module names). In other words, since the one-to-one correspondence between compiled modules and their interfaces can be worked around, there is no sense in clinging to it. Separate compilation should not restrict ascribing a signature to an implementation.

References

0README.dr [<1K]
The complete source code (in the same directory as the index file)

 

Adding to the signature and the implementation

This section tackles the separate compilation of the second part of the Running example: extending the language (LA to LB) and its implementations (EvalA to EvalB, and similarly for Pp). The one-to-one correspondence of a separately compiled implementation to its signature is the problem here as well -- which can be worked around. The work-around is used extensively in the Compiler course -- so extensively that a custom build system has been written around it. The workaround is rather simple; something like that could be incorporated in OCaml.

As a preliminary step, let's fix if not a problem but a blemish in the earlier examples. Modules EvalA and PpA are meant to be implementations of LA. That intention, however, was not made explicit to the compiler, and hence cannot be checked at the time of separately compiling EvalA.ml and PpA.ml. If an implementation does not really match the interface, the error is reported when linking with the user code. To report such errors earlier, when compiling the implementation -- and to make to ourselves clear the interface the module EvalA is meant to fulfill, we should have created the EvalA.mli. Since EvalA.ml is to be an implementation of the LA signature, EvalA.mli should be a copy of LA.mli, or better, a reference to it. That is, EvalA.mli contains the single line:

    include module type of LA
Had EvalA.ml omitted, say, the int operation, it would no longer compile: the operation int is required by EvalA.mli (that is, LA.mli).

At first, the extension of the interface and implementation seems straightforward, just as in the non-separate compilation. We introduce LB.mli containing

    include module type of LA
    val mul: repr -> repr -> repr
(optionally, file EvalB.mli with the copy of it), and the file EvalB.ml adding mul to EvalA:
    include EvalA
    let mul : repr -> repr -> repr  = ( * )
Alas, EvalB.ml does not compile:
    3 | let mul : repr -> repr -> repr  = ( * )
                                          ^^^^^
    Error: This expression has type int -> int -> int
           but an expression was expected of type repr -> repr -> repr
           Type int is not compatible with type repr = EvalA.repr 
The signature EvalA (which is equal to LA) ascribed to the included EvalA made the type repr abstract. To implement mul on the type repr, however, we need to know the concrete type of repr, and be sure it is int.

To use EvalA, it behooves us to ascribe it a signature that hides the implementation. But to extend EvalA, we need a signature that exposes full detail. We really need to ascribe different signatures to the same module. Although the OCaml module system has this ability, the separate compilation does not. For each .ml file there may be only one .mli file (specified by the user, or made implicitly by the compiler), with the signature to ascribe to the corresponding .ml module. Once ascribed, the signature cannot be removed, and a more transparent signature cannot be ascribed.

The work-around is straight forward, as before: if EvalA.ml may take only one ascribed signature, EvalA.mli, to ascribe another we have no choice but give the file EvalA.ml another name, say, EvalA_impl.ml. Many file system allow aliasing. The extended implementation, EvalB.ml should then include EvalA_impl. Overall:

    ln -s EvalA.ml EvalA_impl.ml   # Alias EvalA.ml as EvalA_impl.ml
    ocamlc -c EvalA_impl.ml        # make EvalA_impl.cmo
    ocamlc -c EvalB.ml             # Compile the extension to EvalA_impl.cmo
The last-but-one command compiles EvalA_impl.ml ascribing the (default) signature, fully exposing the implementation details. Compiling EvalB.ml in the last command then succeeds since the concrete type of repr is exposed as int.

The work-around is ungainly, relying on symbolic links that have to be made and cleaned-up. The earlier concrete proposal, if implemented, would make it unnecessary. Recall, the proposal was: When compiling A.ml under a different name B.cmo, as in

    ocamlc -c -o B.cmo A.ml
behave exactly as if the source A.ml were named B.ml

References

0README.dr [<1K]
The complete source code (in the same directory as the index file)

The build system for the Compiler course, designed to support the incremental, step-wise development.

 

Conclusions

In designing separate compilation, OCaml has decided to restrict the correspondence between separately compiled modules and their interfaces to one-to-one. We have shown the restriction is limiting: making the linking with alternative/improved implementations, and the implementation extension/evolution ungainly (in fact, seemingly impossible until workarounds were found). We have also seen odd behavior when compiling a module under different name. Although the work-arounds suffice short term, one may hope that eventually the restriction is lifted (and the odd behavior fixed).

We have emphasized a non-destructive extension/evolution of libraries: to link with a different library or to extend an existing one, no old code should be modified, or copied/cut-and-pasted -- or even re-compiled. The old code base is always available as a fall-back.

Keeping old versions as is, untouched, both in source and compiled form, may remind of version control. Our approach `version-controls' not just source but also the compiled artifacts. Mainly, instead of looking at diffs in a repo, through a repo-specific interface, in our development approach the diffs are the source code. We write an extension as a diff of sort, by referring to the old code and adding new definitions. Our `diffs' are hence intentional and semantically meaningful -- and can be viewed as the source OCaml code with all convenience: syntax highlighting, jump to the definition, type tooltips, Merlin, etc. The Compiler class has demonstrated that such incremental development scales.