Upon systematic investigation we discover a way to formally relate the two traditions. Practically, we can write (or generate) first-order OCaml code using reference-type variables without restrictions, and straightforwardly map it to the efficient and idiomatic C and be certain of its meaning, in C. We may hence write in (a subset of) OCaml as if it were C, using to the full extent the modules, higher-order functions and other abstraction facilities of OCaml -- and obtaining all the performance guarantees of C code, benefiting from vectorization, etc. of modern C compilers.
In the Fortran/Algol tradition, the semantics of assignment
such as x = x + 1
is non-compositional,
and to understand it, L-values were introduced decades
ago. L-values simplify assignment by taking all the complexity upon
themselves. Our relation of two traditions showed L-values are not
needed. What is complicated about the
mutable-variable assignment is not its
semantics
but an unusual syntax: in C, it is a mix-fix operator
* _ = _
, further obfuscated by syntax sugar.
The following table contrasts the two traditions on the simplest example (for concreteness, using C and OCaml).
int x = 1; x = x + 1;
let x : int ref = ref 1 x := !x + 1
int
:
the type of the value that is retrieved and assigned. ML tradition
defines an immutable, lambda-calculus--like variable, of a reference
type: int ref
. One also has to explicitly allocate the box: ref 1
. Mutable variables can be accessed directly. In the ML tradition,
they have to be explicitly de-referenced: see the !
operator. In ML
tradition, dereference !
and assignment :=
are ordinary
functions. In Algol tradition, assignment is nothing but. We will
talk about it in more detail later.
There are two: practical and theoretical. Practically, we want to map code in the reference-type style to code with direct mutation: e.g., map imperative first-order OCaml code to C. The next section tells why it is useful. The theoretical problem is the meaning of mutation.
The theoretical problem is sure tough, but it was already solved, long time ago by Strachey. He introduced so-called L-values -- the notion that has become widely used, especially in C-like languages. Elegant as they are, L-values are not simple. One may say, L-values simplify the problem of assignment by shifting all the complexity upon themselves.
This work started from the need to solve the practical problem. Unexpectedly, the solution gave a new perspective on assignment and mutation, without any need for L-values.
The present page summarizes the presentations at the ML Family workshop and the IFIP WG2.1 meeting, extending them with better examples and clearer explanations. I'm grateful to the participants of these meetings for questions and comments, which indeed helped clarify the presentation.
Mutable Variables and Reference Types - or, L-values
Demystified
IFIP WG2.1 meeting, December 14, 2022
<http://arxiv.org/abs/2211.04107>
Preprint describing the formalization
Generating C
The found solution has been implemented, in MetaOCaml and in
the tagless-final embedding of languages with mutable variables
The key idea of offshoring is the close correspondence between imperative first-order OCaml and C code -- so by generating OCaml we in effect generate C. Let's look at the two pieces of code below: vector addition in idiomatic OCaml and C. The differences in syntax cannot hide how strikingly similar the two pieces of code are.
let addv = fun n vout v1 v2 -> for i=0 to n-1 do vout.(i) <- v1.(i) + v2.(i) done
void addv(int n, int* vout, int* v1, int* v2) { for(int i=0; i<=n-1; i++) vout[i] = v1[i] + v2[i]; }They are so similar that one may say that it is the same code, but in different surface syntaxes: different `dialects'. In other words, an imperative first-order subset of OCaml can be thought of as a different notation for C.
We can generate OCaml code using all conveniences and abstractions of OCaml as a host language, and with some correctness guarantees -- using MetaOCaml. If we can map the generated OCaml code to idiomatic C in a way that clearly preserves its meaning, we in effect accomplish generating C conveniently, and with correctness guarantees.
OCaml and C, however, follow different traditions when it comes to mutable variables. Hence comes the challenge of mapping code written in the reference-type style to direct-mutation, idiomatic and efficient C code. Alas, it is not simple. In fact, it is simply impossible.
Let's consider the following OCaml code
!x + 1where
x
is a variable of type int ref
. This expression is clearly a
part of some larger expression -- but a compositional translation
should not care about it. That is what being compositional means,
after all. The easiest mapping that comes to
mind (at least, the one that described in Eckhardt, Taha et al, 2007
paper and was implemented, until very recently, in BER MetaOCaml) is:
t ref
types correspond to
(or: are a model of) mutable C variables of
type t
. Therefore, !x
in OCaml maps to just x
in C.
SimpleMapping indeed works, in many cases, for example:
let x = ref 0 in foo (!x + 1)This OCaml code clearly corresponds to the following C code
int x = 0; foo (x + 1);The sub-expression
!x + 1
indeed maps to x + 1
in C.
Consider, however, the same !x + 1
sub-expression in a different
context:
fun x -> foo (!x + 1)The corresponding C code (that is, C code of the same meaning) should be
void myfun(int * x) { foo (*x + 1); }The very same
!x + 1
now maps to *x + 1
in C.
Here is the second OCaml example:
!x + foo xwhere, again,
x
has the type int ref
, and foo
is obviously of type
int ref -> int
. SimpleMapping would give the C code
x + foo(x)which is quite wrong: it erases the difference between a pointer and what it points to. One may, and should, object: SimpleMapping is just undefined for a naked ref-type variable. That is not a good answer either since it makes our example code untranslatable. Peeking ahead, one might say that a naked ref-type OCaml variable
x
should map to &x
in C. The whole example should hence be translated as:
x + foo(&x);This is a good translation. Consider, however, our expression in the following context:
let rec foo x = !x + foo x(a bit silly code, but should do for the example.) The corresponding C code should be
int foo(int * x) { return *x + foo(x); }with
!x + foo x
mapping to an unexpected expression.
As the final example, consider the OCaml expression
!!xwhere
x
has the type int ref ref
. How should its C mapping look
like? Should it be x
, or *x
, or **x
?
The upshot: C has mutable variables and, separately, variables
of pointer types. OCaml, on the other hand, uses ref-type variables to
model both. The mapping from C to OCaml is, hence,
non-injective. Therefore, the inverse (functional) mapping from
ref
-type OCaml variables to mutable C variables cannot exist.
One may attempt to build a closer model of mutable variables in OCaml, defined by the following signature:
type 'a mut (* abstract *) val newmut : 'a -> ('a mut -> 'w) -> 'w val demut : 'a mut -> 'a val setmut : 'a mut -> 'a -> unitAlthough better, it is still not quite right since it permits the OCaml code like:
newmut 1 (fun x -> foo x)That is, this model lets us pass a variable of a
mut
type as an
argument and return it as a result. In C, one cannot do that! (One can
in C++). Mutable variables in C are not first-class. That is another
problem with modeling them in OCaml, with OCaml variables, which are
first-class.
t ref
type, t
must be an int
or
float
x : t ref
, it may only appear in the following
three contexts:let x = ref e in ...
, !x
, x := e
ref e
may only appear in the context
let x = ref e in ...
These restrictions implicitly appeared in the Eckhardt, Taha. et al. 2007 paper -- although the paper itself never mentioned any restrictions. One infers them by carefully examining the typing rules in the Appendix of that paper. In effect, the restrictions completely outlaw aliasing -- which is the source of much complexity with mutation.
SimpleMapping with the accompanying restrictions has been the state of the art, and implemented in BER MetaOCaml, until very recently. To illustrate, the OCaml code
let x = ref 0 in x := !x+1is mapped to the following C code.
int x = 0; x = x + 1;However, the following, also simple imperative OCaml code
let x = ref 0 in let y = x in y := 41; x := !x+1violates the restrictions and cannot be offshored.
In my experience, the restrictions are more or less adequate for
numeric code. They do force workarounds or small cheating from time to
time (e.g., reference-type arguments -- result
, or out
arguments,
in Pascal-speak -- are needed occasionally).
x = x + 1;We see
x
on both sides. Clearly, the two things named x
do not
mean the same (i.e., do not both mean an integer). Assignment, hence,
is not an ordinary function: it is a special form, and cannot be
handled compositionally. Whereas its right-hand--side is an expression
(where x
means the current integer value of the variable), the
left-hand side must be something else. Strachey called it `L-value'.
L-values are not restricted to variable occurrences. The samples below exemplify the L-value rules in the C standard:
x = x + 1; a[i] = a[i] + 1; s.f = s.f + 1; sp->f = sp->f + 1; *p = *p + 1; ( z > 0 ? a : b )[i] = a[i] + 1; // ( z > 0 ? x : y ) = x + 1; // not L-value // ( z > 0 ? *p : *q ) = *p + 1; // not L-value *( z > 0 ? p : q ) = ( z > 0 ? *p : *q ) + 1;where we have assumed the declarations:
int x = 1; int a[2] = {1,1}; int b[2] = {10,10}; int i = 1; struct sf {int f;} s = {.f = 1}; struct sf * sp = &s; int * p = &x;
Let's look again at the original assignment expression.
x = x + 1;The main insight is that the expression is a syntax sugar: an abbreviated form of the following full expression (which is valid C):
*&x = *&x + 1;It, in turn, is also syntax sugar: a mix-fix
* _ = _
form of
assign(&x,*&x + 1);where
assign
is an ordinary C function (which could be implemented as
a built-in):
void assign(int *p, int v) { /* movl %esi, 0(%rdi) */ }The two sides of assignment are hence ordinary expressions, but of different types. The mystery of assignment is solved: assignment can be given a compositional meaning after all. What is unusual is not semantics, but a mix-fix form of the assignment operator. Had it been the simple infix operator like
*=
, there would have been no
mystery to start with.
The desugaring helps us understand that a mutable variable per se is
not an expression. It is a syntactic category all by
itself. That is why mutable variables are not first-class. Mutable
variables can be turned into an expression by prefixing them with
&
, which gives a pointer-type expression.
The dereferencing operator *
is an ordinary function, of a
pointer-type argument. Again, *
that appears on the left-hand--side in the
desugared assignment operation *p = e
is not a separate
operator. It is part of the mix-fix syntax * _ = _
of the
assignment operation.
To re-iterate, the assignment operation *p = e
is a mix-fix
expression with *
being part of the syntax. It can be re-written
as an ordinary procedure invocation: assign(p,e)
, where both p
and
e
are expressions, but of different types. In particular, p
is a
pointer-type expression. There is no longer any need for L-values.
Let's look again an the L-value examples earlier. They all can be
desugared into the * _ = _
mix-fix form, and then converted to the
assign
procedure invocation, if desired.
*&x = *&x + 1; *(a+i) = a[i] + 1; *&s.f = s.f + 1; *&(sp->f) = sp->f + 1; *p = *p + 1; *(( z > 0 ? a : b )+i) = a[i] + 1; *( z > 0 ? p : q ) = ( z > 0 ? *p : *q ) + 1;If one insists on L-values, they can now be defined much easily: an L-value is anything that starts with a star.
We now present a different offshoring translation, which maps idiomatic imperative first-order OCaml to idiomatic C preserving meaning while imposing no restrictions whatsoever on reference/pointer types. References of references, pointers of pointers, mutable variables of pointer types are all allowed.
The translation does have to disambiguate OCaml reference-type variables representing C mutable variables from those representing pointer-type variables. To this end, the translation does mutability inference, whose starting point (axiom) is: the variable introduced by the let-expression of the following form
let x : t ref = ref e in ...corresponds to a mutable C variable of type
t
of the same name,
initialized by e
. All other OCaml variables correspond to
constant C variables.
Here is the translation in full. It is parameterized by the set of free variables considered mutable.
let x : t ref = ref e in eb
is translated to
t x = e'; eb'
in C (where e'
and eb'
are the translations). In
addition, x
is to considered mutable during the translation of eb
to
eb'
;
const
C bindings;
x
is translated to
&x
in C if x
is considered mutable, or to x
otherwise;
!
is uniformly translated to
C dereference operator *
;
:=
is translated to mix-fix
assignment operator * _ = _
of C
For example, OCaml's
let x = ref 0 x := !x+1is translated to
int x = 0; *&x = *&x+1;which, if desired, can be prettified by dropping
*&
to
int x = 0; x = x+1;
The extant offshoring cannot translate
let x = ref 0 let y = x in y := 41; x := !x+1But the new translation can, producing (after prettifying)
int x = 0; int * const y = &x; *y = 41; x = x+1;Indeed, idiomatic, easy to understand OCaml is translated to idiomatic, easy to understand (by C standards) C of the same meaning.
Adding increment/decrement is easy: OCaml's incr
is translated to
++*
in C, in any context.
&
' of C can be applied to arbitrarily
expressions, not just variables. Its semantics in that case is a
puzzle: it is not fully strict (like `*
' or `!
' operators, which
evaluate their arguments) and it is not fully non-strict (like sizeof
)
either. Our demystified L-values easily solve the puzzle.
The following C code demonstrates the unusual semantics of &
,
contrasting it with sizeof
:
#include <stdio.h> int sz(int ** p) { return sizeof p[1][2]; } int * ad(int ** p) { return &(p[1][2]); } int main(void){ int* p1 [] = {NULL,NULL,NULL}; printf("sz p1: %d\n",sz(p1)); // prints 4 printf("sz NULL: %d\n",sz(NULL)); // prints 4 printf("ad p1: %p\n",ad(p1)); // prints 0x8 printf("ad NULL: %p\n",ad(NULL)); // segmentation fault }As expected,
sizeof
does not evaluate its argument; therefore
p[1][2]
is always defined, even if p
is NULL
. The operator &
is different, and puzzling: it seems to evaluate its argument,
but only to some extent.
The C standard deals with the puzzle by defining four different cases
of &
expressions: applying &
to functions; to L-value
expressions; to expressions of the form * exp
; and expressions of the
form exp[exp]
. The latter two are special cases. We now give the
semantics to &
without needing special cases.
First, we introduce an `internal' address-of, which applies only to
variables (including register-storage--class variables). Our analysis of
assignment earlier actually used this internal address-of. Its
semantics is the location of the variable: an address in memory or the
register. One may think of the internal address-of as producing an
operand for an assembly instruction (such as
%rdi
, (%rdi)
, 10(%rsp)
, etc.). The `surface' operator &
of C
is then a `macro' of sort:
&
must be applied to an expression that
starts with *
(perhaps after desugaring). The &
operator removes
the *
.
Recall, on our analysis every L-value expression starts with *
(or: an expression that starts with *
, perhaps after desugaring,
is defined to be L-value.)
Let us see how it works on our example. The expression p[1][2]
is
a sugared version of *(p[1]+2)
. Therefore,
&p[1][2] ≡ &*(p[1]+2) ≡ p[1]+2and the
ad
function above is equivalent (desugars) to
int * ad(int ** p) { return p[1]+2; }which explains the program behavior, with no puzzles.
&
operator
The paper introduces calculi ICaml (a model of OCaml with reference types) and CoreCE (a simple first-order language with mutable variables) and formally defines the translation of the former to the latter. It states the meaning preservation theorem.
refcalculi.ml [12K]
The code gives (executable) denotational semantics of ICaml and CoreCE
(as tagless-final interpreters) and defines the translation in full
detail, as an ICaml interpreter in terms of CoreCE.
The mere fact the latter interpreter type-checks proves the translation
type-preserving. The meaning preservation needs (an informal, at
present) argument examining the mapping of the denotations of
expression forms. (Quite a
few mappings are syntactically the identity.)
&
, and dereferenced by further prefixing
with *
* _ = _
, which, aside from the unusual
syntax, is an ordinary procedure of two arguments. Assignment
operation is hence an ordinary procedure call, whose two arguments are
ordinary expressions (the first is of a pointer type). There is no
need for L-values. Or, to put it another way, what is
called L-value in C is an ordinary expression of a pointer type with
a star attached as a mere decoration.
*&
makes code prettier but semantics mysterious. The mystery of
assignment is syntactic, not semantic.