Compilers: Incrementally and Extensibly

 

 

Introduction

Compilers is a practical course. Its goal is to build a real compiler, which compiles a high-level language down to the actual x86-64 machine code and produces an executable that runs on student's laptops. The source language is Tiger': a procedural language in the spirit of Pascal -- or C with arbitrarily nested functions. The compiler itself is to be developed in OCaml.

The characteristic of the course is an iterative, incremental development: we start with the most trivial source language, develop the full compiler for it, and then keep extending the source language and the compiler in small steps, reusing the earlier work as much as possible. At each iteration, we build the complete, end-to-end compiler producing runnable and testable executables, for a (progressively larger) subset of the source language.

Another characteristic is the extensive use of tagless-final style, taking the full advantage of extensibility afforded by it. Extensibility here means reuse -- of type-checked and compiled artifacts from the previous increment -- rather than copy-paste. The compiler is hence structured as a stack of domain-specific languages, with parsing at the bottom and assembly at the top. The languages are extended by adding new operations here and there (and only occasionally by redirection).

We cover all standard material for the compiler course, from parsing and type-checking to analyses, optimizations, calling conventions and assembly generation -- but in a quite non-traditional fashion.

This course was taught as an elective 15-lecture upper-level undergraduate course in the Fall/Winter of 2022 and 2023.

Version
The current version is January 2024
References
course-notes.pdf [1033K]
Course notes (the first few chapters)

<tfintro/0README.dr>
Code for the section: Introduction to tagless-final style

Free Variable as Effect, in Practice
A greatly simplified version of the course, which still, hopefully, gives the taste of the incremental, step-wise compiler development. The article also explains extensible effects in compilation -- the technique currently used in the course.

 

Course Content

After the preliminaries and the introduction to tagless-final style, we start on the developing compiler, from the simplest language onward:
  1. The simplest source language: only integer literals
  2. Unary expressions, booleans, types
  3. LL(1) parsing, lexing
  4. Simple binary expressions
  5. Migrating to ocamllex/ocamlyacc
  6. Local variables, scope
  7. The meaning of free variables: environment or effect
  8. Variable usage analysis and memory allocation
  9. Full expression language
  10. Dummy expression and sequencing
  11. Mutable variables, assignment, function calls
  12. Strings and string operations
  13. Comparisons and conditionals
  14. Loops
  15. Function declarations, local functions and procedures
  16. Type-checking, more formally and practically
  17. Mutually-recursive functions
  18. Nested functions
  19. Final exercises

 

Build tool

The course uses a dedicated build system -- in a sense, a parred down, specialized and tuned version of make. It is designed to support the incremental, step-wise development and has only what is needed. It is simple, fast, and portable across MacOS, Linux and WSL.

The build tool is actually an OCaml script called build.ml. Like make, the script accepts the list of targets to make, and makes them in the given order, stopping in case of error. The script is a regular OCaml file, and can be edited as such: see the sample below. One should particularly note compiler_manifest, which lists all the source files needed to build the compiler, and how to build them.

In 2022 the course used make. The experience showed why a custom build tool would be more more pleasant and convenient to use. First, since our compiler is built by extending the earlier code, the source code is spread through many steps (each in its own directory). The code on step23 may use the source code files in step22/, step21/, step20/. The VPATH facility of make supports exactly this pattern, making make look up the needed files. It is not clear, however, what exactly make finds and where exactly the code is. It is much preferable to have the complete list of code files.

Some files from the previous steps have to be renamed before compiling. For example, step21/lang.mli should be renamed to lang_21.mli: it would be included under that name in step22/lang.mli. Previously I resorted to symbolic links, which clutter the directory and are harder to maintain.

Since files are so spread out, automatic dependencies are difficult to maintain. Ocamldep does not help here. On the other hand, to build the final executable, we have to explicitly enumerate all needed .cmo files in the correct order. The order has to be manually maintained in any case. One may observe that this order is exactly the order in which to compile the source code files. Therefore, we do not need any tool to discover dependencies, since we have to list .cmo files anyway. (We need to also add .cmi files to the build manifest). We hence assume consecutive (single-threaded) build and lose on parallelism. In our experience, the compilation is very fast (including the complete rebuild), so parallelism is not necessary.

References
step31-build.ml [2K]
A sample build script in its entirety

course-notes.pdf [1033K]
Section 4.1.1 of the Course notes explains how to use the build tool

build.ml [9K]
The implementation of the rules (including time-stamp checking, etc.) and the main function

 

Utilities

The directory util contains commonly used code, in particular, the build tool and the testing framework.
References
common.ml [<1K]
util.ml [<1K]

build.ml [9K]
The Build tool

expect.ml [5K]
Testing framework: confirm the parsing, the generated code, and the result of executing the compiled code

Makefile [<1K]
Makefile.common [<1K]