Architectural overview

This document explains the internal design of the Mars compiler and interpreter.

High-level architecture

The Mars suite consists of two programs, mars (for running programs directly, and presenting an interactive read-eval-print loop) and marsc (for compiling programs). These two tools share most of the codebase. The only difference is that mars uses the mars and interactive modules whereas marsc uses the marsc module.

Both tools read in a Mars program (from a single source file, though that file may include other files using the import statement), and compile it into an in-memory representation, following the compilation steps detailed below under Compiler passes. The resulting data structure is identical regardless of which tool is invoking the compilation and which back-end will subsequently be used.

After compilation, what happens depends upon the tool and the back-end. Both tools feature a selection of back-ends which controls how the program is executed (for mars) or the format of the output (for marsc).

The mars command may be run in either interactive or non-interactive mode. If running non-interactively, it will feed the program into a back-end-specific interpreter, beginning at the program entrypoint (by default, main). If running interactively, it begins a prompt session. Whenever the user types a Mars statement, it compiles the statement into a sequence of instructions using the full compilation toolchain, then feeds those instructions into the back-end-specific interpreter. The mars back-ends are implemented by creating an instance of the executor class. The executor type class contains methods for executing individual instruction sequences and adding new functions to the existing program. The prompt also displays the result of any expression statement by calling the built-in show function on the expression. It does this by actually compiling a call to show into the internal representation, and then extracting the resulting string from the executor. For more information on the prompt, see Interactive prompt.

Errors

We use exceptions to handle all errors. There are three main types of errors that can occur in Mars: compiler errors (caused by attempting to compile an illegal program), runtime errors (caused by an error condition in a running Mars program) and internal errors (caused by bugs in the Mars suite).

Compiler errors are always thrown as context.error objects. These always include data about what file and line number the error occurred on, and a helpful error message. Compiler errors range from syntax errors to type errors to static semantic errors such as using a variable without it being definitely assigned. Such errors are caught very high up, printed neatly, and the program then exits. Errors that occur in the interactive prompt are printed but the program does not exit. We pride ourselves on the readability of Mars compiler error messages, and believe they are consistently of high quality and relevance.

Runtime errors are always thrown as executor.mars_error objects. These simply contain strings with the error message. Such errors are thrown by built-in functions when called inappropriately, or by the built-in error function. It is the responsibility of the individual executor to throw these errors. Runtime errors are caught and displayed neatly. Runtime errors that occur in the interactive prompt are printed but the program does not exit.

The rest of the exceptions are internal errors, which are bugs in Mars. These should never be thrown, but they often are when hacking on the compiler. If any such error occurs, it should be reported as a bug. There is a top-level error handler which catches all other errors and prints them out, making clear that this is an internal error and providing a link to file a bug report.

Internal representation

Mars programs are always compiled into the internal representation, Mars internal representation (Mars IR) before being interpreted or compiled. The module ir is therefore the centrepoint of the whole compiler suite, as it contains the definition of the entire data structure.

Note that there are two internal representations: the AST (abstract syntax tree) and the CFG (control flow graph) or Mars IR form (Mars IR is often referred to internally as CFG). Both of these are defined in the ir module. The AST is a data structure representation of Mars source code. The typedef, ctor, expr, stmt, basic_stmt, compound_stmt, function and program types represent the AST.

There is no separate data type for “program in AST form” and “program in Mars IR form” – the function type can represent a function either in AST or Mars IR form, and so a program can theoretically (but shouldn’t) hold a mixture of AST and Mars IR functions.

The typedef and ctor types are also used in Mars IR form (Mars IR represents types the same way as the AST, so there is no need for a separate data type). The atom, instr, terminator_instr, pattern and case_value types are used to represent the Mars IR form of a program.

The Mars IR data structure and instruction set is described in Instruction set architecture.

An important helper data structure is the tables module, which provides the progtable (program table) and localtable (local variable table) structures. These allow for efficient lookup of global functions, types and constructors, as well as the types of local variables. The tables are separate from the main program type, but often threaded around with it.

Compiler passes

The Mars suite splits up the job of compiling from a collection of Mars source code files into the Mars IR internal representation into several phases. Each phase is responsible for transforming the program in some way, and also generating errors.

Mars never reports more than one error – it always stops at the first error. There are also no warnings. All errors are produced by throwing errors as described above.

Reading

(Module: reader.) The “reading” phase actually consists of applying the lexer and parser phases recursively on the imports. Those two sub-phases are detailed later.

Mars always starts with a single source code file. This file is tokenised and parsed into AST (abstract syntax tree) form, and then the first thing that happens is the reader module searches the AST for import statements. Each import statement triggers the search for Mars files of that name in the appropriate paths (see The import statement) – each of these files is loaded, and then the reading process is applied recursively on those files. It is an error if any file cannot be read.

The result is a single AST for the whole program, which does not contain any import statements.

Lexer

(Module: token.) This very simple phase simply reads a text file containing Mars source code, and converts it into a sequence of token objects, as defined at the top of the token module. In particular, this module is responsible for handling the Mars indentation rule, and parsing string literals. This implements everything described in Lexical analysis.

Errors: Syntax errors for unexpected characters, malformed tokens (such as string literals) or indentation.

Parser

(Module: parsem, short for “parse Mars”; the module parse is a generic parser library written for Mars.) This phase reads the stream of tokens and applies grammar productions to produce a complete program in AST form.

Errors: Syntax errors for failing to parse against the grammar.

Type checker

(Module: typecheck.) This phase runs over the entire program in AST form and performs type unification. Note that expressions in the AST data structure may or may not have a type. The expressions produced by the parser do not have type information. This phase adds type information. It also infers types of local variables if they have not been given, and throws type errors if anything fails to unify.

Errors: Type errors, if types fail to unify or an explicit type does not match the inferred type.

AST to CFG conversion

(Module: ast_cfg.) This is the most complicated phase. It transforms the program from the type-annotated AST form into the Mars IR control-flow graph form. This involves many responsibilities:

  • Converting basic statements into instruction sequences
  • Converting complex statements into basic blocks and branches
  • Simplifying complex switch statement patterns into simple cases (see Switch factoring algorithm)
  • Checking that switch statements cover all possibilities.
  • Converting variables into SSA form, so that each variable is assigned in exactly one place. Variables assigned in more than one place are given subscripts, and phi nodes are added to join variables together.
  • Tracking which variables have been definitely assigned, and checking that variables are not used unless definitely assigned.
  • Checking that all code paths of a function return a value.
  • Generating special functions for reading and updating fields of each user-defined type
  • Generating closure templates

Errors: Use of a variable without being definitely assigned. Possible to hit the end of a function without a return statement. A switch statement that does not cover all possibilities.

Type dictionary augmentation

(Module: typedict.) This phase takes a complete Mars IR program, in control-flow graph form, and augments its functions with type dictionaries, as described in Type dictionaries. This generates the type dictionary functions and modifies existing functions to use those.

The result is a program that does not call the built-in functions show or eq, instead implementing them non-primitively using type dictionaries.

Back-ends

Mars is built with an extensible architecture. Both the mars and marsc tools have the choice of several back-ends which carry out interpretation of the program in a different way, or write out a different output format.

Interpret

mars only

(Module: interpret.) A highly inefficient reference implementation of Mars. This interpreter is written almost all in pure Mercury (with a few additional modules written in C for efficient array handling). It uses a Mercury data structure to represent values, and contains an interpreter that executes Mars assembly code without further compilation.

First-class function objects are represented as values containing the Mars assembly objects for those functions (they are not converted into Mercury functions). Built-ins are implemented as Mercury functions in this module.

Note that this isn’t quite straightforward since this module has seen the Mars assembly representation evolve considerably and hasn’t quite caught up. Notably, it still represents functions as objects that can be curried arbitrarily, a throwback to the days when Mars functions were always unary, like Haskell, and before we invented closure templates.

It is not possible to pre-compile code to be interpreted later. The interpreter only works directly on raw Mars code.

Pretty

marsc only

This compiler-only back-end simply spits out the original Mars source code from the AST data structure, formatting it nicely. The marsc module features a special case to call this backend without first compiling the code to MAS, so that it may have access to the uncompiled Mars source code.

ASM

marsc only

This compiler-only back-end emits raw compiled Mars assembly (MAS) code. Note that MAS is not an executable format; there is no tool that can read it in. This is solely for human readability of the compiled code.

The MAS printer attempts to “pretty up” the code by making it look more like Mars source code than assembly code. For example, rather than writing mov x, y, it writes x = y. This may change in the future.

Interactive prompt

(Module: interactive.) The interactive prompt is an important part of the Mars suite and a significant amount of programming effort went into making it work and maintaining it. (It is remarkably different to a normal compiler in the way it works.)

The type istate carries around the state of the interpreter and is threaded through most predicates. Note that it needs to carry around a lot of state that is normally hidden as internal data structures in the compiler phases (such as a varset, for type checking). Many modules need to expose their otherwise-internal functionality just so that the interactive prompt can splice itself in the middle of the compiler phases.

For example, the ast_cfg module has internal counters to track the internal subscript on each variable. The interactive prompt has access to this data, which is needed in order to give variables increasing subscripts as the session goes on.

Note that there is an intimate relationship between the interactive prompt and the executor back-ends. Most of the executor methods are provided so that interactive can perform its job. Each executor needs to know how to store interpreter variables in a way that can persist over a long time. This is a difficult problem, for example, in the LLVM executor, which normally stores local variables in temporary registers – it has a special case to store interpreter variables in global state so that they can be retrieved by subsequent interactive statements. It is also necessary to be able to generate new functions on-the-fly. While the interactive prompt does not permit def statements, it may be necessary to generate new closure templates if a function is curried in a way that it has not been before. These new functions are generated by the ast_cfg module, and the interactive prompt can request that the executor insert those functions into its internal state before executing the new instructions.