.. Mars Documentation
Copyright (C) 2011 Matt Giuca
.. This program is free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation, either version 3 of the License, or
(at your option) any later version.
.. This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU General Public License for more details.
.. You should have received a copy of the GNU General Public License
along with this program. If not, see .
.. _ref-dev-architecture:
**********************
Architectural overview
**********************
.. sectionauthor:: Matt Giuca
This document explains the internal design of the Mars compiler and
interpreter.
High-level architecture
=======================
The Mars suite consists of two programs, :program:`mars` (for running programs
directly, and presenting an interactive read-eval-print loop) and
:program:`marsc` (for compiling programs). These two tools share most of the
codebase. The only difference is that :program:`mars` uses the :mod:`mars` and
:mod:`interactive` modules whereas :program:`marsc` uses the :mod:`marsc`
module.
Both tools read in a Mars program (from a single source file, though that file
may include other files using the :keyword:`import` statement), and compile it
into an in-memory representation, following the compilation steps detailed
below under :ref:`ref-compiler-passes`. The resulting data structure is
identical regardless of which tool is invoking the compilation and which
back-end will subsequently be used.
After compilation, what happens depends upon the tool and the back-end. Both
tools feature a selection of back-ends which controls how the program is
executed (for :program:`mars`) or the format of the output (for
:program:`marsc`).
The :program:`mars` command may be run in either interactive or
non-interactive mode. If running non-interactively, it will feed the program
into a back-end-specific interpreter, beginning at the program entrypoint (by
default, :func:`main`). If running interactively, it begins a prompt session.
Whenever the user types a Mars statement, it compiles the statement into a
sequence of instructions using the full compilation toolchain, then feeds
those instructions into the back-end-specific interpreter. The :program:`mars`
back-ends are implemented by creating an instance of the :type:`executor`
class. The :type:`executor` type class contains methods for executing
individual instruction sequences and adding new functions to the existing
program. The prompt also displays the result of any :ref:`expression statement
` by calling the built-in :func:`show` function on
the expression. It does this by actually compiling a call to :func:`show` into
the internal representation, and then extracting the resulting string from the
executor. For more information on the prompt, see
:ref:`ref-interactive-prompt`.
Errors
======
We use exceptions to handle all errors. There are three main types of errors
that can occur in Mars: compiler errors (caused by attempting to compile an
illegal program), runtime errors (caused by an error condition in a running
Mars program) and internal errors (caused by bugs in the Mars suite).
Compiler errors are always thrown as :type:`context.error` objects. These
always include data about what file and line number the error occurred on, and
a helpful error message. Compiler errors range from syntax errors to type
errors to static semantic errors such as using a variable without it being
definitely assigned. Such errors are caught very high up, printed neatly, and
the program then exits. Errors that occur in the interactive prompt are
printed but the program does not exit. We pride ourselves on the readability
of Mars compiler error messages, and believe they are consistently of high
quality and relevance.
Runtime errors are always thrown as :type:`executor.mars_error` objects. These
simply contain strings with the error message. Such errors are thrown by
built-in functions when called inappropriately, or by the built-in
:func:`error` function. It is the responsibility of the individual executor to
throw these errors. Runtime errors are caught and displayed neatly. Runtime
errors that occur in the interactive prompt are printed but the program does
not exit.
The rest of the exceptions are internal errors, which are bugs in Mars. These
should never be thrown, but they often are when hacking on the compiler. If
any such error occurs, it should be reported as a bug. There is a top-level
error handler which catches all other errors and prints them out, making clear
that this is an internal error and providing a link to file a bug report.
Internal representation
=======================
Mars programs are always compiled into the internal representation, Mars
internal representation (Mars IR) before being interpreted or compiled. The
module :mod:`ir` is therefore the centrepoint of the whole compiler suite, as
it contains the definition of the entire data structure.
Note that there are two internal representations: the AST (abstract syntax
tree) and the CFG (control flow graph) or Mars IR form (Mars IR is often
referred to internally as CFG). Both of these are defined in the :mod:`ir`
module. The AST is a data structure representation of Mars source code. The
:type:`typedef`, :type:`ctor`, :type:`expr`, :type:`stmt`, :type:`basic_stmt`,
:type:`compound_stmt`, :type:`function` and :type:`program` types represent
the AST.
There is no separate data type for "program in AST form" and "program in Mars
IR form" -- the :type:`function` type can represent a function either in AST
or Mars IR form, and so a program can theoretically (but shouldn't) hold a
mixture of AST and Mars IR functions.
The :type:`typedef` and :type:`ctor` types are also used in Mars IR form (Mars
IR represents types the same way as the AST, so there is no need for a
separate data type). The :type:`atom`, :type:`instr`,
:type:`terminator_instr`, :type:`pattern` and :type:`case_value` types are
used to represent the Mars IR form of a program.
The Mars IR data structure and instruction set is described in :ref:`ref-isa`.
An important helper data structure is the :mod:`tables` module, which provides
the :type:`progtable` (program table) and :type:`localtable` (local variable
table) structures. These allow for efficient lookup of global functions, types
and constructors, as well as the types of local variables. The tables are
separate from the main :type:`program` type, but often threaded around with
it.
.. _ref-compiler-passes:
Compiler passes
===============
The Mars suite splits up the job of compiling from a collection of Mars source
code files into the Mars IR internal representation into several phases. Each
phase is responsible for transforming the program in some way, and also
generating errors.
Mars never reports more than one error -- it always stops at the first error.
There are also no warnings. All errors are produced by throwing errors as
described above.
Reading
-------
(Module: :mod:`reader`.)
The "reading" phase actually consists of applying the lexer and parser phases
recursively on the imports. Those two sub-phases are detailed later.
Mars always starts with a single source code file. This file is tokenised and
parsed into AST (abstract syntax tree) form, and then the first thing that
happens is the reader module searches the AST for :keyword:`import`
statements. Each :keyword:`import` statement triggers the search for Mars
files of that name in the appropriate paths (see :ref:`import`) -- each of
these files is loaded, and then the reading process is applied recursively on
those files. It is an error if any file cannot be read.
The result is a single AST for the whole program, which does not contain any
:keyword:`import` statements.
Lexer
-----
(Module: :mod:`token`.)
This very simple phase simply reads a text file containing Mars source code,
and converts it into a sequence of token objects, as defined at the top of the
:mod:`token` module. In particular, this module is responsible for handling
the Mars indentation rule, and parsing string literals. This implements
everything described in :ref:`ref-lexical`.
**Errors**: Syntax errors for unexpected characters, malformed tokens (such as
string literals) or indentation.
Parser
------
(Module: :mod:`parsem`, short for "parse Mars"; the module :mod:`parse` is a
generic parser library written for Mars.)
This phase reads the stream of tokens and applies grammar productions to
produce a complete program in AST form.
**Errors**: Syntax errors for failing to parse against the grammar.
Type checker
------------
(Module: :mod:`typecheck`.)
This phase runs over the entire program in AST form and performs type
unification. Note that expressions in the AST data structure may or may not
have a type. The expressions produced by the parser do not have type
information. This phase adds type information. It also infers types of local
variables if they have not been given, and throws type errors if anything
fails to unify.
**Errors**: Type errors, if types fail to unify or an explicit type does not
match the inferred type.
AST to CFG conversion
---------------------
(Module: :mod:`ast_cfg`.)
This is the most complicated phase. It transforms the program from the
type-annotated AST form into the Mars IR control-flow graph form. This
involves many responsibilities:
* Converting basic statements into instruction sequences
* Converting complex statements into basic blocks and branches
* Simplifying complex :keyword:`switch` statement patterns into simple cases
(see :ref:`ref-switchfactor`)
* Checking that :keyword:`switch` statements cover all possibilities.
* Converting variables into SSA form, so that each variable is assigned in
exactly one place. Variables assigned in more than one place are given
subscripts, and phi nodes are added to join variables together.
* Tracking which variables have been definitely assigned, and checking that
variables are not used unless definitely assigned.
* Checking that all code paths of a function return a value.
* Generating special functions for reading and updating fields of each
user-defined type
* Generating closure templates
**Errors**: Use of a variable without being definitely assigned.
Possible to hit the end of a function without a :keyword:`return` statement.
A :keyword:`switch` statement that does not cover all possibilities.
Type dictionary augmentation
----------------------------
(Module: :mod:`typedict`.)
This phase takes a complete Mars IR program, in control-flow graph form, and
augments its functions with type dictionaries, as described in
:ref:`ref-typedicts`. This generates the type dictionary functions and
modifies existing functions to use those.
The result is a program that does not call the built-in functions :func:`show`
or :func:`eq`, instead implementing them non-primitively using type
dictionaries.
Back-ends
=========
Mars is built with an extensible architecture. Both the :program:`mars` and
:program:`marsc` tools have the choice of several back-ends which carry out
interpretation of the program in a different way, or write out a different
output format.
Interpret
---------
:program:`mars` only
(Module: :mod:`interpret`.)
A highly inefficient reference implementation of Mars. This interpreter is
written almost all in pure Mercury (with a few additional modules written in C
for efficient array handling). It uses a Mercury data structure to represent
values, and contains an interpreter that executes Mars assembly code without
further compilation.
First-class function objects are represented as values containing the Mars
assembly objects for those functions (they are not converted into Mercury
functions). Built-ins are implemented as Mercury functions in this module.
Note that this isn't quite straightforward since this module has seen the Mars
assembly representation evolve considerably and hasn't quite caught up.
Notably, it still represents functions as objects that can be curried
arbitrarily, a throwback to the days when Mars functions were always unary,
like Haskell, and before we invented closure templates.
It is not possible to pre-compile code to be interpreted later. The
interpreter only works directly on raw Mars code.
Pretty
------
:program:`marsc` only
This compiler-only back-end simply spits out the original Mars source code
from the AST data structure, formatting it nicely. The :mod:`marsc` module
features a special case to call this backend *without* first compiling the
code to MAS, so that it may have access to the uncompiled Mars source code.
ASM
---
:program:`marsc` only
This compiler-only back-end emits raw compiled Mars assembly (MAS) code. Note
that MAS is not an executable format; there is no tool that can read it in.
This is solely for human readability of the compiled code.
The MAS printer attempts to "pretty up" the code by making it look more like
Mars source code than assembly code. For example, rather than writing ``mov x,
y``, it writes ``x = y``. This may change in the future.
.. _ref-interactive-prompt:
Interactive prompt
==================
(Module: :mod:`interactive`.)
The interactive prompt is an important part of the Mars suite and a
significant amount of programming effort went into making it work and
maintaining it. (It is remarkably different to a normal compiler in the way
it works.)
The type :mod:`istate` carries around the state of the interpreter and is
threaded through most predicates. Note that it needs to carry around a lot of
state that is normally hidden as internal data structures in the compiler
phases (such as a varset, for type checking). Many modules need to expose
their otherwise-internal functionality just so that the interactive prompt can
splice itself in the middle of the compiler phases.
For example, the :mod:`ast_cfg` module has internal counters to track the
internal subscript on each variable. The interactive prompt has access to this
data, which is needed in order to give variables increasing subscripts as the
session goes on.
Note that there is an intimate relationship between the interactive prompt and
the executor back-ends. Most of the :type:`executor` methods are provided so
that interactive can perform its job. Each executor needs to know how to store
interpreter variables in a way that can persist over a long time. This is a
difficult problem, for example, in the LLVM executor, which normally stores
local variables in temporary registers -- it has a special case to store
interpreter variables in global state so that they can be retrieved by
subsequent interactive statements. It is also necessary to be able to generate
new functions on-the-fly. While the interactive prompt does not permit
:keyword:`def` statements, it may be necessary to generate new closure
templates if a function is curried in a way that it has not been before. These
new functions are generated by the :mod:`ast_cfg` module, and the interactive
prompt can request that the executor insert those functions into its internal
state before executing the new instructions.