.. Mars Documentation Copyright (C) 2011 Matt Giuca .. This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. .. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. .. You should have received a copy of the GNU General Public License along with this program. If not, see . .. _ref-dev-architecture: ********************** Architectural overview ********************** .. sectionauthor:: Matt Giuca This document explains the internal design of the Mars compiler and interpreter. High-level architecture ======================= The Mars suite consists of two programs, :program:`mars` (for running programs directly, and presenting an interactive read-eval-print loop) and :program:`marsc` (for compiling programs). These two tools share most of the codebase. The only difference is that :program:`mars` uses the :mod:`mars` and :mod:`interactive` modules whereas :program:`marsc` uses the :mod:`marsc` module. Both tools read in a Mars program (from a single source file, though that file may include other files using the :keyword:`import` statement), and compile it into an in-memory representation, following the compilation steps detailed below under :ref:`ref-compiler-passes`. The resulting data structure is identical regardless of which tool is invoking the compilation and which back-end will subsequently be used. After compilation, what happens depends upon the tool and the back-end. Both tools feature a selection of back-ends which controls how the program is executed (for :program:`mars`) or the format of the output (for :program:`marsc`). The :program:`mars` command may be run in either interactive or non-interactive mode. If running non-interactively, it will feed the program into a back-end-specific interpreter, beginning at the program entrypoint (by default, :func:`main`). If running interactively, it begins a prompt session. Whenever the user types a Mars statement, it compiles the statement into a sequence of instructions using the full compilation toolchain, then feeds those instructions into the back-end-specific interpreter. The :program:`mars` back-ends are implemented by creating an instance of the :type:`executor` class. The :type:`executor` type class contains methods for executing individual instruction sequences and adding new functions to the existing program. The prompt also displays the result of any :ref:`expression statement ` by calling the built-in :func:`show` function on the expression. It does this by actually compiling a call to :func:`show` into the internal representation, and then extracting the resulting string from the executor. For more information on the prompt, see :ref:`ref-interactive-prompt`. Errors ====== We use exceptions to handle all errors. There are three main types of errors that can occur in Mars: compiler errors (caused by attempting to compile an illegal program), runtime errors (caused by an error condition in a running Mars program) and internal errors (caused by bugs in the Mars suite). Compiler errors are always thrown as :type:`context.error` objects. These always include data about what file and line number the error occurred on, and a helpful error message. Compiler errors range from syntax errors to type errors to static semantic errors such as using a variable without it being definitely assigned. Such errors are caught very high up, printed neatly, and the program then exits. Errors that occur in the interactive prompt are printed but the program does not exit. We pride ourselves on the readability of Mars compiler error messages, and believe they are consistently of high quality and relevance. Runtime errors are always thrown as :type:`executor.mars_error` objects. These simply contain strings with the error message. Such errors are thrown by built-in functions when called inappropriately, or by the built-in :func:`error` function. It is the responsibility of the individual executor to throw these errors. Runtime errors are caught and displayed neatly. Runtime errors that occur in the interactive prompt are printed but the program does not exit. The rest of the exceptions are internal errors, which are bugs in Mars. These should never be thrown, but they often are when hacking on the compiler. If any such error occurs, it should be reported as a bug. There is a top-level error handler which catches all other errors and prints them out, making clear that this is an internal error and providing a link to file a bug report. Internal representation ======================= Mars programs are always compiled into the internal representation, Mars internal representation (Mars IR) before being interpreted or compiled. The module :mod:`ir` is therefore the centrepoint of the whole compiler suite, as it contains the definition of the entire data structure. Note that there are two internal representations: the AST (abstract syntax tree) and the CFG (control flow graph) or Mars IR form (Mars IR is often referred to internally as CFG). Both of these are defined in the :mod:`ir` module. The AST is a data structure representation of Mars source code. The :type:`typedef`, :type:`ctor`, :type:`expr`, :type:`stmt`, :type:`basic_stmt`, :type:`compound_stmt`, :type:`function` and :type:`program` types represent the AST. There is no separate data type for "program in AST form" and "program in Mars IR form" -- the :type:`function` type can represent a function either in AST or Mars IR form, and so a program can theoretically (but shouldn't) hold a mixture of AST and Mars IR functions. The :type:`typedef` and :type:`ctor` types are also used in Mars IR form (Mars IR represents types the same way as the AST, so there is no need for a separate data type). The :type:`atom`, :type:`instr`, :type:`terminator_instr`, :type:`pattern` and :type:`case_value` types are used to represent the Mars IR form of a program. The Mars IR data structure and instruction set is described in :ref:`ref-isa`. An important helper data structure is the :mod:`tables` module, which provides the :type:`progtable` (program table) and :type:`localtable` (local variable table) structures. These allow for efficient lookup of global functions, types and constructors, as well as the types of local variables. The tables are separate from the main :type:`program` type, but often threaded around with it. .. _ref-compiler-passes: Compiler passes =============== The Mars suite splits up the job of compiling from a collection of Mars source code files into the Mars IR internal representation into several phases. Each phase is responsible for transforming the program in some way, and also generating errors. Mars never reports more than one error -- it always stops at the first error. There are also no warnings. All errors are produced by throwing errors as described above. Reading ------- (Module: :mod:`reader`.) The "reading" phase actually consists of applying the lexer and parser phases recursively on the imports. Those two sub-phases are detailed later. Mars always starts with a single source code file. This file is tokenised and parsed into AST (abstract syntax tree) form, and then the first thing that happens is the reader module searches the AST for :keyword:`import` statements. Each :keyword:`import` statement triggers the search for Mars files of that name in the appropriate paths (see :ref:`import`) -- each of these files is loaded, and then the reading process is applied recursively on those files. It is an error if any file cannot be read. The result is a single AST for the whole program, which does not contain any :keyword:`import` statements. Lexer ----- (Module: :mod:`token`.) This very simple phase simply reads a text file containing Mars source code, and converts it into a sequence of token objects, as defined at the top of the :mod:`token` module. In particular, this module is responsible for handling the Mars indentation rule, and parsing string literals. This implements everything described in :ref:`ref-lexical`. **Errors**: Syntax errors for unexpected characters, malformed tokens (such as string literals) or indentation. Parser ------ (Module: :mod:`parsem`, short for "parse Mars"; the module :mod:`parse` is a generic parser library written for Mars.) This phase reads the stream of tokens and applies grammar productions to produce a complete program in AST form. **Errors**: Syntax errors for failing to parse against the grammar. Type checker ------------ (Module: :mod:`typecheck`.) This phase runs over the entire program in AST form and performs type unification. Note that expressions in the AST data structure may or may not have a type. The expressions produced by the parser do not have type information. This phase adds type information. It also infers types of local variables if they have not been given, and throws type errors if anything fails to unify. **Errors**: Type errors, if types fail to unify or an explicit type does not match the inferred type. AST to CFG conversion --------------------- (Module: :mod:`ast_cfg`.) This is the most complicated phase. It transforms the program from the type-annotated AST form into the Mars IR control-flow graph form. This involves many responsibilities: * Converting basic statements into instruction sequences * Converting complex statements into basic blocks and branches * Simplifying complex :keyword:`switch` statement patterns into simple cases (see :ref:`ref-switchfactor`) * Checking that :keyword:`switch` statements cover all possibilities. * Converting variables into SSA form, so that each variable is assigned in exactly one place. Variables assigned in more than one place are given subscripts, and phi nodes are added to join variables together. * Tracking which variables have been definitely assigned, and checking that variables are not used unless definitely assigned. * Checking that all code paths of a function return a value. * Generating special functions for reading and updating fields of each user-defined type * Generating closure templates **Errors**: Use of a variable without being definitely assigned. Possible to hit the end of a function without a :keyword:`return` statement. A :keyword:`switch` statement that does not cover all possibilities. Type dictionary augmentation ---------------------------- (Module: :mod:`typedict`.) This phase takes a complete Mars IR program, in control-flow graph form, and augments its functions with type dictionaries, as described in :ref:`ref-typedicts`. This generates the type dictionary functions and modifies existing functions to use those. The result is a program that does not call the built-in functions :func:`show` or :func:`eq`, instead implementing them non-primitively using type dictionaries. Back-ends ========= Mars is built with an extensible architecture. Both the :program:`mars` and :program:`marsc` tools have the choice of several back-ends which carry out interpretation of the program in a different way, or write out a different output format. Interpret --------- :program:`mars` only (Module: :mod:`interpret`.) A highly inefficient reference implementation of Mars. This interpreter is written almost all in pure Mercury (with a few additional modules written in C for efficient array handling). It uses a Mercury data structure to represent values, and contains an interpreter that executes Mars assembly code without further compilation. First-class function objects are represented as values containing the Mars assembly objects for those functions (they are not converted into Mercury functions). Built-ins are implemented as Mercury functions in this module. Note that this isn't quite straightforward since this module has seen the Mars assembly representation evolve considerably and hasn't quite caught up. Notably, it still represents functions as objects that can be curried arbitrarily, a throwback to the days when Mars functions were always unary, like Haskell, and before we invented closure templates. It is not possible to pre-compile code to be interpreted later. The interpreter only works directly on raw Mars code. Pretty ------ :program:`marsc` only This compiler-only back-end simply spits out the original Mars source code from the AST data structure, formatting it nicely. The :mod:`marsc` module features a special case to call this backend *without* first compiling the code to MAS, so that it may have access to the uncompiled Mars source code. ASM --- :program:`marsc` only This compiler-only back-end emits raw compiled Mars assembly (MAS) code. Note that MAS is not an executable format; there is no tool that can read it in. This is solely for human readability of the compiled code. The MAS printer attempts to "pretty up" the code by making it look more like Mars source code than assembly code. For example, rather than writing ``mov x, y``, it writes ``x = y``. This may change in the future. .. _ref-interactive-prompt: Interactive prompt ================== (Module: :mod:`interactive`.) The interactive prompt is an important part of the Mars suite and a significant amount of programming effort went into making it work and maintaining it. (It is remarkably different to a normal compiler in the way it works.) The type :mod:`istate` carries around the state of the interpreter and is threaded through most predicates. Note that it needs to carry around a lot of state that is normally hidden as internal data structures in the compiler phases (such as a varset, for type checking). Many modules need to expose their otherwise-internal functionality just so that the interactive prompt can splice itself in the middle of the compiler phases. For example, the :mod:`ast_cfg` module has internal counters to track the internal subscript on each variable. The interactive prompt has access to this data, which is needed in order to give variables increasing subscripts as the session goes on. Note that there is an intimate relationship between the interactive prompt and the executor back-ends. Most of the :type:`executor` methods are provided so that interactive can perform its job. Each executor needs to know how to store interpreter variables in a way that can persist over a long time. This is a difficult problem, for example, in the LLVM executor, which normally stores local variables in temporary registers -- it has a special case to store interpreter variables in global state so that they can be retrieved by subsequent interactive statements. It is also necessary to be able to generate new functions on-the-fly. While the interactive prompt does not permit :keyword:`def` statements, it may be necessary to generate new closure templates if a function is curried in a way that it has not been before. These new functions are generated by the :mod:`ast_cfg` module, and the interactive prompt can request that the executor insert those functions into its internal state before executing the new instructions.