Instruction set architecture

Mars internally compiles source code down to “Mars internal representation” (Mars IR) for processing and code generation. Mars IR is a very high-level instruction set (more abstract than, say, the Java virtual machine), but still is broken down much further than the Mars source code.

Very limited processing takes place before Mars source code is converted into Mars IR. Syntax checking and type checking is done on the Mars source text or abstract syntax tree formats, and then it is converted into Mars IR. All of the subsequent analysis and code generation is done on Mars IR. All backends use the Mars IR format as input, except the pretty printer. Mars IR has the following properties:

  • Preserves Mars types: Functions in Mars IR have the same headers as in Mars itself. Therefore, all types are the same, and Mars IR is fully aware of type polymorphism. Similarly, it is aware of CGCs, algebraic data types and constructors.
  • Flattened execution: There are no statements or expressions, only instructions. Instructions are similar to statements, but they do not nest; they typically perform the role of a single expression, and assign the result to a new variable. Arguments to instructions are usually local variable names (for example, most instructions cannot refer to global variables; only special instructions designed for this purpose).
  • Control flow graph: Functions are stored internally as a set of basic blocks. Each basic block is a sequence of instructions without any branches, and the last instruction in the block (the “terminator”) is a special instruction that can return from the function or branch to another block. This is similar to labels-and-jumps assembly code, except that jumps between blocks are explicit; it is not possible to “fall” from one block to the next (and as a corollary, it is not possible to jump into the middle of a block).
  • Explicit “return” variable: Each function has a specific local variable designated the “return” variable. The result of the function is the value of this variable when the function ends.
  • Explicit entry and exit block: Each function has a specific block designated the “entry” block, which is the block that is executed initially. Perhaps less conventionally, there is a specific block designated the “exit” block. There is no “return” instruction; the function execution ends when control falls out of the “exit” block.
  • Static single assignment: All local variables may be assigned at exactly one place (statically) in the code. If a variable needs to be “re-assigned,” instead a new variable must be created with a qualified name (for example, if x is already assigned, assign x:1). A special “phi” instruction is provided to allow a variable to take on the value of one of many source variables, to deal with branches and loops which assign variables.

An informal textual representation of Mars IR is “Mars assembly” (MAS). The marsc program can output MAS representation using --backend=asm. This is “informal” as it doesn’t use the official instruction names (it tries to look more like Mars syntax), and there is no parser for MAS; it’s just a debugging format.

Internally, Mars IR is referred to as “CFG representation” (control-flow graph), due to the main data structure used. The ast_cfg module is responsible for converting parsed Mars source code (in AST or abstract syntax tree form) into Mars IR, as well as doing some semantic checks such as initialised variable checks. There is no specific binary representation for Mars IR.

What follows is a description of each of the Mars IR instructions, which appear within basic blocks within functions and CGCs. They are separated into three categories, which determine the context in which the instructions are allowed.

phi instruction

The phi instruction is a special instruction that may only appear at the beginning of a basic block, before any other instructions. Each block may have zero or more phi instructions.

phi dest sources

Copies the value of a variable in sources to dest, depending on which block the execution just came from. sources is a list of (block - variable) pairs, which MUST contain an entry for each predecessor block to the current block. At runtime, the block that was executing previously before jumping to the current block will be selected from the list, and the corresponding variable will be chosen, and its value copied into dest.

This instruction is necessary due to the static single assignment form of Mars code. Since it is not possible to assign the same variable in two places (e.g., a), the code must assign two different variables (e.g., a:1 and a:2). In order to use the value of this variable no matter where it was assigned, a phi should create another variable (e.g., a:3) from the original two variables.

Basic instructions

The basic instructions form the body of a basic block. Each block may have zero or more basic instructions.

Many instructions accept an atom, which refers to one of the following:

  • A variable name,
  • A number (double-precision floating point), or
  • A constant symbol (parameterless constructor). This MUST be the name of a constant symbol, not a constructor function.
nop
Does nothing.
mov dest source
Copies the value of atom source to variable dest.
ld_arraylit dest sources
Creates a new array, with the values of the variables listed in sources as its elements, and assigns it to dest. sources is a list of atoms.
ld_cgc dest cgc
Loads the value of a computable global constant cgc to dest. This may execute the body of the CGC, which may cause side-effects or fail to return.
ld_field dest source ctor index

Loads a field of an object, by index of a given constructor. Fields are indexed by parameters of the constructor used to create the object (0 is the first parameter, etc). Copies the value of field number index from variable source to dest. source MUST be a variable of an algebraic data type. It MUST be statically known to have been constructed with constructor ctor. This constructor MUST have more than index parameters.

This is a very low-level operation compared to Mars’ field reference expression, because it assumes that the constructor is known. It is illegal to use this on an arbitrary variable of an algebraic data type with multiple constructors, as the constructor will not be known. To implement field reference, first, perform a switch on the constructor, then use ld_field within each branch of the switch. ld_field is also used to implement variable bindings in pattern matches, where the constructor is already known. (The Mars compiler creates special Mars IR functions for high-level field access, which internally perform the switch and ld_field described above.)

fieldset source ctor index value

Destructively modifies a field of an object, by index of a given constructor. Fields are indexed by parameters of the constructor used to create the object (0 is the first parameter, etc). Copies the value of atom value to field number index of variable source, mutating the object referenced by source. source MUST be a variable of an algebraic data type. It MUST be statically known to have been constructed with constructor ctor. This constructor MUST have more than index parameters.

See the notes for ld_field.

fieldreplace dest source ctor index value

Copies an object, changing a single field in the resulting object, by index of a given constructor. Fields are indexed by parameters of the constructor used to create the object (0 is the first parameter, etc). Creates a new object with constructor ctor. Initialises all fields of the new object with the values of the fields in source, except for field number index, which is initialised to the value of atom value. Assigns the new object to dest. source MUST be a variable of an algebraic data type. It MUST be statically known to have been constructed with constructor ctor. This constructor MUST have more than index parameters.

See the notes for ld_field.

new_closure dest func args

Creates a new closure object from a closure template function func. Curries arguments args into the template’s closure vars, assigning the result to dest. Constructs a new function object with the parameters and result of the closure template. func MUST be the name of a closure template. args is a list of atoms containing the arguments. It MUST contain the exact number of closure variables expected by func, and any supplied arguments must be of correct types.

Ordinary functions cannot be converted into first-class function objects; to accomplish this, a closure template with no closure variables must be created, and loaded with new_closure and no args. Therefore, backends SHOULD optimise for the special case where args is empty, to avoid constructing a new closure object where possible.

call dest func args
Calls a function from local variable func, supplying arguments args and assigning the result to dest. func MUST be a variable containing a function value. args is a list of atoms containing the arguments. It MUST contain the exact number of arguments expected by func, of correct types. The execution may cause side-effects or fail to return.
call_ctor dest ctor args
Calls constructor function ctor, supplying arguments args and assigning the result to dest. ctor MUST not be the name of a constant symbol. args is a list of atoms containing the arguments. It MUST contain the exact number of arguments expected by ctor, of correct types.
call_global dest func args
Calls global function func, supplying arguments args and assigning the result to dest. func MUST not be the name of a computable global constant or constructor. args is a list of atoms containing the arguments. It MUST contain the exact number of arguments expected by func, of correct types. The execution may cause side-effects or fail to return.

Terminator instructions

The terminator instructions may only appear at the end of a basic block, after any other instructions. These instructions are responsible for control flow decisions, about where to jump to after the current block. Each block must have precisely one terminator instruction at the end.

switch control cases [default]

Conditional branch based on the value of the control variable. Matches either numbers (if control has type Num) or algebraic data type constructors. cases is a list of (num-or-constructor-name - block) pairs. If control matches one of the cases, branches to that basic block. Otherwise, branches to default (which is a block ID). default is optional, but MUST NOT be omitted unless the control is an algebraic type, and the cases cover all of the type’s constructors (in particular, it MUST NOT be omitted for switches over numbers). Two cases MUST NOT have the same case value.

This is a very low-level operation compared to Mars’ switch statement, because it cannot bind variables or match recursively. The Mars compiler handles variable bindings using the ld_field instruction. See switch factoring for the rather complicated algorithm that performs this translation.

cond_branch if then else
Conditional branch based on the value of the if atom, which MUST be an Num. If if is not equal to 0, branches to the then block. Otherwise, branches to the else block.
branch target
Unconditionally branches to the target block.
null_terminator
Does nothing. Note that this does not imply a fall-through to the subsequent block. This instruction MUST NOT appear in any block other than a) the function’s designated “exit block”, or b) a block which is statically unreachable. In the case of the exit block, this instruction signals the end of the function’s execution.

Table Of Contents

Previous topic

Architectural overview

Next topic

Switch factoring algorithm

This Page