Mars internally compiles source code down to “Mars internal representation” (Mars IR) for processing and code generation. Mars IR is a very high-level instruction set (more abstract than, say, the Java virtual machine), but still is broken down much further than the Mars source code.
Very limited processing takes place before Mars source code is converted into Mars IR. Syntax checking and type checking is done on the Mars source text or abstract syntax tree formats, and then it is converted into Mars IR. All of the subsequent analysis and code generation is done on Mars IR. All backends use the Mars IR format as input, except the pretty printer. Mars IR has the following properties:
An informal textual representation of Mars IR is “Mars assembly” (MAS). The marsc program can output MAS representation using --backend=asm. This is “informal” as it doesn’t use the official instruction names (it tries to look more like Mars syntax), and there is no parser for MAS; it’s just a debugging format.
Internally, Mars IR is referred to as “CFG representation” (control-flow graph), due to the main data structure used. The ast_cfg module is responsible for converting parsed Mars source code (in AST or abstract syntax tree form) into Mars IR, as well as doing some semantic checks such as initialised variable checks. There is no specific binary representation for Mars IR.
What follows is a description of each of the Mars IR instructions, which appear within basic blocks within functions and CGCs. They are separated into three categories, which determine the context in which the instructions are allowed.
The phi instruction is a special instruction that may only appear at the beginning of a basic block, before any other instructions. Each block may have zero or more phi instructions.
Copies the value of a variable in sources to dest, depending on which block the execution just came from. sources is a list of (block - variable) pairs, which MUST contain an entry for each predecessor block to the current block. At runtime, the block that was executing previously before jumping to the current block will be selected from the list, and the corresponding variable will be chosen, and its value copied into dest.
This instruction is necessary due to the static single assignment form of Mars code. Since it is not possible to assign the same variable in two places (e.g., a), the code must assign two different variables (e.g., a:1 and a:2). In order to use the value of this variable no matter where it was assigned, a phi should create another variable (e.g., a:3) from the original two variables.
The basic instructions form the body of a basic block. Each block may have zero or more basic instructions.
Many instructions accept an atom, which refers to one of the following:
Loads a field of an object, by index of a given constructor. Fields are indexed by parameters of the constructor used to create the object (0 is the first parameter, etc). Copies the value of field number index from variable source to dest. source MUST be a variable of an algebraic data type. It MUST be statically known to have been constructed with constructor ctor. This constructor MUST have more than index parameters.
This is a very low-level operation compared to Mars’ field reference expression, because it assumes that the constructor is known. It is illegal to use this on an arbitrary variable of an algebraic data type with multiple constructors, as the constructor will not be known. To implement field reference, first, perform a switch on the constructor, then use ld_field within each branch of the switch. ld_field is also used to implement variable bindings in pattern matches, where the constructor is already known. (The Mars compiler creates special Mars IR functions for high-level field access, which internally perform the switch and ld_field described above.)
Destructively modifies a field of an object, by index of a given constructor. Fields are indexed by parameters of the constructor used to create the object (0 is the first parameter, etc). Copies the value of atom value to field number index of variable source, mutating the object referenced by source. source MUST be a variable of an algebraic data type. It MUST be statically known to have been constructed with constructor ctor. This constructor MUST have more than index parameters.
See the notes for ld_field.
Copies an object, changing a single field in the resulting object, by index of a given constructor. Fields are indexed by parameters of the constructor used to create the object (0 is the first parameter, etc). Creates a new object with constructor ctor. Initialises all fields of the new object with the values of the fields in source, except for field number index, which is initialised to the value of atom value. Assigns the new object to dest. source MUST be a variable of an algebraic data type. It MUST be statically known to have been constructed with constructor ctor. This constructor MUST have more than index parameters.
See the notes for ld_field.
Creates a new closure object from a closure template function func. Curries arguments args into the template’s closure vars, assigning the result to dest. Constructs a new function object with the parameters and result of the closure template. func MUST be the name of a closure template. args is a list of atoms containing the arguments. It MUST contain the exact number of closure variables expected by func, and any supplied arguments must be of correct types.
Ordinary functions cannot be converted into first-class function objects; to accomplish this, a closure template with no closure variables must be created, and loaded with new_closure and no args. Therefore, backends SHOULD optimise for the special case where args is empty, to avoid constructing a new closure object where possible.
The terminator instructions may only appear at the end of a basic block, after any other instructions. These instructions are responsible for control flow decisions, about where to jump to after the current block. Each block must have precisely one terminator instruction at the end.
Conditional branch based on the value of the control variable. Matches either numbers (if control has type Num) or algebraic data type constructors. cases is a list of (num-or-constructor-name - block) pairs. If control matches one of the cases, branches to that basic block. Otherwise, branches to default (which is a block ID). default is optional, but MUST NOT be omitted unless the control is an algebraic type, and the cases cover all of the type’s constructors (in particular, it MUST NOT be omitted for switches over numbers). Two cases MUST NOT have the same case value.
This is a very low-level operation compared to Mars’ switch statement, because it cannot bind variables or match recursively. The Mars compiler handles variable bindings using the ld_field instruction. See switch factoring for the rather complicated algorithm that performs this translation.