Lexical analysis

Mars programs are, at the lowest level, composed of a sequence of “tokens”, in the usual sense of the word when applied to parsing. This section describes the grammar of the tokens. The higher-level syntactic constructs are described in the later sections.

Character encoding

Mars source files are plain ASCII text files. All bytes must be in the range 0...127, and are interpreted as ASCII characters (one character per byte). As an exception, comments may contain non-ASCII bytes.

Line structure

Mars programs, like Python, are structured by the use of whitespace. Newlines are used to end lines, and indentation is used to specify where blocks begin and end. Mars uses a very similar layout rule to Python [1].

Newline characters ('\n' or '\r\n') generate a NEWLINE token. Only space (' ') and tab ('\t') characters are recognised as whitespace.

Comments

Mars has line-based comments. A # signals the start of a comment, which continues until the end of the line.

Blank lines

Lines with only whitespace and comments are blank lines, and are ignored for the purpose of generating INDENT and DEDENT tokens. The implication of this is that a block may contain blank lines without indentation, which do not terminate the block.

Indentation

Blocks are started by a line which is indented further than the previous line, and terminated by a line which is indented the same as the line preceding the block.

The grammar (as defined later in this document) specifies how indentation delimits blocks. This section describes how INDENT and DEDENT tokens are generated from the whitespace.

The tokenizer maintains a stack of integers, which holds the indentation level (column number) of each block in the current stack of blocks. The stack is never empty; its bottom element is always 0 (the leftmost column, representing the top-level block of the program).

At the beginning of tokenization, the stack is initialised with a single element, the value 0.

At the start of each line, all contiguous whitespace (space and tab characters) is consumed, and then the cursor position is read. Each space character advances the cursor by 1 unit. Each tab character advances the cursor by up to 8, to the nearest multiple of 8 (ie. tabstops are at 8, 16, 24, etc).

Note

The choice of 8 as a tab width is arbitrary, but a common convention in Unix tools and other whitespace-dependent programming languages. This choice should have no effect on the code structure unless tabs and spaces are mixed, which is not recommended.

Once the cursor position is known, it is compared to the integers on the stack:

  • If the cursor position is equal to the integer on top of the stack, this line has the same indentation as the previous line. No indentation tokens are produced, and the stack is unchanged.
  • If the cursor position is greater than the integer on top of the stack, this line is is a new block. One INDENT token is produced, and the cursor position is pushed onto the stack.
  • If the cursor position is less than the integer on top of the stack, the previous line was the last of its block, and this line must continue some block in the hierarchy.
    • If the cursor position is equal to one of the integers on the stack, all integers on the stack are popped up until, but not including, the cursor position. One DEDENT token is produced for each integer popped. (This handles the case of multiple nested blocks being terminated on the same line).
    • If the cursor position is not equal to one of the integers on the stack, it is an indentation error. The parser MAY stop reading the source file, at its discretion, and the compiler MUST reject the program.

Other tokens

Aside from the indentation rules, tokenization is fairly straightforward. Whitespace characters (space and tab) delimit tokens. Tokens do not require whitespace to separate them, if it is unambiguous where one token ands and the other begins.

If any ambiguity arises, a the lexeme which produces a token is the longest possible sequence of characters which produces a valid token, when read from left to right.

Keywords

The following lexemes are keywords:

builtin     for             pass        while
case        if              return
def         import          switch
elif        io              type
else        native_import   var

Each keyword generates a token of the same name, and may be used in special places in the language’s grammar. Keywords must appear exactly as written, and may not be used exactly as an identifier.

Identifiers

Identifiers are used for naming components of a program, such as types and variables. There are two distinct types of identifier – those which begin with an uppercase letter and those which begin with a lowercase letter. These are used in different contexts.

Note that keywords are never considered identifiers.

identifier       ::=  upper_identifier | lower_identifier
upper_identifier ::=  uppercase (alpha | digit | "_")*
lower_identifier ::=  (lowercase|"_") (alpha | digit | "_")*
alpha            ::=  lowercase | uppercase
lowercase        ::=  "a"..."z"
uppercase        ::=  "A"..."Z"
digit            ::=  "0"..."9"

The identifier "_" is special, and generates its own token, rather than the token identifier.

Symbols

The following lexemes represent symbols. Each symbol has its own token (of the same name).

(           .           ->          =           <           +           //
)           ...         ::          =!          <=          -           %
[           ,           :           ==          >           *
]           _           :=          !=          >=          /

Literals

Number literals

Number literals are used to write non-negative real number constants in source code.

num_literal ::=  digit+ [ "." digit+ ]

This is interpreted as a decimal number in the usual way, and converted into a floating point number. This may involve rounding towards the nearest representable floating point value.

Note

While number literals must not be negative, negative numbers may be written in the usual way, e.g., -4. Technically this is not a number literal, but a unary negation applied to a positive number literal.

String and character literals

String literals are used to write constant sequences of ASCII characters in source code. Character literals are used to write single ASCII character constants.

string_literal  ::=  '"' string_lit_item* '"'
char_literal    ::=  "'" char_lit_item "'"
string_lit_item ::=  string_lit_char | escape_sequence
char_lit_item   ::=  char_lit_char | escape_sequence
string_lit_char ::=  <any ASCII character in range 32...127, except "\" or '"'>
char_lit_char   ::=  <any ASCII character in range 32...127, except "\">

String and character literals may contain any raw ASCII character, except for control characters and the backslash (which is used for escape sequences). String literals may also not contain the " character, which terminates the string. Note that bytes above 127 are not standard ASCII and are illegal in string and character literals.

Escape sequences may be used to encode any byte value into a string or character literal:

escape_sequence  ::=  "\" (escape_character | "x" hexdigit hexdigit)
escape_character ::=  "0" | "a" | "b" | "t" | "n" | "v" | "f" | "r" | "e"
                      | '"' | "'" | "\"
hexdigit         ::=  "0"..."9" | "a"..."f" | "A"..."F"

Each escape character has a particular associated ASCII character value, according to this table:

Escape Sequence ASCII character Byte value (hex)
\0 Null character (NUL) 00
\a Bell (BEL) 07
\b Backspace (BS) 08
\t Horizontal Tab (HT) 09
\n Linefeed (LF) 0a
\v Vertical Tab (VT) 0b
\f Formfeed (FF) 0c
\r Carriage Return (CR) 0d
\e Escape (ESC) 1b
\" Double quote (") 22
\' Single quote (') 27
\\ Backslash (\) 5c

Furthermore, the escape sequence \xhh (a “\x” followed by exactly two hexdigit characters) specifies an arbitrary byte value for a character in hexadecimal, in the range 0...255 (or 00..ff in hexadecimal).

Note

There is a deliberate discrepancy between the allowed raw character range (32...127) and \x escape value range (0...255). This is because strings in Mars are currently viewed as byte strings, not character strings. Therefore, we allow proper ASCII characters (less than 128) to be converted into byte values. Source characters above 128 have no clearly-defined translation to a byte value, nor do Mars source files have any notion of character encoding. Therefore, string literals may use escape syntax to specify arbitrary byte values, but non-ASCII values may not appear directly in the source code.

[1]See Python Language Reference - Lexical Analysis.