Mars programs are, at the lowest level, composed of a sequence of “tokens”, in the usual sense of the word when applied to parsing. This section describes the grammar of the tokens. The higher-level syntactic constructs are described in the later sections.
Mars source files are plain ASCII text files. All bytes must be in the range 0...127, and are interpreted as ASCII characters (one character per byte). As an exception, comments may contain non-ASCII bytes.
Mars programs, like Python, are structured by the use of whitespace. Newlines are used to end lines, and indentation is used to specify where blocks begin and end. Mars uses a very similar layout rule to Python [1].
Newline characters ('\n' or '\r\n') generate a NEWLINE token. Only space (' ') and tab ('\t') characters are recognised as whitespace.
Mars has line-based comments. A # signals the start of a comment, which continues until the end of the line.
Lines with only whitespace and comments are blank lines, and are ignored for the purpose of generating INDENT and DEDENT tokens. The implication of this is that a block may contain blank lines without indentation, which do not terminate the block.
Blocks are started by a line which is indented further than the previous line, and terminated by a line which is indented the same as the line preceding the block.
The grammar (as defined later in this document) specifies how indentation delimits blocks. This section describes how INDENT and DEDENT tokens are generated from the whitespace.
The tokenizer maintains a stack of integers, which holds the indentation level (column number) of each block in the current stack of blocks. The stack is never empty; its bottom element is always 0 (the leftmost column, representing the top-level block of the program).
At the beginning of tokenization, the stack is initialised with a single element, the value 0.
At the start of each line, all contiguous whitespace (space and tab characters) is consumed, and then the cursor position is read. Each space character advances the cursor by 1 unit. Each tab character advances the cursor by up to 8, to the nearest multiple of 8 (ie. tabstops are at 8, 16, 24, etc).
Note
The choice of 8 as a tab width is arbitrary, but a common convention in Unix tools and other whitespace-dependent programming languages. This choice should have no effect on the code structure unless tabs and spaces are mixed, which is not recommended.
Once the cursor position is known, it is compared to the integers on the stack:
Aside from the indentation rules, tokenization is fairly straightforward. Whitespace characters (space and tab) delimit tokens. Tokens do not require whitespace to separate them, if it is unambiguous where one token ands and the other begins.
If any ambiguity arises, a the lexeme which produces a token is the longest possible sequence of characters which produces a valid token, when read from left to right.
The following lexemes are keywords:
builtin for pass while
case if return
def import switch
elif io type
else native_import var
Each keyword generates a token of the same name, and may be used in special places in the language’s grammar. Keywords must appear exactly as written, and may not be used exactly as an identifier.
Identifiers are used for naming components of a program, such as types and variables. There are two distinct types of identifier – those which begin with an uppercase letter and those which begin with a lowercase letter. These are used in different contexts.
Note that keywords are never considered identifiers.
identifier ::= upper_identifier | lower_identifier upper_identifier ::= uppercase (alpha | digit | "_")* lower_identifier ::= (lowercase|"_") (alpha | digit | "_")* alpha ::= lowercase | uppercase lowercase ::= "a"..."z" uppercase ::= "A"..."Z" digit ::= "0"..."9"
The identifier "_" is special, and generates its own token, rather than the token identifier.
The following lexemes represent symbols. Each symbol has its own token (of the same name).
( . -> = < + //
) ... :: =! <= - %
[ , : == > *
] _ := != >= /
Number literals are used to write non-negative real number constants in source code.
num_literal ::= digit+ [ "." digit+ ]
This is interpreted as a decimal number in the usual way, and converted into a floating point number. This may involve rounding towards the nearest representable floating point value.
Note
While number literals must not be negative, negative numbers may be written in the usual way, e.g., -4. Technically this is not a number literal, but a unary negation applied to a positive number literal.
String literals are used to write constant sequences of ASCII characters in source code. Character literals are used to write single ASCII character constants.
string_literal ::= '"' string_lit_item* '"' char_literal ::= "'" char_lit_item "'" string_lit_item ::= string_lit_char | escape_sequence char_lit_item ::= char_lit_char | escape_sequence string_lit_char ::= <any ASCII character in range 32...127, except "\" or '"'> char_lit_char ::= <any ASCII character in range 32...127, except "\">
String and character literals may contain any raw ASCII character, except for control characters and the backslash (which is used for escape sequences). String literals may also not contain the " character, which terminates the string. Note that bytes above 127 are not standard ASCII and are illegal in string and character literals.
Escape sequences may be used to encode any byte value into a string or character literal:
escape_sequence ::= "\" (escape_character | "x" hexdigit hexdigit) escape_character ::= "0" | "a" | "b" | "t" | "n" | "v" | "f" | "r" | "e" | '"' | "'" | "\" hexdigit ::= "0"..."9" | "a"..."f" | "A"..."F"
Each escape character has a particular associated ASCII character value, according to this table:
Escape Sequence | ASCII character | Byte value (hex) |
---|---|---|
\0 | Null character (NUL) | 00 |
\a | Bell (BEL) | 07 |
\b | Backspace (BS) | 08 |
\t | Horizontal Tab (HT) | 09 |
\n | Linefeed (LF) | 0a |
\v | Vertical Tab (VT) | 0b |
\f | Formfeed (FF) | 0c |
\r | Carriage Return (CR) | 0d |
\e | Escape (ESC) | 1b |
\" | Double quote (") | 22 |
\' | Single quote (') | 27 |
\\ | Backslash (\) | 5c |
Furthermore, the escape sequence \xhh (a “\x” followed by exactly two hexdigit characters) specifies an arbitrary byte value for a character in hexadecimal, in the range 0...255 (or 00..ff in hexadecimal).
Note
There is a deliberate discrepancy between the allowed raw character range (32...127) and \x escape value range (0...255). This is because strings in Mars are currently viewed as byte strings, not character strings. Therefore, we allow proper ASCII characters (less than 128) to be converted into byte values. Source characters above 128 have no clearly-defined translation to a byte value, nor do Mars source files have any notion of character encoding. Therefore, string literals may use escape syntax to specify arbitrary byte values, but non-ASCII values may not appear directly in the source code.
[1] | See Python Language Reference - Lexical Analysis. |