.. Mars Documentation
   Copyright (C) 2008 Matt Giuca
   11/12/2008

.. This program is free software: you can redistribute it and/or modify
   it under the terms of the GNU General Public License as published by
   the Free Software Foundation, either version 3 of the License, or
   (at your option) any later version.

.. This program is distributed in the hope that it will be useful,
   but WITHOUT ANY WARRANTY; without even the implied warranty of
   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
   GNU General Public License for more details.

.. You should have received a copy of the GNU General Public License
   along with this program.  If not, see <http://www.gnu.org/licenses/>.

.. highlight:: none
.. _ref-lexical:

Lexical analysis
================

.. sectionauthor:: Matt Giuca <matt.giuca@gmail.com>

Mars programs are, at the lowest level, composed of a sequence of "tokens", in
the usual sense of the word when applied to parsing. This section describes
the grammar of the tokens. The higher-level syntactic constructs are described
in the later sections.

Character encoding
------------------

Mars source files are plain ASCII text files. All bytes must be in the range
0...127, and are interpreted as ASCII characters (one character per byte). As
an exception, comments may contain non-ASCII bytes.

Line structure
--------------

Mars programs, like Python, are structured by the use of whitespace. Newlines
are used to end lines, and indentation is used to specify where blocks begin
and end. Mars uses a very similar layout rule to Python [#L1]_.

Newline characters (``'\n'`` or ``'\r\n'``) generate a :token:`NEWLINE` token.
Only space (``' '``) and tab (``'\t'``) characters are recognised as
whitespace.

Comments
++++++++

Mars has line-based comments. A ``#`` signals the start of a comment,
which continues until the end of the line.

Blank lines
+++++++++++

Lines with only whitespace and comments are blank lines, and are ignored for
the purpose of generating :token:`INDENT` and :token:`DEDENT` tokens. The
implication of this is that a block may contain blank lines without
indentation, which do not terminate the block.

Indentation
+++++++++++

Blocks are started by a line which is indented further than the previous line,
and terminated by a line which is indented the same as the line preceding the
block.

The grammar (as defined later in this document) specifies how indentation
delimits blocks. This section describes how :token:`INDENT` and
:token:`DEDENT` tokens are generated from the whitespace.

The tokenizer maintains a stack of integers, which holds the indentation level
(column number) of each block in the current stack of blocks. The stack is
never empty; its bottom element is always 0 (the leftmost column, representing
the top-level block of the program).

At the beginning of tokenization, the stack is initialised with a single
element, the value 0.

At the start of each line, all contiguous whitespace (space and tab
characters) is consumed, and then the cursor position is read. Each space
character advances the cursor by 1 unit. Each tab character advances the
cursor by up to 8, to the nearest multiple of 8 (ie. tabstops are at
8, 16, 24, etc).

.. note::

   The choice of 8 as a tab width is arbitrary, but a common convention
   in Unix tools and other whitespace-dependent programming languages. This
   choice should have no effect on the code structure unless tabs and spaces
   are mixed, which is not recommended.

Once the cursor position is known, it is compared to the integers on the
stack:

* If the cursor position is equal to the integer on top of the stack, this
  line has the same indentation as the previous line. No indentation tokens
  are produced, and the stack is unchanged.
* If the cursor position is greater than the integer on top of the stack, this
  line is is a new block. One :token:`INDENT` token is
  produced, and the cursor position is pushed onto the stack.
* If the cursor position is less than the integer on top of the stack, the
  previous line was the last of its block, and this line must continue some
  block in the hierarchy.

  * If the cursor position is equal to one of the integers on the stack, all
    integers on the stack are popped up until, but not including, the cursor
    position. One :token:`DEDENT` token is produced for each integer popped.
    (This handles the case of multiple nested blocks being terminated on the
    same line).
  * If the cursor position is not equal to one of the integers on the stack,
    it is an indentation error. The parser MAY stop reading the source file,
    at its discretion, and the compiler MUST reject the program.

Other tokens
------------

Aside from the indentation rules, tokenization is fairly straightforward.
Whitespace characters (space and tab) delimit tokens. Tokens do not require
whitespace to separate them, if it is unambiguous where one token ands and the
other begins.

If any ambiguity arises, a the lexeme which produces a token is the longest
possible sequence of characters which produces a valid token, when read from
left to right.

Keywords
++++++++

The following lexemes are keywords::

    builtin     for             pass        while
    case        if              return
    def         import          switch
    elif        io              type
    else        native_import   var

Each keyword generates a token of the same name, and may be used in special
places in the language's grammar. Keywords must appear exactly as written, and
may not be used exactly as an identifier.

Identifiers
+++++++++++

Identifiers are used for naming components of a program, such as types and
variables. There are two distinct types of identifier -- those which begin
with an uppercase letter and those which begin with a lowercase letter. These
are used in different contexts.

Note that keywords are never considered identifiers.

.. productionlist::
   identifier: `upper_identifier` | `lower_identifier`
   upper_identifier: `uppercase` (`alpha` | `digit` | "_")*
   lower_identifier: (`lowercase`|"_") (`alpha` | `digit` | "_")*
   alpha: `lowercase` | `uppercase`
   lowercase: "a"..."z"
   uppercase: "A"..."Z"
   digit: "0"..."9"

The identifier ``"_"`` is special, and generates its own token, rather than
the token :token:`identifier`.

Symbols
+++++++

The following lexemes represent symbols. Each symbol has its own token (of the
same name). ::

    (           .           ->          =           <           +           //
    )           ...         ::          =!          <=          -           %
    [           ,           :           ==          >           *
    ]           _           :=          !=          >=          /

Literals
++++++++

Number literals
~~~~~~~~~~~~~~~

Number literals are used to write non-negative real number constants in source
code.

.. productionlist::
   num_literal: `digit`+ [ "." `digit`+ ]

This is interpreted as a decimal number in the usual way, and converted into a
floating point number. This may involve rounding towards the nearest
representable floating point value.

.. note::

   While number literals must not be negative, negative numbers may be written
   in the usual way, e.g., ``-4``. Technically this is not a number literal,
   but a unary negation applied to a positive number literal.

String and character literals
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

String literals are used to write constant sequences of ASCII characters in
source code. Character literals are used to write single ASCII character
constants.

.. productionlist::
   string_literal: '"' `string_lit_item`* '"'
   char_literal: "'" `char_lit_item` "'"
   string_lit_item: `string_lit_char` | `escape_sequence`
   char_lit_item: `char_lit_char` | `escape_sequence`
   string_lit_char: <any ASCII character in range 32...127, except "\" or '"'>
   char_lit_char: <any ASCII character in range 32...127, except "\">

String and character literals may contain any raw ASCII character, except for
control characters and the backslash (which is used for escape sequences).
String literals may also not contain the ``"`` character, which terminates the
string. Note that bytes above 127 are not standard ASCII and are illegal in
string and character literals.

Escape sequences may be used to encode any byte value into a string or
character literal:

.. productionlist::
   escape_sequence: "\" (`escape_character` | "x" `hexdigit` `hexdigit`)
   escape_character: "0" | "a" | "b" | "t" | "n" | "v" | "f" | "r" | "e"
                   : | '"' | "'" | "\" 
   hexdigit: "0"..."9" | "a"..."f" | "A"..."F"

Each escape character has a particular associated ASCII character value,
according to this table:

+-----------------+-------------------------+------------------+
| Escape Sequence | ASCII character         | Byte value (hex) |
+=================+=========================+==================+
| ``\0``          | Null character (NUL)    | 00               |
+-----------------+-------------------------+------------------+
| ``\a``          | Bell (BEL)              | 07               |
+-----------------+-------------------------+------------------+
| ``\b``          | Backspace (BS)          | 08               |
+-----------------+-------------------------+------------------+
| ``\t``          | Horizontal Tab (HT)     | 09               |
+-----------------+-------------------------+------------------+
| ``\n``          | Linefeed (LF)           | 0a               |
+-----------------+-------------------------+------------------+
| ``\v``          | Vertical Tab (VT)       | 0b               |
+-----------------+-------------------------+------------------+
| ``\f``          | Formfeed (FF)           | 0c               |
+-----------------+-------------------------+------------------+
| ``\r``          | Carriage Return (CR)    | 0d               |
+-----------------+-------------------------+------------------+
| ``\e``          | Escape (ESC)            | 1b               |
+-----------------+-------------------------+------------------+
| ``\"``          | Double quote (``"``)    | 22               |
+-----------------+-------------------------+------------------+
| ``\'``          | Single quote (``'``)    | 27               |
+-----------------+-------------------------+------------------+
| ``\\``          | Backslash (``\``)       | 5c               |
+-----------------+-------------------------+------------------+

Furthermore, the escape sequence ``\xhh`` (a "``\x``" followed by exactly two
:token:`hexdigit` characters) specifies an arbitrary byte value for a
character in hexadecimal, in the range 0...255 (or 00..ff in hexadecimal).

.. note::

   There is a deliberate discrepancy between the allowed raw character
   range (32...127) and ``\x`` escape value range (0...255). This is because
   strings in Mars are currently viewed as byte strings, not character
   strings. Therefore, we allow proper ASCII characters (less than 128) to be
   converted into byte values. Source characters above 128 have no
   clearly-defined translation to a byte value, nor do Mars source files have
   any notion of character encoding. Therefore, string literals may use
   escape syntax to specify arbitrary byte values, but non-ASCII values may
   not appear directly in the source code.

.. [#L1] See `Python Language Reference - Lexical Analysis
         <http://docs.python.org/reference/lexical_analysis.html>`_.