Shake Lexer Spec

§ 1 Definition

§ 1.1 Lexer

The lexer is the first step in the compilation process. It takes the source code as input and outputs a list of tokens. The lexer is also called a tokenizer. The lexer is also responsible for removing comments and whitespace.

§ 1.2 Tokens

A token is a sequence of characters that form a meaningful unit in a program. It holds the following information: Learn more about tokens

§ 2 Lexer Implementation

The lexer is implemented in the lexer package. The main lexing logic ist implemented in ShakeLexingBase

§ 2.1 Token Ranking

Ranking will make some difference in the lexer implementation, e.g. *= must be checked before *. This is a sample ranking. Note that this is not the only possible ranking, but it is the one used in the lexer implementation.

Skip whitespace (if existing) (eg. , \t)
Line Separator (\n)
Semicolon (;)
Comma (,)
Colon (:)
Dot (.)
Numbers (floating point, integer, binary, hexadecimal) (eg. 1, 1.0) Implementation by checking whether the first character is a digit (0-9). Does not capture signs (+, -)
Identifier (eg. a, b) Implementation by checking whether the first character is a letter (a-z, A-Z) or an underscore (_). It is allowed, but not required, to contain digits (0-9) as first character, as they are captured by the number token, so for example javascripts "\w" regex would be fine for this token.
Identifier 2 (eg. `a`, `b`) Implementation by checking whether the first character is a backtick (`).
String (eg. "abc") Implementation by checking whether the first character is a double quote (").
Character (eg. 'a') Implementation by checking whether the first character is a single quote (').
Skip single line comment (eg. // abc) Implementation by checking whether the first two characters are a double slash (//). Then just skip until the next \n (Also works with windows line endings, as \r is skipped as well)
Skip multi line comment (eg. /* abc */) Implementation by checking whether the first two characters are a slash and a star (/*). Then just skip until the next */
Pow Assignment (**=)
Mod Assignment (%=)
Div Assignment (/=)
Mul Assignment (*=)
Sub Assignment (-=)
Add Assignment (+=)
Increment (++)
Decrement (--)
Power (**)
Modulo (%)
Division (/)
Multiplication (*)
Subtraction (-)
Addition (+)
Logical OR (||)
Logical AND (&&)
Logical XOR (^^)
Equals (==)
Greater Than Or Equal (>=)
Less Than Or Equal (<=)
Not Equal (!=)
Greater Than (>)
Less Than (<)
Not (!)
Bitwise NAND (~&)
Bitwise NOR (~|)
BIT_XNOR (~^)
Bitwise NOT (~)
Bitwise AND (&)
Bitwise OR (|)
Bitwise XOR (^)
Assignment (=)
LParen (()
RParen ())
LCurl ({)
RCurl (})
LBracket ([)
RBracket (])

§ 1 Definition​

§ 1.1 Lexer​

§ 1.2 Tokens​

§ 2 Lexer Implementation​

§ 2.1 Token Ranking​

§ 1 Definition

§ 1.1 Lexer

§ 1.2 Tokens

§ 2 Lexer Implementation

§ 2.1 Token Ranking