Shake Lexer Spec
§ 1 Definition
§ 1.1 Lexer
The lexer is the first step in the compilation process. It takes the source code as input and outputs a list of tokens. The lexer is also called a tokenizer. The lexer is also responsible for removing comments and whitespace.
§ 1.2 Tokens
A token is a sequence of characters that form a meaningful unit in a program. It holds the following information: Learn more about tokens
§ 2 Lexer Implementation
The lexer is implemented in the lexer
package. The main lexing logic ist implemented in ShakeLexingBase
§ 2.1 Token Ranking
Ranking will make some difference in the lexer implementation, e.g. *=
must be checked before *
. This is a sample ranking. Note that this is not the only possible ranking, but it is the one used in the lexer implementation.
- Skip whitespace (if existing) (eg.
\t
) - Line Separator (
\n
) - Semicolon (
;
) - Comma (
,
) - Colon (
:
) - Dot (
.
) Numbers (floating point, integer, binary, hexadecimal) (eg.
1
,1.0
) Implementation by checking whether the first character is a digit (0-9
). Does not capture signs (+
,-
)Identifier (eg.
a
,b
) Implementation by checking whether the first character is a letter (a-z
,A-Z
) or an underscore (_
). It is allowed, but not required, to contain digits (0-9
) as first character, as they are captured by the number token, so for example javascripts "\w" regex would be fine for this token.Identifier 2 (eg.
`a`
,`b`
) Implementation by checking whether the first character is a backtick (`
).String (eg.
"abc"
) Implementation by checking whether the first character is a double quote ("
).Character (eg.
'a'
) Implementation by checking whether the first character is a single quote ('
).Skip single line comment (eg.
// abc
) Implementation by checking whether the first two characters are a double slash (//
). Then just skip until the next\n
(Also works with windows line endings, as\r
is skipped as well)Skip multi line comment (eg.
/* abc */
) Implementation by checking whether the first two characters are a slash and a star (/*
). Then just skip until the next*/
- Pow Assignment (
**=
) - Mod Assignment (
%=
) - Div Assignment (
/=
) - Mul Assignment (
*=
) - Sub Assignment (
-=
) - Add Assignment (
+=
) - Increment (
++
) - Decrement (
--
) - Power (
**
) - Modulo (
%
) - Division (
/
) - Multiplication (
*
) - Subtraction (
-
) - Addition (
+
) - Logical OR (
||
) - Logical AND (
&&
) - Logical XOR (
^^
) - Equals (
==
) - Greater Than Or Equal (
>=
) - Less Than Or Equal (
<=
) - Not Equal (
!=
) - Greater Than (
>
) - Less Than (
<
) - Not (
!
) - Bitwise NAND (
~&
) - Bitwise NOR (
~|
) -
BIT_XNOR
(~^
) - Bitwise NOT (
~
) - Bitwise AND (
&
) - Bitwise OR (
|
) - Bitwise XOR (
^
) - Assignment (
=
) - LParen (
(
) - RParen (
)
) - LCurl (
{
) - RCurl (
}
) - LBracket (
[
) - RBracket (
]
)