Skip to main content

Shake Lexer Spec

§ 1 Definition

§ 1.1 Lexer

The lexer is the first step in the compilation process. It takes the source code as input and outputs a list of tokens. The lexer is also called a tokenizer. The lexer is also responsible for removing comments and whitespace.

§ 1.2 Tokens

A token is a sequence of characters that form a meaningful unit in a program. It holds the following information: Learn more about tokens

§ 2 Lexer Implementation

The lexer is implemented in the lexer package. The main lexing logic ist implemented in ShakeLexingBase

§ 2.1 Token Ranking

Ranking will make some difference in the lexer implementation, e.g. *= must be checked before *. This is a sample ranking. Note that this is not the only possible ranking, but it is the one used in the lexer implementation.

  1. Skip whitespace (if existing) (eg. , \t)
  2. Line Separator (\n)
  3. Semicolon (;)
  4. Comma (,)
  5. Colon (:)
  6. Dot (.)
  7. Numbers (floating point, integer, binary, hexadecimal) (eg. 1, 1.0) Implementation by checking whether the first character is a digit (0-9). Does not capture signs (+, -)

  8. Identifier (eg. a, b) Implementation by checking whether the first character is a letter (a-z, A-Z) or an underscore (_). It is allowed, but not required, to contain digits (0-9) as first character, as they are captured by the number token, so for example javascripts "\w" regex would be fine for this token.

  9. Identifier 2 (eg. `a`, `b`) Implementation by checking whether the first character is a backtick (`).

  10. String (eg. "abc") Implementation by checking whether the first character is a double quote (").

  11. Character (eg. 'a') Implementation by checking whether the first character is a single quote (').

  12. Skip single line comment (eg. // abc) Implementation by checking whether the first two characters are a double slash (//). Then just skip until the next \n (Also works with windows line endings, as \r is skipped as well)

  13. Skip multi line comment (eg. /* abc */) Implementation by checking whether the first two characters are a slash and a star (/*). Then just skip until the next */

  14. Pow Assignment (**=)
  15. Mod Assignment (%=)
  16. Div Assignment (/=)
  17. Mul Assignment (*=)
  18. Sub Assignment (-=)
  19. Add Assignment (+=)
  20. Increment (++)
  21. Decrement (--)
  22. Power (**)
  23. Modulo (%)
  24. Division (/)
  25. Multiplication (*)
  26. Subtraction (-)
  27. Addition (+)
  28. Logical OR (||)
  29. Logical AND (&&)
  30. Logical XOR (^^)
  31. Equals (==)
  32. Greater Than Or Equal (>=)
  33. Less Than Or Equal (<=)
  34. Not Equal (!=)
  35. Greater Than (>)
  36. Less Than (<)
  37. Not (!)
  38. Bitwise NAND (~&)
  39. Bitwise NOR (~|)
  40. BIT_XNOR (~^)
  41. Bitwise NOT (~)
  42. Bitwise AND (&)
  43. Bitwise OR (|)
  44. Bitwise XOR (^)
  45. Assignment (=)
  46. LParen (()
  47. RParen ())
  48. LCurl ({)
  49. RCurl (})
  50. LBracket ([)
  51. RBracket (])