oboard

oboard

https://oboard.eu.org/
github
email
tg_channel
medium
twitter
bilibili
steam_profiles
follow

Implementing a MoonBit Interpreter from Scratch (Part Two)

There are several concepts surrounding lexical analysis, syntax parsing, and interpretation of programming languages:

  1. Token: A Token is the basic unit in a programming language, representing a semantic element in the syntax. For example, in an arithmetic expression, the plus sign (+), minus sign (-), and number (123) can all be Tokens. Each Token has a type (such as Plus, Minus, NumberLiteral) and a pattern (defined using regular expressions) used to identify these Tokens in the source code.
  2. Lexer: A Lexer is a tool that converts a source code string into a series of Tokens. It is responsible for recognizing the basic building blocks (Tokens) of the language. The Lexer skips unnecessary characters (such as whitespace) and generates useful Tokens for subsequent syntax parsers to process.
  3. CST (Concrete Syntax Tree): A CST is a data structure used to represent the concrete syntax structure of the source code. It contains all syntax elements and their relationships. The CST allows the parser to understand the structure of the source code, providing a complete view of the syntax for further processing (such as semantic analysis and code generation).
  4. AST (Abstract Syntax Tree): An AST is a tree-like data structure that represents the structure of a program, widely used in compilers and interpreters. Unlike the CST, the AST focuses only on the logical structure of the program, ignoring specific syntax details (such as whitespace, parentheses, etc.).
  5. Parser: A Parser is a component that processes a sequence of Tokens and generates a Syntax Tree (ST). It determines whether the combination of Tokens conforms to the expected structure based on the language's syntax rules.
  6. Visitor Pattern: The Visitor Pattern is a design pattern that allows operations to be performed on an object structure without modifying the structure itself. In this context, it is used to traverse the CST and perform corresponding operations (such as calculating the value of an expression).

Comparison of AST and CST#

CST:

  • Represents the concrete structure of the code, including all syntax details.
  • Contains all syntax elements, such as parentheses, semicolons, whitespace, etc.

AST:

  • Provides a more concise view, focusing only on the semantic parts of the syntax.
  • Omits unnecessary details, such as extraneous whitespace and delimiters.

Assuming there is a simple expression like:

1 + 2 * 3;

Its AST might be represented as follows:

     +
    / \
   1   *
      / \
     2   3

In this tree:
The root node is the + operator. The left child node is the operand 1.
The right child node is the * operator, with its left child being operand 2 and its right child being operand 3.

The AST provides an efficient way to represent and manipulate the structure of a program, abstracting the syntax information from the source code. By using the AST, compilers and interpreters can more easily perform code analysis, optimization, and generation, thereby improving the efficiency and flexibility of code processing.

Code#

We start with a simple addition and subtraction Parser, gradually implementing a simple addition parser, and then we will add more syntax step by step.

Addition Parser

First, install Chevrotain
Most people use pnpm, which is:

pnpm install chevrotain

I personally prefer using bun

bun i chevrotain
bun add v1.1.20 (ae194892)
installed [email protected]
[271.00ms] done

After installation, create an index.ts file.

Import#

import {
  createToken,
  Lexer,
  CstParser,
  tokenMatcher,
  CstNode,
} from "chevrotain";

Import relevant methods and classes from Chevrotain to create a lexer and parser.

Create Tokens#

// AdditionOperator: Create a Token representing the addition operator, but use Lexer.NA to indicate that this Token will not be directly matched (it is a category).
const AdditionOperator = createToken({
  name: "AdditionOperator",
  pattern: Lexer.NA,
});
// Plus and Minus: Create Tokens representing the plus (+) and minus (-) operators, both of which are subclasses of AdditionOperator.
const Plus = createToken({
  name: "Plus",
  pattern: /\+/,
  categories: AdditionOperator,
});
const Minus = createToken({
  name: "Minus",
  pattern: /-/,
  categories: AdditionOperator,
});
// NumberLiteral: Create a Token representing numbers, with the regular expression /[1-9]\d*/ matching integers starting from 1 to 9 (i.e., not starting with 0).
const NumberLiteral = createToken({
  name: "NumberLiteral",
  pattern: /[1-9]\d*/,
});
// WhiteSpace: Create a Token representing whitespace, using Lexer.SKIPPED to indicate it should be skipped in lexical analysis (i.e., not produce a Token).
const WhiteSpace = createToken({
  name: "WhiteSpace",
  pattern: /\s+/,
  group: Lexer.SKIPPED,
});

Lexer Configuration#

const allTokens = [WhiteSpace, Plus, Minus, NumberLiteral, AdditionOperator];
// Place all defined Tokens into the array allTokens and create a new lexer, CalculatorLexer.
const CalculatorLexer = new Lexer(allTokens);

Create Parser#

class CalculatorPure extends CstParser {
  constructor() {
    super(allTokens);

    const $ = this;
    // expression rule: This rule corresponds to an expression, containing an addition expression, and we will add more expressions later, such as multiplication.
    $.RULE("expression", () => {
      $.SUBRULE($.additionExpression);
    });

    // additionExpression rule: This rule defines the structure of an addition expression, first consuming a NumberLiteral as the left-hand side (lhs), then consuming zero or more addition operators and right-hand side (rhs).
    $.RULE("additionExpression", () => {
      $.CONSUME(NumberLiteral, { LABEL: "lhs" });
      $.MANY(() => {
        $.CONSUME(AdditionOperator);
        $.CONSUME2(NumberLiteral, { LABEL: "rhs" });
      });
    });

    this.performSelfAnalysis();
  }
}

Create Parser Instance#

const parser = new CalculatorPure();
const BaseCstVisitor = parser.getBaseCstVisitorConstructor();

class CalculatorInterpreter extends BaseCstVisitor {
  constructor() {
    super();
    this.validateVisitor();
  }

  expression(ctx) {
    return this.visit(ctx.additionExpression);
  }
  // additionExpression method: Handles the logic of addition expressions. First, it retrieves the value of the left operand, then iterates over the right operands and performs addition or subtraction based on the operator.
  additionExpression(ctx) {
    let result = Number.parseInt(ctx.lhs[0].image, 10);

    if (ctx.rhs) {
      ctx.rhs.forEach((rhsOperand, idx) => {
        let rhsValue = Number.parseInt(rhsOperand.image, 10);
        let operator = ctx.AdditionOperator[idx];

        if (tokenMatcher(operator, Plus)) {
          result += rhsValue;
        } else {
          result -= rhsValue;
        }
      });
    }

    return result;
  }
}

This code defines a simple addition and subtraction calculator that can parse expressions and compute results. It uses Chevrotain's features to create Tokens, build syntax rules, and implement interpreter logic.

Execute#

function eval(input: string) {
  const lexResult = CalculatorLexer.tokenize(input);
  parser.input = lexResult.tokens;
  const cst = parser.row();
  return new CalculatorInterpreter().visit(cst);
}

console.log(eval("1 + 2 + 3")); // 6
console.log(eval("1 + 2 - 3")); // 0

This article implements a simple addition parser, and we will add more syntax rules later.

Project address: https://github.com/oboard/moonrepl

Loading...
Ownership of this post data is guaranteed by blockchain and smart contracts to the creator alone.