ANTLR Architecture

This guide provides a deep dive into the architecture of the AST toolkit, focusing on how ANTLR v4 powers the parsing process.

Overview

The AST toolkit uses ANTLR (ANother Tool for Language Recognition) as its foundation. ANTLR is a powerful parser generator that reads grammar files and generates parsers capable of building and walking parse trees.

Design Principles

1. Separation of Concerns

┌─────────────────────────────────────────┐
│  Language Parsers                       │
│  (@sylphlab/ast-javascript, etc.)       │
└────────────────┬────────────────────────┘
                 │
                 ▼
┌─────────────────────────────────────────┐
│  Core Types                             │
│  (@sylphlab/ast-core)                   │
└─────────────────────────────────────────┘

Core types define generic AST node interfaces
Language parsers implement specific parsing logic
Each language parser is independent and self-contained

2. Extensibility

Adding a new language parser requires:

Creating a new package (e.g., @sylphlab/ast-python)
Adding ANTLR grammar files
Implementing a custom visitor
Mapping to @sylphlab/ast-core types

No changes to existing packages required!

3. Type Safety

Strict TypeScript throughout the codebase
Full type inference from grammar to AST
Compile-time safety prevents runtime errors

ANTLR Parsing Pipeline

Step 1: Grammar Definition

ANTLR uses grammar files (.g4) to define language syntax:

antlr

// JavaScript.g4 (simplified)
grammar JavaScript;

program
    : statement+ EOF
    ;

statement
    : variableDeclaration
    | expressionStatement
    ;

variableDeclaration
    : ('const' | 'let' | 'var') identifier '=' expression ';'
    ;

identifier
    : ID
    ;

// ... more rules

Key Concepts:

Rules - Define syntax patterns (e.g., variableDeclaration)
Terminals - Literal tokens (e.g., 'const', ';')
Non-terminals - References to other rules (e.g., identifier)

Step 2: Parser Generation

ANTLR reads the grammar and generates:

Lexer - Tokenizes input into tokens
Parser - Builds parse tree from tokens
Visitor - Base class for transforming parse tree

bash

antlr4ts -visitor -listener -o src/generated grammar/*.g4

This generates TypeScript files:

JavaScriptLexer.ts - Tokenizer
JavaScriptParser.ts - Parser
JavaScriptVisitor.ts - Visitor base class
JavaScriptListener.ts - Listener base class

Step 3: Tokenization (Lexer)

The lexer converts source code into tokens:

typescript

// Input
const greeting = "Hello";

// Tokens
CONST, IDENTIFIER("greeting"), EQUALS, STRING("Hello"), SEMICOLON

Lexer responsibilities:

Character-by-character scanning
Token recognition
Whitespace handling
Error reporting

Step 4: Parsing

The parser consumes tokens and builds a parse tree:

         program
            |
       statement
            |
  variableDeclaration
   /      |      \
CONST identifier expression
        |         |
    "greeting"  "Hello"

Parser responsibilities:

Syntax validation
Parse tree construction
Error recovery
Context management

Step 5: Parse Tree vs AST

Parse Tree (ANTLR output):

Contains all syntactic details
Includes tokens like ;, {, }
Closely mirrors grammar structure
Large and verbose

AST (Our output):

Abstracts away syntactic details
Contains only semantic information
Easier to work with
More compact

Example comparison:

Parse TreeAST

typescript

// Parse tree for: const x = 42;
{
  type: 'VariableDeclarationContext',
  children: [
    { type: 'TerminalNode', text: 'const' },
    { type: 'IdentifierContext', text: 'x' },
    { type: 'TerminalNode', text: '=' },
    { type: 'ExpressionContext', value: 42 },
    { type: 'TerminalNode', text: ';' }
  ]
}

typescript

// AST for: const x = 42;
{
  type: 'VariableDeclaration',
  declarations: [{
    type: 'VariableDeclarator',
    id: { type: 'Identifier', name: 'x' },
    init: { type: 'Literal', value: 42 }
  }],
  kind: 'const'
}

Step 6: Custom Visitor

The visitor pattern transforms parse tree to AST:

typescript

class AstBuilderVisitor extends JavaScriptVisitorBase {
  visitVariableDeclaration(ctx: VariableDeclarationContext): AstNode {
    const kind = ctx.CONST() ? 'const' :
                 ctx.LET() ? 'let' : 'var';

    return {
      type: 'VariableDeclaration',
      declarations: ctx.variableDeclarator().map(d =>
        this.visitVariableDeclarator(d)
      ),
      kind,
      loc: this.getLocation(ctx)
    };
  }

  visitVariableDeclarator(ctx: VariableDeclaratorContext): AstNode {
    return {
      type: 'VariableDeclarator',
      id: this.visit(ctx.identifier()),
      init: ctx.expression() ? this.visit(ctx.expression()) : null,
      loc: this.getLocation(ctx)
    };
  }
}

Visitor responsibilities:

Transform parse tree nodes to AST nodes
Extract semantic information
Track source locations
Handle edge cases

Architecture Diagram

┌─────────────────────────────────────────────────────────┐
│                     Source Code                         │
│                  const x = "Hello";                     │
└────────────────────┬────────────────────────────────────┘
                     │
                     ▼
┌─────────────────────────────────────────────────────────┐
│                  ANTLR Lexer                            │
│         (JavaScriptLexer.ts - Generated)                │
│                                                          │
│  Input:  Source code string                            │
│  Output: Token stream                                   │
│          [CONST, ID, EQUALS, STRING, SEMICOLON]        │
└────────────────────┬────────────────────────────────────┘
                     │
                     ▼
┌─────────────────────────────────────────────────────────┐
│                  ANTLR Parser                           │
│        (JavaScriptParser.ts - Generated)                │
│                                                          │
│  Input:  Token stream                                  │
│  Output: Parse tree (Context objects)                  │
│          VariableDeclarationContext                    │
└────────────────────┬────────────────────────────────────┘
                     │
                     ▼
┌─────────────────────────────────────────────────────────┐
│               Custom Visitor                            │
│        (AstBuilderVisitor.ts - Custom)                  │
│                                                          │
│  Input:  Parse tree                                    │
│  Output: AST (@sylphlab/ast-core types)                │
│          { type: 'VariableDeclaration', ... }          │
└────────────────────┬────────────────────────────────────┘
                     │
                     ▼
┌─────────────────────────────────────────────────────────┐
│                   Final AST                             │
│            (@sylphlab/ast-core types)                   │
└─────────────────────────────────────────────────────────┘

Key Components

@sylphlab/ast-core

Purpose: Generic AST node definitions

Contents:

typescript

// Core AST node interface
export interface AstNode {
  type: string;
  loc?: SourceLocation;
}

// Position tracking
export interface SourceLocation {
  start: Position;
  end: Position;
  source?: string;
}

export interface Position {
  line: number;
  column: number;
  offset?: number;
}

// Specific node types
export interface Program extends AstNode {
  type: 'Program';
  body: Statement[];
}

export interface VariableDeclaration extends AstNode {
  type: 'VariableDeclaration';
  declarations: VariableDeclarator[];
  kind: 'const' | 'let' | 'var';
}

// ... more node types

@sylphlab/ast-javascript

Purpose: JavaScript-specific parser

Contents:

grammar/ - ANTLR .g4 grammar files
src/generated/ - ANTLR-generated files (lexer, parser, visitor)
src/AstBuilderVisitor.ts - Custom visitor implementation
src/index.ts - Public API (parse function)

Grammar Files

Location: packages/javascript/grammar/

The JavaScript grammar is based on the official ECMAScript specification:

JavaScriptLexer.g4 - Token definitions
JavaScriptParser.g4 - Syntax rules

Visitor Implementation

Location: packages/javascript/src/AstBuilderVisitor.ts

The custom visitor extends JavaScriptVisitorBase and implements methods for each grammar rule:

typescript

export class AstBuilderVisitor extends JavaScriptVisitorBase {
  // Entry point
  visitProgram(ctx: ProgramContext): Program {
    return {
      type: 'Program',
      body: ctx.statement().map(stmt => this.visit(stmt)),
      loc: this.getLocation(ctx)
    };
  }

  // Statement visitors
  visitVariableDeclaration(ctx: VariableDeclarationContext): VariableDeclaration { /* ... */ }
  visitExpressionStatement(ctx: ExpressionStatementContext): ExpressionStatement { /* ... */ }

  // Expression visitors
  visitBinaryExpression(ctx: BinaryExpressionContext): BinaryExpression { /* ... */ }
  visitIdentifier(ctx: IdentifierContext): Identifier { /* ... */ }

  // Helper methods
  getLocation(ctx: ParserRuleContext): SourceLocation { /* ... */ }
}

Performance Considerations

Lexer Performance

Fast tokenization - ANTLR lexers are highly optimized
DFA-based - Deterministic Finite Automaton for speed
Minimal backtracking

Parser Performance

LL(*) parsing - Efficient top-down parsing
Memoization - Results cached for efficiency
Error recovery - Continues parsing after errors

Visitor Performance

Single pass - Transform in one traversal
No intermediate structures - Direct AST construction
Memory efficient - Garbage collector friendly

Error Handling

Lexer Errors

Invalid characters
Unterminated strings
Malformed numbers

Parser Errors

Syntax errors
Missing tokens
Unexpected tokens

Error Recovery

ANTLR provides built-in error recovery:

Single token deletion - Skip unexpected token
Single token insertion - Assume missing token
Synchronization - Skip to known good state

Extensibility

Adding New Grammar Rules

Update grammar file:

antlr

// Add async/await support
asyncFunctionDeclaration
    : ASYNC FUNCTION identifier '(' parameters ')' block
    ;

Regenerate parser:

bash

pnpm antlr

Implement visitor:

typescript

visitAsyncFunctionDeclaration(ctx: AsyncFunctionDeclarationContext): FunctionDeclaration {
  return {
    type: 'FunctionDeclaration',
    async: true,
    id: this.visit(ctx.identifier()),
    params: this.visit(ctx.parameters()),
    body: this.visit(ctx.block())
  };
}

Grammar Resources

Official ANTLR grammar repository:

ANTLR grammars-v4
Contains 200+ language grammars
Well-tested and maintained
Great starting point for new languages

Best Practices

Grammar Design

Start simple - Add complexity incrementally
Test thoroughly - Use grammar tests
Follow conventions - Match ANTLR idioms
Document rules - Add comments to grammar

Visitor Implementation

One visitor method per rule - Clear mapping
Extract helpers - Reuse common logic
Handle nulls - Grammar may have optional parts
Preserve locations - Track source positions

Type Safety

Strict types - Use TypeScript strict mode
Avoid any - Type all visitor returns
Exhaustive checks - Handle all cases
Runtime validation - Validate AST structure

References

ANTLR Documentation

TypeScript Integration

antlr4ts - ANTLR runtime for TypeScript
antlr4ts-cli - Code generator

AST Specifications

ESTree - JavaScript AST specification
TypeScript AST - TypeScript AST explorer

Next Steps

Now that you understand the architecture:

Questions? Open an issue or contact us.

ANTLR Architecture ​

Overview ​

Design Principles ​

1. Separation of Concerns ​

2. Extensibility ​

3. Type Safety ​

ANTLR Parsing Pipeline ​

Step 1: Grammar Definition ​

Step 2: Parser Generation ​

Step 3: Tokenization (Lexer) ​

Step 4: Parsing ​

Step 5: Parse Tree vs AST ​

Step 6: Custom Visitor ​

Architecture Diagram ​

Key Components ​

@sylphlab/ast-core ​

@sylphlab/ast-javascript ​

Grammar Files ​

Visitor Implementation ​

Performance Considerations ​

Lexer Performance ​

Parser Performance ​

Visitor Performance ​

Error Handling ​

Lexer Errors ​

Parser Errors ​

Error Recovery ​

Extensibility ​

Adding New Grammar Rules ​

Grammar Resources ​

Best Practices ​

Grammar Design ​

Visitor Implementation ​

Type Safety ​

References ​

ANTLR Documentation ​

TypeScript Integration ​

AST Specifications ​

Next Steps ​

ANTLR Architecture

Overview

Design Principles

1. Separation of Concerns

2. Extensibility

3. Type Safety

ANTLR Parsing Pipeline

Step 1: Grammar Definition

Step 2: Parser Generation

Step 3: Tokenization (Lexer)

Step 4: Parsing

Step 5: Parse Tree vs AST

Step 6: Custom Visitor

Architecture Diagram

Key Components

@sylphlab/ast-core

@sylphlab/ast-javascript

Grammar Files

Visitor Implementation

Performance Considerations

Lexer Performance

Parser Performance

Visitor Performance

Error Handling

Lexer Errors

Parser Errors

Error Recovery

Extensibility

Adding New Grammar Rules

Grammar Resources

Best Practices

Grammar Design

Visitor Implementation

Type Safety

References

ANTLR Documentation

TypeScript Integration

AST Specifications

Next Steps