ANTLR Architecture
This guide provides a deep dive into the architecture of the AST toolkit, focusing on how ANTLR v4 powers the parsing process.
Overview
The AST toolkit uses ANTLR (ANother Tool for Language Recognition) as its foundation. ANTLR is a powerful parser generator that reads grammar files and generates parsers capable of building and walking parse trees.
Design Principles
1. Separation of Concerns
┌─────────────────────────────────────────┐
│ Language Parsers │
│ (@sylphlab/ast-javascript, etc.) │
└────────────────┬────────────────────────┘
│
▼
┌─────────────────────────────────────────┐
│ Core Types │
│ (@sylphlab/ast-core) │
└─────────────────────────────────────────┘- Core types define generic AST node interfaces
- Language parsers implement specific parsing logic
- Each language parser is independent and self-contained
2. Extensibility
Adding a new language parser requires:
- Creating a new package (e.g.,
@sylphlab/ast-python) - Adding ANTLR grammar files
- Implementing a custom visitor
- Mapping to
@sylphlab/ast-coretypes
No changes to existing packages required!
3. Type Safety
- Strict TypeScript throughout the codebase
- Full type inference from grammar to AST
- Compile-time safety prevents runtime errors
ANTLR Parsing Pipeline
Step 1: Grammar Definition
ANTLR uses grammar files (.g4) to define language syntax:
// JavaScript.g4 (simplified)
grammar JavaScript;
program
: statement+ EOF
;
statement
: variableDeclaration
| expressionStatement
;
variableDeclaration
: ('const' | 'let' | 'var') identifier '=' expression ';'
;
identifier
: ID
;
// ... more rulesKey Concepts:
- Rules - Define syntax patterns (e.g.,
variableDeclaration) - Terminals - Literal tokens (e.g.,
'const',';') - Non-terminals - References to other rules (e.g.,
identifier)
Step 2: Parser Generation
ANTLR reads the grammar and generates:
- Lexer - Tokenizes input into tokens
- Parser - Builds parse tree from tokens
- Visitor - Base class for transforming parse tree
antlr4ts -visitor -listener -o src/generated grammar/*.g4This generates TypeScript files:
JavaScriptLexer.ts- TokenizerJavaScriptParser.ts- ParserJavaScriptVisitor.ts- Visitor base classJavaScriptListener.ts- Listener base class
Step 3: Tokenization (Lexer)
The lexer converts source code into tokens:
// Input
const greeting = "Hello";
// Tokens
CONST, IDENTIFIER("greeting"), EQUALS, STRING("Hello"), SEMICOLONLexer responsibilities:
- Character-by-character scanning
- Token recognition
- Whitespace handling
- Error reporting
Step 4: Parsing
The parser consumes tokens and builds a parse tree:
program
|
statement
|
variableDeclaration
/ | \
CONST identifier expression
| |
"greeting" "Hello"Parser responsibilities:
- Syntax validation
- Parse tree construction
- Error recovery
- Context management
Step 5: Parse Tree vs AST
Parse Tree (ANTLR output):
- Contains all syntactic details
- Includes tokens like
;,{,} - Closely mirrors grammar structure
- Large and verbose
AST (Our output):
- Abstracts away syntactic details
- Contains only semantic information
- Easier to work with
- More compact
Example comparison:
// Parse tree for: const x = 42;
{
type: 'VariableDeclarationContext',
children: [
{ type: 'TerminalNode', text: 'const' },
{ type: 'IdentifierContext', text: 'x' },
{ type: 'TerminalNode', text: '=' },
{ type: 'ExpressionContext', value: 42 },
{ type: 'TerminalNode', text: ';' }
]
}// AST for: const x = 42;
{
type: 'VariableDeclaration',
declarations: [{
type: 'VariableDeclarator',
id: { type: 'Identifier', name: 'x' },
init: { type: 'Literal', value: 42 }
}],
kind: 'const'
}Step 6: Custom Visitor
The visitor pattern transforms parse tree to AST:
class AstBuilderVisitor extends JavaScriptVisitorBase {
visitVariableDeclaration(ctx: VariableDeclarationContext): AstNode {
const kind = ctx.CONST() ? 'const' :
ctx.LET() ? 'let' : 'var';
return {
type: 'VariableDeclaration',
declarations: ctx.variableDeclarator().map(d =>
this.visitVariableDeclarator(d)
),
kind,
loc: this.getLocation(ctx)
};
}
visitVariableDeclarator(ctx: VariableDeclaratorContext): AstNode {
return {
type: 'VariableDeclarator',
id: this.visit(ctx.identifier()),
init: ctx.expression() ? this.visit(ctx.expression()) : null,
loc: this.getLocation(ctx)
};
}
}Visitor responsibilities:
- Transform parse tree nodes to AST nodes
- Extract semantic information
- Track source locations
- Handle edge cases
Architecture Diagram
┌─────────────────────────────────────────────────────────┐
│ Source Code │
│ const x = "Hello"; │
└────────────────────┬────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────┐
│ ANTLR Lexer │
│ (JavaScriptLexer.ts - Generated) │
│ │
│ Input: Source code string │
│ Output: Token stream │
│ [CONST, ID, EQUALS, STRING, SEMICOLON] │
└────────────────────┬────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────┐
│ ANTLR Parser │
│ (JavaScriptParser.ts - Generated) │
│ │
│ Input: Token stream │
│ Output: Parse tree (Context objects) │
│ VariableDeclarationContext │
└────────────────────┬────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────┐
│ Custom Visitor │
│ (AstBuilderVisitor.ts - Custom) │
│ │
│ Input: Parse tree │
│ Output: AST (@sylphlab/ast-core types) │
│ { type: 'VariableDeclaration', ... } │
└────────────────────┬────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────┐
│ Final AST │
│ (@sylphlab/ast-core types) │
└─────────────────────────────────────────────────────────┘Key Components
@sylphlab/ast-core
Purpose: Generic AST node definitions
Contents:
// Core AST node interface
export interface AstNode {
type: string;
loc?: SourceLocation;
}
// Position tracking
export interface SourceLocation {
start: Position;
end: Position;
source?: string;
}
export interface Position {
line: number;
column: number;
offset?: number;
}
// Specific node types
export interface Program extends AstNode {
type: 'Program';
body: Statement[];
}
export interface VariableDeclaration extends AstNode {
type: 'VariableDeclaration';
declarations: VariableDeclarator[];
kind: 'const' | 'let' | 'var';
}
// ... more node types@sylphlab/ast-javascript
Purpose: JavaScript-specific parser
Contents:
grammar/- ANTLR.g4grammar filessrc/generated/- ANTLR-generated files (lexer, parser, visitor)src/AstBuilderVisitor.ts- Custom visitor implementationsrc/index.ts- Public API (parsefunction)
Grammar Files
Location: packages/javascript/grammar/
The JavaScript grammar is based on the official ECMAScript specification:
- JavaScriptLexer.g4 - Token definitions
- JavaScriptParser.g4 - Syntax rules
Visitor Implementation
Location: packages/javascript/src/AstBuilderVisitor.ts
The custom visitor extends JavaScriptVisitorBase and implements methods for each grammar rule:
export class AstBuilderVisitor extends JavaScriptVisitorBase {
// Entry point
visitProgram(ctx: ProgramContext): Program {
return {
type: 'Program',
body: ctx.statement().map(stmt => this.visit(stmt)),
loc: this.getLocation(ctx)
};
}
// Statement visitors
visitVariableDeclaration(ctx: VariableDeclarationContext): VariableDeclaration { /* ... */ }
visitExpressionStatement(ctx: ExpressionStatementContext): ExpressionStatement { /* ... */ }
// Expression visitors
visitBinaryExpression(ctx: BinaryExpressionContext): BinaryExpression { /* ... */ }
visitIdentifier(ctx: IdentifierContext): Identifier { /* ... */ }
// Helper methods
getLocation(ctx: ParserRuleContext): SourceLocation { /* ... */ }
}Performance Considerations
Lexer Performance
- Fast tokenization - ANTLR lexers are highly optimized
- DFA-based - Deterministic Finite Automaton for speed
- Minimal backtracking
Parser Performance
- LL(*) parsing - Efficient top-down parsing
- Memoization - Results cached for efficiency
- Error recovery - Continues parsing after errors
Visitor Performance
- Single pass - Transform in one traversal
- No intermediate structures - Direct AST construction
- Memory efficient - Garbage collector friendly
Error Handling
Lexer Errors
- Invalid characters
- Unterminated strings
- Malformed numbers
Parser Errors
- Syntax errors
- Missing tokens
- Unexpected tokens
Error Recovery
ANTLR provides built-in error recovery:
- Single token deletion - Skip unexpected token
- Single token insertion - Assume missing token
- Synchronization - Skip to known good state
Extensibility
Adding New Grammar Rules
- Update grammar file:
// Add async/await support
asyncFunctionDeclaration
: ASYNC FUNCTION identifier '(' parameters ')' block
;- Regenerate parser:
pnpm antlr- Implement visitor:
visitAsyncFunctionDeclaration(ctx: AsyncFunctionDeclarationContext): FunctionDeclaration {
return {
type: 'FunctionDeclaration',
async: true,
id: this.visit(ctx.identifier()),
params: this.visit(ctx.parameters()),
body: this.visit(ctx.block())
};
}Grammar Resources
Official ANTLR grammar repository:
- ANTLR grammars-v4
- Contains 200+ language grammars
- Well-tested and maintained
- Great starting point for new languages
Best Practices
Grammar Design
- Start simple - Add complexity incrementally
- Test thoroughly - Use grammar tests
- Follow conventions - Match ANTLR idioms
- Document rules - Add comments to grammar
Visitor Implementation
- One visitor method per rule - Clear mapping
- Extract helpers - Reuse common logic
- Handle nulls - Grammar may have optional parts
- Preserve locations - Track source positions
Type Safety
- Strict types - Use TypeScript strict mode
- Avoid any - Type all visitor returns
- Exhaustive checks - Handle all cases
- Runtime validation - Validate AST structure
References
ANTLR Documentation
TypeScript Integration
- antlr4ts - ANTLR runtime for TypeScript
- antlr4ts-cli - Code generator
AST Specifications
- ESTree - JavaScript AST specification
- TypeScript AST - TypeScript AST explorer
Next Steps
Now that you understand the architecture:
Questions? Open an issue or contact us.