ECMAScript 1 (ES1) Parser
Although I took a compiler course while in university over 10 years ago, I don’t write parsers or compilers on a day-to-day basis. So even though I understand that parsers are probably what’s behind some of the JavaScript tools I use, it sometimes feel a bit like magic.
Today we enjoy tools like Babel, ReactJS, ESLint or JSHint, and UglifyJS, but my knowledge of them mainly revolves around using them as a consumer. Therefore I set out to rectify this gap and wrote an ECMAScript v1 parser. Source code on GitHub.
Table of Contents
- Parsers Everywhere
- Step 1: Lexer
- Step 2: Parser
- Interesting Things About ES1
- Implement JSX Extension
- Implement a Minifier
- Implement a Linter
- Khan Academy’s Live Editor
- What’s Next
Parsers Everywhere
I started my learning by researching how some of the tools work. From this research I found out that Babel uses the Babylon parser. I found out that ReactJS uses a fork of the Esprima parser. I also found out that ReactJS appears to have plan to transition to use the Acorn parser. Similar to ReactJS, ESLint also uses a fork of Esprima named Espree. JSHint uses Esprima as well. Lastly, UglifyJS has its own parser.
Once I confirmed that parser underlies these tools, an idea came to me that I should go write my own parser for v1 of JavaScript. I figured that the syntax is simpler when JavaScript was just invented, so it’d make my job easier but still allow me get a firm grasp of what the tools are doing. I liked that doing so allows me to travel back to around 1997, a time when I was just starting to learn English. I also liked that I could try to implement JSX support, making it a 1997ish meets 2013ish combination.
Step 1: Lexer
The first step to writing a parser is to write a lexer, which is also called tokenizer or scanner. I started by reading the original ECMAScript standard that was finalized in June 1997. You can find links to the ECMAScript standards on the Mozilla Developer Network site.
Section 7 is the part that you’re interested in for this. It specifies the Lexical Conventions and identifies all the individual tokens of the language.
Based on reading the spec, I coded up the first version of my lexer, but it wasn’t very good. I then happened to watch Rob Pike’s talk on lexer for Go and adopted ideas from there to form the foundation in my second iteration.
1.1: Lexer Foundation Functions
A lexer’s job is to read the input string character by character and separate them into tokens for processing later. In Rob Pike’s lexer, there is a variable start
that marks the start of the token being processed. The end of a token is being marked by a variable named pos
. The pos
variable gets incremented each time a character is deemed to be belonging to the same token. When it is decided that a token is finished, then one can take the substring from start
to pos
and advance start
.
For my lexer, the reading of a character is done by a function named peek
. It reads the next character to be processed and returns it. However, the peek
function doesn’t advance the pos
position because sometime you need to see what’s next in order to decide how to process the input text.
// Returns the next character in the source code
l.peek = function() {
if (this.pos >= this.source.length) {
// null represents EOF
return null;
}
const c = this.source[this.pos];
return c;
};
Other functions make use of the peek
function to decide what to do. For example, the acceptRun
function uses peek
to decide whether to keep including characters into the current token. I’ve modified Rob Pike’s acceptRun
to instead take a validator function since it’s more flexible that way and I use it to accumulate various types of tokens.
l.acceptRun = function(validator) {
let c;
let startedAt = this.pos;
do {
c = this.peek();
if (c === null) {
break;
}
} while(validator(c) && ++this.pos);
return (this.pos > startedAt);
};
The last function needed before being able to write the first iteration of the lexer is the ignore
function. It advances the start
variable without doing anything with the characters accumulated between start
and pos
. This is used to discard things in the source code such as whitespaces or code comments.
// Skips over the pending input before this point
l.ignore = function() {
this.start = this.pos;
};
1.2: Code Comments
One of the easiest things to process is a single-line code comment. In the ECMAScript 1 standard it’s specified as follows:
SingleLineComment ::
//
SingleLineCommentCharsopt
SingleLineCommentChars ::
SingleLineCommentChar SingleLineCommentCharsopt
SingleLineCommentChar ::
SourceCharacter but not LineTerminator
The standard says that a single-line comment starts with two forward slash characters (//
) followed by 0 or more SingleLineCommentChars
. SingleLineCommentChars
is just source code character that is not a LineTerminator
(Carriage Return or Line Feed).
The lexer has a main function named lexText
that keeps processing characters until the end. To know that we’ve encountered a single-line comment, the lexText
function peeks at the next two characters to decide whether to call lexSingleLineComment
function to deal with it.
l.lexText = function() {
do {
// Examine the next 2 characters to see if we're encountering code comments
const nextTwo = this.source.substr(this.pos, 2);
if (nextTwo === '//') {
this.pos += 2;
return this.lexSingleLineComment;
}
} while(true);
};
The lexSingleLineComment
function utilizes the acceptRun
and ignore
functions introduced earlier. There are couple helper functions as well. The not
function inverts the boolean result from the function passed to it. The isLineTerminator
function does what its name says and returns a boolean.
The lexSingleLineComment
implementation below says “Keep accumulating characters into the current token as long as it’s not a line terminator”. Then once line terminator or end of input is reached, call ignore()
to discard the token.
function not(fn) {
return function(c) {
const result = fn(c);
return !result;
};
}
function isLineTerminator(c) {
if (c === '\n' || c === '\r') {
return true;
}
return false;
}
l.lexSingleLineComment = function() {
// Single line comment is only terminated by a line terminator
// character and nothing else
this.acceptRun(not(isLineTerminator));
this.ignore();
return this.lexText;
};
That’s how my lexer processes single-line comment. It accumulates and then discards code comment so that the later stages don’t need to deal with them.
1.3: Other Tokens
Other types of tokens are processed similarly to single-line comment by utilizing the foundation functions and additional helper functions. The code for those are more verbose so I won’t go through them here, but the concept is the same. Have a look at src/lexer.js
for the complete source code for the lexer. You can build up the lexer incrementally by adding one type of token processing each time and write tests.
Step 2: Parser
Once the lexer finishes processing we’re left with a sequence of tokens to do two things:
- Determine if the token sequence conforms to the JavaScript grammar
- Produce an Abstract Syntax Tree (AST) structure that can be used to build other tools
ECMAScript standard sections 11 to 14 specifies the language grammar. I’ve compiled the grammar into a single page here.
When one begins to look at parsing, there are many different approaches. There are LL parsers, LR parsers and its variants (GLR, LALR and SLR), and more! There are also a variety of parser generators, and lexer generators for that matter, that one can choose from.
To get a sense of how others have done it, I skimmed over Esprima and Acorn source code. From what I can tell from a quick scan, they’re using recursive descent, hand-coded variety. Therefore, that’s what I’ll do too: hand-code a recursive descent parser. Studies on other types of parsers will have to wait for another time since an exhaustive parsing technique study isn’t my main objective.
2.1: Grammar with Left-Recursion Eliminated
A recursive descent parser can parse grammar that doesn’t contain left-recursion. However, the original ECMAScript grammar contains a bunch of left-recursions that’ll be troublesome for a recursive descent parser. Step one for beginning the parser is to convert the grammar by eliminating left-recursion as I have done so here.
It is interesting to note that the method for eliminating left-recursion also messes with operator associativity, but I’ll deal with it later.
Once we have the non-recursive grammar, writing the parser follows closely to the grammar specification. For example, mine starts with parseProgram
and parseSourceElement
functions similar to how the grammar is specified. Code for the parser is in src/parser.js
.
2.2: Stuck on 1.1 + 2e2 + 3^2
One of the first problems I ran into was when I wrote a test case to parse the statement var a = 1.1 + 2e2 + 3^2;
. The parsing sequence that happens in my initial implementation goes as follows:
VariableStatement
var VariableDeclarationList ;
VariableDeclaration
Identifier Initializer
= AssignmentExpression
ConditionalExpression
LogicalORExpression
LogicalANDExpression
BitwiseORExpresion
BitwiseXORExpression
BitwiseANDExpression
EqualityExpression
RelationalExpression
ShiftExpression
AdditiveExpression + MultiplicativeExpression
The problem is that when AdditiveExpression
expands to MultiplicativeExpression
, it is unable to then go back up the chain and parse ^2
as BitwiseXORExpression
.
I was stuck on this for a while and I really wanted to understand it and not rush it. I knew from reading the header comment in Acorn parser’s expression.js that there are ways to deal with it. However, I could not see how I could do it with the recursive descent parser I was writing. One of my questions was whether that’s even a valid expression since my parser was having trouble. A quick check with node.js confirms that it is indeed valid.
So how is that possible?
Turns out for the parser to be able to parse the expression, it somehow needs to know to expand AssignmentExpression
to a BitwiseXORExpression
first separating the part containing 1.1 + 2e2 + 3
from 2
. Then parse those two parts as BitwiseANDExpression
and continue from there.
I don’t think a simple recursive descent parser can deal with the JavaScript grammar. A simplistic parser would have to somehow know that ^
occurs several tokens later and know to expand AssignmentExpression
to BitwiseXORExpression
instead. Once I understand how the expression conforms to the grammar I was happy to move on and solve the problem using precedence climbing as discussed next.
2.3: Operator Precedence & Precedence Climbing
From the way the JavaScript grammar was specified, one can deduce the operator precedence from the hierarchy. For example, AdditiveExpression
expands to MultiplicativeExpression
. That means multiplications have higher precedence than additions. An expression like 1 + 2 * 3
would mean 1 + (2 * 3)
. A full list of operator precedence levels in ES6 and beyond can be found here.
For ES1 it’s just a subset of what’s listed:
const operatorPrecedence = {
'||': 0,
'&&': 1,
'|': 2,
'^': 3,
'&': 4,
'==': 5,
'!=': 5,
'<': 6,
'>': 6,
'<=': 6,
'=>': 6,
'<<': 7,
'>>': 7,
'>>>': 7,
'+': 8,
'-': 8,
'*': 9,
'/': 9,
'%': 9
};
Knowing the operator precedence levels, one can then use an algorithm called precedence climbing to deal with the 1.1 + 2e2 + 3^2
problem in the previous section. There are other algorithms such as the shunting-yard algorithm, but I went with precedence climbing as it seems simpler.
Luckily in ES1 all binary expressions are left associative, so the part of precedence climbing that deals with right associativity can be removed. In ES7 though, the exponentiation (**) operator would introduce right associative operator to the picture.
My implementation of precedence climbing can be found in the parseBinaryExpression
function. The parts dealing with ast
and lhs
can be ignored for now. Those are discussed in the next section.
// Uses precedence climbing to deal with binary expressions, all of which have
// left-to-right associtivity in this case.
p.parseBinaryExpression = function(minPrecedence) {
const punctuators = [
'||', '&&', '|', '^', '&', '==', '!=', '<', '>', '<=', '=>',
'<<', '>>', '>>>', '+', '-', '*', '/', '%'
];
const result = this.parseUnaryExpression();
let ast = result.ast;
let lhs = result.lhs
while (this.matchPunctuators(punctuators) &&
operatorPrecedence[this.next().value] >= minPrecedence) {
// If any operator is encountered, then the result cannot be
// LeftHandSideExpression anymore
lhs = false;
const precedenceLevel = operatorPrecedence[this.next().value];
const operatorToken = this.expectPunctuators(punctuators);
const right = this.parseBinaryExpression(precedenceLevel + 1);
if (operatorToken.value === '||' || operatorToken.value === '&&') {
ast = new estree.LogicalExpression(operatorToken.value, ast, right.ast);
}
else {
ast = new estree.BinaryExpression(operatorToken.value, ast, right.ast);
}
}
return {
ast: ast,
lhs: lhs
};
};
2.4: ConditionalExpression vs LeftHandSideExpression
Another challenge when parsing JavaScript is when the parser needs to parse an AssignmentExpression
. The parser needs to decide whether it’s parsing a ConditionalExpression
or a LeftHandSideExpression
plus other stuff. The grammar is shown below:
AssignmentExpression:
ConditionalExpression
LeftHandSideExpression AssignmentOperator AssignmentExpression
The problem is that ConditionalExpression
can also expand to just LeftHandSideExpression
, so you can’t just have an if-statement there and call it done.
How I solved it is by delaying the decision until later. When parsing AssignmentExpression
, there is no if-statement initially. The code simply attempts to parse ConditionalExpression
immediately. Whether ConditionalExpression
is expanded into LeftHandSideExpression
is decided later and passed back up the chain. This way the parser doesn’t need to backtrack by trying to parse as one and discard result if that fails and try another one.
The information of whether ConditionalExpression
is actually a LeftHandSideExpression
is revealed when parsing PostfixExpression
. If the parser sees LeftHandSideExpression
followed by ++
or --
then it’s not a lone LeftHandSideExpression
.
PostfixExpression:
LeftHandSideExpression
LeftHandSideExpression [no LineTerminator here] ++
LeftHandSideExpression [no LineTerminator here] –
2.5: NewExpression vs CallExpression vs MemberExpression
A very similar challenge comes up when trying to parse LeftHandSideExpression
, which can be expanded to either NewExpression
or CallExpression
. I applied the same solution of delaying decision as you can see in the function declaration of parseNewOrCallOrMemberExpression
. There is no decision making in parseLeftHandSideExpression
at all.
p.parseNewOrCallOrMemberExpression = function(couldBeNewExpression, couldBeCallExpression) {
};
p.parseLeftHandSideExpression = function() {
return this.parseNewOrCallOrMemberExpression(true, true).object;
};
In this case, the distinction whether we’re dealing with NewExpression
vs MemberExpression
, which comes later on, is by encountering Arguments
. If Arguments
is found, then it is a MemberExpression
and not NewExpression
. Similarly, if after parsing MemberExpression
the parser encounters more Arguments
, then it’s a CallExpression
.
Interesting Things About ES1
I noticed some interesting things after I finished the parser because I tried to parse real-world code and ran into issues.
Back in 1997, JavaScript didn’t support regular expressions. It’s so widely used in source code nowadays and that prevented my parser from processing the source code.
Another difference is that there is no such thing as FunctionExpression
. ES1 has FunctionDeclaration
but not the expression equivalent. All those var a = function(){};
statements are not valid syntax! No lambdas for you in 1997!
This omission is rectified by ES3 in 1999 though. Funny thing is in 1999 I think I just advanced from ESL to regular English class. I wasn’t even concerned about programming back then let alone lambdas.
Implement JSX Extension
At my company Kash, we implemented our online checkout interface in ReactJS, so it’s of interest to me. ReactJS makes use of JSX files which is this weird JavaScript syntax mixed in with HTML tags.
An example JSX file looks like this:
// Using JSX to express UI components.
var dropdown =
<Dropdown>
A dropdown list
<Menu>
<MenuItem>Do Something</MenuItem>
<MenuItem>Do Something Fun!</MenuItem>
<MenuItem>Do Something Else</MenuItem>
</Menu>
</Dropdown>;
render(dropdown);
Once I finished implementing my parser, I implemented an incomplete version of the JSX extension to get a feel for it. Parsers are really driving a ton of things we use today. My parser implementation is in the jsx branch. var a = <div></div>;
now produces an Abstract Syntax Tree! We now time travelled to 2013!
Just having a parser that understands JSX files aren’t enough though. ReactJS would have to take the AST and convert it to proper ECMAScript to be executed in the browser. In the next section when I examine minifiers, I cover the task of transforming the AST to output a different JavaScript source code.
Implement a Minifier
I use UglifyJS to minify JavaScript code before deployment. A JavaScript minifier removes whitespaces, removes new lines, and lots other things. But how does one go about implementing a minifier?
Well, once the parser finishes parsing the source code, it produces an AST. My parser spits out AST in the ESTree format, so does Esprima and Acorn parsers. UglifyJS, on the other hand, has its own format from its parser but can import ESTree format to be processed.
One of the things that a minifier will do is shorten the function names because to a JavaScript engine, our veryMaintainableFunctionName
isn’t meaningful. In order to achieve this renaming we need to first figure out what functions are there. To do so we need to process the AST starting from the root Program
node.
The ESTree format specifies that the Program
node is identified by type = 'Program'
and has a member property body
that contains an array of statements. Our first task is to figure out what function declarations there are and we can do this simply by looping over the body
array like the following code:
if (node.type === 'Program') {
// First, go through all statements in the source code to gather function
// names
const newScope = {}; // mapping of old function name to new function name
for (let stmt of node.body) {
if (stmt.type === 'FunctionDeclaration') {
// FunctionDeclaration is specified in ESTree to have an 'id'
// property, which is an Identifier with a 'name' property.
newScope[stmt.id.name] = stmt.id.name;
}
}
}
Once we’ve obtained all function names, we can remap them to something else. For example, turning veryMaintainableFunctionName
into just a
. To do this the minifier needs to find all instances of CallExpression
and instead of printing out the original function Identifier
being referenced, swap it with our substitution. Below is a quick example but it doesn’t deal with more than two scopes:
if (node.type === 'Identifier') {
let scope = scopes[scopes.length - 1];
// When we can't find the variable in the scope, look in the parent scope!
if (!scope[node.name]) {
scope = scopes[scopes.length - 2]
}
minified += scope[node.name]; // Substitute function name being referenced
}
There is of course a lot more to a minifier. A good one probably also makes use of JavaScript’s automatic semicolon insertion rules if the goal is to optimize for file size.
Implement a Linter
Implementing a linter similarly needs to process the AST to intelligently figure out what’s in a program source code. For example, I use ESLint at work and typically will turn on the no-var
rule for ES6. Doing so gives me a warning whenever I use var
instead of new ES6 constructs like const
and let
.
To implement such a warning is fairly simple, the linter needs to disallow VariableDeclaration
and only allow the use of LexicalDeclaration
in ES6. Again, lots more to a linter than a simplistic example, but it all makes sense once you know how the AST is being used.
Khan Academy’s Live Editor
Khan Academy’s Computer Science and Computer Programming subjects feature an online coding environment that gives you helpful hints as you code. I was quite impressed when I first tried it. Curious about how it works and what parsers they use, I did a bit of researching.
The online text editing environment is called Live Code Editor and uses JSHint as the linter for providing helpful messages in JavaScript code. It also uses a fork of Slowparse for parsing HTML.
Very cool! Parsers are all around us.
What’s Next
There is so much one can learn in this area. There are the different parsing techniques. There’s the TDOP algorithm used by JSLint. There’s the automatic semicolon insertion rules that I didn’t implement. There are things in the spec that I have yet to handle like restricting line terminator from appearing in PostfixExpression
. Also, since these parsers are executing in the browser, speed is another factor. Esprima website has a speed comparison of different parsers. Aside from speed, there’s something called Early Error that I don’t know much about in ES6. Also, I think it’d be interesting to morph my ES1 parser by adding newer standards to get a sense of how challenging each new language addition is to implement. I could also learn to use the various parser generators like Parsec that’s mentioned in the Real World Haskell book.
Reading through the standard for all the different versions would be a nice thing to do too. For example, section 4.2.1 in the ES1 standard doc talks about prototypical inheritance and its behaviour.
That said, I’m satisfied with my little learning project for now. There are lots of other things to learn, and I’ve got my eyes set on my next target.