ECMAScript 1 (ES1) Parser


Although I took a compiler course while in university over 10 years ago, I don’t write parsers or compilers on a day-to-day basis. So even though I understand that parsers are probably what’s behind some of the JavaScript tools I use, it sometimes feel a bit like magic.

Today we enjoy tools like Babel, ReactJS, ESLint or JSHint, and UglifyJS, but my knowledge of them mainly revolves around using them as a consumer. Therefore I set out to rectify this gap and wrote an ECMAScript v1 parser. Source code on GitHub.


Table of Contents


Parsers Everywhere

I started my learning by researching how some of the tools work. From this research I found out that Babel uses the Babylon parser. I found out that ReactJS uses a fork of the Esprima parser. I also found out that ReactJS appears to have plan to transition to use the Acorn parser. Similar to ReactJS, ESLint also uses a fork of Esprima named Espree. JSHint uses Esprima as well. Lastly, UglifyJS has its own parser.

Once I confirmed that parser underlies these tools, an idea came to me that I should go write my own parser for v1 of JavaScript. I figured that the syntax is simpler when JavaScript was just invented, so it’d make my job easier but still allow me get a firm grasp of what the tools are doing. I liked that doing so allows me to travel back to around 1997, a time when I was just starting to learn English. I also liked that I could try to implement JSX support, making it a 1997ish meets 2013ish combination.

Step 1: Lexer

The first step to writing a parser is to write a lexer, which is also called tokenizer or scanner. I started by reading the original ECMAScript standard that was finalized in June 1997. You can find links to the ECMAScript standards on the Mozilla Developer Network site.

Section 7 is the part that you’re interested in for this. It specifies the Lexical Conventions and identifies all the individual tokens of the language.

Based on reading the spec, I coded up the first version of my lexer, but it wasn’t very good. I then happened to watch Rob Pike’s talk on lexer for Go and adopted ideas from there to form the foundation in my second iteration.

1.1: Lexer Foundation Functions

A lexer’s job is to read the input string character by character and separate them into tokens for processing later. In Rob Pike’s lexer, there is a variable start that marks the start of the token being processed. The end of a token is being marked by a variable named pos. The pos variable gets incremented each time a character is deemed to be belonging to the same token. When it is decided that a token is finished, then one can take the substring from start to pos and advance start.

lexer diagram showing start and pos variables

For my lexer, the reading of a character is done by a function named peek. It reads the next character to be processed and returns it. However, the peek function doesn’t advance the pos position because sometime you need to see what’s next in order to decide how to process the input text.

// Returns the next character in the source code
l.peek = function() {
    if (this.pos >= this.source.length) {
        // null represents EOF
        return null;
    }

    const c = this.source[this.pos];
    return c;
};

Other functions make use of the peek function to decide what to do. For example, the acceptRun function uses peek to decide whether to keep including characters into the current token. I’ve modified Rob Pike’s acceptRun to instead take a validator function since it’s more flexible that way and I use it to accumulate various types of tokens.

l.acceptRun = function(validator) {
    let c;
    let startedAt = this.pos;
    do {
        c = this.peek();
        if (c === null) {
            break;
        }
    } while(validator(c) && ++this.pos);

    return (this.pos > startedAt);
};

The last function needed before being able to write the first iteration of the lexer is the ignore function. It advances the start variable without doing anything with the characters accumulated between start and pos. This is used to discard things in the source code such as whitespaces or code comments.

// Skips over the pending input before this point
l.ignore = function() {
    this.start = this.pos;
};

1.2: Code Comments

One of the easiest things to process is a single-line code comment. In the ECMAScript 1 standard it’s specified as follows:


SingleLineComment ::
// SingleLineCommentCharsopt
SingleLineCommentChars ::
SingleLineCommentChar SingleLineCommentCharsopt
SingleLineCommentChar ::
SourceCharacter but not LineTerminator

The standard says that a single-line comment starts with two forward slash characters (//) followed by 0 or more SingleLineCommentChars. SingleLineCommentChars is just source code character that is not a LineTerminator (Carriage Return or Line Feed).

The lexer has a main function named lexText that keeps processing characters until the end. To know that we’ve encountered a single-line comment, the lexText function peeks at the next two characters to decide whether to call lexSingleLineComment function to deal with it.

l.lexText = function() {
    do {
        // Examine the next 2 characters to see if we're encountering code comments
        const nextTwo = this.source.substr(this.pos, 2);
        if (nextTwo === '//') {
            this.pos += 2;
            return this.lexSingleLineComment;
        }
    } while(true);
};

The lexSingleLineComment function utilizes the acceptRun and ignore functions introduced earlier. There are couple helper functions as well. The not function inverts the boolean result from the function passed to it. The isLineTerminator function does what its name says and returns a boolean.

The lexSingleLineComment implementation below says “Keep accumulating characters into the current token as long as it’s not a line terminator”. Then once line terminator or end of input is reached, call ignore() to discard the token.

function not(fn) {
    return function(c) {
        const result = fn(c);
        return !result;
    };
}

function isLineTerminator(c) {
    if (c === '\n' || c === '\r') {
        return true;
    }
    return false;
}

l.lexSingleLineComment = function() {
    // Single line comment is only terminated by a line terminator
    // character and nothing else
    this.acceptRun(not(isLineTerminator));
    this.ignore();
    return this.lexText;
};

That’s how my lexer processes single-line comment. It accumulates and then discards code comment so that the later stages don’t need to deal with them.

1.3: Other Tokens

Other types of tokens are processed similarly to single-line comment by utilizing the foundation functions and additional helper functions. The code for those are more verbose so I won’t go through them here, but the concept is the same. Have a look at src/lexer.js for the complete source code for the lexer. You can build up the lexer incrementally by adding one type of token processing each time and write tests.

Step 2: Parser

Once the lexer finishes processing we’re left with a sequence of tokens to do two things:

  1. Determine if the token sequence conforms to the JavaScript grammar
  2. Produce an Abstract Syntax Tree (AST) structure that can be used to build other tools

ECMAScript standard sections 11 to 14 specifies the language grammar. I’ve compiled the grammar into a single page here.

When one begins to look at parsing, there are many different approaches. There are LL parsers, LR parsers and its variants (GLR, LALR and SLR), and more! There are also a variety of parser generators, and lexer generators for that matter, that one can choose from.

To get a sense of how others have done it, I skimmed over Esprima and Acorn source code. From what I can tell from a quick scan, they’re using recursive descent, hand-coded variety. Therefore, that’s what I’ll do too: hand-code a recursive descent parser. Studies on other types of parsers will have to wait for another time since an exhaustive parsing technique study isn’t my main objective.

2.1: Grammar with Left-Recursion Eliminated

A recursive descent parser can parse grammar that doesn’t contain left-recursion. However, the original ECMAScript grammar contains a bunch of left-recursions that’ll be troublesome for a recursive descent parser. Step one for beginning the parser is to convert the grammar by eliminating left-recursion as I have done so here.

It is interesting to note that the method for eliminating left-recursion also messes with operator associativity, but I’ll deal with it later.

Once we have the non-recursive grammar, writing the parser follows closely to the grammar specification. For example, mine starts with parseProgram and parseSourceElement functions similar to how the grammar is specified. Code for the parser is in src/parser.js.

2.2: Stuck on 1.1 + 2e2 + 3^2

One of the first problems I ran into was when I wrote a test case to parse the statement var a = 1.1 + 2e2 + 3^2;. The parsing sequence that happens in my initial implementation goes as follows:

VariableStatement
  var VariableDeclarationList ;
    VariableDeclaration
      Identifier Initializer
        = AssignmentExpression
          ConditionalExpression
            LogicalORExpression
              LogicalANDExpression
                BitwiseORExpresion
                  BitwiseXORExpression
                    BitwiseANDExpression
                      EqualityExpression
                        RelationalExpression
                          ShiftExpression
                            AdditiveExpression + MultiplicativeExpression

The problem is that when AdditiveExpression expands to MultiplicativeExpression, it is unable to then go back up the chain and parse ^2 as BitwiseXORExpression.

I was stuck on this for a while and I really wanted to understand it and not rush it. I knew from reading the header comment in Acorn parser’s expression.js that there are ways to deal with it. However, I could not see how I could do it with the recursive descent parser I was writing. One of my questions was whether that’s even a valid expression since my parser was having trouble. A quick check with node.js confirms that it is indeed valid.

So how is that possible?

Turns out for the parser to be able to parse the expression, it somehow needs to know to expand AssignmentExpression to a BitwiseXORExpression first separating the part containing 1.1 + 2e2 + 3 from 2. Then parse those two parts as BitwiseANDExpression and continue from there.

I don’t think a simple recursive descent parser can deal with the JavaScript grammar. A simplistic parser would have to somehow know that ^ occurs several tokens later and know to expand AssignmentExpression to BitwiseXORExpression instead. Once I understand how the expression conforms to the grammar I was happy to move on and solve the problem using precedence climbing as discussed next.

2.3: Operator Precedence & Precedence Climbing

From the way the JavaScript grammar was specified, one can deduce the operator precedence from the hierarchy. For example, AdditiveExpression expands to MultiplicativeExpression. That means multiplications have higher precedence than additions. An expression like 1 + 2 * 3 would mean 1 + (2 * 3). A full list of operator precedence levels in ES6 and beyond can be found here.

For ES1 it’s just a subset of what’s listed:

const operatorPrecedence = {
    '||': 0,
    '&&': 1,
    '|': 2,
    '^': 3,
    '&': 4,
    '==': 5,
    '!=': 5,
    '<': 6,
    '>': 6,
    '<=': 6,
    '=>': 6,
    '<<': 7,
    '>>': 7,
    '>>>': 7,
    '+': 8,
    '-': 8,
    '*': 9,
    '/': 9,
    '%': 9
};

Knowing the operator precedence levels, one can then use an algorithm called precedence climbing to deal with the 1.1 + 2e2 + 3^2 problem in the previous section. There are other algorithms such as the shunting-yard algorithm, but I went with precedence climbing as it seems simpler.

Luckily in ES1 all binary expressions are left associative, so the part of precedence climbing that deals with right associativity can be removed. In ES7 though, the exponentiation (**) operator would introduce right associative operator to the picture.

My implementation of precedence climbing can be found in the parseBinaryExpression function. The parts dealing with ast and lhs can be ignored for now. Those are discussed in the next section.

// Uses precedence climbing to deal with binary expressions, all of which have
// left-to-right associtivity in this case.
p.parseBinaryExpression = function(minPrecedence) {
    const punctuators = [
        '||', '&&', '|', '^', '&', '==', '!=', '<', '>', '<=', '=>',
        '<<', '>>', '>>>', '+', '-', '*', '/', '%'
    ];

    const result = this.parseUnaryExpression();
    let ast = result.ast;
    let lhs = result.lhs

    while (this.matchPunctuators(punctuators) &&
           operatorPrecedence[this.next().value] >= minPrecedence) {

        // If any operator is encountered, then the result cannot be
        // LeftHandSideExpression anymore
        lhs = false;

        const precedenceLevel = operatorPrecedence[this.next().value];
        const operatorToken = this.expectPunctuators(punctuators);

        const right = this.parseBinaryExpression(precedenceLevel + 1);
        if (operatorToken.value === '||' || operatorToken.value === '&&') {
            ast = new estree.LogicalExpression(operatorToken.value, ast, right.ast);
        }
        else {
            ast = new estree.BinaryExpression(operatorToken.value, ast, right.ast);
        }
    }

    return {
        ast: ast,
        lhs: lhs
    };
};

2.4: ConditionalExpression vs LeftHandSideExpression

Another challenge when parsing JavaScript is when the parser needs to parse an AssignmentExpression. The parser needs to decide whether it’s parsing a ConditionalExpression or a LeftHandSideExpression plus other stuff. The grammar is shown below:


AssignmentExpression:
ConditionalExpression
LeftHandSideExpression AssignmentOperator AssignmentExpression

The problem is that ConditionalExpression can also expand to just LeftHandSideExpression, so you can’t just have an if-statement there and call it done.

How I solved it is by delaying the decision until later. When parsing AssignmentExpression, there is no if-statement initially. The code simply attempts to parse ConditionalExpression immediately. Whether ConditionalExpression is expanded into LeftHandSideExpression is decided later and passed back up the chain. This way the parser doesn’t need to backtrack by trying to parse as one and discard result if that fails and try another one.

The information of whether ConditionalExpression is actually a LeftHandSideExpression is revealed when parsing PostfixExpression. If the parser sees LeftHandSideExpression followed by ++ or -- then it’s not a lone LeftHandSideExpression.


PostfixExpression:
LeftHandSideExpression
LeftHandSideExpression [no LineTerminator here] ++
LeftHandSideExpression [no LineTerminator here] –

2.5: NewExpression vs CallExpression vs MemberExpression

A very similar challenge comes up when trying to parse LeftHandSideExpression, which can be expanded to either NewExpression or CallExpression. I applied the same solution of delaying decision as you can see in the function declaration of parseNewOrCallOrMemberExpression. There is no decision making in parseLeftHandSideExpression at all.

p.parseNewOrCallOrMemberExpression = function(couldBeNewExpression, couldBeCallExpression) {
};

p.parseLeftHandSideExpression = function() {
    return this.parseNewOrCallOrMemberExpression(true, true).object;
};

In this case, the distinction whether we’re dealing with NewExpression vs MemberExpression, which comes later on, is by encountering Arguments. If Arguments is found, then it is a MemberExpression and not NewExpression. Similarly, if after parsing MemberExpression the parser encounters more Arguments, then it’s a CallExpression.

Interesting Things About ES1

I noticed some interesting things after I finished the parser because I tried to parse real-world code and ran into issues.

Back in 1997, JavaScript didn’t support regular expressions. It’s so widely used in source code nowadays and that prevented my parser from processing the source code.

Another difference is that there is no such thing as FunctionExpression. ES1 has FunctionDeclaration but not the expression equivalent. All those var a = function(){}; statements are not valid syntax! No lambdas for you in 1997!

This omission is rectified by ES3 in 1999 though. Funny thing is in 1999 I think I just advanced from ESL to regular English class. I wasn’t even concerned about programming back then let alone lambdas.

Implement JSX Extension

At my company Kash, we implemented our online checkout interface in ReactJS, so it’s of interest to me. ReactJS makes use of JSX files which is this weird JavaScript syntax mixed in with HTML tags.

An example JSX file looks like this:

// Using JSX to express UI components.
var dropdown =
  <Dropdown>
    A dropdown list
    <Menu>
      <MenuItem>Do Something</MenuItem>
      <MenuItem>Do Something Fun!</MenuItem>
      <MenuItem>Do Something Else</MenuItem>
    </Menu>
  </Dropdown>;

render(dropdown);

Once I finished implementing my parser, I implemented an incomplete version of the JSX extension to get a feel for it. Parsers are really driving a ton of things we use today. My parser implementation is in the jsx branch. var a = <div></div>; now produces an Abstract Syntax Tree! We now time travelled to 2013!

Just having a parser that understands JSX files aren’t enough though. ReactJS would have to take the AST and convert it to proper ECMAScript to be executed in the browser. In the next section when I examine minifiers, I cover the task of transforming the AST to output a different JavaScript source code.

Implement a Minifier

I use UglifyJS to minify JavaScript code before deployment. A JavaScript minifier removes whitespaces, removes new lines, and lots other things. But how does one go about implementing a minifier?

Well, once the parser finishes parsing the source code, it produces an AST. My parser spits out AST in the ESTree format, so does Esprima and Acorn parsers. UglifyJS, on the other hand, has its own format from its parser but can import ESTree format to be processed.

One of the things that a minifier will do is shorten the function names because to a JavaScript engine, our veryMaintainableFunctionName isn’t meaningful. In order to achieve this renaming we need to first figure out what functions are there. To do so we need to process the AST starting from the root Program node.

The ESTree format specifies that the Program node is identified by type = 'Program' and has a member property body that contains an array of statements. Our first task is to figure out what function declarations there are and we can do this simply by looping over the body array like the following code:

if (node.type === 'Program') {
    // First, go through all statements in the source code to gather function
    // names
    const newScope = {}; // mapping of old function name to new function name
    for (let stmt of node.body) {
        if (stmt.type === 'FunctionDeclaration') {
            // FunctionDeclaration is specified in ESTree to have an 'id'
            // property, which is an Identifier with a 'name' property.
            newScope[stmt.id.name] = stmt.id.name;
        }
    }
}

Once we’ve obtained all function names, we can remap them to something else. For example, turning veryMaintainableFunctionName into just a. To do this the minifier needs to find all instances of CallExpression and instead of printing out the original function Identifier being referenced, swap it with our substitution. Below is a quick example but it doesn’t deal with more than two scopes:

if (node.type === 'Identifier') {
    let scope = scopes[scopes.length - 1];
    // When we can't find the variable in the scope, look in the parent scope!
    if (!scope[node.name]) {
        scope = scopes[scopes.length - 2]
    }
    minified += scope[node.name]; // Substitute function name being referenced
}

There is of course a lot more to a minifier. A good one probably also makes use of JavaScript’s automatic semicolon insertion rules if the goal is to optimize for file size.

Implement a Linter

Implementing a linter similarly needs to process the AST to intelligently figure out what’s in a program source code. For example, I use ESLint at work and typically will turn on the no-var rule for ES6. Doing so gives me a warning whenever I use var instead of new ES6 constructs like const and let.

To implement such a warning is fairly simple, the linter needs to disallow VariableDeclaration and only allow the use of LexicalDeclaration in ES6. Again, lots more to a linter than a simplistic example, but it all makes sense once you know how the AST is being used.

Khan Academy’s Live Editor

Khan Academy’s Computer Science and Computer Programming subjects feature an online coding environment that gives you helpful hints as you code. I was quite impressed when I first tried it. Curious about how it works and what parsers they use, I did a bit of researching.

The online text editing environment is called Live Code Editor and uses JSHint as the linter for providing helpful messages in JavaScript code. It also uses a fork of Slowparse for parsing HTML.

Very cool! Parsers are all around us.

What’s Next

There is so much one can learn in this area. There are the different parsing techniques. There’s the TDOP algorithm used by JSLint. There’s the automatic semicolon insertion rules that I didn’t implement. There are things in the spec that I have yet to handle like restricting line terminator from appearing in PostfixExpression. Also, since these parsers are executing in the browser, speed is another factor. Esprima website has a speed comparison of different parsers. Aside from speed, there’s something called Early Error that I don’t know much about in ES6. Also, I think it’d be interesting to morph my ES1 parser by adding newer standards to get a sense of how challenging each new language addition is to implement. I could also learn to use the various parser generators like Parsec that’s mentioned in the Real World Haskell book.

Reading through the standard for all the different versions would be a nice thing to do too. For example, section 4.2.1 in the ES1 standard doc talks about prototypical inheritance and its behaviour.

That said, I’m satisfied with my little learning project for now. There are lots of other things to learn, and I’ve got my eyes set on my next target.