Writing a pseudocode compiler (4) – Generating a parser

Using a “parser generator” such as bison means that the framework for creating this key part of the front-end is more rigidly imposed that if it were written from scratch. On the other hand, it also means that some of the hard work is already done for us. If the compiler back-end has to output assembly language or object code, the temptation would be to rewrite the compiler in its input language. (In fact this action is known as “bootstrapping”, and many professionals would only take a programming language (and its compiler) seriously, if such an action is feasible.) However it may be desirable to keep the original “bootstrap” compiler up-to-date as the language is developed further.

The bison program “grammar.y”, written for this project is quite large, at just under 700 lines, so documenting every part in detail is beyond the scope of this mini-series. However, we’ll pick out a few key parts to describe and the remainder should be able to be understood from examining this file.

The configuration part of the bison input file is similar to that used previously when creating the calculator program. The options are all described in the manual and several of those we have used are specific for creating C++ programs:

%require "3.7"
%language "c++"
%defines "Parser.hpp"
%output "Parser.cpp"

%define api.value.type variant
%define api.token.raw
%define api.parser.class { Parser }
%define parse.error detailed
%define api.location.file "Location.hpp"
%parse-param { Lexer *lexer } { Symbol *table } { Tree *prev }

The use of variant was described in the previous article, while api.token.raw is able to be used because we never return literal characters from the scanner. The class name is set to “Parser”, and %define parse.error detailed enables output more useful error messages than simply “syntax error”. Using %locations changes the signature of the scanner function called by the parser; it will have a pointer to the token’s semantic value as the first parameter, and a pointer to its associated location object as the second. Three parameters are passed to the Parser object’s constructor: a pointer to a previously initialized scanner (which has a member function with the signature described above called lex()), a pointer to the (global) symbol table, and a pointer to the head node of what will become the complete parse tree (this node is where the definitions of all Javascript variables not local to any Javascript function are in the outputted code).

Next come all of the token declarations, some with associated (printable) values which are used when reporting a syntax error. Some of these tokens have a semantic type associated with them, which the scanner assigns; this completes the contract between the parser and the scanner. All tokens (with or without printable forms or associated semantic types) are exported as a (plain) enum to the “Parser.hpp” file so that the scanner which uses them can return them as lex()‘s (integer) return code:

        ENDOFFILE 0
        EOL "newline"
        ASSIGN  "<-"
        PLUS "+"
        MINUS "-"
// many more follow...

%token <std::string>    ID SUBID UNKNOWN;
%token <StringT>        STRING;
%token <RealT>          FLOAT;
%token <IntT>           INTEGER;
%token <BoolT>          BOOLEAN;

Tokens form the “terminal” symbols of an input grammar; together with so-called “non-terminal” symbols they are matched (or reduced) by rules to produce further non-terminals (called productions). For example a SUBROUTINE may be (eventually) reduced to a single Tree* pointer, which is the head node of its complete (abstract syntax tree) description. The non-terminals which need to have a semantic type are declared with the %nterm keyword, which has a known C++ type in triangular brackets, followed by the name of one or more non-terminals with this type:

%nterm <Tree*>				Block Stmnts Stmnt If Else While Repeat For Sub
%nterm <Expression*>			Exp BoolExp
%nterm <std::string>			SubId SubCall ForId ForInId
%nterm <std::pair<std::string,ExpI>>	Field Param
%nterm <std::pair<Expression*,Tree*>>	ElseIf
%nterm <std::vector<std::pair<Expression*,Tree*>>>				ElseIfs
%nterm <std::vector<std::pair<std::string,ExpI>>>				Fields Params
%nterm <std::vector<Expression*>>						Array Args
%nterm <std::vector<std::variant<Expression *,std::vector<Expression*>>>>	Array2

Array precedence follows the pattern described before for the calculator example, with lower precedence operators first, and equal precedence operators grouped on the same line:

%left OR
%left AND
%right NOT
%nonassoc GT LT EQ NE GE LE
%right UMINUS

The grammar rules follow the first %% and the first rule encountered is taken to be the “start condition”, the rule which the complete input is ultimately reduced to. Rules can be defined recursively, and refer to each other in any order. We’ve said that Program is a number of Stmnts (statements) or an end-of-file (reaching end-of-file without encountering an error causes the parser function to return zero, indicating success). Stmnts can be a single Stmnt, or some other Stmnts followed by a Stmnt, or an error state which causes the parser function to return non-zero, for failure. (Functional programmers will be right at home with this recursive syntax, for everyone else just treat it as boiler plate for “one-or-more statements”). A Block is either empty or Stmnts, something which might be considered a superfluous definition (why not simply allow Stmnts itself to be empty?) In fact defining it this way was found to be necessary to avoid conflicts, which are where bison finds more than one way in which to interpret mutually-dependent rules. (The grammar fragment below compiles with zero shift-reduce or reduce-reduce conflicts, which is ideal.)


        | Stmnts

Stmnts  : Stmnts Stmnt
        | Stmnt
        | error { YYABORT; } /* return on first error encountered */

Block   : Stmnts
        | %empty { $$ = nullptr; }

Stmnt   : EOL { $$ = new Empty; prev->link($$); prev = $$; }
        | UNKNOWN { error(@1, "unrecognized input: " + $1); YYERROR; }
        | // many more rules for Stmnt

The semantic type of a non-terminal production is given the symbol $$, while semantic types of the matched grammar elements are called $1, $2… The default production rule, implemented if none is explicitly provided, is $$=$1. The rule for EOL matches a blank line (possibly with preceding whitespace, or itself a comment) and sets the production to a new Empty (our minimalist implementation of a type derived from the Tree abstract base class). The pointer prev is used to link() this in and then is updated itself, to be used by the next Stmnt encountered. The rule for UNKNOWN simply prints out an error message together with the location (line number and offset) and terminates the parsing process; its semantic type is std::string and it is set (by the scanner) to any garbage input encountered.

It is important to note that for a new Block, the Tree* of the first Stmnt will become its semantic value; this is both deliberate and necessary. (Blocks are found as parts of WHILE-loops, IF-expressions and SUBROUTINE definitions, and in many other places.) In fact, the way bison works is to build up productions starting at the leaf nodes and working up to the head node; this is called bottom-up parsing.

Next time we’ll continue looking at “grammar.y” and how the parser rules can call C++ code in order to usefully convert productions into parts of the parse tree (another name for the abstract syntax tree).

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s