Writing a pseudocode compiler (3) – Generating a scanner

This article takes a look at the (only) part of the compiler which directly processes the source text (or “source code”), this being the “scanner”. Some amount of theory lies behind the pattern-matching actions of this part of a compiler, however unless you have a need to implement from scratch (something which can be true for commercial-grade compilers) you can safely follow best practices by employing a “scanner generator”.

The purpose of a (hand-written or semi-auto-generated) scanner is to convert (or reduce) textual patterns into a stream of numerical tokens. It really is as simple as that. (Well, almost!) A working knowledge of regular expressions (“regex”) is really a prerequisite, although patterns for common usages, such as floating-point representation of numbers, can be found and utilized without the need to produce them off the top of your head. The textual patterns which are matched by the regex(es) have a special name: lexemes.

We said “almost”. In the case of a lexeme that has a context, such as being a floating-point number or the name of a variable, a way must be found of passing both the token and its semantic value to the parser. (It’s no good the scanner just saying: “Hey, I found a number”, it needs to say: “I just found a number with value n.”) Any token can therefore have exactly zero or one associated semantic values. The flex-generated scanner function can pass the textual representation of the matched lexeme (as yytext, or YYText() in C++ scanners) back to the (in our case, bison) parser routine which invoked it. More usefully, as the generated scanner is highly configurable in its own right, this semantic value can be converted to a built-in type, double for a floating-point number, for example. The technicalities of how this is achieved (C and C++ being statically typed) is discussed later.

Consider the following program, and its decomposition into lexemes (some with associated semantic values):

CONSTANT PI <- 3.14
radius <- STRING_TO_REAL(USERINPUT)
OUTPUT REAL_TO_STRING(PI * radius * radius)

constant⟩ ⟨id,PI⟩ ⟨assign⟩ ⟨float,3.14⟩ ⟨\n⟩ ⟨id,radius⟩ ⟨assign⟩ ⟨string_to_real⟩ ⟨userinput⟩ ⟨)⟩ ⟨\n⟩ ⟨output⟩ ⟨real_to_string⟩ ⟨id,PI⟩ ⟨*⟩ ⟨id,radius⟩ ⟨*⟩ ⟨id,radius⟩ ⟨)⟩ ⟨\n⟩

The name of the token shown in the token stream representation is just the pseudocode keyword in emboldened lower case; this is just to aid the (human) reader. In reality these tokens would just be numbers (above 255, any lower value would be a single ASCII character, with zero indicating end-of-file). Where a token has an associated semantic value, this is grouped into the lexeme using a comma to separate it from the token name. Notice there is no left parenthesis token used, this is assumed to be indivisible to the keywords as STRING_TO_REAL( and REAL_TO_STRING(.

Of course, the numerical token values need to be identical as apparent to both the scanner and parser, and this is accomplished by having the scanner #include a header file generated by bison (which is also intended for use with the parser). (This header file is required in order to compile the scanner output file, but not to generate it.) To allow the scanner to encode the semantic type, bison provides %define api.value.type variant as an option. (Note: this is not C++17’s std::variant by another name—it is backwardly-compatible to earlier C++ Standards.)

(Note: In the full version we’ve used %define api.token.raw and %locations to set up the contract for the way in which the scanner method is invoked and how tokens and semantic values are returned. This will be covered in more detail in a future article.)

The relevant fragment of the scanner definition file “scanner.l” to accomplish the above is as follows:

{%
#include "Parser.hpp"
#include "Lexer.hpp"
#define YY_DECL int yy::Lexer::lex(Parser::semantic_type *value)
using namespace yy;
%}

%option noyywrap c++ outfile="Lexer.cpp"

ws                      [ \t]+
letter                  [A-Za-z]
underscore              _
digit                   [0-9]
id                      {letter}({underscore}|{letter}|{digit})*
float                   {digit}+(\.{digit}+)?([eE][+-]?{digit}+)?
newline                 \r\n|\n\r|\n

%%

{ws}                    /* no action and no return */
{newline}               return '\n';
"*"|")"                 return YYText()[0];
"<-"                    return Parser::token::ASSIGN;
"CONSTANT"              return Parser::token::CONSTANT;
"OUTPUT"                return Parser::token::OUTPUT;
"USERINPUT"             return Parser::token::USERINPUT;
"STRING_TO_REAL("       return Parser::token::STRING_TO_REAL;
"REAL_TO_STRING("       return Parser::token::REAL_TO_STRING;

{float}                 value->emplace<RealT>(strtod(YYText(), 0)); return Parser::token::FLOAT;
{id}                    value->emplace<std::string>(std::string{ YYText(), YYText() + YYLeng() }); return Parser::token::ID;

It is assumed that the header file “Parser.hpp” provides definitions for yy::Parser::semantic_type. This type has a member function emplace<> which accepts a template type parameter and a value parameter of the same type. This value is then permanently stored by the parser as that particular token’s semantic type. (Older C++ code and C use an unsafe C-style union field to accomplish the same task.) The token numbers themselves are listed as a C-style (plain) enum called Parser::token which is auto-generated from the bison input file. The last two lines of the fragment above are the only two which have to calculate a semantic value from YYText(), this is done using library calls.

The actual “scanner.l” used for this project is quite a bit longer than the fragment shown above, but does not contain any different (or additional) syntax, other than support for location tracking. I would suggest studying it, and as a useful exercise possibly writing a program based upon the fragment shown, which produces a similar output of lexemes to that shown previously. (It is possible to add printf/cout statements to the rules, and raw output of YYText() is possible; storing the semantic value is not strictly necessary in order to simply output it.) Next time we’ll start looking at the bison parser generator, and its input file “grammar.y”, in detail.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s