This article takes a look at the (only) part of the compiler which directly processes the source text (or “source code”), this being the “scanner”. Some amount of theory lies behind the pattern-matching actions of this part of a compiler, however unless you have a need to implement from scratch (something which can be true for commercial-grade compilers) you can safely follow best practices by employing a “scanner generator”.
The purpose of a (hand-written or semi-auto-generated) scanner is to convert (or reduce) textual patterns into a stream of numerical tokens. It really is as simple as that. (Well, almost!) A working knowledge of regular expressions (“regex”) is really a prerequisite, although patterns for common usages, such as floating-point representation of numbers, can be found and utilized without the need to produce them off the top of your head. The textual patterns which are matched by the regex(es) have a special name: lexemes.
We said “almost”. In the case of a lexeme that has a context, such as being a floating-point number or the name of a variable, a way must be found of passing both the token and its semantic value to the parser. (It’s no good the scanner just saying: “Hey, I found a number”, it needs to say: “I just found a number with value n.”) Any token can therefore have exactly zero or one associated semantic values. The flex
-generated scanner function can pass the textual representation of the matched lexeme (as yytext
, or YYText()
in C++ scanners) back to the (in our case, bison
) parser routine which invoked it. More usefully, as the generated scanner is highly configurable in its own right, this semantic value can be converted to a built-in type, double
for a floating-point number, for example. The technicalities of how this is achieved (C and C++ being statically typed) is discussed later.
Consider the following program, and its decomposition into lexemes (some with associated semantic values):
CONSTANT PI <- 3.14
radius <- STRING_TO_REAL(USERINPUT)
OUTPUT REAL_TO_STRING(PI * radius * radius)
⟨constant⟩ ⟨id,PI⟩ ⟨assign⟩ ⟨float,3.14⟩ ⟨\n⟩ ⟨id,radius⟩ ⟨assign⟩ ⟨string_to_real⟩ ⟨userinput⟩ ⟨)⟩ ⟨\n⟩ ⟨output⟩ ⟨real_to_string⟩ ⟨id,PI⟩ ⟨*⟩ ⟨id,radius⟩ ⟨*⟩ ⟨id,radius⟩ ⟨)⟩ ⟨\n⟩
The name of the token shown in the token stream representation is just the pseudocode keyword in emboldened lower case; this is just to aid the (human) reader. In reality these tokens would just be numbers (above 255, any lower value would be a single ASCII character, with zero indicating end-of-file). Where a token has an associated semantic value, this is grouped into the lexeme using a comma to separate it from the token name. Notice there is no left parenthesis token used, this is assumed to be indivisible to the keywords as STRING_TO_REAL(
and REAL_TO_STRING(
.
Of course, the numerical token values need to be identical as apparent to both the scanner and parser, and this is accomplished by having the scanner #include
a header file generated by bison
(which is also intended for use with the parser). (This header file is required in order to compile the scanner output file, but not to generate it.) To allow the scanner to encode the semantic type, bison
provides %define api.value.type variant
as an option. (Note: this is not C++17’s std::variant
by another name—it is backwardly-compatible to earlier C++ Standards.)
(Note: In the full version we’ve used %define api.token.raw
and %locations
to set up the contract for the way in which the scanner method is invoked and how tokens and semantic values are returned. This will be covered in more detail in a future article.)
The relevant fragment of the scanner definition file “scanner.l” to accomplish the above is as follows:
{%
#include "Parser.hpp"
#include "Lexer.hpp"
#define YY_DECL int yy::Lexer::lex(Parser::semantic_type *value)
using namespace yy;
%}
%option noyywrap c++ outfile="Lexer.cpp"
ws [ \t]+
letter [A-Za-z]
underscore _
digit [0-9]
id {letter}({underscore}|{letter}|{digit})*
float {digit}+(\.{digit}+)?([eE][+-]?{digit}+)?
newline \r\n|\n\r|\n
%%
{ws} /* no action and no return */
{newline} return '\n';
"*"|")" return YYText()[0];
"<-" return Parser::token::ASSIGN;
"CONSTANT" return Parser::token::CONSTANT;
"OUTPUT" return Parser::token::OUTPUT;
"USERINPUT" return Parser::token::USERINPUT;
"STRING_TO_REAL(" return Parser::token::STRING_TO_REAL;
"REAL_TO_STRING(" return Parser::token::REAL_TO_STRING;
{float} value->emplace<RealT>(strtod(YYText(), 0)); return Parser::token::FLOAT;
{id} value->emplace<std::string>(std::string{ YYText(), YYText() + YYLeng() }); return Parser::token::ID;
It is assumed that the header file “Parser.hpp” provides definitions for yy::Parser::semantic_type
. This type has a member function emplace<>
which accepts a template type parameter and a value parameter of the same type. This value is then permanently stored by the parser as that particular token’s semantic type. (Older C++ code and C use an unsafe C-style union
field to accomplish the same task.) The token numbers themselves are listed as a C-style (plain) enum
called Parser::token
which is auto-generated from the bison
input file. The last two lines of the fragment above are the only two which have to calculate a semantic value from YYText()
, this is done using library calls.
The actual “scanner.l” used for this project is quite a bit longer than the fragment shown above, but does not contain any different (or additional) syntax, other than support for location tracking. I would suggest studying it, and as a useful exercise possibly writing a program based upon the fragment shown, which produces a similar output of lexemes to that shown previously. (It is possible to add printf
/cout
statements to the rules, and raw output of YYText()
is possible; storing the semantic value is not strictly necessary in order to simply output it.) Next time we’ll start looking at the bison
parser generator, and its input file “grammar.y”, in detail.