Generating C++ programs with flex and bison (1)

This mini-series of articles describes the use of GNU flex and bison when C++ output is desired, which has been possible for some time. The flex utility is a scanner generator while bison is a parser generator; each utility outputs compilable source code from a user-defined input file containing configuration options, syntax specific to the utility, and C++ code fragments. Simply put, the (generated) parser program is meant to call the (generated) scanner repeatedly; the scanner translates its input into a sequence of tokens, each with an optional semantic value. The parser checks this token stream for valid grammatically correct sequences, which causes it to perform actions using the semantic values. Invalid sequences trigger a “syntax error” within the parser.

Despite the fact that it is possible to create C++ scanner and parser classes with flex and bison, much online advice seems to suggest merely creating a C scanner and/or C parser, and then compiling this output with a C++ compiler in order to be able to use (unions with fields of) std::string* etc. for semantic types. However this first article describes the use of flex in its C generation mode with bison generating C++. (If you really want to know how to interface a C++ bison class with a C++ flex class straightaway, then skip to part 3, which also has links to downloads of the resources used. However the reason I cover this later is because it involves creation of a supplementary custom-written header file, and is a less “clean” method in my view.)

We’ll start with the scanner or lexer, code for which is shown here comprising the file lexer1.l (apologies for syntax highlighting not being available for this filetype):

%{
#include <cstdlib>
#include "Parser1.hpp"
using namespace calc;
%}

%option reentrant interactive noyywrap noyylineno nodefault outfile="Scanner1.cpp" header="Scanner1.hpp"

dseq            ([[:digit:]]+)
dseq_opt        ({dseq}?)
frac            (({dseq_opt}"."{dseq})|{dseq}".")
exp             ([eE][+-]?{dseq})
exp_opt         ({exp}?)
integer         ({dseq})
float           (({frac}{exp_opt})|({dseq}{exp}))
intvar          ([[:upper:]])
fltvar          ([[:lower:]])

%%

{integer}   yylval->emplace<long long>(strtoll(yytext, nullptr, 10)); return Parser::token::INT;
{float}     yylval->emplace<double>(strtod(yytext, nullptr)); return Parser::token::FLT;
{intvar}    yylval->emplace<char>(yytext[0]); return Parser::token::INTVAR;
{fltvar}    yylval->emplace<char>(yytext[0]); return Parser::token::FLTVAR;
"+"         return Parser::token::PLUS;
"-"         return Parser::token::MINUS;
"*"         return Parser::token::MULTIPLY;
"/"         return Parser::token::DIVIDE;
"%"         return Parser::token::MODULO;
"!"         return Parser::token::FACTORIAL;
"^"         return Parser::token::EXPONENT;
"("         return Parser::token::LPAREN;
")"         return Parser::token::RPAREN;
"="         return Parser::token::ASSIGN;
\n          return Parser::token::EOL;
<<EOF>>     return Parser::token::YYEOF;
.           /* no action on unmatched input */

%%

int main() {
    yyscan_t scanner;
    yylex_init(&scanner);
    calc::Parser parser{ scanner };
    std::cout.precision(10);
    parser.parse();
    yylex_destroy(scanner);
}

Lines 2-4 #include a system header needed by two of the rules and the header that bison will generate (Parser1.hpp), again needed by the rules. The parser class will live in namespace calc so to save a bit of typing we use a using-directive. This part is copied to the implementation file (Scanner1.cpp), not the header (Scanner1.hpp); note that we’ve used C++ file extensions even though the generated code is pure C.

Line 7 lists all our flex configuration options which are best specified here, instead of on the command line. Of particular interest is reentrant, which causes no global state to be used, with the caveat that the yylex() function (which is the only useful interface a flex scanner provides) cannot be called until initialization has been performed by yylex_init(). Multiple scanners can be used, possibly with multiple parser instances, and we’ve used it here because a parser instance relying on a global function with global state is not good C++ style.

All of the patterns we are matching against are defined in lines 9-18, and some knowledge of regular expressions (regex) comes in handy here. In summary, integer is defined as one or more decimal digits, intvar as a single upper-case letter and fltvar as a single lower-case letter. The definition of float is more complex as it has to deal with the optional decimal point and exponent.

The rules at lines 21-37 come after the mandatory %% separator and perform actions when the specified pattern is matched. The longest match wins, in the case of there being more than one, thus “123.456” matches float, not integer, while “123” always matches integer because that rule is first. The simple tokens do not have a semantic value so we just return an int for these rules, being the value in the public enum from Parser1.hpp named Parser::token. The syntax for registering a semantic value needed for rules integer and float is ugly, but is type-safe; yylval is the name of the pointer passed to yylex() by the parser, and has a member function which takes the form emplace<Type>(Value). The pointer yytext points to a NUL-terminated character string which is the matched input.

Lines 41-48 after the second %% separator are our main() program. This initializes the scanner in order for the parser to be able to call yylex(), which (being a “pull” parser) it will do repeatedly. We do in fact pass the yyscan_t-typed variable to the Parser class constructor, and the parser needs to be configured to use this correctly, as we shall see. The signature of yylex() is not configured here; by default it is int yylex(void), however in our case it is int yylex(calc::Parser::semantic_type *yylval, yyscan_t) as we are using a re-entrant scanner and need to return the variable yylval to the parser. Reading of input is begun by the call parser.parse() (declared in Parser1.hpp), and this member of the Parser class is only called once (being a “pull parser”).

This code is processed to produce Scanner1.hpp and Scanner1.cpp using the command:

$ flex lexer1.l

Or under Windows:

> win_flex --wincompat lexer1.l

However Scanner1.cpp cannot be compiled until Parser1.hpp has been generated. We’ll cover that in the next part in this mini-series.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s