This mini-series of articles describes the use of GNU flex
and bison
when C++ output is desired, which has been possible for some time. The flex
utility is a scanner generator while bison
is a parser generator; each utility outputs compilable source code from a user-defined input file containing configuration options, syntax specific to the utility, and C++ code fragments. Simply put, the (generated) parser program is meant to call the (generated) scanner repeatedly; the scanner translates its input into a sequence of tokens, each with an optional semantic value. The parser checks this token stream for valid grammatically correct sequences, which causes it to perform actions using the semantic values. Invalid sequences trigger a “syntax error” within the parser.
Despite the fact that it is possible to create C++ scanner and parser classes with flex
and bison
, much online advice seems to suggest merely creating a C scanner and/or C parser, and then compiling this output with a C++ compiler in order to be able to use (unions with fields of) std::string*
etc. for semantic types. However this first article describes the use of flex
in its C generation mode with bison
generating C++. (If you really want to know how to interface a C++ bison
class with a C++ flex
class straightaway, then skip to part 3, which also has links to downloads of the resources used. However the reason I cover this later is because it involves creation of a supplementary custom-written header file, and is a less “clean” method in my view.)
We’ll start with the scanner or lexer, code for which is shown here comprising the file lexer1.l
(apologies for syntax highlighting not being available for this filetype):
%{
#include <cstdlib>
#include "Parser1.hpp"
using namespace calc;
%}
%option reentrant interactive noyywrap noyylineno nodefault outfile="Scanner1.cpp" header="Scanner1.hpp"
dseq ([[:digit:]]+)
dseq_opt ({dseq}?)
frac (({dseq_opt}"."{dseq})|{dseq}".")
exp ([eE][+-]?{dseq})
exp_opt ({exp}?)
integer ({dseq})
float (({frac}{exp_opt})|({dseq}{exp}))
intvar ([[:upper:]])
fltvar ([[:lower:]])
%%
{integer} yylval->emplace<long long>(strtoll(yytext, nullptr, 10)); return Parser::token::INT;
{float} yylval->emplace<double>(strtod(yytext, nullptr)); return Parser::token::FLT;
{intvar} yylval->emplace<char>(yytext[0]); return Parser::token::INTVAR;
{fltvar} yylval->emplace<char>(yytext[0]); return Parser::token::FLTVAR;
"+" return Parser::token::PLUS;
"-" return Parser::token::MINUS;
"*" return Parser::token::MULTIPLY;
"/" return Parser::token::DIVIDE;
"%" return Parser::token::MODULO;
"!" return Parser::token::FACTORIAL;
"^" return Parser::token::EXPONENT;
"(" return Parser::token::LPAREN;
")" return Parser::token::RPAREN;
"=" return Parser::token::ASSIGN;
\n return Parser::token::EOL;
<<EOF>> return Parser::token::YYEOF;
. /* no action on unmatched input */
%%
int main() {
yyscan_t scanner;
yylex_init(&scanner);
calc::Parser parser{ scanner };
std::cout.precision(10);
parser.parse();
yylex_destroy(scanner);
}
Lines 2-4 #include
a system header needed by two of the rules and the header that bison
will generate (Parser1.hpp
), again needed by the rules. The parser class will live in namespace calc
so to save a bit of typing we use a using-directive. This part is copied to the implementation file (Scanner1.cpp
), not the header (Scanner1.hpp
); note that we’ve used C++ file extensions even though the generated code is pure C.
Line 7 lists all our flex
configuration options which are best specified here, instead of on the command line. Of particular interest is reentrant
, which causes no global state to be used, with the caveat that the yylex()
function (which is the only useful interface a flex
scanner provides) cannot be called until initialization has been performed by yylex_init()
. Multiple scanners can be used, possibly with multiple parser instances, and we’ve used it here because a parser instance relying on a global function with global state is not good C++ style.
All of the patterns we are matching against are defined in lines 9-18, and some knowledge of regular expressions (regex) comes in handy here. In summary, integer
is defined as one or more decimal digits, intvar
as a single upper-case letter and fltvar
as a single lower-case letter. The definition of float
is more complex as it has to deal with the optional decimal point and exponent.
The rules at lines 21-37 come after the mandatory %%
separator and perform actions when the specified pattern is matched. The longest match wins, in the case of there being more than one, thus “123.456” matches float
, not integer
, while “123” always matches integer
because that rule is first. The simple tokens do not have a semantic value so we just return an int
for these rules, being the value in the public enum
from Parser1.hpp
named Parser::token
. The syntax for registering a semantic value needed for rules integer
and float
is ugly, but is type-safe; yylval
is the name of the pointer passed to yylex()
by the parser, and has a member function which takes the form emplace<Type>(Value)
. The pointer yytext
points to a NUL-terminated character string which is the matched input.
Lines 41-48 after the second %%
separator are our main()
program. This initializes the scanner in order for the parser to be able to call yylex()
, which (being a “pull” parser) it will do repeatedly. We do in fact pass the yyscan_t
-typed variable to the Parser
class constructor, and the parser needs to be configured to use this correctly, as we shall see. The signature of yylex()
is not configured here; by default it is int yylex(void)
, however in our case it is int yylex(calc::Parser::semantic_type *yylval,
yyscan_t)
as we are using a re-entrant scanner and need to return the variable yylval
to the parser. Reading of input is begun by the call parser.parse()
(declared in Parser1.hpp
), and this member of the Parser
class is only called once (being a “pull parser”).
This code is processed to produce Scanner1.hpp
and Scanner1.cpp
using the command:
$ flex lexer1.l
Or under Windows:
> win_flex --wincompat lexer1.l
However Scanner1.cpp
cannot be compiled until Parser1.hpp
has been generated. We’ll cover that in the next part in this mini-series.