Generating C++ programs with flex and bison (2)

In this second article we’ll get straight on with generating the C++ parser using bison. The source code is about twice the length of the scanner and is more complex in implementation, and it would be impossible to explain every part from scratch in one article. The main parts of the configuration to instruct bison to use a C++ parser class are covered. Interestingly the overall format is similar to flex‘s input file, being three sections separated by %% on a line by itself. Here is the bison input file grammar1.y:

%{
#include <iostream>
#include <string>
#include <cmath>
#include "Scanner1.hpp"
%}

%require "3.7.4"
%language "C++"
%defines "Parser1.hpp"
%output "Parser1.cpp"

%define api.parser.class {Parser}
%define api.namespace {calc}
%define api.value.type variant
%param {yyscan_t scanner}

%code provides
{
    #define YY_DECL \
        int yylex(calc::Parser::semantic_type *yylval, yyscan_t yyscanner)
    YY_DECL;
}

%token              EOL LPAREN RPAREN
%token <long long>  INT
%token <double>     FLT
%token <char>       INTVAR FLTVAR

%nterm <long long>  iexp
%nterm <double>     fexp

%nonassoc           ASSIGN
%left               PLUS MINUS
%left               MULTIPLY DIVIDE MODULO
%precedence         UMINUS
%precedence         FACTORIAL
%right              EXPONENT

%code
{
    namespace calc {
        long long ivars['Z' - 'A' + 1];
        double fvars['z' - 'a' + 1];
    
        long long factorial(long long n) {
            if (n < 2) {
                return 1;
            }
            else {
                return n * factorial(n - 1);
            }
        }
    } // namespace calc
} // %code

%%

lines   : %empty
        | lines line
        ;

line    : EOL                       { std::cerr << "Read an empty line.\n"; }
        | iexp EOL                  { std::cout << $1 << '\n'; }
        | fexp EOL                  { std::cout << $1 << '\n'; }
        | INTVAR ASSIGN iexp EOL    { ivars[$1 - 'A'] = $3; }
        | FLTVAR ASSIGN fexp EOL    { fvars[$1 - 'a'] = $3; }
        | error EOL                 { yyerrok; }
        ;

iexp    : INT                       { $$ = $1; }
        | iexp PLUS iexp            { $$ = $1 + $3; }
        | iexp MINUS iexp           { $$ = $1 - $3; }
        | iexp MULTIPLY iexp        { $$ = $1 * $3; }
        | iexp DIVIDE iexp          { $$ = $1 / $3; }
        | iexp MODULO iexp          { $$ = $1 % $3; }
        | MINUS iexp %prec UMINUS   { $$ = -$2; }
        | iexp FACTORIAL            { $$ = factorial($1); }
        | LPAREN iexp RPAREN        { $$ = $2; }
        | INTVAR                    { $$ = ivars[$1 - 'A']; }
        ;

fexp    : FLT                       { $$ = $1; }
        | fexp PLUS fexp            { $$ = $1 + $3; }
        | fexp MINUS fexp           { $$ = $1 - $3; }
        | fexp MULTIPLY fexp        { $$ = $1 * $3; }
        | fexp DIVIDE fexp          { $$ = $1 / $3; }
        | fexp EXPONENT fexp        { $$ = pow($1, $3); }
        | MINUS fexp %prec UMINUS   { $$ = -$2; }
        | LPAREN fexp RPAREN        { $$ = $2; }
        | FLTVAR                    { $$ = fvars[$1 - 'a']; }
        ;

%%

void calc::Parser::error(const std::string& msg) {
    std::cerr << msg << '\n';
}

The #includes at lines 2-5 are copied to the implementation file Parser1.cpp while the contents of the %code requires block at lines 20-22 is copied to the end of the header file Parser1.hpp. As we have seen, the scanner references this header file, and it is here that we control the signature of the generated yylex() function for the scanner implementation (recall that its default prototype is int yylex(void)). The pointer variable yylval is used when setting the semantic value of a token in the lexer.

Lines 8-16 configure the bison program to our needs meaning we don’t have to specify any command-line arguments. Using variant (not to be confused with C++17’s std::variant) is what generates the emplace<Type>(Value) methods used by the lexer. The line %param {yyscan_t scanner} is the “glue” that holds the main() function together as this parameter becomes both a member variable of calc::Parser and the second parameter used to call yylex() from within calc::Parser::parse.

Lines 25-38 declare the terminals, non-terminals and operators, each of which is given a unique integer token value (starting at 256). (The operators are listed in order of precedence low to high.) These are the same names as in the scanner, and generate fully qualified names such as calc::Parser::token::INT. Lines 40-55 contain C++ code which is copied verbatim to Parser1.cpp and consists of two array definitions and a support function. Lines 59-92 comprise the rules which reduce non-terminals via C++ code productions into values which can be output. Don’t be put off by the syntax $$, $1 etc., they are real type-safe variables that can be operated on like any other C++ variable. Importantly, the variable $$ is always the result of the production, which is saved for further use, while $N refers to the N’th token in the matched sequence.

The definition of the error() function after the second %% is necessary, and we could have put the main() function here too, or have put both in external source code files. This file can be processed using:

$ bison grammar1.y

Or under Windows:

> win_bison grammar1.y

We can now compile our generated C++ files in order to create a working program:

$ g++ -o calc1 Scanner1.cpp Parser1.cpp
$ ./calc1

Or under a Windows Visual Studio command prompt:

> cl /EHsc /FeCalc1.exe Parser1.cpp Scanner1.cpp
> Calc1

Both source files and a build script for Windows in a Zip file are available to download from part 3. If you’re interested in playing around with this program, here are a few ideas for features (in no particular order):

  • Check for divide-by-zero and output an error message
  • Conversion between integer and floating-point types
  • Allowing mixed-mode arithmetic
  • Outputting a prompt for user input, such as “? “
  • Set the output stream of the Parser class using a named member variable
  • Use of strings rather than characters for variable names
  • Access to built-in math functions log, sin etc.
  • Use of arbitrary-precision types instead of long long and double
  • Location tracking using bison‘s location type
  • Improved syntax error messages

In the final article of this mini-series we’ll look at the necessary changes to both source files in order to interface a C++ scanner class, still generated by flex, with a C++ parser class.

5 thoughts on “Generating C++ programs with flex and bison (2)”

  1. Hello there, I’m developer of winflexbison package.

    You mentioned a bug in the win_bison
    <<(Note: the version of win_bison.exe I compiled from source appears to have a small bug and would not accept the %header line;

    I checked bison original source code (v3.7.4) and didn't find any evidence of %header option (see parse-gram.y source file).

    As a current solution I propose to use `%defines "Parser.hpp"` that should work.

    If you know that unix bison accepts %header option, please create an issue on github and provide bison version which support it.

    Like

    1. Hi Alex,

      Many thanks for the solution, confirmed as working here with my win_bison.exe. I will update the article to reflect this.

      (The version of bison which allows “–header=” instead of “–defines=” is referenced as 3.8 in the manual, which is not yet a release.)

      Like

  2. I wrote a scanner and parser based on this tutorial which extends the tokens so that variables names can be more than one letter but it’s having issues when the token length is longer than 15 characters. O get strings back that are all null chars, as if the memory was freed. Do you know why that might be?

    Like

    1. Hi Bob,

      Thanks for your comment. In answer to your question, it is my understanding that for a std::string object, a string 15 or fewer characters in length is stored within the object, while longer than that uses heap memory. My guess is that a shallow copy or dangling reference of a std::string object is made somewhere, but without looking at your code it’s difficult to say.

      I made the following minor changes to the code outlined in parts 1 and 2, and could not reproduce this bug, so I don’t believe it’s a problem with flex/bison or the code in the tutorial:

      --- lexer1.l    2021-06-22 10:21:18.598527600 +0100
      +++ lexer1a.l   2023-05-23 12:53:15.222224300 +0100
      @@ -16,2 +16,2 @@
      -intvar          ([[:upper:]])
      -fltvar          ([[:lower:]])
      +intvar          ([[:upper:]]+)
      +fltvar          ([[:lower:]]+)
      @@ -23,2 +23,2 @@
      -{intvar}    yylval->emplace<char>(yytext[0]); return Parser::token::INTVAR;
      -{fltvar}    yylval->emplace<char>(yytext[0]); return Parser::token::FLTVAR;
      +{intvar}    yylval->emplace<std::string>(yytext); return Parser::token::INTVAR;
      +{fltvar}    yylval->emplace<std::string>(yytext); return Parser::token::FLTVAR;
      
      --- grammar1.y  2021-06-22 10:07:16.275782500 +0100
      +++ grammar1a.y 2023-05-23 12:49:55.862715300 +0100
      @@ -4,0 +5 @@
      +#include <unordered_map>
      @@ -28 +29 @@
      -%token <char>       INTVAR FLTVAR
      +%token <std::string>       INTVAR FLTVAR
      @@ -43,2 +44,2 @@
      -        long long ivars['Z' - 'A' + 1];
      -        double fvars['z' - 'a' + 1];
      +        std::unordered_map<std::string,long long> ivars;
      +        std::unordered_map<std::string,double> fvars;
      @@ -66,2 +67,2 @@
      -        | INTVAR ASSIGN iexp EOL    { ivars[$1 - 'A'] = $3; }
      -        | FLTVAR ASSIGN fexp EOL    { fvars[$1 - 'a'] = $3; }
      +        | INTVAR ASSIGN iexp EOL    { ivars[$1] = $3; }
      +        | FLTVAR ASSIGN fexp EOL    { fvars[$1] = $3; }
      @@ -80 +81 @@
      -        | INTVAR                    { $$ = ivars[$1 - 'A']; }
      +        | INTVAR                    { $$ = ivars[$1]; }
      @@ -91 +92 @@
      -        | FLTVAR                    { $$ = fvars[$1 - 'a']; }
      +        | FLTVAR                    { $$ = fvars[$1]; }
      

      You could use this code as a starting point if you like, having checked it works for you. Note that integer variables are all uppercase and floating-point variables are all lowercase in this scheme.

      Like

Leave a reply to Bob Cancel reply