This article takes a look at the (only) part of the compiler which directly processes the source text (or “source code”), this being the “scanner”. Some amount of theory lies behind the pattern-matching actions of this part of a compiler, however unless you have a need to implement from scratch (something which can be true for commercial-grade compilers) you can safely follow best practices by employing a “scanner generator”.
The purpose of a (hand-written or semi-auto-generated) scanner is to convert (or reduce) textual patterns into a stream of numerical tokens. It really is as simple as that. (Well, almost!) A working knowledge of regular expressions (“regex”) is really a prerequisite, although patterns for common usages, such as floating-point representation of numbers, can be found and utilized without the need to produce them off the top of your head. The textual patterns which are matched by the regex(es) have a special name: lexemes.
Continue reading “Writing a pseudocode compiler (3) – Generating a scanner”