Introduction to Compiler Design
Introduction to Compiler Design
Figure: A typical multi-stage compiler pipeline from source code to executable. In computing, a compiler is a
program that automates translation of source code (written in one high-level language) into another
language, often low-level machine code 1 . Compilers are fundamental in software development: they
allow developers to write in human-readable languages while ultimately producing efficient machine-
executable programs 1 . The phases of a compiler are organized into a pipeline, where each phase
transforms the program representation and passes its output to the next phase 2 1 . Figure above
illustrates this sequence from source code through lexical analysis, parsing (syntax analysis), semantic
analysis, intermediate-code generation, optimization, and finally target code generation 2 1 . In
practice, compilers are designed as modular components (front-end and back-end) that isolate each phase,
a structure that promotes clarity, correctness and reusability 3 . In fact, compiler designers invest much
effort to ensure correctness of each phase, since errors (e.g. misparsed constructs or incorrect code) can
lead to faulty executables that are hard to debug 4 .
A compiler’s front end (lexical, syntax, semantic) checks and understands the source program according to
the language’s rules, while the back end (code generation, optimization) produces efficient code for the
target machine 5 . For example, compiling C source eventually produces machine code or an object file;
this output is typically much faster and more efficient than interpreting the source at run time 6 .
Throughout compilation, auxiliary components like the symbol table and error handler support all phases.
The symbol table stores information about every identifier (name, type, scope) 7 8 , and error handling
ensures informative diagnostics. A summary of the major phases is given in Table below, with each phase’s
input, output, and key concepts:
1
Phase Input Output Core Concepts/Tools
This pipeline shows how a high-level language program is progressively transformed into efficient machine
instructions 2 18 . Over the course, we will build understanding of each phase and how compiler tools
(like Lex and Yacc) automate parts of this process. By the end, you should appreciate the theory (formal
languages, automata, type systems) and practice (tool usage, coding strategies) behind compilers, making
you better at both constructing compilers and understanding how languages are implemented.
Lexical Analysis
The first phase of a compiler is lexical analysis (often called scanning). Its job is to read the raw source-
character stream and group characters into meaningful tokens. A token is a sequence of characters that
represents a basic syntactic unit (such as an identifier, keyword, literal constant, operator, or separator) 9 .
For example, in C the character sequence int is recognized as a keyword token, x1 as an identifier
token, + as an operator token, and 123 as a numeric-literal token. Lexical analysis simplifies later phases
by collapsing characters into token units; it also removes irrelevant material (like whitespace or comments)
and catches simple errors (like illegal characters) early.
Formally, tokens are defined by patterns (often given as regular expressions). Regular expressions describe
the set of strings belonging to each token type. For instance, one might specify that an integer literal token
matches the regex [0-9]+ , and an identifier matches [A-Za-z_][A-Za-z0-9_]* . In practice, a finite
automaton (deterministic or nondeterministic) is built from these regexes to recognize tokens. The lexical
analyzer runs the automaton on the input: when the automaton reaches an accepting (final) state, it has
recognized one token 20 . Deterministic finite automata (DFAs) are especially popular because they scan
2
input in one pass (left-to-right) and decide each token in linear time. In fact, the process of converting
regex-based token specifications into an efficient DFA can be automated.
Because defining regex and DFAs by hand is error-prone, compiler writers use scanner generators. Tools
like Lex or Flex let the developer write token patterns (regexes) and corresponding actions in a specification
file. The tool then automatically constructs a DFA and generates C code for the scanner. At compile time,
this scanner reads characters and outputs a stream of tokens (each token is typically represented by a token
type and possibly an attribute, like the text of an identifier) 10 21 . For example, the Flex manual notes that
Flex was explicitly designed to produce lexical analyzers faster than the original Lex 21 . Lex/Flex integrates
easily with later phases: they can hand off each recognized token to a parser generator like Yacc for syntax
analysis.
Common token categories include identifiers (names defined by the programmer), keywords (reserved
words like if , for ), constants/literals (numeric, character, or string literals), operators (such as + ,
- , * , < , > ), and punctuation or delimiters (such as semicolons, parentheses, braces) 9 22 . The
scanner typically maintains a table of keyword tokens (to distinguish the word int as a keyword vs. an
identifier), and it enters each new identifier into the symbol table for later use. In summary, lexical analysis
simplifies the source text into tokens for the parser, using regular expressions and finite automata – often
implemented via tools like Lex/Flex 9 10 .
Syntax Analysis
After tokenization, syntax analysis (parsing) examines the sequence of tokens to determine its
grammatical structure. The language’s syntax is defined by a context-free grammar (CFG) – a formal set of
production rules that describe which token sequences form valid programs 23 11 . For example, a simple
rule might say that if ( <expr> ) <stmt> is a valid form for an if-statement. The parser reads tokens
left-to-right and tries to build a parse tree (or concrete syntax tree): a tree whose internal nodes
correspond to nonterminal symbols of the grammar and whose leaves are the actual tokens 11 24 . Each
branch in the parse tree shows how a grammar rule applies. If the parser finds that the token sequence
cannot be derived by the grammar, it reports a syntax error. Well-formedness in syntax is crucial: parsing
ensures the program structure matches the language’s rules.
• Top-down parsing (LL parsing): These parsers start from the start symbol of the grammar and build
the parse tree from the root downward. An LL(1) parser is a simple predictive parser that reads input
Left-to-right, producing a Leftmost derivation, with 1 token of lookahead. LL(1) grammars must be
unambiguous and free of certain patterns (no left-recursion, no common prefixes requiring
backtracking). Practical top-down parsers often use recursive-descent algorithms guided by FIRST/
FOLLOW sets.
• Bottom-up parsing (LR parsing): These parsers start from the input tokens and build the parse tree
up to the root. LR parsers (including variants SLR(1), LALR(1), and canonical LR(1)) perform a shift-
reduce analysis. For example, Yacc (Yet Another Compiler-Compiler) generates LALR(1) parsers, a
form of LR parser with one-symbol lookahead. LR parsers can handle a larger class of grammars
(including most programming language grammars) than LL parsers. Practically, most compiler
3
courses use LALR(1) because Yacc/Bison are readily available. (The term “LALR” stands for
Lookahead LR, a compact version of full LR.)
Whether using LL or LR, the parser verifies that token sequences conform to the grammar and constructs
the parse tree. The parse tree is then usually transformed into an abstract syntax tree (AST) by dropping
extraneous grammar nodes (e.g. parentheses or punctuation), producing a more compact representation of
program structure. The AST is typically used in semantic analysis and code generation instead of the raw
parse tree.
Parser generators simplify this phase. Tools like Yacc and Bison allow the compiler writer to specify the
grammar rules and attach semantic actions to them. Yacc was originally developed at Bell Labs and
generates C code for an LALR(1) parser from a BNF grammar specification 12 . Its GNU successor, Bison, is
widely used today and can generate LALR(1) parsers (and even canonical LR or GLR parsers) from a similar
grammar file 25 . Both Yacc and Bison parse a .y file that contains grammar productions and C snippets;
they produce a parser that reads token streams (from Lex/Flex) and builds a parse tree or performs on-the-
fly processing. Yacc/Bison integrate well with Lex: typically, the lexer provides tokens to the parser, and
semantic actions (written in C) construct nodes of the AST or populate data structures 26 25 .
In summary, syntax analysis enforces the grammar of the language and builds the parse tree (AST). LL(1)
parsers (top-down) and LR(1)/LALR(1) parsers (bottom-up) are the common algorithms. Yacc/Bison are tools
that generate these parsers automatically 12 25 . By the end of this phase, we have a structured tree that
represents the full program syntax, ready for semantic checks.
Semantic Analysis
Once the parse tree (or AST) is obtained, semantic analysis verifies that the program is meaningful under
the language’s rules and gathers necessary type information. This phase uses both the syntax tree and the
symbol table built so far. Key tasks include type checking (ensuring operators are applied to compatible
types, function calls have correct arguments, etc.), enforcing scoping rules (e.g. variables are declared
before use), and evaluating constant expressions if possible. The semantic analyzer traverses the AST and
performs these checks, often using an attribute grammar approach where each node is annotated with
type and other semantic information. For example, it ensures you cannot add a string to an integer, or
cannot call a function with wrong number of parameters. If a semantic violation is found (like an undeclared
variable or mismatched types), the compiler reports an error.
A crucial data structure in this phase is the symbol table. As identifiers are encountered (in declarations or
uses), entries are made in the symbol table recording their name, type, scope level, memory location, and
other attributes 8 . The symbol table enables quick lookup of any identifier and enforcement of scope
rules (e.g. different functions may have variables with the same name in separate scopes). According to
tutorials, “every compiler uses a symbol table to track all variables, functions, and identifiers in a
program” 8 . During semantic analysis, the symbol table is filled (at declarations) and consulted (at uses) to
confirm correctness: for instance, the analyzer checks that an identifier is already declared before use 13
8 . It also notes type information so that, for example, an assignment x = y + 1 can be checked that
4
Once semantic analysis is done, the AST is typically annotated with type and other semantic information.
Some compilers also directly generate an intermediate representation (IR) in this phase. IR is a machine-
independent code form (such as three-address code or static single-assignment form) that abstracts away
high-level syntax while still being easier to optimize than raw machine code 15 . The Princeton notes define
IR as “an abstract machine language” that is independent of any particular machine or source language 15 .
In any case, semantic analysis outputs either an annotated AST or an initial intermediate code. This AST/IR
will be used by the next phases (code generation and optimization).
Finally, any semantic transformations (like implicit type promotions, array bounds checks, or short-circuit
evaluation of logical operators) are also handled here. The semantic checker may insert conversion nodes in
the AST or otherwise rewrite the tree to ensure type compatibility. After semantic analysis, the program is
guaranteed to be correct in meaning (according to the language), and all high-level structure is resolved.
In textbook terms, the front end (lexical + syntax + semantic) has now completely understood the program.
The remaining task is to produce efficient code from this representation.
(If any semantic errors were found, compilation typically stops here. Assuming none, the compiler proceeds.)
Code Generation
With a semantically sound, annotated AST or IR, the compiler proceeds to code generation. This phase
translates the high-level intermediate representation into low-level machine or assembly code for the target
architecture. The main challenge is to map each operation in the IR into one or more target instructions and
to manage limited hardware resources (like registers and memory) efficiently.
Typically, code generation is structured around traversing the AST or IR and emitting code. For simple
expressions, the compiler recursively generates code for subexpressions and then applies the
corresponding machine instruction (e.g. to add two values). For control structures, it emits jump/branch
instructions to implement loops and conditionals. Often this phase uses templates or patterns: for each IR
operation, there is a template sequence of machine instructions. Modern compilers use intermediate steps
such as instruction selection (choosing the best instruction sequences) and instruction scheduling
(ordering instructions to avoid hazards), but in our scope we focus on basics.
An important subproblem in code generation is register allocation. The target machine has a limited
number of registers, but the IR may use an unlimited number of temporaries. A register allocator assigns IR
variables or temporaries to physical registers, spilling some to memory if needed. Good allocators use
algorithms like graph-coloring to minimize spills. As GeeksforGeeks notes, “register allocation is an NP-
complete problem” but can be approximated by graph-coloring heuristics 19 . In practice, many compilers
perform a global analysis (build an interference graph of variables) and then color it with the number of
physical registers. Alternatively, a simpler local strategy (within each basic block) may be used in teaching
compilers. The goal is to keep frequently used variables in the fast registers rather than repeatedly loading/
storing them to memory 19 .
Stack and calling conventions are also handled here. For each function, the compiler allocates a stack
frame upon entry: it reserves space on the stack for local variables, saved registers, function parameters,
and the return address. When a function call is made, arguments are pushed to the stack or placed in
specified registers (per the target’s calling convention), and a jump is made to the function. On return, the
5
caller’s frame is restored. The details (which registers must be saved by caller vs callee, stack direction, etc.)
are target-specific, but the compiler back end must implement them correctly. For example, on x86 the EBP/
RBP register often points to the start of the current stack frame; on return that register (and others) must
be restored so the caller can continue correctly. The code generator emits prologue and epilogue code at
each function to set up and tear down the stack frame.
Finally, the code generator outputs either an assembly language file or direct machine code (object code).
The assembly output may still require an assembler/linker, whereas object code (relocatable machine code)
can be linked into an executable. Some compilers even produce absolute machine code (with fixed
addresses), but usually they produce relocatable output plus a symbol table for linking. GeeksforGeeks
notes that the target code can be absolute, relocatable, or assembly, each with trade-offs (e.g. absolute
code is hard to reuse, while assembly needs a separate assembler pass) 27 .
In summary, code generation converts IR into target-specific instructions, handling register allocation and
stack-frame layout. This phase ensures the final program performs the same logic as the source, but
directly on hardware. A well-designed code generator (together with the later optimizer) yields fast, efficient
executables from the high-level input 18 19 .
Code Optimization
Before or during code generation, a compiler typically performs optimization to improve performance or
reduce size of the output code. Optimization encompasses many techniques, but its overall goal is to make
the compiled program run faster and/or use fewer resources, without changing its behavior 16 . This might
mean fewer instructions executed, less memory used, or better use of CPU pipeline.
Many optimizations rely on the control-flow graph (CFG) of the program. A CFG breaks the program into
basic blocks (straight-line code sequences) and shows how control can flow between them. Using the CFG,
the compiler can detect unreachable blocks, merge identical code, and apply transformations across
branches. For example, after constructing the CFG, an optimizer might find that a particular block has no
incoming edges (unreachable) and safely remove it 17 . Other optimizations, like loop invariant code
motion, identify calculations inside loops that can be moved outside the loop for efficiency.
Another key method is data-flow analysis on the CFG. This means propagating information (like variable
definitions or usage) along edges until no more changes occur. Data-flow allows optimizers to find dead
variables and dead code: by analyzing what values are actually used, the compiler can eliminate
6
assignments that have no effect on the program’s outcome 30 . Data-flow frameworks also enable more
complex analyses like liveness (for register allocation) and reaching definitions.
The optimizer must always preserve correctness: it can only perform transformations that do not change
the program’s meaning. In practice, compilers strike a balance: heavy optimization can exponentially
increase compile time, so usually only the most beneficial passes are enabled by default. Nonetheless,
effective optimizations can greatly speed up code and are a hallmark of modern compilers 16 29 .
• Modular Design: Each compiler phase should be a separate module with clear interfaces. For
example, the lexer outputs a standardized token structure that the parser expects. The parser
produces a parse tree or AST, which is the interface to the semantic analyzer. By keeping the code for
each phase separate, developers can test and debug them independently. As noted in academic
references, compilers generally implement phases as modular components to promote efficient and
correct design 3 . For instance, one team member might work on the scanner (using Flex), another
on grammar and parsing (using Yacc), and later integration relies on both agreeing on token and
node formats.
• Compiler Construction Tools: We already mentioned Flex (lexical analyzer generator) and Bison/
Yacc (parser generators). These tools greatly simplify development. For example, Flex takes a list of
regular expressions and emits C code for a DFA-based scanner 21 . Tutorial sources point out that
scanner generators like Lex are built on finite automata for regex input 31 . Likewise, Bison reads a
grammar specification (BNF-like syntax), checks for ambiguities, and generates a corresponding
LALR(1) parser 25 . Using these tools means you often write far less code by hand and have robust
solutions for lexing/parsing. Other modern tools (outside the classic course scope) include ANTLR (a
powerful parser generator in Java) and LLVM (a framework for building back ends), but the principles
are similar.
• Development Practices: Use version control (e.g. Git) from day one, as a compiler project will evolve
and you want to track changes. Maintain a good suite of test programs: for each phase write small
programs to exercise all features (e.g. lexing corner cases, grammar constructs, type errors).
7
Integrate often: after completing the lexer, immediately hook it to the parser and test. Avoid writing
the entire compiler in one go – instead, build incrementally, phase by phase. Peer review of grammar
rules and code generation templates can catch many errors early.
• Error Handling and Diagnostics: Implement meaningful error messages. At minimum, report line
and column of errors. For syntax errors, consider simple recovery (skipping tokens until a
synchronizing token). For semantic errors (like type mismatch), point clearly to the offending
construct. Good error handling makes debugging much easier, and many problems in student
compilers arise from silent failures.
• Performance and Profiling: If time permits, profile the compiler itself. The front end (lexing/
parsing) should be quite fast for typical student projects, but code generation and optimization can
be heavy. Use tools (e.g. gprof or perf ) to find bottlenecks if the compiler is slow.
By applying software engineering best practices – modular design, clear interfaces, iterative testing – you’ll
reduce development headaches. And remember the old adage: writing compilers teaches “practical
applications of theory” (formal languages, automata, data structures). Document your code well, write clear
grammar comments, and treat the compiler as any major software project.
Exam Preparation
To prepare for exams, focus on understanding each phase and key concepts, and practice by hand on small
examples. Likely examinable topics include: - Phases of Compilation: Know the purpose and input/output
of each phase 2 1 . Be able to explain why each phase is needed.
- Lexical Analysis: Definitions of token, lexeme, pattern. Use of regular expressions and finite automata to
recognize tokens 20 10 . Perhaps design a regex for a given token or draw a small DFA.
- Syntax Analysis: Context-free grammars and parse trees 11 23 . Difference between LL(1) and LR(1)
parsing strategies. FIRST/FOLLOW sets, parsing tables (maybe for small grammar). Handling of ambiguities
(e.g. dangling else).
- Parser Generators: Role of Yacc/Bison: given a grammar, how Yacc translates it into parser code 12 25 .
Understand shift/reduce and reduce/reduce conflicts at a high level.
- Semantic Analysis: Type checking rules (e.g. given a code snippet, determine if type errors exist). Symbol
table contents: what information is stored and how it’s used 8 . Building and annotating the AST.
- Intermediate Representation: Forms like three-address code (quadruples), AST, DAG. Converting simple
statements into 3-address form.
- Code Generation: Manual translation of small IR to assembly. Register allocation basics: possibly simpler
methods (e.g. using stack for spills). Stack frame layout for a function (argument passing, local vars, return).
- Optimizations: Identify dead code, constant expressions, and optimize them (constant folding, dead-code
elimination). Draw a small CFG and explain data-flow facts. Understand loop optimization examples.
- Tools & Design: Roles of lex/yacc. Benefits of modular compiler design 3 .
- Theory vs Practice: Differences between compiler and interpreter 6 , or the meaning of “compiler
compiler” (parser generator) 32 .
Here are some sample questions with answers to guide your study:
8
Q1: Describe the phases of a compiler and the role of each phase.
Answer: The compiler works in stages. Lexical analysis scans characters and outputs tokens (identifiers,
literals, etc.) 9 . Syntax analysis parses tokens according to a grammar to build a parse tree 11 . Semantic
analysis checks types and scopes using the parse tree and symbol table 13 8 . After this, an intermediate
code may be generated. Code optimization then transforms the intermediate code for efficiency (e.g.
removing dead code) 16 . Finally, code generation produces target machine instructions, handling register
allocation and calling conventions 18 19 . Each phase’s output is the next phase’s input, enforcing
correctness before moving on.
Q2: Given the regular expression [A-Za-z_][A-Za-z0-9_]* , construct a DFA or describe how a
lexical analyzer recognizes identifiers matching this pattern.
Answer: The regex describes identifiers starting with a letter or underscore followed by any number of
letters, digits, or underscores. A DFA for this has: a start state S, transition from S on letter/underscore to
state A; state A loops on letter/digit/underscore; any other input leads to a rejecting (error) state. In
practice, a scanner built (e.g. by Flex) would recognize the longest sequence of such characters from the
input as one token (IDENT) and return it. This DFA is derived automatically from the regex and ensures each
valid identifier lexeme is correctly tokenized 20 10 .
S -> S + S | S * S | ( S ) | id
S -> T S'
S' -> + T S' | ε
T -> F T'
T' -> * F T' | ε
F -> ( S ) | id
Here S’ and T’ are new nonterminals, and this grammar is LL(1) with proper FIRST/FOLLOW sets (it encodes
operator precedence + and *). This process of eliminating left recursion and factoring choices is typical for
preparing grammars for top-down parsing 12 .
Q4: What is a symbol table and how is it used in semantic analysis? Give an example of an entry it
might contain.
Answer: A symbol table is a data structure used by the compiler to store information about identifiers
(variables, functions, etc.) encountered in the program 8 . Each entry typically includes the identifier’s
name, type, scope level, memory location, and other attributes. For example, for a variable declaration
int x; the symbol table might have an entry:
Name: x | Kind: variable | Type: int | Scope: global | Memory addr: 0x1004
9
Semantic analysis uses the symbol table to check correct usage. When seeing x = 5; later, the compiler
looks up x in the table, sees it is declared as an int , and confirms that assigning an integer literal to it is
valid. The symbol table ensures that undeclared identifiers are caught and that types are applied
consistently 13 8 .
Q5: Perform constant folding on the following code fragment and explain the transformation.
int a = 2*(22/7)*r;
int x = 12.4;
float y = x/2.3;
Answer: Constant folding computes any expressions with known constant operands at compile time. In the
fragment:
- 2*(22/7)*r → First compute (22/7) as a constant (≈3.1428) and multiply by 2, giving a constant C =
2*(22/7) ≈ 6.2856. The code becomes int a = C * r; (if integer arithmetic is intended, you might get
2*3 * r = 6*r ).
- x = 12.4; y = x/2.3; → since x is assigned the constant 12.4 and never changed before use, we
can propagate the constant: y = 12.4/2.3; . Then fold the division to another constant (≈5.3913). So
this simplifies to y = 5.3913; .
Thus at compile time we replace expressions with their computed constants 28 . This reduces runtime work
and is a basic machine-independent optimization.
These examples should illustrate how to work through key concepts. In exams, always show your work: draw
diagrams (like DFAs or parse trees) neatly and explain each transformation step. Practice by hand is the best
preparation.
When using textbooks, read the relevant sections carefully and do the exercises. The examples in chapters
on lexical analysis and parsing are particularly helpful. Compare the book’s algorithms with your own notes
and code. If a concept (e.g. FIRST/FOLLOW sets, register allocation) is unclear, the book will usually have a
worked example. Also use lecture slides and reliable online sources (some are cited above) to complement
the texts.
For the term project of building a compiler for a simple language, apply solid project management: start
early, define milestones (e.g. finish lexer by week 3, parser by week 5, etc.), and test incrementally. Begin by
writing a few small test programs in your language (with all language features) and run them through each
phase as you implement. When debugging, work backwards: for example, if the final output is wrong,
inspect the intermediate code or AST to isolate the error. Use version control commits as you complete each
phase so you can revert if needed. Divide tasks among team members by compiler stage if you’re in a
10
group: one person can code the symbol table and semantic checks while another handles code generation,
but agree on data structures beforehand.
Remember to use the compiler tools: write Flex rules for tokens, and Yacc/Bison grammar rules for parsing.
For each Yacc grammar rule, include a semantic action in C that builds an AST node or generates
intermediate code (many examples exist online). Keep your symbol table and AST data structures well-
defined at the start – for example, use C structs or C++ classes with fields for type, value, and child pointers.
Modularize common tasks (like emitting three-address code) into helper functions.
Finally, manage your time: compilers can be tricky, and bugs in parsing or symbol handling can cascade.
Leave the optimizer for last; a straightforward (unoptimized) code generator is fine if you run out of time.
Focus first on getting correct output and clear error messages. In summary, leverage textbooks for theory,
follow good software practices for engineering, and test continuously. Good planning and understanding
the big picture will make the project (and the course) much smoother.
Sources: Authoritative references from compilers literature and educational resources were used
throughout 2 1 10 11 12 25 13 19 16 3 , to ensure accuracy and depth of explanation.
1 3 4 5 6 Compiler - Wikipedia
https://en.wikipedia.org/wiki/Compiler
2 7 14 Phases of Compiler
https://www.tutorialspoint.com/compiler_design/compiler_design_phases_of_compiler.htm
15 lect.dvi
https://www.cs.princeton.edu/courses/archive/spr03/cs320/notes/IR-trans1.pdf
11
19 Register Allocations in Code Generation | GeeksforGeeks
https://www.geeksforgeeks.org/register-allocations-in-code-generation/
12