SP Unit III-2024-25
SP Unit III-2024-25
Unit – III
COMPILERS
AY 2024-2025 SEM-II
Unit III - Syllabus
Input
Source Target
Compiler
Program Program
Preprocessor
Source Program
Assembler
Libraries and
Linker Relocatable Object Files
5
The Phases of a Compiler
Phase Output Sample
Programmer (source code producer) Source string A=B+C;
Scanner (performs lexical analysis) Token string ‘A’, ‘=’, ‘B’, ‘+’, ‘C’, ‘;’
And symbol table with names
Parser (performs syntax analysis based on the Parse tree or abstract syntax tree ;
|
grammar of the programming language) =
/ \
A +
/ \
B C
Semantic analyzer (type checking, etc) Annotated parse tree or abstract syntax tree
7
The role of lexical analyzer
token
Source To semantic
Lexical Analyzer Parser
program analysis
getNextToken
Symbol
table
Input buffering
• Sometimes lexical analyzer needs to look ahead some symbols to
decide about the token to return
• In C language: we need to look after -, = or < to decide what token to return
• In Fortran: DO 5 I = 1.25
• We need to introduce a two buffer scheme to handle large look-
aheads safely
E = M* C**2 eof
Tokens, Patterns and Lexemes
• A token is a pair a token name and an optional token value
• A pattern is a description of the form that the lexemes of a token
may take
• A lexeme is a sequence of characters in the source program that
matches the pattern for a token
Example
if Characters i, f if
else Characters e, l, s, e else
comparison < or > or <= or >= or == or != <=, !=
• Example:
letter_ -> A | B | … | Z | a | b | … | Z | _
digit -> 0 | 1 | … | 9
id -> letter_ (letter_ | digit)*
Recognition of tokens
• Starting point is the language grammar to understand the tokens:
stmt -> if expr then stmt
| if expr then stmt else stmt
|Ɛ
expr -> term relop term
| term
term -> id
| number
Recognition of tokens (cont.)
• The next step is to formalize the patterns:
digit -> [0-9]
Digits -> digit+
number -> digit(.digits)? (E[+-]? Digit)?
letter -> [A-Za-z_]
id -> letter (letter|digit)*
If -> if
Then -> then
Else -> else
Relop -> < | > | <= | >= | = | <>
• We also need to handle whitespaces:
ws -> (blank | tab | newline)+
C
lex.yy.c compiler
a.out
E -> TE’ E E E E E E
lm lm lm lm lm
E’ -> +TE’ | Ɛ T E’ T E’ T E’ T E’ T E’
T -> FT’
T’ -> *FT’ | Ɛ F T’ F T’ F T’ F T’ + T E’
F -> (E) | id id id Ɛ id Ɛ
Recursive descent parsing
• Consists of a set of procedures, one for each
nonterminal
• Execution begins with the procedure for start symbol
• A typical procedure for a non-terminal
void A() {
choose an A-production, A->X1X2..Xk
for (i=1 to k) {
if (Xi is a nonterminal
call procedure Xi();
else if (Xi equals the current input symbol a)
advance the input to the next symbol;
else /* an error has occurred */
}
}
Recursive descent parsing (cont)
• General recursive descent may require backtracking
• The previous code needs to be modified to allow backtracking
• In general form it cant choose an A-production easily.
• So we need to try all alternatives
• If one failed the input pointer needs to be reset and another
alternative should be tried
• Recursive descent parsers cant be used for left-recursive grammars
Example
S->cAd
A->ab | a Input: cad
S S S
c A d c A d c A d
a b a
First and Follow
• First() is set of terminals that begins strings derived from
• If α=>ɛ then *
is also in First(ɛ)
• In predictive parsing when we have A-> α|β, if First(α) and First(β) are
disjoint sets then we can select appropriate A-production by looking at the
next input
• Follow(A), for any nonterminal A, is set of terminals a that can
appear immediately* after A in some sentential form
• If we have S => αAaβ for some αand βthen a is in Follow(A)
• If A can be the rightmost symbol in some sentential form, then $ is
in Follow(A)
Computing First
• To compute First(X) for all grammar symbols X, apply following rules
until no more
* terminals or ɛ can be added to any First set:
1. If X is a terminal then First(X) = {X}.
2. If X is a nonterminal and X->Y1Y2…Yk is a production for some k>=1, then
place a in First(X) if for some i a is in First(Yi) and ɛ is in all of First(Y1),
…,First(Yi-1) that is Y1…Yi-1 => ɛ. if ɛ is in First(Yj) for j=1,…,k then add ɛ
to First(X).
3. If X-> ɛ is a production*
then add ɛ to First(X)
• Example!
Computing follow
• To compute First(A) for all nonterminals A, apply following rules until
nothing can be added to any follow set:
1. Place $ in Follow(S) where S is the start symbol
2. If there is a production A-> αBβ then everything in First(β) except ɛ is in
Follow(B).
3. If there is a production A->B or a production A->αBβ where First(β)
contains ɛ, then everything in Follow(A) is in Follow(B)
• Example!
LL(1) Grammars
• Predictive parsers are those recursive descent parsers needing no
backtracking
• Grammars for which we can create predictive parsers are called LL(1)
• The first L means scanning input from left to right
• The second L means leftmost derivation
• And 1 stands for using one input symbol for lookahead
• A grammar G is LL(1) if and only if whenever A-> α|βare two distinct
productions of G, the following conditions hold:
• For no terminal a do αandβ both derive strings beginning with a
• At most one*of α or βcan derive empty string
• If α=> ɛ then βdoes not derive any string beginning with a terminal in Follow(A).
Construction of Predictive Parsing table
• For each production A->α in grammar do the following:
1. For each terminal a in First(α) add A-> in M[A,a]
2. If ɛ is in First(α), then for each terminal b in Follow(A) add A-> ɛ to M[A,b].
If ɛ is in First(α) and $ is in Follow(A), add A-> ɛ to M[A,$] as well
• If after performing the above, there is no production in M[A,a] then
set M[A,a] to error
First
Example Follow
Input Symbol
Non -
terminal a b e i t $
S S -> a S -> iEtSS’
S’ S’ -> Ɛ S’ -> Ɛ
S’ -> eS
E E -> b
Bottom-Up Parsing
• A bottom-up parser creates the parse tree of the given input
starting from leaves towards the root.
• A bottom-up parser tries to find the right-most derivation of the
given input in the reverse order.
S ⇒ ... ⇒ ω (the right-most derivation of ω)
← (the bottom-up parser finds the right-most derivation in the reverse order)
• At each reduction step, a substring of the input matching to the right side of a production rule is replaced by
the non-terminal at the left side of that production rule.
• If the substring is chosen correctly, the right most derivation of that string is created in the reverse order.
*
Rightmost Derivation: S⇒ω rm
• If the grammar is unambiguous, then every right-sentential form of the grammar has exactly
one handle.
• We will see that ω is a string of terminals.
1. Shift : The next input symbol is shifted onto the top of the stack.
2. Reduce: Replace the handle on the top of the stack by the non-terminal.
3. Accept: Successful completion of parsing.
4. Error: Parser discovers a syntax error, and calls an error recovery routine.
SLR
2. LR-Parsers
• covers wide range of grammars.
• SLR – simple LR parser
• LR – most general LR parser
• LALR – intermediate LR parser (lookhead LR parser)
• SLR, LR and LALR work same, only their parsing tables are different.
CS416 Compiler Design 51
Actions of A LR-Parser
1. shift s -- shifts the next input symbol and the state s onto the stack
( So X1 S1 ... Xm Sm, ai ai+1 ... an $ ) 🡺 ( So X1 S1 ... Xm Sm ai s, ai+1 ... an $ )
4. Error -- Parser detected an error (an empty entry in the action table)
CS416 Compiler Design 52
Reduce Action
• pop 2|β| (=r) items from the stack; let us assume that β = Y1Y2...Yr
• then push A and s where s=goto[sm-r,A]
.
(four different possibility)
A → aB b
.
A → a Bb
A → aBb
• Sets of LR(0) items will be the states of action and goto table of the SLR parser.
• A collection of sets of LR(0) items (the canonical LR(0) collection) is the basis for
constructing SLR parsers.
• Augmented Grammar:
G’ is G with a new production rule S’→S where S’ is the new starting symbol.
54
The Closure Operation
• If I is a set of LR(0) items for a grammar G, then closure(I) is the
set of LR(0) items constructed from I by the two rules:
.
1. Initially, every LR(0) item in I is added to closure(I).
.
2. If A → α Bβ is in closure(I) and B→γ is a production rule of G; then
B→ γ will be in the closure(I). We will apply
this rule until no more new LR(0) items can be added to closure(I).
Example:
I ={ E’ → .. .. .
E, E → E+T, E → T,
T→
F→ . . ..
T*F, T →
(E), F →
F,
id }
.. .
goto(I,E) = { E’ → E , E → E +T }
goto(I,T) = { E → T , T → T *F }
goto(I,F) = {T → F
. .. . . .
}
goto(I,() = { F → ( E), E → E+T, E → T, T → T*F, T → . F,
goto(I,id) = { F → id .
F→
}
(E), F → id }
.
• Algorithm:
C is { closure({S’→ S}) }
repeat the followings until no more set of LR(0) items can be added to C.
for each I in C and each grammar symbol X
if goto(I,X) is not empty and not in C
add goto(I,X) to C
I5: F → id.
E
)
id id T
F +
1. Construct the canonical collection of sets of LR(0) items for G’. C←{I0,...,In}
1) E → E+T state id + * ( ) $ E T F
2) E→T
0 s5 s4 1 2 3
3) T → T*F
4) T→F 1 s6 acc
5) F → (E) 2 r2 s7 r2 r2
6) F → id
3 r4 r4 r4 r4
4 s5 s4 8 2 3
5 r6 r6 r6 r6
6 s5 s4 9 3
7 s5 s4 10
8 s6 s11
9 r1 s7 r1 r1
10 r3 r3 r3 r3
11 r5 r5 r5 r5
state id + * ( ) $ E T F
0 s5 s4 1 2 3
1 s6 acc
2 r2 s7 r2 r2
3 r4 r4 r4 r4
4 s5 s4 8 2 3
5 r6 r6 r6 r6
6 s5 s4 9 3
7 s5 s4 10
8 s6 s11
9 r1 s7 r1 r1
10 r3 r3 r3 r3
11 r5 r5 r5 r5
Problem
FOLLOW(A)={a,b}
FOLLOW(B)={a,b}
a reduce by A → ε b reduce by A → ε
reduce by B → ε reduce by B → ε
reduce/reduce conflict reduce/reduce conflict
CS416 Compiler Design 71
Constructing Canonical LR(1) Parsing Tables
• In SLR method, the state i makes a reduction by A→α when the current
token is a:
• if the A→α. in the Ii and a is FOLLOW(A)
LR(k) parsing.
Sm
Xm output
LR Parsing Algorithm
Sm-1
Xm-1
.
. Action Table Goto Table
S1 terminals and $ non-terminal
s s
X1 t four different t each item is
a actions a a state number
S0 t t
e e
s s
• Sm and ai decides the parser action by consulting the parsing action table. (Initial Stack
contains just So )
.
• A state will contain A → α ,a where {a ,...,a } ⊆ FOLLOW(A)
1 1 n
...
A → α ,an.
• b in FIRST(βa) .
82
goto operation
• If I is a set of LR(1) items and X is a grammar symbol (terminal or
non-terminal), then goto(I,X) is defined as follows:
• If A → α.Xβ,a in I then every item
in closure({A → αX.β,a}) will be in goto(I,X).
• If A → α.Xβ,a in I then goto(I,X)= A → αX.β,a
• Shifting of dot one symbol ahead keeping look ahead symbol as it is
can be written as
.
A → α β,a1/a2/.../an
CS416 Compiler Design 85
EXAMPLE
GRAMMAR:
1. S’ -> S
2. S -> CC
3. C -> aC
4. C -> d
No goto and closure
operations , because the . Is
SET
I0 : S’ -> .S, $ OF ITEMS:
Goto(I0,S)
I1: S’ -> S., $
at the end of production rule
S -> .CC, $
C -> .a C, a /d Goto(I0,C) I2: S -> C.C, $ Goto(I2,C)
C -> .aC, $ I5: S -> CC., $
C -> .d, a/d C -> .d, $
Goto(I3,a)
Goto(I6,a)
Goto(I6,d)
Goto(I2,d) I7: C -> d., $
Goto(I3,d)
DFA I5
S
I0 I1 a
C
a I6
C
I2
d
d
I7
a
I3 C
C
a I8
d
d
I4
I9
0 1 2
2 5
3 8
6 9
0 S3 S4 1 2
1 ACCEPT
2 S6 S7 5
3 S3 S4 8
6 S6 S7 9
0 S3 S4 1 2
1 ACCEPT
2 S6 S7 5
3 S3 S4 8
4 R4 R4
5 R2
6 S6 S7 9
7 R4
8 R3 R3
9 R3
$0 aadd$ action[0,a]=s3
$0a3 add$
action[3,a]=s3 SHIFT
$0a3a3d4 d$ [3,C]=8
action[4,d]=r3 Reduce
$0a3a3C8 d$ action[8,d]=r2 [3,C]=8 Reduce
$0a3C8 d$
action[8,d]=r2 [0,C]=2 Reduce
$0C2 d$ action[2,d]=s7 SHIFT
CS416 Compiler Design 94
Stack
Parsing the input string
Input buffer Action table Goto table Parsing action
$0S1 $
Accept
I9:S → L=R.,$
R I13:L → *R.,$
I6:S → L=.R,$ to I9
L I10:R → L.,$
R → .L,$ to I10
L → .*R,$ * I4 and I11
to I11 R
L → .id,$ I11:L → *.R,$ to I13
id L
to I12 R → .L,$ to I10 I5 and I12
I7:L → *R.,$/= L→ .*R,$ *
to I11
L → .id,$ id I7 and I13
I8: R → L.,$/= to I12
I12:L → id.,$ I8 and I10
CS416 Compiler Design 98
Construction of LR(1) Parsing Tables
1. Construct the canonical collection of sets of LR(1) items for G’. C←{I0,...,In}
• If any conflicting actions generated by these rules, the grammar is not LR(1).
⇓
6 s12 s11 10 9
7 r3 r3
so, it is a LR(1) grammar
8 r5 r5
9 r1
10 r5
11 s12 s11 10 13
12 r4
13 r3
• LALR parsers are often used in practice because LALR parsing tables
are smaller than LR(1) parsing tables.
• The number of states in SLR and LALR parsing tables for a grammar
G are equal.
• But LALR parsers recognize more grammars than SLR parsers.
• yacc creates a LALR parser for the given grammar.
• A state of LALR parser will be again a set of LR(1) items.
CS416 Compiler Design 101
Creating LALR Parsing Tables
. .
. .
Ex: S → L =R,$ 🡺 S → L =R Core
R → L ,$ R→L
• We will find the states (sets of LR(1) items) in a canonical LR(1) parser with same cores. Then we will merge them as a
single state.
. .
.
I1:L → id ,= A new state: I12: L → id ,=
.
🡺 L → id ,$
I2:L → id ,$ have same core, merge them
• We will do this for all states of a canonical LR(1) parser to get the states of the LALR parser.
• In fact, the number of the states of the LALR parser for a grammar will be equal to the number of states of the SLR
parser for that grammar.
CS416 Compiler Design 103
Creation of LALR Parsing Tables
• Create the canonical LR(1) collection of the sets of LR(1) items for the given
grammar.
• Find each core; find all sets having that same core; replace those sets having same
cores with a single set which is their union.
C={I0,...,In} 🡺 C’={J1,...,Jm} where m ≤ n
• Create the parsing tables (action and goto tables) same as the construction of the
parsing tables of LR(1) parser.
• Note that: If J=I1 ∪ ... ∪ Ik since I1,...,Ik have same cores
🡺 cores of goto(I1,X),...,goto(I2,X) must be same.
• So, goto(J,X)=K where K is the union of all sets of items having same cores as goto(I1,X).
.
I1 : A → α ,a .
I2: A → α ,b
B → β.,b B → β.,c
⇓
.
I12: A → α ,a/b 🡺 reduce/reduce conflict
B → β.,b/c
.
1) S → L=R S→ L=R,$ S * R→ L,$/= L
2) S → R
S→ . R,$
.
L I2:S → L =R,$ to I6
L→ .*R,$/= *
to I810
3) L→ *R
. R → L ,$
. id
to I411
.
L→ *R,$/= L→ id,$/=
. .
4) L → id R to I512
I3:S → R ,$ id
L→ id,$/= I512:L → id ,$/=
.
5) R → L
R→ L,$
. R
to I9 I9:S → L=R ,$ .
.
I6:S → L= R,$ Same Cores
L
I4 and I11
.
R → L,$ to I810
*
.
L → *R,$ to I411
id I5 and I12
L → id,$ to I512
.
I713:L → *R ,$/= I7 and I13
.
I810: R → L ,$/= I8 and I10
CS416 Compiler Design 107
LALR(1)
id Parsing
* = $Tables
S L– (for
R Example2)
0 s5 s4 1 2 3
1 acc
2 s6 r5
3 r2
4 s5 s4 8 7
no shift/reduce or
5 r4 r4 no reduce/reduce conflict
⇓
6 s12 s11 10 9
7 r3 r3
so, it is a LALR(1) grammar
8 r5 r5
9 r1
• Ex.
E → E+T | T
E → E+E | E*E | (E) | id 🡺 T → T*F | F
F → (E) | id
CS416 Compiler Design 109
Sets of LR(0) Items for Ambiguous Grammar
I : E’ → .E I : E’ → E. I : E → E + .E I : E → E+E.
E + E + I
E → .E+E E → E .+E E → .E+E E → E.+E
0 1 4 7 4
( *
I
E → .E*E E → E .*E E → .E*E E → E.*E
5
I 2
id
E → .(E) E → .(E)
*
I 3
E → .id ( E → .id
I : E → E *.E
(
I : E → E*E.
E
E → .E+E
+ I
I : E → (.E)
5
(
E → E.+E
8 4
*
E → .E+E E → .E*E
id I I
E → E.*E
2 2
5
.
I
E → .E*E
3
E E → (E)
id E → .(E) E → .id
E → .id I : E → (E.) I : E → (E).
id )
E → E.+E
6 9
+
I : E → id.
E → E.*E
3
* I 4
I 5
I0 E I1 + I4 E I7
I0 E I1 * I5 E I7
id + * ( ) $ E
0 s3 s2 1
1 s4 s5 acc
2 s3 s2 6
3 r4 r4 r4 r4
4 s3 s2 7
5 s3 s2 8
6 s4 s5 s9
7 r1 s5 r1 r1
8 r2 r2 r2 r2
9 r3 r3 r3 r3
CS416 Compiler Design 113
Error Recovery in LR Parsing
• An LR parser will detect an error when it consults the parsing action table and
finds an error entry. All empty entries in the action table are error entries.
• Errors are never detected by consulting the goto table.
• An LR parser will announce error as soon as there is no valid continuation for
the scanned portion of the input.
• A canonical LR parser (LR(1) parser) will never make even a single reduction
before announcing an error.
• The SLR and LALR parsers may make several reductions before announcing an
error.
• But, all LR parsers (LR(1), LALR and SLR parsers) will never shift an erroneous
input symbol onto the stack.
CS416 Compiler Design 114
Panic Mode Error Recovery in LR Parsing
• Scan down the stack until a state s with a goto on a particular
nonterminal A is found. (Get rid of everything from the stack before
this state s).
• Discard zero or more input symbols until a symbol a is found that can
legitimately follow A.
• The symbol a is simply in FOLLOW(A), but this may not work for all situations.
• The parser stacks the nonterminal A and the state goto[s,A], and it
resumes the normal parsing.
• This nonterminal A is normally is a basic programming block (there
can be more than one choice for A).
• stmt, expr, block, ...
CS416 Compiler Design 115
Phrase-Level Error Recovery in LR Parsing
• Each empty entry in the action table is marked with a specific error
routine.
• An error routine reflects the error that the user most likely will
make in that case.
• An error routine inserts the symbols into the stack or the input (or it
deletes the symbols from the stack and the input, or it can do both
insertion and deletion).
• missing operand
• unbalanced right parenthesis
YACC program
yacc
cc C compiler
or gcc
%%
Rules
%%
Supplementary Code
Definitions Section
Example
%{
#include <stdio.h>
#include <stdlib.h>
%} This is called a terminal
%token ID NUM
%start expr
• Example