Compiler Design Material
Compiler Design Material
CE & IT Department
Hand Book
Compiler Design (2170701)
Year: 2020
Fig.1.1. A Translator
2. What is compiler?
A compiler is a program that reads a program written in one language and
translates into an equivalent program in another language.
Fig.1.2. A Compiler
Major functions done by compiler:
● Compiler is used to convert one form of program to another.
● A compiler should convert the source program to a target machine code in
such a way that the generated target code should be easy to understand.
● Compiler should preserve the meaning of source code.
● Compiler should report errors that occur during compilation process.
● The compilation must be done efficiently.
2 Number 2
Syntax Analysis:
● Syntax analysis is also called hierarchical analysis or parsing.
● The syntax analyzer checks each line of the code and spots every tiny
mistake that the programmer has committed while typing the code.
● If code is error free then syntax analyzer generates the tree.
Semantic analysis:
● Semantic analyzer determines the meaning of a source string.
● For example matching of parenthesis in the expression, or matching of
if..else statement or performing arithmetic operation that are type
compatible, or checking the scope of operation.
Code optimization:
● The code optimization phase attempt to improve the intermediate code.
● This is necessary to have a faster executing code or less consumption of memory.
● Thus by optimizing the code the overall running time of a target program can be
improved.
Code generation:
The
instructions are translated into sequence of machine instruction.
MOV id3, R1
MUL #2.0, R1
MOV id2, R2
MUL R2, R1
MOV id1, R2
ADD R2, R1
MOV R1, id1
Symbol Table
○ A symbol table is a data structure used by a language translator such as a
compiler or interpreter.
○ It is used to store names encountered in the source program, along with
the relevant attributes for those names.
○ Information about following entities is stored in the symbol table.
■ Variable/Identifier
■ Procedure/function
■ Keyword
■ Constant
■ Class name
■ Label name
● The lexical analyzer is the first phase of compiler. Its main task is to read the
input characters and produce as output a sequence of tokens that the parser
uses for syntax analysis.
● It is implemented by making lexical analyzer be a subroutine.
● Upon receiving a “get next token” command from parser, the lexical analyzer
reads the input character until it can identify the next token.
● It may also perform secondary task at user interface.
● One such task is stripping out from the source program comments and white
space in the form of blanks, tabs, and newline characters.
● Some lexical analyzer are divided into cascade of two phases, the first called
scanning and second is “lexical analysis”.
● The scanner is responsible for doing simple task while lexical analysis does
the more complex task.
Issues in Lexical Analysis:
There are several reasons for separating the analysis phase of compiling into lexical
analysis and parsing:
● Simpler design is perhaps the most important consideration. The separation
of lexical analysis often allows us to simplify one or other of these phases.
● Compiler efficiency is improved.
● Compiler portability is enhanced.
If If If
Relation <,<=,= ,< >,>=,> < or <= or = or < > or >= or >
Id Pi, count, n, I letter followed by letters and
digits.
Number 3.14159, 0, 6.02e23 Any numeric constant
Literal "core" Any character between “ and
“ except “
Example 1:
total = sum + 12.5
Tokens are: total (id),
= (relation) Sum (id)
+ (operator)
12.5 (num)
Lexemes are: total, =, sum, +, 12.5
Example 2:
Operation on languages
Definition of operation on language
Operation Definition
Union of L and M L U M = {s | s is in L or s is in M }
Written L U M
concatenation of L and M LM = {st | s is in L and t is in M }
Written LM
Kleene closure of L written L* L* denotes “zero or more concatenation of”
L.
Positive closure of L written L+ L+ denotes “one or more concatenation of”
L.
5. Regular Expression
Regular Expressions are used to denote regular languages. An expression is regular if:
● ɸ is a regular expression for regular language ɸ.
● ɛ is a regular expression for regular language {ɛ}.
● If a ∈ Σ (Σ represents the input alphabet), a is regular expression with language
{a}.
● If a and b are regular expression, a + b is also a regular expression with language
{a,b}.
● If a and b are regular expression, ab (concatenation of a and b) is also regular.
● If a is regular expression, a* (0 or more times a) is also regular.
Regular Grammar : A grammar is regular if it has rules of form A -> a or A -> aB or A -> ɛ where
ɛ is a special symbol called NULL.
Regular Languages : A language is regular if it can be expressed in terms of regular expression.
Closure Properties of Regular Languages
Union : If L1 and If L2 are two regular languages, their union L1 ∪ L2 will also be regular. For
example, L1 = {an | n ≥ 0} and L2 = {bn | n ≥ 0}
L3 = L1 ∪ L2 = {an ∪ bn | n ≥ 0} is also regular.
Intersection : If L1 and If L2 are two regular languages, their intersection L1 ∩ L2 will also be
regular. For example,
L1= {am bn | n ≥ 0 and m ≥ 0} and L2= {am bn ∪ bn am | n ≥ 0 and m ≥ 0}
L3 = L1 ∩ L2 = {am bn | n ≥ 0 and m ≥ 0} is also regular.
Concatenation : If L1 and If L2 are two regular languages, their concatenation L1.L2 will also be
regular. For example,
L1 = {an | n ≥ 0} and L2 = {bn | n ≥ 0}
L3 = L1.L2 = {am . bn | m ≥ 0 and n ≥ 0} is also regular.
Kleene Closure : If L1 is a regular language, its Kleene closure L1* will also be regular. For
example,
L1 = (a ∪ b)
L1* = (a ∪ b)*
Complement : If L(G) is regular language, its complement L’(G) will also be regular. Complement
of a language can be found by subtracting strings which are in L(G) from all possible strings. For
example,
L(G) = {an | n > 3}
L’(G) = {an | n <= 3}
Example 1:
Write the regular expression for the language accepting all combinations of a's, over the set ∑
= {a}
Solution:
All combinations of a's means a may be zero, single, double and so on. If a is appearing zero
times, that means a null string. That is we expect the set of {ε, a, aa, aaa, ....}. So we give a
regular expression for this as:
1. R = a*
Example 2:
Write the regular expression for the language accepting all combinations of a's except the null
string, over the set ∑ = {a}
Solution:
This set indicates that there is no null string. So we can denote regular expression as:
R = a+
Example 3:
Write the regular expression for the language accepting all the string containing any number of
a's and b's.
Solution:
1. r.e. = (a + b)*
This will give the set as L = {ε, a, aa, b, bb, ab, ba, aba, bab, .....}, any combination of a and b.
The (a + b)* shows any combination with a and b even a null string.Example 2:
Write the regular expression for the language starting and ending with a and having any having
any combination of b's in between.
Solution:
1. R = a b* b
Example 4:
Write the regular expression for the language starting with a but not having consecutive b's.
1. R = {a + ab}*
Example 5:
Write the regular expression for the language accepting all the string in which any number of
a's is followed by any number of b's is followed by any number of c's.
Solution: As we know, any number of a's means a* any number of b's means b*, any number of
c's means c*. Since as given in problem statement, b's appear after a's and c's appear after b's.
So the regular expression could be:
1. R = a* b* c*
Example 6:
Write the regular expression for the language over ∑ = {0} having even length of the string.
Solution:
The regular expression has to be built for the language:
1. R = (00)*
Example 7:
Write the regular expression for the language having a string which should have atleast one 0
and alteast one 1.
Solution:
Example 8:
Describe the language denoted by following regular expression
Solution:
The language can be predicted from the regular expression by finding the meaning of it. We will
first split the regular expression as:
L = {The language consists of the string in which a's appear triples, there is no restriction on the
number of b's}
Example 9:
Write the regular expression for the language L over ∑ = {0, 1} such that all the string do not
contain the substring 01.
Solution:
The Language is as follows:
1. R = (1* 0*)
Example 10:
Write the regular expression for the language containing the string over {0, 1} in which there
are at least two occurrences of 1's between any two occurrences of 1's between any two
occurrences of 0's.
Solution: At least two 1's between two occurrences of 0's can be denoted by (0111*0)*.
Similarly, if there is no occurrence of 0's, then any number of 1's are also allowed. Hence the
r.e. for required language is:
1. R = (1 + (0111*0))*
Example 11:
Write the regular expression for the language containing the string in which every 0 is
immediately followed by 11.
Solution:
1. R = (011 + 1)*
‘The following rules are depicted according to Aho et al. (2007),[1] p. 122. In what follows, N(s)
and N(t) are the NFA of the subexpressions s and t, respectively.
The empty-expression ε is converted to
State q goes via ε either to the initial state of N(s) or N(t). Their final states become
intermediate states of the whole NFA and merge via two ε-transitions into the final state of
the NFA.
The concatenation expression st is converted to
The initial state of N(s) is the initial state of the whole NFA. The final state of N(s) becomes the
initial state of N(t). The final state of N(t) is the final state of the whole NFA.
The Kleene star expression s* is converted to
An ε-transition connects initial and final state of the NFA with the sub-NFA N(s) in between.
Another ε-transition from the inner final to the inner initial state of N(s) allows for repetition
of expression s according to the star operator.
The parenthesized expression (s) is converted to N(s) itself.
Example 2
Example 3
Example 1
9. Conversion from NFA to DFA using Thompson’s rule.
Ex:1 (a+b)*abb
● Move(B,a) = {3,8}
ε – closure (Move(B,a)) = {1,2,3,4,6,7,8} B
Move(B,b) = {5,9}
ε – closure (Move(B,b)) = {1,2,4,5,6,7,9} Let D
● Move(C,a) = {3,8}
ε – closure (Move(C,a)) = {1,2,3,4,6,7,8} B
Move(C,b) = {5}
ε – closure (Move(C,b)) = {1,2,4,5,6,7} C
● Move(D,a) = {3,8}
ε – closure (Move(D,a)) = {1,2,3,4,6,7,8} B
Move(D,b) = {5,10}
ε – closure (Move(D,b)) = {1,2,4,5,6,7,10} Let E
● Move(E,a) = {3,8}
ε – closure (Move(E,a)) = {1,2,3,4,6,7,8} B
Move(E,b) = {5}
ε – closure (Move(E,b)) = {1,2,4,5,6,7} C
States a b
A B C
B B D
C B C
D B E
E B C
Table Transition table for (a+b)*abb
b
Fig.. DFA for (a+b)*abb
10. DFA Optimization
To optimize the DFA you have to follow the various steps. These are as follows:
Step 1: Remove all the states that are unreachable from the initial state via any set of the
transition of DFA.
Step 2: Draw the transition table for all pair of states.
Step 3: Now split the transition table into two tables T1 and T2. T1 contains all final states and
T2 contains non-final states.
1. δ (q, a) = p
2. δ (r, a) = p
That means, find the two states which have same value of a and b and remove one of them.
Step 5: Repeat step 3 until there is no similar rows are available in the transition table T1.
Step 7: Now combine the reduced T1 and T2 tables. The combined transition table is the
transition table of minimized DFA.
Example
Solution:
Step 1: In the given DFA, q2 and q4 are the unreachable states so remove them.
Step 2: Draw the transition table for rest of the states.
Step 3:
1. One set contains those rows, which start from non-final sates:
2.Other set contains those rows, which starts from final states.
Step 4: Set 1 has no similar rows so set 1 will be the same.
Step 5: In set 2, row 1 and row 2 are similar since q3 and q5 transit to same state on 0 and 1. So
skip q5 and then replace q5 by q3 in the rest.
1. Role of Parser
● In our compiler model, the parser obtains a string of tokens from lexical
analyzer, as shown in fig.
● We expect the parser to report any syntax error. It should commonly
occurring errors.
● The methods commonly used for parsing are classified as a top down or
bottom up parsing.
● In top down parsing parser, build parse tree from top to bottom, while
bottom up parser starts from leaves and work up to the root.
● In both the cases, the input to the parser is scanned from left to right,
one symbol at a time.
● We assume the output of parser is some representation of the parse
tree for the stream of tokens produced by the lexical analyzer.
2. Types of Parsing.
Parsing or syntactic analysis is the process of analyzing a string of symbols according to
the rules of a formal grammar.
a. Parsing is a technique that takes input string and produces output either a
parse tree if string is valid sentence of grammar, or an error message indicating
that string is not a valid sentence of given grammar. Types of parsing are,
Top down parsing: In top down parsing parser build parse tree from top to bottom.
Bottom up parsing: While bottom up parser starts from leaves and work up to the
root.
3. Top Down Parser
The top-down parsing technique parses the input, and starts constructing a parse
tree from the root node gradually moving down to the leaf nodes. The types of
top-down parsing are depicted below:
4. Problems with Top-Down parsing.
4.1.Backtracking
Top- down parsers start from the root node (start symbol) and match the input string against
the production rules to replace them (if matched). To understand this, take the following
example of CFG:
S → rXd | rZd
X → oa | ea
Z → ai
For an input string: read, a top-down parser, will behave like this:
It will start with S from the production rules and will match its yield to the left-most letter of
the input, i.e. ‘r’. The very production of S (S → rXd) matches with it. So the top-down parser
advances to the next input letter (i.e. ‘e’). The parser tries to expand non-terminal ‘X’ and
checks its production from the left (X → oa). It does not match with the next input symbol. So
the top-down parser backtracks to obtain the next production rule of X, (X → ea).
Now the parser matches all the input letters in an ordered manner. The string is accepted.
S --> S / a / b
It is called left recursive where S is any non Terminal and a, and b are any set of terminals.
If a left recursion is present in any grammar then, during parsing in the the syntax analysis part
of compilation there is a chance that the grammar will create infinite loop. This is because at
every time of production of grammar S will produce another S without checking any condition.
S-->S a / S b / c / d
1. Check if the given grammar contains left recursion, if present then separate the
production and start working on it.
In our example,
S-->S a/ S b /c / d
2. Introduce a new nonterminal and write it at the last of every terminal. We produce
a new nonterminal S’and write new production as,
S-->cS' / dS'
Write newly produced nonterminal in LHS and in RHS it can either produce or it can
produce new production in which the terminals or non terminals which followed the
previous LHS will be replaced by new nonterminal at last.
S'-->? / aS' / bS'
So after conversion the new equivalent production is
S-->cS' / dS'
3. S'-->? / aS' / bS'
Example 3:
A -> αβ1 | αβ2 | αβ3 | …… | αβn | γ i.e the productions start with the same terminal
(or set of terminals). On seeing the input α we cannot immediately tell which production to
choose to expand A.
Left factoring is a grammar transformation that is useful for producing a grammar suitable for
predictive or top down parsing. When the choice between two alternative A-productions is not
clear, we may be able to rewrite the productions to defer the decision until enough of the input
A -> αA’ | γ
A’ -> β1 | β2 | β3 | …… | βn
● Ambiguous grammars
● Unambiguous grammars
Ambiguous grammar:
A CFG is said to ambiguous if there exists more than one derivation tree for the given input
string i.e., more than one LeftMost Derivation Tree (LMDT) or RightMost Derivation Tree
(RMDT).
Definition: G = (V,T,P,S) is a CFG is said to be ambiguous if and only if there exist a string in T*
P is a finite set of productions of the form, A -> α, where A is a variable and α ∈ (V ∪ T)* S is
For Example:
We can create 2 parse tree from this grammar to obtain a string id+id+id :
The following are the 2 parse trees generated by left most derivation:
Both the above parse trees are derived from same grammar rules but both parse trees are
E -> I
E -> E + E
E -> E * E
E -> (E)
I -> ε | 0 | 1 | … | 9
E=>E*E E=>E+E
=>I*E =>E*E+E
=>3*E+E =>I*E+E
=>3*I+E =>3*E+E
=>3*2+E =>3*I+E
=>3*2+I =>3*2+I
=>3*2+5 =>3*2+5
● S-> aS |Sa| Є
● E-> E +E | E*E| id
● A -> AA | (A) | a
● S -> SS|AB , A -> Aa|a , B -> Bb|b
E –> T E’
E –> E + T | T
E’ –> + T E’ | e
T –> T * F | F
T –> F T’
F –> ( E ) | id
T’ –> * F T’ | e
F –> ( E ) | id
**Here e is Epsilon
For Recursive Descent Parser, we are going to write one program for every variable.
Example:
Grammar: E --> i E'
E' --> + i E' | e
int main()
{
// E is a start symbol.
E();
// Match function
match(char t)
{
if (l == t) {
l = getchar();
}
else
printf("Error");
}
● The data structure used by LL(1) parser are input buffer, stack and parsing
table.
● The parser works as follows,
● The parsing program reads top of the stack and a current input symbol.
With the help of these two symbols parsing action can be determined.
● The parser consult the table M[A, a] each time while taking the parsing
actions hence this type of parsing method is also called table driven parsing
method.
● The input is successfully parsed if the parser reaches the halting
configuration. When the stack is empty and next token is $ then it
corresponds to successful parsing.
Steps to construct LL(1) parser
1. Remove left recursion / Perform left factoring.
2. Compute FIRST and FOLLOW of nonterminals.
3. Construct predictive parsing table.
4. Parse the input string with the help of parsing table.
Example:
E E+T/T
T T*F/F
F (E)/id
Step1: Remove left recursion
E TE’
E’ +TE’ | ϵ
T FT’
T’ *FT’ | ϵ
F (E) | id
Step2: Compute FIRST & FOLLOW
FIRST FOLLOW
E {(,id} {$,)}
E’ {+,ϵ} {$,)}
T {(,id} {+,$,)}
T’ {*,ϵ} {+,$,)}
F {(,id} {*,+,$,)}
id + * ( ) $
E E TE’ E TE’
E’ E’ +TE’ E’ ϵ E’ ϵ
T T FT’ T FT’
T’ T’ ϵ T’ *FT’ T’ ϵ T’ ϵ
F F id F (E)
7. Bottom Up Parsing
Bottom-up parsing starts from the leaf nodes of a tree and works in upward direction till
it reaches the root node. Here, we start from a sentence and then apply production
rules in reverse manner in order to reach the start symbol. The image given below
depicts the bottom-up parsers available.
bottom up parsing i.e. the parse tree is constructed from leaves(bottom) to the root(up). A
Basic Operations –
● Shift: This involves moving of symbols from input buffer onto the stack.
● Reduce: If the handle appears on top of the stack then, its reduction by using
appropriate production rule is done i.e. RHS of production rule is popped out of
stack and LHS of production rule is pushed onto the stack.
● Accept: If only start symbol is present in the stack and the input buffer is empty
then, the parsing action is called accept. When accept action is obtained, it is
means successful parsing is done.
● Error: This is the situation in which the parser can neither perform shift action nor
reduce action and not even accept action.
S –> S + S
S –> S * S
S –> id
E –> 4
operator precedence grammar. Such grammars have the restriction that no production has
either an empty right-hand side (null productions) or two adjacent non-terminals in its
right-hand side.
Examples –
E->E+E/E*E/id
However, the grammar given below is not an operator grammar because two non-terminals are
S->SAS/a
A->bSb/b
S->SbSbS/SbS/a
A->bSb/b
An operator precedence parser is a bottom-up parser that interprets an operator grammar. This
parser is only used for operator grammars. Ambiguous grammars are not allowed in any parser
There are two methods for determining what precedence relations should hold between a pair
of terminals:
There is not given any relation between id and id as id will not be compared and two variables
can not come side by side. There is also a disadvantage of this table – if we have n operators
then size of table will be n*n and complexity will be 0(n2). In order to decrease the size of table,
Operator precedence parsers usually do not store the precedence table with the relations;
rather they are implemented in a special way. Operator precedence parsers use precedence
functions that map terminal symbols to integers, and the precedence relations between the
symbols are implemented by numerical comparison. The parsing table can be encoded by two
precedence functions f and g that map terminal symbols to integers. We select f and g such
that:
Since there is no cycle in the graph, we can make this function table:
10. LR Parser
● LR parsing is most efficient method of bottom up parsing which can be
used to parse large class of context free grammar.
● The technique is called LR(k) parsing; the “L” is for left to right scanning
of input symbol, the “R” for constructing right most derivation in
reverse, and the k for the number of input symbols of lookahead that
are used in making parsing decision.
● There are three types of LR parsing,
■ SLR (Simple LR)
■ LR(O)
■ CLR (Canonical LR)
■ LALR (Lookahead LR)
● The schematic form of LR parser is given in figure 3.1.6.
● The structure of input buffer for storing the input string, a stack for
storing a grammar symbols, output and a parsing table comprised of
two parts, namely action and goto.
○ Properties of LR parser
● LR parser can be constructed to recognize most of the programming
language for which CFG can be written.
● The class of grammars that can be parsed by LR parser is a superset of
class of grammars that can be parsed using predictive parsers.
● LR parser works using non back tracking shift reduce technique.
● LR parser can detect a syntactic error as soon as possible.
11. LR(0) Parser with Example.
LR(0) Parser We need two functions –
Closure()
Goto()
Augmented Grammar
If G is a grammar with start symbol S then G’, the augmented grammar for G, is the grammar
with new start symbol S’ and a production S’ -> S. The purpose of this new starting production is
to indicate to the parser when it should stop parsing and announce acceptance of input.
Let a grammar be S -> AA
A -> aA | b
S’ -> S
S -> AA
A -> aA | b
LR(0) Items
An LR(0) is the item of a grammar G is a production of G with a dot at some position in the right
side.
S -> .ABC
S -> A.BC
S -> AB.C
S -> ABC.
Closure Operation:
If I is a set of items for a grammar G, then closure(I) is the set of items constructed from I by the
two rules:
Eg:
Goto Operation :
● The action function takes as arguments a state i and a terminal a (or $ , the input
end marker). The value of ACTION[i, a] can have one of four forms:
1. Shift j, where j is a state.
2. Reduce A -> β.
3. Accept
4. Error
● We extend the GOTO function, defined on sets of items, to states: if GOTO[Ii , A] =
Ij then GOTO also maps a state i and a nonterminal A to state j.
Eg:
A -> aA | b
S -> AA
A -> aA | b
Action part of the table contains all the terminals of the grammar whereas the goto part
contains all the nonterminals. For every state of goto graph we write all the goto operations in
the table. If goto is applied to a terminal than it is written in the action part if goto is applied on
a nonterminal it is written in goto part. If on applying goto a production is reduced ( i.e if the
dot reaches at the end of production and no further closure can be applied) then it is denoted
If a production is reduced it is written under the terminals given by follow of the left side of the
production which is reduced for ex: in I5 S->AA is reduced so R1 is written under the terminals
If in a state the start symbol of grammar is reduced it is written under $ symbol as accepted.
NOTE: If in any state both reduced and shifted productions are present or two reduced
productions are present it is called a conflict situation and the grammar is not LR grammar.
NOTE:
1. Two reduced productions in one state – RR conflict.
If no SR or RR conflict present in the parsing table then the grammar is LR(0) grammar.
Action Go to
state + * a b $ E T F
0 S4 S5 1 2 3
1 S6 Accept
2 R2 S4 S5 R2 7
3 R4 S8 R4 R4 R4
4 R6 R6 R6 R6 R6
5 R6 R6 R6 R6 R6
6 S4 S5 9 3
7 R3 S8 R3 R3 R3
8 R5 R5 R5 R5 R5
9 R1 S4 S5 R1 7
Table 3.1.16. SLR Parsing table
Action Go to
state a d $ S C
0 S3 S4 1 2
1 Accept
2 S6 S7 5
3 S3 S4 8
4 R3 R3
5 R1
6 S6 S7 9
7 R3
8 R2 R2
9 R2
LALR Parsing.
● LALR is often used in practice because the tables obtained by it are considerably smaller
than canonical LR.
Example : S C C
C a C | d
Augmented grammar: S’ → .S, $
Closure(I)
I0 : S’→ .S, $ I1 : Go to(I0,S) I2 : Go to(I0,C)
S→ .CC, S’→S. , $ S→ C.C, $
$ C→ .a C ,
C→ .a C , a | d $ C→ .d, $
C→ .d , a | d
I3 : Go to ( I0,a ) I4 : Go to ( I0,d ) I5 : Go to ( I2,C )
C→ a.C, a | d C→ d. , a | d S→ C C., $
C→ .a C, a | d
C → .d , a | d
I6 : Go to ( I2,a
) I7 : Go to ( I2,d
) I8 : Go to ( I3,C )
C→ a.C , $ C→ d. ,$ C→ a C. , a | d
C→ .a C , $
C→ .d , $
I9 : Go to ( I6,C )
C→ a C. ,$
Table 3.1.19. Canonical LR(1) collection
Now we will merge state 3, 6 then 4, 7 and 8, 9.
I36 : C→ a.C , a | d | $
C→ .a C , a | d |
$ C→ .d , a | d |
$
I47 : C→ d. , a | d | $
I89: C→ aC. ,a | d | $
Parsing table:
Action Go to
State a d $ S C
0 S36 S47 1 2
1 Accept
2 S36 S47 5
36 S36 S47 89
47 R3 R3 R3
5 R1
89 R2 R2 R2
Table 3.1.20. LALR parsing table
a. The grammar that contains all the syntactic rules along with the semantic
rules having synthesized attribute only.
b. Such a grammar for converting infix operators to prefix is given by using the
‘val’ as S- attribute.
Production Semantic Rules
L E Print(E.val)
E E+T E.val=’+’ E.val T.val
E E-T E.val=’-‘ E.val T.val
E T E.val= T.val
T T*F T.val=’*’ T.val F.val
T T/F T.val=’/’ T.val F.val
T F T.val= F.val
F F^P F.val=’^’ F.val P.val
F P F.val= P.val
P (E) P.val= E.val
P digit P.val=digit.lexval
Table 3.2.4 Syntax directed definition for infix to prefix notation
In syntax directed translation, every non-terminal can get one or more than one attribute or
sometimes 0 attribute depending on the type of the attribute. The value of these attributes is
evaluated by the semantic rules associated with the production rule.
● In the semantic rule, attribute is VAL and an attribute may hold anything like a string,
a number, a memory location and a complex record
Example
● The syntax directed translation scheme is used to evaluate the order of semantic rules.
● In translation scheme, the semantic rules are embedded within the right side of the
productions.
Example
{ printE.VAL }
S→E$
Syntax directed translation is implemented by constructing a parse tree and performing the
actions in a left to right depth first order. SDT is implementing by parse the input and produce a
parse tree as a result.
Example
{ printE.VAL }
S→E$
b. For the rule X YZ the semantic action is given by X.x=f(Y.y, Z.z) then
synthesized
attribute X.x depends on attributes Y.y and Z.z.
Algorithm to construct Dependency graph
for each node n in the parse tree do
for each attribute a of the grammar symbol at n do
Construct a node in the dependency graph for a;
for each node n in the parse tree do
for each semantic rule b=f(c1,c2,…..,ck)
associated with the production used at n do
o
for i=1 to k d
construct an edge from the node for Ci to the node for b;
Example:
E E1+E2
E E1*E2
Production Semantic Rules
E E1+E2 E.val=E1.val+E2.val
E E1*E2 E.val=E1.val*E2.val
Table 3.2.5 semantic rules
This type of errors can be detected during lexical analysis phase. Typical lexical phase errors are,
1. Spelling errors. Hence get incorrect tokens.
2. Exceeding length of identifier or numeric constants.
3. Appearance of illegal
characters. Example:
fi ( )
{
}
● In above code 'fi' cannot be recognized as a misspelling of keyword if rather lexical
analyzer will understand that it is an identifier and will return it as valid identifier.
Thus misspelling causes errors in token formation.
Syntax error
These types of error appear during syntax analysis phase of compiler.
Typical errors are:
1. Errors in structure.
2. Missing operators.
3. Unbalanced parenthesis.
● The parser demands for tokens from lexical analyzer and if the tokens do not satisfy
the grammatical rules of programming language then the syntactical errors get
raised.
Semantic error
This type of error detected during semantic analysis phase.
Typical errors are:
1. Incompatible types of operands.
2. Undeclared variable.
3. Not matching of actual argument with formal argument.
● If there is less number of errors in the same statement then this strategy is best choice.
2. Phrase level recovery
● In this method, on discovering an error parser performs local correction on remaining
input.
● It can replace a prefix of remaining input by some string. This actually helps parser to
continue its job.
● The local correction can be replacing comma by semicolon, deletion of semicolons or
inserting missing semicolon. This type of local correction is decided by compiler
designer.
● While doing the replacement a care should be taken for not going in an infinite loop.
● This method is used in many error-repairing compilers.
3. Error production
● If we have good knowledge of common errors that might be encountered, then we
can augment the grammar for the corresponding language with error productions that
generate the erroneous constructs.
● If error production is used during parsing, we can generate appropriate error message
to indicate the erroneous construct that has been recognized in the input.
● This method is extremely difficult to maintain, because if we change grammar then it
becomes necessary to change the corresponding productions.
4. Global correction
● We often want such a compiler that makes very few changes in processing an
incorrect input string.
● Given an incorrect input string x and grammar G, the algorithm will find a parse tree
for a related string y, such that number of insertions, deletions and changes of token
require to transform x into y is as small as possible.
● Such methods increase time and space requirements at parsing time.
● Global production is thus simply a theoretical concept.
Intermediate code is used to translate the source code into the machine code.
Intermediate code lies between the high-level language and the machine language.
● If the compiler directly translates source code into the machine code without
generating intermediate code then a full native compiler is required for each new
machine.
● The intermediate code keeps the analysis portion same for all the compilers that's
why it doesn't need a full compiler for every unique machine.
● Intermediate code generator receives input from its predecessor phase and
semantic analyzer phase. It takes input in the form of an annotated syntax tree.
● Using the intermediate code, the second phase of the compiler synthesis phase is
changed according to the target machine.
Intermediate representation
Postfix Notation
● Postfix notation is the useful form of intermediate code if the given language is
expressions.
● The ordinary (infix) way of writing the sum of x and y is with operator in the
middle: x * y. But in the postfix notation, we place the operator at the right end as xy *.
Example
Production
1. E → E1 op E2
2. E → (E1)
3. E → id
Semantic Rule Program fragment
E.code = E1.code
E.code = id print id
When you create a parse tree then it contains more details than actually needed. So, it
is very difficult to compiler to parse the parse tree. Take the following parse tree as an
example:
● In the parse tree, most of the leaf nodes are single child to their parent nodes.
● Syntax tree is a variant of parse tree. In the syntax tree, interior nodes are operators
and leaves are operands.
Abstract syntax trees are more compact than a parse tree and can be easily used by a
compiler.
● In three-address code, the given expression is broken down into several separate
instructions. These instructions can easily translate into assembly language.
● Each Three address code instruction has at most three operands. It is a combination
of assignment and a binary operator.
Example
Given Expression:
t1 := -c t2 := b*t1 t3 := -c t4 := d * t3 t5 := t2 + t4 a := t5
The three address code can be represented in two forms: quadruples and triples.
Quadruples
The quadruples have four fields to implement the three address code. The field of
quadruples contains the name of the operator, the first source operand, the second
source operand and the result respectively.
Example
1. a := -b * c + d
uminus b - t1
(
0
)
(1) + c d t2
(3) := t3 - a
Triples
The triples have three fields to implement the three address code. The field of triples
contains the name of the operator, the first source operand and the second source
operand.
Example:
1. a := -b * c + d
uminus b -
(
0
)
(1) + c d
(2) * (0) (1)
(3) := (2) -
1. S →id := E
2. E → E1 + E2
3. E → E1 * E2
4. E → (E1)
5. E → id
Emit (p = E.place)
Else
Error;
E → E1 + E2 {E.place = newtemp();
E → E1 * E2 {E.place = newtemp();
E → id {p = look_up(id.name);
If p ≠ nil then
Emit (p = E.place)
Else
Error;
Procedures call
Calling sequence:
The translation for a call includes a sequence of actions taken on entry and exit from
each procedure. Following actions take place in a calling sequence:
● When a procedure call occurs then space is allocated for activation record.
● Establish the environment pointers to enable the called procedure to access data in
enclosing blocks.
● Save the state of the calling procedure so that it can resume execution after the call.
● Also save the return address. It is the address of the location to which the called
routine must transfer after it is finished.
● Finally generate a jump to the beginning of the code for the called procedure.
1. S → call id(Elist)
2. Elist → Elist, E
3. Elist → E
E.PLACE
Declarations
When we encounter declarations, we need to lay out storage for the declared variables.
For every local name in a procedure, we create a ST(Symbol Table) entry containing:
The production:
1. D → integer, id
2. D → real, id
3. D → D1, id
D.ATTR = real
D.ATTR = D1.ATTR
ENTER is used to make the entry into symbol table and ATTR is used to trace the data
type.
Storage Organization
● When the target program executes then it runs in its own logical address space in
which the value of each program has a location.
● The logical address space is shared among the compiler, operating system and
target machine for management and organization. The operating system is used to
map the logical address into physical address which is usually spread throughout the
memory.
● Runtime storage comes into blocks, where a byte is used to show the smallest unit
of addressable memory. Using the four bytes a machine word can form. Object of
multibyte is stored in consecutive bytes and gives the first byte address.
Activation Record
● Control stack is a run time stack which is used to keep track of the live procedure
activations i.e. it is used to find out the procedures whose execution have not been
completed.
● When it is called (activation begins) then the procedure name will push on to the
stack and when it returns (activation ends) then it will popped.
● Activation record is used to manage the information needed by a single execution
of a procedure.
● An activation record is pushed into the stack when a procedure is called and it is
popped when the control returns to the caller function.
Access Link: It is used to refer to non-local data held in other activation records.
Saved Machine Status: It holds the information about status of machine before the
procedure is called.
Local Data: It holds the data that is local to the execution of the procedure.
Lexical Error
During the lexical analysis phase this type of error can be detected.
Lexical error is a sequence of characters that does not match the pattern of any token. Lexical
phase error is found during the execution of the program.
● Spelling error.
● Exceeding length of identifier or numeric constants.
Example:
Void main()
char * a;
a= &x;
x= 1xab;
In this code, 1xab is neither a number nor an identifier. So this code will show the lexical
error.
Syntax Error
During the syntax analysis phase, this type of error appears. Syntax error is found during
the execution of the program.
● Error in structure
● Missing operators
● Unbalanced parenthesis
When an invalid calculation enters into a calculator then a syntax error can also occurs.
This can be caused by entering several decimal points in one number or by opening
brackets without closing them.
16 if (number=200)
18 else
In this code, if expression used the equal sign which is actually an assignment operator
not the relational operator which tests for equality.
Due to the assignment operator, number is set to 200 and the expression number=200
are always true because the expression's value is actually 200. For this example the
correct code would be:
16 if (number==200)
Compiler message:
int a = 5
Semantic Error
During the semantic analysis phase, this type of error appears. These types of error are
detected at compile time.
Most of the compile time errors are scope and declaration error. For example:
undeclared or multiple declared identifiers. Type mismatched is another compile
time error.
The semantic error can arises using the wrong variable or using wrong operator or doing
operation in wrong order.
● Undeclared variable
int i;
void f (int m)
m=t;
int a = "hello"; // the types String and int are not compatible
Example 3: Errors in expressions:
String s = "...";
Symbol table is used to store the information about the occurrence of various entities such as
objects, classes, variable name, interface, function name etc. it is used by both the analysis and
synthesis phases.
● It is used to store the name of all entities in a structured form at one place.
For example, suppose a variable store the information about the following variable declaration:
Implementation
The symbol table can be implemented in the unordered list if the compiler is used to handle the
small amount of data.
● Hash table
Operations
Insert ()
● Insert () operation is more frequently used in the analysis phase when the tokens are
identified and names are stored in the table.
● The insert() operation is used to insert the information in the symbol table like the unique
name occurring in the source code.
● In the source code, the attribute for a symbol is the information associated with that
symbol. The information contains the state, value, type and scope about the symbol.
● The insert () function takes the symbol and its value in the form of argument.
For example:
int x;
lookup()
In the symbol table, lookup() operation is used to search a name. It is used to determine:
lookup (symbol)
● A compiler contains two type of symbol table: global symbol table and scope symbol
table.
● Global symbol table can be accessed by all the procedures and scope symbol table.
The scope of a name and symbol table is arranged in the hierarchy structure as shown below:
int value=10;
void sum_num()
int num_1;
int num_2;
int num_3;
int num_4;
int num_5;
int_num 6;
int_num 7;
Void sum_id
{
int id_1;
int id_2;
int id_3;
int id_4;
int num_5;
The above grammar can be represented in a hierarchical data structure of symbol tables:
In the source program, every name possesses a region of validity, called the scope of that
name.
If B1 block is nested within B2 then the name that is valid for block B2 is also valid for B1 unless
the name's identifier is re-declared in B1.
● These scope rules need a more complicated organization of symbol table than a list of
associations between names and attributes.
● Tables are organized into stack and each table contains the list of names and their
associated attributes.
● Whenever a new block is entered then a new table is entered onto the stack. The new
table holds the name that is declared as local to this block.
● When the declaration is compiled then the table is searched for a name.
● If the name is not found in the table then the new name is inserted.
● When the name's reference is translated then each table is searched, starting from the
each table on the stack.
For example:
int x;
void f(int m) {
float x, y;
int i, j;
int u, v;
int g (int n)
bool t;
Fig: Symbol table organization that complies with static scope information rules
The global symbol table contains one global variable and two procedure names. The name
mentioned in the sum_num table is not available for sum_id and its child tables.
Data structure hierarchy of symbol table is stored in the semantic analyzer. If you want to
search the name in the symbol table then you can search it using the following algorithm:
● If memory is created at compile time then the memory will be created in static area
and only once.
● Static allocation supports the dynamic data structure that means memory is created
only at compile time and deallocated after program completion.
● The drawback with static storage allocation is that the size and position of data
objects should be known at compile time.
● An activation record is pushed into the stack when activation begins and it is
popped when the activation end.
● Activation record contains the locals so that they are bound to fresh storage in
each activation record. The value of locals is deleted when the activation ends.
● It works on the basis of last-in-first-out (LIFO) and this allocation supports the
recursion process.
● Allocation and deallocation of memory can be done at any time and at any place
depending upon the user's requirement.
● Heap allocation is used to allocate memory to the variables dynamically and when
the variables are no more used then claim it back.
Example:
fact (int n)
if (n<=1)
return 1;
else
return (n * fact(n-1));
fact (6)