0% found this document useful (0 votes)
23 views67 pages

Compiler Unit Ii

The document discusses syntax analysis in compiler design, focusing on the role of parsers and context-free grammars (CFG). It explains how parsers verify the syntactic structure of programming languages, report errors, and recover from them, along with detailing types of grammars and error handling strategies. Additionally, it covers the components of CFG and the process of derivation and parse trees in representing language syntax.

Uploaded by

aishudurai2005
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views67 pages

Compiler Unit Ii

The document discusses syntax analysis in compiler design, focusing on the role of parsers and context-free grammars (CFG). It explains how parsers verify the syntactic structure of programming languages, report errors, and recover from them, along with detailing types of grammars and error handling strategies. Additionally, it covers the components of CFG and the process of derivation and parse trees in representing language syntax.

Uploaded by

aishudurai2005
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 67

CS3501 Compiler Design R-21

UNIT II
SYNTAX ANALYSIS
Every programming language has rules that prescribe the syntactic structure of well formed
programs. The syntax of programming language constructs can be described by context free
grammars (CFG) or BNF-(Backus Naur Form).
Notation:
 A Grammar gives a precise yet easy to understand syntactic specification of a
programming language.
 From grammars we can automatically construct an efficient parser, that determines if a
source programs is syntactically well formed.
 A properly designed grammar, is useful for the translation of source programs into
correct object code and for the detection of errors.
 Languages evolve over a period of time acquiring new constructs and performing
additional tasks, these new constructs can be added to a language, more easily using
the grammatical description of the language.
2.1 ROLE OF PARSER
 The parser obtains a string of tokens from the lexical analyzer and verifies that the
string can be generated by the grammar, for the source language.
 The parser must also report any syntax errors.
 The parser should also recover from commonly occurring errors, so that it continue
processing the remainder of its input

Figure: Position of a parser in compiler model


Input to the parser:
Sequence of tokens from lexical analyzer

52
CS3501 Compiler Design R-21

Output:
Parse tree
There are number of tasks that might be conducted during parsing,such as collecting
information about various tokens into the symbol table,performing typr checking and other kind
of semantic analysis and generating intermediate code.
Functions:
(i)It verifies the structure generated by the tokens based on the grammar.
(ii)Parser construct parse tree
(iii)Parser report errors
(iv)It perform error recovery
Lexical Analysis vs Parsing
Lexical analysis Parsing
(i)A Scanner simply turns an input (i) A parser converts this list of tokens into a
String (say a file) into a list of tokens. Tree-like object to represent how the tokens fit
These tokens represent things like together to form a cohesive whole (sometimes
identifiers, parentheses, operators etc. referred to as a sentence).
(ii)The lexical analyzer (the "lexer") (ii) A parser does not give the nodes any
parses individual symbols from the meaning beyond structural cohesion.
source code file into tokens. From there,
the "parser" proper turns those whole
tokens into sentences of your grammar
2.2 GRAMMARS
All the production rules together can be called as a grammar language defined by the
grammar. A grammar derives string by beginning with the start symbol and repeatedly replacing
a non-terminal by the right side of production for the non-terminal. These token strings that can
be derived form the start symbol form the language difined by the grammar.
2.2.1 Types of Grammar
Type 0: Phrase Structured Grammar. They are grammar of the form shown below:
  ,   strings

53
CS3501 Compiler Design R-21

Type 1: Context Sensitve Grammar


      terminals are allowed in LHS
(Eg): S  aSBc
S  abc
bB  bb
bc  bc
Type 2: CFG (Context Free Grammar)
    

 must be non terminal : no restriction for 


Type 3: Regular Grammar
   

b  terminal followed by non-terminal


ab | a (eg) :S  as
S  ab
B  bc
ERROR HANDLING
Program can contain errors at different levels.
Lexical errors include misspellings of identifiers, keywords, or operators. E.g., the use of an
identifier elipseSize instead of ellipseSize and missing quotes around text intended as a string.
Syntactic errors include misplaced semicolons or extra or missing braces; that is, “{" or “}."
As another example, in C or Java, the appearance of a case statement without an enclosing
switch is a syntactic error (however, this situation is usually allowed by the parser and caught
later in the processing, as the compiler attempts to generate code).
Semantic errors include type mismatches between operators and operands, e.g., the return of
a value in a Java method with result type void.
Logical errors can be anything from incorrect reasoning on the part of the programmer to the
use in a C program of the assignment operator = instead of the comparison operator ==. The
program containing = may be well formed; however, it may not re ect the programmer's intent.

54
CS3501 Compiler Design R-21

The error handler in a parser has goals that are simple to state but challenging to realize:
 Report the presence of errors clearly and accurately.
 Recover from each error quickly enough to detect subsequent errors.
 Add minimal overhead to the processing of correct programs.
Error-Recovery Strategies
There are different error-recovery stragies are available.They are
Panic-Mode Recovery
With this model,the parser discards input symbols one at a time until one of the designated set
of synchronizing tokens found.The synchronizing tokens are usually delimiters,such as
semicolon or},whose role in the source program is clear and unambiguous.
Phrase-level Recovery
On discovering an error,a parser may perform local corrections on the remaining input;that
is,it may replace a prefix of the remaining input by some string that allows the parser to
continue.A typical local correction is to replace a comma by a semicolon,delete an extraneous
semicolon, or insert a missing semicolon.
Error Production
By anticipating common errors that might be encountered, we can augment the grammar for the
language at hand with productions that generate the erroneous constructs. A parser constructed
from a grammar augmented by these error productions detects the anticipated errors when an
error production is used during parsing. The parser can then generate appropriate error
diagnostics about the erroneous construct that has been recognized in the input.
Global Production
Ideally, we would like a compiler to make as few changes as possible in processing an incorrect
input string. There are algorithms for choosing a minimal sequence of changes to obtain a
globally least-cost correction. Given an incorrect input string x and grammar G, these algorithms
will find a parse tree for a related string y, such that the number of insertions, deletions, and
changes of tokens required to transform x into y is as small as possible. Unfortunately, these
methods are in general too costly to implement in terms of time and space, so these techniques
are currently only of theoretical interest.

55
CS3501 Compiler Design R-21

2.3 CONTEXT FREE GRAMMAR


Context Free Grammar specifies the syntax of a language. It is also called BNF (Backus Norm
form).A context free grammar has 4 components:
1. Set of tokens known as terminal symbols.
2. A set of non-terminals.
3. A set of productions where each production consists of a non-terminal on the left side
of the production, an arrow, and a sequence of tokens and / or non-terminals called the
right side of the productions.
4. One of the non-terminal is designated as the start symbol terminal.
eg: Productions.
List  List + Digit
List  List  Digit
List  Digit
Digit  0 | 1 | 2 | ... | 9
List, Digit are non-terminal. 0, 1, 2, ..., 9 are terminal.
The right sides of the three productions with non-terminal list on the left side can equivalently
be grouped.
List  List + Digit | List - Digit | Digit.
Note : A single digit by itself is a list. If we take any list and follow it by a „t‟ or minus sign
and then another digit we have a new a list.
(E.g.) 9 is a list
9-5 is a list
Since 9 is a list which might be a digit and 5 is a digit.
Parse Tree of 9-5:
 Each node in the tree is labeled be a grammar symbol.
 An interior node and its children correspond to a production.

56
CS3501 Compiler Design R-21

Parse Trees for the production A  XYZ:


A parse tree may have an interior node labeled „A‟ with 3 children labeled x,y and z.

Given a CFG, a parse tree is a tree with the following properties:


1. This root is labeled by start symbol.
2. Each leaf is labeled by a token or by t where t is an empty string.
3. Each interior node is labeled by a non-terminal.
4. If A is a non-terminal labeling some interior node and X1, X2 ....Xn are the lables of the
children of that node from left to right then AÕX1 X2 X3 ............... Xn is a productions.
Here X1, X2 ...... Xn stands for a symbol ie either a terminal or a non-terminal.
If A  t then a node labeled „a‟ may have a single child labeled t.
NOTE : Any tree imparts a left to right order to its leaves based on the idea that if „a‟ and
„b‟are children with the same parent and „a‟ is the left of „b‟ then all descends of „a‟ are to the
left of descendents of „b‟.
Many programming language constructs have an inherently recursive structure that can be
defined by context free grammar.
For Example: Statements such as:
“If E then S1 else S2 ”, cannot be specified using the notation of regular expressions.
We can readily express the statement using the grammar production.
stmt  if expr then stmt else stmt
A context free grammar consists of terminals, non-terminals, a, start symbol and productions.
(1) Terminals are the basic symbols from which strings are formed. The word token is a
synonym for “terminal” when we consider grammars for programming languages.
(2) Example: if, then, else.
(3) Non-terminals are syntactic variables that denote sets of strings. They define the set of
strings that can be generated by the grammar.
(4) Example: stmt, expr are non-terminals.
(5) In a grammar, one non-terminal is distinguished as the start symbol and the set of strings it

57
CS3501 Compiler Design R-21

denotes is the language defined by the grammar.


(6) The productions of a grammar specify the manner in which the terminals and Non-terminals
can be combined to form strings. Each production consists of a non-terminal followed by
an arrow followed by a string of non-terminals and terminals.
Grammar for Arithmetic Expression:
expr  expr op expr
expr  (expr)
expr  -expr
expr  id
op  +
op  
op  
op  /
op  
In the grammar the terminal symbols are id, +, , /, , ( )
Non-terminals are expr and op.
expr is Start symbol.
2.3.1 Notational Conventions
(1) Terminals
(i) Lower case letters  a, b ... .
(ii) Operators  + ,  etc.
(iii) Punctuation Symbols  ( , , etc.
(iv) The digits 0, 1, ..., 9.

(v) Bold face strings such as id or if.


(2) Non-Terminals
(i) Upper case letters A, B, C, ... .
(ii) Letter S, which is usually the start symbol.
(iii)Lower case italic names expr, stmt ... .
(3) Upper case letters later in the alphabet such s X, Y, Z are grammar symbols, that is either
Non-terminals or terminals.

58
CS3501 Compiler Design R-21

(4) Lower case letters later in the alphabet such as y, v ... r represents strings of Terminals.
(5) Lower case Greek letters    etc. represents strings of grammar symbols.
Thus a generic production could be written as A   indicating that there is a single non-
terminal A on the left of the arrow and a string of grammar symbols  to the right of the
arrow.
(6) If A  1, A  2 A  k are all productions called A productions.
We may also write A  1|2| ... |k, where we call 1, 2 k the alternatives for A.
(7) Unless otherwise stated the left side of the first production is the start symbol.
Using these short hand we could write the Grammar for expressions as
E  E AE |(E)|  E |id
A+||*|/|
2.3.2 Derivations
The central idea of derivation is that a production is treated as a rewriting rule in which the
non-terminal on the left is replaced by the strings on the right side of the production.
For Example: Consider the following grammar for arithmetic expression.
E  E * E |E + E| (E) |E| id.
The production E   E signifies that an expression preceded by a minus sign is also an
expression.
So we can replace E by  E. We can describe this action of writing E   E, which is read as
“E drives  E”.
We can take a single E and repeatedly apply productions in any order to obtain a sequence of
replacements.
Example: E  - E  - (E)  - (id).
We call such a sequence of replacements a derivation of (id) from E. This derivation
provides a proof that one particular instance of an expression is the string (id).
If 1  2 … n we say 1 derives n and, 1  2 is a production.
„  ‟ means derives in one step.
*
“*” „0‟ or more steps.

59
CS3501 Compiler Design R-21


 “+” „1‟ on one step.


Given a Grammar „G‟ with start symbol S we can use the  relation to define L(G), the
language generated by „G‟.

(RMD) Rightmost, derivations are sometimes called conical derivations.


*

Note: If S lm  then we say  is a left sentential form of the grammar G.

2.3.3 Parse Tree and Derivations


A parse tree may be viewed as a graphical representation for a derivation that filters out the
choice regarding replacements order. (i.e., leftmost or rightmost information).
 Interior node is labeled by some non-terminal A.
 Children are labeled from left to right by the symbols in the right side of the
production.
 The leaves of the parse tree are labeled by non-terminals or terminals and read from
left to right.
Example: parse tree for  (id + id) using LMD.

• E

( E )

E + E

ID ID

The sequence of parse trees constructed for the stated derivation is:

60
CS3501 Compiler Design R-21

E E E
 
• E • E

( E )

E E E E
  
• E • E E

( E ) ( E ) ( E )

E + E E + E E + E

id id id

Figure: Parse Tree


Example1: Produce the string (a , (a, a)) with the following grammar, using LMD and RMD
S  a |  | (T)
T  T, S | S.
The terminals in the preceding grammar
VT = {a, , (, ), }
The non-terminals are:
VN = {S, T}
Note: LMD  Left Most Derivation
RMD  Right Most Derivation.
RMD: LMD:
S  (T) S  (T)
 (T, S) S  (T, S)
 (T, (T)) S  (S, S)
 (T, (T, S)) S  (a, S)
 (T, (T, a))  (a, (T))
 (T, (S, a))  (a, (T, S))
 (T, (a, a))  (a, (S, S))
 (S, (a, a))  (a, (a, S))
61
CS3501 Compiler Design R-21

 (a, (a, a)).  (a, (a, a))

Example 2: Produce the string (((a, a), *, (a)), a) with the following grammar using LMD and
RMD.
S  a |*| (T)
T  T, S | S
LMD RM D
S  (T) S  (T)
 (T, S)  (T, S)
 (S, S)  (T, a)
 ((T), S)  ((T, S), a)
 ((T, S, S), S)  ((T, (T)), a)
 ((S, S, S), S)  ((T, (S)), a)
 (((T), S, S), S)  ((T, (a)), a)
 (((T, S), S, S), S)  ((T, S, (a)), a)
 (((S, S), S, S), S)  ((T, *, (a)), a)
 (((a, a), S, S), S)  ((S, *, (a)), a)
 (((a, a), *, (T)), S)  (((T), *, (a)), a)
 (((a, a, *, (S)), S)  (((T, S), *, (a)), a)
 (((a, a), *, (a)), S)  (((T, a), *, (a)), a)
 (((a, a), *, (a), S)  (((S, a), *, (a)), a)
 (((a, a), *, (a)), a)  (((a, a), *, (a)), a).

Example 3: The sentence id + id is id has the two distinct left most derivation‟s which is
shown below:
EE+E EE*E
 id + E E+E*E
 id + E * E  id + E * E

62
CS3501 Compiler Design R-21

 id + id * E  id + id * E
 id + id * id  id + id * id.

The corresponding parse trees are shown below:


E E

E + E E * E

id E * E E + E id

id id id id

2.3.4 Ambiguity

A grammar can have more than one parse tree generating a given string of tokens such a
grammar is said to be ambiguous. String with more than one meaning had more than oneparse
tree.

(Eg) : (9-5)+2 or 9-(5+2) has 2 parse trees as


Associativity of operators:
An operand with „t‟ sign on both sides of it is taken by the operator to its left.
9+5+2 is equivalent to (9+5)+2.
In most programming language the 4 arithmetic operator addition, subtraction, multiplication
and division are left associative.
E.g.: a=b=c Here = is right associative

9 – 5 – 2 parse tree grows down towards left  x 2

9 5

63
CS3501 Compiler Design R-21

a = b = c grows down towards right  • =

a b c

Note: left associative are +, -, *, /


Precedence Operators:
We say that * has higher precedence than „+‟ if „ * „ takes its operends before „t‟ does.
For the productions,
factor  digit | (expr)
term  | term * factor
| factor
| term | factor
expr  expr + term
| expr - term
| term.
The resulting grammar is:
expr  expr + term | expr - term |term
term  term * factor | factor |term | factor
factor  digit | (expr)
Solution for the Problem of Ambiguous Grammar:
We can use ambiguous grammar together with disambiguating rules that “throw away”
undesirable parse trees and leave only one tree for each sentence.
Regular Expression (vs) Context Free Grammar:
Every construct that can be described by a regular expression can also be described by a
grammar.
For example the grammar for the regular expression (a|b) * abb is:
A0  aA0 |bA0| aA1
A1 bA2
A2 bA3
A3 

64
CS3501 Compiler Design R-21

NFA of (a|b) * abb is
a

Start a b b
0 1 2 3

We can easily convert a NFA into a grammar that generates the same language as recognized
by the NFA.
The grammar above was constructed from the NFA using the following construction:
 For each state; of the NFA create a non-terminal symbol Ai.
 If state „i‟ has a transition to state j on symbol a, introduce the production Ai  a Aj.
 If „i‟ is an accepting state introduce
Ai  E
 If „i is the start state make Ai be the start symbol of the grammar.
Why do we use regular expression to define the lexical syntax of a language?
(1) The lexical rules of a language are frequently quite simple and to describe them we do
not need a notation as powerful as grammars.
(2) Regular expressions generally provide a more concise and easier to understand notation
for tokens in the grammars.
(3) More efficient lexical analyzers can be constructed automatically from regular
expressions than from arbitrary grammars.
(4) Separating the syntactic structure of a language into lexical and non-lexical part provide
a convenient way of modularizing the front end of a compiler into two manageable sized
components.
 Regular expressions are most useful for describing the structure of lexical constructs
such as identifiers, constants, etc.
 Grammars are most useful in describing nested structures such as balanced
parenthesis, and statements like “if then else”, etc.
2.3.5 Advantages of CFG
 Grammar gives exact and easily understandable structural specification of
language.

65
CS3501 Compiler Design R-21

 An efficient parser can be constructed from CFG


 Useful to detect errors
 Without more modification, new construct of language can easily added to the
grammar.
2.4 WRITING GRAMMAR
2.4.1 Lexical Versus Syntactic Analysis
The reason to use regular expressions to define the lexical syntax of a languages are,
1. Separating the syntactic structure of a language into lexical and non-lexical parts provides
a convenient way of modularizing the front end of a compiler into two manageable-sized
components.
2. The lexical rules of a language are frequently quite simple, and to describe them we do
not need a notation as powerful as grammars.
3. Regular expressions generally provide a more concise and easier-to-under-stand notation
for tokens than grammars.
4. More efficient lexical analyzers can be constructed automatically from regular
expressions than from arbitrary grammars.
2.4.2 Eliminating Ambiguity
Sometimes an ambiguous grammar can be rewritten to eliminate the ambiguity.
For example we shall eliminate the ambiguity, from the following dangling else grammar.
stmt  if expr then stmt
| if expr then stmt else stmt
| other
The grammar is ambiguous, since it has two parse trees as shown below.
stmt
stmt
if expr then
stmt
stmt else
if expr then
E1
S2
S1
E2

and

66
CS3501 Compiler Design R-21

stmt
stmt
if expr then
stmt
stmt else
if expr then
E1
S2
S1
E2

In all the programming languages with conditional statements of this form, the first parse tree
is preferred.
Disambiguating Rule:
“Match each else with the closest previous unmatched then”.
The idea is that a statement appearing between a then and an else must be matched.
A matched statement is either an if then else statement containing no unmatched statement ot
it is any other kind of unconditional statement.
stmt  matched.stmt
| unmatched.stmt
Matched-stmt  if expr then matched-start else
| other matched-stmt
Unmatched stmt  if expr then stmt
| if expr then matched-stmt else unmatched-stmt
2.4.3 Elimination of Left Recursion
A grammar is left recursive if it has a non-terminal A such that there is derivation A
*

 A for some string .


Top down passing methods cannot handle left recursive grammars so a transformation that
eliminates left recursion is needed.
The left recursive pair of productions A  A |  could be replaced by the non-left recursive
productions.
A  A
A   A|
without changing the set of strings derivable from A.

67
CS3501 Compiler Design R-21

Example: Consider the following grammar for arithmetic expressions:


E  E+T|T
T  T*F|F
F  (E)|id.
Eliminating the left recursion of E and T we obtain:

E  TE
E‟  + TE|
T  FT
T‟  * FT|
F  (E)|id.
No matter how many A productions there are we can eliminate immediate left recursion from
then by the following technique.
(1) Group the productions as:
A  A1 |A2 ... An| 1|2| ... n.
where no  begins with an A. Then we replace the A productions by
A  1 A |2A| ... nA
A   A AmA
The non-terminal A generates the same strings as before but is no longer left recursive.
This procedure eliminates all immediate left recursion but it does not eliminate left recursion
involving derivations of two or more steps.
Example: Consider the grammar,
S  Aa|b
A  Ac |Sd|  .
The non-terminal S is left recursive because:
S  Aa  Sda.


NOTE: A grammar is said to have no cycles if the derivations are of the form A  A.
Algorithm to eliminate left recursion (due to derivations):
Input: Grammar G with no cycles or  productions.

68
CS3501 Compiler Design R-21

Output: An equivalent Grammar with no left recursion.


Method: Apply the alg. to G. The resulting non-left recursive grammar may have 
productions.
1) Arrange the non-terminals in some order A1 , A2 , ..., An.
2) for i = 1 to n do begin
for j  1 to i  1 do begin
replace each production of the form AiAj 
by the productions Ai    | | ... k .
where Aj   |... k are all the current Aj productions;
end for
eliminate the immediate left recursion among the Ai productions.
end for
2.4.4 Left Factoring
Example: If we have the two productions,
stmt  if expr then stmt else stmt
| if expr then stmt
On seeing the input token if we cannot immediately tell which production to choose to expand
“stmt”.
In general if A  1|2 are two A productions and the input begins with a non-empty
string desired from , we do not know whether to expand A to 1 or to 2. However we may
defer the decision by expanding A to A‟. Then after seeing the input derived from  we expand
A‟ to 1 or to 2.
That is, left factored the original productions become:
A  A
A  1|2.

Algorithm: Left factoring a Grammar


Input: Grammar G
Output: An equivalent left factored grammar.

69
CS3501 Compiler Design R-21

Method:
 For each non-terminal A find the longest prefix  common to two or more of its
alternatives.
 If   replace all the A productions
A  1 |2| ... | where  represents all alternatives that do not begin with  by
A   A | 
A  1 | 2 ... n.
Here A is a new non-terminal.
 Repeatedly apply the transformation until no two alternatives for a non-terminal have
a common prefix.
The following grammar abstracts the dangling else problem.
S  iEtS | iEtSeS | a
E b
E and S are for expression and stmt i, t, e stands for if then and else.
Left factored grammar becomes
S  iEtSS | a
S  eS | 
E b
Thus we may expand S to iEtSS on input i and wait until iEtS has been seen to decide
whether to expand S to es or to  .

2.5 TOP DOWN PARSING


Definition of parsing
Parsing is the processing of analyzing a continuous stream of input in order to determine its
grammatical structure with respect to given formal grammar.
The methods commonly used in compilers are classified as:
 top down parsing
 bottom up parsing
Top Down Parsers: Build parse trees starting from the root and work up to the leaves.

70
CS3501 Compiler Design R-21

Bottom Up Parsers: Build the parse tree from the leaf work up to the root.
 In both the cases the input to the parser is scanned from left to right. One symbol at a
time.
 The output of the parser is some representation of the parse free for the stream of
tokens produced by the Lexical Analyzer.
The number of tasks that might be conducted during parsing are:
 Collecting information about various tokens into the symbol table.
 Performing type checking.
 Semantic analysis.
 Generating intermediate code.
But all these activities are dumped into the rest of front end box.
2.6 GENERAL STRATEGIES- RECURSIVE DESCENT PARSING
This parser involves backtracking, where it construct the parse tree for W with each
production form the right.If there is any failure then go back to the previous level and tried with
other productions till reaches the input string W.It may involve back tracking that is making
repeated scans of the input.It is an attempt to construct a parse tree from the root and creating
the nodes of the tree in pre-order.
Disadvantage of Backtracking:
 Not very efficient.
Note: We have to keep track of the input when backtracking takes place.
Example: Let the grammar be:
S  cAd
A  ab | a
To construct a parse tree for the string w  cad, we initially create a tree consisting of a single
node labelled S. We shall consider the first production and obtain the tree as follows,

Figure: Steps in top-down parser


The leftmost leaf labeled c matches the first symbol of , so we advance the input pointer to

71
CS3501 Compiler Design R-21

„a‟, (the second symbol of w), and consider the next leaf labeled A.
 We then expand A, using the first alternative for A to obtain the following tree

Figure: Steps in top down parser


 b does not match with the third input symbol „d‟ we see whether there is another
alternative for A.
 In going back to A we must reset the input pointer to position 2.
We now try the second alternative for A to obtain the following tree:
Since we have produced a parse tree for „‟ we halt and announce successful completion of
parsing. Recursive descent parser, cannot handle left recursive grammars. So, a transformation
that eliminates left recursion is needed.A left recursive grammar. can cause a recursivedescent
parser to go into an infinite loop.

Figure: Steps in top down parser


(since A  a production rule is applied now)
2.7 PREDICTIVE PARSERS
Recursive descent parser that needs no back-tracking is called a predictive parser. This can be
accomplished by carefully writing a grammar eliminating left recursion from it and left
factoring the resulting grammar to obtain a grammar that can be parsed by a predictive parser.
In the production A  1 |2 ... |n the proper alternative must be detectable by looking at
only the first symbol it derives.
For Example: If we have the productions.
statement  if expr then stmt else stmt

72
CS3501 Compiler Design R-21

| while expr do stmt


| begin stmt list end
Then the keyword if, while and begin tell us which alternative is the only one that would
possibly succeed if we‟re to find a statement.
2.7.1 Transition Diagrams for Predictive Parser
We can create a transition diagram as a plan for a predictive parser.
To construct the transition diagram of a predictive parser from a grammar, first eliminate left
recursion from the grammar then left factor the grammar. Then for each non-terminal A do
the following:
(1) Create an initial and final state.
(2) For each production A  X1X2 ... Xn.
Create a path from the initial to the final state with edges labeled X1X2…Xn.
Example: Consider the following grammar:
E  TE EE+T|T
E  + TE |  TT*F|F
T  FT F  id | (E)
T  * FT | 
F  (E) | id.
The collection of transition diagrams for the preceding grammar are:

On Simplification

73
CS3501 Compiler Design R-21

74
CS3501 Compiler Design R-21

We must eliminate non-determinism if we, want to build a predictive parser.


Non-determinism, means there is more than one transition from a state on the same input. If
ambiguity occurs we should resolve it.
If non-determinism cannot be eliminated we cannot build a predictive parser, but we could
build a recursive descent parser using backtracking to systematically try all possibilities.
The complete set of diagram after simplification are:

Figure: Transition diagram for non terminal T,T1,F


A „C‟ implementation of the predictive parser based on the preceding simplified transition
diagram runs, 20  25% faster than the „C‟ implementation of predictive parser based on the
transition.

75
CS3501 Compiler Design R-21

2.7.2 Non-Recursive Predictive Parsing


It is possible to build a non-recursive predictive parser by maintaining a stack. This parser
looks up the production to be applied in a parsing table.
A table driven predictive parser has an:
 input buffer
 a stack
 parsing table
 output stream.
 the input buffer contains the string to be passed followed by $ which indicate the end
of the input string.
 the stack contains a sequence of grammar symbols with $ on the bottom indicating the
bottom of the stack.
 Initially the stack-contains the start symbol of the grammar on top of $.
 The parsing table is a two dimensional array M[A, a], where „A‟ is a non-terminal and
„a‟ is a terminal or the Symbol $.
The parser is controlled by a program that behaves as follows:
The program compares „X‟, the symbol on top of the stack and „a‟ the current input symbol.
1) If X = a = $ the parser halts and announces successful completion of parsing.
2) If X = a  $ the parser pops X off the stack and advances the input pointer to the next
input symbol.
3) If X is a non-terminal the program consults entry M[X, a] of the parsing table M.
If for example, M[X, a]  {X  UVW} the parser replaces X on top of the stack by WVU.
As an output we should assume that the parser just prints the production used.
If M[X, a]  error the parser calls an error recovery routine.
Algorithm:
Input: A string w and a parsing table M for grammar G.
Output: If  is in L(G) a leftmost derivation of „‟ otherwise an error indication.
Method:
Initially the parser is in a configuration in which it has $S on the stack with „S‟ the start
symbol of G on top and $ in the input buffer.

76
CS3501 Compiler Design R-21

This program utilizes the predictive parsing table M to produce a parse for the input.
Set input to point to the first symbol of $

Figure : Model of non-recursive predictive parser


Repeat
Let X be the top stack symbol and a the symbol pointed to by input.
if X is a terminal or $ then
if X  a then
pop X from the stack and advance input
else error ( )
else
if M[X, a]  X  Y1 , Y2 ... YK then begin
pop X from the stack;
push yk, yk-1 … y1 onto the stack with y1 on top;
output the production X  y1, y2, ..., yk
end
else
error ()
until X  $

77
CS3501 Compiler Design R-21

Example: Consider the grammar:


E  TE
E  + TE | 
T  FT
T‟  *FT | 
F  (E)| id.
With input id + id * id the predictive parser makes the sequence of moves as shown below.
This parser traces out a leftmost derivation for the input, the output productions are those of a
leftmost derivation.
Step 1: FIRST and FOLLOW
The functions FIRST and FOLLOW allow us to fill in the entries of a predictive parsing table.
To compute FIRST (X) for all grammar symbols X, apply the following rules until no more
terminals or E can be added to any FIRST set.
1) If X is terminal then FIRST (X) is {X}.
(i.e) First (terminal)  terminal.
2) If X   is a production then add  to FIRST (X).
3) first (non-terminal)  first (terminal)
terminal= 1st terminal symbol produced by the series of non-terminal.
To Compute FOLLOW ($) for all non-terminals A, apply the following rules until nothing
can be added to any FOLLOW set. The NT to which Follow is computed should not be in the
both sides of the production.
1) If S is the starting symbol
FOLLOW (S)  {$}.
2) If there is a production A  B is a rule then
FOLLOW (B)  FIRST () except „‟.
3) If there is a production A  B
(or) If A  B;   
Follow (B)  FOLLOW (A).

Considering the following grammar let us compute the FIRST and FOLLOW

78
CS3501 Compiler Design R-21

E  TE ... (1)


E  + TE| ... (2)
T  FT ... (3)
T  *FT| ... (4)
F  (E)|id ... (5)

FIRST (E)  {(, id} [Explanation: First(E) = T by (1) which is a non-terminal


FIRST (E)  {+, } Hence First(E) = First(T)

FIRST (T)  {(, id} First(E) = F which is again a NT by (3)

FIRST (T)  {*, } Hence First(E) = First (F)

FIRST (F)  {(, id} = ( , id (by 5)]

NOTE:
 In the right hand side of the production fix the element for which the follow is to be
calculated as „B‟, the left side elements of B is „‟ and the elements in the right side of
„B‟ is „‟.
 Then try to apply all possible rules (1), (2) and (3) for the follow calculation.
1. FOLLOW (E)
Find the production in which E is in the right side and more over the left side should not
have the same NT.
Consider F  ( E )
  
A  B 
FOLLOW (E)  FIRST () by rule (2)
{)}
As „E‟ is the start symbol add $ to the follow calculation by rule (1).
NOTE: FOLLOW (E)  {), $}
Rule (3) is not applicable.
2. FOLLOW (E1)
Consider E  T E
     

79
CS3501 Compiler Design R-21

A  B 
FOLLOW (E)  FOLLOW (E) by rule (3).
Since  in , Follow (E) = Follow (E).
NOTE: Role (1) and (2) are not applicable
 FOLLOW (E1)  { ) , $ }
3. FOLLOW (T)
Consider E  + TE
FOLLOW (T)  FIRST (E) except  by rule (2)
 {+}.
Consider E  + T E where E  .
  
A  B 
FOLLOW (T)  FOLLOW (E) by rule (3)
 { ) , $}
 FOLLOW (T)  { + , ) , $}
Note: Rule (1) not applicable.
FOLLOW (T)  FOLLOW (E) by rule (3)
 { ) , $}
 FOLLOW (T) { + , ) , $}
NOTE: Rule (1) is not applicable as „T‟ is not the start symbol.
4. FOLLOW (T)
Consider T  F T 
  
A   B 
 FOLLOW (T)  FOLLOW (T) by rule (3) Rules (1) and (2) are not applicable
 FOLLOW (T)  { + , ) , $}
5. FOLLOW (F)
Consider T   F T
  

80
CS3501 Compiler Design R-21

A  B 
FOLLOW (F)  FIRST (T) except  by rule (2)
 {*}
Consider T  * FT‟ where T  
A   B 
FOLLOW (F)  FOLLOW (T) by rule (3)
 FOLLOW (F)  { * , + , ) , $}

The following algorithm can be used to construct predictive passing table.


Algorithm:
Grammar G
Output: Parsing table M
Note: Input:
E  TE
E‟  +TE|
T  FT
T‟  *FT| 
F  () | id

Method:
1) For each production A   of the grammar do step 2 and 3.
2) For each terminal a in First () add A   to M[A, a].
3) If  is in FIRST () add A   to M[A, b]
for each terminal b in FOLLOW (A)
If  is in FIRST () and $ is in FOLLOW (A)
add A   to M[A, $].
NOTE: In this case „‟ must be .
4) Make each undefined entry of M be error.

81
CS3501 Compiler Design R-21

NOTE: The parsing table produced by the preceding algorithm is shown as below.
Step 2: Parsing
Construct of Parsing Table.
E  TE
E  + TE | 
T  FT
T  *FT | 
F  (E) | id
Construct the parsing table.
Rows  Non-Terminals.
Columns  Terminals.

Non-Terminal Input Symbol

+ * ( ) id S

E TE‟ ETE‟

E’ E’+TE‟ E‟ E‟

T TFT‟ TFT‟

T’ T’ T‟ T‟

F F(E) Fid

Procedure to fill the Predictive Parse Table

Step 3: Do the parsing

82
CS3501 Compiler Design R-21

Stack Input Output

$E id + id *id$

$E’T id + id *id$ E  TE [POP E, push RHS

$E’T’F id + id *id$ T  FT of the production]

$E’T’ id + id *id$ F  id

$E’T’ + id *id$ pop (.id, id) and advance

$E’ + id *id$ T  E

$E’T+’ + id *id$ E  + TE

$E’T id *id$

$E’T’F id *id$ T  FT

$E’T’id id *id$ F  id

$E’T’ *id$

$E’T’F* *id$ T  * FT

$E’T’F id$

$E’T’id id$ E  id

$E’T’ $

$E’ $ T  

$ $ E  

Hence the string is accepted.


Example 2: Consider the following grammar:
S  iEtSS|a
S  eS|
Eb
First (S)  {i, a}

83
CS3501 Compiler Design R-21

First (S)  {e, }


First (E)  {b}
Follow (S)  First (S‟)
 {e, $}  as „S‟ is the start symbol
FOLLOW (S)  Follow (S)  {e, $}
Consider S  iEt SS  FOLLOW (E)  FIRST () ds   tSS FIRST ()  {t}.
 FOLLOW (E)  First (tSS)  {t} $.
The parsing table for this grammar is shown below:

Non-terminal NT Input symbol

a b e i t s

FIRST (S)  {i, a} S Sa S 


iEtSS

FOLLOW (S)  S S   S  

FOLLOW (S)  {e, $} S  eS

FIRST (E)  {b} E E 


b
The entry for M[S, e] contains both S  es and S  
Thus the grammar is ambiguous.
We can resolve the ambiguity if we choose S  eS among S   and S  eS. This choice
corresponds to associating else with the closest previous then
S  E is surely wrong as it will satisfies the production use of an unmatched statement.
(i.e) dr. S  iEtSS, if S  E the production best
S  iEtS which is III to:
unmatched statement  if expr then statement.
Disadvantages of Predictive Parsing:
1) It is difficult in writing a grammar for the source language such that a predictive passer
can be constructed from the grammar.
2) Left recursion elimination and left factoring make the resulting grammar hard to read

84
CS3501 Compiler Design R-21

and difficult to use for translation purposes.


Error Recovery in Predictive Parsing:
Two Possible Reasons for Error
1) An error is detected during predictive parsing when the terminal on top of the stack
does not match the next input symbol.
2) When non-terminal A is on top of the stack „a‟ is the next input symbol and the
parsing table entry M[A, a] is empty.
Panic mode error recovery is based on the idea of skipping symbols on the input until a token
in a selected set of synchronizing tokens (delimiters eg. „;” “end”) appears.
The parsing table seen earlier is modified by adding synchronizing in the following set of each
non-terminal.
Non-Terminal Input symbol
id + * ( ) S
follow (E)  {), $} ETE‟ ETE‟ Sync Sync
follow (E‟)  {), $} E’+TE‟ E‟ E‟
follow (T)  {+, ), $} TFT‟ Sync TFT‟ Sync Sync
follow (T)  {+, ), $} T’ T‟FT T‟ T‟
follow (F)  {+, *, ), $} Fid Sync Sync F‟(E) Sync Sync
NOTE: The same table discussed earlier is modified by adding sync in M(X, c) where c is a
terminal in Follow (X). If the respective cells are already filled, leave it undisturbed.
 If the parser looks up entry M[A, a] and finds that it is blank, then the input symbol „a‟
is skipped.
 If the entry is sync then the non-terminal on top of the stack is popped in an attempt to
resume parsing.
 If a token on top of the stack does not match the input symbol, then we POP the token
from the stack.
On the erroneous input + id * + id the parser and the error recovery mechanism behaves as
follows:

85
CS3501 Compiler Design R-21

Stack Input Remark Stack Input Remark


$E +id*+id$ error, skip “+” $ETF +id$ error M[F, +]

M[E,+]empty  synch F has

$E id*+id$ id is in First (E) $ET +id$


$ET id*+id$ $E +id$

$E  TE
$ETF id*+id$ $ET+

T  FT E+TE +id$

$ETid id*+id$ $ET +id$


$ET *+id$ $ETF id$

T  FT
$ETF* *+id$ $ETid id$

T  *FT F  id

$ET
$
$E T  E
$
$ E  E

The above discussion of panic mode recovery does not address the important issue of error
messages. In general informative error messages have to be supplied by the compiler designer.
Phrase Level Recovery:
It is implemented by filling in the blank entries in the predictive parsing table with pointers to
error routines. These routines may change insert or delete symbols on the input and issue
appropriate error messages.
2.8 LL(1) PARSER
A grammar whose parsing table has no multiply defined entries is said to be LL(1).
1st L  scanning the input from left to right
2nd L  for producing a left most derivation

86
CS3501 Compiler Design R-21

1  for using one input symbol of look ahead at each step to make parsing
action decisions.
No ambiguous or left recursive grammar can be LL(1).
It can also be shown that x grammar r is LL(1) if any only if whenever A  | are two
distinct productions of G, the following conditions holds.
1) For no terminal „a‟ do both  and  derive strings beginning with „a‟.
2) Atmost one of  and  can derive the empty string.
3) If  * then „‟ does not derive any string beginning with a terminal in FOLLOW
(A)
Example of LL(1) Grammar Example of a Grammar which is not LL(1)
E  TE
E  + TE| S  iEts|iEtSeS|a
T  FT Eb
T‟  *FT| (1) Rule is not satisfied.

F  (E)|id Consider S  iEts|iEtSeS.

By Rule (3), consider T  * FT/ Here both  and  are beginning with (.)

FOLLOW (T)  {+, ), $} Thus rule (1) is not satisfied.

 is * FT and  is .
„‟ does not starts with +, ), or $
Thus rule 3 is satisfied
By (2),  is  and „‟ is not 
By (1), „‟ string begins with „+‟ and
 does not derive strings beginning with
„+‟.
2.9 SHIFT REDUCE PARSER
Shift-reduce parsing is a type of bottom-up parsing that attempts to construct a parse
tree for an input string beginning at the leaves (the bottom) and working up towards the root
(the top)
Actions in shift reduce parser:

87
CS3501 Compiler Design R-21

(i) Shift: The next symbol is shifted on the top of the stack
(ii) Reduce: The parser replaces the handle within a stack with a nonterminal.
(iii)Accept: The parser announces the successful completion of parsing.
(iv)Error: The parser discovers that a syntax error has occurred and calls a error recovery
routine.
Consider the grammar:
S → aABe
A → Abc | b
B→d
The sentence to be recognized is abbcde.That is,
abbcde
aAbcde
aAde
aABe
S
Handle:
A handle of a right sentenial form  is a production A  B and a position of „‟ where the string
B may be found and replaced by A to produce the previous right sentenial form in a right most
derivation of .
*
If S  a A then A   is a handle of  and the rightmost derivation:
rm rm

EE  E E  id  id *id
rm 1 2 3

 E  E *E  E  id 2 *id3
rm

 E  E *id 3  E  E *id
rm

 E  id 2 *id 3  E  E*E
rm

 id  id *id E*E
rm 1 2 3

88
CS3501 Compiler Design R-21
E

Here, id1 is the handle of the right sentential forms id1 + id2 * id3 because id is the right side

89
CS3501 Compiler Design R-21

or production E  id and replacing id1 by E produces the previous right sentential form:
E  id *id 3
Since the grammar is ambiguous, there is another rightmost derivation of the same string.
E E*E
rm

 E * id
3
rm

 E  E *id 3
rm

 E  id 2 *id 3
rm

 id  id *id
rm 1 2 3

Handle Pruning:
The rightmost derivation in reverse can be obtained by „handle puring‟. The process is repeated
until the right sentential form consists of only the start symbol. The reverse of the sequence of
productions used in the reductions is a rightmost derivation of the input string.
Example: The sequence of steps of the reduction of the input string i + i * i is shown below
table.
NOTE: This is just the reverse of the sequence in the rightmost derivation sequence.
The grammar is: The right most derivation is:
EE+E EE+E
EE*E EE+E*E
E  (E) EE+E*i
Ei EE+i*i
E  i + 1 * i.

Problems that must be solved if we are to parse by handle pruning:


 Locate the substring to be reduced in a right sentenial form.
 Determine what production to choose in case there is more than one production with
that substring on the right hand side.

90
CS3501 Compiler Design R-21

Right Sentential Form Handle Reducing Production

i+i*i i E i

E+i*i i E i
E+E*i i E i
E+E*E E*E E E * E
E+E E+E
EE+E
E

2.9.1 Stack Implementation of Shift Reduce Parser


 A convenient way to implement a shift reduce parser is to use a stack to hold grammar
symbols and an input buffer to hold the string „‟ to be parsed.
 „$‟ may be used to mark the bottom of the stack and also the right end of the input.
Initially, Stack Input Buffer
$ abbcd$

 The parser shifts zero or more input symbols onto the stack until a handle  is on top
of the stack.

S  aAcBe
A  Ab|b
Bd

Input String:
  abbcde

91
CS3501 Compiler Design R-21

Stack Input Action

$a bbcde$ shift

$a bbcde$ shift

$ab bcde$ reduce A  b

$aA bcde$ shift

$aAb cde$ reduce A  Ab


$aA cde$ shift
$aAc de$ shift
$aAcd e$ reduce B  d
$aAcB e$ shift
$aAcBe $ reduce
$S $ accept

2.9.2 Conflicts during Shift Reduce Parsing


There are context free grammars for which shift reduce passing cannot be used. Such
grammars can reach a configuration in which the parser knowing the entire stack contents and
the next input symbol, cannot decide whether to shift or to reduce (shift-reduce conflict) or
cannot decide which of the several reductions to make (reduce-reduce conflict).
These grammars are referred to as non-LR grammars.
An ambiguous grammar can never be LR.
Example: Shift reduce conflicts
stmt  if expr then stmt
|if expr then stmt else stmt
|other
 STACK INPUT
if expr then stmt else $.
We cannot tell whether if expr then statement is the handle.

92
CS3501 Compiler Design R-21

Here, there is a shift-reduce conflict as we cannot determine, whether it is correct to reduce


“if expr then statement” to “statement” or it might be correct to shift „else‟ and then to look
for another statement to complete the alternate “if expr then statement else statement”.
Thus, we cannot tell whether to shift or reduce in this case, so the grammar is not LR(1).
L  left to right scanning of input
R  constructing a rightmost derivative in reverse
1  number of input symbols of lookahead that are used in making parsing
decisions.
Reduce-reduce conflict:All the stack contents and input symbol cannot decide which of
several reductions to made.
2.10 LR PARSERS
 The bottom up syntax analysis technique that can be used to parse a large class of
context free grammar is called LR (K) parsing.
„L‟ stands for left to right scanning of the input.
„R‟ for constructing a right most derivation in reverse.
„K‟ is the number of input symbols of look ahead.
2.10.1 Features of LR Parser
 It can be constructed to recognize all programming language constructs for which
CFG‟s can be written.
 It is the most general non-backtracking shift reduce parsing method known.
 Grammars that can be parsed using LR method can be parsed with predictive parses.
The schematic form of LR parser is shown in the Figure.

Figure 2.16

93
CS3501 Compiler Design R-21

The LR parser consists of:


1) Input buffer
2) Output
3) Stack
4) Driver program/parsing program
5) Parsing table that has two parts (action and goto).
There are three ways to construct the LR passing tables as listed below:
1) SLR (Simple LR) (LR(0))
2) CLR (Canonical LR) (LR(1)) and
3) LALR (Look ahead LR) (LR (1))
The parsing program is common for all the 3 cases, but the parsing lable changes from one
parser to another.
SLR Parser
In order to construct the SLR parsing table the following two components are needed.
1) Construction of sets of LR(0) items collection using, CLOSURE function and GOTO
function.
2) Construction of parsing table using LR(0) items construction of LR(0) items.
2.11 LR(0) ITEMS
LR(0) item of a grammar G is a production of G with a dot at some position on the right side of
the production.
Example: if E  X, is a production, the LR(0) items are E   X and E  X 
The collection of sets of LR(0) items must be constructed the in order to construct SLR parsing
table.
The collection of sets of LR(0) items can be constructed with the help of functions called closure
and goto. The collection of sets of LR(0) items is called canonical collection of LR(0) items.
The Closure Function:
Let us say „I‟ is a set of items for a grammar G, then closure of I (CLOSURE (I)) can be
computed by using the following steps.
1) Initially every item in I is added to closure (I).

94
CS3501 Compiler Design R-21

2) Consider the following:


A  X  BY an item in I
B  Z a production.
Then add the item B  Z also to the closure (I). If it is not already there, this rule has to be
applied till no new items can be added to closure (I). (That is if there is a „.‟ before a non-
terminal then include the production rules of that non-terminal also with a dot in the right side
of the production).
GOTO Function:
This is computed for a set R on a grammar symbol X.
If A   X BY is in I (item)
goto (I, X) will be  A  X  BY,
Even before the closure function is computed, the augmented grammar has to be found, as a
first step in the construction of the canonical collection of LR(0) items for the given grammar
G.
Augmented Grammar:
Consider the grammar G, in which „S‟ is the start symbol. The augmented grammar of G is G
which a new start symbol S and having a production rules of the given grammar G.
Example 1: Construct the canonical collection of LR(0) items for the given grammar G.
EE+T
ET
TT*F
TF
F  (E)
F  id
Augmented grammar „G”
E  E
EE+T
TT*F
TF

95
CS3501 Compiler Design R-21

F  (E)
F  id
The next step is to find closure (E  . E)
E   E
EE+T
ET
TT*F
TF
F   (E)
F   id

96
CS3501 Compiler Design R-21

2.12 CONSTRUCTION OF SLR PARSING TABLE


This is also a 2 Dimensional array in which the rows are states end columns are terminals and
non-terminals.
The label has 2 parts as:
1) Action entries (which consist of terminals).
2) Goto entries (which consist of non-terminals).
The action may be any one of the follow:
(1) shift, (2) reduce, (3) accept, (4) error.
The goto entries will have state numbers like 1, 2, 3.
State: Consider a set I in the
j
collection of LR(0) items. Here „j‟ is the state which is used in
parsing the input string. Example: If I is ian item „i‟ is the state.
Goto Entries: The goto entries indicate the transition of a state „i‟ to „j‟ on a particular non-
terminal.
Steps for the construction of the parsing table:
1) Fill up the „goto‟ part which the next state for each non-terminal.
Example: I0 on „E‟ transition is I1 so enter goto (0, E)=1.
 Repeat for all states.
2) M[1, $]  accept (by default).
3) To make shift entries fill up the action part with Si where „I‟ is the next state for each
terminal.
Example: Ii on „+‟ transition I6 so, enter action (1+) = S6.
 Repeat for all states.
4) States with (.) as the right most symbol in the corresponding item will have reduce
entries. If A  XBY. is present in the item, and „Y‟ is a terminal take follow (A), on
the other hand if „y‟ is a non-terminal take follow (Y) fill in the cell corresponding to
the respective state and elements in follow (A) or follow (Y) with rj where „r‟refers to
reduce and „j‟ refers to the rule number.
Example: I2 has ET as ET‟ is the 2nd production rule in the grammar which follows:
(1) E  E + T (2) E  T (3) T  T  F

97
CS3501 Compiler Design R-21

(4) T  F (6) F (5) F  (E)


 id
Note: FOLLOW(E) ={+,$.)}
Enter r2 under the elements of follow(E), where  is reduction and 2 is the 2nd production in
the unaugmented list.
All the undefined entries are errors.
Parsing Table
Table

State Action Goto

Id + * ( ) $ E T E

0 S5 S5 1 2 3

1 S5 Accept

2 r2 S5 r2 r2

3 r2 r2 r2 r2

4 S5 S5 8 2 3

5 r2 r2 r2 r2

6 S5 S5 9 3

7 S5 S5 10

8 S5

9 r2 S5 r2 r2

10 r2 r2 r2 r2

11 r2 r2 r2 r2

EE+T
ET
TT*F
TF

98
CS3501 Compiler Design R-21

F  (E)
F  id
After eliminating left recursion.
E  TE‟‟ FIRST (E)  {(, id}
E‟  + TE| FIRST (T)  {(, id}
T  FT FIRST (F)  {(, id}
T‟  *FT| FIRST (E)  {+, $}
F  (E)/id FIRST (T)  {*, $}
FOLLOW (E)
Consider F  (E)
FOLLOW (E)  FIRST ( ) )  {)}

E is the start symbol


FOLLOW (E)  {2, 3}

Consider E TE‟

FOLLOW (T)  FIRST (E) + FOLLOW (E)


 {+,$, )}

T FT‟

FOLLOW (F)  FIRST (T) + FOLLOW (T)


 {*, $, +, )}.
2.12.1 SLR Parsing Algorithm
For parsing an input string. The possible parsing actions are as follows:
1. Shift
2. Reduce by a production
3. Accept and halt
4. Error
The input string is in input buffer followed by the right end marker $. The stack keeps the
states of the Parsing Table.

99
CS3501 Compiler Design R-21

Steps involved in SLR Parsing:


Initially action [stack top symbol, input buffer 1st symbol] is referred in the parsing table.
1) If action [Sx , ay ]  Sj , the parser has to make a shift of the current input symbol from
the buffer to the stack, and then push „j‟ also onto stack.
2) If action [Sx , ay ]  rj ; the parser has to make a reduction by the rule number j.
If the reduction rule is of the form   , in this case, the top ... || * 2 elements are
popped from the stack the reduction is applied for the popped elements and theresulting,
element is pushed onto the stack, then, the table is referred for goto (stack [top-1], stack
[top]). The referred symbol there after pushed onto the stack.
If Action [Sx, ay ]  accept, then announce that the parsing is completed successfully
and then halt.
3) If Action [Sx , ay ]  error, then the parser encounters error and calls error recovery
routine or generates error message.

STACK INPUT ACTION

0 id + id$ action (0, id)  S5 [Note S denotes Shift Operation

0id5 + id$ action (5, +)  r6 Hence Shift id 5]. [Fid ]

0F3 [Refer goto(0,T) + id$ action (3, +)  r4 |id|  1 pop 2 elements


in the parsing table]

0T2 + id$
action (2, +)  r4 push F onto the stack.
0E1 + id$
action (1, +)  S6 Then Refer 0, F on to the
0E1 + 6 id$
action (b, d)  S5 Parsing Table which is 3
0E + 6 id5 $
action (5, $)  r6 Push 3]
0E1 + 6F3[Refer $
action (3, $)  r4
goto(6,F)]

0E1 + 6T9
$
0E1 action (9, $)  r1 E  E +T
$
action (1, $)  accept

100
CS3501 Compiler Design R-21

Disadvantages of SLR Parser:


 Too much work to construct the parser.
 Some time there may be both a shift and reduce entry in action [X, Y]. This conflict
arises from the fact that the SLR parser construction method is not powerful enough to
remember.
CLR PARSER
It is also an LR parser, CLR is canonical LR parser. Many of the concepts in CLR parser are
similar to the SLR Parser but there are some differences in the construction of parsing table.
 The construction of parsing table is done with LR(1) items.
 The steps involved for construction of parsing table (T) from the LR(1) items are also
different from the SLR table construction.
LR(1) Items: where 1 is the item of the second component.
The general term of LR(1) item is A  X . Y, a; where „a‟ is called look ahead. It may be a
terminal or the right end marker $.
Example: S  S  $, where $ is the look ahead.
The collection of LR(1) items will lead to the cosntruction of LR(1) items will lead to the
construction of CLR parsing table.
The Closure Function:
Let us say „L‟ is a set of LR(1) items for the given grammar, then closure of I represented as
(Closure (I)) can be computed using the following steps:
1) Initially every item in R is added to closure (R).
2) Consider, A  X.BY, a
BZ
are the two production and X, Y, Z are grammar symbols. Then add B  . Z, First (Y) as the
new LR(1) if its not already there.
This rule has to be applied till no new items can be added to the closure (R). Thus the closure
(A  X . BY, a) will have,
A  X . BY, a
B  . Z, FIRST (Ya).

101
CS3501 Compiler Design R-21

Note:
1) If Y is a terminal or a non-terminal the look ahead will be, FIRST (Y).
2) If Y is not available then the look ahead will be FIRST (a).
GOTO FUNCTION:
This is computed for a set „I‟ on a grammar, what is the set „I‟ reaches on X can be computed
by the function:
goto (I, X)
Consider an LR(1) item is „I‟ which follows:
A   XBY, a
goto (I, x) will be,
A  X  BY, a

including the closure of B and its look ahead where X, Y are grammar symbols.
Example: Construct the LR(1) items for the following grammar.
S  CC
C  cC
Cd
Solution:
Step 1
The augmented grammar G is:
S   S
S   CC
C   cC
Cd
Step 2
Add the second component.
The second component is added to avoid S/R conflict.

102
CS3501 Compiler Design R-21

S/R Conflict Explanation

I. S   S, $ [Add $ as second component to the I production in


G]

[To find the second component make use of the I


II. S  CC
production for comparison with the standard
S  CC , $
format A    B, a
S  S, $
 is  , a is $.
Second component b is
First ( $)
First ($) = $
III. Consider the production A    B, a Second component is
C  .cC S  CC, $ First (C) = c|d.
C  .cC , c | d

IV. C .d Second Component is


since LHR is C. First (, a)
Copy c | d to the second First (c) = c | d
component $ is not included because First (c) has terminals
C  .d, c | d
c, d since c does not derive empty string.
First ( ( , $) = First (c)

Note: Consider the state I , wi2th the second component C   cC, $ .


In this original production the II component is c | d.
The I Rule‟s second component is to be taken for the remaining production in the
state.
CLR Parsing Table Construction:
This is also a two dimensional array, in which the rows are states and columns are terminals
and non-terminals. The table has two entries namely.
1) Action Entries.

103
CS3501 Compiler Design R-21

2) Goto Entries.

Steps for the Construction of CLR Parsing Table:


1) Let C  {I0 , I1 , I2 , ..., In} be a collection of sets or LR(1) items
2) Consider Ij as a set in C.
a. If goto (Ij , a)  Jk then set action [j, a] to Sk and here „a‟ is always a terminal.
b. If A  X . , a (X is a grammar symbol) is in Ij, then set action [j, a] to reduce by
A  X here „a‟ is look ahead and „A‟ should not be S.
c. Set action [1, $ as accept]

[ S  S . is in I1].
3) If goto [Ij , A]  Ik then set goto [j, A] = k.

104
CS3501 Compiler Design R-21

CLR Parsing Table

States Action Goto

c d $ s e
0 S3 S4 $ 1 2
1 accept
2 S6 S7 5
3 S3 S3 8
4 r3 r3
5 r1
6 S6 S6 9
7 r3

8 r2 r2

9 r2

Stack I/O Action

0 cdd S3
$

0C3 cdd S4
$

0C3d4(4) d$ r3 || reduce by production rule 3 Cd

0C3 d$ r2 c cC [pop 4 elements push C, Goto (0, E) = 2; push 2]

0C2 d$ S7 refer goto (3, C) from the table

0C2d7 $ r3 [c  d pop out 2, elements push C, Goto (2, C) = 5]

0C2C5 $ r1

0S1 $ accept

105
CS3501 Compiler Design R-21

4) All the undefined entries are errors. In the case of CLR parsing technique the reduce
entries are made for look ahead terminals.
S  CC
C  cC
Cd
2.13 INTRODUCTION TO LALR PARSER
The look ahead LR parser is another parser in the LR parser category. This parser also
constructs the parsing table from LR(1) items. There is a slight modification in the construction
of the parsing table, for LALR parsers, and the parsing algorithm is very much same as that of
the other LR parsers.
LALR are same as CLR parser. In CLR parser, if two states differ only in lookahead, then we
combine those two states in LALR parser. After minimization, if the parsing table has no
conflict then that grammar is LALR.
Steps for constructing LALR Parsing Table:
1) C = {I0 , I1 , ..., In} be the collection of LR(1) items.
2) Find the sets having core elements present in collection of LR(1) items and replace them
by their unions. (i.e) If Ii and Ij have the same core items. They can be united asIij .
3) All the remaining steps are similar to the construction of the CLR parsing table.
4) Lets consider the same problem, discussed in CLR parser

I36 : C C.C, cld|$ Since I3 and I6 have the same core elements they
are united as I36 (Table of CLR in previous page).
C  .cC, cld|$

C  .d, cld|$

I47 : C  d., c|d|$ Since I4 and have seen core elements they are
united as I47.

I89 : cC., cd, $ Since Is and Io have the same care elements thus
are united as I89.

Table 2.2: LALR Tables

106
CS3501 Compiler Design R-21

States Action Goto

c d $ s e

0 S36 S47 1 2

1 accept

2 S36 S47 5

36 S36 S47 89

47 r3 r3 r3

5 X r1

89 S6 r2 r2

Grammar is S  CC
C  cC
C  d.

Stack I/O Action

0 cdd $ S36

0C36 dd $ S47

0C36d47 d$ r3 || reduce by production rule 3 “Cd”

0C36 C89 d$ r2 || reduce by production rule 2 “CcC”

0C2 d$ S47

0C2d47 $ r3 || reduce by production rule 3 “Cd”

0C2Cs $ r1 || reduce by production rule 1 “SCC”

0S1 $ accept

NOTE:
 Wherever S3 or Si occured in Parsing Table of CLR action entry „S36‟ will now appear.

107
CS3501 Compiler Design R-21

 Similarly wherever S4 or S7 appeared in CLR action entry „S47‟ will now take over.
 Reduce entries of state „8‟ and „9‟ are merged together.
 Reduce entries of state „4‟ and „7‟ are merged together in one row.
2.14 ERROR HANDLING AND RECOVERY IN SYNTAX ANALYZER
2.14.1 Syntax Error Handling
Good compiler should assist the programmer in identifying and locating errors. Programs can
contain errors at many different levels.
Example: Errors can be:
 Lexical, such as misspelling an identifier, keyword or operator.
 Syntactic such as an arithmetic expressions with unbalanced parenthesis.
 Semantic such as an operator applied to an incompatible operand.
 Logical such as an infinitely recursive call.
Much of the error detection and recovery in a compiler is centered around the syntax analysis
phase. Accurately detecting the presence of semantic and logical errors at compile time is a
much more difficult task.
The error handler in a parser has simple to state goals which are as follows:
 It should report the presence of errors clearly and accurately.
 It should recover from each error quickly enough to be able to detect each subsequent
errors.
 It should not significantly slow down the processing of correct programs.
 In difficult cases the error handler may have to guess what the programmer had in
mind when the program was written.
 Errors may be detected when the error detector see a prefix of the input that is not a
prefix of any string in the language.
Many of the errors could be classified simply
 60% were punctuation errors
 20% operator and operand errors
 15% keyword errors and remaining 5% other kinds.
Punctuation Errors
 Incorrect use of semicolons

108
CS3501 Compiler Design R-21

 To use a comma in place of the semicolon etc.


Operator Error
 to leave out the colon form „:‟.
Misspellings of keywords
 leaving out “ln” from write ln.
Example: Error that is much more difficult to repair correctly is
 Missing of “begin” or “end”.
How should an error handler report the presence of an error
 It should report the place in the source program where an error is detected commonly.
 It is done by printing the offended line with a pointer to the position at which an error
is detected.
 Sometimes informative, understandable diagnostic message is also included with the
error message.
Example: Semicolon missing at the position.
Once an error is detected how should the passer recover?
It is not good for a parser to quit after detecting the first error, because subsequent processing
of the input may reveal additional errors.
So, there should be some form of error recovery by which the parser attempts to restore itself
to a state where processing can continue.
 An inadequate job and recovery may introduce spurious errors.
Example: Syntactic error recovery may introduce spurious semantic errors that will later be
detected by the semantic analysis or code generation phases.
Example: The variable „i‟ may be undefined, later when a statement with “i” is encountered,
there may not be any syntactic mistakes in the statement but since there is no symbol table entry
for „i‟, “i undefined” error is generated.
An error recovery strategy has to be carefully designed to take into account, the kinds of
errors that are likely to occur and reasonable to process.
 Some compilers attempt error repair process by which the compiler attempts to guess
what the programmer intended to write.
Example: PL|C compiler

109
CS3501 Compiler Design R-21

2.14.2 Error Recovery Strategies


There are many different general strategies that a parser can employ to recover from a
syntactic error, they are:
 Panic Mode Error Recovery
 Phrase Level Error Recovery
 Error Productions
 Global Correction.
Panic Mode Recovery
 It is the simplest method to implement and can be used by most parsing methods.
 On discovering an error the parser discards input symbols one at a time until one of a
designated set of synchronizing tokens is found. The synthesizing tokens are usually
delimiters such as semicolon or end.
 Panic mode correction often skips a considerable amount of input without checking it
for additional errors.
Advantage:
 Simplicity
 Guaranteed not to go into an infinite loop.
Very useful when multiple errors in the same statement are rare.
Phrase Level Recovery
 On discovering an error a parser may perform local correction on the remaining input
that is it may replace a prefix of the remaining input by some strings that allows the
parser to continue.
Example: By replacing „,‟ by
o Inserting a missing semicolon,
o Deleting an extraneous „;‟.
This type of replacement has been used in several error repairing compilers.
Disadvantage:
Difficulty it has in coping with situations in which the actual error has occurred before the
point of detection.
Error Productions Recovery

110
CS3501 Compiler Design R-21

If we have good idea of the common errors that might be encountered we can augment the
grammar for the language at hand with productions that generate the erroneous constructs. We
then use the grammar with the error productions to construct a parser.
 If an error production is used by the parser, we can generate the appropriate error
diagnosis and recovery mechanisms.
Global Correction Recovery
There are algorithms for choosing a minimal sequences of changes to obtain a globally least
cost correction.
Given a incorrect input string x and grammar G, these algorithms will find a parse tree for a
related string „y‟ such that the number of insertions, deletions and changes of tokens required
to transform „x‟ into „y‟ is as small as possible.
Disadvantage:
 Too costly.
Note: The closest correct program may not be what the programmer had in mind, after these
error recovery strategies are applied.
2.15 YACC-DESIGN OF A SYNTAX ANALYZER FOR A SAMPLE LANGUAGE
A translator can be constructed using Yacc in the manner illustrated in the below figure.
First, a file, say translate.y, containing a Yacc specification of the translator is prepared. The
UNIX system command
yacc translate.y
transforms the le translate.y into a C program called y.tab.c using the LALR. The program
y.tab.c is a representation of an LALR parser written in C, along with other C routines that the
user may have prepared. By compiling y.tab.c along with the ly library that contains the LR
parsing program using the command
cc y.tab.c -ly
we obtain the desired object program a.out that performs the translation specified by the original
Yacc program. If other procedures are needed, they can be compiled or loaded with y.tab.c, just
as with any C program.

A YACC program consist of 3 parts

111
CS3501 Compiler Design R-21

declarations
%%
translation rules
%%
supporting C routines

Yacc
Yacc
speci cation y.tab.c
compiler
translate.y

C
y.tab.c a.out
compiler

input a.out output

Figure: Creating an input/output translator with Yacc

(i) The Declarations Part


There are two sections in the declarations part of a Yacc program; both are optional. In thefirst
section, we put ordinary C declarations, delimited by %{ and %}. Here we place declarations
of any temporaries used by the translation rules or procedures of the second and third sections.
In the below example , this section contains only the include-statement
#include <ctype.h>
that causes the C preprocessor to include the standard header le <ctype.h> that contains the
predicate isdigit.Also in the declarations part are declarations of grammar token. . In the
following example , the statement
%token DIGIT
declares DIGIT to be a token. Tokens declared in this section can then be used in the second
and third parts of the Yacc speci cation. If Lex is used to create the lexical analyzer that passes
token to the Yacc parser, then these token declarations are also made available to the analyzer
generated by Lex.

112
CS3501 Compiler Design R-21

(ii) The Translation Rules Part


In the part of the Yacc specication after the %% pair, we put the translation rules. Each rule
consists of a grammar production and the associated semantic action. A set of productions that
we have been writing:
<head><BODY>1|<BODY>2| ------- <BODY>N
It would be written in Yacc as:
<head>:<body>1 {<semantic action>1}
| <body>2 {semantic action>2}
…………..
|<body>n {semantic action>n}
(iii) The Supporting C-Routines Part
The third part of a Yacc speci cation consists of supporting C-routines. A lexical analyzer by
the name yylex() must be provided. Using Lex to produce yylex() is a common choice; The
lexical analyzer yylex() produces tokens consisting of a token name and its associated attribute
value. If a token name such as DIGIT is returned, the token name must be declared inthe rst
section of the Yacc speci cation. The attribute value associated with a token iscommunicated to
the parser through a Yacc-de noted variable yylval.
The resulting Yacc specification is shown in below.
%{
#include <ctype.h>
#include <stdio.h>
#define YYSTYPE double /* double type for Yacc stack */ %}
%token NUMBER
%left '+' '-'
%left '*' '/'
%right UMINUS
%%
lines : lines expr '\n' { printf("%g\n", $2); }
| lines '\n'
| /* empty */

113
CS3501 Compiler Design R-21

;
expr : expr '+' expr { $$ = $1 + $3; }
| expr '-' expr { $$ = $1 - $3; }
| expr '*' expr { $$ = $1 * $3; }
| expr '/' expr { $$ = $1 / $3; }
| '(' expr ')' { $$ = $2; }

| '-' expr %prec UMINUS { $$ = - $2; }

| NUMBER
;
%%
yylex() {
int c;
while ( ( c = getchar() ) == ' ' ); if ( (c == '.') || (isdigit(c)) ) {
ungetc(c, stdin);
scanf("%lf", &yylval);
return NUMBER;
}
return c;
}
PART-A (Two marks with answers)
1. Define parser
Hierarchical analysis is one in which the tokens are grouped hierarchically into nested
collections with collective meaning.Also termed as Parsing.
2.Mention the basic issues in parsing
There are two important issues in parsing.
· Specification of syntax
· Representation of input after parsing.
3.Why lexical and syntax analyzers are separated out?

114
CS3501 Compiler Design R-21

Reasons for separating the analysis phase into lexical and syntax analyzers:
 Simpler design.
 Compiler efficiency is improved.
 Compiler portability is enhanced
4.Define a context free grammar
A context free grammar G is a collection of the following
 V is a set of non terminals
 T is a set of terminals
 S is a start symbol
 P is a set of production rules
G can be represented as G = (V,T,S,P)
Production rules are given in the following form
Non terminal → (V U T)*
5.Briefly explain the concept of derivation
Derivation from S means generation of string w from S. For constructing derivation two
things are important.
i) Choice of non terminal from several others.
ii) Choice of rule from production rules for corresponding non terminal.
Instead of choosing the arbitrary non terminal one can choose
i) either leftmost derivation – leftmost non terminal in a sentinel form
ii) or rightmost derivation – rightmost non terminal in a sentinel form
6.Define ambiguous grammar
A grammar G is said to be ambiguous if it generates more than one parse tree for some
sentence of language L(G).
i.e. both leftmost and rightmost derivations are same for the given sentence.
7.What is a operator precedence parser?
A grammar is said to be operator precedence if it possess the following properties:
1. No production on the right side is ε.
2. There should not be any production rule possessing two adjacent non terminals at the right
hand side.

115
CS3501 Compiler Design R-21

8.List the properties of LR parser


1. LR parsers can be constructed to recognize most of the programming languages for which
the context free grammar can be written.
2. The class of grammar that can be parsed by LR parser is a superset of class of grammars
that can be parsed using predictive parsers.
3. LR parsers work using non backtracking shift reduce technique yet it is efficient one.
9. Mention the types of LR parser
 SLR parser- simple LR parser
 LALR parser- lookahead LR parser
 Canonical LR parser
10. What are the problems with top down parsing?
The following are the problems associated with top down parsing:
 Backtracking
 Left recursion
 Left factoring
 Ambiguity
11. Write the algorithm for FIRST
1. If X is terminal, then FIRST(X) IS {X}.
2. If X → ε is a production, then add ε to FIRST(X).
3. If X is non terminal and X → Y1,Y2..Yk is a production, then place a in FIRST(X) if for
some i , a is in FIRST(Yi) , and ε is in all of FIRST(Y1),…FIRST(Yi-1);
12. Write short notes on YACC
YACC is an automatic tool for generating the parser program. YACC stands for Yet Another
Compiler Compiler which is basically the utility available from UNIX.Basically YACC is
LALR parser generator. It can report conflict or ambiguities in the form of error messages.
13. What is meant by handle pruning?
A rightmost derivation in reverse can be obtained by handle pruning.
If w is a sentence of the grammar at hand, then w = γn, where γn is the nth right-sentential
form of some as yet unknown rightmost derivation
S = γ0 => γ1…=> γn-1 => γn = w

116
CS3501 Compiler Design R-21

14. Define LR(0) items


An LR(0) item of a grammar G is a production of G with a dot at some position of the right
side. Thus, production A → XYZ yields the four items
A→.XYZ
A→X.YZ
A→XY.Z
A→XYZ
15. What is phrase level error recovery?
Phrase level error recovery is implemented by filling in the blank entries in the predictive
parsing table with pointers to error routines. These routines may change, insert, or delete
symbols on the input and issue appropriate error messages. They may also pop from the stack.
16.What are the Error-recovery actions in a lexical analyzer?
1. Deleting an extraneous character
2. Inserting a missing character
3. Replacing an incorrect character by a correct character
4. Transposing two adjacent characters
17.Construct Regular expression for the language
L= {w ε{a,b}/w ends in abb}
Ans: {a/b}*abb.
PART-B(Possible Questions)
1. What are roles and tasks of an syntax analyzer?
2.Consider the following grammar.
a. S  AS|b
b. A  AS|a
Construct the SLR parse table for the grammar. Show the actions of the parser f or the input
string “abab”
3. What is an ambiguous grammar? Is the following grammar ambiguous?
4. Draw NFA for the regular expression ab*/ab.
5. Find the LALR for the given grammar and parse the sentence (a+b)*c
E  E+T/T, T  T*F/F,F  (E)/id.

117
CS3501 Compiler Design R-21

6. Eliminate left recursion from the following grammar:


S->Aa|b A->Ac|Sd| ε
7. Write the stack implementation of shift-reduce parsing?
8. Draw and discuss the model of LR parser?
9. Construct a canonical parsing table for the grammar given below
S →CC, C→cC|d
10. Write an a algorithm for Non recursive predictive parsing.
PART-C(Possible Questions)
1. Every SLR (1) grammar is unambiguous, but there are many unambiguous grammars that
are not SLR(1).
2. Consider the grammar, E→TE′, E′→+TE′|ε, T→FT′, T′→*FT′|ε, F→ (E)|id.
Construct a predictive parsing table for the grammar given above. Verify whether the input
string id+id*id is accepted by the grammar or not.
3. Describe the conflicts that may occur during shift reduce parsing.
4. Explain ambiguous grammer G:EE +E | E* E | ( E ) | -E | ID for the sentence id+id*id.
5. Construct parse tree for the input string w = cad using top down parser.
ScAd
Aab|a
6.Explain Context free grammars with examples.

118

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy