Compiler-Designnotes (All Modules)
Compiler-Designnotes (All Modules)
m
Introduction, Lexical analysis: Language processors; The structure of a Compiler; The
evolution pf programming languages; The science of building a Compiler; Applications
of compiler technology; Programming language basics.
Lexical analysis: The Role of Lexical Analyzer; Input Buffering; Specifications of
o
Tokens; Recognition of Tokens.
UNIT – 2 6
s.c
Hours
Syntax Analysis – 1: Introduction; Context-free Grammars; Writing a Grammar. Top-
down Parsing; Bottom-up Parsing.
UNIT – 3 6
Hours c
Syntax Analysis – 2: Top-down Parsing; Bottom-up Parsing.
vtu
UNIT – 4 6
Hours
Syntax Analysis – 3: Introduction to LR Parsing: Simple LR; More powerful LR parsers
(excluding Efficient construction and compaction of parsing tables) ; Using ambiguous
grammars; Parser Generators.
PART – B
w.
UNIT – 5 7
Hours
Syntax-Directed Translation: Syntax-directed definitions; Evaluation orders for SDDs;
Applications of syntax-directed translation; Syntax-directed translation schemes.
UNIT – 6 6
Hours
ww
m
Reference Books:
1. Charles N. Fischer, Richard J. leBlanc, Jr.: Crafting a Compiler with C, Pearson
Education, 1991.
o
2. Andrew W Apple: Modern Compiler Implementation in C, Cambridge University
Press, 1997.
3. Kenneth C Louden: Compiler Construction Principles & Practice, Cengage Learning,
s.c
1997.
c
vtu
w.
ww
UNIT – 1 INTRODUCTION: 1 - 28
m
UNIT – 3 SYNTAX ANALYSIS – 2: 51 - 65
o
PART B
s.c
UNIT – 6 INTERMEDIATE CODE GENERATION: 85 – 102
Preliminaries Required
• Basic knowledge of programming languages.
• Basic knowledge of DFA and NFA( FAFL concepts).
m
• Knowledge of a high programming language for the programming assignments.
Textbook:
Alfred V. Aho, Ravi Sethi, and Jeffrey D. Ullman, Monica,
“Compilers: Principles, Techniques, and Tools”
o
Addison-Wesley, 2nd Edition.
s.c
Course Outline
• Introduction to Compiling
• Lexical Analysis
• Syntax Analysis
– Context Free Grammars
– Top-Down Parsing, LL Parsing
c
– Bottom-Up Parsing, LR Parsing
• Syntax-Directed Translation
vtu
– Attribute Definitions
– Evaluation of Attribute Definitions
• Semantic Analysis, Type Checking
• Run-Time Organization
• Intermediate Code Generation
w.
ww
m
x The science of building a Compiler;
x Applications of compiler technology;
x Programming language basics.
o
x Lexical analysis: The Role of Lexical Analyzer;
s.c
x Input Buffering;
x Specifications of Tokens;
x Recognition of Tokens.
c
vtu
w.
ww
m
If target program is executable machine language then
o
Input target program output
s.c
INTERPRETER
Pictorial representation of an Interpreter is given below
Intermediate pgm
w.
COMPILERS
• A compiler is a program takes a program written in a source language and
ww
error messages
Other Applications
m
compiler design.
• A symbolic equation solver which takes an equation as input. That
program should parse the given input equation.
o
– Most of the techniques used in compiler design can be used in Natural
Language Processing (NLP) systems.
s.c
Major Parts of Compilers
Source Target
Program c Analysis Synthesis Program
vtu
• There are two major parts of a compiler: Analysis and Synthesis
• In analysis phase, an intermediate representation is created from the given source
program.
– Lexical Analyzer, Syntax Analyzer and Semantic Analyzer are the parts of
w.
this phase.
• In synthesis phase, the equivalent target program is created from this intermediate
representation.
– Intermediate Code Generator, Code Generator, and Code Optimizer are
the parts of this phase.
ww
m
Error hSaynmdbloelr table
• Each phase tran sforms the source progr am from one represen tation
o
into another representation.
s.c
• They commun icate with the symbo l tab le.
c
Compiler v/s interpreter
vtu
• Source program Compiler target program
• Assignments:
ww
m
<id,1> +
<id,2> *
<id,3> 60
Semantic Analyzer LDF R2, ID3
= MULF R2, R2, #60.0
o
<id,1> + LDF R1, ID2
<id,2> * ADDF R1,R1, R2
<id,3> inttofloat ST F ID1, R1
60
s.c
Intermediate Code Generator
t1 = int tofl oat( 60)
t2 = id 3 * t1
t3 = id 3 + t2
id1 = t 3
Lexical Analyzer
c
• Lexical Analyzer reads the source program character by character and returns the
vtu
tokens of the source program.
• Lexical Analyzer also called as Scanner.
• It reads the stream of characters making up the source program and groups the
characters into meaningful sequences called Lexemes.
• For each lexeme, it produces as output a token of the form
• <token_name, attribute_value>
w.
• Puts information about identifiers into the symbol table not all attributes.
• Regular expressions are used to describe tokens (lexical constructs).
Syntax Analyzer
A Syntax Analyzer creates the syntactic structure (generally a parse tree) of the
given program.
m
A syntax analyzer is also called as a parser.
A parse tree describes a syntactic structure.
=
o
newval +
s.c
oldval 12
4. EE*E
5. E(E)
• Where E is an expression
Syntax Analyzer list out the followings
By rule 1, newval is an expresiion
ww
m
Parsing Techniques
• Depending on how the parse tree is created, there are different parsing techniques.
• These parsing techniques are categorized into two groups:
o
– Top-Down Parsing, Bottom-Up Parsing
• Top-Down Parsing:
– Construction of the parse tree starts at the root, and proceeds towards the
s.c
leaves.
– Efficient top-down parsers can be easily constructed by hand.
– Recursive Predictive Parsing, Non-Recursive Predictive Parsing (LL
Parsing).
• Bottom-Up Parsing: c
– Construction of the parse tree starts at the leaves, and proceeds towards
the root.
vtu
– Normally efficient bottom-up parsers are created with the help of some
software tools.
– Bottom-up parsing is also known as shift-reduce parsing.
– Operator-Precedence Parsing – simple, restrictive, easy to implement
– LR Parsing – much general form of shift-reduce parsing, LR, SLR, LALR
Semantic Analyzer
w.
• A semantic analyzer checks the source program for semantic errors and collects
the type information for the code generation.
• Determines meaning of source string.
– Matching of parenthesis
– Matching if else stmt.
ww
• The type of the identifier newval must match with type of the
expression (oldval+12)
Intermediate Code Generation
• A compiler may produce an explicit intermediate codes representing the source
program.
m
• Is a kind of code.
• Easy to generate
• Easily converted to target code
o
• It can be in three address code or quadruples, triples etc
• Ex:
newval = oldval + 1
s.c
i d1 = i d2 + 1
m
be n entries e enquires to fetch information from this table.
If n and e are more than linear list method is poor in performance. But hash table is better
than list method
o
Grouping of phases into passes
• In an implementation, activities from several phases may be grouped together
into a pass that reads an input file and writes an output file.
s.c
• For example,
o front-end phases of lexical analysis, syntax analysis, semantic analysis and
from back end intermediate code generation might be grouped together
into one pass.
o Code optimization might be an optional pass.
c
o The there could be a back-end pass consisting of code generation for a
particular target machine.
vtu
CLASSIFICATION OF LANGUAGES
1.Based on generation:
a)First generation language-machine languages.
b)second generation languages-assembly languages.
c)Third generation languages-higher level languages.
d)Fourth generation languages-designed for specific applications like sql for database
w.
applications.
e)Fifth generation languages-applied to logic and constraint based languages like prolog
and ops5.
Imperative languages:
languages in which a program specifies how a computation is to be done.
ww
eg: c, c++.
Declarative languages:
languages in which a program specifies what computation is to be done.
eg: prolog.
The science of building a compiler
• Compiler design deals with complicated real world problems.
• First the problem is taken.
• A mathematical abstraction is formulated.
• Solve using mathematical techniques.
m
Compiler optimizations-Design objectives
Must improve performance of many programs.
Optimization must be correct.
o
Compilation time must be kept reasonable.
Engineering effort required must be manageable.
s.c
APPLICATIONS OF COMPILER TECHNOLOGY
IMPLEMENTATION OF HIGH LEVEL PROGRAMING LANGUAGES.
• The programmer expresses an alg using the Lang, and the compiler must translate
that prog to the target language.
• Generally HLP langs are easier to program in, but are less efficient, i.e., the target
c
prog runs more slowly.
• Programmers using LLPL have more control over a computation and can produce
vtu
more efficient code.
• Unfortunately, LLP are harder to write and still worse less portable, more prone to
errors and harder to maintain.
• Optimizing compilers include techniques to improve the performance of general
code, thus offsetting the inefficiency introduced by HL abstractions.
w.
•Parallelism can be found at several levels : at the instruction level, where multiple
operations are executed simultaneously and at the processor level, where different
threads of same application are run on different processors.
• Memory hierarchies are a response to the basic limitation that we can built very
fast storage or very large storage, but not storage that is both fast and large.
PARALLELISM
m
MEMORY HIERARCHIES.
• a memory hierarchy consists of several levels of storage with different speeds and
sizes.
o
• a processor usually has a small number of registers consisting of hundred of bytes,
several levels of caches containing kilobytes to megabytes, and finally secondary
s.c
storage that contains gigabytes and beyond.
• Correspondingly, the speed of accesses between adjacent levels of the hierarchy
can differ by two or three orders of magnitude.
• The performance of a system is often limited not by the speed of the processor but
by the performance of the memory subsystem.
c
• While compliers traditionally focus on optimizing the processor execution, more
emphasis is now placed on making the memory hierarchy more effective.
vtu
DESIGN OF NEW COMPUTER ARCHITECTURES.
• In modern computer arch development, compilers are developed in the processor-
design stage, and compiled code running on simulators, is used to evaluate the
proposed architectural design.
• One of the best known ex of how compilers influenced the design of computer
w.
PROGRAM TRANSLATIONS
Normally we think of compiling as a translation of a high level Lang to machine level
Lang, the same technology can be applied to translate between diff kinds of
languages.
The following are some of the imp applications of program translation techniques:
m
availability of software to their machines.
HARDWARE SYNTHESIS
• Not only is most software written in high level languages, even hardware designs
o
are mostly described in high level hardware description languages like verilog and
VHDL(very high speed integrated circuit hardware description languages).
s.c
• Hardware designs are typically described at the register transfer level (RTL).
• Hardware synthesis tools translate RTL descriptions automatically into gates
which are then mapped to transistors and eventually to a physical layout. This
process takes long hours to optimize the circuits unlike compilers for
programming langs.
c
DATABASE QUERY INTERPRETERS
vtu
• Query languages like SQL are used to search databases.
• These database queries consist of relational and Boolean operators.
• They can be compiled into commands to search a database for records satisfying
the needs.
m
{
int I;
…..
o
i=3;
}
x=i+1;
s.c
Static Scope and Block Structure
Most of the languages,including C and its family uses static scope.The scope rules for
C are based on prgm struc,the scope of declaration is determined implicitly where the
declaration appears in the prgm…for java,c++ provide explicit control over scopes by
c
using the keywords like PUBLIC,PRIVATE,and PROTECTED
Static_scope rules for a language with blocks-grouping of declarations and
vtu
statements.. e.g.: C uses braces
Main()
{
int a=1;
int b=1;
{
w.
int b=2;
{
int a=3;
Cout<< a< <b;
}
ww
{
int b=4;
Cout << a<< b;
}
Cout << a<< b;
}
Cout << a<< b
};
m
source Lexical token
To semantic analysis
program Parser
o
Analyzer get next token
Symbol
s.c
Table
identifiers)
– how to implement the symbol table, and what kind of operations.
• hash table – open addressing, chaining
• putting into the hash table, finding the position of a token from
its lexeme.
Token: It describes the class or category of input string. For example, identifier,
keywords, constants are called tokens
Patters: set of rule that describes the tokens
Lexeme: sequence of character in the source pgm that are matched with the pattern of the
token. Eg: int, I, num, ans etc
m
• Writing it in assembly language and explicitly managing the input.
(hardest to implement, but most efficient)
o
List out lexeme and token in the following example
1.
int Max(int a, int b) Lexeme Token
s.c
{ if(a>b) int Keyword
return a; Max Identifier
else ( Operator
return b; a Identifier
} ,
c Operator
. .
. .
vtu
. .
2. Examples of tokens
Token Informal description Sample Lexemes
if Characters i, f if
w.
digits
Number Any numeric constant 3.14, 0.6, 20
Literal Anything but surrounded by “ “total= %d\n” , “core
” ’s dumped”
In FORTRAN,
DO 5 I = 1.25 DO5I is a lexeme
Disadvantage of
Lexical Analyzer
DO 5 I = 1, 25 do- statement
m
• Compiler efficiency is improved.
- can apply specialized techniques that serve only the lexical task
- Specialized buffering techniques for reading input characters can speed up the
o
compiler.
• Compiler portability is enhanced.
- Input-device-specific peculiarities can be restricted to the lexical analyzer.
s.c
Lexical analysis v/s Parsing
• Simplicity of design is the most important consideration.
• Compiler efficiency is improved.
- can apply specialized techniques that serve only the lexical task
- Specialized buffering techniques for reading input characters can speed up the
c
compiler.
• Compiler portability is enhanced.
vtu
- Input-device-specific peculiarities can be restricted to the lexical analyzer.
Input Buffering
To recognize tokens reading data/ source program from hard disk is done. Accessing
hard disk each time is time consuming so special buffer technique have been
developed to reduce the amount of overhead required.
w.
m
Switch(*forward++)
{
case eof:
if (forward is at end of first buffer)
o
{ reload second buffer;
forward = beginning of second buffer;
s.c
}
else if (forward is at end of second buffer)
{ reload first buffer;
forward = beginning of first buffer;
}
else /* eof within a buffer marks the end of input terminate lexical analysis*/
}
break; c
case for the other characters
vtu
15
Terminology of Languages
• Alphabet : a finite set of symbols (ASCII characters)
w.
• String :
– Finite sequence of symbols on an alphabet
– Sentence and word are also used in terms of string
– H is the empty string
ww
o m
s.c
Example
• L1 = {a,b,c,d} L2 = {1,2}
• L1L2 = {a1,a2,b1,b2,c1,c2,d1,d2}
• L1 L2 = {a,b,c,d,1,2}
c
vtu
• L13 = all strings with length three (using a,b,c,d}
Regular Expressions
w.
m
(r1) | (r2) L(r1) L(r2)
(r1) (r2) L(r1) L(r2)
(r)* (L(r))*
(r) L(r)
o
• (r)+ = (r)(r)*
• (r)? = (r) | H
s.c
Regular Expressions (cont.)
c
• We may remove parentheses by using precedence rules.
– * highest
vtu
– concatenation next
– | lowest
• ab*|c means (a(b)*)|(c)
• Ex:
w.
– 6 = {0,1}
– 0|1 => {0,1}
– (0|1)(0|1) => {00,01,10,11}
– 0* => {H ,0,00,000,0000,....}
– (0|1)* => all strings with 0 and 1, including the empty string
ww
m
as symbols to define other regular expressions.
o
d2 o r2 ri is a regular expression over symbols in
. 6{d1,d2,...,di-1}
s.c
dn o rn
basic symbols previously defined names
Regular Definitions
• Ex: Identifiers in Pascal
c
letter o A | B | ... | Z | a | b | ... | z
digit o 0 | 1 | ... | 9
vtu
id o letter (letter | digit ) *
– If we try to write the regular expression representing identifiers without
using regular definitions, that regular expression will be complex.
(A|...|Z|a|...|z) ( (A|...|Z|a|...|z) | (0|...|9) ) *
• Ex: Unsigned numbers in Pascal
digit o 0 | 1 | ... | 9
w.
digits o digit +
opt-fraction o ( . digits ) ?
opt-exponent o ( E (+|-)? digits ) ?
unsigned-num o digits opt-fraction opt-exponent
ww
Our current goal is to perform the lexical analysis needed for the following grammar.
Recall that the terminals are the tokens, the nonterminals produce terminals.
digit → [0-9]
digits → digits+
number → digits (. digits)? (E[+-]? digits)?
letter → [A-Za-z]
id → letter ( letter | digit )*
if → if
then → then
else → else
relop → < | > | <= | >= | = | <>
m
An identifier id Pointer to table entry
A number number Pointer to table entry
< relop LT
<= relop LE
o
= relop EQ
<> relop NE
s.c
> relop GT
>= relop GE
On the board show how this can be done with just REs.
c
We also want the lexer to remove whitespace so we define a new token
Recall that the lexer will be called by the parser when the latter needs a new token. If the
lexer then recognizes the token ws, it does not return it to the parser but instead goes on
w.
to recognize the next token, which is then returned. Note that you can't have two
consecutive ws tokens in the input because, for a given token, the lexer will match the
longest lexeme starting at the current position that yields this token. The table on the
right summarizes the situation.
ww
For the parser, all the relational ops are to be treated the same so they are all the same
token, relop. Naturally, other parts of the compiler, for example the code generator, will
need to distinguish between the various relational ops so that appropriate code is
generated. Hence, they have distinct attribute values.
Specification of Token
To specify tokens Regular Expressions are used.
Recognition of Token
To recognize tokens there are 2 steps
Transition Diagrams
A transition diagram is similar to a flowchart for (a part of) the lexer. We draw one for
each possible token. It shows the decisions that must be made based on the input seen.
m
The two main components are circles representing states (think of them as decision
points of the lexer) and arrows representing edges (think of them as the decisions made).
o
1. The double circles represent accepting or final states at which point a lexeme has
been found. There is often an action to be done (e.g., returning the token), which
s.c
is written to the right of the double circle.
2. If we have moved one (or more) characters too far in finding the token, one (or
more) stars are drawn.
3. An imaginary start state exists and has an arrow coming from it to indicate where
to begin the process.
c
It is fairly clear how to write code corresponding to this diagram. You look at the first
vtu
character, if it is <, you look at the next character. If that character is =, you return
(relop,LE) to the parser. If instead that character is >, you return (relop,NE). If it is
another character, return (relop,LT) and adjust the input buffer so that you will read this
character again since you have not used it for the current lexeme. If the first character
was =, you return (relop,EQ).
w.
ww
m
• Each state represents a condition that could occur during the process of
scanning the input looking for a lexeme that matches one of several
patterns.
o
• Edges are directed from one state of the transition diagram to another.
start 1 < 2
s.c
25
We will continue to assume that the keywords are reserved, i.e., may not be used as
identifiers. (What if this is not the case—as in Pl/I, which had no reserved words? Then
the lexer does not distinguish between keywords and identifiers and the parser must.)
We will use the method mentioned last chapter and have the keywords installed into the
identifier table prior to any invocation of the lexer. The table entry will indicate that the
entry is a keyword.
gettoken() examines the lexeme and returns the token name, either id or a name
corresponding to a reserved keyword.
The text also gives another method to distinguish between identifiers and keywords.
m
Completion of the Running Example
So far we have transition diagrams for identifiers (this diagram also handles keywords)
o
and the relational operators. What remains are whitespace, and numbers, which are
respectively the simplest and most complicated diagrams seen so far.
s.c
Recognizing Whitespace
The diagram itself is quite simple reflecting the simplicity of the corresponding regular
expression.
x
tab, and newline.
c
The delim in the diagram represents any of the whitespace characters, say space,
vtu
x The final star is there because we needed to find a non-whitespace character in
order to know when the whitespace ends and this character begins the next token.
x There is no action performed at the accepting state. Indeed the lexer does not
return to the parser, but starts again from its beginning as it still must find the next
token.
w.
Recognizing Numbers
This certainly looks formidable, but it is not that bad; it follows from the regular
expression.
ww
In class go over the regular expression and show the corresponding parts in the diagram.
When an accepting states is reached, action is required but is not shown on the diagram.
Just as identifiers are stored in a identifier table and a pointer is returned, there is a
corresponding number table in which numbers are stored. These numbers are needed
when code is generated. Depending on the source language, we may wish to indicate in
the table whether this is a real or integer. A similar, but more complicated, transition
diagram could be produced if the language permitted complex numbers as well.
The idea is that we write a piece of code for each decision diagram. I will show the one
for relational operations below. This piece of code contains a case for each state, which
typically reads a character and then goes to the next case depending on the character read.
The numbers in the circles are the names of the cases.
m
Accepting states often need to take some action and return to the parser. Many of these
accepting states (the ones with stars) need to restore one character of input. This is called
retract() in the code.
o
What should the code for a particular diagram do if at one state the character read is not
one of those for which a next state has been defined? That is, what if the character read is
s.c
not the label of any of the outgoing arcs? This means that we have failed to find the token
corresponding to this diagram.
The code calls fail(). This is not an error case. It simply means that the current input does
not match this particular token. So we need to go to the code section for another diagram
c
after restoring the input pointer so that we start the next diagram at the point where this
failing diagram started. If we have tried all the diagram, then we have a real failure and
need to print an error message and perhaps try to repair the input.
vtu
Note that the order the diagrams are tried is important. If the input matches more than one
token, the first one tried will be chosen.
while (true)
switch(state)
case 0: c = nextChar();
if (c == '<') state = 1;
else if (c == '=') state = 5;
ww
1. Unlike the method above, which tries the diagrams one at a time, the first new
method tries them in parallel. That is, each character read is passed to each
diagram (that hasn't already failed). Care is needed when one diagram has
accepted the input, but others still haven't failed and may accept a longer prefix of
m
the input.
2. The final possibility discussed, which appears to be promising, is to combine all
the diagrams into one. That is easy for the example we have been considering because
all the diagrams begin with different characters being matched. Hence we just have
o
one large start with multiple outgoing edges. It is more difficult when there is a
character that can begin more than one diagram.
c s.c
vtu
w.
ww
m
x Top-down Parsing;
x Bottom-up Parsing
o
c s.c
vtu
w.
ww
Introduction
• Syntax Analyzer creates the syntactic structure of the given source program.
• This syntactic structure is mostly a parse tree.
• Syntax Analyzer is also known as parser.
• The syntax of a programming is described by a context-free grammar (CFG). We
will use BNF (Backus-Naur Form) notation in the description of CFGs.
• The syntax analyzer (parser) checks whether a given source program satisfies the
m
rules implied by a context-free grammar or not.
– If it satisfies, the parser creates the parse tree of that program.
– Otherwise the parser gives the error messages.
• A context-free grammar
o
– gives a precise syntactic specification of a programming language.
– the design of the grammar is an initial phase of the design of a compiler.
s.c
– a grammar can be directly converted into a parser by some tools.
• Parser works on a stream of tokens.
• The smallest item is a token.
Fig :Position Of Parser in Compiler model
c
vtu
SYMBOL TABLE
• We categorize the parsers into two groups:
w.
1. Top-Down Parser
2. the parse tree is created top to bottom, starting from the root.
1. Bottom-Up Parser
– the parse is created bottom to top; starting from the leaves
• Both top-down and bottom-up parsers scan the input from left to right (one
ww
symbol at a time).
• Efficient top-down and bottom-up parsers can be implemented only for sub-
classes of context-free grammars.
– LL for top-down parsing
– LR for bottom-up parsing
Syntax Error Handling
• Common Programming errors can occur at many different levels.
1. Lexical errors: include misspelling of identifiers, keywords, or operators.
2. Syntactic errors : include misplaced semicolons or extra or missing braces.
m
Error-Recovery Strategies
• Panic-Mode Recovery
• Phrase-Level Recovery
• Error Productions
o
• Global Correction
Panic-Mode Recovery
s.c
• On discovering an error, the parser discards input symbols one at a time until one
of a designated set of Synchronizing tokens is found.
• Synchronizing tokens are usually delimiters.
Ex: semicolon or } whose role in the source program is clear and unambiguous.
• It often skips a considerable amount of input without checking it for additional
errors. c
Advantage:
Simplicity
vtu
Is guaranteed not to go into an infinite loop
Phrase-Level Recovery
• A parser may perform local correction on the remaining input. i.e
it may replace a prefix of the remaining input by some string that allows the parser to
continue.
Ex: replace a comma by a semicolon, insert a missing semicolon
• Local correction is left to the compiler designer.
w.
• We can augment the grammar for the language at hand with productions that
generate the erroneous constructs.
• Then we can use the grammar augmented by these error productions to
Construct a parser.
• If an error production is used by the parser, we can generate appropriate error
diagnostics to indicate the erroneous construct that has been recognized in the
input.
Global Correction
• We use algorithms that perform minimal sequence of changes to obtain a globally
least cost correction.
• Given an incorrect input string x and grammar G, these algorithms will find a
parse tree for a related string y.
• Such that the number of insertions, deletions and changes of tokens required to
transform x into y is as small as possible.
• It is too costly to implement in terms of time space, so these techniques only of
theoretical interest.
m
Context-Free Grammars
• Inherently recursive structures of a programming language are defined by a
context-free grammar.
• In a context-free grammar, we have:
o
– A finite set of terminals (in our case, this will be the set of tokens)
– A finite set of non-terminals (syntactic-variables)
s.c
– A finite set of productions rules in the following form
• AoD where A is a non-terminal and
D is a string of terminals and non-terminals (including the empty
string)
– A start symbol (one of the non-terminal symbol)
NOTATIONAL CONVENTIONS c
1. Symbols used for terminals are :
Lower case letters early in the alphabet (such as a, b, c, . . .)
vtu
Operator symbols (such as +, *, . . . )
Punctuation symbols (such as parenthesis, comma and so on)
The digits(0…9)
Boldface strings and keywords (such as id or if) each of which represents
a single terminal symbol
2. Symbols used for non terminals are:
w.
Derivations
E E+E
• E+E derives from E
– we can replace E by E+E
– to able to do this, we have to have a production rule EoE+E in our
grammar.
E E+E id+E id+id
• A sequence of replacements of non-terminal symbols is called a derivation of
m
id+id from E.
• In general a derivation step is
DAE DJ if there is a production rule AoJ in our grammar
where D and E are arbitrary strings of terminal and non-
o
terminal symbols
D1 D2 ... Dn (Dn derives from D1 or D1 derives Dn )
s.c
: derives in one step
: derives in zero or more steps
: derives in one or more steps
CFG – Terminology
• L(G) is the language of G (the language generated by G) which is a set of
c
sentences.
• A sentence of L(G) is a string of terminal symbols of G.
vtu
• If S is the start symbol of G then
Z is a sentence of L(G) iff S Z where Z is a string of terminals of G.
• If G is a context-free grammar, L(G) is a context-free language.
• Two grammars are equivalent if they produce the same language.
• S D - If D contains non-terminals, it is called as a sentential form of G.
- If D does not contain non-terminals, it is called as a sentence of
w.
G.
Derivation Example
E -E -(E) -(E+E) -(id+E) -(id+id)
OR
E -E -(E) -(E+E) -(E+id) -(id+id)
ww
• At each derivation step, we can choose any of the non-terminal in the sentential
form of G for the replacement.
• If we always choose the left-most non-terminal in each derivation step, this
derivation is called as left-most derivation.
• If we always choose the right-most non-terminal in each derivation step, this
derivation is called as right-most derivation.
Left-Most and Right-Most Derivations
Left-Most Derivation
E -E -(E) -(E+E) -(id+E) -(id+id)
Right-Most Derivation
E -E -(E) -(E+E) -(E+id) -(id+id)
• We will see that the top-down parsers try to find the left-most derivation of the
given source program.
• We will see that the bottom-up parsers try to find the right-most derivation of the
given source program in the reverse order.
m
Parse Tree
• Inner nodes of a parse tree are non-terminal symbols.
• The leaves of a parse tree are terminal symbols.
• A parse tree can be seen as a graphical representation of a derivation.
o
c s.c
vtu
Problems on derivation of a string with parse tree:
1. Consider the grammar S (L) | a
LL,S | S
i. What are the terminals, non terminal and the start symbol?
ii. Find the parse tree for the following sentence
a. (a,a)
b. (a, (a, a))
w.
c. (a, ((a,a),(a,a)))
iii. Construct LMD and RMD for each.
2. Do the above steps for the grammar S aS | aSbS | for the string “aaabaab”
Ambiguity
ww
m
• For the most parsers, the grammar must be unambiguous.
o
• unambiguous grammar
unique selection of the parse tree for a sentence
• We should eliminate the ambiguity in the grammar during the design phase of the
s.c
compiler.
• An ambiguous grammar should be written to eliminate the ambiguity.
• We have to prefer one of the parse trees of a sentence (generated by an ambiguous
grammar) to disambiguate that grammar to restrict to this choice.
• EG: c Ambiguity (cont.)
vtu
stmt o if expr then stmt |
if expr then stmt else stmt | otherstmts
w.
stmt stmt
E2 S1 E2 S1 S2
1 2
• We prefer the second parse tree (else matches with closest if).
• So, we have to disambiguate our grammar to reflect this choice.
• The unambiguous grammar will be:
• stmt o matchedstmt | unmatchedstmt
m
2. S S+S | |SS | (S) |S* a with the string (a+a)*a
3. S aS | aSbS | with the string abab
o
• Ambiguous grammars (because of ambiguous operators) can be disambiguated
according to the precedence and associativity rules.
s.c
E o E+E | E*E | E^E | id | (E)
disambiguate the grammar
precedence: ^ (right to left)
* (left to right)
+ (left to right)
E o E+T | T c
T o T*F | F
F o G^F | G
vtu
G o id | (E)
Left Recursion
• A grammar is left recursive if it has a non-terminal A such that there is a
derivation.
m
Left-Recursion – Problem
• A grammar cannot be immediately left-recursive, but it still can be
o
left-recursive.
• By just eliminating the immediate left-recursion, we may not get
a grammar which is not left-recursive.
s.c
S o Aa | b
A o Sc | d This grammar is not immediately left-recursive,
but it is still left-recursive.
S Aa Sca or
A Sc Aac causes to a left-recursion
• So, we have to eliminate all left-recursions from our grammar
c
Eliminate Left-Recursion – Algorithm
- Arrange non-terminals in some order: A1 ... An
vtu
- for i from 1 to n do {
- for j from 1 to i-1 do {
replace each production
Ai o Aj J
by
Ai o D1 J | ... | Dk J
where Aj o D1 | ... | Dk
w.
}
- eliminate immediate left-recursions among Ai productions
}
Example2:
S o Aa | b
ww
A o Ac | Sd | f
- Order of non-terminals: A, S
for A:
- we do not enter the inner loop.
- Eliminate the immediate left-recursion in A
A o SdA’ | fA’
A’ o cA’ | H
for S:
- Replace S o Aa with S o SdA’a | fA’a
So, we will have S o SdA’a | fA’a | b
- Eliminate the immediate left-recursion in S
S o fA’aS’ | bS’
S’ o dA’aS’ | H
So, the resulting equivalent grammar which is not left-recursive is:
S o fA’aS’ | bS’
S’ o dA’aS’ | H
A o SdA’ | fA’
A’ o cA’ | H
m
Problems of left recursion
1. S S(S)S |
2. S S+S | |SS | (S) |S* a
3. S SS+ | SS* | a
o
4. bexpr bexpr or bterm | bterm
bterm bterm and bfactor | bfactor
bfactor not bfactor | (bexpr) | true | false
s.c
5. S (L) | a, L L,S | S
Left-Factoring
• A predictive parser (a top-down parser without backtracking) insists that the
grammar must be left-factored.
c
grammar a new equivalent grammar suitable for predictive parsing
stmt o if expr then stmt else stmt |
if expr then stmt
vtu
• when we see if, we cannot now which production rule to choose to re-write stmt
in the derivation.
• In general,
A o DE1 | DE2 where D is non-empty and the first symbols
of E1 and E2 (if they have one)are different.
• when processing D we cannot know whether expand
w.
A to DE1 or
A to DE2
• But, if we re-write the grammar as follows
A o DA’
A’ o E1 | E2 so, we can immediately expand A to DA’
ww
Left-Factoring – Algorithm
• For each non-terminal A with two or more alternatives (production rules) with a
common non-empty prefix, let say
A o DE1 | ... | DEn | J1 | ... | Jm
convert it into
A o DA’ | J1 | ... | Jm
A’ o E1 | ... | En
Left-Factoring – Example1
A o abB | aB | cdg | cdeB | cdfB
A o aA’ | cdA’’
A’ o bB | B
A’’ o g | eB | fB
Example2
m
A o ad | a | ab | abc | b
A o aA’ | b
A’ o d | H | b | bc
o
A o aA’ | b
A’ o d | H | bA’’
s.c
A’’ o H | c
Problems on left factor
1. S iEtS | iEtSeS | a, Eb 6. S 0S1 | 01
2. S S(S)S | 7. S S+S | |SS | (S) |S* a
3. S aS | aSbS | 8. S (L) | a, L L,S | S
c
4. S SS+ | SS* | a
5. bexpr bexpr or bterm | bterm 9. rexpr rexpr + rterm | rterm
vtu
bterm bterm and bfactor | bfactor rterm rterm rfactor |
rfactor
bfactor not bfactor | (bexpr) | true | false rfactor rfactor* | rprimary
rprimary a |b do both
leftfactor and left recursion
• There are some language constructions in the programming languages which are
not context-free. This means that, we cannot write a context-free grammar for
these constructions.
• L1 = { ZcZ | Z is in (a | b)*} is not context-free
Declaring an identifier and checking whether it is declared or not later.
ww
m
• Backtracking is needed (If a choice of a production rule does not
work, we backtrack to try other alternatives.)
• It is a general parsing technique, but not widely used.
• Not efficient
o
– Predictive Parsing
• no backtracking
s.c
• efficient
• Needs a special form of grammars (LL (1) grammars).
• Recursive Predictive Parsing is a special form of Recursive
Descent parsing without backtracking.
Non-Recursive (Table Driven) Predictive Parser is also known as LL (1) parser.
Recursive-Descent Parsing (uses Backtracking)
c
• Backtracking is needed.
• It tries to find the left-most derivation.
vtu
S o aBc
B o bc | b
S S
input: abc
a B c a B c
w.
b c b
Predictive Parser
• When re-writing a non-terminal in a derivation step, a predictive parser can
uniquely choose a production rule by just looking the current symbol in the input
string.
ww
• When we are trying to write the non-terminal stmt, we can uniquely choose the
production rule by just looking the current token.
• We eliminate the left recursion in the grammar, and left factor it. But it may not
be suitable for predictive parsing (not LL(1) grammar).
Non-Recursive Predictive Parsing -- LL(1) Parser
• Non-Recursive predictive parsing is a table-driven parser.
• It is a top-down parser.
m
• It is also known as LL(1) Parser.
o
LL(1) Parser
input buffer
c s.c
vtu
– our string to be parsed. We will assume that its end is marked with a
special symbol $.
output
– a production rule representing a step of the derivation sequence (left-most
derivation) of the string in the input buffer.
stack
– contains the grammar symbols
w.
completed.
parsing table
– a two-dimensional array M[A, a]
– each row is a non-terminal symbol
– each column is a terminal symbol or the special symbol $
each entry holds a production rule.
Constructing LL(1) Parsing Tables
• Two functions are used in the construction of LL(1) parsing tables:
– FIRST FOLLOW
• FIRST(D) is a set of the terminal symbols which occur as first symbols in strings
derived from D where D is any string of grammar symbols.
• if D derives to H, then H is also in FIRST(D) .
• FOLLOW(A) is the set of the terminals which occur immediately after (follow)
the non-terminal A in the strings derived from the starting symbol.
– a terminal a is in FOLLOW(A) if S DAaE
m
– $ is in FOLLOW(A) if S DA
Compute FIRST for Any String X
• If X is a terminal symbol FIRST(X)={X}
• If X is a non-terminal symbol and X o H is a production rule H is in
o
FIRST(X).
• If X is a non-terminal symbol and X o Y1Y2..Yn is a production rule
s.c
if a terminal a in FIRST(Yi) and H is in all FIRST(Yj) for j=1,...,i-1then a is in
FIRST(X).
if H is in all FIRST(Yj) for j=1,...,n then H is in FIRST(X).
• If X is H FIRST(X)={H}
• If X is Y1Y2..Yn if a terminal a in FIRST(Yi) and H is in all
FIRST(Yj) for c j=1,...,i-1 then a is in FIRST(X).
if H is in all FIRST(Yj) for j=1,...,n
vtu
then H is in FIRST(X).
Compute FOLLOW (for non-terminals)
• If S is the start symbol $ is in FOLLOW(S)
• if A o DBE is a production rule everything in FIRST(E) is
FOLLOW(B) except H
• If ( A o DB is a production rule ) or ( A o DBE is a production rule and H is in
w.
FIRST(E) )
everything in FOLLOW(A) is in
FOLLOW(B).
We apply these rules until nothing more can be added to any follow set.
ww
parser looks at the parsing table entry M[X, a]. If M[X, a] holds a production
rule XoY1Y2...Yk, it pops X from the stack and pushes Yk,Yk-1,...,Y1 into the stack. The
parser also outputs the production rule XoY1Y2...Yk to represent a step of the derivation.
4. none of the above error
– all empty entries in the parsing table are errors.
– If X is a terminal symbol different from a, this is also an error case.
m
Non-Recursive predictive parsing Algorithm
o
c s.c
vtu
Fall 2003 CS416 Compiler Design 13
B o bB | H
FIRST FUNCTION
FIRST(S) = {a} FIRST (aBa) = {a}
FIRST (B) = {b} FIRST (bB) = {b} FIRST (H ) = {H}
FOLLOW FUNCTION
ww
$aB ba$ B o bB
$aBb ba$
$aB a$ BoH
$a a$
$ $ accept, successful completion
Outputs: S o aBa B o bB B o bB BoH
Derivation(left-most): SaBaabBaabbBaabba
o m
s.c
Example2
E o TE’
E’ o +TE’ | H
T o FT’ c
T’ o *FT’ | H
F o (E) | id
vtu
Soln:
FIRST Example
E o TE’
E’ o +TE’ | H
T o FT’
T’ o *FT’ | H
F o (E) | id
w.
m
– If H in FIRST(D) and $ in FOLLOW(A) add A o D to M[A, $]
– All other undefined entries of the parsing table are error entries.
Constructing LL (1) Parsing Table – Example
E o TE’ FIRST (TE’) = {(, id} E o TE’ into M [E, (] and M[E, id]
o
E’ o +TE’ FIRST (+TE’) = {+} E’ o +TE’ into M [E’, +]
’
E oH FIRST (H) = {H} none
s.c
but since H in FIRST(H) and FOLLOW(E’)={$,)}
E’ o H into M[E’,$] and M[E’,)]
’ ’
T o FT FIRST (FT ) = {(, id} T o FT’ into M[T,(] and M[T, id]
T’ o *FT’ FIRST (*FT’) = {*} T’ o *FT’ into M [T’,*]
T’ o H FIRST (H) = {H} none
c but since H in FIRST(H)
and FOLLOW(T’)={$, ) ,+}
T’ o H into M [T’, $], M [T’ , )] and M [T’,+]
vtu
F o (E) FIRST ((E)) = {(} F o (E) into M [F, (]
F o id FIRST (id) = {id} F o id into M [F, id]
id + * ( ) $
E E o E o TE’
TE’
E’
w.
E’ o +TE’ E’ o H E’ o H
T T o T o FT’
FT’
T’ T’ o H T’ o T’ o H T’ o H
ww
*FT’
F F o id F o (E)
stack in p u t output
$E id+id$ E o TE’
$E’T id+id$ T o FT’
$E’ T’F id+id$ F o id
$ E’ T’id id+id$
$ E’ T’ +id$ T’ o H
$ E’ +id$ E’ o +TE’
$ E’ T+ +id$
$ E’ T id$ T o FT’
$ E’ T’ F id$ F o id
’ ’
$ E T id id$
$ E’ T’ $ T’ o H
$ E’ $ E’ o H
$ $ accept
Construct the predictive parser LL (1) for the following grammar and parse the
m
given string
1. S S(S)S | with the string ( ( ) ( 7. P Ra | Qba
)) R aba | caba | Rbc
2. S + S S | |*SS | a with the string Q bbc |bc string “
o
“+*aa a” cababca”
8. S PQR
3. S aSbS | bSaS | with the string
P a | Rb |
s.c
“aabbbab”
Q c | dP |
4. bexpr bexpr or bterm | bterm R e | f string “ adeb”
bterm bterm and bfactor | 9. E E+ T |T
bfactor T id | id[ ] | id[X]
bfactor not bfactor | (bexpr) | X E, E | E string “id[id]”
true | false 10. S (A) | 0
c
string “ not(true or false)”
5. S 0S1 | 01 string “00011”
A SB
B ,SB | string “ (0, (0,0))”
vtu
6. S aB | aC | Sd |Se
11. S a | n | (T)
B bBc | f
T T,S | S
C g
String (a,(a,a))
String ((a,a), n , (a),a)
LL (1) Grammars
• A grammar whose parsing table has no multiply-defined entries is said to be LL
w.
(1) grammar. one input symbol used as a look-head symbol do determine parser
action LL (1) left most derivation input scanned from left to right
• The parsing table of a grammar may contain more than one production rule. In
this case, we say that it is not a LL (1) grammar.
ww
a b e i t $
S Soa S o iCtSE
E EoeS EoH
EoH
C Cob
m
two production rules for M[E, e]
Problem ambiguity
• What do we have to do it if the resulting parsing table contains multiply defined
entries?
o
– If we didn’t eliminate left recursion, eliminate the left recursion in the
grammar.
s.c
– If the grammar is not left factored, we have to left factor the grammar.
– If it’s (new grammar’s) parsing table still contains multiply defined
entries, that grammar is ambiguous or it is inherently not a LL(1)
grammar.
• A left recursive grammar cannot be a LL (1) grammar.
– A o AD | E c
any terminal that appears in FIRST(E) also appears FIRST(AD)
vtu
because AD ED.
If E is H, any terminal that appears in FIRST(D) also appears in
FIRST(AD) and FOLLOW(A).
• A grammar is not left factored, it cannot be a LL(1) grammar
– A o DE1 | DE2
any terminal that appears in FIRST(DE1) also appears in
w.
FIRST(DE2).
• An ambiguous grammar cannot be a LL (1) grammar.
• A grammar G is LL(1) if and only if the following conditions hold for two
distinctive production rules A o D and A o E
1. Both D and E cannot derive strings starting with same terminals.
2. At most one of D and E can derive to H.
3. If E can derive to H, then D cannot derive to any string starting with a
terminal in FOLLOW(A).
Error Recovery in Predictive Parsing
• An error may occur in the predictive parsing (LL(1) parsing)
– if the terminal symbol on the top of stack does not match with the current
input symbol.
m
Error Recovery Techniques
• Panic-Mode Error Recovery
– Skipping the input symbols until a synchronizing token is found.
• Phrase-Level Error Recovery
o
– Each empty entry in the parsing table is filled with a pointer to a specific
error routine to take care that error case.
s.c
• Error-Productions
– If we have a good idea of the common errors that might be encountered,
we can augment the grammar with productions that generate erroneous
constructs.
– When an error production is used by the parser, we can generate
c
appropriate error diagnostics.
– Since it is almost impossible to know all the errors that can be made by the
vtu
programmers, this method is not practical.
• Global-Correction
– Ideally, we would like a compiler to make as few changes as possible in
processing incorrect inputs.
– We have to globally analyze the input to find the error.
– This is an expensive method, and it is not in practice.
w.
m
a b c d e $
S S o sync S o sync S o S o
o
AbS AbS e H
A Aoa sync A o sync sync sync
s.c
cAd
Eg: input string “aab”
stack input output
$S aab$ S o AbS
$SbA aab$ A o a
$Sba aab$
$Sb ab$
$S ab$
c
Error: missing b, inserted
S o AbS
vtu
$SbA ab$ Aoa
$Sba ab$
$Sb b$
$S $ SoH
$ $ accept
w.
o m
c s.c
vtu
w.
ww
o m
cs.c
vtu
w.
ww
Bottom-Up Parsing
• A bottom-up parser creates the parse tree of the given input starting from leaves
towards the root.
• A bottom-up parser tries to find the right-most derivation of the given input in the
reverse order.
S ... Z (the right-most derivation of Z)
m (the bottom-up parser finds the right-most derivation in the reverse
m
order)
• Bottom-up parsing is also known as shift-reduce parsing because its two main
actions are shift and reduce.
o
– At each shift action, the current symbol in the input string is pushed to a
stack.
– At each reduction step, the symbols at the top of the stack (this symbol
s.c
sequence is the right side of a production) will replaced by the non-
terminal at the left side of that production.
– There are also two more actions: accept and error.
Shift-Reduce Parsing
• A shift-reduce parser tries to reduce the given input string into the starting
c
symbol.
a string the starting symbol
vtu
reduced to
• At each reduction step, a substring of the input matching to the right side of a
production rule is replaced by the non-terminal at the left side of that production
rule.
• If the substring is chosen correctly, the right most derivation of that string is
created in the reverse order.
w.
Example
ww
m
• Informally, a handle of a string is a substring that matches the right side of a
production rule.
– But not every substring matches the right side of a production rule is
o
handle
• A handle of a right sentential form J ({ DEZ) is
a production rule A o E and a position of J
s.c
where the string E may be found and replaced by A to produce
the previous right-sentential form in a rightmost derivation of J.
S DAZ DEZ
•
c
If the grammar is unambiguous, then every right-sentential form of the grammar
has exactly one handle.
vtu
• We will see that Z is a string of terminals.
Handle Pruning
w.
A Shift-Reduce Parser
o m
s.c
A Stack Implementation of A Shift-Reduce Parser
• There are four possible actions of a shift-parser action:
1. Shift : The next input symbol is shifted onto the top of the stack.
2. Reduce: Replace the handle on the top of the stack by the non-terminal.
3. Accept: Successful completion of parsing.
c
4. Error: Parser discovers a syntax error, and calls an error recovery routine.
• Initial stack just contains only the end-marker $.
vtu
• The end of the input string is marked by the end-marker $.
Consider the following grams and parse the respective strings using shift-
reduce parser.
(1) E o E+T | T
T o T*F | F
w.
2. If in stack operator has same or less priority than the priority of incoming
operator then perform reduce.
m
$F +id*id$ reduce by T o F
$T +id*id$ reduce by E o T E 8
$E +id*id$ shift
$ E+ id*id$ shift E 3 + T 7
o
$E+id *id$ reduce by F o id
$E+F *id$ reduce by T o F T 2 T 5 * F6
$E+T *id$ shift
s.c
$E+T* id $ shift F 1 F 4 id
$E+T*id $ reduce by F o id
$E+T*F $ reduce by T o T*F id id
$E+T $ reduce by E o E+T
$E $ accept
c
vtu
(2) S TL;
T int | float
L L, id | id
String is “int id, id;” do shift-reduce parser.
(3) S (L) |a
L L,S | S
w.
m
Stack Input Sequence ACTION
$ the dog jumps$ SHIFT word onto stack
$the dog jumps$ REDUCE using grammar rule
$Art dog jumps$ SHIFT..
o
$Art dog jumps$ REDUCE..
$Art Noun jumps$ REDUCE
s.c
$NounPhrase jumps$ SHIFT
$NounPhrase jumps $ REDUCE
$NounPhrase Verb $ REDUCE
$NounPhrase VerbPhrase $ REDUCE
$Sentence $ SUCCESS
c
vtu
w.
ww
Shift-Reduce Parsers
• There are two main categories of shift-reduce parsers
1. Operator-Precedence Parser
– Simple, but only a small class of grammars.
– LR-Parsers
o m
c s.c
vtu
w.
ww
LR Parsers
• LR-Parsers
– covers wide range of grammars.
– SLR – simple LR parser
– LR – most general LR parser(canonical LR)
– LALR – intermediate LR parser (look-head LR parser)
– SLR, LR and LALR work same (they used the same algorithm), only their
parsing tables are different.
LR Parsing Algorithm
input a1 . . . ai . . . an $
stack
m
Sm
Xm
LR Parsing Algorithm output
Sm-1
Xm-1
o
.
.
Action Table Goto Table
S1 terminals and $ non-terminal
s.c
X1 s s
t four different t each item is
S0 a actions a a state number
t t
e e
s s
c
A Configuration of LR Parsing Algorithm
vtu
• A configuration of a LR parsing is:
Actions of A LR-Parser
1. shift s -- shifts the next input symbol and the state s onto the stack
( So X1 S1 ... Xm Sm, ai ai+1 ... an $ ) ( So X1 S1 ... Xm Sm ai s, ai+1 ... an $ )
m
– pop 2|E | (=r) items from the stack;
– then push A and s where s=goto[sm-r,A]
( So X1 S1 ... Xm Sm, ai ai+1 ... an $ ) ( So X1 S1 ... Xm-r Sm-r A s, ai ... an $ )
o
3. Accept – Parsing successfully completed
s.c
4. Error -- Parser detected an error (an empty entry in the action table)
Reduce Action
• pop 2|E | (=r) items from the stack; let us assume that E = Y1Y2...Yr
• then push A and s where s=goto[sm-r,A]
c
( So X1 S1 ... Xm-r Sm-r Y1 Sm-r ...Yr Sm, ai ai+1 ... an $ )
( So X1 S1 ... Xm-r Sm-r A s, ai ... an $ )
vtu
• In fact, Y1Y2...Yr is a handle.
X1 ... Xm-r A ai ... an $ X1 ... Xm Y1...Yr ai ai+1 ... an $
Constructing SLR Parsing Tables – LR(0) Item
• An LR(0) item of a grammar G is a production of G a dot at the some position of
the right side.
• Ex: A o aBb Possible LR(0) Items: A o .aBb
w.
• A collection of sets of LR(0) items (the canonical LR(0) collection) is the basis
for constructing SLR parsers.
• Augmented Grammar:
G’ is G with a new production rule S’oS where S’ is the new starting symbol.
The Closure Operation
• If I is a set of LR(0) items for a grammar G, then closure(I) is the set of LR(0)
items constructed from I by the two rules:
1. Initially, every LR(0) item in I is added to closure(I).
m
E o E+T { E’ o .E kernel items
EoT E o .E+T
T o T*F E o .T
o
ToF T o .T*F
F o (E) T o .F
s.c
F o id F o .(E)
F o .i d }
Goto Operation
• If I is a set of LR(0) items and X is a grammar symbol (terminal or non-terminal),
then goto(I,X) is defined as follows:
c
– If A o D.XE in I
in goto(I,X).
then every item in closure({A o DX.E}) will be
vtu
Example:
I ={ E’ o .E, E o .E+T, E o .T,
T o .T*F, T o .F,
F o .(E), F o .id }
goto(I,E) = { E’ o E., E o E.+T }
goto(I,T) = { E o T., T o T.*F }
w.
goto(I,F) = {T o F. }
goto(I,() = { F o (.E), E o .E+T, E o .T, T o .T*F, T o .F,
F o .(E), F o .id }
goto(I,id) = { F o id. }
Construction of The Canonical LR(0) Collection
ww
• To create the SLR parsing tables for a grammar G, we will create the canonical
LR(0) collection of the grammar G’.
• Algorithm:
C is { closure({S’o.S}) }
repeat the followings until no more set of LR(0) items can be added to C.
for each I in C and each grammar symbol X
if goto(I,X) is not empty and not in C
add goto(I,X) to C
• goto function is a DFA on the sets in C.
m
F o .(E)
F o .i d I3: T o F. I7: T o T*.F I11: F o (E).
F o .(E)
I4: F o (.E) F o .id
o
E o .E+T
E o .T I8: F o (E.)
s.c
T o .T*F E o E.+T
T o .F
F o .(E)
F o .i d
I5: F o id. c
Transition Diagram (DFA) of Goto Function
vtu
E T
I0 I1 I6 I9 * to I7
F
( to I3
T
id to I4
to I5
F I2 * I7
w.
F
(
I10
I3 id to I4
(
to I5
I4 E I8 )
id id T
to I2 I11
ww
F +
I5 to I3 to I6
(
to I4
m
• If a is a terminal, AoD.aE in Ii and goto(Ii,a)=Ij then action[i,a] is shift j.
• If AoD. is in Ii , then action[i,a] is reduce AoD for all a in FOLLOW(A)
where AzS’.
• If S’oS. is in Ii , then action[i,$] is accept.
o
• If any conflicting actions generated by these rules, the grammar is not SLR(1).
s.c
• for all non-terminals A, if goto(Ii,A)=Ij then goto[i,A]=j
c
(SLR) Parsing Tables for Expression Grammar
vtu
Action Table Goto Table
1) E o E+T state id + * ( ) $ E T F
0 s5 s4 1 2 3
2) EoT
1 s6 acc
3) T o T*F 2 r2 s7 r2 r2
4) ToF 3 r4 r4 r4 r4
w.
5) F o (E) 4 s5 s4 8 2 3
6) F o id 5 r6 r6 r6 r6
6 s5 s4 9 3
7 s5 s4 10
8 s6 s11
ww
9 r1 s7 r1 r1
10 r3 r3 r3 r3
11 r5 r5 r5 r5
m
$0T2 *id+id$ [2,*]=s7 shift 7
$0T2*7 id+id$ [7,id]=s5 shift 5
$0T2*7id5 +id$ [5,+]=r6 [7,F]=10 reduce by Foid(pop 2|id| no. of
o
symbols from stack and push F onto the stack)
$0T2*7F10 +id$ [10,+]=r3 [0,T]=2 reduce by ToT*F(pop 2 |T*F| no. of
symbols from stack and push F on the stack)
s.c
$0T2 +id$ [2,+]=r2 [0,E]=1 reduce by EoT (pop 2|T| no. of
symbols from stack and push E onto the stack)
$0E1 +id$ [1,+]=s6 shift 6
$0E1+6 id$ [6,id]=s5 shift 5
$0E1+6id5 $ c[5,$]=r6 [6,F]=3 reduce by Foid (pop 2|id| no. of
symbols from stack and push F onto the stack)
$0E1+6F3 $ [3,$]=r4 [6,F]=3 reduce by ToF (pop 2|F| no. of
vtu
symbols from stack and push T onto the stack)
$0E1+6T9 $ [9,$]=r1 [0,E]=1 reduce by EoE+T (pop 2
|E+T| no. of symbols from stack and push F on the stack)
$0E1 $ accept
w.
ww
SLR(1) Grammar
• An LR parser using SLR(1) parsing tables for a grammar G is called as the
SLR(1) parser for G.
• If a grammar G has an SLR(1) parsing table, it is called SLR(1) grammar (or SLR
grammar in short).
• Every SLR grammar is unambiguous, but every unambiguous grammar is not a
SLR grammar.
m
shift/reduce and reduce/reduce conflicts
• If a state does not know whether it will make a shift operation or reduction for a
terminal, we say that there is a shift/reduce conflict.
o
• If a state does not know whether it will make a reduction operation using the
production rule i or j for a terminal, we say that there is a reduce/reduce conflict.
• If the SLR parsing table of a grammar G has a conflict, we say that that grammar
s.c
is not SLR grammar.
Problems on SLR
1. S SS+ | SS* | a with the string “aa+a*” 6. S +SS | *SS | a with the
string “+*aaa”
2. S (L) | a, L L,S | S
c 7. Show that following grammar is SLR(1)
but not LL(1)
S S A | A
vtu
Aa
3. S aSb | ab 8. X Xb |a parse the string “ abb”
4. S aSbS | bSaS | 9. Given the grammar A (A) |a string “((a))”
5. S o E#
E o E-T
EoT
w.
ToF T
ToF
F o (E)
Foi
ww
Construct parsing table for this. In this table there are 2 actions in one entry of the
table which is why It is not a SLR(1) grammar.
Conflict Example2
m
S o AaAb I0: S’ o .S
S o BbBa S o .AaAb
AoH S o .BbBa
o
BoH Ao.
Bo.
s.c
Problem
FOLLOW(A)={a,b}
FOLLOW(B)={a,b}
a reduce by A o H b reduce by A o H
reduce by B o H reduce by B o H
reduce/reduce conflictc reduce/reduce conflict
vtu
Problems : show that following grammars are not SLR(1) by constructing
parsing table.
1. Show that S S(S)S | not SLR(1)
2. Show that S AaAb | BbBa
A
w.
m
parsing tables) ;
x Using ambiguous grammars;
o
x Parser Generators.
c s.c
vtu
w.
ww
m
– In some situations, EA cannot be followed by the terminal a in a right-
sentential form when ED and the state i are on the top stack. This means
that making reduction in this case is not correct.
o
S o AaAb SAaAbAabab SBbBaBbaba
S o BbBa
AoH Aab H ab Bba H ba
s.c
BoH AaAb Aa H b BbBa Bb H a
LR(1) Item
• To avoid some of invalid reductions, the states need to carry more information.
...
A o D.,an
Canonical Collection of Sets of LR(1) Items
• The construction of the canonical collection of the sets of LR(1) items are similar
to the construction of the canonical collection of the sets of LR(0) items, except
that closure and goto operations work a little bit different.
goto operation
• If I is a set of LR(1) items and X is a grammar symbol (terminal or non-terminal),
then goto(I,X) is defined as follows:
m
– If A o D.XE,a in I then
every item in closure({A o DX.E,a}) will be in goto(I,X).
o
C is { closure({S’o.S,$}) }
s.c
repeat the followings until no more set of LR(1) items can be added to C.
for each I in C and each grammar symbol X
if goto(I,X) is not empty and not in C
add goto(I,X) to C
goto function is a DFA on the sets in C.
A Short Notation for The Sets of LR(1) Items
c
• A set of LR(1) items containing the following items
A o D.E,a1
vtu
...
A o D.E,an
can be written as
A o D.E,a1/a2/.../an
w.
ww
o m
s.c
SLR(1) Parsing table
id * = $ S L R
0
1
s5 s4
acc
c 1 2 3
vtu
2 s6/r5 r5
3 r2
4 s5 s4 8 7
5 r4 r4
w.
6 s5 s4 10 9
7 r3 r3
8 r5 r5
ww
9 r1
m
4 ) L o id L o . id , $ /= R t o I5
I3:S o R.,$ id
I 5: L o i d . , $ / =
5) R o L R o .L,$
I9:S o L=R.,$
R I13:L o *R.,$
I6:S o L=.R,$ t o I9
o
R o .L,$ L I10:R o L.,$
to I10
L o .*R,$ * I4 and I11
to I11 I11:L o *.R,$ R
L o . id , $ to I13
id R o .L,$ L I5 and I12
s.c
to I12 to I10
Lo .*R,$ *
I7:L o *R.,$/= L o . id , $ to I11 I7 and I13
id
I8: R o L.,$/= to I12 I8 and I10
I12:L o id.,$
c
Construction of LR(1) Parsing Tables
1. Construct the canonical collection of sets of LR(1) items for G’. Cm{I0,...,In}
vtu
2. Create the parsing action table as follows
• If a is a terminal, AoD.aE,b in Ii and goto(Ii,a)=Ij then action[i,a] is
shift j.
• If AoD.,a is in Ii , then action[i,a] is reduce AoD where AzS’.
• If S’oS.,$ is in Ii , then action[i,$] is accept.
w.
m
4 s5 s4 8 7
5 r4 r4 no shift/reduce or
6 s12 s11 10 9 no reduce/reduce conflict
7 r3 r3
o
8 r5 r5
9 r1 so, it is a LR(1) grammar
10 r5
s.c
11 s12 s11 10 13
12 r4
13 r3
• In fact, the number of the states of the LALR parser for a grammar will be equal
to the number of states of the SLR parser for that grammar.
m
• Create the canonical LR(1) collection of the sets of LR(1) items for the given
grammar.
• Find each core; find all sets having that same core; replace those sets having same
o
cores with a single set which is their union.
s.c
C={I0,...,In} C’={J1,...,Jm} where m d n
• Create the parsing tables (action and goto tables) same as the construction of the
parsing tables of LR(1) parser.
–
c
cores of goto(I1,X),...,goto(I2,X) must be same.
So, goto(J,X)=K where K is the union of all sets of items having same
cores as goto(I1,X).
vtu
• If no conflict is introduced, the grammar is LALR(1) grammar. (We may
only introduce reduce/reduce conflicts; we cannot introduce a shift/reduce
conflict)
Shift/Reduce Conflict
• We say that we cannot introduce a shift/reduce conflict during the shrink process
w.
• Assume that we can introduce a shift/reduce conflict. In this case, a state of LALR
parser must have:
ww
m
. . .
. .
S’ o S I 0: S ’ o S,$ I 1: S ’ o S , $ I411:L o * R,$/= R to I713
1) S o L=R So
.
L=R,$ S
..
*
.
R o L,$/= L
to I810
. .
2) S o R So R,$ L2I : S o L = R , $ t o I 6 Lo *R,$/= *
o
R o L ,$ to I411
.
3) Lo *R Lo *R,$/= L o id , $ /=
.
id
4 ) L o id
5) R o L
Lo
Ro .
id , $ /=
L,$
R
I 3: S o R , $ id
.
I :L o id , $ /=
512
to I512
s.c
.
.. .
R I9:S o L=R ,$
I6:S o L= R,$ t o I9 Same Cores
R o L,$ L
to I810 I4 and I11
.
L o *R,$ *
L o id , $ to I411
id I5 and I12
to I512
.
I713:L o *R ,$/=
c I7 and I13
.
I810: R o L ,$/=
I8 and I10
vtu
w.
ww
m
• Ex.
E o E+T | T
o
E o E+E | E*E | (E) | id T o T*F | F
F o (E) | id
s.c
Sets of LR(0) Items for Ambiguous Grammar
..EE+E .. .. ..
I 0: E ’ o E I 1: E ’ o E + I 4: E o E + E E I7: E o E+E + I4
.E*E . . .
Eo E o E +E E o E+E ( E o E +E * I
..(idE) ..
5
Eo E o E *E E o E*E I2 E o E *E
* id
Eo c E o (E ) I3
Eo E o id
(
.. ..
(
I 5: E o E * E E
vtu
..EE+)E
I8: E o E*E + I4
.
E o E+E (
.
I 2: E o ( I2 E o E +E * I
..
E o E*E id
.E*E
Eo I3 E o E *E 5
E o (E )
..(idE)
Eo E
Eo E o id
.. .
id Eo id )
I 6: E o ( E ) I 9: E o ( E )
I : E o id . .
E o E +E +
3 E o E *E * I4
w.
I5
ww
m
State I8 has shift/reduce conflicts for symbols + and *.
E * E
I0 I1 I5 I7
o
shift * is right-associative
reduce * is left-associative
s.c
when current token is +
shift + has higher precedence than *
reduce * has higher precedence than +
c
SLR-Parsing Tables for Ambiguous Grammar
vtu
Action Goto
id + * ( ) $ E
0 s3 s2 1
1 s4 s5 acc
2 s3 s2 6
3 r4 r4 r4 r4
4 s3 s2 7
w.
5 s3 s2 8
6 s4 s5 s9
7 r1 s5 r1 r1
8 r2 r2 r2 r2
9 r3 r3 r3 r3
ww
m
• An LR parser will announce error as soon as there is no valid
continuation for the scanned portion of the input.
• A canonical LR parser (LR(1) parser) will never make even a single
reduction before announcing an error.
• The SLR and LALR parsers may make several reductions before
o
announcing an error.
• But, all LR parsers (LR(1), LALR and SLR parsers) will never shift an
erroneous input symbol onto the stack.
s.c
Panic Mode Error Recovery in LR Parsing
c
• Scan down the stack until a state s with a goto on a particular
nonterminal A is found. (Get rid of everything from the stack before this
state s).
vtu
• Discard zero or more input symbols until a symbol a is found that can
legitimately follow A.
– The symbol a is simply in FOLLOW(A), but this may not work for all situations.
• The parser stacks the nonterminal A and the state goto[s,A], and it
resumes the normal parsing.
• This nonterminal A is normally is a basic programming block (there can
be more than one choice for A).
w.
• Each empty entry in the action table is marked with a specific error routine.
• An error routine reflects the error that the user most likely will make in that case.
• An error routine inserts the symbols into the stack or the input (or it deletes the
symbols from the stack and the input, or it can do both insertion and deletion).
– missing operand
– unbalanced right parenthesis
PART-B
UNIT V: SYNTAX-DIRECTED DEFINITIONS
SYLLABUS:
x Syntax-directed definitions;
x Evaluation orders for SDDs;
m
x Applications of syntax-directed translation;
x Syntax-directed translation schemes
o
c s.c
vtu
w.
ww
Overview
input parse tree dependency graph attribute evaluation order
Grammar symbols are associated with attributes to associate information with
the programming language constructs that they represent.
Values of these attributes are evaluated by the semantic rules associated with the
production rules.
Evaluation of these semantic rules:
m
o may generate intermediate codes
o may put information into the symbol table
o may perform type checking
o
o may issue error messages
o may perform some other activities
o in fact, they may perform almost any activities.
s.c
An attribute may hold almost anything.
o a string, a number, a memory location, a complex record.
Attributes for expressions:
type of value: int, float, double, char, string,…
type of construct: variable, constant, operations, …
c
Attributes for constants: values
Attributes for variables: name, scope
vtu
o Attributes for operations: arity, operands, operator,…
When we associate semantic rules with productions, we use two notations:
o Syntax-Directed Definitions
o Translation Schemes
Syntax-Directed Definitions:
o give high-level specifications for translations
w.
m
associated with it a set of semantic rules of the form
a: = f(b1,b2…..bk)
Where a is an attributes obtained from the function f.
o
• A syntax-directed definition is a generalization of a context-free grammar in
which:
– Each grammar symbol is associated with a set of attributes.
s.c
– This set of attributes for a grammar symbol is partitioned into two subsets
called synthesized and inherited attributes of that grammar symbol.
– Each production rule is associated with a set of semantic rules.
• Semantic rules set up dependencies between attributes which can be represented
by a dependency graph.
c
• This dependency graph determines the evaluation order of these semantic rules.
• Evaluation of a semantic rule defines the value of an attribute. But a semantic rule
vtu
may also have some side effects such as printing a value.
The two attributes for non terminal are :
1) synthesized attribute (S-attribute) : (n)
An attribute is said to be synthesized attribute if its value at a parse tree node is
determined from attribute values at the children of the node
2) Inherited attribute : (o,n)
w.
m
EoE-T
Eo T
To T*F
o
ToT/F
ToF
s.c
Fo(E)
Fodigit
No;
Solution :
The syntax directed definition can be written for the above grammar by using
c
semantic actions for each production.
vtu
Production rule Semantic actions
S oEN S.val=E.val
E oE1+T E.val =E1.val + T.val
EoE1-T E.val = E1.val – T.val
EoT E.val =T.val
w.
For the Non-terminals E,T and F the values can be obtained using the attribute “Val”.
The taken digit has synthesized attribute “lexval”.
In SoEN, symbol S is the start symbol. This rule is to print the final answer of
expressed.
Following steps are followed to Compute S attributed definition.
1. Write the SDD using the appropriate semantic actions for corresponding
production rule of the given Grammar.
2. The annotated parse tree is generated and attribute values are computed. The
Computation is done in bottom up manner.
3. The value obtained at the node is supposed to be final output.
PROBLEM 1:
Consider the string 5*6+7; Construct Syntax tree, parse tree and annotated tree.
m
Solution :
The corresponding annoted parse tree is shown below for the string 5*6+7;
Syntax tree:
o
Parse tree:
c s.c S
vtu
E N
E + T
+
T F
w.
T * F digit
F digit 7
ww
digit 6
The corresponding annoted parse tree is shown below for the string 5*6+7;
o m
s.c
Advantages: SDDs are more readable and hence useful for specifications
Disadvantages: not very efficient.
Ex2: c
PROBLEM : Consider the grammar that is used for Simple desk calculator. Obtain
the Semantic action and also the annotated parse tree for the string
vtu
3*5+4n.
LoEn
EoE1+T
EoT
ToT1*F
ToF
w.
Fo(E)
Fodigit
Solution :
Production rule Semantic actions
ww
LoEn L.val=E.val
EoE1+T E.val=E1.val + T.val
EoT E.val=T.val
ToT1*F T.val=T1.val*F.val
ToF T.val=F.val
Fo(E) F.val=E.val
Fodigit F.val=digit.lexval
The corresponding annotated parse tree U shown below, for the string 3*5+4n.
o m
c s.c
Fig:Annotated parse tree
vtu
Exercise :
For the SDD of the problem 1 give annotated parse tree for the Following expressions
a) (3+4)*(5+6)n
b) 1*2*3*(4+5)n
c) (9+8*(7+6)+5)*4n
w.
ww
Solution: a)
o m
b) cs.c
vtu
w.
ww
c)
o m
c s.c
vtu
Dependency Graphs
w.
m
first is needed to compute the second. Edges express constraints implied
by the semantic rules.
Edges express constraints implied by the semantic rules
o
c s.c
vtu
2) Inherited attributes :
Consider an example and compute the inherited attributes, annotate the parse
tree for the computation of inherited attributes for the given string int a, b, c
Ex:
w.
SoTL
Toint
Tofloat
Tochar
ww
Todouble
LoL,id
L o id
The steps are to be followed are:
1) Construct the syntax directed definition using semantic action.
2) Annotate the parser tree with inherited attributes by processing in top down
fashion.
S o TL L.inh =T.type
Toint T.type = int
m
Tofloat T.type =float
Tochar T.type =char
Todouble T.type=double
o
LoL,id T.type = L, inh
Add–type (id.entry,L.inh)
s.c
Loid Add – type (id.entry,L.inh)
string int a, b,c
c
vtu
w.
ww
o m
s.c
Ex2: PROBLEMS: consider the following context free grammar for evaluating
arithmetic expressions with operator *
T FT’
T’ *FT’
T’
F digit
c
vtu
w.
ww
o m
s.c
Dependency graph above example:
T 9 val
F 3 val
c inh 5 T’ 8 syn
vtu
Digit 1 lexval * F 4 va l i nh
6 T’ 7 syn
w.
digit 2 lexval
A topological sort of the above graph gives the order of evaluation of the SDD. One of
the topological sort order is (1,2,3,4,5,6,7,8 and 9) another topological sort order is
(1,2,5,2,4,6,7,8,9)
Advantages: dependency graph helps in computing the order of evaluation of the
ww
attributes
Disadvantage: it cannot give the order of evaluation of attributes if there is a cycle
formation in the graph. However, this disadvantage can be overcome by using S –
attributed and L – attributed definitions.
1. E TE’ 2. T BC
E’ +TE’ B int
E’ B float
T FT’ C [num]C
T’ *FT’ C
T’ String (i) “int[2][3]” (note: int
F (E) [2][3] should be passed as array(2,
m
F id array(3, integer)))
String (i) “id +id*id” (ii) “ (id (ii) “float [3]” (iii) “float
+ id* id)” [3][3][2]”
o
S-Attributed Definitions
• Syntax-directed definitions are used to specify syntax-directed translations.
s.c
• To create a translator for an arbitrary syntax-directed definition can be difficult.
• We would like to evaluate the semantic rules during parsing (i.e. in a single pass,
we will parse and we will also evaluate semantic rules during the parsing).
• We will look at two sub-classes of the syntax-directed definitions:
– S-Attributed Definitions: only synthesized attributes used in the syntax-
c
directed definitions.
– L-Attributed Definitions: in addition to synthesized attributes, we may
vtu
also use inherited attributes in a restricted fashion.
• To implement S-Attributed Definitions and L-Attributed Definitions are easy (we
can evaluate semantic rules in a single pass during the parsing).
• Implementations of S-attributed Definitions are a little bit easier than
implementations of L-Attributed Definitions
L-Attributed Definitions
w.
parse tree.
This means that they can also be evaluated during the parsing
• A syntax-directed definition is L-attributed if each inherited attribute of Xj,
where 1djdn, on the right side of A → X1X2...Xn depends only on:
1. The attributes of the symbols X1,...,Xj-1 to the left of Xj in the production
and
2. the inherited attribute of A
• Every S-attributed definition is L-attributed, the restrictions only apply to the
inherited attributes (not to synthesized attributes).
o m
s.c
Semantic Rules with Controlled Side effects
• Permit incidental side effects the do not constrain attribute evaluation
• Constrain the allowable evaluation orders, so that the same translation is produced
foer any allowable order.
c
– Ex: For production L En Semantic Rule is print(E.val)
vtu
w.
ww
m
E → E1 + T E.node = new Node (‘+’, E1 .node, T.node)
E → E1 - T E.node = new Node (‘-’, E1 .node, T.node)
E→T E. node= T.node
o
T→(E) T.node = E.node
T → id T.node= new Leaf (id, id.entry)
T → num T.node= new Leaf (num, num.val)
s.c
This is an example for S-attributed definition
c
vtu
w.
ww
o m
c s.c
vtu
Syntax-Directed Translation Schemes
A SDT scheme is a context-free grammar with program fragments embedded within
production bodies .The program fragments are called semantic actions and can appear at
any position within the production body.
Any SDT can be implemented by first building a parse tree and then pre-forming the
w.
Syntax Trees
o m
Postfix SDT's
Leftmost: the leftmost nonterminal is always chosen for expansion at each step of
s.c
x
derivation.
L-attributed SDT's
x Shows a graphical depiction of a derivation.
c
vtu
w.
ww
o m
Production Semantic Rules
s.c
L→E return {print(stack[top-1].val); top =top – 1;}
E → E1 + T {stack[top-2].val = stack[top-2].val + stack[top].val; top =
top – 2;}
E→T
T → T1 * F {stack[top-2].val = stack[top-2].val * stack[top].val; top =
top – 2;}
T→F
c
vtu
F→(E) { stack[top-2].val = stack[top-1].val ; top = top – 2;}
F → digit
• At each shift of digit, we also push digit.lexval into val-stack.
• At all other shifts, we do not put anything into val-stack because other terminals
do not have attributes (but we increment the stack pointer for val-stack).
Translation Schemes
w.
should be evaluated?).
• A translation scheme is a context-free grammar in which:
• attributes are associated with the grammar symbols and
• semantic actions enclosed between braces {} are inserted within the right sides
of productions.
• Ex: A → { ... } X { ... } Y { ... }
Semantic Actions
m
• The position of the semantic action on the right side indicates when that semantic
action will be evaluated.
Translation Schemes for S-attributed Definitions
o
• If our syntax-directed definition is S-attributed, the construction of the
corresponding translation scheme will be simple.
• Each associated semantic rule in a S-attributed syntax-directed definition will be
s.c
inserted as a semantic action into the end of the right side of the associated
production.
Production Semantic Rule
E → E1 + T E.val = E1.val + T.val a production of
c a syntax directed
definition
vtu
E → E1 + T { E.val = E1.val + T.val } the production of the
corresponding
translation scheme
SDT for infix –to- prefix translation during parsing
L E n
E → {print(‘+’);} E1 + T
w.
E→T
T → {print(‘*’);} T1 * F
T→F
F→(E)
F → digit { print( digit.lexval); }
ww
o m
s.c
A Translation Scheme Example
• A simple translation scheme that converts infix expressions to the
c
corresponding postfix expressions.
E→TR
vtu
R → + T { print(“+”) } R1
R→H
T → id { print(id.name) }
a+b+c ab+c+
m
production).
4. With a L-attributed syntax-directed definition, it is always possible to
construct a corresponding translation scheme which satisfies these
o
three conditions (This may not be possible for a general syntax-directed
translation).
A Translation Scheme with Inherited Attributes
s.c
D → T id { addtype(id.entry,T.type), L.in = T.type } L
T → int { T.type = integer }
T → real { T.type = real }
L → id { addtype(id.entry,L.in), L1.in = L.in } L1
L→H c
This is a translation scheme for an L-attributed definition.
vtu
w.
ww
m
x Control flow; Back patching;
x Switch statements;
x Procedure calls.
o
c s.c
vtu
w.
ww
Background
m
front end | back end
o
Static checking includes type checking, which ensures that operands are applied to
compatible operands. It also includes any syntactic checks that remain after parsing.
Ex: Static checking assures that a break statement in C is enclosed within a while, for or
s.c
switches statement; otherwise an error message is issued.
Intermediate representation
A complier may construct a sequence of intermediate representation as in fig.
c
Fig6.2: A compiler might use a sequence of intermediate representation
vtu
High level representations are close to source language and low level representations are
close to the target machine.
There are three types of intermediate representation:-
1. Syntax Trees
2. Postfix notation
3. Three Address Code
w.
o m
s.c
Eg2: syntax tree for assignment statement DAG for a=b*-c + b*-c
a=b*-c + b*-c
c
vtu
* *
ww
2 x y _
* y
2 x
The DAG representation may expose instances where redundancies can be eliminated.
SDD to construct DAG for the expression a + a * ( b - c ) + ( b - c ) * d.
o m
c s.c
vtu
6.1.2 The Value-Number Method for Constructing DAG’s
In many applications, nodes are implemented as records stored in an array, as in Figure
7.In the figure; each record has a label field that determines the nature of the node. We
w.
can refer to a node by its index in the array. The integer index of a node is often called
value number. For example, using value numbers, we can say node 3 has label +, its left
child is node 1, and its right child is node 2.The following algorithm can be used to create
nodes for a dag representation of an expression.
ARRAY
ww
to entry for i
m
• In three-address code, there is at most one operator on the right side of an
instruction; that is, no built-up arithmetic expressions are permitted.
x+y*z t1 = y * z
o
t2 = x + t1
• Example 6.4:
c s.c
vtu
Problems: write the 3-address code for the following expression
1. if(x + y * z > x * y +z)
a=0;
2. (2 + a * (b – c / d)) / e
3. A :=b * -c + b * -c
w.
m
– x= op y, where op is a unary operation
– Copy statement: x=y
– Indexed assignments: x=y[i] and x[i]=y
o
– Pointer assignments: x=&y, *x=y and x=*y
Control flow statements
– Unconditional jump: goto L
s.c
– Conditional jump: if x relop y goto L ; if x goto L; if False x goto L
– Procedure calls: call procedure p with n parameters and return y, is
optional
param x1
param x2
c
…
param xn
vtu
call p, n
• Example 6.5:
– do i = i +1; while (a[i]<v);
w.
6.2.2 Quadruples
• Three-address instructions can be implemented as objects or as record with fields
for the operator and operands.
• Three such representations
– Quadruple, triples, and indirect triples
• A quadruple (or quad) has four fields: op, arg1, arg2, and result.
• Example 6.6:
m
6.2.3 Triples
• A triple has only three fields: op, arg1, and arg2
o
• Using triples, we refer to the result of an operation x op y by its position, rather by
an explicit temporary name.
s.c
Example 6.7
c
vtu
Fig6.11: Representations of a = b * - c + b * - c
w.
ww
o m
6.3 Types and Declarations
c s.c
• The applications of types can be grouped under checking and translation:
vtu
– Type checking uses logical rules to reason about the behavior of a program
at run time. Specifically it ensures that the types of the operands match the
type expected by an operator.
– Translation Applications. From the type of a name, a compiler can
determine the storage that will be needed for the name at run time.
• Type information is also needed to calculate the address denoted
w.
m
– A basic type is a type expression.
– A type name is a type expression.
– A type expression can be formed by applying the array type constructor to
a number and a type expression.
o
– A record is a data structure with named fields.
– A type expression can be formed by using the type constructor o for
s.c
function type.
– If s and t are type expressions, then their Cartesian product sut is a type
expression.
– Type expressions may contain variables whose values are type
expressions.
c
6.3.2 Type Equivalence
• When type expressions are represented by graphs, two types are structurally
vtu
equivalent if and only if one of the following condition is true:
– They are the same basic type.
– They are formed by applying the same constructor to structurally
equivalent types.
– One is a type name that denotes the other.
6.3.3 Declaration
w.
• We shall study types and declarations using a simplified grammar that declares
just one name at a time.
D o T id ; D | H
T o B C | record ‘{‘ D ‘}’
B o int | float
ww
C o H | [num] C
• Example 6.9: Storage Layout for Local Names
• Computing types and their widths
m
Syntax-directed translation of array types
o
c s.c
vtu
Sequences of Declarations
w.
• An expression with more than one operator, like a + b * c, will translate into
instructions with at most one operator per instruction.
• An array reference A[ i ][ j ] will expand into a sequence of three-address
instructions that calculate an address for the reference.
6.4.1 Operations within Expressions
• Example 6.11:
a = b+ -c
m
t1 = minus
o
t2 = b + t1
a = t2
s.c
Three-address code for expressions
c
vtu
w.
Incremental Translation
ww
o m
Semantic actions for array reference
c s.c
vtu
w.
L.addr
L.array
L.type
Conversions between primitive types in Java
m
Introducing type conversions into expression evaluation
o
s.c
Abstract syntax tree for the function definition
c
vtu
w.
o m
boolean unify (Node m, Node n)
{
s.c
s = find(m); t = find(n);
if ( s = t ) return true;
else if ( nodes s and t represent the same basic type ) return true;
else if (s is an op-node with children s1 and s2 and
t is an op-node with children t1 and t2) {
c
union(s , t) ;
return unify(s1, t1) and unify(s2, t2);
vtu
}
else if s or t represents a variable {
union(s, t) ;
return true;
}
else return false;
w.
Control Flow
Boolean expressions are often used to:
ww
Flow-of-Control Statements
o m
s.c
Syntax-directed definition
c
vtu
w.
ww
o m
c
Translation of a simple if-statement s.c
vtu
w.
Backpatching
Previous codes for Boolean expressions insert symbolic labels for jumps
It therefore needs a separate pass to set them to appropriate addresses
We can use a technique named backpatching to avoid this
ww
We assume we save instructions into an array and labels will be indices in the
array
For nonterminal B we use two attributes B.truelist and B.falselist together with
following functions:
makelist(i): create a new list containing only I, an index into the array of
instructions
Merge(p1,p2): concatenates the lists pointed by p1 and p2 and returns a
pointer to the concatenated list
Backpatch(p,i): inserts i as the target label for each of the instruction on
the list pointed to by p
o m
c s.c
vtu
Annotated parse tree for x < 100 || x > 200 && x ! = y
w.
ww
Flow-of-Control Statements
o m
c s.c
vtu
Translation of a switch-statement
w.
ww
m
x Heap management;
x Introduction to garbage collection.
o
c s.c
vtu
w.
ww
Compiler must do the storage allocation and provide access to variables and data
Memory management
Stack allocation
Heap management
Garbage collection
Storage Organization
o m
s.c
• Assumes a logical address space
– Operating system will later map it to physical addresses, decide how to
use cache memory, etc.
• Memory typically divided into areas for
– Program code
– Other static data storage, including global constants and compiler
c
generated data
– Stack to support call/return policy for procedures
– Heap to store data that can outlive a call to a procedure
vtu
Static vs. Dynamic Allocation
Static: Compile time, Dynamic: Runtime allocation
Many compilers use some combination of following
Stack storage: for local variables, parameters and so on
Heap storage: Data that may outlive the call to the procedure that created
it
Stack allocation is a valid allocation for procedures since procedure calls are nest
w.
ww
o m
s.c
Activation for Quicksort
c
vtu
Activation tree representing calls during an execution of quicksort
w.
ww
Activation records
Procedure calls and returns are usaully managed by a run-time stack called the
control stack.
Each live activation has an activation record (sometimes called a frame)
The root of activation tree is at the bottom of the stack
The current execution path specifies the content of the stack with the last
activation has record in the top of the stack.
m
Activation Record
Temporary values
Local data
A saved machine status
o
An “access link”
A control link
Space for the return value of the called function
s.c
The actual parameters used by the calling procedure
Elements in the activation record:
temporary values that could not fit into registers
local variables of the procedure
saved machine status for point at which this procedure called. includes
c
return address and contents of registers to be restored.
access link to activation record of previous block or procedure in lexical
scope chain
vtu
control link pointing to the activation record of the caller
space for the return value of the function, if any
actual parameters (or they may be placed in registers, if possible)
m
Access to dynamically allocated arrays
o
c s.c
vtu
ML
ML is a functional language
Variables are defined, and have their unchangeable values initialized, by a statement
of the form:
val (name) = (expression)
Functions are defined using the syntax:
fun (name) ( (arguments) ) = (body)
w.
o m
s.c
Access links for finding nonlocal data
c
vtu
w.
m
Maintaining the Display
o
Memory Manager
Two basic functions:
c s.c
vtu
Allocation
Deallocation
Properties of memory managers:
Space efficiency
Program efficiency
Low overhead
Typical Memory Hierarchy Configurations
w.
ww
Locality in Programs
The conventional wisdom is that programs spend 90% of their time executing 10% of the
code:
Programs often contain many instructions that are never executed.
Only a small fraction of the code that could be invoked is actually executed in a
typical run of the program.
The typical program spends most of its time executing innermost loops and tight
recursive cycles in a program.
o m
c s.c
vtu
w.
ww
m
x Basic blocks and Flow graphs;
x Optimization of basic blocks;
x A Simple Code Generator
o
c s.c
vtu
w.
ww
m
– Preserving the semantic meaning of the source program and being of high
quality
– Making effective use of the available resources of the target machine
– The code generator itself must run efficiently.
o
• A code generator has three primary tasks:
– Instruction selection, register allocation, and instruction ordering
s.c
Issue in the Design of a Code Generator
• General tasks in almost all code generators: instruction selection, register
allocation and assignment.
– The details are also dependent on the specifics of the intermediate
representation, the target language, and the run-tie system.
• The most important criterion for a code generator is that it produce correct code.
c
Given the premium on correctness, designing a code generator so it can be easily
implemented, tested, and maintained is an important design
vtu
Input to the Code Generator
• The input to the code generator is
– the intermediate representation of the source program produced by the
frontend along with
– information in the symbol table that is used to determine the run-time
address of the data objects denoted by the names in the IR.
w.
m
performing the operations on the operands at the top of the stack
• Java Virtual Machine (JVM)
– Just-in-time Java compiler
• Producing the target program as
o
– An absolute machine-language program
– Relocatable machine-language program
s.c
– An assembly-language program
• In this chapter
– Use very simple RISC-like computer as the target machine.
– Add some CISC-like addressing modes
– Use assembly code as the target language.
Instruction Selection
c
• The code generator must map the IR program into a code sequence that can be
executed by the target machine.
vtu
• The complexity of the mapping is determined by the factors such as
– The level of the IR
– The nature of the instruction-set architecture
– The desired quality of the generated code
• If the IR is high level, use code templates to translate each IR statement into a
sequence of machine instruction.
w.
m
• The quality of the generated code is usually determined by its speed and size.
• A given IR program can be implemented by many different code sequences, with
o
significant cost differences between the different implementations.
• A naïve translation of the intermediate code may therefore lead to correct but
s.c
unacceptably inefficient target code.
• For example use INC for a=a+1 instead of
LD R0,a
ADD R0, R0, #1
ST a, R0
• We need to know instruction costs in order to design good code sequences but,
c
unfortunately, accurate cost information is often difficult to obtain.
vtu
Register Allocation
• A key problem in code generation is deciding what values to hold in what
registers.
• Efficient utilization is particularly important.
• The use of registers is often subdivided into two subproblems:
1. Register Allocation, during which we select the set of variables that will
w.
Example:
o m
s.c
Evaluation Order
• The order in which computations are performed can affect the efficiency of the
target code.
• Some computation orders require fewer registers to hold intermediate results than
others.
c
• However, picking a best order in the general case is a difficult NP-complete
problem.
vtu
A Simple Target Machine Model
• Our target computer models a three-address machine with load and store
operations, computation operations, jump operations, and conditional jumps.
• The underlying computer is a byte-addressable machine with n general-purpose
registers.
• Assume the following kinds of instructions are available:
w.
o m
Program and Instruction Costs
s.c
• For simplicity, we take the cost of an instruction to be one plus the costs
associated with the addressing modes of the operands.
• Addressing modes involving registers have zero additional cost, while those
involving a memory location or constant in them have an additional cost f one.
• For example,
c
– LD R0, R1 cost = 1
– LD R0, M cost = 2
vtu
– LD R1, *100(R2) cost = 3
Addresses in the Target Code
• We show how names in the IR can be converted into addresses in the target code
by looking at code generation for simple procedure calls and returns using static
and stack allocation.
• In Section 7.1, we described how each executing program runs in its own logical
w.
address space that was partitioned into four code and data areas:
1. A statically determined area Code that holds the executable target code.
2. A statically determined data area Static, for holding global constants and
other data generated by the compiler.
3. A dynamically managed area Heap for holding data objects that are
ww
o m
cs.c
vtu
w.
ww
o m
cs.c
vtu
w.
ww
o m
cs.c
vtu
w.
ww
o m
s.c
• Code optimization:
– A transformation to a program to make it run faster and/or take up less
space
– Optimization should be safe, preserve the meaning of a program.
c
– Code optimization is an important component in a compiler
vtu
– Examples :
• Flow of cont rol opt imizat ion:
got o L1 got o L2
… …
L1: got o L2 L1: got o L2
• Algebraic simplification:
x : = x+0
x := x*1 == nop
• Reduction in strength
X^2 x * x
X * 4 x << 2
• Instruction selection
Sometimes some hardware instructions can implement certain operation efficiently.
• Code optimization can either be high level or low level:
– High level code optimizations:
m
• Code optimization can either be high level or low level:
– High level code optimizations:
• Loop unrolling, loop fusion, procedure inlining
o
– Low level code optimizations:
• Instruction selection, register allocation
– Some optimization can be done in both levels:
s.c
• Common subexpression elimination, strength reduction, etc.
– Flow graph is a common intermediate representation for code
optimization.
• Basic block: a sequence of consecutive statements with exactly 1 entry and 1 exit.
• Flow graph: a directed graph where the nodes are basic blocks and block B1
c
block B2 if and only if B2 can be executed immediately after B1:
• Algorithm to construct flow graph:
vtu
– Finding leaders of the basic blocks:
• The first statement is a leader
• Any statement that is the target of a conditional or unconditional
goto is a leader
• Any statement that immediately follows a goto or conditional goto
statement is a leader
w.
– For each leader, its basic block consists all statements up to the next
leader.
– B1B2 if and only if B2 can be executed immediately after B1.
• Example:
100: sum = 0
ww
101: j = 0
102: goto 107
103: t1 = j << 2
104: t2 = addr(a)
105: t3 = t2[t1]
106: sum = sum + t3
107: if j < n goto 103
• Optimizations within a basic block is called local optimization.
• Optimizations across basic blocks is called global optimization.
• Some common optimizations:
– Instruction selection
– Register allocation
– Common subexpression elimination
– Code motion
– Strength reduction
– Induction variable elimination
– Dead code elimination
– Branch chaining
m
– Jump elimination
– Instruction scheduling
– Procedure inlining
o
– Loop unrolling
– Loop fusing
– Code hoisting
s.c
• Instruction selection:
– Using a more efficient instruction to replace a sequence of instructions
(space and speed).
– Example:
Mov R2, (R3) c
Add R2, #1, R2
Mov (R3), R2 Add (R3), 1, (R3)
• Register allocation: allocate variables to registers (speed)
vtu
• Example:
w.
m
• Induction variable elimination: can induce value from another variable.
Example:
o
•
c s.c
Common sub-expression elimination: an expression was previously calculated
and the variables in the expression have not changed. Can avoid re-computing the
expression.
vtu
• Example:
ALGEBRAIC TRANSFORMATION
w.
x := x +0
Or
x := x*1
can be eliminated from a basic block without changing the set of expressions it
computes. The exponentiation operator in the statements
x := y ** 2
usually requires a function call to implement. Using an algebraic transformation, this
statement can be replaced by cheaper, but equivalent statement
x := y*y
FLOW GRAPHS
We can add the flow-of –control information to the set of basic blocks making
up a program by constructing a directed graph called a flow graph. The nodes of the flow
graph are the basic blocks. One node is distinguished as initial; it is the block whose
leader is the first statement. There is a directed edge from block B1 to block B2can be
immediately follow B1in some execution sequence; that is, if
1. there is a conditional or unconditional jump from the last statement of B2, or
2. B2 immediately follow B1in the order of the program, and B1 does not end in the
m
unconditional jump
B1 is a predecessor of B2, and B2is a successor of B1.
Example 4:The flow graph of the program of fig. 7 is shown in fig. 9, B1 is the initial
o
node.
B1 Prod := 0
s.c
I:=1
B2 t1 := 4 * i
t2 := a [ t1 ]
t3 := 4 * i
c t4 := b [ t3 ]
t5 := t2 * t4
t6:= prod + t5
t7:=i+1
vtu
i := t7
if I <= 20 goto B2
Basic Blocks are represented by variety of data structures. For example, after
partitioning the three address statements by Algorithm 1, each basic block can be
represented by a record consisting of a count of number of quadruples in the block,
followed by a pointer to the leader of the block, and by the list of predecessors and
successors of the block. For example the block B2 running from the statement (3)
ww
through (12) in the intermediate code of figure 2 were moved elsewhere in the quadruples
array or were shrunk, the (3) in if i<=20 goto(3) would have to be changed.
LOOPS
Loop is a collection of nodes in a flow graph such that
1. All nodes in the collection are strongly connected; from any node in the loop to any
other, there is path of length one or more, wholly within the loop, and
2. The collection of nodes has a unique entry, a node in the loop such that is, a node in
the loop such that the only way to reach a node of the loop from a node outside the loop
is to first go through the entry.
A loop that contains no other loops is called an inner loop.
REDUNTANT LOADS AND STORES
If we see the instructions sequence
(1) MOV R0,a
(2) MOV a,R0
m
-we can delete instructions (2) because whenever (2) is executed. (1) will ensure that the
value of a is already in register R0.If (2) had a label we could not be sure that (1) was
always executed immediately before (2) and so we could not remove (2).
UNREACHABLE CODE
o
Another opportunity for peephole optimizations is the removal of
unreachable instructions. An unlabeled instruction immediately following an
unconditional jump may be removed. This operation can be repeated to eliminate a
s.c
sequence of instructions. For example, for debugging purposes, a large program may
have within it certain segments that are executed only if a variable debug is 1.In C, the
source code might look like:
#define debug 0
….
If ( debug ) { c
Print debugging information
}
vtu
In the intermediate representations the if-statement may be translated as:
If debug =1 goto L2
Goto L2
L1: print debugging information
L2: …………………………(a)
One obvious peephole optimization is to eliminate jumps over jumps .Thus no matter
w.
….
L1 : gotoL2
by the sequence
goto L2
….
L1 : goto L2
If there are now no jumps to L1, then it may be possible to eliminate the statement
L1:goto L2 provided it is preceded by an unconditional jump .Similarly, the sequence
m
if a < b goto L1
….
L1 : goto L2
can be replaced by
o
if a < b goto L2
….
L1 : goto L2
s.c
Finally, suppose there is only one jump to L1 and L1 is preceded by an unconditional
goto. Then the sequence
goto L1
……..
L1:if a<b goto L2
L3: c …………………………………..(1)
may be replaced by
if a<b goto L2
vtu
goto L3
…….
L3: ………………………………….(2)
While the number of instructions in (1) and (2) is the same, we sometimes skip the
unconditional jump in (2), but never in (1).Thus (2) is superior to (1) in execution time
ALGEBRAIC SIMPLIFICATION
w.
m
point multiplication or division by a power of two is cheaper to implement as a shift.
Floating-point division by a constant can be implemented as multiplication by a constant,
which may be cheaper.
o
USE OF MACHINE IDIOMS
s.c
The target machine may have hardware instructions to implement certain
specific operations efficiently. Detecting situations that permit the use of these
instructions can reduce execution time significantly. For example, some machines have
auto-increment and
auto-decrement addressing modes. These add or subtract one from an operand before or
after using its value. The use of these modes greatly improves the quality of code when
c
pushing or popping a stack, as in parameter passing. These modes can also be used in
code for statements like i : =i+1.
vtu
Getting Better Performance
Dramatic improvements in the running time of a program-such as cutting the running
time form a few hours to a few seconds-are usually obtained by improving the program at
all levels, from the source level to the target level, as suggested by fig. At each level, the
available options fall between the two extremes of finding a better algorithm and of
implementing a given algorithm so that fewer operations are performed.
Algorithmic transformations occasionally produce spectacular improvements in running
w.
time. For example, Bentley relates that the running time of a program for sorting N
elements dropped from 2.02N^2 microseconds to 12Nlog2N microseconds then a
carefully coded "insertion sort" was replaced by "quicksort".
THE PRINCIPAL SOURCES OF OPTIMIZATION
Here we introduce some of the most useful code-improving transformations. Techniques
for implementing these transformations are presented in subsequent sections. A
ww
Frequently, a program will include several calculations of the same value, such as an
offset in an array. Some of these duplicate calculations cannot be avoided by the
programmer because they lie below the level of detail accessible within the source
language. For example, block B5 shown in fig recalculates 4*i and 4*j.
o m
Local common subexpression elimination
Common Subexpressions
s.c
An occurrence of an expression E is called a common subexpression if E was previously
computed, and the values of variables in E have not changed since the previous
computation. We can avoid recomputing the expression if we can use the previously
computed value. For example, the assignments to t7 and t10 have the common
subexpressions 4*I and 4*j, respectively, on the right side in Fig. They have been
eliminated in Fig by using t6 instead of t7 and t8 instead of t10. This change is what
c
would result if we reconstructed the intermediate code from the dag for the basic block.
Example: Fig shows the result of eliminating both global and local common
vtu
subexpressions from blocks B5 and B6 in the flow graph of Fig. We first discuss the
transformation of B5 and then mention some subtleties involving arrays.
w.
ww
t9:= a[t4]; a[t4:= x using t4 computed in block B3. In Fig. observe that as control passes
from the evaluation of 4*j in B3 to B5, there is no change in j, so t4 can be used if 4*j is
needed.
Another common sub-expression comes to light in B5 after t4 replaces t8. The new
expression a[t4] corresponds to the value of a[j] at the source level. Not only does j retain
its value as control leaves b3 and then enters B5, but a[j], a value computed into a
temporary t5, does too because there are no assignments to elements of the array a in the
interim. The statement
m
t9:= a[t4]; a[t6]:= t9
in B5 can therefore be replaced by
a[t6]:= t5
The expression in blocks B1 and B6 is not considered a common subexpression although
o
t1 can be used in both places.After control leaves B1 and before it reaches B6,it can go
through B5,where there are assignments to a.Hence, a[t1] may not have the same value
on reaching B6 as it did in leaving B1, and it is not safe to treat a[t1] as a common
s.c
subexpression.
Copy Propagation
Block B5 in Fig. can be further improved by eliminating x using two new
transformations. One concerns assignments of the form f:=g called copy statements, or
copies for short. Had we gone into more detail in Example 10.2, copies would have arisen
much sooner, because the algorithm for eliminating common subexpressions introduces
c
them, as do several other algorithms. For example, when the common subexpression in
c:=d+e is eliminated in Fig., the algorithm uses a new variable t to hold the value of d+e.
vtu
Since control may reach c:=d+e either after the assignment to a or after the assignment to
b, it would be incorrect to replace c:=d+e by either c:=a or by c:=b.
The idea behind the copy-propagation transformation is to use g for f, wherever possible
after the copy statement f:=g. For example, the assignment x:=t3 in block B5 of Fig. is a
copy. Copy propagation applied to B5 yields:
x:=t3
a[t2]:=t5
w.
a[t4]:=t3
goto B2
ww
discussed the use of debug that is set to true or false at various points in the program, and
used in statements like
If (debug) print …
By a data-flow analysis, it may be possible to deduce that each time the program reaches
this statement, the value of debug is false. Usually, it is because there is one particular
statement
Debug :=false
That we can deduce to be the last assignment to debug prior to the test no matter what
m
sequence of branches the program actually takes. If copy propagation replaces debug b y
false, then the print statement is dead because it cannot be reached. We can eliminate
both the test and printing from the o9bject code. More generally, deducing at compile
time that the value of an expression is a co9nstant and using the constant instead is
o
known as constant folding.
One advantage of copy propagation is that it often turns the copy statement into dead
code. For example, copy propagation followed by dead-code elimination removes the
s.c
assignment to x and transforms 1.1 into
a [ t2 ] := t5
a [t4] := t3
goto B2
Loop Optimizations
We now give a brief introduction to a very important place for optimizations, namely
c
loops, especially the inner loops where programs tend to spend the bulk of their time. The
running time of a program may be improved if we decrease the number of instructions in
vtu
an inner loop, even if we increase the amount of code outside that loop. Three techniques
are important for loop optimization: code motion, which moves code outside a loop;
induction-variable elimination, which we apply to eliminate I and j from the inner loops
B2 and B3 and, reduction in strength, which replaces and expensive operation by a
cheaper one, such as a multiplication by an addition.
Code Motion
An important modification that decreases the amount of code in a loop is code motion.
w.
This transformation takes an expression that yields the same result independent of the
number of times a loop is executed ( a loop-invariant computation) and places the
expression before the loop. Note that the notion “before the loop” assumes the existence
of an entry for the loop. For example, evaluation of limit-2 is a loop-invariant
computation in the following while-statement:
While (i<= limit-2 )
ww
When there are two or more induction variables in a loop, iit may be possible to get rid
of all but one, by the process of induction-variable elimination.For the inner loop around
B3 in Fig. we cannot ger rid of either j or t4 completely.; t4 is used in B3 and j in B4.
However, we can illustrate reduction in strength and illustrate a part of the process of
induction-variable elimination. Eventually j will be eliminated when the outer loop of B2
- B5 is considered.
Example: As the relationship t4:=4*j surely holds after such an assignment to t4 in Fig.
and t4 is not changed elsewhere in the inner loop around B3, it follows that just after the
m
statement j:=j-1 the relationship t4:= 4*j-4 must hold. We may therefore replace the
assignment t4:= 4*j by t4:= t4-4. The only problem is that t4 does not have a value when
we enter block B3 for the first time. Since we must maintain the relationship t4=4*j on
entry to the block B3, we place an intializations\ of t4 at the end of the blcok where j
o
itself is initialized, shown by the dashed addt\ition to block B1 in second Fig.
The replacement of a multiplication by a subtraction will speed up the object code if
multiplication takes more time than addition or subtraction, as is the case on many
s.c
machines.
c
vtu
w.
Before After
ww