Material 1
Material 1
Introduction
Roadmap
• Overview
• Front end
• Back end
• Multi-pass compilers
Compiler Construction
Roadmap
• Overview
• Front end
• Back end
• Multi-pass compilers
Compiler Construction
Compilers, Interpreters …
What is a compiler?
Compiler Construction
Why do we care?
artificial greedy algorithms
intelligence learning algorithms
Compiler construction is a
microcosm of computer graph algorithms
science algorithms union-find
dynamic programming
DFAs for scanning
theory parser generators
lattice theory for analysis
allocation and naming
systems locality
synchronization
pipeline management
architecture hierarchy management
instruction set use
Compiler Construction
Abstract view
Compiler Construction
Traditional two pass compiler
Compiler Construction
A fallacy!
Compiler Construction
Roadmap
• Overview
• Front end
• Back end
• Multi-pass compilers
Compiler Construction
Front end
Compiler Construction
Parser
A grammar G = (S,N,T,P)
• S is the start-symbol
• N is a set of non-terminal symbols
• T is a set of terminal symbols
• P is a set of productions — P: N (N T)*
Compiler Construction
Deriving valid sentences
Productio Result
n Given a grammar, valid
sentences can be derived by
<goal>
repeated substitution.
1 <expr>
To recognize a valid sentence
2 <expr> <op> <term>
in some CFG, we reverse this
5 <expr> <op> y process and build up a parse.
7 <expr> - y
2 <expr> <op> <term> - y
4 <expr> <op> 2 - y
6 <expr> + 2 - y
3 <term> + 2 - y
5 x + 2 - y
Compiler Construction
Parse trees
Compiler Construction
Abstract syntax trees
So, compilers often use an abstract syntax tree (AST).
Compiler Construction
Roadmap
• Overview
• Front end
• Back end
• Multi-pass compilers
Compiler Construction
Back end
Compiler Construction
Instruction selection
Compiler Construction
Register allocation
Compiler Construction
Traditional three-pass compiler
Compiler Construction
Optimizer (middle end)
Modern optimizers are usually built as a set of passes
Compiler Construction
Lex
Compiler phases
Break source file into individual words, or tokens
Parse Analyse the phrase structure of program
Parsing Actions Build a piece of abstract syntax tree for each phrase
Determine what each phrase means, relate uses of variables to their
Semantic Analysis
definitions, check types of expressions, request translation of each phrase
Place variables, function parameters, etc., into activation records (stack
Frame Layout
frames) in a machine-dependent way
Produce intermediate representation trees (IR trees), a notation that is not tied
Translate
to any particular source language or target machine
Hoist side effects out of expressions, and clean up conditional branches, for
Canonicalize
convenience of later phases
Group IR-tree nodes into clumps that correspond to actions of target-machine
Instruction Selection
instructions
Analyse sequence of instructions into control flow graph showing all possible
Control Flow Analysis
flows of control program might follow when it runs
Gather information about flow of data through variables of program; e.g.,
Data Flow Analysis liveness analysis calculates places where each variable holds a still-needed
(live) value
Choose registers for variables and temporary values; variables not
Register Allocation
simultaneously live can share same register
Code Emission Replace temporary names in each machine instruction with registers
Compiler Construction
A straight-line programming
language
(no
Stm
loops or
Stm ; Stm
conditionals):
CompoundStm
Stm id := Exp AssignStm
Stm print ( ExpList ) PrintStm
Exp id IdExp
Exp num NumExp
Exp Exp Binop Exp OpExp
Exp ( Stm , Exp ) EseqExp
ExpList Exp , ExpList PairExpList
ExpList Exp LastExpList
Binop + Plus
Binop Minus
Binop Times
Binop / Div
a := 5 + 3; b := (print(a,a—1),10a); print(b)
8 7
prints
Compiler Construction 80
Tree representation
a := 5 + 3; b := (print(a,a—1),10a); print(b)
Compiler Construction
What you should know!
What is the difference between a compiler
and an interpreter?
What are important qualities of compilers?
Why are compilers commonly split into
multiple passes?
What are the typical responsibilities of the
different parts of a modern compiler?
How are context-free grammars specified?
What is “abstract” about an abstract syntax
tree?
Compiler Construction
Can you answer these questions?
Is Java compiled or interpreted? What about
Smalltalk? Ruby? PHP? Are you sure?
What are the key differences between modern
compilers and compilers written in the 1970s?
Why is it hard for compilers to generate good
error messages?
What is “context-free” about a context-free
grammar?
Compiler Construction
THE STRUCTURE OF A COMPILER
Source Program
Lexical Analysis
Syntax Analysis
Table
Management Error Handling
Intermediate code generation
Code Optimization
Code generation
Target
Program
• Structure of compiler
PHASES OF A COMPILER
• Example
• Tokenize the program statement below:
• 1. a+= 5.0; becomes
.
• Answer
+-
;
a 5.0
Example 2.
.
if (a<b)
• .
( b )
if a <
1d1 +
1d3
1d2
Front end
Syntax
tree
Symbol
Table
Back End
Codes
.
Advantages of front/back end model
• Keeping the same front end and attaching
different back ends one can produce a
compiler for the same source language on
different machines.
• Keeping different front ends and same back
end one can compile several different
languages on the same machine.
Front end/back end structure
Language 1
Machine 1
Language 2
Intermediate
code Machine 2
Language n
Machine n
.
Compiler Construction tools
∑ = {a, b, c} … terminals
A = start symbols
Set of Productions
1)A → aB 4) B → a
2)A → bB 5) B → b
3)A → cB 6) B → c
We can apply the production rules using the 1st
and the last rule to generate a string
A → aB
A → ac
The sequence of steps to ac is called Derivation. The
result is that we derive a string in the language. This may
be written as
A → aB →ac
A →* ac (string ac can be derived from A by zero or more
steps)
A → aB →aa
G1 is an example of the kind of grammar used to define a
scanner, although in this case the language is finite. Most
scanner defines infinite languages.
Example 2: Take G2=(∑, N,P,E)
∑={n, +, * , ), ( }
N={E, T, F}
P a set of Productions.
1. E E+T
2.E T
3.T T*F
4.T F
5.F (E)
6.F n
We can rewrite the production No. 1 as:
E ═> E+T ═> E+T+T ═> ........═>
P a set of Productions.
E+ T
T T * F
F F n
n n
AMBIGUOUS GRAMMAR
If two different parse tree can be constructed for the same
sentence then the grammar is said to be ambiguous.
The example of grammars described so far have exactly
one non-terminal symbol on the left-hand side of each
production.
This form is characteristic of grammars used in compiler
construction.
However there are some other classes without the
restrictions as given below:
Example 3.
G3 =({a,b,c}, {A,B,C,}, P, A)
P isSet of Productions
1)A → aABC4) cC → cc
2)A → aBC 5) CB → BC
3)bC → bc 6)aB → ab
7.bB → bb
To generate the string aabbcc the following derivation
appplies:
A ═> aABC ═> aaBCBC ═>aaBBCC═>aabBCC
1 2 5 6
═> aabbCC═>aabbcC ═> aabbcc
7 3 4
The chomsky Hierarchy
He defines four levels of language complexity.
Subsequent research has identified four corresponding
classes of automata(automaton) or abstract machines
types which can recognise exactly those strings in the
languages generated by their respective grammars.
Chomsky language Grammar Recogniser
class
3 Regular Finite state
Automata(FSA)
2 Context free Push Down
Automata(PDA)
1 Context sensitive Linear bounded
Automata(LBA)
0 Unrestricted Turing
Machine(TM)
GRAMMARS AND THEIR MACHINE
An automaton consist of a control mechanism with a
finite number of state and some form of a tape
(magnetic) which may be read and advanced as a
magnetic tape player, and possibly also written (recorded)
and move in either direction. A finite alphabet defines the
symbols that might be on the tape, the same as the
alphabet defines the symbols that might be on the tape,
the same as the alphabet in the language definition.
The initial content of the tape is the string of character in
the alphabets. It is the input to the automaton. From the
start state the automaton sequences through its state,
based on what is on the tape and the rules in its control.
Each step is called transition. At any time, the automaton
may halt, and by doing so, it is said to accept the input
string. It may also block i.e reach a state for which it has
no rules by which to proceed with current tapes input; if
it happens the automaton is set to reject the
input.Automaton are distinguished primarily by the kind
of tape they have
TURING MACHINE (TM)
It recognizes the unrestricted grammar i.e level 0 Level 0
language is called Recursively. Enumerable Language,
because it is the set of those strings that can be
enumerated (Listed by a Turing machine). Turing
machines are an important bases for much theory of
computability and computational complexity but as a
practiced tool for compiling programs in a production
environment, they are hopelessly inefficient. This level 0
is not a useful language for compiler design.
LINEAR BOUNDED AUTOMATA
E * E
E + E n
n n
E
E + E
n E * E
n n
SCANNERS AND REGULAR LANGUAGE
Introduction to Lexical Analysis
The front end of a compiler reads and parses source
text. Most of its run time is spent on Lexical Analysis in
the scanner which reads characters from the input files,
reducing them to a mangable tokens (words or special
symbols). The language defined by a scanner grammar, is
the set of all strings of characters in the external (text)
alphabets that forms tokens in the language to be
compiled. For example; any string that is an identifier is a
token and is therefore a sentence in scanner language.
A regular grammar is sufficient to fully define all strings in
the language. Consequently, a FSA is adequate to
implement the scanner. The notation for expressing a
regular language is a regular expression.
REGULAR EXPRESSIONS
The integer constant 0 or 384 might be defined at least
one decimal digit, followed by 0 or more additional digits.
1. Integer constant → dd*
Where ‘d’ represents digit
.
2. An identifier consist of a letter(a) followed by 0 or more
of either letters or digits i.e a(a/d)*. This compact
notation is called a Regular Expression.
Identifier a(a/d)*
Regular Expression representation by means of context
Free Grammar is given below;
RE = ({“/”, “*”, “)”, “(“, “σ”} {RE, Term, Primary, factor}, P,
RE)
σ- represents any symbol in the alphabet of the
language the regular expression defines.
P- set of productions
1.RE → RE “/” Term(Alternation)
2.RE → T
3.T → TP (Concatenation)
4.T → P
5.P → F “*” (Iteration)
6.P → F
7.F → “(“RE”)” (grouping)
8.F → σ (any terminal)
1 A → xy A → xB B y
2 A → x*y A →xB/y B → xB/y
3 A → x/y A→x A→y
4 A BB→x A→x B→x
5 A → E B → xB B → xA B→x
6 S → E (S is start symbol) G → S G→ϵ
Ex1.:Consider a regular Expression we wish to transform into Regular
grammar: a(a/d)*
Begin with S
S→ a(a/d)*
R1. concatenation : S → aA
A → (a/d)*
R2.: x → (a/d)* y → ϵ
S → aA
A → (a/d)B B → (a/d)B
A→ϵ B→ϵ
R3. Apply distributive Law:
S → aA B → aB
A → aB B → dB
A → dB B→ ϵ
A →ϵ
The grammar is now right linear. The two empty production can be
eliminated by R5, leading to the regular grammar:
R3. Apply distributive Law:
S → aA S → a
A → aB A → dB
A→a A→ d
B → aB B → dB
B→a B→d
CONVERTING GRAMMAR TO REGULAR EXPRESSIONS
Rule# GRAMMARS PRODUCTION(GP) REGULAR EXPRESSION(RE)
R1 A → xB B → y A → xy
R2 A → xA/y A→x*y
R3 A→x A→y A → x/y
Example 1
Convert the Regular Grammar to regular expression
S → aA S→a
A → aA A→a
A → dA A→d
Using R3: S → aA/a
A → aA/a
A → dA/d
A → aA/a/dA/d
Collect all terms representing recursive in A together on the left
S → Aa/a
A → (aA/dA) / (a/d)
Distributive Law allows us to factor our the recursive A, so that the
production is now in form to apply rule 2.
A → (a/d)A / (a/d)
Which when applied ,eliminates the recursion for the iteration
A → (a/d)* (a/d)
Now we can substitute S as
S → a(a/d)* (a/d)
Apply a ----> aϵ
We have
a((a/d)*(a/d)/ϵ
Apply,
x+ = xx*
S → (a((a/d)+/ϵ)
S → a(a/d)*
Regular Expressions
Keyword=BEGIN/END/IF/THEN/ELSE
Operator= </>=/<>/>/>=
DESIGN OF LEXICAL ANALYSER
To design a program the usual approach is to describe the behavior
of the grammar by the flow-chart. Remembering previous
characters by the position in a flow chart is a valuable tool such
that a specialized kind of flow chart of lexical analyzer called
transition diagram as been developed.
It consists of states and edges. States are boxes of flow chart while
edges connect states using arrows. The transition diagram for an
identifier is defined as a letter followed by any number of letters or
digits is shown below :
Transition for an identifier
letter or digit(recursion)
0 1 2
Letter delimiter
Transition for an identifier From state 1; if the next input
is a delimeter for an identifier, which is assumed to be
any character that is neither a letter or digit, on reading
the delimeter we enter state 2.
E Blanck
N D or newln *
7 8 9 Return(2,)
10
L S E Blanck
11 *
12 13 or newln14
Return(5,)
I F Blanck
15 16 or newln *
T 17 Return(3,)
H
E Blanck
18 N
19 or newl *
20 21 22
Return(4,)
identifier
letter or digit
start
23 24 25 *
return(6,INSTALL()
letter not letter/digit
constant
digit
start
26 27 28 *
return(7,INSTALL()
digit not digit
Relops
Not = or< *
<
Return(8,1)
29 30 31
= *
>
Return(8,2)
32
33 *
Return(8,4)
=
*
34 Return(8,3,)
>
Not= *
35 36 Return(8,5)
=
37 *
Return(8,6)
Write a program in JAVA to tokenize the code below:
While(count <=100)
{
Count++;
}
2. Problems facing postgraduate supervision in
computer science.
Write a program in JAVA to tokenize the code below:
While(count <=100)
{
Count++;
}