0% found this document useful (0 votes)
12 views164 pages

Material 1

The document provides an overview of compiler construction, detailing the roles of compilers and interpreters, and the importance of various phases in the compilation process. It discusses the structure of compilers, including front-end and back-end components, and highlights the significance of intermediate representations and optimization techniques. Additionally, it covers key concepts such as lexical analysis, syntax analysis, and the use of abstract syntax trees in modern compilers.

Uploaded by

michaelwealth63
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views164 pages

Material 1

The document provides an overview of compiler construction, detailing the roles of compilers and interpreters, and the importance of various phases in the compilation process. It discusses the structure of compilers, including front-end and back-end components, and highlights the significance of intermediate representations and optimization techniques. Additionally, it covers key concepts such as lexical analysis, syntax analysis, and the use of abstract syntax trees in modern compilers.

Uploaded by

michaelwealth63
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 164

Compiler Construction

Introduction
Roadmap
• Overview
• Front end
• Back end
• Multi-pass compilers

Compiler Construction
Roadmap
• Overview
• Front end
• Back end
• Multi-pass compilers

Compiler Construction
Compilers, Interpreters …
What is a compiler?

a program that translates an executable program in one


language into an executable program in another
language

© Oscar Nierstrasz Compiler Construction


What is an interpreter?

a program that reads an executable program


and produces the results of running that
program

Compiler Construction
Why do we care?
artificial greedy algorithms
intelligence learning algorithms
Compiler construction is a
microcosm of computer graph algorithms
science algorithms union-find
dynamic programming
DFAs for scanning
theory parser generators
lattice theory for analysis
allocation and naming
systems locality
synchronization
pipeline management
architecture hierarchy management
instruction set use

Inside a compiler, all these things come together


Compiler Construction
Isn’t it a solved problem?
• Machines are constantly changing
– Changes in architecture  changes in compilers
– new features pose new problems
– changing costs lead to different concerns
– old solutions need re-engineering

• Innovations in compilers should prompt


changes in architecture
– New languages and features
Compiler Construction
What qualities are important in a
compiler?
1. Correct code
2. Output runs fast
3. Compiler runs fast
4. Compile time proportional to program size
5. Good diagnostics for syntax errors
6. Works well with the debugger
7. Good diagnostics for flow anomalies
8. Cross language calls
9. Consistent, predictable optimization
Compiler Construction
A bit of history
• 1952: First compiler (linker/loader) written by
Grace Hopper for A-0 programming language

• 1957: First complete compiler for FORTRAN by


John Backus and team

• 1960: COBOL compilers for multiple


architectures
© Oscar Nierstrasz Compiler Construction
A compiler was originally a program that “compiled”
subroutines [a link-loader]. When in 1954 the combination
“algebraic compiler” came into use, or rather into misuse,
the meaning of the term had already shifted into the present
one.
— Bauer and Eickel [1975]

Compiler Construction
Abstract view

• recognize legal (and illegal) programs


• generate correct code
• manage storage of all variables and code
• agree on format for object (or assembly) code

Big step up from assembler — higher level notations

Compiler Construction
Traditional two pass compiler

• intermediate representation (IR)


• front end maps legal code into IR
• back end maps IR onto target machine
• simplify retargeting
• allows multiple front ends
• multiple passes  better code

Compiler Construction
A fallacy!

Front-end, IR and back-end must encode knowledge needed


for all nm combinations!

Compiler Construction
Roadmap
• Overview
• Front end
• Back end
• Multi-pass compilers

Compiler Construction
Front end

• recognize legal code


• report errors
• produce IR
• preliminary storage map
• shape code for the back end

Much of front end construction can be automated


Compiler Construction
Scanner

• map characters to tokens


• character string value for a token is a lexeme
• eliminate white space

x = x + y <id,x> = <id,x> + <id,y>

Compiler Construction
Parser

• recognize context-free syntax


• guide context-sensitive analysis
• construct IR(s)
• produce meaningful error messages
• attempt error correction

Parser generators mechanize much of the work


Compiler Construction
Context-free grammars
1. <goal> := <expr>
Context-free syntax is
2. <expr> := <expr> <op> <term>
specified with a grammar,
3. | <term>
usually in Backus-Naur form
4. <term> := number
(BNF)
5. | id
6. <op> := +
7. | -

A grammar G = (S,N,T,P)
• S is the start-symbol
• N is a set of non-terminal symbols
• T is a set of terminal symbols
• P is a set of productions — P: N  (N T)*

Compiler Construction
Deriving valid sentences
Productio Result
n Given a grammar, valid
sentences can be derived by
<goal>
repeated substitution.
1 <expr>
To recognize a valid sentence
2 <expr> <op> <term>
in some CFG, we reverse this
5 <expr> <op> y process and build up a parse.
7 <expr> - y
2 <expr> <op> <term> - y
4 <expr> <op> 2 - y
6 <expr> + 2 - y
3 <term> + 2 - y
5 x + 2 - y
Compiler Construction
Parse trees

A parse can be represented by a tree called a


parse or syntax tree.

Obviously, this contains a lot of


unnecessary information

Compiler Construction
Abstract syntax trees
So, compilers often use an abstract syntax tree (AST).

ASTs are often used


as an IR.

Compiler Construction
Roadmap
• Overview
• Front end
• Back end
• Multi-pass compilers

Compiler Construction
Back end

• translate IR into target machine code


• choose instructions for each IR operation
• decide what to keep in registers at each point
• ensure conformance with system interfaces

Automation has been less successful here

Compiler Construction
Instruction selection

• produce compact, fast code


• use available addressing modes
• pattern matching problem
— ad hoc techniques
— tree pattern matching
— string pattern matching
— dynamic programming

Compiler Construction
Register allocation

• have value in a register when used


• limited resources
• changes instruction choices
• can move loads and stores
• optimal allocation is difficult

Modern allocators often use an analogy to graph coloring


Compiler Construction
Roadmap
• Overview
• Front end
• Back end
• Multi-pass compilers

Compiler Construction
Traditional three-pass compiler

• analyzes and changes IR


• goal is to reduce runtime
• must preserve values

Compiler Construction
Optimizer (middle end)
Modern optimizers are usually built as a set of passes

• constant propagation and folding


• code motion
• reduction of operator strength
• common sub-expression elimination
• redundant store elimination
• dead code elimination
Compiler Construction
The MiniJava compiler

Compiler Construction
Lex
Compiler phases
Break source file into individual words, or tokens
Parse Analyse the phrase structure of program
Parsing Actions Build a piece of abstract syntax tree for each phrase
Determine what each phrase means, relate uses of variables to their
Semantic Analysis
definitions, check types of expressions, request translation of each phrase
Place variables, function parameters, etc., into activation records (stack
Frame Layout
frames) in a machine-dependent way
Produce intermediate representation trees (IR trees), a notation that is not tied
Translate
to any particular source language or target machine
Hoist side effects out of expressions, and clean up conditional branches, for
Canonicalize
convenience of later phases
Group IR-tree nodes into clumps that correspond to actions of target-machine
Instruction Selection
instructions
Analyse sequence of instructions into control flow graph showing all possible
Control Flow Analysis
flows of control program might follow when it runs
Gather information about flow of data through variables of program; e.g.,
Data Flow Analysis liveness analysis calculates places where each variable holds a still-needed
(live) value
Choose registers for variables and temporary values; variables not
Register Allocation
simultaneously live can share same register
Code Emission Replace temporary names in each machine instruction with registers

Compiler Construction
A straight-line programming
language
(no
Stm
loops or
 Stm ; Stm
conditionals):
CompoundStm
Stm  id := Exp AssignStm
Stm  print ( ExpList ) PrintStm
Exp  id IdExp
Exp  num NumExp
Exp  Exp Binop Exp OpExp
Exp  ( Stm , Exp ) EseqExp
ExpList  Exp , ExpList PairExpList
ExpList  Exp LastExpList
Binop  + Plus
Binop   Minus
Binop   Times
Binop  / Div

a := 5 + 3; b := (print(a,a—1),10a); print(b)

8 7
prints
Compiler Construction 80
Tree representation
a := 5 + 3; b := (print(a,a—1),10a); print(b)

© Oscar Nierstrasz Compiler Construction 33


Java classes for trees
abstract class Stm {} class NumExp extends Exp {
class CompoundStm extends Stm { int num;
NumExp(int n) {num=n;}
Stm stm1, stm2;
}
CompoundStm(Stm s1, Stm s2) class OpExp extends Exp {
{stm1=s1; stm2=s2;} Exp left, right; int oper;
} final static int Plus=1,Minus=2,Times=3,Div=4;
class AssignStm extends Stm { OpExp(Exp l, int o, Exp r)
String id; Exp exp; {left=l; oper=o; right=r;}
AssignStm(String i, Exp e) }
class EseqExp extends Exp {
{id=i;
Stm stm; Exp exp;
exp=e;}
EseqExp(Stm s, Exp e) {stm=s; exp=e;}
} }
class PrintStm extends Stm { abstract class ExpList {}
ExpList exps; class PairExpList extends ExpList {
PrintStm(ExpList e) {exps=e;} Exp head; ExpList tail;
} public PairExpList(Exp h, ExpList t)
abstract class Exp {} {head=h; tail=t;}
}
class IdExp extends Exp {
class LastExpList extends ExpList {
String id; Exp head;
IdExp(String i) {id=i;} public LastExpList(Exp h) {head=h;}
} }

Compiler Construction
What you should know!
 What is the difference between a compiler
and an interpreter?
 What are important qualities of compilers?
 Why are compilers commonly split into
multiple passes?
 What are the typical responsibilities of the
different parts of a modern compiler?
 How are context-free grammars specified?
 What is “abstract” about an abstract syntax
tree?
Compiler Construction
Can you answer these questions?
 Is Java compiled or interpreted? What about
Smalltalk? Ruby? PHP? Are you sure?
 What are the key differences between modern
compilers and compilers written in the 1970s?
 Why is it hard for compilers to generate good
error messages?
 What is “context-free” about a context-free
grammar?

Compiler Construction
THE STRUCTURE OF A COMPILER

Compilation process is a complex one therefore; it is


customarily divided into a series of sub-processes
called PHASES as shown in the diagram below.
..

Source Program

Lexical Analysis

Syntax Analysis

Table
Management Error Handling
Intermediate code generation

Code Optimization

Code generation

Target
Program

• Structure of compiler
PHASES OF A COMPILER

• A phase is a logical cohesive expression that


takes as an input one representation of the
source and produces as output another
representation. The source or object program
is in the language which the program to be
translated was originally written in and which
the target is supposed to be converted from.
Lexical Analysis(Scanner)

• Lexical Analysis: this is the 1st phase of the


compilation process. The lexical analysis or
scanner operates characters of the source
language into groups that logically belongs
together. These groups are called TOKENS. The
token groups are keywords identifiers,
operators, symbols, punctuations and so on.
Lexical Analysis

• E.g. DO, IF, <, =, +, ×, NUM


• The output of the lexical analysis is a system of
tokens which is passed into the next phase i.e.
the syntax analysis.
• It reads the character in the source program
and groups them into streams of tokens; each
token represents a logically cohesive sequence
of characters such as identifiers,operators and
keywords.
Lexical Analysis
• The character sequence that forms a token is
called “Lexeme”. Certain tokens are augmented
by the lexical value i.e when an identifier like
XYZ is found, the lexical analyzer not only
returns id, but it also enters the lexeme XYZ
into the symbol table if it does not exist there.
• It returns a pointer to this symbol table entry
as a lexical value associated with this
occurence of token id.
Lexical Analysis
• Therefore when internally representing a
statement like
• X:= Y+Z
• After lexical analysis it will be
• id1=id2+id3
Function of Lexical Analyzer
• It produces stream of tokens e.g keywords,
identifiers, literal operation(number and
strings).Also separators e.g blanks
• It eliminates blanks and comments
• It generates symbol table which stores the
information about identifiers, constants
encountered in the input
• It keeps track of line numbers
• It reports the error encountered while generating
tokens.
Function of Lexical Analyzer
• Scans program statement character by
character (left to right)
• DEFINITION
• Tokens- It describes the class or category of
input string.e.g identifiers,keywords,constants
are called tokens
• Pattern-Set of rules that describes the tokens
DEFINITION
• Lexemes-Sequence of characters in the source
program that are matched with the pattern of
tokens e.g int, i, num,Block1 e.t.c

• Example
• Tokenize the program statement below:
• 1. a+= 5.0; becomes
.
• Answer

+-
;
a 5.0
Example 2.
.
if (a<b)

• .

( b )
if a <

• If-is a key word


• (-open parenthesis
• a-identifier
• <-operator e.t.c
Syntax Analysis
• The syntax analyzer imposes a hierarchical
structure on the token string.
• This analyzer is called a parser. Syntax refers to
the rules governing computer programming.
• The parser recognizes the phrase structure of
the source language and builds an abstract
syntax tree (AST) to pass on. It also enforces
type and declaration rules.
Syntax Analysis
• The major functions are:
• It performs syntactic check on the source
program i.e ensures that the rules governing
the formulation of valid statements are
obeyed.
• e.g id1= id2 + id3 is represented as
Syntax Analysis

AST for id1 = id2 + id3


=

1d1 +

1d3
1d2

• AST for id1=id2 + id3


Intermediate code generation
• Intermediate Code Generator: It uses the structure
produced by the syntax analysis to create a stream
of simple instructions. There are many approaches
and style to intermediate code generation. One style
uses instructions with one operator and a small no
of operands. These instructions can be viewed as a
simple macros e.g. ADD2. The primary difference
between intermediate code and the assembly code
is that the intermediate code needs not to specify
the registers to be used for each operation.
Intermediate code generation
• Some compilers generate an explicit
intermediate code representation of the
source program. The intermediate code can
have a variety of forms e.g a three – address
code (TAC) representation of the three
structure above is:
• T1= id1+id2
• id2=T2 where T1 and T2 are compiler
generated.
code optimization
• This phase is designed to improve the
intermediate code so that the ultimate object
program runs faster and takes less space. Its
output is another intermediate code program
that does the same job as the original, but
perhaps in a way that saves time & space.
• We have 4 methods of intermediate code
generation
code optimization
• 1 post fix notation
• 2.address code
• 3.Direct cyclic graph
• 4.syntax tree
Infix notation postfix notation
a+b ab+
a-b ab-
a*b ab*
a div b ab/
a+b+c ab+c+
a*b+c ab*c+
a*(b+c) abc+*
code generation
• The final phase in compilation process is the generation
of target code. This process involves selecting memory
location for each variable used by the program.
• the code generator produces the objects code by
deciding on the memory location for data, selecting
code to access each data & selecting the registers in
which each computation is to be done.
• Then each intermediate instruction is translated into a
sequence of machine instructions that performs the
same task.
Table Management
• Table Management: Another name for it is
Book-keeping. These functions of the compiler
keep tracks of names used by the program &
records essential information about each such
as its type. The data structure use to record
this information is called a SYMBOL TABLE.
Symbol Table
• The symbol table is the repository of semantic
information attached by the compiler to
individual identifiers in the program compiled.
An entry in the symbol table must contain at
least some reference to the identifiers
spellings
Symbol Table
• A value usually representing the memory
locations or access path of the variable of
procedure named by the identifier and such
flags are necessary to distinguish variable
procedures and other identifiers.
Error Handler
• It is invoked when a flaw in the source
program is detected. It must warn the
programmer by issuing a diagnosis and adjust
the information that has passed from one
phase to another
Compiler Phase Organization
• This is the logical organization of the compiler. It
reveals that certain phases of the compiler are
dependent on the source language and are
independent of the code requirements of the
target machine.
• All such phases when grouped together constitute
the front end of the compiler
• Whereas those phases that are dependent on the
target machine constitute the back end of the
compiler.
Compiler Phase Organization
• This is the logical organization of the compiler. It
reveals that certain phases of the compiler are
dependent on the source language and are
independent of the code requirements of the
target machine.
• All such phases when grouped together constitute
the front end of the compiler
• Whereas those phases that are dependent on the
target machine constitute the back end of the
compiler.
Front End
• Consist of lexical, syntax analyzer and type
checking.
• It generates a tree representing the syntactic
structure of the source text. This tree is held in
the main store and constitutes the interface to
the second part which handles code
generation.
Back End
• Consists of interpreter program which is
implementable with little effort.
• It include code generation and code
optimization
• The phases are totally dependent upon the
target language and optimization.
Front end/back end structure
Program
Declaration Statements

Front end

Syntax
tree
Symbol
Table

Back End

Codes

.
Advantages of front/back end model
• Keeping the same front end and attaching
different back ends one can produce a
compiler for the same source language on
different machines.
• Keeping different front ends and same back
end one can compile several different
languages on the same machine.
Front end/back end structure
Language 1
Machine 1

Language 2
Intermediate
code Machine 2

Language n

Machine n

.
Compiler Construction tools

• Writing a complier is tedious and time


consuming task. There are some specialized
tools for helping in implementation of various
phases of compilers. These tools are called
compiler construction tools:
1.Scanner generator:
• these generators generate lexical analyzers.
The specification given to these are in form of
regular expression. The UNIX has utility for a
scanner generator called LEX. The
specification given to the LEX consists of
regular expressions for representing various
tokens.
2.Parser generator:
• These produce the syntax analyzer. The
specification given to these generators is given
in the form of context free grammar. Typical
UNIX has a tool called YACC which is a parser
generator.
3.Syntax-directed translation
engine:
• In this tool the parse tree is scanned
completely to generate intermediate code.
The translation is done for each node of the
tree.
4. Automatic code generator:
• These generators take an intermediate code as
input and convert each rule of intermediate
language into equivalent machine language.
The template matching technique is used. The
intermediate code statements are replaced
templates that represent the corresponding
sequence of machine instructions.
5.Data flow engines:
• The data flow analysis is required to perform
good code optimisation. The data flow engines
are basically useful in code optimization
GRAMMARS: THE CHOMSKY HIERARCHY

A grammar is defined mathematically as a


4-tuple i.e it consist of four distinct
components. The alphabet, the non-
terminal, the production and a start
symbol (∑, N, P, S). Three of these symbols
represent sets and the fourth identifier, a
particular element of the second set.
Alphabets and Strings
The first set ∑ is the alphabet or set of
terminals. It is a finite set consisting of all
the input characters or symbols that can be
arranged to form sentences in the language.
E.g English Language has 26 letters from A-
Z, but in this definition, it will also include
punctuation, symbols, space between the
letters. .
Many programming languages are encoded
as string of test characters usually some
well defined computer set such as ASCII. A
compiler is usually defined with two
grammars, the alphabet for the scanner
grammar is the set of tokens generated by
the scanner not ASCII.
The terminals in the alphabet can be
assembled to strings of my length
according to the rule of the grammar.
String refers to the sequence of 0 or more
terminal in any particular order. E.g
let ∑ = {a, b, c, d}. Possible strings of the
terminal from ∑ include aaa, aabbccdd, d,
cba, abab.
The empty string is denoted by ϵ .
The set of all possible string in ∑ is denoted
by ∑*.
∑* is called a closure of the alphabet, the *
is called the kleene star. Kleene star can
mean zero or more of whatever it is
appended to.
Non-terminals & Productions
The non-terminals is represented by N. This is a
finite set of symbols not in the alphabet. They
are not set of strings but symbols that can be
thought of as representing or standing for set of
strings which are subset of ∑*. A particular non-
terminal, the start symbol represent exactly all
the strings in the language.
The set of terminals and non-terminals taken
together is called the vocabulary of the grammar.
Production
Production (P) is a set of rewriting rules, each
written as two strings of symbols separated by an
arrow.
Example1:
Consider the language described by a grammar
G1 = ({a, b, c}, {A, B} {A → aB, A →cB, B →a,
B → b, B → c}A)

∑ = {a, b, c} … terminals

Non-terminals (N) = {A, B}

A = start symbols
Set of Productions
1)A → aB 4) B → a
2)A → bB 5) B → b
3)A → cB 6) B → c
We can apply the production rules using the 1st
and the last rule to generate a string
A → aB
A → ac
The sequence of steps to ac is called Derivation. The
result is that we derive a string in the language. This may
be written as
A → aB →ac
A →* ac (string ac can be derived from A by zero or more
steps)
A → aB →aa
G1 is an example of the kind of grammar used to define a
scanner, although in this case the language is finite. Most
scanner defines infinite languages.
Example 2: Take G2=(∑, N,P,E)
∑={n, +, * , ), ( }
N={E, T, F}
P a set of Productions.
1. E E+T
2.E T
3.T T*F
4.T F
5.F (E)
6.F n
We can rewrite the production No. 1 as:
E ═> E+T ═> E+T+T ═> ........═>
P a set of Productions.

G2 can be used to derive n+n*n from


E as follows:
We can rewrite the production No. 1 as:
E ═> E+T ═> E+T*F ═>T +T*F═>F+T*F
1 3 2 4
═> F +F*F═>F+F*n ═> F+n*n ═> n+n*n
4 6 6 6
E ═>* n+n*n
Each of the 9 lines of derivation is called a sentential form
, and the last is called a sentence in the language.
A sentence(δ): in the language is any string in ∑* that can
be derived from the start symbol in one or more steps.
A sentential form(ω) is any string in (∑ U N)*, i.e
containing any number that satisfy the relation:
S ═> *ω ═> *δ
S is the start symbol
ω is sentetial form and
δ is sentence in ∑ *.
NB: A sentential form is any string of terminals and non-
terminal that can be derived from the start symbol on
the way to a sentence in the language.
Parse tree for the derivation
E ═>* n+n*n
E

E+ T

T T * F

F F n

n n
AMBIGUOUS GRAMMAR
If two different parse tree can be constructed for the same
sentence then the grammar is said to be ambiguous.
The example of grammars described so far have exactly
one non-terminal symbol on the left-hand side of each
production.
This form is characteristic of grammars used in compiler
construction.
However there are some other classes without the
restrictions as given below:
Example 3.
G3 =({a,b,c}, {A,B,C,}, P, A)

P isSet of Productions
1)A → aABC4) cC → cc
2)A → aBC 5) CB → BC
3)bC → bc 6)aB → ab
7.bB → bb
To generate the string aabbcc the following derivation
appplies:
A ═> aABC ═> aaBCBC ═>aaBBCC═>aabBCC
1 2 5 6
═> aabbCC═>aabbcC ═> aabbcc
7 3 4
The chomsky Hierarchy
He defines four levels of language complexity.
Subsequent research has identified four corresponding
classes of automata(automaton) or abstract machines
types which can recognise exactly those strings in the
languages generated by their respective grammars.
Chomsky language Grammar Recogniser
class
3 Regular Finite state
Automata(FSA)
2 Context free Push Down
Automata(PDA)
1 Context sensitive Linear bounded
Automata(LBA)
0 Unrestricted Turing
Machine(TM)
GRAMMARS AND THEIR MACHINE
An automaton consist of a control mechanism with a
finite number of state and some form of a tape
(magnetic) which may be read and advanced as a
magnetic tape player, and possibly also written (recorded)
and move in either direction. A finite alphabet defines the
symbols that might be on the tape, the same as the
alphabet defines the symbols that might be on the tape,
the same as the alphabet in the language definition.
The initial content of the tape is the string of character in
the alphabets. It is the input to the automaton. From the
start state the automaton sequences through its state,
based on what is on the tape and the rules in its control.
Each step is called transition. At any time, the automaton
may halt, and by doing so, it is said to accept the input
string. It may also block i.e reach a state for which it has
no rules by which to proceed with current tapes input; if
it happens the automaton is set to reject the
input.Automaton are distinguished primarily by the kind
of tape they have
TURING MACHINE (TM)
It recognizes the unrestricted grammar i.e level 0 Level 0
language is called Recursively. Enumerable Language,
because it is the set of those strings that can be
enumerated (Listed by a Turing machine). Turing
machines are an important bases for much theory of
computability and computational complexity but as a
practiced tool for compiling programs in a production
environment, they are hopelessly inefficient. This level 0
is not a useful language for compiler design.
LINEAR BOUNDED AUTOMATA

LBA recognizes context sensitive language. A context


sensitive grammar has two restrictions.
1.The left hand side of each production must have at least
one non-terminal in it.
2.The right hand side must not have fewer symbols than
the left.
In particular there can be no empty production e.g
N→ϵ
LINEAR BOUNDED AUTOMATA

The one exception to the 2nd restriction is if the left hand


side consists of the start symbol only and the start symbol
does not occur on the right hand side of any other
production, then the right hand side might be empty. This
exception is to generate empty string in the language.G3
Is an example of context sensitive grammar
PUSH DOWN AUTOMATA (PDA)
It recognizes context free language. PDA can only read its
input state but has a stack that can grow to an arbitrary
depth where it can save information. A stack is linear data
structure that permits access only at one end. There are 2
operations of a stack,
i) push and
ii) pop
PUSH DOWN AUTOMATA (PDA)
Both of which operates on the top of a stack
Push adds one item to the top of a stack
Pop takes the top element of the stack, exposing the next
element for access by a subsequent.
Context Free Grammar is more restrictive than
context sensitive in that it allows at most a single non-
terminal (and no terminals on the left side of each
production).
PUSH DOWN AUTOMATA (PDA)
The grammar G2 is an example of context free grammar.
In case of context free grammar it can the relax the
requirements about empty productions. Any CFG with
empty production can be transformed into an equivalent
CSG having at most a single empty production meeting
the empty production of CSG.
REMOVING EMPTY PRODUCTION FROM CFG

For every empty production in the grammar


A→ϵ
find and copy all productions with A on right hand side,
deleting A from the copy
REMOVING EMPTY PRODUCTION FROM CFG
1.
A→ϵ
G→A
G→ϵ
REMOVING EMPTY PRODUCTION FROM CFG
2.
A → AaB B → Ba
A→ϵ B→ϵ
REMOVING EMPTY PRODUCTION FROM CFG
I. Apply B → ϵ
A → Aa B→a
II.
Apply A → ϵ
A → aB A→a
REMOVING EMPTY PRODUCTION FROM CFG
III. Add a new start symbol G ,giving a final grammar in
correct form with no excess empty production.
G→A G→ϵ
A → AaB B → Ba
A → Aa B→a
A → aB A→a
COMPARISM BETWEEN CFG AND CSG

1) All context free productions are applied without regard


to any context or symbol that may be near a non-terminal
being written.
COMPARISM BETWEEN CFG AND CSG

2) A context sensitive production may include any


number of context symbols on the left hand side, so
production can be written to rearrange the symbol in
sentential form
3) Context Free Production can expand non-terminals
where they are- Thus, no production in a CFG can affect
the symbols in any other non-terminals in the working
string. This makes context sensitive languages to be more
powerful and harder to recognize than context free. All
modern programming languages are specified in CFG
FINITE STATE AUTOMATA (MACHINE) (FSA)
This is the highest and the most restrictive level (FSA)
recognizes a regular language or a regular set. Regular
Grammar are the most restrictive of all and allows only
one symbol on the left hand side of each production (a
non-terminal) and only one or two symbols on the right
hand side. The first symbol on the right must always be a
terminal; the second if present is always non-terminal.
Like CFG and CSG, regular grammar does not allow for
empty production, except for the single production with
start symbol on the left when the start symbol, does not
occur on the right of any production.
RIGHT LINEAR GRAMMAR
A right linear grammar is a grammar where every
production has exactly one non-terminal on the left, and
zero(0) or more terminals on the right followed by at
most one non-terminal.
A → xB
LEFT LINEAR GRAMMAR
A grammar in the same form but with the non-terminal
on the left-end of the right hand side is called left linear
grammar.
A → Bx
EMPTY STRINGS
A grammar that generates an empty string (ϵ) is shown
below
G4 = ({a}, {A}, {A ϵ}, A})
EMPTY LANGUAGE
The grammar below generates no strings at all not even
an empty string.
G5 = ( {a}, {A, B}, {A B, B aA}, A)
This is an example of an empty language where the
grammar cannot generate any strings consisting of zero
or more terminals.
AMBIGUITY
An ambiguous grammar is one in which there exist a
string in its language with two different parse tree.
Consider grammar G7 which generates strings
n+n*n
1.E E+E
2.E E*E
3.E (E)
4.E n
E

E * E

E + E n

n n
E

E + E

n E * E

n n
SCANNERS AND REGULAR LANGUAGE
Introduction to Lexical Analysis
The front end of a compiler reads and parses source
text. Most of its run time is spent on Lexical Analysis in
the scanner which reads characters from the input files,
reducing them to a mangable tokens (words or special
symbols). The language defined by a scanner grammar, is
the set of all strings of characters in the external (text)
alphabets that forms tokens in the language to be
compiled. For example; any string that is an identifier is a
token and is therefore a sentence in scanner language.
A regular grammar is sufficient to fully define all strings in
the language. Consequently, a FSA is adequate to
implement the scanner. The notation for expressing a
regular language is a regular expression.
REGULAR EXPRESSIONS
The integer constant 0 or 384 might be defined at least
one decimal digit, followed by 0 or more additional digits.
1. Integer constant → dd*
Where ‘d’ represents digit
.
2. An identifier consist of a letter(a) followed by 0 or more
of either letters or digits i.e a(a/d)*. This compact
notation is called a Regular Expression.
Identifier a(a/d)*
Regular Expression representation by means of context
Free Grammar is given below;
RE = ({“/”, “*”, “)”, “(“, “σ”} {RE, Term, Primary, factor}, P,
RE)
σ- represents any symbol in the alphabet of the
language the regular expression defines.
P- set of productions
1.RE → RE “/” Term(Alternation)
2.RE → T
3.T → TP (Concatenation)
4.T → P
5.P → F “*” (Iteration)
6.P → F
7.F → “(“RE”)” (grouping)
8.F → σ (any terminal)

T- term, P-Primary, F-factor, RE-regular expression


RE has four meta-symbols “/”, “*” , “)”, “(“ . In addition to
the characters of the alphabets represented by σ
a/b – a or b
ab – a followed by b
a* - any number of a’s or none at all
P → F “+”
P → F “?”
+ means one or more time
*means 0 or more times
? means none or once
a + means aa*
a ? Means a/ϵ
ALGEBRA OF REGULAR EXPRESSION
1.r/s = s/r (commutativity for alternation)
2.r/(s/t) = (r/s)/t (associativity for alternation)
3.r/r = r (associativity for alternation)
4.r(st) = (rs)t (associativity for concatenation)
5.r(s/t) = rs/rt (left distributivity)
6.(s/t)r = sr/tr (right distributivity)
7.rϵ = ϵr = r (identity for concatenation)
8.r*r* = r* (closure absorption)
9.r* = ϵ|r||rr| (kleene closure)
10.(r*)* = r*
11.rr* = r*r
12.(r*/s*) = (r/s)*
13.(r*s*)* = (r/s)*
14.(r/s)* = (r*s)*r*
Formal Properties of Regular Expression
A regular expression can be defined as a compact
notation for specifying a set of string over some
alphabets. For example; if our alphabet is {0,1}, The
notation (0/1)* represents all string of 0’s and 1’s.
Definition
A regular expression is a formal expression that is:
a. A single character in the alphabet ∑
b. The empty string ϵ
c. Empty set {}
Transforming Grammar and Regular Expressions
A regular language can be defined either by a regular
expression or a regular grammar. The productions of a
regular grammar differs from regular expression in one
critical aspect: A grammar is a set of rules rewriting the
start symbols, whereas the regular expression describes
the finished strings.
Regular expression is easier to see the strings that are in
the language by examining the grammar. The first step in
transforming a RE into a grammar is to make it into
rewrite rules. This is done by attaching a start symbols.
Thus for any regular expression ω, choose some non-
terminal S and make it the start symbol of the new
grammar then write one production :
S→ω
The second step is to remove the meta symbol from
regular expression
Let x and y be any regular expression
1.A → xy
Choose some new non-terminal B and rewrite
A → xB
B→y
For every production in the partially transformed form.
2.A → x * y
Write instead of the form production
A → xB
A→y
B → xB
B→y
= rewrite as
3.For production of the form
A → x/y
Write instead
A→x
A→y
The rules for transforming Regular Expression into Grammar Productions
Rules RE production Grammar Productions

1 A → xy A → xB B y
2 A → x*y A →xB/y B → xB/y
3 A → x/y A→x A→y
4 A BB→x A→x B→x
5 A → E B → xB B → xA B→x
6 S → E (S is start symbol) G → S G→ϵ
Ex1.:Consider a regular Expression we wish to transform into Regular
grammar: a(a/d)*
Begin with S
S→ a(a/d)*
R1. concatenation : S → aA
A → (a/d)*

R2.: x → (a/d)* y → ϵ
S → aA
A → (a/d)B B → (a/d)B
A→ϵ B→ϵ
R3. Apply distributive Law:
S → aA B → aB
A → aB B → dB
A → dB B→ ϵ
A →ϵ
The grammar is now right linear. The two empty production can be
eliminated by R5, leading to the regular grammar:
R3. Apply distributive Law:
S → aA S → a
A → aB A → dB
A→a A→ d
B → aB B → dB
B→a B→d
CONVERTING GRAMMAR TO REGULAR EXPRESSIONS
Rule# GRAMMARS PRODUCTION(GP) REGULAR EXPRESSION(RE)
R1 A → xB B → y A → xy
R2 A → xA/y A→x*y
R3 A→x A→y A → x/y
Example 1
Convert the Regular Grammar to regular expression
S → aA S→a
A → aA A→a
A → dA A→d
Using R3: S → aA/a
A → aA/a
A → dA/d
A → aA/a/dA/d
Collect all terms representing recursive in A together on the left
S → Aa/a
A → (aA/dA) / (a/d)
Distributive Law allows us to factor our the recursive A, so that the
production is now in form to apply rule 2.
A → (a/d)A / (a/d)
Which when applied ,eliminates the recursion for the iteration
A → (a/d)* (a/d)
Now we can substitute S as
S → a(a/d)* (a/d)
Apply a ----> aϵ
We have
a((a/d)*(a/d)/ϵ
Apply,

x+ = xx*
S → (a((a/d)+/ϵ)

S → a(a/d)*
Regular Expressions
Keyword=BEGIN/END/IF/THEN/ELSE
Operator= </>=/<>/>/>=
DESIGN OF LEXICAL ANALYSER
To design a program the usual approach is to describe the behavior
of the grammar by the flow-chart. Remembering previous
characters by the position in a flow chart is a valuable tool such
that a specialized kind of flow chart of lexical analyzer called
transition diagram as been developed.
It consists of states and edges. States are boxes of flow chart while
edges connect states using arrows. The transition diagram for an
identifier is defined as a letter followed by any number of letters or
digits is shown below :
Transition for an identifier

letter or digit(recursion)

0 1 2

Letter delimiter
Transition for an identifier From state 1; if the next input
is a delimeter for an identifier, which is assumed to be
any character that is neither a letter or digit, on reading
the delimeter we enter state 2.

To turn this into a program, we construct a segment of


code for each state. The first step to be done in the code
for any state is to obtain the next character from input
buffer. For this purpose, we use GETCHAR which
returns the next characters, advancing the look ahead
pointer at each call.
STATE 0: C:=GETCHAR();
IF LETTER(C) THEN GOTO STATE 1
ELSE FAIL()
STATE1: C:=GETCHAR();
IF LETTER(C) OR DIGIT(C) THEN
GOTO STATE 1
ELSE IF DELIMITER(C) THEN
GOTO STATE 2
ELSE FAIL();
STATE 2: RETRACT();
RETURN(id,INSTALL());
State 2 indicates that an identifier has been
found[ delimiters= ;,’,{,},(,),]

Since delimiters are not part of the identifier , we must


retract the look ahead pointer one character, for which
we use a procedure RETRACT. We use a * to indicate
states on which input retraction must take place.
We must install the newly found identifier in the symbol
table if it is not already there using the procedure
INSTALL.
In state 2 we return to the parser a pair consisting of
integer code for an identifier, which we denote by id, and
a value that is a pointer to the symbol table returned by
INSTALL
Let us consider a subset of language shown below. The
table shows list of tokens together with code for token
type and the value returned if any.
Token Code Value
Begin 1 -
End 2 -
If 3 -
Then 4 -
Else 5 -
Identifier 6 pointer to symbol table
Constant 7 pointer to symbol table
< 8 1
<= 8 2
= 8 3
.<> 8 4
.> 8 5
.>= 8 6
Token Recognised

The figure below shows transition diagrams to recognise


keywords, identifiers, constants and relops.
Keywords
G *
B E I N
Return(1,)
0 1 2 3 4 5 6

E Blanck
N D or newln *
7 8 9 Return(2,)
10
L S E Blanck
11 *
12 13 or newln14
Return(5,)
I F Blanck
15 16 or newln *
T 17 Return(3,)
H
E Blanck
18 N
19 or newl *
20 21 22
Return(4,)
identifier

letter or digit

start
23 24 25 *
return(6,INSTALL()
letter not letter/digit
constant

digit

start
26 27 28 *
return(7,INSTALL()
digit not digit
Relops
Not = or< *
<
Return(8,1)
29 30 31
= *
>
Return(8,2)
32

33 *
Return(8,4)
=
*
34 Return(8,3,)
>

Not= *
35 36 Return(8,5)

=
37 *
Return(8,6)
Write a program in JAVA to tokenize the code below:

While(count <=100)
{
Count++;
}
2. Problems facing postgraduate supervision in
computer science.
Write a program in JAVA to tokenize the code below:

While(count <=100)
{
Count++;
}

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy