0% found this document useful (0 votes)
17 views

@CD_ch2 compiler design

Compiler design

Uploaded by

hafizyt2014
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views

@CD_ch2 compiler design

Compiler design

Uploaded by

hafizyt2014
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

Compiler Design Dire Dawa University [DDIT]

Chapter two –Lexical Analysis


1. Overview Lexical analysis
Lexical analysis is the first phase of a compiler. It takes the modified source code from language
preprocessors that are written in the form of sentences. The lexical analyzer breaks these syntaxes into a
series of tokens, by removing any whitespace or comments in the source code.
If the lexical analyzer finds a token invalid, it generates an error. The lexical analyzer works closely with
the syntax analyzer. It reads character streams from the source code, checks for legal tokens, and passes the
data to the syntax analyzer when it demands.
The Role of a Lexical Analyzer
Main tasks of the LA are:
 to read a sequence of characters from the source program
 group them into lexemes and
 Produce as output a sequence of tokens for each lexeme in the source program.
The scanner can also perform the following secondary tasks:
 stripping out blanks, tabs, new lines
 Strips off comments and white space (blank, new line, tab and perhaps other characters
that are used to separate tokens in the input) from source program
 Keeps track of line numbers and associate error messages from the source program

Prepared by: Andualem T. Page 1


Compiler Design Dire Dawa University [DDIT]

Sometimes, lexical analyzers are divided into a cascade of two processes:


1. Scanning consists of the simple processes that do not require tokenization of the input, such as deletion
of comments and compaction of consecutive whitespace characters into one.
2. Lexical analysis proper is the more complex portion, where the scanner produces the sequence of
tokens as output
Why separating lexical analysis from parsing?
There are several reasons for separating the analysis phase of compiling into lexical analysis and parsing.
Simplicity of design is the most important consideration. The separation of lexical and syntactic analysis
often allows us to simplify at least one of these tasks. For example, a parser that had to deal with comments
and whitespace as syntactic units would be considerably more complex than one that can assume
comments and whitespace have already been removed by the lexical analyzer. If we are designing a new
language, separating lexical and syntactic concerns can lead to a cleaner overall language design.
Compiler efficiency is improved. A separate lexical analyzer allows us to apply specialized techniques
that serve only the lexical task, not the job of parsing. In addition, specialized buffering techniques for
reading input characters can speed up the compiler significantly.
Compiler portability is enhanced. Input-device-specific individualities can be restricted to the lexical
analyzer.
Tokens, patterns and Lexemes
Token: is a sequence of characters from the source program having a collective meaning. It is a
classification for a common set of strings. It is a pair consisting of token name and optional attribute value
It can be includes:
 Identifiers: a string of letters or digits , starting with letter: example of ID: A1, x, X, fgfgf2324,
k, but 12sdfsd, 4, ?gfgf, 8?> are not valid ID
 Integer and Real: a non-empty set of digits, example : 00,1. 23 002, 10, 34 etc
 Keyword: to mean reserved words like , if , else ,then , begin , end, while, for… etc
 Symbols or operators: +, -, *, /, =, <, >, -> …
 Other constants: ‘d’, “gghg”
Lexemes: - Collection or group of characters forming tokens is called Lexeme; is a sequence of
characters in the source program that is matched by the pattern for a token.
Pattern: are rules describing the set of lexemes that can represent a particular token in source
program.
Example: In many programming languages, the following classes cover most or all of the tokens:

Prepared by: Andualem T. Page 2


Compiler Design Dire Dawa University [DDIT]

Tokens in programming languages


In many programming languages, the following classes cover most or all of the tokens:
1. One token for each keyword. The pattern for a keyword is the same as the keyword itself.
2. Tokens for the operators, either individually or in classes such as the token comparison mentioned in
the table below
3. One token represents all identifiers.
4. Tokens for Numbers: either individually or in classes such as the token <NUM,34>, <NUM,9>
5. Token for representing literal constants, One token can represent literal strings constant enclosed in
double quote .eg: “this is”
6. One Tokes for each punctuation symbol, such as left and right parentheses, comma, and semicolon.
Attributes for Tokens
When more than one lexeme can match a pattern, the lexical analyzer must provide the subsequent
compiler phase’s additional information about the particular lexeme that matched. For example, both 0 and
1 match the pattern for the token num. But the code generator needs to know which number is recognized.
The lexical analyzer collects information about tokens into their associated attributes.
 Tokens influence parsing decisions;
 Attributes influence the translation of tokens after parse
 Practically, a token has one attribute:
 a pointer to the symbol table entry in which information about the token is kept.
 The symbol table entry contains various information about the token
 such as its lexeme, type, the line number in which it was first seen …
E.g E=M*C**2; The tokens and their attributes are written as:
<id, pointer to symbol-table entry for E>
<assign_op,>
<id, pointer to symbol-table entry for M>
<multi_op,>

Prepared by: Andualem T. Page 3


Compiler Design Dire Dawa University [DDIT]

<id, pointer to symbol-table entry for C>


<exp_op,>
<num,integer value 2>
Examples of Non-Tokens
 comment: /* do not change */
 preprocessor directive: #include <stdio.h>
 preprocessor directive: #define NUM 5
 White space (blanks, tabs, Newlines )
Exercise: how many tokens are present and List all C++ Tokens from the following code:

Lexical Errors and Recovery strategies


Lexical errors are the errors disturbed by your lexer when unable to continue. That an error returned by the
lexer whenever it founds unexpected pattern that doesn’t match any of the defined tokens.
In what Situations do Errors Occur?
Answer: when Prefix of remaining input doesn’t match any defined token by pattern
Some errors are out of power of lexical analyzer to recognize:
Example: fi (a == f(x)) …
A lexical analyzer cannot tell whether fi is a misspelling of the keyword i f or an undeclared function
identifier. Since fi is a valid lexeme for the token id, the lexical analyzer must return the token id to the
parser and let some other phase of the compiler - probably the parser in this case - handle an error due to
transposition of the letters
However it may be able to recognize errors like:
Error : d = 2r ; solution: d= 2*r; Such errors are recognized when no pattern for tokens matches a
character sequence, thus will be 2*r.

Prepared by: Andualem T. Page 4


Compiler Design Dire Dawa University [DDIT]

Possible error recovery actions


1. Panic mode recovery: suppose a situation arises in which the lexical analyzer is unable to proceed
because none of the patterns for tokens matches any prefix of the remaining input. It deletes successive
characters from the remaining input, until the lexical analyzer can find a well-formed token (until error
is resolved.)
2. Other possible error-recovery actions are:
 Delete one character from the remaining input or deleting an extraneous character.
 Insert a missing character into the remaining input.
 Replacing an incorrect character by a correct character in the remaining input
 Transforming(Transpose) two adjacent characters
Input Buffering
The LA scans the characters of the source program only one buffer at a time to discover tokens. Because of
large amount of time can be consumed scanning characters, specialized buffering techniques have been
developed to reduce the amount of overhead required to process an input character.
This task is made difficult by the fact that we often have to look one or more characters beyond the next
lexeme before we can be sure we have the right lexeme. Thus, we shall introduce a two-buffer scheme that
handles large look-a head’s safely
Buffering techniques:
1. Buffer pairs : we will focus this techniques here
2. Sentinels: We then consider an improvement involving "sentinels" that saves time checking for the
ends of buffers.
Buffer Pairs: It is special buffering techniques that have been developed to reduce the amount of
overhead required to process a single input character. An important scheme involves two buffers that are
alternately reloaded, as suggested in next page figure.
A buffer is divided into two N-character halves, as shown next
Each buffer is of the same size N, and N is usually the number of characters on one disk block. E.g., 1024
or 4096 bytes
 Using one system read command we can read N characters into a buffer.
 If fewer than N characters remain in the input file, then a special character, represented by eof,
marks the end of the source file.

Prepared by: Andualem T. Page 5


Compiler Design Dire Dawa University [DDIT]

Example E=M*C**2

Two pointers to the input are maintained:


 Pointer lexemeBegin marks the beginning of the current lexeme, whose extent we are attempting to
determine.
 Pointer forward scans ahead until a pattern match is found;
Note:
 Once the next lexeme is determined, forward is set to the character at its right end.
 The string of characters between the two pointers is the current lexeme.
 After the lexeme is recorded as an attribute value of a token returned to the parser, lexemeBegin is
set to the character immediately after the lexeme just found.
Advancing forward pointer:
 Advancing forward pointer requires that we first test whether we have reached the end of one of the
buffers, and if so, we must reload the other buffer from the input, and move forward to the beginning of
the newly loaded buffer.
 If the end of second buffer is reached, we must again reload the first buffer with input and the pointer
wraps to the beginning of the buffer.
Lexical Analyzer Rules to identify lexeme
Most modern lexical- analyzer generators follow 3 rules
1. Look for the longest token
The longest initial substring that can match any regular expression is taken as the next token. When the
lexical analyzer read the source-code, it scans the code letter by letter; and when it encounters a whitespace, operator
symbol, or special symbols, it decides that a word is completed. The Longest Match Rule states that the lexeme
scanned should be determined based on the longest match among all the tokens available.
2. Rule priority: Look for the first-listed pattern that matches the longest token. The lexical analyzer also
follows rule priority where a reserved word, e.g., a keyword, of a language is given priority over user

Prepared by: Andualem T. Page 6


Compiler Design Dire Dawa University [DDIT]

input. That is, if the lexical analyzer finds a lexeme that matches with any existing reserved word, it
should generate an error.
 In keywords and identifiers, keywords must be written first , then identifiers
3. List frequently occurring patterns first.
 white space
Specification of Tokens
Regular expressions are an important notation for specifying lexeme via patterns. While they cannot
express all possible patterns, they are very effective in specifying those types of patterns that we actually
need for tokens. To see the formal notation for regular expression, let’s see following terms: Alphabet,
String, Language and Regular expression
Strings and Languages
An alphabet is any finite set of symbols such as letters, digits, and punctuation.
 The set {0,1) is the binary alphabet
A string is a finite sequence of symbols drawn from that alphabet.
 If x and y are strings, then the concatenation of x and y is also string, denoted xy, For example,
if x = dog and y = house, then xy = doghouse.
 The empty string is the identity under concatenation; that is, for any string S, .S = S.  = S
 In language theory, the terms "sentence" and "word" are often used as synonyms for "string."
 |s| represents the length of a string S, Ex: banana is a string of length 6
 The empty string is the string of length zero.
A language is any countable set of strings over some fixed alphabet.
Definition: let ∑ be a set of characters; a language over ∑ is a set of strings of characters drawn from ∑.
Let Alphabet = {A, . . . , Z}, then L ={“A”,”B”,”C”, “BF”…,”ABZ”,…} is consider the language L defined
by Alphabet.
Abstract languages like the empty set {}, or the set containing only the empty string {}, are languages
under this definition.
Operations on strings
1. The prefix of string s is any string obtained by removing zero or more symbols from the end of s.
for example ban, banana and ↋ are prefixes of banana.
2. A suffix of string s is any string obtained by removing zero or more symbols from the beginning of
s. for example: nana, banana end ↋ are suffixes of banana.

Prepared by: Andualem T. Page 7


Compiler Design Dire Dawa University [DDIT]

3. A substring of s is obtained by deleting any prefix and suffix from s. for example banana, nan and ↋
are substring of banana.
4. The proper prefixes, suffixes and substring of string s are those, prefixes, suffixes and substring,
respectively of s that are not ↋ or not equal to s itself.
5. A subsequence of s is any string formed by deleting zero or more not necessary consecutive
positions of s. for example: baan is a subsequence of banana
Operations on Languages

Example:
 Let L be the set of letters {A, B, . . . , Z, a, b, . . . , z ) and
 Let D be the set of digits {0, 1... .9).
L and D are respectively, the alphabets of uppercase and lowercase letters and digits.
other languages can be constructed from L and D, using the operators illustrated above
1. L U D is the set of letters and digits: strictly speaking the language, each of with strings of either
one letter or one digit.
2. LD is the set of all strings of length two, each consisting of one letter followed by one digit.
Ex: A1, a1, B0, etc.
3. L4 is the set of all 4-letter strings. (ex: aaba, bcef)
4. L* is the set of all strings of letters, including the empty string.
5. L (L U D)* is the set of all strings of letters and digits beginning with a letter.
6. D+ is the set of all strings of one or more digits.1, 3 1211, 78,etc
Regular Expressions (RE)
The lexical analyzer needs to scan and identify only a finite set of valid string/token/lexeme that belongs to
the language in hand. It searches for the pattern defined by the language rules.

Prepared by: Andualem T. Page 8


Compiler Design Dire Dawa University [DDIT]

Regular expressions have the capability to express finite languages by defining a pattern for finite strings
of symbols. The grammar defined by regular expressions is known as regular grammar. The language
defined by regular grammar is known as regular language.
A Regular Expression is a Set of Rules / Techniques for Constructing Sequences of Symbols (Strings)
From an Alphabet. It is used to specify the patterns of tokens.
Each regular expression r denotes a language L(r). Here are the rules that define the regular expressions
over some alphabet Σ and the languages that those expressions denote:
A. Base definition :
1.  is a regular expression denoting language {}
2. a   is a regular expression denoting {a}
B. Inductive definition: If r and s are regular expressions denoting languages L(r) and L(s) respectively,
then
1. rs is a regular expression denoting L(r)  L(s)
2. rs is a regular expression denoting L(r)L(s)
3. r* is a regular expression denoting ( L(r))*
4. (r) is a regular expression denoting L(r)
A language defined by a regular expression is called a Regular set or a Regular Language

Prepared by: Andualem T. Page 9


Compiler Design Dire Dawa University [DDIT]

Some Flex RE for Pattern Matching Primitives

Examples:
1. a | b = {a,b}
2. (a|b)a = {aa,ba}
3. (ab) | ε ={ab, ε}
4. ((a|b)a)* = {ε, aa,ba,aaaa,baba,....}
5. [ab] = a or b
6. [a-z] = a or b or c or … or z
7. [-+0-9] = all the digits and the two signs
8. [^a-zA-Z] = any character which is not a letter

Reverse
1. Even binary numbers (0|1)*0
2. An alphabet consisting of just three alphabetic characters: Σ = {a, b, c}. Consider the set of all
strings over this alphabet that contains exactly one b.

Prepared by: Andualem T. Page 10


Compiler Design Dire Dawa University [DDIT]

(a | c)*b(a|c)* {b, abc, abaca, baaaac, ccbaca, cccccb}


Precedence and Associativity
Note: *, concatenation. , and | pipesign are left associative and * has the highest precedence Concatenation
(.) has the second highest precedence and | pipesign has the lowest precedence of all.
Representing valid tokens of a language in regular expression
If x is a regular expression, then:
 x* means zero or more occurrence of x.
i.e., it can generate { e, x, xx, xxx, xxxx, … }
 x+ means one or more occurrence of x.
i.e., it can generate { x, xx, xxx, xxxx … } or x.x*
 x? Means at most one occurrence of x
i.e., it can generate either {x} or {e}.
 Character classes- [ ] Eg. [abc] is shorthand for a|b|c
Exercises
Describe the languages denoted by the following regular expressions:
1. a(a|b)*a
2. ((ε|a)b*)*
3. (a|b)*a(a|b)(a|b)
4. a*ba*ba*ba*
5. (aa|bb)*((ab|ba)(aa|bb)*(ab|ba)(aa|bb)*)*
Regular Definitions
Giving names to regular expressions is referred to as a Regular definition. If  is an alphabet of basic
symbols, then a regular definition is a sequence of definitions of the form
d1  r1
d2  r2

dn  rn where each di is a distinct name, each ri is a regular expression over the alphabet  
{d1, d2, …, di}
Example: Regular expressions for tokens
a. Identifiers are the set of strings of letters and digits beginning with a letter. Regular definition for this
set:
 letter →A | B | …. | Z | a | b | …. | z |

Prepared by: Andualem T. Page 11


Compiler Design Dire Dawa University [DDIT]

 digit →0 | 1 | …. | 9
→ So, id →letter (letter | digit )*
Exercise
a. letter →A | B | …. | Z | a | b | …. | z |
b. digit →0 | 1 | …. | 9
Re-write the above regular definition using shorthand notation for RE. Answer as follow
→ letter →[A-Za-z]
→ digit → [0-9]

b. Numbers: Numbers can be: sequence of digits (natural numbers), or decimal numbers, or numbers with
exponent (indicated by an e or E).
nat = [0-9]+
signedNat = (+|-)? Nat
number = signedNat(“.” nat)?(E signedNat)?
c. relop  < | <= | = | <> | > | >=
d. Delimiter  newline | blank | tab | comment
e. White space = (delimiter )+
Recognition of Tokens
Given the grammar of branching statement:

Where the terminals if, then, else, relop, id and num generate sets of strings given by the following
regular definitions:

Prepared by: Andualem T. Page 12


Compiler Design Dire Dawa University [DDIT]

The lexical analyzer also has the job of stripping out whitespace, by recognizing the "token" defined by ws:

Regular Definitions and Grammars


Regular Grammar Regular definitions
stmt  if expr then stmt
 if expr then stmt else stmt

expr  term relop term
 term
term  id
 num
Transition Diagram
As an intermediate step in the construction of a lexical analyzer, we first convert patterns into stylized
flowcharts, called "transition diagrams”.
Transition diagrams have a collection of:
 Nodes or circles, called states= initial , final , any except initial or final state(intermediate state) ,
represented by circle
 Edges= going from one state to another state
 Input= the value associated with the edges
• Actions: Represented by Arrows between states
• Start State: Beginning of a pattern (Arrowhead)
• Final State(s) : End of pattern (Concentric Circles)

Prepared by: Andualem T. Page 13


Compiler Design Dire Dawa University [DDIT]

Some important conventions about transition diagrams


1. Certain states are said to be accepting, or final, these states indicate that a lexeme has been found,
although the actual lexeme may not consist of all positions between the lexemeBegin and forward
pointers. We always indicate an accepting state by a double circle, and if there is an action to be taken
typically returning a token and an attribute value to the parser we shall attach that action to the
accepting state.
2. In addition, if it is necessary to retract the forward pointer one position (i.e., the lexeme does not
include the symbol that got us to the accepting state), then we shall additionally place a # near that
accepting state. In our example, it is never necessary to retract forward by more than one position, but if
it were, we could attach any number of #'s to the accepting state.
3. One state is designated as a start state, or initial state; it is indicated by an edge, labeled "start," entering
from now here. The transition diagram always begins in the start state before any input symbols have
been read.
Recognition of relational operator arithemetic
Ex1: reop: <=|==|<>|>=|<

Recognition of Identifier

Prepared by: Andualem T. Page 14


Compiler Design Dire Dawa University [DDIT]

Recognition of Reserved Words


Install the reserved words in the symbol table initially. A field of the symbol-table entry indicates that these
strings are never ordinary identifiers, and tells which token they represent. Create separate transition
diagrams for each keyword; the transition diagram for the reserved word then is given below

Recognition of numbers

RE with multiple accepting states


Two ways to implement a compound RE:
 Implement it as multiple regular expressions, each with its own start and accepting states. Starting with
the longest one first, if failed, then change the start state to a shorter RE, and re-scan.
 Implement it as a transition diagram with multiple accepting states. When the transition arrives at the
first two accepting states, just remember the states, but keep advancing until a failure is occurred. Then
backup the input to the position of the last accepting state.
Finite Automaton (FA)
A FA is a recognized for a language that takes a string x, and answers “yes” if x is a sentence of that
language, and “no” otherwise. A finite automaton can be: deterministic (DFA) or non-deterministic (NFA)
This means that we may use a deterministic or non-deterministic automaton as a lexical analyzer. Both
deterministic and non-deterministic finite automaton recognizes regular sets.

Prepared by: Andualem T. Page 15


Compiler Design Dire Dawa University [DDIT]

A deterministic automata is one in which each move (transition from one state to another) is determined by
the current configuration.
→ If the internal state, input and contents of the storage are known, it is possible to predict the future
behavior of the automaton. This is said to be deterministic automata otherwise it is non-determinist
automata
a. Nondeterministic finite automata (NFA) have no restrictions on the labels of their edges.
ε, the empty string, is a possible label.
b. Deterministic finite automata (DFA) have, for each state, and for each symbol of its input alphabet
exactly one edge with that symbol leaving that state.
 Deterministic – faster recognizer, but it may take more space
 Non-deterministic – slower, but it may take less space
 Deterministic automatons are widely used lexical analyzers.
First, we define regular expressions for tokens; then we convert them into a DFA to get a lexical analyzer
for our tokens.
Note
 Regular expressions = specification of candidate Tokens
 Finite automata = implementation (Recognition of Tokens)
 Token  Pattern
 Pattern  Regular Expression
 Regular Expression  NFA
 NFA  DFA
 DFA’s or NFA’s for all tokens  Lexical Analyzer

1. Non-deterministic Finite Automata (NFA)


 A Nondeterministic Finite Automata (NFA) is a machine M defined by a 5-tuple M = (Q, ∑, δ, q , F )
Where Q, ∑ , S, δ, q , F are defined as follows:

Prepared by: Andualem T. Page 16


Compiler Design Dire Dawa University [DDIT]

 Q = Finite set of internal states


 ∑ = Finite set of symbols called “Input alphabet”
 δ = Q X (∑ U{λ}) → 2Q
 q ∈ Q is the Initial states
 F ⊆ Q is a set of final states
 An NFA accepts an input string x iff there is a path in the transition graph from the start state to
some accepting (final) states.
 λ- Transitions are allowed in NFAs. In other words, we can move from one state to another one
without consuming any symbol
Note: The language defined by an NFA is the set of strings it accepts
Example of NFA is given below
Transition Function
Transition function gives, for each state, and for each symbol in ∑ U (↋) a set of next states. Transition
function can be implemented as a transition table
The transition graph for an NFA recognizing the language of regular expression (a|b)*abb

Transition Table
The mapping T of an NFA can be represented in a transition table

T(0,a) = {0,1}
T(0,b) = {0}
T(1,b) = {2}
T(2,b) = {3}

The language defined by an NFA is the set of input strings it accepts, such as (a|b)*abb for the example
NFA

Prepared by: Andualem T. Page 17


Compiler Design Dire Dawa University [DDIT]

Acceptance of input strings by NFA


An NFA accepts input string x if and only if there is some path in the transition graph from the start state to
one of the accepting states. The string aabb is accepted by the NFA:

2. Deterministic Finite Automata


Deterministic Finite Automata or DFA is a machine M defined by the Quintuple M = (Q, ∑, δ, q, F)
Where,
 Q is finite set of internal states
 ∑ is a finite set of symbols called the input alphabet
 δ : QX ∑ → Q is a total function called the transition function
 q ∈ Q is the initial state
 F ⊆ Q is a set of Final states
 A Deterministic Finite Automaton (DFA) is a special form of a NFA
 No state has λ- transition
 For each symbol a and states, there is at most one labeled edge a leaving s. i.e. transition function is
from pair of state-symbol to state (not set of states)
Example: A DFA that accepts (a|b)*abb

Prepared by: Andualem T. Page 18


Compiler Design Dire Dawa University [DDIT]

Conversion of Regular Expression to DFA


Two algorithms:
1- Translate a regular expression into an NFA (Thompson’s construction)
2- Translate NFA into DFA (Subset construction)
1. Regular Expression to NFA- It is known as Thompson’s construction
 This is one way to convert a regular expression into a NFA and it is simple and systematic method.
 It guarantees that the resulting NFA will have exactly one final state, and one start state.
 Construction starts from simplest parts (alphabet symbols).
 To create a NFA for a complex regular expression, NFAs of its sub-expressions are combined to
create its NFA.
Rules:
1. For an ε, a regular expressions, construct:

2. For a composition of regular expression:


Case 1: Alternation: regular expression (s|r), assume that NFAs equivalent to r and s have been
constructed.

Case 2: Concatenation: regular expression sr

Case 3: Repetition r*

Prepared by: Andualem T. Page 19


Compiler Design Dire Dawa University [DDIT]

Example: RE= (a|b)*abb


Step 1: construct a, b
Step 2: constructing a | b
Step3: construct (a|b)*
Step4: construct it with a, then, b, then b

Prepared by: Andualem T. Page 20


Compiler Design Dire Dawa University [DDIT]

Exercise: Construct NFA for token identifier and regular expression


 letter(letter|digit)*
 (1 | 0)*1
2. Conversion of NFA to DFA
 Why?
 DFA is difficult to construct directly from RE’s
 NFA is difficult to represent in a computer program and inefficient to compute
 Conversion algorithm: subset construction
 The idea is that each DFA state corresponds to a set of NFA states.
 After reading input a1, a2… an, the DFA is in a state that represents the subset T of the
states of the NFA that are reachable from the start state.

Rules:
 Start state of D is assumed to be unmarked.
 Start state of D is = ε-closer (S0), where S0 - start state of N.
Example NFA to DFA
 The start state A of the equivalent DFA is -closure(0),
 A = {0,1,2,4,7},
 since these are exactly the states reachable from state 0 via a path all of whose edges have label .
Note that a path can have zero edges, so state 0 is reachable from itself by an  -labeled path.
 The input alphabet is {a, b). Thus, our first step is to mark A and compute
Dtran[A, a] =  -closure(move(A, a)) and
Dtran[A, b] =  - closure(move(A, b)) .
 Among the states 0, 1, 2, 4, and 7, only 2 and 7 have transitions on a, to 3 and 8, respectively. Thus,
move(A, a) = {3,8). Also,  -closure({3,8} )= {1,2,3,4,6,7,8), so we conclude call this set B,
let Dtran[A, a] = B

 Compute Dtran[A, b]. Among the states in A, only 4 has a transition on b, and it goes to 5.
Call it C

Prepared by: Andualem T. Page 21


Compiler Design Dire Dawa University [DDIT]

 If we continue this process with the unmarked sets B and C, we eventually reach a point where all
the states of the DFA are marked.

Example 2

Prepared by: Andualem T. Page 22


Compiler Design Dire Dawa University [DDIT]

How about e-transition?


Due to e-transitions, we must compute e-closure(S) which is the set of NFA states reachable from NFA
state S; on e-transition, and e-closure (T) where T is a set of NFA states.
Example:

Minimization of DFA
 If we implement a lexical analyzer as a DFA, we would generally prefer a DFA with as few states as
possible, since each state requires entries in the table that describes the lexical analyzer.
 There is always a unique minimum state DFA for any regular language. Moreover, this minimum-state
DFA can be constructed from any DFA for the same language by grouping sets of equivalent states.

Prepared by: Andualem T. Page 23


Compiler Design Dire Dawa University [DDIT]

Prepared by: Andualem T. Page 24


Compiler Design Dire Dawa University [DDIT]

Minimized DFA

Summary for Regular Expression


. : matches everything except \n
* : matches 0 or more instances of the preceding regular expression
+: matches 1 or more instances of the preceding regular expression
? : matches 0 or 1 of the preceding regular expression
| : matches the preceding or following regular expression
[xyz ] : match one character x,y,or z
[^xyz] : match any character except x,y, and z
() : groups enclosed regular expression into a new regular expression
“…” : matches everything within the “ “ literally
 x :x, but only at beginning of line

Prepared by: Andualem T. Page 25


Compiler Design Dire Dawa University [DDIT]

x$ :x, but only at end of line


{d} : match the regular expression defined by d.
Pattern matching examples

Prepared by: Andualem T. Page 26

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy