@CD_ch2 compiler design
@CD_ch2 compiler design
Example E=M*C**2
input. That is, if the lexical analyzer finds a lexeme that matches with any existing reserved word, it
should generate an error.
In keywords and identifiers, keywords must be written first , then identifiers
3. List frequently occurring patterns first.
white space
Specification of Tokens
Regular expressions are an important notation for specifying lexeme via patterns. While they cannot
express all possible patterns, they are very effective in specifying those types of patterns that we actually
need for tokens. To see the formal notation for regular expression, let’s see following terms: Alphabet,
String, Language and Regular expression
Strings and Languages
An alphabet is any finite set of symbols such as letters, digits, and punctuation.
The set {0,1) is the binary alphabet
A string is a finite sequence of symbols drawn from that alphabet.
If x and y are strings, then the concatenation of x and y is also string, denoted xy, For example,
if x = dog and y = house, then xy = doghouse.
The empty string is the identity under concatenation; that is, for any string S, .S = S. = S
In language theory, the terms "sentence" and "word" are often used as synonyms for "string."
|s| represents the length of a string S, Ex: banana is a string of length 6
The empty string is the string of length zero.
A language is any countable set of strings over some fixed alphabet.
Definition: let ∑ be a set of characters; a language over ∑ is a set of strings of characters drawn from ∑.
Let Alphabet = {A, . . . , Z}, then L ={“A”,”B”,”C”, “BF”…,”ABZ”,…} is consider the language L defined
by Alphabet.
Abstract languages like the empty set {}, or the set containing only the empty string {}, are languages
under this definition.
Operations on strings
1. The prefix of string s is any string obtained by removing zero or more symbols from the end of s.
for example ban, banana and ↋ are prefixes of banana.
2. A suffix of string s is any string obtained by removing zero or more symbols from the beginning of
s. for example: nana, banana end ↋ are suffixes of banana.
3. A substring of s is obtained by deleting any prefix and suffix from s. for example banana, nan and ↋
are substring of banana.
4. The proper prefixes, suffixes and substring of string s are those, prefixes, suffixes and substring,
respectively of s that are not ↋ or not equal to s itself.
5. A subsequence of s is any string formed by deleting zero or more not necessary consecutive
positions of s. for example: baan is a subsequence of banana
Operations on Languages
Example:
Let L be the set of letters {A, B, . . . , Z, a, b, . . . , z ) and
Let D be the set of digits {0, 1... .9).
L and D are respectively, the alphabets of uppercase and lowercase letters and digits.
other languages can be constructed from L and D, using the operators illustrated above
1. L U D is the set of letters and digits: strictly speaking the language, each of with strings of either
one letter or one digit.
2. LD is the set of all strings of length two, each consisting of one letter followed by one digit.
Ex: A1, a1, B0, etc.
3. L4 is the set of all 4-letter strings. (ex: aaba, bcef)
4. L* is the set of all strings of letters, including the empty string.
5. L (L U D)* is the set of all strings of letters and digits beginning with a letter.
6. D+ is the set of all strings of one or more digits.1, 3 1211, 78,etc
Regular Expressions (RE)
The lexical analyzer needs to scan and identify only a finite set of valid string/token/lexeme that belongs to
the language in hand. It searches for the pattern defined by the language rules.
Regular expressions have the capability to express finite languages by defining a pattern for finite strings
of symbols. The grammar defined by regular expressions is known as regular grammar. The language
defined by regular grammar is known as regular language.
A Regular Expression is a Set of Rules / Techniques for Constructing Sequences of Symbols (Strings)
From an Alphabet. It is used to specify the patterns of tokens.
Each regular expression r denotes a language L(r). Here are the rules that define the regular expressions
over some alphabet Σ and the languages that those expressions denote:
A. Base definition :
1. is a regular expression denoting language {}
2. a is a regular expression denoting {a}
B. Inductive definition: If r and s are regular expressions denoting languages L(r) and L(s) respectively,
then
1. rs is a regular expression denoting L(r) L(s)
2. rs is a regular expression denoting L(r)L(s)
3. r* is a regular expression denoting ( L(r))*
4. (r) is a regular expression denoting L(r)
A language defined by a regular expression is called a Regular set or a Regular Language
Examples:
1. a | b = {a,b}
2. (a|b)a = {aa,ba}
3. (ab) | ε ={ab, ε}
4. ((a|b)a)* = {ε, aa,ba,aaaa,baba,....}
5. [ab] = a or b
6. [a-z] = a or b or c or … or z
7. [-+0-9] = all the digits and the two signs
8. [^a-zA-Z] = any character which is not a letter
Reverse
1. Even binary numbers (0|1)*0
2. An alphabet consisting of just three alphabetic characters: Σ = {a, b, c}. Consider the set of all
strings over this alphabet that contains exactly one b.
digit →0 | 1 | …. | 9
→ So, id →letter (letter | digit )*
Exercise
a. letter →A | B | …. | Z | a | b | …. | z |
b. digit →0 | 1 | …. | 9
Re-write the above regular definition using shorthand notation for RE. Answer as follow
→ letter →[A-Za-z]
→ digit → [0-9]
b. Numbers: Numbers can be: sequence of digits (natural numbers), or decimal numbers, or numbers with
exponent (indicated by an e or E).
nat = [0-9]+
signedNat = (+|-)? Nat
number = signedNat(“.” nat)?(E signedNat)?
c. relop < | <= | = | <> | > | >=
d. Delimiter newline | blank | tab | comment
e. White space = (delimiter )+
Recognition of Tokens
Given the grammar of branching statement:
Where the terminals if, then, else, relop, id and num generate sets of strings given by the following
regular definitions:
The lexical analyzer also has the job of stripping out whitespace, by recognizing the "token" defined by ws:
Recognition of Identifier
Recognition of numbers
A deterministic automata is one in which each move (transition from one state to another) is determined by
the current configuration.
→ If the internal state, input and contents of the storage are known, it is possible to predict the future
behavior of the automaton. This is said to be deterministic automata otherwise it is non-determinist
automata
a. Nondeterministic finite automata (NFA) have no restrictions on the labels of their edges.
ε, the empty string, is a possible label.
b. Deterministic finite automata (DFA) have, for each state, and for each symbol of its input alphabet
exactly one edge with that symbol leaving that state.
Deterministic – faster recognizer, but it may take more space
Non-deterministic – slower, but it may take less space
Deterministic automatons are widely used lexical analyzers.
First, we define regular expressions for tokens; then we convert them into a DFA to get a lexical analyzer
for our tokens.
Note
Regular expressions = specification of candidate Tokens
Finite automata = implementation (Recognition of Tokens)
Token Pattern
Pattern Regular Expression
Regular Expression NFA
NFA DFA
DFA’s or NFA’s for all tokens Lexical Analyzer
Transition Table
The mapping T of an NFA can be represented in a transition table
T(0,a) = {0,1}
T(0,b) = {0}
T(1,b) = {2}
T(2,b) = {3}
The language defined by an NFA is the set of input strings it accepts, such as (a|b)*abb for the example
NFA
Case 3: Repetition r*
Rules:
Start state of D is assumed to be unmarked.
Start state of D is = ε-closer (S0), where S0 - start state of N.
Example NFA to DFA
The start state A of the equivalent DFA is -closure(0),
A = {0,1,2,4,7},
since these are exactly the states reachable from state 0 via a path all of whose edges have label .
Note that a path can have zero edges, so state 0 is reachable from itself by an -labeled path.
The input alphabet is {a, b). Thus, our first step is to mark A and compute
Dtran[A, a] = -closure(move(A, a)) and
Dtran[A, b] = - closure(move(A, b)) .
Among the states 0, 1, 2, 4, and 7, only 2 and 7 have transitions on a, to 3 and 8, respectively. Thus,
move(A, a) = {3,8). Also, -closure({3,8} )= {1,2,3,4,6,7,8), so we conclude call this set B,
let Dtran[A, a] = B
Compute Dtran[A, b]. Among the states in A, only 4 has a transition on b, and it goes to 5.
Call it C
If we continue this process with the unmarked sets B and C, we eventually reach a point where all
the states of the DFA are marked.
Example 2
Minimization of DFA
If we implement a lexical analyzer as a DFA, we would generally prefer a DFA with as few states as
possible, since each state requires entries in the table that describes the lexical analyzer.
There is always a unique minimum state DFA for any regular language. Moreover, this minimum-state
DFA can be constructed from any DFA for the same language by grouping sets of equivalent states.
Minimized DFA