Chapter 7 Lexical Analysis
Chapter 7 Lexical Analysis
LEXICAL ANALYSIS
1
Outline
• Specifying lexers
– Regular expressions
– Examples of regular expressions
2
Recall: The Structure of a Compiler
Lexical Tokens
Source
analysis
Parsing
Today we start
3
The Role of Lexical Analyzer
• Lexical analyzer is the first phase of a compiler.
• Its main task is to read input characters and
produce as output a sequence of tokens that
parser uses for syntax analysis.
5
Lexical Analysis
7
Example of Tokens
10
Examples of Tokens
const pi = 3.1416;
The substring pi is a lexeme for the token “identifier.”
11
Lexeme and Token
Index = 2 * count +17;
Lexemes Tokens
Index Identifier
= equal_sign
2 int_literal
* multi_op
Count identifier
+ plus_op
17 int_literal
; semicolon 12
Attributes of Tokens
15
Example
• Recall:
\tif (i == j)\n\t\tz = 0;\n\telse\n\t\tz = 1;
17
19
21
Specification & Recognition
RE specification
NFA
DFA Recognition
Specification of Tokens
• Concatenation: xy
– s ε = εs = s
– x=“lexical” y=“analyzer” => xy = “lexicalanalyzer”
• Exponentiations: si =si-1 s (s0=ε)
Operations on Languages
• Union
– {s | s is in L or s is in M}
• Concatenation
– {st | s is in L and t is in M}
• Kleene closure: zero or more concatenation
– L*: union of Li (i = 0 … infinity)
– L0 = {ε}, Li = Li-1 L
• Positive closure: one or more concatenation
– L+: union of Li (i = 1 … infinity)
Examples
• Union
– The Union of two languages L1 and L2 is set of strings that are either in L1, or L2 or
both.
– It is denoted as L1 + L2 or L1 U L2.
e.g. L1= {01, 11} and L2= {111, 10}
Then L1 + L2= {01, 11, 111, 10}
• Concatenation
– The Concatenation of two languages L1 and L2 is set of strings formed by
concatenating any string in L1 with any string in L2. It is denoted as L1. L2 or L1 L2.
e.g. L1= {01, 11} and L2= {111, 10}
Then L1. L2= {01111, 0110, 11111, 1110}
• Kleen Closure:
– The closure of language L includes all strings which are formed by taking any number
of strings from L repeating them 0 or more times and then concatenating all of them.
It is denoted as L*.
e.g. L= {0, 1} Then L*= { є, 0, 1, 10, 01, 111, 000, 010, ………}
• Positive Closure:
– The closure of language L includes all strings which are formed by taking any number
of strings from L repeating them 1 or more times and then concatenating all of them.
It is denoted as L+.
e.g. L= {0, 1} Then L+= {0, 1, 10, 01, 111, 000, 010, ………}
Regular Definitions
34
Examples of Regular Definitions
35
Recognition of Tokens
36
Finite Automata
symbol.
Finite Automata
38
Finite Automata
• Graphical Representation:
– State transition diagram
• Implementation:
– State transition table
• Deterministic (DFA)
– Single transition for all states on all input symbols
• Non-deterministic (NFA)
– More than one transitions for at least one state
with some input symbol
Finite Automata
• Transition
s1 →a s2
• Is read
In state s1 on input “a” go to state s2
40
Finite Automata State Graphs
• A state
• An accepting state
a
• A transition
41
Finite Automata
42
NFA: Nondeterministic Finite
Automata
• An NFA consists of
❑S: A finite set of states
❑: A finite set of input symbols
❑d: A transition function that maps (state,
symbol) pairs to sets of states
❑s0: A state distinguished as start state
❑F: A set of states distinguished as final states
Transition Diagram (NFA)
(a | b)*abb
a
start a b b
0 1 2 3
b
States: {0/Start/init., 1, 2, 3/Final}
Input symbols: {a, b}
NFA Transition function:
d(0,a) = {0,1}, d(0,b) = {0}
d(1,b) = {2}, d(2,b) = {3}
Acceptance of NFA
Example:
bbababb is accepted by (a|b)*abb
bbabab is NOT
NFA with ε-transition
aa* | bb*
a
a
ε 1 2
start
0
ε 3 4
b
b
NFA Transition function:
d(0, e ) = {1, 3}, d (1, a) = {2}, d(2, a) = {2}
d (3, b) = {4}, d(4, b)
= {4}
Deterministic Finite Automata
b
a
start b b
0 1 2 3
a a
b a
Example of FA and Transition Diagrams
a b c r = (abc)+
a state
a transition
a final state
FA/ and Transition Tables
inputs
a b c
states
q0 q1
q1 q2
q2 q3
q3 q1
• state = 0;
• while ( (c = next_char() ) != EOF ) {
– switch (state) {
• case 0: if ( c == ‘a’ ) state = 1;
– break;
• case 1: if ( c == ‘b’ ) state = 2;
– break;
• case 2: if ( c == ‘c’ ) state = 3;
– break;
• case 3: if ( c == ‘a’ ) state = 1;
– else { ungetchar(); return (TRUE); }
– break;
• default:
– error();
– }
• }
• if ( state == 3 ) return (TRUE) else return (FALSE);
Finite Automata for the Lexical
Tokens
i f
a- z
a- z
1 2 0-9
3 0-9
1 2
1 2
IF ID 0-9 NUM
1 2 3 4
- -
\n
a- z
blank, etc.
5 blank, etc.
White space
(and comment starting with ‘- -’)
Regular expressions for tokens
if {return IF;}
[a - z] [a - z0 - 9 ] * {return ID;}
[0 - 9] + {return NUM;}
([0 - 9] + “.” [0 - 9] *) | (“.” [0 - 9] +) {return REAL;}
(“--” [a - z]* “\n”) | (“ ” | “ \n ” | “ \t ”) + {/* do nothing*/}
Recognition of Regular
Expression Using DFA
Input. An input string ended with eof and a DFA with start state s0 and final
states F.
Output. The answer “yes” if accepts, “no” otherwise.
begin
s := s0;
c := nextchar;
while c <> eof do begin
s := move(s, c); // transition function
c := nextchar
end;
if s is in F then return “yes”
else return “no”
end.
DFA: An Example
(a | b)*abb
b
a
start b b
0 1 2 3
a a
b a
An Example
bbababb bbabab
s=0 s=0
s = move(0, b) = 0 s = move(0, b) = 0
s = move(0, b) = 0 s = move(0, b) = 0
s = move(0, a) = 1 s = move(0, a) = 1
s = move(1, b) = 2 s = move(1, b) = 2
s = move(2, a) = 1 s = move(2, a) = 1
s = move(1, b) = 2 s = move(1, b) = 2
s = move(2, b) = 3 s is not in {3}
s is in {3}
Recognition of Regular Expression
59
Creating a lexical analyzer with Lex
Structure of Lex Programs
61
Lex Program for Tokens
62
63
64