0% found this document useful (0 votes)
24 views

Chapter 7 Lexical Analysis

hwstgg ert yeyh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views

Chapter 7 Lexical Analysis

hwstgg ert yeyh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 61

CHAPTER 7

LEXICAL ANALYSIS

1
Outline

• Informal sketch of lexical analysis


– Identifies tokens in input string

• Issues in lexical analysis


– Lookahead
– Ambiguities

• Specifying lexers
– Regular expressions
– Examples of regular expressions
2
Recall: The Structure of a Compiler

Lexical Tokens
Source
analysis

Parsing
Today we start

Interm. Code Machine


Optimization
Language Gen. Code

3
The Role of Lexical Analyzer
• Lexical analyzer is the first phase of a compiler.
• Its main task is to read input characters and
produce as output a sequence of tokens that
parser uses for syntax analysis.

Winter 2007 SEG2101 Chapter 8 4


Issues in Lexical Analysis

• There are several reasons for separating the


analysis phase of compiling into lexical analysis
and parsing:
– Simpler design
– Compiler efficiency
– Compiler portability
• Specialized tools have been designed to help
automate the construction of lexical analyzer
and parser when they are separated.

5
Lexical Analysis

• What do we want to do? Example:


if (i == j)
z = 0;
else
z = 1;

• The input is just a sequence of characters:


\tif (i == j)\n\t\tz = 0;\n\telse\n\t\tz = 1;

• Goal: Partition input string into substrings


– And classify them according to their role
6
What’s a Token?

• Output of lexical analysis is a stream of


tokens
• A token is a syntactic category
– In English:
noun, verb, adjective, …
– In a programming language:
Identifier, Integer, Keyword, Whitespace, …
• Parser relies on the token distinctions:
– E.g., identifiers are treated differently than keywords

7
Example of Tokens

• Tokens correspond to sets of strings.

• Identifier: strings of letters or digits,


starting with a letter
• Integer: a non-empty string of digits
• Keyword: “else” or “if” or “begin” or …
• Whitespace: a non-empty sequence of blanks,
newlines, and tabs
• OpenPar: a left-parenthesis
8
Lexemes

• A lexeme is a sequence of characters in the


source program that is matched by the pattern
for a token.
• A lexeme is a basic lexical unit of a language
comprising one or several words, the elements of
which do not separately convey the meaning of
the whole.
• The lexemes of a programming language include its
identifier, literals, operators, and special words.
• A token of a language is a category of its
lexemes. 9
pattern

• A pattern is a rule describing the set of


lexemes that can represent as particular
token in source program.
• The set of strings are described by rule called
as pattern associated with each token. The
pattern must match with each string in the
set.

10
Examples of Tokens

const pi = 3.1416;
The substring pi is a lexeme for the token “identifier.”

11
Lexeme and Token
Index = 2 * count +17;
Lexemes Tokens
Index Identifier
= equal_sign
2 int_literal
* multi_op
Count identifier
+ plus_op
17 int_literal
; semicolon 12
Attributes of Tokens

• When a pattern matches to more than one


lexeme, the lexical analyzer must provide
additional information about the particular
lexeme.
– For example, the pattern num matches to both
strings 10 and 15.
• The lexical analyzer gathers information about
tokens into their associated attributes.
• A token has a single attribute, a pointer to
the symbol table entry in which information
about the token is kept. 13
• For example,
• The tokens with associated attributes for
following C statement
E =M * C ^ 2
• Are written as
– <id, pointer to symbol table entry for E>
– <assign_op>
– <id, pointer to symbol table entry for M>
– <mult_op>
– <id, pointer to symbol table entry for C>
– <exp_op>
– <num, integer value 2>
14
Lexical Analyzer: Implementation

• An implementation must do two things:

1. Recognize substrings corresponding to tokens

2. Return the value or lexeme of the token


– The lexeme is the substring

15
Example

• Recall:
\tif (i == j)\n\t\tz = 0;\n\telse\n\t\tz = 1;

• Token-lexeme pairs returned by the lexer:


– (Whitespace, “\t”)
– (Keyword, “if”)
– (OpenPar, “(“)
– (Identifier, “i”)
– (Relation, “==“)
– (Identifier, “j”)
– … 16
Secondary tasks of Lexical Analyzer:

• The lexer usually discards “uninteresting”


tokens that don’t contribute to parsing.

• Examples: Whitespace, Comments

17
19
21
Specification & Recognition

• Regular expressions = specification

• Finite automata = implementation/ Recognition

• Lexical analysis implementation: constructing FAs & a


simulator for the FAs
Lexical Analyzer Generator

RE specification

NFA

DFA Recognition
Specification of Tokens

A Formal Specification for Tokens or Patterns


- Strings and Languages
- Regular Expressions & Definitions
Strings and Language

• alphabet (or character class) (∑): any finite set of symbols


• string (S) over some alphabet : a finite sequence of
symbols drawn from that alphabet ∑
• length of string S, |S|: number of symbols in s
• empty string: a special string of length zero
• Prefix of S- A string obtained by removing zero or more
trailing characters of string S.
Example- S=”compiler” prefix of S is “com”
• Suffix of S- A string obtained by removing zero or more
leading characters of string S.
– Example- S=”compiler” suffix of S is “ler”
• Substring of S- A string obtained by deleting a prefix and
suffix from S.
– Example- S=”compiler” substring of S is “pi”
Strings and Language

• language: any set of strings over some alphabet ∑


• For example
1. A language defined over alphabets Σ={a,b} is
L={anbn| n>0}
Thus L consists of various strings like:
L = {ab, aabb, aaabbb, aaaabbbb, ………….}
2. A language defined over alphabets Σ={0, 1} is L={w
| string with equal number of 0’s and 1’s} . Thus
the set of strings of this language,
L = {ε, 01,10,0101, 1100, 0011, 1001, ……….}
Operations on Strings

• Concatenation: xy
– s ε = εs = s
– x=“lexical” y=“analyzer” => xy = “lexicalanalyzer”
• Exponentiations: si =si-1 s (s0=ε)
Operations on Languages

• Union
– {s | s is in L or s is in M}
• Concatenation
– {st | s is in L and t is in M}
• Kleene closure: zero or more concatenation
– L*: union of Li (i = 0 … infinity)
– L0 = {ε}, Li = Li-1 L
• Positive closure: one or more concatenation
– L+: union of Li (i = 1 … infinity)
Examples

• L={A, B, …, Z, a, b, …, z}, D={0, 1, …, 9}


• Union
– L U D = {letters and digits of length 1}
• Concatenation
– LD={a letter followed by a digit} (={A0, A1, … B0, …})
– L4 = {4-letter strings}(={AAAA, AABC, BBBB, …})
• Kleene closure: zero or more concatenation
– L*: {all strings of letters of length zero (i.e., ε) or more}
– L(L U D)* = {all strings of letters-and-digits, starting with a
letter}
• Positive closure: one or more concatenation
– D+: {strings of one or more digits}
Regular Expression (R.E.)

• A Formal Specification for Tokens


Regular Expression-

• Regular expressions describe or represent the set of


strings.
• The recursive definition of regular expression over ∑
is:
– The empty set ( ) , the empty string (є) is a Regular
expression over ∑.
– Every symbol a ε Σ is a regular expression over Σ.
– If R1 and R2 are regular expressions over Σ, then following
are Regular expressions
• (R1+R2) : ‘+’ indicates Oring or alteration.
• (R1⋅R2) : the operation‘.’ denotes concatenation
• (R1)*: ‘*’ denotes closure
– Regular expressions are only those that are recursively
obtained by applying above rules.
Operations on Regular Expression

• Union
– The Union of two languages L1 and L2 is set of strings that are either in L1, or L2 or
both.
– It is denoted as L1 + L2 or L1 U L2.
e.g. L1= {01, 11} and L2= {111, 10}
Then L1 + L2= {01, 11, 111, 10}
• Concatenation
– The Concatenation of two languages L1 and L2 is set of strings formed by
concatenating any string in L1 with any string in L2. It is denoted as L1. L2 or L1 L2.
e.g. L1= {01, 11} and L2= {111, 10}
Then L1. L2= {01111, 0110, 11111, 1110}
• Kleen Closure:
– The closure of language L includes all strings which are formed by taking any number
of strings from L repeating them 0 or more times and then concatenating all of them.
It is denoted as L*.
e.g. L= {0, 1} Then L*= { є, 0, 1, 10, 01, 111, 000, 010, ………}
• Positive Closure:
– The closure of language L includes all strings which are formed by taking any number
of strings from L repeating them 1 or more times and then concatenating all of them.
It is denoted as L+.
e.g. L= {0, 1} Then L+= {0, 1, 10, 01, 111, 000, 010, ………}
Regular Definitions

• If  is an alphabet of basic symbols, then a regular


definition is a sequence of definitions of the form:
d1→r1
d2→r2
...
dn→rn
• where each di is a distinct name, and each ri is a
regular expression over the symbols in {d1,d2,…,di-
1}, i.e., the basic symbols and the previously defined
names.

34
Examples of Regular Definitions

35
Recognition of Tokens

36
Finite Automata

• A recognizer for a language is a program that


takes as input a string x and answer “yes” if x
is a sentence of the language and “no”
otherwise.
• We compile a regular expression into a
recognizer by constructing a generalized
transition diagram called a finite automaton.
• A finite automaton can be deterministic or
nondeterministic, where nondeterministic
means that more than one transition out of a
state may be possible on the same input 37

symbol.
Finite Automata

• A finite automaton consists of


– An input alphabet 
– A set of states S
– A start state n
– A set of accepting states F  S
– A set of transitions state →input
state

38
Finite Automata

• Graphical Representation:
– State transition diagram
• Implementation:
– State transition table
• Deterministic (DFA)
– Single transition for all states on all input symbols
• Non-deterministic (NFA)
– More than one transitions for at least one state
with some input symbol
Finite Automata

• Transition
s1 →a s2
• Is read
In state s1 on input “a” go to state s2

• If end of input (or no transition possible)


– If in accepting state => accept
– Otherwise => reject

40
Finite Automata State Graphs

• A state

• The start state

• An accepting state

a
• A transition

41
Finite Automata

• Nondeterministic finite automata


– No restrictions on the labels of their edges.
– Several edges out of the same state
– є , the empty string is possible label
• Deterministic finite automata
– For each state and for each symbol of its input
alphabet exactly one edge with that symbol leaving
state.

42
NFA: Nondeterministic Finite
Automata

• An NFA consists of
❑S: A finite set of states
❑: A finite set of input symbols
❑d: A transition function that maps (state,
symbol) pairs to sets of states
❑s0: A state distinguished as start state
❑F: A set of states distinguished as final states
Transition Diagram (NFA)

(a | b)*abb
a

start a b b
0 1 2 3

b
 States: {0/Start/init., 1, 2, 3/Final}
 Input symbols: {a, b}
 NFA Transition function:
d(0,a) = {0,1}, d(0,b) = {0}
d(1,b) = {2}, d(2,b) = {3}
Acceptance of NFA

An NFA accepts an input string s iff there is some


path in the transition diagram from the start state
to some final state such that the edge labels along
this path spell out s

Example:
 bbababb is accepted by (a|b)*abb
 bbabab is NOT
NFA with ε-transition

aa* | bb*
a
a
ε 1 2
start
0
ε 3 4
b
b
 NFA Transition function:
d(0, e ) = {1, 3}, d (1, a) = {2}, d(2, a) = {2}
d (3, b) = {4}, d(4, b)
= {4}
Deterministic Finite Automata

• A DFA is a special case of an NFA in which


– no state has an e-transition
– for each state s and input symbol a, there is at
most one edge labeled a leaving s
Transition Diagram

A DFA for (a | b)*abb

b
a
start b b
0 1 2 3
a a
b a
Example of FA and Transition Diagrams

a b c r = (abc)+

a state

a transition

the start state

a final state
FA/ and Transition Tables

inputs
a b c
states
q0 q1

q1 q2

q2 q3

q3 q1

NextState = Move( CurrentState, Input )


Recognition

• state = 0;
• while ( (c = next_char() ) != EOF ) {
– switch (state) {
• case 0: if ( c == ‘a’ ) state = 1;
– break;
• case 1: if ( c == ‘b’ ) state = 2;
– break;
• case 2: if ( c == ‘c’ ) state = 3;
– break;
• case 3: if ( c == ‘a’ ) state = 1;
– else { ungetchar(); return (TRUE); }
– break;
• default:
– error();
– }
• }
• if ( state == 3 ) return (TRUE) else return (FALSE);
Finite Automata for the Lexical
Tokens

i f
a- z
a- z
1 2 0-9
3 0-9
1 2
1 2

IF ID 0-9 NUM

1 2 3 4
- -
\n
a- z

blank, etc.
5 blank, etc.

White space
(and comment starting with ‘- -’)
Regular expressions for tokens

if {return IF;}
[a - z] [a - z0 - 9 ] * {return ID;}
[0 - 9] + {return NUM;}
([0 - 9] + “.” [0 - 9] *) | (“.” [0 - 9] +) {return REAL;}
(“--” [a - z]* “\n”) | (“ ” | “ \n ” | “ \t ”) + {/* do nothing*/}
Recognition of Regular
Expression Using DFA

• Simulating Deterministic Finite Automata


(DFA)
– initialization:
• current_state = s0; input_symbol = 1st symbol

– while (current_state is not fail_state &&


input_symbol != EOF)
• next_state = d(current_state, input_symbol), &
• Current_state = next_state
• input_symbol = next_input_symbol

– If (current_state in final states) accept() else fail()


Simulating a DFA

Input. An input string ended with eof and a DFA with start state s0 and final
states F.
Output. The answer “yes” if accepts, “no” otherwise.
begin
s := s0;
c := nextchar;
while c <> eof do begin
s := move(s, c); // transition function
c := nextchar
end;
if s is in F then return “yes”
else return “no”
end.
DFA: An Example

(a | b)*abb

b
a
start b b
0 1 2 3
a a
b a
An Example

bbababb bbabab

s=0 s=0
s = move(0, b) = 0 s = move(0, b) = 0
s = move(0, b) = 0 s = move(0, b) = 0
s = move(0, a) = 1 s = move(0, a) = 1
s = move(1, b) = 2 s = move(1, b) = 2
s = move(2, a) = 1 s = move(2, a) = 1
s = move(1, b) = 2 s = move(1, b) = 2
s = move(2, b) = 3 s is not in {3}
s is in {3}
Recognition of Regular Expression

• Simulating NFA is harder than simulating DFA


• Constructing NFA is easier than constructing
DFA
– Construct NFA
– Construct Equivalent DFA
– (optional) State Minimization
– Simulate DFA
The Lexical –Analyzer Generator Lex

• Tool Called Lex that allows to specify a lexical


anlyzer by specifying regular expressions to
describe patterns for tokens.
• The input notation for Lex tool is referredto
as the Lex language and tool itself is Lex
Compiler.

59
Creating a lexical analyzer with Lex
Structure of Lex Programs

61
Lex Program for Tokens

62
63
64

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy