0% found this document useful (0 votes)
21 views71 pages

Lecture II - Lexical Analysis.handouts

The document outlines the process of lexical analysis in programming, detailing how source code is transformed into tokens that represent logical pieces of the code. It discusses the importance of tokens, their association with lexemes, and the challenges faced during scanning. Additionally, it introduces regular expressions as a method for defining formal languages and categorizing tokens.

Uploaded by

hut86176
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views71 pages

Lecture II - Lexical Analysis.handouts

The document outlines the process of lexical analysis in programming, detailing how source code is transformed into tokens that represent logical pieces of the code. It discusses the importance of tokens, their association with lexemes, and the challenges faced during scanning. Additionally, it introduces regular expressions as a method for defining formal languages and categorizing tokens.

Uploaded by

hut86176
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 71

Where We Are

Source Lexical Analysis


Code
Syntax Analysis

Lexical Analysis Semantic Analysis

IR Generation

IR Optimization

Code Generation

Optimization Machine
Code

w h i l e ( i p < z ) \n \t + + i p ;

while (ip < z) while (ip < z)


++ip; ++ip;
While

< ++

Ident Ident Ident


ip z ip

T_While ( T_Ident < T_Ident ) ++ T_Ident T_While ( T_Ident < T_Ident ) ++ T_Ident

ip z ip ip z ip

w h i l e ( i p < z ) \n \t + + i p ; w h i l e ( i p < z ) \n \t + + i p ;

while (ip < z) while (ip < z)


++ip; ++ip;

d o [ f o r ] = n e w 0 ;

do[for] = new 0; do[for] = new 0;


T_Do [ T_For ] = T_New T_IntConst T_Do [ T_For ] = T_New T_IntConst

0 0

d o [ f o r ] = n e w 0 ; d o [ f o r ] = n e w 0 ;

do[for] = new 0; do[for] = new 0;

Scanning a Source File Scanning a Source File


w h i l e ( 1
i 3
p 7 < < z i ) \n
) \n \t \t
+ + +
i i
p ; w h i l e ( 1
i 3
p 7 < < z i ) \n
) \n \t \t
+ + +
i i
p ;
Scanning a Source File Scanning a Source File
w h i l e ( 1
i 3
p 7 < < z i ) \n
) \n \t \t
+ + +
i i
p ; w h i l e ( 1
i 3
p 7 < < z i ) \n
) \n \t \t
+ + +
i i
p ;

Scanning a Source File Scanning a Source File


w h i l e ( 1
i 3
p 7 < < z i ) \n
) \n \t \t
+ + +
i i
p ; w h i l e ( 1
i 3
p 7 < < z i ) \n
) \n \t \t
+ + +
i i
p ;
Scanning a Source File Scanning a Source File
w h i l e ( 1
i 3
p 7 < < z i ) \n
) \n \t \t
+ + +
i i
p ; w h i l e ( 1 3 7 < i ) \n \t + + i ;

T_While

Scanning a Source File Scanning a Source File


w h i l e ( 1 3 7 < i ) \n \t + + i ; w h i l e ( 1 3 7 < i ) \n \t + + i ;

The
The piece
piece of
of the
the original
original program
program
from
from which we made the token isis
which we made the token
called
called aa lexeme.
lexeme.

T_While T_While

This
This isis called
called aa token.
token. You
You can
can
think
think ofof itit as
as an
an enumerated
enumerated type
type
representing
representing what
what logical
logical entity
entity we
we
read
read out of the source code.
out of the source code.
Scanning a Source File Scanning a Source File
w h i l e ( 1 3 7 < i ) \n \t + + i ; w h i l e ( 1 3 7 < i ) \n \t + + i ;

T_While T_While

Scanning a Source File Scanning a Source File


w h i l e ( 1 3 7 < i ) \n \t + + i ; w h i l e ( 1 3 7 < i ) \n \t + + i ;

Sometimes
Sometimes wewe will
will discard
discard aa lexeme
lexeme
T_While rather
rather than storing it for later use.
than storing it for later use. T_While
Here,
Here, we
we ignore
ignore whitespace,
whitespace, since
since itit
has
has no
no bearing
bearing on
on the
the meaning
meaning of
of
the program.
the program.
Scanning a Source File Scanning a Source File
w h i l e ( 1 3 7 < i ) \n \t + + i ; w h i l e ( 1 3 7 < i ) \n \t + + i ;

T_While T_While

Scanning a Source File Scanning a Source File


w h i l e ( 1 3 7 < i ) \n \t + + i ; w h i l e ( 1 3 7 < i ) \n \t + + i ;

T_While ( T_While (
Scanning a Source File Scanning a Source File
w h i l e ( 1 3 7 < i ) \n \t + + i ; w h i l e ( 1 3 7 < i ) \n \t + + i ;

T_While ( T_While (

Scanning a Source File Scanning a Source File


w h i l e ( 1 3 7 < i ) \n \t + + i ; w h i l e ( 1 3 7 < i ) \n \t + + i ;

T_While ( T_While (
Scanning a Source File Scanning a Source File
w h i l e ( 1 3 7 < i ) \n \t + + i ; w h i l e ( 1 3 7 < i ) \n \t + + i ;

Some
Some tokens
tokens can
can have
have
attributes that
attributes that store
store
extra
extra information
information about
about
T_While ( T_IntConst T_While ( T_IntConst
the
the token. Here
token. Here we
we
137 137 store
store which
which integer
integer isis
represented.
represented.

Goals of Lexical Analysis


● Convert from physical description of a program
into sequence of of tokens.
● Each token represents one logical piece of the source
file – a keyword, the name of a variable, etc.

Each token is associated with a lexeme. Choosing Tokens
● The actual text of the token: “137,” “int,” etc.
● Each token may have optional attributes.
● Extra information derived from the text – perhaps a
numeric value.
● The token sequence will be used in the parser to
recover the program structure.
What Tokens are Useful Here? What Tokens are Useful Here?
for (int k = 0; k < myArray[5]; ++k) { for (int k = 0; k < myArray[5]; ++k) {
cout << k << endl; cout << k << endl;
} }
for {
int }
<< ;
= <
( [
) ]
++

What Tokens are Useful Here? Choosing Good Tokens


for (int k = 0; k < myArray[5]; ++k) { ● Very much dependent on the language.
cout << k << endl;
} ● Typically:
for { ● Give keywords their own tokens.
int }
<< ; ● Give different punctuation symbols their own
= < tokens.
( [ ● Group lexemes representing identifiers,
) ] numeric constants, strings, etc. into their own
++
groups.
Identifier ● Discard irrelevant information (whitespace,
IntegerConstant comments)
Challenges in Scanning
● How do we determine which lexemes are
associated with each token?
● When there are multiple ways we could
scan the input, how do we know which
one to pick?
● How do we address these concerns
efficiently?
Lexemes and Tokens
● Tokens give a way to categorize lexemes by
what information they provide.
● Some tokens might be associated with only a
Associating Lexemes with Tokens single lexeme:
● Tokens for keywords like if and while probably
only match those lexemes exactly.
● Some tokens might be associated with lots of
different lexemes:
● All variable names, all possible numbers, all
possible strings, etc.

Sets of Lexemes
● Idea: Associate a set of lexemes with each
token.
● We might associate the “number” token How do we describe which (potentially
with the set { 0, 1, 2, …, 10, 11, 12, … } infinite) set of lexemes is associated with
● We might associate the “string” token each token type?
with the set { "", "a", "b", "c", … }
● We might associate the token for the
keyword while with the set { while }.
Formal Languages Regular Expressions

A formal language is a set of strings. ● Regular expressions are a family of
● Many infinite languages have finite descriptions: descriptions that can be used to capture
● Define the language using an automaton. certain languages (the regular

Define the language using a grammar. languages).
● Define the language using a regular expression. ● Often provide a compact and human-
● We can use these compact descriptions of the readable description of the language.
language to define sets of strings.
● Used as the basis for numerous software
● Over the course of this class, we will use all of
systems, including the flex tool we will
these approaches.
use in this course.

Atomic Regular Expressions Compound Regular Expressions


● The regular expressions we will use in ● If R1 and R2 are regular expressions, R1R2 is a regular
expression represents the concatenation of the
this course begin with two simple languages of R1 and R2.
building blocks.
● If R1 and R2 are regular expressions, R1 | R2 is a regular
● The symbol ε is a regular expression expression representing the union of R1 and R2.
matches the empty string. ● If R is a regular expression, R* is a regular expression for
the Kleene closure of R.
● For any symbol a, the symbol a is a ● If R is a regular expression, (R) is a regular expression
regular expression that just matches a. with the same meaning as R.
Operator Precedence Simple Regular Expressions
● Regular expression operator precedence ● Suppose the only characters are 0 and 1.
is ● Here is a regular expression for strings containing
00 as a substring:
(R)
R* (0 | 1)*00(0 | 1)*
R1 R2
R1 | R 2
● So ab*c|d is parsed as ((a(b*))c)|d

Simple Regular Expressions Simple Regular Expressions


● Suppose the only characters are 0 and 1. ● Suppose the only characters are 0 and 1.
● Here is a regular expression for strings containing ● Here is a regular expression for strings containing
00 as a substring: 00 as a substring:

(0 | 1)*00(0 | 1)* (0 | 1)*00(0 | 1)*

11011100101
0000
11111011110011111
Simple Regular Expressions Simple Regular Expressions
● Suppose the only characters are 0 and 1. ● Suppose the only characters are 0 and 1.
● Here is a regular expression for strings containing ● Here is a regular expression for strings of length
00 as a substring: exactly four:

(0 | 1)*00(0 | 1)*

11011100101
0000
11111011110011111

Simple Regular Expressions Simple Regular Expressions


● Suppose the only characters are 0 and 1. ● Suppose the only characters are 0 and 1.
● Here is a regular expression for strings of length ● Here is a regular expression for strings of length
exactly four: exactly four:

(0|1)(0|1)(0|1)(0|1) (0|1)(0|1)(0|1)(0|1)
Simple Regular Expressions Simple Regular Expressions
● Suppose the only characters are 0 and 1. ● Suppose the only characters are 0 and 1.
● Here is a regular expression for strings of length ● Here is a regular expression for strings of length
exactly four: exactly four:

(0|1)(0|1)(0|1)(0|1) (0|1)(0|1)(0|1)(0|1)

0000 0000
1010 1010
1111 1111
1000 1000

Simple Regular Expressions Simple Regular Expressions


● Suppose the only characters are 0 and 1. ● Suppose the only characters are 0 and 1.
● Here is a regular expression for strings of length ● Here is a regular expression for strings of length
exactly four: exactly four:

(0|1){4} (0|1){4}

0000 0000
1010 1010
1111 1111
1000 1000
Simple Regular Expressions Simple Regular Expressions
● Suppose the only characters are 0 and 1. ● Suppose the only characters are 0 and 1.
● Here is a regular expression for strings that ● Here is a regular expression for strings that
contain at most one zero: contain at most one zero:

1*(0 | ε)1*

Simple Regular Expressions Simple Regular Expressions


● Suppose the only characters are 0 and 1. ● Suppose the only characters are 0 and 1.
● Here is a regular expression for strings that ● Here is a regular expression for strings that
contain at most one zero: contain at most one zero:

1*(0 | ε)1* 1*(0 | ε)1*

11110111
111111
0111
0
Simple Regular Expressions Simple Regular Expressions
● Suppose the only characters are 0 and 1. ● Suppose the only characters are 0 and 1.
● Here is a regular expression for strings that ● Here is a regular expression for strings that
contain at most one zero: contain at most one zero:

1*(0 | ε)1* 1*0?1*

11110111 11110111
111111 111111
0111 0111
0 0

Applied Regular Expressions Applied Regular Expressions


● Suppose that our alphabet is all ASCII ● Suppose that our alphabet is all ASCII
characters. characters.
● A regular expression for even numbers is ● A regular expression for even numbers is

(+|-)?(0|1|2|3|4|5|6|7|8|9)*(0|2|4|6|8) (+|-)?(0|1|2|3|4|5|6|7|8|9)*(0|2|4|6|8)

42
+1370
-3248
-9999912
Applied Regular Expressions Applied Regular Expressions
● Suppose that our alphabet is all ASCII ● Suppose that our alphabet is all ASCII
characters. characters.
● A regular expression for even numbers is ● A regular expression for even numbers is

(+|-)?(0|1|2|3|4|5|6|7|8|9)*(0|2|4|6|8) (+|-)?[0123456789]*[02468]

42 42
+1370 +1370
-3248 -3248
-9999912 -9999912

Applied Regular Expressions


● Suppose that our alphabet is all ASCII
characters.
● A regular expression for even numbers is
Matching Regular Expressions
(+|-)?[0-9]*[02468]

42
+1370
-3248
-9999912
Implementing Regular Expressions A Simple Automaton
● Regular expressions can be implemented A,B,C,...,Z
using finite automata.
start " "
● There are two main kinds of finite
automata:
● NFAs (nondeterministic finite automata),
which we'll see in a second, and
● DFAs (deterministic finite automata), which
we'll see later.
● Automata are best explained by example...

A Simple Automaton A Simple Automaton


A,B,C,...,Z A,B,C,...,Z

start " " start " "

Each
Each circle
circle isis aa state
state of
of the
the
automaton. The automaton's These
automaton. The automaton's These arrows
arrows are
are called
called
configuration
configuration isis determined
determined transitions. The automaton
transitions. The automaton
by
by what
what state(s)
state(s) itit isis in. changes
in. changes which
which state(s)
state(s) itit isis inin
by
by following
following transitions.
transitions.
A Simple Automaton A Simple Automaton
A,B,C,...,Z A,B,C,...,Z

start " " start " "

" H E Y A " " H E Y A "


The
The automaton
automaton takes
takes aa string
string
as
as input and decides whether
input and decides whether
to
to accept
accept or
or reject
reject the
the string.
string.

A Simple Automaton A Simple Automaton


A,B,C,...,Z A,B,C,...,Z

start " " start " "

" H E Y A " " H E Y A "


A Simple Automaton A Simple Automaton
A,B,C,...,Z A,B,C,...,Z

start " " start " "

" H E Y A " " H E Y A "

A Simple Automaton A Simple Automaton


A,B,C,...,Z A,B,C,...,Z

start " " start " "

" H E Y A " " H E Y A "


A Simple Automaton A Simple Automaton
A,B,C,...,Z A,B,C,...,Z

start " " start " "

" H E Y A " " H E Y A "

A Simple Automaton A Simple Automaton


A,B,C,...,Z A,B,C,...,Z

start " " start " "

" H E Y A " " H E Y A "


A Simple Automaton A Simple Automaton
A,B,C,...,Z A,B,C,...,Z

start " " start " "

" H E Y A " " " " " " "


The
The double
double circle
circle indicates
indicates that
that this
this
state is an accepting state.
state is an accepting state. The The
automaton
automaton accepts
accepts the
the string
string ifif itit
ends
ends inin an
an accepting
accepting state.
state.

A Simple Automaton A Simple Automaton


A,B,C,...,Z A,B,C,...,Z

start " " start " "

" " " " " " " " " " " "
A Simple Automaton A Simple Automaton
A,B,C,...,Z A,B,C,...,Z

start " " start " "

" " " " " " " " " " " "

A Simple Automaton A Simple Automaton


A,B,C,...,Z A,B,C,...,Z

start " " start " "

" " " " " " " " " " " "
There
There isis no
no transition
transition on
on ""
here,
here, soso the
the automaton
automaton
dies and
dies and rejects.
rejects.
A Simple Automaton A Simple Automaton
A,B,C,...,Z A,B,C,...,Z

start " " start " "

" " " " " " " A B C


There
There isis no
no transition
transition on
on ""
here, so the automaton
here, so the automaton
dies and
dies and rejects.
rejects.

A Simple Automaton A Simple Automaton


A,B,C,...,Z A,B,C,...,Z

start " " start " "

" A B C " A B C
A Simple Automaton A Simple Automaton
A,B,C,...,Z A,B,C,...,Z

start " " start " "

" A B C " A B C

A Simple Automaton A Simple Automaton


A,B,C,...,Z A,B,C,...,Z

start " " start " "

" A B C " A B C
A Simple Automaton A More Complex Automaton
1 0
A,B,C,...,Z 0

start
start " "
0, 1

1
0 1
" A B C
This
This isis not
not an
an accepting
accepting
state,
state, so the automaton
so the automaton
rejects.
rejects.

A More Complex Automaton A More Complex Automaton


1 0 1 0
0 0

start start
0, 1 0, 1

1 1
0 1 0 1

Notice
Notice that
that there
there are
are multiple
multiple transitions
transitions
defined
defined here on 0 and 1. If we read
here on 0 and 1. If we read aa
00 or 1 here, we follow both transitions
or 1 here, we follow both transitions
and
and enter
enter multiple
multiple states.
states.
A More Complex Automaton A More Complex Automaton
1 0 1 0
0 0

start start
0, 1 0, 1

1 1
0 1 0 1

0 1 1 1 0 1 0 1 1 1 0 1

A More Complex Automaton A More Complex Automaton


1 0 1 0
0 0

start start
0, 1 0, 1

1 1
0 1 0 1

0 1 1 1 0 1 0 1 1 1 0 1
A More Complex Automaton A More Complex Automaton
1 0 1 0
0 0

start start
0, 1 0, 1

1 1
0 1 0 1

0 1 1 1 0 1 0 1 1 1 0 1

A More Complex Automaton A More Complex Automaton


1 0 1 0
0 0

start start
0, 1 0, 1

1 1
0 1 0 1

0 1 1 1 0 1 0 1 1 1 0 1
A More Complex Automaton A More Complex Automaton
1 0 1 0
0 0

start start
0, 1 0, 1

1 1
0 1 0 1

0 1 1 1 0 1 0 1 1 1 0 1

A More Complex Automaton A More Complex Automaton


1 0 1 0
0 0

start start
0, 1 0, 1

1 1
0 1 0 1

0 1 1 1 0 1 0 1 1 1 0 1
A More Complex Automaton A More Complex Automaton
1 0 1 0
0 0

start start
0, 1 0, 1

1 1
0 1 0 1

0 1 1 1 0 1 0 1 1 1 0 1

A More Complex Automaton An Even More Complex Automaton


a, b
1 0
0 c
start
0, 1 ε a, c
start b
ε
1
0 1 b, c
ε a
0 1 1 1 0 1
Since
Since we
we are
are inin at
at least
least
one
one accepting state, the
accepting state, the
automaton
automaton accepts.
accepts.
An Even More Complex Automaton An Even More Complex Automaton
a, b a, b
c c

ε a, c ε a, c
start b start b
ε ε

b, c b, c
ε a ε a

These
These are
are called -transitions.. These
called εε-transitions These b c b a
transitions
transitions are followed automatically
are followed automatically and
and
without
without consuming
consuming any
any input.
input.

An Even More Complex Automaton An Even More Complex Automaton


a, b a, b
c c

ε a, c ε a, c
start b start b
ε ε

b, c b, c
ε a ε a

b c b a b c b a
An Even More Complex Automaton An Even More Complex Automaton
a, b a, b
c c

ε a, c ε a, c
start b start b
ε ε

b, c b, c
ε a ε a

b c b a b c b a

An Even More Complex Automaton An Even More Complex Automaton


a, b a, b
c c

ε a, c ε a, c
start b start b
ε ε

b, c b, c
ε a ε a

b c b a b c b a
An Even More Complex Automaton An Even More Complex Automaton
a, b a, b
c c

ε a, c ε a, c
start b start b
ε ε

b, c b, c
ε a ε a

b c b a b c b a

An Even More Complex Automaton An Even More Complex Automaton


a, b a, b
c c

ε a, c ε a, c
start b start b
ε ε

b, c b, c
ε a ε a

b c b a b c b a
Simulating an NFA From Regular Expressions to NFAs
● Keep track of a set of states, initially the start ● There is a (beautiful!) procedure from converting a
state and everything reachable by ε-moves. regular expression to an NFA.
● For each character in the input: ● Associate each regular expression with an NFA with
the following properties:
● Maintain a set of next states, initially empty.
● There is exactly one accepting state.
● For each current state:
● There are no transitions out of the accepting state.
– Follow all transitions labeled with the current letter.
– Add these states to the set of new states.
● There are no transitions into the starting state.
● Add every state reachable by an ε-move to the set of ● These restrictions are stronger than necessary, but make the
next states. construction easier.
● Complexity: O(mn2) for strings of length m and
start
automata with n states.

Base Cases Construction for R1R2


start ε

Automaton for ε

start a

Automaton for single character a


Construction for R1R2 Construction for R1R2

start start start

R1 R2 R1 R2

Construction for R1R2 Construction for R1R2

ε ε
start start

R1 R2 R1 R2
Construction for R1 | R2 Construction for R1 | R2

start

R1

start

R2

Construction for R1 | R2 Construction for R1 | R2

start
ε

start
R1 start
R1

ε
start

R2 R2
Construction for R1 | R2 Construction for R1 | R2

ε ε ε

start
R1 start
R1

ε ε ε

R2 R2

Construction for R1 | R2 Construction for R*

ε ε

start
R1

ε ε

R2
Construction for R* Construction for R*

start
start start

R R

Construction for R* Construction for R*

start start ε ε
start

R R

ε ε
Construction for R* Construction for R*

ε ε

start ε ε start ε ε

R R

ε ε

Overall Result Challenges in Scanning


● Any regular expression of length n can ● How do we determine which lexemes are
be converted into an NFA with O(n) associated with each token?
states. ● When there are multiple ways we could
● Can determine whether a string of length scan the input, how do we know which
m matches a regular expression of length one to pick?
n in time O(mn2). ● How do we address these concerns
● We'll see how to make this O(m) later efficiently?
(this is independent of the complexity of
the regular expression!)
Challenges in Scanning Lexing Ambiguities
T_For for
● How do we determine which lexemes are T_Identifier [A-Za-z_][A-Za-z0-9_]*
associated with each token?
● When there are multiple ways we could
scan the input, how do we know which
one to pick?
● How do we address these concerns
efficiently?

Lexing Ambiguities Lexing Ambiguities


T_For for T_For for
T_Identifier [A-Za-z_][A-Za-z0-9_]* T_Identifier [A-Za-z_][A-Za-z0-9_]*

f o r t f o r t
f o r t f o r t
f o r t f o r t
f o r t f o r t
f o r t f o r t
f o r t
Conflict Resolution Lexing Ambiguities
T_For for
● Assume all tokens are specified as T_Identifier [A-Za-z_][A-Za-z0-9_]*
regular expressions.
● Algorithm: Left-to-right scan. f o r t
● Tiebreaking rule one: Maximal munch.
f o r t f o r t
● Always match the longest possible prefix of
the remaining text. f o r t f o r t
f o r t f o r t
f o r t f o r t
f o r t

Lexing Ambiguities Implementing Maximal Munch


T_For for
T_Identifier [A-Za-z_][A-Za-z0-9_]*
● Given a set of regular expressions, how
can we use them to implement maximum
f o r t munch?
● Idea:
f o r t ● Convert expressions to NFAs.
● Run all NFAs in parallel, keeping track of the
last match.
● When all automata get stuck, report the last
match and restart the search at that point.
Implementing Maximal Munch Implementing Maximal Munch
T_Do do T_Do do
T_Double double T_Double double
T_Mystery [A-Za-z] T_Mystery [A-Za-z]
start d o

start d o u b l e

start Σ

Implementing Maximal Munch Implementing Maximal Munch


T_Do do T_Do do
T_Double double T_Double double
T_Mystery [A-Za-z] T_Mystery [A-Za-z]
start d o start d o

start d o u b l e start d o u b l e

start Σ start Σ

D O U B D O U B L E D O U B D O U B L E
Implementing Maximal Munch Implementing Maximal Munch
T_Do do T_Do do
T_Double double T_Double double
T_Mystery [A-Za-z] T_Mystery [A-Za-z]
start d o start d o

start d o u b l e start d o u b l e

start Σ start Σ

D O U B D O U B L E D O U B D O U B L E

Implementing Maximal Munch Implementing Maximal Munch


T_Do do T_Do do
T_Double double T_Double double
T_Mystery [A-Za-z] T_Mystery [A-Za-z]
start d o start d o

start d o u b l e start d o u b l e

start Σ start Σ

D O U B D O U B L E D O U B D O U B L E
Implementing Maximal Munch Implementing Maximal Munch
T_Do do T_Do do
T_Double double T_Double double
T_Mystery [A-Za-z] T_Mystery [A-Za-z]
start d o start d o

start d o u b l e start d o u b l e

start Σ start Σ

D O U B D O U B L E D O U B D O U B L E

Implementing Maximal Munch Implementing Maximal Munch


T_Do do T_Do do
T_Double double T_Double double
T_Mystery [A-Za-z] T_Mystery [A-Za-z]
start d o start d o

start d o u b l e start d o u b l e

start Σ start Σ

D O U B D O U B L E D O U B D O U B L E
Implementing Maximal Munch Implementing Maximal Munch
T_Do do T_Do do
T_Double double T_Double double
T_Mystery [A-Za-z] T_Mystery [A-Za-z]
start d o start d o

start d o u b l e start d o u b l e

start Σ start Σ

D O U B D O U B L E D O U B D O U B L E

Implementing Maximal Munch Implementing Maximal Munch


T_Do do T_Do do
T_Double double T_Double double
T_Mystery [A-Za-z] T_Mystery [A-Za-z]
start d o start d o

start d o u b l e start d o u b l e

start Σ start Σ

D O U B D O U B L E D O U B D O U B L E
Implementing Maximal Munch Implementing Maximal Munch
T_Do do T_Do do
T_Double double T_Double double
T_Mystery [A-Za-z] T_Mystery [A-Za-z]
start d o start d o

start d o u b l e start d o u b l e

start Σ start Σ

D O U B D O U B L E D O U B D O U B L E

Implementing Maximal Munch Implementing Maximal Munch


T_Do do T_Do do
T_Double double T_Double double
T_Mystery [A-Za-z] T_Mystery [A-Za-z]
start d o start d o

start d o u b l e start d o u b l e

start Σ start Σ

D O U B D O U B L E D O U B D O U B L E
Implementing Maximal Munch Implementing Maximal Munch
T_Do do T_Do do
T_Double double T_Double double
T_Mystery [A-Za-z] T_Mystery [A-Za-z]
start d o start d o

start d o u b l e start d o u b l e

start Σ start Σ

D O U B D O U B L E D O U B D O U B L E

Implementing Maximal Munch Implementing Maximal Munch


T_Do do T_Do do
T_Double double T_Double double
T_Mystery [A-Za-z] T_Mystery [A-Za-z]
start d o start d o

start d o u b l e start d o u b l e

start Σ start Σ

D O U B D O U B L E D O U B D O U B L E
Implementing Maximal Munch Implementing Maximal Munch
T_Do do T_Do do
T_Double double T_Double double
T_Mystery [A-Za-z] T_Mystery [A-Za-z]
start d o start d o

start d o u b l e start d o u b l e

start Σ start Σ

D O U B D O U B L E D O U B D O U B L E

Implementing Maximal Munch Implementing Maximal Munch


T_Do do T_Do do
T_Double double T_Double double
T_Mystery [A-Za-z] T_Mystery [A-Za-z]
start d o start d o

start d o u b l e start d o u b l e

start Σ start Σ

D O U B D O U B L E D O U B D O U B L E
Implementing Maximal Munch Implementing Maximal Munch
T_Do do T_Do do
T_Double double T_Double double
T_Mystery [A-Za-z] T_Mystery [A-Za-z]
start d o start d o

start d o u b l e start d o u b l e

start Σ start Σ

D O U B D O U B L E D O U B D O U B L E

Implementing Maximal Munch Implementing Maximal Munch


T_Do do T_Do do
T_Double double T_Double double
T_Mystery [A-Za-z] T_Mystery [A-Za-z]
start d o start d o

start d o u b l e start d o u b l e

start Σ start Σ

D O U B D O U B L E D O U B D O U B L E
Implementing Maximal Munch Implementing Maximal Munch
T_Do do T_Do do
T_Double double T_Double double
T_Mystery [A-Za-z] T_Mystery [A-Za-z]
start d o start d o

start d o u b l e start d o u b l e

start Σ start Σ

D O U B D O U B L E D O U B D O U B L E

Implementing Maximal Munch Implementing Maximal Munch


T_Do do T_Do do
T_Double double T_Double double
T_Mystery [A-Za-z] T_Mystery [A-Za-z]
start d o start d o

start d o u b l e start d o u b l e

start Σ start Σ

D O U B D O U B L E D O U B D O U B L E
Implementing Maximal Munch Implementing Maximal Munch
T_Do do T_Do do
T_Double double T_Double double
T_Mystery [A-Za-z] T_Mystery [A-Za-z]
start d o start d o

start d o u b l e start d o u b l e

start Σ start Σ

D O U B D O U B L E D O U B D O U B L E

Implementing Maximal Munch Implementing Maximal Munch


T_Do do T_Do do
T_Double double T_Double double
T_Mystery [A-Za-z] T_Mystery [A-Za-z]
start d o start d o

start d o u b l e start d o u b l e

start Σ start Σ

D O U B D O U B L E D O U B D O U B L E
Implementing Maximal Munch Implementing Maximal Munch
T_Do do T_Do do
T_Double double T_Double double
T_Mystery [A-Za-z] T_Mystery [A-Za-z]
start d o start d o

start d o u b l e start d o u b l e

start Σ start Σ

D O U B D O U B L E D O U B D O U B L E

Implementing Maximal Munch Implementing Maximal Munch


T_Do do T_Do do
T_Double double T_Double double
T_Mystery [A-Za-z] T_Mystery [A-Za-z]
start d o start d o

start d o u b l e start d o u b l e

start Σ start Σ

D O U B D O U B L E D O U B D O U B L E
Implementing Maximal Munch Implementing Maximal Munch
T_Do do T_Do do
T_Double double T_Double double
T_Mystery [A-Za-z] T_Mystery [A-Za-z]
start d o start d o

start d o u b l e start d o u b l e

start Σ start Σ

D O U B D O U B L E D O U B D O U B L E

Implementing Maximal Munch A Minor Simplification


T_Do do
T_Double double
T_Mystery [A-Za-z]
start d o

start d o u b l e

start Σ

D O U B D O U B L E
A Minor Simplification A Minor Simplification

start d o start d o

start d o u b l e start d o u b l e

start Σ start Σ

A Minor Simplification A Minor Simplification

ε d o ε d o

ε d o u b l e ε d o u b l e
start start
ε Σ ε Σ

Build
Build aa single
single automaton
automaton
that
that runs all the matching
runs all the matching
automata
automata in
in parallel.
parallel.
A Minor Simplification A Minor Simplification

ε d o ε d o

ε d o u b l e ε d o u b l e
start start
ε Σ ε Σ

Annotate
Annotate each
each accepting
accepting
state with which automaton
state with which automaton
itit came
came from.
from.

Merging all automata into a single NFA


Other Conflicts
In practice, all NFAs are merged and simulated as a single NFA
Accepting states are labeled with the token name T_Do do
T_Double double
1
i!
2
f!
3 IF#
T_Identifier [A-Za-z_][A-Za-z0-9_]*
=
4 5 EQ#

[0-9]!
[0-9]!
0 6 7 NUM#

[a-z0-9]!
[a-z]!
8 9 ID#

Σ
10! 11!

Lexical analysis 70
Other Conflicts Other Conflicts
T_Do do T_Do do
T_Double double T_Double double
T_Identifier [A-Za-z_][A-Za-z0-9_]* T_Identifier [A-Za-z_][A-Za-z0-9_]*

d o u b l e d o u b l e

d o u b l e
d o u b l e

More Tiebreaking Other Conflicts


T_Do do
● When two regular expressions apply, T_Double double
choose the one with the greater T_Identifier [A-Za-z_][A-Za-z0-9_]*
“priority.”
● Simple priority system: pick the rule d o u b l e
that was defined first.

d o u b l e
d o u b l e
Other Conflicts One Last Detail...
T_Do do
T_Double double
● We know what to do if multiple rules
T_Identifier [A-Za-z_][A-Za-z0-9_]* match.
● What if nothing matches?
d o u b l e ● Trick: Add a “catch-all” rule that matches
any character and reports an error.
d o u b l e

What if nothing matches


Summary of Conflict Resolution
What if we can not reach any accepting states given the current
input?
Add a “catch-all” rule that matches any character and reports an
● Construct an automaton for each regular
error expression.
1
i!
2
f!
3 IF# ● Merge them into one automaton by
4
=
5 EQ#
adding a new start state.
[0-9]! ● Scan the input, keeping track of the last
[0-9]!
6 7 NUM#
known match.
[a-z0-9]!

8
[a-z]!
9 ID#
● Break ties by choosing higher-
Σ
precedence matches.
10! 11!
● Have a catch-all rule to handle errors.

Lexical analysis 69
Challenges in Scanning Challenges in Scanning
● How do we determine which lexemes are ● How do we determine which lexemes are
associated with each token? associated with each token?
● When there are multiple ways we could ● When there are multiple ways we could
scan the input, how do we know which scan the input, how do we know which
one to pick? one to pick?
● How do we address these concerns ● How do we address these concerns
efficiently? efficiently?

DFAs A Sample DFA


● The automata we've seen so far have all
been NFAs.
● A DFA is like an NFA, but with tighter
restrictions:
● Every state must have exactly one
transition defined for every letter.
● ε-moves are not allowed.
A Sample DFA A Sample DFA

1 1
start start
A B
1 1

0 0 0 0 0 0 0 0

1 1
C D

1 1

A Sample DFA Code for DFAs


int kTransitionTable[kNumStates][kNumSymbols] = {
{0, 0, 1, 3, 7, 1, …},
1 …
start
A B
0 1 };
1 bool kAcceptTable[kNumStates] = {
A C B false,
true,
0 0 0 0 B D A true,

C A D };
1 bool simulateDFA(string input) {
C D

1
D B C int state = 0;
for (char ch: input)
state = kTransitionTable[state][ch];
return kAcceptTable[state];
}
Code for DFAs Speeding up Matching
int kTransitionTable[kNumStates][kNumSymbols] = {
{0, 0, 1, 3, 7, 1, …},
● In the worst-case, an NFA with n states
… takes time O(mn2) to match a string of
};
bool kAcceptTable[kNumStates] = { length m.
false,
true, Runs
Runs in
in time
time O(m)
O(m)
● DFAs, on the other hand, take only O(m).
true, on
on a string of
a string of ● There is another (beautiful!) algorithm to

length m.
length m.
}; convert NFAs to DFAs.
bool simulateDFA(string input) {
int state = 0;
Lexical Regular Table-Driven
for (char ch: input) Specification Expressions
NFA DFA
DFA
state = kTransitionTable[state][ch];
return kAcceptTable[state];
}

Subset Construction From NFA to DFA


● NFAs can be in many states at once, while
DFAs can only be in a single state at a time.
● Key idea: Make the DFA simulate the
NFA.
● Have the states of the DFA correspond to
the sets of states of the NFA.
● Transitions between states of DFA
correspond to transitions between sets of
states in the NFA.
From NFA to DFA From NFA to DFA
ε 1
d
2
o
3 ε 1
d
2
o
3

0
ε 4
d
5
o
6
u
7
b
8
l
9
e
10 0
ε 4
d
5
o
6
u
7
b
8
l
9
e
10
start start
ε Σ ε Σ
11 12 11 12

From NFA to DFA From NFA to DFA


ε 1
d
2
o
3 ε 1
d
2
o
3

0
ε 4
d
5
o
6
u
7
b
8
l
9
e
10 0
ε 4
d
5
o
6
u
7
b
8
l
9
e
10
start start
ε Σ ε Σ
11 12 11 12

start 0, 1, 4, 11 start 0, 1, 4, 11
From NFA to DFA From NFA to DFA
ε 1
d
2
o
3 ε 1
d
2
o
3

0
ε 4
d
5
o
6
u
7
b
8
l
9
e
10 0
ε 4
d
5
o
6
u
7
b
8
l
9
e
10
start start
ε Σ ε Σ
11 12 11 12

d d
start 0, 1, 4, 11 2, 5, 12 start 0, 1, 4, 11 2, 5, 12

From NFA to DFA From NFA to DFA


ε 1
d
2
o
3 ε 1
d
2
o
3

0
ε 4
d
5
o
6
u
7
b
8
l
9
e
10 0
ε 4
d
5
o
6
u
7
b
8
l
9
e
10
start start
ε Σ ε Σ
11 12 11 12

d d
start 0, 1, 4, 11 2, 5, 12 start 0, 1, 4, 11 2, 5, 12
From NFA to DFA From NFA to DFA
ε 1
d
2
o
3 ε 1
d
2
o
3

0
ε 4
d
5
o
6
u
7
b
8
l
9
e
10 0
ε 4
d
5
o
6
u
7
b
8
l
9
e
10
start start
ε Σ ε Σ
11 12 11 12

d d
start 0, 1, 4, 11 2, 5, 12 start 0, 1, 4, 11 2, 5, 12

Σ–d Σ–d

12 12

From NFA to DFA From NFA to DFA


ε 1
d
2
o
3 ε 1
d
2
o
3

0
ε 4
d
5
o
6
u
7
b
8
l
9
e
10 0
ε 4
d
5
o
6
u
7
b
8
l
9
e
10
start start
ε Σ ε Σ
11 12 11 12

d d
start 0, 1, 4, 11 2, 5, 12 start 0, 1, 4, 11 2, 5, 12

Σ–d Σ–d

12 12
From NFA to DFA From NFA to DFA
ε 1
d
2
o
3 ε 1
d
2
o
3

0
ε 4
d
5
o
6
u
7
b
8
l
9
e
10 0
ε 4
d
5
o
6
u
7
b
8
l
9
e
10
start start
ε Σ ε Σ
11 12 11 12

d o d o
start 0, 1, 4, 11 2, 5, 12 3, 6 start 0, 1, 4, 11 2, 5, 12 3, 6

Σ–d Σ–d

12 12

From NFA to DFA From NFA to DFA


ε 1
d
2
o
3 ε 1
d
2
o
3

0
ε 4
d
5
o
6
u
7
b
8
l
9
e
10 0
ε 4
d
5
o
6
u
7
b
8
l
9
e
10
start start
ε Σ ε Σ
11 12 11 12

d o u b l e d o u b l e
start 0, 1, 4, 11 2, 5, 12 3, 6 7 8 9 10 start 0, 1, 4, 11 2, 5, 12 3, 6 7 8 9 10

Σ–u Σ–b Σ–l Σ–e


Σ–d Σ–d Σ–o Σ

Σ
12 12

Σ
From NFA to DFA Modified Subset Construction
ε 1
d
2
o
3 ● Instead of marking whether a state is
0
ε 4
d
5
o
6
u
7
b
8
l
9
e
10
accepting, remember which token type it
start matches.
ε Σ
11 12
● Break ties with priorities.
● When using DFA as a scanner, consider
the DFA “stuck” if it enters the state
start 0, 1, 4, 11
d
2, 5, 12
o
3, 6
u
7
b
8
l
9
e
10 corresponding to the empty set.
Σ–u Σ–b Σ–l Σ–e
Σ–d Σ–o Σ

Σ
12

Performance Concerns
● The NFA-to-DFA construction can
introduce exponentially many states.
● Time/memory tradeoff:
● Low-memory NFA has higher scan time. Real-World Scanning: Python
● High-memory DFA has lower scan time.
● Could use a hybrid approach by
simplifying NFA before generating code.
While
Python Blocks
< ++
● Scoping handled by whitespace:
Ident Ident Ident if w == z:
ip z ip a = b
c = d
T_While ( T_Ident < T_Ident ) ++ T_Ident
else:
ip z ip
e = f
g = h
w h i l e ( i p < z ) \n \t + + i p ;
● What does that mean for the scanner?
while (ip < z)
++ip;

Scanning Python
Whitespace Tokens
if w == z:
a = b
● Special tokens inserted to indicate changes in c = d
levels of indentation. else:
e = f
● NEWLINE marks the end of a line. g = h
● INDENT indicates an increase in
indentation.

DEDENT indicates a decrease in indentation.
● Note that INDENT and DEDENT encode
change in indentation, not the total amount of
indentation.
Scanning Python Scanning Python
if w == z: if ident == ident : NEWLINE if w == z: { if ident == ident : NEWLINE
a = b w z a = b; w z
c = d c = d;
else: } else {
INDENT ident = ident NEWLINE INDENT ident = ident NEWLINE
e = f e = f;
g = h a b } a b
g = h;
ident = ident NEWLINE ident = ident NEWLINE
c d c d

DEDENT else : NEWLINE DEDENT else : NEWLINE

INDENT ident = ident NEWLINE INDENT ident = ident NEWLINE


e f e f

DEDENT ident = ident NEWLINE DEDENT ident = ident NEWLINE


g h g h

Scanning Python
Where to INDENT/DEDENT?
if w == z: { if ident == ident :
a = b; w z
c = d; ● Scanner maintains a stack of line indentations
} else { keeping track of all indented contexts so far.
{ ident = ident ;
e = f; ● Initially, this stack contains 0, since initially the
} a b
contents of the file aren't indented.
g = h;
ident = ident ; ● On a newline:
c d ● See how much whitespace is at the start of the line.
● If this value exceeds the top of the stack:
} else :
– Push the value onto the stack.
– Emit an INDENT token.
{ ident = ident ; ● Otherwise, while the value is less than the top of the stack:
e f – Pop the stack.
– Emit a DEDENT token.
} ident = ident ;
g h Source: http://docs.python.org/reference/lexical_analysis.html
Interesting Observation Summary
● Normally, more text on a line translates ●
Lexical analysis splits input text into tokens
into more tokens. holding a lexeme and an attribute.
● With DEDENT, less text on a line often
● Lexemes are sets of strings often defined
means more tokens: with regular expressions.
if cond1: ● Regular expressions can be converted to
if cond2: NFAs and from there to DFAs.
if cond3:
if cond4: ●
Maximal-munch using an automaton allows
if cond5: for fast scanning.
statement1
statement2 ● Not all tokens come directly from the source
code.

Implementing a lexical analyzer Flex

flex is a free implementation of the Unix lex program


flex implements what we have seen:
In practice (and for your project), two ways:
I It takes regular expressions as input
I Write an ad-hoc analyser I It generates a combined NFA
I Use automatic tools like (F)LEX. I It converts it to an equivalent DFA
First approach is more tedious. It is only useful to address specific I It minimizes the automaton as much as possible
needs. I It generates C code that implements it
I It handles conflicts with the longest matching prefix principle and a
Second approach is more portable
preference order on the tokens.
More information
I http://flex.sourceforge.net/manual/

Lexical analysis 85 Lexical analysis 92

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy