0% found this document useful (0 votes)

21 views71 pages

Lecture II - Lexical Analysis.handouts

The document outlines the process of lexical analysis in programming, detailing how source code is transformed into tokens that represent logical pieces of the code. It discusses the importance of tokens, their association with lexemes, and the challenges faced during scanning. Additionally, it introduces regular expressions as a method for defining formal languages and categorizing tokens.

Uploaded by

hut86176

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

21 views71 pages

Lecture II - Lexical Analysis.handouts

Uploaded by

hut86176

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 71

Where We Are

Source Lexical Analysis

Code
Syntax Analysis

Lexical Analysis Semantic Analysis

IR Generation

IR Optimization

Code Generation

Optimization Machine
Code

w h i l e ( i p < z ) \n \t + + i p ;

while (ip < z) while (ip < z)

++ip; ++ip;
While

< ++

Ident Ident Ident

ip z ip

T_While ( T_Ident < T_Ident ) ++ T_Ident T_While ( T_Ident < T_Ident ) ++ T_Ident

ip z ip ip z ip

w h i l e ( i p < z ) \n \t + + i p ; w h i l e ( i p < z ) \n \t + + i p ;

while (ip < z) while (ip < z)

++ip; ++ip;

d o [ f o r ] = n e w 0 ;

do[for] = new 0; do[for] = new 0;

T_Do [ T_For ] = T_New T_IntConst T_Do [ T_For ] = T_New T_IntConst

0 0

d o [ f o r ] = n e w 0 ; d o [ f o r ] = n e w 0 ;

do[for] = new 0; do[for] = new 0;

Scanning a Source File Scanning a Source File

w h i l e ( 1
i 3
p 7 < < z i ) \n
) \n \t \t
+ + +
i i
p ; w h i l e ( 1
i 3
p 7 < < z i ) \n
) \n \t \t
+ + +
i i
p ;
Scanning a Source File Scanning a Source File
w h i l e ( 1
i 3
p 7 < < z i ) \n
) \n \t \t
+ + +
i i
p ; w h i l e ( 1
i 3
p 7 < < z i ) \n
) \n \t \t
+ + +
i i
p ;

Scanning a Source File Scanning a Source File

w h i l e ( 1
i 3
p 7 < < z i ) \n
) \n \t \t
+ + +
i i
p ; w h i l e ( 1
i 3
p 7 < < z i ) \n
) \n \t \t
+ + +
i i
p ;
Scanning a Source File Scanning a Source File
w h i l e ( 1
i 3
p 7 < < z i ) \n
) \n \t \t
+ + +
i i
p ; w h i l e ( 1 3 7 < i ) \n \t + + i ;

T_While

Scanning a Source File Scanning a Source File

w h i l e ( 1 3 7 < i ) \n \t + + i ; w h i l e ( 1 3 7 < i ) \n \t + + i ;

The
The piece
piece of
of the
the original
original program
program
from
from which we made the token isis
which we made the token
called
called aa lexeme.
lexeme.

T_While T_While

This
This isis called
called aa token.
token. You
You can
can
think
think ofof itit as
as an
an enumerated
enumerated type
type
representing
representing what
what logical
logical entity
entity we
we
read
read out of the source code.
out of the source code.
Scanning a Source File Scanning a Source File
w h i l e ( 1 3 7 < i ) \n \t + + i ; w h i l e ( 1 3 7 < i ) \n \t + + i ;

T_While T_While

Scanning a Source File Scanning a Source File

w h i l e ( 1 3 7 < i ) \n \t + + i ; w h i l e ( 1 3 7 < i ) \n \t + + i ;

Sometimes
Sometimes wewe will
will discard
discard aa lexeme
lexeme
T_While rather
rather than storing it for later use.
than storing it for later use. T_While
Here,
Here, we
we ignore
ignore whitespace,
whitespace, since
since itit
has
has no
no bearing
bearing on
on the
the meaning
meaning of
of
the program.
the program.
Scanning a Source File Scanning a Source File
w h i l e ( 1 3 7 < i ) \n \t + + i ; w h i l e ( 1 3 7 < i ) \n \t + + i ;

T_While T_While

Scanning a Source File Scanning a Source File

w h i l e ( 1 3 7 < i ) \n \t + + i ; w h i l e ( 1 3 7 < i ) \n \t + + i ;

T_While ( T_While (
Scanning a Source File Scanning a Source File
w h i l e ( 1 3 7 < i ) \n \t + + i ; w h i l e ( 1 3 7 < i ) \n \t + + i ;

T_While ( T_While (

Scanning a Source File Scanning a Source File

w h i l e ( 1 3 7 < i ) \n \t + + i ; w h i l e ( 1 3 7 < i ) \n \t + + i ;

T_While ( T_While (
Scanning a Source File Scanning a Source File
w h i l e ( 1 3 7 < i ) \n \t + + i ; w h i l e ( 1 3 7 < i ) \n \t + + i ;

Some
Some tokens
tokens can
can have
have
attributes that
attributes that store
store
extra
extra information
information about
about
T_While ( T_IntConst T_While ( T_IntConst
the
the token. Here
token. Here we
we
137 137 store
store which
which integer
integer isis
represented.
represented.

Goals of Lexical Analysis

● Convert from physical description of a program
into sequence of of tokens.
● Each token represents one logical piece of the source
file – a keyword, the name of a variable, etc.
●
Each token is associated with a lexeme. Choosing Tokens
● The actual text of the token: “137,” “int,” etc.
● Each token may have optional attributes.
● Extra information derived from the text – perhaps a
numeric value.
● The token sequence will be used in the parser to
recover the program structure.
What Tokens are Useful Here? What Tokens are Useful Here?
for (int k = 0; k < myArray[5]; ++k) { for (int k = 0; k < myArray[5]; ++k) {
cout << k << endl; cout << k << endl;
} }
for {
int }
<< ;
= <
( [
) ]
++

What Tokens are Useful Here? Choosing Good Tokens

for (int k = 0; k < myArray[5]; ++k) { ● Very much dependent on the language.
cout << k << endl;
} ● Typically:
for { ● Give keywords their own tokens.
int }
<< ; ● Give different punctuation symbols their own
= < tokens.
( [ ● Group lexemes representing identifiers,
) ] numeric constants, strings, etc. into their own
++
groups.
Identifier ● Discard irrelevant information (whitespace,
IntegerConstant comments)
Challenges in Scanning
● How do we determine which lexemes are
associated with each token?
● When there are multiple ways we could
scan the input, how do we know which
one to pick?
● How do we address these concerns
efficiently?
Lexemes and Tokens
● Tokens give a way to categorize lexemes by
what information they provide.
● Some tokens might be associated with only a
Associating Lexemes with Tokens single lexeme:
● Tokens for keywords like if and while probably
only match those lexemes exactly.
● Some tokens might be associated with lots of
different lexemes:
● All variable names, all possible numbers, all
possible strings, etc.

Sets of Lexemes
● Idea: Associate a set of lexemes with each
token.
● We might associate the “number” token How do we describe which (potentially
with the set { 0, 1, 2, …, 10, 11, 12, … } infinite) set of lexemes is associated with
● We might associate the “string” token each token type?
with the set { "", "a", "b", "c", … }
● We might associate the token for the
keyword while with the set { while }.
Formal Languages Regular Expressions
●
A formal language is a set of strings. ● Regular expressions are a family of
● Many infinite languages have finite descriptions: descriptions that can be used to capture
● Define the language using an automaton. certain languages (the regular
●
Define the language using a grammar. languages).
● Define the language using a regular expression. ● Often provide a compact and human-
● We can use these compact descriptions of the readable description of the language.
language to define sets of strings.
● Used as the basis for numerous software
● Over the course of this class, we will use all of
systems, including the flex tool we will
these approaches.
use in this course.

Atomic Regular Expressions Compound Regular Expressions

● The regular expressions we will use in ● If R1 and R2 are regular expressions, R1R2 is a regular
expression represents the concatenation of the
this course begin with two simple languages of R1 and R2.
building blocks.
● If R1 and R2 are regular expressions, R1 | R2 is a regular
● The symbol ε is a regular expression expression representing the union of R1 and R2.
matches the empty string. ● If R is a regular expression, R* is a regular expression for
the Kleene closure of R.
● For any symbol a, the symbol a is a ● If R is a regular expression, (R) is a regular expression
regular expression that just matches a. with the same meaning as R.
Operator Precedence Simple Regular Expressions
● Regular expression operator precedence ● Suppose the only characters are 0 and 1.
is ● Here is a regular expression for strings containing
00 as a substring:
(R)
R* (0 | 1)*00(0 | 1)*
R1 R2
R1 | R 2
● So ab*c|d is parsed as ((a(b*))c)|d

Simple Regular Expressions Simple Regular Expressions

● Suppose the only characters are 0 and 1. ● Suppose the only characters are 0 and 1.
● Here is a regular expression for strings containing ● Here is a regular expression for strings containing
00 as a substring: 00 as a substring:

(0 | 1)00(0 | 1) (0 | 1)00(0 | 1)

11011100101
0000
11111011110011111
Simple Regular Expressions Simple Regular Expressions
● Suppose the only characters are 0 and 1. ● Suppose the only characters are 0 and 1.
● Here is a regular expression for strings containing ● Here is a regular expression for strings of length
00 as a substring: exactly four:

(0 | 1)*00(0 | 1)*

11011100101
0000
11111011110011111

Simple Regular Expressions Simple Regular Expressions

● Suppose the only characters are 0 and 1. ● Suppose the only characters are 0 and 1.
● Here is a regular expression for strings of length ● Here is a regular expression for strings of length
exactly four: exactly four:

(0|1)(0|1)(0|1)(0|1) (0|1)(0|1)(0|1)(0|1)
Simple Regular Expressions Simple Regular Expressions
● Suppose the only characters are 0 and 1. ● Suppose the only characters are 0 and 1.
● Here is a regular expression for strings of length ● Here is a regular expression for strings of length
exactly four: exactly four:

(0|1)(0|1)(0|1)(0|1) (0|1)(0|1)(0|1)(0|1)

0000 0000
1010 1010
1111 1111
1000 1000

Simple Regular Expressions Simple Regular Expressions

(0|1){4} (0|1){4}

0000 0000
1010 1010
1111 1111
1000 1000
Simple Regular Expressions Simple Regular Expressions
● Suppose the only characters are 0 and 1. ● Suppose the only characters are 0 and 1.
● Here is a regular expression for strings that ● Here is a regular expression for strings that
contain at most one zero: contain at most one zero:

1*(0 | ε)1*

Simple Regular Expressions Simple Regular Expressions

● Suppose the only characters are 0 and 1. ● Suppose the only characters are 0 and 1.
● Here is a regular expression for strings that ● Here is a regular expression for strings that
contain at most one zero: contain at most one zero:

1(0 | ε)1 1(0 | ε)1

11110111
111111
0111
0
Simple Regular Expressions Simple Regular Expressions
● Suppose the only characters are 0 and 1. ● Suppose the only characters are 0 and 1.
● Here is a regular expression for strings that ● Here is a regular expression for strings that
contain at most one zero: contain at most one zero:

1(0 | ε)1 10?1

11110111 11110111
111111 111111
0111 0111
0 0

Applied Regular Expressions Applied Regular Expressions

● Suppose that our alphabet is all ASCII ● Suppose that our alphabet is all ASCII
characters. characters.
● A regular expression for even numbers is ● A regular expression for even numbers is

(+|-)?(0|1|2|3|4|5|6|7|8|9)*(0|2|4|6|8) (+|-)?(0|1|2|3|4|5|6|7|8|9)*(0|2|4|6|8)

42
+1370
-3248
-9999912
Applied Regular Expressions Applied Regular Expressions
● Suppose that our alphabet is all ASCII ● Suppose that our alphabet is all ASCII
characters. characters.
● A regular expression for even numbers is ● A regular expression for even numbers is

(+|-)?(0|1|2|3|4|5|6|7|8|9)*(0|2|4|6|8) (+|-)?[0123456789]*[02468]

42 42
+1370 +1370
-3248 -3248
-9999912 -9999912

Applied Regular Expressions

● Suppose that our alphabet is all ASCII
characters.
● A regular expression for even numbers is
Matching Regular Expressions
(+|-)?[0-9]*[02468]

42
+1370
-3248
-9999912
Implementing Regular Expressions A Simple Automaton
● Regular expressions can be implemented A,B,C,...,Z
using finite automata.
start " "
● There are two main kinds of finite
automata:
● NFAs (nondeterministic finite automata),
which we'll see in a second, and
● DFAs (deterministic finite automata), which
we'll see later.
● Automata are best explained by example...

A Simple Automaton A Simple Automaton

A,B,C,...,Z A,B,C,...,Z

start " " start " "

Each
Each circle
circle isis aa state
state of
of the
the
automaton. The automaton's These
automaton. The automaton's These arrows
arrows are
are called
called
configuration
configuration isis determined
determined transitions. The automaton
transitions. The automaton
by
by what
what state(s)
state(s) itit isis in. changes
in. changes which
which state(s)
state(s) itit isis inin
by
by following
following transitions.
transitions.
A Simple Automaton A Simple Automaton
A,B,C,...,Z A,B,C,...,Z

start " " start " "

" H E Y A " " H E Y A "

The
The automaton
automaton takes
takes aa string
string
as
as input and decides whether
input and decides whether
to
to accept
accept or
or reject
reject the
the string.
string.

A Simple Automaton A Simple Automaton

A,B,C,...,Z A,B,C,...,Z

start " " start " "

" H E Y A " " H E Y A "

A Simple Automaton A Simple Automaton
A,B,C,...,Z A,B,C,...,Z

start " " start " "

" H E Y A " " H E Y A "

A Simple Automaton A Simple Automaton

A,B,C,...,Z A,B,C,...,Z

start " " start " "

" H E Y A " " H E Y A "

A Simple Automaton A Simple Automaton
A,B,C,...,Z A,B,C,...,Z

start " " start " "

" H E Y A " " H E Y A "

A Simple Automaton A Simple Automaton

A,B,C,...,Z A,B,C,...,Z

start " " start " "

" H E Y A " " H E Y A "

A Simple Automaton A Simple Automaton
A,B,C,...,Z A,B,C,...,Z

start " " start " "

" H E Y A " " " " " " "

The
The double
double circle
circle indicates
indicates that
that this
this
state is an accepting state.
state is an accepting state. The The
automaton
automaton accepts
accepts the
the string
string ifif itit
ends
ends inin an
an accepting
accepting state.
state.

A Simple Automaton A Simple Automaton

A,B,C,...,Z A,B,C,...,Z

start " " start " "

" " " " " " " " " " " "
A Simple Automaton A Simple Automaton
A,B,C,...,Z A,B,C,...,Z

start " " start " "

" " " " " " " " " " " "

A Simple Automaton A Simple Automaton

A,B,C,...,Z A,B,C,...,Z

start " " start " "

" " " " " " " " " " " "
There
There isis no
no transition
transition on
on ""
here,
here, soso the
the automaton
automaton
dies and
dies and rejects.
rejects.
A Simple Automaton A Simple Automaton
A,B,C,...,Z A,B,C,...,Z

start " " start " "

" " " " " " " A B C

There
There isis no
no transition
transition on
on ""
here, so the automaton
here, so the automaton
dies and
dies and rejects.
rejects.

A Simple Automaton A Simple Automaton

A,B,C,...,Z A,B,C,...,Z

start " " start " "

" A B C " A B C
A Simple Automaton A Simple Automaton
A,B,C,...,Z A,B,C,...,Z

start " " start " "

" A B C " A B C

A Simple Automaton A Simple Automaton

A,B,C,...,Z A,B,C,...,Z

start " " start " "

" A B C " A B C
A Simple Automaton A More Complex Automaton
1 0
A,B,C,...,Z 0

start
start " "
0, 1

1
0 1
" A B C
This
This isis not
not an
an accepting
accepting
state,
state, so the automaton
so the automaton
rejects.
rejects.

A More Complex Automaton A More Complex Automaton

1 0 1 0
0 0

start start
0, 1 0, 1

1 1
0 1 0 1

Notice
Notice that
that there
there are
are multiple
multiple transitions
transitions
defined
defined here on 0 and 1. If we read
here on 0 and 1. If we read aa
00 or 1 here, we follow both transitions
or 1 here, we follow both transitions
and
and enter
enter multiple
multiple states.
states.
A More Complex Automaton A More Complex Automaton
1 0 1 0
0 0

start start
0, 1 0, 1

1 1
0 1 0 1

0 1 1 1 0 1 0 1 1 1 0 1

A More Complex Automaton A More Complex Automaton

1 0 1 0
0 0

start start
0, 1 0, 1

1 1
0 1 0 1

0 1 1 1 0 1 0 1 1 1 0 1
A More Complex Automaton A More Complex Automaton
1 0 1 0
0 0

start start
0, 1 0, 1

1 1
0 1 0 1

0 1 1 1 0 1 0 1 1 1 0 1

A More Complex Automaton A More Complex Automaton

1 0 1 0
0 0

start start
0, 1 0, 1

1 1
0 1 0 1

0 1 1 1 0 1 0 1 1 1 0 1
A More Complex Automaton A More Complex Automaton
1 0 1 0
0 0

start start
0, 1 0, 1

1 1
0 1 0 1

0 1 1 1 0 1 0 1 1 1 0 1

A More Complex Automaton A More Complex Automaton

1 0 1 0
0 0

start start
0, 1 0, 1

1 1
0 1 0 1

0 1 1 1 0 1 0 1 1 1 0 1
A More Complex Automaton A More Complex Automaton
1 0 1 0
0 0

start start
0, 1 0, 1

1 1
0 1 0 1

0 1 1 1 0 1 0 1 1 1 0 1

A More Complex Automaton An Even More Complex Automaton

a, b
1 0
0 c
start
0, 1 ε a, c
start b
ε
1
0 1 b, c
ε a
0 1 1 1 0 1
Since
Since we
we are
are inin at
at least
least
one
one accepting state, the
accepting state, the
automaton
automaton accepts.
accepts.
An Even More Complex Automaton An Even More Complex Automaton
a, b a, b
c c

ε a, c ε a, c
start b start b
ε ε

b, c b, c
ε a ε a

These
These are
are called -transitions.. These
called εε-transitions These b c b a
transitions
transitions are followed automatically
are followed automatically and
and
without
without consuming
consuming any
any input.
input.

An Even More Complex Automaton An Even More Complex Automaton

a, b a, b
c c

ε a, c ε a, c
start b start b
ε ε

b, c b, c
ε a ε a

b c b a b c b a
An Even More Complex Automaton An Even More Complex Automaton
a, b a, b
c c

ε a, c ε a, c
start b start b
ε ε

b, c b, c
ε a ε a

b c b a b c b a

An Even More Complex Automaton An Even More Complex Automaton

a, b a, b
c c

ε a, c ε a, c
start b start b
ε ε

b, c b, c
ε a ε a

b c b a b c b a
An Even More Complex Automaton An Even More Complex Automaton
a, b a, b
c c

ε a, c ε a, c
start b start b
ε ε

b, c b, c
ε a ε a

b c b a b c b a

An Even More Complex Automaton An Even More Complex Automaton

a, b a, b
c c

ε a, c ε a, c
start b start b
ε ε

b, c b, c
ε a ε a

b c b a b c b a
Simulating an NFA From Regular Expressions to NFAs
● Keep track of a set of states, initially the start ● There is a (beautiful!) procedure from converting a
state and everything reachable by ε-moves. regular expression to an NFA.
● For each character in the input: ● Associate each regular expression with an NFA with
the following properties:
● Maintain a set of next states, initially empty.
● There is exactly one accepting state.
● For each current state:
● There are no transitions out of the accepting state.
– Follow all transitions labeled with the current letter.
– Add these states to the set of new states.
● There are no transitions into the starting state.
● Add every state reachable by an ε-move to the set of ● These restrictions are stronger than necessary, but make the
next states. construction easier.
● Complexity: O(mn2) for strings of length m and
start
automata with n states.

Base Cases Construction for R1R2

start ε

Automaton for ε

start a

Automaton for single character a

Construction for R1R2 Construction for R1R2

start start start

R1 R2 R1 R2

Construction for R1R2 Construction for R1R2

ε ε
start start

R1 R2 R1 R2
Construction for R1 | R2 Construction for R1 | R2

start

Construction for R1 | R2 Construction for R1 | R2

start
ε

start
R1 start
R1

ε
start

R2 R2
Construction for R1 | R2 Construction for R1 | R2

ε ε ε

start
R1 start
R1

ε ε ε

R2 R2

Construction for R1 | R2 Construction for R*

ε ε

start
R1

ε ε

R2
Construction for R* Construction for R*

start
start start

R R

Construction for R* Construction for R*

start start ε ε
start

R R

ε ε
Construction for R* Construction for R*

ε ε

start ε ε start ε ε

R R

ε ε

Overall Result Challenges in Scanning

● Any regular expression of length n can ● How do we determine which lexemes are
be converted into an NFA with O(n) associated with each token?
states. ● When there are multiple ways we could
● Can determine whether a string of length scan the input, how do we know which
m matches a regular expression of length one to pick?
n in time O(mn2). ● How do we address these concerns
● We'll see how to make this O(m) later efficiently?
(this is independent of the complexity of
the regular expression!)
Challenges in Scanning Lexing Ambiguities
T_For for
● How do we determine which lexemes are T_Identifier [A-Za-z_][A-Za-z0-9_]*
associated with each token?
● When there are multiple ways we could
scan the input, how do we know which
one to pick?
● How do we address these concerns
efficiently?

Lexing Ambiguities Lexing Ambiguities

T_For for T_For for
T_Identifier [A-Za-z_][A-Za-z0-9_]* T_Identifier [A-Za-z_][A-Za-z0-9_]*

f o r t f o r t
f o r t f o r t
f o r t f o r t
f o r t f o r t
f o r t f o r t
f o r t
Conflict Resolution Lexing Ambiguities
T_For for
● Assume all tokens are specified as T_Identifier [A-Za-z_][A-Za-z0-9_]*
regular expressions.
● Algorithm: Left-to-right scan. f o r t
● Tiebreaking rule one: Maximal munch.
f o r t f o r t
● Always match the longest possible prefix of
the remaining text. f o r t f o r t
f o r t f o r t
f o r t f o r t
f o r t

Lexing Ambiguities Implementing Maximal Munch

T_For for
T_Identifier [A-Za-z_][A-Za-z0-9_]*
● Given a set of regular expressions, how
can we use them to implement maximum
f o r t munch?
● Idea:
f o r t ● Convert expressions to NFAs.
● Run all NFAs in parallel, keeping track of the
last match.
● When all automata get stuck, report the last
match and restart the search at that point.
Implementing Maximal Munch Implementing Maximal Munch
T_Do do T_Do do
T_Double double T_Double double
T_Mystery [A-Za-z] T_Mystery [A-Za-z]
start d o

start d o u b l e

start Σ

Implementing Maximal Munch Implementing Maximal Munch

T_Do do T_Do do
T_Double double T_Double double
T_Mystery [A-Za-z] T_Mystery [A-Za-z]
start d o start d o

start d o u b l e start d o u b l e

start Σ start Σ

D O U B D O U B L E D O U B D O U B L E
Implementing Maximal Munch Implementing Maximal Munch
T_Do do T_Do do
T_Double double T_Double double
T_Mystery [A-Za-z] T_Mystery [A-Za-z]
start d o start d o

start d o u b l e start d o u b l e

start Σ start Σ

D O U B D O U B L E D O U B D O U B L E

Implementing Maximal Munch Implementing Maximal Munch

T_Do do T_Do do
T_Double double T_Double double
T_Mystery [A-Za-z] T_Mystery [A-Za-z]
start d o start d o

start d o u b l e start d o u b l e

start Σ start Σ

D O U B D O U B L E D O U B D O U B L E
Implementing Maximal Munch Implementing Maximal Munch
T_Do do T_Do do
T_Double double T_Double double
T_Mystery [A-Za-z] T_Mystery [A-Za-z]
start d o start d o

start d o u b l e start d o u b l e

start Σ start Σ

D O U B D O U B L E D O U B D O U B L E

Implementing Maximal Munch Implementing Maximal Munch

T_Do do T_Do do
T_Double double T_Double double
T_Mystery [A-Za-z] T_Mystery [A-Za-z]
start d o start d o

start d o u b l e start d o u b l e

start Σ start Σ

D O U B D O U B L E D O U B D O U B L E
Implementing Maximal Munch Implementing Maximal Munch
T_Do do T_Do do
T_Double double T_Double double
T_Mystery [A-Za-z] T_Mystery [A-Za-z]
start d o start d o

start d o u b l e start d o u b l e

start Σ start Σ

D O U B D O U B L E D O U B D O U B L E

Implementing Maximal Munch Implementing Maximal Munch

T_Do do T_Do do
T_Double double T_Double double
T_Mystery [A-Za-z] T_Mystery [A-Za-z]
start d o start d o

start d o u b l e start d o u b l e

start Σ start Σ

D O U B D O U B L E D O U B D O U B L E
Implementing Maximal Munch Implementing Maximal Munch
T_Do do T_Do do
T_Double double T_Double double
T_Mystery [A-Za-z] T_Mystery [A-Za-z]
start d o start d o

start d o u b l e start d o u b l e

start Σ start Σ

D O U B D O U B L E D O U B D O U B L E

Implementing Maximal Munch Implementing Maximal Munch

T_Do do T_Do do
T_Double double T_Double double
T_Mystery [A-Za-z] T_Mystery [A-Za-z]
start d o start d o

start d o u b l e start d o u b l e

start Σ start Σ

D O U B D O U B L E D O U B D O U B L E
Implementing Maximal Munch Implementing Maximal Munch
T_Do do T_Do do
T_Double double T_Double double
T_Mystery [A-Za-z] T_Mystery [A-Za-z]
start d o start d o

start d o u b l e start d o u b l e

start Σ start Σ

D O U B D O U B L E D O U B D O U B L E

Implementing Maximal Munch Implementing Maximal Munch

T_Do do T_Do do
T_Double double T_Double double
T_Mystery [A-Za-z] T_Mystery [A-Za-z]
start d o start d o

start d o u b l e start d o u b l e

start Σ start Σ

D O U B D O U B L E D O U B D O U B L E
Implementing Maximal Munch Implementing Maximal Munch
T_Do do T_Do do
T_Double double T_Double double
T_Mystery [A-Za-z] T_Mystery [A-Za-z]
start d o start d o

start d o u b l e start d o u b l e

start Σ start Σ

D O U B D O U B L E D O U B D O U B L E

Implementing Maximal Munch Implementing Maximal Munch

T_Do do T_Do do
T_Double double T_Double double
T_Mystery [A-Za-z] T_Mystery [A-Za-z]
start d o start d o

start d o u b l e start d o u b l e

start Σ start Σ

D O U B D O U B L E D O U B D O U B L E
Implementing Maximal Munch Implementing Maximal Munch
T_Do do T_Do do
T_Double double T_Double double
T_Mystery [A-Za-z] T_Mystery [A-Za-z]
start d o start d o

start d o u b l e start d o u b l e

start Σ start Σ

D O U B D O U B L E D O U B D O U B L E

Implementing Maximal Munch Implementing Maximal Munch

T_Do do T_Do do
T_Double double T_Double double
T_Mystery [A-Za-z] T_Mystery [A-Za-z]
start d o start d o

start d o u b l e start d o u b l e

start Σ start Σ

D O U B D O U B L E D O U B D O U B L E
Implementing Maximal Munch Implementing Maximal Munch
T_Do do T_Do do
T_Double double T_Double double
T_Mystery [A-Za-z] T_Mystery [A-Za-z]
start d o start d o

start d o u b l e start d o u b l e

start Σ start Σ

D O U B D O U B L E D O U B D O U B L E

Implementing Maximal Munch Implementing Maximal Munch

T_Do do T_Do do
T_Double double T_Double double
T_Mystery [A-Za-z] T_Mystery [A-Za-z]
start d o start d o

start d o u b l e start d o u b l e

start Σ start Σ

D O U B D O U B L E D O U B D O U B L E
Implementing Maximal Munch Implementing Maximal Munch
T_Do do T_Do do
T_Double double T_Double double
T_Mystery [A-Za-z] T_Mystery [A-Za-z]
start d o start d o

start d o u b l e start d o u b l e

start Σ start Σ

D O U B D O U B L E D O U B D O U B L E

Implementing Maximal Munch Implementing Maximal Munch

T_Do do T_Do do
T_Double double T_Double double
T_Mystery [A-Za-z] T_Mystery [A-Za-z]
start d o start d o

start d o u b l e start d o u b l e

start Σ start Σ

D O U B D O U B L E D O U B D O U B L E
Implementing Maximal Munch Implementing Maximal Munch
T_Do do T_Do do
T_Double double T_Double double
T_Mystery [A-Za-z] T_Mystery [A-Za-z]
start d o start d o

start d o u b l e start d o u b l e

start Σ start Σ

D O U B D O U B L E D O U B D O U B L E

Implementing Maximal Munch Implementing Maximal Munch

T_Do do T_Do do
T_Double double T_Double double
T_Mystery [A-Za-z] T_Mystery [A-Za-z]
start d o start d o

start d o u b l e start d o u b l e

start Σ start Σ

D O U B D O U B L E D O U B D O U B L E
Implementing Maximal Munch Implementing Maximal Munch
T_Do do T_Do do
T_Double double T_Double double
T_Mystery [A-Za-z] T_Mystery [A-Za-z]
start d o start d o

start d o u b l e start d o u b l e

start Σ start Σ

D O U B D O U B L E D O U B D O U B L E

Implementing Maximal Munch A Minor Simplification

T_Do do
T_Double double
T_Mystery [A-Za-z]
start d o

start d o u b l e

start Σ

D O U B D O U B L E
A Minor Simplification A Minor Simplification

start d o start d o

start d o u b l e start d o u b l e

start Σ start Σ

A Minor Simplification A Minor Simplification

ε d o ε d o

ε d o u b l e ε d o u b l e
start start
ε Σ ε Σ

Build
Build aa single
single automaton
automaton
that
that runs all the matching
runs all the matching
automata
automata in
in parallel.
parallel.
A Minor Simplification A Minor Simplification

ε d o ε d o

ε d o u b l e ε d o u b l e
start start
ε Σ ε Σ

Annotate
Annotate each
each accepting
accepting
state with which automaton
state with which automaton
itit came
came from.
from.

Merging all automata into a single NFA

Other Conflicts
In practice, all NFAs are merged and simulated as a single NFA
Accepting states are labeled with the token name T_Do do
T_Double double
1
i!
2
f!
3 IF#
T_Identifier [A-Za-z_][A-Za-z0-9_]*
=
4 5 EQ#

[0-9]!
[0-9]!
0 6 7 NUM#

[a-z0-9]!
[a-z]!
8 9 ID#

Σ
10! 11!

Lexical analysis 70
Other Conflicts Other Conflicts
T_Do do T_Do do
T_Double double T_Double double
T_Identifier [A-Za-z_][A-Za-z0-9_]* T_Identifier [A-Za-z_][A-Za-z0-9_]*

d o u b l e d o u b l e

d o u b l e
d o u b l e

More Tiebreaking Other Conflicts

T_Do do
● When two regular expressions apply, T_Double double
choose the one with the greater T_Identifier [A-Za-z_][A-Za-z0-9_]*
“priority.”
● Simple priority system: pick the rule d o u b l e
that was defined first.

d o u b l e
d o u b l e
Other Conflicts One Last Detail...
T_Do do
T_Double double
● We know what to do if multiple rules
T_Identifier [A-Za-z_][A-Za-z0-9_]* match.
● What if nothing matches?
d o u b l e ● Trick: Add a “catch-all” rule that matches
any character and reports an error.
d o u b l e

What if nothing matches

Summary of Conflict Resolution
What if we can not reach any accepting states given the current
input?
Add a “catch-all” rule that matches any character and reports an
● Construct an automaton for each regular
error expression.
1
i!
2
f!
3 IF# ● Merge them into one automaton by
4
=
5 EQ#
adding a new start state.
[0-9]! ● Scan the input, keeping track of the last
[0-9]!
6 7 NUM#
known match.
[a-z0-9]!

8
[a-z]!
9 ID#
● Break ties by choosing higher-
Σ
precedence matches.
10! 11!
● Have a catch-all rule to handle errors.

Lexical analysis 69
Challenges in Scanning Challenges in Scanning
● How do we determine which lexemes are ● How do we determine which lexemes are
associated with each token? associated with each token?
● When there are multiple ways we could ● When there are multiple ways we could
scan the input, how do we know which scan the input, how do we know which
one to pick? one to pick?
● How do we address these concerns ● How do we address these concerns
efficiently? efficiently?

DFAs A Sample DFA

● The automata we've seen so far have all
been NFAs.
● A DFA is like an NFA, but with tighter
restrictions:
● Every state must have exactly one
transition defined for every letter.
● ε-moves are not allowed.
A Sample DFA A Sample DFA

1 1
start start
A B
1 1

0 0 0 0 0 0 0 0

1 1
C D

1 1

A Sample DFA Code for DFAs

int kTransitionTable[kNumStates][kNumSymbols] = {
{0, 0, 1, 3, 7, 1, …},
1 …
start
A B
0 1 };
1 bool kAcceptTable[kNumStates] = {
A C B false,
true,
0 0 0 0 B D A true,
…
C A D };
1 bool simulateDFA(string input) {
C D

1
D B C int state = 0;
for (char ch: input)
state = kTransitionTable[state][ch];
return kAcceptTable[state];
}
Code for DFAs Speeding up Matching
int kTransitionTable[kNumStates][kNumSymbols] = {
{0, 0, 1, 3, 7, 1, …},
● In the worst-case, an NFA with n states
… takes time O(mn2) to match a string of
};
bool kAcceptTable[kNumStates] = { length m.
false,
true, Runs
Runs in
in time
time O(m)
O(m)
● DFAs, on the other hand, take only O(m).
true, on
on a string of
a string of ● There is another (beautiful!) algorithm to
…
length m.
length m.
}; convert NFAs to DFAs.
bool simulateDFA(string input) {
int state = 0;
Lexical Regular Table-Driven
for (char ch: input) Specification Expressions
NFA DFA
DFA
state = kTransitionTable[state][ch];
return kAcceptTable[state];
}

Subset Construction From NFA to DFA

● NFAs can be in many states at once, while
DFAs can only be in a single state at a time.
● Key idea: Make the DFA simulate the
NFA.
● Have the states of the DFA correspond to
the sets of states of the NFA.
● Transitions between states of DFA
correspond to transitions between sets of
states in the NFA.
From NFA to DFA From NFA to DFA
ε 1
d
2
o
3 ε 1
d
2
o
3

0
ε 4
d
5
o
6
u
7
b
8
l
9
e
10 0
ε 4
d
5
o
6
u
7
b
8
l
9
e
10
start start
ε Σ ε Σ
11 12 11 12

From NFA to DFA From NFA to DFA

ε 1
d
2
o
3 ε 1
d
2
o
3

0
ε 4
d
5
o
6
u
7
b
8
l
9
e
10 0
ε 4
d
5
o
6
u
7
b
8
l
9
e
10
start start
ε Σ ε Σ
11 12 11 12

start 0, 1, 4, 11 start 0, 1, 4, 11
From NFA to DFA From NFA to DFA
ε 1
d
2
o
3 ε 1
d
2
o
3

0
ε 4
d
5
o
6
u
7
b
8
l
9
e
10 0
ε 4
d
5
o
6
u
7
b
8
l
9
e
10
start start
ε Σ ε Σ
11 12 11 12

d d
start 0, 1, 4, 11 2, 5, 12 start 0, 1, 4, 11 2, 5, 12

From NFA to DFA From NFA to DFA

ε 1
d
2
o
3 ε 1
d
2
o
3

0
ε 4
d
5
o
6
u
7
b
8
l
9
e
10 0
ε 4
d
5
o
6
u
7
b
8
l
9
e
10
start start
ε Σ ε Σ
11 12 11 12

d d
start 0, 1, 4, 11 2, 5, 12 start 0, 1, 4, 11 2, 5, 12
From NFA to DFA From NFA to DFA
ε 1
d
2
o
3 ε 1
d
2
o
3

0
ε 4
d
5
o
6
u
7
b
8
l
9
e
10 0
ε 4
d
5
o
6
u
7
b
8
l
9
e
10
start start
ε Σ ε Σ
11 12 11 12

d d
start 0, 1, 4, 11 2, 5, 12 start 0, 1, 4, 11 2, 5, 12

Σ–d Σ–d

12 12

From NFA to DFA From NFA to DFA

ε 1
d
2
o
3 ε 1
d
2
o
3

0
ε 4
d
5
o
6
u
7
b
8
l
9
e
10 0
ε 4
d
5
o
6
u
7
b
8
l
9
e
10
start start
ε Σ ε Σ
11 12 11 12

d d
start 0, 1, 4, 11 2, 5, 12 start 0, 1, 4, 11 2, 5, 12

Σ–d Σ–d

12 12
From NFA to DFA From NFA to DFA
ε 1
d
2
o
3 ε 1
d
2
o
3

0
ε 4
d
5
o
6
u
7
b
8
l
9
e
10 0
ε 4
d
5
o
6
u
7
b
8
l
9
e
10
start start
ε Σ ε Σ
11 12 11 12

d o d o
start 0, 1, 4, 11 2, 5, 12 3, 6 start 0, 1, 4, 11 2, 5, 12 3, 6

Σ–d Σ–d

12 12

From NFA to DFA From NFA to DFA

ε 1
d
2
o
3 ε 1
d
2
o
3

0
ε 4
d
5
o
6
u
7
b
8
l
9
e
10 0
ε 4
d
5
o
6
u
7
b
8
l
9
e
10
start start
ε Σ ε Σ
11 12 11 12

d o u b l e d o u b l e
start 0, 1, 4, 11 2, 5, 12 3, 6 7 8 9 10 start 0, 1, 4, 11 2, 5, 12 3, 6 7 8 9 10

Σ–u Σ–b Σ–l Σ–e

Σ–d Σ–d Σ–o Σ

Σ
12 12

Σ
From NFA to DFA Modified Subset Construction
ε 1
d
2
o
3 ● Instead of marking whether a state is
0
ε 4
d
5
o
6
u
7
b
8
l
9
e
10
accepting, remember which token type it
start matches.
ε Σ
11 12
● Break ties with priorities.
● When using DFA as a scanner, consider
the DFA “stuck” if it enters the state
start 0, 1, 4, 11
d
2, 5, 12
o
3, 6
u
7
b
8
l
9
e
10 corresponding to the empty set.
Σ–u Σ–b Σ–l Σ–e
Σ–d Σ–o Σ

Σ
12

Performance Concerns
● The NFA-to-DFA construction can
introduce exponentially many states.
● Time/memory tradeoff:
● Low-memory NFA has higher scan time. Real-World Scanning: Python
● High-memory DFA has lower scan time.
● Could use a hybrid approach by
simplifying NFA before generating code.
While
Python Blocks
< ++
● Scoping handled by whitespace:
Ident Ident Ident if w == z:
ip z ip a = b
c = d
T_While ( T_Ident < T_Ident ) ++ T_Ident
else:
ip z ip
e = f
g = h
w h i l e ( i p < z ) \n \t + + i p ;
● What does that mean for the scanner?
while (ip < z)
++ip;

Scanning Python
Whitespace Tokens
if w == z:
a = b
● Special tokens inserted to indicate changes in c = d
levels of indentation. else:
e = f
● NEWLINE marks the end of a line. g = h
● INDENT indicates an increase in
indentation.
●
DEDENT indicates a decrease in indentation.
● Note that INDENT and DEDENT encode
change in indentation, not the total amount of
indentation.
Scanning Python Scanning Python
if w == z: if ident == ident : NEWLINE if w == z: { if ident == ident : NEWLINE
a = b w z a = b; w z
c = d c = d;
else: } else {
INDENT ident = ident NEWLINE INDENT ident = ident NEWLINE
e = f e = f;
g = h a b } a b
g = h;
ident = ident NEWLINE ident = ident NEWLINE
c d c d

DEDENT else : NEWLINE DEDENT else : NEWLINE

INDENT ident = ident NEWLINE INDENT ident = ident NEWLINE

e f e f

DEDENT ident = ident NEWLINE DEDENT ident = ident NEWLINE

g h g h

Scanning Python
Where to INDENT/DEDENT?
if w == z: { if ident == ident :
a = b; w z
c = d; ● Scanner maintains a stack of line indentations
} else { keeping track of all indented contexts so far.
{ ident = ident ;
e = f; ● Initially, this stack contains 0, since initially the
} a b
contents of the file aren't indented.
g = h;
ident = ident ; ● On a newline:
c d ● See how much whitespace is at the start of the line.
● If this value exceeds the top of the stack:
} else :
– Push the value onto the stack.
– Emit an INDENT token.
{ ident = ident ; ● Otherwise, while the value is less than the top of the stack:
e f – Pop the stack.
– Emit a DEDENT token.
} ident = ident ;
g h Source: http://docs.python.org/reference/lexical_analysis.html
Interesting Observation Summary
● Normally, more text on a line translates ●
Lexical analysis splits input text into tokens
into more tokens. holding a lexeme and an attribute.
● With DEDENT, less text on a line often
● Lexemes are sets of strings often defined
means more tokens: with regular expressions.
if cond1: ● Regular expressions can be converted to
if cond2: NFAs and from there to DFAs.
if cond3:
if cond4: ●
Maximal-munch using an automaton allows
if cond5: for fast scanning.
statement1
statement2 ● Not all tokens come directly from the source
code.

Implementing a lexical analyzer Flex

flex is a free implementation of the Unix lex program

flex implements what we have seen:
In practice (and for your project), two ways:
I It takes regular expressions as input
I Write an ad-hoc analyser I It generates a combined NFA
I Use automatic tools like (F)LEX. I It converts it to an equivalent DFA
First approach is more tedious. It is only useful to address specific I It minimizes the automaton as much as possible
needs. I It generates C code that implements it
I It handles conflicts with the longest matching prefix principle and a
Second approach is more portable
preference order on the tokens.
More information
I http://flex.sourceforge.net/manual/

Lexical analysis 85 Lexical analysis 92

Logic Manual - Oxford
100% (1)
Logic Manual - Oxford
15 pages
Derivation Trees
100% (2)
Derivation Trees
16 pages
Pattern Matching Algorithms
No ratings yet
Pattern Matching Algorithms
17 pages
Lexical Analysis: Textbook:Modern Compiler Design
No ratings yet
Lexical Analysis: Textbook:Modern Compiler Design
43 pages
Chapter 2 - Lexical Analysis
No ratings yet
Chapter 2 - Lexical Analysis
69 pages
Practical File: Be (Cse) 6 Semester
No ratings yet
Practical File: Be (Cse) 6 Semester
54 pages
Unit 2 Lexical Analysis
No ratings yet
Unit 2 Lexical Analysis
94 pages
ch-2 Compiler Design
No ratings yet
ch-2 Compiler Design
9 pages
SSCD Chapter3
No ratings yet
SSCD Chapter3
97 pages
WINSEM2024-25_CSI2005_TH_VL2024250502429_2024-12-14_Reference-Material-II (1)
No ratings yet
WINSEM2024-25_CSI2005_TH_VL2024250502429_2024-12-14_Reference-Material-II (1)
84 pages
WINSEM2023-24_CSI2005_TH_VL2023240501823_2024-01-08_Reference-Material-I
No ratings yet
WINSEM2023-24_CSI2005_TH_VL2023240501823_2024-01-08_Reference-Material-I
23 pages
Chapter 3 Lexical Analysis
No ratings yet
Chapter 3 Lexical Analysis
5 pages
Unit 2 Lexical Analysis - Part 1: Harshita Sharma
No ratings yet
Unit 2 Lexical Analysis - Part 1: Harshita Sharma
55 pages
Lexical Analysis
No ratings yet
Lexical Analysis
44 pages
Chapter2-Lexical Analysis
No ratings yet
Chapter2-Lexical Analysis
64 pages
Lexical Analysis
No ratings yet
Lexical Analysis
57 pages
Lexical Analyzer 2023
No ratings yet
Lexical Analyzer 2023
38 pages
Chapter 2
No ratings yet
Chapter 2
31 pages
Problems in Compilation
No ratings yet
Problems in Compilation
21 pages
Intro To Compilers Lecture 2
No ratings yet
Intro To Compilers Lecture 2
15 pages
Lecture 3
No ratings yet
Lecture 3
22 pages
Compiler Design: Ambo University School of Informatics and Electrical Engineering Department of Computer Science
No ratings yet
Compiler Design: Ambo University School of Informatics and Electrical Engineering Department of Computer Science
35 pages
Lexical Analysis
No ratings yet
Lexical Analysis
121 pages
Lexical Analysis
No ratings yet
Lexical Analysis
45 pages
Compilers - Week 2
No ratings yet
Compilers - Week 2
14 pages
2_Lexical Analysis
No ratings yet
2_Lexical Analysis
52 pages
Lexi Cal a Analyzer
No ratings yet
Lexi Cal a Analyzer
38 pages
Lecture 03
No ratings yet
Lecture 03
42 pages
Lexical Analyzer 1
No ratings yet
Lexical Analyzer 1
37 pages
slides chp 3 and 4
No ratings yet
slides chp 3 and 4
21 pages
Chapter 2 - Lexical Analysis_Regular Expressions(1)
No ratings yet
Chapter 2 - Lexical Analysis_Regular Expressions(1)
27 pages
Experiment No. 9 3118013: Aim: Theory: Lexical Analyzer
No ratings yet
Experiment No. 9 3118013: Aim: Theory: Lexical Analyzer
16 pages
CD ch2
No ratings yet
CD ch2
104 pages
Ch2 Lexical Analysis
No ratings yet
Ch2 Lexical Analysis
11 pages
COS 320 Compilers: David Walker
No ratings yet
COS 320 Compilers: David Walker
38 pages
Chapter 2 - Lexical Analyser
No ratings yet
Chapter 2 - Lexical Analyser
40 pages
1st Phase Lexical Analyzer
No ratings yet
1st Phase Lexical Analyzer
33 pages
Unit-2 Lexical Analysis
No ratings yet
Unit-2 Lexical Analysis
36 pages
lec 02
No ratings yet
lec 02
17 pages
Lexical Analysis
No ratings yet
Lexical Analysis
41 pages
cd1
No ratings yet
cd1
92 pages
67163118e98feCCWeek-03Lecture05
No ratings yet
67163118e98feCCWeek-03Lecture05
62 pages
Lecture02 Scanning 1
No ratings yet
Lecture02 Scanning 1
72 pages
Lexical Analysis
No ratings yet
Lexical Analysis
31 pages
Unit-2
No ratings yet
Unit-2
89 pages
Chapter 3 - Scanning: 3.1 Kinds of Tokens
No ratings yet
Chapter 3 - Scanning: 3.1 Kinds of Tokens
17 pages
Acd Unit-2
No ratings yet
Acd Unit-2
16 pages
2 Lex
No ratings yet
2 Lex
45 pages
Unit 1-REGULAR LANGUAGES
No ratings yet
Unit 1-REGULAR LANGUAGES
27 pages
001. 1. Lexical analysis
No ratings yet
001. 1. Lexical analysis
66 pages
Lexical Analysis: S. M. Farhad
No ratings yet
Lexical Analysis: S. M. Farhad
28 pages
Unit22pdf 2021 03 13 13 38 11
No ratings yet
Unit22pdf 2021 03 13 13 38 11
114 pages
2
No ratings yet
2
109 pages
Lecture 02
No ratings yet
Lecture 02
150 pages
Chapter 2
No ratings yet
Chapter 2
56 pages
Compiler Construction: Chapter # 2 - Lexical Analysis Instructor: Ms. Raazia Sosan
No ratings yet
Compiler Construction: Chapter # 2 - Lexical Analysis Instructor: Ms. Raazia Sosan
53 pages
Chapter 2 - Lexical Analyser
No ratings yet
Chapter 2 - Lexical Analyser
39 pages
Ch3 Modified
No ratings yet
Ch3 Modified
80 pages
Lexical Analysis: CD: Compiler Design
No ratings yet
Lexical Analysis: CD: Compiler Design
122 pages
2-Patterns, lexemes, Tokens, Attributes-18-12-2024
No ratings yet
2-Patterns, lexemes, Tokens, Attributes-18-12-2024
73 pages
Lexical Analysis: Deterministic Finite Automata
No ratings yet
Lexical Analysis: Deterministic Finite Automata
37 pages
Ch2-CC
No ratings yet
Ch2-CC
47 pages
Computer Engineering Laboratory Solution Primer
From Everand
Computer Engineering Laboratory Solution Primer
Karan Bhandari
No ratings yet
Logic Reviewer
No ratings yet
Logic Reviewer
11 pages
Tutorial For Processing: Animation Using Draw, Assignment, and If Statements
No ratings yet
Tutorial For Processing: Animation Using Draw, Assignment, and If Statements
7 pages
CMP3008 LN5 ContextFreeGrammars
No ratings yet
CMP3008 LN5 ContextFreeGrammars
56 pages
Xor Exor Applications
No ratings yet
Xor Exor Applications
10 pages
Predicates
No ratings yet
Predicates
22 pages
Code:: Design A SLR Parser For The Any Grammar
No ratings yet
Code:: Design A SLR Parser For The Any Grammar
14 pages
Intro To KBS and Knowledge Representation
100% (1)
Intro To KBS and Knowledge Representation
18 pages
TAFL 5 YEAR PYQ
100% (1)
TAFL 5 YEAR PYQ
11 pages
Phython - What Are Python Strings
No ratings yet
Phython - What Are Python Strings
4 pages
Semantic Report
No ratings yet
Semantic Report
24 pages
III Cse Ma3354 Notes Unit1
No ratings yet
III Cse Ma3354 Notes Unit1
36 pages
Infytq
No ratings yet
Infytq
34 pages
Unit 2 Boolean Algebra
No ratings yet
Unit 2 Boolean Algebra
24 pages
Settings Provider
No ratings yet
Settings Provider
55 pages
Short Notes _ Mathematical Logic __ Lakshya MHTCET 2025
100% (1)
Short Notes _ Mathematical Logic __ Lakshya MHTCET 2025
5 pages
CS 540-1: Introduction To Artificial Intelligence: Exam 2: 7:15-9:15pm, April 13, 1998
No ratings yet
CS 540-1: Introduction To Artificial Intelligence: Exam 2: 7:15-9:15pm, April 13, 1998
8 pages
6 Methods of Proofs-1
No ratings yet
6 Methods of Proofs-1
24 pages
Grammar and Languages
No ratings yet
Grammar and Languages
6 pages
Java Basic Notes March 14 Final
No ratings yet
Java Basic Notes March 14 Final
42 pages
Python Worksheet 1
0% (1)
Python Worksheet 1
10 pages
KCS-402 2022-23
No ratings yet
KCS-402 2022-23
2 pages
String Equality and Interning in Java
No ratings yet
String Equality and Interning in Java
2 pages
CS101 - Chapter2 Dicrete Math
No ratings yet
CS101 - Chapter2 Dicrete Math
82 pages
THL(ch1+2+3+4)_8457646fd9e3c52b838d9aca1c0abeaa
No ratings yet
THL(ch1+2+3+4)_8457646fd9e3c52b838d9aca1c0abeaa
33 pages
Primes 1
100% (2)
Primes 1
833 pages
TOC May_Jun_2023
No ratings yet
TOC May_Jun_2023
2 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Lecture II - Lexical Analysis.handouts

Uploaded by

Lecture II - Lexical Analysis.handouts

Uploaded by

Where We Are

Source Lexical Analysis

Lexical Analysis Semantic Analysis

while (ip < z) while (ip < z)

Ident Ident Ident

while (ip < z) while (ip < z)

do[for] = new 0; do[for] = new 0;

do[for] = new 0; do[for] = new 0;

Scanning a Source File Scanning a Source File

Scanning a Source File Scanning a Source File

Scanning a Source File Scanning a Source File

Scanning a Source File Scanning a Source File

Scanning a Source File Scanning a Source File

Scanning a Source File Scanning a Source File

Goals of Lexical Analysis

What Tokens are Useful Here? Choosing Good Tokens

Atomic Regular Expressions Compound Regular Expressions

Simple Regular Expressions Simple Regular Expressions

(0 | 1)*00(0 | 1)* (0 | 1)*00(0 | 1)*

Simple Regular Expressions Simple Regular Expressions

Simple Regular Expressions Simple Regular Expressions

Simple Regular Expressions Simple Regular Expressions

1*(0 | ε)1* 1*(0 | ε)1*

1*(0 | ε)1* 1*0?1*

Applied Regular Expressions Applied Regular Expressions

Applied Regular Expressions

A Simple Automaton A Simple Automaton

start " " start " "

start " " start " "

" H E Y A " " H E Y A "

A Simple Automaton A Simple Automaton

start " " start " "

" H E Y A " " H E Y A "

start " " start " "

" H E Y A " " H E Y A "

A Simple Automaton A Simple Automaton

start " " start " "

" H E Y A " " H E Y A "

start " " start " "

" H E Y A " " H E Y A "

A Simple Automaton A Simple Automaton

start " " start " "

" H E Y A " " H E Y A "

start " " start " "

" H E Y A " " " " " " "

A Simple Automaton A Simple Automaton

start " " start " "

start " " start " "

A Simple Automaton A Simple Automaton

start " " start " "

start " " start " "

" " " " " " " A B C

A Simple Automaton A Simple Automaton

start " " start " "

start " " start " "

A Simple Automaton A Simple Automaton

start " " start " "

A More Complex Automaton A More Complex Automaton

A More Complex Automaton A More Complex Automaton

A More Complex Automaton A More Complex Automaton

A More Complex Automaton A More Complex Automaton

A More Complex Automaton An Even More Complex Automaton

An Even More Complex Automaton An Even More Complex Automaton

An Even More Complex Automaton An Even More Complex Automaton

An Even More Complex Automaton An Even More Complex Automaton

Base Cases Construction for R1R2

Automaton for single character a

start start start

Construction for R1R2 Construction for R1R2

Construction for R1 | R2 Construction for R1 | R2

Construction for R1 | R2 Construction for R*

Construction for R* Construction for R*

Overall Result Challenges in Scanning

Lexing Ambiguities Lexing Ambiguities

Lexing Ambiguities Implementing Maximal Munch

Implementing Maximal Munch Implementing Maximal Munch

(0 | 1)00(0 | 1) (0 | 1)00(0 | 1)

1(0 | ε)1 1(0 | ε)1

1(0 | ε)1 10?1