Lecture II - Lexical Analysis.handouts
Lecture II - Lexical Analysis.handouts
IR Generation
IR Optimization
Code Generation
Optimization Machine
Code
w h i l e ( i p < z ) \n \t + + i p ;
< ++
T_While ( T_Ident < T_Ident ) ++ T_Ident T_While ( T_Ident < T_Ident ) ++ T_Ident
ip z ip ip z ip
w h i l e ( i p < z ) \n \t + + i p ; w h i l e ( i p < z ) \n \t + + i p ;
d o [ f o r ] = n e w 0 ;
0 0
d o [ f o r ] = n e w 0 ; d o [ f o r ] = n e w 0 ;
T_While
The
The piece
piece of
of the
the original
original program
program
from
from which we made the token isis
which we made the token
called
called aa lexeme.
lexeme.
T_While T_While
This
This isis called
called aa token.
token. You
You can
can
think
think ofof itit as
as an
an enumerated
enumerated type
type
representing
representing what
what logical
logical entity
entity we
we
read
read out of the source code.
out of the source code.
Scanning a Source File Scanning a Source File
w h i l e ( 1 3 7 < i ) \n \t + + i ; w h i l e ( 1 3 7 < i ) \n \t + + i ;
T_While T_While
Sometimes
Sometimes wewe will
will discard
discard aa lexeme
lexeme
T_While rather
rather than storing it for later use.
than storing it for later use. T_While
Here,
Here, we
we ignore
ignore whitespace,
whitespace, since
since itit
has
has no
no bearing
bearing on
on the
the meaning
meaning of
of
the program.
the program.
Scanning a Source File Scanning a Source File
w h i l e ( 1 3 7 < i ) \n \t + + i ; w h i l e ( 1 3 7 < i ) \n \t + + i ;
T_While T_While
T_While ( T_While (
Scanning a Source File Scanning a Source File
w h i l e ( 1 3 7 < i ) \n \t + + i ; w h i l e ( 1 3 7 < i ) \n \t + + i ;
T_While ( T_While (
T_While ( T_While (
Scanning a Source File Scanning a Source File
w h i l e ( 1 3 7 < i ) \n \t + + i ; w h i l e ( 1 3 7 < i ) \n \t + + i ;
Some
Some tokens
tokens can
can have
have
attributes that
attributes that store
store
extra
extra information
information about
about
T_While ( T_IntConst T_While ( T_IntConst
the
the token. Here
token. Here we
we
137 137 store
store which
which integer
integer isis
represented.
represented.
Sets of Lexemes
● Idea: Associate a set of lexemes with each
token.
● We might associate the “number” token How do we describe which (potentially
with the set { 0, 1, 2, …, 10, 11, 12, … } infinite) set of lexemes is associated with
● We might associate the “string” token each token type?
with the set { "", "a", "b", "c", … }
● We might associate the token for the
keyword while with the set { while }.
Formal Languages Regular Expressions
●
A formal language is a set of strings. ● Regular expressions are a family of
● Many infinite languages have finite descriptions: descriptions that can be used to capture
● Define the language using an automaton. certain languages (the regular
●
Define the language using a grammar. languages).
● Define the language using a regular expression. ● Often provide a compact and human-
● We can use these compact descriptions of the readable description of the language.
language to define sets of strings.
● Used as the basis for numerous software
● Over the course of this class, we will use all of
systems, including the flex tool we will
these approaches.
use in this course.
11011100101
0000
11111011110011111
Simple Regular Expressions Simple Regular Expressions
● Suppose the only characters are 0 and 1. ● Suppose the only characters are 0 and 1.
● Here is a regular expression for strings containing ● Here is a regular expression for strings of length
00 as a substring: exactly four:
(0 | 1)*00(0 | 1)*
11011100101
0000
11111011110011111
(0|1)(0|1)(0|1)(0|1) (0|1)(0|1)(0|1)(0|1)
Simple Regular Expressions Simple Regular Expressions
● Suppose the only characters are 0 and 1. ● Suppose the only characters are 0 and 1.
● Here is a regular expression for strings of length ● Here is a regular expression for strings of length
exactly four: exactly four:
(0|1)(0|1)(0|1)(0|1) (0|1)(0|1)(0|1)(0|1)
0000 0000
1010 1010
1111 1111
1000 1000
(0|1){4} (0|1){4}
0000 0000
1010 1010
1111 1111
1000 1000
Simple Regular Expressions Simple Regular Expressions
● Suppose the only characters are 0 and 1. ● Suppose the only characters are 0 and 1.
● Here is a regular expression for strings that ● Here is a regular expression for strings that
contain at most one zero: contain at most one zero:
1*(0 | ε)1*
11110111
111111
0111
0
Simple Regular Expressions Simple Regular Expressions
● Suppose the only characters are 0 and 1. ● Suppose the only characters are 0 and 1.
● Here is a regular expression for strings that ● Here is a regular expression for strings that
contain at most one zero: contain at most one zero:
11110111 11110111
111111 111111
0111 0111
0 0
(+|-)?(0|1|2|3|4|5|6|7|8|9)*(0|2|4|6|8) (+|-)?(0|1|2|3|4|5|6|7|8|9)*(0|2|4|6|8)
42
+1370
-3248
-9999912
Applied Regular Expressions Applied Regular Expressions
● Suppose that our alphabet is all ASCII ● Suppose that our alphabet is all ASCII
characters. characters.
● A regular expression for even numbers is ● A regular expression for even numbers is
(+|-)?(0|1|2|3|4|5|6|7|8|9)*(0|2|4|6|8) (+|-)?[0123456789]*[02468]
42 42
+1370 +1370
-3248 -3248
-9999912 -9999912
42
+1370
-3248
-9999912
Implementing Regular Expressions A Simple Automaton
● Regular expressions can be implemented A,B,C,...,Z
using finite automata.
start " "
● There are two main kinds of finite
automata:
● NFAs (nondeterministic finite automata),
which we'll see in a second, and
● DFAs (deterministic finite automata), which
we'll see later.
● Automata are best explained by example...
Each
Each circle
circle isis aa state
state of
of the
the
automaton. The automaton's These
automaton. The automaton's These arrows
arrows are
are called
called
configuration
configuration isis determined
determined transitions. The automaton
transitions. The automaton
by
by what
what state(s)
state(s) itit isis in. changes
in. changes which
which state(s)
state(s) itit isis inin
by
by following
following transitions.
transitions.
A Simple Automaton A Simple Automaton
A,B,C,...,Z A,B,C,...,Z
" " " " " " " " " " " "
A Simple Automaton A Simple Automaton
A,B,C,...,Z A,B,C,...,Z
" " " " " " " " " " " "
" " " " " " " " " " " "
There
There isis no
no transition
transition on
on ""
here,
here, soso the
the automaton
automaton
dies and
dies and rejects.
rejects.
A Simple Automaton A Simple Automaton
A,B,C,...,Z A,B,C,...,Z
" A B C " A B C
A Simple Automaton A Simple Automaton
A,B,C,...,Z A,B,C,...,Z
" A B C " A B C
" A B C " A B C
A Simple Automaton A More Complex Automaton
1 0
A,B,C,...,Z 0
start
start " "
0, 1
1
0 1
" A B C
This
This isis not
not an
an accepting
accepting
state,
state, so the automaton
so the automaton
rejects.
rejects.
start start
0, 1 0, 1
1 1
0 1 0 1
Notice
Notice that
that there
there are
are multiple
multiple transitions
transitions
defined
defined here on 0 and 1. If we read
here on 0 and 1. If we read aa
00 or 1 here, we follow both transitions
or 1 here, we follow both transitions
and
and enter
enter multiple
multiple states.
states.
A More Complex Automaton A More Complex Automaton
1 0 1 0
0 0
start start
0, 1 0, 1
1 1
0 1 0 1
0 1 1 1 0 1 0 1 1 1 0 1
start start
0, 1 0, 1
1 1
0 1 0 1
0 1 1 1 0 1 0 1 1 1 0 1
A More Complex Automaton A More Complex Automaton
1 0 1 0
0 0
start start
0, 1 0, 1
1 1
0 1 0 1
0 1 1 1 0 1 0 1 1 1 0 1
start start
0, 1 0, 1
1 1
0 1 0 1
0 1 1 1 0 1 0 1 1 1 0 1
A More Complex Automaton A More Complex Automaton
1 0 1 0
0 0
start start
0, 1 0, 1
1 1
0 1 0 1
0 1 1 1 0 1 0 1 1 1 0 1
start start
0, 1 0, 1
1 1
0 1 0 1
0 1 1 1 0 1 0 1 1 1 0 1
A More Complex Automaton A More Complex Automaton
1 0 1 0
0 0
start start
0, 1 0, 1
1 1
0 1 0 1
0 1 1 1 0 1 0 1 1 1 0 1
ε a, c ε a, c
start b start b
ε ε
b, c b, c
ε a ε a
These
These are
are called -transitions.. These
called εε-transitions These b c b a
transitions
transitions are followed automatically
are followed automatically and
and
without
without consuming
consuming any
any input.
input.
ε a, c ε a, c
start b start b
ε ε
b, c b, c
ε a ε a
b c b a b c b a
An Even More Complex Automaton An Even More Complex Automaton
a, b a, b
c c
ε a, c ε a, c
start b start b
ε ε
b, c b, c
ε a ε a
b c b a b c b a
ε a, c ε a, c
start b start b
ε ε
b, c b, c
ε a ε a
b c b a b c b a
An Even More Complex Automaton An Even More Complex Automaton
a, b a, b
c c
ε a, c ε a, c
start b start b
ε ε
b, c b, c
ε a ε a
b c b a b c b a
ε a, c ε a, c
start b start b
ε ε
b, c b, c
ε a ε a
b c b a b c b a
Simulating an NFA From Regular Expressions to NFAs
● Keep track of a set of states, initially the start ● There is a (beautiful!) procedure from converting a
state and everything reachable by ε-moves. regular expression to an NFA.
● For each character in the input: ● Associate each regular expression with an NFA with
the following properties:
● Maintain a set of next states, initially empty.
● There is exactly one accepting state.
● For each current state:
● There are no transitions out of the accepting state.
– Follow all transitions labeled with the current letter.
– Add these states to the set of new states.
● There are no transitions into the starting state.
● Add every state reachable by an ε-move to the set of ● These restrictions are stronger than necessary, but make the
next states. construction easier.
● Complexity: O(mn2) for strings of length m and
start
automata with n states.
Automaton for ε
start a
R1 R2 R1 R2
ε ε
start start
R1 R2 R1 R2
Construction for R1 | R2 Construction for R1 | R2
start
R1
start
R2
start
ε
start
R1 start
R1
ε
start
R2 R2
Construction for R1 | R2 Construction for R1 | R2
ε ε ε
start
R1 start
R1
ε ε ε
R2 R2
ε ε
start
R1
ε ε
R2
Construction for R* Construction for R*
start
start start
R R
start start ε ε
start
R R
ε ε
Construction for R* Construction for R*
ε ε
start ε ε start ε ε
R R
ε ε
f o r t f o r t
f o r t f o r t
f o r t f o r t
f o r t f o r t
f o r t f o r t
f o r t
Conflict Resolution Lexing Ambiguities
T_For for
● Assume all tokens are specified as T_Identifier [A-Za-z_][A-Za-z0-9_]*
regular expressions.
● Algorithm: Left-to-right scan. f o r t
● Tiebreaking rule one: Maximal munch.
f o r t f o r t
● Always match the longest possible prefix of
the remaining text. f o r t f o r t
f o r t f o r t
f o r t f o r t
f o r t
start d o u b l e
start Σ
start d o u b l e start d o u b l e
start Σ start Σ
D O U B D O U B L E D O U B D O U B L E
Implementing Maximal Munch Implementing Maximal Munch
T_Do do T_Do do
T_Double double T_Double double
T_Mystery [A-Za-z] T_Mystery [A-Za-z]
start d o start d o
start d o u b l e start d o u b l e
start Σ start Σ
D O U B D O U B L E D O U B D O U B L E
start d o u b l e start d o u b l e
start Σ start Σ
D O U B D O U B L E D O U B D O U B L E
Implementing Maximal Munch Implementing Maximal Munch
T_Do do T_Do do
T_Double double T_Double double
T_Mystery [A-Za-z] T_Mystery [A-Za-z]
start d o start d o
start d o u b l e start d o u b l e
start Σ start Σ
D O U B D O U B L E D O U B D O U B L E
start d o u b l e start d o u b l e
start Σ start Σ
D O U B D O U B L E D O U B D O U B L E
Implementing Maximal Munch Implementing Maximal Munch
T_Do do T_Do do
T_Double double T_Double double
T_Mystery [A-Za-z] T_Mystery [A-Za-z]
start d o start d o
start d o u b l e start d o u b l e
start Σ start Σ
D O U B D O U B L E D O U B D O U B L E
start d o u b l e start d o u b l e
start Σ start Σ
D O U B D O U B L E D O U B D O U B L E
Implementing Maximal Munch Implementing Maximal Munch
T_Do do T_Do do
T_Double double T_Double double
T_Mystery [A-Za-z] T_Mystery [A-Za-z]
start d o start d o
start d o u b l e start d o u b l e
start Σ start Σ
D O U B D O U B L E D O U B D O U B L E
start d o u b l e start d o u b l e
start Σ start Σ
D O U B D O U B L E D O U B D O U B L E
Implementing Maximal Munch Implementing Maximal Munch
T_Do do T_Do do
T_Double double T_Double double
T_Mystery [A-Za-z] T_Mystery [A-Za-z]
start d o start d o
start d o u b l e start d o u b l e
start Σ start Σ
D O U B D O U B L E D O U B D O U B L E
start d o u b l e start d o u b l e
start Σ start Σ
D O U B D O U B L E D O U B D O U B L E
Implementing Maximal Munch Implementing Maximal Munch
T_Do do T_Do do
T_Double double T_Double double
T_Mystery [A-Za-z] T_Mystery [A-Za-z]
start d o start d o
start d o u b l e start d o u b l e
start Σ start Σ
D O U B D O U B L E D O U B D O U B L E
start d o u b l e start d o u b l e
start Σ start Σ
D O U B D O U B L E D O U B D O U B L E
Implementing Maximal Munch Implementing Maximal Munch
T_Do do T_Do do
T_Double double T_Double double
T_Mystery [A-Za-z] T_Mystery [A-Za-z]
start d o start d o
start d o u b l e start d o u b l e
start Σ start Σ
D O U B D O U B L E D O U B D O U B L E
start d o u b l e start d o u b l e
start Σ start Σ
D O U B D O U B L E D O U B D O U B L E
Implementing Maximal Munch Implementing Maximal Munch
T_Do do T_Do do
T_Double double T_Double double
T_Mystery [A-Za-z] T_Mystery [A-Za-z]
start d o start d o
start d o u b l e start d o u b l e
start Σ start Σ
D O U B D O U B L E D O U B D O U B L E
start d o u b l e start d o u b l e
start Σ start Σ
D O U B D O U B L E D O U B D O U B L E
Implementing Maximal Munch Implementing Maximal Munch
T_Do do T_Do do
T_Double double T_Double double
T_Mystery [A-Za-z] T_Mystery [A-Za-z]
start d o start d o
start d o u b l e start d o u b l e
start Σ start Σ
D O U B D O U B L E D O U B D O U B L E
start d o u b l e start d o u b l e
start Σ start Σ
D O U B D O U B L E D O U B D O U B L E
Implementing Maximal Munch Implementing Maximal Munch
T_Do do T_Do do
T_Double double T_Double double
T_Mystery [A-Za-z] T_Mystery [A-Za-z]
start d o start d o
start d o u b l e start d o u b l e
start Σ start Σ
D O U B D O U B L E D O U B D O U B L E
start d o u b l e start d o u b l e
start Σ start Σ
D O U B D O U B L E D O U B D O U B L E
Implementing Maximal Munch Implementing Maximal Munch
T_Do do T_Do do
T_Double double T_Double double
T_Mystery [A-Za-z] T_Mystery [A-Za-z]
start d o start d o
start d o u b l e start d o u b l e
start Σ start Σ
D O U B D O U B L E D O U B D O U B L E
start d o u b l e
start Σ
D O U B D O U B L E
A Minor Simplification A Minor Simplification
start d o start d o
start d o u b l e start d o u b l e
start Σ start Σ
ε d o ε d o
ε d o u b l e ε d o u b l e
start start
ε Σ ε Σ
Build
Build aa single
single automaton
automaton
that
that runs all the matching
runs all the matching
automata
automata in
in parallel.
parallel.
A Minor Simplification A Minor Simplification
ε d o ε d o
ε d o u b l e ε d o u b l e
start start
ε Σ ε Σ
Annotate
Annotate each
each accepting
accepting
state with which automaton
state with which automaton
itit came
came from.
from.
[0-9]!
[0-9]!
0 6 7 NUM#
[a-z0-9]!
[a-z]!
8 9 ID#
Σ
10! 11!
Lexical analysis 70
Other Conflicts Other Conflicts
T_Do do T_Do do
T_Double double T_Double double
T_Identifier [A-Za-z_][A-Za-z0-9_]* T_Identifier [A-Za-z_][A-Za-z0-9_]*
d o u b l e d o u b l e
d o u b l e
d o u b l e
d o u b l e
d o u b l e
Other Conflicts One Last Detail...
T_Do do
T_Double double
● We know what to do if multiple rules
T_Identifier [A-Za-z_][A-Za-z0-9_]* match.
● What if nothing matches?
d o u b l e ● Trick: Add a “catch-all” rule that matches
any character and reports an error.
d o u b l e
8
[a-z]!
9 ID#
● Break ties by choosing higher-
Σ
precedence matches.
10! 11!
● Have a catch-all rule to handle errors.
Lexical analysis 69
Challenges in Scanning Challenges in Scanning
● How do we determine which lexemes are ● How do we determine which lexemes are
associated with each token? associated with each token?
● When there are multiple ways we could ● When there are multiple ways we could
scan the input, how do we know which scan the input, how do we know which
one to pick? one to pick?
● How do we address these concerns ● How do we address these concerns
efficiently? efficiently?
1 1
start start
A B
1 1
0 0 0 0 0 0 0 0
1 1
C D
1 1
1
D B C int state = 0;
for (char ch: input)
state = kTransitionTable[state][ch];
return kAcceptTable[state];
}
Code for DFAs Speeding up Matching
int kTransitionTable[kNumStates][kNumSymbols] = {
{0, 0, 1, 3, 7, 1, …},
● In the worst-case, an NFA with n states
… takes time O(mn2) to match a string of
};
bool kAcceptTable[kNumStates] = { length m.
false,
true, Runs
Runs in
in time
time O(m)
O(m)
● DFAs, on the other hand, take only O(m).
true, on
on a string of
a string of ● There is another (beautiful!) algorithm to
…
length m.
length m.
}; convert NFAs to DFAs.
bool simulateDFA(string input) {
int state = 0;
Lexical Regular Table-Driven
for (char ch: input) Specification Expressions
NFA DFA
DFA
state = kTransitionTable[state][ch];
return kAcceptTable[state];
}
0
ε 4
d
5
o
6
u
7
b
8
l
9
e
10 0
ε 4
d
5
o
6
u
7
b
8
l
9
e
10
start start
ε Σ ε Σ
11 12 11 12
0
ε 4
d
5
o
6
u
7
b
8
l
9
e
10 0
ε 4
d
5
o
6
u
7
b
8
l
9
e
10
start start
ε Σ ε Σ
11 12 11 12
start 0, 1, 4, 11 start 0, 1, 4, 11
From NFA to DFA From NFA to DFA
ε 1
d
2
o
3 ε 1
d
2
o
3
0
ε 4
d
5
o
6
u
7
b
8
l
9
e
10 0
ε 4
d
5
o
6
u
7
b
8
l
9
e
10
start start
ε Σ ε Σ
11 12 11 12
d d
start 0, 1, 4, 11 2, 5, 12 start 0, 1, 4, 11 2, 5, 12
0
ε 4
d
5
o
6
u
7
b
8
l
9
e
10 0
ε 4
d
5
o
6
u
7
b
8
l
9
e
10
start start
ε Σ ε Σ
11 12 11 12
d d
start 0, 1, 4, 11 2, 5, 12 start 0, 1, 4, 11 2, 5, 12
From NFA to DFA From NFA to DFA
ε 1
d
2
o
3 ε 1
d
2
o
3
0
ε 4
d
5
o
6
u
7
b
8
l
9
e
10 0
ε 4
d
5
o
6
u
7
b
8
l
9
e
10
start start
ε Σ ε Σ
11 12 11 12
d d
start 0, 1, 4, 11 2, 5, 12 start 0, 1, 4, 11 2, 5, 12
Σ–d Σ–d
12 12
0
ε 4
d
5
o
6
u
7
b
8
l
9
e
10 0
ε 4
d
5
o
6
u
7
b
8
l
9
e
10
start start
ε Σ ε Σ
11 12 11 12
d d
start 0, 1, 4, 11 2, 5, 12 start 0, 1, 4, 11 2, 5, 12
Σ–d Σ–d
12 12
From NFA to DFA From NFA to DFA
ε 1
d
2
o
3 ε 1
d
2
o
3
0
ε 4
d
5
o
6
u
7
b
8
l
9
e
10 0
ε 4
d
5
o
6
u
7
b
8
l
9
e
10
start start
ε Σ ε Σ
11 12 11 12
d o d o
start 0, 1, 4, 11 2, 5, 12 3, 6 start 0, 1, 4, 11 2, 5, 12 3, 6
Σ–d Σ–d
12 12
0
ε 4
d
5
o
6
u
7
b
8
l
9
e
10 0
ε 4
d
5
o
6
u
7
b
8
l
9
e
10
start start
ε Σ ε Σ
11 12 11 12
d o u b l e d o u b l e
start 0, 1, 4, 11 2, 5, 12 3, 6 7 8 9 10 start 0, 1, 4, 11 2, 5, 12 3, 6 7 8 9 10
Σ
12 12
Σ
From NFA to DFA Modified Subset Construction
ε 1
d
2
o
3 ● Instead of marking whether a state is
0
ε 4
d
5
o
6
u
7
b
8
l
9
e
10
accepting, remember which token type it
start matches.
ε Σ
11 12
● Break ties with priorities.
● When using DFA as a scanner, consider
the DFA “stuck” if it enters the state
start 0, 1, 4, 11
d
2, 5, 12
o
3, 6
u
7
b
8
l
9
e
10 corresponding to the empty set.
Σ–u Σ–b Σ–l Σ–e
Σ–d Σ–o Σ
Σ
12
Performance Concerns
● The NFA-to-DFA construction can
introduce exponentially many states.
● Time/memory tradeoff:
● Low-memory NFA has higher scan time. Real-World Scanning: Python
● High-memory DFA has lower scan time.
● Could use a hybrid approach by
simplifying NFA before generating code.
While
Python Blocks
< ++
● Scoping handled by whitespace:
Ident Ident Ident if w == z:
ip z ip a = b
c = d
T_While ( T_Ident < T_Ident ) ++ T_Ident
else:
ip z ip
e = f
g = h
w h i l e ( i p < z ) \n \t + + i p ;
● What does that mean for the scanner?
while (ip < z)
++ip;
Scanning Python
Whitespace Tokens
if w == z:
a = b
● Special tokens inserted to indicate changes in c = d
levels of indentation. else:
e = f
● NEWLINE marks the end of a line. g = h
● INDENT indicates an increase in
indentation.
●
DEDENT indicates a decrease in indentation.
● Note that INDENT and DEDENT encode
change in indentation, not the total amount of
indentation.
Scanning Python Scanning Python
if w == z: if ident == ident : NEWLINE if w == z: { if ident == ident : NEWLINE
a = b w z a = b; w z
c = d c = d;
else: } else {
INDENT ident = ident NEWLINE INDENT ident = ident NEWLINE
e = f e = f;
g = h a b } a b
g = h;
ident = ident NEWLINE ident = ident NEWLINE
c d c d
Scanning Python
Where to INDENT/DEDENT?
if w == z: { if ident == ident :
a = b; w z
c = d; ● Scanner maintains a stack of line indentations
} else { keeping track of all indented contexts so far.
{ ident = ident ;
e = f; ● Initially, this stack contains 0, since initially the
} a b
contents of the file aren't indented.
g = h;
ident = ident ; ● On a newline:
c d ● See how much whitespace is at the start of the line.
● If this value exceeds the top of the stack:
} else :
– Push the value onto the stack.
– Emit an INDENT token.
{ ident = ident ; ● Otherwise, while the value is less than the top of the stack:
e f – Pop the stack.
– Emit a DEDENT token.
} ident = ident ;
g h Source: http://docs.python.org/reference/lexical_analysis.html
Interesting Observation Summary
● Normally, more text on a line translates ●
Lexical analysis splits input text into tokens
into more tokens. holding a lexeme and an attribute.
● With DEDENT, less text on a line often
● Lexemes are sets of strings often defined
means more tokens: with regular expressions.
if cond1: ● Regular expressions can be converted to
if cond2: NFAs and from there to DFAs.
if cond3:
if cond4: ●
Maximal-munch using an automaton allows
if cond5: for fast scanning.
statement1
statement2 ● Not all tokens come directly from the source
code.