0% found this document useful (0 votes)

24 views

Chapter 7 Lexical Analysis

hwstgg ert yeyh

Uploaded by

ganeshsanap462003

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

24 views

Chapter 7 Lexical Analysis

hwstgg ert yeyh

Uploaded by

ganeshsanap462003

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 61

CHAPTER 7

LEXICAL ANALYSIS

1
Outline

• Informal sketch of lexical analysis

– Identifies tokens in input string

• Issues in lexical analysis

– Lookahead
– Ambiguities

• Specifying lexers
– Regular expressions
– Examples of regular expressions
2
Recall: The Structure of a Compiler

Lexical Tokens
Source
analysis

Parsing
Today we start

Interm. Code Machine

Optimization
Language Gen. Code

3
The Role of Lexical Analyzer
• Lexical analyzer is the first phase of a compiler.
• Its main task is to read input characters and
produce as output a sequence of tokens that
parser uses for syntax analysis.

Winter 2007 SEG2101 Chapter 8 4

Issues in Lexical Analysis

• There are several reasons for separating the

analysis phase of compiling into lexical analysis
and parsing:
– Simpler design
– Compiler efficiency
– Compiler portability
• Specialized tools have been designed to help
automate the construction of lexical analyzer
and parser when they are separated.

5
Lexical Analysis

• What do we want to do? Example:

if (i == j)
z = 0;
else
z = 1;

• The input is just a sequence of characters:

\tif (i == j)\n\t\tz = 0;\n\telse\n\t\tz = 1;

• Goal: Partition input string into substrings

– And classify them according to their role
6
What’s a Token?

• Output of lexical analysis is a stream of

tokens
• A token is a syntactic category
– In English:
noun, verb, adjective, …
– In a programming language:
Identifier, Integer, Keyword, Whitespace, …
• Parser relies on the token distinctions:
– E.g., identifiers are treated differently than keywords

7
Example of Tokens

• Tokens correspond to sets of strings.

• Identifier: strings of letters or digits,

starting with a letter
• Integer: a non-empty string of digits
• Keyword: “else” or “if” or “begin” or …
• Whitespace: a non-empty sequence of blanks,
newlines, and tabs
• OpenPar: a left-parenthesis
8
Lexemes

• A lexeme is a sequence of characters in the

source program that is matched by the pattern
for a token.
• A lexeme is a basic lexical unit of a language
comprising one or several words, the elements of
which do not separately convey the meaning of
the whole.
• The lexemes of a programming language include its
identifier, literals, operators, and special words.
• A token of a language is a category of its
lexemes. 9
pattern

• A pattern is a rule describing the set of

lexemes that can represent as particular
token in source program.
• The set of strings are described by rule called
as pattern associated with each token. The
pattern must match with each string in the
set.

10
Examples of Tokens

const pi = 3.1416;
The substring pi is a lexeme for the token “identifier.”

11
Lexeme and Token
Index = 2 * count +17;
Lexemes Tokens
Index Identifier
= equal_sign
2 int_literal
* multi_op
Count identifier
+ plus_op
17 int_literal
; semicolon 12
Attributes of Tokens

• When a pattern matches to more than one

lexeme, the lexical analyzer must provide
additional information about the particular
lexeme.
– For example, the pattern num matches to both
strings 10 and 15.
• The lexical analyzer gathers information about
tokens into their associated attributes.
• A token has a single attribute, a pointer to
the symbol table entry in which information
about the token is kept. 13
• For example,
• The tokens with associated attributes for
following C statement
E =M * C ^ 2
• Are written as
– <id, pointer to symbol table entry for E>
– <assign_op>
– <id, pointer to symbol table entry for M>
– <mult_op>
– <id, pointer to symbol table entry for C>
– <exp_op>
– <num, integer value 2>
14
Lexical Analyzer: Implementation

• An implementation must do two things:

1. Recognize substrings corresponding to tokens

2. Return the value or lexeme of the token

– The lexeme is the substring

15
Example

• Recall:
\tif (i == j)\n\t\tz = 0;\n\telse\n\t\tz = 1;

• Token-lexeme pairs returned by the lexer:

– (Whitespace, “\t”)
– (Keyword, “if”)
– (OpenPar, “(“)
– (Identifier, “i”)
– (Relation, “==“)
– (Identifier, “j”)
– … 16
Secondary tasks of Lexical Analyzer:

• The lexer usually discards “uninteresting”

tokens that don’t contribute to parsing.

• Examples: Whitespace, Comments

17
19
21
Specification & Recognition

• Regular expressions = specification

• Finite automata = implementation/ Recognition

• Lexical analysis implementation: constructing FAs & a

simulator for the FAs
Lexical Analyzer Generator

RE specification

NFA

DFA Recognition
Specification of Tokens

A Formal Specification for Tokens or Patterns

- Strings and Languages
- Regular Expressions & Definitions
Strings and Language

• alphabet (or character class) (∑): any finite set of symbols

• string (S) over some alphabet : a finite sequence of
symbols drawn from that alphabet ∑
• length of string S, |S|: number of symbols in s
• empty string: a special string of length zero
• Prefix of S- A string obtained by removing zero or more
trailing characters of string S.
Example- S=”compiler” prefix of S is “com”
• Suffix of S- A string obtained by removing zero or more
leading characters of string S.
– Example- S=”compiler” suffix of S is “ler”
• Substring of S- A string obtained by deleting a prefix and
suffix from S.
– Example- S=”compiler” substring of S is “pi”
Strings and Language

• language: any set of strings over some alphabet ∑

• For example
1. A language defined over alphabets Σ={a,b} is
L={anbn| n>0}
Thus L consists of various strings like:
L = {ab, aabb, aaabbb, aaaabbbb, ………….}
2. A language defined over alphabets Σ={0, 1} is L={w
| string with equal number of 0’s and 1’s} . Thus
the set of strings of this language,
L = {ε, 01,10,0101, 1100, 0011, 1001, ……….}
Operations on Strings

• Concatenation: xy
– s ε = εs = s
– x=“lexical” y=“analyzer” => xy = “lexicalanalyzer”
• Exponentiations: si =si-1 s (s0=ε)
Operations on Languages

• Union
– {s | s is in L or s is in M}
• Concatenation
– {st | s is in L and t is in M}
• Kleene closure: zero or more concatenation
– L*: union of Li (i = 0 … infinity)
– L0 = {ε}, Li = Li-1 L
• Positive closure: one or more concatenation
– L+: union of Li (i = 1 … infinity)
Examples

• L={A, B, …, Z, a, b, …, z}, D={0, 1, …, 9}

• Union
– L U D = {letters and digits of length 1}
• Concatenation
– LD={a letter followed by a digit} (={A0, A1, … B0, …})
– L4 = {4-letter strings}(={AAAA, AABC, BBBB, …})
• Kleene closure: zero or more concatenation
– L*: {all strings of letters of length zero (i.e., ε) or more}
– L(L U D)* = {all strings of letters-and-digits, starting with a
letter}
• Positive closure: one or more concatenation
– D+: {strings of one or more digits}
Regular Expression (R.E.)

• A Formal Specification for Tokens

Regular Expression-

• Regular expressions describe or represent the set of

strings.
• The recursive definition of regular expression over ∑
is:
– The empty set ( ) , the empty string (є) is a Regular
expression over ∑.
– Every symbol a ε Σ is a regular expression over Σ.
– If R1 and R2 are regular expressions over Σ, then following
are Regular expressions
• (R1+R2) : ‘+’ indicates Oring or alteration.
• (R1⋅R2) : the operation‘.’ denotes concatenation
• (R1)*: ‘*’ denotes closure
– Regular expressions are only those that are recursively
obtained by applying above rules.
Operations on Regular Expression

• Union
– The Union of two languages L1 and L2 is set of strings that are either in L1, or L2 or
both.
– It is denoted as L1 + L2 or L1 U L2.
e.g. L1= {01, 11} and L2= {111, 10}
Then L1 + L2= {01, 11, 111, 10}
• Concatenation
– The Concatenation of two languages L1 and L2 is set of strings formed by
concatenating any string in L1 with any string in L2. It is denoted as L1. L2 or L1 L2.
e.g. L1= {01, 11} and L2= {111, 10}
Then L1. L2= {01111, 0110, 11111, 1110}
• Kleen Closure:
– The closure of language L includes all strings which are formed by taking any number
of strings from L repeating them 0 or more times and then concatenating all of them.
It is denoted as L*.
e.g. L= {0, 1} Then L*= { є, 0, 1, 10, 01, 111, 000, 010, ………}
• Positive Closure:
– The closure of language L includes all strings which are formed by taking any number
of strings from L repeating them 1 or more times and then concatenating all of them.
It is denoted as L+.
e.g. L= {0, 1} Then L+= {0, 1, 10, 01, 111, 000, 010, ………}
Regular Definitions

• If  is an alphabet of basic symbols, then a regular

definition is a sequence of definitions of the form:
d1→r1
d2→r2
...
dn→rn
• where each di is a distinct name, and each ri is a
regular expression over the symbols in {d1,d2,…,di-
1}, i.e., the basic symbols and the previously defined
names.

34
Examples of Regular Definitions

35
Recognition of Tokens

36
Finite Automata

• A recognizer for a language is a program that

takes as input a string x and answer “yes” if x
is a sentence of the language and “no”
otherwise.
• We compile a regular expression into a
recognizer by constructing a generalized
transition diagram called a finite automaton.
• A finite automaton can be deterministic or
nondeterministic, where nondeterministic
means that more than one transition out of a
state may be possible on the same input 37

symbol.
Finite Automata

• A finite automaton consists of

– An input alphabet 
– A set of states S
– A start state n
– A set of accepting states F  S
– A set of transitions state →input
state

38
Finite Automata

• Graphical Representation:
– State transition diagram
• Implementation:
– State transition table
• Deterministic (DFA)
– Single transition for all states on all input symbols
• Non-deterministic (NFA)
– More than one transitions for at least one state
with some input symbol
Finite Automata

• Transition
s1 →a s2
• Is read
In state s1 on input “a” go to state s2

• If end of input (or no transition possible)

– If in accepting state => accept
– Otherwise => reject

40
Finite Automata State Graphs

• A state

• The start state

• An accepting state

a
• A transition

41
Finite Automata

• Nondeterministic finite automata

– No restrictions on the labels of their edges.
– Several edges out of the same state
– є , the empty string is possible label
• Deterministic finite automata
– For each state and for each symbol of its input
alphabet exactly one edge with that symbol leaving
state.

42
NFA: Nondeterministic Finite
Automata

• An NFA consists of
❑S: A finite set of states
❑: A finite set of input symbols
❑d: A transition function that maps (state,
symbol) pairs to sets of states
❑s0: A state distinguished as start state
❑F: A set of states distinguished as final states
Transition Diagram (NFA)

(a | b)*abb
a

start a b b
0 1 2 3

b
 States: {0/Start/init., 1, 2, 3/Final}
 Input symbols: {a, b}
 NFA Transition function:
d(0,a) = {0,1}, d(0,b) = {0}
d(1,b) = {2}, d(2,b) = {3}
Acceptance of NFA

An NFA accepts an input string s iff there is some

path in the transition diagram from the start state
to some final state such that the edge labels along
this path spell out s

Example:
 bbababb is accepted by (a|b)*abb
 bbabab is NOT
NFA with ε-transition

aa* | bb*
a
a
ε 1 2
start
0
ε 3 4
b
b
 NFA Transition function:
d(0, e ) = {1, 3}, d (1, a) = {2}, d(2, a) = {2}
d (3, b) = {4}, d(4, b)
= {4}
Deterministic Finite Automata

• A DFA is a special case of an NFA in which

– no state has an e-transition
– for each state s and input symbol a, there is at
most one edge labeled a leaving s
Transition Diagram

A DFA for (a | b)*abb

b
a
start b b
0 1 2 3
a a
b a
Example of FA and Transition Diagrams

a b c r = (abc)+

a state

a transition

the start state

a final state
FA/ and Transition Tables

inputs
a b c
states
q0 q1

q1 q2

q2 q3

q3 q1

NextState = Move( CurrentState, Input )

Recognition

• state = 0;
• while ( (c = next_char() ) != EOF ) {
– switch (state) {
• case 0: if ( c == ‘a’ ) state = 1;
– break;
• case 1: if ( c == ‘b’ ) state = 2;
– break;
• case 2: if ( c == ‘c’ ) state = 3;
– break;
• case 3: if ( c == ‘a’ ) state = 1;
– else { ungetchar(); return (TRUE); }
– break;
• default:
– error();
– }
• }
• if ( state == 3 ) return (TRUE) else return (FALSE);
Finite Automata for the Lexical
Tokens

i f
a- z
a- z
1 2 0-9
3 0-9
1 2
1 2

IF ID 0-9 NUM

1 2 3 4
- -
\n
a- z

blank, etc.
5 blank, etc.

White space
(and comment starting with ‘- -’)
Regular expressions for tokens

if {return IF;}
[a - z] [a - z0 - 9 ] * {return ID;}
[0 - 9] + {return NUM;}
([0 - 9] + “.” [0 - 9] *) | (“.” [0 - 9] +) {return REAL;}
(“--” [a - z]* “\n”) | (“ ” | “ \n ” | “ \t ”) + {/* do nothing*/}
Recognition of Regular
Expression Using DFA

• Simulating Deterministic Finite Automata

(DFA)
– initialization:
• current_state = s0; input_symbol = 1st symbol

– while (current_state is not fail_state &&

input_symbol != EOF)
• next_state = d(current_state, input_symbol), &
• Current_state = next_state
• input_symbol = next_input_symbol

– If (current_state in final states) accept() else fail()

Simulating a DFA

Input. An input string ended with eof and a DFA with start state s0 and final
states F.
Output. The answer “yes” if accepts, “no” otherwise.
begin
s := s0;
c := nextchar;
while c <> eof do begin
s := move(s, c); // transition function
c := nextchar
end;
if s is in F then return “yes”
else return “no”
end.
DFA: An Example

(a | b)*abb

b
a
start b b
0 1 2 3
a a
b a
An Example

bbababb bbabab

s=0 s=0
s = move(0, b) = 0 s = move(0, b) = 0
s = move(0, b) = 0 s = move(0, b) = 0
s = move(0, a) = 1 s = move(0, a) = 1
s = move(1, b) = 2 s = move(1, b) = 2
s = move(2, a) = 1 s = move(2, a) = 1
s = move(1, b) = 2 s = move(1, b) = 2
s = move(2, b) = 3 s is not in {3}
s is in {3}
Recognition of Regular Expression

• Simulating NFA is harder than simulating DFA

• Constructing NFA is easier than constructing
DFA
– Construct NFA
– Construct Equivalent DFA
– (optional) State Minimization
– Simulate DFA
The Lexical –Analyzer Generator Lex

• Tool Called Lex that allows to specify a lexical

anlyzer by specifying regular expressions to
describe patterns for tokens.
• The input notation for Lex tool is referredto
as the Lex language and tool itself is Lex
Compiler.

59
Creating a lexical analyzer with Lex
Structure of Lex Programs

61
Lex Program for Tokens

62
63
64

Lisp Interpreter in Rust
From Everand
Lisp Interpreter in Rust
Vishal Patil
1/5 (1)
2
No ratings yet
2
109 pages
cd1
No ratings yet
cd1
92 pages
Chapter 3 Finite automata and lexical analysis
No ratings yet
Chapter 3 Finite automata and lexical analysis
100 pages
Ch3 Modified
No ratings yet
Ch3 Modified
80 pages
Lexical Analysis
No ratings yet
Lexical Analysis
41 pages
Chapter 2 - Lexical Analysis
No ratings yet
Chapter 2 - Lexical Analysis
56 pages
CD Unit-2
No ratings yet
CD Unit-2
64 pages
CD_UNIT-2
No ratings yet
CD_UNIT-2
64 pages
Chapter 2 - Lexical Analysis_Regular Expressions(1)
No ratings yet
Chapter 2 - Lexical Analysis_Regular Expressions(1)
27 pages
Chapter Two (3) (Autosaved)
No ratings yet
Chapter Two (3) (Autosaved)
29 pages
Ch2+3 Compiler
No ratings yet
Ch2+3 Compiler
21 pages
Ch3myppt
No ratings yet
Ch3myppt
59 pages
Lexical Analysis
No ratings yet
Lexical Analysis
44 pages
Chapter 2
No ratings yet
Chapter 2
27 pages
Chapter-2[1]
No ratings yet
Chapter-2[1]
77 pages
Lexical Analysis
No ratings yet
Lexical Analysis
36 pages
Acd Unit-2
No ratings yet
Acd Unit-2
16 pages
Compiler
No ratings yet
Compiler
60 pages
Mod 1.2 - Lexical Analysis 1 _CH03REC-A
No ratings yet
Mod 1.2 - Lexical Analysis 1 _CH03REC-A
23 pages
Chapter 3 Finite Automata and Lexical Analysis
No ratings yet
Chapter 3 Finite Automata and Lexical Analysis
100 pages
ch3 M.PPTX - 0
No ratings yet
ch3 M.PPTX - 0
46 pages
Lexical Analysis
No ratings yet
Lexical Analysis
31 pages
Lexical Analyzer 2023
No ratings yet
Lexical Analyzer 2023
38 pages
Lexical Analyzer
No ratings yet
Lexical Analyzer
18 pages
Lexical Analysis
No ratings yet
Lexical Analysis
62 pages
2 Lex
No ratings yet
2 Lex
45 pages
Lexical analysis
No ratings yet
Lexical analysis
62 pages
Chapter 2 - Lexical Analysis
No ratings yet
Chapter 2 - Lexical Analysis
69 pages
Compiler-Lexical Analysis
100% (1)
Compiler-Lexical Analysis
59 pages
Chapter 2
No ratings yet
Chapter 2
56 pages
Compiler Design 2
No ratings yet
Compiler Design 2
76 pages
Lexical Analysis
No ratings yet
Lexical Analysis
57 pages
03 Lex Analysis
No ratings yet
03 Lex Analysis
61 pages
Chapter 2 Lexical Analysis
No ratings yet
Chapter 2 Lexical Analysis
33 pages
Compiler Design Unit-1 - 4
No ratings yet
Compiler Design Unit-1 - 4
4 pages
M.Suhaib Khalid PDF
No ratings yet
M.Suhaib Khalid PDF
10 pages
Ch3 1
No ratings yet
Ch3 1
52 pages
Chpater 2 Lexical Analysis
No ratings yet
Chpater 2 Lexical Analysis
48 pages
Chapter 3 Finite Automata and Lexical Analysis
No ratings yet
Chapter 3 Finite Automata and Lexical Analysis
95 pages
Lexical Analysis and Lexical Analyzer Generators: COP5621 Compiler Construction
No ratings yet
Lexical Analysis and Lexical Analyzer Generators: COP5621 Compiler Construction
52 pages
Lexical Analysis1
No ratings yet
Lexical Analysis1
44 pages
Compiler Design
No ratings yet
Compiler Design
65 pages
Compiler 2
No ratings yet
Compiler 2
10 pages
Compiler Design Chapter-2
60% (5)
Compiler Design Chapter-2
105 pages
Unit 2 Lexical Analyzer
No ratings yet
Unit 2 Lexical Analyzer
63 pages
Lexical Analyzer 1
No ratings yet
Lexical Analyzer 1
37 pages
CD ch2
No ratings yet
CD ch2
104 pages
Unit 6
No ratings yet
Unit 6
109 pages
Lec 06 Specification of Tokens
No ratings yet
Lec 06 Specification of Tokens
23 pages
Chapter 3 - Lexical Analysis and Lexical Analyzer Generators
No ratings yet
Chapter 3 - Lexical Analysis and Lexical Analyzer Generators
52 pages
ch-2 Compiler Design
No ratings yet
ch-2 Compiler Design
9 pages
Chapter 2
No ratings yet
Chapter 2
91 pages
Lexical Analysis
No ratings yet
Lexical Analysis
47 pages
Lexical Analysis: Dr. Murali Krishna Enduri Department of CSE
No ratings yet
Lexical Analysis: Dr. Murali Krishna Enduri Department of CSE
88 pages
CD KCS502 Unit 1 B
No ratings yet
CD KCS502 Unit 1 B
12 pages
Compiler Design: Ambo University School of Informatics and Electrical Engineering Department of Computer Science
No ratings yet
Compiler Design: Ambo University School of Informatics and Electrical Engineering Department of Computer Science
35 pages
1st Phase Lexical Analyzer
No ratings yet
1st Phase Lexical Analyzer
33 pages
2_Lexical Analysis
No ratings yet
2_Lexical Analysis
52 pages
Unit 01 - PART 2
No ratings yet
Unit 01 - PART 2
25 pages
Introduction SMB
No ratings yet
Introduction SMB
11 pages
Linux Privilege Escalation is a Critical Topic in Cybersecurity
No ratings yet
Linux Privilege Escalation is a Critical Topic in Cybersecurity
5 pages
Top 14 Vulnerability Scanners for Cybersecurity Professionals
No ratings yet
Top 14 Vulnerability Scanners for Cybersecurity Professionals
4 pages
Syllabus & Pattern Overview
No ratings yet
Syllabus & Pattern Overview
1 page
Undecidablity
No ratings yet
Undecidablity
9 pages
A10 NLP Exp 1
No ratings yet
A10 NLP Exp 1
15 pages
Paper 11
No ratings yet
Paper 11
17 pages
Turing Machine
No ratings yet
Turing Machine
39 pages
Theory of Computation: Computer Science
No ratings yet
Theory of Computation: Computer Science
134 pages
SPI 2016 Training - EDE Module
No ratings yet
SPI 2016 Training - EDE Module
45 pages
An Introduction To The Usa Computing Olympiad: Darren Yao
No ratings yet
An Introduction To The Usa Computing Olympiad: Darren Yao
83 pages
JAVA DAY 4
No ratings yet
JAVA DAY 4
12 pages
Dasd
No ratings yet
Dasd
11 pages
Java Practice Set Fof FS
No ratings yet
Java Practice Set Fof FS
5 pages
CPENT Module 13 Binary Analysis and Exploitation
No ratings yet
CPENT Module 13 Binary Analysis and Exploitation
12 pages
EMVCo_3DS_BridgingMsgExt_v1.0_20220831
No ratings yet
EMVCo_3DS_BridgingMsgExt_v1.0_20220831
31 pages
Primitive Data Types
No ratings yet
Primitive Data Types
6 pages
SQL5
No ratings yet
SQL5
4 pages
12
No ratings yet
12
16 pages
OSS - Lab - Exc Final
No ratings yet
OSS - Lab - Exc Final
29 pages
Schintro Paul Wilson
No ratings yet
Schintro Paul Wilson
322 pages
Garmin Developer API Programs Women's Health API
No ratings yet
Garmin Developer API Programs Women's Health API
19 pages
Mobile Application Lab
No ratings yet
Mobile Application Lab
53 pages
Final Organized
No ratings yet
Final Organized
70 pages
Assignment 7 Face Pamphlet
No ratings yet
Assignment 7 Face Pamphlet
34 pages
SAP GOS Object - Class Attributes
0% (1)
SAP GOS Object - Class Attributes
32 pages
XII Practical File New-1
No ratings yet
XII Practical File New-1
29 pages
Computer Applications - Notes
100% (1)
Computer Applications - Notes
25 pages
Type Conversion Functions
No ratings yet
Type Conversion Functions
11 pages
ME 120 Experimental Methods Laboratory3-1
No ratings yet
ME 120 Experimental Methods Laboratory3-1
12 pages
APSEPM203HRS.pdf
No ratings yet
APSEPM203HRS.pdf
48 pages
R20 EEE Syllabus
No ratings yet
R20 EEE Syllabus
194 pages
MCSL-017 C and Assembly Language Programming Lab
No ratings yet
MCSL-017 C and Assembly Language Programming Lab
16 pages
QBASIC
No ratings yet
QBASIC
22 pages
Data Types
No ratings yet
Data Types
21 pages
MCA Python Journal
100% (2)
MCA Python Journal
5 pages
PHP String Functions
No ratings yet
PHP String Functions
8 pages
Computer Science Practical Notes
No ratings yet
Computer Science Practical Notes
35 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Chapter 7 Lexical Analysis

Uploaded by

Chapter 7 Lexical Analysis

Uploaded by

CHAPTER 7

• Informal sketch of lexical analysis

• Issues in lexical analysis

Interm. Code Machine

Winter 2007 SEG2101 Chapter 8 4

• There are several reasons for separating the

• What do we want to do? Example:

• The input is just a sequence of characters:

• Goal: Partition input string into substrings

• Output of lexical analysis is a stream of

• Tokens correspond to sets of strings.

• Identifier: strings of letters or digits,

• A lexeme is a sequence of characters in the

• A pattern is a rule describing the set of

• When a pattern matches to more than one

• An implementation must do two things:

1. Recognize substrings corresponding to tokens

2. Return the value or lexeme of the token

• Token-lexeme pairs returned by the lexer:

• The lexer usually discards “uninteresting”

• Examples: Whitespace, Comments

• Regular expressions = specification

• Finite automata = implementation/ Recognition

• Lexical analysis implementation: constructing FAs & a

A Formal Specification for Tokens or Patterns

• alphabet (or character class) (∑): any finite set of symbols

• language: any set of strings over some alphabet ∑

• L={A, B, …, Z, a, b, …, z}, D={0, 1, …, 9}

• A Formal Specification for Tokens

• Regular expressions describe or represent the set of

• If  is an alphabet of basic symbols, then a regular

• A recognizer for a language is a program that

• A finite automaton consists of

• If end of input (or no transition possible)

• The start state

• Nondeterministic finite automata

An NFA accepts an input string s iff there is some

• A DFA is a special case of an NFA in which

A DFA for (a | b)*abb

the start state

NextState = Move( CurrentState, Input )

• Simulating Deterministic Finite Automata

– while (current_state is not fail_state &&

– If (current_state in final states) accept() else fail()

• Simulating NFA is harder than simulating DFA

• Tool Called Lex that allows to specify a lexical

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.