0% found this document useful (0 votes)
53 views43 pages

Lexical Analysis: Textbook:Modern Compiler Design

The document discusses lexical analysis, which involves breaking a program's source code into tokens. It describes how regular expressions can be used to define tokens and how a lexical analyzer works by scanning the input, identifying tokens based on the regular expressions, and outputting a sequence of tokens. The key steps in lexical analysis are defining tokens with regular expressions, constructing a finite state machine to recognize tokens, handling errors, and automatically generating an efficient scanner from the token definitions and finite state machine.

Uploaded by

cute_guddy
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
53 views43 pages

Lexical Analysis: Textbook:Modern Compiler Design

The document discusses lexical analysis, which involves breaking a program's source code into tokens. It describes how regular expressions can be used to define tokens and how a lexical analyzer works by scanning the input, identifying tokens based on the regular expressions, and outputting a sequence of tokens. The key steps in lexical analysis are defining tokens with regular expressions, constructing a finite state machine to recognize tokens, handling errors, and automatically generating an efficient scanner from the token definitions and finite state machine.

Uploaded by

cute_guddy
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 43

Lexical Analysis

Textbook:Modern Compiler Design Chapter 2.1

A motivating example
Create a program that counts the number of lines in a given input text file

Solution
int num_lines = 0;
%% \n ++num_lines; . ; %% main() { yylex(); printf( "# of lines = %d\n", num_lines); }

Solution
int num_lines = 0;
initial %% \n ++num_lines; . ; %% main() { yylex(); printf( "# of lines = %d\n", num_lines); } newline ;

Outline
Roles of lexical analysis What is a token Regular expressions and regular descriptions Lexical analysis Automatic Creation of Lexical Analysis Error Handling

Basic Compiler Phases


Source program (string) Front-End lexical analysis

Tokens
syntax analysis Abstract syntax tree semantic analysis Annotated Abstract syntax tree Back-End

Fin. Assembly

Example Tokens
Type Examples

ID NUM REAL IF COMMA NOTEQ LPAREN RPAREN

foo n_14 last 73 00 517 082 66.1 .5 10. 1e67 5.5e-10 if , != ( )

Example NonTokens
Type Examples

comment preprocessor directive


macro whitespace

/* ignored */ #include <foo.h> #define NUMS 5, 6 NUMS \t \n \b

Example
void match0(char *s) /* find a zero */
{ if (!strncmp(s, 0.0, 3))

return 0. ;
} VOID ID(match0) LPAREN CHAR DEREF ID(s) RPAREN LBRACE IF LPAREN NOT ID(strncmp) LPAREN ID(s) COMMA STRING(0.0) COMMA NUM(3) RPAREN RPAREN RETURN REAL(0.0) SEMI RBRACE EOF

Lexical Analysis (Scanning)


input program text (file) output sequence of tokens Read input file Identify language keywords and standard identifiers Handle include files and macros Count line numbers Remove whitespaces Report illegal symbols Produce symbol table

Why Lexical Analysis


Simplifies the syntax analysis
And language definition

Modularity Reusability Efficiency

What is a token?
Defined by the programming language Can be separated by spaces Smallest units Defined by regular expressions

A simplified scanner for C


Token nextToken() { char c ; loop: c = getchar(); switch (c){ case ` `:goto loop ; case `;`: return SemiColumn; case `+`: c = ungetc() ; switch (c) { case `+': return PlusPlus ; case '= return PlusEqual; default: ungetc(c); return Plus; case `<`: case `w`: }

Regular Expressions

Escape characters in regular expressions


\ converts a single operator into text
a\+ (a\+\*)+

Double quotes surround text


a+*+

Esthetically ugly But standard

Regular Descriptions
EBNF where non-terminals are fully defined before first use
letter [a-zA-Z] digit [0-9] underscore _ letter_or_digit letter|digit underscored_tail underscore letter_or_digit+ identifier letter letter_or_digit* underscored_tail

token description
A token name A regular expression

The Lexical Analysis Problem


Given
A set of token descriptions An input string

Partition the strings into tokens (class, value) Ambiguity resolution


The longest matching token Between two equal length tokens select the first

A Flex specification of C Scanner


Letter [a-zA-Z_] Digit [0-9] %% [ \t] {;} [\n] {line_count++;} ; { return SemiColumn;} ++ { return PlusPlus ;} += { return PlusEqual ;} + { return Plus} while { return While ; } {Letter}({Letter}|{Digit})* { return Id ;} <= { return LessOrEqual;} < { return LessThen ;}

Flex
Input
regular expressions and actions (C code)

Output
A scanner program that reads the input and applies actions when input regular expression is matched
regular expressions

flex
input program scanner tokens

Nave Lexical Analysis

Automatic Creation of Efficient Scanners

Nave approach on regular expressions (dotted items) Construct non deterministic finite automaton over items Convert to a deterministic Minimize the resultant automaton Optimize (compress) representation

Dotted Items

Example
T a+ b+ Input aab After parsing aa
T a+ b+

Item Types
Shift item
In front of a basic pattern A (ab)+ c (de|fe)*

Reduce item
At the end of rhs A (ab)+ c (de|fe)*

Basic item
Shift or reduce items

Character Moves
For shift items character moves are simple
Tc
Digit [0-9]

c 7

c
7

Tc
T [0-9]

Moves
For non-shift items the situation is more complicated What character do we need to see? Where are we in the matching? T a* T (a*)

Moves for Repetitions


Where can we get from T (R)* If R occurs zero times T (R)* If R occurs one or more times T ( R)*
When R ends ( R )*
(R)* ( R)*

Moves

I [0-9]+ F [0-9]*.[0-9]+

Input 3.1;

F ([0-9])*.([0-9])+ F ([0-9])*.([0-9])+ F ( [0-9] )*.([0-9])+ F ( [0-9])*.([0-9])+ F ( [0-9])* .([0-9])+ F ( [0-9])*. ([0-9])+ F ( [0-9])*. ([0-9])+ F ( [0-9])* .([0-9])+

F ( [0-9])*. ( [0-9] )+
F ( [0-9])*. ( [0-9])+ F ( [0-9])*. ( [0-9])+

Concurrent Search
How to scan multiple token classes in a single run?

I [0-9]+ F [0-9]*.[0-9]+

Input 3.1;

I ([0-9])+
I ([0-9])+ I ( [0-9] )+
I ([0-9])+ I ( [0-9])+

F ([0-9])*.([0-9])+
F ([0-9])*.([0-9])+ F ( [0-9] )*.([0-9])+
F ([0-9])*.([0-9])+

F ([0-9])* .([0-9])+

F ( [0-9])* .([0-9])+

F ( [0-9])*. ([0-9])+

The Need for Backtracking


A simple minded solution may require unbounded backtracking T1 a+; T2 a Quadratic behavior Does not occur in practice A linear solution exists

A Non-Deterministic Finite State Machine


Add a production S T1 | T2 | | Tn Construct NDFA over the items
Initial state S (T1 | T2 | | Tn) For every character move, construct a character transition <T c , a> T c For every move construct an transition The accepting states are the reduce items Accept the language defined by Ti

Moves

I [0-9]+ F [0-9]*.[0-9]+ S(I|F)


F ([0-9]*).[0-9]+

I ([0-9]+)
F ([0-9]*).[0-9]+ F ( [0-9]*) .[0-9]+

I ([0-9])+
[0-9] I ( [0-9])+

[0-9]
F ( [0-9] *).[0-9]+

.
F [0-9]*. ([0-9]+)

F [0-9]*. ( [0-9] +)

[0-9] F [0-9]*. ([0-9]+)

I ( [0-9])+

F [0-9]*. ( [0-9] +)

Efficient Scanners
Construct Deterministic Finite Automaton
Every state is a set of items Every transition is followed by an -closure When a set contains two reduce items select the one declared first

Minimize the resultant automaton


Rejecting states are initially indistinguishable Accepting states of the same token are indistinguishable

Exponential worst case complexity


Does not occur in practice

Compress representation

I [0-9]+ F [0-9]*.[0-9]+
[0-9] I ( [0-9])+ F ( [0-9] *).[0-9]+ I ( [0-9])+ I ([0-9])+ F ([0-9]*).[0-9]+ F ( [0-9]*) .[0-9]+ [0-9]

S(I|F) I ([0-9]+) I ([0-9])+ F ([0-9]*).[0-9]+ F ([0-9]*). [0-9]+ F ([0-9]*) . [0-9]+

.|\n [^0-9.]

Sink

. F [0-9] *. ([0-9]+) F [0-9]*.([0-9]+)

[^0-9]

[0-9] F [0-9] *. ([0-9] +) F [0-9]*.([0-9]+) F [0-9]*.( [0-9]+) [0-9] [^0-9]

[^0-9.]

A Linear-Time Lexical Analyzer


IMPORT Input Char [1..]; Set Read Index To 1;
Procedure Get_Next_Token;
set Start of token to Read Index; set End of last token to uninitialized set Class of last token to uninitialized set State to Initial while state /= Sink: Set ch to Input Char[Read Index]; Set state = [state, ch]; if accepting(state): set Class of last token to Class(state); set End of last token to Read Index set Read Index to Read Index + 1; set token .class to Class of last token; set token .repr to char[Start of token .. End last token]; set Read index to End last token + 1;

Scanning 3.1;
input 3.1; 3 .1; 3. 1; 3.1 ; state 1 2 3 4 next state 2 3 4 Sink last token I I F F
[0-9] [0-9]

[^0-9.]

1
[0-9] [^0-9.] . .

Sink

2 I

3
[0-9]
[^0-9]

4 F

Scanning aaa
[^a]

[.\n]

T1 a+; T2 a input aaa$ a aa$ a a a$ aaa$ state 1 2 4 4 next state 2 4 4 Sink last token T1 T1 T1

1
[a] [^a;]

Sink
[.\n] ;
;

T1

2
[a]

T1
[^a;]

[a]

T1

Error Handling
Illegal symbols Common errors

Missing
Creating a lexical analysis by hand Table compression Symbol Tables Handling Macros Start states Nested comments

Summary
For most programming languages lexical analyzers can be easily constructed automatically Exceptions:
Fortran PL/1

Lex/Flex/Jlex are useful beyond compilers

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy