0% found this document useful (0 votes)
124 views

Chapter Two LexicalAnalysis

This document provides an overview of lexical analysis in compiler design. It discusses how a lexical analyzer converts a sequence of characters into tokens by identifying character sequences and categorizing them. Key aspects covered include the role of lexical analysis in separating analysis from parsing for simpler design, efficiency, and portability. The document also describes tokens, patterns, lexemes, and how regular expressions are used to formally specify tokens. It introduces finite automata and explains how regular expressions can be converted to non-deterministic (NFA) and deterministic (DFA) models to perform lexical analysis.

Uploaded by

Asfaw Bassa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
124 views

Chapter Two LexicalAnalysis

This document provides an overview of lexical analysis in compiler design. It discusses how a lexical analyzer converts a sequence of characters into tokens by identifying character sequences and categorizing them. Key aspects covered include the role of lexical analysis in separating analysis from parsing for simpler design, efficiency, and portability. The document also describes tokens, patterns, lexemes, and how regular expressions are used to formally specify tokens. It introduces finite automata and explains how regular expressions can be converted to non-deterministic (NFA) and deterministic (DFA) models to perform lexical analysis.

Uploaded by

Asfaw Bassa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

Bonga University

College of Engineering and Technology


Department of Computer Science
CoSc4103– COMPILER DESIGN
Chapter 2 Handouts – Lexical Analysis

Overview of Lexical Analysis

A lexical analyzer, also called a scanner, typically has the following functionality and
characteristics.

 Its primary function is to convert from a (often very long) sequence of characters into a
(much shorter, perhaps 10X shorter) sequence of tokens.
 The scanner must identify and Categorize specific character sequences into tokens. It
must know whether every two adjacent characters in the file belong together in the
same token, or whether the second character must be in a different token.
 Most lexical analyzers discard comments & whitespace. In most languages these
characters serve to separate tokens from each other, but once lexical analysis is
completed they serve no purpose.
 Handle lexical errors (illegal characters, malformed tokens) by reporting them intelligibly
to the user.
 Efficiency is crucial; a scanner may perform elaborate input buffering.
 Token categories can be (precisely, formally) specified using regular expressions, e.g.
 IDENTIFIER=[a-zA-Z][a-zA-Z0-9]*
 Lexical Analyzers can be written by hand, or implemented automatically using finite
automata.

Role of Lexical Analysis

Issues (why separating lexical analysis from parsing)



Simpler design

Compiler efficiency

Compiler portability (e.g. Linux to Win)

COMPILER DESIGN (CoSc4103) Page 1


What’s a Token?

A syntactic category

In English:

noun, verb, adjective, …

In a programming language:

Identifier, Integer, Keyword, Whitespace …

Tokens, Patterns and Lexemes



A token is a pair a token name and an optional token value

A pattern is a description of the form that the lexemes of a token may take

A lexeme is a sequence of characters in the source program that matches the pattern for a
token

Input buffering


Sometimes lexical analyzer needs to look ahead some symbols to decide about the token to
return

In C language: we need to look after -, = or < to decide what token to return

In Fortran: DO 5 I = 1.25

We need to introduce a two buffer scheme to handle large look-aheads safely

Specification of tokens


In theory of compilation regular expressions are used to formalize the specification of tokens

Regular expressions are means for specifying regular languages

Example:

Letter(letter | digit)*

Each regular expression is a pattern specifying the form of strings

Terminology of Languages
⚫ Alphabet : a finite set of symbols (ASCII characters)

String :

Finite sequence of symbols on an alphabet

Sentence and word are also used in terms of string

 is the empty string

|s| is the length of string s.

COMPILER DESIGN (CoSc4103) Page 2



Language: sets of strings over some fixed alphabet

 the empty set is a language.

{} the set containing empty string is a language

The set of well-formed C programs is a language

The set of all possible identifiers is a language.

Operators on Strings:

Concatenation: xy represents the concatenation of strings x and y. s  = s  s = s

⚫ sn = s s s .. s ( n times) s0 = 
Regular Expressions

We use regular expressions to describe tokens of a programming language.

A regular expression is built up of simpler regular expressions (using defining rules)

Each regular expression denotes a language.

A language denoted by a regular expression is called as a regular set.

Rules

Regular expressions over alphabet 



Reg. Expr Language it denotes
 {}
a  {a}
(r1) | (r2) L(r1)  L(r2)
(r1) (r2) L(r1) L(r2)
(r)* (L(r))*
(r) L(r)
⚫ (r)+ = (r)(r)*

(r)? = (r) | 

We may remove parentheses by using precedence rules.
⚫ * highest
⚫ concatenation next
⚫ | lowest
⚫⚫ ab |c means (a(b)*)|(c)
*
Ex:

 = {0,1}

0|1 => {0,1}

(0|1)(0|1) => {00,01,10,11}

0* => { ,0,00,000,0000,....}

(0|1)* => all strings with 0 and 1, including the empty string

COMPILER DESIGN (CoSc4103) Page 3


Finite Automata


A recognizer for a language is a program that takes a string x, and answers “yes” if x is a
sentence of that language, and “no” otherwise.

We call the recognizer of the tokens as a finite automaton.

A finite automaton can be: deterministic(DFA) or non-deterministic (NFA)

This means that we may use a deterministic or non-deterministic automaton as a lexical
analyzer.

Both deterministic and non-deterministic finite automatons recognize regular sets.

Which one?

deterministic – faster recognizer, but it may take more space

non-deterministic – slower, but it may take less space

Deterministic automatons are widely used lexical analyzers.

First, we define regular expressions for tokens; then we convert them into a DFA to get a lexical
analyzer for our tokens.
⚫  
Algorithm1: Regular Expression NFA DFA (two steps: first to NFA, then to
DFA)
⚫ 
Algorithm2: Regular Expression DFA (directly convert a regular expression into a DFA)

Non-Deterministic Finite Automaton (NFA)



A non-deterministic finite automaton (NFA) is a mathematical model that consists of:

S - a set of states

 - a set of input symbols (alphabet)

Move – a transition function move to map state-symbol pairs to sets of states.

s0 - a start (initial) state

F – a set of accepting states (final states)
⚫ - Transitions are allowed in NFAs. In other words, we can move from one state to another one
without consuming any symbol.

A NFA accepts a string x, if and only if there is a path from the starting state to one of accepting
states such that edge labels along this path spell out x.

Deterministic Finite Automaton (DFA)

 A Deterministic Finite Automaton (DFA) is a special form of a


NFA o no state has - transition
o for each symbol a and state s, there is at most one labeled edge a leaving s. o
i.e. transition function is from pair of state-symbol to state (not set of states)

Converting a Regular Expression into NFA (Thomson’s Construction)



This is one way to convert a regular expression into a NFA.

There can be other ways (much efficient) for the conversion.

Thomson’s Construction is simple and systematic method. It guarantees that the resulting NFA
will have exactly one final state, and one start state.

COMPILER DESIGN (CoSc3112) Page 4



Construction starts from simplest parts (alphabet symbols). To create a NFA for a complex
regular expression, NFAs of its sub-expressions are combined to create its NFA,

COMPILER DESIGN (CoSc3112) Page 5


COMPILER DESIGN (CoSc3112) Page 6
COMPILER DESIGN (CoSc3112) Page 7
Minimizing Number of States of a DFA

partition the set of states into two groups:

G1 : set of accepting states

G2 : set of non-accepting states

For each new group G

partition G into subgroups such that states s1 and s2 are in the same group iff

For all input symbols a, states s1 and s2 have transitions to states in the same group.


Start state of the minimized DFA is the group containing the start state of the original DFA.

Accepting states of the minimized DFA are the groups containing the accepting states of the
original DFA.

COMPILER DESIGN (CoSc3112) Page 8


Deterministic and Nondeterministic Automata

Deterministic Finite Automata (DFA)

One transition per input per state

No -moves

Nondeterministic Finite Automata (NFA)

Can have multiple transitions for one input in a given state

Can have -moves

Finite automata have finite memory

Need only to encode the current state

NFA vs. DFA



NFAs and DFAs recognize the same set of languages (regular languages)

DFAs are easier to implement

There are no choices to consider

Regular Expressions to Finite Automata

COMPILER DESIGN (CoSc3112) Page 9


Overview of Lex and Yacc


Lex (A LEXical Analyzer Generator)
Generates lexical analyzers (scanners or Lexers)


Yacc (Yet Another Compiler-Compiler) Generates
parser based on an analytic grammar

Flex is Free scanner alternative to Lex

Bison is Free parser generator program

Written for the GNU project alternative to Yacc

Lex: what is it?

1. Lex: a tool for automatically generating a lexer or scanner given a lex specification (.l
file)
2. A lexer or scanner is used to perform lexical analysis, or the breaking up of an input
stream into meaningful units, or tokens.
3. For example, consider breaking a text file up into individual words.

Lexical analyzer: scans the input stream and converts sequences of characters into tokens.

Token: a classification of groups of characters.


Examples: Lexeme Token
Sum ID
for FOR
= ASSIGN_OP
== EQUAL_OP
57 INTEGER_CONST
* MULT_OP
, COMMA
( LEFT_PAREN

Lex / Flex is a tool for writing lexical analyzers.

Lex / Flex: reads a specification file containing regular expressions


and generates a C routine that performs lexical analysis.

Matches sequences that identify tokens.

COMPILER DESIGN (CoSc3112) Page 10


Skeleton of a lex specification (.l file)

*.c is generated after


x.l
running

%{ This part will be


< C global variables, prototypes, comments embedded into *.c
>
%}
substitutions, code and
start states; will be
[DEFINITION SECTION] copied into *.c

define how to scan and


%% what action to take for
each token
[RULES SECTION]
any user code. For example, a
%% main function to call the scanning
function yylex().
< C auxiliary subroutines>

COMPILER DESIGN (CoSc3112) Page 11


The rules section
%%
[RULES SECTION]

<pattern> { <action to take when matched> } { <action


<pattern> to take when matched> }

%%

Patterns are specified by regular expressions.


For example:
%%
[A-Za-z]* { printf(“this is a word”); }
%%

Two Rules

1. lex will always match the longest (number of characters) token possible.

2. If two or more possible tokens are of the same length, then the token with the regular
expression that is defined first in the lex specification is favored.

Regular Expressions in lex / Flex:

a matches a
abc matches abc
[abc] matches a, b or c
[a-f] matches a, b, c, d, e, or f
[0-9] matches any digit
X+ matches one or more of X
X* matches zero or more of X
[0-9]+ matches any integer
(…) grouping an expression into a single unit
COMPILER DESIGN (CoSc3112) Page 12
| alternation (or)
(a|b|c)* is equivalent to [a-c]*
X? X is optional (0 or 1 occurrence)
if(def)? matches if or ifdef (equivalent to if|ifdef)
[A-Za-z] matches any alphabetical character
. matches any character except newline character
\. matches the . character
\n matches the newline character
\t matches the tab character
\\ matches the \ character
[ \t] matches either a space or tab character
[^a-d] matches any character other than a,b,c and d

Examples:

Real numbers, e.g., 0, 27, 2.10, .17


[0-9]+|[0-9]+\.[0-9]+|\.[0-9]+
[0-9]+(\.[0-9]+)?|\.[0-9]+
[0-9]*(\.)?[0-9]+

To include an optional preceding sign: [+-]?[0-9]*(\.)?[0-9]+

Special Functions
• yytext
–where text matched most recently is stored
• yyleng
–number of characters in text most recently matched
• yylval
–associated value of current token
• yymore()
–append next string matched to current contents of yytext
• yyless(n)
–remove from yytext all but the first n characters
• unput(c)
–return character c to input stream
• yywrap()
–may be replaced by user
–The yywrap method is called by the lexical analyser whenever it inputs an
EOF as the first character when trying to match a regular expression

Yacc / Bison: what is it?

Yacc: a tool for automatically generating a parser given a grammar written in a yacc
specification (.y file)

COMPILER DESIGN (CoSc3112) Page 13


A grammar specifies a set of production rules, which define a language. A production rule
specifies a sequence of symbols, sentences, which are legal in the language.

Skeleton of a yacc specification (.y file)

*.c is generated after running


x.y

%{
< C global variables, prototypes, comments > This part will be embedded
into *.c
%}

contains token declarations. Tokens are


[DEFINITION SECTION] recognized in lexer.

define how to “understand”


%%
the input language, and what
[PRODUCTION RULES SECTION] %% actions to take for each “sentence”.

any user code. For example, a main


< C auxiliary subroutines> function to call the parser function
yyparse()

Structure of yacc File

Definition section

Declarations of tokens

Type of values used on parser stack

Rules section

List of grammar rules with semantic routines

User code

The Production Rules Section

%%

production: symbol1 symbol2 … { action }

| symbol3 symbol4 … { action }

|…

production: symbol1 symbol2 { action }

%%

COMPILER DESIGN (CoSc3112) Page 14


1. Lex program to count number of vowels and consonants
%{
int v=0,c=0;
%}
%%
[aeiouAEIOU]
v++; [a-zA-Z] c++;
%%
main()
{
printf("ENTER INTPUT : \n"); yylex();
printf("VOWELS=%d\nCONSONANTS=%d\n",v,c);

}
2. Lex program to count the type of numbers
%{
int pi=0,ni=0,pf=0,nf=0;
%}
%%
\+?[0-9]+ pi++;
\+?[0-9]*\.[0-9]+ pf++;
\-[0-9]+ ni++;
\-[0-9]*\.[0-9]+ nf++;
%%
main()
{
printf("ENTER INPUT : ");
yylex();
printf("\nPOSITIVE INTEGER : %d",pi);
printf("\nNEGATIVE INTEGER : %d",ni);
printf("\nPOSITIVE FRACTION : %d",pf);
printf("\nNEGATIVE FRACTION : %d\n",nf);
}
3. Lex program to find simple and compound
statements %{ }%

%%
"and"|
"or"|
"but"|
"because"|
"nevertheless" {printf("COMPOUND STATEMENT"); exit(0); }
.;
\n return 0;
%%
main()

COMPILER DESIGN (CoSc3112) Page 15


{
prntf("\nENTER THE STATEMENT : ");
yylex();
printf("SIMPLE STATEMENT");
}
4. Lex program to word count
/* just like Unix wc */
%{
int chars = 0;
int words = 0;
int lines = 0;
%}
%%
[a-zA-Z]+ { words++; chars += strlen(yytext);
} \n { chars++; lines++; }
. { chars++; }
%%
main(int argc, char **argv)
{
yylex();
printf("%8d%8d%8d\n", lines, words, chars);
}
5. Lex program for English to American
%%
"colour" { printf("color"); }
"flavour" { printf("flavor"); }
"clever" { printf("smart"); }
"smart" { printf("elegant"); }
"conservative" { printf("liberal"); }
… lots of other words …
. { printf("%s", yytext); }
%%

COMPILER DESIGN (CoSc3112) Page 16

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy