Chapter Two LexicalAnalysis
Chapter Two LexicalAnalysis
A lexical analyzer, also called a scanner, typically has the following functionality and
characteristics.
Its primary function is to convert from a (often very long) sequence of characters into a
(much shorter, perhaps 10X shorter) sequence of tokens.
The scanner must identify and Categorize specific character sequences into tokens. It
must know whether every two adjacent characters in the file belong together in the
same token, or whether the second character must be in a different token.
Most lexical analyzers discard comments & whitespace. In most languages these
characters serve to separate tokens from each other, but once lexical analysis is
completed they serve no purpose.
Handle lexical errors (illegal characters, malformed tokens) by reporting them intelligibly
to the user.
Efficiency is crucial; a scanner may perform elaborate input buffering.
Token categories can be (precisely, formally) specified using regular expressions, e.g.
IDENTIFIER=[a-zA-Z][a-zA-Z0-9]*
Lexical Analyzers can be written by hand, or implemented automatically using finite
automata.
Input buffering
⚫
Sometimes lexical analyzer needs to look ahead some symbols to decide about the token to
return
⚫
In C language: we need to look after -, = or < to decide what token to return
⚫
In Fortran: DO 5 I = 1.25
⚫
We need to introduce a two buffer scheme to handle large look-aheads safely
Specification of tokens
⚫
In theory of compilation regular expressions are used to formalize the specification of tokens
⚫
Regular expressions are means for specifying regular languages
⚫
Example:
⚫
Letter(letter | digit)*
⚫
Each regular expression is a pattern specifying the form of strings
Terminology of Languages
⚫ Alphabet : a finite set of symbols (ASCII characters)
⚫
String :
⚫
Finite sequence of symbols on an alphabet
⚫
Sentence and word are also used in terms of string
⚫
is the empty string
⚫
|s| is the length of string s.
⚫ sn = s s s .. s ( n times) s0 =
Regular Expressions
⚫
We use regular expressions to describe tokens of a programming language.
⚫
A regular expression is built up of simpler regular expressions (using defining rules)
⚫
Each regular expression denotes a language.
⚫
A language denoted by a regular expression is called as a regular set.
Rules
⚫
A recognizer for a language is a program that takes a string x, and answers “yes” if x is a
sentence of that language, and “no” otherwise.
⚫
We call the recognizer of the tokens as a finite automaton.
⚫
A finite automaton can be: deterministic(DFA) or non-deterministic (NFA)
⚫
This means that we may use a deterministic or non-deterministic automaton as a lexical
analyzer.
⚫
Both deterministic and non-deterministic finite automatons recognize regular sets.
⚫
Which one?
⚫
deterministic – faster recognizer, but it may take more space
⚫
non-deterministic – slower, but it may take less space
⚫
Deterministic automatons are widely used lexical analyzers.
⚫
First, we define regular expressions for tokens; then we convert them into a DFA to get a lexical
analyzer for our tokens.
⚫
Algorithm1: Regular Expression NFA DFA (two steps: first to NFA, then to
DFA)
⚫
Algorithm2: Regular Expression DFA (directly convert a regular expression into a DFA)
For all input symbols a, states s1 and s2 have transitions to states in the same group.
⚫
Start state of the minimized DFA is the group containing the start state of the original DFA.
⚫
Accepting states of the minimized DFA are the groups containing the accepting states of the
original DFA.
Lex (A LEXical Analyzer Generator)
Generates lexical analyzers (scanners or Lexers)
Yacc (Yet Another Compiler-Compiler) Generates
parser based on an analytic grammar
Flex is Free scanner alternative to Lex
Bison is Free parser generator program
1. Lex: a tool for automatically generating a lexer or scanner given a lex specification (.l
file)
2. A lexer or scanner is used to perform lexical analysis, or the breaking up of an input
stream into meaningful units, or tokens.
3. For example, consider breaking a text file up into individual words.
Lexical analyzer: scans the input stream and converts sequences of characters into tokens.
Two Rules
1. lex will always match the longest (number of characters) token possible.
2. If two or more possible tokens are of the same length, then the token with the regular
expression that is defined first in the lex specification is favored.
a matches a
abc matches abc
[abc] matches a, b or c
[a-f] matches a, b, c, d, e, or f
[0-9] matches any digit
X+ matches one or more of X
X* matches zero or more of X
[0-9]+ matches any integer
(…) grouping an expression into a single unit
COMPILER DESIGN (CoSc3112) Page 12
| alternation (or)
(a|b|c)* is equivalent to [a-c]*
X? X is optional (0 or 1 occurrence)
if(def)? matches if or ifdef (equivalent to if|ifdef)
[A-Za-z] matches any alphabetical character
. matches any character except newline character
\. matches the . character
\n matches the newline character
\t matches the tab character
\\ matches the \ character
[ \t] matches either a space or tab character
[^a-d] matches any character other than a,b,c and d
Examples:
Special Functions
• yytext
–where text matched most recently is stored
• yyleng
–number of characters in text most recently matched
• yylval
–associated value of current token
• yymore()
–append next string matched to current contents of yytext
• yyless(n)
–remove from yytext all but the first n characters
• unput(c)
–return character c to input stream
• yywrap()
–may be replaced by user
–The yywrap method is called by the lexical analyser whenever it inputs an
EOF as the first character when trying to match a regular expression
Yacc: a tool for automatically generating a parser given a grammar written in a yacc
specification (.y file)
%{
< C global variables, prototypes, comments > This part will be embedded
into *.c
%}
Definition section
Declarations of tokens
Rules section
User code
%%
|…
%%
}
2. Lex program to count the type of numbers
%{
int pi=0,ni=0,pf=0,nf=0;
%}
%%
\+?[0-9]+ pi++;
\+?[0-9]*\.[0-9]+ pf++;
\-[0-9]+ ni++;
\-[0-9]*\.[0-9]+ nf++;
%%
main()
{
printf("ENTER INPUT : ");
yylex();
printf("\nPOSITIVE INTEGER : %d",pi);
printf("\nNEGATIVE INTEGER : %d",ni);
printf("\nPOSITIVE FRACTION : %d",pf);
printf("\nNEGATIVE FRACTION : %d\n",nf);
}
3. Lex program to find simple and compound
statements %{ }%
%%
"and"|
"or"|
"but"|
"because"|
"nevertheless" {printf("COMPOUND STATEMENT"); exit(0); }
.;
\n return 0;
%%
main()