Lexical Analyzer
Lexical Analyzer
CONSTRUCTION
Lexical Analysis
Lexical Analysis (LA) is the first phase of the Compiler
Only phase among other phases which will interact with the source program
Basic responsibility of LA is to read source programs and convert them into the
stream of tokens
The Role of the Lexical Analyzer
As the first phase of a compiler, the main task of the lexical analyzer is to read
the input characters of the source program, group them into lexemes, and
produce as output a sequence of tokens for each lexeme in the source program.
The stream of tokens is sent to the parser for syntax analysis. It is common for
the lexical analyzer to interact with the symbol table as well.
When the lexical analyzer discovers a lexeme constituting an identifier, it needs
to enter that lexeme into the symbol table.
The Role of the Lexical Analyzer
The Role of the Lexical Analyzer
The lexical analyzer is the part of the compiler that reads the source text, it may
perform certain other tasks besides the identification of lexemes.
One such task is stripping out comments and whitespace (blank, newline, tab,
and perhaps other characters that are used to separate tokens in the input).
It counts the number of lines in the source program.
Another task is correlating error messages generated by the compiler with the
source program.
Separation of Lexical Analyzer from Syntax
Simpler design is perhaps the most important consideration. The separation of
lexical analysis from syntax analysis often allows us to simply one or the other of
these phases
Compiler efficiency is improved
Compiler Portability is enhanced
Tokens, Patterns and Lexemes
A token is a pair consisting of a token name and an optional attribute value.
The token name is an abstract symbol representing a kind of lexical unit, e.g., a
particular keyword, or a sequence of input characters denoting an identifier
A pattern is a description of the form that the lexemes of a token may take. In
the case of a keyword as a token, the pattern is just the sequence of characters
that form the keyword.
A lexeme is a sequence of characters in the source program that matches the
pattern for a token and is identified by the lexical analyzer as an instance of
that token.
Examples of Tokens
Figure 3.2 gives some typical tokens, their informally described patterns, and
some sample lexemes.
To see how these concepts are used in practice, in the C statement
printf ("Total = %d/n”, score) ;
both printf and score are lexemes matching the pattern for token id, and
"Total = %d/n” is a lexeme matching literal.
Examples of Tokens
Attributes for Token / Examples of Tokens
The token names and associated attribute values for the Fortran statement are
written below as a sequence of pairs.
E = M * C ** 2
Attributes for Token / Examples of Tokens
<id, pointer to symbol-table entry for E>
< assign - op >
<id, pointer to symbol-table entry for M>
<multi - op>
<id, pointer to symbol-table entry for C>
<exp - op>
<number, integer value 2 >
Attributes for Token / Examples of Tokens
The token names and associated attribute values for the Fortran statement are
written below as a sequence of pairs.
E = M * C ** 2
Note that in certain pairs, especially operators, punctuation, and keywords,
there is no need for an attribute value.
Lexical Error
It is hard for a lexical analyzer to tell, without the aid of other components, that
there is a source-code error. For instance, if the string fi is encountered for the
first time in a C program in the context
fi(a==b)
A lexical analyzer cannot tell whether fi is a misspelling of the keyword if or an
undeclared function identifier. Since fi is a valid lexeme for the token id
The lexical analyzer must return the token id to the parser and let some other
phase of the compiler - probably the parser in this case - handle an error
Lexical Error
However, suppose a situation arises in which the lexical analyzer is unable to
proceed because none of the patterns for tokens matches any prefix of the
remaining input.
Panic Mode Recovery
The simplest recovery strategy is "panic mode" recovery. We delete successive
characters from the remaining input until the lexical analyzer can find a well-
formed token at the beginning of what input is left.
Other possible error-recovery actions are:
1. Delete one character from the remaining input.
2. Insert a missing character into the remaining input.
3. Replace a character with another character.
4. Transpose of two adjacent characters.
Example to convert token
float limitedSquare(x){float x;
/* returns x-squared, nut never more than 100 */
return (x <= -10.0 || x >= 10.0) ? 100 : x*x;
}