Day 2 - Lexial Analyzer
Day 2 - Lexial Analyzer
Yared Y.
Lexical Analysis (lexing)
• The primary function of a scanner is to transform a character
stream into a token stream.
• A scanner is sometimes called a lexical analyzer, or lexer.
• The lexical analyzer breaks these syntaxes into a series of tokens,
by removing any whitespace or comments in the source code.
• If the lexical analyzer finds a token invalid, it generates an error.
• It reads character streams from the source code, checks for legal
tokens, and passes the data to the syntax analyzer when it
demands.
• Main task: to read input characters and group them into “tokens.”
• Secondary tasks:
• Skip comments and whitespace;
• Correlate error messages with source program (e.g., line number of error).
Tokens
• Lexemes are said to be a sequence of characters
(alphanumeric) in a token.
• keywords, constants, identifiers, strings, numbers,
operators, and punctuations symbols can be
considered as tokens .
• For example, in C language, the variable declaration line
int value = 100;
Contains the tokens:
int (keyword), value (identifier), = (operator), 100
(constant) and ;
(symbol).
• Examples of non-tokens
Tokens
• Keywords (int, float)
• Identifiers (variables: sum, x, y, total)
• Constants (10, -3)
• Strings( “Hello”)
• Special symbols ((), {})
• Operators (-, + , * /)
Finding the tokens
int a, b, sum; Keywords: int printf scanf
printf("\n Enter two Identifiers: a b sum
numbers:") Symbols: , ; ( ) &
scanf("%d %d", &a, &b); Strings: "\n Enter two
sum = a + b; numbers:"
printf("Sum: %d", sum); "%d %d" "sum: %d“
Operators: + =
Find the total number of tokens in the given C statement
printf(" String %d", ++i ++ && & i **a);
The tokens are
Printf ( ) "String %d“ , ++ i ++ &&
&i*
*a ;
So there are 15 tokens in total
Find the tokens
int main() { keywords: int return
• What is a token?
Tokens in compiler design are the sequence of characters which represents a
unit of information in the source program.
• A syntactic category:
in English: noun, verb, adjective
in programming languages:
identifier, constant, keyword, whitespace,….
• Choice of tokens depends on:
language
design of parser
Steps performed by the lexer
• Initialization
#include <stdio.h>
int maximum(int x, int y) {
// This will compare 2 numbers
if (x > y)
return x;
else {
return y;
}
}
Examples of Tokens created
Lexeme Token
Int Keyword
Maximum Identifier
( Operator
Int Keyword
X Identifier
, Operator
Int Keyword
Y Identifier
) Operator
{ Operator
If keyword
Examples of non-tokens
Comment (eg. # this will compare 2 numbers)
Pre-processor directive (eg.#include <stdio.h>
whitespaces
The task of lexical Analysis
• Let us assume that the source program is stored in a file. It
consists of a sequence of characters. Lexical analysis, i.e., the
scanner, reads this sequence from left to right and decomposes
it into a sequence of lexical units, called symbols.
• The scanner starts the analysis with the character that follows
the end of the last found symbol. It searches for the longest
prefix of the remaining input that is a symbol of the language. It
passes a representation of this symbol on to the screener, which
checks whether this symbol is relevant for the parser. If not, it is
ignored, and the screener reactivates the scanner. Otherwise, it
passes a possibly transformed representation of the symbol on
to the parser.
Scanner generator
• Using a scanner generator, e.g., lex or flex. This
automatically generates a lexical analyzer from a high-
level description of the tokens.
• Tools:
lex
flex
ANTLR