0% found this document useful (0 votes)
34 views37 pages

Day 2 - Lexial Analyzer

Nlp ml

Uploaded by

Yeabsira
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
34 views37 pages

Day 2 - Lexial Analyzer

Nlp ml

Uploaded by

Yeabsira
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 37

Lexical Analyzer

Yared Y.
Lexical Analysis (lexing)
• The primary function of a scanner is to transform a character
stream into a token stream.
• A scanner is sometimes called a lexical analyzer, or lexer.
• The lexical analyzer breaks these syntaxes into a series of tokens,
by removing any whitespace or comments in the source code.
• If the lexical analyzer finds a token invalid, it generates an error.
• It reads character streams from the source code, checks for legal
tokens, and passes the data to the syntax analyzer when it
demands.
• Main task: to read input characters and group them into “tokens.”
• Secondary tasks:
• Skip comments and whitespace;
• Correlate error messages with source program (e.g., line number of error).
Tokens
• Lexemes are said to be a sequence of characters
(alphanumeric) in a token.
• keywords, constants, identifiers, strings, numbers,
operators, and punctuations symbols can be
considered as tokens .
• For example, in C language, the variable declaration line
int value = 100;
Contains the tokens:
int (keyword), value (identifier), = (operator), 100
(constant) and ;
(symbol).
• Examples of non-tokens
Tokens
• Keywords (int, float)
• Identifiers (variables: sum, x, y, total)
• Constants (10, -3)
• Strings( “Hello”)
• Special symbols ((), {})
• Operators (-, + , * /)
Finding the tokens
int a, b, sum; Keywords: int printf scanf
printf("\n Enter two Identifiers: a b sum
numbers:") Symbols: , ; ( ) &
scanf("%d %d", &a, &b); Strings: "\n Enter two
sum = a + b; numbers:"
printf("Sum: %d", sum); "%d %d" "sum: %d“
Operators: + =
Find the total number of tokens in the given C statement
printf(" String %d", ++i ++ && & i **a);
The tokens are
Printf ( ) "String %d“ , ++ i ++ &&
&i*
*a ;
So there are 15 tokens in total
Find the tokens
int main() { keywords: int return

int a = 5; int b = 10; identifier: main a b sum


function identifier: printf
int sum = a + b; symbols: ( ) { } ; ,
printf("Sum: %d\n", sum); strings: "Sum: %d\n"
operators: = +
return 0;
constants: 5 10 0
}
Specify the <token, attribute> set for the C Statement
a = (b + c) * 2;

<id, 100> <=> <(,> <id, 101> <+,> <id,102> <),>


<*,><constant,103>

The numbers after the comma(,) represent the pointer to


the symbol table entry. The symbol entry at memory
location 100 stores the identifier 'a', 101 stores 'b' and 102
stores 'c'. A constant 2 is stored at location 103.
Find the lexemes and the <token type, attribute> set for
a = (b + c - d) / f;

The lexemes are


a=(b+c-d)/f;
The tokens along with their associated attributes are
<id, 1> <=> <(> <id, 2> <+> <id, 3> <-> <id,4> <)>
</> <id,5>
Symbol table example for the
previous statement
Locatioin Token type Value
1 Identifier a
2 >> b
3 >> C
4 >> D
5 >> F
How are
categories(tokens)determined
in compilers?
• based on predefined rules (specification).
• These rules are defined by the grammar and syntax of
the programming language being compiled.
Example:
• Keywords:
These are reserved words predefined in the language's
grammar.

The lexer matches sequences of characters against a


list of reserved words.

For example, int, return, if, else, etc.


• Identifiers:
These are names defined by the programmer, such as
variable names, function names, etc. The lexer
identifies them by recognizing sequences of characters
that start with a letter or underscore (_) followed by
letters, digits, or underscores.
• Symbols:
These are single or multi-character tokens that have a
specific meaning in the language's syntax, such as (, ),
{, }, ;, ,, etc. The lexer matches these against a
predefined list of symbols.
• Strings:
These are sequences of characters enclosed in quotes.
The lexer recognizes string literals by identifying
characters between quotation marks (" ").
• Operators:
These are symbols that represent operations, such as =,
+, -, *, /, &&, ||, etc. The lexer matches these against a
predefined list of operators.
• Literals:
These are fixed values, such as numbers (5, 10, 0),
characters ('a', 'b'), and string literals ("Sum: %d\n").
The lexer recognizes them based on the rules for each
type of literal.
Steps performed by lexical analyzer:
1. identify the lexical units in a source code
2. classify lexical units into classes like identifier,
constants
3. identify token which is not part of the language

• Two main activities


1. Remove irrelevant parts (comments, whitespaces,
directives,)
2. specifying tokens <token_class, token id>
• A lexical analyser, also called a lexer or scanner, will as
input take a string of individual letters and divide this
string into a sequence of classified tokens. Additionally,
it will filter out whatever separates the tokens (the so-
called white-space), i.e., lay-out characters (spaces,
newlines etc.) and comments.
Basic Terminologies
• What is a lexeme?
A lexeme is a sequence of characters that are included in the source program
according to the matching pattern of a token. It is nothing but an instance of a
token.

• What is a token?
Tokens in compiler design are the sequence of characters which represents a
unit of information in the source program.
• A syntactic category:
in English: noun, verb, adjective
in programming languages:
identifier, constant, keyword, whitespace,….
• Choice of tokens depends on:
language
design of parser
Steps performed by the lexer
• Initialization

Read Input: The lexer reads the entire source code as a


string or a stream of characters.
Set Up Data Structures: The lexer sets up data structures
like symbol tables and state machines needed for
tokenization.
• Character reading
Read Character: The lexer reads the next character from
the input.
Buffering: Characters may be buffered to handle multi-
character tokens.
• Token recognition
Match Patterns: The lexer uses regular expressions, state
machines, or lookup tables to match sequences of characters
against patterns for keywords, identifiers, operators,
symbols, literals, and other tokens.
State Transition: The lexer transitions between states in a state
machine based on the current character & the current state.
• Token creation
Create Token: When a pattern matches, the lexer creates a
token of a specific type (e.g., keyword, identifier, symbol).
Store Token: The token is stored in a list or passed to the parser.
• Ignore whitespaces and comments
Skip Whitespace: The lexer skips over whitespace
characters (spaces, tabs, newlines).
Skip Comments: The lexer skips over comments, which can
be single-line or multi-line depending on the language
syntax.
• Error handling
Detect Errors: The lexer detects sequences of characters
that do not match any known pattern and generates a
lexical error.
Report Errors: The lexer reports errors with information
about the location and nature of the error.
• Repeat
Continue Reading: The lexer continues reading
characters and creating tokens until the entire input is
processed.
• End of Input
EOF Token: The lexer generates an end-of-file (EOF)
token to signal the end of input to the parser.
Activities performed by the
lexer
•Pattern Matching:
Using regular expressions & state machines to recognize different types of tokens.
•Token Classification:
Classifying tokens into categories like keywords, identifiers, literals, operators, & symbols.
•Buffer Management:
Managing buffers for multi-character tokens & handling lookahead for complex patterns.
•Symbol Table Management:
Maintaining a symbol table for identifiers and literals, storing information like names and type
•Skipping Non-Essential Elements:
Skipping whitespace & comments to streamline the tokenization process.
•Error Detection and Reporting:
Detecting invalid sequences of characters & reporting lexical errors with context.
• The main task of lexical analysis is to read input
characters in the code and produce tokens.
• Lexical analyzer scans the entire source code of the
program. It identifies each token one by one.
Roles of a lexical analyzer
• Helps to identify token into the symbol table
• Removes white spaces and comments from the source
program
• Correlates error messages with the source program
• Helps you to expands the macros if it is found in the
source program
• Read input characters from the source program
Example:
• Consider the following code that is fed to Lexical Analyzer

#include <stdio.h>
int maximum(int x, int y) {
// This will compare 2 numbers
if (x > y)
return x;
else {
return y;
}
}
Examples of Tokens created
Lexeme Token
Int Keyword
Maximum Identifier
( Operator
Int Keyword
X Identifier
, Operator
Int Keyword
Y Identifier
) Operator
{ Operator
If keyword
Examples of non-tokens
Comment (eg. # this will compare 2 numbers)
Pre-processor directive (eg.#include <stdio.h>
whitespaces
The task of lexical Analysis
• Let us assume that the source program is stored in a file. It
consists of a sequence of characters. Lexical analysis, i.e., the
scanner, reads this sequence from left to right and decomposes
it into a sequence of lexical units, called symbols.
• The scanner starts the analysis with the character that follows
the end of the last found symbol. It searches for the longest
prefix of the remaining input that is a symbol of the language. It
passes a representation of this symbol on to the screener, which
checks whether this symbol is relevant for the parser. If not, it is
ignored, and the screener reactivates the scanner. Otherwise, it
passes a possibly transformed representation of the symbol on
to the parser.
Scanner generator
• Using a scanner generator, e.g., lex or flex. This
automatically generates a lexical analyzer from a high-
level description of the tokens.

• Tools:
lex
flex
ANTLR

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy