0% found this document useful (0 votes)

34 views37 pages

Day 2 - Lexial Analyzer

Nlp ml

Uploaded by

Yeabsira

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

34 views37 pages

Day 2 - Lexial Analyzer

Nlp ml

Uploaded by

Yeabsira

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 37

Lexical Analyzer

Yared Y.
Lexical Analysis (lexing)
• The primary function of a scanner is to transform a character
stream into a token stream.
• A scanner is sometimes called a lexical analyzer, or lexer.
• The lexical analyzer breaks these syntaxes into a series of tokens,
by removing any whitespace or comments in the source code.
• If the lexical analyzer finds a token invalid, it generates an error.
• It reads character streams from the source code, checks for legal
tokens, and passes the data to the syntax analyzer when it
demands.
• Main task: to read input characters and group them into “tokens.”
• Secondary tasks:
• Skip comments and whitespace;
• Correlate error messages with source program (e.g., line number of error).
Tokens
• Lexemes are said to be a sequence of characters
(alphanumeric) in a token.
• keywords, constants, identifiers, strings, numbers,
operators, and punctuations symbols can be
considered as tokens .
• For example, in C language, the variable declaration line
int value = 100;
Contains the tokens:
int (keyword), value (identifier), = (operator), 100
(constant) and ;
(symbol).
• Examples of non-tokens
Tokens
• Keywords (int, float)
• Identifiers (variables: sum, x, y, total)
• Constants (10, -3)
• Strings( “Hello”)
• Special symbols ((), {})
• Operators (-, + , * /)
Finding the tokens
int a, b, sum; Keywords: int printf scanf
printf("\n Enter two Identifiers: a b sum
numbers:") Symbols: , ; ( ) &
scanf("%d %d", &a, &b); Strings: "\n Enter two
sum = a + b; numbers:"
printf("Sum: %d", sum); "%d %d" "sum: %d“
Operators: + =
Find the total number of tokens in the given C statement
printf(" String %d", ++i ++ && & i **a);
The tokens are
Printf ( ) "String %d“ , ++ i ++ &&
&i*
*a ;
So there are 15 tokens in total
Find the tokens
int main() { keywords: int return

int a = 5; int b = 10; identifier: main a b sum

function identifier: printf
int sum = a + b; symbols: ( ) { } ; ,
printf("Sum: %d\n", sum); strings: "Sum: %d\n"
operators: = +
return 0;
constants: 5 10 0
}
Specify the <token, attribute> set for the C Statement
a = (b + c) * 2;

<id, 100> <=> <(,> <id, 101> <+,> <id,102> <),>

<*,><constant,103>

The numbers after the comma(,) represent the pointer to

the symbol table entry. The symbol entry at memory
location 100 stores the identifier 'a', 101 stores 'b' and 102
stores 'c'. A constant 2 is stored at location 103.
Find the lexemes and the <token type, attribute> set for
a = (b + c - d) / f;

The lexemes are

a=(b+c-d)/f;
The tokens along with their associated attributes are
<id, 1> <=> <(> <id, 2> <+> <id, 3> <-> <id,4> <)>
</> <id,5>
Symbol table example for the
previous statement
Locatioin Token type Value
1 Identifier a
2 >> b
3 >> C
4 >> D
5 >> F
How are
categories(tokens)determined
in compilers?
• based on predefined rules (specification).
• These rules are defined by the grammar and syntax of
the programming language being compiled.
Example:
• Keywords:
These are reserved words predefined in the language's
grammar.

The lexer matches sequences of characters against a

list of reserved words.

For example, int, return, if, else, etc.

• Identifiers:
These are names defined by the programmer, such as
variable names, function names, etc. The lexer
identifies them by recognizing sequences of characters
that start with a letter or underscore (_) followed by
letters, digits, or underscores.
• Symbols:
These are single or multi-character tokens that have a
specific meaning in the language's syntax, such as (, ),
{, }, ;, ,, etc. The lexer matches these against a
predefined list of symbols.
• Strings:
These are sequences of characters enclosed in quotes.
The lexer recognizes string literals by identifying
characters between quotation marks (" ").
• Operators:
These are symbols that represent operations, such as =,
+, -, *, /, &&, ||, etc. The lexer matches these against a
predefined list of operators.
• Literals:
These are fixed values, such as numbers (5, 10, 0),
characters ('a', 'b'), and string literals ("Sum: %d\n").
The lexer recognizes them based on the rules for each
type of literal.
Steps performed by lexical analyzer:
1. identify the lexical units in a source code
2. classify lexical units into classes like identifier,
constants
3. identify token which is not part of the language

• Two main activities

1. Remove irrelevant parts (comments, whitespaces,
directives,)
2. specifying tokens <token_class, token id>
• A lexical analyser, also called a lexer or scanner, will as
input take a string of individual letters and divide this
string into a sequence of classified tokens. Additionally,
it will filter out whatever separates the tokens (the so-
called white-space), i.e., lay-out characters (spaces,
newlines etc.) and comments.
Basic Terminologies
• What is a lexeme?
A lexeme is a sequence of characters that are included in the source program
according to the matching pattern of a token. It is nothing but an instance of a
token.

• What is a token?
Tokens in compiler design are the sequence of characters which represents a
unit of information in the source program.
• A syntactic category:
in English: noun, verb, adjective
in programming languages:
identifier, constant, keyword, whitespace,….
• Choice of tokens depends on:
language
design of parser
Steps performed by the lexer
• Initialization

Read Input: The lexer reads the entire source code as a

string or a stream of characters.
Set Up Data Structures: The lexer sets up data structures
like symbol tables and state machines needed for
tokenization.
• Character reading
Read Character: The lexer reads the next character from
the input.
Buffering: Characters may be buffered to handle multi-
character tokens.
• Token recognition
Match Patterns: The lexer uses regular expressions, state
machines, or lookup tables to match sequences of characters
against patterns for keywords, identifiers, operators,
symbols, literals, and other tokens.
State Transition: The lexer transitions between states in a state
machine based on the current character & the current state.
• Token creation
Create Token: When a pattern matches, the lexer creates a
token of a specific type (e.g., keyword, identifier, symbol).
Store Token: The token is stored in a list or passed to the parser.
• Ignore whitespaces and comments
Skip Whitespace: The lexer skips over whitespace
characters (spaces, tabs, newlines).
Skip Comments: The lexer skips over comments, which can
be single-line or multi-line depending on the language
syntax.
• Error handling
Detect Errors: The lexer detects sequences of characters
that do not match any known pattern and generates a
lexical error.
Report Errors: The lexer reports errors with information
about the location and nature of the error.
• Repeat
Continue Reading: The lexer continues reading
characters and creating tokens until the entire input is
processed.
• End of Input
EOF Token: The lexer generates an end-of-file (EOF)
token to signal the end of input to the parser.
Activities performed by the
lexer
•Pattern Matching:
Using regular expressions & state machines to recognize different types of tokens.
•Token Classification:
Classifying tokens into categories like keywords, identifiers, literals, operators, & symbols.
•Buffer Management:
Managing buffers for multi-character tokens & handling lookahead for complex patterns.
•Symbol Table Management:
Maintaining a symbol table for identifiers and literals, storing information like names and type
•Skipping Non-Essential Elements:
Skipping whitespace & comments to streamline the tokenization process.
•Error Detection and Reporting:
Detecting invalid sequences of characters & reporting lexical errors with context.
• The main task of lexical analysis is to read input
characters in the code and produce tokens.
• Lexical analyzer scans the entire source code of the
program. It identifies each token one by one.
Roles of a lexical analyzer
• Helps to identify token into the symbol table
• Removes white spaces and comments from the source
program
• Correlates error messages with the source program
• Helps you to expands the macros if it is found in the
source program
• Read input characters from the source program
Example:
• Consider the following code that is fed to Lexical Analyzer

#include <stdio.h>
int maximum(int x, int y) {
// This will compare 2 numbers
if (x > y)
return x;
else {
return y;
}
}
Examples of Tokens created
Lexeme Token
Int Keyword
Maximum Identifier
( Operator
Int Keyword
X Identifier
, Operator
Int Keyword
Y Identifier
) Operator
{ Operator
If keyword
Examples of non-tokens
Comment (eg. # this will compare 2 numbers)
Pre-processor directive (eg.#include <stdio.h>
whitespaces
The task of lexical Analysis
• Let us assume that the source program is stored in a file. It
consists of a sequence of characters. Lexical analysis, i.e., the
scanner, reads this sequence from left to right and decomposes
it into a sequence of lexical units, called symbols.
• The scanner starts the analysis with the character that follows
the end of the last found symbol. It searches for the longest
prefix of the remaining input that is a symbol of the language. It
passes a representation of this symbol on to the screener, which
checks whether this symbol is relevant for the parser. If not, it is
ignored, and the screener reactivates the scanner. Otherwise, it
passes a possibly transformed representation of the symbol on
to the parser.
Scanner generator
• Using a scanner generator, e.g., lex or flex. This
automatically generates a lexical analyzer from a high-
level description of the tokens.

• Tools:
lex
flex
ANTLR

Emmanuel Todd's The Origin of Family Systems (L'origine Des Systèmes Familiaux) in English
100% (3)
Emmanuel Todd's The Origin of Family Systems (L'origine Des Systèmes Familiaux) in English
667 pages
Lisp Interpreter in Rust
From Everand
Lisp Interpreter in Rust
Vishal Patil
1/5 (1)
Compiler Construction CS-4207: Lecture 4-5 Instructor Name: Atif Ishaq
100% (1)
Compiler Construction CS-4207: Lecture 4-5 Instructor Name: Atif Ishaq
37 pages
2-Lexical Analysis
No ratings yet
2-Lexical Analysis
52 pages
CS606 Assignment 1
No ratings yet
CS606 Assignment 1
4 pages
1 - Scanning Slides Sanyal Part1
No ratings yet
1 - Scanning Slides Sanyal Part1
22 pages
Lecture 3 - Lexical Analysis
No ratings yet
Lecture 3 - Lexical Analysis
42 pages
Lexical Analysis
No ratings yet
Lexical Analysis
15 pages
Chapter 2 - Lexical Analysis
No ratings yet
Chapter 2 - Lexical Analysis
10 pages
CD - CH2 - Lexical Analysis
No ratings yet
CD - CH2 - Lexical Analysis
59 pages
Chapter 2
No ratings yet
Chapter 2
6 pages
Chapter 2
No ratings yet
Chapter 2
41 pages
CD - CH2 - Lexical Analysis
No ratings yet
CD - CH2 - Lexical Analysis
67 pages
Unit 2 Lexical Analyzer
No ratings yet
Unit 2 Lexical Analyzer
30 pages
Lexical Analysis
No ratings yet
Lexical Analysis
35 pages
Learning Materials, CD, Unit-2 (Lexical Analysis)
No ratings yet
Learning Materials, CD, Unit-2 (Lexical Analysis)
13 pages
UNIT I BKS Lexical Analysis I - Tokens - Lexemes - Pattern
No ratings yet
UNIT I BKS Lexical Analysis I - Tokens - Lexemes - Pattern
28 pages
Lexical Analysis
No ratings yet
Lexical Analysis
128 pages
L4 - Lexical Analysis (Introduction)
No ratings yet
L4 - Lexical Analysis (Introduction)
11 pages
002chapter 2 - Lexical Analysis
No ratings yet
002chapter 2 - Lexical Analysis
114 pages
Chapter 2 - Lexical Analysis
No ratings yet
Chapter 2 - Lexical Analysis
74 pages
Chapter 2 - Lexical Analyser
No ratings yet
Chapter 2 - Lexical Analyser
39 pages
BC200405108
No ratings yet
BC200405108
5 pages
Certificate Declaration: Topic Name
No ratings yet
Certificate Declaration: Topic Name
16 pages
Compiler Design Chapter 2
No ratings yet
Compiler Design Chapter 2
14 pages
UNIT I BKS Lesson 3 Lexical Analysis and Role of Lexical Analyzer
No ratings yet
UNIT I BKS Lesson 3 Lexical Analysis and Role of Lexical Analyzer
28 pages
Lexical Analysis (Scanner)
No ratings yet
Lexical Analysis (Scanner)
26 pages
02 Lexical Analysis
No ratings yet
02 Lexical Analysis
86 pages
Compiler Construction Lec 1b
No ratings yet
Compiler Construction Lec 1b
37 pages
ATCD Mod 3
No ratings yet
ATCD Mod 3
46 pages
Lexical Analysis
No ratings yet
Lexical Analysis
38 pages
2 Lexical Analyzer
No ratings yet
2 Lexical Analyzer
21 pages
Chapter 3 Lexical Analysis
No ratings yet
Chapter 3 Lexical Analysis
5 pages
L4 - Lexical Analysis
No ratings yet
L4 - Lexical Analysis
44 pages
5.tokens, Patterns, and Lexemes
No ratings yet
5.tokens, Patterns, and Lexemes
7 pages
Lexical Analysis
No ratings yet
Lexical Analysis
12 pages
Ch2 - Lexical Analysis
No ratings yet
Ch2 - Lexical Analysis
71 pages
Lexical Analyzer
No ratings yet
Lexical Analyzer
16 pages
R.V. College of Engineering
No ratings yet
R.V. College of Engineering
56 pages
Chapter 2 - Lexical Analyser
No ratings yet
Chapter 2 - Lexical Analyser
38 pages
CS606 1
No ratings yet
CS606 1
3 pages
Lexical Analyzer
No ratings yet
Lexical Analyzer
31 pages
2024 CD-Ch02 Lexical Analysis
No ratings yet
2024 CD-Ch02 Lexical Analysis
25 pages
Lecture 4 Lexical Analysis
No ratings yet
Lecture 4 Lexical Analysis
23 pages
Chapter 2 Lexical Analysis
No ratings yet
Chapter 2 Lexical Analysis
14 pages
Chapter 2 - Lexical Analyser
No ratings yet
Chapter 2 - Lexical Analyser
40 pages
2-Lexical Analysis Part1
No ratings yet
2-Lexical Analysis Part1
39 pages
@CD - ch2 Compiler Design
No ratings yet
@CD - ch2 Compiler Design
26 pages
Lecture 04 05 PDF
No ratings yet
Lecture 04 05 PDF
8 pages
Lecture 2 10022025 035804pm
No ratings yet
Lecture 2 10022025 035804pm
27 pages
Lecture 2.1 - Lexical Analysis
No ratings yet
Lecture 2.1 - Lexical Analysis
24 pages
Lexical Analysis: Programming Languages Translators
No ratings yet
Lexical Analysis: Programming Languages Translators
21 pages
3.role of Lexical Analyzer
No ratings yet
3.role of Lexical Analyzer
4 pages
CD - Ch.1
No ratings yet
CD - Ch.1
28 pages
Unit 2
No ratings yet
Unit 2
14 pages
HW 31712
No ratings yet
HW 31712
22 pages
Ch2 - Lexical Analysis
No ratings yet
Ch2 - Lexical Analysis
76 pages
Compilers and Translators Assignment
No ratings yet
Compilers and Translators Assignment
3 pages
Lexical Analysis
No ratings yet
Lexical Analysis
5 pages
Learn C++
From Everand
Learn C++
Durgesh
4.5/5 (9)
Python Programming Concepts
From Everand
Python Programming Concepts
MRB
No ratings yet
Improved Distributed Systems Chapter
No ratings yet
Improved Distributed Systems Chapter
4 pages
Day 3 - Regexps
No ratings yet
Day 3 - Regexps
52 pages
Day 4 - Finite Automata
No ratings yet
Day 4 - Finite Automata
52 pages
Day - 1 Intro To Compilers
No ratings yet
Day - 1 Intro To Compilers
53 pages
Day 5 - Syntax Analysis
No ratings yet
Day 5 - Syntax Analysis
46 pages
Circuit Assignment
No ratings yet
Circuit Assignment
6 pages
Alexander CH 01 Final R1
No ratings yet
Alexander CH 01 Final R1
31 pages
Datastructure Stacks
No ratings yet
Datastructure Stacks
2 pages
Yeabsira Mokonnen ATE/9596/12 Department of Software Engineering
No ratings yet
Yeabsira Mokonnen ATE/9596/12 Department of Software Engineering
9 pages
WAH VPS Presentation
No ratings yet
WAH VPS Presentation
16 pages
B Maxx Bm4 Series
No ratings yet
B Maxx Bm4 Series
84 pages
019 Davinder Sandhu Impex 57-65-36 Dispenser
No ratings yet
019 Davinder Sandhu Impex 57-65-36 Dispenser
3 pages
Fundamental of Computing Notes
No ratings yet
Fundamental of Computing Notes
54 pages
18 Visual Paradigm
No ratings yet
18 Visual Paradigm
6 pages
Explain High Level and Low Level Languages?: Examples
No ratings yet
Explain High Level and Low Level Languages?: Examples
4 pages
Alarm List
No ratings yet
Alarm List
712 pages
qTOWER 2.0 /qTOWER 2.2: Real-Time PCR Thermal Cycler
No ratings yet
qTOWER 2.0 /qTOWER 2.2: Real-Time PCR Thermal Cycler
34 pages
CS352 Advance Data All in Source by Jayson C. Lucena
100% (1)
CS352 Advance Data All in Source by Jayson C. Lucena
138 pages
Data Modeling
No ratings yet
Data Modeling
61 pages
Module 1 Cataloging and Catalogs
No ratings yet
Module 1 Cataloging and Catalogs
50 pages
06 369 CheetahXi 50 Install Manual Rev 6
No ratings yet
06 369 CheetahXi 50 Install Manual Rev 6
76 pages
Implementing EHR in Nigeria: Potential Challenge and Benefits
No ratings yet
Implementing EHR in Nigeria: Potential Challenge and Benefits
5 pages
Chapter 5
No ratings yet
Chapter 5
33 pages
Learn-Keyboard .Co - Uk: Learn How To Play Electronic Keyboard or Piano
No ratings yet
Learn-Keyboard .Co - Uk: Learn How To Play Electronic Keyboard or Piano
13 pages
Chamanismo La Via de La Mente Nativa
No ratings yet
Chamanismo La Via de La Mente Nativa
2 pages
CSE 220: Handout 11 Trees
No ratings yet
CSE 220: Handout 11 Trees
60 pages
License Plate Detection Using Yolov8X and Easy OCR: Abstract
No ratings yet
License Plate Detection Using Yolov8X and Easy OCR: Abstract
9 pages
Dissertation Recipes Problem Statement
100% (2)
Dissertation Recipes Problem Statement
4 pages
CMOS-RRAM Based Non-Volatile Ternary Content Addressable Memory nvTCAM
No ratings yet
CMOS-RRAM Based Non-Volatile Ternary Content Addressable Memory nvTCAM
6 pages
BIG-IP Service Provider SIP Administration
No ratings yet
BIG-IP Service Provider SIP Administration
80 pages
Photogrametery Book
No ratings yet
Photogrametery Book
100 pages
Dear Respondent
No ratings yet
Dear Respondent
3 pages
MB Manual H310m-A-20 e
No ratings yet
MB Manual H310m-A-20 e
41 pages
Android Tutorial For Beginners
No ratings yet
Android Tutorial For Beginners
22 pages
Anti-Surge Presentation Mod LV
No ratings yet
Anti-Surge Presentation Mod LV
46 pages
Nism Admit Card
No ratings yet
Nism Admit Card
2 pages
ONE-SA Enabling Nonlinear Operations in Systolic Arrays For Efficient and Flexible Neural Network Inference PDF
No ratings yet
ONE-SA Enabling Nonlinear Operations in Systolic Arrays For Efficient and Flexible Neural Network Inference PDF
6 pages
Extração Da Bios Do Arquivo .Exe
No ratings yet
Extração Da Bios Do Arquivo .Exe
1 page
Agile Web Development With Rails 6 1st Edition Sam Ruby All Chapter Instant Download
100% (3)
Agile Web Development With Rails 6 1st Edition Sam Ruby All Chapter Instant Download
54 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Day 2 - Lexial Analyzer

Uploaded by

Day 2 - Lexial Analyzer

Uploaded by

Lexical Analyzer

int a = 5; int b = 10; identifier: main a b sum

<id, 100> <=> <(,> <id, 101> <+,> <id,102> <),>

The numbers after the comma(,) represent the pointer to

The lexemes are

The lexer matches sequences of characters against a

For example, int, return, if, else, etc.

• Two main activities

Read Input: The lexer reads the entire source code as a

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.