0% found this document useful (0 votes)

5 views3 pages

Gitika Mandal BE4 A 17 NLP EXP1

The document outlines the importance of text preprocessing techniques in Natural Language Processing (NLP), specifically focusing on tokenization, filtration, and script validation. It describes various types of tokenization, including word, subword, sentence, and character tokenization, and their applications in tasks like sentiment analysis and machine translation. Additionally, it discusses the significance of filtering unwanted text elements and ensuring script validation for accurate language processing and security.

Uploaded by

Gitika Mandal

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views3 pages

Gitika Mandal BE4 A 17 NLP EXP1

Uploaded by

Gitika Mandal

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 3

Experiment:01 Name: Gitika Mandal

PRN: 21UF16885CM030
Roll No: 17
Batch: A
AIM: Apply various text preprocessing techniques for any given text :
Tokenization and Filtration & Script Validation.
INPUT/OUTPUT:

WRITE UP:
1. What is Meant by Word Tokenization?
Tokenization is a fundamental process in Natural Language Processing (NLP) that involves
breaking down a text into smaller units, known as tokens. These tokens can be words, phrases,
subwords, or even characters, depending on the level of granularity required. Tokenization
serves as a preliminary step in text processing and is crucial for various NLP applications such
as text analysis, machine translation, and information retrieval.
There are two primary types of tokenization:
● Word Tokenization: This involves splitting a sentence or text into individual words. For
instance, the sentence "NLP is fascinating." would be tokenized into ["NLP", "is",
"fascinating", "."].
● Subword Tokenization: This breaks words into smaller meaningful units, which is useful
for handling complex words, rare words, and morphological variations. An example is
theuse of Byte-Pair Encoding (BPE) in transformer-based models like BERT, where the
word “unhappiness” may be split into ["un", "happiness"].
● Sentence Tokenization: This divides a text into separate sentences. For instance, the text
"Hello! How are you? I’m fine." would be tokenized into ["Hello!", "How are you?", "I’m
fine."].
● Character Tokenization: This involves splitting a word into individual characters, useful
for languages without clear word boundaries, such as Chinese or Japanese.
Tokenization is more complex than simple splitting based on spaces or punctuation. Many
languages have unique challenges:
In Chinese, Japanese, and Thai, words are not always separated by spaces, requiring specialized
algorithms like Jieba or MeCab. In German, compound words like
"Donaudampfschifffahrtsgesellschaftskapitän" need to be split carefully. In Arabic,
tokenization must handle root-based morphology, where words undergo inflectional changes.
Modern NLP models often use subword tokenization methods such as WordPiece (used in
BERT), SentencePiece (used in T5 and ALBERT), or Unigram Language Model (used in
XLNet) to handle rare words effectively. By breaking words into smaller meaningful parts,
these models achieve better generalization in handling vocabulary.
2.What Are the Uses of Tokenization, Filterization, and Script Validation?
Uses of Tokenization
Tokenization is a crucial step in various NLP tasks, including:
● Text Preprocessing: Before applying machine learning algorithms, text needs to be
converted into a structured format. Tokenization enables this by breaking text into
analyzable units.
● Sentiment Analysis: Tokenizing reviews or social media posts helps identify individual
words or phrases that contribute to sentiment classification.
● Machine Translation: Tokenization allows translation models to handle words efficiently,
especially in morphologically rich languages.
● Speech Recognition: Tokenized words or subwords are mapped to phonemes, improving
transcription accuracy.
● Search Engines: Search algorithms tokenize user queries to match relevant documents in
databases.
● Chatbots and Virtual Assistants: Tokenization helps break down user input into
understandable components for response generation.
● Named Entity Recognition (NER): Tokenization assists in identifying names, locations,
and organizations from unstructured text.

Uses of Filterization
Filterization (or filtering) refers to the process of removing unwanted words, characters, or
symbols from a text. It enhances the quality of text processing by eliminating irrelevant data.
Some common applications include:
● Stopword Removal: In search engines and NLP models, frequent but unimportant words
like “the,” “is,” and “of” are removed to improve efficiency.
● Noise Reduction in Data Processing: Text documents often contain unnecessary
punctuation, special characters, or repeated words that need filtering.
● Spam Detection: Filtering out spam keywords helps classify emails or messages as spam
or non-spam.
● Normalization: Inconsistent text formatting, such as varied capitalization or redundant
whitespace, can be filtered to improve text uniformity.
● Profanity Filtering: Social media platforms use filtering techniques to remove offensive
words from user-generated content.

Uses of Script Validation

Script validation ensures that a text adheres to predefined language constraints and is written
in the correct script. This is useful in multiple contexts:
Multilingual Text Processing: Validating the script of input text helps in language
identification and ensures that an NLP system processes the text correctly.
Security and Fraud Prevention: Many cyberattacks involve script-based manipulations, such
as homoglyph attacks (where similar-looking characters from different scripts are used to
deceive users). Script validation helps detect such attempts.
OCR (Optical Character Recognition) Systems: When digitizing printed or handwritten text,
script validation ensures that the extracted characters belong to the intended language.
Text Standardization: Ensuring that user-generated content follows a single script helps in
document consistency and readability.
AI Model Training: Script validation prevents models from learning from incorrectly labeled
data, improving accuracy in NLP applications.

Detailed Lesson Plan in English IV (Identifying Compound Words)
83% (6)
Detailed Lesson Plan in English IV (Identifying Compound Words)
7 pages
NLP Pipeline
No ratings yet
NLP Pipeline
58 pages
Introduction To NLP Basics of Text Processing, Spelling Correction-Edit Distance, Weighted Edit Distance
No ratings yet
Introduction To NLP Basics of Text Processing, Spelling Correction-Edit Distance, Weighted Edit Distance
35 pages
NLP Sem Answers (All)
No ratings yet
NLP Sem Answers (All)
124 pages
NLP Sem Imp
No ratings yet
NLP Sem Imp
46 pages
NLP-1 (Tokenization)
100% (1)
NLP-1 (Tokenization)
10 pages
Unit - 2
No ratings yet
Unit - 2
55 pages
NLP Exp 3
No ratings yet
NLP Exp 3
24 pages
NLP 02
No ratings yet
NLP 02
6 pages
Past Progressive VS Past Simple Grammar Practice
No ratings yet
Past Progressive VS Past Simple Grammar Practice
2 pages
Natural Language Processing Revision Notes
No ratings yet
Natural Language Processing Revision Notes
4 pages
Intrapersonal Communication:: Reading Skills
100% (1)
Intrapersonal Communication:: Reading Skills
21 pages
Lecture 02 - NLU Concepts
No ratings yet
Lecture 02 - NLU Concepts
27 pages
Natural Language Pre-Processing: Prepared By: Syed Afroz Ali
No ratings yet
Natural Language Pre-Processing: Prepared By: Syed Afroz Ali
81 pages
Module-I NLP
No ratings yet
Module-I NLP
35 pages
02 - NLP Pipeline - Binh
No ratings yet
02 - NLP Pipeline - Binh
37 pages
Theory of Computation
No ratings yet
Theory of Computation
33 pages
NLP - 1 - 250119 - 222702
No ratings yet
NLP - 1 - 250119 - 222702
71 pages
NLP Experiment 1
No ratings yet
NLP Experiment 1
13 pages
NLP Basics
No ratings yet
NLP Basics
12 pages
NLP m2
No ratings yet
NLP m2
71 pages
Notes MSC NLP
No ratings yet
Notes MSC NLP
36 pages
Your Dutch Grammar Ebook A1 A2
No ratings yet
Your Dutch Grammar Ebook A1 A2
52 pages
Introduction To NLP
No ratings yet
Introduction To NLP
50 pages
Module I NLP
No ratings yet
Module I NLP
65 pages
NLP Concepts Resources
No ratings yet
NLP Concepts Resources
48 pages
Module 1
No ratings yet
Module 1
49 pages
Module 3
No ratings yet
Module 3
40 pages
Introduction To NLP: Prof: Vraj M Hingu Dept: Computer
No ratings yet
Introduction To NLP: Prof: Vraj M Hingu Dept: Computer
87 pages
Tokenization in NLP
No ratings yet
Tokenization in NLP
21 pages
Chapter 7.1 - Introducing Natural Language Processing
No ratings yet
Chapter 7.1 - Introducing Natural Language Processing
39 pages
Natural Language Processing (NLP) & Computational Linguistics
No ratings yet
Natural Language Processing (NLP) & Computational Linguistics
60 pages
How To Teach Phonemic Script
100% (2)
How To Teach Phonemic Script
4 pages
English Reviewer Nouns&Pronouns
No ratings yet
English Reviewer Nouns&Pronouns
3 pages
Natural Language Processing
No ratings yet
Natural Language Processing
28 pages
NLP Lab Manual
No ratings yet
NLP Lab Manual
17 pages
NLP Unit 1
No ratings yet
NLP Unit 1
15 pages
Hadi Pres, 21-12-24-1
No ratings yet
Hadi Pres, 21-12-24-1
16 pages
Natural Language Processing Unit 1
No ratings yet
Natural Language Processing Unit 1
16 pages
The Newbie’s Guidebook to ChatGPT: A Beginner's Tutorial: The Newbie’s Guidebook
From Everand
The Newbie’s Guidebook to ChatGPT: A Beginner's Tutorial: The Newbie’s Guidebook
Timothy King
No ratings yet
Unit 1 NLP KCS072
No ratings yet
Unit 1 NLP KCS072
12 pages
NLP - Srilakshmi H - PPT Assignment
No ratings yet
NLP - Srilakshmi H - PPT Assignment
29 pages
Fdocuments - in Report Text SMP Kelas 9
No ratings yet
Fdocuments - in Report Text SMP Kelas 9
13 pages
Text Analytics and Natural Language Processing - KAI073
No ratings yet
Text Analytics and Natural Language Processing - KAI073
24 pages
NLP Prep
No ratings yet
NLP Prep
14 pages
Week 8-Module 7 NLP
No ratings yet
Week 8-Module 7 NLP
52 pages
VO - MCA - SEM 4 - Text Mining - U2
No ratings yet
VO - MCA - SEM 4 - Text Mining - U2
15 pages
INTELLIPAAT - 2024 - 01 - 20 - Tansformers Cont. and Autoencoders
No ratings yet
INTELLIPAAT - 2024 - 01 - 20 - Tansformers Cont. and Autoencoders
11 pages
Handling Corpus Raw Text
No ratings yet
Handling Corpus Raw Text
15 pages
Vocatives in Nkengasong'S Black Caps and Red Feathers and Achebe'S
No ratings yet
Vocatives in Nkengasong'S Black Caps and Red Feathers and Achebe'S
17 pages
NLP Manual (1-12)
No ratings yet
NLP Manual (1-12)
55 pages
NLP Manual (1-12)
No ratings yet
NLP Manual (1-12)
54 pages
AAUN Level 6 Word Skills Wkshts
No ratings yet
AAUN Level 6 Word Skills Wkshts
10 pages
About NLP
No ratings yet
About NLP
14 pages
NLP Insem Notes
No ratings yet
NLP Insem Notes
13 pages
Unraveling The Power of Natural Language Processing
No ratings yet
Unraveling The Power of Natural Language Processing
11 pages
NLP - Notes
No ratings yet
NLP - Notes
3 pages
Module2.4 Text Processing
No ratings yet
Module2.4 Text Processing
17 pages
2A A Writer
No ratings yet
2A A Writer
6 pages
Exercises - CONDITIONAL SENTENCES
No ratings yet
Exercises - CONDITIONAL SENTENCES
2 pages
08. ĐỀ SỐ 08 HSG ANH 8 (HUYỆN)
No ratings yet
08. ĐỀ SỐ 08 HSG ANH 8 (HUYỆN)
7 pages
Disruptive Technologies AI Lecture 3
No ratings yet
Disruptive Technologies AI Lecture 3
19 pages
Teaching Students With Language and Communication Disabilities, With Enhanced Pearson Etext - Access Card Package (What'S New in Special Education)
100% (25)
Teaching Students With Language and Communication Disabilities, With Enhanced Pearson Etext - Access Card Package (What'S New in Special Education)
23 pages
NLP 9
No ratings yet
NLP 9
44 pages
Passive Voice - 24.04
No ratings yet
Passive Voice - 24.04
3 pages
NLP Crash Course Comprehensive
No ratings yet
NLP Crash Course Comprehensive
2 pages
TOS8 English 7
No ratings yet
TOS8 English 7
1 page
Midterm Exams Wednesday Apr 12 2023
No ratings yet
Midterm Exams Wednesday Apr 12 2023
8 pages
Text Preprocessing Stages
No ratings yet
Text Preprocessing Stages
8 pages
Spolsky - Formulating A Theory On SLL
No ratings yet
Spolsky - Formulating A Theory On SLL
21 pages
NLP Ai X
No ratings yet
NLP Ai X
6 pages
NLP Experiment 2
No ratings yet
NLP Experiment 2
5 pages
The Dream Work of Sigmund Freud: John Shannon Hendrix
No ratings yet
The Dream Work of Sigmund Freud: John Shannon Hendrix
32 pages
Experiment - 2
No ratings yet
Experiment - 2
3 pages
Past Perfect Tense
No ratings yet
Past Perfect Tense
3 pages
Final Demo Handout 2021
No ratings yet
Final Demo Handout 2021
6 pages
Basics of Chat GPT: How to utilize this powerful tool to enhance your life!
From Everand
Basics of Chat GPT: How to utilize this powerful tool to enhance your life!
Adam Larsen
No ratings yet
Chapter-1 Introduction To NLP
No ratings yet
Chapter-1 Introduction To NLP
12 pages
Lecture 2 Nominal Clauses
No ratings yet
Lecture 2 Nominal Clauses
18 pages
CSDM2-Text Preprocessing For NL Data - 011050
No ratings yet
CSDM2-Text Preprocessing For NL Data - 011050
6 pages
Assignment 4 (WC2)
No ratings yet
Assignment 4 (WC2)
9 pages
B1 Preliminary For Schools - Examiner Comments - Cristina and Andrea
No ratings yet
B1 Preliminary For Schools - Examiner Comments - Cristina and Andrea
5 pages
Deep Learning in Practice Project Two: NLP of The Holy Quran in Python
No ratings yet
Deep Learning in Practice Project Two: NLP of The Holy Quran in Python
11 pages
Bale Result M A 1St
No ratings yet
Bale Result M A 1St
1 page
Barem 10A
No ratings yet
Barem 10A
5 pages
SF1 - 2020 - Grade 8 (Year II) - 8-NEON
No ratings yet
SF1 - 2020 - Grade 8 (Year II) - 8-NEON
3 pages
STD VIII - English Solution
No ratings yet
STD VIII - English Solution
6 pages
Grade 9 Exam 2nd Grading
No ratings yet
Grade 9 Exam 2nd Grading
5 pages
ELA 3L Unit 3 Lesson 27 - Prefixes and Suffixes
No ratings yet
ELA 3L Unit 3 Lesson 27 - Prefixes and Suffixes
3 pages
Ordinal Numbers 1st - 10th LP
No ratings yet
Ordinal Numbers 1st - 10th LP
2 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Gitika Mandal BE4 A 17 NLP EXP1

Uploaded by

Gitika Mandal BE4 A 17 NLP EXP1

Uploaded by

Experiment:01 Name: Gitika Mandal

Uses of Script Validation

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.