Gitika Mandal BE4 A 17 NLP EXP1
Gitika Mandal BE4 A 17 NLP EXP1
PRN: 21UF16885CM030
Roll No: 17
Batch: A
AIM: Apply various text preprocessing techniques for any given text :
Tokenization and Filtration & Script Validation.
INPUT/OUTPUT:
WRITE UP:
1. What is Meant by Word Tokenization?
Tokenization is a fundamental process in Natural Language Processing (NLP) that involves
breaking down a text into smaller units, known as tokens. These tokens can be words, phrases,
subwords, or even characters, depending on the level of granularity required. Tokenization
serves as a preliminary step in text processing and is crucial for various NLP applications such
as text analysis, machine translation, and information retrieval.
There are two primary types of tokenization:
● Word Tokenization: This involves splitting a sentence or text into individual words. For
instance, the sentence "NLP is fascinating." would be tokenized into ["NLP", "is",
"fascinating", "."].
● Subword Tokenization: This breaks words into smaller meaningful units, which is useful
for handling complex words, rare words, and morphological variations. An example is
theuse of Byte-Pair Encoding (BPE) in transformer-based models like BERT, where the
word “unhappiness” may be split into ["un", "happiness"].
● Sentence Tokenization: This divides a text into separate sentences. For instance, the text
"Hello! How are you? I’m fine." would be tokenized into ["Hello!", "How are you?", "I’m
fine."].
● Character Tokenization: This involves splitting a word into individual characters, useful
for languages without clear word boundaries, such as Chinese or Japanese.
Tokenization is more complex than simple splitting based on spaces or punctuation. Many
languages have unique challenges:
In Chinese, Japanese, and Thai, words are not always separated by spaces, requiring specialized
algorithms like Jieba or MeCab. In German, compound words like
"Donaudampfschifffahrtsgesellschaftskapitän" need to be split carefully. In Arabic,
tokenization must handle root-based morphology, where words undergo inflectional changes.
Modern NLP models often use subword tokenization methods such as WordPiece (used in
BERT), SentencePiece (used in T5 and ALBERT), or Unigram Language Model (used in
XLNet) to handle rare words effectively. By breaking words into smaller meaningful parts,
these models achieve better generalization in handling vocabulary.
2.What Are the Uses of Tokenization, Filterization, and Script Validation?
Uses of Tokenization
Tokenization is a crucial step in various NLP tasks, including:
● Text Preprocessing: Before applying machine learning algorithms, text needs to be
converted into a structured format. Tokenization enables this by breaking text into
analyzable units.
● Sentiment Analysis: Tokenizing reviews or social media posts helps identify individual
words or phrases that contribute to sentiment classification.
● Machine Translation: Tokenization allows translation models to handle words efficiently,
especially in morphologically rich languages.
● Speech Recognition: Tokenized words or subwords are mapped to phonemes, improving
transcription accuracy.
● Search Engines: Search algorithms tokenize user queries to match relevant documents in
databases.
● Chatbots and Virtual Assistants: Tokenization helps break down user input into
understandable components for response generation.
● Named Entity Recognition (NER): Tokenization assists in identifying names, locations,
and organizations from unstructured text.
Uses of Filterization
Filterization (or filtering) refers to the process of removing unwanted words, characters, or
symbols from a text. It enhances the quality of text processing by eliminating irrelevant data.
Some common applications include:
● Stopword Removal: In search engines and NLP models, frequent but unimportant words
like “the,” “is,” and “of” are removed to improve efficiency.
● Noise Reduction in Data Processing: Text documents often contain unnecessary
punctuation, special characters, or repeated words that need filtering.
● Spam Detection: Filtering out spam keywords helps classify emails or messages as spam
or non-spam.
● Normalization: Inconsistent text formatting, such as varied capitalization or redundant
whitespace, can be filtered to improve text uniformity.
● Profanity Filtering: Social media platforms use filtering techniques to remove offensive
words from user-generated content.