NLP Practical
NLP Practical
01
AIM:
Apply various text preprocessing techniques for any given text: Tokenization, Filtration, and
Script Validation.
THEORY:
In Natural Language Processing (NLP), raw text data is often noisy, unstructured, and
difficult to analyze directly. Text preprocessing is a crucial step in converting this raw data
into a cleaner and more structured format that is suitable for analysis or modelling. Three of
the most important preprocessing techniques are tokenization, filtration, and script validation.
Each serves a distinct purpose, contributing to improving data quality and relevance.
1. Tokenization
Tokenization is the process of splitting text into smaller, manageable units called tokens.
These tokens can be words, sentences, or sub word units depending on the level of
granularity required for the task at hand.
Word Tokenization: This splits a text into individual words or terms. For example,
the sentence:
o Input: "Hello, world!"
o Output: ["Hello", "world"]
Sentence Tokenization: This splits text into sentences.
o Input: "Hello world! Welcome to NLP."
o Output: ["Hello world!", "Welcome to NLP."]
2. Filtration
Filtration refers to the process of removing irrelevant elements from the text, such as
punctuation, stop words, and inconsistencies that may affect analysis.
Removing Punctuation: Punctuation marks like commas and periods do not usually
provide meaningful information in most NLP tasks and are often removed.
Lowercasing: Converting all text to lowercase helps ensure uniformity. For example,
"Word" and "word" should be treated the same way.
Removing Stop Words: Stop words are common words like "the", "is", and "and".
These are often filtered out because they provide little to no useful information.
3. Script Validation
Script validation is the process of ensuring that the text data follows specific format rules or
contains valid characters.
Character Validation: Ensures that tokens consist only of valid characters, such as
letters or digits. For instance, we can validate that a token contains only alphabetic
characters if our focus is on standard text analysis.
Pattern Matching: Regular expressions can be used to validate certain patterns, like
email addresses or phone numbers.
Conclusion:
In this practical, we successfully implemented various text preprocessing techniques on a
given text. We applied tokenization to break the text into smaller units, filtration to remove
irrelevant components, and script validation to ensure data correctness. These preprocessing
techniques form the foundation for effective and efficient text analysis in NLP.
PROGRAM
OUTPUT:
Practical No. 02
AIM:
Apply various other text preprocessing techniques for any given text: Stop Word Removal
and Lemmatization/Stemming.
THEORY:
Text preprocessing is a fundamental part of Natural Language Processing (NLP) that helps
prepare raw textual data for analysis. Beyond basic tokenization and filtration, techniques
such as stop word removal, lemmatization, and stemming are essential for improving the
quality and efficiency of text analysis.
Stop words are commonly used words in a language that typically do not contribute
significant meaning to the text. These include words like "and," "the," "is," "in," and "on."
While they are important in forming grammatically correct sentences, they often carry little
importance in text analysis and are removed to reduce noise.
Example:
o Input: "The cat is on the mat."
o Output after stop word removal: "cat mat"
2. Lemmatization
Lemmatization is the process of reducing words to their base or dictionary form (lemma).
Unlike stemming (which simply chops off word endings), lemmatization uses linguistic rules
to ensure that the base form is a valid word. For example, the words "running" and "ran" are
both reduced to the base form "run."
Example:
o Input: "The boys are running quickly."
o Output after lemmatization: "The boy be run quick"
3.Stemming
Stemming is a process of reducing words to their root form by stripping affixes (such as
suffixes and prefixes) from words. Unlike lemmatization, stemming does not necessarily
produce valid words but still helps in reducing word variations.
Example:
o Input: "The boys are running quickly."
o Output after stemming: "The boy are run quick"
PROCEDURE:
We will implement the following preprocessing steps on a sample text:
1. Stop Word Removal
2. Lemmatization
3. Stemming
CONCLUSION:
In this practical, we applied additional text preprocessing techniques: stop word removal,
lemmatization, and stemming. Stop word removal filtered out common, insignificant words
to reduce noise in the text. Lemmatization reduced words to their dictionary base forms,
while stemming truncated words to their root forms. These preprocessing steps help prepare
text data for more effective analysis in NLP tasks by reducing variability in word forms and
focusing on the most relevant information.
PROGRAM:
OUTPUT:
Practical No. 03
AIM:
Perform Morphological Analysis and Word Generation for any given text.
THEORY:
Morphological analysis and word generation are essential in Natural Language Processing
(NLP), as they help break words into morphemes—the smallest meaningful units in a
language. Morphemes are used to understand and generate new words. This analysis enables
a better understanding of how words are structured and how they can be modified to convey
different meanings or grammatical roles.
Types of Morphemes:
1. Free Morphemes:
Morphemes that can stand alone as words. Examples include "book," "walk," and
"help."
2. Bound Morphemes:
Morphemes that cannot stand alone and must be attached to other morphemes. These
include prefixes (e.g., "dis-") and suffixes (e.g., "-ed," "-ly").
Morphological Analysis:
In morphological analysis, we break down words into their base forms (stems) and identify
their affixes (prefixes or suffixes), revealing their meaning and grammatical function.
Example Words:
Let’s consider the following words for morphological analysis:
"disagreement"
"playing"
"bikes"
"driver"
Morphological Breakdown:
1. "disagreement"
o Base Word: agree
o Prefix: dis- (indicates negation or opposition)
o Suffix: -ment (indicates a state or condition)
o Morpheme Breakdown: dis- + agree + -ment
o Meaning: The state of not agreeing or being in opposition.
2. "playing"
o Base Word: play
o Suffix: -ing (indicates present participle)
o Morpheme Breakdown: play + -ing
o Meaning: The act of engaging in a playful or recreational activity.
3. "bikes"
o Base Word: bike
o Suffix: -s (indicates plural)
o Morpheme Breakdown: bike + -s
o Meaning: More than one bike.
4. "driver"
o Base Word: drive
o Suffix: -er (indicates a person who performs the action)
o Morpheme Breakdown: drive + -er
o Meaning: A person who drives.
Word Generation:
Word generation is the process of creating new words by applying morphological rules, such
as adding prefixes or suffixes, or by modifying the tense or form of the base word.
SUMMARY:
CONCLUSION:
Morphological analysis and word generation are crucial processes in NLP, as they help us
understand the structure of language and generate new words from existing morphemes.
These tasks enhance language processing in applications such as text generation, machine
translation, and sentiment analysis. Through the analysis of morphemes and the generation of
new words, we can improve language comprehension and produce more sophisticated NLP
models.
PROGRAM:-
Output:-
PRACTICAL NO. 04
AIM:
Implement Word Sense Disambiguation (WSD) using LSTM or GRU.
THEORY:
Word Sense Disambiguation (WSD) is a key task in Natural Language
Processing (NLP) that involves identifying which sense (or meaning) of a word
is used in a given context. Many words in natural language have multiple
meanings (polysemy), and determining the correct meaning based on context is
critical for tasks like machine translation, information retrieval, and question
answering.
Recurrent Neural Networks (RNNs), particularly Long Short-Term Memory
(LSTM) and Gated Recurrent Units (GRU), are commonly used to model
sequences and capture contextual dependencies in text, making them suitable
for WSD tasks. They can learn word representations and use context from
surrounding words to predict the correct sense of a target word.
Model Overview:
Input: A sequence of words (text) where one word is ambiguous.
Output: The correct sense of the ambiguous word, based on the context
provided by the surrounding words.
LSTM and GRU models can be used to learn the context of a word in a sentence
and then classify the sense of the ambiguous word.
Conclusion:
In this practical, we implemented a basic LSTM-based model to perform Word
Sense Disambiguation (WSD) on a small dataset. The model learns to predict
the correct sense of an ambiguous word based on its context in the sentence. In
real-world applications, a larger, labeled dataset such as SemCor would be used
to achieve better accuracy.
PROGRAM:
OUTPUT:-
PRACTICAL NO. 05
AIM:
Implement N-gram Model for the given text input.
THEORY:
An N-gram model is a probabilistic language model used in Natural Language
Processing (NLP) and computational linguistics to predict the next item in a
sequence. It operates by analyzing the probability of a word based on the last N-
1 words that came before it. In essence, an N-gram represents a contiguous
sequence of N items (words, characters, etc.) from a given text.
Types of N-grams:
1. Unigram (1-gram): One word at a time.
2. Bigram (2-gram): Sequence of two words.
3. Trigram (3-gram): Sequence of three words.
4. N-gram (N-gram): Sequence of N words.
For example, consider the sentence:
Sentence: "The quick brown fox"
o Unigrams: "The", "quick", "brown", "fox"
o Bigrams: "The quick", "quick brown", "brown fox"
o Trigrams: "The quick brown", "quick brown fox"
The N-gram model is widely used in text prediction, machine translation, and
speech recognition systems.
OUTPUT EXPLANATION:
For the input text "The quick brown fox jumps over the lazy dog. The quick
fox is fast.", the output will include:
1. Unigrams:
o ('The',), ('quick',), ('brown',), ('fox',), ('jumps',), ...
2. Bigrams:
o ('The', 'quick'), ('quick', 'brown'), ('brown', 'fox'), ('fox', 'jumps'), ...
3. Trigrams:
o ('The', 'quick', 'brown'), ('quick', 'brown', 'fox'), ('brown', 'fox',
'jumps'), ...
4. Bigram Frequencies:
o ('The', 'quick'): 2
o ('quick', 'brown'): 1
o ('fox', 'jumps'): 1
o ('lazy', 'dog'): 1
o
CONCLUSION:
By using N-grams, we can understand the sequence of words in a given text and the
probability of certain word combinations. This is particularly useful in predictive text models,
where knowing the likelihood of a sequence of words can help make accurate predictions
about the next word. N-gram models are fundamental in many NLP applications, such as
speech recognition, text generation, and machine translation.
Program :-
OUTPUT:-
PRACTICAL NO. 06
AIM:
Study the different Part-of-Speech (POS) taggers and perform POS tagging on
the given text.
THEORY:
Part-of-Speech (POS) tagging is a core task in Natural Language Processing
(NLP) that involves assigning a grammatical label (or tag) to each word in a text
based on its syntactic role. POS tags can include categories such as nouns,
verbs, adjectives, adverbs, pronouns, and more. POS tagging helps in
understanding the grammatical structure of a sentence and is useful for tasks
like syntactic parsing, named entity recognition, text summarization, and
machine translation.
There are various POS tagging techniques and models, which use different
methods to classify the POS of words.
Types of POS Taggers:
1. Rule-Based Taggers:
o These taggers use manually created rules to determine the POS of a
word based on its context. These rules take into account the word
itself and the surrounding words to determine its grammatical role.
o Example: The Brill Tagger is a famous rule-based tagger.
2. Stochastic (Probabilistic) Taggers:
o Stochastic taggers use statistical models to assign POS tags based
on probabilities derived from a large corpus. These models often
use algorithms such as Hidden Markov Models (HMM) to predict
the most likely sequence of POS tags.
o Example: The HMM Tagger and Maximum Entropy Taggers are
well-known probabilistic taggers.
3. Neural Network-Based Taggers:
o These taggers use deep learning models to predict POS tags.
Neural networks, especially Recurrent Neural Networks (RNNs)
like LSTM and BiLSTM (Bidirectional Long Short-Term
Memory), and transformer-based models such as BERT, are
commonly used for POS tagging and provide highly accurate
results.
o Example: Models like BiLSTM-CRF or BERT are state-of-the-art
for POS tagging.
4. Hybrid Taggers:
o Hybrid taggers combine rule-based methods with probabilistic or
machine learning techniques to achieve better performance and
accuracy. This approach aims to balance between hand-crafted
rules and the power of machine learning models.
CONCLUSION:
POS tagging is a crucial task in NLP, allowing for the grammatical
classification of words within a sentence. Different POS taggers employ various
techniques, from rule-based systems to advanced neural networks. These
taggers help in analyzing and understanding text more deeply, which is
fundamental for further NLP applications like machine translation, text
summarization, and sentiment analysis.
PROGRAM:-
OUTPUT:-
PRACTICAL NO. 07
AIM:
Implement Named Entity Recognition (NER) for the given text input.
THEORY:
Named Entity Recognition (NER) is a task in Natural Language Processing
(NLP) that involves identifying and classifying named entities (e.g., people,
organizations, locations, dates, quantities) mentioned in a text. The goal is to tag
each word or phrase with a label indicating whether it is a named entity and
what type of entity it is.
NER helps in various applications such as information retrieval, question
answering, summarization, and machine translation by extracting meaningful
entities from text.
Types of Named Entities:
PERSON: Names of people.
ORGANIZATION: Companies, institutions, etc.
LOCATION: Cities, countries, geographical entities.
DATE: Dates or periods.
GPE (Geopolitical Entities): Countries, cities, states.
MONEY: Monetary values.
TIME: Specific times, durations.
PERCENT: Percentage values.
NER Approaches:
1. Rule-Based Systems:
o Use hand-crafted rules and patterns to identify named entities.
2. Statistical Models:
o Use probabilistic models such as Hidden Markov Models (HMM)
or Conditional Random Fields (CRF) to predict named entities.
CONCLUSION:
Named Entity Recognition (NER) is an essential task in NLP that identifies and
classifies key entities in text. By recognizing important entities such as people,
organizations, and places, NER helps in understanding the content of a text and
extracting meaningful information for downstream applications such as
information extraction, summarization, and question answering.
In this practical, we implemented a basic NER system using SpaCy, a state-of-
the-art NLP library, which successfully identified and labeled entities in the
provided text.
PROGRAM:-
OUTPUT:-
PRACTICAL NO. 08
AIM:
Perform Exploratory Data Analysis (EDA) of the given text and generate a
Word Cloud.
THEORY:
Exploratory Data Analysis (EDA) is an approach to summarizing and
visualizing the important characteristics of a dataset, often used to uncover
patterns, trends, or anomalies. When applied to textual data, it involves
analyzing the frequency and distribution of words, which can reveal insights
about the text's content and structure.
One popular technique for visualizing textual data is generating a Word Cloud,
which is a visual representation where the size of each word is proportional to
its frequency in the text. The more frequently a word appears, the larger it is in
the word cloud. Word clouds provide an intuitive way to understand the most
significant words in a dataset.
EXAMPLE TEXT:
For demonstration, we will use the following sample text:
Text: "Data science is the field of study that combines domain expertise,
programming skills, and knowledge of mathematics and statistics to
extract meaningful insights from data."
OUTPUT:
The code will generate a Word Cloud visual that highlights important
words like "data," "science," "study," "programming," "knowledge," and
"statistics." Words that occur more frequently will appear larger and
bolder in the word cloud.
CONCLUSION:
PROGRAM:-
OUTPUT:-