NLP Pipeline
NLP Pipeline
LANGUAGE
PROCESSING
quratulain.ssc@stmu.edu.pk
NATURAL LANGUAGE
PROCESSING
(END-END PIPELINE)
Learning Objective of this Topic
Machine Learning
Deep Learning
What is End-End NLP Pipeline?
Points to Remember
Steps Involved in NLP Pipeline
Data Acquisition
Text Preparation
Feature Engineering
Modeling
Deployment
2
So what about ML and Deep Learning?
class A
class A
Speech to Text
DATA ACQUISITION
Text data is available on websites, in emails, in social media, in form of pdf, and many more. But
the challenge is. Is it in a machine-readable format? if in the machine-readable format then will
it be relevant to our problem?
Public Dataset: We can search for publicly available data as per our problem statement.
Web Scrapping: Web Scrapping is a technique to scrap data from a website. For this, we can
use Beautiful Soup to scrape the text data from the web page.
Image to Text: We can also scrap the data from the image files with the help of Optical
character recognition (OCR). There is a library Tesseract that uses OCR to convert image to text
data.
Pdf to Text: We have multiple Python packages to convert the data into text. With the PyPDF2
library, pdf data can be extracted in the .text file.
Data augmentation: if our acquired data is not very sufficient for our problem statement then
we can generate fake data from the existing data by Synonym replacement, Back Translation,
12
DATA ACQUISITION
NLP has a bunch of techniques through which we can take a small dataset
and use some tricks to create more data. These tricks are also called data
augmentation
Synonym replacement: Randomly choose “k” words in a sentence that are
not stop words. Replace these words with their synonyms.
Bigram flipping: Divide the sentence into bigrams. Take one bigram at
random and flip it.
For example: “I am going to the supermarket.” Here, we take the bigram
“going to” and replace it with the flipped one: “to going.”
13
DATA ACQUISITION
• Back Translate: Translates a sentence into another language and then back to
the original language
Basic Advance
Cleansing Preprocessing Preprocessing
HTML
Tags
Removal
TEXT PREPARATION
Pre-processing refers to the series of steps taken to clean and transform raw
data before it is used for natural language processing (NLP) tasks. Preprocessing
is important because it helps standardize the data, remove noise, and enhance the
quality of the data for better analysis and modeling.
Text Cleaning
Sometimes our acquired data is not very clean. it may contain HTML tags, spelling
mistakes, or special characters. So, let’s see some techniques to clean our text
data.
Unicode Normalization: if text data may contain symbols, emojis, graphic
characters, or special
16
# Compile the pattern for substitution
# Replace all occurrences of the pattern with a specified replacement
Text
Preparatio
n
Basic Advance
Cleansing Preprocessing Preprocessing
HTML
Tags Emoji
Removal
Text
Preparatio
n
Basic Advance
Cleansing Preprocessing Preprocessing
HTML
Spelling
Tags Emoji
Checker
Removal
In the world of fast typing and fat-finger typing , incoming text data often has spelling errors. Two
such examples follow:
Text
Preparation
Basic Advance
Cleansing Preprocessing Preprocessing
HTML
Spelling
Tags Emoji Basic
Checker
Removal
Tokenization
Word
Sentence
OPTIONAL PREPROCESSING
Stop words Removal: Words used for sentence formation e.g. and, for, the, an are
removed.
Stemming refers to the process of removing suffixes and reducing a word to some
base form such that all different variants of that word can be represented by the
same form (e.g., “car” and “cars” are both reduced to “car”).
Lemmatization is the process of mapping all the different forms of a word to its
base word, or lemma.
While this seems close to the definition of stemming, they are, in fact, different.
For example, the adjective “better,” when stemmed, remains the same. However,
upon lemmatization, this should become “good,” as shown in Figure
23
OPTIONAL PREPROCESSING
Text normalization is the process of converting written text into a standardized form
that's easier to process, analyze, and understand.
Some common steps for text normalization are to convert all text to lowercase or
uppercase, convert digits to text (e.g., 9 to nine), expand abbreviations, and so on.
24
Text
Preparation
Basic Advance
Cleansing Preprocessing Preprocessi
ng
Word Stemming
Removing Punctuation,
Sentence Digits
Lower Casing
Language Detection
Text
Preparatio
n
Basic Advance
Cleansing Preprocessin Preprocessing
g
HTML Spelling POS Coreference
Tags Emoji Basic Optional Parsing
Checker Tagging Resolution
Removal
Tokenization Stopword Removal
Stemming
Word
Removing Punctuation, Digits
Sentence
Lower Casing
Language Detection
TEXT PREPARATION
Our cleaned text data may contain a group of sentences. and each sentence is a
group of words. So, first, we need to Tokenize our text data.
Tokenization: Tokenization is the process of segmenting the text into a list of
tokens. In the case of sentence tokenization, the token will be sentenced and in the
case of word tokenization, it will be the word. It is a good idea to first complete
sentence tokenization and then word tokenization, here output will be the list of
lists. Tokenization is performed in each & every NLP pipeline.
Lowercasing: This step is used to convert all the text to lowercase letters. This is
useful in various NLP tasks such as text classification, information retrieval, and
sentiment analysis.
Stop word removal: Stop words are commonly occurring words in a language such
as “the”, “and”, “a”, etc. They are usually removed from the text during 28
TEXT PREPARATION
Stemming or lemmatization: Stemming and lemmatization are used to reduce words to their
base form, which can help reduce the vocabulary size and simplify the text. Stemming involves
stripping the suffixes from words to get their stem, whereas lemmatization involves reducing
words to their base form based on their part of speech. This step is commonly used in various
NLP tasks such as text classification, information retrieval, and topic modeling.
Removing digit/punctuation: This step is used to remove digits and punctuation from the text.
This is useful in various NLP tasks such as text classification, sentiment analysis, and topic
modeling.
POS tagging: POS tagging involves assigning a part of speech tag to each word in a text. This
step is commonly used in various NLP tasks such as named entity recognition, sentiment analysis,
and machine translation.
Named Entity Recognition (NER): NER involves identifying and classifying named entities in
text, such as people, organizations, and locations. This step is commonly used in various NLP
29
ADVANCE PREPROCESSING
30
ALPINE SKI HOUSE
CO-REFERENCE
RESOLUTION
Christopher Robin is alive and well. He is the same
person that you read about in the book, Winnie the Pooh.
As a boy, Chris lived in a pretty home called Cotchfield
Farm. When Chris was three years old, his father wrote
a poem about him. The poem was printed in a magazine
for others to read. Mr. Robin then wrote a book
32
Feature
Engineerin
g
Text
Represent
Machine
ation Deep Learning
Learning
Converting
your text into
numbers
FEATURE ENGINEERING
In Feature Engineering, our main agenda is to represent the text in the numeric vector in such a
way that the ML algorithm can understand the text attribute. In NLP this process of feature
engineering is known as Text Representation or Text Vectorization.
There are two most common approaches for Text Representation.
1. Classical or Traditional Approach:
In the traditional approach, we create a vocabulary of unique words assign a unique id (integer
value) for each word. and then replace each word of a sentence with its unique id. Here each
word of vocabulary is treated as a feature. So, when the vocabulary is large then the feature size
will become very large. So, this makes it tough for the ML model.
One Hot Encoder:
One Hot Encoding represents each token as a binary vector. First mapped each token to integer
values. and then each integer value is represented as a binary vector where all values are 0
except the index of the integer. index of the integer is marked by 1. 35
FEATURE ENGINEERING
Feature engineering is an integral step in any ML pipeline. Feature engineering steps
convert the raw data into a format that can be consumed by a machine.
36
Over 50000
tweets
…
…
… … … …
… … … …
… … …
Modelling
Machine
Heuristic Deep Learning CloudAPI
Learning
Amount of Data
Nature of Problem
Modelling
Machin
e Deep Intrinsic Extrinsic
Heuristic CloudAPI
Learnin Learning Evaluation Evaluation
g
Technical Criteria
(Metric) Business Criteria
Amount of Data
Nature of Problem
40
EVALUATION
Intrinsic and extrinsic evaluation are two primary approaches to assess the quality
and performance of machine learning models, especially in natural language
processing (NLP) and other AI applications. Here’s a breakdown of each:
Intrinsic Evaluation
•Definition: Measures a model's performance by directly evaluating the quality of
its output on specific tasks without considering its application in real-world
scenarios.
•Goal: Focuses on internal model metrics, like accuracy, precision, recall, F1 score,
or other task-specific benchmarks.
•Examples:
• Word Embeddings: Evaluating word similarity scores between embeddings
(e.g., using cosine similarity).
• Language Models: Measuring perplexity on a text dataset.
• Text Classification: Calculating classification accuracy on a labeled test set.
41
EVALUATION
Extrinsic Evaluation
•Definition: Measures a model's performance based on how well it performs within a
larger, real-world application or downstream task.
•Goal: Evaluates how the model contributes to the performance of a broader task,
often providing insight into practical utility.
•Examples:
• Information Retrieval: Testing how well embeddings improve document
retrieval accuracy.
• Machine Translation: Evaluating if a language model enhances translation
quality in a complete system.
• Sentiment Analysis: Testing a sentiment classifier’s impact on user
engagement in a recommendation system.
42
EVALUATION
43
EVALUATION
Evaluation matric depends on the type of NLP task or problem. Here I am listing
some of the popular methods for evaluation according to the NLP tasks.
•Classification: Accuracy, Precision, Recall, F1-score, AUC
•Sequence Labelling: Fl-Score
•Information Retrieval : Mean Reciprocal rank(MRR), Mean Average Precision
(MAP),
•Text summarization: ROUGE
•Regression [Stock Market Price predictions, Temperature Predictions]:
Root Mean Square Error, Mean Absolute Percentage Error
•Text Generation: BLEU (Bi-lingual Evaluation Understanding), Perplexity
•Machine Translation: BLEU (Bi-lingual Evaluation Understanding), METEOR
44
Deployment
46
LARGE IMAGE SLIDE
LARGE IMAGE SLIDE
Draw non-crossing PIPELINES between three
houses and three utility companies.
THANK
YOU!