0% found this document useful (0 votes)
5 views50 pages

NLP Pipeline

The document outlines the fundamentals of Natural Language Processing (NLP) and its end-to-end pipeline, covering key concepts such as Machine Learning, Deep Learning, and the various steps involved in the NLP process including data acquisition, text preparation, feature engineering, modeling, and deployment. It emphasizes the importance of preprocessing text data and different evaluation methods for assessing model performance. Additionally, it highlights the significance of feature engineering in converting text into a format suitable for machine learning algorithms.

Uploaded by

xafar0852
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views50 pages

NLP Pipeline

The document outlines the fundamentals of Natural Language Processing (NLP) and its end-to-end pipeline, covering key concepts such as Machine Learning, Deep Learning, and the various steps involved in the NLP process including data acquisition, text preparation, feature engineering, modeling, and deployment. It emphasizes the importance of preprocessing text data and different evaluation methods for assessing model performance. Additionally, it highlights the significance of feature engineering in converting text into a format suitable for machine learning algorithms.

Uploaded by

xafar0852
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 50

NATURAL

LANGUAGE
PROCESSING

Instructor: Ms. Qurat-Ul-Ain

quratulain.ssc@stmu.edu.pk
NATURAL LANGUAGE
PROCESSING
(END-END PIPELINE)
Learning Objective of this Topic

 Machine Learning
 Deep Learning
 What is End-End NLP Pipeline?
 Points to Remember
 Steps Involved in NLP Pipeline
 Data Acquisition
 Text Preparation
 Feature Engineering
 Modeling
 Deployment

2
So what about ML and Deep Learning?

• Conversational AI uses Machine Learning (ML) and Deep Learning


(DL) to complete many high-level NLP tasks
Machine Learning Basics

Machine learning is a field of computer science that gives


computers the ability to
learn without being explicitly programmed

Machine Learning algorithm


Labeled
Data
Training Prediction

Labeled / Unlabled Learned model Predictio


Data n

Methods that can learn from and make predictions on data


Types of Learning
Supervised: Learning with a labeled training set Example: email
classification with already labeled emails

Unsupervised: Discover patterns in unlabeled data Example:


cluster similar documents based on text

Reinforcement learning: learn to act based on feedback/reward


Example: learn to play Go, reward: win or lose

class A

class A

Classification Regression Clustering

Anomaly Detection Sequence labeling



What is Deep Learning?
A machine learning subfield of learning representations of data.
Exceptional effective at learning patterns.
Deep learning algorithms attempt to learn (multiple levels of)
representation by using a hierarchy of multiple layers
If you provide the system tons of information, it begins to understand
it and respond in useful ways.
WHAT IS END-
END NLP
PIPELINE?
POINTS TO
REMEMBER
 NLP Pipeline is not Universal.
 Vary from task to task.
 Deep Learning Pipelines are slightly different.
 NLP Pipeline is not linear.
STEPS INVOLVED IN
NLP PIPELINE

Data Text Feature


Modelling Deployment
Acquisition Preparation Engineering
Data
Acquisitio
n

Available Others NoData

Data (Tables) Database Less data Public Dataset

Data Augmentation WebScraping


Data Engineering
BeautifulSoup
Synonym
API
Bigram Flip
Back RapidAPI
Translate
Add Noise PDF
Tools
Image
Audio

Speech to Text
DATA ACQUISITION
Text data is available on websites, in emails, in social media, in form of pdf, and many more. But
the challenge is. Is it in a machine-readable format? if in the machine-readable format then will
it be relevant to our problem?
Public Dataset: We can search for publicly available data as per our problem statement.
Web Scrapping: Web Scrapping is a technique to scrap data from a website. For this, we can
use Beautiful Soup to scrape the text data from the web page.
Image to Text: We can also scrap the data from the image files with the help of Optical
character recognition (OCR). There is a library Tesseract that uses OCR to convert image to text
data.
Pdf to Text: We have multiple Python packages to convert the data into text. With the PyPDF2
library, pdf data can be extracted in the .text file.
Data augmentation: if our acquired data is not very sufficient for our problem statement then
we can generate fake data from the existing data by Synonym replacement, Back Translation,
12
DATA ACQUISITION

NLP has a bunch of techniques through which we can take a small dataset
and use some tricks to create more data. These tricks are also called data
augmentation
Synonym replacement: Randomly choose “k” words in a sentence that are
not stop words. Replace these words with their synonyms.
Bigram flipping: Divide the sentence into bigrams. Take one bigram at
random and flip it.
For example: “I am going to the supermarket.” Here, we take the bigram
“going to” and replace it with the flipped one: “to going.”

13
DATA ACQUISITION
• Back Translate: Translates a sentence into another language and then back to
the original language

• Adding noise to data: deliberately introducing variations or


disruptions to the training data
•Original sentence: "The quick brown fox jumps over the
lazy dog."
•Noisy sentence: "The quikc brown fxo jumps oevr teh lazy
dg."
14
Text
Preparation

Basic Advance
Cleansing Preprocessing Preprocessing

HTML
Tags
Removal
TEXT PREPARATION

Pre-processing refers to the series of steps taken to clean and transform raw
data before it is used for natural language processing (NLP) tasks. Preprocessing
is important because it helps standardize the data, remove noise, and enhance the
quality of the data for better analysis and modeling.
Text Cleaning
Sometimes our acquired data is not very clean. it may contain HTML tags, spelling
mistakes, or special characters. So, let’s see some techniques to clean our text
data.
Unicode Normalization: if text data may contain symbols, emojis, graphic
characters, or special

16
# Compile the pattern for substitution
# Replace all occurrences of the pattern with a specified replacement
Text
Preparatio
n

Basic Advance
Cleansing Preprocessing Preprocessing

HTML
Tags Emoji
Removal
Text
Preparatio
n

Basic Advance
Cleansing Preprocessing Preprocessing

HTML
Spelling
Tags Emoji
Checker
Removal
In the world of fast typing and fat-finger typing , incoming text data often has spelling errors. Two
such examples follow:
Text
Preparation

Basic Advance
Cleansing Preprocessing Preprocessing

HTML
Spelling
Tags Emoji Basic
Checker
Removal
Tokenization

Word

Sentence
OPTIONAL PREPROCESSING

Stop words Removal: Words used for sentence formation e.g. and, for, the, an are
removed.
Stemming refers to the process of removing suffixes and reducing a word to some
base form such that all different variants of that word can be represented by the
same form (e.g., “car” and “cars” are both reduced to “car”).
Lemmatization is the process of mapping all the different forms of a word to its
base word, or lemma.
While this seems close to the definition of stemming, they are, in fact, different.
For example, the adjective “better,” when stemmed, remains the same. However,
upon lemmatization, this should become “good,” as shown in Figure

23
OPTIONAL PREPROCESSING

Text normalization is the process of converting written text into a standardized form
that's easier to process, analyze, and understand.
Some common steps for text normalization are to convert all text to lowercase or
uppercase, convert digits to text (e.g., 9 to nine), expand abbreviations, and so on.

24
Text
Preparation

Basic Advance
Cleansing Preprocessing Preprocessi
ng

HTML Tags Spelling


Emoji Basic Optional
Removal Checker
Tokenization
Stopword Removal

Word Stemming
Removing Punctuation,
Sentence Digits
Lower Casing
Language Detection
Text
Preparatio
n

Basic Advance
Cleansing Preprocessin Preprocessing
g
HTML Spelling POS Coreference
Tags Emoji Basic Optional Parsing
Checker Tagging Resolution
Removal
Tokenization Stopword Removal
Stemming
Word
Removing Punctuation, Digits
Sentence
Lower Casing
Language Detection
TEXT PREPARATION
Our cleaned text data may contain a group of sentences. and each sentence is a
group of words. So, first, we need to Tokenize our text data.
Tokenization: Tokenization is the process of segmenting the text into a list of
tokens. In the case of sentence tokenization, the token will be sentenced and in the
case of word tokenization, it will be the word. It is a good idea to first complete
sentence tokenization and then word tokenization, here output will be the list of
lists. Tokenization is performed in each & every NLP pipeline.
Lowercasing: This step is used to convert all the text to lowercase letters. This is
useful in various NLP tasks such as text classification, information retrieval, and
sentiment analysis.
Stop word removal: Stop words are commonly occurring words in a language such
as “the”, “and”, “a”, etc. They are usually removed from the text during 28
TEXT PREPARATION
Stemming or lemmatization: Stemming and lemmatization are used to reduce words to their
base form, which can help reduce the vocabulary size and simplify the text. Stemming involves
stripping the suffixes from words to get their stem, whereas lemmatization involves reducing
words to their base form based on their part of speech. This step is commonly used in various
NLP tasks such as text classification, information retrieval, and topic modeling.
Removing digit/punctuation: This step is used to remove digits and punctuation from the text.
This is useful in various NLP tasks such as text classification, sentiment analysis, and topic
modeling.
POS tagging: POS tagging involves assigning a part of speech tag to each word in a text. This
step is commonly used in various NLP tasks such as named entity recognition, sentiment analysis,
and machine translation.
Named Entity Recognition (NER): NER involves identifying and classifying named entities in
text, such as people, organizations, and locations. This step is commonly used in various NLP
29
ADVANCE PREPROCESSING

POS (Part-of-Speech) tagging is the process of assigning parts of speech to each


word in a sentence based on its definition and context.
Each tag represents the role a word plays in the sentence, which helps in
understanding its syntactic and grammatical structure.
Here are some common POS tags with examples based on the sentence:
"Chaplin wrote, directed, and composed the music for most of his films."
NN - Noun, singular: Refers to a singular noun, which is a person, place, thing, or
concept. Example: music, film
NNS - Noun, plural: Refers to a plural noun. Example: films
NNP - Proper noun, singular: Refers to a proper noun, which is a specific name of
a person, place, or organization. Example: Chaplin
VB - Verb, base form: Refers to the base form of a verb. Example: compose
(lemmatized form)

30
ALPINE SKI HOUSE
CO-REFERENCE
RESOLUTION
Christopher Robin is alive and well. He is the same
person that you read about in the book, Winnie the Pooh.
As a boy, Chris lived in a pretty home called Cotchfield
Farm. When Chris was three years old, his father wrote
a poem about him. The poem was printed in a magazine
for others to read. Mr. Robin then wrote a book

32
Feature
Engineerin
g

Text
Represent
Machine
ation Deep Learning
Learning

Converting
your text into
numbers
FEATURE ENGINEERING
In Feature Engineering, our main agenda is to represent the text in the numeric vector in such a
way that the ML algorithm can understand the text attribute. In NLP this process of feature
engineering is known as Text Representation or Text Vectorization.
There are two most common approaches for Text Representation.
1. Classical or Traditional Approach:
In the traditional approach, we create a vocabulary of unique words assign a unique id (integer
value) for each word. and then replace each word of a sentence with its unique id. Here each
word of vocabulary is treated as a feature. So, when the vocabulary is large then the feature size
will become very large. So, this makes it tough for the ML model.
One Hot Encoder:
One Hot Encoding represents each token as a binary vector. First mapped each token to integer
values. and then each integer value is represented as a binary vector where all values are 0
except the index of the integer. index of the integer is marked by 1. 35
FEATURE ENGINEERING
Feature engineering is an integral step in any ML pipeline. Feature engineering steps
convert the raw data into a format that can be consumed by a machine.

36
Over 50000
tweets

… … … …
… … … …
… … …
Modelling

Apply Model Evaluation

Machine
Heuristic Deep Learning CloudAPI
Learning

Amount of Data

Nature of Problem
Modelling

Apply Model Evaluation

Machin
e Deep Intrinsic Extrinsic
Heuristic CloudAPI
Learnin Learning Evaluation Evaluation
g
Technical Criteria
(Metric) Business Criteria

Amount of Data

Nature of Problem
40
EVALUATION

Intrinsic and extrinsic evaluation are two primary approaches to assess the quality
and performance of machine learning models, especially in natural language
processing (NLP) and other AI applications. Here’s a breakdown of each:
Intrinsic Evaluation
•Definition: Measures a model's performance by directly evaluating the quality of
its output on specific tasks without considering its application in real-world
scenarios.
•Goal: Focuses on internal model metrics, like accuracy, precision, recall, F1 score,
or other task-specific benchmarks.
•Examples:
• Word Embeddings: Evaluating word similarity scores between embeddings
(e.g., using cosine similarity).
• Language Models: Measuring perplexity on a text dataset.
• Text Classification: Calculating classification accuracy on a labeled test set.

41
EVALUATION

Extrinsic Evaluation
•Definition: Measures a model's performance based on how well it performs within a
larger, real-world application or downstream task.
•Goal: Evaluates how the model contributes to the performance of a broader task,
often providing insight into practical utility.
•Examples:
• Information Retrieval: Testing how well embeddings improve document
retrieval accuracy.
• Machine Translation: Evaluating if a language model enhances translation
quality in a complete system.
• Sentiment Analysis: Testing a sentiment classifier’s impact on user
engagement in a recommendation system.

42
EVALUATION

Intrinsic focuses on inter mediary objectives, while extrinsic focuses on evaluating


performance on the final objective.
For example, consider a spam-classification system.
The ML metric will be precision and recall, while the business metric will be “the
amount of time users spent on a spam email.”
Intrinsic evaluation will focus on measuring the system performance using
precision and recall.
Extrinsic evaluation will focus on measuring the time a user wasted because a
spam email went to their inbox or a genuine email went to their spam folder.

43
EVALUATION

Evaluation matric depends on the type of NLP task or problem. Here I am listing
some of the popular methods for evaluation according to the NLP tasks.
•Classification: Accuracy, Precision, Recall, F1-score, AUC
•Sequence Labelling: Fl-Score
•Information Retrieval : Mean Reciprocal rank(MRR), Mean Average Precision
(MAP),
•Text summarization: ROUGE
•Regression [Stock Market Price predictions, Temperature Predictions]:
Root Mean Square Error, Mean Absolute Percentage Error
•Text Generation: BLEU (Bi-lingual Evaluation Understanding), Perplexity
•Machine Translation: BLEU (Bi-lingual Evaluation Understanding), METEOR
44
Deployment

Deploy Monitoring Update


DEPLOYMENT
Making a trained NLP model usable in a production setting is known as
deployment. The precise deployment process can vary based on the platform and
use case, however, the following are some typical processes that may be
involved:
1.Export the trained model
2.Prepare the input pipeline
3. Set up the inference service
4. Monitor performance and scale
5.Continuous improvement

46
LARGE IMAGE SLIDE
LARGE IMAGE SLIDE
Draw non-crossing PIPELINES between three
houses and three utility companies.
THANK
YOU!

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy