0% found this document useful (0 votes)

5 views50 pages

NLP Pipeline

The document outlines the fundamentals of Natural Language Processing (NLP) and its end-to-end pipeline, covering key concepts such as Machine Learning, Deep Learning, and the various steps involved in the NLP process including data acquisition, text preparation, feature engineering, modeling, and deployment. It emphasizes the importance of preprocessing text data and different evaluation methods for assessing model performance. Additionally, it highlights the significance of feature engineering in converting text into a format suitable for machine learning algorithms.

Uploaded by

xafar0852

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views50 pages

NLP Pipeline

Uploaded by

xafar0852

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 50

NATURAL

LANGUAGE
PROCESSING

Instructor: Ms. Qurat-Ul-Ain

quratulain.ssc@stmu.edu.pk
NATURAL LANGUAGE
PROCESSING
(END-END PIPELINE)
Learning Objective of this Topic

 Machine Learning
 Deep Learning
 What is End-End NLP Pipeline?
 Points to Remember
 Steps Involved in NLP Pipeline
 Data Acquisition
 Text Preparation
 Feature Engineering
 Modeling
 Deployment

2
So what about ML and Deep Learning?

• Conversational AI uses Machine Learning (ML) and Deep Learning

(DL) to complete many high-level NLP tasks
Machine Learning Basics

Machine learning is a field of computer science that gives

computers the ability to
learn without being explicitly programmed

Machine Learning algorithm

Labeled
Data
Training Prediction

Labeled / Unlabled Learned model Predictio

Data n

Methods that can learn from and make predictions on data

Types of Learning
Supervised: Learning with a labeled training set Example: email
classification with already labeled emails

Unsupervised: Discover patterns in unlabeled data Example:

cluster similar documents based on text

Reinforcement learning: learn to act based on feedback/reward

Example: learn to play Go, reward: win or lose

class A

Classification Regression Clustering

Anomaly Detection Sequence labeling

…
What is Deep Learning?
A machine learning subfield of learning representations of data.
Exceptional effective at learning patterns.
Deep learning algorithms attempt to learn (multiple levels of)
representation by using a hierarchy of multiple layers
If you provide the system tons of information, it begins to understand
it and respond in useful ways.
WHAT IS END-
END NLP
PIPELINE?
POINTS TO
REMEMBER
 NLP Pipeline is not Universal.
 Vary from task to task.
 Deep Learning Pipelines are slightly different.
 NLP Pipeline is not linear.
STEPS INVOLVED IN
NLP PIPELINE

Data Text Feature

Modelling Deployment
Acquisition Preparation Engineering
Data
Acquisitio
n

Available Others NoData

Data (Tables) Database Less data Public Dataset

Data Augmentation WebScraping

Data Engineering
BeautifulSoup
Synonym
API
Bigram Flip
Back RapidAPI
Translate
Add Noise PDF
Tools
Image
Audio

Speech to Text
DATA ACQUISITION
Text data is available on websites, in emails, in social media, in form of pdf, and many more. But
the challenge is. Is it in a machine-readable format? if in the machine-readable format then will
it be relevant to our problem?
Public Dataset: We can search for publicly available data as per our problem statement.
Web Scrapping: Web Scrapping is a technique to scrap data from a website. For this, we can
use Beautiful Soup to scrape the text data from the web page.
Image to Text: We can also scrap the data from the image files with the help of Optical
character recognition (OCR). There is a library Tesseract that uses OCR to convert image to text
data.
Pdf to Text: We have multiple Python packages to convert the data into text. With the PyPDF2
library, pdf data can be extracted in the .text file.
Data augmentation: if our acquired data is not very sufficient for our problem statement then
we can generate fake data from the existing data by Synonym replacement, Back Translation,
12
DATA ACQUISITION

NLP has a bunch of techniques through which we can take a small dataset
and use some tricks to create more data. These tricks are also called data
augmentation
Synonym replacement: Randomly choose “k” words in a sentence that are
not stop words. Replace these words with their synonyms.
Bigram flipping: Divide the sentence into bigrams. Take one bigram at
random and flip it.
For example: “I am going to the supermarket.” Here, we take the bigram
“going to” and replace it with the flipped one: “to going.”

13
DATA ACQUISITION
• Back Translate: Translates a sentence into another language and then back to
the original language

• Adding noise to data: deliberately introducing variations or

disruptions to the training data
•Original sentence: "The quick brown fox jumps over the
lazy dog."
•Noisy sentence: "The quikc brown fxo jumps oevr teh lazy
dg."
14
Text
Preparation

Basic Advance
Cleansing Preprocessing Preprocessing

HTML
Tags
Removal
TEXT PREPARATION

Pre-processing refers to the series of steps taken to clean and transform raw
data before it is used for natural language processing (NLP) tasks. Preprocessing
is important because it helps standardize the data, remove noise, and enhance the
quality of the data for better analysis and modeling.
Text Cleaning
Sometimes our acquired data is not very clean. it may contain HTML tags, spelling
mistakes, or special characters. So, let’s see some techniques to clean our text
data.
Unicode Normalization: if text data may contain symbols, emojis, graphic
characters, or special

16
# Compile the pattern for substitution
# Replace all occurrences of the pattern with a specified replacement
Text
Preparatio
n

Basic Advance
Cleansing Preprocessing Preprocessing

HTML
Tags Emoji
Removal
Text
Preparatio
n

Basic Advance
Cleansing Preprocessing Preprocessing

HTML
Spelling
Tags Emoji
Checker
Removal
In the world of fast typing and fat-finger typing , incoming text data often has spelling errors. Two
such examples follow:
Text
Preparation

Basic Advance
Cleansing Preprocessing Preprocessing

HTML
Spelling
Tags Emoji Basic
Checker
Removal
Tokenization

Word

Sentence
OPTIONAL PREPROCESSING

Stop words Removal: Words used for sentence formation e.g. and, for, the, an are
removed.
Stemming refers to the process of removing suffixes and reducing a word to some
base form such that all different variants of that word can be represented by the
same form (e.g., “car” and “cars” are both reduced to “car”).
Lemmatization is the process of mapping all the different forms of a word to its
base word, or lemma.
While this seems close to the definition of stemming, they are, in fact, different.
For example, the adjective “better,” when stemmed, remains the same. However,
upon lemmatization, this should become “good,” as shown in Figure

23
OPTIONAL PREPROCESSING

Text normalization is the process of converting written text into a standardized form
that's easier to process, analyze, and understand.
Some common steps for text normalization are to convert all text to lowercase or
uppercase, convert digits to text (e.g., 9 to nine), expand abbreviations, and so on.

24
Text
Preparation

Basic Advance
Cleansing Preprocessing Preprocessi
ng

HTML Tags Spelling

Emoji Basic Optional
Removal Checker
Tokenization
Stopword Removal

Word Stemming
Removing Punctuation,
Sentence Digits
Lower Casing
Language Detection
Text
Preparatio
n

Basic Advance
Cleansing Preprocessin Preprocessing
g
HTML Spelling POS Coreference
Tags Emoji Basic Optional Parsing
Checker Tagging Resolution
Removal
Tokenization Stopword Removal
Stemming
Word
Removing Punctuation, Digits
Sentence
Lower Casing
Language Detection
TEXT PREPARATION
Our cleaned text data may contain a group of sentences. and each sentence is a
group of words. So, first, we need to Tokenize our text data.
Tokenization: Tokenization is the process of segmenting the text into a list of
tokens. In the case of sentence tokenization, the token will be sentenced and in the
case of word tokenization, it will be the word. It is a good idea to first complete
sentence tokenization and then word tokenization, here output will be the list of
lists. Tokenization is performed in each & every NLP pipeline.
Lowercasing: This step is used to convert all the text to lowercase letters. This is
useful in various NLP tasks such as text classification, information retrieval, and
sentiment analysis.
Stop word removal: Stop words are commonly occurring words in a language such
as “the”, “and”, “a”, etc. They are usually removed from the text during 28
TEXT PREPARATION
Stemming or lemmatization: Stemming and lemmatization are used to reduce words to their
base form, which can help reduce the vocabulary size and simplify the text. Stemming involves
stripping the suffixes from words to get their stem, whereas lemmatization involves reducing
words to their base form based on their part of speech. This step is commonly used in various
NLP tasks such as text classification, information retrieval, and topic modeling.
Removing digit/punctuation: This step is used to remove digits and punctuation from the text.
This is useful in various NLP tasks such as text classification, sentiment analysis, and topic
modeling.
POS tagging: POS tagging involves assigning a part of speech tag to each word in a text. This
step is commonly used in various NLP tasks such as named entity recognition, sentiment analysis,
and machine translation.
Named Entity Recognition (NER): NER involves identifying and classifying named entities in
text, such as people, organizations, and locations. This step is commonly used in various NLP
29
ADVANCE PREPROCESSING

POS (Part-of-Speech) tagging is the process of assigning parts of speech to each

word in a sentence based on its definition and context.
Each tag represents the role a word plays in the sentence, which helps in
understanding its syntactic and grammatical structure.
Here are some common POS tags with examples based on the sentence:
"Chaplin wrote, directed, and composed the music for most of his films."
NN - Noun, singular: Refers to a singular noun, which is a person, place, thing, or
concept. Example: music, film
NNS - Noun, plural: Refers to a plural noun. Example: films
NNP - Proper noun, singular: Refers to a proper noun, which is a specific name of
a person, place, or organization. Example: Chaplin
VB - Verb, base form: Refers to the base form of a verb. Example: compose
(lemmatized form)

30
ALPINE SKI HOUSE
CO-REFERENCE
RESOLUTION
Christopher Robin is alive and well. He is the same
person that you read about in the book, Winnie the Pooh.
As a boy, Chris lived in a pretty home called Cotchfield
Farm. When Chris was three years old, his father wrote
a poem about him. The poem was printed in a magazine
for others to read. Mr. Robin then wrote a book

32
Feature
Engineerin
g

Text
Represent
Machine
ation Deep Learning
Learning

Converting
your text into
numbers
FEATURE ENGINEERING
In Feature Engineering, our main agenda is to represent the text in the numeric vector in such a
way that the ML algorithm can understand the text attribute. In NLP this process of feature
engineering is known as Text Representation or Text Vectorization.
There are two most common approaches for Text Representation.
1. Classical or Traditional Approach:
In the traditional approach, we create a vocabulary of unique words assign a unique id (integer
value) for each word. and then replace each word of a sentence with its unique id. Here each
word of vocabulary is treated as a feature. So, when the vocabulary is large then the feature size
will become very large. So, this makes it tough for the ML model.
One Hot Encoder:
One Hot Encoding represents each token as a binary vector. First mapped each token to integer
values. and then each integer value is represented as a binary vector where all values are 0
except the index of the integer. index of the integer is marked by 1. 35
FEATURE ENGINEERING
Feature engineering is an integral step in any ML pipeline. Feature engineering steps
convert the raw data into a format that can be consumed by a machine.

36
Over 50000
tweets
…
…

… … … …
… … … …
… … …
Modelling

Apply Model Evaluation

Machine
Heuristic Deep Learning CloudAPI
Learning

Amount of Data

Nature of Problem
Modelling

Apply Model Evaluation

Machin
e Deep Intrinsic Extrinsic
Heuristic CloudAPI
Learnin Learning Evaluation Evaluation
g
Technical Criteria
(Metric) Business Criteria

Amount of Data

Nature of Problem
40
EVALUATION

Intrinsic and extrinsic evaluation are two primary approaches to assess the quality
and performance of machine learning models, especially in natural language
processing (NLP) and other AI applications. Here’s a breakdown of each:
Intrinsic Evaluation
•Definition: Measures a model's performance by directly evaluating the quality of
its output on specific tasks without considering its application in real-world
scenarios.
•Goal: Focuses on internal model metrics, like accuracy, precision, recall, F1 score,
or other task-specific benchmarks.
•Examples:
• Word Embeddings: Evaluating word similarity scores between embeddings
(e.g., using cosine similarity).
• Language Models: Measuring perplexity on a text dataset.
• Text Classification: Calculating classification accuracy on a labeled test set.

41
EVALUATION

Extrinsic Evaluation
•Definition: Measures a model's performance based on how well it performs within a
larger, real-world application or downstream task.
•Goal: Evaluates how the model contributes to the performance of a broader task,
often providing insight into practical utility.
•Examples:
• Information Retrieval: Testing how well embeddings improve document
retrieval accuracy.
• Machine Translation: Evaluating if a language model enhances translation
quality in a complete system.
• Sentiment Analysis: Testing a sentiment classifier’s impact on user
engagement in a recommendation system.

42
EVALUATION

Intrinsic focuses on inter mediary objectives, while extrinsic focuses on evaluating

performance on the final objective.
For example, consider a spam-classification system.
The ML metric will be precision and recall, while the business metric will be “the
amount of time users spent on a spam email.”
Intrinsic evaluation will focus on measuring the system performance using
precision and recall.
Extrinsic evaluation will focus on measuring the time a user wasted because a
spam email went to their inbox or a genuine email went to their spam folder.

43
EVALUATION

Evaluation matric depends on the type of NLP task or problem. Here I am listing
some of the popular methods for evaluation according to the NLP tasks.
•Classification: Accuracy, Precision, Recall, F1-score, AUC
•Sequence Labelling: Fl-Score
•Information Retrieval : Mean Reciprocal rank(MRR), Mean Average Precision
(MAP),
•Text summarization: ROUGE
•Regression [Stock Market Price predictions, Temperature Predictions]:
Root Mean Square Error, Mean Absolute Percentage Error
•Text Generation: BLEU (Bi-lingual Evaluation Understanding), Perplexity
•Machine Translation: BLEU (Bi-lingual Evaluation Understanding), METEOR
44
Deployment

Deploy Monitoring Update

DEPLOYMENT
Making a trained NLP model usable in a production setting is known as
deployment. The precise deployment process can vary based on the platform and
use case, however, the following are some typical processes that may be
involved:
1.Export the trained model
2.Prepare the input pipeline
3. Set up the inference service
4. Monitor performance and scale
5.Continuous improvement

46
LARGE IMAGE SLIDE
LARGE IMAGE SLIDE
Draw non-crossing PIPELINES between three
houses and three utility companies.
THANK
YOU!

EF4e Uppint Filetest 5a
100% (6)
EF4e Uppint Filetest 5a
7 pages
Planning For Estidama
No ratings yet
Planning For Estidama
34 pages
Approaching Almost Any NLP
No ratings yet
Approaching Almost Any NLP
118 pages
Dreams - Interpreting Your Dreams and How To Dream Your Desires - Lucid Dreaming, Visions and Dream Interpretation PDF
100% (1)
Dreams - Interpreting Your Dreams and How To Dream Your Desires - Lucid Dreaming, Visions and Dream Interpretation PDF
70 pages
Scope Statement For The Time Table Generation System For Thapar University
60% (5)
Scope Statement For The Time Table Generation System For Thapar University
4 pages
Natural Language Processing Manual
No ratings yet
Natural Language Processing Manual
39 pages
NLP - Srilakshmi H - PPT Assignment
No ratings yet
NLP - Srilakshmi H - PPT Assignment
29 pages
NLP Pipeline: Chapter-2
No ratings yet
NLP Pipeline: Chapter-2
171 pages
Lecture 8 - Text Analytics NLP
No ratings yet
Lecture 8 - Text Analytics NLP
24 pages
Week 8-Module 7 NLP
No ratings yet
Week 8-Module 7 NLP
52 pages
NLP - 1 - 250119 - 222702
No ratings yet
NLP - 1 - 250119 - 222702
71 pages
02 - NLP Pipeline - Binh
No ratings yet
02 - NLP Pipeline - Binh
37 pages
CSDM2-Text Preprocessing For NL Data - 011050
No ratings yet
CSDM2-Text Preprocessing For NL Data - 011050
6 pages
Introduction To NLP
No ratings yet
Introduction To NLP
50 pages
Chapter 7.1 - Introducing Natural Language Processing
No ratings yet
Chapter 7.1 - Introducing Natural Language Processing
39 pages
Unit I NLP
No ratings yet
Unit I NLP
5 pages
Lect 02
No ratings yet
Lect 02
23 pages
AP For NLP-LO1
No ratings yet
AP For NLP-LO1
61 pages
NLP Manual
No ratings yet
NLP Manual
15 pages
NLP Preprocessing Steps 1740444240
No ratings yet
NLP Preprocessing Steps 1740444240
20 pages
NLP - Course EDC 1 29
No ratings yet
NLP - Course EDC 1 29
29 pages
Unit 2
No ratings yet
Unit 2
25 pages
NLP Pipeline
No ratings yet
NLP Pipeline
58 pages
AP For NLP-Word 2 Vec
No ratings yet
AP For NLP-Word 2 Vec
33 pages
Chapter-1 Introduction To NLP
No ratings yet
Chapter-1 Introduction To NLP
12 pages
Text Processing
No ratings yet
Text Processing
5 pages
NLP Preprocessing Steps
No ratings yet
NLP Preprocessing Steps
20 pages
Understanding Language Model
No ratings yet
Understanding Language Model
5 pages
Module-I NLP
No ratings yet
Module-I NLP
35 pages
Text Mining and Dataset Creation in Python
No ratings yet
Text Mining and Dataset Creation in Python
13 pages
VO - MCA - SEM 4 - Text Mining - U2
No ratings yet
VO - MCA - SEM 4 - Text Mining - U2
15 pages
NLP Lab Manual-1
No ratings yet
NLP Lab Manual-1
18 pages
Great Big Natural Language Processing Primer KDnuggets
No ratings yet
Great Big Natural Language Processing Primer KDnuggets
25 pages
NLP Study Materials Updated
No ratings yet
NLP Study Materials Updated
43 pages
Handling Corpus Raw Text
No ratings yet
Handling Corpus Raw Text
15 pages
Text Processing For NLP Text Processing
No ratings yet
Text Processing For NLP Text Processing
15 pages
Text Preprocessing Stages
No ratings yet
Text Preprocessing Stages
8 pages
NLP Lect 2
No ratings yet
NLP Lect 2
5 pages
Chapter 2 Solutions
No ratings yet
Chapter 2 Solutions
6 pages
NLP Manual (1-12) 1
No ratings yet
NLP Manual (1-12) 1
56 pages
NLP Manual (1-12)
No ratings yet
NLP Manual (1-12)
55 pages
NLP - Shortnotes Unit 1 & 2
No ratings yet
NLP - Shortnotes Unit 1 & 2
16 pages
Ir Manual
No ratings yet
Ir Manual
53 pages
DeekshikaJadyada AP24LDS11
No ratings yet
DeekshikaJadyada AP24LDS11
6 pages
NLP Unit1
No ratings yet
NLP Unit1
24 pages
Unraveling The Power of Natural Language Processing
No ratings yet
Unraveling The Power of Natural Language Processing
11 pages
Minorproject Ishant
No ratings yet
Minorproject Ishant
18 pages
2 NLP Pipeline
No ratings yet
2 NLP Pipeline
57 pages
Natural Language Processing - NOTES
No ratings yet
Natural Language Processing - NOTES
4 pages
4.twitter Extraction and Analytics
No ratings yet
4.twitter Extraction and Analytics
45 pages
Module2.4 Text Processing
No ratings yet
Module2.4 Text Processing
17 pages
Introduction To NLP Basics of Text Processing, Spelling Correction-Edit Distance, Weighted Edit Distance
No ratings yet
Introduction To NLP Basics of Text Processing, Spelling Correction-Edit Distance, Weighted Edit Distance
35 pages
Module I NLP
No ratings yet
Module I NLP
65 pages
NLP Module 1
No ratings yet
NLP Module 1
71 pages
Intro To NLP: Natural Language Toolkit
No ratings yet
Intro To NLP: Natural Language Toolkit
11 pages
Big Data Finance t8 1 Choi Neoma NLP 2024
No ratings yet
Big Data Finance t8 1 Choi Neoma NLP 2024
13 pages
NLP
No ratings yet
NLP
40 pages
A Tutorial On: Linguistic Data Analysis
No ratings yet
A Tutorial On: Linguistic Data Analysis
99 pages
NLP Final
No ratings yet
NLP Final
33 pages
Unit 1a
No ratings yet
Unit 1a
53 pages
NLP Experiment 1
No ratings yet
NLP Experiment 1
13 pages
Letter of Invitation SGC
No ratings yet
Letter of Invitation SGC
7 pages
2001 Nieuwaal
No ratings yet
2001 Nieuwaal
89 pages
Understanding The Self Course Outline
100% (1)
Understanding The Self Course Outline
2 pages
Kingspan Range Tribune Xe Brochure en GB
No ratings yet
Kingspan Range Tribune Xe Brochure en GB
16 pages
21-Economics-2017 (Tamil) - Final - 1693223768823
No ratings yet
21-Economics-2017 (Tamil) - Final - 1693223768823
74 pages
Ocean DEHUMID
No ratings yet
Ocean DEHUMID
4 pages
Engine Code Won't Clear in My 2008 Saturn Vue - Google Search
No ratings yet
Engine Code Won't Clear in My 2008 Saturn Vue - Google Search
1 page
Theory of Elasticity
No ratings yet
Theory of Elasticity
4 pages
ICT 7 Learning Module
No ratings yet
ICT 7 Learning Module
77 pages
Mod 1maths
No ratings yet
Mod 1maths
155 pages
Top 6 Reasons
No ratings yet
Top 6 Reasons
15 pages
School Students' Physical Activity Physical Activity and Its Contributing Factors in
No ratings yet
School Students' Physical Activity Physical Activity and Its Contributing Factors in
8 pages
Max 15.0V at 12V Max 31.5V at 24V Max 61.0V at 48V: Main Features
No ratings yet
Max 15.0V at 12V Max 31.5V at 24V Max 61.0V at 48V: Main Features
1 page
SLEX 4 Monster Mash
No ratings yet
SLEX 4 Monster Mash
7 pages
Unit 8
No ratings yet
Unit 8
9 pages
PHD Thesis GauthamRam Cover Final
No ratings yet
PHD Thesis GauthamRam Cover Final
251 pages
Prof K V Subbaraju
No ratings yet
Prof K V Subbaraju
26 pages
Process Design and Reengineering
No ratings yet
Process Design and Reengineering
23 pages
An Economic Analysis of Selected Road PR
No ratings yet
An Economic Analysis of Selected Road PR
22 pages
5630 Cree
No ratings yet
5630 Cree
32 pages
Week 2 - Critical Thinking and Fundamental Reading Skills
No ratings yet
Week 2 - Critical Thinking and Fundamental Reading Skills
49 pages
PROMATECT®-L Safety Data Sheet
No ratings yet
PROMATECT®-L Safety Data Sheet
4 pages
Alternative Delivery Mode Learning Resource Standards (Reviewer's Copy) I. Background
No ratings yet
Alternative Delivery Mode Learning Resource Standards (Reviewer's Copy) I. Background
32 pages
Affected Models: Date: November, 2006 No. 2006-06 (W) MODELS: 2007 Evinrude SUBJECT: Engine Software Update
No ratings yet
Affected Models: Date: November, 2006 No. 2006-06 (W) MODELS: 2007 Evinrude SUBJECT: Engine Software Update
2 pages
Economic Decision Making-2024
No ratings yet
Economic Decision Making-2024
5 pages
Testing MCQ
No ratings yet
Testing MCQ
59 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

NLP Pipeline

Uploaded by

NLP Pipeline

Uploaded by

NATURAL

Instructor: Ms. Qurat-Ul-Ain

• Conversational AI uses Machine Learning (ML) and Deep Learning

Machine learning is a field of computer science that gives

Machine Learning algorithm

Labeled / Unlabled Learned model Predictio

Methods that can learn from and make predictions on data

Unsupervised: Discover patterns in unlabeled data Example:

Reinforcement learning: learn to act based on feedback/reward

Classification Regression Clustering

Anomaly Detection Sequence labeling

Data Text Feature

Available Others NoData

Data (Tables) Database Less data Public Dataset

Data Augmentation WebScraping

• Adding noise to data: deliberately introducing variations or

HTML Tags Spelling

POS (Part-of-Speech) tagging is the process of assigning parts of speech to each

Apply Model Evaluation

Apply Model Evaluation

Intrinsic focuses on inter mediary objectives, while extrinsic focuses on evaluating

Deploy Monitoring Update

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.