NLP m1
NLP m1
MODULE 1
Natural Language Processing
•Introduction–
Past, present, and future of NLP;
Classical problems on text
processing; Necessary Math
concepts for NLP;
expressions Regular NLP;
in
processing: lemmatization,
Basic text stop
word, tokenisation, stemming,
etc; Spelling errors corrections–
Minimum edit distance, Bayesian
method;
What is NLP?
Need For NLP
• In Neuropsychology, linguistics & the philosophy of language, a
natural lang or ordinary lang is any lang that has evolved
naturally in humans through use & repetition without conscious
planning or premeditation
• Natural lang can take different forms, such as speech or signing.
• They are distinguished from constructed & formal lang such as
those used to program computers or to study logic.
Common NLP Tasks or Applications of NLP:
Applications of NLP
❖ Cars with an automatic speech recognition system
❖ Capture words in soundtrack from millions of hours of
video on the web
❖ Cross language information retrieval and translation-
Google Translator
❖ Automatic essay Analyzer, Tool To Summarise Research Papers
❖ Automated interactive virtual agents- Tutors
Even Humans made
Blunders
Chatbots: TayTweet
@TayandYou
Chatbots: TayTweet
@TayandYou
Goal of NLP
Approaches to NLP
(Lexical Dictionary)
Heuristic Method:
• Is any approach to problem solving or self-discovery that employs a
practical method that is not guaranteed to be optimal, perfect or
rational, but is nevertheless sufficient for reaching an immediate,
short term goal or approximation.
Probability based
Classification Algo
Parts Of Speech
Sequence Learning
Data Acquisition:
Text preparation:
Steps For cleaning:
i.Removing Html Tags:
Unicode Normalization & Spell Checking:
Basic preprocessing:
Tokenization:
NLP Pipeline
NLP Pipeline
NLP Pipeline
NLP Pipeline
NLP Pipeline
Lemma
NLP Pipeline
NLP Pipeline
Feature Engineering:
Modelling:
Deployment:
Alternative Views on
NLP
□ Morphology
❖ The way words break down into component parts that
carry meanings like: singular vs plural or I’m vs I am
Complex Language Behavior
Requires
□ Syntax
❖ Structural knowledge to properly string together the words
that constitute the response
□
Semantics
❖ Lexical Semantics : meaning of all the words
❖ Compositional Semantics : knowledge about the
relationship of the
words
□ Pragmatics
❖ Kind of actionsthat speakers intend by their use
of sentences
Complex Language Behavior
Requires
□ Discourse
❖ Knowledge about linguistics units larger than a
utterance single
48
Ambiguit
y
□ Morphological Ambiguity – - > Part of Speech
Tagging
□ Semantic Ambiguity - - > Lexical Disambiguation
□ Syntactic Ambiguity - - > Probabilistic Parsing
33
What Makes Natural
Language Processing
Difficult?
□ Ambiguity at Meaning
Level
The Evolution of Natural Language
Processing
43
The Evolution of Natural Language Processing:
History of NLP
• In 1952, Bell Labs created Audrey, the first speech recognition system. It could recognize all
ten numerical digits.
• DARPA developed Harpy at Carnegie Mellon University in 1971. It was the first system to
recognize over a thousand words.
53
The Evolution of Natural Language Processing:
History of NLP
54
The Evolution of Natural Language Processing:
Current trends in NLP
□ Text to Speech
□ NLP integrated with deep learning and machine learning
has enabled chatbots and virtual assistants to carry
out complicated interactions.
Chatbots types 56
The Evolution of Natural Language Processing:
Current trends in NLP: Various NLP Algorithms
58
The Evolution of Natural Language Processing:
Future Predictions of NLP
59
The Evolution of Natural Language Processing:
Future Predictions of NLP
□ NLP evolution can create robots who can see, touch, hear, and speak,
much like humans
60
Regular Expressions in
NLP
□ a language for specifying text search strings.
61
Regular Expressions in
NLP
62
Regular Expressions in
NLP
63
The question mark ? marks optionality of the previous
Regular Expressions in NLP
expression
64
Regular Expressions in
NLP
Kleene * : “zero or more occurrences of the immediately previous character
or regular expression”.
Consider the language of certain sheep, which consists of strings that
look like the following:
baa!
the sheep language:
baaa!
/baaa*!/
baaaa!
baaaaa!
...
66
Regular Expressions in NLP
Disjunction, Grouping, and Precedence
67
Disjunction:
Negation in Disjunction:
Regular Expressions in NLP
72
Regular Expressions in NLP
Substitution, Capture Groups, and
ELIZA
73
Regular Expressions in NLP
Evaluate the Regular Expressions
74
Regular Expressions in NLP
Evaluate the Regular Expressions
• all strings that start at the beginning of the line with an
integer and that end at the end of the line with a
word;;
• all strings that have both the word grotto and the word
raven in them (but not, e.g., words like grottos that
merely contain the word grotto);
75
Text Normalization
• Tokenizing (segmenting)
words
• Normalizing word formats
• Segmenting sentences
76
Text
Tokenization
77
Text
Tokenization
78
Text
Tokenization
• break off punctuation as a separate token;
□ commas are a useful piece of information
for parsers,
□ periods help indicate sentence boundaries.
✔ punctuation that occurs word internally like
m.p.h., Ph.D., Dr., U.S.
✔ prices ($45.55) and dates (01/02/06);
✔ URLs (http://www.stanford.edu),
✔ Twitter hashtags (#nlproc), or
✔ email addresses (someone@cs.colorado.edu).
✔ commas are used inside numbers in English, every
three digits: 555,500.50
79
Text
Tokenization
• A tokenizer can also be used to expand clitic contractions that
are marked by apostrophes, for example, converting what're
to the two tokens what are, and we're to we are
80
Text Tokenization- Byte-Pair Encoding
(A morpheme is
the smallest
• To deal with unknown word problem meaning-bearing
unit of a language
• Subwords can be arbitrary substrings, or they can be
meaning-bearing units like the morphemes -est or
-er.
82
Text Tokenization- Byte-Pair Encoding
83
Text Tokenization- Byte-Pair Encoding
84
Text Tokenization- Byte-Pair Encoding
the token parser is used to tokenize a test sentence, once the vocabulary is
learned.
85
Text
Tokenization
86
Word Normalization, Lemmatization and
Stemming
•Word normalization
□ putting words/tokens in a standard format,
□ choosing a single normal form for words with multiple
forms.
•For sentiment
analysis
extraction, andand other
machine
Case folding (normalization ) text classification
translation
tasks, information
□ Mapping everything to lower case
88
Word Normalization, Lemmatization and
Stemming
89
Word Normalization, Lemmatization and
Stemming
Lemmatization Morphology is the study of
• complete morphological the way words are built up
from smaller
parsing of the word. meaning-bearing units called
morphemes’
90
Lemmatization:
Word Normalization, Lemmatization and
Stemming
92
• Text on which Porter Stemmeris applied
Word Normalization, Lemmatization and
This Stemming
was not the map we found in
Billy Bones's chest, but an
accurate complete in all
copy,
things-names and heightsand
soundings-with the single
94
Stop
Words
90
Stop Words
Stop
Words-Applications
❖ Supervised machine learning – removing stop words from the
feature space
101
Regular Expression:
Text Normalization
106
Spacy:
Minimum
Edit
Distance
95
Minimum Edit
Distance
111
Minimum Edit
Distance
112
Minimum Edit
Distance
113
Minimum Edit
Distance
114
Minimum Edit
Distance
115
Minimum Edit
Distance
• delete an i,
• Substitute e for n,
• Substitute x for t,
• Insert c,
• Substitute u for n
□ d for deletion,
□ s for
substitution, 116
□ i for insertion.
Minimum Edit
Distance
each insertion or
deletion has a cost of 1
and substitutions are
not allowed.
117
How to find the Min Edit
Distance?
118
How to find the Min Edit
Distance?
119
Min Edit
Distance
120
Dynamic Programming for Minimum Edit
Distance
121
Dynamic Programming for Minimum Edit Distance
D[i; j] as the edit distance between X[1::i] and Y[1:: j], i.e., the first i
characters of X and the first j characters of Y.
122
Dynamic Programming for Minimum Edit
Distance
123
Dynamic Programming for Minimum Edit
Distance
124
Dynamic Programming for Minimum Edit
Distance
125
Dynamic Programming for Minimum Edit
Distance
126
Dynamic Programming for Minimum Edit
Distance
127
Viterbi algorithm for Minimum Edit
Distance
128
114
BAYESIAN APPROACH TO SPELLING
CORRECTION
‘Noisy channels’
❖ In a number of tasks involving natural language, the problem can
be viewed as recovering an ‘original signal’ distorted by a `noisy
channel’:
– Speech recognition
– Spelling correction
– OCR / handwriting recognition
– (less felicitously perhaps): pronunciation variation
130
BAYESIAN APPROACH TO SPELLING
CORRECTION
Spelling Errors
131
BAYESIAN APPROACH TO SPELLING CORRECTION
Spelling Errors
132
BAYESIAN APPROACH TO SPELLING
CORRECTION
133
BAYESIAN APPROACH TO SPELLING
CORRECTION
Types of Spelling
Errors
134
BAYESIAN APPROACH TO SPELLING
CORRECTION
Dealing with Spelling
Errors
135
BAYESIAN APPROACH TO SPELLING
CORRECTION
Noisy Channel Model
136
BAYESIAN APPROACH TO SPELLING
CORRECTION
Bayesian inference
137
BAYESIAN APPROACH TO SPELLING
CORRECTION
Bayesian
inference
138
BAYESIAN APPROACH TO SPELLING
CORRECTION
Bayesian
inference
139
BAYESIAN APPROACH TO SPELLING
CORRECTION
Bayesian inference
In this approach Bayes’ theorem is used to compute the probability
of the intended word being ‘w’ when the typist in fact has typed ‘x’:
P(the|thme)
Bayesian inference
the word in dictionary with the highest posterior probability is
chosen as the intended word:
a set of candidates for any input word (x), which we call C, and do the
maximization over the C set
141
BAYESIAN APPROACH TO SPELLING
CORRECTION
Bayesian inference
We will also rank candidates according to log-posterior instead
of posterior probability:
142
BAYESIAN APPROACH TO SPELLING
CORRECTION
Steps to develop the algorithm
143
BAYESIAN APPROACH TO SPELLING
CORRECTION
The best suggestion for the correct word given the incorrect spelling
”acress” can be calculated using Bayes rule as shown below as shown
below:
144
BAYESIAN APPROACH TO SPELLING
CORRECTION
The best suggestion for the correct word given the incorrect spelling
”acress” can be calculated using Bayes rule as shown below as shown
below:
145
BAYESIAN APPROACH TO SPELLING
CORRECTION
The best suggestion for the correct word given the incorrect spelling
”acress” can be calculated using Bayes rule as shown below as shown
below:
146
BAYESIAN APPROACH TO SPELLING
CORRECTION
147
BAYESIAN APPROACH TO SPELLING
CORRECTION
148