0% found this document useful (0 votes)
58 views

NLP m1

The document provides an overview of natural language processing including its history, common tasks, challenges, and evolution. It discusses key NLP concepts like tokenization, stemming, lemmatization, part-of-speech tagging, and regularization expressions. The future of NLP is predicted to incorporate other signals like biometrics and enable more human-like computer interactions.

Uploaded by

priyankap1624153
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
58 views

NLP m1

The document provides an overview of natural language processing including its history, common tasks, challenges, and evolution. It discusses key NLP concepts like tokenization, stemming, lemmatization, part-of-speech tagging, and regularization expressions. The future of NLP is predicted to incorporate other signals like biometrics and enable more human-like computer interactions.

Uploaded by

priyankap1624153
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 148

4

MODULE 1
Natural Language Processing

•Introduction–
Past, present, and future of NLP;
Classical problems on text
processing; Necessary Math
concepts for NLP;
expressions Regular NLP;
in
processing: lemmatization,
Basic text stop
word, tokenisation, stemming,
etc; Spelling errors corrections–
Minimum edit distance, Bayesian
method;
What is NLP?
Need For NLP
• In Neuropsychology, linguistics & the philosophy of language, a
natural lang or ordinary lang is any lang that has evolved
naturally in humans through use & repetition without conscious
planning or premeditation
• Natural lang can take different forms, such as speech or signing.
• They are distinguished from constructed & formal lang such as
those used to program computers or to study logic.
Common NLP Tasks or Applications of NLP:
Applications of NLP
❖ Cars with an automatic speech recognition system
❖ Capture words in soundtrack from millions of hours of
video on the web
❖ Cross language information retrieval and translation-
Google Translator
❖ Automatic essay Analyzer, Tool To Summarise Research Papers
❖ Automated interactive virtual agents- Tutors
Even Humans made
Blunders
Chatbots: TayTweet
@TayandYou
Chatbots: TayTweet
@TayandYou
Goal of NLP
Approaches to NLP
(Lexical Dictionary)
Heuristic Method:
• Is any approach to problem solving or self-discovery that employs a
practical method that is not guaranteed to be optimal, perfect or
rational, but is nevertheless sufficient for reaching an immediate,
short term goal or approximation.

• Advantages: i. Quick Approaches (accurate)


ii. Less Errors

Current Scenario: So many Times when we are using ML & DL still we


are using Heuristic Approaches.
Algorithm ruled

Probability based

Classification Algo

Linear Discriminant Analysis Topic Modelling

Parts Of Speech
Sequence Learning
Data Acquisition:
Text preparation:
Steps For cleaning:
i.Removing Html Tags:
Unicode Normalization & Spell Checking:
Basic preprocessing:
Tokenization:
NLP Pipeline
NLP Pipeline
NLP Pipeline
NLP Pipeline
NLP Pipeline

Lemma
NLP Pipeline
NLP Pipeline
Feature Engineering:
Modelling:
Deployment:
Alternative Views on
NLP

□ Computational models of human language processing


❖ Programs that operate internally the way humans do
□ Computational models of human communication
❖ Programs that interact like humans
□ Computational systems that efficiently process text and speech
Complex Language Behavior
Requires
□ Phonetics and Phonology
❖ Word Sequence of sounds
❖ How the sound is realized acoustically

□ Morphology
❖ The way words break down into component parts that
carry meanings like: singular vs plural or I’m vs I am
Complex Language Behavior
Requires
□ Syntax
❖ Structural knowledge to properly string together the words
that constitute the response

❖ I’m I do, sorry that afraid Dave I’m I can’t

❖ I’m sorry Dave, I’m afraid I can’t do that


Complex Language Behavior
Requires


Semantics
❖ Lexical Semantics : meaning of all the words
❖ Compositional Semantics : knowledge about the
relationship of the
words

❖ How much Chinese silk was exported to Western


Europe by the end of the 18th century?
Complex Language Behavior
Requires

□ Pragmatics
❖ Kind of actionsthat speakers intend by their use
of sentences
Complex Language Behavior
Requires
□ Discourse
❖ Knowledge about linguistics units larger than a
utterance single

❖ How many students graduated that year?

that year may be :


▪ When the first batch graduates
Or
▪ When covid -19 hit the world
Ambigu
i ty
What Makes Natural
Language Processing
Difficult?
□ Ambiguity at Word
Level
What Makes Natural
Language Processing
Difficult?
□ Ambiguity at Sentence
Level

48
Ambiguit
y
□ Morphological Ambiguity – - > Part of Speech
Tagging
□ Semantic Ambiguity - - > Lexical Disambiguation
□ Syntactic Ambiguity - - > Probabilistic Parsing

33
What Makes Natural
Language Processing
Difficult?
□ Ambiguity at Meaning
Level
The Evolution of Natural Language
Processing

43
The Evolution of Natural Language Processing:
History of NLP

• In 1952, Bell Labs created Audrey, the first speech recognition system. It could recognize all
ten numerical digits.

• DARPA developed Harpy at Carnegie Mellon University in 1971. It was the first system to
recognize over a thousand words.
53
The Evolution of Natural Language Processing:
History of NLP

54
The Evolution of Natural Language Processing:
Current trends in NLP

□ Speech to text Conversion

□ Text to Speech
□ NLP integrated with deep learning and machine learning
has enabled chatbots and virtual assistants to carry
out complicated interactions.

□ NLP in healthcare can monitor treatments and


analyze reports and health records.

□ Cognitive analytics and NLP are combined


to automate routine tasks. 55
The Evolution of Natural Language Processing:
Current trends in NLP

Chatbots types 56
The Evolution of Natural Language Processing:
Current trends in NLP: Various NLP Algorithms

□ Bag of words: This model counts the frequency of each unique


word in an article.

□ TF-IDF: TF (term frequency) is calculated as the number of


times a certain term appears out of the number of terms present
in the document.

□ Co-occurrence matrix: To solve the problem of semantic


ambiguity. It tracked the context of the text but required a lot of
memory to store all the data

□ Transformer models: This is the encoder and decoder model


that uses attention to train the machines that imitate human
attention faster. BERT, developed by Google revolutionized NLP. 48
The Evolution of Natural Language Processing:
Future Predictions of NLP

58
The Evolution of Natural Language Processing:
Future Predictions of NLP

59
The Evolution of Natural Language Processing:
Future Predictions of NLP

□ non-verbal communications, like body language, gestures, and facial


expressions: use biometrics like facial recognition and retina scanner

□ creation of humanoid robotics by integrating NLP with biometrics.


computer-human interaction will move into computer-human
communication

□ NLP evolution can create robots who can see, touch, hear, and speak,
much like humans

60
Regular Expressions in
NLP
□ a language for specifying text search strings.

□ RE is defined as a sequence of characters that are mainly


used to find or replace patterns present in the text
□ RE is a set of characters that is used to find substring in a given string

□ RE is a formula : in a special language which can be used for specifying


simple classes of strings-> a sequence od symbols
□ RE is an algebraic expression for characterizing a set of strings

□ RE is defined as a an Instruction that is given to a function


on what or how to match/search or replace a set of strings

61
Regular Expressions in
NLP

□ RE require two things:

❖ Pattern: We wish to search

❖ Corpus : text from which we need to search

62
Regular Expressions in
NLP

The use of the brackets [] to specify a disjunction of


characters..

63
The question mark ? marks optionality of the previous
Regular Expressions in NLP
expression

The caret ^ for negation or just to mean ^

64
Regular Expressions in
NLP
Kleene * : “zero or more occurrences of the immediately previous character
or regular expression”.
Consider the language of certain sheep, which consists of strings that
look like the following:
baa!
the sheep language:
baaa!
/baaa*!/
baaaa!
baaaaa!
...

Kleene + : “one or more occurrences of the immediately preceding character


or regular expression”.
the sheep language:
/baa+!/
65
Regular Expressions in NLP

Anchors are special characters that anchor regular expressions to


particular places in a string.

Anchors in regular expressions.

66
Regular Expressions in NLP
Disjunction, Grouping, and Precedence

the order precedence of RE operator precedence, from


highest precedence to lowest precedence.

67
Disjunction:

Negation in Disjunction:
Regular Expressions in NLP

Aliases for common sets of


characters

Regular expression operators for counting 71


Regular Expressions in NLP

Some characters that need to be backslashed

72
Regular Expressions in NLP
Substitution, Capture Groups, and
ELIZA

Substitutions and capture groups are very useful in


implementing simple chatbots like ELIZA (Weizenbaum, 1966).

73
Regular Expressions in NLP
Evaluate the Regular Expressions

• the set of all alphabetic strings;

• the set of all lower case alphabetic strings ending in a b;

• the set of all strings from the alphabet a, b such that


each a is immediately preceded by and immediately
followed by a b;

• the set of all strings with two consecutive repeated


words (e.g., “Humbert Humbert” and “the the” but not
“the bug” or “the big bug”);

74
Regular Expressions in NLP
Evaluate the Regular Expressions
• all strings that start at the beginning of the line with an
integer and that end at the end of the line with a
word;;

• all strings that have both the word grotto and the word
raven in them (but not, e.g., words like grottos that
merely contain the word grotto);

• write a pattern that places the first word of an


English sentence in a register. Deal with punctuation.

75
Text Normalization

• Tokenizing (segmenting)
words
• Normalizing word formats
• Segmenting sentences

76
Text
Tokenization

77
Text
Tokenization

78
Text
Tokenization
• break off punctuation as a separate token;
□ commas are a useful piece of information
for parsers,
□ periods help indicate sentence boundaries.
✔ punctuation that occurs word internally like
m.p.h., Ph.D., Dr., U.S.
✔ prices ($45.55) and dates (01/02/06);
✔ URLs (http://www.stanford.edu),
✔ Twitter hashtags (#nlproc), or
✔ email addresses (someone@cs.colorado.edu).
✔ commas are used inside numbers in English, every
three digits: 555,500.50
79
Text
Tokenization
• A tokenizer can also be used to expand clitic contractions that
are marked by apostrophes, for example, converting what're
to the two tokens what are, and we're to we are

80
Text Tokenization- Byte-Pair Encoding
(A morpheme is
the smallest
• To deal with unknown word problem meaning-bearing
unit of a language
• Subwords can be arbitrary substrings, or they can be
meaning-bearing units like the morphemes -est or
-er.

• Most tokenization schemes have two parts:


• a token learner, and □ raw training corpus
• a token segmenter. □ raw test sentence.

❖ byte-pair encoding ❖ unigram language modeling and


❖ SentencePiece ❖ SentencePiece
81
Text Tokenization- Byte-Pair Encoding

• Begins with a vocabulary that is just the set of all


individual characters.

• It continues to count and merge, creating new longer


and longer character strings, until k merges have been
done creating k novel tokens;.

82
Text Tokenization- Byte-Pair Encoding

Input corpus of 18 word tokens with counts for each word

first count all pairs of adjacent symbols:


• the most frequent is the pair e r

83
Text Tokenization- Byte-Pair Encoding

84
Text Tokenization- Byte-Pair Encoding

the token parser is used to tokenize a test sentence, once the vocabulary is
learned.
85
Text
Tokenization

86
Word Normalization, Lemmatization and
Stemming
•Word normalization
□ putting words/tokens in a standard format,
□ choosing a single normal form for words with multiple
forms.
•For sentiment
analysis
extraction, andand other
machine
Case folding (normalization ) text classification
translation
tasks, information
□ Mapping everything to lower case

□ two morphologically different of a word behave


forms similarly. to

India, of India, Indian,


India’s, for India 87
Word Normalization, Lemmatization and
Stemming
Lemmatization is the task of
▪ determining that two words have the same
root,
▪ despite their surface differences..
am, are, and is
have the shared
lemma ‘be’

❖ He is reading detective stories


□ He be read detective story.

88
Word Normalization, Lemmatization and
Stemming

89
Word Normalization, Lemmatization and
Stemming
Lemmatization Morphology is the study of
• complete morphological the way words are built up
from smaller
parsing of the word. meaning-bearing units called
morphemes’

❖ Two broad classes of morphemes can be distinguished:


• stems—the central morpheme of the word, supplying
the main meaning—and
• affixes—adding “additional” meanings of various kinds..

90
Lemmatization:
Word Normalization, Lemmatization and
Stemming

• fox— • one morpheme (the morpheme


fox)
• Cats— • Two morpheme : the morpheme cat and the
• morpheme -s.

92
• Text on which Porter Stemmeris applied
Word Normalization, Lemmatization and
This Stemming
was not the map we found in
Billy Bones's chest, but an
accurate complete in all
copy,
things-names and heightsand
soundings-with the single

• simpler but cruder method, which mainly consists of chopping


off • stemmed output
word-final stemming
Thi wa not the map we found in
affixes
exception of the red crosses and Billi Bone s chest but an
the written notes. accur copi complet in all
thing name and height and
sound
with the singl except of the red
cross and the written note 93
Word Normalization, Lemmatization and
Stemming
Stemming
• Cascade of rules

94
Stop
Words

90
Stop Words
Stop
Words-Applications
❖ Supervised machine learning – removing stop words from the
feature space

❖ Clustering – removing stop words prior to generating clusters

❖ Information retrieval – preventing stop words from being indexed

❖ Text summarization- excluding stop words from contributing to


summarization scores & removing stop words when computing
ROUGE scores

can be detrimental. For


instance, in sentiment
analysis
99
Stop
Words-Types

❖ Determiners – Determiners tend to mark nouns


where a determiner usually will be followed by a noun
examples: the, a, an, another
❖ Coordinating conjunctions – Coordinating conjunctions
connect words, phrases, and clauses
examples: for, an, nor, but, or, yet, so
❖ Prepositions – Prepositions temporal or
express relations spatial
examples: in, under, towards,
before 100
Stop
Words-Benefits
Key benefits of removing stopwords:

❖ On removing stopwords, dataset size decreases and the time


to train the model also decreases
❖ Removing stopwords can potentially help improve the
performance as there are fewer and only meaningful tokens
left. Thus, it could increase classification accuracy
❖ Even search engines like Google remove stopwords for fast
and relevant retrieval of data from the database

101
Regular Expression:
Text Normalization

106
Spacy:
Minimum
Edit
Distance

95
Minimum Edit
Distance

111
Minimum Edit
Distance

• coreference, the task of deciding whether two strings refer


to the same entity:

❖ Stanford President Marc Tessier-Lavigne


❖ Stanford University President Marc Tessier-Lavigne

112
Minimum Edit
Distance

• The minimum edit distance between two strings


• Is the minimum number of editing operations
o Insertion
o Deletion
o Substitution
• Needed to transform one into the other

113
Minimum Edit
Distance

114
Minimum Edit
Distance

115
Minimum Edit
Distance

• delete an i,
• Substitute e for n,
• Substitute x for t,
• Insert c,
• Substitute u for n

□ d for deletion,
□ s for
substitution, 116
□ i for insertion.
Minimum Edit
Distance

each insertion or
deletion has a cost of 1
and substitutions are
not allowed.

117
How to find the Min Edit
Distance?

118
How to find the Min Edit
Distance?

119
Min Edit
Distance

120
Dynamic Programming for Minimum Edit
Distance

❖ Dynamic programming: A tabular computationn of D(n,m)


❖ Solving problems by combining solutions to subproblems.

121
Dynamic Programming for Minimum Edit Distance

D[i; j] as the edit distance between X[1::i] and Y[1:: j], i.e., the first i
characters of X and the first j characters of Y.

122
Dynamic Programming for Minimum Edit
Distance

123
Dynamic Programming for Minimum Edit
Distance

124
Dynamic Programming for Minimum Edit
Distance

125
Dynamic Programming for Minimum Edit
Distance

126
Dynamic Programming for Minimum Edit
Distance

127
Viterbi algorithm for Minimum Edit
Distance

❖ The Viterbi algorithm is a probabilistic extension of minimum edit


distance.
❖ Instead of computing the “minimum edit distance” between two
strings, Viterbi computes the “maximum probability alignment”
of one string with another.

128
114
BAYESIAN APPROACH TO SPELLING
CORRECTION
‘Noisy channels’
❖ In a number of tasks involving natural language, the problem can
be viewed as recovering an ‘original signal’ distorted by a `noisy
channel’:
– Speech recognition
– Spelling correction
– OCR / handwriting recognition
– (less felicitously perhaps): pronunciation variation

130
BAYESIAN APPROACH TO SPELLING
CORRECTION
Spelling Errors

131
BAYESIAN APPROACH TO SPELLING CORRECTION

Spelling Errors

132
BAYESIAN APPROACH TO SPELLING
CORRECTION

Types of Spelling Errors

Damerau (1964): 80% of all misspelled words (non- word


errors) are caused by SINGLE-ERROR MISSPELLINGS:

133
BAYESIAN APPROACH TO SPELLING
CORRECTION
Types of Spelling
Errors

134
BAYESIAN APPROACH TO SPELLING
CORRECTION
Dealing with Spelling
Errors

135
BAYESIAN APPROACH TO SPELLING
CORRECTION
Noisy Channel Model

136
BAYESIAN APPROACH TO SPELLING
CORRECTION

Bayesian inference

❖ `Bayesian inference’ is the name given to techniques typically


used in diagnostics to identify the CAUSE of certain
OBSERVATIONS
❖ The name ‘Bayesian’ comes from the fact that Bayes’ rule is
used to ‘turn around’ a problem:
▪ from one of finding statistics about the posterior
probability of the CAUSE to one of finding the posterior
probability of the OBSERVATIONS

137
BAYESIAN APPROACH TO SPELLING
CORRECTION
Bayesian
inference

138
BAYESIAN APPROACH TO SPELLING
CORRECTION

Bayesian
inference

Using Bayes’ Rule, this probability can be `turned


around’:

139
BAYESIAN APPROACH TO SPELLING
CORRECTION

Bayesian inference
In this approach Bayes’ theorem is used to compute the probability
of the intended word being ‘w’ when the typist in fact has typed ‘x’:

P(the|thme)

This is called the posterior probability of ‘w’ being the


intended word.
140
BAYESIAN APPROACH TO SPELLING
CORRECTION

Bayesian inference
the word in dictionary with the highest posterior probability is
chosen as the intended word:

a set of candidates for any input word (x), which we call C, and do the
maximization over the C set

141
BAYESIAN APPROACH TO SPELLING
CORRECTION

Bayesian inference
We will also rank candidates according to log-posterior instead
of posterior probability:

142
BAYESIAN APPROACH TO SPELLING
CORRECTION
Steps to develop the algorithm

143
BAYESIAN APPROACH TO SPELLING
CORRECTION
The best suggestion for the correct word given the incorrect spelling
”acress” can be calculated using Bayes rule as shown below as shown
below:

144
BAYESIAN APPROACH TO SPELLING
CORRECTION
The best suggestion for the correct word given the incorrect spelling
”acress” can be calculated using Bayes rule as shown below as shown
below:

145
BAYESIAN APPROACH TO SPELLING
CORRECTION
The best suggestion for the correct word given the incorrect spelling
”acress” can be calculated using Bayes rule as shown below as shown
below:

146
BAYESIAN APPROACH TO SPELLING
CORRECTION

147
BAYESIAN APPROACH TO SPELLING
CORRECTION

148

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy