0% found this document useful (0 votes)
15 views63 pages

05 Introduction To NLP

The document introduces natural language processing (NLP) and discusses some common NLP tasks like text classification, named entity recognition, and part-of-speech tagging. It also provides examples of how machine learning can be used for NLP applications like predicting the next word in a sequence and representing text as vectors for classification. Finally, it outlines some of the challenges in NLP like ambiguity and complex language patterns.

Uploaded by

Manish kumawat
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views63 pages

05 Introduction To NLP

The document introduces natural language processing (NLP) and discusses some common NLP tasks like text classification, named entity recognition, and part-of-speech tagging. It also provides examples of how machine learning can be used for NLP applications like predicting the next word in a sequence and representing text as vectors for classification. Finally, it outlines some of the challenges in NLP like ambiguity and complex language patterns.

Uploaded by

Manish kumawat
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 63

EE782 Advanced Topics in Machine Learning

05 Introduction to NLP
Amit Sethi
Electrical Engineering, IIT Bombay
Module objectives

• Revise what is NLP

• Understand some key problems in NLP

• Appreciate earlier frameworks used for NLP

• Some example solutions to NLP problems


Outline

• NLP basics

• Pre-processing in NLP

• Language model with an example

• From words to vectors

• Some applications
What is Natural Language Processing?
• NLP is analysis or generation of • NLP is based on:
natural language text using – Probability and statistics
computers, for example: – Machine learning
– Machine translation – Linguistics
– Spell check (autocorrect) – Common sense
– Automated query answering
– Speech parsing (a problem that
overlaps with ASR)
Why do NLP?
• Language is one of the defining characteristics
of our species
• A large body of knowledge can be organized
and easily accessed using NLP
• Original conception of the Turing test was
based on NLP
Some standard terms
• Corpus: A body of text samples

• Document: A text sample

• Vocabulary: A list of words used in the corpus

• Language model: How the words are supposed to be organized


Examples of NLP tasks
• Corpus  Extract documents
• Document  Extract sentences
• Sentences  Extract tokens
• Tokens  Normal, stem, lemma forms
• Sentence, tokens  PoS tagging, NER
• Sentence, tokens Parsing, e.g. chunking,
chinking, syntax tree
• Document  Classification, e.g. sentiment
analysis, topic extraction
• Sentence  Text synthesis, e.g. translation, Q&A
Example: text classification
• Sentiment analysis – positive or negative
– “This is a ridiculously priced toothbrush. Seriously, no way to get
around it. It is absurdly priced and I'm almost embarrassed to be
admitting that I bought it. With that said... Wow, this thing is
amazing.”
– “These pens make me feel so feminine and desirable. I can barely
keep the men away when I'm holding one of these in my dainty hand.
My husband has started to take fencing lessons just to keep the men
away.”

Text source: Amazon.com reviews


Example: Named entity recognition
• A real-world person, place, or object that can be given a proper
noun:
– “India posted a score of 256/8 in their allotted 50 overs in the third
and deciding ODI of the series. Virat Kohli was the top-scorer for men
in blue with a classy 71, while Adil Rashid and David Willey picked up
three wickets each.”
– India  Place, Virat Kohli  Person, …

Text source: Reuters


Example: PoS and Parsing text

Tool Source: http://mshang.ca/syntree/


Importance of context of a word
• “We were on a crash course.”

• Crash can mean an accident, a percussion strike, or a collapse.

• Course can mean a study plan, or a path.


Challenges in NLP

• Large vocabulary
• Multiple meanings
• Many word forms
• Synonyms
• Sarcasm, jokes, idioms, figures of speech
• Fluid style and usage
Basic text classification using ML
Text sample
• Variable number of words

Pre-processing
• Tokenization, normalization, etc.

Feature
• Fixed-length vector

Class
• Discrete set
Outline

• NLP basics

• Pre-processing in NLP

• Language model with an example

• From words to vectors

• Some applications
Tokenization
• Chopping up text into pieces called tokens
• Usually, each word is a token
§ Jeevan / saved / the / puppy
• How do you tokenize?
§ Split up at all non-alpha-numeric characters
§ What about apostrophes?
§ What about two-word entities, e.g. “New Delhi”?
§ What about compound words in Sanskrit and German?
Stop words
• Words that are common
• Non-selective (excluding negation)
• Examples:
§ Articles: a, an, the
§ Common verbs: is, was, are
§ Pronouns: he, she, it
§ Conjunctions: for, and
§ Prepositions: at, on, with
§ Need not be used to classify text
Normalization
§ Words appear in many forms:
§ School, school, schools
§ U.S.A, USA, U.S., US
§ But not “us”
§ Windows vs. windows/window
§ These need not be considered separate terms
§ Normalization is counting equivalent forms as one term
Stemming and Lemmatization
§ Stemming – chopping off the end of words
§ Nannies becomes nanni (Rule: .ies  .i)
§ Caresses becomes caress (Rule: .sses  .ss)
§ This is a heuristic way
§ Finding the lemma of a word is the more exact task
§ Nannies should become nanny
§ Privatization should become private
Word vectors
• “India posted a score of 256/8 in their allotted 50 overs
in the third and deciding ODI of the series. Virat Kohli
was the top-scorer for men in blue with a classy 71,
while Adil Rashid and David Willey picked up three
wickets each”
• One-hot encoding (or 1-of-N encoding)

Text source: Reuters


Bag-of-words as a feature
§ “India posted a score of 256/8 in § Counts
their allotted 50 overs in the
third and deciding ODI of the §
series. Virat Kohli was the top-
scorer for men in blue with a § The counts can be
classy 71, while Adil Rashid and normalized
David Willey picked up three § The words can be
wickets each” standardized
§ Score
§ Scorer
§ What about
uninformative words?
Text source: Reuters
TF-IDF as a feature
§ Term frequency – inverse document frequency
§ TF ft,d is the count of term t in document d
§ Usually normalized in some sense
,
§ ∑ ∈𝑑 ,
§ IDF penalizes terms that occur often in all
documents, e.g. “the”
|𝐷|
§
1+|{𝑑∈𝐷:𝑡∈𝑑}|
§ TF-IDF is
§ Form a vector of TF-IDF for various terms
§ Which terms?
Examples of TF-IDF
§ Let us assume that the § Let us assume that the
word dog appears four word is appears 50 times
times in a document of in a document of 1000
1000 words words
§ TF = 4/1000 = 4 × 10-3 = § TF = 50/1000 = 50 × 10-3 =
0.004 0.05
§ Let the same word appear § Let the same word appear
50 times in 1 million 40,000 times in 1 million
documents documents
§ IDF = log (1000000 / 50) = § IDF = log (1000000 /
4.3 40000) = 1.398
§ So, TF-IDF = 0.004 × 4.3 = § So, TF-IDF = 0.05 × 1.398
0.0172 = 0.0699

Without IDF, dog would not be able to compete with is.


We can then use traditional ML methods
• Text: “India posted a score of 256/8 in their allotted 50 overs in the third
and deciding ODI of the series. Virat Kohli was the top-scorer for men in
blue with a classy 71, while Adil Rashid and David Willey picked up three
wickets each”

• Add word vectors:

• Topic: “Cricket”

Text source: Reuters


Outline

• NLP basics

• Pre-processing in NLP

• Language model with an example

• From words to vectors

• Some applications
Language model: predicting words

§ Can you predict the next word?

The stocks fell again today for a third day


in this week.
§ Clearly, we can narrow down the choice of
next word, and sometimes even get it right.
§ How?
§ Domain knowledge: third day vs. third minute
§ Syntactic knowledge: a …<adjective | noun>
A language model is perhaps
fundamental to how our mind works
• Even illiterate people can predict the next spoken word with
some certainty in their native language
• This comes from experience with lots of conversational
sentences
• Can a machine gain such “experience?”
• How would such “experience” be modeled?
• What can it be used for?
A probabilistic model of language
• What is the probability of a word? Which words are highly likely?
– A, an, the, he, she, it
P(wm)
– What about “obsequious?”
–…
• What is the probability of a word given its:
– previous word?
P(wm| wm 1)
– Previous two words?
P(wm| wm 1, wm  2)
– Previous three words?
P(wm| wm 1, wm  2, wm  3)
– …
An example: Guess the word!
• *** *** ****** **** ** ?
• *** *** ****** **** me ?
• *** *** ****** pick me ?
• *** *** please pick me ?
• *** you please pick me ?
• Can you please pick me ?
• Can you please pick me up?
N-gram: Markovian assumption
• The information provided by the immediately
previous word(s) is the most useful for
prediction
• We need not use more than n previous words
Unigram: P(wm | wm 1, wm  2,..., wm  )  P(wm)
Bigram: P(wm| wm 1, wm  2,..., wm  )  P(wm| wm 1)
Trigram: P(wm | wm 1, wm  2,..., wm  )  P(wm | wm 1, wm  2)
n-gram: P(wm| wm 1, wm  2,..., wm  )  P(wm| wm 1, wm  2,..., wm  n 1)

• This simplifies our model


How many n-grams are there?
• About 20,000 words (unigrams)
• So, about 400,000,000 bigrams, and
• 8,000,000,000 trigrams

• But, are all the bigrams and trigrams equally likely?


– The is a common word.
– The the does not even make sense.
• Yet, we want n to be small
Learn N-grams through examples
• Examples from corpora
– Shakespeare
– Wall Street Journal
– Thomson Reuters
• Depending on the corpus, machine will learn
that vocabulary; machine can sound like
Shakespeare
– Where art thou ****
– Where art thou my ****
– Where art thou my forlorn ****
– Where art thou my forlorn prince?
How does this help us?

• Automatic speech recognition (ASR)


– “There was a bay-er behind the bushes”
– Did she say bear or bare or beer or bar?
– Noun, adjective, verb?
– Or simply use the previous words
– This requires many, many examples such that
all n-grams that we are ever likely to
encounter are seen with reliable frequencies
It also helps spell check software

• Context for the word being checked


• Two types of spelling mistakes:
– Non words
• “There was a baer behind the bushes”
– Wrong words
• “There was a bare behind the bushes”
• Both benefit from a language model
Typical causes of spelling mistakes

• Exchanging two letters, e.g. baer


• Typing the wrong key, e.g. bwar
• Missing a letter, e.g. b ar
• Adding an extra letter, e.g. beear
• Wrong homophone, e.g. bare or beer
• OCR errors, e.g. bcar
Let us model word distortion
• What is the probability of exchanging two letters?
• What is the probability of typing the wrong key?
– Does it depend on the distance from the right key on
keyboard?
• What is the probability of missing a letter?
• …

The distortion model is called channel model


Channel model example: edit distance
• How many additions, deletions?
– BEAR: (1) FEAR
– BEAR: (1) FEAR, (2) F-AR
– BEAR: (1) FEAR, (2) F-AR, (3) FARE
• Should additions and deletions have equal
weight?
• What about exchange of two letters?
• What about pressing wrong neighboring key?

Channel model: P(typed word | candidate word)


Putting the two models together
• Bayes theorem and chain rule to the rescue:
P(A,B) = P(A|B) × P(B) = P(B|A) × P(A)
 Let W’ be typed word, F be phrase before, W be candidate word
 Find W that maximizes: P(W|W’,F) ; own probability given data
P(W|W’,F) = P(W,W’,F) / P(W’,F)
α P(W,W’,F)
= P(W’|W,F) × P(W,F)
= P(W’|W,F) × P(W|F) × P(F)
α P(W’|W) × P(W|F)
= Channel model × Language model
• That is, it is most likely to have led to the distortion AND makes sense
language-wise
A probabilistic model of spell check has two parts
• Noisy channel model P(wm’|wm)
• Could be based on edit distance between
strings
– E.g. Gaussian function of edit distance
• Markov language model P(wm I wm-1…wm-n+1)
– This could be an n-gram model
– What is the relative probability of a word (among
various choices) to be a part of the n-gram
• The correct word maximizes the product
arg maxw’ P(wm’|wm) x P(wm I wm-1…wm-n+1)
Hidden Markov model
st-3 st-2 st-1 st

xt-3 xt-2 xt-1 xt

• A discrete set of hidden states


• Each state depends only on the previous state
• A discrete set of observations
• Each observation only depends on the current
state
• Inference is based on maximum likelihood
Role of linguistics in NLP, an example
• What if an n-gram wasn’t in the corpus?
• Knowledge of parts of speech (POS) can help
• Another NLP problem: POS tagging
• Linguistics uncovers language syntax, grammar,
and POS patterns
S
• Now word choices
can be limited by VP NP

POS for ASR or P V A N PP

spell check P NP

– No bare! There was a **** behind the bushes


Outline

• NLP basics

• Pre-processing in NLP

• Language model with an example

• From words to vectors

• Some applications
Encoding
• Moving from sparse (e.g. one-hot) to dense
vectors

• Each dimension could represent attributes


such as geography, gender, PoS etc.
Why do we need dense vectors?
• Sparse (one-hot) vectors are high dimensional

• Sparse vectors do not have a neighborhood or directional


relationship between words

• Inserting a new word in the vocabulary will lead to catastrophic


changes in the input space
CBOW and Skip-Gram
§ Example: It was a cat that made all the noise
§ In continuous bag-of-words (CBOW), we try to predict a word
given its surrounding context (e.g. location ± 2)
§ (was  cat), (a  cat), (that  cat), (made  cat)
§ In a skip-gram model, we try to model the contextual words
(e.g. location ± 2) given a particular word
§ (cat  was), (cat  a), (cat  that), (cat  made)
Visualizing CBOW and Skip-Gram
Words from ±w positions

was a that made One-hot encoded


Weight matrix

CBOW Predictor Dense encoding


Weight matrix
cat One-hot encoded
(softmax)

One-hot encoded
cat
Weight matrix

Predictor Dense encoding


Skip-Gram
Weight matrix
was a that made One-hot encoded
(softmax)
Adapted from: https://arxiv.org/pdf/1301.3781.pdf
Words from ±w positions
How it is trained
• The objective is to maximize the probability of actual skip-grams,
while minimizing the probability of non-existent skip-grams

.
• .

Adapted from: https://arxiv.org/pdf/1402.3722.pdf


Importance of negative sampling
• If we just had positive sampling

• Then, we could easily increase this probability by making


a very large number.
Other considerations
• Make it more likely to drop frequent words
– Probability of keeping
– This effectively counters stop words such as ‘the’
• Negative sampling is based on frequency
– E.g., Probability of keeping |𝑉|
𝑗=1
/
– Practically, this worked better |𝑉| /
𝑗=1
The new vectors can directly be used to find analogs
• E.g. vprince – vboy + vgirl = vprincess

princess
girl

prince
boy
Word2Vec example results

Source: https://arxiv.org/pdf/1301.3781.pdf
Word2vec design choices
§ Dimension of the vector
§ Large dimension is more expressive
§ Small dimension trains faster
§ No incremental gain after a particular dimension
§ Number of negative samples
§ Increases the search space
§ Gives better models
§ Neural network architecture
§ Hidden units to convert 1-hot-bit into a vector
Co-occurrence matrix
Raw counts within a certain window Counts converted to probabilities
cat is chasing the mouse cat is chasing the mouse
cat 0 2 0 3 0 cat 0.00 0.40 0.00 0.60 0.00
is 2 0 2 0 0 is 0.50 0.00 0.50 0.00 0.00
chasing 0 2 0 3 4 chasing 0.00 0.22 0.00 0.33 0.44
the 3 0 3 0 5 the 0.27 0.00 0.27 0.00 0.45
mouse 0 0 4 5 0 mouse 0.00 0.00 0.44 0.56 0.00

• Consider two similar words (cat, kitty) with similar context


• Their co-occurrence vectors will be similar
• The co-occurrence matrix will be low rank
• We can represent co-occurrence matrix using SVD C = U ∑ V
• SVD is an expensive operation for a large vocabulary
GloVe : Global Vectors
§ GloVe captures word-word co-occurrences in the entire corpus
better
§ Let ij be the co-occurrence probability of words indexed with and
§ Let i be j ij
§ And, let ij ij i
§ What GloVe models is i j
T
k ik jk

“GloVe: Global Vectors forWord Representation” by Jeffrey Pennington, Richard Socher, Christopher D. Manning
GloVe explanation
§ Cost function:

§ For words co-occurrence probability is


§ And, a weighing function

§ Suppresses rare co-occurrences


§ And prevents frequent co-occurrences from taking over

“GloVe: Global Vectors forWord Representation” by Jeffrey Pennington, Richard Socher, Christopher D. Manning
GloVe is more accurate than word2vec

The accuracy show above is on word analogy task

“GloVe: Global Vectors forWord Representation” by Jeffrey Pennington, Richard Socher, Christopher D. Manning
Outline

• NLP basics

• Pre-processing in NLP

• Language model with an example

• From words to vectors

• Some applications
Application: PoS tagging
• Goal: Find part-of-speech of each word
• Application: Use in language model to structure
sentences better
• Example:
Amit found the tray and started to bring it to the guest
NNP VBD DT NN CC VBD TO VB PRP IN DT NN

• Certain regular expressions can be helpful


• For example, words ending with *ing are usually verbs
• Corpora with tagged words can be used
• For example, Brown corpus
Examples of tags
• Nouns • Adjective
• Singular noun  NN (Cat) • Basic  JJ (bad)
• Plural noun  NNS (Cats) • Comparative  JJR (worse)
• Proper noun  NNP (Garfield) • Adverb
• Personal pronouns  PRP (He) • Basic  RB (quickly)
• Verb • Determiner
• Base verb  VB (sleep) • Basic  DT (a, an, the)
• Gerund  VBG (sleeping) • WH  WDT (which, who)
• Preposition  IN (over) • Coordinating conjunction
 CC (and, or, however)
Some PoS Tagging Challenges
• Ambiguity that needs context
– It is a quick read (NN)
– I like to read (VB)

• Differences in numbers of tags


• Brown has 87 tags
• British National Corpus has 61 tags
• Penn Treebank has 45 tags (several merged)
Approaches to PoS Tagging
• Learn from corpora
• Use regular expressions
– Words ending with ‘ed’ or ‘ing’ are likely to be of a
certain kind
• Use context
– POS of preceding words and grammar structure
– For example, n-gram approaches
• Map untagged words using an embedding
• Use recurrent neural networks
Application: Named entity recognition
• Something which has a name:
• Person, place, thing, time
• Example:
• Thereafter, Amit went to D-Mart
• ___ person place
• Application:
• Tag texts for relevance and search
Some challenges with NER
• Different entities sharing the same name
• Manish Jindal  Person
• Jindal Steel  Thing (company)
• Common words that are also names
• Do you want it with curry or dry
• Tyler Curry
• Ambiguity in the order, abbreviation, style
• Jindal, Manish
• Dept. of Electrical Engineering
• De Marzo, DeMarzo
Approaches to NER
• Match to an NE in a tagged corpus
– Fast, but cannot deal with ambiguities
• Rule based
– E.g. capitalization of first letter
– Does not always work, especially between different types of proper
nouns
• Recurrent neural network based
– Learn from a NE tagged corpus

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy