05 Introduction To NLP
05 Introduction To NLP
05 Introduction to NLP
Amit Sethi
Electrical Engineering, IIT Bombay
Module objectives
• NLP basics
• Pre-processing in NLP
• Some applications
What is Natural Language Processing?
• NLP is analysis or generation of • NLP is based on:
natural language text using – Probability and statistics
computers, for example: – Machine learning
– Machine translation – Linguistics
– Spell check (autocorrect) – Common sense
– Automated query answering
– Speech parsing (a problem that
overlaps with ASR)
Why do NLP?
• Language is one of the defining characteristics
of our species
• A large body of knowledge can be organized
and easily accessed using NLP
• Original conception of the Turing test was
based on NLP
Some standard terms
• Corpus: A body of text samples
• Large vocabulary
• Multiple meanings
• Many word forms
• Synonyms
• Sarcasm, jokes, idioms, figures of speech
• Fluid style and usage
Basic text classification using ML
Text sample
• Variable number of words
Pre-processing
• Tokenization, normalization, etc.
Feature
• Fixed-length vector
Class
• Discrete set
Outline
• NLP basics
• Pre-processing in NLP
• Some applications
Tokenization
• Chopping up text into pieces called tokens
• Usually, each word is a token
§ Jeevan / saved / the / puppy
• How do you tokenize?
§ Split up at all non-alpha-numeric characters
§ What about apostrophes?
§ What about two-word entities, e.g. “New Delhi”?
§ What about compound words in Sanskrit and German?
Stop words
• Words that are common
• Non-selective (excluding negation)
• Examples:
§ Articles: a, an, the
§ Common verbs: is, was, are
§ Pronouns: he, she, it
§ Conjunctions: for, and
§ Prepositions: at, on, with
§ Need not be used to classify text
Normalization
§ Words appear in many forms:
§ School, school, schools
§ U.S.A, USA, U.S., US
§ But not “us”
§ Windows vs. windows/window
§ These need not be considered separate terms
§ Normalization is counting equivalent forms as one term
Stemming and Lemmatization
§ Stemming – chopping off the end of words
§ Nannies becomes nanni (Rule: .ies .i)
§ Caresses becomes caress (Rule: .sses .ss)
§ This is a heuristic way
§ Finding the lemma of a word is the more exact task
§ Nannies should become nanny
§ Privatization should become private
Word vectors
• “India posted a score of 256/8 in their allotted 50 overs
in the third and deciding ODI of the series. Virat Kohli
was the top-scorer for men in blue with a classy 71,
while Adil Rashid and David Willey picked up three
wickets each”
• One-hot encoding (or 1-of-N encoding)
• Topic: “Cricket”
• NLP basics
• Pre-processing in NLP
• Some applications
Language model: predicting words
spell check P NP
• NLP basics
• Pre-processing in NLP
• Some applications
Encoding
• Moving from sparse (e.g. one-hot) to dense
vectors
One-hot encoded
cat
Weight matrix
.
• .
princess
girl
prince
boy
Word2Vec example results
Source: https://arxiv.org/pdf/1301.3781.pdf
Word2vec design choices
§ Dimension of the vector
§ Large dimension is more expressive
§ Small dimension trains faster
§ No incremental gain after a particular dimension
§ Number of negative samples
§ Increases the search space
§ Gives better models
§ Neural network architecture
§ Hidden units to convert 1-hot-bit into a vector
Co-occurrence matrix
Raw counts within a certain window Counts converted to probabilities
cat is chasing the mouse cat is chasing the mouse
cat 0 2 0 3 0 cat 0.00 0.40 0.00 0.60 0.00
is 2 0 2 0 0 is 0.50 0.00 0.50 0.00 0.00
chasing 0 2 0 3 4 chasing 0.00 0.22 0.00 0.33 0.44
the 3 0 3 0 5 the 0.27 0.00 0.27 0.00 0.45
mouse 0 0 4 5 0 mouse 0.00 0.00 0.44 0.56 0.00
“GloVe: Global Vectors forWord Representation” by Jeffrey Pennington, Richard Socher, Christopher D. Manning
GloVe explanation
§ Cost function:
“GloVe: Global Vectors forWord Representation” by Jeffrey Pennington, Richard Socher, Christopher D. Manning
GloVe is more accurate than word2vec
“GloVe: Global Vectors forWord Representation” by Jeffrey Pennington, Richard Socher, Christopher D. Manning
Outline
• NLP basics
• Pre-processing in NLP
• Some applications
Application: PoS tagging
• Goal: Find part-of-speech of each word
• Application: Use in language model to structure
sentences better
• Example:
Amit found the tray and started to bring it to the guest
NNP VBD DT NN CC VBD TO VB PRP IN DT NN