NLP - 1 - 250119 - 222702
NLP - 1 - 250119 - 222702
By Vivek Anand
Introduction
• What is NLP?
• Why is NLP important?
• Core concepts:
• Pre processing
• Tokenization
• Parts of Speech tagging
• Vectorization
Main topics
Text Preprocessing
Text Vectorization
Potential Use Cases
Sentiment
Analysis
Spam Detection
Topic Modelling
Neural
Networks for
Text
classification
Chatbots
Text Summarization
WHAT IS NLP??
Natural Language Processing
For eg:
• “course.” & “course?” are two different tokens and thus causes
redundancy. Also requires additional data to train the model to
learn the difference
• Dimensions becomes large
Accented characters
• Words with different accents can be mistaken as separate tokens
in spite of meaning the same
• For example
• Can convert them into simple ASCII characters to standardize
them.
Is punctuation important?
• Example 1: “He likes the course.”
• Example 2: “He likes the course?”
• Both sentences have different meaning
• Process
Part of speech Database
Input word Output Lemma
tagging lookup
Lemmatization
• Looks beyond
chopping words
and considers
language
vocabulary.
• More robust and
not rule based
• End words are
always real words
3. Text Processing - Case sensitivity
• How does case sensitivity affect outcomes?
• Gor example the words “apple: and “Apple” have the same
meaning
• Models shouldn’t mistake them as two separate tokens
• This could be overcome by using a function called “s.lower()”
Hands on using NLTK – Stemming
Lemmatization
• Stemmer
• Porter
• Snowball
Aids in lemmatization
• Eg: He is running – Verb
• Eg: The running bulls are thrilling - Noun
• NER Application
• Information extraction
• Knowledge gap Construction
• Chatbot development
• Text to speech conversion
TEXT VECTORIZATION
Convert text to vectors
Text Vectorization
• Introduction
• One Hot Encoding
• Word to index mapping
• Python hands on
• Bag of Words (BOW)
• Count vectorizer
• Hands on
• ML application using Count vectorizer
• TF-IDF
• Hands on
• Word2vec
• Shallow Neural network approach
• GloVe(Global Vectors for word representation)
ONE HOT ENCODING
Text Vectorization – One Hot encoding
• Eg: The food is good. The food is bad. Tiramisu is Amazing
• Vocabulary = {The , food , is , good , bad , Pizza , Tiramisu , Amazing}
Advantages and disadvantages of One Hot
Advantages Disadvantages
Easy to implement Sparse Matrix → Overfitting
Will not have a fixed matrix size
across different lines. ML needs
that (Refer to example below)
No Symantec meaning is captured
WORD TO INDEX MAPPING
Word to index mapping
• Disadvantages:
• Ignores semantics
• Vocabulary dependent. (Any new word requires re indexing)
• Fails to capture context
• Not scalable to large vocabularies
BOW – COUNT VECTORIZER
Text Vectorization - Bag of Words – Count Vectorizer
Text Operation Text
He is a good boy Lower case convert S1: good boy
She is a good girl →→→→→→ S2: good girl
Boy and girl are good boy Remove stop words S3: boy girl good boy
Advantage:
Simplicity
Always has a fixed size matrix (Unlike one hot encoding)
Dis Advantage:
Sparse matrix here too
Does not follow order of words
Semantic meaning still not captured
• Eg: boy and good is given equal importance
Out of vocabulary is an issue (In case a missing word and if it
happens to be important)
Major disadvantage to this technique
An example
where there are
2 vectors. They
mean the exact
opposite of one
another.
However, the
distance
between these
2 vectors isn’t
large as just
one word is
different from
one another.
TF-IDF
(Term Frequency – Inverse document frequency)
TF IDF – (Enhanced BOW method)
• S1 = good boy • TF = (# of rep of words in sentence)/(# words in sentence)
• S2 = good girl
• S3 = boy girl good • IDF =loge((# sentences / # sentences containing the word))
S1 S2 S3
Term Frequency (TF)
good 1/2 1/2 1/3
boy 1/2 0/2 1/3
girl 0/2 1/2 1/3
TF IDF – Term Frequency – Inverse document
Frequency(Enhanced BOW method)
• S1 = good boy • TF = (# of rep of words in sentence)/(# words in sentence)
• S2 = good girl
• S3 = boy girl good • IDF =loge((# sentences / # sentences containing the word))
Inverse Document
Frequency (IDF) good loge(3/3)=0
boy loge(3/2)
girl loge(3/2)
TF IDF
Term Frequency (TF) Inverse Document
Frequency (IDF)
S1 S2 S3
S1 S2 S3
good 0 0 0
boy ½*loge(3/2) 0 1/3*loge(3/2)
girl 0 ½*loge(3/2) 1/3*loge(3/2)