0% found this document useful (0 votes)
11 views71 pages

NLP - 1 - 250119 - 222702

The document provides an overview of Natural Language Processing (NLP), its significance, core concepts, and various applications including sentiment analysis, spam detection, and chatbots. It discusses key techniques such as text preprocessing, tokenization, stemming, lemmatization, and text vectorization methods like Bag of Words and Word2Vec. Additionally, it highlights the importance of parts of speech tagging and named entity recognition in enhancing NLP tasks.

Uploaded by

rishavraghavcoc
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views71 pages

NLP - 1 - 250119 - 222702

The document provides an overview of Natural Language Processing (NLP), its significance, core concepts, and various applications including sentiment analysis, spam detection, and chatbots. It discusses key techniques such as text preprocessing, tokenization, stemming, lemmatization, and text vectorization methods like Bag of Words and Word2Vec. Additionally, it highlights the importance of parts of speech tagging and named entity recognition in enhancing NLP tasks.

Uploaded by

rishavraghavcoc
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 71

NLP and its applications

By Vivek Anand
Introduction
• What is NLP?
• Why is NLP important?
• Core concepts:
• Pre processing
• Tokenization
• Parts of Speech tagging
• Vectorization
Main topics
Text Preprocessing
Text Vectorization
Potential Use Cases
Sentiment
Analysis
Spam Detection
Topic Modelling
Neural
Networks for
Text
classification
Chatbots
Text Summarization
WHAT IS NLP??
Natural Language Processing

• NLP is a Challenging field of AI


• Unstructured data
• Inconsistent
• NLP is a field of Computer Science , AI ,
Linguistics concerning interactions
between humans and computers
• NLP Enables the computers to understand
, interpret and generate human language
Why NLP?
• Two sources of information Vision(Photos & Videos) & Text
• Most of qualitative inputs available online are in the form of Text
(Blogs ,Transcripts, Tweets , Comments , Medical reports etc)
• Principal method of human communication is called language in
the form of text
• Language used by humans is called natural language in contrast
to an artificial language or computer code
• NLP is essential is making computers understand how humans
speak and communicate in order let computers engage back in
communication through speech and writing
NLP History
NLP Today
• Text Summarization
• Text Generation
• Language translation
• Attention based models (LSTM)
• Large language models
• Speech recognition
Real Time use cases – SPAM/HAM
• Spam Detection Vs HAM
• A filter to set up specific rules applied to the emails and messages
wherein the SPAM is sent straight to the spam folder.
Real Time use cases – Sentiment Analysis
• Sentiment Analysis
• Opinion mining / Sentiment analysis is an NLP technique to determine if
data is positive, neutral or negative. Used by businesses to see if
customers perceive the product / service in one of the three categories
Real Time use cases – Text Summarization
• Use of ML algorithms to synthesize large bodies of texts into its
most important parts.
• Extractive
• Abstractive
Real Time use cases – Text Generation
• For example Input = “Deep learning has revolutionised AI”
• Generated text from trained ML Model = “Deep learning has
revolutionized AI and researchers are testing it for new roles such
as predictive intelligence and social networking. But the company
is also looking for ways to improve its AI”
Real Time use cases – Topic model
• Topic modeling is an unsupervised machine learning technique to
identify hidden topics in a collection of documents.
• Automatically discover abstract "topics" that best describe the
content of a text corpus
Real Time use cases – Language Translation
• Language translation or Machine translation to enable computers
to understand one language and transfer it into other languages
NLP Terminology

Corpus Document Vocabulary Words


The text under study. An entire sentence in a Collection of words
Entire articles paragraph Not necessarily all words
Many Twitter feeds Selecting a sub set of
words depending on
experiment
Pre Trained models have
set vocabularies
Summary of learnings
Cleaning Transformation Modelling
Spl char removal Text → Quant Logistic
Numbers Bag of words Naïve bayes
Alpha numeric Index mapping Decision trees
Lowercase TF-IDF KNN
Stop words Word2vec (Shallow NN) SVM
White spaces Glove (Matrix Factorization) NLP Classifiers
Stemming
Lemmatization
Spellings
Text Preprocessing - Tokenization
• Tokenization is the process of breaking a piece of text into smaller
units, called tokens, which can be words, phrases, or even
characters. These tokens are the building blocks for further
natural language processing tasks.
• Paragraph →Sentence tokenization
• Sentence →Word based tokenization
• Words →Character based tokenization
• Sub word tokenization →Using models like BERT→Eg Unbelievable→
[“un”,”believe”,”able]
Tokenization
Tokenization is the breaking down of text into smaller fragments or
components also known as TOKENS

For eg:

Each word in a list of tokenization is called as a token and


collectively called as a “Bag of words”
The above function can be performed in python using s.split()
How does it work?
• Tokenization is the first step in extractive information from the text.
These tokens are then converted into numbers and those
numbers (in various formats) are used to feed into the machine
learning models
Challenges
1. Punctuation
2. Case Sensitivity
3. Accented Characters
Punctuation

• “course.” & “course?” are two different tokens and thus causes
redundancy. Also requires additional data to train the model to
learn the difference
• Dimensions becomes large
Accented characters
• Words with different accents can be mistaken as separate tokens
in spite of meaning the same
• For example
• Can convert them into simple ASCII characters to standardize
them.
Is punctuation important?
• Example 1: “He likes the course.”
• Example 2: “He likes the course?”
• Both sentences have different meaning

• Higher order of algorithms (Deeplearning based – BERT , GPT)


Word based tokenization
• Each word is made into a separate token
• Commonly used scheme
• Provides rich information as every word has meaning
• Disadvantage being there are way too many words thus requiring a
large vocabulary. This also means a large feature matrix increasing
dimensionality
• For Eg: String = “I like this item”
when Tokenized = [‘I’, ‘like’ , ‘this’ , ‘item’]
Character-Based Tokenization
• Every word becomes a character.
• Requires a much smaller vocabulary (26 characters++)
• Characters do not contain a lot of information
• Many words start from the same character. Hence model learning
could be complex
• For Eg: String = “I like this item”
when Tokenized = [‘I’,’l’,’I’,’k’,’e’,’t’,’h’,’I’,’s’,’I’,’t’,’e’,’m’]
Subwords based tokenization
• This is in between word-based and Character-based method
• The word ‘Learning” could be tokenized into “Learn” & “ing”
• Advantage of this technique makes the model learning simpler
• By not using this technique the model increases in complexity for
example the model has to learn similarities between the words
“learn” , “learning” , “learned” , “learns” that could be avoided.
• Used in deep learning
Tokenization using NLTK (Natural language
toolkit)
• NLTK is open source
• Integrates well with other libraries (Scikit-learn, Tensorflow)
• Works well for model being built from scratch
• Advantages
• Tokenization
• Text Preprocessing
• Contains inbuilt models
• Part-of-Speech(POS) Tagging
• Parsing
• Named entity Recognition (NER)
• Text Classification
• Sentiment analysis
• Language translation
HANDS ON!!!!!!!
• Hands on approach for Tokenization
• Word based
• Character based
• Subwords bases

• Hands on approach using NLTK (Python)


TEXT PROCESSING
1. Text Processing - Stemming
• Stemming implies getting the
stem of a word. For Eg Learn is
a stem word for learning ,
learned , learns
• Stemming is a rule based
technique. Result may not be a
real word
• Crude method. Chops off
letters from the end
1. Text Processing - Stemming
• Rule based process of removing words
• Removes suffixes to reduce to its root form
• No guarantee stemmed words are valid words
• Simplistic approach
• Fast but less accurate
• No context consideration

Original Text Stemmed text


Running Run
Happier Happi
Excited Excite
Studies Studi
Stemming Algorithms
• Porter stemmer
• Developed By: Martin Porter (1980)
• Relatively conservative, meaning it doesn’t over-stem aggressively.
• Works through a series of rule-based steps (e.g., removing suffixes like
"ing," "ed"). – essentially for English
• Snowball stemmer
• Developed By: Martin Porter (improvement over Porter Stemmer).
• Supports multiple languages (not just English), making it more versatile.
• Follows a refined set of rules and handles edge cases better.
• Lancaster stemmer
• Developed By: Paice/Husk Lancaster (1990).
• Most aggressive of the three stemmers.
• Reduces words more heavily, often leading to over-stemming (e.g., "maximum" →
"max").
2. Text Processing - Lemmatization
• Lemmatization is the process of reducing a word to its base or
root form (called the lemma) while considering its context and
grammatical role. Unlike stemming, lemmatization ensures that
the root form of the word is a valid word in the language.
• Context awareness – considers part of speech
• Linguistic Rules – uses WORDNET analysis
• Semantic validity

• Process
Part of speech Database
Input word Output Lemma
tagging lookup
Lemmatization
• Looks beyond
chopping words
and considers
language
vocabulary.
• More robust and
not rule based
• End words are
always real words
3. Text Processing - Case sensitivity
• How does case sensitivity affect outcomes?
• Gor example the words “apple: and “Apple” have the same
meaning
• Models shouldn’t mistake them as two separate tokens
• This could be overcome by using a function called “s.lower()”
Hands on using NLTK – Stemming
Lemmatization
• Stemmer
• Porter
• Snowball

• Lemmatization using NLTK


When to use Stemming Vs Lemmatization
Feature Stemming Lemmatization
Definition Reduces words to their root Reduces words to their base
form using rules (heuristics). form (lemma) based on
context.
Output Non-dictionary words (e.g., Valid dictionary words (e.g.,
"running" → "run"). "running" → "run").
Speed Faster and simpler. Slower due to reliance on
linguistic analysis.
Accuracy Lower; may produce Higher; produces
incorrect stems. linguistically meaningful
results.
Resource requirement Minimal; no external Requires lexical resources
dictionaries needed. like WordNet , that contains
relationships between words
and their lemmas
4. Text Processing - Stop Words
• Common words such as “and”, “the”, “at”, “an”, “because” etc
occurring frequently that they have no meaning to add to variance
or does not need recognition.
5a. Text Processing – Parts of speech (POS)
Parts of speech are grammatical categories that define the roles and
relationships of words in a sentence. Common parts of speech include:

1.Nouns: Names of people, places, things (e.g., dog, city).


2.Verbs: Actions or states (e.g., run, is).
3.Adjectives: Descriptors for nouns (e.g., beautiful, fast).
4.Adverbs: Describe verbs, adjectives, or other adverbs (e.g., quickly, very).
5.Pronouns: Replace nouns (e.g., he, they).
6.Prepositions: Indicate relationships between nouns (e.g., on, under).
7.Conjunctions: Link words, phrases, or clauses (e.g., and, but).
8.Interjections: Express emotions (e.g., wow!, ouch!).
5a. Text Processing – Parts of speech (POS)
Assigns grammatical roles to each word in text

Understands sentence structure and formation as it operates a Wordwise level


only

Improves NLP translation

Aids in lemmatization
• Eg: He is running – Verb
• Eg: The running bulls are thrilling - Noun

Focus on syntax and grammatical roles


Real use case for POS tagging
• Consider building a chatbot. If a user says, "I want to book a flight
to Paris," POS tagging can:
• Identify "book" as a verb (action to be performed).
• Recognize "flight" as a noun (object of action).
• Mark "Paris" as a proper noun (destination).
• This information can then guide the chatbot's intent recognition and
response generation.
5b. Text Processing – Named entity recognition
(NER)
• Identifies and categories entities from text
• Operates at a word , phrase or any level
• Relies on context to detect information
• While POS focusses on grammatical roles , NER focuses on
semantics in extracting real world information entities
Example to show diff between POS & NER
WORD POS Tag
• Focus: Syntax and
Barack NNP (Proper Noun) grammatical structure
of the sentence.
Obama NNP (Proper Noun)
• Purpose: To identify the
was VBD (Verb) syntactic roles of
words, which is useful
born VBN (Verb) for parsing and
understanding
in IN (Preposition) sentence structure
Hawaii NNP (Proper Noun)
in IN (Preposition)
1961 CD (Cardinal Digit)
Example to explain NER
Entity NER Tag
Barack Obama PERSON
Hawaii LOCATION
1961 DATE
Usage of POS Vs NER
• POS Application
• Grammer analysis
• Machine translation
• Text to speech conversion

• NER Application
• Information extraction
• Knowledge gap Construction
• Chatbot development
• Text to speech conversion
TEXT VECTORIZATION
Convert text to vectors
Text Vectorization
• Introduction
• One Hot Encoding
• Word to index mapping
• Python hands on
• Bag of Words (BOW)
• Count vectorizer
• Hands on
• ML application using Count vectorizer
• TF-IDF
• Hands on
• Word2vec
• Shallow Neural network approach
• GloVe(Global Vectors for word representation)
ONE HOT ENCODING
Text Vectorization – One Hot encoding
• Eg: The food is good. The food is bad. Tiramisu is Amazing
• Vocabulary = {The , food , is , good , bad , Pizza , Tiramisu , Amazing}
Advantages and disadvantages of One Hot
Advantages Disadvantages
Easy to implement Sparse Matrix → Overfitting
Will not have a fixed matrix size
across different lines. ML needs
that (Refer to example below)
No Symantec meaning is captured
WORD TO INDEX MAPPING
Word to index mapping

Start by performing tokenization

Assigning an index to unique words (Token)in a dictionary

Create a vector using these index values


Advantages & Disadvantages to Word to index
• Advantages:
• Simplicity
• Maintains unique identifiers
• Compact
• Handles rare words
• Allows fixed size matrix

• Disadvantages:
• Ignores semantics
• Vocabulary dependent. (Any new word requires re indexing)
• Fails to capture context
• Not scalable to large vocabularies
BOW – COUNT VECTORIZER
Text Vectorization - Bag of Words – Count Vectorizer
Text Operation Text
He is a good boy Lower case convert S1: good boy
She is a good girl →→→→→→ S2: good girl
Boy and girl are good boy Remove stop words S3: boy girl good boy

good boy girl


S1 1 1 0
S2 1 0 1
S3 1 2 1

Comes with Binary as well as non binary option


Pros and Cons to Bag of Words – Count Vectorizer

Advantage:
Simplicity
Always has a fixed size matrix (Unlike one hot encoding)

Dis Advantage:
Sparse matrix here too
Does not follow order of words
Semantic meaning still not captured
• Eg: boy and good is given equal importance
Out of vocabulary is an issue (In case a missing word and if it
happens to be important)
Major disadvantage to this technique
An example
where there are
2 vectors. They
mean the exact
opposite of one
another.
However, the
distance
between these
2 vectors isn’t
large as just
one word is
different from
one another.
TF-IDF
(Term Frequency – Inverse document frequency)
TF IDF – (Enhanced BOW method)
• S1 = good boy • TF = (# of rep of words in sentence)/(# words in sentence)
• S2 = good girl
• S3 = boy girl good • IDF =loge((# sentences / # sentences containing the word))

S1 S2 S3
Term Frequency (TF)
good 1/2 1/2 1/3
boy 1/2 0/2 1/3
girl 0/2 1/2 1/3
TF IDF – Term Frequency – Inverse document
Frequency(Enhanced BOW method)
• S1 = good boy • TF = (# of rep of words in sentence)/(# words in sentence)
• S2 = good girl
• S3 = boy girl good • IDF =loge((# sentences / # sentences containing the word))

Inverse Document
Frequency (IDF) good loge(3/3)=0
boy loge(3/2)
girl loge(3/2)
TF IDF
Term Frequency (TF) Inverse Document
Frequency (IDF)
S1 S2 S3

good 1/2 1/2 1/3 good loge(3/3)=0


boy 1/2 0/2 1/3 X boy loge(3/2)
girl 0/2 1/2 1/3 girl loge(3/2)

S1 S2 S3

good ½*0 ½*0 1/3*0


TF IDF boy ½*loge(3/2) 0*loge(3/2) 1/3*loge(3/2)
girl 0*loge(3/2) ½*loge(3/2) 1/3*loge(3/2)
Pros and cons of TF IDF
S1 S2 S3

good 0 0 0
boy ½*loge(3/2) 0 1/3*loge(3/2)
girl 0 ½*loge(3/2) 1/3*loge(3/2)

Advantage Dis Advantage

• Intuitive • Sparsity matrix


• Fixed matrix size • Out of Vocabulary (OOV): Any word that is not added in
• Word importance is captured (If a word is present in every the train data set and if that word is present in the test
sentence it is scored zero) data, gets ignored
Word2vec
• Word2Vec turns words into dense vectors (a series of numbers)
so that computers can understand and process text.
• It uses a shallow neural network trained on a large text dataset.
• Captures meaning(King and Queen shall be close with similar
meaning)
• Semantic relationship is maintained
• Analogy: “King – Man + Woman = Queen”
• Similarity between words : King = Monarch

Think of Word2Vec as creating a "map" where every word is a point.


Words with similar meanings or usage patterns are placed closer
together on this map.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy