Dsa Unit 3
Dsa Unit 3
• Content Analysis
One of the key aspects of social media engagement is the ability to track and measure the performance of your
content. This helps understand what is working well and what could be improved.
Influencer identification
• Influencer identification is an important part of
influencer marketing.
• It involves finding the right influencers to work with
to promote your brand or product.
• It is a great way to reach a larger audience and
build relationships with influencers who can help
you reach your goals.
Attrition Analysis
• Customer attrition is the loss of customers by a
business.
• Most customers of a given business will not
remain active customers indefinitely.
• Whether a one-time purchaser or a loyal
customer over many years, every customer will
eventually cease his or her relationship with the
business.
Text Vectorization
What is Text vectorization?
• Vectorization is jargon for a classic approach of
converting input data from its raw format (i.e.
text ) into vectors of real numbers which is the
format that ML models support. This approach
has been there ever since computers were first
built, it has worked wonderfully across various
domains, and it’s now used in NLP.
• In Machine Learning, vectorization is a step in
feature extraction. The idea is to get some distinct
features out of the text for the model to train on,
by converting text to numerical vectors.
• NLP (Natural language processing)
is a branch of artificial intelligence
that helps machines to understand,
interpret and manipulate human
language.
• Machine learning algorithms most
often take numeric feature vectors
as input. Thus, when working with
text documents, we need a way to
convert each document into a
numeric vector. This process is
known as text vectorization. In
much simpler words, the process of
converting words into numbers is
called Vectorization.
What is a Vector?
• Vector denotes the mathematical or geometrical
representation quantity.
• Consider a vector of geometrical point P [2, 3, 4].
• This vector basically represents the point P in 3-dimensional
space.
“Text2Vector” Conversion in Machine
Learning
• We generally see particularly in Natural Language
Processing(NLP) that before feeding our raw data(Text data)
, we convert the Text to Vector form so that it can be
processed by any Machine/Deep learning Algorithm. This is
called Featurization
• Featurization is the process to convert varied forms of
data to numerical data which can be used for basic ML
algorithms..…Data can be text data, images, videos,
graphs, various database tables, time-series, categorical
features, etc.
• Feature Engineering : Feature engineering is simply the
process of changing numerical features in such a way that
machine learning models can work efficiently.
Bag of Words (BoW) model
The core idea behind the Bag of Words (BoW)
representation is that any given piece of text can be
represented by a list of all unique words post
stopwords removal. In the BoW approach order of
words does not matter. For example, the below
financial message can be represented in the form of
a bag as shown in Fig.
• So, by creating the “bags” we can represent each of
the messages in our training and test set.
• But the main question still remains how we can build
financial messages sentiment analyzer from these
“bags” representation?
• Now, let’s say the bags, for most of the positive
sentiment messages, contain words such
as significant, increased, rose, appreciated, impro
ved, etc., whereas negative sentiment bags contain
words such
as decreased, declined, loose, liquidated, etc.
BoW Example
Customers like the value, connectivity, and
appearance of the television. They mention
that it's a branded budget TV, and easy to
connect with any network. Customers also
like the clarity, quality, and performance.
That said, opinions are mixed on sound
quality, and performance.
• So, whenever we get a new message, we have to look at its
“bag-of-words” representation.
• Does the bag for this message resembles with the positive
sentiment message or not ?
• Based on the answer to the above question, we can classify
the message into appropriate sentiment.
• Now, the next obvious question that comes to our mind is
how the machine will create the bags of words automatically
for sentiment classification.
• So, the answer to this question is we have to represent all the
bags in a matrix format, after which we can apply different
machine learning algorithms like naive Bayes, logistic
regression, support vector machines, etc., to do the final
Bag of Words Example
Now, let’s understand the BoW approach by the
following example. Let’s assume we have three
documents as given below:
Document 1: Machine learning uses historical data
to predict output values.
Document 2: It is seen as a part of artificial
intelligence.
Document 3: Machine learning programs can
perform tasks without being explicitly programmed
So, in the BoW approach, each document is represented on a
separate row and each word of the vocabulary (post
stopwords removal) has its own column. These vocabulary
words are also known as features of the text.
prog intell hist prog
artif outp mac perf expli pred see lear task valu with use
ram data igen oric part ram
icial ut hine orm citly ict n ning s es out s
s ce al med
D1 0 0 1 1 1 0 0 0 1 1 0 0 1 0 1 0 0 1
D2 1 0 0 0 0 1 0 0 0 0 1 1 0 0 0 0 0 0
D3 0 1 0 0 1 0 1 1 0 0 0 0 1 1 0 1 1 0
The size of the above matrix is (3,18) where 3 denotes the number of rows i.e., the
number of documents, and 18 represents the unique number of words over all the
three documents excluding stopwords. These unique words represent columns
also known as vocabulary and it is actually used as a feature of the text.
• The above matrix was created by filling each cell
with the frequency of each word in the
documents Document 1, Document 2, and
Document 3 represented by D1, D2, and D3
respectively.
• The bag of words representation is also known
as the bag of words model but it shouldn’t be
confused with a machine learning model. A bag
of words model is just the matrix representation
of the frequency of words per document from
• It is important to note that the values inside the cells
can be filled in two ways:
1.We can either fill the cell with the frequency of a word
(values >=0)
2.We can fill the cell with either 0, in case the word is not
present or 1, in case the word is present also known
as the binary bag of words model.
• Out of the above two methods, the frequency
approach is more commonly used in practice and the
NLTK library in Python also fills the BoW model with
word frequencies instead of binary 0 or 1 values.
Stemming
Stemming, in the
realm of Natural
Language
Processing (NLP),
is quite like getting
to the root of a
plant but
language-wise.
It's a technique used
for trimming down a
word to its root form,
known as the 'stem'.
The purpose of this
linguistic "pruning" is
to bring varying forms
of a word down to their
PORTER STEMMING ALGORITHM – BASIC INTRO
https://www.youtube.com/watch?v=GQ1sXx8hH4k
https://www.youtube.com/watch?v=HHAilAC3cXw
Example Inputs
Let’s consider a few example inputs and check what will be their
stem outputs.
Example 1
In the first example, we input the word MULTIDIMENSIONAL to the
Porter Stemming algorithm. Let’s see what happens as the word goes
through steps 1 to 5.
• The suffix will not match any of the cases
found in steps 1, 2.
• Then it comes to step 3.
• The stem of the word has m > 1 (since m = 5)
and ends with “AL”.
• Hence in step 3, “AL” is deleted (replaced with
null).
• Calling step 5 will not change the stem
further.
• Finally the output will be MULTIDIMENSION.
MULTIDIMENSIONAL → MULTIDIMENSION
Example 2
In the second example, we input the word
CHARACTERIZATION to the Porter Stemming
algorithm. Let’s see what happens as the word goes through
steps 1 to 5.
The suffix will not match any of the cases found in step 1.
So it will move to step 2.
The stem of the word has m > 0 (since m = 3) and ends with
“IZATION”.
Hence in step 2, “IZATION” will be replaced with “IZE”.
Then the new stem will be CHARACTERIZE.
Step 3 will not match any of the suffixes and hence will move to
step 4.
Now m > 1 (since m = 3) and the stem ends with “IZE”.
So in step 4, “IZE” will be deleted (replaced with null).
No change will happen to the stem in other steps.
Finally the output will be CHARACTER.
CHARACTERIZATION → CHARACTERIZE → CHARACTER
Generating Unigram, Bigram, Trigram and
Ngrams
What is n-gram Model
In natural language processing n-gram is a contiguous sequence of n
items generated from a given sample of text where the items can be
characters or words and n can be any numbers like 1,2,3, etc.
For example, let us consider a line – “Either my way or no way”, so
below is the possible n-gram models that we can generate –
TF-IDF
• TF-IDF or Term Frequency–Inverse Document
Frequency, is a numerical statistic that’s intended to
reflect how important a word is to a
document. Although it’s another frequency-based
method, it’s not as naive as Bag of Words.
How does TF-IDF improve over Bag
of Words?
• In Bag of Words, we witnessed how vectorization was
just concerned with the frequency of vocabulary words
in a given document. As a result, articles, prepositions,
and conjunctions which don’t contribute a lot to the
meaning get as much importance as, say, adjectives.
• TF-IDF helps us to overcome this issue. Words that
get repeated too often don’t overpower less frequent
but important words.
EXAMPLE
The initial step is to make a vocabulary of unique words and calculate
TF for each document. TF will be more for words that frequently
appear in a document and less for rare words in a document.
Inverse Document Frequency (IDF)
• It is the measure of the importance of a word. Term frequency (TF)
does not consider the importance of words. Some words such as’ of’,
‘and’, etc. can be most frequently present but are of little significance.
IDF provides weightage to each word based on its frequency in the
corpus D.
• IDF of a word (w) is defined as
In our example, since we have two documents in the corpus,
N=2.
Term Frequency — Inverse Document Frequency
(TFIDF)
It is the product of TF and IDF.
•TFIDF gives more weightage to the word that is rare in the
corpus (all the documents).
•TFIDF provides more importance to the word that is more
frequent in the document.
After applying TFIDF, text in A and B documents can be represented as a
TFIDF vector of dimension equal to the vocabulary words. The value
corresponding to each word represents the importance of that word in a
particular document.
Word2Vec
• Word2Vec, short for “word to vector,” is a technology
used to represent the relationships between different
words in the form of a graph. This technology is
widely used in machine learning for embedding and
text analysis.
• Google introduced Word2Vec for their search engine
and patented the algorithm, along with several
following updates, in 2013. This collection of
interconnected algorithms was developed by Tomas
Mikolov.
Word2Vec:
Word2Vec is a neural network-based language model
that learns distributed vector representations of words
from a large corpus of text. It is a popular technique
used for natural language processing (NLP) tasks such
as text classification, machine translation, and sentiment
analysis. The Word2Vec model generates vector
representations for each word in the vocabulary based
on the context in which the word appears in the training
data. It does so by training a neural network to predict
the likelihood of a word given its surrounding words in a
What is a word embedding?
• If you ask someone which word is
more similar to “king” – “ruler” or
“worker” – most people would say
“ruler” makes more sense, right? But
how do we teach this intuition to a
computer? That’s where word
embeddings come in handy.
• A word embedding is a
representation of a word used
in text analysis.
• It usually takes the form of a
vector, which encodes the
word’s meaning in such a way
that words closer in the vector
space are expected to be similar
in meaning.
• Language modeling and feature
learning techniques are typically
used to obtain word
embeddings, where words or
phrases from the vocabulary are
mapped to vectors of real
numbers.
• The meaning of a term is determined by its context: the
words that come before and after it, which is called the
context window. Typically, this window is four words
wide, with four words to the left and right of the target
term. To create vector representations of words, we look
at how often they appear together.
• Word embeddings are one of the most fascinating
concepts in machine learning. If you’ve ever used virtual
assistants like Siri, Google Assistant, or Alexa, or even a
smartphone keyboard with predictive text, you’ve already
interacted with a natural language processing model based
on embeddings.
What’s the difference between word representation, word
vectors, and word embeddings?
• Word meanings and relations between them can be established through
semantics analysis. For this, we need to convert unstructured text data into a
structured format suitable for comparison.
Word representations are visualizations that can be depicted as
independent units (e.g. dots) or by vectors that measure the similarity
between words in multidimensional space.
Word vectors are multidimensional numerical representations where
words with similar meanings are mapped to nearby vectors in space.
Word embedding is a technique for representing words with
low-dimensional vectors, which makes it easier to understand
similarity between them.
How is Word2Vec trained?
Word to vector is trained using a neural network that learns the relationships
between words in large databases of texts. To represent a particular word as a
vector in multidimensional space, the algorithm uses one of the two modes:
continuous bag of words (CBOW) or skip-gram.
Continuous bag-of-words
(CBOW)
The continuous bag-of-words
model predicts the central word
using the surrounding context
words, which comprises a few
words before and after the
current word.
Skip-gram
The skip-gram model architecture is
designed to achieve the opposite of the
CBOW model. Instead of predicting the
center word from the surrounding context
words, it aims to predict the surrounding
context words given the center word.
The choice between the two approaches
depends on the specific task at hand. The
skip-gram model performs well with
limited amounts of data and is particularly
effective at representing infrequent words.
In contrast, the CBOW model produces
better representations for more commonly
occurring words.
• Word2Vec is an algorithm that uses a
shallow neural network model to learn
the meaning of words from a
large corpus of texts.
• Unlike deep neural networks (DNNs),
which have multiple hidden layers,
shallow neural networks only have one
or two hidden layers between the input
and output.
• This makes the processing prompt and
transparent.
• The shallow neural network of
Word2Vec can quickly recognize
semantic similarities and identify
synonymous words using logistic
regression methods, making it faster
than DNNs.
Avg Word2Vec:
=> Avg Word2Vec is an extension of the
Word2Vec model that generates vector
representations for sentences or documents
instead of individual words. It works by taking the
average of the vector representations of all the
words in a sentence or document to generate a
single vector representation for the entire text.
This approach can be useful in cases where we
want to classify or compare entire texts rather
TF-IDF weighted Word2Vec: