Ch6 - Text Vectorization - 1
Ch6 - Text Vectorization - 1
Word Frequency
Operations
Bag of Words
TF-IDF
2
Framework
8
Instructor: Pauline Maouad, PhD
Normalized Word
Frequency
The problem here is that unless Corpus A
and Corpus B are exactly the same size, this
chart is misleading:
It doesn’t accurately reflect the relative
frequencies in each corpus.
10
Instructor: Pauline Maouad, PhD
Normalized Word
Frequency
Ø Corpus A:
Ø We have 18 occurrences per
821,273 words, which is the same
as x (our normalized frequency)
per 1,000,000 words:
Ø Normalized Freq = 21.917194404
11
Instructor: Pauline Maouad, PhD
Normalized Word
Frequency
Ø Corpus B:
Ø We have 47 occurrences per
4,337,846 words, which is the same
as x (our normalized frequency) per
1,000,000 words:
Ø Normalized Freq = 10.834870579
12
Instructor: Pauline Maouad, PhD
Normalized Word
Frequency
13
Instructor: Pauline Maouad, PhD
Text Vectorization with Word Embedding:
1 – Bag of Words
Motivation for text vectorization: Let’s say you have a review for a product, and the
text reviews provided by the customers are of different lengths.
How to deal with text data when building a machine learning model?
By converting from text to numbers, we can represent a review by a vector of finite
length.
This way, the length of the vector will be equal for each review, irrespective of the
text length.
• Motivation for text vectorization: Let’s say Review 1 The movie is very stressful
you have a review for a product, and the and long.
text reviews provided by the customers
are of different lengths. Review 2 The movie is not stressful
• By converting from text to numbers, we and slow
can represent a review by a vector of finite
length. Review 3 The movie is exciting. Watch
it, you will love it too.
• This way, the length of the vector will be
equal for each review, irrespective of the
text length.
16
Instructor: Pauline Maouad, PhD
Bag of Words
Initial step: find a vocabulary of
Ignoring punctuations, cases.
unique words.
The Movie Is Very Stressful and Long Not Slow Exciting Watch It You Will Love It Too
Review 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0
The Movie Is Very Stressful and Long Not Slow Exciting Watch It You Will Love Too Length of the
Review in
Words
Review 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 7
Review 2 1 1 1 0 1 1 0 1 1 0 0 0 0 0 0 0 7
Review 3 1 1 1 0 0 0 0 0 0 1 1 1 1 1 1 1 11
• Vector of Review 1: [1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0]
Vector
• Vector of Review 2: [1 1 1 0 1 1 0 1 1 0 0 0 0 0 0 0]
Vector
• Vector of Review 3: [1 1 1 0 0 0 0 0 0 1 1 1 1 1 1 1]
Vector
22
Instructor: Pauline Maouad, PhD
Text Vectorization with Word Embedding:
2 – TF-IDF
Implementation in Python
N: document set
Given a sentence for example “This skyscraper is so huge”. It’s easy for us to understand
the sentence as we know the semantics of the words and the sentence. But how will the
computer understand this sentence?
The computer can understand any data only in the form of numerical value. So, for this
reason we vectorize all the text so that the computer can understand the text better.
26
Term Frequency-Inverse Document Frequency (Cont…)
By vectorizing the documents, we can further perform multiple tasks such as:
- Finding the relevant documents
- Ranking
- Clustering and so on.
This is the same thing that happens when you perform a google search.
The web pages are called documents and the search text with which you search is called
a query.
When you search with a query, google will find the relevance of the query with all the
documents, ranks them in the order of relevance and shows you the top k
documents. This process is done using the vectorized form of query and documents.
Although Google’s algorithms are highly sophisticated and optimized, this is their
underlying structure.
As pointed out, relevant words are not necessarily the most frequent words since stopwords like
“the”, “of” or “a” tend to occur very often in many documents.
Intuitively: a word’s relevance is proportional to the amount of information that it gives about its
context. That is, the most relevant words are those that would help humans, to better understand
a whole document without reading it all.
There is another caveat: if we want to summarize a document compared to a whole dataset about a specific topic, there will
be words that could occur many times in the document as well as in other documents. These words are not useful to
summarize a document because they convey little discriminating power; they say very little about what the document
contains compared to the other documents.
Frequency
words and doc2: 10,000 words, there is a
high probability that “was” can be present
more in the 10,000 worded document.
2 – Term
Frequency Mathematical definition:
Frequency
(Cont…) Number of Words (Review 2) = 7
TF for the word ‘the’ = (number of times ‘the’
appears in review 2)/(number of terms in
review 2) = 1/7
3–
Document Our main goal is to know the
informativeness of a term, and DF is
Frequency the exact inverse of it.
(DF):
(Cont…):
(Cont…):
(Cont…):
idf(t) = log(N/(df + 1))
Inverse
Document Words with a higher score are more
Frequency important, and those with a lower score are
less important:
3 – Term
Frequency- TF-IDF(‘the’, Review 2) = TF(‘the’,
Inverse Review 2) * IDF(‘the’) = 1/8 * 0 =
0
Document
Frequency
Previous
Example Vocabulary = ‘The’, ‘Movie’, ‘Is’,
‘Very’, ‘Stressful’, ‘and’, ‘Long’,
‘Not’, ’Slow’, ‘Exciting’, ‘Watch’,
‘It’, ‘You’, ‘Will’, ‘Love’, ‘It’, ‘Too’.
In Fine
Bag of Words vectors are easy to interpret. However, TF-IDF
usually performs better in machine learning models.
similarity?
applications such as:
Finding similar sequence to a
DNA in bioinformatics
This means that for two overlapping vectors Cosine(𝛉) will be the highest
Example
Doc1 = “Kind words do not cost much. Yet Doc2 = “When one does not love too
they accomplish much.” much, one does not love enough.”
― Blaise Pascal ― Blaise Pascal
Vocab = {‘kind’, ‘words’, ‘do’, ‘not’, ‘cost’, ‘much’, ‘yet’, ‘they’, ’accomplish’, ‘when’, ‘one’, ‘does’, ‘love’, ‘too’, ‘enough’}
58
Step 1: Use bag-of-Words:
Vectorize
Text
Vocabulary =“kind, words, do,
not, cost, much, yet, they,
accomplish, when, one, does,
love, too, enough”
Example: BoW
kind words do not cost much yet they accomplish when one does love too enough
Doc1 1 1 1 1 1 2 1 1 1 0 0 0 0 0 0
Doc2 0 0 0 2 0 1 0 0 0 1 2 2 2 1 1
60
Doc1 =
[111112111000000]
BoW Doc2 =
Vectors [000201000122211]
1. The numerator of the formula is the dot product of the two vectors and denominator
is the product of the L2-norm of both vectors.
1. Dot product of two vectors is the sum of elementwise multiplication of the vectors
2. L2-norm is the square root of sum of squares of elements of a vector.
2. We can either use built-in functions in Numpy library to calculate dot product and L2
norm of the vectors and put it in the formula or directly use the cosine_similarity from
sklearn.metrics.pairwise. Consider two vectors A and B in 2-D, the following code
calculates the cosine similarity.
2. cos_sim=cosine_similarity(A.reshape(1,-1),B.reshape(1,-1))
print (f"Cosine Similarity between A and B:{cos_sim}")
print (f"Cosine Distance between A and B:{1-cos_sim}")