0% found this document useful (0 votes)
342 views63 pages

Ch6 - Text Vectorization - 1

The document discusses techniques for text vectorization, including bag-of-words which converts text to numeric vectors by counting the frequency of words. It provides an example of representing movie reviews as bag-of-words vectors using a vocabulary of unique words. Normalizing word frequencies is also covered to allow for accurate comparison of word usage across corpora of different sizes.

Uploaded by

Crypto Genius
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
342 views63 pages

Ch6 - Text Vectorization - 1

The document discusses techniques for text vectorization, including bag-of-words which converts text to numeric vectors by counting the frequency of words. It provides an example of representing movie reviews as bag-of-words vectors using a vocabulary of unique words. Normalizing word frequencies is also covered to allow for accurate comparison of word usage across corpora of different sizes.

Uploaded by

Crypto Genius
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 63

Text Vectorization with Word Embedding

Instructor: Pauline Maouad, PhD 1


Language Modeling and N-Grams

Word Frequency

Word Normalized Word Frequency

Operations
Bag of Words

TF-IDF

2
Framework

Instructor: Pauline Maouad, PhD 3


Real-life Scenario: Movie
Reviews
1. We all love watching movies. We generally tend to look
at the reviews of a movie before we commit to
watching it.
2. Here’s a sample of reviews about a particular action
movie:
1. Review 1: This movie is very stressful and long
2. Review 2: This movie is not stressful and is slow
3. Review 3: This movie is exciting. Watch it, you will
love it too.

Instructor: Pauline Maouad, PhD 4


Real-life Scenario: Movie
Reviews (Cont…)
1. These are contrasting reviews about the movie as
well as the length and pace of the movie.
2. Imagine looking at a thousand reviews like these.
Clearly, there is a lot of interesting insights we can
draw from them and build upon them to gauge how
well the movie performed.
3. However, we cannot simply give these sentences to
a machine learning model and ask it to tell us
whether a review was positive or negative. We
need to perform certain text preprocessing steps.
4. Bag-of-Words and TF-IDF are two examples of how
to do this.

Instructor: Pauline Maouad, PhD 5


Creating Vectors from
Text
1. What are some techniques that we could use to
vectorize a sentence?
1. We should be able to retain most of the linguistic
information present in the sentence
2. Word Embedding is a technique to represent a text
using vectors.
3. The more popular forms of word embeddings are:
1. BoW, which stands for Bag of Words
2. TF-IDF, which stands for Term Frequency-Inverse
Document Frequency
4. Now, let us see how we can represent the above movie
reviews as embeddings and get them ready for a
machine learning model.
5. However, before that, let’s take a look at how we can
compute Word Frequency.

Instructor: Pauline Maouad, PhD 6


Freqdist – from nltk
I - Word Frequency method in import
NLTK FreqDist

Instructor: Pauline Maouad, PhD 7


III - Normalized Word
Frequency

One of the things we often do in


corpus linguistics is to compare one
corpus (or one part of a corpus) with
another.

For example: We’re interested in the


frequency of the word moonlighting
in 2 corpora, corpus A and corpus B.

• We find 18 occurrences in Corpus A and 47


occurrences in Corpus B. So, we make the chart
to the right:

8
Instructor: Pauline Maouad, PhD
Normalized Word
Frequency
The problem here is that unless Corpus A
and Corpus B are exactly the same size, this
chart is misleading:
It doesn’t accurately reflect the relative
frequencies in each corpus.

In order to accurately compare corpora (or


sub-corpora) of different sizes, we need
to normalize their frequencies.

Let’s say Corpus A contains 821,273 words


and Corpus B contains 4,337,846 words.
9
Instructor: Pauline Maouad, PhD
Normalized Word
Frequency

1. Corpus A = 18 per 821,273 words


2. Corpus B = 47 per 4,337,846 words

Ø To normalize, we want to calculate the


frequencies for each per the same
number of words.
Ø The convention is to calculate per 1000
(or 10,000) words for small corpora and
per 1,000,000 for larger ones.

10
Instructor: Pauline Maouad, PhD
Normalized Word
Frequency

1. Corpus A = 18 per 821,273 words


2. Corpus B = 47 per 4,337,846 words

Ø Corpus A:
Ø We have 18 occurrences per
821,273 words, which is the same
as x (our normalized frequency)
per 1,000,000 words:
Ø Normalized Freq = 21.917194404

11
Instructor: Pauline Maouad, PhD
Normalized Word
Frequency

1. Corpus A = 18 per 821,273 words


2. Corpus B = 47 per 4,337,846 words

Ø Corpus B:
Ø We have 47 occurrences per
4,337,846 words, which is the same
as x (our normalized frequency) per
1,000,000 words:
Ø Normalized Freq = 10.834870579

12
Instructor: Pauline Maouad, PhD
Normalized Word
Frequency

• The raw frequencies seemed to suggest


that moonlighting appeared more than
2.5 times more in Corpus B.
• The normalized frequencies, however,
show that moonlighting is actually
twice as frequent in Corpus A.
• Bear in mind as well, that if you are
comparing your results to those in
some other corpus, you should use the
same normalizing factor — normalize
to the same number (e.g., one million
words).

13
Instructor: Pauline Maouad, PhD
Text Vectorization with Word Embedding:
1 – Bag of Words
Motivation for text vectorization: Let’s say you have a review for a product, and the
text reviews provided by the customers are of different lengths.

How to deal with text data when building a machine learning model?
By converting from text to numbers, we can represent a review by a vector of finite
length.

This way, the length of the vector will be equal for each review, irrespective of the
text length.

Instructor: Pauline Maouad, PhD 14


Text Vectorization:
1- Bag of Words
Bag of Words (BoW) is a technique that converts
text to numbers.

With BoW, each column of a vector represents a


word.

The values in each cell of a row show the number


of occurrences of a word in a sentence.
Instructor: Pauline Maouad, PhD 15
Bag of Words:
Example
Movie Reviews

• Motivation for text vectorization: Let’s say Review 1 The movie is very stressful
you have a review for a product, and the and long.
text reviews provided by the customers
are of different lengths. Review 2 The movie is not stressful
• By converting from text to numbers, we and slow
can represent a review by a vector of finite
length. Review 3 The movie is exciting. Watch
it, you will love it too.
• This way, the length of the vector will be
equal for each review, irrespective of the
text length.

16
Instructor: Pauline Maouad, PhD
Bag of Words
Initial step: find a vocabulary of
Ignoring punctuations, cases.
unique words.

Each movie review will be represented


In our vocabulary, we have 16
by a vector of 16 dimensions.
unique words:

Vocabulary = ‘The’, ‘Movie’, ‘Is’, ‘Very’, ‘Stressful’, ‘and’, ‘Long’,


‘Not’, ’Slow’, ‘Exciting’, ‘Watch’, ‘It’, ‘You’, ‘Will’, ‘Love’, ‘Too’.

Instructor: Pauline Maouad, PhD 17


Bag of Words
ØFor the first review:

The Movie Is Very Stressful and Long Not Slow Exciting Watch It You Will Love It Too

Review 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0

Instructor: Pauline Maouad, PhD 18


Imbalanced dataset
Bag of Words Biased dataset

ØFor all reviews:

The Movie Is Very Stressful and Long Not Slow Exciting Watch It You Will Love Too Length of the
Review in
Words

Review 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 7

Review 2 1 1 1 0 1 1 0 1 1 0 0 0 0 0 0 0 7

Review 3 1 1 1 0 0 0 0 0 0 1 1 1 1 1 1 1 11

Instructor: Pauline Maouad, PhD 19


Feature Vectors

• Vector of Review 1: [1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0]
Vector

• Vector of Review 2: [1 1 1 0 1 1 0 1 1 0 0 0 0 0 0 0]
Vector

• Vector of Review 3: [1 1 1 0 0 0 0 0 0 1 1 1 1 1 1 1]
Vector

Instructor: Pauline Maouad, PhD 20


Drawbacks of using a Bag-of-
Words (BoW) Model
Ø In the above example, we can have vectors of length 16. However,
we start facing issues when we come across new sentences:
1. If the new sentences contain new words, then our vocabulary
size would increase and hence the length of the vectors would
increase too.
2. Additionally, the vectors would also contain many 0s, thereby
resulting in a sparse matrix (which is what we would like to
avoid)
3. We are retaining no information on the grammar of the
sentences nor on the ordering of the words in the text.

Instructor: Pauline Maouad, PhD 21


2 - Term
Frequency –
Inverse
Document
Frequency

22
Instructor: Pauline Maouad, PhD
Text Vectorization with Word Embedding:
2 – TF-IDF

Term frequency–inverse document frequency, is a numerical statistic that is


intended to reflect how important a word is to a document in a collection of
documents or corpus.

It is a measure of how frequently a term, t, appears in a document, d:

Instructor: Pauline Maouad, PhD 23


Terminology

Term Frequency (TF)

Concepts Document Frequency

Inverse Document Frequency

Implementation in Python

Instructor: Pauline Maouad, PhD 24


1 – Terminology
T: term (word)

D: Document – set of words

N: document set

Corpus: The total document set

Instructor: Pauline Maouad, PhD 25


Instructor: Pauline Maouad, PhD

Term Frequency-Inverse Document Frequency


TF-IDF stands for “Term Frequency — Inverse Document Frequency”. This is a technique
to quantify a word in documents, by computing a weight to each word which indicates the
importance of the word in the document and corpus. This method is a widely used
technique in Information Retrieval and Text Mining.

Given a sentence for example “This skyscraper is so huge”. It’s easy for us to understand
the sentence as we know the semantics of the words and the sentence. But how will the
computer understand this sentence?

The computer can understand any data only in the form of numerical value. So, for this
reason we vectorize all the text so that the computer can understand the text better.

26
Term Frequency-Inverse Document Frequency (Cont…)
By vectorizing the documents, we can further perform multiple tasks such as:
- Finding the relevant documents
- Ranking
- Clustering and so on.

This is the same thing that happens when you perform a google search.

The web pages are called documents and the search text with which you search is called
a query.

Instructor: Pauline Maouad, PhD 27


Term Frequency-Inverse Document Frequency (Cont…)

google maintains a fixed representation for all the documents.

When you search with a query, google will find the relevance of the query with all the
documents, ranks them in the order of relevance and shows you the top k
documents. This process is done using the vectorized form of query and documents.

Although Google’s algorithms are highly sophisticated and optimized, this is their
underlying structure.

Instructor: Pauline Maouad, PhD 28


Term Frequency-Inverse Document Frequency (Cont…)
Intuition: if a word occurs multiple times in a document, we boost its relevance since it’s more meaningful than
other words that appear fewer times (TF). Also, if a word occurs many times in a document but also along many
other documents, maybe it is because this word is just a frequent word, not because it was relevant or meaningful
(IDF).

As pointed out, relevant words are not necessarily the most frequent words since stopwords like
“the”, “of” or “a” tend to occur very often in many documents.

Intuitively: a word’s relevance is proportional to the amount of information that it gives about its
context. That is, the most relevant words are those that would help humans, to better understand
a whole document without reading it all.

There is another caveat: if we want to summarize a document compared to a whole dataset about a specific topic, there will
be words that could occur many times in the document as well as in other documents. These words are not useful to
summarize a document because they convey little discriminating power; they say very little about what the document
contains compared to the other documents.

Instructor: Pauline Maouad, PhD 29


The number of times a term occurs in a
document is called its term frequency.
This highly depends on the length of the
document and the generality of the term.

For example, a very common word such as

2 – Term “was” can appear multiple times in a


document. Given two documents: doc1: 100

Frequency
words and doc2: 10,000 words, there is a
high probability that “was” can be present
more in the 10,000 worded document.

But we cannot say that the longer document


is more important than the shorter
document. For this reason, we perform a
normalization on the frequency value. We
divide the frequency with the total number of
words in the document.
Instructor: Pauline Maouad, PhD 30
Recall that when vectorizing documents,
we cannot just consider the words that are
present in that document. If we do that,
then the vector length will be different for
both documents, and it isn’t feasible to
compute the similarity.

So, instead we vectorize the documents on

2 – Term the vocab. vocab is the list of all possible


words in the corpus. When we are

Frequency vectorizing the documents, we check for


each word count.

If the term doesn’t exist in the


document, then that its TF value will
be 0 and in other extreme cases, if all
the words in the document are same,
then it will be 1

Instructor: Pauline Maouad, PhD 31


The final value of the normalized TF value will
be in the range of [0,1]. So, in order to further
distinguish which document is more relevant,
we count the frequency of each term in each
document.

2 – Term Suppose we wish to rank which document


in a set of English texts, is most relevant to

Frequency the query: “Natural Language Processing is


fantastic!”

A simple way would be to start eliminating


documents that do not contain all 4 words
“Natural”, “Language”, “Processing” and
“fantastic”. However, this leaves many
documents to search in.

Instructor: Pauline Maouad, PhD 32


Definition: The weight of a term that occurs
in a document is proportional to the term
frequency.

2 – Term
Frequency Mathematical definition:

tf(t,d) = count of t in d / number of words in d

Instructor: Pauline Maouad, PhD 33


Review 2:
Vocabulary = ‘The’, ‘Movie’, ‘Is’, ‘Very’,
‘Stressful’, ‘and’, ‘Long’, ‘Not’, ’Slow’, ‘Exciting’,
2 – Term ‘Watch’, ‘It’, ‘You’, ‘Will’, ‘Love’, ‘Too’.

Frequency
(Cont…) Number of Words (Review 2) = 7
TF for the word ‘the’ = (number of times ‘the’
appears in review 2)/(number of terms in
review 2) = 1/7

Instructor: Pauline Maouad, PhD 34


Similarly:
TF(‘movie’) = 1/7
TF(‘is’) = 1/7
2 – Term
Frequency
(Cont…) TF(‘Exciting’) = 0/7 = 0
TF(‘Slow’) = 1/7

Instructor: Pauline Maouad, PhD 35


Review 1 Review 2 Review 3 TF 1 TF2 TF3
The 1 1 1
Term Movie 1 1 1
Frequency Is 1 1 1
for the 3
Reviewers Very 1 0 0
Stressful 1 1 0 1/7 1/7 0
And 0 1 0
Long 0 0 0
Not 0 1 0
Slow 0 1 0
Exciting 0 0 1
Watch 0 0 1
It 0 0 1
You 0 0 1
Will 0 0 1
Love 0 0 1
Too 0 01 Instructor: Pauline1Maouad, PhD 36
Document Frequency measures the importance of a
document in a whole set of corpus, i.e. set of documents.

Similar to the TF with the difference that TF is a frequency


3– counter for a term t in document d, whereas DF is the
count of occurrences of term t in the document set N.
Document
Frequency In other words, DF is the number of documents in which the word is
present:

(DF): df(t) = occurrence of t in documents

We consider one occurrence if the term exists in the


document at least once. We do not need to know the
number of times the term is present.

Instructor: Pauline Maouad, PhD 37


To keep this also in range, we normalize
by dividing with the total number of
documents.

3–
Document Our main goal is to know the
informativeness of a term, and DF is
Frequency the exact inverse of it.
(DF):

df(t) = occurrence of t in documents / N

Instructor: Pauline Maouad, PhD 38


IDF is the inverse of the document frequency which
measures the informativeness of term t, how
important t is. We need the IDF value because
computing just the TF alone is not sufficient to
understand the importance of words.
3 – Inverse When we calculate IDF, it will be very low for the most
Document occurring words such as stopwords (“is” is present in
almost all documents, and N/df will give a very low
Frequency value to that word).

(Cont…):

Instructor: Pauline Maouad, PhD 39


We can calculate the IDF values for all the
words in Review 2:

3 – Inverse IDF(‘the’) = log(number of


Document documents/number of documents containing
the word ‘the’) = log(3/3) = log(1) = 0
Frequency
(Cont…):

Instructor: Pauline Maouad, PhD 40


Similarly:
IDF(‘movie’, ) = log(3/3) = 0
IDF(‘is’) = log(3/3) = 0
IDF(‘not’) = log(3/1) = log(3) = 0.48
3 – Inverse IDF(‘stressful’) = log(3/2) = 0.18
Document IDF(‘and’) = log(3/2) = 0.18

Frequency IDF(‘exciting’) = log(3/1) = 0.48

(Cont…):

Instructor: Pauline Maouad, PhD 41


Now there are few other problems with the IDF, in
case of a large corpus, say 10,000, the IDF value
explodes. So to dampen the effect we take log of IDF.

3 – Inverse During the query time, when a word which is not in


Document vocab occurs, the df will be 0. As we cannot divide by
0, we smoothen the value by adding 1 to the
Frequency denominator.

(Cont…):
idf(t) = log(N/(df + 1))

Instructor: Pauline Maouad, PhD 42


Review 1 Review 2 Review 3 IDF
The 1 1 1
Term Movie 1 1 1
Frequency Is 1 1 1
for the 3
Reviewers Very 1 0 0
Stressful 1 1 0
And 0 1 0
Long 0 0 0
Not 0 1 0
Slow 0 1 0
Exciting 0 0 1
Watch 0 0 1
It 0 0 1
You 0 0 1
Will 0 0 1
Love 0 0 1
Too 0 01 Instructor: Pauline1Maouad, PhD 43
Finally, by taking a multiplicative value of TF
3 – Term and IDF, we get the TF-IDF score, there are
many different variations of TF-IDF but for
Frequency- now let us concentrate on a basic version of it.

Inverse
Document Words with a higher score are more
Frequency important, and those with a lower score are
less important:

Instructor: Pauline Maouad, PhD 44


We can now calculate the TF-IDF
score for every word in Review 2:

3 – Term
Frequency- TF-IDF(‘the’, Review 2) = TF(‘the’,
Inverse Review 2) * IDF(‘the’) = 1/8 * 0 =
0
Document
Frequency

Instructor: Pauline Maouad, PhD 45


In our vocabulary, we have 16
unique words:

Previous
Example Vocabulary = ‘The’, ‘Movie’, ‘Is’,
‘Very’, ‘Stressful’, ‘and’, ‘Long’,
‘Not’, ’Slow’, ‘Exciting’, ‘Watch’,
‘It’, ‘You’, ‘Will’, ‘Love’, ‘It’, ‘Too’.

Instructor: Pauline Maouad, PhD 46


Review 1 Review 2 Review 3 IDF TF-IDF1 TF-IDF2 TF-IDF3
The 1 1 1
Term Movie 1 1 1
Frequency Is 1 1 1
for the 3
Reviewers Very 1 0 0
Stressful 1 1 0
And 0 1 0
Long 0 0 0
Not 0 1 0
Slow 0 1 0
Exciting 0 0 1
Watch 0 0 1
It 0 0 1
You 0 0 1
Will 0 0 1
Love 0 0 1
Instructor: Pauline Maouad, PhD 47
Too 0 0 1
In Fine

Instructor: Pauline Maouad, PhD 48


TF-IDF also gives larger values for less frequent words. TF-
IDF is high when both IDF and TF values are high i.e the
word is rare in all the documents combined but frequent in
a single document.
Bag of Words just creates a set of vectors containing the
count of word occurrences in the document (reviews),
while the TF-IDF model contains information on the more
important words and the less important ones as well.

In Fine
Bag of Words vectors are easy to interpret. However, TF-IDF
usually performs better in machine learning models.

Detecting the similarity between the words ‘spooky’ and


‘scary’, or translating our given documents into another
language, requires a lot more information on the
documents.

This is where Word Embedding techniques such as Word2Vec,


Continuous Bag of Words (CBOW), etc. come in.

Instructor: Pauline Maouad, PhD 49


Example using Python

import pandas as pd from sklearn.feature_extraction.text


import TfidfVectorizer

Instructor: Pauline Maouad, PhD 50


Cosine
Similarity

Instructor: Pauline Maouad, PhD 51


Finding similar documents in
NLP
Cosine similarity measures the
similarity between two vectors
by calculating the cosine of the
angle between the two vectors.

What is Cosine similarity is one of the


most widely used and powerful
Information retrieval

cosine similarity measure in Data


Science. It is used in multiple

similarity?
applications such as:
Finding similar sequence to a
DNA in bioinformatics

Detecting plagiarism and many


more.
Cosine
similarity is
calculated
as follows:

Instructor: Pauline Maouad, PhD 53


Why cosine of the angle between A and B gives us the
similarity?
Cos(𝛉) = 1 at 𝛉 = 0
If you look at the cosine function: Cost (𝛉) = -1 at 𝛉 = 180

This means that for two overlapping vectors Cosine(𝛉) will be the highest

For two exactly opposite vectors Cosine(𝛉) will be the lowest.

You can consider 1-cosine as distance.

Instructor: Pauline Maouad, PhD 54


Instructor: Pauline Maouad, PhD 55
Instructor: Pauline Maouad, PhD 56
Consider two vectors A and B in 2-
dimensions, such as:
Instructor: Pauline Maouad, PhD 57
Instructor: Pauline Maouad, PhD

Example

Doc1 = “Kind words do not cost much. Yet Doc2 = “When one does not love too
they accomplish much.” much, one does not love enough.”
― Blaise Pascal ― Blaise Pascal
Vocab = {‘kind’, ‘words’, ‘do’, ‘not’, ‘cost’, ‘much’, ‘yet’, ‘they’, ’accomplish’, ‘when’, ‘one’, ‘does’, ‘love’, ‘too’, ‘enough’}
58
Step 1: Use bag-of-Words:

Vectorize
Text
Vocabulary =“kind, words, do,
not, cost, much, yet, they,
accomplish, when, one, does,
love, too, enough”

Instructor: Pauline Maouad, PhD 59


Instructor: Pauline Maouad, PhD

Example: BoW

kind words do not cost much yet they accomplish when one does love too enough

Doc1 1 1 1 1 1 2 1 1 1 0 0 0 0 0 0

Doc2 0 0 0 2 0 1 0 0 0 1 2 2 2 1 1

60
Doc1 =
[111112111000000]

BoW Doc2 =
Vectors [000201000122211]

Now compute Cosine


similarity.
Instructor: Pauline Maouad, PhD 61
How to calculate it in Python?

1. The numerator of the formula is the dot product of the two vectors and denominator
is the product of the L2-norm of both vectors.
1. Dot product of two vectors is the sum of elementwise multiplication of the vectors
2. L2-norm is the square root of sum of squares of elements of a vector.

2. We can either use built-in functions in Numpy library to calculate dot product and L2
norm of the vectors and put it in the formula or directly use the cosine_similarity from
sklearn.metrics.pairwise. Consider two vectors A and B in 2-D, the following code
calculates the cosine similarity.

Instructor: Pauline Maouad, PhD 62


How to calculate it in Python?

1. # using sklearn to calculate cosine similarity


1. from sklearn.metrics.pairwise import cosine_similarity, cosine_distances

2. cos_sim=cosine_similarity(A.reshape(1,-1),B.reshape(1,-1))
print (f"Cosine Similarity between A and B:{cos_sim}")
print (f"Cosine Distance between A and B:{1-cos_sim}")

3. # using scipy, it calculates 1-cosine


1. from scipy.spatial import distancedistance.cosine(A.reshape(1,-1),B.reshape(1,-1))

Instructor: Pauline Maouad, PhD 63

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy