0% found this document useful (0 votes)
23 views8 pages

Natural Language Processing

This report outlines developing word embeddings from a movie review dataset using Word2Vec and FastText models. Both models captured semantic relationships effectively but FastText handled out-of-vocabulary words better by providing representations for unseen words, an advantage for datasets with diverse vocabularies.

Uploaded by

parulbatish2102
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views8 pages

Natural Language Processing

This report outlines developing word embeddings from a movie review dataset using Word2Vec and FastText models. Both models captured semantic relationships effectively but FastText handled out-of-vocabulary words better by providing representations for unseen words, an advantage for datasets with diverse vocabularies.

Uploaded by

parulbatish2102
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 8

Natural Language Processing - AIGC 5501

Mid-Term Report
Parul Batish – N01603670

1. Introduction
Word embedding is a fundamental technique in Natural Language Processing (NLP)
that converts words into dense vectors of real numbers. By capturing the semantic links
between words, these vectors improve the comprehension and processing of textual input
by machine learning algorithms. In this report, we will find word embedding using the widely
used Python NLP module Gensim. We specifically look into Word2Vec and FastText, two
popular word embedding techniques that are incorporated into Gensim. The primary goal is
to develop numerical representations of words that capture their semantic meaning and
sentiment within the context of movie-related discourse

Project Scope: This report outlines the following key steps:


• Preparation of a movie review dataset.
• Learning word embeddings through the training of Word2Vec and FastText models.
• A qualitative assessment of the acquired embeddings, covering word analogies, word
similarity, and the management of out-of-vocabulary (OOV) words.

2. Data Preparation and Preprocessing


We are using the "Bag of Words Meets Bags of Popcorn" dataset, which includes both
labelled and unlabeled movie reviews, for our research. For word embedding learning, a
combined subset of 75,000 reviews (25,000 labeled, 50,000 unlabeled) was utilized. After
the dataset is loaded into Pandas DataFrames, the text data is cleaned using preprocessing
techniques. These procedures involve changing text to lowercase, eliminating HTML tags,
and special characters. Furthermore, we use NLTK's tokenizer to tokenize the text into words
and phrases.

To prepare the raw reviews for word embedding the following steps were applied:

Normalization: It involved lowering the case of the text and substituting spaces for HTML
elements and some special characters. This cuts down on redundant vocabulary and
standardises the format.
Tokenization: To maintain some context, reviews were divided into sentences using the
NLTK sentence tokenizer. Each sentence was then tokenized further into individual words.
These terms function as the basic analytical units.

Filtering: Non-alphanumeric characters (except spaces) were removed to focus the


learning on meaningful words and reduce noise.

These preprocessing steps aimed to transform the raw movie reviews into a
structured and cleaned format, enabling the Word2Vec and FastText models to
efficiently identify semantic relationships between words.

3. Word2Vec Embedding

num_features (Dimensionality): The parameter is set to 50 in our word embedding


models. This parameter determines the number of dimensions in each word
embedding vector. We chose a value of 50 to strike a balance between capturing rich
semantic information and maintaining computational efficiency, given the size of our
dataset.

min_word_count: This parameter is set to 20 in our word embedding models. This


parameter controls the minimum frequency threshold for words to be included in the
vocabulary. Words that appear less frequently than this threshold are ignored during
training, reducing noise and focusing on a more meaningful vocabulary.

window_size: window_size is set to 10 in our word embedding models. This


parameter defines the maximum distance between a target word and its
neighbouring words within a sentence.
down_sampling: This parameter is set to 0.001 in our word embedding models. This
technique, known as down sampling, involves discarding some very frequent words
during training. By reducing the occurrence of these highly frequent words, down
sampling aims to improve training speed and the representation of less common
words in the embedding space.

num_threads: This parameter is set to 5. This determines the number of parallel


processes for training, utilizing available CPU resources for efficiency.

Iterations: Set to 20. This controls how many times the model processes the entire
dataset, impacting embedding quality.

Model Choice (CBOW vs. Skip-gram): The CBOW architecture was chosen. It predicts
a target word based on surrounding context, making it suitable for our dataset size
and emphasis on representing less frequent words.

4. Training Results
During the model training phase, basic logging techniques were employed to monitor
the progress and ensure the smooth execution of the training process. The use of
logging allowed for real-time tracking of various training metrics and provided
valuable insights into the model's performance.

With a vocabulary size of 24,800 unique words, the model effectively learned
embeddings for a diverse range of vocabulary terms, ensuring comprehensive
coverage. This enables nuanced relationships between words to be captured,
enhancing semantic representations.

Qualitative evaluation of the word embeddings revealed their ability to capture


meaningful semantic relationships. Notably, the model demonstrated a high degree
of similarity between seemingly unrelated words, such as '1' and '2', indicating its
capacity to discern subtle semantic associations.

Moreover, the model showcased its aptitude for solving analogical reasoning tasks,
effectively understanding and extrapolating word relationships. For instance, the
model accurately deduced the analogy "best" - "good" + "bad" = "worst", showcasing
its capability to perform logical reasoning and infer semantic associations between
words.

Word embeddings and the vocabulary were saved as separate arrays, incorporating
special tokens for padding ('-') and unknown words ('unk'). This comprehensive
storage scheme supports further analysis and application in various downstream
tasks, such as recommendation systems.

Overall, the qualitative evaluation underscores the effectiveness of the trained word
embedding model in capturing semantic nuances and representing lexical
relationships, thus validating its utility for a wide range of natural language
processing tasks.

5. FastText Embedding
Motivation: FastText Implementation for Addressing Word2Vec Limitations

Out-of-Vocabulary (OOV) Handling: FastText effectively handles unseen words, vital


for diverse movie reviews containing slang, specialized terms ('cinematic'), and
misspellings.
Morphological Richness: By breaking words into subword units, fastText captures
internal structures, facilitating analysis of stylistic similarities, genre identification,
and nuanced review understanding.

Dataset Alignment: FastText's capabilities align with dynamic and niche vocabulary
characteristics in our movie review dataset.

Subword Embeddings: FastText adeptly produced a meaningful representation for


'citi', an unseen word in the training data. This showcases its capacity to handle novel
vocabulary, offering a distinct advantage in our dataset.

Morphological Awareness: By dissecting words into subword units (character n-


grams), FastText captures intricate internal structures and relationships. This enriches
our analysis of stylistic similarities, genre identification, and nuanced sentiment
comprehension within movie reviews.

6. Evaluation
Model Performance: Both Word2Vec and FastText models were trained effectively,
resulting in word embeddings that accurately represent semantic similarities and
capture morphological nuances. The models demonstrated proficiency in tasks such
as word similarity comparison, analogy completion, and outlier detection, indicating
robust performance.
Vocabulary Coverage: The trained models exhibited comprehensive vocabulary
coverage, enabling the representation of a wide range of words present in the movie
review dataset. This extensive coverage ensures that semantic relationships and
contextual nuances are adequately captured, enhancing the models' utility in
downstream tasks.

Out-of-Vocabulary Handling: The FastText model showcased superior capability in


handling out-of-vocabulary instances by generating meaningful representations for
unseen words. This feature is particularly advantageous in datasets with diverse
vocabulary, slang terms, and specialized terminology, such as movie reviews.

Morphological Analysis: FastText's ability to break words into subword units


facilitated deeper morphological analysis, allowing for the capture of internal word
structures and relationships. This enhanced understanding of stylistic similarities,
genre identification, and nuanced sentiment analysis within movie reviews.

Storage and Accessibility: The project implemented efficient storage mechanisms for
word embeddings and vocabularies, facilitating easy access and utilization in
downstream tasks. By saving embeddings as separate arrays, including special tokens
for padding and unknown words, the models' representations can be readily
integrated into various applications, such as recommendation systems or sentiment
analysis pipelines.

7. Conclusion

In conclusion, this project delved into the realm of natural language processing (NLP)
by exploring the efficacy of Word2Vec and FastText word embedding techniques in
the analysis of movie reviews. Through rigorous training and evaluation, both models
exhibited commendable performance in capturing semantic relationships and
contextual nuances within the dataset. Notably, FastText showcased superior
handling of out-of-vocabulary instances, providing meaningful representations for
unseen words—an invaluable asset in datasets with diverse vocabularies and
specialized terminology. Furthermore, FastText's capability to analyze word
morphology, facilitated by its subword unit decomposition, offered deeper insights
into internal word structures and relationships, enhancing tasks like stylistic similarity
analysis and sentiment interpretation. By implementing efficient storage mechanisms
for embeddings and vocabularies, the models' representations are readily accessible
for integration into various downstream applications. Overall, this project
underscores the significance of word embedding techniques in NLP and underscores
the potential of Word2Vec and FastText models in extracting meaningful insights
from textual data, with implications spanning sentiment analysis, recommendation
systems, and text classification in movie reviews and beyond.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy