Natural Language Processing
Natural Language Processing
Mid-Term Report
Parul Batish – N01603670
1. Introduction
Word embedding is a fundamental technique in Natural Language Processing (NLP)
that converts words into dense vectors of real numbers. By capturing the semantic links
between words, these vectors improve the comprehension and processing of textual input
by machine learning algorithms. In this report, we will find word embedding using the widely
used Python NLP module Gensim. We specifically look into Word2Vec and FastText, two
popular word embedding techniques that are incorporated into Gensim. The primary goal is
to develop numerical representations of words that capture their semantic meaning and
sentiment within the context of movie-related discourse
To prepare the raw reviews for word embedding the following steps were applied:
Normalization: It involved lowering the case of the text and substituting spaces for HTML
elements and some special characters. This cuts down on redundant vocabulary and
standardises the format.
Tokenization: To maintain some context, reviews were divided into sentences using the
NLTK sentence tokenizer. Each sentence was then tokenized further into individual words.
These terms function as the basic analytical units.
These preprocessing steps aimed to transform the raw movie reviews into a
structured and cleaned format, enabling the Word2Vec and FastText models to
efficiently identify semantic relationships between words.
3. Word2Vec Embedding
Iterations: Set to 20. This controls how many times the model processes the entire
dataset, impacting embedding quality.
Model Choice (CBOW vs. Skip-gram): The CBOW architecture was chosen. It predicts
a target word based on surrounding context, making it suitable for our dataset size
and emphasis on representing less frequent words.
4. Training Results
During the model training phase, basic logging techniques were employed to monitor
the progress and ensure the smooth execution of the training process. The use of
logging allowed for real-time tracking of various training metrics and provided
valuable insights into the model's performance.
With a vocabulary size of 24,800 unique words, the model effectively learned
embeddings for a diverse range of vocabulary terms, ensuring comprehensive
coverage. This enables nuanced relationships between words to be captured,
enhancing semantic representations.
Moreover, the model showcased its aptitude for solving analogical reasoning tasks,
effectively understanding and extrapolating word relationships. For instance, the
model accurately deduced the analogy "best" - "good" + "bad" = "worst", showcasing
its capability to perform logical reasoning and infer semantic associations between
words.
Word embeddings and the vocabulary were saved as separate arrays, incorporating
special tokens for padding ('-') and unknown words ('unk'). This comprehensive
storage scheme supports further analysis and application in various downstream
tasks, such as recommendation systems.
Overall, the qualitative evaluation underscores the effectiveness of the trained word
embedding model in capturing semantic nuances and representing lexical
relationships, thus validating its utility for a wide range of natural language
processing tasks.
5. FastText Embedding
Motivation: FastText Implementation for Addressing Word2Vec Limitations
Dataset Alignment: FastText's capabilities align with dynamic and niche vocabulary
characteristics in our movie review dataset.
6. Evaluation
Model Performance: Both Word2Vec and FastText models were trained effectively,
resulting in word embeddings that accurately represent semantic similarities and
capture morphological nuances. The models demonstrated proficiency in tasks such
as word similarity comparison, analogy completion, and outlier detection, indicating
robust performance.
Vocabulary Coverage: The trained models exhibited comprehensive vocabulary
coverage, enabling the representation of a wide range of words present in the movie
review dataset. This extensive coverage ensures that semantic relationships and
contextual nuances are adequately captured, enhancing the models' utility in
downstream tasks.
Storage and Accessibility: The project implemented efficient storage mechanisms for
word embeddings and vocabularies, facilitating easy access and utilization in
downstream tasks. By saving embeddings as separate arrays, including special tokens
for padding and unknown words, the models' representations can be readily
integrated into various applications, such as recommendation systems or sentiment
analysis pipelines.
7. Conclusion
In conclusion, this project delved into the realm of natural language processing (NLP)
by exploring the efficacy of Word2Vec and FastText word embedding techniques in
the analysis of movie reviews. Through rigorous training and evaluation, both models
exhibited commendable performance in capturing semantic relationships and
contextual nuances within the dataset. Notably, FastText showcased superior
handling of out-of-vocabulary instances, providing meaningful representations for
unseen words—an invaluable asset in datasets with diverse vocabularies and
specialized terminology. Furthermore, FastText's capability to analyze word
morphology, facilitated by its subword unit decomposition, offered deeper insights
into internal word structures and relationships, enhancing tasks like stylistic similarity
analysis and sentiment interpretation. By implementing efficient storage mechanisms
for embeddings and vocabularies, the models' representations are readily accessible
for integration into various downstream applications. Overall, this project
underscores the significance of word embedding techniques in NLP and underscores
the potential of Word2Vec and FastText models in extracting meaningful insights
from textual data, with implications spanning sentiment analysis, recommendation
systems, and text classification in movie reviews and beyond.