0% found this document useful (0 votes)

23 views8 pages

Natural Language Processing

This report outlines developing word embeddings from a movie review dataset using Word2Vec and FastText models. Both models captured semantic relationships effectively but FastText handled out-of-vocabulary words better by providing representations for unseen words, an advantage for datasets with diverse vocabularies.

Uploaded by

parulbatish2102

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

23 views8 pages

Natural Language Processing

Uploaded by

parulbatish2102

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 8

Natural Language Processing - AIGC 5501

Mid-Term Report
Parul Batish – N01603670

1. Introduction
Word embedding is a fundamental technique in Natural Language Processing (NLP)
that converts words into dense vectors of real numbers. By capturing the semantic links
between words, these vectors improve the comprehension and processing of textual input
by machine learning algorithms. In this report, we will find word embedding using the widely
used Python NLP module Gensim. We specifically look into Word2Vec and FastText, two
popular word embedding techniques that are incorporated into Gensim. The primary goal is
to develop numerical representations of words that capture their semantic meaning and
sentiment within the context of movie-related discourse

Project Scope: This report outlines the following key steps:

• Preparation of a movie review dataset.
• Learning word embeddings through the training of Word2Vec and FastText models.
• A qualitative assessment of the acquired embeddings, covering word analogies, word
similarity, and the management of out-of-vocabulary (OOV) words.

2. Data Preparation and Preprocessing

We are using the "Bag of Words Meets Bags of Popcorn" dataset, which includes both
labelled and unlabeled movie reviews, for our research. For word embedding learning, a
combined subset of 75,000 reviews (25,000 labeled, 50,000 unlabeled) was utilized. After
the dataset is loaded into Pandas DataFrames, the text data is cleaned using preprocessing
techniques. These procedures involve changing text to lowercase, eliminating HTML tags,
and special characters. Furthermore, we use NLTK's tokenizer to tokenize the text into words
and phrases.

To prepare the raw reviews for word embedding the following steps were applied:

Normalization: It involved lowering the case of the text and substituting spaces for HTML
elements and some special characters. This cuts down on redundant vocabulary and
standardises the format.
Tokenization: To maintain some context, reviews were divided into sentences using the
NLTK sentence tokenizer. Each sentence was then tokenized further into individual words.
These terms function as the basic analytical units.

Filtering: Non-alphanumeric characters (except spaces) were removed to focus the

learning on meaningful words and reduce noise.

These preprocessing steps aimed to transform the raw movie reviews into a
structured and cleaned format, enabling the Word2Vec and FastText models to
efficiently identify semantic relationships between words.

3. Word2Vec Embedding

num_features (Dimensionality): The parameter is set to 50 in our word embedding

models. This parameter determines the number of dimensions in each word
embedding vector. We chose a value of 50 to strike a balance between capturing rich
semantic information and maintaining computational efficiency, given the size of our
dataset.

min_word_count: This parameter is set to 20 in our word embedding models. This

parameter controls the minimum frequency threshold for words to be included in the
vocabulary. Words that appear less frequently than this threshold are ignored during
training, reducing noise and focusing on a more meaningful vocabulary.

window_size: window_size is set to 10 in our word embedding models. This

parameter defines the maximum distance between a target word and its
neighbouring words within a sentence.
down_sampling: This parameter is set to 0.001 in our word embedding models. This
technique, known as down sampling, involves discarding some very frequent words
during training. By reducing the occurrence of these highly frequent words, down
sampling aims to improve training speed and the representation of less common
words in the embedding space.

num_threads: This parameter is set to 5. This determines the number of parallel

processes for training, utilizing available CPU resources for efficiency.

Iterations: Set to 20. This controls how many times the model processes the entire
dataset, impacting embedding quality.

Model Choice (CBOW vs. Skip-gram): The CBOW architecture was chosen. It predicts
a target word based on surrounding context, making it suitable for our dataset size
and emphasis on representing less frequent words.

4. Training Results
During the model training phase, basic logging techniques were employed to monitor
the progress and ensure the smooth execution of the training process. The use of
logging allowed for real-time tracking of various training metrics and provided
valuable insights into the model's performance.

With a vocabulary size of 24,800 unique words, the model effectively learned
embeddings for a diverse range of vocabulary terms, ensuring comprehensive
coverage. This enables nuanced relationships between words to be captured,
enhancing semantic representations.

Qualitative evaluation of the word embeddings revealed their ability to capture

meaningful semantic relationships. Notably, the model demonstrated a high degree
of similarity between seemingly unrelated words, such as '1' and '2', indicating its
capacity to discern subtle semantic associations.

Moreover, the model showcased its aptitude for solving analogical reasoning tasks,
effectively understanding and extrapolating word relationships. For instance, the
model accurately deduced the analogy "best" - "good" + "bad" = "worst", showcasing
its capability to perform logical reasoning and infer semantic associations between
words.

Word embeddings and the vocabulary were saved as separate arrays, incorporating
special tokens for padding ('-') and unknown words ('unk'). This comprehensive
storage scheme supports further analysis and application in various downstream
tasks, such as recommendation systems.

Overall, the qualitative evaluation underscores the effectiveness of the trained word
embedding model in capturing semantic nuances and representing lexical
relationships, thus validating its utility for a wide range of natural language
processing tasks.

5. FastText Embedding
Motivation: FastText Implementation for Addressing Word2Vec Limitations

Out-of-Vocabulary (OOV) Handling: FastText effectively handles unseen words, vital

for diverse movie reviews containing slang, specialized terms ('cinematic'), and
misspellings.
Morphological Richness: By breaking words into subword units, fastText captures
internal structures, facilitating analysis of stylistic similarities, genre identification,
and nuanced review understanding.

Dataset Alignment: FastText's capabilities align with dynamic and niche vocabulary
characteristics in our movie review dataset.

Subword Embeddings: FastText adeptly produced a meaningful representation for

'citi', an unseen word in the training data. This showcases its capacity to handle novel
vocabulary, offering a distinct advantage in our dataset.

Morphological Awareness: By dissecting words into subword units (character n-

grams), FastText captures intricate internal structures and relationships. This enriches
our analysis of stylistic similarities, genre identification, and nuanced sentiment
comprehension within movie reviews.

6. Evaluation
Model Performance: Both Word2Vec and FastText models were trained effectively,
resulting in word embeddings that accurately represent semantic similarities and
capture morphological nuances. The models demonstrated proficiency in tasks such
as word similarity comparison, analogy completion, and outlier detection, indicating
robust performance.
Vocabulary Coverage: The trained models exhibited comprehensive vocabulary
coverage, enabling the representation of a wide range of words present in the movie
review dataset. This extensive coverage ensures that semantic relationships and
contextual nuances are adequately captured, enhancing the models' utility in
downstream tasks.

Out-of-Vocabulary Handling: The FastText model showcased superior capability in

handling out-of-vocabulary instances by generating meaningful representations for
unseen words. This feature is particularly advantageous in datasets with diverse
vocabulary, slang terms, and specialized terminology, such as movie reviews.

Morphological Analysis: FastText's ability to break words into subword units

facilitated deeper morphological analysis, allowing for the capture of internal word
structures and relationships. This enhanced understanding of stylistic similarities,
genre identification, and nuanced sentiment analysis within movie reviews.

Storage and Accessibility: The project implemented efficient storage mechanisms for
word embeddings and vocabularies, facilitating easy access and utilization in
downstream tasks. By saving embeddings as separate arrays, including special tokens
for padding and unknown words, the models' representations can be readily
integrated into various applications, such as recommendation systems or sentiment
analysis pipelines.

7. Conclusion

In conclusion, this project delved into the realm of natural language processing (NLP)
by exploring the efficacy of Word2Vec and FastText word embedding techniques in
the analysis of movie reviews. Through rigorous training and evaluation, both models
exhibited commendable performance in capturing semantic relationships and
contextual nuances within the dataset. Notably, FastText showcased superior
handling of out-of-vocabulary instances, providing meaningful representations for
unseen words—an invaluable asset in datasets with diverse vocabularies and
specialized terminology. Furthermore, FastText's capability to analyze word
morphology, facilitated by its subword unit decomposition, offered deeper insights
into internal word structures and relationships, enhancing tasks like stylistic similarity
analysis and sentiment interpretation. By implementing efficient storage mechanisms
for embeddings and vocabularies, the models' representations are readily accessible
for integration into various downstream applications. Overall, this project
underscores the significance of word embedding techniques in NLP and underscores
the potential of Word2Vec and FastText models in extracting meaningful insights
from textual data, with implications spanning sentiment analysis, recommendation
systems, and text classification in movie reviews and beyond.

p1
No ratings yet
p1
44 pages
10 - Neural Networks For Text
No ratings yet
10 - Neural Networks For Text
40 pages
Module 6
No ratings yet
Module 6
32 pages
CS771: GROUP-19 Sentiment Analysis in Movie Reviews: Project Report
No ratings yet
CS771: GROUP-19 Sentiment Analysis in Movie Reviews: Project Report
28 pages
TutorialOnNeuralModelingSystems 2
0% (1)
TutorialOnNeuralModelingSystems 2
10 pages
1366-Article Text-8507-2-10-20240524
No ratings yet
1366-Article Text-8507-2-10-20240524
6 pages
WordEmbeddingMethodsofTextProcessing (1)
No ratings yet
WordEmbeddingMethodsofTextProcessing (1)
7 pages
Report
No ratings yet
Report
3 pages
T3_Sentiment Analysis of Hotel Reviews A Comparison of Word2Vec and FastText_Sawsan Setiady_Maulana Ihsan
No ratings yet
T3_Sentiment Analysis of Hotel Reviews A Comparison of Word2Vec and FastText_Sawsan Setiady_Maulana Ihsan
5 pages
ria_37.03_24
No ratings yet
ria_37.03_24
7 pages
Introduction and Basic Concepts: Fluid Mechanics: Fundamentals and Applications
No ratings yet
Introduction and Basic Concepts: Fluid Mechanics: Fundamentals and Applications
21 pages
SecretsToTheirSuccess
No ratings yet
SecretsToTheirSuccess
35 pages
s10579-022-09620-5
No ratings yet
s10579-022-09620-5
35 pages
NLP Concepts
No ratings yet
NLP Concepts
37 pages
Lecture 3a - Writing A Literature Essay
No ratings yet
Lecture 3a - Writing A Literature Essay
22 pages
Sociology Assignment
No ratings yet
Sociology Assignment
19 pages
Belal Sheikh AlKusair Resume
No ratings yet
Belal Sheikh AlKusair Resume
3 pages
Uses and Gratification Theory
100% (1)
Uses and Gratification Theory
19 pages
Zhou 2020
No ratings yet
Zhou 2020
5 pages
A Guide For The Aspiring High School Trumpet Player
100% (1)
A Guide For The Aspiring High School Trumpet Player
20 pages
Realism and The End of The Cold War
No ratings yet
Realism and The End of The Cold War
40 pages
Unit 5b - Natural Language Processing
No ratings yet
Unit 5b - Natural Language Processing
41 pages
NLP_Assignment2
No ratings yet
NLP_Assignment2
7 pages
Machine Learning
No ratings yet
Machine Learning
11 pages
Trend
No ratings yet
Trend
47 pages
Carti Bookster
0% (1)
Carti Bookster
6 pages
Word Embeddings Notes
No ratings yet
Word Embeddings Notes
9 pages
Pre and Perinatal Massage Therapy A Comprehensive Guide to Prenatal, Labor and Postpartum Practice Complete EPUB eBook
100% (10)
Pre and Perinatal Massage Therapy A Comprehensive Guide to Prenatal, Labor and Postpartum Practice Complete EPUB eBook
14 pages
Gen AI lab
No ratings yet
Gen AI lab
22 pages
Electronics 10 01372 With Cover
No ratings yet
Electronics 10 01372 With Cover
24 pages
08-DL-Deep Learning For Text Data (Transfer Learning in NLP)
No ratings yet
08-DL-Deep Learning For Text Data (Transfer Learning in NLP)
53 pages
CHATGPT NLP
No ratings yet
CHATGPT NLP
6 pages
CCS369 UNIT-2 20.12.24
No ratings yet
CCS369 UNIT-2 20.12.24
41 pages
ML for NLP-LO4
No ratings yet
ML for NLP-LO4
42 pages
Chapter II
No ratings yet
Chapter II
26 pages
Three 150224 Generative a i Intro
No ratings yet
Three 150224 Generative a i Intro
19 pages
001 Greatness of Geetha
No ratings yet
001 Greatness of Geetha
8 pages
4. Word Embadding
No ratings yet
4. Word Embadding
24 pages
2021 Saudi Arabia Venture Capital Report 2021
No ratings yet
2021 Saudi Arabia Venture Capital Report 2021
25 pages
7 pillars
No ratings yet
7 pillars
8 pages
GENAI_Lab_Viva_QA
No ratings yet
GENAI_Lab_Viva_QA
7 pages
Science 10 Protein Synthesis
No ratings yet
Science 10 Protein Synthesis
2 pages
MatSciEng
No ratings yet
MatSciEng
7 pages
Module Research
No ratings yet
Module Research
6 pages
Essay Grading System
No ratings yet
Essay Grading System
14 pages
NLP Notes
No ratings yet
NLP Notes
11 pages
Texas Business Executives Urge School Finance Plan
No ratings yet
Texas Business Executives Urge School Finance Plan
7 pages
Credo - Module 5 Final Activities
0% (1)
Credo - Module 5 Final Activities
4 pages
Explaining The Intuition of Word2Vec & Implementing It in Python
No ratings yet
Explaining The Intuition of Word2Vec & Implementing It in Python
13 pages
Listening B and C (EDITED)
No ratings yet
Listening B and C (EDITED)
7 pages
Curriculum Vitae: Kishan S Raval
No ratings yet
Curriculum Vitae: Kishan S Raval
4 pages
CHAPTER-1-3-G4PR2
No ratings yet
CHAPTER-1-3-G4PR2
11 pages
Part 3
No ratings yet
Part 3
5 pages
344pm - 39.EPRA JOURNALS 12504
No ratings yet
344pm - 39.EPRA JOURNALS 12504
4 pages
DL Unit-IV
No ratings yet
DL Unit-IV
20 pages
Student Fee Details
No ratings yet
Student Fee Details
1 page
542 315 Word2vec
No ratings yet
542 315 Word2vec
20 pages
Entrepreneurship: 3 Quarter Week 1&2
No ratings yet
Entrepreneurship: 3 Quarter Week 1&2
9 pages
Sociology Questions -1
No ratings yet
Sociology Questions -1
2 pages
2nd year 1st sem class schedule
No ratings yet
2nd year 1st sem class schedule
1 page
The Influence of Peer Pressure To The School Behavior of Senior Highschool Students of Colegio de San Jose Del Monte
No ratings yet
The Influence of Peer Pressure To The School Behavior of Senior Highschool Students of Colegio de San Jose Del Monte
15 pages
Bootstrapping Language-Image Pretraining: The Complete Guide for Developers and Engineers
From Everand
Bootstrapping Language-Image Pretraining: The Complete Guide for Developers and Engineers
William Smith
No ratings yet
Lex Analysis and Implementation: Definitive Reference for Developers and Engineers
From Everand
Lex Analysis and Implementation: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Rebol Programming Insights: Definitive Reference for Developers and Engineers
From Everand
Rebol Programming Insights: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
CoreNLP in Practice: Definitive Reference for Developers and Engineers
From Everand
CoreNLP in Practice: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Applied Natural Language Processing with AllenNLP: Definitive Reference for Developers and Engineers
From Everand
Applied Natural Language Processing with AllenNLP: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
The Newbie’s Guidebook to ChatGPT: A Beginner's Tutorial: The Newbie’s Guidebook
From Everand
The Newbie’s Guidebook to ChatGPT: A Beginner's Tutorial: The Newbie’s Guidebook
Timothy King
No ratings yet
Building Software Interpreters: Definitive Reference for Developers and Engineers
From Everand
Building Software Interpreters: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
TextMate in Depth: Definitive Reference for Developers and Engineers
From Everand
TextMate in Depth: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Introduction to DBMS: Designing and Implementing Databases from Scratch for Absolute Beginners
From Everand
Introduction to DBMS: Designing and Implementing Databases from Scratch for Absolute Beginners
Dr. Hariram Chavan
No ratings yet
Natural Language Processing with NLTK: Definitive Reference for Developers and Engineers
From Everand
Natural Language Processing with NLTK: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Text-to-Speech Systems and Algorithms: Definitive Reference for Developers and Engineers
From Everand
Text-to-Speech Systems and Algorithms: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
ChatGPT Simplified: A Comprehensive Guide to Understanding and Utilizing AI Language Models, ChatGPT-4, ChatGPT Prompts, Fiction Writing, Blogging, Content Writing, Make Money Online
From Everand
ChatGPT Simplified: A Comprehensive Guide to Understanding and Utilizing AI Language Models, ChatGPT-4, ChatGPT Prompts, Fiction Writing, Blogging, Content Writing, Make Money Online
Silas Quantum
5/5 (1)
Implementing Domain-Specific Languages with Xtext and Xtend - Second Edition
From Everand
Implementing Domain-Specific Languages with Xtext and Xtend - Second Edition
Lorenzo Bettini
4/5 (1)
Object-Oriented Programming: A Comprehensive Guide for Beginners
From Everand
Object-Oriented Programming: A Comprehensive Guide for Beginners
Pasquale De Marco
No ratings yet
The Essence of Programming: A Comprehensive Guide to Object-Oriented Programming
From Everand
The Essence of Programming: A Comprehensive Guide to Object-Oriented Programming
Pasquale De Marco
No ratings yet
Python Text Mining: Perform Text Processing, Word Embedding, Text Classification and Machine Translation
From Everand
Python Text Mining: Perform Text Processing, Word Embedding, Text Classification and Machine Translation
Alexandra George
No ratings yet
SpaCy for Natural Language Processing: Definitive Reference for Developers and Engineers
From Everand
SpaCy for Natural Language Processing: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Conversational AI Development with Rasa: Definitive Reference for Developers and Engineers
From Everand
Conversational AI Development with Rasa: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
BERT Foundations and Applications: Definitive Reference for Developers and Engineers
From Everand
BERT Foundations and Applications: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Gensim for Natural Language Processing: Definitive Reference for Developers and Engineers
From Everand
Gensim for Natural Language Processing: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Mastering Large Language Models: Advanced techniques, applications, cutting-edge methods, and top LLMs (English Edition)
From Everand
Mastering Large Language Models: Advanced techniques, applications, cutting-edge methods, and top LLMs (English Edition)
Sanket Subhash Khandare
No ratings yet
Crafting Excellence in Software Development
From Everand
Crafting Excellence in Software Development
Pasquale De Marco
No ratings yet
Java OOP Simplified: A Practical Guide with Examples
From Everand
Java OOP Simplified: A Practical Guide with Examples
William E. Clark
No ratings yet
Regular Expressions Demystified: A Practical Guide with Examples
From Everand
Regular Expressions Demystified: A Practical Guide with Examples
William E. Clark
No ratings yet
Python Regular Expressions Explained: A Practical Guide with Examples
From Everand
Python Regular Expressions Explained: A Practical Guide with Examples
William E. Clark
No ratings yet
The spaCy Handbook: Simplifying Natural Language Processing
From Everand
The spaCy Handbook: Simplifying Natural Language Processing
Robert Johnson
No ratings yet
Lexicon of Programming Terminology: Lexicon of Tech and Business, #17
From Everand
Lexicon of Programming Terminology: Lexicon of Tech and Business, #17
Mustafa Al-Dori
5/5 (1)
Hugging Face Transformers Essentials: From Fine-Tuning to Deployment
From Everand
Hugging Face Transformers Essentials: From Fine-Tuning to Deployment
Robert Johnson
No ratings yet
Visual Word: Unlocking the Power of Image Understanding
From Everand
Visual Word: Unlocking the Power of Image Understanding
Fouad Sabry
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Natural Language Processing

Uploaded by

Natural Language Processing

Uploaded by

Natural Language Processing - AIGC 5501

Project Scope: This report outlines the following key steps:

2. Data Preparation and Preprocessing

Filtering: Non-alphanumeric characters (except spaces) were removed to focus the

num_features (Dimensionality): The parameter is set to 50 in our word embedding

min_word_count: This parameter is set to 20 in our word embedding models. This

window_size: window_size is set to 10 in our word embedding models. This

num_threads: This parameter is set to 5. This determines the number of parallel

Qualitative evaluation of the word embeddings revealed their ability to capture

Out-of-Vocabulary (OOV) Handling: FastText effectively handles unseen words, vital

Subword Embeddings: FastText adeptly produced a meaningful representation for

Morphological Awareness: By dissecting words into subword units (character n-

Out-of-Vocabulary Handling: The FastText model showcased superior capability in

Morphological Analysis: FastText's ability to break words into subword units

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.