ISSS609 Project Proposal Group 7
ISSS609 Project Proposal Group 7
ISSS609
Project Proposal
29 Sep 2024
SMU Classification: Restricted
1. Introduction
With social media platforms and e-commerce websites being highly prevalent among all
consumer segments, businesses now have access to a wealth of user-generated content that
offers valuable insights into their sentiment. This project aims to develop a sentiment analysis tool
that categorizes and processes the sentiments behind user reviews, helping a popular e-
commerce platform Sephora understand how their brands and products are perceived. The tool
will enable businesses to respond more effectively to customer feedback by classifying
sentiments in captions and comments as positive, negative, or neutral.
Project Components:
• Data Sources: The dataset (downloaded from Kaggle) was collected via Python scraper
from Sephora US website (March 2023) and contains 2 data tables, “Products” and
“review”. (See Appendix 1)
• Challenges: User reviews content often includes informal language, slang, and emojis,
making natural language processing (NLP) essential for cleaning and normalizing the data.
The language this model will be trained on is English, it will not be able to analyze mixed
language posts and comments.
• Machine Learning Models: A mix of traditional machine learning techniques and deep
learning architectures will be applied for sentiment classification.
Benefits for different business functions include:
• Marketing teams can leverage consumer sentiment insights to fine-tune campaigns and
better target audiences
• Customer service teams can respond quickly to negative sentiments, addressing concerns
• Product development teams can gather feedback on product features and identify areas
for improvement and new value propositions
By providing insights into consumer opinions, this tool allows businesses to track and adjust their
strategies, addressing customer concerns and improving overall product satisfaction. Staying
attuned to consumer sentiment can help businesses remain responsive, customer-centric, and
competitive.
2. Proposed Methodology
Analyzing user reviews poses various challenges for sentiment analysis, especially because of
the subtle language nuances often found in the data. This makes it difficult to detect ambiguity,
sarcasm, or irony, given that this heavily depends on context which models may not capture
explicitly. Moreover, traditional models often struggle with contextual understanding, leading to
misclassification of sentiments. Data quality such as comments containing noise, slang,
abbreviations, and emojis can affect model performance. Additionally, comments may be written
in languages other than English, which might not be supported by the models. Training Large
Language Models (LLMs) like BERT [2] and GPT [3] requires significant computational resources.
SMU Classification: Restricted
This project aims to develop and compare sentiment analysis models using traditional machine
learning techniques, deep learning architectures, and Large Language Models (LLMs). The
methodology is structured into three core steps: data collection and preprocessing, model
development, and comparison and evaluation. Figure 1 shows the flowchart for the proposed
methodology [1]:
We will utilize datasets of user reviews and product information. Data preprocessing will remove
noise such as punctuation, stop words, emojis, slang, and abbreviations by cleaning, normalizing
words (stemming or lemmatization), and tokenizing the text. After preprocessing, methods like
word embeddings will be used to capture semantic relationships. The cleaned dataset will be split
into training and testing sets using an 80-20 split.
Model Development
To desmonstrate different approaches to sentiment classification, both traditional machine
learning models and LLM-based models will be implemented. Feature extraction techniques will
be employed to transform text data into numerical representations for traditional machine learning
models algorithms. This will facilitate an assessment of how engineered textual features impact
model performance. Deep learning models, including CNN and transformers like BERT and GPT,
will be used for their ability to capture complex contextual relationships. Comparing these
methods will reveal the trade-offs between computational efficiency and capturing nuanced
sentiment.
3. Solution Details
Data Pre-processing
Data cleaning and pre-processing such as stop word removal, stemming, lemmatization and
tokenization will be done before being used for model training.
Feature Extraction:
Classic machine learning models perform better with numbers of inputs compared to text inputs.
We will explore multiple methods for converting text data into numerical representations:
Model Selection:
Logistic Regression (LR): The LR model is a simple classification model which we will use as
a baseline model. The LR model will be trained on the preprocessed text data to classify the
sentiments of each text. We will explore using 1. a binary classifier to classify posts into positive
or negative, and 2. a multiclass classifier to classify posts into positive, neutral or negative.
Support Vector Machine (SVM): SVM is a more powerful classifier that works well with high-
dimensional data (i.e. its better able to accept a larger corpus than LR). Similar to LR, the SVM
model will be trained using the preprocessed text data and classify the sentiment of the text into
positive and negative sentiment. In addition, we will explore the use of the One vs All approach
to do a 3-class classification with SVM.
Deep Learning models replicates artificial neural networks to learn the underlying patterns from
large amount of data. They are capable of handling more complex data compared to classic
machine learning models and are therefore less reliant on data pre-processing.
Model Selection:
Convolutional Neural Networks (CNN): CNNs are traditionally used for image recognition
tasks but we can use a 1-Dimension CNN models for the purpose of text sentiment analysis.
SMU Classification: Restricted
The CNN model will be trained on the tokenized data to do a 3-class classification of the
sentiment.
We will explore using various transformers model such as BERT (Bidirectional Encoder
Representations from Transformers) [2] and GPT (Generative Pretrained Transformer) [3] to
perform sentiment analysis and compare their performance.
4. Proposed Experiments
For the sentiment analysis experiment, we plan to create a systematic process for data
preparation, model training, and evaluation.
Our first step will involve text preprocessing, where tokenization is performed first to break down
the text into individual words or tokens. Following this, we will remove noise such as
punctuation, emojis, and special characters. Stop-word removal will come next, eliminating
common words that do not add significant meaning to the text. Finally, we will apply stemming
or lemmatization to normalize words to their root forms, ensuring consistency across the
SMU Classification: Restricted
dataset. This process is critical for reducing noise and improving the overall performance of the
sentiment analysis models, as detailed in our methodology.
Next, we will partition the data into three sets: 80% for training, 10% for validation, and 10% for
testing. This will allow us to fine-tune the model using the validation set and evaluate the
model's generalization using the test set.
For the experiment, we will use Python, leveraging key libraries such as Pandas for data handling,
NLTK and spaCy for text preprocessing, and Scikit-learn for traditional machine learning models.
Advanced deep learning models will be implemented using frameworks like TensorFlow and
PyTorch, ensuring efficient model training on GPU resources available through Google Colab.
Our approach involves running two phases of experiments:
1. Phase 1: We will start with traditional machine learning models, such as Logistic
Regression (LR) and Support Vector Machine (SVM). These will be used as baselines to
classify sentiments using TF-IDF and Word2Vec for feature extraction.
2. Phase 2: We will then advance to more sophisticated models like Convolutional Neural
Networks (CNN) and transformers (such as BERT and GPT-3) to capture complex
relationships between words and improve contextual understanding. Fine-tuning will be
conducted for the transformers to optimize performance.
Evaluation metrics will include accuracy, precision, recall, and F1-score to compare the
performance of traditional models against deep learning models.
This step-by-step approach allows us to assess the efficiency and accuracy of various models,
ultimately determining the most effective method for social media sentiment analysis.
References
[1] Arun, K. & Srinagesh, Ayyagari. (2020). Multi-lingual Twitter sentiment analysis using machine
learning. International Journal of Electrical and Computer Engineering (IJECE). 10. 5992.
10.11591/ijece.v10i6.pp5992-6000.
[2] Devlin, J., Chang, M., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of Deep
Bidirectional Transformers for Language Understanding. North American Chapter of the
Association for Computational Linguistics.
[3] Radford, A. (2018). Improving language understanding by generative pre-training.
SMU Classification: Restricted
- “Products” containing information about all beauty products (over 8,000) from the
Sephora online store, including product and brand names, prices, ingredients, ratings,
and all features.
- “review” includes user reviews (about 1 million on over 2,000 products) of all products
from the Skincare category, including user appearances, and review ratings by other
users