DS - Lab Report.
DS - Lab Report.
Submitted by:
Smriti Tomar : 8024320108
Swaijot Kaur : 8024320115
Vaishali Varshney : 8024320117
Vibhu Tyagi : 8024320123
Submitted to:
Dr. Divisha Garg
1
DATA SCIENCE - LAB REPORT
Notebook link:
https://colab.research.google.com/drive/1lNl6VgZ0w9LwuIRohX8VY4Pkeyi7vPTi?usp=sharing
About Dataset
The dataset provides a snapshot of user-generated content, encompassing text, timestamps, hashtags,
countries, likes, and retweets.
Key Features
2
GOAL:
1.SENTIMENT ANALYSIS:
To conduct sentiment analysis on the "Text" column, classifying user-generated content into categories
such as positive, negative, and neutral.
3.PLATFORM-SPECIFIC ANALYSIS:
Examine variations in content across different social media platforms using the "Platform" column.
4.GEOGRAPHICAL ANALYSIS:
COLUMNS CONSIDERED:
3
DATA UNDERSTANDING:
DATA VISUALIZATION
HISTOGRAM:
4
CODE AND RESULT
CONCLUSION:
Since our main objective is to classify the text into positive, negative, and neutral sentiments only, we
can discard the existing sentiment column and create a new one with values from these three
categories.
5
BOX PLOT: CODE AND RESULTS
6
OUTLIERS: From the above we can see that there are outlier values, but since they are not too
extreme , we decided to keep them.
CODE:
7
We can observe that among the list of columns, that we are considering “Likes and Retweets” are
highly correlated (positive correlation), This make cause redundancy.
CONCLUSION: we can remove the “RETWEETS” column as it will only cause redundancy.
8
DATA PREPROCESSING:
2. CHECK DUPLICATES:
9
3. CHECK FOR UNIQUE VALUES:
BEFORE :
10
AFTER:
BEFORE
AFTER
11
AGAIN CHECKING UNIQUES VALUES FOR SPECIFIC COLS:
12
6. TEXT PREPROCESSING:
TEXT CLEANING
1.Removing non-alphabetic characters.
2.Tokenize the text
3.Removing stopwords
4.Stemming text:
It is process of reducing a word to its Root word.
WHY? To reduce dimension and complexity of data
eg: actor ,actress, acting = act
SENTIMENT SCORING
• The VADER Sentiment Analyzer was used to calculate sentiment polarity scores based on the
text.
• Sentiment scores were categorized as positive, negative, or neutral based on the compound
VADER score.
13
NOTICE: IT GIVES WRONG OUTPUT. PREDICTS WRONG SENTIMENT FOR 2nd
SENTENCE.
PROBLEM FACED:
It leads to Over stemming in our case: reduces separate inflected words to the same word stem even
though they are not related.
for example, the Porter Stemmer algorithm stems "universal", "university", and "universe" to the same
word stem. Though they are etymologically related, their meanings and the context they are used for
is totally different.
SOLUTION:
use Text Lemmatization - return the base or dictionary form of a word, which is known as the
“lemma.” For example, you can expect a lemmatization algorithm to map “runs,” “running,” and “ran”
to the lemma, “run.”
14
TEXT CLEANING:
The function was applied to clean the text data, resulting in a new column called Lemmatised Text.
15
PERFORMING SENTIMENT ANALYSIS ON LEMMATIZED TEXT:
NOTE: we will be working on the lemmatised text col only from now onwards.
Import all the important dependencies for text transformation and model training.
DATA SPLITTING:
1. Extracting Features (X) from Processed Text i.e. “Lemmatised Text” col and Labels (y) i.e.
“Sentiment” value for that respective text.
16
• X: This variable contains the features for the model. In this case, it consists of the
column 'Lemmatized Text' from the Data Frame df4. These are likely pre processed
(lemmatized) text data used as input for the model.
• y: This variable contains the labels or target variable (e.g., 'Sentiment'), representing
The.values converts the Data Frame columns into NumPy arrays for compatibility with machine
learning libraries.
Purpose:
Training set (X_train and y_train): Used to train the machine learning model.
Testing set (X_test and y_test): Used to evaluate the performance of the trained model on unseen
data, ensuring the model performs well.
X_train:Training features.
• test_size=0.2: Specifies that 20% of the data will be used for testing, while 80% will be used
for training.
• random_state=52: Ensures reproducibility by fixing the random seed for the split. This means
the same split will occur every time the code runs. Can consider any value. In our case Random
Seed = 52, it ensures that every time someone run the code, the random split will be the same.
17
TEXT TRANSFORMATION AND BUILD MODEL (MACHINE LEARNING)
Since Machine learning models need numbers to work, not raw text, we use TF-IDF Vectorization
(Term Frequency-Inverse Document Frequency) to convert text data into numbers while keeping
the focus on important words.
Working of TF-IDF: It converts text (words) into a numerical format by assigning importance to
words based on:
Inverse Document Frequency (IDF): How unique the word is across all documents.
Words that appear frequently in one document but not across many documents get higher scores (e.g.,
specific keywords like "happy").
CODE:
18
ALGORITHMS USED and EVAULATING THE MODELS:
We’re creating 3 different types of model for our sentiment analysis problem:
Logistic Regression
Random Forest
We are choosing Accuracy as our evaluation metric. Furthermore, we’re plotting the Confusion
Matrix to get an understanding of how our model is performing on the 3 classification types -Positive,
Negative, Neutral.
LOGISTIC REGRESSION:
CODE:
EXPLANATION:
Model Initialization:
Model Training:
• Fits (trains) the model using X_train_tfidf (training features) and y_train (training labels).
19
Model Prediction and Evaluation:
OUTPUT: Generates a detailed classification report with metrics like precision, recall, and F1-score
and a heatmap representing the confusion matrix.
20
Similarly, we performed RANDOM-FOREST and SUPPORT VECTOR MACHINE - (SVM)
CODE:
21
OUTPUT: Classification Report
22
RANDOM FOREST: CODE:
23
OUTPUT- Classification Report
24
RESULT: We can clearly see that the Random Forest Model performs a little better than the other
models we tried. It achieves 79% accuracy while classifying the sentiment of a tweet.
25