0% found this document useful (0 votes)

9 views25 pages

DS - Lab Report.

The lab assignment focuses on conducting sentiment analysis on a social media dataset, aiming to classify user-generated content into positive, negative, and neutral sentiments while also analyzing user behavior and geographical variations. Key methodologies include data preprocessing, text cleaning, and applying machine learning models such as Logistic Regression, Random Forest, and Support Vector Machine, with Random Forest achieving the highest accuracy of 79%. The report also discusses challenges faced during sentiment analysis and the implementation of lemmatization for improved accuracy.

Uploaded by

vaishali varshney

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

9 views25 pages

DS - Lab Report.

Uploaded by

vaishali varshney

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 25

Lab Assignment

Data Science Foundations

Subject Code: PCS110

Submitted by:
Smriti Tomar : 8024320108
Swaijot Kaur : 8024320115
Vaishali Varshney : 8024320117
Vibhu Tyagi : 8024320123

Submitted to:
Dr. Divisha Garg

Department of Computer Science and Engineering

Thapar Institute of Engineering and Technology , Patiala

1
DATA SCIENCE - LAB REPORT

Sentiment Analysis on Social Media Dataset

Notebook link:
https://colab.research.google.com/drive/1lNl6VgZ0w9LwuIRohX8VY4Pkeyi7vPTi?usp=sharing

About Dataset

The dataset provides a snapshot of user-generated content, encompassing text, timestamps, hashtags,
countries, likes, and retweets.

Key Features

2
GOAL:

1.SENTIMENT ANALYSIS:

To conduct sentiment analysis on the "Text" column, classifying user-generated content into categories
such as positive, negative, and neutral.

2.USER BEHAVIOUR INSIGHTS:

Analyse user engagement through the "Likes" columns.

3.PLATFORM-SPECIFIC ANALYSIS:

Examine variations in content across different social media platforms using the "Platform" column.

Understand how sentiments vary across platforms.

4.GEOGRAPHICAL ANALYSIS:

Explore content distribution based on the "Country" column.

Understand regional variations in sentiment and topic preferences.

COLUMNS CONSIDERED:

['Text', 'Platform', 'Likes', 'Country', 'Year', 'Sentiment']

IMPORT LIBRARIES And DEPENDENCIES:

3
DATA UNDERSTANDING:

DATA VISUALIZATION

HISTOGRAM:

SENTIMENT COLUMN VISULATION:

4
CODE AND RESULT

CONCLUSION:

Since our main objective is to classify the text into positive, negative, and neutral sentiments only, we
can discard the existing sentiment column and create a new one with values from these three
categories.

LIKES AND RETWEETS COLUMN VISUALIZATION:

CODE AND RESULTS

5
BOX PLOT: CODE AND RESULTS

6
OUTLIERS: From the above we can see that there are outlier values, but since they are not too
extreme , we decided to keep them.

Lets check for correlation between different features:

CODE:

RESULT : CORRELATION MATRIX

7
We can observe that among the list of columns, that we are considering “Likes and Retweets” are
highly correlated (positive correlation), This make cause redundancy.

Lets make a scatter plot for the same.

CONCLUSION: we can remove the “RETWEETS” column as it will only cause redundancy.

WORD CLOUD: CODE AND RESULT:

8
DATA PREPROCESSING:

1. MISSING VALUES: (NO MISSING VALUES)

2. CHECK DUPLICATES:

9
3. CHECK FOR UNIQUE VALUES:

4. REMOVE UNNECESSARY COLUMNS:

5. HANDLE TRAILING SPACES:

FOR PLATFORM COLUMN

BEFORE :

10
AFTER:

FOR COUNTRIES COLUMN:

BEFORE

AFTER

11
AGAIN CHECKING UNIQUES VALUES FOR SPECIFIC COLS:

12
6. TEXT PREPROCESSING:
TEXT CLEANING
1.Removing non-alphabetic characters.
2.Tokenize the text
3.Removing stopwords
4.Stemming text:
It is process of reducing a word to its Root word.
WHY? To reduce dimension and complexity of data
eg: actor ,actress, acting = act

7. CREATE NEW SENTIMENT COLUMN

SENTIMENT SCORING

• The VADER Sentiment Analyzer was used to calculate sentiment polarity scores based on the
text.
• Sentiment scores were categorized as positive, negative, or neutral based on the compound
VADER score.

13
NOTICE: IT GIVES WRONG OUTPUT. PREDICTS WRONG SENTIMENT FOR 2nd
SENTENCE.

Expected: Negative Actual: Neutral

PROBLEM FACED:

It leads to Over stemming in our case: reduces separate inflected words to the same word stem even
though they are not related.

for example, the Porter Stemmer algorithm stems "universal", "university", and "universe" to the same
word stem. Though they are etymologically related, their meanings and the context they are used for
is totally different.

SOLUTION:

use Text Lemmatization - return the base or dictionary form of a word, which is known as the
“lemma.” For example, you can expect a lemmatization algorithm to map “runs,” “running,” and “ran”
to the lemma, “run.”

Leading to more Accurate results.

IMPLEMENTATION: import dependencies:

14
TEXT CLEANING:

• Removing non-alphabetic characters.

• Tokenizing the text.
• Removing stopwords.
• Lemmatizing tokens.

The function was applied to clean the text data, resulting in a new column called Lemmatised Text.

COMAPRING OUTPUTS OF BOTH STEMMING (Clean Text Column) AND

LEMMATIZATION (Lemmatized Text Col) with ORIGINAL TEXT COL (TEXT)

15
PERFORMING SENTIMENT ANALYSIS ON LEMMATIZED TEXT:

Now, we get correct Output for Sentiment Column.

NOTE: we will be working on the lemmatised text col only from now onwards.

NEXT STEP would be PERFORM TEXT VECTORIZATION.

Import all the important dependencies for text transformation and model training.

DATA SPLITTING:

1. Extracting Features (X) from Processed Text i.e. “Lemmatised Text” col and Labels (y) i.e.
“Sentiment” value for that respective text.

16
• X: This variable contains the features for the model. In this case, it consists of the
column 'Lemmatized Text' from the Data Frame df4. These are likely pre processed
(lemmatized) text data used as input for the model.
• y: This variable contains the labels or target variable (e.g., 'Sentiment'), representing

the sentiment classification labels such as positive, negative, or neutral.

The.values converts the Data Frame columns into NumPy arrays for compatibility with machine
learning libraries.

2. Splitting the Data into Training and Testing Sets.

Purpose:

Training set (X_train and y_train): Used to train the machine learning model.

Testing set (X_test and y_test): Used to evaluate the performance of the trained model on unseen
data, ensuring the model performs well.

• train_test_split: This function, from sklearn.model_selection, is used to divide the dataset

into:

X_train:Training features.

X_test: Testing features.

y_train: Training labels.

y_test: Testing labels.

• test_size=0.2: Specifies that 20% of the data will be used for testing, while 80% will be used
for training.
• random_state=52: Ensures reproducibility by fixing the random seed for the split. This means
the same split will occur every time the code runs. Can consider any value. In our case Random
Seed = 52, it ensures that every time someone run the code, the random split will be the same.

17
TEXT TRANSFORMATION AND BUILD MODEL (MACHINE LEARNING)

Since Machine learning models need numbers to work, not raw text, we use TF-IDF Vectorization
(Term Frequency-Inverse Document Frequency) to convert text data into numbers while keeping
the focus on important words.

Working of TF-IDF: It converts text (words) into a numerical format by assigning importance to
words based on:

Term Frequency (TF): How often a word appears in a document.

Inverse Document Frequency (IDF): How unique the word is across all documents.

Words that appear frequently in one document but not across many documents get higher scores (e.g.,
specific keywords like "happy").

CODE:

1. TfidfVectorizer(max_features=5000): Keeps the top 5,000 most important words (features).

2. fit_transform(X_train): Learns word importance (TF-IDF scores) from the training data and
transforms it into a matrix of numbers.
3. transform(X_test): Converts test data into numbers using the same learned word importance.

18
ALGORITHMS USED and EVAULATING THE MODELS:

We’re creating 3 different types of model for our sentiment analysis problem:

Logistic Regression

Random Forest

Support Vector Machine (SVM)

We are choosing Accuracy as our evaluation metric. Furthermore, we’re plotting the Confusion
Matrix to get an understanding of how our model is performing on the 3 classification types -Positive,
Negative, Neutral.

LOGISTIC REGRESSION:

CODE:

EXPLANATION:

Model Initialization:

• Creates a Logistic Regression model.

• max_iter=50: Limits the number of iterations for optimization to 50.
• random_state=42: Ensures reproducibility by fixing the random seed.

Model Training:

• Fits (trains) the model using X_train_tfidf (training features) and y_train (training labels).

19
Model Prediction and Evaluation:

• Predicts labels for the test data (X_test_tfidf).

• Calculate the model's accuracy by comparing predicted labels (y_pred_logistic) with actual
labels (y_test).

OUTPUT: Generates a detailed classification report with metrics like precision, recall, and F1-score
and a heatmap representing the confusion matrix.

20
Similarly, we performed RANDOM-FOREST and SUPPORT VECTOR MACHINE - (SVM)

CODE:

21
OUTPUT: Classification Report

22
RANDOM FOREST: CODE:

23
OUTPUT- Classification Report

24
RESULT: We can clearly see that the Random Forest Model performs a little better than the other
models we tried. It achieves 79% accuracy while classifying the sentiment of a tweet.

Solution T1
No ratings yet
Solution T1
9 pages
Sentimental Analysis
No ratings yet
Sentimental Analysis
3 pages
Sentiment Analysis On User-Generated Tweets
No ratings yet
Sentiment Analysis On User-Generated Tweets
15 pages
Samaksh Gupta Programming Ass. IR
No ratings yet
Samaksh Gupta Programming Ass. IR
13 pages
Intelligent Control of Robotic Systems 1st Edition Laxmidhar Behera Instant Download
100% (1)
Intelligent Control of Robotic Systems 1st Edition Laxmidhar Behera Instant Download
59 pages
NLP Transformer-Based Models Used For Sentiment Analysis
No ratings yet
NLP Transformer-Based Models Used For Sentiment Analysis
45 pages
Sentiment Analysis Report
No ratings yet
Sentiment Analysis Report
22 pages
AIML IA3 Loki & SG
No ratings yet
AIML IA3 Loki & SG
31 pages
IR - Group1
No ratings yet
IR - Group1
27 pages
Twitter Sentiment Analysis For Product Review
No ratings yet
Twitter Sentiment Analysis For Product Review
19 pages
Twitter Sentiment Analysis Dss
No ratings yet
Twitter Sentiment Analysis Dss
14 pages
Leveraging Natural Language Processing and Machine Learning For Enhanced Content Rating
No ratings yet
Leveraging Natural Language Processing and Machine Learning For Enhanced Content Rating
8 pages
Sentimental Analysis Using NLP
No ratings yet
Sentimental Analysis Using NLP
5 pages
Sentiment Analysis Using Machine Learning Algorithms
No ratings yet
Sentiment Analysis Using Machine Learning Algorithms
23 pages
Ment Analysis Text Classification
No ratings yet
Ment Analysis Text Classification
9 pages
IC-RTETM Final Sentiment Analysis
No ratings yet
IC-RTETM Final Sentiment Analysis
13 pages
17 Practicals
No ratings yet
17 Practicals
7 pages
Malignant Comments Classifier Project
No ratings yet
Malignant Comments Classifier Project
30 pages
NLP Transformer-Based Models Used For Sentiment Analysis: 1. BERT
No ratings yet
NLP Transformer-Based Models Used For Sentiment Analysis: 1. BERT
98 pages
Document Dsbda Codes For Mini Project
No ratings yet
Document Dsbda Codes For Mini Project
9 pages
121a1114 D2 Sma Exp3
No ratings yet
121a1114 D2 Sma Exp3
9 pages
Adithiyaa BR 23MBA0018 SMA DA Text Mining PDF
No ratings yet
Adithiyaa BR 23MBA0018 SMA DA Text Mining PDF
6 pages
Ai Project
No ratings yet
Ai Project
15 pages
Template For The First Slide of PPT Presentation1
No ratings yet
Template For The First Slide of PPT Presentation1
18 pages
Introduction
No ratings yet
Introduction
27 pages
IDTA For NLP
No ratings yet
IDTA For NLP
16 pages
Wa0002
No ratings yet
Wa0002
21 pages
Data Science Project
No ratings yet
Data Science Project
24 pages
Lec.4 SDA (2023-2024) .FCDS
No ratings yet
Lec.4 SDA (2023-2024) .FCDS
18 pages
Twitter Sentiment Analysis
No ratings yet
Twitter Sentiment Analysis
9 pages
Importing Packages: Id Label Tweet 0 1 2 3 4
No ratings yet
Importing Packages: Id Label Tweet 0 1 2 3 4
8 pages
Lab Report - CSE 816
No ratings yet
Lab Report - CSE 816
17 pages
CSE4062S21 Group3 Project Delivery7 FinalReport
No ratings yet
CSE4062S21 Group3 Project Delivery7 FinalReport
9 pages
Presented by Prof. Dr. A. M. Siddiqui Penn State University, York, USA
No ratings yet
Presented by Prof. Dr. A. M. Siddiqui Penn State University, York, USA
18 pages
AI Lab Report BIM
No ratings yet
AI Lab Report BIM
34 pages
Group 4 MovieReview
No ratings yet
Group 4 MovieReview
10 pages
Module4 TextAnalytics
No ratings yet
Module4 TextAnalytics
9 pages
Sentiment Analysis For Twitter Comments Project Exp
No ratings yet
Sentiment Analysis For Twitter Comments Project Exp
5 pages
Part C - Assignment No. 2 Mini-Project On Twitter
No ratings yet
Part C - Assignment No. 2 Mini-Project On Twitter
7 pages
AIDI 1003 Presentation
No ratings yet
AIDI 1003 Presentation
9 pages
Sentiment Classification System of Twitter Data For US Airline Service Analysis
No ratings yet
Sentiment Classification System of Twitter Data For US Airline Service Analysis
5 pages
Machine Learning For Predictive Data Analytics PDF
No ratings yet
Machine Learning For Predictive Data Analytics PDF
45 pages
2.algebraic Expressions (Special Products and Factoring)
No ratings yet
2.algebraic Expressions (Special Products and Factoring)
21 pages
Python Deep Learning Tutorial
0% (1)
Python Deep Learning Tutorial
17 pages
Assignment 1 Groupwork C0927405 C0928791
No ratings yet
Assignment 1 Groupwork C0927405 C0928791
11 pages
Fin Ijprems1714118825
No ratings yet
Fin Ijprems1714118825
6 pages
### Seminar Report
No ratings yet
### Seminar Report
12 pages
PPPT
No ratings yet
PPPT
20 pages
Thesis - Aru Omarali
No ratings yet
Thesis - Aru Omarali
34 pages
Machine Learning With Advance Model
No ratings yet
Machine Learning With Advance Model
19 pages
ML Project Report
No ratings yet
ML Project Report
26 pages
MP 1
No ratings yet
MP 1
14 pages
Sentiment Analysis From H El Reviews: Data Mining For Business Intelligence
No ratings yet
Sentiment Analysis From H El Reviews: Data Mining For Business Intelligence
13 pages
Maneesha Nidigonda Major Project
No ratings yet
Maneesha Nidigonda Major Project
11 pages
Methodology
No ratings yet
Methodology
9 pages
Introduction To Probability Models 9th Ed Edition Sheldon M. Ross Instant Download
No ratings yet
Introduction To Probability Models 9th Ed Edition Sheldon M. Ross Instant Download
65 pages
DSBA+Master+Codebook+ +Text+Mining+&+TSF
No ratings yet
DSBA+Master+Codebook+ +Text+Mining+&+TSF
11 pages
Q 3
No ratings yet
Q 3
2 pages
Sentiment Analysis of Tweets Using Machine Learning
No ratings yet
Sentiment Analysis of Tweets Using Machine Learning
22 pages
OR Module 2 SESSION 2
No ratings yet
OR Module 2 SESSION 2
52 pages
Types of Data Represented As Strings
No ratings yet
Types of Data Represented As Strings
2 pages
Detailed Report
No ratings yet
Detailed Report
6 pages
4th International Conference On Machine Learning, NLP and Data Mining (MLDA 2025)
No ratings yet
4th International Conference On Machine Learning, NLP and Data Mining (MLDA 2025)
2 pages
Chap5 ECL301L Lab Manual Pascual
No ratings yet
Chap5 ECL301L Lab Manual Pascual
12 pages
DSP 02
No ratings yet
DSP 02
13 pages
Stack Practice Programs
No ratings yet
Stack Practice Programs
19 pages
Complete Report
No ratings yet
Complete Report
56 pages
New Jersey Institute of Technology AI COurse Syllabus
No ratings yet
New Jersey Institute of Technology AI COurse Syllabus
4 pages
Materi 7.1. Distance Measurement
No ratings yet
Materi 7.1. Distance Measurement
14 pages
RES Presentation
No ratings yet
RES Presentation
21 pages
Chapter 7: Deadlocks: Silberschatz, Galvin and Gagne ©2013 Operating System Concepts - 9 Edition
No ratings yet
Chapter 7: Deadlocks: Silberschatz, Galvin and Gagne ©2013 Operating System Concepts - 9 Edition
11 pages
Lovely Professional University: Design Problem-1
No ratings yet
Lovely Professional University: Design Problem-1
6 pages
Analysis and Design of Algorithm (ADA) : Amity School of Engineering & Technology (CSE)
No ratings yet
Analysis and Design of Algorithm (ADA) : Amity School of Engineering & Technology (CSE)
18 pages
ARIMA Forecasting Using R
No ratings yet
ARIMA Forecasting Using R
9 pages
ISYE6669 LP 10 21 1 - AndySun - FW
No ratings yet
ISYE6669 LP 10 21 1 - AndySun - FW
8 pages
Maneesha Nidigonda Verzeo Major Project
No ratings yet
Maneesha Nidigonda Verzeo Major Project
11 pages
Design of Risk-Based Univariate Control Charts With Measurement Uncertainty
No ratings yet
Design of Risk-Based Univariate Control Charts With Measurement Uncertainty
7 pages
Workshop Slides Topic 2
No ratings yet
Workshop Slides Topic 2
38 pages
Final Assessment Data Mining
No ratings yet
Final Assessment Data Mining
2 pages
Large Language Models in Finance
No ratings yet
Large Language Models in Finance
11 pages
RAID Personalized Image Editing
No ratings yet
RAID Personalized Image Editing
4 pages
Bits F463 Cryptography - Handout
No ratings yet
Bits F463 Cryptography - Handout
3 pages
From Image To Emotion Exploring CNN Architectures For Facial Emotion Recognition
No ratings yet
From Image To Emotion Exploring CNN Architectures For Facial Emotion Recognition
6 pages
Rohan Gusain: Get in Contact
No ratings yet
Rohan Gusain: Get in Contact
1 page
(Java - C++ - Python) One Pass O (N) - LeetCode Discuss
No ratings yet
(Java - C++ - Python) One Pass O (N) - LeetCode Discuss
1 page
Siddhant Srivastava Software Resume
No ratings yet
Siddhant Srivastava Software Resume
1 page
Assignment Problem
No ratings yet
Assignment Problem
11 pages
Bma3201 - Operations Research - Cat - May - Aug 2020
No ratings yet
Bma3201 - Operations Research - Cat - May - Aug 2020
2 pages
Praveen Phase 3
No ratings yet
Praveen Phase 3
6 pages
Coding Interview Questions and Answers
From Everand
Coding Interview Questions and Answers
Chinmoy Mukherjee
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

DS - Lab Report.

Uploaded by

DS - Lab Report.

Uploaded by

Lab Assignment

Data Science Foundations

Department of Computer Science and Engineering

Sentiment Analysis on Social Media Dataset

2.USER BEHAVIOUR INSIGHTS:

Analyse user engagement through the "Likes" columns.

Understand how sentiments vary across platforms.

Explore content distribution based on the "Country" column.

Understand regional variations in sentiment and topic preferences.

['Text', 'Platform', 'Likes', 'Country', 'Year', 'Sentiment']

IMPORT LIBRARIES And DEPENDENCIES:

SENTIMENT COLUMN VISULATION:

LIKES AND RETWEETS COLUMN VISUALIZATION:

CODE AND RESULTS

Lets check for correlation between different features:

RESULT : CORRELATION MATRIX

Lets make a scatter plot for the same.

WORD CLOUD: CODE AND RESULT:

1. MISSING VALUES: (NO MISSING VALUES)

4. REMOVE UNNECESSARY COLUMNS:

5. HANDLE TRAILING SPACES:

FOR PLATFORM COLUMN

FOR COUNTRIES COLUMN:

7. CREATE NEW SENTIMENT COLUMN

Expected: Negative Actual: Neutral

Leading to more Accurate results.

IMPLEMENTATION: import dependencies:

• Removing non-alphabetic characters.

COMAPRING OUTPUTS OF BOTH STEMMING (Clean Text Column) AND

Now, we get correct Output for Sentiment Column.

NEXT STEP would be PERFORM TEXT VECTORIZATION.

the sentiment classification labels such as positive, negative, or neutral.

2. Splitting the Data into Training and Testing Sets.

• train_test_split: This function, from sklearn.model_selection, is used to divide the dataset

X_test: Testing features.

y_train: Training labels.

y_test: Testing labels.

Term Frequency (TF): How often a word appears in a document.

1. TfidfVectorizer(max_features=5000): Keeps the top 5,000 most important words (features).

Support Vector Machine (SVM)

• Creates a Logistic Regression model.

• Predicts labels for the test data (X_test_tfidf).

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.