0% found this document useful (0 votes)
9 views25 pages

DS - Lab Report.

The lab assignment focuses on conducting sentiment analysis on a social media dataset, aiming to classify user-generated content into positive, negative, and neutral sentiments while also analyzing user behavior and geographical variations. Key methodologies include data preprocessing, text cleaning, and applying machine learning models such as Logistic Regression, Random Forest, and Support Vector Machine, with Random Forest achieving the highest accuracy of 79%. The report also discusses challenges faced during sentiment analysis and the implementation of lemmatization for improved accuracy.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views25 pages

DS - Lab Report.

The lab assignment focuses on conducting sentiment analysis on a social media dataset, aiming to classify user-generated content into positive, negative, and neutral sentiments while also analyzing user behavior and geographical variations. Key methodologies include data preprocessing, text cleaning, and applying machine learning models such as Logistic Regression, Random Forest, and Support Vector Machine, with Random Forest achieving the highest accuracy of 79%. The report also discusses challenges faced during sentiment analysis and the implementation of lemmatization for improved accuracy.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

Lab Assignment

Data Science Foundations


Subject Code: PCS110

Submitted by:
Smriti Tomar : 8024320108
Swaijot Kaur : 8024320115
Vaishali Varshney : 8024320117
Vibhu Tyagi : 8024320123

Submitted to:
Dr. Divisha Garg

Department of Computer Science and Engineering


Thapar Institute of Engineering and Technology , Patiala

1
DATA SCIENCE - LAB REPORT

Sentiment Analysis on Social Media Dataset

Notebook link:
https://colab.research.google.com/drive/1lNl6VgZ0w9LwuIRohX8VY4Pkeyi7vPTi?usp=sharing

About Dataset

The dataset provides a snapshot of user-generated content, encompassing text, timestamps, hashtags,
countries, likes, and retweets.

Key Features

2
GOAL:

1.SENTIMENT ANALYSIS:

To conduct sentiment analysis on the "Text" column, classifying user-generated content into categories
such as positive, negative, and neutral.

2.USER BEHAVIOUR INSIGHTS:

Analyse user engagement through the "Likes" columns.

3.PLATFORM-SPECIFIC ANALYSIS:

Examine variations in content across different social media platforms using the "Platform" column.

Understand how sentiments vary across platforms.

4.GEOGRAPHICAL ANALYSIS:

Explore content distribution based on the "Country" column.

Understand regional variations in sentiment and topic preferences.

COLUMNS CONSIDERED:

['Text', 'Platform', 'Likes', 'Country', 'Year', 'Sentiment']

IMPORT LIBRARIES And DEPENDENCIES:

3
DATA UNDERSTANDING:

DATA VISUALIZATION

HISTOGRAM:

SENTIMENT COLUMN VISULATION:

4
CODE AND RESULT

CONCLUSION:

Since our main objective is to classify the text into positive, negative, and neutral sentiments only, we
can discard the existing sentiment column and create a new one with values from these three
categories.

LIKES AND RETWEETS COLUMN VISUALIZATION:

CODE AND RESULTS

5
BOX PLOT: CODE AND RESULTS

6
OUTLIERS: From the above we can see that there are outlier values, but since they are not too
extreme , we decided to keep them.

Lets check for correlation between different features:

CODE:

RESULT : CORRELATION MATRIX

7
We can observe that among the list of columns, that we are considering “Likes and Retweets” are
highly correlated (positive correlation), This make cause redundancy.

Lets make a scatter plot for the same.

CONCLUSION: we can remove the “RETWEETS” column as it will only cause redundancy.

WORD CLOUD: CODE AND RESULT:

8
DATA PREPROCESSING:

1. MISSING VALUES: (NO MISSING VALUES)

2. CHECK DUPLICATES:

9
3. CHECK FOR UNIQUE VALUES:

4. REMOVE UNNECESSARY COLUMNS:

5. HANDLE TRAILING SPACES:

FOR PLATFORM COLUMN

BEFORE :

10
AFTER:

FOR COUNTRIES COLUMN:

BEFORE

AFTER

11
AGAIN CHECKING UNIQUES VALUES FOR SPECIFIC COLS:

12
6. TEXT PREPROCESSING:
TEXT CLEANING
1.Removing non-alphabetic characters.
2.Tokenize the text
3.Removing stopwords
4.Stemming text:
It is process of reducing a word to its Root word.
WHY? To reduce dimension and complexity of data
eg: actor ,actress, acting = act

7. CREATE NEW SENTIMENT COLUMN

SENTIMENT SCORING

• The VADER Sentiment Analyzer was used to calculate sentiment polarity scores based on the
text.
• Sentiment scores were categorized as positive, negative, or neutral based on the compound
VADER score.

13
NOTICE: IT GIVES WRONG OUTPUT. PREDICTS WRONG SENTIMENT FOR 2nd
SENTENCE.

Expected: Negative Actual: Neutral

PROBLEM FACED:

It leads to Over stemming in our case: reduces separate inflected words to the same word stem even
though they are not related.

for example, the Porter Stemmer algorithm stems "universal", "university", and "universe" to the same
word stem. Though they are etymologically related, their meanings and the context they are used for
is totally different.

SOLUTION:

use Text Lemmatization - return the base or dictionary form of a word, which is known as the
“lemma.” For example, you can expect a lemmatization algorithm to map “runs,” “running,” and “ran”
to the lemma, “run.”

Leading to more Accurate results.

IMPLEMENTATION: import dependencies:

14
TEXT CLEANING:

• Removing non-alphabetic characters.


• Tokenizing the text.
• Removing stopwords.
• Lemmatizing tokens.

The function was applied to clean the text data, resulting in a new column called Lemmatised Text.

COMAPRING OUTPUTS OF BOTH STEMMING (Clean Text Column) AND


LEMMATIZATION (Lemmatized Text Col) with ORIGINAL TEXT COL (TEXT)

15
PERFORMING SENTIMENT ANALYSIS ON LEMMATIZED TEXT:

Now, we get correct Output for Sentiment Column.

NOTE: we will be working on the lemmatised text col only from now onwards.

NEXT STEP would be PERFORM TEXT VECTORIZATION.

Import all the important dependencies for text transformation and model training.

DATA SPLITTING:

1. Extracting Features (X) from Processed Text i.e. “Lemmatised Text” col and Labels (y) i.e.
“Sentiment” value for that respective text.

16
• X: This variable contains the features for the model. In this case, it consists of the
column 'Lemmatized Text' from the Data Frame df4. These are likely pre processed
(lemmatized) text data used as input for the model.
• y: This variable contains the labels or target variable (e.g., 'Sentiment'), representing

the sentiment classification labels such as positive, negative, or neutral.

The.values converts the Data Frame columns into NumPy arrays for compatibility with machine
learning libraries.

2. Splitting the Data into Training and Testing Sets.

Purpose:

Training set (X_train and y_train): Used to train the machine learning model.

Testing set (X_test and y_test): Used to evaluate the performance of the trained model on unseen
data, ensuring the model performs well.

• train_test_split: This function, from sklearn.model_selection, is used to divide the dataset


into:

X_train:Training features.

X_test: Testing features.

y_train: Training labels.

y_test: Testing labels.

• test_size=0.2: Specifies that 20% of the data will be used for testing, while 80% will be used
for training.
• random_state=52: Ensures reproducibility by fixing the random seed for the split. This means
the same split will occur every time the code runs. Can consider any value. In our case Random
Seed = 52, it ensures that every time someone run the code, the random split will be the same.

17
TEXT TRANSFORMATION AND BUILD MODEL (MACHINE LEARNING)

Since Machine learning models need numbers to work, not raw text, we use TF-IDF Vectorization
(Term Frequency-Inverse Document Frequency) to convert text data into numbers while keeping
the focus on important words.

Working of TF-IDF: It converts text (words) into a numerical format by assigning importance to
words based on:

Term Frequency (TF): How often a word appears in a document.

Inverse Document Frequency (IDF): How unique the word is across all documents.

Words that appear frequently in one document but not across many documents get higher scores (e.g.,
specific keywords like "happy").

CODE:

1. TfidfVectorizer(max_features=5000): Keeps the top 5,000 most important words (features).


2. fit_transform(X_train): Learns word importance (TF-IDF scores) from the training data and
transforms it into a matrix of numbers.
3. transform(X_test): Converts test data into numbers using the same learned word importance.

18
ALGORITHMS USED and EVAULATING THE MODELS:

We’re creating 3 different types of model for our sentiment analysis problem:

Logistic Regression

Random Forest

Support Vector Machine (SVM)

We are choosing Accuracy as our evaluation metric. Furthermore, we’re plotting the Confusion
Matrix to get an understanding of how our model is performing on the 3 classification types -Positive,
Negative, Neutral.

LOGISTIC REGRESSION:

CODE:

EXPLANATION:

Model Initialization:

• Creates a Logistic Regression model.


• max_iter=50: Limits the number of iterations for optimization to 50.
• random_state=42: Ensures reproducibility by fixing the random seed.

Model Training:

• Fits (trains) the model using X_train_tfidf (training features) and y_train (training labels).

19
Model Prediction and Evaluation:

• Predicts labels for the test data (X_test_tfidf).


• Calculate the model's accuracy by comparing predicted labels (y_pred_logistic) with actual
labels (y_test).

OUTPUT: Generates a detailed classification report with metrics like precision, recall, and F1-score
and a heatmap representing the confusion matrix.

20
Similarly, we performed RANDOM-FOREST and SUPPORT VECTOR MACHINE - (SVM)

CODE:

21
OUTPUT: Classification Report

22
RANDOM FOREST: CODE:

23
OUTPUT- Classification Report

24
RESULT: We can clearly see that the Random Forest Model performs a little better than the other
models we tried. It achieves 79% accuracy while classifying the sentiment of a tweet.

25

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy