0% found this document useful (0 votes)
24 views15 pages

EXP5

Uploaded by

Sachin Yadav
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views15 pages

EXP5

Uploaded by

Sachin Yadav
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 15

Vidyavardhini’s College of Engineering & Technology

Department of Computer Engineering

Name: Sachin Yadav


Roll No.: 22
Experiment No.: 05
Develop Content( text, emoticons, image, audio, video) based
social media analytics model for business.
Date of Performance: 24/02/2024
Date of Submission: 08/03/2024

CSDL8023: Social Media Analytics Lab


Vidyavardhini’s College of Engineering & Technology
Department of Computer Engineering

Aim: Develop Content (text, emoticons, image, audio, video) based social media analytics
model for business.

Objective: To build a comprehensive social media analytics model utilizing Python, capable of
analyzing various content types including text, emoticons, images, audio, and video. This model
aims to extract valuable insights from diverse content sources, enabling businesses to understand
user sentiment, preferences, and engagement patterns across multiple media formats for
informed decision-making and targeted marketing strategies.

Theory:

Social media sentiment analysis is the process of collecting and analyzing information on the
emotions behind how people talk about your brand on social media. Rather than a simple count
of mentions or comments, sentiment analysis considers feelings and opinions. Social media
sentiment analysis is sometimes called “opinion mining.”

Sentiment analysis, a fundamental aspect of Natural Language Processing (NLP), entails the
classification of text based on polarity, typically categorized as positive, negative, or neutral.
Early approaches to sentiment analysis relied on rule-based methodologies, exemplified by
Python libraries such as TextBlob and NLTK-VADER.

A consistent observation is that the efficacy of sentiment classification improves with methods
capable of capturing contextual nuances. Various techniques for encoding or embedding text
have been developed to enhance context awareness, consequently leading to higher accuracy in
sentiment classification tasks.

What is TextBlob?
TextBlob is a Python library for processing textual data12. It provides a simple API for common
natural language processing (NLP) tasks such as part-of-speech tagging, noun phrase extraction,
sentiment analysis, classification, and more1. TextBlob is built on top of the Natural Language
Toolkit (NLTK) and provides an easier-to-use interface with additional functionalities.

CSDL8023: Social Media Analytics Lab


Vidyavardhini’s College of Engineering & Technology
Department of Computer Engineering

What is WordCloud?
A word cloud is a visual representation of text data. It is a collection of words depicted in
different sizes, where the size of each word indicates its importance or frequency within the
given text. The more a specific word appears in a source of textual data, the bigger and bolder it
appears in the word cloud. Word clouds are used to quickly identify the most common words in a
text and to help visualize the main themes of the text.

Performing sentiment analysis in Python typically involves the following steps:

Data Collection:-
Obtain the dataset containing text data for sentiment analysis. This could be from social media
platforms, review websites, or any other relevant source like Netlytic, Octoparse.

Data Preprocessing:-
● Text Cleaning: Remove noise such as HTML tags, special characters, punctuation, and
stopwords.
● Tokenization: Split the text into individual words or tokens.
● Normalization: Convert the text to lowercase to ensure uniformity.
● Stemming or Lemmatization: Reduce words to their base or root form to improve
analysis accuracy.

Feature Extraction:-
Convert the preprocessed text into numerical features that can be understood by machine
learning algorithms. Common techniques include:
● Bag of Words (BoW): Represent each document as a vector of word counts.
● Term Frequency-Inverse Document Frequency (TF-IDF): Assign weights to words based
on their importance in the document and across the entire corpus.

CSDL8023: Social Media Analytics Lab


Vidyavardhini’s College of Engineering & Technology
Department of Computer Engineering

Post-processing:-
Optionally, perform additional steps such as thresholding or confidence scoring to refine the
sentiment predictions.

Visualization and Interpretation:-


Visualize the results to gain insights into the sentiment distribution and analyze the model's
behavior.

Code:

CSDL8023: Social Media Analytics Lab


import pandas as pd

!pip install snscrape

Collecting snscrape
Downloading snscrape-0.7.0.20230622-py3-none-any.whl (74 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 74.8/74.8 kB 2.3 MB/s eta 0:00:00
Requirement already satisfied: requests[socks] in /usr/local/lib/python3.10/dist-packages (from snscrape) (2.31.0)
Requirement already satisfied: lxml in /usr/local/lib/python3.10/dist-packages (from snscrape) (4.9.4)
Requirement already satisfied: beautifulsoup4 in /usr/local/lib/python3.10/dist-packages (from snscrape) (4.12.3)
Requirement already satisfied: filelock in /usr/local/lib/python3.10/dist-packages (from snscrape) (3.13.1)
Requirement already satisfied: soupsieve>1.2 in /usr/local/lib/python3.10/dist-packages (from beautifulsoup4->snscrape) (2.5)
Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.10/dist-packages (from requests[socks]->snscrape)
Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.10/dist-packages (from requests[socks]->snscrape) (3.6)
Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.10/dist-packages (from requests[socks]->snscrape) (2.0.
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.10/dist-packages (from requests[socks]->snscrape) (2024
Requirement already satisfied: PySocks!=1.5.7,>=1.5.6 in /usr/local/lib/python3.10/dist-packages (from requests[socks]->snscrape) (
Installing collected packages: snscrape
Successfully installed snscrape-0.7.0.20230622

import snscrape.modules.twitter as sntwitter

import numpy as np

import matplotlib.pyplot as plt

import seaborn as sns

import nltk

nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...


[nltk_data] Unzipping corpora/stopwords.zip.
True

from nltk.corpus import stopwords


from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.stem.porter import PorterStemmer

import string
import re
import textblob
from textblob import TextBlob

from wordcloud import WordCloud, STOPWORDS

from wordcloud import ImageColorGenerator

import warnings
%matplotlib inline

import os

# Using OS library to call CLI commands in Python


os.system("snscrape --jsonl --max-results 5000 --since 2023-01-31 twitter-search 'Budget 2023 until:2023-02-03' > text-query-tweets.jso

256

import pandas as pd

# Reads the json generated from the CLI commands above and creates a pandas dataframe
tweets_df = pd.read_json('text-query-tweets.json', lines=True)

tweets_df.head()
tweets_df.to_csv()

'""\n'

df = pd.read_csv("./Alan_walker_Hello_world.csv")

df.head(5)

id author description guid to likecount

Perfect song
n
0 1 @steveodyuo192 music..good Ugxa6AhoP2tgLxuTfx54AaABAg NaN 0
job..from
north eas...

@HieuHoang-
1 2 great UgyXaMpozA_GxcuyXfR4AaABAg NaN 0

🏛Romans 2:
9 to 16.
2 3 @VideosGospels Tribulation UgwH_Kryk9j51c3VBLp4AaABAg NaN 0
and 🤮🤮
anguish ...

Hello, hello,
3 4 @skibniewska3782 hello, world. UgxWKbvkC816aLIU38J4AaABAg NaN 0

I love ❤❤
4 5 @Thethunder007k
alan walker Ugzop-cnCZ6sHhvQ1PV4AaABAg NaN 0

print(df.shape)

(557, 12)

df.info()
df.pubdate.value_counts()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 557 entries, 0 to 556
Data columns (total 12 columns):
# Column Non-Null Count Dtype

0 id 557 non-null int64


1 author 557 non-null object
2 description 557 non-null object
3 guid 557 non-null object
4 to 111 non-null object
5 likecount 557 non-null int64
6 link 557 non-null object
7 pubdate 557 non-null object
8 replycount 557 non-null int64
9 title 557 non-null object
10 authorChannelUrl 557 non-null object
11 Unnamed: 11 0 non-null float64
dtypes: float64(1), int64(3), object(8)
memory usage: 52.3+ KB
2022-03-03 21:14:39 2
2022-03-03 21:13:44 2
2022-03-03 21:14:29 2
2022-04-05 10:13:43 2
2022-03-08 01:36:27 1
..
2022-08-08 13:54:55 1
2022-08-08 13:55:10 1
2022-08-08 17:27:37 1
2022-08-09 03:43:02 1
2022-03-03 21:13:13 1
Name: pubdate, Length: 553, dtype: int64

import numpy as np
import pandas as
pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
#Heat Map for missing values
plt.figure(figsize=(17, 5))
sns.heatmap(df.isnull(), cbar=True, yticklabels=False)
plt.xlabel("Column_Name", size=14, weight="bold")
plt.title("Places of missing values in column",fontweight="bold",size=17)
plt.show()

import plotly.graph_objects as go
Top_Title_Of_tweet= df['title'].value_counts().head(10)

print (Top_Title_Of_tweet)

Get best remex hello world by Alan walker here https://youtu.be/SfwPUuGOmw0 34


Yes 6
Hello you are best among others hello MALAWI 2
Me 2
Nice 2
Love this song 2
Nice song 2
You 2
🤮🤮🤮🤮 2
The best 2
Name: title, dtype: int64

import nltk

stop=nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...


[nltk_data] Package stopwords is already up-to-date!

from nltk.corpus import stopwords


stop = stopwords.words('english')
df1['renderedContent'].apply(lambda x: [item for item in x if item not in stop])
df.shape

(557, 12)

!pip install tweet-preprocessor

Collecting tweet-preprocessor
Downloading tweet_preprocessor-0.6.0-py3-none-any.whl (27 kB)
Installing collected packages: tweet-preprocessor
Successfully installed tweet-preprocessor-0.6.0
#Remove unnecessary characters
punct =['%','/',':','\\','&amp;','&',';', '?']

def remove_punctuations(text):
for punctuation in punct:
text = text.replace(punctuation, '')
return text

df=df.drop_duplicates('description')

from nltk.corpus import stopwords


stop = stopwords.words('english')
df['description'].apply(lambda x: [item for item in x if item not in stop])
df.shape

(502, 12)

df['description'] = df['description'].apply(lambda x: remove_punctuations(x))

<ipython-input-36-3d2e3b3af81c>:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus


df['description'] = df['description'].apply(lambda x: remove_punctuations(x))

#Drop tweets which have empty text field


df['description'].replace(' ', np.nan, inplace=True)
df.dropna(subset=['description'], inplace=True)
len(df)

<ipython-input-37-fed08228fedf>:2: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus


df['description'].replace(' ', np.nan, inplace=True)
<ipython-input-37-fed08228fedf>:3: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus


df.dropna(subset=['description'], inplace=True)
502

df = df.reset_index(drop=True)
df.sample(5)

id author description guid to likec

I love ❤❤
4 5 @Thethunder007k Ugzop-cnCZ6sHhvQ1PV4AaABAg NaN
alan walker

I am not
home. I am
396 418 @mei3094 an Ugw5i7Q3VV2iCfWWLKR4AaABAg NaN
immigrant.
This is not ...

Thanks for
this video, u
419 450 @ayanocybergod make good UgyyyykL52PuD3FQuo94AaABAg NaN
mood for
me...

Is this stupid
89 94 @patrickparreno7741
heart UgzoM74YWPzx4wOHNLx4AaABAg NaN

45 48 @kwokhocheng 田中, Ugz1qlqdmI0PhQ5KuMx4AaABAg NaN

from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer


AttributeError Traceback (most recent call last)
<ipython-input-41-492b1091e77a> in <cell line: 10>()
8 X = vectorizer.fit_transform(corpus)
9
---> 10 features = vectorizer.get_feature_names()
11
12 features

AttributeError: 'CountVectorizer' object has no attribute 'get_feature_names'

sns.set_style('whitegrid')
%matplotlib inline
stop=stop +['hello','good','http','love','happy','Thank','heart','me','mood','video','india']
def plot_20_most_common_words(count_data, count_vectorizer):
import matplotlib.pyplot as plt
words = count_vectorizer.get_feature_names_out()
total_counts = np.zeros(len(words))
for t in count_data:
total_counts+=t.toarray()[0]

count_dict = (zip(words, total_counts))


count_dict = sorted(count_dict, key=lambda x:x[1], reverse=True)[0:20]
words = [w[0] for w in count_dict]
counts = [w[1] for w in count_dict]
x_pos = np.arange(len(words))

plt.figure(2, figsize=(40, 40))


plt.subplot(title='20 most common words')
sns.set_context("notebook", font_scale=4, rc={"lines.linewidth": 2.5})

sns.barplot(x=x_pos, y=counts, palette='husl')


#sns.barplot(x_pos, counts, palette='husl')
plt.xticks(x_pos, words, rotation=90)
plt.xlabel('words')
plt.ylabel('counts')
plt.show()
# Initialise the count vectorizer with the English stop words
count_vectorizer = CountVectorizer(stop_words=stop)
# Fit and transform the processed titles
count_data = count_vectorizer.fit_transform(df['description'])
# Visualise the 20 most common words
plot_20_most_common_words(count_data, count_vectorizer)
plt.savefig('saved_figure.png')
<ipython-input-50-6216072cd042>:21: FutureWarning:

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.

sns.barplot(x=x_pos, y=counts, palette='husl')

<Figure size 640x480 with 0 Axes>


import cufflinks as cf
cf.go_offline()
cf.set_config_file(offline=False, world_readable=True)

def get_top_n_bigram(corpus, n=None):


vec = CountVectorizer(ngram_range=(2, 4), stop_words='english').fit(corpus)
bag_of_words = vec.transform(corpus)
sum_words = bag_of_words.sum(axis=0)
words_freq = [(word, sum_words[0, idx]) for word, idx in vec.vocabulary_.items()]
words_freq =sorted(words_freq, key = lambda x: x[1], reverse=True)
return words_freq[:n]
common_words = get_top_n_bigram(df['description'], 8)
mydict={}
for word, freq in common_words:
bigram_df = pd.DataFrame(common_words, columns = ['ngram' , 'count']) bigram_df.groupby('ngram').sum()
['count'].sort_values(ascending=False).sort_values().plot.barh(title='Top 8 bigrams ', color='orange',

#iplot(kind='bar', yTitle='Count', linecolor='black', title='Top 10 bigrams ')

<Axes: title={'center': 'Top 8 bigrams '}, ylabel='ngram'>

!pip install textblob

Requirement already satisfied: textblob in /usr/local/lib/python3.10/dist-packages (0.17.1)


Requirement already satisfied: nltk>=3.1 in /usr/local/lib/python3.10/dist-packages (from textblob) (3.8.1)
Requirement already satisfied: click in /usr/local/lib/python3.10/dist-packages (from nltk>=3.1->textblob) (8.1.7)
Requirement already satisfied: joblib in /usr/local/lib/python3.10/dist-packages (from nltk>=3.1->textblob) (1.3.2)
Requirement already satisfied: regex>=2021.8.3 in /usr/local/lib/python3.10/dist-packages (from nltk>=3.1->textblob) (2023.12.25)
Requirement already satisfied: tqdm in /usr/local/lib/python3.10/dist-packages (from nltk>=3.1->textblob) (4.66.1)

from textblob import TextBlob

def get_subjectivity(text):
return TextBlob(text).sentiment.subjectivity

def get_polarity(text):
return TextBlob(text).sentiment.polarity
df['subjectivity'] = df['description'].apply(get_subjectivity)
df['polarity'] = df['description'].apply(get_polarity)
df.head()

id author description guid to likecount

Perfect song
n
0 1 @steveodyuo192 music..good Ugxa6AhoP2tgLxuTfx54AaABAg NaN 0
job..from
north eas...

@HieuHoang-
1 2 great UgyXaMpozA_GxcuyXfR4AaABAg NaN 0

🏛Romans 2
9 to 16.
2 3 @VideosGospels Tribulation UgwH_Kryk9j51c3VBLp4AaABAg NaN 0
and 🤮🤮
anguish w...

Hello,
hello, hello,
world.
3 4 @skibniewska3782 UgxWKbvkC816aLIU38J4AaABAg NaN 0

I love ❤❤
4 5 @Thethunder007k
alan walker Ugzop-cnCZ6sHhvQ1PV4AaABAg NaN 0

# Obtain polarity scores generated by TextBlob


df['textblob_score'] = df['description'].apply(lambda x: TextBlob(x).sentiment.polarity)

neutral_threshold = 0.05

# Convert polarity score into sentiment categories


df['textblob_sentiment'] = df['textblob_score'].apply(lambda c: 'Positive' if c >= neutral_threshold else ('Negative' if c <= -(neutral

textblob_df = df[['description', 'textblob_sentiment', 'likecount']]


textblob_df

description textblob_sentiment likecount


0 Perfect song n music..good job..from north eas... Positive 0

1 great Positive 0

2 🏛Romans 2 9 to 16. Tribulation and 🤮🤮 anguish w... Positive 0

3 Hello, hello, hello, world. Neutral 0

4 I love ❤❤ alan walker Positive 0

... ... ... ...

497 Who's awake Neutral 31

498 Mee! Neutral 3

499 I’m awake Neutral 3

500 I AMMMM. It's only 914pm- Neutral 2

501 First Positive 4

502 rows × 3 columns

textblob_sentiment

Show code
textblob_df['textblob_sentiment'].value_counts()

Neutral 286
Positive 195
Negative 21
Name: textblob_sentiment, dtype: int64

textblob_df['textblob_sentiment'].value_counts().plot.barh(title='Sentiment Analysis ', color='orange', width=.4, figsize=(10, 8), stac

<Axes: title={'center': 'Sentiment Analysis '}>

df_positive=textblob_df[textblob_df['textblob_sentiment']=='Positive']
df_Very_positive=df_positive[df_positive['likecount']>0]
df_Very_positive.head()

description textblob_sentiment likecount

35 Alan walker it's just unique ☺ Positive 2

65 I Love this Alan walker song ❤ Positive 1

94 The best Positive 1

104 Great Positive 1

106 I love all songs of Alan walker Positive 1

df_negative=textblob_df[textblob_df['textblob_sentiment']=='Negative']

df_negative.head()

description textblob_sentiment likecount

30 Is it a bad thing i associate this song with t... Negative 0

37 This song reminds me of somthong some memory b... Negative 0

89 Is this stupid heart Negative 1

122 "lost in the dark , but ill never be alone" ca... Negative 0

140 I want to build video in youtube.. i want to s... Negative 0

df_neutral=textblob_df[textblob_df['textblob_sentiment']=='Neutral']

df_neutral.head()

description textblob_sentiment likecount


3 Hello, hello, hello, world. Neutral 0

5 I close my eyes and goodbye to the world. Neutral 0

6 🤮✌🤮G🤮🤮🤮 🤮love it💛 Neutral 0

7 My name is Christian walker county is Rwanda I... Neutral 0

8 Waw I like it Neutral 1

from wordcloud import WordCloud, STOPWORDS


from PIL import Image

#Creating the text variable


positive_tw = " ".join(t for t in df_Very_positive.description)

# Creating word_cloud with text as argument in .generate()

method

word_cloud1 = WordCloud(collocations = False, background_color = 'white').generate(positive_tw)

# Display the generated Word Cloud

plt.imshow(word_cloud1, interpolation='bilinear')

plt.axis("off")

plt.show()
Vidyavardhini’s College of Engineering & Technology
Department of Computer Engineering

Conclusion: In conclusion, leveraging Python, we've successfully developed a comprehensive


content-based social media analytics model tailored for business needs. Utilizing TextBlob for
sentiment analysis and WordCloud for visual representation, the model efficiently processes
diverse content types such as text, emoticons, images, audio, and video, enabling businesses to
extract valuable insights into user engagement, sentiment distribution, and preferences for
informed decision-making and strategic planning in the dynamic landscape of social media.

CSDL8023: Social Media Analytics Lab

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy