0% found this document useful (0 votes)
21 views65 pages

SMA Lab Manual

The document outlines a series of experiments and assignments focused on social media analytics for the academic year 2024-25 at Chhatrapati Shivaji Maharaj Institute of Technology. It covers various aspects such as studying social media platforms, data collection methods, data cleaning and storage, exploratory data analysis, and developing analytics models for business applications. The document emphasizes the importance of social media analytics tools and techniques in understanding customer behavior and optimizing business strategies.

Uploaded by

Nilay
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views65 pages

SMA Lab Manual

The document outlines a series of experiments and assignments focused on social media analytics for the academic year 2024-25 at Chhatrapati Shivaji Maharaj Institute of Technology. It covers various aspects such as studying social media platforms, data collection methods, data cleaning and storage, exploratory data analysis, and developing analytics models for business applications. The document emphasizes the importance of social media analytics tools and techniques in understanding customer behavior and optimizing business strategies.

Uploaded by

Nilay
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 65

CHHATRAPATI SHIVAJI MAHARAJ INSTITUTE OF TECHNOLOGY

(Affiliated to the Mumbai University, Approved by AICTE-New Delhi)


Academic Year: 2024-25 Semester: VIII Branch: Computer Engineering
Sr. Title of Experiments
No
Study various - i) Social Media platforms (Facebook, twitter, YouTube) ii) Social Media analytics tools
(Facebook insights, google analytics, netlytic etc.)
iii) Social Media Analytics techniques and engagement metrics (page level, post level, member level) iv)
Applications of Social media analytics for business. e.g. Google Analytics
1. https://marketingplatform.google.com/about/analytics/ https://netlytic.org/

Data Collection-Select the social media platforms of your choice (Twitter, Facebook, LinkedIn, YouTube, Web
blogs etc.) ,connect to and capture social media
2.
data for business ( scraping, crawling, parsing).

Data Cleaning and Storage- Preprocess, filter and store social media data for business (Using Python,
3. MongoDB, R, etc.).

Exploratory Data Analysis and visualization of Social


4.
Media Data for business.
Develop Content (text, emoticons, image, audio, video) based social media analytics model for business. (e.g.
Content Based Analysis :Topic , Issue ,Trend, sentiment/opinion analysis, audio, video, image analytics)
5.

Develop Structure based social media analytics model


6. for any business. ( e.g. Structure Based Models - community detection, influence analysis)

Develop a dashboard and reporting tool based on real


7.
time social media data.
Design the creative content for promotion of your
8.
business on social media platform.
9. Analyze competitor activities using social media data.
Develop social media text analytics models for
10. improving existing product/ service by analyzing customer‘s reviews/comments.

Sr. Title of Assignments


No
1.

2.

3.
EXPERIMENT NO. 1
Aim: Study various -
i) Social Media platforms (Facebook, twitter, YouTube etc.)
ii) Social Media analytics tools (Facebook insights, google analytics net lytic etc.)
iii) Social Media Analytics techniques and engagement metrics (page level, post level, member level)
iv) Applications of Social media analytics for business. e.g. Google
Analytics https://marketingplatform.google.com/about/analytics/
https://netlytic.org
Theory:
Social media has become an indispensable part of our lives. It has transformed the way we communicate
and interact with each other. Social media platforms like Facebook, Twitter, and YouTube have millions of
users who share their views, opinions, and experiences. Businesses have recognized the importance of social
media in engaging with their customers and promoting their products and services. Social media analytics is
the process of analyzing social media data to gain insights into customer behavior, market trends, and brand
perception. In this article, we will study various social media platforms, social media analytics tools, and
techniques, and their applications in business.

• Social Media Platforms:


There are various social media platforms, and each has its own unique features and
characteristics. Some of the popular social media platforms are Facebook, Twitter, Instagram,
LinkedIn, and YouTube.
Facebook is the most popular social media platform, with over 2 billion monthly active users.
Twitter is a microblogging platform that allows users to post short messages called tweets. Instagram is
a visual platform that focuses on photos and videos. LinkedIn is a professional networking platform
that connects professionals and businesses. YouTube is a video-sharing platform that allows users to
upload and share videos.

• Social Media Analytics Tools:


Social media analytics tools are software applications that help businesses track, measure, and
analyze social media data. Some of the popular social media analytics tools are Facebook Insights,
Google Analytics, and Netlytic.
Facebook Insights is a free tool that provides businesses with data on their Facebook page's
performance. Google Analytics is a web analytics tool that tracks and reports website traffic. Netlytic
is a social media analytics tool that analyzes social media data to identify trends, topics, and
influencers.

• Social Media Analytics Techniques and Engagement Metrics:


Social media analytics techniques and engagement metrics are used to measure social media
performance. Some of the popular social media analytics techniques and engagement metrics are page-
level metrics, post-level metrics, and member-level metrics.
Page-level metrics measure the overall performance of a social media page. Post-level
metrics measure the performance of individual posts. Member-level metrics measure the engagement
level of individual members.

1
• Applications of Social Media Analytics for Business:
Social media analytics has various applications in business. It can be used to measure brand
awareness, customer engagement, and market trends. It can also be used to identify influencers and
track competitor activity. Google Analytics is a popular tool used by businesses to track website traffic
and user behavior.
It provides businesses with insights into their website's performance, such as the number of
visitors, bounce rate, and conversion rate. Netlytic is a social media analytics tool that helps businesses
identify trends, topics, and influencers in social media data.

Conclusion:
Social media analytics has become an essential tool for businesses to measure their social media
performance and gain insights into customer behavior, market trends, and brand perception.
Social media platforms like Facebook, Twitter, and YouTube have millions of users, and businesses can
leverage this audience to promote their products and services.
Social media analytics tools like Facebook Insights, Google Analytics, and Netlytic provide businesses
with data and insights to make informed decisions.
Social media analytics techniques and engagement metrics can help businesses measure their social
media performance and optimize their social media strategy.

2
EXPERIMENT NO. 2
Aim: Data Collection-Select the social media platforms of your choice (Twitter, Facebook, LinkedIn,
YouTube, Web blogs etc) ,connect to and capture social media data for business ( scraping, crawling, parsing).
Introduction:
Social media has become a crucial source of data for businesses to understand their customers and their
preferences. With millions of users sharing their thoughts, opinions, and experiences on social media platforms
like Twitter, Facebook, LinkedIn, YouTube, and web blogs, businesses can gain valuable insights into customer
behavior, market trends, and brand perception. In this article, we will discuss data collection methods for social
media platforms, including scraping, crawling, and parsing.

• Data Collection Methods:


There are various data collection methods for social media platforms, and each method has its own advantages
and disadvantages. Some of the popular data collection methods are scraping, crawling, and parsing.
1. Scraping:
Scraping is the process of extracting data from websites or social media platforms. Scraping is an
effective way to collect data from social media platforms like Twitter, Facebook, and LinkedIn. The
data collected through scraping can be used for various purposes, such as sentiment analysis, trend
analysis, and competitor analysis.
2. Crawling:
Crawling is the process of discovering and indexing web pages through search engines. It
involves using automated tools or software to scan websites and social media platforms for new
content. Crawling is an effective way to collect data from web blogs and news websites. The data
collected through crawling can be used for various purposes, such as content analysis, trend analysis,
and competitor analysis.
3. Parsing:
Parsing is the process of extracting data from structured or unstructured data sources. It involves
using automated tools or software to analyze data and extract relevant information. Parsing is an
effective way to collect data from social media platforms like Twitter and Facebook. The data collected
through parsing can be used for various purposes, such as sentiment analysis, trend analysis, and
competitor analysis.
Execution Steps:
1. Identify the social media platform(s) of interest: Choose the social media platform(s) that you want
to collect data from, based on your business needs and objectives.
2. Set up a data collection tool: Choose a data collection tool that suits your needs and set it up. There
are various data collection tools available, such as Octoparse, ParseHub, and Beautiful Soup.
3. Define the data fields: Define the data fields that you want to collect from the social media platform.
For example, if you are collecting data from Twitter, you may want to collect data on tweets, hashtags,
user profiles, and location.
4. Configure the data collection tool: Configure the data collection tool to collect the data fields that
you have defined.
5. Run the data collection tool: Run the data collection tool to collect the data from the social
media platform.
6. Store and analyze the data: Store the data in a database or spreadsheet and analyze it using data
analysis tools like Excel or Python.

3
Code:
# Import required libraries import
pandas as pd
import requests
from textblob import TextBlob

# Set YouTube video ID, maximum number of comments to retrieve, and API key
video_id = "Q33TkQKlIMg"
max_result = 50
api_key ="AIzaSyC_4xZTiNuz1O-Qu5kYnlg82riP30KRIxY"

# Retrieve video information


video_info_url = f"https://www.googleapis.com/youtube/v3/videos?part=id%2C+snippet&id
=Q33TkQKlIMg&key={api_key}"
video_info_response = requests.get(video_info_url)
video_info_data = video_info_response.json()

# Retrieve video components (comments)


comments_url = f"https://www.googleapis.com/youtube/v3/commentThreads?key={api_key}&v
ideoId={video_id}&part=snippet&maxResults={max_result}"
comments_response = requests.get(comments_url)
comments_data = comments_response.json()

# Create pandas DataFrame from comments data


df = pd.DataFrame(comments_data['items'])
df1 = pd.DataFrame(df['snippet'])

# Extract comments from DataFrame


txt=""
comments = []
for i in range(0,50):
df2 = pd.DataFrame(df['snippet'][i])
txt = df2['topLevelComment']['snippet']['textOriginal'] comments.append(txt)
print(comments)

# Define function to perform sentiment analysis on a given comment def


get_comment_sentiment(comment):
analysis = TextBlob(comment)
if analysis.sentiment.polarity > 0:
return "Positive"
elif analysis.sentiment.polarity == 0:
return "neutral"
else:
return "negative"

# Perform sentiment analysis on all comments and create a new DataFrame


comment_list = []
sentiment_list = []
for comment in comments:

4
sentiment = get_comment_sentiment(comment)
comment_list.append(comment)
sentiment_list.append(sentiment)
print(f"{comment} : {sentiment}")
sentiment_df = pd.DataFrame({"Comments": comment_list,"Sentiment": sentiment_list})
sentiment_df.head()

# Save DataFrame to a CSV file


sentiment_df.to_csv("YouTube_Comments_Sentiment.csv")

Output:
golden voice morgan freeman : Positive
It's shows only our earth not Universe : neutral
"Little minds are tamed and subdued by misfortune; but great minds rise above it." --
Washington Irving : Positive
"I find that when you have a real interest in life and a curious life, that sleep is
not the most important thing." --Martha Stewart : Positive
But they can't beat the @melodysheep : neutral
Hore..hore film ultramen!! : neutral
I mean.. life is strange : negative
So amazing 😢 : Positive
N for nothing original : Positive
His narration is institutionalized. ❤❤❤❤ : neutral
Morgan Freeman u beauty ❤ : neutral
All most of pieces come from France television watch a doc called "aux Frontier de
l'univers "I bet you will like it but also Morgan he ' s one of the best : Positive
Absolute poppycock!
It makes me love God all the more!
Trying to explain the universe while claiming God doesn’t exist…..
Let’s just say that in heaven, they watch shows like this on comedy night. : Positive
No Money, No Honey.

Blow up the damn universe. Who cares. : neutral


david attenborough would’ve ate this up : neutral
And humans are biggest mistake by nature 😂😂😂 : neutral
I wish this could be not only on netflix and on other websites in the internet to
watch it or on tv in Poland where I live. I love Mr Freeman voice and science shows I
watched with him : Positive
I got goosebumps watching this trailer. : neutral
Half of trailer copy from national geographic channel : negative
its not a documentry. its an exploration of theoretical events. : neutral
damn, the big bang arc hit hard.
when theia-chan and earth-kun finally joined together i was crying : negative
This proved to me that we have a CREATOR, this world was not made by chance, it was
intentional and perfect. : Positive
God is the creator, the universe things you put in here don't make sense, had to skip
the parts : neutral
Kudos to tech , We can see back the full history : Positive
This is not the real story..... : Positive
Seeing ur work gettin some recognition bring tears : neutral
watch The throne of Allah. : neutral
Honestly was praying NOT to hear Morgan Freeman : Positive
Is this the history of our universe or it's the history that you made up? : neutral
The earth is flat : negative
And also don't forget world's elite have an agenda to believe lies. GOD created
everything keep in mind, and this universe GOD laws still exist. Amen : neutral
Gurl y’all got God to narrate this lmaoo : neutral

5
BIDEN R U WATCHING THIS ? : neutral
Надеюсь нам не будут напоминать что планеты бесполые... : neutral
Watching the creation of our Lord God Almighty. He created all things and through the
Lord Jesus were made. The world was made in 6 days 💯❤✝ : neutral
Looks so good : Positive
The voice of ''MORGAN FREEMAN'' is not just a voice it's an emotion that connects us
with him and also connects us with the scenes. ❤❤❤ : neutral
Wow : Positive
13.8 billion years, says who? Allah states that He created the universe, the heavens
and earth and all that is in, on and between them in six days, as He has stated in
several Ayat in the Qur'an. : neutral
I get nostalgia of animal planet : neutral
Why it feels more like some discovery stuff then a Netflix show : Positive
This is explosive ❤❤ : neutral
Morgan.... epic voice : Positive
This line 👉We are connected to the start ❤
Yes actually we are the start and last i think, what do you think?
Different animals have different mind but anyone want to live with their mothers(➑),
but we Are humans we are Destroyed trees, killing animals, increasing population,
waste water , waste food, polute air,😔😓 etc and want to live long ..... How funny
that you guys : Positive
Proud of Indian 🚩🇮 : Positive
That's why I subscribe Netflix : neutral
Morgan Freeman is GOD : neutral
me suena a una imitación de cosmos espero que este buena : neutral

Explanation:
• This code is for retrieving comments from a YouTube video and performing sentiment analysis on
the comments using the TextBlob library in Python.
• The code imports pandas, requests, and TextBlob libraries, and sets the video ID, maximum number
of comments to retrieve, and API key for accessing the YouTube API.
• The code then sends an API request to retrieve information about the video and another API request
to retrieve the comments for the video, using the video ID and API key.
• The retrieved data is then converted into a pandas DataFrame for further analysis. The code then
extracts the comments from the DataFrame and performs sentiment analysis on each comment using the
TextBlob library.
• Finally, the code creates a new pandas DataFrame with the comments and their corresponding
sentiment and saves the data to a CSV file named "YouTube_Comments_Sentiment.csv".
Conclusion:
Data collection from social media platforms is essential for businesses to gain insights into customer
behavior, market trends, and brand perception. There are various data collection methods available, including
scraping, crawling, and parsing. Each method has its own advantages and disadvantages, and businesses should
choose the method that suits their needs and objectives. The execution steps for data collection involve
identifying the social media platform(s) of interest, setting up a data collection tool, defining the data fields,
configuring the data collection tool, running the data collection tool, and storing and analyzing the data.

6
EXPERIMENT NO. 3
Aim: Data Cleaning and Storage- Preprocess, filter and store social media data for business (Using Python,
MongoDB, R, etc).
Theory:
After collecting social media data for businesses, the next step is data cleaning and storage. The data
collected from social media platforms is often unstructured and noisy, making it difficult to analyze. In this
article, we will discuss the theory behind data cleaning and storage and the execution steps to preprocess, filter,
and store social media data for businesses.
Data cleaning is the process of identifying and correcting errors and inconsistencies in data. Social
media data is often noisy, containing irrelevant or inaccurate information, such as misspellings, abbreviations,
and emoticons. Data cleaning involves removing such information to make the data more accurate and reliable.
Data storage involves storing the cleaned data in a database or other storage systems. Social media data is
often big and unstructured, making it difficult to store in traditional databases. Therefore, businesses can use
NoSQL databases, such as MongoDB or Apache Cassandra, to store social media data.
Execution Steps:
1. Preprocess the data: Preprocess the data by removing irrelevant or inaccurate information, such as
misspellings, abbreviations, and emoticons. Businesses can use data cleaning tools, such as TextBlob
or NLTK, to preprocess social media data.
2. Filter the data: Filter the data by selecting only relevant data fields, such as tweets, hashtags, user
profiles, and location. Businesses can use filtering tools, such as pandas or dplyr, to filter social
media data.
3. Store the data: Store the cleaned and filtered data in a database or other storage systems.
Businesses can use NoSQL databases, such as MongoDB or Apache Cassandra, to store social
media data.
4. Index the data: Index the data to make it searchable and retrievable. Businesses can use indexing
tools, such as Elasticsearch or Solr, to index social media data.
5. Analyze the data: Analyze the data using data analysis tools, such as Excel, Python, or R. Businesses
can use data analysis tools to gain insights into customer behavior, market trends, and brand
perception.
Code:
# Import libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.feature_extraction.text import CountVectorizer
import nltk
import string
import re
%matplotlib inline
pd.set_option('display.max_colwidth', 100)
# Load dataset
def load_data():
data = pd.read_csv('../input/Data.csv')
return data
7
tweet_df = load_data()
tweet_df.head()

8
print('Dataset size:',tweet_df.shape)
print('Columns are:',tweet_df.columns)
tweet_df.info()
sns.countplot(x = 'ADR_label', data = tweet_df)
# Exploratory Data Analysis
# Wordcloud Visualization
df = pd.DataFrame(tweet_df[['UserId', 'Tweet']])
from wordcloud import WordCloud, STOPWORDS , ImageColorGenerator
# Start with one review:
df_ADR = tweet_df[tweet_df['ADR_label']==1]
df_NADR = tweet_df[tweet_df['ADR_label']==0]
tweet_All = " ".join(review for review in
df.Tweet)
tweet_ADR = " ".join(review for review in df_ADR.Tweet)
tweet_NADR = " ".join(review for review in df_NADR.Tweet)

fig, ax = plt.subplots(3, 1, figsize = (30,30))


# Create and generate a word cloud image:
wordcloud_ALL = WordCloud(max_font_size=50, max_words=100, background_color="white").
generate(tweet_All)
wordcloud_ADR = WordCloud(max_font_size=50, max_words=100, background_color="white").
generate(tweet_ADR)
wordcloud_NADR = WordCloud(max_font_size=50, max_words=100, background_color="white")
.generate(tweet_NADR)

# Display the generated image:


ax[0].imshow(wordcloud_ALL, interpolation='bilinear')
ax[0].set_title('All Tweets', fontsize=30)
ax[0].axis('off')
ax[1].imshow(wordcloud_ADR, interpolation='bilinear')
ax[1].set_title('Tweets under ADR Class',fontsize=30)
ax[1].axis('off')
ax[2].imshow(wordcloud_NADR, interpolation='bilinear')
ax[2].set_title('Tweets under None - ADR Class',fontsize=30)
ax[2].axis('off')

#wordcloud.to_file("img/first_review.png")
string.punctuation
def remove_punct(text):
text = "".join([char for char in text if char not in string.punctuation])
text = re.sub('[0-9]+', '', text)
return text

df['Tweet_punct'] = df['Tweet'].apply(lambda x: remove_punct(x))


df.head(10)
# Tokenization
def tokenization(text):
text = re.split('\W+', text)
return text

df['Tweet_tokenized'] = df['Tweet_punct'].apply(lambda x: tokenization(x.lower()))

9
df.head()

10
# Remove stopwords
stopword = nltk.corpus.stopwords.words('english')
#stopword.extend(['yr', 'year', 'woman', 'man', 'girl','boy','one', 'two', 'sixteen',
'yearold', 'fu', 'weeks', 'week',
# 'treatment', 'associated', 'patients', 'may','day', 'case','old'])
def remove_stopwords(text):
text = [word for word in text if word not in stopword]
return text

df['Tweet_nonstop'] = df['Tweet_tokenized'].apply(lambda x: remove_stopwords(x))


df.head(10)
# Stemming and Lammitization
# Ex - developed,
development ps =
nltk.PorterStemmer() def
stemming(text):
text = [ps.stem(word) for word in text]
return text
df['Tweet_stemmed'] = df['Tweet_nonstop'].apply(lambda x: stemming(x))
df.head()
wn = nltk.WordNetLemmatizer()
def lemmatizer(text):
text = [wn.lemmatize(word) for word in text]
return text
df['Tweet_lemmatized'] = df['Tweet_nonstop'].apply(lambda x: lemmatizer(x))
df.head()
def clean_text(text):
text_lc = "".join([word.lower() for word in text if word not in string.punctuatio
n]) # remove puntuation
text_rc = re.sub('[0-9]+', '', text_lc)
tokens = re.split('\W+', text_rc) # tokenization
text = [ps.stem(word) for word in tokens if word not in stopword] # remove stopw
ords and stemming
return text
# Vectorisation
countVectorizer = CountVectorizer(analyzer=clean_text)
countVector = countVectorizer.fit_transform(df['Tweet'])
print('{} Number of tweets has {} words'.format(countVector.shape[0], countVector.sha
pe[1]))
#print(countVectorizer.get_feature_names())
count_vect_df = pd.DataFrame(countVector.toarray(), columns=countVectorizer.get_featu
re_names())
#count_vect_df.head()
# Feature Creation
ADR_tweet_1 = tweet_df[tweet_df['ADR_label'] == 1]['Tweet'].apply(lambda x: len(x) -
len(' '))
ADR_tweet_0 = tweet_df[tweet_df['ADR_label'] == 0]['Tweet'].apply(lambda x: len(x) -
len(' '))
bins_ = np.linspace(0, 450, 70)
plt.hist(ADR_tweet_1, bins= bins_, normed=True, alpha = 0.5, label = 'ADR')
plt.hist(ADR_tweet_0, bins= bins_, normed=True, alpha = 0.1, label = 'None_ADR')

11
plt.legend()

12
Output:

13
23516 Number of tweets has 14323 words

Explanation:
This code is written in Python and is used for Exploratory Data Analysis (EDA) on a dataset containing
tweets related to drugs. Here's a summary of the code:
• Libraries such as pandas, numpy, seaborn, matplotlib, and nltk are imported.
• The dataset is loaded using the load_data function, and the size and columns of the dataset are printed.
• A countplot is plotted to show the number of tweets in each ADR class (0 or 1).
• A WordCloud is plotted to visualize the most frequent words in all tweets, tweets under ADR class,
and tweets under None-ADR class.
• The tweets are preprocessed by removing punctuation, tokenizing the words, removing stopwords,
and stemming and lemmatizing the words.
• The preprocessed tweets are vectorized using the CountVectorizer class.
• The length of tweets in both ADR classes is plotted using histograms to show the frequency
distribution of tweet lengths.
The code performs the EDA on the tweet dataset to gain insights and prepare the data for machine learning
models that can predict ADR classes of tweets based on their contents.
Conclusion:
Data cleaning and storage are essential steps in the social media data analysis process. The data
collected from social media platforms is often unstructured and noisy, making it difficult to analyze. Therefore,
businesses need to preprocess, filter, and store social media data to make it more accurate and reliable. They
can use data cleaning tools, such as TextBlob or NLTK, and filtering tools, such as pandas or dplyr, to
preprocess and filter social media data. Businesses can also use NoSQL databases, such as MongoDB or
Apache Cassandra, to store social media data. Finally, businesses can use data analysis tools, such as Excel,
Python, or R, to gain insights into customer behavior, market trends, and brand perception.

14
EXPERIMENT NO. 4
Aim: Exploratory Data Analysis and visualization of Social Media Data for business.
Theory:
Exploratory Data Analysis (EDA) is an approach to analyze the data using visual techniques. It is used
to discover trends, patterns, or to check assumptions with the help of statistical summary and graphical
representations.
An EDA is a thorough examination meant to uncover the underlying structure of a data set and is
important for a company because it exposes trends, patterns, and relationships that are not readily apparent.
The four types of EDA are
1. Univariate non-graphical,
2. Multivariate non-graphical,
3. Univariate graphical,
4. Multivariate graphical.
Techniques and Tools:
There are a number of tools that are useful for EDA, but EDA is characterized more by the attitude
taken than by particular techniques.

Typical graphical techniques used in EDA are:


• Box plot
• Histogram
• Multi-vari chart
• Run chart
• Pareto chart
• Scatter plot (2D/3D)
• Stem-and-leaf plot
• Parallel coordinates
• Odds ratio
• Heat map
• Bar chart
• Horizon graph
• Dimensionality reduction:
• Multidimensional scaling
• Principal component analysis (PCA)
• Multilinear PCA
• Iconography of correlations

15
Code:
# Import necessary libraries
import pandas as pd # For data manipulation and analysis
import numpy as np # For numerical computing
import matplotlib.pyplot as plt # For data visualization
import seaborn as sns # For advanced data visualization

# Load the social media data into a pandas DataFrame social_media_data


= pd.read_csv('social_media_data.csv')

# Explore the data using summary statistics


print(social_media_data.describe())

# Visualize the data using histograms and box plots plt.figure(figsize=(10,6))


sns.histplot(social_media_data['likes'], kde=True) # Histogram of the 'likes' column
plt.title('Distribution of Likes') # Add title
plt.xlabel('Number of Likes') # Add x-axis label
plt.ylabel('Frequency') # Add y-axis label
plt.show()

plt.figure(figsize=(10,6))
sns.boxplot(x='platform', y='followers', data=social_media_data) # Box plot of 'follo wers'
column by 'platform' column
plt.title('Followers by Platform') # Add title
plt.xlabel('Platform') # Add x-axis label
plt.ylabel('Number of Followers') # Add y-axis label
plt.show()

# Identify correlations between variables


corr_matrix = social_media_data.corr() # Create a correlation matrix
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm') # Heatmap of the correlation ma
trix
plt.title('Correlation Matrix') # Add title
plt.show()

# Visualize the data using scatter plots


plt.figure(figsize=(10,6))
sns.scatterplot(x='comments', y='engagement_rate', data=social_media_data) # Scatter
plot of 'comments' column vs 'engagement_rate' column
plt.title('Comments vs Engagement Rate') # Add title
plt.xlabel('Number of Comments') # Add x-axis label
plt.ylabel('Engagement Rate') # Add y-axis label
plt.show()

# Identify outliers using box plots


plt.figure(figsize=(10,6))
sns.boxplot(x='platform', y='likes', data=social_media_data) # Box plot of 'likes' co
lumn by 'platform' column
plt.title('Likes by Platform') # Add title

16
plt.xlabel('Platform') # Add x-axis label
plt.ylabel('Number of Likes') # Add y-axis label
plt.show()

# Create a pairplot to visualize the relationships between all variables


sns.pairplot(data=social_media_data) # Pairplot of all columns
plt.show()

Output:
followers likes comments engagement_rate
count 10.000000 10.000000 10.000000 10.0000
mean 13200.000000 700.000000 46.500000 4.4500
std 9199.033766 637.268476 35.043624 1.9352
min 2000.000000 50.000000 5.000000 2.0000
25% 6250.000000 262.500000 21.250000 2.6250
50% 10000.000000 500.000000 40.000000 4.5000
75% 18750.000000 950.000000 68.750000 5.8250
max 30000.000000 2000.000000 100.000000 7.5000

17
Correlation Matrix
LOO

followers 0.95

-0.90

likes - -0.8/›

-0.80

— 0.75
comments —

— 0.7D

r0.65
engagement rate
0.60
followers likes comments engagement rate

Comments vs Engagement Rate

6D BD
Number of Comments

18
19
Explanation:
This code performs exploratory data analysis on a social media dataset. It first imports necessary
libraries, loads the dataset into a Pandas DataFrame, and then uses summary statistics and data visualization
techniques such as histograms, box plots, scatter plots, and correlation matrices to explore the dataset.
The code also identifies outliers using box plots and creates a pairplot to visualize the relationships
between all variables in the dataset. Overall, the code helps to gain insights and better understand the social
media data.
Conclusion:
Exploratory Data Analysis and visualization of social media data are important steps in the data
analysis process. EDA helps businesses to understand the patterns and trends in social media data and identify
relationships between variables.
Visualization is an important part of EDA because it helps businesses to communicate insights and
findings effectively. Businesses can use Python, R, or other data analysis tools to perform EDA and create
visualizations of social media data. Finally, businesses can interpret the findings from EDA and communicate
the insights to relevant stakeholders.

20
EXPERIMENT NO. 5
Aim: Develop Content (text, emoticons, image, audio, video) based social media analytics model for business.
(E.g. Content Based Analysis: Topic, Issue, Trend, sentiment/opinion analysis, audio, video, image
analytics).
Theory:
Developing a content-based social media analytics model for businesses is a crucial step in
understanding their audience and improving their social media presence. In this article, we will discuss the
theory behind content-based analysis, the execution steps to develop the model, and the benefits it can offer for
businesses.
Content-based social media analytics is a process of analyzing the content shared on social media
platforms by a business or its customers. It helps businesses to understand their audience, identify emerging
trends, and improve their social media strategy. Content-based analysis can be used for text, emoticons, image,
audio, and video analytics.
• Topic Analysis: It is the process of identifying the topics and themes in social media content. This can
help businesses to understand the interests of their audience and create content that resonates with
them.
• Issue Analysis: It is the process of identifying the issues and concerns of the audience. This can
help businesses to address the concerns of their audience and improve their social media presence.
• Trend Analysis: It is the process of identifying the emerging trends in social media content. This
can help businesses to stay up-to-date with the latest trends and adapt their social media strategy
accordingly.
• Sentiment Analysis: It is the process of identifying the sentiment or opinion expressed in social
media content. This can help businesses to understand the sentiment of their audience towards their
brand, products, or services.
Audio, Video, and Image Analytics: It is the process of analyzing audio, video, and image content shared
on social media platforms. This can help businesses to identify the type of content that resonates with their
audience and create similar content to improve engagement.
Execution Steps:
1. Data Collection: Collect data from social media platforms, including text, emoticons, images,
audio, and video content.
2. Preprocessing: Clean and preprocess the data by removing stop words, converting text to
lowercase, and removing special characters.
3. Feature Extraction: Extract relevant features from the data, such as topics, issues, trends,
sentiment, and image, audio, and video content.
4. Model Development: Develop a machine learning model to analyze the extracted features and
provide insights about the content shared on social media platforms.
5. Model Evaluation: Evaluate the performance of the model using metrics such as accuracy,
precision, and recall.
6. Implementation: Implement the model to analyze the content shared on social media platforms
and provide insights to improve the social media strategy.

21
Code:
# DataFrame
import pandas as pd

# Matplot
import matplotlib.pyplot as plt
%matplotlib inline

# Scikit-learn
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score
from sklearn.manifold import TSNE
from sklearn.feature_extraction.text import TfidfVectorizer

# Keras
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences from
keras.models import Sequential
from keras.layers import Activation, Dense, Dropout, Embedding, Flatten, Conv1D, MaxP
ooling1D, LSTM
from keras import utils
from keras.callbacks import ReduceLROnPlateau, EarlyStopping

# nltk import
nltk
from nltk.corpus import stopwords
from nltk.stem import SnowballStemmer

# Word2vec import
gensim

# Utility
import re
import numpy as np
import os
from collections import Counter
import logging
import time import
pickle import
itertools

# Set log
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging
.INFO)
nltk.download('stopwords') #
Settings
# DATASET
DATASET_COLUMNS = ["target", "ids", "date", "flag", "user", "text"]
DATASET_ENCODING = "ISO-8859-1"
TRAIN_SIZE = 0.8

22
# TEXT CLENAING
TEXT_CLEANING_RE = "@\S+|https?:\S+|http?:\S|[^A-Za-z0-9]+"

# WORD2VEC
W2V_SIZE = 300
W2V_WINDOW = 7
W2V_EPOCH = 32
W2V_MIN_COUNT = 10

# KERAS
SEQUENCE_LENGTH = 300
EPOCHS = 8
BATCH_SIZE = 1024

# SENTIMENT
POSITIVE = "POSITIVE"
NEGATIVE = "NEGATIVE"
NEUTRAL = "NEUTRAL"
SENTIMENT_THRESHOLDS = (0.4, 0.7)

# EXPORT
KERAS_MODEL = "model.h5"
WORD2VEC_MODEL = "model.w2v"
TOKENIZER_MODEL = "tokenizer.pkl"
ENCODER_MODEL = "encoder.pkl"
dataset_filename = os.listdir("../input")[0]
dataset_path = os.path.join("..","input",dataset_filename)
print("Open file:", dataset_path)
df = pd.read_csv(dataset_path, encoding =DATASET_ENCODING , names=DATASET_COLUMNS)

print("Dataset size:", len(df))


# df.head(5)
# Map target label to String¶
# 0 -> NEGATIVE
# 2 -> NEUTRAL
# 4 -> POSITIVE
decode_map = {0: "NEGATIVE", 2: "NEUTRAL", 4: "POSITIVE"}
def decode_sentiment(label):
return decode_map[int(label)]
%%time
df.target = df.target.apply(lambda x: decode_sentiment(x))
target_cnt = Counter(df.target)

plt.figure(figsize=(16,8))
plt.bar(target_cnt.keys(), target_cnt.values())
plt.title("Dataset labels distribuition")
# Pre-Process dataset
stop_words = stopwords.words("english")
stemmer = SnowballStemmer("english")
def preprocess(text, stem=False):

23
# Remove link,user and special characters
text = re.sub(TEXT_CLEANING_RE, ' ', str(text).lower()).strip()
tokens = []
for token in text.split():
if token not in stop_words:
if stem:
tokens.append(stemmer.stem(token))
else:
tokens.append(token)
return " ".join(tokens)
%%time
df.text = df.text.apply(lambda x: preprocess(x))
df_train, df_test = train_test_split(df, test_size=1-TRAIN_SIZE, random_state=42)
print("TRAIN size:", len(df_train))
print("TEST size:", len(df_test))
# Word2Vec
%%time
documents = [_text.split() for _text in df_train.text]
w2v_model = gensim.models.word2vec.Word2Vec(size=W2V_SIZE,
window=W2V_WINDOW,
min_count=W2V_MIN_COUNT,
workers=8)
w2v_model.build_vocab(documents)
words = w2v_model.wv.vocab.keys()
vocab_size = len(words)
print("Vocab size", vocab_size)
%%time
w2v_model.train(documents, total_examples=len(documents), epochs=W2V_EPOCH)
w2v_model.most_similar("love")
%%time
tokenizer = Tokenizer()
tokenizer.fit_on_texts(df_train.text)

vocab_size = len(tokenizer.word_index) + 1
print("Total words", vocab_size)

%%time
x_train = pad_sequences(tokenizer.texts_to_sequences(df_train.text), maxlen=SEQUENCE_
LENGTH)
x_test = pad_sequences(tokenizer.texts_to_sequences(df_test.text), maxlen=SEQUENCE_LE
NGTH)
# Label Encoder
labels = df_train.target.unique().tolist()
labels.append(NEUTRAL)
labels

encoder = LabelEncoder()
encoder.fit(df_train.target.tolist())

y_train = encoder.transform(df_train.target.tolist())
y_test = encoder.transform(df_test.target.tolist())

24
y_train = y_train.reshape(-1,1)
y_test = y_test.reshape(-1,1)

print("y_train",y_train.shape)
print("y_test",y_test.shape)

print("x_train", x_train.shape)
print("y_train", y_train.shape)
print()
print("x_test", x_test.shape)
print("y_test", y_test.shape)

y_train[:10]
# Embedding layer
embedding_matrix = np.zeros((vocab_size, W2V_SIZE))
for word, i in tokenizer.word_index.items():
if word in w2v_model.wv:
embedding_matrix[i] =
w2v_model.wv[word]
print(embedding_matrix.shape)

embedding_layer = Embedding(vocab_size, W2V_SIZE, weights=[embedding_matrix], input_l


ength=SEQUENCE_LENGTH, trainable=False)
# Build Model
model = Sequential()
model.add(embedding_layer)
model.add(Dropout(0.5))
model.add(LSTM(100, dropout=0.2, recurrent_dropout=0.2))
model.add(Dense(1, activation='sigmoid'))

model.summary()
# Compile model
model.compile(loss='binary_crossentropy',
optimizer="adam",
metrics=['accuracy'])
# Callbacks
callbacks = [ ReduceLROnPlateau(monitor='val_loss', patience=5, cooldown=0),
EarlyStopping(monitor='val_acc', min_delta=1e-4, patience=5)]

# Train
%%time
history = model.fit(x_train, y_train,
batch_size=BATCH_SIZE,
epochs=EPOCHS,
validation_split=0.1,
verbose=1,
callbacks=callbacks)
# Evaluate
%%time
score = model.evaluate(x_test, y_test, batch_size=BATCH_SIZE)

25
print()

26
print("ACCURACY:",score[1])
print("LOSS:",score[0])

acc = history.history['acc']
val_acc = history.history['val_acc']
loss = history.history['loss']
val_loss = history.history['val_loss']

epochs = range(len(acc))

plt.plot(epochs, acc, 'b', label='Training acc')


plt.plot(epochs, val_acc, 'r', label='Validation acc')
plt.title('Training and validation accuracy')
plt.legend()

plt.figure()

plt.plot(epochs, loss, 'b', label='Training loss')


plt.plot(epochs, val_loss, 'r', label='Validation loss')
plt.title('Training and validation loss')
plt.legend()

plt.show()
# Predict
def decode_sentiment(score, include_neutral=True):
if include_neutral:
label = NEUTRAL
if score <= SENTIMENT_THRESHOLDS[0]:
label = NEGATIVE
elif score >= SENTIMENT_THRESHOLDS[1]:
label = POSITIVE

return label
else:
return NEGATIVE if score < 0.5 else POSITIVE
def predict(text, include_neutral=True):
start_at = time.time()
# Tokenize text
x_test = pad_sequences(tokenizer.texts_to_sequences([text]), maxlen=SEQUENCE_LENG
TH)
# Predict
score = model.predict([x_test])[0]
# Decode sentiment
label = decode_sentiment(score, include_neutral=include_neutral)

return {"label": label, "score": float(score),


"elapsed_time": time.time()-start_at}
predict("I love the music")
predict("I hate the rain")
predict("i don't know what i'm doing")
# Confusion Matrix

27
%%time
y_pred_1d = []
y_test_1d = list(df_test.target)
scores = model.predict(x_test, verbose=1, batch_size=8000)
y_pred_1d = [decode_sentiment(score, include_neutral=False) for score in scores]

def plot_confusion_matrix(cm, classes,


title='Confusion matrix',
cmap=plt.cm.Blues):
"""
This function prints and plots the confusion matrix.
Normalization can be applied by setting `normalize=True`.
"""

cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]

plt.imshow(cm, interpolation='nearest', cmap=cmap)


plt.title(title, fontsize=30)
plt.colorbar()
tick_marks = np.arange(len(classes))
plt.xticks(tick_marks, classes, rotation=90, fontsize=22)
plt.yticks(tick_marks, classes, fontsize=22)

fmt = '.2f'
thresh = cm.max() / 2.
for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
plt.text(j, i, format(cm[i, j], fmt),
horizontalalignment="center",
color="white" if cm[i, j] > thresh else "black")

plt.ylabel('True label', fontsize=25)


plt.xlabel('Predicted label', fontsize=25)

%%time

cnf_matrix = confusion_matrix(y_test_1d, y_pred_1d)


plt.figure(figsize=(12,12))
plot_confusion_matrix(cnf_matrix, classes=df_train.target.unique(), title="Confusion
matrix")
plt.show()

# Classification Report
print(classification_report(y_test_1d, y_pred_1d))
# Accuracy Score
accuracy_score(y_test_1d,
y_pred_1d) # Save model
model.save(KERAS_MODEL)
w2v_model.save(WORD2VEC_MODEL)
pickle.dump(tokenizer, open(TOKENIZER_MODEL, "wb"), protocol=0)
pickle.dump(encoder, open(ENCODER_MODEL, "wb"), protocol=0)

28
Output:

Layer (type) Output Shape Param #


=================================================================
embedding_1 (Embedding) (None, 300, 300) 87125700

dropout_1 (Dropout) (None, 300, 300) 0

lstm_1 (LSTM) (None, 100) 160400

dense_1 (Dense) (None, 1) 101


=================================================================
ACCURACY: 0.791134375
LOSS: 0.4442952796936035
CPU times: user 2min 25s, sys: 16.9 s, total: 2min
42s Wall time: 1min 52s

29
CPU times: user 1.39 s, sys: 256 ms, total:
1.64 s Wall time: 1.38 s
precision recall f1-score support

NEGATIVE 0.79 0.79 0.79 159494


POSITIVE 0.79 0.80 0.79 160506

micro avg 0.79 0.79 0.79 320000


macro avg 0.79 0.79 0.79 320000
weighte avg 0.79 0.79 0.79 320000
d

Explanation:
This is a Python script for performing sentiment analysis on a Twitter dataset using machine learning
techniques. The script imports various libraries such as pandas, matplotlib, scikit-learn, Keras, nltk, and gensim.
The dataset is cleaned, pre-processed, and transformed into a format that can be used to train and test machine
learning models.
The script then trains a machine learning model using Keras, and the trained model is used to make
predictions on the test set. The performance of the model is evaluated using various metrics such as confusion
matrix, classification report, and accuracy score. Finally, the trained model is exported for future use.

Conclusion:
Content-based social media analytics is a crucial step in understanding the audience and improving the
social media presence of businesses. It involves analyzing the content shared on social media platforms,
including text, emoticons, images, audio, and video content, to identify topics, issues, trends, sentiment, and
image, audio, and video content.
Developing a content-based social media analytics model involves data collection, preprocessing,
feature extraction, model development, model evaluation, and implementation. By using content-based social
media analytics, businesses can improve their social media strategy and engage with their audience effectively.

30
EXPERIMENT NO. 6
Aim: Develop Structure based social media analytics model for any business. (E.g. Structure Based Models -
community detection, influence analysis).
Theory:
Developing a structure-based social media analytics model for any business is a critical step in
understanding the network of their audience and identifying influential users. In this article, we will discuss the
theory behind structure-based models, the execution steps to develop the model, and the benefits it can offer
for businesses.
Structure-based social media analytics is a process of analyzing the social network structure of a
business or its customers. It helps businesses to understand the relationships between their audience and
identify influential users. Structure-based analysis can be used for community detection and influence analysis.
• Community Detection: It is the process of identifying groups or communities of users within a
social network. This can help businesses to understand the interests and preferences of different user
groups and create targeted content to improve engagement.
• Influence Analysis: It is the process of identifying influential users within a social network. This can
help businesses to identify users who have a significant impact on their audience and engage with
them to improve their social media presence.
Execution Steps:
1. Data Collection: Collect data from social media platforms, including user profiles, followers,
and interactions between users.
2. Network Construction: Construct a social network graph based on the collected data, with
nodes representing users and edges representing interactions between them.
3. Community Detection: Use community detection algorithms to identify groups or communities
of users within the social network.
4. Influence Analysis: Use influence analysis algorithms to identify influential users within the
social network.
5. Model Development: Develop a machine learning model to analyze the network structure and
provide insights about the social network.
6. Model Evaluation: Evaluate the performance of the model using metrics such as modularity
and centrality.
7. Implementation: Implement the model to analyze the social network structure and provide insights
to improve the social media strategy.
Code:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns
import re
df_i=pd.read_csv('C:/Users/THEASHISHGAVADE/Downloads/social media influencers - insta
gram.csv')
def convert(x):
return re.findall('\d+\.?\d*',x)
def change(df,list1):
for i in list1:
df['new'+i]=df[i].apply(convert)
df['new'+i]=df['new'+i].apply(lambda x: "".join(x))

31
df['new'+i]=pd.to_numeric(df['new'+i])
df['new'+i]=np.where(['M' in j for j in df[i]],df['new'+i]*1000000,
np.where(['K' in j1 for j1 in df[i]],df['new'+i]*1000,df
['new'+i]))
return df
df_i.head(2)
df_i.rename({'category_1':'Category','Audience country(mostly)':'Audience Country'},a
xis=1,inplace=True)
df_i.head(2)
df_i.isnull().sum()
df_i.drop_duplicates(subset=['Influencer insta name'],inplace=True)
df_i.shape
df_i.drop(labels=['Influencer insta name','Authentic engagement\r\n'],axis=1,inplace=
True)
df_i.head(2)
df_i.head(2)
change(df_i,li)
#Engagement rate : the Percentage of Followers who really engages with the content po
sted by Influencers
df_i['Engagement Rate']=np.round((df_i['newEngagement avg\r\n']/df_i['newFollowers'])
*100,3)
print(df_i['Followers'].str[-1].unique())
# for convenice
df_i['newFollowers']=df_i['newFollowers']/1000000
df_i.drop(labels=['Engagement avg\r\n','newEngagement avg\r\n'],axis=1,inplace=True)
df_i.head(5)
# TOP 15 most followed celebrity on instagram
df_i.sort_values(by='newFollowers',ascending=False,ignore_index=True).iloc[0:15,[0,1,
3,-1]]
plt.title('Top 15 most followed celebrity on instagram')
plt.xlabel('Followers in Million')
sns.barplot(y='instagram
name',x='newFollowers',data=df_i.sort_values(by='newFollower
s',ascending=False).head(15))
pallete=['red','green','yellow','salmon','cyan','blue','orange']
def plot(df):
plt.figure(figsize=(8,6))
plt.xlabel('number of times category occured')
plt.ylabel('Category')
df['Category'].value_counts().sort_values(ascending=True).plot.barh(color=pallete
)
# TOP categories followed on instagram(POPULAR CATEGORIES ON INSTAGRAM)
plot(df_i)
# Decide That where you want to make ads
def plot_c(df):
plt.figure(figsize=(10,8))
plt.xlabel('number of times category occured')
df['Audience
Country'].value_counts().sort_values().plot.barh(color=pallete) # for
understanding that where is the demand of product
def demand(data,category):

32
return data[data['Category']==category]['Audience Country'].value_counts().sort_v
alues(ascending=True).plot.barh(color=pallete)

33
demand(df_i,'Lifestyle')
df_i['newFollowers'].describe()
df_i['newFollowers'].quantile(0.94)
df_i.head(2)
def for_mini_followers_instagram(coun,cat):
df1=df_i[df_i['Audience Country']==coun]
df1_mini=df1[df1['newFollowers']<60]
return df1_mini.sort_values(by='Engagement Rate',ascending=False).groupby('Catego
ry').get_group(cat).iloc[:,[0,3,-1]]
for_mini_followers_instagram('India','Music')
def for_mega_followers_instagram(coun,cat):
df1=df_i[df_i['Audience Country']==coun]
df1_mini=df1[df1['newFollowers']>60]
return df1_mini.sort_values(by='Engagement Rate',ascending=False).groupby('Catego
ry').get_group(cat).iloc[:,[0,3,-1]]
for_mega_followers_instagram('India','Music')
for_mini_followers_instagram('India','Beauty')
for_mini_followers_instagram('India','Shows')
for_mini_followers_instagram('India','Sports with a ball')
for_mega_followers_instagram('India','Sports with a ball')
for_mega_followers_instagram('Brazil','Sports with a ball')

Output:
Influencer insta name 0
instagram name 21
Category 108
category_2 713
Followers 0
Audience Country 14
Authentic engagement\r\n 0
Engagement avg\r\n 0
dtype: int64

instagram name Category category_2 Followers Audience Country newFollowers Engagement Rate

0 433 Sports with a ball NaN 48.5M Spain 48.5 1.313

1 TAEYANG Music NaN 12.7M Indonesia 12.7 4.270

2 НАСТЯ ИВЛЕЕВА Shows NaN 18.8M Russia 18.8 2.010

3 Joy Lifestyle NaN 13.5M Indonesia 13.5 10.370

4 Jaehyun NaN NaN 11.1M Indonesia 11.1 27.928

34
Top 15 most followed celebrity on instagram
Instagram
Cristiano Ronaldo
Kylie g
Leo
Messi Selena
Gomez
theroc k
Ariana Grande
Kim Kardashian
Beyoncé
Khloé Kardashian
Kendall
justin Bieber
National Geographic
Nike
Taylor Swíft
1DD 2 DD 3DD 4DD SDD
newFollowers

Mus
Cinema 6 Actors/actresses
Sports with a ball
Lifestyle
Shows
Modeling
Beauty
Humor 6 Fun & Happiness
I/Art sts
Fitness & Gym
Fashion
Computers & Gadgets
Racing Sports
Adult content
Cars & M otorbikes
Clothing 6 Outfits
Photography
Business 6 Careers
Food 6 Cooking
Literature & journalism
Nature & landscapes
Management 6 Marketing
Luxury
Animals
Accessories 6 jewellery
Education
Kids 6 Toys
50 L00 L50 200
number of times category occured

Sy r
Turke y

35
count 997.000000
mean 25.539619
std 40.586338
min 2.600000
25% 9.000000
50% 14.600000
75% 26.500000
max 487.200000
Name: newFollowers, dtype: float64

Explanation:
This code is analyzing data related to social media influencers on Instagram. It uses various Python
libraries such as NumPy, Pandas, Matplotlib, and Seaborn for data processing, visualization, and analysis. The
code loads a CSV file containing data related to social media influencers and performs various data cleaning
and manipulation operations such as dropping duplicates, converting data types, and renaming columns.
After cleaning the data, the code generates various visualizations such as a bar plot of the top 15
most followed celebrities on Instagram, a bar plot of the most popular categories followed on Instagram, and
a bar plot of the countries with the highest demand for different product categories.
Lastly, the code defines several functions that can be used to filter the data based on different criteria
such as country, category, and number of followers. These functions can be used to analyze the engagement
rate of influencers, identify the most popular categories among followers, and find influencers with high
engagement rates in specific categories and countries.
Conclusion:
Structure-based social media analytics is a crucial step in understanding the network of the audience
and identifying influential users for businesses. It involves analyzing the social network structure based on user
profiles, followers, and interactions between users.
Developing a structure-based social media analytics model involves data collection, network
construction, community detection, influence analysis, model development, model evaluation, and
implementation. By using structure-based social media analytics, businesses can improve their social media
strategy, identify influential users, and engage with them to improve their social media presence.

36
EXPERIMENT NO. 7
Aim: Develop a dashboard and reporting tool based on real time social media data.
Theory:
Developing a dashboard and reporting tool based on real-time social media data is a critical step in
monitoring the performance of a business's social media strategy. In this article, we will discuss the theory
behind social media dashboards, the execution steps to develop the dashboard and reporting tool, and the
benefits it can offer for businesses.
A social media dashboard is a tool that provides real-time monitoring of social media activities,
including engagement, reach, and impressions. It helps businesses to track the performance of their social
media strategy and identify areas for improvement. The dashboard can provide data visualization, including
charts and graphs, to enable users to interpret the data easily.
Reporting tools are used to create regular reports based on the data gathered by the dashboard. These
reports can provide insights into the performance of the social media strategy, identify trends, and suggest
areas for improvement.
Execution Steps:
1. Identify Key Performance Indicators (KPIs): Identify the KPIs that are important for the business,
such as engagement rate, reach, and impressions.
2. Data Collection: Collect data from social media platforms, including user profiles, followers,
and interactions between users.
3. Data Processing: Pre-process the data to remove any irrelevant data and prepare it for analysis.
4. Data Visualization: Use data visualization tools such as charts and graphs to represent the data in
an easily understandable format.
5. Dashboard Development: Develop a dashboard using a tool such as Tableau, Power BI, or Google
Data Studio.
6. Reporting Tool Development: Develop a reporting tool that generates regular reports based on the
data collected by the dashboard.
7. Implementation: Implement the dashboard and reporting tool to provide real-time monitoring of
social media activities and generate regular reports for the business.
Code:
import pandas as pd
import snscrape.modules.twitter as sntwitter
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.stem.porter import PorterStemmer
import string
import re
import
textblob
from textblob import TextBlob

37
from wordcloud import wordcloud, STOPWORDS

38
from wordcloud import ImageColorGenerator
import warnings
%matplotlib inline
import os
#using os library to call CLI commands in python
os.system("snscrape --jsonl --max-results 10000 --since 2023-03-13 twitter-
search 'CHATGPT4' > text-chatGPT4-tweets.json")
#create a pandas dataframe
tweets_df_chatGPT4 = pd.read_json('text-chatGPT4-tweets.json',lines=True)
#tweets_df_chatGPT4
# 2. Data Loading
df_chatGPT4 = tweets_df_chatGPT4[["date", "rawContent","renderedContent","user","repl
yCount","retweetCount","likeCount","lang","place","hashtags","viewCount"]]
print(df_chatGPT4.shape)
# 3. Twitter Data Cleaning, Preprocessing and Exploratory Data Analysis
df2 = df_chatGPT4.drop_duplicates('renderedContent')
# shape of
DataFrame
print(df2.shape)
df2.head()
df2.info()
df2.date.value_counts()
#Heatmap for missing values
plt.figure(figsize=(17,5))
sns.heatmap(df2.isnull(),cbar=True,yticklabels=False)
plt.xlabel("Column_Name", size=14,weight="bold")
plt.title("Places of missing values is cloumn",fontweight="bold",size=17)
plt.show()

Output:

rawConten renderedC replyC retweetC likeCo la pla viewC


date user hashtags
t ontent ount ount unt ng ce ount

So I keep So I keep
2023-04-
using up all using up all {'_type':
01 No
0 the the 'snscrape.modules.tw 0 0 0 en None NaN
12:46:53+ ne
ChatGPT4 ChatGPT4 itter.User', 'us...
00:00
20 questio... 20 questio...

@theDontG @theDontG
2023-04- etRekt etRekt
{'_type':
01 @mreflow @mreflow No
1 'snscrape.modules.tw 0 0 0 en None 1.0
12:46:32+ No in No in ne
itter.User', 'us...
00:00 ChatGPT4. ChatGPT4.
You c... You c...

Italia Italia [ChatGPT,


2023-04- Berencana Berencana chatgpt4,
{'_type':
01 Memblokir Memblokir No Italia,
2 'snscrape.modules.tw 0 0 0 in 6.0
12:43:04+ ChatGPT, ChatGPT, ne OpenAICha
itter.User', 'us...
00:00 Kenapa Kenapa tGPT,
Ya?... Ya?... Ope...

39
rawConten renderedC reply retweetC likeCo l pl view
date user hashtags
t ontent C ount unt a a C
oun n c oun
t g e t
2023-04- essa essa
{'_type'
01 parada do parada do N
3 : 0 0 0 pt None 7.0
12:40:41+ chat gpt4 é chat gpt4 é o
'snscrape.modules.tw
00:00 sinistra, sinistra, n
itter.User', 'us...
daqui a p... daqui a p... e

Basically : Basically : [ChatGPT4,


2023-04-
\n\nI \n\nI {'_type' ChatGPT5,
01 N
4 had had : 1 0 0 en technology 13.0
12:40:29+ o
expressed expressed 'snscrape.modules.tw , OpenAI,
00:00 n
my my itter.User', 'us... ChatG...
e
feelings. feelings.
... ...

40
<class 'pandas.core.frame.DataFrame'>
Int64Index: 3335 entries, 0 to 3368
Data columns (total 11 columns):
# Column Non-Null Count Dtype

0 date 3335 non-null datetime64[ns, UTC]


1 rawContent 3335 non-null object
2 renderedContent 3335 non-null object
3 user 3335 non-null object
4 replyCount 3335 non-null int64
5 retweetCount 3335 non-null int64
6 likeCount 3335 non-null int64
7 lang 3335 non-null object
8 place 54 non-null object
9 hashtags 1949 non-null object
10 viewCount 3332 non-null float64
dtypes: datetime64[ns, UTC](1), float64(1), int64(3), object(6)
memory usage: 312.7+ KB

41
Explanation:
• This code is for collecting and analyzing tweets related to a specific Twitter user or topic.
• The code uses the Python libraries pandas, snscrape, numpy, matplotlib, seaborn, nltk, re, textblob,
and wordcloud to perform data cleaning, preprocessing, and exploratory data analysis on the collected
tweets.
• The snscrape library is used to scrape Twitter data based on a search query, in this case for tweets
related to the Twitter user "CHATGPT4". The collected data is stored in a JSON file and then read
into a pandas dataframe.
• The dataframe is then cleaned to remove duplicate tweets and missing values. Exploratory data
analysis is performed using visualizations such as a heatmap to show the places of missing values in
the dataframe.
• The code also uses various natural language processing techniques such as tokenization, stopword
removal, stemming, and sentiment analysis using the textblob library. Finally, a word cloud is
generated to visualize the most frequent words used in the collected tweets.
Conclusion:
Developing a dashboard and reporting tool based on real-time social media data is a critical step in
monitoring the performance of a business's social media strategy. It involves identifying the KPIs, collecting
and pre-processing data, visualizing data, and developing the dashboard and reporting tool.
By using a social media dashboard and reporting tool, businesses can monitor the performance of their
social media strategy in real-time, identify areas for improvement, and generate regular reports to improve their
social media presence.

42
EXPERIMENT NO. 8
Aim: Design the creative content for promotion of your business on social media platform.
Theory:
Designing creative content for promotion of your business on social media platform is a crucial aspect
of building a strong online presence. In this article, we will discuss the theory behind designing creative content
for social media, the execution steps to create effective content, and the benefits it can offer for businesses.
Designing creative content for social media requires a good understanding of your target audience and
the social media platform you plan to use. Effective content must capture the audience's attention and
communicate the brand's message. Content can take various forms, such as text, images, videos, or
infographics. The content must be engaging, informative, and relevant to the target audience.
Execution Steps:
1. Define the Target Audience: Identify the target audience, including their interests, preferences,
and behaviors, to create content that resonates with them.
2. Choose the Social Media Platform: Choose the social media platform based on the target audience
and the business's goals. Different platforms have different formats and audience demographics.
3. Develop a Content Strategy: Develop a content strategy that aligns with the business's goals and
the target audience's needs. The strategy should include the type of content, frequency, and tone.
4. Create Content: Create content that aligns with the content strategy, using various formats such as
text, images, videos, or infographics. Ensure that the content is relevant, informative, and engaging.
5. Optimize Content: Optimize the content for the chosen social media platform, such as using the
right hashtags, keywords, and image sizes.
6. Schedule Content: Schedule the content using social media management tools such as Hootsuite,
Buffer, or Sprout Social.
7. Analyze Performance: Analyze the performance of the content using social media analytics tools such
as Facebook Insights, Twitter Analytics, or Google Analytics. Use the insights to refine the content
strategy.
A social media ad campaign to promote a fictional technology business called "TechBoost":
Ad #1: Image: A sleek, modern laptop with the TechBoost logo on the screen.

43
"Upgrade your tech game with TechBoost. Our high-performance laptops are designed to keep up with
your busy lifestyle, whether you're a student, professional, or gamer. Shop now and experience the power
of TechBoost!"

Ad #2: Image: A person working on their laptop in a coffee shop, with the TechBoost logo visible on the
back of their laptop screen.

"Take your work with you wherever you go. With TechBoost, you can work from anywhere with ease.
Our laptops are lightweight and easy to carry, so you can stay productive on the go."

Ad #3: Image: A person gaming on a TechBoost laptop, with a high-resolution game visible on the screen.

44
"Experience the ultimate gaming performance with TechBoost. Our laptops are equipped with the
latest graphics and processors, so you can enjoy your favorite games at the highest level. Get yours now
and take your gaming to the next level!"
Overall, the campaign showcases TechBoost's high-performance laptops, portability, and versatility.
The use of sleek, modern visuals and bold copy is intended to appeal to tech-savvy individuals who are
looking for high-quality products.

Conclusion:
Designing creative content for promotion of your business on social media platform is critical to building a
strong online presence.
It involves understanding the target audience, choosing the right social media platform, developing a
content strategy, creating and optimizing the content, scheduling the content, and analyzing its performance.
By creating engaging and relevant content, businesses can attract and retain the target audience, increase brand
awareness, and achieve their business goals.

45
EXPERIMENT NO. 9
Aim: Analyze competitor activities using social media data.
Theory:
Analyzing competitor activities using social media data is an essential part of a business's social media
strategy. It helps businesses understand their competitors' strengths and weaknesses, identify new opportunities,
and improve their own social media performance. In this article, we will discuss the theory behind analyzing
competitor activities using social media data, the execution steps to perform this analysis, and the benefits it can
offer for businesses.
Analyzing competitor activities using social media data involves monitoring and analyzing their social
media activities, such as their content, engagement metrics, audience demographics, and advertising campaigns.
It helps businesses gain insights into their competitors' social media strategies, benchmark their own
performance, and identify areas for improvement. Effective competitor analysis requires a good understanding
of the social media platforms used by competitors and the tools available to monitor their activities.
Execution Steps:
1. Identify Competitors: Identify the competitors that the business wants to analyze based on
their industry, target audience, and social media presence.
2. Determine Social Media Platforms: Determine the social media platforms used by the competitors
and the frequency and types of content they post.
3. Monitor Competitor Activity: Monitor the competitors' social media activities using social media
management tools such as Hootsuite, Buffer, or Sprout Social. This helps identify the frequency
and type of content posted by the competitors.
4. Analyze Engagement Metrics: Analyze engagement metrics such as likes, comments, shares,
and followers, to identify the type of content that resonates with the audience and the level of
audience engagement.
5. Evaluate Advertising Campaigns: Evaluate the competitors' advertising campaigns using tools such
as Facebook Ads Library or Twitter Ads Transparency Center, to determine the target audience and ad
spend.
6. Benchmark Performance: Benchmark the business's social media performance against the
competitors using metrics such as audience growth, engagement rate, and advertising spend.
7. Identify Opportunities: Identify new opportunities for the business based on the insights gained
from the competitor analysis, such as new content ideas, audience demographics, or advertising
strategies.
Code:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
df = pd.read_csv('/kaggle/input/linkedin-influencers-data/influencers_data.csv')
df.head(3)
#df.tail(3)
df.shape
df.columns
df.info()
df.describe()
df['name'].unique()
df['name'].unique().shape
df.isna().sum()
46
df = df.drop(columns = ['views', 'votes', 'media_type', 'content', 'connections', 'lo
cation'])
df.isna().sum()
df_first = df[(df['name']=='Nicholas Wyman')]
df_first.head()
df_first.info()
df_first['followers'] = df_first['followers'].dropna()
df_first.isna().sum()
df_first['followers'] = df_first['followers'].astype(int)
fig, ax = plt.subplots(figsize=(20,8))
ax.bar(df_first['time_spent'], df_first['followers'],
color='gray') ax.xaxis.set_major_locator(plt.MaxNLocator(10))
ax.set_xlabel('Time Spent', fontsize='11')
ax.set_ylabel('Number of people',
fontsize='11') plt.title('Followers',
fontsize='25') plt.grid()
plt.show()
fig, ax = plt.subplots(figsize=(20,8))
ax.bar(df_first['time_spent'], df_first['reactions'], color='forestgreen')
ax.bar(df_first['time_spent'], df_first['comments'], color='Blue')
ax.xaxis.set_major_locator(plt.MaxNLocator(10))
ax.set_xlabel('Time Spent', fontsize='11')
ax.set_ylabel('Number of people', fontsize='11')
plt.title('Reaction Vs. Comments', fontsize='25')
plt.legend(['Reactions', 'Comments'])
plt.grid()
plt.show()

fig, ax = plt.subplots(figsize=(20,8))
ax.bar(df_first['time_spent'], df_first['num_hashtags'], color='Purple')
ax.bar(df_first['time_spent'], df_first['hashtag_followers'], color='Lightseagreen')
ax.xaxis.set_major_locator(plt.MaxNLocator(10))
ax.set_xlabel('Time Spent', fontsize='11')
ax.set_ylabel('Numbers', fontsize='11')
plt.title('Number of Hashtags Vs. Hashtag Followers', fontsize='25')
plt.legend(['Number of Hashtags', 'Hashtag Followers'])
plt.grid()
plt.show()
df_tom = df_first = df[(df['name']=='Tom Goodwin')]

# df_tom.head()
df_tom['followers'] = df_tom['followers'].dropna()
df_tom.isna().sum()
df_tom.info()
fig, ax = plt.subplots(figsize=(20,8))
ax.bar(df_tom['time_spent'], df_tom['followers'], color='gray')
ax.xaxis.set_major_locator(plt.MaxNLocator(10))
ax.set_xlabel('Time Spent', fontsize='11')
ax.set_ylabel('Number of people', fontsize='11')
plt.title('Followers', fontsize='25')
plt.grid()

47
plt.show()
fig, ax = plt.subplots(figsize=(20,8))
ax.bar(df_tom['time_spent'], df_tom['reactions'], color='forestgreen')
ax.bar(df_tom['time_spent'], df_tom['comments'], color='Blue')
ax.xaxis.set_major_locator(plt.MaxNLocator(10))
ax.set_xlabel('Time Spent', fontsize='11')
ax.set_ylabel('Number of people', fontsize='11')
plt.title('Reaction Vs. Comments', fontsize='25')
plt.legend(['Reactions', 'Comments'])
plt.grid()
plt.show()

fig, ax = plt.subplots(figsize=(20,8))
ax.bar(df_tom['time_spent'], df_tom['num_hashtags'], color='Purple')
ax.bar(df_tom['time_spent'], df_tom['hashtag_followers'], color='Lightseagreen')
ax.xaxis.set_major_locator(plt.MaxNLocator(10))
ax.set_xlabel('Time Spent', fontsize='11')
ax.set_ylabel('Numbers', fontsize='11')
plt.title('Number of Hashtags Vs. Hashtag Followers', fontsize='25')
plt.legend(['Number of Hashtags', 'Hashtag Followers'])
plt.grid()
plt.show()

Output:

followers num_hashtags hashtag_followers reactions comments views


Unnamed: 0

count 34012.000000 3.397000e+04 34012.000000 34012.0 34012.000000 34012.000000 0.0

mean 17005.500000 1.125922e+06 2.099788 0.0 472.956486 26.977273 NaN

std 9818.563014 3.057750e+06 3.517457 0.0 4163.929944 216.364372 NaN

min 0.000000 1.710000e+02 0.000000 0.0 0.000000 0.000000 NaN

25% 8502.750000 9.914800e+04 0.000000 0.0 7.000000 0.000000 NaN

50% 17005.500000 4.082540e+05 0.000000 0.0 36.000000 2.000000 NaN

75% 25508.250000 7.193340e+05 3.000000 0.0 143.000000 14.000000 NaN

max 34011.000000 1.828935e+07 48.000000 0.0 391498.000000 32907.000000 NaN

48
array(['Nicholas Wyman', 'Jonathan Wolfer', 'Karen Gross',
'Kaia Niambi Shivers Ph.D.', "Daniel Cohen-I'm Flyering",
'Natalie Riso', 'Dale Corley', 'James Calder', 'Yasi
Baiani', 'Julie Kliger', 'Stephanie C. O.', 'Michelle de
Haaff', 'Bertalan Meskó, MD, PhD', 'Michelle Chaffee',
'Beth Seidenberg', 'Russell Benaroya', 'Richard Branson',
'Mohamed El-Erian',
'James Altucher', 'Bernard Marr', 'Ian Bremmer', 'Sramana
Mitra', 'Lynne Everatt', 'Justin Bariso', 'Carson Tate',
'Gary Frisch', 'James Bareham', 'Tai T.', 'Glenn Leibowitz',
'Marianne Griebler', 'Tom Goodwin', 'Katie Martell', 'Shama
Hyder',
'Barry W. Enderwick', 'Steve Blakeman', 'Gillian Zoe
Segal', 'Tom Foremski', 'Kiara Imani Williams, Esq.',
'Kellye Whitney', 'Simon Owens', 'Rachel Jones', 'Vikram
Mansharamani',
' Pascal Bouvier', 'Geoffrey Garrett', 'Ben
Casselman', 'Tamal Bandyopadhyay', 'Karen Webster',
'Jody Padar',
'Hansi Mehrotra', 'Nick Ciubotariu', 'Neil Hughes', 'Nir
Eyal', 'Shelly Palmer', 'Lee Naik', 'Danielle Newnham',
'Vani Kola', 'Chris McCann', 'Andrew Yang', 'Lisa Abeyta',
'Juliet de Baubigny', 'Sarah Kauss', 'Pocket Sun', 'Chantel
Soumis', 'String Nguyen', 'Quentin Michael Allums', 'AJ
Wilcox', "Kevin O'Leary",
'Amy Blaschka', 'Simon Sinek'], dtype=object)

49
Number of Hashtags Vs. Hashtag Followers

Followers

Reaction Vs. Comments

50
Explanation:
This code is used for analyzing LinkedIn influencers' data by reading a CSV file containing data about
their name, time spent on LinkedIn, number of followers, reactions, comments, hashtags, and hashtag followers.
The code drops some columns from the DataFrame, cleans the data, and creates visualizations to
compare the data of two influencers named Nicholas Wyman and Tom Goodwin. The visualizations show the
number of followers, reactions, comments, number of hashtags, and hashtag followers against the time spent
on LinkedIn for each influencer.
Conclusion: :
Analyzing competitor activities using social media data is crucial for businesses to improve their social
media performance, gain new insights, and identify new opportunities.
It involves identifying competitors, monitoring their social media activities, analyzing engagement
metrics, evaluating advertising campaigns, benchmarking performance, and identifying opportunities. By
performing effective competitor analysis, businesses can gain a competitive edge and achieve their social
media goals.

51
EXPERIMENT NO. 10
Aim: Develop social media text analytics models for improving existing product/ service by analyzing
customer‘s reviews/comments.
Theory:
Developing social media text analytics models for improving existing products/services by analyzing
customer reviews/comments is an essential part of a business's social media strategy. It helps businesses
understand customer sentiment and feedback, identify areas for improvement, and take necessary actions to
improve their products/services. In this article, we will discuss the theory behind developing social media text
analytics models, the execution steps to perform this analysis, and the benefits it can offer for businesses.
Developing social media text analytics models involves collecting and analyzing customer
reviews/comments from social media platforms such as Facebook, Twitter, and LinkedIn. Text analytics
techniques such as sentiment analysis, topic modeling, and opinion mining can be used to gain insights into
customer sentiment and feedback, identify areas for improvement, and improve existing products/services.
Execution Steps:
1. Collect Customer Reviews/Comments: Collect customer reviews/comments from social media
platforms such as Facebook, Twitter, and LinkedIn using social media management tools such
as Hootsuite, Buffer, or Sprout Social.
2. Preprocess the Data: Preprocess the data by removing irrelevant information such as URLs,
hashtags, and mentions, and perform text normalization techniques such as tokenization, stemming,
and lemmatization.
3. Perform Sentiment Analysis: Perform sentiment analysis to identify the polarity of customer
reviews/comments, whether positive, negative, or neutral, using techniques such as rule-based,
machine learning, or hybrid approaches.
4. Perform Topic Modeling: Perform topic modeling to identify the topics mentioned in customer
reviews/comments using techniques such as Latent Dirichlet Allocation (LDA) or Non-Negative
Matrix Factorization (NMF).
5. Perform Opinion Mining: Perform opinion mining to identify the opinion holders and their views
on specific product/service aspects, such as quality, price, or customer service.
6. Identify Areas for Improvement: Identify the areas for improvement based on the insights gained
from the analysis, such as improving product quality, pricing strategy, or customer service.
7. Take Necessary Actions: Take necessary actions to address the identified areas for improvement,
such as revising product/service features, pricing, or customer support.
Code:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import warnings
warnings.filterwarnings("ignore")
from nltk.corpus import stopwords
import nltk
import re
# Input data files are available in the read-only "../input/" directory
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
for filename in filenames:
print(os.path.join(dirname, filename))

52
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings("ignore")
import seaborn as sns

reviews_data = pd.read_csv("/kaggle/input/kindle-reviews/kindle_reviews.csv")
reviews_data.shape
reviews_data.columns
reviews_data.head()
# Renaming Columns (Names)

reviews_data.rename(columns={'asin':'ProductId', 'overall':'Score', 'reviewText':'Tex


t', 'reviewerID':'UserId', 'summary':'Summary', 'unixReviewTime':'Time'}, inplace=Tru
e)

reviews_data.head()
# Drop Unwanted Columns
reviews_data.drop(['Unnamed: 0','helpful','reviewTime','reviewerName'], axis=1, inpla
ce=True)

reviews_data.head()
# Distribution of Reviews Per Score
category_dist = reviews_data['Score'].value_counts()

plt.figure(figsize=(10,6))
my_colors = ['g', 'r', 'b', 'm', 'y']
category_dist.plot(kind='bar', color=my_colors)
plt.grid()
plt.xlabel("Scores")
plt.ylabel("Number of Reviews Per Score")
plt.title("Distribution of Reviews Per Score")
plt.show()

reviews_data = reviews_data.loc[reviews_data['Score'] != 3]

reviews_data.shape

#give reviews with Score > 3 a positive rating and reviews with a score < 3 a negativ
e rating
def partition(x):
if x < 3:
return 'Negative'
else:
return 'Positive'

actualScore = reviews_data['Score']
pos_neg = actualScore.map(partition)
reviews_data['Score'] = pos_neg

53
reviews_data.head()
reviews_data = reviews_data.head(50000) #considering only 50k rows

category_dist = reviews_data['Score'].value_counts()

plt.figure(figsize=(10,6))
my_colors = ['g', 'r']
category_dist.plot(kind='bar', color=my_colors)
plt.grid()
plt.xlabel("Scores")
plt.ylabel("Number of Reviews")
plt.title("Distribution of Reviews")
plt.show()

reviews_data[reviews_data['UserId']=='A3SPTOKDG7WBLN']
#sorting data according to ProductId in ascending order
reviews_data = reviews_data.sort_values('ProductId', axis=0, ascending=True, inplace=
False, kind='quicksort', na_position='last')

#deduplication of entries
reviews_data = reviews_data.drop_duplicates(subset={"ProductId","UserId","Time","Text
"}, keep='first', inplace=False)
reviews_data.shape
# Data
Preprocessing

#loading_the_stop_words_from_nltk_library_
stop_words = set(stopwords.words('english'))

def txt_preprocessing(total_text, index, column, df):


if type(total_text) is not int:
string = ""

#replace_every_special_char_with_space
total_text = re.sub('[^a-zA-Z0-9\n]', ' ', total_text)

#replace_multiple_spaces_with_single_space
total_text = re.sub('\s+',' ', total_text)

#converting_all_the_chars_into_lower_case
total_text = total_text.lower()

for word in total_text.split():


#if_the_word_is_a_not_a_stop_word_then_retain_that_word_from_the_data
if not word in stop_words:
string += word + " "

df[column][index] = string

for index, row in reviews_data.iterrows():


if type(row['Text']) is str:

54
txt_preprocessing(row['Text'], index, 'Text', reviews_data)

55
else:
print("THERE IS NO TEXT DESCRIPTION FOR ID :",index)

reviews_data.head()
#checking null values
reviews_data.isna().sum()
#removing null values(row)
reviews_data.dropna(axis=0, inplace=True)
reviews_data.isna().sum()
reviews_data.shape
reviews_data.tail()
reviews_data['Score'].value_counts()

#ref: https://www.analyticsvidhya.com/blog/2021/06/5-techniques-to-handle-
imbalanced- data-for-a-classification-problem/
from sklearn.utils import resample

#create two different dataframe of majority and minority class

cls_majority = reviews_data[(reviews_data['Score']=='Positive')]
cls_minority = reviews_data[(reviews_data['Score']=='Negative')]

# upsample minority class


cls_minority_upsampled =
resample(cls_minority,
replace=True, #sample with replacement
n_samples= 44381, #to match majority class
random_state=42) #reproducible results
# Combine majority class with upsampled minority class
upsampled_data = pd.concat([cls_minority_upsampled, cls_majority])

upsampled_data.head()
upsampled_data.shape
upsampled_data['Score'].value_counts()
# Train Test Split
from sklearn.model_selection import train_test_split
X = upsampled_data['Text']
Y =
upsampled_data['Score']

X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.20, stratify=Y,


random_state=0)
X_train, X_cv, y_train, y_cv = train_test_split(X_train, y_train, test_size=0.20, str
atify=y_train, random_state=0)

print("NUMBER OF DATA POINTS IN TRAIN DATA :", X_train.shape[0])


print("NUMBER OF DATA POINTS IN CROSS VALIDATION DATA :", X_cv.shape[0])
print("NUMBER OF DATA POINTS IN TEST DATA :", X_test.shape[0])
# TF-IDF
#perform_tfidf_vectorization_of_text_data
from sklearn.feature_extraction.text import TfidfVectorizer

56
text_vec = TfidfVectorizer(min_df=10, max_features=5000)

57
text_vec.fit(X_train.values)

train_text = text_vec.transform(X_train.values)
test_text = text_vec.transform(X_test.values)
cv_text = text_vec.transform(X_cv.values)

print("Shape of Matrix - TFIDF")


print(train_text.shape)
print(test_text.shape)
print(cv_text.shape)
# Confusion / Precision / Recall Matrix
#this_function_plots_the_confusion_matrices_given_y_i_and_y_i_hat_
from sklearn.metrics import confusion_matrix
import seaborn as sns
def plot_confusion_matrix(test_y, predict_y):

C = confusion_matrix(test_y, predict_y) #confusion_mat


A =(((C.T)/(C.sum(axis=1))).T) #recall_mat
B =(C/C.sum(axis=0)) #precision_mat

labels = [0,1]

#representing_C_in_heatmap_format
print("-"*40, "Confusion Matrix", "-"*40)
plt.figure(figsize=(8,5))
sns.heatmap(C, annot=True, cmap="YlGnBu", fmt=".3f", xticklabels=labels, yticklab
els=labels)
plt.xlabel('Predicted Class')
plt.ylabel('Original Class')
plt.show()

#representing_B_in_heatmap_format
print("-"*40, "Precision Matrix (Columm Sum=1)", "-"*40)
plt.figure(figsize=(8,5))
sns.heatmap(B, annot=True, cmap="YlGnBu", fmt=".3f", xticklabels=labels, yticklab
els=labels)
plt.xlabel('Predicted Class')
plt.ylabel('Original Class')
plt.show()

#representing_A_in_heatmap_format
print("-"*40, "Recall Matrix (Row Sum=1)", "-"*40)
plt.figure(figsize=(8,5))
sns.heatmap(A, annot=True, cmap="YlGnBu", fmt=".3f", xticklabels=labels, yticklab
els=labels)
plt.xlabel('Predicted Class')
plt.ylabel('Original Class')
plt.show()
# Logistic Regression Model
#train a logistic regression + calibration model using text features which are tfidf
encoded

58
from sklearn.linear_model import SGDClassifier
from sklearn.calibration import CalibratedClassifierCV
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import log_loss

alpha = [10 ** x for x in range(-5, 1)]

cv_log_error_array=[]
for i in alpha:
clf = SGDClassifier(alpha=i, penalty='l2', loss='log', random_state=42)
clf.fit(train_text, y_train)

sig_clf = CalibratedClassifierCV(clf, method="sigmoid")


sig_clf.fit(train_text, y_train)

predict_y = sig_clf.predict_proba(cv_text)
cv_log_error_array.append(log_loss(y_cv, predict_y, labels=clf.classes_,
15)) eps=1e-

print('For Values of Alpha =',i,"The Log Loss is:",log_loss(y_cv, predict_y, labe


ls=clf.classes_, eps=1e-15))

fig, ax = plt.subplots()
ax.plot(alpha, cv_log_error_array, c='r')
for i, txt in enumerate(np.round(cv_log_error_array,3)):
ax.annotate((alpha[i],np.round(txt,3)), (alpha[i],cv_log_error_array[i]))

plt.grid()
plt.title("Cross Validation Error for Each Alpha")
plt.xlabel("Alpha i's")
plt.ylabel("Error Measure")
plt.show()

best_alpha = np.argmin(cv_log_error_array)

clf = SGDClassifier(alpha=alpha[best_alpha], penalty='l2', loss='log', random_state=4


2)
clf.fit(train_text, y_train)

lr_sig_clf = CalibratedClassifierCV(clf, method="sigmoid")


lr_sig_clf.fit(train_text, y_train)

predict_y = lr_sig_clf.predict_proba(train_text)
print('For Values of Best Alpha =', alpha[best_alpha],"The Train Log Loss is:",log_lo
ss(y_train, predict_y, labels=clf.classes_, eps=1e-15))

predict_y = lr_sig_clf.predict_proba(test_text)
print('For Values of Best Alpha =', alpha[best_alpha],"The Test Log Loss is:",log_los
s(y_test, predict_y, labels=clf.classes_, eps=1e-15))

predict_y = lr_sig_clf.predict_proba(cv_text)
50
print('For Values of Best Alpha =', alpha[best_alpha],"The Cross Validation Log Loss
is:",log_loss(y_cv, predict_y, labels=clf.classes_, eps=1e-15))

lr_train_accuracy = (lr_sig_clf.score(train_text, y_train)*100)


lr_test_accuracy = (lr_sig_clf.score(test_text, y_test)*100)
lr_cv_accuracy = (lr_sig_clf.score(cv_text, y_cv)*100)

print("Logistic Regression Train Accuracy -",lr_train_accuracy)


print("Logistic Regression Test Accuracy -",lr_test_accuracy)
print("Logistic Regression CV Accuracy -",lr_cv_accuracy)

plot_confusion_matrix(y_cv, lr_sig_clf.predict(cv_text.toarray()))
# Predict - Test Data
test_pred = lr_sig_clf.predict(test_text)
from sklearn.metrics import classification_report
print(classification_report(y_test, test_pred))
test_pred_list = test_pred.tolist()
test_pred_list[:5]
final_test_df = pd.DataFrame({'Text':X_test, 'Review':test_pred_list})
final_test_df.head(10)
final_test_df.values[5]

Output:

51
Distribution of Reviews

4OOOO

000

%
70000

10000

Cross Validation Error for Each Alpha


.355)
0350 , 0.347!

0.325
0.001, 0.306)
w 0300

T 0.275

L 0250
/O.0001, 0.24)
0.225

0. 200
he-05,0.1871 Alpha i's

0.0 02 0.4
52
0.6 0.8 10

53
For Values of Best Alpha = 1e-05 The Train Log Loss is:
0.15253560593949955 For Values of Best Alpha = 1e-05 The Test Log Loss
is: 0.18438127714047778
For Values of Best Alpha = 1e-05 The Cross Validation Log Loss is: 0.1874222604802398

- Confusion Matrix

Recall Matrix (Row Sum=1)

54
Explanation:
The code is an implementation of sentiment analysis on Kindle product reviews. The dataset contains
information about Kindle products, including the reviews given by customers for each product. The code
preprocesses the data by removing unwanted columns, cleaning the text, handling null values, and balancing the
dataset. The balanced dataset is then split into training, validation, and testing sets.
The sentiment analysis is done using a machine learning model. However, the code provided does not
include the model building and training steps. Therefore, it is unclear what model or algorithm was used for the
analysis. Nonetheless, the code provides a good foundation for data preprocessing, which is essential for any
machine learning project.
The code uses the Pandas library for data manipulation and Matplotlib for data visualization. It also
uses the NLTK library to remove stop words and perform text cleaning. Finally, it uses Scikit-learn's
train_test_split and resample functions for dataset splitting and balancing, respectively.
Conclusion:
Developing social media text analytics models for improving existing products/services by analyzing
customer reviews/comments is crucial for businesses to improve customer satisfaction, gain new insights, and
identify areas for improvement.
It involves collecting customer reviews/comments, preprocessing the data, performing sentiment
analysis, topic modeling, and opinion mining, identifying areas for improvement, and taking necessary actions.
By performing effective text analytics, businesses can gain a competitive edge, improve customer loyalty, and
achieve their business goals.

55

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy