Sma Exp4 Ayu
Sma Exp4 Ayu
4
Roll No.: B856
Date:
Aim: To study exploratory data analysis (EDA) and visualization of Social Media Data for business using
python like histogram, line chart, pie chart, scatter plot
Theory:
Exploratory Data Analysis (EDA)
Exploratory Data Analysis (EDA) is an essential step in the data analysis process that involves examining,
summarizing, and visualizing data to understand its structure, detect patterns, and uncover hidden insights. It
serves as the foundation for data-driven decision-making and is commonly used in business analytics, machine
learning, and statistical modeling.
EDA is not just about looking at numbers; it is about understanding the story behind the data. It helps analysts
identify errors, missing values, outliers, and relationships between variables, ensuring that the data is clean
and ready for further analysis. Without a thorough EDA process, analysts risk making inaccurate assumptions
or drawing misleading conclusions.
Importance of EDA
EDA plays a critical role in transforming raw data into meaningful information. It allows businesses to make
informed decisions by answering key questions about the dataset. For example, in the retail sector, EDA can
help businesses understand customer buying patterns, sales trends, and product demand fluctuations. In
finance, it can help detect fraudulent transactions, while in healthcare, it can identify risk factors for diseases.
EDA is particularly valuable for:
• Identifying missing or inconsistent data
• Detecting outliers and unusual patterns
• Understanding data distributions
• Examining relationships between different variables
• Validating assumptions before applying predictive models
Data Visualization
Data visualization is the graphical representation of data, making it easier to interpret Data visualization is the
graphical representation of data and information. It involves using visual elements such as charts, graphs, and
maps to make complex data more accessible, understandable, and useful for decision-making. In the modern
world, where businesses and organizations generate vast amounts of data, visualization plays a crucial role in
transforming raw numbers into meaningful insights.
The human brain processes visual information much faster than raw text or numerical data. This makes data
visualization an essential tool in data analysis, as it allows stakeholders to quickly grasp trends, patterns, and
anomalies that might be difficult to detect in spreadsheets or databases.
B856
Types of Data Visualization
Different types of data visualization methods are used depending on the nature of the data and the insights
needed. Some of the most commonly used visualization techniques include:
1. Line Charts
Line charts are used to display trends over time. They connect data points with a continuous line, making it
easy to observe upward or downward trends. Businesses often use line charts to track monthly sales, stock
market movements, or website traffic over time.
2. Bar Charts
Bar charts represent categorical data with rectangular bars, where the length of each bar is proportional to the
value it represents. These charts are useful for comparing different categories, such as sales performance across
various regions or customer preferences for different products.
3. Pie Charts
Pie charts divide data into slices, showing the proportion of each category in a dataset. They are commonly
used in market share analysis, financial reports, and customer segmentation studies. However, pie charts
should be used carefully, as too many slices can make interpretation difficult.
4. Scatter Plots
Scatter plots are used to show relationships between two numerical variables. Each point in the plot represents
an observation, helping analysts identify correlations or patterns. For example, a scatter plot can show how
advertising spend is related to sales revenue.
5. Histograms
Histograms display the distribution of numerical data by dividing it into bins or intervals. This helps in
understanding the spread and shape of the data, making it useful for analyzing customer age distribution,
income levels, or exam scores.
6. Box Plots (Box-and-Whisker Plots)
Box plots summarize data distributions and highlight key statistical measures such as the median, quartiles,
and outliers. They are particularly useful for comparing multiple datasets and identifying unusual values.
B856
Data Visualization Tools and Libraries
Several tools and libraries are available for creating data visualizations. Python, one of the most popular
programming languages for data analysis, offers powerful libraries for visualization, including:
• Matplotlib: A foundational library for creating static, animated, and interactive visualizations.
• Seaborn: Built on Matplotlib, it provides advanced statistical visualization with aesthetically pleasing
themes.
• Plotly: Used for interactive and dynamic visualizations, making it ideal for web applications.
• Bokeh: Specialized in interactive and web-based visualizations.
Code:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
df = pd.read_csv(r"C:\Users\Ayushi\Desktop\submissions\SMA\tiktok_dataset.csv")
print("Dataset Info:")
print(df.info())
print("\nSummary Statistics:")
print(df.describe())
print("\nMissing Values:")
print(df.isnull().sum())
plt.figure(figsize=(10, 6))
sns.histplot(df['video_duration_sec'], bins=30, kde=True)
plt.title('Histogram of Video Duration (seconds)')
plt.xlabel('Video Duration (seconds)')
plt.ylabel('Frequency')
plt.show()
df_line_graph = df.head(1000)
plt.figure(figsize=(10, 6))
plt.plot(df_line_graph.index, df_line_graph['video_view_count'], marker='o', linestyle='-')
plt.title('Line Chart of Video Views Over Time (Limited Data)')
plt.xlabel('Index (or Time)')
plt.ylabel('Video Views')
plt.show()
claim_status_counts = df['claim_status'].value_counts()
plt.figure(figsize=(8, 8))
plt.pie(claim_status_counts, labels=claim_status_counts.index, autopct='%1.1f%%', startangle=140)
plt.title('Pie Chart of Claim Status Distribution')
plt.show()
B856
plt.figure(figsize=(10, 6))
sns.scatterplot(x=df['video_like_count'], y=df['video_share_count'], hue=df['verified_status'])
plt.title('Scatter Plot of Video Likes vs. Video Shares')
plt.xlabel('Video Likes')
plt.ylabel('Video Shares')
plt.show()
plt.figure(figsize=(10, 6))
sns.boxplot(x=df['author_ban_status'], y=df['video_view_count'])
plt.title('Box Plot of Video Views by Author Ban Status')
plt.xlabel('Author Ban Status')
plt.ylabel('Video Views')
plt.show()
numeric_df = df.select_dtypes(include=['number'])
plt.figure(figsize=(12, 8))
sns.heatmap(numeric_df.corr(), annot=True, cmap="coolwarm", fmt=".2f", linewidths=0.5)
plt.title("Correlation Matrix Heatmap")
plt.show()
plt.figure(figsize=(12, 6))
sns.barplot(y=top_videos['video_id'], x=top_videos['video_view_count'], palette="viridis")
plt.title('Top 10 Videos by View Count')
plt.xlabel('View Count')
plt.ylabel('Video ID')
plt.show()
Output:
B856
B856
Conclusion
EDA and data visualization are essential for business analytics, providing valuable insights that drive decision
making. By using Python’s powerful libraries, businesses can analyze data efficiently, identify trends, and
improve performance.
B856