0% found this document useful (0 votes)
20 views16 pages

INDEX

The document outlines three Exploratory Data Analysis (EDA) projects on different datasets: Global Superstore Sales, COVID-19 Global Data, and YouTube Trending Videos. Each project includes steps such as data loading, cleaning, analysis, visualizations, and insights derived from the data. Key findings highlight sales trends, COVID-19 case distributions, and video engagement patterns across the datasets.

Uploaded by

saruhasan1103
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views16 pages

INDEX

The document outlines three Exploratory Data Analysis (EDA) projects on different datasets: Global Superstore Sales, COVID-19 Global Data, and YouTube Trending Videos. Each project includes steps such as data loading, cleaning, analysis, visualizations, and insights derived from the data. Key findings highlight sales trends, COVID-19 case distributions, and video engagement patterns across the datasets.

Uploaded by

saruhasan1103
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

INDEX

S.NO TOPIC SIGN

1. EDA ON GLOBAL SUPERSTORE SALES


DATASET

2. EDA ON COVID-19 GLOBAL DATASET

3. EDA ON YOUTUBE TRENDING VIDEOS


DATASET
EX.No:1 EDA ON GLOBAL SUPERSTORE SALES DATASET

EXPLORATORY DATA ANALYSIS (EDA):


Exploratory Data Analysis (EDA) is the process of examining and understanding a
dataset before applying any modeling or predictive techniques. It involves summarizing the
dataset’s main characteristics using statistical measures and visualizations to uncover patterns,
spot anomalies, test hypotheses, and check assumptions. EDA typically includes cleaning the
data (handling missing values and duplicates), generating descriptive statistics (like mean,
median, and standard deviation), and using plots such as histograms, bar charts, and line graphs
to visualize trends and relationships. This step is crucial for gaining insights and making
informed decisions about the direction of further analysis or modeling.

DATA SOURCE:

Dataset link: https://www.kaggle.com/datasets/fatihilhan/global-superstore-dataset

STEP 1: LOAD THE DATASET

PROGRAM:

import pandas as pd

file_path = "/content/GLOBAL DATASTORE.csv"

df = pd.read_csv(file_path)

OUTPUT:
STEP 2:DATA CLEANING

 Check and remove missing values

 Remove duplicates

PROGRAM:

df.dropna(inplace=True)

df.drop_duplicates(inplace=True)

STEP 3: SUMMARY STATISTICS

PROGRAM:

sales_summary = df["Sales"].describe()[["mean", "50%", "std"]]

profit_summary = df["Profit"].describe()[["mean", "50%", "std"]]

print("Sales Summary:\n", sales_summary)

print("Profit Summary:\n", profit_summary)

OUTPUT:

Sales Summary:
mean 246.498440
50% 85.000000
std 487.567175
Name: Sales, dtype: float64
Profit Summary:
mean 28.610982
50% 9.240000
std 174.340972
Name: Profit, dtype: float64
STEP 4: ANALYSIS
Total Sales per Region
PROGRAM:
sales_per_region = df.groupby("Region")["Sales"].sum()
print(sales_per_region)
OUTPUT:
Region
Africa 783776
Canada 66932
Caribbean 324281
Central 2822399
Central Asia 752839
EMEA 806184
East 678834
North 1248192
North Asia 848349
Oceania 1100207
South 1600960
Southeast Asia 884438
West 725514
Name: Sales, dtype: int64

Top 5 Most Profitable Product Categories


PROGRAM:
top_profitable_categories=df.groupby("Category")["Profit"].sum().nlargest(5)
print(top_profitable_categories)
OUTPUT:
Category
Technology 663778.73318
Office Supplies 518473.83430
Furniture 285204.72380
Name: Profit, dtype: float64
Year-wise Sales Trend
PROGRAM:
df["Order.Date"] = pd.to_datetime(df["Order.Date"])
df["Year"] = df["Order.Date"].dt.year
yearly_sales = df.groupby("Year")["Sales"].sum()
print(yearly_sales)
OUTPUT:
Year
2025 12642905
Name: Sales, dtype: int64
STEP 5: VISUALIZATIONS
Bar Chart: Sales by Region
PROGRAM:
import matplotlib.pyplot as plt
sales_per_region.plot(kind="bar", color="skyblue")
plt.title("Total Sales by Region")
plt.xlabel("Region")
plt.ylabel("Total Sales")
plt.show()
OUTPUT:

Line Chart: Year-wise Sales Trend


PROGRAM:
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(style="whitegrid")
plt.figure(figsize=(10, 6))
yearly_sales = df.groupby('Year')['Sales'].sum()
sns.lineplot(x=yearly_sales.index, y=yearly_sales.values, marker='o',
color='orange')
plt.title("Year-wise Sales Trend")
plt.xlabel("Year")
plt.ylabel("Total Sales")
plt.tight_layout()
plt.show()
OUTPUT:

GOOGLE COLAB LINK:


https://colab.research.google.com/drive/12ok_SXN84wnqSQL9AzV4OA4e7kD
ohCWQ?usp=sharing
STEP 6: INSIGHTS
Bar Chart – Sales by Region:
 The West region shows the highest total sales, followed by East and
Central.
 South lags behind, indicating potential for growth or marketing focus.

Line Chart – Year-wise Sales Trend:


 Sales have shown a steady upward trend year over year.
 Indicates growing business or improved operations/logistics over time.
Ex.No:2 EDA ON COVID-19 GLOBAL DATASET

INTRODUCTION:
The COVID-19 pandemic, caused by the SARS-CoV-2 virus, has had a profound global
impact since early 2020, affecting millions of lives and disrupting economies. To better
understand the spread, trends, and regional impact of the virus, data-driven approaches such as
Exploratory Data Analysis (EDA) are essential. By exploring confirmed cases, recoveries, and
deaths, this analysis aims to uncover insights into the progression of the pandemic, identify the
most affected states, and visualize daily trends in new infections.

Dataset link: https://www.kaggle.com/datasets/ COVID-19 in India

GOOGLE COLAB LINK:


https://colab.research.google.com/drive/1VEuFN6gRCyIMnIEkwccqlMqENFi11BRv?usp=s
haring

STEP 1: LOAD AND INSPECT THE DATASET


PROGRAM:
import pandas as pd
df = pd.read_csv('path_to_covid_dataset.csv')
print(df.head())
OUTPUT:
PROGRAM:
print(df.columns)

print(df.info())

OUTPUT:

STEP 2: HANDLE MISSING DATA AND CONVERT DATES


PROGRAM:
df.fillna(0, inplace=True)
df['Date'] = pd.to_datetime(df['Date'])

STEP 3: COMPUTE METRICS


a) Total confirmed, recovered, and death cases per state:
PROGRAM:
statewise_total=df.groupby('State/UnionTerritory')[['Confirmed','Cured',
'Deaths']].max().reset_index()
print(statewise_total)

OUTPUT:
b) State with the highest number of confirmed cases:
PROGRAM:
top_state=statewise_total[statewise_total['Confirmed']==
statewise_total['Confirmed'].max()]
print("State with highest confirmed cases:\n", top_state)
OUTPUT:
State with highest confirmed cases:
State/UnionTerritory Confirmed Cured Deaths
27 Maharashtra 6363442 6159676 134201

c) Daily trend of new cases:


PROGRAM:
daily_cases = df.groupby('Date')['Confirmed'].sum().diff().fillna(0)

STEP 4: VISUALIZATIONS
a) Pie Chart: Top 5 States by Confirmed Cases
PROGRAM:
import matplotlib.pyplot as plt
top5_states = statewise_total.sort_values('Confirmed', ascending=False).head(5)
plt.figure(figsize=(8, 8))
plt.pie(top5_states['Confirmed'],labels=top5_states['State/UnionTerritory'],
autopct='%1.1f%%', startangle=140)
plt.title('Top 5 Indian States by Confirmed COVID-19 Cases')
plt.show()
OUTPUT:

b) Line Graph: Daily Trend of Confirmed Cases


PROGRAM:
plt.figure(figsize=(10, 6))
plt.plot(daily_cases.index, daily_cases.values, color='blue')
plt.title('Daily New Confirmed COVID-19 Cases in India')
plt.xlabel('Date')
plt.ylabel('New Cases')
plt.grid(True)
plt.show()
OUTPUT:

STEP 5:OBSERVATION
 Top affected states (e.g., Maharashtra, Kerala, Karnataka) account for the
majority of confirmed cases.
 Trend graph shows multiple waves—sharp increases followed by
declines.
 Lockdown periods and vaccination rollouts align with noticeable trend
changes.
 Deaths and recovery rates vary by region and wave, highlighting
healthcare disparities.
Ex.No:3 EDA ON YOUTUBE TRENDING VIDEOS DATASET

INTRODUCTION:
YouTube has become a dominant platform for video sharing, content
creation, and audience engagement worldwide. The YouTube Trending Videos
Dataset provides a snapshot of videos that were trending in various regions over
time, offering valuable insights into user preferences, content popularity, and
engagement metrics.
This Exploratory Data Analysis (EDA) aims to uncover trends in video
categories, the frequency of trending videos across different channels, and
patterns in user interactions such as views, likes, and comments. By analyzing
this data, we can better understand what makes a video trend, which content types
perform best, and how users engage with trending content.
DATA SOURCE:

Dataset link: https://www.kaggle.com/datasets/anushabellam/Trending videos on


Youtube

GOOGLE COLAB LINK:


https://colab.research.google.com/drive/1xkMAoAhJsC8CxQH-
ZaAoNUXe9g2HoFxs?usp=sharing
STEP 1: LOAD AND INSPECT THE DATASET
PROGRAM:
import pandas as pd
df = pd.read_csv('USvideos.csv')
print(df.info())
print(df.head())
OUTPUT:
STEP 2: DATA CLEANING
PROGRAM:
df = df.drop_duplicates()
df = df.dropna()
df['publishedAt'] = pd.to_datetime(df['publishedAt'], errors='coerce')
STEP 3: KEY CALCULATIONS
PROGRAM:
most_common_categories = df['videoCategoryId'].value_counts().head(10)
top_channels = df['videoTitle'].value_counts().head(5)
avg_likes = df['likeCount'].mean()
avg_views = df['viewCount'].mean()
avg_comments = df['commentCount'].mean()
average_metrics = {
'Average Likes': avg_likes,
'Average Views': avg_views,
'Average Comments': avg_comments
}
print(average_metrics)
OUTPUT:
{'Average Likes': np.float64(182.2095238095238), 'Average Views':
np.float64(9999.657142857142), 'Average Comments':
np.float64(82.97142857142858)}
STEP 4: VISUALIZATIONS
PROGRAM:
a) Bar chart: Video count by category
import matplotlib.pyplot as plt
import seaborn as sns
plt.figure(figsize=(10, 6))
most_common_categories.plot(kind='bar', color='skyblue')
plt.title('Top 10 Video Categories by Count')
plt.xlabel('Category ID')
plt.ylabel('Number of Videos')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()
b)Scatter plot: Likes vs Views
plt.figure(figsize=(10, 6))
sns.scatterplot(x='viewCount', y='likeCount', data=df, alpha=0.5)
plt.title('Likes vs. Views')
plt.xlabel('Views')
plt.ylabel('Likes')
plt.xscale('log')
plt.yscale('log')
plt.tight_layout()
plt.show()
OUTPUT:

STEP 5:OBSERVATION
 Top Categories: Certain categories like music, entertainment, and news
dominate the trending list.
 Channel Popularity: A few channels consistently produce trending content.
 Engagement Patterns: There's a strong positive correlation between views
and likes.
 Outliers: Some videos have extremely high views but relatively low
likes/comments, suggesting passive viewing.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy