0% found this document useful (0 votes)
7 views23 pages

DEV Manual

The document outlines a series of exploratory data analysis (EDA) exercises using various datasets, including email data, car specifications, geographical data, and Titanic passenger information. Each exercise includes specific algorithms, programming steps, and visualizations to extract insights and present findings. The successful completion of these analyses demonstrates the application of EDA and visualization techniques across different contexts.

Uploaded by

priunandhu2705
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views23 pages

DEV Manual

The document outlines a series of exploratory data analysis (EDA) exercises using various datasets, including email data, car specifications, geographical data, and Titanic passenger information. Each exercise includes specific algorithms, programming steps, and visualizations to extract insights and present findings. The successful completion of these analyses demonstrates the application of EDA and visualization techniques across different contexts.

Uploaded by

priunandhu2705
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 23

EX NO : 1 EXPLORATORY DATA ANALYSIS (EDA) ON AN

EMAIL DATASET
DATE :

Aim

To perform Exploratory Data Analysis (EDA) on an email dataset by


importing email data into a pandas Data Frame, visualizing the data, and
extracting insights.

Algorithm:
Step 1: Import libraries and load the dataset.
Step 2: Clean the data (remove missing values and duplicates). Step 3:
Compute summary statistics and label counts.
Step 4: Plot label distribution (spam vs. non-spam).
Step 5: Standardize features and plot pair plots.
Step 6: Apply PCA and visualize with a scatter plot.
Step 7: Extract and summarize key insights.

PROGRAM :

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
# 1. Load dataset
url =
'https://archive.ics.uci.edu/ml/machine-learning-databases/spambase/
spambase.data'
columns = [f'feature_{i}' for i in range(1, 58)] + ['label']
df = pd.read_csv(url, header=None, names=columns)
# 2. Data Cleaning
df.dropna(inplace=True) # Remove missing values
df.drop_duplicates(inplace=True) # Remove duplicate rows
# 3. Summary Statistics
print("Summary Statistics:")
print(df.describe())
print("\nSpam vs Non-Spam Counts:")
print(df['label'].value_counts())

# 4. Visualizations
# Spam vs Non-Spam Distribution
plt.figure(figsize=(8, 5))
sns.countplot(data=df, x='label', palette='Set2')
plt.title('Spam vs Non-Spam Distribution')
plt.xlabel('Email Type (0: Non-Spam, 1: Spam)')
plt.ylabel('Count')
plt.show()
# Feature Distributions by Label
# Standardize features for better comparison
scaler = StandardScaler()
df_scaled = pd.DataFrame(scaler.fit_transform(df.iloc[:, :-1]),
columns=columns[:-1])
df_scaled['label'] = df['label'].values
# Pairplot of first few features
# Pairplot of first few features, including 'label'
sns.pairplot(df_scaled.iloc[:, :5].assign(label=df_scaled['label']),
hue='label', palette='husl')
plt.suptitle('Pairplot of Features by Label', y=1.02)
plt.show()
# PCA for Dimensionality Reduction
pca = PCA(n_components=2)
principal_components = pca.fit_transform(df_scaled.iloc[:, :-1])
df_pca = pd.DataFrame(data=principal_components, columns=['PC1',
'PC2'])
df_pca['label'] = df_scaled['label']
plt.figure(figsize=(10, 6))
sns.scatterplot(data=df_pca, x='PC1', y='PC2', hue='label', palette='Set1',
alpha=0.7)
plt.title('PCA of Email Features')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.show()
# 5. Insights
print("Insights:")
print("- The dataset contains a higher number of non-spam emails
compared to spam.")
print("- Certain features show distinct distributions between spam and
non-spam emails.")
print("- PCA visualization indicates potential separability between spam
and non-spam emails based on the features.")
OUTPUT:
RESULT :

Exploratory data analysis on an email dataset is successfully


completed.
EX NO : 2
VARIABLE AND ROW FILTERS IN R
DATE :

Aim

To explore and apply variable and row filters in R for cleaning data, and
to visualize data using various plotting features available in R.

Algorithm:
1. Install Dependencies: Install r-base, rpy2, and required R packages
(ggplot2, dplyr) in Google Colab.
2. Set Up Environment: Enable the R magic (%load_ext rpy2.ipython)
to run R code in Colab.
3. Load Dataset: Load the built-in mtcars dataset in R.
4. Clean Data:
5. Remove unnecessary columns (carb, gear).
6. Filter rows where mpg is greater than or equal to 15.
7. Visualizations:
8. Pair Plot: Use the pairs() function to plot relationships between
selected variables (mpg, wt, hp, disp).
9. Scatter Plot: Create a scatter plot of mpg vs wt.
10.Histogram: Show the distribution of mpg values.
11.Box Plot: Compare mpg across different cylinder counts (cyl).
12.Bar Plot: Display the count of cars for each cylinder type.
13.Render Plots: Use print() to display each ggplot2 plot in the Colab
environment.

PROGRAM:
!sudo apt-get update -y
!sudo apt-get install r-base -y
!pip install rpy2
# Activate the R magic
%load_ext rpy2.ipython
%%R
# Step 3: Install necessary libraries
install.packages("ggplot2")
install.packages("dplyr")
# Step 4: Load libraries
library(ggplot2)
library(dplyr)
# Step 5: Load and Clean Data
data <- mtcars # Load built-in dataset
data_cleaned <- data %>%
select(-c(carb, gear)) %>% # Remove unwanted columns
filter(mpg >= 15) # Filter rows where mpg >= 15
# Step 6: Pair Plot (Base R)
print("Pair Plot of Selected Variables:")
pairs(data_cleaned[, c("mpg", "wt", "hp", "disp")],
main = "Pair Plot of Selected Variables")
# Step 7: Scatter Plot
scatter_plot <- ggplot(data_cleaned, aes(x = mpg, y = wt)) +
geom_point(color = "blue") +
ggtitle("Scatter Plot of MPG vs Weight") +
xlab("Miles Per Gallon (MPG)") +
ylab("Weight (WT)")
print(scatter_plot) # Explicitly display the plot
# Step 8: Histogram
histogram <- ggplot(data_cleaned, aes(x = mpg)) +
geom_histogram(binwidth = 2, fill = "lightblue", color = "black") +
ggtitle("Histogram of MPG") +
xlab("Miles Per Gallon (MPG)") +
ylab("Frequency")
print(histogram)
# Step 9: Box Plot
box_plot <- ggplot(data_cleaned, aes(x = as.factor(cyl), y = mpg)) +
geom_boxplot(fill = "pink") +
ggtitle("Box Plot of MPG by Cylinder Count") +
xlab("Number of Cylinders") +
ylab("Miles Per Gallon (MPG)")
print(box_plot)
# Step 10: Bar Plot
bar_plot <- ggplot(data_cleaned, aes(x = as.factor(cyl))) +
geom_bar(fill = "purple") +
ggtitle("Bar Plot of Cylinder Count") +
xlab("Number of Cylinders") +
ylab("Count")
print(bar_plot)
OUTPUT:
RESULT:

Variable and row filters in R for cleaning data, and to visualize data
using various plotting features available in R is successfully
implemented.
EX NO : 3 DATA ANALYSIS AND REPRESENTATION
ON A MAP
DATE :

Aim:
Perform Data Analysis and representation on a Map using
various Map data sets with Mouse Rollover effect, user
interaction, etc.

Algorithm:

Step 1: Install necessary libraries:


• Install folium and requests using pip if they are not already
installed.

Step 2: Import necessary libraries:


• Import the folium library for map visualization.
• Import the requests library for making HTTP requests to fetch the
GeoJSON data.

Step 3: Load the GeoJSON Data from the web:


• Define the URL from which the GeoJSON data (US States
boundaries) will be fetched.
• Use the requests.get() function to send an HTTP GET request to
the URL and retrieve the data.

Step 4: Check if the response is valid:


• Check if the response's status code is 200 (indicating success).
• If the response is successful, proceed to attempt loading the data.

Step 5: Parse the JSON data:


• Use response.json() to parse the fetched data into a Python
dictionary.
• If an error occurs during parsing, catch the ValueError exception
and print an error message.

Step 6: Create the Base Map:


• Use folium.Map() to create a map object. Set the initial center of
the map (latitude: 37.0902, longitude: -95.7129 for the United States) and
the zoom level (4 for a good world view).

Step 7: Add GeoJSON Data to the Map:


• Use folium.GeoJson() to add the parsed GeoJSON data (US state
boundaries) to the map.
• Set the tooltip to show the state name when hovering over a state.
• Set the popup to display additional information (like the state
name) when clicking on a state.

Step 8: Display the Map:


• The map is displayed in an interactive environment (such as
Jupyter Notebooks) by returning the folium.Map() object.

PROGRAM:

# Step 1: Install necessary libraries


!pip install folium requests
# Step 2: Import necessary libraries
import folium
import requests
# Step 3: Load the GeoJSON Data from the web (checking the response)
url =
"https://raw.githubusercontent.com/PublicaMundi/MappingAPI/master/
data/geojson/us-states.geojson"
response = requests.get(url)
# Check if the response is valid
if response.status_code == 200:
try:
# Step 4: Attempt to parse the response as GeoJSON
geojson_data = response.json() # Convert the response into a JSON
object
print("GeoJSON data successfully loaded!")
except ValueError as e:
print(f"Error in loading JSON: {e}")
else:
print(f"Failed to fetch data. Status Code: {response.status_code}")
# Step 5: Create the Base Map (Centering it over the United States)
m = folium.Map(location=[37.0902, -95.7129], zoom_start=4)
# Step 6: Add the GeoJSON Layer (States of the US) to the map if the
data is valid
if 'geojson_data' in locals():
folium.GeoJson(

geojson_data,
name="US States",
tooltip=folium.GeoJsonTooltip(fields=["name"], aliases=["State"]),
popup=folium.GeoJsonPopup(fields=["name"], aliases=["State"])
).add_to(m)
# Step 7: Display the Map
m

OUTPUT:
RESULT:

Data Analysis and representation on a Map using various Map


data sets with Mouse Rollover effect, user interaction is successfully
completed.
EX NO : 4
CARTOGRAPHIC VISUALIZATION
DATE :

Aim:

Build cartographic visualization for multiple datasets involving various


countries of the world,states and districts in India.

Algorithm:

1. Install required libraries: folium, requests.


2. Define a function to fetch and validate GeoJSON data from the web.
3. Use reliable URLs for:
4. World countries.
5. Indian states.
6. Indian districts.
7. Fetch and validate the GeoJSON datasets.
8. Create a base map centered over India.
9. Add layers for each dataset (world, states, districts) with tooltips.
10.Add a layer control for toggling visibility.
11.Display the interactive map.

Program :

# Install necessary libraries


!pip install folium geopandas requests
import folium
import requests
# Function to validate and fetch GeoJSON data
def fetch_geojson(url):
response = requests.get(url)
if response.status_code == 200:
try:
return response.json() # Parse the JSON
except ValueError:
print(f"Invalid JSON format for URL: {url}")
else:
print(f"Failed to fetch data from {url}. Status code:
{response.status_code}")
return None
# URLs for GeoJSON datasets
url_world =
"https://raw.githubusercontent.com/johan/world.geo.json/master/countrie
s.geo.json"

url_india_states =
"https://raw.githubusercontent.com/datameet/maps/master/States/Admin2
.geojson"
url_india_districts =
"https://raw.githubusercontent.com/datameet/maps/master/Districts/
Admin3.geojson"
# Fetch data
world_geojson = fetch_geojson(url_world)
india_states_geojson = fetch_geojson(url_india_states)
india_districts_geojson = fetch_geojson(url_india_districts)
# Create the base map
m = folium.Map(location=[20.5937, 78.9629], zoom_start=4,
tiles="cartodbpositron")
# Add GeoJSON layers (if data is valid)
if world_geojson:
folium.GeoJson(
world_geojson,
name="World Countries",
tooltip=folium.GeoJsonTooltip(fields=["name"], aliases=["Country"])
).add_to(m)
if india_states_geojson:
folium.GeoJson(
india_states_geojson,
name="Indian States",
tooltip=folium.GeoJsonTooltip(fields=["NAME_1"], aliases=["State"])
).add_to(m)
if india_districts_geojson:
folium.GeoJson(
india_districts_geojson,
name="Indian Districts",
tooltip=folium.GeoJsonTooltip(fields=["NAME_2"], aliases=["District"])
).add_to(m)
# Add layer control
folium.LayerControl().add_to(m)
# Display the map
m

OUTPUT:

RESULT:
Cartographic visualization for multiple datasets is successfully
visualized.
EX NO : 5
EDA AND VISUALIZATION TECHNIQUES
DATE :

Aim:

Use a case study on a data set and apply the various EDA and
visualization techniques and present an analysis report.

Algorithm:

1. Load and explore the dataset.


2. Perform univariate analysis using histograms and count plots.
3. Conduct bivariate analysis to study survival rates by gender and class.
4. Visualize missing data using a heatmap.
5. Analyze correlations among numerical features using a heatmap.
6. Present findings with advanced visualizations (FacetGrid, pairplot).

PROGRAM :

!pip install pandas numpy matplotlib seaborn


import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
# Load the Titanic dataset
df = sns.load_dataset('titanic')
# Display the first few rows
print(df.head())
# Summary of the dataset
print(df.info())
# Check for missing values
print(df.isnull().sum())
# Distribution of Age
sns.histplot(df['age'], bins=20, kde=True, color='blue')
plt.title('Distribution of Age')
plt.xlabel('Age')
plt.ylabel('Frequency')
plt.show()
# Countplot for survival
sns.countplot(x='survived', data=df, palette='pastel')
plt.title('Survival Count')
plt.xlabel('Survived')
plt.ylabel('Count')
plt.xticks([0, 1], ['Not Survived', 'Survived'])
plt.show()
# Survival rate by gender
sns.countplot(x='sex', hue='survived', data=df, palette='Set2')
plt.title('Survival Count by Gender')
plt.xlabel('Gender')

plt.ylabel('Count')
plt.legend(title='Survived', labels=['No', 'Yes'])
plt.show()
# Survival rate by class
sns.countplot(x='class', hue='survived', data=df, palette='Set1')
plt.title('Survival Count by Passenger Class')
plt.xlabel('Class')
plt.ylabel('Count')
plt.legend(title='Survived', labels=['No', 'Yes'])
plt.show()
# Heatmap for missing values
sns.heatmap(df.isnull(), cbar=False, cmap='viridis')
plt.title('Missing Data Heatmap')
plt.show()
# Correlation heatmap
corr_matrix = df.select_dtypes(include=np.number).corr() # Select only
numerical columns
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', fmt='.2f')
plt.title('Correlation Heatmap')
plt.show()
# Pairplot for numerical features
sns.pairplot(df[['age', 'fare', 'survived']], hue='survived', palette='husl')
plt.show()
# Survival rate by gender and class using FacetGrid
g = sns.FacetGrid(df, col='sex', row='class', hue='survived',
palette='coolwarm', height=3, aspect=1.5)
g.map(sns.histplot, 'age', kde=True)
g.add_legend()
plt.show()
OUTPUT :
RESULT:
The various EDA and visualization techniques is successfully
implemented and analysis is done.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy