DEV Manual
DEV Manual
EMAIL DATASET
DATE :
Aim
Algorithm:
Step 1: Import libraries and load the dataset.
Step 2: Clean the data (remove missing values and duplicates). Step 3:
Compute summary statistics and label counts.
Step 4: Plot label distribution (spam vs. non-spam).
Step 5: Standardize features and plot pair plots.
Step 6: Apply PCA and visualize with a scatter plot.
Step 7: Extract and summarize key insights.
PROGRAM :
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
# 1. Load dataset
url =
'https://archive.ics.uci.edu/ml/machine-learning-databases/spambase/
spambase.data'
columns = [f'feature_{i}' for i in range(1, 58)] + ['label']
df = pd.read_csv(url, header=None, names=columns)
# 2. Data Cleaning
df.dropna(inplace=True) # Remove missing values
df.drop_duplicates(inplace=True) # Remove duplicate rows
# 3. Summary Statistics
print("Summary Statistics:")
print(df.describe())
print("\nSpam vs Non-Spam Counts:")
print(df['label'].value_counts())
# 4. Visualizations
# Spam vs Non-Spam Distribution
plt.figure(figsize=(8, 5))
sns.countplot(data=df, x='label', palette='Set2')
plt.title('Spam vs Non-Spam Distribution')
plt.xlabel('Email Type (0: Non-Spam, 1: Spam)')
plt.ylabel('Count')
plt.show()
# Feature Distributions by Label
# Standardize features for better comparison
scaler = StandardScaler()
df_scaled = pd.DataFrame(scaler.fit_transform(df.iloc[:, :-1]),
columns=columns[:-1])
df_scaled['label'] = df['label'].values
# Pairplot of first few features
# Pairplot of first few features, including 'label'
sns.pairplot(df_scaled.iloc[:, :5].assign(label=df_scaled['label']),
hue='label', palette='husl')
plt.suptitle('Pairplot of Features by Label', y=1.02)
plt.show()
# PCA for Dimensionality Reduction
pca = PCA(n_components=2)
principal_components = pca.fit_transform(df_scaled.iloc[:, :-1])
df_pca = pd.DataFrame(data=principal_components, columns=['PC1',
'PC2'])
df_pca['label'] = df_scaled['label']
plt.figure(figsize=(10, 6))
sns.scatterplot(data=df_pca, x='PC1', y='PC2', hue='label', palette='Set1',
alpha=0.7)
plt.title('PCA of Email Features')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.show()
# 5. Insights
print("Insights:")
print("- The dataset contains a higher number of non-spam emails
compared to spam.")
print("- Certain features show distinct distributions between spam and
non-spam emails.")
print("- PCA visualization indicates potential separability between spam
and non-spam emails based on the features.")
OUTPUT:
RESULT :
Aim
To explore and apply variable and row filters in R for cleaning data, and
to visualize data using various plotting features available in R.
Algorithm:
1. Install Dependencies: Install r-base, rpy2, and required R packages
(ggplot2, dplyr) in Google Colab.
2. Set Up Environment: Enable the R magic (%load_ext rpy2.ipython)
to run R code in Colab.
3. Load Dataset: Load the built-in mtcars dataset in R.
4. Clean Data:
5. Remove unnecessary columns (carb, gear).
6. Filter rows where mpg is greater than or equal to 15.
7. Visualizations:
8. Pair Plot: Use the pairs() function to plot relationships between
selected variables (mpg, wt, hp, disp).
9. Scatter Plot: Create a scatter plot of mpg vs wt.
10.Histogram: Show the distribution of mpg values.
11.Box Plot: Compare mpg across different cylinder counts (cyl).
12.Bar Plot: Display the count of cars for each cylinder type.
13.Render Plots: Use print() to display each ggplot2 plot in the Colab
environment.
PROGRAM:
!sudo apt-get update -y
!sudo apt-get install r-base -y
!pip install rpy2
# Activate the R magic
%load_ext rpy2.ipython
%%R
# Step 3: Install necessary libraries
install.packages("ggplot2")
install.packages("dplyr")
# Step 4: Load libraries
library(ggplot2)
library(dplyr)
# Step 5: Load and Clean Data
data <- mtcars # Load built-in dataset
data_cleaned <- data %>%
select(-c(carb, gear)) %>% # Remove unwanted columns
filter(mpg >= 15) # Filter rows where mpg >= 15
# Step 6: Pair Plot (Base R)
print("Pair Plot of Selected Variables:")
pairs(data_cleaned[, c("mpg", "wt", "hp", "disp")],
main = "Pair Plot of Selected Variables")
# Step 7: Scatter Plot
scatter_plot <- ggplot(data_cleaned, aes(x = mpg, y = wt)) +
geom_point(color = "blue") +
ggtitle("Scatter Plot of MPG vs Weight") +
xlab("Miles Per Gallon (MPG)") +
ylab("Weight (WT)")
print(scatter_plot) # Explicitly display the plot
# Step 8: Histogram
histogram <- ggplot(data_cleaned, aes(x = mpg)) +
geom_histogram(binwidth = 2, fill = "lightblue", color = "black") +
ggtitle("Histogram of MPG") +
xlab("Miles Per Gallon (MPG)") +
ylab("Frequency")
print(histogram)
# Step 9: Box Plot
box_plot <- ggplot(data_cleaned, aes(x = as.factor(cyl), y = mpg)) +
geom_boxplot(fill = "pink") +
ggtitle("Box Plot of MPG by Cylinder Count") +
xlab("Number of Cylinders") +
ylab("Miles Per Gallon (MPG)")
print(box_plot)
# Step 10: Bar Plot
bar_plot <- ggplot(data_cleaned, aes(x = as.factor(cyl))) +
geom_bar(fill = "purple") +
ggtitle("Bar Plot of Cylinder Count") +
xlab("Number of Cylinders") +
ylab("Count")
print(bar_plot)
OUTPUT:
RESULT:
Variable and row filters in R for cleaning data, and to visualize data
using various plotting features available in R is successfully
implemented.
EX NO : 3 DATA ANALYSIS AND REPRESENTATION
ON A MAP
DATE :
Aim:
Perform Data Analysis and representation on a Map using
various Map data sets with Mouse Rollover effect, user
interaction, etc.
Algorithm:
PROGRAM:
geojson_data,
name="US States",
tooltip=folium.GeoJsonTooltip(fields=["name"], aliases=["State"]),
popup=folium.GeoJsonPopup(fields=["name"], aliases=["State"])
).add_to(m)
# Step 7: Display the Map
m
OUTPUT:
RESULT:
Aim:
Algorithm:
Program :
url_india_states =
"https://raw.githubusercontent.com/datameet/maps/master/States/Admin2
.geojson"
url_india_districts =
"https://raw.githubusercontent.com/datameet/maps/master/Districts/
Admin3.geojson"
# Fetch data
world_geojson = fetch_geojson(url_world)
india_states_geojson = fetch_geojson(url_india_states)
india_districts_geojson = fetch_geojson(url_india_districts)
# Create the base map
m = folium.Map(location=[20.5937, 78.9629], zoom_start=4,
tiles="cartodbpositron")
# Add GeoJSON layers (if data is valid)
if world_geojson:
folium.GeoJson(
world_geojson,
name="World Countries",
tooltip=folium.GeoJsonTooltip(fields=["name"], aliases=["Country"])
).add_to(m)
if india_states_geojson:
folium.GeoJson(
india_states_geojson,
name="Indian States",
tooltip=folium.GeoJsonTooltip(fields=["NAME_1"], aliases=["State"])
).add_to(m)
if india_districts_geojson:
folium.GeoJson(
india_districts_geojson,
name="Indian Districts",
tooltip=folium.GeoJsonTooltip(fields=["NAME_2"], aliases=["District"])
).add_to(m)
# Add layer control
folium.LayerControl().add_to(m)
# Display the map
m
OUTPUT:
RESULT:
Cartographic visualization for multiple datasets is successfully
visualized.
EX NO : 5
EDA AND VISUALIZATION TECHNIQUES
DATE :
Aim:
Use a case study on a data set and apply the various EDA and
visualization techniques and present an analysis report.
Algorithm:
PROGRAM :
plt.ylabel('Count')
plt.legend(title='Survived', labels=['No', 'Yes'])
plt.show()
# Survival rate by class
sns.countplot(x='class', hue='survived', data=df, palette='Set1')
plt.title('Survival Count by Passenger Class')
plt.xlabel('Class')
plt.ylabel('Count')
plt.legend(title='Survived', labels=['No', 'Yes'])
plt.show()
# Heatmap for missing values
sns.heatmap(df.isnull(), cbar=False, cmap='viridis')
plt.title('Missing Data Heatmap')
plt.show()
# Correlation heatmap
corr_matrix = df.select_dtypes(include=np.number).corr() # Select only
numerical columns
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', fmt='.2f')
plt.title('Correlation Heatmap')
plt.show()
# Pairplot for numerical features
sns.pairplot(df[['age', 'fare', 'survived']], hue='survived', palette='husl')
plt.show()
# Survival rate by gender and class using FacetGrid
g = sns.FacetGrid(df, col='sex', row='class', hue='survived',
palette='coolwarm', height=3, aspect=1.5)
g.map(sns.histplot, 'age', kde=True)
g.add_legend()
plt.show()
OUTPUT :
RESULT:
The various EDA and visualization techniques is successfully
implemented and analysis is done.