EDA Withoutcode (1)
EDA Withoutcode (1)
ipynb - Colaboratory
Problem Statement:
We have used Cars dataset from kaggle with features including make, model, year, engine, and other properties of the car used to predict its
price.
import pandas as pd
import numpy as np
import seaborn as sns #visualisation
import matplotlib.pyplot as plt #visualisation
%matplotlib inline
sns.set(color_codes=True)
from scipy import stats
import warnings
warnings.filterwarnings("ignore")
https://colab.research.google.com/drive/1jY5FhPVVnbIvNZlRZA5VOA3dvRqZ-PSq?authuser=1#scrollTo=VslkQJNWBxAU&printMode=true 1/36
16/09/2022, 16:40 EDA_withoutcode.ipynb - Colaboratory
5 points
Please download the dataset from here and extract the csv file. Load the csv file as pandas dataframe.
https://colab.research.google.com/drive/1jY5FhPVVnbIvNZlRZA5VOA3dvRqZ-PSq?authuser=1#scrollTo=VslkQJNWBxAU&printMode=true 2/36
16/09/2022, 16:40 EDA_withoutcode.ipynb - Colaboratory
Engine Number
Engine Engine Transmission Market Vehicle Vehi
Make Model Year Fuel Driven_Wheels of
HP Cylinders Type Category Size St
Type Doors
1 premium Factory
0 BMW Series 2011 unleaded 335.0 6.0 MANUAL rear wheel drive 2.0 Tuner,Luxury,High- Compact Co
M (required) Performance
premium
1
1 BMW 2011 unleaded 300.0 6.0 MANUAL rear wheel drive 2.0 Luxury,Performance Compact Convert
Series
(required)
premium
1 Luxury,High-
2 BMW 2011 unleaded 300.0 6.0 MANUAL rear wheel drive 2.0 Compact Co
Series Performance
(required)
premium
1
3 BMW 2011 unleaded 230.0 6.0 MANUAL rear wheel drive 2.0 Luxury,Performance Compact Co
Series
(required)
premium
1
4 BMW 2011 unleaded 230.0 6.0 MANUAL rear wheel drive 2.0 Luxury Compact Convert
Series
(required)
2 points
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11914 entries, 0 to 11913
Data columns (total 16 columns):
# Column Non-Null Count Dtype
https://colab.research.google.com/drive/1jY5FhPVVnbIvNZlRZA5VOA3dvRqZ-PSq?authuser=1#scrollTo=VslkQJNWBxAU&printMode=true 3/36
16/09/2022, 16:40 EDA_withoutcode.ipynb - Colaboratory
If we consider all columns present in the dataset then unneccessary columns will impact on the model's accuracy.
Not all the columns are important to us in the given dataframe, and hence we would drop the columns that are irrevalent to us. It would reflect
our model's accucary so we need to drop them. Otherwise it will affect our model.
The list cols_to_drop contains the names of the cols that are irrevalent, drop all these cols from the dataframe.
https://colab.research.google.com/drive/1jY5FhPVVnbIvNZlRZA5VOA3dvRqZ-PSq?authuser=1#scrollTo=VslkQJNWBxAU&printMode=true 4/36
16/09/2022, 16:40 EDA_withoutcode.ipynb - Colaboratory
cols_to_drop = ["Engine Fuel Type", "Market Category", "Vehicle Style", "Popularity", "Number of Doors", "Vehicle
Size"]
These features are not neccessary to obtain the model's accucary. It does not contain any relevant information in the dataset.
# initialise cols_to_drop
cols_to_drop =
# drop the irrevalent cols and print the head of the dataframe
df =
# print df head
Make Model Year Engine HP Engine Cylinders Transmission Type Driven_Wheels highway MPG city mpg MSRP
0 BMW 1 Series M 2011 335.0 6.0 MANUAL rear wheel drive 26 19 46135
1 BMW 1 Series 2011 300.0 6.0 MANUAL rear wheel drive 28 19 40650
2 BMW 1 Series 2011 300.0 6.0 MANUAL rear wheel drive 28 20 36350
3 BMW 1 Series 2011 230.0 6.0 MANUAL rear wheel drive 28 18 29450
4 BMW 1 Series 2011 230.0 6.0 MANUAL rear wheel drive 28 18 34500
5 points
Now, Its time for renaming the feature to useful feature name. It will help to use them in model training purpose.
https://colab.research.google.com/drive/1jY5FhPVVnbIvNZlRZA5VOA3dvRqZ-PSq?authuser=1#scrollTo=VslkQJNWBxAU&printMode=true 5/36
16/09/2022, 16:40 EDA_withoutcode.ipynb - Colaboratory
We have already dropped the unneccesary columns, and now we are left with useful columns. One extra thing that we would do is to rename
the columns such that the name clearly represents the essence of the column.
The given dict represents (in key value pair) the previous name, and the new name for the dataframe columns
# rename cols
rename_cols =
Make Model Year HP Cylinders Transmission Drive Mode MPG_H MPG-C Price
0 BMW 1 Series M 2011 335.0 6.0 MANUAL rear wheel drive 26 19 46135
1 BMW 1 Series 2011 300.0 6.0 MANUAL rear wheel drive 28 19 40650
2 BMW 1 Series 2011 300.0 6.0 MANUAL rear wheel drive 28 20 36350
3 BMW 1 Series 2011 230.0 6.0 MANUAL rear wheel drive 28 18 29450
4 BMW 1 Series 2011 230.0 6.0 MANUAL rear wheel drive 28 18 34500
https://colab.research.google.com/drive/1jY5FhPVVnbIvNZlRZA5VOA3dvRqZ-PSq?authuser=1#scrollTo=VslkQJNWBxAU&printMode=true 6/36
16/09/2022, 16:40 EDA_withoutcode.ipynb - Colaboratory
There are many rows in the dataframe which are duplicate, and hence they are just repeating the information. Its better if we remove these
rows as they don't add any value to the dataframe.
For given data, we would like to see how many rows were duplicates. For this, we will count the number of rows, remove the dublicated rows,
and again count the number of rows.
Make 11914
Model 11914
Year 11914
HP 11845
Cylinders 11884
Transmission 11914
Drive Mode 11914
MPG_H 11914
MPG-C 11914
Price 11914
dtype: int64
# print head of df
https://colab.research.google.com/drive/1jY5FhPVVnbIvNZlRZA5VOA3dvRqZ-PSq?authuser=1#scrollTo=VslkQJNWBxAU&printMode=true 7/36
16/09/2022, 16:40 EDA_withoutcode.ipynb - Colaboratory
Make Model Year HP Cylinders Transmission Drive Mode MPG_H MPG-C Price
0 BMW 1 Series M 2011 335.0 6.0 MANUAL rear wheel drive 26 19 46135
1 BMW 1 Series 2011 300.0 6.0 MANUAL rear wheel drive 28 19 40650
2 BMW 1 Series 2011 300.0 6.0 MANUAL rear wheel drive 28 20 36350
3 BMW 1 Series 2011 230.0 6.0 MANUAL rear wheel drive 28 18 29450
# Count Number of rows after deleting duplicated rows
4 BMW 1 Series 2011 230.0 6.0 MANUAL rear wheel drive 28 18 34500
Make 10925
Model 10925
Year 10925
HP 10856
Cylinders 10895
Transmission 10925
Drive Mode 10925
MPG_H 10925
MPG-C 10925
Price 10925
dtype: int64
10 points
Missing values are usually represented in the form of Nan or null or None in the dataset.
Finding whether we have null values in the data is by using the isnull() function.
There are many values which are missing, in pandas dataframe these values are reffered to as np.nan. We want to deal with these values
beause we can't use nan values to train models. Either we can remove them to apply some strategy to replace them with other values.
https://colab.research.google.com/drive/1jY5FhPVVnbIvNZlRZA5VOA3dvRqZ-PSq?authuser=1#scrollTo=VslkQJNWBxAU&printMode=true 8/36
16/09/2022, 16:40 EDA_withoutcode.ipynb - Colaboratory
Make 0
Model 0
Year 0
HP 69
Cylinders 30
Transmission 0
Drive Mode 0
MPG_H 0
MPG-C 0
Price 0
dtype: int64
As we can see that the HP and Cylinders have null values of 69 and 30. As these null values will impact on models' accuracy. So to avoid the
impact we will drop the these values. As these values are small camparing with dataset that will not impact any major affect on model
accuracy so we will drop the values.
https://colab.research.google.com/drive/1jY5FhPVVnbIvNZlRZA5VOA3dvRqZ-PSq?authuser=1#scrollTo=VslkQJNWBxAU&printMode=true 9/36
16/09/2022, 16:40 EDA_withoutcode.ipynb - Colaboratory
Make 0
Model 0
Year 0
HP 0
Cylinders 0
Transmission 0
Drive Mode 0
MPG_H 0
MPG-C 0
Price 0
dtype: int64
#Describe statistics of df
8. Removing outliers
https://colab.research.google.com/drive/1jY5FhPVVnbIvNZlRZA5VOA3dvRqZ-PSq?authuser=1#scrollTo=VslkQJNWBxAU&printMode=true 10/36
16/09/2022, 16:40 EDA_withoutcode.ipynb - Colaboratory
Sometimes a dataset can contain extreme values that are outside the range of what is expected and unlike the other data. These are called
outliers and often machine learning modeling and model skill in general can be improved by understanding and even removing these outlier
values.
Detecting outliers
There many techiniques to detect outliers. Let us first see the simplest form of visualizing outliers.
Box plots are a graphical depiction of numerical data through their quantiles. It is a very simple but effective way to visualize outliers. Think
about the lower and upper whiskers as the boundaries of the data distribution. Any data points that show above or below the whiskers, can be
considered outliers or anomalous.
15 points
https://colab.research.google.com/drive/1jY5FhPVVnbIvNZlRZA5VOA3dvRqZ-PSq?authuser=1#scrollTo=VslkQJNWBxAU&printMode=true 11/36
16/09/2022, 16:40 EDA_withoutcode.ipynb - Colaboratory
Observation:
Here as you see that we got some values near to 1.5 and 2.0 . So these values are called outliers. Because there are away from the normal
values. Now we have detect the outliers of the feature of Price. Similarly we will checking of anothers features.
Observation:
Here boxplots show the proper distribution of of 25 percentile and 75 percentile of the feature of HP.
https://colab.research.google.com/drive/1jY5FhPVVnbIvNZlRZA5VOA3dvRqZ-PSq?authuser=1#scrollTo=VslkQJNWBxAU&printMode=true 12/36
16/09/2022, 16:40 EDA_withoutcode.ipynb - Colaboratory
print all the columns which are of int or float datatype in df.
# print all the columns which are of int or float datatype in df.
Save the column names of the above output in variable list named 'l'
l=
IQR is the first quartile subtracted from the third quartile; these quartiles can be clearly seen on a box plot on the data.
image.png
Calculate IQR and give a suitable threshold to remove the outliers and save this new dataframe into df2.
Let us help you to decide threshold: Outliers in this case are defined as the observations that are below (Q1 − 1.5x IQR) or above (Q3 + 1.5x
IQR)
## define Q1 and Q2
Q1 =
Q3 =
https://colab.research.google.com/drive/1jY5FhPVVnbIvNZlRZA5VOA3dvRqZ-PSq?authuser=1#scrollTo=VslkQJNWBxAU&printMode=true 14/36
16/09/2022, 16:40 EDA_withoutcode.ipynb - Colaboratory
df2 =
The intuition behind Z-score is to describe any data point by finding their relationship with the Standard Deviation and Mean of the
group of data points.
We will use Z-score function defined in scipy library to detect the outliers in dataframe df having columns which are in variable 'l'
# use stats.zscore on list l from above code and take abs value
z =
# print z
Hey buddy! do you understand the above output? Difficult right? let’s try and define a threshold to identify an outlier so that we get a clear
picture of whats going on.
In most of the cases a threshold of 3 or -3 is used i.e if the Z-score value is greater than or less than 3 or -3 respectively, that data point will be
identified as outliers.
https://colab.research.google.com/drive/1jY5FhPVVnbIvNZlRZA5VOA3dvRqZ-PSq?authuser=1#scrollTo=VslkQJNWBxAU&printMode=true 15/36
16/09/2022, 16:40 EDA_withoutcode.ipynb - Colaboratory
# print the values in dataframe which are less than the threshold and save this dataframe as df3
threshold = # set threshold
df3 = # set condition
# print df3
Make Model Year HP Cylinders Transmission Drive Mode MPG_H MPG-C Price
0 BMW 1 Series M 2011 335.0 6.0 MANUAL rear wheel drive 26 19 46135
1 BMW 1 Series 2011 300.0 6.0 MANUAL rear wheel drive 28 19 40650
2 BMW 1 Series 2011 300.0 6.0 MANUAL rear wheel drive 28 20 36350
3 BMW 1 Series 2011 230.0 6.0 MANUAL rear wheel drive 28 18 29450
4 BMW 1 Series 2011 230.0 6.0 MANUAL rear wheel drive 28 18 34500
... ... ... ... ... ... ... ... ... ... ...
11909 Acura ZDX 2012 300.0 6.0 AUTOMATIC all wheel drive 23 16 46120
11910 Acura ZDX 2012 300.0 6.0 AUTOMATIC all wheel drive 23 16 56670
11911 Acura ZDX 2012 300.0 6.0 AUTOMATIC all wheel drive 23 16 50620
11912 Acura ZDX 2013 300.0 6.0 AUTOMATIC all wheel drive 23 16 50920
11913 Lincoln Zephyr 2006 221.0 6.0 AUTOMATIC front wheel drive 26 17 28995
https://colab.research.google.com/drive/1jY5FhPVVnbIvNZlRZA5VOA3dvRqZ-PSq?authuser=1#scrollTo=VslkQJNWBxAU&printMode=true 16/36
16/09/2022, 16:40 EDA_withoutcode.ipynb - Colaboratory
(10827, 10)
(9191, 10)
(10338, 10)
Interesting right? Bam! you have removed 489 rows from the dataframe which was detected as outlier by Z-score technique. and removed 1636
rows from the dataframe which was detected as outlier by IQR technique.
By the way there are many other techniques by which you can remove outliers. You can explore on more interesting techniques available.
Dont't worry these delimma is faced my many data analyst. We provide you with good references below for you to explore further on this
https://www.theanalysisfactor.com/outliers-to-drop-or-not-to-drop/
https://www.researchgate.net/post/Which-is-the-best-method-for-removing-outliers-in-a-data-set
Lets find unique values and there counts in each column in df using value counts function.
# find unique values and there counts in each column in df using value counts function.
for i in df.columns:
print ("--------------- %s ----------------" % i)
# code here
GMC 475
Honda 429
Cadillac 396
Mazda 392
Mercedes-Benz 340
Suzuki 338
Infiniti 326
BMW 324
Audi 320
Hyundai 254
Acura 246
Volvo 241
Subaru 229
Kia 219
Mitsubishi 202
Lexus 201
Chrysler 185
Buick 184
Pontiac 163
Lincoln 152
Porsche 134
Land Rover 126
Oldsmobile 111
Saab 101
Aston Martin 91
Bentley 74
Ferrari 69
Plymouth 62
Scion 60
FIAT 58
Maserati 55
Lamborghini 52
Rolls-Royce 31
Lotus 28
HUMMER 17
Maybach 16
McLaren 5
Alfa Romeo 5
Genesis 3
Bugatti 3
Spyker 2
https://colab.research.google.com/drive/1jY5FhPVVnbIvNZlRZA5VOA3dvRqZ-PSq?authuser=1#scrollTo=VslkQJNWBxAU&printMode=true 18/36
16/09/2022, 16:40 EDA_withoutcode.ipynb - Colaboratory
Do you know? you have just now already explored one univariate plot. guess which one? Yeah its box plot.
https://colab.research.google.com/drive/1jY5FhPVVnbIvNZlRZA5VOA3dvRqZ-PSq?authuser=1#scrollTo=VslkQJNWBxAU&printMode=true 19/36
16/09/2022, 16:40 EDA_withoutcode.ipynb - Colaboratory
Histograms and density plots show the frequency of a numeric variable along the y-axis, and the value along the x-axis. The sns.distplot()
function plots a density curve. Notice that this is aesthetically better than vanilla matplotlib .
Documentation Link : Must go through this - https://seaborn.pydata.org/generated/seaborn.displot.html
Observation:
https://colab.research.google.com/drive/1jY5FhPVVnbIvNZlRZA5VOA3dvRqZ-PSq?authuser=1#scrollTo=VslkQJNWBxAU&printMode=true 20/36
16/09/2022, 16:40 EDA_withoutcode.ipynb - Colaboratory
Since seaborn uses matplotlib behind the scenes, the usual matplotlib functions work well with seaborn. For example, you can use subplots to
plot multiple univariate distributions.
# plot all the columns present in list l together using subplot of dimention (2,3).
c=0
plt.figure(figsize=(15,10))
for i in l:
# code here
plt.show()
https://colab.research.google.com/drive/1jY5FhPVVnbIvNZlRZA5VOA3dvRqZ-PSq?authuser=1#scrollTo=VslkQJNWBxAU&printMode=true 21/36
16/09/2022, 16:40 EDA_withoutcode.ipynb - Colaboratory
2. Bar plots
10 points
Plot a histogram depicting the make in X axis and number of cars in y axis.
plt.figure(figsize = (12,8))
# use nlargest and then .plot to get bar plot like below output
https://colab.research.google.com/drive/1jY5FhPVVnbIvNZlRZA5VOA3dvRqZ-PSq?authuser=1#scrollTo=VslkQJNWBxAU&printMode=true 22/36
16/09/2022, 16:40 EDA_withoutcode.ipynb - Colaboratory
Observation:
In this plot we can see that we have plot the bar plot with the cars model and nos. of cars.
3. Count Plot
10 points
A count plot can be thought of as a histogram across a categorical, instead of quantitative, variable.
Plot a countplot for a variable Transmission vertically with hue as Drive mode
https://colab.research.google.com/drive/1jY5FhPVVnbIvNZlRZA5VOA3dvRqZ-PSq?authuser=1#scrollTo=VslkQJNWBxAU&printMode=true 23/36
16/09/2022, 16:40 EDA_withoutcode.ipynb - Colaboratory
plt.figure(figsize=(15,5))
# 'Cylinders', y='Price'
Observation:
In this count plot, We have plot the feature of Transmission with help of hue.
We can see that the the nos of count and the transmission type and automated manual is plotted. Drive mode as been given with help of hue.
https://colab.research.google.com/drive/1jY5FhPVVnbIvNZlRZA5VOA3dvRqZ-PSq?authuser=1#scrollTo=VslkQJNWBxAU&printMode=true 24/36
16/09/2022, 16:40 EDA_withoutcode.ipynb - Colaboratory
10 points
Using scatterplot find the correlation between 'HP' and 'Price' column of the data.
https://colab.research.google.com/drive/1jY5FhPVVnbIvNZlRZA5VOA3dvRqZ-PSq?authuser=1#scrollTo=VslkQJNWBxAU&printMode=true 25/36
16/09/2022, 16:40 EDA_withoutcode.ipynb - Colaboratory
Observation:
It is a type of plot or mathematical diagram using Cartesian coordinates to display values for typically two variables for a set of data.
We have plot the scatter plot with x axis as HP and y axis as Price.
The data points between the features should be same either wise it give errors.
4. joint distributions
Seaborn's jointplot displays a relationship between 2 variables (bivariate) as well as 1D profiles (univariate) in the margins. This plot is a
convenience class that wraps JointGrid
https://colab.research.google.com/drive/1jY5FhPVVnbIvNZlRZA5VOA3dvRqZ-PSq?authuser=1#scrollTo=VslkQJNWBxAU&printMode=true 26/36
16/09/2022, 16:40 EDA_withoutcode.ipynb - Colaboratory
Observations:
Jointplot is library specific and can be used to quickly visualize and analyze the relationship between two variables and describe their individual
distributions on the same plot.
In this plot we can see the relationship of MPG-C abd MPG_H.
You can adjust the arguments of the jointplot() to make the plot more readable.
https://colab.research.google.com/drive/1jY5FhPVVnbIvNZlRZA5VOA3dvRqZ-PSq?authuser=1#scrollTo=VslkQJNWBxAU&printMode=true 27/36
16/09/2022, 16:40 EDA_withoutcode.ipynb - Colaboratory
Bar plots are used to display aggregated values of a variable, rather than entire distributions. This is especially useful when you have a lot of
data which is difficult to visualise in a single figure.
For example, say you want to visualise and compare the Price across Cylinders. The sns.barplot() function can be used to do that.
https://colab.research.google.com/drive/1jY5FhPVVnbIvNZlRZA5VOA3dvRqZ-PSq?authuser=1#scrollTo=VslkQJNWBxAU&printMode=true 28/36
16/09/2022, 16:40 EDA_withoutcode.ipynb - Colaboratory
Observation:
By default, seaborn plots the mean value across categories, though you can plot the count, median, sum etc.
Also, barplot computes and shows the confidence interval of the mean as well.
When you want to visualise having a large number of categories, it is helpful to plot the
categories across the y-axis. Let's now *drill down into Transmission sub categories*.
https://colab.research.google.com/drive/1jY5FhPVVnbIvNZlRZA5VOA3dvRqZ-PSq?authuser=1#scrollTo=VslkQJNWBxAU&printMode=true 29/36
16/09/2022, 16:40 EDA_withoutcode.ipynb - Colaboratory
Plot bar plot for Price and Transmission with hue="Drive Mode"
# Plot bar plot for Price and Transmission , specify hue="Drive Mode"
https://colab.research.google.com/drive/1jY5FhPVVnbIvNZlRZA5VOA3dvRqZ-PSq?authuser=1#scrollTo=VslkQJNWBxAU&printMode=true 30/36
16/09/2022, 16:40 EDA_withoutcode.ipynb - Colaboratory
These plots looks beutiful isn't it? In Data Analyst life such charts are there unavoidable friend.:)
Multivariate Plots
1. Pairplot
10 points
# plot pairplot on df
https://colab.research.google.com/drive/1jY5FhPVVnbIvNZlRZA5VOA3dvRqZ-PSq?authuser=1#scrollTo=VslkQJNWBxAU&printMode=true 31/36
16/09/2022, 16:40 EDA_withoutcode.ipynb - Colaboratory
https://colab.research.google.com/drive/1jY5FhPVVnbIvNZlRZA5VOA3dvRqZ-PSq?authuser=1#scrollTo=VslkQJNWBxAU&printMode=true 32/36
16/09/2022, 16:40 EDA_withoutcode.ipynb - Colaboratory
Observation:
To plot multiple pairwise bivariate distributions in a dataset, you can use the pairplot() function. This shows the relationship for (n, 2)
combination of variable in a DataFrame as a matrix of plots and the diagonal plots are the univariate plots.
2. Heatmaps
A heat map is a two-dimensional representation of information with the help of colors. Heat maps can help the user visualize simple or
complex information
20 points
Using heatmaps plot the correlation between the features present in the dataset.
https://colab.research.google.com/drive/1jY5FhPVVnbIvNZlRZA5VOA3dvRqZ-PSq?authuser=1#scrollTo=VslkQJNWBxAU&printMode=true 33/36
16/09/2022, 16:40 EDA_withoutcode.ipynb - Colaboratory
# print corr
https://colab.research.google.com/drive/1jY5FhPVVnbIvNZlRZA5VOA3dvRqZ-PSq?authuser=1#scrollTo=VslkQJNWBxAU&printMode=true 34/36
16/09/2022, 16:40 EDA_withoutcode.ipynb - Colaboratory
Observation:
A heatmap contains values representing various shades of the same colour for each value to be plotted. Usually the darker shades of the chart
represent higher values than the lighter shade. For a very different value a completely different colour can also be used.
The above heatmap plot shows correlation between various variables in the colored scale of -1 to 1.
Amazing work done ! you have really made eye catchy visualization plots so far. Did you felt its complicate to understand the above plot?. Hey
smarty don't worry, in near assignments you will have enough practise to analyse and prepare insights from such plots that you will become
pro in this field.
Have a sweet cookie:) Congratulations! you have completed the 6th milestone
challenge too.
FeedBack
https://colab.research.google.com/drive/1jY5FhPVVnbIvNZlRZA5VOA3dvRqZ-PSq?authuser=1#scrollTo=VslkQJNWBxAU&printMode=true 35/36
16/09/2022, 16:40 EDA_withoutcode.ipynb - Colaboratory
We hope you’ve enjoyed this course so far. We’re committed to helping you use AIforAll course to its full potential so you can grow with us. And
that’s why we need your help in form of a feedback here
https://forms.gle/SedkKUD2TNPCnafj8
https://colab.research.google.com/drive/1jY5FhPVVnbIvNZlRZA5VOA3dvRqZ-PSq?authuser=1#scrollTo=VslkQJNWBxAU&printMode=true 36/36