SML Lab 1
SML Lab 1
ipynb - Colaboratory
Date:11/02/2023
Objective
1)EDA
Question Take a suitable data set having at least six features and build a linear regression ML
model. Whether the p-value of feature variables should be taken into account to check the
adequacy of the model
We can use different graphs, plots, handle missing observation and outliers appropriately.
Dataset
We have also taken few parameters that are qualitative to demonstrate the method to deal with
such data.
https://colab.research.google.com/drive/1TaSL_2telgbE4BEnmlqHEpPyaTjjhhu-#printMode=true 1/19
2/11/24, 10:48 PM Copy of Assignment Lab1.ipynb - Colaboratory
Codes
We have adopted
1) Kaggle, Github , python official website (sklearn, pandas, numpy, matplotlib, seaborn)
import warnings
warnings.filterwarnings("ignore")
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
len(insurance.values)
1338
insurance.shape #
(1338, 7)
insurance.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1338 entries, 0 to 1337
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 age 1338 non-null int64
1 sex 1338 non-null object
2 bmi 1338 non-null float64
https://colab.research.google.com/drive/1TaSL_2telgbE4BEnmlqHEpPyaTjjhhu-#printMode=true 2/19
2/11/24, 10:48 PM Copy of Assignment Lab1.ipynb - Colaboratory
3 children 1338 non-null int64
4 smoker 1338 non-null object
5 region 1338 non-null object
6 expenses 1338 non-null float64
dtypes: float64(2), int64(2), object(3)
memory usage: 73.3+ KB
insurance.isnull()
insurance.isnull().sum()
age 0
sex 0
bmi 0
children 0
smoker 0
region 0
expenses 0
dtype: int64
insurance.tail()
https://colab.research.google.com/drive/1TaSL_2telgbE4BEnmlqHEpPyaTjjhhu-#printMode=true 3/19
2/11/24, 10:48 PM Copy of Assignment Lab1.ipynb - Colaboratory
insurance
col=list(insurance.columns)
col
display(insurance['smoker'].mode()[0])
insurance['children'].mean()
'no'
1.0949177877429
insurance['age'].dtype
https://colab.research.google.com/drive/1TaSL_2telgbE4BEnmlqHEpPyaTjjhhu-#printMode=true 4/19
2/11/24, 10:48 PM Copy of Assignment Lab1.ipynb - Colaboratory
dtype('int64')
col
https://colab.research.google.com/drive/1TaSL_2telgbE4BEnmlqHEpPyaTjjhhu-#printMode=true 5/19
2/11/24, 10:48 PM Copy of Assignment Lab1.ipynb - Colaboratory
https://colab.research.google.com/drive/1TaSL_2telgbE4BEnmlqHEpPyaTjjhhu-#printMode=true 6/19
2/11/24, 10:48 PM Copy of Assignment Lab1.ipynb - Colaboratory
We have used a looping concept to create bargraphs for all float (numerical observation such as
children bmi etc) with its frequancy (count)
https://colab.research.google.com/drive/1TaSL_2telgbE4BEnmlqHEpPyaTjjhhu-#printMode=true 7/19
2/11/24, 10:48 PM Copy of Assignment Lab1.ipynb - Colaboratory
insurance.describe()
https://colab.research.google.com/drive/1TaSL_2telgbE4BEnmlqHEpPyaTjjhhu-#printMode=true 8/19
2/11/24, 10:48 PM Copy of Assignment Lab1.ipynb - Colaboratory
#treating outliers ()
Q1 = insurance.bmi.quantile(0.25)
Q3 = insurance.bmi.quantile(0.75)
Q1
Q3
34.7
IQR = Q3 - Q1
IQR
8.400000000000002
Q1 - 1.5*IQR
Q3 + 1.5*IQR
47.300000000000004
Box Plot
plt.boxplot(insurance['bmi'])
plt.show()
https://colab.research.google.com/drive/1TaSL_2telgbE4BEnmlqHEpPyaTjjhhu-#printMode=true 9/19
2/11/24, 10:48 PM Copy of Assignment Lab1.ipynb - Colaboratory
fig=plt.figure(figsize=(10,20))
https://colab.research.google.com/drive/1TaSL_2telgbE4BEnmlqHEpPyaTjjhhu-#printMode=true 10/19
2/11/24, 10:48 PM Copy of Assignment Lab1.ipynb - Colaboratory
Th above code has used looping concept to build these box plots. Where v is the variable that
takes different names such as bmi, children for each cycle inside the loop.
Q1 = insurance.bmi.quantile(0.25)
Q3 = insurance.bmi.quantile(0.75)
IQR = Q3 - Q1
insurance = insurance[(insurance.bmi >= Q1 - 1.5*IQR) & (insurance.bmi <= Q3 + 1.5*IQR)]
https://colab.research.google.com/drive/1TaSL_2telgbE4BEnmlqHEpPyaTjjhhu-#printMode=true 11/19
2/11/24, 10:48 PM Copy of Assignment Lab1.ipynb - Colaboratory
insurance['bmi'].dtype
dtype('float64')
Q1 = insurance.expenses.quantile(0.25)
Q3 = insurance.expenses.quantile(0.75)
IQR = Q3 - Q1
insurance = insurance[(insurance.expenses >= Q1 - 1.5*IQR) & (insurance.expenses <= Q3 +
insurance['expenses'].dtype
dtype('float64')
insurance.shape
(1191, 7)
https://colab.research.google.com/drive/1TaSL_2telgbE4BEnmlqHEpPyaTjjhhu-#printMode=true 12/19
2/11/24, 10:48 PM Copy of Assignment Lab1.ipynb - Colaboratory
https://colab.research.google.com/drive/1TaSL_2telgbE4BEnmlqHEpPyaTjjhhu-#printMode=true 13/19
2/11/24, 10:48 PM Copy of Assignment Lab1.ipynb - Colaboratory
keyboard_arrow_down Drop
data
the object type columns because regression needs only numeric
for i in col:
if i != 'charges' and insurance[i].dtype == 'float':
insurance.fillna(insurance[i].mean(),inplace=True)
elif i != 'charges' and insurance[i].dtype == 'object':
insurance.drop(i,axis=1,inplace=True)
else:
pass
insurance.corr().T
insurance.shape
(1191, 4)
X = insurance[col_list]
X.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1191 entries, 0 to 1337
Data columns (total 4 columns):
https://colab.research.google.com/drive/1TaSL_2telgbE4BEnmlqHEpPyaTjjhhu-#printMode=true 14/19
2/11/24, 10:48 PM Copy of Assignment Lab1.ipynb - Colaboratory
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 age 1191 non-null int64
1 bmi 1191 non-null float64
2 children 1191 non-null int64
3 expenses 1191 non-null float64
dtypes: float64(2), int64(2)
memory usage: 46.5 KB
X.values
for i in range(len(X.columns)):
print(i)
0
1
2
3
X = insurance[col_list]
vif_data = pd.DataFrame()
print(X.columns)
vif_data["feature"] = X.columns
vif_data["VIF"] = [variance_inflation_factor(X.values, i) for i in range(len(X.columns))]
print(vif_data)
VIF : Variance Inflation Factor is used to understand multicolilinearity between the variables. VIF
>5 implies there exsist multicoolinearity for the variable. Hence we drop these parameters.
insurance.size
4764
https://colab.research.google.com/drive/1TaSL_2telgbE4BEnmlqHEpPyaTjjhhu-#printMode=true 15/19
2/11/24, 10:48 PM Copy of Assignment Lab1.ipynb - Colaboratory
len(X.columns)
insurance=insurance.drop(['age'], axis=1)
col_list = []
for col in insurance.columns:
if ((insurance[col].dtype != 'object') & (col != 'charges') ):
col_list.append(col)
X = insurance[col_list]
vif_data = pd.DataFrame()
vif_data["feature"] = X.columns
vif_data["VIF"] = [variance_inflation_factor(X.values, i)
for i in range(len(X.columns))]
print(vif_data)
# vif value < 5
feature VIF
0 bmi 3.120146
1 children 1.784762
2 expenses 2.676247
x=insurance.loc[:,['bmi','children','expenses']]
y=insurance.iloc[:,-1]
insurance.head()
0 27.9 0 16884.92
1 33.8 1 1725.55
2 33.0 3 4449.46
3 22.7 0 21984.47
4 28.9 0 3866.86
print(x_train.shape)
print(y_train.shape)
print(x_test.shape)
print(y_test.shape)
(952, 3)
(952,)
(239, 3)
(239,)
https://colab.research.google.com/drive/1TaSL_2telgbE4BEnmlqHEpPyaTjjhhu-#printMode=true 16/19
2/11/24, 10:48 PM Copy of Assignment Lab1.ipynb - Colaboratory
error_pred=pd.DataFrame({'Actual_data':y_test,'Prediction_data':pd.Series(predictions)})
error_pred
https://colab.research.google.com/drive/1TaSL_2telgbE4BEnmlqHEpPyaTjjhhu-#printMode=true 17/19
2/11/24, 10:48 PM Copy of Assignment Lab1.ipynb - Colaboratory
Actual_data Prediction_data
0 NaN 4673.39
1 1725.55 8551.35
2 NaN 15170.07
3 NaN 2261.57
4 NaN 1631.82
error_pred
Actual_data Prediction_data
0 NaN 4673.39
1 1725.55 8551.35
2 NaN 15170.07
3 NaN 2261.57
4 NaN 1631.82
error_pred['Error']=error_pred['Actual_data']-error_pred['Prediction_data']
error_pred
https://colab.research.google.com/drive/1TaSL_2telgbE4BEnmlqHEpPyaTjjhhu-#printMode=true 18/19
2/11/24, 10:48 PM Copy of Assignment Lab1.ipynb - Colaboratory
https://colab.research.google.com/drive/1TaSL_2telgbE4BEnmlqHEpPyaTjjhhu-#printMode=true 19/19