0% found this document useful (0 votes)
9 views19 pages

SML Lab 1

Machine learning (ML) is a branch of artificial intelligence that focuses on the development of algorithms and statistical models that enable computers to learn and make decisions without being explicitly programmed for specific tasks. In a machine learning lab exercise, students gain hands-on experience with various ML techniques, tools, and frameworks, allowing them to understand the practical applications and underlying principles of this rapidly evolving field.

Uploaded by

rsprashu1104
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views19 pages

SML Lab 1

Machine learning (ML) is a branch of artificial intelligence that focuses on the development of algorithms and statistical models that enable computers to learn and make decisions without being explicitly programmed for specific tasks. In a machine learning lab exercise, students gain hands-on experience with various ML techniques, tools, and frameworks, allowing them to understand the practical applications and underlying principles of this rapidly evolving field.

Uploaded by

rsprashu1104
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

2/11/24, 10:48 PM Copy of Assignment Lab1.

ipynb - Colaboratory

INSURANCE PREDICTION USING MULTIPLE REGRESSION


MODEL
Students : Mayuri.A, Apoorva Tumma, Prashanthi R S, Ragortham R C, R M Uma, Swetha.R, J
Jeya Sharmila, Anjali Gupta, Antony Joshy, Gokul Premkumar

Subject: Statistical Machine Learning

Date:11/02/2023

Objective

1)EDA

2)Multiple Linear regression.

Question Take a suitable data set having at least six features and build a linear regression ML
model. Whether the p-value of feature variables should be taken into account to check the
adequacy of the model

keyboard_arrow_down Insurance charges prediction


EDA

Exploratory Data Analyis is a preliminary phase of understanding the data holistically.

We can use different graphs, plots, handle missing observation and outliers appropriately.

Dataset

The below data-set was sourced from kaggle website,


https://www.kaggle.com/datasets/noordeen/insurance-premium-prediction

The dataset is specifically used by practicioners to understand multiple linear


regression.However the dataset had more that 12 attribute which was cleaned and refined to the
variables of interest by the team.

We have also taken few parameters that are qualitative to demonstrate the method to deal with
such data.

Source of defences for our work

1)Applied regression and generalized linear model - John Fox

2)Regression Modelling Michael Panik

3)Applied Linear regression by sanford weisberg

https://colab.research.google.com/drive/1TaSL_2telgbE4BEnmlqHEpPyaTjjhhu-#printMode=true 1/19
2/11/24, 10:48 PM Copy of Assignment Lab1.ipynb - Colaboratory

Codes

We have adopted

1) Kaggle, Github , python official website (sklearn, pandas, numpy, matplotlib, seaborn)

import warnings
warnings.filterwarnings("ignore")
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from google.colab import drive


drive.mount('/content/drive')

## Importing the Given Dataset


insurance=pd.read_csv("/content/drive/MyDrive/insurance data.csv")
insurance.head()

age sex bmi children smoker region expenses

0 19 female 27.9 0 yes southwest 16884.92

1 18 male 33.8 1 no southeast 1725.55

2 28 male 33.0 3 no southeast 4449.46

3 33 male 22.7 0 no northwest 21984.47

4 32 male 28.9 0 no northwest 3866.86

len(insurance.values)

1338

The length/ number of observation of the datset is 1338

insurance.shape #

(1338, 7)

insurance.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1338 entries, 0 to 1337
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 age 1338 non-null int64
1 sex 1338 non-null object
2 bmi 1338 non-null float64

https://colab.research.google.com/drive/1TaSL_2telgbE4BEnmlqHEpPyaTjjhhu-#printMode=true 2/19
2/11/24, 10:48 PM Copy of Assignment Lab1.ipynb - Colaboratory
3 children 1338 non-null int64
4 smoker 1338 non-null object
5 region 1338 non-null object
6 expenses 1338 non-null float64
dtypes: float64(2), int64(2), object(3)
memory usage: 73.3+ KB

insurance.isnull()

age sex bmi children smoker region expenses

0 False False False False False False False

1 False False False False False False False

2 False False False False False False False

3 False False False False False False False

4 False False False False False False False

... ... ... ... ... ... ... ...

1333 False False False False False False False

1334 False False False False False False False

1335 False False False False False False False

1336 False False False False False False False

1337 False False False False False False False

1338 rows × 7 columns

insurance.isnull().sum()

age 0
sex 0
bmi 0
children 0
smoker 0
region 0
expenses 0
dtype: int64

There is no null values in the data set

insurance.tail()

https://colab.research.google.com/drive/1TaSL_2telgbE4BEnmlqHEpPyaTjjhhu-#printMode=true 3/19
2/11/24, 10:48 PM Copy of Assignment Lab1.ipynb - Colaboratory

age sex bmi children smoker region expenses

1333 50 male 31.0 3 no northwest 10600.55

1334 18 female 31.9 0 no northeast 2205.98

1335 18 female 36.9 0 no southeast 1629.83

1336 21 female 25.8 0 no southwest 2007.95

1337 61 female 29.1 0 yes northwest 29141.36

insurance

age sex bmi children smoker region expenses

0 19 female 27.9 0 yes southwest 16884.92

1 18 male 33.8 1 no southeast 1725.55

2 28 male 33.0 3 no southeast 4449.46

3 33 male 22.7 0 no northwest 21984.47

4 32 male 28.9 0 no northwest 3866.86

... ... ... ... ... ... ... ...

1333 50 male 31.0 3 no northwest 10600.55

1334 18 female 31.9 0 no northeast 2205.98

1335 18 female 36.9 0 no southeast 1629.83

1336 21 female 25.8 0 no southwest 2007.95

1337 61 female 29.1 0 yes northwest 29141.36

1338 rows × 7 columns

col=list(insurance.columns)
col

['age', 'sex', 'bmi', 'children', 'smoker', 'region', 'expenses']

display(insurance['smoker'].mode()[0])
insurance['children'].mean()

'no'
1.0949177877429

Double-click (or enter) to edit

insurance['age'].dtype

https://colab.research.google.com/drive/1TaSL_2telgbE4BEnmlqHEpPyaTjjhhu-#printMode=true 4/19
2/11/24, 10:48 PM Copy of Assignment Lab1.ipynb - Colaboratory

dtype('int64')

col

['age', 'sex', 'bmi', 'children', 'smoker', 'region', 'expenses']

for col_name in col:


if(insurance[col_name].dtypes=='int64' or insurance[col_name].dtypes=='float64'):
plt.hist(insurance[col_name])
plt.xlabel(col_name)
plt.ylabel('count')
plt.show()

https://colab.research.google.com/drive/1TaSL_2telgbE4BEnmlqHEpPyaTjjhhu-#printMode=true 5/19
2/11/24, 10:48 PM Copy of Assignment Lab1.ipynb - Colaboratory

https://colab.research.google.com/drive/1TaSL_2telgbE4BEnmlqHEpPyaTjjhhu-#printMode=true 6/19
2/11/24, 10:48 PM Copy of Assignment Lab1.ipynb - Colaboratory

We have used a looping concept to create bargraphs for all float (numerical observation such as
children bmi etc) with its frequancy (count)

import seaborn as sns


sns.scatterplot(data=insurance, x="age", y="expenses")

<Axes: xlabel='age', ylabel='expenses'>

keyboard_arrow_down Check for outliers


for col_name in col:
if(insurance[col_name].dtypes=='int64' or insurance[col_name].dtypes=='float64'):
plt.boxplot(insurance[col_name])
plt.xlabel(col_name)
plt.ylabel('count')
plt.show()

https://colab.research.google.com/drive/1TaSL_2telgbE4BEnmlqHEpPyaTjjhhu-#printMode=true 7/19
2/11/24, 10:48 PM Copy of Assignment Lab1.ipynb - Colaboratory

insurance.describe()

https://colab.research.google.com/drive/1TaSL_2telgbE4BEnmlqHEpPyaTjjhhu-#printMode=true 8/19
2/11/24, 10:48 PM Copy of Assignment Lab1.ipynb - Colaboratory

age bmi children expenses

count 1338.000000 1338.000000 1338.000000 1338.000000

mean 39.207025 30.665471 1.094918 13270.422414

std 14.049960 6.098382 1.205493 12110.011240

min 18.000000 16.000000 0.000000 1121.870000

25% 27.000000 26.300000 0.000000 4740.287500

50% 39.000000 30.400000 1.000000 9382.030000

75% 51.000000 34.700000 2.000000 16639.915000

max 64.000000 53.100000 5.000000 63770.430000

Inter Quartile Range

#treating outliers ()
Q1 = insurance.bmi.quantile(0.25)
Q3 = insurance.bmi.quantile(0.75)

Q1
Q3

34.7

IQR = Q3 - Q1
IQR

8.400000000000002

Q1 - 1.5*IQR
Q3 + 1.5*IQR

47.300000000000004

Box Plot

plt.boxplot(insurance['bmi'])
plt.show()

https://colab.research.google.com/drive/1TaSL_2telgbE4BEnmlqHEpPyaTjjhhu-#printMode=true 9/19
2/11/24, 10:48 PM Copy of Assignment Lab1.ipynb - Colaboratory

fig=plt.figure(figsize=(10,20))

for i,v in enumerate(col):


if insurance[v].dtype != 'object':
plt.subplot(8,2,i+1)
sns.boxplot(insurance[v])

https://colab.research.google.com/drive/1TaSL_2telgbE4BEnmlqHEpPyaTjjhhu-#printMode=true 10/19
2/11/24, 10:48 PM Copy of Assignment Lab1.ipynb - Colaboratory

Th above code has used looping concept to build these box plots. Where v is the variable that
takes different names such as bmi, children for each cycle inside the loop.

Removing the outliers in the data

Q1 = insurance.bmi.quantile(0.25)
Q3 = insurance.bmi.quantile(0.75)
IQR = Q3 - Q1
insurance = insurance[(insurance.bmi >= Q1 - 1.5*IQR) & (insurance.bmi <= Q3 + 1.5*IQR)]

https://colab.research.google.com/drive/1TaSL_2telgbE4BEnmlqHEpPyaTjjhhu-#printMode=true 11/19
2/11/24, 10:48 PM Copy of Assignment Lab1.ipynb - Colaboratory

insurance['bmi'].dtype

dtype('float64')

Q1 = insurance.expenses.quantile(0.25)
Q3 = insurance.expenses.quantile(0.75)
IQR = Q3 - Q1
insurance = insurance[(insurance.expenses >= Q1 - 1.5*IQR) & (insurance.expenses <= Q3 +

insurance['expenses'].dtype

dtype('float64')

insurance.shape

(1191, 7)

keyboard_arrow_down After removing of outliers


for col_name in col:
if(insurance[col_name].dtypes=='int64' or insurance[col_name].dtypes=='float64'):
sns.boxplot(insurance[col_name])
plt.xlabel(col_name)
plt.ylabel('count')
plt.show()

https://colab.research.google.com/drive/1TaSL_2telgbE4BEnmlqHEpPyaTjjhhu-#printMode=true 12/19
2/11/24, 10:48 PM Copy of Assignment Lab1.ipynb - Colaboratory

https://colab.research.google.com/drive/1TaSL_2telgbE4BEnmlqHEpPyaTjjhhu-#printMode=true 13/19
2/11/24, 10:48 PM Copy of Assignment Lab1.ipynb - Colaboratory

keyboard_arrow_down Drop
data
the object type columns because regression needs only numeric

for i in col:
if i != 'charges' and insurance[i].dtype == 'float':
insurance.fillna(insurance[i].mean(),inplace=True)
elif i != 'charges' and insurance[i].dtype == 'object':
insurance.drop(i,axis=1,inplace=True)
else:
pass

insurance.corr().T

age bmi children expenses

age 1.000000 0.123845 0.038179 0.448798

bmi 0.123845 1.000000 0.007357 -0.064589

children 0.038179 0.007357 1.000000 0.089083

expenses 0.448798 -0.064589 0.089083 1.000000

insurance.shape

(1191, 4)

keyboard_arrow_down Linear Regression


We are using module and inbuilt features from SKLearn to aid us in linear regression and
test/train.

from sklearn.linear_model import LinearRegression


from sklearn.model_selection import train_test_split

from statsmodels.stats.outliers_influence import variance_inflation_factor


col_list = []
for col in insurance.columns:
if ((insurance[col].dtype != 'object') & (col != 'charges') ):#only num cols except f
col_list.append(col)

X = insurance[col_list]
X.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1191 entries, 0 to 1337
Data columns (total 4 columns):

https://colab.research.google.com/drive/1TaSL_2telgbE4BEnmlqHEpPyaTjjhhu-#printMode=true 14/19
2/11/24, 10:48 PM Copy of Assignment Lab1.ipynb - Colaboratory
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 age 1191 non-null int64
1 bmi 1191 non-null float64
2 children 1191 non-null int64
3 expenses 1191 non-null float64
dtypes: float64(2), int64(2)
memory usage: 46.5 KB

X.values

array([[1.900000e+01, 2.790000e+01, 0.000000e+00, 1.688492e+04],


[1.800000e+01, 3.380000e+01, 1.000000e+00, 1.725550e+03],
[2.800000e+01, 3.300000e+01, 3.000000e+00, 4.449460e+03],
...,
[1.800000e+01, 3.690000e+01, 0.000000e+00, 1.629830e+03],
[2.100000e+01, 2.580000e+01, 0.000000e+00, 2.007950e+03],
[6.100000e+01, 2.910000e+01, 0.000000e+00, 2.914136e+04]])

for i in range(len(X.columns)):
print(i)

0
1
2
3

from statsmodels.stats.outliers_influence import variance_inflation_factor


col_list = []
for col in insurance.columns:
if ((insurance[col].dtype != 'object') & (col != 'charges') ):
col_list.append(col)

X = insurance[col_list]
vif_data = pd.DataFrame()
print(X.columns)
vif_data["feature"] = X.columns
vif_data["VIF"] = [variance_inflation_factor(X.values, i) for i in range(len(X.columns))]
print(vif_data)

Index(['age', 'bmi', 'children', 'expenses'], dtype='object')


feature VIF
0 age 10.418850
1 bmi 7.955137
2 children 1.786420
3 expenses 3.660522

VIF : Variance Inflation Factor is used to understand multicolilinearity between the variables. VIF
>5 implies there exsist multicoolinearity for the variable. Hence we drop these parameters.

insurance.size

4764

https://colab.research.google.com/drive/1TaSL_2telgbE4BEnmlqHEpPyaTjjhhu-#printMode=true 15/19
2/11/24, 10:48 PM Copy of Assignment Lab1.ipynb - Colaboratory

len(X.columns)

insurance=insurance.drop(['age'], axis=1)

col_list = []
for col in insurance.columns:
if ((insurance[col].dtype != 'object') & (col != 'charges') ):
col_list.append(col)

X = insurance[col_list]
vif_data = pd.DataFrame()
vif_data["feature"] = X.columns
vif_data["VIF"] = [variance_inflation_factor(X.values, i)
for i in range(len(X.columns))]
print(vif_data)
# vif value < 5

feature VIF
0 bmi 3.120146
1 children 1.784762
2 expenses 2.676247

x=insurance.loc[:,['bmi','children','expenses']]
y=insurance.iloc[:,-1]

insurance.head()

bmi children expenses

0 27.9 0 16884.92

1 33.8 1 1725.55

2 33.0 3 4449.46

3 22.7 0 21984.47

4 28.9 0 3866.86

x_train, x_test, y_train, y_test=train_test_split(x,y,train_size=0.8, random_state=0)

print(x_train.shape)
print(y_train.shape)
print(x_test.shape)
print(y_test.shape)

(952, 3)
(952,)
(239, 3)
(239,)
https://colab.research.google.com/drive/1TaSL_2telgbE4BEnmlqHEpPyaTjjhhu-#printMode=true 16/19
2/11/24, 10:48 PM Copy of Assignment Lab1.ipynb - Colaboratory

from sklearn.linear_model import LinearRegression


l_model=LinearRegression()
#building the model
l_model.fit(x_train, y_train)
predictions=l_model.predict(x_test)
predictions

array([ 4673.39, 8551.35, 15170.07, 2261.57, 1631.82, 10141.14,


7729.65, 5630.46, 33732.69, 11015.17, 3756.62, 13143.86,
9850.43, 4185.1 , 2138.07, 14133.04, 10422.92, 12347.17,
23807.24, 17081.08, 7151.09, 2473.33, 33900.65, 10370.91,
8823.28, 8457.82, 11093.62, 6406.41, 5415.66, 4189.11,
1837.24, 4320.41, 7256.72, 4058.12, 8347.16, 22218.11,
10796.35, 14043.48, 8825.09, 2585.85, 18608.26, 5458.05,
20149.32, 5227.99, 6593.51, 3556.92, 11299.34, 4779.6 ,
11944.59, 9447.25, 18972.5 , 11763. , 2680.95, 5124.19,
1711.03, 1875.34, 3757.84, 1769.53, 13880.95, 10325.21,
2699.57, 12244.53, 4133.64, 28340.19, 20167.34, 13204.29,
24180.93, 30284.64, 10795.94, 14474.68, 9625.92, 25309.49,
5152.13, 7209.49, 18903.49, 10450.55, 12265.51, 20420.6 ,
12741.17, 24227.34, 1880.49, 10381.48, 2134.9 , 6272.48,
20781.49, 18033.97, 19539.24, 11931.13, 4428.89, 14256.19,
5397.62, 12430.95, 5080.1 , 4670.64, 8211.1 , 22478.6 ,
5966.89, 10579.71, 6414.18, 3994.18, 8606.22, 6289.75,
3353.28, 7243.81, 2207.7 , 24535.7 , 3176.29, 7749.16,
11082.58, 11534.87, 8515.76, 14455.64, 13747.87, 8023.14,
11272.33, 10214.64, 11187.66, 8334.59, 13352.1 , 3847.67,
13228.85, 14988.43, 8978.19, 1141.45, 12222.9 , 10959.33,
11345.52, 4687.8 , 6238.3 , 17748.51, 19214.71, 5253.52,
18310.74, 15359.1 , 14210.54, 7419.48, 19023.26, 14478.33,
4074.45, 7935.29, 22462.04, 10493.95, 2721.32, 3268.85,
10928.85, 6548.2 , 5246.05, 2352.97, 2302.3 , 11356.66,
34166.27, 4561.19, 5383.54, 24869.84, 6360.99, 8627.54,
1708. , 13887.2 , 7640.31, 1725.55, 10704.47, 1980.07,
8603.82, 10577.09, 2897.32, 8162.72, 6653.79, 21771.34,
6875.96, 21880.82, 11848.14, 9058.73, 17663.14, 17942.11,
13041.92, 4762.33, 4260.74, 4151.03, 9225.26, 4237.13,
28468.92, 5693.43, 6186.13, 11512.41, 15019.76, 20234.85,
2850.68, 10106.13, 8442.67, 6799.46, 4931.65, 10355.64,
7804.16, 17904.53, 9778.35, 4234.93, 15555.19, 17179.52,
10226.28, 13224.06, 1635.73, 9095.07, 7153.55, 3062.51,
18955.22, 8252.28, 2523.17, 14235.07, 7325.05, 19350.37,
11552.9 , 4571.41, 1622.19, 7623.52, 8068.19, 6474.01,
11305.93, 12094.48, 11482.63, 12957.12, 7740.34, 11881.97,
11411.69, 1136.4 , 10797.34, 13224.69, 23887.66, 6435.62,
4433.92, 9620.33, 27375.9 , 14394.4 , 27218.44, 2020.55,
9048.03, 3056.39, 1972.95, 11737.85, 9301.89])

error_pred=pd.DataFrame({'Actual_data':y_test,'Prediction_data':pd.Series(predictions)})
error_pred

https://colab.research.google.com/drive/1TaSL_2telgbE4BEnmlqHEpPyaTjjhhu-#printMode=true 17/19
2/11/24, 10:48 PM Copy of Assignment Lab1.ipynb - Colaboratory

Actual_data Prediction_data

0 NaN 4673.39

1 1725.55 8551.35

2 NaN 15170.07

3 NaN 2261.57

4 NaN 1631.82

... ... ...

1311 4571.41 NaN

1315 11272.33 NaN

1329 10325.21 NaN

1331 10795.94 NaN

1332 11411.69 NaN

440 rows × 2 columns

error_pred

Actual_data Prediction_data

0 NaN 4673.39

1 1725.55 8551.35

2 NaN 15170.07

3 NaN 2261.57

4 NaN 1631.82

... ... ...

1311 4571.41 NaN

1315 11272.33 NaN

1329 10325.21 NaN

1331 10795.94 NaN

1332 11411.69 NaN

440 rows × 2 columns

error_pred['Error']=error_pred['Actual_data']-error_pred['Prediction_data']
error_pred

https://colab.research.google.com/drive/1TaSL_2telgbE4BEnmlqHEpPyaTjjhhu-#printMode=true 18/19
2/11/24, 10:48 PM Copy of Assignment Lab1.ipynb - Colaboratory

Actual_data Prediction_data Error

0 NaN 4673.39 NaN

1 1725.55 8551.35 -6825.8

2 NaN 15170.07 NaN

3 NaN 2261.57 NaN

4 NaN 1631.82 NaN

https://colab.research.google.com/drive/1TaSL_2telgbE4BEnmlqHEpPyaTjjhhu-#printMode=true 19/19

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy