0% found this document useful (0 votes)
46 views1 page

Shailesh020902@gmail - Com 1

This document performs exploratory data analysis on a Haberman's Survival dataset using Pandas and Seaborn in Python. The analysis includes: 1) Summarizing the distribution of each variable and identifying right-skew for axillary nodes. 2) Calculating correlations which show little relationship between survival status and other variables. 3) Creating KDE plots and box plots to visualize variable distributions and identify outliers. 4) Concluding that bivariate analysis reveals minimal insights and a model may be needed.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
46 views1 page

Shailesh020902@gmail - Com 1

This document performs exploratory data analysis on a Haberman's Survival dataset using Pandas and Seaborn in Python. The analysis includes: 1) Summarizing the distribution of each variable and identifying right-skew for axillary nodes. 2) Calculating correlations which show little relationship between survival status and other variables. 3) Creating KDE plots and box plots to visualize variable distributions and identify outliers. 4) Concluding that bivariate analysis reveals minimal insights and a model may be needed.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 1

In [1]: import pandas as pd

import seaborn as sns


import matplotlib.pyplot as plt
import numpy as np
from scipy.stats import norm
from scipy.stats import zscore
data=pd.read_csv("haberman.csv")
print(data)
print(data.shape)
print(data.columns)

30 64 1 1.1
0 30 62 3 1
1 30 65 0 1
2 31 59 2 1
3 31 65 4 1
4 33 58 10 1
5 33 60 0 1
6 34 59 0 2
7 34 66 9 2
8 34 58 30 1
9 34 60 1 1
10 34 61 10 1
11 34 67 7 1
12 34 60 0 1
13 35 64 13 1
14 35 63 0 1
15 36 60 1 1
16 36 69 0 1
17 37 60 0 1
18 37 63 0 1
19 37 58 0 1
20 37 59 6 1
21 37 60 15 1
22 37 63 0 1
23 38 69 21 2
24 38 59 2 1
25 38 60 0 1
26 38 60 0 1
27 38 62 3 1
28 38 64 1 1
29 38 66 0 1
.. .. .. .. ...
275 67 66 0 1
276 67 61 0 1
277 67 65 0 1
278 68 67 0 1
279 68 68 0 1
280 69 67 8 2
281 69 60 0 1
282 69 65 0 1
283 69 66 0 1
284 70 58 0 2
285 70 58 4 2
286 70 66 14 1
287 70 67 0 1
288 70 68 0 1
289 70 59 8 1
290 70 63 0 1
291 71 68 2 1
292 72 63 0 2
293 72 58 0 1
294 72 64 0 1
295 72 67 3 1
296 73 62 0 1
297 73 68 0 1
298 74 65 3 2
299 74 63 0 1
300 75 62 1 1
301 76 67 0 1
302 77 65 3 1
303 78 65 1 2
304 83 58 2 2

[305 rows x 4 columns]


(305, 4)
Index(['30', '64', '1', '1.1'], dtype='object')

In [2]: # As the column names are not that helpful we rename them as per the names mentioned in the kaggle d
escription to make more sense out of column names
data.columns=['Age','Op_year','axil_nodes','Surv_status']
print(pd.isnull(data).sum())
#no null values are present in the given dataset

print("\n classes:",data.Surv_status.groupby(data.Surv_status).count())
#this is an imbalanced dataset

Age 0
Op_year 0
axil_nodes 0
Surv_status 0
dtype: int64

classes: Surv_status
1 224
2 81
Name: Surv_status, dtype: int64

In [3]: #survival status within 5 years of operation represented by 1 in dataset shows most people between
30-40 survived
sns.boxplot(x="Surv_status",y="Age",data=data)

plt.show()

In [4]: #There seems to be a higher distribution at axil nodes 0 for both survival status.comparatively high
er for survival status 1 same can be done with the help of box-plots
sns.FacetGrid(data,col='Surv_status', hue='Surv_status').map(plt.scatter,"Age","axil_nodes").add_leg
end()
plt.show()

In [5]: #we cannot deduce anything substantial from the distribution just that most of the data lies between
0-10 axil nodes
#across all the op_year for both the survivalstatus
sns.FacetGrid(data,col='Surv_status',col_wrap=2).map(plt.scatter,"Op_year","axil_nodes").add_legend
()

plt.show()

In [92]: np.random.seed(1234)
#z score k probability table and .cdf se same hi answer aata h
#print(data.Age.loc[data.Age==67],data.Age[data.Age==60])
m=data.Age.mean()
med=data.Age.median()
d=data.Age.std()
print(m,d,med)
print("cdf: ",norm(52.53,10.74).cdf(67))
print("pdf: ",norm(52.53,10.74).pdf(67))
print("calclated z score:",(data.Age[272]-52.53)/10.74)
print("stats z score:",data.zscore[272])
print(data.Age.loc[302],data.Age.loc[222])
data['zscore']=zscore(data.Age)
print("Zscore for age 77 and 60 is:",data.zscore.loc[302],data.zscore.loc[222])
print("value for z score diffrence :",0.9890-0.7549)
print("infernece that probability of observing an age between 60 and 77 is 23.40% ")
y=round(data.zscore)
data['y']=y
print(data.groupby(['y']).count())

52.5311475409836 10.744024363993269 52.0


cdf: 0.9110581539648359
pdf: 0.014987750926376829
calclated z score: 1.3472998137802605
stats z score: 1.3489014809138864
77 60
Zscore for age 77 and 60 is: 2.2811809997870687 0.696305817702659
value for z score diffrence : 0.23409999999999997
infernece that probability of observing an age between 60 and 77 is 23.40%
Age Op_year axil_nodes Surv_status zscore
y
-2.0 17 17 17 17 17
-1.0 89 89 89 89 89
-0.0 101 101 101 101 101
1.0 73 73 73 73 73
2.0 24 24 24 24 24
3.0 1 1 1 1 1

In [61]: #univariate analysis to show the age vector is distributed which is nearly normal with mean and medi
an as 52
#x=data.Age
x=sample_data.Age
sns.distplot(x,kde=True,fit=norm);
plt.show()

x.mean()

Out[61]: 52.9875

In [8]: #operation year seems to be a bimodal distribution


x=data.Op_year
sns.distplot(x,norm_hist=True,fit=norm);
plt.show()

x.mode()

Out[8]: 0 58
dtype: int64

In [9]: # the axil nodes are highy-right skewed with mean 4 which indicates most of the axil nodes in our da
taset is between 0 and 4
#the diffrence between mean and median rises due to skewness
x=data.axil_nodes
sns.distplot(x);
plt.show()
print("mean:",x.mean())
print("median:",x.median())

mean: 4.036065573770492
median: 1.0

In [10]: #CDF of Age and vertical lines at Age 70 indicates its almost 90% probable to find a value eaqual o
r less than 70
sns.kdeplot(data.Age, cumulative=True)
plt.axvline(x=70)

plt.show()

In [11]: #CDF of operation_years and vertical lines at year 68 indicates its almost 90% probable to find a v
alue eaqual or less than 68
sns.kdeplot(data.Op_year, cumulative=True)
plt.axvline(x=68)

plt.show()

CDF of axil_nodes and vertical lines at axil nodes 10 indicates its


almost 90% probable to find a value eaqual or less than 10
sns.kdeplot(data.axil_nodes, cumulative=True) plt.axvline(x=10)

plt.show()

In [12]: #there seems to be a very less coorelation between surv_status with all other vectors
print("Pearson corelation matrix:\n",data.corr(method='pearson'))

print("\n\nspearmann corelation matrix:\n",data.corr(method='spearman'))

Pearson corelation matrix:


Age Op_year axil_nodes Surv_status
Age 1.000000 0.092623 -0.066548 0.064351
Op_year 0.092623 1.000000 -0.003277 -0.004076
axil_nodes -0.066548 -0.003277 1.000000 0.286191
Surv_status 0.064351 -0.004076 0.286191 1.000000

spearmann corelation matrix:


Age Op_year axil_nodes Surv_status
Age 1.000000 0.093534 -0.097884 0.052806
Op_year 0.093534 1.000000 -0.036001 -0.007028
axil_nodes -0.097884 -0.036001 1.000000 0.327468
Surv_status 0.052806 -0.007028 0.327468 1.000000

In [13]: #pairplot summarizes all the plots mentioned above


sns.pairplot(data, hue='Surv_status')
plt.show()
#so using the EDA techniques i realized there isn't much we can deduce using bivariate analysis
#we can make comments on the distribution of indiviual vectors using PDF and CDF's (comments are men
tioned in respective cells)
#the corelation indicates there's absence of a strong corelation between survival status and all the
other variables

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy