0% found this document useful (0 votes)

46 views1 page

Shailesh020902@gmail - Com 1

This document performs exploratory data analysis on a Haberman's Survival dataset using Pandas and Seaborn in Python. The analysis includes: 1) Summarizing the distribution of each variable and identifying right-skew for axillary nodes. 2) Calculating correlations which show little relationship between survival status and other variables. 3) Creating KDE plots and box plots to visualize variable distributions and identify outliers. 4) Concluding that bivariate analysis reveals minimal insights and a model may be needed.

Uploaded by

Shailendra chaudhary

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

46 views1 page

Shailesh020902@gmail - Com 1

Uploaded by

Shailendra chaudhary

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 1

In [1]: import pandas as pd

import seaborn as sns

import matplotlib.pyplot as plt
import numpy as np
from scipy.stats import norm
from scipy.stats import zscore
data=pd.read_csv("haberman.csv")
print(data)
print(data.shape)
print(data.columns)

30 64 1 1.1
0 30 62 3 1
1 30 65 0 1
2 31 59 2 1
3 31 65 4 1
4 33 58 10 1
5 33 60 0 1
6 34 59 0 2
7 34 66 9 2
8 34 58 30 1
9 34 60 1 1
10 34 61 10 1
11 34 67 7 1
12 34 60 0 1
13 35 64 13 1
14 35 63 0 1
15 36 60 1 1
16 36 69 0 1
17 37 60 0 1
18 37 63 0 1
19 37 58 0 1
20 37 59 6 1
21 37 60 15 1
22 37 63 0 1
23 38 69 21 2
24 38 59 2 1
25 38 60 0 1
26 38 60 0 1
27 38 62 3 1
28 38 64 1 1
29 38 66 0 1
.. .. .. .. ...
275 67 66 0 1
276 67 61 0 1
277 67 65 0 1
278 68 67 0 1
279 68 68 0 1
280 69 67 8 2
281 69 60 0 1
282 69 65 0 1
283 69 66 0 1
284 70 58 0 2
285 70 58 4 2
286 70 66 14 1
287 70 67 0 1
288 70 68 0 1
289 70 59 8 1
290 70 63 0 1
291 71 68 2 1
292 72 63 0 2
293 72 58 0 1
294 72 64 0 1
295 72 67 3 1
296 73 62 0 1
297 73 68 0 1
298 74 65 3 2
299 74 63 0 1
300 75 62 1 1
301 76 67 0 1
302 77 65 3 1
303 78 65 1 2
304 83 58 2 2

[305 rows x 4 columns]

(305, 4)
Index(['30', '64', '1', '1.1'], dtype='object')

In [2]: # As the column names are not that helpful we rename them as per the names mentioned in the kaggle d
escription to make more sense out of column names
data.columns=['Age','Op_year','axil_nodes','Surv_status']
print(pd.isnull(data).sum())
#no null values are present in the given dataset

print("\n classes:",data.Surv_status.groupby(data.Surv_status).count())
#this is an imbalanced dataset

Age 0
Op_year 0
axil_nodes 0
Surv_status 0
dtype: int64

classes: Surv_status
1 224
2 81
Name: Surv_status, dtype: int64

In [3]: #survival status within 5 years of operation represented by 1 in dataset shows most people between
30-40 survived
sns.boxplot(x="Surv_status",y="Age",data=data)

plt.show()

In [4]: #There seems to be a higher distribution at axil nodes 0 for both survival status.comparatively high
er for survival status 1 same can be done with the help of box-plots
sns.FacetGrid(data,col='Surv_status', hue='Surv_status').map(plt.scatter,"Age","axil_nodes").add_leg
end()
plt.show()

In [5]: #we cannot deduce anything substantial from the distribution just that most of the data lies between
0-10 axil nodes
#across all the op_year for both the survivalstatus
sns.FacetGrid(data,col='Surv_status',col_wrap=2).map(plt.scatter,"Op_year","axil_nodes").add_legend
()

plt.show()

In [92]: np.random.seed(1234)
#z score k probability table and .cdf se same hi answer aata h
#print(data.Age.loc[data.Age==67],data.Age[data.Age==60])
m=data.Age.mean()
med=data.Age.median()
d=data.Age.std()
print(m,d,med)
print("cdf: ",norm(52.53,10.74).cdf(67))
print("pdf: ",norm(52.53,10.74).pdf(67))
print("calclated z score:",(data.Age[272]-52.53)/10.74)
print("stats z score:",data.zscore[272])
print(data.Age.loc[302],data.Age.loc[222])
data['zscore']=zscore(data.Age)
print("Zscore for age 77 and 60 is:",data.zscore.loc[302],data.zscore.loc[222])
print("value for z score diffrence :",0.9890-0.7549)
print("infernece that probability of observing an age between 60 and 77 is 23.40% ")
y=round(data.zscore)
data['y']=y
print(data.groupby(['y']).count())

52.5311475409836 10.744024363993269 52.0

cdf: 0.9110581539648359
pdf: 0.014987750926376829
calclated z score: 1.3472998137802605
stats z score: 1.3489014809138864
77 60
Zscore for age 77 and 60 is: 2.2811809997870687 0.696305817702659
value for z score diffrence : 0.23409999999999997
infernece that probability of observing an age between 60 and 77 is 23.40%
Age Op_year axil_nodes Surv_status zscore
y
-2.0 17 17 17 17 17
-1.0 89 89 89 89 89
-0.0 101 101 101 101 101
1.0 73 73 73 73 73
2.0 24 24 24 24 24
3.0 1 1 1 1 1

In [61]: #univariate analysis to show the age vector is distributed which is nearly normal with mean and medi
an as 52
#x=data.Age
x=sample_data.Age
sns.distplot(x,kde=True,fit=norm);
plt.show()

x.mean()

Out[61]: 52.9875

In [8]: #operation year seems to be a bimodal distribution

x=data.Op_year
sns.distplot(x,norm_hist=True,fit=norm);
plt.show()

x.mode()

Out[8]: 0 58
dtype: int64

In [9]: # the axil nodes are highy-right skewed with mean 4 which indicates most of the axil nodes in our da
taset is between 0 and 4
#the diffrence between mean and median rises due to skewness
x=data.axil_nodes
sns.distplot(x);
plt.show()
print("mean:",x.mean())
print("median:",x.median())

mean: 4.036065573770492
median: 1.0

In [10]: #CDF of Age and vertical lines at Age 70 indicates its almost 90% probable to find a value eaqual o
r less than 70
sns.kdeplot(data.Age, cumulative=True)
plt.axvline(x=70)

plt.show()

In [11]: #CDF of operation_years and vertical lines at year 68 indicates its almost 90% probable to find a v
alue eaqual or less than 68
sns.kdeplot(data.Op_year, cumulative=True)
plt.axvline(x=68)

plt.show()

CDF of axil_nodes and vertical lines at axil nodes 10 indicates its

almost 90% probable to find a value eaqual or less than 10
sns.kdeplot(data.axil_nodes, cumulative=True) plt.axvline(x=10)

plt.show()

In [12]: #there seems to be a very less coorelation between surv_status with all other vectors
print("Pearson corelation matrix:\n",data.corr(method='pearson'))

print("\n\nspearmann corelation matrix:\n",data.corr(method='spearman'))

Pearson corelation matrix:

Age Op_year axil_nodes Surv_status
Age 1.000000 0.092623 -0.066548 0.064351
Op_year 0.092623 1.000000 -0.003277 -0.004076
axil_nodes -0.066548 -0.003277 1.000000 0.286191
Surv_status 0.064351 -0.004076 0.286191 1.000000

spearmann corelation matrix:

Age Op_year axil_nodes Surv_status
Age 1.000000 0.093534 -0.097884 0.052806
Op_year 0.093534 1.000000 -0.036001 -0.007028
axil_nodes -0.097884 -0.036001 1.000000 0.327468
Surv_status 0.052806 -0.007028 0.327468 1.000000

In [13]: #pairplot summarizes all the plots mentioned above

sns.pairplot(data, hue='Surv_status')
plt.show()
#so using the EDA techniques i realized there isn't much we can deduce using bivariate analysis
#we can make comments on the distribution of indiviual vectors using PDF and CDF's (comments are men
tioned in respective cells)
#the corelation indicates there's absence of a strong corelation between survival status and all the
other variables

model_modified_day
No ratings yet
model_modified_day
190 pages
Ilovepdf Merged (1)
No ratings yet
Ilovepdf Merged (1)
89 pages
KNN - Jupyter Notebook (1)
No ratings yet
KNN - Jupyter Notebook (1)
7 pages
Heart Disease Prediction! ❤️?
No ratings yet
Heart Disease Prediction! ❤️?
52 pages
SMG Estimation of Survival Function
No ratings yet
SMG Estimation of Survival Function
8 pages
DL EXP2.ipynb - Colaboratory
No ratings yet
DL EXP2.ipynb - Colaboratory
6 pages
PRACTICAL_4
No ratings yet
PRACTICAL_4
9 pages
PRACTICAL_1
No ratings yet
PRACTICAL_1
26 pages
Diabetis Project
No ratings yet
Diabetis Project
7 pages
ML Practice Assignment
No ratings yet
ML Practice Assignment
7 pages
Week 4 Naive Bayes Classifier
No ratings yet
Week 4 Naive Bayes Classifier
2 pages
Correlation: Import As Import As Import As Import As From Import From Import Import Matplotlib Import
No ratings yet
Correlation: Import As Import As Import As Import As From Import From Import Import Matplotlib Import
1 page
Python Solution
No ratings yet
Python Solution
30 pages
Merged
No ratings yet
Merged
35 pages
binned_data
No ratings yet
binned_data
1 page
Survivor Analysis Assignment
No ratings yet
Survivor Analysis Assignment
6 pages
Pandas
No ratings yet
Pandas
4 pages
Sleep Disorder 1689050852
No ratings yet
Sleep Disorder 1689050852
41 pages
HHHHH
No ratings yet
HHHHH
8 pages
Dataset-Blood Glucose
No ratings yet
Dataset-Blood Glucose
354 pages
Stroke Prediction
No ratings yet
Stroke Prediction
10 pages
Assignment Instructions:: Import As
No ratings yet
Assignment Instructions:: Import As
1 page
Assignment Calculation
No ratings yet
Assignment Calculation
12 pages
Dimentionality Reduction Implementation
No ratings yet
Dimentionality Reduction Implementation
8 pages
Lecture 08 Nonlinearity
No ratings yet
Lecture 08 Nonlinearity
26 pages
Ml1.ipynb - Colaboratory
No ratings yet
Ml1.ipynb - Colaboratory
5 pages
Project 3 - Diabetes Prediction.ipynb - Colab
No ratings yet
Project 3 - Diabetes Prediction.ipynb - Colab
4 pages
Heart Diseases EDA
No ratings yet
Heart Diseases EDA
1 page
GEE and Mixed Models
No ratings yet
GEE and Mixed Models
76 pages
Heart Disease Prediction (1) (1) - 1
No ratings yet
Heart Disease Prediction (1) (1) - 1
1 page
Gridding Report - : Data Source
No ratings yet
Gridding Report - : Data Source
8 pages
Diabetes - Prediction - Project - Ipynb - Colab
No ratings yet
Diabetes - Prediction - Project - Ipynb - Colab
11 pages
PCA
No ratings yet
PCA
23 pages
Data Loading- Jupyter Notebook
No ratings yet
Data Loading- Jupyter Notebook
15 pages
DAL Experiment Outputs 6to10
No ratings yet
DAL Experiment Outputs 6to10
16 pages
howxtre
No ratings yet
howxtre
8 pages
ML Lab Exp 7 K-Means Clustering
No ratings yet
ML Lab Exp 7 K-Means Clustering
14 pages
grin5
No ratings yet
grin5
4 pages
Instructions To Create A Box Plot
No ratings yet
Instructions To Create A Box Plot
22 pages
(3.12) Exercise:: Observation
No ratings yet
(3.12) Exercise:: Observation
1 page
Practical 1
No ratings yet
Practical 1
7 pages
grin7
No ratings yet
grin7
4 pages
Data Pre Processing 1
No ratings yet
Data Pre Processing 1
35 pages
Camp Class Moving Cluster
No ratings yet
Camp Class Moving Cluster
22 pages
Soledad enero
No ratings yet
Soledad enero
12 pages
Gridding Report - : Data Source
No ratings yet
Gridding Report - : Data Source
8 pages
resultados soledad 31 mayo
No ratings yet
resultados soledad 31 mayo
12 pages
salida posta
No ratings yet
salida posta
12 pages
01 5 CF BF
No ratings yet
01 5 CF BF
357 pages
Boppc PVC
No ratings yet
Boppc PVC
119 pages
Haberman
No ratings yet
Haberman
1 page
Pima Indian Diabetes Prediction
No ratings yet
Pima Indian Diabetes Prediction
22 pages
DATA SCIENCE IDC 302 End Sem Project
No ratings yet
DATA SCIENCE IDC 302 End Sem Project
1 page
JSBL Enterprises (4)
No ratings yet
JSBL Enterprises (4)
1 page
New Doc 2018-01-27 - 6
No ratings yet
New Doc 2018-01-27 - 6
1 page
NCERT Class 10 Economics
No ratings yet
NCERT Class 10 Economics
93 pages
SIMUL1
No ratings yet
SIMUL1
4 pages
Guide Manual to Intercept and Beat the Roulette Microprocessor
From Everand
Guide Manual to Intercept and Beat the Roulette Microprocessor
The Guru
No ratings yet
Anderson F. Survival Analysis by Example. Hands On Approach Using R 2016
No ratings yet
Anderson F. Survival Analysis by Example. Hands On Approach Using R 2016
42 pages
X Ä A A Ä A A X: Values of Joint-Life Actuarial Functions Based On The AM92 Mortality Table at I 4%
No ratings yet
X Ä A A Ä A A X: Values of Joint-Life Actuarial Functions Based On The AM92 Mortality Table at I 4%
1 page
4 Exploratory Data Analysis.
No ratings yet
4 Exploratory Data Analysis.
1 page
Soa Exam P Sample Questions
No ratings yet
Soa Exam P Sample Questions
233 pages
EDA Assignment
No ratings yet
EDA Assignment
15 pages
Assignment 4
No ratings yet
Assignment 4
216 pages
Shake Them Haters off Volume 12: Mastering Your Mathematics Skills – the Study Guide
From Everand
Shake Them Haters off Volume 12: Mastering Your Mathematics Skills – the Study Guide
Russell Bailey
No ratings yet
[2024-ICML] Variational Schrodinger Diffusion Models
No ratings yet
[2024-ICML] Variational Schrodinger Diffusion Models
24 pages
Lecture 7 Slides
No ratings yet
Lecture 7 Slides
33 pages
BU FCAI BS111 P&S Lec08
No ratings yet
BU FCAI BS111 P&S Lec08
66 pages
Worksheet 02 (2)
No ratings yet
Worksheet 02 (2)
3 pages
Probability Distributions and Combination of Random Variables
No ratings yet
Probability Distributions and Combination of Random Variables
52 pages
MAT3004_RANDOM-PROCESS_LT_1.0_1_MAT3004
No ratings yet
MAT3004_RANDOM-PROCESS_LT_1.0_1_MAT3004
3 pages
Question Bank Unit 1-4-1
No ratings yet
Question Bank Unit 1-4-1
40 pages
Final Exam S2 Correction
No ratings yet
Final Exam S2 Correction
4 pages
Grade 9 Rationalized Mathematics Notes on Data Interpretation
No ratings yet
Grade 9 Rationalized Mathematics Notes on Data Interpretation
23 pages
Naivebayes
No ratings yet
Naivebayes
7 pages
Lec 3
No ratings yet
Lec 3
47 pages
Practice 2-Midterm 3 1 1
No ratings yet
Practice 2-Midterm 3 1 1
7 pages
Probability Notes
No ratings yet
Probability Notes
73 pages
Chapter 3 - Multiple Random Variables-Updated
No ratings yet
Chapter 3 - Multiple Random Variables-Updated
25 pages
hw3 Soln
No ratings yet
hw3 Soln
5 pages
1 Recall of Probability Concepts
No ratings yet
1 Recall of Probability Concepts
8 pages
Chapter - 5 Family Related Variables and Work Life Balance
No ratings yet
Chapter - 5 Family Related Variables and Work Life Balance
21 pages
Random Variable
No ratings yet
Random Variable
10 pages
Interviewair Process: Institute of Emerging Technologies (IET)
No ratings yet
Interviewair Process: Institute of Emerging Technologies (IET)
10 pages
Joint Exam 1/P Sample Exam 2
100% (1)
Joint Exam 1/P Sample Exam 2
6 pages
PietraszekJ ConceptVariance PDF
No ratings yet
PietraszekJ ConceptVariance PDF
8 pages
Kullback-Leibler Divergence Estimation of Continuous Distributions
No ratings yet
Kullback-Leibler Divergence Estimation of Continuous Distributions
5 pages
Discrete Random Variable Distributions
No ratings yet
Discrete Random Variable Distributions
6 pages
Shankar Ganesh Indian Economy 3rd Clear Printable Version (Upscpdf - Com) PDF
No ratings yet
Shankar Ganesh Indian Economy 3rd Clear Printable Version (Upscpdf - Com) PDF
102 pages
Cheat Sheet Formulae
No ratings yet
Cheat Sheet Formulae
7 pages
Hasil Uji Statistik
No ratings yet
Hasil Uji Statistik
7 pages
Parikh IdentifyTagsFromMillionsOfTextQuestion PDF
No ratings yet
Parikh IdentifyTagsFromMillionsOfTextQuestion PDF
5 pages
MA202 Probability Distributions Transforms and Numerical Methods
No ratings yet
MA202 Probability Distributions Transforms and Numerical Methods
3 pages
hw3 Sol PDF
No ratings yet
hw3 Sol PDF
3 pages
Shailesh020902@gmail - Com 6
No ratings yet
Shailesh020902@gmail - Com 6
2 pages
Unit 1 Problems
No ratings yet
Unit 1 Problems
7 pages
Big Data Analytics With Spark: A Practitioner's Guide To Using Spark For Large Scale Data Analysis
No ratings yet
Big Data Analytics With Spark: A Practitioner's Guide To Using Spark For Large Scale Data Analysis
1 page
Probability, Chapter I
100% (3)
Probability, Chapter I
20 pages
MCMC - Markov Chain Monte Carlo: One of The Top Ten Algorithms of The 20th Century
100% (1)
MCMC - Markov Chain Monte Carlo: One of The Top Ten Algorithms of The 20th Century
31 pages
Time Series Analysis
0% (1)
Time Series Analysis
2 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Shailesh020902@gmail - Com 1

Uploaded by

Shailesh020902@gmail - Com 1

Uploaded by

In [1]: import pandas as pd

import seaborn as sns

[305 rows x 4 columns]

52.5311475409836 10.744024363993269 52.0

In [8]: #operation year seems to be a bimodal distribution

CDF of axil_nodes and vertical lines at axil nodes 10 indicates its

print("\n\nspearmann corelation matrix:\n",data.corr(method='spearman'))

Pearson corelation matrix:

spearmann corelation matrix:

In [13]: #pairplot summarizes all the plots mentioned above

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.