0% found this document useful (0 votes)

128 views1 page

(3.12) Exercise:: Observation

The document provides analysis of the Haberman Cancer Survival dataset using univariate and bivariate analysis techniques like histograms, scatter plots, box plots, CDF/PDF plots etc. to understand the relationship between features and the survival status. The key observations are: 1) Positive auxiliary nodes feature best separates the classes compared to other features based on pair plots, histograms and box plots. 2) Most data points overlap making the classification non-linearly separable and difficult. 3) Patients have shorter survival time with increasing number of positive nodes based on CDF/PDF plots. 4) Patients with nodes <=4 have better survival probability compared to those with nodes >4. The document concludes with positive auxiliary

Uploaded by

Sam Shankar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

128 views1 page

(3.12) Exercise:: Observation

Uploaded by

Sam Shankar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 1

(3.

12) Exercise:
1. Download Haberman Cancer Survival dataset from Kaggle. You may have to create a Kaggle
account to donwload data. (https://www.kaggle.com/gilsousa/habermans-survival-data-set)
2. Perform a similar alanlaysis as above on this dataset with the following sections:

High level statistics of the dataset: number of points, numer of features, number of classes,
data-points per class.
Explain our objective.
Perform Univaraite analysis(PDF, CDF, Boxplot, Voilin plots) to understand which features are
useful towards classification.
Perform Bi-variate analysis (scatter plots, pair-plots) to see if combinations of features are
useful in classfication.
Write your observations in english as crisply and unambigously as possible. Always quantify
your results.

Haberman Cancer Survival Dataset

Objective:: To find the survival status of the operation with the help of 4 features.

In [4]: import pandas as pd

import scipy as sy
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
hab_data=pd.read_csv("haberman.csv")
print("********-----NUMBER OF DATA POINTS AND FEATURES-------*********")
print(hab_data.shape)
col_name=['AGE','YEAR OF OPERATION','POSITIVE_AUX_NODES','SURVIVAL_STATUS']
hab_data.columns=col_name
print("**********---------NUMBER OF DATA POINTS PER CLASS-------**********")
hab_data["SURVIVAL_STATUS"]=hab_data["SURVIVAL_STATUS"].map({1:"YES",2:"NO"})#
to change numerical values [1,2]to[yes,no] for better understanding
hab_data["SURVIVAL_STATUS"]=hab_data["SURVIVAL_STATUS"].astype('category')
hab_data["SURVIVAL_STATUS"].value_counts()

-----NUMBER OF DATA POINTS AND FEATURES-------*

(306, 4)
**********---------NUMBER OF DATA POINTS PER CLASS-------**********

Out[4]: YES 225

NO 81
Name: SURVIVAL_STATUS, dtype: int64

In [52]: sns.set_style("whitegrid")
sns.pairplot(hab_data,hue='SURVIVAL_STATUS',height=3)
plt.title("PAIR PLOTS")
plt.show()

OBSERVATION

1.Among all the plots,the plot between Year of operation and positive_aux_nodes shows better
results.

2.It is not easy to separate the points(linearly non-separable) as most of the poinst overlap with
each other.

3.It is difficult to build the accurate model with simple if-else condition as most of the points are
overlapping.

2-D SCATTER PLOTS

In [39]: sns.set_style("whitegrid");
sns.FacetGrid(hab_data, hue="SURVIVAL_STATUS", height=4) \
.map(plt.scatter, "YEAR OF OPERATION", "POSITIVE_AUX_NODES") \
.add_legend();
plt.title("2-D SCATTER PLOTS")
plt.show();

OBSERVATION

1.Analysis of data is very difficult as most of the points overlap with each other.

2.It is linearly non-separable and biilding of model requires speciic techniques.

HISTOGRAMS
In [18]: sns.set_style("whitegrid")
sns.FacetGrid(hab_data,hue="SURVIVAL_STATUS",height=8)\
.map(sns.distplot,"AGE")\
.add_legend();
plt.title("AGE HISTOGRAM")
plt.show()

In [19]: sns.set_style("whitegrid")
sns.FacetGrid(hab_data,hue="SURVIVAL_STATUS",height=10)\
.map(sns.distplot,"YEAR OF OPERATION")\
.add_legend();
plt.title(" YEAR OF OPERATION HISTOGRAM")
plt.show()

In [20]: sns.set_style("whitegrid")
sns.FacetGrid(hab_data,hue="SURVIVAL_STATUS",height=10)\
.map(sns.distplot,"POSITIVE_AUX_NODES")\
.add_legend();
plt.title("POSITIVE_AUX_NODES HISTOGRAM")
plt.show()

OBSERVATION
1.PLOT 1:

The plot is not clear for understanding hence,AGE cannot be used as better feature for
classification.
The data points are overlapping with each other and curvers are misleading.

2.PLOT 2:

Compared to Plot 1 ,the plot 2 gave worst results and more difficult for analysis

3.PLOT 3:

The plot 3 is best among the results.Although,it is difficult to categorised the points but still it is
separable till x=5 and y=0.13(approx).
The accuracy of model constructed using simple if-else stmts will be less than 50%

CDF and PDF

In [53]: import numpy as np
sns.set_style("whitegrid")
hab_data_survived = hab_data.loc[hab_data["SURVIVAL_STATUS"] == "YES"];
hab_data_not_survived = hab_data.loc[hab_data["SURVIVAL_STATUS"] == "NO"];
counts, bin_edges = np.histogram(hab_data_survived['POSITIVE_AUX_NODES'], bins
=10,
density = True)
pdf = counts/(sum(counts))
print(pdf);
print(bin_edges);
cdf = np.cumsum(pdf)
plt.plot(bin_edges[1:],pdf);
plt.plot(bin_edges[1:], cdf)

counts, bin_edges = np.histogram(hab_data_not_survived['POSITIVE_AUX_NODES'],

bins=10,
density = True)
pdf = counts/(sum(counts))
print(pdf);
print(bin_edges);
cdf = np.cumsum(pdf)
plt.title("CDF")
plt.legend(["0","1"],title="SURVIVAL_STATUS",loc="right")
plt.plot(bin_edges[1:],pdf);
plt.plot(bin_edges[1:], cdf)

[0.83555556 0.08 0.02222222 0.02666667 0.01777778 0.00444444

0.00888889 0. 0. 0.00444444]
[ 0. 4.6 9.2 13.8 18.4 23. 27.6 32.2 36.8 41.4 46. ]
[0.56790123 0.14814815 0.13580247 0.04938272 0.07407407 0.
0.01234568 0. 0. 0.01234568]
[ 0. 5.2 10.4 15.6 20.8 26. 31.2 36.4 41.6 46.8 52. ]

Out[53]: [<matplotlib.lines.Line2D at 0x1d4421df828>]

OBSERVATION
1.The 85% of the patients have better surivial years when the number of nodes is less than or
equal to(<=4).

2.In both classes of patients,100% of people have less survival years when the number of nodes is
greater than 40

CONCLUSION:

Hence,The patients have shorter survival years when they have more number of nodes

MEDIAN,MEAN,STD-DEV
In [21]: print("MEANS")
print(np.mean(hab_data_survived["POSITIVE_AUX_NODES"]))
print(np.mean(np.append(hab_data_survived["POSITIVE_AUX_NODES"],225)))
print(np.mean(hab_data_not_survived["POSITIVE_AUX_NODES"]))
print("-----------")
print("STD-DEV")
print(np.std(hab_data_survived["POSITIVE_AUX_NODES"]))
print(np.std(hab_data_not_survived["POSITIVE_AUX_NODES"]))
print("-----------")
print("MEDIAN")
print(np.median(hab_data_survived["POSITIVE_AUX_NODES"]))
print(np.median(hab_data_not_survived["POSITIVE_AUX_NODES"]))

MEANS
2.7911111111111113
3.774336283185841
7.45679012345679
-----------
STD-DEV
5.857258449412131
9.128776076761632
-----------
MEDIAN
0.0
4.0

OBSERVATION
1.The probability of surviving beyond 5 years is less .

2.From the std-dev,the spread of not surving for 5 years is wider than surviving beyond 5 years

3.From the median,the average of patients with nodes 4 have shorter survival period and the
average of long survival is zero nodes

QUANTILES,PERCENTILES
In [3]: print("QUANTILES")
print(np.percentile(hab_data_survived["POSITIVE_AUX_NODES"],np.arange(0,100,25
)))
print(np.percentile(hab_data_not_survived["POSITIVE_AUX_NODES"],np.arange(0,10
0,25)))
print("-----------")
print("PERCENTILES")
print(np.percentile(hab_data_survived["POSITIVE_AUX_NODES"],90))
print(np.percentile(hab_data_not_survived["POSITIVE_AUX_NODES"],90))

QUANTILES

---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-3-399fc1c3d17a> in <module>
1 print("QUANTILES")
----> 2 print(np.percentile(hab_data_survived["POSITIVE_AUX_NODES"],np.aran
ge(0,100,25)))
3 print(np.percentile(hab_data_not_survived["POSITIVE_AUX_NODES"],np.
arange(0,100,25)))
4 print("-----------")
5 print("PERCENTILES")

NameError: name 'np' is not defined

OBSERVATION
1.The 50% of the people survived with zero positive nodes and 25% of the people who survived
had nodes more than 3.

2.The 25% of patients not survived with one node,50% of patients not survived with 4 nodes and
75% patients not survived have morethan 10 nodes.

3.The 90% of survived people had less than 8 nodes and if the nodes(8<nodes<20)have less
chance of surviving.

BOX PLOTS
In [38]: sns.boxplot(x='SURVIVAL_STATUS',y='POSITIVE_AUX_NODES',hue='SURVIVAL_STATUS',d
ata=hab_data)
plt.title("POSITIVE_AUX_NODES")

plt.show()
sns.boxplot(x='SURVIVAL_STATUS',y='AGE',hue='SURVIVAL_STATUS',data=hab_data)
plt.title("AGE")
plt.show()

sns.boxplot(x='SURVIVAL_STATUS',y='YEAR OF OPERATION',data=hab_data,hue='SURVI
VAL_STATUS')
plt.title("YEAR OF OPERATION")
plt.show()

OBSERVATION
1. In above plot 25th percentile and 50th percentile are nearly same for surviving above five
years and it lies between 0 to 5.Thershold for survival lies from 0 to 10 nodes.

2.For not survival there are 50th percentile of nodes are above 10. Threshold for the not survival
lies from 0 to 25 nodes and 75th% is 12 and 25th% is 1 or 2

3.The box plots of age and year of operation is not very efficient in analysis

VIOLIN PLOTS
In [40]: sns.violinplot(x='SURVIVAL_STATUS',y='POSITIVE_AUX_NODES', data=hab_data,hue=
'SURVIVAL_STATUS')
plt.title("POSITIVE_AUX_NODES")
plt.show()
sns.violinplot(x='SURVIVAL_STATUS',y='AGE', data=hab_data,hue='SURVIVAL_STATU
S')
plt.title("AGE")

plt.show()
sns.violinplot(x='SURVIVAL_STATUS',y='YEAR OF OPERATION', data=hab_data,hue='S
URVIVAL_STATUS')
plt.title("YEAR OF OPERATION")
plt.show()

OBSERVATION
1. In above plot ,the surviving abover 5 years is wide spread at 0 nodes. WHISKERS is from 0-7
2.For not survival,the density is more at 12nodes and whiskers is from 0 to 25 nodes

In [36]: sns.jointplot(x="YEAR OF OPERATION",y="POSITIVE_AUX_NODES",data=hab_data,kind=

"kde")
plt.title("YEAR OF OPERATION vs POSITIVE_AUX_NODES",loc="left")
plt.grid()

plt.show()

SUMMARY OF THE ANALYSIS

1. Survival status depends on the Number Of Nodes i.e when the number of nodes is more then
survival decreases.

1. Age doesnot have effect on the survival status of the individual.

1. The patients treated after 1966 have the slighlty higher chance to surive that the rest. The
patients treated before 1959 have the slighlty lower chance to surive that the rest.(From box
and violin plots).

In [ ]:

Aiml Lab
No ratings yet
Aiml Lab
14 pages
Regression Analysis of Gapminder Data
No ratings yet
Regression Analysis of Gapminder Data
41 pages
Lifelines
No ratings yet
Lifelines
347 pages
Haberman Data Set Ed A
No ratings yet
Haberman Data Set Ed A
10 pages
Heart Failure Prediction
100% (1)
Heart Failure Prediction
41 pages
Test Bank for Statistics for Business & Economics 14th by Anderson pdf download
100% (6)
Test Bank for Statistics for Business & Economics 14th by Anderson pdf download
44 pages
Final Group Project
No ratings yet
Final Group Project
26 pages
EDA HabermanDataset
No ratings yet
EDA HabermanDataset
15 pages
Haberman Datasets Analysis - Ipynb - Colaboratory
No ratings yet
Haberman Datasets Analysis - Ipynb - Colaboratory
13 pages
EDA On Haberman Survival Data
No ratings yet
EDA On Haberman Survival Data
6 pages
Statistics
No ratings yet
Statistics
163 pages
Ilovepdf Merged (1)
No ratings yet
Ilovepdf Merged (1)
89 pages
Exploratory Data Analysis On Haberman Dataset PDF
No ratings yet
Exploratory Data Analysis On Haberman Dataset PDF
11 pages
Business Report: by Sreenath Radhakrishnan
No ratings yet
Business Report: by Sreenath Radhakrishnan
26 pages
Data Science Assignment Submission
No ratings yet
Data Science Assignment Submission
12 pages
COMP5318
No ratings yet
COMP5318
42 pages
The Frailty Model Full Chapter Download
100% (11)
The Frailty Model Full Chapter Download
15 pages
Anemia Code
No ratings yet
Anemia Code
33 pages
Anderson F. Survival Analysis by Example. Hands On Approach Using R 2016
No ratings yet
Anderson F. Survival Analysis by Example. Hands On Approach Using R 2016
42 pages
Data Mining Tutorial: D. A. Dickey
No ratings yet
Data Mining Tutorial: D. A. Dickey
109 pages
Assignment Instructions:: Import As
No ratings yet
Assignment Instructions:: Import As
1 page
AIML Expt
No ratings yet
AIML Expt
7 pages
Heart: Our "Goal" Predict The Presence of Heart Disease in The Patient
100% (1)
Heart: Our "Goal" Predict The Presence of Heart Disease in The Patient
73 pages
Lifelines
No ratings yet
Lifelines
343 pages
Explonatory Data analysis
No ratings yet
Explonatory Data analysis
11 pages
C2M2 - Assignment: 1 Risk Models Using Tree-Based Models
100% (1)
C2M2 - Assignment: 1 Risk Models Using Tree-Based Models
38 pages
C2M4 - Assignment: 1 Cox Proportional Hazards and Random Survival Forests
No ratings yet
C2M4 - Assignment: 1 Cox Proportional Hazards and Random Survival Forests
18 pages
Ml Lab Experiment Shortened With Same Output
No ratings yet
Ml Lab Experiment Shortened With Same Output
6 pages
Mayank Chaudhary DEV Practicals
No ratings yet
Mayank Chaudhary DEV Practicals
14 pages
Heart Disease Prediction - Jupyter Notebook
100% (1)
Heart Disease Prediction - Jupyter Notebook
9 pages
10 - Eda To Prediction Dietanic
No ratings yet
10 - Eda To Prediction Dietanic
21 pages
ml_labmanual (3)
No ratings yet
ml_labmanual (3)
33 pages
Dav Lab Manual
No ratings yet
Dav Lab Manual
28 pages
A Weighted Random Survival Forest
No ratings yet
A Weighted Random Survival Forest
27 pages
2017moderating Effect Work Raliboliyt, Emperment
No ratings yet
2017moderating Effect Work Raliboliyt, Emperment
234 pages
Shailesh020902@gmail - Com 1
No ratings yet
Shailesh020902@gmail - Com 1
1 page
bacdeaf_23032025_115708_split_1
No ratings yet
bacdeaf_23032025_115708_split_1
37 pages
Ai in HC - 2
No ratings yet
Ai in HC - 2
9 pages
Pattern Recognition
No ratings yet
Pattern Recognition
26 pages
1728086737277
No ratings yet
1728086737277
26 pages
Giuaki
No ratings yet
Giuaki
7 pages
Fresco
100% (2)
Fresco
17 pages
Assignment2_DMS672
No ratings yet
Assignment2_DMS672
15 pages
State Wise Health Income Clustering 18th December 2021 PDF
100% (2)
State Wise Health Income Clustering 18th December 2021 PDF
29 pages
MACHINE LEARNING manual
No ratings yet
MACHINE LEARNING manual
36 pages
Practical 5
No ratings yet
Practical 5
6 pages
Hariks
No ratings yet
Hariks
5 pages
AIML PROGRAMS
No ratings yet
AIML PROGRAMS
12 pages
Hangal - Frailty Models
No ratings yet
Hangal - Frailty Models
307 pages
healthcare-project-simplilearn- Week3
No ratings yet
healthcare-project-simplilearn- Week3
7 pages
Print Print Print Print: Import As
No ratings yet
Print Print Print Print: Import As
6 pages
Strip Plot Design
No ratings yet
Strip Plot Design
9 pages
H-410; Survival Analysis with R
No ratings yet
H-410; Survival Analysis with R
63 pages
Irwin David L Lass Norman J Et Al Eds Clinical Research Meth
100% (1)
Irwin David L Lass Norman J Et Al Eds Clinical Research Meth
361 pages
1-10
No ratings yet
1-10
4 pages
Strangers
No ratings yet
Strangers
8 pages
Heart_Disease_1.Ipynb - Colaboratory (1)[1]
No ratings yet
Heart_Disease_1.Ipynb - Colaboratory (1)[1]
9 pages
Ml Short Code_under Updating
No ratings yet
Ml Short Code_under Updating
4 pages
Instant ebooks textbook Statistics for Technology A Course in Applied Statistics Third Edition Chatfield download all chapters
100% (2)
Instant ebooks textbook Statistics for Technology A Course in Applied Statistics Third Edition Chatfield download all chapters
55 pages
GUIDELINES FOR All DESIGN PROJECT
No ratings yet
GUIDELINES FOR All DESIGN PROJECT
29 pages
1_2_3_4_6_7_8_9_10_merged --
No ratings yet
1_2_3_4_6_7_8_9_10_merged --
21 pages
Pops Plan
100% (1)
Pops Plan
136 pages
M pdf
No ratings yet
M pdf
13 pages
Instant Download (eBook PDF) Fundamentals of Cost Accounting 4th Edition PDF All Chapters
100% (1)
Instant Download (eBook PDF) Fundamentals of Cost Accounting 4th Edition PDF All Chapters
53 pages
1
No ratings yet
1
13 pages
4 - Data-Analysis-Final
No ratings yet
4 - Data-Analysis-Final
52 pages
EDA Assignment
No ratings yet
EDA Assignment
15 pages
BSC in IT Information Guide
No ratings yet
BSC in IT Information Guide
24 pages
BA111
No ratings yet
BA111
21 pages
SPC SQC
No ratings yet
SPC SQC
68 pages
4 Exploratory Data Analysis.
No ratings yet
4 Exploratory Data Analysis.
1 page
Econometric S
No ratings yet
Econometric S
59 pages
02 Forecasting
No ratings yet
02 Forecasting
9 pages
UPoles - Paper - Durability of Poles in NESC Grade C Construction by HJ Dagher - 04-23-2001.ashx
No ratings yet
UPoles - Paper - Durability of Poles in NESC Grade C Construction by HJ Dagher - 04-23-2001.ashx
22 pages
Analysis of Financial Risk Exposure in M PDF
No ratings yet
Analysis of Financial Risk Exposure in M PDF
11 pages
The Sampling Distribution of The Sample Mean
No ratings yet
The Sampling Distribution of The Sample Mean
22 pages
1.2 Control Chart
No ratings yet
1.2 Control Chart
21 pages
Lab-6-Binomail and Poisson Distribution
100% (1)
Lab-6-Binomail and Poisson Distribution
13 pages
HW10
No ratings yet
HW10
7 pages
Linear Regression
No ratings yet
Linear Regression
12 pages
Interpret The Key Results For Attribute Agreement Analysis
100% (1)
Interpret The Key Results For Attribute Agreement Analysis
28 pages
Partitioned Regression and The Frisch - Waugh - Lovell Theorem
No ratings yet
Partitioned Regression and The Frisch - Waugh - Lovell Theorem
8 pages
Multiple Linear Regression
No ratings yet
Multiple Linear Regression
7 pages
Chapter Three
No ratings yet
Chapter Three
4 pages
Homework Assignment 2
No ratings yet
Homework Assignment 2
8 pages
Module 13 18 PDF
No ratings yet
Module 13 18 PDF
9 pages
Operational Management - Numerical
No ratings yet
Operational Management - Numerical
4 pages
Comp Scaling Exam
No ratings yet
Comp Scaling Exam
4 pages
Lesson 1 NOTES Summary Conclusion and Recommendations
No ratings yet
Lesson 1 NOTES Summary Conclusion and Recommendations
3 pages
Profound Python Data Science
From Everand
Profound Python Data Science
Onder Teker
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

(3.12) Exercise:: Observation

Uploaded by

(3.12) Exercise:: Observation

Uploaded by

(3.

Haberman Cancer Survival Dataset

In [4]: import pandas as pd

-----NUMBER OF DATA POINTS AND FEATURES-------*

Out[4]: YES 225

2-D SCATTER PLOTS

2.It is linearly non-separable and biilding of model requires speciic techniques.

CDF and PDF

counts, bin_edges = np.histogram(hab_data_not_survived['POSITIVE_AUX_NODES'],

[0.83555556 0.08 0.02222222 0.02666667 0.01777778 0.00444444

Out[53]: [<matplotlib.lines.Line2D at 0x1d4421df828>]

NameError: name 'np' is not defined

In [36]: sns.jointplot(x="YEAR OF OPERATION",y="POSITIVE_AUX_NODES",data=hab_data,kind=

SUMMARY OF THE ANALYSIS

1. Age doesnot have effect on the survival status of the individual.

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.