0% found this document useful (0 votes)
128 views1 page

(3.12) Exercise:: Observation

The document provides analysis of the Haberman Cancer Survival dataset using univariate and bivariate analysis techniques like histograms, scatter plots, box plots, CDF/PDF plots etc. to understand the relationship between features and the survival status. The key observations are: 1) Positive auxiliary nodes feature best separates the classes compared to other features based on pair plots, histograms and box plots. 2) Most data points overlap making the classification non-linearly separable and difficult. 3) Patients have shorter survival time with increasing number of positive nodes based on CDF/PDF plots. 4) Patients with nodes <=4 have better survival probability compared to those with nodes >4. The document concludes with positive auxiliary

Uploaded by

Sam Shankar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
128 views1 page

(3.12) Exercise:: Observation

The document provides analysis of the Haberman Cancer Survival dataset using univariate and bivariate analysis techniques like histograms, scatter plots, box plots, CDF/PDF plots etc. to understand the relationship between features and the survival status. The key observations are: 1) Positive auxiliary nodes feature best separates the classes compared to other features based on pair plots, histograms and box plots. 2) Most data points overlap making the classification non-linearly separable and difficult. 3) Patients have shorter survival time with increasing number of positive nodes based on CDF/PDF plots. 4) Patients with nodes <=4 have better survival probability compared to those with nodes >4. The document concludes with positive auxiliary

Uploaded by

Sam Shankar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 1

(3.

12) Exercise:
1. Download Haberman Cancer Survival dataset from Kaggle. You may have to create a Kaggle
account to donwload data. (https://www.kaggle.com/gilsousa/habermans-survival-data-set)
2. Perform a similar alanlaysis as above on this dataset with the following sections:

High level statistics of the dataset: number of points, numer of features, number of classes,
data-points per class.
Explain our objective.
Perform Univaraite analysis(PDF, CDF, Boxplot, Voilin plots) to understand which features are
useful towards classification.
Perform Bi-variate analysis (scatter plots, pair-plots) to see if combinations of features are
useful in classfication.
Write your observations in english as crisply and unambigously as possible. Always quantify
your results.

Haberman Cancer Survival Dataset


Objective:: To find the survival status of the operation with the help of 4 features.

In [4]: import pandas as pd


import scipy as sy
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
hab_data=pd.read_csv("haberman.csv")
print("********-----NUMBER OF DATA POINTS AND FEATURES-------*********")
print(hab_data.shape)
col_name=['AGE','YEAR OF OPERATION','POSITIVE_AUX_NODES','SURVIVAL_STATUS']
hab_data.columns=col_name
print("**********---------NUMBER OF DATA POINTS PER CLASS-------**********")
hab_data["SURVIVAL_STATUS"]=hab_data["SURVIVAL_STATUS"].map({1:"YES",2:"NO"})#
to change numerical values [1,2]to[yes,no] for better understanding
hab_data["SURVIVAL_STATUS"]=hab_data["SURVIVAL_STATUS"].astype('category')
hab_data["SURVIVAL_STATUS"].value_counts()

********-----NUMBER OF DATA POINTS AND FEATURES-------*********


(306, 4)
**********---------NUMBER OF DATA POINTS PER CLASS-------**********

Out[4]: YES 225


NO 81
Name: SURVIVAL_STATUS, dtype: int64

In [52]: sns.set_style("whitegrid")
sns.pairplot(hab_data,hue='SURVIVAL_STATUS',height=3)
plt.title("PAIR PLOTS")
plt.show()

OBSERVATION

1.Among all the plots,the plot between Year of operation and positive_aux_nodes shows better
results.

2.It is not easy to separate the points(linearly non-separable) as most of the poinst overlap with
each other.

3.It is difficult to build the accurate model with simple if-else condition as most of the points are
overlapping.

2-D SCATTER PLOTS


In [39]: sns.set_style("whitegrid");
sns.FacetGrid(hab_data, hue="SURVIVAL_STATUS", height=4) \
.map(plt.scatter, "YEAR OF OPERATION", "POSITIVE_AUX_NODES") \
.add_legend();
plt.title("2-D SCATTER PLOTS")
plt.show();

OBSERVATION

1.Analysis of data is very difficult as most of the points overlap with each other.

2.It is linearly non-separable and biilding of model requires speciic techniques.

HISTOGRAMS
In [18]: sns.set_style("whitegrid")
sns.FacetGrid(hab_data,hue="SURVIVAL_STATUS",height=8)\
.map(sns.distplot,"AGE")\
.add_legend();
plt.title("AGE HISTOGRAM")
plt.show()

In [19]: sns.set_style("whitegrid")
sns.FacetGrid(hab_data,hue="SURVIVAL_STATUS",height=10)\
.map(sns.distplot,"YEAR OF OPERATION")\
.add_legend();
plt.title(" YEAR OF OPERATION HISTOGRAM")
plt.show()

In [20]: sns.set_style("whitegrid")
sns.FacetGrid(hab_data,hue="SURVIVAL_STATUS",height=10)\
.map(sns.distplot,"POSITIVE_AUX_NODES")\
.add_legend();
plt.title("POSITIVE_AUX_NODES HISTOGRAM")
plt.show()

OBSERVATION
1.PLOT 1:

The plot is not clear for understanding hence,AGE cannot be used as better feature for
classification.
The data points are overlapping with each other and curvers are misleading.

2.PLOT 2:

Compared to Plot 1 ,the plot 2 gave worst results and more difficult for analysis

3.PLOT 3:

The plot 3 is best among the results.Although,it is difficult to categorised the points but still it is
separable till x=5 and y=0.13(approx).
The accuracy of model constructed using simple if-else stmts will be less than 50%

CDF and PDF


In [53]: import numpy as np
sns.set_style("whitegrid")
hab_data_survived = hab_data.loc[hab_data["SURVIVAL_STATUS"] == "YES"];
hab_data_not_survived = hab_data.loc[hab_data["SURVIVAL_STATUS"] == "NO"];
counts, bin_edges = np.histogram(hab_data_survived['POSITIVE_AUX_NODES'], bins
=10,
density = True)
pdf = counts/(sum(counts))
print(pdf);
print(bin_edges);
cdf = np.cumsum(pdf)
plt.plot(bin_edges[1:],pdf);
plt.plot(bin_edges[1:], cdf)

counts, bin_edges = np.histogram(hab_data_not_survived['POSITIVE_AUX_NODES'],


bins=10,
density = True)
pdf = counts/(sum(counts))
print(pdf);
print(bin_edges);
cdf = np.cumsum(pdf)
plt.title("CDF")
plt.legend(["0","1"],title="SURVIVAL_STATUS",loc="right")
plt.plot(bin_edges[1:],pdf);
plt.plot(bin_edges[1:], cdf)

[0.83555556 0.08 0.02222222 0.02666667 0.01777778 0.00444444


0.00888889 0. 0. 0.00444444]
[ 0. 4.6 9.2 13.8 18.4 23. 27.6 32.2 36.8 41.4 46. ]
[0.56790123 0.14814815 0.13580247 0.04938272 0.07407407 0.
0.01234568 0. 0. 0.01234568]
[ 0. 5.2 10.4 15.6 20.8 26. 31.2 36.4 41.6 46.8 52. ]

Out[53]: [<matplotlib.lines.Line2D at 0x1d4421df828>]

OBSERVATION
1.The 85% of the patients have better surivial years when the number of nodes is less than or
equal to(<=4).

2.In both classes of patients,100% of people have less survival years when the number of nodes is
greater than 40

CONCLUSION:

Hence,The patients have shorter survival years when they have more number of nodes

MEDIAN,MEAN,STD-DEV
In [21]: print("MEANS")
print(np.mean(hab_data_survived["POSITIVE_AUX_NODES"]))
print(np.mean(np.append(hab_data_survived["POSITIVE_AUX_NODES"],225)))
print(np.mean(hab_data_not_survived["POSITIVE_AUX_NODES"]))
print("-----------")
print("STD-DEV")
print(np.std(hab_data_survived["POSITIVE_AUX_NODES"]))
print(np.std(hab_data_not_survived["POSITIVE_AUX_NODES"]))
print("-----------")
print("MEDIAN")
print(np.median(hab_data_survived["POSITIVE_AUX_NODES"]))
print(np.median(hab_data_not_survived["POSITIVE_AUX_NODES"]))

MEANS
2.7911111111111113
3.774336283185841
7.45679012345679
-----------
STD-DEV
5.857258449412131
9.128776076761632
-----------
MEDIAN
0.0
4.0

OBSERVATION
1.The probability of surviving beyond 5 years is less .

2.From the std-dev,the spread of not surving for 5 years is wider than surviving beyond 5 years

3.From the median,the average of patients with nodes 4 have shorter survival period and the
average of long survival is zero nodes

QUANTILES,PERCENTILES
In [3]: print("QUANTILES")
print(np.percentile(hab_data_survived["POSITIVE_AUX_NODES"],np.arange(0,100,25
)))
print(np.percentile(hab_data_not_survived["POSITIVE_AUX_NODES"],np.arange(0,10
0,25)))
print("-----------")
print("PERCENTILES")
print(np.percentile(hab_data_survived["POSITIVE_AUX_NODES"],90))
print(np.percentile(hab_data_not_survived["POSITIVE_AUX_NODES"],90))

QUANTILES

---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-3-399fc1c3d17a> in <module>
1 print("QUANTILES")
----> 2 print(np.percentile(hab_data_survived["POSITIVE_AUX_NODES"],np.aran
ge(0,100,25)))
3 print(np.percentile(hab_data_not_survived["POSITIVE_AUX_NODES"],np.
arange(0,100,25)))
4 print("-----------")
5 print("PERCENTILES")

NameError: name 'np' is not defined

OBSERVATION
1.The 50% of the people survived with zero positive nodes and 25% of the people who survived
had nodes more than 3.

2.The 25% of patients not survived with one node,50% of patients not survived with 4 nodes and
75% patients not survived have morethan 10 nodes.

3.The 90% of survived people had less than 8 nodes and if the nodes(8<nodes<20)have less
chance of surviving.

BOX PLOTS
In [38]: sns.boxplot(x='SURVIVAL_STATUS',y='POSITIVE_AUX_NODES',hue='SURVIVAL_STATUS',d
ata=hab_data)
plt.title("POSITIVE_AUX_NODES")

plt.show()
sns.boxplot(x='SURVIVAL_STATUS',y='AGE',hue='SURVIVAL_STATUS',data=hab_data)
plt.title("AGE")
plt.show()

sns.boxplot(x='SURVIVAL_STATUS',y='YEAR OF OPERATION',data=hab_data,hue='SURVI
VAL_STATUS')
plt.title("YEAR OF OPERATION")
plt.show()

OBSERVATION
1. In above plot 25th percentile and 50th percentile are nearly same for surviving above five
years and it lies between 0 to 5.Thershold for survival lies from 0 to 10 nodes.

2.For not survival there are 50th percentile of nodes are above 10. Threshold for the not survival
lies from 0 to 25 nodes and 75th% is 12 and 25th% is 1 or 2

3.The box plots of age and year of operation is not very efficient in analysis

VIOLIN PLOTS
In [40]: sns.violinplot(x='SURVIVAL_STATUS',y='POSITIVE_AUX_NODES', data=hab_data,hue=
'SURVIVAL_STATUS')
plt.title("POSITIVE_AUX_NODES")
plt.show()
sns.violinplot(x='SURVIVAL_STATUS',y='AGE', data=hab_data,hue='SURVIVAL_STATU
S')
plt.title("AGE")

plt.show()
sns.violinplot(x='SURVIVAL_STATUS',y='YEAR OF OPERATION', data=hab_data,hue='S
URVIVAL_STATUS')
plt.title("YEAR OF OPERATION")
plt.show()

OBSERVATION
1. In above plot ,the surviving abover 5 years is wide spread at 0 nodes. WHISKERS is from 0-7
2.For not survival,the density is more at 12nodes and whiskers is from 0 to 25 nodes

In [36]: sns.jointplot(x="YEAR OF OPERATION",y="POSITIVE_AUX_NODES",data=hab_data,kind=


"kde")
plt.title("YEAR OF OPERATION vs POSITIVE_AUX_NODES",loc="left")
plt.grid()

plt.show()

SUMMARY OF THE ANALYSIS

1. Survival status depends on the Number Of Nodes i.e when the number of nodes is more then
survival decreases.

1. Age doesnot have effect on the survival status of the individual.

1. The patients treated after 1966 have the slighlty higher chance to surive that the rest. The
patients treated before 1959 have the slighlty lower chance to surive that the rest.(From box
and violin plots).

In [ ]:

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy