(3.12) Exercise:: Observation
(3.12) Exercise:: Observation
12) Exercise:
1. Download Haberman Cancer Survival dataset from Kaggle. You may have to create a Kaggle
account to donwload data. (https://www.kaggle.com/gilsousa/habermans-survival-data-set)
2. Perform a similar alanlaysis as above on this dataset with the following sections:
High level statistics of the dataset: number of points, numer of features, number of classes,
data-points per class.
Explain our objective.
Perform Univaraite analysis(PDF, CDF, Boxplot, Voilin plots) to understand which features are
useful towards classification.
Perform Bi-variate analysis (scatter plots, pair-plots) to see if combinations of features are
useful in classfication.
Write your observations in english as crisply and unambigously as possible. Always quantify
your results.
In [52]: sns.set_style("whitegrid")
sns.pairplot(hab_data,hue='SURVIVAL_STATUS',height=3)
plt.title("PAIR PLOTS")
plt.show()
OBSERVATION
1.Among all the plots,the plot between Year of operation and positive_aux_nodes shows better
results.
2.It is not easy to separate the points(linearly non-separable) as most of the poinst overlap with
each other.
3.It is difficult to build the accurate model with simple if-else condition as most of the points are
overlapping.
OBSERVATION
1.Analysis of data is very difficult as most of the points overlap with each other.
HISTOGRAMS
In [18]: sns.set_style("whitegrid")
sns.FacetGrid(hab_data,hue="SURVIVAL_STATUS",height=8)\
.map(sns.distplot,"AGE")\
.add_legend();
plt.title("AGE HISTOGRAM")
plt.show()
In [19]: sns.set_style("whitegrid")
sns.FacetGrid(hab_data,hue="SURVIVAL_STATUS",height=10)\
.map(sns.distplot,"YEAR OF OPERATION")\
.add_legend();
plt.title(" YEAR OF OPERATION HISTOGRAM")
plt.show()
In [20]: sns.set_style("whitegrid")
sns.FacetGrid(hab_data,hue="SURVIVAL_STATUS",height=10)\
.map(sns.distplot,"POSITIVE_AUX_NODES")\
.add_legend();
plt.title("POSITIVE_AUX_NODES HISTOGRAM")
plt.show()
OBSERVATION
1.PLOT 1:
The plot is not clear for understanding hence,AGE cannot be used as better feature for
classification.
The data points are overlapping with each other and curvers are misleading.
2.PLOT 2:
Compared to Plot 1 ,the plot 2 gave worst results and more difficult for analysis
3.PLOT 3:
The plot 3 is best among the results.Although,it is difficult to categorised the points but still it is
separable till x=5 and y=0.13(approx).
The accuracy of model constructed using simple if-else stmts will be less than 50%
OBSERVATION
1.The 85% of the patients have better surivial years when the number of nodes is less than or
equal to(<=4).
2.In both classes of patients,100% of people have less survival years when the number of nodes is
greater than 40
CONCLUSION:
Hence,The patients have shorter survival years when they have more number of nodes
MEDIAN,MEAN,STD-DEV
In [21]: print("MEANS")
print(np.mean(hab_data_survived["POSITIVE_AUX_NODES"]))
print(np.mean(np.append(hab_data_survived["POSITIVE_AUX_NODES"],225)))
print(np.mean(hab_data_not_survived["POSITIVE_AUX_NODES"]))
print("-----------")
print("STD-DEV")
print(np.std(hab_data_survived["POSITIVE_AUX_NODES"]))
print(np.std(hab_data_not_survived["POSITIVE_AUX_NODES"]))
print("-----------")
print("MEDIAN")
print(np.median(hab_data_survived["POSITIVE_AUX_NODES"]))
print(np.median(hab_data_not_survived["POSITIVE_AUX_NODES"]))
MEANS
2.7911111111111113
3.774336283185841
7.45679012345679
-----------
STD-DEV
5.857258449412131
9.128776076761632
-----------
MEDIAN
0.0
4.0
OBSERVATION
1.The probability of surviving beyond 5 years is less .
2.From the std-dev,the spread of not surving for 5 years is wider than surviving beyond 5 years
3.From the median,the average of patients with nodes 4 have shorter survival period and the
average of long survival is zero nodes
QUANTILES,PERCENTILES
In [3]: print("QUANTILES")
print(np.percentile(hab_data_survived["POSITIVE_AUX_NODES"],np.arange(0,100,25
)))
print(np.percentile(hab_data_not_survived["POSITIVE_AUX_NODES"],np.arange(0,10
0,25)))
print("-----------")
print("PERCENTILES")
print(np.percentile(hab_data_survived["POSITIVE_AUX_NODES"],90))
print(np.percentile(hab_data_not_survived["POSITIVE_AUX_NODES"],90))
QUANTILES
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-3-399fc1c3d17a> in <module>
1 print("QUANTILES")
----> 2 print(np.percentile(hab_data_survived["POSITIVE_AUX_NODES"],np.aran
ge(0,100,25)))
3 print(np.percentile(hab_data_not_survived["POSITIVE_AUX_NODES"],np.
arange(0,100,25)))
4 print("-----------")
5 print("PERCENTILES")
OBSERVATION
1.The 50% of the people survived with zero positive nodes and 25% of the people who survived
had nodes more than 3.
2.The 25% of patients not survived with one node,50% of patients not survived with 4 nodes and
75% patients not survived have morethan 10 nodes.
3.The 90% of survived people had less than 8 nodes and if the nodes(8<nodes<20)have less
chance of surviving.
BOX PLOTS
In [38]: sns.boxplot(x='SURVIVAL_STATUS',y='POSITIVE_AUX_NODES',hue='SURVIVAL_STATUS',d
ata=hab_data)
plt.title("POSITIVE_AUX_NODES")
plt.show()
sns.boxplot(x='SURVIVAL_STATUS',y='AGE',hue='SURVIVAL_STATUS',data=hab_data)
plt.title("AGE")
plt.show()
sns.boxplot(x='SURVIVAL_STATUS',y='YEAR OF OPERATION',data=hab_data,hue='SURVI
VAL_STATUS')
plt.title("YEAR OF OPERATION")
plt.show()
OBSERVATION
1. In above plot 25th percentile and 50th percentile are nearly same for surviving above five
years and it lies between 0 to 5.Thershold for survival lies from 0 to 10 nodes.
2.For not survival there are 50th percentile of nodes are above 10. Threshold for the not survival
lies from 0 to 25 nodes and 75th% is 12 and 25th% is 1 or 2
3.The box plots of age and year of operation is not very efficient in analysis
VIOLIN PLOTS
In [40]: sns.violinplot(x='SURVIVAL_STATUS',y='POSITIVE_AUX_NODES', data=hab_data,hue=
'SURVIVAL_STATUS')
plt.title("POSITIVE_AUX_NODES")
plt.show()
sns.violinplot(x='SURVIVAL_STATUS',y='AGE', data=hab_data,hue='SURVIVAL_STATU
S')
plt.title("AGE")
plt.show()
sns.violinplot(x='SURVIVAL_STATUS',y='YEAR OF OPERATION', data=hab_data,hue='S
URVIVAL_STATUS')
plt.title("YEAR OF OPERATION")
plt.show()
OBSERVATION
1. In above plot ,the surviving abover 5 years is wide spread at 0 nodes. WHISKERS is from 0-7
2.For not survival,the density is more at 12nodes and whiskers is from 0 to 25 nodes
plt.show()
1. Survival status depends on the Number Of Nodes i.e when the number of nodes is more then
survival decreases.
1. The patients treated after 1966 have the slighlty higher chance to surive that the rest. The
patients treated before 1959 have the slighlty lower chance to surive that the rest.(From box
and violin plots).
In [ ]: