Assignment Instructions:: Import As
Assignment Instructions:: Import As
1. Download Haberman Cancer Survival dataset from Kaggle. You may have to create a Kaggle account to donwload data.
(https://www.kaggle.com/gilsousa/habermans-survival-data-set) or you can also run the below cell and load the data directly.
2. Perform a similar anlaysis as done in the reference notebook on this dataset.
1 30 64 1 1
2 30 62 3 1
3 30 65 0 1
4 31 59 2 1
1.1 Analyze high level statistics of the dataset: number of points, numer of features, number
of classes, data-points per class.
You have to write all of your observations in Markdown cell with proper formatting.You can go through the following blog to
understand formatting in markdown cells - https://www.markdownguide.org/basic-syntax/
Do not write your observations as comments in code cells.
Write comments in your code cells in order to explain the code that you are writing. Proper use of commenting can make code
maintenance much easier, as well as helping make finding bugs faster.
You can add extra cells using Insert cell below command in Insert tab. You can also use the shortcut Alt+Enter
It is a good programming practise to define all the libraries that you would be using in a single cell
306
In [3]: print(hab.shape[1])
In [4]: print(hab.columns)
Index(['age', 'year', 'nodes', 'status'], dtype='object')
In [5]: hab["status"].value_counts()
1 225
Out[5]:
2 81
Name: status, dtype: int64
In [6]: hab.describe()
That is, A patient will survive or not after 5 years of his operation of cancer.
Survival_status : It is a class attribute, classifying patients survived or not after 5 years of operation. As we see it is a imbalanced dataset
with 225 results of survival and 81 results of deaths after 5 years of operation. It is a binary class classification.
1.3 Perform Univariate analysis - Plot PDF, CDF, Boxplot, Voilin plots
Plot the required charts to understand which feature are important for classification.
Make sure that you add titles, legends and labels for each and every plots.
Suppress the warnings you get in python, in that way it makes your notebook more presentable.
Do write observations/inference for each plot.
In [8]: print(hab.shape)
(306, 4)
In [22]: # CDF of age,year,nodes of patients at the time of surgery with survival status as survived after 5 years.
counts, bin_edges=np.histogram(hab_live['age'], bins=10, density=True)
pdf=counts/(sum(counts))
cdf=np.cumsum(pdf)
plt.plot(bin_edges[1:],pdf)
plt.plot(bin_edges[1:],cdf)
plt.xlabel('AGE')
plt.ylabel('PROBABILITIES')
plt.legend(labels=['PDF plot', 'CDF plot'])
plt.grid()
plt.title('pdf and cdf on age of patients at the time of surgery(survived)')
plt.show()
counts, bin_edges=np.histogram(hab_live['year'], bins=10, density=True)
pdf=counts/(sum(counts))
cdf=np.cumsum(pdf)
plt.plot(bin_edges[1:],pdf)
plt.plot(bin_edges[1:],cdf)
plt.xlabel('YEAR')
plt.ylabel('PROBABILITIES')
plt.legend(labels=['PDF plot', 'CDF plot'])
plt.grid()
plt.title('pdf and cdf on year in which surgery is done(survived)')
plt.show()
counts, bin_edges=np.histogram(hab_live['nodes'], bins=10, density=True)
pdf=counts/(sum(counts))
cdf=np.cumsum(pdf)
plt.plot(bin_edges[1:],pdf)
plt.plot(bin_edges[1:],cdf)
plt.xlabel('AXIL_NODES')
plt.ylabel('PROBABILITIES')
plt.legend(labels=['PDF plot', 'CDF plot'])
plt.grid()
plt.title('pdf and cdf on number of auxiliary nodes(survived)')
plt.show()
<AxesSubplot:xlabel='status', ylabel='nodes'>
Out[134]:
1.4 Perform Bivariate analysis - Plot 2D Scatter plots and Pair plots
Plot the required Scatter plots and Pair plots of different features to see which combination of features are useful for clasification task
Make sure that you add titles, legends and labels for each and every plots.
Suppress the warnings you get in python, in that way it makes your notebook more presentable.
Do write observations/inference for each plot.
In [137… plt.close()
sns.set_style('whitegrid')
sns.pairplot(hab, hue='status', palette='dark', size=3, )
plt.title('Pair Plot')
plt.show()
In [ ]:
OBSERVATIONS:
1. No of points 306.
2. NO of columns 4
3. In columns we have 3 features and 1 class attribute
4. It is an imbalanced dataset with 225 data points of class 1(patients who survived after 5 years of surgery) and 81 data points of class
2(patients who died with in 5 years of surgery)
=>From describe:
1. Both pdf plots are overlapping on each other, hence we cannot conclude the plots.
1. Both pdf plots are overlapping on each other, hence we cannot conclude the plots.
1. Both pdf plots are overlapping, but at nodes from 0 to 3 density of survivals is high.
=> In 'Age' vs 'Axil_nodes' plot we can see there is an abundant points accumulated at axil_nodes<=4.
1. As we can see there are more blue points then orange points at nodes=0 i.e.,if nodes=0 patients are more likely to survive
irrespective of their age.
2. patients with nodes above 4 and age above 50 are less likely to survive.
3. patients of age less then 40 and nodes less then 10 have higher survival rate.
4. higher the nodes lower the chances of survival.
1. Patients of age 53 to 58 have 18% of survival which is higher compare to other ages.
1. There are 92% of the patients have the chances of survival who had 'axil_nodes' <= 10 in class 1.
1. Pateints of age 48 to 54 have 20% of died which is higher compared to other ages
2. There are 40% of the patients have a chances of death with in 5 years of operation whose 'age' <= 50 in class 2.
1. There are 70% of the patients have a chances of death with in 5 years of operation whose 'axil_nodes' <= 10 in class 2.
Final Conclusion:
YES, we can diagnose the breast cancer using haberman's dataset using:
In [ ]: