0% found this document useful (0 votes)
36 views1 page

Assignment Instructions:: Import As

The document provides instructions for analyzing the Haberman Cancer Survival dataset from Kaggle to predict patient survival. Key points: 1. Download and load the Haberman dataset containing 306 patients, 4 features - age, operation year, number of positive axillary nodes, and survival status. 2. Perform exploratory data analysis including describing dataset statistics, defining the objective to predict survival, and univariate and bivariate analysis using plots to understand feature relationships. 3. Analyses include calculating class distributions, generating histograms, CDFs and scatter plots to analyze each feature individually and together to determine predictive power for survival classification. Observations from plots should be commented on.

Uploaded by

Tayub khan.A
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
36 views1 page

Assignment Instructions:: Import As

The document provides instructions for analyzing the Haberman Cancer Survival dataset from Kaggle to predict patient survival. Key points: 1. Download and load the Haberman dataset containing 306 patients, 4 features - age, operation year, number of positive axillary nodes, and survival status. 2. Perform exploratory data analysis including describing dataset statistics, defining the objective to predict survival, and univariate and bivariate analysis using plots to understand feature relationships. 3. Analyses include calculating class distributions, generating histograms, CDFs and scatter plots to analyze each feature individually and together to determine predictive power for survival classification. Observations from plots should be commented on.

Uploaded by

Tayub khan.A
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 1

Assignment Instructions:

1. Download Haberman Cancer Survival dataset from Kaggle. You may have to create a Kaggle account to donwload data.
(https://www.kaggle.com/gilsousa/habermans-survival-data-set) or you can also run the below cell and load the data directly.
2. Perform a similar anlaysis as done in the reference notebook on this dataset.

In [1]: import pandas as pd


df=pd.read_csv('haberman.csv',names=["age","operation_Year","axil_nodes","survival_status"])
df.head()

Out[1]: age operation_Year axil_nodes survival_status

0 age year nodes status

1 30 64 1 1

2 30 62 3 1

3 30 65 0 1

4 31 59 2 1

1.1 Analyze high level statistics of the dataset: number of points, numer of features, number
of classes, data-points per class.
You have to write all of your observations in Markdown cell with proper formatting.You can go through the following blog to
understand formatting in markdown cells - https://www.markdownguide.org/basic-syntax/
Do not write your observations as comments in code cells.
Write comments in your code cells in order to explain the code that you are writing. Proper use of commenting can make code
maintenance much easier, as well as helping make finding bugs faster.
You can add extra cells using Insert cell below command in Insert tab. You can also use the shortcut Alt+Enter
It is a good programming practise to define all the libraries that you would be using in a single cell

In [2]: import pandas as pd


hab=pd.read_csv("haberman.csv")
print(hab.shape[0])

306

In [3]: print(hab.shape[1])

In [4]: print(hab.columns)
Index(['age', 'year', 'nodes', 'status'], dtype='object')

In [5]: hab["status"].value_counts()
1 225
Out[5]:
2 81
Name: status, dtype: int64

1. Number of point = 306


2. Number of features = 3 =>(4 columns in which 3 features and 1 class attribute)
3. Number of classes = 2
4. Number of data points per class =>class 1=225(survived after 5 years)
class 2=81(not survied for 5 years)

In [6]: hab.describe()

Out[6]: age year nodes status

count 306.000000 306.000000 306.000000 306.000000

mean 52.457516 62.852941 4.026144 1.264706

std 10.803452 3.249405 7.189654 0.441899

min 30.000000 58.000000 0.000000 1.000000

25% 44.000000 60.000000 0.000000 1.000000

50% 52.000000 63.000000 1.000000 1.000000

75% 60.750000 65.750000 4.000000 2.000000

max 83.000000 69.000000 52.000000 2.000000

1.2 - Explain the objective of the problem.


(The objective for a problem can be defined as a brief explanation of problem that you are trying to solve using the given dataset)

Objective: To predict the survival of a cancer patient(breast cancer).

That is, A patient will survive or not after 5 years of his operation of cancer.

prediction is done based on the given features:

1. Age of the patient at the operation time.


2. operation year:On which year operation was done.
3. axil_nodes:These are glands acts as filters which purify cancer cells from the bloodstream.(here these are counts of glands affected by
cancer)

Survival_status : It is a class attribute, classifying patients survived or not after 5 years of operation. As we see it is a imbalanced dataset
with 225 results of survival and 81 results of deaths after 5 years of operation. It is a binary class classification.

1.3 Perform Univariate analysis - Plot PDF, CDF, Boxplot, Voilin plots
Plot the required charts to understand which feature are important for classification.
Make sure that you add titles, legends and labels for each and every plots.
Suppress the warnings you get in python, in that way it makes your notebook more presentable.
Do write observations/inference for each plot.

In [7]: import numpy as np


import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')
hab=pd.read_csv("haberman.csv")

In [8]: print(hab.shape)
(306, 4)

In [9]: import numpy as np


hab_live=hab.loc[hab["status"]==1]
hab_dead=hab.loc[hab["status"]==2]
plt.plot(hab_live["nodes"], np.zeros_like(hab_live['nodes']),'o')
plt.plot(hab_dead["nodes"], np.zeros_like(hab_dead['nodes']),'o')
plt.title('1D plot')
plt.xlabel('axil_nodes')
plt.show()

In [10]: sns.FacetGrid(hab, hue='status', size=5) \


.map(sns.distplot, "age") \
.add_legend()
plt.title('PDF of age')
plt.show()

In [11]: sns.FacetGrid(hab, hue='status', size=5) \


.map(sns.distplot, "year") \
.add_legend()
plt.title('PDF of year of operation')
plt.show()

In [12]: sns.FacetGrid(hab, hue='status', size=5) \


.map(sns.distplot, "nodes") \
.add_legend()
plt.title('PDF of axil_nodes')
plt.show()

In [22]: # CDF of age,year,nodes of patients at the time of surgery with survival status as survived after 5 years.
counts, bin_edges=np.histogram(hab_live['age'], bins=10, density=True)
pdf=counts/(sum(counts))
cdf=np.cumsum(pdf)
plt.plot(bin_edges[1:],pdf)
plt.plot(bin_edges[1:],cdf)
plt.xlabel('AGE')
plt.ylabel('PROBABILITIES')
plt.legend(labels=['PDF plot', 'CDF plot'])
plt.grid()
plt.title('pdf and cdf on age of patients at the time of surgery(survived)')
plt.show()
counts, bin_edges=np.histogram(hab_live['year'], bins=10, density=True)
pdf=counts/(sum(counts))
cdf=np.cumsum(pdf)
plt.plot(bin_edges[1:],pdf)
plt.plot(bin_edges[1:],cdf)
plt.xlabel('YEAR')
plt.ylabel('PROBABILITIES')
plt.legend(labels=['PDF plot', 'CDF plot'])
plt.grid()
plt.title('pdf and cdf on year in which surgery is done(survived)')
plt.show()
counts, bin_edges=np.histogram(hab_live['nodes'], bins=10, density=True)
pdf=counts/(sum(counts))
cdf=np.cumsum(pdf)
plt.plot(bin_edges[1:],pdf)
plt.plot(bin_edges[1:],cdf)
plt.xlabel('AXIL_NODES')
plt.ylabel('PROBABILITIES')
plt.legend(labels=['PDF plot', 'CDF plot'])
plt.grid()
plt.title('pdf and cdf on number of auxiliary nodes(survived)')
plt.show()

In [23]: import numpy as np


# CDF of age,year,nodes of patients at the time of surgery with survival status as died within 5 years.
counts, bin_edges=np.histogram(hab_dead['age'], bins=10, density=True)
pdf=counts/(sum(counts))
cdf=np.cumsum(pdf)
plt.plot(bin_edges[1:],pdf)
plt.plot(bin_edges[1:],cdf)
plt.xlabel('AGE')
plt.ylabel('PROBABILITIES')
plt.legend(labels=['PDF plot', 'CDF plot'])
plt.grid()
plt.title('pdf and cdf on age of patients at the time of surgery(died)')
plt.show()
counts, bin_edges=np.histogram(hab_dead['year'], bins=10, density=True)
pdf=counts/(sum(counts))
cdf=np.cumsum(pdf)
plt.plot(bin_edges[1:],pdf)
plt.plot(bin_edges[1:],cdf)
plt.xlabel('YEAR')
plt.ylabel('PROBABILITIES')
plt.legend(labels=['PDF plot', 'CDF plot'])
plt.grid()
plt.title('pdf and cdf on year in which surgery is done(died)')
plt.show()
counts, bin_edges=np.histogram(hab_dead['nodes'], bins=10, density=True)
pdf=counts/(sum(counts))
cdf=np.cumsum(pdf)
plt.plot(bin_edges[1:],pdf)
plt.plot(bin_edges[1:],cdf)
plt.xlabel('Axil_Nodes')
plt.ylabel('PROBABILITIES')
plt.legend(labels=['PDF plot', 'CDF plot'])
plt.grid()
plt.title('pdf and cdf on number of auxiliary nodes(died)')
plt.show()

In [133… sns.boxplot(x='status', y='nodes', data=hab)


plt.show()

In [134… sns.violinplot(x='status', y='nodes', data=hab, size=8)

<AxesSubplot:xlabel='status', ylabel='nodes'>
Out[134]:

1.4 Perform Bivariate analysis - Plot 2D Scatter plots and Pair plots
Plot the required Scatter plots and Pair plots of different features to see which combination of features are useful for clasification task
Make sure that you add titles, legends and labels for each and every plots.
Suppress the warnings you get in python, in that way it makes your notebook more presentable.
Do write observations/inference for each plot.

In [135… hab.plot(kind='scatter', x='age', y='nodes')


plt.title("2-D scatter plot between age and axil_nodes")
plt.grid()
plt.show()

In [24]: #2-D scatter plot with color-coding


sns.set_style("whitegrid")
sns.FacetGrid(hab, hue='status', size=10) \
.map(plt.scatter, "age", "nodes") \
.add_legend()
plt.title('2-D scatter plot with color-coding')
plt.show()

In [137… plt.close()
sns.set_style('whitegrid')
sns.pairplot(hab, hue='status', palette='dark', size=3, )
plt.title('Pair Plot')
plt.show()

In [ ]:

1.5 Summarize your final conclusions of the Exploration


You can desrcibe the key features that are important for the Classification task.
Try to quantify your results i.e. while writing observations include numbers,percentages, fractions etc.
Write a brief of your exploratory analysis in 3-5 points
Write your observations in english as crisply and unambigously as possible.

OBSERVATIONS:

=>From statistical data:

1. No of points 306.
2. NO of columns 4
3. In columns we have 3 features and 1 class attribute
4. It is an imbalanced dataset with 225 data points of class 1(patients who survived after 5 years of surgery) and 81 data points of class
2(patients who died with in 5 years of surgery)

=>From describe:

1. Mean age is 52.45 and mean axil_nodes is 4.026.


2. minimum age is 30 and minimum axil_nodes is 0.
3. 25th percentile for age is 44 and for axil_nodes is 0.
4. 50th percentile for age is 52 and for axil_nodes is 1.
5. 75th percentile for age is 60.75 and for axil_nodes is 4.

From PDF plots:

=> PDF of age(age vs density):

1. Both pdf plots are overlapping on each other, hence we cannot conclude the plots.

=> PDF of year(year vs density):

1. Both pdf plots are overlapping on each other, hence we cannot conclude the plots.

=> PDF of axil_nodes(nodes vs density):

1. Both pdf plots are overlapping, but at nodes from 0 to 3 density of survivals is high.

=> From pair plot and scatter plot:

=> In 'Age' vs 'Axil_nodes' plot we can see there is an abundant points accumulated at axil_nodes<=4.

1. As we can see there are more blue points then orange points at nodes=0 i.e.,if nodes=0 patients are more likely to survive
irrespective of their age.
2. patients with nodes above 4 and age above 50 are less likely to survive.
3. patients of age less then 40 and nodes less then 10 have higher survival rate.
4. higher the nodes lower the chances of survival.

=> From PDF's and CDF's:

=> In probability vs age plot:

1. Patients of age 53 to 58 have 18% of survival which is higher compare to other ages.

=> In probability vs Axil_nodes plot:

1. There are 92% of the patients have the chances of survival who had 'axil_nodes' <= 10 in class 1.

=> In probability vs Age plot:

1. Pateints of age 48 to 54 have 20% of died which is higher compared to other ages

2. There are 40% of the patients have a chances of death with in 5 years of operation whose 'age' <= 50 in class 2.

=> In probability vs Axil_nodes:

1. There are 70% of the patients have a chances of death with in 5 years of operation whose 'axil_nodes' <= 10 in class 2.

=> From BOX plot:

=> patients survived after 5 years of surgery:

1. 25th percentile have axil_nodes=0


2. 75th percentile have axil_nodes=4

=> patients not survived after 5 years of surgery:

1. 25th percentile have axil_nodes=2


2. 50th percentile have axil_nodes=4
3. 75th percentile have axil_nodes=12

=> From VIOLIN plot:

Percentiles are the same as box plot

Final Conclusion:

YES, we can diagnose the breast cancer using haberman's dataset using:

1. Most important feature is number of Axil_nodes.


2. Later 'Age' feature contributes to the classification.
3. Combine these 2 features will best contribute for the classification.

In [ ]:

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy