0% found this document useful (0 votes)
348 views11 pages

Exploratory Data Analysis On Haberman Dataset PDF

The dataset contains 306 cases from a study on breast cancer survival conducted between 1958-1970. There are 4 attributes: age, year of operation, number of positive axillary nodes, and survival status. The dataset is imbalanced, with 225 patients surviving 5+ years and 81 dying within 5 years. Scatter plots show overlap between age and year. Pairwise plots examine relationships between all attributes. Distribution plots find nodes are higher for patients who died within 5 years. CDF plots compare year distributions between survival statuses.

Uploaded by

Syed Subahani
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
348 views11 pages

Exploratory Data Analysis On Haberman Dataset PDF

The dataset contains 306 cases from a study on breast cancer survival conducted between 1958-1970. There are 4 attributes: age, year of operation, number of positive axillary nodes, and survival status. The dataset is imbalanced, with 225 patients surviving 5+ years and 81 dying within 5 years. Scatter plots show overlap between age and year. Pairwise plots examine relationships between all attributes. Distribution plots find nodes are higher for patients who died within 5 years. CDF plots compare year distributions between survival statuses.

Uploaded by

Syed Subahani
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

3/17/2020 Exploratory Data Analysis on Haberman Dataset

Data Set Information:

The dataset contains cases from a study that was conducted between 1958 and 1970 at the
University of Chicago's Billings Hospital on the survival of patients who had undergone surgery for
breast cancer.

Attribute Information:

1. Age of patient at time of operation (numerical)


2. Patient's year of operation (year - 1900, numerical)
3. Number of positive axillary nodes detected (numerical)
4. Survival status (class attribute) 1 = the patient survived 5 years or longer 2 = the patient died
within 5 year

Source : https://archive.ics.uci.edu/ml/datasets/Haberman's+Survival
(https://archive.ics.uci.edu/ml/datasets/Haberman's+Survival)

In [1]: import pandas as pd


import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
import warnings

warnings.filterwarnings("ignore")

habermandf = pd.read_csv("haberman.csv")

In [2]: # (Q) how many data-points and features?


habermandf.shape

Out[2]: (306, 4)

Dataset contains 306 Data points (observations) and 4 Attributes(charecteristics)

In [3]: habermandf.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 306 entries, 0 to 305
Data columns (total 4 columns):
age 306 non-null int64
year 306 non-null int64
nodes 306 non-null int64
status 306 non-null int64
dtypes: int64(4)
memory usage: 9.6 KB

Dataset has only Integers There is no missing data, all the colums have values.

localhost:8888/notebooks/Subahani/Study/Applied AI/Assignments/Mandatory/Exploratory Data Analysis on Haberman Dataset.ipynb 1/12


3/17/2020 Exploratory Data Analysis on Haberman Dataset

In [4]: #(Q) What are the column names in our dataset?


habermandf.columns

Out[4]: Index(['age', 'year', 'nodes', 'status'], dtype='object')

In [5]: #(Q) How many data points for each class are present?
habermandf["status"].value_counts()

Out[5]: 1 225
2 81
Name: status, dtype: int64

This is a im-balanced dataset data points for each class is different (huge gap among different
status types)

2-D Scatter Plot


In [6]: habermandf.plot(kind = "Scatter", x = "age", y = "year")
plt.grid()
plt.show()

localhost:8888/notebooks/Subahani/Study/Applied AI/Assignments/Mandatory/Exploratory Data Analysis on Haberman Dataset.ipynb 2/12


3/17/2020 Exploratory Data Analysis on Haberman Dataset

In [7]: # 2-D Scatter plot with color-coding for each flower type/class.
# How many cobinations exist? 3C2
habermandf["status"] = habermandf["status"].apply(lambda x: "Positive" if x == 1

sns.set_style("whitegrid");
sns.FacetGrid(habermandf, hue="status", size=4) \
.map(plt.scatter, "age", "year") \
.add_legend();
plt.show();

Observation(s):
1. The patient survived 5 years or longer
2. The patient died within 5 year

Very hard to distinguish between Age and Year as data points overlap

Pair-plot
localhost:8888/notebooks/Subahani/Study/Applied AI/Assignments/Mandatory/Exploratory Data Analysis on Haberman Dataset.ipynb 3/12
3/17/2020 Exploratory Data Analysis on Haberman Dataset

In [8]: # pairwise scatter plot: Pair-Plot

plt.close();
sns.set_style("whitegrid");
sns.pairplot(habermandf, hue="status", size=3);
plt.show()

Histogram, PDF, CDF

localhost:8888/notebooks/Subahani/Study/Applied AI/Assignments/Mandatory/Exploratory Data Analysis on Haberman Dataset.ipynb 4/12


3/17/2020 Exploratory Data Analysis on Haberman Dataset

In [9]: # What about 1-D scatter plot using just one feature?
haberman_pos=habermandf.loc[habermandf["status"]=='Positive'];
haberman_neg=habermandf.loc[habermandf["status"]=='Negative'];

plt.plot(haberman_pos['nodes'],np.zeros_like(haberman_pos['nodes']),'o',label='Po
plt.plot(haberman_neg['nodes'],np.zeros_like(haberman_neg['nodes']),'o',label='Ne
plt.ylabel("Counts")
plt.xlabel("Nodes")
plt.title("Haberman")
plt.legend()
plt.show()

localhost:8888/notebooks/Subahani/Study/Applied AI/Assignments/Mandatory/Exploratory Data Analysis on Haberman Dataset.ipynb 5/12


3/17/2020 Exploratory Data Analysis on Haberman Dataset

In [10]: # Nodes
sns.FacetGrid(habermandf,hue='status',size=5)\
.map(sns.distplot,"nodes")\
.add_legend()
plt.title('Haberman')
plt.show()

localhost:8888/notebooks/Subahani/Study/Applied AI/Assignments/Mandatory/Exploratory Data Analysis on Haberman Dataset.ipynb 6/12


3/17/2020 Exploratory Data Analysis on Haberman Dataset

In [11]: # Year
sns.FacetGrid(habermandf,hue='status',size=5)\
.map(sns.distplot,"year")\
.add_legend()
plt.title('Haberman')
plt.show()

localhost:8888/notebooks/Subahani/Study/Applied AI/Assignments/Mandatory/Exploratory Data Analysis on Haberman Dataset.ipynb 7/12


3/17/2020 Exploratory Data Analysis on Haberman Dataset

In [12]: # Age
sns.FacetGrid(habermandf,hue='status',size=5)\
.map(sns.distplot,"age")\
.add_legend()
plt.title('Haberman')
plt.show()

localhost:8888/notebooks/Subahani/Study/Applied AI/Assignments/Mandatory/Exploratory Data Analysis on Haberman Dataset.ipynb 8/12


3/17/2020 Exploratory Data Analysis on Haberman Dataset

In [13]: #Plot CDF


counts,bin_edges=np.histogram(haberman_pos['year'],bins=5,density=True)
pdf=counts/(sum(counts))
print(pdf)
print(bin_edges)
cdf=np.cumsum(pdf)
plt.plot(bin_edges[1:],pdf,label='Positive')
plt.plot(bin_edges[1:],cdf,label='Negative')
plt.xlabel('year')
plt.ylabel('count')
plt.title("Haberman")
plt.legend()
plt.show()

[0.29333333 0.17333333 0.2 0.16444444 0.16888889]


[58. 60.2 62.4 64.6 66.8 69. ]

localhost:8888/notebooks/Subahani/Study/Applied AI/Assignments/Mandatory/Exploratory Data Analysis on Haberman Dataset.ipynb 9/12


3/17/2020 Exploratory Data Analysis on Haberman Dataset

In [14]: # Plots of CDF of Year for Status (Positive/Negative)


counts,bin_edges=np.histogram(haberman_pos['year'],bins=5,density=True)
pdf=counts/(sum(counts))
print(pdf)
print(bin_edges)
cdf=np.cumsum(pdf)
plt.plot(bin_edges[1:],pdf)
plt.plot(bin_edges[1:],cdf,label='Positive')

counts,bin_edges=np.histogram(haberman_neg['year'],bins=5,density=True)
pdf=counts/(sum(counts))
print(pdf)
print(bin_edges)

cdf=np.cumsum(pdf)
plt.plot(bin_edges[1:],pdf)
plt.plot(bin_edges[1:],cdf,label='Negative')
plt.xlabel('year')
plt.ylabel('count')
plt.title("Haberman")
plt.legend()
plt.show()

[0.29333333 0.17333333 0.2 0.16444444 0.16888889]


[58. 60.2 62.4 64.6 66.8 69. ]
[0.30864198 0.12345679 0.19753086 0.2345679 0.13580247]
[58. 60.2 62.4 64.6 66.8 69. ]

localhost:8888/notebooks/Subahani/Study/Applied AI/Assignments/Mandatory/Exploratory Data Analysis on Haberman Dataset.ipynb 10/12


3/17/2020 Exploratory Data Analysis on Haberman Dataset

Box plot and Whiskers


In [15]: sns.boxplot(x='status',y='year',data=habermandf)
plt.title("Haberman")
plt.show()

Violin Plots
In [16]: sns.violinplot(x='status',y='year',data=habermandf,size=8)
plt.title("Haberman")
plt.show()

Conclusion:
Unable to find out perfect relation as dataset is imbalaneced.

1.Patients with less than 35 years will survive 5 year or longer.

2.Patients with more than 75 years will not survive 5 years or longer.
localhost:8888/notebooks/Subahani/Study/Applied AI/Assignments/Mandatory/Exploratory Data Analysis on Haberman Dataset.ipynb 11/12

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy