0% found this document useful (0 votes)

348 views11 pages

Exploratory Data Analysis On Haberman Dataset PDF

The dataset contains 306 cases from a study on breast cancer survival conducted between 1958-1970. There are 4 attributes: age, year of operation, number of positive axillary nodes, and survival status. The dataset is imbalanced, with 225 patients surviving 5+ years and 81 dying within 5 years. Scatter plots show overlap between age and year. Pairwise plots examine relationships between all attributes. Distribution plots find nodes are higher for patients who died within 5 years. CDF plots compare year distributions between survival statuses.

Uploaded by

Syed Subahani

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

348 views11 pages

Exploratory Data Analysis On Haberman Dataset PDF

Uploaded by

Syed Subahani

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 11

3/17/2020 Exploratory Data Analysis on Haberman Dataset

Data Set Information:

The dataset contains cases from a study that was conducted between 1958 and 1970 at the
University of Chicago's Billings Hospital on the survival of patients who had undergone surgery for
breast cancer.

Attribute Information:

1. Age of patient at time of operation (numerical)

2. Patient's year of operation (year - 1900, numerical)
3. Number of positive axillary nodes detected (numerical)
4. Survival status (class attribute) 1 = the patient survived 5 years or longer 2 = the patient died
within 5 year

Source : https://archive.ics.uci.edu/ml/datasets/Haberman's+Survival
(https://archive.ics.uci.edu/ml/datasets/Haberman's+Survival)

In [1]: import pandas as pd

import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
import warnings

warnings.filterwarnings("ignore")

habermandf = pd.read_csv("haberman.csv")

In [2]: # (Q) how many data-points and features?

habermandf.shape

Out[2]: (306, 4)

Dataset contains 306 Data points (observations) and 4 Attributes(charecteristics)

In [3]: habermandf.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 306 entries, 0 to 305
Data columns (total 4 columns):
age 306 non-null int64
year 306 non-null int64
nodes 306 non-null int64
status 306 non-null int64
dtypes: int64(4)
memory usage: 9.6 KB

Dataset has only Integers There is no missing data, all the colums have values.

localhost:8888/notebooks/Subahani/Study/Applied AI/Assignments/Mandatory/Exploratory Data Analysis on Haberman Dataset.ipynb 1/12

3/17/2020 Exploratory Data Analysis on Haberman Dataset

In [4]: #(Q) What are the column names in our dataset?

habermandf.columns

Out[4]: Index(['age', 'year', 'nodes', 'status'], dtype='object')

In [5]: #(Q) How many data points for each class are present?
habermandf["status"].value_counts()

Out[5]: 1 225
2 81
Name: status, dtype: int64

This is a im-balanced dataset data points for each class is different (huge gap among different
status types)

2-D Scatter Plot

In [6]: habermandf.plot(kind = "Scatter", x = "age", y = "year")
plt.grid()
plt.show()

localhost:8888/notebooks/Subahani/Study/Applied AI/Assignments/Mandatory/Exploratory Data Analysis on Haberman Dataset.ipynb 2/12

3/17/2020 Exploratory Data Analysis on Haberman Dataset

In [7]: # 2-D Scatter plot with color-coding for each flower type/class.
# How many cobinations exist? 3C2
habermandf["status"] = habermandf["status"].apply(lambda x: "Positive" if x == 1

sns.set_style("whitegrid");
sns.FacetGrid(habermandf, hue="status", size=4) \
.map(plt.scatter, "age", "year") \
.add_legend();
plt.show();

Observation(s):
1. The patient survived 5 years or longer
2. The patient died within 5 year

Very hard to distinguish between Age and Year as data points overlap

Pair-plot
localhost:8888/notebooks/Subahani/Study/Applied AI/Assignments/Mandatory/Exploratory Data Analysis on Haberman Dataset.ipynb 3/12
3/17/2020 Exploratory Data Analysis on Haberman Dataset

In [8]: # pairwise scatter plot: Pair-Plot

plt.close();
sns.set_style("whitegrid");
sns.pairplot(habermandf, hue="status", size=3);
plt.show()

Histogram, PDF, CDF

localhost:8888/notebooks/Subahani/Study/Applied AI/Assignments/Mandatory/Exploratory Data Analysis on Haberman Dataset.ipynb 4/12

3/17/2020 Exploratory Data Analysis on Haberman Dataset

In [9]: # What about 1-D scatter plot using just one feature?
haberman_pos=habermandf.loc[habermandf["status"]=='Positive'];
haberman_neg=habermandf.loc[habermandf["status"]=='Negative'];

plt.plot(haberman_pos['nodes'],np.zeros_like(haberman_pos['nodes']),'o',label='Po
plt.plot(haberman_neg['nodes'],np.zeros_like(haberman_neg['nodes']),'o',label='Ne
plt.ylabel("Counts")
plt.xlabel("Nodes")
plt.title("Haberman")
plt.legend()
plt.show()

localhost:8888/notebooks/Subahani/Study/Applied AI/Assignments/Mandatory/Exploratory Data Analysis on Haberman Dataset.ipynb 5/12

3/17/2020 Exploratory Data Analysis on Haberman Dataset

In [10]: # Nodes
sns.FacetGrid(habermandf,hue='status',size=5)\
.map(sns.distplot,"nodes")\
.add_legend()
plt.title('Haberman')
plt.show()

localhost:8888/notebooks/Subahani/Study/Applied AI/Assignments/Mandatory/Exploratory Data Analysis on Haberman Dataset.ipynb 6/12

3/17/2020 Exploratory Data Analysis on Haberman Dataset

In [11]: # Year
sns.FacetGrid(habermandf,hue='status',size=5)\
.map(sns.distplot,"year")\
.add_legend()
plt.title('Haberman')
plt.show()

localhost:8888/notebooks/Subahani/Study/Applied AI/Assignments/Mandatory/Exploratory Data Analysis on Haberman Dataset.ipynb 7/12

3/17/2020 Exploratory Data Analysis on Haberman Dataset

In [12]: # Age
sns.FacetGrid(habermandf,hue='status',size=5)\
.map(sns.distplot,"age")\
.add_legend()
plt.title('Haberman')
plt.show()

localhost:8888/notebooks/Subahani/Study/Applied AI/Assignments/Mandatory/Exploratory Data Analysis on Haberman Dataset.ipynb 8/12

3/17/2020 Exploratory Data Analysis on Haberman Dataset

In [13]: #Plot CDF

counts,bin_edges=np.histogram(haberman_pos['year'],bins=5,density=True)
pdf=counts/(sum(counts))
print(pdf)
print(bin_edges)
cdf=np.cumsum(pdf)
plt.plot(bin_edges[1:],pdf,label='Positive')
plt.plot(bin_edges[1:],cdf,label='Negative')
plt.xlabel('year')
plt.ylabel('count')
plt.title("Haberman")
plt.legend()
plt.show()

[0.29333333 0.17333333 0.2 0.16444444 0.16888889]

[58. 60.2 62.4 64.6 66.8 69. ]

localhost:8888/notebooks/Subahani/Study/Applied AI/Assignments/Mandatory/Exploratory Data Analysis on Haberman Dataset.ipynb 9/12

3/17/2020 Exploratory Data Analysis on Haberman Dataset

In [14]: # Plots of CDF of Year for Status (Positive/Negative)

counts,bin_edges=np.histogram(haberman_neg['year'],bins=5,density=True)
pdf=counts/(sum(counts))
print(pdf)
print(bin_edges)

cdf=np.cumsum(pdf)
plt.plot(bin_edges[1:],pdf)
plt.plot(bin_edges[1:],cdf,label='Negative')
plt.xlabel('year')
plt.ylabel('count')
plt.title("Haberman")
plt.legend()
plt.show()

[0.29333333 0.17333333 0.2 0.16444444 0.16888889]

[58. 60.2 62.4 64.6 66.8 69. ]
[0.30864198 0.12345679 0.19753086 0.2345679 0.13580247]
[58. 60.2 62.4 64.6 66.8 69. ]

localhost:8888/notebooks/Subahani/Study/Applied AI/Assignments/Mandatory/Exploratory Data Analysis on Haberman Dataset.ipynb 10/12

3/17/2020 Exploratory Data Analysis on Haberman Dataset

Box plot and Whiskers

In [15]: sns.boxplot(x='status',y='year',data=habermandf)
plt.title("Haberman")
plt.show()

Violin Plots
In [16]: sns.violinplot(x='status',y='year',data=habermandf,size=8)
plt.title("Haberman")
plt.show()

Conclusion:
Unable to find out perfect relation as dataset is imbalaneced.

1.Patients with less than 35 years will survive 5 year or longer.

2.Patients with more than 75 years will not survive 5 years or longer.
localhost:8888/notebooks/Subahani/Study/Applied AI/Assignments/Mandatory/Exploratory Data Analysis on Haberman Dataset.ipynb 11/12

SMDM Week 2 Quiz 2 - Solution
100% (3)
SMDM Week 2 Quiz 2 - Solution
4 pages
Advanced Statistics
100% (1)
Advanced Statistics
16 pages
Untitled
No ratings yet
Untitled
29 pages
AV Project Shivakumar Vanga
100% (1)
AV Project Shivakumar Vanga
37 pages
Graded Quiz 1 - Working With Python Great Lakes
100% (1)
Graded Quiz 1 - Working With Python Great Lakes
6 pages
Quiz M2
100% (1)
Quiz M2
7 pages
Advanced Statistics - Project Report
100% (5)
Advanced Statistics - Project Report
14 pages
Haberman Data Set Ed A
No ratings yet
Haberman Data Set Ed A
10 pages
Lead Score Case Study Presentation
No ratings yet
Lead Score Case Study Presentation
16 pages
Thera Bank - Project
100% (4)
Thera Bank - Project
34 pages
Time Series
67% (3)
Time Series
34 pages
Project Advance Stats - Abhishek
No ratings yet
Project Advance Stats - Abhishek
14 pages
Banking Credit Risk Analysis With Naive Bayes Approach and Cox Proportional Hazard
No ratings yet
Banking Credit Risk Analysis With Naive Bayes Approach and Cox Proportional Hazard
6 pages
Linear Regression Review
67% (3)
Linear Regression Review
4 pages
Clustering & PCA Assignment Questions
No ratings yet
Clustering & PCA Assignment Questions
4 pages
Clustering Analysis: Reading The Data
100% (1)
Clustering Analysis: Reading The Data
15 pages
Haberman Datasets Analysis - Ipynb - Colaboratory
No ratings yet
Haberman Datasets Analysis - Ipynb - Colaboratory
13 pages
EDA On Haberman Survival Data
No ratings yet
EDA On Haberman Survival Data
6 pages
Sap-C S4ewm 2023
No ratings yet
Sap-C S4ewm 2023
31 pages
ML Week 3 Logistic Regression
60% (10)
ML Week 3 Logistic Regression
6 pages
House Price Prediction Using Machine Learning: Bachelor of Technology
No ratings yet
House Price Prediction Using Machine Learning: Bachelor of Technology
20 pages
Advanced Statistics ANOVA PCA EDA Project Report 3 Great Lakes
No ratings yet
Advanced Statistics ANOVA PCA EDA Project Report 3 Great Lakes
28 pages
(Cat) Sat Phone
No ratings yet
(Cat) Sat Phone
19 pages
Data Mining Quiz 3 - Random Forest: Course Content
No ratings yet
Data Mining Quiz 3 - Random Forest: Course Content
8 pages
K2 Cold Storage Case Study
0% (1)
K2 Cold Storage Case Study
1 page
Ruhee Ansari - Advanced Statistic Project SCB
100% (1)
Ruhee Ansari - Advanced Statistic Project SCB
28 pages
Assignment Instructions:: Import As
No ratings yet
Assignment Instructions:: Import As
1 page
Data Mini Proj
100% (2)
Data Mini Proj
44 pages
40 Questions To Test A Data Scientist On Time Series
No ratings yet
40 Questions To Test A Data Scientist On Time Series
26 pages
Predicting Mode of Transport (ML) : Akalya KS
No ratings yet
Predicting Mode of Transport (ML) : Akalya KS
17 pages
Assignment 1 Ans (Reference)
No ratings yet
Assignment 1 Ans (Reference)
18 pages
Business Report Project - Sheetal - SMDM
100% (1)
Business Report Project - Sheetal - SMDM
20 pages
Ensemble Techniques Project
100% (2)
Ensemble Techniques Project
28 pages
Data Mining Project
100% (1)
Data Mining Project
24 pages
Week 1 Quiz
100% (1)
Week 1 Quiz
28 pages
Problem Statement 1
100% (1)
Problem Statement 1
17 pages
Capstone Notes-1
No ratings yet
Capstone Notes-1
18 pages
Literature Review On Airline Reservation System PDF
100% (1)
Literature Review On Airline Reservation System PDF
8 pages
Capstone Project - Final Submission
No ratings yet
Capstone Project - Final Submission
36 pages
Random Forest - US - Heart - Patients - Class
100% (1)
Random Forest - US - Heart - Patients - Class
24 pages
PM ProjectJune - 2021
100% (1)
PM ProjectJune - 2021
33 pages
Reliability, Culture and Data-1
No ratings yet
Reliability, Culture and Data-1
5 pages
LDA 01 Linear Discriminant Analysis
No ratings yet
LDA 01 Linear Discriminant Analysis
65 pages
Color: Due On Sunday June 7th, by 11:59PM
No ratings yet
Color: Due On Sunday June 7th, by 11:59PM
2 pages
Statistical Methods For Decision Making
100% (1)
Statistical Methods For Decision Making
15 pages
PLSQL and SQL Coding Guidelines
No ratings yet
PLSQL and SQL Coding Guidelines
196 pages
Clustering Analysis: Prepared by Muralidharan N
100% (1)
Clustering Analysis: Prepared by Muralidharan N
16 pages
Forecasting
No ratings yet
Forecasting
75 pages
Bosch Presentation
No ratings yet
Bosch Presentation
21 pages
Problem 1 - (Download Data) : Importing Nessceary Libraries
No ratings yet
Problem 1 - (Download Data) : Importing Nessceary Libraries
16 pages
Business Analytics Report: Submitted To
No ratings yet
Business Analytics Report: Submitted To
32 pages
Assignment 02
No ratings yet
Assignment 02
9 pages
7z1018 CW Example Predicting House Prices in King County
No ratings yet
7z1018 CW Example Predicting House Prices in King County
16 pages
Uber Drive Practice DP PDF
No ratings yet
Uber Drive Practice DP PDF
10 pages
SuccessFactors With Microsoft 365
No ratings yet
SuccessFactors With Microsoft 365
41 pages
Microprocessor and Assembly Language Lecture Note For Ndii Computer Engineering
No ratings yet
Microprocessor and Assembly Language Lecture Note For Ndii Computer Engineering
25 pages
Exploratory Data Analysis
No ratings yet
Exploratory Data Analysis
22 pages
Vijayalakshmi
No ratings yet
Vijayalakshmi
17 pages
Answer Book - Sparkling Wines
No ratings yet
Answer Book - Sparkling Wines
10 pages
Report AMRUTHA FINAL
No ratings yet
Report AMRUTHA FINAL
12 pages
4 Exploratory Data Analysis.
No ratings yet
4 Exploratory Data Analysis.
1 page
Understanding DBSCAN Algorithm and Implementation From Scratch - by Andrewngai - Towards Data Science
No ratings yet
Understanding DBSCAN Algorithm and Implementation From Scratch - by Andrewngai - Towards Data Science
10 pages
CCBoot Manual - Disk Manager
No ratings yet
CCBoot Manual - Disk Manager
89 pages
Data Preparation
No ratings yet
Data Preparation
12 pages
Computational Mathematics
No ratings yet
Computational Mathematics
49 pages
4IR Assi (AH Sir)
No ratings yet
4IR Assi (AH Sir)
18 pages
1) Introduction A) Defining Problem Statement:-: ST ST
No ratings yet
1) Introduction A) Defining Problem Statement:-: ST ST
10 pages
Lab 7 Capturing and Examining The Registry (15 PTS.)
No ratings yet
Lab 7 Capturing and Examining The Registry (15 PTS.)
8 pages
2014 Smart Card cloner User's Manual V3.0: 1、Equipment introduction
No ratings yet
2014 Smart Card cloner User's Manual V3.0: 1、Equipment introduction
2 pages
Acumatica Presentation
No ratings yet
Acumatica Presentation
14 pages
Nokia: Service Schematics
No ratings yet
Nokia: Service Schematics
6 pages
SMDM Report
No ratings yet
SMDM Report
12 pages
Chat GPT
No ratings yet
Chat GPT
2 pages
Data Science & Business Analytics: Post Graduate Program in
No ratings yet
Data Science & Business Analytics: Post Graduate Program in
16 pages
End Term Quiz1 - Attempt Review
No ratings yet
End Term Quiz1 - Attempt Review
5 pages
Compare - Supply Chain Planning Solutions - Blue-Yonder-Vs-O9-Solutions - Vendor
No ratings yet
Compare - Supply Chain Planning Solutions - Blue-Yonder-Vs-O9-Solutions - Vendor
6 pages
Reasoning and Problem Solving: Module Overview
No ratings yet
Reasoning and Problem Solving: Module Overview
20 pages
Instructions For Creating and Submitting Effective Assignment Solutions
No ratings yet
Instructions For Creating and Submitting Effective Assignment Solutions
6 pages
Assignment-Based Subjective Questions/Answers
No ratings yet
Assignment-Based Subjective Questions/Answers
3 pages
Octnov 23
No ratings yet
Octnov 23
3 pages
Ii:Rrftsmrn'I: Owner's Manual
No ratings yet
Ii:Rrftsmrn'I: Owner's Manual
8 pages
Report Ece551 Project 3
No ratings yet
Report Ece551 Project 3
4 pages
Pagerank Explained Simple
No ratings yet
Pagerank Explained Simple
4 pages
Docker Private Registry
No ratings yet
Docker Private Registry
4 pages
Resource AI Class X
No ratings yet
Resource AI Class X
1 page
Tata Motors
No ratings yet
Tata Motors
1 page
Esigno E S: Nergy Aver
No ratings yet
Esigno E S: Nergy Aver
2 pages
Muc 8051 - Automatic School Bell
No ratings yet
Muc 8051 - Automatic School Bell
5 pages
TF Idf Algorithm
No ratings yet
TF Idf Algorithm
4 pages
Richard Shoup and Thomas Etter - Boundary Institute 2004-5
No ratings yet
Richard Shoup and Thomas Etter - Boundary Institute 2004-5
11 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Exploratory Data Analysis On Haberman Dataset PDF

Uploaded by

Exploratory Data Analysis On Haberman Dataset PDF

Uploaded by

3/17/2020 Exploratory Data Analysis on Haberman Dataset

Data Set Information:

1. Age of patient at time of operation (numerical)

In [1]: import pandas as pd

In [2]: # (Q) how many data-points and features?

Dataset contains 306 Data points (observations) and 4 Attributes(charecteristics)

localhost:8888/notebooks/Subahani/Study/Applied AI/Assignments/Mandatory/Exploratory Data Analysis on Haberman Dataset.ipynb 1/12

In [4]: #(Q) What are the column names in our dataset?

Out[4]: Index(['age', 'year', 'nodes', 'status'], dtype='object')

2-D Scatter Plot

localhost:8888/notebooks/Subahani/Study/Applied AI/Assignments/Mandatory/Exploratory Data Analysis on Haberman Dataset.ipynb 2/12

In [8]: # pairwise scatter plot: Pair-Plot

Histogram, PDF, CDF

localhost:8888/notebooks/Subahani/Study/Applied AI/Assignments/Mandatory/Exploratory Data Analysis on Haberman Dataset.ipynb 4/12

localhost:8888/notebooks/Subahani/Study/Applied AI/Assignments/Mandatory/Exploratory Data Analysis on Haberman Dataset.ipynb 5/12

localhost:8888/notebooks/Subahani/Study/Applied AI/Assignments/Mandatory/Exploratory Data Analysis on Haberman Dataset.ipynb 6/12

localhost:8888/notebooks/Subahani/Study/Applied AI/Assignments/Mandatory/Exploratory Data Analysis on Haberman Dataset.ipynb 7/12

localhost:8888/notebooks/Subahani/Study/Applied AI/Assignments/Mandatory/Exploratory Data Analysis on Haberman Dataset.ipynb 8/12

In [13]: #Plot CDF

[0.29333333 0.17333333 0.2 0.16444444 0.16888889]

localhost:8888/notebooks/Subahani/Study/Applied AI/Assignments/Mandatory/Exploratory Data Analysis on Haberman Dataset.ipynb 9/12

In [14]: # Plots of CDF of Year for Status (Positive/Negative)

[0.29333333 0.17333333 0.2 0.16444444 0.16888889]

localhost:8888/notebooks/Subahani/Study/Applied AI/Assignments/Mandatory/Exploratory Data Analysis on Haberman Dataset.ipynb 10/12

Box plot and Whiskers

1.Patients with less than 35 years will survive 5 year or longer.

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.