0% found this document useful (0 votes)
187 views13 pages

Haberman Datasets Analysis - Ipynb - Colaboratory

This document analyzes a breast cancer datasets called haberman using Python. It contains 305 observations with 4 features - age, operation year, number of positive axillary nodes, and survival status. The analysis includes summarizing the data, plotting scatter plots and pair plots to visualize relationships between features, generating histograms to compare distributions between survival groups, and calculating metrics like means and percentiles. Overall, it explores the datasets through various statistical and visualization techniques to gain insights into factors that affect survival.

Uploaded by

Shyamal Hazarika
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
187 views13 pages

Haberman Datasets Analysis - Ipynb - Colaboratory

This document analyzes a breast cancer datasets called haberman using Python. It contains 305 observations with 4 features - age, operation year, number of positive axillary nodes, and survival status. The analysis includes summarizing the data, plotting scatter plots and pair plots to visualize relationships between features, generating histograms to compare distributions between survival groups, and calculating metrics like means and percentiles. Overall, it explores the datasets through various statistical and visualization techniques to gain insights into factors that affect survival.

Uploaded by

Shyamal Hazarika
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

5/22/2019 Copy of haberman datasets analysis.

ipynb - Colaboratory

1 import pandas as pd
2 import seaborn as sns
3 import matplotlib.pyplot as plt
4 import numpy as np
5 from google.colab import files

1 uploaded = files.upload()

Choose Files haberman.csv


haberman.csv(n/a) - 3103 bytes, last modified: 5/16/2019 - 100% done
Saving haberman.csv to haberman (1).csv

1 haberman = pd.read_csv("haberman.csv")

1 print(haberman.shape)

(305, 4)

1. (rows, columns) shows (data-points, features)

1 print(haberman.columns)
2 haberman.columns = ["age", "operation_year", "axil_nodes", "survival status"]
3 haberman.head()

Index(['30', '64', '1', '1.1'], dtype='object')


age operation_year axil_nodes survival status

0 30 62 3 1

1 30 65 0 1

2 31 59 2 1

3 31 65 4 1

4 33 58 10 1

1. This is said to be imbalanced datasets


2. ["age", 'operation year', "axil_nodes", "survival_status"]

# This is formatted as code

"]

1 haberman.info()

https://colab.research.google.com/drive/1o1_WfES3ATAM2cXHemuFB6JVMSf76syP#scrollTo=nCsQxUiSTW7_&printMode=true 1/13
5/22/2019 Copy of haberman datasets analysis.ipynb - Colaboratory

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 305 entries, 0 to 304
Data columns (total 4 columns):
age 305 non-null int64
operation_year 305 non-null int64
axil_nodes 305 non-null int64
survival status 305 non-null int64
dtypes: int64(4)
1 haberman["survival
memory usage: 9.6status"].value_counts()
KB

1 224
2 81
Name: survival status, dtype: int64

Double-click (or enter) to edit

observation:- out of 305 observation , we found 224 people lived more than 5 years,and 81 people died
wthin 5 years.

1 haberman.describe()

age operation_year axil_nodes survival status

count 305.000000 305.000000 305.000000 305.000000

mean 52.531148 62.849180 4.036066 1.265574

std 10.744024 3.254078 7.199370 0.442364

min 30.000000 58.000000 0.000000 1.000000

25% 44.000000 60.000000 0.000000 1.000000

50% 52.000000 63.000000 1.000000 1.000000

75% 61.000000 66.000000 4.000000 2.000000

max 83.000000 69.000000 52.000000 2.000000

observation;-age(min, max)=(30, 83),median is 52 and number of positive axil_nodes is 52.and 75%


people has positive axil_nodes and 25% people has no positive axil_nodes

2 - d scatter plot

1 haberman.plot(kind='scatter', x = 'age', y = 'axil_nodes');


2 plt.show()

https://colab.research.google.com/drive/1o1_WfES3ATAM2cXHemuFB6JVMSf76syP#scrollTo=nCsQxUiSTW7_&printMode=true 2/13
5/22/2019 Copy of haberman datasets analysis.ipynb - Colaboratory

observation ; most of the people have less than 1 positive axil_nodes

1 sns.set_style("whitegrid");
2 sns.FacetGrid(haberman, hue="survival status", size = 8) \
3 .map(plt.scatter, 'age', 'axil_nodes') \
4 .add_legend()
5 plt.show()

/usr/local/lib/python3.6/dist-packages/seaborn/axisgrid.py:230: UserWarning: The `siz


warnings.warn(msg, UserWarning)

https://colab.research.google.com/drive/1o1_WfES3ATAM2cXHemuFB6JVMSf76syP#scrollTo=nCsQxUiSTW7_&printMode=true 3/13
5/22/2019 Copy of haberman datasets analysis.ipynb - Colaboratory

Double-click (or enter) to edit

observation:-here we cannot distinguished between orange and blue dots, and here most patient has 0
axil_nodes

1 plt.close();
2 sns.set_style("whitegrid")
3 sns.pairplot(haberman, hue = 'survival status', vars =("age", "operation_year", "axil_no
4 plt.show()

/usr/local/lib/python3.6/dist-packages/seaborn/axisgrid.py:2065: UserWarning: The `si


warnings.warn(msg, UserWarning)

observation:-by observing these pair-plot, wecan't distingush cause most of the point are overlaping

UNIVARAITE ANALYSIS HISTOGRAM, PDF,CDF

1 sns.FacetGrid(haberman, hue= "survival status", size= 5) \


2 .map(sns.distplot,'axil_nodes') \
3 .add_legend();
4 plt.show();

https://colab.research.google.com/drive/1o1_WfES3ATAM2cXHemuFB6JVMSf76syP#scrollTo=nCsQxUiSTW7_&printMode=true 4/13
5/22/2019 Copy of haberman datasets analysis.ipynb - Colaboratory

/usr/local/lib/python3.6/dist-packages/seaborn/axisgrid.py:230: UserWarning: The `siz


warnings.warn(msg, UserWarning)

1 sns.FacetGrid(haberman, hue="survival status", size=5) \


2 .map(sns.distplot, "age") \
3 .add_legend();
4 plt.show();

/usr/local/lib/python3.6/dist-packages/seaborn/axisgrid.py:230: UserWarning: The `siz


warnings.warn(msg, UserWarning)

1 sns.FacetGrid(haberman, hue = 'survival status', size = 7) \


2 .map(sns.distplot,"operation_year") \
3 .add_legend();
4 plt.show()

https://colab.research.google.com/drive/1o1_WfES3ATAM2cXHemuFB6JVMSf76syP#scrollTo=nCsQxUiSTW7_&printMode=true 5/13
5/22/2019 Copy of haberman datasets analysis.ipynb - Colaboratory

/usr/local/lib/python3.6/dist-packages/seaborn/axisgrid.py:230: UserWarning: The `siz


warnings.warn(msg, UserWarning)

observation :- 1. only axil_nodes is usefull to read the graph 2. ages and operation are not usefull as they
are overlap, 3. In 1965 more number of people are not survive.

1 alive = haberman.loc[haberman['survival status']==1]


2 dead = haberman.loc[haberman["survival status"]==2]

1 counts, bin_edges = np.histogram(alive['axil_nodes'],bins = 15,density= True)


2 pdf = counts/(sum(counts))
3 print(pdf)
4 print(bin_edges)
5 cdf = np.cumsum(pdf)
6 plt.plot(bin_edges[1:],pdf)
7 plt.plot(bin_edges[1:], cdf)
8 plt.legend(["pdf for people who survive more than 5 year",
9 "cdf for the people who survive more than 5 years"])
10 plt.show

https://colab.research.google.com/drive/1o1_WfES3ATAM2cXHemuFB6JVMSf76syP#scrollTo=nCsQxUiSTW7_&printMode=true 6/13
5/22/2019 Copy of haberman datasets analysis.ipynb - Colaboratory

[0.79017857 0.07142857 0.05357143 0.01785714 0.02232143 0.00892857


0.00892857 0.00892857 0.00446429 0.00892857 0. 0.
0. 0. 0.00446429]
[ 0. 3.06666667 6.13333333 9.2 12.26666667 15.33333333
18.4 21.46666667 24.53333333 27.6 30.66666667 33.73333333
36.8 39.86666667 42.93333333 46. ]
<function matplotlib.pyplot.show>

1 counts, bin_edges = np.histogram(dead['axil_nodes'],bins = 15,density= True)


2 pdf = counts/(sum(counts))
3 print(pdf)
4 print(bin_edges)
5 cdf = np.cumsum(pdf)
6 plt.plot(bin_edges[1:],pdf)
7 plt.plot(bin_edges[1:], cdf)
8 plt.legend(["pdf for people who dead within 5 year",
9 "cdf for the people who dead within 5 years"])
10 plt.show

[0.48148148 0.12345679 0.11111111 0.09876543 0.04938272 0.03703704


0.07407407 0. 0. 0. 0.01234568 0.
0. 0. 0.01234568]
[ 0. 3.46666667 6.93333333 10.4 13.86666667 17.33333333
20.8 24.26666667 27.73333333 31.2 34.66666667 38.13333333
41.6 45.06666667 48.53333333 52. ]
<function matplotlib.pyplot.show>

MEAN , MEDIAN,PERCENTILE

1 print("mean:")
2 print(np.mean(alive["age"]))
3 print(np.mean(dead["age"]))

https://colab.research.google.com/drive/1o1_WfES3ATAM2cXHemuFB6JVMSf76syP#scrollTo=nCsQxUiSTW7_&printMode=true 7/13
5/22/2019 Copy of haberman datasets analysis.ipynb - Colaboratory

mean:
52.11607142857143
53.67901234567901

1 print(np.mean(alive["operation_year"]))
2 print(np.mean(dead["operation_year"]))

62.857142857142854
62.82716049382716

1 print(np.mean(alive["axil_nodes"]))
2 print(np.mean(dead["axil_nodes"]))

2.799107142857143
7.45679012345679

1 print('std')
2 print(np.std(alive["age"]))
3 print(np.std(dead["age"]))

std
10.913004640364269
10.10418219303131

1 print(np.std(alive['operation_year']))
2 print(np.std(dead["operation_year"]))

3.2220145175061514
3.3214236255207883

1 print(np.std(alive["axil_nodes"]))
2 print(np.std(dead["axil_nodes"]))

5.869092706952767
9.128776076761632

1 print("median")
2 print(np.median(alive['age']))
3 print(np.median(dead["age"]))

median
52.0
53.0

1 print(np.median(alive['operation_year']))
2 print(np.median(dead["operation_year"]))

63.0
63.0

1 print(np.median(alive['axil_nodes']))
https://colab.research.google.com/drive/1o1_WfES3ATAM2cXHemuFB6JVMSf76syP#scrollTo=nCsQxUiSTW7_&printMode=true 8/13
5/22/2019 Copy of haberman datasets analysis.ipynb - Colaboratory
2 print(np.median(dead['axil_nodes']))

0.0
4.0

1 print('quantiles')

quantiles

1 print(np.percentile(alive["age"],np.arange(0,100,25)))
2 print(np.percentile(dead["age"],np.arange(0,100,25)))
3 print(np.percentile(alive["operation_year"],np.arange(0,100,25)))
4 print(np.percentile(dead["operation_year"],np.arange(0,100,25)))
5 print(np.percentile(alive["axil_nodes"],np.arange(0,100,25)))
6 print(np.percentile(dead["axil_nodes"],np.arange(0,100,25)))

[30. 43. 52. 60.]


[34. 46. 53. 61.]
[58. 60. 63. 66.]
[58. 59. 63. 65.]
[0. 0. 0. 3.]
[ 0. 1. 4. 11.]

1 print("90th percentile")

90th percentile

1 print(np.percentile(alive["age"], 90))
2 print(np.percentile(alive["operation_year"], 90))
3 print(np.percentile(alive["axil_nodes"], 90))
4
5 print(np.percentile(dead["age"], 90))
6 print(np.percentile(dead["operation_year"], 90))
7 print(np.percentile(dead["axil_nodes"], 90))

67.0
67.0
8.0
67.0
67.0
20.0

1 from statsmodels import robust


2 print("Median absolute deviation ")
3 print(robust.mad(alive["age"]))
4 print(robust.mad(alive["operation_year"]))
5 print(robust.mad(alive["axil_nodes"]))
6 print(robust.mad(dead["age"]))
7 print(robust.mad(dead["operation_year"]))
8 print(robust.mad(dead["axil_nodes"]))
9

https://colab.research.google.com/drive/1o1_WfES3ATAM2cXHemuFB6JVMSf76syP#scrollTo=nCsQxUiSTW7_&printMode=true 9/13
5/22/2019 Copy of haberman datasets analysis.ipynb - Colaboratory

Median absolute deviation


13.343419966550417
4.447806655516806
0.0
11.860817748044816
4.447806655516806
BOX PLOT AND WHISKERS
5.930408874022408

1 sns.boxplot(x='survival status',y='age', data=haberman)

<matplotlib.axes._subplots.AxesSubplot at 0x7fb8a5ed8b38>

1 print("\n.......year......")
2 sns.boxplot(x='survival status',y='operation_year',data=haberman)

.......year......
<matplotlib.axes._subplots.AxesSubplot at 0x7fb8a5eb8e80>

1 print("\n......axil_nodes.....")
2 sns.boxplot(x="survival status",y="axil_nodes",data=haberman)

https://colab.research.google.com/drive/1o1_WfES3ATAM2cXHemuFB6JVMSf76syP#scrollTo=nCsQxUiSTW7_&printMode=true 10/13
5/22/2019 Copy of haberman datasets analysis.ipynb - Colaboratory

......axil_nodes.....
<matplotlib.axes._subplots.AxesSubplot at 0x7fb8a5e1d358>

OBSERVATION:- 1. 75 PERCENTILE DEAD PATIENT FROM CANCER IS BETWEEN THHE YEAR 64 TO 66


AND 25 PERCENTILE DEAD PEOPLE FROM THE YEAR 58 TO 63 2 75% People in the year 1965 survived

VOLIN PLOT

1 sns.violinplot(x='survival status',y='age',data=haberman,size= 8)
2
3 plt.show()

1 sns.violinplot(x="survival status", y='operation_year',data=haberman, size = 8)


2 plt.show()

https://colab.research.google.com/drive/1o1_WfES3ATAM2cXHemuFB6JVMSf76syP#scrollTo=nCsQxUiSTW7_&printMode=true 11/13
5/22/2019 Copy of haberman datasets analysis.ipynb - Colaboratory

1 sns.violinplot(x='survival status',y='axil_nodes',data= haberman,size= 8)


2 plt.show()

1 sns.jointplot(x='age',y='operation_year',data=dead,kind='kde')
2 plt.show()

https://colab.research.google.com/drive/1o1_WfES3ATAM2cXHemuFB6JVMSf76syP#scrollTo=nCsQxUiSTW7_&printMode=true 12/13
5/22/2019 Copy of haberman datasets analysis.ipynb - Colaboratory

1 sns.jointplot(x='axil_nodes',y='operation_year',data=alive,kind='kde')
2 plt.show()

https://colab.research.google.com/drive/1o1_WfES3ATAM2cXHemuFB6JVMSf76syP#scrollTo=nCsQxUiSTW7_&printMode=true 13/13

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy