0% found this document useful (0 votes)
103 views18 pages

Breast Cancer Classification

This document provides a summary of a project to classify breast cancer tumors using machine learning techniques. It includes an introduction to breast cancer and machine learning, a literature review of previous classification studies, a description of the Wisconsin Breast Cancer Dataset used, data visualization, preprocessing steps like standardization and principal component analysis, and evaluations of k-Nearest Neighbors and Naive Bayes classification models. The models are compared using accuracy scores and confusion matrices to determine the most effective approach for this breast cancer classification task.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
103 views18 pages

Breast Cancer Classification

This document provides a summary of a project to classify breast cancer tumors using machine learning techniques. It includes an introduction to breast cancer and machine learning, a literature review of previous classification studies, a description of the Wisconsin Breast Cancer Dataset used, data visualization, preprocessing steps like standardization and principal component analysis, and evaluations of k-Nearest Neighbors and Naive Bayes classification models. The models are compared using accuracy scores and confusion matrices to determine the most effective approach for this breast cancer classification task.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 18

BREAST CANCER CLASSIFICATION

Project report submitted for EED363 Applied Machine Learning

ENDTERM REPORT

Submitted by

1. Satwik Boojala ( 1610110340 )

2. Kaliki Sai Preetham ( 1610110452 )

Department of Electrical Engineering

School of Engineering

INDEX
1) Introduction

2) Literature review

3) Dataset

4) Data Visualisation

5) Creating Training and Test data

6) Correlation matrix of the Data

7) Standardising the Data

8) Principal component analysis

9) Training and Evaluating different Classification Models

● K- Nearest Neighbors (KNN)

● Naive-Bayes (NB)

10) Confusion Matrix and Accuracy scores

11) ROC curves

12) References

INTRODUCTION
Many women are diagnosed with Breast cancer, second to lung cancer, Breast Cancer is the
second popular cause of death in both developed and undeveloped worlds. Every year, one
million women are newly diagnosed with breast cancer, according to the report of the world
health organization half of them would die, because it’s usually late when doctors detect the
cancer. Breast Cancer is caused by a mutation in a single cell, which can be shut down by the
system or causes a reckless cell division. Breast Cancer is characterized by the mutation of
genes, constant pain, changes in the size, color(redness), skin texture of breasts.
Classification of breast cancer leads pathologists to find a systematic and objective
prognostic, generally the most frequent classification is binary (benign cancer/malignant
cancer).

The early diagnosis of Breast cancer can improve the prognosis and chance of survival
significantly, as it can promote timely clinical treatment to patients. Further accurate
classification of benign tumors can prevent patients undergoing unnecessary treatments.
Thus, the correct diagnosis of Breast cancer and classification of patients into malignant or
benign groups is the subject of much research. Because of its unique advantages in critical
features detection from complex breast cancer datasets, Machine Learning (ML) techniques
are being broadly used in the breast cancer classification problem. They provide high
classification accuracy and effective diagnostic capabilities.

The relation between Breast cancer and Machine learning is not recent, it has been used for
decades to classify tumors and other malignancies, predict sequences of genes responsible for
cancer and determine the prognostic. The classification’s aim is to put each observation in a
category that it belongs to.

LITERATURE REVIEW:

A lot of studies have been done in the field of Breast cancer classification, some of them used
mammography images and some breast cancers are classified with other techniques such as
Softmax Discriminant Classifier (SDC), Linear Discriminant Analysis (LDA), and Fuzzy C
Means Clustering. The k-nearest neighbors algorithm is one of the most used algorithms in
machine learning. In cancer classification, KNN can be used to measure the performance of
false positive rates . Naive Bayesian classifiers are generally used to predict biological,
chemical and physiological properties. In cancer classification, NBC are sometimes
combined to other classifiers such as decision tree to determine prognostics or classification
models. Different classification techniques were developed for breast cancer diagnosis, the
accuracy of many of them was evaluated using the dataset taken from Wisconsin breast
cancer database. For example, in the optimized learning vector method’s performance was
96.7%, big LVQ method reached, SVM for cancer diagnosis’s accuracy is 97.13% is the
highest one in the literature.
DATASET:

Breast cancer Wisconsin Dataset-  


https://www.kaggle.com/uciml/breast-cancer-wisconsin-data

Features are computed from a digitized image of a fine needle aspirate (FNA) of a breast
mass. They describe characteristics of the cell nuclei present in the image.
Classification type: Binary
Class distribution: 357 benign, 212 malignant

Attribute Information:
1) ID number
2) Diagnosis (M = malignant, B = benign)

Ten real-valued features are computed for each cell nucleus:


a) Radius (mean of distances from center to points on the perimeter)
b) Texture (standard deviation of gray-scale values)
c) Perimeter
d) Area
e) Smoothness (local variation in radius lengths)
f) Compactness (perimeter^2 / area - 1.0)
g) Concavity (severity of concave portions of the contour)
h) Concave points (number of concave portions of the contour)
i) Symmetry
j) Fractal dimension ("coastline approximation" - 1)
The mean, standard error and "worst" or largest (mean of the three
largest values) of these features were computed for each image,
resulting in 30 features. For instance, field 3 is Mean Radius, field
13 is Radius SE, field 23 is Worst Radius.
All feature values are recoded with four significant digits.
Missing attribute values: none

Class distribution: 357 benign, 212 malignant

B- Benign
M- Malignant
DATA VISUALIZATION:

Histograms of all the feature vectors

Now from these histograms we see that features like- mean fractal dimension has very little
role to play in separating malignant from benign, but worst concave points or worst perimeter
are useful features that can give us strong hints about the classes of cancer data-set. So if your
data has only one feature e.g. worst perimeter, it can be good enough to separate malignant
from benign cases.

CREATE TRAINING AND TESTING DATA:


Scikit-Learn provides a few functions to split datasets into multiple subsets in various
ways. The simplest function is train_test_split, which does pretty much the same thing
as the function split_train_test defined earlier, with a couple of additional features. First
there is a random_state parameter that allows you to set the random generator seed as
explained previously, and second you can pass it multiple datasets with an identical
number of rows, and it will split them on the same indices (this is very useful, for
example, if you have a separate DataFrame for labels):
80% of the samples for testing
20% of the samples for training

CORRELATION MATRIX:

The Pandas corr() function is used to find the pairwise correlation of all columns in the breast
cancer dataframe. Correlation is used when referencing the strength of a relationship between
two variables have a high/strong correlation means.

Correlation matrix size- 30*30


A heatmap is a two-dimensional graphical representation of data values that are contained in

a visualized matrix. The seaborn Python package allows the creation of heatmaps which can

be tweaked using matplotlib tools.


STANDARDISING DATA:

Standardize features by removing the mean (i.e. making it to 0) and scaling to unit variance.

The standard score of a sample x is calculated as:

z = (x - u) / s

Where u is the mean of the training samples or zero if with_mean=False and ‘s’ is the
standard deviation of the training samples or one if with_std=False

Standardization of a dataset is a common requirement for many machine learning estimators:


they might behave badly if the individual features do not more or less look like standard
normally distributed data (e.g. Gaussian with 0 mean and unit variance).

PRINCIPAL COMPONENT ANALYSIS:

Principal component analysis (PCA) is a technique for reducing the dimensionality of such
datasets, increasing interpretability but at the same time minimizing information loss. It does
so by creating new uncorrelated variables that successively maximize variance.
By analysing the correlation matrix and since the dimensionality of our dataset is huge, we
are using PCA to reduce dimensions of our data.
PCA Scatterplot with two components:

- Benign
- Malignant
K- NEAREST NEIGHBORS CLASSIFIER:

k-Nearest Neighbors is an example of a classification algorithm. These algorithms are either

quantitative or qualitative and are used to place a particular data set in a particular category or

classification. The way that this algorithm works is through demarcation lines and decisions

about boundaries. In this algorithm, K is the data point that the operator is trying to figure out

more information about. The operator often wants to figure out what categories K fits in.

In order to do this, the algorithm draws a perimeter around K and studies the other data points

within that perimeter. The data points within a determined perimeter help push the artificial

intelligence machine to give K classification. Different neighbors in a different perimeter

would lead to potentially different results for this algorithm. K-nearest neighbors are helpful

for guiding machine learning and determining relationships while only knowing a limited

amount of data about the situation.

Finding the optimum k value:


From the above plot we can view that the nearest neighbor 3 has the highest accuracy rating

between the testing and training data. We are able to set the nearest neighbor to 3 and validate

k-NN score.
NAÏVE BAYES CLASSIFIER:

It is a classification technique based on Bayes’ Theorem with an assumption of


independence among predictors. In simple terms, a Naive Bayes classifier assumes that the
presence of a particular feature in a class is unrelated to the presence of any other feature.

Naive Bayes model is easy to build and particularly useful for very large data sets. Along
with simplicity, Naive Bayes is known to outperform even highly sophisticated classification
methods.

Bayes theorem provides a way of calculating posterior probability P(c|x) from P(c), P(x) and
P(x|c). Look at the equation below:

Above,

● P(c|x) is the posterior probability of class (c, target) given predictor (x, attributes).
● P(c) is the prior probability of class.
● P(x|c) is the likelihood which is the probability of a predictor given class.
● P(x) is the prior probability of the predictor.

There are three types of Naive Bayes model under the scikit-learn library:

● Gaussian: It is used in classification and it assumes that features follow a normal


distribution.
● Multinomial: It is used for discrete counts.
● Bernoulli: The binomial model is useful if your feature vectors are binary (i.e. zeros
and ones).

Since our problem is classification problem we have used Gaussian type of Naïve Bayes
Classifier.
CONFUSION MATRIX:

In the field of machine learning and specifically the problem of, a statistical classification,
also known as an error matrix, is a specific table layout that allows visualization of the
performance of an algorithm, typically a supervised learning one (in unsupervised learning it
is usually called a matching matrix. Each row of the matrix represents the instances in a
predicted class while each column represents the instances in an actual class (or vice versa).
The name stems from the fact that it makes it easy to see if the system is confusing two
classes (i.e. commonly mislabelling one as another).
It is a special kind of contingency table, with two dimensions ("actual" and "predicted"), and
identical sets of "classes" in both dimensions (each combination of dimension and class is a
variable in the contingency table).

True positive (TP) - No of positives correctly predicted

True negative (TN) - No of negatives correctly predicted

False positive (FP) - No of negatives predicted as negatives

False negative (FN) - No of positives predicted as negatives

ACCURACY SCORE:
It is calculated from the confusion matrix which has the values of how true positives(TP)
,true negatives(TN),false positives(FP) and false negatives(FN).the formula for the accuracy
score is
ROC CURVES:

A receiver operating characteristic curve, or ROC curve, is a graphical plot that illustrates the
diagnostic ability of a binary classifier system as its discrimination threshold is varied.
It gives us the trade-off between the True Positive Rate (TPR) and the False Positive Rate
(FPR) at different classification thresholds.

The ROC curve is plotted with TPR against the FPR where TPR is on y-axis and FPR is on
the x-axis.

Roc accuracy scores for both knn and naive bayes


REFERENCES:

[1] M. Amrane, S. Oukid, I. Gagaoua and T. Ensarİ, "Breast cancer classification using
machine learning," 2018 Electric Electronics, Computer Science, Biomedical Engineerings'
Meeting (EBBT), Istanbul, 2018, pp. 1-4.

[2] S.K. Prabhakar, H. Rajaguru, "Performance Analysis of Breast Cancer Classification with
Softmax Discriminant Classifier and Linear Discriminant Analysis", In: Maglaveras N.,
Chouvarda I., de Carvalho P. (eds) Precision Medicine Powered by pHealth and Connected
Health. IFMBE Proceedings, vol 66. Springer, Singapore, 2018.

[3] P.Bhuvaneswaria, B. Therese, "Detection of Cancer in Lung with K-NN Classification


Using Genetic Algorithm", Procedia Materials Science, Vol. 10, pp. 433-440, 2015.

[4]https://towardsdatascience.com/building-a-simple-machine-learning-model-on-breast-
cancer-data-eca4b3b99fa3

Also referred many websites which helped us in implementation.

1. Hands on Machine Learning with Scikit and Tensorflow(Book)


2. Kaggle
3. Geeks for Geeks
4. Stack Overflow
5. Towards Data Science
6. Medium.com
7. statisticsbyjim.com
8. Analyticstraining.com
9. Levelup.gitconnected.com
10. Github

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy