Breast Cancer Classification
Breast Cancer Classification
ENDTERM REPORT
Submitted by
School of Engineering
INDEX
1) Introduction
2) Literature review
3) Dataset
4) Data Visualisation
● Naive-Bayes (NB)
12) References
INTRODUCTION
Many women are diagnosed with Breast cancer, second to lung cancer, Breast Cancer is the
second popular cause of death in both developed and undeveloped worlds. Every year, one
million women are newly diagnosed with breast cancer, according to the report of the world
health organization half of them would die, because it’s usually late when doctors detect the
cancer. Breast Cancer is caused by a mutation in a single cell, which can be shut down by the
system or causes a reckless cell division. Breast Cancer is characterized by the mutation of
genes, constant pain, changes in the size, color(redness), skin texture of breasts.
Classification of breast cancer leads pathologists to find a systematic and objective
prognostic, generally the most frequent classification is binary (benign cancer/malignant
cancer).
The early diagnosis of Breast cancer can improve the prognosis and chance of survival
significantly, as it can promote timely clinical treatment to patients. Further accurate
classification of benign tumors can prevent patients undergoing unnecessary treatments.
Thus, the correct diagnosis of Breast cancer and classification of patients into malignant or
benign groups is the subject of much research. Because of its unique advantages in critical
features detection from complex breast cancer datasets, Machine Learning (ML) techniques
are being broadly used in the breast cancer classification problem. They provide high
classification accuracy and effective diagnostic capabilities.
The relation between Breast cancer and Machine learning is not recent, it has been used for
decades to classify tumors and other malignancies, predict sequences of genes responsible for
cancer and determine the prognostic. The classification’s aim is to put each observation in a
category that it belongs to.
LITERATURE REVIEW:
A lot of studies have been done in the field of Breast cancer classification, some of them used
mammography images and some breast cancers are classified with other techniques such as
Softmax Discriminant Classifier (SDC), Linear Discriminant Analysis (LDA), and Fuzzy C
Means Clustering. The k-nearest neighbors algorithm is one of the most used algorithms in
machine learning. In cancer classification, KNN can be used to measure the performance of
false positive rates . Naive Bayesian classifiers are generally used to predict biological,
chemical and physiological properties. In cancer classification, NBC are sometimes
combined to other classifiers such as decision tree to determine prognostics or classification
models. Different classification techniques were developed for breast cancer diagnosis, the
accuracy of many of them was evaluated using the dataset taken from Wisconsin breast
cancer database. For example, in the optimized learning vector method’s performance was
96.7%, big LVQ method reached, SVM for cancer diagnosis’s accuracy is 97.13% is the
highest one in the literature.
DATASET:
Features are computed from a digitized image of a fine needle aspirate (FNA) of a breast
mass. They describe characteristics of the cell nuclei present in the image.
Classification type: Binary
Class distribution: 357 benign, 212 malignant
Attribute Information:
1) ID number
2) Diagnosis (M = malignant, B = benign)
B- Benign
M- Malignant
DATA VISUALIZATION:
Now from these histograms we see that features like- mean fractal dimension has very little
role to play in separating malignant from benign, but worst concave points or worst perimeter
are useful features that can give us strong hints about the classes of cancer data-set. So if your
data has only one feature e.g. worst perimeter, it can be good enough to separate malignant
from benign cases.
CORRELATION MATRIX:
The Pandas corr() function is used to find the pairwise correlation of all columns in the breast
cancer dataframe. Correlation is used when referencing the strength of a relationship between
two variables have a high/strong correlation means.
a visualized matrix. The seaborn Python package allows the creation of heatmaps which can
Standardize features by removing the mean (i.e. making it to 0) and scaling to unit variance.
z = (x - u) / s
Where u is the mean of the training samples or zero if with_mean=False and ‘s’ is the
standard deviation of the training samples or one if with_std=False
Principal component analysis (PCA) is a technique for reducing the dimensionality of such
datasets, increasing interpretability but at the same time minimizing information loss. It does
so by creating new uncorrelated variables that successively maximize variance.
By analysing the correlation matrix and since the dimensionality of our dataset is huge, we
are using PCA to reduce dimensions of our data.
PCA Scatterplot with two components:
- Benign
- Malignant
K- NEAREST NEIGHBORS CLASSIFIER:
quantitative or qualitative and are used to place a particular data set in a particular category or
classification. The way that this algorithm works is through demarcation lines and decisions
about boundaries. In this algorithm, K is the data point that the operator is trying to figure out
more information about. The operator often wants to figure out what categories K fits in.
In order to do this, the algorithm draws a perimeter around K and studies the other data points
within that perimeter. The data points within a determined perimeter help push the artificial
would lead to potentially different results for this algorithm. K-nearest neighbors are helpful
for guiding machine learning and determining relationships while only knowing a limited
between the testing and training data. We are able to set the nearest neighbor to 3 and validate
k-NN score.
NAÏVE BAYES CLASSIFIER:
Naive Bayes model is easy to build and particularly useful for very large data sets. Along
with simplicity, Naive Bayes is known to outperform even highly sophisticated classification
methods.
Bayes theorem provides a way of calculating posterior probability P(c|x) from P(c), P(x) and
P(x|c). Look at the equation below:
Above,
● P(c|x) is the posterior probability of class (c, target) given predictor (x, attributes).
● P(c) is the prior probability of class.
● P(x|c) is the likelihood which is the probability of a predictor given class.
● P(x) is the prior probability of the predictor.
There are three types of Naive Bayes model under the scikit-learn library:
Since our problem is classification problem we have used Gaussian type of Naïve Bayes
Classifier.
CONFUSION MATRIX:
In the field of machine learning and specifically the problem of, a statistical classification,
also known as an error matrix, is a specific table layout that allows visualization of the
performance of an algorithm, typically a supervised learning one (in unsupervised learning it
is usually called a matching matrix. Each row of the matrix represents the instances in a
predicted class while each column represents the instances in an actual class (or vice versa).
The name stems from the fact that it makes it easy to see if the system is confusing two
classes (i.e. commonly mislabelling one as another).
It is a special kind of contingency table, with two dimensions ("actual" and "predicted"), and
identical sets of "classes" in both dimensions (each combination of dimension and class is a
variable in the contingency table).
ACCURACY SCORE:
It is calculated from the confusion matrix which has the values of how true positives(TP)
,true negatives(TN),false positives(FP) and false negatives(FN).the formula for the accuracy
score is
ROC CURVES:
A receiver operating characteristic curve, or ROC curve, is a graphical plot that illustrates the
diagnostic ability of a binary classifier system as its discrimination threshold is varied.
It gives us the trade-off between the True Positive Rate (TPR) and the False Positive Rate
(FPR) at different classification thresholds.
The ROC curve is plotted with TPR against the FPR where TPR is on y-axis and FPR is on
the x-axis.
[1] M. Amrane, S. Oukid, I. Gagaoua and T. Ensarİ, "Breast cancer classification using
machine learning," 2018 Electric Electronics, Computer Science, Biomedical Engineerings'
Meeting (EBBT), Istanbul, 2018, pp. 1-4.
[2] S.K. Prabhakar, H. Rajaguru, "Performance Analysis of Breast Cancer Classification with
Softmax Discriminant Classifier and Linear Discriminant Analysis", In: Maglaveras N.,
Chouvarda I., de Carvalho P. (eds) Precision Medicine Powered by pHealth and Connected
Health. IFMBE Proceedings, vol 66. Springer, Singapore, 2018.
[4]https://towardsdatascience.com/building-a-simple-machine-learning-model-on-breast-
cancer-data-eca4b3b99fa3