Project Report
Project Report
1
decision tree which are among the most influential data mining algorithms.
It can be medically detected early during a screening examination through
mammography or by portable cancer diagnostic tool. Cancerous breast
tissues change with the progression ofthe disease, which can be directly
linked to cancer staging. The stage of breast cancer (I–IV) describes how far
a patient’s cancer has proliferated. Statistical indicators such as tumour size,
lymph node metastasis, and distant metastasis and so on are used to
determine stages. To prevent cancer from spreading, patients have to
undergo breast cancer surgery, chemotherapy, radiotherapy and endocrine.
The goal of the research is to identify and classify Malignant and Benign
patients and intending how to parametrize our classification techniques
hence to achieve high accuracy. We are looking into many datasets and how
further Machine Learning algorithms can be used to characterize Breast
Cancer. We want to reduce the error rates with maximum accuracy. 10-fold
cross validation test which is a Machine Learning Technique is used in
JUPYTER to evaluate the data and analyse data in terms of effectiveness
and efficiency.
2
1.1 MOTIVATION
Breast Cancer is the most affected disease present in women
worldwide. 246,660 of women's new cases of invasive breast cancer are
expected to be diagnosed in the U.S during 2016 and 40,450 of women’s
death is estimated. The development in Breast Cancer and its prediction
fascinated. The UCI Wisconsin Machine Learning Repository Breast
Cancer Dataset attracted as large patients with multivariate attributes were
taken as sample set.
3
of the disease: survival time, life expectancy, progression, drug sensitivity, etc.
The survivability rate and the cancer relapse are dependent very much on the
medical treatment and the quality of the diagnosis. As we know that data pre-
processing is a data mining technique that used for filter data in a usable format.
Because the real- world dataset almost available in different format. It is not
available as per our requirement so it must be filtered in understandable format.
Data pre-processing is a proven method of resolving such issues. Data pre-
processing convert the dataset into usable format for pre-processing we have
used standardization method.
The following is the summary of the existing works on the given domain:
Lung cancer
Lung cancer U-Net ,
detection and
detection
classification TCIA dataset and Random
3 2021 and SVM 99%
using machine LIDC Forest
classifier for
learning
classfication Convolutio
algorithm
nal
Network
4
A Novel
approach to Analysis the
perform breast
2018 -Decision tree,
analysis and cancer data
4 Random forest, R
prediction on and do for
SVMs
breast cancer efficiency
dataset using prediction
R
(Research Gate)
A deep A method of
learning multi-level
model based feature Inception -V3 99.34%(inc
5th Mar, 2020
on extraction and DensNet eption -V3)
Brain tumor
5 concatenation and Python
dataset
approach for concatenati
the diagnosis on for early
of brain (IEEE Access) diagnosis of 201 99.51%
tumor brain tumor (DensNet2
02)
Combinatio
MIAS
Automated n of various
(mammographic
breast mass 5th Apr, 2021 techniques 97.50%
image analysis
classification to classify
society) and Random
7 system using the breast Python
DDSM(digital forest and DL
deep learning mass in to
database for
in digital benign
screening
mammogram malignant
(IEEE Explore) mammography (MIAS)
and normal
96%
(DDSM)
5
Deep learning
to improve Improve
breast cancer cancer DL using the
CBIS-DDSM
8 detection on th
29 Aug, 2019 detection python CNN 97%
screening with deep
mammograph learning
y
COVID-19
detection 14th July, COVID-19 x-ray VGG.19
through
10 transfer COVID-19 98.40%
Detection
learning using
2020 through and And
multimodal
transfer
imaging data
(IEEE Access) learning CT scan DenseNet
Feature
selection from
colon cancer
dataset for Cancer ANN and
Colon cancer
cancer
11 2018 dataset and MATLAB 98.40%
classification
SVMs
using Artificial
Neural
Network(ANN
) Classificatio
SVMs
n using ANN
An
Automated
detection of
breast cancer CAD
diagnosis and (computer
12 prognosis 14th Apr, 2022 DDSM/323 Benign vs aided ANN, 98.83%
based on diagnostic
Machine )
learning using
ensemble of
classifier
6
(IEEE Explore) maligonant SVM and
KNN
A sustainable
IoTH based
computationa
Over come Greedy Best
lly intelligent 9th June, 2021 Lung cancer 98.80%
the rise of First Search
13 healthcare Python
lung cancer
monitoring
diseases
system for
lung cancer
risk detection
(Random
(Elsevier) dataset (GBFS)
Forest)
Diagnosing
Deep-chest , measure
multiclassficat diseases Deep
ion deep 2022 COVID-CT VGG19+(CNN) 98.05%
deep learning,
learning learning
model for model
14 diagnosing Chest X-ray
COVID-19 (Elsevier) Dataset and CT (AI) ResNet
pneumonia images
and lung
(computed
cancer chest 152V2
tomography)
diseases
95.31%
On the
Automatic 30th May, 2022 HAM1000 Raw deep MATLAB
detection and transfer
Deep transfer
classification learning in
15 learning of a 82.90%
of skin cancer classifying
CNN
using deep (Sensors) 10015 images of R2021a
transfer skin lesion
learning Dermoscopic
images
Build
models for
detecting
Hospital based
Prediction and 79.8%(De
dataset n=8066 Decision tree
factors for 19(1),1- visualising cision
16 with diagnosis Python and Random
17,2019 significant
forest
prognostic
indicators
of survival
rate
Tree),82.7
Survival of Information
%
7
breast cancer Between 1993 (Random
patient and 2016 forest)
Features
concentrati
Deep on using
learning pre-trained
model on Brain dataset model as 99.34%(In
concentratio comprised of compared Inception-v3 ception-
n approach 5 March 2020 3064 T1- to the and v3) and
17 Python
for the IEEE Access weighted current 99.51%
diagnosis of contrast image research
brain tumour of 233 method for
brain
tumour
classificatio
(IEEE n (DensNet
DensNet201
ACCESS) 201)
Deep
learning
3533 skin method 83.2%(CN
Detection of
lesions(benign, CNN was N),83.7%
skin cancer
used to CNN, (Resnet50
based on skin 30 May 2022
19 detect python Resnet50, ),
lesion images Sensors
malignant Inception V3 85.8%(Inc
using deep
and benign eption
learning
Malignant and using V3)
melanocytic ISIC2018
tumour) dataset
8
Algorithms
were
evaluated
Comparison
in terms of
of nomogram
ROC curve
with machine
and
learning 88.7%(De
accuracy
techniques Decision tree cision
31 May 2013 7596 tongue value and
20 for prediction springer link
python and tree) and
cancer patients the result
of overall nomogram 60.4%(no
was
survival in mogram)
compared
patients with
with
tongue
nomogram
cancer
to predict
survival of
patients
Proposing a
suitable
method
WBC that can
Analysis of dataset(699 manage 98.2%(J48
breast cancer instances and the ),99.56%
detection 11 attributes) imbalanced
27 April 2020 WEKA
21 using and Breast dataset and J48,NB,SMO
IEEE Explore 3.8.3
different cancer dataset the missing
machine (286 instances values to
learning and 10 enhance
attributes) the
classifier’s
performanc
e (SMO)
and
99.24%
(NB)
Reduce the
variability
Lung cancer
in assessing
prediction
and
using
reporting
machine 2021 3593 CAD
22 the lung SVM 98.56%
learning and IEEE Access LUNGRADS software
cancer risk
advanced
between
imaging
interpretin
techniques
g
physicians
9
Used to
Breast cancer
predict
prognosis
outcomes
using a N=318 (training Kernel-based
23 2019 in Python 96.30%
machine set) learning
individual
learning
cancer
approach
patients
Breast cancer
5 year
prediction 2020
Electronic survivabilit Logistic
24 using www.researchg WEKA 92.30%
ate.net health record y regression
Machine
prediction
learning
Patient
Breast cancer features
96.85%
prediction sorted out
2020
using Wisconsin from data KNN and
25 www.researchg WEKA
machine ate.net
breast cancer materials SVM
learning are (KNN)
approach statistically and
tested 96.85%(S
VM)
Breast cancer Decision
Naïve Bayes
prediction UC Irvine tree is the
2020 j48 decision
using machine best
26 www.researchg WEKA tree and 96.50%
machine ate.net learning predictor
bagging
learning repository on holdout
algorithm
approach sample
Requires
Breast cancer less input
prediction parameter
2020
using ,performin
27 www.researchg Cancer society WEKA ADABOOST 97.50%
machine ate.net
g well in
learning the low
approach noise
dataset
Breast cancer
prediction
2020 Getting Logistic and
using
28 www.researchg Cancer society higher MATLAB Neural 96.30%
machine ate.net accuracy network
learning
approach
Reduce the
variability
Breast cancer
in assessing
prediction Logistic
2020 and
using CAD regression
29 www.researchg BCI Dataset reporting 94.20%
machine ate.net
system and back
the lung
learning propagation
cancer risk
approach
between
interpretin
10
g
physicians
Build
models for
Breast cancer detecting 93.29%(C
C4,5 Bagging
prediction and 4,5
2020 Gene expression and
using visualising Bagging),
30 www.researchg dataset WEKA ADABOOST
machine significant 92.62%
ate.net collection Decision
learning prognostic
trees
approach indicators
of survival
rate
(ADABOO
ST)
11
2. PROPOSED METHODOLOGY
DATA DATA
PREPROCESSIN PREPARATIO
FEATURE FEATURE
PROJECTION SELECTION
FEATURE
SCALING
PREDICTION
MODEL SELECTION
Fig. (1) Phases of Machine Learning consists of seven phases, the phases are elaborated as
given below:
-
Phase 1 - Pre-Processing Data
The first phase we do is to collect the data that we are interested in collecting
for pre-processing and to apply classification and Regression methods. Data
pre-processing is a data mining technique that involves transforming raw data
into an understandable format. Real world data is often incomplete,
inconsistent, and lacking certain to contain many errors. Data pre-processing
is a proven method of resolving such issues. Data pre-processing prepares raw
data for further processing. For pre-processing we have used standardization
12
method to pre-process the UCI dataset. This step is very important because the
quality and quantity of data that you gather will directly determine how good
your predictive model can be. In this case we collect the Breast Cancer
samples which are Benign and Malignant. This will be our training data.
Data Preparation, where we load our data into a suitable place and prepare it
for use in our machine learning training. We’ll first put all our data together,
and then randomize the ordering.
Data File and Feature Selection Breast Cancer Wisconsin (Diagnostic):- Data
Set from Kaggle repository and out of 31 parameters we have selected about
8-9 parameters. Our target parameter is breast cancer diagnosis
– malignant or benign. We have used Wrapper Method for Feature Selection.
The important features found by the study are: Concave points worst, Area
worst, Area se, Texture worst, Texture mean, Smoothness worst, Smoothness
mean, Radius mean, Symmetry mean.
Attribute Information:
13
dimensional space (with few attributes). Both linear and nonlinear reduction
techniques can be used in accordance with the type of relationships among the
features in the dataset.
Most of the times, your dataset will contain features highly varying in
magnitudes, units and range. But since, most of the machine learning
algorithms use Euclidian distance between two data points in their
computations. We need to bring all features to the same level of magnitudes.
This can be achieved by scaling.
Supervised learning is the method in which the machine is trained on the data
which the input and output are well labelled. The model can learn on the
training data and can process the future data to predict outcome. They are
grouped to Regression and Classification techniques. A regression problem is
when the result is a real or continuous value, such as “salary” or “weight”. A
classification problem is when the result is a category like filtering emails
spam” or “not spam”. Unsupervised Learning: Unsupervised learning is
giving away information to the machine that is neither classified nor labelled
and allowing the algorithm to analyse the given information without providing
any directions. In unsupervised learning algorithm the machine is trained from
the data which is not labelled or classified making the algorithm to work
without proper instructions. In our dataset we have the outcome variable or
Dependent variable i.e. Y having only two set of values, either M (Malign) or
B (Benign). So Classification algorithm of supervised learning is applied on
it. We have chosen three different types of classification algorithms in
Machine Learning. We can use a small linear model, which is a simple.
14
2.1 METHODS USED:
1) LOGISTICS REGRESSION
Logistic regression was introduced by statistician DR Cox in 1958
and so predates the field of machine learning. It is a supervised machine
learning technique, employed in classification jobs (for predictions based on
training data). Logistic Regression uses an equation like Linear Regression,
but the outcome of logistic regression is a categorical variable whereas it is a
value for other regression models. Binary outcomes can be predicted from the
independent variables.
2) RANDOM FOREST:
Random forest, like its name implies, consists of many individual
decision trees that operate as an ensemble. Each individual tree in the random
forest spits out a class prediction and the class with the most votes becomes our
model’s prediction.
3) DECISION TREE:
Decision Tree is a Supervised learning technique that can be used
for both classification and Regression problems, but mostly it is preferred for
solving Classification problems. It is a tree-structured classifier,
where internal nodes represent the features of a dataset, branches represent the
decision rules and each leaf node represents the outcome.
15
In a Decision tree, there are two nodes, which are the Decision
Node and Leaf Node. Decision nodes are used to make any decision and have
multiple branches, whereas Leaf nodes are the output of those decisions and
do not contain any further branches.
The decisions or the test are performed on the basis of features of the
given dataset.
16
3. PROGRAMMING USED:
THE CODE:
# importing libraries
import numpy
import pandas as pd
df=pd.read_csv("data.csv")
df.info()
df.isna().sum()
df.shape
df=df.dropna(axis=1)
df.shape
17
# describe the dataset
df.describe()
df['diagnosis'].value_counts()
sns.countplot(df['diagnosis'],label="count")
labelencoder_Y = LabelEncoder()
df.iloc[:,1]=labelencoder_Y.fit_transform(df.iloc[:,1].values)
df.head()
sns.pairplot(df.iloc[:,1:5],hue="diagnosis")
df.iloc[:,1:32].corr()
plt.figure(figsize=(10,10))
sns.heatmap(df.iloc[:,1:10].corr(),annot=True,fmt=".0%")
X=df.iloc[:,2:31].values
18
Y=df.iloc[:,1].values
X_train,X_test,Y_train,Y_test=train_test_split(X,Y,test_size=0.20,rando
m_state=0)
# feature scaling
X_train=StandardScaler().fit_transform(X_train)
X_test=StandardScaler().fit_transform(X_test)
# models/ Algorithms
def models(X_train,Y_train):
#logistic regression
log=LogisticRegression(random_state=0)
log.fit(X_train,Y_train)
#Decision Tree
tree=DecisionTreeClassifier(random_state=0,criterion="entropy")
19
tree.fit(X_train,Y_train)
#Random Forest
forest=RandomForestClassifier(random_state=0,criterion="entropy
",n_estimators=10)
forest.fit(X_train,Y_train)
return log,tree,forest
model=models(X_train,Y_train)
20
for i in range(len(model)):
print("Model",i)
print(classification_report(Y_test,model[i].predict(X_test)))
print('Accuracy :
',accuracy_score(Y_test,model[i].predict(X_test))
# prediction of random-forest
pred=model[2].predict(X_test)
print('Predicted values:')
print(pred)
print('Actual values:')
print(Y_test)
dump(model[2],"Cancer_prediction.joblib")
21
RESULT AND DISCUSSION OF PROPOSED
METHODOLOGY
Table No. 1
Algorithms Accuracy Recall F1 Score
22
Fig 2 :- Comparison graphs between features where represents Malignant and blue represents
Benign.
23
4.1 CONCLUSION
Breast Cancer represents one of the diseases that makes highest number of
deaths every year. At present, only few accurate prognostic and predictive factors are
used clinically for managing the patients with breast cancer. Here, by making use of
Algorithms with Level Set approach, high accuracy can be achieved in detection of
effected cell shapes with exact marking on detected contours. The proposed system helps
to enhance the performance of mammogram retrieval by selecting optimal features.
After creating the predicted model, we can now analyse results obtained in
evaluating efficiency of our algorithms. Random forest achieved the highest accuracy of
97.36% and 96.49%, 93.85% for logistic regression and decision tree respectively.
The analysis of the results signifies that the integration of multidimensional data along
with different classification, feature selection and dimensionality reduction techniques can
provide auspicious tools for inference in this domain. Further research in this field should be
carried out for the better performance of the classification techniques so that it can predict on
more variables. We are intending how to parametrize our classification techniques hence to
achieve high accuracy. We are looking into many datasets and how further Machine Learning
algorithms can be used to characterize Breast Cancer. We want to reduce the error rates with
maximum accuracy.
24
REFERENCES
[1] Wang, D. Zhang and Y. H. Huang “Breast Cancer Prediction Using Machine Learning” (2018), Vol. 66,
NO. 7.
[2] B. Akbugday, "Classification of Breast Cancer Data Using Machine Learning Algorithms," 2019 Medical
Technologies Congress (TIPTEKNO), Izmir, Turkey, 2019, pp. 1-4.
[3] Keles, M. Kaya, "Breast Cancer Prediction and Detection Using Data Mining Classification Algorithms:
A Comparative Study." Tehnicki Vjesnik - Technical Gazette, vol. 26, no. 1, 2019, p. 149+.
[4] V. Chaurasia and S. Pal, “Data Mining Techniques: To Predict and Resolve Breast Cancer Survivability”,
IJCSMC, Vol. 3, Issue. 1, January 2014, pg.10 – 22.
[5] ] Delen, D.; Walker, G.; Kadam, A. Predicting breast cancer survivability: A comparison of three data
mining methods. Artif. Intell. Med. 2005, 34, 113–127.
[6] R. K. Kavitha1, D. D. Rangasamy, “Breast Cancer Survivability Using Adaptive Voting Ensemble
Machine Learning Algorithm Adaboost and CART Algorithm” Volume 3, Special Issue 1, February 2014
[7] P. Sinthia, R. Devi, S. Gayathri and R. Sivasankari, “Breast Cancer detection using PCPCET and
ADEWNN”, CIEEE’ 17, p.63-65
[8] Vikas Chaurasia and S.Pal, “Using Machine Learning Algorithms for Breast Cancer Risk Prediction and
Diagnosis” (FAMS 2016) 83 ( 2016 ) 1064 – 1069
[9] N. Khuriwal, N. Mishra. “A Review on Breast Cancer Diagnosis in Mammography Images Using Deep
Learning Techniques”, (2018), Vol. 1, No. 1.
[10] Y. Khourdifi and M. Bahaj, "Feature Selection with Fast Correlation-Based Filter for Breast Cancer
Prediction and Classification Using Machine Learning Algorithms," 2018 International Symposium on
Advanced Electrical and Communication Technologies (ISAECT), Rabat, Morocco, 2018, pp. 1-6.
[11] R. M. Mohana, R. Delshi Howsalya Devi, Anita Bai, “Lung Cancer Detection using Nearest Neighbour
Classifier”, International Journal of Recent Technology and Engineering (IJRTE), Volume-8, Issue-2S11,
September 2019
[12] Ch. Shravya, K. Pravalika, Shaik Subhani, “Prediction of Breast Cancer Using Supervised Machine
Learning Techniques”, International Journal of Innovative Technology and Exploring Engineering (IJITEE),
Volume-8 Issue-6, April 2019.
[13] Haifeng Wang and Sang Won Yoon, “Breast Cancer Prediction Using Data Mining Method”, Proceedings
of the 2015 Industrial and Systems Engineering Research Conference,
[14] Abdelghani Bellaachia, Erhan Guven, “Predicting Breast Cancer Survivability Using Data Mining
Techniques”
25
[15] Juhyeon Kim, Hyunjung Shin, Breast cancer survivability prediction using labeled,
unlabeled, and pseudo-labeled patient data, Journal of the American Medical Informatics
Association, Volume 20, Issue 4, July 2013, Pages 613–618.
[16] N. Khuriwal and N. Mishra, "Breast cancer diagnosis using adaptive voting ensemble
machine learning algorithm," 2018 IEEMA Engineer Infinite Conference (eTechNxT),
New Delhi, 2018, pp. 1-5.
[17] M. Amrane, S. Oukid, I. Gagaoua and T. Ensarİ, "Breast cancer classification using
machine learning," 2018 Electric Electronics, Computer Science, Biomedical Engineerings'
Meeting (EBBT), Istanbul, 2018, pp. 1-4.
[18] M. R. Al-Hadidi, A. Alarabeyyat and M. Alhanahnah, "Breast Cancer Detection
Using K-Nearest Neighbor Machine Learning Algorithm," 2016 9th International
Conference on Developments in eSystems Engineering (DeSE), Liverpool, 2016, pp. 35-
39.
[19] Kibeom Jang, Minsoon Kim, Candace A Gilbert, Fiona Simpkins, Tan A Ince, Joyce
M Slingerland “WEGFA activates an epigenetic pathway regulating ovarian cancer
initiating cells” Embo Molecular Medicines Volume 9 Issue 3 (2017)
26
27