ML Report2
ML Report2
Activity Report On
“MACHINE LEARNING”
MACHINE LEARNING
(21CS601)
in
Computer Science and Engineering
Under the Guidance of
Dr H M Keerthi Kumar
Assossiate Professor
Department of Computer Science and Engineering
Malnad College of Engineering
Submitted by
3.Saima Sharleen Islam, Md. Samiul Haque1, M. Saef Ullah Miah, Talha
Bin Sarwar, Ramdhan Nugraha “Application of machine learning algorithms to predict
the thyroid disease risk: an experimental comparative study” .
4. A.K.M Sazzadur Rahman, F. M. Javed Mehedi Shamrat, Zarrin Tasnim, Joy Roy, Syed
Akhter Hossain “A Comparative Study On Liver Disease Prediction Using Supervised
Machine Learning Algorithms” INTERNATIONAL JOURNAL OF SCIENTIFIC &
TECHNOLOGY RESEARCH VOLUME 8, ISSUE 11, NOVEMBER 2019.
INTRODUCTION:
DOMAIN: HUMAN DISEASES
Datasets in human diseases are collections of data specifically gathered to study various
aspects of diseases affecting humans. These datasets can include information such as clinical
data ,genomic data, imaging data etc. Human disease datasets are used by researchers,
clinicians, and public health officials to improve our understanding of diseases, develop new
treatments and interventions, and enhance patient care. They are often stored in databases and
repositories and can be accessed for research purposes while ensuring patient privacy and
data security.
Decision Tree: Decision Tree builds tree structure for the classification problem. The final
nodes in the tree are leaf nodes or decision nodes. Decision Tree classification works on the
core ID3 algorithm that implies entropy or information gain to build decision tree.
Naïve Bayes algorithm: Naïve Bayes algorithm is greatly applied in classification and
prediction problems. . Due to its simplicity, this algorithm is very useful for large datasets
and produces promising outcomes. Bayes theorem considers the value of unknown data point
for a given dataset is independent of other unknown data points. Naïve Bayesian classifier
considers each individual data point has independent distribution, therefore, it estimates the
probabilities of each unknown data point, as given below:
k-NN algorithm: k-nearest neighbor (k-NN) algorithm has been widely used for
classification, estimation and prediction. The k-NN algorithm based on training set selects the
most near known data point and label/predict the unknown data point. The k represents the
number of nearest neighbors for the unknown data point. For instance, if k=2 then k-NN
would choose the two closest known data points from the known data points to classify the
new unknown data point. The similarity between data points is measured by various distance
measures.
The performance of the classification model is computed based on confusion matrix. This
matrix is the base for the common evaluation metrics such as precision, recall, accuracy and
F-measure (Fawcett, 2004).
TP
PRECISION =
TP+ FP
Recall - This is the sensitivity. In the expression of recall, P is the actual total POSITIVE
data points.
TP
RECALL=
P
Accuracy - This value presents the correctness of the classification model in predicting
unknown data points. In the expression of accuracy, N is the actual total NEGATIVE data
points, and P is actual total POSITIVE data points.
TP+ TN
ACCURACY =
P+ N
2
F−MEASURE=
1
+1/ RECALL
PRECISION
Results:
To evaluate the performance of each considered classification algorithm, the dataset has been
segmented in different cases. Each case represents the distribution of dataset. The different
cases considered in this study allows to evaluate the performance of the classification model
with respect to varying number of attributes and their varying values in the dataset.
Case 1: In this case, all the attributes (i.e., 14 attributes) of the dataset are used. Attributes:
Age, Sex,
CP, trestbps, chol, fbs, restecg, thalach, exang, oldpeak, slope, ca, thal, num
Case 2: Subset (i.e., 8 attributes) of the dataset is used in this case. Attributes: Age, Sex, num,
trestbps, chol, restecg, thalach, ca
Case 3: Subset (i.e., 7 attributes) of the dataset is used in this case. Attributes: Ca, fbs, exang,
oldpeak, slope, thal, num
Fig. 2. k-NN Classification/Prediction Results
Description of dataset:
This is a classic dataset for training and benchmarking machine learning algorithms.Biopsy
features for classification of 569 malignant (cancer) and benign (not cancer) breast masses.
Features were computationally extracted from digital images of fine needle aspirate biopsy
slides. Features correspond to properties of cell nuclei, such as size, shape and regularity. The
mean, standard error, and worst value of each of 10 nuclear parameters is reported for a total
of 30 features.
Algorithms:
In our project, the predictive analysis of the machine learning algorithms is achieved. The
machine learning algorithms applied in our project are:
• Support Vector Machine (SVM) is a classifier which divides the datasets into classes to
find a maximum marginal hyper plane (MMH) via the nearest data points .
• Random forests or random decision forests are an ensemble method for classification,
regression and other tasks, that operate by constructing a multitude of decision trees at
training time and outputting the class that is the mode of the classes (classification) or mean
prediction (regression) of the individual trees. Random decision forests correct for decision
trees' habit of over fitting to their training set.
• Decision Tree C4.5 is a predictive modeling tool that can be applied across many areas. It
can be constructed by an algorithmic approach that can split the dataset in different ways
based on different conditions .
The classifiers performances were evaluated using metrics such as Accuracy, Precision,
Recall, and F1- Score.
The Accuracy metric defines how correct the classifiers used in the experiment performed
the classification task. Accuracy measures the proportion of true instances classified by the
classifier (TP + TN) against the overall predicted classification instances. The metric is
defined as follows:
(TP+TN )
Acc .=
(TP+ FP+TN + FN )
Precision evaluates the proportion of the data instances predicted as true and were true in the
experiment i.e. (the fraction of relevant instances among all retrieved instances). It is defined
as follows:
TP
Precision=
TP+ FP
Recall evaluates the proportion of the actual true data instances that were predicted correctly
as true in the experiment i.e. (the fraction of retrieved instances among all relevant instances).
It is defined as follows:
TP
Recall=
TP+ FN
F1-Score evaluation metric was used in measuring the harmonic mean between precision
and recall scores of the classifiers. The metric was used in finding a fair balance between the
two metric values of the classifiers. It is defined as follows:
2∗Recall∗Precision
F 1−score=
Recall+ Precision
Results:
The confusion matrix for each model is formulated to evaluate the classifier.
From the results of training set and testing set we can see that all the classifiers have varying
accuracies but SVM always has higher accuracy testing set (97.2%) than the other classifiers.
Since confusion matrices are a useful way to assess the classifier, each row in Table 2
represents the rates in an actual class while each column displays the predictions. Table 3
present the calculated performance measures of classification models based on confusion
matrix results, precision sensitivity f1 score for benign and malignant.
Description of dataset:
The dataset has 3,162 rows and 25 data columns. Consider the other datasets in allbp.data,
which include only 2,800 instances and no missing values. Additionally, contains only 2,800
instances and contains no missing values. Each data set and test set contain the same 2,800
instances and 972 cases. However, not just because of the volume of cases, we discover that
relatively few researchers have previously worked with this dataset. As a result, we
concentrated on this dataset and studied the results in order to aid future researchers in
predicting similar types of multiclass thyroid datasets.
Algorithms:
In our project, the predictive analysis of the machine learning algorithms is achieved. The
machine learning algorithms applied in our project are:
KNN
K-Nearest Neighbors (KNN) is a technique for inducing laziness in learning. All training data
is incorporated into the testing process. While this expedites planning, it delays testing and
necessitates a great deal of time and memory. When building the model, the number of
neighbors (K) must be specified in KNN. In this case, K acts as a controlling variable for the
prediction model. When the number of classes is even, K is usually an odd number.
ANN
The input neurons receive the data that we feed the ANN in the input layer. These neurons
transmit data to the hidden layer, which performs the magic, and then transfer output neurons
to the output layer, which stores the network’s final calculations for future use. The data will
produce outputs with insufficient data after ANN training. It can also learn on its own and
provide results
Decision tree
The decision tree technique is a subset of supervised machine learning that is based on a
continuous data splitting mechanism across specified parameters.After obtaining the training
data, it splits the dataset into small subsets using the Gini Index, Information Gain, Entropy,
Gain Ratio, and Chi-Square. The Gini Index and Information Gain are measured in the
majority of datasets using Eqs. (1) and (2) respectively. The process is repeated for each child
tuple until all tuples belong to the same class and no additional attributes are needed. The
Gini Index is used to increase precision and diagnostic accuracy.
This attribute then subdivides the dataset into smaller subsets for each child until there are no
more attributes.
GaussianNB
The technique of Gaussian Naive Bayes (GaussianNB) is predicated on the susceptibility to
predictor independence .As each feature is independent, the inclusion of one feature does not
affect the appearance of other features in the GaussianNB algorithm. The algorithm is based
on the Bayes theorem, and the probability can be calculated utilizing
The probability P(A|B) that we are interested in computing is referred to as the posterior
probability. P(A) is the probability of occurrence of event A. Similarly, P(B) is the probability
of occurrence of event B.
The classifiers performances were evaluated using metrics such as Accuracy, Precision,
Recall, and F1- Score.
Accuracy:
In classification problems, accuracy refers to the number of correct predictions made by the
model over all possible predictions. It is calculated by multiplying the number of correct
predictions by the total number of predictions multiplied by 100.
Here, TP is True Positive, where a person is actually having euthyroid sick syndrome, and the
model classifying his case as sick-euthyroid comes under True Positive.
Precision:
Precision is the ratio of true positives and total positives predicted.
Here, precision is a measure that tells us what proportion of patients that diagnosed as having
euthyroid sick syndrome, actually had the euthyroid sick syndrome. The predicted positives
(People predicted as sick-euthyroid are TP and FP) and the people actually having a euthyroid
sick syndrome are TP.
F1-score: Precision and recall are combined in the F1-score metric. Indeed, the F1 score is
the mean of the two harmonics. A high F1-score indicates both great precision and recall. It
has an excellent mix of precision and recall and performs well on issues involving
imbalanced classification.
Recall:
A recall is the ratio of true positives to all positives in the ground truth.
The actual positives (those with the euthyroid sick syndrome are TP and FN), as well as the
patients diagnosed with euthyroid sick syndrome by the model, are TP. FN is included since
the Person did indeed have a euthyroid sick syndrome, despite the model’s prediction.
Results:
The confusion matrix for each model is formulated to evaluate the classifier.
We used one neural network model (ANN), six tree-based models (CatBoost, XGBoost,
Random Forest, LightGBM, Decision Tree, and Extra-Trees), and three statistical models
(SVC, KNN, and GaussianNB) in this study. Experimental results are scrutinized utilizing
accuracy, precision, recall, F1-score, and learning curves. These evaluation matrices are
compared for ten classification algorithms.
Description of dataset:
The dataset consists of 583 liver patient’s data whereas 75.64% male patients and 24.36% are
female patients. This dataset has contained 11 particular parameters whereas we choose 10
parameters for our further analysis and 1 parameter as a target class. Such as,
Algorithms:
In our project, the predictive analysis of the machine learning algorithms is achieved. The
machine learning algorithms applied in our project are:
The general thought process of utilizing Decision Tree is to make a training model that can
use to predict class or estimation of objective factors by taking in choice standards derived
from earlier data (training data).
Fig. 4: Sample of the process of Decision Trees.
K-Nearest Neighbors (KNN):
KNN is one of the most fundamental occasion-based classification algorithms in Machine
Learning. In any case, the KNN takes a shot at the idea that examples are near fit in similar
examples class. A KNN sorts an example to the class that is most decided among K
neighboring. K is a limitation for adjusting the classification algorithms
The classifiers performances were evaluated using metrics such as Accuracy, Precision,
Recall, and F1- Score.
Accuracy
This value presents the correctness of the classification model in predicting unknown data
points. In the expression of accuracy, N is the actual total NEGATIVE data points, and P is
actual total POSITIVE data points.
Precision:
It is otherwise called positive predictive value. It gives the proportion of an accurately
predicted positive outcome by classifier algorithms
F1:
It measures the precision of the model by a blend of accuracy and recall. It gives the
proportion of both FP and FN of a model.
Results:
The confusion matrix for each model is formulated to evaluate the classifier.