0% found this document useful (0 votes)
14 views21 pages

ML Report2

Uploaded by

Bindu Prasad GS
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views21 pages

ML Report2

Uploaded by

Bindu Prasad GS
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 21

sMalnad College of Engineering, Hassan

(An Autonomous Institution affiliated to VTU, Belagavi)

Activity Report On
“MACHINE LEARNING”

MACHINE LEARNING
(21CS601)
in
Computer Science and Engineering
Under the Guidance of
Dr H M Keerthi Kumar
Assossiate Professor
Department of Computer Science and Engineering
Malnad College of Engineering
Submitted by

Bindu Prasad GS 4MC21CS027


Deeksha K 4MC21CS042
Deeksha S 4MC21CS043
Dhawan S 4MC21CS048

Department of Computer Science 2023-24


LITERATURE SURVEY

1. Anoud Shaikh , Naeem A. Mahoto, Faheem Khuhawar “Performance Evaluation of


Classification Methods for Heart Disease Dataset” Sindh Univ. Res. Jour. (Sci. Ser.) Vol.
47(3):389-394 (2015).

2. Sulyman Age Abdulkareema, Zainab Olorunbukademi Abdulkareemb. “An Evaluation of


the Wisconsin Breast Cancer Dataset” Institute for Communication Systems, Home of 5G
and 6G Innovation Centre, University of Surrey, Guildford, GU2 7XH, UK
.

3.Saima Sharleen Islam, Md. Samiul Haque1, M. Saef Ullah Miah, Talha
Bin Sarwar, Ramdhan Nugraha “Application of machine learning algorithms to predict
the thyroid disease risk: an experimental comparative study” .

4. A.K.M Sazzadur Rahman, F. M. Javed Mehedi Shamrat, Zarrin Tasnim, Joy Roy, Syed
Akhter Hossain “A Comparative Study On Liver Disease Prediction Using Supervised
Machine Learning Algorithms” INTERNATIONAL JOURNAL OF SCIENTIFIC &
TECHNOLOGY RESEARCH VOLUME 8, ISSUE 11, NOVEMBER 2019.
INTRODUCTION:
DOMAIN: HUMAN DISEASES
Datasets in human diseases are collections of data specifically gathered to study various
aspects of diseases affecting humans. These datasets can include information such as clinical
data ,genomic data, imaging data etc. Human disease datasets are used by researchers,
clinicians, and public health officials to improve our understanding of diseases, develop new
treatments and interventions, and enhance patient care. They are often stored in databases and
repositories and can be accessed for research purposes while ensuring patient privacy and
data security.

HEART DISEASE DATA SET


Description of dataset:
This data set dates from 1988 and consists of four databases: Cleveland, Hungary,
Switzerland, and Long Beach V. It contains 76 attributes, including the predicted attribute,
but all published experiments refer to using a subset of 14 of them. The "target" field refers to
the presence of heart disease in the patient. It is integer valued 0 = no disease and 1 = disease .
Algorithms:

Decision Tree: Decision Tree builds tree structure for the classification problem. The final
nodes in the tree are leaf nodes or decision nodes. Decision Tree classification works on the
core ID3 algorithm that implies entropy or information gain to build decision tree.

Naïve Bayes algorithm: Naïve Bayes algorithm is greatly applied in classification and
prediction problems. . Due to its simplicity, this algorithm is very useful for large datasets
and produces promising outcomes. Bayes theorem considers the value of unknown data point
for a given dataset is independent of other unknown data points. Naïve Bayesian classifier
considers each individual data point has independent distribution, therefore, it estimates the
probabilities of each unknown data point, as given below:

k-NN algorithm: k-nearest neighbor (k-NN) algorithm has been widely used for
classification, estimation and prediction. The k-NN algorithm based on training set selects the
most near known data point and label/predict the unknown data point. The k represents the
number of nearest neighbors for the unknown data point. For instance, if k=2 then k-NN
would choose the two closest known data points from the known data points to classify the
new unknown data point. The similarity between data points is measured by various distance
measures.

Performance Evaluation Metrics:

The performance of classifier/prediction model is measured using evaluation metrics such as


Precision, Recall, Accuracy and F-measure. In particular, the performance of the
classification model is measured to distinguish between actual and predicted class/label.

The performance of the classification model is computed based on confusion matrix. This
matrix is the base for the common evaluation metrics such as precision, recall, accuracy and
F-measure (Fawcett, 2004).

Precision - This is the positive predictive value (PPV).

TP
PRECISION =
TP+ FP
Recall - This is the sensitivity. In the expression of recall, P is the actual total POSITIVE
data points.
TP
RECALL=
P

Accuracy - This value presents the correctness of the classification model in predicting
unknown data points. In the expression of accuracy, N is the actual total NEGATIVE data
points, and P is actual total POSITIVE data points.

TP+ TN
ACCURACY =
P+ N

F-measure - It is the harmonic mean of precision and recall (Sasaki, 2007).

2
F−MEASURE=
1
+1/ RECALL
PRECISION

Results:

To evaluate the performance of each considered classification algorithm, the dataset has been
segmented in different cases. Each case represents the distribution of dataset. The different
cases considered in this study allows to evaluate the performance of the classification model
with respect to varying number of attributes and their varying values in the dataset.

Different cases used in this study are described in the following

Case 1: In this case, all the attributes (i.e., 14 attributes) of the dataset are used. Attributes:
Age, Sex,
CP, trestbps, chol, fbs, restecg, thalach, exang, oldpeak, slope, ca, thal, num

Case 2: Subset (i.e., 8 attributes) of the dataset is used in this case. Attributes: Age, Sex, num,
trestbps, chol, restecg, thalach, ca

Case 3: Subset (i.e., 7 attributes) of the dataset is used in this case. Attributes: Ca, fbs, exang,
oldpeak, slope, thal, num
Fig. 2. k-NN Classification/Prediction Results

Fig. 3, Naïve Bayesian Classification/Prediction


Results

Fig. 4: Decision Tree Classification/Prediction


Results
Fig. 5, Case 3 Results of classifiers
BREAST CANCER DATASET:

Description of dataset:

This is a classic dataset for training and benchmarking machine learning algorithms.Biopsy
features for classification of 569 malignant (cancer) and benign (not cancer) breast masses.
Features were computationally extracted from digital images of fine needle aspirate biopsy
slides. Features correspond to properties of cell nuclei, such as size, shape and regularity. The
mean, standard error, and worst value of each of 10 nuclear parameters is reported for a total
of 30 features.

Algorithms:

In our project, the predictive analysis of the machine learning algorithms is achieved. The
machine learning algorithms applied in our project are:

• Support Vector Machine (SVM) is a classifier which divides the datasets into classes to
find a maximum marginal hyper plane (MMH) via the nearest data points .

• Random forests or random decision forests are an ensemble method for classification,
regression and other tasks, that operate by constructing a multitude of decision trees at
training time and outputting the class that is the mode of the classes (classification) or mean
prediction (regression) of the individual trees. Random decision forests correct for decision
trees' habit of over fitting to their training set.

• k-Nearest Neighbors (K-NN) is a supervised classification algorithm. It takes a bunch of


labeled points and uses them to learn how to label other points. To label a new point, it looks
at the labeled points closest to that new point, which is its nearest neighbors, and has those
neighbors vote .
• Logistic regression is a very powerful modeling tool, is a generalization of linear
regression [11]. Logistic Regression is used to assess the likelihood of a disease or health
condition as a function of a risk factor (and covariates). Both simple and multiple logistic
regression , assess the association between independent variable(s) (Xi) -- sometimes called
exposure or predictor variables — and a dichotomous dependent variable (Y) -- sometimes
called the outcome or response variable. It is used primarily for predicting binary or
multiclass dependent variables.

• Decision Tree C4.5 is a predictive modeling tool that can be applied across many areas. It
can be constructed by an algorithmic approach that can split the dataset in different ways
based on different conditions .

Performance Evaluation Metrics:

The classifiers performances were evaluated using metrics such as Accuracy, Precision,
Recall, and F1- Score.

The Accuracy metric defines how correct the classifiers used in the experiment performed
the classification task. Accuracy measures the proportion of true instances classified by the
classifier (TP + TN) against the overall predicted classification instances. The metric is
defined as follows:

(TP+TN )
Acc .=
(TP+ FP+TN + FN )

Where TP=True Positive


TN=True Negative
FP=False Positive
FN=False Negative

Precision evaluates the proportion of the data instances predicted as true and were true in the
experiment i.e. (the fraction of relevant instances among all retrieved instances). It is defined
as follows:
TP
Precision=
TP+ FP

Recall evaluates the proportion of the actual true data instances that were predicted correctly
as true in the experiment i.e. (the fraction of retrieved instances among all relevant instances).
It is defined as follows:
TP
Recall=
TP+ FN

F1-Score evaluation metric was used in measuring the harmonic mean between precision
and recall scores of the classifiers. The metric was used in finding a fair balance between the
two metric values of the classifiers. It is defined as follows:
2∗Recall∗Precision
F 1−score=
Recall+ Precision

Results:

The confusion matrix for each model is formulated to evaluate the classifier.

From the results of training set and testing set we can see that all the classifiers have varying
accuracies but SVM always has higher accuracy testing set (97.2%) than the other classifiers.

Table 1. Accuracy percentage for breast cancer diagnostic dataset

Since confusion matrices are a useful way to assess the classifier, each row in Table 2
represents the rates in an actual class while each column displays the predictions. Table 3
present the calculated performance measures of classification models based on confusion
matrix results, precision sensitivity f1 score for benign and malignant.

Table 2. Confusion Matrix.


Table 3. Classifiers performances

Fig. 3. Comparative graph of different classifiers


THYROID DATASET:

Description of dataset:
The dataset has 3,162 rows and 25 data columns. Consider the other datasets in allbp.data,
which include only 2,800 instances and no missing values. Additionally, contains only 2,800
instances and contains no missing values. Each data set and test set contain the same 2,800
instances and 972 cases. However, not just because of the volume of cases, we discover that
relatively few researchers have previously worked with this dataset. As a result, we
concentrated on this dataset and studied the results in order to aid future researchers in
predicting similar types of multiclass thyroid datasets.

Table 1:Sick-euthyroid dataset structure.

Algorithms:

In our project, the predictive analysis of the machine learning algorithms is achieved. The
machine learning algorithms applied in our project are:

KNN
K-Nearest Neighbors (KNN) is a technique for inducing laziness in learning. All training data
is incorporated into the testing process. While this expedites planning, it delays testing and
necessitates a great deal of time and memory. When building the model, the number of
neighbors (K) must be specified in KNN. In this case, K acts as a controlling variable for the
prediction model. When the number of classes is even, K is usually an odd number.

ANN
The input neurons receive the data that we feed the ANN in the input layer. These neurons
transmit data to the hidden layer, which performs the magic, and then transfer output neurons
to the output layer, which stores the network’s final calculations for future use. The data will
produce outputs with insufficient data after ANN training. It can also learn on its own and
provide results
Decision tree
The decision tree technique is a subset of supervised machine learning that is based on a
continuous data splitting mechanism across specified parameters.After obtaining the training
data, it splits the dataset into small subsets using the Gini Index, Information Gain, Entropy,
Gain Ratio, and Chi-Square. The Gini Index and Information Gain are measured in the
majority of datasets using Eqs. (1) and (2) respectively. The process is repeated for each child
tuple until all tuples belong to the same class and no additional attributes are needed. The
Gini Index is used to increase precision and diagnostic accuracy.

This attribute then subdivides the dataset into smaller subsets for each child until there are no
more attributes.
GaussianNB
The technique of Gaussian Naive Bayes (GaussianNB) is predicated on the susceptibility to
predictor independence .As each feature is independent, the inclusion of one feature does not
affect the appearance of other features in the GaussianNB algorithm. The algorithm is based
on the Bayes theorem, and the probability can be calculated utilizing

The probability P(A|B) that we are interested in computing is referred to as the posterior
probability. P(A) is the probability of occurrence of event A. Similarly, P(B) is the probability
of occurrence of event B.

Performance Evaluation Metrics:

The classifiers performances were evaluated using metrics such as Accuracy, Precision,
Recall, and F1- Score.
Accuracy:
In classification problems, accuracy refers to the number of correct predictions made by the
model over all possible predictions. It is calculated by multiplying the number of correct
predictions by the total number of predictions multiplied by 100.

Here, TP is True Positive, where a person is actually having euthyroid sick syndrome, and the
model classifying his case as sick-euthyroid comes under True Positive.
Precision:
Precision is the ratio of true positives and total positives predicted.

Here, precision is a measure that tells us what proportion of patients that diagnosed as having
euthyroid sick syndrome, actually had the euthyroid sick syndrome. The predicted positives
(People predicted as sick-euthyroid are TP and FP) and the people actually having a euthyroid
sick syndrome are TP.
F1-score: Precision and recall are combined in the F1-score metric. Indeed, the F1 score is
the mean of the two harmonics. A high F1-score indicates both great precision and recall. It
has an excellent mix of precision and recall and performs well on issues involving
imbalanced classification.

Recall:
A recall is the ratio of true positives to all positives in the ground truth.

The actual positives (those with the euthyroid sick syndrome are TP and FN), as well as the
patients diagnosed with euthyroid sick syndrome by the model, are TP. FN is included since
the Person did indeed have a euthyroid sick syndrome, despite the model’s prediction.

Results:

The confusion matrix for each model is formulated to evaluate the classifier.
We used one neural network model (ANN), six tree-based models (CatBoost, XGBoost,
Random Forest, LightGBM, Decision Tree, and Extra-Trees), and three statistical models
(SVC, KNN, and GaussianNB) in this study. Experimental results are scrutinized utilizing
accuracy, precision, recall, F1-score, and learning curves. These evaluation matrices are
compared for ten classification algorithms.

Figure 14: Comparison of F1-scores for different machine learning algorithms


employed in this study.

Table 3:Accuracy, precision, recall and F1-scores for different classification


methods employed in this study.
Figure 15: Learning curves for the top four classification algorithms.
LIVER DISEASE DATASET:

Description of dataset:
The dataset consists of 583 liver patient’s data whereas 75.64% male patients and 24.36% are
female patients. This dataset has contained 11 particular parameters whereas we choose 10
parameters for our further analysis and 1 parameter as a target class. Such as,

Algorithms:

In our project, the predictive analysis of the machine learning algorithms is achieved. The
machine learning algorithms applied in our project are:

Decision Tree (DT):

The general thought process of utilizing Decision Tree is to make a training model that can
use to predict class or estimation of objective factors by taking in choice standards derived
from earlier data (training data).
Fig. 4: Sample of the process of Decision Trees.
K-Nearest Neighbors (KNN):
KNN is one of the most fundamental occasion-based classification algorithms in Machine
Learning. In any case, the KNN takes a shot at the idea that examples are near fit in similar
examples class. A KNN sorts an example to the class that is most decided among K
neighboring. K is a limitation for adjusting the classification algorithms

Support Vector Machine (SVM) :


SVM is a supervised learning calculation. It can utilize for both grouping or relapse issues
however generally it is utilized in characterization issues. SVM function admirably for some,
human services issues and can comprehend both linear and non-linear issues.
training involves the minimization of the error function:

Naive Bayes (NB):


Naive Bayes is one of the basic, best and ordinarily utilized, AI techniques. It is a
probabilistic classifier that classifies utilizing the speculation of restrictive freedom with the
pre-trained datasets

Performance Evaluation Matrix:

The classifiers performances were evaluated using metrics such as Accuracy, Precision,
Recall, and F1- Score.

Accuracy
This value presents the correctness of the classification model in predicting unknown data
points. In the expression of accuracy, N is the actual total NEGATIVE data points, and P is
actual total POSITIVE data points.

Precision:
It is otherwise called positive predictive value. It gives the proportion of an accurately
predicted positive outcome by classifier algorithms
F1:
It measures the precision of the model by a blend of accuracy and recall. It gives the
proportion of both FP and FN of a model.

Results:

The confusion matrix for each model is formulated to evaluate the classifier.

In this experiment, we considered different analyses to examine the six-machine learning


classifier for the classification of liver disease dataset. In terms of accuracy, LR achieved the
highest accuracy of 75% and NB achieved the worst performance 53%. With respect to
precision, LR achieved the highest score 91% and NB performs worst 36%. When
considering the sensitivity, SVM achieved the highest value 88% and KNN obtained the
worst 76%.
the Receiver Operating Characteristics (ROC). ROC is used to represent the performance of
machine learning techniques which is based on the true positive rate (TPR) and false-positive
rate (FPR) of these classification results

Fig. 7: Receiver Operating Characteristics (ROC).


In this experiment, we considered different analyses to examine the six-machine learning
classifier for the classification of liver disease dataset. In terms of accuracy, LR achieved the
highest accuracy of 75% and NB achieved the worst performance 53%. With respect to
precision, LR achieved the highest score 91% and NB performs worst 36%. When
considering the sensitivity, SVM achieved the highest value 88% and KNN obtained the
worst 76%.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy