0% found this document useful (0 votes)
46 views45 pages

Heart Disease Prediction Research

Uploaded by

Sonikesh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
46 views45 pages

Heart Disease Prediction Research

Uploaded by

Sonikesh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 45

HEART DISEASE PREDICTION USING MACHINE

LEARNING ALGORITHMS
A Minor Project report submitted in partial fulfilment of the

requirements for the award of the degree of

BACHELOR OF TECHNOLOGY

IN

INFORMATION TECHNOLOGY

Submitted by

SHUBHAM KUMAR - 08315003120

SONIKESH - 08615003120
DISHANT KANGOTRA - 75315003120

Under the guidance of

Dr. Tripti Sharma

HOD

DEPTT. OF INFORMATION TECHNOLOGY

MAHARAJA SURAJMAL INSTITUTE OF TECHNOLOGY,


NEW DELHI-110058
2020-2024

1
ACKNOWLEDGEMENT

We would like to express our deep gratitude to our project guide Dr. Tripti Sharma,
Head of the Department, Information Technology, for her guidance with unsurpassed
knowledge and immense encouragement for providing us with the required facilities
for the completion of the project work.
We thank all teaching faculty of the Department of IT, whose suggestions during
reviews helped us in accomplishment of our project. We would like to thank our
parents, friends, and classmates for their encouragement throughout our project period.
At last but not the least, we thank everyone for supporting us directly or indirectly in
completing this project successfully.
DEPARTMENT OF INFORMATION TECHNOLOGY
MAHARAJA SURAJMAL INSTITUTE OF TECHNOLOGY
NEW DELHI – 110058

CERTIFICATE

This is to certify that the project report entitled “MULTIPLE DISEASE


PREDICTION USING MACHINE LEARNING ALGORITHMS” submitted by
SHUBHAM KUMAR ( 08315003120 ), SONIKESH ( 08615003120 ), DISHANT
KANGOTRA ( 75315003120 ) in partial fulfillment of the requirements for the award
of the degree of Bachelor of Technology in Information Technology of Maharaja
Surajmal Institute of technology is a record of bonafide work carried out under my
guidance and supervision.

Mentor:
Dr. Tripti Sharma
(Head of The Department)
DECLARATION

We, SHUBHAM KUMAR, SONIKESH, DISHANT KANGOTRA, of fifth


semester B.Tech., in the department of Information Technology from MSIT, New
Delhi, hereby declare that the Minor project work entitled MULTIPLE DISEASE
PREDICTION USING MACHINE LEARNING ALGORITHMS is carried out by
us and submitted in partial fulfilment of the requirements for the award of Bachelor of
Technology in Information Technology, under Maharaja Surajmal Institute of Technology
and has not been submitted to any other university for the award of any kind of degree.
ABSTRACT

Machine learning techniques have revolutionized the field of healthcare by enabling accurate and
timely disease prediction. The ability to predict multiple diseases simultaneously can significantly
improve early diagnosis and treatment, leading to better patient outcomes and reduced healthcare
costs.

This research paper explores the application of machine learning algorithms in predicting multiple
diseases, focusing on their benefits, challenges, and future directions. We present an overview of
various machine learning models and data sources commonly used for disease prediction.
, we discuss the importance of feature selection, model evaluation, and the integration of multiple data
modalities for enhanced disease prediction.
The research findings highlight the potential of machine learning in multi-disease prediction and its
potential impact on public health. Once more, I am applying machine learning model to identify that a
person is affected with few disease or not. This training model takes a sample data and train itself for
predicting disease.

Keywords: SVM, ANN, RT etc.


CONTENTS

ABSTRACT iv
LIST OF FIGURES vii
LIST OF TABLES ix
LIST OF ABBREVIATIONS x
CHAPTER 1 INTRODUCTION 01
1.1 Introduction 01
1.2 Motivation of the work 02
1.3 Problem Statement 02
CHAPTER 2 LITERATURE SURVEY 03

CHAPTER 3 METHODOLOGY 06
3.1Existing System 06
3.2Proposed System 06
3.2.1 Collection of dataset 06
3.2.2 Selection of attributes 07
3.2.3 Pre-processing of Data 07
3.2.4 Balancing of Data 08
3.2.5 Prediction of Disease 09

CHAPTER 4 WORKING OF SYSTEM 10


4.1 SYSTEM ARCHITECTURE 10
4.2 MACHINE LEARNING 10
4.3 ALGORITHMS 12
4.3.1 SUPPORT VECTOR MACHINE 12
4.3.2 NAIVE BAYES 14
4.3.3 DECISION TREE 16
4.3.4 RANDOM FOREST 19
4.3.5 LOGISTIC REGRESSION 21

CHAPTER 5 EXPERIMENTAL ANALYSIS 26


5.1 SYSTEM CONFIGURATION 26
5.2 SAMPLE CODE 26
5.3 DATASET DETAILS 56
5.4 PERFORMANCE ANALYSIS 59
5.5 PERFORMANCE MEASURES 61
5.6 INPUT AND OUTPUT 62
5.6.1 INPUT 62
5.6.2 OUTPUT 63
5.7 RESULTS 65
CHAPTER 6 CONCLUSION AND FUTURE WORK 66

APPENDIX 67

REFERENCES 69
LIST OF FIGURES

S.N FIGURE DESCRIPTION PAG


O E
NO
3.2.1 Collection of dataset 07

3.2.2 Correlation Matrix 07


3.2.3 Preprocessing Of Data 07
3.2.4 Data Balancing 08
3.2.5 Prediction Of Disease 09

4 System Architecture 10

4.3.1 Support Vector Machine 14

4.3.5 Logistic Regression 23

5.5 Accuracies of various algorithms 65

5.6.1 Input 63

5.6.2 Output 64
LIST OF TABLES

Table No. Topic Name


Page No.

1 Attribute Table 59

2 Accuracy Table 65
LIST OF ABBREVIATIONS

ML Machine Learning

AI Artificial Intelligence

ANN Artificial Neural Networks

SVM Support Vector Machine


CHAPTER

INTRODUCTION
According to the World Health Organization, every year 12 million deaths occur
worldwide due to Heart Disease. Heart disease is one of the biggest causes of
morbidity and mortality among the population of the world. Prediction of
cardiovascular disease is regarded as one of the most important subjects in the section
of data analysis. The load of cardiovascular disease is rapidly increasing all over the
world from the past few years. Many researches have been conducted in attempt to
pinpoint the most influential factors of heart disease as well as accurately predict the
overall risk. Heart Disease is even highlighted as a silent killer which leads to the death
of the person without obvious symptoms. The early diagnosis of heart disease plays a
vital role in making decisions on lifestyle changes in high-risk patients and in turn
reduces the complications.

Machine learning proves to be effective in assisting in making decisions and


predictions from the large quantity of data produced by the health care industry. This
project aims to predict future Heart Disease by analyzing data of patients which
classifies whether they have heart disease or not using machine-learning algorithm.
Machine Learning techniques can be a boon in this regard. Even though heart disease
can occur in different forms, there is a common set of core risk factors that influence
whether someone will ultimately be at risk for heart disease or not. By collecting the
data from various sources, classifying them under suitable headings & finally
analyzing to extract the desired data we can say that this technique can be very well
adapted to do the prediction of heart disease.
1.1 MOTIVATION FOR THE WORK

The main motivation of doing this research is to present a heart disease prediction
model for the prediction of occurrence of heart disease. Further, this research work is
aimed towards identifying the best classification algorithm for identifying the
possibility of heart disease in a patient. This work is justified by performing a
comparative study and analysis using three classification algorithms namely Naïve
Bayes, Decision Tree, and Random Forest are used at different levels of evaluations.
Although these are commonly used machine learning algorithms, the heart disease
prediction is a vital task involving highest possible accuracy. Hence, the three
algorithms are evaluated at numerous levels and types of evaluation strategies. This
will provide researchers and medical practitioners to establish a better.

1.2 PROBLEM STATEMENT

This research highlights a significant challenge in the early detection of cardiovascular diseases,
Parkinson's, and diabetes, emphasizing the limitations of current detection instruments in terms of
cost and efficiency. Early detection is crucial to reduce mortality rates and complications associated
with these diseases. However, continuous patient monitoring is impractical, and 24-hour doctor
consultations are not readily available. Leveraging the wealth of data available today, this paper
explores the application of various machine learning algorithms to analyze data and uncover hidden
patterns for predictive health diagnostics across cardiovascular diseases, Parkinson's, and diabetes.
CHAPTER 2
LITERATURE SURVEY

With growing development in the field of medical science alongside machine learning
various experiments and researches has been carried out in these recent years releasing the
relevant significant papers.

[1] Liang H, Tsui BY, Ni H, et al. Evaluation and accurate diagnoses of pediatric diseases using artificial
intelligence. Nat Med. 2019;25(3):433- 438.

[2] Deo RC. Machine learning in medicine. Circulation. 2015;132(20):1920-1930.

[3] Rajendra Acharya U, Fujita H, Oh SL, et al. Application of deep convolutional neural network for
automated detection of myocardial infarction using ECG signals. Inf Sci (Ny). 2017;415-416:190-198.

[4] Paniagua JA, Molina-Antonio JD, LopezMartinez F, et al. Heart disease prediction using random forests. J
Med Syst. 2019;43(10):329.

[5] Poudel RP, Lamichhane S, Kumar A, et al. Predicting the risk of type 2 diabetes mellitus using data
mining techniques. J Diabetes Res. 2018;2018:1686023.

[6] Al-Mallah MH, Aljizeeri A, Ahmed AM, et al. Prediction of diabetes mellitus type-II using machine
learning techniques. Int J Med Inform. 2014;83(8):596-604.

[7] Tsanas A, Little MA, McSharry PE, Ramig LO. Nonlinear speech analysis algorithms mapped to a standard
metric achieve clinically useful quantification of average Parkinson's disease symptom severity. J R Soc
Interface. 2012;9(65):2756-2764.

[8] Arora S, Aggarwal P, Sivaswamy J. Automated diagnosis of Parkinson's disease using ensemble machine
learning. IEEE Trans Inf Technol Biomed. 2017;21(1):289-299.

[9] Ahmad F, Hussain M, Khan MK, et al. Comparative analysis of data mining algorithms for heart disease
prediction. J Med Syst. 2019;43(4):101.

[10] Parashar A, Gupta A, Gupta A. Machine learning techniques for diabetes prediction. Int J Emerg Technol
Adv Eng. 2014;4(3):672-675.

[11] Breiman L. Random forests. Mach Learn. 2001;45(1):5-32.

[12] Hastie T, Tibshirani R, Friedman J. The Elements of Statistical Learning: Data Mining, Inference, and
Prediction. 2nd ed. Springer; 2009.

[13] Pedregosa F, Varoquaux G, Gramfort A, et al. Scikit-learn: Machine learning in Python. J Mach Learn
Res. 2011;12:2825-2830.

[14] McKinney W, van der Walt S, Lamoureux C, et al. Data structures for statistical computing in Python. In:
Proceedings of the 9th Python in Science Conference; 2010.

[15] pickle — Python object serialization. Python documentation.


https://docs.python.org/3/library/pickle.html. Accessed May 26, 2023.

[16] Huang ML, Hung CC, Hsu CY, et al. Predicting ischemic stroke using the Framingham Stroke Risk
Score and a simple decision rule in Chinese patients with type 2 diabetes. Diabetes Care. 2010;33(2):427-429
CHAPTER 3 METHODOLOGY

3.1 EXISTING SYSTEM

Diseases like Heart disease, Diabetes, Parkinsons is even being highlighted as a silent
killer which leads to the death of a person without obvious symptoms. The nature of
the disease is the cause of growing anxiety about the disease & its consequences.
Hence continued efforts are being done to predict the possibility of this deadly disease
in prior. So that various tools & techniques are regularly being experimented with to
suit the present-day health needs. Machine Learning techniques can be a boon in this
regard. Even though these disease can occur in different forms, there is a common set
of core risk factors that influence whether someone will ultimately be at risk for these
disease or not. By collecting the data from various sources, classifying them under
suitable headings & finally analyzing to extract the desired data we can conclude. This
technique can be very well adapted to the do the prediction of these disease. As the
well-known quote says “Prevention is better than cure”, early prediction & its control
can be helpful to prevent & decrease the death rates due to these disease.

3.2 PROPOSED SYSTEM

The working of the system starts with the collection of data and selecting the
important attributes. Then the required data is preprocessed into the required format.
The data is then divided into two parts training and testing data. The algorithms are
applied and the model is trained using the training data. The accuracy of the system is
obtained by testing the system using the testing data. This system is implemented
using the following modules.

1.) Collection of Dataset


2.) Selection of attributes
3.) Data Pre-Processing
4.) Balancing of Data
5.) Disease Prediction

3.2.1 Collection of dataset


Initially, we collect a dataset for our heart disease prediction system. After the
collection of the dataset, we split the dataset into training data and testing data. The
training dataset is used for prediction model learning and testing data is used for
evaluating the prediction model. For this project, 70% of training data is used and 30%
of data is used for testing. The dataset used for this project is Heart Disease UCI.
The dataset consists of 76 attributes; out of which, 14 attributes are used for the system

Figure: Collection of Data


3.2.2 Selection of attributes
Attribute or Feature selection includes the selection of appropriate attributes for the
prediction system. This is used to increase the efficiency of the system. Various
attributes of the patient like gender, chest pain type, fasting blood pressure, serum
cholesterol, exang, etc are selected for the prediction. The Correlation matrix is used
for attribute selection for this model.

Figure: Correlation matrix


3.2.3 Pre-processing of Data
Data pre-processing is an important step for the creation of a machine learning
model. Initially, data may not be clean or in the required format for the model which
can cause misleading outcomes. In pre-processing of data, we transform data into our
required format. It is used to deal with noises, duplicates, and missing values of the
dataset. Data pre-processing has the activities like importing datasets, splitting
datasets, attribute scaling, etc. Preprocessing of data is required for improving the
accuracy of the model.

Figure: Data Pre-processing

3.2.4 Balancing of Data


Imbalanced datasets can be balanced in two ways. They are Under Sampling and
Over Sampling
(a) Under Sampling:
In Under Sampling, dataset balance is done by the reduction of the size of the ample
class. This process is considered when the amount of data is adequate. (b) Over
Sampling:
In Over Sampling, dataset balance is done by increasing the size of the scarce samples.
This process is considered when the amount of data is inadequate.
Figure: Data Balancing

3.2.5 Prediction of Disease


Various machine learning algorithms like SVM, Naive Bayes, Decision Tree, Random
Tree, Logistic Regression are used for classification. Comparative analysis is
performed among algorithms and the algorithm that gives the highest accuracy is used
for heart disease prediction.

Figure: Prediction of Disease


CHAPTER 4
WORKING OF SYSTEM
4.1 SYSTEM ARCHITECTURE
The system architecture gives an overview of the working of the system.
The working of this system is described as follows:

Dataset collection is collecting data which contains patient details. Attributes selection
process selects the useful attributes for the prediction of heart disease. After identifying the
available data resources, they are further selected, cleaned, made into the desired form.
Different classification techniques as stated will be applied on preprocessed data to predict
the accuracy of heart disease. Accuracy measure compares the accuracy of different
classifiers.

Figure:. SYSTEM ARCHITECTURE

4.1 MACHINE LEARNING


In machine learning, classification refers to a predictive modeling problem where
a class label is predicted for a given example of input data.
● Supervised Learning

Supervised learning is the type of machine learning in which machines are trained
using well "labelled" training data, and on the basis of that data, machines predict the
output. The labelled data means some input data is already tagged with the correct
output.

In supervised learning, the training data provided to the machines work as the
supervisor that teaches the machines to predict the output correctly. It applies the same
concept as a student learns in the supervision of the teacher.
Supervised learning is a process of providing input data as well as correct output data
to the machine learning model. The aim of a supervised learning algorithm is to find a
mapping function to map the input variable(x) with the output variable(y).

 Unsupervised learning

Unsupervised learning cannot be directly applied to a regression or classification


problem because unlike supervised learning, we have the input data but no
corresponding output data. The goal of unsupervised learning is to find the underlying
structure of dataset, group that data according to similarities, and represent that dataset
in a compressed format.

● Unsupervised learning is helpful for finding useful insights from the data.

● Unsupervised learning is much similar to how a human learns to think by their

own experiences, which makes it closer to the real AI.

● Unsupervised learning works on unlabeled and uncategorized data which make


unsupervised learning more important.

● In real-world, we do not always have input data with the corresponding output so

to solve such cases, we need unsupervised learning.

Reinforcement learning

Reinforcement learning is an area of Machine Learning. It is about taking suitable


action to maximize reward in a particular situation. It is employed by various software
and machines to find the best possible behaviour or path it should take in a specific
situation. Reinforcement learning differs from supervised learning in a way that in
supervised learning the training data has the answer key with it so the model is
trained with the correct answer itself whereas in reinforcement learning, there is no
answer but the reinforcement agent decides what to do to perform the given task. In
the absence of a training dataset, it is bound to learn from its experience.

4.2 ALGORITHMS

4.2.1 SUPPORT VECTOR MACHINE (SVM):


Support Vector Machine or SVM is one of the most popular Supervised Learning
algorithms, which is used for Classification as well as Regression problems. However,
primarily, it is used for Classification problems in Machine Learning.

The goal of the SVM algorithm is to create the best line or decision boundary that can
segregate n-dimensional space into classes so that we can easily put the new data point
in the correct category in the future. This best decision boundary is called a
hyperplane. SVM chooses the extreme points/vectors that help in creating the
hyperplane. These extreme cases are called support vectors, and hence the algorithm is
termed as Support Vector Machine.

Support vector machines (SVMs) are powerful yet flexible supervised machine
learning algorithms which are used both for classification and regression. But
generally, they are used in classification problems. In the 1960s, SVMs were first
introduced but later they got refined in 1990. SVMs have their unique way of
implementation as compared to other machine learning algorithms. Lately, they are
extremely popular because of their ability to handle multiple continuous and
categorical variables.

The followings are important concepts in SVM -

Support Vectors - Data Points that are closest to the hyperplane are called support
vectors. Separating line will be defined with the help of these data points.

Hyperplane - As we can see in the above diagram, it is a decision plane or space which
is divided between a set of objects having different classes.

Margin - It may be defined as the gap between two lines on the closest data points of
different classes. It can be calculated as the perpendicular distance from the line to the
support vectors. Large margin is considered as a good margin and small margin is
considered as a bad margin.

Types of SVM:

SVM can be of two types:

● Linear SVM: Linear SVM is used for linearly separable data, which means if
a dataset can be classified into two classes by using a single straight line, then such
data is termed as linearly separable data, and classifier is used called as Linear SVM
classifier.

● Non-linear SVM: Non-Linear SVM is used for non-linearly separated data,


which means if a dataset cannot be classified by using a straight line, then such data is
termed as non-linear data and classifier used is called as Non-linear SVM classifier.

The objective of the support vector machine algorithm is to find a hyperplane in an N-


dimensional space (N - the number of features) that distinctly classifies the data points.

The advantages of support vector machines are:

● Effective in high dimensional spaces.

● Still effective in cases where the number of dimensions is greater than the

number of samples.

● Uses a subset of training points in the decision function (called support


vectors), so it is also memory efficient.

● Versatile: different kernel functions can be specified for the decision function.
Common kernels are provided, but it is also possible to specify custom kernels.

The disadvantages of support vector machines include:

● If the number of features is much greater than the number of samples, avoid
over-fitting in choosing Kernel functions and regularization term is crucial.
SVMs do not directly provide probability estimates, these are calculated using an
expensive five-fold cross-validation.
Figure: Support Vector Machine

4.2.2 NAIVE BAYES ALGORITHM:


Naive Bayes algorithm is a supervised learning algorithm, which is based on Bayes
theorem and used for solving classification problems. It is mainly used in text
classification that includes a high-dimensional training dataset.

Naive Bayes Classifier is one of the simple and most effective Classification
algorithms which helps in building the fast machine learning models that can make
quick predictions.

It is a probabilistic classifier, which means it predicts on the basis of the probability of


an object. Some popular examples of Naïve Bayes Algorithm are spam filtration,
Sentimental analysis, and classifying articles.

It is a classification technique based on Bayes’ Theorem with an assumption of


independence among predictors. In simple terms, a Naive Bayes classifier assumes
that the presence of a particular feature in a class is unrelated to the presence of any
other feature.

The Naive Bayes model is easy to build and particularly useful for very large data sets.
Along with simplicity, Naive Bayes is known to outperform even highly sophisticated
classification methods.

The Naive Bayes algorithm is comprised of two words Naive and Bayes, Which can
be described as:
● Naive: It is called Naive because it assumes that the occurrence of a certain
feature is independent of the occurrence of other features. Such as if the fruit is
identified on the basis of color, shape, and taste, then red, spherical, and sweet fruit is
recognized as an apple. Hence each feature individually contributes to identify that it is
an apple without depending on each other.
● Bayes: It is called Bayes because it depends on the principle of Bayes'
Theorem.

Bayes’s theorem:

Bayes' theorem is also known as Bayes' Rule or Bayes' law, which is used to determine
the probability of a hypothesis with prior knowledge. It depends on the conditional
probability.

The formula for Bayes' theorem is given as:

Where,

P(A|B) is Posterior probability:Probability of hypothesis A on the observed event B.

P(B|A) is Likelihood probability: Probability of the evidence given that the probability
of a hypothesis is true.

P(A) is Prior Probability: Probability of hypothesis before observing the evidence.

P(B) is Marginal Probability: Probability of Evidence.

Types of Naive Bayes model:

There are three types of Naive Bayes Model, which are given below:

● Gaussian: The Gaussian model assumes that features follow a normal

distribution. This means if predictors take continuous values instead of discrete,


then the model assumes that these values are sampled from the Gaussian
distribution.

● Multinomial: The Multinomial Naïve Bayes classifier is used when the data is

multinomial distributed. It is primarily used for document classification


problems, it means a particular document belongs to which category such as
Sports, Politics, education, etc. The classifier uses the frequency of words for
the predictors.

● Bernoulli: The Bernoulli classifier works similar to the Multinomial classifier,

but the predictor variables are the independent Booleans variables. Such as if a
particular word is present or not in a document. This model is also famous for
document classification tasks.

4.2.3 DECISION TREE ALGORITHM


Decision Tree is a Supervised learning technique that can be used for both
classification and regression problems, but mostly it is preferred for solving
classification problems. It is a tree-structured classifier, where internal nodes represent
the features of a dataset, branches represent the decision rules and each leaf node
represents the outcome. In a Decision Tree, there are two nodes, which are the
Decision Node and Leaf Node.
Decision nodes are used to make any decision and have multiple branches, whereas
Leaf nodes are the output of those decisions and do not contain any further branches.
The decisions or the test are performed on the basis of features of the given dataset. It
is a graphical representation for getting all the possible solutions to a problem/decision
based on given conditions. It is called a Decision Tree because, similar to a tree, it
starts with the root node, which expands on further branches and constructs a tree-like
structure. In order to build a tree, we use the CART algorithm, which stands for
Classification and Regression Tree algorithm. A Decision Tree simply asks a question,
and based on the answer (Yes/No), it further split the tree into subtrees.

The Decision Tree Algorithm belongs to the family of supervised machine learning
algorithms. It can be used for both a classification problem as well as for a regression
problem.

The goal of this algorithm is to create a model that predicts the value of a target
variable, for which the decision tree uses the tree representation to solve the problem
in which the leaf node corresponds to a class label and attributes are represented on the
internal node of the tree.
There are various algorithms in Machine learning, so choosing the best algorithm for
the given dataset and problem is the main point to remember while creating a machine
learning model. Below are the two reasons for using the Decision Tree:

● Decision Trees usually mimic human thinking ability while making a decision,
so it is easy to understand.

● The logic behind the decision tree can be easily understood because it shows a
tree-like structure.

In Decision Tree the major challenge is to identify the attribute for the root node in
each level

RANDOM FOREST ALGORITHM

Random Forest is a supervised learning algorithm. It is an extension of machine


learning classifiers which include the bagging to improve the performance of Decision
Tree. It combines tree predictors, and trees are dependent on a random vector which is
independently sampled. The distribution of all trees are the same. Random Forests
splits nodes using the best among of a predictor subset that are randomly chosen from
the node itself, instead of splitting nodes based on the variables. The time complexity
of the worst case of learning with Random Forests is O(M(dnlogn)), where M is the
number of growing trees, n is the number of instances, and d is the data dimension.

It can be used both for classification and regression. It is also the most flexible and
easy to use algorithm. A forest consists of trees. It is said that the more trees it has, the
more robust a forest is. Random Forests create Decision Trees on randomly selected
data samples, get predictions from each tree and select the best solution by means of
voting. It also provides a pretty good indicator of the feature importance.

Random Forests have a variety of applications, such as recommendation engines,


image classification and feature selection. It can be used to classify loyal loan
applicants, identify fraudulent activity and predict diseases. It lies at the base of the
Boruta algorithm, which selects important features in a dataset.

Random Forest is a popular machine learning algorithm that belongs to the supervised
learning technique. It can be used for both Classification and Regression problems in
ML. It is based on the concept of ensemble learning, which is a process of combining
multiple classifiers to solve a complex problem and to improve the performance of the
model.

As the name suggests, "Random Forest is a classifier that contains a number of


decision trees on various subsets of the given dataset and takes the average to improve
the predictive accuracy of that dataset." Instead of relying on one decision tree, the
random forest takes the prediction from each tree and based on the majority votes of
predictions, and it predicts the final output.

The greater number of trees in the forest leads to higher accuracy and prevents the
problem of overfitting.

Assumptions:

Since the random forest combines multiple trees to predict the class of the dataset, it is
possible that some decision trees may predict the correct output, while others may not.
But together, all the trees predict the correct output. Therefore, below are two
assumptions for a better Random forest classifier:

● There should be some actual values in the feature variable of the dataset so that

the classifier can predict accurate results rather than a guessed result.

● The predictions from each tree must have very low correlations.

Algorithm Steps:

It works in four steps:

● Select random samples from a given dataset.

● Construct a Decision Tree for each sample and get a prediction

result from each Decision Tree.

● Perform a vote for each predicted result.

● Select the prediction result with the most votes as the final

prediction.
Advantages:

● Random Forest is capable of performing both Classification and


Regression tasks.

● It is capable of handling large datasets with high dimensionality.

● It enhances the accuracy of the model and prevents the overfitting issue.

Disadvantages:

Although Random Forest can be used for both classification and regression tasks, it is

not more suitable for Regression tasks.

4.2.4 LOGISTIC REGRESSION ALGORITHM


Logistic regression is one of the most popular Machine Learning algorithms, which
comes under the Supervised Learning technique. It is used for predicting the
categorical dependent variable using a given set of independent variables.

Logistic regression predicts the output of a categorical dependent variable. Therefore


the outcome must be a categorical or discrete value. It can be either Yes or No, 0 or 1,
true or False, etc. but instead of giving the exact value as 0 and 1, it gives the
probabilistic values which lie between 0 and 1.

Logistic Regression is much similar to the Linear Regression except that how they are
used. Linear Regression is used for solving Regression problems, whereas logistic
regression is used for solving the classification problems.

In Logistic regression, instead of fitting a regression line, we fit an "S" shaped logistic
function, which predicts two maximum values (0 or 1).

The curve from the logistic function indicates the likelihood of something such as
whether the cells are cancerous or not, a mouse is obese or not based on its weight, etc.

Logistic Regression is a significant machine learning algorithm because it has the


ability to provide probabilities and classify new data using continuous and discrete
datasets.
4.1.1 LOGISTIC REGRESSION ALGORITHM
Logistic regression is one of the most popular Machine Learning algorithms, which
comes under the Supervised Learning technique. It is used for predicting the
categorical dependent variable using a given set of independent variables.

Logistic regression predicts the output of a categorical dependent variable. Therefore


the outcome must be a categorical or discrete value. It can be either Yes or No, 0 or 1,
true or False, etc. but instead of giving the exact value as 0 and 1, it gives the
probabilistic values which lie between 0 and 1.

Logistic Regression is much similar to the Linear Regression except that how they are
used. Linear Regression is used for solving Regression problems, whereas logistic
regression is used for solving the classification problems.

In Logistic regression, instead of fitting a regression line, we fit an "S" shaped logistic
function, which predicts two maximum values (0 or 1).

The curve from the logistic function indicates the likelihood of something such as
whether the cells are cancerous or not, a mouse is obese or not based on its weight, etc.

Logistic Regression is a significant machine learning algorithm because it has the


ability to provide probabilities and classify new data using continuous and discrete
datasets.

Advantages:
Logistic Regression is one of the simplest machine learning algorithms and is easy to
implement yet provides great training efficiency in some cases. Also due to these
reasons, training a model with this algorithm doesn't require high computation power.

The predicted parameters (trained weights) give inference about the importance of
each feature. The direction of association i.e. positive or negative is also given. So we
can use Logistic Regression to find out the relationship between the features.
This algorithm allows models to be updated easily to reflect new data, unlike Decision
Tree or Support Vector Machine. The update can be done using stochastic gradient
descent.

Logistic Regression outputs well-calibrated probabilities along with classification


results. This is an advantage over models that only give the final classification as
results. If a training example has a 95% probability for a class, and another has a 55%
probability for the same class, we get an inference about which training examples are
more accurate for the formulated problem.

Disadvantages:
Logistic Regression is a statistical analysis model that attempts to predict precise
probabilistic outcomes based on independent features. On high dimensional datasets,
this may lead to the model being over-fit on the training set, which means overstating
the accuracy of predictions on the training set and thus the model may not be able to
predict accurate results on the test set. This usually happens in the case when the
model is trained on little training data with lots of features. So on high dimensional
datasets, Regularization techniques should be considered to avoid over- fitting (but this
makes the model complex). Very high regularization factors may even lead to the
model being under-fit on the training data.
Non linear problems can't be solved with logistic regression since it has a linear
decision surface. Linearly separable data is rarely found in real world scenarios. So the
transformation of non linear features is required which can be done by increasing the
number of features such that the data becomes linearly separable in higher dimensions.

Non-Linearly Separable Data:


It is difficult to capture complex relationships using logistic regression. More powerful
and complex algorithms such as Neural Networks can easily outperform this algorithm
Figure: Logistic Regression.
CHAPTER
EXPERIMENTAL ANALYSIS

2.1 SYSTEM CONFIGURATION

2.1.1 Hardware requirements:


Processer : Any Update Processer
Ram : Min 4GB
Hard Disk : Min 100GB

2.1.2 Software requirements:


Operating System : Windows family
Technology : Python3.7
IDE : Jupiter notebook

2.2 SAMPLE CODE


For Diabetes:

For
Heart Diseases
For Parkinsons
Web App using Streamlit:
CHAPTER 6
CONCLUSION AND FUTURE WORK
APPENDIX

Python

Python is an interpreted, high-level, general purpose programming language created by


Guido Van Rossum and first released in 1991, Python's design philosophy emphasizes
code Readability with its notable use of significant White space. Its language
constructs and object oriented approach aim to help programmers write clear, logical
code for small and large-scale projects. Python is dynamically typed and garbage
collected. It supports multiple programming paradigms, including procedural, object-
oriented, and functional programming.

Sklearn
Scikit-learn (Sklearn) is the most useful and robust library for machine learning in
Python. It provides a selection of efficient tools for machine learning and statistical
modeling including classification, regression, clustering and dimensionality reduction
via a consistent interface in Python. This library, which is largely written in Python, is
built upon NumPy, SciPy and Matplotlib.

Numpy
NumPy is a library for the python programming language, adding support for large,
multi- dimensional arrays and matrices, along with a large collection of high level
mathematical functions to operate on these arrays. The ancestor of NumPy, Numeric,
was originally created by Jim with contributions from several other developers. In
2005, Travis created NumPy by incorporating features of the competing Num array
into Numeric, with extensive modifications. NumPy is open source software and has
many contributors.

It provides the building blocks necessary to create the music information retrieval
systems. Librosa helps to visualize the audio signals and also do the feature
extractions in it using different signal processing techniques.
Matplotlib

Matplotlib is a plotting library for the Python programming language and its numerical
mathematics extension NumPy. It provides an object-oriented API for embedding
plots into applications using general-purpose GUI toolkits like Tkinter, wxPython, Qt,
or GTK. There is also a procedural "pylab" interface based on a statemachine (like
OpenGL), designed to closely resemble that of MATLAB, though its use is
discouraged.

Seaborn

Seaborn is a Python data visualization library based on matplotlib. It provides a high-


level interface for drawing attractive and informative statistical graphics. Seaborn is a
library in Python predominantly used for making statistical graphics. Seaborn is a data
visualization library built on top of matplotlib and closely integrated with pandas data
structures in Python. Visualization is the central part of Seaborn which helps in
exploration and understanding of data.

SciPy

SciPy contains modules for optimization, linear algebra, integration, interpolation,


special functions, FFT, signal and image processing, ODE solvers and other tasks
common in science and engineering. SciPy is also a family of conferences for users
and developers of these tools: SciPy (in the United States), Euro SciPy (in Europe) and
SciPy.in (in India). Enthought originated the SciPy conference in the United States and
continues to sponsor many of the international conferences as well as host the SciPy
website. SciPy is a scientific computation library that uses NumPy underneath. It
provides more utility functions for optimization, stats and signal processing.

Streamlit
Streamlit is a free and open-source framework to rapidly build and share beautiful
machine learning and data science web apps. It is a Python-based library specifically
designed for machine learning engineers.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy