Machine Learning For The Prediction of Infectious Diseases: Ntroduction
Machine Learning For The Prediction of Infectious Diseases: Ntroduction
diseases
Abstract—Infectious diseases harm both the health and eco- domain, as clearly shown in the imaging techniques used to
nomic sectors of any society. By predicting an outbreak of diagnose COVID-19 [4], we gain a deeper understanding of
such diseases, the appropriate parties can better prepare, and what we currently know. Through the help of Artificial intelli-
in certain cases, avoid the outbreak altogether. For this rea-
son, we looked into how machine learning is used to predict gence, we can tackle highly complex problems, such as disease
infectious diseases. Through this research, we looked into the prediction and possible cures for said diseases. Furthermore,
current models available that use structured data and how artificial intelligence algorithms allow us to be proactive rather
they scale up against other similar infectious disease prediction than reactive, simply by modelling and predicting futuristic
algorithms. Following this, we looked into how algorithms can events using what we have observed and learnt from history.
use unstructured data to achieve the same goal. When comparing
both approaches, it was evident that unstructured data can This type of information minimizes negative impact and allows
positively impact this field of study, seeing the high percentage of us to avoid unnecessary danger
prediction rate obtained. However, further research is required The main goal of this research is to see how ML can be used
which implements unstructured data as their basis for prediction. to predict infectious diseases. We will start by reviewing and
analyzing the approaches studied in the systematic review [10].
Index Terms—Machine Learning, Predication, Detection, Dis-
ease Outbreak, Infectious Diseases Said research will give us an insight into the research between
2010 and 2020 and how models scale against each other.
Afterwards, using this newly gained information, we will look
I. I NTRODUCTION
into [8]. These types of research provide an implementation of
In recent years, a pandemic struck worldwide, which to this the previous research within a specific disease, thus providing
day is still ongoing. As a result, COVID-19, categorized as elaborate detail about the prediction of outbreaks. Finally, this
an infectious disease, has (approximately) killed up to 10.5 paper compares algorithms and their heuristics.
million people [1]. Fortunately, through the advancements of We will start by looking into how we selected the literature
technology, more specifically Artificial Intelligence (AI), we for this paper in elaborate detail. Following this, we will
were able to find a vaccine in a shorter period when compared discuss our results, and finally, we will argue our findings
to Influenza, another infectious disease that caused a pan- and conclude any gaps in the current knowledge.
demic. Furthermore, AI aided the health sector by contributing
during COVID-19 [2], [3], such as screening and evaluating II. M ETHOD
COVID-19 on CT Images [4] and helping combat COVID-19 We carried out a methodical search to find relevant sources
through its entire life cycle [5]. that provide critical insight into our research question. Our
Apart from the negative impact on health, we observe field of research is very diverse by nature. Therefore we
that both diseases negatively impacted different economic dedicated ample time to come up with suitable search criteria.
sectors. For example, “Spanish flu reduced real GDP per We started our research by deriving a list of critical keywords
capita by around 6 per cent in the typical country over the from our objective. Below we see a comprehensive list of such
period 1918–21” [6]. Moreover, recent statistics [7] show that keywords:
COVID-19 has had unimaginable negative consequences on • Predict / Prediction
the economic and financial sectors. • Detect / Detection
Although both are generic phrases used every day, AI and • Machine Learning
Machine Learning (ML) differ. AI is a technology that allows • Infectious disease(s)
us to create intelligent systems that can simulate human intel- • Review
ligence. In contrast, ML, a subfield of AI, enables machines These were then formulated into the following query string:
to learn from past experiences without explicit programming. (“Predict” OR “Prediction”) AND (“Detect” OR “Detec-
Throughout the past two decades, we slowly integrated AI tion”) AND (“Machine Learning”) AND (“Infectious disease”
and ML into our day-to-day life. From the health and financial OR “Infectious diseases”) AND “Review”
sectors such as hospitals and banks to the leisure sectors like Using the Technical University of Delft Library Website
gaming and betting companies. These companies are turning as our search database, our complete query string resulted
to AI to solve and aid in problems previously irresolvable. in 295 results. Even though these are ample, we decided to
Bioinformatics and disease detection are no exceptions to narrow down our results. Therefore, we experimented with
this. By implementing these technologies within the biological the different abbreviations of the keywords to get a dataset
that related more to our primary objective. After numerous • AQ is the relevance of assessment, defined by the authors
attempts, the following query string yielded the best and most and illustrated in Table I
concise results:
“predict” AND “detect” AND “machine learning” AND TABLE I
“infectious disease” AND “review” amounting to a total of Q UALITY A SSESSMENT Q UESTION , ADAPTED FROM [10, P. 3]
35 results. There was no other formatting and filtering applied
ID Ten Assessment Questions
through the university’s website. AQ1 Does the study define a main research objective or problem
For all 35 sources, the title was initially reviewed and related to the spread of deadly diseases outbreak (e.g.,
studied to see if it aims to tackle problems of interest to us. An prediction, detection, responses)?
AQ2 Does the study specify the relevant disease datasets used?
example of this is how ML was used to predict the outbreak. AQ3 Does the study specify the availability of these datasets
The source was disregarded if the title did not contribute or (e.g. public datasets, private datasets)?
was of little relevance to our main question. Otherwise, it was AQ4 Does the study define the parameters or variables used or
learnt by the machine learning algorithms?
saved locally for later review. AQ5 Does the study define the type of parameters used or learnt
After reviewing the titles of all the sources, we read and by the machine learning algorithms?
evaluated the abstract, introduction, and conclusion of any AQ6 Does the study specify the type of machine learning models
used (e.g. classification, regression, clustering) in solving
source we thought could contribute to our research. After the problem?
reviewing said source, if it was relevant to us and our research, AQ7 Does the study specify the individual models explicitly
an entry was recorded in EndNote of the reference, along with (e.g., neural network, linear regression)?
a short note highlighting critical points of the research and any AQ8 Does the study specify the evaluation measures (e.g.,
Accuracy, Precision, Recall, F-Measure, ROC) used to assess
essential characteristics. the performance of the proposed machine learning approach?
AQ9 Does the study specify the evaluation approaches (e.g.,
III. R ESULTS cross-validation, holdout) used to assess the performance of
the proposed machine learning approach?
Within this section, we will discuss the papers found using AQ10 Does the study specify the ensemble models (e.g.,
the method suggested in the Method section. To facilitate bagging, boosting) used and compare the performance
understanding, we will be splitting our results into subsections. with individual models?
The first subsection will tackle the systematic review [10].
This review gives us an in-depth analysis of the Machine
For their purposes, Rayner Alfred and Joe Henry Obit only
Learning models currently available, how these are used within
used research with a score greater than or equal to 0.65,
the disease detection domain and any drawbacks found from
denoted as good. Following the filtering and evaluation of the
these models. In the second subsection, we will be looking
papers and journals used, they proceeded to summarize and
at a specific approach and review the papers relating to their
categorize the relevant content of each paper at a technical
research.
level, that is, datasets and parameters, problems addressed and
A. Systematic Review assessment measures and used methods.
Within the systematic review [10], the authors illustrated Finally, the authors summarized their findings and listed
numerous ML models. However, as stated, their targeted down any observations made from this research. They noted
algorithms and evaluation criteria were those that can detect the following observations and conclusions:
and predict the spread and outbreak of a disease. Therefore, • Further research needs to be conducted on the ensemble
this review includes references from 2010 until 2020, and models/hybrid models based on deep learning methods
these were found using the search terms “Predicting Disease using multi-source data, primarily because these have
Outbreaks” and “Detecting Disease” using Machine Learning. demonstrated an improvement in the base model’s per-
In order to select the appropriate research and models, the formance.
authors created a model split into five stages. Through this • By integrating multi-sources data, a deeper understanding
model, they filtered through multiple papers and journals found is obtainable about a particular disease outbreak.
in Scopus to have diverse yet detailed research allowing them • Analysis of complex relationships between multi-sources
to conduct this study. After reviewing the relevant research, data can produce better modelling results.
methodically, the authors tabled their findings based on the • There is a limited number of studies revolving around
paper’s research question. unstructured data (e.g., blogs, websites and news articles),
They graded the research if said research question aligned even though these have demonstrated improvements in
with that of the systematic review, using the equation the prediction of disease outbreaks