M Akaba 2019
M Akaba 2019
Abstract— Dealing with missing values in data is an important or case deletion, single and multiple imputations. Researchers
feature engineering task in data science to prevent negative continue to develop enhanced variants. On the other hand, some
impacts on machine learning classification models in terms of researchers have carried out a comparative evaluation of the
accurate prediction. However, it is often unclear what the current missing data techniques to provide more insight and
underlying cause of the missing values in real-life data is or rather
guidance on the choice of techniques, depending on the
the missing data mechanism that is causing the missingness. Thus,
it becomes necessary to evaluate several missing data approaches percentage, pattern and mechanism underlining the missingness
for a given dataset. In this paper, we perform a comparative study in a dataset [3], [5], [7]-[10].
of several approaches for handling missing values in data, namely This study compares six missing data-handling methods,
listwise deletion, mean, mode, k–nearest neighbors, expectation-
maximization, and multiple imputations by chained equations.
namely, listwise deletion (LD), mean, mode, k-nearest neighbor
The comparison is performed on two real-world datasets, using the (k-NN), expectation-maximization single imputation (EMSI)
following evaluation metrics: Accuracy, root mean squared error, and multiple imputations by chained equation (MICE), on six
receiver operating characteristics, and the F1 score. Most ML algorithms: logistic regression (LR), k-NN, support vector
classifiers performed well across the missing data strategies. machine (SVM), random forest (RF), naïve Bayes (NB) and
However, based on the result obtained, the support vector artificial neural network (ANN). Two real-life datasets are used
classifier method overall performed marginally better for the and evaluated based on the following performance metrics:
numerical data and naïve Bayes classifier for the categorical data accuracy, root mean squared error (RMSE), receiver operator
when compared to the other evaluated missing value methods. characteristics (ROC) and the F1-score.
Keywords - missing data; imputation methods; performance The rest of the paper is organized in the following way:
metrics; machine learning, classification Section II reviews the literature with regard to the missing
I. INTRODUCTION values and imputation strategies and the classifiers employed in
this study. Section III outlines the study methodology, which
Approaches to dealing with missing data have been well comprises the experimental set-up, data set used, and the
researched in literature, using either statistical [1], [2] or performance metrics for evaluation. Section IV provides the
computational intelligence (such as machine learning (ML)) results achieved and a discussion on these. Finally, section V
[3], [4] approaches. Missing values in data are broadly concludes the paper.
categorized into three missingness mechanisms [1], [2]: data
missing completely at random (MCAR) when the probability of II. MISSING DATA METHODS
an instance case or variable having a missing value is not The term missing data refers to the absence of records or
dependent on either the known value itself or any other value or values or observations usually expected to be present in a
variable in the given dataset; data missing at random (MAR) dataset. Missing data strategies are broadly categorized into
when the probability of an instance or variable having a missing three: (1) filling with zero, or ignoring data with missing values,
value is dependent on other known variables but not on the or deleting or dropping missing values, (2) single imputation
value of the missing data itself; data missing not at random strategies and (3) multiple imputation strategies. Four of the
(MNAR) when the probability of an instance or variable having methods used in this study are based on single imputation, while
a missing value is dependent on the value of that variable itself. one is based on multiple imputation methods (IM). The methods
Missing data are now a common problem in many real-world that are considered in this study are briefly described as follows:
datasets in numerous domains such as fraud detection, sensor
readings, anomaly detection etc. The missingness can be A. Listwise Deletion
attributed to numerous sources and reasons such as LD is a statistical method that handles missing data by
measurement error, mechanical faults, non-response or deleting deleting or ignoring the entire record of missing values in a
of values [5]. Missing data, if not addressed during the data dataset, and thus excluding these from the analysis. Only the
preprocessing stage prior to feeding these into an ML model, complete data are retained, which can result in biased
could induce complexity into the data analysis and affect the estimations. This method is also referred to as complete-case
performance of ML algorithms in terms of conclusions that can analysis and assumes that data are MCAR [8].
be inferred from the data, because of reduced data samples and
bias in estimation of the algorithms’ parameters. Numerous
missing data imputation handling techniques have been
developed [6], which could be broadly categorized as listwise
Authorized licensed use limited to: University College London. Downloaded on May 23,2020 at 09:30:39 UTC from IEEE Xplore. Restrictions apply.
B. Imputation Methods 4) Random Forest (RF): The RF model is an ensemble
Imputation is an approach to handling missing data by and tree-based learning method that can be used to build
estimating the missing values in a dataset. The IM could be predictive models. It combines a number of decision tree
subdivided into single and multiple IM. The methods classifiers and averages their predictive accuracy, in the
considered in this paper are briefly described as follows: process improving on the overall model performance.
Ensemble learning uses multiple learning models to gain
1) Mean/Mode: Mean consists of replacing the missing better predictive results [12].
data for a given variable by the mean or mode of all known 5) Naïve Bayes: The NB classifier is a probabilistic
values of that variable. Generally, the mean method is learning technique that is based on the Bayes theorem, which
suitable for numerical variables and the mode for categorical assumes features are statistically independent. NBC uses
variables. Mean or mode usually assumes MCAR [1]. prior knowledge to calculate the probability of a sample for a
2) k-Nearest Neighbors: k-NN defines a set of nearest certain category [12].
neighbors for each sample or individual and then replaces the 6) Artificial Neural Networks: An ANN examines the
missing data for a given variable by averaging through relationship between inputs and outputs by using the training
estimating (non-missing) values of its neighbors. The size of dataset without much detail about the system; it mimics the
the dataset to be analyzed and the optimal k value are crucial workings of the human brain [12].
for this method. k-NN usually assumes data are MCAR [8].
3) Expectation maximization (EM): EM is an iterative IV. RELATED WORK
means of imputing one or more plausible missing data (EM A considerable number of research articles are available
single or multiple imputations) values, resulting a complete to deal with missing values across several domains. Some of
new dataset, through a repeated procedure [2], [11]. EM the earlier research works focused on developing enhanced
usually assumes that data are MAR. missing data IM, such as in [4], while others focused on
4) Multiple imputations by chained equations: The
performing a comparative analysis of existing missing data
MICE method is an iterative algorithm based on chained
equations that use an imputation model specified separately methods on different ML algorithms, such as in [3], [7], [14].
for each variable and involving the other variables as Most of the articles apply single imputation strategies in
estimators. MICE is a multiple imputation method that dealing with missing values in the dataset, since, it is very
involves imputing missing values in a dataset not once, but often unclear what the underlying causes of missing values in
many times [1]. MICE usually assume that data are MAR. any given data are and hard to know in advance which
missing value method is ideal for a given dataset or problem
The criteria and justification for choosing the missing data [10]. In addition, applying missing data imputation have is
methods are based on their popularity and how often they likely to distort variable distribution and associated
have been cited and used in literature, as suggested in Table interactions, and in a way also affects the ML model. It is for
1. this reason that we embark on conducting an experimental
III. MACHINE LEARNING MODELS comparison of several missing data approaches for our real-
world dataset against different ML classification algorithms.
The six classifiers are selected based on their different
In this way we could gain valuable insights into the biases
forms of learning methods. This ensures a broader
shown by these missing values strategies and how they affect
consideration of families of algorithms depending on their
learning philosophies: linear, density-based models, instance- different learning classification algorithms for our given
based, tree and neural network-based models [12]. These datasets. From the summary of some related works outlined
allow researchers a robust assessment of the missing data in Table 1, it appears that the following missing data methods
methods. are the most popularly used: mean/mode, k-NN, EM and
multiple imputations such as MICE.
1) Logistic Regression (LR): LR is a linear-based
classifier that calculates the linear output, followed by a V. STUDY METHODOLOGY
stashing function over the regression output. LR is an easy,
fast and simple ML method. A. Experimental Set-up
2) k-Nearest Neighbors: The k-NN classifier is an The aim of this experiment was to carry out a comparative
instance-based method where new instance query results are analysis and evaluate the impact of five missing data-
classified according to the majority k-NN of the category handling methods against six classifier ML algorithms with
using the Euclidean distance. The basic logic of the k-NN is four performance metrics using two real-world datasets.
to explore the nearest neighbor by assigning an initial size of
k neighborhood [13]. One of the main advantages of k-NN is
that it is an easy and simple ML algorithm.
3) Support Vector Machine: SVM is a supervised ML
algorithm that uses a technique called the kernel trick to
transform the dataset and from the transformation it finds the
best boundary between the possible results.
Authorized licensed use limited to: University College London. Downloaded on May 23,2020 at 09:30:39 UTC from IEEE Xplore. Restrictions apply.
TABLE 1. SUMMARY OF RELATED WORKS
Our experiments were conducted on ‘SPyDER’ frequent (mode) missing strategies because of the size of the
(Scientific Python Development EnviRonment) on Anaconda dataset, the number of missing values and our observation
Python distribution IDE, each time using one missing data with regard to k-NN, EMSI and MICE strategies, which did
method to test the chosen ML algorithms. The experimental not show much difference with the numeric dataset, as shown
simulation is a three-way repeated-measures strategy, which in Table 3.
allows the main effect factors (6 classifiers, 6 missing data
B. Dataset
methods and 4 performance metrics) to be evaluated against
interaction with the random effect factor (numerical and The experiments were carried out using two real-life data
categorical) datasets. Throughout the experimentation, we sets, namely Gauteng road traffic and water quality datasets.
kept the default settings of the presented classifiers. However, The characteristics of the dataset are summarized in Table 2.
for the categorical data, we only considered LD and the most
Missing
Missing
Dataset Data Type Instances Attributes Class values
values
%
Authorized licensed use limited to: University College London. Downloaded on May 23,2020 at 09:30:39 UTC from IEEE Xplore. Restrictions apply.
Gauteng
Nominal
road 672 4 3 21 3.12
categorical
traffic
Water
Continuous
quality 1000 9 2 200 20
numerical
data
C. Performance Metrics guesses created by other IM, by taking into cognizance all the
The following performance metrics were used to evaluate available information from other variables in the data and
the performance of the models after implementing the averaging their results for better estimates of the unknown
missing data methods: Accuracy, RMSE, ROC and F1-score. true missing value. It could thus provide more valid standard
The four chosen metrics are the most popular methods used errors, p-values and final inferences. However,
for evaluating classification ML algorithms [17]. computational cost is one of MICE’s drawbacks.
VI. RESULTS AND DISCUSSION With regard to the categorical data, overall NBC seems to
perform slightly better on both LD and mode strategies used
Table 3 and Figure 1 show the results for numerical water in comparison to the other classifiers. One reason for this is
quality data, while Table 4 and Figure 2 show the result for that generally, NBC performs well with a smaller dataset with
categorical Gauteng road traffic data. The results report the a low missing rate. On the other hand, ANN had the lowest
performance of the examined classifiers based on different RMSE for the LD and mode methods in comparison to all the
missing data methods with a constant percentage of missing other classifiers, indicating better fit of ANN model and
values. The following is observed: classification accuracy. However, all the classifiers examined
With regard to the numerical data, generally all classifiers performed slightly better against the mode strategy in
performed well across the different missing data strategies comparison to the LD method. Because data are lost when
used in this study. However, overall SVC performed with using the LD method, complexity could be added in term of
consistency and slightly better in terms of all the performance variance and bias. In general, we observed that the results
metrics evaluated, with the NB classifier showing the obtained varied depending on the classifier, type of data
marginally lowest performance except when using the LD (numerical or categorical), and percentage of missing value.
and mode methods. In addition, LD, mean and mode This means that no single missing data methods is superior or
performed well across all the classifiers compared to the more fits all dataset type problems. We have seen in our case that
advanced k-NN, EMSI. The reasons for their performance, results varied with both numerical and categorical datasets,
apart from ease of implementation, are the low occurrence of reasons such as how correlated the attributes are, the data
missing values in the numerical dataset and variance distribution pattern, data size, missing value rate and data
reduction. Moreover, we observed that the MICE method type. Different missing value methods induce biases,
performed well for all the classifiers. One possible reason is particularly if the methods are based on certain assumptions
that it takes into account the uncertainties resulting from pointed out earlier in section II.
Authorized licensed use limited to: University College London. Downloaded on May 23,2020 at 09:30:39 UTC from IEEE Xplore. Restrictions apply.
RF 1.00 1.00 1.00 1.00 0.994 1.00
ANN 1.00 1.00 1.00 1.00 0.9825 1.00
Authorized licensed use limited to: University College London. Downloaded on May 23,2020 at 09:30:39 UTC from IEEE Xplore. Restrictions apply.
Fig. 2. Performance result ML vs MS on Dataset 2 (Categorical)
The authors are also thankful to Mikros Traffic Monitoring (Pty) Ltd
VII. CONCLUSION and Prof. T. Bartz-Beielstein for making the datasets available.
The aim of this work was to evaluate the performance of six
REFERENCES
ML classifiers on different missing data strategies using
[1] R. J. A. Little and D. B. Rubin, Statistical Analysis with Missing Data.
numerical and categorical datasets. We observed a very
Third Edition (eds R. Little and D. Rubin), 2019. DOI:
marginal difference in terms of overall performance across all 10.1002/9781119482260
the classifiers. However, SVC performed marginally better [2] D. B. Rubin, "Inference and missing data," Biometrika, vol. 63, (3), pp.
for the numerical dataset, while NB classifier did the same 581-592, 1976.
for the categorical dataset across the missing data methods [3] J. M. Jerez, L. Molina, P. J. García-Laencina, E. Alba, N. Ribelles, M.
examined. However, ANN had the lowest RMSE when Martín, and L. Franco, "Missing data imputation using statistical and
compared to all the other classifiers for the categorical machine learning methods in a real breast cancer problem," Artificial
Intelligence in Medicine, vol. 50, (2), pp. 105-115, 2010. Available:
dataset, indicating better fit of ANN model. Nonetheless, for https://www.clinicalkey.es/playcontent/1-s2.0-S0933365710000679.
the categorical dataset, we noticed slightly improved DOI: 10.1016/j.artmed.2010.05.002.
performance by the classifiers against mode method in [4] H. de Silva and A. S. Perera, "Missing data imputation using
comparison to the LD method. We intend to test other evolutionary k- nearest neighbour algorithm for gene expression data."
missing value strategies, including ML and missing data Sep 2016. Available: https://ieeexplore.ieee.org/document/7829911.
DOI: 10.1109/ICTER.2016.7829911.
methods in the future, using larger datasets and different
[5] S. P. Mandel J, "A Comparison of Six Methods for Missing Data
missing values rates. The authors would like to pay detailed Imputation," Journal of Biometrics & Biostatistics, vol. 6, (1), 2015.
attention to employing ML approaches to handling missing DOI: 10.4172/2155-6180.1000224.
data, statistical quantification of biases and sensitivity [6] H. Kang, "The prevention and handling of the missing data," Korean
analysis for the missing data strategies as areas of interest in Journal of Anesthesiology, vol. 64, (5), pp. 402-406, 2013. Available:
future work. Finally, our preliminary submission is that http://synapse.koreamed.org/search.php?where=aview&id=10.4097/kja
e.2013.64.5.402&code=0011KJAE&vmode=FULL. DOI:
knowing the cause of missing values in a dataset is key to 10.4097/kjae.2013.64.5.402.
tackling the missingness problem, since the missing value [7] B. Twala, "An empirical comparison of techniques for handling
methods are based on certain assumptions. incomplete data using decision trees," Applied Artificial Intelligence,
vol. 23, (5), pp. 373-405, 2009. Available:
ACKNOWLEDGMENT http://www.tandfonline.com/doi/abs/10.1080/08839510902872223.
DOI: 10.1080/08839510902872223.
We would like to thank the University of Johannesburg for
funding and making the resources available to complete this work. [8] T. Nkonyana and B. Twala, Eds., Impact of Poor Data Quality in
Remotely Sensed Data. (Artificial Intelligence and Evolutionary
Computations in Engineering Systems ed.) Singapore: Springer Nature.
Authorized licensed use limited to: University College London. Downloaded on May 23,2020 at 09:30:39 UTC from IEEE Xplore. Restrictions apply.
[9] M. R. Stavseth, T. Clausen and J. Røislien, "How handling missing data Security, vol. 4, (4), pp. 323-335, 2012. Available:
may impact conclusions: A comparison of six different imputation http://www.tandfonline.com/doi/abs/10.1080/19439962.2012.702711.
methods for categorical questionnaire data," SAGE Open Medicine, vol. DOI: 10.1080/19439962.2012.702711.
7, pp. 2050312118822912, 2019. Available: [14] D. Ferreira-Santos, M. Monteiro-Soares and P. P. Rodrigues, "Impact of
https://www.ncbi.nlm.nih.gov/pubmed/30671242. imputing missing data in Bayesian network structure learning for
[10] T. Marwala, Computational Intelligence for Missing Data Imputation, obstructive sleep apnea diagnosis," Studies in Health Technology and
Estimation, and Management. 2009Available: Informatics, vol. 247, pp. 126-130, 2018. Available:
https://ebookcentral.proquest.com/lib/[SITE_ID]/detail.action?docID= https://www.ncbi.nlm.nih.gov/pubmed/29677936.
3309570. [15] G. Chhabra, V. Vashisht and J. Ranjan, "A comparison of multiple
[11] A. P. Dempster, N. M. Laird and D. B. Rubin, "Maximum likelihood imputation methods for data with missing values," Indian Journal of
from incomplete data via the Em algorithm," Journal of the Royal Science and Technology, vol. 10, (19), pp. 1-7, 2017. DOI:
Statistical Society, vol. 39, (1), pp. 1-38, 1977. Available: 10.17485/ijst/2017/v10i19/110646.
http://www.econis.eu/PPNSET?PPN=388257237. [16] M. Singh, "Learning Bayesian Networks from Incomplete Data," In
[12] B. Twala and F. Mekuria, "Ensemble multisensor data using state-of- AAAI/IAAI, pp. 539., 1997.
the-art classification methods." Sep 2013. Available: [17] C. Ferri, J. Hernndez-Orallo and R. Modroiu, "An experimental
https://ieeexplore.ieee.org/document/6757711. DOI: comparison of performance measures for classification," Pattern Recog.
10.1109/AFRCON.2013.6757711. Lett., vol. 30, (1), pp. 27-38, 2009.
[13] B. Twala, "Dancing with dirty Road traffic accidents data: The case of
Gauteng Province in South Africa," Journal of Transportation Safety &
Authorized licensed use limited to: University College London. Downloaded on May 23,2020 at 09:30:39 UTC from IEEE Xplore. Restrictions apply.