Meta-learning-based Approach for IoT Data Analytics
Meta-learning-based Approach for IoT Data Analytics
https://doi.org/10.1007/s12046-025-02713-1
Sadhana(0123456789().,-volV)FT3](012345
6789().,-volV)
Abstract. Missing data significantly impacts Internet of Things (IoT) applications, causing various issues
depending on the type and amount of the missing information. For example, in wearable device applications,
missing acceleration readings can result in misclassification of physical activities. Handling missing data in IoT
applications, particularly for tasks such as human activity classification, is a complex challenge. In this paper, we
propose a two-stage approach for IoT data classification that handles missing data effectively. In the first stage,
we build an ensemble of heterogeneous classifiers, while in the second stage, a meta-learner is employed to
address high levels of data sparsity without resorting to data imputation or reconstruction methods. The meta-
learner is trained on a dataset that combines the original features with the predictions from the first-stage
classifiers. By leveraging both feature analysis and classifier performance, the meta-learner produces accurate
predictions, ensuring robust results even in the presence of missing data. We address three mechanisms for
handling incomplete data: (i) missing at random, (ii) missing completely at random, and (iii) missing not at
random, each with various levels of missing data. Our results demonstrate the viability of the proposed approach
in handling extreme levels of missing data in both training and testing datasets, consistently outperforming state-
of-the-art models.
dataset, while data-independent methods use statistical (i) Three different combinations of handling missing
measures or techniques that are not tightly coupled to the values in both training and test datasets are
dataset’s unique features. Ensembles are mostly based on explored.
the structure of the predictors, the pattern of missing data, (ii) A meta-learning approach is presented for effec-
and the type of relationship between dependent and inde- tively combining base models in a heterogeneous
pendent variables. ensemble classifier, leveraging their individual
An ensemble can also be categorized based on the strengths and weaknesses.
underlying ML models used. The models within an (iii) A robust meta-learner is developed to support
ensemble can be the same (homogeneous) or different extreme cases of missing data.
(heterogeneous). In homogeneous ensemble techniques, the
The rest of the paper is organized as follows. Section 2
models tend to be stable because they are less sensitive to
presents the related work. Section 3 presents the proposed
minor changes in the training data. Moreover, these models
are easier to interpret, as the base models share similar approach. Section 4 discusses the experimental results.
structures and typically identify comparable patterns and Finally, section 5 concludes the paper.
relationships. However, if all the base models are flawed,
the ensemble may not offer significant improvement over
the individual models. On the other hand, a heterogeneous 2. Related work
ensemble can result in a more robust final model. This is
because the diverse models capture various aspects of the Different approaches to handling missing data have been
problem and complement each other’s weaknesses. Each extensively studied in the literature [9–11]. Generating
classifier contributes unique insights and perspectives, diversity between classifiers can be achieved by training
resulting in an aggregate outcome that is often more them individually with different parameters, training data-
accurate than any single classifier. sets, and subsets of feature sets. Dung et al [12] presented a
In an ensemble, each classifier’s prediction can be treated random subspace method to train classifiers on distinct sub-
as a vote for a particular class. The final outcome is typi- domains of the feature set. However, this method is effec-
cally determined by different voting strategies. However, tive only when the dataset contains redundant information
when handling missing data, the majority voting principle across all features.
has the following limitations [7]: Ensemble methods have also been proposed to address
the challenges of missing data. Tasci et al [13] demon-
• It determines the result solely based on the class with strated that ensembles combining decision trees and neural
the majority votes, without considering class networks achieve higher accuracy than individual classi-
importance. fiers when applied to high-dimensional datasets. Hasan et al
• A subset of classifiers may agree on an incorrect [14] proposed an ensemble of decision tree classifiers,
classification, leading to misclassification. where each decision tree is trained on data that has been
• When misclassification costs are unequal, relying on imputed using different techniques, such as nearest neigh-
votes from a standalone model does not yield an bor, single imputation, and Bayesian multiple imputation.
optimal prediction. Banjarnahor et al [15] compared the results of an ensemble
In this work, we propose a meta-learning approach for IoT classifier based on the C4.5 decision tree and k-nearest
data analysis in the presence of missing values. Meta- neighbors (kNN), using different imputations with the
learning is a process of learning to learn, which helps expectation maximization multiple imputation (EMMI)
determine the best way of combining classifiers. The meta- technique. All these approaches showed that the ensemble
results achieved improved accuracy.
learner combines the predictions of different heterogeneous
Tran et al [16] employed multiple imputations with
models and learns optimally, weighing them to produce a
random subspace methods within an ensemble classifier,
more accurate final prediction. Figure 1 illustrates a high-
demonstrating effective handling of missing data up to
level overview of constructing a meta-learner. The training
30%. Aleryani et al [17] explored the use of ensemble
dataset (D) is the input, followed by the induction of MAR, learning combined with multiple imputations to build
MCAR, and MNAR patterns of missing data Dm . Hetero- diverse classifiers specifically for predicting missing data
geneous classifiers are then applied to Dm to obtain pre- and compared this approach against the hot deck and kNN-
dictions. These predictions, combined with Dm , are used to based single imputation methods.
train the meta-learner, which acts as the final classification Melville et al [18] proposed an ensemble method called
model. This approach is particularly effective when inde- Diverse Ensemble Creation by Oppositional Relabeling of
pendent models misclassify similar patterns [8]. Artificial Training Examples (DECORATE), which con-
The main contributions of this work are as follows: structs diverse committees by generating artificial training
examples. This approach effectively handles missing values
by creating complete datasets. Polikar et al [19] proposed
Sådhanå (2025)50:52 Page 3 of 9 52
To handle missing values, we aim to use landmark In this work, we considered RF as a meta-learner as it
classifiers (simple models that serve as benchmarks) effi- constructs multiple trees that are less correlated with each
ciently based on the correlations between their expertise. other, sampling both observations and features. This helps
When selecting landmark classifiers, we evaluate all can- maintain a high level of diversity in the ensemble. RF is a
didates to identify which model performs best under the bagging ensemble technique that mainly focuses on
current conditions. If any of these classifiers demonstrate reducing variance, which is crucial when dealing with
computational efficiency, that classifier can serve as a datasets that may be sparse or contain noise.
landmarker. The overall model efficiency will be optimized Training principle of meta-learner The training
if one of these efficient classifiers is selected at the next metadata can be represented as \f ðxÞ; tðxÞ [ , where f(x) is
level of classification. the feature vector that captures important characteristics of
We have chosen different classifiers such as support the input data x and t(x) is the target value indicating which
vector machine (SVM), logistic regression (LR), kNN, and model from a set of models M performed best on x. The
RF along with a meta-learner to address the issue of spar- performance of each model is assessed using a measure like
sity in a dataset. The meta-learner can be any ML model, accuracy, with the target determined:
but we typically select a simpler model to ensure quicker
training and a lower likelihood of overfitting. tðxÞ ¼ arg max pðm; xÞ: ð1Þ
m2M
The following considerations guide the selection of
individual classifiers: This equation signifies the model that achieves the highest
• SVM: It handles missing values by ignoring the performance on input x. The set of models M addresses
features with missing data in its calculations for each classification problems by extracting statistical measures
data point. It uses the available non-missing features to such as the number of features, the number of classes, and
train the model and make predictions. This approach feature–target correlations. A variation of this approach
works well when missingness is completely random. uses indirect data characterization to explain algorithm
• LR: By default, LR excludes any rows with missing performance. In contrast, landmarking evaluates learners
values from the analysis; that is, incomplete observa- based on their task performance, emphasizing their exper-
tions are omitted. This can lead to a loss of valuable tise as a primary source of information rather than relying
data if many observations have missing values. solely on statistical properties [23]. Each classifier offers a
• kNN: It identifies the nearest neighbors based on a unique method for managing missing data, enabling
distance matrix. Missing values are then replaced with selection based on the characteristics of the data and the
weighted averages of corresponding attributes from nature of missingness.
these neighbors, preserving the data’s correlational The key difference between our approach and stacking is
structure. However, the default method often ignores as follows:
observations with missing values when calculating • Stacking splits the data into training and validation
distances. sets. The base models are trained on the training set to
• RF: It can handle missing data natively by ignoring the make estimates on the validation set. The predictions
missing features, and also, it is a bagging ensemble from the base models are then used as input to train the
technique that mainly focuses on reducing the meta-learner on the validation set.
variance.
Sådhanå (2025)50:52 Page 5 of 9 52
• However, in our approach, the meta-learner is trained procedure for invoking the heterogeneous classifiers and
on the incomplete training data alongside the predic- the meta-learner.
tions from the base models, and this combined model Stage I: invocation of heterogeneous classifiers
is applied directly to the test data. Step 1: A training dataset D without missing values is
used. Missing data is then induced using the three mecha-
nisms, MAR, MCAR, and MNAR, with varying percent-
Algorithm 1. Generating a Meta-learner
ages of missing values (ranging from 10 to 95%) in D,
denoted as Dm (line 1 in Algorithm 1).
Suppose we have N training examples, each with M
features having missing values (Dm ) and a target value y,
represented by training data (Dm , y), where Dm is an N M
feature matrix and y is an N 1 vector of target values.
Step 2: Using expertise space, select the best-performing
classifiers from the various individual classifiers applied to
Dm , such as SVM, LR, kNN, and RF. These heterogeneous
models inherently handle missing values in different ways,
either by ignoring them or by incorporating them into their
training processes (lines 2–3 in algorithm 1).
Step 3: Each classifier makes predictions on Dm . The
predictions from these k models are denoted as
f1 ðDm Þ; f2 ðDm Þ; . . .; fk ðDm Þ. Each fi ðDm Þ is an N 1 vector
of predictions (P) for the target value, which are repre-
sented as follows: f1 ðDm Þ for SVM, f2 ðDm Þ for LR, f3 ðDm Þ
for kNN, and f4 ðDm Þ for RF.
Since each classifier brings unique predictive capabili-
ties, this ensemble approach offers more value than simply
averaging the base models’ predictions. The diversity and
expertise of the classifiers provide the meta-learner with
richer information, leading to better handling of incomplete
data (lines 4–6 in Algorithm 1).
3.2 Proposed meta-learning-based approach Step 4: A new dataset D0 is constructed, including pre-
Figure 3 illustrates a specific instance of a meta-learning dictions from each classifier and the corresponding per-
framework (see Figure 1), which consists of four best- centage of missing data (line 7 in Algorithm 1).
performing heterogeneous classifiers from the expertise Stage II: invocation of meta-learner
space: SVM, LR, kNN, and RF. Below is a step-by-step Step 5: Train the meta-learner on D0 . RF is used as the
meta-learner, taking the predictions from the heterogeneous
models and the original features as input. By providing all Table 2. Number of samples for each activity in the HAR dataset
the predictions along with the data that includes varying
percentages of missing values, the meta-learner becomes S.no. Activity name No. of samples (%)
more robust, as it learns to associate input features with 1 Walking 19.14
model predictions (line 8 in Algorithm 1). 2 Walking Upstairs 18.69
Step 6: The trained meta-learner serves as an ensemble 3 Walking Downstairs 17.49
classifier designed to handle and classify data with missing 4 Sitting 16.68
values. By training the base models and the meta-learner on 5 Standing 14.59
varying degrees of missing data, the overall classification 6 Laying 13.41
performance improves (line 9 in Algorithm 1).
Table 4. Accuracy comparison between RF, meta-learner, and EM with RF for MCAR pattern on the HAR dataset
Test(%)
dealing with high levels of missing data. We observed that for 22 epochs with a batch size of 256, and validation
when missing data is present in both training and test performance is monitored using the test dataset. Missing
datasets, our method outperforms the scenario where data is generated under the MCAR pattern, and predictions
missing data is only present during the test phase. are made by selecting the highest probability from the
To demonstrate our model, we compared it with a neural Softmax output. Table 5 presents a comparison of the
network. For this, we used Keras, a high-level API for proposed approach against a neural network, in which
building and testing feedforward neural networks on clas- better results are shown in boldface for the ARSCM data-
sification tasks. The process begins with data preprocess- set. The meta-learner outperforms the neural network,
ing, where (a) the input features are standardized using the especially as the percentage of missing data increases, even
standard scaler to normalize the data distribution; (b) cate- in the case of an imbalanced dataset. This superiority is
gorical labels are encoded using a label encoder, converting particularly evident when dealing with high levels of
them into numerical representation; (c) one-hot encoding is missing data, as neural networks tend to struggle when a
applied to the target variable, transforming it into a binary large portion of data is absent. On the other hand, our meta-
format suitable for multi-class classification tasks; and learner handles missing data more effectively due to the
(d) principal component analysis is used to reduce the diversity introduced by the base learners, which improves
dimensionality of the feature space while retaining the robustness and classification performance.
maximum variance in the data. The neural network archi-
tecture consists of three hidden layers, with 64, 128, and 64
units, respectively, each employing a rectified linear unit 5. Conclusion
(ReLU) activation function. The output layer uses Softmax
activation for multi-class classification. The model is In IoT deployments, missing data is often inevitable due to
trained using the Adam optimizer, and categorical cross- environmental factors or communication disruptions.
entropy is used as the loss function. The model is trained Human activity classification becomes particularly
52 Page 8 of 9 Sådhanå (2025)50:52
Table 5. Accuracy comparison between neural network and meta-learner for MCAR pattern on the ARSCM dataset
Test (%)
review and comparative study. Expert Sys. Appl. 227: [20] Emmanuel T, Maupong T, Mpoeleng D, Semong T, Mphago
120201 B and Tabona O 2021 A survey on missing data in machine
[11] Luo Y 2022 Evaluating the state of the art in missing data learning. J. Big Data. 8: 1–37
imputation for clinical data. Briefings Bioinform. 23(1): [21] Utukuru S, Krishna P R and Karlapalem K 2023 Missing
bbab489 data resilient ensemble subspace decision Tree Classifier, in:
[12] Dung N V, Trung N L and Abed-Meraim K 2021 Robust Proceedings of the 6th Joint International Conference on
subspace tracking with missing data and outliers: novel Data Science and Management of Data 10th ACM IKDD
algorithm with convergence guarantee. IEEE Trans. Signal CODS and 28th COMAD: 104–107
Process. 69: 2070–2085 [22] Giraud-Carrier C, Brazdil P, Soares C and Vilalta R 2009
[13] Taşc1 E, Ülütürk C, Uğur A, 2021 A voting-based ensemble Meta-learning. In Encyclopedia of Data Warehousing and
deep learning method focusing on image augmentation and Mining, IGI Global, (2nd edition): 1207–1215
preprocessing variations for tuberculosis detection. Neural [23] Khwaja A S, Anpalagan A, Naeem M and Venkatesh B 2020
Comput. Appl. 33(22): 15541–15555 Joint bagged-boosted artificial neural networks: using
[14] Hasan M, Alam M, Roy S, Dutta A, Jawad M T and Das S ensemble machine learning to improve short-term electricity
2021 Missing value imputation affects the performance of load forecasting. Electric Power Syst. Res. 179: 106080
machine learning: a review and analysis of the literature [24] Garcia-Gonzalez D, Rivero D, Fernandez-Blanco E and
2010–2021. Inf. Med. Unlocked 27: 100799 Luaces M R 2020 A public domain dataset for real-life
[15] Banjarnahor J, Zai F, Sirait J, Nainggolan D W and human activity recognition using smartphone sensors. Sen-
Sihombing N G D 2023 Comparison analysis of C4.5 sors 20(8): 2200
algorithm and KNN algorithm for predicting data of non- [25] Casale P, Pujol O and Radeva P 2012 Personalization and
active students at prima indonesia university. Sinkron: jurnal user verification in wearable systems using biometric
dan penelitian teknik informatika 7(4):2027–2035 walking patterns. Personal Ubiquitous Comput. 16: 563–580
[16] Tran C and Nguyen B 2024 Random subspace ensemble for [26] Mayer I, Sportisse A, Josse J, Tierney N and Vialaneix N
directly classifying high-dimensional incomplete data. Evo- 2019 R-miss-tastic: a unified platform for missing values
lut. Intell. 1–13 methods and workflows. arXiv preprint: arXiv:1908.04822
[17] Aleryani A, Bostrom A, Wang W and Iglesia B 2023
Multiple imputation ensembles for time series (MIE-TS).
ACM Trans. Knowl. Discovery Data 17(3): 1–28 Springer Nature or its licensor (e.g. a society or other partner)
[18] Melville P and Mooney R 2003 Constructing diverse holds exclusive rights to this article under a publishing agreement
classifier ensembles using artificial training examples. Ijcai. with the author(s) or other rightsholder(s); author self-archiving of
3: 505–510 the accepted manuscript version of this article is solely governed
[19] Polikar R, DePasquale J, Mohammed H S, Brown G and by the terms of such publishing agreement and applicable law.
Kuncheva L I 2010 Learn??. MF: a random subspace
approach for the missing feature problem. Pattern Recognit.
43(11): 3817–3832