This paper introduces a novel approach for occupancy estimation in smart buildings. In particular, we focus on the challenging yet common situation where the amount of training data is small and imbalanced (i.e. the classes are not approximately equally represented). Our model is based on two parts namely predictive modelling, performed via the inverted Dirichlet mixture (IDMM), and an over-sampling approach that we propose. The first part, in which the main goal is to tackle the small training data problem, concerns the calculation of the predictive distribution of the IDMM by marginalizing over its parameters, with their posterior distributions, which are estimated by a Bayesian variational inference approach that we develop. Based on over-sampling, the second part can be viewed as a complement to tackling the imbalanced domains problem. Extensive experiments and simulations involving synthetic data as well as real data extracted from smart buildings sensors show the merits of our statistical framework.

Data could be made available on reasonable request.
It is noteworthy that when the training set become sufficiently large, the posterior distribution variance decreases and then the predictive distribution, which can be viewed as an average over model’s parameters (Snelson and Ghahramani 2005), could be approximated by \(f({\textbf {x}}|{\hat{\theta }})\), where \({\hat{\theta }}\) is a point estimate (obtained via maximum a posteriori or expectation propagation or variational Bayes, for instance) (Gelman et al. 1996; Sinharay and Stern 2003) .
Data generation techniques themselves can be categorized into two groups (Branco et al. 2016). The first group of approaches introduces perturbations (i.e. producing noisy replicates of existing data). The second group is based on interpolating existing data. Our approach belongs to the second group.
The completion of this research was made possible thanks to Natural Sciences and Engineering Research Council of Canada (NSERC), the “Nouveaux arrivants Université Grenoble Alpes Grenoble INP - UGA Â\(\gg \)/G-SCOP” program and the National Natural Science Foundation of China (61876068). The authors would like to thank the associate editor and reviewers for their helpful comments.
The authors declare that they have no conflict of interest.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendix: Proof of equation 11
Appendix: Proof of equation 11
The logarithm of the Multivariate-Inverse-Beta has been proved to be concave (Ma et al. 2014). Thus, the following inequality can be easily obtained by first order Taylor expansion
where \(\tilde{\alpha }_{d}, k=1,2,...,D+1\) is any point from the posterior distribution. Taking the exponential of both sides, we have
By substituting (24) into (7) and with some mathematical manipulations, we can obtain the following upper-bound
For simplicity let’s denote
where \(d = 1,2,...,D+1\). Thus, the integration in Eq. 25 has a same form as Gamma function and could be reduced to
Here, we attempt \({\textbf {G}}(x_{d},\tilde{\varvec{\alpha }}) > 0\) for any d because the situation of \({\textbf {G}}(x_{d},\tilde{\varvec{\alpha }}) \le 0\) is unsolvable. Finally, the analytically tractable form of finite upper-bound of the predictive distribution is
Guo, J., Amayri, M., Najar, F. et al. Occupancy estimation in smart buildings using predictive modeling in imbalanced domains. J Ambient Intell Human Comput 14, 10917–10929 (2023). https://doi.org/10.1007/s12652-022-04359-x
DOI: https://doi.org/10.1007/s12652-022-04359-x