Monte Carlo Averaging For Uncertainty Estimation I
Monte Carlo Averaging For Uncertainty Estimation I
Abstract. Although convolutional neural networks (CNNs) are widely used in modern clas-
sifiers, they are affected by overfitting and lack robustness leading to overconfident false predic-
tions (FPs). By preventing FPs, certain consequences (such as accidents and financial losses)
can be avoided and the use of CNNs in safety- and/or mission-critical applications would be
effective. In this work, we aim to improve the separability of true predictions (TPs) and FPs
by enforcing the confidence determining uncertainty to be high for TPs and low for FPs. To
achieve this, we must devise a suitable method. We proposed the use of Monte Carlo averaging
(MCA) and thus compare it with related methods, such as baseline (single CNN), Monte Carlo
dropout (MCD), ensemble, and mixture of Monte Carlo dropout (MMCD). This comparison
is performed using the results of experiments conducted on four datasets with three different
architectures. The results show that MCA performs as well as or even better than MMCD,
which in turn performs better than baseline, ensemble, and MCD. Consequently, MCA could
be used instead of MMCD for uncertainty estimation, especially because it does not require a
predefined distribution and it is less expensive than MMCD.
Keywords: Convolutional neural network (CNN), ensemble, Monte Carlo dropout (MCD),
mixture of Monte Carlo dropout (MMCD), Monte Carlo averaging (MCA), separating true
predictions (TPs) and false predictions (FPs), confidence calibration
1. Introduction
Because of the emergence of large datasets, increasing computational power, and advances in
deep learning, convolutional neural networks (CNNs) have become the standard for solving
classification problems. Despite the widespread use of CNNs in modern classifiers [1, 2, 3], they
are faced with several problems, such as overfitting [4], which causes overconfident predictions [5],
and lack of robustness. An example of a lack of robustness was described by Hendrycks
and Dietterich [6], who empirically showed that CNNs can change their predictions when
perturbations, such as blur or noise, are applied to the input image. This overconfidence
and lack of robustness can lead to overconfident FPs. Several other authors also empirically
showed that CNNs can overconfidently misclassify out-of-domain (OOD) examples (situations
not present in the training data) [7, 5, 8]. In [9], the authors showed that CNNs can also
overconfidently misclassify domain-shift examples, which are in-domain examples (situations
present in the training data) affected by a set of perturbations, such as changes in the camera
lens and lighting conditions. Overconfident FPs can be costly and dangerous, especially when
Content from this work may be used under the terms of the Creative Commons Attribution 3.0 licence. Any further distribution
of this work must maintain attribution to the author(s) and the title of the work, journal citation and DOI.
Published under licence by IOP Publishing Ltd 1
JCRAI-2022 IOP Publishing
Journal of Physics: Conference Series 2506 (2023) 012004 doi:10.1088/1742-6596/2506/1/012004
CNN-based classifiers are part of the decision making unit of systems for safety- and/or mission-
critical applications, such as collision avoidance [10], door recognition for visual-based robot
navigation [11], and pedestrian detection [12]. In this study, FPs will result in false actions
in the environment, leading to robot collisions, potential false medical treatments, and/or
increased financial costs. By preventing FPs, we can avoid these consequences and encourage
the widespread adoption of CNNs in safety- and/or mission-critical applications. This goal will
be achieved by estimating and evaluating the predictive uncertainty of CNNs and ensuring that
the confidence measuring uncertainty is high ([50%, 100%]) for TPs and low ([0%, 50%[) for FPs.
This, however, calls for a research question: What method is required to achieve high and low
confidence of CNN-based classifiers for TPs and FPs, respectively? In our paper titled A Survey
of Uncertainty in Deep Neural Networks [13], we suggested an uncertainty estimation technique
that combines the strengths of both Bayesian and ensemble principles. This technique was
adopted in this current work and thus, MCA was proposed, a method similar to the mixture of
Monte Carlo dropout (MMCD), which combines the strengths of an ensemble and Monte Carlo
dropout (MCD). MCA is deterministic similar to an ensemble, evaluates multiple features such
as in the ensemble and MMCD, and evaluates uncertainty associated with extracted features
similar to MCD and MMCD, both of which are stochastic. We, therefore, empirically compared
MCA and other related methods (baseline (single CNN), MCD, ensemble, and MMCD) based
on results from experiments conducted on four datasets (CIFAR10, FashionMNIST, MNIST,
and GTSRB) using three different architectures (DenseNets, ResNets, and VGGNets). Results
show that MCA can perform equally or even better than MMCD. Similar to MMCD, MCA
can preserve the classification accuracy of the underlying ensemble, which can increase the
classification accuracy of the baseline, which can only be preserved via MCD. Similar to MMCD,
MCA can separate TPs and FPs better than baseline, ensemble, and MCD but at the cost of
increasing calibration error on test data.
2. Related works
MCD [14, 15] is one of the most widely used approximations for Bayesian inference. It samples
features by dropping neurons using Bernoulli masks. Several related works have investigated
various extensions of MCD. For example, Tassi [16] investigated the use of dropout in pooling
and/or convolution instead of fully-connected layers. Zeng et al. [17] investigated the position
and number of Bayesian layers required to approximate a fully Bayesian neural network (BNN)
and found that only a few Bayesian layers near the output of the BNN are sufficient. Similarly,
Brosse et al. [18] evaluated the quality of uncertainty that results when only the last layer is
Bayesian, and found that last-layer BNNs perform similarly well compared with full BNNs.
Kristiadi et al. [19] complemented the empirical evidence of Brosse et al. [18] with a theoretical
justification showing why it is sufficient to make the last layer Bayesian at low cost overhead.
According to Zeng et al. [17] and further supported by Brosse et al. [18], the use of multiple
Bayesian layers in a BNN can compromise accuracy without improving the quality of uncertainty.
However, the more Bayesian layers are, for example, by using a high dropout probability, the
more uncertainty we can capture at the cost of sacrificing the accuracy [17, 16]. Other studies
investigated the use of different dropout strategies, such as drop-connect, where connections
are dropped instead of neurons [20, 21], or structured dropout, where layers or channels are
dropped [22]. Other studies [21, 16] evaluated the use of different dropout masks, such as
Gaussian, Bernoulli, or a cascade of Gaussian and Bernoulli. Taken together, all these works
show that a single MCD layer near the output of a CNN, for example, at the input of the first
fully-connected layer, is sufficient for uncertainty quantification. Moreover, all the studies show
that the MCD is sensitive to the sampling masks drawn from a predefined distribution. The
proposed MCA does not require a predefined distribution from which masks are drawn and
therefore overcomes the drawback of the MCD.
2
JCRAI-2022 IOP Publishing
Journal of Physics: Conference Series 2506 (2023) 012004 doi:10.1088/1742-6596/2506/1/012004
The ensemble was initially proposed for improving accuracy [23, 1]. Existing methods
for introducing diversity among ensemble members, such as random initialization, data
shuffling, bagging, and data augmentation, were originally proposed for improving the accuracy.
Nevertheless, ensembles have become a popular method for uncertainty estimation through
the pioneering work of Lakshminarayanan et al. [5]. Several related works have evaluated the
performance of ensembles in capturing in-domain uncertainties [24] or OOD uncertainties [9].
Other works [25, 26, 9] compared ensembles with other uncertainty estimation methods such as
MCD and concluded that ensembles perform better than MCD. Lakshminarayanan et al. [5] used
random initialization and data shuffling to build ensembles for uncertainty estimation. Lee et
al. [27] experimentally showed that the diversity introduced into ensemble members via random
parameter initialization is more useful than that introduced via bagging. They concluded that
random initialization is not only sufficient, but also preferable to bagging for building ensembles
of CNNs because CNNs have a large parameter space and require large training data. They
also showed that bagging can result in poorly calibrated ensembles. According to Wen et
al. [28], data augmentation approaches, such as mixup [29], can also harm the calibration of
ensembles. This has also been reported in other studies [30, 31]. In [31, 24, 32], the authors
improved the calibration of ensembles using temperature scaling. In this work, to avoid harming
the calibration of ensembles, diversity was introduced into ensemble members using random
initialization, data shuffling, and standard label-preserving data augmentation techniques, such
as rotation, translation, flipping, shear and additive Gaussian noise.
MMCD was used in [10, 33, 34] for uncertainty estimation. It combines the strengths of MCD
and ensemble. Although MCD evaluates a single local optimum in a given solution space,
but additionally considers the uncertainty of the local optimum, an ensemble includes multiple
deterministic CNNs representing different local optima in the solution space and therefore
evaluates multiple modes (extracted features) [35]. However, an ensemble does not account
for the uncertainty around the individual modes. To explore the uncertainty around each mode,
MMCD applies MCD to each ensemble member.
3. Background
3.1. Convolutional neural network
For image classification, a CNN is a function f that maps an input image x ∈ RH×W ×C to a class
label y ∈ U K , where H, W , and C are the height, weight, and number of channels of the input
image, respectively. U K and K denote the set of standard unit vectors of RK and the number
of possible classes, respectively. A CNN consists of two main modules: a features extractor,
which is realized using convolutional and pooling layers, and a discriminator, which is realized
using fully-connected layers. Thus, a CNN is a composite of two functions fF eatureExtractor ()
and fDiscriminator (). That is
with predicted class label y = arg maxk (pk (y|x)) and predicted confidence c = maxk (pk (y|x))
for k = 1, ..., K. A single CNN is referred to as the baseline.
3
JCRAI-2022 IOP Publishing
Journal of Physics: Conference Series 2506 (2023) 012004 doi:10.1088/1742-6596/2506/1/012004
where q, αis and βis denote the dropout probability, and the elements of the sampled masks αs
and β s , respectively. MCD estimates p(y|x) using the mean of S features sampling operations.
That is,
S S
1X s 1X
p(y|x) ≈ p (y|x) ≈ fDiscriminator (x̂s ). (3)
S S
s=1 s=1
MCD is referred to as the average of S stochastic CNNs. Its main drawback is that it is
sensitive to the sampling mask drawn from a predefined distribution parameterized by a dropout
probability q, which is sensitive to the dataset and/or architecture [16].
3.3. Ensemble
Given a set of CNNs fm for m ∈ 1, 2, ..., M , the ensemble prediction p(y|x) is estimated by
averaging over the predictions of all CNNs. That is,
M M
1 X m 1 X
p(y|x) := p (y|x) := fDiscriminatorm (fF eatureExtractorm (x)). (4)
M M
m=1 m=1
An ensemble is referred to as the average of M deterministic CNNs (M << S). Its major
drawback is the inability to evaluate the uncertainty associated with the extracted features.
M S M S
1 X X ms 1 XX
p(y|x) ≈ p (y|x) ≈ fDiscriminatorm (x̂ms ), (5)
M ·S M ·S
m=1 s=1 m=1 s=1
where x̂ms is a feature sampled (as shown in (2)) from x̂m = fF eatureExtractorm (x). MMCD is
referred to as the average of M · S stochastic CNNs. It has a similar drawback as MCD.
Features Features
extractor Discriminator
averaging
Features Features
Discriminator
extractor averaging
4
JCRAI-2022 IOP Publishing
Journal of Physics: Conference Series 2506 (2023) 012004 doi:10.1088/1742-6596/2506/1/012004
We found that with this, the classification accuracy drops drastically. For example, the
classification accuracy of an ensemble trained on CIFAR10 dropped from 89.50% to 17.73%
when we estimated p(y|x) as shown in (6). This means that the discriminators of members m
cannot correctly classify the features extracted from other members n. This proves that the
features extracted from different ensemble members are different. Therefore, MCA perturbs the
features x̂m (extracted from member m) by averaging them sequentially with the features x̂n
(extracted from other members n). That is, given a set of CNNs fm , MCA estimates p(y|x) as
M M M M
1 X X amn 1 XX
p(y|x) := p (y|x) := fDiscriminatorm (x̂amn ), (7)
M2 M2
m=1 n=1 m=1 n=1
where x̂amn = 12 x̂m + 12 x̂n . Here, m can be equal to n to preserve the classification accuracy.
MCA is referred to as the average of M 2 deterministic CNNs. Overall, the proposed MCA is an
alternative to MMCD. Both MCA and MMCD have the same purpose and underlying principle.
Specifically, both approaches evaluate multiple features extracted from multiple members and
evaluate the uncertainty associated with the extracted features based on feature averaging or
sampling.
5. Experimental setup
Training details Performance was expected to be dependent on task difficulty (dataset). This
is because some datasets (e.g., GTSRB [36]) have more noise in their samples than others (e.g.,
MNIST [37]). Additionally, some datasets (e.g., CIFAR10 [38]) are more challenging to learn
than others (e.g., MNIST [37]). Performance was also expected to be dependent on architecture
because architecture determines how information is propagated from the input to subsequent
layers and different architectures can result in different gradient computations and, thus, different
solutions. Therefore, we compared MCA and related methods using four different datasets to
evaluate their abilities to perform on different tasks with different difficulties. We compared these
methods using three different architectures to evaluate their abilities to perform on different
architectures. Specifically, we evaluated MNIST on VGGNets [1], FashionMNIST [39] on
ResNets [2], CIFAR10 on DenseNets [3], and GTSRB on ResNets [2]. All CNNs were randomly
initialized and trained with a random shuffling of training samples. All CNNs were trained with
categorical cross entropy and stochastic gradient descent with a momentum of 0.9, a learning
rate of 0.02, a batch size of 128, and epochs of 100. All CNNs were regularized with batch
normalization [40] layers placed before each convolutional activation function and dropout layers
placed at the inputs of the fully-connected layers. Regularization was also conducted using
5
JCRAI-2022 IOP Publishing
Journal of Physics: Conference Series 2506 (2023) 012004 doi:10.1088/1742-6596/2506/1/012004
standard data augmentation, such as rotation, translation, scaling, and shear. All images were
standardized and normalized by dividing the pixel values by 255.
Evaluation data We used five evaluation data (test data, subsets of correctly classified test
data, OOD data, swap data, and noisy data) for different purposes. The test data were used to
evaluate the classification accuracy and the ECE. We expect the classification accuracy to be
high and the ECE to be low for test data. Subsets of correctly classified test data include 1000
correctly classified test data and were used to evaluate the average confidence for TPs. Swap
data were simulated with subsets of correctly classified test data that were structurally perturbed
by dividing the images into four regions and swapping the regions diagonally (see 2b). The
swap data were used to evaluate the average confidence for FPs caused by structurally perturbed
objects. Noisy data were simulated with subsets of correctly classified test data perturbed by
additive Gaussian noise with a standard deviation of 500 (see 2c). The noisy data were used
to evaluate the average confidence for FPs caused by noisy objects. OOD data were simulated
using 1000 test data from CIFAR100 [38] and were used to evaluate the average confidence on
FPs caused by unknown objects. TPs and FPs are separable when the confidence for TPs is
high and the confidence for FPs is low. Therefore, we expect the average confidence to be high
for TPs and low for FPs.
6
JCRAI-2022 IOP Publishing
Journal of Physics: Conference Series 2506 (2023) 012004 doi:10.1088/1742-6596/2506/1/012004
6. Experimental results
Comparison of classification accuracy and calibration error of MCA and related methods
Table 1 summarizes the classification accuracy, average confidence, and ECE for test data.
The results show that MCD can preserve the classification accuracy of the baseline (single
CNN), since the increase/decrease in classification accuracy of the baseline caused by MCD
is minimal. For example, for CIFAR10, MCD decreases the classification accuracy of the
baseline from 86.02% to (only) 85.65%. Moreover, for FashionMNIST, MCD increases the
classification accuracy of the baseline from 90.23% to (only) 90.32%. Furthermore, the results
show that an ensemble can increase the classification accuracy of the baseline, since the increase
in classification accuracy of the baseline caused by an ensemble is significant. For example,
for CIFAR10, the ensemble increases the classification accuracy of the baseline from 86.02%
to 90.15%. However, the results show that MMCD and MCA can preserve the classification
accuracy of the underlying ensemble, since the increase/decrease in classification accuracy of an
7
JCRAI-2022 IOP Publishing
Journal of Physics: Conference Series 2506 (2023) 012004 doi:10.1088/1742-6596/2506/1/012004
ensemble caused by MMCD or MCA is minimal. Table 1 shows that the ECE of a baseline is
lower than that of an ensemble, which is in turn lower than that of MCD, MMCD, and MCA.
This means that baseline is better calibrated than ensemble, which is in turn better calibrated
than MCD, MMCD, and MCA. Table 1 also shows that the ECE of MCD is lower than that of
MMCD, which is in turn lower than that of MCA. This means that, MCD is better calibrated
than MMCD, which is in turn better calibrated than MCA.
8
JCRAI-2022 IOP Publishing
Journal of Physics: Conference Series 2506 (2023) 012004 doi:10.1088/1742-6596/2506/1/012004
7. Discussion
To achieve our goal of improving the separability of TPs and FPs by enforcing the confidence
to be high for TPs and low for FPs, the research question What method is required to achieve
high and low confidence for TPs and FPs, respectively? must be answered. To address this
question, MCA was proposed and compared to related methods (baseline (single CNN), MCD,
ensemble, and MMCD). We showed that MCD could preserve the accuracy of the baseline,
while reducing the confidence for TPs. This finding indicates that MCD (mainly) affects the
degree of confidence than accuracy. Conversely, we showed that an ensemble can increase the
accuracy of the baseline, while maintaining high confidence for TPs. This result is consistent with
previous studies [1, 2] showing that ensemble can increase accuracy. This is because an ensemble
evaluates multiple features. We showed that MMCD and MCA can preserve the accuracy of
the underlying ensemble, while reducing the confidence for TPs. However, this result suggests
that, similar to MCD/MMCD, MCA (mainly) affects the degree of confidence than accuracy.
This is because feature sampling in MCD/MMCD and feature averaging in MCA evaluate the
uncertainty associated with a given feature, but does not change the prediction associated with
the feature. We showed that baseline is (often) better calibrated than ensemble, which is (often)
better calibrated than MCD, MMCD, and MCA. Moreover, we showed that MCD is (often)
better calibrated than MMCD, which is (often) better calibrated than MCA. This is because
MCD, MMCD, and MCA reduce the confidence of TPs. The larger the decrease in the degree
of confidence for TPs, the larger the calibration error. We showed that ensemble can reduce the
confidence of baseline for FPs, while maintaining the confidence for TPs (nearly) unchanged.
However, MCD, MMCD, and MCA can reduce the confidence of baseline for FPs at the cost
of reducing the confidence for TPs. We showed that the ability of an ensemble to reduce the
confidence of FPs better than MCD is dependent on the dataset, architecture, or FP type. This
result suggests that we cannot claim that the ensemble performs better than MCD in terms
of capturing uncertainty. However, this contradicts previous studies [25, 26, 9], which claim
that ensemble captures uncertainty better than MCD. Although MMCD and MCA reduce the
confidence for TPs, the remaining confidence for TPs is still high ([50%, 100%]) whereas the
confidence for FPs is low ([0%, 50%[). Therefore, we hypothesized that MCA and MMCD can
separate TPs and FPs better than ensemble or MCD. This is because MCA and MMCD not only
evaluate multiple features extracted from different members like an ensemble, but also evaluate
the uncertainty associated with the extracted features. For this reason, MCA and MMCD
capture the diversity between the different members better than an ensemble and therefore
improve the uncertainty. We showed that MCA can maintain low confidence for FPs similar
to or sometimes even better than MMCD. Hence, we hypothesized that MCA can perform (in
terms of separating TPs and FPs) similar to or sometimes even better than MMCD. Although
MMCD and MCA have similar performance, the design process of MMCD is more complex
than that of MCA. This is because MMCD requires the specification of a prior distribution from
which masks will be drawn for feature sampling, whereas MCA relies on features extracted from
ensemble members. Besides, MMCD is more expensive than MCA because of the large number
of sampling operations.
8. Conclusion
By sequentially averaging the features of ensemble members, MCA evaluates the uncertainty
associated with the extracted features like MMCD. Based on the empirical comparison of MCA
and related methods, we conclude that MCA can obtain performance similar to or sometimes
even better than MMCD. Particularly, like MMCD, MCA can preserve the accuracy of the
underlying ensemble. MCA, like MMCD, can separate TPs and FPs better than baseline,
ensemble, and MCD. This finding suggests that we can use MCA instead of MMCD for
applications (such as collision prediction [10]), where the separability of TPs and FPs is
9
JCRAI-2022 IOP Publishing
Journal of Physics: Conference Series 2506 (2023) 012004 doi:10.1088/1742-6596/2506/1/012004
essential. MCA can also benefit other fields (such as active learning [17], online learning [25],
and reinforcement learning [33]) where uncertainty is required.
9. Limitations
Although MCA can improve the separability of TPs and FPs, it can increase the calibration
error because it reduces the confidence in TPs. This suggests that improving the separability of
TPs and FPs may negatively affect confidence calibration, and vice versa. We argue that the
confidence drop in TPs is caused by inductive biases inherent in ensemble members or introduced
by feature averaging. To reduce the level of inductive biases, we can combine ensemble members
by averaging logits instead of probabilities. This will, however, be investigated in future works.
MCA relies on multiple members like ensemble and MMCD, and for a large number of members,
it may require a large amount of storage memory. This may limit its adoption in applications
with a limited amount of storage memory. To overcome this limitation, future research should
explore pruning methods [43] to reduce the number of members to three or five and therefore
reduce the memory requirement.
References
[1] Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image
recognition. arXiv preprint arXiv:14091556. 2014.
[2] He K, Zhang X, Ren S, Sun J. Deep residual learning for image recognition. In: Proceedings
of the IEEE conference on computer vision and pattern recognition; 2016. p. 770-8.
[3] Huang G, Liu Z, Van Der Maaten L, Weinberger KQ. Densely connected convolutional
networks. In: Proceedings of the IEEE conference on computer vision and pattern
recognition; 2017. p. 4700-8.
[4] Bejani MM, Ghatee M. A systematic review on overfitting control in shallow and deep
neural networks. Artificial Intelligence Review. 2021;54(8):6391-438.
[5] Lakshminarayanan B, Pritzel A, Blundell C. Simple and scalable predictive uncertainty
estimation using deep ensembles. arXiv preprint arXiv:161201474. 2016.
[6] Hendrycks D, Dietterich T. Benchmarking neural network robustness to common
corruptions and perturbations. arXiv preprint arXiv:190312261. 2019.
[7] Hendrycks D, Gimpel K. A Baseline for Detecting Misclassified and Out-of-Distribution
Examples in Neural Networks. In: 5th International Conference on Learning
Representations; 2017. .
[8] Liang S, Li Y, Srikant R. Enhancing the reliability of out-of-distribution image detection
in neural networks. arXiv preprint arXiv:170602690. 2017.
[9] Ovadia Y, Fertig E, Ren J, Nado Z, Sculley D, Nowozin S, et al. Can you trust your model’s
uncertainty? Evaluating predictive uncertainty under dataset shift. In: Advances in Neural
Information Processing Systems; 2019. p. 13991-4002.
[10] Kahn G, Villaflor A, Pong V, Abbeel P, Levine S. Uncertainty-aware reinforcement learning
for collision avoidance. arXiv preprint arXiv:170201182. 2017.
[11] Chen W, Qu T, Zhou Y, Weng K, Wang G, Fu G. Door recognition and deep learning
algorithm for visual based robot navigation. In: 2014 ieee international conference on
robotics and biomimetics (robio 2014). IEEE; 2014. p. 1793-8.
[12] Ouyang W, Wang X. Joint deep learning for pedestrian detection. In: Proceedings of the
IEEE international conference on computer vision; 2013. p. 2056-63.
[13] Gawlikowski J, Tassi CRN, Ali M, Lee J, Humt M, Feng J, et al. A survey of uncertainty
in deep neural networks. arXiv preprint arXiv:210703342. 2021.
10
JCRAI-2022 IOP Publishing
Journal of Physics: Conference Series 2506 (2023) 012004 doi:10.1088/1742-6596/2506/1/012004
[14] Gal Y, Ghahramani Z. Bayesian convolutional neural networks with Bernoulli approximate
variational inference. arXiv preprint arXiv:150602158. 2015.
[15] Gal Y, Ghahramani Z. Dropout as a bayesian approximation: Representing model
uncertainty in deep learning. In: international conference on machine learning; 2016. p.
1050-9.
[16] Tassi CRN. Bayesian Convolutional Neural Network: Robustly Quantify Uncertainty for
Misclassifications Detection. In: Mediterranean Conference on Pattern Recognition and
Artificial Intelligence. Springer; 2019. p. 118-32.
[17] Zeng J, Lesnikowski A, Alvarez JM. The relevance of Bayesian layer positioning to model
uncertainty in deep Bayesian active learning. arXiv preprint arXiv:181112535. 2018.
[18] Brosse N, Riquelme C, Martin A, Gelly S, Moulines É. On last-layer algorithms for
classification: Decoupling representation from uncertainty estimation. arXiv preprint
arXiv:200108049. 2020.
[19] Kristiadi A, Hein M, Hennig P. Being bayesian, even just a bit, fixes overconfidence in relu
networks. In: International Conference on Machine Learning. PMLR; 2020. p. 5436-46.
[20] Mobiny A, Nguyen HV, Moulik S, Garg N, Wu CC. DropConnect Is Effective in Modeling
Uncertainty of Bayesian Deep Networks. arXiv preprint arXiv:190604569. 2019.
[21] McClure P, Kriegeskorte N. Robustly representing uncertainty through sampling in deep
neural networks. arXiv preprint arXiv:161101639. 2016.
[22] Zhang Z, Dalca AV, Sabuncu MR. Confidence calibration for convolutional neural networks
using structured dropout. arXiv preprint arXiv:190609551. 2019.
[23] Krizhevsky A, Sutskever I, Hinton GE. Imagenet classification with deep convolutional
neural networks. Advances in neural information processing systems. 2012;25.
[24] Ashukha A, Lyzhov A, Molchanov D, Vetrov D. Pitfalls of in-domain uncertainty estimation
and ensembling in deep learning. arXiv preprint arXiv:200206470. 2020.
[25] Beluch WH, Genewein T, Nürnberger A, Köhler JM. The power of ensembles for active
learning in image classification. In: Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition; 2018. p. 9368-77.
[26] Gustafsson FK, Danelljan M, Schon TB. Evaluating scalable bayesian deep learning
methods for robust computer vision. In: Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition Workshops; 2020. p. 318-9.
[27] Lee S, Purushwalkam S, Cogswell M, Crandall D, Batra D. Why M heads are better than
one: Training a diverse ensemble of deep networks. arXiv preprint arXiv:151106314. 2015.
[28] Wen Y, Jerfel G, Muller R, Dusenberry MW, Snoek J, Lakshminarayanan B, et al.
Combining Ensembles and Data Augmentation Can Harm Your Calibration. In:
International Conference on Learning Representations; 2021. .
[29] Zhang H, Cisse M, Dauphin YN, Lopez-Paz D. mixup: Beyond empirical risk minimization.
In: International Conference on Learning Representations; 2018. .
[30] Maroñas J, Ramos D, Paredes R. Improving Calibration in Mixup-trained Deep Neural
Networks through Confidence-Based Loss Functions. arXiv preprint arXiv:200309946. 2020.
[31] Rahaman R, Thiery AH. Uncertainty Quantification and Deep Ensembles. stat.
2020;1050:20.
[32] Wu X, Gales M. Should Ensemble Members Be Calibrated? arXiv preprint
arXiv:210105397. 2021.
[33] Lütjens B, Everett M, How JP. Safe reinforcement learning with model uncertainty
estimates. In: 2019 International Conference on Robotics and Automation (ICRA). IEEE;
2019. p. 8662-8.
11
JCRAI-2022 IOP Publishing
Journal of Physics: Conference Series 2506 (2023) 012004 doi:10.1088/1742-6596/2506/1/012004
[34] Wilson AG, Izmailov P. Bayesian Deep Learning and a Probabilistic Perspective of
Generalization. In: Larochelle H, Ranzato M, Hadsell R, Balcan MF, Lin H, editors.
Advances in Neural Information Processing Systems. vol. 33; 2020. p. 4697-708.
[35] Fort S, Hu H, Lakshminarayanan B. Deep ensembles: A loss landscape perspective. arXiv
preprint arXiv:191202757. 2019.
[36] Houben S, Stallkamp J, Salmen J, Schlipsing M, Igel C. Detection of traffic signs in real-
world images: The German Traffic Sign Detection Benchmark. In: The 2013 international
joint conference on neural networks (IJCNN). Ieee; 2013. p. 1-8.
[37] LeCun Y, Bottou L, Bengio Y, Haffner P. Gradient-based learning applied to document
recognition. Proceedings of the IEEE. 1998;86(11):2278-324.
[38] Krizhevsky A, Hinton G, et al. Learning multiple layers of features from tiny images. 2009.
[39] Xiao H, Rasul K, Vollgraf R. Fashion-mnist: a novel image dataset for benchmarking
machine learning algorithms. arXiv preprint arXiv:170807747. 2017.
[40] Ioffe S, Szegedy C. Batch normalization: Accelerating deep network training by reducing
internal covariate shift. In: International conference on machine learning. PMLR; 2015. p.
448-56.
[41] Naeini MP, Cooper GF, Hauskrecht M. Obtaining well calibrated probabilities using
bayesian binning. In: Proceedings of the... AAAI Conference on Artificial Intelligence.
AAAI Conference on Artificial Intelligence. vol. 2015. NIH Public Access; 2015. p. 2901.
[42] Guo C, Pleiss G, Sun Y, Weinberger KQ. On calibration of modern neural networks. In:
International Conference on Machine Learning. PMLR; 2017. p. 1321-30.
[43] Tsoumakas G, Partalas I, Vlahavas I. A taxonomy and short review of ensemble selection.
In: Workshop on Supervised and Unsupervised Ensemble Methods and Their Applications;
2008. p. 1-6.
12