Unsupervised Anomaly Detection Algorithms On Real-World Data: How Many Do We Need?
Unsupervised Anomaly Detection Algorithms On Real-World Data: How Many Do We Need?
Abstract
In this study we evaluate 33 unsupervised anomaly detection algorithms on 52 real-world
multivariate tabular data sets, performing the largest comparison of unsupervised anomaly
detection algorithms to date. On this collection of data sets, the EIF (Extended Isolation
Forest) algorithm significantly outperforms the most other algorithms. Visualizing and
then clustering the relative performance of the considered algorithms on all data sets, we
identify two clear clusters: one with “local” data sets, and another with “global” data sets.
“Local” anomalies occupy a region with low density when compared to nearby samples,
while “global” occupy an overall low density region in the feature space. On the local data
sets the kNN (k-nearest neighbor) algorithm comes out on top. On the global data sets,
the EIF (extended isolation forest) algorithm performs the best. Also taking into consid-
eration the algorithms’ computational complexity, a toolbox with these two unsupervised
anomaly detection algorithms suffices for finding anomalies in this representative collection
of multivariate data sets. By providing access to code and data sets, our study can be
easily reproduced and extended with more algorithms and/or data sets.
Keywords: Unsupervised Anomaly Detection, Anomaly Analysis, Algorithm Compari-
son, Outlier Detection, Outlier Analysis
1. Introduction
Anomaly detection is the study of finding data points that do not fit the expected struc-
ture of the data. Anomalies can be caused by unexpected processes generating the data. In
chemistry an anomaly might be caused by an incorrectly performed experiment, in medicine
a certain disease might induce rare symptoms, and in predictive maintenance an anomaly
can be indicative of early system failure. Depending on the application domain, anomalies
have different properties, and may also be called by different names. Within the domain
of machine learning (and hence also in this paper), anomaly detection is often used inter-
changeably with outlier detection.
Unsupervised, data-driven, detection of anomalies is a standard technique in machine
learning. Throughout the years, many methods, or algorithms, have been developed in
order to detect anomalies. Some of these algorithms aim to solve specific issues, such
as high dimensionality. Other methods try to detect anomalies in the general sense, and
focus on high performance or low computational or memory complexity. Due to the many
algorithms available, it is hard to determine which algorithm is best suited for a particular
use case, especially for a user who is not intimately familiar with the field of anomaly
detection. More details on the specific algorithms evaluated in this study can be found in
section 3.1.
Several studies have been performed to provide guidelines on when to apply which
algorithm. Some review studies (Malik et al., 2014; Ruff et al., 2021), give advice based
on the theoretical properties of the algorithms. In recent years, several studies have been
conducted that empirically compare a number of anomaly detection algorithms on a range
of data sets.
Emmott et al. (2015) study 8 well-known algorithms on 19 data sets. They find Isolation
Forest to perform the best overall, but recommend using ABOD (Angle-Based anomaly
Detection) or LOF (Local Outlier Factor) when there are multiple clusters present in the
data.
Campos et al. (2016) compare 12 k-nearest neighbours based algorithms, on 11 base
data sets. They find LOF to significantly outperform a number of other methods, while
KDEOS (Kernel Density Estimation anomaly Score) performs significantly worse than most
algorithms.
Goldstein and Uchida (2016) compare 19 algorithms on 10 data sets. Unlike Campos
et al., Goldstein and Uchida perform no explicit optimization or selection, but rather evalu-
ate the average performance over a range of sensible hyperparameter settings. With methods
based on k-nearest neighbours generally giving stable results, Goldstein and Uchida recom-
mend kNN (k-nearest neighbours) for global anomalies, LOF for local anomalies, and HBOS
(Histogram-Based anomaly Selection) in general (see 2.2 for an explanation of global and
local anomalies). Goldstein and Uchida compare on a data set basis, without any overall
statistical analysis.
More recently, Domingues et al. (2018), apply 14 algorithms on 15 data sets, some of
which are categorical. They find IF (Isolation Forest) and robust KDE (Kernel Density
Estimation) to perform best, but note that robust KDE is often too expensive too calculate
for larger data sets.
Steinbuss and Böhm (2021) propose a novel strategy for synthesizing anomalies in real-
world data sets using several statistical distributions as a sampling basis. They compare 4
algorithms across multiple data sets derived from 19 base data sets, both using the original
and synthesized anomalies. They find kNN and IF to work best for detecting global anoma-
lies, and LOF to work best for local and dependency anomalies. In the same year, Soenen
et al. (2021) study the effect of hyperparameter optimization strategies on the evaluation
and propose to optimize hyperparameters on a small validation set, with evaluation on a
much larger test set. In their comparison of 6 algorithms on 16 data sets, IF performs
the best, closely followed by CBLOF/u-CBLOF ((unweighted-)Cluster-Based Local Out-
2
Unsupervised Anomaly Detection Algorithms: How Many Do We Need?
lier Factor) and kNN, while OCSVM (One-Class Support Vector Machine) performs worst
unless optimized using a substantially larger validation set than the other algorithms.
Han et al. (2022) performed an extensive comparison of anomaly detection methods,
including supervised and semi-supervised algorithms. They compare 14 unsupervised algo-
rithms on 47 tabular data sets using out-of-the-box, that is suggested by algorithm author
or implementation, hyperparameter settings. They subsample larger data sets to a maxi-
mum of 10.000 samples, duplicate samples for those data sets smaller than 1000 samples.
They find no significant differences between unsupervised algorithms. While real-world data
sets are being used, the anomalies in each data set are generated synthetically according
to 4 different type definitions (see section 2.2), and they compare the performance for each
different type. Additionally, they have analyzed more complex benchmark data sets used
in CV and NLP, such as CIFAR10 (Krizhevsky and Hinton, 2009) and the Amazon data
set (He and McAuley, 2016) by performing neural-based feature extraction.
Other studies are of a more limited scope, and cover for example methods for high-
dimensional data (Xu et al., 2018), or consider only ensemble methods (Zimek et al., 2014).
The studies done by Campos et al. (2016); Goldstein and Uchida (2016); Domingues
et al. (2018); Steinbuss and Böhm (2021); Soenen et al. (2021); Han et al. (2022) have
several limitations when used as a benchmark. Firstly, with the exception of Han et al., all
studies were done on a rather small collection of data sets. Secondly, these studies cover
only a small number of methods. Campos et al. compare only kNN-based approaches, while
Goldstein and Uchida fail to cover many of the methods that have gained traction in the
last few years, such as IF (Liu et al., 2008) and variants thereof (Hariri et al., 2019). Soenen
et al. consider just 6 commonly used methods, Steinbuss and Böhm cover 4 methods and
Han et al. cover 14 unsupervised methods.
Some of these studies consider the performance on data sets containing specific types of
anomalies, such as global or local anomalies. Specifically, Steinbuss and Böhm look at the
performance of different algorithms on data sets containing synthesized global, local, and
dependency anomalies. Similarly, Han et al. synthesize these three types of anomalies as
well as cluster anomalies for use in their comparison. Goldstein and Uchida’s study is, to
the best of our knowledge, the only one that analyzes real-world, that is, non-synthesized
global and local anomalies. In particular, they analyze the ‘pen-local’ and ‘pen-global’ data
set, two variants of the same data set where different classes were selected to obtain local
and global anomalies specifically.
In practice, very little is known regarding what types of anomalies are present in com-
monly used benchmark data sets, and thus large scale comparisons on real-world data for
specific types of anomalies are still missing. In this study we apply a large number of com-
monly used anomaly detection methods on a large collection of multivariate data sets, to
discover guidelines on when to apply which algorithms. We explicitly choose to perform
no optimization of hyperparameters, so as to evaluate the performance of algorithms in a
truly unsupervised manner. Instead, we evaluate every algorithm over a range of sensible
hyperparameters, and compare average performances. This contrasts with Soenen et al.
(2021), who perform extensive optimization on a small validation set and thereby supply
guidelines for semi-supervised detection or active learning. Our approach rather is similar
to that used by Domingues et al. (2018), who also compare out-of-the-box performance. To
the best of our knowledge, ours is the largest study of its kind performed so far.
3
Bouman, Bukhsh, and Heskes
2. Background
2.1 Unsupervised Anomaly Detection
Most anomaly detection tasks, including that done in this study, are conducted unsuper-
vised. That means that no labels are available to the user. Consequently this means that
regular optimization, like grid searches for optimal hyperparameters used in supervised
learning, are not used within unsupervised anomaly detection. Most unsupervised anomaly
detection algorithms produce scores, rather than labels, to samples. The most common
convention is that a higher score indicates a higher likelihood that a sample is an anomaly,
making unsupervised anomaly detection a ranking problem.
4
Unsupervised Anomaly Detection Algorithms: How Many Do We Need?
10
Peripheral Anomalies 8
Enclosed Anomalies
8 6
6 4
4 2
2 0
X2
X2
0 2
2 4
4 6
6 8
6 4 2 0 2 4 6 8 10 8 6 4 2 0 2 4 6 8
X1 X1
6 6
4 4
2 2
X2
X2
0 0
2 2
4 4
6 6
6 4 2 0 2 4 6 8 10 6 4 2 0 2 4 6 8 10
X1 X1
6 6
4 4
2 2
X2
X2
0 0
2 2
4 4
6 6
6 4 2 0 2 4 6 8 10 6 4 2 0 2 4 6 8 10
X1 X1
4
4
2
0
X2
X2
0 2
4
2
6
4
8 6 4 2 0 2 4 6 8 6 4 2 0 2 4 6
X1 X1
Figure 1: 8 examples of different types of anomalies along the 4 defined property axes.
Normal data are visualized as blue points, while anomalies are visualized as red
crosses.
5
Bouman, Bukhsh, and Heskes
Most often, anomalies are isolated, and are single datapoints without any additional, normal
or anomaly, datapoints nearby. In many practical cases, anomalies are not that singular, and
small groups of anomalies form clusters, leading to clustered anomalies. Clustered anomalies
are closely related to the phenomenon known as “masking”, where similar anomalies mask
each other’s presence by forming a cluster (Liu et al., 2008). Examples of both isolated and
clustered anomalies can be found in Figure 1c.
Some anomalies are clearly univariate in nature. That is, they can be identified by just
a single feature score in an anomalous range. Other anomalies are multivariate in nature,
requiring a specific combination of feature scores to be identified as anomalies. These
multivariate anomalies are also often called dependency anomalies, as they differ from the
normal dependency, or causal, structure of the data. Examples of both isolated and clustered
anomalies can be found in Figure 1d.
3.1 Algorithms
Of the 33 methods, 27 were used as implemented in the popular Python library for anomaly
detection, PyOD (Zhao et al., 2019). As part of this research, we made several contribu-
tions to this open source library, such as a memory-efficient implementation of the COF
(Connectivity-based Outlier Factor) method, as well as an implementation of the Birgé and
Rozenholc (2006) method for histogram size selection, which is used in HBOS and LODA
(lightweight on-line detector of anomalies). For EIF (extended isolation forest) we used the
implementation provided by the authors in the Python package “eif” by Hariri et al. (2019).
We implemented the ODIN (Outlier Detection using Indegree Number) method in Python,
and it is being prepared as a submission to the PyOD package. The ensemble-LOF method,
which implements the LOF score calculation in line with the original paper Breunig et al.
(2000) was implemented using the base LOF algorithm from PyOD. DeepSVDD is applied
6
Unsupervised Anomaly Detection Algorithms: How Many Do We Need?
Table 1: Overview of the algorithms, the setting of the hyperparameters, the year of original
publication, and the author(s). For the neural networks, the “shrinkage factor”
hyperparameter indicates that any subsequent layer in the encoder is defined by:
layer sizen+1 = layer sizen × shrinkage factor.
based on the publicly available code by its author Lukas Ruff1 , and we modified it to work
on general tabular data sets. The DynamicHBOS method was applied using the code by
Kanatoko2
We left out several of the implemented methods in the PyOD package, such as LOCI and
ROD, because they have a time or memory complexity of O(n3 ), with n the number of data
points. The PyOD SOS method was also ignored, due to its O(n2 ) memory requirement.
None of these methods performed notably well compared to the other algorithms that
we included on the smaller data sets where evaluation was feasible. We have thoroughly
optimized several of the slower methods in PyOD, specifically the SOD, COF, and LMDD
methods.
7
Bouman, Bukhsh, and Heskes
3.2 Data
3.2.1 Datasets
In this study we consider a large collection of data sets from a variety of sources. We focus
on real-valued, multivariate, tabular data, comparable to the data sets used by Fernández-
Delgado et al. (2014); Campos et al. (2016); Goldstein and Uchida (2016); Soenen et al.
(2021); Domingues et al. (2018). Table 2 contains a summary of the data sets, listing
each data set’s origin, number of samples, number of features, number and percentage of
anomalies. While we recognize that other types of data, such as timeseries, categorical, or
visual, are of interest, they cannot be readily compared in a single study.
Our collection consists for the most part of data sets from the ODDS data collec-
tion (Rayana, 2016), specifically the multi-dimensional point data sets. It is a collection
of various data sets, mostly adapted from the UCI machine learning repository (Dua and
Graff, 2017). All data sets are real-valued, without any categorical data. Curation of this
collection is sadly not fully up-to-date, causing some of the listed data sets to be unavailable.
The unavailable data sets were omitted from this comparison.
In addition to the ODDS data set, we also incorporate publicly available data sets used
in earlier anomaly detection research. These include several data sets from the compari-
son by Goldstein and Uchida (2016), from the comparison of Campos et al. (2016) using
ELKI (Schubert and Zimek, 2019), from a study on Generative Adversial Active Learning,
or GAAL (Liu et al., 2019), from a study on extended Autoencoders (Shin and Kim, 2020),
from the ADBench comparison (Han et al., 2022), and from a study on Efficient Online
Anomaly Detection (EOAD) (Brandsæter et al., 2019). data sets from these latter sources
that are (near-)duplicates of data sets present in the ODDS collection are left out. In Table
2 we specify exactly where each data set was downloaded or reconstructed from.
Emmott et al. (2013, 2015) present a systematic methodology to construct anomaly
detection benchmarks, which is then also extensively applied by Pevnỳ (2016). In this paper,
we chose not to construct our own benchmark data sets, which inevitably leads to some
arbitrariness and possibly bias, but instead we rely on a large collection of different data
sets used in earlier comparison studies. Synthetic data sets are not included in this study,
as real-world data sets are generally considered the best available tool for benchmarking
algorithms (Emmott et al., 2015; Domingues et al., 2018; Ruff et al., 2021). While real-
world data sets are preferred for benchmarking, we note the usefulness of synthetic data in
when studying specific properties of anomaly detection algorithms.
3.2.2 Preprocessing
Several steps have been undertaken to be able to compare the performance of the various
algorithms on the different data sets. Most importantly all features in all data sets have
been scaled and centered. Centering is done for each feature in a data set by subtracting the
median. Scaling is performed by dividing each feature by its interquartile range. Our choice
of centering and scaling procedure is deliberate, as both the median and interquartile range
are influenced less by the presence of anomalies than the mean and standard deviation. This
procedure is generally considered to be more stable than standardization when anomalies
are known to be present (Rousseeuw and Croux, 1993). Our choice of scaling is further
motivated because although some algorithms, such as Isolation Forest, can implicitly handle
8
Unsupervised Anomaly Detection Algorithms: How Many Do We Need?
Table 2: Summary of the 52 multivariate data sets used in our anomaly detection algorithm
comparison: the colloquial name of the data set, origin of the data set, the number
of samples, features, and anomalies, as well as the percentage of anomalies, and
the number of removed features.
9
Bouman, Bukhsh, and Heskes
features with different scales, methods that involve, for example, distance or cross-product
calculations are strongly affected by the scale of the features.
In the unsupervised anomaly detection setting, it is generally more common and useful to
evaluate anomaly scores, rather than binary labels as also produced by some algorithms.
An anomaly score is a real-valued score, where a higher value indicates a higher likelihood,
according to the score producing algorithm, that a specific sample is an anomaly. Using
these scores, samples can be ranked according to apparent anomalousness, providing in-
sights into the underlying nature of anomalies. For each data set, we calculate anomaly
scores on all available data at once, without using any cross-validation or train-test splits,
procedures common in the supervised setting. The scores from this unsupervised analy-
sis are then compared to the ground truth labels, which indicates whether a sample is an
anomaly (1), or not (0), to evaluate the performance of the algorithm. In order to compare
the different algorithms we calculate the performance for each algorithm-data set combina-
tion in terms of the AUC (area under the curve) value resulting from the ROC (receiver
operating characteristic) curve. This is the most commonly used metric in anomaly detec-
tion evaluations (Goldstein and Uchida, 2016; Campos et al., 2016; Xu et al., 2018), which
can be readily interpreted from a probabilistic view. We considered using other metrics,
such as the R-precision or average precision and their chance-adjusted variants introduced
by Campos et al. (2016), but found these to be less stable, and harder to interpret.
For each data set we rank the AUC scores calculated from the scores produced by each
algorithm. Following the recommendations for the comparison of classifiers by Demšar
(2006), we use the Iman-Davenport statistic (Iman and Davenport, 1980) in order to deter-
mine whether there is any significant difference between the algorithms. If this statistic falls
below the desired critical value corresponding to a p-value of 0.05, we apply the Nemenyi
post-hoc test (Nemenyi, 1963) to then assess which algorithms differ significantly from each
other.
In some of the visualizations in this paper we plot the percentage of maximum AUC,
defined as
AUC(a, d)
AUC(a,
] d) = × 100 ,
maxa0 ∈A AUC(a0 , d)
3.4 Reproducibility
In order to reproduce all our experiments, we have provided access to a public GitHub
repository3 containing the code, including the optimized PyOD methods, and data sets
used for all experiments as well as for the production of all figures and tables presented in
this paper.
10
Unsupervised Anomaly Detection Algorithms: How Many Do We Need?
4. Results
4.1 Overall Algorithm Performance
In order to gauge the performance, we evaluated each algorithm on each data set using the
AUC measure corresponding to a ROC curve. To evaluate the performance across multiple
sensible hyperparameters the AUC value for a given method is the average of the AUC of
each hyperparameter setting evaluated. In our analysis, we found three data sets on which
nearly every evaluated algorithm produced close to random results, that is with all AUC
values between 0.4 and 0.6. These data sets, the ‘hrss anomalous standard’ and ‘wpbc’
data sets, were therefore excluded from further analysis. It is likely that these data sets
contain no discernible anomalies. The existence of newer versions of these data sets, further
motivates our choice of removal. Some data sets showed no AUC values above 0.6, but did
show AUC values below 0.4. In these cases, the detector performs better when the labels
are inverted. This behaviour was observed in the ‘yeast‘, ‘skin’ and ‘vertebral’ data sets.
The original construction of the latter two data sets was done based on treating the largest
group of samples as the normal (label 0) class, and the smaller group as the anomaly (label
1) class, as is done commonly in anomaly detection research. Yet, for both these sets, the
more heterogeneous group is chosen as the normal class, in contrast to normal anomaly
definitions. For these data sets, we inverted the labelling and recalculated the AUC values
to be more in line with the general anomaly property that anomalies are more heterogenous
than normal data.
Figure 2 shows the distribution of the performance for each method. In order to compare
the AUC across different data sets, which might have different baseline performances, in a
boxplot we express the AUC in terms of its percentage of the maximum AUC value obtained
by the best performing algorithm on that particular data set.
It can be seen that many of the algorithms perform comparably, with a median percent-
age of maximum AUC around 90%. Several lower medians, as well as wider quartiles, can
be observed.
To determine whether any of the observed differences in performance from Figure 2 are
significant, we apply the Iman-Davenport test. This yielded a test-statistic of 16.395 , far
above the critical value of 0.625 , thus refuting the null hypothesis. We then applied the
Nemenyi post-hoc test to establish which algorithms significantly outperform which other
algorithms. The results of this Nemenyi post-hoc test are summarized in Table 3.
Table 3 reveals that there are indeed several algorithms significantly outperforming
many other algorithms. Most notable here is EIF, which significantly outperforms 14/16
algorithms evaluated in this study at the p = 0.05/p = 0.10 significance level respectively.
Since the computational complexity of Isolation Forest and variants thereof scales linearly
with the number of samples n, this may give them a clear further edge over methods such
as kNN and derivatives for large data sets, with a computational complexity that scales
quadratically or at best with O(n log n) when optimized.
From Figure 2 and Table 3 we can also observe that the original CBLOF method is
by far the worst performing method based on the mean AUC, being significantly outper-
formed by 22 at the p = 0.05 significance level. This corroborates the results of Goldstein
and Uchida (2016), who also found CBLOF to consistently underperform, while its un-
weighted variant, u-CBLOF, performs comparably to other algorithms. Notable is also
11
Bouman, Bukhsh, and Heskes
40
20
0
VAE
kNN
EIF
MCD
IF
INNE
ECOD
beta-VAE
kth-NN
KDE
COPOD
PCA
LUNAR
SOD
LMDD
ODIN
COF
u-CBLOF
OCSVM
ABOD
HBOS
AE
DynamicHBOS
GMM
LODA
LOF
DeepSVDD
ALAD
CBLOF
SO-GAAL
sb-DeepSVDD
gen2out
ensemble-LOF
method
Figure 2: Boxplots of the performance of each algorithm on each data set in terms of per-
centage of maximum AUC. The maximum AUC is the highest AUC value ob-
tained by the best performing algorithm on that particular data set. The whiskers
in the boxplots extend 1.5 times the interquartile range past the low and high
quartiles. data set-algorithm combinations outside of the whiskers are marked as
diamonds.
12
Unsupervised Anomaly Detection Algorithms: How Many Do We Need?
sb-DeepSVDD
ensemble-LOF
DeepSVDD
SO-GAAL
LUNAR
CBLOF
LMDD
ALAD
LODA
Mean
ODIN
GMM
AUC
COF
PCA
SOD
VAE
LOF
EIF ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ + ++ ++ + ++ ++ 0.770
MCD ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ 0.762
IF ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ 0.759
gen2out ++ ++ ++ ++ ++ ++ ++ ++ + 0.747
kth-NN ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ + 0.745
INNE ++ ++ ++ ++ ++ ++ ++ ++ 0.743
KDE ++ ++ ++ ++ ++ ++ ++ ++ ++ 0.743
kNN ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ 0.741
u-CBLOF ++ ++ ++ ++ ++ ++ ++ + 0.740
COPOD ++ ++ ++ ++ ++ ++ ++ ++ 0.729
OCSVM ++ ++ ++ ++ ++ ++ ++ 0.723
ABOD ++ ++ ++ ++ ++ ++ ++ ++ 0.721
HBOS ++ ++ ++ ++ ++ ++ ++ 0.720
AE ++ ++ ++ ++ + ++ 0.719
DynamicHBOS ++ ++ ++ ++ ++ ++ ++ 0.717
ECOD ++ ++ ++ ++ + ++ 0.713
beta-VAE ++ ++ ++ ++ ++ 0.711
VAE ++ ++ + ++ + 0.704
PCA ++ ++ ++ 0.700
LUNAR ++ ++ ++ ++ ++ 0.691
SOD ++ ++ + 0.682
LMDD ++ 0.675
GMM ++ ++ ++ ++ + 0.664
LODA + 0.658
ensemble-LOF ++ 0.643
LOF 0.617
DeepSVDD - -- - 0.596
ODIN 0.588
COF 0.586
ALAD -- - -- -- -- 0.576
SO-GAAL -- -- - 0.575
sb-DeepSVDD -- - -- -- -- -- -- -- 0.551
CBLOF -- -- -- -- -- 0.515
13
Bouman, Bukhsh, and Heskes
the sb-DeepSVDD method, which performs slightly better in terms of mean AUC, but is
significantly outperformed by 24/25 algorithms at the p = 0.05/p = 0.10 significance level
respectively.
From these overall results it is clear that many of the neural networks do not perform
well. (Soft-boundary) DeepSVDD, ALAD, and SO-GAAL all occupy the lower segment
of overall method performance. We surmise that there are three likely reasons for this
phenomenon. Firstly, these methods were not designed with tabular data in mind, and
they can’t leverage the same feature extraction capabilities that give them an edge on their
typical computer vision tasks. Secondly, these methods are relatively complex, making it
exceedingly hard to specify general hyperparameter settings and architectures which work
on a large variety of data sets. Lastly, many of the data sets in this study are likely not
sufficiently large enough to leverage the strengths of neural network approaches. Not all
neural networks suffer from these problems equally, as the auto-encoder and variants, as
well as LUNAR, perform about average. This is likely caused by a more straightforward
architecture and optimisation criterion. More specifically, we suspect lack of convergence is
a problem for the generative adversarial methods and DeepSVDD.
In addition to the neural networks, the local methods, such as LOF, ODIN, COF, and
CBLOF, are some of the most underperforming methods. This result for LOF stands in
stark contrast to the results of Campos et al. (2016), who found LOF to be among the best
performing methods. This is most likely caused by their evaluation on a small number of
data sets with a low percentage of anomalies, which causes LOF to suffer less from swamping
or masking (Liu et al., 2008). We further study this finding in Section 4.3.
To visualize the similarities between algorithms on one hand, and the data sets on the other,
Figure 3 shows a heatmap of the performance of each data set/algorithm combination and
dendrograms of two hierarchical clusterings, one on the data sets, and one on the algorithms.
For these clustering steps, the Pearson correlation was used as a distance measure, as this
best shows how similar methods or data sets are when looking at the calculated performance.
We furthermore used average linkage cluster analysis to construct more robust clusters. For
the sake of visualisation the leaf orderings were optimized using the method of Bar-Joseph
et al. (2001).
Figure 3 shows that many similar algorithms cluster together in an expected way, with
families of algorithms forming their own clusters. Some interesting patterns can be observed
at a larger level. For the algorithms, we obtain several fairly distinct clusters. Firstly,
CBLOF is distinct, as it underperforms on nearly every data set. Similarly, underperforming
methods such as SO-GAAL, GMM, and ALAD only cluster together at a large distance,
indicating that they have little correlation to other methods. The local methods, COF,
ensemble-LOF, LOF and ODIN, form a separate cluster. These algorithms, which are
specifically designed to detect local anomalies, work well on a few (approximately a quarter)
of the data sets, but do not perform well for most other data sets. We have a small cluster
of kNN and related methods such as LUNAR and SOD, that performs decently for all
data sets. Lastly, as can be seen seen in the top left half of Figure 3, a large cluster of
14
Unsupervised Anomaly Detection Algorithms: How Many Do We Need?
1.0
0.8
0.6
0.4
0.2
vertebral
skin
ionosphere
glass
landsat
fault
vowels
pen-local
letter
wilt
nasa
parkinson
waveform
magic.gamma
pima
internetads
speech
aloi
wbc
wbc2
arrhythmia
hepatitis
wine
yeast
campaign
annthyroid
pageblocks
pen-global
cardio
yeast6
pendigits
smtp
thyroid
stamps
satimage-2
breastw
seismic-bumps
cover
shuttle
donors
mammography
spambase
satellite
hrss
mi-f
mi-v
mnist
http
musk
optdigits
VAE
ALAD
LODA
LMDD
PCA
beta-VAE
AE
COPOD
ECOD
OCSVM
EIF
IF
INNE
MCD
SO-GAAL
DynamicHBOS
HBOS
u-CBLOF
ABOD
KDE
SOD
kth-NN
kNN
LUNAR
DeepSVDD
GMM
CBLOF
COF
LOF
ODIN
gen2out
sb-DeepSVDD
ensemble-LOF
Figure 3: Clustered heatmap of the ROC/AUC performance of each algorithm. The algo-
rithms and data sets are each clustered using hierarchical clustering with average
linkage and the Pearson correlation as metric.
15
Bouman, Bukhsh, and Heskes
methods seems to negatively correlate with the local methods, performing well for most
(approximately three-quarters) of the data sets, but less so for the remainder.
The data sets split into two clearly distinct clusters: one cluster of data sets on which
the local algorithms perform well, and another cluster of data sets on which the large clus-
ter of algorithms performs well. Combining the two-way clustering with knowledge of the
algorithms suggests that approximately one third of the data sets comprises so-called local
problems, while the other two-thirds comprises global problems. This is corroborated by
specifically constructed local and global sets ‘pen-local’ and ‘pen-global’, that clearly belong
to their expected clusters. This observation is corroborated by research by Steinbuss and
Böhm (2021) and Emmott et al. (2015), who similarly find differences between what they
categorize as local/dependency and multi-cluster anomalies respectively, and global anoma-
lies. We cannot clearly observe any other clear patterns of different anomaly properties
arising from our analysis. The ‘vertebral’ data set furthermore seems distinct from either
the global or local clusters.
To the best of our knowledge, no previous study on naturally occurring anomalies in real-
world data has looked, in detail, into the difference between the performance of algorithms
when specifically being applied to either global or local anomaly detection problems.
In the previous section, we discovered a clear distinction between two clusters of data sets:
one with the “local” data sets ‘aloi’, ‘fault’, ‘glass’, ‘internetads’, ‘ionosphere’, ‘landsat’,
‘letter’, ‘magic.gamma’, ‘nasa’, ‘parkinson’, ‘pen-local’, ‘pima’, ‘skin’, ‘speech’, ‘vowels’,
‘waveform’, and ‘wilt’, and another with the remaining “global” data sets, excluding ‘ver-
tebral’,. Suspecting that different methods may do well on different types of data sets, we
repeated the significance testing procedure from Section 4.1 for both clusters separately.
Performance boxplots for all algorithms applied on the collection of local data sets can
be found in Figure 4. Figure 4 clearly shows the reversed performance of some of the
local methods for anomaly detection. Where COF, ensemble-LOF, and LOF were among
the worst performers over the entire collection, they are among the best performers when
applied to the problems for which they were specifically developed. This phenomenon is
a fine example of Simpson’s paradox (Simpson, 1951). This also partially explains the
difference in findings of our overall comparison and the comparison of Campos et al. (2016).
We then repeated the Nemenyi-Friedman post hoc test on just the local data sets.
The results for this analysis are summarized in Table 4. kNN is the top performers, and
significantly outperform 17/18 other methods at the p = 0.05/p = 0.10 significance level
respectively.
We then repeated the analysis for the global data sets, leading to the performance
boxplots in Figure 5 and the significance results in Table 5.
From Figure 5 and Table 5 we can see that the Extended Isolation Forest has the highest
mean performance, closely followed by the regular Isolation Forest. The Extended Isolation
Forest outperforms 13/14 methods at p = 0.05/p = 0.10 respectively. Coincidentally, these
methods also have the lowest computational and memory requirement, leaving them as the
most likely choices for global anomalies.
16
Unsupervised Anomaly Detection Algorithms: How Many Do We Need?
40
20
0
VAE
kNN
MCD
COF
kth-NN
LOF
KDE
SOD
LUNAR
ODIN
EIF
INNE
IF
COPOD
beta-VAE
GMM
ABOD
HBOS
LODA
LMDD
ECOD
ALAD
PCA
u-CBLOF
DynamicHBOS
OCSVM
CBLOF
AE
DeepSVDD
sb-DeepSVDD
SO-GAAL
gen2out
ensemble-LOF
method
Figure 4: Boxplots of the performance of each algorithm on the “local” data sets in terms
of percentage of maximum AUC. The maximum AUC is the highest AUC value
obtained by the best performing algorithm on that particular data set. The
whiskers in the boxplots extend 1.5 times the interquartile range past the low
and high quartiles. data set-algorithm combinations outside of the whiskers are
marked as diamonds.
17
Bouman, Bukhsh, and Heskes
DynamicHBOS
sb-DeepSVDD
DeepSVDD
SO-GAAL
beta-VAE
COPOD
OCSVM
CBLOF
gen2out
LMDD
ECOD
ALAD
LODA
Mean
HBOS
ODIN
AUC
PCA
VAE
AE
kNN ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ + 0.737
kth-NN ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ + ++ ++ + 0.722
GMM ++ ++ ++ ++ ++ ++ ++ + + 0.710
ABOD ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ 0.709
ensemble-LOF ++ ++ ++ ++ ++ ++ ++ ++ + ++ + + 0.708
MCD ++ ++ ++ ++ ++ ++ ++ + + 0.702
COF ++ ++ ++ ++ ++ ++ ++ 0.698
LOF ++ ++ ++ ++ ++ ++ ++ 0.695
KDE ++ ++ ++ ++ ++ ++ ++ + + 0.693
SOD ++ ++ ++ ++ ++ ++ ++ 0.693
LUNAR ++ ++ ++ ++ ++ ++ ++ 0.693
u-CBLOF ++ ++ ++ + 0.657
ODIN 0.646
EIF ++ ++ ++ 0.646
INNE ++ + + 0.644
IF + 0.637
HBOS 0.609
gen2out 0.601
VAE 0.585
DynamicHBOS 0.578
OCSVM 0.573
CBLOF 0.566
AE 0.565
COPOD 0.561
LODA 0.555
DeepSVDD 0.554
LMDD 0.539
beta-VAE 0.539
ECOD 0.538
sb-DeepSVDD 0.531
ALAD 0.514
PCA 0.513
SO-GAAL 0.444
Table 4: Significant differences between algorithms on the collection of local problems based
on Nemenyi post-hoc analysis. ++/+ denotes that the row algorithm outperforms
the column algorithm at p = 0.05/p = 0.10. Rows and columns are sorted by
descending and ascending mean performance, respectively. Columns are not shown
when the column algorithm is not outperformed by any other algorithms at p =
0.05 or p = 0.10. The last column shows the mean AUC.
18
Unsupervised Anomaly Detection Algorithms: How Many Do We Need?
40
20
VAE
beta-VAE
kNN
EIF
IF
COPOD
ECOD
INNE
PCA
MCD
LMDD
LUNAR
ODIN
COF
OCSVM
AE
HBOS
KDE
kth-NN
ABOD
LODA
SOD
DynamicHBOS
u-CBLOF
GMM
CBLOF
DeepSVDD
ALAD
LOF
SO-GAAL
sb-DeepSVDD
gen2out
ensemble-LOF
method
Figure 5: Boxplots of the performance of each algorithm on the global data sets in terms
of percentage of maximum AUC. The maximum AUC is the highest AUC value
obtained by the best performing algorithm on that particular data set. The
whiskers in the boxplots extend 1.5 times the interquartile range past the low
and high quartiles. data set-algorithm combinations outside of the whiskers are
marked as diamonds.
19
Bouman, Bukhsh, and Heskes
sb-DeepSVDD
ensemble-LOF
DeepSVDD
SO-GAAL
LUNAR
CBLOF
ABOD
ALAD
LODA
Mean
ODIN
GMM
AUC
COF
SOD
LOF
EIF ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ + 0.849
IF ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ 0.837
gen2out ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ 0.836
COPOD ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ 0.831
ECOD ++ ++ ++ ++ ++ ++ ++ ++ ++ + ++ 0.815
OCSVM ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ 0.813
beta-VAE ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ 0.813
AE ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ 0.812
INNE ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ 0.807
PCA ++ ++ ++ ++ ++ ++ ++ ++ ++ 0.807
MCD ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ 0.805
DynamicHBOS ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ 0.804
u-CBLOF ++ ++ ++ ++ ++ ++ ++ ++ ++ 0.793
HBOS ++ ++ ++ ++ ++ ++ ++ ++ ++ + 0.790
KDE ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ 0.782
VAE ++ ++ ++ ++ ++ ++ + ++ 0.778
kth-NN ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ 0.770
LMDD ++ ++ ++ ++ ++ ++ ++ 0.757
kNN ++ ++ ++ ++ ++ ++ ++ ++ + 0.755
ABOD ++ ++ ++ ++ ++ ++ ++ 0.740
LODA ++ ++ ++ 0.724
LUNAR + 0.701
SOD 0.679
GMM 0.650
SO-GAAL 0.642
DeepSVDD -- 0.621
ensemble-LOF 0.615
ALAD -- 0.611
LOF -- 0.580
sb-DeepSVDD -- -- 0.566
ODIN -- 0.560
COF - -- -- 0.530
CBLOF -- -- 0.487
20
Unsupervised Anomaly Detection Algorithms: How Many Do We Need?
5. Discussion
In our study we compared the performance of anomaly detection algorithms on 52 seman-
tically meaningful real-world tabular data sets, more than any other recent comparison
studies (Campos et al., 2016; Goldstein and Uchida, 2016; Xu et al., 2018; Soenen et al.,
2021; Steinbuss and Böhm, 2021; Domingues et al., 2018; Han et al., 2022). A somewhat
comparable study by Fernández-Delgado et al. (2014) on classification algorithms easily
considered 121 data sets. The main reason for this discrepancy is that data sets for com-
paring anomaly detection algorithms rarely include categorical features, which are not an
issue for comparing classification algorithms. It is certainly possible to further extend the
collection of data sets, for example, through data set modifications. Campos et al. (2016),
Emmott et al. (2013, 2015), and Steinbuss and Böhm (2021) modified data sets in different
ways to create similar data sets with differing characteristics from a single base data set.
While such modifications can be useful for targeted studies, near-duplicate data sets are far
from independent and then seem detrimental to a proper statistical comparison of anomaly
detection algorithms, such as can be observed in Emmott et al. (2015).
In this study we compared 33 of the most commonly used algorithms for anomaly detec-
tion. This collection is certainly not exhaustive: many more methods exist (Schubert and
Zimek, 2019; Goldstein and Uchida, 2016; Emmott et al., 2015; Ruff et al., 2021; Domingues
et al., 2018), and likely even more will be invented. Also along this axis, there is a clear
discrepancy with the study by Fernández-Delgado et al. (2014) on classification algorithms,
who incorporated 179 classifiers from 17 different families. Apparently, the number of clas-
sification algorithms largely exceeds the number of anomaly detection algorithms. But
perhaps more importantly, there are many more solid and easy-to-use implementations of
classification algorithms in many different machine learning libraries than there are out-of-
the-box implementations of anomaly detection algorithms. Before being able to perform
the comparison in this study, we had to spend quite some effort to clean up and sometimes
re-implement (parts of) existing code.
In this research we chose not to cover meta-techniques for ensembling. While ensembles
are of great interest, a better understanding of the performance of base learners is an
essential prerequisite before moving on to a study of ensemble methods.
While we evaluated neural networks in our comparison, no general guidelines exist on
how to construct a well-performing network for any given data set, which is essential for
the unsupervised setup considered in this study. Additionally, the strength of many of
these methods comes from high-level feature extraction implicitly performed by the net-
work, which cannot be leveraged on the smaller tabular data sets used in this benchmark.
Like Ruff et al. (2021), we recognize that there is a major discrepancy between the avail-
ability of classification and anomaly detection benchmark data sets useful for deep learning
approaches. More anomaly detection benchmark data sets useful for deep learning based
anomaly detection would be a welcome addition to the field.
Cross-comparing the performance of algorithms on data sets, we noticed a clear separa-
tion between two clusters of data sets and roughly two clusters of corresponding algorithms.
We characterized these clusters as “local” and “global” data sets and algorithms, in corre-
spondence with common nomenclature in the literature (Breunig et al., 2000; Goldstein and
Uchida, 2016). However, we are well aware that this characterization may turn out to be
21
Bouman, Bukhsh, and Heskes
an oversimplification when analyzing more data sets and more algorithms in closer detail.
For example, the local and global problems likely have quite some overlap, but need not be
fully equivalent with multimodal and unimodal anomaly detection problems, respectively.
Overlap between multimodal and local problems occurs when the different modes start hav-
ing different densities, so that local algorithms that try to estimate these local densities fare
better than global algorithms that cannot make this distinction. Further theoretical and
empirical studies, for example, on carefully simulated data sets are needed to shed further
light. We acknowledge that there is a gap in both theoretical and empirical studies on
determining what types of anomalies are present in a data set, which would directly help
in selecting an appropriate algorithm in conjunction with this research. Furthermore, we
have not readily observed several well-described properties of anomalies. This exemplifies
the need for more, varied, benchmark data sets.
6. Conclusion
Based on our research we can establish general guidelines on when users should apply which
anomaly detection methods for their problem.
In general, when a user has no a priori knowledge on whether or not their data set
contains local or global anomalies, EIF, Extended Isolation Forest, is the best choice. It
outperforms 14 out of 33 other evaluated methods at p = 0.05, and is one the highest
performing method based on its mean AUC score.
When a data set is known or suspected to contain local anomalies, which might for
example occur when the data is known to contain multiple different density clusters, the
best performing method is kNN, which outperforms 17 out of 33 methods at p = 0.05.
Datasets containing just global anomalies are best analyzed using EIF, which is the
top performing algorithm on the data sets containing global anomalies. COPOD, gen2out,
INNE and k-thNN all perform comparably, and these methods all outperform at least 10
other methods at p = 0.05. IF and EIF are the algorithms with the lowest computational
complexity, which are usually preferable in practice.
Contemplating the above considerations, we are tempted to answer the question in the
title of our paper with “two”: a toolbox with kNN, and EIF seems sufficient to perform
well on the type of multivariate data sets considered in our study. These two algorithms
are due to the scope of this study likely to perform well on unseen real-world multivariate
data sets. This conclusion is open for further consideration when other algorithms and/or
data sets are added to the bag, which should be relatively easy to check when extending
the code and the data set pre-processing procedures that we open-sourced.
Future work following this study may seek to extend our comparative analysis with
diverse types of data such as raw images, texts and time-series. All of these types of data
require specific methods and tailored comparisons. Furthermore, automatically determining
properties of anomalies in a data set before further analysis is an unexplored avenue of study
which might provide users with even more detailed guidelines on which algorithm to apply.
22
Unsupervised Anomaly Detection Algorithms: How Many Do We Need?
Acknowledgments
The research reported in this paper has been partly funded by the NWO grant
NWA.1160.18.238 (https://primavera-project.com/), as well as BMK, BMDW, and the
State of Upper Austria in the frame of the SCCH competence center INTEGRATE [(FFG
grant no. 892418)] part of the FFG COMET Competence Centers for Excellent Technologies
Programme.
23
Bouman, Bukhsh, and Heskes
24
ionosphere
yeast
spambase
hrss
glass
seismic-bumps
donors
pageblocks
internetads
wilt
breastw
shuttle
pen-global
skin
satimage-2
fault
pima
mammography
smtp
hepatitis
mi-f
cover
parkinson
pen-local
arrhythmia
MCD 0.95 0.61 0.49 0.59 0.75 0.73 0.91 0.93 0.75 0.86 0.99 1.00 0.91 0.82 0.99 0.51 0.68 0.81 0.95 0.79 0.38 0.84 0.62 0.74 0.72
OCSVM 0.84 0.58 0.52 0.57 0.65 0.71 0.87 0.93 0.69 0.44 0.98 0.98 0.87 0.55 0.99 0.47 0.63 0.83 0.95 0.76 0.49 0.93 0.45 0.47 0.80
ODIN 0.85 0.55 0.53 0.57 0.64 0.55 0.28 0.58 0.57 0.66 0.51 0.50 0.66 0.33 0.55 0.59 0.56 0.44 0.30 0.67 0.51 0.52 0.43 0.88 0.63
SO-GAAL 0.78 0.58 0.44 0.57 0.77 0.57 0.60 0.88 0.45 0.44 0.48 0.58 0.76 0.66 0.72 0.39 0.33 0.61 0.66 0.72 0.58 0.60 0.30 0.05 0.70
LODA 0.47 0.48 0.59 0.58 0.22 0.71 0.74 0.57 0.60 0.49 0.92 0.67 0.72 0.54 0.98 0.43 0.58 0.81 0.76 0.55 0.80 0.91 0.76 0.64 0.51
DynamicHBOS 0.21 0.58 0.70 0.57 0.60 0.75 0.90 0.71 0.50 0.45 0.98 0.99 0.71 0.61 0.98 0.55 0.63 0.86 0.95 0.78 0.34 0.79 0.68 0.66 0.80
COPOD 0.80 0.62 0.68 0.60 0.64 0.71 0.82 0.88 0.68 0.34 0.99 0.99 0.79 0.47 0.97 0.46 0.65 0.91 0.91 0.80 0.66 0.88 0.54 0.52 0.80
kNN 0.92 0.60 0.65 0.55 0.82 0.74 0.67 0.95 0.73 0.69 0.98 0.82 0.98 0.67 0.95 0.66 0.71 0.83 0.91 0.79 0.39 0.79 0.64 0.98 0.80
kth-NN 0.89 0.60 0.67 0.55 0.79 0.73 0.67 0.95 0.71 0.69 0.98 0.84 0.98 0.67 0.97 0.64 0.71 0.83 0.92 0.81 0.39 0.79 0.60 0.98 0.80
PCA 0.80 0.59 0.55 0.57 0.53 0.64 0.77 0.90 0.61 0.31 0.82 0.99 0.78 0.44 0.97 0.50 0.59 0.84 0.87 0.76 0.83 0.93 0.31 0.34 0.78
gen2out 0.76 0.60 0.66 0.57 0.73 0.71 0.86 0.91 0.56 0.49 0.99 0.99 0.92 0.65 0.99 0.53 0.65 0.86 0.94 0.70 0.70 0.92 0.53 0.66 0.78
GMM 0.74 0.57 0.58 0.60 0.76 0.66 0.75 0.83 0.61 0.78 0.96 0.90 0.92 0.74 0.75 0.65 0.59 0.85 0.94 0.35 0.45 0.90 0.61 0.95 0.34
DeepSVDD 0.75 0.50 0.47 0.55 0.48 0.59 0.52 0.67 0.68 0.50 0.71 0.49 0.64 0.50 0.75 0.50 0.53 0.53 0.78 0.65 0.41 0.47 0.52 0.72 0.67
SOD 0.89 0.53 0.54 0.53 0.74 0.71 0.77 0.76 0.53 0.59 0.94 0.74 0.81 0.61 0.78 0.64 0.58 0.78 0.88 0.57 0.37 0.63 0.70 0.92 0.76
25
KDE 0.93 0.61 0.60 0.55 0.77 0.74 0.88 0.94 0.66 0.50 0.98 0.93 0.96 0.62 0.99 0.63 0.70 0.83 0.95 0.78 0.36 0.92 0.62 0.87 0.67
CBLOF 0.72 0.50 0.59 0.54 0.52 0.45 0.43 0.63 0.56 0.57 0.29 0.98 0.58 0.49 0.33 0.59 0.42 0.63 0.40 0.36 0.56 0.83 0.45 0.67 0.48
INNE 0.90 0.60 0.59 0.56 0.70 0.71 0.79 0.96 0.69 0.56 0.77 0.98 0.89 0.53 1.00 0.53 0.67 0.72 0.94 0.71 0.52 0.95 0.47 0.74 0.73
AE 0.83 0.60 0.55 0.58 0.60 0.67 0.76 0.92 0.61 0.34 0.97 0.99 0.90 0.60 0.98 0.53 0.66 0.88 0.82 0.72 0.83 0.92 0.35 0.55 0.77
LOF 0.90 0.54 0.50 0.56 0.79 0.54 0.57 0.67 0.63 0.65 0.40 0.58 0.72 0.57 0.56 0.57 0.61 0.74 0.48 0.72 0.55 0.51 0.54 0.98 0.62
IF 0.86 0.61 0.62 0.59 0.69 0.67 0.77 0.90 0.68 0.45 0.99 1.00 0.93 0.67 0.99 0.58 0.68 0.86 0.91 0.70 0.78 0.89 0.50 0.80 0.81
ECOD 0.74 0.56 0.64 0.59 0.62 0.68 0.89 0.91 0.68 0.39 0.99 1.00 0.77 0.49 0.97 0.47 0.59 0.91 0.88 0.74 0.56 0.93 0.38 0.45 0.81
ABOD 0.93 0.60 0.52 0.57 0.81 0.74 0.25 0.94 0.74 0.69 0.98 0.79 0.98 0.25 0.95 0.68 0.70 0.55 0.94 0.79 0.44 0.80 0.63 0.96 0.82
ensemble-LOF 0.90 0.53 0.51 0.56 0.80 0.52 0.68 0.71 0.68 0.64 0.31 0.62 0.81 0.68 0.67 0.56 0.61 0.78 0.83 0.64 0.59 0.50 0.54 0.99 0.62
sb-DeepSVDD 0.71 0.54 0.41 0.52 0.58 0.51 0.28 0.66 0.63 0.41 0.72 0.46 0.66 0.49 0.60 0.51 0.55 0.36 0.71 0.60 0.30 0.46 0.44 0.63 0.63
u-CBLOF 0.87 0.42 0.51 0.57 0.80 0.70 0.81 0.92 0.69 0.53 0.97 0.99 0.90 0.67 1.00 0.56 0.66 0.79 0.90 0.72 0.60 0.89 0.57 0.81 0.79
LUNAR 0.93 0.58 0.51 0.58 0.82 0.72 0.67 0.72 0.65 0.43 0.98 0.65 0.91 0.66 0.89 0.71 0.70 0.83 0.90 0.65 0.41 0.73 0.51 0.90 0.79
HBOS 0.77 0.59 0.64 0.56 0.71 0.71 0.81 0.75 0.68 0.36 0.99 0.99 0.77 0.60 0.98 0.59 0.70 0.83 0.83 0.78 0.40 0.64 0.52 0.73 0.80
VAE 0.86 0.60 0.56 0.57 0.59 0.67 0.79 0.92 0.61 0.33 0.96 0.79 0.85 0.55 0.87 0.55 0.66 0.84 0.82 0.72 0.51 0.94 0.42 0.63 0.77
COF 0.88 0.55 0.47 0.54 0.78 0.55 0.68 0.59 0.66 0.63 0.43 0.56 0.64 0.70 0.54 0.57 0.60 0.75 0.42 0.57 0.52 0.51 0.60 0.95 0.66
LMDD 0.74 0.53 0.50 0.50 0.59 0.71 0.90 0.76 0.68 0.40 0.66 0.99 0.75 0.43 0.48 0.40 0.64 0.78 0.86 0.83 0.56 0.91 0.59 0.45 0.80
ALAD 0.57 0.51 0.56 0.52 0.55 0.64 0.60 0.58 0.61 0.43 0.83 0.58 0.62 0.72 0.62 0.41 0.54 0.72 0.31 0.66 0.74 0.42 0.45 0.33 0.67
beta-VAE 0.79 0.60 0.55 0.57 0.59 0.66 0.82 0.91 0.61 0.33 0.95 0.99 0.81 0.52 0.98 0.49 0.66 0.89 0.82 0.74 0.80 0.93 0.36 0.44 0.77
EIF 0.88 0.61 0.67 0.58 0.70 0.71 0.83 0.91 0.69 0.49 0.98 1.00 0.94 0.71 1.00 0.52 0.69 0.84 0.94 0.75 0.79 0.89 0.52 0.79 0.82
Unsupervised Anomaly Detection Algorithms: How Many Do We Need?
Table 6: The AUC values for the first half of the algorithm-data set combinations.
musk
wbc2
vertebral
mnist
pendigits
stamps
nasa
wbc
optdigits
waveform
landsat
cardio
satellite
yeast6
campaign
speech
letter
wine
aloi
mi-v
vowels
thyroid
annthyroid
http
magic.gamma
MCD 1.00 0.99 0.39 0.83 0.84 0.84 0.61 0.93 0.39 0.59 0.56 0.82 0.76 0.67 0.77 0.50 0.81 0.77 0.54 0.58 0.91 0.99 0.93 1.00 0.74
OCSVM 0.87 0.98 0.38 0.70 0.93 0.88 0.52 0.94 0.51 0.65 0.38 0.88 0.63 0.72 0.80 0.47 0.61 0.86 0.54 0.62 0.71 0.99 0.94 0.99 0.67
ODIN 0.55 0.79 0.50 0.62 0.51 0.59 0.56 0.79 0.51 0.66 0.49 0.57 0.49 0.45 0.56 0.65 0.86 0.66 0.74 0.53 0.87 0.57 0.50 0.92 0.65
SO-GAAL 0.90 0.83 0.67 0.58 0.90 0.71 0.51 0.41 0.50 0.41 0.46 0.80 0.55 0.56 0.56 0.47 0.43 0.20 0.50 0.53 0.04 0.93 0.86 0.66 0.57
LODA 0.99 0.98 0.32 0.82 0.91 0.92 0.57 0.96 0.42 0.70 0.39 0.64 0.62 0.71 0.40 0.55 0.55 0.32 0.53 0.72 0.72 0.93 0.55 0.97 0.69
DynamicHBOS 1.00 0.99 0.27 0.44 0.88 0.89 0.65 0.95 0.83 0.69 0.59 0.84 0.77 0.67 0.83 0.47 0.63 0.92 0.47 0.51 0.71 0.99 0.89 0.94 0.72
COPOD 0.95 0.99 0.33 0.77 0.90 0.93 0.54 0.96 0.68 0.73 0.42 0.92 0.63 0.81 0.78 0.49 0.56 0.87 0.51 0.67 0.50 0.94 0.78 0.99 0.68
kNN 0.54 0.99 0.35 0.81 0.79 0.86 0.65 0.94 0.45 0.74 0.60 0.80 0.70 0.64 0.80 0.51 0.85 0.81 0.60 0.59 0.97 0.99 0.92 0.15 0.79
kth-NN 0.63 0.99 0.34 0.82 0.82 0.89 0.65 0.94 0.48 0.74 0.61 0.84 0.71 0.67 0.81 0.48 0.81 0.90 0.58 0.60 0.96 0.99 0.92 0.15 0.78
PCA 1.00 0.99 0.49 0.85 0.92 0.85 0.46 0.92 0.48 0.60 0.40 0.95 0.63 0.71 0.74 0.47 0.52 0.81 0.55 0.74 0.64 0.96 0.69 1.00 0.65
gen2out 1.00 1.00 0.39 0.78 0.95 0.90 0.53 0.94 0.63 0.63 0.48 0.93 0.72 0.74 0.70 0.46 0.65 0.78 0.50 0.71 0.71 0.99 0.87 1.00 0.70
GMM 0.76 0.39 0.34 0.67 0.76 0.74 0.62 0.38 0.45 0.69 0.59 0.64 0.67 0.49 0.77 0.57 0.89 0.17 0.53 0.60 0.95 0.91 0.87 0.20 0.80
DeepSVDD 0.73 0.92 0.50 0.63 0.58 0.69 0.46 0.85 0.55 0.45 0.60 0.73 0.59 0.65 0.65 0.51 0.63 0.72 0.50 0.48 0.64 0.70 0.63 0.41 0.48
SOD 0.45 0.96 0.59 0.71 0.66 0.74 0.59 0.93 0.52 0.62 0.57 0.68 0.58 0.59 0.73 0.56 0.89 0.47 0.66 0.57 0.93 0.96 0.81 0.26 0.75
26
KDE 0.08 0.97 0.33 0.73 0.96 0.87 0.64 0.90 0.41 0.76 0.60 0.81 0.78 0.72 0.84 0.43 0.88 0.77 0.59 0.59 0.89 0.98 0.94 0.99 0.68
CBLOF 0.61 0.23 0.53 0.64 0.45 0.32 0.45 0.35 0.48 0.69 0.55 0.53 0.62 0.39 0.51 0.52 0.59 0.31 0.54 0.55 0.81 0.24 0.37 0.42 0.49
INNE 0.99 0.91 0.40 0.82 0.87 0.82 0.59 0.93 0.55 0.74 0.54 0.89 0.75 0.69 0.81 0.47 0.68 0.82 0.53 0.64 0.89 0.98 0.92 1.00 0.71
AE 1.00 0.98 0.36 0.85 0.93 0.88 0.50 0.93 0.51 0.64 0.37 0.93 0.60 0.67 0.73 0.47 0.55 0.80 0.55 0.74 0.76 0.91 0.65 1.00 0.69
LOF 0.54 0.75 0.47 0.60 0.51 0.64 0.56 0.92 0.49 0.72 0.54 0.53 0.54 0.48 0.55 0.53 0.87 0.76 0.74 0.55 0.93 0.59 0.48 0.37 0.69
IF 1.00 1.00 0.36 0.81 0.95 0.90 0.57 0.94 0.72 0.72 0.48 0.93 0.70 0.73 0.72 0.47 0.64 0.80 0.54 0.77 0.77 0.98 0.82 1.00 0.73
ECOD 0.96 0.99 0.42 0.75 0.91 0.88 0.44 0.90 0.60 0.72 0.37 0.94 0.75 0.70 0.77 0.49 0.57 0.73 0.53 0.65 0.59 0.98 0.79 0.98 0.64
ABOD 0.18 0.99 0.36 0.81 0.77 0.85 0.63 0.93 0.46 0.70 0.58 0.76 0.66 0.61 0.79 0.57 0.82 0.76 0.61 0.59 0.96 0.98 0.91 0.97 0.80
Bouman, Bukhsh, and Heskes
ensemble-LOF 0.63 0.92 0.45 0.63 0.53 0.70 0.55 0.94 0.51 0.72 0.55 0.59 0.58 0.48 0.46 0.55 0.87 0.88 0.75 0.59 0.94 0.71 0.50 0.14 0.70
sb-DeepSVDD 0.64 0.86 0.43 0.60 0.47 0.67 0.52 0.77 0.47 0.47 0.54 0.76 0.52 0.57 0.61 0.50 0.53 0.60 0.51 0.37 0.58 0.76 0.61 0.43 0.45
u-CBLOF 0.85 0.99 0.42 0.82 0.91 0.78 0.44 0.93 0.52 0.71 0.57 0.84 0.77 0.62 0.80 0.47 0.70 0.59 0.54 0.59 0.86 0.99 0.91 1.00 0.70
LUNAR 0.73 0.97 0.36 0.76 0.72 0.71 0.54 0.93 0.44 0.74 0.59 0.60 0.66 0.61 0.67 0.46 0.75 0.69 0.71 0.62 0.86 0.93 0.71 0.16 0.81
HBOS 1.00 0.99 0.36 0.35 0.93 0.91 0.49 0.96 0.87 0.68 0.58 0.80 0.76 0.75 0.78 0.46 0.60 0.91 0.50 0.59 0.66 0.97 0.68 0.97 0.71
VAE 0.80 0.97 0.38 0.84 0.92 0.88 0.54 0.87 0.47 0.66 0.51 0.87 0.63 0.77 0.72 0.47 0.63 0.77 0.55 0.64 0.70 0.92 0.67 1.00 0.67
COF 0.50 0.61 0.47 0.61 0.51 0.52 0.54 0.81 0.43 0.71 0.53 0.50 0.53 0.41 0.50 0.54 0.86 0.41 0.77 0.54 0.90 0.51 0.47 0.12 0.65
LMDD 0.97 1.00 0.36 0.75 0.94 0.89 0.49 0.72 0.59 0.56 0.45 0.71 0.42 0.63 0.73 0.48 0.52 0.86 0.50 0.60 0.63 0.99 0.91 1.00 0.63
ALAD 0.56 0.65 0.50 0.51 0.61 0.53 0.49 0.65 0.70 0.54 0.43 0.71 0.54 0.62 0.67 0.49 0.50 0.81 0.50 0.56 0.59 0.43 0.51 0.91 0.57
beta-VAE 0.86 0.99 0.37 0.85 0.94 0.90 0.49 0.93 0.51 0.65 0.39 0.95 0.60 0.77 0.73 0.47 0.52 0.81 0.55 0.74 0.62 0.96 0.67 1.00 0.67
EIF 1.00 1.00 0.35 0.81 0.95 0.89 0.59 0.95 0.66 0.73 0.50 0.93 0.72 0.73 0.76 0.47 0.65 0.85 0.53 0.78 0.81 0.99 0.90 0.99 0.72
Table 7: The AUC values for the second half of the algorithm-data set combinations.
PCA
LMDD
SO-GAAL
IF
LOF
EIF
ALAD
kNN
LUNAR
KDE
COF
MCD
GMM
u-CBLOF
INNE
ECOD
beta-VAE
AE
LODA
ODIN
HBOS
SOD
ensemble-LOF
DynamicHBOS
VAE
COPOD
OCSVM
DeepSVDD
gen2out
ABOD
CBLOF
kth-NN
sb-DeepSVDD
PCA 1.0 0.9 0.188 0.188 0.9 0.012 0.045 0.155 0.9 0.482 0.835 0.167 0.9 0.9 0.736 0.9 0.9 0.9 0.9 0.616 0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.216 0.651 0.747 0.035 0.058 0.001
LMDD 0.9 1.0 0.581 0.031 0.9 0.001 0.245 0.024 0.9 0.125 0.9 0.026 0.9 0.658 0.326 0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.888 0.9 0.613 0.9 0.616 0.245 0.338 0.204 0.007 0.015
SO-GAAL 0.188 0.581 1.0 0.001 0.9 0.001 0.9 0.001 0.032 0.001 0.9 0.001 0.048 0.001 0.001 0.006 0.024 0.007 0.877 0.9 0.001 0.238 0.553 0.001 0.074 0.001 0.001 0.9 0.001 0.001 0.9 0.001 0.9
IF 0.188 0.031 0.001 1.0 0.001 0.9 0.001 0.9 0.574 0.9 0.001 0.9 0.496 0.9 0.9 0.852 0.63 0.849 0.006 0.001 0.9 0.146 0.036 0.9 0.401 0.9 0.9 0.001 0.9 0.9 0.001 0.9 0.001
LOF 0.9 0.9 0.9 0.001 1.0 0.001 0.9 0.001 0.9 0.002 0.9 0.001 0.9 0.062 0.011 0.718 0.9 0.722 0.9 0.9 0.33 0.9 0.9 0.171 0.9 0.049 0.434 0.9 0.007 0.012 0.9 0.001 0.383
EIF 0.012 0.001 0.001 0.9 0.001 1.0 0.001 0.9 0.088 0.9 0.001 0.9 0.062 0.9 0.9 0.274 0.116 0.271 0.001 0.001 0.662 0.008 0.001 0.845 0.04 0.9 0.574 0.001 0.9 0.9 0.001 0.9 0.001
ALAD 0.045 0.245 0.9 0.001 0.9 0.001 1.0 0.001 0.005 0.001 0.9 0.001 0.008 0.001 0.001 0.001 0.004 0.001 0.56 0.9 0.001 0.062 0.221 0.001 0.014 0.001 0.001 0.9 0.001 0.001 0.9 0.001 0.9
kNN 0.155 0.024 0.001 0.9 0.001 0.9 0.001 1.0 0.525 0.9 0.001 0.9 0.442 0.9 0.9 0.803 0.581 0.799 0.004 0.001 0.9 0.119 0.028 0.9 0.342 0.9 0.9 0.001 0.9 0.9 0.001 0.9 0.001
LUNAR 0.9 0.9 0.032 0.574 0.9 0.088 0.005 0.525 1.0 0.863 0.451 0.542 0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.221 0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.039 0.9 0.9 0.004 0.284 0.001
KDE 0.482 0.125 0.001 0.9 0.002 0.9 0.001 0.9 0.863 1.0 0.001 0.9 0.785 0.9 0.9 0.9 0.9 0.9 0.03 0.001 0.9 0.409 0.141 0.9 0.701 0.9 0.9 0.001 0.9 0.9 0.001 0.9 0.001
COF 0.835 0.9 0.9 0.001 0.9 0.001 0.9 0.001 0.451 0.001 1.0 0.001 0.532 0.003 0.001 0.175 0.383 0.178 0.9 0.9 0.032 0.898 0.9 0.011 0.616 0.002 0.05 0.9 0.001 0.001 0.9 0.001 0.9
MCD 0.167 0.026 0.001 0.9 0.001 0.9 0.001 0.9 0.542 0.9 0.001 1.0 0.462 0.9 0.9 0.821 0.599 0.817 0.005 0.001 0.9 0.127 0.031 0.9 0.362 0.9 0.9 0.001 0.9 0.9 0.001 0.9 0.001
GMM 0.9 0.9 0.048 0.496 0.9 0.062 0.008 0.442 0.9 0.785 0.532 0.462 1.0 0.9 0.9 0.9 0.9 0.9 0.9 0.291 0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.058 0.9 0.9 0.006 0.216 0.001
u-CBLOF 0.9 0.658 0.001 0.9 0.062 0.9 0.001 0.9 0.9 0.9 0.003 0.9 0.9 1.0 0.9 0.9 0.9 0.9 0.342 0.001 0.9 0.9 0.687 0.9 0.9 0.9 0.9 0.001 0.9 0.9 0.001 0.9 0.001
27
INNE 0.736 0.326 0.001 0.9 0.011 0.9 0.001 0.9 0.9 0.9 0.001 0.9 0.9 0.9 1.0 0.9 0.9 0.9 0.105 0.001 0.9 0.673 0.358 0.9 0.9 0.9 0.9 0.001 0.9 0.9 0.001 0.9 0.001
ECOD 0.9 0.9 0.006 0.852 0.718 0.274 0.001 0.803 0.9 0.9 0.175 0.821 0.9 0.9 0.9 1.0 0.9 0.9 0.9 0.068 0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.008 0.9 0.9 0.001 0.584 0.001
beta-VAE 0.9 0.9 0.024 0.63 0.9 0.116 0.004 0.581 0.9 0.9 0.383 0.599 0.9 0.9 0.9 0.9 1.0 0.9 0.9 0.178 0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.029 0.9 0.9 0.003 0.342 0.001
AE 0.9 0.9 0.007 0.849 0.722 0.271 0.001 0.799 0.9 0.9 0.178 0.817 0.9 0.9 0.9 0.9 0.9 1.0 0.9 0.069 0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.008 0.9 0.9 0.001 0.581 0.001
LODA 0.9 0.9 0.877 0.006 0.9 0.001 0.56 0.004 0.9 0.03 0.9 0.005 0.9 0.342 0.105 0.9 0.9 0.9 1.0 0.9 0.775 0.9 0.9 0.591 0.9 0.295 0.863 0.9 0.071 0.11 0.51 0.001 0.074
ODIN 0.616 0.9 0.9 0.001 0.9 0.001 0.9 0.001 0.221 0.001 0.9 0.001 0.291 0.001 0.001 0.068 0.178 0.069 0.9 1.0 0.009 0.68 0.9 0.003 0.383 0.001 0.015 0.9 0.001 0.001 0.9 0.001 0.9
HBOS 0.9 0.9 0.001 0.9 0.33 0.662 0.001 0.9 0.9 0.9 0.032 0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.775 0.009 1.0 0.9 0.9 0.9 0.9 0.9 0.9 0.001 0.9 0.9 0.001 0.9 0.001
SOD 0.9 0.9 0.238 0.146 0.9 0.008 0.062 0.119 0.9 0.409 0.898 0.127 0.9 0.9 0.673 0.9 0.9 0.9 0.9 0.68 0.9 1.0 0.9 0.9 0.9 0.9 0.9 0.271 0.588 0.683 0.048 0.042 0.002
ensemble-LOF 0.9 0.9 0.553 0.036 0.9 0.001 0.221 0.028 0.9 0.141 0.9 0.031 0.9 0.687 0.358 0.9 0.9 0.9 0.9 0.9 0.9 0.9 1.0 0.9 0.9 0.641 0.9 0.588 0.271 0.371 0.183 0.008 0.013
DynamicHBOS 0.9 0.888 0.001 0.9 0.171 0.845 0.001 0.9 0.9 0.9 0.011 0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.591 0.003 0.9 0.9 0.9 1.0 0.9 0.9 0.9 0.001 0.9 0.9 0.001 0.9 0.001
VAE 0.9 0.9 0.074 0.401 0.9 0.04 0.014 0.342 0.9 0.701 0.616 0.362 0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.383 0.9 0.9 0.9 0.9 1.0 0.9 0.9 0.087 0.87 0.9 0.01 0.155 0.001
COPOD 0.9 0.613 0.001 0.9 0.049 0.9 0.001 0.9 0.9 0.9 0.002 0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.295 0.001 0.9 0.9 0.641 0.9 0.9 1.0 0.9 0.001 0.9 0.9 0.001 0.9 0.001
OCSVM 0.9 0.9 0.001 0.9 0.434 0.574 0.001 0.9 0.9 0.9 0.05 0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.863 0.015 0.9 0.9 0.9 0.9 0.9 0.9 1.0 0.001 0.9 0.9 0.001 0.863 0.001
DeepSVDD 0.216 0.616 0.9 0.001 0.9 0.001 0.9 0.001 0.039 0.001 0.9 0.001 0.058 0.001 0.001 0.008 0.029 0.008 0.9 0.9 0.001 0.271 0.588 0.001 0.087 0.001 0.001 1.0 0.001 0.001 0.9 0.001 0.9
gen2out 0.651 0.245 0.001 0.9 0.007 0.9 0.001 0.9 0.9 0.9 0.001 0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.071 0.001 0.9 0.588 0.271 0.9 0.87 0.9 0.9 0.001 1.0 0.9 0.001 0.9 0.001
ABOD 0.747 0.338 0.001 0.9 0.012 0.9 0.001 0.9 0.9 0.9 0.001 0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.11 0.001 0.9 0.683 0.371 0.9 0.9 0.9 0.9 0.001 0.9 1.0 0.001 0.9 0.001
CBLOF 0.035 0.204 0.9 0.001 0.9 0.001 0.9 0.001 0.004 0.001 0.9 0.001 0.006 0.001 0.001 0.001 0.003 0.001 0.51 0.9 0.001 0.048 0.183 0.001 0.01 0.001 0.001 0.9 0.001 0.001 1.0 0.001 0.9
kth-NN 0.058 0.007 0.001 0.9 0.001 0.9 0.001 0.9 0.284 0.9 0.001 0.9 0.216 0.9 0.9 0.584 0.342 0.581 0.001 0.001 0.9 0.042 0.008 0.9 0.155 0.9 0.863 0.001 0.9 0.9 0.001 1.0 0.001
sb-DeepSVDD 0.001 0.015 0.9 0.001 0.383 0.001 0.9 0.001 0.001 0.001 0.9 0.001 0.001 0.001 0.001 0.001 0.001 0.001 0.074 0.9 0.001 0.002 0.013 0.001 0.001 0.001 0.001 0.9 0.001 0.001 0.9 0.001 1.0
Table 8: The p-values from Nemenyi post-hoc analysis on all algorithm pairs based on all
PCA
LMDD
SO-GAAL
IF
LOF
EIF
ALAD
kNN
LUNAR
KDE
COF
MCD
GMM
u-CBLOF
INNE
ECOD
beta-VAE
AE
LODA
ODIN
HBOS
SOD
ensemble-LOF
DynamicHBOS
VAE
COPOD
OCSVM
DeepSVDD
gen2out
ABOD
CBLOF
kth-NN
sb-DeepSVDD
PCA 1.0 0.9 0.9 0.129 0.001 0.042 0.9 0.001 0.001 0.001 0.002 0.001 0.001 0.028 0.064 0.9 0.9 0.9 0.9 0.407 0.804 0.002 0.001 0.78 0.9 0.9 0.9 0.9 0.9 0.001 0.9 0.001 0.9
LMDD 0.9 1.0 0.9 0.436 0.009 0.193 0.9 0.001 0.009 0.003 0.019 0.004 0.004 0.144 0.263 0.9 0.9 0.9 0.9 0.755 0.9 0.017 0.001 0.9 0.9 0.9 0.9 0.9 0.9 0.001 0.9 0.001 0.9
SO-GAAL 0.9 0.9 1.0 0.068 0.001 0.019 0.9 0.001 0.001 0.001 0.001 0.001 0.001 0.012 0.03 0.9 0.9 0.9 0.9 0.252 0.659 0.001 0.001 0.635 0.9 0.9 0.9 0.9 0.9 0.001 0.9 0.001 0.9
IF 0.129 0.436 0.068 1.0 0.9 0.9 0.144 0.263 0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.463 0.407 0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.755 0.9 0.695 0.335
LOF 0.001 0.009 0.001 0.9 1.0 0.9 0.001 0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.011 0.008 0.144 0.392 0.9 0.9 0.9 0.9 0.9 0.526 0.31 0.241 0.129 0.538 0.9 0.263 0.9 0.006
EIF 0.042 0.193 0.019 0.9 0.9 1.0 0.047 0.526 0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.212 0.175 0.731 0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.864 0.707 0.9 0.9 0.888 0.9 0.136
ALAD 0.9 0.9 0.9 0.144 0.001 0.047 1.0 0.001 0.001 0.001 0.003 0.001 0.001 0.032 0.072 0.9 0.9 0.9 0.9 0.436 0.828 0.002 0.001 0.804 0.9 0.9 0.9 0.9 0.9 0.001 0.9 0.001 0.9
kNN 0.001 0.001 0.001 0.263 0.9 0.526 0.001 1.0 0.9 0.9 0.9 0.9 0.9 0.598 0.436 0.001 0.001 0.001 0.001 0.072 0.009 0.9 0.9 0.01 0.001 0.001 0.001 0.001 0.001 0.9 0.001 0.9 0.001
LUNAR 0.001 0.009 0.001 0.9 0.9 0.9 0.001 0.9 1.0 0.9 0.9 0.9 0.9 0.9 0.9 0.01 0.008 0.136 0.378 0.9 0.9 0.9 0.9 0.9 0.514 0.298 0.231 0.125 0.526 0.9 0.252 0.9 0.005
KDE 0.001 0.003 0.001 0.9 0.9 0.9 0.001 0.9 0.9 1.0 0.9 0.9 0.9 0.9 0.9 0.003 0.003 0.064 0.212 0.9 0.755 0.9 0.9 0.78 0.322 0.159 0.118 0.057 0.335 0.9 0.129 0.9 0.002
COF 0.002 0.019 0.001 0.9 0.9 0.9 0.003 0.9 0.9 0.9 1.0 0.9 0.9 0.9 0.9 0.022 0.017 0.231 0.526 0.9 0.9 0.9 0.9 0.9 0.647 0.45 0.363 0.212 0.659 0.9 0.392 0.9 0.012
MCD 0.001 0.004 0.001 0.9 0.9 0.9 0.001 0.9 0.9 0.9 0.9 1.0 0.9 0.9 0.9 0.004 0.003 0.076 0.241 0.9 0.792 0.9 0.9 0.816 0.363 0.184 0.136 0.068 0.378 0.9 0.152 0.9 0.002
GMM 0.001 0.004 0.001 0.9 0.9 0.9 0.001 0.9 0.9 0.9 0.9 0.9 1.0 0.9 0.9 0.004 0.003 0.076 0.241 0.9 0.792 0.9 0.9 0.816 0.363 0.184 0.136 0.068 0.378 0.9 0.152 0.9 0.002
u-CBLOF 0.028 0.144 0.012 0.9 0.9 0.9 0.032 0.598 0.9 0.9 0.9 0.9 0.9 1.0 0.9 0.159 0.129 0.659 0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.864 0.792 0.635 0.9 0.9 0.816 0.9 0.1
28
INNE 0.064 0.263 0.03 0.9 0.9 0.9 0.072 0.436 0.9 0.9 0.9 0.9 0.9 0.9 1.0 0.286 0.241 0.816 0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.792 0.9 0.9 0.9 0.852 0.193
ECOD 0.9 0.9 0.9 0.463 0.011 0.212 0.9 0.001 0.01 0.003 0.022 0.004 0.004 0.159 0.286 1.0 0.9 0.9 0.9 0.78 0.9 0.019 0.001 0.9 0.9 0.9 0.9 0.9 0.9 0.001 0.9 0.001 0.9
beta-VAE 0.9 0.9 0.9 0.407 0.008 0.175 0.9 0.001 0.008 0.003 0.017 0.003 0.003 0.129 0.241 0.9 1.0 0.9 0.9 0.731 0.9 0.014 0.001 0.9 0.9 0.9 0.9 0.9 0.9 0.001 0.9 0.001 0.9
AE 0.9 0.9 0.9 0.9 0.144 0.731 0.9 0.001 0.136 0.064 0.231 0.076 0.076 0.659 0.816 0.9 0.9 1.0 0.9 0.9 0.9 0.212 0.032 0.9 0.9 0.9 0.9 0.9 0.9 0.001 0.9 0.001 0.9
LODA 0.9 0.9 0.9 0.9 0.392 0.9 0.9 0.001 0.378 0.212 0.526 0.241 0.241 0.9 0.9 0.9 0.9 0.9 1.0 0.9 0.9 0.502 0.125 0.9 0.9 0.9 0.9 0.9 0.9 0.006 0.9 0.004 0.9
ODIN 0.407 0.755 0.252 0.9 0.9 0.9 0.436 0.072 0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.78 0.731 0.9 0.9 1.0 0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.436 0.9 0.363 0.671
HBOS 0.804 0.9 0.659 0.9 0.9 0.9 0.828 0.009 0.9 0.755 0.9 0.792 0.792 0.9 0.9 0.9 0.9 0.9 0.9 0.9 1.0 0.9 0.622 0.9 0.9 0.9 0.9 0.9 0.9 0.106 0.9 0.081 0.9
SOD 0.002 0.017 0.001 0.9 0.9 0.9 0.002 0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.019 0.014 0.212 0.502 0.9 0.9 1.0 0.9 0.9 0.622 0.421 0.335 0.193 0.635 0.9 0.363 0.9 0.01
ensemble-LOF 0.001 0.001 0.001 0.9 0.9 0.9 0.001 0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.001 0.001 0.032 0.125 0.9 0.622 0.9 1.0 0.647 0.202 0.089 0.064 0.028 0.212 0.9 0.072 0.9 0.001
DynamicHBOS 0.78 0.9 0.635 0.9 0.9 0.9 0.804 0.01 0.9 0.78 0.9 0.816 0.816 0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.647 1.0 0.9 0.9 0.9 0.9 0.9 0.118 0.9 0.089 0.9
VAE 0.9 0.9 0.9 0.9 0.526 0.9 0.9 0.001 0.514 0.322 0.647 0.363 0.363 0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.622 0.202 0.9 1.0 0.9 0.9 0.9 0.9 0.012 0.9 0.009 0.9
COPOD 0.9 0.9 0.9 0.9 0.31 0.9 0.9 0.001 0.298 0.159 0.45 0.184 0.184 0.864 0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.421 0.089 0.9 0.9 1.0 0.9 0.9 0.9 0.004 0.9 0.003 0.9
OCSVM 0.9 0.9 0.9 0.9 0.241 0.864 0.9 0.001 0.231 0.118 0.363 0.136 0.136 0.792 0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.335 0.064 0.9 0.9 0.9 1.0 0.9 0.9 0.002 0.9 0.002 0.9
Bouman, Bukhsh, and Heskes
DeepSVDD 0.9 0.9 0.9 0.9 0.129 0.707 0.9 0.001 0.125 0.057 0.212 0.068 0.068 0.635 0.792 0.9 0.9 0.9 0.9 0.9 0.9 0.193 0.028 0.9 0.9 0.9 0.9 1.0 0.9 0.001 0.9 0.001 0.9
gen2out 0.9 0.9 0.9 0.9 0.538 0.9 0.9 0.001 0.526 0.335 0.659 0.378 0.378 0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.635 0.212 0.9 0.9 0.9 0.9 0.9 1.0 0.013 0.9 0.009 0.9
ABOD 0.001 0.001 0.001 0.755 0.9 0.9 0.001 0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.001 0.001 0.001 0.006 0.436 0.106 0.9 0.9 0.118 0.012 0.004 0.002 0.001 0.013 1.0 0.003 0.9 0.001
CBLOF 0.9 0.9 0.9 0.9 0.263 0.888 0.9 0.001 0.252 0.129 0.392 0.152 0.152 0.816 0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.363 0.072 0.9 0.9 0.9 0.9 0.9 0.9 0.003 1.0 0.002 0.9
kth-NN 0.001 0.001 0.001 0.695 0.9 0.9 0.001 0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.852 0.001 0.001 0.001 0.004 0.363 0.081 0.9 0.9 0.089 0.009 0.003 0.002 0.001 0.009 0.9 0.002 1.0 0.001
sb-DeepSVDD 0.9 0.9 0.9 0.335 0.006 0.136 0.9 0.001 0.005 0.002 0.012 0.002 0.002 0.1 0.193 0.9 0.9 0.9 0.9 0.671 0.9 0.01 0.001 0.9 0.9 0.9 0.9 0.9 0.9 0.001 0.9 0.001 1.0
17 local data sets. P-values below 0.05 have been printed bold.
Table 9: The p-values from Nemenyi post-hoc analysis on all algorithm pairs based on the
PCA
LMDD
SO-GAAL
IF
LOF
EIF
ALAD
kNN
LUNAR
KDE
COF
MCD
GMM
u-CBLOF
INNE
ECOD
beta-VAE
AE
LODA
ODIN
HBOS
SOD
ensemble-LOF
DynamicHBOS
VAE
COPOD
OCSVM
DeepSVDD
gen2out
ABOD
CBLOF
kth-NN
sb-DeepSVDD
PCA 1.0 0.9 0.015 0.9 0.001 0.669 0.001 0.9 0.731 0.9 0.001 0.9 0.599 0.9 0.9 0.9 0.9 0.9 0.9 0.001 0.9 0.118 0.005 0.9 0.9 0.9 0.9 0.001 0.9 0.9 0.001 0.9 0.001
LMDD 0.9 1.0 0.23 0.634 0.002 0.126 0.026 0.9 0.9 0.9 0.001 0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.001 0.9 0.651 0.105 0.9 0.9 0.638 0.9 0.009 0.484 0.9 0.001 0.9 0.001
SO-GAAL 0.015 0.23 1.0 0.001 0.9 0.001 0.9 0.054 0.9 0.004 0.9 0.001 0.9 0.03 0.001 0.001 0.001 0.002 0.9 0.9 0.005 0.9 0.9 0.001 0.216 0.001 0.001 0.9 0.001 0.335 0.9 0.002 0.9
IF 0.9 0.634 0.001 1.0 0.001 0.9 0.001 0.9 0.004 0.9 0.001 0.9 0.002 0.9 0.9 0.9 0.9 0.9 0.021 0.001 0.9 0.001 0.001 0.9 0.651 0.9 0.9 0.001 0.9 0.524 0.001 0.9 0.001
LOF 0.001 0.002 0.9 0.001 1.0 0.001 0.9 0.001 0.519 0.001 0.9 0.001 0.651 0.001 0.001 0.001 0.001 0.001 0.223 0.9 0.001 0.9 0.9 0.001 0.002 0.001 0.001 0.9 0.001 0.004 0.9 0.001 0.9
EIF 0.669 0.126 0.001 0.9 0.001 1.0 0.001 0.424 0.001 0.898 0.001 0.9 0.001 0.546 0.9 0.9 0.9 0.9 0.001 0.001 0.85 0.001 0.001 0.9 0.136 0.9 0.9 0.001 0.9 0.078 0.001 0.9 0.001
ALAD 0.001 0.026 0.9 0.001 0.9 0.001 1.0 0.003 0.9 0.001 0.9 0.001 0.9 0.002 0.001 0.001 0.001 0.001 0.669 0.9 0.001 0.9 0.9 0.001 0.023 0.001 0.001 0.9 0.001 0.046 0.9 0.001 0.9
kNN 0.9 0.9 0.054 0.9 0.001 0.424 0.003 1.0 0.9 0.9 0.001 0.9 0.836 0.9 0.9 0.9 0.9 0.9 0.9 0.001 0.9 0.295 0.019 0.9 0.9 0.9 0.9 0.001 0.819 0.9 0.001 0.9 0.001
LUNAR 0.731 0.9 0.9 0.004 0.519 0.001 0.9 0.9 1.0 0.502 0.056 0.198 0.9 0.854 0.312 0.138 0.286 0.424 0.9 0.36 0.55 0.9 0.9 0.219 0.9 0.004 0.195 0.775 0.001 0.9 0.147 0.371 0.126
KDE 0.9 0.9 0.004 0.9 0.001 0.898 0.001 0.9 0.502 1.0 0.001 0.9 0.35 0.9 0.9 0.9 0.9 0.9 0.784 0.001 0.9 0.039 0.001 0.9 0.9 0.9 0.9 0.001 0.9 0.9 0.001 0.9 0.001
COF 0.001 0.001 0.9 0.001 0.9 0.001 0.9 0.001 0.056 0.001 1.0 0.001 0.105 0.001 0.001 0.001 0.001 0.001 0.012 0.9 0.001 0.572 0.9 0.001 0.001 0.001 0.001 0.9 0.001 0.001 0.9 0.001 0.9
MCD 0.9 0.9 0.001 0.9 0.001 0.9 0.001 0.9 0.198 0.9 0.001 1.0 0.116 0.9 0.9 0.9 0.9 0.9 0.488 0.001 0.9 0.007 0.001 0.9 0.9 0.9 0.9 0.001 0.9 0.9 0.001 0.9 0.001
GMM 0.599 0.9 0.9 0.002 0.651 0.001 0.9 0.836 0.9 0.35 0.105 0.116 1.0 0.722 0.195 0.078 0.175 0.277 0.9 0.51 0.408 0.9 0.9 0.128 0.9 0.002 0.114 0.9 0.001 0.9 0.245 0.237 0.216
u-CBLOF 0.9 0.9 0.03 0.9 0.001 0.546 0.002 0.9 0.854 0.9 0.001 0.9 0.722 1.0 0.9 0.9 0.9 0.9 0.9 0.001 0.9 0.195 0.01 0.9 0.9 0.9 0.9 0.001 0.9 0.9 0.001 0.9 0.001
29
INNE 0.9 0.9 0.001 0.9 0.001 0.9 0.001 0.9 0.312 0.9 0.001 0.9 0.195 0.9 1.0 0.9 0.9 0.9 0.616 0.001 0.9 0.016 0.001 0.9 0.9 0.9 0.9 0.001 0.9 0.9 0.001 0.9 0.001
ECOD 0.9 0.9 0.001 0.9 0.001 0.9 0.001 0.9 0.138 0.9 0.001 0.9 0.078 0.9 0.9 1.0 0.9 0.9 0.387 0.001 0.9 0.004 0.001 0.9 0.9 0.9 0.9 0.001 0.9 0.9 0.001 0.9 0.001
beta-VAE 0.9 0.9 0.001 0.9 0.001 0.9 0.001 0.9 0.286 0.9 0.001 0.9 0.175 0.9 0.9 0.9 1.0 0.9 0.59 0.001 0.9 0.013 0.001 0.9 0.9 0.9 0.9 0.001 0.9 0.9 0.001 0.9 0.001
AE 0.9 0.9 0.002 0.9 0.001 0.9 0.001 0.9 0.424 0.9 0.001 0.9 0.277 0.9 0.9 0.9 0.9 1.0 0.713 0.001 0.9 0.027 0.001 0.9 0.9 0.9 0.9 0.001 0.9 0.9 0.001 0.9 0.001
LODA 0.9 0.9 0.9 0.021 0.223 0.001 0.669 0.9 0.9 0.784 0.012 0.488 0.9 0.9 0.616 0.387 0.59 0.713 1.0 0.126 0.832 0.9 0.9 0.515 0.9 0.022 0.484 0.493 0.009 0.9 0.039 0.669 0.033
ODIN 0.001 0.001 0.9 0.001 0.9 0.001 0.9 0.001 0.36 0.001 0.9 0.001 0.51 0.001 0.001 0.001 0.001 0.001 0.126 1.0 0.001 0.9 0.9 0.001 0.001 0.001 0.001 0.9 0.001 0.001 0.9 0.001 0.9
HBOS 0.9 0.9 0.005 0.9 0.001 0.85 0.001 0.9 0.55 0.9 0.001 0.9 0.408 0.9 0.9 0.9 0.9 0.9 0.832 0.001 1.0 0.05 0.001 0.9 0.9 0.9 0.9 0.001 0.9 0.9 0.001 0.9 0.001
SOD 0.118 0.651 0.9 0.001 0.9 0.001 0.9 0.295 0.9 0.039 0.572 0.007 0.9 0.195 0.016 0.004 0.013 0.027 0.9 0.9 0.05 1.0 0.9 0.009 0.634 0.001 0.007 0.9 0.001 0.762 0.784 0.021 0.748
ensemble-LOF 0.005 0.105 0.9 0.001 0.9 0.001 0.9 0.019 0.9 0.001 0.9 0.001 0.9 0.01 0.001 0.001 0.001 0.001 0.9 0.9 0.001 0.9 1.0 0.001 0.096 0.001 0.001 0.9 0.001 0.167 0.9 0.001 0.9
DynamicHBOS 0.9 0.9 0.001 0.9 0.001 0.9 0.001 0.9 0.219 0.9 0.001 0.9 0.128 0.9 0.9 0.9 0.9 0.9 0.515 0.001 0.9 0.009 0.001 1.0 0.9 0.9 0.9 0.001 0.9 0.9 0.001 0.9 0.001
VAE 0.9 0.9 0.216 0.651 0.002 0.136 0.023 0.9 0.9 0.9 0.001 0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.001 0.9 0.634 0.096 0.9 1.0 0.656 0.9 0.008 0.502 0.9 0.001 0.9 0.001
COPOD 0.9 0.638 0.001 0.9 0.001 0.9 0.001 0.9 0.004 0.9 0.001 0.9 0.002 0.9 0.9 0.9 0.9 0.9 0.022 0.001 0.9 0.001 0.001 0.9 0.656 1.0 0.9 0.001 0.9 0.528 0.001 0.9 0.001
OCSVM 0.9 0.9 0.001 0.9 0.001 0.9 0.001 0.9 0.195 0.9 0.001 0.9 0.114 0.9 0.9 0.9 0.9 0.9 0.484 0.001 0.9 0.007 0.001 0.9 0.9 0.9 1.0 0.001 0.9 0.9 0.001 0.9 0.001
DeepSVDD 0.001 0.009 0.9 0.001 0.9 0.001 0.9 0.001 0.775 0.001 0.9 0.001 0.9 0.001 0.001 0.001 0.001 0.001 0.493 0.9 0.001 0.9 0.9 0.001 0.008 0.001 0.001 1.0 0.001 0.018 0.9 0.001 0.9
gen2out 0.9 0.484 0.001 0.9 0.001 0.9 0.001 0.819 0.001 0.9 0.001 0.9 0.001 0.9 0.9 0.9 0.9 0.9 0.009 0.001 0.9 0.001 0.001 0.9 0.502 0.9 0.9 0.001 1.0 0.355 0.001 0.9 0.001
ABOD 0.9 0.9 0.335 0.524 0.004 0.078 0.046 0.9 0.9 0.9 0.001 0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.001 0.9 0.762 0.167 0.9 0.9 0.528 0.9 0.018 0.355 1.0 0.001 0.9 0.001
CBLOF 0.001 0.001 0.9 0.001 0.9 0.001 0.9 0.001 0.147 0.001 0.9 0.001 0.245 0.001 0.001 0.001 0.001 0.001 0.039 0.9 0.001 0.784 0.9 0.001 0.001 0.001 0.001 0.9 0.001 0.001 1.0 0.001 0.9
kth-NN 0.9 0.9 0.002 0.9 0.001 0.9 0.001 0.9 0.371 0.9 0.001 0.9 0.237 0.9 0.9 0.9 0.9 0.9 0.669 0.001 0.9 0.021 0.001 0.9 0.9 0.9 0.9 0.001 0.9 0.9 0.001 1.0 0.001
sb-DeepSVDD 0.001 0.001 0.9 0.001 0.9 0.001 0.9 0.001 0.126 0.001 0.9 0.001 0.216 0.001 0.001 0.001 0.001 0.001 0.033 0.9 0.001 0.748 0.9 0.001 0.001 0.001 0.001 0.9 0.001 0.001 0.9 0.001 1.0
32 global data sets. P-values below 0.05 have been printed bold.
Unsupervised Anomaly Detection Algorithms: How Many Do We Need?
Table 10: The p-values from Nemenyi post-hoc analysis on all algorithm pairs based on the
Bouman, Bukhsh, and Heskes
References
D. Agarwal. Detecting anomalies in cross-classified streams: a Bayesian approach. Knowl-
edge and Information Systems, 11(1):29–44, 2007.
A. Arning, R. Agrawal, and P. Raghavan. A linear method for deviation detection in large
databases. Knowledge Discovery and Data Mining, 1141(50):972–981, 1996.
Z. Bar-Joseph, D. K. Gifford, and T. S. Jaakkola. Fast optimal leaf ordering for hierarchical
clustering. Bioinformatics, 17(suppl 1):S22–S29, 2001.
L. Birgé and Y. Rozenholc. How many bins should be put in a regular histogram. ESAIM:
Probability and Statistics, 10:24–45, 2006.
A. Brandsæter, E. Vanem, and I. K. Glad. Efficient on-line anomaly detection for ship
systems in operation. Expert Systems with Applications, 121:418–437, 2019.
J. Demšar. Statistical comparisons of classifiers over multiple data sets. The Journal of
Machine Learning Research, 7:1–30, 2006.
D. Dua and C. Graff. UCI machine learning repository, 2017. URL http://archive.ics.
uci.edu/ml.
30
Unsupervised Anomaly Detection Algorithms: How Many Do We Need?
A. Goodge, B. Hooi, S.-K. Ng, and W. S. Ng. LUNAR: Unifying local outlier detection
methods via graph neural networks. In AAAI Conference on Artificial Intelligence, vol-
ume 36, pages 6737–6745, 2022.
S. Han, X. Hu, H. Huang, M. Jiang, and Y. Zhao. ADBench: Anomaly detection benchmark.
In Neural Information Processing Systems, 2022.
R. He and J. McAuley. Ups and downs: Modeling the visual evolution of fashion trends
with one-class collaborative filtering. In International Conference on World Wide Web,
pages 507–517, 2016.
Z. He, X. Xu, and S. Deng. Discovering cluster-based local outliers. Pattern Recognition
Letters, 24(9-10):1641–1650, 2003.
31
Bouman, Bukhsh, and Heskes
A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images. Tech-
nical report, University of Toronto, 2009.
L. J. Latecki, A. Lazarevic, and D. Pokrajac. Outlier detection with kernel density functions.
In International Workshop on Machine Learning and Data Mining in Pattern Recognition,
pages 61–75. Springer, 2007.
Z. Li, Y. Zhao, N. Botta, C. Ionescu, and X. Hu. COPOD: copula-based outlier detection. In
2020 IEEE International Conference on Data Mining (ICDM), pages 1118–1123. IEEE,
2020.
Z. Li, Y. Zhao, X. Hu, N. Botta, C. Ionescu, and G. H. Chen. ECOD: Unsupervised outlier
detection using empirical cumulative distribution functions, 2022.
F. T. Liu, K. M. Ting, and Z.-H. Zhou. Isolation forest. In IEEE International Conference
on Data Mining, pages 413–422. IEEE, 2008.
Y. Liu, Z. Li, C. Zhou, Y. Jiang, J. Sun, M. Wang, and X. He. Generative adversarial active
learning for unsupervised outlier detection. IEEE Transactions on Knowledge and Data
Engineering, 32(8):1517–1528, 2019.
S. Ramaswamy, R. Rastogi, and K. Shim. Efficient algorithms for mining outliers from large
data sets. In International Conference on Management of Data, pages 427–438, 2000.
P. J. Rousseeuw and K. V. Driessen. A fast algorithm for the minimum covariance deter-
minant estimator. Technometrics, 41(3):212–223, 1999.
32
Unsupervised Anomaly Detection Algorithms: How Many Do We Need?
E. Schubert and A. Zimek. Elki: A large open-source library for data analysis. CoRR,
abs/1902.03616, 2019. URL https://arxiv.org/abs/1902.03616.
S. Y. Shin and H.-j. Kim. Extended autoencoder for novelty detection with reconstruction
along projection pathway. Applied Sciences, 10(13):4497, 2020.
M.-L. Shyu, S.-C. Chen, K. Sarinnapakorn, and L. Chang. A novel anomaly detection
scheme based on principal component classifier. Technical report, University of Miami,
department of Eletrical and Computer Engineering, 2003.
X. Xu, H. Liu, L. Li, and M. Yao. A comparison of outlier detection techniques for high-
dimensional data. International Journal of Computational Intelligence Systems, 11(1):
652–662, 2018.
Y. Zhao, Z. Nasrullah, and Z. Li. PyOD: A Python toolbox for scalable outlier detection,
2019.
L. Zhou, W. Deng, and X. Wu. Unsupervised anomaly localization using VAE and beta-
VAE. arXiv preprint arXiv:2005.10686, 2020.
33
Bouman, Bukhsh, and Heskes
34