Outlier Detection With The Use of Isolation Forests
Outlier Detection With The Use of Isolation Forests
of Isolation Forests
boundaries differ significantly. The cluster parameters, with and without point B,
would be completely different.
For the above reasons, outliers should be detected and removed from the dataset
at the stage of data preparation for analysis. A number of methods can be used for
this purpose, including Isolation Forests. The purpose of the research presented is to
assess the usefulness of this method in outlier detection, in particular in cluster
analysis. Properties of a basic algorithm and its extensions will be analyzed. The
results of simulation and empirical research on selected datasets will be presented.
The algorithm evaluation will take into account the impact of particular features of
big datasets on the effectiveness of the methods analyzed.
Isolation Forests (iForest) (Liu et al. 2008) are a method of outlier detection,
propounded in 2008 and developed by Fei Tony Liu, Kai Ming Ting, and Zhi-Hua
Zhou. The algorithm aims to isolate outliers by random partitioning of the dataset,
using multiple isolation trees. The mechanism steams from the assumption of
anomalies, i.e., their small share in the dataset and their considerable distance from
other typical observations. In Isolation Forests, the step of building a given pop-
ulation’s profile is omitted, owing to which the impact of the distance of the
68 K. Najman and K. Zieliński
anomaly from the rest of the data is not biased by inaccurate mapping of the dataset
structure. Taking into account the risk of examining inconsistent datasets, the
method gains in its universality.
To better understand how the iForest algorithm works, the Isolation Tree
algorithm needs to be explained. As the name suggests, their structure is closely
related to Classification Trees. The main difference entails the fact that Isolation
Trees are unsupervised algorithms; thus, they can be used for outlier detection.
Unlike Classification Trees, where the dataset is partitioned into nodes, based on a
selected measure that determines the quality of the partition, Isolation Trees divide
the dataset according to a random variable and its random value. This change
allowed significant acceleration of the algorithm, since selection of the best partition
requires many potential variants to be checked. The degree of observation isolation
depends on the length of the path between the tree root and the leaf containing the
observation. This is equivalent to the number of the dataset partitions needed to
extract a given value. Typical observations require more nodes for their isolation,
because there are more units within their small area. The probability that outlier at a
random data partition will be separated earlier is higher (Fig. 2).
Definition 1 Let X be the set of observations, and y the variables in the set. The
Isolation Tree is a type of a tree, in which node T is a leaf or a node with one
partition and exactly two descendants Ti,j, so that the random variable yk 2 y and
the random value p 2 ðminðyk : yk 2 T Þ; maxðyk : yk 2 T ÞÞ, while the elements of
the set are divided into:
Ti : yk \p; Ti : yk p: ð1Þ
Fig. 2 Isolation of a typical and an outlying observation in Isolation Trees. Source Own
elaboration
Outlier Detection with the Use of Isolation Forests 69
Using the binary search tree (BST) theory, it is possible to estimate the average
height of the Isolation Tree in a set of n elements. This value is the same as the
average length of a failed BST search, that is:
n1
cðnÞ ¼ 2H ðn 1Þ 2 ð2Þ
n
where H ðiÞ lnðiÞ þ e is a harmonic number. The value is used to calculate the
anomaly score in the iForest algorithm.
Similarly to random forests, the iForest algorithm is more effective when indi-
vidual trees have low isolation capabilities; nevertheless, a large number of trees
will be used for the prediction. For this reason, to train an Isolation Tree, not all the
observations from the set are needed; it is even advisable to sample a small number
of observations. In this way, the tree can work on separating the unusually atypical
values rather than unnecessarily creating a large number of partitions among the
values that are consistent with the population profile.
Definition 2 Let X be the dataset. The Isolation Forest is a statistical model
consisting of t single, independent Isolation Trees {h(w,Hk), k = 1,…,t}, where w is
a subset of X, drawn without a return, and Hk * iid is a random vector.
Using the algorithm, the following equation results from the anomaly score
prediction for observations x:
E ðhð xÞÞ
sðx; nÞ ¼ 2 cðnÞ ; ð3Þ
where E(h(x)) is the average path length for observations x in all the Isolation Trees
from the iForest. Dividing this value by c(n), a normalized path length is obtained,
which can be interpreted universally, regardless of the set of observations. If:
ð~
x ~
pÞ ~
n0 ð4Þ
A partition taking into account a dimension value greater than one has positive
impact on the isolation capabilities of the algorithm, owing to which the activation
map better reflects the dataset. At the same time, it does not increase the compu-
tational time needed to train the Isolation Tree. It should be added that in the EIF,
the number of the variables taken into account when determining the normal vector
can be established, whereas for the variables that are not included, vector ~
n assumes
the value of 0. In the following part of the article, the algorithm’s behavior will be
tested, taking into account all the variables, as to determine the partition of the
dataset (Fig. 3).
Outlier Detection with the Use of Isolation Forests 71
Fig. 3 Map of iForest and EIF anomaly score. Source Own elaboration
Despite the small number of parameters, the value of which is determined a priori
by the researcher, their impact on the iForest and the EIF is very significant. The
number of the trees used in the iForest and in the Extended Isolation Forest
algorithms affects the method’s effectiveness as well as the computational time
needed to train the model and to predict new observations. As in other ensemble
learning algorithms (Probst and Boulesteix 2018), a greater number of trees reduces
the deviation of the anomaly score values for both typical observations and outliers.
The model is better fitted for the input data; therefore, small changes in the values of
observations do not drastically affect the value of the anomaly score. Equally,
insertion of new trees linearly increases the computational complexity of the
algorithm, owing to which the parameter value should be the convergence point for
the anomaly index value.
The empirical research conducted on a dataset of 100,000 observations from a
normal distribution and on a thousand outliers shows stabilization of the standard
deviation of the anomaly index value at as little as 100 trees (Fig. 4).
Another parameter, the value of which is determined by the researcher, is the
sample size. The number of the observations used to train individual Isolation
Trees, similarly to the number of trees, affects the fit of the model to the input
dataset. The difference between the parameters entails the fact that an increase in the
size carries the risk of tree overtraining. When sampling a large number of
observations, the probability of anomaly inclusion in the process of training is
higher, due to which some of these values can be treated as typical. Concurrently,
the mean path length for typical observations is extended, which heightens the
differences in the values of the anomaly index for inlying and outlying observations
(Fig. 5).
72 K. Najman and K. Zieliński
Fig. 4 Impact of the number of trees on the anomaly score deviation. Source Own elaboration
Fig. 5 Impact of the sample size on the anomaly score values. Source Own elaboration
Parameters of the method are not the only factors which influence significantly the
anomaly score. One important element that may affect the method’s effectiveness is
the dimensionality of the dataset. More variables mean a greater number of
potential dataset partitions in individual isolation trees, which means that the values
of the anomaly index provide poorer reflection of the phenomenon under exami-
nation (Fig. 6).
Outlier Detection with the Use of Isolation Forests 73
Fig. 6 Impact of the dataset dimensionality on the anomaly score value. Source Own elaboration
The Isolation Forest and the Extended Isolation Forest were trained on datasets
of various dimensions, where typical observations assume, for each variable, the
values from the N(0, 1) distribution, whereas 1% of the outliers is clearly separable
from them. As the number of the features describing the data increases, the values
of the anomaly index are more dispersed—small distances between the observa-
tions result in a smaller difference in the values of the anomaly index.
Concomitantly, the values of the anomaly index for outlying observations become
lower. The impact of the data’s high dimensionality on the distribution of the s(x, n)
values can be effectively compensated for by the use of a greater number of indi-
vidual Isolation Trees in the process of the model training.
Insofar, research involved datasets with one cluster point. The method for cal-
culating the anomaly score in Isolation Forests and Extended Isolation Forests
indicates the global nature of the algorithms. Often, in empirical data, researchers
encounter situations when various groups of observations with completely different
distributions of the variables stand out in a given dataset. In such a case, the
algorithm’s ability to detect local anomalies in the dataset is significant.
Simulation studies show that the effectiveness of Isolation Forests in detecting
local anomalies is most affected by the disproportions in the number and the dis-
persion of the observations for individual groups. The assumption regarding the
anomalies in the algorithms described entails their rare occurrence and large dis-
tance from the other values. Observations from smaller clusters have a lower chance
of being included in the sample, which makes the path length from the tree root to
the observation significantly shorter. Observation dispersion also significantly
affects the anomaly score values. The random value of partition p in each node is
selected from between the minimum and the maximum value of the variable in the
observations drawn. For instance, 100 observations assume values within the range
interval [0, 1] and another 100 within [2, 50]. In the best case scenario, the first
partition will separate the objects in disparate groups from each other, owing to
which the dispersion effect will be reduced. If the value of p is within the range
[2, 50], however, the objects in the group with the greater standard deviation will be
treated as atypical in the tree branch encompassing the values of the variable that
74 K. Najman and K. Zieliński
Fig. 7 Impact of the group structures on the anomaly score value. Source Own elaboration
are less than p. The values from the dispersed group will be underrepresented in this
part of the tree. At the same time, in its remaining part, the partition will take place
without a compact group, owing to which the observation dispersion will not affect
the path length from the root to the observation. Summing up, both phenomena
significantly impact the method’s effectiveness. In parallel, the disproportion in the
number of observations in a group has greater impact on the increase in the
anomaly score (Fig. 7).
The last feature characterizing the datasets constituting the object of simulation
research is the share of noise in the data. Data noise occurs when some observations
do not form any group structures, and their distribution approximates randomness.
Simultaneously, the spatial distribution of these values does not differ significantly
from typical units. If the share of such observations constitutes a significant part of
the dataset examined, they cannot be referred to as outliers. Data noise should be
considered in two contexts when detecting outliers. First of all, a good method of
anomaly detection should be able to distinguish anomalies from typical values and
noise. Due to the conventional boundary between noise and outliers, it is difficult to
determine exactly how the method would distinguish between the two phenomena.
From the perspective of data analysis, it seems more important to distinguish typical
observations from noise, e.g., by assigning higher values to the anomaly score.
Another aspect to be kept in mind is the impact of the noise itself on the effec-
tiveness of the method used. In terms of anomaly detection, it manifests itself by
inclusion of typical observations as outliers or outliers as typical (Fig. 8).
An Extended Isolation Forest with parameters t ¼ 100 and w = 256 was trained
for three datasets characterized by different spatial structures. Regardless of the
percentage share of noise in the data, the value of the anomaly score is noticeably
lower near the typical observations. At a thirty percent share of noise in the data, the
algorithm is able to distinguish between typical values, anomalies and noise;
whereas along with an increase in the distance from the main cluster, the coefficient’s
values increase significantly. This allows the researcher to make a decision regarding
Outlier Detection with the Use of Isolation Forests 75
Fig. 8 Impact of the share of data noise on the anomaly score values. Source Own elaboration
the cut-off point between the anomalies, the noise, and the typical observations. Due
to the fact that a properly trained Isolation Forest does not require large samples in
individual trees, data noise does not have any critical impact on the effectiveness of
the iForest or the EIF algorithms. Interestingly, the share of any noise in the dataset
causes the anomaly score values for typical observations to be lower.
The simulation studies show the algorithms’ strong outlier detection abilities. The
next stage of the research entails the testing of the Isolation Forests and the
Extended Isolation Forests on the datasets described in more detail in the literature.
The local outlier factor (LOF) method, with a neighborhood parameter k ¼ 10, was
used as a comparative algorithm. The datasets examined differ in the number of
observations, the dimensions, and the share of anomalous values (Table 1).
76 K. Najman and K. Zieliński
The dataset “Wine Quality” (Cortez 2009) FF consists of a description of red and
white wines, the quality of which has been rated on a scale of 1–10. Observations,
the quality of which was extremely low (Quality = 3), were marked as outliers. In
the dataset “Lymphography” (Zwitter and Soklic), most of the observations were in
the “metastases” and “malignant lymph” classes, while the “normal find” and the
“fibrosis” ones were marked as outliers. The dataset “Musk” (Dua and Graff 2019a)
describes various molecular structures that have been assessed by experts in a given
field. In outlier detection, the dataset is limited to muskless classes (j146, j147 and
252), marked as typical observations, and musk classes (213 and 211), added as
anomalies. For the dataset “Satellite” (The Center for remote sensing), the three
least numerous classes were selected as anomalies, which in total constitute 32% of
the dataset. In the study, only one, the least numerous, class was adopted as
atypical, in consistency with the definition of outliers, which should constitute a
small part of the observation. “SMTP3” (Dua and Graff 2019b) is a subset of the
KDD CUP 99 dataset, in which the “attacks” class is treated as anomalies. In the
sets “Shuttle” (Dua and Graff 2019c) and “Mammography” (https://www.openml.
org/d/310), the observations assigned to the least numerous classes were marked as
outliers (Table 2).
The values of the area under the receiver operating characteristic (ROC) curve
indicate the effectiveness of the iForest and the EIF algorithms, with regard to
anomaly detection. In most cases, the area under the ROC curve (AUC) values are
significantly higher, compared to the local outlier factor (LOF) algorithm. The
results of both tree-based algorithms are comparable. The reason for the signifi-
cantly lower AUC values, in the case of the LOF method, may be the global nature
of the outliers considered, which favors the isolation methods. In some of the
datasets, observations from the least numerous classes were marked as atypical. The
simulation studies showed that the number of observations in a group is the decisive
factor, in the context of the anomaly score value.
6 Final Conclusions
Summing up, the simulation and empirical studies have confirmed the outlier
detection capability of the iForest and the EIF algorithms. The main advantages of
the methods analyzed can be recapped in the following points:
• Linear computational complexity, which makes the method suitable for the
analysis of big datasets. The duration of the forest training stage depends on the
sample size and the number of trees; thus, the dataset dimensionality and the
total number of observations do not play any role here. Additionally, it is worth
remembering about the convergence of the algorithms at a small number of trees
and a small sample size.
• Resistance of the algorithms to data noise. The simulation studies have shown
that the anomaly score for outliers and for the noise differ significantly, owing to
which the researcher can independently decide about the cut-off point between
them. At the same time, at a higher share of noise in the data, the anomaly score
values for typical observations are lower.
• Ability to analyze multidimensional datasets. A larger number of variables does
not negatively affect the computational time or the algorithm’s ability to detect
anomalies. Even when an observation is considered an outlier only because of
one of the variables, the algorithm, with an appropriate number of trees, is able
to successfully detect such a unit.
• Effectiveness of the algorithms is independent of the spatial structure of the data.
• The small number of the parameters to be determined by the researcher, which
facilitates the adaptation of the method to the dataset examined.
The limitations of the algorithms are less obvious and debatable. Although in
certain situations, they may pose a difficulty in outlier analysis; in other cases, they
may prove to be helpful. The features that raise most doubts pertain to:
• Inefficiency of datasets with significant differences in the number of observa-
tions in each group. The small sample size used in the algorithm means that the
values of the smallest clusters are used less frequently to train trees, which
78 K. Najman and K. Zieliński
References