0% found this document useful (0 votes)
8 views15 pages

Outlier Detection With The Use of Isolation Forests

The document discusses the importance of outlier detection in data analysis, particularly using the Isolation Forests method. It outlines the properties and effectiveness of Isolation Forests for identifying anomalies in large datasets, emphasizing their low computational requirements and adaptability. The research aims to evaluate the algorithm's performance through simulation and empirical studies, considering the unique characteristics of big data.

Uploaded by

Robert BOIS
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views15 pages

Outlier Detection With The Use of Isolation Forests

The document discusses the importance of outlier detection in data analysis, particularly using the Isolation Forests method. It outlines the properties and effectiveness of Isolation Forests for identifying anomalies in large datasets, emphasizing their low computational requirements and adaptability. The research aims to evaluate the algorithm's performance through simulation and empirical studies, considering the unique characteristics of big data.

Uploaded by

Robert BOIS
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

Outlier Detection with the Use

of Isolation Forests

Krzysztof Najman and Krystian Zieliński

Abstract Appropriate preparation of data for analysis is a key element in empirical


research. Considering the source of data or the nature of the phenomenon studied,
some observations may differ significantly from others. Inclusion of such cases in a
research may seriously distort the profile of the population under examination.
Nevertheless, their omission can be equally disadvantageous. When analyzing
dynamically changing phenomena, especially in case of big data, a relatively small
amount of outliers may constitute a coherent and internally homogeneous group,
which, along with the registration of subsequent observations, may grow into an
independent cluster. Whether or not an outlier is removed from the dataset,
researcher must be first aware of its existence. For this purpose, an appropriate
method of anomaly detection should be used. Identification of such units allows the
researcher to make an appropriate decision regarding the further steps in the
analysis.
Assessment of the usefulness of outlier value detection methods has been
increasingly influenced by the possibility of their application for big data problems.
The algorithms should be effective for large volume and diverse sets of data, which
are additionally subject to constant changes. For these reasons, apart from high
sensitivity, the following are also important: low computational time and the
algorithm’s adaptability.
The aim of the research presented is to assess the usefulness of Isolation Forests
in outlier detection. Properties of the algorithm, with its extensions, will be ana-
lyzed. The results of simulation and empirical research on selected datasets will be
presented. The algorithm evaluation will take into account the impact of particular
features of big datasets on the effectiveness of the methods analyzed.

Keywords Outliers  Anomalies  Isolation forests

K. Najman (&)  K. Zieliński


University of Gdańsk, Gdańsk, Poland
e-mail: krzysztof.najman@ug.edu.pl

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 65


K. Jajuga et al. (eds.), Data Analysis and Classification, Studies in Classification,
Data Analysis, and Knowledge Organization,
https://doi.org/10.1007/978-3-030-75190-6_5
66 K. Najman and K. Zieliński

1 The Essence of Outliers in Cluster Analysis

By analyzing the properties of complex populations, researchers collect data on


many of the characteristics describing them. Individual units within a population
differ from each other, while the level of the differentiation is one of the significant
features characterizing it. It may happen, however, that some, usually a few, units
differ so much, that a suspicion can emerge, that a mistake has been made (mea-
surement error) during the data collection process or that the units do not belong to
a given population. Since, due to its values, such a unit is on the edge of a given
feature’s distribution, these values are called outliers (Grubbs 1969). Hawkins
formally defined the concept of an outlier as follows:
“An outlier is an observation that deviates so much from the other observations
as to arouse suspicions that it was generated by a different mechanism.” (Hawkins
1980).
Occurrence of outliers in a dataset possesses a considerable problem, from the
perspective of its analysis. In many cases, even a single observation can signifi-
cantly change the statistics describing a given phenomenon. It is particularly visible
in analyzes employing the arithmetic mean as well as in regression analysis
(Anscombe 1973). In terms of cluster analysis, this is also an important issue,
because even a small number of atypical values can distort the resulting group
structure. Some methods of cluster analysis, e.g., the self-learning artificial neural
networks of growing neural gas (GNG) type are so sensitive that they will try to
create a separate cluster even for a few single outlying observations (Migdał-
Najman and Najman 2013). As such, one-element clusters can emerge, which do
not contribute to the understanding of the phenomenon under examination. It may
happen, however, that deformed clusters including an outlying value are obtained,
which in turn may significantly distort the clusters.
From the perspective of the impact on the results of grouping, the nature of
anomalies can be diverse. Such a unit may be distant from all the others in the
dataset, while its feature values are well outside the range of all other units’ vari-
ation. Such an unusual value is called an extreme value. It may happen that the
values of the features examined with respect to a given unit do not differ in the
scope, and yet, such a unit is outside the group structure. This is an unusual value,
i.e., an outlier, but not an extreme value (Aggarwal 2015).
Figure 1 illustrates such problems. Point A is considerably distanced from all the
others; thus, it is an outlier and an extreme value. Many methods of grouping will
create a separate cluster for it, making it relatively easy to be detected and removed
as non-contributing to the understanding of the structure of the population under
examination. Point B is also an outlier, but the values of its features fall within the
variability range of the remaining part of the population. It is also not that con-
siderably distant from the other objects. Yet, the point is controversial, because it
distorts, to a large extent the parameters of the cluster which it belongs to. The red
line marks the boundary of the cluster that would emerge without this point, while
the navy blue line marks the boundary including it. It is easy to see that these
Outlier Detection with the Use of Isolation Forests 67

Fig. 1 Outliers and extreme values. Source Own elaboration

boundaries differ significantly. The cluster parameters, with and without point B,
would be completely different.
For the above reasons, outliers should be detected and removed from the dataset
at the stage of data preparation for analysis. A number of methods can be used for
this purpose, including Isolation Forests. The purpose of the research presented is to
assess the usefulness of this method in outlier detection, in particular in cluster
analysis. Properties of a basic algorithm and its extensions will be analyzed. The
results of simulation and empirical research on selected datasets will be presented.
The algorithm evaluation will take into account the impact of particular features of
big datasets on the effectiveness of the methods analyzed.

2 Introduction to Isolation Forests and Extended Isolation


Forests

Isolation Forests (iForest) (Liu et al. 2008) are a method of outlier detection,
propounded in 2008 and developed by Fei Tony Liu, Kai Ming Ting, and Zhi-Hua
Zhou. The algorithm aims to isolate outliers by random partitioning of the dataset,
using multiple isolation trees. The mechanism steams from the assumption of
anomalies, i.e., their small share in the dataset and their considerable distance from
other typical observations. In Isolation Forests, the step of building a given pop-
ulation’s profile is omitted, owing to which the impact of the distance of the
68 K. Najman and K. Zieliński

anomaly from the rest of the data is not biased by inaccurate mapping of the dataset
structure. Taking into account the risk of examining inconsistent datasets, the
method gains in its universality.
To better understand how the iForest algorithm works, the Isolation Tree
algorithm needs to be explained. As the name suggests, their structure is closely
related to Classification Trees. The main difference entails the fact that Isolation
Trees are unsupervised algorithms; thus, they can be used for outlier detection.
Unlike Classification Trees, where the dataset is partitioned into nodes, based on a
selected measure that determines the quality of the partition, Isolation Trees divide
the dataset according to a random variable and its random value. This change
allowed significant acceleration of the algorithm, since selection of the best partition
requires many potential variants to be checked. The degree of observation isolation
depends on the length of the path between the tree root and the leaf containing the
observation. This is equivalent to the number of the dataset partitions needed to
extract a given value. Typical observations require more nodes for their isolation,
because there are more units within their small area. The probability that outlier at a
random data partition will be separated earlier is higher (Fig. 2).
Definition 1 Let X be the set of observations, and y the variables in the set. The
Isolation Tree is a type of a tree, in which node T is a leaf or a node with one
partition and exactly two descendants Ti,j, so that the random variable yk 2 y and
the random value p 2 ðminðyk : yk 2 T Þ; maxðyk : yk 2 T ÞÞ, while the elements of
the set are divided into:

Ti : yk \p; Ti : yk  p: ð1Þ

The degree of isolation of a given observation x is determined by the path length


h(x), from the root of the tree to the leaf which point x is located in. Due to the fact
that the value of h(x) depends on many factors, such as the sample size or the
dimensionality, it cannot in itself be regarded as a universal coefficient of isolation.

Fig. 2 Isolation of a typical and an outlying observation in Isolation Trees. Source Own
elaboration
Outlier Detection with the Use of Isolation Forests 69

Using the binary search tree (BST) theory, it is possible to estimate the average
height of the Isolation Tree in a set of n elements. This value is the same as the
average length of a failed BST search, that is:

n1
cðnÞ ¼ 2H ðn  1Þ  2 ð2Þ
n

where H ðiÞ  lnðiÞ þ e is a harmonic number. The value is used to calculate the
anomaly score in the iForest algorithm.
Similarly to random forests, the iForest algorithm is more effective when indi-
vidual trees have low isolation capabilities; nevertheless, a large number of trees
will be used for the prediction. For this reason, to train an Isolation Tree, not all the
observations from the set are needed; it is even advisable to sample a small number
of observations. In this way, the tree can work on separating the unusually atypical
values rather than unnecessarily creating a large number of partitions among the
values that are consistent with the population profile.
Definition 2 Let X be the dataset. The Isolation Forest is a statistical model
consisting of t single, independent Isolation Trees {h(w,Hk), k = 1,…,t}, where w is
a subset of X, drawn without a return, and Hk * iid is a random vector.
Using the algorithm, the following equation results from the anomaly score
prediction for observations x:
E ðhð xÞÞ
sðx; nÞ ¼ 2 cðnÞ ; ð3Þ

where E(h(x)) is the average path length for observations x in all the Isolation Trees
from the iForest. Dividing this value by c(n), a normalized path length is obtained,
which can be interpreted universally, regardless of the set of observations. If:

• s(xo) is significantly higher than 0.5, the observation is an outlier,


• s(xo) is lower than 0.5, the observation is not treated as an outlier,
• 8i sðxi Þ  0:5, no significant outliers have been observed in the dataset.
The iForest algorithm comprises two steps. In the first, Isolation Trees are
trained on the training part of the dataset. Then, new observations are introduced to
the model, which calculates their anomaly score. In the process of the model’s
training, the researcher determines a priori the values of two parameters: the sample
size in the drawing and the number of independent Isolation Trees.
One important advantage of the algorithm is the fact that the iForest is distin-
guished by low computational requirements. The structure of single trees is the
same as the BST; thus, the computational complexity for an Isolation Tree is
Oðwdlog2 weÞ. For the iForest, the complexity is Oðtwdlog2 weÞ. As a result, the
increase in complexity is only linear (it is dependent on the sample size and the
number of trees), which makes the Isolation Forest perfect for analysis of big
70 K. Najman and K. Zieliński

datasets. In the process of evaluation, the observations go through each of the t


trees, and based on the path length, the value of sðx; wÞ is calculated. The com-
putational complexity, just as in the process of learning, is Oðtndlog2 weÞ, where
n is the number of observations which the anomaly score is calculated for.
In the process of Isolation Tree training, data partition according to random
values of individual variables causes the observations from the dataset to partition
perpendicularly to the given variable, at the partition point p. The problem is
reflected in higher dimensions and increases the risk of including an anomaly in
further analysis, if the observation assumes typical values for at least one of the
variables. Optimization (Liu et al. 2019) of the partitioning value increases the
method’s effectiveness; nevertheless, it still assumes partition according to one
variable. One idea reducing the impact of the phenomenon on the value of the
anomaly score is the Extended Isolation Forest (EIF) (Sahand et al. 2019).
The change in the approach to outlier isolation entails a different definition of
dataset partition in the Isolation Tree nodes. In classic Isolation Forests, for a
k-dimensional dataset, partition by a random value p from the ith variable is the
same as a cross-section of hyperplane Rk by a k − 1 dimensional hyperplane that is
parallel to Rk nRi and shifted by the value of p. In this way, each partition is parallel
to the rest of the variables in the dataset. Nothing, however, prevents the partition
hyperplane from sloping with reference to the other variables. If the slope angle is
random, the maps of the anomaly score values should be topologically closer to the
dataset under examination.
Selection of a random slope is equivalent to indication of a normal vector ~ n, at a
random point in a k-dimensional sphere. To do so, it is enough to draw each of the
vector ~n coordinates as a random value from a normal standardized distribution.
Then, to determine the displacement vector ~ p, values are randomly selected for each
of the k coordinates within the range of the subset being divided. The formula for
partition of the observations in a node into two consecutive subsets can be written
as follows:

ð~
x ~
pÞ  ~
n0 ð4Þ

A partition taking into account a dimension value greater than one has positive
impact on the isolation capabilities of the algorithm, owing to which the activation
map better reflects the dataset. At the same time, it does not increase the compu-
tational time needed to train the Isolation Tree. It should be added that in the EIF,
the number of the variables taken into account when determining the normal vector
can be established, whereas for the variables that are not included, vector ~
n assumes
the value of 0. In the following part of the article, the algorithm’s behavior will be
tested, taking into account all the variables, as to determine the partition of the
dataset (Fig. 3).
Outlier Detection with the Use of Isolation Forests 71

Fig. 3 Map of iForest and EIF anomaly score. Source Own elaboration

3 The Impact of Algorithm Parameters on the method’s


Effectiveness

Despite the small number of parameters, the value of which is determined a priori
by the researcher, their impact on the iForest and the EIF is very significant. The
number of the trees used in the iForest and in the Extended Isolation Forest
algorithms affects the method’s effectiveness as well as the computational time
needed to train the model and to predict new observations. As in other ensemble
learning algorithms (Probst and Boulesteix 2018), a greater number of trees reduces
the deviation of the anomaly score values for both typical observations and outliers.
The model is better fitted for the input data; therefore, small changes in the values of
observations do not drastically affect the value of the anomaly score. Equally,
insertion of new trees linearly increases the computational complexity of the
algorithm, owing to which the parameter value should be the convergence point for
the anomaly index value.
The empirical research conducted on a dataset of 100,000 observations from a
normal distribution and on a thousand outliers shows stabilization of the standard
deviation of the anomaly index value at as little as 100 trees (Fig. 4).
Another parameter, the value of which is determined by the researcher, is the
sample size. The number of the observations used to train individual Isolation
Trees, similarly to the number of trees, affects the fit of the model to the input
dataset. The difference between the parameters entails the fact that an increase in the
size carries the risk of tree overtraining. When sampling a large number of
observations, the probability of anomaly inclusion in the process of training is
higher, due to which some of these values can be treated as typical. Concurrently,
the mean path length for typical observations is extended, which heightens the
differences in the values of the anomaly index for inlying and outlying observations
(Fig. 5).
72 K. Najman and K. Zieliński

Fig. 4 Impact of the number of trees on the anomaly score deviation. Source Own elaboration

Fig. 5 Impact of the sample size on the anomaly score values. Source Own elaboration

4 The Impact of Dataset Characteristics on the Anomaly


Score Values

Parameters of the method are not the only factors which influence significantly the
anomaly score. One important element that may affect the method’s effectiveness is
the dimensionality of the dataset. More variables mean a greater number of
potential dataset partitions in individual isolation trees, which means that the values
of the anomaly index provide poorer reflection of the phenomenon under exami-
nation (Fig. 6).
Outlier Detection with the Use of Isolation Forests 73

Fig. 6 Impact of the dataset dimensionality on the anomaly score value. Source Own elaboration

The Isolation Forest and the Extended Isolation Forest were trained on datasets
of various dimensions, where typical observations assume, for each variable, the
values from the N(0, 1) distribution, whereas 1% of the outliers is clearly separable
from them. As the number of the features describing the data increases, the values
of the anomaly index are more dispersed—small distances between the observa-
tions result in a smaller difference in the values of the anomaly index.
Concomitantly, the values of the anomaly index for outlying observations become
lower. The impact of the data’s high dimensionality on the distribution of the s(x, n)
values can be effectively compensated for by the use of a greater number of indi-
vidual Isolation Trees in the process of the model training.
Insofar, research involved datasets with one cluster point. The method for cal-
culating the anomaly score in Isolation Forests and Extended Isolation Forests
indicates the global nature of the algorithms. Often, in empirical data, researchers
encounter situations when various groups of observations with completely different
distributions of the variables stand out in a given dataset. In such a case, the
algorithm’s ability to detect local anomalies in the dataset is significant.
Simulation studies show that the effectiveness of Isolation Forests in detecting
local anomalies is most affected by the disproportions in the number and the dis-
persion of the observations for individual groups. The assumption regarding the
anomalies in the algorithms described entails their rare occurrence and large dis-
tance from the other values. Observations from smaller clusters have a lower chance
of being included in the sample, which makes the path length from the tree root to
the observation significantly shorter. Observation dispersion also significantly
affects the anomaly score values. The random value of partition p in each node is
selected from between the minimum and the maximum value of the variable in the
observations drawn. For instance, 100 observations assume values within the range
interval [0, 1] and another 100 within [2, 50]. In the best case scenario, the first
partition will separate the objects in disparate groups from each other, owing to
which the dispersion effect will be reduced. If the value of p is within the range
[2, 50], however, the objects in the group with the greater standard deviation will be
treated as atypical in the tree branch encompassing the values of the variable that
74 K. Najman and K. Zieliński

Fig. 7 Impact of the group structures on the anomaly score value. Source Own elaboration

are less than p. The values from the dispersed group will be underrepresented in this
part of the tree. At the same time, in its remaining part, the partition will take place
without a compact group, owing to which the observation dispersion will not affect
the path length from the root to the observation. Summing up, both phenomena
significantly impact the method’s effectiveness. In parallel, the disproportion in the
number of observations in a group has greater impact on the increase in the
anomaly score (Fig. 7).
The last feature characterizing the datasets constituting the object of simulation
research is the share of noise in the data. Data noise occurs when some observations
do not form any group structures, and their distribution approximates randomness.
Simultaneously, the spatial distribution of these values does not differ significantly
from typical units. If the share of such observations constitutes a significant part of
the dataset examined, they cannot be referred to as outliers. Data noise should be
considered in two contexts when detecting outliers. First of all, a good method of
anomaly detection should be able to distinguish anomalies from typical values and
noise. Due to the conventional boundary between noise and outliers, it is difficult to
determine exactly how the method would distinguish between the two phenomena.
From the perspective of data analysis, it seems more important to distinguish typical
observations from noise, e.g., by assigning higher values to the anomaly score.
Another aspect to be kept in mind is the impact of the noise itself on the effec-
tiveness of the method used. In terms of anomaly detection, it manifests itself by
inclusion of typical observations as outliers or outliers as typical (Fig. 8).
An Extended Isolation Forest with parameters t ¼ 100 and w = 256 was trained
for three datasets characterized by different spatial structures. Regardless of the
percentage share of noise in the data, the value of the anomaly score is noticeably
lower near the typical observations. At a thirty percent share of noise in the data, the
algorithm is able to distinguish between typical values, anomalies and noise;
whereas along with an increase in the distance from the main cluster, the coefficient’s
values increase significantly. This allows the researcher to make a decision regarding
Outlier Detection with the Use of Isolation Forests 75

Fig. 8 Impact of the share of data noise on the anomaly score values. Source Own elaboration

the cut-off point between the anomalies, the noise, and the typical observations. Due
to the fact that a properly trained Isolation Forest does not require large samples in
individual trees, data noise does not have any critical impact on the effectiveness of
the iForest or the EIF algorithms. Interestingly, the share of any noise in the dataset
causes the anomaly score values for typical observations to be lower.

5 Discussion of the Empirical Research Results

The simulation studies show the algorithms’ strong outlier detection abilities. The
next stage of the research entails the testing of the Isolation Forests and the
Extended Isolation Forests on the datasets described in more detail in the literature.
The local outlier factor (LOF) method, with a neighborhood parameter k ¼ 10, was
used as a comparative algorithm. The datasets examined differ in the number of
observations, the dimensions, and the share of anomalous values (Table 1).
76 K. Najman and K. Zieliński

Table 1 Description of Dataset n d Anomaly percentage (%)


empirical datasets
SMTP 95,156 3 0.03
Statellite* 5100 36 9.7
Shuttle 49,097 9 7
Mammography 11,183 6 2
Wine_red 1599 11 0.6
Wine_white 4898 11 0.4
Lymphography 148 18 4.1
Musk 3062 30 3.2
Source Own elaboration

The dataset “Wine Quality” (Cortez 2009) FF consists of a description of red and
white wines, the quality of which has been rated on a scale of 1–10. Observations,
the quality of which was extremely low (Quality = 3), were marked as outliers. In
the dataset “Lymphography” (Zwitter and Soklic), most of the observations were in
the “metastases” and “malignant lymph” classes, while the “normal find” and the
“fibrosis” ones were marked as outliers. The dataset “Musk” (Dua and Graff 2019a)
describes various molecular structures that have been assessed by experts in a given
field. In outlier detection, the dataset is limited to muskless classes (j146, j147 and
252), marked as typical observations, and musk classes (213 and 211), added as
anomalies. For the dataset “Satellite” (The Center for remote sensing), the three
least numerous classes were selected as anomalies, which in total constitute 32% of
the dataset. In the study, only one, the least numerous, class was adopted as
atypical, in consistency with the definition of outliers, which should constitute a
small part of the observation. “SMTP3” (Dua and Graff 2019b) is a subset of the
KDD CUP 99 dataset, in which the “attacks” class is treated as anomalies. In the
sets “Shuttle” (Dua and Graff 2019c) and “Mammography” (https://www.openml.
org/d/310), the observations assigned to the least numerous classes were marked as
outliers (Table 2).

Table 2 Results of outlier Comparison of Area Under Curve (AUC)


detection in empirical data empirical datasets IForest EIF LOF
SMTP 0.879 0.882 0.811
Statellite 0.804 0.741 0.515
Shuttle 0.998 0.996 0.521
Mammography 0.859 0.862 0.67
Wine_red 0.828 0.85 0.598
Wine_white 0.775 0.788 0.698
Lymphography 0.984 0.993 0.982
Musk 1 1 0.392
Source Own elaboration
Outlier Detection with the Use of Isolation Forests 77

The values of the area under the receiver operating characteristic (ROC) curve
indicate the effectiveness of the iForest and the EIF algorithms, with regard to
anomaly detection. In most cases, the area under the ROC curve (AUC) values are
significantly higher, compared to the local outlier factor (LOF) algorithm. The
results of both tree-based algorithms are comparable. The reason for the signifi-
cantly lower AUC values, in the case of the LOF method, may be the global nature
of the outliers considered, which favors the isolation methods. In some of the
datasets, observations from the least numerous classes were marked as atypical. The
simulation studies showed that the number of observations in a group is the decisive
factor, in the context of the anomaly score value.

6 Final Conclusions

Summing up, the simulation and empirical studies have confirmed the outlier
detection capability of the iForest and the EIF algorithms. The main advantages of
the methods analyzed can be recapped in the following points:
• Linear computational complexity, which makes the method suitable for the
analysis of big datasets. The duration of the forest training stage depends on the
sample size and the number of trees; thus, the dataset dimensionality and the
total number of observations do not play any role here. Additionally, it is worth
remembering about the convergence of the algorithms at a small number of trees
and a small sample size.
• Resistance of the algorithms to data noise. The simulation studies have shown
that the anomaly score for outliers and for the noise differ significantly, owing to
which the researcher can independently decide about the cut-off point between
them. At the same time, at a higher share of noise in the data, the anomaly score
values for typical observations are lower.
• Ability to analyze multidimensional datasets. A larger number of variables does
not negatively affect the computational time or the algorithm’s ability to detect
anomalies. Even when an observation is considered an outlier only because of
one of the variables, the algorithm, with an appropriate number of trees, is able
to successfully detect such a unit.
• Effectiveness of the algorithms is independent of the spatial structure of the data.
• The small number of the parameters to be determined by the researcher, which
facilitates the adaptation of the method to the dataset examined.
The limitations of the algorithms are less obvious and debatable. Although in
certain situations, they may pose a difficulty in outlier analysis; in other cases, they
may prove to be helpful. The features that raise most doubts pertain to:
• Inefficiency of datasets with significant differences in the number of observa-
tions in each group. The small sample size used in the algorithm means that the
values of the smallest clusters are used less frequently to train trees, which
78 K. Najman and K. Zieliński

results in higher values of the coefficient of isolation. This relationship, how-


ever, does not have to be a disadvantage; because in the case of global
anomalies, small clusters of observations are still treated as outliers.
• Lack of exact cut-off point for the anomaly score values between the anomalies,
the data noise, and the typical observations. Before marking some observations
as anomalies, one should analyze the distribution of the coefficient of isolation
in the dataset, which makes full automatization of the algorithm difficult. In
parallel, it allows consideration of expert knowledge or of the business needs in
the final decision.
Despite the extensive analysis of the iForest and the EIF algorithms, some issues
related to the detection of the outlying values that are typical for big datasets remain
to be settled. Increasingly, often the datasets analyzed are characterized by a
real-time influx of observations (streaming data) (Zhiguo 2013), which means that
the methods used should have high adaptability. With regard to the iForest and the
EIF, the question of how often the model should be re-trained to adequately reflect
the current relations between observations should be answered. Another way to
analyze streaming data could entail addition of newly trained trees to the forests,
with simultaneous removal of the oldest ones, yet the method could still be inef-
fective for analysis of periodically changing phenomena. Effectiveness of the
algorithm, in regards to local outliers, is another issue. It is most clearly visible in
the case of group structures; thus, it could be eliminated by combining iForest or
EIF with grouping methods (Rongfang et al. 2019). In the literature, attempts have
been made to make such modernization, but the most effective combination of
methods, taking into account a larger number of grouping algorithms, has not yet
been established. It should be remembered, however, that combination of different
methods will affect the computational time of the algorithm. The key determining
aspect can therefore be the appropriate use of grouping algorithms with linear
computational complexity. At the same time, it should be remembered that incorrect
classification of observations into the corresponding clusters, at the stage of
grouping, may disturb the effectiveness of the entire method. As such, combination
of grouping methods with algorithms that isolate outliers should only take place
when it is possible to properly separate the clusters of observations from each other.

References

Anscombe FJ (1973) Graphs in statistical analysis. Am Stat 27(1):17–21


Aggarwal CC (2015) Data mining. Springer International Publishing Switzerland. https://doi.org/
10.1007/978-3-319-14142-8
Cortez P (2009) http://www3.dsi.uminho.pt/pcortez. Viticulture Commission of the Vinho
Verde Region(CVRVV), Porto, Portugal. https://archive.ics.uci.edu/ml/datasets/wine+quality.
Accessed 4 Sept 2020
Dataset was published by a courtesy of Aleksandar Lazarevic. https://www.openml.org/d/310
Accessed 5 Sept 2020
Outlier Detection with the Use of Isolation Forests 79

Dua D, Graff C (2019a) UCI machine learning repository [http://archive.ics.uci.edu/ml].


University of California, School of Information and Computer Science, Irvine, CA. https://
archive.ics.uci.edu/ml/datasets/Musk+%28Version+2%29. Accessed 4 Sept 2020
Dua D, Graff C (2019b) UCI machine learning repository [http://archive.ics.uci.edu/ml].
University of California, School of Information and Computer Science, Irvine, CA. https://
archive.ics.uci.edu/ml/machine-learning-databases/kddcup99-mld/. Accessed 4 Sept 2020
Dua D, Graff C (2019c) UCI machine learning repository [http://archive.ics.uci.edu/ml].
University of California, School of Information and Computer Science, Irvine, CA. https://
archive.ics.uci.edu/ml/datasets/Statlog+%28Shuttle%29. Accessed 4 Sept 2020
Grubbs FE (1969) Procedures for detecting outlying observations in samples. Technometrics
11(1):1–21
Hawkins DM (1980) Identification of outliers. Chapman and Hall
Liu FT, Ting KM, Zhou Z (2008). Isolation forest, 2008 Eighth IEEE International Conference on
Data Mining, Pisa, pp 413–422. https://doi.org/10.1109/ICDM.2008.17
Liu Z, Liu X, Ma J, Gao H (2019) An optimized computational framework for isolation forest.
Mathematical Problems in Engineering, vol 2018, Article ID 2318763
Migdał-Najman K, Najman K (2013) Samouczące się sztuczne sieci neuronowe w grupowaniu i
klasyfikacji danych. Teoria i zastosowania w ekonomii, Wydawnictwo Uniwersytetu
Gdańskiego, Gdańsk
Probst P, Boulesteix A-L (2018) To tune or not to tune the number of trees in random forest.
J Mach Learn Res 18:10–18
Rongfang G, Tiantian Z, Shaohua S, Zhanyu L (2019) Research and improvement of isolation
forest in detection of local anomaly points. J Phys Conf Ser 1237:052023. https://doi.org/10.
1088/1742-6596/1237/5/052023
Sahand H, Kind MC, Brunner RJ (2019) Extended isolation forest. IEEE Transactions on
Knowledge and Data Engineering, pp 1–1. Crossref. Web
The Centre for Remote Sensing, University of New South Wales, Kensington, PO Box 1, NSW
2033. https://archive.ics.uci.edu/ml/datasets/Statlog+%28Landsat+Satellite%29. Accessed 4
Sept 2020
Zhiguo D (2013) An anomaly detection approach based on isolation forest algorithm for streaming
data using sliding window. IFAC Proc 46:12–17. https://doi.org/10.3182/20130902-3-CN-
3020.00044
Zwitter M, Soklic M University Medical Centre, Institute of Oncology, Ljubljana, Yugoslavia.
https://archive.ics.uci.edu/ml/datasets/Lymphography. Accessed 4 Sept 2020

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy