0% found this document useful (0 votes)
20 views35 pages

On The Nature and Types of Anomalies: A Review of Deviations in Data

Uploaded by

hazmat2324
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views35 pages

On The Nature and Types of Anomalies: A Review of Deviations in Data

Uploaded by

hazmat2324
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 35

International Journal of Data Science and Analytics (2021) 12:297–331

https://doi.org/10.1007/s41060-021-00265-1

REVIEW

On the nature and types of anomalies: a review of deviations in data


Ralph Foorthuis1

Received: 19 July 2020 / Accepted: 17 May 2021 / Published online: 4 August 2021
© The Author(s) 2021

Abstract
Anomalies are occurrences in a dataset that are in some way unusual and do not fit the general patterns. The concept of
the anomaly is typically ill defined and perceived as vague and domain-dependent. Moreover, despite some 250 years of
publications on the topic, no comprehensive and concrete overviews of the different types of anomalies have hitherto been
published. By means of an extensive literature review this study therefore offers the first theoretically principled and domain-
independent typology of data anomalies and presents a full overview of anomaly types and subtypes. To concretely define
the concept of the anomaly and its different manifestations, the typology employs five dimensions: data type, cardinality of
relationship, anomaly level, data structure, and data distribution. These fundamental and data-centric dimensions naturally
yield 3 broad groups, 9 basic types, and 63 subtypes of anomalies. The typology facilitates the evaluation of the functional
capabilities of anomaly detection algorithms, contributes to explainable data science, and provides insights into relevant
topics such as local versus global anomalies.

Keywords Anomalies · Outliers · Deviants · Typology · Anomaly detection · Explainable data science

1 Introduction (atomic) cases and grouped (aggregated) cases, as well as


desired and undesired observations [7, 9, 16–21, 300, 319,
The physical and social world is known to bring about 326]. Although anomalies can form a noise factor hindering
abnormal and bizarre phenomena that are seemingly hard the data analysis, they may also constitute the actual signals
to explain. Although rare by definition, such strange and that one is looking for. Identifying them can be a difficult
unusual occurrences can actually also said to be relatively task due to the many shapes and sizes they come in, as illus-
abundant due to the huge amount of objects and interac- trated in Fig. 1. Anomaly detection (AD) is the process of
tions in the world. Owing to the massive data collection tak- analyzing the data to identify these unusual occurrences.
ing place in the current era and the imperfect measurement Outlier research has a long history and traditionally focused
systems used for this, anomalous observations can thus be on techniques for rejecting or accommodating the extreme
expected to be amply present in our datasets. These large cases that hamper statistical inference. Bernoulli seems to
collections of data are mined in both academia and practice, be the first to address the issue in 1777 [22], with subsequent
with the aim of identifying patterns as well as peculiari- theory building throughout the 1800s [23–26, 327, 328],
ties. The term anomalies in this context refers to cases, or 1900s [27–36, 177, 274] and beyond [e.g., 37–39]. Although
groups of cases, that are in some way unusual and deviate it was occasionally recognized that anomalies may be inter-
from some notion of normality [1–13]. Such occurrences esting in their own right [e.g., 12, 29, 33, 40–42], it was not
are often also referred to as outliers, novelties, deviants or until the end of the 1980s that they started to play a crucial
discords [5, 14–16]. Anomalies are assumed to be both rare role in the detection of system intrusions and other sorts
and different, and pertain to a wide variety of phenomena, of unwarranted behavior [43–50]. At the end of the 1990s
which include static entities and time-related events, single another surge in AD research focused on general-purpose,
nonparametric approaches for detecting interesting devia-
tions [51–56]. Anomaly detection has now been studied for
* Ralph Foorthuis a wide variety of purposes, such as fraud discovery, data
Ralph.Foorthuis@Heineken.com
quality analysis, security scanning, system and process con-
1
Digital & Technology, HEINEKEN, Amsterdam, trol, and—as indeed practiced in classical statistics for some
The Netherlands

13
Vol.:(0123456789)
298 International Journal of Data Science and Analytics (2021) 12:297–331

The anomalous data


point is multivariately B
isolated (but not ex- B The anomalous
treme on x1 or x2) B
A vertex has a
B different class
B
B label than its
x1 B B
B B adjacent vertices.
A A
A
A
A
A

x2

The anomalous
The anomalous text section is
time interval comprised of
y
deviates from the unusually long
cyclical pattern. words.

time

Fig. 1  Red bold occurrences illustrate the wide variety of anomalies, resulting in the anomaly being perceived as an ambiguous concept. Resolving
this requires typifying all these manifestations in a single overarching framework

250 years—data handling prior to statistical inference [e.g., provides a concrete description of the different types of
3, 5, 14, 21, 24, 25, 57, 58, 158]. The topic of AD has not deviations one may encounter in datasets. To the best of
only gained ample academic attention over the years, but is my knowledge this is the first comprehensive overview of
also deemed crucial for industrial practice [59–63]. the ways anomalies can manifest themselves, which, given
Despite abundant research and valuable progress, the field that the field is about 250 years old, can be safely said to be
of anomaly detection cannot claim maturity yet. It lacks an overdue. The concept of the anomaly, including its different
overall, integrative framework to understand the nature and types and subtypes, is meaningfully characterized by five
different manifestations of its focal concept, the anomaly [6, fundamental dimensions of anomalies, namely data type,
69, 184]. The general definitions of an anomaly are often cardinality of relationship, anomaly level, data structure, and
said to be ‘vague’ and dependent on the application domain data distribution. The value of the typology lies in offering
[11, 12, 20, 64–68, 160, 316–318], which is likely due to a theoretical yet tangible understanding of the essence and
the wide variety of ways anomalies manifest themselves. types of data anomalies, assisting researchers with systemat-
In addition, although the data mining, artificial intelligence ically evaluating and clarifying the functional capabilities of
and statistics literature does offer various ways to distinguish detection algorithms, and aiding in analyzing the conceptual
between different kinds of anomalies, research has hitherto characteristics and levels of data, patterns, and anomalies.
not resulted in overviews and conceptualizations that are Preliminary versions of the typology have been employed
both comprehensive and concrete. Existing discussions on for evaluating AD algorithms [6, 69, 70, 297]. This study
anomaly classes tend to be either only relevant for specific extends the initial versions of the typology, discusses its
situations or so abstract that they neither provide a tangi- theoretical properties in more depth, and provides a full
ble understanding of anomalies nor facilitate the evaluation overview of the anomaly (sub)types it accommodates. Real-
of AD algorithms (see Sects. 2.2 and 4). Moreover, not all world examples from fields such as evolutionary biology,
conceptualizations focus on the intrinsic properties of the astronomy and—from my own research—organizational
data and almost none of them use clear and explicit theo- data management serve to illustrate the anomaly types and
retical principles to differentiate between the acknowledged their relevance for both academia and industry.
classes of anomalies (see Sect. 2.2). Finally, the research A key property of the typology presented in this work is
on this topic is fragmented and studies on AD algorithms that it is fully data-centric. The anomaly types are defined
usually provide little insight into the kinds of anomalies the in terms of characteristics intrinsic to data, thus without any
tested solutions can and cannot detect [6, 8, 184]. This litera- reference to external factors such as measurement errors,
ture study therefore presents an integrative and data-centric unknown natural events, employed algorithms, domain
typology that defines the key dimensions of anomalies and knowledge or arbitrary analyst decisions. This is different

13
International Journal of Data Science and Analytics (2021) 12:297–331 299

from many other conceptualizations, as will be discussed enables researchers to systematically analyze which algo-
in Sect. 2.2 and 4. Note that ‘defining an anomaly type’ rithms are able to detect what types of anomalies to what
in this context does not imply an ex ante domain-specific degree. Fifth, a comprehensive overview of anomalies con-
definition known before the actual analysis (e.g., based on tributes to making implemented systems more robust and
rules or supervised learning). Unless specified otherwise, stable, as it allows injecting test datasets with deviations
the anomalies discussed in this study can in principle be that represent unexpected and possibly faulty behavior [314,
detected by unsupervised AD methods, thus based on the 329]. Finally, a principled overall framework, grounded in
intrinsic properties of the data at hand, without any need for extant knowledge, offers students and researchers founda-
domain knowledge, rules, prior model training or specific tional knowledge of the field of anomaly analysis and detec-
distributional assumptions. Such anomalies are therefore tion and allows them to position and scope their own aca-
universally deviant, regardless of the given problem. demic endeavors.
A clear understanding of the nature and types of anoma- This study therefore puts forward an overall typology
lies in data is crucial for various reasons. First, it is impor- of anomalies and provides an overview of known anomaly
tant in data mining, artificial intelligence, and statistics to types and subtypes. Rather than presenting a mere summing-
have a fundamental yet tangible understanding of anomalies, up, the different manifestations are discussed in terms of
their defining characteristics and the various anomaly types the theoretical dimensions that describe and explain their
that may be present in datasets. The typology’s theoretical essence. The anomaly (sub)types are described in a quali-
dimensions describe the nature of data and capture (devia- tative fashion, using meaningful and explanatory textual
tions from) patterns therein and as such offer a deep under- descriptions. Formulas are not presented, as these often rep-
standing of the field’s focal concept, the anomaly. This is not resent the detection techniques (which are not the focus of
only relevant for academia, but also for practical applica- this study) and may draw attention away from the anomaly’s
tions, especially now that AD has gained increased attention cardinal properties. Also, each (sub)type can be detected by
from industry [61–63]. Second, with the criticism on ‘black multiple techniques and formulas, and the aim is to abstract
box’ and ‘opaque’ AI and data mining methods that may from those by typifying them on a somewhat higher level
result in biased and unfair outcomes, it has become clear that of meaning. A formal description would also bring with it
it is often undesirable to have techniques and analysis results the risk of unnecessarily excluding anomaly variations. As a
that lack transparency and cannot be explained meaning- final introductory remark it should be noted that, despite this
fully [71–76]. This is especially true for AD algorithms, as study’s extensive literature review, the long and rich history
these may be used to identify and act on ‘suspicious’ cases of anomaly research makes it impossible to include each and
[48–50, 326, 330]. Moreover, the definitions of anomalies every relevant publication.
are sometimes non-obvious and hidden in the designs of This article proceeds as follows. Section 2 explains key
algorithms [8, 65, 184], and true deviations may be declared concepts and discusses related research. Section 3 introduces
anomalous for the wrong reasons [306]. Although the typol- the typology of anomalies. Section 4 discusses various prop-
ogy presented here does not increase the transparency of the erties of the typology and compares it with other research.
algorithms, a clear understanding of (the types of) anoma- Finally, Sect. 5 is for conclusions.
lies and their properties, abstracted from detailed formulas
and algorithms, does increase post hoc interpretability by
making the analysis results and data more understandable 2 Theory
[20, 52, 69, 76, 184, 276]. Third, even if techniques from
computer science and statistics are functionally transparent 2.1 Key terms and concepts
and understandable, the implementations of these algorithms
may be done poorly or simply fail due to overly complex This section defines the employed concepts to ensure that
real-world settings [73, 77–79]. A clear view on anomalies is the reader understands the terms as intended, regardless of
therefore needed to determine whether detected occurrences his or her discipline (senior scholars may choose to only
indeed constitute true deviations. This is especially relevant do a quick scan). An anomaly, in its broadest meaning, is
for unsupervised AD settings, as these do not involve pre- something that is different or peculiar given what is usual
labeled data. Fourth, the no free lunch theorem, which posits or expected [88–90]. In the philosophy of science, anoma-
that no single algorithm will demonstrate superior perfor- lies play a crucial role as observations or predictions that
mance in all problem domains, also holds for anomaly detec- are inconsistent with the models in the prevailing academic
tion [17, 60, 80–87, 184, 286, 320]. Individual AD algo- paradigm [91–94]. Such anomalies require an explanation
rithms are generally not able to detect all types of anomalies and consequently initiate the advancement of knowledge by
and do not perform equally well in different situations. The the refinement of current theories. Over time, anomalies that
typology provides a functional evaluation framework that constitute fundamental novelties may accumulate and trigger

13
300 International Journal of Data Science and Analytics (2021) 12:297–331

an academic crisis in which the old paradigm is replaced by feasible without referring to the functional data structures
a wholly different one. Newtonian physics, for example, was that host them. This section therefore shortly discusses
succeeded by Einstein’s theory of general relativity, which several important formats for organizing and storing data
was better capable of predicting and explaining a variety [cf. 5, 57, 95, 106, 110–115, 184]. Some analyses are
of observed astronomical phenomena, such as anomalies conducted on unstructured and semi-structured text docu-
pertaining to the perihelion of Mercury. In statistics, data ments. However, most datasets have an explicitly struc-
mining and AI an anomalous occurrence deviates from some tured format. Cross-sectional data consist of observations
notion of normality for the given data and setting. Deviants on unit instances—e.g., individual people, organizations
that can be detected in an unsupervised fashion, which are or countries—at one point in time. The cases in such a set
the focus of this study, can be defined more precisely. An are generally considered to be unordered and otherwise
anomaly in this context is a case, or a group of cases, that independent, as opposed to the following structures with
in some way is unusual and does not fit the general patterns dependent data. Time series data consist of observations
exhibited by the majority of the data [3, 4, 8, 10, 11, 69, 325, on one unit instance (e.g., one country) at different points
326]. The detection of anomalies is a highly relevant task, in time. Time-oriented panel data, or longitudinal data,
not only because they should be handled appropriately dur- consist of a set of time series and are therefore comprised
ing inferential research, but also because the goal of analyses of observations on multiple individual entities at differ-
is often to discover interesting new phenomena [9, 37–39, ent points in time (e.g., income history for a sample of
95–98]. The remainder of this section will focus on terms citizens followed over a five-year period). The general
and concepts pertaining to anomalies in data. term sequence data will be used to refer to time series,
The term cases refers to the individual instances in a time-oriented panel data, as well as to sets with an order-
dataset, also called data points, rows, records, or observa- ing not based on time. Sequence data have broad appli-
tions [57, 99, 323]. These cases are described by one or cations and, besides time-oriented phenomena, are able
more attributes, also referred to as variables, columns, to capture genomic and other biological features, user
fields, dimensions or features. Some of these attributes will actions, spectroscopy wavelengths, trajectories, audio,
be required for data management and context, such as identi- and even visual information such as the shape of physi-
fication (ID) and time variables. In addition, the dataset will cal objects and moving elements in a video [5, 10, 114,
contain substantive attributes, i.e., the meaningful domain- 116–123, 278]. Each of the above data structures can be
specific variables of interest, such as income and tempera- implemented with a single matrix or table, but when sev-
ture. Measuring and recording the actual attribute values eral inter-related entities need to be modeled a relational
is prone to errors, the discovery of which may indeed be model is often used [95, 124, 125]. This allows many
one of the reasons to conduct anomaly detection. The term functional structures, including domain-specific designs
occurrence is used here in a broad fashion and may refer to and analytical star schemas [95, 103, 104]. A graph is a
an individual case or a group of cases, an object or an event, related data structure and typically consists of vertices
and anomalous or regular data. (nodes), edges (connections or links), edge directions
The term dependency is used in the literature to refer and edge weights [20, 57, 95, 106, 112, 113]. An attrib-
to two aspects of relationships, both of which are relevant uted graph has, in addition to these structural properties,
for this study. First, there can be a dependency between the any number of substantive domain-specific variables.
attributes, meaning there is a relationship between the vari- Such structures are highly relevant for modeling, e.g.,
ables [59, 96, 99–101, 182]. Income, for example, may be social networks, chemical compounds, Internet data, and
correlated with education and parental financial status. A wireless sensor networks. Graphs can take many forms,
second form of dependency, referred to as dependent data, including tree data structures. A tree consists of a root,
deals with the relationship between the dataset’s individual a given amount of parent and child nodes, and does not
cases or rows [7, 20, 57, 102, 323]. A set with such depend- feature any closed paths (so-called cycles). Storing graphs
ent cases contains an intrinsic relation between the obser- typically involves both a set of vertices (e.g., as a list of
vations. Examples are time series, spatial and graph data, nodes and their properties) and a set of edges (e.g., as an
and sets with hierarchical relationships. The dependencies adjacency matrix with relations, directions and weights).
in such datasets are typically captured by time, location, This is similar to spatial data, which usually consist of
linking or grouping attributes. These inter-case relations a set of coordinates and a set of substantive features [8,
are absent from independent data, such as in i.i.d. random 66, 95, 126–129, 277, 323]. The latter may pertain to,
samples for cross-sectional surveys, in which every row rep- e.g., population density, settlement type or availability of
resents a stand-alone observation. utility infrastructure elements. The representations often
Describing and understanding the different types of constitute points (atomic positions such as addresses),
anomalies in a concrete and data-centric manner is not lines, arcs (e.g., roads or rivers) and polygons (regions

13
International Journal of Data Science and Analytics (2021) 12:297–331 301

such as neighborhoods or states). However, a rasteriza- 2.2 Related work


tion of continuous data can also be used, such as satellite
imagery and brain scans represented as a grid of pixels. The literature acknowledges various ways to distinguish
This format therefore also captures image material in between different manifestations of anomalies. Barnett and
general, with the data points being 2D pixels or 3D vox- Lewis [2, cf. 31, 131] make a distinction between extreme
els. The coordinates represent a position on a canvas or but genuine members of the main population, i.e., random
frame, while the features store the visual information as fluctuations at the tails of the focal distribution, and contami-
gray intensities or color information (e.g., RGB, multi- or nants, which are observations from a different distribution.
hyperspectral). Spatio-temporal data feature a sequence Wainer [34] differentiates between distant outliers,
dimension in addition to the coordinates and may capture, which exhibit extreme values and are clearly in error, and
e.g., video and historical geographical information [281, fringeliers, which are unusual but with their position about
283]. Even more detailed distinctions can be made, but three standard deviations from the majority of the data can-
the key formats described above suffice to properly dis- not be said to be extremely rare and unequivocally errone-
cuss the wide variety of anomaly types. ous. Essentially the same distinction is made in [132] with
The concept of an aggregate is often used in the con- white crows and in-disguise anomalies, respectively. Relat-
text of dealing with noise or obtaining a more abstract edly, in [5, 133] a distinction is made between a weak outlier
representation at the level of interest. When aggregating (noise) and a strong outlier (a significant deviation from
the cases the analyst is able to treat multiple individual normal behavior). The latter category can be sub-divided
rows as a whole or a group and consequently obtain sum- in events, i.e., unusual changes in the real-world state, and
mary statistics—e.g., means and totals—or other derived measurement errors, such as a faulty sensor [134, 135]. An
properties of the collective [5, 57, 95, 103–106, 282, 323]. overall classification is presented in [96], with the classes of
This allows the data and perspective to be transformed, anomalies indicating the underlying reasons for their devi-
for example, from days to months or from individual ant nature: a procedural error (e.g., a coding mistake), an
people to households. In terms of data structures, typical extraordinary event (such as a hurricane), an extraordinary
examples of aggregates are subgraphs, subsequences, and observation (unexplained deviation), and a unique value
regions in spatial data. Often, however, the AD analy- combination (which has normal values for its individual
sis will simply focus on the individual cases, i.e., on the attributes). Other sources refer to similar explanations in a
atomic level of the set’s microdata. One can also aggre- more free-format fashion [39, 97, 136]. In [184] a distinction
gate the attributes in order to reduce dimensionality or is made between 9 types of anomalies. Another broad clas-
to obtain for individual or grouped cases meaningful and sification is that of [7], which differentiates between three
complex latent variables or manifolds [3, 96, 107–109]. general categories. A point anomaly refers to one or several
However, this may mainly be relevant in the context of individual cases that are deviant with respect to the rest of
high-level semantic anomalies, in which such aggregates the data. A contextual anomaly appears normal at first, but
are generally difficult to determine without any prior the- is deviant when an explicitly selected context is taken into
ory, rules or supervised training (see Sect. 3.2 for more account [cf. 137]. An example is a temperature value that is
information) [13, 310]. only remarkably low in the context of the summer season.
To conclude this section it is valuable to briefly discuss Finally, a collective anomaly refers to a collection of data
what constitutes a typology. To theoretically distinguish points that belong together and, as a group, deviate from
between concepts, scholars have various intellectual tools the rest of the data.
at their disposal, amongst which are taxonomies, classi- Several specific and concrete classifications are also
fications, dendrograms, and typologies [130]. These all known, especially those dedicated to sequence and graph
make use of one or more classificatory principles (explicit analysis. Many of their anomaly types will be described
dimensions) to differentiate between the relevant elements. in detail in Sect. 3. In time series analysis several within-
A classification uses a single principle, whereas a typology sequence types are acknowledged, such as the additive out-
uses two or more simultaneously. A typology is therefore lier, temporary change, level shift and innovational outlier
well suited to theoretically distinguish between complex [138–141, 191]. The taxonomy presented in [142] focuses on
concept types—offering not only a fundamental and sum- between-sequence anomalies in panel data and makes a dis-
marized description of a general concept, but also an tinction between isolated outliers, shift outliers, amplitude
exhaustive and mutually exclusive overview of its distinct outliers, and shape outliers. Another specific classification
but related types. The term classification will be used more is known from regression analysis, in which it is common
loosely in this study and will also refer to conceptualiza- to distinguish between outliers, high-leverage points and
tions with classes that are not based on clear principles and influential points [3, 143–145]. Two anomalies with regard
that are neither mutually exclusive nor jointly exhaustive. to the trajectories of moving entities are presented in [118,

13
302 International Journal of Data Science and Analytics (2021) 12:297–331

Table 1  Existing classifications for distinguishing between anomalies.


Reference G/S DC? Classes of anomalies Explicit classificatory dimensions

[6, 69, 70] G Y Extreme value ano, rare class ano, simple mixed data ano, multidimen- Types of data, cardinality of relationship
sional numerical ano, multidimensional rare class ano, multidimen-
sional mixed data ano
[2, cf. 31] G N Extreme genuine member, contaminant None
[34] G Y Fringelier, distant outlier None
[52] G Y Strongest outlier, weak outlier, trivial outlier Attribute subspace
[132] G Y white crow, in-disguise anomaly None
[5, 133] G Y Weak outlier, strong outlier None
[96] G N Procedural error, extraordinary event, extraordinary observation, unique None
value combination
[136] G N Data error, normal variance, data from other distributions, distributional None
assumption
[7] G N Point anomaly, contextual anomaly, collective anomaly None
[184] G Y Known distribution ano, sparse distribution ano, local density-based ano, None
global density-based ano, rare instance ano, burst ano, deviant sequence
ano, trend ano, irregularity ano
[182] G Y Trivial outlier, non-trivial outlier None
[3, 143] S N Outlier, high-leverage point, influential point None
[138, cf. 141] S Y Additive outlier, temporary change, level shift, innovational outlier None
[187] S Y Isolated outlier, patch outlier, level shift None
[142] S Y Isolated outlier, shift outlier, amplitude outlier, shape outlier None
[233] S Y Trend anomaly, seasonality anomaly None
[314] S Y/N Outlier, spike, stuck-at, high-noise (plus several non-data-centric anoma- None
lies)
[281] S Y Various spatio-temporal change patterns Temporal, spatial, raster/vector
[20] S Y Deviant vertex, deviant edge, deviant subgraph None
[205] S Y Near-star, near-clique, heavy vicinity, dominant edge None
[125] S Y Insertion, update and deletion anomaly Based on database CRUD functions
[60] S Y Foreign-symbol, foreign n-gram, rare n-gram None
[118, cf. 146] S Y Positional outlier, angular outlier None

G/S refers to a general (broad and usually abstract) versus specific way to distinguish between classes of anomalies. DC stands for data-centric,
meaning the anomalies can be distinguished by analyzing the dataset, without a reference to or dependency on external factors (such as unknown
real-world events or arbitrary analyst decisions)

cf. 146, 147], namely the positional outlier, which is posi- clear principles to systematically partition the classificatory
tioned in a low-density area of the trajectory space, and the space to obtain meaningful categories of anomalies. They
angular outlier, which has a direction different from regular thus do not constitute a classification or typology as defined
trajectories. The subfield of graph mining has also acknowl- by [130]. To the best of my knowledge this study’s frame-
edged several specific classes of anomalies, with anomalous work and its predecessors offer the first overall typology of
vertices, edges, and subgraphs being the basic forms [18, anomalies that presents a comprehensive overview of con-
20, 112, 113, 148, 149]. Table 1 summarizes the anomaly crete anomaly types.
classes acknowledged in the extant literature. In Sect. 3 these Many of the existing overviews also do not offer a data-
anomalies, particularly those that allow a data-centric defini- centric conceptualization. Classifications often involve algo-
tion, will be discussed in more detail and positioned within rithm- or formula-dependent definitions of anomalies [cf.
this study’s typology. 8, 11, 17, 86, 150, 184], choices made by the data analyst
The classifications in Table 1 are either too general and regarding the contextuality of attributes [e.g., 7, 137], or
abstract to provide a clear and concrete understanding of assumptions, oracle knowledge, and references to unknown
anomaly types, or feature well-defined types that are only populations, distributions, errors and phenomena [e.g., 1,
relevant for a specific purpose (such as time series analy- 2, 39, 96, 131, 136]. This does not mean these conceptual-
sis, graph mining or regression modeling). The fifth col- izations are not valuable. On the contrary, they often pro-
umn also makes clear that extant overviews hardly offer vide important insights as to the underlying reasons why

13
International Journal of Data Science and Analytics (2021) 12:297–331 303

Types of Data
Quantave Qualitave Mixed
aributes aributes aributes

Type I Type II Type III


Univariate

Uncommon Uncommon Simple mixed


number anomaly class anomaly data anomaly
Cardinality of Relationship

Atomic
Anomaly Level
Type IV Type V Type VI
Muldimensional Muldimensional Muldimensional
numerical anomaly categorical anomaly mixed data anomaly
Multivariate

Type VII Type VIII Type IX

Aggregate
Aggregate Aggregate Aggregate mixed
numerical anomaly categorical anomaly data anomaly

Fig. 2  The framework for the typology of anomalies

anomalies exist and the options that a data analyst can 3 A typology of anomalies
exploit. However, this study exclusively uses the intrinsic
properties of the data to define and distinguish between the 3.1 Classificatory principles
different sorts of anomalies, because this yields a typology
that is generally and objectively applicable. Referencing This section presents the five fundamental data-oriented
external and unknown phenomena in this context would dimensions employed to describe the types and subtypes of
be problematic because the true underlying causes usually anomalies: data type, cardinality of relationship, anomaly
cannot be ascertained, which means distinguishing between, level, data structure, and data distribution. The typology’s
e.g., extreme genuine observations and contaminants is dif- framework, as depicted in Fig. 2, comprises three main
ficult at best and subjective judgments necessarily play a dimensions, namely data type, cardinality of relationship
major role [2, 4, 5, 34, 314, 323]. A data-centric typology and anomaly level, each of which represents a classificatory
also allows for an integrative and all-encompassing frame- principle that describes a key characteristic of the nature of
work, as all anomalies are ultimately represented as part of data [57, 96, 101, 106]. Together these dimensions distin-
a data structure. This study’s principled and data-oriented guish between nine basic anomaly types. The first dimen-
typology therefore offers an overview of anomaly types that sion represents the types of data involved in describing the
not only is general and comprehensive, but also comes with behavior of the occurrences. This pertains to these data types
tangible, meaningful and practically useful descriptions. of the attributes responsible for the deviant character of a
To end this section it is good to note that many valuable given anomaly type [10, 57, 96, 97, 114, 161]:
classifications of anomaly detection techniques are available
[5, 7, 13, 14, 55, 84, 135, 150–152, 299–301, 318–320, 330]. • Quantitative: The variables that capture the anomalous
Because the core focus of the current study is on anoma- behavior all take on numerical values. Such attributes
lies, detection techniques are only discussed if valuable in indicate both the possession of a certain property and
the context of the typification of data deviations. A review the degree to which the case may be characterized by it
of AD techniques is therefore out of scope, but note that and are measured at the interval or ratio scale. This kind
the many references direct the reader to information on this of data generally allows meaningful arithmetic opera-
topic. tions, such as addition, subtraction, multiplication, divi-
sion, and differentiation. Examples of such variables are
temperature, age, and height, which are all continuous.

13
304 International Journal of Data Science and Analytics (2021) 12:297–331

Quantitative attributes can also be discrete, however, into account simultaneously. Note that, owing to the massive
such as the amount of people in a household. data collection nowadays, a dataset is likely to contain many
• Qualitative: The variables that capture the anomalous attributes beyond the hosting subspace (i.e., the subset of
behavior are all categorical in nature and thus take on attributes required to describe and detect a given anomaly).
values in distinct classes (codes or categories). Quali- As a matter of fact, an occurrence can be deviant in one sub-
tative data indicate the presence of a property, but not space and normal in others [133, 162–164, 180, 182, 303].
the amount or degree. Examples of such variables are An occurrence could even be one type of anomaly in one
gender, country, color and animal species. Words in a subspace and another type in a second subspace.
social media stream and other symbolic information The third dimension is the anomaly level, which repre-
also constitute qualitative data. Identification attributes, sents the distinction between atomic anomalies (individual
such as unique names and ID numbers, are categorical low-level cases or data points) versus aggregate anomalies
in nature as well because they are essentially nominal (groups or collective structures). In theory this is also an
(even if they are technically stored as numbers). Note independent dimension, but in practice univariate data only
that although qualitative attributes always have discrete contain atomic anomalies. Multivariate data may also host
values, there can be a meaningful order present, such aggregate anomalies, which alongside substantive attrib-
as with the ordinal martial arts classes ‘lightweight,’ utes typically require data management attributes (e.g.,
‘middleweight’ and ‘heavyweight.’ However, arithmetic time stamps or group designations) that allow the forma-
operations such as subtraction and multiplication are not tion of collective structures. However, should future research
allowed for qualitative data. introduce noteworthy examples of aggregate anomalies in
• Mixed: The variables that capture the anomalous behav- univariate data, the framework can be easily extended to
ior are both quantitative and qualitative in nature. At accommodate this.
least one attribute of each type is thus present in the set The fourth dimension represents the data structure,
describing the anomaly type. An example is an anomaly which is used to distinguish between the subtypes within
that involves both country of birth and body length. the nine cells of the typology. A given cell may contain mul-
tiple anomaly subtypes, which have defining characteristics
The second dimension is the cardinality of relationship, that can be traced back to the specific data formats that host
which represents how the various attributes relate to each them, such as graphs and time series. Also note that the dif-
other when describing anomalous behavior. These attrib- ference between dependent and independent data is treated
utes are individually or jointly responsible for the deviant as a characteristic of the data structure here. See Sect. 2.1
character of the occurrences [39, 59, 96, 100, 105, 106, 136, for an overview of the different structures.
158, 285]: The fifth dimension is the data distribution, which refers
to the collection of attribute values and their pattern or dis-
• Univariate: Except for being part of the same set, no persion throughout the data space [98, 165]. An anomaly,
relationship between the variables exists to which the per definition, is defined by its difference with regard to the
anomalous behavior of the deviant case can be attrib- remainder of the data, which makes the distribution of the
uted. To describe and detect the anomaly, its variables dataset an important factor to take into account. The dis-
can therefore be referred to separately. In other words, the tribution is strongly dependent on the classificatory factors
analysis can assume independence between the attributes. mentioned above, but allows focusing on density and other
• Multivariate: The deviant behavior of the anomaly can be dispersion-related aspects of the set. It therefore offers addi-
attributed to the relationship between its variables. The tional descriptive and delineating capabilities.1 This dimen-
anomaly needs to be described and detected by referring sion will not only be used to subdivide between anomaly
to the joint distribution, meaning the individual attributes subtypes within the typology’s nine cells, but occasionally
cannot be studied separately. Variables have to be ana- also to illustrate how an altered distribution would result in
lyzed jointly in order to take into account their relation, a different manifestation of a given anomaly.
i.e., their combination of values. The term ‘relationship’ The five classificatory principles of the typology are
should be interpreted broadly here and includes corre- not only fundamental in the sense that they describe theo-
lations, partial correlations, interactions, collinearity, retically crucial properties of data, but also because they
concurvity, (non)collapsibility, and associations between
attributes of different data types. 1
The term ‘distribution’ is usually not explicitly defined. One could
argue that the concept of data distribution alone is sufficient to
The cardinality of relationship essentially refers to describe anomalies. However, such a simplified stance would defeat
the purpose of this study, namely to offer fundamental and concrete
whether one attribute is sufficient to define and detect the insights into the nature and types of anomalies. The distribution here
anomaly type or that multiple attributes need to be taken thus excludes the other dimensions.

13
International Journal of Data Science and Analytics (2021) 12:297–331 305

deeply impact analysis and storage solutions. Some exam- given case is linked. An example is a time series tempera-
ples of this: jointly analyzing qualitative and quantitative ture measurement that is unusually high for winter, but that
data requires specialized multivariate techniques; analyzing would have been normal in summer. Atomic multivariate
dependent data usually needs to account for autocorrelation; anomalies hide in multidimensionality, as they cannot be
and locating clusters and other patterns in multidimensional described and detected by simply analyzing the individual
data implies discovering inter-variable relationships and variables separately. Finally, aggregate anomalies are groups
dealing with exponential scaling issues as datasets increase of cases that deviate as a collective, of which the constituent
in size. cases usually are not individually anomalous. Relationships
The preliminary typology presented in [69, 70, cf. 6] is between attributes and between cases play a key role here,
summarized in the first row of Table 1 (it essentially lacked not only to position an occurrence in the set with dependent
the lowest layer for aggregate anomalies present in Fig. 2). data, but often also to form a pre-defined or derived group.
Although this version was able to implicitly accommodate Owing to their complex and intricate nature, these anoma-
complex anomalies, several discussions at conferences lies are generally the most difficult to describe and detect.
pointed to the fact that the types in the multivariate row A deviant subsequence is one manifestation of an aggre-
were rather broad and demanded further subdivision and gate anomaly, such as a whole winter with many unusually
clarification. Atomic and aggregate anomalies are therefore high temperatures compared to other winters. Although the
acknowledged explicitly in the framework now, yielding nine above may be abstract at first reading, the detailed discussion
basic anomaly types. In addition, some of the terminology is below will offer a concrete understanding.
updated. The new typology framework is depicted in Fig. 2. Before presenting the individual types and subtypes
Detailed subtypes are included in Fig. 3 and illustrated in it is worthwhile to make some assumptions explicit. It is
the data plots throughout this article. The nine main types of assumed that an atomic case (individual row) with p attrib-
anomalies, which follow naturally and objectively from the utes represents a single data point in p-dimensional space
classificatory principles, are described in Sect. 3.2. (Note on (not a set of p distinct data points). A time series consists of
visualization: in many of the diagrams the data points’ colors multiple atomic data points, each of which has p attributes.
and shapes represent different categorical values; the reader The typology also assumes a parsimonious data structure,
might want to zoom in on a digital screen to see colors, without redundant information. For example, the degree of
shapes and patterns in detail.) a vertex, i.e., the number of edges that connect it to other
vertices in the graph, is a latent characteristic of which the
3.2 Overview of anomaly types and subtypes value is seen as being derived runtime during the analysis
and is therefore assumed not to be explicitly included in
This section presents the anomaly types and their concrete the original dataset. When relevant the set does explicitly
subtypes. The typology’s rows represent three broad groups include management and dependency attributes, such as
of anomaly types. Atomic univariate anomalies are single ID, sequence, group and link information, as these repre-
cases with a deviant value for one or possibly multiple sent crucial structural properties (note that in the special
individual attributes. They are relatively easy to describe case of text or symbolic data the sequence may be present
and detect because the individual values of these observa- implicitly). The typology also assumes the unaltered and
tions are unusual. Relationships between attributes or cases original dataset in which one aims to declare anomalies. The
are not relevant for such occurrences. An example is an reason for this is that, e.g., normalization [16, 86], dimen-
extremely high numerical value, such as a person reported sionality reduction [166], log transformations [167] and data
to be 246 cm high. There may be several anomalous atomic type conversions [70] have all been shown to have signifi-
occurrences (albeit apparently not in a way so as to form a cant impact on the presence and detection of anomalies. To
‘normal pattern’), but the essence is that each individual case be sure, transformations are allowed, but the typology then
is anomalous in its own right. Moreover, should an individ- either assumes the newly derived dataset as the starting point
ual case have multiple unusual values, then each of them will for typification or remains agnostic as to any transformations
be anomalous (e.g., not only 246 cm high, but also 117 years performed as part of the AD algorithm. Finally, if one needs
old). Atomic multivariate anomalies are single cases whose to choose between potential anomaly types, then the norm is
deviant nature lies in their relationships, with the individ- to opt for the simplest type that captures the deviant occur-
ual values not being anomalous. In independent data this rence (see the Discussion for more on this).
will manifest itself in the unusual combination of a case’s The types and subtypes are visualized schematically in
own attribute values, such as a 10-year-old person with a Fig. 3 and discussed in detail in the remainder of this sec-
body length of 180 cm. However, the multivariate nature tion. Since even the subtypes can be quite broad when mul-
also allows defining and detecting deviations in dependent tivariate in nature, ample examples are also provided.
data, i.e., in the relation with the other cases to which the

13
International Journal of Data Science and Analytics (2021) 12:297–331

Types of Data Legend


Normal point or object
Anomalous point or object
Quantave aributes Qualitave aributes Mixed aributes Independent data
Dependent data
Type I: Uncommon number anomaly Type II: Uncommon class anomaly Type III: Simple mixed data anomaly
Univariate

a) Extreme tail value a) Unusual class a) Extreme tail uncommon class


b) Isolated intermediate value b) Deviant repeater
b) Intermediate uncommon class
Atomic univariate anomaly
Type IV: Multidimensional numerical anomaly Type V: Multidimensional categorical anomaly Type VI: Multidimensional mixed data anomaly

Atomic
Cardinality of Relationship

a) Peripheral point
a) Incongruous common class
a) Uncommon class combination
b) Enclosed point b) Incongruous common sequential class
b) Deviant categorical vertex
c) Deviant vertex
c) Local density anomaly
d-f) Unusual vertex insertion/change/removal
d) Global density anomaly

Anomaly Level
c) Deviant categorical edge g) Deviant edge
e) Local additive anomaly h-j) Unusual edge insertion/change/removal
f) Deviant numerical spatial point (typically in images) k) Deviant spatial point (typically in geo data)
g) Deviant numerical spatio-temporal point (typically in videos) l) Deviant spatio-temporal point (typically in geo data)
Mulvariate

Atomic multivariate anomaly


Type VII: Aggregate numerical anomaly Type VIII: Aggregate categorical anomaly Type IX: Aggregate mixed data anomaly
a) Class change
a) Deviant cycle
b) Deviant class cycle
a) Deviant class aggregate
b) Temporary change (typically in texts) c) Deviant class sequence
d-i) Deviant isolation/shift/shape/

Aggregate
c) Level shift b) Deviant categorical subgraph amplitude/trend/variation sequence
j) Deviant subgraph
d) Innovational outlier k-r) Appearing/disappearing/flickering/merging/
e) Trend change c) Deviant relational aggregate splitting/growing/shrinking/eccentric (sub)graph
f ) Variation change
s) Deviant spatial region (typically in geo data)
g) Deviant numerical spatial region (typically in images) t ) Deviant spatio-temporal region (typically in geo data)
h) Deviant numerical spatio-temporal region
(typically in videos) u) Point-based mixed data aggregate anomaly
v) Distribution-based mixed data aggregate anomaly
i ) Point-based aggregate anomaly
j ) Distribution-based aggregate anomaly
Aggregate anomaly
sents a set with dependent data.
anomaly subtype is represented

(Zoom in on a digital screen to


Fig. 3  The typology including

icon that includes lines repre-


essence of the deviation. An
all types and subtypes. Each

by an icon that depicts the

13
see details.)
306
International Journal of Data Science and Analytics (2021) 12:297–331 307

3.2.1 Atomic univariate anomalies. value lying relatively near the bulk of the data points can be
found in real-world income data, as depicted at the left of
This section provides an overview of anomaly types that Fig. 5 (which in this context may point to an error or impro-
consist of a single case with a deviant value for one or pos- vised corrective transaction). Moreover, given a different data
sibly several attributes, with each individual value being distribution, isolated low-density values can also be located
deviant in its own right. The more unusual a value is or the in the middle of the value range [5, 59, 169, 307]. These
more attributes take on unusual values, the more anomalous isolated intermediate values (subtype ST-Ib) do not only lie
the respective case is. outside the dense regions, but also in between them. They can
manifest themselves, for example, in multimodal or disjoint
I. Uncommon number anomaly probability distributions, where they may be extreme mem-
This is a case with an extremely high, low, or otherwise bers of one of the populations. Traditional AD techniques
unusual value for one or multiple individual quantitative for ST-Ia anomalies often cannot detect ST-Ib cases (see the
attributes [5, 97, 168]. These deviant numbers often mani- Grubbs and GESD example in the Discussion section).
fest themselves as an extreme tail value (depicted as subtype The distribution of the variable affects the way Type I
ST-Ia in Fig. 3). They are hosted by the given attribute’s anomalies can manifest themselves in other ways as well [1,
numerical vector, which may contain one or more extreme 153, 170, 171]. Skewed distributions [39, 167, 172], lepto-
values at the far ends of its statistical distribution. Figure 4 kurtic distributions [173, 174] and heavy-tailed distributions
shows two plots of the national Polis administration of Dutch [131, 175, 176] tend to generate substantially more extreme
income transactions [6], with various ST-Ia occurrences. Fig- cases than normal distributions do. Masking and swamping
ures 5 and 6 present examples as well. Traditional univariate are also relevant from a distributional perspective [1, 2, 28,
statistics typically offers methods to detect this subtype, e.g., 36, 177–179].
by using a measure of central tendency and a given degree of
variation [5, 7, 30, 96, 97, 143, 153, 184]. Cases that clearly II. Uncommon class anomaly
exceed a threshold are considered extreme and very distant The unusual class (ST-IIa) is a case with a unique or rare
ST-Ia instances. It follows that cases lying near this deci- categorical value for one or several individual qualitative
sion boundary, so-called fringeliers, are more difficult to variables. The studies [60, 184, 309] discuss this type of
interpret [34, cf. 5, 132]. Extreme tail values are literally anomaly. Case ST-IIa in Fig. 4 is a truly unique class, with
‘outliers’, as they lie in an isolated region of the numerical the orange color representing the sole instance of the respec-
space. However, this does not necessarily mean the case is tive categorical value. The two ST-IIa cases in the left panel
located far from the other data. An example of an extreme of Fig. 6 are the only square and reversed triangle. The red

Fig. 4  Real-world income data from the Polis administration with anomalies shown as large dots. The left plot has two and the right plot three
numerical variables (wage and social charges). The social security code attribute is represented by color

13
308 International Journal of Data Science and Analytics (2021) 12:297–331

1.5
ST-Ia ST-IVb

1.0
0.5
0.0
x2
ST-Ia

-0.5
-1.0
-1.5
-1.0 -0.5 0.0 0.5 1.0 1.5

x1

Fig. 5  (Left) Univariate social charges data from the Polis administration. Note that the vertical dimension represents random scatterplot jitter for
visualization purposes. (Right) Two-dimensional synthetic dataset

and orange colors of the ST-IIa points in the right panel of individual attributes that partake in this relationship. Type
Fig. 6 are non-unique rare code values. These may not be IV anomalies may reside not only in independent data, but
clear-cut anomalies and thus may at some moment in the also in dependent data because the multivariate character
AD process demand that the concept of rarity is arbitrar- of the set allows taking into account the inter-case relation-
ily defined, e.g., by using a threshold [60, 85, cf. 34]. The ships. In independent data the anomalous nature of a case
deviant repeater (ST-IIb) is category that occurs frequently, of this type lies in the unusual combination of its numerical
while the norm is to be non-frequent due to a highly skewed attribute values [38, 39, 52, 182, 185].
distribution. Such anomalies can occur in, e.g., identifica- Several quantitative attributes therefore need to be jointly
tion, IP address or name attributes [181]. taken into account to describe and detect such an anomaly.
An example is a person who is 182 cm tall and weighs 53
III. Simple mixed data anomaly kilos, i.e., an unusual combination of normal individual val-
This is a case that is both a Type I and a Type II anom- ues [34]. In independent data such a case is literally ‘out-
aly, i.e., with at least one isolated numerical value and one lying’ from the relatively dense multivariate clouds and is
uncommon class. The subtype extreme tail uncommon class thus located in an isolated area [cf. 101, 179]. This can be a
(ST-IIIa) has a rare or unique class value at the tail of the peripheral point (ST-IVa), such as illustrated by the ST-IVa
distribution, whereas the subtype intermediate uncommon cases in Figs. 4 and 6. A second subtype is the enclosed
class (ST-IIIb) has an unusual class at an isolated intermedi- point (ST-IVb), which means the anomaly is surrounded by
ate location in the numerical space. Case ST-IIIa in Fig. 6 normal data. An example is an anomaly located inside an
is an example. A Type III anomaly deviates with regard to annular region [302, 303, 307], illustrated by case ST-IVb
multiple data types, requiring deviant values for at least two in Fig. 5. Another example is a case in a spiral shape or
attributes, each anomalous in its own right [69, 70]. How- other maze-like distribution [305]. Some methods, such as
ever, like Type I and II anomalies, analyzing the attributes one-class support vector machines or iForest, are able to
jointly is unnecessary because the case in question is not detect peripheral points, but are not geared toward identify-
multivariately anomalous. In other words, this type requires ing enclosed points [6, 305].
a set of individually deviant attribute values, not a deviant Another question becomes relevant if the dataset con-
combination of attribute values. This is fundamentally dif- tains multiple clusters that have different densities. The out-
ferent from the types described in 3.2.2. lyingness of an individual case can then be seen as being
dependent upon the degree of isolation relative to its local
3.2.2 Atomic multivariate anomalies area rather than to the global space [8, 17, 53, 55]. A local
density anomaly (ST-IVc) is a case that is only isolated in
This section discusses anomalies that comprise a single case the context of its neighborhood. Techniques to detect these
with a deviant combination of attribute values. In dependent anomalies, such as LOF and LOCI, need to account for the
data the deviancy typically lies in the relationship between density of both the case in question and its neighbors. See
the cases. the Discussion section for more on the topic of locality. A
dataset may also predominantly consist of random noise or
IV. Multidimensional numerical anomaly large clusters, except for a few data points located close to
This is a case that does not fit the general patterns when the each other. These points are global density anomalies (ST-
relationship between multiple quantitative attributes is taken IVd). Although they could be perceived as a single (albeit
into account, without showing unusual values for any of the tiny) cluster, they are often conceptualized and detected as

13
International Journal of Data Science and Analytics (2021) 12:297–331 309

Fig. 6  (Left) Synthetic set with two numerical attributes and two categorical attributes (color and shape); (Right) Real-world Polis set with one
categorical and three numerical attributes, and large dots representing anomalies

individual cases [9, 184, 305, 307]. ST-IVc and ST-IVd abrupt change that pertains to a single observation, i.e., a
occurrences could in principle also have univariate equiv- Type IV anomaly [21, 105, 138, 141, 193].
alents, although these do not seem to be discussed in the A deviant numerical spatial point (ST-IVf) is a case that
literature. is unusual due to its quantitative spatial and possibly sub-
For independent data the description of a Type IV stantive features. If time is also a relevant factor, the case
instance requires multiple substantive attributes, as illus- is a deviant numerical spatio-temporal point (ST-IVg).
trated in the examples provided above. With dependent data Due to their quantitative nature these anomalies typically
the anomaly may also be defined by a single substantive var- reside in images and videos, respectively [68, 122, 194].
iable, e.g., temperature, although at least one other attribute An anomaly then is an individual pixel or voxel that, given
is typically still needed to link the related individual cases. In its location in the frame and possibly in time, has an unu-
a quantitative context this usually concerns time series data, sual color or multispectral measurement. Anomalies are
comprising a variable for ordered linking and a numerical known to occur at this granular data point level, for exam-
substantive measure such as weight, wage, volume or heart ple in satellite imagery [67, 68]. However, anomalies in
rate [16, 114]. The local additive anomaly (ST-IVe) captures this context are usually aggregates (e.g., a group of pixels),
anomalous observations in time series and other numerical so this topic will be discussed in detail in the section on
sequences. It features a short-lived spike that deviates from Type VII deviations. ST-IVf and ST-IVg occurrences can
the local temporal neighborhood—e.g., the current season or also be geographical anomalies, but examples thereof will
trend—without exhibiting globally extreme values [cf. 141, be discussed as Type VI cases because they usually reside
187–189]. As such this subtype implies that the substantive in mixed data.
and sequence attributes are acknowledged and described
jointly, i.e., multivariately. Case ST-IVe in Fig. 8.A is an V. Multidimensional categorical anomaly
example. Note that a globally extreme occurrence, such as This is a case that does not fit the general patterns when
case ST-Ia in the same figure, is simply a Type I extreme the relationship between multiple qualitative attributes is
tail value for which the time attribute is irrelevant (unless taken into account, without showing unusual values for any
the position in the sequence should be known). If the anom- of the individual attributes that partake in this relationship.
alous event transpires slowly relative to the measurement In short, a case with a rare or unique combination of class
resolution, it may span multiple observations and should values, which can reside in independent or dependent data.
be considered an aggregate Type VII occurrence. However, In independent data two or more substantive categorical
the classic definition of an additive anomaly is that it is an attributes from the same case need to be jointly taken into

13
310 International Journal of Data Science and Analytics (2021) 12:297–331

account to describe and identify a multidimensional cate- outbreak, a phenomenon observed for Covid-19 and other
gorical anomaly. An example is this curious combination of viruses [154–156]. Likewise, in a security context this may
values from three attributes used to describe dogs: ‘male’, be an infected source node in a computer network that com-
‘puppy’, and ‘pregnant’. A visual example is case ST-Va in municates with many other nodes, possibly with malicious
Fig. 6, as it is the only red circle in the set—despite the fact intent [203, 296]. Graphs are well-equipped to deal with
that neither circles nor red shapes are unusual. These two notions of locality by taking into account adjacent nodes
illustrations are instances of an uncommon class combina- or the broader community. This allows anomalies such as a
tion (ST-Va). The studies [181, 195] deal with this subtype. vertex with a class label that is unexpected at that position in
A high-dimensional set may also constitute a so-called the graph [20, 204, 295]. Examples are a smoker in a group
corpus, in which the individual cases represent different of non-smokers and vertex ‘B.X’ in Fig. 7b (seemingly mis-
texts (e.g., documents, blog posts, or e-mails). In this purely located in the ‘A’ group). The connectedness of neighboring
qualitative context the case’s word order is irrelevant for nodes can also be analyzed [205]. An anomalous vertex then
the anomaly’s description and detection. The anomaly may is a node whose neighbors are all highly connected or mostly
reside in unstructured or semi-structured documents, CSV unconnected. A node connected to two otherwise separated
files with a single message on each row, bag-of-words rep- graph communities, such as vertex ‘A.E’ connecting the ‘A’
resentations, or sets of a similar nature such as market bas- and ‘B’ groups in Fig. 7b, can also be seen as an anomaly
ket data [100, 323]. A transaction consisting of an unusual [20, 206]. In real-world data such community-crossing
combination of common retail products is an example of a occurrences may point to intrusion attempts [207]. A related
market basket anomaly [196, 197]. Text cases such as blog subtype is the deviant categorical edge (ST-Vc). Many of
posts or e-mails may be deviant because they contain unex- the examples provided above for the ST-Vb subtype have
pected topics or feature a different writing style [198–201]. an analogue for the deviant categorical edge. Examples are
See the ST-VIIIa subtype below for more details on text style a relationship that connects two otherwise separated com-
and topic analysis. munities [cf. 20, 206], a hyperlink between two web pages
Dependent data afford wholly different subtypes. A tree, with unrelated information [324], and a link that is attached
essentially an acyclic graph comprising qualitative identifi- to a vertex with an uncommon class label.
ers of parent and child nodes, is a data structure well-suited
for hosting Type V anomalies. One subtype in this context is VI. Multidimensional mixed data anomaly
the deviant categorical vertex (ST-Vb). An individual node This is a case that does not fit the general patterns when the
in a tree can be anomalous as a result of its structural rela- relationship between multiple quantitative and qualitative
tionships. This requires at least the vertex ID (a qualitative attributes is taken into account, without being an atomic
designation identifying the individual nodes) and the edges univariate anomaly with regard to any of the individual
(parent–child relationships). For example, a leaf node per attributes that partake in this relationship. It concerns a case
definition is dependent on its structural relationships: there with an unusual combination of qualitative and quantitative
needs to be a parent, but children are absent. A leaf node attributes, which can reside in both independent and depend-
that is deviant due to its graph context, e.g., because it is the ent data. As with all multivariate anomalies, multiple attrib-
terminal node of an extremely short path to the root, is there- utes need to be jointly taken into account to describe and
fore an instance of an ST-Vb anomaly (see Fig. 7a). Other identify them. As a matter of fact, multiple data types need
examples are vertices with an unusual amount of children to be considered, as anomalies of this type per definition
(see Fig. 7a as well) and vertices connected by an edge that are comprised of both numerical and categorical variables.
has unexpected labels [cf. 132]. Note that a vertex with a sin- In a set with independent data the anomalous case gener-
gle uncommon categorical value is simply a Type II anom- ally has a class value, or a combination of class values, that
aly, since no dependent data are required to describe and in itself is not rare in the dataset from a global perspective,
detect it in a flat node list. An ST-Vb anomaly is not neces- but is only uncommon in its own neighborhood. Such cases
sarily a node in a tree. It can also be part of a regular graph, therefore seem to be mislabeled or misplaced. The incon-
assuming that weights or other numerical properties are not gruous common class (ST-VIa) is such an anomaly and this
involved in the anomalous behavior. An example is a vertex subtype has been described in various studies [6, 70, 160,
that is entirely unconnected (see ‘A.F’ in Fig. 7b) or does 208]. The right panel of Fig. 6 shows several real-world
not belong to an identifiable community [202]. A node can ST-VIa occurrences identified in a data quality analysis of
also be deviant because it is connected to an extreme amount the Polis administration, with multiple blue and pink dots
of other vertices, which in various domains is known as a seemingly misplaced or mislabeled. Not all detected anoma-
‘super spreader’. In biology this refers to a single individual lies necessarily represent erroneous data, as complex real-
who disproportionately infects a large number of other peo- world phenomena sometimes simply result in strange (but
ple and as such contributes to the speed and degree of the correct) data. However, this specific analysis showed that

13
International Journal of Data Science and Analytics (2021) 12:297–331 311

ST-IIa
ST-Vb
ST-VIIIb

ST-Vb

ST-Vb

ST-Ia

ST-VIIIb

ST-IXi
ST-VIg

Fig. 7  Various types of anomalies in a a tree and b a cyclic graph

some occurrences proved to be indicative of real data quality in ST-VIb occurrences may also consist of mixed data. In
problems, which were subsequently remedied by improving fact, the verification process that determines the quality of
the software [6]. Cases ST-Vla in Fig. 4 are also examples the DNA character reading often also utilizes the underlying
in this administration, as they are data points with a color quantitative chromatogram data that are available for a given
(class label) rarely seen in their respective neighborhoods. base [ibid.]. Another example is when a numerical sequence
In dependent data a Type VI anomaly can manifest itself variable, substantive numerical variables (e.g., ‘amount of
in many other ways. An incongruous common sequential money’) and categorical variables (e.g., ‘type of transac-
class (ST-VIb) is an individual deviant in a sequence of class tion’) form a time series that hosts anomalies. The blue dot
values of one or several substantive attributes. A quantita- in Fig. 10 is an example in crop biomass data, which may
tive time or sequence indicator is required here to link the indicate a wrong label of the data point.
dependent substantive values, although in symbolic data the A graph, comprised of numerical weights and nominal
order may be implicit. An example is the bold italic class vertex IDs and edge directions, is also a structure capable
phaseB at an unexpected position in this symbolic sequence: of hosting Type VI anomalies [cf. 112, 113, 148, 149]. In
this context an anomaly can take the form of a deviant ver-
phaseA, phaseB, phaseC, phaseA, phaseB, tex (ST-VIc). A specific example is a node connected by
phaseC, phaseB, phaseA, phaseB, phaseC, phaseD, multiple edges with high weights, which may be of inter-
phaseA, phaseB, phaseC est because such a vertex potentially has a high impact in
(Note that the bold underlined case phaseD is a Type the network. From a security perspective such an ST-VIc
II anomaly because it is an entirely novel class.) Another node could be infected if it sends many packages or may be
example can be found in a DNA segment. This is a sym- subject to a DDoS (distributed denial-of-service) attack if
bolic sequence in which each of the characters represents it receives many packages from a great number of sources
one of four nucleotide bases, namely A, G, T or C [10, 57, [152, 210]. From a technical perspective a node with many
183]. After reading the data the individual characters of the high weights may also point to faulty equipment [205]. A
genome sequence are automatically verified and corrected node with exactly one very heavy link constitutes an anom-
in order to obtain a complete and accurate representation of aly as well, which in a who-calls-whom network could indi-
the chromosomes. The order of the base symbols herein con- cate a stalker who keeps calling one of his or her contacts
tains information that can be used for this verification and [ibid.]. (Note that the edge with the overly high weight, such
correction task [121, 209]. The characters in this example as shown in Fig. 7b, is itself an ST-Ia anomaly because it
constitute qualitative data, but the substantive information simply is an extreme value in a weight vector.) Attributed

13
312 International Journal of Data Science and Analytics (2021) 12:297–331

graphs readily afford the detection of local anomalies, i.e., A deviant spatial point (ST-VIk) is a case with coordi-
vertices featuring substantive values that differ from their nation data, often in combination with substantive proper-
neighbors [211]. An interesting example can be found in a ties, that can be seen as unusual. Although deviations in this
graph representing individual people (vertices with attrib- context can be global [279], the explicit coordinates can be
utes such as monthly income) and their friendships (edges naturally exploited to define neighborhoods and detect local
that connect people). A person with an income below aver- anomalies. For Type VI cases this typically concerns geo-
age and connected only to rich people will likely be a rare graphical sets, which are known to generally involve mixed
occurrence [169]. A deviant edge (ST-VIg) is another sub- data [126–128, 222]. An ST-VIk anomaly usually represents
type of the Type VI anomaly. An example is a link with a a unit location with one or more properties considered to
weight that can be considered normal in the entire graph but be abnormal in that spatial neighborhood [211, 222, 223].
is relatively high or low in the local community or subgraph, Examples that combine labels and the point’s area are a dam
such as the ST-VIg example in Fig. 7b. In a communication in a residential area and a car in the middle of an ocean [cf.
network such edges may point to fake or redundant message 316]. When a time dimension is also present the set may host
exchanges between vertices [152]. deviant spatio-temporal points (ST-VIl). These are cases with
In biology a phylogenetic tree represents the evolutionary one or more values that seem unusual when both their tem-
relationships of species, individuals or genes from ancestors poral patterns and neighboring points are taken into account
to descendants [106, 212, 213]. Such a tree shows that the [277, 283]. For example, a case’s temperature, wind direction
brown bear and the polar bear evolved relatively recently (e.g., ‘NNW’) and wind speed—measured at a given time
from a common ancestor, while the split between these and location—can be unexpected in the context of the histori-
two species and the giant panda occurred closer to the root, cal data of that geographical area [128, 224]. Spatio-temporal
thus in a more distant past. Although the topology can be anomalies like this have been shown to point to complex
represented with purely categorical data, biologists often climatic phenomena such as El Niño [128, 184, 225].
use a mixed data tree in which the branch lengths (edge
weights) represent the genetic distance. An individual tree 3.2.3 Aggregate anomalies
branch (edge) can be anomalous if it is significantly shorter
or longer than the other branches in that neighborhood of the This section provides an overview of aggregate anomalies.
tree, and as such may point to an interesting difference in the Such an anomaly is a group of related cases that deviate as a
evolutionary rate of species or to a methodological problem collective. The cases are generally not individually anoma-
[212, 214–219]. Individual branches may also be anoma- lous, but multiple cases are jointly involved in a deviation
lous because they are unstable, which means these edges from the dataset’s regular inter-case patterns that can be
easily jump to very different positions in the tree. This may expressed in terms of several qualitative and/or quantitative
be observed when comparing trees generated with different variables.
samples or by different tree-building algorithms, or when the
parameters are slightly changed [215–217]. VII. Aggregate numerical anomaly
When dealing with mixed data the focus of study can This is a group of related cases that deviate as a collective
also be a dynamic graph, in which time-dependent behavior with regard to their quantitative attributes. Such an anomaly
is taken into account [20, 112, 113, 149]. This may take the is typically found in time series data, in which it constitutes
form of irregular changes in the respective time series, such a subsequence of the entire sequence. A time series is capa-
as spikes or level shifts with regard to the edge weights, ble of hosting a variety of aggregate anomaly subtypes, the
attribute values, or the frequency of vertices and edges [20, discovery of which is strongly related to the task of change
112, cf. 204, 220]. See the different sequence-based sub- detection. A first subtype is the deviant cycle anomaly (ST-
types for a full overview of how these dynamics can manifest VIIa), illustrated in Fig. 8c. This subtype occurs when the
themselves (i.e., ST-IVe and ST-VIb for atomic anomalies, time series consists of cycles—such as climatic seasons—
and ST-VIIa-f and ST-IXa-i for aggregate occurrences). that demonstrate similar patterns, with the discord follow-
Notable events in this context can also be graph-specific, in ing a different pattern [7, 16, 69, 189]. Deviant cycles can
the form of unusual insertions, modifications or removals be observed in many natural and societal phenomena, and
of vertices (ST-VId-f) and edges (ST-VIh-j) in the network have also been shown to correspond to unexpected physical
[112]. An example is an ‘evolutionary community outlier’, gestures that were originally represented as video imagery
i.e., a node whose time-dependent behavior is different from [114]. The cycles can generally be detected in an unsuper-
that of its neighbors or community members [221]. A node vised fashion, but for very specific deviations, such as cer-
that at some moment in time switches in terms of community tain medically relevant heartbeat patterns in an electrocar-
membership can be regarded as an ST-VIe anomaly [112]. diogram, a supervised approach may be required [226].

13
International Journal of Data Science and Analytics (2021) 12:297–331 313

ST-Ia

ST-IVe ST-VIIa
A C

ST-IXe

ST-VIIb
ST-VIIc

ST-IXf
B D

Fig. 8  Time series and panel data anomalies. Gray dots represent individual measurements, red lines show temporal dependencies

Another subtype is the temporary change (ST-VIIb), exhibits only a 1% dip, which requires these measurements
which is a rise or fall of the substantive value that requires to be very precise.
a certain period to get back to the regular level [138, 140]. Another subtype of a Type VII anomaly is the level shift
This subtype has many practical uses, such as detecting wild (ST-VIIc), which is an abrupt structural change to a higher or
fires and other ecosystem disturbances [304]. Figure 8b lower value level, i.e., a permanent step change [138, 140].
shows an example of an abrupt change that gradually returns Such a shift, illustrated in Fig. 8b, comprises at least two
to the normal range, a real-life example of which is the burst consecutive data points and should therefore be regarded
in the volume of news articles following an earthquake, as a dyad or a larger group. A variation of this subtype, the
major crime or other dramatic event [227, 228]. Another ‘seasonal level shift’, implies a step change of the level of the
example is a so-called ′transient′ in audio data, i.e., a sudden recurring cycles or their amplitude [140, cf. 188]. An inno-
increase in amplitude followed by a decay [123, 229]. Exam- vational outlier (ST-VIId) is a complex anomaly that goes
ples of this are a gunshot and a microphone being dropped beyond a temporary change and usually consists of an initial
on the floor. Although the initial spike can be identified as shock followed by an impact on potentially both the cycle
one or several atomic extreme value anomalies, the slowly and trend components and may be temporary or permanent
diminishing tail is also an intrinsic part of the anomaly. in nature [21, 138–141]. A trend change (ST-VIIe) repre-
It renders this a collective that can only be described and sents the start or end of a structural trend [232, cf. 184, 233].
detected as a combination of multiple cases and attributes. Finally, the variation change (ST-VIIf) is an anomaly in
A traditional temporary change anomaly is often described which the random fluctuation changes over time [313, 329].
as abruptly starting and gradually returning to the regular For example, the variance that was observed earlier may
level [138, 140, 315]. However, the ST-VIIb subtype in this suddenly increase or drop to zero, possibly due to sensor
typology also allows gradual starts and sudden recoveries so failures, low battery levels or environmental circumstances
as to accommodate more variations, akin to how the ‘patch that exceed the hardware range [314]. Note that in practice
outlier’ of [187] and the ‘spike’ and ‘stuck-at’ deviations of the variation change may be observed simultaneously with
[314] can manifest themselves. An interesting astronomi- a temporary change or level shift, so in certain situations it
cal example can be found in the light curves that represent may be worthwhile to acknowledge such a combined sub-
the brightness of stars. Figure 9a shows the time series of type separately and add it to the typology.
the star WASP-47, based on publicly available observations When the data exhibit non-stationarity or drift, there may
from NASA’s Kepler space telescope [250]. The sudden dip be an overlap with the aggregate time-dependent anoma-
in brightness, which extends over several data points, can be lies. Level shifts, trend changes, and possibly innovational
explained by various phenomena. In this case it represents outliers are associated with a permanent alteration of the
a planet transiting the star, and such dimming events thus time series. These are related to the types of change in the
offer a way to discover exoplanets [230, 231]. Note that this

13
314 International Journal of Data Science and Analytics (2021) 12:297–331

ST-VIIb

ST-IXf
B C

Fig. 9  Real-world data: A and B are measurements from the Kepler space telescope. C is an aerial photograph

underlying distribution that are also acknowledged in the changed camera positions. The region of the image change
literature on concept drift [234]. can also cover the whole frame, e.g., when it is significantly
Spatial and spatio-temporal data comprise a set of coor- brighter or inserted as a subliminal shot [68, 122, 235, 236].
dinates, a set of numerical substantive features and pos- Deviant spatio-temporal patterns also figure largely in the
sibly a time dimension, and often manifest themselves as video surveillance of streets, parks and train stations. In
images and videos. Although sometimes residing at the data non-crowded scenes anomalies may be an individual per-
point level, an image or video anomaly usually represents son or vehicle that demonstrates abnormal walking, running,
a region-of-interest, i.e., an aggregate rather than an indi- crawling, driving or stopping behavior [68, 293]. In crowded
vidual pixel or voxel [67, 68, 122, 194, 235, 281]. A deviant scenes individual people cannot easily be distinguished, so
numerical spatial region (ST-VIIg) is a quantitative aggre- the focus is on even more aggregated and abstract motion
gate that is unusual due to its spatial and possibly substantive patterns that capture multiple subjects simultaneously [237].
features, e.g., a surprising object, area or person in a static Examples are pedestrians that move against the general flow,
image. Due to the spatial context local anomalies can readily an empty local area that is normally crowded (e.g., a ticket
be acknowledged. For example, the small blue region at the gate during rush hour), and groups of people that obstruct
right of Fig. 9c is anomalous with regard to its neighbor- traffic.
hood but normal from a global perspective. Other occur- Aggregate anomalies are mostly found in dependent data.
rences may feature an unusual texture in an image, such as However, research has shown that they may also occur in
the different surface pattern to the south-east of the blue i.i.d. data [288–291]. The first subtype in this context is the
region. Identifying deviating textures has practical applica- point-based aggregate anomaly (ST-VIIi) and simply con-
tions, including detecting scratches, dents and other sorts sists of a set of anomalous data points, i.e., a microcluster
of damage during production processes in the manufactur- of points that are individually deviant. The current typology
ing industry [312]. Nuanced high-level semantic anomalies, positions them as an aggregate because the literature regu-
such as detecting a cat-like dog amongst multiple cats, can larly discusses this subtype as such [e.g., 288, 290, 291].
also be detected, although this may require prior theory, However, in certain contexts it may be more appropriate
rules or supervised training because the needed domain to treat them as Type IV cases. The second subtype is the
knowledge cannot be leveraged otherwise [13, 310, 322]. distribution-based aggregate anomaly (ST-VIIj), which nec-
A deviant numerical spatio-temporal region (ST-VIIh) is essarily is an aggregate anomaly because its deviant nature
similar but also takes the time dimension into account [277]. is dependent on group-level characteristics. A dataset may
This pertains to an area of the frame that is significantly consist of different clusters, with the anomalous cluster
different between images or video shots [194]. The identi- exhibiting, e.g., a deviant variance, covariance, mean or
fied differences between images or shots can reflect, e.g., group frequency [282, 290, 291]. The scagnostics approach
moved objects, altered light sources, transformed colors and also acknowledges relevant distributional metrics regarding

13
International Journal of Data Science and Analytics (2021) 12:297–331 315

Fig. 10  Crop biomass time series, with color representing the class of crop. The large dot in cycle 2 highlights an atomic anomaly, i.e., a data
point with an unexpected class label, whereas cycle 14 is an aggregate anomaly

density and shape, such as to what degree a collection of examples, namely the unusually short path ‘AA-AC-AH’ as
data points is striated, stringy, skewed and clumpy [307, well as vertex ‘BA’ in combination with its unusually large
308]. An example of a density-based VIIj anomaly is when number of descendants. An uncommon split in a tree should
it takes the form of an excess density pattern on top of the also be regarded as an aggregate anomaly, which comprises
normally observed distribution [289]. at least two edges and three vertices. In addition to these
topological aspects, the anomaly may involve substantive
VIII. Aggregate categorical anomaly attributes. A subgraph that deviates from a regular pattern
This is a group of related cases that deviate as a collective of domain-specific class labels is an example of this, such
with regard to their qualitative attributes. A deviant class as linked vertices of classes that normally are observed to
aggregate (ST-VIIIa) often manifests itself as a deviant text be unconnected [18, 204].
paragraph, section, or document. An individual symbolic A deviant relational aggregate (ST-VIIIc) is an anoma-
segment in this context is not conveniently stored in a neatly lous occurrence of a non-atomic concept represented by its
structured format (as with Type V occurrences, with one relationships. Such complex collectives are usually com-
case being a single document, e-mail, transaction, or blog prised of several domain-specific entities that are inter-
post). The segment here exists as an aggregate concept, related by one-to-many relationships and are typically
e.g., as multiple rows or sentences in an unstructured text or stored in different tables in a relational database [95, 124,
bag-of-words, or as multiple sections in a semi-structured 125]. The structural relations between the entities are mod-
text. For example, an aggregate of transactions in a market eled using nominal ID or key attributes, which can dem-
basket set, grouped over, e.g., time, region or client, may be onstrate patterns and deviations [19, 95, 240]. Similar to
anomalous. An ST-VIIIa anomaly may also be a novel text graphs, the topological structure alone—thus not taking into
that differs with regard to topics and content [100, 198, 227, account any substantive attributes—allows for patterns and
238]. Similarly, a text may stand out in terms of style or tone anomalies. For example, if normal cases relate to one or two
[199, 201, 239]. This has practical applications because a relationship-instances of another entity, an occurrence with
segment with a deviant style may point to plagiarism [200]. a large collection of such instances can be considered an
Examples of textual and distributional features relevant here anomaly, such as a person with seven jobs in income data.
are the percentage of words of a given type (e.g., pronouns, Also, suboptimal data models and systems may lead to sev-
adjectives, nouns, prepositions), the ratio of adjectives to eral well-known database issues, namely insertion, deletion,
nouns, the average number of syllables per word, spelling and update anomalies [125]. An example is an ‘orphaned
mistakes, the ratio of positive versus negative words, and record’ whose foreign key relation points to a previously
the percentage of words that occur only once (note that the deleted record in the primary table [241]. Substantive cat-
unique word itself is a Type II anomaly). These characteris- egorical attributes may also play a role. For example when
tics, which do not require word order, can be used to identify a case from a given entity type (stored in database table A)
the most dissimilar texts. is unlike regular occurrences because it has an unexpected
Similar to Type V cases a tree or other graph is a struc- class value in a related entity type (stored in table B). A
ture suitable for hosting Type VIII anomalies. A deviant concrete example is the sales transaction of an item that at
categorical subgraph (ST-VIIIb) is a subset of a graph and that moment had the unusual status of being out of stock.
is unusual in terms of its qualitative attributes. A typical What is ‘expected’ can be based solely on patterns in the
manifestation is a subtree, which comprises multiple ver- given dataset, but in practice often also on the data model,
tices and edges. Various characteristics of the dataset may domain-specific rules or a supervised approach. Note that
underlie the anomalous structure, such as an unusually long if numerical attributes are also part of the definition of the
or short path from leaf node to root, or an otherwise uncom- aggregate deviant, it will be a Type IX anomaly (to avoid
mon path in the collection of paths. Figure 7a features two

13
316 International Journal of Data Science and Analytics (2021) 12:297–331

redundancy this variant is not explicitly described in the is when the short login-password event, which usually is
next section). observed once every now and then, suddenly occurs very
frequently in a row and as such may represent a hacking
IX. Aggregate mixed data anomaly attempt [183, 275]. An example from a different domain is
This is a group of related cases that deviate as a collec- a DNA sequence that contains genes from another organ-
tive with regard to both numerical and categorical vari- ism [243]. A more intricate variation of the ST-IXb subtype
ables. Owing to its mixed data and diverse ways in which combines, apart from the sequence attribute, substantive
relationships can manifest themselves, this type allows for numerical variables with categorical variables [244]. An
a wide variety of complex anomaly subtypes. To start, an interesting example can be found in crop biomass monitor-
anomaly can be a deviant group of class values within the ing, in which vegetation quantities (represented by the NDVI
larger symbolic sequence of one or more attributes [5, 7, 85, index) and crop classes are included in a single time series
114, 183, 242]. Such subsequences can for example reside [292]. Cycle 14 in Fig. 10 is an example, with the anomaly
in DNA strings, system logs or time-based text messages. possibly pointing out a data quality problem because the
Order or time information is required to link the consecu- cycle’s crop labels may very well be erroneous. A symbolic
tive data points. A first subtype in this context is the class sequence can also be anomalous in its entirety when com-
change (ST-IXa), in which a sequence suddenly changes pared to other sequences. A real-life example of such a devi-
its symbolic value or pattern. This is akin to the aforemen- ant class sequence (ST-IXc) would be an anomalous trace in
tioned level shift in purely numerical data (ST-VIIc) and a collection of traces representing logged system or business
comprises at least two consecutive cases. A simple example processes [85, 183, 294]. Another example can be found in
is when the function of a given real estate asset is at some a dedicated genomic database that contains an erroneously
point changed from ‘commercial’ to ‘residential’. A more included DNA sequence from an entirely different organism
complex manifestation occurs when the stochastic univari- [243]. Finally, subtype ST-IXc may represent an anomalous
ate or multivariate pattern of the given categorical attributes text paragraph, section or document. This is similar to the
changes. This could be observed in a time series of market topic and style deviations of ST-VIIIa, although the order
basket data when two or more products are increasingly of the words now plays a crucial role. For example, texts
bought together, with this pattern being nonexistent in the may deviate in terms of sentiment, which is a relevant issue
past [192]. A third example of the ST-IXa subtype is akin to in detecting fake reviews [157]. Word order is especially
the temporary change and can be observed when one or mul- important here to properly deal with negations and other
tiple news streams suddenly introduce a new topic, perhaps types of sequential dependencies [157, 245, 246, 299].
as a result of a major disaster or crime [227, 228]. (Note that Several related subtypes are unusual individual time
the example of bursty news topics presented above as a Type series in a panel dataset hosting multiple time series [116,
VII anomaly simply pertains to the time-related frequency 247, 282]. The set’s substantive information consists of
of news items on a given topic, whereas the current example numerical (and possibly categorical) data, whereas the
focuses on the topics themselves, i.e., as a pattern of words.) individual series are distinguished by a nominal attribute
In the real estate and market basket examples the underlying [57, 111]. The deviant sequence (ST-IXd-i) consists of six
distribution of the dataset changes, meaning that it is subject subtypes [142, 233, 280, 314]. The anomalous series may be
to some form of data drift [190, 192, 234]. isolated from the other series for a short time interval or an
Another subtype that may be found in sequence data is extended period. In case of the latter, the curve may be devi-
the deviant class cycle (ST-IXb), in which an entire sub- ant due to a shift (having a normal shape but at a different
sequence is anomalous because it does not adhere to the location), a distinct shape (positioned at the same location as
cycle pattern. The pair of bold italic classes in the following other curves but showing another shape), a different ampli-
phase-sequence can therefore be regarded as such a subtype: tude (having the same shape but with a different range), an
unusual trend (that makes the sequence slowly drift away
phaseA, phaseB, phaseC, phaseA, phaseB, from the other sequences) or with different variation levels
phaseC, phaseA, phaseC, phaseA, phaseB, phaseC, (exhibiting a different random fluctuation). Figure 8d shows
phaseB, phaseA, phaseB, phaseC an ST-IXe anomaly that has a deviant shift with respect to
(Note that the bold underlined phaseB case is an ST-VIb its categorical property represented by a blue color, as well
anomaly because it is a known single class at an unexpected as an ST-IXf occurrence with a deviant shape. Another
position.) Such anomalous n-grams or other forms of sub- example is a deviant sound recording in a music collection,
sequences are typically of interest in (intrusion) detection such as a song that stands out in terms of loudness, rhythm,
systems, where deviations from regular symbolic sequences melody, harmony or timbre [248].
may indicate a defect or attack [60, 85]. A concrete example The field of astronomy also offers an intriguing example
with the star KIC 8462852. When analyzing its light curves

13
International Journal of Data Science and Analytics (2021) 12:297–331 317

as measured by the Kepler space telescope one can observe into another [106, 213]. The evolutionary paths from the
brightness dips, a type of event that was declared earlier in leaf nodes to the root need to have the same length because
this study a temporary change anomaly. However, to under- the evolution of a group is generally expected to have a con-
stand exactly why astronomers where so excited about KIC stant rate. A given path from root to leaf that is significantly
8462852 it is necessary to view the anomaly in the context shorter or longer than the other paths can therefore be seen
of the usually observed dips. Figure 9a represents a normal as an example of an anomaly [106, 212, 214, 218, 219, 256].
occurrence, i.e., a fluent dip of around 1% that lasts a couple A clade (i.e., a subtree of an ancestor and its descendants)
of hours, which may very well point to a transiting exo- may also be considered anomalous if it is unstable when
planet (which is indeed the case here). KIC 8462852, on the certain parameters are changed, features conspicuous differ-
other hand, exhibits several irregular, aperiodic brightness ences in its branch lengths, or has several long branches—
fluctuations that last multiple days, with strong light dips all of which may indicate methodological errors in the tree
of up to 22% [249]. Figure 9b shows this anomaly, based building process [212, 215, 216, 218, 257, 258].
on publicly available real data from the Kepler telescope In dynamic graphs a time dimension affords analyzing
[250]. Compared with the normal dips of other stars this the evolution of graphs and subgraphs [20, 112, 220, 259].
is thus an ST-IXf deviant shape sequence due to its erratic Analogous to Type VI graph dynamics this may take the
and extended dimming. The cause of this anomalous event form of the aforementioned sequence-based anomalies such
is unknown, but the fluctuations cannot be attributed to the as spikes, level shifts and other changes (see the sequence-
light being blocked by a planet orbiting the star. Hypotheses dependent subtypes ST-IVe and ST-VIb for atomic anoma-
that have been put forward are an uneven ring of dust, a lies, and ST-VIIa-f and ST-IXa-i for aggregate occurrences).
swarm of asteroids or comet fragments, and even the spec- Such time-aware analysis of aggregate anomalies has been
tacular explanation of an artificial megastructure built by an shown to be relevant for detecting faults in a network of
advanced extraterrestrial civilization, i.e., a so-called Dyson application services, where monitoring nodes and edges at
sphere for harvesting the star’s energy [249, 250, 251]. the individual level would not properly account for clus-
Note that deviations with regard to seasonality, which ters and noisy traffic fluctuations [260]. Anomalous events
besides the trend is another basic component of time series can also be graph-specific, manifesting themselves as (sub)
[111, 232, 253], are captured by the deviant shape and ampli- graphs that appear, disappear, flicker, merge, split, grow,
tude subtypes. Furthermore, the five ST-IXd-h subtypes can shrink or demonstrate eccentric behavior (ST-IXk-r) [20,
also be used to typify trajectory anomalies because time 112, 221, 261, 262, 287]. An example of a subgraph exhib-
series are conceptually very similar to sets with information iting highly eccentric behavior, which incidentally requires
on moving entities [5, 116, 119, 120, 278, 325]. Examples a rule-based detection approach, is a fraudulent group in a
are a ship that deviates from the normal route [254] and a financial network that features specific trading ring patterns
taxi that takes an unusual amount of turns [316]. Beyond [263]. The group members first follow a ‘blackhole’ pat-
this, time series are able to represent visual shapes of physi- tern by exclusively trading amongst themselves, and sub-
cal objects. In this context sequences with a deviating shape sequently a ‘volcano’ pattern by selling the stocks—which
have been used to detect skulls from a primate species that by then have increased in price—to non-involved traders.
differs from the rest of the collection [114]. A deviant spatial region (ST-IXs) is an aggregate that is
A graph dataset can host a deviant subgraph (ST-IXj), unusual due to its quantitative spatial and possibly substan-
which is an anomaly comprised of multiple vertices and tive features. Such an anomaly is typically hosted by a geo-
edges. A specific example is a group of linked vertices with graphical dataset with mixed data. An example is a deviant
significantly different substantive values than those observed area, i.e., a polygon comprised of multiple lines, such as a
in other parts of the graph. The anomaly can also reside in land parcel with an unusual structure, color or class (e.g.,
a set of domain-specific subgraphs, which are defined by a water, development area or coastal scrub). Another example
group ID or shared property. The deviation of the subgraph is a (poly)line located in an area where that class of object
may be due to unusual structural relationships, weights or is normally not found, such as a river in the middle of a
attribute values. Alternatively, a set of subgraphs may be lake. These are examples that require not only linking dif-
derived by community detection, after which uncommon ferent data elements to form polylines or other aggregates,
(sub)communities can be identified [95, 169, 255]. The sub- but also relating them to both the wider area or polygon
community below left in Fig. 7b with its high weights forms they are located in as well as any relevant domain-specific
an example of such a phenomenon. The evolutionary trees attributes. This is illustrated in [222, cf. 223] by counties
used in biology also provide informative examples of graph (represented as polygons) in the USA, with Los Angeles and
anomalies. A so-called molecular clock may be assumed downtown Chicago being anomalies with an unusually high
in such a tree, in which the branch length (edge weight) population density compared to adjacent counties. Spatial
represents a time estimate for the evolution of one species data can also be analyzed from a time perspective. A deviant

13
318 International Journal of Data Science and Analytics (2021) 12:297–331

spatio-temporal region (ST-IXt) is a polygon or other aggre- exclude the possibility that the case in question is also a
gated object with one or more substantive values that deviate Type V anomaly, but that should then pertain to a different
when both its temporal pattern and spatial area are taken subset of attributes.
into account [279, 283]. Alternatively, it can be viewed as There are several general principles to determine which
a temporal occurrence that shows a pattern unlike its spa- (sub)types are simpler than others. The univariate types are
tial neighbors. In other words, it is a deviant (sub)sequence, simpler than the multivariate types. Anomalies that require
such as one of the ST-IXc-i subtypes described above, but only qualitative or quantitative attributes are simpler than
typically one that is uncommon in the local region rather those defined in terms of mixed data. Subtypes based on
than in the entire global space [224, 277]. An example is a independent data are simpler than those based on dependent
region and time interval where the risk of suffering from a data, and atomic anomalies are simpler than aggregate types.
global disease outbreak is increasing significantly faster than Simple subtypes such as the extreme tail value and unu-
elsewhere [281]. A more specific manifestation is a ‘flow sual class may be part of a larger set with dependent data,
anomaly’, a spatially marked subsequence that does not e.g., one that constitutes a graph. The fact that the graph
adhere to the pattern of values flowing from one location to structure need not be referenced in the definition of these
another with a given time lag [283, 284]. Such occurrences anomalies obviously does not prohibit the researcher to
may be observed if multiple sensors are placed in rivers and meaningfully discuss the deviation in the context of the
may point to flood conditions or chemical spills. given graph. Often it makes perfect sense to discuss a
Finally, i.i.d. data with mixed data types may also host Type II occurrence as an ‘anomalous node’, as long as one
aggregate anomalies. A point-based mixed data aggregate acknowledges that the graph structure itself is not required
anomaly (ST-IXu) is a group of cases that each have one or to define and detect the anomaly from a data perspective.
more unusual categorical and numerical values [290]. A dis- The same holds for other complex data structures, such as
tribution-based mixed data aggregate anomaly (ST-IXv) is spatial data and time series. This insight is the reason that
a group of cases that is unusual in terms of its categorically the typology does not feature an anomaly subtype for a glob-
and numerically determined group-level characteristics. An ally extreme high or low time series value. This is simply
example is a cluster of neighboring points that is anomalous a Type I anomaly and, unless the position in the sequence
with respect to its distribution of categorical values [288]. should be identified, the time-related dependency with other
See subtypes ST-VIIi and ST-VIIj for more information on cases in the dataset is irrelevant for defining and detecting
their purely numerical counterparts. the deviation.
To summarize, Sect. 3 has introduced nine basic anomaly This section concludes with a brief tightening of the ter-
types, each of which is discussed in a tangible way by a vari- minology used in this article. The term anomaly and deviant
ety of subtypes and ample real-world and synthetic examples. are synonyms and used as general terms. The term outlier
The basic types are stable due to the fundamental classifica- refers to numerically isolated occurrences, such as the clas-
tory principles of the typology, while the set of subtypes is sical Type I and Type IVa-d cases. The term novelty can also
extensible. be defined more strictly, as referring to occurrences that in
some way represent new and hitherto unknown events or
objects.
4 Discussion
4.2 Evaluation of algorithms
This section discusses several relevant topics, such as decid-
ing on the anomaly type, the evaluation of AD algorithms, In addition to offering an understanding of the nature and
explainable anomaly detection, and local versus global types of anomalies, the typology is aimed at facilitating the
anomalies. evaluation of AD algorithms. This is a relevant contribu-
tion because most research publications do not make it very
4.1 Deciding on the anomaly type clear which types can be identified by the anomaly detection
algorithms studied, nor do they position the targeted anoma-
In order to sharply determine the nature of a given data lies in a broader context. However, given the wide variety
deviation, one should opt for the simplest applicable (sub) of anomalies, it is clear that individual algorithms will be
type. For example, a case that is a Type II anomaly, per incapable of identifying all types [6, 17, 60, 82, 84–86, 184].
definition, will also have unique class value combinations Researchers can thus employ the predefined typology to pro-
in a larger subspace that includes additional categorical vide a clear and objective insight into the functional capa-
attributes. However, this does not imply one should see this bilities of their AD algorithms by explicitly stating which
instance as a Type V anomaly. After all, the deviation can anomaly (sub)types can be detected. Using the typology in
be defined more parsimoniously. To be sure, this does not this way also gives due acknowledgment of the no free lunch

13
International Journal of Data Science and Analytics (2021) 12:297–331 319

Table 2  Illustration of using the typology to evaluate anomaly detection algorithms.


Algorithm Type Remarks
I II III IV V VI

Grubbs/GESD test a: ✓ b: × a: × b: × a: ✓ b: × a: × b: × a: × a: × Also provides statistical significance metric. ST-


c: × d: × IIIa will be detected using quantitative data
only, and thus cannot be directly distinguished
from ST-Ia.
SECODA a: ✓ b: ✓ a: ✓ b: ✓ a: ✓ b: ✓ a: ✓ b: ✓ a: ✓ a: ✓ No data type transformations or rescaling
c: ✓ d: ✓ required, but vulnerable to the curse of dimen-
sionality. ST-IIb cases are represented by high
anomaly scores instead of low scores.
Distance-based AD a: ✓ b: ✓ a: × b: × a: ✓ b: ✓ a: ✓ b: ✓ a: × a: × Needs rescaling for optimal performance that
c: × d: ✓ corresponds with human intuition. With pre-
processing (e.g., dummy variables or IDF)
categorical data can also be analyzed. Type III
occurrences cannot be directly distinguished
from Type I outliers.

Table 3  Illustration of using the typology to evaluate anomaly detection algorithms, with the focus on the types.

Type Impact? Useful? Explanation [ED = equidepth / EW = equiwidth discretization]

I Y N ED cannot discriminate between univariate numerical values and is intrinsically not equipped to detect this type.
II N/Y Y ED is identical to EW when analyzing a single categorical attribute. It can be more useful than EW if the goal is to
detect (non-unique) rare Type II anomalies in numerically high-density regions in an analysis of mixed data.
… … … …
VI Y Y ED tends to favor the detection of Type VI anomalies and can be more useful than EW if identifying them is the aim
of the analysis.

theorem [80–83] and demand for transparent and explainable Table 2 illustrates how the typology can facilitate the
analytics and AI [71–75] (also see next section). evaluation of several unsupervised algorithms for detecting
However, the typology offers more opportunities for algo- anomalies in independent data. The focus is mainly on the
rithm evaluation than merely clarifying which types and sub- algorithms, which are presented in the rows of the table, with
types can be detected, since the typology is ideally also used the columns representing evaluation characteristics such as
for creating test sets. AD studies often evaluate algorithms the capability to detect the individual anomaly types, metrics
by using existing AD benchmark datasets that are flawed such as the AUC (area under the curve), F1 and sensitivity,
[321]. Moreover, the common practice of treating (a sample as well as any remarks that may be relevant. Such an evalua-
of) a minority class as anomalies [9, 17, 86, 133, 159, 163, tion is appropriate when various (versions of) algorithms are
182, 186, 195, 316] also poses problems. The cases of such evaluated and evaluation characteristics need to be presented
a class may be very similar because they are part of a true at the level of the algorithm.
pattern, and may even be very similar to other normal classes Grubbs’ parametric test aims at verifying whether a vec-
in the dataset. This latter situation can indeed be observed tor contains one or two outliers [30, 264], while the related
for several classes in the above-mentioned real-world Polis GESD procedure is intended for testing whether a vector
administration. In addition, there is no guarantee that all contains multiple outliers [35]. SECODA is a nonparametric
relevant anomaly subtypes will be present in such a test set. AD algorithm for mixed data [6, 70]. Distance-based algo-
A better approach for creating AD test sets is therefore to rithms use a nearest neighbor approach to detect anomalies
use the typology as a basis for systematically creating and [51, 52, 54]. One could also include several versions of the
inserting instances of each relevant anomaly subtype in a algorithms or pre-processing steps. For example, with dis-
real-world or simulated dataset. Such an injection approach tance-based AD techniques the numerical attributes could
ensures that the different anomalies are present and a thor- be normalized and the categorical attributes transformed
ough evaluation of the algorithm can subsequently be con- using dummy variables or inverse document frequency IDF
ducted. Researchers should aim to include the subtypes that, (inverse document frequency), and such techniques could be
given the domain and typology dimensions, are relevant for evaluated in several combinations. Also, evaluation metrics
the problem being studied. based on the ROC (receiver operating characteristic) and

13
320 International Journal of Data Science and Analytics (2021) 12:297–331

confusion matrices could be calculated and reported for each should be implemented. In real-time monitoring processes
anomaly (sub)type separately. ST-Ia cases may then be handled with a simple threshold,
Table 3 presents the anomaly types in the rows, with the ST-Ib cases with an interval, ST-IVa cases with a rule that
columns providing more details on different evaluation char- combines thresholds or intervals, and perhaps it is decided
acteristics. This format was used in [70] for studying the that ST-IVc cases should be ignored because they merely
impact of discretization and is appropriate if the focus is represent subtle deviations.
mainly on the anomaly (sub)types and few algorithms are The typology also yields tangible clues for further inter-
being compared. pretation and sense-making. For example, a limited num-
ber of dispersed ST-Ia or b anomalies are likely to be non-
4.3 Explainable anomaly detection informative random events or glitches, while the aggregate
ST-VIIa, b, i and j occurrences may very well imply a more
The typology provides a generic framework for understand- significant event or mechanism, and ST-VIIc and e even
ing anomalies through a data lens and is relevant for both come with a drift in the data that represents a fundamental
research and practice. Typifying an individual deviant case distributional change.
as a specific anomaly subtype means it is described mean-
ingfully in terms of five fundamental dimensions. This 4.4 Local anomalies
provides transparency and explains the nature of the devia-
tion, i.e., makes clear how it differs from the normal cases There are several perspectives on the concept of locality,
in terms of key data characteristics. This adds value to an all of which can be meaningfully described in terms of the
anomaly analysis because AD algorithms yield an anomaly dimensions of anomalies that have been put forward in this
score or label, but, although they may be able to detect mul- study. The first perspective focuses on data structures with
tiple subtypes, typically do not provide information on how independent data and uses the cardinality of relationship
the individual anomalies differ from the rest of the data. to define the difference between local and global. Univari-
The typology provides a structured way to analyze this and ate anomalies are simply seen as global because they are
clarify the detection results. unconditionally deviant relative to the remainder of the rows
For example, a general-purpose AD analysis on independ- in the dataset [69, 70]. A single variable describes the entire
ent data, using, e.g., distance or density based algorithms, (univariate) data space, which renders an unusual case in this
may detect 12 subtypes, namely ST-Ia-b, IIa-b, IIIa-b, IVa- view per definition a global anomaly. Therefore, when taking
d, Va and VIa. These obviously represent a wide variety of all the set’s cases into account, Type I anomalies will always
deviations that may be present in the dataset. If the detection have an extremely low, high, or otherwise unusual numeri-
algorithm merely informs the analyst on whether (or to what cal value for the given attribute, without any condition and
degree) a case is anomalous, the results therefore remain regardless of the other attributes. A similar argument can
a black box. The typology can be used to bring about the be given for Type II and III anomalies. The multivariate
necessary clarification, with its 12 clearly distinguishable anomaly types, on the other hand, are only deviant given the
subtypes and names, as well as five dimensions that explain categorical condition or the specific numerical area the case
how each subtype differs from normal data. Figure 4 illus- in question is located in. This is due to the fact that a multi-
trates different anomalies in a real-life dataset. Some of these variate anomaly is only unusual as a combination of values
occurrences differ from regular data because they have an from multiple attributes and is therefore normal across the
extreme numerical value (ST-Ia) or have numerical values entire one-dimensional space. This is clearly illustrated with
that, although individually normal, in combination position the mixed data types of the ST-VIa cases in the right plot of
them outside the regular space (ST-IVa). Some cases are Fig. 6. These anomalies are normal with regard to the val-
deviant because they feature a rare code (ST-IIa), while oth- ues of each individual numerical and categorical attribute.
ers have a code that, although normal, is unusual in their own However, while, e.g., the blue points are normal in the global
numerical neighborhood (ST-VIa). By employing the typol- space, they are deviant in their own local numerical neigh-
ogy these different subtypes can be clearly distinguished, borhood. A local anomaly thus exists in some area, subseg-
labeled and understood. The plot also makes clear that visu- ment or class of the data [278]. Other examples are provided
alization can be valuable during this explanation process and by [57] in a discussion on global versus local anomalies. A
can be of help in typifying the detected anomalies. male with a body length of 175 cm is normal in the general
Understanding the nature of the anomaly types present in population, but is an anomaly within the local class of pro-
a set is also relevant because of their different implications. fessional basketball players. The reverse is also true: some-
In the context of data management AD may be conducted one with a body length of 195 cm may be unusually tall with
as an exploratory data quality analysis, which can be used respect to the population, but not when only considering the
to decide what kind of automated checks and corrections class of professional basketball players.

13
International Journal of Data Science and Analytics (2021) 12:297–331 321

A second perspective on the concept of locality focuses more detailed distinction, because the classic additive outlier
on data structures with independent data as well, but uses type does not distinguish between outliers that are globally
the data distribution to make the distinction between local extreme and outliers that are only deviant from a local per-
and global. Local anomalies are described in terms of neigh- spective, e.g., in their climatic season [141]. This study’s
borhood density or similar characteristics, a perspective that typology identifies them as extreme tail values (ST-Ia) and
is particularly relevant if the set contains multiple clusters local additive anomalies (ST-IVe), respectively. Another
that differ in this regard [8, 17, 53, 55, 273]. The outlying- notable classification is that of [184], as it is both broad and
ness of a single case can be seen as being dependent upon data-oriented. However, it distinguishes between 9 concrete
the degree of isolation relative to its local neighborhood anomaly classes instead of 63 and does not use any clas-
(rather than to the global space). Techniques for this set- sificatory principles.
ting, such as LOF and LOCI, therefore determine the density The anomalies acknowledged in classifications that are
of a case in relation to the density of its neighboring points. not data-centric are not explicitly present in the current
A third perspective on locality focuses on data structures typology. For example, [96] presents classes of outliers that
with dependent data. In such datasets the individual cases are not defined in terms of observed data characteristics, but
are by their very nature intrinsically related, allowing the instead refer to external causal phenomena that are often
data management or contextual attributes used for linking beyond the knowledge of the data analyst. This particularly
the data points to naturally define neighborhoods [8, 126, holds for the procedural error (e.g., a data entry mistake),
128, 222, 223]. Local anomalies in this context are typi- the extraordinary event (e.g., a hurricane), and the extraor-
cally deviations from autocorrelation patterns. They can per- dinary observation (a non-explained measurement). Other
haps most clearly be illustrated with spatial data, in which classifications seemingly refer to the data, but are ultimately
latitude–longitude attributes or other sorts of coordinates grounded in phenomena external to the dataset. Causes of
explicitly define locations and neighborhoods. An occur- outliers presented in [136] are, e.g., measurement errors and
rence is locally anomalous if the values of one or more sub- data from other distributions. To ascertain whether this is the
stantive attributes are unusual in its own spatial neighbor- case, one often requires additional information and subjec-
hood, but normal in other regions. This logic also pertains tive interpretation [2, 4, 34, 323]. This is by no means to
to images and videos, with data points being pixels or voxels say that these conceptualizations are not valuable, because
with a fixed position in the canvas or frame. An example analysts should certainly possess knowledge of the potential
of a local anomaly in an aerial photograph is an isolated causes of anomalies. However, this study defines anomalies
tree in a certain region of the picture, while a large forest in terms of observable data characteristics and five theoreti-
is present in another region [67]. The small blue region in cal dimensions. It allows the objective and principled dec-
Fig. 9c is an example as well. A similar reasoning holds for laration of anomaly (sub)types in a tangible and explainable
time series data, in which the timestamp explicitly positions fashion, using the data at hand and leaving relatively little
cases at time points and in periods, which are the temporal room for discussion and doubt.
equivalents of spatial locations and neighborhoods. A local The classification in [7] distinguishes between point, con-
anomaly then has a value that is normal when taking into textual and collective anomalies. This differs from the three
account the entire history, but unusual in the period in which broad groups of this study’s typology (i.e., atomic univari-
the occurrence lies. The ST-IVe subtype is such an anomaly ate anomalies, atomic multivariate anomalies, and aggregate
and is illustrated in Fig. 8a. Local spatio-temporal anomalies anomalies), which are rooted in the requirements a typology
can be viewed in the same vein. Finally, graphs also feature brings with it. The former classification features classes of
explicit structural positions and neighborhoods, although anomalies that are not mutually exclusive, because a collec-
their visualization generally allows more freedom in where tive anomaly can also be a contextual anomaly. This is an
to graphically depict the nodes or cases. undesirable property for any well-formed typology or clas-
sification, as the aim is to offer clear distinctions between
4.5 Other classifications concepts [130]. Strong classificatory principles and mutual
exclusiveness were therefore demanded for the study at
The typology presented here offers an all-encompassing hand. The classification in [7] is also very general in nature,
framework to describe the types of anomalies acknowledged yielding rather abstract anomaly types. This is made clear
in the literature, on the condition that they can be defined by the fact that Type I to VI occurrences can all manifest
in terms of their data properties. For example, the typology themselves as a point anomaly, and a similar argument holds
can accommodate the well-known time series anomalies, for contextual anomalies. In order to provide a concrete
i.e., the additive, temporary change, level shift, and innova- understanding, the current study offers not only a high-level
tional outlier [138, 140, 141]. In fact, the typology makes a framework hosting 3 broad groups and 9 basic types, but
also a full typology with 63 tangible and detailed subtypes.

13
322 International Journal of Data Science and Analytics (2021) 12:297–331

4.6 Conceptual variations center anomalies [269], biological malformations [270], and


wireless sensor network threats [152, 271]. Some of these
The concept of the anomaly refers to occurrences being both domain types can be detected by unsupervised techniques
rare and different [9, 17, 319]. This can be seen as ultimately (and can be typified using this study’s typology). However,
referring to anomalies being non-concentrated and posi- many types represent deviations from specific patterns and are
tioned in the lowest density areas [13, 66, 311], with them only relevant for a given domain or problem and thus require
being more extreme if they have more attributes that deviate supervised or rule-based detection methods.
and with values that deviate in a more severe fashion [297,
298]. However, it is worth discussing some nuances, mainly
in the context of the typology’s data distribution dimension. 5 Conclusion
For example, the global density anomaly (ST-IVd) shows
that for a numerical dataset in which random noise is the This study has presented a comprehensive theoretical con-
norm, a small cluster actually forms the anomaly [184, 307]. ceptualization of anomalies that offers a concrete under-
Similarly, high-density regions may constitute the anomalies standing of the nature and different types of anomalies in
in dependent data (ST-VIk-l), such as when detecting geo- datasets. The contributions of this study can be summarized
graphical hot-spots of traffic accidents or disease outbreaks as follows:
[66, 277]. Relatedly, categorical data may consist of nearly-
unique vales, meaning that values that do repeat form the • It presents the first all-encompassing, theoretically prin-
exception (ST-IIb) [181]. cipled, data-centric, general and domain-independent
Anomalies can also be deviant on only one or two attrib- typology that offers a tangible understanding of the
utes and show unremarkable behavior on the other variables nature and types of anomalies. Apart from preliminary
[57, 162, 163, 298]. A more complex manifestation is the versions [6, 69, 70] no comparable typology, classifica-
high-density anomaly (HDA), which is an occurrence that tion or conceptualization seems to have been published
deviates from the norm but in some subspace is located in a before. An extensive literature review has been con-
high-density region, and as such is positioned amongst or is ducted to ground the typology and overview of anomaly
a member of the most normal cases [297]. In cross-sectional types in the rich contributions of extant research.
(independent) data, subtypes IIa-b, Va, and VIa may turn out • Rather than presenting a mere summing-up, the anomaly
to be a HDA in its most basic form, but in principle all 63 types are discussed in terms of fundamental dimensions,
subtypes can have high-density counterparts when taking into or classificatory principles. These dimensions offer a
account an additional subspace. HDAs can be interpreted as deep insight into the nature of the theoretical concept
deviant occurrences that hide in normality, and are especially of the anomaly, and, as they systematically partition
relevant in misbehavior detection. Identifying them implies the classificatory space, serve to differentiate between
solving a delicate balancing problem, taking both anomalous- the various mutually exclusive types and subtypes. The
ness and normalness into account. The centrally positioned study employs five data-oriented dimensions to under-
ST-IIa case in the left panel of Fig. 6 is a HDA, while the top stand and define the concept of anomalies and to dis-
one is positioned in a moderate or low density area. tinguish between the multitude of anomaly types and
subtypes. These five cardinal aspects of anomalies are
the data type, cardinality of relationship, anomaly level,
4.7 Domain‑specific anomalies data structure, and data distribution. By employing these
dimensions this work aims to turn the generally ‘vague’
The AD field features many overviews of anomalies that are view on anomalies into a grounded and tangible theoreti-
specific to a given domain. An example is the classification of cal concept, and to yield a typology that is principled,
[265, cf. 254], which describes a collection of anomaly types meaningful, non-arbitrary and offers explanatory power.
in maritime traffic data that may point to erroneous or falsified The typology’s framework is presented in Fig. 2, and the
messages. Examples of types are ‘too fast for the given ves- full typology, this study’s core contribution, is summa-
sel,’ ‘vessel type incompatible with size,’ ‘non-declared flag rized in Fig. 3.
change’ and ‘outside of usual roads.’ Examples of anomaly • By using three levels of abstraction the typology offers
types observed on stock markets are ‘forward rate bias,’ ‘the a hierarchical insight into the different manifestations of
new December effect,’ ‘short-term price drift’ and ‘the week- anomalies:
end effect’ [266]. Examples from the domain of computer
network traffic are ‘port scans,’ ‘denial of service attacks,’ o Three general groups: atomic univariate anoma-
‘alpha flows,’ and ‘outage events’ [267]. Other domain-spe- lies, atomic multivariate anomalies, and aggregate
cific classifications describe land parcel anomalies [268], data anomalies.

13
International Journal of Data Science and Analytics (2021) 12:297–331 323

o Nine basic and stable types of anomalies. unsupervised methods [182, 196]. As the current study has
o An extensible set of different subtypes that offer a primarily focused on anomalies that can be characterized
concrete understanding of how the basic types can by their intrinsic data properties and thus be detected in an
manifest themselves in datasets. Based on the data unsupervised mode, this similarity may bring with it inter-
types, cardinality and level, the subtypes can be posi- esting opportunities. Another topic for further research is
tioned principally and logically in one of the frame- studying in more detail how the data distribution can be used
work’s nine main anomaly type cells, within which for further classification and clarification of anomaly sub-
they can be described in more detail using the data types. The way in which the distribution impacts anomalies
structure and distribution. This study has presented 63 in graphs and other complex data structures can also be stud-
subtypes, but future research may discover new ones, ied, for example in the context of the current trend of AD
for example as a result of entirely new data structures. in data streams. Regardless of future work, it is hoped this
Figure 3 provides a summary of all subtypes. study has shown the inspiring richness of anomaly analysis
and detection as well as the many contributions this field can
• The typology can be used to meaningfully comprehend make to both science and practice.
and explain the results of data analyses in both academia
and practice, and to evaluate anomaly detection algo- Acknowledgements This research has been partly supported by
HEINEKEN, Loonaangifteketen and UWV, but did not receive a spe-
rithms in a transparent and understandable fashion. It cific grant from funding agencies, nor is there any conflict of interest.
facilitates the creation of test sets and serves to clarify The author thanks Arno Schilperoord and the anonymous reviewers
which types and subtypes can and cannot be detected for their valuable contributions. An earlier version of this article was
by different (versions or parameterizations of) AD algo- published on arXiv on July 30, 2020. Preliminary versions of the typol-
ogy were presented in [69] and the poster of [6]. Feel free to contact
rithms. the author if anomaly (sub)types are deemed missing.
• More in general, this study shows that attention should
not merely be paid to algorithms and the detection pro- Open Access This article is licensed under a Creative Commons Attri-
cess, but also to understanding the anomalies themselves bution 4.0 International License, which permits use, sharing, adapta-
by using a data-centric perspective. This helps to com- tion, distribution and reproduction in any medium or format, as long
as you give appropriate credit to the original author(s) and the source,
prehend and explain the data and ultimately the world provide a link to the Creative Commons licence, and indicate if changes
and may offer opportunities for developing new knowl- were made. The images or other third party material in this article are
edge. For this reason both academics and practitioners included in the article’s Creative Commons licence, unless indicated
would do well to not see the field as limited to anomaly otherwise in a credit line to the material. If material is not included in
the article’s Creative Commons licence and your intended use is not
detection, but to adopt the broader perspective of anom- permitted by statutory regulation or exceeds the permitted use, you will
aly analysis and detection (AAAD). After all, in addition need to obtain permission directly from the copyright holder. To view a
to detecting anomalies it is important to understand and copy of this licence, visit http://​creat​iveco​mmons.​org/​licen​ses/​by/4.​0/.
explain why a given occurrence is anomalous, especially
because follow-up actions are often required to manage
the identified deviations. References
• Finally, the typology can be employed to scope and
position academic studies, and to structure publications, 1. Hawkins, D.M.: Identification of Outliers. Chapman & Hall,
London (1980)
courses, lectures, tutorials and projects on the basis of, 2. Barnett, V., Lewis, T.: Outliers in Statistical Data, 3rd edn. Wiley,
for example, three general anomaly groups or nine basic Chichester (1994)
types. 3. Izenman, A.J.: Modern Multivariate Statistical Techniques:
Regression, Classification, and Manifold Learning. Springer,
New York (2008)
With these contributions this study aims to advance the 4. Boddy, R., Smith, G.: Statistical Methods in Practice: For Scien-
maturity of the field. Not only by presenting a comprehen- tists and Technologists. John Wiley & Sons, Chichester (2009)
sive overview of anomaly types, but also by offering an 5. Aggarwal, C.C.: Outlier Analysis, 2nd edn. Springer, New York
overarching, integrative and fundamental conceptualization (2017)
6. Foorthuis, R.: SECODA: Segmentation- and Combination-Based
of the field’s focal topic, the anomaly. This research has cen- Detection of Anomalies. Proceedings of the 4th IEEE Interna-
tered its attention on anomaly types that can be meaningfully tional Conference on Data Science and Advanced Analytics
described in terms of data and can be identified by unsuper- (DSAA 2017), Tokyo, Japan, pp. 755–764 (2017) (Also see the
vised AD methods. Future research may therefore extend poster for the typology’s framework: Foorthuis, R.: Anomaly
Detection with SECODA. Poster Presentation at DSAA 2017)
the scope to rare classes and categories that require (semi-) 7. Chandola, V., Banerjee, A., Kumar, V.: Anomaly Detection: A
supervised methods to be detected [164, 272]. Research has Survey. ACM Computing Surveys, Vol. 41, No. 3 (2009)
shown that these share various characteristics with anoma- 8. Schubert, E., Zimek, A., Kriegel, H.P.: Local outlier detection
lous occurrences that may be unlabeled and detectable by reconsidered: a generalized view on locality with applications to

13
324 International Journal of Data Science and Analytics (2021) 12:297–331

spatial, video, and network outlier detection. Data Min. Knowl. 32. Proschan, F.: Rejection of outlying observations. Am. J. Phys.
Disc. 28(1), 190–237 (2014) 21(7), 520–525 (1953)
9. Goldstein, M., Uchida, S.: A comparative evaluation of unsuper- 33. Kruskal, W.H.: Some remarks on wild observations. Technomet-
vised anomaly detection algorithms. PloS ONE, 11(4) (2016) rics 2(1), 346–348 (1960)
10. Shahbaba, B.: Biostatistics with R: An Introduction to Statistics 34. Wainer, H.: Robust statistics: a survey and some prescriptions. J.
Through Biological Data. Springer, New York (2012) Educ. Stat. 1(4), 285–312 (1976)
11. Taha, A., Hadi, A.S.: anomaly detection methods for categorical 35. Rosner, B.: Percentage points for a generalized ESD many-outlier
data: a review. ACM Comput. Surv., 52(2) (2019) procedure. Technometrics 25(2), 165–172 (1983)
12. Beckman, R.J., Cook, R.D.: Outliers. Technometrics 25(2), 36. Rousseeuw, P.J., Leroy, A.M.: Robust Regression and Outlier
119–149 (1983) Detection. John Wiley & Sons, New York (1987)
13. Ruff, L., Kauffmann, J.R., Vandermeulen, R.A., Montavon, G., 37. Osborne, J.W., Overbay, A.: The power of outliers (and why
Samek, W., Kloft, M., Dietterich, T.G., Müller, K. A Unifying researchers should always check for them). Prac. Assess. Res.
Review of Deep and Shallow Anomaly Detection. In: Proceed- Eval. 9(6) (2004)
ings of the IEEE, doi:https://d​ oi.o​ rg/1​ 0.1​ 109/J​ PROC.2​ 021.3​ 0524​ 38. Hoaglin, D.C., Mosteller, F., Tukey, J.W.: Exploring Data Tables,
49 (2021) Trends, and Shapes. Wiley, Hoboken (2006)
14. Pimentel, M.A.F., Clifton, D.A., Clifton, L., Tarassenko, L.: A 39. Tabachnick, B.G., Fidell, L.S.: Using Multivariate Statistics, 6th
review of novelty detection. Signal Process. 99, 215–249 (2014) edn. Pearson, Boston (2012)
15. Fu, T.: A review on time series data mining. Eng. Appl. Artif. 40. Anderson, J.P.: Computer Security Threat Monitoring and Sur-
Intell. 24, 164–181 (2011) veillance. In: Technical Report, Washington (1980)
16. Esling, P., Agon, C.: Time-series data mining. ACM Comput. 41. Mann, N.R.: Optimal outlier tests for a weibull model - to iden-
Surv. 45(1) (2012) tify process changes or to predict failure times. In: Technical
17. Campos, G.O., Zimek, A., Sander, J., Campello, R.J.G.B., paper, Office of Naval Research, Arlington, Virginia (1981)
Micenková, B., Schubert, E., Assent, I., Houle, M.E.: On the 42. Tietjen, G.L.: The Analysis and detection of Outliers. In:
evaluation of unsupervised outlier detection: measures, datasets, D’Agostino, R.B., Stephens, M.A., Goodness-of-Fit Techniques,
and an empirical study. Data Min. Knowl. Disc. 30(4), 891–927 pp. 497–522. Marcel Dekker, New York (1986)
(2016) 43. Denning, D.E.: An intrusion-detection model. IEEE Trans.
18. Noble, C.C., Cook, D.J.: Graph-based anomaly detection. In: Softw. Eng. SE-13(2), 222–232 (1987)
Proceedings of the Ninth ACM SIGKDD International Confer- 44. Smaha, S.E.: Haystack: An intrusion detection system. In: Pro-
ence on Knowledge Discovery and Data Mining (2003) ceedings of the Fourth Aerospace Computer Security Applica-
19. Maervoet, J., Vens, C., Vanden Berghe, G., Blockeel, H., De tions Conference, pp. 37–44 (1988)
Causmaecker, P.: Outlier detection in relational data: a case study 45. Lunt, T.F., Jagannathan, R.: A Prototype real-time intrusion-
in geographical information systems. Expert Syst. Appl. 39(5), detection expert system. In: Proceedings of the IEEE Symposium
4718–4728 (2012) on Security and Privacy, pp. 59–66 (1988)
20. Akoglu, L., Tong, H., Koutra, D.: Graph-based anomaly detec- 46. Heberlein, L.T., Dias, G.V., Levitt, K.N., Mukherjee, B., Wood,
tion and description: a survey. Data Min. Knowl. Disc. 29(3), J., Wolber, D.: A Network security monitor. In; Technical Report,
626–688 (2015) University of California (1989)
21. Box, G.E.P., Jenkins, G.M., Reinsel, G.C.: Time Series Analysis: 47. Javitz, H., Valdes, A.: The SRI IDES statistical anomaly detector.
Forecasting and Control, 3rd edn. Prentice-Hall, Upper Saddle In: Proceedings of the IEEE Symposium on Security and Privacy,
River (1994) Oakland, USA (1991)
22. Bernoulli, D.: The Most probable choice between several discrep- 48. Vaccaro, H.S., Liepins, G.E.: Detection of anomalous computer
ant observations and the formation therefrom of the most likely session activity. In: Proceedings of the IEEE Symposium on
induction. In: Transl by C.G. Allen, Biometrika, 48(1–2), pp. Security and Privacy (1989)
3–18 (1777) (The original was published in Latin in Acta Acad. 49. Major, J.A., Riedinger, D.R.: EFD: A hybrid knowledge/statis-
Petrop.) (1961) tical-based system for the detection of fraud. Int. J. Intell. Syst.
23. Legendre, A.M.: On the Method of Least Squares. In: Smith, 7(7), 687–703 (1992)
D.E., A Source Book in Mathematics, Vol. II, pp. 576–579, 50. Burge, P., Shawe-Taylor, J.: Detecting cellular fraud using adap-
McGraw-Hill 1929 and Dover 1959 (1805) tive prototypes. In: Proceedings of the AAAI-97 Workshop on
24. Peirce, B.: Criterion for the rejection of doubtful observations. AI Approaches to Fraud Detection and Risk Management (1997)
Astron. J. 2(45), 161–163 (1852) 51. Knorr, E.M., Ng, R.T.: Algorithms for mining distance-based
25. Glaisher, J.W.L.: On the Rejection of Discordant Observations. outliers in large datasets. In: VLDB-98, Proceedings of the 24th
Mon. Not. R. Astron. Soc. 33, 391–402 (1873) International Conference on Very Large Data Bases, New York
26. Edgeworth, F.Y.: On discordant observations. Lond. Edinburgh (1998)
Dublin Philos. Mag. J. Sci. 23(143), 364–375 (1887) 52. Knorr, E.M., Ng, R.T.: finding intensional knowledge of dis-
27. Irwin, J.O.: On a criterion for the rejection of outlying observa- tance-based outliers. In: VLDB-99, Proceedings of the 25th
tions. Biometrika 17(3/4), 238–250 (1925) International Conference on Very Large Data Bases, Edinburgh,
28. Pearson, E.S., Chandra Sekar, C.: The efficiency of statistical Scotland (1999)
tools and a criterion for the rejection of outlying observations. 53. Breunig, M.M., Kriegel, H., Ng, R.T., Sander, J.: LOF: identify-
Biometrika 28(3/4), 308–320 (1936) ing density-based local outliers. In: Proceedings of the ACM
29. Nair, K.R.: The distribution of the extreme deviate from the sam- SIGMOD Conference on Management of Data (2000)
ple mean and its studentized form. Biometrika 35(1/2), 118–144 54. Ramaswamy, S., Rastogi, R., Shim, K.: Efficient algorithms for
(1948) mining outliers from large data sets. In: Proceedings of the ACM
30. Grubbs, F.E.: Sample criteria for testing outlying observation. SIGMOD International Conference on Management of Data,
Ann. Math. Stat. 21(1), 27–58 (1950) Dallas, USA (2000)
31. Dixon, W.J.: Analysis of Extreme Values. Ann. Math. Stat. 21(4), 55. Papadimitriou, S., Kitagawa, H., Faloutsos, C., Gibbons, P.B.:
488–506 (1950) LOCI: Fast outlier detection using the local correlation integral.

13
International Journal of Data Science and Analytics (2021) 12:297–331 325

In: Proceedings of the 19th IEEE International Conference on 78. Breck, E., Cai, S., Nielsen, E., Salib, M., Sculley, D.: What’s
Data Engineering (ICDE’03), Bangalore, India (2003) your ML test score? A rubric for ML production systems. In:
56. Bay, S.D., Schwabacher, M.: Mining distance-based outliers in Proceedings of NIPS’16 (2016)
near linear time with randomization and a simple pruning rule. 79. Lazer, D., Kennedy, R., King, G., Vespignani, A.: The parable
In: Proceedings of the Ninth ACM SIGKDD, pp. 29–38 (2003) of google Flu: traps in big data analysis. Science 343(6176),
57. Tan, P., Steinbach, M., Kumar, V.: Introduction to Data Mining. 1203–1205 (2014)
Addison-Wesley, Boston (2006) 80. Wolpert, D.H., Macready, W.G.: No free lunch theorems for
58. van der Loo, M., de Jonge, E.: Statistical Data Cleaning with search. In: Technical Report SFI-TR-95–02–010, Santa Fe Insti-
Applications in R. Wiley, Hoboken (2018) tute (1996)
59. Mason, R.L., Young, J.C.: Multivariate Statistical Process Con- 81. Clarke, B., Fokoué, E., Zhang, H.H.: Principles and Theory for
trol with Industrial Applications. ASA-SIAM, Philadelphia Data Mining and Machine Learning. Springer, New York (2009)
(2002) 82. Janssens, J.H.M.: Outlier selection and one-class classification.
60. Maxion, R.A., Tan, K.M.C.: Benchmarking anomaly-based In: PhD Thesis, Tilburg University (2013)
detection systems. In: First International Conference on Depend- 83. Rokach, L., Maimon, O.: Data Mining With Decision Trees:
able Systems & Networks: New York, USA (2000) Theory and Applications, 2nd edn. World Scientific Publishing,
61. Gartner: Hype Cycle for Data Science and Machine Learning, Singapore (2015)
2018. Gartner, Inc. (2018) 84. Orair, G.H., Teixeira, C.H.C., Meira Jr., W., Wang, Y., Par-
62. Forrester: The Forrester Wave: Security Analytics Platforms, Q1 thasarathy, S.: Distance-based outlier detection: consolidation
2017. Forrester Research, Inc. (2017) and renewed bearing. proceedings of the vldb endowment, 3(2)
63. Anodot: Ultimate guide to building a machine learning anomaly (2010)
detection system. Anodot (2017) 85. Warrender, C., Forrest, S., Pearlmutter, B.: Detecting intrusions
64. Riveiro, M.: Visual Analytics for Maritime Anomaly Detection. using system calls: alternative data models. In: Proceedings of
Örebro University, Örebro (2011) the IEEE Symposium on Security and Privacy, Washington,
65. Zimek, A., Campello, R.J.G.B., Sander, J.: Ensembles for unsu- USA, pp. 133–145 (1999)
pervised outlier detection: challenges and research questions. 86. Kandanaarachchi, S., Muñoz, M.A., Hyndman, R.J., Smith-
ACM SIGKDD Explor. 15(1), 11–22 (2013) Miles, K.: On normalization and algorithm selection for unsu-
66. Schubert, E., Weiler, M., Zimek, A.: Outlier detection and trend pervised outlier detection. In: Working paper, ISSN 1440–771X,
detection: two sides of the same coin. In: Proceedings of the Monash University (2018)
15th IEEE International Conference on Data Mining Workshops 87. Keogh, E., Lonardi, S., Ratanamahatana, C.A.: Towards param-
(2015) eter-free data mining. In: Proceedings of the Tenth ACM SIG-
67. Matteoli, S., Diani, M., Corsini, G.: A tutorial overview of anom- KDD International Conference on Knowledge Discovery and
aly detection in hyperspectral images. IEEE Aerosp. Electron. Data Mining, Seattle, USA (2004)
Syst. Mag. 25(7), 5–28 (2010) 88. Collins: Collins Cobuild Advanced Learner’s English Dictionary.
68. Roshtkhari, M.J., Levine, M.D.: An on-line, real-time learning HarperCollins Publishers (2006)
method for detecting anomalies in videos using spatio-temporal 89. Merriam-Webster: Merriam-Webster Online Dictionary.
compositions. Comput. Vis. Image Underst. 117(10), 1436–1452 Retrieved 9 December 2018, URL: https://​www.​merri​am-​webst​
(2013) er.​com/​dicti​onary/​anoma​ly (2018)
69. Foorthuis, R.: A typology of data anomalies. In: Proceedings 90. Stevenson, A.: Oxford Dictionary of English, 3rd edn. Oxford
of the 17th International Conference on Information Processing University Press, Oxford (2010)
and Management of Uncertainty in Knowledge-Based Systems 91. Kuhn, T.S.: The Structure of Scientific Revolutions, 3rd edn. The
(IPMU 2018), Cádiz, Spain, CCIS 854 (2018) University of Chicago Press, London (1996)
70. Foorthuis, R.: All or in-cloud: how the identification of six types 92. Lakatos, I.: The Methodology of Scientific Research Pro-
of anomalies is affected by the discretization method. In: Atzm- grammes. Philosophical Papers, vol. 1. Cambridge University
ueller M., Duivesteijn W. (eds) Artificial Intelligence. BNAIC Press, Cambridge (1978)
2018. Springer, Communications in Computer and Information 93. Audi, R.: The Cambridge Dictionary of Philosophy, 2nd edn.
Science, vol. 1021, pp 25–42 (2019) Cambridge University Press, Cambridge (1999)
71. Mittelstadt, B.D., Allo, P., Taddeo, M., Wachter, S., Floridi, L.: 94. Hollis, M.: The Philosophy of Social Science: An Introduction.
The ethics of algorithms: mapping the debate. Big Data Soc., Cambridge University Press, Cambridge (1994)
3(2), July–December (2016) 95. Han, J., Kamber, M.: Data Mining: Concepts and Techniques,
72. Ziewitz, M.: Governing algorithms: myth, mess, and methods. 2nd edn. Elsevier, Amsterdam (2006)
Sci. Technol. Human Values 41(1), 3–16 (2016) 96. Hair, J.F., Black, W.C., Babin, B.J., Anderson, R.E.: Multivariate
73. Marcus, G.: Deep learning: a critical appraisal. arXiv: Data Analysis. Seventh Ed. Pearson Prentice Hall (2010)
1801.00631 (2018) 97. Bluman, A.G.: Elementary Statistics: A Step by Step Approach,
74. O’Neil, C.: Weapons of Math Destruction. Crown Publishers, Eight McGraw-Hill, New York (2012)
New York (2016) 98. Cramer, D., Howitt, D.: The SAGE Dictionary of Statistics.
75. EU: Ethics guidelines for trustworthy AI. In: The EU’s High- SAGE Publications, London (2004)
Level Expert Group on Artificial Intelligence. Brussels: Euro- 99. Fielding, J., Gilbert, N.: Understanding Social Statistics. SAGE
pean Commission (2019) Publications, London (2000)
76. Lipton, Z.C.: The mythos of model interpretability. In: Pro- 100. Lantz, B.: Machine Learning with R, 2nd edn. Packt Publishing,
ceedings of the ICML Workshop on Human Interpretability in Birmingham (2015)
Machine Learning (WHI 2016), New York (2016) 101. Johnson, R., Wichern, D.: Applied Multivariate Statistical Analy-
77. Sculley, D. et al.: Hidden technical debt in machine learning sys- sis. Sixth Edition. Pearson, Harlow (2014)
tems. In: Proceedings of NIPS’15, Vol. 2, pp. 2503–2511 (2015) 102. Witten, I.H., Frank, E., Hall, M.A.: Data Mining: Practical
Machine Learning Tools and Techniques, 3rd edn. Elsevier,
Amsterdam (2011)

13
326 International Journal of Data Science and Analytics (2021) 12:297–331

103. Kimball, R., Ross, M.: The Datawarehouse Toolkit. The Com- 126. Burrough, P.A., McDonnell, R.A.: Principles of Geographical
plete Guide to Dimensional Modeling. Second Edition. Wiley, Information Systems. Oxford University Press, Oxford (1998)
New York (2002) 127. Galati, S.R.: Geographic Information Systems Demystified.
104. Inmon, W.H.: Building the Data Warehouse, 3rd edn. Wiley, New Artech House, Boston (2006)
York (2002) 128. By, R.A. de: Principles of Geographical Information Systems:
105. Everitt, B.S., Skrondal, A.: The Cambridge Dictionary of Statis- An Introductory Textbook. ITC, Enschede (2001)
tics, 4th edn. Cambridge University Press, Cambridge (2010) 129. Daróczi, G.: Mastering Data Analysis with R. Packt Publishing,
106. Sullivan, R.: Introduction to Data Mining for the Life Sciences. Birmingham (2015)
Springer, New York (2012) 130. Marradi, A.: Classification, typology, taxonomy. Qual. Quant.
107. Law, K.S., Wong, C., Mobley, W.H.: Toward a taxonomy of 24(2), 129–157 (1990)
multidimensional constructs. Acad. Manag. Rev. 23(4), 741–755 131. Schluter, C., Trede, M.: Identifying multiple outliers in heavy-
(1998) tailed distributions with an application to market crashes. J.
108. Polites, G.L., Roberts, N., Thatcher, J.: Conceptualizing models Empir. Financ. 15(4), 700–713 (2008)
using multidimensional constructs: a review and guidelines for 132. Padmanabhan, K., Chen, Z., Lakshminarasimhan, S., Ramas-
their use. Eur. J. Inf. Syst. 21(1), 22–48 (2012) wamy, S.S., Richardson, B.T.: Graph-based anomaly detection.
109. Maaten, L.J.P. van der, Postma, E.O., Herik, H.J. van der.: In: Samatova et al. (Eds.), Practical Graph Mining with R. CRC
Dimensionality reduction: a comparative review. In: Technical Press, Boca Raton (2014)
Report, TiCC TR 2009–005, Tilburg University (2009) 133. Aggarwal, C.C., Yu, P.S.: An effective and efficient algorithm
110. Feelders, A.: Data mining in economic science. In: Meij, J. (ed.), for high-dimensional outlier detection. VLDB J. 14(2), 211–221
Dealing with the Data Flood: Mining Data, Text and Multimedia. (2005)
STT65. Study Centre for Technology Trends, The Hague (2002) 134. Fawzy, A., Mokhtar, H.M.O., Hegazy, O.: Outliers detection and
111. Wooldridge, J.M.: Introductory Econometrics: A Modern classification in wireless sensor networks. Egypt. Inf. J. 14(2),
Approach. Fifth Edition. Cengage Learning (2012) 157–164 (2013)
112. Ranshous, S., Shen, S., Koutra, D., Harenberg, S., Faloutsos, 135. Zhang, Y., Meratnia, N., Havinga, P.: Outlier detection techniques
C., Samatova, N.F.: Anomaly detection in dynamic networks. A for wireless sensor networks: a survey. IEEE Commun. Surv.
survey. WIREs Computational Statistics 7(3), 223–247 (2015) Tutorials, 12(2) (2010)
113. Jurdak, R., Wang, R., Obst, O., Valencia, P.: Wireless Sensor net- 136. Kotu, V., Deshpande, B.: Predictive Analytics and Data Mining:
work anomalies. Diagnosis and detection strategies. In A. Tolk, Concepts and Practice with RapidMiner. Elsevier, Amsterdam
& L. C. Jain (Eds.), Intelligence-Based Systems Engineering. (2015)
Berlin: Springer, Springer Nature (2011) 137. Song, X., Wu, M., Jermaine, C., Ranka, S.: Conditional anomaly
114. Keogh, E., Lin, J., Lee, S., Van Herle, H.: Finding the most detection. IEEE Trans. Knowl. Data Eng. 19(5), 631–645 (2007)
unusual time series subsequence: algorithms and applications. 138. Chen, C., Liu, L.: Joint estimation of model parameters and out-
Knowl. Inf. Syst. 11(1), 1–27 (2006) lier effects in time series. J. Am. Stat. Assoc. 88(421), 284–297
115. Brockwell, P.J., Davis, R.A.: Introduction to Time Series and (1993)
Forecasting, 3rd edn. Springer, Switzerland (2016) 139. López-de-Lacalle, J.: Tsoutliers: R package for detection of out-
116. Gupta, M., Gao, J., Aggarwal, C. C., Han, J.: outlier detection for liers in time series. Draft version. URL: https://​jalobe.​com/​doc/​
temporal data: a survey. IEEE Trans. Knowl. Data Eng., 25(1) tsout​liers.​pdf (2016)
(2014) 140. Kaiser, R., Maravall, A.: Seasonal outliers in time series. Univer-
117. Gupta, M., Gao, J., Aggarwal, C. C., Han, J.: Outlier detection sidad Carlos III de Madrid, working paper number 99–49 (1999)
for temporal data: tutorial. In: SIAM International Conference 141. Fox, A.J.: Outliers in time series. J. R. Stat. Soc. Se. B Methodol.
on Data Mining (2013) 34(3): 350–363 (1972)
118. Lee, J.G., Han, J., Li, X.: Trajectory outlier detection: a partition- 142. Hubert, M., Rousseeuw, P., Segaert, P.: Multivariate functional
and-detect framework. In: Proceedings of the 24th IEEE Interna- outlier detection. Stat. Methods Appl. 24(2), 177–202 (2015)
tional Conference on Data Engineering (ICDE), Cancun, Mexico 143. Chatterjee, S., Hadi, A.: Regression Analysis by Example, 4th
(2008) edn. Wiley, Hoboken (2006)
119. Li, X., Han, J., Kim, S., Gonzales, H.: ROAM: Rule- and motif- 144. James, G., Witten, D., Hastie, T., Tibshirani, R.: An introduction
based anomaly detection in massive moving object data sets. In: to statistical learning: with applications in R, 8th edn. Springer,
Proceedings of the 2007 SIAM International Conference on Data New York (2017)
Mining (2007) 145. Fox, J., Weisberg, S.: An R Companion to Applied Regression,
120. Agrawal, R., Psaila, G., Wimmers, E.L., Zaït, M.: querying 3rd edn. Sage, Los Angeles (2019)
shapes of histories. In: Proceedings of the 21st VLDB Confer- 146. Ge, Y., Xiong, H., Zhou, Z., Ozdemir, H., Yu, J., Lee, K.C.: TOP-
ence, Zürich, Switzerland (1995) EYE: top-k evolving trajectory outlier detection. In: Proceedings
121. Gajer, P., Schatz, M., Salzberg, S.L.: Automated correction of of the 19th ACM Conference on Information and Knowledge
genome sequence errors. Nucleic Acids Res. 32(2), 562–569 Management (CIKM 2010), pp. 1733–1736 (2010)
(2004) 147. Chabiyyam, M., Reddy, R.D., Dogra, D.P., Bhaskar, H., Mihay-
122. Rousseeuw, P.J., Raymaekers, J., Hubert, M.: A measure of direc- lova, L.: Motion anomaly detection and trajectory analysis in vis-
tional outlyingness with applications to image data and video. J. ual surveillance. Multimedia Tools Appl. 77(13), 16223–16248
Comput. Graph. Stat. 27(2), 345–359 (2018) (2017)
123. Schedl, M., Gómez, E., Urbano, J.: Music Information Retrieval: 148. Suthaharan, S., Alzahrani, M., Rajasegarar, S., Alzahrani, M.,
Recent Developments and Applications. Found. Trends Inf. Retr. Leckie, C., Palaniswami, M.: Labelled data collection for anom-
8(2–3), 127–261 (2014) aly detection in wireless sensor networks. In: Proceedings of the
124. Codd, E.F.: A relational model of data for large shared data 6th International Conference on Intelligent Sensors, Sensor Net-
banks. Commun. ACM 13(6), 377–387 (1970) works and Information Processing, Brisbane (2010)
125. Date, C.J.: Database Design and Relational Theory. O’Reilly, 149. Henderson, K. Eliassi-Rad, T., Faloutsos, C., Akoglu, L., Li,
Sebastopol (2012) L., Maruhashi, K., Prakash, B.A., Tong, H.: Metric forensics: a
multi-level approach for mining volatile graphs. In: Proceedings

13
International Journal of Data Science and Analytics (2021) 12:297–331 327

of the ACM SIGKDD International Conference on Knowledge 168. Embrechts, P., Resnick, S.I., Samorodnitsky, G.: Extreme value
Discovery and Data Mining, Washington, United States (2010) theory as a risk management tool. North Am. Actuarial J. 3(2),
150. Vries, T. de, Chawla, S., Houle, M.E.: Finding local anomalies in 30–41 (1999)
very high dimensional space. In: Proceedings of the 2010 IEEE 169. Gao, J., Liang, F., Fan, W., Wang, C., Sun, Y., Han, J.: On com-
International Conference on Data Mining (2010) munity outliers and their efficient detection in information net-
151. Kathiresan, V., Vasanthi, N.A.: A survey on outlier detection works. In: Proceedings of the 16th ACM SIGKDD Conference
techniques useful for financial card fraud detection. Int. J. Innov. on Knowledge Discovery and Data Mining (KDD 2010), Wash-
Eng. Technol. 6(1), 226–235 (2015) ington, USA (2010)
152. Xie, M., Han, S., Tian, B., Parvin, S.: Anomaly detection in wire- 170. Green, R.F.: Outlier-prone and outlier-resistant distributions. J.
less sensor networks: a survey. J. Netw. Comput. Appl. 34(4), Am. Stat. Assoc. 71(354), 502–505 (1976)
1302–1325 (2011) 171. Neyman, J., Scott, E.L.: Outlier Proneness of Phenomena and of
153. Leys, C., Ley, C., Klein, O., Bernard, P., Licata, L.: Detecting Related Distributions. In: Proceedings of the Symposium Opti-
outliers: do not use standard deviation around the mean, use mizing Methods in Statistics, Ohio, USA, pp. 413-430 (1971)
absolute deviation around the median. J. Exp. Soc. Psychol. 172. Kennedy, D., Lakonishok, J., Shaw, W.H.: Accommodating outli-
49(4), 764–766 (2013) ers and nonlinearity in decision models. J. Acc. Audit. Financ.
154. Small, M., Tse, C.K., Walker, D.M.: Super-spreaders and the 7(2), 161–190 (1992)
Rate of Transmission of the SARS Virus. Physica D 215(2), 173. DeCarlo, L.T.: On the meaning and use of Kurtosis. Psychol.
146–158 (2006) Methods 2(3), 292–307 (1997)
155. Wong, G., Liu, W., Liu, Y., Zhou, B., Bi, Y., Gao, G.F.: MERS, 174. Kennedy, J.: Probability and dynamics in the particle swarm. In:
SARS, and Ebola: the role of super-spreaders in infectious dis- Proceedings of the IEEE Congress on Evolutionary Computa-
ease. Cell Host Microbe 18, 398–401 (2015) tion, Portland, USA (2004)
156. Al-Tawfiq, J.A., Rodriguez-Morales, A.J.: Super-spreading 175. Katz, R.W., Brush, G.S., Parlange, M.B.: Statistics of extremes:
Events and Contribution to Transmission of MERS, SARS, and modeling ecological disturbances. Ecology 86(5), 1124–1134
COVID-19. J. Hosp. Inf. Doi: https://d​ oi.o​ rg/1​ 0.1​ 016/j.j​ hin.2​ 020.​ (2005)
04.​002 (2020) 176. Reiss, R., Thomas, M.: Statistical Analysis of Extreme Values:
157. Liu, B.: Sentiment Analysis and Opinion Mining. Morgan & With Applications to Insurance, Finance, Hydrology and Other
Claypool Publishers, Williston (2012) Fields. Third Edition. Birkhäuser, Basel (2007)
158. Shyu, M, Chen, S., Sarinnapakorn, K., Chang, L.: A Novel anom- 177. Fieller, N.R.J.: Some problems related to the rejection of outlying
aly detection scheme based on principal component classifier. observations. PhD Thesis, The University of Hull (1976)
In: Proceedings of the IEEE Foundations and New Directions 178. Woolley, T.W.: An investigation of the effect of the swamping
of Data Mining Workshop, Melbourne, FL., USA, pp. 172–179 phenomenon on several block procedures for multiple outliers in
(2003) univariate samples. Open J. Stat. 3(5), 299–304 (2013)
159. Hawkins, S., He, H., Williams, G., Baxter, R.: Outlier detection 179. Ben-Gal I.: Outlier Detection. In: Maimon, O., Rokach, L. (Eds.),
using replicator neural networks. In: Proceedings of the Interna- Data Mining and Knowledge Discovery Handbook. Kluwer Aca-
tional Conference on Data Warehousing and Knowledge Discov- demic Publishers (2005)
ery (2002) 180. Trittenbach, H., Böhm, K.: Dimension-based subspace search for
160. Valko, M., Kveton, B., Valizadegan, H., Cooper, G.F., Hauskre- outlier detection. Int. J. Data Sci. Anal. 7, 87–101 (2018)
cht, M.: Conditional anomaly detection with soft harmonic func- 181. Das, K., Schneider, J.: Detecting anomalous records in categori-
tions. In: Proceedings of the 11th International Conference on cal datasets. In: Proceedings of the 13th ACM SIGKDD Inter-
Data Mining (ICDM), Vancouver, Canada (2011) national Conference on Knowledge Discovery and Data Mining,
161. Yang, Y., Webb, G.I., Wu, X.: Discretization Methods. In: Mai- San Jose, USA (2007)
mon, O., Rokach, L. (Eds.), Data Mining and Knowledge Dis- 182. Keller, F., Muller, E., Bohm, K.: HiCS: High contrast subspaces
covery Handbook. Springer, New York (2005) for density-based outlier ranking. In: Proceedings of the 28th
162. Müller, E., Assent, I., Iglesias, P., Mülle, Y., Böhm, K.: Outlier IEEE International Conference on Data Engineering (2012)
ranking via subspace analysis in multiple views of the data. In: 183. Chandola, V., Banerjee, A., Kumar, V.: Anomaly detection for
Proceedings of the 12th IEEE International Conference on Data discrete sequences: a survey. IEEE Trans. Knowl. Data Eng.,
Mining (2012) 24(5) (2012)
163. Lazarevic, A., Kumar, V.: Feature bagging for outlier detection. 184. King, J., Fineman, J.T., Palermo, V., Singh, L.: Combining pure
In: Proceedings of the 11th ACM SIGKDD International Confer- anomalies to describe unusual behavior in real world data sets.
ence on Knowledge Discovery in Data Mining, Chicago, USA In: Presented at ODDx3, the ACM SIGKDD Workshop on Out-
(2005) lier Definition, Detection and Description, Sydney, Australia
164. Zhou, D., He, J., Candan, K.S., Davalcu, H.: MUVIR: Multi- (2015)
view rare category detection. In: Proceedings of the 24th Interna- 185. Penny, K.I.: Appropriate critical values when testing for a single
tional Joint Conference on Artificial Intelligence (IJCAI), Buenos multivariate outlier by using the mahalanobis distance. J. Roy.
Aires, Argentina (2015) Stat. Soc. 45(1), 73–81 (1996)
165. Urdan, T.C.: Statistics in Plain English, 3rd edn. Routledge, New 186. Steinbuss, G., Böhm, K.: Hiding outliers in high-dimensional
York (2010) data spaces. Int. J. Data Sci. Anal. 4(3), 173–189 (2017)
166. Onderwater, M.: Outlier preservation by dimensionality reduc- 187. Maronna, R., Martin, D., Yohai, V.: Robust Statistics: Theory
tion techniques. Int. J. Data Anal. Techniq. Strateg. 7(3), 231– and Methods. Wiley, Chichester (2006)
252 (2015) 188. Huang, A., Lai, K., Li, Y., Wang, S.: Forecasting container
167. Seo, S.: A review and comparison of methods for detecting outli- throughput of qingdao port with a hybrid model. J. Syst. Sci.
ers in univariate data sets. In: Master Thesis, University of Pitts- Complexity 28(1), 105–121 (2015)
burgh (2006) 189. Dunning, T., Friedman, E.: Practical Machine Learning: A New
Look at Anomaly Detection. O’Reilly, Sebastopol (2014)

13
328 International Journal of Data Science and Analytics (2021) 12:297–331

190. Gama, J., Medas, P., Castillo, G., Rodrigues, P.: Learning with 209. Ma, E.Y.T., Ratnasingham, S., Kremer, S.C.: Machine learned
drift detection. In: Proceedings of the 17th Brazilian Symposium replacement of N-labels for basecalled sequences in DNA bar-
on Artificial Intelligence (SBIA), Sao Luis, Brazil (2004) coding. IEEE/ACM Trans. Comput. Biol. Bioinf. 15(1), 191–204
191. Kirchgässner, G., Wolters, J.: Introduction to Modern Time (2018)
Series Analysis. Springer, Berlin (2007) 210. Sun, J., Xie, Y., Zhang, H., Faloutsos, C.: Less is More: Compact
192. Chakrabarti, S., Sarawagi, S., Dom, B.: Mining surprising pat- matrix decomposition for large sparse graphs. In: Proceedings of
terns using temporal description length. In: VLDB-98, Proceed- the 7th SIAM International Conference on Data Mining (SDM),
ings of the 24th International Conference on Very Large Data Minneapolis, USA (2007)
Bases, New York (1998) 211. Shekhar, S., Lu, C., Zhang, P.: Detecting graph-based spatial
193. Burridge, P., Taylor, A.M.R.: Additive outlier detection via outliers: algorithms and applications (a summary of results). In:
extreme-value theory. J. Time Ser. Anal. 7(5), 685–701 (2006) Proceedings of the Seventh ACM SIGKDD International Confer-
194. Radke, R.J., Andra, S., Al-Kofahi, O., Roysam, B.: Image change ence on Knowledge Discovery and Data Mining, San Francisco,
detection algorithms: a systematic survey. IEEE Trans. Image USA (2001)
Process. 14(3), 294–307 (2005) 212. Graur, D., Li, W.: Fundamentals of Molecular Evolution. Second
195. Pang, G., Cao, L., Chin, L.: Outlier detection in complex cat- Edition. Sinaur Associates, Sunderland (2000)
egorical data by modelling the feature value couplings. In: Pro- 213. Jones, N.C., Pevzner, P.A.: An Introduction to Bioinformatics
ceedings of the 25th International Joint Conference on Artificial Algorithms. The MIT Press, Cambridge (2004)
Intelligence (2016) 214. Lesk, A.M.: Introduction to Bioinformatics. Oxford University
196. Weiss, G.M.: Mining with rarity: a unifying framework. ACM Press, Oxford (2002)
SIGKDD Exploration Newsletter 6(1), 7–19 (2004) 215. Spears, T., DeBry, R.W., Abele, L.G., Chodyla, K.: Peracarid
197. Dash, M., Lie, N.: Outlier detection in transactional data. Intell. monophyly and interordinal phylogeny inferred from nuclear
Data Anal. 14(3), 283–298 (2010) small-subunit ribosomal DNA sequences (Crustacea: Mala-
198. Hansen, L.K., Sigurdsson, S., Kolenda, T., Nielsen, F.A, Kjems, costraca: Peracarida). Proc. Biol. Soc. Wash. 118(1), 117–157
U., Larsen, J.: Modeling text with generalizable gaussian mix- (2005)
tures. In: Proceedings of the IEEE International Conference 216. Jenner, R.A., Dhubhghaill, C.N., Ferla, M.P., Wills, M.A.:
on Acoustics, Speech, and Signal Processing, Istanbul, Turkey Eumalacostracan phylogeny and total evidence: limitations of
(2000) the usual suspects. BMC Evolution. Biol., 9(21) (2009)
199. Guthrie, D., Guthrie, L., Allison, B., Wilks, Y.: Unsupervised 217. Giribet, G., Distel, D.L., Polz, M., Sterrer, W., Wheeler, W.C.:
anomaly detection. In: Proceedings of the 20th International Joint Triploblastic relationships with emphasis on the acoelomates and
Conference on Artificial Intelligence (IJCAI’07), Hyderabad, the position of gnathostomulida, cycliophora, plathelminthes,
India (2007) and chaetognatha: a combined approach of 18S rDNA sequences
200. Oberreuter, G., Velásquez, J.D.: Text mining applied to plagia- and morphology. Syst. Biol. 49(3), 539–562 (2000)
rism detection: the use of words for detecting deviations in the 218. Struck, T.H.: TreSpEx—detection of misleading signal in phy-
writing style. Expert Syst. Appl. 40(9), 3756–3763 (2013) logenetic reconstructions based on tree information. Evol. Bio-
201. Zheng, R., Li, J., Chen, H., Huang, Z.: A framework for author- informa. 10, 51–67 (2014)
ship identification of online messages: writing-style features 219. Petitjean, C., Makarova, K.S., Wolf, Y.I., Koonin, E.V.: Extreme
and classification techniques. J. Am. Soc. Inform. Sci. Technol. deviations from expected evolutionary rates in archaeal protein
57(3), 378–393 (2006) families. Genome Biol. Evol. 9(10), 2791–2811 (2017)
202. Chouchane, A., Bouguessa, M.: Identifying anomalous nodes 220. Pincombe, B.: Anomaly detection in time series of graphs using
in multidimensional networks. In: Proceedings of the 4th IEEE ARMA processes. ASOR Bull 24(4), 2–10 (2005)
International Conference on Data Science and Advanced Analyt- 221. Gupta, M., Gao, J., Sun, Y., Han, J.: Integrating community
ics (DSAA), Tokyo, Japan (2017) matching and outlier detection for mining evolutionary commu-
203. Venkataraman, S., Song, D.X., Gibbons, P.B., Blum, A.: New nity outliers. In: Proceedings of the 18th ACM SIGKDD Inter-
streaming algorithms for fast detection of superspreaders. In: national Conference on Knowledge Discovery and Data Mining,
Proceedings of Network and Distributed System Security Sym- Beijing, China (2012)
posium (NDSS’05), p. 149–166 (2005) 222. Lu, C, Chen, D., Kou, Y.: Detecting spatial outliers with multiple
204. Eberle, W., Holder, L.: Discovering structural anomalies in attributes. In: Proceedings of the 15th IEEE International Con-
graph-based data. In: Proceedings of the 7th IEEE International ference on Tools with Artificial Intelligence, Sacramento, USA
Conference on Data Mining (2007) (2003)
205. Akoglu, L., McGlohon, M., Faloutsos, C.: Anomaly detection in 223. Chawla, S., Sun, P.: SLOM: a new measure for local spatial outli-
large graphs. In: Technical Report, CMU-CS-09–173, Carnegie ers. Knowl. Inf. Syst. 9(4), 412–429 (2006)
Mellon University (2009) 224. Izakian, H., Pedrycz, W.: Anomaly detection and characteriza-
206. Sun, J., Qu, H., Chakrabarti, D., Faloutsos, C.: Neighborhood tion in spatial time series data: a cluster-centric approach. IEEE
formation and anomaly detection in bipartite graphs. In: Proceed- Trans. Fuzzy Syst. 22(6), 1612–1624 (2013)
ings of the Fifth IEEE International Conference on Data Mining 225. Das, M., Parthasarathy, S.: Anomaly detection and spatio-tem-
(2005) poral analysis of global climate system. In: Proceedings of the
207. Ding, Q., Katenka, N., Barford, P., Kolaczyk, E., Crovella, M.: Third International Workshop on Knowledge Discovery from
Intrusion as (anti)social communication: characterization and Sensor Data (SensorKDD’09), Paris, France (2009)
detection.In: Proceedings of the 18th ACM SIGKDD Interna- 226. Kiranyaz, S., Ince, T., Gabbouj, M.: Real-time patient-specific
tional Conference on Knowledge Discovery and Data Mining, ECG classification by 1-D convolutional neural networks. IEEE
Beijing, China (2012) Trans. Biomed. Eng. 63(3), 664–675 (2015)
208. Barata, A.P., Bruin, G.J. de, Takes, F., Herik, J. van den, Veen- 227. Allan, J., Papka, R., Lavrenko, V.: On-line new event detection
man, C.: Finding anomalies in waste transportation data with and tracking. In: Proceedings of the 21st Annual International
supervised category models. In: Proceedings of the Proceed- ACM SIGIR Conference on Research and Development in Infor-
ings of the 30th Benelux Conference on Artificial Intelligence mation Retrieval (1998)
(BNAIC), Den Bosch, the Netherlands (2018)

13
International Journal of Data Science and Analytics (2021) 12:297–331 329

228. Wang, X., Zhai, C., Hu, X., Sproat, R.: Mining correlated bursty 250. MAST: Barbara A. Mikulski archive for space telescopes.
topic patterns from coordinated text streams. In: Proceedings of URL: https://​archi​ve.​stsci.​edu/​k2/​hlsp/​k2sff/​search.​php (2019)
the 13th ACM SIGKDD Conference on Knowledge Discovery Accessed April 6th 2019
and Data Mining (KDD 2007), San Jose, USA (2007) 251. Wright, J.T., Cartier, K.M.S., Zhao, M., Jontof-Hutter, D., Ford,
229. Bello, J.P., Daudet, L., Abdallah, S., Duxbury, C., Davies, M., E.B.: The G search for extraterrestrial civilizations with large
Sandler, M.B.: A tutorial on onset detection in music signals. energy supplies. IV. The Signatures and Information Content of
IEEE Trans. Speech Audio Process. 13(5), 1035–1047 (2005) Transiting Megastructures. Astrophys. J., 816(1) (2016)
230. Vanderburg, A.: Transit Light Curve Tutorial: The Transit Light 252. Thompson, M.A., Scicluna, P., Kemper, F., Geach, J.E. et al.:
Curve. URL: https://​www.​cfa.​harva​rd.​edu/​~avand​erb/​tutor​ial/​ Constraints on the Circumstellar Dust Around KIC 8462852.
tutor​ial.​html (2019). Accessed 6 April 2019 Monthly Notices R. Astron. Soc., 458(1): L39-L43 (2016)
231. Batalha, N.M., Rowe, J.F., Bryson, S.T. et al.: Planetary CANDI- 253. Wang, X., Smith, K., Hyndman, R.: Characteristic-based clus-
DATES OBserved by Kepler III: analysis of the first 16 months tering for time series data. Data Min. Knowl. Disc. 13, 335–364
of data. Astrophys. J. Suppl. Ser., 204(2) (2013) (2006)
232. Cleveland, R.B., Cleveland, W.S., McRae, J.E., Terpenning, I.: 254. Pallotta, G., Jousselme, A.: Data-driven detection and context-
STL: a seasonal-trend decomposition procedure based on loess based classification of maritime anomalies. In: Proceedings of
(with discussion). J. Off. Stat. 6(1), 3–73 (1990) the 18th International Conference on Information Fusion, Wash-
233. Hyndman, R.J., Wang, E., Laptev, N.: Large-scale unusual time ington DC, USA (2015)
series detection. In: Proceedings of the IEEE International Con- 255. Fortunato, S.: Community detection in graphs. Phys. Rep. 486(3–
ference on Data Mining. Atlantic City, USA, 14–17 (2015) 5), 75–174 (2010)
234. Gama, J., Žliobaite, I., Bifet, A., Pechenizkiy, M., Bouchachia, 256. Grassly, N.C., Harvey, P.H., Holmes, E.C.: Population dynam-
A.: A survey on concept drift adaptation. ACM Comput. Surv., ics of HIV-1 inferred from gene sequences. Genetics 151(2),
46(4) (2014) 427–438 (1999)
235. Roshtkhari, M.J., Levine, M.D.: Online dominant and anomalous 257. Felsenstein, J.: Cases in which parsimony or compatibility meth-
behavior detection in videos. In IEEE Conference on Computer ods will be positively misleading. Syst. Biol. 27(4), 401–410
Vision and Pattern Recognition, CVPR (2013) (1978)
236. Cooper, J., Cooper, G.: Subliminal motivation: a story revisited. 258. Driver, F., Milner, R.J., Trueman, J.W.H.: A taxonomic revi-
J. Appl. Soc. Psychol. 32(11), 2213–2227 (2002) sion of metarhizium based on a phylogenetic analysis of rDNA
237. Kratz, L., Nishino, K.: Anomaly detection in extremely crowded sequence data. Mycol. Res. 104(2), 134–150 (2000)
scenes using spatio-temporal motion pattern models. In: Proceed- 259. Shoubridge, P., Kraetzl, M., Wallis, W., Bunke, H.: Detection of
ings of the IEEE Conference on Computer Vision and Pattern abnormal change in a time series of graphs. J. Interconn. Netw.
Recognition, Miami, USA (2009) 3(1–2), 85–101 (2002)
238. Wang, S., Manning, C.D.: Baselines and bigrams: simple, good 260. Idé, T., Kashima, H.: Eigenspace-based anomaly detection in
sentiment and topic classification. In: Proceedings of the 50th computer systems. In: Proceedings of the 10th ACM SIGKDD
Annual Meeting of the Association for Computational Linguis- International Conference on Knowledge Discovery and Data
tics, Jeju, Korea (2012) Mining, Seattle, USA (2004)
239. Ross, D., Jr., Rasche, R.H.: EYEBALL: a computer program for 261. Araujo, M., Papadimitriou, S., Günnemann, S., Faloutsos, C.,
description of style. Comput. Humanit. 6(4), 213–221 (1972) Basu, P., Swami, A., Papalexakis, E.E., Koutra, D.: Com2: Fast
240. Riahi, F., Schulte, O.: Propositionalization for unsupervised automatic discovery of temporal (‘comet’) communities. In:
outlier detection in multi-relational data. In: Proceedings of the Proceedings of the 18th Pacific-Asia Conference on Knowledge
29th International Florida Artificial Intelligence Research Soci- Discovery and Data Mining (PAKDD), Tainan, Taiwan (2014)
ety Conference (2016) 262. Gupta, M., Aggarwal, C.C., Han, J., Sun, Y.: Evolutionary clus-
241. Feldman, J.: What’s wrong with my data? In: Purba, S. (ed.) tering and analysis of bibliographic networks. In: Proceedings
High-Performance Web Databases: Design, Development, and of the IEEE International Conference on Advances in Social
Deployment. Auerbach, Boca Raton (2001) Networks Analysis and Mining, Kaohsiung, Taiwan (2011)
242. Hofmeyr, S.A., Forrest, S., Somayaji, A.: Intrusion detection 263. Li, Z., Xiong, H., Liu, Y., Zhou, A.: Detecting blackhole and
using sequences of system calls. J. Comput. Secur. 6, 151–180 volcano patterns in directed networks. In: Proceedings of the
(1998) IEEE International Conference on Data Mining, Sydney, Aus-
243. Murphy, D.J.: The future of oil palm as a major global crop: tralia (2010)
opportunities and challenges. J. Oil Palm Res. 26(1), 1–24 (2014) 264. Komsta, L.: Package ‘outliers’. Tests for outliers, Version 0.14,
244. Pilastre, B., Boussouf, L., D’Escrivan, S., Tourneret, J.: Anomaly CRAN Repository. URL: https://​cran.r-​proje​ct.​org/​web/​packa​
detection in mixed telemetry data using a sparse representation ges/​outli​ers/​outli​ers.​pdf (2015)
and dictionary learning. Signal Process., 168 (2020) 265. Iphar, C., Napoli, A., Ray, C., Alincourt, E., Brosset, D.: Risk
245. Pang, B., Lee, L.: Opinion mining and sentiment analysis. Found. analysis of falsified automatic identification system for the
Trends Inf. Retr. 2(1–2), 1–135 (2008) improvement of maritime traffic safety. In: Proceedings of
246. Mukherjee, S.: F# for Machine Learning Essentials. Packt Pub- ESREL 2016, Glasgow, United Kingdom, pp. 606–613 (2016)
lishing, Birmingham (2016) 266. Singal, V.: Beyond the Random Walk: A Guide to Stock Market
247. Venturini, A.: Time Series outlier detection: a new non-paramet- Anomalies and Low-Risk Investing. Oxford University Press,
ric methodology (Washer). Statistica (Bologna) 71(3), 329–344 Oxford (2003)
(2011) 267. Lakhina, A., Crovella, M., Diot, C.: Mining anomalies using
248. Panteli, M., Benetos, E., Dixon, S.: A computational study on traffic feature distributions. In: Proceedings of the SIG-
outliers in world music. PLoS ONE, 12(12) (2017) COMM’05 Conference on Applications, Technologies, Archi-
249. Boyajian, T.S., LaCourse, D.M., Rappaport, S.A., Fabrycky, tectures and Protocols for Computer Communications, Phila-
D. et al.: Planet Hunters X. KIC 8462852—Where’s the Flux? delphia, USA, pp. 217–228 (2005)
Monthly notices of the royal astronomical society, 457(4): 3988– 268. Grandgirard, D., Zielinski, R.: Land parcel identification sys-
4004 (2016) tem (LPIS) anomalies’ sampling and spatial pattern. In: JRC

13
330 International Journal of Data Science and Analytics (2021) 12:297–331

Scientific and Technical Reports. European Commission, Lux- with an application in high energy particle physics. In: Proceed-
embourg (2008) ings of the International Joint Conference on Neural Networks
269. Aydillo, D.F.: Trust-ware: A methodology to analyze, design, (IJCNN), Brisbane, Australia (2012)
and secure trust and reputation systems. In: Doctoral Thesis, 290. Muandet, K., Schölkopf, B.: One-class support measure machines
University of Madrid (2015) for group anomaly detection. In: Proceedings of the 29th Con-
270. Guernaoui, S., Ramaoui, K., Rahola, N., Barnabe, C., Ser- ference on Uncertainty in Artificial Intelligence (UAI’13), pp.
eno, D., Boumezzough, A.: Malformations of the Genitalia in 449–458 (2013)
Male Phlebotomus Papatasi (Scopoli) (Diptera: Psychodidae). 291. Guevara, J., Canu, S., Hirata, R.: Support measure data descrip-
J. Vector Ecol. 35(1), 13–19 (2010) tion for group anomaly detection. In: Proceedings of the ODDx3
271. Karlof, C., Wagner, D.: Secure routing in wireless sensor net- Workshop on Outlier Definition, Detection, and Description at
works: attacks and countermeasures. Ad Hoc Netw. 1(2–3), the 21st ACM International Conference on Knowledge Discovery
293–315 (2003) and Data Mining (SIG KDD), Sydney, Australia (2015)
272. Pelleg, D., Moore, A.: Active learning for anomaly and rare- 292. Chandola, V., Vatsavai, R.R.: A scalable gaussian process analy-
category detection. In: Proceeding of NIPS’04, the 17th Inter- sis algorithm for biomass monitoring. Stati. Anal. Data Min.
national Conference on Neural Information Processing Sys- 4(4), 430–445 (2011)
tems, pp. 1073–1080 (2004) 293. Ramachandra, B., Dutton, B., Vatsavai, R.R.: Anomalous cluster
273. Kumpulainen, P., Hätönen, K.: Local anomaly detection for detection in spatio-temporal meteorological fields. Stat. Anal.
network system log monitoring. In: Proceedings of the 10th Data Min. 12(2), 88–100 (2018)
International Conference on Engineering Applications of Neu- 294. van der Aalst, W.M.P., de Medeiros, A.K.A.: Process mining and
ral Networks (EANN), Greece (2007) security: detecting anomalous process executions and checking
274. Fisher, R.A.: On the mathematical foundations of theoretical process conformance. Electron. Notes Theor. Comput. Sci. 121,
statistics. Philos. Trans. R. Soc. Lond. 222, 309–368 (1922) 3–21 (2005)
275. Gwadera, R., Atallah, M.J., Szpankowski, W.: Reliable detec- 295. Li, Y., Huang, X., Li, J., Du, M., Zou, N.: SpecAE: Spectral
tion of episodes in event sequences. Knowl. Inf. Syst. 7(4), autoencoder for anomaly detection in attributed networks. In:
415–437 (2005) Proceedings of CIKM, the 28th ACM International Conference
276. Antwarg, L., Shapira, B., Rokach, L.: Explaining anomalies on Information and Knowledge Management, Beijing, China
detected by autoencoders using SHAP. arXiv: 1903.02407v1 (2019)
(2019) 296. Cao, J., Yu, J., Chen, A., Bu, T., Zhang, Z.: Identifying high car-
277. Atluri, G., Karpatne, A., Kumar, V.: Spatio-temporal data min- dinality internet hosts. In: Proceedings of the 28th IEEE INFO-
ing: a survey of problems and methods. ACM Comput. Surv., COM Conference on Computer Communications, Rio de Janeiro,
Vol. 51, No. 4, Article 83 (2018) Brazil (2009)
278. Yang, W., Gao, Y., Cao, L.: TRASMIL: a local anomaly detec- 297. Foorthuis R.M.: Algorithmic frameworks for the detection of
tion framework based on trajectory segmentation and multi- high-density anomalies. In: Proceedings of IEEE SSCI CIDM
instance learning. Comput. Vis. Image Underst. 117(10), (Symposium on Computational Intelligence in Data Mining),
1273–1286 (2013) Canberra Australia (2020)
279. Wu, M., Jermaine, C., Ranka, S., Song, X., Gums, J.: A model- 298. Pijnenburg, M., Kowalczyk, W.: Singular outliers: finding com-
agnostic framework for fast spatial anomaly detection. ACM mon observations with an uncommon feature. In: Proceedings
Trans. Knowl. Dis. Data, Vol. 4, No. 4, Article 20 (2010) of the International IPMU Conference, Cádiz, Spain, Springer
280. Chen, X.C., Steinhaeuser, K., Boriah, S., Chatterjee, S., Kumar, CCIS 854 (2018)
V.: contextual time series change detection. In: Proceedings of 299. Chalapathy, R., Chawla, S.: deep learning for anomaly detection:
the 2013 SIAM International Conference on Data Mining (2013) a survey. arXiv: 1903.02407v1 (2019)
281. Zhou, X., Shekhar, S., Ali, R.Y.: spatio-temporal change footprint 300. Pang, G., Shen, C., Cao, L., Hengel, A. van den: Deep learning
pattern discovery: an inter-disciplinary survey. WIREs Data Min. for anomaly detection: A Review. arXiv: 2007.02500 (2020)
Knowl. Discovery 4(1), 1–23 (2014) 301. Himeur, Y., Ghanem, K., Alsalemi, A., Bensaali, F., Amira, A.:
282. Toth, E., Chawla, S.: Group deviation detection methods: a sur- Artificial intelligence based anomaly detection of energy con-
vey. ACM Comput. Surv., Vol. 51, No. 4, Article 77 (2018) sumption in buildings: A Review, Current Trends and new Per-
283. Shekhar, S., Jian, Z., Ali, R.Y., Eftelioglu, E., Tang, X., Gunturi, spectives. Applied Energy, Vol. 287 (2021)
V.M.V., Zhou, X.: spatio-temporal data mining: a computational 302. Hu, W., Gao, J., Li, B., Wu, O., Du, J., Maybank, S.: Anomaly
perspective. ISPRS Int. J. Geo Inf. 4(4), 2306–2338 (2015) detection using local kernel density estimation and context-based
284. Kang, J.M., Shekhar, S., Wennen, C., Novak, P.: Discovering regression. IEEE Trans. Knowl. Data Eng., 32(2) (2020)
flow anomalies: a SWEET approach. In: Proceedings of the 303. Henrion, M., Hand, D.J., Gandy, A., Mortlock, D.J.: CASOS:
Eighth IEEE International Conference on Data Mining, Pisa, A subspace method for anomaly detection in high dimensional
Italy (2008) astronomical databases. Stat. Anal. Data Min. 6(1), 53–72 (2013)
285. Torgo, L.: Data Mining with R: Learning with Case Studies, 2nd 304. Cheng, H., Tan, P., Potter, C., Klooster, S.: Detection and charac-
edn. CRC Press, Boca Raton (2017) terization of anomalies in multivariate time series. In: Proceed-
286. Hodge, V.J., Austin, J.: A survey of outlier detection methodolo- ings of the 2009 SIAM International Conference on Data Mining,
gies. Artif. Intell. Rev. 22, 85–126 (2004) Sparks, USA (2009)
287. Yu, R., He, X., Liu, Y.: GLAD: Group anomaly detection in 305. Bandaragoda, T.B., Ting, K.M., Albrecht, D., Liu, F.T., Wells,
social media analysis. In: ACM Trans. Knowl. Dis. Data, Vol. J.R.: Efficient anomaly detection by isolation using nearest neigh-
10, No. 2, Article 18 (2015) bour ensemble. In: Proceedings of the IEEE International Con-
288. Xiong, L., Póczos, B., Schneider, J.: Group anomaly detection ference on Data Mining Workshops (2014)
using flexible genre models. In: Proceedings of NIPS 2011, 306. Kauffmann, J., Ruff, L., Montavon, G., Müller, K.: The clever
Advances in Neural Information Processing Systems 24 (2011) hans effect in anomaly detection. arXiv: 2006.10609 (2020)
289. Vatanen, T., Kuusela, M., Malmi, E., Raiko, T., Aaltonen, T.,
Nagai, Y.: Semi-supervised detection of collective anomalies

13
International Journal of Data Science and Analytics (2021) 12:297–331 331

307. Talagala, P.D., Hyndman, R.J., Smith-Miles, K.: Anomaly detec- 320. Baddar, S., Merlo, A., Migliardi, M.: Anomaly detection in com-
tion in high dimensional data. J. Comput. Graph. Stat. online puter networks: a state-of-the-art review. J. Wireless Mob. Netw.
accepted author version of 13 Aug 2020 (2020) Ubiquitous Comput. Depend. Appl. 5(4), 29–64 (2014)
308. Wilkinson, L., Anand, A., Grossman, R.: Graph-theoretic scag- 321. Wu, R., Keogh, E.J.: Current time series anomaly detection
nostics. In: Proceedings of the IEEE Symposium on Information benchmarks are flawed and are creating the illusion of progress.
Visualization (2005) arXiv: 2009.13807 (2020)
309. Koufakou, A., Ortiz, E., Georgiopoulos, M., Anagnostopoulos, 322. Ahmed, F., Courville, A.: Detecting semantic anomalies. In: Pro-
G., Reynolds, K.: A scalable and efficient outlier detection strat- ceedings of the 34th AAAI Conference on Artificial Intelligence,
egy for categorical data. In: Proceedings of ICTAI (2007) pp. 3154–3162 (2020)
310. Locatello, F., Bauer, S., Lucic, M., Rätsch, G., Gelly, S., 323. Shmueli, G., Bruce, P.C., Yahav, I., Patel, N.R., Lichtendahl,
Schölkopf, B., Bachem, O.: Challenging common assumptions K.C.: Data Mining for Business Analytics: Concepts, Techniques
in the unsupervised learning of disentangled representations. In: and Applications in R. Wiley, Hoboken (2018)
Proceedings of the 36th International Conference on Machine 324. Agyemang, M., Barker, K., Alhajj, R.: Web outlier mining:
Learning, California, PMLR 97 (2019) discovering outliers from web datasets. Intell. Data Anal. 9(5),
311. Steinwart, I., Hush, D., Scovel, C.: A classification framework for 473–486 (2005)
anomaly detection. J. Mach. Learn. Res. 6(8), 211–232 (2005) 325. Suzuki, N., Hirasawa, K., Tanaka, K., Kobayashi, Y., Sato, Y.,
312. Bergmann, P., Fauser, M., Sattlegger, D., Steger, C.: MVTec Fujino, Y.: Learning motion patterns and anomaly detection by
AD—A comprehensive real-world dataset for unsupervised human trajectory analysis. In: Proceedings of the IEEE Interna-
anomaly detection. In: Proceedings of the IEEE/CVF Confer- tional Conference on Systems, Man and Cybernetics, Montreal,
ence on Computer Vision and Pattern Recognition (CVPR), pp. Canada (2007)
9592–9600 (2019) 326. Aradau, C., Blanke, T.: Governing others: anomaly and the algo-
313. Page, E.S.: On problems in which a change in a parameter occurs rithmic subject of security. Eur. J. Int. Secur. 3(1), 1–21 (2017)
at an unknown point. Biometrika 44(1–2), 248–252 (1957) 327. Stone, E.J.: On the rejection of discordant observations. Mon.
314. Ni, K., Ramanathan, N., Chehade, M.N.H., Balzano, L., Nair, Not. R. Astron. Soc. 28, 165–168 (1873)
S., Zahedi, S., Kohler, E., Pottie, G., Hansen, M., Srivastava, 328. Gould, B.A.: On peirce’s criterion for the rejection of doubtful
M.: Sensor network data fault types. ACM Trans. Sensor Netw., observations, with tables for facilitating its application. Astron.
5(3) (2009) J. 6(11), 81–86 (1855)
315. Chan, W.: Understanding the effect of time series outliers on 329. Sharma, A.B., Golubchik, L., Govindan, R.: Sensor faults: detec-
sample autocorrelations. TEST 4(1), 179–186 (1995) tion methods and prevalence in real-world datasets. ACM Trans.
316. Smolyak, D., Gray, K., Badirli, S., Mohler, G.: Coupled IGMM- Sensor Netw., 6(3) (2010)
GANs with Applications to anomaly detection in human mobility 330. Ahmed, M., Mahmood, A.N., Hu, J.: A survey of network
data. ACM Trans. Spat. Algo. Syst., 6(4) (2020) anomaly detection techniques. J. Netw. Comput. Appl. 60, 19–31
317. Zhang, X., Dou, W., He, Q., Zhou, R., Leckie, C., Kotagiri, R., (2016)
Salcic, Z.: LSHiForest: A generic framework for fast tree isola-
tion based ensemble anomaly analysis. In: IEEE 33rd Interna- Publisher’s Note Springer Nature remains neutral with regard to
tional Conference on Data Engineering, San Diego, USA (2017) jurisdictional claims in published maps and institutional affiliations.
318. Brax, C.: Anomaly Detection in the Surveillance Domain. Örebro
University. Örebro, Sweden (2011)
319. Braei, M., Wagner, S.: Anomaly detection in univariate time
series: a survey on the state-of-the-art. arXiv: 2004.00433v1
(2020)

13

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy