Mathematics 10 04043
Mathematics 10 04043
Article
Design and Evaluation of Unsupervised Machine Learning
Models for Anomaly Detection in Streaming
Cybersecurity Logs
Carmen Sánchez-Zas * , Xavier Larriva-Novo , Víctor A. Villagrá , Mario Sanz Rodrigo
and José Ignacio Moreno
ETSI Telecomunicación, Universidad Politécnica de Madrid (UPM), Avda. Complutense 30, 28040 Madrid, Spain
* Correspondence: carmen.szas@upm.es
Abstract: Companies, institutions or governments process large amounts of data for the development
of their activities. This knowledge usually comes from devices that collect data from various sources.
Processing them in real time is essential to ensure the flow of information about the current state
of infrastructure, as this knowledge is the basis for management and decision making in the event
of an attack or anomalous situations. Therefore, this article exposes three unsupervised machine
learning models based on clustering techniques and threshold definitions to detect anomalies from
heterogeneous streaming cybersecurity data sources. After evaluation, this paper presents a case
of heterogeneous cybersecurity devices, comparing WSSSE, Silhouette and training time metrics
for all models, where K-Means was defined as the optimal algorithm for anomaly detection in
streaming data processing. The anomaly detection’s accuracy achieved is also significantly high.
A comparison with other research studies is also performed, against which the proposed method
proved its strong points.
Citation: Sánchez-Zas, C.;
Larriva-Novo, X.; Villagrá, V.A.; Keywords: machine learning; clustering; real-time; data pre-processing; threshold; Spark; cybersecurity;
Rodrigo, M.S.; Moreno, J.I. Design K-means; anomaly detection; logs
and Evaluation of Unsupervised
Machine Learning Models for
MSC: 68T99
Anomaly Detection in Streaming
Cybersecurity Logs. Mathematics
2022, 10, 4043. https://doi.org/
10.3390/math10214043
1. Introduction
Academic Editor: Denis N. Sidorov The analysis and subsequent extraction of information from heterogeneous volumes
Received: 29 September 2022
of data processed in business environments are computationally intensive, especially when
Accepted: 24 October 2022
anomaly detection must be performed immediately and in real time. Therefore, security
Published: 31 October 2022
experts began to use methods and mechanisms from one area of computer science that
has experienced exponential growth in recent years: the fields of machine learning and
Publisher’s Note: MDPI stays neutral
Big Data.
with regard to jurisdictional claims in
Unsupervised learning models [1] attract the most attention because they are able to
published maps and institutional affil-
generate knowledge about unidentifiable events or behaviours. They are supplied with
iations.
unlabelled data, so knowledge must be extracted from the metastructure of the supplied
information itself. This feature is essential for the analysis of most data generated by
devices, as these data are not labelled.
Copyright: © 2022 by the authors.
Furthermore, in the field of Big Data, various mechanisms and models have been
Licensee MDPI, Basel, Switzerland. developed to handle and process these volumes of data in real time. This property is critical
This article is an open access article in security environments where devices are continuously transmitting data.
distributed under the terms and Regarding anomaly detection, currently, there is a challenge in dealing with large
conditions of the Creative Commons volumes of information feeds in real time, which come from a wide variety of cybersecurity
Attribution (CC BY) license (https:// sources. These heterogeneous data must be processed upon receipt using models trained
creativecommons.org/licenses/by/ to detect anomalous traffic via unsupervised algorithms that can identify normal traffic
4.0/). and the incoming input is compared to classify it as normal or abnormal. These models
would follow similar training, and each one is focused on attacks from the logs of a
cybersecurity device.
The motivation of this research is to develop an unsupervised machine learning model
to detect real-time anomalies in a custom environment with heterogeneous log sources
that monitor communications or behaviours. The presence of different devices as input
data is the triggering point for gathering our own dataset, which was labeled by experts
by the analysis of the parameters of the different logs, and led us to using unsupervised
detecting techniques.
The aim of this work is therefore to present a real environment based on a scalable
open-source system that allows the management of large amounts of data in real time
by parallelising the work, to develop an unsupervised learning system that can detect
anomalies in a set of data from different devices in real time. The machine learning models
are trained with pre-processed logs representing normal traffic from each source. This
step is vital to assure a correct input to the system. The anomaly detection architecture
is based on the development of a thresholding system in conjunction with the clustering
algorithm selected after its comparison, which makes it possible to classify which data from
the different data sources are anomalous based on the clusters previously formed by the
algorithm. Metrics such as WSSSE and Silhouette are used to optimise the model via its
hyperparameters and, after the system is developed, its behaviour is evaluated. With the
tests performed, whether the model correctly clasifies data as anomalous/non-anomalous
is determined in order to obtain a comparison among the different algorithms applied.
Finally, the devices mentioned above that provide the data sources are integrated into
the system, as well as a module where the results can be stored, managed and analysed.
With this proposal, the objective is to process logs from heterogeneous cybersecurity
devices that are streaming in real time using similarly trained models to detect whether
any communication is anomalous.
To test its functioning, we also developed a use case representing a system working in a
real environment. Therefore, the article proposes the use of Apache Spark [2], as mentioned
in [3], particularly one of the Python APIs offered by Spark, PySpark [4], which includes
a library of machine learning models (MLlib) for processing streaming logs from these
heterogeneous sources. Unsupervised learning algorithms can be run and integrated
into the overall Spark ecosystem to accomplish the task of detecting anomalies in real
time. Similarly, the environment provides all necessary tools for data pre-processing and a
subsequent connection to an ELK [5] system (ElasticSearch, Logstash and Kibana) where the
result can be visualised and analysed. Three unsupervised learning models are proposed:
K-Means, Bisecting K-Means and GMM, as they are among the best known unsupervised
learning models that allow grouping data into clusters [6].
Training of the model is performed with anomaly-free datasets generated for each
source, which are different physical and logical devices (Wi-Fi, Bluetooth, Mobile Networks,
Radio Frequency, UEBA and SIEM and Firewall log sources) that represent various logs
existing in business environments. The correct modelling of the system requires pre-
processing the data with the PySpark tool mentioned earlier so that the algorithms can be
trained correctly.
The results of the research highlight the KMeans model as the optimal model for the
case under study, obtaining better metrics and prediction results close to 99%.
For these reasons, the main findings of this research are as follows: the process of
obtaining datasets formed from the logs of different devices and synthetic data generation,
real-time processing of data using an ensemble of several models trained specifically for
each data type to detect anomalies.
During the course of this document, in Section 2, we will introduce a general overview
of the state-of-the-art methods by analysing previous projects related to the state of the
art. In Section 3, an overview of technologies involved in this development is provided.
Finally, in Sections 4–6, the defined proposal, its implementation and the results obtained
in the validation are provided, and the system is checked in terms of whether it meets
Mathematics 2022, 10, 4043 3 of 30
the requirements set out in a real scenario. To sum up, Section 7 will include conclusions,
exposing the advantages and disadvantages offered by the architecture as well as possible
improvements and future lines that may result from the project.
2. Related Works
To solve the problem of detecting anomalies in real time, many studies have been
presented to create a mechanism capable of accomplishing it in an effective, simple and
powerful manner.
The approach to this problem means that most of current studies need to use Big
Data technologies, such as Apache Spark, because these systems enable them in handling
the large amount of data that has to be processed in real time. From this point of view,
an attempt on addressing the problem via unsupervised clustering, as there is a lack of
tagged data to train a supervised algorithm.
The authors in [7] use a K-Means clustering algorithm to tag an unlabelled dataset in
order to use it as a basis for training supervised algorithms. They use service and customer
data collected from IoT networks to detect abnormal SIM cards.
The proposal from [8] makes use of principal component analysis (PCA) to reduce
dimensionality and to apply its results to a Mini Batch K-Means. This proposal allows
improvements in execution time as well as the CH metric that evaluates the formed clusters.
A similar proposal to the one described in this document, in which the use of thresholds
in each of the clusters formed to eliminate or detect the outliers, noise or anomalies, is
described in [9], which is a framework for analysing large IoT datasets by using the
parallelisation of the implementation of K-mean methods and using Spark to, afterwards,
apply outlier detection to remove fraudulent data.
The authors in [10] propose a method that combines clustering techniques and Sup-
port Vector Machines (SSC-OCSVM) to detect anomalies in the NSL-KDD dataset. The re-
searchers in [11] describe a system to detect anomalies from streaming IoT sensors using
statistical and deep-learning-based models.
In [12], the authors present their unsupervised anomaly detection method based on
autoencoders with Generative Adversarial Networks using five public datasets (SWaT,
WADI, SMD, SMAP and MSL) and an internal dataset. Meanwhile, in [13], they propose a
real-time implementation of the isolated forest to detect anomalies in the Aero-Propulsion
System Simulation dataset.
In [14], researchers use an Adversarial Auto Encoder (AAE) model in wireless spec-
trum data anomaly detection, compression and signal classification over three spectrum
datasets along with one synthetic anomaly set to test the model in a controlled environment.
Moreover, the authors in [15] describe an unsupervised anomaly detection approach based
on Bluetooth tracking data and the isolation forest algorithm.
In the Results Section, we will present a summary of the procedures of these related
research studies in comparison with our work.
An important aspect that is taken into account in various related studies is the metric
upon which the results are evaluated. As there are no labels for comparing and measuring
the performance of the model, we apply various metrics that provide an indirect evaluation
of it. In this way, studies such as [16] compile different metrics in a clustering algorithm.
Another approach is the application of non-iterative methods, such as geometric
data transformations (GTM) [17] and the successive geometric transformations model
(SGTM) [18].
It can be seen that the efforts and attempts for achieving an unsupervised anomaly
detector are quite numerous. However, despite the number of proposals, a concrete
architecture to solve the problem is not yet in sight. Therefore, this paper proposes an
unsupervised method for anomaly detection using a paradigm based on training models
with normal data and detection based on the misclassification of events relative to the
clusters formed. We will also apply different metrics and methods of visualising the clusters
Mathematics 2022, 10, 4043 4 of 30
to obtain a more robust idea of what occurs during clustering and so that we aim to evaluate
the performance of the clusters more comprehensively.
3. Unsupervised Learning
3.1. Clustering
As one of the large families of unsupervised learning, the main task of clustering
models [19] is the automatic grouping of unlabelled data to build subsets of data known as
clusters. Each cluster is a collection of data that, according to certain metrics, are similar to
each other. By extension, data belonging to different clusters have differentiating elements,
thus providing a simultaneous dual function: the aggregation of data that have similar
elements between them and the segmentation of data that have non-common characteristics.
This aggregation/segmentation that occurs when grouping data into clusters, as indicated
above, is conditional on the method of analysis or metric used to group the data. Variations
in this metric will result in different clusters, both in terms of the number and size. Therefore,
knowing and selecting the method in which the clusters are formed is vital, as using one
or the other will lead to variations in the results: Choosing a cluster formation method is
equivalent to choosing a different clustering model.
3.1.1. K-Means
The K-Means [20] clustering model is one of the simplest and most popular models of
the unsupervised learning paradigm. Its use in all types of applications and branches of
knowledge has made this model one of the most widely used and best known.
The fundamental idea of K-Means lies in correctly locating what are called centroids,
a reference where the data of a set can be compared. Various data that we want to group
together will be compared with these centroids and, depending on how similar the data
are, they will be grouped with one or the other within a cluster.
It should be noted that each cluster has only one centroid; therefore, a centroid can be
seen as a centre of masses.
To measure the distance between each point and the centroids, two methods can be
used [21]:
• The cosine distance between two points is calculated using the angle between the
vectors obtained from them. Because X is a m x n data matrix that can be decomposed
in m 1 x n row vectors ( x1 , ..., xm ), the cosine distances between vectors xs and xt are
as follows.
xs xt0
d = 1− p (1)
( xs xs0 )( xt xt0 )
• The Euclidean distance between points a and b is computed as follows.
v
u k
d = t ∑ ( a j − b j )2
u
(2)
j =1
stress is the need to normalise the data before running it through the model, as K-Means
uses distances to generate clusters.
Algorithm 1 K-Means
1: Prefix the number of clusters (k).
2: Randomly choose k centroids among the poins from the dataset (D).
3: ∀ x ∈ D → Measure the Euclidean or Cosine distance to the centroids.
4: ∀ x ∈ D → x ∈ nearest cluster.
5: Mean of each cluster = new centroids.
6: Run 3, 4 and 5 again until the variation of the new value of the centroid related to the
previous iteration does not vary (or varies less than a preset tolerance).
The failure to normalise it results in data with high numerical values being weighted
more heavily than others and vice versa, resulting in poor measurements and therefore
poor clustering.
- Step E for calculating the probability that a point (xi ) belongs to each cluster, ck :
Prob. of xi belonging to c
ric = (3)
Sum of prob. of xi belonging to c1 , c2 , ..., ck
- Step M for updating the values µ, Σ, and Π (representing the point density of a
distribution):
Number of points assigned to a cluster
Π= (4)
Total number of points
Σric xi
µ= (5)
Number of points allocated to a cluster
Both steps will be performed iteratively, optimising the parameters and maximising
the associated likelihood function.
In a nutshell, the GMM algorithm (Algorithm 4) works as follows:
Algorithm 4 GMM
1: Number of clusters (k) selected.
2: Set randomly parameters of the different distributions.
3: Calculate the likelihood of the Gaussians with the data in the dataset.
4: Maximise the log-likelihood function by optimising parameters.
5: Iterate steps 3 and 4 until the indicated number of iterations is completed or a given
tolerance is reached.
Algorithm 5 PCA
1: Calculate covariance matrix (C).
2: Calculate eigenvalues and eigenvectors of C.
3: Select the m eigenvectors with the highest eigenvalue (m, the number of dimensions to
reduce C to).
4: Project the data onto the selected eigenvectors.
5: Result: Data reduced to m dimensions.
3.2.2. ISOMAP
ISOMAP is another dimensional reduction algorithm [26], which is part of what is
called manifold learning, a mathematical space where Euclidean space is recreated locally
but not globally. This implies that the points of the dataset that are distributed in the space
will be conditioned by a hyperplane of a certain shape, which may prevent determining
the distance between two points from necessarily following a straight line. For a more
accurate description of the distance/similarity between two points, it is necessary to
traverse the dimensional space of the manifold and measure their distances using a geodesic
of that space.
When dealing with high-dimensional data, the assumption that the data lies within
Euclidean space is not always true. Isomap therefore adopts the premise that the data
to be reduced belongs to a manifold, so it will perform the dimensional reduction in a
non-linear manner. This method will try to preserve the geodesics found in the manifold
when projecting it in a lower dimension. To achieve this, Isomap will first create a graph
with the shape of the manifold from clustering algorithms such as K-Nearest Neighbours
(KNN). Once the network is formed, it calculates the geodesics from the distance of the
nodes in the graph. Then, it uses eigenvectors and eigenvalues to make a projection on the
eigenvectors with the highest eigenvalue and, thus, reduces it dimensionally.
Analysing the details of this method, the procedure to undertake the dimensional
reduction is described as follows (Algorithm 6).
Algorithm 6 ISOMAP
1: Determine the neighbourhood of each point.
2: Construct a manifold graph, connecting each point to the nearest neighbour.
3: Calculate the minimum distance between two nodes of the graph using the Dijkstra
algorithm, obtaining a matrix with geodesic distances of the points in the manifold.
4: Projection of the data: the distance matrix is squared, double centred, and the eigen-
value decomposition of a matrix is computed.
Algorithm 7 t-SNE
1: Measure similarity of data in high-dimensional space: assign each point a Gaussian
distribution with a given standard deviation. Points that are close to each other will
have a high density value in that distributions, while distant points will have a low
density value in that distribution.
2: Construct a similarity matrix in high-dimensional space.
3: Data are randomly projected from high-dimensional space to low-dimensional space.
4: The calculation of the similarity of the data in low-dimensional space.
5: Construct a similarity matrix in low-dimensional space.
6: Try to make low-dimensional matrix values as similar as possible to the high-
dimensional matrix by applying the Kullback–Leibler divergence metric and gradient
descent. This causes the grouping of similar points and they are separated from the
rest of the points that are not similar.
As a result, a representation of the data is presented by taking into account the possible
distribution that could occur in the high-dimensional space.
t-SNE has an associated hyperparameter (perplexity) [28] that determines the value of
the standard deviation of the distributions used to perform the similarity calculation.
3.3. Metrics
One of the key aspects when designing and evaluating a learning model is to check its
performance, operation or accuracy. In order to know these characteristics of the model,
it is necessary to carry out different tests to check the real performance of the algorithm,
Mathematics 2022, 10, 4043 9 of 30
applying a metric that allows measuring a value or property to compare different models
and represents a feature of the algorithm’s operation or result.
However, estimating how well a machine learning model is working becomes com-
plicated when working within the unsupervised learning paradigm. Despite not having
a ground truth for comparisons, there are several metrics that can be used to infer how a
model is working. The metrics shown below are mainly intended for clustering algorithms,
which is the type of model used in this proposal.
K p
∑ ∑ ∑ (xij − xkj )2 (6)
k =1 ieSk j=1
3.3.2. Silhouette
It is another measure of how good a clustering model is. In essence, it is a metric that
allows knowing the cohesion within a cluster as well as the separation with other clusters,
thus measuring how well a datum is classified in a cluster. To obtain this information [32],
the following distances need to be calculated:
• Mean distance between a datum with respect to all other points belonging to the same
cluster: This distance is called the mean intra-cluster distance.
• Mean distance between a datum with respect to the rest of the points belonging to the
next nearest cluster: This distance is called the mean nearest-cluster distance.
The range of values that this metric can take is between [−1, 1], where high values
indicate that the data are wel- assigned (high cohesion to its cluster and high separation
with other clusters) and vice versa.
The mathematical expression representing this metric is as follows, where o denotes
an observation, a denotes the mean intra-cluster distance and b denotes the mean nearest-
cluster distance [33].
b(o ) − a(o )
s(o ) = (7)
max { a(o ), b(o )}
Mathematics 2022, 10, 4043 10 of 30
4. Proposal
For this development, we start from the need to control heterogeneous environments
with data sources that stream in real time. This information flow can represent normal
traffic or attacks and, therefore, needs to be analysed and classified upon receipt.
When there are different devices simultaneously emitting amounts of data in a secure
environment, it is essential to have a system that reacts immediately to incoming data in
the event that these logs contain potentially anomalous characteristics.
Due to the possible lack of logs in some of the devices that conform to a heterogeneous
environment and the ability to extract features common to normal traffic, training the
system to detect anomalous entries is difficult, so these features must be identified by
a model that is previously trained to recognise normal traffic in the environment and,
therefore, via unsupervised algorithms, these features automatically identify logs that do
not resemble training logs. The definition of anomalies in this context is conditioned by
these training data, because it is determined by a threshold of the Euclidean distance and
takes into consideration the centre of each cluster and the position from which any data
can be considered as anomalies.
This proposal follows the architecture described in Figure 1 and the structure below
(Algorithm 8).
Algorithm 8 Proposal
1: Let n be the number of devices analysed in an environment.
2: Let i be a device from the set of n devices analysed.
3: Let Fi be the flow of information coming in real time from device i.
4: Let m be the number of machine learning models trained.
5: Let j be a model from the set of algorithms, applying function M j .
6: M j ( Fi ) = {0, 1} | 0: normal traffic; 1: anomalous log.
In Figure 1, the pre-processing of the flow of information for selecting the best features
that will be introduced to the model is shown, and the generation of synthetic data for
devices that do not have enough data to be trained is also demonstrated. These processes
will be explained in the context of the use case presented in the next section, where we
introduce the solution proposed for the problem described.
Figure 1. Systems involved in the architecture of the proposed model for anomaly classification
(pre-processing, real-time processing, training and validation and synthetic data generation).
Mathematics 2022, 10, 4043 11 of 30
5. Designed Solution
As mentioned earlier, the aim of this study focuses on the use of unsupervised clus-
tering algorithms to detect anomalies in real time. The main reason for using clustering
algorithms is that, as seen in Section 2 in Related Work, they perform well in dealing
with these types of situations. To achieve this, it is first necessary to provide a general
introduction to the architecture (Figure 2), where one can observe how the entire work-
flow is constructed from the reception of the raw data to the classification of events by
different models.
Essentially, the machine learning system is responsible for detecting possible anoma-
lies in the system’s input data, which comes from heterogeneous sources. This information
consists of sets of values in various formats that are fed into the Kafka Streaming mod-
ule [35]. These data must be pre-processed by mathematical functions to transform the
input values so that the machine learning algorithms embedded in the real-time processing
subsystem and the training and validation subsystem can process them.
The machine learning algorithms must be previously trained with a set of data with the
same characteristics as the data provided by the devices, and this is explained in Section 5.1.
In the training and validation subsystem, machine learning algorithms are trained.
These algorithms are trained using various unsupervised methods. If existing datasets are
not sufficient for training and validation, the synthetic data generation subsystem is used
to generate a pair of normal/abnormal datasets so that the models can be properly trained
and validated. Once machine learning algorithms have been trained, they are moved to
the production phase. This is characterised by the real-time processing subsystem, which
enables the processing of data flow from devices and identifies possible anomalies within
the flow. The result is then stored in the anomaly database.
Each of these modules sends the data it generates via the Kafka Flow Management
Subsystem, and it uses differentiated topics (one per device), with each learning model
subscribing to its corresponding topic to be handled.
Therefore, one unsupervised learning model is developed, trained and validated per
topic (after selecting one of the possible models) and, thus, per the type of data source.
Some of these devices have been previously tested in other projects [36,37].
Table 1. Cont.
Table 1. Cont.
All these configurations are related to the library for the generation of realistic datasets
Trumania [38], which is a library that provides the necessary tools for the synthetic genera-
tion of data, the definition of the internal structures of the dataset and the prevention or
enablement of the generation of events that meet the following requirements:
• Suitable types for all values of the dataset that facilitate their subsequent modification;
• Non-uniform structure for the generation of events;
• Plausible time structure in relation to real normal and abnormal scenarios;
• The impossibility of generating events that cannot occur in real environments.
The output data of the subsystem correspond to a pair of records generated with
normal characteristics and a record with abnormal characteristics. The size of these created
records is about 100,000 entries, and they can be varied in any order.
The characteristics of the generated datasets are identical to those generated and sent
by the specific device with which we are up-sampling.
5.3. Preprocessing
This subsystem is responsible for normalising, transforming and standardising the
values of the dataset before training the machine learning algorithms and real-time process-
ing. Data preprocessing is defined in Figure 4 and divided into the different modules that
make it up.
The input data of this system consist of a set of data values from each device. This
information is first structured according to the type of datum so that the type of datum
is defined. This structuring process is performed by defining a schema at the beginning
of the data load that contains names of the attributes of the device data to which it refers,
followed by the definition of the type of value contained in these attributes [39].
Then, preprocessing modules offered by Apache Spark and the respective functions
defined for each type of event appear (String Indexer [40], Min-Max Scaler [41], Stan-
dard Scaler [42], One Hot Encoder [43], Hash Encoder [44], Regex Tokenizer [45], Count
Vectorizer [46], TF-IDF [47], Word2Vec [48] and Vector Assembler [49]), performing rele-
vant transformations and adjusting the data as a result. Some examples are presented in
Figures 5–8.
The training and validation subsystem is responsible for preparing and monitoring
the correct functioning of the machine learning model for anomaly detections. This is
performed by using input data consisting of training data and validation data.
The training data are preprocessed by the preprocessing subsystem, which performs
the above transformations depending on the type of event. For the identification of the
best machine learning model for a given device type, hyperparameter selection modules,
metrics and validation data are used.
The hyperparameter selection module is responsible for determining which feature
set of a given machine learning model architecture is best suited for anomaly detection.
Hyperparameters are selected based on the results provided by the metrics module. Therefore,
the evaluation of the machine learning model for each set of devices is performed by the
metrics module together with the validation dataset. This module allows the evaluation of
the accuracy of an anomaly detection algorithm by applying a set of mathematical functions.
The mathematical functions chosen to determine the best hyperparameter for the
model were the WSSSE metric and the Silhouette metrics used under the criterion of the
‘elbow point’ of the graph when selecting hyperparameters. It consist of plotting variations
of the metric as we vary the hyperparameter value. From the graph, we select the point
where the variation of the slope occurs.
Once the most accurate model has been obtained, the final model is determined for
each device. The result, after going through the training and validation subsystem, is the
different final models created for each device in the system.
• Assigning the optimal hyperparameters for each model depending on WSSSE, silhou-
ette and training time;
• We compare the results of the metrics with each other and select the one with the best
result;
• The selected model is trained with the data from the dataset of events considered
normal, creating different clusters and setting a threshold for each;
• By default, the threshold is equal to the furthest point with respect to the centroid
of the cluster to which it belongs. A threshold value is set for each cluster created.
The value of the threshold can be changed as needed.
Table 2. Cont.
Table 2. Cont.
Table 2. Cont.
The high computational cost of training with the bisecting K-Means clustering model
automatically implies that it is rejected as a solution for this proposal.
6. Results
6.1. Model Comparison
The comparison of the three models associated with each of the devices is displayed
after selecting the hyperparameters for each of them. The purpose of this comparison is
to select the best model for each data source, taking into account the three metrics used:
WSSSE, Silhouette and training time. For each of the three models per device, the same
dataset and hardware/software environment was used to consider only the aspects related
to the performance of the models themselves.
From the results presented in Table 3, it appears that the K-Means model performs
equally or similarly to the bisecting K-Means model in most cases, but improves training
time significantly by a factor of 10.
The GMM model appears to be the worst. Most tests, given the results of the Silhouette
metric, show that its performance is significantly inferior to both K-Means and bisecting
K-Means. This can be explained by the fact that GMM tends to get caught in local minima,
resulting in those Silhouette values. Nevertheless, the result in training time is similar to
K-Means. Another advantage of K-Means over the other two models is that it does not
suffer from convergence or training time problems that prevent its use in some models,
as is the case with GMM and bisecting K-Means. So, the model that seems to work best for
each device is K-Means; therefore, this model is used for anomaly detection.
Bluetooth Bisecting
3.47 0.62 7.01s
K-Means
GMM None 0.57 0.82s
K-Means 8571.55 0.50 1.13s
WiFi Bisecting
- - >>60s
K-Means
GMM None 0.2 2.81s
K-Means 1.99 0.71 3.35s
Table 3. Cont.
For the test, two sets of samples are used for each device. The first set is considered as
a normal set of samples and the second set of samples is considered anomalous.
Table 4 shows the result of measuring the accuracy of a system in an environment
where normal data or data with possible anomalies have been defined. The result shows
that almost all data considered normal and those with possible anomalies are classified
as such.
It should be noted that all these results depend on the training and validation data
provided/generated by each device. Although the data are promising, further testing is
required to substantiate the results.
Figure 11. Clustering representation of the information from each device. (a) Mobile Network. (b) Ra-
dio Frequency. (c) Bluetooth. (d) WiFi. (e) Firewall Logs. (f) SIEM logs. (g) UEBA/Activity Track.
(h) UEBA/Browsers. (i) UEBA/Files. (j) UEBA/Network. (k) UEBA/Process. (l) UEBA/Sockets.
Mathematics 2022, 10, 4043 27 of 30
7. Conclusions
Anomaly detection is a commonly addressed issue in recent cybersecurity research
studies and is examined by various approaches. For the real-time processing of these
incidents, unsupervised methods are very useful when the data that contain outliers are
heterogeneous or specific for a certain device.
To do so, we have designed a real-time solution to deal with anomalies from various
sources in a heterogeneous real environment.
After outlining all the modules and processes that make up the real-time anomaly
detection system and conducting various tests with respect to its detection capability and
performance, we were able to determine that the best algorithm among the proposed
algorithms for tackling the threshold detection problem is K-Means, which is sometimes
equivalent to bisecting K-Means, but it has better training times. GMM performed the
worst and scored the lowest in the Silhouette metric. In the clusters obtained with K-Means,
this configured threshold determines the difference between normal traffic and anomalies.
The conclusion drawn from the test conducted to check the system’s ability to detect
anomalies is that the detection of normal events and anomalous events had an acceptable
performance, with an accuracy metric close to 99%. The UEBA–Browsers model had the
lowest results, which was 86% for anomaly detection, while Process is the most accurate at
identifying anomalous logs.
Mathematics 2022, 10, 4043 28 of 30
Author Contributions: Conceptualization, C.S.-Z., X.L.-N. and V.A.V.; methodology, C.S.-Z., X.L.-N.
and V.A.V.; software, C.S.-Z. and X.L.-N.; validation, C.S.-Z., X.L.-N. and V.A.V.; formal analysis,
C.S.-Z., X.L.-N. and V.A.V.; investigation, C.S.-Z. and X.L.-N.; resources, C.S.-Z.; data curation,
C.S.-Z. and X.L.-N.; writing—original draft preparation, C.S.-Z.; writing—review and editing, C.S.-Z.,
X.L.-N., V.A.V., M.S.R. and J.I.M.; visualization, C.S.-Z.; supervision, X.L.-N., V.A.V., M.S.R. and J.I.M.;
project administration, V.A.V.; funding acquisition, V.A.V. All authors have read and agreed to the
published version of the manuscript.
Funding: This research was partially supported by the Ministerio de Defensa of the Spanish Govern-
ment within the frame of PLICA project (Ref. 1003219004900-Coincidente).
Institutional Review Board Statement: Not applicable.
Data Availability Statement: Not applicable.
Conflicts of Interest: The authors declare no conflict of interest.
References
1. Sarker, I.H. Machine Learning: Algorithms, Real-World Applications and Research Directions. SN Comput. Sci. 2021, 2, 160.
[CrossRef] [PubMed]
2. Apache Spark™—Unified Engine for Large-Scale Data Analytics. Available online: https://spark.apache.org/ (accessed on
9 December 2021).
3. Larriva-Novo, X.A.; Vega-Barbas, M.; Villagrá, V.A.; Sanz Rodrigo, M. Evaluation of Cybersecurity Data Set Characteristics
for Their Applicability to Neural Networks Algorithms Detecting Cybersecurity Anomalies. IEEE Access 2020, 8, 9005–9014.
[CrossRef]
4. ZANID HAYTAM. Outliers Detection in Pyspark #3—K-MEANS. Available online: https://blog.zhaytam.com/2019/08/06
/outliers-detection-in-pyspark-3-k-means/ (accessed on 24 August 2022).
5. El ELK Stack: De los Creadores de Elasticsearch. | Elastic. Available online: https://www.elastic.co/es/what-is/elk-stack
(accessed on 15 August 2022).
6. Jawale, A.; Magar, G. Survey of Clustering Methods for Large Scale Dataset. Int. J. Comput. Sci. Eng. 2019, 7, 1338–1344. [CrossRef]
7. Zhang, T.; Li, H.; Xu, L.; Gao, J.; Guan, J.; Cheng, X. Comprehensive IoT SIM Card Anomaly Detection Algorithm Based on Big
Data. In Proceedings of the IEEE International Conferences on Ubiquitous Computing & Communications (IUCC) and Data
Science and Computational Intelligence (DSCI) and Smart Computing, Networking and Services (SmartCNS), Shenyang, China,
21–23 October 2019.
8. Peng, K.; Leung, V.C.; Huang, Q. Clustering Approach Based on Mini Batch Kmeans for Intrusion Detection System Over Big
Data. IEEE Access 2018, 6, 11897–11906. [CrossRef]
Mathematics 2022, 10, 4043 29 of 30
9. Erdem, Y.; Ozcan, C. Fast Data Clustering and Outlier Detection using K-Means Clustering on Apache Spark. Int. J. Adv. Comput.
Eng. Netw. 2017, 5–7, 86–90.
10. Pu, G.; Wang, L.; Shen, J.; Dong, F. A hybrid unsupervised clustering-based anomaly detection method. Tsinghua Sci. Technol.
2021, 26, 146–153. [CrossRef]
11. Munir, M.; Siddiqui, S.A.; Chattha, M.A.; Dengel, A.; Ahmed, S. FuseAD: Unsupervised Anomaly Detection in Streaming Sensors
Data by Fusing Statistical and Deep Learning Models. Sensors 2019, 19, 2451. [CrossRef] [PubMed]
12. Audibert, J.; Michiardi, P.; Guyard, F.; Marti, S.; Zuluaga, M.A. USAD: UnSupervised Anomaly Detection on Multivariate Time
Series. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD ’20),
Virtual Event, 6–10 July 2020; Association for Computing Machinery: New York, NY, USA, 2020; pp. 3395–3404. [CrossRef]
13. Khan, S.; Liew, C.F.; Yairi, T.; McWilliam, R. Unsupervised anomaly detection in unmanned aerial vehicles. Appl. Soft Comput.
2019, 83, 105650. [CrossRef]
14. Rajendran, S.; Meert, W.; Lenders, V.; Pollin, S. Unsupervised Wireless Spectrum Anomaly Detection with Interpretable Features.
IEEE Trans. Cogn. Commun. Netw. 2019, 5, 637–647. [CrossRef]
15. Mercader, P.; Haddad, J. Automatic incident detection on freeways based on Bluetooth traffic monitoring. Accid. Anal. Prev. 2020,
146, 105703. [CrossRef] [PubMed]
16. Palacio-Niño, J.; Galiano, F. Evaluation Metrics for Unsupervised Learning Algorithms. arXiv 2019, arXiv:1905.05667.
17. Tkachenko, R.; Izonin, I. Model and Principles for the Implementation of Neural-Like Structures Based on Geometric Data
Transformations. In International Conference on Computer Science, Engineering and Education Applications; Springer: Cham, Switzer-
land, 2018. [CrossRef]
18. Tkachenko, R. An Integral Software Solution of the SGTM Neural-Like Structures Implementation for Solving Different Data
Mining Tasks. In International Scientific Conference “Intellectual Systems of Decision Making and Problem of Computational Intelligence”;
Springer: Cham, Switzerland, 2021. [CrossRef]
19. Unsupervised Learning and Data Clustering | by Sanatan Mishra | Towards Data Science. Available online: https://
towardsdatascience.com/unsupervised-learning-and-data-clustering-eeecb78b422a (accessed on 26 August 2022).
20. Roman, V. Medium. 12 June 2019. Available online: https://medium.com/datos-y-ciencia/aprendizaje-no-supervisado-en-
machine-learning-agrupaci%C3%B3n-bb8f25813edc (accessed on 3 August 2022).
21. Bora, M.; Jyoti, D.; Gupta, D.; Kumar, A. Effect of Different Distance Measures on the Performance of K-Means Algorithm: An
Experimental Study in Matlab. arXiv 2014. arXiv:1405.7471.
22. K Means Clustering | K Means Clustering Algorithm in Python. Available online: https://www.analyticsvidhya.com/blog/2019
/08/comprehensive-guide-k-means-clustering/ (accessed on 5 August 2022).
23. Understanding the Concept of Hierarchical Clustering Technique | by Chaitanya Reddy Patlolla | Towards Data Science.
Available online: https://towardsdatascience.com/understanding-the-concept-of-hierarchical-clustering-technique-c6e82437
58ec (accessed on 25 August 2022).
24. Gaussian Mixture Models | Clustering Algorithm Python. Available online: https://www.analyticsvidhya.com/blog/2019/10/
gaussian-mixture-models-clustering/ (accessed on 11 September 2022).
25. Lavrenko and Sutton. IAML: Dimensionality Reduction. 2011. Available online: http://www.inf.ed.ac.uk/teaching/courses/
iaml/2011/slides/pca.pdf (accessed on 15 August 2022).
26. Tenenbaum, J.B.; Silva, V.D.; Langford, J.C. A Global Geometric Framework for Nonlinear Dimensionality Reduction; Science: New York,
NY, USA, 2001.
27. Van der Maaten, L.; Hinton, G. Visualizing data using t-SNE. J. Mach. Learn. Res. 2008, 9, 2579–2605.
28. Cao, Y.; Wang, L. Automatic Selection of t-SNE Perplexity. arXiv 2017, arXiv:1708.03229.
29. McInnes, L.; Healy, J. UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction. arXiv 2018. arXiv:1802.03426.
30. Coenen, A.; Pearce, A. Understanding UMAP. Available online: https://pair-code.github.io/understanding-umap/ (accessed on
1 August 2022).
31. Finding the K in K-Means Clustering | The Data Science Lab. Available online: https://datasciencelab.wordpress.com/2013/12/
27/finding-the-k-in-k-means-clustering/ (accessed on 5 September 2022).
32. Wei, H. How to Measure Clustering Performances When There Are No Ground Truth? Available online: https://medium.com/
@haataa/how-to-measure-clustering-performances-when-there-are-no-ground-truth-db027e9a871c (accessed on 14 August 2022).
33. Chaudhary, M. Silhouette Analysis in K-Means Clustering. Available online: https://medium.com/@cmukesh8688/silhouette-
analysis-in-k-means-clustering-cefa9a7ad111 (accessed on 15 August 2022).
34. Drakos, G. Silhouette Analysis vs. Elbow Method vs. Davies-Bouldin Index: Selecting the Optimal Number of Clusters for
KMeans Clustering. Available online: https://gdcoder.com/silhouette-analysis-vs-elbow-method-vs-davies-bouldin-index-
selecting-the-optimal-number-of-clusters-for-kmeans-clustering/ (accessed on 12 August 2022).
35. Apache Kafka. Available online: https://kafka.apache.org/documentation/streams/ (accessed on 13 September 2022).
36. Alvarez-Campana, M.; López, G.; Vázquez, E.; Villagrá, V.A.; Berrocal, J. Smart CEI Moncloa: An IoT-based Platform for People
Flow and Environmental Monitoring on a Smart University Campus. Sensors 2017, 17, 2856. [CrossRef] [PubMed]
37. Vega-Barbas, M.; Álvarez-Campana, M.; Rivera, D.; Sanz, M.; Berrocal, J. AFOROS: A Low-Cost Wi-Fi-Based Monitoring System
for Estimating Occupancy of Public Spaces. Sensors 2021, 21, 3863. [CrossRef] [PubMed]
Mathematics 2022, 10, 4043 30 of 30
38. Sv3ndk, Milanvdm, FHachez, Thomas-jakemeyn, Petervandenabeele. Trumania. 2020. Available online: https://github.com/
RealImpactAnalytics/trumania (accessed on 12 August 2022).
39. Larriva-Novo, X.; Vega-Barbas, M.; Villagrá, V.A.; Rivera, D.; Álvarez-Campana, M.; Berrocal, J. Efficient distributed preprocessing
model for machine learning-based anomaly detection over large-scale cybersecurity datasets. Appl. Sci. 2020, 10, 3430. [CrossRef]
40. StringIndexer—PySpark 3.3.0 Documentation. Available online: https://spark.apache.org/docs/latest/api/python/reference/
api/pyspark.ml.feature.StringIndexer.html (accessed on 13 September 2022).
41. MinMaxScaler—PySpark 3.3.0 Documentation. Available online: https://spark.apache.org/docs/latest/api/python/reference/
api/pyspark.ml.feature.MinMaxScaler.html (accessed on 13 September 2022).
42. StandardScaler—PySpark 3.3.0 Documentation. Available online: https://spark.apache.org/docs/latest/api/python/reference/
api/pyspark.ml.feature.StandardScaler.html (accessed on 13 September 2022).
43. OneHotEncoder—PySpark 3.3.0 Documentation. Available online: https://spark.apache.org/docs/latest/api/python/
reference/api/pyspark.ml.feature.OneHotEncoder.html (accessed on 13 September 2022).
44. FeatureHasher—PySpark 3.1.3 Documentation. Available online: https://spark.apache.org/docs/3.1.3/api/python/reference/
api/pyspark.ml.feature.FeatureHasher.html (accessed on 13 September 2022).
45. RegexTokenizer—PySpark 3.1.3 Documentation. Available online: https://spark.apache.org/docs/3.1.3/api/python/reference/
api/pyspark.ml.feature.RegexTokenizer.html (accessed on 13 September 2022).
46. CountVectorizer—PySpark 3.1.3 Documentation. Available online: https://spark.apache.org/docs/3.1.3/api/python/reference/
api/pyspark.ml.feature.CountVectorizer.html (accessed on 13 September 2022).
47. IDF—PySpark 3.1.3 Documentation. Available online: https://spark.apache.org/docs/3.1.3/api/python/reference/api/
pyspark.ml.feature.IDF.html (accessed on 13 September 2022).
48. Word2Vec—PySpark 3.1.3 Documentation. Available online: https://spark.apache.org/docs/3.1.3/api/python/reference/api/
pyspark.ml.feature.Word2Vec.html (accessed on 13 September 2022).
49. VectorAssembler—PySpark 3.1.3 Documentation. Available online: https://spark.apache.org/docs/3.1.3/api/python/reference/
api/pyspark.ml.feature.VectorAssembler.html (accessed on 13 September 2022).