Batch 7 Conference Paper
Batch 7 Conference Paper
A. Feature Selection
● False Positive Rate quantifies the rate at which normal activities are
incorrectly identified as anomalies. This metric is crucial for
maintaining operational efficiency and trust in the anomaly ● in real-world applications.
detection system, as a high false positive rate can lead to alert
fatigue, where users become desensitized to alerts due to their In addition to these metrics, other considerations such as computational
frequent occurrence. Monitoring the false positive rate is essential efficiency, model interpretability, and robustness to noise are also vital in
for fine-tuning the model and ensuring that it remains reliable the evaluation process.
C . Implementation of Cross-Validation Techniques Additionally, stratified cross-validation can be utilized to ensure that
each fold maintains the original class distribution of the dataset. This is
In this study, k-fold cross-validation is employed as the primary method particularly important in anomaly detection tasks where the class
for assessing model performance. In k-fold cross-validation, the dataset is distribution may be imbalanced, as it helps in providing a more realistic
divided into k equal-sized folds. The model is trained on k-1 folds and evaluation of the model’s performance across different classes.
evaluated on the remaining fold. This process is repeated k times, with
each fold serving as the validation set once. By averaging the performance D. Focus on False Positives
metrics across all folds, we obtain a more robust estimate of the model's
generalization capabilities. A typical choice for k is 5 or 10, balancing Minimizing the false-positive rate is crucial in anomaly detection systems
computational efficiency with reliability in performance estimation. because excessive false alarms can lead to alert fatigue among security
personnel, causing them to overlook genuine threats. This is particularly
problematic in high-volume environments where a continuous stream of irregular or complex cluster boundaries, but DBSCAN excels in
alerts can dilute the team's focus and responsiveness. During model identifying clusters of arbitrary shape, which is essential in dynamic
evaluation, special attention is given to this metric by employing environments like network traffic, where attack patterns can vary
strategies such as threshold tuning and cost-sensitive learning, which significantly.
prioritize true positive identification while minimizing false alarms.
Effective management of false positives ensures that alerts remain One of the key advantages of DBSCAN is its ability to handle non-linear
actionable, enhancing the overall reliability and effectiveness of the cluster shapes. Many clustering algorithms, like k-means, struggle with
anomaly detection system in operational settings. irregular or complex cluster boundaries, but DBSCAN excels in
identifying clusters of arbitrary shape, which is essential in dynamic
environments like network traffic, where attack patterns can vary
IV. . Machine Learning Models significantly.
In this section some implementation details will be explained,
related to the developed system entities, the used graphical user Another strength is its resilience to noise. Since DBSCAN explicitly
interface and the database used to store probing information collected labels points that do not belong to any dense cluster as outliers, it is
from the ISP network. particularly well-suited for scenarios where there is a mix of regular and
irregular behaviors, such as in network security. This ability to distinguish
A. K-Nearest Neighbors (KNN) between normal and anomalous traffic patterns without excessive false
positives enhances its effectiveness in detecting rare or novel attacks.
K-Nearest Neighbors (KNN) is a straightforward and widely used
classification algorithm that classifies data points based on the majority However, DBSCAN’s performance is sensitive to its parameters: epsilon
class of their k-nearest neighbors. Its simplicity and effectiveness make it (ε), which defines the neighborhood size, and minimum points (minPts),
useful for anomaly detection, as it can identify outliers by recognizing the minimum number of points required to form a dense region. Choosing
instances that do not belong to known clusters. However, KNN faces appropriate values for these parameters can be challenging and requires
several challenges, especially when applied to large datasets. domain knowledge or trial-and-error. Despite this, DBSCAN's flexibility,
robustness in handling noise, and ability to discover outliers without
One significant issue is its high computational complexity. KNN predefined cluster numbers make it an excellent choice for detecting
requires calculating the distance between a query point and all other anomalies in complex datasets like network traffic.
points in the dataset, which becomes slow and resource-intensive with
large-scale data. This is compounded by the curse of dimensionality, C. Random Forest
where high-dimensional data causes distances between points to become
less meaningful, diminishing KNN's accuracy. Random Forest is a powerful ensemble learning algorithm that operates
by constructing multiple decision trees during training and outputting the
The algorithm is also highly sensitive to feature scaling. Since KNN majority vote (classification) or average (regression) of the individual
relies on distance metrics, features with larger ranges can dominate the trees. For anomaly detection, Random Forest excels due to its ability to
classification process, leading to biased results. Proper normalization is handle high-dimensional datasets and its robustness against overfitting,
essential to mitigate this effect. Additionally, KNN memory requirements especially in complex scenarios like network traffic analysis.
can be substantial, as it stores the entire dataset to make predictions,
which is a challenge with large network traffic datasets. Random Forests are particularly well-suited for detecting anomalies
because they can learn from a variety of features simultaneously and
Another critical factor is the choice of k, the number of neighbors capture intricate patterns in the data. Each tree in the forest is built on a
considered. A small k may result in overfitting to noise, while a large k random subset of the data and features, which introduces variability and
can cause the model to overlook outliers. Balancing this parameter is key helps the model generalize better to unseen data. This property is highly
to effective anomaly detection. beneficial in detecting network traffic anomalies, where patterns of
normal and malicious activities may vary significantly.
Finally, KNN struggles with imbalanced datasets, a common issue in
network security, where normal traffic far outweighs anomalies. In such V. Experiments and Results
cases, the majority class can overshadow minority (anomalous) instances,
leading to misclassifications. Techniques like weighted distance metrics A.Experimental Setup
or oversampling can help address this problem.
The experiments are carried out on Google Colab, leveraging its cloud-
based infrastructure for collaborative development and access to high-
Despite these limitations, KNN simplicity and interpretability make it a
performance computational resources such as GPUs. The UNSW-
valuable tool, especially when combined with strategies to reduce
NB15 dataset, a well-known benchmark for network intrusion
computational costs and improve performance on imbalanced data.
detection, is used in the experiments. The dataset is preprocessed by
handling missing values, encoding categorical features, and
B. Density-Based Spatial Clustering of Applications with Noise normalizing numerical attributes to ensure the machine learning
(DBSCAN) models perform optimally. The dataset is split into training and testing
sets with an 80-20 ratio, providing ample data for model validation
Density-Based Spatial Clustering of Applications with Noise (DBSCAN) while avoiding overfitting. Cross-validation is also applied to ensure
is a powerful unsupervised algorithm that is particularly well-suited for robustness and reduce the risk of overfitting the training data.
anomaly detection due to its ability to identify outliers. Unlike traditional
clustering algorithms that require specifying the number of clusters B. Results
beforehand, DBSCAN determines clusters based on the density of data
points. It groups data points that are closely packed together, while those
that are sparsely distributed are classified as noise or anomalies. This The KNN model demonstrates strong performance with an accuracy of
feature makes DBSCAN highly effective in detecting anomalies in approximately 92%, indicating its effectiveness in identifying both normal
network traffic, where unusual patterns do not fit neatly into predefined and malicious traffic. However, KNN experiences increased computation
groups. times as the dataset size grows, a limitation that becomes more prominent
when dealing with large-scale network traffic data. Despite this drawback,
KNN achieves a low false-positive rate, making it reliable for scenarios
One of the key advantages of DBSCAN is its ability to handle non-linear where minimizing false alarms is critical. Tuning the value of k and
cluster shapes. Many clustering algorithms, like k-means, struggle with optimizing the algorithm through dimensionality reduction further
improves its performance, making it a solid choice for real-time anomaly the strengths of various algorithms, leading to more robust and accurate
detection. anomaly detection systems.[9,10]
On the other hand, DBSCAN excels at identifying outliers, making it Researchers have created methods that combine information entropy with
highly effective in detecting previously unseen or novel attacks. It shows a multi-order autoregressive models to improve anomaly detection accuracy
strong detection rate for malicious traffic, especially when dealing with by capturing subtle variations in traffic patterns. This enhances the
irregular patterns that deviate from normal network behavior. However, detection rate of unknown anomalies[11]. Real-time detection modules
DBSCAN produces a slightly higher false-positive rate compared to typically extract characteristic values from network traffic and compare
KNN, which could lead to more frequent false alarms in practical them with historical patterns to identify anomalies.[12,13]
applications. Despite this, DBSCAN's ability to uncover anomalies
without predefined cluster counts makes it particularly valuable in
complex network environments where threats do not follow predictable In the context of Internet Service Provider (ISP) networks, anomalies can
patterns. degrade the quality of service provided to end users. P2P overlay
networks have been explored as a potential solution for network anomaly
Through hyperparameter tuning, feature selection, and careful detection in ISP infrastructures. These networks offer distinct benefits
preprocessing, both models deliver competitive results. While KNN offers depending on the approach used, whether centralized or decentralized [14,
better overall accuracy, DBSCAN's strength lies in its ability to detect 15].
novel and complex attack patterns, making each model suitable for
different aspects of network security. The integration of machine learning techniques with flow-based and
entropy-based methods has significantly advanced network traffic
anomaly detection. These approaches effectively identify a wide range of
anomalies, from known threats to new attacks. As network traffic volume
VI. Literature Review and complexity increase, developing more sophisticated and adaptive
detection methods will be essential for maintaining cybersecurity in a
The growing volume of encrypted network traffic and new attack types, digital world.
such as zero-day exploits, have highlighted the limitations of traditional
signature-based packet inspection methods. Consequently, network traffic VII. CONCLUSIONS
anomaly detection has become vital in cybersecurity for identifying
unknown threats. This review focuses on flow-based data, entropy-based This paper presents a comprehensive machine learning-based anomaly
methods, and machine learning techniques. detection system utilizing the UNSW-NB15 dataset, a well-regarded
benchmark in network intrusion detection. Through a systematic approach
NetFlow and similar protocols effectively collect network communication that includes data preprocessing, feature selection, and the application of
data by aggregating packets into flows based on shared attributes like IP various machine learning algorithms, the study successfully achieves high
addresses and ports. While basic flow data may not suffice for advanced detection accuracy while effectively minimizing false positives. The
machine learning, it is suitable for entropy-based anomaly detection, results underscore the effectiveness of K-Nearest Neighbors (KNN) and
which analyzes traffic randomness and distribution.[1] Density-Based Spatial Clustering of Applications with Noise (DBSCAN)
in detecting network anomalies, each offering unique strengths: KNN
provides high accuracy with straightforward implementation, while
Entropy-based methods excel at detecting high-intensity anomalous traffic
DBSCAN excels in identifying outliers without the need for predefined
by identifying spikes in data distribution. For example, a DDoS attack
cluster counts.
may result in lower entropy from concentrated traffic from a single source
IP.[2] Additionally, parametrized entropies can outperform traditional
Moreover, the paper discusses the importance of model evaluation
Shannon entropy in specific scenarios, enhancing the detection of various
metrics, particularly the false-positive rate, in ensuring the reliability of
network anomalies.[3]
anomaly detection systems in real-world environments. The findings
contribute valuable insights into the practical application of these
The European Union Agency for Cybersecurity (ENISA) highlights the algorithms, emphasizing the necessity for continuous monitoring and
growing threat of ransomware, social engineering, and zero-day exploits, model refinement.
underscoring the need for effective anomaly detection methods.[4] The
2022 ENISA report and Truesec findings indicate a significant rise in Future work could explore the integration of advanced techniques, such as
cyberattacks, emphasizing advanced detection techniques.[5] deep learning models, which may enhance the system's ability to capture
complex patterns in network traffic. Additionally, incorporating real-time
Machine learning plays a crucial role in network anomaly detection, with detection mechanisms would significantly improve the responsiveness of
successful techniques including artificial neural networks (ANNs), Naive the system, allowing for timely interventions against evolving cyber
Bayes classifiers, and Support Vector Machines (SVMs). ANNs model threats. Implementing adaptive learning strategies to continually update
complex relationships, while Naive Bayes is valued for its efficiency. the model based on new data could further bolster its effectiveness,
SVMs handle high-dimensional data effectively, and Random Forest making it a resilient solution for contemporary network security
classifiers enhance accuracy through aggregation.[6] challenges. Overall, this study lays a solid foundation for ongoing
research in the field of anomaly detection, highlighting the critical role of
machine learning in safeguarding network integrity.
Support Vector Machines (SVMs) are increasingly popular in network
anomaly detection due to their ability to handle high-dimensional data and
perform well with small sample sizes, as demonstrated by Mukkamala et
al. (2002) and Zhang et al. (2008b). Additionally, Random Forest VIII.ACKNOWLEDGMENT
classifiers enhance detection accuracy by aggregating multiple decision
trees, as noted in Zhang et al. (2008a)[7,8] We extend our sincere gratitude to our research team for their
collaborative efforts and insights in developing this network traffic
In some instances, machine learning techniques have been combined with anomaly detection system. Special thanks to our academic advisors for
other methodologies to enhance performance. For example, genetic their invaluable guidance and support. We also appreciate the
algorithms have been integrated with SVMs and other classifiers to organizations that provided access to the UNSW-NB15 dataset and other
optimize detection processes, as shown in studies by Shon and Moon essential resources. Lastly, we thank our families and friends for their
(2007) and Jongsuebsuk et al. (2013). These hybrid approaches harness encouragement throughout this project.
IX.REFERENCES
[1] S. Gajin, "Network Traffic Anomaly Detection and Analysis – from Research to the
Implementation," Journal of Network and Computer Applications, 2019.
[2] W. Liu, "Traffic Anomaly Detection Based on Information Entropy," Southeast
University, 2012
. [3] P. Bereziński, B. Jasiul, M. Szpyrka, "An Entropy-based Network Anomaly Detection
Method," Entropy, vol. 17, pp. 2367-2408, 2015.
[4] European Union Agency for Cybersecurity (ENISA), "ENISA Threat Landscape 2022,"
ENISA, 2022. URL: https://www.enisa.europa.eu/publications/enisa-threat-landscape-2022.
[
5] Truesec, "An In-depth Analysis of the Cyber Threat Landscape," Truesec Threat
Intelligence Report 2022, 2022. URL: https://www.truesec.com/hub/report/threat-
intelligence-report-2022.
[6] S. Panda, R. Patra, "Network Intrusion Detection Using Naive Bayes," International
Journal of Computer Science and Network Security, vol. 7, no. 12, pp. 258-263, 2007.
[7] K. Ghosh, A. Schwartzbard, "A Study in Using Neural Networks for Anomaly and
Misuse Detection," Proceedings of the 8th USENIX Security Symposium, 1999.
[8] Z. Zhang, "Random Forest-based Network Intrusion Detection Systems," IEEE
Transactions on Systems, Man, and Cybernetics, vol. 38, pp. 497-502, 2008a.
[9] D. Mukkamala, A. H. Sung, "Identifying Significant Features for Network Forensic
Analysis Using Artificial Intelligent Techniques," International Journal of Digital Evidence,
vol. 1, no. 4, 2002
. [10] H. Shon, J. Moon, "A Hybrid Machine Learning Approach to Network Anomaly
Detection," Information Sciences, vol. 177, no. 18, pp. 3799-3821, 2007.
[11] Y. Zhou, J. Li, "Research of Network Traffic Anomaly Detection Model Based on
Multilevel.
[12] F. Palmieri, "Network Anomaly Detection Based on Logistic Regression of Nonlinear
Chaotic Invariants," Journal of Network and Computer Applications, vol. 148, 2019.
[13] J. Xu, Y. Zhou, et al., "Network Traffic Anomaly Detection Based on Flow Time
Domain," Journal of Northeast University, vol. 40, no. 1, pp. 27-31, 2019. [14] E.K. Lua,
J.A. Crowcroft, M. Pias, R. Sharma, S. Lim, "A Survey and Comparison of Peer-to-Peer
Overlay Network Schemes," IEEE Communications Surveys & Tutorials, vol. 7, pp. 72-93,
2005.
[15] M. Silva, R. Mendonça, P. Sousa, "A P2P Overlay System for Network Anomaly
Detection in ISP Infrastructures," University of Minho, 2019.