Malware Detection
Malware Detection
1. Introduction
2. Objective
3. Motivation
4. Literature Review
5. Problem Statement
6. Methodology
7. Conclusion
8. Future Work
9. References
INTRODUCTION
● Today's interconnected world faces a constant threat of cyber attacks, driven by
sophisticated automated technologies and evolving tactics.
● Given the scale of cyber threats, human efforts alone are insufficient, highlighting the need
for machine learning to analyze large datasets and proactively identify security concerns.
OBJECTIVE
• Antivirus Limitations: Current scanners are inadequate against diverse and sophisticated
malware.
• Need for Advanced Methods: Polymorphic and automated malware necessitate better
detection techniques.
• Innovative Defense: Adopting new methods strengthens protection against modern digital
threats.
LITERATURE REVIEW
SL. TITLE METHODOLOGY FINDINGS
No
1. “Android malware Naive Bayes, This study shows the malware detection
detection and identification KNN, Decision Tree, methods, focusing on feature selection, ML, and
Support Vector DL techniques. Challenges persist in detecting
by leveraging the machine
Machine, Random obfuscated and zero-day malware, prompting the
and deep learning Forest, LSTM, DNN, exploration of advanced DL ensembles and
techniques.” by Santosh K. GAN. reinforcement learning. To enhance accuracy for
Smmarwar (2024). evolving threats, recent malware samples across
various ML and DL models is incorporated.
2. “Detection of malware in Linear Regression, This project employed various supervised and
downloaded files using Logistic Regression, unsupervised ML models, alongside dataset
various machine learning Decision Tree, SVM, balancing techniques like Oversampling,
models” by A. Kamboj, P. Random Forest, K Undersampling, SMOTE, and Balanced Bagging
Kumar, A.K. Bairwa, S. Means. Classifier. Random Forest achieved the highest
Joshi (2023). accuracy.
SL.No TITLE METHODOLOGY FINDINGS
● Traditional malware detection techniques are often inadequate, allowing various forms of malware
to enter systems undetected.
● Machine learning algorithms offer promising solutions for detecting malware in computer systems.
● Convolutional neural networks (CNNs), recurrent neural networks (RNNs), decision trees, and
random forests are among the important machine learning algorithms being utilized for malware
detection.
● Research will focus on testing the accuracy of these algorithms in detecting malware using
relevant data.
Methodology
The approach for detecting malware involves using a machine learning method with a one-sided
perceptron.
This method will be applied to a dataset to identify malware among different files in computer
systems.
Overview :
● Various machine learning algorithms will be employed to address the malware detection
problem.
● A database will be designed according to the dataset, followed by analysis and design on the
dataset.
Proposed Model Workflow
Architecture
Design
2. Data Exploration /Analysis
Dataset Used : EMBER DATASET
A labeled benchmark dataset for training machine learning models to statically detect malicious
Windows portable executable files. The dataset includes features extracted from 1.1M binary
files: 900K training samples (300K malicious, 300K benign, 300K unlabeled) and 200K test
samples (100K malicious, 100K benign).
DATA DESCRIPTION
In crafting the EMBER dataset, we considered several practical use cases and research studies,
including the following : -
● Compare machine learning models for malware detection.
● Quantify model degradation and concept drift over time.
● Research interpretable machine learning.
● Compare features for malware classification, particularly novel features not represented in
the EMBER dataset. This requires an extensible dataset.
● Compare to featureless end-to-end deep learning. This may require code to extract features
from a new dataset, or sha256 hashes to build a raw binary dataset to match EMBER.
● Research adversarial attacks against machine learning malware, and subsequent defense
strategies.
● Leverage unlabeled samples via unsupervised learning for PE file representation or semi-
supervised learning for classification.
● Considerations of these use cases led to the data structure outlined in this section.
Eliminating Noisy Data
Challenges of Noisy Data :
● Data mining algorithms operate on real-world data affected by various factors, noise.
● Noise, unintended fluctuations in data, is a major contributor to inaccuracies in collected
data. Sources of noise include human error and limitations of data-gathering instruments.
● Machine learning algorithms can misinterpret noise as patterns if not properly trained,
leading to incorrect generalizations.
● Noisy datasets compromise the quality of the analysis process.
Signal-to-Noise Ratio :
● Analysts and data scientists use the signal-to-noise ratio as a primary metric to data quality.
● A higher signal-to-noise ratio indicates better data quality and fewer disturbances noise.
● Diagrams can visually represent how noise degrades the quality of a signal, illustrating the
importance of noise reduction in data analysis.
Eliminating Noisy Features
Data Cleaning
Auto-Encoders for De-Noising :
● Auto-encoders, particularly stochastic variations, are effective for denoising data.
● They are trained to recognize and remove noise from input data, producing clean output.
● Comprise encoder and decoder components: encoder converts data into an encoded form,
while decoder reverts it to its original state.
● Denoising autoencoders manipulate hidden layers to extract robust features.
1. Model of statically extracted features: Identifies characteristics without executing the file,
such as PE header data and entropy measurements.
● A recent approach utilizes raw byte sequences instead of static or dynamic features for
malware detection.
● This method, relying on a deep learning neural network, processes entire executable files
for detection.
● Achieved high accuracy (94%) and AUC (98.1%), even without domain-specific
knowledge.
● Manual feature engineering may not keep pace with the constantly changing and evolving
nature of malware, making CNN advantageous.
● CNNs can process various representations of malware, including images, providing more
comprehensive information compared to traditional methods.
● Once trained, the ensemble of decision trees forms a Random Forest, capable of collectively
analyzing and classifying new malware samples.
● Detection of new malware samples involves feeding them into the Random Forest and
observing the collective response of the trees.
Sigmoid Function:
Probability Estimation:
Algorithm Accuracy
CNN 86%
RNN 70%
More accuracy was achieved with CNN, suggesting that it may be considered the best algorithm
among the four for malware detection.
ROC curve for URL detector
Confusion Matrix for URL detector
For URL File :
Algorithm Accuracy
Logistic Regression is considered the best algorithm for malware detection in URL.
Conclusion
● While the results are promising, achieving zero false positives remains a challenge,
indicating room for improvement in the detection framework.
● Decision tree and Random Forest algorithms excel in detecting malware from large
datasets, suggesting potential for further improvement to enhance detection accuracy.
● Among the compared algorithms (CNN, RNN, Decision Tree, and Random Forest), CNN
demonstrates the highest accuracy, making it the preferred choice for malware detection.
● For URL malware detection, Logistic regression algorithm is the preferred choice.
REFERENCES
1. “Malware Detection by Leveraging The Machine and Deep Learning Techniques”, 2024.
2. “Detection of malware in downloaded files using various machine learning models” by A. Kamboj, P.
Kumar, A.K. Bairwa, S. Joshi, 2023.
3. Santosh K. Smmarwar, “Android malware detection and identification by leveraging the machine and
deep learning techniques.”, 2024.
4. “Applying Convolutional Neural Network for Malware Detection” by Chia-Mei Chen; Shi-Hao Wang;
Dan-Wei Wen; Gu-Hsin Lai; Ming-Kung Sun, 2021.
5. “Classification Of Malware Detection using Machine Learning Algorithms” by Sanket Agarkar; Soma
Ghosh, 2020.
6. Manal Abdullah, Afnan Agal, Mariam Alharthi and Mariam Alrashidi. Arabic Handwriting Recognition
Model based on, International Journal of Advanced Trends in Computer Science and Engineering, Vol.
8, No.1.1, 2019.
7. R. Tomar and Y. Awasthi. “Prevention Techniques Employed In Wireless Ad-Hoc Networks”.
International Journal of Advanced Trends in Computer Science and Engineering, Vol. 8, No.1.2, 2019.
8. R. Ronen, M. Radu, C. Feuerstein, E. Yom-Tov and M. Ahmadi. “Microsoft Malware Classification
Challenge”, arXiv, 1802.10135, 2018.
9. A. Shalaginov, S. Banin, A. Dehghantanha and K. Franke, "Machine learning aided static malware
analysis: A survey and tutorial", Cyber Threat Intelligence, pp. 7-45, 2018.
10. R. Kumar, P. Kumar, R. Tripathi, G.P. Gupta, S. Garg, M.M. Hassan: A distributed intrusion
detection
system to detect DDoS attacks in blockchain-enabled IoT network, J. Parallel Distrib. Comput., 164 ,
2022.
11. Annual Number of Malware Attacks Worldwide from 2015 to First Half 2022, Statista Research
Department (2023).
12. A metaheuristic-based ensemble feature selection framework for cyber threat detection in IoT-
enabled
networks, A.K. Dey, G.P. Gupta, S.P. Sahu(2023).
13. AI-empowered malware detection system for industrial internet of things Comput. Electr. Eng.,
108
(April) (2023).
14. Detection of malware in downloaded files using various machine learning models,A. Kamboj, P.
Kumar, A.K. Bairwa, S. Joshi, Egypt. Informatics J., 24 (1) (2023).
15. .M. Gopinath, S.C. Sethuraman, A comprehensive survey on deep learning based malware
detection
techniques, Comput. Sci. Rev., 47 (2023).
16. A. Kamboj, P. Kumar, A.K. Bairwa, S. Joshi, Detection of malware in downloaded files using
various
machine learning models, Egypt. Informatics J., 24 (1) (2023).
Thank
you!