0% found this document useful (0 votes)
78 views38 pages

Malware Detection

The document discusses leveraging machine learning techniques for malware detection. It outlines using convolutional neural networks, recurrent neural networks, decision trees and random forests for malware classification. The methodology proposes applying these algorithms to an malware dataset and evaluating their effectiveness at identifying malware.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
78 views38 pages

Malware Detection

The document discusses leveraging machine learning techniques for malware detection. It outlines using convolutional neural networks, recurrent neural networks, decision trees and random forests for malware classification. The methodology proposes applying these algorithms to an malware dataset and evaluating their effectiveness at identifying malware.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 38

MALWARE DETECTION BY LEVERAGING THE

MACHINE AND DEEP LEARNING TECHNIQUES

Guided by : Group members :


Sanskar Maharana 2001106514
Dr. Subhadarshini Mohanty Arpit Mohapatra 2001106485

SCHOOL OF COMPUTER SCIENCES


ODISHA UNIVERSITY OF TECHNOLOGY AND RESEARCH
BHUBANESWAR, ODISHA
OUTLINE

1. Introduction
2. Objective
3. Motivation
4. Literature Review
5. Problem Statement
6. Methodology
7. Conclusion
8. Future Work
9. References
INTRODUCTION
● Today's interconnected world faces a constant threat of cyber attacks, driven by
sophisticated automated technologies and evolving tactics.

● Malicious software, or malware, poses a significant challenge in the digital realm,


targeting systems and networks for espionage or financial gain.

● The emergence of malware targeting embedded platforms like IoT devices


complicates the threat landscape further.

● Traditional signature-based protections fall short against the diverse range of


malware, necessitating a more comprehensive cybersecurity approach.
● Deep learning algorithms excel in identifying evolving malware by implicitly extracting and
representing features.

● Given the scale of cyber threats, human efforts alone are insufficient, highlighting the need
for machine learning to analyze large datasets and proactively identify security concerns.
OBJECTIVE

● To highlight the impact of malware used for hardware system.


● To explain and analyse the use of Convolutional Neural Network and Recurrent Neural
Network for malware detection.
● To check and evaluate how decision tree and random forest is helping in malware detection
for PE files.
● To check and evaluate Logistic regression algorithms for malware detection for malicious
URL detection.
● To implement both PE header-based malware detection and malicious URL detection
techniques.
MOTIVATION
• Cybersecurity Challenges: Traditional measures are outpaced by the fast-evolving cyber
threats.

• Antivirus Limitations: Current scanners are inadequate against diverse and sophisticated
malware.

• Need for Advanced Methods: Polymorphic and automated malware necessitate better
detection techniques.

• Machine Learning Advantage: Offers improved threat detection by analyzing behavioral


patterns.

• Innovative Defense: Adopting new methods strengthens protection against modern digital
threats.
LITERATURE REVIEW
SL. TITLE METHODOLOGY FINDINGS
No

1. “Android malware Naive Bayes, This study shows the malware detection
detection and identification KNN, Decision Tree, methods, focusing on feature selection, ML, and
Support Vector DL techniques. Challenges persist in detecting
by leveraging the machine
Machine, Random obfuscated and zero-day malware, prompting the
and deep learning Forest, LSTM, DNN, exploration of advanced DL ensembles and
techniques.” by Santosh K. GAN. reinforcement learning. To enhance accuracy for
Smmarwar (2024). evolving threats, recent malware samples across
various ML and DL models is incorporated.

2. “Detection of malware in Linear Regression, This project employed various supervised and
downloaded files using Logistic Regression, unsupervised ML models, alongside dataset
various machine learning Decision Tree, SVM, balancing techniques like Oversampling,
models” by A. Kamboj, P. Random Forest, K Undersampling, SMOTE, and Balanced Bagging
Kumar, A.K. Bairwa, S. Means. Classifier. Random Forest achieved the highest
Joshi (2023). accuracy.
SL.No TITLE METHODOLOGY FINDINGS

3. “Applying Convolutional CNN. Only focuses on the use of


Neural Network for Malware CNN algorithm for malware
Detection” by Chia-Mei Chen; detection.
Shi-Hao Wang;
Dan-Wei Wen; Gu-Hsin Lai;
Ming-Kung Sun (2021).

4. “Classification Of Malware Naive bayes, support Only focuses on the use of


Detection using Machine vector machine, random machine learning algorithms
Learning Algorithms” by forest, K- nearest for malware detection.
Sanket Agarkar; Soma neighbor.
Ghosh (2020).
PROBLEM STATEMENT
● Increased internet demand has led to a heightened risk of cyber attacks, resulting in the
proliferation of malware files in computer systems.

● Traditional malware detection techniques are often inadequate, allowing various forms of malware
to enter systems undetected.

● There is a pressing need for more effective malware detection methods.

● Machine learning algorithms offer promising solutions for detecting malware in computer systems.

● Convolutional neural networks (CNNs), recurrent neural networks (RNNs), decision trees, and
random forests are among the important machine learning algorithms being utilized for malware
detection.

● Research will focus on testing the accuracy of these algorithms in detecting malware using
relevant data.
Methodology
The approach for detecting malware involves using a machine learning method with a one-sided
perceptron.

This method will be applied to a dataset to identify malware among different files in computer
systems.

Overview :

● Various machine learning algorithms will be employed to address the malware detection
problem.

● A database will be designed according to the dataset, followed by analysis and design on the
dataset.
Proposed Model Workflow
Architecture
Design
2. Data Exploration /Analysis
Dataset Used : EMBER DATASET

A labeled benchmark dataset for training machine learning models to statically detect malicious
Windows portable executable files. The dataset includes features extracted from 1.1M binary
files: 900K training samples (300K malicious, 300K benign, 300K unlabeled) and 200K test
samples (100K malicious, 100K benign).
DATA DESCRIPTION

In crafting the EMBER dataset, we considered several practical use cases and research studies,
including the following : -
● Compare machine learning models for malware detection.
● Quantify model degradation and concept drift over time.
● Research interpretable machine learning.
● Compare features for malware classification, particularly novel features not represented in
the EMBER dataset. This requires an extensible dataset.
● Compare to featureless end-to-end deep learning. This may require code to extract features
from a new dataset, or sha256 hashes to build a raw binary dataset to match EMBER.
● Research adversarial attacks against machine learning malware, and subsequent defense
strategies.
● Leverage unlabeled samples via unsupervised learning for PE file representation or semi-
supervised learning for classification.
● Considerations of these use cases led to the data structure outlined in this section.
Eliminating Noisy Data
Challenges of Noisy Data :
● Data mining algorithms operate on real-world data affected by various factors, noise.
● Noise, unintended fluctuations in data, is a major contributor to inaccuracies in collected
data. Sources of noise include human error and limitations of data-gathering instruments.
● Machine learning algorithms can misinterpret noise as patterns if not properly trained,
leading to incorrect generalizations.
● Noisy datasets compromise the quality of the analysis process.

Signal-to-Noise Ratio :
● Analysts and data scientists use the signal-to-noise ratio as a primary metric to data quality.
● A higher signal-to-noise ratio indicates better data quality and fewer disturbances noise.
● Diagrams can visually represent how noise degrades the quality of a signal, illustrating the
importance of noise reduction in data analysis.
Eliminating Noisy Features
Data Cleaning
Auto-Encoders for De-Noising :
● Auto-encoders, particularly stochastic variations, are effective for denoising data.
● They are trained to recognize and remove noise from input data, producing clean output.
● Comprise encoder and decoder components: encoder converts data into an encoded form,
while decoder reverts it to its original state.
● Denoising autoencoders manipulate hidden layers to extract robust features.

Principal Component Analysis (PCA) :


● PCA is a statistical technique for separating potentially correlated variables into
uncorrelated components.
● Objective: Improve signal/image quality by reducing noise while preserving key information.
● Geometrically and statistically projects input data along axes to minimize dimensionality,
akin to projecting a point onto the X-axis.
● Removes noisy axes, effectively cleaning noisy input data.
Two-Stage Noise Reduction Technique :
● PCA used in a two-stage noise reduction approach:
1. Takes noisy input
2. Produces clean output
● Addresses concerns over separating signal from noise, including overfitting and
performance issues.

Addressing Noisy Data :


● Noisy data poses challenges, including potential overfitting and degradation of dataset
quality.
● Solutions include techniques like feature selection and dimensionality reduction.
● Removing or reducing noise is crucial for improving data quality and algorithm
performance.
Building Machine Learning Models

PE Files and ML Approaches:

ML approaches for PE files can be categorized into:

1. Model of statically extracted features: Identifies characteristics without executing the file,
such as PE header data and entropy measurements.

2. Dynamic extracted features model: Requires execution in a virtual environment and


analyzes dynamic aspects like API calls, system calls, and instruction traces.
URL Detection:
● Sanitization Method:
○ Custom Python function implemented to sanitize and extract relevant data from raw
URLs.
○ Dataset structured into two columns: URLs and their labels (malicious or not).
● Feature Extraction:
○ Employed Tf-idf text feature extraction from the sklearn module.
○ Data read into data frames and matrices, processed by the vectorizer, then applied to
Tf-idf.
● Logistic Regression Training:
○ After vectorization, the Logistic Regression model is trained and tested.
● Whitelist Filtering:
○ To improve accuracy, good URLs are combined with bad ones using a Whitelist Filter.
○ Known non-malicious sites are allowed, enhancing the model’s prediction capabilities.
Malware Detection by Eating a Whole EXE:

● A recent approach utilizes raw byte sequences instead of static or dynamic features for
malware detection.
● This method, relying on a deep learning neural network, processes entire executable files
for detection.
● Achieved high accuracy (94%) and AUC (98.1%), even without domain-specific
knowledge.

Challenges and Limitations:

● Computational limitations due to memory requirements, as training with millions of


observations can take months.
● Despite challenges, these ML approaches showcase promising results in effectively
detecting malware, including zero-day threats.
RNN Algorithm
● Utilized for detecting malware due to its capability to identify
patterns indicative of malicious behavior within data.
● Analyzes new data, and if it detects patterns resembling those in
the training dataset, it flags the new data as potentially
malicious.
● Learns characteristics commonly linked with malware by
analyzing a vast dataset of known malicious software instances.
● Detection of Malware Behaviors: Configurable to identify
various malware behaviors, including: Unusual system activity,
Unexpected network traffic, Unusual file access
patterns.Identifies these behaviors by recognizing patterns in
data indicating abnormal activity.
● Implementing RNN for Malware Detection: Creation of
separate RNN models for different behaviors: RNN for detecting
unusual system activity, RNN for detecting unexpected network
traffic, RNN for detecting unusual file access patterns.
● Each RNN model trained on specific data representing the
respective behavior:
CNN Algorithm
● CNNs (Convolutional Neural Networks) are increasingly favored for malware detection due to
their ability to automatically learn distinctive features of malware.

● Manual feature engineering may not keep pace with the constantly changing and evolving
nature of malware, making CNN advantageous.

● CNNs can process various representations of malware, including images, providing more
comprehensive information compared to traditional methods.

● Detection methodologies using CNN algorithms:


1. Analyzing system behavior (e.g., system calls, network traffic) to detect anomalies indicative of
malware presence.
2. Examining file and application content for suspicious patterns that signify potential malware.
3. Analyzing patterns of malicious activity over time to detect new, previously unknown threats.
4. Comparing known malware signatures with newly detected samples to identify known malware
instances.
CNN Architecture :
Decision Tree
● Decision trees are effective tools for detecting malware files by identifying the presence of
malicious code.
● The process begins with a set of labeled data, which is used to construct a decision tree
model.
● Decision trees classify new data points based on learned patterns from the training data.
● The decision tree predicts the presence of malicious code in a file by analyzing its features.
● Additionally, decision trees can pinpoint areas within the code that are more likely to contain
malicious code.
● This ability enhances the accuracy of malware
detection systems by focusing on specific areas for
further analysis.
● Decision trees offer a transparent and interpretable
approach to malware detection, aiding in
understanding and refining detection
methodologies.
Random Forest
● It involves training multiple decision trees on labeled malware samples, each tree learning
distinct characteristics of malware, such as code size, system calls, and code complexity.

● Once trained, the ensemble of decision trees forms a Random Forest, capable of collectively
analyzing and classifying new malware samples.

● Detection of new malware samples involves feeding them into the Random Forest and
observing the collective response of the trees.

● If the reaction of the Random Forest aligns consistently with a


malicious sample, the sample can be confidently labeled as
malware.

● Random Forest excels in handling complex and dynamic


malware characteristics, making it a valuable tool in the battle
against evolving cyber threats.
Logistic Regression
● Suitable for binary classification problems.
● Used to differentiate between malicious and non-malicious URLs.
● Estimates the probability of a given input belonging to a certain class.

Sigmoid Function:

● The logistic function is central to the model.


● Defined as: p(x)=1+e−(β0​+β1​x1​+...+βn​xn​)1​
● Outputs a probability value between 0 and 1.
● Where:
i. ( p(x) ): Probability of the URL being malicious.
ii. ( e ): Base of the natural logarithm.
iii. ( \beta_0, \beta_1, \ldots, \beta_n ): Parameters of the model.

Probability Estimation:

● Output ideal for estimating probabilities.


● Ensures the predicted probability is always within the range of 0 to 1.
RESULTS AND ANALYSIS
For PE File :

Algorithm Accuracy

CNN 86%

RNN 70%

Decision Tree 60%

Random Forest 60%

Accuracy comparisons for different algorithms :

More accuracy was achieved with CNN, suggesting that it may be considered the best algorithm
among the four for malware detection.
ROC curve for URL detector
Confusion Matrix for URL detector
For URL File :

Algorithm Accuracy

Logistic Regression 98.46%

Accuracy comparisons for different algorithms

Logistic Regression is considered the best algorithm for malware detection in URL.
Conclusion

● Malware attacks pose significant threats to personal computer systems, necessitating


effective detection and removal methods to safeguard against potential damage.

● While the results are promising, achieving zero false positives remains a challenge,
indicating room for improvement in the detection framework.

● Decision tree and Random Forest algorithms excel in detecting malware from large
datasets, suggesting potential for further improvement to enhance detection accuracy.

● Among the compared algorithms (CNN, RNN, Decision Tree, and Random Forest), CNN
demonstrates the highest accuracy, making it the preferred choice for malware detection.

● For URL malware detection, Logistic regression algorithm is the preferred choice.
REFERENCES
1. “Malware Detection by Leveraging The Machine and Deep Learning Techniques”, 2024.
2. “Detection of malware in downloaded files using various machine learning models” by A. Kamboj, P.
Kumar, A.K. Bairwa, S. Joshi, 2023.
3. Santosh K. Smmarwar, “Android malware detection and identification by leveraging the machine and
deep learning techniques.”, 2024.
4. “Applying Convolutional Neural Network for Malware Detection” by Chia-Mei Chen; Shi-Hao Wang;
Dan-Wei Wen; Gu-Hsin Lai; Ming-Kung Sun, 2021.
5. “Classification Of Malware Detection using Machine Learning Algorithms” by Sanket Agarkar; Soma
Ghosh, 2020.
6. Manal Abdullah, Afnan Agal, Mariam Alharthi and Mariam Alrashidi. Arabic Handwriting Recognition
Model based on, International Journal of Advanced Trends in Computer Science and Engineering, Vol.
8, No.1.1, 2019.
7. R. Tomar and Y. Awasthi. “Prevention Techniques Employed In Wireless Ad-Hoc Networks”.
International Journal of Advanced Trends in Computer Science and Engineering, Vol. 8, No.1.2, 2019.
8. R. Ronen, M. Radu, C. Feuerstein, E. Yom-Tov and M. Ahmadi. “Microsoft Malware Classification
Challenge”, arXiv, 1802.10135, 2018.
9. A. Shalaginov, S. Banin, A. Dehghantanha and K. Franke, "Machine learning aided static malware
analysis: A survey and tutorial", Cyber Threat Intelligence, pp. 7-45, 2018.
10. R. Kumar, P. Kumar, R. Tripathi, G.P. Gupta, S. Garg, M.M. Hassan: A distributed intrusion
detection
system to detect DDoS attacks in blockchain-enabled IoT network, J. Parallel Distrib. Comput., 164 ,
2022.
11. Annual Number of Malware Attacks Worldwide from 2015 to First Half 2022, Statista Research
Department (2023).
12. A metaheuristic-based ensemble feature selection framework for cyber threat detection in IoT-
enabled
networks, A.K. Dey, G.P. Gupta, S.P. Sahu(2023).
13. AI-empowered malware detection system for industrial internet of things Comput. Electr. Eng.,
108
(April) (2023).
14. Detection of malware in downloaded files using various machine learning models,A. Kamboj, P.
Kumar, A.K. Bairwa, S. Joshi, Egypt. Informatics J., 24 (1) (2023).
15. .M. Gopinath, S.C. Sethuraman, A comprehensive survey on deep learning based malware
detection
techniques, Comput. Sci. Rev., 47 (2023).
16. A. Kamboj, P. Kumar, A.K. Bairwa, S. Joshi, Detection of malware in downloaded files using
various
machine learning models, Egypt. Informatics J., 24 (1) (2023).
Thank
you!

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy