0% found this document useful (0 votes)
39 views48 pages

BlackBook-Report FY-ML MalwareDetection1

Uploaded by

nivedyashaji416
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
39 views48 pages

BlackBook-Report FY-ML MalwareDetection1

Uploaded by

nivedyashaji416
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 48

SAVITRIBAI PHULE UNIVERSITY

A PRELIMINARY PROJECT REPORT ON

Malware Detection
SUBMITTED IN PARTIAL FULFILLMENT OF THE
REQUIREMENTS FOR THE DEGREE OF
MASTER OF SCIENCE IN
COMPUTER APPLICATION

By
KASHAF KHAN
NIVEDYA SHAJI

Under the Guidance of :


Prof. SANKET LODHA

Department of Computer Science


PROGRESSIVE EDUCATION SOCIETY'S

Vishwakarma College of Arts, Commerce &


Science Sr.No. 3/6 VIIT Campus, Laxmi
Nagar, Kondhwa(BK), Pune 411048.
Maharashtra, India. (An Institute affiliated to
Savitribai Phule Pune University) Pune - 411005

2022-2023

Malware Detection using Machine Learning 1


Department of Computer Science
Vishwakarma College of Arts, Commerce & Science

CERTIFICATE

This is to-certify that Miss Kashaf Khan and Miss Nivedya Shaji has satisfactorily
Completed Project titled Malware Detection for Msc(Computer Application)
Semester-IV during the academic year 2022-2023

Guide HOD
Prof. Sanket Lodha Mr. S. D. Chitnis

A Report on Major Project/Dissertation Stage-I

Malware Detection using Machine Learning 2


PROJECT SYNOPSIS

Group Members :
Roll No Name Contact No Email-ID

Kashaf Khan 7972432226 Khankashaf190@gmail.com

Nivedya Shaji 8666653230 Nivedyapayyathil123@gmail.com

Class : SYMsc(CA)
Academic Year : 2022-23
Project Title : Malware Detection
Project Area : Machine Learning
Guide : Prof.Sanket Lodha

Internal Guide External Guide Head of Department


Name : Name : Name :
Signature : Signature : Signature :

Malware Detection using Machine Learning 3


ABSTRACT

The purpose of this project is to develop an effective malware detection system using Support Vector
Machine (SVM) algorithm. Malware poses a significant threat to computer systems and networks, and its
detection is crucial for maintaining system security. In this report, we present a detailed analysis of the
SVM algorithm and its application in malware detection. We also discuss the dataset used, feature
extraction techniques, and the evaluation metrics employed to assess the performance of the system. The
results demonstrate the effectiveness of the SVM algorithm in accurately identifying malware instances
and its potential to enhance computer security.

The continuous evolution of malware poses a significant threat to computer systems and network security.
Traditional signature-based approaches often struggle to keep up with the rapid emergence of new
malware variants. To address this challenge, machine learning techniques have gained prominence for
malware detection. This research paper focuses on the application of Support Vector Machine (SVM)
algorithm for effective malware detection. The SVM algorithm is known for its ability to handle high-
dimensional data and has shown promising results in various classification tasks. The paper discusses the
methodology, experimental setup, feature selection, dataset preparation, and performance evaluation of
SVM-based malware detection. The results demonstrate the efficacy of the SVM algorithm in accurately
classifying malware samples, thereby enhancing the overall security of computer systems.

Android plays a vital role in the today's market.According to recent survey placed nearly 84.4% of people
stick to android which explosively become popular for personal or business purposes. There is no doubt
that the application is extremely familiar in the market for their amazing features and the wonderful
benefits of android applications makes the users fall for it.

Android imparts significant responsibility to application developers for designing the application with
understanding the risk of security issues. When concerned about security, malware protection is a major
issue in which android has been a major target of malicious applications. In android based applications,
permission control is one of the major security mechanisms. In this project, the permission induced risk in
application, and the fundamentals of the android security architecture are explored, and it also focuses on
the security ranking algorithms that are unique to specific applications. Hence, we propose the system
providing the detection of malware analysis based on permission and steps to mitigate from accessing
unwanted permission (limits the permission). It is also designed to reduce the probability of vulnerable
attacks.

Malware Detection using Machine Learning 4


TABLE CONTENT
TITLE PAGE NO

1. COMPANY INTRODUCTION 1

2. INTRODUCTION AND OBJECTIVE OF THE PROJECT 8


1. Existing System.
2. Proposed System.
3. Objective of System
4. Scope of System 8

3. PROJECT DETAILS 11
3.1 Project Description 11

4. SYSTEM DESIGN 12
4.1 Feasibility Study
4.2 ER Diagram
5. PLATFORM DETAILS 15
5.1 Introduction 15
5.2 Terms Of Reference 15
5.3 Expertise 16

6. HARDWARE & SOFTWARE REQUIREMENT 17

7. UML DIAGRAMS
7.1 Use Case Diagram
7.2 Class Diagram
7.3 Sequence Diagram
7.4 Deployment Diagram
7.5 Activity Diagram

8. FUTURE ENHANCEMENT 45

9.CONCLUSION

10. B I B L I O G R A P H Y 46

Malware Detection using Machine Learning 5


IT IMPACT
IT IMPACT has fast become a dynamic and fast-moving company in Client Solution

Management in Information Technology and has proven itself to be one of the market

leaders.

Established in 2001, they have quickly expanded our operations globally and served

customers from Brunei, Hong Kong, Indonesia, Macau, Mauritius, Macau and many others.

They strive to provide you with the best solutions to your business system needs. Place great

emphasis and focus on your problems and goals and develop solutions that best fits your

needs. Offer strong and effective solutions to your business with an enduring impact. As

partners, we will ensure your business needs are carefully evaluated and will decide the best

methods to represent your company, and develop a strong and effective solution for your

business that will have an enduring impact.

Malware Detection using Machine Learning 6


CHAPTER 2 INTRODUCTION

1.1 Existing System


With the expansion of the Android market, as well as the increasing degree of dependence on mobile
phones, malicious applications are growing rapidly. In the current situation, improving the efficiency of
malicious application detection has become an urgent demand. Therefore, applying machine learning
technology to malicious application detection which can reduce labor costs and improve detection
efficiency has become a hot research direction.

Malware, short for malicious software, refers to any software or code specifically designed to damage,
disrupt, or gain unauthorized access to computer systems or networks. With the rapid growth of the
digital landscape, malware threats have become more sophisticated, posing a significant challenge to
system security. Timely detection of malware is crucial to prevent potential damage and protect sensitive
information.

1.2 Proposed System

The aim of this project is to develop a malware detection system using machine learning techniques,
specifically the Support Vector Machine (SVM) algorithm. Malware, or malicious software, poses a
significant threat to computer systems and networks, and it is crucial to detect and prevent their execution.
Traditional signature-based approaches are limited in their ability to detect new and unknown malware.
Machine learning algorithms offer a promising solution by leveraging the patterns and characteristics of
malware samples to identify malicious behavior. In this project, we focus on training an SVM model to
classify malware samples accurately.

1.3 Motivation
Android has over one billion active users for all their mobile Devices with a market impact that is
influencing an increase in the amount of information obtained from different users, facts that have
motivated the development of malware by cybercriminals To solve the problems caused by malware.
Android implements a different architecture and security controls, such as unique user ID For Each
Application, System Permissions And Its Distribution platform Google play.

1.4 Objective
Malware Detection using Machine Learning 7
The primary objective of this research project is to develop a robust and accurate malware detection
system using the Support Vector Machine (SVM) algorithm. The SVM algorithm, known for its
effectiveness in classification tasks, holds promise in identifying malware instances by learning patterns
from labeled training data. By achieving high accuracy and low false positive rates, this system aims to
enhance computer security and safeguard against malware attacks.

The specific objectives of malware detection using the SVM algorithm include:

1. Accurate Classification: The SVM algorithm aims to accurately classify instances of malware
by learning from labeled training data. The objective is to develop a system that can effectively
differentiate between malware and benign files or activities.

Malware Detection using Machine Learning 8


2. Robust Performance: The SVM-based malware detection system should demonstrate robust
performance in terms of accuracy, precision, recall, and other evaluation metrics. It should be
capable of handling diverse malware samples, including both known and unknown variants.

3. Low False Positive Rate: False positives occur when benign files or activities are incorrectly
identified as malware. The objective is to minimize false positives and ensure that legitimate files
or activities are not mistakenly flagged as malicious.

4. Adaptability to New Malware Variants: Malware is continuously evolving, with new variants
and attack vectors emerging regularly. The SVM algorithm should possess the ability to adapt
and generalize well to new and previously unseen malware samples.

5. Efficiency and Scalability: The objective is to develop a malware detection system that is
efficient and scalable, capable of handling large-scale datasets and real-time detection
requirements without significant performance degradation.

6. Comparison with Other Techniques: The performance of the SVM algorithm in malware
detection should be evaluated and compared with other state-of-the-art techniques, such as deep
learning models, ensemble methods, or traditional signature-based approaches, to assess its
competitiveness and effectiveness.

By achieving these objectives, the use of the SVM algorithm in malware detection aims to enhance the
security of computer systems and networks, effectively identify and mitigate malware threats, and
contribute to the overall field of cybersecurity.

Malware Detection using Machine Learning 9


LITERATURE SURVEY

2.1 Study of Research Paper


Numerous techniques have been employed for malware detection, including signature-based, behavior-
based, and machine learning-based approaches. Signature-based detection relies on predefined patterns or
signatures of known malware, while behavior-based detection focuses on identifying malicious activities.
Machine learning algorithms, such as SVM, have gained popularity due to their ability to learn from data
and adapt to new malware variants. Support Vector Machine is a supervised learning algorithm that
analyzes and classifies data by finding the optimal hyperplane that maximally separates different classes.
SVM maps the input data into a high-dimensional feature space and constructs a decision boundary that
maximizes the margin between classes. This algorithm has proven effective in various domains, including
text categorization, image recognition, and now, malware detection

1. Paper Name :
A MaliciousApplication Detection Model to Remove the Influence of Interference API Sequence
Author : Peng Tian and Xiaojun Huang
Abstract : This paper proposes a new model for detectingAndroid malicious applications. The
model obtains the API call sequences of APP runtime, and extracts features from them. These
features have the highest correlation with malicious attributes detection, and have the
characteristics of small redundancy between each other and noticed thatAPI subsequences
generated by normal behavior that may exist in a malicious application can interfere with the
training of the detector.We use VSM and K-means combined with GBDT algorithm to eliminate
this interference and improve the detection accuracy. Experiments show that this method can
effectively eliminate the influence of interference API sequence and obtain higher detection
accuracy.

2. Paper Name :
A Detecting Method for Malicious Mobile Application Based on Incremental SVM
Author : Yong Li
Abstract : Due to the rapid growth of android malicious application samples, traditional
detection methods need to spend a lot of time training, a detecting method for malicious mobile
application based on incremental SVM was proposed to achieve incremental learning of the
detection system. The method used the SVM as the classification and training algorithm, and
extracted sensitive permissions and APIs as application characteristics. On the basis of SVM, a
dual weight function was designed to filter the historical training samples to avoid redundant
samples, and the incremental learning method of SVM was implemented in combination with
KKT conditions. Therefore, the training time could be reduced and the learning efficiency of the
malicious application detection system could be improved without reducing the training accuracy.

Malware Detection using Machine Learning 10


3. Paper Name :
Application Layer Anomaly Detection Based on HSMM
Author : Xie Bailin, Yu Shunzheng, Wang Tao
Abstract : Today more and more network-based attacks occur at the application layer. Observed
from the network layer and transport layer, these attacks may not contain significant malicious
activities, and generate abnormal network traffic. However, traditional security techniques
usually detect attacks from those two layers. Although some security techniques can detect some
application layer attacks, these techniques can only detect some known attacks and these
techniques can’t detect the unknown or novel attacks happening on the application layer. In
theory, application layer anomaly detection can detect the unknown and novel attacks happening
on the application layer, so the research of application layer anomaly detection is very important.
This paper presents a new application layer anomaly detection method which is based on HSMM.
The experimental results show that this method has high detection accuracy and low false positive
ratio.

4. Paper Name :
Anomaly Detection of Malicious Users’ Behaviors for Web Applications Based on Web Logs
Author : Yang Gao
Abstract : With more and more online services developed into web applications, security
problems based on web applications become more serious now. Most intrusion detection systems
are based on every single request to find the cyber-attack instead of users’ behaviors, and these
systems can only protect web applications from known vulnerability rather than some zero-day
attacks. In order to detect newly developed attacks, we analyze web logs from web servers and
define users’ behaviors to divide them into normal and malicious ones. The result shows that by
using the feature of web resources to define users’ behaviors, a higher accuracy rate and lower
false alarm rate of intrusion detection can be obtained.

5. Paper Name :
Malicious Android Application Detection based on Naive Bayes using Multiple Feature Set
Author : Parnika Bhat,Kamlesh Dutta
Abstract : Android is currently the most popular operating system for mobile devices in the
market. Android devices are being used by every other person for everyday life activities and it
has become a center for storing personal information. Because of these reasons it attracts many
hackers, who develop malicious software for attacking the platform; thus a technique that can
effectively prevent the system from malware attacks is required. In this paper, a malware
detection technique, MaplDroid has been proposed for detecting malware applications on
Android platform. The proposed technique statically analyzes the application files using features
which are extracted from the manifest file. A supervised learning model based on Naive Bayes is
used to classify the application as benign or malicious. MaplDroid achieved Recall score 99.12

Malware Detection using Machine Learning 11


6. Paper Name :
A Time Interval based Blockchain Model for Detection of Malicious Nodes in MANET Using
Network Block Monitoring Node
Author : Dr.V.Lakshman Narayana
Abstract : Mobile Ad Hoc Networks (MANETs) are infrastructure-less networks that are mainly
used for establishing communication during the situation where a wired network fails. Security
related information collection is a fundamental part of the identification of attacks in Mobile Ad
Hoc Networks (MANETs). A node should find accessible routes to remaining nodes for
information assortment and gather security related information during route discovery for
choosing secured routes. During data communication, malicious nodes enter the network and
cause disturbances during data transmission and reduce the performance of the system. In this
manuscript, a Time Interval Based Blockchain Model (TIBBM) for security related information
assortment that identifies malicious nodes in the MANET is proposed. The proposed model
builds the Blockchain information structure which is utilized to distinguish malicious nodes at
specified time intervals. To perform a malicious node identification process, a Network Block
Monitoring Node (NBMN) is selected after route selection and this node will monitor the blocks
created by the nodes in the routing table. At long last, NBMN nodes understand the location of
malicious nodes by utilizing the Blocks created. The proposed model is compared with the
traditional malicious node identification model and the results show that the proposed model
exhibits better performance in malicious node detection.

Malware Detection using Machine Learning 12


CHAPTER 3 PROJECT DETAILS

3.1 Project Details


The objective of this project is to develop a robust malware detection system using the Support Vector
Machine (SVM) algorithm. Malware, such as viruses, worms, and trojans, pose a significant threat to
computer systems and user data. Traditional signature-based detection methods are limited in their ability
to detect emerging and unknown malware variants. Therefore, there is a need to explore machine learning
techniques to effectively identify and classify malware samples.

The primary problem addressed in this project is the detection of malware samples using features
extracted from various sources, such as API calls, file properties, and network traffic. The challenge lies
in developing a model that can accurately distinguish between malware and benign samples, while also
generalizing well to new and unseen malware instances. Additionally, the project aims to analyze the
effectiveness of the SVM algorithm for malware detection and compare its performance with other
machine learning algorithms.

By successfully developing an SVM-based malware detection system, this project aims to contribute to
the field of cybersecurity by providing an efficient and reliable solution for identifying and mitigating the
risks associated with malware infections

Malware Detection using Machine Learning 13


CHAPTER 4 FEASIBILITY STUDY

4.1 Introduction
The continuous evolution of malware poses a significant threat to computer systems and network security.
Traditional signature-based approaches often struggle to keep up with the rapid emergence of new
malware variants. As a result, there is a growing interest in exploring the feasibility of using machine
learning techniques for effective malware detection. This feasibility study aims to evaluate the practicality
and viability of implementing malware detection using machine learning algorithms, specifically focusing
on the application of Support Vector Machine (SVM) algorithm.

4.2 Purpose
The continuous evolution of malware poses a significant threat to computer systems and network security.
Traditional signature-based approaches often struggle to keep up with the rapid emergence of new
malware variants. As a result, there is a growing interest in exploring the feasibility of using machine
learning techniques for effective malware detection. This feasibility study aims to evaluate the practicality
and viability of implementing malware detection using machine learning algorithms, specifically focusing
on the application of Support Vector Machine (SVM) algorithm.

4.3 Research Questions


1. How effective is the Support Vector Machine (SVM) algorithm in detecting and classifying
malware samples compared to traditional signature-based methods?
2. What are the key performance metrics (e.g., detection accuracy, false positive rate, computational
efficiency) of the SVM algorithm for malware detection, and how do they compare to other
machine learning algorithms?
3. How does the performance of the SVM algorithm vary with different feature extraction and
selection techniques in malware detection?
4. What is the impact of dataset size and diversity on the performance of the SVM algorithm for
malware detection?
5. How does the SVM algorithm perform in detecting unknown and zero-day malware samples
compared to known malware variants?
6. What are the computational resource requirements (e.g., processing power, memory, storage) for
training and deploying an SVM-based malware detection model?
7. How does the SVM algorithm perform in differentiating between malware and benign software,
and what is the potential for false positives or false negatives?

Malware Detection using Machine Learning 14


8. How does the SVM algorithm handle evasive techniques employed by malware, such as
obfuscation or polymorphism?
9. What are the legal and ethical considerations associated with using the SVM algorithm for
malware detection, particularly in terms of privacy, data protection, and compliance with
regulations?
10. What are the potential limitations and challenges in implementing the SVM algorithm for
malware detection, and how can they be addressed or mitigated?

4.4 Limitations
Implementing malware detection using machine learning techniques, including the SVM algorithm,
comes with certain limitations. These limitations should be considered to ensure a realistic understanding
of the challenges involved. Some common limitations in the context of malware detection using machine
learning are:

1. Availability and Quality of Training Data: Machine learning models require a diverse and
representative dataset for effective training. However, obtaining high-quality and comprehensive
malware datasets can be challenging due to limited availability and restrictions on sharing
malicious samples. Biases in the training data, such as an overrepresentation of certain types of
malware, can impact the model's performance and generalizability.
2. Evolution of Malware: Malware is constantly evolving, with new variants and obfuscation
techniques emerging regularly. Machine learning models may struggle to adapt to unknown or
zero-day malware samples that do not match patterns learned during training. The need for
continuous retraining and updating of models to keep up with evolving threats is a challenge.
3. Generalization and False Positives: Machine learning models may have difficulty generalizing
to new and unseen malware samples, leading to false positives or false negatives. Overly
aggressive models can result in a high false positive rate, flagging legitimate software as
malware. Striking a balance between detection accuracy and false positives is crucial but
challenging.
4. Feature Engineering and Selection: Selecting relevant features that capture the distinctive
characteristics of malware is a non-trivial task. Feature engineering requires domain expertise and
may vary based on malware families and types. Choosing the appropriate feature set and
optimizing feature selection methods can significantly impact the model's performance.
5. Adversarial Attacks: Malware creators may intentionally design samples to evade detection by
machine learning models. Adversarial attacks can involve various techniques like obfuscation,
polymorphism, or using evasion strategies to mislead the model. These attacks can reduce the
effectiveness of machine learning-based malware detection systems.
6. Computational Resources and Performance: Training and deploying machine learning models
for malware detection can be computationally intensive. The SVM algorithm, for instance, may
require significant computational resources, especially for large-scale datasets and complex

Malware Detection using Machine Learning 15


feature sets. Ensuring efficient deployment and scalability can be challenging, particularly in
resource-constrained environments.
7. Interpretability and Explainability: Machine learning models, including SVM, can be
inherently complex and lack interpretability. Understanding the reasoning behind the model's
decisions and providing explanations for detection outcomes can be challenging. This can hinder
trust, accountability, and regulatory compliance.
8. Legal and Ethical Considerations: Using machine learning for malware detection raises ethical
and legal considerations, particularly related to privacy, data protection, and compliance with
regulations. Collecting and processing malware samples, as well as sharing and storing sensitive
information, must adhere to relevant laws and regulations.

4.5 References
1. https://www.ijraset.com/research-paper/malware-detection-using-machine-learning
2. https://www.researchgate.net/publication/224089748_Malware_detection_using_machine_learnin
g
3. https://www.mdpi.com/2073-8994/14/11/2304
4. https://ieeexplore.ieee.org/document/6616872

ER DIAGRAM:

Malware Detection using Machine Learning 16


CHAPTER 5 SOFTWARE REQUIREMENTS SPECIFICATIONS

5.1 Introduction
The aim of this document is to gather and analyze and give an in-depth insight of the complete Behavior
analysis using handwriting by defining the problem statement in detail. The SRS describes the main
functionalities of the software with the purpose of creating an appropriate model.

A writer does not consciously draw each letter by his or her hand while writing, just like how a person
does not consciously remember and locate the position of each letter on a computer keyboard while
typing. These graphic movements generated by the subconscious mind reflect the state of the
subconscious itself. Humans have always been intrigued by variability and uniqueness of each individual.
A Graphologist can roughly interpret an individual’s character and personality traits by analyzing the
handwriting. We can use graphology to determine the personality and character profile of a person.

1. Purpose :
The purpose of implementing malware detection using machine learning in the research
paper is to address the challenges posed by evolving malware threats and explore the
effectiveness of machine learning techniques, specifically the SVM algorithm, in
detecting and classifying malware.

The traditional signature-based approaches for malware detection often struggle to keep
up with the rapid emergence of new malware variants. Machine learning algorithms, on
the other hand, have the potential to learn patterns and behaviors from large datasets,
enabling them to detect previously unseen or unknown malware samples.

5.2 Terms Of Reference


The software requirements for implementing malware detection using machine learning are as follows:

Machine Learning Algorithm: Select an appropriate machine learning algorithm, such as


Support Vector Machines (SVM), Random Forests, Decision Trees, or Neural Networks, that is
well-suited for the malware detection task. The algorithm should be capable of learning from the
extracted features and making accurate predictions.

Training and Evaluation: Implement mechanisms to split the dataset into training and testing
sets. The algorithm should be trained on the training set and evaluated on the testing set to assess

Malware Detection using Machine Learning 17


its performance. Employ appropriate evaluation metrics such as accuracy, precision, recall, and
F1 score to measure the effectiveness of the model.

Model Optimization: Fine-tune the model's hyperparameters to optimize its performance. This
may involve conducting parameter searches, cross-validation, or employing techniques such as
grid search or Bayesian optimization to find the best combination of hyperparameters.

Handling Class Imbalance: Implement techniques to handle class imbalance if present in the
dataset. Class imbalance occurs when the number of malware samples differs significantly from
the number of benign files. Techniques like oversampling, undersampling, or using class weights
can address this issue and prevent bias towards the majority class.

Validation and Generalization: Validate the trained model using an independent validation set
or through cross-validation techniques. Ensure that the model generalizes well to unseen data and
is not overfitting or underfitting the training data.

5.3 Expertise
The expertise needed for doing a project defines a set of professional requirements for the individual
and teams involved in project implementation. It will be the basis for team building, including train and
skill assessment.

Name Roles

Priyanka Pawar - Leader, Developer, Documentation


Pratiksha Wayale - Asset Management, Co Developer
Aditi Patil - Asset Management, Design, Documentation

Chaitanya Ingale - Design, Documentation

Table 5.3: Team members and their roles

Malware Detection using Machine Learning 18


CHAPTER 6 PROJECT REQUIREMENT

6.1 External Interface Requirement


6.1.1 User Interface

Application Based malicious Application Detection.

6.1.2 Hardware Interfaces


Application Based malicious Application Detection.

RAM : 8 GB

As we are using Machine Learning Algorithm and Various High Level Libraries Laptop RAM
minimum required is 8 GB.

Hard Disk : 40 GB

Data Set of CT Scan images is to be used hence minimum 40 GB Hard Disk memory is required.

Processor : Intel i5 Processor

Pycharm IDE that Integrated Development Environment is to be used and data loading should be
fast hence Fast Processor is required

IDE : Pycharm

Best Integrated Development Environment as it gives possible suggestions at the time of typing
code snippets that makes typing feasible and fast.

Coding Language : Python Version 3.5

Highly specified Programming Language for Machine Learning because of availability of High
Performance Libraries.

Operating System : Windows 10

Latest Operating System that supports all type of installation and development Environment

6.1.3 Software Interface

Operating System: Windows 10


IDE: Pycharm ,Spyder
Programming Language : Python

Malware Detection using Machine Learning 19


6.2 Non Functional Requirement
6.2.1 Performance Requirement
The performance of the functions and every module must be good. The overall performance of
the software will enable the users to work efficiently. Perfor-mance of encryption of data should
be fast. Performance of the providing virtual environment should be fastSafety Requirement•The
application is designed in modules where errors can be detected easily. This makes it easier to
install and update new functionality if required.

6.2.2 Safety Requirement


The application is designed in modules where errors can be detected and fixed easily. This makes
it easier to install and update new functionality if required.

6.2.3 Software Quality Attributes Requirement


Our software has many quality attribute that are given below:-

1. Adaptability: This software is adaptable by all users.


2. Availability: This software is freely available to all users. The availability of the
software is easy for everyone.
3. Maintainability: After the deployment of the project if any error occurs then it can be
easily maintained by the software developer.
4. Reliability: The performance of the software is better which will increase the reliability
of the Software.
5. User Friendliness: Since, the software is a GUI application; the output generated is
much user friendly in its behavior.
6. Integrity: Integrity refers to the extent to which access to software or data by
unauthorized persons can be controlled.
7. Security: Users are authenticated using many security phases so reliable security is
provided.
8. Testability: The software will be tested considering all the aspects

Malware Detection using Machine Learning 20


CHAPTER 7 SYSTEM ANALYSIS

7.1 System Architecture

7.2 Methodology
Data Collection :
To train and evaluate the SVM model, a dataset of malware samples and benign files is required. We
collect a diverse set of malware samples from public malware repositories and benign files from
legitimate software sources. The dataset is carefully curated to ensure a balanced representation of
different malware families and benign applications.

To evaluate the performance of the SVM-based malware detection system, a comprehensive dataset
containing both malware and benign samples is required. The dataset should represent diverse malware
families and cover a wide range of features. In this project, we obtained a dataset from a reputable
malware research lab, consisting of approximately 10,000 samples.

Feature Extraction :
Next, we extract relevant features from the collected samples. These features can include static features
such as file size, entropy, and opcode frequency, as well as dynamic features obtained by analyzing the
behavior of malware samples in a controlled environment. Feature extraction techniques play a crucial
role in capturing the discriminative characteristics of malware and distinguishing it from benign files.

Malware Detection using Machine Learning 21


In this research, we focused on extracting both static and dynamic features, including system calls, byte-
level n-grams, and file entropy.

Preprocessing :
Before training the SVM model, we preprocess the dataset to handle missing values, normalize the feature
values, and address class imbalance if present. Preprocessing techniques such as feature scaling and
oversampling/undersampling are employed to enhance the model's performance.

SVM Algorithm :

Support Vector Machines (SVMs) are a popular class of supervised learning algorithms used for
classification tasks. They work by mapping input data to a high-dimensional feature space and finding the
optimal hyperplane that separates different classes. In this project, we utilize the SVM algorithm for
binary classification, with malware and benign files as the two classes. We explore different kernel
functions, such as linear, polynomial, and radial basis function (RBF), to find the best configuration for
our dataset.

We implemented the SVM algorithm using the Scikit-learn library in Python. Before training the SVM
model, the dataset was preprocessed to remove noise, handle missing values, and balance the class
distribution if necessary. We experimented with different kernel functions and hyperparameters to
optimize the performance of the SVM classifier.

The purpose of Support Vector Machine is based on malware detection being connected with machine
learning techniques that are based on SVM classifiers. It is extended to the notion of feature filtering and
attempts to improve the performance. Malware detection system is implemented on the basis of: Dataset
Preparation.

Malware Detection using Machine Learning 22


The dataset can be prepared by using two sets of executables one is malicious and another is beginning
executables. It can perform a range of malicious activities such as back-door downloaders, system attack,
fake alerts, fake warnings, adware, and information stealer. Our system uses SVM classifier machine
learning technique which implements two phases training and testing. Feature Extraction. An opcode is
operational code, a machine language instruction that specifies the operation to be performed. The
operands associated with each opcode are omitted. Further opcode occurrences are measured and are
parsed into density histograms for each executable. The next step is streamlining the extracted sequences
of opcode in the same logical order as they appear in the executable file. Op Code Filtering. We employed
a feature filtering approach to reduce the explosion of features and diminish the interference and
misclassification of benign and malicious software. The proposed system investigates Principal
Component Analysis to find the subspace to determine the importance of the individual opcodes and
weed out irrelevant opcodes.

SVM Based Classification. It avoids the attributes in greater numeric ranges dominated by those with
smaller numeric ranges and also it avoids numerous difficulties during the calculation of kernel values
that depend on the inner products of feature vectors. SVM works in two phases: training phase and testing
phase. Behavior Monitoring. Every dataset file can be executed in an automation of environment using
dynamic analysis parallelly so that the behavior of programs can be monitored. This performs automatic
behavior analysis on execution of files in sandbox generating XML reports on the basis of behavior
profile.

7.3 Implementation

Malware Detection using Machine Learning 23


The implementation of the malware detection system is carried out using Python programming language
and various libraries such as scikit-learn, pandas, and numpy. The code is developed and tested in popular
integrated development environments (IDEs) such as Jupyter and Spyder.
The dataset is loaded from CSV files, and the SVM model is trained using the extracted features. Cross-
validation techniques are employed to evaluate the model's performance, and appropriate evaluation
metrics, including accuracy, precision, recall, and F1 score, are calculated.

The dataset was split into training and testing sets using a stratified sampling technique. The training set
was used to train the SVM model on the extracted features, while the testing set was used to evaluate its
performance. Cross-validation was employed to ensure robustness and minimize overfitting.

Working of SVM Algorithm :

import numpy as np
from sklearn import svm
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Step 1: Load the dataset


data = np.loadtxt('malware_dataset.csv', delimiter=',')
X = data[:, :-1] # Features
y = data[:, -1] # Labels

# Step 2: Split the dataset into training and testing sets


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Step 3: Train the SVM model


svm_model = svm.SVC(kernel='linear')
svm_model.fit(X_train, y_train)

# Step 4: Make predictions on the testing set


y_pred = svm_model.predict(X_test)

# Step 5: Evaluate the model


accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

The code will load the dataset, split it into training and testing sets, train the SVM model with a linear
kernel, make predictions on the testing set, and evaluate the model's accuracy. The accuracy score will be
displayed in the Spyder console.

Malware Detection using Machine Learning 24


In the code above:

1. Import the necessary libraries, including pandas for data manipulation, train_test_split from
scikit-learn for splitting the dataset, SVC from scikit-learn for creating the SVM classifier, and
accuracy_score, confusion_matrix, and classification_report from scikit-learn for evaluating the
model's performance.
2. Load the malware dataset using pd.read_csv() function. Make sure to replace
'malware_dataset.csv' with the actual path to your dataset.
3. Separate the features (X) and labels (y) from the dataset.
4. Split the data into training and testing sets using train_test_split(). Here, we are using
80% of the data for training and 20% for testing, but you can adjust these percentages as needed.
5. Create an SVM classifier object (svm_classifier) using the linear kernel.
6. Train the classifier using the training data with the fit() function.
7. Predict the labels for the test set using the predict() function.
8. Evaluate the model by calculating the accuracy using accuracy_score(), creating a
confusion matrix using confusion_matrix(), and generating a classification report using
classification_report().
9. Print the results including the accuracy, confusion matrix, and classification report.

Working of Random Forest Algorithm :

Random Forest is an ensemble learning algorithm that combines multiple decision trees to make
predictions. It is also commonly used for malware detection tasks. Here's an example of how to
implement Random Forest for malware detection using a CSV file:

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

# Load dataset from CSV file


data = pd.read_csv('malware_dataset.csv')

# Separate features (X) and labels (y)


X = data.drop('label', axis=1)
y = data['label']

# Split the dataset into training and testing subsets


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create a Random Forest classifier


rf = RandomForestClassifier()

# Train the Random Forest classifier

Malware Detection using Machine Learning 25


rf.fit(X_train, y_train)

# Make predictions on the test set


predictions = rf.predict(X_test)

# Evaluate accuracy
accuracy = accuracy_score(y_test, predictions)
print("Accuracy:", accuracy)

The two main phases of the classification process were training and testing. To train a system, it was sent
both harmful and safe files. Automated classifiers were taught using a learning algorithm. Each classifier
(KNN, CNN, NB, RF, SVM, or DT) became smarter with each set of data it annotated. In the testing
phase, a classifier was sent a collection of new files, some harmful and some not; the classifier
determined whether the files were malicious or clean.

In this section, we provide a detailed explanation of our algorithm for malware detection using the SVM
classifier. We discussed the feature extraction techniques employed into the system.

Malware Detection using Machine Learning 26


Code Implementation of Main Module :

from subprocess import call


import tkinter as tk
import tkinter as tk
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from PIL import Image, ImageTk
from tkinter import ttk
from sklearn.decomposition import PCA
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score

root = tk.Tk()
root.title("GUI")

w, h = root.winfo_screenwidth(), root.winfo_screenheight()
root.geometry("%dx%d+0+0" % (w, h))

image2 = Image.open('bg1.jpg')
image2 = image2.resize((w, h), Image.ANTIALIAS)
background_image = ImageTk.PhotoImage(image2)
background_label = tk.Label(root, image=background_image)
background_label.image = background_image
background_label.place(x=0, y=0) # , relwidth=1, relheight=1)
lbl = tk.Label(root, text="Malicious Application Predication using ML", font=('times', 25,' bold '),
height=1, width=70,bg="black",fg="red")
lbl.place(x=0, y=0)

# Reading CSV File


data = pd.read_csv("new1.csv")
data = data.dropna()
le = LabelEncoder()
data['TelephonyManager.*getDeviceId'] = le.fit_transform(data['TelephonyManager.*getDeviceId'])
data['TelephonyManager.*getSubscriberId']
le.fit_transform(data['TelephonyManager.*getSubscriberId'])
data['abortBroadcast'] = le.fit_transform(data['abortBroadcast'])
data['SEND_SMS'] = le.fit_transform(data['SEND_SMS'])
data['DELETE_PACKAGES'] = le.fit_transform(data['DELETE_PACKAGES'])
data['PHONE_STATE'] = le.fit_transform(data['PHONE_STATE'])
data['RECEIVE_SMS'] = le.fit_transform(data['RECEIVE_SMS'])
data['Ljava.net.InetSocketAddress'] = le.fit_transform(data['Ljava.net.InetSocketAddress'])
data['READ_SMS'] = le.fit_transform(data['READ_SMS'])
data['android.intent.action.BOOT_COMPLETED']
le.fit_transform(data['android.intent.action.BOOT_COMPLETED'])
data['io.File.*delete('] = le.fit_transform(data['io.File.*delete('])

Malware Detection using Machine Learning 27


data['chown'] = le.fit_transform(data['chown'])
data['chmod'] = le.fit_transform(data['chmod'])
data['mount'] = le.fit_transform(data['mount'])
data['.apk'] = le.fit_transform(data['.apk'])
data['.zip'] = le.fit_transform(data['.zip'])
data['.dex'] = le.fit_transform(data['.dex'])
data['CAMERA'] = le.fit_transform(data['CAMERA'])
data['ACCESS_FINE_LOCATION'] = le.fit_transform(data['ACCESS_FINE_LOCATION'])
data['INSTALL_PACKAGES'] = le.fit_transform(data['INSTALL_PACKAGES'])
data['android.intent.action.BATTERY_LOW']
le.fit_transform(data['android.intent.action.BATTERY_LOW'])
data['.so'] = le.fit_transform(data['.so'])
data['android.intent.action.ACTION_POWER_CONNECTED']
le.fit_transform(data['android.intent.action.ACTION_POWER_CONNECTED'])
data['System.*loadLibrary'] = le.fit_transform(data['System.*loadLibrary'])
data['.exe'] = le.fit_transform(data['.exe'])
data.head()

# Feature Selection => Manual


x=data.drop(['ACCESS_NETWORK_STATE','BLUETOOTH','ACCESS_WIFI_STATE','BROADCAST
_SMS','CALL_PHONE','CALL_PRIVILEGED','CLEAR_APP_CACHE','CLEAR_APP_USER_DATA','
CONTROL_LOCATION_UPDATES','INTERNET','Result'], axis=1)

def Data_Preprocessing():
data = pd.read_csv("new1.csv")
data.head()
data = data.dropna()

# One Hot Encoding


le = LabelEncoder()
data['TelephonyManager.*getDeviceId']=le.fit_transform(data'TelephonyManager.*getDeviceId')
data['TelephonyManager.*getSubscriberId']=le.fit_transform(data['TelephonyManager.*getSubscr
iberId'])
data['abortBroadcast'] = le.fit_transform(data['abortBroadcast'])
data['SEND_SMS'] = le.fit_transform(data['SEND_SMS'])
data['DELETE_PACKAGES'] = le.fit_transform(data['DELETE_PACKAGES'])
data['PHONE_STATE'] = le.fit_transform(data['PHONE_STATE'])
data['RECEIVE_SMS'] = le.fit_transform(data['RECEIVE_SMS'])
data['Ljava.net.InetSocketAddress'] = le.fit_transform(data['Ljava.net.InetSocketAddress'])
data['READ_SMS'] = le.fit_transform(data['READ_SMS'])
data['android.intent.action.BOOT_COMPLETED']=le.fit_transform(data['android.intent.action.B
OOT_COMPLETED'])
data['io.File.*delete('] = le.fit_transform(data['io.File.*delete('])
data['chown'] = le.fit_transform(data['chown'])
data['chmod'] = le.fit_transform(data['chmod'])
data['mount'] = le.fit_transform(data['mount'])
data['.apk'] = le.fit_transform(data['.apk'])
data['.zip'] = le.fit_transform(data['.zip'])

Malware Detection using Machine Learning 28


data['.dex'] = le.fit_transform(data['.dex'])
data['CAMERA'] = le.fit_transform(data['CAMERA'])
data['ACCESS_FINE_LOCATION'] = le.fit_transform(data['ACCESS_FINE_LOCATION'])
data['INSTALL_PACKAGES'] = le.fit_transform(data['INSTALL_PACKAGES'])
data['android.intent.action.BATTERY_LOW']=le.fit_transform(data['android.intent.action.BATT
ERY_LOW'])
data['.so'] = le.fit_transform(data['.so'])
data['android.intent.action.ACTION_POWER_CONNECTED']=le.fit_transform(data['android.int
ent.action.ACTION_POWER_CONNECTED'])
data['System.*loadLibrary'] = le.fit_transform(data['System.*loadLibrary'])
data['.exe'] = le.fit_transform(data['.exe'])
# data['AHD'] = le.fit_transform(data['AHD'])
# print(data['Ca'])
# data['Thal'] = le.fit_transform(data['Thal'])
# print("thal Encoding")
# data['ChestPain'] = le.fit_transform(data['ChestPain'])
# data['Thal'] = le.fit_transform(data['Thal'])
# data['ChestPain'] = le.fit_transform(data['ChestPain'])

#Feature Selection => Manual


x=data.drop(['ACCESS_NETWORK_STATE','BLUETOOTH','ACCESS_WIFI_STATE','BROA
DCAST_SMS','CALL_PHONE','CALL_PRIVILEGED','CLEAR_APP_CACHE','CLEAR_APP_
USER_DATA','CONTROL_LOCATION_UPDATES','INTERNET','Result'], axis=1)
data = data.dropna()

print(type(x))
y = data['Result']
print(type(y))
x.shape

from sklearn.model_selection import train_test_split


x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.20)
load = tk.Label(root, font=("Tempus Sans ITC", 15, "bold"), width=50, height=2,
background="green",foreground="white", text="Data Loaded=>Splitted into 80% for Training &
20% for Testing")
load.place(x=200, y=80)

def Model_Training():
data = pd.read_csv("new1.csv")
data.head()
data = data.dropna()

#One Hot Encoding


le = LabelEncoder()
data['TelephonyManager.*getDeviceId']=le.fit_transform(data'TelephonyManager.*getDeviceId')

Malware Detection using Machine Learning 29


data['TelephonyManager.*getSubscriberId']=le.fit_transform(data['TelephonyManager.*getSubscr
iberId'])
data['abortBroadcast'] = le.fit_transform(data['abortBroadcast'])
data['SEND_SMS'] = le.fit_transform(data['SEND_SMS'])
data['DELETE_PACKAGES'] = le.fit_transform(data['DELETE_PACKAGES'])
data['PHONE_STATE'] = le.fit_transform(data['PHONE_STATE'])
data['RECEIVE_SMS'] = le.fit_transform(data['RECEIVE_SMS'])
data['Ljava.net.InetSocketAddress'] = le.fit_transform(data['Ljava.net.InetSocketAddress'])
data['READ_SMS'] = le.fit_transform(data['READ_SMS'])
data['android.intent.action.BOOT_COMPLETED']=le.fit_transform(data['android.intent.action.B
OOT_COMPLETED'])
data['io.File.*delete('] = le.fit_transform(data['io.File.*delete('])
data['chown'] = le.fit_transform(data['chown'])
data['chmod'] = le.fit_transform(data['chmod'])
data['mount'] = le.fit_transform(data['mount'])
data['.apk'] = le.fit_transform(data['.apk'])
data['.zip'] = le.fit_transform(data['.zip'])
data['.dex'] = le.fit_transform(data['.dex'])
data['CAMERA'] = le.fit_transform(data['CAMERA'])
data['ACCESS_FINE_LOCATION'] = le.fit_transform(data['ACCESS_FINE_LOCATION'])
data['INSTALL_PACKAGES'] = le.fit_transform(data['INSTALL_PACKAGES'])
data['android.intent.action.BATTERY_LOW']=
le.fit_transform(data['android.intent.action.BATTERY_LOW'])
data['.so'] = le.fit_transform(data['.so'])
data['android.intent.action.ACTION_POWER_CONNECTED']=
le.fit_transform(data['android.intent.action.ACTION_POWER_CONNECTED'])
data['System.*loadLibrary'] = le.fit_transform(data['System.*loadLibrary'])
data['.exe'] = le.fit_transform(data['.exe'])

#Feature Selection => Manual


x=data.drop(['ACCESS_NETWORK_STATE','BLUETOOTH','ACCESS_WIFI_STATE','BROA
DCAST_SMS','CALL_PHONE','CALL_PRIVILEGED','CLEAR_APP_CACHE','CLEAR_APP_
USER_DATA','CONTROL_LOCATION_UPDATES','INTERNET','Result'], axis=1)
data = data.dropna()

print(type(x))
y = data['Result']
print(type(y))
x.shape

from sklearn.model_selection import train_test_split


x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.20,random_state=123)

from sklearn.svm import SVC


svcclassifier = SVC(kernel='linear')
svcclassifier.fit(x_train, y_train)
y_pred = svcclassifier.predict(x_test)
print(y_pred)

Malware Detection using Machine Learning 30


print("=" * 40)
print("Classification Report : ",(classification_report(y_test, y_pred)))
print("Accuracy : ",accuracy_score(y_test,y_pred)*100)
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy: %.2f%%" % (accuracy * 100.0))
ACC = (accuracy_score(y_test, y_pred) * 100)
repo = (classification_report(y_test, y_pred))

label4 = tk.Label(root,text =str(repo),width=45,height=10,bg='khaki',fg='black',font=("Tempus


Sanc ITC",14))
label4.place(x=205,y=200)

label5=tk.Label(root,text ="Accracy : "+str(ACC)+"%\nModel saved as


malicious_MODEL.joblib",width=45,height=3,bg='khaki',fg='black',font=("TempusSansITC",14)
label5.place(x=205,y=420)
from joblib import dump
dump (svcclassifier,"botnet_MODEL.joblib")
print("Model saved as
botnet_MODEL.joblib")

def call_file():
from subprocess import call
call(['python','Check_predict.py'])

def window():
root.destroy()

#button2 = tk.Button(root, foreground="white", background="black", font=("Tempus Sans ITC",


14, "bold"), # text="Data_Preprocessing", command=Data_Preprocessing, width=15, height=2)
#button2.place(x=5, y=90)

#button3 = tk.Button(root, foreground="white", background="black", font=("Tempus Sans ITC",


14, "bold"),# text="Model Training", command=Model_Training, width=15, height=2)
#button3.place(x=5, y=170)

button4 = tk.Button(root, foreground="white", background="black", font=("Tempus Sans ITC",


14, "bold"),text="Detect Malicious", command=call_file, width=15, height=1)
button4.place(x=600, y=150)

button5 = tk.Button(root, text="Exit", command=window, width=15, height=1, font=('times', 14, '


bold '),bg="red",fg="white")
button5.place(x=600, y=200)

root.mainloop()

Malware Detection using Machine Learning 31


Performance of different models on different dataset :

7.4 Modelling and Analysis


We use the area under the ROC curve to quantify the success of the experiments. Given a scatter- plot, an
ROC curve is obtained by plotting the false positive rate against the true positive rate as the thresh-old
varies through the range of data values. It implies that there exists a threshold that results in no false
positives or false negatives, which is the ideal case. The AUC can be interpreted as the probability that a
randomly selected positive instance scores higher than a randomly selected negative instance.
Mathematical Model: The support vector machine classifier used is Support Vector Machine. The support
vector machine was developed by Vapnik for binary classification. Its objective is to find the optimal
hyperplane f (w, x) = w · x + b to separate two classes in a given dataset, with features x ∈ R m. SVM
learns the parameters w and b by solving the following constrained optimization problem. SVM may be
used for multinomial classification as well as which converts a linear model to a non-linear model by
applying kernel functions such as radial basis function . However, for this study, we utilized the linear
L2- SVM for the multinomial classification problem. We then employed the one-versus-all approach,
which treats a given classic as the positive class, and others as negative class. Suppose that the AUC for a
given experiment is x , where x < 0.5. Then by simply reversing the sense of the binary classifier, we
obtain an AUC of 1 x > 0.5. Consequently, some of the low AUC graphs, actually represent relatively
strong scores, when properly interpreted. It appears that the SVM is able to properly interpret such scores.
This is entirely plausible based on the geometric intuition behind the SVMtechnique. Provide additional
evidence of the strength of SVMs for this particular application.

Malware Detection using Machine Learning 32


Computational Results :

Input Data Description


The Support Vector Machine method was used to classify heterogeneous datasets. The input data were
collected from the real malware database of the N6 platform. The N6 platform was developed in the
Research and Academic Computer Network (NASK). The purpose of the system is to monitor computer
networks, collect and analyze data about such events as threads, incidents. Most of the information is
updated daily. The N6 can be compared to a sorting plant of incidents, of which the heart is the N6’s
engine. Due to a sophisticated tagging system, incidents can be assigned to unique entities (e.g. based on
IP address and AS numbers). Data is collected into a special package, which keeps an original source
format (each source in a separate file). Additionally, it is possible to provide other information, e.g. about
C & C servers which do not consist of a client network, but can be utilized to detect infected computers.
Information about malicious sources are transferred by the platform as URLs, domain, IP addresses or
names of malware.

Preliminary input dataset analysis


The classification process should be preceded by preliminary analysis of training data consisting of 398
samples. We performed preprocessing of data taken from the N6 platform. We compared the data taking
into account four parameters assigned to each sample: time, format, domain, and address. Record time
consists of date and time when an event was inserted into the N6 database. To perform the analysis the
domains were clustered, and converted to the numerical format.

Malware Detection using Machine Learning 33


Evaluation of SVM classification
The SVM method was validated on a heterogeneous malware dataset. Three commonly used validation
techniques were used for evaluation the results of our classification:

CR: Cross Validation with 5 number of folds.LOO: Leave-One-Out method.


RS: Random Sampling with 5 repetition of training process and 50% of relating training set size.

Then, we evaluated the quality of our classification system based on the results of their validations.
Several criteria were taken into consideration.

CA - Classification accuracy,
Sens - Sensitivity,
Spec - Specificity,
AUC - Area under ROC curve,
F1 - F-measure,
Prec - Precision

Dataset Creation
At first, we downloaded 27104 malicious executables compiled by VX Heaven website. We created
52803 elements of the dataset combining 51243 unpacked malicious files from this collection and 1560
benign files from various sources. We extracted the texts from the dataset and constructed vectors
appropriately. There are various weighing methods including frequency counting and TFTDF. In this
research, we used frequency counting and TFIDF approaches together. Constructing vectors using
frequency is called bag-of-words. In addition to finding frequency, we counted bigrams (sequence of two
adjacent words) and constructed vectors. Our dataset has [following] 40 classes.

Experiment
We split the dataset into two subsets, training set and test set. Training set used 67 percent of the whole
dataset and the test set used 33 percent. We performed the experiment 4 times. In each experiment, we
randomly selected the training and test set from the primary dataset. After creating the training set, we
trained the data using a linear SVM algorithm for the classification. Remaining 33% of the test set was
predicted by our [previously trained machine. During the experiment, we constructed 297003 features of
vectors per sample.

Malware Detection using Machine Learning 34


Result of Experiment

Malware Detection using Machine Learning 35


7.5 Diagrams
7.5.1 Data Flow Diagram :
In Data Flow Diagram,we Show that flow of data in our system in DFD0 we show that base DFD
in which rectangle present input as well as output and circle show our system,In DFD1 we show
actual input and actual output of system input of our system is text or image and output is rumor
detected likewise in DFD 2.

Figure 7.5.1: Data Flow(0) diagram

Figure 7.5.2: Data Flow(1) diagram

Figure 7.5.2: Data Flow(2) diagram

Malware Detection using Machine Learning 36


7.5.2 Class Diagram
The class diagram provides an overview of the system's structure, depicting the classes, their
attributes, and the relationships between them. In the context of malware detection using SVM
algorithm, the class diagram would include classes such as MalwareDetector, FeatureExtractor,
SVM, DataPreprocessor, and possibly others depending on the specific implementation.

7.5.3 Sequence Diagram


The sequence diagram illustrates the interactions and flow of messages between different objects
or components over a specific period of time. In the case of malware detection, a sequence
diagram could demonstrate the steps involved in training the SVM model, such as data
preprocessing, feature extraction, and training the SVM classifier. It can also depict the process of
predicting the class of a new sample using the trained SVM model.

Malware Detection using Machine Learning 37


User: Represents the user interacting with the malware detection system.
MalwareDetector: Represents the main component responsible for coordinating the malware
detection process.
FeatureExtractor: Represents the component responsible for extracting features from malware
samples.
SVM: Represents the component responsible for training the SVM model and making
predictions.

Malware Detection using Machine Learning 38


7.5.4 Activity Diagram
The activity diagram represents the flow of activities or processes within the system. In the
context of malware detection, an activity diagram can depict the overall workflow of the malware
detection system, including steps such as data preprocessing, feature extraction, training the SVM
model, and classifying new samples. It can show decision points, loops, and parallel activities.

Malware Detection using Machine Learning 39


7.5.5 Use Case Diagram
The use case diagram provides an overview of the system's functionalities from a user's
perspective. In the case of malware detection, it could represent different use cases, such as
"Train SVM Model," "Classify Sample," and "Evaluate Performance." These use cases would
involve interactions between the user (or external systems) and the malware detection system,
showcasing the system's capabilities.

Results of the experiments demonstrated that the technique achieves the best results for detection of such
mobile malware as DDoS, spyware, SMS malware, botnets, etc.

At the same time, the efficiency of the system concerning rootkits is rather low. This is because the
behavior of some malware is very similar to users’ ones and some of malware’s features weren’t taken
into account for the detection process.

Malware Detection using Machine Learning 40


CHAPTER 8 RESULTS

Figure 8.1: Output (1)

Figure 8.2: Output (2)

Malware Detection using Machine Learning 41


Figure 8.3: Output (3)

Figure 8.4: Output (4)

Malware Detection using Machine Learning 42


Figure 8.5: Output (5)

Figure 8.6: Output (6)

Malware Detection using Machine Learning 43


CHAPTER 9 TEST CASES

9.1 Test case No 1

9.2 Test case No 2

Malware Detection using Machine Learning 44


9.3 Test case No 3

Malware Detection using Machine Learning 45


CHAPTER 10 CONCLUSION

In this project, we have successfully implemented a malware detection system using the SVM algorithm.
The objective of the project was to develop an effective system that can accurately identify and classify
instances of malware within computer systems or networks.

A new technique for mobile malware detection based on the malware’s network features analysis is
proposed. It uses SVM for malicious programs detection. The novel approach provides the ability to
detect malware in mobile devices.

As the inference engine for malware detection the support vector machine was used. The detection
process is performed by taking into account the malware’s features, captured in the mobile devices.

Experimental research showed that the SVMs are able to produce the accurate clas- sification results.
Implementation of the SVM-based inference engine into the mobile malware’s detection process allowed
it to obtain its mean detection accuracy up to 98.01%. Experiments demonstrated that this technique is
able to detect different types of malware in the range from 90.28 to 98.21%, while false positives is about
5%

Malware Detection using Machine Learning 46


CHAPTER 11 FUTURE SCOPE

Ideally, future work will involve a larger dataset so that our system may be taught to recognise and
classify the exact target data endpoints that are less well-known in current scenarios.

In future we can following certain principles to achieve the desired outcome as mentioned below :

Enhancing Feature Extraction: Explore more advanced and comprehensive feature extraction
techniques to capture diverse aspects of malware behavior. This can involve incorporating static and
dynamic features, considering file metadata, analyzing network traffic patterns, or utilizing behavior-
based features.

Class Imbalance Handling: Address the challenge of class imbalance in the dataset. Class imbalance
occurs when there are significantly more instances of benign files than malware samples (or vice versa).
Investigate techniques such as oversampling, undersampling, or generating synthetic samples to balance
the classes and prevent the model from being biased towards the majority class.

Multi-Class Classification: Extend the malware detection system to handle multi-class classification,
where different types of malware are classified into multiple categories. This would involve training the
SVM algorithm on a dataset with more than two classes and adapting the decision boundaries..

Advanced SVM Configurations: Experiment with different SVM configurations, such as non-linear
kernels (e.g., polynomial, radial basis function) or using support vector regression (SVR) for continuous-
valued outputs. Explore the impact of these configurations on the detection performance and compare
them with the linear SVM.

Incremental Learning: Implement incremental learning techniques that allow the SVM model to adapt
and learn from new data over time. This would enable the system to update its knowledge and improve
detection accuracy as new malware samples are encountered.

Malware Variant Detection: Focus on detecting new and emerging malware variants that exhibit
different characteristics than the known malware samples. Investigate techniques such as transfer
learning, where knowledge gained from known malware types is transferred to identify new variants with
similar characteristics.

Real-Time Detection and Scalability: Optimize the malware detection system for real-time analysis and
scalability. This includes improving the efficiency of feature extraction, model training, and prediction to
enable fast and accurate detection even in high-traffic or resource-constrained environments.

Malware Attribution and Integration with Security Systems: Integrate the malware detection system
with existing security systems, such as intrusion detection systems (IDS), firewalls, or security
information and event management (SIEM) platforms. This would enhance overall cybersecurity
measures by enabling proactive malware detection and response.

Malware Detection using Machine Learning 47


CHAPTER 12 REFERENCES

1. Franklin Tchakount Computers & Security “Permission-based Malware Detection Mechanisms


on Android: Analysis and Perspectives”,2014.
2. McAfee Mobile Threat Report Q1, 2019, https://www.mcafee.com /enterprise/en-
us/assets/reports/rp-mobile-threat-report-2019.pdf
3. AV-Comparatives Security Survey, 2019, https://www.av-comparatives.org/wp-
content/uploads/2019/02/Security_Survey_2019_en.pdf
4. Amro, B.: Personal Mobile Malware Guard PMMG: a mobile malware detection technique based
on user’s preferences. IJCSNS International Journal of Computer Science and Net- work
Security, Vol. 18, No. 1, pp. 18–24 (2018)
5. McLaughlin, N., Martinez del Rincon, J., Kang, B, et al.: Deep android malware detec- tion. In
Proc. of the Seventh ACM on Conference on Data and Application Security and Privacy, pp.
301–308 (2017)
6. Idrees, F., Rajarajan, M., Conti, M., Chen, T., Rahulamathavan, Y.: Pindroid: a novel an- droid
malware detection system using ensemble learning methods. Computers & Security, Vol. 68, pp.
36–46 (2017)
7. Amro, B.: Personal Mobile Malware Guard PMMG: a mobile malware detection technique based
on user’s preferences. IJCSNS International Journal of Computer Science and Net- work
Security, Vol. 18, No. 1, pp. 18–24 (2018)
8. A. P. Felt, K. Greenwood, and D.Wagner, “The effectiveness of install-time permission systems
forthird-partyapplications”,2010
9. B. P. Sarma, N. Li, C. Gates, R.Potharaju, C. Nita-Rotaru, and I. Molloy, “Android permissions:
perspective combining risks and benefits,” 2012.
10. Y. Zhou and X. Jiang, “Dissecting android malware: Characterization and evolution,2012.
11. V. Rastogi, Y. Chen, and X. Jiang, “Droidchameleon: evaluating android anti malware against
transformation attacks, 2013.
12. G. Canfora, F. Mercaldo, and C. A. Visaggio, “A classifier of malicious android
applications,”2013.

Malware Detection using Machine Learning 48

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy