FYP Report FINAL Done
FYP Report FINAL Done
Program:BS(CS)
Project Title:
P R E D I C T I V E MA L W A R E D E F E N S E
U S I N G MA C H I N E L E A R N I N G
Submitted By:
T A L HA K H A N 9 651
R AO H U SNA I N 9 919
(Name) (Enrollment.No.)
Submission Date
2 0 / 0 6 / 2 2
(Date:DD/MM/YY)
In the name of Allah, the most Gracious and the Most Merciful.
Peace and blessing of Allah be upon Prophet Muhammadﷺ
First and foremost, we would want to express our gratitude to Allah for
giving us the opportunity, the strength, and the perseverance to eventually
complete our FYP despite the challenges. We want to thank our boss, Mr.
Usman Shahid, for being a strong leader, an inspiration, and—most
importantly—for making a big contribution to our initiative. Additionally, we
are grateful that the project's experts, Mr. Syed Minhal Raza, Mr. Iftikhar,
Mr. Faisal Ahmed, and Mrs. Dr. Umaima Hani Syeda, allowed us to work on
it. We also want to thank our parents for their moral and material help, as
well as our friends for their constant support and affection. May Allah
bestow upon them all a rich recompense. Ameen.
DEDICATION
Revision History----------------------------------------------------------------------
1. introduction----------------------------------------------------------------------- 1
1.1 definition----------------------------------------------------------------------------- 1
1.2 purpose------------------------------------------------------------------------------- 1
2. Overall description----------------------------------------------------------------- 4
3. System Diagram----------------------------------------------------------------------- 7
4.2 Dataset--------------------------------------------------------------------------------- 11
4.7 characteristics------------------------------------------------------------------- 13
5. Test Cases-------------------------------------------------------------------------------- 14
6.Algorithms Used--------------------------------------------------------------------- 14
6.1Random Forest--------------------------------------------------------------------- 15
8. Implementation of ML Model------------------------------------------------ 15
algorithm for finding dangerous code is Random Forest. If we increase the size
of the data collection in filesused to power the algorithms in the future, this
accuracy can be increased. Every algorithm has a number of variables that
can be evaluated with various values to improve accuracy.
1.4Project Scope
Our motive is to classify different malicious programs behavior according to the systems specification
and further predict malware attacks probability on the machine depends on various features of the
hardware and software. Much research has been done in the field but accuracy of the model is not
that higher to use efficiently. So in this project we try to improve those models along with that we add
some updated features and models to predict the malicious attacks
2. Overall Description
2.1 Product Features
Machine Systems that use machine learning to detect malware must
meet specific criteria in order to automate and lessen the workload of
human analysts. More information on the malware is first required for the
feature extraction procedure, however this data is difficult to get for
every malware instance. Of order for the feature extraction process to
function correctly, it must be scalable to handle the surge in malware
releases.
⮚ Python3.7.9
⮚ JupyterNoteBook
⮚ “Tensorflow2.5.0”
⮚ “Keras2.5.0”
⮚ SpyderIDE
⮚ PythonData Visualizationtools
⮚ Scikit-learn1.0.1
2.4 Problem Statement
Using various operating system and software specifications as well as
hardware component specifications, it is possible to predict the
likelihood that a Windows system will be compromised by various viruses
at the time it is manufactured. In this project, we concentrate on many
facets of machine learning and deep learning-based automated
malware prediction. The initial phase of our project required us to
analyse the dataset's strongest features and employ a variety of
machine learning-based techniques that had the highest prediction
accuracy. The Jupyter Notebook and PyCharm are used to create the
models. The best outcome is specifically predicted using Gradient
Boosting, Random Forest, and Decision Tree classifiers..In order to
predict attacks, we must implement artificial neural networks in the
d. Phase4:Saving the entire algorithm for into pickle model file for later
Data Preprocessing
Data
Collection
Raw
dataset Data Cleansing
Malware
Dataset Feature Encoding
DataSplittinginto Visualizing
train&test Models CleanedData
a
Coding Exploration of
Classifiers like Importantfeatures
Decision tree,
Random Forest
&GBA
Visualizingthe
Classifiersthroug
Prediction h
ConfusionMatrix
Accuracy of
Implementan ANN, Decision
and training tree, Random
the data Forest & GBA
datausingANN
Prediction Prediction
Figure2:-Sequence Diagram
4.4DOS Header
The "MS-DOS Header" header is used to identify file types that are
compatible with MS-DOS. All executable files that are MS-DOS
compatible typically set this parameter to 0x54AD, which is the same as
the ASCII character "MZ." As a result, "MZ headers" are a common
nickname for MS-DOS headers. Using a hex editor, the MS-DOS header
may be seen to begin at offset 0.
4.5 DOS Stub
A notice stating that the program me can't be run in DOS mode is displayed
by the DOS stub, which is normally present in Windows executables. It could
only be a text message, but it could also be a complete DOS program me.
The linker interacts with the binary file winstub.exe while an application is
being built on Windows, adding it to an executable program. The offset for
this file is 0x3c, which links to the PE header section after it.
signature, a mandatory header, and the COFF file header. The section
headers are followed by both the COFF file header and the optional
header, which make up the COFF object file header. The header
section defines a number of fundamental sub-sections, including:
4.7 Characteristics
a. Signature
It just includes the signature to make it simple for the Windows loader to
interpret. Everything is implied by the alphabet P.E. followed by two
zeros.
b. Machines
This number identifies the kind of device on the target system. The
Machine field's CPU type can be determined using the values shown
below. Only the specified device or a system that simulates the
designated machine may run a file.
c. Number Of Sections
The number of sections. This shows the size of the section table, which
follows the headers.
h. Size of Image
How much Memory there is in total, including headers. The picture must
be a multiple of Section Alignment when it is loaded into memory.
i. Section Alignment
These action's alignment once it has been loaded into memory. Page
size cannot be less than section alignment.
5.Test Cases
A test case is a group of conditions or guidelines that a tester will use to
determine if the system under test complies with requirements or
performs as intended. Making test cases is another method for
identifying problems with demands or requirement analysis.
Figure5:-Decision Tree
6.3 Gradient boosting
Gradient Boosting is a powerful machine learning technique that is
commonly used in regression and classification problems. It works by
combining multiple weak prediction models, usually decision trees, to
form a more robust prediction model.
Figure6:-Gradient Boosting
6.4 Ada boosting
The AdaBoost (Adaptive Boosting) algorithm is a type of Ensemble
Method used in Machine Learning. It aims to reduce bias and
variance in supervised learning by re-assigning weights to each
instance, with higher weights given to instances that were incorrectly
classified.
Figure7:-Ada Boosting
Figure9:-Naïve Bayes
Figure12:-Activation Functions
7 Implementation:
7.1 Importing the Data set and Modules
So as far as modules are concerned for app.py i.e. while deploying the
model on flask we have used following modules and dependencies:
The first one was the model file for which we input following dependencies:
Secondpartaftercodingtreeclassifierwastoextractthebestknownfeaturesf
rom dataset to get more accuracy in features:
7.3 Implementing all ML Models:
We applied three algorithms in order to train the system .The first
algorithms of machine learning we used was random forest
Classification algorithms .These was decision tree and last was gradient
boosting algorithm. The main purpose was to actually train these
algorithms according to the features extracted through dataset.
The parameter defined besides estimators actually estimates the max depth of
tree. Because in case of random forest classifiers the .Each decision tree is a
single classifier and the target prediction is based on
Here ,URL‘/’rule is bound to the ‘index. html’ template .As a result, if a user visits
http://localhost:5000/helloURL,the output of the
def upload():
Result.html
Result is also located in templates folder .Here, we finally display our
result that is, if the file is malicious or not. It Is done by using simple if loop
and checking for the value of prediction variable.
8 Results & Discussions
8.1 Accuracy on train & test Models
So after training all the Machine learning & deep learning models now
it was time to calculate the accuracy of each algorithms on their train &
test models. So by calculating the accuracy on train and testing models
for Neural Network we go t91.04% -91.28%, For GBA we got accuracy of
98.98% -99.02%, for decision tree we got training and testing accuracy
up to 99.31% to 99.42%, and last but not the least for random forest we
got up to 99.99% in both models i.e. for training as well as for testing
which is so far the best one accuracy in the entire model.
8.3 Conclusion:
These days, anti-malware providers get a to one of suspected malware
files every day. Machine learning techniques, such as classification and
clustering, are used to group similar viruses in order to deal with this
influx of malware files. The amount of discriminative information that the
features retrieved from these files include has a significant impact on
how well the machine learning approaches perform. The classification
and clustering methods are compelled to work in an open set scenario
where they are presented with cases from never before seen families
because it is challenging to get training samples from each malware
family. The malware classification and clustering methods we
suggested in this dissertation are capable of operating in an open set
environment. We worked on extracting useful malware features.With
different data sets and different properties for the ensemble of
machine learning algorithms, the research could be further expanded.
References:
⮚ https://scikit-
learn.org/stable/modules/generated/sklearn.ensemble.RandomForest
Classifier.html
⮚ https://scikit-
learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClas
sifier.html
⮚ https://scikit-
learn.org/stable/modules/generated/sklearn.ensemble.GradientBoosti
ngClassifier
.html
⮚ https://docs.microsoft.com/en-us/windows/win32/debug/pe-format
⮚ https://github.com/PacktPublishing/Mastering-Machine-Learning-for-
Penetration-Testing/blob/master/Chapter03/MalwareData.csv.gz
⮚ https://en.wikipedia.org/wiki/Supervised_learning
⮚ https://en.wikipedia.org/wiki/Unsupervised_learning
⮚ https://github.com/krishnaik06/Deployment-Deep-Learning-Model
⮚ https://www.lastline.com/blog/history-of-malware-its-evolution-and-impact/
⮚ https://machinelearningmastery.com/display-deep-learning-
model-training-history-in-keras/
⮚ https://docs.python.org/2/library/pickle.html