LAKSHMI Documentation
LAKSHMI Documentation
Submitted by
VOLLA LAKSHMI
(22RH1D5807)
Under the Esteemed Guidance of
Dr.C.V.P.R. PRASAD
Professor
In partial fulfillment of the Academic Requirements for the Degree of
MASTER OF TECHNOLOGY
CERTIFICATE
This is to certify that The Technical Seminar Report entitled “ AUTOMATED ANDROID
MALWARE DETECTION USING ML AND DEEP LEARNING ALGORITHMS FOR
CYBERSECURITY ”“ is carried out by “VOLLA LAKSHMI (22RH1D5807)” in partial
External Examiner
MALLA REDDY ENGINEERING COLLEGE FOR WOMEN
(Autonomous Institution-UGC,Govt. of India)
Accredited by NBA & NAAC with ‘A’ Grade
Permanently Affiliated to JNTUH, Hyderabad, Approved by AICTE-ISO 9001:2015
CertifiedNIRF Indian Ranking 2020, Accepted by MHRD Govt of India
AAA+ Rated by Careers 360 Magazine, Top Hundred Rank Band by Outlook, 3th Rank CSR
Maisammaguda, Dullapally(post), Secunderabad,TELANGANA
DECLERATION
I hereby declare that the Technical Seminar Report entiled “AUTOMATED ANDROID
MALWARE DETECTION USING ML AND DEEP LEARNING ALGORITHMS FOR
CYBERSECURITY” “ submitted to Malla Reddy Engineering College for Women,
Secunderabad for the award of the Degree of Master of Technology in Computer Science
and Engineering is a result of original research work done by me.
It is declared that the Technical Seminar report has not been previously submitted to
any University or Institute for the award of Degree.
Being submitted by
VOLLA LAKSHMI
(22RH1D5807)
ACKNOWLEDGEMENT
I feel ourselves honored and privileged to place our warm salutation to our college
Malla Reddy Engineering College for Women and department of Computer Science and
Engineering which gave us the opportunity to have expertise in engineering and profound
technical knowledge.
I would like to deeply thank our Honorable Minister of Telangana State Sri. Ch.
Malla Reddy Garu, founder chairmen MRGI, the largest cluster of institutions in the state
of Telangana for providing us with all the resources in the college to make our project
success.
I wish to convey gratitude to our Principal Dr. Y. Madhavee Latha, for providing
us with the environment and mean to enrich our skills and motivating us in our endeavour
and helping us to realize our full potential.
I express our sincere gratitude to Dr .C.V.P.R. Prasad, Head of the Department of
Computer Science and Engineering for inspiring us to take up a project on this subject and
successfully guiding us towards its completion.
I would also like to thank our Technical Seminar coordinator Mr. G. Bhanu
Prasad, for his kind encouragement and overall guidance in viewing this program a good
asset with profound gratitude.
I would like to thank our internal guide Dr. --------------, and all the Faculty
members for their valuable guidance and encouragement towards the completion of our
project work.
Our project focuses on addressing the critical issue of Android malware detection
through the utilization of machine learning (ML) and deep learning models. We begin by
performing extensive data preprocessing on a dataset containing phone logs, aiming to extract
meaningful information. Subsequently, we categorize the data into five distinct classes: benign,
SMS malware, riskware, banking malware, and adware. Through the application of various
machine learning algorithms such as K-Nearest Neighbors (KNN), Logistic Regression, Random
Forest, and Recurrent Neural Networks (RNN), our system effectively identifies and classifies
the malware within Android phones. This project not only aids in pinpointing malicious software
but also provides insights into the prevalence of different malware types within the dataset. Our
approach serves as a valuable tool for enhancing Android security and safeguarding users against
potential threats.
CONTENTS
TITLE PAGE NO
Abstract i
Contents ii
1.INTRODUCTION 1
2. SYSTEM ANALYSIS 3
3.SYSTEM STUDY 5
5.SYSTEMDESIGN 8
6. IMPLEMENTATION 13
7.TESTING 21
8.OUTPUTSCREENS 23
9.CONCLUSION 28
10. FUTURESCOPE 29
11.REFERENCES 30
AUTOMATED ANDROID MALWARE DETECTION USING ML AND DL ALGORITHMS FOR CYBERSECURITY
1. INTRODUCTION
In the rapidly evolving landscape of mobile technology, Android devices have
become integral to our daily lives, serving as gateways to communication,
information, and services. However, the ubiquity of Android smart phones has
attracted the attention of malicious actors who exploit vulnerabilities for nefarious
purposes, leading to the rise of Android malware. The ever-growing sophistication
of these threats necessitates advanced detection mechanisms, and this project aims
to address this critical issue through the application of machine learning (ML) and
deep learning models.
The absence of a robust and scalable system for Android malware detection
hampers user security, leading to potential data breaches, financial losses, and
privacy violations. This project seeks to bridge this gap by leveraging the power of
machine learning and deep learning models to enhance the accuracy and efficiency
of Android malware detection.
The project will rigorously evaluate the performance of these models using defined
metrics and validation techniques to ensure robustness and generalization. Insights
into the prevalence and distribution of different malware types within the dataset
will provide valuable understanding. Emphasizing scalability and integration, we
plan to design the system to accommodate growing datasets and seamlessly
integrate with existing Android security frameworks. Ultimately, our project aspires
to contribute to the ongoing efforts in fortifying mobile security and offering users
and professionals an adaptive and efficient tool against evolving Android malware
threats.
2. SYSTEM ANALYSIS
3. SYSTEMSTUDY
ECONOMICALFEASIBILITY
Our project demonstrates strong economic feasibility, as it offers cost-effective
Android malware detection solutions. Leveraging existing hardware and open-
source software reduces development expenses, while potential savings from
preventing malware-related damages make it a financially viable investment.
Conduct a return on investment (ROI) analysis, accounting for ongoing costs and
regulatory compliance, to determine the project's potential competitive advantage
and payback period. Sensitivity analysis helps assess the impact of changing
assumptions. The decision should be based on a positive ROI, alignment with
strategic goals, and dynamic adaptability to evolving cybersecurity threats.
TECHNICALFEASIBILITY
The project exhibits high technical feasibility, given the availability of well-
established machine learning and deep learning frameworks. Accessible datasets
and ample online resources facilitate model development. Furthermore,
advancements in computational power and cloud services enhance the scalability
and implementation of the system.
It involves evaluating data quality, algorithm suitability, computational resources,
feature engineering, scalability, integration, regulatory compliance, real-time
capabilities, maintenance, testing, and deployment. A successful implementation
hinges on addressing these technical challenges and ensuring effective model
training, deployment, and monitoring within the evolving cybersecurity landscape.
SOCIALFEASIBILITY
Our project addresses social feasibility by contributing to enhanced Android
security. Protecting user privacy and personal data from malware threats aligns
with societal expectations for safer digital experiences. Public awareness
campaigns can further promote responsible smartphone usage.
It involves ensuring user acceptance, addressing ethical and privacy concerns,
promoting transparency and accountability, and managing public perception.
Compliance with regulations, societal impact, collaboration, accessibility, and
continuous improvement are critical factors in building trust and fostering the
acceptance of this technology in the cybersecurity landscape.
➢ Processor - Pentium–IV
➢ HardDisk -20GB
➢ KeyBoard -StandardWindowsKeyboard
➢ Mouse -TwoorThreeButtonMouse
➢ Monitor -SVGA
Python:
Python is a versatile, high-level programming language renowned for its simplicity and
readability. Its extensive library ecosystem and active community support make it an
ideal choice for our project.
Scikit-Learn (sklearn):
Scikit-Learn is a powerful machine learning library in Python that simplifies the
implementation of various machine learning algorithms.
Pandas:
Pandas is a data manipulation library that facilitates data preprocessing and analysis.
We rely on Pandas for efficient data handling, including cleaning, transformation, and
feature engineering.
Seaborn:
Seaborn is a Python data visualization library built on top of Matplotlib. It provides a
high-level interface for creating aesthetically pleasing statistical graphics
Matplotlib:
Matplotlib is a versatile 2D plotting library in Python. We employ Matplotlib to create
various visualizations, including bar charts, line plots, and heatmaps, to convey project
insights and results effectively.
5. SYSTEMDESIGN
5.1 SYSTEMARCHITECTURE
The architecture for an automated Android malware detection system using ML and
deep learning algorithms entails data collection from diverse sources, data
preprocessing for feature extraction and labeling, selection and training of ML and
DL models, real-time monitoring, alert generation, user-friendly interfaces,
scalability, security measures, compliance with privacy regulations, maintenance,
and ongoing improvement through feedback loops, creating a robust solution that
adapts to evolving threats while effectively safeguarding Android devices.
5.2 UMLDIAGRAMS
Class Diagram:
Class diagram is a static diagram. It represents the static view of an application. Class
diagram is not only used for visualizing , describing , and documenting different aspects
of a system but also for constructing executable code of the software application.
SYSTEM
USER
SEQUENCEDIAGRAM
A sequence diagram is a type of interaction diagram in Unified Modeling Language
(UML) used to visualize the interactions and the order of messages exchanged
between objects or components in a system. It shows how objects or components
collaborate over time to achieve a particular functionality.
SYSTEM USER
Input API’s
Data Preprocessing
Feature Extraction
Applying Algorithms
Metrics Evaluation
Usecase
5.3 DATAFLOWDIAGRAM
A data flow diagram (DFD) maps out the flow of information for any process or system. It
usesdefinedsymbolslikerectangles,circlesandarrows,plusshorttextlabels,toshowdata
inputs,outputs, storage points and the routes between each destination. Data flowcharts
can rangefrom simple, even hand-drawn process overviews, to in-depth, multi-level
DFDs that digprogressively deeper into how the data is handled. They can be used to
analyze an existingsystemor modela newone. Likeallthebestdiagramsandcharts,
aDFDcanoftenvisually“say” things that would be hard to explain in words, and they
work for both technical andnontechnicalaudiences, fromdevelopertoCEO.
6. IMPLEMENTATION
6.1 SYSTEM MODULES
Implementing the Android malware detection project involves several key steps, from data
preprocessing to model development and deployment
Feature Extraction:
- Identify and extract features from the phone logs that indicates malware behavior.
- Transform categorical features into a suitable format for machine learning models.
Dataset Labeling:
- Categorize the dataset into the predefined classes: benign, SMS malware, riskware,
banking malware, and adware.
Model Selection:
- Choose machine learning algorithms such as K-Nearest Neighbors, Logistic Regression,
and Random Forest for initial classification.
- Implement a Recurrent Neural Network (RNN) for deep learning-based classification.
Model Training:
- Split the dataset into training and validation sets.
- Train the selected models and validate their performance on the validation set.
- Tune hyperparameters to optimize model performance.
Evaluation Metrics:
- Defie and calculate evaluation metrics such as accuracy, precision, recall, and F1 score
to assess the effectiveness of each model.
importnumpyasnp
importpandasaspd
fromsklearn.feature_selectionimportSelectKBest, f_classif
fromsklearn.model_selectionimporttrain_test_split
fromsklearn.metricsimportaccuracy_score, precision_score, recall_score,
f1_score
fromsklearn.metricsimportconfusion_matrix
fromsklearn.model_selectionimportGridSearchCV
fromsklearn.ensembleimportRandomForestClassifier
fromsklearn.neighborsimportKNeighborsClassifier
fromsklearn.naive_bayesimportGaussianNB
fromsklearn.linear_modelimportLogisticRegression
fromsklearn.svmimportSVC
fromsklearn.model_selectionimportcross_val_score
labels=label_counts.index.tolist()
counts=label_counts.tolist()
plt.bar(labels, counts)
plt.xlabel('Malware')
plt.ylabel('Counts')
plt.title('Malware Distribution in Dataset')
plt.show()
df.columns
df["Class"].unique()
# Data Preprocessing
X=df.drop(columns=['Class']) # Features
y=df['Class'] # Target
scaler=StandardScaler()
X_train=scaler.fit_transform(X_train)
X_test=scaler.transform(X_test)
# Naive Bayes
nb_model=GaussianNB()
nb_model.fit(X_train, y_train)
rfModel=RandomForestClassifier(n_estimators=300, random_state=42)
rfModel.fit(X_train, y_train)
conf=confusion_matrix(y_pred, y_test)
sns.heatmap(conf , cmap='YlGnBu', fmt='', xticklabels=['Adware' ,'Banking'
,'SMS malware', 'Riskware','Benign'], yticklabels=['Adware' ,'Banking' ,'SMS
malware', 'Riskware','Benign'], annot=True)
# SVM
SVmodel=SVC(kernel='linear', C=1.0, random_state=42)
SVmodel.fit(X_train, y_train)
conf=confusion_matrix(y_pred, y_test)
sns.heatmap(conf , cmap='YlGnBu', fmt='', xticklabels=['Adware' ,'Banking'
,'SMS malware', 'Riskware','Benign'], yticklabels=['Adware' ,'Banking' ,'SMS
malware', 'Riskware','Benign'], annot=True)
#ADA Boost
fromsklearn.ensembleimportAdaBoostClassifier
abc=AdaBoostClassifier(n_estimators=100, random_state=0)
abc.fit(X_train, y_train)
conf=confusion_matrix(y_pred, y_test)
sns.heatmap(conf , cmap='YlGnBu', fmt='', xticklabels=['Adware' ,'Banking'
,'SMS malware', 'Riskware','Benign'], yticklabels=['Adware' ,'Banking' ,'SMS
malware', 'Riskware','Benign'], annot=True)
# KNN
#a range of k values to try
k_values= [2, 3, 5, 7, 9, 11]
best_k=None
best_accuracy=0
forkink_values:
knn_model=KNeighborsClassifier(n_neighbors=k)
# Use cross-validation to evaluate the model
scores=cross_val_score(knn_model, X_train, y_train, cv=5,
scoring='accuracy')
mean_accuracy=scores.mean()
ifmean_accuracy>best_accuracy:
best_accuracy=mean_accuracy
best_k=k
# Train the KNN model with the best k
best_knn_model=KNeighborsClassifier(n_neighbors=best_k)
best_knn_model.fit(X_train, y_train)
print(f"Best k: {best_k}")
print(f"K-Nearest Neighbors Classifier Accuracy: {knn_accuracy:.4f}")
print(f"K-Nearest Neighbors Classifier Precision: {knn_precision:.4f}")
print(f"K-Nearest Neighbors Classifier Recall: {knn_recall:.4f}")
print(f"K-Nearest Neighbors Classifier F1-Score: {knn_f1:.4f}")
conf=confusion_matrix(y_pred, y_test)
sns.heatmap(conf , cmap='YlGnBu', fmt='', xticklabels=['Adware' ,'Banking'
,'SMS malware', 'Riskware','Benign'], yticklabels=['Adware' ,'Banking' ,'SMS
malware', 'Riskware','Benign'], annot=True)
# Combined Result Visualization and Comparison
#classifiers and their colors
classifiers= ['Logistic Regression', 'SVM', 'Random Forest', 'KNN', 'Naive
Bayes']
colors= ['blue', 'orange', 'green', 'red', 'purple']
#metric variables
accuracy_scores= [lr_accuracy, svm_accuracy, rf_accuracy, knn_accuracy,
nb_accuracy]
precision_scores= [lr_precision, svm_precision, rf_precision, knn_precision,
nb_precision]
recall_scores= [lr_recall, svm_recall, rf_recall, knn_recall, nb_recall]
f1_scores= [lr_f1, svm_f1, rf_f1, knn_f1, nb_f1]
x=np.arange(len(metrics))
# metric scores
accuracy_scores= [lr_accuracy, svm_accuracy, rf_accuracy, knn_accuracy,
nb_accuracy]
precision_scores= [lr_precision, svm_precision, rf_precision, knn_precision,
nb_precision]
recall_scores= [lr_recall, svm_recall, rf_recall, knn_recall, nb_recall]
f1_scores= [lr_f1, svm_f1, rf_f1, knn_f1, nb_f1]
#heatmap
plt.figure(figsize=(10, 6))
sns.set(font_scale=1.2)
sns.heatmap(df, annot=True, cmap='YlGnBu', fmt='.4f')
plt.title("Model Evaluation Metrics Heatmap")
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()
df
#RNN
fromkeras.modelsimportSequential
importkeras
importkeras.backendaskb
importtensorflowastf
model=keras.Sequential([
keras.layers.Dense(32,activation=tf.nn.relu,input_shape=[120]),
keras.layers.Dense(32,activation=tf.nn.relu),
keras.layers.Dense(32,activation=tf.nn.relu),
keras.layers.Dense(5)
])
optimizer=tf.keras.optimizers.RMSprop(0.0099)
model.compile(loss='mean_squared_error',optimizer=optimizer)
history=model.fit(X_train, y_train,epochs=500)
7. TESTING
The purpose of testing is to discover errors. Testing is the process of trying to discover
every conceivable fault or weakness in a work product. It provides a way to check the
functionality of components, sub assemblies, assemblies and/or a finished product It is the
process of exercising software with the intent of ensuring that the
Software system meets its requirements and user expectations and does not fail in an
unacceptable manner. There are various types of test. Each test type addresses a specific testing
requirement.
TYPES OF TESTS
Unit testing
Unit testing involves the design of test cases that validate that the internal program logic is
functioning properly, and that program n puts produce valid outputs. All decision branches
and internal code flow should be validated. It is the testing of individual software units of
the application. Unit tests ensure that each unique path of a business process performs
accurately to the documented specifications and contains clearly defined inputs and
expected results.
Integration testing
Integration tests are designed to test integrated software components to determine if they
actually run as one program. Testing is event driven and is more concerned with the basic
outcome of screens or fields. Integration tests demonstrate that although the components
were in dividually satisfaction, as shown by successfully unit testing, the combination of
components is correct and consistent
Functional test
Functional tests provide systematic demonstrations that functions tested are available
asspecifiedbythebusinessandtechnicalrequirements,systemdocumentation,andusermanuals.
Functional testing is centered by Valid Input, Invalid Input, Functions, Output, System
Procedures.
System Test
System testing ensures that the entire integrated software system meets requirements. It tests
a configuration to ensure known and predictable results. An example of system testing is the
configuration oriented system integration test. System testing is based on process
descriptions and flows, emphasizing pre-driven process links and integration points.
8. OUTPUTSCREENS
The dataset which we have taken into the consideration is distributed by the Malware by the values
of counts and malware detection which is shown below.
Naïve bayes:
In above fig, we can see the metrics values of the naïve bayes classifier which shows the
accuracy based on the classes such as Adware, Banking, SMS malware, Riskware, Benign.
Random Forest:
In above fig, we can see the metrics values of the Random Forest classifier which shows the
accuracy based on the the classes such as Adware, Banking, SMS malware, Riskware,
Benign. This classifier has the heighest accuracy of all the models which we have used in
this project.
Logistic Regression:
In above fig, we can see the metrics values of the Logistic Regression classifier which
shows the accuracy based on the the classes such as Adware, Banking, SMS malware,
Riskware, Benign.
SVM:
In above fig, we can see the metrics values of the Support Vector Machine classifier which
shows the accuracy based on the the classes such as Adware, Banking, SMS malware,
Riskware, Benign.
ABC Classifier:
In above fig, we can see the metrics values of the ABC classifier which shows the accuracy
based on the the classes such as Adware, Banking, SMS malware, Riskware, Benign. It
works through ABC analytics.
K-Nearest Neighbour:
In above fig, we can see the metrics values of the k-nearest neighbour classifier which
shows the accuracy based on the the classes such as Adware, Banking, SMS malware,
Riskware, Benign.
In the above fig, the overall performance metrics of all the classifiers is shown together to
know the accuracy of the models. It was shown by bar graph which is known as a
visualization tool for the better understanding.
9. CONCLUSION
In conclusion, our proposed project for Android malware detection represents a significant
advancement in the field of cybersecurity. Through rigorous experimentation with machine
learning and deep learning models, including Random Forest with a remarkable accuracy
rate of 94% and Recurrent Neural Networks (RNN) consistently achieving 90%, we have
achieved outstanding results in classifying and identifying malicious software in Android
devices.
The advantages of our system extend far beyond these impressive accuracy metrics. It offers
a robust defense mechanism for Android users, safeguarding their personal data, privacy,
and overall digital experience. By accurately detecting and preventing malware, our system
mitigates the risks associated with malicious software, such as data theft, financial fraud,
and compromised device performance.
Furthermore, the adaptability of our RNN model, which can potentially improve its
accuracy with repeated runs, underscores its potential for ongoing optimization and
effectiveness. This adaptability aligns with the ever-evolving landscape of malware threats,
ensuring that our system remains resilient against new and emerging challenges.
In a world increasingly reliant on mobile technology, our project not only enhances Android
device security but also contributes to creating a safer digital environment for individuals
and organizations alike. The combination of machine learning and deep learning models
positions our system as a valuable asset in the fight against Android malware, offering peace
of mind and Protection to users in an interconnected world which makes the world more
innovative today.
10.FUTURESCOPE
These hurdles are based on various stages of our work and may be gradually rectified in
theworktobeundertakeninthefuture. Features declared mostly on the device are more durable
than the features specific to theapplications and therefore can usually automate malware
detection. The range of androidparameters for processing is rather big and difficult to detect
properly if someone does notextract thefeaturesproperly. There is still a fast increase in the
number of apps. Malware apps can always be identifiedin potential in combination with
methods based on AI or machine learning, such as ineptlearning, to make the detection more
sophisticated to make it easier to identify and regulateapp predictionrate.
The applications with time induce new features with enhanced malware abilities which
iswhy we would have to upgrade the system whenever the model’s FPR rate after
executionincreases. The simplest explanation for how to identify if the model is degrading
on evolvedfeatures is that our datasets are designed in binary matrix extracted from
features that arecurrently implemented in these applications and not features that will be
present in evolvedapps in coming years. With new features, we would have to reverse and
extract those featuresto form an updated dataset again to train on these classifiers
11. REFERENCES
[2] H. Wang, W. Zhang and H. He, "You are what the permissions told me! Android
malware detection based on hybrid tactics", J. Inf. Secur. Appl., vol. 66, May 2022.
[4] M. Ibrahim, B. Issa and M. B. Jasser, "A method for automatic Android malware
detection based on static analysis and deep learning", IEEE Access, vol. 10, pp.
117334-117352, 2022.
[6] P. Bhat and K. Dutta, "A multi-tiered feature selection model for Android malware
detection based on feature discrimination and information gain", J. King Saud Univ.-
Comput. Inf. Sci., vol. 34, no. 10, pp. 9464-9477, Nov. 2022.
[7] D. Wang, T. Chen, Z. Zhang and N. Zhang, "A survey of Android malware
detection based on deep learning", Proc. Int. Conf. Mach. Learn. Cyber Secur., pp. 228-
242, 2023.
[8] Y. Zhao, L. Li, H. Wang, H. Cai, T. F. Bissyandé, J. Klein, et al., "On the impact of
sample duplication in machine-learning-based Android malware detection", ACM
Trans. Softw. Eng. Methodol., vol. 30, no. 3, pp. 1-38, Jul. 2021.
[10] H.-J. Zhu, W. Gu, L.-M. Wang, Z.-C. Xu and V. S. Sheng, "Android malware
detection based on multi-head squeeze-and-excitation residual network", Expert Syst.
Appl., vol. 212, Feb. 2023.