4.1. Datasets
The proposed model was evaluated with four malware datasets: Malimg [
10], Microsoft’s BIG 2015 [
11], MaleVis [
12], and Malicia [
16]. The first three datasets were used for training, and the fourth (Malicia) dataset was used for testing. The experiments were carried out with 1043 cleanware samples. These samples were collected from executable files (.exe) of the Windows operating system and checked using the VirusTotal portal. The various families of the malware datasets used for evaluation of the proposed malware detection method are given in
Table 1. The samples of different classes of malware vary in number across different datasets. There were 9339 malicious samples presented as grayscale images in the Malimg dataset. Each of the malware samples in the dataset corresponds to one of the 25 malware classes.
The BIG 2015 dataset contains 21,741 malware samples, among which the training set includes 10,868 samples and the remaining 10,873 samples are test samples. In our experiments, the training set samples are used for evaluation. Each malware file has an identifier and class. The identifier is a hash value that particularly identifies the file, while the class labels one of nine distinct malware families. Each malware has two files, namely, .bytes and .asm. We use .bytes files, which have raw hexadecimal code of the file, to generate malware images.
The MaleVis dataset consists of 14,226 RGB byte images assigned to one of 26 families (25 malware + 1 cleanware). The Malicia dataset includes 8 classes consisting of 9670 malware samples. The dataset is untrained by the proposed DenseNet model to evaluate how well it performs under different samples. The three trained datasets contain completely different classes from the Malicia dataset classes.
Figure 4 illustrates the distribution of samples over classes for all four datasets.
4.2. Results and Discussion
The dataset was randomly divided into 70% training and 30% validation sets. The results were taken with 1043 cleanware samples and each of the three malware datasets. Train and test files were divided such that 30% of the overall samples were considered for testing purposes. The proposed malware detection system was trained on 7268 samples and tested on 3115 samples for the Malimg dataset with cleanware samples (9339 + 1043). Then, the model was trained on 8338 samples and tested on 3573 samples from the BIG 2015 dataset along with cleanware samples (10,868 + 1043). On the MaleVis dataset, 9958 samples were training samples and 4268 were testing samples.
The experiments were implemented on a Linux system with Intel® Xeon(R) CPU E3-1226 v3 at 3.30 GHz × 4, 32 GB RAM, and NVIDIA GM107GL Quadro K2200/PCIe/SSE2. The performance evaluations were carried out with the following hyperparameter settings: 100 epochs, learning rate 0.0001, and batch size 32. The proposed deep neural network model was implemented on the Python framework and Keras v0.1.1 deep learning library. The experiments were performed for various input binary image sizes such as 32 × 32 dimensions and 64 × 64 dimensions. It is observed that the information is retained and showed better predictive accuracy for images reshaped to 64 × 64.
There are four types of metrics calculated to assess classification predictions.
True Positive (TP): the prediction that an observation belongs to a class and it actually does belong to that class, i.e., a binary image that is classified as malware and is actually malware.
True Negative (TN): the prediction that an observation does not belong to a class and it actually does not belong to that class, i.e., a binary image that is classified as not malware (negative) and is actually not malware (negative).
False Positive (FP): the prediction that an observation belongs to a class and it actually does not belong to that class, i.e., a binary image that is classified as malware and is actually not malware (negative).
False Negative (FN): the prediction that an observation does not belong to a class and it actually does belong to that class, i.e., a binary image that is classified as not malware (negative) and is actually malware.
These four outcomes are presented on a confusion matrix to better describe the results of the proposed model. If there are classes, the confusion matrix will be the matrix, with the true class on the left axis and the class assigned to an element with that true class on the top axis. Each member of the matrix is the number of elements with actual class that is classified as belonging to class .
The elements of confusion matrix for each class are defined by
Accuracy (Acc), Precision (Pr), Recall (Re), and F1 score are the four main classification metrics. The number of correct predictions divided by the total number of predictions is known as accuracy. It is defined as
Precision is the number of correct positive outcomes divided by the number of positive outcomes predicted by the classifier.
Recall gives the fraction of correctly identified instances as the positive out of all positives.
F1 score is the harmonic mean of precision and recall. It determines the classifier’s precision (the number of instances it correctly classifies) as well as its robustness (it does not miss a substantial number of instances). It is given by
The comparison results of Machine Learning (ML) and Deep Learning (DL) methods for malware detection are presented in
Table 2 and
Table 3, respectively. The performance analysis of the proposed model is compared with various ML techniques such as K-Nearest Neighbor (KNN), Logistic Regression (LR), Naïve Bayes (NB), SVM, Decision Tree (DT), Random Forest (RF), and Adaboost. The malware detectors based on pretrained DL models such as CNN and its variants are used for analyzing the efficiency of the proposed DenseNet-based malware detection method. The performance results obtained for the proposed model are better than the ML and DL-based malware detection models for the three datasets. The proposed model obtained an accuracy of 98.23% for Malimg, of 98.46% for BIG 2015, and of 98.21% for MaleVis dataset.
The generalization ability of the proposed method is assessed using unseen dataset. The dataset is untrained by the proposed DenseNet model to evaluate how well it performs under different samples. The three trained malware datasets contain completely different classes from the Malicia dataset classes. The comparison of the proposed methods with the ML and DL methods over the unseen Malicia dataset is given in
Table 4. The results on the unseen Malicia dataset show an accuracy of 89.48%, which is less than the performances of the ML and DL methods over the trained datasets.
Table 5 provides details about the time taken for the proposed model to train and test the binary samples. The comparison of the proposed model and the malware detectors based on various DL methods are studied in terms of computational efficiency. The results indicate that the proposed DenseNet-based malware detection model takes less time to train and test the samples when compared to other deep learning-based malware detection systems.
Table 6 compares the results of the proposed malware detection model with previous works on the four malware datasets (3 training dataset + 1 (unseen) test dataset). The proposed model outperforms other detection methods in the literature. The accuracy of the proposed model (98.23%) is slightly higher than the accuracy of the method by Roseline et al. (98.65%) on the Malimg dataset. The results of the proposed model outperform the existing methods on the BIG 2015, MaleVis, and Malicia datasets.
Figure 5,
Figure 6 and
Figure 7 present the plots for train accuracy, test accuracy, and loss over the number of epochs for the proposed model with the Malimg, BIG 2015, and MaleVis datasets. From the figures, the accuracy is observed as rising for increasing epochs and the loss decreases as epochs increase.
The confusion matrices for the models trained on three malware datasets along with the cleanware class are given in
Figure 8,
Figure 9 and
Figure 10. For the Malimg dataset with 26 classes, the confusion matrix is a
matrix with the columns representing the actual class and the rows indicating the predicted class. The diagonal elements show the number of correctly classified samples, where the predicted class matches the actual class. The off-diagonal elements represent misclassified samples. The diagonal elements for all three datasets show higher values compared to the off-diagonal elements. Although the samples in the Simda class are fewer, most of the samples in that class were correctly classified by the proposed model.
A Receiver Operating Characteristic (ROC) curve is a plot of the True Positive Rate (TPR) vs. False Positive Rate (FPR) at different classification thresholds to examine the performance of the proposed malware detection model.
Figure 11,
Figure 12 and
Figure 13 show the ROC curves for the proposed model obtained for the three training malware datasets. The N number of ROC curves corresponding to the N number of classes are seen in
Figure 11,
Figure 12 and
Figure 13. For instance, the Malimg dataset includes 26 classes. The graph shows 26 ROC curves, with the first curve representing the first class that is classified against the other 25 classes, the next ROC curve representing the second class that is classified against the rest of the classes, and so on. TPR is approximately one and FPR is close to zero on the curves for each class against every other class. The area under the curve is higher for all the classes on the Malimg and BIG2015 malware datasets compared to the area under the curve for the MaleVis dataset. This indicates the outperforming efficiency of the proposed DenseNet-based malware detection model.
The proposed malware detection system would be effective and can produce advanced results, as shown in
Table 7. Any new malware that resembles these families of malware will also be detected with the same accuracy because of the generalization property of the proposed model. If the new malware is completely unseen, i.e., a zero-day malware attack, the proposed system may fail to detect it. Therefore, if such zero-day attacks accumulate, then the performance of the proposed model could fall, but a false alarm may indicate that the model needs to be retrained. Therefore, the model will be retuned with new samples and the performance will be tuned such that the model will detect malware that has already been trained as well as newly seen malware, almost similar to a top-up of the training set. As a result, the proposed model would be able to keep up with malware evolution over time and to understand anti-malware evasion techniques.
The experiments were conducted for binary classification (malware or cleanware) with the Malimg, BIG2015, and MaleVis datasets. For each of the three datasets, 1000 samples were picked and included in the malware class, while the other class contained 1043 cleanware samples. The results were taken to assess the performance of the proposed DenseNet-based malware detection system for the three binary datasets. The accuracy for the BIG2015 binary dataset shows a higher detection accuracy of 97.72% compared to the other datasets. The accuracy for the Malimg binary dataset is 97.55%, and the accuracy for the MaleVis binary dataset is 96.81%. The other metrics such as precision, recall, and f1score are similarly higher for BIG2015 than for the other two binary datasets.