Angelov 2020
Angelov 2020
Neural Networks
journal homepage: www.elsevier.com/locate/neunet
article info a b s t r a c t
Article history: In this paper, we propose an elegant solution that is directly addressing the bottlenecks of the tradi-
Received 5 December 2019 tional deep learning approaches and offers an explainable internal architecture that can outperform
Received in revised form 11 May 2020 the existing methods, requires very little computational resources (no need for GPUs) and short
Accepted 6 July 2020
training times (in the order of seconds). The proposed approach, xDNN is using prototypes. Prototypes
Available online 11 July 2020
are actual training data samples (images), which are local peaks of the empirical data distribution
Keywords: called typicality as well as of the data density. This generative model is identified in a closed form
Explainable AI and equates to the pdf but is derived automatically and entirely from the training data with no
Interpretability user- or problem-specific thresholds, parameters or intervention. The proposed xDNN offers a new
Prototype-based models deep learning architecture that combines reasoning and learning in a synergy. It is non-iterative and
Deep-learning non-parametric, which explains its efficiency in terms of time and computational resources. From
the user perspective, the proposed approach is clearly understandable to human users. We tested
it on challenging problems as the classification of different lighting conditions for driving scenes
(iROADS), object detection (Caltech-256, and Caltech-101), and SARS-CoV-2 identification via computed
tomography scan (COVID CT-scans dataset). xDNN outperforms the other methods including deep
learning in terms of accuracy, time to train and offers an explainable classifier.
© 2020 Elsevier Ltd. All rights reserved.
https://doi.org/10.1016/j.neunet.2020.07.010
0893-6080/© 2020 Elsevier Ltd. All rights reserved.
186 P. Angelov and E. Soares / Neural Networks 130 (2020) 185–194
Fig. 2. Pre-training a traditional deep neural network (weights of the network are being optimized/trained). Using the transfer learning concept this architecture
with the weights are used as feature extractor (the last fully connected layer is considered as a feature vector).
Source: Adapted from Simonyan and Zisserman (2014).
1. Features layer; (a) The values are then standardized using the following
2. Density layer; Eq. (1):
3. Typicality layer;
xi,j − µ(xi,j )
4. Prototypes layer; x̂i,j = (1)
5. MegaClouds layer; σ (xi,j )
where x̂ denotes a standardized features vector x of the im-
1. Features layer: (Defines the data space) age I (x are the values provided by the FCL), i = 1, 2, . . . , N
The Feature Layer is the first phase of the proposed xDNN denotes the time stamp or the ID of the image, j =
method. This layer is in charge of extracting global features 1, 2, . . . , n refers to the number of features of the given
vector from the images. This first layer can be formed by x in our case n = 4096.
more traditional ‘handcrafted’ methods such as GIST (Sol- (b) The standardized values are normalized to bring them
maz et al., 2013) or HoG (Mizuno et al., 2012). Alternatively, to the range [0;1]:
it can be formed by the fully connected layer (FCL) of x̂i,j − mini (x̂i,j )
the pre-trained convolutional neural network approaches x̄i,j = (2)
maxi (x̂i,j ) − mini (x̂i,j )
such as AlexNet (Krizhevsky et al., 2012), VGG–VD–16 (Si-
where x̄ denotes the normalized value of the features vec-
monyan & Zisserman, 2014), and Inception (Szegedy et al.,
tor. For clarity in the rest of the paper we will use x instead
2015), residual neural networks such as Resnet (He et al.,
of x̄.
2016) or Inception-Resnet (Szegedy et al., 2017), etc. Using
Initialization:
pre-trained deep neural network approach allows auto- Meta-parameters for the xDNN are initialized with the
matic extraction of more abstract and discriminative high- first observed data sample (image). The proposed algorithm
level features. In this paper, pre-trained VGG–VD–16 DCNN works per class; therefore, all the calculations are done for
is employed for feature extraction. According to Ren et al. each class separately.
(2016), VGG–VD–16 has a simple structure and it can
achieve a better performance in comparison with other
P ← 1; µ ← xi ; (3)
pre-trained deep neural networks. The first fully connected where µ denotes the global mean of data samples of the
layer from VGG–VD–16 provides a 1 × 4096 dimensional given class. P is the total number of the identified proto-
vector. types from the observed data samples (images).
188 P. Angelov and E. Soares / Neural Networks 130 (2020) 185–194
Each class C is initialized by the first data sample of that free from prior assumptions about the data distribution
class: type, as well as the random or deterministic nature of the
C1 ← x 1 ; p 1 ← x1 ; data. In contrast, it empirically extracts the distribution
∗
(4) from the data samples (images) bottom up (Angelov & Gu,
Support1 ← 1; r1 ← r ; Î1 ← I1
2019). The prototypes are independent from each other.
where, p1 is the vector of features that describe the proto- Therefore, one can change the structure by adding a new
type Î of the C1 ; Î is the identified prototype; Support1 is the prototype without influencing the other already existing
corresponding support (number of members) associated prototypes. In other words, the proposed xDNN is highly
with this prototype; r1 is the corresponding radius of the
parallelizable and suitable for evolving form of application
area of influence of C1 . √ where new prototypes may be added (if the data pattern
In this paper, we use r ∗ = 2 − 2cos(30◦ ) same as Angelov
and Gu (2019); the rationale is that two vectors for which requires this). The proposed xDNN method is trained per
the angle between them is less than π/6 or 30◦ are point- class forming a set of prototypes per class. Therefore, all the
ing in close/similar directions d. That is, we consider that calculations are done for each class separately. Prototypes
two feature vectors can be considered to be similar if the are the local peaks of the data density (and typicality) iden-
angle between them is smaller than 30 degrees. Note that tified in the previous layers/ stages of the algorithm from
r ∗ is data derived, not a problem- or user-specific param- the images of the corresponding class based on their fea-
eter. In fact, it can be defined without prior knowledge of ture vectors. The prototypes can be used to form linguistic
the specific problem or data through the following Eq. (5). logical IF . . . THEN rules of the following form:
xi pi
∥xi ∥ − ∥pi ∥ .
d(xi , pi ) = (5) Rc : IF (I ∼ ÎP ) THEN (class c)
5. MegaClouds layer:
In the MegaClouds layer the clouds formed by the proto-
types in the previous layer are merged if the neighbouring
prototypes have the same class label. In other words, they
are merged if they belong to the same class. MegaClouds
are used to facilitate the human interpretability. Fig. 5 Fig. 5. MegaClouds — Voronoi tesselation.
illustrates the formation of the MegaClouds.
Rules in the MegaClouds layer have the following format:
1. Features layer;
2. Similarity layer (density); /validation) data sample/image Ii defined as follows:
3. Local decision-making. 1
4. Global decision-making. S(x, pi ) = , (15)
∥x−pi ∥2
1+
∥σ ∥2N
Which is detailed described as the following:
where S denotes the similarity degree.
1. Features layer:
3. Local (per class) decision-making layer:
Similarly to the features layer described in the training
Local (per class) decision-making is calculated based on the
process.
‘winner-takes-all’ principle and can be obtained by:
2. Prototypes layer:
In this layer the degrees of similarity to the nearest pro- λc = max (Sj ), (16)
totypes (per class) are extracted for each unlabelled (new j=1,2,...,P
190 P. Angelov and E. Soares / Neural Networks 130 (2020) 185–194
Table 1
Performance comparison: iRoads dataset.
Method Accuracy Time (s) # Parameters
xDNN 99.59% 4.32 27
VGG–16 (He et al., 2016) 99.51% 836.28 Not reported
DRB (Angelov & Gu, 2019) 99.02% 2.95 521
SVM (Suykens & Vandewalle, 1999) 94.17% 5.67 Not reported
KNN (Bishop, 2006) 93.49% 4.43 4656
Naive Bayes (Bishop, 2006) 88.35% 5.31 Not reported
Table 2
Performance comparison: Caltech-256 dataset.
Method Accuracy
xDNN 75.41%
MSVM (Cao et al., 2019) 70.18%
VGG–16 (He et al., 2016) 73.2%
VGG–19 (He et al., 2016) 70.62%
ResNet–101 (Simonyan & Zisserman, 2014) 75.14%
GoogLeNet (Szegedy et al., 2015) 72.42%
Softmax(7) (Zeiler & Fergus, 2014) 74.2%
Fig. 10. Final rule given by the proposed xDNN classifier for the COVID-19 identification. Differently from ‘black box’ approaches as deep neural networks, the
proposed approach provides highly interpretable rules which can be used by human experts for the early evaluation of patients suspected of SARS-Cov-2 infection.
Fig. 11. Non-Covid final rule given by the proposed explainable Deep Learning classifier.
Table 3 helpful for specialists (in this case, medical doctors). The pro-
Performance comparison: Caltech-101 dataset. posed classifier identified 30 prototypes for non-COVID and 33
Method Accuracy prototypes for COVID patients. Rules generated by the identified
xDNN 94.31% prototypes for COVID and non-COVID patients are illustrated by
SPP–net (He et al., 2015) 91.44%
Figs. 10 and 11 respectively. The baseline approach (Zhao et al.,
ResNet–50 (He et al., 2016) 90.39%
CNN S TUNE-CLS (Chatfield et al., 2014) 88.35% 2020) is a Deep Neural Network approach which is ‘black box’
(Zeiler & Fergus, 2014) 86.5% (offers no interpretability).
VGG–16 (He et al., 2016) 90.32% Using the proposed method we extracted from the data lin-
KNN (Bishop, 2006) 85.65% guistic IF...THEN rules which involve actual images of both cases
DT (Quinlan, 1986) 54.42%
(COVID-19 and non-COVID) as illustrated in Figs. 10 and 11. Such
transparent rules can be used in the decision-making process for
early diagnostics for COVID-19 infection. Rapid detection with
amount of information and therefore, we do not report these high sensitivity of viral infection may allow better control of the
methods. Apart from them, to the best of our knowledge, there viral spread. Early diagnosis of COVID-19 is crucial for the disease
is no better result achieved on Caltech data sets. treatment and control.
Fig. 12 illustrates the evolving nature of the proposed ap-
proach. xDNN is able to continuously learn as new data is pre-
5.3. COVID CT-scan dataset
sented to it. Therefore, no full re-training is required due to
its life-long learning architecture. On the contrary, the baseline
In this section we report the results obtained by the proposed approach (Zhao et al., 2020) is based on a Deep Neural Network
xDNN classification approach when applied to the COVID CT-scan that requires full re-training for any new data sample, which
dataset (Zhao et al., 2020). Results presented in Table 4 compare can be very costly in terms of time, computational complexity
the proposed algorithm with other state-of-the-art approaches, and requirements for hardware and computer experts. xDNN
including traditional ‘‘black-box’’ Deep Neural Network, Support continuously learns as new training data arrives to the system. It
Vector Machines, etc. can be observed that with 478 training data samples the proposed
The proposed xDNN classifier provided better results in terms approach could obtain better results in terms of accuracy (84.56%)
of accuracy, recall, F 1 score, and AUC. Moreover, the proposed than the baseline approach (84.0%) with 537 training data sam-
approach also provided highly interpretable results that may be ples (Zhao et al., 2020). The baseline approach is a Deep Neural
P. Angelov and E. Soares / Neural Networks 130 (2020) 185–194 193
Table 4
Performance comparison: COVID CT-scan dataset.
Method Metric
Accuracy Precision Recall F1 score AUC
xDNN 88.6% 89.7% 88.6% 89.2% 88.6%
Baseline (Zhao et al., 2020) 84.7% 97.0% 76.2% 85.3% 82.4%
SVM (Suykens & Vandewalle, 1999) 80.5% 84.4% 83.5% 84% 79.7%
KNN (Bishop, 2006) 83.9% 90.4% 82.4% 86.2% 84.3%
AdaBoost (Hastie et al., 2009) 83.9% 87.7% 83.5% 85.5% 84%
Naive Bayes (Bishop, 2006) 70.5% 77% 73.6% 75.3% 69.6%
References
Fig. 12. The figure illustrates the evolving nature of the proposed xDNN
approach.
Angelov, P. (2012). Autonomous learning systems: from data streams to knowledge
in real-time. John Wiley & Sons.
Angelov, P. P., & Gu, X. (2018). Deep rule-based classifier with human-level
performance and characteristics. Information Sciences, 463, 196–213.
Network that needs a large number of training data to obtain
Angelov, P. P., & Gu, X. (2019). Empirical approach to machine learning. Springer.
a high performance in terms of classification accuracy and once Angelov, P. P., Gu, X., & Príncipe, J. C. (2017). A generalized methodology for
trained cannot be further improved unless fully re-trained. In data analysis. IEEE Transactions on Cybernetics, 48(10), 2981–2993.
contrast, the proposed approach can obtain higher performance Biehl, M., Hammer, B., & Villmann, T. (2013). Distance measures for prototype
using less training data due to its prototype-based nature. based classification. In International workshop on brain-inspired computing
(pp. 100–116). Springer.
Experiments have demonstrated that the proposed xDNN ap-
Biehl, M., Hammer, B., & Villmann, T. (2016). Prototype-based models in machine
proach is able to produce highly accurate results surpassing state- learning. Wiley Interdisciplinary Reviews: Cognitive Science, 7(2), 92–111.
of-the-art methods for different challenging datasets. Moreover, Bishop, C. M. (2006). Pattern recognition and machine learning. springer.
xDNN presents highly interpretable results that can be presented Cao, J., Wang, M., Li, Y., & Zhang, Q. (2019). Improved support vector machine
in the form of IF . . . THEN logical rules, Voronoi tessellations, classification algorithm based on adaptive feature weight updating in the
hadoop cluster environment. PloS one, 14(4).
and/or typicality (empirically derived form of pdf) in a closed
Chatfield, K., Simonyan, K., Vedaldi, A., & Zisserman, A. (2014). Return of the
analytical form allowing further analysis. Because of its recursive, devil in the details: Delving deep into convolutional nets. arXiv preprint
non-iterative and non-parametric form it allows computationally arXiv:1405.3531.
very efficient implementations to be realized. Doshi-Velez, F., & Kim, B. (2017). Towards a rigorous science of interpretable
machine learning. arXiv preprint arXiv:1702.08608.
Fei-Fei, L., Fergus, R., & Perona, P. (2004). Learning generative visual models
6. Conclusion from few training examples: An incremental bayesian approach tested on
101 object categories. In 2004 conference on computer vision and pattern
In this paper we propose a new method, explainable deep recognition workshop (p. 178). IEEE.
neural network (xDNN), that is directly addressing the bottle- Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep learning. MIT press.
Griffin, G., Holub, A., & Perona, P. (2007). Caltech-256 object category dataset.
necks of the traditional deep learning approaches and offers an
California Institute of Technology.
explainable internal architecture that can outperform the exist- Hastie, T., Rosset, S., Zhu, J., & Zou, H. (2009). Multi-class adaboost. Statistics and
ing methods. The proposed xDNN approach requires very little its Interface, 2(3), 349–360.
computational resources (no need for GPUs) and short training He, K., Zhang, X., Ren, S., & Sun, J. (2015). Spatial pyramid pooling in deep
times (in the order of seconds). The proposed approach, xDNN convolutional networks for visual recognition. IEEE Transactions on Pattern
Analysis and Machine Intelligence, 37(9), 1904–1916.
is prototype-based. Prototypes are actual training data samples
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image
(images), which have local peaks of the empirical data distribu- recognition. In Proceedings of the IEEE conference on computer vision and
tion called typicality as well as of the data density. This generative pattern recognition (pp. 770–778).
model is identified in a closed form and equates to the pdf but is Hu, J., Lu, J., & Tan, Y.-P. (2015). Deep transfer metric learning. In Proceedings of
derived automatically and entirely from the training data with no the IEEE conference on computer vision and pattern recognition (pp. 325–333).
Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification
user- or problem-specific thresholds, parameters or intervention. with deep convolutional neural networks. In Advances in neural information
The proposed xDNN offers a new deep learning architecture that processing systems (pp. 1097–1105).
combines reasoning and learning in a synergy. It is non-iterative LeCun, Y., Bengio, Y., & Hinton, G. (2015). Deep learning. Nature, 521(7553),
and non-parametric, which explains its efficiency in terms of 436–444.
time and computational resources. From the user perspective, Leng, J., Liu, Y., & Chen, S. (2019). Context-aware attention network for image
recognition. Neural Computing and Applications, 31(12), 9295–9305.
the proposed approach is clearly understandable to human users. Li, O., Liu, H., Chen, C., & Rudin, C. (2018). Deep learning for case-based
Results for some well-known benchmark data sets such as iRoads, reasoning through prototypes: A neural network that explains its predictions.
Caltech-256, Caltech-101, and COVID CT-scan show that xDNN In Thirty-second AAAI conference on artificial intelligence.
194 P. Angelov and E. Soares / Neural Networks 130 (2020) 185–194
Liu, C., Bellec, G., Vogginger, B., Kappel, D., Partzsch, J., Neumärker, F., Höpp- Sejnowski, T. J. (2018). The deep learning revolution. MIT Press.
ner, S., Maass, W., Furber, S. B., & Legenstein, R. (2018). Memory-efficient Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks for
deep learning on a spinnaker 2 prototype. Frontiers in Neuroscience, 12, 840. large-scale image recognition. arXiv preprint arXiv:1409.1556.
Mizuno, K., Terachi, Y., Takagi, K., Izumi, S., Kawaguchi, H., & Yoshimoto, M. Soares, E., & Angelov, P. (2019). Novelty detection and learning from extremely
(2012). Architectural study of HOG feature extraction processor for real- weak supervision. arXiv preprint arXiv:1911.00616.
time object detection. In 2012 IEEE workshop on signal processing systems Soares, E., Angelov, P., Costa, B., & Castro, M. (2019). Actively semi-supervised
(pp. 197–202). IEEE. deep rule-based classifier applied to adverse driving scenarios. In 2019
Nebel, D., Kaden, M., Villmann, A., & Villmann, T. (2017). Types of (dis-) sim- international joint conference on neural networks (IJCNN) (pp. 1–8). IEEE.
ilarities and adaptive mixtures thereof for improved classification learning. Solmaz, B., Assari, S. M., & Shah, M. (2013). Classifying web videos using a global
Neurocomputing, 268, 42–54. video descriptor. Machine Vision and Applications, 24(7), 1473–1485.
Oyedotun, O. K., & Khashman, A. (2017). Prototype-incorporated emotional Suykens, J. A., & Vandewalle, J. (1999). Least squares support vector machine
neural network. IEEE Transactions on Neural Networks and Learning Systems, classifiers. Neural Processing Letters, 9(3), 293–300.
29(8), 3560–3572.
Szegedy, C., Ioffe, S., Vanhoucke, V., & Alemi, A. A. (2017). Inception-v4,
Qian, G., Zhang, L., & Wang, Y. (2019). Single-label and multi-label conceptor
inception-resnet and the impact of residual connections on learning. In
classifiers in pre-trained neural networks. Neural Computing and Applications,
Thirty-first AAAI conference on artificial intelligence.
31(10), 6179–6188.
Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D.,
Quinlan, J. R. (1986). Induction of decision trees. Machine Learning, 1(1), 81–106.
Vanhoucke, V., & Rabinovich, A. (2015). Going deeper with convolutions. In
Ren, S., He, K., Girshick, R., Zhang, X., & Sun, J. (2016). Object detection networks
Proceedings of the IEEE conference on computer vision and pattern recognition
on convolutional feature maps. IEEE Transactions on Pattern Analysis and
Machine Intelligence, 39(7), 1476–1481. (pp. 1–9).
Rezaei, M., & Terauchi, M. (2013). Vehicle detection based on multi-feature clues Xiong, W., Wu, L., Alleva, F., Droppo, J., Huang, X., & Stolcke, A. (2018). The
and Dempster-Shafer fusion theory. In Pacific-Rim symposium on image and microsoft 2017 conversational speech recognition system. In 2018 IEEE
video technology (pp. 60–72). Springer. international conference on acoustics, speech and signal processing (ICASSP) (pp.
Rudin, C. (2019). Stop explaining black box machine learning models for high 5934–5938). IEEE.
stakes decisions and use interpretable models instead. Nature Machine Zeiler, M. D., & Fergus, R. (2014). Visualizing and understanding convolutional
Intelligence, 1(5), 206–215. networks. In European conference on computer vision (pp. 818–833). Springer.
Saralajew, S., Holdijk, L., Rees, M., & Villmann, T. (2018). Prototype-based neural Zhao, J., Zhang, Y., He, X., & Xie, P. (2020). COVID-CT-Dataset: a CT scan dataset
network layers: incorporating vector quantization. arXiv preprint arXiv:1812. about COVID-19. arXiv preprint arXiv:2003.13865.
01214. Zhuang, F., Cheng, X., Luo, P., Pan, S. J., & He, Q. (2015). Supervised represen-
Schmidhuber, J. (2015). Deep learning in neural networks: An overview. Neural tation learning: Transfer learning with deep autoencoders. In Twenty-fourth
Networks, 61, 85–117. international joint conference on artificial intelligence.