0% found this document useful (0 votes)
36 views10 pages

Angelov 2020

Uploaded by

g19199
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
36 views10 pages

Angelov 2020

Uploaded by

g19199
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Neural Networks 130 (2020) 185–194

Contents lists available at ScienceDirect

Neural Networks
journal homepage: www.elsevier.com/locate/neunet

Towards explainable deep neural networks (xDNN)



Plamen Angelov, Eduardo Soares
School of Computing and Communications, LIRA Research Centre, Lancaster University, Lancaster, LA1 4WA, UK

article info a b s t r a c t

Article history: In this paper, we propose an elegant solution that is directly addressing the bottlenecks of the tradi-
Received 5 December 2019 tional deep learning approaches and offers an explainable internal architecture that can outperform
Received in revised form 11 May 2020 the existing methods, requires very little computational resources (no need for GPUs) and short
Accepted 6 July 2020
training times (in the order of seconds). The proposed approach, xDNN is using prototypes. Prototypes
Available online 11 July 2020
are actual training data samples (images), which are local peaks of the empirical data distribution
Keywords: called typicality as well as of the data density. This generative model is identified in a closed form
Explainable AI and equates to the pdf but is derived automatically and entirely from the training data with no
Interpretability user- or problem-specific thresholds, parameters or intervention. The proposed xDNN offers a new
Prototype-based models deep learning architecture that combines reasoning and learning in a synergy. It is non-iterative and
Deep-learning non-parametric, which explains its efficiency in terms of time and computational resources. From
the user perspective, the proposed approach is clearly understandable to human users. We tested
it on challenging problems as the classification of different lighting conditions for driving scenes
(iROADS), object detection (Caltech-256, and Caltech-101), and SARS-CoV-2 identification via computed
tomography scan (COVID CT-scans dataset). xDNN outperforms the other methods including deep
learning in terms of accuracy, time to train and offers an explainable classifier.
© 2020 Elsevier Ltd. All rights reserved.

1. Introduction throughput applications of complex problems like image process-


ing where the human expertise may simply be not available or
Deep learning has demonstrated ability to achieve highly accu- very expensive.
rate results in different application domains such as speech recog- Feature extraction is an important pre-processing stage, which
nition (Xiong et al., 2018), image recognition (He et al., 2016), defines the data space and may influence the level of accuracy
and language translation (LeCun et al., 2015) and other complex the end result provides. Therefore, we consider this very useful
problems (Goodfellow et al., 2016). It attracted the attention of property of the traditional deep learning and step on it combined
media and the wider public (Sejnowski, 2018). It has also proven with another important recent result in the deep learning do-
main, namely, the transfer learning. This concept postulates that
to be very valuable and efficient in automating the usually labori-
knowledge in the form of a model architecture learned in one
ous and sometimes controversial pre-processing stage of feature
context can be re-used and useful in another context (Hu et al.,
extraction. The main criticism towards deep learning is usu-
2015). Transfer learning helps to considerably reduce the amount
ally related to its ‘black-box’ nature and requirements for huge
of time used for training. Moreover, it also may help to improve
amount of labelled data, computational resources (GPU acceler-
the accuracy of the models (Zhuang et al., 2015).
ators as a standard), long times (hours) of training, high power
Stepping on the two main achievements of the deep learning
and energy requirements (Rudin, 2019). Indeed, a traditional
— top accuracy combined with an automatic approach for feature
deep learning (e.g. convolutional neural network) algorithm in-
extraction for complex problems, such as image classification,
volves hundreds of millions of weights/coefficients/parameters
we try to address its deficiencies such as the lack of explain-
that require iterative optimization procedures. In addition, these ability (Rudin, 2019), computational burden, power and energy
hundreds of millions of parameters are abstract and detached resources required, ability to self-adapt and evolve (Soares &
from the physical nature of the problem being modelled. How- Angelov, 2019). Interpretability and explainability are extremely
ever, the automated way to extract them is very attractive in high important for high stake applications, such as autonomous cars,
medical or court decisions, etc. For example, it is extremely
∗ Corresponding author. important to know the reasons why a car took some action,
E-mail addresses: p.angelov@lancaster.ac.uk (P. Angelov), especially if this car is involved in an accident (Doshi-Velez & Kim,
e.almeidasoares@lancaster.ac.uk (E. Soares). 2017).

https://doi.org/10.1016/j.neunet.2020.07.010
0893-6080/© 2020 Elsevier Ltd. All rights reserved.
186 P. Angelov and E. Soares / Neural Networks 130 (2020) 185–194

of the problems with post hoc approach is that the explanations


can change for different models used. In other words, it is easy to
create multiple conflicting yet convincing explanations for how
the network would classify a single object.
Prototypes-based classifiers are a reasoning process that do
not consider post hoc analysis (Biehl et al., 2016). They rely on
the similarity (proximity in the feature space) of a data sample to
a given prototype (Biehl et al., 2013, 2016). Different works have
different meanings for the word ‘‘prototype’’ (Biehl et al., 2013,
2016; Saralajew et al., 2018), in our case we consider prototypes
to be the most representative data samples of the training set
(the data samples which have local peaks of the density (Angelov
& Gu, 2019)). In other cases, a prototype can be considered as a
convex combination of several observations, and not necessarily
Fig. 1. Trade-off between accuracy and explainability.
required to be close to any data sample of the training set or even
to be feasible (Liu et al., 2018; Oyedotun & Khashman, 2017).
Our work is closely aligned with other prototype classification
techniques in machine learning. Prototype classification is a clas-
The state-of-the-art classifiers offer a choice between higher
sical form of case-based reasoning (Li et al., 2018); however, as Li
explainability for the price of lower accuracy or vice versa (Fig. 1).
et al. (2018) use neural networks, the distance measure between
Before deep learning (Schmidhuber, 2015), machine-learning and
prototypes and observations is measured in a latent space. Li et al.
pattern-recognition required substantial domain expertise to
(2018) use an auto encoder to create a latent low-dimensional
model a feature extractor that could transform the raw data into
space, and distances to prototypes are computed in that latent
a feature vector which defines the data space within which the
space. Other works also use Euclidean distance calculation can
learning subsystem could detect or classify data patterns (LeCun
be expressed in terms of convolution operations in the neural
et al., 2015). Deep learning offers new way to extract abstract
network sense (Biehl et al., 2013; Nebel et al., 2017). This and the
features automatically. Moreover, pre-trained structures can be
computation of the Euclidean distance in terms of a dot product
reused for different tasks through the transfer learning tech-
are essential steps towards efficient computational schemes for
nique (Hu et al., 2015). Transfer learning helps to considerably
prototype-based neural network layers.
reduce the amount of time used for training, moreover, it also
In contrast, the proposed method uses local densities and
may help to improve the accuracy of the models (Zhuang et al.,
global multivariate generative distributions based on an em-
2015). In this paper, we propose a new approach, xDNN that
pirically derived form of the probability distribution function
offers both, high level of explainability combined with the top
(Angelov & Gu, 2019). Furthermore, differently from other
accuracy.
The proposed approach, xDNN offers a new deep learning prototype-based classifiers, the presented method is non-iterative
architecture that combines reasoning and learning in a synergy. It and non-parametric as it is using recursive calculations and no
is based on prototypes and the data density (Angelov & Gu, 2019) search procedures. Moreover, the proposed algorithm can learn
as well as typicality — an empirically derived pdf (Angelov et al., continuously without full re-training, it can also benefit from the
2017). It is non-iterative and non-parametric, which explains its transfer learning technique (see Fig. 2).
efficiency in terms of time and computational resources. From
the user perspective, the proposed approach is clearly under- 3. Explainable deep neural network
standable to human users. We tested it on some well-known
benchmark data sets such as iRoads (Rezaei & Terauchi, 2013) and 3.1. Architecture and training of the proposed xDNN
Caltech-256 (Griffin et al., 2007) and xDNN outperforms the other
methods including deep learning in terms of accuracy, time to The proposed explainable deep neural network (xDNN) clas-
train, moreover, offers an explainable classifier. In fact, the result sifier is formed of several layers with a very clear semantic
on the very hard Caltech-256 problem (which has 257 classes) and functional meaning. In addition to the internal clarity and
represents a world record (He et al., 2015). transparency it also offers a very clear from the user point of view
The remainder of this paper is organized as follows: The next set of prototype-based IF . . . THEN rules. Prototypes are selected
section introduces a brief literature review. The proposed explain- data samples (images) that the user can easily view, understand
able deep learning approach is presented in Section 3. The data and appreciate the similarity to other validation images. xDNN
employed in the analysis is presented in Section 4, and the results offers a synergy between the statistical learning and reasoning
are presented in Section 5. The discussion is presented in the last bringing both together. In most of the other approaches there is a
section of this paper. dichotomy and preference of one over the other. We advocate and
demonstrate that both, learning and reasoning can work together
2. Brief literature review in a synergy and produce very impressive results. Indeed, the pro-
posed xDNN method outperforms all published results (Angelov
Deep Neural Networks have often been designed purely for & Gu, 2018; He et al., 2015; Rezaei & Terauchi, 2013) in terms of
accuracy. The decisions made by these networks are at best inter- accuracy. Moreover, in terms of time for training, computational
preted by post hoc techniques (Li et al., 2018) or not interpreted simplicity, low power and energy required it is also far ahead.
at all. That is, the first step is the selection of the network archi- The proposed approach can be described as a feedforward neural
tecture by the human and the attempt to interpret the trained network which has an incremental learning algorithm that au-
model and the learned high-level features follows. Therefore, the tonomously self-develops and evolves its structure adding new
post hoc interpretability analysis requires a separate modelling ef- prototypes to reflect the possibly changing (dynamically evolv-
fort (Saralajew et al., 2018) and is an approximation rather than a ing) data pattern (Soares & Angelov, 2019). As shown in Fig. 3,
deep explanation of the cause–effect relations and reasoning. One xDNN is composed of the following layers —
P. Angelov and E. Soares / Neural Networks 130 (2020) 185–194 187

Fig. 2. Pre-training a traditional deep neural network (weights of the network are being optimized/trained). Using the transfer learning concept this architecture
with the weights are used as feature extractor (the last fully connected layer is considered as a feature vector).
Source: Adapted from Simonyan and Zisserman (2014).

Fig. 3. xDNN training architecture (per class).

1. Features layer; (a) The values are then standardized using the following
2. Density layer; Eq. (1):
3. Typicality layer;
xi,j − µ(xi,j )
4. Prototypes layer; x̂i,j = (1)
5. MegaClouds layer; σ (xi,j )
where x̂ denotes a standardized features vector x of the im-
1. Features layer: (Defines the data space) age I (x are the values provided by the FCL), i = 1, 2, . . . , N
The Feature Layer is the first phase of the proposed xDNN denotes the time stamp or the ID of the image, j =
method. This layer is in charge of extracting global features 1, 2, . . . , n refers to the number of features of the given
vector from the images. This first layer can be formed by x in our case n = 4096.
more traditional ‘handcrafted’ methods such as GIST (Sol- (b) The standardized values are normalized to bring them
maz et al., 2013) or HoG (Mizuno et al., 2012). Alternatively, to the range [0;1]:
it can be formed by the fully connected layer (FCL) of x̂i,j − mini (x̂i,j )
the pre-trained convolutional neural network approaches x̄i,j = (2)
maxi (x̂i,j ) − mini (x̂i,j )
such as AlexNet (Krizhevsky et al., 2012), VGG–VD–16 (Si-
where x̄ denotes the normalized value of the features vec-
monyan & Zisserman, 2014), and Inception (Szegedy et al.,
tor. For clarity in the rest of the paper we will use x instead
2015), residual neural networks such as Resnet (He et al.,
of x̄.
2016) or Inception-Resnet (Szegedy et al., 2017), etc. Using
Initialization:
pre-trained deep neural network approach allows auto- Meta-parameters for the xDNN are initialized with the
matic extraction of more abstract and discriminative high- first observed data sample (image). The proposed algorithm
level features. In this paper, pre-trained VGG–VD–16 DCNN works per class; therefore, all the calculations are done for
is employed for feature extraction. According to Ren et al. each class separately.
(2016), VGG–VD–16 has a simple structure and it can
achieve a better performance in comparison with other
P ← 1; µ ← xi ; (3)
pre-trained deep neural networks. The first fully connected where µ denotes the global mean of data samples of the
layer from VGG–VD–16 provides a 1 × 4096 dimensional given class. P is the total number of the identified proto-
vector. types from the observed data samples (images).
188 P. Angelov and E. Soares / Neural Networks 130 (2020) 185–194

Each class C is initialized by the first data sample of that free from prior assumptions about the data distribution
class: type, as well as the random or deterministic nature of the
C1 ← x 1 ; p 1 ← x1 ; data. In contrast, it empirically extracts the distribution

(4) from the data samples (images) bottom up (Angelov & Gu,
Support1 ← 1; r1 ← r ; Î1 ← I1
2019). The prototypes are independent from each other.
where, p1 is the vector of features that describe the proto- Therefore, one can change the structure by adding a new
type Î of the C1 ; Î is the identified prototype; Support1 is the prototype without influencing the other already existing
corresponding support (number of members) associated prototypes. In other words, the proposed xDNN is highly
with this prototype; r1 is the corresponding radius of the
parallelizable and suitable for evolving form of application
area of influence of C1 . √ where new prototypes may be added (if the data pattern
In this paper, we use r ∗ = 2 − 2cos(30◦ ) same as Angelov
and Gu (2019); the rationale is that two vectors for which requires this). The proposed xDNN method is trained per
the angle between them is less than π/6 or 30◦ are point- class forming a set of prototypes per class. Therefore, all the
ing in close/similar directions d. That is, we consider that calculations are done for each class separately. Prototypes
two feature vectors can be considered to be similar if the are the local peaks of the data density (and typicality) iden-
angle between them is smaller than 30 degrees. Note that tified in the previous layers/ stages of the algorithm from
r ∗ is data derived, not a problem- or user-specific param- the images of the corresponding class based on their fea-
eter. In fact, it can be defined without prior knowledge of ture vectors. The prototypes can be used to form linguistic
the specific problem or data through the following Eq. (5). logical IF . . . THEN rules of the following form:
 
 xi pi 
 ∥xi ∥ − ∥pi ∥  .
d(xi , pi ) =   (5) Rc : IF (I ∼ ÎP ) THEN (class c)

where ∼ stands for similarity, it also can be seen as a


2. Density layer:
fuzzy degree of membership; p is the identified prototype;
The density layer defines the mutual proximity of the
images in the data space defined by the features from the P is the number of identified prototypes; c is the class
previous layer. The data density, if use Euclidean form of c = 1, 2, . . . , C , I denotes an image.
distance, has a Cauchy form (15) (Angelov & Gu, 2019): One rule per prototype can be formed. All rules per class
can be combined together using logical OR, also known as
1
D(xi ) = , (6) disjunction or S-norm:
∥xi −µN ∥2
1+
∥σ ∥2N
Rc : IF (I ∼ Î1 ) OR (I ∼ Î2 ) OR . . . OR (I ∼ ÎP ) THEN (class c)
where D is the density, µ is the global mean, and σ is the
variance. The reason it is Cauchy is not arbitrary (Angelov Fig. 4 illustrates the area of influence of the identified
& Gu, 2019). It can be demonstrated theoretically that if prototypes. These areas around the identified prototypes
Euclidean or Mahalanobis type of distances in the feature are called data clouds (Angelov & Gu, 2019). Thus, each
space are considered, the data density reduces to Cauchy prototype defines a data cloud.
type as referred in Eq. (15). Density can also be updated We call all data points associated with a prototype data
online (Angelov, 2012): clouds, because their shape is not regular (e.g., hyper-
1 spherical, hyper-ellipsoidal, etc.) and the prototype is not
D(xi ) = . (7)
1 + ∥xi − µi ∥ + necessarily the statistical and geometric mean , but actual
2

i −∥µi ∥2
image (Angelov & Gu, 2019). The algorithm absorbs the
where µi and the scalar product,

i can be updated recur-
new data samples one by one by assigning then to the
sively as follows:
nearest (in the feature space) prototype:
i−1 1
µi = µi−1 + xi , (8)
j∗ = argmin (∥xi − pj ∥2 ) (11)
i i
j=1,2,...,P
∑ i−1 ∑ 1 ∑
= + ∥ xi ∥ 2 = ∥x1 ∥2 . (9) In case, the following condition (Angelov & Gu, 2019) is
i i
i i−1 1
met:
Data samples (images) that are closer to the global mean
IF (D(xi ) ≥ max D(pj ))
have higher density values. Therefore, the value of the data j=1,2,...,P
density indicates how strongly a particular data sample is OR (D(xi ) ≤ min D(pj )) (12)
influenced by other data samples in the data space due to j=1,2,...,P
their mutual proximity. THEN (add a new data cloud (P ← P + 1))
3. Typicality layer:
Typicality is an empirically derived form of probability It means that xi is out of the influence area of pj . Therefore,
distribution function (pdf). Typicality τ is given by Eq. (10). the vector of features xi becomes a new prototype of a new
The value of τ even
∫ ∞at the point x = pi is much less than data cloud with meta-parameters initialized by Eq. (13).
1; the integral of −∞ τ dx = 1 (Angelov & Gu, 2019). Add a new data cloud:
∑c P ← P + 1; C P ← x i ; p P ← Ii ; SupportP ← 1;
Supporti Dj (xi )
τ (xi ) = ∑c i=1
∫∞ (10) (13)
i=1 Supporti Dj (xi )dx rP ← ro ; ÎP ← Ii ;
−∞

4. Prototypes layer: Otherwise, data cloud parameters are updated online by


The prototypes identification layer is the core of the pro- Eq. (14). It has to be stressed that all calculations per data
posed xDNN classifier. This layer is responsible to pro- cloud are performed on the basis of data points associated
vide the clearly explainable model. The xDNN classifier is with a certain data cloud only (i.e. locally, not globally, on
P. Angelov and E. Soares / Neural Networks 130 (2020) 185–194 189

the basis of all data points).


Cj∗ ← Cj∗ + 1;
Supportj∗ Supportj∗
pj∗ ← pj∗ + xi ;
Supportj∗ + 1 Supportj∗ + 1
(14)
Supportj∗ ← Supportj∗ + 1;
rj2∗ + (1 − ∥pj∗ ∥2 )
rj2∗ ← .
2
The xDNN learning procedure can be summarized by the
following algorithm.

xDNN: Learning Procedure

1: Read the first feature vector sample xi representing the


image Ii of the class c;
2: Set i ← 1; n ← 1; P1 ← 1; p1 ← xi ; µ ←
Fig. 4. Identified prototypes — Voronoi tesselation.
x1 ; Support ← 1; r1 ← r0 ; Iˆ1 ← I1 ;
3: FOR i = 2, ...
4: Read xi ;
5: Calculate D(xi ) and D(pj ) (j = 1, 2, ..., P) according to
Eq. (9);
6: IF Eq. (12) holds
7: Create rule according to Eq. (13);
8: ELSE
9: Search for pj according to Eq. (11);
10: Update rule according to Eq. (14);
11: END
12: END

5. MegaClouds layer:
In the MegaClouds layer the clouds formed by the proto-
types in the previous layer are merged if the neighbouring
prototypes have the same class label. In other words, they
are merged if they belong to the same class. MegaClouds
are used to facilitate the human interpretability. Fig. 5 Fig. 5. MegaClouds — Voronoi tesselation.
illustrates the formation of the MegaClouds.
Rules in the MegaClouds layer have the following format:

Rc : IF (x ∼ MC1 ) OR (x ∼ MC2 ) OR . . . OR (x ∼ MCmc )


THEN (class c)

where MC are the MegaClouds, or the areas formed from


the merging of the clouds, and mc is the number of identi-
fied MegaClouds. Multimodal typicality, τ , can also be used
to illustrate the MegaClouds as illustrated by Fig. 6.

3.2. Architecture and validation of the proposed xDNN

Architecture for the validation process of the proposed xDNN


method is illustrated by Fig. 7.
The validation process of xDNN is composed of the following Fig. 6. Typicality for the iRoads dataset.
layers:

1. Features layer;
2. Similarity layer (density); /validation) data sample/image Ii defined as follows:
3. Local decision-making. 1
4. Global decision-making. S(x, pi ) = , (15)
∥x−pi ∥2
1+
∥σ ∥2N
Which is detailed described as the following:
where S denotes the similarity degree.
1. Features layer:
3. Local (per class) decision-making layer:
Similarly to the features layer described in the training
Local (per class) decision-making is calculated based on the
process.
‘winner-takes-all’ principle and can be obtained by:
2. Prototypes layer:
In this layer the degrees of similarity to the nearest pro- λc = max (Sj ), (16)
totypes (per class) are extracted for each unlabelled (new j=1,2,...,P
190 P. Angelov and E. Soares / Neural Networks 130 (2020) 185–194

Fig. 7. Architecture for the validation process of the proposed xDNN.

4. Global decision-making layer: The global decision-making 4.3. Caltech-101


layer is in charge of forming the decision by assigning
labels to the validation images based on the degree of sim- Caltech-101 is divided into 102 object categories (one of which
ilarity of the prototypes obtained by the prototype identi- is the background) (Fei-Fei et al., 2004).
fication layer as illustrated by Fig. 7 and determining the
winning class. 4.4. COVID-CT dataset
λ∗c = max (λc ), (17)
c =1,2,...,C COVID-CT dataset contains 275 computed tomography scans
In order to determine the overall degree of satisfaction, the positive for COVID-19 (Zhao et al., 2020).
maximum of the local, per class winners is applied.
The label is obtained by the following Eq. (18): 4.5. Performance evaluation
label = argmax (λ∗c ), (18)
c =1,2,...,C We used the following metrics for classification evaluation:
TP + TN
4. Experimental data ACC (%) = × 100, (19)
TP + FP + TN + FN
Precision:
We validated our proposed approach, xDNN using several
TP
complex, well-known image classification benchmark datasets Precision(%) = × 100, (20)
(iRoads, Caltech-256, Caltech-101) as well as we propose our own TP + FP
dataset for SARS-CoV-2 identification. Recall:
TP
4.1. iRoads dataset
Recall(%) = × 100, (21)
TP + FN
F 1 Score:
The iROADS dataset (Rezaei & Terauchi, 2013) was considered
Precision × Recall
in the analysis first. The dataset contains 4656 image frames F 1 Score(%) = 2 × × 100, (22)
recorded from moving vehicles on a diverse set of road scenes, Precision + Recall
recorded in day, night, under various weather and lighting con- where TP , FP , TN , FN denote true and false, negative and positive
ditions, as described below: respectively.
The area under the curve, AUC , is defined through the TP rate
• Daylight — 903 images and FN rate.
• Night — 1050 images All the experiments were conducted with MATLAB 2018a us-
• Rainy day — 1049 images ing a personal computer with a 1.8 GHz Intel Core i5 processor,
• Rainy night — 431 images 8-GB RAM, and MacOS operating system. The classification exper-
• Snowy — 569 images iments were executed using 10-fold cross validation under the
• Sun strokes — 307 images same ratio of training-to-testing (90% to 10%) sample sets.
• Tunnel — 347 images
5. Results and analysis
4.2. Caltech-256
Computational simulations were performed to assess the ac-
Caltech-256 has 30,607 images divided into 257 object cate- curacy of the proposed explainable deep learning method, xDNN
gories (one of which is the background) (Griffin et al., 2007). against other state-of-the-art approaches.
P. Angelov and E. Soares / Neural Networks 130 (2020) 185–194 191

Table 1
Performance comparison: iRoads dataset.
Method Accuracy Time (s) # Parameters
xDNN 99.59% 4.32 27
VGG–16 (He et al., 2016) 99.51% 836.28 Not reported
DRB (Angelov & Gu, 2019) 99.02% 2.95 521
SVM (Suykens & Vandewalle, 1999) 94.17% 5.67 Not reported
KNN (Bishop, 2006) 93.49% 4.43 4656
Naive Bayes (Bishop, 2006) 88.35% 5.31 Not reported

Table 2
Performance comparison: Caltech-256 dataset.
Method Accuracy
xDNN 75.41%
MSVM (Cao et al., 2019) 70.18%
VGG–16 (He et al., 2016) 73.2%
VGG–19 (He et al., 2016) 70.62%
ResNet–101 (Simonyan & Zisserman, 2014) 75.14%
GoogLeNet (Szegedy et al., 2015) 72.42%
Softmax(7) (Zeiler & Fergus, 2014) 74.2%

5.1. iRoads Dataset

Table 1 shows that the proposed xDNN method provides


the best result in terms of classification accuracy as well as
time/complexity and simplicity of the model structure (number
of parameters/prototypes). The number of model parameters
for xDNN (and DRB) is, strictly speaking, zero, because the 2
parameters (mean, µ and standard deviation, σ ) per prototype
(data cloud) are derived from the data and are not algorithmic
parameters or user-defined parameters. For kNN method one Fig. 8. xDNN rule generated for the ‘Daylight scene’.
can argue that the number of parameters is the number of data
samples, N. The proposed explainable DNN surpasses in terms of
accuracy the state-of-the-art VGG–16 algorithm which is a well-
established convolutional deep neural network. Moreover, the
proposed xDNN has at its top layer a set of a very small number
of MegaClouds (27 or, on average, 4 MegaClouds per class) which
makes it very easy to explain and visualize. For comparison, our
earlier version of deep rule-based models, called DRB (Angelov
& Gu, 2018) also produced a high accuracy and was trained a
bit faster, but ended up with 521 prototypes (on average 75
prototypes per class) (Soares et al., 2019). With xDNN we do
generate meaningful IF . . . THEN rules as well as generate an
analytical description of the typicality which is the empirically
derived pdf in a closed form which lends itself for further analysis
and processing.
MegaClouds generated by the proposed xDNN model can be
visualized in terms of rules as illustrated by Fig. 8.
Voronoi tessellation can also be used to visualize the resulting
MegaClouds as illustrated by Fig. 9.
Fig. 9. MegaClouds for the iRoads dataset.
5.2. Caltech-256 and Caltech-101 dataset

Results for Caltech-256 are presented in Table 2.


Results presented in Table 2 demonstrate that the proposed We also tested the proposed xDNN approach on the Caltech-
xDNN approach can obtain highly accurate results compared to 101 dataset. Results for the Caltech-101 dataset demonstrated on
state-of-the-art approaches for this complex problem, it is impor- Table 3 showed that the proposed approach could surpass other
tant to highlight that we just compared the proposed approach state-of-the-art approaches in terms of accuracy.
with DNNs that do not use any trick for image augmentation. We compared the proposed xDNN approach with the best
The proposed approach offers explainable models which can be published single-label classifiers methods and achieved better
visualized in terms of IF . . . THEN rules. xDNN produced on av- result. There are couple of alternative methods that report higher
erage 3 MegaClouds per class (a total of 721) which are clearly results on Caltech problems, but they use additional information
explainable. Rules have the following format: such as the context (Leng et al., 2019) or multiple labels (Qian
et al., 2019) processes in order to enhance the classification
performance, include extra features (labels and descriptions) and
this makes the underlying problem different even if the name is
still the same (Caltech-101 or Caltech-256). We believe that the
comparison has to be in the same playing field using the same
192 P. Angelov and E. Soares / Neural Networks 130 (2020) 185–194

Fig. 10. Final rule given by the proposed xDNN classifier for the COVID-19 identification. Differently from ‘black box’ approaches as deep neural networks, the
proposed approach provides highly interpretable rules which can be used by human experts for the early evaluation of patients suspected of SARS-Cov-2 infection.

Fig. 11. Non-Covid final rule given by the proposed explainable Deep Learning classifier.

Table 3 helpful for specialists (in this case, medical doctors). The pro-
Performance comparison: Caltech-101 dataset. posed classifier identified 30 prototypes for non-COVID and 33
Method Accuracy prototypes for COVID patients. Rules generated by the identified
xDNN 94.31% prototypes for COVID and non-COVID patients are illustrated by
SPP–net (He et al., 2015) 91.44%
Figs. 10 and 11 respectively. The baseline approach (Zhao et al.,
ResNet–50 (He et al., 2016) 90.39%
CNN S TUNE-CLS (Chatfield et al., 2014) 88.35% 2020) is a Deep Neural Network approach which is ‘black box’
(Zeiler & Fergus, 2014) 86.5% (offers no interpretability).
VGG–16 (He et al., 2016) 90.32% Using the proposed method we extracted from the data lin-
KNN (Bishop, 2006) 85.65% guistic IF...THEN rules which involve actual images of both cases
DT (Quinlan, 1986) 54.42%
(COVID-19 and non-COVID) as illustrated in Figs. 10 and 11. Such
transparent rules can be used in the decision-making process for
early diagnostics for COVID-19 infection. Rapid detection with
amount of information and therefore, we do not report these high sensitivity of viral infection may allow better control of the
methods. Apart from them, to the best of our knowledge, there viral spread. Early diagnosis of COVID-19 is crucial for the disease
is no better result achieved on Caltech data sets. treatment and control.
Fig. 12 illustrates the evolving nature of the proposed ap-
proach. xDNN is able to continuously learn as new data is pre-
5.3. COVID CT-scan dataset
sented to it. Therefore, no full re-training is required due to
its life-long learning architecture. On the contrary, the baseline
In this section we report the results obtained by the proposed approach (Zhao et al., 2020) is based on a Deep Neural Network
xDNN classification approach when applied to the COVID CT-scan that requires full re-training for any new data sample, which
dataset (Zhao et al., 2020). Results presented in Table 4 compare can be very costly in terms of time, computational complexity
the proposed algorithm with other state-of-the-art approaches, and requirements for hardware and computer experts. xDNN
including traditional ‘‘black-box’’ Deep Neural Network, Support continuously learns as new training data arrives to the system. It
Vector Machines, etc. can be observed that with 478 training data samples the proposed
The proposed xDNN classifier provided better results in terms approach could obtain better results in terms of accuracy (84.56%)
of accuracy, recall, F 1 score, and AUC. Moreover, the proposed than the baseline approach (84.0%) with 537 training data sam-
approach also provided highly interpretable results that may be ples (Zhao et al., 2020). The baseline approach is a Deep Neural
P. Angelov and E. Soares / Neural Networks 130 (2020) 185–194 193

Table 4
Performance comparison: COVID CT-scan dataset.
Method Metric
Accuracy Precision Recall F1 score AUC
xDNN 88.6% 89.7% 88.6% 89.2% 88.6%
Baseline (Zhao et al., 2020) 84.7% 97.0% 76.2% 85.3% 82.4%
SVM (Suykens & Vandewalle, 1999) 80.5% 84.4% 83.5% 84% 79.7%
KNN (Bishop, 2006) 83.9% 90.4% 82.4% 86.2% 84.3%
AdaBoost (Hastie et al., 2009) 83.9% 87.7% 83.5% 85.5% 84%
Naive Bayes (Bishop, 2006) 70.5% 77% 73.6% 75.3% 69.6%

outperforms the other methods including state-of-the-art deep


learning approaches in terms of accuracy, time to train and offers
an explainable classifier. Future research will concentrate on the
development of a tree-based architecture, synthetic data gener-
ation, and local optimization in order to improve the proposed
deep explainable approach.

Declaration of competing interest

The authors declare that they have no known competing finan-


cial interests or personal relationships that could have appeared
to influence the work reported in this paper.

References
Fig. 12. The figure illustrates the evolving nature of the proposed xDNN
approach.
Angelov, P. (2012). Autonomous learning systems: from data streams to knowledge
in real-time. John Wiley & Sons.
Angelov, P. P., & Gu, X. (2018). Deep rule-based classifier with human-level
performance and characteristics. Information Sciences, 463, 196–213.
Network that needs a large number of training data to obtain
Angelov, P. P., & Gu, X. (2019). Empirical approach to machine learning. Springer.
a high performance in terms of classification accuracy and once Angelov, P. P., Gu, X., & Príncipe, J. C. (2017). A generalized methodology for
trained cannot be further improved unless fully re-trained. In data analysis. IEEE Transactions on Cybernetics, 48(10), 2981–2993.
contrast, the proposed approach can obtain higher performance Biehl, M., Hammer, B., & Villmann, T. (2013). Distance measures for prototype
using less training data due to its prototype-based nature. based classification. In International workshop on brain-inspired computing
(pp. 100–116). Springer.
Experiments have demonstrated that the proposed xDNN ap-
Biehl, M., Hammer, B., & Villmann, T. (2016). Prototype-based models in machine
proach is able to produce highly accurate results surpassing state- learning. Wiley Interdisciplinary Reviews: Cognitive Science, 7(2), 92–111.
of-the-art methods for different challenging datasets. Moreover, Bishop, C. M. (2006). Pattern recognition and machine learning. springer.
xDNN presents highly interpretable results that can be presented Cao, J., Wang, M., Li, Y., & Zhang, Q. (2019). Improved support vector machine
in the form of IF . . . THEN logical rules, Voronoi tessellations, classification algorithm based on adaptive feature weight updating in the
hadoop cluster environment. PloS one, 14(4).
and/or typicality (empirically derived form of pdf) in a closed
Chatfield, K., Simonyan, K., Vedaldi, A., & Zisserman, A. (2014). Return of the
analytical form allowing further analysis. Because of its recursive, devil in the details: Delving deep into convolutional nets. arXiv preprint
non-iterative and non-parametric form it allows computationally arXiv:1405.3531.
very efficient implementations to be realized. Doshi-Velez, F., & Kim, B. (2017). Towards a rigorous science of interpretable
machine learning. arXiv preprint arXiv:1702.08608.
Fei-Fei, L., Fergus, R., & Perona, P. (2004). Learning generative visual models
6. Conclusion from few training examples: An incremental bayesian approach tested on
101 object categories. In 2004 conference on computer vision and pattern
In this paper we propose a new method, explainable deep recognition workshop (p. 178). IEEE.
neural network (xDNN), that is directly addressing the bottle- Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep learning. MIT press.
Griffin, G., Holub, A., & Perona, P. (2007). Caltech-256 object category dataset.
necks of the traditional deep learning approaches and offers an
California Institute of Technology.
explainable internal architecture that can outperform the exist- Hastie, T., Rosset, S., Zhu, J., & Zou, H. (2009). Multi-class adaboost. Statistics and
ing methods. The proposed xDNN approach requires very little its Interface, 2(3), 349–360.
computational resources (no need for GPUs) and short training He, K., Zhang, X., Ren, S., & Sun, J. (2015). Spatial pyramid pooling in deep
times (in the order of seconds). The proposed approach, xDNN convolutional networks for visual recognition. IEEE Transactions on Pattern
Analysis and Machine Intelligence, 37(9), 1904–1916.
is prototype-based. Prototypes are actual training data samples
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image
(images), which have local peaks of the empirical data distribu- recognition. In Proceedings of the IEEE conference on computer vision and
tion called typicality as well as of the data density. This generative pattern recognition (pp. 770–778).
model is identified in a closed form and equates to the pdf but is Hu, J., Lu, J., & Tan, Y.-P. (2015). Deep transfer metric learning. In Proceedings of
derived automatically and entirely from the training data with no the IEEE conference on computer vision and pattern recognition (pp. 325–333).
Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification
user- or problem-specific thresholds, parameters or intervention. with deep convolutional neural networks. In Advances in neural information
The proposed xDNN offers a new deep learning architecture that processing systems (pp. 1097–1105).
combines reasoning and learning in a synergy. It is non-iterative LeCun, Y., Bengio, Y., & Hinton, G. (2015). Deep learning. Nature, 521(7553),
and non-parametric, which explains its efficiency in terms of 436–444.
time and computational resources. From the user perspective, Leng, J., Liu, Y., & Chen, S. (2019). Context-aware attention network for image
recognition. Neural Computing and Applications, 31(12), 9295–9305.
the proposed approach is clearly understandable to human users. Li, O., Liu, H., Chen, C., & Rudin, C. (2018). Deep learning for case-based
Results for some well-known benchmark data sets such as iRoads, reasoning through prototypes: A neural network that explains its predictions.
Caltech-256, Caltech-101, and COVID CT-scan show that xDNN In Thirty-second AAAI conference on artificial intelligence.
194 P. Angelov and E. Soares / Neural Networks 130 (2020) 185–194

Liu, C., Bellec, G., Vogginger, B., Kappel, D., Partzsch, J., Neumärker, F., Höpp- Sejnowski, T. J. (2018). The deep learning revolution. MIT Press.
ner, S., Maass, W., Furber, S. B., & Legenstein, R. (2018). Memory-efficient Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks for
deep learning on a spinnaker 2 prototype. Frontiers in Neuroscience, 12, 840. large-scale image recognition. arXiv preprint arXiv:1409.1556.
Mizuno, K., Terachi, Y., Takagi, K., Izumi, S., Kawaguchi, H., & Yoshimoto, M. Soares, E., & Angelov, P. (2019). Novelty detection and learning from extremely
(2012). Architectural study of HOG feature extraction processor for real- weak supervision. arXiv preprint arXiv:1911.00616.
time object detection. In 2012 IEEE workshop on signal processing systems Soares, E., Angelov, P., Costa, B., & Castro, M. (2019). Actively semi-supervised
(pp. 197–202). IEEE. deep rule-based classifier applied to adverse driving scenarios. In 2019
Nebel, D., Kaden, M., Villmann, A., & Villmann, T. (2017). Types of (dis-) sim- international joint conference on neural networks (IJCNN) (pp. 1–8). IEEE.
ilarities and adaptive mixtures thereof for improved classification learning. Solmaz, B., Assari, S. M., & Shah, M. (2013). Classifying web videos using a global
Neurocomputing, 268, 42–54. video descriptor. Machine Vision and Applications, 24(7), 1473–1485.
Oyedotun, O. K., & Khashman, A. (2017). Prototype-incorporated emotional Suykens, J. A., & Vandewalle, J. (1999). Least squares support vector machine
neural network. IEEE Transactions on Neural Networks and Learning Systems, classifiers. Neural Processing Letters, 9(3), 293–300.
29(8), 3560–3572.
Szegedy, C., Ioffe, S., Vanhoucke, V., & Alemi, A. A. (2017). Inception-v4,
Qian, G., Zhang, L., & Wang, Y. (2019). Single-label and multi-label conceptor
inception-resnet and the impact of residual connections on learning. In
classifiers in pre-trained neural networks. Neural Computing and Applications,
Thirty-first AAAI conference on artificial intelligence.
31(10), 6179–6188.
Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D.,
Quinlan, J. R. (1986). Induction of decision trees. Machine Learning, 1(1), 81–106.
Vanhoucke, V., & Rabinovich, A. (2015). Going deeper with convolutions. In
Ren, S., He, K., Girshick, R., Zhang, X., & Sun, J. (2016). Object detection networks
Proceedings of the IEEE conference on computer vision and pattern recognition
on convolutional feature maps. IEEE Transactions on Pattern Analysis and
Machine Intelligence, 39(7), 1476–1481. (pp. 1–9).
Rezaei, M., & Terauchi, M. (2013). Vehicle detection based on multi-feature clues Xiong, W., Wu, L., Alleva, F., Droppo, J., Huang, X., & Stolcke, A. (2018). The
and Dempster-Shafer fusion theory. In Pacific-Rim symposium on image and microsoft 2017 conversational speech recognition system. In 2018 IEEE
video technology (pp. 60–72). Springer. international conference on acoustics, speech and signal processing (ICASSP) (pp.
Rudin, C. (2019). Stop explaining black box machine learning models for high 5934–5938). IEEE.
stakes decisions and use interpretable models instead. Nature Machine Zeiler, M. D., & Fergus, R. (2014). Visualizing and understanding convolutional
Intelligence, 1(5), 206–215. networks. In European conference on computer vision (pp. 818–833). Springer.
Saralajew, S., Holdijk, L., Rees, M., & Villmann, T. (2018). Prototype-based neural Zhao, J., Zhang, Y., He, X., & Xie, P. (2020). COVID-CT-Dataset: a CT scan dataset
network layers: incorporating vector quantization. arXiv preprint arXiv:1812. about COVID-19. arXiv preprint arXiv:2003.13865.
01214. Zhuang, F., Cheng, X., Luo, P., Pan, S. J., & He, Q. (2015). Supervised represen-
Schmidhuber, J. (2015). Deep learning in neural networks: An overview. Neural tation learning: Transfer learning with deep autoencoders. In Twenty-fourth
Networks, 61, 85–117. international joint conference on artificial intelligence.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy