0% found this document useful (0 votes)
5 views278 pages

Deep Learning Applications in Image Analysis (2023)

The document outlines the 'Studies in Big Data' series, which publishes advancements in Big Data across various fields, including engineering and life sciences. It highlights the significance of deep learning applications in image analysis, detailing specific projects such as Bangla handwritten character recognition and COVID-19 diagnosis from X-ray images. The book aims to benefit researchers and students in deep learning and machine learning through practical applications and theoretical insights.

Uploaded by

elaheh.shabani85
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views278 pages

Deep Learning Applications in Image Analysis (2023)

The document outlines the 'Studies in Big Data' series, which publishes advancements in Big Data across various fields, including engineering and life sciences. It highlights the significance of deep learning applications in image analysis, detailing specific projects such as Bangla handwritten character recognition and COVID-19 diagnosis from X-ray images. The book aims to benefit researchers and students in deep learning and machine learning through practical applications and theoretical insights.

Uploaded by

elaheh.shabani85
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 278

Volume 129

Studies in Big Data

Series Editor
Janusz Kacprzyk
Polish Academy of Sciences, Warsaw, Poland

The series “Studies in Big Data” (SBD) publishes new developments and
advances in the various areas of Big Data- quickly and with a high quality.
The intent is to cover the theory, research, development, and applications of
Big Data, as embedded in the fields of engineering, computer science,
physics, economics and life sciences. The books of the series refer to the
analysis and understanding of large, complex, and/or distributed data sets
generated from recent digital sources coming from sensors or other physical
instruments as well as simulations, crowd sourcing, social networks or other
internet transactions, such as emails or video click streams and other. The
series contains monographs, lecture notes and edited volumes in Big Data
spanning the areas of computational intelligence including neural networks,
evolutionary computation, soft computing, fuzzy systems, as well as
artificial intelligence, data mining, modern statistics and Operations
research, as well as self-organizing systems. Of particular value to both the
contributors and the readership are the short publication timeframe and the
world-wide distribution, which enable both wide and rapid dissemination of
research output.
The books of this series are reviewed in a single blind peer review
process.
Indexed by SCOPUS, EI Compendex, SCIMAGO and zbMATH.
All books published in the series are submitted for consideration in Web
of Science.
Editors
Sanjiban Sekhar Roy, Ching-Hsien Hsu and Venkateshwara Kagita

Deep Learning Applications in Image Analysis


Editors
Sanjiban Sekhar Roy
School of Computer Science and Engineering, Vellore Institute of
Technology, Vellore, TN, India

Ching-Hsien Hsu
College of Information and Electrical Engineering, Asia University,
Musashino, Taiwan

Venkateshwara Kagita
Department of Computer Science and Engineering, National Institute of
Technology Warangal, Warangal, India

ISSN 2197-6503 e-ISSN 2197-6511


Studies in Big Data
ISBN 978-981-99-3783-7 e-ISBN 978-981-99-3784-4
https://doi.org/10.1007/978-981-99-3784-4

© The Editor(s) (if applicable) and The Author(s), under exclusive license
to Springer Nature Singapore Pte Ltd. 2023

This work is subject to copyright. All rights are solely and exclusively
licensed by the Publisher, whether the whole or part of the material is
concerned, specifically the rights of translation, reprinting, reuse of
illustrations, recitation, broadcasting, reproduction on microfilms or in any
other physical way, and transmission or information storage and retrieval,
electronic adaptation, computer software, or by similar or dissimilar
methodology now known or hereafter developed.

The use of general descriptive names, registered names, trademarks, service


marks, etc. in this publication does not imply, even in the absence of a
specific statement, that such names are exempt from the relevant protective
laws and regulations and therefore free for general use.

The publisher, the authors, and the editors are safe to assume that the advice
and information in this book are believed to be true and accurate at the date
of publication. Neither the publisher nor the authors or the editors give a
warranty, expressed or implied, with respect to the material contained
herein or for any errors or omissions that may have been made. The
publisher remains neutral with regard to jurisdictional claims in published
maps and institutional affiliations.

This Springer imprint is published by the registered company Springer


Nature Singapore Pte Ltd.
The registered company address is: 152 Beach Road, #21-01/04 Gateway
East, Singapore 189721, Singapore
This book is dedicated to my mother “Papri Roy”
–Sanjiban Sekhar Roy
Preface
In recent times, deep learning applications have achieved cutting-edge
results on various image-related problems. Deep learning models are
fascinating because they can understand images and perform vision tasks
without requiring a complex series of specialized methods. In recent years,
deep learning has emerged as the fastest-growing field in artificial
intelligence. It has found widespread application across various domains,
showcasing its effectiveness and rapid development. Starting from
handwritten character recognitions, automatic diagnosis of COVID-19
disease from x-ray images, imbalance image data sets of classification to
image captioning, vehicle over speed detection systems, and many other
applications. The topics that have been included in this book will cater
interest to academicians and researchers working in the field of deep
learning and machine learning with image-related problems. Also,
graduates, postgraduates, and Ph.D. scholars working in these fields will
immensely be benefited. This edited book has dealt with the following
chains of works on the applications of deep learning for various image-
related problems.
Autoencoder and Deep Convolutional Generative Adversarial Network in
Improving Classification Performance of Bangla Handwritten
Deep Learning-Based Approaches Using Feature Selection Methods for
Automatic Diagnosis of COVID-19 Disease from X-RAY Images
Image Captioning Using Deep Transfer Learning
Vehicle Over speed Detection system
An Intelligent System for Video-Based Proximity Analysis
Melanoma cancer detection using deep learning
Plant Diseases Classification using Neural Network: AlexNet
Hyperspectral Images: A Succinct Analytical Deep Learning Study
Chest X-Ray image classification of Pneumonia Disease using Efficient
Net and InceptionV3
Detection of Cancer using Deep Learning Techniques
The intention of compiling this book is to present a good idea about
both theory and practice related to the above-mentioned applications before
the readers by showcasing the usages of deep learning. We hope that
readers will be benefited significantly from learning the state of the art of
deep learning applications in the domain of imagery.
Keep reading, learning, and inquiring.
Dr. Sanjiban Sekhar Roy
Vellore, TN, India
September 2020
Contents
Autoencoder and Deep Convolutional Generative Adversarial Network
in Improving the Performance of Bangla Handwritten Character
Recognition
Tanzina Akter Tani, Mir Moynuddin Ahmed Shibly,
Md. Shoumique Hasan, Nilofa Yeasmin and Shamim Ripon
Deep Learning-Based Approaches Using Feature Selection Methods for
Automatic Diagnosis of COVID-19 Disease from X-Ray Images
Burak Taşci
Image Captioning Using Deep Transfer Learning
Tapan Kumar Das
Vehicle Over Speed Detection System
K. Ganesan, N. S. Manikandan and Vijayan Sugumaran
An Intelligent System for Video-Based Proximity Analysis
Sergey Antonov, Mikhail Bogachev, Pavel Leyba, Aleksandr Sinitca
and Dmitrii Kaplun
Deep Learning-Based Conjunctival Melanoma Detection Using Ocular
Surface Images
Kanchon Kanti Podder, Mohammad Kaosar Alam,
Zakaria Shams Siam, Khandaker Reajul Islam, Proma Dutta,
Adam Mushtak, Amith Khandakar, Shona Pedersen and
Muhammad E. H. Chowdhury
Plant Diseases Classification Using Neural Network: AlexNet
Mohd Anas, Sanjiban Sekhar Roy, Kunwar S. Srivastava and
Jashabir Chakraborty
Hyperspectral Images: A Succinct Analytical Deep Learning Study
L. Sandeep Kumar, G. K. Panda and B. K. Tripathy
Chest X-Ray Image Classification of Pneumonia Disease Using
EfficientNet and InceptionV3
Neel Ghoshal, Mohd Anas and Sanjiban Sekhar Roy
Detection of Cancer Using Deep Learning Techniques
Apoorv Singh, Arjunaditya and B. K. Tripathy
About the Editors
Sanjiban Sekhar Roy is currently a Professor with the School of
Computer Science and Engineering, Vellore Institute of Technology. He
received Ph.D. degree from the Vellore Institute of Technology, Vellore,
India, in 2016. He has edited handful of special issues for journals,
published numerous articles in SCI high impact journals such as IEEE
Transactions on Computational social systems; Scientific Reports, Nature;
Computers and Electrical Engineering, Elsevier and many other reputed
journals; Dr. Roy has published nine books with reputed international
publishers such as Springer, Elsevier and IGI Global. His research interests
are deep learning and advanced machine learning. Dr. Roy was a recipient
of the “Diploma of Excellence” Award for academic research from the
Ministry of National Education, Romania. He was also an Associate
Researcher with Ton Duc Thang University, Ho Chi Minh City, Vietnam,
during 2019 to 2020.

Ching-Hsien Hsu is Chair Professor of the College of Information and


Electrical Engineering, Asia University, Taiwan; Professor in the
department of Computer Science and Information Engineering, National
Chung Cheng University; Research Consultant, Department of Medical
Research, China Medical University Hospital, China Medical University,
Taiwan. His research includes cloud and edge computing, big data
analytics, high performance computing systems, parallel and distributed
systems, artificial intelligence, medical AI and natural language processing.
He has published 350+ papers in top journals such as IEEE TPDS, IEEE
TSC, ACM TOMM, IEEE TCC, IEEE TETC, IEEE System, IEEE
Network, top conferences, and book chapters in these areas. Dr. Hsu is the
editor-in-chief of International Journal of Grid and High Performance
Computing, and International Journal of Big Data Intelligence; and serving
as editorial board for a number of prestigious journals, including IEEE
Transactions on Service Computing, IEEE Transactions on Cloud
Computing, International Journal of Cloud Computing, Journal of
Communication Systems, International Journal of Computational Science,
AutoSoft Journal. He has been acting as an author/co-author or an
editor/co-editor of 10 books from Elsevier, Springer, IGI Global, World
Scientific and McGraw-Hill. Dr. Hsu was awarded seven times talent
awards from Ministry of Science and Technology, Ministry of Education,
and nine times distinguished award for excellence in research from Chung
Hua University, Taiwan. Prof. Hsu is president of Taiwan Association of
Cloud Coputing; Chair of IEEE Technical Committee on Cloud Computing
(TCCLD); Fellow of the IET (IEE) and senior member of the IEEE.

Venkateswara Rao Kagita is an Assistant Professor at NIT Warangal.


He has obtained Ph.D. from the University of Hyderabad. His research
interests are Data Mining, Machine Learning, and Deep learning with a
specific focus on machine learning techniques for recommender systems.
His research works have been published in various reputed journals and
conference proceedings. He has also delivered various guest lectures in
several International and National workshops, IITs, NITs, and Universities.
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023
S. S. Roy et al. (eds.), Deep Learning Applications in Image Analysis, Studies in Big Data 129
https://doi.org/10.1007/978-981-99-3784-4_1

Autoencoder and Deep Convolutional


Generative Adversarial Network in Improving
the Performance of Bangla Handwritten
Character Recognition
Tanzina Akter Tani1, Mir Moynuddin Ahmed Shibly2 ,
Md. Shoumique Hasan1, Nilofa Yeasmin1 and Shamim Ripon1
(1) Department of Computer Science and Engineering, East West
University, Dhaka, Bangladesh
(2) Department of Computer Science and Engineering, United
International University, Dhaka, Bangladesh

Mir Moynuddin Ahmed Shibly


Email: moynuddin@cse.uiu.ac.bd

Shamim Ripon (Corresponding author)


Email: dshr@ewubd.edu

Keywords Generative adversarial network – Class imbalance –


Autoencoder – Outlier detection – Bangla handwritten character
classification

1 Introduction
Handwritten character recognition has been an area of interest among deep
learning researchers and practitioners in recent years. Due to its huge
possibilities of various implementations, a significant number of studies
have been carried out on handwritten texts, and character recognition of
different languages, such as English [1], Japanese [2], Latin [3], etc. Bangla
is the 1st and official language of Bangladesh, and it is the 4th most popular
language in the world, spoken by almost 300 million people [4].
Considering this large number of native users, handwritten character
recognition of the Bangla language plays a very important role in a wide
range of applications, including bank cheque processing, identifying postal
codes, zip code scanning, interpreting national ID numbers, Bangla optical
character recognition (OCR), and many more [5, 6].
In the Bangla language, there are 11 vowels, 39 consonants, and a
considerable number of vowel diacritical, consonant conjuncts and
diacritical, and other digits, symbols, and punctuation marks. Recognizing
handwritten Bangla characters is more difficult and complicated for a
number of reasons: (a) there are a lot of compound characters in the Bangla
alphabet, (b) the forms of certain characters are identical, (c) as different
people write in different ways, the same character written by different
people will have different forms, sizes, and curvatures.
To overcome these problems several efforts have been taken to improve
the recognition accuracy. Convolutional neural networks (CNN) [4, 7–9],
Deep CNN [10], and ensemble learning methods [11, 12] have been applied
in recent years. However, the scarcity of Bangla datasets and imbalances in
classes in those datasets are a barrier to the recognition problem. Ensemble
methods and image augmentation are among the many ways to overcome
this issue. A generative Adversarial Network (GAN) introduced by [13] is
another way to produce new instances of data. The presence of outliers in
the dataset can also make the recognition a difficult task as they mislead the
training of the models. So, by eliminating outliers, statistically meaningful
results can be obtained.
In Bangla handwriting-related studies, researchers have used different
classification approaches. The authors in [14] suggested a hierarchical
method for segmenting characters from sentences, with multilayer
perceptron (MLP) as the classification algorithm, whereas an MLP, RBF
network, and SVM fusion classifier is suggested in [15]. In [16], Bangla
handwriting images are classified into 50 groups by using a multilayer
perceptron neural network.
Deep learning methods such as convolutional neural network (CNN)-
based architecture have been used in the majority of recent works. Some of
these works are only limited to simple characters [17] while others are
concentrated in handwritten digits [18, 19]. Additionally, work has been
done on a subset of the compound characters of the Bangla language [20].
One of the major issues in Bangla handwritten character recognition is the
limited availability of a complete handwritten characters dataset.
Generating Bangla handwritten characters is a way to solve this problem.
Deep convolutional generative adversarial network (DCGAN) [21] has
been used by some researchers to generate Bangla handwritten digits [22,
23]. However, there has not been much work focused on generating
complicated curative characters and classifying Bangla handwritten
characters using them.
The deep neural network is a widely used technique in analyzing and
classifying different types of images [24–27]. Residual Networks (ResNet)
is one of the prominent neural network-based architectures that have been
used for image classification and identification with excellent results for a
long time. For example, researchers have used Transfer Learning with
ResNet-50 for Malaria Cell-Image classification [28], or malicious software
classification [29]. The residual networks have been applied in some of the
Bangla handwritten character recognition studies [30–32].
This chapter aims at proposing a two-fold approach with the residual
network classifier to classify Bangla handwritten characters. At first, a
model has been created by using a ResNet variant called ResNet-50 to show
the classification of the target dataset, which is in this case, the Ekush [33]
dataset. After classifying, the datasets are then stabilized by removing the
outliers by using an autoencoder, and the classification is performed again
by using the same ResNet-50 model. Finally, the classes with a fewer
number of images are augmented with more images by applying DCGAN,
so that the number of images among the classes becomes balanced. This
dataset is then classified with the ResNet-50 model. In the end, a detailed
comparative analysis is conducted over the results obtained from the above-
mentioned experiments measuring the strengths of the adopted methods.
The rest of the chapter is structured as follows. Section 2 covers a
detailed review of the state-of-the-art of Bangla handwritten character
recognition. The methods and materials of this study and elaborate result
analysis with discussion are presented in Sects. 3, 4, and 5. The chapter
ends with an appropriate conclusion section.
2 Related Work
The researchers in [4] have introduced a CNN model named Ekushnet,
which has generated satisfactory results on Ekush [33] and CMATERdb
[34] datasets. The authors have mentioned that their Ekushnet model has
performed extremely well and generated the best results on Bangla
character recognition relative to the prior work that has been performed.
Their proposed model has found 96.90% accuracy on the training set and
97.73% on the validation set on the Ekush dataset after 50 iterations. The
authors also applied cross-validation on the CMATERdb dataset and found
that their EkushNet model is 95.01% accurate. Another research work [7]
has applied only a CNN model on Bangla handwritten character
identification and their proposed model obtained 85.96% accuracy on the
test dataset, whereas the authors in [10] achieved 95% accuracy by using
the Deep CNN model. Both works have used 50 alphabet classes of the
Ekush dataset. Another study [20] has achieved 95.05% accuracy on 122
classes in the Ekush dataset using their implemented DCNN model. The
authors have experimented on two other databases, CMATERdb and
BanglaLekha Isolated dataset [35]. The authors in [36] have shown an
excellent accuracy result on the CMATERdb dataset which is 98.78%. The
researchers have used five different approaches for classification.
The authors of [11] have found that the ensembled convolutional neural
network system outperforms a single CNN model when it comes to
recognizing Bangla handwriting. They have proposed a stacked
Generalization Ensemble Framework, consisting of six CNN models. Their
research has reached 96.72% accuracy on the test set. They achieved the
result only after 40 epochs. Another study [37] has applied three
approaches: first, seven CNN models have been applied to recognize the
Bangla handwritten characters. Then the best performing model ResNet-50
which has given 97.81% accuracy, has been used for feature extraction, and
classification is done by traditional classification algorithms. In the last
step, the authors employed different ensemble techniques for the
classification task. The stacked generation ensemble method has achieved
98.68% test accuracy which is the best result among all the adopted
methods. All the experiments of this study have been done on Ekush and
BanglaLekha-isolated datasets.
The authors of another study [38] experimented on six CNN models and
evaluated which DCNN model produces the best performance by using
CMATERdb [34] dataset. The results have shown that all the DCNN
models have worked wonderfully, but the DenseNet model has
outperformed the others. They have also pointed out that the DCNN
framework works better than other object recognition methods.
Another work [17] has shown that data augmentation can improve
handwritten character identification accuracy. The authors have tested their
algorithms on the alphabets of the BanglaLekha-Isolated dataset and found
that it is 91.81% accurate without data-augmented images and 95.25%
accurate with data-augmented images. They have also compared other
machine learning approaches to find out the efficiency level of these
methods. The comparative analysis has revealed that CNN outperforms
SVM and LSTM with or without data augmentation. They also have put
their proposed approaches to test on other datasets with similar
characteristics. The experiment has demonstrated 95.07% test accuracy on
the 59 classes of the Ekush dataset.
The performance of the classifier can be enhanced by enlarging the
dataset size. GAN as a data augmentation technique can help to expand the
dataset [23]. In [22], the authors have proposed a DCGAN architecture that
successfully increased four Bangla handwritten character datasets. For their
proposed work, the writers have just focused on the digit dataset. However,
they have not attempted to determine the CNN model performance. The
proof of improving the performance of the classifier on handwritten
datasets by adding GAN-generated images has been shown by another
study [23]. The proposed method has been successful in increasing the
accuracy on the MNIST dataset by using the GAN approach. They have
also used GAN on three Indian numerical handwritten datasets: Bangla,
Devanagari, and Oriya. The accuracy of all the datasets has been improved.
However, the result of the proposed work has shown that combining so
many GAN-generated images with the real dataset might degrade
efficiency. Another digit recognition and generation work done by [39] has
proposed network architecture and achieved 99.44% on BHAND [40]
dataset. After that, the study applied Semi-Supervised GAN or SGAN to
generate Bangla digits. One more GAN-related work [41] has proposed a
conditional GAN-based method for generating character images based on
class. Their study has used three separate Bangla handwritten character
datasets and has been able to achieve very realistic images by 1500 epochs.
However, they did not apply any classification with the generated images.
The literature review reveals that most of the research has been done
with either CNN or Deep CNN models. Apart from performing
classification, there have been very few variations of approaches in Bangla
handwritten character recognition works. Only a few studies have used the
GAN method with the attempt of the classifiers. Furthermore, none of the
studies has used outlier detection as part of their study. So, there is a
knowledge gap identified in this literature review about outlier
identification and elimination. In this work, both approaches are explored to
enhance the recognition performance of Bangla handwritten characters.

3 Materials and Methods


In order to improve handwritten character recognition of the Bangla
language, a series of steps has been followed and a set of experiments have
been conducted to evaluate the effectiveness of the proposed model. These
steps are illustrated in this section. The schematic view of the overall steps
is shown in Fig. 1.

Fig. 1 Schematic view of the proposed model


As shown in Fig. 1, various experiments are conducted over the dataset.
Algorithm 1 outlines the pseudocode of the proposed methods. The adopted
steps of the model are described in the subsequent sections.
3.1 Dataset
BanglaLekha Isolated [35], ISI [42], NumtaDB [43], CMATERdb [34], and
Ekush [33] are a few datasets that contain Bangla handwritten characters
and numerals. Ekush dataset is selected in this study for experimental
purposes because it contains more classes than any other Bangla
handwritten dataset. The Ekush dataset consists of basic and compound
characters, numerals, and modifiers. The 122 classes of characters are
grouped into four categories: 10 modifiers, 11 vowels, 39 consonants, 52
widely used compound letters, and 10 numeral digits and the dataset
contains about 7,29,750 images. A few images from the dataset are shown
in Fig. 2. The images are greyscaled with a size of 28 × 28 pixels.

Fig. 2 Different handwritten characters from the Ekush dataset

3.2 Outlier Detection


Outliers boost the uncertainty of the results, lowering statistical power.
Therefore, removing outliers can lead to statistically significant results. In
the Ekush dataset, Bangla handwritten characters are categorized into 122
groups. While grouping individual characters into their respective classes,
some characters are moved into separate classes. As a result, a few
characters from various classes are mixed up [37]. It has already been
mentioned that some Bangla handwritten characters bear a striking
resemblance to one another. During the pre-processing of the dataset, it has
been discovered that class 87 and 97, class 19 and 84, and class 69, 76, 110,
and 111, all contain outliers of one another due to their resemblance. Since
the character instances are anomalous in only the specific context, they are
termed contextual outliers. To locate outliers in the individual groups, a
semi-supervised outlier detection approach using autoencoders has been
applied.
In Fig. 3, the process of outlier detection and preparing a purer dataset
has been presented. Initially, the images of 122 classes have been analyzed
and eight classes that potentially contain more outliers have been identified.
From each class that contains more than 3000 images, 1000 inlier images
have been selected and the training sets for the autoencoder-based outlier
detection models have been prepared. The number of images in the training
sets was 500 for the classes with less than 3000 images. No outlier image
has been fed to the outlier detection model during training. After training,
the rest of the images have been tested using the model and the outliers
have been identified. The inlier images along with the previously selected
pure training set have been used to develop robust classifiers.

Fig. 3 Workflow of the outlier detection


3.2.1 Autoencoder
Autoencoders are special neural networks that learn features of complex
data in lower dimensions from unlabeled data [44], then try to reconstruct
the original complex input from the simpler encoded features. This type of
neural network has been proven to perform well in numerous fields such as
generative models, classification, clustering, recommender system,
dimensionality reduction, and so on [45], but in this work, it has been used
as an outlier detection model.
A convolutional autoencoder depicted in Fig. 4 has been used for
detecting outliers from Bangla handwritten character dataset. The network
consists of three major components – an encoder network, a bottleneck
layer, and a decoder network. The encoder network starts with an input
layer. After that, there are three convolutional layers; the output of the last
such layer is flattened and passed through a dense layer which produces a
vector containing features in a lower dimension. This is also known as a
bottleneck which is followed by a decoder network. The job of the decoder
is to reproduce the input as close as possible to the original. Convolutional
transpose layers have been used which perform the inverse operation of
what typical convolutional layers do.

Fig. 4 Convolutional autoencoder for outlier detection

The autoencoder network has been trained with only a few inlier
images. The intuition behind using only inlier images is to train the model
to be familiar with what is normal so that while testing, the model
reconstructs the outlier images poorly and reconstruction error becomes
high. The images with reconstruction errors higher than a specific threshold
then have been labeled as outliers and have been discarded from the dataset.
The reconstruction error is calculated using mean squared error.
3.3 Generative Adversarial Network
The Ekush Bangla handwritten dataset contains several imbalanced classes.
Data augmentation can be a way for generating a number of images in order
to balance a dataset. Data augmentation approaches such as rotation, and
scaling can expand a dataset but do not always add information. Generative
Adversarial Network (GAN), on the other hand, can generate synthetic
images that can bring additional information to the dataset. We have chosen
a deep convolutional generative adversarial network (DCGAN) as it is the
most effective architecture for improving classification and identification
[46]. We have only taken five classes from the Ekush dataset as these
classes have much fewer images than others. The outlier-removed classes
that are common in these classes are used as input data in the proposed
GAN model. Table 1 shows the classes that have been used in DCGAN.

Table 1 Classes used in DCGAN


Class No Number of images Image example
72 4186

76 4261

97 4100

110 2012

111 986

The generative adversarial network is a method for creating new


synthetic data that consists of two models: generator and discriminator. The
generator attempts to create a new image from the random noise and feeds
it into the discriminator model, which determines whether the image is fake
or real. If the discriminator determines it to be fake, the generator attempts
again to create a new image to deceive the discriminator. The fight between
these two models will continue until the generator becomes incredibly
powerful, creating a synthetic image that the discriminator model is unable
to differentiate. A general view of GAN is shown in Fig. 5.
Fig. 5 General overview of GAN

Though a few experimental setups have been altered, we have adopted


the DCGAN model shown in [20] as their approach has achieved good
results for generating Ekush dataset images. The DCGAN architecture is
defined briefly in the following section. A CNN is used for both
discriminator and generator in DCGAN architecture. Before passing to the
DCGAN model, the images have been prepared by converting all the
images into a single channel and by resizing them all into 28 × 28 pixels.
After that, all the images are normalized in the scale of [-1,1]. The GAN
model has been run separately for each of the chosen classes.

3.3.1 Generator Architecture


In GAN, the generator model is used to create new images from a random
variable. A random noise of 100 input sizes is given to the generator of our
model. This is forwarded onto the dense layer which with 1024 hidden
units. To keep the GAN model steady, batch normalization is used in both
the generator and discriminator [21]. The Relu activation function is used in
all layers except the output layer, where the Tanh activation is used. The
Tanh activation function allows taking the pixel in the [-1,1] range that is
later used as the discriminator input [23]. Again, another dense layer having
6272 neurons is used. After the following batch normalization and Relu
activation, the output is reshaped. Up-sample to input data in the generator
model is required to generate a new output image. Two convolution layers
have been used where the first layer consists of 64 filters, and a kernel size
of 3, and the second layer consists of one filter, and 3 kernel size. In both
layers, padding with zero has been applied. A 2D upsampling is used just
before each convolutional layer. The architecture of the generator model is
given in Fig. 6.

Fig. 6 Generator architecture

3.3.2 Discriminator Architecture


In the discriminator model, two convolution layers have been used. With 64
filters, a kernel size of 5, a stride of 2, and ‘same’ padding, the first
convolutional layer receives the dimension of 28 × 28 × 1 as an input shape.
The same size of the kernel, stride, and padding is used in the second
convolution layer with 128 filters. Then the outputs are flattened and
transferred to the dense layer with 256 hidden neurons. The LeakyRelu
activation function with an initial of α = 0.2 has been used in every layer in
the discriminator model as it helps to perform well in the GAN model [21].
The alpha(α) parameter is the leakiness of the LeakyReLu activation
function which controls the negative inputs and allows the passing of
negative values to the network which prevents the dying state. After that, a
25% dropout has been used to keep the discriminator model from
overfitting. Finally, a single unit of output has been used in a dense layer
having sigmoid activation. The architecture of the discriminator model is
given in Fig. 7.
Fig. 7 Discriminator architecture

Following this [21] research, we have used the Adam optimizer in our
DCGAN model. Although another study [47] has used an Adam optimizer
with of learning rate and 0.1 of momentum, we have changed
the learning rate to and momentum term to in both the
discriminator and generator model as we have found that these values of
parameters helped to stabilize the training. β_1 momentum is used to
control the decaying of the running average of the gradient, which is
exponentially multiplied by itself at the end of each batch step [48]. Binary
cross entropy has been employed to measure the loss of the discriminator
and generator. We have used two separate batch sizes: 64 batch size in 110
and 111 classes because the number of real images has been limited, and
128 batch size in the other three classes. For these groups, the model has
been trained for 2000 epochs, except for 111, which have been trained for
4000 epochs. The explanation for the higher number of epochs for 111
groups is that the number of training data in 111 is very scarce, which
prevents the generator to generate quality synthetic images in the early
epoch. Every 50 epochs, we saved images and observed the produced
images. We have taken images for these five classes at various epochs and
identified the epoch where the quality of synthetic images is good enough
compared to the actual images. We have taken a fixed number of images for
each of these classes so that the classification model is trained with at least
4000 images. Table 2 shows the total number of images that are added to
the actual training dataset.

Table 2 DCGAN generated dataset details


Class no Epoch number Total DCGAN image
72 1900 964
76 1750 1360
97 750 896
110 1850 3267
111 3750 3594

3.4 Classification
Before applying the classification model, all the images have been resized
as 28 × 28 scales with gray color mode. We have used ResNet-50 to classify
the 122 classes of the Ekush dataset. The name implies that the model
consists of 50 layers. A brief description of ResNet-50 is given in the
following section.

3.4.1 ResNet-50
Identity and convolutional blocks are the two different blocks that are used
in ResNet-50 architecture based on the dimensions of the input and output.
Both blocks have a skip connection over the main path which helps the
model learn an identity function. The identity function helps to skip the
layers to be trained which is not helpful to add value to accuracy [49]. In
the identity block, there are three Conv2D layers with stride (1,1) and zero
seed random initialization. Only the second Conv2D has padding. Batch
normalization and Relu activation follow each Conv2D except that a
shortcut is added before the final Relu activation. In the convolutional
block, the skip connection has a Conv2D layer and Batch normalization that
the identity block does not have. Except for this, the structure is almost the
same as the identity block. The first and the convolution layer on shortcut
paths has a stride of (s,s) and the rest has (1,1).
The ResNet-50 architecture has five stages. Before entering these
stages, the dimension of the dataset image 28 × 28 × 1 is given as an input
shape to the ResNet-50 architecture. The first stage of the ResNet-50 has 7
× 7 convolutional layers with 32 filters and (1,1) strides. Right after that,
batch normalization and a 3 × 3 MaxPooling layer are used. The rest of the
stages of ResNet-50 has two, three, five, and two numbers of identity
blocks, respectively followed by a convolutional block. After the five
stages, there is an average pooling with (2,2) strides, which is used to
reduce the output. Finally, a SoftMax activation is used with an FC-dense
layer to reduce the 122 input classes. The diagram of ResNet-50
architecture is given in Fig. 8.

Fig. 8 ResNet-50 architecture

To train the ResNet-50 models, adam optimizer with the default


learning rate value of 0.001 has been used. Also, as the loss function,
categorical cross-entropy has been utilized. The accuracy with 1024 batch
size has provided the best result in the [11] study. Following this study, the
batch size has reset to 1024. Furthermore, 100 epochs have been used to
train all the approaches.

3.5 Train, Test, and Validation


There have been made three different datasets after the deduction of outliers
and DCGAN image generation. All the images in each approach have been
split in such a way that 70% of images are in the training set, 20% of
images are in the test set and 10% of images are in the validation set. The
DCGAN-generated images are used only to balance the training dataset
after the split to avoid bias. The total number of images and train, test, and
validation set image numbers are presented in Table 3.

Table 3 Overview of the dataset for all approaches


Datasets Total number of images train set Test set Validation set
Original dataset 729,750 547,131 109,777 72,842
Outlier removed Dataset 727,849 545,724 109,467 72,658
Datasets Total number of images train set Test set Validation set
DCGAN + 737,659 555,224 109,777 72,658
Outlier Removed
Dataset

4 Results
To improve the performance of Bangla handwritten character recognition,
initially, a semi-supervised image outlier detection model has been
proposed, and secondly, a generative adversarial network model has been
used to balance the dataset. For both strategies, a subset of 122 classes has
been chosen based on the recommendation made by other works [37] and
based on the domain knowledge regarding Bangla handwritten characters.
Outliers have been excluded from 7 classes using an autoencoder-based
model and 5 classes have been balanced up using the DCGAN model. In
this section, the outcomes of the experiments are explained in detail.

4.1 Result Analysis


The Ekush dataset has been classified using the ResNet-50 framework in
three different datasets (original dataset, outlier removed dataset, outlier
removed and DCGAN implemented dataset). The ResNet-50 model has
achieved 97.63% test accuracy on the original dataset consisting of 122
classes. The second approach where the outliers are removed from seven
classes has achieved 97.95% test accuracy. And the final approach where
outliers are removed from the original dataset and DCGAN-generated
images are used to balance the original training dataset has achieved
97.92% test accuracy. The precision, recall, F1-score, and accuracy yielded
by the ResNet-50 model on three approaches are shown in Table 4. It
illustrates that the ResNet-50 models with both outliers-removed dataset
and with a balanced dataset using DCGAN-generated images have
outperformed the model trained on the original dataset.

Table 4 Performance of all proposed approaches


Methods Precision Recall F1 Score Test accuracy
(%) (%) (%) (%)
Methods Precision Recall F1 Score Test accuracy
(%) (%) (%) (%)
Original dataset 97.64 97.63 97.62 97.63
Outlier removed dataset 97.96 97.95 97.95 97.95
DCGAN + Outlier removed 97.93 97.92 97.92 97.92
dataset

4.1.1 Result of Outlier Detection


The model accuracy has improved from 97.63% to 97.95% after outliers are
removed from seven classes of the dataset, demonstrating the benefit of
outlier removal. When assessing changes in individual classes, the same
trend of improvement can be observed. In Table 5, the precision for classes
76, 87, and 97 increased by 4%, 1%, and 5%, respectively, suggesting that
the performance has improved for these three classes compared to the
original dataset. In classes 84 and 111, the recall has improved by 1% and
6% for the outlier-removed dataset than that for the original dataset, which
also indicates the improvement of the classifier. For four classes (76, 84, 97,
and 111) among the seven discussed classes, the increased F1-score
compared to the original dataset indicates that the images have been better
classified than the original dataset. The remaining three classes (19, 69, and
87) have not seen any changes in F1-score. However, few classes have
experienced performance drops in terms of precision and recall even after
removing the outliers. The reason behind this is, even though the outliers
are eliminated from those classes, some noises are still there. Another
explanation is that, when the outliers are eliminated, it also removes some
of the original images from these classes, resulting in a dataset that is less
balanced than the original. But these can be ignored as the performance
drop is very negligible. Removing outliers from specific classes has also
reduced the number of false positives and false negatives for classes other
than the ones that are discussed. This, along with the improved performance
in these specific classes has been the key ingredient to achieving an overall
better performance. So, the cumulative results justify that the classifier
performs well as a result of excluding outliers from the original dataset.

Table 5 Classifier evaluation after outlier exclusion


Precision Recall F1-Score
Class Original Outlier Original Outlier Original Outlier
dataset removed dataset removed dataset removed
dataset dataset dataset
19 0.95 0.95 0.98 0.96 0.96 0.96
69 0.91 0.89 0.94 0.94 0.92 0.92
76 0.90 0.94 0.89 0.89 0.89 0.92
84 0.95 0.95 0.92 0.93 0.93 0.94
87 0.97 0.98 0.97 0.96 0.97 0.97
97 0.88 0.93 0.95 0.94 0.92 0.93
111 0.86 0.84 0.62 0.68 0.72 0.75

The improvement in classification performance shows the effectiveness


of the autoencoder-based outlier detection model in Bangla handwritten
character images. In Table 6, the numbers of discarded images from the
chosen seven classes are shown. The greatest number of outliers have been
removed from class 19, whereas the least amount has been removed from
class 111. There is also a correlation between the number of outliers
removed with the original size of the dataset. The more images a class has
more outliers have been removed.

Table 6 Outliers removed dataset details


Class No. of images in the original dataset No. of outliers removed
19 6180 272
69 5676 234
76 4446 185
84 5788 240
87 6136 257
97 4264 164
111 1028 42

In Fig. 9, a few representative inlier and outlier images from class 69


that are detected by the model are presented. By looking at the images, one
can identify that the images in the right part of the figure are anomalous,
while the images on the left side are inliers. However, there are cases where
outliers have not been accurately detected, and inliers have been wrongly
identified as outliers. Despite this, the overall outlier detection scheme has
been successful as it has improved the ResNet-50 model performance.

Fig. 9 Inliers versus outliers in class 69

Figure 10 also justifies the efficiency of the outlier detection model. The
inlier images of class 19 have been divided into four batches, and all the
images of each batch have been superimposed into a single image. Each
batch consists of approximately 1500 images. In contrast, 272 outlier
images detected by the model have also been superimposed into a single
image. It is apparent from the figure that the images with superimposed
inliers tend to hold the inherent shape of the character even with 1500
images. On the other hand, only 272 outliers have made the corresponding
superimposed image all jumbled up, which further validates the efficiency
of the outlier detector.
Fig. 10 Superimposed inliers versus superimposed outliers

4.1.2 Result of Balancing Dataset with DCGAN


Apart from the existence of outliers in the Ekush dataset, there is also an
imbalance in it. Five such imbalanced classes (72, 76, 97, 110, and 111)
have been selected and their training sets have been made balanced using
DCGAN-generated images. No generated image has been added to either
validation or test set. The test accuracy of ResNet-50 has improved from
97.63% to 97.92% after adding synthesized images to the training set.
Moreover, Table 7 shows that almost all the evaluation metrics are
improved through the use of DCGAN-generated images. Especially the
class of 111 has improved exceptionally. But only the precision of DCGAN
with outlier removed images in class 72 is dropped by 2% from the original
classes. This means when the classifier predicted the images are from class
72, it is less correct than the original dataset. The reason for decreasing the
precision of class 72 can be the noisy images that are generated in the
DCGAN experiment. But that is a very negligible value and also the
corresponding recall is increased which means it can more correctly
identify all the respective class images than the original class. There are
three classes to which both the outlier detection model and DCGAN have
been applied. For all three classes, ResNet-50 with DCGAN-generated
images has outperformed the ResNet-50 trained on the outliers-removed
dataset. The reason for this improvement is that the DCGAN model has
been trained on those individual classes after the removal of outliers which
has produced good-quality images. The overall performance justifies that
the use of proposed DCGAN-generated images on the real dataset can
improve the classification result.

Table 7 Classifier evaluation after applying DCGAN


Class Precision Recall F1-Score
Original Balanced dataset Original Balanced dataset Original Balanced dataset
dataset with DCGAN dataset with DCGAN dataset with DCGAN
72 0.99 0.97 0.93 0.94 0.96 0.95
76 0.90 0.97 0.89 0.94 0.89 0.96
97 0.88 0.95 0.95 0.96 0.92 0.96
110 0.97 0.99 0.92 0.93 0.94 0.96
111 0.86 0.93 0.62 0.88 0.72 0.90

Figure 11 shows a comparison of original images and DCGAN-


generated images. From the figure, it is difficult to distinguish between
original and synthesized images without the labels which prove that the
DCGAN has generated good quality images. However, for class 111, the
generated images have not been up to the mark for having a smaller number
of images to train DCGAN. Using this generative adversarial network has
helped us to tackle the class imbalance problem. Without the five training
classes on which the DCGAN has been applied, the average training size
had been nearly 4575 images per class. On the other hand, those five classes
had only 2389 training images on average per class. Even one class i.e.,
class 111 had only 770 training instances which led the classifier to achieve
only a 72% F1-score. But after balancing only the training set with 3594
synthesized images, the F1-score has improved to 90%.
Fig. 11 Some original images versus some DCGAN generated images

The changes in the training sample sizes are illustrated in Fig. 12. In
classes 110 and 111, the number of synthesized images added has been
more than 3000 and for the rest of the three classes, this number has been
around 1000. For four of these five classes, the classifier performance in
terms of the F1-score has improved. Moreover, the overall performance of
the ResNet-50 classifier trained on a balanced dataset has been better than
that of the trained on the original dataset. This validates the applicability of
DCGAN in generating synthesized Bangla handwritten character images.
Fig. 12 Training size before versus training size after applying DCGAN

4.2 Overfitting Handling


Training and validation accuracy and loss are illustrated in Figs. 13, 14, and
15. On both the original dataset and the outlier removed dataset, the
ResNet-50 model has a good fit for predicting handwriting characters, as
illustrated in Figs. 13 and 14. But Fig. 15 illustrates an exception, in which
the model is applied to the Ekush dataset after outliers are removed and
DCGAN is used. Except for one epoch in validation loss, the model has a
good prediction result because the training and validation accuracy and loss
are near to each other. Additionally, in the learning curve of each approach,
the training and the validation loss are initially high and then gradually
decrease in the same direction, indicating that the model is secure from
overfitting. Though there is a slight gap between the training and validation
curve, it is negligible for being considered as overfitting. However, the
validation loss in epoch 88 has increased to about 1.6, which is relatively
high compared to other epochs. The reason behind this spike of 88 number
epoch can be due to the existence of noise in the dataset. In this particular
batch of images, the model is unable to correctly predict the batch image’s
class. This type of spike does not exist in the outlier-removed dataset or the
original dataset. This means there is some noise in the DCGAN-generated
images. Also, when the training and the validation dataset are split
randomly, this particular batch has got the noisiest images.
Fig. 13 Accuracy and loss of ResNet-50 on the original dataset

Fig. 14 Accuracy and loss of ResNet-50 on outlier removed dataset

Fig. 15 Accuracy and loss of ResNet-50 on outlier removed and DCGAN applied
dataset
4.3 Comparison with State-of-the-Art
Outlier elimination on the Ekush dataset is a unique operation. We are the
first who experimented on the outlier-removed Ekush dataset. Authors in
[22] only performed DCGAN to enlarge the Ekush dataset but no
classification was performed on the generated images. A comparative
analysis of the current work with others that used only the Ekush dataset is
illustrated in Table 8. Our proposed ResNet-50 model on the original
dataset has achieved 97.63% accuracy on the test dataset (Table 4), which
proves that the score has outperformed all the work except the EkushNet.
Shibly et al. [37] achieved the best test accuracy of 98.68% on the Ekush
dataset but that has been obtained through an ensemble of ten CNN models.
Their highest performance with a single CNN model has been 97.81%
using ResNet-50 which is easily outperformed by both of our proposed
methods. Our work has also achieved better performance than an ensemble
[11] and deep CNN techniques [10, 20] applied to the same dataset. Also,
although we have applied outlier on only seven classes and DCGAN on
only five classes, our two approaches outperformed the other related works.
However, the improvement is minor as only some classes from 122 classes
of the Ekush dataset have been considered in our study. But the results can
conclude that our proposed outlier and DCGAN approaches are capable to
improve the classification performance.

Table 8 Performance comparison with the state of the art


Work Methods Number of Test accuracy
references classes (%)
[4] CNN 122 97.73
[10] Deep CNN (Bengali handwritten alphabets of 50 95.00
Ekush dataset)
[37] ResNet-50 122 97.81
[20] Deep CNN 122 95.05
[11] Stacked generalization ensemble method 122 96.72
Proposed Outlier Removal + ResNet-50 122 97.95
method
Proposed DCGAN + ResNet-50 122 97.92
method
5 Discussion
Outlier elimination and applying DCGAN as well as comparing the
character detection of these two approaches is a unique experiment
conducted on the Ekush dataset. ResNet-50 is one of the most popular
models and can be used to achieve a very good result on the Ekush dataset
as in [37]. In addition to managing the vanishing gradient problem, the
ResNet-50 model can achieve great results with a few error rates. Apart
from that, by applying a skip connection, it can ignore the layers which
cannot provide any benefit to the output [50]. The result has shown that the
ResNet-50 has given a better performance than the widely used CNN
models.
Outlier detection is very beneficial if there is a probability of images
being found in the wrong classes. The result analysis has shown that the test
result as well as the precision, recall, and F1-scores have improved after
applying outlier detection on seven classes of the Ekush dataset. There is
also an improvement in the performance of the overall classification result
of 122 classes. Moreover, outlier detection and elimination on three (76, 97,
111) classes help our DCGAN to generate good-quality images. However,
certain classes from the dataset, that have been chosen in this outlier
detection approach, have a smaller volume of data, so training the model
with this limited dataset reduced the precision. The outcome could be better
if outlier detection can be applied to the whole dataset.
The DCGAN approach has generated images as an augmentation
technique with an outlier removed dataset has improved the test dataset
performance by 0.29% over the original Ekush dataset. Not only DCGAN
has increased the size of the dataset but also created variant images that add
more information to the original dataset. In our study, only five classes of
images have been augmented by the DCGAN approach and the generated
image number is only able to make the training set near to 4 thousand.
However, the whole dataset still has imbalanced classes besides the chosen
classes. Yet with small amounts of generated images, the study has shown
an improvement in the classification result. If we could generate more
images for these classes, the accuracy might be improved further. However,
as mentioned in [23], we should be careful not to generate a large number
of images to avoid the probability of degrading the performance.
6 Conclusion
Handwritten character recognition is a widely known research problem.
This study adopts a two-fold approach on one of the largest Bangla
handwritten datasets, namely the Ekush dataset with the ResNet-50
classifier. At first, outliers are detected and eliminated which has achieved a
test accuracy of 97.95%. In the second approach, DCGAN is used to
generate images for the original dataset which shows an accuracy of
97.92%. However, the results can be improved more if the adopted
approaches have been applied to the whole dataset. Because of the limited
computing resources, we have taken only a few classes of the Ekush dataset
for our experiments. Despite this, the results which are obtained from the
adopted novel approaches have demonstrated superior performance than
majority the related works. In the future, other Bangla handwritten
character datasets may also be used to evaluate the efficacy of these
methods. In addition, other classifier models, such as VGG-16, Xception,
DenseNet, AlexNet, etc. can also be explored with these two proposed
methods.

Data Availability Statement


All the codes and the dataset can be accessed at the following repositories.
Tanzina Akter Tani and Shibly, Moynuddin Ahmed (2022): Codes.
figshare. Journal contribution. https://doi.org/10.6084/m9.figshare.
18933470.
Tanzina Akter Tani and Shibly, Moynuddin Ahmed (2022): Dataset.
figshare. Dataset. https://doi.org/10.6084/m9.figshare.18931760.
DCGAN generated images: http://doi.org/10.6084/m9.figshare.
14754309.

References
1. Yuan, A., Bai, G., Jiao, L., & Liu, Y. (2012). Offline handwritten English
character recognition based on convolutional neural network. In Proceedings 10th
IAPR International Workshop on Document Analysis Systems, DAS 2012 (pp.
125–129). https://doi.org/10.1109/DAS.2012.61
2.
Kimura, F., Wakabayashi, T., Tsuruoka, S., & Miyake, Y. (1997). Improvement of
handwritten Japanese character recognition using weighted direction code
histogram. Pattern Recognition, 30(8), 1329–1337. https://doi.org/10.1016/S0031-
3203(96)00153-7
[Crossref]

3. Ciresan, D. C., Meier, U., & Schmidhuber, J. (2012). Transfer learning for Latin
and Chinese characters with deep neural networks. In Proceedings of the
international joint conference on neural networks (pp. 1–6). https://doi.org/10.
1109/IJCNN.2012.6252544

4. Azad Rabby, A. K. M. S., Haque, S., Abujar, S., & Hossain, S. A. (2018).
Ekushnet: Using convolutional neural network for Bangla handwritten
recognition. Procedia Computer Science, 143, 603–610. https://doi.org/10.1016/j.
procs.2018.10.437

5. Ahmed, S., et al. (2019). Hand sign to bangla speech: A deep learning in vision
based system for recognizing hand sign digits and generating bangla speech.
https://doi.org/10.2139/ssrn.3358187

6. Manisha, N., Sreenivasa, E., & Krishna, Y. (2016). Role of offline handwritten
character recognition system in various applications. International Journal of
Computer Applications. https://doi.org/10.5120/ijca2016908349

7. Rahman, Md. M., Akhand, M. A. H., Islam, S., Chandra Shill, P., & Hafizur
Rahman, M. M. (2015). Bangla handwritten character recognition using
convolutional neural network. International Journal of Image, Graphics and
Signal Processing, 7(8), 42–49. https://doi.org/10.5815/ijigsp.2015.08.05

8. Ghosh, T., Abedin, M. H. Z., Al Banna, H., Mumenin, N., & Abu Yousuf, M.
(2021). Performance analysis of state of the art convolutional neural network
architectures in Bangla handwritten character recognition. Pattern Recognition
and Image Analysis, 31(1), 60–71. https://doi.org/10.1134/S1054661821010089

9. Chowdhury, R. R., Hossain, M. S., ul Islam, R., Andersson, K., & Hossain, S.
(2019). Bangla handwritten character recognition using convolutional neural
network with data augmentation. In 2019 Joint 8th international conference on
informatics, electronics & vision (ICIEV) and 2019 3rd international conference
on imaging, vision & pattern recognition (icIVPR) (pp. 318–323). https://doi.org/
10.1109/ICIEV.2019.8858545
10.
Ahmed, S., Tabsun, F., Reyadh, A. S., Shaafi, A. I., & Shah, F. M. (2019). Bengali
handwritten alphabet recognition using deep convolutional neural network. In 5th
International conference on computer, communication, chemical, materials and
electronic engineering, IC4ME2 2019. https://doi.org/10.1109/IC4ME247184.
2019.9036572

11. Shibly, M. M. A., Tisha, T. A., & Ripon, S. H. (2021). Stacked generalization
ensemble method to classify Bangla handwritten character. In Proceedings of
international conference on sustainable expert systems. Lecture Notes in
Networks and Systems 176. https://doi.org/10.1007/978-981-33-4355-9_46

12. Mamun, M. R., Al Nazi, Z., & Yusuf, M. S. (2018). Bangla handwritten digit
recognition approach with an ensemble of deep residual networks. In
International conference on bangla speech and language processing, ICBSLP
2018 (pp. 21–22). https://doi.org/10.1109/ICBSLP.2018.8554674

13. Goodfellow, I., et al. (2014). Generative adversarial nets. Advance in Neural
Information Process Systems, 27.

14. Basu, S., Das, N., Sarkar, R., Kundu, M., Nasipuri, M., & Basu, D. K. (2009). A
hierarchical approach to recognition of handwritten Bangla characters. Pattern
Recognition, 42(7), 1467–1484. https://doi.org/10.1016/j.patcog.2009.01.008
[Crossref][zbMATH]

15. Bhowmik, T. K., Ghanty, P., Roy, A., & Parui, S. K. (2009). SVM-based
hierarchical architectures for handwritten Bangla character recognition.
International Journal on Document Analysis and Recognition, 12(2), 97–108.
https://doi.org/10.1007/s10032-009-0084-x
[Crossref]

16. Bhattacharya, U., Gupta, B. K., & Parui, S. K. (2007). Direction code based
features for recognition of online handwritten characters of Bangla. In
Proceedings of the international conference on document analysis and
recognition, ICDAR, 2007. https://doi.org/10.1109/ICDAR.2007.4378675

17. Chowdhury, R. R., Hossain, M. S., Ul Islam, R., Andersson, K., & Hossain, S.
(2019). Bangla handwritten character recognition using convolutional neural
network with data augmentation. In 2019 Joint 8th international conference on
informatics, electronics and vision, ICIEV 2019 and 3rd international conference
on imaging, vision and pattern recognition, icIVPR 2019 with international
conference on activity and behavior computing, ABC 2019 (pp. 318–323). https://
doi.org/10.1109/ICIEV.2019.8858545
18.
Shopon, M., Mohammed, N., & Abedin, M. A. (2017). Bangla handwritten digit
recognition using autoencoder and deep convolutional neural network. In IWCI
2016-2016 International Workshop on Computational Intelligence. https://doi.org/
10.1109/IWCI.2016.7860340

19. Shopon, M., Mohammed, N., & Abedin, M. A. (2017). Image augmentation by
blocky artifact in deep convolutional neural network for handwritten digit
recognition. In IEEE international conference on imaging, vision and pattern
recognition, icIVPR 2017 (pp. 1–6). https://doi.org/10.1109/ICIVPR.2017.
7890867

20. Mashrukh Zayed, M., Neyamul Kabir Utsha, S. M., & Waheed, S. (2021).
Handwritten bangla character recognition using deep convolutional neural
network: Comprehensive analysis on three complete datasets. Advances in
Intelligent Systems and Computing. https://doi.org/10.1007/978-981-33-4673-4_7

21. Radford, A., Metz, L., & Chintala, S. (2016). Unsupervised representation
learning with deep convolutional generative adversarial networks. In 4th
International conference on learning representations, ICLR 2016-conference track
proceedings.

22. Haque, S., Shahinoor, S. A., Rabby, A. K. M. S. A., Abujar, S., & Hossain, S. A.
(2018). OnkoGan: Bangla handwritten digit generation with deep convolutional
generative adversarial networks. In Recent Trends in image processing and
pattern recognition, second international conference, {RTIP2R} 2018, Solapur,
India, 21–22 Dec 2018, Revised Selected Papers, Part {III}, 2018, vol. 1037 (pp.
108–117). https://doi.org/10.1007/978-981-13-9187-3_10

23. Jha, G., & Cecotti, H. (2020). Data augmentation for handwritten digit recognition
using generative adversarial networks. Multimed Tools and Applications. https://
doi.org/10.1007/s11042-020-08883-w
[Crossref]

24. Biswas, R., Vasan, A., & Roy, S. S. (2020). Dilated deep neural network for
segmentation of retinal blood vessels in fundus images. Iranian Journal of
Science and Technology, Transactions of Electrical Engineering, 44(1), 505–518.
https://doi.org/10.1007/s40998-019-00213-7
[Crossref]

25. Roy, S. S., Rodrigues, N., & Taguchi, Y. (2020). Incremental dilations using CNN
for brain tumor classification. Applied Sciences, 10(14), 4915. https://doi.org/10.
3390/app10144915
[Crossref]
26. Roy, S. S., Mihalache, S. F., Pricop, E., & Rodrigues, N. (2022). Deep
convolutional neural network for environmental sound classification via dilation.
Journal of Intelligent & Fuzzy Systems, 43(2), 1827–1833. https://doi.org/10.
3233/JIFS-219283
[Crossref]

27. Roy, S. S., et al. (2022). L2 regularized deep convolutional neural networks for
fire detection. Journal of Intelligent & Fuzzy Systems, 43(2), 1799–1810. https://
doi.org/10.3233/JIFS-219281
[Crossref]

28. Reddy, A. S. B., & Juliet, D. S. (2019). Transfer learning with ResNet-50 for
malaria cell-image classification. In International Conference on Communication
and Signal Processing (ICCSP) (pp. 945–949). https://doi.org/10.1109/ICCSP.
2019.8697909

29. Rezende, E., Ruppert, G., Carvalho, T., Ramos, F., & de Geus, P. (2017).
Malicious software classification using transfer learning of ResNet-50 deep neural
network. In Proceedings of the 16th IEEE international conference on machine
learning and applications, ICMLA 2017 (pp. 1011–1014). https://doi.org/10.1109/
ICMLA.2017.00-19

30. Alif, M. A. R., Ahmed, S., & Hasan, M. A. (2017). Isolated Bangla handwritten
character recognition with convolutional neural network. In 2017 20th
International conference of computer and information technology (ICCIT) (pp. 1–
6).

31. Alom, M. Z., Sidike, P., Hasan, M., Taha, T. M., & Asari, V. K. (2018).
Handwritten Bangla character recognition using the state-of-the-art deep
convolutional neural networks. Computational Intelligence and Neuroscience.
https://doi.org/10.1155/2018/6747098
[Crossref]

32. Khan, M. M., Uddin, M. S., Parvez, M. Z., & Nahar, L. (2022). A squeeze and
excitation ResNeXt-based deep learning model for Bangla handwritten compound
character recognition. Journal of King Saud University Computer and Information
Sciences, 34(6), 3356–3364. https://doi.org/10.1016/j.jksuci.2021.01.021
[Crossref]
33.
Rabby, A. K. M. S. A., Haque, S., Islam, M. S., Abujar, S., & Hossain, S. A.
(2019). Ekush: A multipurpose and multitype comprehensive database for online
off-line Bangla handwritten characters. Communications in Computer and
Information Science. https://doi.org/10.1007/978-981-13-9187-3_14
[Crossref]

34. Sarkar, R., Das, N., Basu, S., Kundu, M., Nasipuri, M., & Basu, D. K. (2012).
CMATERdb1: A database of unconstrained handwritten Bangla and Bangla-
English mixed script document image. International Journal on Document
Analysis and Recognition. https://doi.org/10.1007/s10032-011-0148-6
[Crossref]

35. Biswas, M., et al. (2017). BanglaLekha-Isolated: A multi-purpose comprehensive


dataset of handwritten Bangla isolated characters. Data in Brief. https://doi.org/10.
1016/j.dib.2017.03.035
[Crossref]

36. Alom, Z., Sidike, P., Taha, T. M., & Asari, V. K. (2017). Handwritten bangla digit
recognition using deep learning, p. 1712.

37. Shibly, M. M. A., Tisha, T. A., Tani, T. A., & Ripon, S. (2021). Convolutional
neural network-based ensemble methods to recognize Bangla handwritten
character. PeerJ Computer Science, 7, 1–30. https://doi.org/10.7717/peerj-cs.565
[Crossref]

38. Alom, M. Z., Sidike, P., Hasan, M., Taha, T. M., & Asari, V. K. (2017).
Handwritten bangla character recognition using the state-of-art deep convolutional
neural networks, p.1712.

39. Sikder, M. F. (2020). Bangla handwritten digit recognition and generation. In:
Proceedings of international joint conference on computational intelligence (pp.
547–556).

40. Rahman, M. S. (2016). Towards optimal convolutional neural network parameters


for bengali handwritten numerals recognition. In 19th international conference on
computer and information technology (ICCIT) (pp. 431–436).

41. Nishat, Z. K., & Shopon, M. (2019). Synthetic class specific Bangla handwritten
character generation using conditional generative adversarial networks. In 2019
International conference on bangla speech and language processing (ICBSLP
2019). https://doi.org/10.1109/ICBSLP47725.2019.201475
42.
Chaudhuri, B. B. (2006). A complete handwritten numeral database of Bangla-A
major Indic script. In 10th international workshop on frontiers of handwriting
recognition (IWFHR), La Baule, France.

43. Alam, S., Reasat, T., Doha, R. M., & Humayun, A. I. (2018). NumtaDB-
assembled Bengali handwritten digits, pp 1–4.

44. Kramer, M. A. (1991). Nonlinear principal component analysis using


autoassociative neural networks. AIChE Journal, 37(2), 233–243. https://doi.org/
10.1002/aic.690370209
[Crossref]

45. Bank, D., Koenigstein, N., & Giryes, R. (2020). Autoencoders. In Machine
learning: Methods and applications to brain disorders (pp. 193–208). https://doi.
org/10.1016/B978-0-12-815739-8.00011-0

46. Alqahtani, H., Kavakli-Thorne, M., & Kumar, G. (2021). Applications of


generative adversarial networks (GANs): An updated review. Archives of
Computational Methods in Engineering, 28(2), 525–552. https://doi.org/10.1007/
s11831-019-09388-y

47. Haque, S., Shahinoor, S. A., Rabby, A. K. M. S. A., Abujar, S., & Hossain, S. A.
(2019). OnkoGan: Bangla handwritten digit generation with deep convolutional
generative adversarial networks. Communications in Computer and Information
Science. https://doi.org/10.1007/978-981-13-9187-3_10
[Crossref]

48. Kingma, D. P., & Ba, J. (2014). Adam: A method for stochastic optimization.
Preprint at arXiv arXiv:1412.6980.

49. Theckedath, D., & Sedamkar, R. R. (2020). Detecting affect states using VGG16,
ResNet50 and SE-ResNet50 networks. SN Computer Science. https://doi.org/10.
1007/s42979-020-0114-9
[Crossref]

50. He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image
recognition. In 2016 IEEE Conference on Computer Vision and Pattern
Recognition (CVPR) (pp. 770–778). https://doi.org/10.1109/CVPR.2016.90
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023
S. S. Roy et al. (eds.), Deep Learning Applications in Image Analysis, Studies in Big Data 129
https://doi.org/10.1007/978-981-99-3784-4_2

Deep Learning-Based Approaches Using Feature


Selection Methods for Automatic Diagnosis of COVID-19
Disease from X-Ray Images
Burak Taşci1
(1) Fırat University Vocational School of Technical Sciences Elazığ, Elazığ, Turkey

Burak Taşci
Email: btasci@firat.edu.tr

Keywords COVID-19 – Deep learning – Pre-trained models – Feature selections

1 Introduction
The novel coronavirus pandemic (COVID-19) was created a worldwide chaos environment
in a very short time. As of July 2021, over 206 million official cases were reported in the
world and the number of deaths due to COVID-19 has exceeded 4 million [1]. Many
countries have developed various policies to cope with this pandemic and minimize its
effects. In particular, Turkey is among the few countries that set an example to the world as
a result of the early measures and social isolation rules. It is of vital importance to take early
action for COVID-19 and similar pandemics. If the cases of COVID-19 can be detected
early, these patients can be isolated, so that healthy individuals who are not infected can
remain safe. Science and technology make great contributions to the precautionary policies
implemented in this sense. One of the most important of these contributions is to predict
how the pandemic will act in the ongoing times. In this context, two main approaches
appear. The first of these is statistical approaches and mathematical models. The second
approach is artificial intelligence-based approaches that have received more attention in
recent years.
In the literature, there are various approaches for disease detection using biomedical
images based on machine learning and deep learning methods [2–8].
Javaheri et al. [9], tried to detect COVID-19 positive, CAP, and other diseases from
89,145 images obtained from the data of 5 different hospitals using BCDU-Net (U-Net).
The achievement results were 91.66%, 87.5%, 95%, and 94% accuracy, sensitivity, AUC,
and specificity, respectively. Rehmen et al. [10], used CT and X-Ray images of 200
COVID19(+), 200 Healthy, 200 Bacterial Pneumonia and 200 viral Pneumonia in their
study. Using the RestNet101 transfer learning method, the reported results were 98.75%,
97.5%, 96.43%, and 100% accuracy, sensitivity, precision, and specificity respectively.
JavadiMoghaddam et al. [11], proposed a deep learning model called Wavelet CNN-4,
which consists of a wavelet and four convolution layers and a Squeeze Excitation Block
layer in the coupling layer. They compared the proposed model with pre-trained models
such as VGG11, ResNet18, ResNet50 and Inception-v3. The proposed model achieved
99.03% accuracy. Chen et al. [12], tried to detect COVID-19 positive and other diseases
from 35,355 images using. U-Net+ +. With the applied method, the obtained results were
98.85%, 94.34%, 99.16%, 88.37%, and 99.4% accuracy, sensitivity, specificity, precision,
and AUC values, respectively. Wu et al. [13], used CT images consisting of 368
COVID19(+), 127 other diseases in their study. Using the RestNet50 transfer learning
method, the reported results were 76%, 81.1%, 61.5%, and 81.9% accuracy, sensitivity,
AUC, and specificity, respectively. Mobiny et al. [14], used CT images consisting of 349
COVID19(+) and 397 COVID19(-) images in their study. The images that applied GAN,
Rescaling, and cropping as a preprocessing, in DECAPS + Peekaboo and DECAPS
architectures were used. In the applied DECAPS + Peekaboo method, it was reached 87.6%,
84.3%, 87.1%, and 85.2% accuracy, sensitivity, F1-score, and specificity, respectively.
Balaha et al. [15], proposed a hybrid learning and optimization approach based on pre-
trained models to detect Covid-19. Harris Hawks Optimization (HHO) algorithm was used
to optimize the hyperparameters. They performed data augmentation by combining three
publicly available data sets. The Weighted Summation Method (WSM) was used as an
evaluation metric to compare combinations of models, with the best accuracy being 99.33%
with VGG19. Li et al. [16], proposed a deep learning automated framework, COVNet, to
accurately identify COVID-19 with chest CTs. While creating the models, a chest CT
consisting of 4356 images was reported to be used. With this model, detecting COVID-19
patients from other pneumonia patients, a sensitivity of 87% and an Area Under the Curve
(AUC) value of 0.95% were obtained. He et al. [17], used CT images consisting of 349
COVID19(+) and 397 COVID19(-) images in their study. The self-trans method was used as
a preprocess. Using the DenseNet-169 transfer learning method, the reached results were
85%, 94%, and 86% F1-score, AUC, and accuracy, respectively. Ahamed et al. [18], used
datasets consisting of chest X-ray and CT images in their study to train their proposed
model. Images were preprocessed and enlarged before entering the proposed ResNet50V2
model. Extra layers have been added to the basic model with regularization and fine-tuning
processes. They classified the images according to two-class, three-class and four-class
categories as pre-processed and non-pre-processed. The model achieved 99.01% and 83.6%
accuracy for the 3 class categories with and without preprocessing, respectively. Pathak et
al. [19], used 413 COVID19(+), 439 normal or pneumonia CT images in their study. As
preprocessing, ResNet50 feature extraction was used. CNN was used for classification. On
the CNN network, the reached results were 93.01%, 91.45%, 94.77%, 95.18% accuracy,
sensitivity, specificity, and precision, respectively. Shi et al. [20], have applied a machine
learning algorithm, Random Forest (RF), to screen for COVID-19. CT images of 2685
patients were used to evaluate the models in the presented study. In the model, after
evaluating the fivefold cross-validation technique, the model achieved accuracy, sensitivity,
and specificity of 87.9%, 90.7%, and 83.3%, respectively.
The following are the primary contributions of this study:
The suggested model utilized the classification capabilities of features derived from
AlexNet and ResNet101's pre-trained deep architectures.
The current study examines Chi-square, NCA, mRMR, and ReliefF feature selection
algorithms in order to reduce the amount of features obtained from pre-trained deep
neural networks and identify the most effective deep features.The AlexNet and
ResNet101 features that give the highest result are combined. The mRMR feature
selection algorithm was adapted to the combined features. In experimental studies, a
highly successful diagnostic model was obtained by using these selected and effective
features for chest X-ray image classification.
Deep features were obtained using pre-trained CNN networks, and those features were
used to optimize the parameters of the best SVM classifier. This method got the
maximum performance, with a score of 98.21%, when it came to the classification of
chest X-ray images.
In the remaining section of the paper, the material and method was mentioned in the
second section, the experimental studies and results in the third section, and the discussion
in the fourth section.

2 Material and Method


2.1 Methodology
Using previously trained network models, an efficient method for detecting the COVID-19
virus with a high degree of accuracy is proposed in this research. Figure 1 depicts the
planned workflow for the approach. Preprocessing techniques are applied to X-RAY images
as part of the proposed method. The primary purpose of these techniques is to improve
classification performance. In order to draw attention to the point regions in X-RAY images
and cut down on the overall number of house gray tones, the gradient operator was used in
sobel operator mode. After that, we moved on to the second step, which involved using the
Modulator Circulating Water System (MCWS) to segment the points in the gradient images.
In the last step, feature extraction was performed on 13 pre-trained models. Extracted
features were reduced in number using Chi-square, NCA, mRMR, and ReliefF feature
selection methods. Selected features obtained from pretrained networks were given to 13
different classifiers. High performance was observed in AlexNet and Resnet101. The
AlexNet and Resnet101 architectures were reused for feature extraction. The FC8 layer of
the AlexNet model and the FC1000 layer of the ResNET101 model have 1000 features. In
the proposed method, feature extractions were carried out during the training and testing
phases. In total (1000 + 1000) 2000 features have been reduced to 200 features by combined
mRMR feature selection methods. In the last step, the reclassification process was given to
13 different classifiers. It was observed that the highest performance was obtained in SVM.
Fig. 1 Framework of the proposed approach

2.1.1 Preprocessing
The gradient method is applied to the input images. Calculation of gradient magnitudes and
directions is done with the help of directional gradient.
The watershed method is usually applied to the gradient of the image. By using 8
neighboring points around each point in the image, the most bumpy and rough directions in
the image are detected [21]. Points with a minimum height in the image are marked with
individual identifiers. Using the gradient information in the image, the descending regions
are followed at certain rates. The watershed method associates all pixels with their
respective minimum points [22].

2.1.2 Feature Selection Algorithms


Feature selection, in short, is the creation of a feature vector equivalent to the principal
feature vector and more functional, smaller in size, by creating a subset of features that
belong to a class and obtained by deep learning models.

Neighborhood Component Analysis (NCA)


NCA is a feature weighting method that may be used to select the optimum subset of
features by maximizing the objective function that evaluates classification accuracy over
training data. This is done through the use of NCA as a feature selection method. In order to
obtain the weight vector (w) that corresponds to the feature vector xi, the approach optimizes
the closest neighbor learning classifier in an effort to improve performance [23]. Within the
NCA framework, a reference sample point xj is chosen for each sample, and then that point
is assigned to sample xi. As a result of the close proximity of the two samples, the
probability that xj will be selected as the reference point for xi will increase as a direct result
of this proximity. This distance can be measured using the weighted distance, which is
denoted by Dw and found by applying Eq. 1 to the equation.

(1)

wm is the weight that has been allotted to the mth feature. A kernel function that returns big
values for tiny Dw can be used to determine the relationship between probability Pij and
weighted distance Dw. This relationship can be determined by using the kernel function. Pij
is defined by the following equation:

(2)

Also, it takes the vae 1 if i = j and Pii = 0. The kernel function is defined as k(z) = exp (−
z/σ). The parameter k and σ are the core width and this affects the probability that sample xj
will be selected as the reference point. The probability of xi being classified correctly is
written ain Eq. 3.

(3)

ReliefF
One of the most well-known approaches of feature selection is referred to as the relief
algorithm. It is a type of algorithm that has the potential to create features predictions that
are quite accurate and fruitful. The prediction of these features is accomplished by assigning
weights to the characteristics or features If an features is of any use, one can anticipate that
the closest distances of the same class will be closer to one another than the closest
distances of any and all other classes that are given along that feature [24]. The convex
optimization problem is solved, and the result is used to determine the feature weights.
However, the Relief algorithm has the limitation of only being able to handle two-class
situations and cannot process data that is incomplete. This is a disadvantage. The ReliefF
method, which was an enhanced version of the Relief algorithm, was offered as a solution
for these problems as well as additional difficulties. It's possible that this enhanced approach
can conquer incredibly powerful, noisy, and incomplete data. If the working logic of the
ReliefF algorithm is examined, firstly, a sample Ri is randomly selected, then, the k nearest
neighbors from the same class called Hj, and k nearest neighbors from each of the different
classes, called Mj(C) are selected. Depending on the values of Ri, Hj, and Mj(C), the w[A]
value was updated for all A features. feature weights range from −1 to + 1. The largest
positive values mean that the feature was important. This process was continued for the
number determined by the user. With the diff function, the differences between samples and
features, that is, distances, are calculated. The calculation of this function depends on
whether the features are written or numeric. Let I1 and I2 be samples and A be features If the
features were written, then the calculation will be as in Eqs. 4 and 5. Choosing k, increases
the robustness of the algorithm against noisy data. This value can be set by the user; but if k
is chosen as 1, the algorithm will be sensitive to noisy data. In many studies, the k value was
chosen as 10, but choosing the k value differently would be more useful in examining the
importance levels of the features. Finally, choosing the k value too small will cause similar
bad results.

(4)

(5)

Chi Square Test


Chi Square Test; It is a single variable filter method. The Chi-Square method works on
categorical variables. It detects the relationships and dependencies of categorical variables.
The chi-square test is a two-step test. In the first step, the chi-square statistics of the
observed values are calculated according to the expected values. In the second step, the
obtained chi-square statistics are compared with the determined threshold value and a
decision is made accordingly. The features are scored according to the chi-square statistic
and the features with the best score are used. The chi-square statistic is obtained using Eq.
6–8. I given in Eq. 6; is the number of intervals, and J is the number of classes. Nij; The ith
interval is the number of samples in the jth class. While the two properties Eij given in Eq. 6
are independent; The ith interval is the expected number of units in the jth grade. Finally,
the d given in Eq. 7 shows the degrees of freedom of the Chi-Square distribution to be used
for the test statistic [25, 26].

(6)

(7)

(8)

Minimum Redundancy Feature Selection(mRMR)


The MRMR algorithm is an entropy-based feature selection algorithm proposed by Peng et
al. in 2005 [27]. The MRMR algorithm is a filtering algorithm that works by selecting the
features that are most associated with the labels of the classes in the data to be used for
classification. This algorithm uses Mutual Information to measure the similarity ratio
between two features or between features and class labels [28]. In essence, the MRMR
algorithm tries to rank all the features from the most valuable to the least valuable and
leaves the user to decide how many features should be used for the classification problem.
Therefore, the MRMR algorithm should be considered as a feature sorting algorithm rather
than a feature selection algorithm.

2.1.3 Pre-trained Networks


Transfer learning is defined as the learning structure created by using the features obtained
by deep learning models developed for special purposes as inputs in other machine learning
methods. In this study, deep learning models AlexNet, EfficientNet B0, GoogleNet,
Inception ResNet-v2, Inception-v3, ResNet18, ResNet50, ResNet101, VGG16 and VGG19
were used. Layer, depth, number of parameters, image input dimensions of the mentioned
networks in Table 1, and network architectures were given in Table 1.

Table 1 Deep learning networks used in the study


Pre-trained model Layer Depth Number of parameters (Million) Image input size
AlexNet 25 8 61,0 227 × 227
EfficientNet B0 290 82 5,3 224 × 224
GoogleNet 144 22 7,0 224 × 224
Inception ResNet-v2 825 164 55,9 299 × 299
Inception v3 316 48 23,9 299 × 299
Densenet201 708 201 3,5 224 × 224
Nasnetlarge 1243 88,9 331 × 331
Mobilenetv2 154 53 3,5 224 × 224
Resnet-18 71 18 11,7 224 × 224
Pre-trained model Layer Depth Number of parameters (Million) Image input size
Resnet-50 177 50 25,6 224 × 224
Resnet-101 347 101 44,6 224 × 224
VGG16 41 16 138,0 224 × 224
VGG19 47 19 144,0 224 × 224

AlexNet
Deep learning pioneers Alex Krishevsky, Ilya Sutskever, and Geoffrey Hinton came up with
the method that would become known as AlexNet [29]. This deep convolutional neural
network has a total of 25 layers, with 5 convolution layers, 3 maxpool layers, 2 dropout
layers, 3 fully connected layers, 7 relu layers, 2 normalization layers, a softmax layer, input,
and classification (output) layers making up the structure. The dimensions of the image that
will go into the input layer of Alexnet are 227 by 227 by 3. The final layer is where
classification takes place, and this is also where the value of the classification number in the
input image is presented.

DenseNet201
Forward connections are made between each layer of the DenseNet-121 (Densely
Connected Convolutional Network) and other layers. Each layer of the DenseNet design
takes as input the properties of all of the layers that came before it, as well as the qualities
that are unique to that layer, which are then passed on to the layers that come after it [30].
DenseNet topologies have the advantage of providing feature propagation and reducing the
number of parameters by permitting feature reuse [31]. DenseNet-121 design is composed
of four dense blocks, three transition layers, and 121 layers in total (117 loops, 3 passes and
1 classification).

MobileVNet2
MobileNet designs are built on a modular architecture that allows for the development of
both shallow and deep neural networks. This architecture's two basic global
hyperparameters provide an optimal balance of latency and precision. Based on the
restrictions of the problem, these hyperparameters allow the model builder to select the
appropriate-sized model for their application.

Nasnet-Large
NASNet-Large is a 1243-layer convolutional neural network trained on more than one
million photos from the ImageNet collection. The network can split photos into one
thousand object types, including animals, balloons, and flowers. As a result, the network has
acquired rich feature representations for a vast array of image types. 331 × 331 pixels should
be the size of the picture to be put to the mesh.

EfficientNet B0
EfficientNet, a new CNN study developed by Google in 2019, provides significant
improvements in accuracy and productivity (performance). The productivity model
presented in the study offers a new approach because of being also applicable to other CNN
models. EfficientNet-B0, is the basic network developed using AutoML MNAS [32].
EfficientNet-B0 consists of 290 layers. The image to be placed in the input layer of
EfficientNet B0 is 227 × 227 × 3 in size.

GoogleNet
ImageNet 2014 came first with a success rate of 93.33% in image classification competition.
GoogLeNet architectural structure consists of 144 layers and this architecture has proven
that too many data sets were increased the performance of the classification process by
increasing the number of layers. The image to be placed in the input layer of Googlenet is
224 × 224 × 3 In order to prevent overloading of large-sized images, it filters images in
various sizes such as “1 × 1, 3 × 3, 5 × 5” in the same period. Unlike other architectures, this
architecture processes images in parallel, rather than stacking the layers it creates. Because
it also was considered negative factors such as memory size increase, waste of time, etc. for
stacked processes [25].

Inception ResNet-V2
The Inception-ResNet-V2 architecture combines the remaining connections with a new
version of the inception architecture. The Inception-ResNet-V2 network makes efficient use
of remaining connections [33]. The feature extraction performance of Inception-ResNet-V2
architecture is quite good. In this architecture, remaining units are added to each Inception
module to prevent degradation of the network gradient usually associated with the increase
in the number of layers. Inception ResNet-v2 architectural structure consists of 825 layers.
The image to be placed in the input layer of Inception ResNet-v2 measures 299 × 299 × 3.

Inception V3
Inception architecture is an architecture that emerged with the GoogleNet model. GoogleNet
model, proposed by Szegedy et al. (2015), tries to keep the computational cost at the same
rate while increasing the depth and width. Therefore, in this model using the concept of
Inception, the outputs obtained by using different convolution filters together were
combined [34]. The Inception-v3 architectural structure consists of 316 layers. The image to
be placed in the input layer of Inception v3 measures 299 × 299 × 3.
ResNet-18
The ResNet 18 pre-trained model, which provides rich features, works by inputting more
than one million data in the ImageNet dataset with a size of 224 × 224. Although it has 71
layers and 18 depths, it is analyzed that it gives successful and faster results compared to
some models with a deeper layer [35].

ResNet-50
Resnet microarchitecture module differs from other architectures with its structure. It may
be preferable to switch to the lower layer by ignoring the change between some layers. By
allowing this situation in the Resnet architecture, the performance rate was increased to
higher levels.
Resnet50 architecture consists of a network of 177 layers. The depth of the net is 50. In
addition to this layered structure, there is information about how the inter-layer connections
will be [36].

ResNet-101
The Resnet-101 structure has 347 layers and a depth of 101. ResNet's bypass (jumping)
between layers is referred to as ResBlock. Even if nothing is learned in the previous layer,
ResBlock makes the model more robust by applying the information from the previous layer
to the new layer. ResBlock thereby fixed the gradient deletion issue. Utilizing slope drop as
the optimization algorithm. Resnet-101 input layer dimensions are 224 × 224 × 3 [36].

VGG16
The VGG16 model consists of a total of 41 layers, 16 of which include learnable weights,
followed by ReLu and pooling layers. Learnable layers include thirteen convolutional and
three fully linked layers. Similar to AlexNet, the VGG16 model employs a 1-pixel pitch
shift and 3 × 3 filter in all convolutional layers, and maximum pooling layers follow
convolutional layers. Maximum pooling is attained with a two-step, two-by-two filter. To
extract feature vectors, activations in the first and second fully connected layers (fc6, fc7)
were utilized. fc6 and fc7 result vectors include a total of 4096 characteristics. Training
utilizes 224,224 RGB pictures [37].

VGG19
The Visual Geometry Group at the University of Oxford is responsible for the development
of the VGG19 computer program (VGG). It consists of 19 layers, 16 of which are
convolutional, 3 of which are completely connected, 5 of which are maximum pooling, and
1 of which is a Softmax layer. The input for this network is photos with a dimension of
(224, 224, 3). Approximately 144 million trainable parameters are available. Filters with a
step size of one pixel (3 by 3) were employed so that the overall notion of the image could
be conveyed [37].

2.1.4 Support Vector Machine


SVM is a machine learning model, used in clustering and regression problems, especially in
classification, developed by Vapnik–Chervonenkis in 1995. Especially in recent years, it is
one of the most successful machine learning algorithms used for solving classification
problems. The purpose of the SVM model is basically, is to detect the hyperplane that will
separate the classes of target variables from each other in the most appropriate way [38].

2.1.5 K-Nearest Neighbors(K-NN)


Although the k-NN classifier is a simple type of classifier, it is one of the classifiers with
good results. The reason why it is called “simple”, this classifier does not require any
training steps. This feature distinguishes this training data. This classifier from other
classifiers. used directly during the classification process by the classifier, without a
requirement for a training stage. Let a test sample is given, k nearest neighbors of this test
sample in the training set are detected and the number of those belonging to each class is
subtracted. Here it is said to belong to the class with the largest number of neighbors [39].
There are certain mathematical formulas for the concept of distance in the k-NN classifier.
These are given in Eqs. 9–11. In the Minkowski distance equation, if k 1 is chosen,
Manhattan, if k 2 is chosen, the Euclidean distance equation is obtained.

(9)

(10)

(11)

2.1.6 Decision Trees


Decision trees allow the rapid processing of data. Decision trees perform the classification
process by data with certain property values. For this process, some features are determined
as input and some features as output, are presented to the algorithm. In order to obtain the
results in the output feature with the algorithm, what the input values can be is realized by
looking at the decision trees. One of the methods used to create a model is the EBT method.
To increase the prediction accuracy of discrete learning algorithms, ensemble
approaches mix various learning methods. They are a linear mixture of different modeling
methods that produce better prediction outcomes without increasing complexity
significantly. Bagged and boosted ensemble methods are two of the most used ensemble
methods. While bagged approaches minimize error variance in constructor learning
algorithms, boosted methods specifically reduce bias in constructor learning algorithms [40,
41].

2.2 Dataset
The dataset consists of 1061 x-ray images labeled by Radiologists. The dataset has been
edited after downloading from the kaggle website [42, 43]. X-ray images consist of three
classes: COVID-19, Pneumonia and Normal. There are 361 COVID-19, 500 Pneumonia and
200 Normal chest X-ray images in the Dataset. The COVID-19 cases in the dataset consist
of chest X-ray images of 200 male and 161 female patients. The mean age of the patients is
over 45. These images range in height is from 143 to 1637 pixels (average 491 pixels) and
in width from 76 to 1225 pixels (average 383 pixels). Figure 2 shows an example of X-
RAY scans of COVID-19, Normal and Pneumonia patients in the dataset.
Fig. 2 COVID-X-Ray scan dataset sample images

3 Performance Measurement Metrics


The success of machine learning classifiers was determined by the correlation between class
labeling and actual class value. Labeling data with a positive true class value as positive was
referred to as true positive (TP), while labeling as negative was referred to as false negative
(FN); labeling data with a negative true class value as negative was referred to as true
negative (TN), while labeling as positive was referred to as false positive (FP) (FP). For the
suggested method, performance measurement metrics were computed utilizing the TP, TN,
FP, and FN numbers from the matrix of complexity. Using the values of accuracy,
sensitivity, specificity, precision, and F-score, performance measures were developed. Using
the following equations, performance measurement metrics were computed.

(12)

(13)

(14)

(15)
(16)

4 Experimental Studies
Matlab environment was used to obtain the experimental results in this study. Experimental
results were obtained using an all-in-one computer with an I7 processor, 16 GB Ram, and a
4 GB graphics card. The images in the data set were sized as 224 × 224, 227 × 227, 299 ×
299 and 331 × 331, and classification was performed. In the study, convolutional neural
networks, AlexNet, EfficientNet B0, GoogleNet, Inception ResNet-v2, Inception-v3,
DenseNet201, MobilevNet2, Nasnet-Large, ResNet18, ResNet50, ResNet101, VGG16 and
VGG19 models were used. Chi-square, NCA, mRMR and ReliefF feature selection methods
were used. A total of 2000 features were selected, 1000 from the FC8 layer of AlexNet's
features and 1000 from Resnet101's FC1000 layer. Selected features have been reduced to
200 features with mRMR feature selection methods. Classification process for 200 features
was given to 13 different classifiers. In this study, it was observed that the highest
performance was obtained in SVM. In Fig. 3, the Confusion matrices of the classification
method in which the 13 pre-learned different networks and combined networks used reach
the highest accuracy were given. ResNet50 + AlexNet network Cubic SVM classifier with
mRMR feature selection had the best accuracy result with 98,21% and classifier Inception
Resnet-v2 network Cubic SVM classifier with NCA feature selection had the worst
accuracy result with 95,00%.
Fig. 3 Confusion matrices with the highest accuracy

In Fig. 4, The graphs of the accuracy values of the pre-trained networks according to the
classifiers and feature selections were given.
Fig. 4 Graphs of truth values of pre-trained networks according to classifiers and feature selections

Cubic SVM classifier had the highest accuracy with 96.42% for AlexNet network, The
Medium Gaussian SVM classifier with mRMR feature selection had the worst accuracy
with 89.2%. Cubic SVM classifier with NCA feature selection had the highest accuracy
with 96.61% for DenseNet-201 network, The Quadratic Dicriminant classifier with Chi2
feature selection had the worst accuracy with 89.6%. Cubic SVM classifier with NCA
feature selection had the highest accuracy with 96.51% for EfficientNet B0 network, The
Fine Tree classifier with Chi2 feature selection had the worst accuracy with 89.3%. Cubic
SVM classifier with NCA feature selection had the highest accuracy with 96.06% for
GoogleNet network, The Quadratic SVM classifier with mRMR feature selection had the
worst accuracy with 89.7%. Cubic SVM classifier had the highest accuracy with 95.0% for
Inception ResNet-v2 network, The Medium Gaussian SVM classifier with NCA feature
selection had the worst accuracy with 89.7%.
Cubic SVM classifier with Chi2 feature selection had the highest accuracy with 96.14%
for Inception v3 network, The Bilayered Neural Network had the worst accuracy with
89.2%. Cubic SVM classifier with Chi2 feature selection had the highest accuracy with
96.14% for MobilevNet2 network, The Quadratic Dicriminant with ReliefF feature selection
had the worst accuracy with 90.0%. Cubic SVM classifier with ReliefF feature selection had
the highest accuracy with 96.32% for Nasnet-Large network, The Medium Gaussian SVM
with ReliefF feature selection had the worst accuracy with 89.7%. Cubic SVM classifier had
the highest accuracy with 96.04% for ResNet18 network, The Quadratic Dicriminant with
ReliefF feature selection had the worst accuracy with 90.0%. Cubic SVM classifier with
NCA feature selection had the highest accuracy with 97.08% for ResNet50 network, The
Quadratic Dicriminant with NCA feature selection had the worst accuracy with
90.1%.Quadratic Dicriminant with NCA feature selection had the highest accuracy with
96.04% for ResNet101 network, The Fine Tree classifier with ReliefF feature selection had
the worst accuracy with 90.4%.Cubic SVM classifier with mRMR feature selection had the
highest accuracy with 96.42% for VGG16 network, The Medium Gaussian SVM with NCA
feature selection had the worst accuracy with 90.0%.Quadratic Dicriminant classifier with
NCA feature selection had the highest accuracy with 95.66% for VGG19 network, The
Medium Gaussian SVM with NCA feature selection had the worst accuracy with 89.3%.
In Table 2, the sensitivity, specificity, precision and, F-score results of the classifiers
used in the proposed method were given. For the pneumonia class, Accuracy, Sensitivity,
Specificity, Precision, F-Score metrics were all 100%. In the COVID19 class, for the
Sensitivity metric, GoogleNet network Cubic SVM classifier with NCA feature selection
had the best result with 100% and classifier Inception Resnet-v2 network Cubic SVM
classifier with NCA feature selection had the worst result with 94.18%. For the Specificity
metric, ResNet50 + AlexNet network Cubic SVM classifier with mRMR feature selection
had the best result with 98.14% and classifier VGG19 network Cubic SVM classifier with
mRMR feature selection had the worst result with 93.71%. For the Precision metric,
ResNet50 + AlexNet network Cubic SVM classifier with mRMR feature selection had the
best result with 96.47% and classifier VGG19 network Cubic SVM classifier with mRMR
feature selection had the worst result with 89.08%. For the F-score metric, ResNet50 +
AlexNet network Cubic SVM classifier with mRMR feature selection had the best result
with 97.39% and classifier Inception Resnet-v2 network Cubic SVM classifier with NCA
feature selection had the worst result with 92.77%.

Table 2 Other performance metrics of classifiers


Accuracy Sensitivity Specificity Precision F-Score
(%) (%) (%) (%) (%)
AlexNet-Cubic SVM COVID-19 96.42 96.95 96.14 92.84 94.85
Normal 98.72 86.50 96.92 97.81
Pneumonia 100.00 100.00 100.00 100.00
DenseNet-201-Cubic SVM-NCA COVID-19 96.61 97.51 96.14 92.88 95.14
Accuracy Sensitivity Specificity Precision F-Score
(%) (%) (%) (%) (%)
Normal 98.95 86.50 96.93 97.93
Pneumonia 100.00 100.00 100.00 100.00
Efficient-B0-Cubic SVM-mRMR COVID-19 96.51 96.12 96.71 93.78 94.94
Normal 98.37 88.50 97.36 97.86
Pneumonia 100.00 100.00 100.00 100.00
GoogleNet-Cubic SVM-NCA COVID-19 96.06 100.00 94.86 90.93 95.25
Normal 100.00 98.37 86.00 97.58
Pneumonia 100.00 100.00 100.00 100.00
Inception Resnet-v2-Cubic SVM- COVID-19 95.00 94.18 95.43 91.40 92.77
NCA
Normal 97.56 84.00 96.33 96.94
Pneumonia 100.00 100.00 100.00 100.00
Inception- v3-Cubic SVM-Chi2 COVID-19 96.14 97.51 95.43 91.67 94.50
Normal 96.14 98.95 84.00 96.38
Pneumonia 100.00 100.00 100.00 100.00
MobilevNet2-Subspace KNN-NCA COVID-19 96.14 95.57 96.43 93.24 94.39
Normal 98.14 87.50 97.13 97.63
Pneumonia 100.00 100.00 100.00 100.00
NasNet Large-Cubic SVM-ReliefF COVID-19 96.32 96.68 96.14 92.82 94.71
Normal 98.61 86.50 96.92 97.75
Pneumonia 100.00 100.00 100.00 100.00
ResNet18-Cubic SVM COVID-19 96.04 96.12 96.00 92.53 94.29
Normal 98.37 86.00 96.80 97.58
Pneumonia 100.00 100.00 100.00 100.00
ResNet101-Cubic SVM-NCA COVID-19 97.08 95.57 97.86 95.83 95.70
Normal 98.14 92.50 98.26 98.20
Pneumonia 100.00 100.00 100.00 100.00
ResNet50-Cubic SVM-NCA COVID-19 96.04 97.23 95.43 91.64 94.35
Normal 98.84 84.00 96.38 97.59
Pneumonia 100.00 100.00 100.00 100.00
VGG16-Cubic SVM-mRMR COVID-19 96.42 94.74 97.29 94.74 94.74
Normal 97.79 90.50 97.79 97.79
Pneumonia 100.00 100.00 100.00 100.00
VGG19-Cubic SVM-mRMR COVID-19 95.66 99.45 93.71 89.08 93.98
Normal 99.77 78.00 95.13 97.39
Pneumonia 100.00 100.00 100.00 100.00
RenNet50 + AlexNet-Cubic SVM- COVID-19 98.21 98.34 98.14 96.47 97.39
mRMR
Normal 99.30 93.50 98.50 98.90
Accuracy Sensitivity Specificity Precision F-Score
(%) (%) (%) (%) (%)
Pneumonia 100.00 100.00 100.00 100.00

In the Normal class, for the Sensitivity metric, VGG19 network Cubic SVM classifier
with mRMR feature selection had the best result with 99.77% and classifier GoogleNet
network Cubic SVM classifier with NCA feature selection had the worst result with
96.04%. For the Specificity metric, classifier Inception Resnet-v2 network Cubic SVM
classifier with NCA feature selection had the best result with 98.95% and classifier VGG19
network Cubic SVM classifier with mRMR feature selection had the worst result with
78.00%. For the Precision metric, ResNet50 + AlexNet network Cubic SVM classifier with
mRMR feature selection had the best result with 98.50% and classifier Inception Resnet-v2
network Cubic SVM classifier with NCA feature selection had the worst result with
84.00%. For the F-score metric, ResNet50 + AlexNet network Cubic SVM classifier with
mRMR feature selection had the best result with 98.90% and classifier Inception-v3
network Cubic SVM classifier with Chi2 feature selection had the worst result with 96.38%.

5 Discussion
In this section, the performance criteria of studies with pre-trained models and the proposed
method, consisting of accuracy, sensitivity and specificity, are discussed. Evaluations in the
literature are usually made on combined data sets. Since the data sets used in the studies are
different and the evaluation criteria are different, it cannot be said that they are completely
superior to each other. The performance scores of these methods are given in Table 3.

Table 3 Literature studies and results


Ref Dataset Method Accuracy Sensitivity Specificity Precision F-Score
(%)
Abbas et COVID-19 eTraC-ResNet-18 95.12 97.97% 91.87% 93.36% –
al. [44] image data
collection[49]
Wang et The chest x- VGG16, VGG19, DenseNet201, 82.9 81.00% 77.00% – 84.00%
al. [45] ray images Inception_ResNet_V
(pneumonia) 2,Inception_V3, Resnet50,
[42] MobileNet_V2 Xception
Alqudah COVID-19 SVM,Random Forest, CNN 95.20 93.30% 100.00% 100.00% –
et al. image data
[46] collection[49]
Hemdan COVID-19 VGG19,DenseNet201,InceptionV3, 90.00 – – 83.00% 91.00%
et al. image data ResNetV2, InceptionResNetV2,
[47] collection[49] Xception, MobileNetV2
Narin et COVID-19 ResNet50, InceptionV3, 98.00 – – 100.00% 98.00%
al. [48], image data InceptionResNetV2
collection[49]
Proposed COVID-19 AlexNet, EfficientNet B0, COVID-19 98.34% 98.14% 96.47% 97.39%
method image dataset GoogleNet, Inception ResNet-v2, = 98.21
Ref Dataset Method Accuracy Sensitivity Specificity Precision F-Score
(%)
[42, 43] Inception-v3, DenseNet201, Normal = 99.30% 93.50% 98.50% 98.90%
MobilevNet2, Nasnet-Large, 98.21
ResNet18, ResNet50, ResNet101,
VGG16,VGG19 Pneumonia 100.00% 100.00% 100.00% 100.00%
= 98.21

Abbas et al. [44], established a modified deep neural network effective on Xray images
to more effectively distinguish between COVID-19 cases. The model they call DeTraC
includes three inner layers. This model was created using ResNet18 on the backend and
achieved 95.12% accuracy on the X-Ray dataset. Wang et al. [45], used 44 COVID19(+), 55
typical viral pneumonia CT images in their study. As preprocessing, a visual inspection of
ROI extraction was performed. In the applied M-inception algorithm, the obtained results
were 82.9%, 81%, 84%, 77%, and 90% accuracy, sensitivity, F1-score, AUC, and
specificity, respectively. Alqudah et al. [46] used SVM, Random Forest, CNN in this study.
95.2% accuracy, 93.3% Sensitivity, 100% Specificity and 100% Precision were achieved.
Hemdan et al. [47], suggested the COVIDXNET deep learning classifier architecture for
COVID-19 diagnosis using X-Ray pictures. In addition, they validated seven distinct DCNN
models, such as VGG19 and Densenet201, in their investigation. They demonstrated that
VGG19 and DenseNet classifications are superior. Narin et al. [48], used deep CNN-based
models to classify X-ray images for COVID-19 illness. Using chest X-ray radiographs,
CNN-based models (InceptionResNetV2, ResNet50, and InceptionV3) were utilized to
detect people infected with coronavirus pneu-monia. 98.00% accuracy was reached with the
ResNet50 model, based on the results of the experiments.
The proposed approach has reached a success rate of 98.21%. It has reached a 100%
success rate in the sensitivity and specificity criteria for the pneumonia class. For the
COVID-19 class Sensitivity, Specificity, Precision, F-Score metrics, values of 98.34%,
98.14%, 0.96.47%, and 97.39% were obtained, respectively.

6 Results
The rapid spread of the COVID-19 pandemic all over the world, its negative effects on
people, clearly demonstrates the detection of positive cases in the early stages and the rapid
and correct intervention. In this study, the three-class data set consisting of X-Ray images
obtained during the COVID-19 epidemic was classified by the learning transfer method. In
this paper, preprocessing techniques have been applied to X-RAY images to improve
classification performance. Gradient operator used as Sobel operator was used to highlight
the point regions in X-RAY images and reduce the number of house gray tones. Chi-square,
NCA, mRMR and ReliefF feature selection methods were used. First, the results of 13 pre-
trained models were compared. Then, a total of 2000 features were selected from AlexNet
and Resnet101. Selected features have been reduced to 200 features with mRMR feature
selection methods. Classification process for 200 features was given to 13 different
classifiers. In this study, it was seen that the highest performance was obtained at 98.21%
SVM after applying mRMR feature selection to the combined models of RenNet50 +
AlexNet models. In the study, the highest accuracy, sensitivity, specificity, precision and F-
score value for the COVID19 class were; ResNet50 + AlexNet Cubic SVM with 98.21%,
GoogleNet network Cubic SVM classifier with 100%, ResNet50 + AlexNet Cubic SVM
with 98.14%, ResNet50 + AlexNet Cubic SVM with 96.47%, ResNet50 + AlexNet with
97.39% Obtained in Cubic SVM. In the proposed approach, it has been seen that pre-trained
CNN architectures and feature extraction methods can be used together. In addition, it has
been confirmed in this study that the weights can be combined and efficient rather than
considering the performance of feature selection methods separately. The major limitation
of this study is that the method used requires more powerful hardware if applied to larger
datasets.

References
1. CoronaVirus Updates. (2022). https://www.worldometers.info/coronavirus/

2. Jalali, S. M. J., Ahmadian, M., Ahmadian, S., Hedjam, R., Khosravi, A., & Nahavandi, S. (2022). X-
ray image based COVID-19 detection using evolutionary deep learning approach. Expert Systems
with Applications, 201, 116942.
[Crossref]

3. Dhiman, G., Chang, V., Kant Singh, K., & Shankar, A. (2022). Adopt: Automatic deep learning and
optimization-based approach for detection of novel coronavirus covid-19 disease using x-ray images.
Journal of Biomolecular Structure and Dynamics, 40(13), 5836–5847.
[Crossref]

4. Roy, S. S., Goti, V., Sood, A., Roy, H., Gavrila, T., Floroian, D., Paraschiv, N., & Mohammadi-
Ivatloo, B. (2014). L2 regularized deep convolutional neural networks for fire detection. Journal of
Intelligent & Fuzzy Systems, 1–12.

5. Ravi, V., Narasimhan, H., Chakraborty, C., & Pham, T. D. (2022). Deep learning-based meta-
classifier approach for COVID-19 classification using CT scan and chest X-ray images. Multimedia
Systems, 28(4), 1401–1415.
[Crossref]

6. Roy, S. S., Rodrigues, N., & Taguchi, Y. (2020). Incremental dilations using CNN for brain tumor
classification. Applied Sciences, 10(14), 4915.
[Crossref]

7. Biswas, R., Vasan, A., & Roy, S. S. (2020). Dilated deep neural network for segmentation of retinal
blood vessels in fundus images. Iranian Journal of Science and Technology, Transactions of
Electrical Engineering, 44(1), 505–518.
[Crossref]

8. Samui, P., Roy, S. S., & Balas, V. E. (2017). Handbook of neural computation. Academic Press.
9.
Javaheri, T., Homayounfar, M., Amoozgar, Z., Reiazi, R., Homayounieh, F., Abbas, E., Laali, A.,
Radmard, A. R., Gharib, M. H., & Mousavi, S. A. J. (2021). CovidCTNet: An open-source deep
learning approach to diagnose covid-19 using small cohort of CT images. NPJ Digital Medicine,
4(1), 1–10.
[Crossref]

10. Rehman, A., Naz, S., Khan, A., Zaib, A., & Razzak, I. (2022) Improving coronavirus (COVID-19)
diagnosis using deep transfer learning. In Proceedings of international conference on information
technology and applications (pp. 23–37). Springer.

11. JavadiMoghaddam, S., & Gholamalinejad, H. (2021). A novel deep learning based method for
COVID-19 detection from CT image. Biomedical Signal Processing and Control, 70, 102987.
[Crossref]

12. Chen, J., Wu, L., Zhang, J., Zhang, L., Gong, D., Zhao, Y., Chen, Q., Huang, S., Yang, M., & Yang,
X. (2020). Deep learning-based model for detecting 2019 novel coronavirus pneumonia on high-
resolution computed tomography. Scientific Reports, 10(1), 1–11.

13. Wu, X., Hui, H., Niu, M., Li, L., Wang, L., He, B., Yang, X., Li, L., Li, H., & Tian, J. (2020). Deep
learning-based multi-view fusion model for screening 2019 novel coronavirus pneumonia: A
multicentre study. European Journal of Radiology, 128, 109041.
[Crossref]

14. Mobiny, A., Cicalese, P., Zare, S., Yuan, P., Abavisani, M., Wu, C., Ahuja, J., de Groot, P., & Van
Nguyen, H. (2020). Covid R-l detection using CT scans with detail-oriented capsule networks.

15. Balaha, H. M., El-Gendy, E. M., & Saafan, M. M. (2021). CovH2SD: A COVID-19 detection
approach based on Harris Hawks Optimization and stacked deep learning. Expert Systems with
Applications, 186, 115805.

16. Li, L., Qin, L., Xu, Z., Yin, Y., Wang, X., Kong, B., Bai, J., Lu, Y., Fang, Z., & Song, Q. (2020)
Artificial intelligence distinguishes COVID-19 from community acquired pneumonia on chest CT.
Radiology.

17. He, X., Yang, X., Zhang, S., Zhao, J., Zhang, Y., Xing, E., & Xie, P. (2020) Sample-efficient deep
learning for COVID-19 diagnosis based on CT scans. Medrxiv.

18. Ahamed, K. U., Islam, M., Uddin, A., Akhter, A., Paul, B. K., Yousuf, M. A., Uddin, S., Quinn, J.
M., & Moni, M. A. (2021). A deep learning approach using effective preprocessing techniques to
detect COVID-19 from chest CT-scan and X-ray images. Computers in Biology and Medicine, 139,
105014.
[Crossref]

19. Pathak, Y., Shukla, P. K., Tiwari, A., Stalin, S., & Singh, S. (2020). Deep transfer learning based
classification model for COVID-19 disease. Irbm.

20. Shi, F., Xia, L., Shan, F., Song, B., Wu, D., Wei, Y., Yuan, H., Jiang, H., He, Y., & Gao, Y. (2021).
Large-scale screening to distinguish between COVID-19 and community-acquired pneumonia using
infection size-aware classification. Physics in Medicine & Biology, 66(6), 065031.
[Crossref]

21. Tarabalka, Y., Chanussot, J., & Benediktsson, J. A. (2010). Segmentation and classification of
hyperspectral images using watershed transformation. Pattern Recognition, 43(7), 2367–2379.
[Crossref][zbMATH]
22.
Gauch, J. M. (1999). Image segmentation and analysis via multiscale gradient watershed hierarchies.
IEEE Transactions on Image Processing, 8(1), 69–79.
[Crossref]

23. Yang, W., Wang, K., & Zuo, W. (2012). Neighborhood component feature selection for high-
dimensional data. Journal of Computers, 7(1), 161–168.
[Crossref]

24. Robnik-Šikonja, M., & Kononenko, I. (2003). Theoretical and empirical analysis of ReliefF and
RReliefF. Machine Learning, 53(1), 23–69.
[Crossref][zbMATH]

25. Liu, H., Li, J., & Wong, L. (2002). A comparative study on feature selection and classification
methods using gene expression profiles and proteomic patterns. Genome Informatics, 13, 51–60.

26. McHugh, M. L. (2013). The chi-square test of independence. Biochemia Medica, 23(2), 143–149.
[Crossref]

27. Peng, H., Long, F., & Ding, C. (2005). Feature selection based on mutual information criteria of
max-dependency, max-relevance, and min-redundancy. IEEE Transactions on Pattern Analysis and
Machine Intelligence, 27(8), 1226–1238.
[Crossref]

28. Ding, C., & Peng, H. (2005). Minimum redundancy feature selection from microarray gene
expression data. Journal of Bioinformatics and Computational Biology, 3(02), 185–205.
[Crossref]

29. Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2017). Imagenet classification with deep
convolutional neural networks. Communications of the ACM, 60(6), 84–90.
[Crossref]

30. Huang, G., Liu, Z., Van Der Maaten, L., & Weinberger, K. Q. (2017). Densely connected
convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern
recognition (pp. 4700–4708).

31. Iandola, F., Moskewicz, M., Karayev, S., Girshick, R., Darrell, T., & Keutzer, K. (2014) Densenet:
Implementing efficient convnet descriptor pyramids. Preprint at arXiv:14041869

32. Tan, M., & Le, Q. (2019). Efficientnet: Rethinking model scaling for convolutional neural networks.
In International conference on machine learning, PMLR (pp. 6105–6114).

33. Szegedy, C., Ioffe, S., Vanhoucke, V., & Alemi, A. A. (2017). Inception-v4, inception-resnet and the
impact of residual connections on learning. In Thirty-first AAAI conference on artificial intelligence.

34. Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., &
Rabinovich, A. (2015). Going deeper with convolutions. In Proceedings of the IEEE conference on
computer vision and pattern recognition (pp. 1–9).

35. Ou, X., Yan, P., Zhang, Y., Tu, B., Zhang, G., Wu, J., & Li, W. (2019). Moving object detection
method via ResNet-18 with encoder–decoder structure in complex scenes. IEEE Access, 7, 108152–
108160.
[Crossref]
36.
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In
Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 770–778)

37. Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks for large-scale image
recognition. Preprint at arXiv:14091556.

38. Vapnik, V. (1999). The nature of statistical learning theory. Springer science & business media.

39. McRoberts, R. E., Tomppo, E. O., Finley, A. O., & Heikkinen, J. (2007). Estimating areal means and
variances of forest attributes using the k-Nearest Neighbors technique and satellite imagery. Remote
Sensing of Environment, 111(4), 466–480.
[Crossref]

40. Bühlmann, P. (2012). Bagging, boosting and ensemble methods. In Handbook of computational
statistics. Springer, pp 985–1022.

41. Breiman, L. (1996). Bagging predictors. Machine Learning, 24(2), 123–140.


[Crossref][zbMATH]

42. COVID-19 chest xray. (2022). https://www.kaggle.com/bachrr/covid-chest-xray

43. Chest X-Ray Images (Pneumonia). (2022). Retrieved from https://www.kaggle.com/


paultimothymooney/chest-xray-pneumonia

44. Abbas, A., Abdelsamea, M. M., & Gaber, M. M. (2021). Classification of COVID-19 in chest X-ray
images using DeTraC deep convolutional neural network. Applied Intelligence, 51(2), 854–864.
[Crossref]

45. Wang, S., Kang, B., Ma, J., Zeng, X., Xiao, M., Guo, J., Cai, M., Yang, J., Li, Y., & Meng, X.
(2021). A deep learning algorithm using CT images to screen for Corona Virus Disease (COVID-
19). European Radiology, 31(8), 6096–6104.
[Crossref]

46. Alqudah, A. M., Qazan, S., Alquran, H., Qasmieh, I. A., & Alqudah, A. (2020). COVID-2019
detection using X-ray images and artificial intelligence hybrid systems. Biomedical Signal and
Image Analysis and Project.

47. Hemdan, E. E.-D., Shouman, M. A., & Karar, M. E. (2020). Covidx-net: A framework of deep
learning classifiers to diagnose covid-19 in x-ray images. Preprint at arXiv:200311055.

48. Narin, A., Kaya, C., & Pamuk, Z. (2021). Automatic detection of coronavirus disease (covid-19)
using x-ray images and deep convolutional neural networks. Pattern Analysis and Applications,
24(3), 1207–1220.
[Crossref]

49. Cohen, J. P., Morrison, P., Dao, L., Roth, K., Duong, T. Q., & Ghassemi, M. (2020). Covid-19 image
data collection: Prospective predictions are the future. Preprint at arXiv:200611988.
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023
S. S. Roy et al. (eds.), Deep Learning Applications in Image Analysis, Studies in Big Data 129
https://doi.org/10.1007/978-981-99-3784-4_3

Image Captioning Using Deep Transfer


Learning
Tapan Kumar Das1
(1) School of Information Technology and Engineering, Vellore Institute
of Technology, Vellore, 632014, India

Tapan Kumar Das


Email: tapan.das@vit.ac.in

Keywords Image captioning – Encoder – Decoder – CNN – RNN – NLP

1 Introduction
Generating textual description of an image is an easier task for human
being, however, for a machine to explain the image requires computer
vision to visualise the image and NLP to describe the image [1]. Hence in
order to generate caption automatically for a particular photograph, the
system must be trained and educated to realise the content of image and
thereafter to express the contents in natural language words [2]. With the
advent of deep learning methods especially for image feature extraction and
processing [3], this particular problem has been swiftly addressed.
Deep learning techniques such as convolutional neural network (CNN)
are widely used for image processing tasks for their ability to deal with
millions of underlying features [4]. It has been well perceived that CNN
techniques are quite efficient for varieties medical image processing e.g.
COVID-19 lung CT- scan [5], MRI images for brain tumor diagnosis [6, 7],
retinal blood vessel [8], angiograms [9], chest X-rays [10] and many more.
By just seeing the picture depicted in Fig. 1, some of us might say “A
Little is talking brown guiding grassy”, some may say “Little boy is playing
with toys” and yet some others might say “A little boy is designing the
house”. The answer to all these observations are true and even few
additional captions are also possible. All these findings do not require any
special training or efforts for a human being, however, this is not the case
for a system so that just by overlooking glancing; an appropriate language
can be described.

Fig.1 A sample image

This study of generating the captions for the images has following
significances.
The experiments are based on transfer learning coupled with Convolution
Neural Network (CNN).
We aims for boosting the model performance by making subtle changes
to the block diagram.
The objective is producing the semantic and syntactical captions for the
input images by using the phrases as elementary units instead of words.
Motivation
This problem is immensely useful in real-world applications. We listed
below few applications where this study is being interpreted:
Self-driving cars: By automatically and readily generating the caption of
the scene around the car, the self-driving system would be truly
autonomous.
Aid to the blind: By designing a product which will guide the blind
persons when walking on the roads will fulfil a lot of aspirations. This is
possible by converting the scene around into text following the text to
voice.
Google image search: Like Google search, image search may be popular
if an image could be first transformed into a caption and then the
underlying text can be searched.

2 Related Studies
Different techniques for image captioning exists; they are retrieval based or
template based. Recently deep learning base captioning become very
popular due to the quality and appropriateness of the textual description of
images. Deep learning based attention mechanism are also delivers
promising result in captioning [11]. Most of the models are encoder-
decoder based, and it has been realised that LSTM and bidirectional LSTM
networks are used as decoder in most of the systems [12]. Similarly for
encoding purpose VGG16 and ResNet50 are employed for their
effectiveness in vectorising [13].
Few studies on image captioning those have used deep learning for
image processing and text description are represented in Table 1.

Table 1 Contemporary studies on caption generation using deep learning method


Studies Objective Methodology Result
Chen et Mapping between Generating the caption Capable of generating novel capions
al. [14] images and their using the recurrent neural
textual descriptions network
Sharma Image captioning by The methodologies used extracted classes belong to the True
et al. integrating visual are LSTM, CNN, and class in general
[15] and external knowledge from external
knowledge source
You et Image captioning Combines both top-down State-of-the-art performance as
al. [16] with semantic and bottom-up strategies compared to standard benchmarks
attention
Studies Objective Methodology Result
Rampal Image captioning VGG16 or ResNet50 as As compared to uncompressed model,
et al. using neural an encoder, LSTM as achieves a 73.1% reduction in model
[17] network decoder and flickr8k size, and 7.7% increase in BLEU score
compression dataset
Arnav Image captioning Two input streams are Results show 5 sentences, generated
et al. using deep learning merged and passed to an using a beam size of 5 along with
[18] LSTM layer average log probability of the sequence
of words
Yao et Boosting image Devised the CNN plus Examines image representations and
al. [19] captioning with RNN architecture to high-level attributes
attributes generate descriptions
Singh Image Captioning CNN, RNN –LSTM discussed the various algorithms like
et al. using Artificial CNN, RNN, LSTM
[20] Intelligence
Wang Image captioning CNN and two separate Achieved highly performance
et al. LSTM networks
[21]

3 Methodology
We used combined (CNN-RNN) model to extract the features from the
image and text, further, we used evaluation model to check the accuracy of
the proposed model and finally performance of the model at each epoch is
tracked by the help of error rate. Here we are using top-down approach and
transfer learning to extract the features and to train a model and also to get
accurate captions of the image. In fact the concept of transfer learning is
applied twice in our model. InceptionV3 for extracting features from
images and Glove for extracting features from text/captions for better
accuracy.Finally, we test a model with some images (test images) to know
the accuracy of the model. Detailed methodology consists of following
steps:
Data collection.
Data cleaning and pre-processing.
The result from pre-processing is that we have a vocabulary of 1652
unique words from the training dataset. We employed InceptionV3
transfer learning model.
We encoded all the training images and testing images which are input to
our model.
After removing the stop-words in the process of data cleaning we have
7578 words in our vocabulary.
We also used a transfer learning model (Glove) to extract the features
from our pre-processed text data.
Then we built and train our network/model. Finally, we evaluated the
performance on the test data.

3.1 Dataset
We have utilised Flickr8k dataset which contains around 8000 image, out of
which 6000 images are used for training the model, 1000 images for
validating the model and remaining 1000 images for testing the model in
order to determine the model efficiency. Each image contains five number
of captions (Fig. 2).

Fig. 2 Encoder-decoder based image captioning process

Figure 3 exhibits few sample images from the Flickr8k dataset.

Fig. 3 Sample images in the dataset

From Fig. 4, clearly each individual images have five different captions.
Fig. 4 Caption for the images

The Flickr dataset are loaded in repository, then the data is pre-
processed by removing extra whitespace, punctuation, and other
distractions. For encoding, CNN is used. The input image is fed to CNN to
extract the features. After the features are processed by a series of layers,
the last hidden state of the CNN is connected to the decoder. In this
framework, RNN serves as a decoder which performs language modelling
up to the word level. A schematic diagram of encoder-decoder based image
captioning process is shown in Fig. 2.

3.2 Inception Model for Images


Here we have used pretrained Inception V3 model to extract the features
from the images. Inception v3 is a widely-used image recognition model
which has shown a remarkable accuracy of 98.1% on the standard
ImageNet dataset.
Architecture of Inception V3 is depicted in Fig. 5.
Fig. 5 Inception V3 architecture diagram

The process of encoding and decoding and the detailed layers of those
models and the parameters are involved are being represented in Figs. 6 and
7 respectively.

Fig. 6 Encoding the model summary


Fig. 7 Decoding the vectored model summary

Summary of caption model which depicts that the total parameters


trained by the proposed model and the detailed network layers are
represented in Fig. 8.
Fig. 8 Caption model summary

4 Result
The main objective is to predict the caption for the image. For predicting,
we applied an efficient predictive model using deep learning technique. We
mainly focussed on the predictiveness of the model that suits to find the
caption for the given image in the dataset.
For evaluating the calibre of the text generated, we used BLEU
(Bilingual Evaluation Understudy) since it has the principle of matching
each text against set of reference texts composed by human itself. It is being
signified a score which reflects overall quality of generated text. We
achieved a BLEU score of 0.645 for our considered dataset.

4.1 Sample Output


For testing the effectiveness of our designed model, we tested the model
over the few images from Flicker8k dataset and exhibited the output caption
obtained in Figs. 9, 10, 11, 12 and 13.

Fig. 9 Sample output for image 1


Fig. 10 Sample output for image 2

Fig. 11 Sample output for image 3


Fig. 12 Sample output for image 4

Fig. 13 Sample output for image 5


5 Conclusion
In this chapter, we have executed image captioning task by integrating two
deep learning techniques i.e. CNN with RNN. For training the encoder-
decoder model, we used Flickr8k dataset. The trained model achieved state
of the art performance when tested with unseen images of the dataset.
Efficiency of image retrieval with content is assessed by the quality of the
textual description of the image. This image caption generation can widen
the scope of application areas such as medicine, security and other fields
where the underlying image speaks a lot and has some implicit meaning.
Moreover, the framework of image captioning can automate and promote
annotating the image in large scale which can lead to even video captioning
and video dialog.

References
1. Sharma, H., Agrahari, M., Singh, S. K., Firoj, M., & Mishra, R. K. (2020). Image
captioning: A comprehensive survey. In 2020 International Conference on Power
Electronics & IoT Applications in Renewable Energy and its Control (PARC) (pp.
325–328). IEEE.

2. Stefanini, M., Cornia, M., Baraldi, L., Cascianelli, S., Fiameni, G., & Cucchiara,
R. (2022). From show to tell: a survey on deep learning-based image captioning.
IEEE Transactions on Pattern Analysis and Machine Intelligence.

3. Hossain, M. Z., Sohel, F., Shiratuddin, M. F., & Laga, H. (2019). A


comprehensive survey of deep learning for image captioning. ACM Computing
Surveys (CsUR), 51(6), 1–36.
[Crossref]

4. Chohan, M., Khan, A., Mahar, M. S., Hassan, S., Ghafoor, A., & Khan, M.
(2020). Image captioning using deep learning: A systematic. Image, 11(5).

5. Tiwari, R. S., Das, T. K., Srinivasan, K., & Chang, C. Y. (2022). Conceptualising
a channel-based overlapping CNN tower architecture for COVID-19 identification
from CT-scan images. Scientific Reports, 12(1), 1–15.
[Crossref]
6.
Roy, S. S., Rodrigues, N., & Taguchi, Y. (2020). Incremental dilations using CNN
for brain tumor classification. Applied Sciences, 10(14), 4915.
[Crossref]

7. Das, T. K., Roy, P. K., Uddin, M., Srinivasan, K., Chang, C. Y., & Syed-Abdul, S.
(2021). Early tumor diagnosis in brain MR images via deep convolutional neural
network model. Computers, Materials and Continua, 68(2), 2413–2429.
[Crossref]

8. Biswas, R., Vasan, A., & Roy, S. S. (2020). Dilated deep neural network for
segmentation of retinal blood vessels in fundus images. Iranian Journal of
Science and Technology, Transactions of Electrical Engineering, 44(1), 505–518.
[Crossref]

9. Roy, S. S., Hsu, C., Samaran, A., Goyal, R., Pande, A., et al. (2023). Vessels
segmentation in angiograms using convolutional neural network: A deep learning
based approach. CMES-Computer Modeling in Engineering & Sciences, 136(1),
241–255.
[Crossref]

10. Das, T. K., Chowdhary, C. L., & Gao, X. Z. (2020). Chest X-ray investigation: a
convolutional neural network approach. Journal of Biomimetics, Biomaterials and
Biomedical Engineering, 45, 57–70. Trans Tech Publications Ltd.

11. Zohourianshahzadi, Z., & Kalita, J. K. (2022). Neural attention for image
captioning: Review of outstanding methods. Artificial Intelligence Review, 55(5),
3833–3862.
[Crossref]

12. Wang, C., Yang, H., Bartz, C., & Meinel, C. (2016). Image captioning with deep
bidirectional LSTMs. In Proceedings of the 24th ACM International Conference
on Multimedia (pp. 988–997).

13. Rampal, H., & Mohanty, A. (2020). Efficient CNN-LSTM based image captioning
using neural network compression. Preprint retrieved from arXiv:2012.09708.

14. Chen, X., & Zitnick, C. L. (2014). Learning a recurrent visual representation for
image caption generation. Preprint retrieved from arXiv:1411.5654.

15. Sharma, H., & Jalal, A. S. (2020). Incorporating external knowledge for image
captioning using CNN and LSTM. Modern Physics Letters B, 34(28), 2050315.
[MathSciNet][Crossref]
16.
You, Q., Jin, H., Wang, Z., Fang, C., & Luo, J. (2016). Image captioning with
semantic attention. In Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition (pp. 4651–4659).

17. Rampal, H., & Mohanty, A. (2020). Efficient CNN-LSTM based image captioning
using neural network compression. Preprint retrieved from arXiv:2012.09708.

18. Arnav, J. H., & Pulkit, M. (2018). Image captioning using deep learning.

19. Yao, T., Pan, Y., Li, Y., Qiu, Z., & Mei, T., (2017). Boosting image captioning
with attributes. In Proceedings of the IEEE International Conference on Computer
Vision (pp. 4894–4902).

20. Singh, Y. P., Ahmed, S. A. L. E., Singh, P., Kumar, N., & Diwakar, M. (2021).
Image captioning using artificial intelligence. In Journal of Physics: Conference
Series (Vol. 1854, No. 1, p. 012048). IOP Publishing.

21. Wang, C., Yang, H., Bartz, C., & Meinel, C. (2016). Image captioning with deep
bidirectional LSTMs. In Proceedings of the 24th ACM International Conference
on Multimedia (pp. 988–997).
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023
S. S. Roy et al. (eds.), Deep Learning Applications in Image Analysis, Studies in Big Data 129
https://doi.org/10.1007/978-981-99-3784-4_4

Vehicle Over Speed Detection System


K. Ganesan1 , N. S. Manikandan2 and Vijayan Sugumaran3
(1) Professor, Higher Academic Grade, School of Information Technology and
Engineering, Vellore Institute of Technology (VIT), Vellore, 632014, Tamil
Nadu, India
(2) Senior System Architect, TIFAC-CORE Automotive Infotronics, Vellore
Institute of Technology, Vellore, 632014, Tamil Nadu, India
(3) Distinguished Professor of Management Information Systems, School of
Business Administration, Oakland University, Rochester, MI, USA

K. Ganesan
Email: kganesan@vit.ac.in

Keywords License plate detection – Object detection – Electronic toll system –


Road curvature detection – Vehicle over-speed detection

1 Introduction
Every year, many individuals die all across the world. One of the most common
causes of death is a vehicle accident. Accidents not only kill people, but also harm a
large number of people. Among the several causes of accidents, high-speed vehicles
are the most important cause. As a result, high-speed vehicles must be managed. As
a result, different government organisations, academic institutions, and automobile
manufacturers have begun various studies and projects to lower the likelihood of
accidents and provide safety to passengers and drivers. Several researchers have
used different kinds of mechanisms to detect vehicle over-speed in highways such
as VANET technology to connect with cloud server [1], video based specific area of
Interest (ROI) [2], and Electronic toll collection data based speed prediction [3]. To
manage high-speed vehicles on the highway, the Tamil Nadu government planned
to install an over-speed detecting device in the toll plaza. Figure 1 depicts a block
diagram of over-speed detection in a toll plaza. This architecture is made up of a
vehicle detection system, a common cloud server that is linked to an RTO server,
and an over-speed detection system.
Fig. 1 Block diagram of the proposed system for high speed detection

A state-of-the-art application in many domains, including vehicle detection


from satellite images [4], tumour extraction from medical images [5], and many
others, is made possible by neural computation [6] and deep learning [7].
The vehicle detection and licence plate detection and recognition play a major
role in the initial stages of the vehicle over-speed detection system. It must, in
particular, detect Indian vehicles and extract Indian licence plate information. In
view of the Indian vehicle detection at the toll plaza, Rajput et al. [8] utilized
YOLO3 object detection and classification to detect and classify vehicles in a toll
plaza. They have classified six types of vehicles, each of which can be used for a
separate toll cost. Their findings at the toll plaza revealed an average recall of
86.3% and precision of 94.1%.
The important role of this system is to locate and extract vehicle license plate
information. Many researchers have used deep learning technique [9–15]; some
have used image processing technique [16, 17]. However, most of the time, locating
and extracting Indian vehicle license plate is complicated due to non-proper
position or damaged or occlusion or un-authorized font used [18]. In terms of
Indian vehicle license plate localization and data extraction, a novel method for
detecting license plates with various font styles on vehicles was proposed by Jagtap
et al. [19]. It relies on adaptive image segmentation in conjunction with Artificial
Neural Network (ANN) character recognition. The proposed approach combines
morphological operations with horizontal and vertical edge histograms to
accomplish plate localization and character segmentation. To recognize characters,
a two layer feed forward back propagation ANN is used. The results show an
overall accuracy of 89.5%. When it comes to Indian license plate irregularities, a
pipeline is built by Ravirathinam et al. [20] using a number of cutting-edge Faster
Regional Convolutional Neural Networks to effectively address the Indian situation
in a variety of scenarios. There is no publicly accessible dataset for Indian licence
plates, so they created a balanced dataset using frames from videos and images
from mobile devices, accounting for all the irregularities. Their pipeline generated
an overall total correctness of 88.5% and a partial correctness of 10% for Indian
plates. The overall correctness increased to 91% with the addition of a new
heuristics system. The accuracy of licence plate detection for all kinds of vehicles
was 94.98%. Sometimes the extracted license plate information is incorrect, for
OCR corrections in chaotic Indian traffic videos with complicated licence plate
patterns; Singh et al. [18] proposed a modular framework. These patterns are
produced by a cutting-edge deep learning model that was trained on video frames.
This model includes multi-frame consensus in their framework for generating
suggestions because it reads text from videos rather than images. Their human-
interactive framework uses an object detector and a tracker to first separate the
multi-vehicle videos into multiple clips, each of which contains a single vehicle
from the video, to aid in the correction process. Their framework then offers
recommendations for a single vehicle using multi-frame consensus. The user is then
given interactive suggestions that only show them certain extracted clips, allowing
them to quickly and easily verify or correct their predictions. This high-quality
output can be used to update a sizable database continuously for surveillance, which
will improve the accuracy of deep models in difficult real-world scenarios.
In view of the cloud platform, an IoT-based system that uses two detection
points with surveillance cameras to measure the average speed between them was
proposed by Khan et al. [21]. To enforce speed limits, the measured data is sent to
the cloud for additional processing. Entry and exit points are used to detect any
uncertainty in a particular area. The failure of a car, for instance, to reach the end
point after passing through the entrance point, can be highlighted. The system is
made up of a mobile phone application and a web network that exchange real-time
data, including information about passing vehicles like entrance time, pictures, and
license plate registration numbers. Such a system has the advantages of requiring
little human involvement, requiring fewer speed guns to be installed, and
monitoring vehicles even when they are not in the camera’s field of view.
The speed limit between two toll gates is determined by traffic density or
government traffic rules and regulations. However, the roads between the toll plazas
are generally curvy and have speed limits. The majority of cloud-based vehicle
over-speed detection systems are unaware of road curvature. In terms of extracting
data about horizontal curves from road GIS maps, Li et al. [22] present a fully
automated method. Their proposed methodology aims at four different things: (a)
Regardless of the type of curve, each road's curves in the selected road's surface
layers are identified; (b) each curve is automatically classified as either simple or
compound; (c) Each simple curve’s radius, degree of curvature, length, and
compound curve's radius are all automatically determined; and (d) curve
characteristics and layers are automatically created in the GIS for all detected
curves. 96.7% of curves were correctly identified and their geometric information
was computed using the proposed technique. However, the existing road curvature
extraction method is unaware of curvature noise and curvature in hilly terrain.
Thus, the existing over-speed detection system has some gaps, such as not being
aware of the curvature on the highway and not being aware of curvature noise. To
bridge the generation gap, the proposed system includes the following features:
The YOLO object detection model has been proposed for vehicle detection and
vehicle type extraction.
An image processing technique is used to locate and extract licence plates from
detected vehicle images.
The information on the localised licence plate is extracted using the CRNN deep
learning text extraction model.
The proposed curvature aware travel time estimation model calculates the travel
time between two toll plazas, and the cloud-based system detects over-speed of
vehicles.
The remaining portions of this paper is arranged as follows: Sect. 2 describes
the vehicle detection & license plate extraction system, which is sub divided into
vehicle detection & type classification, License plate localization and license plate
recognition, travel time estimating & over-speed detection system. The speed
detection system is further sub divided into new curve finding method, curve speed
limit database creation, curve aware travel time estimation, and vehicle over speed
detection system. Section 3 discusses the results of vehicle detection and license
plate localization & text extraction, new curve finding method, curve aware travel
time estimation, and vehicle over speed detection. Finally, Sect. 4 provides the
conclusion and future work.
2 Proposed Model
Figure 2 depicts the proposed system’s architecture. This system has been
subdivided into three subsystems. The first subsystem detects vehicles at toll gates
and extracts license plate information as well as vehicle type. The second
subsystem uses a road curvature extraction module, a curve aware speed limitation
module, and a curvature aware travel time estimator to characterize the curvature
aware journey time between two toll gates. Over speed detection is the third and
final subsystem. It is made up of a toll gate system and a common cloud server
infrastructure. The three subsystems are briefly described below.

Fig. 2 The architecture of the proposed system

2.1 Vehicle Detection and License Plate Extraction System


This system consists of three modules: vehicle detection and vehicle
type classification, license plate localization, and license plate recognition. Details
of each of these modules are provided below.

2.1.1 Vehicle Detection and Type Classification: YOLO


YOLO [23] divides the image into M X M grids by a single CNN applied to the
entire image. For each grid, the prediction of B bounding boxes and the associated
confidence score are computed. The class confidence score analyses these bounding
boxes using the formula given below.
Class confidence score = conditional class probability + box confidence score.
It assesses the level of certainty in both classification and localization. The
mathematical definitions are as follows:
box confidence score
conditional class probability
class confidence score , then
(1)
where denotes the likelihood that an object is present in the box. The
intersection over union, or IoU, between the predicted box and the actual data is the
ground truth. The probability that an object belongs to a given , given its
presence, is known as . The probability that an object belongs to
a given is given by .
YOLO reduces an input image to 448 × 448 pixels in size. The image is then
sent through a convolutional network, yielding a tensor of 7 × 7x30. Tensor
information includes: (1) the coordinates of the bounding box's rectangle, and (2)
the probability distribution for all classes for which the system has been trained. By
limiting these class labels, confidence scores (probability) with less than 30% are
eliminated.
To calculate the loss, when comparing predictions to ground truth, YOLO uses
the sum-squared error. The categorization loss is part of the loss function. The loss
of localization is the error between the predicted boundary box and the ground
truth. The loss of confidence scores only for the boxes which did not contain any
object at all. Here’s the overall formula:
(2)
2.1.2 License Plate Localization
Finding the location of the License Plate in the vehicle image is a critical
assignment. Grayscale conversion, thresholding, and morphological procedures
such as dilatation and erosion are used to localize plates. Canny edge detector is
used to detect license plate edges and crop the located license plate from vehicle
image [24].

2.1.3 License Plate Recognition: CRNN


The CNN, Bi-directional LSTM, and CTC layer that make up the CRNN [25] can
be viewed as an encoder-decoder structure. A feature sequence encoder known as
CNN creates image feature sequences. Character sequences are produced by a
decoder made up of the bi-directional LSTM and CTC layers.
The input image’'s width and height are set by CNN to (Wx32)/H and 32 pixels,
respectively, where W and H are the image’s width and height, in order to maintain
the original aspect ratio. The CNN uses stride 21 rather than stride 22 for the
pooling layer because the character is tall and thin, with a height greater than a
width. As a result, the final feature map has a thin and tall pixel point that
corresponds to the original image’s receptive field. The input image is
downsampled using two layering pools with a 22 stride, and three layering pools
with a 21 stride. The final dimension of the feature map is b × 1 × [(W × 8)/H] × C,
where b is the batch size, 1 is the height, (Wx8)/H is the width, and C represents the
number of channels. The structure of CNN used for feature extraction is displayed
in Table 1.

Table 1 CNN network of CRNN


Layers CNN network Output size
Conv1 (3 × 3 × conv) × 6 32 × [(W × 32)/H]
Connection layer Relu, 1 × 1 conv, dropout 32 × [(W × 32)/H]
2 × 2 average pool, stride 2 × 2 16 × [(W × 16)/H]
Conv2 (3 × 3 × conv) × 6 16 × [(W × 16)/H]
Connection layer Relu, 1 × 1 conv, dropout 16 × [(W × 16)/H]
2 × 2 average pool, stride 2 × 2 8 × [(W × 8)/H]
Conv3 (3 × 3 × conv) × 6 8 × [(W × 8)/H]
Connection layer_1 Relu, 1 × 1 conv, dropout 8 × [(W × 8)/H]
2 × 1 average pool, stride 2 × 1 4 × [(W × 8)/H]
Conv4 (3 × 3 × conv) × 6 4 × [(W × 8)/H]
Layers CNN network Output size
Connection layer_1 Relu, 1 × 1 conv, dropout 4 × [(W × 8)/H]
2 × 1 average pool, stride 2 × 1 2 × [(W × 8)/H]
Conv5 (3 × 3 × conv) × 6 2 × [(W × 8)/H]
Connection layer_1 Relu, 1 × 1 conv, dropout 2 × [(W × 8)/H]
2 × 1 average pool, stride 2 × 1 1 × [(W × 8)/H]

The CRNN decoder is composed of the CTC layer and the Bi-directional LSTM
layer. Bi-directional LSTM receives its input from a feature map's column vector.
The probability matrix of (Wx8)/HxC, where C is the number of character labels
and is set to English uppercase letters in 26, English lowercase letters in 26, and a
space, is the output and it represents the probabilities of characters in each column
vector. The feature map that was recovered has a width of (Wx8)/H. The likelihood
of the label sequence is determined by applying the CTC layer to the Bi-directional
LSTM's output. The likelihood of the label sequence during training is determined
by the conditional probability defined in the CTC layer. The conditional
probability's negative log-likelihood serves as the loss function for training the
network. The probability sum of all pathways that are genuine label sequences is
calculated by the CTC layer. The paths ‘hee-ll-o’ and ‘hh-ee-ll-oo’ (where ‘-’
signifies a space) eliminate duplicates and spaces to show the label sequence ‘helo.‘
The test's recognition result is determined by which character sequence has the
highest probability.

2.2 Travel Time Estimating and Over-Speed Detection System


This system has four modules: road curvature identification, curve speed restriction
declaration, curve aware travel time computation, and vehicle overspeed detection.
They are described in detail below.

2.2.1 New Curve Detection Algorithm


A path between the source to the destination is constructed. The path, as shown in
Fig. 3a, is made up of a series of segment points (S1 to S9) with each segment
connected by straight lines. (Note: In India, all vehicles drive on the left side [26]).
Fig. 3 a Google map road segment points b Identified curve

The equations for detecting the curve using the sequence of segment point are
shown below. Before calculating the radius of curvature, we must first calculate the
great-circle distance between two points using the ‘Haversine’ formula.
(3)

(4)

(5)
where is latitude, is longitude, ER is radius of earth (ER = 6,371 km). Let's
think about the distance between segment point S1 to S2 as a, S2 to S3 as b, and S1
to S3 as c then calculating the radius is

(6)

According to the Indian Roads Congress [27, 28], a vehicle can travel at a speed
of 70 to 80 km/h in a 1000 m radius curve on the Indian highways [26]. So, we
assume that the maximum radius of an Indian road curve is 1000 m. Using Eq. (6)
at segment points S1, S2, and S3 from Fig. 3a, we find that the radius of these three
segment points is more than 1000 m because they are interconnected like a straight
line. So, we check the next adjacent three-segment points S2, S3, and S4. The
radius of these three-segment points (S2, S3, and S4) is less than 1000 m because it
looks like a curve. These three segment points’ radius values are recorded in the
radius list, and this procedure is repeated for subsequent segment points until we
reach the final set of segment points along the path.
Figure 3a segment points S3–S9 yield six curve radii R1, R2, R3, R4, R5, and
R6. This information is saved in the radius list. After that, the average curve radius
(R1 to R6) of the segment points S2 to S9 is calculated. The detected curve of the
path in Fig. 3a is shown in Fig. 3b. (Explained in Algorithm 1).
The curve list keeps track of the curve's starting segment point S2, ending
segment point S8, mid-segment point S6, and computed average curve radius. The
method is utilized with the route between the origin and destination, and the found
curves and their attributes (curve starting point, curve ending point, curve mid-
point, and average curve radius) are saved in the curve list. Figure 4 shows that the
source location is on a highway, but the destination location is on a mountainous
(hilly) terrain. The curves on the highway are always large. A single curve that is
1000 m long, as seen in Fig. 4 (top red solid circle), is a good example. The
mountainous landscape here features several hairpin bends. These curves have a
radius of 50 to 150 m, and some curves are 500 m long. As the assumed maximum
curve radius of roads in India is 1000 m, multiple hairpin curves form a single
curve, as seen in Fig. 4 (bottom two red solid circles).
Fig. 4 Many types of curve in different terrain

To solve this problem, a recursive curve detection technique is applied. In this


technique, the single curve (multi-curve in a single curve) is broken up into
numerous smaller curves. The recursive curve detection method (lines no. 18 and
19 from algorithm 1) uses the discovered curve from the path, and is recorded in a
curve list. The potential of one or more hairpin curves falls within the 500-m radius,
so this technique uses a sequence of curve radii of 500, 250, 150, and 50 m. As a
result, one large single curve is divided into smaller curves, as seen in Fig. 5a.
Figure 5b shows the curve finding the result of the twin tunnel road in Mumbai.
Fig. 5 a Result of proposed curve finding algorithm b Curve in tunnel road (Twin tunnel
Mumbai)

Finding the curve direction (bearing) is based on the following calculation.

(7)

where is latitude, is longitude.


There is sometimes ‘noise’ in the path. The connected lines that connect the
segment points are not uniform, as shown in Fig. 6b yellow circle, and this is
referred to as “noise.“ This noise can be eliminated by the proposed curve detector
method. The next section goes over the process of creating the curve speed limit
database.
Fig. 6 a Noisy curve b Curve noise eliminated

2.2.2 Curve Speed Limit Database Creation


The references to the Indian Road Congress's (IRC) [27, 28] articles provide
support for the development of the database of curve speed restriction on Indian
roads. According to the Indian Road Congress (IRC) article, Table 2 measures up
the planned speed limit and curve radius. Based on the super-elevation of the curve,
the radius range is established.

Table 2 database of curve speed restriction


S.No Range of curve radius Designed speed limit (in km/h)
1 50−100 20
2 70−150 25
3 100−200 30
4 180−320 35
`5 280−650 40
6 470−1100 60
7 700−1400 80

2.2.3 Curve Aware Travel Time Estimation


This module describes the travel time between two toll gates. This is described in
following equation.
(8)
(9)
(10)

where the is straight road distance which is subtraction result of Toll road
distance from total distance of curvature road The time taken to travel
only on straight road is straight road distance divided by declared speed
.
Here, the , , and is time to travel over the curvature, which is
computed from , , curvature distance divided by curvature speed
restriction , . Finally, the curvature aware travel time is
obtained by adding the travel time on straight road with every curvature travel
time , , and .

2.2.4 Vehicle Over Speed Detection System


Figure 7 depicts the block diagram of over speed detection system. In a toll gate,
every booth has an over speed detection system. This system connects with camera
situated outside the toll booth which is focusing on the vehicle for detecting vehicle
type as well as extracting license plate information. This system connects with
cloud server, which stores vehicle information with timestamp and it connects with
the RTO server.
Fig. 7 Over speed detection module

The system connects with the camera to detect vehicle type and extract license
plate and add the current timestamp. Before that, the system downloads vehicle
information and timestamp of vehicle entry in the previous toll gate, along with
RTO information of vehicle from the cloud server. When the vehicle enters the toll
booth, the over speed detection system in each booth checks the vehicle information
from the downloaded information from cloud server. If there is no match, then it
considers the vehicle entering the toll booth as new, so the vehicle information with
current timestamp is added to the cloud server. In case the entered vehicle
information is matched with downloaded information, then it checks for the over-
speed. The over-speed is calculated using the following formula.
(11)

(12)
where is vehicle travelled time, calculated from subtracting its current toll
booth timestamp from pervious toll booth timestamp . Here, the over
speed is whether the vehicle travelled time is less than Toll gate declared curve
aware travel time (Eq. 10). if the vehicle over speed is detected, then it will be
entered into the violator database and get fined by the field inspector.

3 Result
The vehicle overspeed detection system's testbed location is set up in two toll
plazas in Tamilnadu, India: Pallikonda and Ranipet. The results of this testbed are
described below.

3.1 Vehicle Detection and License Plate Localization and Text


Extraction
Using the toll booth outside camera, which was focusing on vehicles and moving
line by line, the YOLO detected and classified the type of vehicle. The YOLO
result is shown in Figs. 8 and 9 at the top left (Vehicle detection) and bottom left
(Detected vehicle type). The detected vehicle image has been sent to the license
plate localization process, as described in Sect. 2.1.2. Figures 8 and 9 bottom right
(Number plate cropped) show the output of this license plate localization.

Fig. 8 Car detection and license plate information extraction


Fig. 9 Truck detection and license plate information extraction

The cropped license plate image is then fed into the CRNN text recognition
algorithm, which produces the extracted text of the license plate, as shown in the
bottom right (Predicted num) of Figs. 8 and 9. This Vehicle detection & License
plate extraction system output is vehicle type and license plate text recognition,
which is sent to the vehicle over-speed detection module.

3.2 New Curve Detection Algorithm


The found curve analysis is assessed using the Type 1 error, Type 2 error, and Type
2 error ratio (TIIR) metrics [22]. Wherever a Type 2 error occurs, the detected
curve is extended beyond the ground truth curve by an additional segment.
Wherever a Type 1 error occurs, either the detected curve is not detected or is
missing 25, 50, or 75% of the ground truth curve.
Figure 10a, b, and c show various 25, 50% Type 1 curve identifications error
and Type 2 error, respectively. It's risky to make this Type 1 error.
Fig. 10 a 25% type 1 curve identification error b 50% type 1 curve identification error c type 2
error

The type 2 error ratio is denoted by the formula TIIR = m/n, where m denotes
the quantity of type 2 errors, n denotes the quantity of ground truth curves, and
TIIR denotes the type 2 error ratio.
Table 3 displays the Type 1, Type 2, TIIR, actual, predicted curve numbers, the
overall distance between source and destination, performance delay, types of curves
predicted, and noise corrected curve numbers for locations in India (rows 1 to 5),
France (row 6), and the United States (row 7). The data from Google Maps Road
segments is the same all over the world. As a result, the proposed method can
extract curves from road segments anywhere in the world. One minor distinction is
that vehicles in India travel on the left side of the road, whereas vehicles in France
and the United States drive on the right side. As a result, depending on the vehicle
travelling direction, the starting and ending point of the curve varies from country
to country. Each (7 rows of Table 3) source to destination Google map road
segment data was collected and the starting and ending point of each curve was
manually identified; this ground truth data were compared with the proposed model
recognized curve data. Here, one highway road (row 1), a hilly terrain road (row 2),
a university road (row 3), and a tunnel road (row 5) have all been tested in India.
The curve observed in the highway starting location and hilly terrain destination
location is shown in Table 3, row no. 4. The proposed method can extract curve
from the tunnel road. Figure 5b shows the row 5 in the twin tunnel in Mumbai city.
Noise is the result of a GPS segment being drawn incorrectly over an existing
segment. This form of noise, may be corrected using the proposed approach, as
shown in Fig. 6b. However, this noise frequently misrepresents a straight road as a
curving road. Because a hilly terrain road includes more noise segments than a road
segment in the plains, this kind of error is classified as a Type 2 error. However, it is
not dangerous because one can get alerted if a curve exists. If the proposed curve
detection algorithm fails to detect a curve with a radius of fewer than 60 m, it is
dangerous because this type of road includes sharp or blinding bends and is prone
to accidents. This type of curve was successfully identified using the proposed
method. The final column (column no. 11) in Table 3, displays the predicted
number of curves in the specified location using the existing method [22]. The
existing method lacks the capability of removing curve noise and therefore, the
noise is declared as a curve.

Table 3 Curve detection research and analysis


S.no Source and Total No. of No. of Type 1 Type No. of Type of TIIR Predicated
destination curve actual Predicated error 2 Noise on detected no. of
distance curve curve error path is curve curve by
corrected ref. [22]
1 12.932459, 113 km 18 19 2 (1– 1 2 Simple 0.05 20
79.138573 25%, 1– curve = 17,
& 50%) compound
= 2,
13.047688, reverse =
80.081534 0, sharp =
0
2 12.600237, 13 km 48 48 1–25% 9 33 Simple 0.19 93
78.596748 curve = 33,
& compound
= 0,
12.593029, reverse =
78.631709 2, sharp =
13
3 12.968459, 1.5 km 6 6 0 1 8 Simple 0.17 15
79.155885 curve = 2,
& compound
= 0,
12.971857, reverse =
79.163570 1, sharp =
3
4 12.931632, 91.5 km 64 64 1–25% 10 33 Simple 0.15 98
79.135380 curve = 44,
& compound
= 0,
12.593029, reverse =
78.631709 2, sharp =
18
S.no Source and Total No. of No. of Type 1 Type No. of Type of TIIR Predicated
destination curve actual Predicated error 2 Noise on detected no. of
distance curve curve error path is curve curve by
corrected ref. [22]
5 19.059460, 10.7 km 5 5 0 1 0 Simple 0.05 5
72.913796 curve = 5,
& compound
18.949226, = 0,
72.840700 reverse =
2, sharp =
18
6 47.082396, 12.5 km 40 45 4 6 7 Simple 0.13 52
3.929467 (2−25%, curve = 28,
& 2–50%) compound
= 5,
47.136291, reverse =
4.016592 6, sharp =
6
7 37.202597, 46.9 km 24 25 3(1– 3 2 Simple 0.12 30
−87.010464 25%, 2– curve = 21,
& 50%) compound
= 0,
37.312765, reverse =
−86.614955 2, sharp =
2

3.3 Curve Aware Travel Time Estimation


Table 4 details the study of the proposed curve-aware travel time estimation;
column 2 in the table shows the latitude and longitude of the source and destination
toll plazas, column 3 lists the information about the toll plaza and the type of
highway, column 4 lists the distance between the two toll plazas, column 5 lists the
number of curves in the highways that are a result of Sect. 2.2.1, column 6 lists the
declared speed of the car and truck, column 7 displays the declared reaching times
for cars and trucks between toll plazas, while column 8 displays the results of the
Sect. 2.2.3 curve aware reaching times.

Table 4 The study of proposed curve aware travel time estimation


S.No Source to destination Location Distance No. of Declared Declared Curve-
information curves speed reaching aware
time (in reaching
minutes) time
1 12.544464, 78.201390 Highway 66.1 km 6 100 (Car) 39 (car) 42 (car)
to 12.006492, (Krishnakiri to 80 49 (truck) 54 (truck)
78.080849 thopur toll gate) (truck)
S.No Source to destination Location Distance No. of Declared Declared Curve-
information curves speed reaching aware
time (in reaching
minutes) time
2 12.006198, 78.080653 Mountain pass 37.7 km 9 80 (car) 28 (Car) 30 (Car)
to 11.720234, (thopur to omalur 60 37 (truck) 40 (truck)
78.073370 toll gate) (truck)
3 13.647111, 79.401600 Ghat road (Alibri to 16.6 km 110 40 (Car) 28 (Car) 35 (Car)
to 13.672812, GNC toll gate) 40 28 (Truck) 43
79.351193 (truck) (Truck)
4 13.672823, 79.351400 Ghat road (GNC to 17.1 km 107 40 (Car) 40 (Car) 42 (Car)
to 13.647667, Alibri toll gate) 40 40 (Truck) 50
79.405564 (truck) (Truck)
5 12.910950, 79.400920 Highway (Ranipet 52.1 km 12 100 (Car) 31 (Car) 33 (Car)
to 12.905704, to pallikonda toll 80 39 (Truck) 42
78.951853 gate) (truck) (Truck)
6 12.905825, 78.951824 Highway 51.8 km 13 100 (Car) 30 (Car) 32 (Car)
to 12.911158, (Pallikonda to 80 38 (Truck) 40
79.401029 ranipet toll gate) (truck) (Truck)

3.4 Vehicle Over Speed Detection System


Ten booth systems and one server system are used at each toll gate in the Vehicle
over Speed Detection System architecture. As shown in Fig. 11a, the GUI for each
toll booth system is connected to a camera to automatically gather information
about the vehicle's type and license plate number. This information can then be
updated or corrected by a toll booth attendant. When a user clicks the check button,
these details are sent to the local server system, which then uses its local database to
check the vehicle number and type. If the vehicle is first time entry, it enters the
details of the vehicle and a time stamp (date and time) in a local database
and pushing the data to the next tollgate via a cloud server. Details of a vehicle that
just entered the Pallikonda toll gate are shown in Fig. 11b. If the vehicle has
previously registered at a toll gate, its information is stored locally. Such that the
local server system calculates the average speed of a vehicle between two tollgates,
and if it exceeds that speed, a fine is assessed as shown in Fig. 11c. Vehicle just
entered with timestamp in local server if it is travelling at normal speed, as shown
in Fig. 11d. The Google cloud platform, which is used to store data on each toll
gate vehicle, is depicted in Fig. 12.
Fig. 11 a License plate no. and vehicle type entry b Vehicle entered to toll plaza first time c
Vehicle over-speed detected d Vehicle passed between toll plazas in normal speed
Fig. 12 Google cloud platform

Figure 13a demonstrates how to search the Log cloud database. Date, time, a
vehicle's number, or a Toll gate ID can all be used to search the log's details. How
to search the Violator cloud database is shown in Fig. 13b. By vehicle
identification, date, time, or toll gate ID, one can search the details of the violator.

Fig. 13 a RTO log search b Violator log search

3.5 Discussion
In this test-bed for both toll plazas, a total of 3552 vehicles passed through all toll
booths during the two hours test-bed's time. Figure 14 displays a bar graph of
vehicle passes broken down by booth. During the two hours of testing, two vehicles
received fines for exceeding the government-mandated speed limits of 100 km/h for
cars and 80 km/h for trucks. There could be a large number of vehicles that receive
fines if the vehicle speed limit was set using a curve-aware travel time estimation.
According to Fig. 15, there would be 13 to 14 vehicles fined if the speed limit for
cars was 90 km/h.

Fig. 14 Analysis report: booth wise vehicle pass

Fig. 15 Suggestion to reduce vehicle speed

The Pallikonda and Ranipet Toll Plazas were used for two hours of testing
during the test-bed. This test site was overseen by the Vellore branch of RTO, the
Tamil Nadu government. Figure 16a depicts the RTO and inspector's presence at the
Pallikonda Toll Plaza during the test-bed period. The experts testing the vehicle
over-speed detection system in the Ranipet toll plaza are depicted in Fig. 16b.

Fig. 16 Field test at a Pallikonda and b Ranipet toll plazas

4 Conclusion
Highway traffic moving at an excessive speed needs to be controlled. The proposed
vehicle over-speed detection system can be used to determine whether or not a
vehicle that is travelling between two toll plaza roads was travelling at an excessive
speed. In this regard, a new curve-finding algorithm is proposed to precisely
determine the travel time of the vehicle. In the proposed vehicle over-speed
detection system, this curve-aware travel time is used. The Pallikonda and Ranipet
toll plazas participated in the real-time test-bed for a two-hour testing period under
the direction of the RTO, Tamilnadu government. Due to speeding, two vehicles
were found and fined. This system is currently being tested in two plazas; however,
in the future, it could be expanded to all toll plazas. In the future, the camera-based
license plate extraction module will be replaced by an RFID tag-based vehicle
information extraction module, which is currently used in every vehicle in India
under the brand name FastTag.

References
1. Nayak, R. P., Sethi, S., & Bhoi, S. K. (2018). PHVA: A position based high speed vehicle
detection algorithm for detecting high speed vehicles using vehicular cloud. In 2018
International Conference on Information Technology (ICIT). https://doi.org/10.1109/icit.
2018.00054
2.
Krishnakumar, B., Kousalya, K., Mohana, R., Vellingiriraj, E., Maniprasanth, K., &
Krishnakumar, E. (2022). Detection of vehicle speeding violation using video processing
techniques. In 2022 International Conference on Computer Communication and
Informatics (ICCCI). https://doi.org/10.1109/iccci54379.2022.9740909

3. Zou, F., Ren, Q., Tian, J., Guo, F., Huang, S., Liao, L., & Wu, J. (2022). Expressway speed
prediction based on electronic toll collection data. Electronics, 11(10), 1613. https://doi.
org/10.3390/electronics11101613
[Crossref]

4. Shen, J., Zhou, W., Liu, N., Sun, H., Li, D., & Zhang, Y. (2022). An anchor-free
lightweight deep convolutional network for vehicle detection in aerial images. IEEE
Transactions on Intelligent Transportation Systems.

5. Roy, S. S., Rodrigues, N., & Taguchi, Y. (2020). Incremental dilations using CNN for brain
tumor classification. Applied Sciences, 10(14), 4915.
[Crossref]

6. Samui, P., Roy, S. S., & Balas, V. E. (Eds.). (2017). Handbook of neural computation.
Academic Press.

7. Biswas, R., Vasan, A., & Roy, S. S. (2019). Dilated deep neural network for segmentation
of retinal blood vessels in fundus images. Iranian Journal of Science and Technology,
Transactions of Electrical Engineering, 1–14.

8. Rajput, S. K., Patni, J. C., Alshamrani, S. S., Chaudhari, V., Dumka, A., Singh, R., Rashid,
M., Gehlot, A., & AlGhamdi, A. S. (2022). Automatic vehicle identification and
classification model using the YOLOv3 algorithm for a toll management system.
Sustainability, 14(15), 9163. https://doi.org/10.3390/su14159163
[Crossref]

9. Wang, W., Yang, J., Chen, M., & Wang, P. (2019). A light CNN for end-to-end car license
plates detection and recognition. IEEE Access, 7, 173875–173883. https://doi.org/10.1109/
access.2019.2956357
[Crossref]

10. Huang, Q., Cai, Z., & Lan, T. (2021). A new approach for character recognition of multi-
style vehicle license plates. IEEE Transactions on Multimedia, 23, 3768–3777. https://doi.
org/10.1109/tmm.2020.3031074
[Crossref]

11. Seo, T., & Kang, D. (2022). A robust layout-independent license plate detection and
recognition model based on attention method. IEEE Access, 10, 57427–57436. https://doi.
org/10.1109/access.2022.3178192
[Crossref]
12.
Henry, C., Ahn, S. Y., & Lee, S. (2020). Multinational license plate recognition using
generalized character sequence detection. IEEE Access, 8, 35185–35199. https://doi.org/10.
1109/access.2020.2974973
[Crossref]

13. Park, S., Yu, S., Kim, J., & Yoon, H. (2022). An all-in-one vehicle type and license plate
recognition system using YOLOv4. Sensors, 22(3), 921. https://doi.org/10.3390/s22030921
[Crossref]

14. Alam, N., Ahsan, M., Based, M. A., & Haider, J. (2021). Intelligent system for vehicles
number plate detection and recognition using convolutional neural networks. Technologies,
9(1), 9. https://doi.org/10.3390/technologies9010009
[Crossref]

15. Alghyaline, S. (2022). Real-time Jordanian license plate recognition using deep learning.
Journal of King Saud University-Computer and Information Sciences, 34(6), 2601–2609.
https://doi.org/10.1016/j.jksuci.2020.09.018
[Crossref]

16. Raghunandan, K. S., Shivakumara, P., Jalab, H. A., Ibrahim, R. W., Kumar, G. H., Pal, U.,
& Lu, T. (2018). Riesz fractional based model for enhancing license plate detection and
recognition. IEEE Transactions on Circuits and Systems for Video Technology, 28(9).

17. Dalarmelina, N. D., Teixeira, M. A., & Meneguette, R. I. (2019). A real-time automatic
plate recognition system based on optical character recognition and wireless sensor
networks for ITS. Sensors, 20(1), 55. https://doi.org/10.3390/s20010055
[Crossref]

18. Singh, P., Patwa, B., Saluja, R., Ramakrishnan, G., & Chaudhuri, P. (2019).
StreetOCRCorrect: An interactive framework for OCR corrections in chaotic Indian street
videos. In 2019 International Conference on Document Analysis and Recognition
Workshops (ICDARW). https://doi.org/10.1109/icdarw.2019.10036

19. Jagtap, J., & Holambe, S. (2018). Multi-style license plate recognition using artificial
neural network for Indian vehicles. In 2018 International Conference on Information,
Communication, Engineering and Technology (ICICET). https://doi.org/10.1109/icicet.
2018.8533707

20. Ravirathinam, P., & Patawari, A. (2019). Automatic license plate recognition for Indian
roads using Faster-RCNN. In 2019 11th International Conference on Advanced Computing
(ICoAC). https://doi.org/10.1109/icoac48765.2019.246853

21. Khan, S. U., Alam, N., Jan, S. U., & Koo, I. S. (2022). IoT-enabled vehicle speed
monitoring system. Electronics, 11(4), 614. https://doi.org/10.3390/electronics11040614
[Crossref]
22.
Li, Z., Chitturi, M., Bill, A., & Noyce, D. (2012). Automated identification and extraction
of horizontal curve information from geographic information system roadway maps.
Transportation Research Record: Journal of the Transportation Research Board, 2291, 80–
92.

23. Horzyk, A., & Ergun, E. (2020). YOLOv3 precision improvement by the weighted centers
of confidence selection. In 2020 International Joint Conference on Neural Networks
(IJCNN). https://doi.org/10.1109/ijcnn48605.2020.9206848

24. Jayaraman, S., Esakkirajan, S., Veerakumar, T. (2015). Digital image processing. Tata
McGraw Hill publication, Indian Edition.

25. Shi, B., Bai, X., & Yao, C. (2017). An end-to-end trainable neural network for image-based
sequence recognition and its application to scene text recognition. IEEE Transactions on
Pattern Analysis and Machine Intelligence, 39(11), 2298–2304. https://doi.org/10.1109/
tpami.2016.2646371
[Crossref]

26. Bains, M. S., Bhardwaj, A., Arkatkar, S., Velmurugan, S. (2013). Effect of speed limit
compliance on roadway capacity of Indian expressways. Procedia-Social and Behavioral
Sciences, 104, 458−467

27. IRC: 73. (1980). Geometric design standards for rural (Non-urban) highways. Indian
Roads Congress.

28. IRC: 38. (1988). Guidelines for design of horizontal curves for highways and design tables.
Indian Roads Congress.
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023
S. S. Roy et al. (eds.), Deep Learning Applications in Image Analysis, Studies in Big Data 129
https://doi.org/10.1007/978-981-99-3784-4_5

An Intelligent System for Video-Based


Proximity Analysis
Sergey Antonov1 , Mikhail Bogachev2 , Pavel Leyba2,
Aleksandr Sinitca2 and Dmitrii Kaplun1, 2
(1) Department of Automation and Control Processes, Saint Petersburg
Electrotechnical University “LETI”, St. Petersburg, 197022, Russia
(2) Centre for Digital Telecommunication Technologies, Saint Petersburg
Electrotechnical University “LETI”, St. Petersburg, 197022, Russia

Sergey Antonov
Email: serg2209157.antono@yandex.ru

Mikhail Bogachev
Email: mibogachev@etu.ru

Aleksandr Sinitca
Email: amsinitca@etu.ru

Dmitrii Kaplun (Corresponding author)


Email: dikaplun@etu.ru

Keywords Walker tracing – Face recognition – Proximity analysis –


Random walk model – Public space planning

1 Introduction
Recently boosted by the COVID-19 pandemic, digital technologies played
an increasingly significant role in the public-health response to contact
tracing worldwide. Budd et al. [12] provides a comprehensive review of
digital innovations developed in response to COVID-19 worldwide,
including legal, ethical and privacy barriers to their implementation, as well
as organizational and workforce restrictions. The review covers
technologies developed in responce to five public-health needs, including
epidemiological surveillance, rapid case identification, control of
community transmission, communication of essential medical information
and clinical support [5].
Interrupting community transmission requires rapid tracing and
quarantining of contacts in order to prevent further transmission.
Technologies supporting such activities are largely based on proximity
tracing [17], which is usually implemented using smartphone apps ([57,
59]) and low-power Bluetooth technologies. Hossain et al. [18] recently
proposed a B5G framework that employs high throughput and low latency
of modern 5G network standard to exchange chest X-ray [20] or CT scan
images [41] for an early instrumental detection of COVID-19, as well as
development of a mass surveillance system to control and manage social
distancing, mask wearing, and body temperature monitoring. The above
approach lies in the context of various AI-based integrated emergency
response solutions attracting increasing interest in recent years [40, 42, 44].
Privacy is one of the major concerns in this context, strongly limiting
the applicability of various solutions. As a prominent example, Norway has
stopped using the Smittestopp app and switched to the Bluetooth approach
[60]. Several international frameworks with various systematic approaches
to privacy preservation are emerging, including Decentralized Privacy-
Preserving Proximity Tracing [58], the Pan-European Privacy-Preserving
Proximity Tracing initiative [61] and the joint Google–Apple framework
[56].
A key limitation of contact-tracing apps such as those mentioned above
is that they require a large proportion of the population to use the app.
However, the practical effectiveness of these apps is strongly limited by
smartphone ownership, user compliance, and technical compatibility [12].
An alternative approach, which can be more effective in a variety of
scenarios is proximity tracing based on video surveillance.
There are only few works addressing video surveillance in the context
of COVID-19 pandemic. Punn et al. [38] proposes a framework that utilizes
the YOLO v3 object detection model not only to detect, but also to
distinguish between humans using the Deepsort approach capable of further
tracking the identified persons according to their assigned IDs. The results
of the YOLO v3 model are further compared with other popular
convolutional neural network architectures, such as SSD (Single Shot
Detector), R-CNN (Region-Based CNN) and their modifications. Rezaei et
al. [39] use a YOLOv4-based framework and inverse perspective mapping
to improve accuracy of personal identification for an improved social
distance tracking in the presence of disturbance factors, such as crowd
occlusion, partial visibility, and lighting variations, also providing a risk
assessment scheme based on the statistical analysis of personalized
movement trajectories and the rate of social distancing violations.
Like in the case with mobile apps for tracing proximity, any solution
based on video surveillance needs to address privacy concerns. In this paper
we propose a framework which builds on the ideas of object detection and
trajectory analysis, incorporated from the literature on pedestrian tracking,
but also integrates elements that will allow for addressing privacy issues:
facial recognition system which maps faces to anonymzed IDs, and the
construction of an anonymzied potential spread graph, which can be used in
scenarios such as contact tracking and epidemiological surveillance.
Now more than two years since the onset of the pandemic, public
attention is increasingly shifting towards finding optimal exit strategies,
including adaptation of the technologies that have been rapidly deployed
earlier in the course of the pandemic, and finding their place in the post-
pandemic society. Here we show explicitly how the AI-based framework
for proximity tracing based on video surveillance in public places proposed
here can be used in different scenarios ranging from individual contact
tracing or epidemiological surveillance of crowds to the improved public
spaces planning.
Existing body of work, e.g., on automatic pedestrian behavior analysis
can be adapted to this context [52]. These approaches usually employ
various models for object detection. However, the pandemic largely
changed our vision of the goals that have to be achieved in public spaces
planning. There is compelling evidence that various social distancing
measures also reduced the spread of other infectious diseases such as
common cold or flu, which accounts for around 166 million working days
loss in the U.S. only, that nearly doubles when taking into account parents
that skip work due to the colds caught by their children, even outside of the
pandemic context. Therefore, adapation of the technologies widely used
during the current COVID-19 pandemic to reduce community transmissions
of other respiratory diseases such as common cold and flu, could be
advantageous for at least a partial reduction of these losses.
The rest of the paper is organized as follows. Section 2 presents an
overview of the proposed framework and the corresponding video data
processing pipeline. Section 3 focuses on the proximity networks, which
can be used in a variety of scenarios to address public-health needs.
Section 4 describes the evaluation of our approach for a series of videos
captured by the street surveilance cameras. Section 5 introduces statistical
quantities that are associated with the risks of community transmission and
discusses how they could be used for future improvements in public space
planning aiming at the reduction of community transmission risks in the
post-pandemic society.

2 Overview of the Framework


A schematic overview of the proposed framework is presented in Fig. 1.
Fig. 1 Diagram of workflow

The proposed framework contains four main modules that are


responsible for persons detection, distance calculation, face recognition and
network construction, respectively, with the first three being repeated for
each frame, while the last one combining all information gathered from the
entire scene.
Frames are fed into a trained convolutional neural network model
“ssdlite mobilenet v2 coco” for object detection. The output of the model
contains coordinates of bounding boxes of all detected persons in the frame.
To find the actual coordinates of each person in a frame relative to other
people, coordinates of bounding boxes are passed into OpenCV computer
vision library, which performs bird's eye projection using the homography
matrix. The next step is to link coordinates with individual persons ID's by
matching with previous frames and update their trajectories. To facilitate
the linkage process, every time the position appears not in a close proximity
to an already identified person's trajectory, a facial recognition algorithm is
applied to the cropped image for identification purposes. To calculate
distance between people, a VP Tree is built and a modified nearest
neighbors algorithm is applied to this tree.
Once individual trajectories are obtained for each detected person
(identified by a unique ID) based on the results of the video analysis,
identification of groups of people that appear in a close proximity to each
person, as well as the durations they appear in proximity, are among the
quantities of interest in the context of contact tracing purposes.

3 Construction of Proximity Networks


3.1 People Detection and Accuracy Evaluation
A common approach to the detection of individual persons in video
obtained from fixed surveillance cameras is based on convolutional neural
networks. There are several approaches to the neural network training,
among them supervised, unsupervised and reinforcement learning. While
training by supervised learning generally leads to superior accuracy and
performance, it requires large amounts of data at learning stage. Datasets
used for the network training should contain the “ground truth” information
including segmentation, localization, as well as object classification,
typically summarized in the associated annotation files. Among multiple
variants, convolutional neural networks [27] should be noted as a common
solution for object detection.
Choice of particular solution and its validation are largely based on the
accuracy metrics such as Precision (Prc), Recall (Rec), Intersection Over
Union (IoU) and mean Average Precision (mAP). Figure 2 illustrates the
IoU, a measure based on Jaccard Index that evaluates the overlap between
the reference and the predicted bounding boxes, respectively.
Fig. 2 Intersection Over Union (IoU) that is a measure based on Jaccard Index that
evaluates the overlap between the reference (indicated by green border) and the
predicted (indicated by red border) bounding boxes

To obtain the accuracy metrics, the IoU is next compared against a fixed
threshold , that equals 0.5 in our example. When the decision
in made in favor of hypothesis , otherwise the decision is made in favor
of hypothesis . . The accuracy of the decision making procedure is
quantified based on the true positive (TP) rate indicating the rate of
decisions in favor of hypothesis under the validity of hypothesis ,,
and by the false positive (FP) rate, indicating the rate of decisions in favor
of hypothesis under validity of hypothesis (see, e.g., [48] and
references therein). In a numerical treatment, based on the above rates, one
can estimate precision

(1)

and recall

(2)

Similarly to the approach taken in [16], we also calculate the widely


used detection accuracy measure mAP, obtained as the area under the
curve. By definition, both precision and recall are bounded
between 0 and 1, and thus mAP is also bounded between 0 and 1. It is
common to estimate mAP from interpolated curves
(3)
where is the interpolation of the curve.

3.2 Finding Coordinates of Each Person


Next before proceeding to walking trajectories, one has to transform from
the homogeneous (also known as projective) coordinates to the world
coordinates (corresponding to the bird's eye view) by means of projective
geometry techniques [31]. For simplification purposes, each detected object
represented by a bounding box is associated with its pivot point, resulting in
a simplified transformation expressed by 3 × 3 matrices

(4)

In order to transform from homogeneous coordinates to world


coordinates, one has to divide the resulting coordinates by . Accordingly,
the procedure of finding the location of each person in world coordinates
can be expressed as

(5)

where the projection matrix, sometimes also referred to as the


homography matrix, which can be estimated using a number of approaches
[33], such as direct linear transformation (DLT) and robust estimation
(RANSAC). Assuming that the pivot point of each bounding boxes is
located in the center of its lower edge, it can be found as

(6)

(7)
where ( ) are the bounding box coordinates. Thus,
for a given homography matrix, transformation to the world coordinates can
be expressed as

(8)

3.3 Extracting Walking Trajectories


The idea of constructing walking trajectories based on locations obtained
from individual video frames requires linking the location points
corresponding to the same person observed in consecutive frames. For that,
the first step is commonly searching for the nearest neighbor points. The
latter is usually performed using one of the algorithms such as linear (full)
search, search in kd-trees [35], search in BSP-trees [28], LS-hashing [36],
method with keywords [50] and search in VP-trees [54]. As linear search is
computationally inefficient due to its linear complexity of , alternative
algorithms are in focus. The LS-hashing algorithm is based on finding a
simple hash function that can be used instead of direct comparison of point
coordinates, resulting in superior efficacy once a simple hash function is
known, although finding such function is not a straightforward solution in
many real-world scenarios. The idea of the keyword algorithm is to store a
list of objects with rarely observed coordinates, which also limits its
applicability. Therefore, in our case the remaining options, namely kd-tree,
BSP-tree and VP-tree search algorithms, are of greater interest. In the
following, we focus on the VP-tree search, since it searches for other points
in a circular vicinity around the current pivot point, that is relevant to the
contact proximity analysis problem.

3.3.1 VP-Tree Construction


Like the majority tree construction algorithms, building a hierarchical VP-
tree is a recursive procedure. In the first iteration of the algorithm, a vantage
point is selected and the average distance from this point to all other points
is calculated. The input set of points is divided into two subtrees, assigning
the point to the set of points in the inner (left) subtree if the distance from it
to the vantage point is less than the average, and to the set of points in the
outer (right) subtree otherwise. The same operation is repeated for each
subtree. Thus, each node in the tree has a vantage point and a radius where
the points belong to the node. Complexity of the tree construction algorithm
.

3.3.2 Finding Nearest Neighbor in VP-Tree


The algorithm for finding the nearest neighbor to the point is also
recursive. At any given step, one focuses on a tree node that has a vantage
point and a radius . Let us assume that point is located at some
distance from . If is below , a recursive algorithm to search for a
subtree of the node that contains any points closer to the vantage point than
the radius is activated. Upon reaching the subtree, we perform a linear
search among the points of this subtree. Otherwise, returning to the subtree
of the node containing points displaced from further than the given radius
. When constructing the trajectory of a single walker movement, is
obtained from the coordinates of this person in the previous frame, and the
desired nearest point is the coordinates of the same person in the current
frame.

3.4 Finding People That Appear in Close Proximity


In the context of contact tracing, the next step is typically finding all points
that appear in a close proximity, usually determined by a circular area of a
certain radius around each person, first for a given video frame
corresponding to a single point of time. Since the original VP-tree based
search algorithm focuses on finding single nearest neighbor only, it should
be generalized to search for potentially multiple nearest neighbors within a
circle of a given radius. In this context, there are several possible situations:
1. The Entire Search Area is Included in the Internal Subtree

(9)

where is the distance from the center of the node to the search point,
is the search radius, is the node radius, determining the border of the
inner subtree. The world scale of the distances between two bird's eye
viewpoints is determined using the size of the camera pixel obtained from
the calibration procedure, and the distance between two points is calculated
as
(11)

If this condition is met, the search can continue in the internal subtree
only.
2. The entire search area is included in the external subtree.

(12)

If this condition is met, the search can only continue in the external
subtree.
3. The entire search area is distributed over both subtrees.

In this case, the search is performed in both subtrees. The difficulty of


searching for nearest neighbors is .

3.5 Face Analysis and Recognition


Facial recognition is a long studied problem which attracted increasing
attention in recent years, leading to a considerable advancements in
methodology and algorithms development (see, e.g., [4] for a detailed
review). One of the key issues regarding widespread use of facial
recognition technologies are privacy concerns, and thus those
methodological approaches that are capable of integrating anonymization in
a systematic way are favorable. In this work detected faces are being
mapped to anonymized IDs, which can be then stored in the system,
allowing for identities to be revealed in a controlled way.
The face analysis and recognition problem is generally a stepwise
procedure, including finding and selecting all faces in the images, their
initial preprocessing and alignment, identification of unique facial features,
and their comparison against a database of known people. The above
procedure is typically implemented as a pipeline, where all steps can be
performed independently of each other, and thus particular choice of
techniques can be performed independently at each step from a number of
available solutions. Since there is a large body of recent work in this field,
we only provide a brief overview of the available solutions and their pros
and cons.
Face detection techniques largely rely upon several well-established
approaches. Retrospecitvely, the Viola-Jones method [53], while being one
of the first widely available and computationally efficient solutions, was
characterized by relatively high false detection rates, as well as requirement
of frontal facial images and low robustness against occlusion. The
Histogram of Oriented Gradients (HOG) method [15] is based on analyzing
the gradient of the binarized image, followed by its segmentation into small
segments and finding those where the arrangement of gradients is close to a
known facial image, often denoted as the HOG pattern. The keynote
advantage of this method is its computational efficiency, as well as
reasonable effectiveness for slightly non-frontal images, as well as
moderate robustness against occlusions, while its major drawback is the
requirement of high resolution images, and failure with low resolution
images due to discreteness effects.
In recent years, Multi-Task Cascaded Convolutional Networks
(MTCNN) [55] became one of the most popular solutions for finding faces
in images based on the DNN (Deep Neural Network) approach. The above
algorithm consists of three consective steps, with the first one responsible
for the image rescaling, the second one known as the Proposal Network (or
P-Net) looks for the candidate facial regions, followed by the Refine
Network (or R-Net) filtering bounding boxes and finally by the Output
Network (or O-Net) that focuses on facial landmarks (such as eyes and
mouth) localization. Another recent and powerful alternative solution is the
MMOD algorithm introduced by Davis E. King and implemented in the
Dlib library [23]. Since it appears one of the most accurate of the other
methods discussed above, while also working well for different face
orientations and even under substantial occlusion, it has been chosen as an
instrument used in this work.
However, it is also important to note that deep learning algorithms,
while typically outperforming other approaches in terms of accuracy,
require considerably higher computational resources, that may appear a
limiting factor for their application under limited resources scenarios and/or
large amounts of data, as well as online analysis requirements.
Face rescaling and alignment is an intermediate step between face
detection and face recognition. Common solutions are based on finding
specific face landmarks that can be used in the rescaling and alignment
procedure as pivot points.
Face recognition techniques are also well developed. Early approaches
were largely based on such algorithms as Eigenface [21], Fisherface [3] and
Local Binary Patterns Histogram (LBPH) [46]. As these algorithms proved
to have numerous drawbacks, here we follow a more recent approach based
on Convolutional Neural Networks (CNN), that remain one of the most
effective and reliable solution to the date. Prominent examples include
Google FaceNet [45] based on convolutional layers learning face
representations directly from the image. FaceNet was trained on the
Labelled Faces in the wild (LFW) [19] dataset to achieve invariance to
illumination, pose, and other variable conditions. Other notable examples
include OpenFace [2]. In this work, we used also a neural network based
solution implemented in the Dlib library.
Finally, recognized faces should be associated with IDs of particular
persons. This is a typical problem for machine learning classification
algorithms. If no matches are found, a new ID is added to the database. In
this work, we used a KNN classifier, although many alternative classifiers
would do the job.

4 Experiments
4.1 Combined Dataset of Neural Network Training
Next, we evaluated the approach using several sample videos recorded by
surveillance cameras in busy outdoor public places. For neural network
training, we combined two different datasets, that are among the most
popular for object detection algorithms learning, PASCAL VOC [16] and
COCO [25]. Although they differ in the amount of annotation, both of them
contain sufficient information to extract bounding boxes around detected
people. Figure 3 shows the histogram of person count in images for the
resulting dataset, indicating that the majority of images contained one single
person, while a significant number of images contained up to twenty
different people.
Fig. 3 Distribution of people number in images

4.2 Training a Neural Network Model


We used the combined dataset described above to train a convolutional
neural network model “ssdlite mobile net v2 coco”, which is a lightweight
version of SSD (Single Shot MultiBox Detector) [26] based on the joint
architecture of SSDLite and MobileNetV2 [43], characterized by high
object detection accuracy (evalated by mAP) and computational
performance in various image analysis based problems. Figure 4 shows the
loss function obtained during model training, while Fig. 5 shows the
average accuracy of object detection (mAP), altogether indicating the
chosen neural network model demonstrates high accuracy of object
recognition.
Fig. 4 Evolution of the loss during model training
Fig. 5 Evolution of the mAP@0.5IoU while model training

4.3 Frame Processing


By processing a video frame, the system detects people using a previously
trained neural network model “ssd lite mobile net coco v2”.
After detecting people by the trained network, their bounding box
coordinates were subjected to the homography matrix based transformation
and nearest neighbor search algorithm, followed by face detection and
recognition algorithms, as described above. Figure 6 exemplifies a
processed video frame with indicated bounding boxes, where those
appearing within a close proximity (for an arbitrary 2 m threshold) are
shown in red, while others are shown in green. Figure 7 shows the
corresponding bird's eye view for the same frame, using similar color
notation.
Fig. 6 Results of processing 1 frame

Fig. 7 Bird eye view of 1 frame

Another example is shown in Figs. 8 and 9, respectively.


Fig. 8 Results of processing 2 frame

Fig. 9 Bird eye view of 2 frame

4.4 Contact Network Graph


In epidemiological contact tracing, an important quantity that strongly
influences the transmission risk is the duration of contact between each pair
of individuals. The corresponding framework for a given public space can
be represented by a weighted graph, where the nodes correspond to
individual persons, while the weights of the links between them represent
contact durations. Figure 10 exemplifies a contact graph for a representative
short scene, where link weights represent contact durations in seconds.

Fig. 10 Contact graph

In order to reduce the risk of infection transmission in public spaces, it


is essential to reduce the duration of contacts. Alternatively, under the
assumption that contact duration above a certain threshold is associated
with increased risk of infection transmission, one can focus on the reduction
of the number of links above a certain threshold weight, i.e., the number of
pairs of individuals that appear in close proximity to each other for
durations above a certain threshold value.

5 Further Interpretation and Outlook Towards


Adaptation to the Post-pandemic Society Goals
Now after more than two years since the onset of the pandemic, public
attention is increasingly shifting towards finding optimal exit strategies,
including adaptation of the technologies that have been rapidly deployed
earlier in the course of the pandemic, and finding their place in the post-
pandemic society. In the following, we consider how the above solutions
could be used in different scenarios than individual contact tracing or
epidemiological surveillance of crowds, for example, leading to the
improved public spaces planning.
Planning of public spaces strongly affects the probability of
congestions, formation of crowds, organization of queues, that in turn
largely determines the numbers of total contacts that remain in close
proximity above a certain duration. There is a number of well-known
mathematical models widely used to simulate collective dynamics from
particle movement to walking trajectories. One of the simplest models for
walking trajectories simulation is a 2D random walk characterized by
random increments. In real-world settings, randomness of increments is an
unlikely scenario, due to inevitable interactions between walkers and
stationary objects, as well as between walkers and other walkers, leading to
the adjustment of their trajectories, and thus correlated and self-avoiding
walks appear more relevant. For a recent literature overview of the problem
from a multidisciplinary perspective, we refer to [30, 37, 51], as well as
several relevant special cases, including presence of obstacles [47] and
compactness constraints [24] capable of representing typical features of the
real-world public space settings.
In the following, we consider several short scenes, calculate statistics
for the quantities of interest and compare them against similar results for
both uncorrelated and correlated random walk models obtained by
computer simulations.
Figure 11 shows the pairwise contact duration matrices representing the
duration of time each pair of individuals remains in a close proximity, for a
sample proximity threshold value. To simplify the comparison between
different scenes, as well as between video analysis based and random walk
simulation based results, we define the proximity threshold as a certain
quantile of the distance distribution for all walkers that can be observed
simulateously within the scene. This kind of normalization is a common
approach to the comparison of datasets at different scales, see e. g. [7]. In
this particular example, we have chosen , indicating that on the
average each 5th pairwise distance appears below the threshold.

Fig. 11 Examples of pairwise contact duration matrices for six representative short
scenes captured from a street video surveilance camera for . Matrix sizes are
determined by the total number of individuals captured in each scene, with their total
pairwise duration of proximity (in seconds) indicated by color

For a statistical characterization of the contact graph properties, a


straightforward approach would be consideration of the distribution of
contact durations obtained for all possible pairs of individuals. Figure 12
shows the statistics for six different short scenes, including complementary
cumulative distribution functions (CCDFs) indicating the probabilities that
inter-arrival times and durations that each individual remains within the
scene exceed the function argument, as well as similar quantities of the
pairwise contact durations for all possible pairs of individuals, each for
three different threshold values, corresponding to , 5 and 10,
respectively. The figure shows that the normalized distributions expressed
in the units of the average contact durations obtained separately for each
scene and each threshold value, tend to follow a simple exponential.
Fig. 12 Statistics for six different short scenes, including a CCDFs of inter-arrival
times and b durations that each individual remains within the scene, as well as c
distributions of the contact times for all possible pairs of individuals, each for three
different thresholds , 5 and 10. Straight black lines show a simple exponential,
while dashed colored curves show similar results for simulated random walks with
similar parameters like in the observational data (blue curves correspond to the
absence of correlations, while red curves correspond to the long-range correlated
random walks with Hurst exponent

This is generally an expected result, which is in a good agreement with


similar quantities obtained by computer simulations of random walks
characterized by the same average inter-arrival times and durations (for
simplicity, exponential distributions of arrivals and durations within the
scene have been considered). The theoretical background behind this
distribution is rather simple and can be explained via an event-based
concept, considering any pair of individuals following random trajectories
coming into proximity as a random event. In this simplest scenario, these
events constitute Poisson processes with parameters generally depending on
both inter-arrival and duration times, as well as average distances between
different walkers and step sizes performed by a single walker in a given
time unit (e.g. one second). However, since the inter-event distbituion for
any Poisson process decays by an exponential with only one free parameter
that is the average value, normalization by division by this average value
for each distribution results in a data collapse indicated by all curves
following the same pattern close to a simple exponential with the unit
average. Deviations from this simple theoretical scenario can be attributed
to the discreteness and finite size effects. As one can see from the figure,
these deviations are comparable for the observational and for the simulated
data, given that the simulated data contains similar number of frames,
average inter-arrival intervals and durations of individuals remaining within
the scene, and thus also similar total numbers of individual trajectories in
the entire scene.
However, in most real-world scenarios walking trajectories strongly
deviate from the simplest random model. Typical reasons for that are
localization of the objects of attraction (e.g. counters, doors, passages etc.),
as well as obstacles (e.g. barriers, billboards, kiosks etc.) in both indoor and
outdoor public spaces, leading to the spatial clustering of the walking
trajectories. In addition, traffic regulations (e.g. revolving doors, traffic
lights at crosswalks etc.) lead to additional temporal clustering of the
walking trajectories. Among various models used to characterize motion
from the statistical physics viewpoint, long-range correlations appear the
most relevant in the context of human dynamics (for a recent and
comprehensive review of literature on the topic, we refer to [22]). To
account for both spatial and temporal clustering, two-dimensional long-
range correlated fields seem to be a relevant model.
Recent data including our own results indicate that long-range
correlations are strongly associated with clustering of events, generally
leading to heavy-tailed distributions of both inter-event times and event
durations, with the latter being crucial for the contact proximity durations.
The impact of long-range temporal correlations on the event dynamics have
been investigated both analytically [32] and numerically [1, 14] indicating
that the interval distributions between consecutive events in a series
broaden from a single exponential for the simplest Poisson process scenario
to a stretched exponential for linear long-term correlations, and finally
converge to a power-law decay for strong long-term correlations, especially
in the presence of nonlinear interactions in the system [6, 7]. Moreover, in
recent years similar distributions of the inter-event times have been
observed in a number of real-world complex systems, ranging from bursty
access patterns driven by user interactions in public computer networks [6,
29, 34, 49] to various natural phenomena, e.g. in geophysics [8, 13].
Finally, our recent data indicate that spatial long-range correlations lead to
the manifestations of similar laws in biological polymer structures [9–11].
Figure 13 exemplifies similar distributions obtained by computer
simulations for walks with random increments with Hurst exponent
and long-range spatiotemporally correlated increments with Hurst
exponent . The figure shows explicitly that stronger spatio-
temporal correlations lead to broader contact duration distributions,
indicating that a larger fraction of pairs of individuals remain within the
same proximity thresholds for longer times (depicted by a more pronounced
initial decay in the exceedance probability distributions), compared to the
random increments scenario. The figure also shows that, while some
general qualitative conclusions are possible based on these simulations,
particular functional forms of the distributions obtained for finite systems
exhibit non-trivial shapes that are determined by a complex interplay of
correlations, discreteness and finite size effects, and thus are determined not
only by their asymptotic behaviors that could be eventually derived from
known theoretical assumptions, but also depend explicitly on the system
size.
As a remark, obtaining Fig. 13 required simulated datasets that
contained 110 times more time steps and 11 times more individual walkers,
altogether resulting in ~103 more walker positions, and potentially up to
~106 more pairwise distances, compared to the observational video
examples used in our study. Since the amount of video analysis required to
obtain comparable statistics for different public places requires considerable
computational efforts, we believe that more detailed analysis including
long-term video analysis and best correlated walkers model fitting, for a
better understanding of how public space planning affects both the spatio-
temporal walking trajectory correlation patterns and contact proximity
distributions, remains beyond the scope of this study, and could be
considered as an outlook for future reseach directions.
Fig. 13 Pairwise proximity duration distributions obtained by computer simulations
for a random walk with random increments with Hurst exponent (blue
curves) and long-range spatiotemporally correlated increments with Hurst exponent
(red curves), respectively

6 Conclusion and Outlook


To summarize, digital technologies played a major role in the global
responce to the COVID-19 early on from the onset of the pandemic,
especially in the context of digital epidemiological surveillance and contact
tracing, and proved their effectiveness in the real-world context being
strongly associated with a number of success stories leading to the rapid
suppression of the community transmission and reduction of the incidence
rates.
While AI and machine learning techniques have been widely applied in
web-based epidemic information support tools and online case tracing, they
have not yet been fully explored in the context of proximity tracing and
consecutive analysis for a more informed public spaces planning in the
context of the reduction of the contacts and contact durations.
In this paper, we have proposed a framework which is based on video-
surveilance for proximity tracing. However, as with the use of mobile apps
and Bluetooth, privacy considerations cannot be emphasized enough for
any approach to be of practical use. This is one of the fundamental ideas in
our framework, realized by using anonymized IDs to identify individuals.
Further exploring how privacy can be integrated in the proposed solution is
the most immediate future research direction. Other directions include
training other neural network models and comparing them to find the best
model. Trained models will be evaluated based on the above parameters,
such as mAP with a set IoU threshold of 0.5, the error of the trained model,
and the number of frames per second (FPS) spent on object detection. In
addition, we will further evaluate the approach using large datasets from
crowded streets.
Now after more than two year since the onset of the pandemic, public
attention increasingly shifts towards finding optimal exit strategies,
including adaptation of these technologies and finding their place in the
post-pandemic society. Looking forward towards this goal, we also consider
how the proximity tracing based on video surveillance in public places
could be adapted to facilitate the improved public spaces planning.

Acknowledgment
The work of Sergey Antonov was supported by the Ministry of Science and
Higher Education of the Russian Federation “Goszadanie” No 075-01024-
21-02 from 29.09.2021 (Project No. FSEE-2021-0014).

References
1. Altmann, E., & Kantz, H. (2005). Recurrence time analysis, long-term
correlations, and extreme events. Physical Review E, 71(5), 056106.
[MathSciNet][Crossref]
2.
Amos, B., Ludwiczuk, B., & Satyanarayanan, M. (2016). Openface: A general-
purpose face recognition library with mobile applications. Technical report, CMU-
CS-16–118, CMU School of Computer Science.

3. Anggo, M., & Arapu, L. (2018). Face recognition using fisherface method.
Journal of Physics: Conference Series, 1028, 012119. https://doi.org/10.1088/
1742-6596/1028/1/012119
[Crossref]

4. Balaban, S. (2015). Deep learning and face recognition: the state of the art. In
Biometric and Surveillance Technology for Human and Activity Identification XII
(vol. 9457, p. 94570B). International Society for Optics and Photonics.

5. Biswas, R., Vasan, A., & Roy, S. S. (2020). Dilated deep neural network for
segmentation of retinal blood vessels in fundus images. Iranian Journal of
Science and Technology, Transactions of Electrical Engineering, 44(1), 505–518.
[Crossref]

6. Bogachev, M., & Bunde, A. (2009). On the occurrence and predictability of


overloads in telecommunication networks. EPL (Europhysics Letters), 86(6),
66002.
[Crossref]

7. Bogachev, M., Eichner, J., & Bunde, A. (2007). Effect of nonlinear correlations on
the statistics of return intervals in multifractal data sets. Physical Review Letters,
99(24), 240601.
[Crossref]

8. Bogachev, M., Eichner, J., & Bunde, A. (2008). On the occurence of extreme
events in long-term correlated and multifractal data sets. Pure and Applied
Geophysics, 165, 1195–1207.
[Crossref][zbMATH]

9. Bogachev, M., Kayumov, A., & Bunde, A. (2014). Universal internucleotide


statistics in full genomes: A footprint of the dna structure and packaging? PLoS
ONE, 9(12), e112534.
[Crossref]
10.
Bogachev, M., Kayumov, A., Markelov, O., & Bunde, A. (2016). Statistical
prediction of protein structural, localization and functional properties by the
analysis of its fragment mass distributions after proteolytic cleavage. Scientific
Reports, 6, 22286.
[Crossref]

11. Bogachev, M., Markelov, O., Kayumov, A., & Bunde, A. (2017). Superstatistical
model of bacterial DNA architecture. Scientific Reports, 7, 43034.
[Crossref]

12. Budd, J., Miller, B. S., Manning, E. M., Lampos, V., Zhuang, M., Edelstein, M.,
Rees, G., Emery, V. C., Stevens, M. M., Keegan, N., et al. (2020). Digital
technologies in the public-health response to covid-19. Nature Medicine, 1–10.

13. Bunde, A., Bogachev, M., & Lennartz, S.: Precipitation and river flow: Long-term
memory and predictability of extreme events. Extreme Events and Natural
Hazards: The Complexity Perspective, 139–152.

14. Bunde, A., Eichner, J., Havlin, S., & Kantelhardt, J. (2004). Return intervals of
rare events in records with long-term persistence. Physica A: Statistical
Mechanics and its Applications, 342(1), 308–314.
[Crossref]

15. Dalal, N., & Triggs, B. (2005). Histograms of oriented gradients for human
detection. In 2005 IEEE Computer Society Conference on Computer Vision and
Pattern Recognition (CVPR 2005) (vol. 1, pp. 886–893). IEEE (2005). https://doi.
org/10.1109/cvpr.2005.177

16. Everingham, M., Eslami, S. A., Van Gool, L., Williams, C. K., Winn, J., &
Zisserman, A. (2015). The pascal visual object classes challenge: A retrospective.
International Journal of Computer Vision, 111(1), 98–136.
[Crossref]

17. Ferretti, L., Wymant, C., Kendall, M., Zhao, L., Nurtay, A., Abeler-Dörner, L.,
Parker, M., Bonsall, D., & Fraser, C. (2020). Quantifying sars-cov-2 transmission
suggests epidemic control with digital contact tracing. Science, 368(6491).

18. Hossain, M. S., Muhammad, G., & Guizani, N. (2020). Explainable ai and mass
surveillance system-based healthcare framework to combat covid-i9 like
pandemics. IEEE Network, 34(4), 126–132.
[Crossref]

19. Huang, G. B., Ramesh, M., Berg, T., & Learned-Miller, E. (2007). Labeled faces
in the wild: A database for studying face recognition in unconstrained
environments. Technical Report 07-49, University of Massachusetts, Amherst.
20.
Jalali, S. M. J., Ahmadian, M., Ahmadian, S., Hedjam, R., Khosravi, A., &
Nahavandi, S. (2022). X-ray image based COVID-19 detection using evolutionary
deep learning approach. Expert Systems with Applications, 201, 116942.
[Crossref]

21. Jalled, F. (2017). Face recognition machine vision system using eigenfaces.

22. Karsai, M., Jo, H. H., Kaski, K., et al. (2018). Bursty human dynamics. Springer

23. King, D. E. (2015). Max-margin object detection

24. Lellouche, S., & Souris, M. (2020). Distribution of distances between elements in
a compact set. Stats, 3(1), 1–15.
[Crossref]

25. Lin, T., Maire, M., Belongie, S. J., Bourdev, L. D., Girshick, R. B., Hays, J.,
Perona, P., Ramanan, D., Dollár, P., & Zitnick, C. L. (2014). Microsoft COCO:
Common objects in context. CoRR abs/1405.0312

26. Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C. Y., Berg, A. C.
(2016). Ssd: Single shot multibox detector (pp. 21–37). Lecture Notes in
Computer Science. https://doi.org/10.1007/978-3-319-46448-0_2

27. Li, Z., Yang, W., Peng, S., & Liu, F. (2020). A survey of convolutional neural
networks: Analysis, applications, and prospects

28. Maneewongvatana, S., & Mount, D. M. (2001). An empirical study of a new


approach to nearest neighbor searching. In Algorithm Engineering and
Experimentation (pp. 172–187). Springer Berlin Heidelberg. https://doi.org/10.
1007/3-540-44808-x_14

29. Markelov, O., Nguyen, V., & Bogachev, M. (2017). Statistical modeling of the
internet traffic dynamics: To which extent do we need long-term correlations?
Physica A: Statistical Mechanics and its Applications, 485, 48–60.
[Crossref]

30. Moltchanov, D. (2012). Distance distributions in random networks. Ad Hoc


Networks, 10(6), 1146–1166.
[Crossref]

31. Mundy, J. L., Zisserman, A., et al. (1992). Geometric invariance in computer
vision (Vol. 92). MIT press Cambridge.
32.
Newell, G., & Rosenblatt, M. (1962). Zero crossing probabilities for gaussian
stationary processes. The Annals of Mathematical Statistics, 33(4), 1306–1313.
[MathSciNet][Crossref][zbMATH]

33. Nguyen, T., Chen, S.W., Shivakumar, S. S., Taylor, C. J., & Kumar, V. (2017).
Unsupervised deep homography: A fast and robust homography estimation model.

34. Nguyen, V., Markelov, O., Serdyuk, A., Vasenev, A., & Bogachev, M. (2018).
Universal rank-size statistics in network traffic: Modeling collective access
patterns by zipf’s law with long-term correlations. EPL (Europhysics Letters),
123(5), 50001.
[Crossref]

35. Panigrahy, R. (2008). An improved algorithm finding nearest neighbor using kd-
trees. Lecture Notes in Computer Science, pp. 387–398. Springer Berlin
Heidelberg. https://doi.org/10.1007/978-3-540-78773-0_34

36. Pan, J., & Manocha, D. (2011). Fast gpu-based locality sensitive hashing for k-
nearest neighbor computation. In Proceedings of the 19th ACM SIGSPATIAL
international conference on advances in geographic information systems, GIS, pp.
211–220. Association for Computing Machinery, New York, NY, USA. https://
doi.org/10.1145/2093973.2094002

37. Pönisch, W., & Zaburdaev, V. (2018). Relative distance between tracers as a
measure of diffusivity within moving aggregates. The European Physical Journal
B, 91(2), 1–7.
[Crossref]

38. Punn, N. S., Sonbhadra, S. K., & Agarwal, S. (2020). Monitoring covid-19 social
distancing with person detection and tracking via fine-tuned yolo v3 and deepsort
techniques.

39. Rezaei, M., & Azarmi, M. (2020). Deepsocial: Social distancing monitoring and
infection risk assessment in covid-19 pandemic. arXiv preprint arXiv:2008.11672

40. Roy, S. S., Goti, V., Sood, A., Roy, H., Gavrila, T., Floroian, D., Paraschiv, N. &
Mohammadi-Ivatloo, B. (2014). L2 regularized deep convolutional neural
networks for fire detection. Journal of Intelligent & Fuzzy Systems, 1–12.

41. Roy, S. S., Rodrigues, N., & Taguchi, Y. (2020). Incremental dilations using CNN
for brain tumor classification. Applied Sciences, 10(14), 4915.
[Crossref]
42.
Roy, S. S., Mihalache, S. F., Pricop, E., & Rodrigues, N. (2022). Deep
convolutional neural network for environmental sound classification via dilation.
Journal of Intelligent & Fuzzy Systems, 1–7.

43. Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., & Chen, L. C. (2018).
Mobilenetv2: Inverted residuals and linear bottlenecks.

44. Samui, P., Roy, S. S., & Balas, V. E. (Eds.). (2017). Handbook of neural
computation. Academic Press.

45. Schroff, F., Kalenichenko, D., & Philbin, J. (2015). Facenet: A unified embedding
for face recognition and clustering. In 2015 IEEE conference on computer vision
and pattern recognition (CVPR), pp. 815–823. https://doi.org/10.1109/CVPR.
2015.7298682

46. Singh, S., Kaur, A., & Taqdir, A. (2015). A face recognition technique using local
binary pattern method. IJARCCE, 165–168. https://doi.org/10.17148/IJARCCE.
2015.4340

47. Skliros, A., & Chirikjian, G. S. (2008). Position and orientation distributions for
locally self-avoiding walks in the presence of obstacles. Polymer, 49(6), 1701–
1715.
[Crossref]

48. Sokolova, A., Uljanitski, Y., Kayumov, A. R., & Bogachev, M. I. (2021).
Improved online event detection and differentiation by a simple gradient-based
nonlinear transformation: Implications for the biomedical signal and image
analysis. Biomedical Signal Processing and Control, 66, 102470.
[Crossref]

49. Tamazian, A., Nguyen, V., Markelov, O., & Bogachev, M. (2016). Universal
model for collective access patterns in the internet traffic dynamics: A
superstatistical approach. EPL (Europhysics Letters), 115(1), 10008.
[Crossref]

50. Tao, Y., & Sheng, C. (2014). Fast nearest neighbor search with keywords. , IEEE
Transactions on Knowledge and Data Engineering, 26, 878–888. https://doi.org/
10.1109/TKDE.2013.66
[Crossref]
51.
Tejedor, V., Schad, M., Bénichou, O., Voituriez, R., & Metzler, R. (2011).
Encounter distribution of two random walkers on a finite one-dimensional
interval. Journal of Physics A: Mathematical and Theoretical, 44(39), 395005.
[MathSciNet][Crossref][zbMATH]

52. Vannoorenberghe, P., Motamed, C., Blosseville, J. M., & Postaire, J. G. (1997).
Automatic pedestrian recognition using real-time motion analysis. In International
conference on image analysis and processing (pp. 493–500). Springer.

53. Viola, P., & Jones, M. (2001). Rapid object detection using a boosted cascade of
simple features. In Proceedings of the 2001 IEEE computer society conference on
computer vision and pattern recognition (CVPR 2001, vol. 1, pp. I–I). IEEE

54. Yianilos, P. N. (1993). Data structures and algorithms for nearest neighbor search
in general metric spaces. In Proceedings of the fourth annual ACM-SIAM
symposium on discrete algorithms, SODA, pp. 311–321. Society for Industrial and
Applied Mathematics, USA.

55. Zhang, K., Zhang, Z., Li, Z., & Qiao, Y. (2016). Joint face detection and
alignment using multitask cascaded convolutional networks. IEEE Signal
Processing Letters, 23(10), 1499–1503. https://doi.org/10.1109/lsp.2016.2603342

56. Apple and google framework. https://www.apple.com/newsroom/2020/04/apple-


and-google-partner-on-covid-19-contact-tracing-technology/

57. Covidsafe app, Australia. https://www.health.gov.au/resources/apps-and-tools/


covidsafe-app

58. The dp-3t project. https://github.com/DP-3T/documents

59. Hamagen app, israel. https://govextra.gov.il/ministry-of-health/hamagen-app/


download-en/

60. Norway halting smittestop app. https://www.amnesty.org/en/latest/news/2020/06/


norway-covid19-contact-tracing-app-privacy-win/

61. Pepp-pt project. https://github.com/pepp-pt/pepp-pt-documentation/blob/master/


PEPP-PT-high-level-overview.pdf
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023
S. S. Roy et al. (eds.), Deep Learning Applications in Image Analysis, Studies in Big Data 129
https://doi.org/10.1007/978-981-99-3784-4_6

Deep Learning-Based Conjunctival Melanoma


Detection Using Ocular Surface Images
Kanchon Kanti Podder1, Mohammad Kaosar Alam2, Zakaria Shams Siam2,
3, Khandaker Reajul Islam4, Proma Dutta5, Adam Mushtak6,

Amith Khandakar4, Shona Pedersen7 and Muhammad E. H. Chowdhury4


(1) Department of Biomedical Physics and Technology, University of
Dhaka, Dhaka, 1000, Bangladesh
(2) Department of Electrical, Electronic and Systems Engineering,
Universiti Kebangsaan Malaysia, 43600 Bangi, Malaysia
(3) Department of Electrical and Computer Engineering, Presidency
University, Dhaka, Bangladesh
(4) Department of Electrical Engineering, Qatar University, 2713 Doha,
Qatar
(5) Department of Electrical and Electronic Engineering, Chittagong
University of Engineering and Technology, Chittagong, 4349,
Bangladesh
(6) Clinical Imaging Department, Hamad Medical Corporation, Doha,
Qatar
(7) Department of Basic Medical Sciences, College of Medicine, Qatar
University, 2713 Doha, Qatar

Muhammad E. H. Chowdhury
Email: mchowdhury@qu.edu.qa

Supplementary Information
The online version contains supplementary material available at https://doi.
org/10.1007/978-981-99-3784-4_6.
Keywords Conjunctival melanoma – Computer-aided diagnosis – Deep
learning – Ocular surface images – Pretrained models

1 Introduction
The eye is a crucial and among the most intricate sensory organs which we
have as humans. It aids in our ability to visualize objects as well as our
perception of light, depth, and colour. Conjunctival nevus [1], which is a
relatively ordinary disorder, possesses several distinct clinical presentations
[2]. Sufferers who ask about conjunctival lesions are frequently encountered
during ordinary clinical treatment [3]. Conjunctival nevi could exhibit a
range of malignant or benign characteristics [4]. An uncommon but
possibly fatal malignant growth of the eye is called conjunctival melanoma
[1], which develops from melanocytes found within the conjunctival
epithelium's basal cells [5]. This uncommon tumour accounts for around
2% of all eye tumours, 5% of optic melanomas [6], and 0.25% of each type
of melanoma [7]. Mortality rates of at least 30% are associated with
conjunctival melanoma [8], which demands costly treatment, while a bad
prognosis is linked to a belated diagnosis [3, 5]. Conjunctival melanoma
often manifests as a pigmented or colourful sharp conjunctival lesion,
however unusual cases with a variety of morphologies might cause the
diagnosis to be delayed [9]. This condition might be caused either by nevus
or acquired melanosis [10]. To diminish the mortality caused by this
condition, prompt diagnosis and the practicality of detection are necessary,
given the contemporary scenario in several countries that involve an ageing
population as well as insufficient healthcare resources. An ophthalmologist
performs a conventional clinical examination to determine whether a patient
has conjunctival melanoma by viewing the ocular surface under a slit lamp,
where a biopsy is necessary to verify the diagnosis [3]. The implementation
of these in-clinic investigations has, however, been considerably impacted
by the contemporary outbreak due to COVID-19 [11]. Therefore,
ophthalmologists face significant difficulties in the prompt identification of
conjunctival melanoma [3].
Medical imaging has already been greatly impacted by deep learning,
and this influence is only anticipated to increase in future [12, 13]. Deep
learning, according to several experts, is going to be a key factor in the
forthcoming medicine and a key instrument for medical practice and
research [14–18]. In terms of the analyses of medical images, deep learning
methods have already demonstrated impressive, and frequently unheard-of,
performance and accomplishment in a wide range of tasks from both low-
and high-level image processing functions, including image classification,
detection, segmentation, enhancement, denoising, reconstruction,
registration etc. [19–26]. Deep learning techniques that make use of digital
images with pathological lesions are thought to be useful for enhancing the
detection of skin malignancies [27, 28]. Even though many studies utilizing
deep learning models have concentrated on skin melanoma [29–32], the use
of modern deep learning technology to identify conjunctival melanomas has
been underexplored. Because of the lack of substantial data including
ground truth data of conjunctival diseases, training traditional deep neural
networks to identify conjunctival melanoma is very difficult. Very recently,
deep learning techniques for identifying conjunctival melanoma from the
optic surface images were explored [3]. However, their dataset was not well
curated. Also, for the classification to perform even better, more research is
required. The current study's goal is to examine contemporary deep learning
techniques used to detect conjunctival melanoma utilizing a sizable,
enhanced optic surface image dataset. Four classes of image data, that are
conjunctival melanoma, melanosis or nevus, normal conjunctiva, and
pterygium [33] images, have been used in the present study. Considering
the research gap available in the field of classifying conjunctival
melanomas, the following contributions are proposed in this study:
A well-curated dataset for conjunctival melanoma is proposed which is
validated by medical experts.
An effective and faster augmentation technique is proposed counter to
CycleGAN-based augmentation [3] for increasing a small conjunctival
melanoma dataset.
A high-performing deep learning model is proposed in this study which
can classify the different eye conditions with high accuracy.
Additionally, we incorporated the interpretability of our findings. This
study intends to verify the hypothesis that conjunctival lesions could be
classified, and conjunctival melanoma could be found utilizing optic
surface images with the help of deep learning. The prompt identification of
conjunctival lesions might be made easier by this investigation.
The outline of this study is described in the sections below. The
following parts go into further information about the materials and methods
that were utilized. Afterwards, the findings are revealed and discussed. At
last, we address the conclusion and potential future research as we wrap off
our study.

2 Methodology
This study proposed a system where an image of the eye taken using a
smartphone can be classified as normal or other eye-related medical
conditions. The methods involved in this system start from data collection,
data cleaning and validation, CNN training and evaluation and visual
interpretation. Figure 1 illustrates the step-by-step workflow of the
methodology proposed in this study.

Fig. 1 Depiction of methodology adopted in this study

2.1 Data Collection


The focus of this research was on analyzing the anterior segment utilizing a
deep learning system and images of the eye's surface. The preliminary
melanoma data set on which our data set is developed was taken from [3].
Normal, Pterygium, Nevus, and Conjunctival melanosis were the four
categories present in that dataset. The dataset suggested by [3] contains
some irrelevant and problematic images identified by the medical experts of
our team. Ocular images of subjects with conjunctival anomalies are widely
available online and can be accessed through various keyword searches (for
example, “normal conjunctiva”, “pterygium”, “conjunctival nevus”,
“conjunctival melanosis”), so we removed irrelevant data from the dataset
proposed in [3] and added new images to the dataset. Expert physicians
double-checked the data to make sure it was accurate and valuable. The
details of the original dataset and the proposed dataset in this study are
illustrated in Fig. 2 and a sample representation of the different classes in
the dataset is available in Fig. 3.

Fig. 2 Dataset details before and after cleaning and validation


Fig. 3 A sample representation of the proposed dataset displaying images from class
“Normal Conjunctiva”, “Conjunctival Melanoma”, “Nevus”, and “Pterygium”

2.2 Data Augmentation


The “Four Class” Dataset is the label considered for the dataset proposed in
this study. It was from this “Four Class Dataset” that, another dataset was
developed. Here we have a “Binary Class Dataset” where “Normal
Conjunctiva” is categorized separately from “Abnormal Conjunctiva”
(which includes pterygium, nevus, and conjunctival melanosis). Both
datasets were divided into training set, a validation set, and a test set with
the percentage of 70%, 10%, and 20%, respectively (Fig. 4).
Fig. 4 A representation of single and multiple augmentation techniques on an ocular
surface image

Due to the small size of dataset, four augmentation techniques were


employed to enhance the size of the train set. We have seen that data
augmentation technique is a proven method to counter the problem of small
data set as shown in some other publications [34–36]. These were random
rotation, random affine transformation, padding, and colour correction. The
specifics of the four methods of augmentation explored in this study are
provided in Table 1. Methods of random augmentation including both
single augmentation and multiple augmentations were used in this
investigation. Whether a single augmentation or multiple augmentations
would be used was determined randomly in the augmentation model. The
augmentation model would then randomly decide which combination of
augmentations to use if multiple augmentations are chosen. Single and
multiple augmentation techniques were used to an image of the ocular
surface, as shown in Fig. 3.

Table 1 Augmentation techniques and ranges used in the training set of proposed
datasets
Augmentation techniques Range
Random rotation +20 to −20 degree
Random affine Degree = 0
Translate range = (0.05, 0.15)
Scaling range = (0.9, 0.95)
Padding Range = (0,10)
Fill = (black, white)
Mode = (‘Constant’, ‘Edge’)
Colour correction Brightness = (0, 0.2)
Contrast = (0, 0.2)

In each of the two datasets, the size of the training set for each class was
expanded to three thousand samples by applying these four augmentation
techniques. As the validation and test sets were used for evaluating deep
learning models in a real-world setting, these two sets were left unchanged
throughout the process. Table 2 contains a description of the sizes of the
datasets along with the augmentation [37] factors.

Table 2 The detailed description of proposed datasets. The curated dataset is validated
by expert doctors and the training samples are increased by an augmentation factor
using different augmentation techniques
Dataset Class Original Validation Testing Training Augmentation Training set
data set set set factor after
samples augmentation
Binary Normal 125 13 25 87 34.48 3000
class conjunctiva
Abnormal 285 28 57 200 15 3000
conjunctiva
Four Normal 125 13 25 87 34.48 3000
class conjunctiva
dataset
Nevus 85 8 17 60 50 3000
Pterygium 70 7 14 49 61.44 3000
Conjunctival 130 13 26 91 32.97 3000
melanoma

2.3 Convolutional Neural Network (CNN) Based Classification


Models
This project utilized state-of-the-art CNNs for classifying ocular surface
images of normal and different eye conditions. Four CNN
architectures, ResNet, DenseNet, GoogLeNet, and EfficientNet were used
in this study with pre-trained weights. We selected these architectures due
to there efficacy in previous publication [38]. These four CNN architectures
were trained on a large benchmark dataset “ImageNet” [39], and the
weights adopted in the training are the pre-trained weights that were utilized
for this study utilizing the well-known concept of transfer learning. CNN
models are initiated with the pre-trained weights and optimized during
training on the ocular surface images. Details of the trained CNN
architectures are given below:

2.3.1 GoogLeNet
GoogLeNet was proposed in the literature [40], which was built on the
Inception module. The authors of GoogLeNet proposed a wider and deeper
Inception which performed slightly better performance in the ImageNet
Large Scale Visual Recognition Challenge (ILSVRC) 2014 competition.
Inside Inception module with dimensionality reduction of GoogleNet, 1 × 1
convolution was added before every 3 × 3 and 5 × 5 convolution. This
model is 22 layers deep with 27 pooling layers where 9 inception modules
are stacked linearly. The end of the inception modules is entangaled to the
global average pooling layer. Detailed model architecture with
convolutional layers, pooling, and activations is available in the literature
[40].

2.3.2 ResNet
ResNet architecture proposed in the literature [41] was designed to counter
the vanishing gradient problem in the deeper CNN architectures. In a deep
CNN architecture, the features of the earlier layers start vanishing from the
network as it goes deeper and is introduced to more complex feature
extractors. As a result, the vanishing gradient happens and the residual
connection in ResNet architecture solves this problem by implementing a
skip connection which flows the feature from the earlier layer to deeper
layers. In this study, ResNet18, ResNet50 and ResNet152 were used. The
designation ResNet, which is then followed by a number consisting of two
or more digits, indicates, quite simply, the ResNet architecture with a
specific number of neural network layers. So, in this ocular surface image
classification research, 18, 50, and 152 layers-based ResNet
architectures were utilized for evaluation and comparison with other
counterpart CNN architectures.

2.3.3 DenseNet
The authors in [42] observed that deeper CNN models are more accurate
and efficient when the short connections are built among layers closer to
input and closer to output. By applying this observation, authors in [42]
proposed DenseNet, which works in a feed-forward fashion to connect each
layer to every other layer. The authors discovered that utilizing DenseNet
had several benefits, including the elimination of the vanishing-gradient
problem, which resulted in better feature propagation and reuse. This
particular sort of connection achieved benchmark results on the ImageNet
dataset while also significantly reducing the number of parameters. Both the
Densenet-161 and the Densenet-201 architectures were utilized in this
study; respectively, the depth of each design is 161 and 201 layers.

2.3.4 EfficientNet
All the CNNs, such as VGGNet, ResNet, MobileNet, and SeNet, employ a
variety of methods to improve the accuracy of the network. The methods
may increase any one of the three dimensions (width, depth, or resolution),
but at least one of them will. The authors in [43], addressed these methods
of scaling in the literature. The integration of all these strategies into
EfficientNet was accomplished by the proposal of a scaling mechanism that
scales consistently across all of these dimensions. EfficientNet_B7, a family
member of the EfficientNet architecture, achieved 84.3% top-1 accuracy on
ImageNet and pre-trained weights of this model performance were used in
our ocular surface image classification.

2.4 Visualization Methods


Intuition on how CNN performs and reasoning behind its decision-making
is always an intriguing topic. Over the years with the development of
visualization tools, the curiosity behind how CNN works is satisfied
effectively. This leads to model’s functionality by showing the rationale
behind the inference in a way that human would figure out the engineering
behind it which results in confidence in the CNNs’ outputs. Among various
visualization tools, Grad-CAM [44] was chosen for this investigation as
Grad-CAM shows promising performance in recent computer vision
problems [45]. The method of Gradient-Weighted Class Activation
Mapping utilizes gradient of the feature at any final CNN layer to yield a
localization map on images to find out which region contributes to the
decision-making. The benefit of using Grad-CAM against other
visualization technique is that, it is applicable on wide variety of CNN
architectures such as with or without fully connected layers [45]. Because
sensitive medical condition classification was carried out in this study, it
was necessary to confirm the region of interest with visualization for the
CNN model to take it into consideration. As a result, this ultimately
strengthened the trust in the decision-making technique of the models. At
the very end of the result section, a discussion regarding the visual
representation and explanation of the Grad-CAM used in this ocular surface
image classification is provided.

2.5 Experimental Setup


Five-fold cross-validation was used in the investigation of the “Binary
Class” and “Four Class” datasets. PyTorch library and Python 3.7 are being
utilized in this study. Google ColabPro platform with a 16 GB Tesla T4
GPU and 120 GB of High RAM was utilized for training, validation, testing
process. Apart from that, hyper-parameters used in this study for all
investigations are given in Table 3.

Table 3 Details for hyper-parameters used for all CNN models to train on “Binary
Class” and “Four Class” datasets
Hyper-parameters Details
Batch size 4
Optimizer Adam
Loss function NLL Loss
Learning rate 0.0001
Total epoch 20
Epoch patients 6
Drop factor of learning rate 0.1
Maximum epoch stop 10
Stop criteria Loss

2.6 Evaluation Metrics


The performance of the CNN models was investigated by utilizing
mathematical metrics such as overall accuracy, precision, sensitivity/recall,
F1 score, and specificity. Let, α = Number of ocular surface images
predicted as true positive, γ = Number of ocular surface images predicted as
false positive, δ = Number of ocular surface images predicted as true
negative, and θ = Number of ocular surface images predicted as false
negative. So, the overall accuracy, precision, sensitivity/recall, F1-score,
and specificity may be formulated as given in Eqs. (1–5).
(1)

(2)

(3)

(4)

(5)

The confusion matrix and ROC curves present important model


evaluation metrics for deep learning models' performance on medical image
classification. In this study, the confusion matrix and ROC curves of each
CNN model were evaluated to figure out the best-performing model by
comparing other counterpart models.

3 Results
3.1 Binary Classification
“Normal Conjunctiva” versus “Abnormal Conjunctiva” classes are
considered binary classes for classification using seven CNN models. The
learning curves of these seven CNNs are available in Supplementary tables
1 to 7. All the learning curves suggested the models are well-trained and do
not have chances of overfitting and underfitting problems. Figure 5 displays
the mean and standard deviation of accuracies across five-fold validation
using these seven pre-trained CNN models. EfficientNet_B7 achieved the
highest mean accuracy and lowest standard deviation in fold-wise accuracy.
The results showed that GoogLeNet's performance varied more over five-
fold than EfficientNet_B7, which indicates that GoogLeNet had a
comparatively less fold-wise performance.
Fig. 5 Representation of mean and standard deviation in the five-fold accuracy of all
models for binary classification

Table 4 Depicts binary classification results of all the employed models


along with number of trainable parameters of all models as well as the
inference time taken by each of them.

Table 4 Performance metrics of different CNN models in detection of “Normal


Conjunctiva” versus “Abnormal Conjunctiva” with five-fold cross-validation method
in a binary class dataset
Model Trainable Inference Overall Precision Recall F1 Specificity
parameters time accuracy (%) (%) score (%)
(second) (%) (%)
ResNet18 11,177,538 0.00216 99.27 99.27 99.27 99.26 98.78
ResNet50 23,512,130 0.00532 99.27 99.27 99.27 99.26 99.22
ResNet152 58,147,906 0.01557 98.78 98.78 98.78 98.78 98.76
GoogLeNet 5,601,954 0.00641 98.78 98.79 98.78 98.78 98.27
DenseNet161 26,476,418 0.01893 98.29 98.30 98.29 98.29 98.22
DenseNet201 18,096,770 0.02592 99.02 99.05 99.02 99.03 99.17
Model Trainable Inference Overall Precision Recall F1 Specificity
parameters time accuracy (%) (%) score (%)
(second) (%) (%)
EfficientNet_B7 63,792,082 0.03334 99.51 99.52 99.51 99.51 99.70

As can be seen from Table 4, out of the seven distinct CNN


architectures that we used, the best-performing model turned out to be
EfficientNet_B7 according to the outcome of different parameters. Different
parameters such as accuracy, precision, recall, F1 Score and specificity are
99.51%, 99.52%, 99.51%, 99.51% and 99.70%, respectively. The
EfficientNet_B7 is the heaviest network in terms of the number of trainable
parameters (more than 63 million trainable parameters). However, all the
other models also achieved very close performance in terms of the
evaluation metrics used in Table 4.
It is notable that the shallowest network, according to number of
trainable parameters is GoogLeNet model (only ~5.6 million trainable
parameters), compared to the other networks. However, GoogLeNet
achieved a classification performance that was comparable to that of the
EfficientNet_B7 model with regard to the evaluation criteria. As of
trainable parameters, EfficientNet_B7 model is almost 11 times heavier
than the GoogLeNet model. However, the EfficientNet_B7 model also
produced a performance in classification that was 0.73% more accurate and
precise than that of the GoogLeNet model. In terms of the inference time,
ResNet18 took the least inference time (around 2.16 ms) and also achieved
very good performance in classification (accuracy of 99.27%). Due to a
very less inference time (less than 0.04 s), all the employed networks can be
utilized for real-time applications.
The performance as well as effectiveness of one model distinguishing
critical medical complications from normal medical data can be also
understood using ROC curves, AUC score and the confusion matrices.
Figure 6 represents ROC curve and confusion matrix of best-performing
network, EfficientNet_B7 for binary classification. The confusion matrices
and ROC curves of the other models used in binary classification can be
found in Supplementary Figures (1–14).
Fig. 6 The a ROC curve and b confusion matrix for best-performing EfficientNet_B7
model, which has been trained and tested on binary class data. The confusion matrices
and the ROC curves of the other models can be found in the supplementary materials

Figure 6a depicts TPR vs FPR of EfficientNet_B7 in classifying


“Normal Conjunctiva” vs “Abnormal Conjunctiva” in different thresholds.
AUCROC was close to 1.00 to indicate that EfficientNet B7 was able to
accurately classify the sample across all classification thresholds. The value
of true positive, true negative, false positive, and false negative cases of
EfficientNet_B7 are shown in confusion matrix that can be seen in Fig. 6b.
Only one of the 285 test instances of the “Abnormal Conjunctiva” class
across five-fold was identified as “Normal conjunctiva”. When compared to
other CNNs, overall performance of EfficientNet_B7 was superior to that of
its counterparts.

3.2 Multi-class Classification


The seven CNN models used in binary classification were also used in four
class classifications. The learning curves of these models are also available
in Supplementary Tables 8 to 14, displaying the trend of well-fitted models.
Figure 7 represents the graphical illustration of mean and standard deviation
of accuracies in five-fold cross-validation of all models on multi-class
classification.
Fig. 7 Graphical representation of mean and standard deviation in the five-fold
accuracy of all the models on multi-class classification

Multi-class classification of three cases of ocular illness and normal


condition based on optic images presents significant challenges.
EfficientNet_B7, a recently developed and robust CNN, had the highest
mean accuracy across all five folds (94.43 percent). Although other CNNs,
such as DenseNet161, demonstrated larger standard deviations, the standard
deviation of the EfficientNet_B7 model’s accuracy was small (±1.54),
indicating steady performance based on fold-wise accuracy. The other
metrics such as overall accuracy, precision, recall, F1 score, and specificity
are also significant in understanding the functionality of a deep learning
model as well as fold-wise accuracies. Table 5 represents the four-class
classification results of all the CNN models along with number of trainable
parameters as well as the inference time taken by each of them.

Table 5 The performance metrics of different state-of-the-art CNN models in the


detection of conjunctival melanoma with a five-fold cross-validation method on a four-
class dataset
Model Trainable Inference Overall Precision Recall F1 Specificity
parameters time accuracy (%) (%) score (%)
(second) (%) (%)
ResNet18 11,178,564 0.00226 93.45 93.49 93.45 93.46 97.86
Model Trainable Inference Overall Precision Recall F1 Specificity
parameters time accuracy (%) (%) score (%)
(second) (%) (%)
ResNet50 23,516,228 0.00539 91.02 91.22 91.02 91.01 96.91
ResNet152 58,152,004 0.01494 92.72 92.92 92.72 92.76 97.66
GoogLeNet 5,604,004 0.00627 91.75 91.76 91.75 91.72 96.98
DenseNet161 26,480,836 0.01789 91.02 91.34 91.02 91.1 96.90
DenseNet201 18,100,612 0.02269 94.42 94.43 94.42 94.42 98.15
EfficientNet_B7 63,797,204 0.03188 94.42 94.55 94.42 94.43 98.20

Multi-class classification results, as shown in Table 5, exhibit greater


variability than binary classification results (Table 4). Table 5 shows that
the EfficientNet_B7 model had the highest performance across all metrics
used for assessing models. The accuracy, precision, recall/sensitivity, F1
Score and specificity are 94.42%, 94.55%, 94.42%, 94.43%, and 98.20%,
respectively. However, in terms of accuracy and recall, the DenseNet201
model showed exactly the same performance as the EfficientNet_B7 model.
Precision, F1 score, and specificity were all improved for EfficientNet_B7.
Although DenseNet161 and ResNet50 had more trainable parameters than
GoogLeNet, the shallower network still managed to outperform them by a
little margin. ResNet18 once again had the fastest inference time
(approximately 2.26 ms) with a classification accuracy of 93.45%. In
addition, the inference time for all of the models was less than 0.04 s,
making them suitable for usage in real-time settings.
Figure 8 represents the ROC curve and confusion matrix of best-
performing model, EfficientNet_B7 for multi-class classification. Figure 8a
represents ROC curve to be around 0.99, indicating close-to-perfect
performance in multi-class classification. Figure 8b describes the TP (true
positive) , TN (true negative), FP (false positive), and FN (false negative)
capabilities of the best-performing EfficientNet_B7 model. The true
positive percentage of EfficientNet_B7 in classifying normal, pterygium,
nevus, and melanoma is 0.98%, 0.94%, 0.94%, and 0.91% respectively,
which indicates the model’s higher capability in distinguishing the classes.
The confusion matrix and the ROC curves of the other models can be found
in Supplementary Figures (15–28).
Fig. 8 The a ROC curve and b confusion matrix for best performing EfficientNet_B7
model (the “others” class is labelled as the “abnormal” class) trained and tested on the
multi-class dataset

3.3 Comparative Analysis with Existing Literature


The proposed method of using data augmentation and pre-trained CNNs
showed improvement in model performance. The comparative analysis
between previous literature [3] and the proposed method in this study is
tabulated in Table 6. In multi-class or four-class classification, the method
proposed in this study achieved 13.42% improved accuracy and 0.036
improved AUC. The EfficientNet_B7 with image augmentation techniques
outperformed the CycleGAN-based Image Augmentation and
MobileNetV2-based study reported in [3]. The proposed method also
outperformed the previous literature in binary classification by 3.23%
accuracy and 0.024 AUC.

Table 6 Comparative analysis of the performance of the proposed method with


counterpart literature
Datasets Reference Technique AUC Accuracy
Four Yoo et al. CycleGAN-based image augmentation, MobileNetV2 0.954 81
class
Proposed Dataset cleaning, inclusion of related images, image 0.99 94.42
method augmentations, EfficientNet_B7
Binary Yoo et al. CycleGAN-based image augmentation, MobileNetV2 0.976 96.5
class
Proposed Dataset cleaning, inclusion of related images, image 1.00 99.73
method augmentations, EfficientNet_B7
3.4 Visualization Using Grad-CAM
Figures 9 and 10 represent the visual interpretation of the best-performing
models in the “Binary Class” and “Four Class” datasets, respectively using
Grad-CAM. It is easier to comprehend the model's prediction process when
using this visual representation. Figure 9 provides a visual interpretation of
EfficientNet_B7 and ResNet18, two of the top-performing models in the
“Binary Class” investigation.

Fig. 9 Visual interpretation of ResNet18 and EfficientNet_B7 model predictions on


the “Binary Class” dataset

Fig. 10 Visual interpretation of DenseNet201 and EfficientNet_B7 model predictions


on the “Four Class” dataset

Both models were effectively predicting the classes that corresponded to


the region of interest. Also, this study was undertaken to categorize three
different medical conditions, including Nevus, Pterygium, and Conjunctival
Melanoma, in addition to Normal subjects, thus visual interpretability is
especially crucial in “Four Class” investigations. Figure 10 displays the
visual interpretation of the best performing model, EfficientNet_B7, beside
the DenseNet201 interpretation. From a visual perspective, EfficientNet_B7
revealed that the features learned from the relative region of interest during
training are the key to the models’ capacity to classify ocular surface
images at maximum accuracy.

4 Conclusion
In conclusion, the proposed study used state-of-the-art CNN models with
data curation, validation and single and multiple augmentation techniques
to classify ocular surface images for different medical condition
investigations (“Binary Class” and “Four Class”). EfficientNet_B7 was the
best-performing model with 99.73% and 94.42% accuracy for “Binary
Class” and “Multi-Class” respectively utilizing the methodology proposed
in this study. The results for both types of investigation outperformed
previously published literature [3]. Moreover, this model showed a high
degree of sensitivity of 99.51% and 99.42% for the “Binary Class” and
“Four Class” investigations, respectively. The performance of the best
model, EfficientNet_B7, was also evaluated through Grad-CAM-based
visual interpretation as this study includes the diagnosis of sensitive
medical conditions using ocular surface images. In future, the proposed
model can be implemented in the server so that the model can produce
predictions with visual interpretation for clinicians and patients. The
implementation of such a server-based implementation of the proposed
model can be used in remote areas for telemedicine facilities and helps
people in the rural area to easily diagnose eye conditions with visual
interpretation.
Funding
This work was made possible by Qatar National Research Fund (QNRF)
NPRP12S-0227–190164 and International Research Collaboration Co-Fund
(IRCC) grant: IRCC-2021–001. The statements made herein are solely the
responsibility of the authors.

References
1. Damato, B., & Coupland, S. E. (2008). Conjunctival melanoma and melanosis: a
reappraisal of terminology, classification and staging. Clinical & Experimental
Ophthalmology, 36(8), 786–795.

2. Oellers, P., & Karp, C. L. (2012). Management of pigmented conjunctival lesions.


The Ocular Surface, 10(4), 251–263.

3. Yoo, T. K., Choi, J. Y., Kim, H. K., Ryu, I. H., & Kim, J. K. (2021). Adopting
low-shot deep learning for the detection of conjunctival melanoma using ocular
surface images. Computer Methods and Programs in Biomedicine, 205, 106086.

4. Shields, C. L., Fasiudden, A., Mashayekhi, A., & Shields, J. A. (2004).


Conjunctival nevi: clinical features and natural course in 410 consecutive patients.
Archives of Ophthalmology, 122(2), 167–175.

5. Wong, J. R., Nanji, A. A., Galor, A., & Karp, C. L. (2014). Management of
conjunctival malignant melanoma: a review and update. Expert Review of
Ophthalmology, 9(3), 185–204.

6. Isager, P., Engholm, G., Overgaard, J., & Storm, H. (2002). Uveal and
conjunctival malignant melanoma in Denmark 1943–97: observed and relative
survival of patients followed through 2002. Ophthalmic Epidemiology, 13(2), 85–
96.

7. Chang, A. E., Karnell, L. H., & Menck, H. R. (1998). The National Cancer Data
Base report on cutaneous and noncutaneous melanoma: A summary of 84,836
cases from the past decade. Cancer: Interdisciplinary International Journal of the
American Cancer Society, 83(8), 1664–1678.

8. Larsen, A. C., Dahmcke, C. M., Dahl, C., Siersma, V. D., Toft, P. B., Coupland, S.
E., et al. (2015). A retrospective review of conjunctival melanoma presentation,
treatment, and outcome and an investigation of features associated with BRAF
mutations. JAMA Ophthalmology, 133 (11), 1295–1303.

9. Kao, A., Afshar, A., Bloomer, M., & Damato, B. (2016). Management of primary
acquired melanosis, nevus, and conjunctival melanoma. Cancer Control, 23(2),
117–125.

10. Damato, B., & Coupland, S. E. (2008). Conjunctival melanoma and melanosis: a
reappraisal of terminology, classification and staging. Clinical & Experimental
Ophthalmology, 36 (8), 786–795.

11. Hallak, J. A., Scanzera, A., Azar, D. T., & Chan, R. P. (2020). Artificial
intelligence in ophthalmology during COVID-19 and in the post COVID-19 era.
Current Opinion in Ophthalmology, 31(5), 447.
12.
Ching, T., Himmelstein, D. S., Beaulieu-Jones, B. K., Kalinin, A. A., Do, B. T.,
Way, G. P., et al. (2018). Opportunities and obstacles for deep learning in biology
and medicine. Journal of The Royal Society Interface, 15(141), 20170387

13. Topol, E. J. (2019). High-performance medicine: the convergence of human and


artificial intelligence. Nature Medicine, 25(1), 44–56.

14. DuBois, K. N. (2019). Deep medicine: How artificial intelligence can make
healthcare human again. Perspectives on Science and Christian Faith, 71(3), 199–
201.

15. Rahman, T., Akinbi, A., Chowdhury, M. E., Rashid, T. A., Şengür, A., Khandakar,
A., et al. (2022). COV-ECGNET: COVID-19 detection using ECG trace images
with deep convolutional neural network. Health Information Science and Systems,
10(1), 1–16.

16. Rahman, T., Khandakar, A., Islam, K. R., Soliman, M. M., Islam, M. T., Elsayed,
A., et al. (2022). HipXNet: Deep learning approaches to detect aseptic loos-ening
of hip implants using X-ray images. IEEE Access, 10, 53359–53373.

17. Abir, F. F., Alyafei, K., Chowdhury, M. E., Khandakar, A., Ahmed, R., Hossain,
M. M., et al. (2022). PCovNet: A presymptomatic COVID-19 detection
framework using deep learning model using wearables data. Computers in Biology
and Medicine, 147, 105682.

18. Chowdhury, M. H., Shuzan, M. N. I., Chowdhury, M. E., Reaz, M. B. I., Mahmud,
S., Al Emadi, N., et al. (2022). Lightweight end-to-end deep learning solution for
estimating the respiration rate from photoplethysmogram signal. Bioengineering,
9(10), 558.

19. Wang, G., Ye, J. C., Mueller, K., & Fessler, J. A. (2018). Image reconstruction is a
new frontier of machine learning. IEEE Transactions On Medical Imaging, 37(6),
1289–1296.

20. Ronneberger, O., Fischer, P., & Brox, T. (2015). U-net: Convolutional networks
for biomedical image segmentation. In International Conference on Medical
image computing and computer-assisted intervention (pp. 234–241).

21. Haskins, G., Kruger, U., & Yan, P. (2020). Deep learning in medical image
registration: A survey. Machine Vision and Applications, 31(1), 1–18.
22.
Karimi, D., Dou, H., Warfield, S. K., & Gholipour, A. (2020). Deep learning with
noisy labels: Exploring techniques and remedies in medical image analysis.
Medical Image Analysis, 65, 101759.

23. Rahman, T., Chowdhury, M. E., Khandakar, A., Mahbub, Z. B., Hossain, M. S. A.,
Alhatou, A., et al. (2022). BIO-CXRNET: A robust multimodal stacking machine
learning technique for mortality risk prediction of COVID-19 patients using chest
x-ray Images and clinical data. Neural Computing and Applications.

24. Tahir, A. M., Qiblawey, Y., Khandakar, A., Rahman, T., Khurshid, U.,
Musharavati, F., et al. (2022). Deep learning for reliable classification of COVID-
19, MERS, and SARS from chest X-ray images. Cognitive Computation, 1–21.

25. Tahir, A. M., Chowdhury, M. E., Khandakar, A., Rahman, T., Qiblawey, Y.,
Khurshid, U., et al. (2021). COVID-19 infection localization and severity grading
from chest X-ray images Computers in Biology and Medicine, 139, 105002.

26. Qiblawey, Y., Tahir, A., Chowdhury, M. E., Khandakar, A., Kiranyaz, S., Rahman,
T., et al. (2021). Detection and severity classification of COVID-19 in CT images
using deep learning. Diagnostics, 11(5), 893.

27. Pacheco, A. G. C., & Krohling, R. A. (2020). The impact of patient clinical
information on automated skin cancer detection. Computers in Biology and
Medicine, 116, 103545.

28. Han, S. S., Park, G. H., Lim, W., Kim, M. S., Na, J. I., Park, I., et al. (2018). Deep
neural networks show an equivalent and often superior performance to
dermatologists in onychomycosis diagnosis: Automatic construction of
onychomycosis datasets by region-based convolutional deep neural network. PloS
one, 13(1), e0191493.

29. Bhimavarapu, U., & Battineni, G. (2022). Skin lesion analysis for melanoma
detection using the novel deep learning model fuzzy GC-SCNN. In Healthcare, p.
962.

30. Martin-Gonzalez, M., Azcarraga, C., Martin-Gil, A., Carpena-Torres, C., Jaen, P.,
& Health, P. (2022). Efficacy of a deep learning convolutional neural network
system for melanoma diagnosis in a hospital population. International Journal of
Environmental Research and Public Health, 19(7), 3892.
31.
Haenssle, H. A., Fink, C., Schneiderbauer, R., Toberer, F., Buhl, T., Blum, A., et
al. (2018). Man against machine: diagnostic performance of a deep learning
convolutional neural network for dermoscopic melanoma recognition in
comparison to 58 dermatologists. Annals of Oncology, 29(8), 1836–1842.

32. Brinker, T. J., Hekler, A., Enk, A. H., Klode, J., Hauschild, A., Berking, C., et al.
(2019). A convolutional neural network trained with dermoscopic images
performed on par with 145 dermatologists in a clinical melanoma image
classification task. European Journal of Cancer, 111, 148–154.

33. Yin, G., Gendler, S., & Teichman, J. (2022). Ocular surface squamous neoplasia in
a patient following oral steroids for contralateral necrotising scleritis. BMJ Case
Reports CP, 15(12), e253300.

34. Rahman, T., Chowdhury, M. E., Khandakar, A., Mahbub, Z. B., Hossain, M. S. A.,
Alhatou, A., et al. (2022). BIO-CXRNET: A robust multimodal stacking machine
learning technique for mortality risk prediction of COVID-19 patients using chest
x-ray images and clinical data. arXiv preprint arXiv:2206.07595

35. Khandakar, A., Chowdhury, M. E., Reaz, M. B. I., Ali, S. H. M., Kiranyaz, S.,
Rahman, T., et al. (2022). A novel machine learning approach for severity
classification of diabetic foot complications using thermogram images. Sensors,
22(11), 4249.

36. Rahman, T., Khandakar, A., Islam, K. R., Soliman, M. M., Islam, M. T., Elsayed,
A. et al. (2022). HipXNet: Deep learning approaches to detect aseptic loos-ening
of hip implants using x-ray images. IEEE Access, 10, 53359–53373.

37. Wang, H., Wang, Z., Du, M., Yang, F., Zhang, Z., Ding, S., et al. (2020). Score-
CAM: Score-weighted visual explanations for convolutional neural networks. In
Proceedings of the IEEE/CVF conference on computer vision and pattern
recognition workshops (pp. 24–25).

38. Schlemper, J., Oktay, O., Schaap, M., Heinrich, M., Kainz, B., Glocker, B., et al.
(2019). Attention gated networks: Learning to leverage salient regions in medical
images. Medical Image Analysis, 53, 197–207.

39. Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., et al. (2015).
ImageNet large scale visual recognition challenge. International Journal of
Computer Vision, 115(3), 211–252.

40. Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., et al. (2015).
Going deeper with convolutions. In Proceedings of the IEEE conference on
computer vision and pattern recognition (pp. 1–9).
41. He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image
recognition. In Proceedings of the IEEE conference on computer vision and
pattern recognition (pp. 770–778).

42. Huang, G., Liu, Z., Van Der Maaten, L., Weinberger, K. Q. (2017). Densely
connected convolutional networks. In Proceedings of the IEEE conference on
computer vision and pattern recognition (pp. 4700–4708).

43. Tan, M., & Le, Q. (2019). Efficientnet: Rethinking model scaling for
convolutional neural networks. In International conference on machine learning
(pp. 6105–6114).

44. Selvaraju, R. R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., & Batra, D.
(2017). Grad-cam: Visual explanations from deep networks via gradient-based
localization. In Proceedings of the IEEE international conference on computer
vision (pp. 618–626).

45. Podder, K. K., Chowdhury, M. E., Tahir, A. M., Mahbub, Z. B., Khandakar, A.,
Hossain, M. S., et al. (2022). Bangla sign language (bdsl) alphabets and numerals
classification using a deep learning model. Sensors, 22(2), 574.
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023
S. S. Roy et al. (eds.), Deep Learning Applications in Image Analysis, Studies in Big Data 129
https://doi.org/10.1007/978-981-99-3784-4_7

Plant Diseases Classification Using Neural


Network: AlexNet
Mohd Anas1, Sanjiban Sekhar Roy1 , Kunwar S. Srivastava1 and
Jashabir Chakraborty2
(1) School of Computer Science and Engineering, Vellore Institute of
Technology, Vellore, 632014, India
(2) Mata Gujri College of Pharmacy, Mata Gujri University, Bihar, India

Sanjiban Sekhar Roy


Email: sanjibanroy09@gmail.com

1 Introduction
Not so long ago, India was an agricultural country. Even today, roughly,
there are 118 million farmers in the country [1]. One of the major issues
that these farmers/cultivators face is several diseases that affect their plants.
Not only this exacerbates their economic problem, but also their social life;
several hours, and sometimes years, of hard work. There are several
chemicals that can be employed to alleviate this problem. The major issue
here is diagnosis, and unless farmers have a lab in their vicinity, it is likely
that diseases will be misidentified. Furthermore, the situation may get
worsen, as it is often and spread to other farms. India has seen a large
increase in smartphone sales and this is coupled with the rise of middle
class. Various telecommunication companies want to have hold of the rising
market and this has led to the cost of internet usage to almost nearly zero.
There are nearly 833 million internet [2] users which is equal to 59.28% of
the population of India. In this chapter, we have work to provide all the
farmers and cultivators with smartphones with internet access, we could
reduce the food loss in the country.
In order to help these farmers, David. P. Hughes and Marcel Salathe, in
their paper have created a database called, PlantVillage, which is an open
access database of 50,000 + images of healthy and diseased crops. This
database has more than 150 crops and 1800 diseases. PlantVillage is a
community of people helping each other, by answering the questions and
identifying the diseases by looking at pictures in the questions. It is helpful
but it has drawbacks as stated above [3]. In the paper, David P. Hughes &
Marcel Salathé, have described the advantage of computer diagnostics tools
over human diagnosis. And we cannot download all the images in their
dataset. But in April 2016, PlantVillage released a subset of their dataset for
image classification challenge on CrowdAI [3].

2 Machine Learning and Deep Learning


In this section, we have discussed about the machine learning and neural
network in details.

2.1 History
Deep Learning was an underappreciated field due to several reasons such
as, absence of powerful GPUs, absence of required data and limited
scientific work. In fact, deep learning is term coined to attract interest in
neural networks again. There have been three phases of development in the
field: it was known as cybernetics in 1940−1960s, connectionism in
1980−1990s and deep learning from late 2000s. It is also known as artificial
neural network (ANNs) due to the fact that its design is inspired from
biological neural network [4].
So, earliest neural network models were simple linear models. They
were designed to take inputs{x1, x2…..xN} at the input layer, corresponding
to an output y. The network would learn the weights {w1, w2……wN} such
that
(1)
McCulloch-Pitts Neuron, Perceptron and ADALINE (adaptive linear
element) were some of the linear models. Although, these models were very
useful but they had limitations, most importantly, they couldn’t replicate
XOR function. Neural network were no longer popular after the discovery.
There were massive research going on during the second phase or popularly
known as connectionism. The most important development in this phase
was successful implementation of backpropagation algorithm for training
purposes.
Algorithms such as backpropagation and LSTM are still popular. But
the reason why the popularity of neural net declined was unrealistic claims
made by the companies and then under delivering them. Meanwhile,
various other machine learning models were performing far better than
neural networks, thus declining its popularity. In 2006, Geoffrey Hinton,
trained a neural network called, deep belief network. This sparked interest
in neural network again. World had more computation power and more
data. By 2012, deep learning had proved to be useful state of art technology
in the field of object detection, image classification and computer vision.

2.2 Machine Learning Basics


A learning program is said to learn from experience E on task T with
respect to performance measure P, if its performance on T improves with
experience E. A learning program produces a representation R (often called
a hypothesis h) of what it has learned. Another program can use R to
perform T. A learning program uses a learning algorithm A to produce R
from E [4].

2.2.1 Capacity, Overfitting and Under Fitting


The main challenge in machine learning is that our trained model must
perform well on new data points. This ability to perform well on new data
points is called generalization. When we train a model on a dataset, we have
an error measure known as, training error. We want this error to be as low
as possible. But, in order to have a working model, we want our model to
have good generalization as well which means that our test error should be
low [4, 5].
Take linear regression for example, we train the model by minimizing
the training error, which is:

(2)

Similarly, we want to minimize our test error, which would be:


(3)

There are two factors determining the performance of machine learning


models. First is to make training error small, and second is to reduce the
difference between training and test error.
Underfitting occurs when model is not able to make the training error
small and Overfitting occurs when it cannot reduce the difference between
the training error and test error.
In simpler words, when model hasn’t sufficiently learned the features,
we call it under fitting, whereas when model memorizes the features instead
of learning from data, we call it overfitting. We can control whether a
model is more likely to overfit or underfit by altering its capacity.

2.2.2 Hyperparameters and Validation Set


Generally, machine learning algorithms have several parameters that control
the behaviour of training algorithm, these parameters are called
Hyperparameters. We usually do not learn hyperparameters, because it is
not appropriate to learn the hypermeter on training set. If we learn
hyperparameters on training set, it will almost always overfit. To solve this
problem, we need another dataset, known as validation set. Validation set is
taken from training set but not included during training process.
Validation set is used during and after training, in order to estimate
generalization or test loss. We use this to update hyper parameters
accordingly [4]. Typically, we take 80% of training dataset for training and
20% for validation.

2.2.3 Gradient Descent and Stochastic Gradient Descent


Gradient Descent and its variations is widely used in several deep learning
algorithms [6]. It minimizes an error function.

(4)

In order to compute error or the gradient of error, we have to evaluate


the hypothesis at every point in the sample. We go down the error surface
along the direction subjected by gradient descent. The steps used in this
case are iterative, and we take one step at a time and one step is full epoch.
Simply, we consider epoch when we take all the example at once. So,
weight update formula in this case:
(5)
In case of stochastic gradient descent, instead of having movement in
the w space, we will try to do it on space on one example at a time.
∇Ein is based on all examples (xn, yn). Because we will use another
method, we will call the standard gradient descent as “batch” GD. In case of
stochastic gradient descent, we pick one example at time and apply gradient
descent on that point e(h(xn), yn), instead of whole dataset. Now, let’s think
of the average direction that we are going to be send along.
Average direction:
(6)
If we take the error measure that we are going to minimize, in this case,
just one example, and take the expected value, we get an equation which is
similar to equation mentioned above [4, 6, 7].
Average direction:

(7)

So, this is as if we are actually going in the direction we want, except


that we only use one example in the computation and then keep repeating.
Thus, we will always get the expected value in that direction and with time,
the noise will average out and we’ll go along the desired direction.

2.2.4 Neural Network and Backpropagation


Suppose we assign weights the notations where l is hidden notation for
layers [7, 8] (Fig. 1).

(8)
And if we use as the activation function, where (Fig. 2):

(9)

Output in neural network is

(10)

Apply .

Fig. 1 A multi-layer perceptron


Fig. 2 Graph for tanh(x) activation function

2.2.5 Applying Stochastic Gradient Descent


We take one example at a time and apply it to the network and adjust the
weight of the network in the direction of negative of the gradient descent
and thus makes it stochastic [7].
All the weights determine h(x).
Error on example (xn, yn) is:

(11)

So, to implement SGD, all we have to do is implement gradient of

(12)
All we have to do is compute this for every i,j,l and then take entire
value of weight and move along negative gradient (Fig. 3).

Fig. 3 Backpropagation: phase I

We can evaluate using chain rule:

(13)
Now let’s find for final layer. When we computed the same we got xs
for first layer and then we propagate it forward until we get to the output.
The reason is that if we know for final layer, we will be able to use it to
find for previous layers by propagating backwards, and hence the name,
backpropagation.

So,

(14)

is error measure. This is applied on each layer until we reach the


output, h(xn) and compare it to target output yn.

(15)
Suppose we are using mean square error, then (Fig. 4)
Fig.4 Backpropagation: phase II

(16)

(17)

(18)
2.2.6 Backpropagation Algorithm
1. Initialize all weights at random.
2. For t = 0, 1….. do
3. Pick n from {1, 2, … N}
4. Forward: compute all
5. Backward: compute
6. Update the weights,

7. Iterate to the next step until it is time to stop.


8. Return the final weight, .

2.3 Convolution Neural Network


Convolution neural network is a special kind of neural network. It was
given the name because it uses convolution in at least one layer. It is widely
used in computer vision, image segmentation, classification etc. among
other things [9, 10].

2.3.1 Convolution
In mathematics and engineering, convolution is described as mathematical
operation between two functions. It is defined as the integral of the product
of the two functions after one is reversed and shifted.

(19)
Convolution is denoted by asterisk (*).
In deep learning, function x(a) is known as input and function w(t-a) is
known as kernel.
Convolution controls three important ideas that helps a machine
learning system: sparse interactions, parameter sharing and equivariant
representations. Additionally, convolution provides a means for working
with inputs of variable size.

2.3.2 Pooling
A layer of convolution network has three stages: convolution layer,
activation function such as ReLU and a pooling layer. A pooling layer
changes the output of the net by replacing some areas of input by its
statistical summary. It performs down sampling in height and width
dimensions. The commonly used pooling layer is max pooling.

2.3.3 ReLU
The rectifier linear unit is an activation function defined as
(20)
Convolutional nets were some of the first working deep networks
trained with backpropagation. It is not fully clear why convolutional
networks succeeded when general backpropagation networks were
considered to have failed.

2.4 Various Deep Learning Libraries


There are several deep learning libraries to choose from. Some popular ones
are:

2.4.1 Theano
Theano is a framework based on python developed by the LISA group and
run by Yoshua Bengio at the University of Montreal [11].

2.4.2 Torch
Torch is a deep learning framework developed by Ronan Collobert,
Clement Farabet and Koray Kavukcuoglu [12].

2.4.3 Caffe
Caffe is a Python deep learning library developed by Yangqing Jia at the
Berkeley Vision and Learning Centre. The biggest advantage of Caffe is the
number of pre-trained network that be downloaded from their model zoo
[13].

2.4.4 Tensorflow
TensorFlow is an open-source programming library for machine learning
over a scope of assignments, and created by Google to address their issues
for frameworks fit for building and preparing neural systems to identify and
interpret examples and relationships.

2.4.5 Deep Learning 4J


Deeplearning4j is an open-source, distributed deep learning framework for
Java and Scala programming languages. It supports a variety of neural
network architectures such as feedforward, recurrent, and convolutional
networks, and enables deployment of models on GPUs, CPUs, and
embedded devices [14].

3 Experimental Work and Results


In this section, we have discussed the experimental results and the model
used.

3.1 Dataset
The dataset on CrowdAI consists of 54,309 images for training the neural
network. It has 14 different species of crop, 17 fungal diseases, 4 bacterial
diseases, 2 mold diseases, 2 viral disease, 1 disease caused by a mite and 12
crop species that are visibly healthy. This means that there are 38 classes of
images.
These 14 crop species are: Apple, Blueberry, Cherry, Corn, Grape,
Orange, Peach, Bell Pepper, Potato, Raspberry, Soybean, Squash,
Strawberry, and Tomato (Fig. 5).

Fig.5 Different 38 disease classes of leaves

In the Fig. 1 above, there are 38 images each corresponding to different


class of diseases [3].

3.2 Data Pre-processing


Since we are trying to tune AlexNet, we have to make sure that the size of
images must be of exactly the same size as was used to originally train it.
AlexNet was trained on images of size 256 × 256 pixels with central crop of
227 × 227 pixels. This means that we have to resize all the images of
PlantVillage dataset. Instead of having to deal with images straight from the
disk, we will store them in LMDB which is a high performance embedded
transactional database. While Caffe does supports reading images directly
from the disk, using LMDB as the data store has quite significant
performance gains. Finally, we will compute the mean of all the images.
This will be useful in both, training and testing processes. After correctly
updating LMDB store references, fine tuning the parameters in
configuration files, and changing hyperparameters in solver configuration
file, we will train the model.

3.3 Architecture
In 2012, Alex Krizhevsky, Ilya Sutskever and Geoff Hinton submitted a
convolution network called, Alexnet, for an Imagenet ILSVRC challenge.
ILSVRC challenge also known as ImageNet challenge is conducted every
year where participants have to make a model that can classify millions of
images into 1000 classes of object. They won the challenge the same year
and since then it was always a variation of CNN that won the challenge
(Fig. 6).

Fig. 6 Architecture of alexNet

The input layers in AlexNet are formed by the raw pixel values obtained
from the image, and the final layer gives a probability distribution across all
the classes. The intermediate layers use a “processed version” of the output
of the previous layer as their input, and over the whole training period they
learn to activate against more and more complex features depending on how
deep they are in the overall architecture. The neural net such as AlexNet are
computationally very expensive and intensive. It usually takes several
weeks to train on ImageNet dataset. Fortunately, the features learnt by
earlier layers are very generic in nature, and thus can be used on new
dataset with totally different classes. This approach is known Transfer
Learning or Fine Tuning. In transfer learning, we take a pre-trained model
and use the learnt weight and after modification of the final fully connected
layers, we use them to train on new dataset. This gives us better result. In
our PlantVillage dataset, we have 38 classes instead of 1000 classes from
ImageNet. So, we have to change the num_output parameter of fully
connected layer in the training configuration file Caffe [3, 8, 15, 16].

3.4 Results
If data is pre-processed and files are correctly configured, there will be no
problem in training the model. So, when we train the model, we have to
make sure that we are maintaining the log file. This is done in order to
understand the training process. Also, this log file can be used to generate
graph. It took roughly around 2 h for training the model for 2000 iterations
(Fig. 7).
Fig. 7 Training curve for accuracy and loss with 2000 iterations

We can see the development of three performance measure: training


loss, test loss, test accuracy. Training and Test loss has significantly
decreased from nearly 1 to 0.1, whereas the test accuracy on the test dataset
was around 91.3%, which is pretty impressive. The two most important
factors to be considered in transfer learning are size of the data and
similarity of the data to the original dataset. If new dataset is small and
similar to original dataset, there is a high chance that the model will over
fit. In case we have large dataset, this may work given that both datasets are
similar [17–27].

4 Conclusion
In conclusion, the use of deep learning in the form of image classification
can provide a budget-friendly and efficient solution to the problem of plant
diseases affecting farmers and cultivators. Otherwise, farmers would need
well equipped labs to determine the disease. AlexNet is able to obtain 98 to
99% accuracy on training set and 91.3% accuracy on test set. In the future,
we would like to employed different deep learning models and perform
different types of augmentations.

References
1. Agarwal, K. (2021). Indian agriculture’s enduring question: Just how many
farmers does the country have?. The Wire. Retrieved, 22.

2. BBC. (2023, January 23). India media guide. BBC News. https://www.bbc.com/
news/world-south-asia-12557390

3. Hughes, D., & Salathé, M. (2015). An open access repository of images on plant
health to enable the development of mobile disease diagnostics. arXiv preprint
arXiv:1511.08060.

4. Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep learning. Book in
preparation for MIT Press. http://www.deeplearningbook.org

5. Jabbar, H., & Khan, R. Z. (2015). Methods to avoid over-fitting and under-fitting
in supervised machine learning (comparative study). Computer Science,
Communication and Instrumentation Devices, 70, 163–172.

6. Robbins, H., & Monro, S. (1951). A stochastic approximation method. The annals
of mathematical statistics, 400–407.

7. Bengio, Y. (2009). Learning deep architectures for AI. Foundations and Trends in
Machine Learning, 2, 1–127. Also published as a book. Now Publishers.

8. Hecht-Nielsen, R. (1992). Theory of the backpropagation neural network.


In Neural networks for perception (pp. 65–93). Academic Press.

9. Roy, S. S., Awad, A. I., Amare, L. A., Erkihun, M. T., & Anas, M. (2022).
Multimodel phishing URL detection using LSTM, bidirectional LSTM, and GRU
models. Future Internet, 14(11), 340.
[Crossref]

10. O'Shea, K., & Nash, R. (2015). An introduction to convolutional neural


networks. arXiv preprint arXiv:1511.08458.
11.
Al-Rfou, R., Alain, G., Almahairi, A., Angermueller, C., Bahdanau, D., Ballas, N.,
Bastien, F., Bayer, J., Belikov, A., Belopolsky, A., Bengio, Y., Bergeron, A.,
Bergstra, J., Bisson, V., Snyder, J. B., Bouchard, N., Boulanger-Lewandowski, N.,
Bouthillier, X., de Brébisson, A., … Zhang, Y. (2016). Theano: A python
framework for fast computation of mathematical expressions. arXiv e-prints,
arXiv-1605.

12. Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T.,
Lin, Z., Gimelshein, N., Antiga, L., Desmaison, A., Köpf, A., Yang, E., DeVito,
Z., Raison, M., Tejani, A., Chilamkurthy, S., Steiner, B., Fang, L., ... Chintala, S.
(2019). Pytorch: An imperative style, high-performance deep learning
library. Advances in neural information processing systems, 32.

13. Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., …
Guadarrama, S. & Darrell, T. (2014). Caffe: Convolutional architecture for fast
feature embedding. In Proceedings of the 22nd ACM international conference on
Multimedia (pp. 675–678).

14. Gibson, A., Nicholson, C., Patterson, J., Warrick, M., Black, A. D., Kokorin, V., ...
& Eraly, S. (2016). Deeplearning4j: Distributed, opensource deep learning for
Java and Scala on hadoop and spark. Towards Data Science.

15. Fei Fei, L., Karpathy, A., Johnson, J. CS231N–Stanford University

16. Krizhevsky, A., Sutskever, I., Hinton, G. E. ImageNet classification with deep
convolutional neural networks. University of Toronto.

17. Roy, S. S., Goti, V., Sood, A., Roy, H., Gavrila, T., Floroian, D., Paraschiv, N.,
Mohammadi-Ivatloo, B. (2014). L2 regularized deep convolutional neural
networks for fire detection. Journal of Intelligent & Fuzzy Systems, (Preprint), 1–
12.

18. Roy, S. S., Mihalache, S. F., Pricop, E., & Rodrigues, N. (2022) Deep
convolutional neural network for environmental sound classification via
dilation. Journal of Intelligent & Fuzzy Systems Preprint, 1–7.

19. Roy, S. S., Rodrigues, N., & Taguchi, Y. (2020). Incremental dilations using CNN
for brain tumor classification. Applied Sciences, 10(14), 4915.
[Crossref]
20.
Biswas, R., Vasan, A., & Roy, S. S. (2020). Dilated deep neural network for
segmentation of retinal blood vessels in fundus images. Iranian Journal of
Science and Technology, Transactions of Electrical Engineering, 44(1), 505–518.
[Crossref]

21. Roy, S. S., Mihalache, S. F., Pricop, E., & Rodrigues, N. (2022). Deep
convolutional neural network for environmental sound classification via dilation.
Journal of Intelligent & Fuzzy Systems, (Preprint), 1–7.

22. Deep learning research should be encouraged for diagnosis and treatment of
antibiotic resistance of microbial infections in treatment associated emergencies in
hospitals.

23. Lee, K. C., Roy, S. S., Samui, P., & Kumar, V. (Eds.). (2020). Data analytics in
biomedical engineering and healthcare. Academic Press.

24. Samui, P., Roy, S. S., & Balas, V. E. (Eds.). (2017). Handbook of neural
computation. Academic Press.

25. Forecasting stock price by hybrid model of cascading multivariate adaptive


regression splines and deep neural network

26. Roy, S. S., & Taguchi, Y. H. (2021). Identification of genes associated with altered
gene expression and m6A profiles during hypoxia using tensor decomposition
based unsupervised feature extraction. Scientific reports, 11(1), 1–18.
[Crossref]

27. Ali, M., Magdon-Ismail, M., Lin, H. T. Learning from Data-Abu. https://amlbook.
com/
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023
S. S. Roy et al. (eds.), Deep Learning Applications in Image Analysis, Studies in Big Data 129
https://doi.org/10.1007/978-981-99-3784-4_8

Hyperspectral Images: A Succinct Analytical


Deep Learning Study
L. Sandeep Kumar1, G. K. Panda2 and B. K. Tripathy3
(1) Biju Patnaik University of Technology, Rourkela, Odisha, India
(2) MITS School of Biotechnology, Utkal University, Bhubaneswar,
Odisha, 765017, India
(3) School of Information Technology and Engineering, VIT, Vellore,
Tamil Nadu, 632014, India

B. K. Tripathy
Email: tripathybk@vit.ac.in

Keywords Feature Extraction – Convolutional Neural Networks – Spinal


Fully Connected Networks – Hyperspectral Image – Spectral and Spatial
Features

1 Introduction
Since the advent of imaging spectrum (1980s), Hyperspectral images (HIs)
have been acquired owing to computational classificatory capability for fine
spectra that provides a resolving power for a diverse range of applications.
Some includes remote sensing based environmental, atmospheric and ocean
observations [66], meteorological applications, military [37], geological
exploration and mining [53], crops, vegetation and food analysis and
standalone biomedical fields [56]. In addition to having high spectral and
spatial resolution, HIs have many bands and abundant information because
they cover ultraviolet, visible, near-infrared, and mid-infrared wavelengths.
This offers an avenue of research HI-based image correction [77], noise
reduction [40], transformation [48], dimensionality reduction, and
classification [8].
For the machine learning (ML) based methods to processes HIs, there is
a high need to label several legitimate samples for training. Early researches
on this regard were focused with spectral information based HI
classification methods like, support vector machine [72], random forest,
neural networks [20, 67], and Polynomial logistic regression [45]. An HI
represents the image as a “hypercube, (x, y, λ)” in which the first two
dimensions indicate its spatial coordinates and the third indicates the
number of bands. As a result, each pixel represents a pattern with as many
attributes as there are bands. With a complexity on bands (large number)
associated in HIs, the high data volume (populates exponentially) to be
processed further relates to the avenue to reduce the dimensionality and to
minimize the computation complexity in many real life HI applications. To
cater the dimension reduction methods in HIs, numerous applications have
also been proposed using feature extraction and feature selection. Some
prominent methods include, principal component analysis (PCA) [32],
independent component analysis (ICA) [33, 71], and linear discriminant
analysis (LDA) [17]. The deep learning method has excellent capabilities in
image processing, particularly in recent years, when image classification,
target detection, and other fields have sparked its use. There are a number
of deep learning network models available to improve the performance of
HI processing, such as the convolutional neural network (CNN), the deep
belief network (DBN) [24], and the recurrent neural network (RNN). In
addition, to resolve the problem of poor classification results due to a lack
of training samples, tensor-based classification model [51, 52] was
proposed and experiments revealed that when the number of training
samples is small, this method outperforms to support vector machines and
deep learning. In this first part of our discussion, one of our primary goals is
to enhance the accuracy of the classification. We use Hyperspectral image
of Sundarban mangrove area through seninel-2 satellite (Fig. 1). The input
images contributed through 12 bands are processed with spatial analysis
using DL based 3D CNN. Principal Component Analysis (PCA) is
implemented to derive 3D patches of the Sundarban satellite image. The
process exhibits with 96% classification accuracy.
Fig. 1 Bhitarkanika mangrove
(Source Google Maps): a Binary image b Grayscale image c RGB image

The remaining part is summarized as follows. In Sect. 2 we review the


related research of the study. In Sect. 3 we highlight some concepts of
Hyperspectral image. Section 4 discusses the overview of deep learning and
CNN. Section 5 presents an empirical study of 3D-CNN classification on
HIs of Sundarban Mangrove region. In Sect. 6, we present a hybrid-MSSN
model, validate with experiment analysis with three HI datasets and discuss
the outcomes.

2 Related Researches
In the process of classification of HIs, the spectral dimension (Fig. 2) helps
in identifying the significant variations of reflectance between image pixels
which change with wavelength [38]. In a study [31], it was observed that,
the classification accuracy drops dramatically after a certain increased value
of spectral bands. Since a majority of spectral bands are redundant in
nature, so carrying all bands into consideration, affect to the model’s
performance. Dimension reduction techniques [28, 57] on this regard are
used to identify such unnecessary bands without compromising the image’s
information content. The modified brown stick rule for HI [3] contributes a
phenomenal aspect in dimension reductions. In majority of the cases, the
reduced band features suffer with the anomalies of object identification and
necessitates for discriminative spatial features. As per a study [19] the
pixels next to each other belongs to the same class in HIs, hence
applications of HI’s spatial features along with spectral features is an
intuitional and motivation for an effective classification, to study. There
have been some methodologies adapted on feature extractions like Gray
level co-occurrence matrix [44, 54], stationary wavelet transform (SWT)
[43, 73], discrete wavelet transforms [10, 22], morphological profiles [4,
55] have been used in may real-world applications.

Fig. 2 Image dimension: a Hyperspectral image b RGB image

Neural network-based techniques have been implemented to tackle


many complex problems of remote sensing [67]. DL techniques have
become extremely popular in recent years with several real life applications
like, study of gene characteristics [25], text-based image retrieval [60],
audio signal classification [9], image processing [2], health care analysis
[36], measuring confidence in interviewees [61], Face mask detection [64],
classification of skin cancer [63] and computer vision [70]. In general it has
influenced the research in AI in a major way. Some such studies are
presented in [1, 69]. The application of Deep learning (DL) in ANN has led
to the development of Deep Neural Networks (DNN) [6]. Some of the DL
algorithms which are used in HI classifications are stacked autoencoders
(SAEs) [5], deep belief networks (DBNs) [30]. Convolutional neural
networks (CNNs) [21, 49] are used in HI classifications [11]. CNNs have
wide applications like, MRI segmentation [68], Diabetic retinopathy [58],
in study of Big Data [7], Classification of pests [16], COVID-19 detection
[62] and whether classification [23], There are some innovative approaches
with 2D-CNN [50], 3D-CNN [26, 46], spectral-spatial LSTMs [80], SSUN
[76], SSRN [79] have also been employed in HI classifications. Literature
shows that 2D-CNN alone, is not able to generate discriminative features of
classification [59] whereas 3D CNN is found to be suitable for volumetric
samples. However, it lacks in generating discriminative features of classes
that have textural similarity across several spectral bands. Taking these
shortcomings a HybridSN model [59] was proposed which comprises of 2-
D and 3-D convolution layers to generate discriminative spectral and spatial
features. MCNNCP model [78] also contributes promising accuracy in
using 3-D and 2-D convolution layers based solutions.
DLs have achieved noteworthy performance in the domains of visual
information processing and AI. Some special DNNs like Gated recurrent
unit networks are used for detecting toxicity [39] and wide res-Net being
used for age and gender estimation [14]. This approach pioneered the
extraction of hierarchical deep features automatically in a practicable way
for an HI. They consider an image to be organized with hierarchical
components like pixels, edges, parts and objects. In contrast to shallow
handcrafted features, end-to-end deep features are capable of representing
more abstract and complex shapes in the image. They perform well even in
circumstances where there are rapid regional changes in an image.
Normal image classifications presume on the data that follows uniform
distribution between diverse classes and prone with discriminate samples
belonging to the majority classes leading to an imbalanced phenomenon (in
case of HIs). Hence, special care or measure needs to be addressed to tackle
such imbalanced characteristics of HI classification [65]. Studies in [29, 47,
74] demonstrate on data augmentations, pixel-pairing and auto allocations
of unlabeled samples respectively and demonstrate their efficacy in HI
classifications. Studies in [27, 41–43] modeled with recent novel concepts
of SWT and CNN, decomposition and deep residual nets, 3D-2D-and depth
wise separable-1D CNNs, CNN with optimization (Grey Wolf) and 1D-
EWT and 3D-CNN.
So, following are some of the intuitive literature outcomes that motivate
us to address through the proposed models undertaken in the following
sections.
1. Ensemble a DL-model to address the hierarchical feature extractions.
2. To perform and learn with limited training data.
3. To demonstrate minimum information loss due to dimension
reductions, convolutions and Max pooling operations.
4. To address the issues of vanishing gradients, minimum computational
time, class imbalance problem and tolerance to noise.

3 Hyperspectral Images
To start with the fundamental concepts of a digital image, we can
interpolate it into the form of binary, grayscale, color and Hyperspectral
images. Binary images consist with 0 and 1’s to represent black and white
respectively and occupy in a 2-D matrix (r-rows. c-columns). Grayscale
digital images range from 0 to 255 to represent white to black with
intermediate levels of gray-scale. As per the biological aspects of human
cone cells (eye) to render environment colours, combinations of RGB-
scales (red, green, blue) are digitized into (r-rows × c-columns) × 3
channels. These RGB coloration is based on the reflected light from objects
fall under separate wavelengths (long, medium, short for red, green and
blue respectively) in the visible spectrum (perceived by human eyes) of the
electromagnetic radiation.
Alongside, there are lot of wavelengths beyond the visible spectrum
signify valuable information which the human eye cannot perceive. To be
formal, spectral image is a kind of similar to RGB colour image with many
channels describing the spatial and spectral information. Multi-spectral
image consists with n-band images, where each band has corresponding
light intensity to the wavelength (not necessarily spread over a contiguous
wavelength range). A λ-band Hyperspectral image consists with n grey-
scale images, where each band has corresponding light intensity to the
wavelength being stacked on top of each other over a contiguous
wavelength range (r-rows × c-columns × λ bands).
4 Deep Learning and CNN
The idea behind Deep Learning (DL) is to train computers/machines
artificially with an approach to model complex algorithms to learn from
experience, classify and recognize data or images just like a human brain
does. As a type of ANN, CNN is also used for image or object recognition
(processing images, analyzing videos, and detecting obstacles in
autonomous vehicles). There have been phenomenal developments in
devising methods pertain to ANN in DL-based classification and
object/image recognition domain. DL-based three Core layers (dense,
convolution and output) offer learning based HI solutions towards a
supervised, semi-supervised or unsupervised models.
Hi-based DL models are being developed for many classifications and
object identification purposes in using these three designs. The adaptability
of these design for application models depends on the availability of labeled
HI-data. To be specific, if the HI-model is based on the mapping process of
labeled datasets in respect to the ground truth then the supervised model is
used. To extract/unavil properties of HI data from unlabeled datasets, the
unsupervised design is addressed and while with availability of little/small
portion of HI based labeled data, the semi-supervised design is preferred to
get use in the model. Further, convolutional neural networks (CNNs) in
contrast to deep forward neural networks (DNNs) and autoencoders (AEs)
play a vital role in many HI-based intensive applications. In a high-
dimensional recognition or prediction system, the role of convolution layers
in CNN is specifically oriented to identify or learn the local patterns from
images or sequences of images. There are three simple operational steps
generally viewed in CNN models (feed forward and one direction) for HI
classifications. First, identification of input image and the conversion into
image pixels (array) by the input layer. Then it passes through multiple
hidden layers. The feature extraction process is being taken care by
convolution followed by the usage of pooling, rectified linear units on need
basis. Object classification is being taken care at fully connected layer and
to identify with label at output layer. The most general form of a CNN is
identified with a group of convolutional and pooling into modules; however
there are variant possible of groups.
In HI-based research point of view, the top ten most popular CNNs can
be represented as, Convolutional Neural Networks (CNNs), Long Short
Term Memory Networks (LSTMs), Recurrent Neural Networks (RNNs),
Generative Adversarial Networks (GANs), Radial Basis Function Networks
(RBFNs), Multilayer Perceptions (MLPs), Self-Organizing Maps (SOMs),
Deep Belief Networks (DBNs), Restricted Boltzmann Machines( RBMs)
and Autoencoders.

4.1 HI Based Deep Feature Selection


With high spectral resolution based HIs, information from each pixel is
generally interpolated to one-dimensional spectral vectors. 1D-CNN model
helps in identifying specific features (from the pool of spectral information)
of the HI through such pixels for further classifications. In simpler
description, 1D-CNN takes labeled HI-data as input, process with class
labels during training and updates network weights iteratively using
stochastic gradient descent algorithms and results with classified data being
trained with each pixel classification. Convolution operations on 1-D
feature vectors are performed using a 1-D convolution kernel defined in
Eq. 1.

(1)

The 2-D CNN (Eq. 2) uses a 2-D convolution-kernel to exhibit a


convolution operation on 2-D matrix in using 2-D filter.

(2)

To perform a convolution operation on 3-D data, 3-D CNNs use 3-D


convolution kernels (Ri refers to the size of each kernel). As the main
objective is to extract the low-level features contained in the HIs, we use 3-
D filter at the input image and generates a cube or cuboid in the 3-D volume
space. In 3-D convolution, the same 3-D kernel is applied to overlapping 3-
D cubes in the input to extract the features. Max pooling, Dropout, Batch
Normalization, Flatten methods are generally used to route multi-scale
feature maps generated from each 3-D convolution layer.

(3)

In addition to splendid advantages of Deep Neural Network (DNN)


usage, some of its observed limitations include, (a) difficulty in
accommodating large number of input features in case of small first hidden
layer, (b) high increase of weights in case of accommodating large input
features to a large first hidden layer, (c) difficulty due to the vanishing
gradient point in case of large number of layers, the gradient is high at
neurons near output and comparatively low at near inputs. The Spinal Fully
Connected Layer (SFCN) or SpinalNet [34] model is in interpreting with
human somatosensory system offers solutions to such issues being observed
in conventional DNNs. SFCN is based on gradual inputs, local output and
probable global influence, reconfiguration of weights during training. The
architecture [34] of the model is shown in Fig. 3.
Fig. 3 SpinalNet (Source [34]

4.2 HI Based Optimization


In HI-based classifications, with algorithmic approach of optimizers are
used during the learning process in neural network. The main purpose of
these algorithms is to minimize the difference between the expected and
actual values in adjusting or updating the weights in order to make the most
accurate predictions.
The gradient descent technique is found to be one of the prominent
methods adopted by many research image classification applications in the
context of deep learning and to get an optimized neural network. Gradient
descent may be classified into three basic variants according to the amount
of data used: batch gradient descent, stochastic gradient descent (SGD), and
mini-batch gradient descent. In addition to the SGD optimization, adaptive
moment (Adam), AdaDelta, the root mean square propagation (RMSProp),
Nesterov, AdaMax, Nadam GD [65, 79] are also take part in many
applications.
The Adaptive Moment Estimation (Adam) is a replacement
optimization algorithm for SGD for training deep learning models [75],
which combines the capabilities of both the RMSProp and AdaGrad [35].
This optimizer needs little memory and tuning, can handle sparse gradients
on very noisy problems, impressive speed of convergence and mean
absolute error and found to be the most preferred optimizers including
Hyperspectral image analysis. The mean and variance (1st and 2nd
moments) of the gradient are calculated as:
(4)

(5)

The relative contribution of past history with regards to the present


gradient is controlled through the decay rates (β1 and β2 hyper parameters),
each parameter wt replaces with w. η is the first level learning rate, εt
represents the gradient at time t, vt signify for the exponential average and st
is the exponential average of the square of the gradient.

5 3D-Convolutional Neural Network Based HI


Classification on Sentinel-2 Satellite Data of Sundarban
Mangrove Regions
Sundarban is one of the largest mangrove areas in the world stretching from
India to Bangladesh with a delta formed by the rivers, Brahmaputra,
Meghan and Padma in the Bay of Bengal. Around 106 islands and supports
a good number of biodiversity (Fig. 5). It is home to a wide range of
wildlife species including endangered species and supports for biodiversity
through its 106 islands.

5.1 Dataset Description


We have used Hyperspectral images (HI) of Sundarban mangrove region.
Actual Hyperspectral images from 12 bands are collected from the Sentinel-
2 satellite images (COAH, [12]).
The remote sensing based satellite images contain more than three
bands which contains a diverse set of information about any specific
geographical location in contrast to the general images (3 bands, red, green
and blue bands). With the help of more data in the form of bands, we can
understand and analyze the data effectively. The image in Fig. 4a represents
a satellite image cube that contains R-rows, C-columns, and B-bands. As
stated above, the input Sentinel-2 based HIs for the experiment comprise
with 12 bands (coastal aerosol, Near Infra-Red(NIR), Short Wave Infra-
Red(SWIR), and RGB), wavelength ranging from 0.443 to 2.190 micro
meters with 10–60 m of spectral resolution. In using the COAH tool, HIs
with less than one percent of clouds, being filtered with cloud cover map
were selected for input image analysis (Fig. 5).

Fig. 4 HI/classification map: a Sundarban b Indian pines [13] c PaviaU [13]


Fig. 5 Satellite data: Sundarban mangrove a Composite image b Ground truth image
c 12-band HI visualization

5.2 Experimental Setup and Hyper-Parameters


Outcomes of the undertaken experiment on Sundarban satellite HI data is
processed on Google Colab Pro™ cloud platform with graphical processing
unit (GPU) analysis. In using Python libraries and methods like rasterio,
loadmad, EarthPy, the input HIs brought into the frame of stack to compare
with six major classes. It includes, Barren land (BL): Land devoid of
vegetation or sand dunes, River(RV) bodies, Dense Mangrove (DM):
Mangrove forest with dense canopy cover, Open Mangrove (OM):
Mangrove forest with open canopy and mudflats with very less mangrove
cover, Agriculture (AG): Active agricultural practice and Human habitat
(HUM): Human habitation often under the canopy shade of non-mangrove
plants.
Principal Component Analysis (PCA) and TensorFlow based Keras
package of Python is used to extract 3D patches (containing true-classes)
and to categorize the reduced high-dimensional input with (0.7 to 0.3 of
scale-1) for encoding.
Next, we processed into a 3D-CNN through Convolution, Dropout and
Dense layers with 1,204,098 trainable parameters. The model adapts
6optimizers discussed in Sect. 2.3 and selects the best (here, the Adams).
Methods like TesnsorBoard, EarlyStopping, and ModelCheckpoint were
used to tackle issues of keeping track of learning logs during every batch,
monitor metric of learning status to overcome issues of overfitting and to
epoch-leveled control checkpoint losses respectively. Classification
accuracy of the undertaken input HI is shown below.

Plot the classification report (page 11) in graph.


To overcome the unbalanced classes and to minimize the loss in the training
and validation of HI patches the categorical cross-entropy (CCE) (Fig. 6).
The functions of CCE can be identified as
with C set of
classes, ti ground truths and Si corresponding CNN score for each class-i
having softmax activation function. The data were augmented using random
horizontal and vertical flips. After tuning, the monitor = ‘val_loss’ and
restore_best_weights = True, the batch size to (1024 × 6) and the optimizer
used was Adam with CCE.

Fig. 6 Training, testing: sundarban mangrove a Accuracy b Loss

6 A Novel Deep Learning Hybrid-MSSN Architecture


for Hyperspectral Image Classification
It is often the case that scientists combine two or more types of
architectures instead of relying on a single approach (hybrid models), which
can result in better results when dealing with complex problems. In other
words, they are a class of methods that integrate the advantages of different
models in the same system. The following sections describe on the
methodology for classifying (deep) Hyperspectral images (HIs) from three
HI-datasets.

6.1 Architecture of Hybrid-MSSN Model


The architecture of our HI-based deep classification model is presented in
Fig. 7.
Fig. 7 Proposed architecture of our model

In the model, we use multi-scale CNNs and spinal fully connected


network (SFCN). In the process of HI based spectral and spatial feature
extractions, we use 3D-CNNs and for spatial feature learning we use 2D-
CNN. First, the model is initialized with satellite band based high-
dimension HI which was meant to address high-spectral features. We use
principal component analysis (PCA) to filter the unnecessary bands, de-
correlate and reduce the spectral dimension without compromising the HI’s
information content.

6.2 Dataset Description


We have used the following three popular HI-datasets [13] to validate our
Hybrid-MSSN model (Table 1).

Table 1 Description of experimental HI-datasets


HI HI-captured source HI-description Ground truth
Datasets description
Indian pines North-western Indiana Spatial dimension, Classes-16, patches
(IPD) (AVIRIS sensor) 145 × 145 × 200 21,025

(rows, columns, filtered- Land coverage


bands) (forest)-66%
Land coverage
(farming)–33%
Crop coverage (corn,
soybeans)
Highways (dual lane)-1
Railway line–1
Houses/structures/roads
HI HI-captured source HI-description Ground truth
Datasets description
Salinas Salinas valley, California Spatial dimension, 512 × Classes-16, patches
(SD) (AVIRIS sensor) 217 × 20 Land coverage (bare
(rows, columns, filtered- soils)
bands) Land coverage
(vineyard fields)
Land coverage
(farming)
Crop coverage
(vegetables)
Pavia Pavia, northern Italy, (ROSIS Spatial dimension, 610 × Classes-09, patches
University sensor) 610 × 103
(PUD) (rows, columns, filtered-
bands)

6.3 Experimental Setup and Result Analysis


The detailed process of the model is outlined in Algorithm 1. The
undertaken experimental setup is based on Google Colaboratory pro cloud
platform with Python, Jupyter notebook and GPUs. Keras, a deep learning
tool, was used to validate the model.
In the deep CNN classification, as the layer becomes deeper, the spatial
dimensions of feature maps shrink sharply and results to a loss. In
conventional cases, the FC layer frequently points to the deepest
Convolutional (or pooling) layer and hence the network seriously depends
on the global data which reflect to high computation time. Thus, to
overcome such issues, we use both shallow and deep convolution features
[76] to account the complexity in HIs, where distinct items likely to have
varying scales and the Spinal Fully Connected network (SpinalNet, [34])
instead of the dense layer.
To experiment the HI, first we use PCA transformation to extract the
most informative r spectral bands (IPDr = 30, PUDr = 15 and SDr = 15) as
per the Modified Brown Stick Rule (MBR) [3]. With iterative noise
filtrations we get HI-cubes with reduced dimension of (13 × 13 × r) which is
relevant to the findings in [59].
The HI-cubes were further categorized into two groups with distinct
training and testing samples (Fig. 8). One group comprises with 10 and 90
percent of train to test samples and the other group with 30 and 70 percent
to compensate for the problem of class imbalance. Table 2 represents the
classification outcomes of the three datasets with oversampling.

Fig. 8 Testing: IPD, SD, PUD a Overall accuracy b Average accuracy c Kappa score

Table 2 Classification performance: with oversampling


HI data sets Window size Train: test ratio Train: test ratio
(3D-Patch) (10:90) (30:70)
(IPD) 9×9 99.956 ± 0.01 99.958 ± 0.01
11 × 11 99.967 ± 0.02 99.986 ± 0.01
13 × 13 99.967 ± 0.02 100.000 ± 0.01
HI data sets Window size Train: test ratio Train: test ratio
(3D-Patch) (10:90) (30:70)
(SD) 9×9 99.934 ± 0.002 99.981 ± 0.002
11 × 11 100.000 ± 0.001 100.000 ± 0.001
13 × 13 100.000 ± 0.000 100.000 ± 0.000
(PUD) 9×9 99.989 ± 0.002 99.981 ± 0.002
11 × 11 99.994 ± 0.002 100.000 ± 0.002
13 × 13 100.000 ± 0.001 100.000 ± 0.001

The undertaken approach also achieves impressing results on all 3d-


patches of 3-datasets without oversampling; for instance with (13 × 13)
patch size, accuracies at 3-datasets are represented in Table 3. We use
accuracy measures like, Overall Accuracy (OA), Average Accuracy (AA),
Kappa value (KA) and Class-wise accuracy to evaluate the model.

Table 3 Classification performance: without oversampling


Data Accuracy Training: testing Training: testing Training: testing
Sets Measures (%) (20:80) (30:70) (80:20)
(%) Time (S) (%) Time (S) (%) Time (S)
(IPD) OA 98.65 ± 0.05 79.74 99.27 ± 0.03 85.38 99.65 ± 0.01 144:71
AA 98.69 ± 0.20 02.21 98.07 ± 0.12 02.77 99.43 ± 0.02 0.718
K 98.47 ± 0.10 99.15 ± 0.03 99.61 ± 0.02
(SD) OA 99.20 ± 0.05 184.22 99.58 ± 0.03 138.38 99.95 ± 0.01 213.21
AA 99.32 ± 0.20 07.59 99.68 ± 0.12 06.67 99.92 ± 0.02 01.97
K 99.10 ± 0.10 99.53 ± 0.00 99.94 ± 0.02
(PUD) OA 99.84 ± 0.05 59.18 99.84 ± 0.03 86.62 99.99 ± 0.01 206.87
AA 99.70 ± 0.20 06.57 99.65 ± 0.12 05.85 99.99 ± 0.02 01.37
K 99.79 ± 0.10 99.80 ± 0.03 99.99 ± 0.02

We use the first category of 10 percent training samples for model


validation. With 26, 27 and 28 filters (3 × 3 × 3 dimension) in the first,
second and third phase of 3D-Convolution layers respectively, we adapt
‘Relu’ activation function. In the model, each 3D convolution layer follows
with 3D Max pooling with pooling size 2 and dropout ratio of 0.5. The 2D-
convolution layer in the model has 256 filters (3 × 3 dimension), dropout
ratio of 0.25. In all SFCNs (1–5), the layer width is set to 256 and half
width is set to the round of integer value to half of the layer width, which
play a significant role. We use Adam optimizer, having categorical cross-
entropy loss function (Fig. 9). The learn-rate and decays were assigned as
0.001 and 1e−06 respectively. The model is trained over 20 epochs with a
batch size of 256. The model is compared with four published methods
(Fig. 10), EMP-SVM [18], MCNN-CP [78], 2D-CNN [50] and hybrid-SN
[59].

Fig. 9 Epochs and training/validation loss a–c 10% Training IPD, SD, PUD d–f30%
Training IPD, SD, PUD

Fig. 10 Class-wise classification accuracy: Training sample (T.S.) with oversample


(O.S.) a & b IPD c & d SD e & f PUD
The performance of the model is also investigated (Table 4) by
repeating the experiments with data that contains noise, with and without
weak class oversampling and with different spatial sizes and train-test ratios
(Fig. 11).

Table 4 Accuracy of datasets with noise


Data Accuracy (%) With noise
sets Speckle noise Gaussian noise Salt & Pepper
v = 10 v = 30 v = 50 v = 10 v = 30 v = 50 a = 0 a = 0.5 a = 1
(IPD) OA 98.68 99.60 99.75 98.87 96.54 97.59 99.95 99.60 95.56
AA 99.09 98.50 98.62 99.26 96.54 97.59 99.91 99.50 97.12
K 98.49 99.55 99.72 98.71 96.55 98.44 99.94 99.55 94.95
(SD) OA 99.94 96.91 99.38 99.88 99.94 99.85 99.83 99.52 97.16
AA 99.89 96.17 99.33 99.70 99.88 99.80 99.79 99.16 93.85
K 99.93 96.56 99.31 99.87 99.93 99.83 99.81 99.47 96.83
(PUD) OA 99.97 99.92 99.76 99.27 99.10 95.89 100.00 99.73 99.53
AA 99.94 99.75 99.70 99.56 98.31 93.77 100.00 99.38 99.49
K 99.96 99.90 99.69 99.88 98.80 94.58 100.00 99.64 99.38
Fig. 11 Classification maps: a–c Ground truth of IPD, SD and PUD d–f Our model
(with 30% training) of IPD, SD and PUD g–i Our model (with 10% training and
oversampling) of IPD, SD and PUD

7 Conclusions and Future Scope


This literature addresses basic issues related to satellite imaging techniques
and hyper spectral based classification techniques. In the first part of
experiment analysis, we used sentinel-2 based satellite image of Sundarban
Mangrove and classified the land coverage with respect to six ground truth
labels with comparative better accuracy. Further with identified issues like
training size limitation, better computational time and better classification
performance under noise, we adapted a combined 3D-2D DL approach for
the generation of hierarchical discriminative deep spectral-spatial features
and HI classification. A multi-scale feature learning technique is employed
in the framework, which increases the ability of the model to classify the
objects of diverse shapes even after the information loss by the convolution
mechanism. The use of SpinalNet model enhances the accuracy and
controls the error. Experimental results demonstrate that our model is
capable enough to classify with a limited number of training samples and
thus avoid the need for oversampling and performs well even in the
presence of Gaussian and Poisson noise. The model demonstrates with three
benchmark datasets by giving consistent and competitive values for Overall
Accuracy (OA), Aver- age Accuracy (AA), and Kappa Accuracy (KA)
compared to the other four state-of-the-art models. Being a supervised
classification based model, it offers with best usage on labeled
Hyperspectral datasets and most suitable for applications based on land
cover mapping, agriculture and global climate.

References
1. Adate, A., Arya, D., Shaha, A., & Tripathy, B. K. (2020). Impact of deep neural
learning on artificial intelligence research. In S. Bhattacharyya, A. E. Hassanian,
S. Saha, & B. K. Tripathy (Ed.), Deep Learning Research and Applications
(pp.69–84). De Gruyter Publications. https://doi.org/10.1515/9783110670905-004

2. Adate, A., & Tripathy, B. K. (2018). Deep learning techniques for image
processing. In S. Bhattacharyya, H. Bhaumik, A. Mukherjee & S. De (Eds.),
Machine Learning for Big Data Analysis (pp. 69–90). De Gruyter. https://doi.org/
10.1515/9783110551433-00357

3. Bajorski, P. (2010). Investigation of virtual dimensionality and broken stick rule


for hyperspectral images. In 2010 2nd Workshop on Hyperspectral Image and
Signal Processing: Evolution in Remote Sensing (pp. 1–4).

4. Benediktsson, J. A., Palmason, J. A., & Sveinsson, J. R. (2005). Classification of


hyperspectral data from urban areas based on extended morphological profiles.
IEEE Transactions on Geoscience and Remote Sensing, 43(3), 480–491.
[Crossref]
5.
Bengio, Y., Lamblin, P., Popovici, D., Larochelle, H., et al. (2007). Greedy layer-
wise training of deep networks. Advances in neural information processing
systems, 19, 153.

6. Bhattacharyya, S., Snasel, V., Hassanian, A. E., Saha, S., & Tripathy, B. K.
(2020). Deep learning research with engineering applications. De Gruyter
Publications. ISBN: 3110670909, 9783110670905. https://doi.org/10.1515/
9783110670905

7. Bhardwaj, P., Guhan, T., & Tripathy, B. K. (2021). Computational biology in the
lens of CNN, Studies in Big Data. In S.S. Roy, & Y.-H. Taguchi (Eds.), Handbook
of Machine Learning Applications for Genomics, (Chapter 5) (vol. 103). ISBN:
978–981–16–9157–7 496166_1_En

8. Binol, H. (2018). Ensemble learning based multiple kernel principal component


analysis for dimensionality reduction and classification of hyperspectral imagery.
Mathematical Problems in Engineering, 2018, 14. Article ID 9632569.

9. Bose, A., & Tripathy, B. K. (2020). Deep learning for audio signal classification.
In S. Bhattacharyya, A. E. Hassanian, S. Saha, & B. K. Tripathy (Ed.), Deep
Learning Research and Applications (pp. 105–136). De Gruyter Publications.
https://doi.org/10.1515/9783110670905-00660

10. Bruce, L. M., Li, J., & Huang, Y. (2022). Automated detection of subpixel
hyperspectral targets with adaptive multichannel discrete wavelet trans-form.
IEEE Transactions on Geoscience and Remote Sensing, 40(4), 977−980

11. Chen, Y., Lin, Z., Zhao, X., Wang, G., & Gu, Y. (2014). Deep learning-based
classi-fication of hyperspectral data. IEEE Journal of Selected topics in applied
earth observations and remote sensing, 7(6), 2094–2107.
[Crossref]

12. COAH: Copernicus Open Access Hub. https://scihub.copernicus.eu

13. Grupo de Inteligencia Computacional. (2014). Hyperspectral remote sensing


scenes. http://www.ehu.eus/ccwintco/index.php

14. Debgupta, R., Chaudhuri, B. B., Tripathy, B. K. (2020). A eide resNet-based


approach for age and gender estimation in face images. In A. Khanna, D. Gupta,
S. Bhattacharyya, V. Snasel, J. Platos, A. Hassanien (Eds.), International
Conference on Innovative Computing and Communications, Advances in
Intelligent Systems and Computing (vol. 1087, pp. 517–530). Springer. https://doi.
org/10.1007/978-981-15-1286-5_44
15.
Deepa, P., & Thilagavathi, K. (2015). Feature extraction of hyperspectral image
using principal component analysis and folded-principal component analysis. In
2015 2nd International Conference on Electronics and Communication Systems
(ICECS) (pp. 656–660).

16. Dharmasastha, K. N. S., Banu, K. S., Kalaichevlan, G., Lincy, B., & Tripathy,
B.K. (2022). Classification of pest in tomato plants using CNN. In M. N.
Mohanty, S. Das, M. Ray, B. Patra (Eds.), Meta Heuristic Techniques in Software
Engineering and Its Applications. METASOFT 2022. Artificial Intelligence-
Enhanced Software and Systems Engineering (vol. 1). Springer. https://doi.org/10.
1007/978-3-031-11713-8_6

17. Du, Q. (2007). Modified fisher’s linear discriminant analysis for hyperspectral
imagery. IEEE Geoscience and Remote Sensing Letters, 4(4), 503–507.
[Crossref]

18. Fauvel, M., Benediktsson, J. A., Chanussot, J., & Sveinsson, J. R. (2008). Spectral
and spatial classification of hyperspectral data using svms and morphological
profiles. IEEE Transactions on Geoscience and Remote Sensing, 46(11), 3804–
3814.
[Crossref]

19. Fauvel, M., Tarabalka, Y., Benediktsson, J. A., Chanussot, J., & Tilton, J. C.
(2012). Advances in spectral-spatial classification of hyperspectral images.
Proceedings of the IEEE, 101(3), 652–675.
[Crossref]

20. Fu, A., Ma, X., & Wang, H. (2018). Classification of hyperspectral image based
on hybrid neural networks. In: IGARSS 2018 2018 IEEE International Geoscience
and Remote Sensing Symposium (pp. 2643–2646).

21. Fukushima, K., & Miyake, S. (1982). Neocognitron: A self-organizing neural net-
work model for a mechanism of visual pattern recognition. In Competition and
Cooperation in Neural Nets (pp. 267–285). Springer.

22. Ghasemzadeh, A., & Demirel, H. (2016) Hyperspectral face recognition using 3d
discrete wavelet transform. In 2016 Sixth International Conference on Image
Processing Theory, Tools and Applications (IPTA) (pp. 1–4).

23. Ghiya, A.S., Vijay, V., Ranganath, A., Chaturvedi, P., Tripathy, B.K. & Banu, K.
S. (2021). Weather classification: Image embedding using xonvolutional
autoencoder and predictive analysis using stacked generalization. In ANTIC
conference. BHU.
24. Guo, Y., Liu, Y., Oerlemans, A., Lao, S., Wu, S., & Lew, M. S. (2016). Deep
learning for visual understanding: A review. Neurocomputing, 187, 27–48.
[Crossref]

25. Gupta, P., Bhachawat, S., Dhyani, K., & Tripathy, B. K. (2021). A study of gene
characteristics and their applications using deep learning, (Chapter 4), Studies in
Big Data. In S. S. Roy, & Y.-H. Taguchi (Eds.), Handbook of Machine Learning
Applications for Genomics (vol. 103). ISBN: 978–981–16–9157–7, 496166_1_En

26. Hamida, A. B., Benoit, A., Lambert, P., & Amar, C. B. (2018). 3-d deep learning
approach for remote sensing image classification. IEEE Transactions on
geoscience and remote sensing, 56(8), 4420–4434.
[Crossref]

27. Harikiran, J., Ladi, S. K., Panda, G. K., Dash, R., Ladi, P. K. (2020).
Hyperspectral image classification bi-dimensional empirical mode decomposition
and deep residual networks. In 2020 International Conference on Artificial
Intelligence and Signal Processing (AISP) (pp.1–6).

28. Harsanyi, J. C., & Chang, C.-I. (1994). Hyperspectral image classification and
dimensionality reduction: An orthogonal subspace projection approach. IEEE
Transactions on geoscience and remote sensing, 32(4), 779–785.
[Crossref]

29. Haut, J. M., Paoletti, M. E., Plaza, J., Plaza, A., & Li, J. (2019). Hyperspectral
image classification using random occlusion data augmentation. IEEE Geoscience
and Remote Sensing Letters, 16(11), 1751–1755.
[Crossref]

30. Hinton, G. E., Osindero, S., & Teh, Y.-W. (2006). A fast learning algorithm for
deep belief nets. Neural computation, 18(7), 1527–1554.
[MathSciNet][Crossref][zbMATH]

31. Hughes, G. (1968). On the mean accuracy of statistical pattern recognizers. IEEE
transactions on information theory, 14(1), 55–63.
[Crossref]

32. Imani, M., & Ghassemian, H. (2014). Principal component discriminant analysis
for feature extraction and classification of hyperspectral images. In 2014 Iranian
Conference on Intelligent Systems (ICIS) (pp. 1–5).
33.
Jayaprakash, C., Damodaran, B. B., Sowmya, V., & Soman, K. P. (2018).
Dimensionality reduction of hyperspectral images for classification using
randomized independent component analysis. In 2018 5th International
Conference on Signal Processing and Integrated Networks (SPIN) (pp. 492–496)

34. Kabir, H. M. D., Abdar, M., Jalali, S. M. J., Khosravi, A., Atiya, A.F., Nahavandi,
S., & Srinivasan, D. (2020). SpinalNet: Deep neural network with gradual input

35. Kathuria, A. (2018) Intro to optimization in deep learning: Momentum, Rmsprop


and Adam. https://blog.paperspace.com/intro-to-optimization-momentum-
rmsprop-adam/

36. Kaul, D., Raju, H., & Tripathy, B. K. (2022). Deep learning in healthcare, in:
Deep Learning in Data Analytics. In: D.P. Acharjya, A. Mitra, N. Zaman (Eds,),
Deep Learning in Data Analytics-Recent Techniques, Practices and Applications,
Studies in Big Data (vol. 91, pp. 97–115). Springer. https://doi.org/10.1007/978-3-
030-75855-4_6

37. Ke, C. (2017). Military object detection using multiple information extracted from
hyperspectral imagery. In 2017 International Conference on Progress in
Informatics and Computing (PIC) (pp. 124–128).

38. Khan, M.J., Khan, H.S., Yousaf, A., Khurshid, K., & Abbas, A. (2018). Modern
trends in hyperspectral image analysis: A review. IEEE Access. 6, 14118−14129

39. Kumar, V., & Tripathy, B. K. (2020). Detecting toxicity with bidirectional gated
recurrent unit networks. In V. Bhateja, S. Satapathy, Y.D. Zhang, V. Aradhya
(Eds.), Intelligent Computing and Communication. ICICC 2019. Advances in
Intelligent Systems and Computing (vol. 1034). Springer. https://doi.org/10.1007/
978-981-15-1084-7_57

40. Kwon, H., Hu, X., Theiler, J., Zare, A, & Gurram, P. (2013). Algorithms for
multispectral and hyperspectral image analysis. Journal of Electrical and
Computer Engineering, 2013, 2. Article ID 908906

41. Ladi, S. K., Panda, G. K., Dash, R., et al. (2022). A novel grey wolf optimisation
based CNN classifier for hyperspectral image classification. Multimed Tools Appl,
81, 28207–28230.
[Crossref]

42. Ladi, S. K., Panda, G. K., Dash, R. et al. (2022). A novel strategy for classifying
spectral-spatial shallow and deep hyperspectral image features using 1D-EWT and
3D-CNN. Earth science informatics
43. Ladi, S. K., Dash, R., Panda, G. K., Ladi, P. K., & Dhupar, R. (2019).
Hyperspectral image classification using swt and cnn. In 2019 International
Conference on Information Technology (ICIT) (pp. 172–177).

44. Li, C., Zuo, H., Fan, T. (2017). Hyperspectral image classification based on gray
level co-occurrence matrix and local mean decomposition. In 2017 4th
International Conference on Systems and Informatics (ICSAI) (pp. 1219–1223).

45. Li, J., Bioucas-Dias, J. M., & Plaza, A. (2010). Semisupervised hyperspectral
image segmentation using multinomial logistic regression with active learning.
IEEE Transactions on Geoscience and Remote Sensing, 48(11), 4085–4098.

46. Li, Y., Zhang, H., & Shen, Q. (2017). Spectral–spatial classification of
hyperspectral imagery with 3d convolutional neural network. Remote Sensing,
9(1), 67.
[Crossref]

47. Li, W., Wu, G., Zhang, F., & Du, Q. (2017). Hyperspectral image classification
using deep pixel-pair features. IEEE Transactions on Geoscience and Remote
Sensing, 55(2), 844–853.
[Crossref]

48. Ma, Y., Li, R., Yang, G., Sun, L., & Wang, J. (2018). A research on the
combination strategies of multiple features for hyperspectral remote sensing
image classification. Journal of Sensors, 2018, 14. Article ID 7341973.

49. Maheswari, K., Shaha, A., Arya, D., Tripathy, B. K., & Rajkumar, R. (2020).
Convolutional neural networks: A bottom-ip approach. In S. Bhattacharyya, A. E.
Hassanian, S. Saha, & B.K. Tripathy (Ed.), Deep Learning Research with
Engineering Applications (pp.21–50). De Gruyter Publications. https://doi.org/10.
1515/9783110670905-002

50. Makantasis, K., Karantzalos, K., Doulamis, A., & Doulamis, N. (2015). Deep
super-vised learning for hyperspectral data classification through convolutional
neural networks. In 2015 IEEE International Geoscience and Remote Sensing
Symposium (IGARSS) (pp. 4959–4962).

51. Makantasis, K., Doulamis, A. D., Doulamis, N. D., & Nikitakis, A. (2018).
Tensor-based classification models for hyperspectral data analysis. IEEE
Transactions on Geoscience and Remote Sensing, 56(12), 6884–6898.
[Crossref]
52.
Makantasis, K., Doulamis, A., Doulamis, N., Nikitakis, A., & Voulodimos, A.
(2018). Tensor-based nonlinear classifier for highorder data analysis. In 2018
IEEE International Conference

53. Notesco, G., Dor, E. B., & Brook, A. (2014). Mineral mapping of makhtesh ramon
in israel using hyperspectral remote sensing day and night LWIR images. In 2014
6th Workshop on Hyperspectral Image and Signal Processing: Evolution in
Remote Sensing (WHISPERS) (pp. 1–4).

54. Pesaresi, M., Gerhardinger, A., & Kayitakire, F. (2008). A robust built-up area
presence index by anisotropic rotation-invariant textural measure. IEEE Journal of
selected topics in applied earth observations and remote sensing, 1(3), 180–192.
[Crossref]

55. Pesaresi, M., & Benediktsson, J. A. (2001). A new approach for the
morphological segmentation of high-resolution satellite imagery. IEEE
transactions on Geoscience and Remote Sensing, 39(2), 309–320.
[Crossref]

56. Pike, R., Lu, G., Wang, D., Chen, Z. G., & Fei, B. (2016). A minimum spanning
forest-based method for noninvasive cancer detection with hyperspectral imaging.
IEEE Transactions on Biomedical Engineering, 63(3), 653–663.
[Crossref]

57. Plaza, A., Mart´ınez, P., Plaza, J., P´erez, R. (2005). Dimensionality reduction and
classification of hyperspectral image data using sequences of extended
morphological transformations. IEEE Transactions on Geoscience and remote
sensing, 43(3), 466–479.

58. Prabhavathy, P., Tripathy, B.K., Venkatesan, M. (2022). Analysis of diabetic


retinopathy detection techniques using CNN Models. In: S. Mishra, H. K.
Tripathy, P. Mallick, K. Shaalan (Eds.), Augmented Intelligence in Healthcare: A
Pragmatic and Integrated Analysis. Studies in Computational Intelligence (vol.
1024). Springer, https://doi.org/10.1007/978-981-19-1076-0_6

59. Roy, S. K., Krishna, G., Dubey, S. R., & Chaudhuri, B. B. (2020). Hybridsn:
Exploring 3-d-2-d cnn feature hierarchy for hyperspectral image classification.
IEEE Geoscience and Remote Sensing Letters, 17(2), 277–281.

60. Singhania, U., & Tripathy, B. K. (2021). Text-based image retrieval using deep
learning. In Encyclopedia of Information Science and Technology (5th ed., p. 11).
https://doi.org/10.4018/978-1-7998-3479-3.ch007
61. Rungta, R. K., Jaiswal, P., Tripathy, B. K. (2022). A deep learning based approach
to measure confidence for virtual interviews. In A. K. Das et al. (Eds.),
Proceedings of the 4th International Conference on Computational Intelligence in
Pattern Recognition (CIPR) (pp. 278–291). CIPR 2022, LNNS 480.

62. Sihare, P., Khan, A. U., Bardhan, P., & Tripathy, B. K. (2022). COVID-19
detection using deep learning: A comparative study of segmentation algorithms.
In A. K. Das et al. (Eds.), Proceedings of the 4th International Conference on
Computational Intelligence in Pattern Recognition (CIPR) (pp. 1–10). CIPR
2022, LNNS 480.

63. Jain, S., Singhania, U., Tripathy, B.K., Abouel, E. N., Aboudaif, M. K., & Ali, K.
K. (2021). Deep learning based transfer learning for classification of skin cancer.
Sensors (Basel), 21(23), 8142 https://doi.org/10.3390/s21238142. (IF:4.35)

64. Surya, Y. S., Geetha Rani, K. T., & Tripathy, B. K. (2022). Social distance
monitoring and face mask detection using deep learning. In: J. Nayak, H. Behera,
B. Naik, S. Vimal, D. Pelusi (Eds.), Computational Intelligence in Data Mining.
Smart Innovation, Systems and Technologies (vol. 281). Springer. https://doi.org/
10.1007/978-981-16-9447-9_36

65. Sun, T., Jiao, L., Feng, J., Liu, F., & Zhang, X. (2015). Imbalanced hyperspectral
image classification based on maximum margin. IEEE Geoscience and Remote
Sensing Letters, 12(3), 522–526.
[Crossref]

66. Teng, M. Y., Mehrubeoglu, R., King, S. A., Cammarata, K., & Simons, J. (2013).
Investig tion of epifauna coverage on seagrass blades using spatial and spectral
analysis of hyperspectral images. In 2013 5th Workshop on Hyperspectral Image
and Signal Processing: Evolution in Remote Sensing (WHISPERS) (pp. 1–4).

67. Tripathy, B. K., & Anuradha, J. (2015). Soft computing-advances and


applications. Cengage Learning publishers. ASIN: 8131526194, ISBN-
109788131526194

68. Tripathy, B. K., Parikh, S., Ajay, P., & Magapu, C. (2022). Brain MRI
segmentation techniques based on CNN and its variants, (Chapter-10). In J. Chaki
(Ed.), Brain Tumor MRI Image Segmentation Using Deep Learning Techniques
(pp. 161−182). Elsevier publications. https://doi.org/10.1016/B978-0-323-91171-
9.00001-6
69.
Tripathy, B. K., & Adate, A. (2021). Impact of deep neural learning on artificial
intelligence research, Chapter-8. In D. P. Acharjya et al (Ed.), Springer
publications.

70. Voulodimos, A. (2018). Deep learning for computer vision: a brief review.
Computational Intelligence and Neuroscience, 2018, 13. Article ID 7068349.

71. Wang, & Chang, C. I. (2006). Independent component analysis based


dimensionality reduction with applications in hyperspectral image analysis. In
IEEE Transactions on Geoscience and Remote Sensing (vol. 44, no. 6, pp. 1586–
1600).

72. Wang, X., & Feng, Y. (2008). New method based on support vector machine in
classification for hyperspectral data. In 2008 International Symposium on
Computational Intelligence and Design (pp. 76–80)

73. Wang, Y., & Cui, S. (2014). Hyperspectral image feature classification using
stationary wavelet transform. In 2014 International Conference on Wavelet
Analysis and Pattern Recognition (pp. 104–108)

74. Wu, Y., Mu, G., Qin, C., Miao, Q., Ma, W., & Zhang, X. (2020). Semi-supervised
hyperspectral image classification via spatial-regulated self-training. Remote
Sensing, 12(1)

75. Xingjian, S., Chen, Z., Wang, H., Yeung, D.Y., Wong, W.K., & Woo, W.C. (2015).
Convolutional LSTM network: A machine learning approach for precipitation
nowcasting. In Proceedings of the 28th International Conference on Neural
Information Processing Systems (Vol. 1, pp. 802–810).

76. Xu, Y., Zhang, L., Du, B., & Zhang, F. (2018). Spectral–spatial unified networks
for hyperspectral image classification. IEEE Transactions on Geoscience and
Remote Sensing, 56(10), 5893–5909.

77. Zhang, X., Zhang, A., & Meng, X. (2015). Automatic fusion of hyperspectral
images and laser scans using feature points. Journal of Sensors, 2015, 9. Article
ID 415361

78. Zheng, J., Feng, Y., Bai, C., & Zhang, J. (2021). Hyperspectral image
classification using mixed convolutions and covariance pooling. IEEE
Transactions on Geoscience and Remote Sensing, 59(1), 522–534.
[Crossref]
79.
Zhong, Z., Li, J., Luo, Z., & Chapman, M. (2018). Spectral–spatial residual
network for hyperspectral image classification: A 3-d deep learning framework.
IEEE Transactions on Geoscience and Remote Sensing, 56(2), 847–858

80. Zhou, F., Hang, R., Liu, Q., & Yuan, X. (2019). Hyperspectral image classification
using spectral-spatial lstms. Neurocomputing, 328, 39–47.
[Crossref]
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023
S. S. Roy et al. (eds.), Deep Learning Applications in Image Analysis, Studies in Big Data 129
https://doi.org/10.1007/978-981-99-3784-4_9

Chest X-Ray Image Classification of


Pneumonia Disease Using EfficientNet and
InceptionV3
Neel Ghoshal1, Mohd Anas1 and Sanjiban Sekhar Roy1
(1) School of Computer Science and Engineering, Vellore Institute of
Technology, Vellore, India

Sanjiban Sekhar Roy


Email: sanjibanroy09@gmail.com

Keywords Image classification – Diagnostic imaging – EfficientNet –


InceptionV3 – CNN – Soft computing

1 Introduction
Pneumonia is a type of respiratory infection that affects the lungs. It leads to
inflammation in the lungs and fluid buildup in the air sacs within, causing
difficulties in breathing and simultaneous cardiovascular health effects.
Pneumonia is considered to be the single largest cause of death in children
worldwide, leading to an estimated count of 5.9 million deaths for children
under 5 years old annually [1]. Chest X-Rays and Radiography methods
have been prevalent in the medical industry for quite some time and the use
of such methods and tools have been administered in diagnosing and curing
issues and illnesses such as cancer, infections, emphysema and pneumonia.
The specialized analysis and diagnosis of an illness through the use of X-
Ray outputs are generally conducted by expert radiologists in person. In
recent times, the number of cases requiring chest X-Rays have substantially
increased [2], hence simultaneously, radiologists working on these outputs
now have to devote higher levels of time for this task. The requirement of
expertise for this task comes from the extremely detailed and niched
characteristics of the components present in the lung which has to be
analyzed and deduced via intricate characterizations and traits which
coherently point towards a general illness category. Due to the
aforementioned cause of increased frequency of Chest X-Ray instances, it is
a possibility that due to this vast volume of data to be manually processed,
can be a reason which simultaneously leads to time delays, cost problems,
and/or errors which may occur, which in the end is something that needs to
be avoided via any medical institution. Through the work described in this
chapter, we propose an automated medical image diagnosis system which
essentially will allow the radiologists and staff alike to gain an alternate and
handy method to efficiently process and analyze data without much hassle
or manual work. For our problem statement, we have used two
Convolutional Neural Network (CNN) based algorithms to classify Chest
X-Ray scans for the illness of pneumonia.
These CNN based algorithms have worked well with this specific image
classification problem due to it’s inherent trait of reducing dimensionality
of data and efficient processing for accurate results [3]. The aforementioned
advantages are due to the neural network subdivisions and their tasks,
namely Convolution Layer which breaks down the entire image into smaller
sub-parts of it for and efficient and less-dimensional input layer, Pooling
Layer assumes the convolution layer as input and reduces the
dimensionality further, and the Fully Connected Layer which can be
considered as the final layer upon which the network finally learns which
subdivisions/parts are necessary for the classification problem at hand.

2 Literature Survey
Till date, there have been few proposals and advances towards similar and
specific medical diagnosis problems. CNNs and Deep Neural Networks
have allowed researchers to build sophisticated models towards medical
issues including pneumonia, tuberculosis, Covid-19, lung cancer and many
more [4].
There are many different techniques and methodologies used to
progress the specific tasks of medical diagnosis employed by various
researchers in their respective fields, some of them include, Convolutional
Neural Networks, Transfer Learning, Image-Level Prediction,
Segmentation networks, Localization Networks, Image Generation
Networks, Domain Adaption Networks, and likewise [4].
For example, Crosby et al. employed the use of deep CNNs for
distinguishing between binary labelled chest radiograph data [5]. Deep
Learning has also been employed in detection of foreign objects in chest
radiographs using similar data [6]. The use of General Adversarial
Networks can also be seen in deployment of technology for organ
segmentation and bone suppression tasks in Chest X-Rays [7]. Transfer
learning based image classifier models have been researched by Showkat et
al. in detection of Covid-19 pneumonia [8]. Deep Learning techniques are
used by Hirata et al. in the pursuit of detecting pulmonary artery wedge
pressure metrics using Standard Chest X-Ray data. The research
community pertaining to these specific tasks have produced a foothold in
the use of CNNs in computer vision problems like these and in 2015 and
2016 more than 300 papers were published on applications of deep learning
in workshops, conferences, journals, and special issues in this
domain[9, 10].

3 Dataset
The dataset used to train our proposed models was obtained from the
internet website named Kaggle, and is named “Chest X-Ray Images
(Pneumonia)”. It consists of 5863 images as training samples each of which
has a binary feature associated with it depicting the individual datapoints as
either ‘normal’ or ‘pneumonia’. A point to note here is that, the feature
category for this specific dataset is binary in nature, hence the proposed
models will be tasked with the duty of analysing the image for the presence
of the disease of pneumonia in contrast to the task of finding specific types
of pneumonia ranging from bacterial to viral. The images present in the
dataset are formatted X-Ray images of the lungs (Fig. 1).
Fig.1 Three samples from normal and pneumonia classes

The dataset consists of 27% images of normal lung x-rays and the
remaining pertaining to those corresponding to pneumonia (Fig. 2).

Fig. 2 Image category distribution in dataset


4 Data Pre-processing
For the task of data pre-processing, all individual images are converted into
grayscale and gaussian blur is applied to them. The conversion of images
into grayscale helps in fine tuning the dataset for the specific image
classification task by converting the pixels present in the images into values
depicting the information of the intensity of light. Gaussian blur, in essence,
is applied to reduce the noise and redundant data present in the information
pixels. The concept of Gaussian blur works on it’s characteristic to
smoothen the edges and boundaries of objects resulting in enhancement of
object data and smoothening of transitions between boundaries. Image
Erosion is also applied to the categorical data, wherein, the erosion function
used to process the data, reduces or removes pixels on object boundaries,
the frequency of pixels affected depends on the specific inherent
characteristics of the image. The Canny Edge Detection algorithm
developed by OpenCV is also used, which reduces noise, finds the intensity
gradient of the image and supresses unwanted pixels (Figs. 3, 4 and 5).

Fig. 3 Grayscale conversion and gaussian blurring


Fig. 4 Image erosion
Fig. 5 Canny edge detection

5 Proposed Model
We have used 2 distinct models for this classification problem, the
Efficient-Net model and the Inception model. Both of these models are
based on CNNs (Convolutional Neural Networks).

5.1 EfficientNet
EfficientNet is an architecture framework based on the methodology of
model scaling in Convolutional Neural Networks. This architecture
uniformly scales all dimensions of depth/width/resolution using a
compound coefficient. The distinguishing factor for this specific
architecture is that it doesn’t use arbitrary scaling for these factors, it uses a
fixed set of scaling coefficients for uniformly scaling the network width,
depth and resolution. Using this technique, the creators have surpassed the
accuracy of almost all high performing convolutional network models,
while simultaneously achieving better efficiency.
For model scaling, the following methodologies of (a) Baseline model,
(b) Width Scaling, (c) Depth Scaling and (d) Resolution Scaling are
followed, whereas in the EfficientNet model, a methodology known as
compound scaling is used which inculcates all the previously techniques
into one hybrid and dynamic structure (Figs. 6 and 7).
Fig. 6 Baseline network with connecting layers
Fig. 7 Compound scaled network with connecting layers

For obtaining the compound scaling factor, it was observed that the
network depth should be increased for higher resolution images which helps
capture high pixel features in bigger images and correspondingly that
network width should be increased when the resolution is lower, due to the
need of capturing the fine grain patterns present in the images [11]. The
compound scaling method employed by the EfficientNet model using a
coefficient φ to uniformly scale the width, depth and resolution for the
neural network.
The equations for the same are:
where a, b, c are constants that are determined by a small grid search.
Henceforth, φ, is a user specified coefficient that controls how many
more resources are available for model scaling, while a,b,c specify how to
assign the extra resources to network width, depth and resolution
respectively [11].
The EfficientNet Architecture is the baseline network for implementing
a framework employing the above criterion and characteristics.

5.2 InceptionV3
InceptionV3 is an image recognition model which has demonstrably
achieved state-of-the-art accuracy levels for image associated tasks. It uses
and build upon it’s base architectures of the InceptionV1 model, which
inherently consisted of multiple filters of parallel layers instead of the
classical deep layers of a typical CNN model. Each subpart of a basic
Inception model is made of 4 parallel layers, which are: 1*1, 3*3, 5*5
convolutions and a 3*3 max pooling layer.
The InceptionV3 implementable model consists of building blocks,
including (a) convolutions, (b) average pooling, (c) max pooling, (d)
concatenations, (e) dropouts and (f)Softmax (Fig. 8).
Fig. 8 Input layer and output layer dimensions for InceptionV3 model

The model builds upon the base work of the InceptionV1 model, it
enables factorization of data into smaller convolutions, i.e. reducing high
dimensional data into smaller fragments for effective processing, the model
also uses spatial factorization into asymmetric convolutions, which entails
subdividing the previously occurred convolutions into factors of the form
n*1, which allows for higher efficiency in processing and outcome [12].
The model takes into effect the use of auxiliary classifiers which in essence,
acts as a regularizer here, also parallel stride blocks are created to allow for
an efficient grid size reduction algorithm in order to avoid a
representational bottleneck.

6 Experimental Outcome and Analysis


6.1 InceptionV3
Figure 9 shows the accuracy graphs and validation of accuracy graphs for
the InceptionV3 model, the training of the model has occurred for a
duration of 15 epochs. The peak accuracy achieved by the model is high
value of 92.93%, it portrays a gradual and simultaneous increase and
decrease in the graph metric values, occurring due to the fine tuning of
model prediction confidence values, until finally arriving at it’s peak
accuracy point and decreasing therein. The validation accuracy curve can be
seen performing a similar curvature until dropping to an extremely low
value but stabilizing itself while moving forward which depicts the overall
accuracy value fluctuation metrics to the change of model parameters.

Fig. 9 Accuracy curve of inception model

The loss value function, as shown graphically in Fig. 10, for the model
can be seen taking a huge initial decline and reaching it’s required lowest
value moving forward in a stable and coherent manner. The validation loss
curve doesn’t take a steep dive but goes through a sudden high peak value
in between it’s complete graph path, after which it stabilizes and reaches it’s
boundary values, which are close to the loss value curve boundary values.
Fig. 10 Loss value curve of inception model

These results hence depict the benchmark being set in pneumonia


diagnosis using CNN based algorithms. This outcome, when compared with
other models for similar tasks perform demonstrably better in the outcomes
and at the same time is more efficient due to the inbuilt performance
metrics present in the baseline Inception models, as depicted in Sect. 5.2.

6.2 EfficientNet
Figure 11 shows the accuracy graphs and validation of accuracy graphs for
the EfficientNet model, the training of the model has occurred for a
duration of 10 epochs. The peak accuracy achieved by the model is high
value of 95.39%, it displays the accuracy of the model steeply increasing
after the first epoch and gradually and stably achieving it’s peak value after
the last epoch. The validation accuracy curve can be seen performing a
similar curvature until dropping to an extremely low value and steeply
increasing after the subsequent epoch but again dropping extremely low
after two more epochs.

Fig. 11 Accuracy curve of EfficentNet model

The loss value function for the model, as depicted in Fig. 12, can be
seen taking an initial decline and reaching it’s required lowest value while
performing simultaneous but negligible ups and downs throughout the
curvature. The validation loss curve can be observed performing a steep
initial decline similar to the loss value function. It achieves it’s peak
boundary value in the following steps therein, but it then suddenly increases
to an enormous amount and also decreases in the following epoch only to
increase substantially again after 2 more epochs.
Fig. 12 Loss value curve of EfficientNet model

These results also simultaneously set the benchmark being set in


pneumonia diagnosis using CNN based algorithms. This outcome, when
compared with other models for similar tasks perform demonstrably better
in the outcomes and at the same time is more efficient and customizable due
to the inbuilt model metrics present in the baseline Inception models, as
depicted in Sect. 5.1.

7 Discussion
One of the necessities and dire requirements of radiologists, clinicians and
staff alike working towards the problem of detecting and curing pneumonia
and related conditions is the metric factors of time, frequency and volume
of data to be processed, and expertise requirements. The presence of already
existing classifiers for other medical diagnosis and related works, including
breast cancer detection [13], and also the recent use of CNNs being used in
Brain Tumour Classification [14]. Almost all of these can be solved to a
significant extent via the use of machine learning and neural network based
models to ease this task. But simultaneously, it must be noted that the final
diagnosis and inferences received from it should be done ultimately by a
trained professional, these classification models, for now, are present only
to aid the clinicians and trained experts in streamlining their tasks. Some
limitations a model like this would pertain along with itself is the
explanation of achieved metrics and reasons embedded therein, and
inability to characterize a few key metrics which demonstrate a substrata of
the general illness being caused and which could necessitates simultaneous
alternate remedies extending to a cohesion of multiple disorders either
causing or caused from the pneumonia disease. The accuracies achieved in
this chapter, can be improved further by incorporating a larger dataset, or
developing further specific and custom models based exclusively on X-Ray
diagnostics. Another method which can be availed to achieve improvement
is to incorporate medical histories of the patient in a significant shape or
form to be included in as a feature variable in the dataset. Furthermore, data
augmentation techniques can be identified and incorporated in future
models for achieving higher output metrics [15–30].

8 Conclusion
In this chapter, we have discussed the outcomes and experimental usage and
use-cases of the EfficientNet and InceptionV3 models for the medical
diagnosis of pneumonia via Chest X-Rays. We have achieved high
performance results of 95.39% and 92.93% which is achieved at a
significantly low computational cost. Thereby, using the discussed
frameworks can highly beneficial in the medical diagnosis of the disease
and come in handy to the professional medical practitioners and radiologists
working with the related problem statement. Further refinement of
approaches and methodologies will definitely provide a highly positive
impact towards this cause and pave the way for further improvements
therein.
References
1. Yadav, K. K., & Awasthi, S. (2016). The current status of community-acquired
pneumonia management and prevention in children under 5 years of age in India:
A review. Therapeutic Advances in Infectious Disease, 3(3–4), 83–97.
[Crossref]

2. Çallı, E., Sogancioglu, E., van Ginneken, B., van Leeuwen, K. G., & Murphy, K.
(2021). Deep learning for chest X-ray analysis: A survey. Medical Image Analysis,
72, 102125.
[Crossref]

3. Li, Q., Cai, W., Wang, X., Zhou, Y., Feng, D. D., & Chen, M. (2014). Medical
image classification with convolutional neural network. In 13th International
Conference on Control Automation Robotics & Vision (ICARCV), Singapore, pp.
844–848. https://doi.org/10.1109/ICARCV.2014.7064414

4. Çallı, E., Sogancioglu, E., van Ginneken, B., van Leeuwen, K. G., & Murphy, K.
(2021). Deep learning for chest X-ray analysis: A survey. Medical Image Analysis,
72, 102125. ISSN 1361-8415 https://doi.org/10.1016/j.media.2021.102125

5. https://www.spiedigitallibrary.org/journals/journal-of-medical-imaging/volume-
7/issue-1/016501/Deep-convolutional-neural-networks-in-the-classification-of-
dual-energy/https://doi.org/10.1117/1.JMI.7.1.016501.short?SSO=1

6. Deshpande, H., Harder, T., Saalbach, A., Sawarkar, A., Buelow, T. (2020).
Detection of foreign objects in chest radiographs using deep learning. In IEEE
17th International Symposium on Biomedical Imaging Workshops (ISBI
Workshops). Iowa City, IA, USA, pp. 1–4. https://doi.org/10.1109/
ISBIWorkshops50223.2020.9153350

7. Eslami, M., Tabarestani, S., Albarqouni, S., Adeli, E., Navab, N., & Adjouadi, M.
(2020). Image-to-images translation for multi-task organ segmentation and bone
suppression in chest X-ray radiography. IEEE Transactions on Medical Imaging,
39(7), 2553–2565. https://doi.org/10.1109/TMI.2020.2974159
[Crossref]

8. Showkat, S., & Qureshi, S. (2022). Efficacy of transfer learning-based resnet


models in chest x-ray image classification for detecting COVID-19 pneumonia.
Chemometrics and Intelligent Laboratory Systems, 224, 104534.
[Crossref]
9. Hirata, Y., Kusunose, K., Tsuji, T., Fujimori, K., Kotoku, J. I., & Sata, M. (2021).
Deep learning for detection of elevated pulmonary artery wedge pressure using
standard chest x-ray. Canadian Journal of Cardiology, 37(8), 1198–1206.
[Crossref]

10. Greenspan, H., Summers, R. M., & van Ginneken, B. (2016). Deep learning in
medical imaging: Overview and future promise of an exciting new technique.
IEEE Transactions on Medical Imaging, 35(5), 1153–1159.
[Crossref]

11. Tan, M., & Le, Q. (2019). Efficientnet: Rethinking model scaling for
convolutional neural networks. In International Conference on Machine
Learning (pp. 6105–6114). PMLR.

12. Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., & Wojna, Z. (2016). Rethinking
the inception architecture for computer vision. In Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition (pp. 2818–2826).

13. Mittal, D., Gaurav, D., & Sekhar Roy, S. (2015). An effective hybridized classifier
for breast cancer diagnosis. In 2015 IEEE International Conference on Advanced
Intelligent Mechatronics (AIM), Busan, Korea (South), pp. 1026–1031. https://doi.
org/10.1109/AIM.2015.7222674

14. Roy, S. S., Rodrigues, N., & Taguchi, Y. (2020). Incremental dilations using CNN
for brain tumor classification. Applied Sciences 10(14):4915. https://doi.org/10.
3390/app10144915

15. Shorten, C., & Khoshgoftaar, T. M. (2019). A survey on image data augmentation
for deep learning. Journal of Big Data, 6, 60.
[Crossref]

16. Roy, S. S., Hsu, C., Samaran, A., Goyal, R., Pande, A., et al. (2023). Vessels
segmentation in angiograms using convolutional neural network: A deep learning
based approach. CMES-Computer Modeling in Engineering & Sciences, 136(1),
241–255.
[Crossref]

17. Turki, T., & Roy, S. S. (2022). Novel hate speech detection using word cloud
visualization and ensemble learning coupled with count vectorizer. Applied
Sciences, 12(13), 6611.
[Crossref]
18.
Roy, S. S., Goti, V., Sood, A., Roy, H., Gavrila, T., Floroian, D., Mohammadi-
Ivatloo, B., et al. (2014). L2 regularized deep convolutional neural networks for
fire detection. Journal of Intelligent & Fuzzy Systems, 1–12.

19. Roy, S. S., Mihalache, S. F., Pricop, E., & Rodrigues, N. (2022). Deep
convolutional neural network for environmental sound classification via dilation.
Journal of Intelligent & Fuzzy Systems, 1–7.

20. Forecasting stock price by hybrid model of cascading multivariate adaptive


regression splines and deep neural network.

21. Bose, A., Hsu, C. H., Roy, S. S., Lee, K. C., Mohammadi-Ivatloo, B., &
Abimannan, S. (2021). Forecasting stock price by hybrid model of cascading
multivariate adaptive regression splines and deep neural network. Computers and
Electrical Engineering, 95, 107405.

22. Roy, S. S., & Taguchi, Y. H. (2021). Identification of genes associated with altered
gene expression and m6A profiles during hypoxia using tensor decomposition
based unsupervised feature extraction. Scientific Reports, 11(1), 1–18.

23. Roy, S. S., & Samui, P. (2021). Predicting longitudinal dispersion coefficient in
natural streams using minimax probability machine regression and multivariate
adaptive regression spline. International Journal of Advanced Intelligence
Paradigms, 19(2), 119–127.

24. Marques, G., Agarwal, D., & de la Torre, I. (2020). Automated medical diagnosis
of COVID-19 through EfficientNet convolutional neural network. Applied Soft
Computing, 96, 106691.

25. Biswas, R., Vasan, A., & Roy, S. S. (2020). Dilated deep neural network for
segmentation of retinal blood vessels in fundus images. Iranian Journal of
Science and Technology, Transactions of Electrical Engineering, 44(1), 505–518.
[Crossref]

26. Roy, S. S., Samui, P., Nagtode, I., Jain, H., Shivaramakrishnan, V., &
Mohammadi-Ivatloo, B. (2020). Forecasting heating and cooling loads of
buildings: A comparative performance analysis. Journal of Ambient Intelligence
and Humanized Computing, 11(3), 1253–1264.
27.
Roy, S. S., Chopra, R., Lee, K. C., Spampinato, C., & Mohammadi-Ivatlood, B.
(2020). Random forest, gradient boosted machines and deep neural network for
stock price forecasting: A comparative analysis on South Korean companies.
International Journal of Ad Hoc and Ubiquitous Computing, 33(1), 62–71.

28. Roy, S. S., Mihalache, S. F., Pricop, E., & Rodrigues, N. (2022). Deep
convolutional neural network for environmental sound classification via
dilation. Journal of Intelligent & Fuzzy Systems, 1–7.

29. Chakraborty, C., Bhattacharya, M., Sharma, A. R., Roy, S. S., Islam, M. A.,
Chakraborty, S., Dhama, K., et al. (2022). Deep learning research should be
encouraged for diagnosis and treatment of antibiotic resistance of microbial
infections in treatment associated emergencies in hospitals. International Journal
of Surgery (London, England), 105, 106857.

30. Lee, K. C., Roy, S. S., Samui, P., & Kumar, V. (Eds.). (2020). Data analytics in
biomedical engineering and healthcare. Academic Press.
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023
S. S. Roy et al. (eds.), Deep Learning Applications in Image Analysis, Studies in Big Data 129
https://doi.org/10.1007/978-981-99-3784-4_10

Detection of Cancer Using Deep Learning


Techniques
Apoorv Singh1, Arjunaditya2 and B. K. Tripathy3
(1) School of Electronics Engineering, VIT, Vellore, TN, 632014, India
(2) School of Computer Science and Engineering, VIT, Vellore, TN,
632014, India
(3) School of Information Technology and Engineering, VIT, Vellore, TN,
632014, India

B. K. Tripathy
Email: tripathybk@vit.ac.in

Keywords Deep learning – Cancer diagnosis – Deep neural network –


Medical imaging – Image segmentation

1 Introduction
Cancer is a dreaded disease which is posing threat to the human society and
according to the data provided by World Health Organisation, cancer
accounted for 13% of all the fatalities in 2018 [1]. In the upcoming years it
is predicted to be ranked among the most deadly diseases in the world. As
projected, 12 million individuals are likely to be affected by cancer in 2030.
The number of cancer cases would rise dramatically in the next few years.
Experts, specialists, and medical professionals are developing new methods
to combat cancer, but it is well recognized that this battle is quite
challenging [2–4].
Evaluating the visuals related to medical data by technicians, supported
by computers is referred to as interpretation. Diagnostic ultrasound images,
on the contrary, demand a large volume of data to be addressed by the
physician and require thorough analysis in a short amount of time. These
imaging processes include high-energy electromagnetic radiation. Digital
photographs are analyzed by computer-assisted methods to detect the
presence or absence of cancer in the early stages [5].
Analysis of medical images using computer tools supports medical
professionals in interpretation of medical information inherent in the
images. On the other hand, diagnosing ultrasound images using specific
imaging processes such as high intensity electromagnetic radiation
necessitates a significant quantity of data to be controlled from doctor's end
and involves thorough analysis in a short amount of time. Digital
photographs analyzed by computer-assisted methods are potentially used to
detect presence or absence of the disease in the early stage. Therefore, early
cancer detection is the top goal for securing lives. To find and diagnose
cancer in its early stages, many visual examinations and manual methods
are used. As human interference in analyzing medical images requires
enough time and expertise in order to improve the efficiency of medical
image interpretation, computerized systems for disease diagnosis have been
proposed [5].
Developments in the areas of AI and Machine learning (ML) have been
progressing fast during the recent years and their rise in the fields of
computer vision, image processing, and computer-assisted diagnosis are eye
catching [6]. Some of these applications use the traditional machine
learning techniques like Support Vector Machines (SVM), decision trees,
K- Nearest Neighbour (KNN) and back propagation [7]. Figure 1 illustrates
the overall relationship among AI, ML and their components. An Artificial
Neural Network (ANN) has an input layer, an output layer and a number of
hidden layers of neurons according to the requirement of the applications
The input layer accepts attributes in the form of input data and uses the
associated weights in the connections to get the total input before applying
the activation functions to get the outputs at the hidden layer nodes. This
process is repeated layer of after layer till it reaches the output layer which
generates the final outputs [8]. This increases in accuracy of prediction, aids
clinicians in mapping subject’s treatments and eliminating emotional and
physical challenges caused by sickness. An important aspect supporting
clinical researchers is an increase in the number of diagnoses made utilizing
latest cutting-edge AI technology. Computer engineers and health scientists
can now successfully diagnose patients by using multi-factor analysis,
classical logistic regression, and analysis assisted by AI. This is made
possible by theoretical and technical advancements in computer programs
and statistics. These estimations are much more accurate than the
experimental estimates. Recently, researchers have started to develop new
models to predict and detect cancer using AI. These models are crucial for
increasing the precision of survival from cancer and sensitivity estimations
[3].

Fig. 1 Categorization of DL neural networks

But just with the detection and management of cancer, this diagnosis
must be made in the earliest stages of the illness. The most important thing
is to diagnose cancer early in order to preserve the lives of many individuals
[9]. For this form of cancer diagnosis, visual examination and manual
techniques are typically used. It takes a lot of effort and is quite error-prone
to explain medical imagery [10]. Due to the ambiguous nature of the
symptoms, the limitations of mammography and other screening methods,
and the potential for recurrence after care, a cancer in its initial phases is
extremely challenging [11]. Therefore, high resolution medical diagnostics
in cancer investigations will lead to the development of better predictive
models [12]. An analysis of studies on the identification and management
of cancer in the literature shows that the application of AI approaches is
expanding [13]. Additionally, this has come to light that AI techniques are
more effective than conventional analysis methods like statistical and
multivariate analysis. Particularly the DL approach among AI techniques
produces excellent outcomes [14].
A specific kind of neural network called DL has numerous hidden
layers. DL is implemented in many different industries recently [15]. It has
demonstrated particularly high efficiency results in use cases like voice
recognition, as well as image detection within advanced devices such as
driverless cars and drones [14, 16, 17]. Additionally, fundamental
classifications including the identification of cancerous and healthy tissue
are carried out, and conventional ML techniques are used in the produced
models. Deep neural networks powered by artificial intelligence, on the
other hand, offer a better way to use data matrices to create classification
models. With the use of these models, cancer may be identified, its
progression can be observed and predicted, and timely and effective cancer
therapy can then be administered [18].
DL approaches operate by using a backpropagation algorithm to
uncover fine structures in huge and frequently complex datasets. Existing
techniques, such as those based on machine learning, have limits when it
comes to handling raw data in its native format without preprocessing [19].
The ability to learn invariant features is a property of convolutional neural
networks (CNN), a type of DL system. To build patterns for various object
identification tasks like detection, segmentation, and classification, CNNs
use feature pooling layers, filter banks, dropout layers, batch normalization
layers, and dense layers. CNNs include a multilevel hierarchy in which the
dispersion of inputs varies throughout training. To achieve improved
performance throughout tasks, preprocessed data is extremely desirable
[20]. There are many other CNN variations, including those with shorter
connections, like the DenseNet architecture, which gives a significant
reduction in the number of hyper parameters needed to develop effective
designs and has benefits for feature circulation [21].
ResNets, Xception, and GooLeNet designs are other varieties of CNN
architectures that have been more effective recently. These networks are
necessary because multiscale processing is required, job performance across
the board degrades as the network gets deeper, and better topologies with
fewer parameters are sought [22–25].
Another critical challenge in DL is the capability of an architecture to
store data over long time periods. Long Short-Term Memory has been
suggested as a potential remedy for this issue (LSTM). Through the states
of specialized units, the LSTM design enforces continuous error flow which
is non-global in time and space [26].
The concept of transfer learning is another DL concept worth
mentioning. Transfer learning involves applying features taken from deep
convolutional neural networks to contemporary and inventive jobs. The
requirement for this arises from the possibility that generic tasks may differ
significantly from the original tasks and that there won't be enough marks
or inputs to train DL architecture for new tasks. The use of transfer learning
also allows characteristics to be modified with ease so that they dependably
express generalization well enough [27–29].
DL techniques utilized in cancer detection and treatments are
investigated in this paper. The purpose of the study is to demonstrate, with
the help of the literature, the effectiveness of a deep learning approach—
one of the machine learning techniques treating a condition like cancer, as
well as the methodologies and techniques that are employed and how they
are applied [30].

2 Deep Learning
2.1 Basics of Deep Learning
DL has gained a lot of popularity and success in nearly every industry and
has emerged as a useful tool for understanding how machines perceive the
world. In fields including speech recognition, image classification, video
scrutiny and natural language learning DL techniques are applied [31].
Based on a DL created mathematical model, analysis is performed without
using any attribute extractor. The scope for generalization of DL techniques
is one of their key benefits. For additional applications and data types, a
learnt neural network method can be used. When the data set is inadequate
DL performs poorly [32].
DL exists as a kind of machine learning approach which capitalizes on
benefits of nonlinear processing unit layers [15]. The result of the preceding
layer is fed into the subsequent layer as an input. Data is established on the
results from the visualization of the data in the DL approach by
understanding multiple feature levels [33]. A hierarchy is created in the
representation by deriving low-level features from top-level features. While
generally based on ANN, DL techniques include more buried layers and
neurons [34]. DL techniques show excellent outcomes when processing a
variety of data kinds, including text, audio, and video [35, 36]. There are
several applications of DL, including information retrieval, audio and
speech processing [14], multi-modal and multi-task learning, Natural
Language Processing (NLP), image segmentation and image recognition
[16].

2.2 Cancer Diagnosis with DL


When making a diagnosis, doctors frequently draw on their own
knowledge, abilities, and experience. A doctor can never guarantee that his
diagnosis of a condition is accurate, regardless of how talented he is, and
many times diseases are misdiagnosed Technologies involving AI,
therefore, appear on the agenda. This is due to the fact that AI possesses the
capacity to evaluate vast quantities of data, resolve complicated issues, and
make very accurate predictions [4]. One of the most modern methods for
AI, DNN describes a number of computational methods that are useful for
extracting data from photos. Many medical disciplines have used DL
algorithms for various medical tasks like radiology, pathology etc. Good
efficiency has also been achieved in the notion of using DL tools for tumor
biology and other fields, such as medical imaging of many species [16].

2.3 Deep Neural Network Characteristics


Any basic neural network consists of an input layer that is connected to the
output immediately. There are several hidden layers inside DNNs that are
efficient at handling complicated issues, each layer’s weight is modified
using delta learning technique. Deep neural networks are also used to
discover complex nonlinear interactions by including more hidden layers.
Although learning occurs relatively slowly, DNNs are employed in
unsupervised and supervised learning situations. However, good
performance outcomes can be produced, and it is typically employed for
classification and regression purposes [34, 35].
Using a DNN and endoscopic imaging, in [37] lesions were identified
and differentiated. It was discovered that there was no appreciable
distinction in diagnostic performance between the artificial intelligence
system and skilled endoscopists. The neural network approach they built
has demonstrated great accuracy in discriminating non-cancerous lesions
and high sensitivity [37].
The ability of deep neural networks to identify cancer, specifically lung
cancer, in the presence of low-dose computed tomography and positron
emission tomography scans was examined in [38]. It was shown here that
the DNN algorithm has excellent results in detecting lung cancer. Their
work also demonstrated the efforts to screen for lung cancer were more
successful as a result of the continued development of this technique [38].
A DNN is a type of neural network with any more than two layers and a
specific complexity level [27].
Advanced mathematical modeling is used to get deeper understanding,
and as a result, the processing of data or features is considered to be
complex.
The task of pattern recognition is carried out by a neural network, which
is a metaphor for the activity of the human brain [8, 20]. In particular,
patterns are recognized to classify cells into non-cancerous and cancerous
ones and for tracking input through various simulated neural association
layers [39, 40].
Dealing with unlabeled data is the major objective of using this network,
with each layer carrying out specific types of tasks [11].

3 Architectures of Deep Learning Neural Networks


Based on the learning technique, the DL neural network architectures are
classified into 4 categories: supervised, semi-supervised, unsupervised, and
reinforcement learning [41]. Figure 1 shows how DL neural networks are
categorized.

3.1 Deep Unsupervised Learning


The internal representation of the data is examined by the deep
unsupervised learning architectures, employing a few features without the
need for any tagged data. The dimensionality reduction and clustering
techniques used unsupervised methods. Restricted Boltzmann Machines
(RBM) and Auto-Encoders (AE) are a few deep unsupervised learning
architectures [42].

3.2 Deep Supervised Learning


Architectures for supervised learning use predetermined data for training.
Target results and all possible combinations of inputs are fed to the network
[43]. The training phase's data is validated during testing. Recurrent Neural
Networks (RNN), Long Short Term Memory (LSTM), Convolution Neural
Networks (CNN), and gated recurrent units are few typical methods used
under supervised learning [17].

3.3 Deep Semi-supervised Learning


Partially labeled data is used for training phase under deep supervised
learning architectures. A few semi-supervised learning architectures include
LSTM, RNN, Generative Adversarial Networks (GAN), GRU, and deep
reinforcement learning [44].

4 Types of Deep Learning Architectures for Cancer


Detection
4.1 Convolutional Neural Networks (CNN)
The analysis of 2D images as well as 3D images was effective with the use
of CNN. A gradient-based algorithm is taken to train majority of the CNN
systems [26]. Compared to other neural network models, there are fewer
factors to be tweaked. Feature extractors and classification are both
components of the CNN architecture [45]. The feature extraction layer
receives input from one layer before it and passes it to the next layer after it.
Convolution, maximum pooling, and classification are the three types of
layers that make up the CNN architecture. Even numbers are used to
represent convolution layers, while odd numbers are used to represent max-
pooling layers. The categorization layer, the final step of architecture, is a
completely connected layer. For more accuracy, an architecture using back
propagation is used during for classification. Maximum pooling, global
average, average, and minimum pooling are some of the several types of
pooling procedures. Using a kernel made up of a linear or nonlinear
activation function, the convolution layer convolves the data to create
feature maps. The activation functions include the rectified linear, sigmoid,
Softmax, identity and hyperbolic tangent functions. The downsampling
action occurs in the pooling layer, which is also known as the subsampling
layer. Depending on the application, there are different numbers of
classification layers. Figure 2 shows the convolution neural network
architecture.

Fig. 2 Architecture of a convolution neural network

4.2 Multi-scale Convolution Neural Network


A multi-scale convolutional neural network is created by modifying the
conventional CNN [46]. This consists of three convolution layers; a
rectified linear unit layer; a layer that maximizes pooling; as well as two
fully linked layers. The input image is downsampled, and extraction of
features is completed for sending to the multi-scale CNN.

4.3 LeNet-5
This is a 7-stage convolutional neural network that is utilized to categorize
handwritten digits. For a complicated scenario, the number of convolution
layers is employed with input images of size 32 × 32. Figure 3 shows the
LeNet design, which consists of two convolutional layers, subsampling
layers, and fully linked layers. Gaussian connectivity was used on a single
output layer [47].

Fig. 3 Architecture of LeNet-5

4.4 AlexNet
While Alexnet's design is identical to that of LeNet’s, it possesses deeper
layers, increased filters for every layer, and connected convolutional layers.
After every fully connected layer and convolutional layer, the function of
ReLU activation was added. With a decreased error of 15.3% from 26%,
this was a winning architecture during 2012. It includes data augmentation,
dropout, max pooling, and ReLU activations in addition to 11 × 11, 5 × 5,
and 3 × 3 convolutional kernels [48]. In Fig. 4, the AlexNet architecture is
shown.
Fig. 4 Architecture of AlexNet

4.5 ZFNet
Although ZFNet's architecture was similar to AlexNet, its settings had been
fine-tuned, making it the 2013 challenge winner. There was a 14.8%
reduction in inaccuracies. The number of weights is reduced by using 7 7
kernels rather than 11 11 kernels. The precision is increased as a result of
reducing the number of tuning parameters [49].

4.6 GoogleNet
A part of the GoogleNet design is LeNet, which has an inception structure.
It has 22 number of layers, and throughout testing the rate of error
decreased gradually from 6.66 to 3.66%. The building was the winner of
ILSVRC 2014 [46]. When compared to the conventional CNN architecture,
it has a reduced computational complexity. Compared to other architectures
like AlexNet and VGG [50], it was less frequently used. In Fig. 5, the
GoogleNet architecture is shown.
Fig. 5 Architecture of GoogleNet

4.7 VGGNet
The VGGNet, which consists of sixteen convolution layers with several
filters, was the ILSVRC 2014 winner [39]. With this architecture, feature
extraction has been found to be effective, however parameter adjustment is
quite important. Three VGG models with 11 layers, 16 layers, and 19 layers
each were proposed: VGG-11, 16, and 19. All VGG models have three fully
connected layers at the very end. Figure 6 shows the architecture of the
VGGNet.

Fig. 6 Architecture of VGGNet

4.8 ResNet
In order to employ prevent connections and normalization of batch, the
ResNet, which won the ILSVRC 2015, was used [51]. When compared to
the VGGNet, the computation complexity was lower. The gated recurrent
units were utilized for skipping connections. This has 152 layers in total,
the inaccuracy is kept at minimum of 3.57%. It finds a solution to the
vanishing gradient issue. It has a residual connection and is one traditional
feed forward NN [52]. It consists of a number of leftover blocks, and
depending on the architecture, it operates differently. In Fig. 7, the residual
network is shown.

Fig. 7 Architecture of ResNet

4.9 Fully Convolutional Networks (FCNs)


In contrast to the classical CNN, the fully convolutional layer in the fully
convolutional network has been replaced with one layer of up-sampling,
one layer of de-convolution, and one completely linked layer, as shown in
the Fig. 8. This architecture was designed so that the fully convolution and
the de-convolution layers create the reversed equivalents of pooling and
convolution layers. Up-sampling and de-convolution layers were added to
the design, which increased its accuracy [40, 41].

Fig. 8 Architecture of fully convolutional networks

4.10 U-Net
U-Net, which has two routes, was created for the segmentation of medical
images. The first path has an encoder which records the context of the
image. However, the second path consists of transposed convolutions as
well as a decoder [53, 54]. Figure 9 shows the U-Net.
Fig. 9 Architecture of U-Net

4.11 Recurrent Neural Networks


Figure 10 shows the RNN's fundamental structure. In [55], various RNN
design variations are described. Numerous functional blocks are included in
the recurrent neural network, as seen in Fig. 10. Recurrent neural networks
are susceptible to the vanishing gradient problem. Recurrent neural
networks require memory because they use prior states as input to
determine their present state. It makes use of sequential data, and
connections among nodes create one directed graph. RNNs are used to
convert input sequences into fixed-sized vectors. Using RNN in
combination with the convolutional layer, the effective pixel neighborhood
is extended. It is used in machine translation, time series prediction, and
NLP. An example of RNN is long short-term memory network (LSTM)
[56].

Fig. 10 Architecture of recurrent neural networks

4.12 Autoencoders
The auto encoder functions as a potent unsupervised learning architecture
with three layers: encoder, decoder, and code. Encoding data into a more
compact representation is the function of an encoder. As a result, the input's
distortion is represented by the compressed image. The compressed input is
represented by code. Another layer that is referred to as a bottleneck is the
layer that sits between the encoder and the decoder. Figure 11 shows the
construction of the autoencoder. The decoder converts the code into a
replica of the initial input. The key characteristics are lossy and data-
specific. Four hyperparameters, including the code size, layer count, nodes
per layer, and loss function, need to be tuned before training the
architecture. The application areas of the autoencoder include dimension
reduction, image compression, image denoising, and feature extraction [57,
58].
Fig. 11 Architecture of autoencoders

4.13 Deep Belief Networks


It consists of a forward feed network for the fine adjustment phase and a
RBM (Restricted Boltzmann Machine) for pre-trained model. This network
receives the features that the RBM has extracted from the input data
vectors. Deep belief networks use a back propagation design with a slower
learning rate. It also has numerous levels that are hidden. The deep belief
network's primary advantage is its capacity to learn from higher-level
features that are present in earlier levels thanks to its layer-by-layer learning
strategies [59, 60]. In Fig. 12.

Fig. 12 Architecture of deep belief networks

5 Steps for Diagnosis of Cancer by Medical Imaging


The medical imaging techniques like MRI, CT scan, and ultrasound were
used to evaluate the healthy function of anatomical organs and analyze
diseases [61]. Cancer diagnosis and therapy planning are crucially
dependent on medical imaging modalities. Preprocessing, often known as
filtering, is the initial step in the processing of medical pictures. The goal of
filtering is to either eliminate image noise introduced in the acquiring
process or for enhancing image quality to get more accurate details [62].
The term “segmentation” describes the method of identifying ROI, or
region of interest, and in the context of medical pictures, the ROI stands for
anatomical organs or any abnormalities associated with them, such as
tumors or cysts. To classify cancer intensity, the classification step typically
uses any ML algorithm. Compression is defined as the process of using
machine-assisted techniques to make files smaller so they can be stored and
transferred with more ease. The table shows the machine learning methods
that can be used in each stage of cancer diagnosis [63].
When assessing an ailment, professionals depend heavily on their first-
hand observations, abilities, and experiences. A doctor can never be in a
state of complete surety and claim that his assessment of the condition is
entirely right, and they undoubtedly get it wrong. This introduces the
dependence of Artificial Intelligence powered automated systems because
artificial intelligence (AI) can evaluate enormous volumes of information,
handle complicated prepositions, and anticipate accurately. One of the most
modern methods for AI systems, deep neural networks, describes a number
of computer models that are useful for extracting data from digital images.
Algorithms for DL are utilized in several medical professions [4, 16].
The steps of cancer diagnosis are as follows.

5.1 Cleaning and Pre-processing


The initial stage in the identification process is pre-processing since the raw
photos include noise. Pre-processing is used to boost the quality of a picture
that will be utilized more frequently by eliminating unnecessary image data
known as image noises. If this issue is not resolved, improper categorization
may occur. It becomes crucially important to properly clean the images and
convert them into standard forms for getting high accuracy levels [3].

5.2 Image Segmentation


Image segmentation refers to dividing any image into different sections. It
is separated into pixel and region, model, and threshold based segmentation.
Additionally, there is additional histogram cutoff, adaptive cutoff point, and
boundary detection approaches. These strategies are also used in
combination [3, 64, 65].

5.3 Post Processing


After image segmentation, closing and opening operations, island removal,
region merging, border expansion and smoothening is done [3].

6 Diagnosis of Different Types of Cancers Using DL


Table 1 shows DL architectures for various cancer diagnoses. Neural
network Architectures have been extremely useful in illness detection and
have also contributed to research relating to cancer that affects different
organs. The convolution sparse encoder was found to be appropriate for all
categories of 3-dimensional datasets in the proposed work [66]. In [67],
lesion identification was achieved while stage of cancer diagnosis was
accomplished using CNN and handmade features. In another work [68],
GoogLeNet was determined to be more successful, with an efficiency of
85%, as compared to AlexNet, with an accuracy of 82%, and the VGGNet,
with an accuracy of about 84%. When compared to the conventional
predictor based on texture analysis, the model that had combined pre
trained SVM and CNN was more successful for categorizing tumor tissues
in digital mammograms [69].

Table 1 Deep learning architectures for cancer diagnoses


References Cancer Type of data/imaging DL architecture Performance metrics
type(s) used
[70] Breast Gene expression data Multi omics NN Enhanced
performance with
more omics data
[71] Breast The cancer genome atlas Random forest, NN Log-rank p < 0.05
[72] Breast Pathology Convolutional Sensitivity: 73%
neural network
(CNN)
[73] Breast Pathology Convnet Positive predictive
Value: 71.6%,
[74] Breast Ultrasound Alexnet (CNN) Fps / image—0.16,
TPF—0.98, F measure
—0.91
[75] Liver Computed tomography Deep Neural Accuracy 99.4%
(CT) Scan/3D Network
References Cancer Type of data/imaging DL architecture Performance metrics
type(s) used
[76] Liver Computed tomography Back propagation Accuracy 73.2%
scan neural network
[77] Liver Computed tomography Convolutional Precision: 82.67%
scan neural network Dice: 80.06%
(CNN)
Recall: 84.34%
[78] Lung Computed tomography Deep neural network Sensitivity: 78.2%
scan (DNN) Accuracy: 82.1%
Specificity: 86.13%
[79] Lung Computed tomography Deep neural network Sensitivity: 78.9%
scan (DNN)
[80] Lung Computed tomography Resnet Sensitivity: 0.54
scan
[81] Skin Standard images from Deep convolutional Accuracy: 98.55
camera neural networks Sensitivity: 95%
(DCNN)
[82] Skin Dermoscopy images ReLU-rectified Accuracy: 86.67%
linear activation unit
(CNN)
[83] Colon Histopathology image Shallow neural Accuracy: 84%
network
[84] Astrocytic Microarray gene dataset Artificial neural Accuracy: 96.15%
tumor network (ANN)
[85] Prostate Multiparametric Magnetic Xmasnet (CNN) AUC: 0.84
resonance imaging
(mpMRI)/3D
[86] Prostate Multiparametric Magnetic Deep convolutional AUC: 0.897
resonance imaging neural networks
(mpMRI) (DCNN)
[87] Brain Magnetic resonance Input cascade Sensitivity: 0.84
imaging (MRI) convolutional neural Specificity: 0.9
network

The researchers [70] used a DL method to perform studies on breast


cancer patients. They used a Cox prediction model and genomic datasets to
make predictions. They show that whenever there happens to be an
abundance of information and it is utilized to integrate and simplify
biomarkers and gene regulation to enable prediction, performance
improves. Shimizu and Nakayama [71] used the TCGA database to identify
and work on breast cancer genes and analytical prediction. They employed
AI to identify 184 genes, after which they used ML algorithms such as
Random Forest Classifier along withDL networks to do it. Furthermore,
they employed a prognostic genetic score that utilized just 23 out of the 184
identified genes.
Liu et al. [72] Proposes a CNN model that is capable of identifying tiny
cancerous tumors using gigapixel pathology slides. The proposed system
suggested in Cruz-Roa et al. [73] identifies aggressive lesions in entire slide
pictures while minimizing human work and temporal complications. On
breast ultrasound image lesion pictures, the alternative CNN architectures
like LeNet, U-Net, Transfer Learning, and AlexNet were thoroughly
analyzed and it was found that AlexNet and Patch-based LeNet were the
most accurate architectures [74].
Even before DNN tumor identification, the ROI was extracted using the
different watershed and Gaussian mixture model (GMM) algorithms in Das
et al. [75]. For the segmentation of liver tumors, the FCN structure U net
was proposed, with subsequent processing via 3D linked item tagging in
order to get better segmentation results [76]. CNNs were proved to be more
accurate classifiers than classical machine learning algorithms [77]. The
DNN was shown to be effective for segmenting the cancerous growth of
cells, and it is also appropriate for segmenting tiny lung nodules. Deep
Neural Network efficiency grows as training data increases [78]. The
Convolutional Neural Network suggested in Golan et al. [79] is divided into
two stages, out of which the first gathers spatial characteristics, while the
second does categorization. The DL structure was used with an SVM
classifier to identify lung nodules; the rule-based method reduced false
positives. The new ResNet design outperforms the traditional ResNet
structure of lesion segmentation [80].
Additionally, using conventional camera pictures, a CNN was employed
to detect melanoma [81]. Convolutional CNN has been proposed for
detecting skin lesion borders in dermoscopy pictures [82]. A smaller
network is used to analyze multi-dimensional gene data in order to
definitively diagnose cancerous cell growth in histological pictures of the
colon [83]. Petalidis et al. [84] published data of genomics for astrocytic
malignancies. To be able to explain the necessity for accurate categorization
of these cancers, they used a neural network technique to merge
characteristics from histological subtypes of these cancers. They were able
to identify 59 genes in this research. They identified accurate classifications
for these variants using custom and separate data with a correctness of
96.15%.
Prostate cancer were identified under the MRI pictures using XmasNet,
a CNN-based algorithm [85]. AUC of 0.897 [86] was reached by it. In the
BRATS dataset, the brain tumor is segmented using the deep interconnected
CNN, which has achieved good performance through a cascaded design
[87].

7 Conclusions
DL has been successful in displaying its effectiveness in feature extraction,
and their properties have improved cancer prognosis and prediction. DL
models have revolutionized cancer diagnosis and prediction because of their
superior features, learning architectures have received massive use in
cancer cell segmentation and classification. Data augmentation was critical
in diagnosis of cancer and prediction jobs in order to enhance system
efficiency. DL solutions are evaluated and verified in areas such as
replicability and universal applicability in treatment of cancer. These
techniques helped in the early detection of cancer and contributed to patient
recovery or life extension.
DL based technological innovation has started to benefit the local and
national medical sectors. Consequently, it is advantageous to use DL
technology in cancer diagnostics and general medicine in order to get
further theoretical understanding. Researchers studying ML algorithms for
diagnosing diseases as well as experts in planning and treating have
something to gain from this work's conclusion.

References
1. Grisold, W. (Ed.) (2021). Wolfgang Grisold, Riccardo Soffietti, Stefan
Oberndorfer, Guido Cavaletti (eds): Effects of cancer treatment on the nervous
system.
2. Tang, J., Rangayyan, R. M., Xu, J., El Naqa, I., & Yang, Y. (2009). Computer-
aided detection and diagnosis of breast cancer with mammography: Recent
advances. IEEE Transactions on Information Technology in Biomedicine, 13(2),
236–251.
[Crossref]

3. Munir, K., Elahi, H., Ayub, A., Frezza, F., & Rizzi, A. (2019). Cancer diagnosis
using deep learning: A bibliographic review. Cancers, 11(9), 1235.
[Crossref]

4. Huang, S., Yang, J., Fong, S., & Zhao, Q. (2020). Artificial intelligence in cancer
diagnosis and prognosis: Opportunities and challenges. Cancer letters, 471, 61–
71.
[Crossref]

5. Cancer Facts and Figures. (2019). American Cancer Society. https://www.cancer.


org/content/dam/cancer-org/research/cancer-facts-and-statistics/annualcancerfacts-
andfigures/2019/cancer-facts-and-figures-2019.pdf

6. Bhardwaj, P., Guhan, T., & Tripathy, B. K. (2021). Computational biology in the
lens of CNN. In S. S. Roy, Y. H. Taguchi (eds.), Handbook of machine learning
applications for genomics (Chapter 5). Studies in Big Data. ISBN: 978-981-16-
9157-7 496166_1_En

7. Tripathy, B. K., & Anuradha, J. (2015). Soft computing-advances and


applications. Cengage Learning Publishers, New Delhi. ASIN : 8131526194.
ISBN-10: 9788131526194.

8. Rungta, R. K., Jaiswal, P, & Tripathy, B. K. (2022) A deep learning based


approach to measure confidence for virtual interviews. In A. K. Das et al. (Eds.),
Proceedings of the 4th International Conference on Computational Intelligence in
Pattern Recognition (CIPR), CIPR 2022 (pp. 278–291). LNNS 480.

9. Bhandari, A., Tripathy, B. K., Jawad, K., Bhatia, S., Rahmani, M. K. I., & Mash,
A. (2022). Cancer detection and prediction using genetic algorithms. Comput
Intell Neurosci 2022, 18. https://doi.org/10.1155/2022/1871841

10. Allahyar, A., Ubels, J., & de Ridder, J. (2019). A data-driven interactome of
synergistic genes improves network-based cancer outcome prediction. PLoS
Computational Biology, 15(2), e1006657.
[Crossref]
11.
Adate, A., Tripathy, B. K., Arya, D., & Shaha, A. (2020) Impact of deep neural
learning on artificial intelligence research. In S. Bhattacharyya, A. E. Hassanian,
S. Saha, & B. K. Tripathy (Eds.), Deep learning research and applications (pp.69–
84). De Gruyter Publications. https://doi.org/10.1515/9783110670905-004

12. Mitchell, M. J., Jain, R. K., & Langer, R. (2017). Engineering and physical
sciences in oncology: Challenges and opportunities. Nature Reviews Cancer,
17(11), 659–675.
[Crossref]

13. Obermeyer, Z., & Emanuel, E. J. (2016). Predicting the future—big data, machine
learning, and clinical medicine. The New England Journal of Medicine, 375(13),
1216.
[Crossref]

14. Graves, A., Mohamed, A. R., & Hinton, G. (2013). Speech recognition with deep
recurrent neural networks. In 2013 IEEE International Conference on Acoustics,
Speech and Signal Processing (pp. 6645–6649). IEEE.

15. Bhattacharyya, D. S., Snasel, V., Hassanian, A. E., Saha, S., & Tripathy, B. K.
(2020). Deep learning research with engineering applications. De Gruyter
Publications. ISBN: 3110670909, 9783110670905. https://doi.org/10.1515/
9783110670905

16. Bose, A., & Tripathy, B. K. (2020) Deep learning for audio signal classification.
In S. Bhattacharyya, A. E. Hassanian, S. Saha, & B. K. Tripathy (Eds.), Deep
learning research and applications (pp. 105–136). De Gruyter Publications. https://
doi.org/10.1515/9783110670905-00660

17. Singhania, U., & Tripathy, B. K. (2021). Text-based image retrieval using deep
learning. In Encyclopedia of information science and technology (5th edn, p. 11).
https://doi.org/10.4018/978-1-7998-3479-3.ch007

18. Yagna Sai Surya, K., Geetha Rani, T., & Tripathy, B. K. (2022). Social distance
monitoring and face mask detection using deep learning. In J. Nayak, H. Behera,
B. Naik, S. Vimal, & D. Pelusi (Eds.), Computational intelligence in data mining
(Vol. 281). Smart Innovation, Systems and Technologies. Springer, Singapore.
https://doi.org/10.1007/978-981-16-9447-9_36

19. Ioffe, S., & Szegedy, C. (2015). Batch normalization: Accelerating deep network
training by reducing internal covariate shift. In International Conference on
Machine Learning (pp. 448–456). PMLR.
20. Huang, G., Liu, Z., Van Der Maaten, L., & Weinberger, K. Q. (2017). Densely
connected convolutional networks. In Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition (pp. 4700–4708).

21. Kyi, C. W., Birriel, P. C., Davidsen, T. M., Ferguson, M. L., Gesuwan, P., Griner,
N. B., Gerhard, D. S., et al. (2020). NCI office of cancer genomics supports
multidisciplinary genomics research initiatives to advance precision oncology.
Cancer Research, 80(16_Supplement), 5862–5862.

22. Pogorelov, K., Randel, K. R., Griwodz, C., Eskeland, S. L., de Lange, T.,
Johansen, D., Halvorsen, P., et al. (2017). Kvasir: A multi-class image dataset for
computer aided gastrointestinal disease detection. In Proceedings of the 8th ACM
on Multimedia Systems Conference (pp. 164–169).

23. Mesri, M., An, E., Hiltke, T., Robles, A. I., Rodriguez, H., & CPTAC
Investigators. (2022). NCI’s clinical proteomic tumor analysis consortium: A
proteogenomic cancer analysis program. Cancer Research, 82(12_Supplement),
6331–6331.

24. Gupta, P., Bhachawat, S., Dhyani, K., & Tripathy, B. K. (2021). A study of gene
characteristics and their applications using deep learning, (Chapter 4). In S. S.
Roy, & Y. H. Taguchi (Eds.), Handbook of Machine Learning Applications for
Genomics (Vol. 103). Studies in Big Data. ISBN: 978-981-16-9157-7,
496166_1_En.

25. Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural
Computation, 9(8), 1735–1780.
[Crossref]

26. Maheswari, K., Shaha, A., Arya, D., Tripathy, B. K., & Rajkumar, R. (2020).
Convolutional neural networks: A bottom-up approach. In S. Bhattacharyya, A. E.
Hassanian, S. Saha, & B. K. Tripathy (Eds.), Deep Learning Research with
Engineering Applications (pp. 21–50). De Gruyter Publications. https://doi.org/10.
1515/9783110670905-002

27. Tripathy, B. K., & Deepthi, P. H. (2015). Application of spatial FCM in detecting
cancer cells. IIMT Research Network (pp. 1–6, 96–100). ISBN 878-93-82208-77-
8.
28.
Zhong, Z., Sun, L., & Huo, Q. (2019). An anchor-free region proposal network for
Faster R-CNN-based text detection approaches. International Journal on
Document Analysis and Recognition (IJDAR), 22(3), 315–327.
[Crossref]

29. Hanefi Calp, M. (2021). Use of deep learning approaches in cancer diagnosis. In
Deep Learning for Cancer Diagnosis (pp. 249–267). Springer, Singapore.

30. Karahan, Ş., & Akgül, Y. S. (2016). Eye detection by using deep learning. In 2016
24th Signal Processing and Communication Application Conference (SIU) (pp.
2145–2148). IEEE.

31. Özkan, İN. İK., & Ülker, E. (2017). Derin öğrenme ve görüntü analizinde
kullanılan derin öğrenme modelleri. Gaziosmanpaşa Bilimsel Araştırma Dergisi,
6(3), 85–104.

32. Şeker, A., Diri, B., & Balık, H. H. (2017). Derin öğrenme yöntemleri ve
uygulamaları hakkında bir inceleme. Gazi Mühendislik Bilimleri Dergisi, 3(3),
47–64.

33. Bengio, Y. (2009). Learning deep architectures for AI. Foundations and Trends®
in Machine Learning, 2(1), 1–127.

34. Tripathy, B. K., Raju, H., & Kaul, D. (2018). Deep learning in health care,
accepted in deep learning for remote sensing and GIS: Frontier advancements and
applications. In V. Santhi (Eds.) CRC publications

35. Ravì, D., Wong, C., Deligianni, F., Berthelot, M., Andreu-Perez, J., Lo, B., &
Yang, G. Z. (2016). Deep learning for health informatics. IEEE Journal of
Biomedical and Health Informatics, 21(1), 4–21.
[Crossref]

36. Küçük, D., & Arici, N. (2018). Doğal Dil İşlemede Derin Öğrenme Uygulamalari
Üzerine Bir Literatür Çalişmasi. Uluslararası Yönetim Bilişim Sistemleri ve
Bilgisayar Bilimleri Dergisi, 2(2), 76–86.

37. Ohmori, M., Ishihara, R., Aoyama, K., Nakagawa, K., Iwagami, H., Matsuura, N.,
& Tada, T., et al. (2020). Endoscopic detection and differentiation of esophageal
lesions using a deep neural network. Gastrointestinal Endoscopy, 91(2), 301–309.

38. Schwyzer, M., Ferraro, D. A., Muehlematter, U. J., Curioni-Fontecedro, A.,


Huellner, M. W., Von Schulthess, G. K., Kaufmann, P. A., Burger, I. A., &
Messerli, M. (2018). Automated detection of lung cancer at ultralow dose PET/CT
by deep neural networks–initial results. Lung Cancer, 126, 170–173.
39.
Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Rabinovich,
A., et al. (2015). Going deeper with convolutions. In Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition (pp. 1–9).

40. Sihare, P., Ullah Khan, A., Bardhan, P., & Tripathy, B. K. (2022). COVID-19
detection using deep learning: A comparative study of segmentation algorithms.
In A. K. Das et al. (Eds.), Proceedings of the 4th International Conference on
Computational Intelligence in Pattern Recognition (CIPR) (pp. 1–10), CIPR
2022, LNNS 480.

41. Dai, J., Li, Y., He, K., & Sun, J. (2016). R-fcn: Object detection via region-based
fully convolutional networks. Advances in Neural Information Processing
Systems, 29.

42. Raina, R., Madhavan, A., & Ng, A. Y. (2009). Large-scale deep unsupervised
learning using graphics processors. In Proceedings of the 26th Annual
International Conference on Machine Learning (pp. 873–880).

43. Tripathy, B. K., Dash, S., & Patro, B. N. (2012). Study of classification accuracy
of microarray data for cancer classification using multivariate and hybrid feature
selection method. IOSR Journal of Engineering (IOSRJEN), 2(8), 112–119 ISSN:
2250-302.

44. Adate, A., & Tripathy, B. K. (2017). Understanding single image super-resolution
techniques with generative adversarial networks. Advances in Intelligent Systems
and ComputingIn J. Bansal, K. Das, A. Nagar, K. Deep, & A. Ojha (Eds.), Soft
computing for problem solving (Vol. 816, pp. 833–840). Springer.
[Crossref]

45. Liu, W., Wang, Z., Liu, X., Zeng, N., Liu, Y., & Alsaadi, F. E. (2017). A survey of
deep neural network architectures and their applications. Neurocomputing, 234,
11–26.
[Crossref]

46. Mustafa, H. T., Yang, J., & Zareapoor, M. (2019). Multi-scale convolutional
neural network for multi-focus image fusion. Image and Vision Computing, 85,
26–35.
[Crossref]

47. LeCun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998). Gradient-based learning
applied to document recognition. Proceedings of the IEEE, 86(11), 2278–2324.
[Crossref]
48.
Kaul, D., Raju, H., & Tripathy, B. K. (2022). Deep learning in healthcare. In D. P.
Acharjya, A. Mitra, & N. Zaman (Eds.), Deep learning in data analytics, deep
learning in data analytics-recent techniques, practices and applications (Vol. 91,
pp. 97–115). Studies in Big Data. Springer, Cham. https://doi.org/10.1007/978-3-
030-75855-4_6

49. Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks for
large-scale image recognition. Preprint retrieved from arXiv:1409.1556.

50. Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Fei-Fei, L., et
al. (2015). Imagenet large scale visual recognition challenge. International
Journal of Computer Vision, 115(3), 211–252.
[MathSciNet][Crossref]

51. Tripathy, B. K., Garg, N., & Nikhitha, P. (2014). Image retrieval using latent
feature learning by deep architecture. In Proceedings of the IEEE ICCIC2014 (pp.
663–666)

52. Targ, S., Almeida, D., & Lyman, K. (2016). Resnet in resnet: Generalizing
residual architectures. Preprint retrieved from arXiv:1603.08029.

53. Tripathy, B. K., Parikh, S., Ajay, P., & Magapu, C.: Brain MRI segmentation
techniques based on CNN and its variants (Chapter-10). In J. Chaki (Ed.), Brain
tumor MRI image segmentation using deep learning techniques (pp.161–182.).
Elsevier publications. https://doi.org/10.1016/B978-0-323-91171-9.00001-6

54. Çiçek, Ö., Abdulkadir, A., Lienkamp, S. S., Brox, T., & Ronneberger, O. (2016).
3D U-Net: learning dense volumetric segmentation from sparse annotation. In
International Conference on Medical Image Computing and Computer-Assisted
Intervention (pp. 424–432). Springer, Cham.

55. Baktha, K., & Tripathy, B. K. (2017). Investigation of recurrent neural networks in
the field of sentiment analysis. In International Conference on Communication
and Signal Processing (ICCSP), (pp. 2047–2050). https://doi.org/10.1109/ICCSP.
2017.8286763

56. Adate, A., & Tripathy, B. K. (2019). S-LSTM-GAN: Shared recurrent neural
networks with adversarial training. In A. Kulkarni, S. Satapathy, T. Kang, A.
Kashan (Eds.), Proceedings of the 2nd International Conference on Data
Engineering and Communication Technology (Vol. 828, pp. 107–115). Advances
in Intelligent Systems and Computing. Springer, Singapore.
57.
Loey, M., El-Sawy, A., & El-Bakry, H. (2017). Deep learning autoencoder
approach for handwritten arabic digits recognition. Preprint retrieved from arXiv:
1706.06720.

58. Thomas, S. A., Race, A. M., Steven, R. T., Gilmore, I. S., & Bunch, J. (2016).
Dimensionality reduction of mass spectrometry imaging data using autoencoders.
In 2016 IEEE Symposium Series on Computational Intelligence (SSCI) (pp. 1–7).
IEEE.

59. Keyvanrad, M. A., & Homayounpour, M. M. (2014). A brief survey on deep


belief networks and introducing a new object oriented toolbox (DeeBNet).
Preprint retrieved from arXiv:1408.3264.

60. Hinton, G. E. (2009). Deep belief networks. Scholarpedia, 4(5), 5947.


[Crossref]

61. Jeong, J. (2017). Deep learning for cancer screening in medical imaging. Hanyang
Medical Reviews, 37(2), 71–76.
[Crossref]

62. Pereira, G. C., Traughber, M., & Muzic, R. F. (2014). The role of imaging in
radiation therapy planning: past, present, and future. BioMed Research
International.

63. Adate, A., & Tripathy, B. K. (2018) Deep learning techniques for image
processing. In S. Bhattacharyya, H. Bhaumik, A. Mukherjee, & S. De (Eds.),
Machine learning for big data analysis (pp. 69–90). De Gruyter, Berlin, Boston.
https://doi.org/10.1515/9783110551433-00357

64. Jain, S., Singhania, U., Tripathy, B., Nasr, E. A., Aboudaif, M. K., & Kamrani, A.
K. (2021). Deep learning-based transfer learning for classification of skin
cancer. Sensors (Basel), 21(23), 8142. https://doi.org/10.3390/s21238142

65. Tong, N., Lu, H., Ruan, X., & Yang, M. H. (2015). Salient object detection via
bootstrap learning. In Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition (pp. 1884–1892).

66. Kallenberg, M., Petersen, K., Nielsen, M., Ng, A. Y., Diao, P., Igel, C., Lillholm,
M., et al. (2016). Unsupervised deep learning applied to breast density
segmentation and mammographic risk scoring. IEEE Transactions on Medical
Imaging, 35(5), 1322–1331.
[Crossref]
67. Wang, H., Roa, A. C., Basavanhally, A. N., Gilmore, H. L., Shih, N., Feldman,
M., Madabhushi, A., et al. (2014). Mitosis detection in breast cancer pathology
images by combining handcrafted and convolutional neural network features.
Journal of Medical Imaging, 1(3), 034003.
[Crossref]

68. Ertosun, M. G., & Rubin, D. L. (2015). Probabilistic visual search for masses
within mammography images using deep learning. In 2015 IEEE International
Conference on Bioinformatics and Biomedicine (BIBM) (pp. 1310–1315). IEEE.

69. Turkki, R., Linder, N., Kovanen, P. E., Pellinen, T., & Lundin, J. (2016).
Antibody-supervised deep learning for quantification of tumor-infiltrating
immune cells in hematoxylin and eosin stained breast cancer samples. Journal of
Pathology Informatics, 7(1), 38.
[Crossref]

70. Huang, Z., Zhan, X., Xiang, S., Johnson, T. S., Helm, B., Yu, C. Y., Huang, K., et
al. (2019). SALMON: Survival analysis learning with multi-omics neural
networks on breast cancer. Frontiers in Genetics, 10, 166.
[Crossref]

71. Shimizu, H., & Nakayama, K. I. (2019). A 23 gene–based molecular prognostic


score precisely predicts overall survival of breast cancer patients. eBioMedicine,
46, 150–159.
[Crossref]

72. Liu, Y., Gadepalli, K., Norouzi, M., Dahl, G. E., Kohlberger, T., Boyko, A.,
Stumpe, M. C., et al. (2017). Detecting cancer metastases on gigapixel pathology
images. Preprint retrieved from arXiv preprint arXiv:1703.02442.

73. Cruz-Roa, A., Gilmore, H., Basavanhally, A., Feldman, M., Ganesan, S., Shih, N.
N., Tomaszewski, J., González, F. A., & Madabhushi, A. (2017). Accurate and
reproducible invasive breast cancer detection in whole-slide images: A deep
learning approach for quantifying tumor extent. Scientific Reports, 7(1), 1–14.

74. Yap, M. H., Pons, G., Marti, J., Ganau, S., Sentis, M., Zwiggelaar, R., Davison, A.
K., & Marti, R. (2017). Automated breast ultrasound lesions detection using
convolutional neural networks. IEEE Journal of Biomedical and Health
Informatics, 22(4), 1218–1226.
75.
Das, A., Acharya, U. R., Panda, S. S., & Sabut, S. (2019). Deep learning based
liver cancer detection using watershed transform and Gaussian mixture model
techniques. Cognitive Systems Research, 54, 165–175.
[Crossref]

76. Devi, P., & Dabas, P. (2015). Liver tumor detection using artificial neural
networks for medical images. International Journal of Innovative Reserach
Science Technology, 2(3), 34–38.

77. Li, W. (2015). Automatic segmentation of liver tumor in CT images with deep
convolutional neural networks. Journal of Computer and Communications, 3(11),
146.
[Crossref]

78. Gruetzemacher, R., & Gupta, A. (2016). Using deep learning for pulmonary
nodule detection & diagnosis.

79. Golan, R., Jacob, C., & Denzinger, J. (2016). Lung nodule detection in CT images
using deep convolutional neural networks. In 2016 International Joint Conference
on Neural Networks (IJCNN) (pp. 243–250). IEEE.

80. Kuan, K., Ravaut, M., Manek, G., Chen, H., Lin, J., Nazir, B., Chen, C., Howe, T.
C., Zeng, Z., & Chandrasekhar, V. (2017). Deep learning for lung cancer
detection: tackling the kaggle data science bowl 2017 challenge. Preprint retrieved
from arXiv:1705.09435.

81. Jafari, M. H., Karimi, N., Nasr-Esfahani, E., Samavi, S., Soroushmehr, S. M. R.,
Ward, K., & Najarian, K. (2016). Skin lesion segmentation in clinical images
using deep learning. In 2016 23rd International Conference on Pattern
Recognition (ICPR) (pp. 337–342). IEEE.

82. Sabouri, P., & GholamHosseini, H. (2016). Lesion border detection using deep
learning. In 2016 IEEE Congress on Evolutionary Computation (CEC) (pp. 1416–
1421). IEEE.

83. Chen, H., Zhao, H., Shen, J., Zhou, R., & Zhou, Q. (2015). Supervised machine
learning model for high dimensional gene data in colon cancer detection. In 2015
IEEE International Congress on Big Data (pp. 134–141). IEEE.

84. Petalidis, L. P., Oulas, A., Backlund, M., Wayland, M. T., Liu, L., Plant, K.,
Happerfield, L., Freeman, T.C., Poirazi, P., & Collins, V. P. (2008). Improved
grading and survival prediction of human astrocytic brain tumors by artificial
neural network analysis of gene expression microarray data. Molecular Cancer
Therapeutics, 7(5), 1013–1024.
85. Liu, S., Zheng, H., Feng, Y., & Li, W. (2017). Prostate cancer diagnosis using
deep learning with 3D multiparametric MRI. In Medical Imaging 2017:
Computer-Aided Diagnosis (Vol. 10134, pp. 581–584). SPIE.

86. Tsehay, Y. K., Lay, N. S., Roth, H. R., Wang, X., Kwak, J. T., Turkbey, B. I.,
Pinto, P. A., Wood, B. J., & Summers, R. M. (2017). Convolutional neural
network based deep-learning architecture for prostate cancer detection on
multiparametric magnetic resonance images. In Medical Imaging 2017:
Computer-Aided Diagnosis (Vol. 10134, pp. 20–30). SPIE.

87. Havaei, M., Davy, A., Warde, D., Biard, A., Courville, A., Bengio, Y., Pal, C.,
Jodoin, P. M., & Larochelle, H. (2017). Brain tumor segmentation with deep
neural networks. Medical Image Analysis, 35, 18–31.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy