0% found this document useful (0 votes)
57 views273 pages

Muhammad Fayyaz 2021 CS Comsats Isb

Uploaded by

Anees Abbasi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
57 views273 pages

Muhammad Fayyaz 2021 CS Comsats Isb

Uploaded by

Anees Abbasi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 273

Appearance based Pedestrian Analysis using

Machine Learning

By
Muhammad Fayyaz
CIIT/SP15-PCS-001/WAH

PhD Thesis
In
Computer Science

COMSATS University Islamabad


Wah Campus - Pakistan
Spring, 2020
COMSATS University Islamabad

Appearance based Pedestrian Analysis using


Machine Learning

A Thesis Presented to

COMSATS University Islamabad, Wah Campus

In partial fulfillment
of the requirement for the degree of

PhD (Computer Science)

By

Muhammad Fayyaz
CIIT/SP15-PCS-001/WAH

Spring, 2020

ii
Appearance based Pedestrian Analysis using
Machine Learning

A Post Graduate Thesis submitted to the Department of Computer Science as

partial fulfillment of the requirement for the award of Degree of Ph.D. in

Computer Science.

Name Registration Number

Muhammad Fayyaz CIIT/SP15-PCS-001/WAH

Supervisor

Dr. Mussarat Yasmin


Assistant Professor
Department of Computer Science
COMSATS University Islamabad, Wah Campus

Co-Supervisor

Dr. Muhammad Sharif


Associate Professor
Department of Computer Science
COMSATS University Islamabad, Wah Campus

iii
DEDICATION
To ALLAH Almighty and the Holy Prophet
Muhammad (P.B.U.H)
&
My Parents, Siblings, Loving Family, and Teachers

viii
ABSTRACT
Appearance based Pedestrian Analysis using
Machine Learning
In the present era, the popularity of visual surveillance has opened new venues for
researchers in visual content analysis, and automatic pedestrian analysis is among one
of them. Pedestrian analysis uses machine learning techniques that are categorized into
biometric and full-body appearance based. Full-body appearance based pedestrian
analysis is preferred because of additional capabilities and fewer constraints; however,
it is not without challenges such as environmental effects and different camera settings.
These challenges are associated with different tasks such as detection, orientation
analysis, gender classification, re-identification (ReID), and action classification of
pedestrian. This thesis deals with challenges in full-body appearance based pedestrian
analysis for two tasks: 1) pedestrian ReID and 2) pedestrian gender classification. In
this regard, three methods are proposed; one of them is for pedestrian ReID, and rest of
two deals with pedestrian gender classification.

In the first method, person ReID is performed by using features-based clustering and
deep features (FCDF). Initially, three types of handcrafted features including shape,
color, and texture are extracted from the input image for feature representation. To
acquire optimal features, feature fusion, and selection (FFS) techniques are applied to
these handcrafted features. For gallery search optimization, features-based clustering is
utilized which splits the whole gallery into k consensus clusters. For relationship
learning of gallery features and related labels of chosen clusters, a radial basis kernel is
utilized. Afterwards, images are selected cluster-wise and provided to deep convolution
neural network (DCNN) model to obtain deep features. A cluster-wise feature vector is
then obtained by fusing deep and handcrafted features. This follows feature matching
process where a multi-class support vector machine (SVM) is applied to choose the
related cluster. Finally, to find accurate matching pair from the classified cluster(s), a
cross bin histogram based distance similarity measure is used instead of whole gallery
search. The proposed FCDF framework attains recognition rate at rank 1 as 46.82%,
48.12%, and 40.67% on selected datasets VIPeR, CUHK01, and iLIDS-VID,
respectively.

x
In the second method, joint feature representation is used for gender prediction. In this
regard, histogram of oriented gradients (HOG) and local maximum occurrences
(LOMO) assisted low-level features are extracted to handle rotation, viewpoint, and
illumination variances in the images. VGG19 and ResNet101 based standard deep
convolutional neural network (CNN) architectures are applied simultaneously to
acquire deep features, which are robust against pose variations. To avoid the ambiguous
and unnecessary feature representations, entropy controlled features are chosen from
both low-level and deep representations of features to reduce the dimensions of
computed features. By merging the selected low-level features with deep features, a
robust joint feature representation is used for gender prediction. The proposed joint
low-level and deep CNN feature representation (J-LDFR) method achieves AUC of
96% and accuracy of 89.3% on PETA dataset, and AUC of 86% and accuracy of 82%
on MIT dataset with cubic SVM classifier. The computed results suggest that J-LDFR
improves the performance in comparison with using feature representation individually.

In the third method, an imbalanced and small sample space (IB-SSS) dataset problem
is addressed for pedestrian gender classification using fusion of selected deep and
traditional features (PGC-FSDTF). Initially, data preparation is applied which consists
of data augmentation and data preprocessing steps. It follows investigation of multiple
low and high-level feature extraction schemes (pyramid HOG (PHOG), hue saturation
value (HSV) histogram, and deep visual features of DenseNet201 and
InceptionResNetV2 CNN architectures), features selection (PCA and entropy), and
fusion (parallel and serial) strategies for more accurate gender prediction on
imbalanced, augmented balanced and customized balanced datasets. It shows better
results in terms of O-ACC and AUC with 92.2% and 93% respectively on imbalanced
MIT-IB dataset, whereas 93.7% O-ACC and 98% AUC on augmented balanced MIT-
BROS-3 dataset. Similarly, this framework shows improved performance in terms of
O-ACC and AUC as 89.7% and 96% respectively on imbalanced PKU-Reid-IB dataset,
whereas 92.2% O-ACC and 97% AUC on augmented balanced PKU-Reid-BROS2. It
also provides superior outcomes on customized balanced datasets such as 88.8% O-
ACC and 95% AUC on PETA-SSS-1 dataset and 94.7% O-ACC and 95% AUC on
VIPeR-SSS dataset. This method outperforms by showing O-ACC 90.8% and AUC
95% for cross-dataset and 90.4% O-ACC and 96% AUC for cross-dataset-1. Superior
results are achieved on applied datasets using PCA based selected optimal features

xi
subset and medium Gaussian SVM classifier. Hence, the results on different datasets
confirm that the selected feature combination effectively handles the imbalanced and
SSS issue for PGC task.

The abovementioned proposed methods are unique in using the strategy of features-
based clustering to optimize gallery search for efficient person ReID, considering, joint
feature representation with both large-scale and small-scale datasets for accurate gender
prediction, and applying data augmentation and robust feature engineering to handle
IB-SSS datasets issues. The computed results show that the proposed methods
outperform recent and most related state-of-the-art methods with significant margins in
terms of recognition rates at different ranks, overall O-ACC, and AUC.

xii
TABLE OF CONTENTS
Chapter 1 Introduction ................................................................................. 1

1.1 Introduction to Visual Surveillance ........................................................... 2

1.1.1 Significance of Visual Surveillance .................................................... 3

1.1.2 Automatic Visual Content Analysis .................................................... 3

1.2 Automatic Pedestrian Analysis .................................................................. 5

1.2.1 Appearance based Pedestrian Analysis ............................................... 6

1.2.1.1 Pedestrian Re-identification .......................................................... 7

1.2.1.2 Pedestrian Gender Classification .................................................. 8

1.2.2 Applications of Appearance based Pedestrian Analysis ................... 10

1.3 Research Challenges ................................................................................ 10

1.3.1 Physical Variations of Individuals ..................................................... 10

1.3.2 Illumination Variations ...................................................................... 11

1.3.3 Viewpoint Changes ............................................................................ 11

1.3.4 Pose Changes ..................................................................................... 11

1.3.5 Body Deformations ............................................................................ 11

1.3.6 Occlusion ........................................................................................... 12

1.3.7 Different Camera Settings ................................................................. 12

1.4 Problem Statement ................................................................................... 13

1.5 Research Objectives ................................................................................. 14

1.6 Research Contributions ............................................................................ 14

1.7 List of Publications .................................................................................. 16

1.8 Thesis Organization ................................................................................. 16

Chapter 2 Literature Review ...................................................................... 18

2.1 Introduction .............................................................................................. 19

xiii
2.2 State-of-the-art Methods for Pedestrian Re-identification....................... 19

2.2.1 Appearance based Approaches for Pedestrian Re-identification ...... 20

2.2.2 Metric Learning based Approaches for Person Re-identification ..... 24

2.2.3 Deep Learning based Approaches for Person Re-identification ....... 27

2.3 State-of-the-art Methods for Pedestrian Gender Classification ............... 29

2.3.1 Face-based Approaches for Gender Classification............................ 30

2.3.2 Parts-based Approaches for Pedestrian Gender Classification ......... 31

2.3.3 Full-body Appearance based Approaches for Pedestrian Gender


Classification .................................................................................... 33

2.4 Discussion and Analysis .......................................................................... 40

2.5 Summary .................................................................................................. 41

Chapter 3 Proposed Methodologies ........................................................... 42

3.1 Introduction .............................................................................................. 43

3.2 Proposed Method for Person Re-identification with Features-based


Clustering and Deep Features (FCDF) ................................................... 44

3.2.1 Feature Representation ...................................................................... 46

3.2.1.1 Color Feature Extraction ............................................................. 46

3.2.1.2 Texture Feature Extraction .......................................................... 48

3.2.1.3 HOG Feature Extraction.............................................................. 50

3.2.1.4 Feature Fusion and Selection ...................................................... 50

3.2.2 Feature Clustering .............................................................................. 54

3.2.2.1 Features-based Cluster Formation ............................................... 54

3.2.2.2 Deep Feature Extraction .............................................................. 54

3.2.3 Feature Matching ............................................................................... 59

3.2.3.1 Cluster Selection followed by Probe Deep Features ................... 60

3.2.3.2 Similarity Measure ...................................................................... 60

xiv
3.3 Proposed Pedestrian Gender Classification Method of Joint Low-level and
Deep CNN Feature Representation (J-LDFR) ....................................... 61

3.3.1 Data Preprocessing ............................................................................ 62

3.3.2 Feature Representation ...................................................................... 62

3.3.2.1 Low-level Feature Extraction ...................................................... 63

3.3.2.2 Deep CNN Feature Extraction .................................................... 65

3.3.2.3 Pre-trained Deep CNN Models ................................................... 68

3.3.3 Features selection and Joint Feature Representation ......................... 68

3.3.4 Classification Methods ...................................................................... 72

3.4 Proposed Method for Pedestrian Gender Classification on Imbalanced and


Small Sample Datasets using Parallel and Serial Fusion of Selected Deep
and Traditional Features (PGC-FSDTF) ................................................ 74

3.4.1 Data Preparation ................................................................................ 74

3.4.1.1 Data Augmentation ..................................................................... 74

3.4.1.2 Data Preprocessing ...................................................................... 79

3.4.2 Traditional Feature Extraction ........................................................... 80

3.4.2.1 Pyramid HOG based Feature Extraction ..................................... 80

3.4.2.2 HSV Histogram based Feature Extraction .................................. 81

3.4.3 Deep Convolution Neural Networks ................................................. 83

3.4.3.1 Pre-trained Deep CNNs and Fine Tuning ................................... 84

3.4.3.2 Deep CNN Feature Extraction and Parallel Fusion .................... 87

3.4.4 Features Selection and Fusion ........................................................... 89

3.4.5 Classification Methods ...................................................................... 93

3.5 Summary .................................................................................................. 93

Chapter 4 Results and Discussions ............................................................ 95

4.1 Introduction .............................................................................................. 96

xv
4.2 Proposed Method for Person Re-identification with Features-based
Clustering and Deep Features (FCDF) ................................................... 97

4.2.1 Performance Evaluation Protocols and Implementation Settings for


Person ReID ...................................................................................... 97

4.2.2 Pedestrian Analysis Datasets Presentation for Person Re-


identification ..................................................................................... 97

4.2.2.1 VIPeR Dataset Presentation ........................................................ 98

4.2.2.2 CUHK01 Dataset Presentation .................................................... 98

4.2.2.3 iLIDS-VID Dataset Presentation ................................................. 98

4.2.3 Results Evaluation ............................................................................. 99

4.2.3.1 Proposed FCDF Framework Results ........................................... 99

4.2.3.2 Performance on Selected Datasets and Comparison with Existing


Methods ..................................................................................... 100

4.3 Proposed Pedestrian Gender Classification Method of Joint Low-level and


Deep CNN Feature Representations (J-LDFR) .................................... 111

4.3.1 Evaluation Protocols and Implementation Settings for Pedestrian


Gender Classification ...................................................................... 111

4.3.2 Pedestrian Analysis Datasets Presentation for Pedestrian Gender


Classification .................................................................................. 112

4.3.2.1 PETA Dataset Presentation ....................................................... 113

4.3.2.2 MIT Dataset Presentation .......................................................... 113

4.3.3 Results Evaluation ........................................................................... 113

4.3.3.1 Performance of Low-level Feature Representation for Gender


Classification ............................................................................. 117

4.3.3.2 Performance of Deep Feature Representation for Gender


Classification ............................................................................. 121

xvi
4.3.3.3 Performance of Joint Feature Representation for Gender
Classification ............................................................................. 125

4.3.3.4 Comparison with State-of-the-art Methods ............................... 133

4.4 Proposed Method for Pedestrian Gender Classification on Imbalanced and


Small Sample Datasets using Parallel and Serial Fusion of Selected Deep
and Traditional Features (PGC-FSDTF) .............................................. 136

4.4.1 Performance Evaluation Protocols and Implementation Settings for


Pedestrian Gender Classification .................................................... 136

4.4.2 Pedestrian Analysis Datasets Presentation for Pedestrian Gender


Classification including Augmented Datasets ................................ 138

4.4.2.1 MIT and Augmented MIT Dataset Presentation ....................... 139

4.4.2.2 PKU-Reid Dataset Presentation ................................................ 140

4.4.2.3 PETA-SSS Dataset Presentation ............................................... 142

4.4.2.4 VIPeR-SSS dataset Presentation ............................................... 143

4.4.2.5 Cross-dataset Presentation......................................................... 144

4.4.3 Results Evaluation ........................................................................... 146

4.4.3.1 Performance Evaluation on MIT and Augmented MIT Datasets


................................................................................................... 147

4.4.3.2 Performance Evaluation on PKU-Reid and Augmented PKU-


Reid Datasets ............................................................................. 160

4.4.3.3 Performance Evaluation on PETA-SSS and VIPeR-SSS Datasets


................................................................................................... 173

4.4.3.4 Performance Evaluation on Cross-datasets ............................... 182

4.4.3.5 Comparison with Existing Methods .......................................... 189

4.5 Discussion .............................................................................................. 199

4.6 Summary ................................................................................................ 201

Chapter 5 Conclusion and Future Work ................................................ 202

xvii
5.1 Conclusion ............................................................................................. 203

5.2 Future Work ........................................................................................... 204

Chapter 6 References ................................................................................ 206

Appendix A..................................................................................................... 232

A.1 English Proof Reading Certificate ........................................................ 233

xviii
LIST OF FIGURES
Figure 1.1: Increasing trend of CCTV cameras in top ten countries a) country wise b)
per 100 people [1] .......................................................................................................... 2
Figure 1.2: Visual surveillance viewing modes, real-time and posteriori [3] ................ 4
Figure 1.3: Full-body appearance based pedestrians analysis using features-based
descriptors and deep learning models ............................................................................ 7
Figure 1.4: Full-body appearance based person ReID scenario .................................... 8
Figure 1.5: Full-body appearance based pedestrian gender classification scenario ...... 9
Figure 1.6: Sample images showing challenges in pedestrian analysis ....................... 13
Figure 1.7: Detail overview of thesis ........................................................................... 17
Figure 2.1: Organization of chapter 2 .......................................................................... 19
Figure 2.2: General model for person ReID ................................................................ 20
Figure 2.3: Pipeline for full-body appearance and face-based gender classification .. 30
Figure 3.1: Block diagram of chapter 3, proposed methodologies with highlights ..... 43
Figure 3.2: FCDF framework for person ReID consisting of three modules, where a)
feature representation module is applied to compute different types of features from R,
G, B, H, S, and V channels, and then optimal features are selected using novel FFS
method, b) feature clustering module is used to split whole gallery into different
consensus clusters for gallery search optimization, whereas deep features of each
cluster sample are also examined, and c) feature matching module includes
classification of corresponding cluster(s), probe deep features and finally similarity
measure is applied to obtain recognition rates at different ranks................................. 45
Figure 3.3: LBP and LEP code generation [187] ......................................................... 49
Figure 3.4: Proposed feature extraction, fusion, and max-entropy based selection of
features ......................................................................................................................... 51
Figure 3.5: CNN model for deep feature extraction .................................................... 56
Figure 3.6: Parameters setting at each layer ................................................................ 56
Figure 3.7: Max pooling operation .............................................................................. 59
Figure 3.8: Proposed J-LDFR framework for pedestrian gender classification .......... 61
Figure 3.9: Process of formation of low-level (HOG) features ................................... 64
Figure 3.10: Process of formation of low-level (LOMO) features, a) overview of feature
extraction procedure using two different LOMO representation schemes such as HSV
and SILTP based representations including feature fusion step to compute LOMO

xix
feature vector, and b) basic internal representation to calculate combined histogram
from different patches of input image [77] .................................................................. 65
Figure 3.11: Complete design of proposed low-level and deep feature extraction from
gender images for joint feature representation. The proposed framework J-LDFR
selects maximum score-based features and then fusion is applied to generate a robust
feature vector that has both low-level and deep feature representations. Selected
classifiers are applied to evaluate these feature representations for gender prediction70
Figure 3.12: An overview of the proposed PGC-FSDTF framework for pedestrian
gender classification..................................................................................................... 75
Figure 3.13: Proposed 1vs1 and 1vs4 strategies for data augmentation ...................... 78
Figure 3.14: PHOG feature extraction scheme ............................................................ 81
Figure 3.15: HSV histogram based color features extraction ...................................... 83
Figure 3.16: Different ways to deploy pre-trained deep learning models ................... 85
Figure 3.17: Schematic view of InceptionResNetV2 model (compressed) ................. 86
Figure 3.18: Schematic view of DenseNet201 model (compressed) ........................... 86
Figure 3.19: Deep CNN feature extraction and parallel fusion ................................... 88
Figure 3.20: FSDTF for gender prediction .................................................................. 92
Figure 4.1: Block of chapter 4 including section wise highlights................................ 96
Figure 4.2: CMC curves of existing and proposed FCDF method on VIPeR dataset102
Figure 4.3: CMC curves of existing and proposed FCDF method on CUHK01 dataset
.................................................................................................................................... 104
Figure 4.4: CMC curves of existing and proposed FCDF method on the iLIDS-VID
dataset ........................................................................................................................ 106
Figure 4.5: Performance comparison of CC and NNC based searching against all probe
images ........................................................................................................................ 110
Figure 4.6: Selected image pairs (column-wise) from VIPeR, CUHK01, and iLIDS-
VID datasets with challenging conditions such as a) Improper image appearances, b)
different background and foreground information, c) drastic illumination changes, and
d) pose variations including lights effects ................................................................. 110
Figure 4.7: Samples of pedestrian images selected from sub-datasets of PETA dataset,
column representing the gender (male and female) from each sub-dataset, upper row is
showing the images of male gender whereas lower row is showing the image of female
gender ......................................................................................................................... 112

xx
Figure 4.8: Proposed J-LDFR method (a) training and (b) prediction time using
different classifiers on PETA and MIT datasets ........................................................ 117
Figure 4.9: Performance evaluation of proposed J-LDFR method on PETA dataset
using entropy controlled low-level feature representations, individually.................. 119
Figure 4.10: Performance evaluation of proposed method on MIT dataset using entropy
controlled low-level feature representations, individually......................................... 120
Figure 4.11: Performance evaluation of proposed J-LDFR method on PETA and MIT
datasets using entropy controlled low-level feature representations, jointly ............. 121
Figure 4.12: Performance estimation of proposed J-LDFR method on PETA dataset
using entropy controlled deep feature representations, individually ......................... 124
Figure 4.13: Performance evaluation of proposed J-LDFR method on MIT dataset using
entropy controlled deep feature representations, separately ...................................... 124
Figure 4.14: Performance estimation of proposed J-LDFR method on PETA and MIT
datasets using entropy controlled deep feature representations, jointly .................... 125
Figure 4.15: Proposed evaluation results using individual feature representation and
JFR on PETA dataset ................................................................................................. 128
Figure 4.16: Proposed evaluation results using individual feature representation and
JFR on the MIT dataset .............................................................................................. 130
Figure 4.17: AUC on PETA dataset .......................................................................... 131
Figure 4.18: AUC on MIT dataset ............................................................................. 132
Figure 4.19: Comparison of existing and proposed results in terms of AUC on PETA
dataset ........................................................................................................................ 134
Figure 4.20: Comparison of existing and proposed results in terms of overall accuracy
on MIT dataset ........................................................................................................... 135
Figure 4.21: Gender wise pair of sample images collected from MIT, PKU-Reid, PETA,
VIPeR, and cross-datasets where column represents the gender (male and female) from
each dataset, upper row shows images of male, and lower row shows images of female
.................................................................................................................................... 138
Figure 4.22: Sample images of pedestrian with back and front views (a) MIT/MIT-IB
dataset (b) augmented MIT datasets. First two rows represent male images, and next
two rows represent female images ............................................................................. 140
Figure 4.23: Sample images of pedestrians (a) first and second row represent male and
female samples, respectively collected from PKU-Reid-IB dataset, (b) first and second
row represent male and female samples respectively collected from augmented datasets,

xxi
and (c) male (top) and female (bottom) images to show pedestrian images with different
viewpoint angle changes from 0° to 315°, total in eight directions .......................... 142
Figure 4.24: Sample images of pedestrian, column represents gender (male and female)
selected from each customized SSS PETA datasets, upper row shows images of male,
and lower row is shows images of female ................................................................. 143
Figure 4.25: Gender wise sample images of pedestrian, column represents gender (male
and female) selected from sub-datasets of PETA dataset, upper row shows two images
of male, and lower row shows two image of female ................................................. 145
Figure 4.26: An overview of the selected, imbalanced, augmented balanced, and
customized datasets with the class-wise (male and female) distribution of samples for
pedestrian gender classification (a) imbalanced and augmented balanced SSS datasets
and (b) customized balanced SSS datasets ................................................................ 146
Figure 4.27: Entropy and PCA based proposed PGC-FSDTF method results
comparison in terms of O-ACC, M-ACC, B-ACC, AUC, CW-ACC male, and CW-
ACC female on imbalanced MIT-IB dataset ............................................................. 150
Figure 4.28: Best AUC for males and females on imbalanced MIT-IB dataset using
PCA based selected features set ................................................................................. 151
Figure 4.29: Entropy and PCA based proposed PGC-FSDTF method results
comparison in terms of O-ACC, M-ACC, B-ACC, AUC, CW-ACC male, and CW-
ACC female on balanced MIT-BROS-1 dataset ........................................................ 153
Figure 4.30: Best AUC for males and females on balanced MIT-BROS-1 dataset using
PCA based selected FSs ............................................................................................. 154
Figure 4.31: Entropy and PCA based proposed PGC-FSDTF method results
comparison in terms of O-ACC, M-ACC, B-ACC, AUC, CW-ACC male, and CW-
ACC female on balanced MIT-BROS-2 dataset ........................................................ 156
Figure 4.32: Best AUC for males and females on balanced MIT-BROS-2 dataset using
PCA based selected FS .............................................................................................. 157
Figure 4.33: Entropy and PCA based proposed PGC-FSDTF method results
comparison in terms of O-ACC, M-ACC, B-ACC, AUC, CW-ACC male, and CW-
ACC female on balanced MIT-BROS-3 dataset ........................................................ 159
Figure 4.34: Best AUC for males and females on balanced MIT-BROS-3 dataset using
PCA based selected features subsets.......................................................................... 160

xxii
Figure 4.35: Entropy and PCA based proposed PGC-FSDTF method results
comparison in terms of O-ACC, M-ACC, B-ACC, AUC, CW-ACC male, and CW-
ACC female on imbalanced PKU-Reid-IB dataset .................................................... 163
Figure 4.36: Best AUC for males and females on imbalanced PKU-Reid-IB dataset
using PCA based selected FSs ................................................................................... 164
Figure 4.37: Entropy and PCA based proposed PGC-FSDTF method results
comparison in terms of O-ACC, M-ACC, B-ACC, AUC, CW-ACC male, and CW-
ACC female on the balanced PKU-Reid-BROS-1 dataset ........................................ 166
Figure 4.38: Best AUC for males and females on balanced PKU-Reid-BROS-1 dataset
using PCA based selected FSs ................................................................................... 167
Figure 4.39: Entropy and PCA based proposed PGC-FSDTF method results
comparison in terms of O-ACC, M-ACC, B-ACC, AUC, CW-ACC male, and CW-
ACC female on balanced PKU-Reid-BROS-2 dataset .............................................. 169
Figure 4.40: Best AUC for males and females on balanced PKU-Reid-BROS-2 dataset
using PCA based selected FSs ................................................................................... 170
Figure 4.41: Entropy and PCA based proposed PGC-FSDTF method results
comparison in terms of O-ACC, M-ACC, B-ACC, AUC, CW-ACC male, and CW-
ACC female on balanced PKU-Reid-BROS-3 dataset .............................................. 173
Figure 4.42: Best AUC for males and females on balanced PKU-Reid-BROS-3 dataset
using PCA based selected FSs ................................................................................... 173
Figure 4.43: Entropy and PCA based proposed PGC-FSDTF method results
comparison in terms of O-ACC, M-ACC, B-ACC, AUC, CW-ACC male, and CW-
ACC female on the balanced PETA-SSS-1 dataset ................................................... 176
Figure 4.44: Best AUC for males and females on the balanced PETA-SSS-1 dataset
using PCA based selected FSs ................................................................................... 176
Figure 4.45: Entropy and PCA based proposed PGC-FSDTF method results
comparison in terms of O-ACC, M-ACC, B-ACC, AUC, CW-ACC male, and CW-
ACC female on balanced PETA-SSS-2 dataset ......................................................... 179
Figure 4.46: Best AUC for males and females on balanced PETA-SSS-2 dataset using
PCA based selected FSs ............................................................................................. 179
Figure 4.47: Entropy and PCA based proposed PGC-FSDTF method results
comparison in terms of O-ACC, M-ACC, B-ACC, AUC, CW-ACC male, and CW-
ACC female on balanced VIPeR-SSS dataset ........................................................... 182

xxiii
Figure 4.48: Best AUC for males and females on balanced VIPeR-SSS dataset using
PCA based selected features subsets.......................................................................... 182
Figure 4.49: Entropy and PCA based proposed PGC-FSDTF method results
comparison in terms of O-ACC, M-ACC, B-ACC, AUC, CW-ACC male, and CW-
ACC female on balanced cross-dataset, best outcomes of two classifiers ................ 185
Figure 4.50: Best AUC for males and females on balanced cross-dataset using PCA
based selected FSs...................................................................................................... 186
Figure 4.51: Entropy and PCA based proposed PGC-FSDTF method results
comparison in terms of O-ACC, M-ACC, B-ACC, AUC, CW-ACC male, and CW-
ACC female on balanced cross-dataset-1, best outcomes of two classifiers ............. 188
Figure 4.52: Best AUC for males and females on balanced cross-dataset-1 using PCA
based selected FSs...................................................................................................... 189
Figure 4.53: Performance comparison in terms of overall accuracy between proposed
PGC-FSDTF method and existing methods on MIT/MIT-IB dataset ....................... 191
Figure 4.54: Comparison of training and prediction time of PGC-FSDTF with J-LDFR
.................................................................................................................................... 191
Figure 4.55: Performance comparison in terms of AUC between the proposed PGC-
FSDTF method and existing methods on cross-datasets ........................................... 193
Figure 4.56: Training time of PGC-FSDTF method on applied datasets .................. 198
Figure 4.57: Prediction time of PGC-FSDTF method on applied datasets................ 198
Figure 4.58: Complete overview of proposed PGC-FSDTF method results in terms of
best CW-ACC on selected, customized, and augmented datasets where proposed
approach achieved superior AUC (a) CW-ACC on customized balanced SSS datasets,
and (b) CW-ACC on imbalanced and augmented balanced SSS datasets ................. 199

xxiv
LIST OF TABLES
Table 2.1: Summary of appearance based approaches for person ReID ..................... 23
Table 2.2: Summary of metric learning based approaches for person ReID ............... 26
Table 2.3: Summary of deep learning based approaches for person ReID ................. 29
Table 2.4: Summary of parts-based approaches using frontal, back and mixed view
images of a pedestrian for gender classification, (-) represents that no reported result is
available ....................................................................................................................... 32
Table 2.5: Summary of existing methods results using full-body frontal view images of
pedestrian for gender classification with 5 cross-validation (male=123, female=123
uncropped and cropped images of MIT dataset; whereas male=292, female=291
uncropped images of VIPeR dataset) ........................................................................... 34
Table 2.6: Summary of gender classification methods using handcrafted features and
different classifiers with full-body frontal, back, and mixed view images of a gender, (-
) represents that no reported result is available ............................................................ 36
Table 2.7: Summary of deep learning and hybrid methods for pedestrian gender
classification with full-body frontal, back, and mixed view images of a gender, (-)
represents that no reported result is available .............................................................. 39
Table 3.1: Description of different test FS with different dimensions ........................ 53
Table 3.2: Experimental results using handcrafted features on three selected datasets,
top recognition rates at each rank written in bold ........................................................ 53
Table 3.3: Description of preprocessing for each feature representation scheme ....... 62
Table 3.4: Proposed J-LDFR framework selected features subset dimensions, classifiers
and their parameter settings ......................................................................................... 73
Table 3.5: Augmentation statistics for imbalanced and small sample MIT, and PKU-
Reid datasets, class-wise selected number of samples in a single set for data
augmentation, and resultantly, total augmented images and total images ................... 79
Table 3.6: Description of preprocessing for each feature representation scheme ....... 80
Table 4.1: Statistics of datasets for person ReID ......................................................... 99
Table 4.2: Experimental results using deep features (from higher to lower dimension)
on VIPeR dataset........................................................................................................ 100
Table 4.3: Experimental results using deep features (from higher to lower dimension)
.................................................................................................................................... 100

xxv
Table 4.4: Experimental results using deep features (from higher to lower dimension)
on i-LIDS-VID dataset ............................................................................................... 100
Table 4.5: Performance comparison in terms of top matching rates (%) of existing
methods including proposed FCDF method on VIPeR dataset (p=316), dash (-)
represents that no reported result is available ............................................................ 103
Table 4.6: Performance comparison in terms of top matching rates (%) of existing
methods including proposed FCDF method on CUHK01 dataset (p=486), dash (-)
represents that no reported result is available ............................................................ 105
Table 4.7: Performance comparison in terms of top matching rates (%) of existing
methods and proposed FCDF method on the iLIDS-VID dataset (p=150), dash (-)
represents that no reported result is available ............................................................ 107
Table 4.8: Cluster and gallery-based probe matching results of proposed FCDF
framework on VIPeR dataset .................................................................................... 109
Table 4.9: Cluster and gallery-based probe matching results of proposed FCDF
framework on CUHK01 dataset................................................................................. 109
Table 4.10: Cluster and gallery-based probe matching results of proposed FCDF
framework on the iLIDS-VID dataset........................................................................ 109
Table 4.11: Evaluation protocols ............................................................................... 111
Table 4.12: Statistics of PETA dataset for pedestrian gender classification ............. 113
Table 4.13: Statistics of MIT dataset for pedestrian gender classification ................ 113
Table 4.14: Description of different test FSs with different dimensions ................... 114
Table 4.15: Performance evaluation of proposed J-LDFR method using different
classifiers and test FSs on PETA dataset ................................................................... 115
Table 4.16: Performance evaluation of proposed J-LDFR method using different
classifiers and test FSs on MIT dataset ...................................................................... 116
Table 4.17: Performance evaluation of proposed J-LDFR method on PETA dataset
using C-SVM classifier with 10-fold cross-validation .............................................. 118
Table 4.18: Performance evaluation of proposed J-LDFR method on PETA dataset
using M-SVM classifier with 10-fold cross-validation ............................................. 118
Table 4.19: Performance evaluation of proposed J-LDFR method on PETA dataset
using Q-SVM classifier with 10-fold cross-validation .............................................. 118
Table 4.20: Performance evaluation of proposed J-LDFR method on MIT dataset using
C-SVM classifier with 10-fold cross-validation ........................................................ 119

xxvi
Table 4.21: Performance evaluation of proposed J-LDFR method on MIT dataset using
M-SVM classifier with 10-fold cross-validation ....................................................... 120
Table 4.22: Performance evaluation of proposed J-LDFR method on MIT dataset using
Q-SVM classifier with 10-fold cross-validation ........................................................ 120
Table 4.23: Performance evaluation of proposed J-LDFR method on PETA dataset
using C-SVM classifier with 10-fold cross-validation .............................................. 122
Table 4.24: Performance evaluation of the proposed J-LDFR method on PETA dataset
using M-SVM classifiers with 10-fold cross-validation ............................................ 122
Table 4.25: Performance evaluation of the proposed J-LDFR method on PETA dataset
using Q-SVM classifier with 10-fold cross-validation .............................................. 122
Table 4.26: Performance evaluation of proposed J-LDFR method on MIT dataset using
C-SVM classifier with 10-fold cross-validation ........................................................ 123
Table 4.27: Performance evaluation of proposed J-LDFR method on MIT dataset using
M-SVM classifier with 10-fold cross-validation ....................................................... 123
Table 4.28: Performance evaluation of proposed J-LDFR method on MIT dataset using
Q-SVM classifier with 10-fold cross-validation ........................................................ 123
Table 4.29: Performance evaluation of proposed J-LDFR method on PETA dataset
using C-SVM classifier with 10-fold cross-validation .............................................. 126
Table 4.30: Performance evaluation of proposed J-LDFR method on PETA dataset
using M-SVM classifier with 10-fold cross-validation ............................................. 126
Table 4.31: Performance evaluation of proposed J-LDFR method on PETA dataset
using Q-SVM classifiers with 10-fold cross-validation............................................. 127
Table 4.32: Performance evaluation of the proposed J-LDFR method on the MIT
dataset using cubic-SVM classifier with 10-fold cross-validation ............................ 128
Table 4.33: Performance evaluation of the proposed J-LDFR method on the MIT
dataset using medium-SVM classifier with 10-fold cross-validation ........................ 129
Table 4.34: Performance evaluation of the proposed J-LDFR method on the MIT
dataset using quadratic-SVM classifier with 10-fold cross-validation ...................... 129
Table 4.35: Confusion matrix using C-SVM on PETA dataset ................................. 130
Table 4.36: Confusion matrix using C-SVM on MIT dataset ................................... 131
Table 4.37: Performance comparison with existing methods using PETA dataset ... 133
Table 4.38: Performance comparison with existing methods using MIT dataset, dash (-
) represents that no reported result is available .......................................................... 135
Table 4.39: Proposed J-LDFR method results on PETA and MIT datasets .............. 136

xxvii
Table 4.40: Evaluation protocols / metrics ................................................................ 137
Table 4.41: Statistics of MIT/MIT-IB dataset samples based imbalanced and
augmented balanced small sample datasets for pedestrian gender classification ...... 140
Table 4.42: Statistics of PKU-Reid dataset samples based imbalanced and augmented
balanced small sample datasets for pedestrian gender classification......................... 141
Table 4.43: Statistics of PETA dataset samples based customized PETA-SSS-1 and
PETA-SSS-2 datasets for pedestrian gender classification ....................................... 143
Table 4.44: Statistics of VIPeR dataset samples based customized VIPeR-SSS dataset
for pedestrian gender classification ........................................................................... 144
Table 4.45: Statistics of cross-datasets for pedestrian gender classification ............. 145
Table 4.46: Performance of proposed PGC-FSDTF method on imbalanced MIT-IB
dataset ........................................................................................................................ 148
Table 4.47: Performance of proposed PGC-FSDTF method on imbalanced MIT-IB
dataset ........................................................................................................................ 149
Table 4.48: Performance of proposed PGC-FSDTF method on balanced MIT-BROS-1
dataset (male=864, and female=864 images) using different evaluation protocols.. 152
Table 4.49: Performance of proposed PGC-FSDTF method on balanced MIT- BROS-
1 dataset (male=864, and female=864 images) using different accuracies, AUC and
time ............................................................................................................................ 153
Table 4.50: Performance of proposed PGC-FSDTF method on balanced MIT-BROS-2
dataset (male=864, and female=864 images) using different evaluation protocols... 155
Table 4.51: Performance of proposed PGC-FSDTF method on the balanced MIT-
BROS-2 dataset (male=864, and female=864 images) using accuracies, AUC, and time
.................................................................................................................................... 155
Table 4.52: Performance of proposed PGC-FSDTF method on balanced MIT-BROS-3
dataset (male=600, and female=600 images) using different evaluation protocols... 158
Table 4.53: Performance of proposed PGC-FSDTF method on balanced MIT-BROS-3
dataset (male=600, and female=600 images) using different accuracies, AUC, and time
.................................................................................................................................... 159
Table 4.54: Performance of proposed PGC-FSDTF method on imbalanced PKU-Reid-
IB dataset (male=1120, and female=704 images) using different evaluation protocols
.................................................................................................................................... 161

xxviii
Table 4.55: Performance of proposed PGC-FSDTF method on imbalanced PKU-Reid-
IB dataset (male=1120, and female=704 images) using different accuracies, AUC, and
time ............................................................................................................................ 162
Table 4.56: Performance of proposed PGC-FSDTF method on balanced PKU-Reid-
BROS-1 dataset (male=1300, and female=1300 images), different evaluation protocols
.................................................................................................................................... 165
Table 4.57: Performance of proposed PGC-FSDTF method on balanced PKU-Reid-
BROS-1 dataset (male=1300, and female=1300 images) using different accuracies,
AUC, and time ........................................................................................................... 165
Table 4.58: Performance of proposed PGC-FSDTF method on PKU-Reid-BROS-2
dataset (male=1300, and female=1300 images) using different evaluation protocols
.................................................................................................................................... 168
Table 4.59: Performance of proposed PGC-FSDTF method on PKU-Reid-BROS-2
dataset (male=1300, and female=1300 images) using different accuracies, AUC, and
time ............................................................................................................................ 168
Table 4.60: Performance of proposed PGC-FSDTF method on balanced PKU-Reid-
BROS-3 dataset (male=1120, and female=1120 images) different evaluation protocols
.................................................................................................................................... 171
Table 4.61: Performance of proposed PGC-FSDTF method on PKU-Reid-BROS-3
dataset (male=1120, and female=1120 images) using different accuracies, AUC, and
time ............................................................................................................................ 172
Table 4.62: Performance of proposed PGC-FSDTF method on balanced PETA-SSS-1
dataset (male=864, and female=864 images) using different evaluation protocols... 175
Table 4.63: Performance of proposed PGC-FSDTF method on balanced PETA-SSS-1
dataset (male=864, and female=864 images) using different accuracies, AUC and time
.................................................................................................................................... 175
Table 4.64: Performance of proposed PGC-FSDTF method on balanced PETA-SSS-2
dataset (male=1300, and female=1300 images) using different evaluation protocols
.................................................................................................................................... 177
Table 4.65: Performance of proposed PGC-FSDTF method on balanced PETA-SSS-2
dataset (male=1300, and female=1300 images) using different accuracies, AUC, and
time ............................................................................................................................ 178
Table 4.66: Performance of proposed PGC-FSDTF method on balanced VIPeR-SSS
dataset (male=544, and female=544 images) using different evaluation protocols... 180

xxix
Table 4.67: Performance of proposed PGC-FSDTF method on balanced VIPeR-SSS
dataset (male=544, and female=544 images) using different accuracies, AUC, and time
.................................................................................................................................... 181
Table 4.68: Performance of proposed PGC-FSDTF method on balanced cross-dataset
(male=175, and female=175 images) using different evaluation protocols ............... 184
Table 4.69: Performance of proposed PGC-FSDTF method on balanced cross-dataset
(male=175, and female=175 images) using different accuracies, AUC, and time .... 184
Table 4.70: Performance of proposed PGC-FSDTF method on balanced cross-dataset-
1 (male=350, and female=350 images) using different evaluation protocols ............ 187
Table 4.71: Performance of proposed PGC-FSDTF method on balanced cross-dataset-
1 (male=350, and female=350 images) using different accuracies, AUC and time .. 187
Table 4.72: Performance comparison with state-of-the-art methods on MIT/MIT-IB
dataset, dash (-) represents that no reported result is available.................................. 190
Table 4.73: Comparison of proposed PGC-FSDTF method results with state-of-the-art
methods on cross-datasets .......................................................................................... 192
Table 4.74: Summary of proposed method PGC-FSDTF results on all selected
imbalanced, and augmented balanced datasets where proposed approach recorded
superior AUC ............................................................................................................. 195
Table 4.75: Proposed approach PGC-FSDTF results on customized/non-augmented
balanced datasets where proposed approach recorded superior AUC ....................... 197
Table 4.76: Proposed PGC-FSDTF approach results on MIT-IB dataset and cross-
dataset ........................................................................................................................ 197
Table 4.77: Summary of proposed methods including tasks, datasets, and results ... 200

xxx
LIST OF ABBREVIATIONS

AD Average Deep
AFDA Adaptive Fisher Discriminative Analysis
AML Adaptive Metric Learning
AML-PSR Adaptive Metric Learning Probe Specific Re-ranking
AUC Area Under the Curve
AU Activation Unit
AvgDeep_FV Average score-based Deep Feature Vector
BoW Bag of Words
B-ACC Balanced Accuracy
B-SSS Balanced and Small Sample Space
BROS Balanced Random Over Sampling
BN Batch Normalization
BRM2L Bidirectional Reference Matric Metric Learning
BS Bin-Size
BIF Biological Inspired Features
CAM Camera
CAST Center for Applied Science and Technology
CPNI Center for the Protection of National Infrastructure
CUHK China University of Hong Kong
CW-ACC Class Wise Accuracy
CCTV Closed Circuit Tele Vision
CF Color Feature
CNN Convolution Neural Network
CL Convolutional Layer
CC Corresponding Cluster
CMGTN Cross Modal feature Generating and target information
preserving Transfer Network
CTC-GAN Cross-media Transfer Cycle Generative Adversarial Networks
XQDA Cross-view Quadratic Discriminant Analysis
CMC Cumulative Matching Characteristics
DDNN Deep De-compositional Neural Networks

xxxi
DFR Deep Feature Ranking
DFBP Deep Features Body and Part
DMIL Deep Multi Instance Learning
DR Deep ResNet101
DV Deep VGG
DHFFN Deep-learned and Handcrafted Feature Fusion Network
DPML Deformable Patch based Metric Learning
DB Dense Block
DN DenseNet
DN201_FCL DenseNet201 Fully Connected Layer
DVR Discriminative Video fragments selection and Ranking
DVDL Discriminatively trained Viewpoint invariant Dictionaries
Learning
DHGM Dynamic Hyper Graph Matching
EML Enhanced Metric Learning
FN False Negative
FP False Positive
FPR False Positive Rate
FCDF Features-based Clustering and Deep Features
FFS Feature Fusion and Selection
FR Feature Representation
FSF Features selection and Fusion
FS Features Subset
FSD Features Subset Dimension
FV Feature Vector
FoVs Field of Views
FFV Final Feature Vector
FTCNN Fine Tuned Convolution Neural Network
FC Fully Connected
FCL Fully Connected Layer
FCUE Fully Controlled and Uncontrolled Environment
FMCF Fusion of Multiple Channel Features
GoG Gaussian of Gaussian

xxxii
GM Greedy Matching
HRPID High Resolution Pedestrian re-Identification Dataset
HOG Histogram of Oriented Gradients
HDFL HOG assisted Deep Feature Learning
HH HSV Histogram
HSV Hue Saturation Value
HSV-Hist_FV Hue Saturation Value Histogram based Feature Vector
HM Hungarian Matching
HG Hyper Graph
HGL Hyper Graph Learning
iLIDS-VID imagery Library for Intelligent Detection System-Video re-
IDentification
IB Imbalanced
IB-SSS Imbalanced and Small Sample Space
IRV2 Inception ResNet V2
IRNV2_FCL InceptionResNetV2 Fully Connected Layer
IN Indoor
JFR Joint Feature Representation
J-LDFR Joint Low-level and Deep CNN Feature Representations
KNN k Nearest Neighbor
LWA Light Weight and Accurate
LDA Linear Discriminant Analysis
LBP Local Binary Patterns
LCDN Local Contrast Divisive Normalization
LEDF Local Embedded Deep Feature
LEP Local Extrema Patterns
LFDA Local Fisher Discriminant Analysis
LHSV Local Hue Saturation Value
LOMO Local Maximal Occurrence
LSDA Locality Sensitive Discriminant Analysis
MSDALF Mask-improved Symmetry Driven Accumulation of Local
Features
MD Maximum Deep

xxxiii
MaxDeep_Fv Maximum score-based Deep Feature Vector
M-ACC Mean Accuracy
MARPML Multi-feature Attribute Restricted Projection Metric Learning
M3 L Multi-modality Mining Metric learning
MCTS Multiple Camera Tracking Scenarios
MFA Marginal Fisher Analysis
MKSSL Multiple Kernel Sub Space Learning
NNC Nearest Neighbor Cluster
NP ReId Neuromorphic Person Re-identification
NFST Null Foley-Sammon Transfer
OFS Optimal Features Subsets
OF Optimal Features
OD Outdoor
OLPP Orthogonal Locality Preserving Projections
O-ACC Overall Accuracy
PBGR Part Based Gender Recognition
PLS Partial Least Square
PAN Pedestrian Alignment Network
PETA PEdesTrian Attribute
PGC-FSDTF Pedestrian Gender Classification using Fusion of Selected Deep
and Traditional Features
PKU-Reid Peking University Re-identification
Preid PFCC Person Re-identification via Prototype Formation and Color
Categorization
P2SNET Point to Set Network
PL Pooling Layer
PICB Pose Invariant Convolutional Baseline
PaMM Pose-aware Multi-shot Matching
PPV Positive Predicative Value
PCA Principal Component Analysis
PRDC Probabilistic Distance Comparison
PSR Probe Specific Re-ranking
PHDL Projection matrix and Heterogeneous Dictionary pair Learning

xxxiv
PHOG Pyramid Histogram of Oriented Gradients
PH Pyramid Histogram of Oriented Gradients Feature Vector
PHOG_FV Pyramid Histogram of Oriented Gradients Feature Vector
PHOW Pyramid Histogram of Words
QRKISS-FFN QR Keep It Simple and Straightforward Feature Fusion
Network
QC Quadratic Chi
QLBP Quaternionic Local Binary Pattern
RBS Radial Basis Function
RF Random Forest
ROS Random Over Sampling
RF+DDA+AC Raw low-level Features + Data Driven Attributes + Attribute
Correlations
RLU Rectified Linear Unit
RGB Red Green Blue
ROI Region of Interest
RCCA Regularized Canonical Correlation Analysis
RLML Regularized Local Metric Learning
ReID Re-identification
RDC Relative Distance Comparison
RDML Relative Distance Metric Learning
RDML-CCPVL Relative Distance Metric Learning Clustering Centralization
and Projection Vectors Learning
RR Re-Ranking
RN ResNet
ReLU Rectified Liner Unit
ROCCA Robust Canonical Correlation Analysis
RDs Robust Descriptor
SCD Salience Color Descriptor
SSSVM Sample Specific Support Vector Machine
SIFT Scale Invariant Feature Transform
SILTP Scale Invariant Local Ternary Patterns
SF Shape Feature

xxxv
SCNN Siamese Convolutional Neural Network
SECGAN Similarity Embedded Cycle Generative adversarial Network
SVD Single Value Decomposition
SSS Small Sample Space
sCSPA soft Cluster-based Similarity Partitioning Algorithm
SMA Stable Marriage Algorithm
SSAE Stacked Sparse Auto Encoder
SD Standard Deviation
SVM Support Vector Machine
SDALFs Symmetry Driven Accumulation of Local Features
TMSL Temporally Memorized Similarity Learning
TF Texture Feature
TPPLFDA Topology Properties Preserved Local Fisher Discriminant
Analysis
TLSTP Transfer Learning of Spatial Temporal Patterns
TL Transition Layer
TMD2L Triplet-based Manifold Discriminative Distance Learning
TN True Negative
TP True Positive
TPR True Positive Rate
UCDTL Unsupervised Cross Dataset Transfer Learning
VA Video Analytics
VIPeR Viewpoint Invariant Pedestrian Recognition
VSS Visual Surveillance System

xxxvi
LIST OF SYMBOLS

𝑁 Total number of samples (images) in a dataset/cluster


𝐺 Total number of samples (images) in gallery or gallery set
𝐺𝑖 Gallery image
𝑃 Probe/query images or probe set
𝑃𝑖 Probe image
𝐷 Distance measure
𝐴𝑀 Accurate matches
𝜇 Mean
𝜎2 Sample variance
𝜎 Standard deviation
𝑖 Row indices
𝑗 Column indices
𝑀 Real matrix
𝐶 Multi-channel image
𝑐 Color space channel
𝜉 Single channel of color space
𝑀 Real Matrix
𝑚 Total number of rows
𝑛 Total number of columns
𝐻 Input channel
𝑈 left singular vector matrix (orthogonal)
𝑢𝑖 Left singular vector
𝑉 Right singular vector matrix(orthogonal)
𝑣𝑖 Right singular vector
𝑆 Diagonal matrix of singular values
Ŝ Single value decomposition function
𝜎𝑘 Singular values
𝑟 Rank of the channel
𝑟𝜉 Ranks of query channel
𝜎𝑖 Projection coefficient
𝑑 Dimension of row feature vector

xxxvii
𝑓𝑑 Feature vector with specific dimension
𝑏 Channel number
𝐹𝑉 Feature vector
𝑃𝑐𝑣 Value of center pixel
𝑃𝑛𝑣 Value of neighbor pixel
`
𝑃𝑛𝑣 Corresponding differences from center pixel value
𝜃 Direction/angle/orientation of gradients
` Pixel values at a particular direction
𝑃𝑛𝑏 (𝜃)
𝐼2 Binary code at given position
𝐹 Feature vector
𝐾 Number of cluster
𝐿 Set of cluster labels
Ƈ Set of clusters
𝑔𝑥 Horizontal gradients
𝑔𝑦 Vertical gradients
𝑀𝑔 Magnitude
𝐾𝑣 HOG feature vector
𝐶𝐹𝑣 Color feature vector
𝑐𝑓𝑖,𝑗 Color feature indices
𝑑𝑞 Dimension of color feature vector
𝑇𝐹𝑣 Texture feature vector
𝑡𝑓𝑖,𝑗 Texture feature indices
𝑑𝑟 Dimension of texture feature vector
𝑆𝐹𝑣 Shape feature vector
𝑠𝑓𝑖,𝑗 Shape feature indices
𝑑𝑠 Dimension of shape feature vector
𝑓𝑖 Current feature
𝑓𝑗 Next feature
𝛿 Entropy controlled feature vector
𝑣 Number of selected features
𝑁𝑒 Number of epochs
𝑓𝑛 Feature space/feature vector

xxxviii
𝑃𝑅 Probability of computed features
𝑙 Layer of network
𝑓𝑙 Filter at layer 𝑙
𝑙
𝑓𝑥,𝑦 Filter size at layer 𝑙 and between 𝑥 𝑡ℎ and 𝑦 𝑡ℎ feature maps
𝑙
𝑤𝑥,𝑦 Weight at layer 𝑙 and between 𝑥 𝑡ℎ and 𝑦 𝑡ℎ feature maps
𝑛𝑖 × 𝑛𝑗 Size of matrix for input
𝑈𝑖𝑗𝑘 Given pooling region
𝑘×𝑘 Filter size
𝐾×𝐾 Pooling region
𝑀𝑝,𝑞 Pooling region with filter size
𝑍𝑝𝑞𝑘 Output of max pooling
𝑀𝑎𝑏 Local area
𝐶𝑘 Consensus cluster
QC Quadratic-Chi
𝑘 Cluster index
𝐷 Distance
𝑋𝐼𝑃 Number of accurately matched probe images
𝑌𝐼𝑃 Total images in the probe set
𝑛𝑐 Neighbor cluster
𝑠𝑤 𝑙 Sliding window
𝑏𝑥𝑙 Bias matrix
𝑛𝑥𝑙 Feature map
𝑍𝑥𝑙 Output of convolution process
𝑍𝑥𝑙−1 Input channel values
𝐿 Pyramid level
𝑃𝐻𝑣 PHOG feature indices
𝑝ℎ𝑖,𝑗 PHOG feature indices
𝑑𝑝 Dimension of PHOG feature vector
𝐻𝐻𝑣 HSV Histogram feature vector
ℎℎ𝑖,𝑗 HSV histogram feature indices
𝑑ℎ Dimension of HSV-Histogram feature vector
𝑀𝐷𝑣 Maximum score-based deep feature vector

xxxix
𝑚𝑑𝑖,𝑗 Maximum score-based deep feature indices
𝑑𝑚 Dimension of maximum score-based deep feature vector
𝐴𝐷𝑣 Average score-based deep feature vector
𝑎𝑑𝑖,𝑗 Average score-based deep feature indices
𝑑𝑎 Dimension of average score-based deep feature vector
𝐼𝐹𝑖𝑚 Filtered image
𝐻𝐹𝑖𝑚 Horizontal flipped image
𝐼𝑀 Image matrix
𝐺𝑇𝑖𝑚 Geometric transformed image
𝐵𝐴𝑖𝑚 Brightness adjusted image
𝑚𝑔𝑣 Matrix with gradient values
𝑚ℎ𝑣 Matrix with histogram values
𝑟𝑜𝑖 Region of interest
𝑏𝑠 Bin size on the histogram
𝐼𝐹𝑖𝑚 Filtered image
𝐻𝐹𝑖𝑚 Horizontal Flip
𝐼𝑀 Matrix from image
𝐺𝑇𝑖𝑚 Geometric transformation
𝐵𝐴𝑖𝑚 Brightness adjustment
𝐿 Pyramid levels
ĉ Constant value
ɠ𝑐𝑐 Consensus cluster operation

xl
Chapter 1 Introduction
Introduction

1
1.1 Introduction to Visual Surveillance
Visual surveillance is an emerging technology that deals with the monitoring of a
particular area or a person for security and safety. According to recent security concerns
worldwide, many countries are committed to expand visual surveillance zone through
cameras. In this regard, millions of visual surveillance or closed-circuit television
(CCTV) cameras are being installed round the globe. According to a recent report,
approximately 770 million surveillance cameras have been installed worldwide until
now including 200 million cameras in the year 2020 alone. The report also states that
one billion cameras will be deployed globally by 2021 [1]. The increasing trend of
CCTV cameras country wise and per 100 people in top ten countries is shown in Figure
1.1 (a) and Figure 1.1 (b), respectively.

Figure 1.1: Increasing trend of CCTV cameras in top ten countries a) country wise b)
per 100 people [1]

2
The top countries are realizing the significance of visual surveillance, for instance,
China is maintaining itself as a surveillance state because it has deployed more CCTV
cameras than any other country, but in case of per capita usage, United States is standing
at first.

1.1.1 Significance of Visual Surveillance

The importance of visual surveillance is increase due to its applications in numerous


fields such as public safety and crime reduction. The surveillance systems keep an eye
on the people entering a particular place to sense threats in case of suspicious activities.
Thus, these surveillance systems, enable the security agencies to intervene and preempt
unwanted incidents at important buildings, train stations, airports, businesses, private
and public departments.

Moreover, in the present era, the popularity of visual surveillance is linked with the
availability of powerful computing hardware and cost-effective CCTV cameras which
are two important components in any visual surveillance system (VSS). These
components make the visual surveillance more suitable and efficient for monitoring
target objects such as pedestrians, crowd, and unattended items. As, the cameras
provide unique visualization properties, hence it is considered a reliable source for
automatic visual content analysis as compared to traditional modalities such as infrared,
sonar, and radar.

1.1.2 Automatic Visual Content Analysis

VSS offers two modes of processing for visual content analysis including a real-time
mode for online data processing and posteriori mode for stored or offline data
processing. VSS provides an infrastructure to capture, store, and distribute video while
leaving the task of threat detection exclusively to human operators in both modes. In
real-time mode, video operators are continuously viewing live stream video for the
prevention of crime, and they are also responsible to generate alerts in case of any
misconduct or an event of interest has occurred [2]. However, in this mode of
processing, delays may occur due to live streams generated by numerous CCTV
cameras in a network. In addition, live streams are stored on digital media for a
predefined time. In the posteriori mode, searching for a given person/event of interest
in thousands of recorded images/video frames provided by many cameras requires the

3
allocation of a large number of enforcement officers to this task and a lot of time to
perform.

In both modes of processing, VSS requires continuous monitoring by watching


pedestrian(s), vehicles, and unattended items, which becomes a boring job, and a human
may overlook important events and information, which can lead to many security
issues. Visual content analysis requires a high level of human attention making human
based surveillance a labor-intensive task. In addition, such type of surveillance is prone
to errors due to lack of human attention. Moreover, it requires more operators to
monitor gigantic data generated by the cameras in different events. These limitations of
human based visual surveillance, availability of powerful computing hardware, and
cost-effective cameras have set the basis for visual content analysis which automatically
investigates captured visual data. This automatic visual content analysis adopts two
modes of processing including real-time mode for online data processing and posteriori
mode for stored or offline data processing. Visual surveillance modes [3] are shown in
Figure 1.2.

Figure 1.2: Visual surveillance viewing modes, real-time and posteriori [3]

4
In case of automatic visual content analysis, real-time mode, automatically analyzes
live stream for the event of interest, whereas posteriori mode extracts important
information such as evidence from already stored visual data and replays specific events
overlooked by operators due to stream delays in the network. Thus, in automatic
settings for both modes, human operators are replaced with machine based VSS. In
contrast to human based surveillance, both automatic modes are well-organized to
examine the captured streams for detecting suspicious activities. Using either mode,
automatic visual content analysis opens new venues for research such as pedestrian [4-
6], medical [7-9], and agriculture [10-12] image analysis. Nowadays, automatic
pedestrian image analysis is one of the important research areas as it acts as an effective
deterrent to unlawful and anti-social activities [13].

1.2 Automatic Pedestrian Analysis


Pedestrian analysis is a challenging problem in computer vision domain. The automatic
analysis helps to investigate the pedestrian as a target object by developing suitable
solutions. For automatic pedestrian analysis, several studies have been carried out using
different machine learning techniques [14-21]. Each technique has its own capabilities
and constraints. These techniques can be generally divided into two main categories:
Biometrics and Full-body appearance based pedestrian analysis.

A biometric technique utilizes iris, face, gait, and fingerprint images of an individual
for pedestrian analysis. The techniques which use iris [22-24] and fingerprint [25-27]
images require the collaboration of humans for their actions in a monitored
environment. For instance, acquisition of an iris image needs an individual to set his
eyes position close to a particular sensor, rightly in front of it, whereas capturing the
fingerprint requires putting a finger on a particular sensor. However, the techniques
which use face [28-30] and gait [31-38] do not require the collaboration of an individual
being observed. Thus, these techniques are more suitable for automatic surveillance at
entry and exit points of public places under controlled environments.

The performance of these techniques may influence during image acquisition,


compression, and decompression. During image/video acquisition the performance
suffers due to (1) small object size which is not visible/distinguishable under large field
of views (FoVs) of camera and (2) delay in a huge amount of live stream data due to
several cameras in a network. The performance may also be affected due to loss of

5
information or addition of noise during compression and decompression steps used by
the cameras or storage devices. These constraints show that biometric techniques are
highly dependent on camera settings and person orientation towards the camera. For
instance, if the targeted person's face is not visible or has a side view in front of cameras,
facial recognition cannot be carried out. To address all these constraints, full-body
appearance based pedestrian analysis is becoming a dominant field of research.

1.2.1 Appearance based Pedestrian Analysis

Full-body appearance of a pedestrian in images and videos is widely used for analysis
and is often called appearance based pedestrian analysis. The pedestrian appearance is
captured under indoor (ID), outdoor (OD), fully controlled, and uncontrolled
environments (FCUE). However, under these environments, variations in pedestrian
appearances such as pose, illumination, and viewpoint including different camera
settings make the area of pedestrian analysis more challenging. The relevant literature
presents two approaches: traditional and deep learning based, as shown in Figure 1.3.
In the traditional based approach, the pedestrian analysis process consists of several
steps such as preprocessing, feature extraction or descriptor design, features selection,
and classification/matching. Whereas in deep learning, mainly convolution neural
network (CNN) models implicitly learn features to establish an accurate
correspondence more effectively. In either way, to improve the accuracy of matching
and classification supervised and unsupervised learning algorithms may be used. In
addition, discriminative information from image and video frames plays a vital role in
pedestrian analysis related to recognition, matching, and classification tasks.

Both approaches preferably use full-body appearance or body parts of a pedestrian as


an input to model the appearance of a pedestrian for a specific task. As, full-body
appearance of the pedestrian comprises of strong clues such as clothing style, carrying
items, body shapes, color, and texture details [39], thus, in the context of visual content
analysis, these clues play an important role to design robust solutions for different tasks.
These tasks include but are not limited to 1) Pedestrian detection - automatically detects
and classifies pedestrian or non-pedestrian in images or videos [40], 2) Pedestrian ReID
to perform image recognition across multiple cameras [41-43], 3) Pedestrian orientation
analysis which classifies the body/head poses, and direction of movement [44], 4)
Pedestrian gender classification which recognizes the gender using full-body or its parts

6
[15, 45-49], and 5) Pedestrian action classification for classifying different actions of
pedestrian such as walking, fighting, running, and sitting [50-52].

Figure 1.3: Full-body appearance based pedestrians analysis using features-based


descriptors and deep learning models

In the present era, these tasks are still considered as challenging studies because
matching and classification accuracies are usually influenced due to intra-class and
inter-class variations, appearance diversities, and environmental effects [53]. In the
light of aforementioned discussion, this dissertation deals with full-body appearance
based pedestrian analysis using a blend of traditional and deep learning techniques for
two tasks: 1) Person ReID - how to re-identify the person of interest from static images
with improved performance? and 2) Pedestrian gender classification - how to predict
the pedestrian gender in large-scale and imbalanced and small sample space (IB-SSS)
datasets? It is worth mentioning that to perform these tasks posteriori mode of
processing is used. The foreword about both tasks is given in the following subsections.

1.2.1.1 Pedestrian Re-identification

Pedestrian ReID is an important task in pedestrian analysis which means to


recognize/match a person across multiple camera views. This task is different from the
classical identification and detection tasks as it not only answers the question of whether
a given image belongs to a gallery image or not but also finds when and where this
person appeared in each camera in a network. The full-body appearance based person
ReID process comprises of different steps as shown in Figure 1.4.

7
Figure 1.4: Full-body appearance based person ReID scenario

In the first step, the gallery set {G} is formed by collecting cropped pedestrian images
from different camera scenes. To represent gallery images precisely, multiple features
(shape, color, texture, deep CNN, etc.) are extracted. Then, the similarity measures or
optimal metric learning methods are utilized to find an accurate image pair from the
gallery set against each probe image {𝑃𝑖 }. Finally, the ranking of accurate matches is
presented as empirical results. For example, the red box at positions 1, 5, and 48 show
the matches at rank-1, rank-10, and rank-50 respectively, considering the query images
shown in Figure 1.4.

The research task of person ReID aims to find a probe person from a certain vicinity
(gallery set) which is of great significance in the visual surveillance field for tracking
of lost person and verification of a certain individual.

1.2.1.2 Pedestrian Gender Classification

Pedestrian gender classification is one of the interesting tasks in the pedestrian analysis
domain. It considers pedestrian full-body appearance consisting of clues such as
wearing items, color, hairstyle, carrying items, and poses [54]. Usually, gender
classification is desirable to analyze pedestrians’ activities and behaviors based on their
gender. In manual settings, an individual performs the task of pedestrian gender

8
recognition based on clothes, hairstyles, body shapes, voice, gait, skin color, and face
looks of a target person [55]. While in automatic settings, numerous techniques are
proposed for gender recognition depending on facial characteristics [56-59] and full-
body appearance of pedestrians [45, 46, 49]. The face-based gender classification
techniques require the complete face of a person for gender recognition. However, these
techniques fail when the camera acquires an image from the left, right, or backside of
the pedestrian. On the other hand, appearance based pedestrian gender classification
techniques use hairstyle, carrying items, body shape, shoes, and coat as strong
arguments. Such arguments provide detailed information about the pedestrian image.
Hence, full-body appearance based gender classification is more reliable than face-
based gender classification where first model is trained using given training samples
and then a test (unknown) gender image is supplied to trained model for prediction of
gender as male or female, as shown in Figure 1.5. The full-body appearance based
pedestrian gender classification techniques are mainly divided into two types: 1)
traditional techniques that adopt feature extraction and features selection followed by
classification [60], and 2) deep CNN based techniques [61]. Both types use full-body
or parts-based images of the pedestrian as input for gender classification. The traditional
and deep CNN approaches for pedestrian gender recognition showed promising results
individually.

Figure 1.5: Full-body appearance based pedestrian gender classification scenario

9
There are still classical research issues to be addressed, for instance, discriminative
feature representation, lower classification accuracy, and SSS for model learning.
Traditional approaches are effective to examine low-level information of an image,
whereas, deep CNN based features are more appropriate in case of large variations in
pedestrian poses, illumination changes, and diverse appearances. Hence, these deep
learning based approaches have more dominant feature representation capabilities of
given images. A detailed description of both techniques is provided in chapter 2.

1.2.2 Applications of Appearance based Pedestrian Analysis

Pedestrian analysis has various applications including human activity analysis, person
retrieval, area monitoring, and people behavior analysis in the domains of robotics,
entertainment, security, and visual surveillance. More specifically, person ReID task
has a wide range of practical applications such as cross camera person tracking, tracking
by detection, long-term human behavior, and activity analysis, whereas the task of
pedestrian gender classification is useful for population statistics, demographic
statistics, and population prediction.

1.3 Research Challenges


Several research challenges exist in pedestrian analysis. Many of these are common for
both tasks including person ReID and pedestrian gender classification whereas few of
them are face by only one task. However, the influence of these challenges may vary
across these tasks. A detailed discussion on these challenges is provided in the
following subsections.

1.3.1 Physical Variations of Individuals

People appearance may be similar or different in the context of color, texture, carrying
items, body size, clothing, and styles. The case of a similar appearance becomes a
challenge in gender classification. For instance, it is hard to recognize the gender of a
person in a crowd when they are wearing a uniform as the similarity of two individuals
increases the risk of false positives. According to relevant literature, invariant
information such as color, shape, and texture successfully handles the physical
variations of an individual. This problem is significant but not critical whereas the case
of different appearances can be reflected as a hint for pedestrian ReID.

10
1.3.2 Illumination Variations

Varying illuminations and shadings in different environments can affect performance


of a task that depends on the lighting conditions. For pedestrian gender classification,
this is a moderate level issue because it depends on the selected feature extraction
schemes for modeling. According to related literature, existing methods handle this
problem at a satisfactory level. This issue is critical for a person ReID, especially in the
case of non-overlapping FoVs. For instance, the same person captured with two
cameras under different lighting conditions may have two different appearances. This
phenomenon becomes more critical in the case of bad and different light conditions.

1.3.3 Viewpoint Changes

Different shapes can be observed with different aspect ratios, depending on angle of the
camera. For gender classification task, this is a crucial problem, because the variation
in viewpoint implies a variation in the appearance and size of persons. Similarly, for
person ReID, this issue is also important. For instance, as the perceived properties under
CAM1 FoVs cannot be available under CAM2 FoVs hence, a person can be professed
with different sizes, and aspect ratios, which may cause failure in person ReID.

1.3.4 Pose Changes

Different poses can be observed depending on the direction in which people are
moving, where pose variation implies a variation in the appearance. For person ReID,
it is a challenging issue. For instance, the perceived direction under CAM1 FoVs cannot
be available under CAM2 FoVs; therefore, a person can be apparent at a different angle,
causing the failure of person ReID. Similarly, in gender classification, left, right, back,
and side views of a pedestrian are more challenging as compared to frontal views.

1.3.5 Body Deformations

Deformations in body shape may negatively affect the output where a task is highly
dependent on a person structure. As appearance based methods are based on full-body
image of a person, this problem influences both tasks. For gender classification, local
and global features-based on shape characteristics provide a way to handle this issue
but it fails when some important clues are not found due to non-availability of some
body parts in the given images. For person ReID task, extracted/learned properties of a

11
standing person in CAM1 may be insufficient if the person appears in CAM2 at a
different angle. In this scenario, few features of interest may be observed at a different
location than those extracted/learned which may cause difficulty in person matching.

1.3.6 Occlusion

Sometimes people are partially or completely occluded due to: 1) objects or items they
are carrying, 2) overlapping with a person of interest or object, and 3) with
environmental structure that is permanently/temporally present under specific FoVs.
For both tasks, this problem influences the performance in case of partially occluded
pedestrian images. In the case of gender prediction, it becomes a difficult task and relies
highly on the selected structures to extract reliable information for pedestrian gender
classification. For example, if all the required information is available/visible excluding
occluded part of an input image, gender classification can still be performed. If only
few properties are present, it will depend on the way to perform gender classification.
For person ReID task, this issue will also affect recognition rates in the case of partial
occlusion, whereas full occlusion case is not considered for ReID because of non-
availability of target person in another camera. Person ReID is performed only when
the target person returns and is observed under different FoVs. This is a key issue in
ReID and if the important information is missing due to a partially occluded image,
person ReID suffers.

1.3.7 Different Camera Settings

This problem is concerned with both, prediction of gender and person ReID. The state-
of-the-art gender classification and person ReID methods use posteriori mode of
processing, which is highly dependent on camera settings such that the images are
captured/stored with different qualities (e.g. low, medium, etc.). In this scenario, the
camera response is one of the most critical acquisition conditions under different
camera settings. In a multi-camera network, it is not sure that all the cameras have same
model, and if so, the sensors can have small or significant variations in their response.
In this situation, a different color response is a key issue that arises under different
camera models or settings. For example, the same person with the same clothes can be
rendered in different ways by two different cameras. The aforementioned challenges
have a significant impact on the performance of both pedestrian gender classification
[45, 48, 49] and person ReID [62-64] tasks. In addition, overall and class-wise

12
accuracies are decreased due to imbalanced distribution of data whereas model learning
is another problem while applying SSS datasets in both tasks. Figure 1.6 shows sample
images of pedestrians where these challenges can be observed more clearly.

Figure 1.6: Sample images showing challenges in pedestrian analysis

1.4 Problem Statement


To accomplish the tasks of person ReID and pedestrian gender classification, mostly
recognition rate, and accuracy suffer from challenging issues including physical
variations of individuals, illumination variations, viewpoint changes, pose changes,
body deformations, occlusion, and different camera settings. For person ReID, an
exhaustive gallery search is required to find the target image against each probe image
[65-68] whereas imbalanced distribution of data is a research issue in gender
classification [45, 46, 69]. Despite of these concerns, model learning is another problem
while using SSS datasets in both tasks [65, 70-73]. In the existing studies, feature
engineering has been proposed to deal with these problems in both tasks, however, the
relevant literature reveals that there is a need to design more robust feature
extraction/learning, feature selection, and feature fusion strategies that may reflect

13
powerful and rich features for person ReID and pedestrian gender classification with
improved performance [15, 45, 46, 48, 49, 74-76].

1.5 Research Objectives


In the domain of appearance based pedestrian analysis, this research work sets the
following objectives to:

1. Perform feature engineering using proposed features subsets such as


handcrafted and deep features.
2. Design a robust framework for ReID of pedestrian appearance under different
camera settings to optimize gallery search and recognition rates.
3. Design a framework to classify the gender using full-body appearance of
pedestrians.
4. Improve accuracy and recognition rates under low-quality images, illumination
effects, diverse appearances, and even imbalanced and SSS of training data.
5. Compute and evaluate empirical-based results on standard datasets for both
person ReID and pedestrian gender classification with existing studies using
standard performance evaluation protocols.

1.6 Research Contributions


After realizing above described challenges, this thesis focuses on person ReID, and
pedestrian gender classification tasks for pedestrian analysis. In this regard, three
methods are designed, where one method is proposed for pedestrian ReID on static
images/video frames, and the other two methods are presented for pedestrian gender
classification, with improved performance. Such improvements depend on few major
steps such as data preparation, feature extraction and selection, feature fusion,
classification, and matching. The main contributions made in this dissertation
correspond to the following major steps:
a) In the data preparation step, image smoothness and removal of noisy
information are done using bi-cubic interpolation and median filter respectively,
in the proposed methodologies. For example, during image acquisition, noise,
and irrelevant information are included in the images/video frames due to
environmental effects, and different camera settings resulting in false
positive/mismatch. Therefore, removing the noise or overcoming the

14
environmental effects is an important and necessary step. Despite this, data
augmentation is performed to handle class-imbalanced problem for the
classification task.
b) Feature extraction, selection, and fusion play important role in pedestrian image
analysis. In this thesis, handcrafted features are used along with deep features,
as follows.
 In the handcrafted features, color (HSV histogram, and single value
decomposition), texture (local extrema patterns), shape (HOG and
PHOG), and local maximal occurrence (LOMO) [77] are extracted from
each image using their RGB, HSV, and grayscale color spaces.
 Features-based clustering is utilized based on selected optimal
handcrafted features, where the whole gallery is split into k clusters to
optimize the searching process against each probe image; consequently,
each gallery image has a learned relationship with the kth cluster.
 The deep features are extracted using pre-trained CNN models such as
AlexNet, ResNet101, VGG19, InceptionResNetV2, and DensNet201.
The deeply learned features effectively handle the inherited ambiguities
in pedestrian appearance(s) and better generalization capability of an
input image.
 The optimal features are selected using maximum entropy and PCA
methods. To create a single robust feature vector, selected features are
fused using serial and parallel based fusion methods.
c) In the classification step, handcrafted (color, texture, shape, and LOMO) and
deep features (CNNs) are supplied to machine learning classifiers (k-nearest
neighbor (KNN), SVM, ensemble, and discriminant analysis for pedestrian
gender classification. Whereas, in case of person ReID, these features are
supplied to Quadratic – Chi (QC) cross bin histogram distance measure to find
an accurate match against probe image from cluster instead of whole gallery.
d) The experimental results of all methods are based on improving the
abovementioned major steps from existing methods, thereby supporting the
contributions of this study. In this regard, all methods obtained competitive and
better results than state-of-the-art methods in literature.

15
1.7 List of Publications

1. Fayyaz, Muhammad, et al. "Person re-identification with features-based


clustering and deep features." Neural Computing and Applications 32, no. 14,
10519-10540, 2020. [Impact Factor: 5.606, Q1]
2. Fayyaz, Muhammad, et al. "J-LDFR: joint low-level and deep neural network
feature representations for pedestrian gender classification." Neural Computing
and Applications 33, 361-391, 2021. [Impact Factor: 5.606, Q1]
3. Fayyaz, Muhammad, et al. " PGC-FSDTF: Pedestrian gender classification on
imbalanced and small sample datasets using parallel and serial fusion of selected
deep and traditional features.” [Submitted]
4. Fayyaz, Muhammad, et al. "A review of state-of-the-art methods for full-body
appearance-based pedestrian gender classification/recognition." [Submitted]

1.8 Thesis Organization

This Ph.D. thesis is organized into five chapters. Figure 1.7 illustrates the organization
and a chapter-wise description is given below:

Chapter 1 presents a detailed introduction regarding visual content analysis for


surveillance and security. In this regard, appearance based pedestrian analysis tasks are
briefly discussed. Moreover, challenges and objectives in this work are also presented.

Chapter 2 gives a review of state-of-the-art techniques for person ReID and pedestrian
gender classification, using machine learning approaches. This chapter shares the
research gaps in existing relevant studies.

Chapter 3 describes the proposed frameworks for appearance based pedestrian


analysis. It covers the detailed discussion on the three proposed approaches: 1) Person
re-identification with feature-based clustering and deep features, 2) joint low-level and
deep neural network feature representation for pedestrian gender classification, and 3)
pedestrian gender classification on imbalanced and small sample datasets using parallel
and serial fusion of selected deep and traditional features.

Chapter 4 presents experimental results for proposed methods. This chapter also
includes a detailed discussion and comprehensive comparison with state-of-the-art

16
studies under three consistent headings of evaluation protocols, datasets presentation,
and results evaluation.

Chapter 5 provides thesis conclusion, along with future research directions.

Figure 1.7: Detail overview of thesis

17
Chapter 2 Literature Review
Literature Review

18
2.1 Introduction
In the proceeding chapter, relevant literature is presented in two different scenarios
namely person ReID and pedestrian gender classification for appearance based
pedestrian analysis. For this reason, the subsequent literature is divided in multiple
sections to discusses the related state-of-the-art methods. Figure 2.1 shows the
organization of chapter 2 such that first section presents methods of literature for person
ReID and second section covers the main developments related to full-body and parts-
based approaches for pedestrian gender classification. Both tasks are concluded in
discussion and analysis section. The section-wise review will help the reader to give a
clear insight into state-of-the-art methods of given tasks of pedestrian analysis.
Summary of chapter is given in the last section.

Figure 2.1: Organization of chapter 2

2.2 State-of-the-art Methods for Pedestrian Re-identification


The available literature highlighted various ways to re-identify the pedestrian image. In
this regard, person ReID methods are split into the following main categories:
appearance, metric learning, deep learning, dictionary learning, and graph learning

19
methods. The appearance based ReID method have become a hot topic for relevant
research domain of feature representation.

Figure 2.2: General model for person ReID

The second category belonging to metric learning based methods learns an optimal
metric from a given set of features. This optimal metric learning process is also called
optimal subspace learning. Despite the traditional way of ReID, deep learning based
methods are also used for person ReID in the context of deep feature representation and
model learning. In addition, some other methods such as graph and dictionary learning
are also used to perform person ReID. The general model for person ReID is shown in
Figure 2.2.

2.2.1 Appearance based Approaches for Pedestrian Re-identification

The appearance based method mainly focuses on feature representation. In this regard,
many efforts are carried out for robust feature representation that distinguishes a
person's appearance across changes in pose, views, and illumination. The main
categories of feature representations include color, texture, shape, saliency, and their
combination. Zhao et al. [78] utilized a probabilistic distribution of salience which is

20
robust under non-overlapping camera views. In another work, authors used learning of
mid-level filters from automatically selected clusters of patches for ReID [79]. Wang
et. al [80] presented a model in which discriminative fragments are first selected from
highly noisy frames of pedestrian, and then distinct space-time features are extracted to
perform person ReID. To handle the pose and illumination changes, Geng et. al [81]
divided the full-body image of pedestrian into two parts: upper and lower. Then, parts-
based feature engineering is performed to select different features from these parts for
preserving salience. Moreover, color information is accurately represented by
considering present regions and their adjacent regions in salience color descriptor
(SCD).

For robust feature representation, Liao et al. [77] introduced LOMO features to fully
examine and maximize the occurrence of local features horizontally. An et. al [82]
formulated a robust canonical correlation analysis (ROCCA) method to match people
from different views in coherent subspace. Local features are obtained to design a
codebook which comprises of HOG, LBP, color and HS histogram features [83].
Similarly, the researchers applied multi-model feature encoding, hyper-graph
construction and learning [84] for person ReID. Wang et al. [68] formulated a descriptor
for ReID named fusion of multi-channel features (FMCF). In this descriptor, color
information is captured from hue and saturation channels, shape information from
grayscale, and texture detail is computed from the value channel. A multi-kernel
subspace learning (MKSSL) strategy based method is designed to handle complex
variations issue in person ReID [85].

Cho et. al [86] proposed a method to estimate the target poses, then four representative
features are extracted from four different views such as front, back, left, and right. Later,
ReID is performed by computing matching score and weights. Zhang et. al [87]
presented null Foley-Sammon transfer (NFST) method and proposed a supervised
model to learn null discriminative space in order to address SSS problem in person
ReID. They also extended this model for semi-supervised learning. The pedestrian
matching is performed using reference descriptors (RDs) produced with the reference
set [88]. They projected gallery and probe data in regularized canonical correlation
analysis (RCCA) projection to learn projection matrix and additional step re-ranking is
used to increase the recognition rates at different ranks. Furthermore, 𝑙2 regularized
sparse representation is adopted for matching by [89]. In this approach, they maintained

21
the stability of sparse coefficients for better performance. An et al. [19] suggested a
method in which the appearance features are integrated with soft biometrics features to
represent the feature descriptor. Chahla et al. [90] applied prototype formation for
person ReID. The presented technique uses color categorization to handle appearance
variation in terms of clothes colors. In addition, for the robustness of feature descriptors,
the proposed technique captures the relationship between color channels using a
quaternionic local binary pattern (QLBP).

Li et al. [91] proposed a method that extracts salient regions from an image of a person
and clusters the extracted regions by applying “log squares log density gradient
clustering”. Similarly, Zhang et. al [92] used a color invariant algorithm for ReID to
cluster the segmented area of an image that exploits color for clustering. However, it is
not robust against low contrast images. An et. al [72] investigated multiple coding
schemes (LBP, HSV, LAB, and SCN) for intermediate data synthesis. They used group
sparse representation to produce particular intermediate data. Shah et al. [93] introduced
another approach that utilizes hexagonal SIFT and color information in a combination
of time tree learning algorithms for robust feature representation, however unable to
handle above 60o pose variation. Chu et al. [94] suggested a technique that splits the
image into local sub-regions considering two directions to overcome the risks of
mismatching and pose variations. However, it does not handle illumination changes.
Nanda et al. [95] presented a three-fold framework to specifically solve issue due to
illumination changes but failed to tackle complex background and diverse variations in
the pose. Moreover, they extracted color and texture information from segmented parts
of body. Then, inlier-set based ReID is performed [96] to handles the partial occlusion,
viewpoint and illumination variations effectively.

Gray et al. [97] considered color and texture channels, binning information, and
location for a combination of localized features for recognition. Bazzani et al. [98]
introduced symmetry driven accumulation of local features (SDALFs) approach using
symmetric and asymmetric perceptual principles to overcome environmental variances
for appearance based ReID problem. For improving recognition rates, Ye et al. [99]
proposed an approach to combine the semantic attributes of body parts with LOMO
features. Also, semantic features help to improve the person ReID recognition rates
[100-102]. Few other appearance based methods, also exploit fusion of different
features such as ensembles of color invariant features, salient matching, kernel-based

22
features, camera correlation aware feature augmentation, and learned vs handcrafted
features [73, 103-106] for person ReID. Hashim et. al [107] formulated a method named
mask-improved symmetry-driven accumulation of local features (MSDALF) using
simultaneous image matching based on stable marriage algorithm (SMA). They also
computed the results using Greedy matching (GM) and Hungarian Matching (HM) with
the combination of MSDALF. The summary of appearance based methods for person
ReID is given in Table 2.1.

Table 2.1: Summary of appearance based approaches for person ReID


Rank-1
Ref. Year Method Datasets
Results (%)
Patch matching is adopted and VIPeR 30.1
[78] 2013
salience is estimated. CUHK01 28.4
Effective patches are used in VIPeR 29.1
[79]
learning of mid-level filters. CUHK01 34.3
2014
HOG3D and optical energy PRID2011 28.9
[80]
profile over each frame. iLIDS-VID 23.3
Pedestrian body is divided into
upper and lower parts which
VIPeR 28.4
[81] are further divided into
iLIDS-VID 28.2
patches. Region based salience
for ReID is explored.
Used canonical correlation
2015 VIPeR 30.4
[82] analysis based technique to re-
CUHK01 29.7
identify pedestrian views.
Bag of words representation is
used which includes LBP,
[83] VIPeR 30.1
HOG, HS histogram and color
features.
Fusion of HOG, s-weight,
local dominant orientation, iLIDS-VID 31.7
[68]
local extrema pattern CUHK01 25.0
histograms.
Multi-pose model is used to 3DPeS 59.2
[86]
match the target pose for ReID iLIDS-VID 30.0
VIPeR 41.0
Used discriminative null space Market-1501 61.0
[87]
2016 of training data for ReID. CUHK03 62.5
PRID2011 35.8
Probe and gallery images are
projected into RCCA learned
space. RDs of gallery and
VIPeR 33.2
[88] probe are also calculated.
CUHK01 31.1
Then, identity of probe and
gallery images is determined
by comparing their RD.

23
Quantized mean value of Lab
and HSV color spaces, SDC, VIPeR 32.9
[89] and LBP features are CUHK01 31.3
extracted. CCA is used to learn PRID201 27.0
projected subspace.
LOMO and soft biometric
[19] features are used to re-identify VIPeR 43.9
the pedestrian views.
Color categorization, QLBP,
[90] 2017 and prototype formation is VIPeR 28.0
adopted for single-shot ReID
Multi-model feature encoding
VIPeR 44.5
[84] and multi-hyper graph fusion
GRID 19.8
for ReID.
LBP, HSV, LAB, and SCN
feature coding schemes are
VIPeR 34.5
[72] used to observe the influence
CUHK01 34.0
2018 of these schemes in ReID
process.
Salient region feature PRID201 52.3
[91]
extraction and clustering. MARS 62.1
Sematic partition is carried out
then hue weighted saturation, VIPeR 43.3
[96] CbCrY, texture, eight Gabor, iLIDS-VID 35.8
and thirteen Schimd filters are CUHK01 44.5
used.
2019
LOMO features are combined
with semantic attributes of VIPeR 44.7
[99] pedestrian image. Effectively QMUL-GRID 25.2
utilized the impact of upper PRID2011 28.2
and lower body parts.
Handcrafted vs learned
[73] features are obtained for CUHK01 35.0
person ReID.
HSV color histogram and
2020 VIPeR 40.4
modified features named
CUHK01 27.0
[107] MSDALF are extracted. SMA
iLIDS-VID 34.4
is used with simultaneous
PRID 16.5
matching.

2.2.2 Metric Learning based Approaches for Person Re-identification

The metric learning based person ReID is used to acquire a discriminative metric for
better performance. The process followed by metric learning methods considers
representation of features for metric learning, hence learned metric highly relies on
reliable feature representation. In this regard, existing efforts include partial least square
(PLS) [108], probabilistic distance comparison (PRDC) [109], local fisher discriminant

24
analysis (LFDA) [110], adaptive fisher discriminant analysis (AFDA) [111],
deformable patch based metric learning (DPML) [112], transferred metric learning
[113], and relative distance comparison (RDC) [114] approaches for reliable feature
representations to learn an effective metric for person ReID. For more stable feature
representation against viewpoint changes, LOMO features are extracted then cross-
view quadratic discriminant analysis (XQDA) metric learning based method is
implemented [77]. Xie et. al formulated a method using adaptive metric learning
(AML) with probe specific re-ranking (PSR) [115]. Furthermore, Liu et. al [116]
proposed a method named multi-modality metrics learning (M3L) that attempts to
investigate the effect of multi-modality metrics in relation to the long-run surveillance
scenes for person ReID. Region based metric learning is performed by [117] for person
ReID by using positive neighbors from imbalanced unlabeled data. Ma et. al presented
a method with the name of triplet-based manifold discriminative distance learning
(TMD2L) [118] to learn discriminative distance, which effectively handles the low
illumination issue in person ReID. Feng et. al [119] preferred second order hessian
energy function because of extrapolating power and richer null space. Thus, they used
metric learning based approach for person ReID. Jia et. al [120] suggested a semi-
supervised method to enable view specific transformations and update projections with
graph regularization. In addition, hessian regularized distance metric learning [119],
joint dictionary and metric learning [75, 121], and relative distance metric learning
(RDML) based on clustering centralization and projection vectors learning (CCPVL)
[122] based approach achieved significant performance on publicly available datasets
for person ReID. Two stage metric learning method named QR kept it simple and
straightforward (QR-KISS) [123] to investigate the performance of different feature
descriptors such LOMO and salient color names based color descriptor (SCNCD). To
address cross-domain issue for ReID, Gu et. al [74] proposed topology properties
preserved LFDA (TPPLFDA) which reliably projects the cross-domain data into lower
dimensional subspace. Ma et. al [124] split the training data into positive and negative
pairs and proposed extreme metric learning (EML) method based on adaptive
asymmetric and diversity regularization for ReID. Few other methods addressed ReID
task by ranking methods [125, 126]. The existing metric learning based methods for
person ReID are described in Table 2.2. Furthermore, graph learning based approaches
also achieved significant performance in ReID process. For example, in dynamic hybrid
graph matching (DHGM), the authors utilized content and context information with

25
metric learning for person ReID [76]. An et. al [127] proposed a hyper graph matching
method in which pairwise relationships are discovered in higher order for both gallery
and probe sets. Similarly, the authors used multi-model feature encoding, hyper-graph
construction and learning [84] for person ReID.

Table 2.2: Summary of metric learning based approaches for person ReID
Rank-1
Ref. Year Method Datasets
Results (%)
HSV and scale invariant
local ternary patterns VIPeR 40.0
[77] (SILTP) based features GRID 18.9
with XQDA metric CUHK03 52.2
2015 learning.
The clustering is integrated 44.4
SAIVT-SoftBio
with discriminant analysis 43.0
[111] PRID2011
to preserve sample diversity 37.5
iLIDS-VID
and local structure.
Common space is learned
34.1
and mapped, then hyper VIPeR
[127] 35.0
graph matching is applied CUHK01
for ReID.
Metric learning is
2016 performed using local
feature representations. 41.6
VIPeR
[112] Pose and occlusion issues 35.8
CUHK01
are handled by
implementing deformable
patch-based model.
Discriminative information 43.0
VIPeR
[115] is explored through better 62.5
PRID450S
deal with negative samples.
Clustering based multi-
2017
modality mining method is VIPeR
30.2
[116] used which automatically VIPeR+
29.3
determines the modalities PRID450S
of illumination changes.
VIPeR 16.6
CUHK01 18.5
Clustering centralization
3DpeS 24.5
[122] and projection vectors are
CAVIAR4REID 13.1
used for person ReID.
Town center 62.2
2018
Market-1501 26.9
Positive regions are CUHK01 39.6
assessed using region PRID450D 26.1
[117]
metric learning for person VIPeR 25.7
ReID. Market-1501 40.5

26
Manifold learning
technique is used to LIPS 57.8
[118] preserve intrinsic geometry LI-PRID2011 32.4
structure of low L1-iLIDS-VID 29.6
illumination data.
Applied two stage metric
VIPeR 42.2
learning method (QRKISS)
[123] PRID450S 57.1
2019 with different feature
CUHK01 44.9
descriptors for ReID.
Preferred second order
hessian energy function
because of extrapolating VIPeR 20.3
[119]
power and richer null space. CUHK01 21.9
Thus, used metric learning
approach for person ReID.
Labeled and unlabeled data, VIPeR 44.8
learned projections, built PRID450S 68.2
[120]
cross-view correspondence PRID201 35.4
for ReID Market-1501 74.8
Utilized the dictionary VIPeR 42.5
based projection PRID450S 64.1
[75]
transformation learning for CUHK03 60.5
ReID. GMUL-GRID 22.4
Split training data into two
VIPeR 44.3
groups of positive and
[124] PRID450S 63.5
2020 negative pairs then apply
GRID 19.47
EML for ReID.
Used single and multi-
source domains to transfer
VIPeR 42.3
[74] data and perform cross-
iLIDS-VID 33.8
domain transfer ReID with
TPPLFDA.
Context and content
information is investigated PRID2011 76.9
[76]
into graph and metric iLIDS-VID 36.8
learning approach for ReID.

2.2.3 Deep Learning based Approaches for Person Re-identification

Recently, applications of deep learning models showed promising performances in all


areas of research. In deep learning based methods, features are automatically learned in
a hierarchy, including learning complicated features directly from the data. For
example, Krizhevsky et al. [128] used eight layers deep CNN trained model for image
classification. In few other approaches, object detection and segmentation is also
carried out using an improved version of the same model with promising results [129].
Varior et. al [130] formulated a Siamese CNN model in which gaiting function is

27
utilized to highlight common patterns. The summary of existing deep learning based
methods for person ReID is presented in Table 2.3. Despite these, numerous CNN based
trained models are available that provide better discriminative features. For instance,
the authors in [131] proposed a deep features body and part (DFBP) based method in a
sensor network to obtain a discriminative feature. Zhang et al. [132] introduced a local
embedded deep feature (LEDF) based method for feature learning by extracting local
and holistic summing maps. Similarly, Huang et al. [133] proposed a model where
human body is used for deep feature learning. All these deep learning methods compute
the likeness scores of two input images, without any feature extraction process. Thus,
they have the potential to handle inherited appearance changes. For semi-supervised
feature representation, Xin et.al [134] presented a method using multiple CNN models
for clustering called multi-view clustering for accurate label estimation. To address an
image to video person ReID problem, feature extraction and similarity measure are
studied separately in [135]. Yuan et. al [136] proposed a deep multi instance learning
(DMIL) based approach to handle complex variations in pedestrian appearances.
Ordinal deep features are used to measure the distance [137]. This proposed method
achieved better recognition rate under diverse viewpoints. Hu et. al [138] combined the
DenseNet convolutional and batch normalization layer to design the lightweight image
classifier (LWC). Recently, pre-trained models such as ResNet50, AlexNet, VGG16,
and inceptionV4 are used and 47.1%, 25.4%, 37.2%, and 43.5% rank-1 recognition
rates are attained on CUHK01 dataset [139]. Similarly, the results are computed on
other datasets such as CUHK03, VIPeR, PRID, Market-1501, iLIDS-VID, and 3DPeS.
As an example, rank-1 recognition rates on CUHK01, iLIDS-VID, VIPeR datasets are
described in Table 2.3. Semantics aligning network (SAN) [64] is proposed which
comprises of encoder and decoder. Encoder is used for ReID and decoder is applied to
reconstruct densely semantics. Pooling fusion block is added in proposed pose invariant
convolutional baseline (PICB) model to learn the distinct features for ReID [140].
Zhang et. al [43] used limited labeled data to learn cross-view features in suggested
approach of similarity embedded cycle generative adversarial network (SECGAN).
Moreover, the researchers discussed a cross-modal feature generating and target
information preserving transfer network (CMGTN) for person ReID [141]. Lv et. al
[142] utilized learned spatio-temporal patterns with visual features in Bayesian fusion
for unsupervised cross-domain person ReID.

28
Table 2.3: Summary of deep learning based approaches for person ReID
Rank-1
Ref. Year Method Datasets
Results (%)
Market-1501 72.9
Siamese CNN model for
[130] 2016 CUHK01 63.9
ReID.
VIPeR 36.2
Built end-to-end deep CNN iLIDS-VID 40.0
[135] 2017 model to integrate feature and PRID2011 73.3
metric learning for ReID. MARS 55.2
Convolution feature VIPeR 31.1
[136] representation is presented ETHZ 77.7
using multi DMIL. CUHK01 31.8
2018
Utilized spatio-temporal
GRID 64.1
[142] patterns and extracted visual
Market-1501 73.1
features in ReID process.
CUHK01 44.3
Deep feature ranking is
[137] Market-1501 83.1
implemented for ReID
Duke 73.4
2019 Combined the DenseNets
VIPeR
convolutional and batch 44.3
[138] CUHK01 as
normalization layers to 46.7
source dataset
design LWC for person ReID.
Multiple CNN models are 47.1
CUHK01 VIPeR
[139] trained to re-identify 23.2
iLIDS
pedestrian image. 36.2
Addressed the sematic CUHK03 80.1
misaligned problem in large Market-1501 96.1
[64]
scale datasets for person MSMT17 79.2
ReID. Duke 87.9
2020
Learned cross-view features VIPeR 40.2
[43]
using cycle GAN. Market-1501 62.3
Designed a deep model to
extract cross-model features iLIDS-VID 38.4
[141] for person ReID while MARS 41.2
incorporating target PRID2011 42.7
preserving loss into model.

2.3 State-of-the-art Methods for Pedestrian Gender Classification


This section covers the state-of-the-art methods in pedestrian gender classification.
Depending on pedestrian body-parts/full-body, these classification methods are
categorized into face-based [143, 144], parts-based [48, 145], and full-body appearance
based gender classification [146, 147], as shown in Figure 2.3. In these types, gender
classification is performed using feature extraction, deep feature learning, features
selection and classification techniques. The existing studies using these methods of
gender classification are described in the following subsections.

29
Figure 2.3: Pipeline for full-body appearance and face-based gender classification

2.3.1 Face-based Approaches for Gender Classification

The face-based gender classification is entirely dependent on facial features [56-59].


Therefore, a clear and complete face image is required in this scenario for robust gender
prediction. The facial geometric features are captured from images and are used for
classifier learning to predict gender. Numerous face-based methods have been reported
in the last decade to classify gender. BenAbdelkader et al. [148] presented a technique
that analyzes the local regions of the provided face images to classify the gender.
Exploiting facial alignment techniques, Eidinger et al. [149] proposed a pipeline for
inferring facial attributes. Haval et al. proposed a method that extracts facial features
and then classifies their feelings using SVM [150]. However, these methods require
optimal extraction of geometric features of face images to achieve significant
classification results. In this regard, few existing approaches used low-level information
of given face image for gender prediction, for example, local binary patterns (LBP)
[151, 152], dense scale invariant feature transform (SIFT) features with shape contexts
[153], facial attribute features [154], HOG [154], facial photographic and sketches
[155], visual words [156] and LBP with biological inspired features (BIF) [157]. Divate
et al. [158] proposed a technique for gender recognition that considers different bio-
metrics of humans such as iris, face, gait, and fingerprints. Similarly, Ali et al.

30
effectively utilized low-level information of input image for object and place
recognition [159]. Later, the computed low-level features are supplied to classifiers for
learning of discriminative model for instance, SVM [59, 160] and linear discriminant
analysis (LDA) [161]. Despite these approaches, CNN models attained superior results
in full face-based gender classification methodologies. For instance, Duan et al. [162]
presented an ensemble of features-based approach which consists of CNN and extreme
learning machine (ELM) for gender and age classification. Few other approaches which
performed gender classification using different deep CNN architectures are discussed
in [163-165]. The aforementioned discussion concludes that capturing clear full-face
image in real environment, is a difficult task. Also, face-based gender recognition fails
when the camera acquires an image from left, right or backside of pedestrian.
Alternatively, body-parts/full-body views of pedestrian can be more useful to perform
gender classification as compared to face-based gender classification. In parallel,
capturing full-body appearances (images) of pedestrian under specific FoVs in multi-
camera networks is easy.

2.3.2 Parts-based Approaches for Pedestrian Gender Classification

The existing approaches also consider different parts of body such as leg, torso, upper
and lower body to classify the gender. In this regard, Cao et al. [166] proposed part-
based gender recognition (PBGR) method using body patches and their characteristics.
They utilized raw pixels, edge map and HOG to characterize the gender image. In
addition, fine grid sampling is applied to partition the gender image into patches where
HOG features of each patch were extracted. Later, patch-based features are combined
and supplied to Ada boost and random forest (RF) classifiers for gender prediction. As
a result, PBGR outperformed in terms of mean accuracy (M-ACC) for frontal, back and
non-fixed (frontal and back) view images respectively. Moreover, they also presented
different feature and classifier combination based results, separately. In all experiments,
this approach adopted 5-cross validation to measure the M-ACC for restricted views of
gender such as frontal, back and mixed/non-fixed views. It is noted that researchers did
not consider the gender images with side view in experiments and it was the first
attempt where patch wise computed features were fed to Ada boost classifier to
investigate silhouette characteristics for pedestrian gender classification. Raza et al.
[145] trained a CNN model by utilizing upper-body foreground images of gender. As a
result, supplying upper body images to the model slightly produced better overall

31
accuracy (O-ACC) and M-ACC on frontal, rear and mixed view images respectively.
Ng. et al. [71] introduced a parts-based CNN model which investigates different regions
corresponding to upper, the middle and lower half of pedestrian body image. Extensive
experiments are performed to increase the classification accuracy. Better results are
obtained because of the combination of CNN based global and local information.
However, overhead is found in the training of the CNN network for each region
separately. Yaghoubi et al. [49] suggested a region-based CNN methodology for gender
prediction on PETA and MIT datasets. They used body key points to estimate the
gender pose whereas head, legs, and torso based information is supplied to learn
appropriate classification models. The authors also utilized score fusion strategy which
produced better results on PETA dataset as compared to raw, head and polygon images
based trained networks. They suggested that segmented full-body image of pedestrian
is suitable for gender prediction. This technique has been evaluated on MIT dataset with
same settings, achieving better accuracies on frontal, rear and mixed view images. The
fine-tuned models highly rely on extracted regions of interest and segmented parts of
body where due to complex background images, performance may have decreased.
Parts-based gender classification methods are summarized in Table 2.4.

Table 2.4: Summary of parts-based approaches using frontal, back and mixed view
images of a pedestrian for gender classification, (-) represents that no
reported result is available
Results (%)
Ref. Year Methods Dataset
Frontal Back Mixed
Raw information 71.7 61.5
-
with Ada boost M-ACC M-ACC
Edge map features 71.1 70.1
-
with Ada boost M-ACC M-ACC
HOG features with 70.9 63.0
-
Ada boost M-ACC M-ACC
Parts-based features 71.9 71.2
-
with Ada boost M-ACC M-ACC
Raw information 72.8 60.2
[166] 2008 MIT -
with RF M-ACC M-ACC
Edge map features 73.1 63.9
-
with RF M-ACC M-ACC
HOG features with 73.1 65.8
-
RF M-ACC M-ACC
Part-based features 73.2 65.4
-
with RF M-ACC M-ACC
PBGR with HOG 76.0 74.6 75.0
and Ada boost M-ACC M-ACC M-ACC

32
83.3 82.3 82.8
Upper body image
O-ACC O-ACC O-ACC
[145] 2017 and deep CNN
80.5 82.3 81.4
model
M-ACC M-ACC M-ACC
79.5 85.0 82.5
Model with CNN-1
M-ACC M-ACC M-ACC
MIT&
Model with 80.0 87.0 83.8
[71] 2019 APiS
CaffeNet M-ACC M-ACC M-ACC
mixed
84.4 88.9 86.8
Model with VGG-19
M-ACC M-ACC M-ACC
Raw images 89.1 89.9 75.9
supplied to base-net M-ACC M-ACC M-ACC
Raw images
supplied to frontal- 90.5 93.0 77.2
net, rear-net, and M-ACC M-ACC M-ACC
lateral-net
Head images
supplied to frontal- 88.7 90.1 77.3
[49] 2019
net, rear-net, and M-ACC M-ACC M-ACC
lateral-net
Polygon images
supplied to frontal- PETA 91.2 91.4 76.0
net, rear-net, and M-ACC M-ACC M-ACC
lateral-net
92.1 93.5 80.1
Fusion strategy
M-ACC M-ACC M-ACC
Pedestrian gender
recognition network
(PGRN) using raw
90.0 87.9 89.0
[49] 2019 images, head- MIT
M-ACC M-ACC M-ACC
cropped and
polygon-shape
regions

2.3.3 Full-body Appearance based Approaches for Pedestrian Gender


Classification

This section discusses those approaches that have performed gender classification using
full-body appearance of pedestrian as an input. In this scenario, wearing items, body
shape, hairstyle, shoes, and coats are taken as strong arguments while recognizing
appearance based pedestrian gender. In short, full-body appearance based pedestrian
gender recognition approaches are mainly split into two groups: i) traditional
approaches that adopt feature extraction followed by classification [60], and ii) deep
CNN based approaches [61]. According to relevant literature, deep CNN has attained
reliable results in many areas of computer vision under diverse appearance conditions
and different camera settings [167]. The traditional approaches compute different types

33
features such as color, shape, texture, and deeply learned information to represent
gender images. For example, Collins et. al. [168] proposed a gender classification
technique to examine different local features and their combination with linear SVM.
First they cropped images of MIT dataset of size 106 × 45 and then performed
investigation on cropped and un-cropped images of size 128 × 64 for gender
prediction. The proposed strategy revealed average results on un-cropped MIT dataset
in terms of O-ACC using pyramid histogram of words (PHOW), PHOG, canny HOG,
local HSV (LHSV) color histogram, and pixel PHOG features respectively. The authors
utilized the VIPeR dataset first time for gender prediction and obtained satisfactory
performance in terms of class-wise accuracy (CW-ACC) and O-ACC on frontal view
images of pedestrian. However, features combinations with CHOG+LHSV and
PiHOG+LHSV attained better O-ACC on both datasets. Table 2.5 summarizes the
results in terms of CW-ACC and O-ACC on MIT and VIPeR datasets for gender
classification.

Table 2.5: Summary of existing methods results using full-body frontal view images of
pedestrian for gender classification with 5 cross-validation (male=123,
female=123 uncropped and cropped images of MIT dataset; whereas
male=292, female=291 uncropped images of VIPeR dataset)
Methods with CW-ACC (%)
O-ACC
Ref. Year
un-cropped Dataset
Male Female (%)
images
PHOW with
65.3 72.4 64.0
SVM
PHOG with
56.0 68.0 56.9
SVM
CHOG with
70.7 81.0 74.2
SVM
LHSV with
MIT 73.1 54.0 60.9
SVM
PiHOG with
80.2 83.0 79.1
SVM
[168] 2009
CHOG+LHSV
80.2 82.6 75.8
with SVM
PiHOG+LHSV
81.1 88.7 80.2
with SVM
Methods with
cropped images
PHOW with
70.3 63.0 63.7
SVM
MIT
PHOG with
54.0 70.7 60.0
SVM

34
CHOG with
73.8 77.5 72.5
SVM
LHSV with
74.4 68.9 66.5
SVM
PiHOG with
82.2 86.5 81.7
SVM
CHOG+LHSV
77.4 87.3 76.3
with SVM
Methods with
un-cropped
images
PiHOG+LHSV
78.8 95.9 84.1
with SVM
PHOW with
75.1 63.0 66.1
SVM
PHOG with
72.1 50.5 57.9
SVM
[168] 2009 CHOG with
74.5 80.9 77.4
SVM
LHSV with
77.8 74.8 73.2
SVM
PiHOG with
80.1 80.1 78.4
SVM VIPeR
CHOG+LHSV
80.5 82.9 78.5
with SVM
PiHOG+LHSV
84.3 88.8 83.1
with SVM

Guo et al. [60] presented an approach to handle pose changes (frontal and rear view
changes) in a better way for gender classification. They have used visualization of
biological inspired features (BIF) with two bands and four orientations, integrated with
various manifold learning methods. This approach utilized orthogonal locality
preserving projections (OLPP) based 117, PCA based 300, locality sensitive
discriminant analysis (LSDA) based 150, and marginal fisher analysis (MFA) based
117 features with linear SVM including 5-fold cross validation for view-based (frontal,
back, and mixed) gender recognition. Geelen et al. [169] utilized low-level information
of full-body image and extracted different features such as HSV, LBP, and HOG. They
also examined different combinations of these features for gender classification using
two supervised methods such as SVM and RF kernel. Antipov et al. [47] presented a
technique to examine familiar, unfamiliar and cross dataset for gender prediction. They
utilized HOG features and SVM classifier with linear kernel which produced poor
results in terms of area under curve (AUC) and M-ACC on unfamiliar dataset as
compared to familiar and cross datasets. From the relevant literature, year-wise main

35
developments in the context of traditional approaches for full-body appearance based
pedestrian gender classification are summarized in Table 2.6.
Table 2.6: Summary of gender classification methods using handcrafted features and
different classifiers with full-body frontal, back, and mixed view images of
a gender, (-) represents that no reported result is available

Results (%)
Ref. Year Methods Dataset
Frontal Back Mixed
BIF, PCA with 79.1 82.8 79.2
linear SVM O-ACC O-ACC O-ACC
BIF, OLPP
78.3 82.8 77.1
with linear
O-ACC O-ACC O-ACC
SVM
BIF, LSDA
79.5 84.0 78.2
with linear
O-ACC O-ACC O-ACC
SVM
BIF+MFA
79.1 81.7 75.2
with linear
O-ACC O-ACC O-ACC
[60] 2009 SVM
Gender
recognition in
each view
where
BIF+LSDA 80.6
- -
with frontal O-ACC
and back view,
and BIF+PCA
with mixed
view images
81.2 77.5 78.9
HOG with O-ACC O-ACC O-ACC
linear SVM 76.9 72.6 75.9
M-ACC M-ACC M-ACC
78.4 77.7 76.1
LBP with O-ACC O-ACC O-ACC
linear SVM 73.5 71.9 68.5
M-ACC M-ACC M-ACC
69.4 71.5 71.3
HSV with MIT O-ACC O-ACC O-ACC
[169] 2015
linear SVM 65.0 63.5 64.8
M-ACC M-ACC M-ACC
77.6 78.3 77.6
LBP and HSV
O-ACC O-ACC O-ACC
with linear
73.9 72.7 73.7
SVM
M-ACC M-ACC M-ACC
81.6 80.5 80.9
HSV and HOG
O-ACC O-ACC O-ACC
with linear
79.0 74.1 75.3
SVM
M-ACC M-ACC M-OCC

36
81.2 80.3 79.8
HOG and LBP
O-ACC O-ACC O-ACC
with linear
76.6 75.5 76.6
SVM
M-ACC M-ACC M-ACC
81.0 82.7 80.1
HOG, LBP and
O-ACC O-ACC O-ACC
HSV with
73.9 79.3 76.7
linear SVM
M-ACC M-ACC M-ACC
88
HOG features Cross- AUC
- -
with SVM dataset 80.0
M-ACC
84
HOG features AUC
[47] 2015 Familiar - -
with SVM 72
M-ACC
64
HOG features AUC
Unfamiliar - -
with SVM 82.0
M-ACC

Although, low-level feature representation using handcrafted features is robust against


pose and illumination changes, however, it is difficult to extract discriminative
information in case of diverse appearances of pedestrian. Therefore, low-level
information of the pedestrian full-body image needs further investigation to acquire
more distinct and optimal information for gender classification. Deep CNN models are
more suitable to handle issues such as diverse appearance of image with low resolution,
instead of handcrafted features [170, 171]. The popularity of CNN is because of
remarkable improvements in the accuracy of different classification tasks [172-174]. In
few other research areas, deep neural network based methods are also used to improve
the performance of different tasks for instance long short term memory (LSTM)
architecture for breast cancer classification [175], facial expression recognition [176],
classification and recognition of action, text, face, image and speech etc. [177].
Moreover, already trained deep CNN models provided an effective way to extract the
features automatically and utilized as deep features [178]. Subsequently, the integration
of deep features with handcrafted features is also reported in existing studies such as
facial expression recognition [179, 180] and MRI brain scan [181] to improve the
classification and recognition results. Similarly, the relevant literature reports few
existing research studies which utilize deep CNN trained models for gender prediction.
In this regard, Ng et al. [61] exploited CNN model for gender classification problems.

37
This model utilizes two convolution layers to obtain feature maps; 1) sub-sampling
layer for downsampling and 2) a fully connected (FC) layer before the output layer for
classification. According to model settings, a CNN network is trained for gender
prediction on MIT dataset and achieved acceptable O-ACC. The model performed
better for small-sized homogeneous datasets. Ng et al. [182] investigated full-body
pedestrian images with different color spaces including grayscale, RGB, and YUV for
image representation. Later on, these representations are fed to CNN network separately
as an input to train a model for gender classification. The method obtained promising
results on MIT dataset with grayscale image representation as compared to RGB and
YUV. In another study, the authors investigated a training strategy [183] for gender
prediction. This strategy followed by pre-training showed good results with the
limitation of small amount of labeled training data. Antipov et al. [47] presented a
method to investigate handcrafted and learned feature extraction schemes for gender
classification using SSS familiar, unfamiliar and cross datasets. This study showed that
both feature extraction schemes equally performed on SSS homogeneous dataset but
learned features gave better results on unfamiliar datasets. Raza et al. [145] initially
examined pedestrian parsing on pedestrian images using deep de-compositional neural
networks (DDNN). The technique found two types of parsed images having foreground
views. Later on, the parsed images of pedestrian are given to the proposed CNN model
for view-wise gender classification. In another study, the principal investigator and his
team extended the previous work and utilized stack sparse auto encoder (SSAE) [46]
for gender classification based on full-body appearance of pedestrian. Initially, HOG
based feature map is computed and then supplied to the pedestrian parsing phase to
create a silhouette of an input image. The output of parsing phase in the form of a binary
parsed pedestrian’s map is fed to two layers SSAE model followed by a softmax
classifier to predict pedestrian image as male or female. Cai et al. [45] presented a
hybrid technique named HOG assisted deep feature learning (HDFL), which examines
deeply learned features and weighted HOG features for gender classification.
According to relevant literature, HDFL has achieved the best results in terms of AUC.
In another study, they presented a fusion method called deep-learned and handcrafted
feature fusion network (DHFFN) [15]. The authors achieve significant results by
combining handcrafted and deep characteristics of an input image. Further, to overcome
variations in scene and viewpoint for PGC, they also proposed a cascading scene and
viewpoint feature learning (CSVFL) method and showed considerable results with the

38
combination of input image deep and handcrafted characteristics [69]. Ng. et al. [71]
introduced a methodology to investigate full-body of a pedestrian for gender prediction.
The extensive experiments are performed using CNN-1, CNN-2, CNN-3, CaffeNet, and
VGG-19 to increase the classification accuracy. Better results are attained with VGG-
19 model. From relevant literature, yearwise main developments in the context of deep
learning and hybrid approaches for full-body appearance based pedestrian gender
classification are summarized in Table 2.7.

Table 2.7: Summary of deep learning and hybrid methods for pedestrian gender
classification with full-body frontal, back, and mixed view images of a
gender, (-) represents that no reported result is available

Results (%)
Ref. Year Methods Dataset
Frontal Back Mixed
80.4
[61] 2013 Deep CNN - -
O-ACC
Deep CNN, MIT
81.5
[182] 2013 gray-scale, - -
O-ACC
RGB, YUV
90
Deep learning
Cross- AUC
(Alexnet- - -
dataset 82.0
CNN)
M-ACC
91
Deep learning
AUC
(Alexnet- Familiar - -
85.0
CNN)
M-ACC
85
Deep learning
AUC
(Alexnet- Unfamiliar - -
79.0
CNN)
M-ACC
[47] 2015
86
Deep learning Cross- AUC
- -
(Mini-CNN) dataset 79.0
M-ACC
88
Deep learning AUC
Familiar - -
(Mini-CNN) 80.0
M-ACC
80
Deep learning AUC
Unfamiliar - -
(Mini-CNN) 75.0
M-ACC
Deep CNN-e 81.5
[183] 2017 - -
configuration O-ACC
MIT
Full-body 82.1 O-
[145] 2017 81.3 82.0
image is ACC

39
forwarded to 81.1 O-ACC O-ACC
CNN model M-ACC 81.7 80.7
M-ACC M-ACC
Deep CNN,
95
[45] 2018 HOG assisted - -
AUC
features
Deep CNN, PETA 95
[15] 2018 - -
HOG, PCA AUC
92
[46] 2018 SSAE - -
AUC
82.9 81.8 82.4
O-ACC O-ACC O-ACC
80.4 80.8 81.6
[46] 2018 SSAE MIT
M-ACC M-ACC M-ACC
89 90 89
AUC AUC AUC
MIT&
VGG-19 85.38
[71] 2019 APIS - -
model M-ACC
mixed
84.4 85.9 85.2
MIT
Scene and O-ACC O-ACC O-ACC
viewpoint 81.9 84.7 80.1
[69] 2021 VIPeR
feature O-ACC O-ACC O-ACC
learning 92.4 94.6 92.7
PETA
O-ACC O-ACC O-ACC

2.4 Discussion and Analysis


Although appearance, metric learning and deep learning based methods outperform
existing person ReID benchmarks, still appearance based methods are not reliable
enough and distinctive under critical changes across multi-camera network. Moreover,
in these methods, the probe/test image is matched with whole gallery that leads towards
extensive search and low recognition rate. Comparatively, deep learning-based ReID
approaches are constrained by SSS for model learning. For person ReID, our focus is
to select the optimal features subset (OFS) from presented features-based on their
maximum entropy values. Moreover, the objective is to choose such a feature
combination that does not lead to overfitting and provides distinct feature
representation. The major differences of proposed FCDF method from existing
approaches are 1) application of optimal features-based clustering to form 𝑘 consensus
clusters, 2) extraction of deep features from clustered sample, and 3) use of similarity
measures for probe matching within the classified cluster or nearest neighbor consensus
cluster(s) instead of whole gallery. The said differences contribute to gallery search
optimization and improve the recognition rate at different ranks. The traditional and

40
deep CNN approaches for pedestrian gender recognition showed promising results
individually, but they are still facing classical issues, for example, discriminative
feature representation, lower classification accuracy, and SSS for model learning. Since
traditional approaches are effective to examine the low-level information of an image,
hence the deep CNN based features are more appropriate in case of large variations in
pedestrian poses, illumination changes, and diverse appearances, and they have more
dominant feature representation capabilities of given image. In addition, a fusion
strategy effectively characterizes the image information for gender classification.
Moreover, with fusion strategy, only low-level information of an input image is not
sufficient to handle the issues such as an image with low resolution, viewpoint changes,
and diversity in pedestrian appearances, so considering high-level information is
crucial. Whereas, SSS datasets have class-wise imbalanced distribution of data which
directly affects the classification performance in terms of O-ACC as well as accuracy
of the minority class. Considering these challenges, there is still a need to design a
robust method that effectively automates pedestrian gender recognition process. In this
regard, two methods named J-LDFR and PGC-FSDTF are proposed for gender
prediction where we have performed investigations using different low-level feature
extraction schemes and deeply learned features of already trained CNN models. The
features selection and fusion methods are also incorporated in proposed methodologies
for dimensionality reduction, optimal features selection and compact representations.
The first PGC method is tested on large-scale and small-scale datasets whereas the
second PGC method is evaluated on thirteen imbalanced, augmented balanced, and
customized balanced datasets, such that both methodologies successfully attain
competitive performance with existing methods.

2.5 Summary
This chapter provides a comprehensive review of existing work for both research tasks
of pedestrian analysis and covers the discussion on state-of-the-art methods with recent
developments in person ReID and pedestrian gender classification. Similarly, feature
extraction/learning, features selection, classification, and matching related to various
benchmark techniques for both tasks are described and tabulated with evaluation
metrics and datasets used in experimentation. After extensive study of literature, we
have developed three methodologies for appearance based pedestrian analysis as
discussed in chapter 3.

41
Chapter 3 Proposed Methodologies
Proposed Methodologies

42
3.1 Introduction
This chapter presents proposed methodologies for appearance based pedestrian analysis
using image processing and machine learning approaches. The prime concern of these
methodologies is to re-identify and classify pedestrian full-body image using normal
and low-quality images with improved performance. Figure 3.1 shows the organization
of chapter, three proposed methodologies, and their highlights. The selected tasks
demonstrate that full-body appearance of pedestrian is to be analyzed using proposed
methodologies for person ReID and pedestrian gender classification in this research
work.

Figure 3.1: Block diagram of chapter 3, proposed methodologies with highlights

All the proposed methodologies as per the general proposed work are listed below, and
details are discussed in the coming sections of this chapter.

 Method for person ReID with features-based clustering and deep features.

43
 Method for pedestrian gender classification using joint low-level and deep CNN
features representation.
 Method for pedestrian gender classification on imbalanced and SSS datasets
using parallel and serial fusion of selected deep and traditional features.

3.2 Proposed Method for Person Re-identification with Features-


based Clustering and Deep Features (FCDF)
A novel appearance based method for person ReID is presented in this section. It is
comprised of three modules, 1) feature representation, 2) feature clustering and 3)
feature matching. Figure 3.2 illustrates the functionality of each module, graphically.
To address the issues of ReID, an appearance based framework for person ReID with
features-based clustering and deep features (FCDF) is proposed. The feature
representation scheme is introduced which considers the properties of each channel of
RGB and HSV color spaces, texture details, and spatial structural information of input
image. The features are then fused and distinct features are selected by applying
maximum entropy. The features-based clustering is carried out to optimize gallery
search. The response of a fully connected (FC) layer of deep CNN model integrated
with shape, color, and texture information is also used to ensure the robustness of final
feature vector (FFV) for ReID. Four types of features are computed on each gallery
image where three features including color, texture, and shape are traditional and one
of them is deep feature. The key contributions are given below:

1. A new appearance based framework called FCDF is proposed.


2. Three types of traditional features (e.g. color, texture, shape) are computed from
RGB and HSV color spaces and a novel FFS method is proposed to select an
OFS.
3. Features-based clustering is used based on selected optimal features where
whole gallery is split into 𝑘 clusters to optimize the searching process against
each probe image; consequently, each gallery image has a learned relationship
with kth cluster. In addition, deeply learned features of each cluster sample are
computed to handle the inherited ambiguities in appearance(s) more effectively.
4. The application of cluster-based probe matching instead of whole gallery
matching in the proposed FCDF framework improves the recognition rate at
different ranks when tested on three commonly used challenging datasets.

44
The objective of feature representation module is to extract color, texture, and shape
features of each gallery image that once put into FFS method generate OFS and do not
alter or lose vital features. The purpose of feature clustering module is to split gallery
images into k consensus clusters in an efficient manner. In addition, it is responsible to
extract deep features of each clustered sample for handling the inherited ambiguities of
person(s) appearance. Finally, feature matching module aims to find an accurate match
against probe image from the filtered gallery subset (consensus cluster). A detailed
description of these modules is given in the following subsections.

Figure 3.2: FCDF framework for person ReID consisting of three modules, where a)
feature representation module is applied to compute different types of
features from R, G, B, H, S, and V channels, and then optimal features are
selected using novel FFS method, b) feature clustering module is used to
split whole gallery into different consensus clusters for gallery search
optimization, whereas deep features of each cluster sample are also
examined, and c) feature matching module includes classification of
corresponding cluster(s), probe deep features and finally similarity
measure is applied to obtain recognition rates at different ranks

45
3.2.1 Feature Representation

In the process of person ReID, discriminative appearance cues are integrated to mark
as a robust feature representation. The feature representation module extracts
handcrafted features of each gallery image to generate an OFS by putting these features
into FFS method while taking care of not altering or losing vital features. It exploits
color, texture patterns, orientation detail, and spatial structural information for feature
encoding. The module is responsible to perform three functionalities of feature
extraction, fusion and selection, explained in the following subsections.

3.2.1.1 Color Feature Extraction

To extract color features, two commonly used measures including mean and variance
are considered for feature extraction using RGB and HSV color spaces. To compute
color features, in the very first step, color spaces are separated into their respective
channels such as red, green, blue, hue, saturation, and value respectively. The sample
mean and variance of each channel are formulated through Eq. (3.1) and Eq. (3.2).

∑𝑚 𝑛
𝑖=1 ∑𝑗=1 𝜉𝑖,𝑗
𝜇= (3.1)
𝑀

∑𝑚 𝑛
𝑖=1 ∑𝑗=1(𝜉𝑖,𝑗 −𝜇)
2
𝜎2 = (3.2)
𝑀

where 𝜇 and 𝜎 2 denote sample mean and sample variance of each extracted channel
and 𝜉𝑖,𝑗 is a related channel of utilized color spaces. Other parameters 𝑖 and 𝑗 denote
rows and columns of each channel and 𝑀 = 𝑚 × 𝑛, where 𝑀 is a matrix which explains
the total number of 𝑚 rows and 𝑛 columns. These extracted features have few
challenges such as a high correlation between color channels and perceptual non-
uniformity [184]. To overcome these challenges, standard deviation (SD) and singular
value decomposition (SVD) are further calculated for each channel of selected color
spaces. The formulation of these channels for SD features is defined by Eq. (3.3).

∑𝑚 𝑛
𝑖=1 ∑𝑗=1(𝜉𝑖,𝑗 −𝜇)
2
𝜎=√ (3.3)
𝑀

where 𝜎 depicts standard deviation of each input channel. Moreover, SVD based
structural projections [185] are applied as follows. Let 𝑚 × 𝑛 be dimensions of real

46
matrix 𝑀 used as an input channel 𝐻 where 𝑚 and 𝑛 are rows and columns of input
channel/matrix. Here 𝑀 is expressed using Eq. (3.4).

𝑀 = 𝑈𝑆𝑉 𝑇 (3.4)

where

𝑈 = [𝑢1 , 𝑢2 , … 𝑢𝑖 , … 𝑢𝑚 ]𝑚×𝑚 (3.5)

𝑉 = [𝑣1 , 𝑣2 , … 𝑣𝑖 , … 𝑣𝑚 ]𝑛×𝑛 (3.6)

𝑆 = 𝑑𝑖𝑎𝑔(𝜎1 , 𝜎2 , … 𝜎𝑘 , … 𝜎𝑟 , 0 … 0)𝑚×𝑛 (3.7)

Here, 𝑈 and 𝑉 represent left and right singular vector matrices, and 𝑆 denotes diagonal
matrix of singular values (𝜎𝑘 ) in descending order respectively. Both 𝑈 and 𝑉 are
orthogonal matrices, 𝑢𝑖 are left, and 𝑣𝑖 are the right singular vectors as formulated in
Eq. (3.5) and Eq. (3.6) and 𝑟 is the rank of 𝑀. Eq. (3.4) can be rewritten as Eq. (3.8).

𝐻 = ∑𝑟𝑖=1 𝜎𝑖 𝑢𝑖 𝑣𝑖𝑇 (3.8)

Aiming to capture the structural projections, projection bases, and projection


coefficients of local SVD from each channel block of 8  8 and half overlapped sliding
window are utilized. The SVD based structural projections are formulated through Eq.
(3.9).

𝑇 𝜉 𝑟 𝜉 𝜉 𝜉𝑇
𝜉 = 𝑈 𝜉 𝑆 𝜉 𝑉 𝜉 = ∑𝑖=1 𝜎𝑖 𝑢𝑖 𝑣𝑖 (3.9)

where 𝜉 ∈ {𝑅, 𝐺, 𝐵, 𝐻, 𝑆, 𝑉} channels and 𝑟𝜉 are the ranks of each query channel. The
projection 𝑢𝑖 𝑣𝑖𝑇 bases and projection coefficients 𝜎𝑖 preserve the structural and energy
information of decomposed query channel. Lastly, color feature vector is acquired by
fusing these features. The mathematical representation of color features is given in Eqs.
(3.10)-(3.15).

𝐹𝑉𝜇 = [𝑓𝑚1 , 𝑓𝑚2 , … 𝑓𝑚𝑑 ] (3.10)


𝐹𝑉𝜎2 = [𝑓𝑣1 , 𝑓𝑣2 , … 𝑓𝑣𝑑 ] (3.11)
𝐹𝑉𝜎 = [𝑓𝑠1 , 𝑓𝑠2 , … 𝑓𝑠𝑑 ] (3.12)

Here, 𝐹𝑉𝜇 , 𝐹𝑉𝜎2 and 𝐹𝑉𝜎 are extracted feature vectors that represent the mean, variance,
and standard deviation of each channel respectively, where dimension 𝑑 = 6. We used

47
𝑓𝑚 , 𝑓𝑣 , and 𝑓𝑠 to denote mean, variance, and standard deviation feature of single channel
respectively. These features are computed from all channels 𝑅, 𝐺, 𝐵, 𝐻, 𝑆, and 𝑉
separately. Moreover, the mathematical representation of SVD feature vector is given
in Eq. (3.13).

𝐹𝑉𝑆𝑉𝐷 = [𝑓𝑠𝑣𝑑1 , 𝑓𝑠𝑣𝑑2 , … 𝑓𝑠𝑣𝑑𝑑 ] (3.13)

Here, 𝐹𝑉𝑆𝑉𝐷 represents SVD feature vector of any single channel, where 𝑑 = 48 in
which 𝑓𝑠𝑣𝑑 is used to represent the value of SVD feature vector. Eq. (3.14) is developed
to compute SVD features from multiple channels.

𝐹𝑉𝑆𝑉𝐷1×𝑑 = Ŝ(H) where 𝐻 ∈ {𝑅, 𝐺, 𝐵, 𝐻, 𝑆, 𝑉} (3.14)

where Ŝ is a function which computes SVD features and consequently feature vector of
each channel is obtained, where 𝑑 describes the dimension of one channel SVD
features-based on the size of input image. In addition, SVD features of six channels are
serially fused to generate combined SVD feature vector named as 𝐹𝑉𝑆𝑉𝐷1×𝑑 , here value
of 𝑑 = 288 as per calculation of six channels 6 × 48 = 288. Finally, all extracted
features (mean=6, variance=6, standard deviation=6 and SVD=288) are combined to
yield a compact color vector that preserves the color information for all channels of
input image. Mathematical representation of combined color feature vector is
expressed Eq. (3.15).

𝐹𝑉𝑐𝑜𝑙𝑜𝑟(𝐶𝐹𝑣) = [𝐹𝑉𝜇 , 𝐹𝑉𝜎2 , 𝐹𝑉𝜎 , 𝐹𝑉𝑆𝑉𝐷 ] (3.15)

where 𝐹𝑉𝑐𝑜𝑙𝑜𝑟(𝐶𝐹𝑣) denotes serially fused color feature vector of size 1×306.

3.2.1.2 Texture Feature Extraction

For texture analysis, a well-known grayscale invariant texture descriptor named LBP
[186] is utilized, which extracts texture details and its structural information from its
neighbors around the pixel value of center point. It performs well under a change of
illumination conditions. In LBP, image pixels corresponding information and its
domination orientation are used to extract structural details in the confined path,
however it ignores the correlation of neighboring points. Therefore, the advancement
of LBP in the form of local extrema patterns (LEP) [68, 187] is implemented in this
work for textural, orientation, and spatial structural information. The LEP descriptor
considers edge information in 0° , 45° , 90° and 135° directions. In a particular direction,

48
it assigns a value of 1 when the neighboring pixel value is more or less than the middle
pixel separately otherwise it assigns a value of 0. By considering the value of center
pixel 𝑃𝑐𝑣 and its corresponding values of neighbor pixels 𝑃𝑛𝑣 , LEP code is computed
using Eqs. (3.16)-(3.19) as follows.


𝑃𝑛𝑣 = 𝑃𝑛𝑣 − 𝑃𝑐𝑣 (3.16)


where 𝑛𝑣 ∈ 1, 2, … ,8 neighbors in 3×3 window around center pixel 𝑐𝑣 and 𝑃𝑛𝑣 denotes
corresponding differences from center pixel value.


𝑃𝑛𝑣 ′ ),
= 𝐼2 (𝑃𝑥′ , 𝑃𝑥+4 𝑥 = (1 + 𝜃⁄ ° ) , ∀𝜃 ∈ 0° , 45° , 90° , 135° (3.17)
45

where 𝑃𝑥′ (𝜃) denotes pixel values at a particular direction in a 3×3 window 𝐼2 .

′ )
′ ) 1, (𝑃𝑥′ , 𝑃𝑥+4 ≥0
𝐼2 (𝑃𝑥′ , 𝑃𝑥+4 ={ (3.18)
0, 𝑒𝑙𝑠𝑒

Here, 𝐼2 represents binary codes at positions 𝑃𝑥′ and 𝑃𝑥+4 .

𝜃⁄
𝐿𝐸𝑃(𝑃𝑐𝑣 ) = ∑𝜃 2 45 × 𝑃𝑥′ (𝜃), ∀𝜃 ∈ 0° , 45° , 90° , 135° (3.19)

The LEP descriptor is specifically designed to get information about spatial correlation
across the center and its neighboring points. The LBP and LEP code generation using
a 3×3 pattern window is shown in Figure 3.3. The LBP code generation process is
simple and straight forward, however in the case of LEP code generation, positive
values are pointed with inside arrow and negative values with outside arrow in 3×3
differences window.

Figure 3.3: LBP and LEP code generation [187]

49
Based on these arrows, binary codes are computed to assign binary code 1 when both
arrows are in the same direction (either inside or outside in a particular direction) and
0 otherwise. The binary codes are multiplied by weights using Eq. (3.19) and LEP
feature vector is obtained having dimension 1× 256 and denoted by 𝐹𝑉𝐿𝐸𝑃(𝑇𝐹𝑣) .

3.2.1.3 HOG Feature Extraction

HOG is a commonly used feature descriptor for object detection [188]. Dalal and Triggs
[189] presented this descriptor which extracts HOG features by considering complete
dense grid locations and orientations in an image. HOG features are originally
computed for human detection [190] and later used in several domains [45, 191, 192].
To compute HOG features in this work, the algorithm begins with a localized area of
an image in which gradients are calculated horizontally and vertically. Afterwards, the
magnitude 𝑀𝑔 and orientation of gradients 𝜃 are calculated at each pixel using Eq.
(3.20) and Eq. (3.21).
𝑀𝑔 = √𝑔𝑥2 + 𝑔𝑦2 (3.20)

𝑔𝑦
𝜃 = 𝑎𝑟𝑐𝑡𝑎𝑛 𝑔 (3.21)
𝑥

By considering the image size as 128 × 48, pedestrian image is split into size 8 × 8
non-overlapping cells (patch) and resultantly collected 15 × 7 patches per image. Each
patch consists of 2 × 2 cells and histogram of gradients is calculated for each cell using
9 bins accumulation. Using this detail, 15× 5 × 36 = 2700 HOG features are
generated. The objective of patch by patch feature extraction is to provide a compact
representation. Meanwhile, a histogram of every patch makes representation more
robust to noise. Finally, HOG feature vector 𝐹𝑉𝐻𝑂𝐺(𝑆𝐹𝑣) with the dimension of 1× 2700
is comprised of these equalized histograms taken from patches.

3.2.1.4 Feature Fusion and Selection

A novel FFS approach is proposed to acquire an OFS from extracted features as


depicted in Figure 3.4. The approach operates in two steps: 1) feature fusion, where
different features are combined to compute a single feature vector using feature level
fusion. It is a fundamental step towards the applications of feature fusion such as pattern
recognition and machine learning. Feature level fusion is thought to be more efficient
because feature vector contains additional information when compared with score or

50
decision level fusion, at classification/recognition phase [185]; 2) features selection,
that selects OFS by using the idea of maximum entropy.

Figure 3.4: Proposed feature extraction, fusion, and max-entropy based selection of
features

Let 𝐹𝑉𝑐𝑜𝑙𝑜𝑟(𝐶𝐹𝑣) , 𝐹𝑉𝐿𝐸𝑃(𝑇𝐹𝑣) , and 𝐹𝑉𝐻𝑂𝐺(𝑆𝐹𝑣) , representing three feature vectors such as
color, texture, and shape, respectively. If the first feature vector 𝐶𝐹𝑣 has dimension 𝑑𝑞
second feature vector 𝑇𝐹𝑣 has dimension 𝑑𝑟 , third feature vector 𝑆𝐹𝑣 has dimension 𝑑𝑠 ,
and 𝐺 is a total number of gallery images, then 𝐶𝐹𝑣 , 𝑇𝐹𝑣 and 𝑆𝐹𝑣 can be written by Eqs.
(3.22)-(3.24).

𝑐𝑓(1,1) ⋯ 𝑐𝑓(1,𝑑𝑞)
𝐶𝐹𝑣 = [ ⋮ ⋱ ⋮ ] (3.22)
𝑐𝑓(𝐺,1) ⋯ 𝑐𝑓(𝐺,𝑑𝑞 )

𝑡𝑓(1,1) ⋯ 𝑡𝑓(1,𝑑𝑟 )
𝑇𝐹𝑣 = [ ⋮ ⋱ ⋮ ] (3.23)
𝑡𝑓(𝐺,1) ⋯ 𝑡𝑓(𝐺,𝑑𝑟 )

𝑠𝑓(1,1) ⋯ 𝑠𝑓(1,𝑑𝑠 )
𝑆𝐹𝑣 = [ ⋮ ⋱ ⋮ ] (3.24)
𝑠𝑓(𝐺,1) ⋯ 𝑠𝑓(𝐺,𝑑𝑠 )

where 𝑓 , 𝑡𝑓 and 𝑠𝑓 denote features indices of each feature vector 𝐶𝐹𝑣 , 𝑇𝐹𝑣 and 𝑆𝐹𝑣
respectively. Then, these feature vectors are concatenated by using Eq. (3.25).

𝐹𝐹𝑉1×𝑑 = [𝐶𝐹𝑣 , 𝑇𝐹𝑣 , 𝑆𝐹𝑣 ] (3.25)

51
where 𝐹𝑉1×𝑑 denotes fused feature vector (FFV), 𝐹𝑉𝐺×𝑑 denotes FFV for gallery
images, 𝑑 = (𝐶𝐹𝑣 + 𝑇𝐹𝑣 + 𝑆𝐹𝑣 ), and 𝐺 is total number of samples/images belonging to
gallery. The size of computed FFV is 1×3262 which is larger in dimension.

To reduce the dimension, many feature reduction approaches are applied for numerous
pattern recognition tasks such as classification, detection, and recognition. Similarly,
features selection is also used for this purpose. In this concern, entropy controlled
features selection approach is applied in this work which is rarely used in existing
literature related to person ReID. The objective of features selection is to select distinct
features to build a discriminative descriptor or model [193] having spatial, structural,
and statistical information about that observation. Likewise, features selection is
utilized to find enough optimal features instead of all or several features that have the
potential to improve results. Hence for optimal features selection, experiments are
conducted by selecting best features from color, texture, and shape feature vectors.
Initially, most suitable feature combination is chosen to produce highest results using
any common similarity measure. For this purpose, Canberra distance [68] has opted.
For experimentation, three datasets VIPeR, CUHK01, and iLIDS-VID are used as
described in section 4.2.2.

To select maximum features-based on their respective scores, a maximum entropy


selection technique is implemented. The entropy technique computes the randomness
of feature space and applies it across all feature vectors. The features selection
description is as follows. The FFV computed in Eq. (3.25) is used. Initially, at the start
of features selection, PCA computes the score of each feature vector. It consists of four
steps to find principal components and scores: 1) Initially, the mean of FFV is
calculated, 2) mean from respective feature space is subtracted, 3) its covariance matrix
is computed, and 4) finally, Eigenvalues and Eigenvector from the covariance matrix
are calculated. By using PCA score and applying maximum entropy, maximum score-
based features from 𝐹𝑉𝑐𝑜𝑙𝑜𝑟 (𝐶𝐹𝑣) , 𝐹𝑉𝐿𝐸𝑃(𝑇𝐹𝑣) , and 𝐹𝑉𝐻𝑂𝐺(𝑆𝐹𝑣) feature vectors are chosen.
Then, score-based computed feature vectors are sorted in descending order, and
empirically the features subset (FS) in dimensions 220, 212, and 820 are taken from
𝐶𝐹𝑣 , 𝑇𝐹𝑣 and 𝑆𝐹𝑣 feature vectors respectively. The reason to choose best feature
combination is to perform cluster formation in feature clustering module. Different
feature dimensions as given in Table 3.1 are applied to find such a feature combination
(from higher to lower dimension) where maximum results are attained at different ranks

52
as shown in Table 3.2. The mathematical description of entropy controlled features
selection is presented in Eq. (3.26) and Eq. (3.27).

𝐸𝑛𝑡𝑟𝑜𝑝𝑦(ẟ) = ∑𝑑𝑓𝑖 ∑𝑑𝑓𝑗 𝑃𝑅(𝑓𝑖 , , 𝑓𝑗 ) 𝑙𝑜𝑔𝑃𝑅(𝑓𝑖 , , 𝑓𝑗 ) (3.26)

where 𝑓𝑖 and 𝑓𝑗 are current and next features respectively, 𝑃𝑅 represents the probability
of computed features, 𝑑 is the dimension of feature vector, and ẟ denotes the entropy
controlled features.

𝑂𝐹𝑆𝑑 = ẟ(𝑚𝑎𝑥(𝑓𝑑 , 𝑣)) (3.27)

Here, 𝑂𝐹𝑆𝑑 represents maximum score-based OFS having dimension 𝑑 and 𝑣 denotes
number of selected features from larger feature space 𝑓𝑑 . This process is repeated for
each feature space.

Table 3.1: Description of different test FS with different dimensions

FS Feature type and dimension of selected FS


no. Color Texture Shape Total optimal features
1 306 256 2700 3262
2 276 240 2200 2716
3 260 232 1508 2000
4 246 224 1120 1590
5 220 212 820 1252
6 190 202 600 990

Table 3.2: Experimental results using handcrafted features on three selected datasets,
top recognition rates at each rank written in bold

Datasets
FS
VIPeR CUHK01 i-LIDS-VID
no.
1 10 20 50 1 10 20 50 1 10 20 50
1 18.3 39.7 47.0 56.8 21.7 38.1 44.6 50.1 16.3 35.4 41.2 46.7
2 20.1 41.3 50.2 58.1 24.5 41.0 48.3 53.2 17.0 37.8 42.4 49.3
3 24.6 46.7 52.6 59.2 26.3 44.0 52.9 56.6 20.7 41.0 45.1 51.8
4 25.9 47.2 54.4 62.4 27.7 46.6 53.1 57.5 21.9 43.1 52.0 55.1
5 27.0 51.4 56.3 62.8 28.8 49.0 55.1 60.3 23.4 45.8 51.8 56.8
6 26.1 50.1 56.8 61.0 26.2 48.2 53.9 58.0 22.8 45.1 50.6 56.4

Afterwards, all best selected features are fused into one matrix which returns a feature
vector of size 1×1252 known as OFS. Moreover, it is also observed that total number
of selected features provides a portion of sufficient and reliable information for image
representation. Besides, selected 𝑂𝐹𝑆𝑑 is supplied to K-means module [193] for

53
features-based clustering of each gallery image. Later on, OFS is integrated with deep
features of cluster sample, for a more reliable representation of gallery image.

3.2.2 Feature Clustering

For test data, searching operation responds quite faster if clustering is performed with
optimal results selected correctly. The objective of this module is to split the dataset
into clusters based on the selected features subsets. The feature clustering module
consists of two parts including features-based cluster formation and deep feature
extraction.

3.2.2.1 Features-based Cluster Formation

For features-based cluster formation, 𝐾-means clustering algorithm is first utilized to


group gallery images into k disjoint subsets. Applying 𝐾-means algorithm has two
issues; 1) value of K, and 2) selection of initial central point. In this concern, a self-
tuning algorithm [194] is used to estimate the value of 𝐾 whereas central limit theorem
[195] is used to choose the initial cluster center. Hence, the process of 𝐾-means
clustering is repeated for a sufficient number of epochs (𝑁𝑒 ) by random cluster center
initialization. This process generates a total of (𝑛 × K) clusters. All features-based
cluster formation steps are enlisted in Algorithm 3.1. During this, few images may be
misclassified and assigned to an irrelevant cluster. To overcome this issue, the
consensus-based soft cluster-based similarity partitioning algorithm (sCSPA) [196] is
applied for re-clustering (update_cluster) of clusters by combining (𝑛 × 𝐾) clusters
into K consensus clusters named as 𝐺 = {𝑐1 , 𝑐2 , … , 𝑐𝐾 }, where 𝐺 represents gallery
images containing these clusters.
Lastly, the classification model SVM of radial basis function (RBF) kernel is utilized
to learn a model that acquires associations among gallery feature vectors and their
corresponding consensus cluster. The motivation behind the use of SVM is the
flexibility of transformation function during kernel selection [197, 198]. Later on, this
learned model is used in feature matching module of the proposed framework.

3.2.2.2 Deep Feature Extraction

The existing work proves that CNN has a variety of pre-trained deep learning models
that have been implemented to solve the problems of classification and object detection
as well as for person ReID task [131, 199]. These models have the potential to extract

54
features from the given data efficiently and automatically. AlexNet is one of the simple
and foundational CNN pre-trained models introduced by Hinton et al. [128] that can be
easily trained and optimized as compared to complex CNN architectures such as
VGGNet [200], and GooLeNet [201].

Algorithm 3.1: Handcrafted features-based clusters formation


Input: 𝐹 = {𝑓1 , 𝑓2 , … 𝑓𝐺 } (Set of feature vectors of each gallery image as an input for
cluster formation)
𝐾 (The number of clusters)
Maximum-Iters (Maximum number of iterations)
Output: Ƈ = {𝑐1 , 𝑐2 , … 𝑐𝑘 } (The set of Ƈ clusters)
𝐿 = {ȴ(𝑓)|𝑓 = 1,2, … 𝐺} (The set of cluster labels of 𝐹)
Begin
Step 1: while ci є Ƈ cluster do
ci ← fj є F (e.g. by applying the self-tuning algorithm)
end while
Step 2: while fi є F do
ȴ(f) ← label_with_minDistance (fi , cj ) j є {1 … K}
end while
Step 3: Changed ← False
Step 4: Iteration ← 0
Step 5: do
while ci є Ƈ do
update_cluster (ci );
end while
while fi є F do
min_Distance ←label_with_minDistance (fi , cj ) j є {1 … K}
if min_Distance ≠ ȴ(fi ) then
ȴ(fi ) ← min_Distance;
Changed ← True;
end if
end while
iteration ++;
while changed = True and iteration ≤ Maximum-Iters
End

In addition, CNN architecture is widely used due to its performance. It consists of


multiple layers such as five convolutional layers (CL) including pooling layers (PL)
and fully connected (FC) layers. Based on AlexNet, a deep CNN model is used to learn
discriminative features from an input image. It consists of 5 CL, 3 PL, and 3 FC layers.
Each CL comes before the PL. The deep CNN architecture is highlighted in Figure 3.5

55
and parameters setting of network is depicted in Figure 3.6. As the parameters of
AlexNet model are already well tested, therefore, standard parameters are utilized to
learn deep features without any optimization. In this work, an RGB image is resized
into 227 × 227 × 3 dimensions and then a bi-cubic interpolation algorithm is applied
for equalization of image details.

Figure 3.5: CNN model for deep feature extraction

Figure 3.6: Parameters setting at each layer


Thus, image with size 227 × 227 × 3 is input into the network. The convolutional
operation is applied by using Eq. (3.28).

𝑍𝑥𝑙 = 𝑏𝑥𝑙 + ∑𝑦 𝑓𝑥𝑦


𝑙
⊗ 𝑧𝑦𝑙−1 (3.28)

56
where 𝑍𝑥𝑙 represents output channel values up to point x at layer , 𝑧𝑦𝑙−1 shows input
𝑙
channel values up to point y at layer 𝑙 − 1, and 𝑓𝑥𝑦 represents convolutional filter
between 𝑥 𝑡ℎ and 𝑦 feature maps. The bias 𝑏𝑥𝑙 is added to move activation function
towards successful learning. For activation of neurons, the rectified linear unit (ReLU)
is applied through Eq. (3.29).

𝑅𝑒𝐿𝑈(𝑧) = 𝑚𝑎𝑥(0, 𝑧) (3.29)

Max pooling is another important step used for downsampling in CNN architectures. It
is quite simple and does not have a learning process. Max pooling is applied after the
first, second, and fifth layers in the network. Figure 3.7 shows max pooling operation
on a given sample feature region. It simply applies 𝑘 × 𝑘 sized filter and selects a
maximum value through Eq. (3.30).

max
𝑍𝑝𝑞𝑐 = 𝑖,𝑗∈𝑀𝑝,𝑞 𝑈𝑖𝑗𝑐 (3.30)

where 𝑀𝑝,𝑞 shows pooling region having indices 𝑖, 𝑗, 𝑈𝑖𝑗𝑐 region of feature map, color
space channel 𝑐 and then a pixel value 𝑍𝑝𝑞𝑐 is obtained as output of max pooling
operation. After first two pooling layers, local contrast divisive normalization (LCDN)
is applied by considering interaction between channels 𝐶 (multi-channel images),
where the variance of local area 𝑀𝑎𝑏 is computed by applying Eq. (3.31).
2 1
𝜎𝑎𝑏 = 𝐶 ∑𝐶−1
𝑐=0 ∑(𝑖.𝑗)∈𝑀𝑎𝑏 𝑤𝑖𝑗𝑐 (𝑥𝑎+𝑖,𝑏+𝑗,𝑐 − 𝑥̅𝑎𝑏 )
2
(3.31)

The divisive normalization is then calculated according to Eq. (3.32).

𝑎𝑏𝑐𝑥 −𝑥̅
𝑎𝑏
𝑍𝑎𝑏𝑐 = max (3.32)
(ĉ,𝜎 ) 𝑎𝑏

If 𝜎𝑎𝑏 < ĉ then divide with ĉ. In divisive normalization, continuous change in


denominator depends upon two values (1) constant value ĉ and (2) variance value, as
expressed in Eq. (3.33). The application of normalization process is robust to variations
in illumination and contrast.

𝑥𝑎𝑏 −𝑥̅ 𝑎𝑏𝑘


𝑍𝑎𝑏𝑘 = (3.33)
2
√ĉ+𝜎𝑎𝑏

The remaining part of architecture consists of three FC layers, where the first two layers
deal with extracted features of previous layers and decrease the dimensionality of such
features from 9216 to 4096. The response of FC-7 layer of trained network called deep

57
features is utilized. The architecture is deeply applied because in multi-camera network,
the appearance of given pedestrian changes due to various reasons such as camera
settings, background clutter, viewpoints, and variation in illumination and pose
changes.
To handle these issues of appearance and inherent ambiguities, a deep enough
architecture is desirable. Consequently, CNN model is trained using a stochastic
gradient learning algorithm commonly used in various CNN models. The formulation
of deep feature extraction considering all k consensus clusters and their samples images
belonging to gallery images is given in Eq. (3.34).

𝐶𝑘 = {𝑆(𝑘,𝑠1 ) , … , 𝑆(𝑘,𝑠2 ) , … , 𝑆(𝑘,𝑠𝑁 ) } (3.34)

where 𝐶𝑘 represents consensus cluster, 𝑆(𝑘,𝑠1 ) is considered as sample image, k denotes


𝑘 𝑡ℎ cluster, 𝑁 depicts total sample images in 𝐶𝑘 , 𝑘 represents clustered index and 𝑆𝑁
represents total number of sample images in one consensus cluster.

Finally, a deep CNN model is applied and feature extraction on each sample image of
consensus clusters ɠ𝑐𝑐 is carried out using Eq. (3.35).

𝐷𝐹1×𝑑 = ɠ𝑐𝑐 (𝑘, 𝑠𝑖 ) (3.35)

where 𝐷𝐹1×𝑑 denotes deep features having dimension 𝑑 and 𝑠𝑖 depicts 𝑖 𝑡ℎ a sample
image of cluster 𝑘, extracted by applying ɠ𝑐𝑐 operation. Similarly, this procedure is
applied across all the consensus clusters to extract deep features of each sample image.
PCA is then used to decrease the dimension of deep features. The reduced deep features
filter out the noise by discarding unnecessary information and preserving discriminative
information. The empirically 1000 deep features are selected and fused with the
selected OFS to get a feature vector which is then used for all the experiments. The
deep features selection is discussed in section 4.2.3.1, and results are shown in Tables
(4.2)-(4.4). Thus, each gallery image specifically belongs to a particular consensus
cluster and it consists of optimal features subsets 𝑂𝐹𝑆𝑠𝑒𝑙𝑒𝑐𝑡 , reduced deep
features DFi×d and cluster index 𝑘. Hence, all gallery images are represented with the
concatenation of these features, and the size of FFV becomes 1×2252 with the clustered
index. Mathematically, it is formulated through Eq. (3.36) and complete steps of
proposed FCDF framework are presented in Algorithm 3.2.

58
Figure 3.7: Max pooling operation

𝐹𝐹𝑉𝑣 (𝑓𝑢𝑠𝑒𝑑) = [𝑂𝐹𝑆, 𝐷𝐹, 𝑘] (3.36)

where 𝐹𝐹𝑉𝑣 (𝑓𝑢𝑠𝑒𝑑) represents final feature vector for ReID which contains of OFS,
DF feature vectors and 𝑘 cluster index.

Algorithm 3.2: Proposed FCDF framework


Input: Subset of gallery images represented with cluster set
k = {cluster-1, cluster-2… cluster-6}-and probe set P = Pi where i=1,2…, n}
Output: The recognition rate.
Begin
Step# 1: Extract color (mean, variance, SD, and SVD) information with RGB and
HSV channels separately.
Step# 2: Extract local extrema patterns with V channel and HOG on gray values.
Step# 3: Feature level fusion is applied across Fcolor (I), FLEP (I) and FHOG (I) to
obtain a fused feature vector by Eqs. (3.22)-(3.25).
Step# 4: Compute respective scores based on maximum entropy to select OFS by
Eq. (3.26) and Eq. (3.27).
Step# 5: Split gallery set into consensus clusters by using Algorithm 1.
Step# 6: Capture deep features across all the consensus clusters by Eq. (3.34) and
Eq. (3.35).
Step# 7: Combine OFS, deep features, and cluster index to obtain a FFV by Eq.
(3.36).
Step# 8: Select target cluster against each probe based on an OFS by using multi
SVM.
Step# 9: Obtain deep features of a probe and combine it with its OFS.
Step# 10: Obtain QC cross bin histogram distance with target cluster(s) and a
probe by Eq. (3.37).
Step# 11: Obtain recognition rate.
End

3.2.3 Feature Matching

In feature matching module, first step is to find a consensus cluster. It follows the
extraction of deep features of test image. Later on, OFS and deep features of probe
image are concatenated. Finally, a similarity measure is deployed to get target match

59
from selected cluster or neighbor clusters. The details about all steps are given in the
following subsections.

3.2.3.1 Cluster Selection followed by Probe Deep Features

In the cluster selection, initially, probe OFS is provided to the learned model for
selecting target cluster. For this purpose, multi-class SVM is applied to assign an
optimal score for all k consensus clusters using score-to-posterior-probability
transformation function. Based on the highest score, any consensus cluster may qualify
for a target cluster against a given probe image. During this process, the target image
may belong to another cluster. In this situation, the nearest neighbor consensus cluster
is taken instead of a cluster with a high score. However, nearest neighbor clusters (with
a minimum distance) not more than half of the total clusters are considered. Thus, the
probability of finding the target image increases when neighbor consensus clusters are
taken. The deep features of probe image are also selected and integrated with probe
OFS. Then, searching is applied against the given probe image across n consensus
clusters instead of whole gallery to effectively optimize the gallery search, where 𝑛 ∈
[1,2,3].

3.2.3.2 Similarity Measure

The last phase of presented framework is to compare probe image within the consensus
cluster to explore the accurate match. QC cross bin histogram distance measure [202]
is used for this purpose where two histogram distance properties are utilized across the
bins; 1) similarity matrix quantization invariance property that includes its cross bin
relationships of features, and 2) sparseness invariance property which obtains distances
between full histograms (e.g., probe and cluster feature vectors). Let, probe feature
vector be Ip = {X1 , X2 , … , Xn } and cluster feature vector is Ic = {Y1 , Y2 , … , Yn }, hence,
distance D between a probe feature vector and cluster(s) feature vector is calculated by
Eq. (3.37).

𝐷(𝐼𝑝 , 𝐼𝑐 ) = ∑𝑛𝑖=1 𝑄𝐶( 𝑋𝑖 , 𝑌𝑖 ) (3.37)

where d denotes feature vector dimension. Once, all the distances between probe
images and samples of target cluster(s) are computed, ReID ranks are to be obtained
after applying descending order sort on the computed distances. Consequently, probe
image with minimum distance D in cluster(s) is declared as an accurate match.

60
This method utilized handcrafted features (color, HOG, and LEP) for cluster formation.
The extracted deep features of each cluster sample are fused for more distinct feature
representation. The proposed FCDF method optimized the gallery search by applying
cluster-based probe matching instead of gallery-based matching. To further analyze
pedestrians, gender classification is considered as a next step of this dissertation. The
subsequent methodology based on J-LDFR is discussed in section 3.3.

3.3 Proposed Pedestrian Gender Classification Method of Joint Low-


level and Deep CNN Feature Representation (J-LDFR)
The proposed gender classification framework consists of four principal steps, 1)
preprocessing, 2) feature representation 3) features selection and joint feature
representation, and 4) classification, as shown in Figure 3.8. To remove the noise and
to equalize intensity values of input gender image, preprocessing is applied that follows
feature representation where low-level and deep information of pedestrian full-body
image is extracted. Then, maximum entropy is utilized to choose best features and
discard irrelevant information by dropping lower scored features. The best selected
features are fused and then supplied to different classifiers to investigate the
performance of proposed J-LDFR framework for gender classification. This framework
is discussed in the subsequent sections.

Figure 3.8: Proposed J-LDFR framework for pedestrian gender classification

61
3.3.1 Data Preprocessing

Preprocessing is a foremost step to improve the results of later processing. It enhances


visual quality and foreground information. It also optimizes environmental effects such
as illumination variations, poor contrast, and brightness related issues. For each feature
representation, original gender image is resized as described in Table 3.3. Later on, two
basic steps such as interpolation and non-linear digital filtering are applied in
preprocessing. Initially, bi-cubic interpolation is performed by choosing 4 × 4 blocks
of neighboring known pixels located at different distances from an unknown pixel. Bi-
cubic interpolation produces sharper images with ideal processing time and operative
in all feature types. Afterwards, a median filter is used for the removal of noisy
information from gender images due to an uncontrolled image acquisition environment.
The objective to choose median filter is to retain edge information while removing the
noise. In this work, HOG feature extraction uses the output of median 2D filter, LOMO
utilizes HSV and grayscale images, HOG examines grayscale images and both pre-
trained CNN models exploit RGB images as an input during the implementation of
feature extraction schemes as presented in Table 3.3.

Table 3.3: Description of preprocessing for each feature representation scheme

Non-
Original Interpo.
Feature Resized linear Input
FR image (Bi-
type RGB image filtering images
size cubic)
(Median)
HSV and
Low- LOMO  
128 × 64×3 grayscale
level
HOG   grayscale
Vary
High- VGG19   RGB
level 224 × 224 ×3
ResNet101   RGB
(deep)

3.3.2 Feature Representation

The visual description of input image is fundamental step for automated image
recognition systems. These systems use efficient and effective visual clues in numerous
tasks of computer vision. In this context, known visual feature representations such as
shape, texture, color, deep and structural information are often used to compute distinct
image features. In pedestrian gender analysis, feature representations are investigated
by extracting either low-level features or deep CNN features. Both feature extraction

62
schemes are introduced in the proposed framework for robust feature representation of
gender image, described in the subsequent sections.

3.3.2.1 Low-level Feature Extraction

Various handcrafted features are commonly studied in the area of pattern recognition.
Specifically, these features are used to represent low-level information (e.g. shape,
color, and texture) of an input image. Moreover, these features are used as a
complementary part of feature fusion and for reliable representation of an image.
Considering these facts, two well know descriptors such as HOG descriptor for shape
information and LOMO descriptor for color and texture information are selected for
low-level feature representation of input image. Besides, HOG and LOMO features
jointly handle the issues of rotation, viewpoint, and illumination variances in images.
HOG feature representation is initially used for pedestrian detection by Dalal et al.
[203], and currently it is used in many areas such as person ReID [68], railway track
detection [204], and other applications [205-207].

In this manuscript, illumination and rotation invariant local features of gender images
are computed using HOG feature descriptor. Initially, the orientation of input image by
splitting the whole image into different blocks is calculated; afterwards, gradients of
image are computed block by block, where each block consists of 2 × 2 cells.
Considering gradient information in a single block, local histogram of orientations of
each cell is calculated with 9 bins accumulation. Meanwhile, local 1-D histogram over
pixels of cell is normalized individually using 𝐿1 − 𝑛𝑜𝑟𝑚 along with interpolation
operation.
The final histogram is then computed by concatenating local histograms of each cell.
In final histogram, gradient's strengths vary due to local illumination changes.
Therefore, overall HOG features are normalized using L2- norm. Finally, feature vector
with 3780 HOG features is obtained. HOG feature representation is shown in Figure
3.9. From the larger size of HOG feature vector, subset of 1000 features is empirically
selected. The selection of subset of features is because of two reasons; 1) reduced
dimension of HOG feature vector discards irrelevant features, and 2) only sufficiently
contributing HOG features are selected for the succeeding fusion stage. Next, LOMO
descriptor is also utilized for low-level feature representation, initially introduced by
[77] and designed for pedestrian ReID.

63
Figure 3.9: Process of formation of low-level (HOG) features

LOMO descriptor effectively examines horizontal pixel values of local features. It also
maximizes pixel values that efficiently handle viewpoint changes and applies Retinex
transform to effectively tackle illumination changes of input image. Recently, many
approaches integrate LOMO features and achieve outstanding results [54, 67, 208-210].
Thus, this work also focuses on LOMO feature representation as low-level information
with HOG features for gender prediction. The formation of LOMO features is shown
in Figure 3.10 (a) and Figure 3.10 (b).
LOMO feature representation utilizes a sliding window having a size of 10×10 with a
stride of 5 pixels and patches of dimensions 128×64 to depict local information. The
purpose of sliding window is to handle viewpoint changes in input image. Each sub-
window is comprised of two types of information; 1) scale-invariant local ternary
patterns (SILTP) [211] histograms, and 2) HSV based 8×8×8 color bin histograms.
Both types of information describe texture and color details of input image. A three-
level pyramid is utilized for multi scale information and obtains the features at a
different level. Both information (from each window and scale) are then combined to
produce LOMO feature representations.

64
Figure 3.10: Process of formation of low-level (LOMO) features, a) overview of feature
extraction procedure using two different LOMO representation schemes
such as HSV and SILTP based representations including feature fusion
step to compute LOMO feature vector, and b) basic internal representation
to calculate combined histogram from different patches of input image
[77]
Besides, these procedures retain local features of a person from selected regions, as in
[77]. The calculated LOMO feature representations achieve a large invariance to
viewpoint and illumination changes. In this work, LOMO features are obtained from
each gender image. As feature dimension is high (26960 features in total), a
dimensionality reduction is carried out on computed LOMO features, and empirically
600 best features are selected from a larger dimension of LOMO features to facilitate
fusion with sufficient and optimal LOMO features.

3.3.2.2 Deep CNN Feature Extraction

Already-trained deep CNN models have shown promising results in different areas of
computer vision with small and large datasets [212-214]. In related literature, two well-
known procedures including transfer learning and feature mining are applied using
already-trained CNN models. Transfer learning is a much faster and suitable procedure

65
as compared to training from the start, whereas feature mining is another useful and
fastest procedure when high-level representations of an input image are desirable and
supplied to train an image classifier. This procedure extracts deeply learned image
features using deep CNN models. Therefore, aiming to acquire high-level feature
depiction of gender image, deep CNN feature extraction procedure is utilized in this
work. Generally, deep CNN models consist of diverse layers (blocks) such as
convolutional, pooling, normalization, ReLU, and FC with a single softmax function.

The purpose of CL in CNN model is to preserve spatial relationship between pixels


(edge and structure information) by learning input image features stepwise at different
levels while PL of CNN is required for downsampling of detected features in feature
maps. The max-pooling method is adopted in CNN model to select maximum value for
each patch of feature map. In deep CNN models, ReLU is opted as an activation
function. FC block contains deep or high-level feature representations of input image
when these representations are supplied to softmax layer for classification of input
image based on training dataset. Therefore, deep feature representation using already-
trained CNN model is a suitable way to adopt as compared to low-level feature
representation, or integration of both representations.

The proposed framework considers two already-trained CNN models named VGG19
and ReNet101 to compute deep feature representations. Both models apply a stack of
3 × 3 convolutional filters with stride 1 during convolution operation. The depth of
CNN model is used for learning more complex information from input image.
Moreover, each model comprises of different characteristics to set their significance. In
this work, the objective is to examine deep information of two different depths of CNN
models by considering their feature representations at FC layers. So, feature
representation of FC7 and FC1000 using VGG19 and ResNet101 models are utilized
respectively. Often, aforementioned different types of layers in the design of deep CNN
model are used. Initially, CL obtains local features of the input image. This local feature
extraction process is formulated in Eq. (3.38).

𝑥 𝐿−1
𝑧𝑖𝐿 = 𝑏𝑖𝐿 + ∑𝑗=1
𝑖 𝐿
𝐹𝑖,𝑗 × 𝑦𝑗𝐿−1 (3.38)

where 𝑧𝑖𝑙 represents output layer 𝐿, 𝑏𝑖𝐿 is base value, 𝐹𝑖,𝑗


𝐿
denotes weights joining 𝑗 𝑡ℎ
feature map, and 𝑦𝑗 represents 𝐿 − 1 output layer. The max-pooling is used to choose

66
highest value from pooling region. The max-pooling between CLs is applied to reduce
the number of irrelevant features, computational load, and overfitting issue.
Similarly, average pooling is another type of pooling that computes the average of filter
values rather than max value. Both max-pooling and average-pooling layers are
included in ResNet101 model, however, VGG19 only comprises of max-pooling. For
instance, a single feature map of a PL is denoted with 𝑅. Before applying max-pooling,
pooling operations 𝑅 = 𝑅0 , … , 𝑅𝑖 can be considered as a collection of all small local
regions where i is controlled by both the size of pooling regions and dimension of input
feature maps. A native region 𝑅𝑖 is randomly selected, where 𝑖 is an index between 0
and 𝑚. Mathematically local regions are represented through Eq. (3.39).

𝑅𝑖 = 𝑥1 , 𝑥2 , … , 𝑥𝐾×𝐾 (3.39)

where 𝐾 denotes the size of a pooling region (filter size) and 𝑥 represents a component
of pooling region. At each pooling layer, aforementioned both max and average pooling
operations use different calculation procedures dealing with the component in each
pooling region. Eq. (3.40) and Eq. (3.41) are for max pooling and average pooling
operations respectively.

max
𝑀𝑃𝑖𝐿 = 1 ≤ 𝑗 ≤ 𝐾 × 𝐾 (𝑥𝑖 ) (3.40)

1
𝐴𝑃𝑖𝐿 = 𝐾×𝐾 ∑𝐾×𝐾
𝑗=1 𝑥𝑗 (3.41)

where 𝑀𝑃𝑖𝐿 and 𝐴𝑃𝑖𝐿 denote output of max pooling and average pooling operations
respectively using 𝐾 × 𝐾 pooling region at layer 𝐿. Max pooling operation only chooses
maximum value from pooling region, whereas average pooling operation computes the
(𝑙)
average of all pooling region values. Also, other layers including ReLU, FC layer 𝐹𝐶𝑖 ,
(𝑙)
and FC layer 𝐹𝐶𝑗 are represented in Eqs. (3.42)-(3.44).

(𝑙)
𝑅𝑒𝐿𝑈𝑖 = max(0, 𝑝𝑖𝑙−1 ) (3.42)

(𝑙−1)
(𝑙) (𝑙) (𝑙) 𝑀𝑃 (𝑙) (𝑙−1)
𝐹𝐶𝑖 = f(𝐷1𝑖 ) 𝑤𝑖𝑡ℎ 𝐷1𝑖 = ∑𝑟=1𝑖 𝑤𝑖,𝑟 ( 𝐹𝐶𝑖 )𝑟 (3.43)

(𝑙−1)
(𝑙) (𝑙) (𝑙) 𝐴𝑃 (𝑙) (𝑙−1)
𝐹𝐶𝑗 = f(𝐷2𝑗 ) 𝑤𝑖𝑡ℎ 𝐷2𝑗 = ∑𝑟=1𝑗 𝑤𝑗,𝑟 ( 𝐹𝐶𝑗 )𝑟 (3.44)

67
(𝑙) (𝑙) (𝑙)
where 𝑅𝑒𝐿𝑈𝑖 represents the output of ReLU layer 𝐹𝐶𝑖 and 𝐹𝐶𝑗 denote the response
of FC7 and FC1000 layers respectively using VGG19 and ResNet101 models. The FC
layer efficiently depicts the higher-level information and most of the researchers
consider the output of FC layers as deep features to apply for pattern recognition, person
ReID, and image classification tasks [172-174, 215].

3.3.2.3 Pre-trained Deep CNN Models

The selected VGG19 and ResNet101 deep CNN pre-trained models are used to acquire
deep features of gender images. Firstly, VGG19 model is proposed by Hu et al. [216]
which comprises of different blocks/layers such as 16 CLs, 18 ReLU layers, 5 PL’s, 3
FC layers, 2 dropout layers, and a softmax classifier for input image prediction. This
model is learned on ImageNet dataset, where the size of input image at the input layer
is 224 × 224 × 3. ReLU is utilized after each CL in VGG19 model. Both ReLU and
dropout operations are performed before FC7 layer, where the value of dropout is 0.5.

In max-pooling operation, pool size and stride are selected as (2, 2) with zero padding.
Secondly, ResNet101 is a pre-trained model used by He et al. [217] for image
recognition. This model consists of input layer, 104 CL’s, 100 ReLU layers, 104 batch
normalization layers, 33 additional layers, MP, AP, FC, softmax, and output. The size
of input images used by this model is 224 × 224 × 3. ResNet101 model utilizes pool
and stride size (7, 7) with zero padding in average pooling operation and pool size (3,
3), stride (2, 2), and padding (0, 1, 0, and 1) in max-pooling operation.

3.3.3 Features selection and Joint Feature Representation

The complete illustration of proposed J-LDFR framework for pedestrian gender


classification is shown in Figure 3.11. The framework contains three stages; 1) different
types of feature representations, 2) features selection, and 3) joint feature
representation. In feature representation, low-level and deep features are computed. The
low-level feature extraction is performed using HOG and LOMO designed for
representation of viewpoint and illumination invariant features, and robust low-level
feature representation. The low-level extracted feature vectors of size 𝑁 × 3780 and
𝑁 × 26960 are computed from HOG and LOMO descriptors separately, where N
represents the number of sample images. The deep features representations are
performed by considering the response of FC layers of VGG19 and ResNet101 models.

68
These responses are used as deep feature representations that have deeper information
on gender images to handle misclassification of gender due to pose variations. The
deeply extracted feature vectors of size 𝑁 × 4096 and 𝑁 × 1000 are obtained from
VGG19 and ResNet101, respectively. This computed deeper information helps to
reduce false gender predictions due to variations in pose and illumination. In this
process, four kinds of different features from a gender image are obtained as shown in
Figure 3.11. Let, 𝐹𝑉𝐻𝑂𝐺(𝐻𝑣 ) , 𝐹𝑉𝐿𝑂𝑀𝑂(𝐿𝑣) , 𝐹𝑉𝐷𝑒𝑒𝑝1(𝐷1𝑣) , and 𝐹𝑉𝐷𝑒𝑒𝑝2(𝐷2𝑣) represent four
different feature vectors such as HOG, LOMO, deepV (VGG19), and deepR
(ResNet101) respectively. If HOG feature vector 𝐻𝑣 has dimension 𝑑𝑤 , LOMO feature
vector 𝐿𝑣 has dimension 𝑑𝑥 , deepV feature vector 𝐷𝑉𝑣 has dimension 𝑑𝑦 and deepR
feature vector 𝐷𝑅𝑣 has dimension 𝑑𝑧 , and 𝑁 represents total images in the selected
dataset then 𝐻𝑣 , 𝐿𝑣 , 𝐷1𝑣 , and 𝐷2𝑣 can be calculated through Eqs. (3.45)-(3.48).

ℎ(1,1) … ℎ(1,dw )
𝐻v = ( ⋮ ⋱ ⋮ ) (3.45)
ℎ(N,1) ⋯ ℎ(N,d𝑤 )

𝑙(1,1) … 𝑙(1,dx )
𝐿v = ( ⋮ ⋱ ⋮ ) (3.46)
𝑙(N,1) ⋯ 𝑙(N,d𝑥 )

𝑑𝑣(1,1) … 𝑑𝑣(1,dy )
𝐷𝑉v = ( ⋮ ⋱ ⋮ ) (3.47)
dv(N,1) ⋯ dv(N,d𝑦 )

𝑑𝑟(1,1) … 𝑑𝑟(1,d𝑧 )
𝐷𝑅v = ( ⋮ ⋱ ⋮ ) (3.48)
𝑑𝑟(N,1) ⋯ 𝑑𝑟(N,d𝑧 )

where ℎ, 𝑙, 𝑑𝑣 and, 𝑑𝑟 present the features indices of extracted 𝐻v , 𝐿v , 𝐷𝑉v and 𝐷𝑅v
feature vectors respectively. However, merging these feature vectors may increase
feature dimensions in size which may require more execution time, and influence
classification accuracy due to irrelevant information. To address these issues, feature
reduction is performed which computes optimal features from given feature vectors.
These optimal features comprise of discriminant features which are desirable to
correctly classify gender images. In this manuscript, an entropy controlled method is
applied to select best features subset from extracted feature vectors. To the best of our
insight, the entropy based features selection is never being used in existing literature
for pedestrian gender classification task.

69
Figure 3.11: Complete design of proposed low-level and deep feature extraction from
gender images for joint feature representation. The proposed framework
J-LDFR selects maximum score-based features and then fusion is applied
to generate a robust feature vector that has both low-level and deep feature
representations. Selected classifiers are applied to evaluate these feature
representations for gender prediction

Entropy is used as a measure of randomness in the information that is being processed


[218]. The entropy method can preserve feature behavior by computing valuable
information. This computed information is utilized in feature representation [172].
Several entropies based features selection methods are reported in literature, and hence
in this work, maximum entropy method is adopted for features selection. The
motivation behind the use of maximum entropy is to satisfy the given constraints; in
addition, it is considered as the most uniform model [219]. The maximum entropy
method returns generalized scalar values representing the entropy of each feature of
given feature space individually and considered a suitable method while investigating
the randomness to characterize the extracted features of input image.

70
Maximum entropy is described as follows. Let ℎ1 , ℎ2 , … , ℎ𝑁 represent features from
feature space H, 𝑙1 , 𝑙2 , … , 𝑙𝑁 represent features from feature space L,
𝑑𝑣1 , 𝑑𝑣2 , … , 𝑑𝑣𝑁 are features from the feature space DV and 𝑑𝑟1 , 𝑑𝑟2 , … , 𝑑𝑟𝑁 denote
features from feature space DR, where H∈HOG, L∈LOMO, DV∈VGG19, and
DR∈ResNet101 feature vectors. The dimension of each feature vector is 1 × 3780,
1 × 26960, 1 × 4096, and 1 × 1000 for, HOG, LOMO, VGG19, and ResNet101
respectively. The optimal features selection using maximum entropy method is given
in Eqs. (3.49)-(3.58).

𝑃 = 𝐻𝑖𝑠𝑡(𝑓) (3.49)

where 𝑃 denotes histogram counts of feature vector 𝑓 such that 𝑓 ∈ 𝐼 and 𝐼 depict total
feature vectors in a given feature space. Meanwhile, zero entries are removed in 𝑃 and
matrix is returned in the form of bin values. Then, entropy is compute using Eq. (3.50).

𝐸𝑛𝑡(𝐺) = − ∑ 𝑃.∗ 𝑙𝑜𝑔2 (𝑃) (3.50)

where 𝐺 ∈ (ℎ𝑁 , 𝑙𝑁 , 𝑑𝑣𝑁 , 𝑑𝑟𝑁 ), and maximum entropy controlled method obtains a
feature vector with 𝑁 × 𝑀 dimensions. It controls the randomness of each feature
space. Finally, the scores of each feature vector are arranged in descending order. The
entropy information of each feature vector is calculated through Eqs. (3.51)-(3.54).

𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝐻) = 𝐸(ℎ𝑁 , ῤ) (3.51)

𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝐿) = 𝐸(𝑙𝑁 , ῤ) (3.52)

𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝐷𝑉) = 𝐸(𝑑𝑣𝑁 , ῤ) (3.53)

𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝐷𝑅) = 𝐸(𝑑𝑟𝑁 , ῤ) (3.54)

where 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝐻), 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝐿), 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝐷𝑉), and 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝐷𝑅) represent the


entropy controlled information of feature space HOG, LOMO, deep VGG19, and deep
ResNet101, respectively. 𝐸 presents the sorting function and ῤ denotes descending
order operation. The deep and low-level features were obtained from the computed
entropy controlled information and then merged for joint feature representation. After
extensive experimentation, maximum score-based top 1000 features are chosen from
H, 600 features from L, and 1000 features from each DV and DR. Mathematically,
selection of top features subset is done through Eqs. (3.55)-(3.58).

71
𝐻𝑣 = 𝑚𝑎𝑥(𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝐻), ẟ) (3.55)

𝐿𝑣 = 𝑚𝑎𝑥(𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝐿), ẟ) (3.56)

𝐷𝑉𝑣 = 𝑚𝑎𝑥(𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝐷𝑉), ẟ) (3.57)

𝐷𝑅𝑣 = 𝑚𝑎𝑥(𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝐷𝑅), ẟ) (3.58)

where 𝐻𝑣 , 𝐿𝑣 , 𝐷𝑉𝑣 , and 𝐷𝑅𝑣 indicate the selected subsets of features and ẟ represents
top features from computed entropy information. The joint feature representation is
computed using Eq. (3.59), where d shows FFV dimension. The dimension of FFV is
1 × 3600, (see section 4.3.3).

𝐹𝐹𝑉1×𝑑 = [𝐻𝑣 , 𝐿𝑣 , 𝐷𝑉𝑣 , 𝐷𝑅𝑣 ] (3.59)

Later, this FFV is supplied to different classifiers for classification. In this work,
supervised learning methods are used to train the data for gender prediction.

3.3.4 Classification Methods

Three supervised learning methods including discriminant analysis, ensemble, KNN,


and SVM are used to train the classifier. The parameters setting of all selected classifier
methods is described in Table 3.4. The discriminant analysis investigates the
classification results using linear [220] and ensemble subspace discriminant
classification [221] methods. Secondly, fine, medium and cosine KNN classification
methods from the family of KNN are considered. These three classification methods
are selected for their satisfactory classification results as compared to coarse, cubic, and
weighted KNN classification methods. Default metric (Euclidean distance) is used in
KNN methods because it is most frequently preferred due to ease in computation and
works well with different datasets [221, 222].

Lastly, SVM is considered due to generalization ability and effectiveness for


classification and regression tasks and showed promising results as per relevant
literature [220, 221, 223, 224]. In this work, linear, cubic, medium, and quadratic SVM
classifiers are utilized for gender classification task. All these methods were executed
using their default settings as given in Table 3.4. Extensive experiments were conducted
using the aforementioned classification methods and observed that cubic, medium, and
quadratic SVM achieved significant results as compared to existing approaches.

72
Table 3.4: Proposed J-LDFR framework selected features subset dimensions, classifiers
and their parameter settings
Proposed Features
Classifiers Parameters Values
Framework Subset
Linear Covariance structure: Full
discriminant Full
Ensemble method Subspace
Ensemble
Learner type Discriminant
subspace
Number of learners 30
discriminant
Subspace dimension 1800
Distance metric Euclidean
Distance weight Equal
Fine-KNN
Standardize data True
Number of neighbors 1
Distance metric Euclidean
Distance weight Equal
Low-level Median-KNN
Standardize data True
feature Number of neighbors 10
representations Distance metric Euclidean
(HOG 1000 Distance weight Equal
features and Cosine-KNN
Standardize data True
LOMO 600 Number of neighbors 10
J-LDFR features)
Kernel function Linear
Kernel scale Automatic
framework +
Linear-SVM Box constraint level 1
Deep feature Multiclass method one-vs-one
representations Standardize data True
(VGG19 1000 Kernel function Gaussian
features and Medium Kernel scale 60
ResNet101 1000 Gaussian- Box constraint level 1
features) SVM Multiclass method one-vs-one
Standardize data True
Kernel function Quadratic
Kernel scale Automatic
Quadratic-
Box constraint level 1
SVM
Multiclass method one-vs-one
Standardize data True
Kernel function Cubic
Kernel scale Automatic
Cubic-SVM Box constraint level 1
Multiclass method one-vs-one
Standardize data True

Therefore, all the experiments in this work are executed using cubic, medium, and
quadratic SVM methods. This method is focused on robust joint feature representation
for pedestrian gender classification using large-scale and small-scale datasets. The
pedestrian gender classification is further investigated in the next methodology as an
imbalanced binary classification problem with SSS datasets.

73
3.4 Proposed Method for Pedestrian Gender Classification on
Imbalanced and Small Sample Datasets using Parallel and Serial
Fusion of Selected Deep and Traditional Features (PGC-FSDTF)
An approach named pedestrian gender classification using the fusion of selected deep
and traditional features (PGC-FSDTF) is proposed for gender prediction. It consists of
five main steps, 1) data preparation, 2) traditional feature extraction 3) deep CNN
feature extraction and parallel fusion, 4) features selection and fusion, and 5)
application of classification methods. The overview of proposed approach is shown in
Figure 3.12. In the subsequent sections, these steps are discussed in detail.

3.4.1 Data Preparation

Data preparation is an essential step in pattern recognition tasks and is widely used to
support later processing stages for accurate predictions. The purpose of data preparation
is to refine the information for accurate match. Usually, these refinements are
implemented as a component of data preparation that usually includes data
augmentation, data profiling, data preprocessing, and data cleansing. Therefore, two
components of data preparation are opted here: 1) data augmentation, and 2) data
preprocessing. Data augmentation is being utilized for equal distribution of data in
gender classes and data preprocessing is used to enhance the visual quality, foreground
information, and resizing of gender image. The description of these steps is given in
subsections.

3.4.1.1 Data Augmentation

Pedestrian analysis datasets such as MIT and PKU-Reid contain SSS in which each
pedestrian appearance is captured from multiple non-overlapping cameras. As a result,
the number of captured views against each pedestrian (male/female) is increased in
terms of total number of images in these datasets. These datasets are used for different
pedestrian analysis tasks such as gender prediction, person ReID, and attribute
recognition. MIT is a sub-dataset of PETA that comprises of 888 gender images in
which 600 pedestrian images belong to male class and 288 pedestrian images belong to
female class. This dataset is widely used for pedestrian attribute recognition such as
clothing style, and gender. In MIT dataset, it can be observed that class-wise
distribution of data is imbalanced. This inequality identifies two research issues: a)
class-imbalanced and b) SSS dataset.

74
Figure 3.12: An overview of the proposed PGC-FSDTF framework for pedestrian
gender classification

Using SSS datasets for pedestrian gender classification, another such dataset named as
PKU-Reid is chosen. The PKU-Reid dataset is preferably used for pedestrian re-
identification [225-227] consisting of 114 individuals (70 males and 44 females), where
the appearances of each individual are captured in eight directions and resultantly 1824
images are collected from two non-overlapping cameras. In this research study, all
images of PKU-Reid dataset are labeled for classes of male and female. Consequently,
1120 male and 704 female images are obtained from a total of 1824 images and then
this newly prepared dataset is named as PKU-Reid-IB such that class-wise data in this
dataset is also imbalanced. Hence, both MIT and PKU-Reid-IB are IB-SSS datasets

75
which are used in this work for pedestrian gender classification. As discussed earlier,
class-wise variation in samples becomes imbalanced classification problem that caused
poor predictive performance, specifically for the class with less number of samples. If
the imbalanced classification problem is severe then it is more challenging to suggest a
robust approach. Besides, datasets with a small number of sample spaces is another
problem for researchers while training of a model is carried out. To handle these
problems, data augmentation process is selected to enhance class-wise data by
considering existing data of single/both gender classes. Hence, this process is applied
to both imbalanced MIT and PKU-Reid datasets and controlled with a random
oversampling (ROS) technique to generate synthetic data on existing gender images.

The reason to choose ROS in data augmentation is two-fold: 1) to handle class-


imbalanced problem by providing a suitable way to equally distribute class-wise data,
and 2) to increase class-wise total number of samples up to appropriate size of data. To
form the balanced data, first, required number of samples are randomly chosen for one
set as described in Table 3.5. Then, four different operations are applied for data
augmentation. The selected operations, for instance image filtering, horizontal flipping,
geometric transformation, and adjust brightness [228] are used to equalize the number
of samples in gender classes. The randomly selected gender images are resized to
1280 × 64 and then data augmentation operations are performed on the selected
samples of MIT, and PKU-Reid datasets to handle class-imbalanced problem. The
description of these operations is given below.

a) Image Filtering: Image filtering operation is implemented to produce the filtered


image 𝐼𝐹𝑖𝑚 using a Gaussian smoothing filter with a standard deviation of 1.8.

b) Horizontal Flipping: This operation is applied to generate a randomly flipped image


𝐻𝐹𝑖𝑚 , horizontally. For example, 𝐼𝑀 is assumed as a matrix from an image, then flip
operation reverses the elements of each row under horizontal flipping.

c) Geometric Transformation: This operation is performed to generate a transformed


image 𝐺𝑇𝑖𝑚 using affine 2d geometric transformation object which supports two types
of transformations: 1) reverse transformation, and 2) forward transformation. In this
work, forward transformation is applied with valid 3 × 3 affine transformation matrix

76
as [1 0 0; .3 1 0; 0 0 1]. Resultantly, this operation yielded the transformed image
according to applied geometric transformation object.

d) Adjust Brightness: This operation is applied to adjust image intensity values by


stating contrast limits with low and high values in the range of [0, 1]. Here, low values
as [0.2, 0.3, 0] and high values as [0.6, 0.7, 1] are used. Finally, the output image is
acquired which maps the values in true-color image RGB to new values in 𝐵𝐴𝑖𝑚 .

Later, image filtering, horizontal flipping, geometric transformation, and adjust


brightness operations are referred to as filter, flip, transform, and brightness
respectively. While performing these operations, 1vs1, 1vs4, and both strategies are
utilized to generate synthetic data. Resultantly, three different augmented datasets are
generated based on these strategies. To realize 1vs1 strategy, first four different sets are
prepared from the male class of MIT dataset where 66 non-overlapped images are
randomly chosen for each set. Then a single operation (e.g. flip) is performed by
selecting an image one by one from a given set of male images. It means that against
one image only one operation will be performed instead of all operations as shown in
Figure 3.13. In addition, images of selected set will not be considered for further
operations. As an output, 264 augmented images are generated for the male class. The
detailed description of class-wise selected number of samples for each set, the number
of augmented images with original images, and total images are shown in Table 3.5.
Similarly, different images are collected from female class of MIT dataset to prepare
four different sets. The female class contains less number of images as compared to
male class, thus there is no option to create four sets from this class for data
augmentation. Therefore, images of female class are divided into two different sets to
collect 144 images for each set from a total of 288 images. In this scenario, there are
two different sets for data augmentation, so two operations (e.g. filter and flip) are
applied against a single set instead of one operation as applied on selected samples in
case of male class. Hence, two operations of filter and flip are performed on the first
set of 144 images and other two operations of transform and brightness on the second
set of 144 images. Consequently, 576 augmented images are acquired for female class
as shown in Table 3.5. To apply 1vs4 strategy, a single set of 66 images and a single
set of 144 images from male and female classes of MIT dataset are randomly selected.
Then, four operations (filter, flip, transform, and brightness) are done on each image

77
chosen from the given set of male/ female instead of a single operation. It means that
four images will be generated from a single image as shown in Figure 3.13.

Figure 3.13: Proposed 1vs1 and 1vs4 strategies for data augmentation

As 1vs1 and 1vs4 strategies are applied to generate augmented data from both the
gender classes of MIT and PKU-Reid datasets, furthermore a mixed (1vs + 1vs4)
strategy is also adopted for data augmentation. To apply this strategy, first a class is to
be found out with less number of samples.

In this work, female class of both datasets MIT and PKU-Reid has less number of
samples as compared to the male class. Therefore, in the second step, images from only
the female class are randomly selected to create different sets, as described in Table
3.5. Then, all selected operations are executed. For example, using the female class of
PKU-Reid dataset, 1vs1 strategy on 64 images and 1vs4 strategy on 40 images are
applied. Similarly, using MIT dataset, 50 and 28 images are randomly chosen to apply
1vs1 and 1vs4 strategies respectively. Based on these strategies, augmentation is
performed for the balanced distribution of data in both classes to handle the imbalanced
binary classification problem for gender prediction. As this research work used the
existing IB-SSS datasets and newly prepared imbalanced, and augmented balanced SSS
datasets for gender classification, thus this may be considered as a novel contribution

78
to the body of knowledge. Furthermore, impact of data augmentation is analyzed for
pedestrian gender classification.

Table 3.5: Augmentation statistics for imbalanced and small sample MIT, and PKU-
Reid datasets, class-wise selected number of samples in a single set for data
augmentation, and resultantly, total augmented images and total images

Strategies for data augmentation operations


Images and
1vs1 1vs4 Mixed
set details Dataset
Male Female Male Female Male Female
Class-wise MIT 66 144 66 144 - 50=1vs1
selected 28=1vs4
images in a PKU- 45 149 45 149 - 64=1vs1
single set Reid 40=1vs4

Augmented MIT 264 576 264 576 - 200 and


images in all 112
sets PKU- 180 596 180 596 - 256 and
Reid 160

Original + MIT 600+ 288+ 600+ 288+ 600+0 288+


Augmented 264 576 264 576 312
images PKU- 1120+ 704+ 1120+ 704+ 1120+0 704+
Reid 180 596 180 596 416

Total images MIT 864 864 864 864 600 600


in dataset PKU- 1300 1300 1300 1300 1120 1120
Reid

3.4.1.2 Data Preprocessing

Data preprocessing is an important step for object classification and recognition. In this
work, data preprocessing is used to enhance visual quality and foreground information
of input images. It also optimizes environmental effects such as illumination variations,
light effects, and poor contrast related issues. For each feature extraction, original
gender image is resized as described in Table 3.6. Traditional feature extraction
schemes of PHOG and HSV Histogram used grayscale and HSV color spaces
respectively. For deep feature extraction, both pre-trained CNN models exploit RGB
images as input during the implementation. PETA, VIPeR, and cross-datasets are used
with a small number of samples. These datasets are more challenging due to complexity
and imagery variations. Therefore, a step of contrast adjustment is implemented to
balance high and low contrast in gender images. Moreover, median filter is applied to
remove noisy information from gender images. This step is only applied to the images
of PETA, VIPeR, and cross-datasets before execution of all feature extraction schemes.

79
Table 3.6: Description of preprocessing for each feature representation scheme

Original Resized Input


Feature type
image size image image
PHOG 128 × 64 × 3 Grayscale
HSV_Histogram 128 × 64 × 3 HSV
Vary
FCL_IRV2 224 × 224 × 3 RGB
FCL_DenseNet201 299 × 299 × 3 RGB

3.4.2 Traditional Feature Extraction

The traditional feature extraction step is usually used to extract basic level information
such as texture, color, and shape of an input image for detection and classification [68,
229]. In this study, two traditional methods namely pyramid HOG [230] and HSV
histogram are utilized to compute shape and color features. The output of both
traditional feature extraction schemes corresponds to an individual feature vector. The
description of both feature extraction schemes is given in subsequent sections.

3.4.2.1 Pyramid HOG based Feature Extraction

Presently, PHOG based features have been effectively applied for different pattern
recognition tasks such as detection [231, 232], recognition [21, 233], and classification
[234, 235]. PHOG features mainly comprise of edge based features at multiple pyramid
levels. These features play a vital role because they are insensitive to local geometric,
illumination, and pose variations. Therefore, the proposed approach utilizes pHOG
based features to detect distinct characteristics such as clothing style and carrying items
from gender images and classify an image as male or female. PHOG feature extraction
process exploits multiple representations of an input image in a pyramid fashion to
preserve spatial layout information and well-known HOG descriptor to construct the
local shape of this image. PHOG is an advanced version of HOG descriptor. To extract
HOG features, the image is divided into several blocks, and gradients of the image are
computed for each block consisting of 2 × 2 cells. Considering gradient information in
a single block, local histogram of orientations of each cell is calculated. The final HOG
features are then obtained by concatenating cell-wise histogram based extracted
features. The computed HOG features are capable to characterize local shape based
information. This information is then merged with pyramid based spatial layout
information of the gender image for accurate gender prediction. Canny edge detection
approach is opted to realize edge based information present in gender image.

80
Figure 3.14: PHOG feature extraction scheme

Then, level by level full gender image is split into 4𝐿 sub-blocks, where 𝐿 = {0,1,2,3},
4𝐿+1 −1
acquiring ∑3𝐿=0 4𝐿 pHOG feature vector from total cells. For example, at level 0,
3

whole image is considered as one block because 40 is equal to one. The histogram will
be computed by considering the complete image as a single block. Similarly, at level 1,
the gender image is split using 41 and resultantly image will be divided into four sub-
blocks. Subsequently, the histogram of these sub-blocks is calculated for a total of five
sub-blocks that comprise of previous one histogram of level 0 and four histograms of
level 1. To execute next levels in the same way, 21 histograms (5 previous and 16 of
level 2) for level 2 and 85 histograms (25 previous and 64 of level 3) for level 3 are
calculated and normalized using L2- norm. At different pyramid levels 𝐿, the computed
histograms are shown in Figure 3.14. All the steps followed in pHOG feature extraction
scheme are given in Algorithm 3.3.

3.4.2.2 HSV Histogram based Feature Extraction

In recent research studies, the perceptual color space HSV is opted instead of RGB,
Luv, Lab, etc. and then mostly exploited to compute HSV histogram based color
information for image retrieval [236, 237]. This color information can resist multiple
types of changes in an image such as size, direction, rotation, distortion, and noise
[238]. Thus, a HSV histogram based color feature extraction scheme is used for the
gender classification task.

81
Algorithm 3.3: PHOG feature extraction
Input: Training and testing images
Output: Normalized pyramid HOG feature vector (PHOG_FV)
Begin
Step# 1: Initially, set the values of required parameters where levels 𝐿 = 3, bin
size on the histogram 𝑏𝑠 = 8, 𝑎𝑛𝑔𝑙𝑒 = 360, 𝑎𝑛𝑑
𝑟𝑜𝑖 = [1; 128; 1; 64; ].
Step# 2: Convert resized input RGB image into a grayscale image.
Step# 3: Compute edge features from grayscale image using a canny edge
detector.
Step# 4: Obtain a matrix with histogram values (𝑚ℎ𝑣 ) and a matrix with
gradient values (𝑚𝑔𝑣 ).
Step# 5: Using 𝑚ℎ𝑣 , 𝑚𝑔𝑣 , 𝐿, 𝑎𝑛𝑔𝑙𝑒 and 𝑏𝑠 , compute PHOG features over
given 𝑟𝑜𝑖.
Step# 6: Normalize computed PHOG features.
Step# 7: Finally, obtained normalized pHOG feature vector (PHOG_FV) with
the dimension of 1 × 680.
End

HSV histogram based color features are acquired by following three steps: 1) convert
the gender RGB image into HSV color space using Eqs. (3.60)-(3.62), where 𝐻
represents hue, 𝑆 represents saturation, and v represents brightness 2) color quantization
is performed to decrease the feature vector dimension and reduce computational
complexity, whereas quantization process minimizes the number of colors and its level
used in an image, and 3) compute the histogram of each quantized image according to
applied intervals for H, S, and V channels with 8, 2, and 2 bins respectively. The
computed histogram shows the frequency distribution of quantized HSV values of each
pixel in a given image as shown in Figure 3.15.

Resultantly, normalized HSV histogram based feature vector (HSV-Hist_FV) is


obtained with a dimension of 1 × 32 by concatenating 8 × 2 × 2. The mathematical
representation of H, S, and V channels is given in Eqs. (3.60)-(3.62).

1⁄ [(𝑅−𝐺)+(𝑅−𝐵)]
2
𝐻 = 𝑐𝑜𝑠 −1 (3.60)
√(𝑅−𝐺)2 +(𝑅−𝐵)(𝐺−𝐵)

3[min(𝑅,𝐺,𝐵)]
𝑆 =1− (3.61)
𝑅+𝐺+𝐵

𝑅+𝐺+𝐵
𝑉= (3.62)
3

82
Figure 3.15: HSV histogram based color features extraction

3.4.3 Deep Convolution Neural Networks

CNN architectures can automatically learn different types of distinct properties by using
the backpropagation method. Typically, these architectures are assembled with
different building blocks such as CLs, PLs, transition layers (TL), an activation unit
(AU), and FC layers. The convolutional block is formed of a CL, an AU, and a PL in
which the assignment of CL depends on the practice of operating the input image with
the nominated kernel. The kernel size may be set as 3 × 3, 5 × 5, and 7 × 7 pixels.
Thus, the input of next layer (𝑛2 × 𝑛3 ) along with the filter is applied to the image.
Activation maps with distinctive features are produced as a result of convolution
process. Each CL enables the use of a filter. The output 𝑍𝑥𝑙 of layer (𝑙) comprises of 𝑛1𝑙
feature-map of size 𝑛2𝑙 × 𝑛3𝑙 . The 𝑥 𝑡ℎ feature-map displayed 𝑍𝑥𝑙 , is computed using Eq.
(3.63), where 𝑏𝑥𝑙 and 𝑓𝑥,𝑦
𝑙
denote the bias matrix and filter size, respectively.

𝑓 𝑙−1
𝑍𝑥𝑙 = 𝑓𝑚 (𝑏𝑥𝑙 + ∑𝑦=1
𝑥 𝑙
𝑓𝑥,𝑦 × 𝑍𝑥𝑙−1 ) (3.63)

The PL preserves image information, diminishes image size, and also retains image
information intact. PL also limits the number of parameters. A PL has two settings: (1)
particular size of 𝑓 𝑙 filter and (2) overlapping/non-overlapping sliding window 𝑠𝑤 𝑙 .
The PL gets input data with the size of 𝑛1𝑙−1 × 𝑛2𝑙−1 × 𝑛3𝑙−1 and yields an outcome of
size 𝑛1𝑙 × 𝑛2𝑙 × 𝑛3𝑙 . Shortly, operation of the PL is shown in Eqs. (3.64)-(3.66).

83
𝑛1𝑙 = 𝑛1𝑙−1 (3.64)

𝑛2𝑙 = (𝑛2𝑙−1 − 𝑓 𝑙 )⁄𝑠𝑤 𝑙 + 1 (3.65)

𝑛3𝑙 = (𝑛3𝑙−1 − 𝑓 𝑙 )⁄𝑠𝑤 𝑙 + 1 (3.66)

Moreover, a nonlinear node named ReLU allows complex relationships in the data to
be learned. The FC layers are individually allocated to a single vector by flattening
deeply learned information from the earlier layers. Also, they implement the update of
weights and provide a value against each label. The fully interrelated layers in CNN are
mostly a multilayer sensor for mapping 𝑛1𝑙−1 × 𝑛2𝑙−1 × 𝑛3𝑙−1 . The operational procedure
of FC layers is shown in Eq. 3.67.

𝑓 𝑙−1
𝑦𝑥𝑙 = 𝑓(𝑍𝑥𝑙 ) 𝑤𝑖𝑡ℎ 𝑍𝑥𝑙 = ∑𝑦=1
1 𝑙
𝑤𝑥,𝑦 𝑦𝑥𝑙−1 (3.67)

3.4.3.1 Pre-trained Deep CNNs and Fine Tuning

Deep CNNs usually outperform larger-scale databases as compared to SSS databases.


The already-trained CNN models are easily accessible to public for use because they
offer the learned kernels and weights which can be applied directly or indirectly in
different object recognition [239], classification [240], segmentation [241], and
detection [242] tasks of computer vision. CNN models are mainly trained on large-scale
datasets to classify different 1000 object classes. But these models are not stable with
SSS datasets due to the non-availability of large-scale data for model learning,
therefore, model learning is a key issue while investigating SSS datasets. In this
scenario, two ways are reported in literature to reuse already-trained deep CNN models:
(1) transfer learning with fine-tuning method and (2) transfer learning with deep feature
extractor method given in Figure 3.16.

Transfer learning with fine-tuning is often used to solve SSS dataset problem where a
model is already-trained on a very large-scale dataset, for instance, ImageNet [128,
243]. It is a much faster and suitable procedure to train a model on new data as
compared to train a model from scratch. On the other hand, the already-trained CNN
models are utilized as deep feature extractors and their applications are also reported in
present research for different pattern recognition tasks [244, 245]. In this context,
existing studies utilized independently or combined early, middle, and last FC layer(s)
of already-trained CNN model to extract deeply learned information of input image and

84
then used this information or subset of information to train a new classifier such as
SVM and KNN.

Figure 3.16: Different ways to deploy pre-trained deep learning models


Deep feature extraction is a simple procedure in which a single round is required on the
training samples to acquire deeply learned information. As in this study, SSS datasets
are investigated for gender classification and these datasets are more challenging due
to complex and diverse appearances of gender in images, therefore, two already-trained
CNN deeper models of inceptionresnetv2 (IRNV2) and densenet201 (DN201) are
selected and utilized as deep feature extractors. There are two reasons to choose deeper
networks. (1) these networks can extract abstract representations of input at each layer,
and (2) they have an ability to learn discriminative information that is significantly
expressive and general. The description of selected IRNV2 and DN201 deep CNN
models is given as follows.

InceptionResNetV2 is a CNN model with 164 layers deep trained on huge amount of
data (images) collected from ImageNet database [246]. It is a hybrid model that
combines the structure of inception with the residual association. The model utilizes
images at input layer with the size of 299 × 299 × 3, and its results provide learned
rich information and an estimated value against each class. The benefits of IRNV2 are
converting inception modules to residual inception blocks, adding more inception

85
modules, and adding a new type of inception module (inception-A) after the stem
module. A schematic view of InceptionResNetV2 architecture (compressed) is shown
in Figure 3.17.

Figure 3.17: Schematic view of InceptionResNetV2 model (compressed)

DenseNet201 is a CNN model with 201 layers deep, accepts an input size of 224 ×
224 × 3, and is evaluated on SVHN, CIFAR-10, CIFAR-100, and ImageNet databases
[247]. This network was designed to achieve deep and wide architectures based CNN
models that can be useful to enhance the performance of deep CNNs. In this way,
DenseNet (DN) is an improvement of ResNet (RN) that comprises of dense connections
between layers to transfer collective knowledge for features reusability.

Figure 3.18: Schematic view of DenseNet201 model (compressed)

86
Hence, the network layer obtains information from all preceding layers and sends it to
all subsequent layers. This activity encourages maximum flow of information from a
particular layer to the next layer including features reuse in a network. Unlike
traditional convolutional networks with 𝑙 layers having 𝑙 connections, DN has
𝑙(𝑙 + 1)⁄2 direct connections. Moreover, DN can improve the performance by
vanishing gradient problems, implicit deep supervision, model compactness, and
reduction in parameters count. A schematic view of the DenseNet201 architecture is
given in Figure 3.18.

3.4.3.2 Deep CNN Feature Extraction and Parallel Fusion

This section covers the discussion on deep CNN feature extraction from two already-
trained models and their fusion. As mentioned above, two different deep networks
IRNV2 and DN201 are selected for deep CNN feature extraction. It is noticeable that
the networks with different architectures have strength to acquire diverse characteristics
because of their different depths and structures. Furthermore, an effective ImageNet
output does not pass to further assignments, therefore, multiple networks may be
required. Keeping this, both networks are utilized in this work, where FC layer of both
networks is considered for deep feature extraction. The common thing about these
deeper networks is that they have only a FC layer at last with a feature matrix of size
1 × 1000 as shown in Figure 3.19. Consequently, two deep feature vectors
IRNV2_FCL and DN201_FCL are represented using InceptionResNetv2 and
DenseNet201 CNN models, respectively.

The use of deep feature fusion is apparent because of the following two reasons (1) It
is useful for acquiring powerful and rich features from two different models as
compared to obtaining from an individual model, and (2) parallel fused features are
more expressive because they reflect the distinct properties of two different deep
networks. Keeping in view these benefits, deeply learned features of both models are
merged by applying parallel fusion based on two procedures (1) maximum score-based
fusion and (2) average score-based fusion. Objective of this research work is to
represent the gender image with joint distinctive and average feature depictions under
two diverse depths. Both feature fusion procedures follow two steps to compute
maximum score-based and average score-based deep feature vectors: (1) utilization of
overlapped sliding window having a size of 2 × 2 on both deep feature vectors (first

87
row values belong to IRNV2_FCL feature vector and second row values belong to
DN201_FCL feature vector), and (2) implementation of both methods to compute
maximum score-based deep feature vector (MaxDeep_FV), and average score-based
deep feature vector (AvgDeep_FV).

Figure 3.19: Deep CNN feature extraction and parallel fusion


Moreover, parallel fusion provides a transformation of both deep feature
representations into a novel, richer, and powerful one feature vector, which keeps the
information of both models using the abovementioned procedures. Mathematically,
formulas of both feature fusion procedures are given in Eqs. (3.68)-(3.72). Let, 𝑋 and
𝑌 denote the matrix which contains extracted deep features of IRNV2 and DN201
models, respectively.

𝑋 = [𝑥𝑖,𝑗 ] 𝑁×𝑀 𝑖 = 1 𝑡𝑜 𝑁, 𝑗 = 1 𝑡𝑜 𝑀 (3.68)

𝑌 = [𝑦𝑖,𝑗 ] 𝑁×𝑀 𝑖 = 1 𝑡𝑜 𝑁, 𝑗 = 1 𝑡𝑜 𝑀 (3.69)

where 𝑁 represents total number of samples, and 𝑀 represents the dimensions of deep
feature vector. Before fusion operations, feature concatenation 𝑍𝑘 is represented to hold

88
extracted deep information of both models. The mathematical representation is
provided in Eq. (3.70).

𝑥𝑘,𝑗
𝑍𝑘 = [𝑦 ] 𝑘 = 1 𝑡𝑜 𝑁, 𝑗 = 𝑀 (3.70)
𝑘,𝑗

For each value of 𝑘, a 2 × 𝑀 matrix 𝑍𝑘 is obtained. Let us choose any of 𝑍𝑘 of


dimension 2 × 𝑀 and apply maximum and average score-based operations using Eq.
(3.71) and Eq. (3.72).

𝑚𝑎𝑥 𝑎1,𝑞 𝑎1,𝑞+1


𝑀𝑎𝑥𝐷𝑒𝑒𝑝_𝐹𝑉 = 1 ≤ 𝑞 ≤ 𝑀 − 1 [𝑎 𝑎2,𝑞+1 ] (3.71)
2,𝑞

𝑎𝑣𝑔 𝑎1,𝑞 𝑎1,𝑞+1


𝐴𝑣𝑔𝐷𝑒𝑒𝑝_𝐹𝑉 = 1 ≤ 𝑞 ≤ 𝑀 − 1 [𝑎 𝑎2,𝑞+1 ] (3.72)
2,𝑞

The maximum score-based fusion procedure chooses the maximum response from the
generated feature window of size 2 × 2. Similarly, average score-based fusion
procedure computes the average response using the generated feature window of size
2 × 2. Both procedures are executed step by step with an overlapped sliding window,
the response of each feature window is stored to produce merged information
resultantly with a dimension of 1× 1000 for both MaxDeep_FV and AvgDeep_FV.
The fused deep features deal with different challenges such as pose changes, viewpoint
variations, and environmental effects because of the rich and powerful information of
two deep networks.

3.4.4 Features Selection and Fusion

One of the important factors to enhance classification rates is the accessibility and
utilization of distinct features from the extracted feature vector. It is notable that
extracted feature vectors consist of high dimensions and have irrelevant information
not related to robust modeling. Moreover, this irrelevant information not only decreases
overall performance of classifier but also increases computational cost. In existing
studies, there are different features selection techniques such as CCA, entropy, PCA,
and DCA commonly applied for dimensionality reduction. These techniques also
provide a suitable way to select OFS from a large feature vector by discarding irrelevant
information. Therein, few techniques have been implemented to compute the
relationship between two or more feature representations. Features selection is a key
step that is implemented to improve the classifier performance without much loss of

89
computed features. Keeping in view these reasons, PCA and entropy based features
selection methods are selected to use in this study before feature fusion of applied
feature extraction schemes. The prime objective to apply a features selection method is
to choose OFS from applied feature vector(s), which preserves not only distinct
representations that were part of original feature vector, but also reduces the dimension
after eliminating redundancy and noise present in it. The entropy and PCA are
straightforward methods such that PCA provides the benefit of minimizing
reconstruction error with no solid supposition for utilization of diminished feature
vector. Thus, the contribution of each applied feature vector is selected from PCA and
entropy controlled feature vector(s) and explored for pedestrian gender classification
task.

The proposed approach utilizes the traditional feature vectors of size 𝑁 × 680 and 𝑁 ×
32 obtained from gender images with PHOG and HSV histogram based feature
extraction schemes respectively. In addition, deep learned information at FC layer of
two deeper networks IRNV2 and DN201 is used as deep features as presented in the
previous subsection with the name of IRNV2_FCL and DN201_FCL deep feature
vectors. Then, a parallel fusion of deep feature vectors is implemented to combine
deeply learned information of both networks. As a result, two fused feature vectors are
generated and used as MaxDeep_FV and AvgDeep_FV, each of size 𝑁 × 1000 such
that N denotes total number of images in the selected dataset. Fused deep feature vector
is more expressive and has distinct representation because it contains maximum score
and average score-based features from two deeper networks instead of a single network.
The deeper network based fused deep features help to reduce false positive rate due to
large variations in appearance under non-overlapping camera settings.

Let, all extracted feature vectors PHOG_FV, HSV-Hist_FV, MaxDeep_FV, and


AvgDeep_FV are denoted as 𝐹𝑉𝑃𝐻𝑂𝐺 , 𝐹𝑉𝐻𝑆𝑉−𝐻𝑖𝑠𝑡 , 𝐹𝑉𝑀𝑎𝑥𝐷𝑒𝑒𝑝 , and 𝐹𝑉𝐴𝑣𝑔𝐷𝑒𝑒𝑝 , and can
be expressed through Eqs. (3.73)-(3.76).

𝐹𝑉𝑃𝐻𝑂𝐺 = [𝑓1 , 𝑓2 , … , 𝑓𝑛 ] (3.73)


𝐹𝑉𝐻𝑆𝑉−𝐻𝑖𝑠𝑡 = [𝑓1 , 𝑓2 , … , 𝑓𝑛 ] (3.74)
𝐹𝑉𝑀𝑎𝑥𝐷𝑒𝑒𝑝 = [𝑓1 , 𝑓2 , … , 𝑓𝑛 ] (3.75)

𝐹𝑉𝐴𝑣𝑔𝐷𝑒𝑒𝑝 = [𝑓1 , 𝑓2 , … , 𝑓𝑛 ] (3.76)

90
where 𝑓 presents the numerical value (feature) with 𝑛 dimension. As mentioned earlier,
an entropy based features selection defined in Eqs. (3.49)-(3.58) and responds to a
specific feature vector with maximum numerical values is utilized. The mathematical
representation is given in Eqs. (3.77)-(3.80)

𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑃𝐻) = 𝐸(𝐹𝑉𝑝𝐻𝑂𝐺 , ῤ) (3.77)

𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝐻𝐻) = 𝐸(𝐹𝑉𝐻𝑆𝑉−𝐻𝑖𝑠𝑡 , ῤ) (3.78)

𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑀𝐷) = 𝐸(𝐹𝑉𝑀𝑎𝑥𝐷𝑒𝑒𝑝 , ῤ) (3.79)

𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝐴𝐷) = 𝐸(𝐹𝑉𝐴𝑣𝑔𝐷𝑒𝑒𝑝 , ῤ) (3.80)

where 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑃𝐻), 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝐻𝐻), 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑀𝐷), and 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝐴𝐷) represent


entropy based information of feature space PHOG, HSV histogram, maximum deep,
and average deep respectively. 𝐸 represents the sorting function and ῤ denotes the
descending order operation. After extensive experimentation, maximum score-based
top 200 features are chosen from each entropy controlled feature vector PH, MD, and
AV, while 30 features are selected from HSV histogram based feature vector.
Mathematically, selection of top features subsets is performed through Eqs. (3.81)-
(3.84).

𝑃𝐻𝑣 = 𝑚𝑎𝑥(𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑃𝐻), ẟ) (3.81)

𝐻𝐻𝑣 = 𝑚𝑎𝑥(𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝐻𝐻), ẟ) (3.82)

𝑀𝐷𝑣 = 𝑚𝑎𝑥(𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑀𝐷), ẟ) (3.83)

𝐴𝐷𝑣 = 𝑚𝑎𝑥(𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝐴𝐷), ẟ) (3.84)

where 𝑃𝐻𝑣 , 𝐻𝐻𝑣 , M𝐷𝑣 , and 𝐴𝐷𝑣 indicate the selected subsets of features and ẟ
represents top features set from computed entropy based information.

Similarly, score-based top 200 features are selected from each PCA controlled feature
vector PH, MD, and AV, and 30 features are taken from HSV histogram based feature
vector. Mathematically, selection of top features subsets (FSs) is performed through
Eqs. (3.85)-(3.88).

𝑃𝐻𝑣 = 𝑃𝐶𝐴(𝑃𝐻, ẟ) (3.85)

91
𝐻𝐻𝑣 = 𝑃𝐶𝐴(𝐻𝐻, ẟ) (3.86)

𝑀𝐷𝑣 = 𝑃𝐶𝐴(𝑀𝐷, ẟ) (3.87)

𝐴𝐷𝑣 = 𝑃𝐶𝐴(𝐴𝐷, ẟ) (3.88)

where 𝑃𝐻𝑣 , 𝐻𝐻𝑣 , M𝐷𝑣 , and 𝐴𝐷𝑣 indicate the selected subset of features and ẟ
represents the top features set from computed PCA based information.

Figure 3.20: FSDTF for gender prediction


For pedestrian gender classification, there is a less chance to achieve better performance
with a single extracted feature vector either before or after features selection. Therefore,
a feature fusion method is preferred for compact representation of all FSs for improved
performance in classification tasks. The proposed approach takes distinct information
from each feature vector and serially fuses entropy and PCA based selected OFSs for
gender prediction as shown in Figure 3.20. The mathematical form of FFV is provided
in Eq. (3.89).

𝐹𝐹𝑉1×𝑑 = [𝑃𝐻𝑣 , 𝐻𝐻𝑣 , 𝑀𝐷𝑣 , 𝐴𝐷𝑣 ] (3.89)

92
where 𝐹𝐹𝑉1×𝑑 denotes fused feature vector, d shows FFV dimension as 𝑁 × 630, and
𝐹𝐹𝑉𝑁×𝑑 represents fused feature vectors of all sample images. Later the selected FFV
is supplied to different classifiers for classification. This work uses supervised learning
methods that need to train the data for gender prediction. Performance of proposed
approach using both features selection methods is tested separately, where empirical
analysis shows that PCA based selected OFSs outperformed as compared to entropy
based selected OFSs for gender prediction (see results and evaluation section 4.4.3).

3.4.5 Classification Methods

This section covers the description of selected SVM classifier commonly used in
classification problems of text and image datasets. Following are the main reasons
behind choosing SVM classifier: 1) it improves the generalization ability of learning
machine, reliable classification rates on large-scale and even on SSS data, 2) efficient
prediction speed especially for binary classification tasks, flexible model, memory
efficient, and 3) capable to overcome the over learning [248, 249]. SVM classifier has
a set of kernel functions to yield data and convert it into targeted form. Keeping in view,
this work utilizes linear, quadratic, cubic, and medium Gaussian kernels with SVM.
Aiming to state a reliable classification model for gender prediction, extensive
experimentation is done on SVM based classification methods such as linear SVM (L-
SVM), quadratic SVM (Q-SVM), cubic SVM (Q-SVM), and medium Gaussian SVM
(M-SVM) to test the performance of proposed PGC-FSDTF approach. All these
methods have been executed under their default settings.

3.5 Summary
In this research work, full-body appearance of pedestrian is considered to examine for
person ReID and pedestrian gender classification tasks. Person ReID consists of feature
representation, features-based clustering, and features matching. Whereas, pedestrian
gender classification consists of data preparation, feature extraction/learning, features
selection, fusion, and classification. Considering existing challenges and limitations,
three methods are proposed in this dissertation for full-body appearance based
pedestrian analysis, as described in this chapter. The significance of proposed
methodologies is summarized as below:

93
 Method I performs person ReID on full-body diverse appearances of
pedestrians. This method presents a robust way to re-identify the pedestrians,
where cluster-based probe matching is carried out to optimize gallery search.
To the best of our insight, FFS approach is proposed first time for cluster
formation in ReID. The recognition rates are improved by acquiring PCA based
deep features of each cluster sample that are fused low-level features subset.
The cluster-based probe matching instead of search from whole gallery further
improves the recognition rates.
 Method II performs pedestrian gender classification using low-level and deep
feature representations, jointly. The method is applicable for both large-scale
and small-scale datasets. The features selection on the combination of low-level
and deep feature representations is used to enhance O-ACC, CW-ACC, and
AUC.
 Method III performs pedestrian gender classification on imbalanced,
augmented, and customized balanced SSS datasets. The color and PHOG
features are fused with deep features acquired from more deep CNN networks
in parallel manner. Two different features selection methods are used to
investigate the performance of proposed method for gender prediction on SSS
datasets. The unique way of data augmentation is also proposed to handle
imbalanced binary classification problem. In addition, thirteen different datasets
are also prepared from different existing datasets in pedestrian analysis domain
to investigate the robustness of proposed method.

94
Chapter 4 Results and Discussions
Results and Discussions

95
4.1 Introduction
This chapter presents the experimental setup and results obtained after empirical
analysis of proposed methodologies for person ReID and pedestrian gender
classification tasks (as discussed in chapter 3). The evaluation of each task is described
separately in subsequent sections. Figure 4.1 provides the organization of this chapter
including section wise highlights. Firstly, performance evaluation protocols and
implementation settings are described.

Figure 4.1: Block of chapter 4 including section wise highlights

96
Then, publicly available datasets used in this study are presented in detail. Finally, the
results of all proposed methodologies are discussed, and compared with state-of-the-art
methods in literature. All methods are implemented using MATLAB 2019a and
performed on desktop system core i5-7400 with 16GB of RAM and GeForce GTX 1080
GPU.

4.2 Proposed Method for Person Re-identification with Features-


based Clustering and Deep Features (FCDF)
In the accompanying section, the description of evaluation protocols, implementation
settings, and selected datasets are presented for person ReID. Afterwards, the computed
results are tabulated, compared with state-of-the-art methods, and then concluded with
discussion.

4.2.1 Performance Evaluation Protocols and Implementation Settings for


Person ReID

The commonly used protocol named cumulative matching characteristics (CMC) [97]
is considered as standard metric for person ReID research evaluation. In the evaluation
process, initially, all datasets are separated into training and testing sets. Besides, the
training set (gallery images) of all datasets is divided into k consensus clusters. Then,
random initialization of cluster center is performed 150 times to address the problem of
selection of cluster center. For suitable number of clusters, value of k = 6 is obtained,
which is computed using self-tuning algorithm [194] across selected datasets. Later on,
probe image is matched with the classified cluster to increase the retrieval probability
of most similar image as compared to retrieving a similar image from the whole gallery.
For each test, one-trail CMC results are recorded at all ranks of true matches. Similarly,
this validation process is iterated ten times (10 trails) and means results are presented
using CMC curve to show stable statistics. To assess the proposed FCDF framework,
316, 486, and 150 pedestrian images are selected randomly from VIPeR, CUHK01, and
iLIDS-VID datasets respectively for training and testing.

4.2.2 Pedestrian Analysis Datasets Presentation for Person Re-


identification

The datasets evaluated in this study are benchmark and publicly available which are
VIPeR [97], CUHK01 [113], and iLIDS-VID [80]. More details of these datasets are
provided in the subsections. All these datasets are challenging due to different camera
97
settings, changes in viewpoint, low contrast/ resolution, and pose of a person’s
observations. These issues make a person ReID a more challenging problem.

4.2.2.1 VIPeR Dataset Presentation

The single-shot VIPeR (viewpoint invariant pedestrian recognition) dataset comprises


of 632 pedestrian objects. It contains two different views of each object captured from
two separate cameras: CAM1 and CAM2. All the views captured using former are
considered as gallery set while those captured using latter are considered as probe set.
For recognition, each view of probe set is matched with gallery set and then ranked
based on matching rates for each image in gallery to the probe. The views of these
cameras are challenging because of variation in viewpoint angle changes from 45° to
180° , different camera settings, pose, and light effects. Besides, these views are
normalized to 128×48 pixels. Currently, VIPeR dataset is widely used as a single-shot
person ReID problem. In this work, 316 images are selected as a gallery set, and as
many images are selected as probe set for single-shot person ReID.

4.2.2.2 CUHK01 Dataset Presentation

The multi-shot CUHK01 (China University of Hong Kong 01) [113] dataset consists
of 971 identities taken from two disjoint cameras CAM1 and CAM2 in a campus
environment. CAM1 captures front and back views of each pedestrian and CAM2
captures their side views (left and right). As a whole, this dataset contains four different
views of each identity with a total of 3884 pedestrian images and is challenging because
of viewpoint and pose changes. For experiments, it is split into two equal sets of 486
images as a gallery set as well as probe set.

4.2.2.3 iLIDS-VID Dataset Presentation

The imagery library for intelligent detection systems video re-identification (i-LIDS-
VID) [80] is considered as a benchmark dataset for video analytics (VA) systems. It
has been collected by center for applied science and technology (CAST) in partnership
with center for the protection of national infrastructure (CPNI). The i-LIDS contains a
library of CCTV video footage based around “scenarios” central to UK government’s
requirements. The footage accurately represents real operating conditions and potential
threats. This dataset consists of two different scenarios: 1) event detection scenario
comprising of parked vehicle, doorway surveillance, sterilized zone and abandoned

98
baggage and new technology scenario, and 2) tracking scenario which includes multiple
camera tracking scenarios (MCTS) holding near to 50 hours of footage provided by five
cameras deployed in an airport. From this, subset of images is selected in this work for
the iLIDS-VID dataset, which contains 600 images of 300 identities such that each
identity has two images captured under non-overlapping camera settings. This dataset
is highly challenging because of different photometric variations and environmental
effects such as diverse occlusions, large variations in appearance from a camera to
another one, people wearing backpacks, carrying luggage or pushing carts. Table 4.1
describes main particulars in the form of total images, image pair, image size, and
challenges of selected datasets for person ReID.
Table 4.1: Statistics of datasets for person ReID

Image Image
Name Total images Challenges
pair size
VIPeR 1264 Pose variations, occlusion
316 128×48
[97] (632 identity pairs) and illumination changes.
Front, back, and side view
CUHK01 3884
486 160×60 with illumination changes
[113] (971 identity pairs)
and pose variations.
Clothing similarity,
iLIDS- 600 partial occlusion,
150 128×48
VID [80] (300 identity pairs) illumination changes and
background clutter.

4.2.3 Results Evaluation

For the evaluation of proposed work, initially different combinations of deep features
are fused with optimal features subset OFSselect. For this purpose, all selected datasets
are taken into account and recognition rate is computed at different ranks (rank-1, rank-
10, rank-20, and rank-50), where probe image is to be matched within the classified
cluster(s) instead of whole gallery set. In subsequent tables, top recognition rates at
each rank are written in bold.

4.2.3.1 Proposed FCDF Framework Results

The experimental results using the concatenation of optimal handcrafted and deep
features are shown in Tables (4.2)-(4.4). During experiments, different number of top
PCA based deep features are chosen to find the best deep feature combination with
OFselect . Outcomes of experiments depict that variation in number of deep features
helps to find suitable set of deep features to provide the best recognition rates.

99
Table 4.2: Experimental results using deep features (from higher to lower dimension)
on VIPeR dataset
Ranks
Selected deep features
1 10 20 50
4000 30.2 72.1 80.0 86.8
3000 33.6 76.6 82.2 89.1
2000 36.6 79.4 87.6 92.2
1500 40.0 84.2 89.4 95.4
1000 46.8 87.2 93.2 98.5
750 43.6 86.6 92.0 97.9

Table 4.3: Experimental results using deep features (from higher to lower dimension)
on CUHK01 dataset
Ranks
Selected deep features
1 10 20 50
4000 36.7 75.1 87.6 90.1
3000 39.5 76.0 88.3 92.2
2000 41.7 78.0 89.9 93.6
1500 43.3 79.6 91.1 94.5
1000 48.1 81.6 92.4 96.1
750 47.2 80.2 90.1 95.0

Table 4.4: Experimental results using deep features (from higher to lower dimension)
on i-LIDS-VID dataset
Ranks
Selected deep features
1 10 20 50
4000 33.3 72.4 79.2 87.7
3000 35.0 73.8 82.4 90.3
2000 36.7 75.0 83.1 92.8
1500 38.9 77.1 84.8 93.5
1000 40.6 78.5 86.9 93.8
750 47.2 80.2 90.1 95.0

Moreover, this process removes irrelevant and redundant information through PCA
based approach. According to these results, 1000 number of deep features with OFS
(handcrafted) provides distinct and sufficient information for single-shot person ReID
process. Thus, the contribution of deep features with OFS is desirable for cluster-based
probe matching.

4.2.3.2 Performance on Selected Datasets and Comparison with Existing Methods

For comparison, relevant state-of-the-art methods are considered such as RBS [81],
Preid PFCC [90], HGL ReID [127], RLML [250], NP ReId [95], DPML [112], Multi-
scale CNN [251], Multi-HG [84], RD [88], Sparse [252], Soft-Bio [19], ML common

100
subspace [72], AML-PSR [115], semi supervised-ReID+XQDA [253], LOMOXQDA
[77], RDML-CCPVL [122], Inlier-set group modeling [96], FMCF [68], Salience [61],
RF+DDA+AC [254], MARPML [42], PaMM [86], DVDL [255], AFDA [111],
Midlevel [79], Salmatch [78], Semantic [83], ROCCA [82], DVR [80], Hessian [119],
RN+XQDA+RR+DFR, PAN+XQDA+RR+DFR [137], SSS with fully-
supervised+LOMO, semi supervised+LOMO and fusion [87], MKSSL,
LOMO+MKSSL and MKSSL-MRank [256], DMIL [136], SSSVM [99], VS-SSL with
GOG, LOMO and combined [120], ML-ReID+LOMO [75], MSDALF+HM,
HSV+SMA and MSDALF+SMA [107], TPPLFDA [74], SECGAN_s, SECGAN_c,
SECGAN [43], EML [124], LWA [138], QRKISS-FFN M1&M2 [123], SCNN
(handcrafted and learned features) [73], ResNet-50 and inceptionV4 [139],
BRM2L(GOG and FTCNN) [257], UCDTL [258], TLSTP [142], P2SNET [135],
TMSL [259], PHDL [260], CTC-GAN [261], TMD2L [118], HRPID [262], DHGM-
average pooling and regularized minimum [76], and CMGTN [141] on selected
datasets. The existing approaches used the selected datasets in different combinations
for evaluation and validation. For instance, the approaches that consider VIPeR do not
use CUHK01 and iLIDS-VID datasets, and vice-versa. The computed results are
presented for VIPeR, CUHK01, and iLIDS-VID datasets in Figure (4.2), Figure (4.3),
and Figure (4.4), respectively. Also, the presented results of proposed framework are
assessed with existing approaches at ranks 1, 10, 20, and 50. For further clarity, the
quantitative performance comparison is tabulated in Tables (4.5)-(4.7) with
corresponding VIPeR, CUHK01, and iLIDS-VID datasets.

a) VIPeR Dataset: According to the results presented in Table 4.5, the proposed
framework achieves a 46.82% rank-1 matching rate on VIPeR that outperforms
previous rank-1 results of RBS [81], NP ReId [95], DPML [112], Soft-Bio [19], AML-
PSR [115], RD [88], Sparse [252], Inlier-set group modeling [96], semantic attributes
[99], semi-supervised-ReID+XQDA [253], LOMOXQDA [187], MKSSL and
MKSSL-MRank [85] which achieve 28.4%, 43.3%, 41.4%, 43.9%, 45.1%, 33.2%,
32.9%, 41.3%, 44.7%, 40.5%, 40.0%, 40.6%, and 42.3% respectively. Similarly, results
of proposed FCDF framework are better than the graph learning methods such as HGL
ReID[127] and Multi-HG [84] as well as metric learning methods such as RLML [250],
EML [124], DMIL [136] and RDML-CCPVL [122]. Moreover, the proposed
framework obtains improved results as compared to those methods where LOMO

101
descriptor is used for person ReID such as SSS with fully-supervised+LOMO, semi
supervised+LOMO and fusion [87], and LOMO+MKSSL [85]. In addition, the
proposed framework outperforms the recent state-of-the-art methods such as SSSVM
[99], VS-SSL with GOG, LOMO and combined [120], ML-ReID+LOMO [75],
InceptionV4 [139], HSV+SMA [107], MSDALF+SMA [107], TPPLFDA [74],
SECGAN_s, SECGAN_c, and SECGAN [43]. Furthermore, the computed results in
comparison with existing methods are also presented using CMC as shown in Figure
4.2. The results on VIPeR confirm that even though the presented framework has
limited training data, however, it achieves significant performance on challenging
VIPeR dataset at ranks-1 and rank-10. These outstanding outcomes on VIPeR dataset
confirm the usefulness of presented FCDF framework in comparison to state-of-the-art
approaches.

Figure 4.2: CMC curves of existing and proposed FCDF method on VIPeR dataset

102
Table 4.5: Performance comparison in terms of top matching rates (%) of existing
methods including proposed FCDF method on VIPeR dataset (p=316), dash
(-) represents that no reported result is available
Ranks
Methods Year
1 10 20 50
RBS [81] 2015 28.4 64.5 76.2 -
LOMOXQDA [187] 2015 40.0 68.9 81.5 91.1
RLML [250] 2015 35.2 81.6 - 90.8
HGL ReID [127] 2016 34.1 79.7 90.1 98.1
DPML [112] 2016 41.4 80.9 90.4 -
RD [88] 2016 33.2 78.3 88.4 97.5
Sparse [252] 2016 32.9 75.9 89.2 96.8
SSS fully-supervised + LOMO [87] 2016 42.2 82.9 92.0 -
SSS semi-supervised + LOMO [87] 2016 31.6 72.7 84.9 -
SSS + Fusion [87] 2016 41.0 81.6 91.0 -
NP ReId [95] 2017 43.3 85.2 96.0 -
Multi-HG [84] 2017 44.5 83.0 92.4 99.1
Soft-Bio [19] 2017 43.9 86.5 94.5 99.6
Preid PFCC [90] 2017 28.0 - - -
AML-PSR [115] 2017 45.1 73.5 85.3 90.5
MKSSL [85] 2017 40.6 78.1 85.9 -
LOMO+MKSSL [85] 2017 31.2 62.9 72.8 -
MKSSL-MRank [85] 2017 42.3 74.4 80.6 -
Semi supervised-ReID+XQDA [253] 2018 40.5 64.4 - -
RDML-CCPVL [122] 2018 13.3 60.0 83.3 -
ML common subspace [72] 2018 34.5 80.5 90.4 98.0
DMSCNN [136] 2018 31.1 71.7 86.3 -
DMIL [136] 2018 31.1 71.7 84.7 -
Hessian [119] 2019 20.3 52.1 68.1 -
Multi-scale CNN [251] 2019 22.6 57.6 71.2 -
Inlier-set group modeling [96] 2019 41.3 85.2 94.0 -
SSSVM [99] 2019 44.7 84.5 92.0 -
Semantic attributes [99] 2019 44.7 84.5 92.0 -
VS-SSL with GOG [120] 2020 43.9 80.9 87.8 -
VS-SSL with LOMO [120] 2020 35.1 67.4 77.6 -
VS-SSL Combined [120] 2020 44.8 79.3 86.1 -
ML-ReID + LOMO [75] 2020 42.5 81.5 91.2 -
InceptionV4 [139] 2020 23.2 41.5 - -
HSV + SMA [107] 2020 40.4 - - -
MSDALF + SMA [107] 2020 40.3 - - -
TPPLFDA [74] 2020 42.3 76.3 89.7 -
SECGAN_s [43] 2020 36.8 75.9 87.3 -
SECGAN_c [43] 2020 37.5 74.6 86.8 -
SECGAN [43] 2020 40.2 77.3 89.7 -
EML [124] 2020 44.3 85.2 92.3 -
Proposed (FCDF) 46.8 87.2 93.2 98.5

103
b) CUHK01 Dataset: The proposed FCDF method is tested on CUHK01 dataset in
which 486 images are randomly selected while taking one image per person from the
gallery set. As per the results illustrated in Figure 4.3, the presented approach
outperforms the existing methodologies.

Figure 4.3: CMC curves of existing and proposed FCDF method on CUHK01 dataset

FCDF framework achieves significant matching rates such as 48.14%, 81.68%,


92.47%, and 96.13% at ranks 1, 10, 20, and 50 respectively. The experimental results
improved as compared to the previous results reported by HGL ReID [127], NP ReId
[95], semi-supervised-ReID+XQDA [253], RN+XQDA+RR+DFR and
PAN+XQDA+RR+DFR [137], Inlier-set group modeling [96] and hessian [119]
methods which attained 35.0%, 44.5%, 37.0%, 44.3%, 44.5% and 21.9% recognition
rates respectively at rank-1 as tabulated in Table 4.6.

104
Table 4.6: Performance comparison in terms of top matching rates (%) of existing
methods including proposed FCDF method on CUHK01 dataset (p=486),
dash (-) represents that no reported result is available
Ranks
Method Year
1 10 20 50
Salmatch[78] 2013 28.4 55.6 67.9 -
Midlevel [79] 2014 34.3 64.9 74.9 -
Semantic [83] 2015 32.7 64.4 76.3 -
ROCCA [82] 2015 29.7 66.0 76.7 -
Sparse [252] 2016 31.3 68.3 78.1 87.6
HGL ReID [127] 2016 35.0 69.2 80.6 91.5
RD [88] 2016 31.1 68.5 79.1 90.3
FMCF [68] 2016 25.0 39.0 46.6 -
DPML [112] 2016 35.8 70.9 79.5 -
NP ReId[95] 2017 44.5 80.3 92.3 -
ML common subspace[72] 2018 34.0 69.7 80.5 91.6
Semi supervised-ReID+XQDA[253] 2018 37.0 64.4 - -
DMS CNN[136] 2018 31.8 69.9 80.7 -
RDML-CCPVL [122] 2018 18.5 57.1 79.8 -
DMIL [136] 2018 31.8 69.9 80.9 -
RN+XQDA+RR+DFR [137] 2019 44.3 71.4 80.3 -
PAN+ XQDA+RR+DFR [137] 2019 42.6 73.2 80.6 -
Inlier-set group modeling [96] 2019 44.5 80.3 92.3 -
Hessian [119] 2019 21.9 52.0 62.2 -
LWA [138] 2019 46.7 - - -
QRKISS-FFN (M1) [123] 2019 42.1 70.7 - -
QRKISS-FFN (M2) [123] 2019 44.9 73.8 - -
HSV + SMA [107] 2020 27.0 - - -
MSDALF + SMA [107] 2020 21.7 - - -
SCNN (learned features) [73] 2020 31.0 74.0 86.0 -
SCNN (handcrafted features) [73] 2020 22.0 57.0 68.0 -
ResNet-50 [139] 2020 47.1 71.4 - -
InceptionV4 [139] 2020 43.5 61.3 - -
BRM2L(GOG) [257] 2020 45.3 86.5 90.0 -
BRM2L(FTCNN) [257] 2020 47.4 88.3 98.3 -
Proposed (FCDF) 48.1 81.6 92.4 96.1

Similarly, FCDF acquires better rank-1 results when evaluated with previous descriptor
results of FMCF [68] with 25.0% and RD [88] with 31.1%. In comparison to other
recent approaches such as DMIL [136], LWA [138], QRKISS-FFN M1&M2 [123],
MSDALF+SMA and HSV+SMA [107], SCNN (handcrafted and learned features) [73],
ResNet-50 and inceptionV4 [139], and BRM2L(GOG and FTCNN) [257] proposed
rank-1 results are superior. The improved results are not only because of optimal
handcrafted features but also due to gallery search optimization. Moreover, the use of

105
deep features with OFS as it effectively handles the appearance diversity which proves
the significance of proposed FCDF framework.

c) iLIDS-VID Dataset: The results are also computed using the iLIDS-VID dataset,
having numerous issues in captured images such as variations in pose, random
occlusions, and clothing similarities. For performance comparison of proposed FCDF
framework with relevant methods, 150 image pairs of the dataset are used. The results
based CMC curves depict the evaluation of FCDF framework with the existing
approaches at ranks 1, 10, 20, and 50 as shown in Figure 4.4.

Figure 4.4: CMC curves of existing and proposed FCDF method on the iLIDS-VID
dataset

According to results shown in Table 4.7, rank-1 matching rate of 40.67% is attained
that outperforms the previous results reported by MS-color & LBP+DVR [80], semi-
supervised-ReID+XQDA [253], NP ReId [95], RF+DDA+AC [254], MARPML [42],
PaMM [86], AFDA [111], Inlier-set group modeling [96], and CMGTN [141] with

106
34.5%, 39.3%, 35.8%, 36.4%, 36.9%, 30.3%, 37.5%, 34.8%, and 38.4% matching rates
at rank-1 respectively.

Table 4.7: Performance comparison in terms of top matching rates (%) of existing
methods and proposed FCDF method on the iLIDS-VID dataset (p=150),
dash (-) represents that no reported result is available
Ranks
Method Year
1 10 20 50
Salience [61] 2013 10.2 35.5 52.9 -
DVR [80] 2014 23.3 55.3 68.4 -
Salience + DVR [80] 2014 30.9 65.1 77.1 -
MS color & LBP+DVR [80] 2014 34.5 67.5 77.5 -
DVDL [255] 2015 25.9 57.3 68.9 -
AFDA [111] 2015 37.5 73.0 81.8 -
UCDTL [258] 2016 21.5 53.7 73.6 -
PaMM [86] 2016 30.3 70.3 82.7 -
NP ReId [95] 2017 35.8 59.5 72.3 -
RF+DDA+AC [254] 2017 36.4 59.0 73.9 91.7
P2SNET [135] 2017 40.0 78.1 90.0 -
TMSL [259] 2017 39.5 75.4 86.5 -
PHDL [260] 2017 28.2 65.9 80.4 -
MARPML [42] 2018 36.9 77.1 89.9 -
Semi supervised-ReID+XQDA[253] 2018 39.3 71.1 - -
TLSTP [142] 2018 30.3 66.6 79.8 -
Multi-scale CNN [251] 2019 33.6 69.7 84.2 -
Inlier-set group modeling [96] 2019 34.8 59.5 72.3 -
CTC-GAN [261] 2019 35.3 72.0 82.9 -
TMD2L [118] 2019 29.6 74.0 86.2 -
HRPID [262] 2019 23.9 67.1 83.7 -
TPPLFDA [74] 2020 33.8 66.8 78.5 -
MSDALF + HM [107] 2020 34.4 - - -
MSDALF + SMA [107] 2020 34.3 - - -
DHGM-average pooling [76] 2020 33.5 - - -
DHGM-regularized minimum [76] 2020 40.0 - - -
CMGTN [141] 2020 38.4 74.8 86.3 -
InceptionV4 [139] 2020 32.9 74.2 - -
Proposed (FCDF) 40.6 78.5 86.9 93.8

From the results, it is illustrated that FCDF proposed framework outperforms in


comparison with existing methodologies for example UCDTL [258], TLSTP [142],
P2SNET [135], TMSL [259], PHDL [260], CTC-GAN [261], TMD2L [118], HRPID
[262], TPPLFDA [74], MSDALF+HM and SMA [107], DHGM-average pooling and
regularized minimum [76], and InceptionV4 [139], as shown in Figure 4.4. These
findings confirm effectiveness of FCDF under clothing similarity and occlusion issues
of iLIDS-VID dataset.

107
The proposed FCDF framework attained significant recognition rates at different ranks
for ReID problem. According to the computed results on VIPeR dataset, it outperforms
the previous methods of AML-PSR, Multi-HG, Semantic attribute, NP ReId, Soft-Bio,
Inlier-set group modeling, DPML, and LOMOXQDA with margins of 1.63%, 2.26%,
2.10%, 3.46%, 2.90%, 5.47%, 5.36%, and 6.82% respectively. Using VIPeR dataset,
this research work attained better results at rank-1 and rank-10 whereas NP ReId and
Soft-Bio perform best at ranks 20 and 50, respectively. The recognition rates at ranks-
1 and rank-10 confirm that proposed FCDF framework consistently outperforms under
illumination and viewpoint variations. Using CUHK01 dataset, experiments are
conducted at ranks 1, 10, 20, 50, and the proposed FCDF framework attained higher
recognition rates at rank-1 and rank-50 whereas BRM2L (FTCNN) performs best at
rank-10 and rank-20. In case of proposed method rank-1 results on CUHK01 dataset, it
obtained higher recognition rate as compared to NP ReId, RN+XQDA+RR+DFR,
PAN+ XQDA+RR+DFR, Inlier-set group modeling, Hessian and DPML with
significant margins of 3.62%, 3.81%, 5.46%, 3.62%, 26.18% and 12.29% respectively.
Similarly, on iLIDS-VID dataset, the proposed FCDF framework shows improved
results as compared to previous results reported by semi-supervised-ReID+XQDA, NP
ReId, RF+DDA+AC, MARPML, AFDA, and Inlier-set group modeling with margins
of 1.29%, 4.80%, 4.27%, 3.77%, 3.17%, and 5.80% respectively. The proposed
framework performs best at three ranks including 1, 10, and 50 whereas P2SNET
performs best at rank-20. The recognition rate is also examined using the
target/corresponding cluster (initially classified cluster) and number of neighbor
cluster(s) during probe matching. For this purpose, experiments are conducted in which
gallery-based probe matching is performed at ranks 1, 10, 20, and 50. The prime
concern of these experiments is to compare cluster-based probe matching with gallery-
based probe matching. Tables (4.8)-(4.10) show the comparison of both cluster and
gallery-based probe matching results at ranks 1, 10, 20, and 50. The results show that
cluster-based probe matching rates are higher than gallery-based probe matching rates
at all ranks, for instance at rank-1 by margins of 9.2%, 6.57%, and 5.95% on VIPeR,
CUHK01, and iLIDS-VID datasets, respectively. According to the results, recognition
rates are improved using cluster-based probe matching technique, hence it proves that
recognition rate relies on accurate classification of the corresponding cluster (CC) and
its nearest neighbor cluster(s) (NNC’s). All probe images are matched either with
classified CC or NNC’s (cluster with minimum distance from the corresponding

108
cluster). Notably, not more than three clusters including CC for probe matching are
considered because of gallery search optimization.

Table 4.8: Cluster and gallery-based probe matching results of proposed FCDF
framework on VIPeR dataset
Ranks
Probe matching
1 10 20 50
Proposed cluster-based 46.82 87.24 93.23 98.58
Proposed gallery-based 37.62 77.89 85.25 92.12

Table 4.9: Cluster and gallery-based probe matching results of proposed FCDF
framework on CUHK01 dataset
Ranks
Probe matching
1 10 20 50
Proposed cluster-based 48.14 81.68 92.47 96.13
Proposed gallery-based 41.57 76.11 85.84 91.41

Table 4.10: Cluster and gallery-based probe matching results of proposed FCDF
framework on the iLIDS-VID dataset
Ranks
Probe matching
1 10 20 50
Proposed cluster-based 40.67 78.50 86.92 93.87
Proposed gallery-based 34.72 73.91 81.17 88.26

Mathematically, CC, and NNC are calculated through Eq. (4.1) and Eq. (4.2).

XIp
CC = (4.1)
𝑃

Where X Ip represents number of probe images that are accurately matched in the target
matched clusters and P denotes total images in the probe set.

𝑛𝑐
NNC = (4.2)
K

Where 𝑛𝑐 depicts neighbor clusters such that cluster search is applied to find accurate
image pair, and K is consensus clusters. The cluster-wise matching results are presented
in Figure 4.5. The experimental results computed using the aforementioned equations
show that the success rate of accurate image matching using CC on VIPeR, CUHK01
and iLIDS-VID datasets is 42%, 47%, and 38%, respectively against the probe set.
However, once CC is mismatched, first NCC is considered. In this case, success rate of
accurate image matching on VIPeR, CUHK01, and iLIDS-VID datasets are recorded
60%, 64%, and 54% respectively against the probe set. Similarly, if CC and NCC both

109
are exhausted, second NCC is considered, in this situation the success rate on VIPeR,
CUHK01, and iLIDS-VID datasets is observed as 83%, 89%, and 79% respectively
against the probe set.

Figure 4.5: Performance comparison of CC and NNC based searching against all probe
images

Figure 4.6: Selected image pairs (column-wise) from VIPeR, CUHK01, and iLIDS-
VID datasets with challenging conditions such as a) Improper image
appearances, b) different background and foreground information, c)
drastic illumination changes, and d) pose variations including lights effects
It is worth mentioning that few image pairs show false matching result (where the
appearance of same pair of images looks dissimilar) because of the drastic change in
pose and viewpoint. For further clarity, the abovementioned challenges are shown in
Figure 4.6. Similarly, sometimes background information and foreground information
of images are barely differentiable, hence making it difficult for the proposed

110
framework to re-identify the person accurately. These challenging conditions limit the
recognition rate.

4.3 Proposed Pedestrian Gender Classification Method of Joint Low-


level and Deep CNN Feature Representations (J-LDFR)
This section contains evaluation of proposed J-LDFR framework. The discussion
covers evaluation protocols, implementation settings, and statistics regarding the
selected datasets. Besides, it includes a detailed discussion and comparison of computed
results with existing studies.

4.3.1 Evaluation Protocols and Implementation Settings for Pedestrian


Gender Classification

Six evaluation metrics including O-ACC, M-ACC, AUC, false positive rate (FPR), hit
rate/ sensitivity/ true positive rate (TPR) or recall, precision or positive predictive value
(PPV) are selected to estimate the performance of proposed J-LDFR framework for
gender classification. Moreover, training time, prediction time, and CW-ACC are also
calculated and presented in the empirical analysis. The mathematical representation of
these metrics is given in Table 4.11.

Table 4.11: Evaluation protocols

S.
Metrics Equations
No
𝑇𝑃 + 𝑇𝑁
1 O-ACC 𝑂𝑣𝑒𝑟𝑎𝑙𝑙 𝑎𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = × 100
𝑇𝑃 + 𝑇𝑁 + 𝐹𝑃 + 𝐹𝑁

1 𝐶𝑜𝑟𝑟𝑒𝑐𝑡 𝑛𝑜. 𝑜𝑓 𝑚𝑎𝑙𝑒𝑠 𝐶𝑜𝑟𝑟𝑒𝑐𝑡 𝑛𝑜. 𝑜𝑓 𝑓𝑒𝑚𝑎𝑙𝑒𝑠


𝑀𝑒𝑎𝑛 𝑎𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = ( + ) × 100
2 M-ACC 2 𝑇𝑜𝑡𝑎𝑙 𝑚𝑎𝑙𝑒𝑠 𝑇𝑜𝑡𝑎𝑙 𝑓𝑒𝑚𝑎𝑙𝑒𝑠

−∞
𝐴𝑈𝐶 = ∫ 𝑇𝑃𝑅(𝑡)𝐹𝑃𝑅′ (𝑡)𝑑𝑡
3 AUC ∞

𝐹𝑃
4 FPR 𝐹𝑃𝑅 = × 100
𝑇𝑁 + 𝐹𝑃

𝑇𝑃
5 TPR (𝐻𝑖𝑡 𝑟𝑎𝑡𝑒/𝑆𝑒𝑛𝑠𝑖𝑡𝑖𝑣𝑖𝑡𝑦/𝑅𝑒𝑐𝑎𝑙𝑙/𝑇𝑃𝑅) = × 100
𝑇𝑃 + 𝐹𝑁

𝑇𝑃
6 PPV 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛/𝑃𝑃𝑉 = × 100
𝑇𝑃 + 𝐹𝑃

Where TP, TN, FP, and FN denote true positive, true negative, false positive, and false
negative, respectively, whereas three metrics of O-ACC, M-ACC, and AUC are key

111
metrics to assess the performance of a system. Thus, higher scores of accuracy or AUC
confirm the significance of model with respect to prediction performance. Rest of two
metrics are defined as: 1) precision is considered for the measure of exactness or
quality, and 2) recall is taken as the measure of completeness or quantity. For testing
the proposed J-LDFR framework, k-fold cross-validation approach is applied i.e., 10-
fold cross-validation, which is considered as a standard procedure for the assessment
of models. All the experiments are performed on Intel ® Core i7-7700 CPU @ 3.60
GHz desktop computer, 16-GB RAM and NVIDIA GeForce GTX1070 having
MATLAB2018b including deep CNN trained networks.

4.3.2 Pedestrian Analysis Datasets Presentation for Pedestrian Gender


Classification

In this research work, two datasets named pedestrian attribute recognition (PETA)
[263] and MIT [166] are selected to test the proposed J-LDFR framework. The
challenges of these datasets include diverse variations such as different camera settings,
variation in viewpoint, illumination changes, resolution, body deformation, and scene
(indoor/outdoor) effects. Figure 4.7 shows a few sample images from the selected
datasets as an example of abovementioned challenges of pedestrian gender
classification [263].

Figure 4.7: Samples of pedestrian images selected from sub-datasets of PETA dataset,
column representing the gender (male and female) from each sub-dataset,
upper row is showing the images of male gender whereas lower row is
showing the image of female gender

112
4.3.2.1 PETA Dataset Presentation

PETA dataset has annotated images of pedestrians taken from ten existing datasets. By
comprising these datasets, total number of 19000 images are acquired, where the
percentage of sub-dataset iLIDS is 2%, CUHK is 24%, 3DPeS is 5%, GRID is 7%,
CAVIAR4REID is 6%, MIT is 5%, PRID is 6%, SARC3D is 1%, VIPeR is 7%, and
Town Center is 37%, and each sub-dataset holds number of images as 477, 4563, 1012,
1275, 1220, 888, 1134, 200, 1264, and 6967, respectively. In this research work, 8472
images of each gender (male and female) are chosen from a total of 19000 images. This
equal distribution of samples per class is investigated for gender prediction. Also, male
and female annotation is used in terms of mixed views. PETA dataset is described in
Table 4.12 in terms of dataset name, class-wise images, total images, view, image size,
and scenario.

Table 4.12: Statistics of PETA dataset for pedestrian gender classification

Class-wise images Total Image size


Dataset View
#Males #Females images (Height × Width) Scenario
Outdoor
PETA 8472 8472 16944 Mixed Vary and
indoor

4.3.2.2 MIT Dataset Presentation

The experiments are also done on the imbalanced MIT dataset separately and the
computed results are compared with existing studies. MIT dataset is described in Table
4.13 in terms of dataset name, class-wise images, total images, view, image size, and
scenario.

Table 4.13: Statistics of MIT dataset for pedestrian gender classification

Class-wise images Total Image size


Dataset View
#Males #Females images (Height × Width) Scenario
MIT 600 288 888 Mixed 128×64 Outdoor

4.3.3 Results Evaluation

Detailed results of proposed J-LDFR framework are computed on the selected datasets.
For pedestrian gender classification, each feature vector is confined according to
different features subset dimensions (FSD) as described in Table 4.14. For
experimentation, trail of five testing FSs with different dimensions are selected from

113
entropy controlled low-level and deep CNN feature representations. In subsequent
tables, results of best classifier are written in bold.

Table 4.14: Description of different test FSs with different dimensions


Test feature Feature type and dimension of selected feature subset
set no. LOMO HOG VGG19 ResNet101 Total features
1 200 300 500 500 1500
2 400 600 750 750 2500
3 550 750 850 850 3000
4 600 1000 1000 1000 3600
5 700 1100 1200 1000 4000

The objective of these experiments is to find a feature combination in size/ dimension,


where the proposed framework may produce significant results on PETA and MIT
datasets. The next step is to examine whether the gender predictive performance of low-
level features can be enhanced by merging deep feature representations or not. After
integrating low-level features with deep features, initially this work intends to test a
single feature combination represented as LOMO+HOG+VGG19+ResNet101. All the
experiments were carried out using this feature combination with few test FSs as
depicted in Table 4.14. Different classification methods are used for training and these
methods were executed using the settings described in Table 3.4. Table 4.15 and Table
4.16 show the results of the proposed framework using joint low-level and deep CNN
feature representation for gender prediction in terms of O-ACC, AUC, precision, and
recall using PETA and MIT datasets. It can be observed from Table 4.15 and Table
4.16 that SVM classifier with test FS number 4 achieves better results as compared to
other classification methods, and rest of test FSs in case of both PETA and MIT
datasets. Within SVM, cubic method produces better results as compared to linear,
median, and quadratic methods. In case of test FS number 4, C-SVM computes O-ACC,
AUC, precision and recall with 89.3%, 96%, 90%, and 88.7% respectively on PETA
datasets, whereas it achieves overall accuracy, AUC, precision and recall with 82%,
86%, 92.1%, and 84% respectively on MIT dataset. These results are superior to the
rest of test FSs and other classification methods on both datasets. Similarly, the results
are calculated using other test FSs and classification methods on selected datasets as
illustrated in Table 4.14 where best results are written in bold. Based on these
investigations, contribution of each feature vector is found for robust feature
representation jointly.

114
Table 4.15: Performance evaluation of proposed J-LDFR method using different
classifiers and test FSs on PETA dataset

Test Evaluation Protocols


Classifier features subsets no. O-ACC AUC PPV TPR
1 2 3 4 5 (%) (%) (%) (%)
LD  83.4 91 83.8 83.1
 84.1 92 84.4 83.8
 84.2 92 84.5 84.0
 84.0 92 84.4 83.8
 83.4 91 83.7 83.3
ES-D  83.5 91 83.5 83.4
 85.0 93 85.3 84.8
 85.6 93 86.2 85.1
 86.0 93 86.7 85.5
 85.9 93 86.6 85.4
F-KNN  81.2 81 83.8 79.7
 81.7 82 84.5 80.0
 82.0 82 84.9 80.2
 82.2 82 85.1 80.4
 82.2 82 85.0 80.6
M-KNN  81.6 91 90.1 77.1
 82.5 91 91.5 77.5
 82.4 92 91.7 77.4
 82.8 92 92.0 77.6
 83.0 92 82.3 77.8
C-KNN  82.1 91 90.7 77.3
 82.8 92 92.0 77.8
 83.0 92 90.0 78.0
 76.1 86 76.4 76.0
 83.3 92 92.2 78.3
L-SVM  82.9 91 83.1 82.8
 84.1 92 84.8 83.6
 84.6 92 85.6 83.9
 84.5 92 85.6 83.8
 84.6 92 85.9 83.7
M-SVM  87.0 94 88.1 86.1
 87.6 95 88.5 87.0
 88.4 95 90.0 87.6
 88.6 95 90.0 87.8
 85.5 95 90.0 87.5
Q-SVM  87.1 94 87.9 86.6
 88.1 95 88.9 87.4
 88.4 95 89.2 87.7
 88.5 95 89.3 87.7
 88.2 95 89.3 87.5
C-SVM  87.7 95 88.3 87.2
 88.8 95 89.5 88.2
 89.0 95 90.0 88.5
 89.3 96 90.0 88.7
 89.2 96 90.0 88.6

115
Table 4.16: Performance evaluation of proposed J-LDFR method using different
classifiers and test FSs on MIT dataset

Test Evaluation Protocols


Classifier features subsets no. O-ACC AUC PPV TPR
1 2 3 4 5 (%) (%) (%) (%)
LD  66.4 63 72.8 76.3
 73.6 71 79.3 81.2
 74.2 71 80.5 81.1
 73.6 70 79.5 81.1
 74.1 71 80.6 80.9
ES-D  60.5 61 66.5 72.8
 75.9 81 83.1 81.5
 77.6 82 83.5 65.5
 77.0 80 84.3 82.1
 77.2 82 86.1 81.2
F-KNN  64.8 57 79.0 71.8
 64.3 57 78.6 71.4
 64.9 57 79.1 71.7
 62.5 55 75.5 70.8
 63.6 57 76.5 71.6
M-KNN  68.5 64 97.0 69.0
 69.9 67 97.8 69.8
 69.5 66 97.6 69.5
 70.2 67 98.0 70.0
 69.1 67 97.8 69.2
C-KNN  68.5 65 95.5 69.5
 68.8 66 96.3 69.3
 70.0 67 97.3 70.0
 69.2 68 96.1 69.7
 68.1 68 96.0 68.9
L-SVM  76.7 81 92.6 77.3
 78.3 83 94.8 78.2
 78.5 84 94.3 78.2
 79.2 85 94.6 78.7
 78.7 85 94.1 78.5
M-SVM  74.4 80 96.3 73.8
 75.8 83 96.5 74.1
 77.4 84 97.1 76.1
 78.0 85 97.8 75.9
 76.1 85 97.0 75.0
Q-SVM  76.0 80 88.6 78.5
 78.7 84 89.6 80.1
 79.7 85 89.5 82.1
 81.4 86 91.0 82.7
 80.6 85 90.6 82.4
C-SVM  74.5 79 85.5 78.6
 78.5 83 88.3 81.4
 80.0 85 89.1 82.6
 82.0 86 91.2 84.0
 80.4 85 90.6 82.1

116
Moreover, the training and prediction time under all applied classification methods is
computed and shown in Figure 4.8, where (a) presents the training (sec) on PETA and
MIT datasets, and (b) presents prediction time (obs/sec) on PETA and MIT datasets,
respectively. As the experimental results confirm that the feature combination from test
FS number 4 produced better results as compared to other test FSs, so it is intended to
use this feature combination for the rest of experiments in this work.

Figure 4.8: Proposed J-LDFR method (a) training and (b) prediction time using
different classifiers on PETA and MIT datasets

4.3.3.1 Performance of Low-level Feature Representation for Gender


Classification

To examine the impact of low-level feature representation of gender image generated


from HOG and LOMO features, selected FSs of both features are applied separately.

117
The main objective of this test is to estimate the performance of low-level features, and
their fusion using supposed FSs. The selected FSs of HOG and LOMO are supplied to
SVM classifier and the results under selected evaluation protocols are computed. It is
observed that C-SVM provides better results as compared to M-SVM and Q-SVM
when the integration of LOMO and HOG features is considered. In case of Q-SVM, O-
ACC, AUC, precision, and recall are recorded as 85.5%, 93%, 87.7%, and 84%
respectively on PETA dataset using LOMO+HOG feature representations. The
performance of different low-level features and their integration results using cubic,
medium, and quadratic SVM on the PETA dataset is illustrated in Tables (4.17)-(4.19).

Table 4.17: Performance evaluation of proposed J-LDFR method on PETA dataset


using C-SVM classifier with 10-fold cross-validation
Classifier
Selected C-SVM
Features type
FSD O-ACC AUC PPV TPR
(%) (%) (%) (%)
LOMO 600 82.2 90 84.9 80.5
HOG 1000 82.8 90 84.4 81.8
LOMO+HOG 1600 85.5 93 87.7 84.0

Table 4.18: Performance evaluation of proposed J-LDFR method on PETA dataset


using M-SVM classifier with 10-fold cross-validation
Classifier
Selected M-SVM
Features type
FSD O-ACC AUC PPV TPR
(%) (%) (%) (%)
LOMO 600 83.1 91 86.3 81.1
HOG 1000 81.9 90 84.3 80.5
LOMO+HOG 1600 84.9 92 87.9 82.9

Table 4.19: Performance evaluation of proposed J-LDFR method on PETA dataset


using Q-SVM classifier with 10-fold cross-validation
Classifier
Selected Q-SVM
Features type
FSD O-ACC AUC PPV TPR
(%) (%) (%) (%)
LOMO 600 81.7 90 85.6 79.4
HOG 1000 81.3 89 83.3 80.2
LOMO+HOG 1600 84.3 92 87.1 82.5

The medium and quadratic SVM classifiers produced O-ACC with a value of 84.9%,
and 84.3%, respectively when the integration of LOMO and HOG features is applied.
The low-level feature representations are also tested on MIT dataset using the same

118
classification methods, evaluation protocols, and settings. According to LOMO and
HOG features fusion results on MIT dataset, C-SVM again provides improved results
than other classification methods with O-ACC 79%, AUC 82%, precision 89.8%, and
recall 80.3%. In addition, these evaluation protocols are also tested on rest of two
classification methods such as M-SVM and Q-SVM as illustrated in Tables (4.20)-
(4.22). Using entropy controlled low-level feature representations, comparison between
HOG and LOMO separately on datasets PETA and MIT is shown in Figure 4.9 and
Figure 4.10, respectively. The joint low-level feature representation result using LOMO
and HOG is presented subsequently in Figure 4.11 for both PETA and MIT datasets.
The joint low-level feature representation results as described in Tables (4.17)-(4.19)
show improved performance on the PETA dataset. In case of HOG, the improvement
of 2.7%, 3%, 3.3%, and 2.2% is recorded for O-ACC, AUC, precision, and recall
respectively, whereas in case of LOMO the improvement of 3.3%, 3%, 2.6%, and 3.5%
is recorded for O-ACC, AUC, precision, and recall respectively using C-SVM.

Figure 4.9: Performance evaluation of proposed J-LDFR method on PETA dataset


using entropy controlled low-level feature representations, individually

Table 4.20: Performance evaluation of proposed J-LDFR method on MIT dataset using
C-SVM classifier with 10-fold cross-validation
Classifier
Selected C-SVM
Features type
FSD O-ACC AUC PPV TPR
(%) (%) (%) (%)
LOMO 600 70.5 70 84.5 75.0
HOG 1000 78.0 82 89.5 80.2
LOMO+HOG 1600 79.0 82 89.8 80.3

119
Table 4.21: Performance evaluation of proposed J-LDFR method on MIT dataset using
M-SVM classifier with 10-fold cross-validation
Classifier
Selected M-SVM
Features type
FSD O-ACC AUC PPV TPR
(%) (%) (%) (%)
LOMO 600 68.7 70 98.3 68.7
HOG 1000 77.1 81 97.8 75.5
LOMO+HOG 1600 77.5 82 98.0 75.7

Table 4.22: Performance evaluation of proposed J-LDFR method on MIT dataset using
Q-SVM classifier with 10-fold cross-validation
Classifier
Selected Q-SVM
Features type
FSD O-ACC AUC PPV TPR
(%) (%) (%) (%)
LOMO 600 72.3 72 87.6 75.3
HOG 1000 75.9 80 88.8 78.3
LOMO+HOG 1600 77.4 82 90.3 79.1

Similarly, joint low-level feature representation results as described in Tables (4.20)-


(4.22) indicated improved results on MIT dataset. In case of HOG, the improvement of
1%, 0.3%, and 0.1% is recorded for O-ACC, precision, and recall respectively, whereas
in case of LOMO the improvement of 7.5%, 12%, 5.3%, and 5.3 % is observed for O-
ACC, AUC, precision, and recall respectively under C-SVM.

Figure 4.10: Performance evaluation of proposed method on MIT dataset using entropy
controlled low-level feature representations, individually

120
Figure 4.11: Performance evaluation of proposed J-LDFR method on PETA and MIT
datasets using entropy controlled low-level feature representations, jointly

The improvement in computed results proves that the LOMO has the strength to deal
with viewpoint variations. Similarly, other classification methods also produced
exceptional results using the abovementioned joint low-level feature representation and
settings. Comparing Figure 4.11 with Figure 4.9 and Figure 4.10, it is obvious that the
joint low-level feature representation outperforms single feature representations on
both selected datasets. Thus, it is complementary to the improved performance of
pedestrian gender classification. Since PETA dataset consists of several datasets where
various diverse appearances with pose and illumination become major issues, so deeply
learned feature representations of CNN models are more suitable in these cases.
Consequently, deep feature representations are taken along with low-level feature
representations in this framework. The detail about deep feature representations with
performance evaluation is given in the following subsection.

4.3.3.2 Performance of Deep Feature Representation for Gender Classification

Considering LOMO and HOG FSs including their joint representation (LOMO+HOG),
it is observed that the performance of gender prediction is decreased due to appearance
variances issues such as pose and illumination. To handle these issues, an entropy
controlled deep CNN feature representation is proposed. Impact of deep feature
representations is also examined using two different deep CNN models. The results are
computed for both deep feature representations separately and jointly considering

121
selected datasets, evaluation protocols, and classification methods. In case of deep
feature representations separately, the performance of proposed framework on PETA
and MIT datasets is described in Tables (4.23)-(4.25) and (4.26)-(4.28) respectively.

Table 4.23: Performance evaluation of proposed J-LDFR method on PETA dataset


using C-SVM classifier with 10-fold cross-validation
Classifier
Selected C-SVM
Features type
FSD O-ACC AUC PPV TPR
(%) (%) (%) (%)
VGG19 1000 82.3 90 82.6 82.1
ResNet101 1000 84.5 92 84.8 84.2
VGG19+ResNet101 2000 85.7 93 86.3 85.4

Table 4.24: Performance evaluation of the proposed J-LDFR method on PETA dataset
using M-SVM classifiers with 10-fold cross-validation
Classifier
Selected M-SVM
Features type
FSD O-ACC AUC PPV TPR
(%) (%) (%) (%)
VGG19 1000 82.5 90 83.5 81.7
ResNet101 1000 84.1 92 84.1 84.2
VGG19+ResNet101 2000 85.4 93 86.1 84.1

Table 4.25: Performance evaluation of the proposed J-LDFR method on PETA dataset
using Q-SVM classifier with 10-fold cross-validation
Classifier
Selected Q-SVM
Features type
FSD O-ACC AUC PPV TPR
(%) (%) (%) (%)
VGG19 1000 82.3 90 82.5 82.2
ResNet101 1000 84.2 92 84.5 83.9
VGG19+ResNet101 2000 85.6 93 86.1 85.2

According to the results, entropy controlled ResNet101 deep features outperformed


VGG19 deep features with an O-ACC of 84.5%, AUC of 92%, precision of 84.8%, and
recall of 84.2% on PETA dataset using C-SVM. Likewise, ResNet101 achieved better
performance on MIT dataset with O-ACC of 74.3% and recall of 79.7% under C-SVM.
Q-SVM provided better outcomes in case of AUC with 81% and precision with 89.1%.
Also, the performance is evaluated using two other classification methods including M-
SVM and Q-SVM, where significant results on PETA and MIT datasets are observed
and presented in Tables (4.23)-(4.25) and (4.26)-(4.28), respectively. The performance
using a single deep feature representation on PETA and MIT datasets is shown in Figure

122
4.12 and Figure 4.13, respectively. The joint deep feature representation
(VGG19+ResNet101) as described in Tables (4.23)-(4.25) showed improved results on
PETA dataset. In case of ResNet101, the improvement of 1.2%, 1.0%, 1.5%, and 1.2%
is recorded for O-ACC, AUC, precision, and recall respectively, whereas in case of
VGG19, improved results of 3.4%, 3%, 4.3%, and 3.3% are attained for O-ACC, AUC,
precision, and recall respectively. Similarly, joint deep feature representation
(VGG19+ResNet101) as described in Tables (4.26)-(4.28) indicated enhanced results
on MIT dataset. In case of ResNet101, the improvement of 2.1%, 1.8%, and 1.0% is
observed for O-ACC, precision, and recall respectively, whereas considering VGG19,
enhanced results 3.7%, 4%, 2.3%, and 2.7% are seen for O-ACC, AUC, precision, and
recall respectively. Also, other classification methods produced better results using the
aforementioned joint deep CNN feature representation and settings.
Table 4.26: Performance evaluation of proposed J-LDFR method on MIT dataset using
C-SVM classifier with 10-fold cross-validation
Classifier
Selected C-SVM
Features type
FSD O-ACC AUC PPV TPR
(%) (%) (%) (%)
VGG19 1000 72.5 75 84.5 77.0
ResNet101 1000 74.3 79 85.0 78.7
VGG19+ResNet101 2000 76.2 79 86.8 79.7

Table 4.27: Performance evaluation of proposed J-LDFR method on MIT dataset using
M-SVM classifier with 10-fold cross-validation
Classifier
Selected M-SVM
Features type
FSD O-ACC AUC PPV TPR
(%) (%) (%) (%)
VGG19 1000 72.5 76 96.5 72.1
ResNet101 1000 73.5 78 96.0 73.1
VGG19+ResNet101 2000 74.8 79 96.0 74.2

Table 4.28: Performance evaluation of proposed J-LDFR method on MIT dataset using
Q-SVM classifier with 10-fold cross-validation
Classifier
Selected Q-SVM
Features type
FSD O-ACC AUC PPV TPR
(%) (%) (%) (%)
VGG19 1000 73.2 76 87.5 76.3
ResNet101 1000 75.1 80 88.5 77.7
VGG19+ResNet101 2000 76.8 81 89.1 79.1

123
The computed results using deep CNN feature representation individually on PETA
and MIT datasets are shown in Figure 4.12 and Figure 4.13 whereas joint deep feature
representation (VGG19+ResNet101) based results are presented in Figure 4.14.

Figure 4.12: Performance estimation of proposed J-LDFR method on PETA dataset


using entropy controlled deep feature representations, individually

Comparing Figure 4.14 with Figure 4.12 and Figure 4.13, it is apparent that joint deep
feature representation outperforms single deep feature representations on both selected
datasets. Thus, it is supporting for the improved performance of pedestrian gender
classification.

Figure 4.13: Performance evaluation of proposed J-LDFR method on MIT dataset using
entropy controlled deep feature representations, separately

124
Figure 4.14: Performance estimation of proposed J-LDFR method on PETA and MIT
datasets using entropy controlled deep feature representations, jointly

Since deep feature representations provide detailed information of input image, hence
it has the potential to effectively handle pose and illumination issues by the computed
results. In addition, it is observed that deep CNN feature representations provide
enhanced results than other traditional feature representations.

4.3.3.3 Performance of Joint Feature Representation for Gender Classification

In the preceding sections, low-level and deep feature representations of input image for
gender classification are investigated either separately or jointly. The proposed
framework achieved superior results with joint feature representation. The computed
results show improvement when joint representation of either low-level features or deep
features is compared to individual feature representations. It further urges to investigate
the joint versions of low-level and deep feature representations in different
combinations. According to the calculated results, joint feature representation
(HOG+LOMO+VGG19+ ResNet101) increased overall performance on PETA and
MIT datasets and achieved desirable results as presented in Tables (4.29)-(4.31) and
(4.32)-(4.34) respectively, which confirms that joint feature representation makes the
proposed framework more robust. The second best results are obtained using
HOG+LOMO+ResNet101 combinations.

125
Table 4.29: Performance evaluation of proposed J-LDFR method on PETA dataset
using C-SVM classifier with 10-fold cross-validation
Classifiers
Selected C-SVM
Joint feature representation
FSD O-ACC AUC PPV TPR
(%) (%) (%) (%)
LOMO+HOG 1600 85.5 93 87.7 84.0
LOMO+ResNet101 1600 87.8 95 88.7 87.1
LOMO+VGG19 1600 86.5 94 87.4 85.8
HOG+ResNet101 2000 87.8 95 88.5 87.2
HOG+VGG19 2000 86.6 94 87.3 86.1
VGG19+ResNet101 2000 85.7 93 86.3 85.4
HOG+LOMO+VGG19 2600 87.6 95 88.8 86.7
HOG+LOMO+ResNet101 2600 88.9 95 90.0 88.0
LOMO+VGG19+ResNet101 2600 88.1 95 89.0 87.4
HOG+VGG19+ResNet101 3000 88.4 95 88.9 88.1
HOG+LOMO+VGG19+ResNet101 3600 89.3 96 90.0 88.7

Table 4.30: Performance evaluation of proposed J-LDFR method on PETA dataset


using M-SVM classifier with 10-fold cross-validation
Classifier
Selected M-SVM
Joint feature representation
FSD O-ACC AUC PPV TPR
(%) (%) (%) (%)
LOMO+HOG 1600 84.9 92 87.9 82.9
LOMO+ResNet101 1600 87.3 94 88.6 86.4
LOMO+VGG19 1600 86.1 93 87.8 85.0
HOG+ResNet101 2000 86.7 94 87.8 86.2
HOG+VGG19 2000 85.7 93 87.3 84.3
VGG19+ResNet101 2000 85.4 93 86.1 84.1
HOG+LOMO+VGG19 2600 87.2 94 89.0 85.9
HOG+LOMO+ResNet101 2600 88.1 95 89.5 87.1
LOMO+VGG19+ResNet101 2600 87.6 95 88.1 86.7
HOG+VGG19+ResNet101 3000 87.3 95 88.4 86.6
HOG+LOMO+VGG19+ResNet101 3600 88.6 95 90.0 87.8

The performance of entropy controlled LOMO, HOG, VGG19, ResNet101 feature


representations and their fusion in terms of O-ACC, AUC, precision, and recall on
PETA dataset are investigated and shown in Figure 4.15. It is noted that individual
features contribution significantly influences the results computed through joint feature
representation under selected evaluation protocols and settings.

According to the results on PETA datasets as presented in Tables (4.29)-(4.31), the


combination LOMO+VGG19+ResNet101 and HOG+VGG19+ResNet101 attained an
88.1% O-ACC and 88.4% O-ACC respectively, and 95% AUC in both combinations.

126
However, in case of feature combination LOMO+HOG+VGG19+ResNet101, the
proposed framework showed 89.3% O-ACC, AUC 96%, 90% precision, and 88.7%
recall on PETA dataset. Similarly, investigation is carried out for MIT dataset. The
results of entropy controlled LOMO, HOG, VGG19, ResNet101 feature representations
and their fusion in terms of O-ACC, AUC, precision, and recall on MIT dataset are
computed.

Table 4.31: Performance evaluation of proposed J-LDFR method on PETA dataset


using Q-SVM classifiers with 10-fold cross-validation
Classifier
Selected Q-SVM
Joint feature representation
FSD O-ACC AUC PPV TPR
(%) (%) (%) (%)
LOMO+HOG 1600 84.3 92 87.1 82.5
LOMO+ResNet101 1600 87.4 94 88.8 86.4
LOMO+VGG19 1600 86.0 93 87.2 85.1
HOG+ResNet101 2000 87.0 94 88.0 86.2
HOG+VGG19 2000 85.7 93 86.7 85.1
VGG19+ResNet101 2000 85.6 93 86.1 85.2
HOG+LOMO+VGG19 2600 86.6 94 88.0 85.7
HOG+LOMO+ResNet101 2600 87.9 95 89.3 86.8
LOMO+VGG19+ResNet101 2600 87.6 95 88.6 86.9
HOG+VGG19+ResNet101 3000 87.6 94 88.4 87.1
HOG+LOMO+VGG19+ResNet101 3600 88.5 95 89.3 87.7

According to the results as presented in Tables (4.32)-(4.34), the combination


LOMO+HOG+VGG19+ResNet101 outperformed other feature combinations with an
82% O-ACC, 86% AUC, 92.2% precision, and 84% recall on MIT dataset. The
abovementioned feature combinations are also tested with other classifiers and
achieved promising results as shown in Tables (4.29)-(4.31) and Tables (4.32)-(4.34).

Since joint feature representation approach considering both low-level and deep CNN
features significantly improved the results, hence confirms the robustness of proposed
J-LDFR framework for gender classification. For instance, comparing Figure 4.15 and
Figure 4.16, it is obvious that the deep CNN feature representation is effective when
high-level information of input image is required, whereas utilizing low-level feature
representation contributes to enhancing the performance. In addition, it is observed
from the results that high-level deep features of two different models preserve distinct
clues of gender image, thus assist in enabling a reliable classification of sample by
different classification methods.

127
Figure 4.15: Proposed evaluation results using individual feature representation and
JFR on PETA dataset

Table 4.32: Performance evaluation of the proposed J-LDFR method on the MIT
dataset using cubic-SVM classifier with 10-fold cross-validation
Classifiers
Selected Cubic-SVM
Joint feature representations
FSD O-ACC AUC PPV TPR
(%) (%) (%) (%)
LOMO+HOG 1600 79 82 89.8 80.3
LOMO+ResNet101 1600 76.4 81 87.5 79.5
LOMO+VGG19 1600 75.6 80 87.8 78.5
HOG+ResNet101 2000 79.3 84 90.0 81.3
HOG+VGG19 2000 81.0 85 90.6 82.8
VGG19+ResNet101 2000 76.2 79 86.8 79.7
HOG+LOMO+VGG19 2600 81.0 85 90.3 83.0
HOG+LOMO+ResNet101 2600 78.2 84 89.8 80.2
LOMO+VGG19+ResNet101 2600 76.9 81 87.1 80.3
HOG+VGG19+ResNet101 3000 79.4 85 89.3 81.8
HOG+LOMO+VGG19+ResNet101 3600 82.0 86 91.2 84.0

128
Table 4.33: Performance evaluation of the proposed J-LDFR method on the MIT
dataset using medium-SVM classifier with 10-fold cross-validation
Classifier
Selected Medium-SVM
Joint feature representations
FSD O-ACC AUC PPV TPR
(%) (%) (%) (%)
LOMO+HOG 1600 77.5 82 98.0 75.7
LOMO+ResNet101 1600 73.9 80 97.6 72.8
LOMO+VGG19 1600 73.1 79 97.8 72.2
HOG+ResNet101 2000 77.8 84 96.8 76.5
HOG+VGG19 2000 76.4 85 97.3 75.1
VGG19+ResNet101 2000 74.8 79 96.0 74.2
HOG+LOMO+VGG19 2600 76.1 85 98.0 74.6
HOG+LOMO+ResNet101 2600 76.4 84 97.5 75.0
LOMO+VGG19+ResNet101 2600 74.8 81 96.3 74.1
HOG+VGG19+ResNet101 3000 77.1 85 96.1 76.2
HOG+LOMO+VGG19+ResNet101 3600 78.0 85 97.8 75.9

Furthermore, computed AUC of proposed framework is shown in Figure 4.17 and


Figure 4.18 on PETA and MIT datasets respectively. According to these results of
proposed framework, C-SVM outperformed M-SVM and Q-SVM with 96% AUC on
PETA dataset as shown in Figure 4.17. In other classification methods, 95% AUC is
obtained using both M-SVM and Q-SVM. Similarly, AUC on MIT dataset is enhanced
with 86% using both C-SVM and Q-SVM as shown in Figure 4.18. The results of AUC
show the significance of joint feature representation for gender prediction.

Table 4.34: Performance evaluation of the proposed J-LDFR method on the MIT
dataset using quadratic-SVM classifier with 10-fold cross-validation
Classifier
Selected Quadratic-SVM
Joint feature representations
FSD O-ACC AUC PPV TPR
(%) (%) (%) (%)
LOMO+HOG 1600 77.4 82 90.3 79.1
LOMO+ResNet101 1600 77.0 82 89.6 79.1
LOMO+VGG19 1600 76.1 80 88.5 78.7
HOG+ResNet101 2000 79.4 84 90.8 80.9
HOG+VGG19 2000 80.7 85 91.5 82.1
VGG19+ResNet101 2000 76.8 81 89.1 79.1
HOG+LOMO+VGG19 2600 80.4 85 91.1 81.8
HOG+LOMO+ResNet101 2600 80.2 85 92.1 81.1
LOMO+VGG19+ResNet101 2600 77.1 82 88.2 80.0
HOG+VGG19+ResNet101 3000 80.3 85 90.5 82.1
HOG+LOMO+VGG19+ResNet101 3600 81.4 86 91.0 82.7

The proposed framework also calculates CW-ACC to show the robustness of joint low-
level and deep CNN feature representation.

129
With 10-fold cross-validation, the confusion matrix under C-SVM is shown in Table
4.35 and Table 4.36 using PETA and MIT datasets respectively. The male classification
accuracy is 90% and female is 89% on PETA dataset, whereas the classification
accuracy of a male is 92% and a female is 62% on MIT dataset.

Figure 4.16: Proposed evaluation results using individual feature representation and
JFR on the MIT dataset

Table 4.35: Confusion matrix using C-SVM on PETA dataset

Predicted
True
classes
classes
Male Female
Male 90% 10%
Female 11% 89%

130
Table 4.36: Confusion matrix using C-SVM on MIT dataset

Predicted
True
classes
classes
Male Female
Male 92% 8%
Female 38% 62%

Figure 4.17: AUC on PETA dataset

131
Figure 4.18: AUC on MIT dataset

132
4.3.3.4 Comparison with State-of-the-art Methods

To validate the usefulness of proposed framework, the results are compared with
various state-of-the-art pedestrian gender classification approaches such as HOG, Mini-
CNN, AlexNet-CNN [47], Hierarchical ELM [264], GoogleNet [201], ResNet50 [217],
SSAE [46], W-HOG, DFL, HDFL [45] and DFN, DHFFN [15]. In this regard, only
full-body appearance based methods are considered for pedestrian gender classification
using PETA dataset and AUC evaluation protocol. To the best of our knowledge,
DHFFN [15] is the most recent method reported for pedestrian gender classification on
the PETA dataset. The existing and computed results on the PETA dataset are
illustrated in Table 4.37.

The comparison depicts that proposed J-LDFR framework outperforms on PETA


dataset as compared to other exiting methods. The proposed framework achieves AUC
96% using joint low-level and deep feature representation with C-SVM. Comparing
with existing appearance based methods such as HOG, Mini-CNN, AlexNet-CNN,
Hierarchical ELM, GoogleNet, ResNet50, SSAE, W-HOG, DFL, HDFL, DFN, and
DHFFN, the proposed framework performed better in terms of AUC with 8%, 10%,
6%, 4%, 5%,6%, 4%, 10%, 3%, 1%, 3%, and 1% respectively as shown in Figure 4.19.
These improvements are obtained by combining viewpoint invariant LOMO features,
rotation-invariant HOG features, and deeply learned feature representations.

Table 4.37: Performance comparison with existing methods using PETA dataset

Methods Year AUC (%)


HOG [47] 2015 88
Mini-CNN [47] 2015 86
AlexNet-CNN [47] 2015 90
Hierarchical ELM [264] 2015 92
GoogleNet [201] 2015 91
ResNet50 [217] 2016 90
SSAE [46] 2018 92
W-HOG [45] 2018 86
DFL [45] 2018 93
HDFL [45] 2018 95
DFN [15] 2018 93
DHFFN [15] 2018 95
Proposed J-LDFR 96

133
In addition, the formation of robust joint feature representation is made possible by the
successful use of entropy controlled FSs. Moreover, the computed results on MIT
dataset are compared with previous results as illustrated in Table 4.38. The comparison
shows that proposed J-LDFR framework outperforms C-SVM in the same settings. The
existing methods such as HOG, LBP, HSV, LBP-HSV, HOG-HSV, HOG-LBP, HOG-
LBP-HSV [169], PBGR [166], BIO-PCA, BIO-OLPP, BIO-LSDA, BIO-MFA, BIO-
PCA [60], CNN [61], CNN-e [183], and PiHOG-LHSV [168] are chosen for
comparison. Furthermore, CNN and CaffeNet with the upper (U), middle (M), and
lower (L) body patches based results are also taken for comparison [71]. The only
appearance based methods are chosen because they were reported in literature for
pedestrian gender classification using MIT dataset. Overall, comparing with existing
methods such as HOG, LBP, HSV, LBP-HSV, HOG-HSV, HOG-LBP, HOG-LBP-
HSV, PBGR, BIO-PCA, BIO-OLPP, BIO-LSDA, BIO-MFA, BIO-PCA, CNN, CNN-
e, PiHOG-LHSV, CNN-1, CNN-2, CNN-3, and CaffeNet, the proposed framework
outperformed all existing methods in terms of O-ACC with improvement of 3.1%,
5.9%, 10.7%, 4.4%, 1.1%, 2.2%, 1.9%, 7%, 2.8%, 4.9%, 3.8%, 6.8%, 1.4%, 1.6%,
0.5%, 9.7%, 1.2%, 0.8%, 0.7%, and 1.1% respectively. The comparison of existing and
computed results is shown in Figure 4.20. J-LDFR framework attained 77.3% M-ACC,
which is better among seven available existing methods including HOG, LBP, HSV,
LBP-HSV, HOG-HSV, HOG-LBP and HOG-LBP-HSV with improvement of 1.4%,
8.8%, 12.5%, 3.6%, 2%, 0.7% and 0.6% respectively.

Figure 4.19: Comparison of existing and proposed results in terms of AUC on PETA
dataset
134
Table 4.38: Performance comparison with existing methods using MIT dataset, dash
(-) represents that no reported result is available
Methods Year O-ACC (%) M-ACC (%)
PBGR [166] 2008 75.0 -
PiHOG-LHSV [168] 2009 72.3 -
BIO-PCA [60] 2009 79.2 -
BIO-OLPP [60] 2009 77.1 -
BIO-LSDA [60] 2009 78.2 -
BIO-MFA [60] 2009 75.2 -
BIF-PCA [60] 2009 80.6 -
CNN [61] 2013 80.4 -
HOG [169] 2015 78.9 75.9
LBP [169] 2015 76.1 68.5
HSV [169] 2015 71.3 64.8
LBP-HSV [169] 2015 77.6 73.7
HOG-HSV [169] 2015 80.9 75.3
HOG-LBP [169] 2015 79.8 76.6
HOG-LBP-HSV [169] 2015 80.1 76.7
CNN-e [183] 2017 81.5 -
U+M+L (CNN-1) [71] 2019 80.8 -
U+M+L (CNN-2) [71] 2019 81.2 -
U+M+L (CNN-3) [71] 2019 81.3 -
U+M+L (CaffeNet) [71] 2019 80.9 -
Proposed J-LDFR 82.0 77.3

Figure 4.20: Comparison of existing and proposed results in terms of overall accuracy
on MIT dataset

In addition, overall results of proposed J-LDFR framework are listed in Table 4.39. The
better outcome of J-LDFR framework is because of joint feature representation

135
comprising of illumination and rotation invariant HOG features, LOMO features, and
deep feature presentations of two CNN models. It further confirms that deep CNN
feature representations are complementary to low-level feature representations, hence
contributing to the improved results.

Table 4.39: Proposed J-LDFR method results on PETA and MIT datasets
CW-ACC (%)
Datasets O-ACC (%) M-ACC (%) AUC (%)
Male Female
PETA 89.3 89.5 96 90 89
MIT 82.0 77.3 86 92 62

The proposed J-LDFR framework achieved 89.3% O-ACC, 89.5% M-ACC, and 96%
AUC on PETA dataset. The results on MIT dataset are 82%O-ACC, 77.3% M-ACC,
and 86% AUC. The proposed J-LDFR method also obtained noteworthy results in terms
of CW-ACC as 90% and 89% on PETA dataset for male and female classes
respectively. Similarly, CW-ACC is 92% and 62% on MIT dataset for male and female
classes respectively. To the best of our knowledge, J-LDFR framework outcomes are
better than existing full-body appearance based pedestrian gender classification results
in terms of O-ACC, M-ACC, AUC, and CW-ACC on large scale PETA and small scale
MIT datasets.

4.4 Proposed Method for Pedestrian Gender Classification on


Imbalanced and Small Sample Datasets using Parallel and Serial
Fusion of Selected Deep and Traditional Features (PGC-FSDTF)
This section contains evaluation of proposed approach named as pedestrian gender
classification using fusion of selected deep and traditional features (PGC-FSDTF).
Initially, performance evaluation protocols and implementation settings are described
that are considered for this work. Then, this discussion covers the presentation on
selected, prepared, and augmented datasets for balanced distribution of class-wise data.
Finally, proposed method results on these datasets are evaluated in detail, and
afterwards are compared with state-of-the-art methods.

4.4.1 Performance Evaluation Protocols and Implementation Settings for


Pedestrian Gender Classification

Out of many evaluation protocols, six are taken into account for evaluating the proposed
method, which are already described in section 4.3.1 and mathematically represented

136
in Table 4.11. Few other evaluation protocols such as F1-score, negative predictive
value (NPV), specificity/true negative rate (TNR), balanced accuracy (B-ACC), and
false negative rate (FNR) are also selected to estimate the performance of proposed
PGC-FSDTF approach for gender prediction. Furthermore, training time and CW-ACC
are also calculated and presented in experiments. The mathematical representation of
F1-score, NPV, TNR, B-ACC, FNR in equation form is provided in Table 4.40.

Table 4.40: Evaluation protocols / metrics

S.
Metrics Equations
No
2. 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛. 𝑟𝑒𝑐𝑎𝑙𝑙
𝐹 − 𝑚𝑒𝑎𝑠𝑢𝑟𝑒/𝐹1 − 𝑠𝑐𝑜𝑟𝑒 =
1 F1 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑟𝑒𝑐𝑎𝑙𝑙

𝑇𝑁
2 NPV 𝑁𝑃𝑉 = × 100
𝑇𝑁 + 𝐹𝑁

𝑇𝑁
3 TNR 𝑇𝑁𝑅/𝑆𝑝𝑒𝑐𝑖𝑓𝑖𝑐𝑖𝑡𝑦 = × 100
𝑇𝑁 + 𝐹𝑃

𝑇𝑃𝑅 + 𝑇𝑁𝑅
𝐵𝑎𝑙𝑎𝑛𝑐𝑒𝑑 𝑎𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = × 100
4 B-ACC 2

𝐹𝑁
5 FNR 𝐹𝑁𝑅 = × 100
𝐹𝑁 + 𝑇𝑃

Evaluation protocols like AUC and accuracies (overall, mean, balanced, and class-wise)
are key metrics to assess the performance of proposed approach. Thus, higher scores of
accuracies or AUC confirm significance of the model w.r.t to prediction performance.
Mainly, balanced accuracy in present research is a suitable metric for classification
tasks because of invariant property against imbalanced data. To the best of our
knowledge, B-ACC is used first time in this work for gender classification. It is a
powerful metric when class-wise data distribution is imbalanced [265]. In this work, B-
ACC will be considered for both imbalanced and balanced data. In all investigations,
following procedures are followed: first, entropy and PCA based selected FSs are taken,
and then serially fused to produce FFV. Secondly, after FSF, multiple classifiers (as
discussed in Section 3.4.5) are trained on the proposed FFV. All the classifiers are
executed with default settings.

The proposed PGC-FSDTF approach followed k-fold cross-validation procedure for


the training and testing of applied classifiers, where k=10 which is commonly applied
for assessment of models. In cross-validation, all samples are divided into random k

137
subsamples, where a single subsample is used for testing and remaining for training.
Consequently, in each fold, new data is randomly selected for training and testing,
whereas different accuracies are obtained using the classifiers. Finally, AUC, overall,
average, and balanced accuracies of the selected classifier are calculated including other
abovementioned evaluation protocols. Moreover, the performance of multiple
classifiers on different datasets is tabulated and discussed in the following subsection.
All the experiments are performed on Intel ® Core i7-7700 CPU @ 3.60 GHz desktop
computer, 16-GB RAM and NVIDIA GeForce GTX1070 having MATLAB2018b
including deep CNN trained networks.

4.4.2 Pedestrian Analysis Datasets Presentation for Pedestrian Gender


Classification including Augmented Datasets

In this work, mainly five pedestrian analysis datasets MIT [166], PKU-Reid [225],
PETA [263], VIPeR [97], and cross-dataset [45] are selected to test the proposed
approach PGC-FSDTF for pedestrian gender classification. For gender prediction,
analyzing pedestrian full-body appearance is more challenging because of variations in
viewpoint angle changes from 0° to 315° , different camera settings, low resolution,
pose changes, and light effects.

Figure 4.21: Gender wise pair of sample images collected from MIT, PKU-Reid, PETA,
VIPeR, and cross-datasets where column represents the gender (male and
female) from each dataset, upper row shows images of male, and lower
row shows images of female

138
Figure 4.21 shows a few sample images from the selected datasets where mixed-views
of the pedestrian show these challenges clearly. A detailed explanation of selected
datasets is given in subsections.

4.4.2.1 MIT and Augmented MIT Dataset Presentation

MIT dataset has annotated images of pedestrians captured in an outdoor environment


for pedestrian attribute recognition. This dataset is a part of large-scale PETA dataset
and is also widely used for pedestrian gender classification. It has a total number of 888
images such that gender class consists of 600 and 288 images for males and females,
respectively. The class-wise variation in samples identifies MIT dataset as an
imbalanced dataset because the number of samples belonging to female class is
approximately 50% lower than male class samples. This change directly affects overall
performance especially in terms of O-ACC and CW-ACC. To handle these problems,
a random sampling procedure is applied to equalize class-wise data distribution. In this
work, a random oversampling procedure is adopted for data augmentation of MIT
dataset. Resultantly, each class of gender is customized and two additional balanced
datasets are prepared for gender prediction: a) balanced with random over sampling-1
(MIT-BROS-1), and b) balanced with random over sampling-2 (MIT-BROS-2) to
observe the robustness of suggested approach on balanced datasets. Besides, only
female class of MIT dataset is customized and data augmentation is carried out to
balance the number of images of both classes. Another dataset named as balanced with
random over sampling-3 (MIT-BROS-3) is generated to test the proposed approach.
The only difference among three augmented datasets (MIT-BROS-1, MIT-BROS-2,
and MIT-BROS-3) is a procedure to select the samples from single/both classes of MIT
dataset (as discussed in chapter 3, section 3.4.1.1).

All the augmented datasets are also taken for experiments while comparing state-of-
the-art methods with MIT-IB dataset. Table 4.41 shows the statistics of MIT-IB and
augmented MIT datasets. Annotation for male and female classes is used only in terms
of mixed (front and back) views in all experiments. As discussed above, MIT dataset
has class-imbalanced and SSS issues. Moreover, this dataset is challenging due to
frontal and back views of gender images with environmental effects such as low
resolution, and lighting conditions that make gender classification task more
challenging. Figure 4.22 shows few images from MIT/ MIT-IB and augmented MIT

139
datasets as an example so that the diversity in the appearance of pedestrians can be
observed from these sample images.

Figure 4.22: Sample images of pedestrian with back and front views (a) MIT/MIT-IB
dataset (b) augmented MIT datasets. First two rows represent male
images, and next two rows represent female images
Table 4.41: Statistics of MIT/MIT-IB dataset samples based imbalanced and
augmented balanced small sample datasets for pedestrian gender
classification
Class-wise images Total Image
Dataset View Scenario
#Males #Females mages size
MIT-IB 600 288 888
MIT-BROS-1 864 864 1728
Mixed 128×64 outdoor
MIT-BROS-2 864 864 1728
MIT-BROS-3 600 600 1200

4.4.2.2 PKU-Reid Dataset Presentation

PKU-Ried dataset has annotated images of pedestrian appearances in eight directions.


The images are captured from two disjoint cameras in an indoor environment and used
for person ReID [225]. This is a challenging dataset due to different viewpoints,
illumination conditions, pose changes, and background noises. It has a total number of
1824 images (mixed views) of 114 pedestrians (70 males and 44 females). According
to camera settings, it is noticed that the appearances of 70 males and 44 females are
captured in different eight directions (0° , 45° , 90° , 135° , 180° , 225° , 270° , and 315° )
using two cameras, as shown in Figure 4.23. It means that each gender image is captured
from eight different orientations under a single camera. Resultantly, it is found that the
captured male images from one camera are 560 with a total of 560×2=1120, and female

140
images are 352 with a total of 352×2=704, under two cameras. Based on these
collections, these images are categorized into male class with total number of 1120
images and female class with a total number of 704 images. It is worth mentioning that
collection represented in this work as PKU-Reid-IB dataset is equally suitable for
pedestrian gender classification after normal categorization. This collection contains
mixed (front, back, and side) views of each gender image. The class-wise variation in
collected samples identifies that the prepared PKU-Reid-IB dataset is imbalanced
because the number of samples belonging to female class is approximately 37% lower
than male class samples. To balance the data, a random sampling procedure is applied
to equalize class-wise data distribution. For this purpose, a random oversampling
procedure is implemented for data augmentation of PKU-Reid dataset. Subsequently,
each class of gender is customized and two additional balanced datasets are prepared
for gender prediction: 1) balanced with random oversampling-1 (PKU-Reid-BROS-1)
and balanced with random oversampling-2 (PKU-Reid-BROS-2) to perceive
robustness of the proposed approach on balanced datasets. Only the female class of
PKU-Reid dataset is augmented and data augmentation operations are applied to
balance female class as equal to male class. For this purpose, another dataset named
(PKU-Reid-BROS-3) is created to examine the proposed approach. The only difference
among these augmented datasets (PKU-Reid-BROS-1, PKU-Reid-BROS-2, and PKU-
Reid-BROS-3) is a procedure to select samples from single/both classes of PKU-Reid
dataset. Then, data augmentation operations are carried out with suggested strategies
(as discussed in chapter 3, Section 3.4.1.1). Table 4.42 shows the statistics of PKU-
Reid-IB and augmented PKU-Reid datasets.

Table 4.42: Statistics of PKU-Reid dataset samples based imbalanced and augmented
balanced small sample datasets for pedestrian gender classification
Class-wise images Total Image
Dataset View Scenario
#Males #Females images size
PKU-Reid-IB 1120 704 1824
PKU-Reid-BROS-1 1300 1300 2600
Mixed 128×64 Indoor
PKU-Reid-BROS-2 1300 1300 2600
PKU-Reid-BROS-3 1120 1120 2240

To validate the proposed approach, these augmented datasets are also taken for
experiments and their performance is compared with an imbalanced PKU-Reid-IB
dataset. Figure 4.23 shows few sample images from PKU-Reid-IB and augmented

141
PKU-Reid-IB datasets as an example so that the pedestrian appearances under
viewpoint angle changes can be observed.

Figure 4.23: Sample images of pedestrians (a) first and second row represent male and
female samples, respectively collected from PKU-Reid-IB dataset, (b)
first and second row represent male and female samples respectively
collected from augmented datasets, and (c) male (top) and female (bottom)
images to show pedestrian images with different viewpoint angle changes
from 0° to 315° , total in eight directions

4.4.2.3 PETA-SSS Dataset Presentation

PETA dataset has annotated images of pedestrians taken from ten existing datasets.
This dataset contains 19000 total images and widely used as a large-scale dataset for
pedestrian attribute recognition. Each sub-dataset of PETA contains number of images
for iLIDS as 477, CUHK as 4563, 3DPeS as 1012, GRID as 1275, CAVIAR4REID as
1220, MIT as 888, PRID as 1134, SARC3D as 200, VIPeR as 1264, and Town Center
as 6967 with 2%, 24%, 5%, 7%, 6%, 5%, 6%, 1%, 7%, and 37% of total percentage
respectively. The information regarding gender images in these datasets has variation
in terms of camera angle, viewpoint, illumination, resolution, body deformation, and
scenes (indoor/ outdoor) [263]. Some sample images from these datasets are shown in
Figure 4.24. In literature, many studies conducted experiments on PETA dataset for
pedestrian gender classification such that class-wise 8472 maximum number of images
are utilized for both male and female classes [46, 48, 266].

142
Figure 4.24: Sample images of pedestrian, column represents gender (male and female)
selected from each customized SSS PETA datasets, upper row shows
images of male, and lower row is shows images of female

As investigations are performed on SSS datasets, therefore despite large-scale dataset


PETA, a total number of 864, and 1300 samples (images) for each gender are randomly
chosen from PETA dataset. There are two reasons to choose these samples, 1) random
collection of data with SSS from PETA dataset to test the proposed approach on more
challenging dataset, and 2) because of data augmentation on MIT, and PKU-Reid
datasets, class-wise distribution of samples are increased up to 864 and 1300 for MIT,
and PKU-Ried datasets respectively. Based on these collections, two datasets are
presented with 864 images for each gender in PETA-SSS-1, and 1300 images for each
gender in PETA-SSS-2 for gender classification. Table 4.43 is shown to share the
statistics of both datasets.

Table 4.43: Statistics of PETA dataset samples based customized PETA-SSS-1 and
PETA-SSS-2 datasets for pedestrian gender classification

Class-wise images Total Image


Dataset View Scenario
#Males #Females images size
PETA-SSS-1 864 864 1728 Outdoor
Mixed Vary
PETA-SSS-2 1300 1300 2600 and indoor

4.4.2.4 VIPeR-SSS dataset Presentation

VIPeR dataset is also considered for experiments. It is sub-dataset of PETA dataset and
widely used for person ReID [18, 215, 267]. Also, this dataset is examined by [168] for

143
pedestrian gender classification such that they only categorized frontal-view of
pedestrian and found 292 male and 291 female samples. The original VIPeR dataset
comprises of 632 pedestrian objects and has two different views of each object captured
from two disjoint cameras: CAM1 and CAM2. The views of these cameras are
challenging because of variation in viewpoint angle from 45° to 180° , different camera
settings, pose, and light effects, as shown in Figure 4.24. These views are normalized
to 128×64 pixels. In this work, new dataset named VIPeR-SSS is created by observing
CAM1 and CAM2 images. For this purpose, VIPeR dataset images are annotated and
categorized as male class and female class for pedestrian gender classification, hence
resulted in 720 and 544 images of male and female, respectively. To maintain class-
wise equal distribution of data, 544 images from male class are randomly chosen. Thus,
equal distribution of male and female classes with 544 images each is used to test the
performance of proposed method. It is also noticeable that the prepared VIPeR-SSS
dataset is suitable for pedestrian gender classification with another SSS dataset for
experiments. This dataset also contains mixed (frontal, back, and side) views of each
gender image. The statistics of customized VIPeR-SSS dataset is shown in Table 4.44.

Table 4.44: Statistics of VIPeR dataset samples based customized VIPeR-SSS dataset
for pedestrian gender classification
Class-wise images Total Image
Dataset View Scenario
#Males #Females images size
VIPeR-SSS 544 544 1088 Mixed 128×48 Outdoor

4.4.2.5 Cross-dataset Presentation

To demonstrate the generalization ability, proposed PGC-FSDTF approach is also


tested by collecting images from sub-datasets of PETA dataset such as 3DPeS,
CAVIAR, i-LIDS, SARC3D, and Town Centre with random selection of 50, 34, 50,
20, and 21 images respectively for each gender [45]. In addition, more images from the
same sub-datasets are taken to prepare one more dataset used in this work with the name
of cross-dataset-1. For this purpose, 100, 68, 100, 32, and 50 images are randomly
selected from 3DPeS, CAVIAR, i-LIDS, SARC3D, and Town Centre datasets
respectively. For experiments, cross-dataset and cross-dataset-1 contain 350, and 700
total images respectively. Few sample images are shown in Figure 4.25. The objective
to create cross-dataset-1 is to investigate the proposed method performance on more

144
mixed views of gender images selected from the same sub-datasets of PETA dataset.
The statistics of cross datasets is described in Table 4.45.

Figure 4.25: Gender wise sample images of pedestrian, column represents gender (male
and female) selected from sub-datasets of PETA dataset, upper row
shows two images of male, and lower row shows two image of female
Table 4.45: Statistics of cross-datasets for pedestrian gender classification

Class-wise images Total Image


Dataset View
#Males #Females images size Scenario
Cross-dataset 175 175 350 Outdoor
Mixed Vary
Cross-dataset-1 350 350 700 and indoor

An overview of total thirteen (selected, imbalanced, augmented, and customized


balanced) datasets with class-wise distribution of samples for pedestrian gender
classification is shown in Figure 4.26. Extensive experiments are conducted on these
datasets (MIT-IB, MIT-BROS-1, MIT-BROS-2, MIT-BROS-3, PKU-Reid-IB, PKU-
Reid-BROS-1, PKU-Reid-BROS-2, PKU-Reid-BROS-3, PETA-SSS-1, PETA-SSS-2,
VIPeR-SSS, cross-dataset, and cross-dataset-1) to evaluate the performance of
proposed PGC-FSDTF method. The applied datasets are divided into two categories:
(1) two imbalanced (MIT-IB, and PKU-Reid-IB) and six augmented balanced SSS
datasets as shown in Figure 4.26 (a) and (2) remaining five datasets are customized
balanced SSS datasets, as shown in Figure 4.26 (b). In this study, each dataset is used
separately and computed results are discussed in subsequent sections.

145
Figure 4.26: An overview of the selected, imbalanced, augmented balanced, and
customized datasets with the class-wise (male and female) distribution of
samples for pedestrian gender classification (a) imbalanced and
augmented balanced SSS datasets and (b) customized balanced SSS
datasets

4.4.3 Results Evaluation

A number of experiments are performed to analyze the performance of the proposed


PGC-FSDTF approach on the selected, customized, and augmented SSS datasets. For
precise gender prediction, entropy and PCA based best selected PHOG, MaxDeep, and
AvgDeep features are serially fused with HSV-Hist based features. Before FSF, parallel
fusion is implemented on computed FCL based deep features of two CNN models (as
discussed in chapter 3). A detailed investigation on all datasets is provided separately
in subsections using both entropy and PCA based best feature combination (FFV).

146
Using selected features, multiple classifiers L-SVM, Q-SVM, C-SVM, and M-SVM
are trained with standard settings and configurations (as described in chapter 3).
Besides, different evaluation protocols, for example PPV, TPR, F1, NPV, TNR, FPR,
FNR, O-ACC, M-ACC, B-ACC, AUC, time, and CW-ACC are used to assess the
robustness of proposed approach. Moreover, the results are presented in multiple tables
containing numerical values indicating gender classification rates using common and
numerous evaluation protocols. The objective of these investigations is to observe
higher classification rates and robustness of proposed FSF approach for accurate gender
prediction. As a novel research contribution, results are also provided on few
augmented and customized SSS datasets, for instance, MIT-BROS-1, MIT-BROS-2,
MIT-BROS-3, PKU-Reid-IB, PKU-Reid-BROS-1, PKU-Reid-BROS-2, PKU-Reid-
BROS-3, PETA-SSS1, PETA-SSS2, VIPeR-SSS, and cross-dataset1 with a balanced
distribution of data. In this work, all experiments are conducted using entropy and PCA
based selected features sets (vectors) such as PHOG_FV, HSV-Hist_FV, MaxDeep_FV
and AvgDeep_FV, and their fusion. FFV is provided to selected classifiers and results
are analyzed using evaluation protocols. Comparisons of results with state-of-the-art
methods is also done on the abovementioned datasets (MIT/MIT-IB and cross-dataset).
In subsequent tables, best computed results are written in bold.

4.4.3.1 Performance Evaluation on MIT and Augmented MIT Datasets

In this section, the proposed PGC-FSDTF approach classifies a data instance into male
or female class on MIT-IB (imbalanced), MIT-BROS-1 (balanced), MIT-BROS-2
(balanced), and MIT-BROS-3 (balanced) datasets using multiple classifiers. Two kinds
of experiments are performed on each dataset and computed outcomes are evaluated
using standard performance evaluation metrics. These experiments are performed to 1)
observe entropy and PCA based features set performance on imbalanced and balanced
SSS datasets, and 2) validate the effect of proposed approach to overcome false
positives and improvements in terms of accuracies and AUC. Later, obtained results on
MIT-IB dataset are compared with state-of-the-art approaches in literature. The existing
studies utilized MIT dataset for gender prediction in which gender wise images are 600
males and 288 females, whereas in this study, MIT dataset is designated as imbalanced
and renamed as MIT-IB dataset to highlight its imbalanced nature. The proceeding
subsections discuss the results using the abovementioned imbalanced and three
balanced datasets.

147
a) Performance Evaluation on MIT-IB Dataset: Results are calculated on an
imbalanced and SSS MIT-IB dataset using proposed PGC-FSDTF method with settings
described above. The computed results using different classifiers such as L-SVM, Q-
SVM, C-SVM, and M-SVM under evaluation protocols of PPV, TPR, F1, NPV, TNR,
FPR, and FNR are shown in Table 4.46 and Table 4.47. The results depict role of both
features selection methods of entropy and PCA based features sets using MIT-IB
dataset. In case of entropy based selected features, Q-SVM and M-SVM are proven
better classifiers for accurate gender prediction as compared to other applied classifiers.
The empirical based results evaluation revealed that the proposed approach using Q-
SVM produced better results in terms of F1, TNR, and FPR with 61.9%, 80.2%, and
19.7% respectively, while M-SVM outperformed other classifiers in terms of TPR,
NPV, FNR with 83.9%, 96.3%, and 16.1% respectively. C-SVM classifier, among
other applied classifiers, presented better results using a performance evaluation
measure of PPV with a value as high as 55.1%. In case of PCA based selected features,
M-SVM showed better performance as compared to other selected classifiers.
According to computed results, M-SVM classifier exhibited significant performance in
terms of PPV, TPR, F1, NPV, TNR, FPR, and FNR with 79.8%, 97.0%, 87.6%, 98.8%,
91.1%, 8.9%, and 2.9%, respectively. Moreover, proposed approach tested on other
selected classifiers such that L-SVM, Q-SVM, and C-SVM classifiers show acceptable
results. For instance, L-SVM, Q-SVM, and C-SVM produced lower values of PPV with
20.4%, 12.5%, and 22.2%, respectively as compared to M-SVM classifier. Both
entropy and PCA based computed outcomes in terms of PPV, TPR, F1, NPV, TNR,
FPR, and FNR are shown in Table 4.46.

Table 4.46: Performance of proposed PGC-FSDTF method on imbalanced MIT-IB


dataset
Evaluation protocols
Method Classifier PPV TPR F1 NPV TNR FPR FNR
(%) (%) (%) (%) (%) (%) (%)
Entropy L-SVM 45.5 78.4 57.6 94.0 78.2 21.8 21.6
based Q-SVM 53.8 72.8 61.9 90.3 80.2 19.7 27.2
selected C-SVM 55.1 61.2 57.9 83.3 79.3 20.6 38.8
features M-SVM 40.1 83.9 54.1 96.3 77.1 23.0 16.1

PCA L-SVM 59.4 96.6 73.5 99.0 83.5 16.4 3.3


based Q-SVM 67.3 96.5 79.3 98.8 86.3 13.6 3.4
selected C-SVM 57.6 93.7 71.3 98.1 82.8 17.1 6.2
features M-SVM 79.8 97.0 87.6 98.8 91.1 8.9 2.9

148
Results are also recorded with different accuracies (O-ACC, M-ACC, and B-ACC),
AUC, and time using entropy and PCA based FSs as presented in Table 4.47. Overall,
Q-SVM and M-SVM outperformed as compared to other classifiers, for example L-
SVM, and C-SVM. In case of entropy based FSs, Q-SVM classifier depicted better
outcomes in terms of O-ACC, M-ACC, and AUC with 78.4%, 72.1%, and 81%
respectively among other applied supervised methods. Moreover, higher outcomes are
observed in terms of B-ACC with 80.5% and CW-ACC male with 96% whereas Q-
SVM produced best CW-ACC female with 55%. L-SVM among other selected
supervised methods yields less time with 1.99 sec.

Likewise, the results are acquired using PCA based FSs. M-SVM outperformed as
compared to other selected classifiers. According to computed results, M-SVM
classifier showed better performance in terms of O-ACC, M-ACC, B-ACC, AUC, time,
CW-ACC male, and CW-ACC female with 92.7%, 89.4%, 94.1%, 93%, 1.91 sec, 99%,
and 80%, respectively. The corresponding results on selected supervised methods are
also computed to examine the performance of proposed approach. Hence, showing
better performance under respective performance evaluation protocols strengthens the
working of proposed approach with PCA based FSs. Best entropy and PCA based
results of two classifiers are presented such that proposed method obtained higher O-
ACC among other classifiers as shown in Figure 4.27. This comparison provides an
insight view of different accuracies and AUC for comparison of both features selection
methods.

Table 4.47: Performance of proposed PGC-FSDTF method on imbalanced MIT-IB


dataset
Evaluation protocols CW-ACC
O- M- B-
Method Classifier AUC Time M F
ACC ACC ACC
(%) (Sec) (%) (%)
(%) (%) (%)
Entropy L-SVM 78.2 69.7 78.3 82 1.99 94 45
based Q-SVM 78.4 72.1 76.5 81 2.11 90 55
selected C-SVM 74.1 69.1 70.3 77 2.07 83 55
features M-SVM 78.0 68.1 80.5 80 2.64 96 40

PCA L-SVM 86.1 79.1 90.0 88 2.25 96 60


based Q-SVM 88.6 83.1 91.4 91 2.02 97 67
selected C-SVM 85.0 77.9 88.3 86 1.96 96 58
features M-SVM 92.7 89.4 94.1 93 1.91 99 80

149
Figure 4.27: Entropy and PCA based proposed PGC-FSDTF method results
comparison in terms of O-ACC, M-ACC, B-ACC, AUC, CW-ACC
male, and CW-ACC female on imbalanced MIT-IB dataset

According to these results, PCA based FSs showed improved results as compared to
entropy based FSs on an imbalanced and SSS MIT-IB dataset. For example, using PCA
based FSs, M-SVM classifier results are surpassed with the values of 14.3%, 17.4%,
17.6%, 12%, 9%, and 25% for O-ACC, M-ACC, B-ACC, AUC, CW-ACC male, and
CW-ACC female respectively when compared with entropy based computed results of
Q-SVM classifier. Also, other notable improvement of 10.2%, 11.0%, 14.9%, 10%,
7%, and 12% are recorded for O-ACC, M-ACC, B-ACC, AUC, CW-ACC male, and
CW-ACC female, respectively when Q-SVM results using PCA FSs are compared with
entropy based FSs. These improvements verify that PCA based features selection and
then serial fusion boost the performance as compared to entropy based FSs. Similarly,
the corresponding classification methods also produced better results using the same
FSs and settings. Moreover, it is observed that Q-SVM and M-SVM classifiers exhibit
higher AUC than other supervised methods on entropy and PCA based FSs
respectively. However, M-SVM classifier with PCA based FSs reveals a superior AUC
with a value of 93% as shown in Figure 4.28. The important factor is B-ACC with a
higher value of 80.5% on M-SVM classifier (entropy based) and 94.1% on M-SVM
classifier (PCA based) that is acceptable even in case of imbalanced data distribution.
MIT-IB dataset is imbalanced and also challenging due to variations in pose,
illumination, low contrast, etc. Therefore, presenting better performance under B-ACC
metric strengthens the working of proposed approach.

150
Figure 4.28: Best AUC for males and females on imbalanced MIT-IB dataset using
PCA based selected features set

b) Performance Evaluation on Augmented MIT-BROS-1 Dataset: Experiments are


also conducted in this work on MIT-BROS-1 dataset to test proposed approach using
the same classifiers, evaluation protocols, and settings. The entropy and PCA based
computed results are presented in Table 4.48 and Table 4.49. These experiments
examine the impact of both entropy and PCA based selected FSs on an augmented
balanced and SSS MIT-BROS-1 dataset. In the case of entropy based FSs, the proposed
approach revealed better performance under C-SVM classifier as compared to other
classifiers. Outstanding results are attained in terms of PPV, TPR, F1, NPV, TNR, FPR,
and FNR with 82.0%, 83.0%, 82.5%, 83.5%, 82.2%, 17.7%, and 16.9% respectively.
In parallel, PCA based results using M-SVM classifier outperformed as compared to
entropy based results and also among other classifiers. For example, the trained
classifier M-SVM obtained PPV, TPR, F1, TNR, FPR, and FNR as 89.6%, 88.9%,
89.3%, 89.6%, 10.3%, and 11.0% respectively. Despite this, C-SVM provides a higher
NPV value of 90.5%. The corresponding results of all classifiers are also calculated
using both entropy and PCA based selected features and shown in Table 4.48. Further,
proposed approach is also tested under more evaluation protocols, for instance, O-ACC,
M-ACC, B-ACC, AUC, time, and CW-ACC. The calculated values for these evaluation
protocols are shown in Table 4.49. From this table, it can be seen that C-SVM classifier
outperformed entropy based computed results while achieving O-ACC, M-ACC, B-
ACC, AUC, CW-ACC male, and CW-ACC female as 82.6%, 82.6%, 82.6%, 91%,

151
84%, and 82% respectively using entropy based selected FSs. But in case of PCA based
results, M-SVM classifier showed better performance than other applied classifiers and
entropy based results. For example, M-SVM showed the values of O-ACC, M-ACC,
B-ACC, AUC, CW-ACC male, and CW-ACC female as 89.2%, 89.2%, 89.2%, 96%,
89%, and 90%, respectively by considering PCA based selected FSs.

Table 4.48: Performance of proposed PGC-FSDTF method on balanced MIT-BROS-1


dataset (male=864, and female=864 images) using different evaluation
protocols
Evaluation protocols
Method Classifier PPV TPR F1 NPV TNR FPR FNR
(%) (%) (%) (%) (%) (%) (%)
Entropy L-SVM 73.8 80.7 77.1 82.4 75.9 24.0 19.2
based Q-SVM 79.1 82.8 80.9 83.5 80.0 19.9 17.1
selected C-SVM 82.0 83.0 82.5 83.5 82.2 17.7 16.9
features M-SVM 77.6 82.3 79.9 83.3 78.8 21.1 17.6

PCA L-SVM 74.8 86.6 80.3 88.4 77.8 22.1 13.3


based Q-SVM 74.4 87.9 80.6 89.8 77.8 22.1 12.0
selected C-SVM 75.1 88.7 81.3 90.5 78.4 21.5 11.2
features M-SVM 88.8
89.6 88.9 89.3 89.6 10.3 11.0

It is obvious from the computed results that the combination of proposed features leads
to higher classification results when they are used in combination of M-SVM classifier
and PCA based selected features. The combination of traditional and deep features has
better discrimination potential to classify the gender image. Furthermore, numerous
features are less helpful, therefore, the selection of important features with PCA as
compared to entropy outperforms for pedestrian gender classification. The other applied
classifiers also showed adequate performance and validated against entropy and PCA
based selected FSs as presented in Table 4.49. For comparison, results are provided in
Figure 4.29, which show entropy and PCA based AUC and accuracies (O-ACC, M-
ACC, B-ACC, CW-ACC male, CW-ACC female) of multiple classifiers. According to
Figure 4.29, PCA based FSs exhibit better results as compared to entropy based FSs on
a balanced and SSS MIT-BROS-1 dataset. For example, M-SVM classifier results are
surpassed with 6.6%, 6.6%, 6.6%, 5%, 5%, and 8% for O-ACC, M-ACC, B-ACC,
AUC, CW-ACC male, and CW-ACC female respectively when compared with entropy
based results of C-SVM classifier, hence verify that PCA based selected FSs
combination enhances the performance as compared to entropy based FSs.

152
Table 4.49: Performance of proposed PGC-FSDTF method on balanced MIT- BROS-
1 dataset (male=864, and female=864 images) using different accuracies,
AUC and time
Evaluation protocols CW-ACC
O- M- B-
Method Classifier AUC Time M F
ACC ACC ACC
(%) (Sec) (%) (%)
(%) (%) (%)
Entropy L-SVM 78.1 78.1 78.3 86 4.31 82 74
based Q-SVM 81.3 81.3 81.4 90 4.95 84 79
features C-SVM 82.6 82.6 82.6 91 4.92 84 82
selection M-SVM 80.4 80.4 80.5 88 4.21 83 78

PCA L-SVM 81.6 81.6 82.2 88 4.66 88 75


based Q-SVM 82.1 82.1 82.8 92 4.57 90 74
features C-SVM 82.8 82.8 83.6 93 3.69 91 75
selection M-SVM 89.2 89.2 89.2 96 5.46 89 90

Similarly, in parallel, other selected supervised methods also attained reliable results
using the same FSs and settings. Moreover, it can be noted that C-SVM and M-SVM
classifiers exhibit higher AUC than other classification methods using entropy and PCA
based FSs respectively. However, M-SVM classifier with PCA based FSs shows 96%
AUC being the highest one as depicted in Figure 4.30. The corresponding all AUCs are
also computed to show the performance of other selected classifiers as tabulated in
Table 4.49.

Figure 4.29: Entropy and PCA based proposed PGC-FSDTF method results
comparison in terms of O-ACC, M-ACC, B-ACC, AUC, CW-ACC
male, and CW-ACC female on balanced MIT-BROS-1 dataset

153
Figure 4.30: Best AUC for males and females on balanced MIT-BROS-1 dataset using
PCA based selected FSs

c) Performance Evaluation on Augmented MIT-BROS-2 Dataset: This work also


performed experiments on another augmented balanced dataset named MIT-BROS-2
to verify the proposed approach performance. Through these experiments, contribution
of each entropy and PCA based selected feature vector is tested for gender prediction.
In case of entropy based selected features, the proposed approach shown superior
performance under Q-SVM and C-SVM classifiers as compared to other selected
classifiers. According to entropy based results, Q-SVM achieves higher TPR 84.1%,
NPV 85.3%, and FNR 15.8% whereas C-SVM exhibits 80.2% PPV, 81.7% F1, 80.9%
TNR, and 19.0% FPR in comparison with other applied classifiers. Similarly, PCA
based results using M-SVM classifier outperformed as compared to entropy based
results and also among other classifiers. For example, the trained classifier M-SVM
showed better results in terms of PPV, TPR, F1, NPV, TNR, FPR, and FNR with 89.7%,
90.9%, 90.3%, 91.0%, 89.8%, 10.1%, and 9.0% respectively. Results are also
calculated using other selected classifiers, which show satisfactory performance as
given in Table 4.50. Despite this, proposed approach is also evaluated by computing O-
ACC, M-ACC, B-ACC, AUC, time, and CW-ACC. The obtained outcomes against
these evaluation protocols are shown in Table 4.51. According to the results, C-SVM
classifier showed better results when entropy based selected features are applied.
Hence, C-SVM achieved O-ACC, M-ACC, B-ACC, AUC, and CW-ACC female with
values of 82.1%, 82.1%, 82.1%, 90%, and 80% respectively, whereas Q-SVM attained
better CW-ACC female with a value of 85%.

154
Table 4.50: Performance of proposed PGC-FSDTF method on balanced MIT-BROS-2
dataset (male=864, and female=864 images) using different evaluation
protocols
Evaluation protocols
Method Classifier PPV TPR F1 NPV TNR FPR FNR
(%) (%) (%) (%) (%) (%) (%)
Entropy L-SVM 73.8 80.4 77.0 82.0 75.8 24.1 19.5
based Q-SVM 78.0 84.1 80.9 85.3 79.5 20.4 15.8
selected C-SVM 80.2 83.3 81.7 84.0 80.9 19.0 16.6
features M-SVM 77.0 82.6 79.7 83.7 78.5 21.4 17.3

PCA L-SVM 74.3 86.6 80.0 88.5 77.5 22.4 13.3


based Q-SVM 74.3 87.2 80.2 89.1 77.6 22.3 12.7
selected C-SVM 72.8 89.0 80.1 91.0 77.0 22.2 10.9
features M-SVM 89.7 90.9 90.3 91.0 89.8 10.1 9.0

Table 4.51: Performance of proposed PGC-FSDTF method on the balanced MIT-


BROS-2 dataset (male=864, and female=864 images) using accuracies,
AUC, and time
Evaluation protocols CW-ACC
O- M- B-
Method Classifier AUC Time M F
ACC ACC ACC
(%) (Sec) (%) (%)
(%) (%) (%)
Entropy L-SVM 78.1 78.1 78.1 86 5.89 82 74
based Q-SVM 81.6 81.6 81.8 90 5.05 85 78
features C-SVM 82.1 82.1 82.1 90 5.08 84 80
selection M-SVM 80.4 80.4 80.5 88 4.11 84 77

PCA L-SVM 81.4 81.4 82.1 89 6.17 89 74


based Q-SVM 81.7 81.7 82.4 91 4.48 89 74
features C-SVM 81.9 81.9 83.0 92 4.62 91 73
selection M-SVM 90.3 90.3 90.4 97 5.43 91 90

But when PCA based selected features are used, M-SVM classifier gave superior
classification results as compared to other selected supervised methods and entropy
based results. Thus, M-SVM revealed O-ACC, M-ACC, B-ACC, AUC, CW-ACC
male, and CW-ACC female with values of 90.3%, 90.3%, 90.4%, 97%, 91%, and 90%
respectively by considering PCA based selected FSs. The other classification methods
including L-SVM, Q-SVM, and C-SVM produced average results then M-SVM. From
Table 4.50 and Table 4.51, it is apparent from the calculated results that use of proposed
features assists in better classification specifically when M-SVM classifier is trained
with PCA based selected features. In this study, the combination of selected deep and
traditional features supports FFV with distinct features for accurate gender prediction,

155
which is the main reason for attaining outstanding results. The other applied classifiers
also displayed appropriate results and confirmed against entropy and PCA based
selected FSs as presented in Table 4.50 and Table 4.51. Hence, comparison of both
entropy and PCA based results of two classifiers confirms that proposed approach
achieved higher O-ACC. As presented in Figure 4.31, PCA based FSs exhibit better
results as compared to entropy based FSs on a balanced and SSS MIT-BROS-2 dataset.
It means that the proposed approach proved reliable results under both augmented
datasets while applying 1vs1 strategy for MIT-BROS-1 and 1vs4 strategy for MIT-
BROS-2 to include augmented images in these datasets for equal distribution of data.
For example, while comparing entropy based C-SVM classifier results with PCA based
M-SVM classifier results, the outcomes of M-SVM classifier are surpassed with 8.2%,
8.2%, 8.3%, 7%, 7%, and 10% for O-ACC, M-ACC, B-ACC, AUC, CW-ACC male,
and CW-ACC female respectively. M-SVM classifier has the highest 91% CW-ACC
male and 90% CW-ACC female using PCA based FSs when compared with other
classifiers, and entropy based results. The satisfactory improvements confirm that PCA
based selected FSs are more reliable for gender prediction as compared to entropy based
selected features.

Figure 4.31: Entropy and PCA based proposed PGC-FSDTF method results
comparison in terms of O-ACC, M-ACC, B-ACC, AUC, CW-ACC
male, and CW-ACC female on balanced MIT-BROS-2 dataset

156
Figure 4.32: Best AUC for males and females on balanced MIT-BROS-2 dataset using
PCA based selected FS

Similarly, in parallel, other selected classifiers also attained acceptable results using the
same selected FSs and settings. Moreover, it is observed that C-SVM and M-SVM
classifiers exhibit higher AUC than other classification methods using entropy and PCA
based FSs respectively. Hence, M-SVM classifier with PCA based FSs shows 97%
AUC being the highest one for both male and female classes as depicted in Figure 4.32.
The corresponding all AUCs are also calculated to show the performance of other
selected classifiers as described previously in Table 4.51. Thus, it is determined that C-
SVM and M-SVM classifiers show improved AUC than other selected classifiers.

d) Performance Evaluation on Augmented MIT-BROS-3 Dataset: Proposed


approach is also tested with a mixed strategy (1vs and 1vs4) based augmented dataset
named MIT-BROS-3. The selected classifiers are trained using entropy and PCA based
selected FSs, and acquired results are presented in Table 4.52 and Table 4.53. In view
of entropy based FSs, C-SVM and L-SVM showed higher PPV of 79.6% and TPR of
89.9% respectively, whereas Q-SVM classifier scores 82.9% F1, 80.7% TNR, and
19.2% FPR. In comparison, M-SVM classifier achieved better NPV with a score of
93.6%, and FNR with a score of 8.1%. With PCA based results, outstanding outcomes
of PPV, TPR, F1, NPV, TNR, FPR, and FNR are obtained with 90.0%, 97.2%, 93.5%,
97.5%, 90.6%, 9.3%, and 2.7% respectively than entropy based results and also among
other classifiers. The results verified that PCA based selected features worked much
better as compared to entropy based selected features. Moreover, corresponding results

157
of all classifiers are also taken with both entropy and PCA based selected features and
tabulated in Table 4.52.

Table 4.52: Performance of proposed PGC-FSDTF method on balanced MIT-BROS-3


dataset (male=600, and female=600 images) using different evaluation
protocols
Evaluation protocols
Method Classifier PPV TPR F1 NPV TNR FPR FNR
(%) (%) (%) (%) (%) (%) (%)
Entropy L-SVM 73.3 89.9 80.8 91.8 77.4 22.5 10.0
Based Q-SVM 78.8 87.4 82.9 88.6 80.7 19.2 12.5
selected C-SVM 79.6 83.1 81.3 83.8 80.4 19.5 16.8
features M-SVM 71.3 91.8 80.3 93.6 76.5 23.4 8.1

PCA L-SVM 69.1 95.1 80.1 96.5 75.7 24.2 4.8


based Q-SVM 64.5 96.7 77.4 97.5 73.3 26.6 3.2
selected C-SVM 64.3 96.0 77.0 97.3 73.1 26.8 3.9
features M-SVM 90.0 97.2 93.5 97.5 90.6 9.3 2.7

For comparison, selected classifiers are trained and tested using entropy and PCA based
FSs for gender prediction with additional evaluation protocols of O-ACC, M-ACC, B-
ACC, AUC, time, and CW-ACC, and results are shown in Table 4.53. According to
these results, it is observed that Q-SVM classifier exhibit better performance with
entropy based selected FSs such that it attained O-ACC, M-ACC, B-ACC, and AUC
with values of 83.7%, 83.7%, 84.2%, and 91% respectively, whereas M-SVM produced
a higher score of 91% against CW-ACC male with the best time of 2.4 sec. Moreover,
C-SVM has the best CW-ACC female with a value of 80%. In parallel, with PCA based
computed results, M-SVM classifier outperformed as compared to other applied
classifiers and entropy based results. Thus, M-SVM revealed O-ACC, M-ACC, B-
ACC, AUC, CW-ACC male, and CW-ACC female with values of 93.7%, 93.7%,
93.9%, 98%, 97%, and 90% respectively with best time 2.35 sec by considering PCA
based selected FSs. Again, the combination of proposed PCA based selected FSs, for
instance, max deep, average deep, and traditional features lead to higher classification
results. The selected feature combination has distinct properties that classify gender
image precisely. The other applied classifiers also showed acceptable performance and
validated against entropy and PCA based selected FSs as shown in Table 4.53. Figure
4.33 shows entropy and PCA based AUC and accuracies (O-ACC, M-ACC, B-ACC,
CW-ACC male, CW-ACC female) of different classifiers verifying that PCA based FSs
produced better results than entropy based FSs on a balanced and SSS MIT-BROS-3

158
dataset. For example, M-SVM classifier results are surpassed with values of 10.0%,
10.0%, 9.7%, 7%, 8%, and 11% for O-ACC, M-ACC, B-ACC, AUC, CW-ACC male,
and CW-ACC female respectively when compared with other classifiers and entropy
based results.

Table 4.53: Performance of proposed PGC-FSDTF method on balanced MIT-BROS-3


dataset (male=600, and female=600 images) using different accuracies,
AUC, and time
Evaluation protocols CW-ACC
O- M- B-
AUC Time M F
ACC ACC ACC
Method Classifier (%) (Sec) (%) (%)
(%) (%) (%)
Entropy L-SVM 82.6 82.6 83.7 91 4.31 92 73
based Q-SVM 83.7 83.7 84.2 91 4.29 89 79
features C-SVM 81.7 81.7 81.8 90 4.5 84 80
selection M-SVM 82.5 82.5 84.2 91 2.4 94 71

PCA L-SVM 82.8 82.8 85.4 91 2.64 97 69


based Q-SVM 81.2 81.2 85.1 93 2.45 97 95
features C-SVM 80.8 80.8 84.6 93 2.45 97 64
selection M-SVM 93.7 93.7 93.9 98 2.35 97 90

Also, M-SVM classifier has the highest 97% CW-ACC male and 90% CW-ACC female
using PCA based FSs as compared to L-SVM and Q-SVM. The significant
improvements confirm that PCA based selected FSs combination enhances the
performance as compared to entropy based.

Figure 4.33: Entropy and PCA based proposed PGC-FSDTF method results
comparison in terms of O-ACC, M-ACC, B-ACC, AUC, CW-ACC
male, and CW-ACC female on balanced MIT-BROS-3 dataset

159
Figure 4.34: Best AUC for males and females on balanced MIT-BROS-3 dataset using
PCA based selected features subsets

Similarly, in parallel, other selected supervised methods also attained reliable results
using the same FSs and settings. Furthermore, it is seen that Q-SVM and M-SVM
classifiers exhibit higher AUC than other classification methods using entropy and PCA
based FSs, respectively. The corresponding all AUCs are also calculated to show the
performance of other selected classifiers as described in Table 4.53. Therefore, it is
determined that Q-SVM and M-SVM classifiers show improved AUC than other
selected classifiers as shown in Figure 4.34. However, M-SVM classifier with PCA
based FSs shows a higher AUC with a value of 98% for both male and female classes
among other classifiers and entropy based results.

4.4.3.2 Performance Evaluation on PKU-Reid and Augmented PKU-Reid


Datasets

In this research work, experiments are performed on customized imbalanced and


augmented balanced datasets of PKU-Reid-IB (imbalanced), PKU-Reid-BROS-1
(balanced), PKU-Reid-BROS-2 (balanced), and PKU-Reid-BROS-3 (balanced) to test
proposed approach using selected classifiers and evaluation protocols. Results are
computed on these datasets individually and given in Tables (4.54)–(4.61). The purpose
of this testing is two-fold: 1) to cross-validate proposed approach on another
imbalanced dataset, and 2) to observe improvement in terms of accuracies and AUC.
To the best of our knowledge, PKU-Reid is not investigated for gender classification
and remains challenging dataset due to large variation in pedestrian appearance views

160
under unconstrained environment. In this study, PKU-Reid dataset is labeled as an
imbalanced PKU-Reid-IB dataset and first time considered for pedestrian gender
classification. The experimental results along with detailed discussion on the datasets
mentioned in this section are given in subsequent sections.

a) Performance Evaluation on PKU-Reid-IB Dataset: Proposed method is evaluated


on another imbalanced and SSS PKU-Reid-IB dataset using the same settings. The
entropy and PCA based results are shown in Table 4.54. In entropy based computed
results, Q-SVM classifier proved better classifier for accurate gender prediction in
comparison with other applied classifiers with PPV, TPR, F1, NPV, TNR, FPR, and
FNR as 81.3%, 88.1%, 84.2%, 93.2%, 88.5%, 11.3%, and 11.8% respectively than the
selected classifiers. The corresponding results on selected classifiers are also calculated
for the evaluation of proposed method. With PCA based selected features, M-SVM
showed better performance as compared to other selected classifiers with PPV, TPR,
F1, NPV, TNR, FPR, and FNR as 81.5%, 90.9%, 85.9%, 94.9%, 89.1%, 10.8%, and
9.0% respectively. Moreover, proposed approach is also tested on other selected
classifiers such that L-SVM and Q-SVM classifiers exhibit satisfactory performance by
showing less PPV rates of 9.7%, and 4.6% respectively as compared to M-SVM
classifier. Both entropy and PCA based computed outcomes in terms of PPV, TPR, F1,
NPV, TNR, FPR, and FNR are shown in Table 4.54.

Table 4.54: Performance of proposed PGC-FSDTF method on imbalanced PKU-Reid-


IB dataset (male=1120, and female=704 images) using different
evaluation protocols
Evaluation protocols
Method Classifier PPV TPR F1 NPV TNR FPR FNR
(%) (%) (%) (%) (%) (%) (%)
Entropy L-SVM 74.7 85.1 79.6 91.8 85.2 14.7 14.8
based Q-SVM 81.3 88.1 84.2 93.2 88.5 11.3 11.8
selected C-SVM 81.1 86.1 83.7 91.7 88.5 11.4 13.8
features M-SVM 76.5 87.5 81.6 93.1 86.3 13.6 12.5

PCA L-SVM 71.8 86.0 78.3 92.6 83.9 16.1 13.9


based Q-SVM 76.9 88.5 82.3 93.7 86.6 13.3 11.4
selected C-SVM 79.6 89.1 84.1 93.9 88.0 11.9 10.8
features M-SVM 81.5 90.9 85.9 94.9 89.1 10.8 9.0

Results are verified using additional evaluation protocols, for instance accuracies (O-
ACC, M-ACC, and B-ACC), AUC, time, and CW-ACC using entropy and PCA based

161
FSs as presented in Table 4.55 in which Q-SVM and M-SVM classifiers outperformed
than other applied classifiers. Keeping in view entropy based results, Q-SVM classifier
depicted better outcomes of O-ACC, M-ACC, B-ACC, AUC, CW-ACC male, and CW-
ACC female with values of 88.3%, 87.0%, 88.3%, 95%, 9.3%, and 81% respectively
among other applied classifiers.

The parallel results of applied classifiers are also obtained to see the significance of
proposed approach. Likewise, the results are acquired using PCA based FSs in which
M-SVM outperformed as compared to other selected classifiers. Hence, M-SVM
classifier showed better performance in terms of O-ACC, M-ACC, B-ACC, AUC, time,
CW-ACC male, and CW-ACC female with 89.7%, 88.2%, 90.0%, 96%, 1.92 sec, 95%,
and 93% respectively. The corresponding results on selected classifiers are also
calculated for the evaluation of proposed approach. Hence, showing better performance
under respective performance evaluation protocols strengthens working of the proposed
technique when PCA based FSs are utilized. Both entropy and PCA based experimental
results are depicted in Figure 4.35 with different accuracies and AUCs for comparison
of both features selection methods. PCA based FSs showed improved results as
compared to entropy based FSs on the imbalanced and SSS PKU-Reid-IB dataset. For
example, using PCA based FSs, M-SVM classifier results are surpassed with values of
1.4%, 1.2%, 1.7%, 1%, 2%, and 12% for O-ACC, M-ACC, B-ACC, AUC, CW-ACC
male, and CW-ACC female respectively when compared with entropy based computed
results of Q-SVM classifier.

Table 4.55: Performance of proposed PGC-FSDTF method on imbalanced PKU-Reid-


IB dataset (male=1120, and female=704 images) using different
accuracies, AUC, and time
Evaluation protocols CW-ACC
O- M- B-
AUC Time M F
ACC ACC ACC
Method Classifier (%) (Sec) (%) (%)
(%) (%) (%)
Entropy L-SVM 85.2 83.2 85.1 92 3.22 92 75
based Q-SVM 88.3 87.0 88.3 95 3.20 93 81
features C-SVM 87.7 86.5 86.5 95 3.42 92 81
selection M-SVM 86.7 84.8 86.9 94 3.25 93 77

PCA L-SVM 84.6 82.2 85.0 90 2.35 93 72


based Q-SVM 87.2 85.3 87.6 95 2.43 94 77
features C-SVM 88.4 86.8 88.6 95 2.75 94 80
selection M-SVM 89.7 88.2 90.0 96 1.92 95 93

162
It is obvious from computed results that entropy and PCA based selected features
equally performed better because of a slight difference in terms of O-ACC, M-ACC,
B-ACC, AUC, and CW-ACC male as shown in Table 4.55. The tiny improvements
confirm that PCA based features selection and then serial fusion enhances the
performance as compared to entropy based FSs. Similarly, corresponding classification
methods also produced better results using the same FSs and settings. Further, it is seen
that Q-SVM and M-SVM classifiers exhibit higher AUC than other supervised methods
on entropy and PCA based FSs sets respectively. However, M-SVM classifier with
PCA based FSs reveals a superior AUC of 96% as shown in Figure 4.36. The important
factor is B-ACC with 88.3% on Q-SVM classifier (entropy based) and 90% on M-SVM
classifier (PCA based), which is acceptable even in case of imbalanced data
distribution. As PKU-Reid-IB dataset is imbalanced and also challenging due to
variations in angle, environment conditions, low contrast, etc. therefore, presenting
better performance under the B-ACC metric strengthens the working of proposed
method.

Figure 4.35: Entropy and PCA based proposed PGC-FSDTF method results
comparison in terms of O-ACC, M-ACC, B-ACC, AUC, CW-ACC
male, and CW-ACC female on imbalanced PKU-Reid-IB dataset

163
Figure 4.36: Best AUC for males and females on imbalanced PKU-Reid-IB dataset
using PCA based selected FSs

b) Performance Evaluation on Augmented PKU-Reid-BROS-1 Dataset:


Experiments are conducted on PKU-Reid-BROS-1 dataset to test the proposed
approach using the same classifiers, evaluation protocols, and settings. The entropy and
PCA based computed results are presented in Table 4.56 and Table 4.57. These
experiments examine the impact of both entropy and PCA based selected FSs on an
augmented balanced and SSS PKU-Reid-BROS-1 dataset. With entropy based FSs, the
proposed approach revealed better performance under C-SVM classifier as compared
to other classifiers. Outstanding results of PPV, TPR, F1, TNR, FPR, and FNR are
calculated with 90.2%, 89.0%, 89.5%, 90.1%, 9.9%, and 11.1% respectively. Despite
this, Q-SVM classifier achieved higher NPV with 89.2%. In parallel, PCA based results
using M-SVM classifier outperformed as compared to entropy based results and also
among the other classifiers by showing better results of PPV, TPR, F1, NPV, TNR,
FPR, and FNR as 89.3%, 93.5%, 91.4%, 93.8%, 89.8%, 10.1%, and 6.4% respectively.
The corresponding results of all classifiers are also calculated with both entropy and
PCA based selected features and presented in Table 4.56. More evaluation protocols as
O-ACC, M-ACC, B-ACC, AUC, time, and CW-ACC are also used for the evaluation
of presented approach with calculated values tabulated in Table 4.57. From this table,
it is clear that C-SVM classifier outperformed with entropy based computed results by
attaining O-ACC, M-ACC, B-ACC, AUC, CW-ACC male, and CW-ACC female as
89.5%, 89.5%, 89.5%, 96%, 89%, and 90%, respectively using entropy based selected

164
FSs. In comparison, M-SVM classifier takes less time of 5.05 sec than other applied
classifiers.

Table 4.56: Performance of proposed PGC-FSDTF method on balanced PKU-Reid-


BROS-1 dataset (male=1300, and female=1300 images), different
evaluation protocols
Evaluation protocols
Method Classifier PPV TPR F1 NPV TNR FPR FNR
(%) (%) (%) (%) (%) (%) (%)
Entropy L-SVM 83.6 85.3 84.4 85.6 83.9 16.0 16.6
based Q-SVM 87.7 89.0 88.4 89.2 87.9 12.1 10.9
selected C-SVM 90.2 89.0 89.5 88.7 90.1 9.9 11.1
features M-SVM 84.9 86.6 85.7 86.9 85.2 14.7 13.3

PCA L-SVM 84.3 90.8 87.5 91.5 85.4 14.5 9.1


based Q-SVM 87.4 93.3 90.3 93.7 88.2 11.7 6.5
selected C-SVM 89.3 93.5 91.4 93.8 89.8 10.1 6.4
features M-SVM 89.3 93.0 91.2 93.3 89.9 10.1 6.9

In case of PCA based results, M-SVM classifier gave better performance as compared
to other applied classifiers and entropy based results. For example, C-SVM displayed
O-ACC, M-ACC, B-ACC, AUC, CW-ACC male, and CW-ACC female with values of
91.6%, 91.6%, 91.6%, 97%, and 94% respectively by considering PCA based selected
features subsets. Despite this, M-SVM classifier achieved higher CW-ACC female with
90% and Q-SVM with less time 12.61 sec as compared to other classifiers. Again, it is
observed from the computed results that combination of PCA based proposed FSs leads
to higher classification under C-SVM classifier.

Table 4.57: Performance of proposed PGC-FSDTF method on balanced PKU-Reid-


BROS-1 dataset (male=1300, and female=1300 images) using different
accuracies, AUC, and time
Evaluation protocols CW-ACC
O- M- B-
AUC Time M F
ACC ACC ACC
Method Classifier (%) (Sec) (%) (%)
(%) (%) (%)
Entropy L-SVM 84.6 84.6 84.6 92 5.06 86 84
based Q-SVM 88.5 88.5 88.5 95 6.61 89 88
features C-SVM 89.5 89.5 89.5 96 7.05 89 90
selection M-SVM 85.9 85.9 85.9 94 5.05 87 85

PCA L-SVM 87.9 87.9 88.1 95 13.05 92 84


based Q-SVM 90.6 90.6 90.7 97 12.61 94 87
features C-SVM 91.6 91.6 91.6 97 15.25 94 89
selection M-SVM 91.4 91.4 91.4 97 15.67 93 90

165
The combination of traditional and deep features has better discrimination potential to
classify the gender image even in case of augmented balanced dataset. Furthermore,
numerous features are less helpful, therefore, the selection of important features with
PCA as compared to entropy outperformed for pedestrian gender classification. The
other applied classifiers also gave adequate performance and validated against entropy
and PCA based selected FSs as shown in Table 4.57. For comparison, results in Figure
4.37 show entropy and PCA based AUC and accuracies (O-ACC, M-ACC, B-ACC,
CW-ACC male, CW-ACC female) of multiple classifiers.

According to Figure 4.37, PCA based FSs exhibited better results as compared to
entropy based FSs on a balanced and SSS PKU-Reid-BROS-1 dataset. For example, C-
SVM classifier results are exceeded with 2.1%, 2.1%, 2.1%, 1%, and 5% for O-ACC,
M-ACC, B-ACC, AUC, and CW-ACC male respectively when compared with entropy
based results of C-SVM classifier. Whereas, entropy based results are better in terms
of time and CW-ACC female with 7.05 sec and 90%, respectively on C-SVM classifier.
Here, it is noted that C-SVM produced better CW-ACC for male class with 94% and
M-SVM classifier for female class with 90% using PCA based selected FSs. These
significant improvements verify that PCA based selected FSs combination enhances
the performance for O-ACC, M-ACC, B-ACC, AUC, and CW-ACC male but entropy
based FSs performed better in terms of time and CW-ACC female.

Figure 4.37: Entropy and PCA based proposed PGC-FSDTF method results
comparison in terms of O-ACC, M-ACC, B-ACC, AUC, CW-ACC
male, and CW-ACC female on the balanced PKU-Reid-BROS-1 dataset

166
Figure 4.38: Best AUC for males and females on balanced PKU-Reid-BROS-1 dataset
using PCA based selected FSs

Hence, in Figure 4.37, best entropy and PCA based results of two classifiers are
depicted with higher O-ACC. Similarly, in parallel, other selected classifiers also
attained reliable results using the same FSs and settings. Moreover, it is observed that
C-SVM classifier exhibited higher AUC than other classification methods using
entropy and PCA based FSs. According to the computed results, C-SVM classifier with
PCA based FSs shows a superior AUC with a value of 97%, as presented in Figure
4.38. The corresponding all AUCs are also calculated to show the performance of other
selected classifiers as previously shown in Table 4.57. Thus, it is determined that C-
SVM classifier gave improved AUC than other selected classifiers. Overall, PCA based
results are better than entropy based results.

c) Performance Evaluation on Augmented PKU-Reid-BROS-2 Dataset:


Experimentation is also done on another augmented balanced dataset named PKU-
Reid-BROS-2 to verify the proposed approach performance. Through these
experiments, contribution of each entropy and PCA based selected feature vector is
tested for gender prediction where data is equally distributed. With entropy based
selected features, the proposed technique showed superior performance under C-SVM
classifier than other selected classifiers by obtaining higher PPV 86.9%, TPR 87.7%,
F1 87.3%, NPV 87.8%, TNR 87.0%, FPR 12.9%, and FNR 12.2%. Similarly, PCA
based results using M-SVM classifier outperformed as compared to entropy based
results and also among other classifiers by achieving PPV, TPR, F1, NPV, TNR, FPR,
and FNR with 91.2%, 92.9%, 92.1%, 93.1%, 91.3%, 8.6%, and 7.1% respectively.

167
Results are also calculated using other selected classifiers which showed satisfactory
performance as tabulated in Table 4.58.

Table 4.58: Performance of proposed PGC-FSDTF method on PKU-Reid-BROS-2


dataset (male=1300, and female=1300 images) using different evaluation
protocols
Evaluation protocols
Method Classifier PPV TPR F1 NPV TNR FPR FNR
(%) (%) (%) (%) (%) (%) (%)
Entropy L-SVM 83.3 83.8 83.5 84.0 83.4 16.5 16.1
based Q-SVM 85.3 86.2 85.7 86.3 85.4 14.5 13.7
selected C-SVM 86.9 87.7 87.3 87.8 87.0 12.9 12.2
features M-SVM 85.6 85.6 85.6 85.6 85.6 14.3 14.3

PCA L-SVM 85.1 88.8 86.9 89.3 85.7 14.2 11.1


based Q-SVM 86.3 91.2 88.6 91.6 87.1 12.9 8.7
selected C-SVM 86.8 92.0 89.3 92.5 87.5 12.4 7.9
features M-SVM 91.2 92.9 92.1 93.1 91.3 8.6 7.1

Despite, proposed technique is also evaluated on few additional evaluation protocols of


O-ACC, M-ACC, B-ACC, AUC, time, and CW-ACC. The obtained outcomes against
these evaluation protocols are shown in Table 4.59.

Table 4.59: Performance of proposed PGC-FSDTF method on PKU-Reid-BROS-2


dataset (male=1300, and female=1300 images) using different accuracies,
AUC, and time
Evaluation protocols CW-ACC
O- M- B-
AUC Time M F
ACC ACC ACC
Method Classifier (%) (Sec) (%) (%)
(%) (%) (%)
Entropy L-SVM 83.6 83.6 83.6 91 3.35 84 83
based Q-SVM 85.8 85.8 85.8 93 4.75 86 85
features C-SVM 87.3 87.3 87.3 94 5.35 88 87
selection M-SVM 85.6 85.6 85.6 93 3.04 86 86

PCA L-SVM 87.2 87.2 87.2 94 11.72 91 83


based Q-SVM 89.0 89.0 89.1 96 10.41 92 86
features C-SVM 89.6 89.6 89.8 96 11.51 93 87
selection M-SVM 92.2 92.2 92.2 97 18.22 93 91

According to the results, again C-SVM classifier exhibited better results when entropy
based selected features are applied with O-ACC, M-ACC, B-ACC, AUC, CW-ACC
male, and CW-ACC female as 87.3%, 87.3%, 87.3%, 94%, 88%, and 87%,
respectively. Furthermore, it is observed that M-SVM classifier takes less time of 3.04
sec as compared to the selected classifiers. But when PCA based selected features are

168
applied, M-SVM classifier showed superior classification results in comparison with
other selected supervised methods and entropy based results. For example, M-SVM
revealed the O-ACC, M-ACC, B-ACC, AUC, CW-ACC male, and CW-ACC female
with values of 92.2%, 92.2%, 92.2%, 97%, 93%, and 91% respectively by considering
PCA based selected FSs. According to the results shown in Table 4.58 and Table 4.59,
it is apparent that proposed FSs assist in better classification especially when M-SVM
classifier is trained on PCA based selected FSs. The results show that entropy based
selected FSs are lower than PCA based selected FSs. Therefore, it is obvious that the
combination of deep and traditional features supports FFV with distinct features for
accurate gender prediction. The other applied classifiers portrayed appropriate results
and confirmed against entropy and PCA based selected FSs as presented in Table 4.59.
It can be seen that proposed approach achieved higher O-ACC as shown in Figure 4.39.
This figure offers us an insight view of different accuracies and AUC for comparison
of both features selection methods. PCA based FSs exhibited better results as compared
to entropy based FSs on a balanced and SSS PKU-Reid-BROS-2 dataset. Experiments
confirmed that proposed approach produced reliable outcomes on another augmented
dataset PKU-Reid-BROS-2 as previously performed on the augmented dataset PKU-
Reid-BROS-1.

Figure 4.39: Entropy and PCA based proposed PGC-FSDTF method results
comparison in terms of O-ACC, M-ACC, B-ACC, AUC, CW-ACC
male, and CW-ACC female on balanced PKU-Reid-BROS-2 dataset

169
While comparing PCA based M-SVM classifier results with entropy based C-SVM
classifier results, the outcomes of M-SVM classifier are surpassed with values of 4.9%,
4.9%, 4.9%, 3%, 5%, and 4% for O-ACC, M-ACC, B-ACC, AUC, CW-ACC male,
and CW-ACC female respectively. The satisfactory improvements witness that PCA
based selected FSs are more reliable for gender prediction than entropy based selected
features.

In parallel, other selected classifiers also achieved acceptable results using the same
selected FSs and settings. Further, it is observed that C-SVM and M-SVM classifiers
depicted higher AUC than other classification methods using entropy and PCA based
FSs respectively. However, M-SVM classifier with PCA based FSs shows 97% AUC
as highest as shown in Figure 4.40. The corresponding all AUCs are also calculated as
well to show the performance of other selected classifiers as previously described in
Table 4.59. Therefore, it is concluded that C-SVM and M-SVM classifiers are preferred
over other selected classifiers by achieving higher AUC.

Figure 4.40: Best AUC for males and females on balanced PKU-Reid-BROS-2 dataset
using PCA based selected FSs

d) Performance Evaluation on Augmented PKU-Reid-BROS-3 Dataset: Proposed


approach is evaluated on one another mixed strategy (1vs and 1vs4) based augmented
dataset named as PKU-Reid-BROS-3. The selected supervised methods are trained
using entropy and PCA based selected FSs and acquired results for performance
evaluation are presented in Table 4.60 and Table 4.61. In case of entropy based FSs, Q-
SVM classifier showed higher TPR with 90.9%, NPV with 91.2%, and FNR with of

170
9.1%, whereas C-SVM classifier attained superior results as 89.1% PPV, 89.8% F1,
89.3% TNR, and 10.6% FPR. With PCA based outcomes, outstanding results are
obtained in terms of PPV, TPR, F1, NPV, TNR, FPR, and FNR with 88.9%, 94.8%,
91.7%, 95.2%, 89.6%, 10.4%, and 5.1% than entropy based results and also among
other selected classifiers. Achieved results revealed better performance of PCA based
selected features than entropy based selected features. Moreover, corresponding results
of other classifiers are also computed with both entropy and PCA based selected
features and tabulated in Table 4.60.

Table 4.60: Performance of proposed PGC-FSDTF method on balanced PKU-Reid-


BROS-3 dataset (male=1120, and female=1120 images) different
evaluation protocols
Evaluation protocols
Method Classifier PPV TPR F1 NPV TNR FPR FNR
(%) (%) (%) (%) (%) (%) (%)
Entropy L-SVM 84.0 87.5 85.7 88.1 84.6 15.3 12.4
based Q-SVM 87.6 90.9 89.2 91.2 88.1 11.8 9.1
selected C-SVM 89.1 90.4 89.8 90.6 89.3 10.6 9.5
features M-SVM 85.8 89.4 87.6 89.9 86.3 13.6 10.5

PCA L-SVM 82.8 93.1 87.6 93.8 84.5 15.4 6.9


based Q-SVM 83.6 93.8 88.4 94.5 85.2 14.7 6.1
selected C-SVM 85.3 94.2 89.5 94.7 86.6 13.3 5.8
features M-SVM 88.9 94.8 91.7 95.2 89.6 10.4 5.1

For comparison of both features selection methods, selected classifiers are trained and
tested using entropy and PCA based FSs for gender prediction with additional
evaluation protocols of O-ACC, M-ACC, B-ACC, AUC, time, CW-ACC and their
results are shown in Table 4.61. Obtained results verify that C-SVM classifier is better
in performance with entropy based selected FSs such that it attained O-ACC, M-ACC,
B-ACC, AUC, CW-ACC male and CW-ACC female with values of 90.0%, 90.0%,
90.1%, 97%, 91%, and 89% respectively. With PCA based computed results in parallel,
M-SVM classifier outperformed than other applied classifiers and entropy based
results. Hence, M-SVM revealed O-ACC, M-ACC, B-ACC, AUC, CW-ACC male, and
CW-ACC female as 92.1%, 92.1%, 92.2%, 98%, 95%, and 89% respectively with best
time of 7.78 sec by considering PCA based selected FSs. Again, the combination of
proposed PCA based selected FSs, for instance max deep, average deep, and traditional
features produce higher classification results. The selected feature combination has
distinct properties that classify gender image precisely for customized datasets as well.

171
Other applied classifiers also gave acceptable performance and validated against
entropy and PCA based selected FSs.

Table 4.61: Performance of proposed PGC-FSDTF method on PKU-Reid-BROS-3


dataset (male=1120, and female=1120 images) using different
accuracies, AUC, and time
Evaluation protocols CW-ACC
O- M- B-
FSF AUC Time M F
ACC ACC ACC
Method Classifier (%) (Sec) (%) (%)
(%) (%) (%)
Entropy L-SVM 86.0 82.0 86.1 94 2.71 88 84
controlled Q-SVM 89.4 89.4 89.5 96 3.22 91 88
features C-SVM 90.0 90.0 90.1 97 3.63 91 89
selection M-SVM 87.8 87.8 87.9 95 2.65 90 86

PCA L-SVM 88.3 88.3 88.8 95 8.22 94 83


controlled Q-SVM 89.1 89.1 89.5 97 7.79 95 84
features C-SVM 90.0 90.0 90.3 97 8.82 95 85
selection M-SVM 92.1 92.1 92.2 98 7.78 95 89

Best entropy and PCA based results of two classifiers are shown such that proposed
technique achieved superior value of O-ACC as presented in Figure 4.41 that offers us
an insight view of different accuracies and AUC for comparison of both features
selection methods. According to Figure 4.41, PCA based FSs displayed better results
in comparison with entropy based FSs on a balanced and SSS PKU-Reid-BROS-3
dataset. For example, M-SVM classifier results are surpassed with 2.1%, 2.1%, 2.1%,
1%, 1%, for O-ACC, M-ACC, B-ACC, AUC, and CW-ACC male respectively when
compared with entropy based results of Q-SVM classifier. The computed CW-ACC of
89% is same on both C-SVM and M-SVM classifiers. The improvements confirm that
PCA based selected FSs combination slightly improves the performance as compared
to entropy based results.

Similarly, in parallel, other classification methods also attained reliable results with the
same selected FSs and settings. Further, it is observed that C-SVM and M-SVM
classifiers exhibit higher AUC than other classification methods using entropy and PCA
based FSs respectively. The corresponding all AUCs are also computed to show the
performance of other selected classifiers as described previously in Table 4.61. It is
concluded that C-SVM and M-SVM classifiers show improved AUC than other
selected classifiers. However, M-SVM classifier with PCA based FSs shows a higher

172
AUC with a value of 98% for both male and female classes, among other selected
classifiers and entropy based results. The only best AUC is shown in Figure 4.42.

Figure 4.41: Entropy and PCA based proposed PGC-FSDTF method results
comparison in terms of O-ACC, M-ACC, B-ACC, AUC, CW-ACC
male, and CW-ACC female on balanced PKU-Reid-BROS-3 dataset

Figure 4.42: Best AUC for males and females on balanced PKU-Reid-BROS-3 dataset
using PCA based selected FSs

4.4.3.3 Performance Evaluation on PETA-SSS and VIPeR-SSS Datasets

This section includes performance evaluation of proposed method on three customized


balanced and SSS datasets including PETA-SSS1, PETA-SSS2, and VIPeR-SSS. The
entropy and PCA based selected features are supplied to multiple classifiers for training.

173
Later, results based on test samples, and computed results are tabulated using different
evaluation protocols as presented in Tables (4.62)–(4.65). The purpose of these
experiments is to observe the performance of proposed method on balanced and SSS
(B-SSS) challenging datasets. The following subsections cover the evaluation of these
datasets.

a) Performance Evaluation on PETA-SSS-1 Dataset: Experiments are reported on a


SSS dataset called PETA-SSS-1 which includes sample images of a more challenging
dataset PETA. Individually, fused entropy and PCA based selected feature (deep and
traditional) vector is supplied to multiple classifiers to compute the results using
evaluation protocols, for example PPV, TPR, F1, NPV, TNR, FPR, and FNR as shown
in Table 4.62. According to achieved results, overall, C-SVM and M-SVM classifiers
are strong classifiers for accurate gender prediction as compared to other classifiers.
For instance, in case of entropy based computed results, proposed approach acquired
best results such as PPV 84.9%, F1 79.9%, TNR 84.6%, FPR 15.3% on Q-SVM
classifier whereas TPR 92.1%, NPV 94.2%, and FNR 7.8% on L-SVM classifier. These
recorded results are better than other selected classifiers.

Similarly, in case of PCA based computed results, best values are obtained such as PPV
88.5%, TPR 89.0%, F1 88.7%, TNR 88.6%, FPR 11.3%, and FNR 10.9% on M-SVM
classifier whereas NPV 89.5% on C-SVM classifier. It is apparent from the results on
PETA-SSS-1 balanced dataset that PCA based selected FSs outperformed as compared
to entropy based selected FSs. For example, while using PCA, FPR and FNR is 11.3 %
and 10.9%, which are better rates as compared to entropy based rates and among other
selected classifiers respective rates. Thus, the proposed approach achieved best PPV,
TPR, F1, etc. using PCA based selected features. Results are computed under other
evaluation protocols as well including O-ACC, M-ACC, B-ACC, AUC, time, and CW-
ACC. The calculated figures are shown in Table 4.63 from where it is noted that C-
SVM and M-SVM outperformed as compared to other classifiers. For example, if
entropy based results are perceived, proposed approach attained higher values such as
O-ACC 84.0%, M-ACC 84.0%, AUC 91%, and CW-ACC female 83% on C-SVM
classifier whereas B-ACC 85.5% on M-SVM classifier, and CW-ACC male 94% on L-
SVM classifier. Similarly, with PCA based computed results, superior results are
obtained like O-ACC 88.3%, M-ACC 88.3%, B-ACC 88.8%, AUC 95%, and CW-ACC

174
female 88% on M-SVM classifier with less time 4.52 sec than CW-ACC male 91% on
C-SVM classifier.

Table 4.62: Performance of proposed PGC-FSDTF method on balanced PETA-SSS-1


dataset (male=864, and female=864 images) using different evaluation
protocols
Evaluation protocols
Method Classifier PPV TPR F1 NPV TNR FPR FNR
(%) (%) (%) (%) (%) (%) (%)
Entropy L-SVM 68.8 92.1 78.4 94.2 74.8 25.1 7.8
based Q-SVM 80.3 82.0 81.1 82.4 80.7 19.2 17.9
selected C-SVM 84.9 83.5 84.2 83.2 84.6 15.3 16.4
features M-SVM 79.1 84.6 81.8 85.6 80.4 19.5 15.3

PCA L-SVM 77.1 86.7 81.6 88.1 79.4 20.5 13.2


based Q-SVM 76.8 85.7 81.0 87.2 79.0 20.9 14.2
selected C-SVM 75.2 87.8 81.0 89.5 78.3 21.6 12.1
features M-SVM 88.5 89.0 88.7 89.1 88.6 11.3 10.9

Table 4.63: Performance of proposed PGC-FSDTF method on balanced PETA-SSS-1


dataset (male=864, and female=864 images) using different accuracies,
AUC and time
Evaluation protocols CW-ACC
O- M- B-
AUC Time M F
ACC ACC ACC
Method Classifier (%) (Sec) (%) (%)
(%) (%) (%)
Entropy L-SVM 81.2 81.2 83.5 90 12.7 94 68
based Q-SVM 81.3 81.3 81.3 90 5.95 84 78
features C-SVM 84.0 84.0 84.0 91 5.71 85 83
selection M-SVM 82.4 82.4 85.5 88 5.31 84 81

PCA L-SVM 82.6 82.6 83.1 92 4.61 88 77


based Q-SVM 82.0 82.0 82.4 91 4.71 90 74
features C-SVM 82.4 82.4 83.1 92 4.91 91 74
selection M-SVM 88.3 88.3 88.8 95 4.52 89 88

The results shown in Figure 4.43 provide us a better understanding of improvements


using PCA based FSs as compared to entropy FSs on PETA-SSS-1 dataset. For
example, the improvement of 4.3%, 4.3%, 4.8%, 4%, 4%, and 5% is recorded for O-
ACC, M-ACC, B-ACC, AUC, CW-ACC male, and CW-ACC female, respectively
under M-SVM classifier when compared with entropy based results on C-SVM
classifier. This improvement proves that PCA controlled FSs boost the performance as
compared to entropy controlled FSs. Likewise, corresponding classification methods
also produced better results using the same settings and FSs.

175
Figure 4.43: Entropy and PCA based proposed PGC-FSDTF method results
comparison in terms of O-ACC, M-ACC, B-ACC, AUC, CW-ACC
male, and CW-ACC female on the balanced PETA-SSS-1 dataset

Figure 4.44: Best AUC for males and females on the balanced PETA-SSS-1 dataset
using PCA based selected FSs

Moreover, Figure 4.44 depicts best average AUC of 95% achieved using PCA based
FSs in the proposed approach. The corresponding AUCs of all other classifiers were
also computed for performance evaluation as presented previously in Table 4.63. But
C-SVM classifier showed 91% AUC with entropy based results and M-SVM classifier
gave 95% with PCA based results, hence exhibited higher AUC. Thus, it is concluded
that M-SVM classifier reveals better AUC on PCA based selected FSs than other
classifiers.

176
b) Performance Evaluation on PETA-SSS-2 Dataset: To observe the robustness of
proposed approach, additional investigations are presented on another balanced PETA-
SSS-2 dataset that has more number of samples as compared to PETA-SSS-1 dataset.
For this purpose, selected classifiers, evaluation protocols, and settings are considered
during these investigations. The entropy and PCA based results are computed and
tabulated for discussion. According to Q-SVM classifier results shown in Table (4.64),
it can be seen that entropy features selection method achieved higher rates with PPV
84.2%, TPR 84.5%, F1 84.5%, NPV 84.9%, TNR 84.3%, FPR 15.6%, and FNR 15.1%
which are higher than the best PCA based results and among other classifiers. Despite,
M-SVM classifier outperformed such that PPV is 80.3%, TPR is 94.6%, F1 is 86.9%,
NPV is 95.4%, TNR 82.9%, FPR 17.0%, and FNR 5.3% in comparison with L-SVM,
Q-SVM, and C-SVM, only when PCA based selected features are supplied for training
and testing. The corresponding results of all classifiers are also computed with both
entropy and PCA based selected features and presented in Table 4.64.

Table 4.64: Performance of proposed PGC-FSDTF method on balanced PETA-SSS-2


dataset (male=1300, and female=1300 images) using different evaluation
protocols
Evaluation protocols
Method Classifier PPV TPR F1 NPV TNR FPR FNR
(%) (%) (%) (%) (%) (%) (%)
Entropy L-SVM 81.9 81.3 81.6 81.2 81.7 18.2 18.6
based Q-SVM 84.2 84.8 84.5 84.9 84.3 15.6 15.1
selected C-SVM 83.1 83.2 83.1 83.2 83.1 16.8 16.7
features M-SVM 82.6 82.5 82.5 82.5 82.6 17.3 17.4

PCA L-SVM 74.4 90.2 81.6 92.0 78.2 21.7 9.7


based Q-SVM 70.0 90.8 79.0 93.0 75.6 24.4 9.1
selected C-SVM 71.2 90.1 79.5 92.1 76.1 23.8 9.9
features M-SVM 80.3 94.6 86.9 95.4 82.9 17.0 5.3

Other evaluation protocols are also utilized for the performance check of proposed
approach including O-ACC, M-ACC, B-ACC, AUC, time, and CW-ACC. The
computed scores for these evaluation protocols are shown in Table 4.65 where Q-SVM
classifier obtained higher values in terms of O-ACC, M-ACC, B-ACC, AUC, CW-
ACC male, and CW-ACC female with 84.5%, 84.5%, 84.5% 92%, 85%, and 84% when
entropy based selected features are used in proposed approach. As opposed, M-SVM
used less time with 6.66 sec among other classifiers. In parallel, M-SVM attained O-
ACC, M-ACC, B-ACC, AUC, CW-ACC male, and CW-ACC female with values of

177
87.9%, 87.9%, 88.8%, 95%, 95%, and 80%, respectively using PCA based selected
FSs. The combination of proposed selected features performed better in SSS even when
the samples are collected from more challenging dataset PETA. It can be seen that
proposed feature combination precisely classifies gender image when tested under
PETA dataset images. The other applied classifiers also showed adequate performance
and validated against entropy and PCA based selected FSs as shown in Table 4.65.

Table 4.65: Performance of proposed PGC-FSDTF method on balanced PETA-SSS-2


dataset (male=1300, and female=1300 images) using different accuracies,
AUC, and time
Evaluation protocols CW-ACC
O- M- B- AUC Time M F
Method Classifier ACC ACC ACC (%) (Sec) (%) (%)
(%) (%) (%)
Entropy L-SVM 81.5 81.5 81.5 90 7.68 81 82
based Q-SVM 84.5 84.5 84.5 92 9.19 85 84
features C-SVM 83.2 83.2 83.2 91 9.98 83 83
selection M-SVM 82.6 82.6 82.6 91 6.66 83 83

PCA L-SVM 83.2 83.2 84.2 93 8.49 92 74


based Q-SVM 81.4 81.4 83.2 92 7.85 93 70
features C-SVM 81.6 81.6 83.1 91 8.03 91 71
selection M-SVM 87.9 87.9 88.8 95 6.84 95 80

For comparison, results are depicted in Figure 4.45 to cover entropy and PCA based
AUC and accuracies (O-ACC, M-ACC, B-ACC, CW-ACC male, CW-ACC female) of
multiple classifiers. Thus, PCA based FSs displayed higher results as compared to
entropy based FSs on a more challenging balanced and SSS PETA-SSS-2 dataset. For
example, M-SVM classifier results are exceeded with 3.4%, 3.4%, 4.3%, 3%, 10%, and
4% for O-ACC, M-ACC, B-ACC, AUC, CW-ACC male, and CW-ACC female
respectively in comparison with entropy based results of Q-SVM classifier. The
noteworthy improvement proves that PCA based selected FSs combination enhances
the performance as compared to entropy based. Likewise, other selected supervised
methods also attained better results in parallel using the same settings and FSs.

Further, it is noted that Q-SVM and M-SVM classifiers exhibit higher AUC than other
classification methods using entropy and PCA based FSs respectively. However, M-
SVM classifier with PCA based FSs shows a higher AUC with a value of 95% as shown
in Figure 4.46. The corresponding all AUCs are also calculated to show the

178
performance of other selected classifiers as shown previously in Table 4.65. Thus, Q-
SVM and M-SVM classifiers showed improved AUC than other selected classifiers.

Figure 4.45: Entropy and PCA based proposed PGC-FSDTF method results
comparison in terms of O-ACC, M-ACC, B-ACC, AUC, CW-ACC
male, and CW-ACC female on balanced PETA-SSS-2 dataset

Figure 4.46: Best AUC for males and females on balanced PETA-SSS-2 dataset using
PCA based selected FSs
c) Performance Evaluation on VIPeR-SSS Dataset: Proposed approach is verified
on customized VIPeR-SSS dataset for pedestrian gender classification. The entropy and
PCA based selected FSs combination is provided to the selected classifier and acquired

179
results are presented in Tables (4.66)-(4.67). In case of entropy based FSs, M-SVM
classifier showed values of 75.8% PPV, 73.9% TPR, 74.9% F1, 73.3% NPV, 75.2%
TNR, 24.7% FPR, and 26.0% FNR. While with PCA based outcomes, superior results
are attained in terms of PPV, TPR, F1, NPV, TNR, FPR, and FNR with 90.4%, 98.7%,
94.4%, 98.8%, 91.2%, 18.7%, and 11.2% respectively as compared to entropy based
results and also among the other classifiers. Obtained results confirmed that PCA based
selected features performed much better than entropy based selected features.
Moreover, the corresponding results of all classifiers are also computed using both
entropy and PCA based selected features, as shown in Table 4.66.

Table 4.66: Performance of proposed PGC-FSDTF method on balanced VIPeR-SSS


dataset (male=544, and female=544 images) using different evaluation
protocols
Evaluation protocols
Method Classifier PPV TPR F1 NPV TNR FPR FNR
(%) (%) (%) (%) (%) (%) (%)
Entropy L-SVM 74.2 73.2 73.7 72.8 73.9 26.0 26.7
based Q-SVM 74.9 73.6 74.2 73.1 74.4 25.5 26.3
selected C-SVM 73.8 71.4 72.5 70.4 72.9 27.0 28.5
features M-SVM 75.8 73.9 74.9 73.3 75.2 24.7 26.0

PCA L-SVM 75.7 85.0 80.1 86.6 78.1 21.8 14.9


based Q-SVM 74.9 94.5 83.6 95.7 79.2 20.7 15.4
selected C-SVM 78.2 85.7 81.8 86.9 80.0 20.0 14.2
features M-SVM 90.4 98.7 94.4 98.8 91.2 18.7 11.2

Moreover, the selected classifiers are trained and tested with entropy and PCA based
FSs for gender prediction with additional evaluation protocols of O-ACC, M-ACC, B-
ACC, AUC, time, and CW-ACC with the results shown in Table 4.67. According to
these results, M-SVM classifier depicted better performance with entropy based
selected FSs while attaining O-ACC, M-ACC, B-ACC, AUC, CW-ACC male, and
CW-ACC female as 74.6%, 74.6%, 74.6%, 81%, 73%, and 76% respectively. Testing
is also done on PCA based computed results such that M-SVM classifier outperformed
as compared to other applied classifiers and entropy based results with O-ACC, M-
ACC, B-ACC, AUC, CW-ACC male, and CW-ACC female as 94.6%, 94.6%, 95.0%,
97%, 99%, and 90% respectively by considering PCA based selected FSs. Again, the
combination of proposed PCA based selected FSs, for instance max deep, average deep,
and traditional features produce higher classification results. The selected proposed
feature combination has distinct properties to classify gender image accurately even in

180
case of customized dataset VIPeR. Acceptable performance is also observed on other
classifiers and validated against entropy and PCA based selected FSs as shown in Table
4.67.

Table 4.67: Performance of proposed PGC-FSDTF method on balanced VIPeR-SSS


dataset (male=544, and female=544 images) using different accuracies,
AUC, and time
Evaluation protocols CW-ACC
O- M- B-
AUC Time M F
ACC ACC ACC
Method Classifier (%) (Sec) (%) (%)
(%) (%) (%)
Entropy L-SVM 73.5 73.5 73.5 81 3.78 73 74
based Q-SVM 74.0 74.0 74.0 80 4.10 73 75
features C-SVM 72.1 71.2 72.1 79 4.49 70 74
selection M-SVM 74.6 74.6 74.6 81 3.98 73 76

PCA L-SVM 81.2 81.2 81.5 90 2.59 87 76


based Q-SVM 85.3 85.3 86.9 93 5.92 96 75
features C-SVM 82.6 82.6 82.8 91 3.11 87 78
selection M-SVM 94.6 94.6 95.0 97 6.23 99 90

Figure 4.47 shows entropy and PCA based AUC and accuracies (O-ACC, M-ACC, B-
ACC, CW-ACC male, CW-ACC female) of the top two classifiers according to which
PCA based FSs gave better results than entropy based FSs on a balanced and SSS
VIPeR dataset. For example, M-SVM classifier results are surpassed with values of
20.0%, 20.0%, 20.4%, 16%, 26%, and 14% for O-ACC, M-ACC, B-ACC, AUC, CW-
ACC male, and CW-ACC female when compared with entropy based results of M-
SVM classifier. M-SVM has excellent CW-ACC as 99% male and 90% female. These
improvements make it obvious that PCA based selected FSs combination and
contribution of each FS enhance the performance than entropy based results. Likewise,
other selected classification methods showed reliable results in parallel with the same
FSs and settings. It is observed that M-SVM classifier depicted higher AUC than other
classification methods in case of both entropy and PCA based FSs respectively. All
other corresponding AUCs are calculated as well for the performance evaluation of
other selected classifiers as described previously in Table 4.67. M-SVM classifier with
PCA based FSs showed higher AUC with 97% for both male and females classes
among other classifiers and entropy method. The only best AUC for both classes (male
and female) is shown in Figure 4.48. It is evident from the results that the proposed
method has superior performance on VIPeR-SSS dataset.

181
Figure 4.47: Entropy and PCA based proposed PGC-FSDTF method results
comparison in terms of O-ACC, M-ACC, B-ACC, AUC, CW-ACC
male, and CW-ACC female on balanced VIPeR-SSS dataset

Figure 4.48: Best AUC for males and females on balanced VIPeR-SSS dataset using
PCA based selected features subsets

4.4.3.4 Performance Evaluation on Cross-datasets

In this section, the proposed approach is tested on two balanced cross-datasets using
the same classifiers and evaluation protocols. All the results are computed separately
on each dataset, presented in Tables (4.68)–(4.71). The reason to perform experiments
on these datasets is to confirm the strength of the proposed approach using a small

182
number of samples in a dataset, for instance, a total of 175 samples for each class
(male/female) as used by [45]. Proposed method is also verified on a new dataset named
cross-dataset-1 such that each class consists of 350 samples. The results analysis with
discussion on both datasets is given in the following subsections.

a) Performance Evaluation on Cross-dataset: Results are reported on a balanced and


SSS cross-dataset using the proposed approach. The entropy and PCA based
performance on the selected classifiers is examined and outcomes of different
evaluation protocols, for example PPV, TPR, F1, NPV, TNR, FPR, and FNR are
presented in Table 4.68. The experiments show the role of both features selection
methods of entropy and PCA based FSs on the cross-dataset. With entropy based
selected FSs, Q-SVM proved better classifier for accurate gender prediction as
compared to other applied classifiers. The empirical based results evaluation showed
that the proposed approach using Q-SVM depicted better results in terms of PPV, TPR,
F1, TNR, FPR, and FNR with 77.7%, 71.2%, 74.3%, 75.4%, 24.5%, and 28.7%
respectively, whereas L-SVM outperformed other classifiers in terms of NPV with
69.1%. Results on other selected classifiers also examined the performance of proposed
approach and shown in Table 4.68. In case of PCA based selected features, M-SVM
revealed better performance than other selected classifiers by achieving superior values
of PPV, TPR, F1, NPV, TNR, FPR, and FNR as 89.1%, 92.3%, 90.6%, and 92.5%,
89.5%, 10.4%, and 7.6% respectively. Moreover, proposed approach is also validated
on other three classifiers such that L-SVM classifier depicted worst performance with
lesser PPV, TPR, F1, NPV, TNR, FPR, and FNR values as 11.4%, 9.4%, 10.4%, 8.5%,
10.5%, 10.5%, and 9.4% in comparison with M-SVM classifier results. Table 4.68
shows the performance of other selected classifiers in terms of PPV, TPR, F1, NPV,
TNR, FPR, and FNR one after the other. Results obtained using different accuracies
(O-ACC, M-ACC, and B-ACC), AUC, and time with entropy and PCA based FS are
presented in Table 4.69. Overall, Q-SVM and M-SVM outperformed than the other
classifiers, for example L-SVM, and C-SVM. In case of entropy based FSs, Q-SVM
classifier depicted better results of O-ACC, M-ACC, B-ACC, AUC, CW-ACC male,
and CW-ACC female with 73.1%, 73.1%, 73.3%, 81%, 69%, and 78% among other
classification methods. C-SVM in comparison with other selected supervised methods
takes less time with 2.53 sec. The corresponding results using selected classifiers are
also obtained to observe the performance of proposed approach. Similarly, the results

183
are attained using PCA based FSs. According to computed results, M-SVM classifier
outperformed by depicting better O-ACC, M-ACC, B-ACC, AUC, CW-ACC male, and
CW-ACC female with values of 90.8%, 90.8%, 90.9%, 95%, 93%, and 89%,
respectively. As opposed, C-SVM classifier gave better time with 0.47 sec.

Table 4.68: Performance of proposed PGC-FSDTF method on balanced cross-dataset


(male=175, and female=175 images) using different evaluation protocols
Evaluation protocols
FSF Classifier PPV TPR F1 NPV TNR FPR FNR
method (%) (%) (%) (%) (%) (%) (%)
Entropy L-SVM 72.5 70.1 71.3 69.1 71.5 28.4 29.8
based Q-SVM 77.7 71.2 74.3 68.5 75.4 24.5 28.7
selected C-SVM 76.0 70.7 73.2 68.5 74.0 25.9 29.2
features M-SVM 74.8 67.5 71.0 64.0 71.7 28.2 32.4

PCA L-SVM 77.7 82.9 80.2 84.0 79.0 20.9 17.0


based Q-SVM 81.7 85.1 83.3 85.7 82.4 17.5 14.8
selected C-SVM 78.8 87.8 83.1 89.1 80.8 19.1 12.1
features M-SVM 89.1 92.3 90.6 92.5 89.5 10.4 7.6

Parallel results on selected supervised methods also examined the performance of


proposed approach. Hence, showing better performance under respective performance
evaluation protocols strengthens the working of proposed work with PCA based FSs
showing enhanced performance than entropy based FSs on balanced and SSS cross-
dataset.

Table 4.69: Performance of proposed PGC-FSDTF method on balanced cross-dataset


(male=175, and female=175 images) using different accuracies, AUC, and
time
Evaluation protocols CW-ACC
FSF O- M- B-
AUC Time M F
Method Classifier ACC ACC ACC
(%) (Sec) (%) (%)
(%) (%) (%)
Entropy L-SVM 70.8 70.8 70.8 79 2.76 69 73
controlled Q-SVM 73.1 73.1 73.3 81 2.65 69 78
features
selection C-SVM 72.2 72.2 72.4 80 2.53 69 76
M-SVM 69.4 69.4 69.6 78 2.72 64 75

PCA L-SVM 80.8 80.8 80.9 82 0.67 84 78


controlled Q-SVM 83.7 83.7 83.7 91 0.47 86 82
features
C-SVM 84.0 84.0 84.3 92 0.47 89 79
selection
M-SVM 90.8 90.8 90.9 95 0.65 93 89

184
For example, using PCA based FS, M-SVM classifier results are exceeded 17.7%,
17.7%, 17.6%, 14%, 24%, and 11% for O-ACC, M-ACC, B-ACC, AUC, CW-ACC
male, and CW-ACC female when they are compared with Q-SVM classifier results
using entropy. For comparison, both entropy and PCA based experimental results are
shown in Figure 4.49 that offers an insight view of best results of two classifiers under
both features selection methods. Also, other notable improvements of 11.8%, 11.8%,
11.9%, 12%, 20%, and 3% are observed for O-ACC, M-ACC, AUC, B-ACC, CW-
ACC male, and CW-ACC female when C-SVM results are compared under both
features selection methods. These improvements prove that PCA based features
selection boost the performance in comparison with entropy based FSs. Likewise, other
classification methods also depicted improved results using the same FSs and settings.
Further, it is obvious that Q-SVM and M-SVM classifiers showed higher AUC than
other supervised methods on entropy and PCA based FSs respectively. However, M-
SVM classifier with PCA based FSs depicted better AUC of 95% as shown in Figure
4.50.

Figure 4.49: Entropy and PCA based proposed PGC-FSDTF method results
comparison in terms of O-ACC, M-ACC, B-ACC, AUC, CW-ACC
male, and CW-ACC female on balanced cross-dataset, best outcomes of
two classifiers

185
Figure 4.50: Best AUC for males and females on balanced cross-dataset using PCA
based selected FSs
b) Performance Evaluation on Cross-dataset-1: Proposed technique is also
examined on cross-dataset-1 through experiments. The entropy and PCA based
computed results are presented in Table 4.70 and Table 4.71. With entropy based FSs,
the proposed approach revealed enhanced performance under Q-SVM classifier as
compared to other classifiers in terms of PPV, TPR, F1, NPV, TNR, FPR, and FNR
with 84.9%, 82.5%, 83.6%, 82.0%, 84.4%, 15.5%, and 17.5% respectively. In parallel,
PCA based results using M-SVM classifier outclassed in comparison with entropy
based results and also among other classifiers by attaining better results of PPV, TPR,
F1, NPV, TNR, FPR, and FNR with 88.2%, 92.2%, 90.2%, 92.5%, 88.7%, 11.2%, and
7.7% respectively. Keeping in view these experiments, in can be seen that PCA based
selected features performed much better as compared to entropy based selected
features. Moreover, the corresponding results of all classifiers are also computed using
both entropy and PCA based selected features and shown in Table 4.70. Confirmation
of proposed technique is also done using more evaluation protocols such as O-ACC,
M-ACC, B-ACC, AUC, time, and CW-ACC. The intended results under these
evaluation protocols are shown in Table 4.71 that revealed that Q-SVM classifier
outperformed with entropy based computed results by attaining O-ACC, M-ACC, B-
ACC, AUC, CW-ACC male, and CW-ACC female as 83.4%, 83.4%, 83.4%, 90%,
82%, and 85% using entropy based selected FSs. But in case of PCA based results, M-

186
SVM classifier showed improved performance than other classifiers and entropy based
results with values of O-ACC, M-ACC, B-ACC, AUC, CW-ACC male, and CW-ACC
female as 90.4%, 90.4%, 90.5%, 96%, 89%, and 90% respectively by considering PCA
based selected FSs.

Table 4.70: Performance of proposed PGC-FSDTF method on balanced cross-dataset-


1 (male=350, and female=350 images) using different evaluation protocols
Evaluation protocols
Method Classifier PPV TPR F1 NPV TNR FPR FNR
(%) (%) (%) (%) (%) (%) (%)
Entropy L-SVM 82.8 81.2 81.9 80.1 82.3 17.7 18.8
based Q-SVM 84.9 82.5 83.6 82.0 84.4 15.5 17.5
selected C-SVM 83.7 81.3 82.5 80.8 83.2 16.7 18.6
features M-SVM 80.8 81.5 81.2 81.7 81.0 18.9 18.4

PCA L-SVM 77.1 84.3 80.5 85.6 78.8 21.1 15.6


based Q-SVM 78.8 84.4 81.5 85.4 80.1 19.8 15.5
selected C-SVM 74.2 85.2 79.3 87.1 77.2 22.7 14.7
features M-SVM 88.2 92.2 90.2 92.5 88.7 11.2 7.7

It is obvious from the computed results that the combination of proposed features like
deep and traditional features leads to higher classification results when they are used
under PCA controlled features selection and trained using M-SVM classifier. The
combination of selected deep and traditional features has the potential to classify gender
image precisely. Other applied classifiers also showed notable outcomes performance
and validated against entropy and PCA based selected FSs as shown in Table 4.71.

Table 4.71: Performance of proposed PGC-FSDTF method on balanced cross-dataset-


1 (male=350, and female=350 images) using different accuracies, AUC
and time
Evaluation protocols CW-ACC
O- M- B-
AUC Time M F
ACC ACC ACC
Method Classifier (%) (Sec) (%) (%)
(%) (%) (%)
Entropy L-SVM 81.7 81.7 81.7 89 2.11 81 83
based Q-SVM 83.4 83.4 83.4 90 3.74 82 85
features C-SVM 82.2 82.2 82.3 90 3.82 81 84
selection
M-SVM 81.2 81.2 81.2 88 2.07 82 81

PCA L-SVM 81.3 81.1 81.6 89 0.82 85 77


based Q-SVM 82.1 82.1 82.2 90 0.99 85 79
features C-SVM 80.7 80.7 81.2 86 0.94 87 74
selection M-SVM 90.4 90.4 90.5 96 1.16 89 90

187
Figure 4.51 shows entropy and PCA based AUC and accuracies (O-ACC, M-ACC, B-
ACC, CW-ACC male, CW-ACC female) of multiple classifiers. According to this
figure, better results are achieved with PCA based FSs as compared to entropy based
FSs on a balanced and SSS cross-dataset-1. For example, M-SVM classifier results are
exceeded with values of 7.0%, 7.0%, 7.1%, 6%, 7%, and 5% for O-ACC, M-ACC, B-
ACC, AUC, CW-ACC male, and CW-ACC female when they are compared with
entropy based results of Q-SVM classifier. The noteworthy improvements confirm that
PCA based selected FSs combination improves the performance as compared to entropy
based.

Figure 4.51: Entropy and PCA based proposed PGC-FSDTF method results
comparison in terms of O-ACC, M-ACC, B-ACC, AUC, CW-ACC
male, and CW-ACC female on balanced cross-dataset-1, best outcomes
of two classifiers

Similarly, in parallel, other selected supervised methods also attained reliable results
using the same FSs and settings. Moreover, it is noted that Q-SVM and M-SVM
classifiers exhibit higher AUC than other classification methods using entropy and PCA
based FSs. All other corresponding AUCs are also calculated to show the performance
of selected classifiers previously described in Table 4.71 and shown in Figure 4.52.
Conclusion is that Q-SVM and M-SVM classifies with PCA based FSs show a higher
AUC with 96% than rest of classifiers and entropy method.

188
Figure 4.52: Best AUC for males and females on balanced cross-dataset-1 using PCA
based selected FSs

4.4.3.5 Comparison with Existing Methods

To assess the performance of proposed PGC-FSDTF approach, this work provides a


comparison with those state-of-the-art approaches in which MIT-IB and cross-dataset
are utilized for pedestrian gender classification. This comparison is explained below.

a) Comparative Analysis on MIT-IB Dataset: Comparison is performed on existing


approaches such as PBGR [166], PiHOG-LHSV [168], BIO-PCA [60], BIO-OLPP
[60], BIO-LSDA [60], BIO-MFA [60], BIF-PCA [60], CNN [61], HOG [169], LBP
[169], HSV [169], LBP-HSV [169], HOG-HSV [169], HOG-LBP [169], HOG-LBP-
HSV [169], CNN-e [183], CSVFL[69], J-LDFR [266], CNN and CaffeNet [48]
introduced by researchers for pedestrian gender classification using MIT dataset. The
proposed approach surpassed all existing methods, for example, PBGR, PiHOG-LHSV,
BIO-PCA, BIO-OLPP, BIO-LSDA, BIO-MFA, BIF-PCA, CNN, HOG, LBP, HSV,
LBP-HSV, HOG-HSV, HOG-LBP, HOG-LBP-HSV, CNN-e, CNN-1, CNN-2, CNN-
3, CaffeNet, J-LDFR and CSVFL in term of O-ACC with improvement of 17.7%,
20.2%, 13.5%, 15.6%, 14.5%, 17.5%, 12.1%, 12.3%, 13.8%, 16.6%, 21.4%, 15.1%,
11.8%, 12.9%, 12.6%, 11.2%, 1.2%, 0.8%, 0.7%, 1.1%, 10.7%, and 7.5% respectively.
Despite this, proposed approach obtained higher AUC as compared to J-LDFR with 7%
improvement as described in Table 4.72.

189
Table 4.72: Performance comparison with state-of-the-art methods on MIT/MIT-IB
dataset, dash (-) represents that no reported result is available
Methods Year AUC (%) O-ACC (%) M-ACC(%)
PBGR [166] 2008 - 75.0 -
PiHOG-LHSV [168] 2009 - 72.3 -
BIO-PCA [60] 2009 - 79.2 -
BIO-OLPP [60] 2009 - 77.1 -
BIO-LSDA [60] 2009 - 78.2 -
BIO-MFA [60] 2009 - 75.2 -
BIF-PCA [60] 2009 - 80.6 -
CNN [61] 2013 - 80.4 -
HOG [169] 2015 - 78.9 75.9
LBP [169] 2015 - 76.1 68.5
HSV [169] 2015 - 71.3 64.8
LBP-HSV [169] 2015 - 77.6 73.7
HOG-HSV [169] 2015 - 80.9 75.3
HOG-LBP [169] 2015 - 79.8 76.6
HOG-LBP-HSV [169] 2015 - 80.1 76.7
CNN-e [183] 2017 - 81.5 -
U+M+L (CNN-1) [71] 2019 - 80.8 -
U+M+L (CNN-2) [71] 2019 - 81.2 -
U+M+L (CNN-3) [71] 2019 - 81.3 -
U+M+L (CaffeNet) [71] 2019 - 80.9 -
J-LDFR [266] 2020 86 82.0 77.3
CSVFL[69] 2020 - 85.2 -
Proposed PGC-FSDTF 93 92.7 89.4

Moreover, PGR-FSDTF approach attained 89.4% M-ACC which is superior among


eight available existing approaches including HOG, LBP, HSV, LBP-HSV, HOG-
HSV, HOG-LBP, HOG-LBP-HSV, and J-LDFR with improvement of 13.5%, 20.9%,
24.6%, 15.7%, 14.1%, 12.8%, 12.7%, and 12.1% respectively, as shown in Figure 4.53.
From these improvements, it is observed that proposed approach achieved higher O-
ACC as compared to existing approaches with a difference of 3.7% minimum and
21.4% maximum. In addition, proposed approach outperformed eight existing
approaches in terms of M-ACC with a difference of 12.1% minimum and 24.6%
maximum. To the best of our knowledge, proposed PGC-FSDTF method performance
is superior in terms of AUC, O-ACC, and M-ACC as compared to all existing
appearance based pedestrian classification methods. The training and prediction time
of J-LDFR is compared with proposed PGC-FSDTF method training and prediction
time as shown in Figure 4.54. It can be seen that PGC-FSDTF method takes less
training time due to lower dimension of applied FFV in this work but the prediction
time is higher as compared to J-LDFR method because of SVM kernel.

190
Figure 4.53: Performance comparison in terms of overall accuracy between proposed
PGC-FSDTF method and existing methods on MIT/MIT-IB dataset

Figure 4.54: Comparison of training and prediction time of PGC-FSDTF with J-LDFR

b) Comparative Analysis on Cross-dataset: A comparison of proposed approach with


existing approaches such as VGGNet16 [268], Mini-CNN [47], AlexNet-CNN [47],
GoogleNet [201], ResNet50 [217], W-HOG [45], DFL [45], and HDFL [45] is reported
in literature for pedestrian gender classification using cross-dataset. M-SVM classifier
has the highest 95% AUC and 96% AUC on cross-dataset and cross-dataset1
respectively. The proposed and existing results on cross-datasets are tabulated in Table
4.73. PGC-FDSTF proposed approach exceeded all eight approaches of VGGNet16,
Mini-CNN, AlexNet-CNN, GoogleNet, ResNet50, W-HOG, DFL, and HDFL with
improvement of 11%, 15%, 10%, 12%, 9%, 16%, 6%, and 4% respectively as shown
in Figure 4.55. From these improvements, it is obvious that proposed approach achieved

191
higher AUC as compared to existing approaches with a difference of 4% minimum and
16% maximum. The superior results of PGC-FSDTF approach is because of PCA based
selected distinct information from the proposed FSs in which applied PHOG FS is
insensitive to local geometric, illumination, and pose variations. It contains significant
information at both local and global levels using HOG descriptor and pyramid structure
that effectively contribute for gender prediction. Despite this, HSV based histogram
contains color information which seems to be complimentary part of traditional features
because color information resist multiple types of changes in an image such as size,
direction, rotation, distortion, and noise. Furthermore, two different CNN architectures
are used to provide deeper level information of gender image according to the depth of
CNN architecture.

Table 4.73: Comparison of proposed PGC-FSDTF method results with state-of-the-art


methods on cross-datasets
Methods Year AUC O-ACC M-ACC
(%) (%) (%)
VGGNet16 [268] 2014 84 - -
Mini-CNN [47] 2015 80 - -
AlexNet-CNN [47] 2015 85 - -
GoogleNet [201] 2015 83 - -
ResNet50 [217] 2016 86 - -
W-HOG [45] 2018 79 - -
DFL [45] 2018 89 - -
HDFL [45] 2018 91 - -
Proposed PGC-FSDTF (cross dataset) 95 90.8 90.8
Proposed PGC-FSDTF (cross dataset-1) 96 90.4 90.4

The utilization of parallel fused deep features is more robust because it provides distinct
information (max features) and average information (average features) from two CNN
architectures instead of single CNN architecture. Hence, the proposed results suggest
that combining features of two CNN architectures leads to better classification rates.
The superior results also verified that fusion of all FSs provides better discrimination
ability and identifies the contribution of each FS for robust pedestrian gender
classification. The proposed PGC-FSDTF method addresses SSS dataset problem for
pedestrian gender classification. These SSS datasets mainly faced three issues: (1) lack
of data generalization, (2) class-wise imbalanced data, and (3) model learning. First two
issues are addressed using data augmentation operations and random over sampling.

192
Figure 4.55: Performance comparison in terms of AUC between the proposed PGC-
FSDTF method and existing methods on cross-datasets

In this work, equal distribution of data is achieved using four data augmentation
operations that added synthetic data (more samples) in a SSS class. Mostly, single
minority class is considered for data augmentation but the data of this class is
synthesized as well as for both gender classes because of two reasons: (1) to minimize
biases while applying data augmentation operations and (2) performance of a male
gender class may be affected while trying to improve the performance of female gender
class and vice versa. To generate unbiased balanced data, three different strategies
(1vs1, 1vs4, and mixed) are presented wherein selected data augmentation operations
are implemented based on these strategies one by one. Consequently, from each
imbalanced dataset, three augmented datasets are generated for experiments. Finally,
all datasets are classified as two class-wise imbalanced, six augmented balanced, and
five non-augmented/customized balanced SSS datasets.

This work has utilized different kernels with SVM to classify the gender. However, the
superior results in terms of O-ACC and AUC on these datasets validate the proposed
PGC-FSDTF approach performance for gender classification when C-SVM and M-
SVM classifiers are trained on PCA based selected FSs. This study aimed to design a
robust approach that accurately classifies the gender image under imbalanced and SSS
dataset issues. In parallel, detailed investigations are shown on these datasets using
entropy based FSs with selected classifiers to test the performance of proposed
approach. Moreover, it is observed that proposed approach outperformed as compared

193
to existing studies on MIT-IB dataset and cross-dataset with significant improvements.
In this study, deeply learned information is extracted from FC layers using two different
CNN architectures as compared to model learning from stretch or transfer learning.
Deep feature representations are robust against large intra-class variations, pose
changes, and different illumination conditions. Multiple selected feature schemes are
combined in proposed PGC-FSDTF approach that produces improved outcomes
without demanding additional data for model learning. For comprehensive analysis,
applied datasets are classified into two categories: (1) imbalanced and augmented
balanced datasets, and (2) customized/non-augmented balanced datasets. The empirical
analysis is elaborated in the next section.

Imbalanced and Augmented Balanced Datasets for Gender Prediction: Comparison of


empirically analysis based optimal results obtained using imbalanced and augmented
balanced datasets is tabulated in Table 4.74. According to results, the proposed method
showed 93% AUC, 99% CW-ACC for male class, and 80% CW-ACC for female class
on an imbalanced MIT-IB dataset which outperformed state-of-the-art approaches with
significant improvements. But, class-wise accuracy of male is 19% higher than class-
wise accuracy of female because of class-wise similarities. In addition, this higher
accuracy element described that class-wise accuracy rate of proposed approach is
superior on the majority class (male) as compared to the minority (female) class due to
extremely imbalanced distribution of data. Moreover, proposed approach achieved best
O-ACC of 92.7% on MIT-IB dataset that is slightly higher than two augmented
balanced datasets MIT-BROS-1 and MIT-BROS-2. In comparison, considering other
augmented balanced MIT-BROS-3 dataset, the proposed approach produced higher O-
ACC with a margin of 4.4%, and 3.3% as compared to augmented balanced MIT-
BROS-1, and MIT-BROS- 2 datasets respectively as shown in Table 4.74. As already
mentioned, PKU-Reid dataset was renamed with an imbalanced PKU-Reid-IB dataset
due to unequal distribution of data observed from collected samples. This dataset is
used first time for pedestrian gender classification. On the new prepared dataset, the
proposed approach achieved improved results of AUC 96%, CW-ACC male 95%, and
CW-ACC female 93% on another imbalanced PKU-Reid-IB dataset. Here, class-wise
accuracy of male is slightly higher with a value of 2% than class-wise accuracy of
female due to less imbalanced ratio of data. Even better performance is seen on both
imbalanced datasets, however, noteworthy factor is that B-ACC, usually applied on an

194
unequal distribution of data, is examined on MIT-IB and PKU-Reid-IB datasets. These
datasets are also challenging due to variations in pose, illumination, and low contrast.
Considering PKU-Reid based augmented balanced datasets, the proposed approach
outperformed in terms of O-ACC, M-ACC, and B-ACC with 92.2% for each evaluation
metric on PKU-Reid-BROS-2 dataset such that O-ACC is 0.6% and 2.3% higher than
PKU-Reid-BROS-1 and PKU-Reid-BROS-3 datasets respectively.

It is also observed from the empirical results that gender classification improvement on
PKU-Reid-BROS-2 was achieved when more samples are added in minority class
(female) based on 1vs4 strategy while implementing data augmentation operations. The
proposed approach maintains the classification accuracies with minor differences even
that the data is augmented for single or both gender classes. However, proposed
approach achieved best AUC with 98% on both mixed strategies based on augmented
datasets MIT-BROS-3 and PKU-Reid-BROS-3.

Table 4.74: Summary of proposed method PGC-FSDTF results on all selected


imbalanced, and augmented balanced datasets where proposed approach
recorded superior AUC
O- M- B- CW-ACC (%)
AUC
ACC ACC ACC
Dataset (%) Male Female
(%) (%) (%)
MIT-IB 92.7 89.4 94.1 93 99 80
MIT-BROS-1 89.3 89.3 89.3 96 89 91
MIT-BROS-2 90.4 90.4 90.4 97 91 90
MIT-BROS-3 93.7 93.7 93.9 98 97 90

PKU-Reid_IB 89.7 88.2 90.0 96 95 93


PKU-Reid-BROS1 91.6 91.6 91.6 97 94 89
PKU-Reid-BROS2 92.2 92.2 92.2 97 93 91
PKU-Reid-BROS3 89.9 89.9 89.9 98 95 89

It was also noticed from overall outcomes that performance improvement was attained
on an extremely imbalanced MIT-IB dataset when BROS is applied using mixed
strategy rather than augmenting data with 1vs1 and 1vs4 strategies. Similarly, in case
of PKU-Reid-BROS-3 dataset, the performance improvement was achieved when
BROS is applied using 1vs4 and mixed strategies rather than augmenting data with 1vs1
strategy. Considering these developments, it can be concluded that proposed approach
achieved better results on imbalanced SSS datasets, and can also be applied on
augmented balanced datasets in an effective manner. Hence, showing better

195
performance under multiple respective performance evaluation measures strengthens
the working of proposed approach.

Customized/Non-augmented Balanced Datasets for Gender Prediction: The


subsequent content covers the discussion on superior results obtained using PCA based
selected FSs on customized datasets as presented in Table 4.75. To validate the
robustness of proposed approach, extensive experiments are conducted on existing
(cross-dataset) and four newly customized balanced (PETA-SSS-1, PETA-SSS-2,
VIPeR-SSS, and cross-dataset-1) datasets for pedestrian gender classification. The
proposed approach attained best results in terms of O-ACC, M-ACC, B-ACC, AUC,
CW-ACC male, and CW-ACC female with the values of 90.8%, 90.8%, 90.9%. 95%,
93%, and 89% respectively.

The proposed approach also yields better classification rates as compared to existing
studies reported on cross-dataset with significant improvements. Also, the proposed
approach achieved comparable classification results on a newly customized cross-
dataset-1. This dataset has class-wise double number of samples (350) as compared to
cross-dataset. However, as far as performance aspect is concerned, there is a slight
difference between the existing cross-dataset and newly customized cross-dataset-1.
Experimental outcomes validate that proposed approach can achieve classification
improvements on customized SSS cross-dataset-1. These improvements also confirm
that PGC-FSDTF maintains better performance on cross-dataset as compared to newly
prepared cross-dataset-1. Proposed PGC-FSDTF approach is also evaluated on other
three balanced and SSS datasets where gender wise samples are collected from more
challenging PETA and VIPeR datasets. Using PETA dataset, class-wise 864 and 1300
samples are randomly selected for PETA-SSS-1 and PETA-SSS-2 datasets
respectively. The reason to set these numbers was to cross check the proposed approach
performance on the same number of samples (class-wise) as used in augmented
balanced datasets. Interestingly, proposed approach yields better results on both PETA-
SSS-1 and PETA-SSS-2 datasets. It acquired same AUC of 95% on both datasets.
Despite this, proposed approach attained O-ACC, M-ACC, and B-ACC with 88.8% for
each on PETA-SSS-1 dataset. Similarly, on PETA-SSS-2 dataset PGC-FSDTF
approach has acquired slightly lower accuracies. The notable factor on these datasets is
class-wise accuracies, while comparing these accuracies of both datasets, it is observed

196
that CW-ACC male is higher with a difference of 6% on PETA-SSS-2 dataset, while
CW-ACC female is better with a difference of 8% on PETA-SSS-1 dataset because of
class-wise similarities, in randomly collected data.

Table 4.75: Proposed approach PGC-FSDTF results on customized/non-augmented


balanced datasets where proposed approach recorded superior AUC
O- M- B- CW-ACC (%)
AUC
ACC ACC ACC
Dataset (%) Male Female
(%) (%) (%)
PETA-SSS-1 88.8 88.8 88.8 95 89 88
PETA-SSS-2 88.0 88.0 87.9 95 95 80
VIPeR-SSS 94.7 94.7 95.0 97 99 90
Cross-dataset 90.8 90.8 90.9 95 93 89
Cross-dataset-1 90.4 90.4 90.5 96 93 88

VIPeR dataset is customized and renamed as VIPeR-SSS dataset. This prepared dataset
contains balanced distribution of SSS data. The proposed approach outperformed all
applied datasets in terms of O-ACC, M-ACC, and B-ACC with 94.7%, 94.7%, and
95.0%, whereas other evaluation metrics also showed significant performance. To the
best of our knowledge, PGC-FSDTF approach entirely outperformed existing
approaches reported in literature for pedestrian gender classification in terms of O-
ACC, M-ACC, AUC, CW-ACC male, CW-ACC female on MIT-IB dataset and cross-
dataset. According to the results, the proposed approach achieved higher outcomes as
O-ACC 92.7%, M-ACC 89.4%, B-ACC 94.1%, AUC 93%, CW-ACC male 99%, and
CW-ACC female 80% on MIT-IB dataset. The improved results acquired on cross-
dataset show that O-ACC is 90.8%, M-ACC is 90.8%, B-ACC is 90.9%, AUC is 85%,
CW-ACC male 93%, and CW-ACC female 89% as described in Table 4.76

Table 4.76: Proposed PGC-FSDTF approach results on MIT-IB dataset and cross-
dataset

O- M- B- CW-ACC (%)
AUC
ACC ACC ACC
Dataset (%) Male Female
(%) (%) (%)
MIT-IB 92.7 89.4 94.1 93 99 80
Cross-dataset 90.8 90.8 90.9 95 93 89

Further, it is seen that training time and prediction time of proposed method on applied
datasets are efficient. Considering optimal values of training time and prediction time,
the proposed PGC-FSDTF obtained best AUC using entropy and PCA based features
subsets as shown in Figure 4.56 and Figure 4.57. By considering all applied datasets,

197
Figure 4.56 shows training time in a sec, and Figure 4.57 shows the prediction time in
obs/sec using entropy and PCA based features subsets.

Figure 4.56: Training time of PGC-FSDTF method on applied datasets

Figure 4.57: Prediction time of PGC-FSDTF method on applied datasets

According to overall empirical assessments, an overview of CW-ACC is provided on


imbalanced, augmented balanced, and customized SSS datasets as shown in Figure
4.58. It is noticed that PCA based results obtained using the proposed PGC-FSDTF
method achieved momentous performance on existing and even on newly customized
datasets for gender prediction.

198
Figure 4.58: Complete overview of proposed PGC-FSDTF method results in terms of
best CW-ACC on selected, customized, and augmented datasets where
proposed approach achieved superior AUC (a) CW-ACC on customized
balanced SSS datasets, and (b) CW-ACC on imbalanced and augmented
balanced SSS datasets

4.5 Discussion
In this thesis, one method is presented for person ReID named as FCDF and two
methods for pedestrian gender classification namely J-LDFR and PGC-FSDTF. In
person ReID, gallery search optimization, environmental effects and different camera
settings issues are examined. Three stage robust method is proposed to address these
issues. The first stage of proposed method extracts distinct features using FFS. The
second stage splits the dataset into clusters based on the selected features subset such
that extracted deep features are fused with OFS to handle issues of appearance and

199
inherent ambiguities. In the third stage, feature matching is introduced to search probe
image match from classified cluster. This stage provides gallery search optimization
because of cluster-based search as compared to gallery-based search for probe image
matching. The proposed FCDF approach is tested on VIPeR, CUHK01 and iLIDS-VID
datasets and obtained significant results at different ranks as compared to state-of-the-
art methods. Recognition rates at rank-1 is shown in Table 4.77. As this work
implemented person ReID on different pedestrian datasets, the next objective is to
examine abovementioned datasets which are sub-datasets of large-scale PETA dataset
and MIT dataset, widely reported in existing literature to investigate full-body
appearances of pedestrian(s) for gender classification. Therefore, in second proposed J-
LDFR method, large-scale (PETA) and small-scale (MIT) datasets are selected for
PGC.

Table 4.77: Summary of proposed methods including tasks, datasets, and results

Proposed Tasks Datasets Results


methods
VIPeR 46.8% Rank-1
FCDF Person ReID CUHK01 48.1% Rank-1
iLIDS-VID 40.6% Rank-1
Pedestrian
gender PETA 89.3% O-ACC, 96% AUC
classification
J-LDFR
on large-scale
and small-scale MIT 82.0% O-ACC, 86% AUC
datasets
MIT-IB 92.7% O-ACC, 93% AUC
MIT-BROS-1 89.3% O-ACC, 96% AUC
Pedestrian MIT-BROS-2 90.4% O-ACC, 97% AUC
gender MIT-BROS-3 93.7% O-ACC, 98% AUC
classification PKU-Reid_IB 89.7% O-ACC, 96% AUC
on imbalanced, PKU-Reid-BROS1 91.6% O-ACC, 97% AUC
PGC-FSDTF augmented, PKU-Reid-BROS2 92.2% O-ACC, 97% AUC
and PKU-Reid-BROS3 89.9% O-ACC, 98% AUC
customized PETA-SSS-1 88.8% O-ACC, 95% AUC
balanced PETA-SSS-2 88.0% O-ACC, 95% AUC
datasets VIPeR-SSS 94.7% O-ACC, 97% AUC
Cross-dataset 90.8% O-ACC, 95% AUC
Cross-dataset-1 90.4% O-ACC, 96% AUC

Through extensive experimentation, it is found that suggested joint feature


representations are more robust as compared to disjoint feature representations for

200
accurate gender prediction. The computed results surpassed with significant margins to
relevant existing methods while considering the large-scale and small-scale datasets.
To the best of our knowledge, no such study has been found to address the binary
imbalanced classification and SSS problem for PGC. Therefore, third proposed method
performed PGC on IB-SSS datasets. In this regard, the contribution of this research is
therefore two fold. (1) to generate synthetic data in effective handling of binary
imbalanced classification and SSS problem for PGC and (2) investigation of multiple
low and high-level feature extraction schemes, features selection (PCA and entropy),
and fusion (parallel and serial) strategies to accurately classify the gender under same
settings and protocols but with different types of datasets including imbalanced,
augmented balanced and customized balanced. The suggested PGC-FSDTF equally
produced better results on these datasets. The details about datasets and results in terms
of O-ACC and AUC is given in Table 4.77.

4.6 Summary
In this chapter, proposed methodologies are evaluated on several publicly available
challenging datasets. Different experiments are performed to authenticate the
performance of proposed methods. The selected features subsets based results are tested
on a number of evaluation protocols such as CMC, O-ACC, M-ACC, B-ACC, and AUC
using handcrafted and deep learning features, which are then fed to distance measure
and different classifiers for matching and classification of full-body pedestrian images.
The probe matching results in person ReID are obtained from cluster instead of whole
gallery, whereas pedestrian gender classification results are obtained using 10-fold
cross-validation.

201
Chapter 5 Conclusion and Future Work
Conclusion and Future Work

202
5.1 Conclusion
The accurate pedestrian ReID and gender classification are challenging tasks, owing to
high diversity in pedestrian appearances under non-overlapping camera settings. To
overcome the challenges, in this thesis, three methods are proposed and evaluated on
benchmark datasets.

The first method, addresses the person ReID problem and its challenges using an OFS
based features clustering and deeply learned features. A novel FCDF framework is
presented which considers a single image (one probe and one gallery image per identity)
based ReID. OFS is chosen to handle the challenges of illumination and viewpoint
variations. For its selection, a reliable and effective method called FFS is proposed. A
deep CNN model is also used to extract discriminative cues from FC-7 layer that
effectively handles large appearance changes. For probe matching, gallery search
optimization is performed using features-based clustering technique to improve
recognition rates. The cross-bin histogram distance measure is utilized to obtain an
accurate image pair from the cluster(s). Moreover, the proposed framework handles the
challenges of chosen datasets more efficiently. According to the computed results,
recognition rates are significantly increased at different ranks on VIPeR, CUHK01, and
iLIDS-VID datasets that show the robustness of FCDF framework.

A new framework J-LDFR, as a second method, is proposed and evaluated for


pedestrian gender classification. The key contribution this framework is the utilization
of LOMO and HOG features to depict low-level feature representations collectively.
Moreover, meaningful deep feature representations of two convolution neural networks
are used as well. The computed deep features not only provide pose and illumination
invariant information but also capture distinct characteristics of gender images. These
feature representations are effectively utilized through maximum entropy controlled
features selection and these selected features are combined for robust feature
representation, jointly. Different classification methods are applied to investigate the
robustness of these feature representations individually and jointly. The computed
results surpass by a minimum of 1% to a maximum of 10% AUC on PETA dataset and
accuracy minimum of 0.5% to a maximum of 10.7% on MIT dataset. The experimental
outcomes show that joint feature representation improves the results using different
classification methods, specifically under C-SVM. The usefulness of proposed

203
framework is tested by comparing its results with various state-of-the-art methods. The
cross-validation results demonstrate that J-LDFR is superior and outperforms existing
pedestrian gender classification methods.

The third proposed method PGC-FSDTF obtained superior results on SSS imbalanced,
augmented balanced and customized balanced datasets for pedestrian gender
classification. Different techniaues are implemented in data preparation module for
image smoothness, removal of noise, and balanced distribution of samples in both
classes. Traditional and deep features are extracted and then FSF is carried out to
compute single robust feature vector. SVM classifier with different kernels is applied
to classify the gender image. The experimental outcomes show that the suggested
method outperforms state-of-the-art methods on the original MIT dataset and cross-
dataset. In addition, it achieves noteworthy results on eleven imbalanced, augmented
balanced, and customized balanced datasets.

This thesis has validated proposed methods on several benchmark pedestrian datasets,
such as VIPeR, CUHK01, iLIDS-VID, PETA, MIT/MIT-IB, MIT-BROS-1, MIT-
BROS-2, MIT-BROS-3, PKU-Reid-IB, PKU-Reid-BROS-1, PKU-Reid-BROS-2,
PKU-Reid-BROS-3, PETA-SSS-1, PETA-SSS-2, VIPeR-SSS, cross-dataset, and
cross-dataset-1 in terms of different evaluation protocols such as CMC, O-ACC, M-
ACC, B-ACC, AUC, F1-score, FPR, TPR, PPV, NPV, TNR, and FNR. It is concluded
that the proposed methods obtained improved results in terms of CMC, accuracy, and
AUC as compared with existing methods for pedestrian ReID and pedestrian gender
classification tasks.

5.2 Future Work


Although, the computed results are promising, however there is a room to enhance the
performance for both pedestrian analysis tasks. In this regard, fine-tuned CNN models
or fusing deep features from different CNNs can improve the classification accuracy
and recognition rates in future. Other potential future work directions include
considering both body and its parts for handcrafted and deep feature extraction
separately by mapping extracted features of body parts over full-body features for
person ReID and gender classification. In addition, spatial-temporal information can be
used with LSTM for both tasks. As, it is difficult to recognize gender with inter-class
similarities such as similar clothes, hairstyles, and body shapes, this needs to be further

204
investigated with more robust low-level and high-level feature extraction, feature
optimization, and feature fusion techniques for gender classifications. Finally,
extensive pre-processing steps or data augmentation techniques may improve the
classification and recognition results. Researchers may apply shallow and deep learning
based schemes for low-level and deep feature extraction for the testing of other
pedestrian analysis tasks such as pedestrian detection, pedestrian activity and action
recognition. This will further validate the usefulness of feature engineering as
implemented in different proposed methods. Despite this, the unsupervised learning can
be used to label pedestrian or non-pedestrian images based on deep features
automatically.

205
Chapter 6 References
References

206
[1] E. Cosgrove, "One billion surveillance cameras will be watching around the
world in 2021"
[2] M. P. Ashby, "The value of CCTV surveillance cameras as an investigative tool:
An empirical analysis," European Journal on Criminal Policy and Research,
vol. 23, pp. 441-459, 2017.
[3] M. Souded, "People detection, tracking and re-identification through a video
camera network," 2013.
[4] Q. Ma, Y. Kang, W. Song, Y. Cao, and J. Zhang, "Deep Fundamental Diagram
Network for Real-Time Pedestrian Dynamics Analysis," in Traffic and
Granular Flow 2019, ed: Springer, pp. 195-203, 2020.
[5] Z. Ji, Z. Hu, E. He, J. Han, and Y. Pang, "Pedestrian attribute recognition based
on multiple time steps attention," Pattern Recognition Letters, vol. 138, pp. 170-
176, 2020.
[6] P. Pandey and J. V. Aghav, "Pedestrian Activity Recognition Using 2-D Pose
Estimation for Autonomous Vehicles," in ICT Analysis and Applications, ed:
Springer, pp. 499-506, 2020.
[7] S. Kazeminia, C. Baur, A. Kuijper, B. van Ginneken, N. Navab, S. Albarqouni,
et al., "GANs for medical image analysis," Artificial Intelligence in Medicine,
p. 101938, 2020.
[8] X. Ma, Y. Niu, L. Gu, Y. Wang, Y. Zhao, J. Bailey, et al., "Understanding
adversarial attacks on deep learning based medical image analysis systems,"
Pattern Recognition, p. 107332, 2020.
[9] K. Armanious, C. Jiang, M. Fischer, T. Küstner, T. Hepp, K. Nikolaou, et al.,
"MedGAN: Medical image translation using GANs," Computerized Medical
Imaging and Graphics, vol. 79, p. 101684, 2020.
[10] D. Caballero, R. Calvini, and J. M. Amigo, "Hyperspectral imaging in crop
fields: precision agriculture," in Data Handling in Science and Technology. vol.
32, ed: Elsevier, pp. 453-473, 2020.
[11] A. Przybylak, R. Kozłowski, E. Osuch, A. Osuch, P. Rybacki, and P.
Przygodziński, "Quality Evaluation of Potato Tubers Using Neural Image
Analysis Method," Agriculture, vol. 10, p. 112, 2020.
[12] H. Tian, T. Wang, Y. Liu, X. Qiao, and Y. Li, "Computer vision technology in
agricultural automation—A review," Information Processing in Agriculture,
vol. 7, pp. 1-19, 2020.

207
[13] V. Tsakanikas and T. Dagiuklas, "Video surveillance systems-current status and
future trends," Computers & Electrical Engineering, vol. 70, pp. 736-753, 2018.
[14] M. Baqui, M. D. Samad, and R. Löhner, "A novel framework for automated
monitoring and analysis of high density pedestrian flow," Journal of Intelligent
Transportation Systems, vol. 24, pp. 585-597, 2020.
[15] L. Cai, J. Zhu, H. Zeng, J. Chen, and C. Cai, "Deep-learned and hand-crafted
features fusion network for pedestrian gender recognition," in Proceedings of
ELM-2016, ed: Springer, pp. 207-215, 2018.
[16] X. Li, L. Telesca, M. Lovallo, X. Xu, J. Zhang, and W. Song, "Spectral and
informational analysis of pedestrian contact force in simulated overcrowding
conditions," Physica A: Statistical Mechanics and its Applications, p. 124614,
2020.
[17] L. Yang, G. Hu, Y. Song, G. Li, and L. Xie, "Intelligent video analysis: A
Pedestrian trajectory extraction method for the whole indoor space without
blind areas," Computer Vision and Image Understanding, p. 102968, 2020.
[18] C. Zhao, X. Wang, W. Zuo, F. Shen, L. Shao, and D. Miao, "Similarity learning
with joint transfer constraints for person re-identification," Pattern Recognition,
vol. 97, p. 107014, 2020.
[19] L. An, X. Chen, S. Liu, Y. Lei, and S. Yang, "Integrating appearance features
and soft biometrics for person re-identification," Multimedia Tools and
Applications, vol. 76, pp. 12117-12131, 2017.
[20] H. J. Galiyawala, M. S. Raval, and A. Laddha, "Person retrieval in surveillance
videos using deep soft biometrics," in Deep Biometrics, ed: Springer, pp. 191-
214, 2020.
[21] P. P. Sarangi, B. S. P. Mishra, and S. Dehuri, "Fusion of PHOG and LDP local
descriptors for kernel-based ear biometric recognition," Multimedia Tools and
Applications, vol. 78, pp. 9595-9623, 2019.
[22] N. Ahmadi and G. Akbarizadeh, "Iris tissue recognition based on GLDM feature
extraction and hybrid MLPNN-ICA classifier," Neural Computing and
Applications, vol. 32, pp. 2267-2281, 2020.
[23] M. Regouid, M. Touahria, M. Benouis, and N. Costen, "Multimodal biometric
system for ECG, ear and iris recognition based on local descriptors,"
Multimedia Tools and Applications, vol. 78, pp. 22509-22535, 2019.

208
[24] R. F. Soliman, M. Amin, and F. E. Abd El-Samie, "Cancelable Iris recognition
system based on comb filter," Multimedia Tools and Applications, vol. 79, pp.
2521-2541, 2020.
[25] P. S. Chanukya and T. Thivakaran, "Multimodal biometric cryptosystem for
human authentication using fingerprint and ear," Multimedia Tools and
Applications, vol. 79, pp. 659-673, 2020.
[26] K. M. Sagayam, D. N. Ponraj, J. Winston, E. Jeba, and A. Clara,
"Authentication of Biometric System using Fingerprint Recognition with
Euclidean Distance and Neural Network Classifier," Project: Hand posture and
gesture recognition techniques for virtual reality applications: a survey, 2019.
[27] T. B. Moeslund, S. Escalera, G. Anbarjafari, K. Nasrollahi, and J. Wan,
"Statistical Machine Learning for Human Behaviour Analysis," ed:
Multidisciplinary Digital Publishing Institute, 2020.
[28] S. Gupta, K. Thakur, and M. Kumar, "2D-human face recognition using SIFT
and SURF descriptors of face’s feature regions," The Visual Computer, pp. 1-
10, 2020.
[29] N. Jahan, P. K. Bhuiyan, P. A. Moon, and M. A. Akbar, "Real Time Face
Recognition System with Deep Residual Network and KNN," in 2020
International Conference on Electronics and Sustainable Communication
Systems (ICESC), pp. 1122-1126, 2020.
[30] F. Zhao, J. Li, L. Zhang, Z. Li, and S.-G. Na, "Multi-view face recognition using
deep neural networks," Future Generation Computer Systems, pp.375-380,
2020.
[31] I. Chtourou, E. Fendri, and M. Hammami, "Walking Direction Estimation for
Gait Based Applications," Procedia Computer Science, vol. 126, pp. 759-767,
2018.
[32] A. Derbel, N. Mansouri, Y. B. Jemaa, B. Emile, and S. Treuillet, "Comparative
study between spatio/temporal descriptors for pedestrians recognition by gait,"
in International Conference Image Analysis and Recognition, pp. 35-42, 2013.
[33] M. Huang, H.-Z. Li, and X. Wu, "Three-dimensional pedestrian dead reckoning
method based on gait recognition," International Journal of Simulation and
Process Modelling, vol. 13, pp. 537-547, 2018.
[34] Z. Li, J. Xiong, and X. Ye, "Gait Energy Image Based on Static Region
Alignment for Pedestrian Gait Recognition," in Proceedings of the 3rd

209
International Conference on Vision, Image and Signal Processing, pp. 1-6,
2019.
[35] Y. Makihara, G. Ogi, and Y. Yagi, "Geometrically Consistent Pedestrian
Trajectory Extraction for Gait Recognition," in 2018 IEEE 9th International
Conference on Biometrics Theory, Applications and Systems (BTAS), pp. 1-11,
2018.
[36] B. Wang, T. Su, X. Jin, J. Kong, and Y. Bai, "3D reconstruction of pedestrian
trajectory with moving direction learning and optimal gait recognition,"
Complexity, vol. 2018, 2018.
[37] H. Wang, Y. Fan, B. Fang, and S. Dai, "Generalized linear discriminant analysis
based on euclidean norm for gait recognition," International Journal of
Machine Learning and Cybernetics, vol. 9, pp. 569-576, 2018.
[38] L.-F. Shi, C.-X. Qiu, D.-J. Xin, and G.-X. Liu, "Gait recognition via random
forests based on wearable inertial measurement unit," Journal of Ambient
Intelligence and Humanized Computing, pp. 1-12, 2020.
[39] S. Islam, T. Qasim, M. Yasir, N. Bhatti, H. Mahmood, and M. Zia, "Single-and
two-person action recognition based on silhouette shape and optical point
descriptors," Signal, Image and Video Processing, vol. 12, pp. 853-860, 2018.
[40] S. Govardhan and A. Vasuki, "Wavelet based iterative deformable part model
for pedestrian detection," Multimedia Tools and Applications, pp. 1-15, 2018.
[41] A. Li, L. Liu, K. Wang, S. Liu, and S. Yan, "Clothing attributes assisted person
reidentification," IEEE Transactions on Circuits and Systems for Video
Technology, vol. 25, pp. 869-878, 2015.
[42] S.-M. Li, C. Gao, J.-G. Zhu, and C.-W. Li, "Person Reidentification Using
Attribute-Restricted Projection Metric Learning," IEEE Transactions on
Circuits and Systems for Video Technology, vol. 28, pp. 1765-1776, 2018.
[43] X. Zhang, X.-Y. Jing, X. Zhu, and F. Ma, "Semi-supervised person re-
identification by similarity-embedded cycle GANs," Neural Computing and
Applications, pp. 1-10, 2020.
[44] M. Raza, Z. Chen, S.-U. Rehman, P. Wang, and P. Bao, "Appearance based
pedestrians’ head pose and body orientation estimation using deep learning,"
Neurocomputing, vol. 272, pp. 647-659, 2018.

210
[45] L. Cai, J. Zhu, H. Zeng, J. Chen, C. Cai, and K.-K. Ma, "Hog-assisted deep
feature learning for pedestrian gender recognition," Journal of the Franklin
Institute, vol. 355, pp. 1991-2008, 2018.
[46] M. Raza, M. Sharif, M. Yasmin, M. A. Khan, T. Saba, and S. L. Fernandes,
"Appearance based pedestrians’ gender recognition by employing stacked auto
encoders in deep learning," Future Generation Computer Systems, vol. 88, pp.
28-39, 2018.
[47] G. Antipov, S.-A. Berrani, N. Ruchaud, and J.-L. Dugelay, "Learned vs. hand-
crafted features for pedestrian gender recognition," in Proceedings of the 23rd
ACM international conference on Multimedia, pp. 1263-1266, 2015.
[48] C.-B. Ng, Y.-H. Tay, and B.-M. Goi, "Pedestrian gender classification using
combined global and local parts-based convolutional neural networks," Pattern
Analysis and Applications, pp. 1-12, 2018.
[49] E. Yaghoubi, P. Alirezazadeh, E. Assunção, J. C. Neves, and H. Proençaã,
"Region-Based CNNs for Pedestrian Gender Recognition in Visual
Surveillance Environments," in 2019 International Conference of the
Biometrics Special Interest Group (BIOSIG), pp. 1-5, 2019.
[50] R. Q. Mínguez, I. P. Alonso, D. Fernández-Llorca, and M. Á. Sotelo,
"Pedestrian Path, Pose, and Intention Prediction Through Gaussian Process
Dynamical Models and Pedestrian Activity Recognition," IEEE Transactions
on Intelligent Transportation Systems, pp.1803-1814, 2018.
[51] R. Cui, G. Hua, A. Zhu, J. Wu, and H. Liu, "Hard Sample Mining and Learning
for Skeleton-Based Human Action Recognition and Identification," IEEE
Access, vol. 7, pp. 8245-8257, 2019.
[52] Z. Gao, T.-t. Han, H. Zhang, Y.-b. Xue, and G.-p. Xu, "MMA: a multi-view and
multi-modality benchmark dataset for human action recognition," Multimedia
Tools and Applications, pp. 1-22, 2018.
[53] R. Vezzani, D. Baltieri, and R. Cucchiara, "People reidentification in
surveillance and forensics: A survey," ACM Computing Surveys (CSUR), vol.
46, p. 29, 2013.
[54] Y. Chen, S. Duffner, A. Stoian, J.-Y. Dufour, and A. Baskurt, "Deep and low-
level feature based attribute learning for person re-identification," Image and
Vision Computing, vol. 79, pp. 25-34, 2018.

211
[55] Y. Sun, M. Zhang, Z. Sun, and T. Tan, "Demographic analysis from biometric
data: Achievements, challenges, and new frontiers," IEEE transactions on
pattern analysis and machine intelligence, vol. 40, pp. 332-351, 2017.
[56] G. Azzopardi, A. Greco, A. Saggese, and M. Vento, "Fusion of domain-specific
and trainable features for gender recognition from face images," IEEE access,
vol. 6, pp. 24171-24183, 2018.
[57] S. Mane and G. Shah, "Facial Recognition, Expression Recognition, and Gender
Identification," in Data Management, Analytics and Innovation, ed: Springer,
pp. 275-290, 2019.
[58] J. Cheng, Y. Li, J. Wang, L. Yu, and S. Wang, "Exploiting effective facial
patches for robust gender recognition," Tsinghua Science and Technology, vol.
24, pp. 333-345, 2019.
[59] A. Geetha, M. Sundaram, and B. Vijayakumari, "Gender classification from
face images by mixing the classifier outcome of prime, distinct descriptors,"
Soft Computing, vol. 23, pp. 2525-2535, 2019.
[60] G. Guo, G. Mu, and Y. Fu, "Gender from body: A biologically-inspired
approach with manifold learning," in Asian Conference on Computer Vision,
pp. 236-245, 2009.
[61] R. Zhao, W. Ouyang, and X. Wang, "Unsupervised salience learning for person
re-identification," in Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition, pp. 3586-3593, 2013.
[62] H. Yang, X. Wang, J. Zhu, W. Ma, and H. Su, "Resolution adaptive feature
extracting and fusing framework for person re-identification," Neurocomputing,
vol. 212, pp. 65-74, 2016.
[63] S. Gong, M. Cristani, C. C. Loy, and T. M. Hospedales, "The re-identification
challenge," in Person re-identification, ed: Springer, pp. 1-20, 2014.
[64] X. Jin, C. Lan, W. Zeng, G. Wei, and Z. Chen, "Semantics-Aligned
Representation Learning for Person Re-Identification," in AAAI, pp. 11173-
11180, 2020.
[65] L. An, X. Chen, and S. Yang, "Multi-graph feature level fusion for person re-
identification," Neurocomputing, pp.39-45, 2017.
[66] L. An, X. Chen, S. Yang, and X. Li, "Person Re-identification by Multi-
hypergraph Fusion," IEEE Transactions on Neural Networks and Learning
Systems, pp.2763-2774, 2016.

212
[67] Q.-Q. Ren, W.-D. Tian, and Z.-Q. Zhao, "Person Re-identification Based on
Feature Fusion," in International Conference on Intelligent Computing, pp. 65-
73, 2019.
[68] X. Wang, C. Zhao, D. Miao, Z. Wei, R. Zhang, and T. Ye, "Fusion of multiple
channel features for person re-identification," Neurocomputing, vol. 213, pp.
125-136, 2016.
[69] L. Cai, H. Zeng, J. Zhu, J. Cao, Y. Wang, and K.-K. Ma, "Cascading Scene and
Viewpoint Feature Learning for Pedestrian Gender Recognition," IEEE Internet
of Things Journal, pp.3014-3026, 2020.
[70] L. Cai, H. Zeng, J. Zhu, J. Cao, J. Hou, and C. Cai, "Multi-view joint learning
network for pedestrian gender classification," in 2017 International Symposium
on Intelligent Signal Processing and Communication Systems (ISPACS), pp. 23-
27, 2017.
[71] C.-B. Ng, Y.-H. Tay, and B.-M. Goi, "Pedestrian gender classification using
combined global and local parts-based convolutional neural networks," Pattern
Analysis and Applications, vol. 22, pp. 1469-1480, 2019.
[72] L. An, Z. Qin, X. Chen, and S. Yang, "Multi-level common space learning for
person re-identification," IEEE Transactions on Circuits and Systems for Video
Technology, vol. 28, pp. 1777-1787, 2018.
[73] C. Chahla, H. Snoussi, F. Abdallah, and F. Dornaika, "Learned versus
handcrafted features for person re-identification," International Journal of
Pattern Recognition and Artificial Intelligence, vol. 34, p. 2055009, 2020.
[74] X. Gu, T. Ni, W. Wang, and J. Zhu, "Cross-domain transfer person re-
identification via topology properties preserved local fisher discriminant
analysis," Journal of Ambient Intelligence and Humanized Computing, pp. 1-
11, 2020.
[75] H. Li, W. Zhou, Z. Yu, B. Yang, and H. Jin, "Person re-identification with
dictionary learning regularized by stretching regularization and label
consistency constraint," Neurocomputing, vol. 379, pp. 356-369, 2020.
[76] X. Xu, Y. Chen, and Q. Chen, "Dynamic Hybrid Graph Matching for
Unsupervised Video-based Person Re-identification," International Journal on
Artificial Intelligence Tools, vol. 29, p. 2050004, 2020.

213
[77] S. Liao, Y. Hu, X. Zhu, and S. Z. Li, "Person re-identification by local maximal
occurrence representation and metric learning," in Proceedings of the IEEE
conference on computer vision and pattern recognition, pp. 2197-2206., 2015.
[78] R. Zhao, W. Ouyang, and X. Wang, "Person re-identification by salience
matching," in Proceedings of the IEEE International Conference on Computer
Vision, pp. 2528-2535, 2013.
[79] R. Zhao, W. Ouyang, and X. Wang, "Learning mid-level filters for person re-
identification," in Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition, pp. 144-151, 2014.
[80] T. Wang, S. Gong, X. Zhu, and S. Wang, "Person re-identification by video
ranking," in European Conference on Computer Vision, pp. 688-703, 2014.
[81] Y. Geng, H.-M. Hu, G. Zeng, and J. Zheng, "A person re-identification
algorithm by exploiting region-based feature salience," Journal of Visual
Communication and Image Representation, vol. 29, pp. 89-102, 2015.
[82] L. An, S. Yang, and B. Bhanu, "Person re-identification by robust canonical
correlation analysis," IEEE signal processing letters, vol. 22, pp. 1103-1107,
2015.
[83] L. Zheng, S. Wang, L. Tian, F. He, Z. Liu, and Q. Tian, "Query-adaptive late
fusion for image search and person re-identification," in Proceedings of the
IEEE conference on computer vision and pattern recognition, pp. 1741-1750,
2015.
[84] L. An, X. Chen, S. Yang, and X. Li, "Person re-identification by multi-
hypergraph fusion," IEEE transactions on neural networks and learning
systems, vol. 28, pp. 2763-2774, 2017.
[85] X. Yang, M. Wang, R. Hong, Q. Tian, and Y. Rui, "Enhancing Person Re-
identification in a Self-trained Subspace," arXiv preprint arXiv:1704.06020,
2017.
[86] Y.-J. Cho and K.-J. Yoon, "Improving person re-identification via pose-aware
multi-shot matching," in Proceedings of the IEEE conference on computer
vision and pattern recognition, pp. 1354-1362, 2016.
[87] L. Zhang, T. Xiang, and S. Gong, "Learning a discriminative null space for
person re-identification," in Proceedings of the IEEE conference on computer
vision and pattern recognition, pp. 1239-1248, 2016.

214
[88] L. An, M. Kafai, S. Yang, and B. Bhanu, "Person reidentification with reference
descriptor," IEEE Transactions on Circuits and Systems for Video Technology,
vol. 26, pp. 776-787, 2016.
[89] L. An, X. Chen, S. Yang, and B. Bhanu, "Sparse representation matching for
person re-identification," Information Sciences, vol. 355, pp. 74-89, 2016.
[90] C. Chahla, H. Snoussi, F. Abdallah, and F. Dornaika, "Discriminant quaternion
local binary pattern embedding for person re-identification through prototype
formation and color categorization," Engineering Applications of Artificial
Intelligence, vol. 58, pp. 27-33, 2017.
[91] T. Li, L. Sun, C. Han, and J. Guo, "Salient Region-Based Least-Squares Log-
Density Gradient Clustering for Image-To-Video Person Re-Identification,"
IEEE ACCESS, vol. 6, pp. 8638-8648, 2018.
[92] L. Zhang, K. Li, Y. Zhang, Y. Qi, and L. Yang, "Adaptive image segmentation
based on color clustering for person re-identification," Soft Computing, vol. 21,
pp. 5729-5739, 2017.
[93] J. H. Shah, M. Lin, and Z. Chen, "Multi-camera handoff for person re-
identification," Neurocomputing, vol. 191, pp. 238-248, 2016.
[94] H. Chu, M. Qi, H. Liu, and J. Jiang, "Local region partition for person re-
identification," Multimedia Tools and Applications, pp. 1-17, 2017.
[95] A. Nanda, P. K. Sa, S. K. Choudhury, S. Bakshi, and B. Majhi, "A neuromorphic
person re-identification framework for video surveillance," IEEE Access, vol.
5, pp. 6471-6482, 2017.
[96] A. Nanda, P. K. Sa, D. S. Chauhan, and B. Majhi, "A person re-identification
framework by inlier-set group modeling for video surveillance," Journal of
Ambient Intelligence and Humanized Computing, vol. 10, pp. 13-25, 2019.
[97] D. Gray and H. Tao, "Viewpoint invariant pedestrian recognition with an
ensemble of localized features," Computer Vision–ECCV 2008, pp. 262-275,
2008.
[98] M. Farenzena, L. Bazzani, A. Perina, V. Murino, and M. Cristani, "Person re-
identification by symmetry-driven accumulation of local features," in Computer
Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on, pp. 2360-
2367, 2010.

215
[99] X. Ye, W.-y. Zhou, and L.-a. Dong, "Body Part-Based Person Re-identification
Integrating Semantic Attributes," Neural Processing Letters, vol. 49, pp. 1111-
1124, 2019.
[100] J. Dai, Y. Zhang, H. Lu, and H. Wang, "Cross-view semantic projection learning
for person re-identification," Pattern Recognition, vol. 75, pp. 63-76, 2018.
[101] C.-H. Kuo, S. Khamis, and V. Shet, "Person re-identification using semantic
color names and rankboost," in Applications of Computer Vision (WACV), 2013
IEEE Workshop on, pp. 281-287, 2013.
[102] X. Ye, W.-y. Zhou, and L.-a. Dong, "Body Part-Based Person Re-identification
Integrating Semantic Attributes," Neural Processing Letters, pp. 1-14, 2018.
[103] I. Kviatkovsky, A. Adam, and E. Rivlin, "Color invariants for person
reidentification," IEEE Transactions on Pattern Analysis and Machine
Intelligence, vol. 35, pp. 1622-1634, 2013.
[104] Y. Yang, J. Yang, J. Yan, S. Liao, D. Yi, and S. Z. Li, "Salient color names for
person re-identification," in European conference on computer vision, pp. 536-
551, 2014.
[105] F. Xiong, M. Gou, O. Camps, and M. Sznaier, "Person re-identification using
kernel-based metric learning methods," in European conference on computer
vision, pp. 1-16, 2014.
[106] Y.-C. Chen, X. Zhu, W.-S. Zheng, and J.-H. Lai, "Person re-identification by
camera correlation aware feature augmentation," IEEE transactions on pattern
analysis and machine intelligence, vol. 40, pp. 392-408, 2018.
[107] N. M. Z. Hashim, Y. Kawanishi, D. Deguchi, I. Ide, and H. Murase,
"Simultaneous image matching for person re‐identification via the stable
marriage algorithm," IEEJ Transactions on Electrical and Electronic
Engineering, vol. 15, pp. 909-917, 2020.
[108] W. R. Schwartz and L. S. Davis, "Learning discriminative appearance-based
models using partial least squares," in Computer Graphics and Image
Processing (SIBGRAPI), 2009 XXII Brazilian Symposium on, pp. 322-329,
2009.
[109] W.-S. Zheng, S. Gong, and T. Xiang, "Person re-identification by probabilistic
relative distance comparison," in Computer vision and pattern recognition
(CVPR), 2011 IEEE conference on, pp. 649-656, 2011.

216
[110] S. Pedagadi, J. Orwell, S. Velastin, and B. Boghossian, "Local fisher
discriminant analysis for pedestrian re-identification," in Proceedings of the
IEEE conference on computer vision and pattern recognition, pp. 3318-3325,
2013.
[111] Y. Li, Z. Wu, S. Karanam, and R. J. Radke, "Multi-Shot Human Re-
Identification Using Adaptive Fisher Discriminant Analysis," in BMVC, p. 2,
2015.
[112] S. Bąk and P. Carr, "Person re-identification using deformable patch metric
learning," in 2016 IEEE Winter Conference on Applications of Computer Vision
(WACV), pp. 1-9, 2016.
[113] W. Li, R. Zhao, and X. Wang, "Human reidentification with transferred metric
learning," in Asian Conference on Computer Vision, pp. 31-44, 2012.
[114] W.-S. Zheng, S. Gong, and T. Xiang, "Reidentification by relative distance
comparison," IEEE transactions on pattern analysis and machine intelligence,
vol. 35, pp. 653-668, 2013.
[115] Y. Xie, H. Yu, X. Gong, and M. D. Levine, "Adaptive Metric Learning and
Probe-Specific Reranking for Person Reidentification," IEEE Signal Processing
Letters, vol. 24, pp. 853-857, 2017.
[116] X. Liu, X. Ma, J. Wang, and H. Wang, "M3L: Multi-modality mining for metric
learning in person re-Identification," Pattern Recognition, 2017.
[117] J. Li, A. J. Ma, and P. C. Yuen, "Semi-supervised Region Metric Learning for
Person Re-identification," International Journal of Computer Vision, pp. 1-20,
2018.
[118] F. Ma, X. Zhu, X. Zhang, L. Yang, M. Zuo, and X.-Y. Jing, "Low illumination
person re-identification," Multimedia Tools and Applications, vol. 78, pp. 337-
362, 2019.
[119] G. Feng, W. Liu, D. Tao, and Y. Zhou, "Hessian Regularized Distance Metric
Learning for People Re-Identification," Neural Processing Letters, pp. 1-14,
2019.
[120] J. Jia, Q. Ruan, Y. Jin, G. An, and S. Ge, "View-specific Subspace Learning and
Re-ranking for Semi-supervised Person Re-identification," Pattern
Recognition, p. 107568, 2020.

217
[121] Q. Zhou, S. Zheng, H. Ling, H. Su, and S. Wu, "Joint dictionary and metric
learning for person re-identification," Pattern Recognition, vol. 72, pp. 196-206,
2017.
[122] T. Ni, Z. Ding, F. Chen, and H. Wang, "Relative Distance Metric Leaning Based
on Clustering Centralization and Projection Vectors Learning for Person Re-
Identification," IEEE Access, vol. 6, pp. 11405-11411, 2018.
[123] C. Zhao, Y. Chen, Z. Wei, D. Miao, and X. Gu, "QRKISS: a two-stage metric
learning via QR-decomposition and KISS for person re-identification," Neural
Processing Letters, vol. 49, pp. 899-922, 2019.
[124] W. Ma, H. Han, Y. Kong, and Y. Zhang, "A New Date-Balanced Method Based
on Adaptive Asymmetric and Diversity Regularization in Person Re-
Identification," International Journal of Pattern Recognition and Artificial
Intelligence, vol. 34, p. 2056004, 2020.
[125] G. Lisanti, I. Masi, A. D. Bagdanov, and A. Del Bimbo, "Person re-
identification by iterative re-weighted sparse ranking," IEEE transactions on
pattern analysis and machine intelligence, vol. 37, pp. 1629-1642, 2015.
[126] T. Wang, S. Gong, X. Zhu, and S. Wang, "Person re-identification by
discriminative selection in video ranking," IEEE transactions on pattern
analysis and machine intelligence, vol. 38, pp. 2501-2514, 2016.
[127] L. An, X. Chen, and S. Yang, "Person re-identification via hypergraph-based
matching," Neurocomputing, vol. 182, pp. 247-254, 2016.
[128] A. Krizhevsky, I. Sutskever, and G. E. Hinton, "Imagenet classification with
deep convolutional neural networks," in Advances in neural information
processing systems, pp. 1097-1105, 2012.
[129] R. Girshick, J. Donahue, T. Darrell, and J. Malik, "Region-based convolutional
networks for accurate object detection and segmentation," IEEE transactions
on pattern analysis and machine intelligence, vol. 38, pp. 142-158, 2016.
[130] R. R. Varior, M. Haloi, and G. Wang, "Gated siamese convolutional neural
network architecture for human re-identification," in European conference on
computer vision, pp. 791-808, 2016.
[131] Z. Zhang and T. Si, "Learning deep features from body and parts for person re-
identification in camera networks," EURASIP Journal on Wireless
Communications and Networking, vol. 2018, p. 52, 2018.

218
[132] Z. Zhang and M. Huang, "Learning local embedding deep features for person
re-identification in camera networks," EURASIP Journal on Wireless
Communications and Networking, vol. 2018, p. 85, 2018.
[133] Y. Huang, H. Sheng, Y. Zheng, and Z. Xiong, "DeepDiff: Learning deep
difference features on human body parts for person re-identification,"
Neurocomputing, vol. 241, pp. 191-203, 2017.
[134] X. Xin, J. Wang, R. Xie, S. Zhou, W. Huang, and N. Zheng, "Semi-supervised
person Re-Identification using multi-view clustering," Pattern Recognition, vol.
88, pp. 285-297, 2019.
[135] G. Wang, J. Lai, and X. Xie, "P2snet: Can an image match a video for person
re-identification in an end-to-end way?," IEEE Transactions on Circuits and
Systems for Video Technology, vol. 28, pp. 2777-2787, 2017.
[136] C. Yuan, C. Xu, T. Wang, F. Liu, Z. Zhao, P. Feng, et al., "Deep multi-instance
learning for end-to-end person re-identification," Multimedia Tools and
Applications, vol. 77, pp. 12437-12467, 2018.
[137] J. Nie, L. Huang, W. Zhang, G. Wei, and Z. Wei, "Deep Feature Ranking for
Person Re-Identification," IEEE Access, vol. 7, pp. 15007-15017, 2019.
[138] Y. Hu, X. Cai, D. Huang, and Y. Liu, "LWA: A lightweight and accurate
method for unsupervised Person Re-identification," in 2019 6th International
Conference on Systems and Informatics (ICSAI), pp. 1314-1318, 2019.
[139] R. Aburasain, "Application of convolutional neural networks in object
detection, re-identification and recognition," Loughborough University, 2020.
[140] W. Zhang, L. Huang, Z. Wei, and J. Nie, "Appearance feature enhancement for
person re-identification," Expert Systems with Applications, vol. 163, p. 113771,
2021.
[141] X. Zhang, S. Li, X.-Y. Jing, F. Ma, and C. Zhu, "Unsupervised domain adaption
for image-to-video person re-identification," Multimedia Tools and
Applications, pp. 1-18, 2020.
[142] J. Lv, W. Chen, Q. Li, and C. Yang, "Unsupervised cross-dataset person re-
identification by transfer learning of spatial-temporal patterns," in Proceedings
of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7948-
7956, 2018.

219
[143] M. Afifi and A. Abdelhamed, "AFIF4: deep gender classification based on
adaboost-based fusion of isolated facial features and foggy faces," Journal of
Visual Communication and Image Representation, vol. 62, pp. 77-86, 2019.
[144] C.-B. Ng, Y.-H. Tay, and B.-M. Goi, "A review of facial gender recognition,"
Pattern Analysis and Applications, vol. 18, pp. 739-755, 2015.
[145] M. Raza, C. Zonghai, S. U. Rehman, G. Zhenhua, W. Jikai, and B. Peng, "Part-
wise pedestrian gender recognition via deep convolutional neural networks,"
2017.
[146] C. Zhao, X. Wang, W. K. Wong, W. Zheng, J. Yang, and D. Miao, "Multiple
Metric Learning based on Bar-shape Descriptor for Person Re-Identification,"
Pattern Recognition, 2017.
[147] X. Gao, F. Gao, D. Tao, and X. Li, "Universal blind image quality assessment
metrics via natural scene statistics and multiple kernel learning," IEEE
transactions on neural networks and learning systems, vol. 24, 2013.
[148] C. BenAbdelkader and P. Griffin, "A local region-based approach to gender
classi. cation from face images," in 2005 IEEE Computer Society Conference
on Computer Vision and Pattern Recognition (CVPR'05)-Workshops, pp. 52-
52, 2005.
[149] E. Eidinger, R. Enbar, and T. Hassner, "Age and gender estimation of unfiltered
faces," IEEE Transactions on Information Forensics and Security, vol. 9, pp.
2170-2179, 2014.
[150] H. A. Ahmed, T. A. Rashid, and A. Sidiq, "Face behavior recognition through
support vector machines," International Journal of Advanced Computer Science
and Applications, vol. 7, pp. 101-108, 2016.
[151] N. Sun, W. Zheng, C. Sun, C. Zou, and L. Zhao, "Gender classification based
on boosting local binary pattern," in International Symposium on Neural
Networks, pp. 194-201, 2006.
[152] C. Shan, "Learning local binary patterns for gender classification on real-world
face images," Pattern recognition letters, vol. 33, pp. 431-437, 2012.
[153] J.-G. Wang, J. Li, W.-Y. Yau, and E. Sung, "Boosting dense SIFT descriptors
and shape contexts of face images for gender recognition," in 2010 IEEE
Computer Society Conference on Computer Vision and Pattern Recognition-
Workshops, pp. 96-102, 2010.

220
[154] J. Bekios-Calfa, J. M. Buenaposada, and L. Baumela, "Robust gender
recognition by exploiting facial attributes dependencies," Pattern Recognition
Letters, vol. 36, pp. 228-234, 2014.
[155] B. Patel, R. Maheshwari, and B. Raman, "Compass local binary patterns for
gender recognition of facial photographs and sketches," Neurocomputing, vol.
218, pp. 203-215, 2016.
[156] X. Li, X. Zhao, Y. Fu, and Y. Liu, "Bimodal gender recognition from face and
fingerprint," in 2010 IEEE Computer Society Conference on Computer Vision
and Pattern Recognition, pp. 2590-2597, 2010.
[157] S. E. Bekhouche, A. Ouafi, F. Dornaika, A. Taleb-Ahmed, and A. Hadid,
"Pyramid multi-level features for facial demographic estimation," Expert
Systems with Applications, vol. 80, pp. 297-310, 2017.
[158] C. P. Divate and S. Z. Ali, "Study of Different Bio-Metric Based Gender
Classification Systems," in 2018 International Conference on Inventive
Research in Computing Applications (ICIRCA), pp. 347-353, 2018.
[159] A. M. Ali and T. A. Rashid, "Kernel Visual Keyword Description for Object
and Place Recognition," in Advances in Signal Processing and Intelligent
Recognition Systems, ed: Springer, pp. 27-38, 2016.
[160] B. Moghaddam and M.-H. Yang, "Learning gender with support faces," IEEE
Transactions on Pattern Analysis and Machine Intelligence, vol. 24, pp. 707-
711, 2002.
[161] J. Bekios-Calfa, J. M. Buenaposada, and L. Baumela, "Revisiting linear
discriminant techniques in gender recognition," IEEE Transactions on Pattern
Analysis and Machine Intelligence, vol. 33, pp. 858-864, 2010.
[162] M. Duan, K. Li, C. Yang, and K. Li, "A hybrid deep learning CNN–ELM for
age and gender classification," Neurocomputing, vol. 275, pp. 448-461, 2018.
[163] A. Dhomne, R. Kumar, and V. Bhan, "Gender Recognition Through Face Using
Deep Learning," Procedia computer science, vol. 132, pp. 2-10, 2018.
[164] K. Zhang, L. Tan, Z. Li, and Y. Qiao, "Gender and smile classification using
deep convolutional neural networks," in Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition Workshops, pp. 34-38, 2016.
[165] J. Mansanet, A. Albiol, and R. Paredes, "Local deep neural networks for gender
recognition," Pattern Recognition Letters, vol. 70, pp. 80-86, 2016.

221
[166] L. Cao, M. Dikmen, Y. Fu, and T. S. Huang, "Gender recognition from body,"
in Proceedings of the 16th ACM international conference on Multimedia, pp.
725-728, 2008.
[167] K. Ahmad, A. Sohail, N. Conci, and F. De Natale, "A Comparative study of
Global and Deep Features for the analysis of user-generated natural disaster
related images," in 2018 IEEE 13th Image, Video, and Multidimensional Signal
Processing Workshop (IVMSP), pp. 1-5, 2018.
[168] M. Collins, J. Zhang, P. Miller, and H. Wang, "Full body image feature
representations for gender profiling," in 2009 IEEE 12th International
Conference on Computer Vision Workshops, ICCV Workshops, pp. 1235-1242,
2009.
[169] C. D. Geelen, R. G. Wijnhoven, and G. Dubbelman, "Gender classification in
low-resolution surveillance video: in-depth comparison of random forests and
SVMs," in Video Surveillance and Transportation Imaging Applications, p.
94070M, 2015.
[170] V. A. Sindagi and V. M. Patel, "A survey of recent advances in cnn-based single
image crowd counting and density estimation," Pattern Recognition Letters,
vol. 107, pp. 3-16, 2018.
[171] C. Li, J. Guo, F. Porikli, and Y. Pang, "Lightennet: A convolutional neural
network for weakly illuminated image enhancement," Pattern Recognition
Letters, vol. 104, pp. 15-22, 2018.
[172] M. Rashid, M. A. Khan, M. Sharif, M. Raza, M. M. Sarfraz, and F. Afza,
"Object detection and classification: a joint selection and fusion strategy of deep
convolutional neural network and SIFT point features," Multimedia Tools and
Applications, vol. 78, pp. 15751-15777, 2019.
[173] M. A. Khan, T. Akram, M. Sharif, M. Awais, K. Javed, H. Ali, et al., "CCDF:
Automatic system for segmentation and recognition of fruit crops diseases
based on correlation coefficient and deep CNN features," Computers and
electronics in agriculture, vol. 155, pp. 220-236, 2018.
[174] M. Sharif, M. Attique Khan, M. Rashid, M. Yasmin, F. Afza, and U. J. Tanik,
"Deep CNN and geometric features-based gastrointestinal tract diseases
detection and classification from wireless capsule endoscopy images," Journal
of Experimental & Theoretical Artificial Intelligence, pp. 1-23, 2019.

222
[175] T. A. Rashid, P. Fattah, and D. K. Awla, "Using accuracy measure for
improving the training of lstm with metaheuristic algorithms," Procedia
Computer Science, vol. 140, pp. 324-333, 2018.
[176] T. A. Rashid, "Convolutional neural networks based method for improving
facial expression recognition," in The International Symposium on Intelligent
Systems Technologies and Applications, pp. 73-84, 2016.
[177] A. S. Shamsaldin, P. Fattah, T. A. Rashid, and N. K. Al-Salihi, "A Study of The
Convolutional Neural Networks Applications," UKH Journal of Science and
Engineering, vol. 3, pp. 31-40, 2019.
[178] M. A. Uddin and Y.-K. Lee, "Feature fusion of deep spatial features and
handcrafted spatiotemporal features for human action recognition," Sensors,
vol. 19, p. 1599, 2019.
[179] X. Fan and T. Tjahjadi, "Fusing dynamic deep learned features and handcrafted
features for facial expression recognition," Journal of Visual Communication
and Image Representation, vol. 65, p. 102659, 2019.
[180] M.-I. Georgescu, R. T. Ionescu, and M. Popescu, "Local learning with deep and
handcrafted features for facial expression recognition," IEEE Access, vol. 7, pp.
64827-64836, 2019.
[181] A. M. Hasan, H. A. Jalab, F. Meziane, H. Kahtan, and A. S. Al-Ahmad,
"Combining deep and handcrafted image features for MRI brain scan
classification," IEEE Access, vol. 7, pp. 79959-79967, 2019.
[182] C.-B. Ng, Y.-H. Tay, and B.-M. Goi, "Comparing image representations for
training a convolutional neural network to classify gender," in 2013 1st
International Conference on Artificial Intelligence, Modelling and Simulation,
pp. 29-33, 2013.
[183] C.-B. Ng, Y.-H. Tay, and B.-M. Goi, "Training strategy for convolutional neural
networks in pedestrian gender classification," in Second International
Workshop on Pattern Recognition, p. 104431A, 2017.
[184] M. Tkalcic and J. F. Tasic, Colour spaces: perceptual, historical and
applicational background vol. 1: IEEE, 2003.
[185] A. Hu, R. Zhang, D. Yin, and Y. Zhan, "Image quality assessment using a SVD-
based structural projection," Signal Processing: Image Communication, vol. 29,
pp. 293-302, 2014.

223
[186] T. Ojala, M. Pietikainen, and T. Maenpaa, "Multiresolution gray-scale and
rotation invariant texture classification with local binary patterns," IEEE
Transactions on pattern analysis and machine intelligence, vol. 24, pp. 971-
987, 2002.
[187] M. Verma, B. Raman, and S. Murala, "Local extrema co-occurrence pattern for
color and texture image retrieval," Neurocomputing, vol. 165, pp. 255-269,
2015.
[188] N. Dalal and B. Triggs, "Histograms of oriented gradients for human detection,"
in Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE
Computer Society Conference on, pp. 886-893, 2005.
[189] N. Dalal, B. Triggs, and C. Schmid, "Human detection using oriented
histograms of flow and appearance," in European conference on computer
vision, pp. 428-441, 2006.
[190] S. Zhang and X. Wang, "Human detection and object tracking based on
Histograms of Oriented Gradients," in Natural Computation (ICNC), 2013
Ninth International Conference on, pp. 1349-1353, 2013.
[191] O. Déniz, G. Bueno, J. Salido, and F. De la Torre, "Face recognition using
histograms of oriented gradients," Pattern Recognition Letters, vol. 32, pp.
1598-1603, 2011.
[192] K. Seemanthini and S. Manjunath, "Human Detection and Tracking using HOG
for Action Recognition," Procedia Computer Science, vol. 132, pp. 1317-1326,
2018.
[193] M. Dash and P. W. Koot, "Feature selection for clustering," in Encyclopedia of
database systems, ed: Springer, pp. 1119-1125, 2009.
[194] L. Zelnik-Manor and P. Perona, "Self-tuning spectral clustering," in Advances
in neural information processing systems, pp. 1601-1608, 2005.
[195] S. Ben-David, D. Pál, and H. U. Simon, "Stability of k-means clustering," in
International Conference on Computational Learning Theory, pp. 20-34, 2007.
[196] C. Zhong, X. Yue, Z. Zhang, and J. Lei, "A clustering ensemble: Two-level-
refined co-association matrix with path-based transformation," Pattern
Recognition, vol. 48, pp. 2699-2709, 2015.
[197] Y. D. Zhang, S. Chen, S. H. Wang, J. F. Yang, and P. Phillips, "Magnetic
resonance brain image classification based on weighted‐type fractional Fourier

224
transform and nonparallel support vector machine," International Journal of
Imaging Systems and Technology, vol. 25, pp. 317-327, 2015.
[198] L. Chen, C. P. Chen, and M. Lu, "A multiple-kernel fuzzy c-means algorithm
for image segmentation," IEEE Transactions on Systems, Man, and
Cybernetics, Part B (Cybernetics), vol. 41, pp. 1263-1274, 2011.
[199] S. Wu, Y.-C. Chen, X. Li, A.-C. Wu, J.-J. You, and W.-S. Zheng, "An enhanced
deep feature representation for person re-identification," in Applications of
Computer Vision (WACV), 2016 IEEE Winter Conference on, pp. 1-8, 2016.
[200] Y. Gan, "Facial Expression Recognition Using Convolutional Neural Network,"
in Proceedings of the 2nd International Conference on Vision, Image and Signal
Processing, p. 29, 2018.
[201] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, et al., "Going
deeper with convolutions," in Proceedings of the IEEE conference on computer
vision and pattern recognition, pp. 1-9, 2015.
[202] O. Pele and M. Werman, "The quadratic-chi histogram distance family," in
European conference on computer vision, pp. 749-762, 2010.
[203] N. Dalal and B. Triggs, "Histograms of oriented gradients for human detection,"
2005.
[204] Z. Qi, Y. Tian, and Y. Shi, "Efficient railway tracks detection and turnouts
recognition method using HOG features," Neural Computing and Applications,
vol. 23, pp. 245-254, 2013.
[205] K. W. Chee and S. S. Teoh, "Pedestrian Detection in Visual Images Using
Combination of HOG and HOM Features," in 10th International Conference on
Robotics, Vision, Signal Processing and Power Applications, pp. 591-597,
2019.
[206] Y. Wei, Q. Tian, J. Guo, W. Huang, and J. Cao, "Multi-vehicle detection
algorithm through combining Harr and HOG features," Mathematics and
Computers in Simulation, vol. 155, pp. 130-145, 2019.
[207] K. Firuzi, M. Vakilian, B. T. Phung, and T. R. Blackburn, "Partial discharges
pattern recognition of transformer defect model by LBP & HOG features," IEEE
Transactions on Power Delivery, vol. 34, pp. 542-550, 2018.
[208] T. Xiao, S. Li, B. Wang, L. Lin, and X. Wang, "Joint detection and identification
feature learning for person search," in Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, pp. 3415-3424, 2017.

225
[209] J. Xu, L. Luo, C. Deng, and H. Huang, "Bilevel distance metric learning for
robust image recognition," in Advances in Neural Information Processing
Systems, pp. 4198-4207, 2018.
[210] I. N. Junejo, "A Deep Learning Based Multi-color Space Approach for
Pedestrian Attribute Recognition," in Proceedings of the 2019 3rd International
Conference on Graphics and Signal Processing, pp. 113-116, 2019.
[211] S. Liao, G. Zhao, V. Kellokumpu, M. Pietikäinen, and S. Z. Li, "Modeling pixel
process with scale invariant local patterns for background subtraction in
complex scenes," in 2010 IEEE Computer Society Conference on Computer
Vision and Pattern Recognition, pp. 1301-1306, 2010.
[212] S. Arora and M. Bhatia, "A Robust Approach for Gender Recognition Using
Deep Learning," in 2018 9th International Conference on Computing,
Communication and Networking Technologies (ICCCNT), pp. 1-6, 2018.
[213] R. D. Labati, E. Muñoz, V. Piuri, R. Sassi, and F. Scotti, "Deep-ECG:
Convolutional neural networks for ECG biometric recognition," Pattern
Recognition Letters, 2018.
[214] E.-J. Cheng, K.-P. Chou, S. Rajora, B.-H. Jin, M. Tanveer, C.-T. Lin, et al.,
"Deep Sparse Representation Classifier for facial recognition and detection
system," Pattern Recognition Letters, vol. 125, pp. 71-77, 2019.
[215] M. Fayyaz, M. Yasmin, M. Sharif, J. H. Shah, M. Raza, and T. Iqbal, "Person
re-identification with features-based clustering and deep features," Neural
Computing and Applications, pp. 1-22, 2019.
[216] F. Hu, G.-S. Xia, J. Hu, and L. Zhang, "Transferring deep convolutional neural
networks for the scene classification of high-resolution remote sensing
imagery," Remote Sensing, vol. 7, pp. 14680-14707, 2015.
[217] K. He, X. Zhang, S. Ren, and J. Sun, "Deep residual learning for image
recognition," in Proceedings of the IEEE conference on computer vision and
pattern recognition, pp. 770-778, 2016.
[218] M. A. Khan, T. Akram, M. Sharif, M. Y. Javed, N. Muhammad, and M. Yasmin,
"An implementation of optimized framework for action classification using
multilayers neural network on selected fused features," Pattern Analysis and
Applications, pp. 1-21, 2018.

226
[219] K. Nigam, J. Lafferty, and A. McCallum, "Using maximum entropy for text
classification," in IJCAI-99 workshop on machine learning for information
filtering, pp. 61-67, 1999.
[220] C. L. Morais, K. M. Lima, and F. L. Martin, "Uncertainty estimation and
misclassification probability for classification models based on discriminant
analysis and support vector machines," Analytica chimica acta, vol. 1063, pp.
40-46, 2019.
[221] K. Radhika and S. Varadarajan, "Ensemble Subspace Discriminant
Classification of Satellite Images," 2018.
[222] A. S. Shirkhorshidi, S. Aghabozorgi, and T. Y. Wah, "A comparison study on
similarity and dissimilarity measures in clustering continuous data," PloS one,
vol. 10, 2015.
[223] K. Lekdioui, R. Messoussi, Y. Ruichek, Y. Chaabi, and R. Touahni, "Facial
decomposition for expression recognition using texture/shape descriptors and
SVM classifier," Signal Processing: Image Communication, vol. 58, pp. 300-
312, 2017.
[224] X.-X. Niu and C. Y. Suen, "A novel hybrid CNN–SVM classifier for
recognizing handwritten digits," Pattern Recognition, vol. 45, pp. 1318-1325,
2012.
[225] L. Ma, H. Liu, L. Hu, C. Wang, and Q. Sun, "Orientation driven bag of
appearances for person re-identification," arXiv preprint arXiv:1605.02464,
2016.
[226] W.-L. Wei, J.-C. Lin, Y.-Y. Lin, and H.-Y. M. Liao, "What makes you look like
you: Learning an inherent feature representation for person re-identification,"
in 2019 16th IEEE International Conference on Advanced Video and Signal
Based Surveillance (AVSS), pp. 1-6, 2019.
[227] D. Moctezuma, E. S. Tellez, S. Miranda-Jiménez, and M. Graff, "Appearance
model update based on online learning and soft-biometrics traits for people re-
identification in multi-camera environments," IET Image Processing, vol. 13,
pp. 2162-2168, 2019.
[228] C. Shorten and T. M. Khoshgoftaar, "A survey on image data augmentation for
deep learning," Journal of Big Data, vol. 6, p. 60, 2019.

227
[229] W. Hou, Y. Wei, Y. Jin, and C. Zhu, "Deep features based on a DCNN model
for classifying imbalanced weld flaw types," Measurement, vol. 131, pp. 482-
489, 2019.
[230] N. Hussain, M. A. Khan, M. Sharif, S. A. Khan, A. A. Albesher, T. Saba, et al.,
"A deep neural network and classical features based scheme for objects
recognition: an application for machine inspection," Multimed Tools Appl.
https://doi. org/10.1007/s11042-020-08852-3, 2020.
[231] N. Gour and P. Khanna, "Automated glaucoma detection using GIST and
pyramid histogram of oriented gradients (PHOG) descriptors," Pattern
Recognition Letters, 2019.
[232] I. Murtza, D. Abdullah, A. Khan, M. Arif, and S. M. Mirza, "Cortex-inspired
multilayer hierarchy based object detection system using PHOG descriptors and
ensemble classification," The Visual Computer, vol. 33, pp. 99-112, 2017.
[233] W. El-Tarhouni, L. Boubchir, M. Elbendak, and A. Bouridane, "Multispectral
palmprint recognition using Pascal coefficients-based LBP and PHOG
descriptors with random sampling," Neural Computing and Applications, vol.
31, pp. 593-603, 2019.
[234] D. A. Abdullah, M. H. Akpinar, and A. Sengür, "Local feature descriptors based
ECG beat classification," Health Inf. Sci. Syst., vol. 8, p. 20, 2020.
[235] T. Xu, H. Zhang, C. Xin, E. Kim, L. R. Long, Z. Xue, et al., "Multi-feature
based benchmark for cervical dysplasia classification evaluation," Pattern
recognition, vol. 63, pp. 468-475, 2017.
[236] P. Liu, J.-M. Guo, K. Chamnongthai, and H. Prasetyo, "Fusion of color
histogram and LBP-based features for texture image retrieval and
classification," Information Sciences, vol. 390, pp. 95-111, 2017.
[237] Y. Mistry, D. Ingole, and M. Ingole, "Content based image retrieval using
hybrid features and various distance metric," Journal of Electrical Systems and
Information Technology, vol. 5, pp. 874-888, 2018.
[238] N. Danapur, S. A. A. Dizaj, and V. Rostami, "An efficient image retrieval based
on an integration of HSV, RLBP, and CENTRIST features using ensemble
classifier learning," Multimedia Tools and Applications, pp. 1-24, 2020.
[239] H. Fujiyoshi, T. Hirakawa, and T. Yamashita, "Deep learning-based image
recognition for autonomous driving," IATSS research, vol. 43, pp. 244-252,
2019.

228
[240] J.-H. Choi and J.-S. Lee, "EmbraceNet: A robust deep learning architecture for
multimodal classification," Information Fusion, vol. 51, pp. 259-270, 2019.
[241] F. I. Diakogiannis, F. Waldner, P. Caccetta, and C. Wu, "Resunet-a: a deep
learning framework for semantic segmentation of remotely sensed data," ISPRS
Journal of Photogrammetry and Remote Sensing, vol. 162, pp. 94-114, 2020.
[242] L. Liu, W. Ouyang, X. Wang, P. Fieguth, J. Chen, X. Liu, et al., "Deep learning
for generic object detection: A survey," International journal of computer
vision, vol. 128, pp. 261-318, 2020.
[243] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, et al., "Imagenet
large scale visual recognition challenge," International journal of computer
vision, vol. 115, pp. 211-252, 2015.
[244] M. Toğaçar, B. Ergen, Z. Cömert, and F. Özyurt, "A deep feature learning
model for pneumonia detection applying a combination of mRMR feature
selection and machine learning models," IRBM, 2019.
[245] B. Yuan, L. Han, X. Gu, and H. Yan, "Multi-deep features fusion for high-
resolution remote sensing image scene classification," Neural Computing and
Applications, pp. 1-17, 2020.
[246] C. Szegedy, S. Ioffe, V. Vanhoucke, and A. Alemi, "Inception-v4, inception-
resnet and the impact of residual connections on learning," arXiv preprint
arXiv:1602.07261, 2016.
[247] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger, "Densely
connected convolutional networks," in Proceedings of the IEEE conference on
computer vision and pattern recognition, pp. 4700-4708, 2017.
[248] I. Lizarazo, "SVM‐based segmentation and classification of remotely sensed
data," International Journal of Remote Sensing, vol. 29, pp. 7277-7283, 2008.
[249] B. S. Bhati and C. Rai, "Analysis of Support Vector Machine-based Intrusion
Detection Techniques," Arabian Journal for Science and Engineering, vol. 45,
pp. 2371-2383, 2020.
[250] V. E. Liong, J. Lu, and Y. Ge, "Regularized local metric learning for person re-
identification," Pattern Recognition Letters, vol. 68, pp. 288-296, 2015.
[251] X. Yang and P. Chen, "Person re-identification based on multi-scale
convolutional network," Multimedia Tools and Applications, pp. 1-15, 2019.

229
[252] H. Wang, D. Oneata, J. Verbeek, and C. Schmid, "A robust and efficient video
representation for action recognition," International Journal of Computer
Vision, vol. 119, pp. 219-238, 2016.
[253] G. Varol, I. Laptev, and C. Schmid, "Long-term temporal convolutions for
action recognition," IEEE transactions on pattern analysis and machine
intelligence, vol. 40, pp. 1510-1517, 2018.
[254] C. Su, S. Zhang, F. Yang, G. Zhang, Q. Tian, W. Gao, et al., "Attributes driven
tracklet-to-tracklet person re-identification using latent prototypes space
mapping," Pattern Recognition, vol. 66, pp. 4-15, 2017.
[255] S. Karanam, Y. Li, and R. J. Radke, "Person re-identification with
discriminatively trained viewpoint invariant dictionaries," in Proceedings of the
IEEE International Conference on Computer Vision, pp. 4516-4524, 2015.
[256] X. Yang, M. Wang, R. Hong, Q. Tian, and Y. Rui, "Enhancing person re-
identification in a self-trained subspace," ACM Transactions on Multimedia
Computing, Communications, and Applications (TOMM), vol. 13, pp. 1-23,
2017.
[257] C. Ying and X. Xiaoyue, "Matrix Metric Learning for Person Re-identification
Based on Bidirectional Reference Set," vol. 42, pp. 394-402, 2020.
[258] P. Peng, T. Xiang, Y. Wang, M. Pontil, S. Gong, T. Huang, et al., "Unsupervised
cross-dataset transfer learning for person re-identification," in Proceedings of
the IEEE conference on computer vision and pattern recognition, pp. 1306-
1315, 2016.
[259] D. Zhang, W. Wu, H. Cheng, R. Zhang, Z. Dong, and Z. Cai, "Image-to-video
person re-identification with temporally memorized similarity learning," IEEE
Transactions on Circuits and Systems for Video Technology, vol. 28, pp. 2622-
2632, 2017.
[260] X. Zhu, X.-Y. Jing, X. You, W. Zuo, S. Shan, and W.-S. Zheng, "Image to video
person re-identification by learning heterogeneous dictionary pair with feature
projection matrix," IEEE Transactions on Information Forensics and Security,
vol. 13, pp. 717-732, 2017.
[261] B. Yu and N. Xu, "Urgent image-to-video person reidentification by cross-
media transfer cycle generative adversarial networks," Journal of Electronic
Imaging, vol. 28, p. 013052, 2019.

230
[262] F. Ma, X. Zhu, Q. Liu, C. Song, X.-Y. Jing, and D. Ye, "Multi-view coupled
dictionary learning for person re-identification," Neurocomputing, vol. 348, pp.
16-26, 2019.
[263] Y. Deng, P. Luo, C. C. Loy, and X. Tang, "Pedestrian attribute recognition at
far distance," in Proceedings of the 22nd ACM international conference on
Multimedia, pp. 789-792, 2014.
[264] W. Zhu, J. Miao, L. Qing, and G.-B. Huang, "Hierarchical extreme learning
machine for unsupervised representation learning," in 2015 International Joint
Conference on Neural Networks (IJCNN), pp. 1-8, 2015.
[265] A. Ali-Gombe and E. Elyan, "MFC-GAN: class-imbalanced dataset
classification using multiple fake class generative adversarial network,"
Neurocomputing, vol. 361, pp. 212-221, 2019.
[266] M. Fayyaz, M. Yasmin, M. Sharif, and M. Raza, "J-LDFR: joint low-level and
deep neural network feature representations for pedestrian gender
classification," NEURAL COMPUTING & APPLICATIONS, 2020.
[267] Z. Wang, J. Jiang, Y. Yu, and S. i. Satoh, "Incremental re-identification by
cross-direction and cross-ranking adaption," IEEE Transactions on Multimedia,
vol. 21, pp. 2376-2386, 2019.
[268] K. Simonyan and A. Zisserman, "Very deep convolutional networks for large-
scale image recognition," arXiv preprint arXiv:1409.1556, 2014.

231
Appendix A

232
A.1 English Proof Reading Certificate

233

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy