Muhammad Fayyaz 2021 CS Comsats Isb
Muhammad Fayyaz 2021 CS Comsats Isb
Machine Learning
By
Muhammad Fayyaz
CIIT/SP15-PCS-001/WAH
PhD Thesis
In
Computer Science
A Thesis Presented to
In partial fulfillment
of the requirement for the degree of
By
Muhammad Fayyaz
CIIT/SP15-PCS-001/WAH
Spring, 2020
ii
Appearance based Pedestrian Analysis using
Machine Learning
Computer Science.
Supervisor
Co-Supervisor
iii
DEDICATION
To ALLAH Almighty and the Holy Prophet
Muhammad (P.B.U.H)
&
My Parents, Siblings, Loving Family, and Teachers
viii
ABSTRACT
Appearance based Pedestrian Analysis using
Machine Learning
In the present era, the popularity of visual surveillance has opened new venues for
researchers in visual content analysis, and automatic pedestrian analysis is among one
of them. Pedestrian analysis uses machine learning techniques that are categorized into
biometric and full-body appearance based. Full-body appearance based pedestrian
analysis is preferred because of additional capabilities and fewer constraints; however,
it is not without challenges such as environmental effects and different camera settings.
These challenges are associated with different tasks such as detection, orientation
analysis, gender classification, re-identification (ReID), and action classification of
pedestrian. This thesis deals with challenges in full-body appearance based pedestrian
analysis for two tasks: 1) pedestrian ReID and 2) pedestrian gender classification. In
this regard, three methods are proposed; one of them is for pedestrian ReID, and rest of
two deals with pedestrian gender classification.
In the first method, person ReID is performed by using features-based clustering and
deep features (FCDF). Initially, three types of handcrafted features including shape,
color, and texture are extracted from the input image for feature representation. To
acquire optimal features, feature fusion, and selection (FFS) techniques are applied to
these handcrafted features. For gallery search optimization, features-based clustering is
utilized which splits the whole gallery into k consensus clusters. For relationship
learning of gallery features and related labels of chosen clusters, a radial basis kernel is
utilized. Afterwards, images are selected cluster-wise and provided to deep convolution
neural network (DCNN) model to obtain deep features. A cluster-wise feature vector is
then obtained by fusing deep and handcrafted features. This follows feature matching
process where a multi-class support vector machine (SVM) is applied to choose the
related cluster. Finally, to find accurate matching pair from the classified cluster(s), a
cross bin histogram based distance similarity measure is used instead of whole gallery
search. The proposed FCDF framework attains recognition rate at rank 1 as 46.82%,
48.12%, and 40.67% on selected datasets VIPeR, CUHK01, and iLIDS-VID,
respectively.
x
In the second method, joint feature representation is used for gender prediction. In this
regard, histogram of oriented gradients (HOG) and local maximum occurrences
(LOMO) assisted low-level features are extracted to handle rotation, viewpoint, and
illumination variances in the images. VGG19 and ResNet101 based standard deep
convolutional neural network (CNN) architectures are applied simultaneously to
acquire deep features, which are robust against pose variations. To avoid the ambiguous
and unnecessary feature representations, entropy controlled features are chosen from
both low-level and deep representations of features to reduce the dimensions of
computed features. By merging the selected low-level features with deep features, a
robust joint feature representation is used for gender prediction. The proposed joint
low-level and deep CNN feature representation (J-LDFR) method achieves AUC of
96% and accuracy of 89.3% on PETA dataset, and AUC of 86% and accuracy of 82%
on MIT dataset with cubic SVM classifier. The computed results suggest that J-LDFR
improves the performance in comparison with using feature representation individually.
In the third method, an imbalanced and small sample space (IB-SSS) dataset problem
is addressed for pedestrian gender classification using fusion of selected deep and
traditional features (PGC-FSDTF). Initially, data preparation is applied which consists
of data augmentation and data preprocessing steps. It follows investigation of multiple
low and high-level feature extraction schemes (pyramid HOG (PHOG), hue saturation
value (HSV) histogram, and deep visual features of DenseNet201 and
InceptionResNetV2 CNN architectures), features selection (PCA and entropy), and
fusion (parallel and serial) strategies for more accurate gender prediction on
imbalanced, augmented balanced and customized balanced datasets. It shows better
results in terms of O-ACC and AUC with 92.2% and 93% respectively on imbalanced
MIT-IB dataset, whereas 93.7% O-ACC and 98% AUC on augmented balanced MIT-
BROS-3 dataset. Similarly, this framework shows improved performance in terms of
O-ACC and AUC as 89.7% and 96% respectively on imbalanced PKU-Reid-IB dataset,
whereas 92.2% O-ACC and 97% AUC on augmented balanced PKU-Reid-BROS2. It
also provides superior outcomes on customized balanced datasets such as 88.8% O-
ACC and 95% AUC on PETA-SSS-1 dataset and 94.7% O-ACC and 95% AUC on
VIPeR-SSS dataset. This method outperforms by showing O-ACC 90.8% and AUC
95% for cross-dataset and 90.4% O-ACC and 96% AUC for cross-dataset-1. Superior
results are achieved on applied datasets using PCA based selected optimal features
xi
subset and medium Gaussian SVM classifier. Hence, the results on different datasets
confirm that the selected feature combination effectively handles the imbalanced and
SSS issue for PGC task.
The abovementioned proposed methods are unique in using the strategy of features-
based clustering to optimize gallery search for efficient person ReID, considering, joint
feature representation with both large-scale and small-scale datasets for accurate gender
prediction, and applying data augmentation and robust feature engineering to handle
IB-SSS datasets issues. The computed results show that the proposed methods
outperform recent and most related state-of-the-art methods with significant margins in
terms of recognition rates at different ranks, overall O-ACC, and AUC.
xii
TABLE OF CONTENTS
Chapter 1 Introduction ................................................................................. 1
xiii
2.2 State-of-the-art Methods for Pedestrian Re-identification....................... 19
xiv
3.3 Proposed Pedestrian Gender Classification Method of Joint Low-level and
Deep CNN Feature Representation (J-LDFR) ....................................... 61
xv
4.2 Proposed Method for Person Re-identification with Features-based
Clustering and Deep Features (FCDF) ................................................... 97
xvi
4.3.3.3 Performance of Joint Feature Representation for Gender
Classification ............................................................................. 125
xvii
5.1 Conclusion ............................................................................................. 203
xviii
LIST OF FIGURES
Figure 1.1: Increasing trend of CCTV cameras in top ten countries a) country wise b)
per 100 people [1] .......................................................................................................... 2
Figure 1.2: Visual surveillance viewing modes, real-time and posteriori [3] ................ 4
Figure 1.3: Full-body appearance based pedestrians analysis using features-based
descriptors and deep learning models ............................................................................ 7
Figure 1.4: Full-body appearance based person ReID scenario .................................... 8
Figure 1.5: Full-body appearance based pedestrian gender classification scenario ...... 9
Figure 1.6: Sample images showing challenges in pedestrian analysis ....................... 13
Figure 1.7: Detail overview of thesis ........................................................................... 17
Figure 2.1: Organization of chapter 2 .......................................................................... 19
Figure 2.2: General model for person ReID ................................................................ 20
Figure 2.3: Pipeline for full-body appearance and face-based gender classification .. 30
Figure 3.1: Block diagram of chapter 3, proposed methodologies with highlights ..... 43
Figure 3.2: FCDF framework for person ReID consisting of three modules, where a)
feature representation module is applied to compute different types of features from R,
G, B, H, S, and V channels, and then optimal features are selected using novel FFS
method, b) feature clustering module is used to split whole gallery into different
consensus clusters for gallery search optimization, whereas deep features of each
cluster sample are also examined, and c) feature matching module includes
classification of corresponding cluster(s), probe deep features and finally similarity
measure is applied to obtain recognition rates at different ranks................................. 45
Figure 3.3: LBP and LEP code generation [187] ......................................................... 49
Figure 3.4: Proposed feature extraction, fusion, and max-entropy based selection of
features ......................................................................................................................... 51
Figure 3.5: CNN model for deep feature extraction .................................................... 56
Figure 3.6: Parameters setting at each layer ................................................................ 56
Figure 3.7: Max pooling operation .............................................................................. 59
Figure 3.8: Proposed J-LDFR framework for pedestrian gender classification .......... 61
Figure 3.9: Process of formation of low-level (HOG) features ................................... 64
Figure 3.10: Process of formation of low-level (LOMO) features, a) overview of feature
extraction procedure using two different LOMO representation schemes such as HSV
and SILTP based representations including feature fusion step to compute LOMO
xix
feature vector, and b) basic internal representation to calculate combined histogram
from different patches of input image [77] .................................................................. 65
Figure 3.11: Complete design of proposed low-level and deep feature extraction from
gender images for joint feature representation. The proposed framework J-LDFR
selects maximum score-based features and then fusion is applied to generate a robust
feature vector that has both low-level and deep feature representations. Selected
classifiers are applied to evaluate these feature representations for gender prediction70
Figure 3.12: An overview of the proposed PGC-FSDTF framework for pedestrian
gender classification..................................................................................................... 75
Figure 3.13: Proposed 1vs1 and 1vs4 strategies for data augmentation ...................... 78
Figure 3.14: PHOG feature extraction scheme ............................................................ 81
Figure 3.15: HSV histogram based color features extraction ...................................... 83
Figure 3.16: Different ways to deploy pre-trained deep learning models ................... 85
Figure 3.17: Schematic view of InceptionResNetV2 model (compressed) ................. 86
Figure 3.18: Schematic view of DenseNet201 model (compressed) ........................... 86
Figure 3.19: Deep CNN feature extraction and parallel fusion ................................... 88
Figure 3.20: FSDTF for gender prediction .................................................................. 92
Figure 4.1: Block of chapter 4 including section wise highlights................................ 96
Figure 4.2: CMC curves of existing and proposed FCDF method on VIPeR dataset102
Figure 4.3: CMC curves of existing and proposed FCDF method on CUHK01 dataset
.................................................................................................................................... 104
Figure 4.4: CMC curves of existing and proposed FCDF method on the iLIDS-VID
dataset ........................................................................................................................ 106
Figure 4.5: Performance comparison of CC and NNC based searching against all probe
images ........................................................................................................................ 110
Figure 4.6: Selected image pairs (column-wise) from VIPeR, CUHK01, and iLIDS-
VID datasets with challenging conditions such as a) Improper image appearances, b)
different background and foreground information, c) drastic illumination changes, and
d) pose variations including lights effects ................................................................. 110
Figure 4.7: Samples of pedestrian images selected from sub-datasets of PETA dataset,
column representing the gender (male and female) from each sub-dataset, upper row is
showing the images of male gender whereas lower row is showing the image of female
gender ......................................................................................................................... 112
xx
Figure 4.8: Proposed J-LDFR method (a) training and (b) prediction time using
different classifiers on PETA and MIT datasets ........................................................ 117
Figure 4.9: Performance evaluation of proposed J-LDFR method on PETA dataset
using entropy controlled low-level feature representations, individually.................. 119
Figure 4.10: Performance evaluation of proposed method on MIT dataset using entropy
controlled low-level feature representations, individually......................................... 120
Figure 4.11: Performance evaluation of proposed J-LDFR method on PETA and MIT
datasets using entropy controlled low-level feature representations, jointly ............. 121
Figure 4.12: Performance estimation of proposed J-LDFR method on PETA dataset
using entropy controlled deep feature representations, individually ......................... 124
Figure 4.13: Performance evaluation of proposed J-LDFR method on MIT dataset using
entropy controlled deep feature representations, separately ...................................... 124
Figure 4.14: Performance estimation of proposed J-LDFR method on PETA and MIT
datasets using entropy controlled deep feature representations, jointly .................... 125
Figure 4.15: Proposed evaluation results using individual feature representation and
JFR on PETA dataset ................................................................................................. 128
Figure 4.16: Proposed evaluation results using individual feature representation and
JFR on the MIT dataset .............................................................................................. 130
Figure 4.17: AUC on PETA dataset .......................................................................... 131
Figure 4.18: AUC on MIT dataset ............................................................................. 132
Figure 4.19: Comparison of existing and proposed results in terms of AUC on PETA
dataset ........................................................................................................................ 134
Figure 4.20: Comparison of existing and proposed results in terms of overall accuracy
on MIT dataset ........................................................................................................... 135
Figure 4.21: Gender wise pair of sample images collected from MIT, PKU-Reid, PETA,
VIPeR, and cross-datasets where column represents the gender (male and female) from
each dataset, upper row shows images of male, and lower row shows images of female
.................................................................................................................................... 138
Figure 4.22: Sample images of pedestrian with back and front views (a) MIT/MIT-IB
dataset (b) augmented MIT datasets. First two rows represent male images, and next
two rows represent female images ............................................................................. 140
Figure 4.23: Sample images of pedestrians (a) first and second row represent male and
female samples, respectively collected from PKU-Reid-IB dataset, (b) first and second
row represent male and female samples respectively collected from augmented datasets,
xxi
and (c) male (top) and female (bottom) images to show pedestrian images with different
viewpoint angle changes from 0° to 315°, total in eight directions .......................... 142
Figure 4.24: Sample images of pedestrian, column represents gender (male and female)
selected from each customized SSS PETA datasets, upper row shows images of male,
and lower row is shows images of female ................................................................. 143
Figure 4.25: Gender wise sample images of pedestrian, column represents gender (male
and female) selected from sub-datasets of PETA dataset, upper row shows two images
of male, and lower row shows two image of female ................................................. 145
Figure 4.26: An overview of the selected, imbalanced, augmented balanced, and
customized datasets with the class-wise (male and female) distribution of samples for
pedestrian gender classification (a) imbalanced and augmented balanced SSS datasets
and (b) customized balanced SSS datasets ................................................................ 146
Figure 4.27: Entropy and PCA based proposed PGC-FSDTF method results
comparison in terms of O-ACC, M-ACC, B-ACC, AUC, CW-ACC male, and CW-
ACC female on imbalanced MIT-IB dataset ............................................................. 150
Figure 4.28: Best AUC for males and females on imbalanced MIT-IB dataset using
PCA based selected features set ................................................................................. 151
Figure 4.29: Entropy and PCA based proposed PGC-FSDTF method results
comparison in terms of O-ACC, M-ACC, B-ACC, AUC, CW-ACC male, and CW-
ACC female on balanced MIT-BROS-1 dataset ........................................................ 153
Figure 4.30: Best AUC for males and females on balanced MIT-BROS-1 dataset using
PCA based selected FSs ............................................................................................. 154
Figure 4.31: Entropy and PCA based proposed PGC-FSDTF method results
comparison in terms of O-ACC, M-ACC, B-ACC, AUC, CW-ACC male, and CW-
ACC female on balanced MIT-BROS-2 dataset ........................................................ 156
Figure 4.32: Best AUC for males and females on balanced MIT-BROS-2 dataset using
PCA based selected FS .............................................................................................. 157
Figure 4.33: Entropy and PCA based proposed PGC-FSDTF method results
comparison in terms of O-ACC, M-ACC, B-ACC, AUC, CW-ACC male, and CW-
ACC female on balanced MIT-BROS-3 dataset ........................................................ 159
Figure 4.34: Best AUC for males and females on balanced MIT-BROS-3 dataset using
PCA based selected features subsets.......................................................................... 160
xxii
Figure 4.35: Entropy and PCA based proposed PGC-FSDTF method results
comparison in terms of O-ACC, M-ACC, B-ACC, AUC, CW-ACC male, and CW-
ACC female on imbalanced PKU-Reid-IB dataset .................................................... 163
Figure 4.36: Best AUC for males and females on imbalanced PKU-Reid-IB dataset
using PCA based selected FSs ................................................................................... 164
Figure 4.37: Entropy and PCA based proposed PGC-FSDTF method results
comparison in terms of O-ACC, M-ACC, B-ACC, AUC, CW-ACC male, and CW-
ACC female on the balanced PKU-Reid-BROS-1 dataset ........................................ 166
Figure 4.38: Best AUC for males and females on balanced PKU-Reid-BROS-1 dataset
using PCA based selected FSs ................................................................................... 167
Figure 4.39: Entropy and PCA based proposed PGC-FSDTF method results
comparison in terms of O-ACC, M-ACC, B-ACC, AUC, CW-ACC male, and CW-
ACC female on balanced PKU-Reid-BROS-2 dataset .............................................. 169
Figure 4.40: Best AUC for males and females on balanced PKU-Reid-BROS-2 dataset
using PCA based selected FSs ................................................................................... 170
Figure 4.41: Entropy and PCA based proposed PGC-FSDTF method results
comparison in terms of O-ACC, M-ACC, B-ACC, AUC, CW-ACC male, and CW-
ACC female on balanced PKU-Reid-BROS-3 dataset .............................................. 173
Figure 4.42: Best AUC for males and females on balanced PKU-Reid-BROS-3 dataset
using PCA based selected FSs ................................................................................... 173
Figure 4.43: Entropy and PCA based proposed PGC-FSDTF method results
comparison in terms of O-ACC, M-ACC, B-ACC, AUC, CW-ACC male, and CW-
ACC female on the balanced PETA-SSS-1 dataset ................................................... 176
Figure 4.44: Best AUC for males and females on the balanced PETA-SSS-1 dataset
using PCA based selected FSs ................................................................................... 176
Figure 4.45: Entropy and PCA based proposed PGC-FSDTF method results
comparison in terms of O-ACC, M-ACC, B-ACC, AUC, CW-ACC male, and CW-
ACC female on balanced PETA-SSS-2 dataset ......................................................... 179
Figure 4.46: Best AUC for males and females on balanced PETA-SSS-2 dataset using
PCA based selected FSs ............................................................................................. 179
Figure 4.47: Entropy and PCA based proposed PGC-FSDTF method results
comparison in terms of O-ACC, M-ACC, B-ACC, AUC, CW-ACC male, and CW-
ACC female on balanced VIPeR-SSS dataset ........................................................... 182
xxiii
Figure 4.48: Best AUC for males and females on balanced VIPeR-SSS dataset using
PCA based selected features subsets.......................................................................... 182
Figure 4.49: Entropy and PCA based proposed PGC-FSDTF method results
comparison in terms of O-ACC, M-ACC, B-ACC, AUC, CW-ACC male, and CW-
ACC female on balanced cross-dataset, best outcomes of two classifiers ................ 185
Figure 4.50: Best AUC for males and females on balanced cross-dataset using PCA
based selected FSs...................................................................................................... 186
Figure 4.51: Entropy and PCA based proposed PGC-FSDTF method results
comparison in terms of O-ACC, M-ACC, B-ACC, AUC, CW-ACC male, and CW-
ACC female on balanced cross-dataset-1, best outcomes of two classifiers ............. 188
Figure 4.52: Best AUC for males and females on balanced cross-dataset-1 using PCA
based selected FSs...................................................................................................... 189
Figure 4.53: Performance comparison in terms of overall accuracy between proposed
PGC-FSDTF method and existing methods on MIT/MIT-IB dataset ....................... 191
Figure 4.54: Comparison of training and prediction time of PGC-FSDTF with J-LDFR
.................................................................................................................................... 191
Figure 4.55: Performance comparison in terms of AUC between the proposed PGC-
FSDTF method and existing methods on cross-datasets ........................................... 193
Figure 4.56: Training time of PGC-FSDTF method on applied datasets .................. 198
Figure 4.57: Prediction time of PGC-FSDTF method on applied datasets................ 198
Figure 4.58: Complete overview of proposed PGC-FSDTF method results in terms of
best CW-ACC on selected, customized, and augmented datasets where proposed
approach achieved superior AUC (a) CW-ACC on customized balanced SSS datasets,
and (b) CW-ACC on imbalanced and augmented balanced SSS datasets ................. 199
xxiv
LIST OF TABLES
Table 2.1: Summary of appearance based approaches for person ReID ..................... 23
Table 2.2: Summary of metric learning based approaches for person ReID ............... 26
Table 2.3: Summary of deep learning based approaches for person ReID ................. 29
Table 2.4: Summary of parts-based approaches using frontal, back and mixed view
images of a pedestrian for gender classification, (-) represents that no reported result is
available ....................................................................................................................... 32
Table 2.5: Summary of existing methods results using full-body frontal view images of
pedestrian for gender classification with 5 cross-validation (male=123, female=123
uncropped and cropped images of MIT dataset; whereas male=292, female=291
uncropped images of VIPeR dataset) ........................................................................... 34
Table 2.6: Summary of gender classification methods using handcrafted features and
different classifiers with full-body frontal, back, and mixed view images of a gender, (-
) represents that no reported result is available ............................................................ 36
Table 2.7: Summary of deep learning and hybrid methods for pedestrian gender
classification with full-body frontal, back, and mixed view images of a gender, (-)
represents that no reported result is available .............................................................. 39
Table 3.1: Description of different test FS with different dimensions ........................ 53
Table 3.2: Experimental results using handcrafted features on three selected datasets,
top recognition rates at each rank written in bold ........................................................ 53
Table 3.3: Description of preprocessing for each feature representation scheme ....... 62
Table 3.4: Proposed J-LDFR framework selected features subset dimensions, classifiers
and their parameter settings ......................................................................................... 73
Table 3.5: Augmentation statistics for imbalanced and small sample MIT, and PKU-
Reid datasets, class-wise selected number of samples in a single set for data
augmentation, and resultantly, total augmented images and total images ................... 79
Table 3.6: Description of preprocessing for each feature representation scheme ....... 80
Table 4.1: Statistics of datasets for person ReID ......................................................... 99
Table 4.2: Experimental results using deep features (from higher to lower dimension)
on VIPeR dataset........................................................................................................ 100
Table 4.3: Experimental results using deep features (from higher to lower dimension)
.................................................................................................................................... 100
xxv
Table 4.4: Experimental results using deep features (from higher to lower dimension)
on i-LIDS-VID dataset ............................................................................................... 100
Table 4.5: Performance comparison in terms of top matching rates (%) of existing
methods including proposed FCDF method on VIPeR dataset (p=316), dash (-)
represents that no reported result is available ............................................................ 103
Table 4.6: Performance comparison in terms of top matching rates (%) of existing
methods including proposed FCDF method on CUHK01 dataset (p=486), dash (-)
represents that no reported result is available ............................................................ 105
Table 4.7: Performance comparison in terms of top matching rates (%) of existing
methods and proposed FCDF method on the iLIDS-VID dataset (p=150), dash (-)
represents that no reported result is available ............................................................ 107
Table 4.8: Cluster and gallery-based probe matching results of proposed FCDF
framework on VIPeR dataset .................................................................................... 109
Table 4.9: Cluster and gallery-based probe matching results of proposed FCDF
framework on CUHK01 dataset................................................................................. 109
Table 4.10: Cluster and gallery-based probe matching results of proposed FCDF
framework on the iLIDS-VID dataset........................................................................ 109
Table 4.11: Evaluation protocols ............................................................................... 111
Table 4.12: Statistics of PETA dataset for pedestrian gender classification ............. 113
Table 4.13: Statistics of MIT dataset for pedestrian gender classification ................ 113
Table 4.14: Description of different test FSs with different dimensions ................... 114
Table 4.15: Performance evaluation of proposed J-LDFR method using different
classifiers and test FSs on PETA dataset ................................................................... 115
Table 4.16: Performance evaluation of proposed J-LDFR method using different
classifiers and test FSs on MIT dataset ...................................................................... 116
Table 4.17: Performance evaluation of proposed J-LDFR method on PETA dataset
using C-SVM classifier with 10-fold cross-validation .............................................. 118
Table 4.18: Performance evaluation of proposed J-LDFR method on PETA dataset
using M-SVM classifier with 10-fold cross-validation ............................................. 118
Table 4.19: Performance evaluation of proposed J-LDFR method on PETA dataset
using Q-SVM classifier with 10-fold cross-validation .............................................. 118
Table 4.20: Performance evaluation of proposed J-LDFR method on MIT dataset using
C-SVM classifier with 10-fold cross-validation ........................................................ 119
xxvi
Table 4.21: Performance evaluation of proposed J-LDFR method on MIT dataset using
M-SVM classifier with 10-fold cross-validation ....................................................... 120
Table 4.22: Performance evaluation of proposed J-LDFR method on MIT dataset using
Q-SVM classifier with 10-fold cross-validation ........................................................ 120
Table 4.23: Performance evaluation of proposed J-LDFR method on PETA dataset
using C-SVM classifier with 10-fold cross-validation .............................................. 122
Table 4.24: Performance evaluation of the proposed J-LDFR method on PETA dataset
using M-SVM classifiers with 10-fold cross-validation ............................................ 122
Table 4.25: Performance evaluation of the proposed J-LDFR method on PETA dataset
using Q-SVM classifier with 10-fold cross-validation .............................................. 122
Table 4.26: Performance evaluation of proposed J-LDFR method on MIT dataset using
C-SVM classifier with 10-fold cross-validation ........................................................ 123
Table 4.27: Performance evaluation of proposed J-LDFR method on MIT dataset using
M-SVM classifier with 10-fold cross-validation ....................................................... 123
Table 4.28: Performance evaluation of proposed J-LDFR method on MIT dataset using
Q-SVM classifier with 10-fold cross-validation ........................................................ 123
Table 4.29: Performance evaluation of proposed J-LDFR method on PETA dataset
using C-SVM classifier with 10-fold cross-validation .............................................. 126
Table 4.30: Performance evaluation of proposed J-LDFR method on PETA dataset
using M-SVM classifier with 10-fold cross-validation ............................................. 126
Table 4.31: Performance evaluation of proposed J-LDFR method on PETA dataset
using Q-SVM classifiers with 10-fold cross-validation............................................. 127
Table 4.32: Performance evaluation of the proposed J-LDFR method on the MIT
dataset using cubic-SVM classifier with 10-fold cross-validation ............................ 128
Table 4.33: Performance evaluation of the proposed J-LDFR method on the MIT
dataset using medium-SVM classifier with 10-fold cross-validation ........................ 129
Table 4.34: Performance evaluation of the proposed J-LDFR method on the MIT
dataset using quadratic-SVM classifier with 10-fold cross-validation ...................... 129
Table 4.35: Confusion matrix using C-SVM on PETA dataset ................................. 130
Table 4.36: Confusion matrix using C-SVM on MIT dataset ................................... 131
Table 4.37: Performance comparison with existing methods using PETA dataset ... 133
Table 4.38: Performance comparison with existing methods using MIT dataset, dash (-
) represents that no reported result is available .......................................................... 135
Table 4.39: Proposed J-LDFR method results on PETA and MIT datasets .............. 136
xxvii
Table 4.40: Evaluation protocols / metrics ................................................................ 137
Table 4.41: Statistics of MIT/MIT-IB dataset samples based imbalanced and
augmented balanced small sample datasets for pedestrian gender classification ...... 140
Table 4.42: Statistics of PKU-Reid dataset samples based imbalanced and augmented
balanced small sample datasets for pedestrian gender classification......................... 141
Table 4.43: Statistics of PETA dataset samples based customized PETA-SSS-1 and
PETA-SSS-2 datasets for pedestrian gender classification ....................................... 143
Table 4.44: Statistics of VIPeR dataset samples based customized VIPeR-SSS dataset
for pedestrian gender classification ........................................................................... 144
Table 4.45: Statistics of cross-datasets for pedestrian gender classification ............. 145
Table 4.46: Performance of proposed PGC-FSDTF method on imbalanced MIT-IB
dataset ........................................................................................................................ 148
Table 4.47: Performance of proposed PGC-FSDTF method on imbalanced MIT-IB
dataset ........................................................................................................................ 149
Table 4.48: Performance of proposed PGC-FSDTF method on balanced MIT-BROS-1
dataset (male=864, and female=864 images) using different evaluation protocols.. 152
Table 4.49: Performance of proposed PGC-FSDTF method on balanced MIT- BROS-
1 dataset (male=864, and female=864 images) using different accuracies, AUC and
time ............................................................................................................................ 153
Table 4.50: Performance of proposed PGC-FSDTF method on balanced MIT-BROS-2
dataset (male=864, and female=864 images) using different evaluation protocols... 155
Table 4.51: Performance of proposed PGC-FSDTF method on the balanced MIT-
BROS-2 dataset (male=864, and female=864 images) using accuracies, AUC, and time
.................................................................................................................................... 155
Table 4.52: Performance of proposed PGC-FSDTF method on balanced MIT-BROS-3
dataset (male=600, and female=600 images) using different evaluation protocols... 158
Table 4.53: Performance of proposed PGC-FSDTF method on balanced MIT-BROS-3
dataset (male=600, and female=600 images) using different accuracies, AUC, and time
.................................................................................................................................... 159
Table 4.54: Performance of proposed PGC-FSDTF method on imbalanced PKU-Reid-
IB dataset (male=1120, and female=704 images) using different evaluation protocols
.................................................................................................................................... 161
xxviii
Table 4.55: Performance of proposed PGC-FSDTF method on imbalanced PKU-Reid-
IB dataset (male=1120, and female=704 images) using different accuracies, AUC, and
time ............................................................................................................................ 162
Table 4.56: Performance of proposed PGC-FSDTF method on balanced PKU-Reid-
BROS-1 dataset (male=1300, and female=1300 images), different evaluation protocols
.................................................................................................................................... 165
Table 4.57: Performance of proposed PGC-FSDTF method on balanced PKU-Reid-
BROS-1 dataset (male=1300, and female=1300 images) using different accuracies,
AUC, and time ........................................................................................................... 165
Table 4.58: Performance of proposed PGC-FSDTF method on PKU-Reid-BROS-2
dataset (male=1300, and female=1300 images) using different evaluation protocols
.................................................................................................................................... 168
Table 4.59: Performance of proposed PGC-FSDTF method on PKU-Reid-BROS-2
dataset (male=1300, and female=1300 images) using different accuracies, AUC, and
time ............................................................................................................................ 168
Table 4.60: Performance of proposed PGC-FSDTF method on balanced PKU-Reid-
BROS-3 dataset (male=1120, and female=1120 images) different evaluation protocols
.................................................................................................................................... 171
Table 4.61: Performance of proposed PGC-FSDTF method on PKU-Reid-BROS-3
dataset (male=1120, and female=1120 images) using different accuracies, AUC, and
time ............................................................................................................................ 172
Table 4.62: Performance of proposed PGC-FSDTF method on balanced PETA-SSS-1
dataset (male=864, and female=864 images) using different evaluation protocols... 175
Table 4.63: Performance of proposed PGC-FSDTF method on balanced PETA-SSS-1
dataset (male=864, and female=864 images) using different accuracies, AUC and time
.................................................................................................................................... 175
Table 4.64: Performance of proposed PGC-FSDTF method on balanced PETA-SSS-2
dataset (male=1300, and female=1300 images) using different evaluation protocols
.................................................................................................................................... 177
Table 4.65: Performance of proposed PGC-FSDTF method on balanced PETA-SSS-2
dataset (male=1300, and female=1300 images) using different accuracies, AUC, and
time ............................................................................................................................ 178
Table 4.66: Performance of proposed PGC-FSDTF method on balanced VIPeR-SSS
dataset (male=544, and female=544 images) using different evaluation protocols... 180
xxix
Table 4.67: Performance of proposed PGC-FSDTF method on balanced VIPeR-SSS
dataset (male=544, and female=544 images) using different accuracies, AUC, and time
.................................................................................................................................... 181
Table 4.68: Performance of proposed PGC-FSDTF method on balanced cross-dataset
(male=175, and female=175 images) using different evaluation protocols ............... 184
Table 4.69: Performance of proposed PGC-FSDTF method on balanced cross-dataset
(male=175, and female=175 images) using different accuracies, AUC, and time .... 184
Table 4.70: Performance of proposed PGC-FSDTF method on balanced cross-dataset-
1 (male=350, and female=350 images) using different evaluation protocols ............ 187
Table 4.71: Performance of proposed PGC-FSDTF method on balanced cross-dataset-
1 (male=350, and female=350 images) using different accuracies, AUC and time .. 187
Table 4.72: Performance comparison with state-of-the-art methods on MIT/MIT-IB
dataset, dash (-) represents that no reported result is available.................................. 190
Table 4.73: Comparison of proposed PGC-FSDTF method results with state-of-the-art
methods on cross-datasets .......................................................................................... 192
Table 4.74: Summary of proposed method PGC-FSDTF results on all selected
imbalanced, and augmented balanced datasets where proposed approach recorded
superior AUC ............................................................................................................. 195
Table 4.75: Proposed approach PGC-FSDTF results on customized/non-augmented
balanced datasets where proposed approach recorded superior AUC ....................... 197
Table 4.76: Proposed PGC-FSDTF approach results on MIT-IB dataset and cross-
dataset ........................................................................................................................ 197
Table 4.77: Summary of proposed methods including tasks, datasets, and results ... 200
xxx
LIST OF ABBREVIATIONS
AD Average Deep
AFDA Adaptive Fisher Discriminative Analysis
AML Adaptive Metric Learning
AML-PSR Adaptive Metric Learning Probe Specific Re-ranking
AUC Area Under the Curve
AU Activation Unit
AvgDeep_FV Average score-based Deep Feature Vector
BoW Bag of Words
B-ACC Balanced Accuracy
B-SSS Balanced and Small Sample Space
BROS Balanced Random Over Sampling
BN Batch Normalization
BRM2L Bidirectional Reference Matric Metric Learning
BS Bin-Size
BIF Biological Inspired Features
CAM Camera
CAST Center for Applied Science and Technology
CPNI Center for the Protection of National Infrastructure
CUHK China University of Hong Kong
CW-ACC Class Wise Accuracy
CCTV Closed Circuit Tele Vision
CF Color Feature
CNN Convolution Neural Network
CL Convolutional Layer
CC Corresponding Cluster
CMGTN Cross Modal feature Generating and target information
preserving Transfer Network
CTC-GAN Cross-media Transfer Cycle Generative Adversarial Networks
XQDA Cross-view Quadratic Discriminant Analysis
CMC Cumulative Matching Characteristics
DDNN Deep De-compositional Neural Networks
xxxi
DFR Deep Feature Ranking
DFBP Deep Features Body and Part
DMIL Deep Multi Instance Learning
DR Deep ResNet101
DV Deep VGG
DHFFN Deep-learned and Handcrafted Feature Fusion Network
DPML Deformable Patch based Metric Learning
DB Dense Block
DN DenseNet
DN201_FCL DenseNet201 Fully Connected Layer
DVR Discriminative Video fragments selection and Ranking
DVDL Discriminatively trained Viewpoint invariant Dictionaries
Learning
DHGM Dynamic Hyper Graph Matching
EML Enhanced Metric Learning
FN False Negative
FP False Positive
FPR False Positive Rate
FCDF Features-based Clustering and Deep Features
FFS Feature Fusion and Selection
FR Feature Representation
FSF Features selection and Fusion
FS Features Subset
FSD Features Subset Dimension
FV Feature Vector
FoVs Field of Views
FFV Final Feature Vector
FTCNN Fine Tuned Convolution Neural Network
FC Fully Connected
FCL Fully Connected Layer
FCUE Fully Controlled and Uncontrolled Environment
FMCF Fusion of Multiple Channel Features
GoG Gaussian of Gaussian
xxxii
GM Greedy Matching
HRPID High Resolution Pedestrian re-Identification Dataset
HOG Histogram of Oriented Gradients
HDFL HOG assisted Deep Feature Learning
HH HSV Histogram
HSV Hue Saturation Value
HSV-Hist_FV Hue Saturation Value Histogram based Feature Vector
HM Hungarian Matching
HG Hyper Graph
HGL Hyper Graph Learning
iLIDS-VID imagery Library for Intelligent Detection System-Video re-
IDentification
IB Imbalanced
IB-SSS Imbalanced and Small Sample Space
IRV2 Inception ResNet V2
IRNV2_FCL InceptionResNetV2 Fully Connected Layer
IN Indoor
JFR Joint Feature Representation
J-LDFR Joint Low-level and Deep CNN Feature Representations
KNN k Nearest Neighbor
LWA Light Weight and Accurate
LDA Linear Discriminant Analysis
LBP Local Binary Patterns
LCDN Local Contrast Divisive Normalization
LEDF Local Embedded Deep Feature
LEP Local Extrema Patterns
LFDA Local Fisher Discriminant Analysis
LHSV Local Hue Saturation Value
LOMO Local Maximal Occurrence
LSDA Locality Sensitive Discriminant Analysis
MSDALF Mask-improved Symmetry Driven Accumulation of Local
Features
MD Maximum Deep
xxxiii
MaxDeep_Fv Maximum score-based Deep Feature Vector
M-ACC Mean Accuracy
MARPML Multi-feature Attribute Restricted Projection Metric Learning
M3 L Multi-modality Mining Metric learning
MCTS Multiple Camera Tracking Scenarios
MFA Marginal Fisher Analysis
MKSSL Multiple Kernel Sub Space Learning
NNC Nearest Neighbor Cluster
NP ReId Neuromorphic Person Re-identification
NFST Null Foley-Sammon Transfer
OFS Optimal Features Subsets
OF Optimal Features
OD Outdoor
OLPP Orthogonal Locality Preserving Projections
O-ACC Overall Accuracy
PBGR Part Based Gender Recognition
PLS Partial Least Square
PAN Pedestrian Alignment Network
PETA PEdesTrian Attribute
PGC-FSDTF Pedestrian Gender Classification using Fusion of Selected Deep
and Traditional Features
PKU-Reid Peking University Re-identification
Preid PFCC Person Re-identification via Prototype Formation and Color
Categorization
P2SNET Point to Set Network
PL Pooling Layer
PICB Pose Invariant Convolutional Baseline
PaMM Pose-aware Multi-shot Matching
PPV Positive Predicative Value
PCA Principal Component Analysis
PRDC Probabilistic Distance Comparison
PSR Probe Specific Re-ranking
PHDL Projection matrix and Heterogeneous Dictionary pair Learning
xxxiv
PHOG Pyramid Histogram of Oriented Gradients
PH Pyramid Histogram of Oriented Gradients Feature Vector
PHOG_FV Pyramid Histogram of Oriented Gradients Feature Vector
PHOW Pyramid Histogram of Words
QRKISS-FFN QR Keep It Simple and Straightforward Feature Fusion
Network
QC Quadratic Chi
QLBP Quaternionic Local Binary Pattern
RBS Radial Basis Function
RF Random Forest
ROS Random Over Sampling
RF+DDA+AC Raw low-level Features + Data Driven Attributes + Attribute
Correlations
RLU Rectified Linear Unit
RGB Red Green Blue
ROI Region of Interest
RCCA Regularized Canonical Correlation Analysis
RLML Regularized Local Metric Learning
ReID Re-identification
RDC Relative Distance Comparison
RDML Relative Distance Metric Learning
RDML-CCPVL Relative Distance Metric Learning Clustering Centralization
and Projection Vectors Learning
RR Re-Ranking
RN ResNet
ReLU Rectified Liner Unit
ROCCA Robust Canonical Correlation Analysis
RDs Robust Descriptor
SCD Salience Color Descriptor
SSSVM Sample Specific Support Vector Machine
SIFT Scale Invariant Feature Transform
SILTP Scale Invariant Local Ternary Patterns
SF Shape Feature
xxxv
SCNN Siamese Convolutional Neural Network
SECGAN Similarity Embedded Cycle Generative adversarial Network
SVD Single Value Decomposition
SSS Small Sample Space
sCSPA soft Cluster-based Similarity Partitioning Algorithm
SMA Stable Marriage Algorithm
SSAE Stacked Sparse Auto Encoder
SD Standard Deviation
SVM Support Vector Machine
SDALFs Symmetry Driven Accumulation of Local Features
TMSL Temporally Memorized Similarity Learning
TF Texture Feature
TPPLFDA Topology Properties Preserved Local Fisher Discriminant
Analysis
TLSTP Transfer Learning of Spatial Temporal Patterns
TL Transition Layer
TMD2L Triplet-based Manifold Discriminative Distance Learning
TN True Negative
TP True Positive
TPR True Positive Rate
UCDTL Unsupervised Cross Dataset Transfer Learning
VA Video Analytics
VIPeR Viewpoint Invariant Pedestrian Recognition
VSS Visual Surveillance System
xxxvi
LIST OF SYMBOLS
xxxvii
𝑓𝑑 Feature vector with specific dimension
𝑏 Channel number
𝐹𝑉 Feature vector
𝑃𝑐𝑣 Value of center pixel
𝑃𝑛𝑣 Value of neighbor pixel
`
𝑃𝑛𝑣 Corresponding differences from center pixel value
𝜃 Direction/angle/orientation of gradients
` Pixel values at a particular direction
𝑃𝑛𝑏 (𝜃)
𝐼2 Binary code at given position
𝐹 Feature vector
𝐾 Number of cluster
𝐿 Set of cluster labels
Ƈ Set of clusters
𝑔𝑥 Horizontal gradients
𝑔𝑦 Vertical gradients
𝑀𝑔 Magnitude
𝐾𝑣 HOG feature vector
𝐶𝐹𝑣 Color feature vector
𝑐𝑓𝑖,𝑗 Color feature indices
𝑑𝑞 Dimension of color feature vector
𝑇𝐹𝑣 Texture feature vector
𝑡𝑓𝑖,𝑗 Texture feature indices
𝑑𝑟 Dimension of texture feature vector
𝑆𝐹𝑣 Shape feature vector
𝑠𝑓𝑖,𝑗 Shape feature indices
𝑑𝑠 Dimension of shape feature vector
𝑓𝑖 Current feature
𝑓𝑗 Next feature
𝛿 Entropy controlled feature vector
𝑣 Number of selected features
𝑁𝑒 Number of epochs
𝑓𝑛 Feature space/feature vector
xxxviii
𝑃𝑅 Probability of computed features
𝑙 Layer of network
𝑓𝑙 Filter at layer 𝑙
𝑙
𝑓𝑥,𝑦 Filter size at layer 𝑙 and between 𝑥 𝑡ℎ and 𝑦 𝑡ℎ feature maps
𝑙
𝑤𝑥,𝑦 Weight at layer 𝑙 and between 𝑥 𝑡ℎ and 𝑦 𝑡ℎ feature maps
𝑛𝑖 × 𝑛𝑗 Size of matrix for input
𝑈𝑖𝑗𝑘 Given pooling region
𝑘×𝑘 Filter size
𝐾×𝐾 Pooling region
𝑀𝑝,𝑞 Pooling region with filter size
𝑍𝑝𝑞𝑘 Output of max pooling
𝑀𝑎𝑏 Local area
𝐶𝑘 Consensus cluster
QC Quadratic-Chi
𝑘 Cluster index
𝐷 Distance
𝑋𝐼𝑃 Number of accurately matched probe images
𝑌𝐼𝑃 Total images in the probe set
𝑛𝑐 Neighbor cluster
𝑠𝑤 𝑙 Sliding window
𝑏𝑥𝑙 Bias matrix
𝑛𝑥𝑙 Feature map
𝑍𝑥𝑙 Output of convolution process
𝑍𝑥𝑙−1 Input channel values
𝐿 Pyramid level
𝑃𝐻𝑣 PHOG feature indices
𝑝ℎ𝑖,𝑗 PHOG feature indices
𝑑𝑝 Dimension of PHOG feature vector
𝐻𝐻𝑣 HSV Histogram feature vector
ℎℎ𝑖,𝑗 HSV histogram feature indices
𝑑ℎ Dimension of HSV-Histogram feature vector
𝑀𝐷𝑣 Maximum score-based deep feature vector
xxxix
𝑚𝑑𝑖,𝑗 Maximum score-based deep feature indices
𝑑𝑚 Dimension of maximum score-based deep feature vector
𝐴𝐷𝑣 Average score-based deep feature vector
𝑎𝑑𝑖,𝑗 Average score-based deep feature indices
𝑑𝑎 Dimension of average score-based deep feature vector
𝐼𝐹𝑖𝑚 Filtered image
𝐻𝐹𝑖𝑚 Horizontal flipped image
𝐼𝑀 Image matrix
𝐺𝑇𝑖𝑚 Geometric transformed image
𝐵𝐴𝑖𝑚 Brightness adjusted image
𝑚𝑔𝑣 Matrix with gradient values
𝑚ℎ𝑣 Matrix with histogram values
𝑟𝑜𝑖 Region of interest
𝑏𝑠 Bin size on the histogram
𝐼𝐹𝑖𝑚 Filtered image
𝐻𝐹𝑖𝑚 Horizontal Flip
𝐼𝑀 Matrix from image
𝐺𝑇𝑖𝑚 Geometric transformation
𝐵𝐴𝑖𝑚 Brightness adjustment
𝐿 Pyramid levels
ĉ Constant value
ɠ𝑐𝑐 Consensus cluster operation
xl
Chapter 1 Introduction
Introduction
1
1.1 Introduction to Visual Surveillance
Visual surveillance is an emerging technology that deals with the monitoring of a
particular area or a person for security and safety. According to recent security concerns
worldwide, many countries are committed to expand visual surveillance zone through
cameras. In this regard, millions of visual surveillance or closed-circuit television
(CCTV) cameras are being installed round the globe. According to a recent report,
approximately 770 million surveillance cameras have been installed worldwide until
now including 200 million cameras in the year 2020 alone. The report also states that
one billion cameras will be deployed globally by 2021 [1]. The increasing trend of
CCTV cameras country wise and per 100 people in top ten countries is shown in Figure
1.1 (a) and Figure 1.1 (b), respectively.
Figure 1.1: Increasing trend of CCTV cameras in top ten countries a) country wise b)
per 100 people [1]
2
The top countries are realizing the significance of visual surveillance, for instance,
China is maintaining itself as a surveillance state because it has deployed more CCTV
cameras than any other country, but in case of per capita usage, United States is standing
at first.
Moreover, in the present era, the popularity of visual surveillance is linked with the
availability of powerful computing hardware and cost-effective CCTV cameras which
are two important components in any visual surveillance system (VSS). These
components make the visual surveillance more suitable and efficient for monitoring
target objects such as pedestrians, crowd, and unattended items. As, the cameras
provide unique visualization properties, hence it is considered a reliable source for
automatic visual content analysis as compared to traditional modalities such as infrared,
sonar, and radar.
VSS offers two modes of processing for visual content analysis including a real-time
mode for online data processing and posteriori mode for stored or offline data
processing. VSS provides an infrastructure to capture, store, and distribute video while
leaving the task of threat detection exclusively to human operators in both modes. In
real-time mode, video operators are continuously viewing live stream video for the
prevention of crime, and they are also responsible to generate alerts in case of any
misconduct or an event of interest has occurred [2]. However, in this mode of
processing, delays may occur due to live streams generated by numerous CCTV
cameras in a network. In addition, live streams are stored on digital media for a
predefined time. In the posteriori mode, searching for a given person/event of interest
in thousands of recorded images/video frames provided by many cameras requires the
3
allocation of a large number of enforcement officers to this task and a lot of time to
perform.
Figure 1.2: Visual surveillance viewing modes, real-time and posteriori [3]
4
In case of automatic visual content analysis, real-time mode, automatically analyzes
live stream for the event of interest, whereas posteriori mode extracts important
information such as evidence from already stored visual data and replays specific events
overlooked by operators due to stream delays in the network. Thus, in automatic
settings for both modes, human operators are replaced with machine based VSS. In
contrast to human based surveillance, both automatic modes are well-organized to
examine the captured streams for detecting suspicious activities. Using either mode,
automatic visual content analysis opens new venues for research such as pedestrian [4-
6], medical [7-9], and agriculture [10-12] image analysis. Nowadays, automatic
pedestrian image analysis is one of the important research areas as it acts as an effective
deterrent to unlawful and anti-social activities [13].
A biometric technique utilizes iris, face, gait, and fingerprint images of an individual
for pedestrian analysis. The techniques which use iris [22-24] and fingerprint [25-27]
images require the collaboration of humans for their actions in a monitored
environment. For instance, acquisition of an iris image needs an individual to set his
eyes position close to a particular sensor, rightly in front of it, whereas capturing the
fingerprint requires putting a finger on a particular sensor. However, the techniques
which use face [28-30] and gait [31-38] do not require the collaboration of an individual
being observed. Thus, these techniques are more suitable for automatic surveillance at
entry and exit points of public places under controlled environments.
5
information or addition of noise during compression and decompression steps used by
the cameras or storage devices. These constraints show that biometric techniques are
highly dependent on camera settings and person orientation towards the camera. For
instance, if the targeted person's face is not visible or has a side view in front of cameras,
facial recognition cannot be carried out. To address all these constraints, full-body
appearance based pedestrian analysis is becoming a dominant field of research.
Full-body appearance of a pedestrian in images and videos is widely used for analysis
and is often called appearance based pedestrian analysis. The pedestrian appearance is
captured under indoor (ID), outdoor (OD), fully controlled, and uncontrolled
environments (FCUE). However, under these environments, variations in pedestrian
appearances such as pose, illumination, and viewpoint including different camera
settings make the area of pedestrian analysis more challenging. The relevant literature
presents two approaches: traditional and deep learning based, as shown in Figure 1.3.
In the traditional based approach, the pedestrian analysis process consists of several
steps such as preprocessing, feature extraction or descriptor design, features selection,
and classification/matching. Whereas in deep learning, mainly convolution neural
network (CNN) models implicitly learn features to establish an accurate
correspondence more effectively. In either way, to improve the accuracy of matching
and classification supervised and unsupervised learning algorithms may be used. In
addition, discriminative information from image and video frames plays a vital role in
pedestrian analysis related to recognition, matching, and classification tasks.
6
[15, 45-49], and 5) Pedestrian action classification for classifying different actions of
pedestrian such as walking, fighting, running, and sitting [50-52].
In the present era, these tasks are still considered as challenging studies because
matching and classification accuracies are usually influenced due to intra-class and
inter-class variations, appearance diversities, and environmental effects [53]. In the
light of aforementioned discussion, this dissertation deals with full-body appearance
based pedestrian analysis using a blend of traditional and deep learning techniques for
two tasks: 1) Person ReID - how to re-identify the person of interest from static images
with improved performance? and 2) Pedestrian gender classification - how to predict
the pedestrian gender in large-scale and imbalanced and small sample space (IB-SSS)
datasets? It is worth mentioning that to perform these tasks posteriori mode of
processing is used. The foreword about both tasks is given in the following subsections.
7
Figure 1.4: Full-body appearance based person ReID scenario
In the first step, the gallery set {G} is formed by collecting cropped pedestrian images
from different camera scenes. To represent gallery images precisely, multiple features
(shape, color, texture, deep CNN, etc.) are extracted. Then, the similarity measures or
optimal metric learning methods are utilized to find an accurate image pair from the
gallery set against each probe image {𝑃𝑖 }. Finally, the ranking of accurate matches is
presented as empirical results. For example, the red box at positions 1, 5, and 48 show
the matches at rank-1, rank-10, and rank-50 respectively, considering the query images
shown in Figure 1.4.
The research task of person ReID aims to find a probe person from a certain vicinity
(gallery set) which is of great significance in the visual surveillance field for tracking
of lost person and verification of a certain individual.
Pedestrian gender classification is one of the interesting tasks in the pedestrian analysis
domain. It considers pedestrian full-body appearance consisting of clues such as
wearing items, color, hairstyle, carrying items, and poses [54]. Usually, gender
classification is desirable to analyze pedestrians’ activities and behaviors based on their
gender. In manual settings, an individual performs the task of pedestrian gender
8
recognition based on clothes, hairstyles, body shapes, voice, gait, skin color, and face
looks of a target person [55]. While in automatic settings, numerous techniques are
proposed for gender recognition depending on facial characteristics [56-59] and full-
body appearance of pedestrians [45, 46, 49]. The face-based gender classification
techniques require the complete face of a person for gender recognition. However, these
techniques fail when the camera acquires an image from the left, right, or backside of
the pedestrian. On the other hand, appearance based pedestrian gender classification
techniques use hairstyle, carrying items, body shape, shoes, and coat as strong
arguments. Such arguments provide detailed information about the pedestrian image.
Hence, full-body appearance based gender classification is more reliable than face-
based gender classification where first model is trained using given training samples
and then a test (unknown) gender image is supplied to trained model for prediction of
gender as male or female, as shown in Figure 1.5. The full-body appearance based
pedestrian gender classification techniques are mainly divided into two types: 1)
traditional techniques that adopt feature extraction and features selection followed by
classification [60], and 2) deep CNN based techniques [61]. Both types use full-body
or parts-based images of the pedestrian as input for gender classification. The traditional
and deep CNN approaches for pedestrian gender recognition showed promising results
individually.
9
There are still classical research issues to be addressed, for instance, discriminative
feature representation, lower classification accuracy, and SSS for model learning.
Traditional approaches are effective to examine low-level information of an image,
whereas, deep CNN based features are more appropriate in case of large variations in
pedestrian poses, illumination changes, and diverse appearances. Hence, these deep
learning based approaches have more dominant feature representation capabilities of
given images. A detailed description of both techniques is provided in chapter 2.
Pedestrian analysis has various applications including human activity analysis, person
retrieval, area monitoring, and people behavior analysis in the domains of robotics,
entertainment, security, and visual surveillance. More specifically, person ReID task
has a wide range of practical applications such as cross camera person tracking, tracking
by detection, long-term human behavior, and activity analysis, whereas the task of
pedestrian gender classification is useful for population statistics, demographic
statistics, and population prediction.
People appearance may be similar or different in the context of color, texture, carrying
items, body size, clothing, and styles. The case of a similar appearance becomes a
challenge in gender classification. For instance, it is hard to recognize the gender of a
person in a crowd when they are wearing a uniform as the similarity of two individuals
increases the risk of false positives. According to relevant literature, invariant
information such as color, shape, and texture successfully handles the physical
variations of an individual. This problem is significant but not critical whereas the case
of different appearances can be reflected as a hint for pedestrian ReID.
10
1.3.2 Illumination Variations
Different shapes can be observed with different aspect ratios, depending on angle of the
camera. For gender classification task, this is a crucial problem, because the variation
in viewpoint implies a variation in the appearance and size of persons. Similarly, for
person ReID, this issue is also important. For instance, as the perceived properties under
CAM1 FoVs cannot be available under CAM2 FoVs hence, a person can be professed
with different sizes, and aspect ratios, which may cause failure in person ReID.
Different poses can be observed depending on the direction in which people are
moving, where pose variation implies a variation in the appearance. For person ReID,
it is a challenging issue. For instance, the perceived direction under CAM1 FoVs cannot
be available under CAM2 FoVs; therefore, a person can be apparent at a different angle,
causing the failure of person ReID. Similarly, in gender classification, left, right, back,
and side views of a pedestrian are more challenging as compared to frontal views.
Deformations in body shape may negatively affect the output where a task is highly
dependent on a person structure. As appearance based methods are based on full-body
image of a person, this problem influences both tasks. For gender classification, local
and global features-based on shape characteristics provide a way to handle this issue
but it fails when some important clues are not found due to non-availability of some
body parts in the given images. For person ReID task, extracted/learned properties of a
11
standing person in CAM1 may be insufficient if the person appears in CAM2 at a
different angle. In this scenario, few features of interest may be observed at a different
location than those extracted/learned which may cause difficulty in person matching.
1.3.6 Occlusion
Sometimes people are partially or completely occluded due to: 1) objects or items they
are carrying, 2) overlapping with a person of interest or object, and 3) with
environmental structure that is permanently/temporally present under specific FoVs.
For both tasks, this problem influences the performance in case of partially occluded
pedestrian images. In the case of gender prediction, it becomes a difficult task and relies
highly on the selected structures to extract reliable information for pedestrian gender
classification. For example, if all the required information is available/visible excluding
occluded part of an input image, gender classification can still be performed. If only
few properties are present, it will depend on the way to perform gender classification.
For person ReID task, this issue will also affect recognition rates in the case of partial
occlusion, whereas full occlusion case is not considered for ReID because of non-
availability of target person in another camera. Person ReID is performed only when
the target person returns and is observed under different FoVs. This is a key issue in
ReID and if the important information is missing due to a partially occluded image,
person ReID suffers.
This problem is concerned with both, prediction of gender and person ReID. The state-
of-the-art gender classification and person ReID methods use posteriori mode of
processing, which is highly dependent on camera settings such that the images are
captured/stored with different qualities (e.g. low, medium, etc.). In this scenario, the
camera response is one of the most critical acquisition conditions under different
camera settings. In a multi-camera network, it is not sure that all the cameras have same
model, and if so, the sensors can have small or significant variations in their response.
In this situation, a different color response is a key issue that arises under different
camera models or settings. For example, the same person with the same clothes can be
rendered in different ways by two different cameras. The aforementioned challenges
have a significant impact on the performance of both pedestrian gender classification
[45, 48, 49] and person ReID [62-64] tasks. In addition, overall and class-wise
12
accuracies are decreased due to imbalanced distribution of data whereas model learning
is another problem while applying SSS datasets in both tasks. Figure 1.6 shows sample
images of pedestrians where these challenges can be observed more clearly.
13
powerful and rich features for person ReID and pedestrian gender classification with
improved performance [15, 45, 46, 48, 49, 74-76].
14
environmental effects is an important and necessary step. Despite this, data
augmentation is performed to handle class-imbalanced problem for the
classification task.
b) Feature extraction, selection, and fusion play important role in pedestrian image
analysis. In this thesis, handcrafted features are used along with deep features,
as follows.
In the handcrafted features, color (HSV histogram, and single value
decomposition), texture (local extrema patterns), shape (HOG and
PHOG), and local maximal occurrence (LOMO) [77] are extracted from
each image using their RGB, HSV, and grayscale color spaces.
Features-based clustering is utilized based on selected optimal
handcrafted features, where the whole gallery is split into k clusters to
optimize the searching process against each probe image; consequently,
each gallery image has a learned relationship with the kth cluster.
The deep features are extracted using pre-trained CNN models such as
AlexNet, ResNet101, VGG19, InceptionResNetV2, and DensNet201.
The deeply learned features effectively handle the inherited ambiguities
in pedestrian appearance(s) and better generalization capability of an
input image.
The optimal features are selected using maximum entropy and PCA
methods. To create a single robust feature vector, selected features are
fused using serial and parallel based fusion methods.
c) In the classification step, handcrafted (color, texture, shape, and LOMO) and
deep features (CNNs) are supplied to machine learning classifiers (k-nearest
neighbor (KNN), SVM, ensemble, and discriminant analysis for pedestrian
gender classification. Whereas, in case of person ReID, these features are
supplied to Quadratic – Chi (QC) cross bin histogram distance measure to find
an accurate match against probe image from cluster instead of whole gallery.
d) The experimental results of all methods are based on improving the
abovementioned major steps from existing methods, thereby supporting the
contributions of this study. In this regard, all methods obtained competitive and
better results than state-of-the-art methods in literature.
15
1.7 List of Publications
This Ph.D. thesis is organized into five chapters. Figure 1.7 illustrates the organization
and a chapter-wise description is given below:
Chapter 2 gives a review of state-of-the-art techniques for person ReID and pedestrian
gender classification, using machine learning approaches. This chapter shares the
research gaps in existing relevant studies.
Chapter 4 presents experimental results for proposed methods. This chapter also
includes a detailed discussion and comprehensive comparison with state-of-the-art
16
studies under three consistent headings of evaluation protocols, datasets presentation,
and results evaluation.
17
Chapter 2 Literature Review
Literature Review
18
2.1 Introduction
In the proceeding chapter, relevant literature is presented in two different scenarios
namely person ReID and pedestrian gender classification for appearance based
pedestrian analysis. For this reason, the subsequent literature is divided in multiple
sections to discusses the related state-of-the-art methods. Figure 2.1 shows the
organization of chapter 2 such that first section presents methods of literature for person
ReID and second section covers the main developments related to full-body and parts-
based approaches for pedestrian gender classification. Both tasks are concluded in
discussion and analysis section. The section-wise review will help the reader to give a
clear insight into state-of-the-art methods of given tasks of pedestrian analysis.
Summary of chapter is given in the last section.
19
methods. The appearance based ReID method have become a hot topic for relevant
research domain of feature representation.
The second category belonging to metric learning based methods learns an optimal
metric from a given set of features. This optimal metric learning process is also called
optimal subspace learning. Despite the traditional way of ReID, deep learning based
methods are also used for person ReID in the context of deep feature representation and
model learning. In addition, some other methods such as graph and dictionary learning
are also used to perform person ReID. The general model for person ReID is shown in
Figure 2.2.
The appearance based method mainly focuses on feature representation. In this regard,
many efforts are carried out for robust feature representation that distinguishes a
person's appearance across changes in pose, views, and illumination. The main
categories of feature representations include color, texture, shape, saliency, and their
combination. Zhao et al. [78] utilized a probabilistic distribution of salience which is
20
robust under non-overlapping camera views. In another work, authors used learning of
mid-level filters from automatically selected clusters of patches for ReID [79]. Wang
et. al [80] presented a model in which discriminative fragments are first selected from
highly noisy frames of pedestrian, and then distinct space-time features are extracted to
perform person ReID. To handle the pose and illumination changes, Geng et. al [81]
divided the full-body image of pedestrian into two parts: upper and lower. Then, parts-
based feature engineering is performed to select different features from these parts for
preserving salience. Moreover, color information is accurately represented by
considering present regions and their adjacent regions in salience color descriptor
(SCD).
For robust feature representation, Liao et al. [77] introduced LOMO features to fully
examine and maximize the occurrence of local features horizontally. An et. al [82]
formulated a robust canonical correlation analysis (ROCCA) method to match people
from different views in coherent subspace. Local features are obtained to design a
codebook which comprises of HOG, LBP, color and HS histogram features [83].
Similarly, the researchers applied multi-model feature encoding, hyper-graph
construction and learning [84] for person ReID. Wang et al. [68] formulated a descriptor
for ReID named fusion of multi-channel features (FMCF). In this descriptor, color
information is captured from hue and saturation channels, shape information from
grayscale, and texture detail is computed from the value channel. A multi-kernel
subspace learning (MKSSL) strategy based method is designed to handle complex
variations issue in person ReID [85].
Cho et. al [86] proposed a method to estimate the target poses, then four representative
features are extracted from four different views such as front, back, left, and right. Later,
ReID is performed by computing matching score and weights. Zhang et. al [87]
presented null Foley-Sammon transfer (NFST) method and proposed a supervised
model to learn null discriminative space in order to address SSS problem in person
ReID. They also extended this model for semi-supervised learning. The pedestrian
matching is performed using reference descriptors (RDs) produced with the reference
set [88]. They projected gallery and probe data in regularized canonical correlation
analysis (RCCA) projection to learn projection matrix and additional step re-ranking is
used to increase the recognition rates at different ranks. Furthermore, 𝑙2 regularized
sparse representation is adopted for matching by [89]. In this approach, they maintained
21
the stability of sparse coefficients for better performance. An et al. [19] suggested a
method in which the appearance features are integrated with soft biometrics features to
represent the feature descriptor. Chahla et al. [90] applied prototype formation for
person ReID. The presented technique uses color categorization to handle appearance
variation in terms of clothes colors. In addition, for the robustness of feature descriptors,
the proposed technique captures the relationship between color channels using a
quaternionic local binary pattern (QLBP).
Li et al. [91] proposed a method that extracts salient regions from an image of a person
and clusters the extracted regions by applying “log squares log density gradient
clustering”. Similarly, Zhang et. al [92] used a color invariant algorithm for ReID to
cluster the segmented area of an image that exploits color for clustering. However, it is
not robust against low contrast images. An et. al [72] investigated multiple coding
schemes (LBP, HSV, LAB, and SCN) for intermediate data synthesis. They used group
sparse representation to produce particular intermediate data. Shah et al. [93] introduced
another approach that utilizes hexagonal SIFT and color information in a combination
of time tree learning algorithms for robust feature representation, however unable to
handle above 60o pose variation. Chu et al. [94] suggested a technique that splits the
image into local sub-regions considering two directions to overcome the risks of
mismatching and pose variations. However, it does not handle illumination changes.
Nanda et al. [95] presented a three-fold framework to specifically solve issue due to
illumination changes but failed to tackle complex background and diverse variations in
the pose. Moreover, they extracted color and texture information from segmented parts
of body. Then, inlier-set based ReID is performed [96] to handles the partial occlusion,
viewpoint and illumination variations effectively.
Gray et al. [97] considered color and texture channels, binning information, and
location for a combination of localized features for recognition. Bazzani et al. [98]
introduced symmetry driven accumulation of local features (SDALFs) approach using
symmetric and asymmetric perceptual principles to overcome environmental variances
for appearance based ReID problem. For improving recognition rates, Ye et al. [99]
proposed an approach to combine the semantic attributes of body parts with LOMO
features. Also, semantic features help to improve the person ReID recognition rates
[100-102]. Few other appearance based methods, also exploit fusion of different
features such as ensembles of color invariant features, salient matching, kernel-based
22
features, camera correlation aware feature augmentation, and learned vs handcrafted
features [73, 103-106] for person ReID. Hashim et. al [107] formulated a method named
mask-improved symmetry-driven accumulation of local features (MSDALF) using
simultaneous image matching based on stable marriage algorithm (SMA). They also
computed the results using Greedy matching (GM) and Hungarian Matching (HM) with
the combination of MSDALF. The summary of appearance based methods for person
ReID is given in Table 2.1.
23
Quantized mean value of Lab
and HSV color spaces, SDC, VIPeR 32.9
[89] and LBP features are CUHK01 31.3
extracted. CCA is used to learn PRID201 27.0
projected subspace.
LOMO and soft biometric
[19] features are used to re-identify VIPeR 43.9
the pedestrian views.
Color categorization, QLBP,
[90] 2017 and prototype formation is VIPeR 28.0
adopted for single-shot ReID
Multi-model feature encoding
VIPeR 44.5
[84] and multi-hyper graph fusion
GRID 19.8
for ReID.
LBP, HSV, LAB, and SCN
feature coding schemes are
VIPeR 34.5
[72] used to observe the influence
CUHK01 34.0
2018 of these schemes in ReID
process.
Salient region feature PRID201 52.3
[91]
extraction and clustering. MARS 62.1
Sematic partition is carried out
then hue weighted saturation, VIPeR 43.3
[96] CbCrY, texture, eight Gabor, iLIDS-VID 35.8
and thirteen Schimd filters are CUHK01 44.5
used.
2019
LOMO features are combined
with semantic attributes of VIPeR 44.7
[99] pedestrian image. Effectively QMUL-GRID 25.2
utilized the impact of upper PRID2011 28.2
and lower body parts.
Handcrafted vs learned
[73] features are obtained for CUHK01 35.0
person ReID.
HSV color histogram and
2020 VIPeR 40.4
modified features named
CUHK01 27.0
[107] MSDALF are extracted. SMA
iLIDS-VID 34.4
is used with simultaneous
PRID 16.5
matching.
The metric learning based person ReID is used to acquire a discriminative metric for
better performance. The process followed by metric learning methods considers
representation of features for metric learning, hence learned metric highly relies on
reliable feature representation. In this regard, existing efforts include partial least square
(PLS) [108], probabilistic distance comparison (PRDC) [109], local fisher discriminant
24
analysis (LFDA) [110], adaptive fisher discriminant analysis (AFDA) [111],
deformable patch based metric learning (DPML) [112], transferred metric learning
[113], and relative distance comparison (RDC) [114] approaches for reliable feature
representations to learn an effective metric for person ReID. For more stable feature
representation against viewpoint changes, LOMO features are extracted then cross-
view quadratic discriminant analysis (XQDA) metric learning based method is
implemented [77]. Xie et. al formulated a method using adaptive metric learning
(AML) with probe specific re-ranking (PSR) [115]. Furthermore, Liu et. al [116]
proposed a method named multi-modality metrics learning (M3L) that attempts to
investigate the effect of multi-modality metrics in relation to the long-run surveillance
scenes for person ReID. Region based metric learning is performed by [117] for person
ReID by using positive neighbors from imbalanced unlabeled data. Ma et. al presented
a method with the name of triplet-based manifold discriminative distance learning
(TMD2L) [118] to learn discriminative distance, which effectively handles the low
illumination issue in person ReID. Feng et. al [119] preferred second order hessian
energy function because of extrapolating power and richer null space. Thus, they used
metric learning based approach for person ReID. Jia et. al [120] suggested a semi-
supervised method to enable view specific transformations and update projections with
graph regularization. In addition, hessian regularized distance metric learning [119],
joint dictionary and metric learning [75, 121], and relative distance metric learning
(RDML) based on clustering centralization and projection vectors learning (CCPVL)
[122] based approach achieved significant performance on publicly available datasets
for person ReID. Two stage metric learning method named QR kept it simple and
straightforward (QR-KISS) [123] to investigate the performance of different feature
descriptors such LOMO and salient color names based color descriptor (SCNCD). To
address cross-domain issue for ReID, Gu et. al [74] proposed topology properties
preserved LFDA (TPPLFDA) which reliably projects the cross-domain data into lower
dimensional subspace. Ma et. al [124] split the training data into positive and negative
pairs and proposed extreme metric learning (EML) method based on adaptive
asymmetric and diversity regularization for ReID. Few other methods addressed ReID
task by ranking methods [125, 126]. The existing metric learning based methods for
person ReID are described in Table 2.2. Furthermore, graph learning based approaches
also achieved significant performance in ReID process. For example, in dynamic hybrid
graph matching (DHGM), the authors utilized content and context information with
25
metric learning for person ReID [76]. An et. al [127] proposed a hyper graph matching
method in which pairwise relationships are discovered in higher order for both gallery
and probe sets. Similarly, the authors used multi-model feature encoding, hyper-graph
construction and learning [84] for person ReID.
Table 2.2: Summary of metric learning based approaches for person ReID
Rank-1
Ref. Year Method Datasets
Results (%)
HSV and scale invariant
local ternary patterns VIPeR 40.0
[77] (SILTP) based features GRID 18.9
with XQDA metric CUHK03 52.2
2015 learning.
The clustering is integrated 44.4
SAIVT-SoftBio
with discriminant analysis 43.0
[111] PRID2011
to preserve sample diversity 37.5
iLIDS-VID
and local structure.
Common space is learned
34.1
and mapped, then hyper VIPeR
[127] 35.0
graph matching is applied CUHK01
for ReID.
Metric learning is
2016 performed using local
feature representations. 41.6
VIPeR
[112] Pose and occlusion issues 35.8
CUHK01
are handled by
implementing deformable
patch-based model.
Discriminative information 43.0
VIPeR
[115] is explored through better 62.5
PRID450S
deal with negative samples.
Clustering based multi-
2017
modality mining method is VIPeR
30.2
[116] used which automatically VIPeR+
29.3
determines the modalities PRID450S
of illumination changes.
VIPeR 16.6
CUHK01 18.5
Clustering centralization
3DpeS 24.5
[122] and projection vectors are
CAVIAR4REID 13.1
used for person ReID.
Town center 62.2
2018
Market-1501 26.9
Positive regions are CUHK01 39.6
assessed using region PRID450D 26.1
[117]
metric learning for person VIPeR 25.7
ReID. Market-1501 40.5
26
Manifold learning
technique is used to LIPS 57.8
[118] preserve intrinsic geometry LI-PRID2011 32.4
structure of low L1-iLIDS-VID 29.6
illumination data.
Applied two stage metric
VIPeR 42.2
learning method (QRKISS)
[123] PRID450S 57.1
2019 with different feature
CUHK01 44.9
descriptors for ReID.
Preferred second order
hessian energy function
because of extrapolating VIPeR 20.3
[119]
power and richer null space. CUHK01 21.9
Thus, used metric learning
approach for person ReID.
Labeled and unlabeled data, VIPeR 44.8
learned projections, built PRID450S 68.2
[120]
cross-view correspondence PRID201 35.4
for ReID Market-1501 74.8
Utilized the dictionary VIPeR 42.5
based projection PRID450S 64.1
[75]
transformation learning for CUHK03 60.5
ReID. GMUL-GRID 22.4
Split training data into two
VIPeR 44.3
groups of positive and
[124] PRID450S 63.5
2020 negative pairs then apply
GRID 19.47
EML for ReID.
Used single and multi-
source domains to transfer
VIPeR 42.3
[74] data and perform cross-
iLIDS-VID 33.8
domain transfer ReID with
TPPLFDA.
Context and content
information is investigated PRID2011 76.9
[76]
into graph and metric iLIDS-VID 36.8
learning approach for ReID.
27
utilized to highlight common patterns. The summary of existing deep learning based
methods for person ReID is presented in Table 2.3. Despite these, numerous CNN based
trained models are available that provide better discriminative features. For instance,
the authors in [131] proposed a deep features body and part (DFBP) based method in a
sensor network to obtain a discriminative feature. Zhang et al. [132] introduced a local
embedded deep feature (LEDF) based method for feature learning by extracting local
and holistic summing maps. Similarly, Huang et al. [133] proposed a model where
human body is used for deep feature learning. All these deep learning methods compute
the likeness scores of two input images, without any feature extraction process. Thus,
they have the potential to handle inherited appearance changes. For semi-supervised
feature representation, Xin et.al [134] presented a method using multiple CNN models
for clustering called multi-view clustering for accurate label estimation. To address an
image to video person ReID problem, feature extraction and similarity measure are
studied separately in [135]. Yuan et. al [136] proposed a deep multi instance learning
(DMIL) based approach to handle complex variations in pedestrian appearances.
Ordinal deep features are used to measure the distance [137]. This proposed method
achieved better recognition rate under diverse viewpoints. Hu et. al [138] combined the
DenseNet convolutional and batch normalization layer to design the lightweight image
classifier (LWC). Recently, pre-trained models such as ResNet50, AlexNet, VGG16,
and inceptionV4 are used and 47.1%, 25.4%, 37.2%, and 43.5% rank-1 recognition
rates are attained on CUHK01 dataset [139]. Similarly, the results are computed on
other datasets such as CUHK03, VIPeR, PRID, Market-1501, iLIDS-VID, and 3DPeS.
As an example, rank-1 recognition rates on CUHK01, iLIDS-VID, VIPeR datasets are
described in Table 2.3. Semantics aligning network (SAN) [64] is proposed which
comprises of encoder and decoder. Encoder is used for ReID and decoder is applied to
reconstruct densely semantics. Pooling fusion block is added in proposed pose invariant
convolutional baseline (PICB) model to learn the distinct features for ReID [140].
Zhang et. al [43] used limited labeled data to learn cross-view features in suggested
approach of similarity embedded cycle generative adversarial network (SECGAN).
Moreover, the researchers discussed a cross-modal feature generating and target
information preserving transfer network (CMGTN) for person ReID [141]. Lv et. al
[142] utilized learned spatio-temporal patterns with visual features in Bayesian fusion
for unsupervised cross-domain person ReID.
28
Table 2.3: Summary of deep learning based approaches for person ReID
Rank-1
Ref. Year Method Datasets
Results (%)
Market-1501 72.9
Siamese CNN model for
[130] 2016 CUHK01 63.9
ReID.
VIPeR 36.2
Built end-to-end deep CNN iLIDS-VID 40.0
[135] 2017 model to integrate feature and PRID2011 73.3
metric learning for ReID. MARS 55.2
Convolution feature VIPeR 31.1
[136] representation is presented ETHZ 77.7
using multi DMIL. CUHK01 31.8
2018
Utilized spatio-temporal
GRID 64.1
[142] patterns and extracted visual
Market-1501 73.1
features in ReID process.
CUHK01 44.3
Deep feature ranking is
[137] Market-1501 83.1
implemented for ReID
Duke 73.4
2019 Combined the DenseNets
VIPeR
convolutional and batch 44.3
[138] CUHK01 as
normalization layers to 46.7
source dataset
design LWC for person ReID.
Multiple CNN models are 47.1
CUHK01 VIPeR
[139] trained to re-identify 23.2
iLIDS
pedestrian image. 36.2
Addressed the sematic CUHK03 80.1
misaligned problem in large Market-1501 96.1
[64]
scale datasets for person MSMT17 79.2
ReID. Duke 87.9
2020
Learned cross-view features VIPeR 40.2
[43]
using cycle GAN. Market-1501 62.3
Designed a deep model to
extract cross-model features iLIDS-VID 38.4
[141] for person ReID while MARS 41.2
incorporating target PRID2011 42.7
preserving loss into model.
29
Figure 2.3: Pipeline for full-body appearance and face-based gender classification
30
effectively utilized low-level information of input image for object and place
recognition [159]. Later, the computed low-level features are supplied to classifiers for
learning of discriminative model for instance, SVM [59, 160] and linear discriminant
analysis (LDA) [161]. Despite these approaches, CNN models attained superior results
in full face-based gender classification methodologies. For instance, Duan et al. [162]
presented an ensemble of features-based approach which consists of CNN and extreme
learning machine (ELM) for gender and age classification. Few other approaches which
performed gender classification using different deep CNN architectures are discussed
in [163-165]. The aforementioned discussion concludes that capturing clear full-face
image in real environment, is a difficult task. Also, face-based gender recognition fails
when the camera acquires an image from left, right or backside of pedestrian.
Alternatively, body-parts/full-body views of pedestrian can be more useful to perform
gender classification as compared to face-based gender classification. In parallel,
capturing full-body appearances (images) of pedestrian under specific FoVs in multi-
camera networks is easy.
The existing approaches also consider different parts of body such as leg, torso, upper
and lower body to classify the gender. In this regard, Cao et al. [166] proposed part-
based gender recognition (PBGR) method using body patches and their characteristics.
They utilized raw pixels, edge map and HOG to characterize the gender image. In
addition, fine grid sampling is applied to partition the gender image into patches where
HOG features of each patch were extracted. Later, patch-based features are combined
and supplied to Ada boost and random forest (RF) classifiers for gender prediction. As
a result, PBGR outperformed in terms of mean accuracy (M-ACC) for frontal, back and
non-fixed (frontal and back) view images respectively. Moreover, they also presented
different feature and classifier combination based results, separately. In all experiments,
this approach adopted 5-cross validation to measure the M-ACC for restricted views of
gender such as frontal, back and mixed/non-fixed views. It is noted that researchers did
not consider the gender images with side view in experiments and it was the first
attempt where patch wise computed features were fed to Ada boost classifier to
investigate silhouette characteristics for pedestrian gender classification. Raza et al.
[145] trained a CNN model by utilizing upper-body foreground images of gender. As a
result, supplying upper body images to the model slightly produced better overall
31
accuracy (O-ACC) and M-ACC on frontal, rear and mixed view images respectively.
Ng. et al. [71] introduced a parts-based CNN model which investigates different regions
corresponding to upper, the middle and lower half of pedestrian body image. Extensive
experiments are performed to increase the classification accuracy. Better results are
obtained because of the combination of CNN based global and local information.
However, overhead is found in the training of the CNN network for each region
separately. Yaghoubi et al. [49] suggested a region-based CNN methodology for gender
prediction on PETA and MIT datasets. They used body key points to estimate the
gender pose whereas head, legs, and torso based information is supplied to learn
appropriate classification models. The authors also utilized score fusion strategy which
produced better results on PETA dataset as compared to raw, head and polygon images
based trained networks. They suggested that segmented full-body image of pedestrian
is suitable for gender prediction. This technique has been evaluated on MIT dataset with
same settings, achieving better accuracies on frontal, rear and mixed view images. The
fine-tuned models highly rely on extracted regions of interest and segmented parts of
body where due to complex background images, performance may have decreased.
Parts-based gender classification methods are summarized in Table 2.4.
Table 2.4: Summary of parts-based approaches using frontal, back and mixed view
images of a pedestrian for gender classification, (-) represents that no
reported result is available
Results (%)
Ref. Year Methods Dataset
Frontal Back Mixed
Raw information 71.7 61.5
-
with Ada boost M-ACC M-ACC
Edge map features 71.1 70.1
-
with Ada boost M-ACC M-ACC
HOG features with 70.9 63.0
-
Ada boost M-ACC M-ACC
Parts-based features 71.9 71.2
-
with Ada boost M-ACC M-ACC
Raw information 72.8 60.2
[166] 2008 MIT -
with RF M-ACC M-ACC
Edge map features 73.1 63.9
-
with RF M-ACC M-ACC
HOG features with 73.1 65.8
-
RF M-ACC M-ACC
Part-based features 73.2 65.4
-
with RF M-ACC M-ACC
PBGR with HOG 76.0 74.6 75.0
and Ada boost M-ACC M-ACC M-ACC
32
83.3 82.3 82.8
Upper body image
O-ACC O-ACC O-ACC
[145] 2017 and deep CNN
80.5 82.3 81.4
model
M-ACC M-ACC M-ACC
79.5 85.0 82.5
Model with CNN-1
M-ACC M-ACC M-ACC
MIT&
Model with 80.0 87.0 83.8
[71] 2019 APiS
CaffeNet M-ACC M-ACC M-ACC
mixed
84.4 88.9 86.8
Model with VGG-19
M-ACC M-ACC M-ACC
Raw images 89.1 89.9 75.9
supplied to base-net M-ACC M-ACC M-ACC
Raw images
supplied to frontal- 90.5 93.0 77.2
net, rear-net, and M-ACC M-ACC M-ACC
lateral-net
Head images
supplied to frontal- 88.7 90.1 77.3
[49] 2019
net, rear-net, and M-ACC M-ACC M-ACC
lateral-net
Polygon images
supplied to frontal- PETA 91.2 91.4 76.0
net, rear-net, and M-ACC M-ACC M-ACC
lateral-net
92.1 93.5 80.1
Fusion strategy
M-ACC M-ACC M-ACC
Pedestrian gender
recognition network
(PGRN) using raw
90.0 87.9 89.0
[49] 2019 images, head- MIT
M-ACC M-ACC M-ACC
cropped and
polygon-shape
regions
This section discusses those approaches that have performed gender classification using
full-body appearance of pedestrian as an input. In this scenario, wearing items, body
shape, hairstyle, shoes, and coats are taken as strong arguments while recognizing
appearance based pedestrian gender. In short, full-body appearance based pedestrian
gender recognition approaches are mainly split into two groups: i) traditional
approaches that adopt feature extraction followed by classification [60], and ii) deep
CNN based approaches [61]. According to relevant literature, deep CNN has attained
reliable results in many areas of computer vision under diverse appearance conditions
and different camera settings [167]. The traditional approaches compute different types
33
features such as color, shape, texture, and deeply learned information to represent
gender images. For example, Collins et. al. [168] proposed a gender classification
technique to examine different local features and their combination with linear SVM.
First they cropped images of MIT dataset of size 106 × 45 and then performed
investigation on cropped and un-cropped images of size 128 × 64 for gender
prediction. The proposed strategy revealed average results on un-cropped MIT dataset
in terms of O-ACC using pyramid histogram of words (PHOW), PHOG, canny HOG,
local HSV (LHSV) color histogram, and pixel PHOG features respectively. The authors
utilized the VIPeR dataset first time for gender prediction and obtained satisfactory
performance in terms of class-wise accuracy (CW-ACC) and O-ACC on frontal view
images of pedestrian. However, features combinations with CHOG+LHSV and
PiHOG+LHSV attained better O-ACC on both datasets. Table 2.5 summarizes the
results in terms of CW-ACC and O-ACC on MIT and VIPeR datasets for gender
classification.
Table 2.5: Summary of existing methods results using full-body frontal view images of
pedestrian for gender classification with 5 cross-validation (male=123,
female=123 uncropped and cropped images of MIT dataset; whereas
male=292, female=291 uncropped images of VIPeR dataset)
Methods with CW-ACC (%)
O-ACC
Ref. Year
un-cropped Dataset
Male Female (%)
images
PHOW with
65.3 72.4 64.0
SVM
PHOG with
56.0 68.0 56.9
SVM
CHOG with
70.7 81.0 74.2
SVM
LHSV with
MIT 73.1 54.0 60.9
SVM
PiHOG with
80.2 83.0 79.1
SVM
[168] 2009
CHOG+LHSV
80.2 82.6 75.8
with SVM
PiHOG+LHSV
81.1 88.7 80.2
with SVM
Methods with
cropped images
PHOW with
70.3 63.0 63.7
SVM
MIT
PHOG with
54.0 70.7 60.0
SVM
34
CHOG with
73.8 77.5 72.5
SVM
LHSV with
74.4 68.9 66.5
SVM
PiHOG with
82.2 86.5 81.7
SVM
CHOG+LHSV
77.4 87.3 76.3
with SVM
Methods with
un-cropped
images
PiHOG+LHSV
78.8 95.9 84.1
with SVM
PHOW with
75.1 63.0 66.1
SVM
PHOG with
72.1 50.5 57.9
SVM
[168] 2009 CHOG with
74.5 80.9 77.4
SVM
LHSV with
77.8 74.8 73.2
SVM
PiHOG with
80.1 80.1 78.4
SVM VIPeR
CHOG+LHSV
80.5 82.9 78.5
with SVM
PiHOG+LHSV
84.3 88.8 83.1
with SVM
Guo et al. [60] presented an approach to handle pose changes (frontal and rear view
changes) in a better way for gender classification. They have used visualization of
biological inspired features (BIF) with two bands and four orientations, integrated with
various manifold learning methods. This approach utilized orthogonal locality
preserving projections (OLPP) based 117, PCA based 300, locality sensitive
discriminant analysis (LSDA) based 150, and marginal fisher analysis (MFA) based
117 features with linear SVM including 5-fold cross validation for view-based (frontal,
back, and mixed) gender recognition. Geelen et al. [169] utilized low-level information
of full-body image and extracted different features such as HSV, LBP, and HOG. They
also examined different combinations of these features for gender classification using
two supervised methods such as SVM and RF kernel. Antipov et al. [47] presented a
technique to examine familiar, unfamiliar and cross dataset for gender prediction. They
utilized HOG features and SVM classifier with linear kernel which produced poor
results in terms of area under curve (AUC) and M-ACC on unfamiliar dataset as
compared to familiar and cross datasets. From the relevant literature, year-wise main
35
developments in the context of traditional approaches for full-body appearance based
pedestrian gender classification are summarized in Table 2.6.
Table 2.6: Summary of gender classification methods using handcrafted features and
different classifiers with full-body frontal, back, and mixed view images of
a gender, (-) represents that no reported result is available
Results (%)
Ref. Year Methods Dataset
Frontal Back Mixed
BIF, PCA with 79.1 82.8 79.2
linear SVM O-ACC O-ACC O-ACC
BIF, OLPP
78.3 82.8 77.1
with linear
O-ACC O-ACC O-ACC
SVM
BIF, LSDA
79.5 84.0 78.2
with linear
O-ACC O-ACC O-ACC
SVM
BIF+MFA
79.1 81.7 75.2
with linear
O-ACC O-ACC O-ACC
[60] 2009 SVM
Gender
recognition in
each view
where
BIF+LSDA 80.6
- -
with frontal O-ACC
and back view,
and BIF+PCA
with mixed
view images
81.2 77.5 78.9
HOG with O-ACC O-ACC O-ACC
linear SVM 76.9 72.6 75.9
M-ACC M-ACC M-ACC
78.4 77.7 76.1
LBP with O-ACC O-ACC O-ACC
linear SVM 73.5 71.9 68.5
M-ACC M-ACC M-ACC
69.4 71.5 71.3
HSV with MIT O-ACC O-ACC O-ACC
[169] 2015
linear SVM 65.0 63.5 64.8
M-ACC M-ACC M-ACC
77.6 78.3 77.6
LBP and HSV
O-ACC O-ACC O-ACC
with linear
73.9 72.7 73.7
SVM
M-ACC M-ACC M-ACC
81.6 80.5 80.9
HSV and HOG
O-ACC O-ACC O-ACC
with linear
79.0 74.1 75.3
SVM
M-ACC M-ACC M-OCC
36
81.2 80.3 79.8
HOG and LBP
O-ACC O-ACC O-ACC
with linear
76.6 75.5 76.6
SVM
M-ACC M-ACC M-ACC
81.0 82.7 80.1
HOG, LBP and
O-ACC O-ACC O-ACC
HSV with
73.9 79.3 76.7
linear SVM
M-ACC M-ACC M-ACC
88
HOG features Cross- AUC
- -
with SVM dataset 80.0
M-ACC
84
HOG features AUC
[47] 2015 Familiar - -
with SVM 72
M-ACC
64
HOG features AUC
Unfamiliar - -
with SVM 82.0
M-ACC
37
This model utilizes two convolution layers to obtain feature maps; 1) sub-sampling
layer for downsampling and 2) a fully connected (FC) layer before the output layer for
classification. According to model settings, a CNN network is trained for gender
prediction on MIT dataset and achieved acceptable O-ACC. The model performed
better for small-sized homogeneous datasets. Ng et al. [182] investigated full-body
pedestrian images with different color spaces including grayscale, RGB, and YUV for
image representation. Later on, these representations are fed to CNN network separately
as an input to train a model for gender classification. The method obtained promising
results on MIT dataset with grayscale image representation as compared to RGB and
YUV. In another study, the authors investigated a training strategy [183] for gender
prediction. This strategy followed by pre-training showed good results with the
limitation of small amount of labeled training data. Antipov et al. [47] presented a
method to investigate handcrafted and learned feature extraction schemes for gender
classification using SSS familiar, unfamiliar and cross datasets. This study showed that
both feature extraction schemes equally performed on SSS homogeneous dataset but
learned features gave better results on unfamiliar datasets. Raza et al. [145] initially
examined pedestrian parsing on pedestrian images using deep de-compositional neural
networks (DDNN). The technique found two types of parsed images having foreground
views. Later on, the parsed images of pedestrian are given to the proposed CNN model
for view-wise gender classification. In another study, the principal investigator and his
team extended the previous work and utilized stack sparse auto encoder (SSAE) [46]
for gender classification based on full-body appearance of pedestrian. Initially, HOG
based feature map is computed and then supplied to the pedestrian parsing phase to
create a silhouette of an input image. The output of parsing phase in the form of a binary
parsed pedestrian’s map is fed to two layers SSAE model followed by a softmax
classifier to predict pedestrian image as male or female. Cai et al. [45] presented a
hybrid technique named HOG assisted deep feature learning (HDFL), which examines
deeply learned features and weighted HOG features for gender classification.
According to relevant literature, HDFL has achieved the best results in terms of AUC.
In another study, they presented a fusion method called deep-learned and handcrafted
feature fusion network (DHFFN) [15]. The authors achieve significant results by
combining handcrafted and deep characteristics of an input image. Further, to overcome
variations in scene and viewpoint for PGC, they also proposed a cascading scene and
viewpoint feature learning (CSVFL) method and showed considerable results with the
38
combination of input image deep and handcrafted characteristics [69]. Ng. et al. [71]
introduced a methodology to investigate full-body of a pedestrian for gender prediction.
The extensive experiments are performed using CNN-1, CNN-2, CNN-3, CaffeNet, and
VGG-19 to increase the classification accuracy. Better results are attained with VGG-
19 model. From relevant literature, yearwise main developments in the context of deep
learning and hybrid approaches for full-body appearance based pedestrian gender
classification are summarized in Table 2.7.
Table 2.7: Summary of deep learning and hybrid methods for pedestrian gender
classification with full-body frontal, back, and mixed view images of a
gender, (-) represents that no reported result is available
Results (%)
Ref. Year Methods Dataset
Frontal Back Mixed
80.4
[61] 2013 Deep CNN - -
O-ACC
Deep CNN, MIT
81.5
[182] 2013 gray-scale, - -
O-ACC
RGB, YUV
90
Deep learning
Cross- AUC
(Alexnet- - -
dataset 82.0
CNN)
M-ACC
91
Deep learning
AUC
(Alexnet- Familiar - -
85.0
CNN)
M-ACC
85
Deep learning
AUC
(Alexnet- Unfamiliar - -
79.0
CNN)
M-ACC
[47] 2015
86
Deep learning Cross- AUC
- -
(Mini-CNN) dataset 79.0
M-ACC
88
Deep learning AUC
Familiar - -
(Mini-CNN) 80.0
M-ACC
80
Deep learning AUC
Unfamiliar - -
(Mini-CNN) 75.0
M-ACC
Deep CNN-e 81.5
[183] 2017 - -
configuration O-ACC
MIT
Full-body 82.1 O-
[145] 2017 81.3 82.0
image is ACC
39
forwarded to 81.1 O-ACC O-ACC
CNN model M-ACC 81.7 80.7
M-ACC M-ACC
Deep CNN,
95
[45] 2018 HOG assisted - -
AUC
features
Deep CNN, PETA 95
[15] 2018 - -
HOG, PCA AUC
92
[46] 2018 SSAE - -
AUC
82.9 81.8 82.4
O-ACC O-ACC O-ACC
80.4 80.8 81.6
[46] 2018 SSAE MIT
M-ACC M-ACC M-ACC
89 90 89
AUC AUC AUC
MIT&
VGG-19 85.38
[71] 2019 APIS - -
model M-ACC
mixed
84.4 85.9 85.2
MIT
Scene and O-ACC O-ACC O-ACC
viewpoint 81.9 84.7 80.1
[69] 2021 VIPeR
feature O-ACC O-ACC O-ACC
learning 92.4 94.6 92.7
PETA
O-ACC O-ACC O-ACC
40
deep CNN approaches for pedestrian gender recognition showed promising results
individually, but they are still facing classical issues, for example, discriminative
feature representation, lower classification accuracy, and SSS for model learning. Since
traditional approaches are effective to examine the low-level information of an image,
hence the deep CNN based features are more appropriate in case of large variations in
pedestrian poses, illumination changes, and diverse appearances, and they have more
dominant feature representation capabilities of given image. In addition, a fusion
strategy effectively characterizes the image information for gender classification.
Moreover, with fusion strategy, only low-level information of an input image is not
sufficient to handle the issues such as an image with low resolution, viewpoint changes,
and diversity in pedestrian appearances, so considering high-level information is
crucial. Whereas, SSS datasets have class-wise imbalanced distribution of data which
directly affects the classification performance in terms of O-ACC as well as accuracy
of the minority class. Considering these challenges, there is still a need to design a
robust method that effectively automates pedestrian gender recognition process. In this
regard, two methods named J-LDFR and PGC-FSDTF are proposed for gender
prediction where we have performed investigations using different low-level feature
extraction schemes and deeply learned features of already trained CNN models. The
features selection and fusion methods are also incorporated in proposed methodologies
for dimensionality reduction, optimal features selection and compact representations.
The first PGC method is tested on large-scale and small-scale datasets whereas the
second PGC method is evaluated on thirteen imbalanced, augmented balanced, and
customized balanced datasets, such that both methodologies successfully attain
competitive performance with existing methods.
2.5 Summary
This chapter provides a comprehensive review of existing work for both research tasks
of pedestrian analysis and covers the discussion on state-of-the-art methods with recent
developments in person ReID and pedestrian gender classification. Similarly, feature
extraction/learning, features selection, classification, and matching related to various
benchmark techniques for both tasks are described and tabulated with evaluation
metrics and datasets used in experimentation. After extensive study of literature, we
have developed three methodologies for appearance based pedestrian analysis as
discussed in chapter 3.
41
Chapter 3 Proposed Methodologies
Proposed Methodologies
42
3.1 Introduction
This chapter presents proposed methodologies for appearance based pedestrian analysis
using image processing and machine learning approaches. The prime concern of these
methodologies is to re-identify and classify pedestrian full-body image using normal
and low-quality images with improved performance. Figure 3.1 shows the organization
of chapter, three proposed methodologies, and their highlights. The selected tasks
demonstrate that full-body appearance of pedestrian is to be analyzed using proposed
methodologies for person ReID and pedestrian gender classification in this research
work.
All the proposed methodologies as per the general proposed work are listed below, and
details are discussed in the coming sections of this chapter.
Method for person ReID with features-based clustering and deep features.
43
Method for pedestrian gender classification using joint low-level and deep CNN
features representation.
Method for pedestrian gender classification on imbalanced and SSS datasets
using parallel and serial fusion of selected deep and traditional features.
44
The objective of feature representation module is to extract color, texture, and shape
features of each gallery image that once put into FFS method generate OFS and do not
alter or lose vital features. The purpose of feature clustering module is to split gallery
images into k consensus clusters in an efficient manner. In addition, it is responsible to
extract deep features of each clustered sample for handling the inherited ambiguities of
person(s) appearance. Finally, feature matching module aims to find an accurate match
against probe image from the filtered gallery subset (consensus cluster). A detailed
description of these modules is given in the following subsections.
Figure 3.2: FCDF framework for person ReID consisting of three modules, where a)
feature representation module is applied to compute different types of
features from R, G, B, H, S, and V channels, and then optimal features are
selected using novel FFS method, b) feature clustering module is used to
split whole gallery into different consensus clusters for gallery search
optimization, whereas deep features of each cluster sample are also
examined, and c) feature matching module includes classification of
corresponding cluster(s), probe deep features and finally similarity
measure is applied to obtain recognition rates at different ranks
45
3.2.1 Feature Representation
In the process of person ReID, discriminative appearance cues are integrated to mark
as a robust feature representation. The feature representation module extracts
handcrafted features of each gallery image to generate an OFS by putting these features
into FFS method while taking care of not altering or losing vital features. It exploits
color, texture patterns, orientation detail, and spatial structural information for feature
encoding. The module is responsible to perform three functionalities of feature
extraction, fusion and selection, explained in the following subsections.
To extract color features, two commonly used measures including mean and variance
are considered for feature extraction using RGB and HSV color spaces. To compute
color features, in the very first step, color spaces are separated into their respective
channels such as red, green, blue, hue, saturation, and value respectively. The sample
mean and variance of each channel are formulated through Eq. (3.1) and Eq. (3.2).
∑𝑚 𝑛
𝑖=1 ∑𝑗=1 𝜉𝑖,𝑗
𝜇= (3.1)
𝑀
∑𝑚 𝑛
𝑖=1 ∑𝑗=1(𝜉𝑖,𝑗 −𝜇)
2
𝜎2 = (3.2)
𝑀
where 𝜇 and 𝜎 2 denote sample mean and sample variance of each extracted channel
and 𝜉𝑖,𝑗 is a related channel of utilized color spaces. Other parameters 𝑖 and 𝑗 denote
rows and columns of each channel and 𝑀 = 𝑚 × 𝑛, where 𝑀 is a matrix which explains
the total number of 𝑚 rows and 𝑛 columns. These extracted features have few
challenges such as a high correlation between color channels and perceptual non-
uniformity [184]. To overcome these challenges, standard deviation (SD) and singular
value decomposition (SVD) are further calculated for each channel of selected color
spaces. The formulation of these channels for SD features is defined by Eq. (3.3).
∑𝑚 𝑛
𝑖=1 ∑𝑗=1(𝜉𝑖,𝑗 −𝜇)
2
𝜎=√ (3.3)
𝑀
where 𝜎 depicts standard deviation of each input channel. Moreover, SVD based
structural projections [185] are applied as follows. Let 𝑚 × 𝑛 be dimensions of real
46
matrix 𝑀 used as an input channel 𝐻 where 𝑚 and 𝑛 are rows and columns of input
channel/matrix. Here 𝑀 is expressed using Eq. (3.4).
𝑀 = 𝑈𝑆𝑉 𝑇 (3.4)
where
Here, 𝑈 and 𝑉 represent left and right singular vector matrices, and 𝑆 denotes diagonal
matrix of singular values (𝜎𝑘 ) in descending order respectively. Both 𝑈 and 𝑉 are
orthogonal matrices, 𝑢𝑖 are left, and 𝑣𝑖 are the right singular vectors as formulated in
Eq. (3.5) and Eq. (3.6) and 𝑟 is the rank of 𝑀. Eq. (3.4) can be rewritten as Eq. (3.8).
𝑇 𝜉 𝑟 𝜉 𝜉 𝜉𝑇
𝜉 = 𝑈 𝜉 𝑆 𝜉 𝑉 𝜉 = ∑𝑖=1 𝜎𝑖 𝑢𝑖 𝑣𝑖 (3.9)
where 𝜉 ∈ {𝑅, 𝐺, 𝐵, 𝐻, 𝑆, 𝑉} channels and 𝑟𝜉 are the ranks of each query channel. The
projection 𝑢𝑖 𝑣𝑖𝑇 bases and projection coefficients 𝜎𝑖 preserve the structural and energy
information of decomposed query channel. Lastly, color feature vector is acquired by
fusing these features. The mathematical representation of color features is given in Eqs.
(3.10)-(3.15).
Here, 𝐹𝑉𝜇 , 𝐹𝑉𝜎2 and 𝐹𝑉𝜎 are extracted feature vectors that represent the mean, variance,
and standard deviation of each channel respectively, where dimension 𝑑 = 6. We used
47
𝑓𝑚 , 𝑓𝑣 , and 𝑓𝑠 to denote mean, variance, and standard deviation feature of single channel
respectively. These features are computed from all channels 𝑅, 𝐺, 𝐵, 𝐻, 𝑆, and 𝑉
separately. Moreover, the mathematical representation of SVD feature vector is given
in Eq. (3.13).
Here, 𝐹𝑉𝑆𝑉𝐷 represents SVD feature vector of any single channel, where 𝑑 = 48 in
which 𝑓𝑠𝑣𝑑 is used to represent the value of SVD feature vector. Eq. (3.14) is developed
to compute SVD features from multiple channels.
where Ŝ is a function which computes SVD features and consequently feature vector of
each channel is obtained, where 𝑑 describes the dimension of one channel SVD
features-based on the size of input image. In addition, SVD features of six channels are
serially fused to generate combined SVD feature vector named as 𝐹𝑉𝑆𝑉𝐷1×𝑑 , here value
of 𝑑 = 288 as per calculation of six channels 6 × 48 = 288. Finally, all extracted
features (mean=6, variance=6, standard deviation=6 and SVD=288) are combined to
yield a compact color vector that preserves the color information for all channels of
input image. Mathematical representation of combined color feature vector is
expressed Eq. (3.15).
where 𝐹𝑉𝑐𝑜𝑙𝑜𝑟(𝐶𝐹𝑣) denotes serially fused color feature vector of size 1×306.
For texture analysis, a well-known grayscale invariant texture descriptor named LBP
[186] is utilized, which extracts texture details and its structural information from its
neighbors around the pixel value of center point. It performs well under a change of
illumination conditions. In LBP, image pixels corresponding information and its
domination orientation are used to extract structural details in the confined path,
however it ignores the correlation of neighboring points. Therefore, the advancement
of LBP in the form of local extrema patterns (LEP) [68, 187] is implemented in this
work for textural, orientation, and spatial structural information. The LEP descriptor
considers edge information in 0° , 45° , 90° and 135° directions. In a particular direction,
48
it assigns a value of 1 when the neighboring pixel value is more or less than the middle
pixel separately otherwise it assigns a value of 0. By considering the value of center
pixel 𝑃𝑐𝑣 and its corresponding values of neighbor pixels 𝑃𝑛𝑣 , LEP code is computed
using Eqs. (3.16)-(3.19) as follows.
′
𝑃𝑛𝑣 = 𝑃𝑛𝑣 − 𝑃𝑐𝑣 (3.16)
′
where 𝑛𝑣 ∈ 1, 2, … ,8 neighbors in 3×3 window around center pixel 𝑐𝑣 and 𝑃𝑛𝑣 denotes
corresponding differences from center pixel value.
′
𝑃𝑛𝑣 ′ ),
= 𝐼2 (𝑃𝑥′ , 𝑃𝑥+4 𝑥 = (1 + 𝜃⁄ ° ) , ∀𝜃 ∈ 0° , 45° , 90° , 135° (3.17)
45
where 𝑃𝑥′ (𝜃) denotes pixel values at a particular direction in a 3×3 window 𝐼2 .
′ )
′ ) 1, (𝑃𝑥′ , 𝑃𝑥+4 ≥0
𝐼2 (𝑃𝑥′ , 𝑃𝑥+4 ={ (3.18)
0, 𝑒𝑙𝑠𝑒
′
Here, 𝐼2 represents binary codes at positions 𝑃𝑥′ and 𝑃𝑥+4 .
𝜃⁄
𝐿𝐸𝑃(𝑃𝑐𝑣 ) = ∑𝜃 2 45 × 𝑃𝑥′ (𝜃), ∀𝜃 ∈ 0° , 45° , 90° , 135° (3.19)
The LEP descriptor is specifically designed to get information about spatial correlation
across the center and its neighboring points. The LBP and LEP code generation using
a 3×3 pattern window is shown in Figure 3.3. The LBP code generation process is
simple and straight forward, however in the case of LEP code generation, positive
values are pointed with inside arrow and negative values with outside arrow in 3×3
differences window.
49
Based on these arrows, binary codes are computed to assign binary code 1 when both
arrows are in the same direction (either inside or outside in a particular direction) and
0 otherwise. The binary codes are multiplied by weights using Eq. (3.19) and LEP
feature vector is obtained having dimension 1× 256 and denoted by 𝐹𝑉𝐿𝐸𝑃(𝑇𝐹𝑣) .
HOG is a commonly used feature descriptor for object detection [188]. Dalal and Triggs
[189] presented this descriptor which extracts HOG features by considering complete
dense grid locations and orientations in an image. HOG features are originally
computed for human detection [190] and later used in several domains [45, 191, 192].
To compute HOG features in this work, the algorithm begins with a localized area of
an image in which gradients are calculated horizontally and vertically. Afterwards, the
magnitude 𝑀𝑔 and orientation of gradients 𝜃 are calculated at each pixel using Eq.
(3.20) and Eq. (3.21).
𝑀𝑔 = √𝑔𝑥2 + 𝑔𝑦2 (3.20)
𝑔𝑦
𝜃 = 𝑎𝑟𝑐𝑡𝑎𝑛 𝑔 (3.21)
𝑥
By considering the image size as 128 × 48, pedestrian image is split into size 8 × 8
non-overlapping cells (patch) and resultantly collected 15 × 7 patches per image. Each
patch consists of 2 × 2 cells and histogram of gradients is calculated for each cell using
9 bins accumulation. Using this detail, 15× 5 × 36 = 2700 HOG features are
generated. The objective of patch by patch feature extraction is to provide a compact
representation. Meanwhile, a histogram of every patch makes representation more
robust to noise. Finally, HOG feature vector 𝐹𝑉𝐻𝑂𝐺(𝑆𝐹𝑣) with the dimension of 1× 2700
is comprised of these equalized histograms taken from patches.
50
decision level fusion, at classification/recognition phase [185]; 2) features selection,
that selects OFS by using the idea of maximum entropy.
Figure 3.4: Proposed feature extraction, fusion, and max-entropy based selection of
features
Let 𝐹𝑉𝑐𝑜𝑙𝑜𝑟(𝐶𝐹𝑣) , 𝐹𝑉𝐿𝐸𝑃(𝑇𝐹𝑣) , and 𝐹𝑉𝐻𝑂𝐺(𝑆𝐹𝑣) , representing three feature vectors such as
color, texture, and shape, respectively. If the first feature vector 𝐶𝐹𝑣 has dimension 𝑑𝑞
second feature vector 𝑇𝐹𝑣 has dimension 𝑑𝑟 , third feature vector 𝑆𝐹𝑣 has dimension 𝑑𝑠 ,
and 𝐺 is a total number of gallery images, then 𝐶𝐹𝑣 , 𝑇𝐹𝑣 and 𝑆𝐹𝑣 can be written by Eqs.
(3.22)-(3.24).
𝑐𝑓(1,1) ⋯ 𝑐𝑓(1,𝑑𝑞)
𝐶𝐹𝑣 = [ ⋮ ⋱ ⋮ ] (3.22)
𝑐𝑓(𝐺,1) ⋯ 𝑐𝑓(𝐺,𝑑𝑞 )
𝑡𝑓(1,1) ⋯ 𝑡𝑓(1,𝑑𝑟 )
𝑇𝐹𝑣 = [ ⋮ ⋱ ⋮ ] (3.23)
𝑡𝑓(𝐺,1) ⋯ 𝑡𝑓(𝐺,𝑑𝑟 )
𝑠𝑓(1,1) ⋯ 𝑠𝑓(1,𝑑𝑠 )
𝑆𝐹𝑣 = [ ⋮ ⋱ ⋮ ] (3.24)
𝑠𝑓(𝐺,1) ⋯ 𝑠𝑓(𝐺,𝑑𝑠 )
where 𝑓 , 𝑡𝑓 and 𝑠𝑓 denote features indices of each feature vector 𝐶𝐹𝑣 , 𝑇𝐹𝑣 and 𝑆𝐹𝑣
respectively. Then, these feature vectors are concatenated by using Eq. (3.25).
51
where 𝐹𝑉1×𝑑 denotes fused feature vector (FFV), 𝐹𝑉𝐺×𝑑 denotes FFV for gallery
images, 𝑑 = (𝐶𝐹𝑣 + 𝑇𝐹𝑣 + 𝑆𝐹𝑣 ), and 𝐺 is total number of samples/images belonging to
gallery. The size of computed FFV is 1×3262 which is larger in dimension.
To reduce the dimension, many feature reduction approaches are applied for numerous
pattern recognition tasks such as classification, detection, and recognition. Similarly,
features selection is also used for this purpose. In this concern, entropy controlled
features selection approach is applied in this work which is rarely used in existing
literature related to person ReID. The objective of features selection is to select distinct
features to build a discriminative descriptor or model [193] having spatial, structural,
and statistical information about that observation. Likewise, features selection is
utilized to find enough optimal features instead of all or several features that have the
potential to improve results. Hence for optimal features selection, experiments are
conducted by selecting best features from color, texture, and shape feature vectors.
Initially, most suitable feature combination is chosen to produce highest results using
any common similarity measure. For this purpose, Canberra distance [68] has opted.
For experimentation, three datasets VIPeR, CUHK01, and iLIDS-VID are used as
described in section 4.2.2.
52
as shown in Table 3.2. The mathematical description of entropy controlled features
selection is presented in Eq. (3.26) and Eq. (3.27).
where 𝑓𝑖 and 𝑓𝑗 are current and next features respectively, 𝑃𝑅 represents the probability
of computed features, 𝑑 is the dimension of feature vector, and ẟ denotes the entropy
controlled features.
Here, 𝑂𝐹𝑆𝑑 represents maximum score-based OFS having dimension 𝑑 and 𝑣 denotes
number of selected features from larger feature space 𝑓𝑑 . This process is repeated for
each feature space.
Table 3.2: Experimental results using handcrafted features on three selected datasets,
top recognition rates at each rank written in bold
Datasets
FS
VIPeR CUHK01 i-LIDS-VID
no.
1 10 20 50 1 10 20 50 1 10 20 50
1 18.3 39.7 47.0 56.8 21.7 38.1 44.6 50.1 16.3 35.4 41.2 46.7
2 20.1 41.3 50.2 58.1 24.5 41.0 48.3 53.2 17.0 37.8 42.4 49.3
3 24.6 46.7 52.6 59.2 26.3 44.0 52.9 56.6 20.7 41.0 45.1 51.8
4 25.9 47.2 54.4 62.4 27.7 46.6 53.1 57.5 21.9 43.1 52.0 55.1
5 27.0 51.4 56.3 62.8 28.8 49.0 55.1 60.3 23.4 45.8 51.8 56.8
6 26.1 50.1 56.8 61.0 26.2 48.2 53.9 58.0 22.8 45.1 50.6 56.4
Afterwards, all best selected features are fused into one matrix which returns a feature
vector of size 1×1252 known as OFS. Moreover, it is also observed that total number
of selected features provides a portion of sufficient and reliable information for image
representation. Besides, selected 𝑂𝐹𝑆𝑑 is supplied to K-means module [193] for
53
features-based clustering of each gallery image. Later on, OFS is integrated with deep
features of cluster sample, for a more reliable representation of gallery image.
For test data, searching operation responds quite faster if clustering is performed with
optimal results selected correctly. The objective of this module is to split the dataset
into clusters based on the selected features subsets. The feature clustering module
consists of two parts including features-based cluster formation and deep feature
extraction.
The existing work proves that CNN has a variety of pre-trained deep learning models
that have been implemented to solve the problems of classification and object detection
as well as for person ReID task [131, 199]. These models have the potential to extract
54
features from the given data efficiently and automatically. AlexNet is one of the simple
and foundational CNN pre-trained models introduced by Hinton et al. [128] that can be
easily trained and optimized as compared to complex CNN architectures such as
VGGNet [200], and GooLeNet [201].
55
and parameters setting of network is depicted in Figure 3.6. As the parameters of
AlexNet model are already well tested, therefore, standard parameters are utilized to
learn deep features without any optimization. In this work, an RGB image is resized
into 227 × 227 × 3 dimensions and then a bi-cubic interpolation algorithm is applied
for equalization of image details.
56
where 𝑍𝑥𝑙 represents output channel values up to point x at layer , 𝑧𝑦𝑙−1 shows input
𝑙
channel values up to point y at layer 𝑙 − 1, and 𝑓𝑥𝑦 represents convolutional filter
between 𝑥 𝑡ℎ and 𝑦 feature maps. The bias 𝑏𝑥𝑙 is added to move activation function
towards successful learning. For activation of neurons, the rectified linear unit (ReLU)
is applied through Eq. (3.29).
Max pooling is another important step used for downsampling in CNN architectures. It
is quite simple and does not have a learning process. Max pooling is applied after the
first, second, and fifth layers in the network. Figure 3.7 shows max pooling operation
on a given sample feature region. It simply applies 𝑘 × 𝑘 sized filter and selects a
maximum value through Eq. (3.30).
max
𝑍𝑝𝑞𝑐 = 𝑖,𝑗∈𝑀𝑝,𝑞 𝑈𝑖𝑗𝑐 (3.30)
where 𝑀𝑝,𝑞 shows pooling region having indices 𝑖, 𝑗, 𝑈𝑖𝑗𝑐 region of feature map, color
space channel 𝑐 and then a pixel value 𝑍𝑝𝑞𝑐 is obtained as output of max pooling
operation. After first two pooling layers, local contrast divisive normalization (LCDN)
is applied by considering interaction between channels 𝐶 (multi-channel images),
where the variance of local area 𝑀𝑎𝑏 is computed by applying Eq. (3.31).
2 1
𝜎𝑎𝑏 = 𝐶 ∑𝐶−1
𝑐=0 ∑(𝑖.𝑗)∈𝑀𝑎𝑏 𝑤𝑖𝑗𝑐 (𝑥𝑎+𝑖,𝑏+𝑗,𝑐 − 𝑥̅𝑎𝑏 )
2
(3.31)
𝑎𝑏𝑐𝑥 −𝑥̅
𝑎𝑏
𝑍𝑎𝑏𝑐 = max (3.32)
(ĉ,𝜎 ) 𝑎𝑏
The remaining part of architecture consists of three FC layers, where the first two layers
deal with extracted features of previous layers and decrease the dimensionality of such
features from 9216 to 4096. The response of FC-7 layer of trained network called deep
57
features is utilized. The architecture is deeply applied because in multi-camera network,
the appearance of given pedestrian changes due to various reasons such as camera
settings, background clutter, viewpoints, and variation in illumination and pose
changes.
To handle these issues of appearance and inherent ambiguities, a deep enough
architecture is desirable. Consequently, CNN model is trained using a stochastic
gradient learning algorithm commonly used in various CNN models. The formulation
of deep feature extraction considering all k consensus clusters and their samples images
belonging to gallery images is given in Eq. (3.34).
Finally, a deep CNN model is applied and feature extraction on each sample image of
consensus clusters ɠ𝑐𝑐 is carried out using Eq. (3.35).
where 𝐷𝐹1×𝑑 denotes deep features having dimension 𝑑 and 𝑠𝑖 depicts 𝑖 𝑡ℎ a sample
image of cluster 𝑘, extracted by applying ɠ𝑐𝑐 operation. Similarly, this procedure is
applied across all the consensus clusters to extract deep features of each sample image.
PCA is then used to decrease the dimension of deep features. The reduced deep features
filter out the noise by discarding unnecessary information and preserving discriminative
information. The empirically 1000 deep features are selected and fused with the
selected OFS to get a feature vector which is then used for all the experiments. The
deep features selection is discussed in section 4.2.3.1, and results are shown in Tables
(4.2)-(4.4). Thus, each gallery image specifically belongs to a particular consensus
cluster and it consists of optimal features subsets 𝑂𝐹𝑆𝑠𝑒𝑙𝑒𝑐𝑡 , reduced deep
features DFi×d and cluster index 𝑘. Hence, all gallery images are represented with the
concatenation of these features, and the size of FFV becomes 1×2252 with the clustered
index. Mathematically, it is formulated through Eq. (3.36) and complete steps of
proposed FCDF framework are presented in Algorithm 3.2.
58
Figure 3.7: Max pooling operation
where 𝐹𝐹𝑉𝑣 (𝑓𝑢𝑠𝑒𝑑) represents final feature vector for ReID which contains of OFS,
DF feature vectors and 𝑘 cluster index.
In feature matching module, first step is to find a consensus cluster. It follows the
extraction of deep features of test image. Later on, OFS and deep features of probe
image are concatenated. Finally, a similarity measure is deployed to get target match
59
from selected cluster or neighbor clusters. The details about all steps are given in the
following subsections.
In the cluster selection, initially, probe OFS is provided to the learned model for
selecting target cluster. For this purpose, multi-class SVM is applied to assign an
optimal score for all k consensus clusters using score-to-posterior-probability
transformation function. Based on the highest score, any consensus cluster may qualify
for a target cluster against a given probe image. During this process, the target image
may belong to another cluster. In this situation, the nearest neighbor consensus cluster
is taken instead of a cluster with a high score. However, nearest neighbor clusters (with
a minimum distance) not more than half of the total clusters are considered. Thus, the
probability of finding the target image increases when neighbor consensus clusters are
taken. The deep features of probe image are also selected and integrated with probe
OFS. Then, searching is applied against the given probe image across n consensus
clusters instead of whole gallery to effectively optimize the gallery search, where 𝑛 ∈
[1,2,3].
The last phase of presented framework is to compare probe image within the consensus
cluster to explore the accurate match. QC cross bin histogram distance measure [202]
is used for this purpose where two histogram distance properties are utilized across the
bins; 1) similarity matrix quantization invariance property that includes its cross bin
relationships of features, and 2) sparseness invariance property which obtains distances
between full histograms (e.g., probe and cluster feature vectors). Let, probe feature
vector be Ip = {X1 , X2 , … , Xn } and cluster feature vector is Ic = {Y1 , Y2 , … , Yn }, hence,
distance D between a probe feature vector and cluster(s) feature vector is calculated by
Eq. (3.37).
where d denotes feature vector dimension. Once, all the distances between probe
images and samples of target cluster(s) are computed, ReID ranks are to be obtained
after applying descending order sort on the computed distances. Consequently, probe
image with minimum distance D in cluster(s) is declared as an accurate match.
60
This method utilized handcrafted features (color, HOG, and LEP) for cluster formation.
The extracted deep features of each cluster sample are fused for more distinct feature
representation. The proposed FCDF method optimized the gallery search by applying
cluster-based probe matching instead of gallery-based matching. To further analyze
pedestrians, gender classification is considered as a next step of this dissertation. The
subsequent methodology based on J-LDFR is discussed in section 3.3.
61
3.3.1 Data Preprocessing
Non-
Original Interpo.
Feature Resized linear Input
FR image (Bi-
type RGB image filtering images
size cubic)
(Median)
HSV and
Low- LOMO
128 × 64×3 grayscale
level
HOG grayscale
Vary
High- VGG19 RGB
level 224 × 224 ×3
ResNet101 RGB
(deep)
The visual description of input image is fundamental step for automated image
recognition systems. These systems use efficient and effective visual clues in numerous
tasks of computer vision. In this context, known visual feature representations such as
shape, texture, color, deep and structural information are often used to compute distinct
image features. In pedestrian gender analysis, feature representations are investigated
by extracting either low-level features or deep CNN features. Both feature extraction
62
schemes are introduced in the proposed framework for robust feature representation of
gender image, described in the subsequent sections.
Various handcrafted features are commonly studied in the area of pattern recognition.
Specifically, these features are used to represent low-level information (e.g. shape,
color, and texture) of an input image. Moreover, these features are used as a
complementary part of feature fusion and for reliable representation of an image.
Considering these facts, two well know descriptors such as HOG descriptor for shape
information and LOMO descriptor for color and texture information are selected for
low-level feature representation of input image. Besides, HOG and LOMO features
jointly handle the issues of rotation, viewpoint, and illumination variances in images.
HOG feature representation is initially used for pedestrian detection by Dalal et al.
[203], and currently it is used in many areas such as person ReID [68], railway track
detection [204], and other applications [205-207].
In this manuscript, illumination and rotation invariant local features of gender images
are computed using HOG feature descriptor. Initially, the orientation of input image by
splitting the whole image into different blocks is calculated; afterwards, gradients of
image are computed block by block, where each block consists of 2 × 2 cells.
Considering gradient information in a single block, local histogram of orientations of
each cell is calculated with 9 bins accumulation. Meanwhile, local 1-D histogram over
pixels of cell is normalized individually using 𝐿1 − 𝑛𝑜𝑟𝑚 along with interpolation
operation.
The final histogram is then computed by concatenating local histograms of each cell.
In final histogram, gradient's strengths vary due to local illumination changes.
Therefore, overall HOG features are normalized using L2- norm. Finally, feature vector
with 3780 HOG features is obtained. HOG feature representation is shown in Figure
3.9. From the larger size of HOG feature vector, subset of 1000 features is empirically
selected. The selection of subset of features is because of two reasons; 1) reduced
dimension of HOG feature vector discards irrelevant features, and 2) only sufficiently
contributing HOG features are selected for the succeeding fusion stage. Next, LOMO
descriptor is also utilized for low-level feature representation, initially introduced by
[77] and designed for pedestrian ReID.
63
Figure 3.9: Process of formation of low-level (HOG) features
LOMO descriptor effectively examines horizontal pixel values of local features. It also
maximizes pixel values that efficiently handle viewpoint changes and applies Retinex
transform to effectively tackle illumination changes of input image. Recently, many
approaches integrate LOMO features and achieve outstanding results [54, 67, 208-210].
Thus, this work also focuses on LOMO feature representation as low-level information
with HOG features for gender prediction. The formation of LOMO features is shown
in Figure 3.10 (a) and Figure 3.10 (b).
LOMO feature representation utilizes a sliding window having a size of 10×10 with a
stride of 5 pixels and patches of dimensions 128×64 to depict local information. The
purpose of sliding window is to handle viewpoint changes in input image. Each sub-
window is comprised of two types of information; 1) scale-invariant local ternary
patterns (SILTP) [211] histograms, and 2) HSV based 8×8×8 color bin histograms.
Both types of information describe texture and color details of input image. A three-
level pyramid is utilized for multi scale information and obtains the features at a
different level. Both information (from each window and scale) are then combined to
produce LOMO feature representations.
64
Figure 3.10: Process of formation of low-level (LOMO) features, a) overview of feature
extraction procedure using two different LOMO representation schemes
such as HSV and SILTP based representations including feature fusion
step to compute LOMO feature vector, and b) basic internal representation
to calculate combined histogram from different patches of input image
[77]
Besides, these procedures retain local features of a person from selected regions, as in
[77]. The calculated LOMO feature representations achieve a large invariance to
viewpoint and illumination changes. In this work, LOMO features are obtained from
each gender image. As feature dimension is high (26960 features in total), a
dimensionality reduction is carried out on computed LOMO features, and empirically
600 best features are selected from a larger dimension of LOMO features to facilitate
fusion with sufficient and optimal LOMO features.
Already-trained deep CNN models have shown promising results in different areas of
computer vision with small and large datasets [212-214]. In related literature, two well-
known procedures including transfer learning and feature mining are applied using
already-trained CNN models. Transfer learning is a much faster and suitable procedure
65
as compared to training from the start, whereas feature mining is another useful and
fastest procedure when high-level representations of an input image are desirable and
supplied to train an image classifier. This procedure extracts deeply learned image
features using deep CNN models. Therefore, aiming to acquire high-level feature
depiction of gender image, deep CNN feature extraction procedure is utilized in this
work. Generally, deep CNN models consist of diverse layers (blocks) such as
convolutional, pooling, normalization, ReLU, and FC with a single softmax function.
The proposed framework considers two already-trained CNN models named VGG19
and ReNet101 to compute deep feature representations. Both models apply a stack of
3 × 3 convolutional filters with stride 1 during convolution operation. The depth of
CNN model is used for learning more complex information from input image.
Moreover, each model comprises of different characteristics to set their significance. In
this work, the objective is to examine deep information of two different depths of CNN
models by considering their feature representations at FC layers. So, feature
representation of FC7 and FC1000 using VGG19 and ResNet101 models are utilized
respectively. Often, aforementioned different types of layers in the design of deep CNN
model are used. Initially, CL obtains local features of the input image. This local feature
extraction process is formulated in Eq. (3.38).
𝑥 𝐿−1
𝑧𝑖𝐿 = 𝑏𝑖𝐿 + ∑𝑗=1
𝑖 𝐿
𝐹𝑖,𝑗 × 𝑦𝑗𝐿−1 (3.38)
66
highest value from pooling region. The max-pooling between CLs is applied to reduce
the number of irrelevant features, computational load, and overfitting issue.
Similarly, average pooling is another type of pooling that computes the average of filter
values rather than max value. Both max-pooling and average-pooling layers are
included in ResNet101 model, however, VGG19 only comprises of max-pooling. For
instance, a single feature map of a PL is denoted with 𝑅. Before applying max-pooling,
pooling operations 𝑅 = 𝑅0 , … , 𝑅𝑖 can be considered as a collection of all small local
regions where i is controlled by both the size of pooling regions and dimension of input
feature maps. A native region 𝑅𝑖 is randomly selected, where 𝑖 is an index between 0
and 𝑚. Mathematically local regions are represented through Eq. (3.39).
𝑅𝑖 = 𝑥1 , 𝑥2 , … , 𝑥𝐾×𝐾 (3.39)
where 𝐾 denotes the size of a pooling region (filter size) and 𝑥 represents a component
of pooling region. At each pooling layer, aforementioned both max and average pooling
operations use different calculation procedures dealing with the component in each
pooling region. Eq. (3.40) and Eq. (3.41) are for max pooling and average pooling
operations respectively.
max
𝑀𝑃𝑖𝐿 = 1 ≤ 𝑗 ≤ 𝐾 × 𝐾 (𝑥𝑖 ) (3.40)
1
𝐴𝑃𝑖𝐿 = 𝐾×𝐾 ∑𝐾×𝐾
𝑗=1 𝑥𝑗 (3.41)
where 𝑀𝑃𝑖𝐿 and 𝐴𝑃𝑖𝐿 denote output of max pooling and average pooling operations
respectively using 𝐾 × 𝐾 pooling region at layer 𝐿. Max pooling operation only chooses
maximum value from pooling region, whereas average pooling operation computes the
(𝑙)
average of all pooling region values. Also, other layers including ReLU, FC layer 𝐹𝐶𝑖 ,
(𝑙)
and FC layer 𝐹𝐶𝑗 are represented in Eqs. (3.42)-(3.44).
(𝑙)
𝑅𝑒𝐿𝑈𝑖 = max(0, 𝑝𝑖𝑙−1 ) (3.42)
(𝑙−1)
(𝑙) (𝑙) (𝑙) 𝑀𝑃 (𝑙) (𝑙−1)
𝐹𝐶𝑖 = f(𝐷1𝑖 ) 𝑤𝑖𝑡ℎ 𝐷1𝑖 = ∑𝑟=1𝑖 𝑤𝑖,𝑟 ( 𝐹𝐶𝑖 )𝑟 (3.43)
(𝑙−1)
(𝑙) (𝑙) (𝑙) 𝐴𝑃 (𝑙) (𝑙−1)
𝐹𝐶𝑗 = f(𝐷2𝑗 ) 𝑤𝑖𝑡ℎ 𝐷2𝑗 = ∑𝑟=1𝑗 𝑤𝑗,𝑟 ( 𝐹𝐶𝑗 )𝑟 (3.44)
67
(𝑙) (𝑙) (𝑙)
where 𝑅𝑒𝐿𝑈𝑖 represents the output of ReLU layer 𝐹𝐶𝑖 and 𝐹𝐶𝑗 denote the response
of FC7 and FC1000 layers respectively using VGG19 and ResNet101 models. The FC
layer efficiently depicts the higher-level information and most of the researchers
consider the output of FC layers as deep features to apply for pattern recognition, person
ReID, and image classification tasks [172-174, 215].
The selected VGG19 and ResNet101 deep CNN pre-trained models are used to acquire
deep features of gender images. Firstly, VGG19 model is proposed by Hu et al. [216]
which comprises of different blocks/layers such as 16 CLs, 18 ReLU layers, 5 PL’s, 3
FC layers, 2 dropout layers, and a softmax classifier for input image prediction. This
model is learned on ImageNet dataset, where the size of input image at the input layer
is 224 × 224 × 3. ReLU is utilized after each CL in VGG19 model. Both ReLU and
dropout operations are performed before FC7 layer, where the value of dropout is 0.5.
In max-pooling operation, pool size and stride are selected as (2, 2) with zero padding.
Secondly, ResNet101 is a pre-trained model used by He et al. [217] for image
recognition. This model consists of input layer, 104 CL’s, 100 ReLU layers, 104 batch
normalization layers, 33 additional layers, MP, AP, FC, softmax, and output. The size
of input images used by this model is 224 × 224 × 3. ResNet101 model utilizes pool
and stride size (7, 7) with zero padding in average pooling operation and pool size (3,
3), stride (2, 2), and padding (0, 1, 0, and 1) in max-pooling operation.
68
These responses are used as deep feature representations that have deeper information
on gender images to handle misclassification of gender due to pose variations. The
deeply extracted feature vectors of size 𝑁 × 4096 and 𝑁 × 1000 are obtained from
VGG19 and ResNet101, respectively. This computed deeper information helps to
reduce false gender predictions due to variations in pose and illumination. In this
process, four kinds of different features from a gender image are obtained as shown in
Figure 3.11. Let, 𝐹𝑉𝐻𝑂𝐺(𝐻𝑣 ) , 𝐹𝑉𝐿𝑂𝑀𝑂(𝐿𝑣) , 𝐹𝑉𝐷𝑒𝑒𝑝1(𝐷1𝑣) , and 𝐹𝑉𝐷𝑒𝑒𝑝2(𝐷2𝑣) represent four
different feature vectors such as HOG, LOMO, deepV (VGG19), and deepR
(ResNet101) respectively. If HOG feature vector 𝐻𝑣 has dimension 𝑑𝑤 , LOMO feature
vector 𝐿𝑣 has dimension 𝑑𝑥 , deepV feature vector 𝐷𝑉𝑣 has dimension 𝑑𝑦 and deepR
feature vector 𝐷𝑅𝑣 has dimension 𝑑𝑧 , and 𝑁 represents total images in the selected
dataset then 𝐻𝑣 , 𝐿𝑣 , 𝐷1𝑣 , and 𝐷2𝑣 can be calculated through Eqs. (3.45)-(3.48).
ℎ(1,1) … ℎ(1,dw )
𝐻v = ( ⋮ ⋱ ⋮ ) (3.45)
ℎ(N,1) ⋯ ℎ(N,d𝑤 )
𝑙(1,1) … 𝑙(1,dx )
𝐿v = ( ⋮ ⋱ ⋮ ) (3.46)
𝑙(N,1) ⋯ 𝑙(N,d𝑥 )
𝑑𝑣(1,1) … 𝑑𝑣(1,dy )
𝐷𝑉v = ( ⋮ ⋱ ⋮ ) (3.47)
dv(N,1) ⋯ dv(N,d𝑦 )
𝑑𝑟(1,1) … 𝑑𝑟(1,d𝑧 )
𝐷𝑅v = ( ⋮ ⋱ ⋮ ) (3.48)
𝑑𝑟(N,1) ⋯ 𝑑𝑟(N,d𝑧 )
where ℎ, 𝑙, 𝑑𝑣 and, 𝑑𝑟 present the features indices of extracted 𝐻v , 𝐿v , 𝐷𝑉v and 𝐷𝑅v
feature vectors respectively. However, merging these feature vectors may increase
feature dimensions in size which may require more execution time, and influence
classification accuracy due to irrelevant information. To address these issues, feature
reduction is performed which computes optimal features from given feature vectors.
These optimal features comprise of discriminant features which are desirable to
correctly classify gender images. In this manuscript, an entropy controlled method is
applied to select best features subset from extracted feature vectors. To the best of our
insight, the entropy based features selection is never being used in existing literature
for pedestrian gender classification task.
69
Figure 3.11: Complete design of proposed low-level and deep feature extraction from
gender images for joint feature representation. The proposed framework
J-LDFR selects maximum score-based features and then fusion is applied
to generate a robust feature vector that has both low-level and deep feature
representations. Selected classifiers are applied to evaluate these feature
representations for gender prediction
70
Maximum entropy is described as follows. Let ℎ1 , ℎ2 , … , ℎ𝑁 represent features from
feature space H, 𝑙1 , 𝑙2 , … , 𝑙𝑁 represent features from feature space L,
𝑑𝑣1 , 𝑑𝑣2 , … , 𝑑𝑣𝑁 are features from the feature space DV and 𝑑𝑟1 , 𝑑𝑟2 , … , 𝑑𝑟𝑁 denote
features from feature space DR, where H∈HOG, L∈LOMO, DV∈VGG19, and
DR∈ResNet101 feature vectors. The dimension of each feature vector is 1 × 3780,
1 × 26960, 1 × 4096, and 1 × 1000 for, HOG, LOMO, VGG19, and ResNet101
respectively. The optimal features selection using maximum entropy method is given
in Eqs. (3.49)-(3.58).
𝑃 = 𝐻𝑖𝑠𝑡(𝑓) (3.49)
where 𝑃 denotes histogram counts of feature vector 𝑓 such that 𝑓 ∈ 𝐼 and 𝐼 depict total
feature vectors in a given feature space. Meanwhile, zero entries are removed in 𝑃 and
matrix is returned in the form of bin values. Then, entropy is compute using Eq. (3.50).
where 𝐺 ∈ (ℎ𝑁 , 𝑙𝑁 , 𝑑𝑣𝑁 , 𝑑𝑟𝑁 ), and maximum entropy controlled method obtains a
feature vector with 𝑁 × 𝑀 dimensions. It controls the randomness of each feature
space. Finally, the scores of each feature vector are arranged in descending order. The
entropy information of each feature vector is calculated through Eqs. (3.51)-(3.54).
71
𝐻𝑣 = 𝑚𝑎𝑥(𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝐻), ẟ) (3.55)
𝐿𝑣 = 𝑚𝑎𝑥(𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝐿), ẟ) (3.56)
where 𝐻𝑣 , 𝐿𝑣 , 𝐷𝑉𝑣 , and 𝐷𝑅𝑣 indicate the selected subsets of features and ẟ represents
top features from computed entropy information. The joint feature representation is
computed using Eq. (3.59), where d shows FFV dimension. The dimension of FFV is
1 × 3600, (see section 4.3.3).
Later, this FFV is supplied to different classifiers for classification. In this work,
supervised learning methods are used to train the data for gender prediction.
72
Table 3.4: Proposed J-LDFR framework selected features subset dimensions, classifiers
and their parameter settings
Proposed Features
Classifiers Parameters Values
Framework Subset
Linear Covariance structure: Full
discriminant Full
Ensemble method Subspace
Ensemble
Learner type Discriminant
subspace
Number of learners 30
discriminant
Subspace dimension 1800
Distance metric Euclidean
Distance weight Equal
Fine-KNN
Standardize data True
Number of neighbors 1
Distance metric Euclidean
Distance weight Equal
Low-level Median-KNN
Standardize data True
feature Number of neighbors 10
representations Distance metric Euclidean
(HOG 1000 Distance weight Equal
features and Cosine-KNN
Standardize data True
LOMO 600 Number of neighbors 10
J-LDFR features)
Kernel function Linear
Kernel scale Automatic
framework +
Linear-SVM Box constraint level 1
Deep feature Multiclass method one-vs-one
representations Standardize data True
(VGG19 1000 Kernel function Gaussian
features and Medium Kernel scale 60
ResNet101 1000 Gaussian- Box constraint level 1
features) SVM Multiclass method one-vs-one
Standardize data True
Kernel function Quadratic
Kernel scale Automatic
Quadratic-
Box constraint level 1
SVM
Multiclass method one-vs-one
Standardize data True
Kernel function Cubic
Kernel scale Automatic
Cubic-SVM Box constraint level 1
Multiclass method one-vs-one
Standardize data True
Therefore, all the experiments in this work are executed using cubic, medium, and
quadratic SVM methods. This method is focused on robust joint feature representation
for pedestrian gender classification using large-scale and small-scale datasets. The
pedestrian gender classification is further investigated in the next methodology as an
imbalanced binary classification problem with SSS datasets.
73
3.4 Proposed Method for Pedestrian Gender Classification on
Imbalanced and Small Sample Datasets using Parallel and Serial
Fusion of Selected Deep and Traditional Features (PGC-FSDTF)
An approach named pedestrian gender classification using the fusion of selected deep
and traditional features (PGC-FSDTF) is proposed for gender prediction. It consists of
five main steps, 1) data preparation, 2) traditional feature extraction 3) deep CNN
feature extraction and parallel fusion, 4) features selection and fusion, and 5)
application of classification methods. The overview of proposed approach is shown in
Figure 3.12. In the subsequent sections, these steps are discussed in detail.
Data preparation is an essential step in pattern recognition tasks and is widely used to
support later processing stages for accurate predictions. The purpose of data preparation
is to refine the information for accurate match. Usually, these refinements are
implemented as a component of data preparation that usually includes data
augmentation, data profiling, data preprocessing, and data cleansing. Therefore, two
components of data preparation are opted here: 1) data augmentation, and 2) data
preprocessing. Data augmentation is being utilized for equal distribution of data in
gender classes and data preprocessing is used to enhance the visual quality, foreground
information, and resizing of gender image. The description of these steps is given in
subsections.
Pedestrian analysis datasets such as MIT and PKU-Reid contain SSS in which each
pedestrian appearance is captured from multiple non-overlapping cameras. As a result,
the number of captured views against each pedestrian (male/female) is increased in
terms of total number of images in these datasets. These datasets are used for different
pedestrian analysis tasks such as gender prediction, person ReID, and attribute
recognition. MIT is a sub-dataset of PETA that comprises of 888 gender images in
which 600 pedestrian images belong to male class and 288 pedestrian images belong to
female class. This dataset is widely used for pedestrian attribute recognition such as
clothing style, and gender. In MIT dataset, it can be observed that class-wise
distribution of data is imbalanced. This inequality identifies two research issues: a)
class-imbalanced and b) SSS dataset.
74
Figure 3.12: An overview of the proposed PGC-FSDTF framework for pedestrian
gender classification
Using SSS datasets for pedestrian gender classification, another such dataset named as
PKU-Reid is chosen. The PKU-Reid dataset is preferably used for pedestrian re-
identification [225-227] consisting of 114 individuals (70 males and 44 females), where
the appearances of each individual are captured in eight directions and resultantly 1824
images are collected from two non-overlapping cameras. In this research study, all
images of PKU-Reid dataset are labeled for classes of male and female. Consequently,
1120 male and 704 female images are obtained from a total of 1824 images and then
this newly prepared dataset is named as PKU-Reid-IB such that class-wise data in this
dataset is also imbalanced. Hence, both MIT and PKU-Reid-IB are IB-SSS datasets
75
which are used in this work for pedestrian gender classification. As discussed earlier,
class-wise variation in samples becomes imbalanced classification problem that caused
poor predictive performance, specifically for the class with less number of samples. If
the imbalanced classification problem is severe then it is more challenging to suggest a
robust approach. Besides, datasets with a small number of sample spaces is another
problem for researchers while training of a model is carried out. To handle these
problems, data augmentation process is selected to enhance class-wise data by
considering existing data of single/both gender classes. Hence, this process is applied
to both imbalanced MIT and PKU-Reid datasets and controlled with a random
oversampling (ROS) technique to generate synthetic data on existing gender images.
76
as [1 0 0; .3 1 0; 0 0 1]. Resultantly, this operation yielded the transformed image
according to applied geometric transformation object.
77
chosen from the given set of male/ female instead of a single operation. It means that
four images will be generated from a single image as shown in Figure 3.13.
Figure 3.13: Proposed 1vs1 and 1vs4 strategies for data augmentation
As 1vs1 and 1vs4 strategies are applied to generate augmented data from both the
gender classes of MIT and PKU-Reid datasets, furthermore a mixed (1vs + 1vs4)
strategy is also adopted for data augmentation. To apply this strategy, first a class is to
be found out with less number of samples.
In this work, female class of both datasets MIT and PKU-Reid has less number of
samples as compared to the male class. Therefore, in the second step, images from only
the female class are randomly selected to create different sets, as described in Table
3.5. Then, all selected operations are executed. For example, using the female class of
PKU-Reid dataset, 1vs1 strategy on 64 images and 1vs4 strategy on 40 images are
applied. Similarly, using MIT dataset, 50 and 28 images are randomly chosen to apply
1vs1 and 1vs4 strategies respectively. Based on these strategies, augmentation is
performed for the balanced distribution of data in both classes to handle the imbalanced
binary classification problem for gender prediction. As this research work used the
existing IB-SSS datasets and newly prepared imbalanced, and augmented balanced SSS
datasets for gender classification, thus this may be considered as a novel contribution
78
to the body of knowledge. Furthermore, impact of data augmentation is analyzed for
pedestrian gender classification.
Table 3.5: Augmentation statistics for imbalanced and small sample MIT, and PKU-
Reid datasets, class-wise selected number of samples in a single set for data
augmentation, and resultantly, total augmented images and total images
Data preprocessing is an important step for object classification and recognition. In this
work, data preprocessing is used to enhance visual quality and foreground information
of input images. It also optimizes environmental effects such as illumination variations,
light effects, and poor contrast related issues. For each feature extraction, original
gender image is resized as described in Table 3.6. Traditional feature extraction
schemes of PHOG and HSV Histogram used grayscale and HSV color spaces
respectively. For deep feature extraction, both pre-trained CNN models exploit RGB
images as input during the implementation. PETA, VIPeR, and cross-datasets are used
with a small number of samples. These datasets are more challenging due to complexity
and imagery variations. Therefore, a step of contrast adjustment is implemented to
balance high and low contrast in gender images. Moreover, median filter is applied to
remove noisy information from gender images. This step is only applied to the images
of PETA, VIPeR, and cross-datasets before execution of all feature extraction schemes.
79
Table 3.6: Description of preprocessing for each feature representation scheme
The traditional feature extraction step is usually used to extract basic level information
such as texture, color, and shape of an input image for detection and classification [68,
229]. In this study, two traditional methods namely pyramid HOG [230] and HSV
histogram are utilized to compute shape and color features. The output of both
traditional feature extraction schemes corresponds to an individual feature vector. The
description of both feature extraction schemes is given in subsequent sections.
Presently, PHOG based features have been effectively applied for different pattern
recognition tasks such as detection [231, 232], recognition [21, 233], and classification
[234, 235]. PHOG features mainly comprise of edge based features at multiple pyramid
levels. These features play a vital role because they are insensitive to local geometric,
illumination, and pose variations. Therefore, the proposed approach utilizes pHOG
based features to detect distinct characteristics such as clothing style and carrying items
from gender images and classify an image as male or female. PHOG feature extraction
process exploits multiple representations of an input image in a pyramid fashion to
preserve spatial layout information and well-known HOG descriptor to construct the
local shape of this image. PHOG is an advanced version of HOG descriptor. To extract
HOG features, the image is divided into several blocks, and gradients of the image are
computed for each block consisting of 2 × 2 cells. Considering gradient information in
a single block, local histogram of orientations of each cell is calculated. The final HOG
features are then obtained by concatenating cell-wise histogram based extracted
features. The computed HOG features are capable to characterize local shape based
information. This information is then merged with pyramid based spatial layout
information of the gender image for accurate gender prediction. Canny edge detection
approach is opted to realize edge based information present in gender image.
80
Figure 3.14: PHOG feature extraction scheme
Then, level by level full gender image is split into 4𝐿 sub-blocks, where 𝐿 = {0,1,2,3},
4𝐿+1 −1
acquiring ∑3𝐿=0 4𝐿 pHOG feature vector from total cells. For example, at level 0,
3
whole image is considered as one block because 40 is equal to one. The histogram will
be computed by considering the complete image as a single block. Similarly, at level 1,
the gender image is split using 41 and resultantly image will be divided into four sub-
blocks. Subsequently, the histogram of these sub-blocks is calculated for a total of five
sub-blocks that comprise of previous one histogram of level 0 and four histograms of
level 1. To execute next levels in the same way, 21 histograms (5 previous and 16 of
level 2) for level 2 and 85 histograms (25 previous and 64 of level 3) for level 3 are
calculated and normalized using L2- norm. At different pyramid levels 𝐿, the computed
histograms are shown in Figure 3.14. All the steps followed in pHOG feature extraction
scheme are given in Algorithm 3.3.
In recent research studies, the perceptual color space HSV is opted instead of RGB,
Luv, Lab, etc. and then mostly exploited to compute HSV histogram based color
information for image retrieval [236, 237]. This color information can resist multiple
types of changes in an image such as size, direction, rotation, distortion, and noise
[238]. Thus, a HSV histogram based color feature extraction scheme is used for the
gender classification task.
81
Algorithm 3.3: PHOG feature extraction
Input: Training and testing images
Output: Normalized pyramid HOG feature vector (PHOG_FV)
Begin
Step# 1: Initially, set the values of required parameters where levels 𝐿 = 3, bin
size on the histogram 𝑏𝑠 = 8, 𝑎𝑛𝑔𝑙𝑒 = 360, 𝑎𝑛𝑑
𝑟𝑜𝑖 = [1; 128; 1; 64; ].
Step# 2: Convert resized input RGB image into a grayscale image.
Step# 3: Compute edge features from grayscale image using a canny edge
detector.
Step# 4: Obtain a matrix with histogram values (𝑚ℎ𝑣 ) and a matrix with
gradient values (𝑚𝑔𝑣 ).
Step# 5: Using 𝑚ℎ𝑣 , 𝑚𝑔𝑣 , 𝐿, 𝑎𝑛𝑔𝑙𝑒 and 𝑏𝑠 , compute PHOG features over
given 𝑟𝑜𝑖.
Step# 6: Normalize computed PHOG features.
Step# 7: Finally, obtained normalized pHOG feature vector (PHOG_FV) with
the dimension of 1 × 680.
End
HSV histogram based color features are acquired by following three steps: 1) convert
the gender RGB image into HSV color space using Eqs. (3.60)-(3.62), where 𝐻
represents hue, 𝑆 represents saturation, and v represents brightness 2) color quantization
is performed to decrease the feature vector dimension and reduce computational
complexity, whereas quantization process minimizes the number of colors and its level
used in an image, and 3) compute the histogram of each quantized image according to
applied intervals for H, S, and V channels with 8, 2, and 2 bins respectively. The
computed histogram shows the frequency distribution of quantized HSV values of each
pixel in a given image as shown in Figure 3.15.
1⁄ [(𝑅−𝐺)+(𝑅−𝐵)]
2
𝐻 = 𝑐𝑜𝑠 −1 (3.60)
√(𝑅−𝐺)2 +(𝑅−𝐵)(𝐺−𝐵)
3[min(𝑅,𝐺,𝐵)]
𝑆 =1− (3.61)
𝑅+𝐺+𝐵
𝑅+𝐺+𝐵
𝑉= (3.62)
3
82
Figure 3.15: HSV histogram based color features extraction
CNN architectures can automatically learn different types of distinct properties by using
the backpropagation method. Typically, these architectures are assembled with
different building blocks such as CLs, PLs, transition layers (TL), an activation unit
(AU), and FC layers. The convolutional block is formed of a CL, an AU, and a PL in
which the assignment of CL depends on the practice of operating the input image with
the nominated kernel. The kernel size may be set as 3 × 3, 5 × 5, and 7 × 7 pixels.
Thus, the input of next layer (𝑛2 × 𝑛3 ) along with the filter is applied to the image.
Activation maps with distinctive features are produced as a result of convolution
process. Each CL enables the use of a filter. The output 𝑍𝑥𝑙 of layer (𝑙) comprises of 𝑛1𝑙
feature-map of size 𝑛2𝑙 × 𝑛3𝑙 . The 𝑥 𝑡ℎ feature-map displayed 𝑍𝑥𝑙 , is computed using Eq.
(3.63), where 𝑏𝑥𝑙 and 𝑓𝑥,𝑦
𝑙
denote the bias matrix and filter size, respectively.
𝑓 𝑙−1
𝑍𝑥𝑙 = 𝑓𝑚 (𝑏𝑥𝑙 + ∑𝑦=1
𝑥 𝑙
𝑓𝑥,𝑦 × 𝑍𝑥𝑙−1 ) (3.63)
The PL preserves image information, diminishes image size, and also retains image
information intact. PL also limits the number of parameters. A PL has two settings: (1)
particular size of 𝑓 𝑙 filter and (2) overlapping/non-overlapping sliding window 𝑠𝑤 𝑙 .
The PL gets input data with the size of 𝑛1𝑙−1 × 𝑛2𝑙−1 × 𝑛3𝑙−1 and yields an outcome of
size 𝑛1𝑙 × 𝑛2𝑙 × 𝑛3𝑙 . Shortly, operation of the PL is shown in Eqs. (3.64)-(3.66).
83
𝑛1𝑙 = 𝑛1𝑙−1 (3.64)
Moreover, a nonlinear node named ReLU allows complex relationships in the data to
be learned. The FC layers are individually allocated to a single vector by flattening
deeply learned information from the earlier layers. Also, they implement the update of
weights and provide a value against each label. The fully interrelated layers in CNN are
mostly a multilayer sensor for mapping 𝑛1𝑙−1 × 𝑛2𝑙−1 × 𝑛3𝑙−1 . The operational procedure
of FC layers is shown in Eq. 3.67.
𝑓 𝑙−1
𝑦𝑥𝑙 = 𝑓(𝑍𝑥𝑙 ) 𝑤𝑖𝑡ℎ 𝑍𝑥𝑙 = ∑𝑦=1
1 𝑙
𝑤𝑥,𝑦 𝑦𝑥𝑙−1 (3.67)
Transfer learning with fine-tuning is often used to solve SSS dataset problem where a
model is already-trained on a very large-scale dataset, for instance, ImageNet [128,
243]. It is a much faster and suitable procedure to train a model on new data as
compared to train a model from scratch. On the other hand, the already-trained CNN
models are utilized as deep feature extractors and their applications are also reported in
present research for different pattern recognition tasks [244, 245]. In this context,
existing studies utilized independently or combined early, middle, and last FC layer(s)
of already-trained CNN model to extract deeply learned information of input image and
84
then used this information or subset of information to train a new classifier such as
SVM and KNN.
InceptionResNetV2 is a CNN model with 164 layers deep trained on huge amount of
data (images) collected from ImageNet database [246]. It is a hybrid model that
combines the structure of inception with the residual association. The model utilizes
images at input layer with the size of 299 × 299 × 3, and its results provide learned
rich information and an estimated value against each class. The benefits of IRNV2 are
converting inception modules to residual inception blocks, adding more inception
85
modules, and adding a new type of inception module (inception-A) after the stem
module. A schematic view of InceptionResNetV2 architecture (compressed) is shown
in Figure 3.17.
DenseNet201 is a CNN model with 201 layers deep, accepts an input size of 224 ×
224 × 3, and is evaluated on SVHN, CIFAR-10, CIFAR-100, and ImageNet databases
[247]. This network was designed to achieve deep and wide architectures based CNN
models that can be useful to enhance the performance of deep CNNs. In this way,
DenseNet (DN) is an improvement of ResNet (RN) that comprises of dense connections
between layers to transfer collective knowledge for features reusability.
86
Hence, the network layer obtains information from all preceding layers and sends it to
all subsequent layers. This activity encourages maximum flow of information from a
particular layer to the next layer including features reuse in a network. Unlike
traditional convolutional networks with 𝑙 layers having 𝑙 connections, DN has
𝑙(𝑙 + 1)⁄2 direct connections. Moreover, DN can improve the performance by
vanishing gradient problems, implicit deep supervision, model compactness, and
reduction in parameters count. A schematic view of the DenseNet201 architecture is
given in Figure 3.18.
This section covers the discussion on deep CNN feature extraction from two already-
trained models and their fusion. As mentioned above, two different deep networks
IRNV2 and DN201 are selected for deep CNN feature extraction. It is noticeable that
the networks with different architectures have strength to acquire diverse characteristics
because of their different depths and structures. Furthermore, an effective ImageNet
output does not pass to further assignments, therefore, multiple networks may be
required. Keeping this, both networks are utilized in this work, where FC layer of both
networks is considered for deep feature extraction. The common thing about these
deeper networks is that they have only a FC layer at last with a feature matrix of size
1 × 1000 as shown in Figure 3.19. Consequently, two deep feature vectors
IRNV2_FCL and DN201_FCL are represented using InceptionResNetv2 and
DenseNet201 CNN models, respectively.
The use of deep feature fusion is apparent because of the following two reasons (1) It
is useful for acquiring powerful and rich features from two different models as
compared to obtaining from an individual model, and (2) parallel fused features are
more expressive because they reflect the distinct properties of two different deep
networks. Keeping in view these benefits, deeply learned features of both models are
merged by applying parallel fusion based on two procedures (1) maximum score-based
fusion and (2) average score-based fusion. Objective of this research work is to
represent the gender image with joint distinctive and average feature depictions under
two diverse depths. Both feature fusion procedures follow two steps to compute
maximum score-based and average score-based deep feature vectors: (1) utilization of
overlapped sliding window having a size of 2 × 2 on both deep feature vectors (first
87
row values belong to IRNV2_FCL feature vector and second row values belong to
DN201_FCL feature vector), and (2) implementation of both methods to compute
maximum score-based deep feature vector (MaxDeep_FV), and average score-based
deep feature vector (AvgDeep_FV).
where 𝑁 represents total number of samples, and 𝑀 represents the dimensions of deep
feature vector. Before fusion operations, feature concatenation 𝑍𝑘 is represented to hold
88
extracted deep information of both models. The mathematical representation is
provided in Eq. (3.70).
𝑥𝑘,𝑗
𝑍𝑘 = [𝑦 ] 𝑘 = 1 𝑡𝑜 𝑁, 𝑗 = 𝑀 (3.70)
𝑘,𝑗
The maximum score-based fusion procedure chooses the maximum response from the
generated feature window of size 2 × 2. Similarly, average score-based fusion
procedure computes the average response using the generated feature window of size
2 × 2. Both procedures are executed step by step with an overlapped sliding window,
the response of each feature window is stored to produce merged information
resultantly with a dimension of 1× 1000 for both MaxDeep_FV and AvgDeep_FV.
The fused deep features deal with different challenges such as pose changes, viewpoint
variations, and environmental effects because of the rich and powerful information of
two deep networks.
One of the important factors to enhance classification rates is the accessibility and
utilization of distinct features from the extracted feature vector. It is notable that
extracted feature vectors consist of high dimensions and have irrelevant information
not related to robust modeling. Moreover, this irrelevant information not only decreases
overall performance of classifier but also increases computational cost. In existing
studies, there are different features selection techniques such as CCA, entropy, PCA,
and DCA commonly applied for dimensionality reduction. These techniques also
provide a suitable way to select OFS from a large feature vector by discarding irrelevant
information. Therein, few techniques have been implemented to compute the
relationship between two or more feature representations. Features selection is a key
step that is implemented to improve the classifier performance without much loss of
89
computed features. Keeping in view these reasons, PCA and entropy based features
selection methods are selected to use in this study before feature fusion of applied
feature extraction schemes. The prime objective to apply a features selection method is
to choose OFS from applied feature vector(s), which preserves not only distinct
representations that were part of original feature vector, but also reduces the dimension
after eliminating redundancy and noise present in it. The entropy and PCA are
straightforward methods such that PCA provides the benefit of minimizing
reconstruction error with no solid supposition for utilization of diminished feature
vector. Thus, the contribution of each applied feature vector is selected from PCA and
entropy controlled feature vector(s) and explored for pedestrian gender classification
task.
The proposed approach utilizes the traditional feature vectors of size 𝑁 × 680 and 𝑁 ×
32 obtained from gender images with PHOG and HSV histogram based feature
extraction schemes respectively. In addition, deep learned information at FC layer of
two deeper networks IRNV2 and DN201 is used as deep features as presented in the
previous subsection with the name of IRNV2_FCL and DN201_FCL deep feature
vectors. Then, a parallel fusion of deep feature vectors is implemented to combine
deeply learned information of both networks. As a result, two fused feature vectors are
generated and used as MaxDeep_FV and AvgDeep_FV, each of size 𝑁 × 1000 such
that N denotes total number of images in the selected dataset. Fused deep feature vector
is more expressive and has distinct representation because it contains maximum score
and average score-based features from two deeper networks instead of a single network.
The deeper network based fused deep features help to reduce false positive rate due to
large variations in appearance under non-overlapping camera settings.
90
where 𝑓 presents the numerical value (feature) with 𝑛 dimension. As mentioned earlier,
an entropy based features selection defined in Eqs. (3.49)-(3.58) and responds to a
specific feature vector with maximum numerical values is utilized. The mathematical
representation is given in Eqs. (3.77)-(3.80)
where 𝑃𝐻𝑣 , 𝐻𝐻𝑣 , M𝐷𝑣 , and 𝐴𝐷𝑣 indicate the selected subsets of features and ẟ
represents top features set from computed entropy based information.
Similarly, score-based top 200 features are selected from each PCA controlled feature
vector PH, MD, and AV, and 30 features are taken from HSV histogram based feature
vector. Mathematically, selection of top features subsets (FSs) is performed through
Eqs. (3.85)-(3.88).
91
𝐻𝐻𝑣 = 𝑃𝐶𝐴(𝐻𝐻, ẟ) (3.86)
where 𝑃𝐻𝑣 , 𝐻𝐻𝑣 , M𝐷𝑣 , and 𝐴𝐷𝑣 indicate the selected subset of features and ẟ
represents the top features set from computed PCA based information.
92
where 𝐹𝐹𝑉1×𝑑 denotes fused feature vector, d shows FFV dimension as 𝑁 × 630, and
𝐹𝐹𝑉𝑁×𝑑 represents fused feature vectors of all sample images. Later the selected FFV
is supplied to different classifiers for classification. This work uses supervised learning
methods that need to train the data for gender prediction. Performance of proposed
approach using both features selection methods is tested separately, where empirical
analysis shows that PCA based selected OFSs outperformed as compared to entropy
based selected OFSs for gender prediction (see results and evaluation section 4.4.3).
This section covers the description of selected SVM classifier commonly used in
classification problems of text and image datasets. Following are the main reasons
behind choosing SVM classifier: 1) it improves the generalization ability of learning
machine, reliable classification rates on large-scale and even on SSS data, 2) efficient
prediction speed especially for binary classification tasks, flexible model, memory
efficient, and 3) capable to overcome the over learning [248, 249]. SVM classifier has
a set of kernel functions to yield data and convert it into targeted form. Keeping in view,
this work utilizes linear, quadratic, cubic, and medium Gaussian kernels with SVM.
Aiming to state a reliable classification model for gender prediction, extensive
experimentation is done on SVM based classification methods such as linear SVM (L-
SVM), quadratic SVM (Q-SVM), cubic SVM (Q-SVM), and medium Gaussian SVM
(M-SVM) to test the performance of proposed PGC-FSDTF approach. All these
methods have been executed under their default settings.
3.5 Summary
In this research work, full-body appearance of pedestrian is considered to examine for
person ReID and pedestrian gender classification tasks. Person ReID consists of feature
representation, features-based clustering, and features matching. Whereas, pedestrian
gender classification consists of data preparation, feature extraction/learning, features
selection, fusion, and classification. Considering existing challenges and limitations,
three methods are proposed in this dissertation for full-body appearance based
pedestrian analysis, as described in this chapter. The significance of proposed
methodologies is summarized as below:
93
Method I performs person ReID on full-body diverse appearances of
pedestrians. This method presents a robust way to re-identify the pedestrians,
where cluster-based probe matching is carried out to optimize gallery search.
To the best of our insight, FFS approach is proposed first time for cluster
formation in ReID. The recognition rates are improved by acquiring PCA based
deep features of each cluster sample that are fused low-level features subset.
The cluster-based probe matching instead of search from whole gallery further
improves the recognition rates.
Method II performs pedestrian gender classification using low-level and deep
feature representations, jointly. The method is applicable for both large-scale
and small-scale datasets. The features selection on the combination of low-level
and deep feature representations is used to enhance O-ACC, CW-ACC, and
AUC.
Method III performs pedestrian gender classification on imbalanced,
augmented, and customized balanced SSS datasets. The color and PHOG
features are fused with deep features acquired from more deep CNN networks
in parallel manner. Two different features selection methods are used to
investigate the performance of proposed method for gender prediction on SSS
datasets. The unique way of data augmentation is also proposed to handle
imbalanced binary classification problem. In addition, thirteen different datasets
are also prepared from different existing datasets in pedestrian analysis domain
to investigate the robustness of proposed method.
94
Chapter 4 Results and Discussions
Results and Discussions
95
4.1 Introduction
This chapter presents the experimental setup and results obtained after empirical
analysis of proposed methodologies for person ReID and pedestrian gender
classification tasks (as discussed in chapter 3). The evaluation of each task is described
separately in subsequent sections. Figure 4.1 provides the organization of this chapter
including section wise highlights. Firstly, performance evaluation protocols and
implementation settings are described.
96
Then, publicly available datasets used in this study are presented in detail. Finally, the
results of all proposed methodologies are discussed, and compared with state-of-the-art
methods in literature. All methods are implemented using MATLAB 2019a and
performed on desktop system core i5-7400 with 16GB of RAM and GeForce GTX 1080
GPU.
The commonly used protocol named cumulative matching characteristics (CMC) [97]
is considered as standard metric for person ReID research evaluation. In the evaluation
process, initially, all datasets are separated into training and testing sets. Besides, the
training set (gallery images) of all datasets is divided into k consensus clusters. Then,
random initialization of cluster center is performed 150 times to address the problem of
selection of cluster center. For suitable number of clusters, value of k = 6 is obtained,
which is computed using self-tuning algorithm [194] across selected datasets. Later on,
probe image is matched with the classified cluster to increase the retrieval probability
of most similar image as compared to retrieving a similar image from the whole gallery.
For each test, one-trail CMC results are recorded at all ranks of true matches. Similarly,
this validation process is iterated ten times (10 trails) and means results are presented
using CMC curve to show stable statistics. To assess the proposed FCDF framework,
316, 486, and 150 pedestrian images are selected randomly from VIPeR, CUHK01, and
iLIDS-VID datasets respectively for training and testing.
The datasets evaluated in this study are benchmark and publicly available which are
VIPeR [97], CUHK01 [113], and iLIDS-VID [80]. More details of these datasets are
provided in the subsections. All these datasets are challenging due to different camera
97
settings, changes in viewpoint, low contrast/ resolution, and pose of a person’s
observations. These issues make a person ReID a more challenging problem.
The multi-shot CUHK01 (China University of Hong Kong 01) [113] dataset consists
of 971 identities taken from two disjoint cameras CAM1 and CAM2 in a campus
environment. CAM1 captures front and back views of each pedestrian and CAM2
captures their side views (left and right). As a whole, this dataset contains four different
views of each identity with a total of 3884 pedestrian images and is challenging because
of viewpoint and pose changes. For experiments, it is split into two equal sets of 486
images as a gallery set as well as probe set.
The imagery library for intelligent detection systems video re-identification (i-LIDS-
VID) [80] is considered as a benchmark dataset for video analytics (VA) systems. It
has been collected by center for applied science and technology (CAST) in partnership
with center for the protection of national infrastructure (CPNI). The i-LIDS contains a
library of CCTV video footage based around “scenarios” central to UK government’s
requirements. The footage accurately represents real operating conditions and potential
threats. This dataset consists of two different scenarios: 1) event detection scenario
comprising of parked vehicle, doorway surveillance, sterilized zone and abandoned
98
baggage and new technology scenario, and 2) tracking scenario which includes multiple
camera tracking scenarios (MCTS) holding near to 50 hours of footage provided by five
cameras deployed in an airport. From this, subset of images is selected in this work for
the iLIDS-VID dataset, which contains 600 images of 300 identities such that each
identity has two images captured under non-overlapping camera settings. This dataset
is highly challenging because of different photometric variations and environmental
effects such as diverse occlusions, large variations in appearance from a camera to
another one, people wearing backpacks, carrying luggage or pushing carts. Table 4.1
describes main particulars in the form of total images, image pair, image size, and
challenges of selected datasets for person ReID.
Table 4.1: Statistics of datasets for person ReID
Image Image
Name Total images Challenges
pair size
VIPeR 1264 Pose variations, occlusion
316 128×48
[97] (632 identity pairs) and illumination changes.
Front, back, and side view
CUHK01 3884
486 160×60 with illumination changes
[113] (971 identity pairs)
and pose variations.
Clothing similarity,
iLIDS- 600 partial occlusion,
150 128×48
VID [80] (300 identity pairs) illumination changes and
background clutter.
For the evaluation of proposed work, initially different combinations of deep features
are fused with optimal features subset OFSselect. For this purpose, all selected datasets
are taken into account and recognition rate is computed at different ranks (rank-1, rank-
10, rank-20, and rank-50), where probe image is to be matched within the classified
cluster(s) instead of whole gallery set. In subsequent tables, top recognition rates at
each rank are written in bold.
The experimental results using the concatenation of optimal handcrafted and deep
features are shown in Tables (4.2)-(4.4). During experiments, different number of top
PCA based deep features are chosen to find the best deep feature combination with
OFselect . Outcomes of experiments depict that variation in number of deep features
helps to find suitable set of deep features to provide the best recognition rates.
99
Table 4.2: Experimental results using deep features (from higher to lower dimension)
on VIPeR dataset
Ranks
Selected deep features
1 10 20 50
4000 30.2 72.1 80.0 86.8
3000 33.6 76.6 82.2 89.1
2000 36.6 79.4 87.6 92.2
1500 40.0 84.2 89.4 95.4
1000 46.8 87.2 93.2 98.5
750 43.6 86.6 92.0 97.9
Table 4.3: Experimental results using deep features (from higher to lower dimension)
on CUHK01 dataset
Ranks
Selected deep features
1 10 20 50
4000 36.7 75.1 87.6 90.1
3000 39.5 76.0 88.3 92.2
2000 41.7 78.0 89.9 93.6
1500 43.3 79.6 91.1 94.5
1000 48.1 81.6 92.4 96.1
750 47.2 80.2 90.1 95.0
Table 4.4: Experimental results using deep features (from higher to lower dimension)
on i-LIDS-VID dataset
Ranks
Selected deep features
1 10 20 50
4000 33.3 72.4 79.2 87.7
3000 35.0 73.8 82.4 90.3
2000 36.7 75.0 83.1 92.8
1500 38.9 77.1 84.8 93.5
1000 40.6 78.5 86.9 93.8
750 47.2 80.2 90.1 95.0
Moreover, this process removes irrelevant and redundant information through PCA
based approach. According to these results, 1000 number of deep features with OFS
(handcrafted) provides distinct and sufficient information for single-shot person ReID
process. Thus, the contribution of deep features with OFS is desirable for cluster-based
probe matching.
For comparison, relevant state-of-the-art methods are considered such as RBS [81],
Preid PFCC [90], HGL ReID [127], RLML [250], NP ReId [95], DPML [112], Multi-
scale CNN [251], Multi-HG [84], RD [88], Sparse [252], Soft-Bio [19], ML common
100
subspace [72], AML-PSR [115], semi supervised-ReID+XQDA [253], LOMOXQDA
[77], RDML-CCPVL [122], Inlier-set group modeling [96], FMCF [68], Salience [61],
RF+DDA+AC [254], MARPML [42], PaMM [86], DVDL [255], AFDA [111],
Midlevel [79], Salmatch [78], Semantic [83], ROCCA [82], DVR [80], Hessian [119],
RN+XQDA+RR+DFR, PAN+XQDA+RR+DFR [137], SSS with fully-
supervised+LOMO, semi supervised+LOMO and fusion [87], MKSSL,
LOMO+MKSSL and MKSSL-MRank [256], DMIL [136], SSSVM [99], VS-SSL with
GOG, LOMO and combined [120], ML-ReID+LOMO [75], MSDALF+HM,
HSV+SMA and MSDALF+SMA [107], TPPLFDA [74], SECGAN_s, SECGAN_c,
SECGAN [43], EML [124], LWA [138], QRKISS-FFN M1&M2 [123], SCNN
(handcrafted and learned features) [73], ResNet-50 and inceptionV4 [139],
BRM2L(GOG and FTCNN) [257], UCDTL [258], TLSTP [142], P2SNET [135],
TMSL [259], PHDL [260], CTC-GAN [261], TMD2L [118], HRPID [262], DHGM-
average pooling and regularized minimum [76], and CMGTN [141] on selected
datasets. The existing approaches used the selected datasets in different combinations
for evaluation and validation. For instance, the approaches that consider VIPeR do not
use CUHK01 and iLIDS-VID datasets, and vice-versa. The computed results are
presented for VIPeR, CUHK01, and iLIDS-VID datasets in Figure (4.2), Figure (4.3),
and Figure (4.4), respectively. Also, the presented results of proposed framework are
assessed with existing approaches at ranks 1, 10, 20, and 50. For further clarity, the
quantitative performance comparison is tabulated in Tables (4.5)-(4.7) with
corresponding VIPeR, CUHK01, and iLIDS-VID datasets.
a) VIPeR Dataset: According to the results presented in Table 4.5, the proposed
framework achieves a 46.82% rank-1 matching rate on VIPeR that outperforms
previous rank-1 results of RBS [81], NP ReId [95], DPML [112], Soft-Bio [19], AML-
PSR [115], RD [88], Sparse [252], Inlier-set group modeling [96], semantic attributes
[99], semi-supervised-ReID+XQDA [253], LOMOXQDA [187], MKSSL and
MKSSL-MRank [85] which achieve 28.4%, 43.3%, 41.4%, 43.9%, 45.1%, 33.2%,
32.9%, 41.3%, 44.7%, 40.5%, 40.0%, 40.6%, and 42.3% respectively. Similarly, results
of proposed FCDF framework are better than the graph learning methods such as HGL
ReID[127] and Multi-HG [84] as well as metric learning methods such as RLML [250],
EML [124], DMIL [136] and RDML-CCPVL [122]. Moreover, the proposed
framework obtains improved results as compared to those methods where LOMO
101
descriptor is used for person ReID such as SSS with fully-supervised+LOMO, semi
supervised+LOMO and fusion [87], and LOMO+MKSSL [85]. In addition, the
proposed framework outperforms the recent state-of-the-art methods such as SSSVM
[99], VS-SSL with GOG, LOMO and combined [120], ML-ReID+LOMO [75],
InceptionV4 [139], HSV+SMA [107], MSDALF+SMA [107], TPPLFDA [74],
SECGAN_s, SECGAN_c, and SECGAN [43]. Furthermore, the computed results in
comparison with existing methods are also presented using CMC as shown in Figure
4.2. The results on VIPeR confirm that even though the presented framework has
limited training data, however, it achieves significant performance on challenging
VIPeR dataset at ranks-1 and rank-10. These outstanding outcomes on VIPeR dataset
confirm the usefulness of presented FCDF framework in comparison to state-of-the-art
approaches.
Figure 4.2: CMC curves of existing and proposed FCDF method on VIPeR dataset
102
Table 4.5: Performance comparison in terms of top matching rates (%) of existing
methods including proposed FCDF method on VIPeR dataset (p=316), dash
(-) represents that no reported result is available
Ranks
Methods Year
1 10 20 50
RBS [81] 2015 28.4 64.5 76.2 -
LOMOXQDA [187] 2015 40.0 68.9 81.5 91.1
RLML [250] 2015 35.2 81.6 - 90.8
HGL ReID [127] 2016 34.1 79.7 90.1 98.1
DPML [112] 2016 41.4 80.9 90.4 -
RD [88] 2016 33.2 78.3 88.4 97.5
Sparse [252] 2016 32.9 75.9 89.2 96.8
SSS fully-supervised + LOMO [87] 2016 42.2 82.9 92.0 -
SSS semi-supervised + LOMO [87] 2016 31.6 72.7 84.9 -
SSS + Fusion [87] 2016 41.0 81.6 91.0 -
NP ReId [95] 2017 43.3 85.2 96.0 -
Multi-HG [84] 2017 44.5 83.0 92.4 99.1
Soft-Bio [19] 2017 43.9 86.5 94.5 99.6
Preid PFCC [90] 2017 28.0 - - -
AML-PSR [115] 2017 45.1 73.5 85.3 90.5
MKSSL [85] 2017 40.6 78.1 85.9 -
LOMO+MKSSL [85] 2017 31.2 62.9 72.8 -
MKSSL-MRank [85] 2017 42.3 74.4 80.6 -
Semi supervised-ReID+XQDA [253] 2018 40.5 64.4 - -
RDML-CCPVL [122] 2018 13.3 60.0 83.3 -
ML common subspace [72] 2018 34.5 80.5 90.4 98.0
DMSCNN [136] 2018 31.1 71.7 86.3 -
DMIL [136] 2018 31.1 71.7 84.7 -
Hessian [119] 2019 20.3 52.1 68.1 -
Multi-scale CNN [251] 2019 22.6 57.6 71.2 -
Inlier-set group modeling [96] 2019 41.3 85.2 94.0 -
SSSVM [99] 2019 44.7 84.5 92.0 -
Semantic attributes [99] 2019 44.7 84.5 92.0 -
VS-SSL with GOG [120] 2020 43.9 80.9 87.8 -
VS-SSL with LOMO [120] 2020 35.1 67.4 77.6 -
VS-SSL Combined [120] 2020 44.8 79.3 86.1 -
ML-ReID + LOMO [75] 2020 42.5 81.5 91.2 -
InceptionV4 [139] 2020 23.2 41.5 - -
HSV + SMA [107] 2020 40.4 - - -
MSDALF + SMA [107] 2020 40.3 - - -
TPPLFDA [74] 2020 42.3 76.3 89.7 -
SECGAN_s [43] 2020 36.8 75.9 87.3 -
SECGAN_c [43] 2020 37.5 74.6 86.8 -
SECGAN [43] 2020 40.2 77.3 89.7 -
EML [124] 2020 44.3 85.2 92.3 -
Proposed (FCDF) 46.8 87.2 93.2 98.5
103
b) CUHK01 Dataset: The proposed FCDF method is tested on CUHK01 dataset in
which 486 images are randomly selected while taking one image per person from the
gallery set. As per the results illustrated in Figure 4.3, the presented approach
outperforms the existing methodologies.
Figure 4.3: CMC curves of existing and proposed FCDF method on CUHK01 dataset
104
Table 4.6: Performance comparison in terms of top matching rates (%) of existing
methods including proposed FCDF method on CUHK01 dataset (p=486),
dash (-) represents that no reported result is available
Ranks
Method Year
1 10 20 50
Salmatch[78] 2013 28.4 55.6 67.9 -
Midlevel [79] 2014 34.3 64.9 74.9 -
Semantic [83] 2015 32.7 64.4 76.3 -
ROCCA [82] 2015 29.7 66.0 76.7 -
Sparse [252] 2016 31.3 68.3 78.1 87.6
HGL ReID [127] 2016 35.0 69.2 80.6 91.5
RD [88] 2016 31.1 68.5 79.1 90.3
FMCF [68] 2016 25.0 39.0 46.6 -
DPML [112] 2016 35.8 70.9 79.5 -
NP ReId[95] 2017 44.5 80.3 92.3 -
ML common subspace[72] 2018 34.0 69.7 80.5 91.6
Semi supervised-ReID+XQDA[253] 2018 37.0 64.4 - -
DMS CNN[136] 2018 31.8 69.9 80.7 -
RDML-CCPVL [122] 2018 18.5 57.1 79.8 -
DMIL [136] 2018 31.8 69.9 80.9 -
RN+XQDA+RR+DFR [137] 2019 44.3 71.4 80.3 -
PAN+ XQDA+RR+DFR [137] 2019 42.6 73.2 80.6 -
Inlier-set group modeling [96] 2019 44.5 80.3 92.3 -
Hessian [119] 2019 21.9 52.0 62.2 -
LWA [138] 2019 46.7 - - -
QRKISS-FFN (M1) [123] 2019 42.1 70.7 - -
QRKISS-FFN (M2) [123] 2019 44.9 73.8 - -
HSV + SMA [107] 2020 27.0 - - -
MSDALF + SMA [107] 2020 21.7 - - -
SCNN (learned features) [73] 2020 31.0 74.0 86.0 -
SCNN (handcrafted features) [73] 2020 22.0 57.0 68.0 -
ResNet-50 [139] 2020 47.1 71.4 - -
InceptionV4 [139] 2020 43.5 61.3 - -
BRM2L(GOG) [257] 2020 45.3 86.5 90.0 -
BRM2L(FTCNN) [257] 2020 47.4 88.3 98.3 -
Proposed (FCDF) 48.1 81.6 92.4 96.1
Similarly, FCDF acquires better rank-1 results when evaluated with previous descriptor
results of FMCF [68] with 25.0% and RD [88] with 31.1%. In comparison to other
recent approaches such as DMIL [136], LWA [138], QRKISS-FFN M1&M2 [123],
MSDALF+SMA and HSV+SMA [107], SCNN (handcrafted and learned features) [73],
ResNet-50 and inceptionV4 [139], and BRM2L(GOG and FTCNN) [257] proposed
rank-1 results are superior. The improved results are not only because of optimal
handcrafted features but also due to gallery search optimization. Moreover, the use of
105
deep features with OFS as it effectively handles the appearance diversity which proves
the significance of proposed FCDF framework.
c) iLIDS-VID Dataset: The results are also computed using the iLIDS-VID dataset,
having numerous issues in captured images such as variations in pose, random
occlusions, and clothing similarities. For performance comparison of proposed FCDF
framework with relevant methods, 150 image pairs of the dataset are used. The results
based CMC curves depict the evaluation of FCDF framework with the existing
approaches at ranks 1, 10, 20, and 50 as shown in Figure 4.4.
Figure 4.4: CMC curves of existing and proposed FCDF method on the iLIDS-VID
dataset
According to results shown in Table 4.7, rank-1 matching rate of 40.67% is attained
that outperforms the previous results reported by MS-color & LBP+DVR [80], semi-
supervised-ReID+XQDA [253], NP ReId [95], RF+DDA+AC [254], MARPML [42],
PaMM [86], AFDA [111], Inlier-set group modeling [96], and CMGTN [141] with
106
34.5%, 39.3%, 35.8%, 36.4%, 36.9%, 30.3%, 37.5%, 34.8%, and 38.4% matching rates
at rank-1 respectively.
Table 4.7: Performance comparison in terms of top matching rates (%) of existing
methods and proposed FCDF method on the iLIDS-VID dataset (p=150),
dash (-) represents that no reported result is available
Ranks
Method Year
1 10 20 50
Salience [61] 2013 10.2 35.5 52.9 -
DVR [80] 2014 23.3 55.3 68.4 -
Salience + DVR [80] 2014 30.9 65.1 77.1 -
MS color & LBP+DVR [80] 2014 34.5 67.5 77.5 -
DVDL [255] 2015 25.9 57.3 68.9 -
AFDA [111] 2015 37.5 73.0 81.8 -
UCDTL [258] 2016 21.5 53.7 73.6 -
PaMM [86] 2016 30.3 70.3 82.7 -
NP ReId [95] 2017 35.8 59.5 72.3 -
RF+DDA+AC [254] 2017 36.4 59.0 73.9 91.7
P2SNET [135] 2017 40.0 78.1 90.0 -
TMSL [259] 2017 39.5 75.4 86.5 -
PHDL [260] 2017 28.2 65.9 80.4 -
MARPML [42] 2018 36.9 77.1 89.9 -
Semi supervised-ReID+XQDA[253] 2018 39.3 71.1 - -
TLSTP [142] 2018 30.3 66.6 79.8 -
Multi-scale CNN [251] 2019 33.6 69.7 84.2 -
Inlier-set group modeling [96] 2019 34.8 59.5 72.3 -
CTC-GAN [261] 2019 35.3 72.0 82.9 -
TMD2L [118] 2019 29.6 74.0 86.2 -
HRPID [262] 2019 23.9 67.1 83.7 -
TPPLFDA [74] 2020 33.8 66.8 78.5 -
MSDALF + HM [107] 2020 34.4 - - -
MSDALF + SMA [107] 2020 34.3 - - -
DHGM-average pooling [76] 2020 33.5 - - -
DHGM-regularized minimum [76] 2020 40.0 - - -
CMGTN [141] 2020 38.4 74.8 86.3 -
InceptionV4 [139] 2020 32.9 74.2 - -
Proposed (FCDF) 40.6 78.5 86.9 93.8
107
The proposed FCDF framework attained significant recognition rates at different ranks
for ReID problem. According to the computed results on VIPeR dataset, it outperforms
the previous methods of AML-PSR, Multi-HG, Semantic attribute, NP ReId, Soft-Bio,
Inlier-set group modeling, DPML, and LOMOXQDA with margins of 1.63%, 2.26%,
2.10%, 3.46%, 2.90%, 5.47%, 5.36%, and 6.82% respectively. Using VIPeR dataset,
this research work attained better results at rank-1 and rank-10 whereas NP ReId and
Soft-Bio perform best at ranks 20 and 50, respectively. The recognition rates at ranks-
1 and rank-10 confirm that proposed FCDF framework consistently outperforms under
illumination and viewpoint variations. Using CUHK01 dataset, experiments are
conducted at ranks 1, 10, 20, 50, and the proposed FCDF framework attained higher
recognition rates at rank-1 and rank-50 whereas BRM2L (FTCNN) performs best at
rank-10 and rank-20. In case of proposed method rank-1 results on CUHK01 dataset, it
obtained higher recognition rate as compared to NP ReId, RN+XQDA+RR+DFR,
PAN+ XQDA+RR+DFR, Inlier-set group modeling, Hessian and DPML with
significant margins of 3.62%, 3.81%, 5.46%, 3.62%, 26.18% and 12.29% respectively.
Similarly, on iLIDS-VID dataset, the proposed FCDF framework shows improved
results as compared to previous results reported by semi-supervised-ReID+XQDA, NP
ReId, RF+DDA+AC, MARPML, AFDA, and Inlier-set group modeling with margins
of 1.29%, 4.80%, 4.27%, 3.77%, 3.17%, and 5.80% respectively. The proposed
framework performs best at three ranks including 1, 10, and 50 whereas P2SNET
performs best at rank-20. The recognition rate is also examined using the
target/corresponding cluster (initially classified cluster) and number of neighbor
cluster(s) during probe matching. For this purpose, experiments are conducted in which
gallery-based probe matching is performed at ranks 1, 10, 20, and 50. The prime
concern of these experiments is to compare cluster-based probe matching with gallery-
based probe matching. Tables (4.8)-(4.10) show the comparison of both cluster and
gallery-based probe matching results at ranks 1, 10, 20, and 50. The results show that
cluster-based probe matching rates are higher than gallery-based probe matching rates
at all ranks, for instance at rank-1 by margins of 9.2%, 6.57%, and 5.95% on VIPeR,
CUHK01, and iLIDS-VID datasets, respectively. According to the results, recognition
rates are improved using cluster-based probe matching technique, hence it proves that
recognition rate relies on accurate classification of the corresponding cluster (CC) and
its nearest neighbor cluster(s) (NNC’s). All probe images are matched either with
classified CC or NNC’s (cluster with minimum distance from the corresponding
108
cluster). Notably, not more than three clusters including CC for probe matching are
considered because of gallery search optimization.
Table 4.8: Cluster and gallery-based probe matching results of proposed FCDF
framework on VIPeR dataset
Ranks
Probe matching
1 10 20 50
Proposed cluster-based 46.82 87.24 93.23 98.58
Proposed gallery-based 37.62 77.89 85.25 92.12
Table 4.9: Cluster and gallery-based probe matching results of proposed FCDF
framework on CUHK01 dataset
Ranks
Probe matching
1 10 20 50
Proposed cluster-based 48.14 81.68 92.47 96.13
Proposed gallery-based 41.57 76.11 85.84 91.41
Table 4.10: Cluster and gallery-based probe matching results of proposed FCDF
framework on the iLIDS-VID dataset
Ranks
Probe matching
1 10 20 50
Proposed cluster-based 40.67 78.50 86.92 93.87
Proposed gallery-based 34.72 73.91 81.17 88.26
Mathematically, CC, and NNC are calculated through Eq. (4.1) and Eq. (4.2).
XIp
CC = (4.1)
𝑃
Where X Ip represents number of probe images that are accurately matched in the target
matched clusters and P denotes total images in the probe set.
𝑛𝑐
NNC = (4.2)
K
Where 𝑛𝑐 depicts neighbor clusters such that cluster search is applied to find accurate
image pair, and K is consensus clusters. The cluster-wise matching results are presented
in Figure 4.5. The experimental results computed using the aforementioned equations
show that the success rate of accurate image matching using CC on VIPeR, CUHK01
and iLIDS-VID datasets is 42%, 47%, and 38%, respectively against the probe set.
However, once CC is mismatched, first NCC is considered. In this case, success rate of
accurate image matching on VIPeR, CUHK01, and iLIDS-VID datasets are recorded
60%, 64%, and 54% respectively against the probe set. Similarly, if CC and NCC both
109
are exhausted, second NCC is considered, in this situation the success rate on VIPeR,
CUHK01, and iLIDS-VID datasets is observed as 83%, 89%, and 79% respectively
against the probe set.
Figure 4.5: Performance comparison of CC and NNC based searching against all probe
images
Figure 4.6: Selected image pairs (column-wise) from VIPeR, CUHK01, and iLIDS-
VID datasets with challenging conditions such as a) Improper image
appearances, b) different background and foreground information, c)
drastic illumination changes, and d) pose variations including lights effects
It is worth mentioning that few image pairs show false matching result (where the
appearance of same pair of images looks dissimilar) because of the drastic change in
pose and viewpoint. For further clarity, the abovementioned challenges are shown in
Figure 4.6. Similarly, sometimes background information and foreground information
of images are barely differentiable, hence making it difficult for the proposed
110
framework to re-identify the person accurately. These challenging conditions limit the
recognition rate.
Six evaluation metrics including O-ACC, M-ACC, AUC, false positive rate (FPR), hit
rate/ sensitivity/ true positive rate (TPR) or recall, precision or positive predictive value
(PPV) are selected to estimate the performance of proposed J-LDFR framework for
gender classification. Moreover, training time, prediction time, and CW-ACC are also
calculated and presented in the empirical analysis. The mathematical representation of
these metrics is given in Table 4.11.
S.
Metrics Equations
No
𝑇𝑃 + 𝑇𝑁
1 O-ACC 𝑂𝑣𝑒𝑟𝑎𝑙𝑙 𝑎𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = × 100
𝑇𝑃 + 𝑇𝑁 + 𝐹𝑃 + 𝐹𝑁
−∞
𝐴𝑈𝐶 = ∫ 𝑇𝑃𝑅(𝑡)𝐹𝑃𝑅′ (𝑡)𝑑𝑡
3 AUC ∞
𝐹𝑃
4 FPR 𝐹𝑃𝑅 = × 100
𝑇𝑁 + 𝐹𝑃
𝑇𝑃
5 TPR (𝐻𝑖𝑡 𝑟𝑎𝑡𝑒/𝑆𝑒𝑛𝑠𝑖𝑡𝑖𝑣𝑖𝑡𝑦/𝑅𝑒𝑐𝑎𝑙𝑙/𝑇𝑃𝑅) = × 100
𝑇𝑃 + 𝐹𝑁
𝑇𝑃
6 PPV 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛/𝑃𝑃𝑉 = × 100
𝑇𝑃 + 𝐹𝑃
Where TP, TN, FP, and FN denote true positive, true negative, false positive, and false
negative, respectively, whereas three metrics of O-ACC, M-ACC, and AUC are key
111
metrics to assess the performance of a system. Thus, higher scores of accuracy or AUC
confirm the significance of model with respect to prediction performance. Rest of two
metrics are defined as: 1) precision is considered for the measure of exactness or
quality, and 2) recall is taken as the measure of completeness or quantity. For testing
the proposed J-LDFR framework, k-fold cross-validation approach is applied i.e., 10-
fold cross-validation, which is considered as a standard procedure for the assessment
of models. All the experiments are performed on Intel ® Core i7-7700 CPU @ 3.60
GHz desktop computer, 16-GB RAM and NVIDIA GeForce GTX1070 having
MATLAB2018b including deep CNN trained networks.
In this research work, two datasets named pedestrian attribute recognition (PETA)
[263] and MIT [166] are selected to test the proposed J-LDFR framework. The
challenges of these datasets include diverse variations such as different camera settings,
variation in viewpoint, illumination changes, resolution, body deformation, and scene
(indoor/outdoor) effects. Figure 4.7 shows a few sample images from the selected
datasets as an example of abovementioned challenges of pedestrian gender
classification [263].
Figure 4.7: Samples of pedestrian images selected from sub-datasets of PETA dataset,
column representing the gender (male and female) from each sub-dataset,
upper row is showing the images of male gender whereas lower row is
showing the image of female gender
112
4.3.2.1 PETA Dataset Presentation
PETA dataset has annotated images of pedestrians taken from ten existing datasets. By
comprising these datasets, total number of 19000 images are acquired, where the
percentage of sub-dataset iLIDS is 2%, CUHK is 24%, 3DPeS is 5%, GRID is 7%,
CAVIAR4REID is 6%, MIT is 5%, PRID is 6%, SARC3D is 1%, VIPeR is 7%, and
Town Center is 37%, and each sub-dataset holds number of images as 477, 4563, 1012,
1275, 1220, 888, 1134, 200, 1264, and 6967, respectively. In this research work, 8472
images of each gender (male and female) are chosen from a total of 19000 images. This
equal distribution of samples per class is investigated for gender prediction. Also, male
and female annotation is used in terms of mixed views. PETA dataset is described in
Table 4.12 in terms of dataset name, class-wise images, total images, view, image size,
and scenario.
The experiments are also done on the imbalanced MIT dataset separately and the
computed results are compared with existing studies. MIT dataset is described in Table
4.13 in terms of dataset name, class-wise images, total images, view, image size, and
scenario.
Detailed results of proposed J-LDFR framework are computed on the selected datasets.
For pedestrian gender classification, each feature vector is confined according to
different features subset dimensions (FSD) as described in Table 4.14. For
experimentation, trail of five testing FSs with different dimensions are selected from
113
entropy controlled low-level and deep CNN feature representations. In subsequent
tables, results of best classifier are written in bold.
114
Table 4.15: Performance evaluation of proposed J-LDFR method using different
classifiers and test FSs on PETA dataset
115
Table 4.16: Performance evaluation of proposed J-LDFR method using different
classifiers and test FSs on MIT dataset
116
Moreover, the training and prediction time under all applied classification methods is
computed and shown in Figure 4.8, where (a) presents the training (sec) on PETA and
MIT datasets, and (b) presents prediction time (obs/sec) on PETA and MIT datasets,
respectively. As the experimental results confirm that the feature combination from test
FS number 4 produced better results as compared to other test FSs, so it is intended to
use this feature combination for the rest of experiments in this work.
Figure 4.8: Proposed J-LDFR method (a) training and (b) prediction time using
different classifiers on PETA and MIT datasets
117
The main objective of this test is to estimate the performance of low-level features, and
their fusion using supposed FSs. The selected FSs of HOG and LOMO are supplied to
SVM classifier and the results under selected evaluation protocols are computed. It is
observed that C-SVM provides better results as compared to M-SVM and Q-SVM
when the integration of LOMO and HOG features is considered. In case of Q-SVM, O-
ACC, AUC, precision, and recall are recorded as 85.5%, 93%, 87.7%, and 84%
respectively on PETA dataset using LOMO+HOG feature representations. The
performance of different low-level features and their integration results using cubic,
medium, and quadratic SVM on the PETA dataset is illustrated in Tables (4.17)-(4.19).
The medium and quadratic SVM classifiers produced O-ACC with a value of 84.9%,
and 84.3%, respectively when the integration of LOMO and HOG features is applied.
The low-level feature representations are also tested on MIT dataset using the same
118
classification methods, evaluation protocols, and settings. According to LOMO and
HOG features fusion results on MIT dataset, C-SVM again provides improved results
than other classification methods with O-ACC 79%, AUC 82%, precision 89.8%, and
recall 80.3%. In addition, these evaluation protocols are also tested on rest of two
classification methods such as M-SVM and Q-SVM as illustrated in Tables (4.20)-
(4.22). Using entropy controlled low-level feature representations, comparison between
HOG and LOMO separately on datasets PETA and MIT is shown in Figure 4.9 and
Figure 4.10, respectively. The joint low-level feature representation result using LOMO
and HOG is presented subsequently in Figure 4.11 for both PETA and MIT datasets.
The joint low-level feature representation results as described in Tables (4.17)-(4.19)
show improved performance on the PETA dataset. In case of HOG, the improvement
of 2.7%, 3%, 3.3%, and 2.2% is recorded for O-ACC, AUC, precision, and recall
respectively, whereas in case of LOMO the improvement of 3.3%, 3%, 2.6%, and 3.5%
is recorded for O-ACC, AUC, precision, and recall respectively using C-SVM.
Table 4.20: Performance evaluation of proposed J-LDFR method on MIT dataset using
C-SVM classifier with 10-fold cross-validation
Classifier
Selected C-SVM
Features type
FSD O-ACC AUC PPV TPR
(%) (%) (%) (%)
LOMO 600 70.5 70 84.5 75.0
HOG 1000 78.0 82 89.5 80.2
LOMO+HOG 1600 79.0 82 89.8 80.3
119
Table 4.21: Performance evaluation of proposed J-LDFR method on MIT dataset using
M-SVM classifier with 10-fold cross-validation
Classifier
Selected M-SVM
Features type
FSD O-ACC AUC PPV TPR
(%) (%) (%) (%)
LOMO 600 68.7 70 98.3 68.7
HOG 1000 77.1 81 97.8 75.5
LOMO+HOG 1600 77.5 82 98.0 75.7
Table 4.22: Performance evaluation of proposed J-LDFR method on MIT dataset using
Q-SVM classifier with 10-fold cross-validation
Classifier
Selected Q-SVM
Features type
FSD O-ACC AUC PPV TPR
(%) (%) (%) (%)
LOMO 600 72.3 72 87.6 75.3
HOG 1000 75.9 80 88.8 78.3
LOMO+HOG 1600 77.4 82 90.3 79.1
Figure 4.10: Performance evaluation of proposed method on MIT dataset using entropy
controlled low-level feature representations, individually
120
Figure 4.11: Performance evaluation of proposed J-LDFR method on PETA and MIT
datasets using entropy controlled low-level feature representations, jointly
The improvement in computed results proves that the LOMO has the strength to deal
with viewpoint variations. Similarly, other classification methods also produced
exceptional results using the abovementioned joint low-level feature representation and
settings. Comparing Figure 4.11 with Figure 4.9 and Figure 4.10, it is obvious that the
joint low-level feature representation outperforms single feature representations on
both selected datasets. Thus, it is complementary to the improved performance of
pedestrian gender classification. Since PETA dataset consists of several datasets where
various diverse appearances with pose and illumination become major issues, so deeply
learned feature representations of CNN models are more suitable in these cases.
Consequently, deep feature representations are taken along with low-level feature
representations in this framework. The detail about deep feature representations with
performance evaluation is given in the following subsection.
Considering LOMO and HOG FSs including their joint representation (LOMO+HOG),
it is observed that the performance of gender prediction is decreased due to appearance
variances issues such as pose and illumination. To handle these issues, an entropy
controlled deep CNN feature representation is proposed. Impact of deep feature
representations is also examined using two different deep CNN models. The results are
computed for both deep feature representations separately and jointly considering
121
selected datasets, evaluation protocols, and classification methods. In case of deep
feature representations separately, the performance of proposed framework on PETA
and MIT datasets is described in Tables (4.23)-(4.25) and (4.26)-(4.28) respectively.
Table 4.24: Performance evaluation of the proposed J-LDFR method on PETA dataset
using M-SVM classifiers with 10-fold cross-validation
Classifier
Selected M-SVM
Features type
FSD O-ACC AUC PPV TPR
(%) (%) (%) (%)
VGG19 1000 82.5 90 83.5 81.7
ResNet101 1000 84.1 92 84.1 84.2
VGG19+ResNet101 2000 85.4 93 86.1 84.1
Table 4.25: Performance evaluation of the proposed J-LDFR method on PETA dataset
using Q-SVM classifier with 10-fold cross-validation
Classifier
Selected Q-SVM
Features type
FSD O-ACC AUC PPV TPR
(%) (%) (%) (%)
VGG19 1000 82.3 90 82.5 82.2
ResNet101 1000 84.2 92 84.5 83.9
VGG19+ResNet101 2000 85.6 93 86.1 85.2
122
4.12 and Figure 4.13, respectively. The joint deep feature representation
(VGG19+ResNet101) as described in Tables (4.23)-(4.25) showed improved results on
PETA dataset. In case of ResNet101, the improvement of 1.2%, 1.0%, 1.5%, and 1.2%
is recorded for O-ACC, AUC, precision, and recall respectively, whereas in case of
VGG19, improved results of 3.4%, 3%, 4.3%, and 3.3% are attained for O-ACC, AUC,
precision, and recall respectively. Similarly, joint deep feature representation
(VGG19+ResNet101) as described in Tables (4.26)-(4.28) indicated enhanced results
on MIT dataset. In case of ResNet101, the improvement of 2.1%, 1.8%, and 1.0% is
observed for O-ACC, precision, and recall respectively, whereas considering VGG19,
enhanced results 3.7%, 4%, 2.3%, and 2.7% are seen for O-ACC, AUC, precision, and
recall respectively. Also, other classification methods produced better results using the
aforementioned joint deep CNN feature representation and settings.
Table 4.26: Performance evaluation of proposed J-LDFR method on MIT dataset using
C-SVM classifier with 10-fold cross-validation
Classifier
Selected C-SVM
Features type
FSD O-ACC AUC PPV TPR
(%) (%) (%) (%)
VGG19 1000 72.5 75 84.5 77.0
ResNet101 1000 74.3 79 85.0 78.7
VGG19+ResNet101 2000 76.2 79 86.8 79.7
Table 4.27: Performance evaluation of proposed J-LDFR method on MIT dataset using
M-SVM classifier with 10-fold cross-validation
Classifier
Selected M-SVM
Features type
FSD O-ACC AUC PPV TPR
(%) (%) (%) (%)
VGG19 1000 72.5 76 96.5 72.1
ResNet101 1000 73.5 78 96.0 73.1
VGG19+ResNet101 2000 74.8 79 96.0 74.2
Table 4.28: Performance evaluation of proposed J-LDFR method on MIT dataset using
Q-SVM classifier with 10-fold cross-validation
Classifier
Selected Q-SVM
Features type
FSD O-ACC AUC PPV TPR
(%) (%) (%) (%)
VGG19 1000 73.2 76 87.5 76.3
ResNet101 1000 75.1 80 88.5 77.7
VGG19+ResNet101 2000 76.8 81 89.1 79.1
123
The computed results using deep CNN feature representation individually on PETA
and MIT datasets are shown in Figure 4.12 and Figure 4.13 whereas joint deep feature
representation (VGG19+ResNet101) based results are presented in Figure 4.14.
Comparing Figure 4.14 with Figure 4.12 and Figure 4.13, it is apparent that joint deep
feature representation outperforms single deep feature representations on both selected
datasets. Thus, it is supporting for the improved performance of pedestrian gender
classification.
Figure 4.13: Performance evaluation of proposed J-LDFR method on MIT dataset using
entropy controlled deep feature representations, separately
124
Figure 4.14: Performance estimation of proposed J-LDFR method on PETA and MIT
datasets using entropy controlled deep feature representations, jointly
Since deep feature representations provide detailed information of input image, hence
it has the potential to effectively handle pose and illumination issues by the computed
results. In addition, it is observed that deep CNN feature representations provide
enhanced results than other traditional feature representations.
In the preceding sections, low-level and deep feature representations of input image for
gender classification are investigated either separately or jointly. The proposed
framework achieved superior results with joint feature representation. The computed
results show improvement when joint representation of either low-level features or deep
features is compared to individual feature representations. It further urges to investigate
the joint versions of low-level and deep feature representations in different
combinations. According to the calculated results, joint feature representation
(HOG+LOMO+VGG19+ ResNet101) increased overall performance on PETA and
MIT datasets and achieved desirable results as presented in Tables (4.29)-(4.31) and
(4.32)-(4.34) respectively, which confirms that joint feature representation makes the
proposed framework more robust. The second best results are obtained using
HOG+LOMO+ResNet101 combinations.
125
Table 4.29: Performance evaluation of proposed J-LDFR method on PETA dataset
using C-SVM classifier with 10-fold cross-validation
Classifiers
Selected C-SVM
Joint feature representation
FSD O-ACC AUC PPV TPR
(%) (%) (%) (%)
LOMO+HOG 1600 85.5 93 87.7 84.0
LOMO+ResNet101 1600 87.8 95 88.7 87.1
LOMO+VGG19 1600 86.5 94 87.4 85.8
HOG+ResNet101 2000 87.8 95 88.5 87.2
HOG+VGG19 2000 86.6 94 87.3 86.1
VGG19+ResNet101 2000 85.7 93 86.3 85.4
HOG+LOMO+VGG19 2600 87.6 95 88.8 86.7
HOG+LOMO+ResNet101 2600 88.9 95 90.0 88.0
LOMO+VGG19+ResNet101 2600 88.1 95 89.0 87.4
HOG+VGG19+ResNet101 3000 88.4 95 88.9 88.1
HOG+LOMO+VGG19+ResNet101 3600 89.3 96 90.0 88.7
126
However, in case of feature combination LOMO+HOG+VGG19+ResNet101, the
proposed framework showed 89.3% O-ACC, AUC 96%, 90% precision, and 88.7%
recall on PETA dataset. Similarly, investigation is carried out for MIT dataset. The
results of entropy controlled LOMO, HOG, VGG19, ResNet101 feature representations
and their fusion in terms of O-ACC, AUC, precision, and recall on MIT dataset are
computed.
Since joint feature representation approach considering both low-level and deep CNN
features significantly improved the results, hence confirms the robustness of proposed
J-LDFR framework for gender classification. For instance, comparing Figure 4.15 and
Figure 4.16, it is obvious that the deep CNN feature representation is effective when
high-level information of input image is required, whereas utilizing low-level feature
representation contributes to enhancing the performance. In addition, it is observed
from the results that high-level deep features of two different models preserve distinct
clues of gender image, thus assist in enabling a reliable classification of sample by
different classification methods.
127
Figure 4.15: Proposed evaluation results using individual feature representation and
JFR on PETA dataset
Table 4.32: Performance evaluation of the proposed J-LDFR method on the MIT
dataset using cubic-SVM classifier with 10-fold cross-validation
Classifiers
Selected Cubic-SVM
Joint feature representations
FSD O-ACC AUC PPV TPR
(%) (%) (%) (%)
LOMO+HOG 1600 79 82 89.8 80.3
LOMO+ResNet101 1600 76.4 81 87.5 79.5
LOMO+VGG19 1600 75.6 80 87.8 78.5
HOG+ResNet101 2000 79.3 84 90.0 81.3
HOG+VGG19 2000 81.0 85 90.6 82.8
VGG19+ResNet101 2000 76.2 79 86.8 79.7
HOG+LOMO+VGG19 2600 81.0 85 90.3 83.0
HOG+LOMO+ResNet101 2600 78.2 84 89.8 80.2
LOMO+VGG19+ResNet101 2600 76.9 81 87.1 80.3
HOG+VGG19+ResNet101 3000 79.4 85 89.3 81.8
HOG+LOMO+VGG19+ResNet101 3600 82.0 86 91.2 84.0
128
Table 4.33: Performance evaluation of the proposed J-LDFR method on the MIT
dataset using medium-SVM classifier with 10-fold cross-validation
Classifier
Selected Medium-SVM
Joint feature representations
FSD O-ACC AUC PPV TPR
(%) (%) (%) (%)
LOMO+HOG 1600 77.5 82 98.0 75.7
LOMO+ResNet101 1600 73.9 80 97.6 72.8
LOMO+VGG19 1600 73.1 79 97.8 72.2
HOG+ResNet101 2000 77.8 84 96.8 76.5
HOG+VGG19 2000 76.4 85 97.3 75.1
VGG19+ResNet101 2000 74.8 79 96.0 74.2
HOG+LOMO+VGG19 2600 76.1 85 98.0 74.6
HOG+LOMO+ResNet101 2600 76.4 84 97.5 75.0
LOMO+VGG19+ResNet101 2600 74.8 81 96.3 74.1
HOG+VGG19+ResNet101 3000 77.1 85 96.1 76.2
HOG+LOMO+VGG19+ResNet101 3600 78.0 85 97.8 75.9
Table 4.34: Performance evaluation of the proposed J-LDFR method on the MIT
dataset using quadratic-SVM classifier with 10-fold cross-validation
Classifier
Selected Quadratic-SVM
Joint feature representations
FSD O-ACC AUC PPV TPR
(%) (%) (%) (%)
LOMO+HOG 1600 77.4 82 90.3 79.1
LOMO+ResNet101 1600 77.0 82 89.6 79.1
LOMO+VGG19 1600 76.1 80 88.5 78.7
HOG+ResNet101 2000 79.4 84 90.8 80.9
HOG+VGG19 2000 80.7 85 91.5 82.1
VGG19+ResNet101 2000 76.8 81 89.1 79.1
HOG+LOMO+VGG19 2600 80.4 85 91.1 81.8
HOG+LOMO+ResNet101 2600 80.2 85 92.1 81.1
LOMO+VGG19+ResNet101 2600 77.1 82 88.2 80.0
HOG+VGG19+ResNet101 3000 80.3 85 90.5 82.1
HOG+LOMO+VGG19+ResNet101 3600 81.4 86 91.0 82.7
The proposed framework also calculates CW-ACC to show the robustness of joint low-
level and deep CNN feature representation.
129
With 10-fold cross-validation, the confusion matrix under C-SVM is shown in Table
4.35 and Table 4.36 using PETA and MIT datasets respectively. The male classification
accuracy is 90% and female is 89% on PETA dataset, whereas the classification
accuracy of a male is 92% and a female is 62% on MIT dataset.
Figure 4.16: Proposed evaluation results using individual feature representation and
JFR on the MIT dataset
Predicted
True
classes
classes
Male Female
Male 90% 10%
Female 11% 89%
130
Table 4.36: Confusion matrix using C-SVM on MIT dataset
Predicted
True
classes
classes
Male Female
Male 92% 8%
Female 38% 62%
131
Figure 4.18: AUC on MIT dataset
132
4.3.3.4 Comparison with State-of-the-art Methods
To validate the usefulness of proposed framework, the results are compared with
various state-of-the-art pedestrian gender classification approaches such as HOG, Mini-
CNN, AlexNet-CNN [47], Hierarchical ELM [264], GoogleNet [201], ResNet50 [217],
SSAE [46], W-HOG, DFL, HDFL [45] and DFN, DHFFN [15]. In this regard, only
full-body appearance based methods are considered for pedestrian gender classification
using PETA dataset and AUC evaluation protocol. To the best of our knowledge,
DHFFN [15] is the most recent method reported for pedestrian gender classification on
the PETA dataset. The existing and computed results on the PETA dataset are
illustrated in Table 4.37.
Table 4.37: Performance comparison with existing methods using PETA dataset
133
In addition, the formation of robust joint feature representation is made possible by the
successful use of entropy controlled FSs. Moreover, the computed results on MIT
dataset are compared with previous results as illustrated in Table 4.38. The comparison
shows that proposed J-LDFR framework outperforms C-SVM in the same settings. The
existing methods such as HOG, LBP, HSV, LBP-HSV, HOG-HSV, HOG-LBP, HOG-
LBP-HSV [169], PBGR [166], BIO-PCA, BIO-OLPP, BIO-LSDA, BIO-MFA, BIO-
PCA [60], CNN [61], CNN-e [183], and PiHOG-LHSV [168] are chosen for
comparison. Furthermore, CNN and CaffeNet with the upper (U), middle (M), and
lower (L) body patches based results are also taken for comparison [71]. The only
appearance based methods are chosen because they were reported in literature for
pedestrian gender classification using MIT dataset. Overall, comparing with existing
methods such as HOG, LBP, HSV, LBP-HSV, HOG-HSV, HOG-LBP, HOG-LBP-
HSV, PBGR, BIO-PCA, BIO-OLPP, BIO-LSDA, BIO-MFA, BIO-PCA, CNN, CNN-
e, PiHOG-LHSV, CNN-1, CNN-2, CNN-3, and CaffeNet, the proposed framework
outperformed all existing methods in terms of O-ACC with improvement of 3.1%,
5.9%, 10.7%, 4.4%, 1.1%, 2.2%, 1.9%, 7%, 2.8%, 4.9%, 3.8%, 6.8%, 1.4%, 1.6%,
0.5%, 9.7%, 1.2%, 0.8%, 0.7%, and 1.1% respectively. The comparison of existing and
computed results is shown in Figure 4.20. J-LDFR framework attained 77.3% M-ACC,
which is better among seven available existing methods including HOG, LBP, HSV,
LBP-HSV, HOG-HSV, HOG-LBP and HOG-LBP-HSV with improvement of 1.4%,
8.8%, 12.5%, 3.6%, 2%, 0.7% and 0.6% respectively.
Figure 4.19: Comparison of existing and proposed results in terms of AUC on PETA
dataset
134
Table 4.38: Performance comparison with existing methods using MIT dataset, dash
(-) represents that no reported result is available
Methods Year O-ACC (%) M-ACC (%)
PBGR [166] 2008 75.0 -
PiHOG-LHSV [168] 2009 72.3 -
BIO-PCA [60] 2009 79.2 -
BIO-OLPP [60] 2009 77.1 -
BIO-LSDA [60] 2009 78.2 -
BIO-MFA [60] 2009 75.2 -
BIF-PCA [60] 2009 80.6 -
CNN [61] 2013 80.4 -
HOG [169] 2015 78.9 75.9
LBP [169] 2015 76.1 68.5
HSV [169] 2015 71.3 64.8
LBP-HSV [169] 2015 77.6 73.7
HOG-HSV [169] 2015 80.9 75.3
HOG-LBP [169] 2015 79.8 76.6
HOG-LBP-HSV [169] 2015 80.1 76.7
CNN-e [183] 2017 81.5 -
U+M+L (CNN-1) [71] 2019 80.8 -
U+M+L (CNN-2) [71] 2019 81.2 -
U+M+L (CNN-3) [71] 2019 81.3 -
U+M+L (CaffeNet) [71] 2019 80.9 -
Proposed J-LDFR 82.0 77.3
Figure 4.20: Comparison of existing and proposed results in terms of overall accuracy
on MIT dataset
In addition, overall results of proposed J-LDFR framework are listed in Table 4.39. The
better outcome of J-LDFR framework is because of joint feature representation
135
comprising of illumination and rotation invariant HOG features, LOMO features, and
deep feature presentations of two CNN models. It further confirms that deep CNN
feature representations are complementary to low-level feature representations, hence
contributing to the improved results.
Table 4.39: Proposed J-LDFR method results on PETA and MIT datasets
CW-ACC (%)
Datasets O-ACC (%) M-ACC (%) AUC (%)
Male Female
PETA 89.3 89.5 96 90 89
MIT 82.0 77.3 86 92 62
The proposed J-LDFR framework achieved 89.3% O-ACC, 89.5% M-ACC, and 96%
AUC on PETA dataset. The results on MIT dataset are 82%O-ACC, 77.3% M-ACC,
and 86% AUC. The proposed J-LDFR method also obtained noteworthy results in terms
of CW-ACC as 90% and 89% on PETA dataset for male and female classes
respectively. Similarly, CW-ACC is 92% and 62% on MIT dataset for male and female
classes respectively. To the best of our knowledge, J-LDFR framework outcomes are
better than existing full-body appearance based pedestrian gender classification results
in terms of O-ACC, M-ACC, AUC, and CW-ACC on large scale PETA and small scale
MIT datasets.
Out of many evaluation protocols, six are taken into account for evaluating the proposed
method, which are already described in section 4.3.1 and mathematically represented
136
in Table 4.11. Few other evaluation protocols such as F1-score, negative predictive
value (NPV), specificity/true negative rate (TNR), balanced accuracy (B-ACC), and
false negative rate (FNR) are also selected to estimate the performance of proposed
PGC-FSDTF approach for gender prediction. Furthermore, training time and CW-ACC
are also calculated and presented in experiments. The mathematical representation of
F1-score, NPV, TNR, B-ACC, FNR in equation form is provided in Table 4.40.
S.
Metrics Equations
No
2. 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛. 𝑟𝑒𝑐𝑎𝑙𝑙
𝐹 − 𝑚𝑒𝑎𝑠𝑢𝑟𝑒/𝐹1 − 𝑠𝑐𝑜𝑟𝑒 =
1 F1 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑟𝑒𝑐𝑎𝑙𝑙
𝑇𝑁
2 NPV 𝑁𝑃𝑉 = × 100
𝑇𝑁 + 𝐹𝑁
𝑇𝑁
3 TNR 𝑇𝑁𝑅/𝑆𝑝𝑒𝑐𝑖𝑓𝑖𝑐𝑖𝑡𝑦 = × 100
𝑇𝑁 + 𝐹𝑃
𝑇𝑃𝑅 + 𝑇𝑁𝑅
𝐵𝑎𝑙𝑎𝑛𝑐𝑒𝑑 𝑎𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = × 100
4 B-ACC 2
𝐹𝑁
5 FNR 𝐹𝑁𝑅 = × 100
𝐹𝑁 + 𝑇𝑃
Evaluation protocols like AUC and accuracies (overall, mean, balanced, and class-wise)
are key metrics to assess the performance of proposed approach. Thus, higher scores of
accuracies or AUC confirm significance of the model w.r.t to prediction performance.
Mainly, balanced accuracy in present research is a suitable metric for classification
tasks because of invariant property against imbalanced data. To the best of our
knowledge, B-ACC is used first time in this work for gender classification. It is a
powerful metric when class-wise data distribution is imbalanced [265]. In this work, B-
ACC will be considered for both imbalanced and balanced data. In all investigations,
following procedures are followed: first, entropy and PCA based selected FSs are taken,
and then serially fused to produce FFV. Secondly, after FSF, multiple classifiers (as
discussed in Section 3.4.5) are trained on the proposed FFV. All the classifiers are
executed with default settings.
137
subsamples, where a single subsample is used for testing and remaining for training.
Consequently, in each fold, new data is randomly selected for training and testing,
whereas different accuracies are obtained using the classifiers. Finally, AUC, overall,
average, and balanced accuracies of the selected classifier are calculated including other
abovementioned evaluation protocols. Moreover, the performance of multiple
classifiers on different datasets is tabulated and discussed in the following subsection.
All the experiments are performed on Intel ® Core i7-7700 CPU @ 3.60 GHz desktop
computer, 16-GB RAM and NVIDIA GeForce GTX1070 having MATLAB2018b
including deep CNN trained networks.
In this work, mainly five pedestrian analysis datasets MIT [166], PKU-Reid [225],
PETA [263], VIPeR [97], and cross-dataset [45] are selected to test the proposed
approach PGC-FSDTF for pedestrian gender classification. For gender prediction,
analyzing pedestrian full-body appearance is more challenging because of variations in
viewpoint angle changes from 0° to 315° , different camera settings, low resolution,
pose changes, and light effects.
Figure 4.21: Gender wise pair of sample images collected from MIT, PKU-Reid, PETA,
VIPeR, and cross-datasets where column represents the gender (male and
female) from each dataset, upper row shows images of male, and lower
row shows images of female
138
Figure 4.21 shows a few sample images from the selected datasets where mixed-views
of the pedestrian show these challenges clearly. A detailed explanation of selected
datasets is given in subsections.
All the augmented datasets are also taken for experiments while comparing state-of-
the-art methods with MIT-IB dataset. Table 4.41 shows the statistics of MIT-IB and
augmented MIT datasets. Annotation for male and female classes is used only in terms
of mixed (front and back) views in all experiments. As discussed above, MIT dataset
has class-imbalanced and SSS issues. Moreover, this dataset is challenging due to
frontal and back views of gender images with environmental effects such as low
resolution, and lighting conditions that make gender classification task more
challenging. Figure 4.22 shows few images from MIT/ MIT-IB and augmented MIT
139
datasets as an example so that the diversity in the appearance of pedestrians can be
observed from these sample images.
Figure 4.22: Sample images of pedestrian with back and front views (a) MIT/MIT-IB
dataset (b) augmented MIT datasets. First two rows represent male
images, and next two rows represent female images
Table 4.41: Statistics of MIT/MIT-IB dataset samples based imbalanced and
augmented balanced small sample datasets for pedestrian gender
classification
Class-wise images Total Image
Dataset View Scenario
#Males #Females mages size
MIT-IB 600 288 888
MIT-BROS-1 864 864 1728
Mixed 128×64 outdoor
MIT-BROS-2 864 864 1728
MIT-BROS-3 600 600 1200
140
images are 352 with a total of 352×2=704, under two cameras. Based on these
collections, these images are categorized into male class with total number of 1120
images and female class with a total number of 704 images. It is worth mentioning that
collection represented in this work as PKU-Reid-IB dataset is equally suitable for
pedestrian gender classification after normal categorization. This collection contains
mixed (front, back, and side) views of each gender image. The class-wise variation in
collected samples identifies that the prepared PKU-Reid-IB dataset is imbalanced
because the number of samples belonging to female class is approximately 37% lower
than male class samples. To balance the data, a random sampling procedure is applied
to equalize class-wise data distribution. For this purpose, a random oversampling
procedure is implemented for data augmentation of PKU-Reid dataset. Subsequently,
each class of gender is customized and two additional balanced datasets are prepared
for gender prediction: 1) balanced with random oversampling-1 (PKU-Reid-BROS-1)
and balanced with random oversampling-2 (PKU-Reid-BROS-2) to perceive
robustness of the proposed approach on balanced datasets. Only the female class of
PKU-Reid dataset is augmented and data augmentation operations are applied to
balance female class as equal to male class. For this purpose, another dataset named
(PKU-Reid-BROS-3) is created to examine the proposed approach. The only difference
among these augmented datasets (PKU-Reid-BROS-1, PKU-Reid-BROS-2, and PKU-
Reid-BROS-3) is a procedure to select samples from single/both classes of PKU-Reid
dataset. Then, data augmentation operations are carried out with suggested strategies
(as discussed in chapter 3, Section 3.4.1.1). Table 4.42 shows the statistics of PKU-
Reid-IB and augmented PKU-Reid datasets.
Table 4.42: Statistics of PKU-Reid dataset samples based imbalanced and augmented
balanced small sample datasets for pedestrian gender classification
Class-wise images Total Image
Dataset View Scenario
#Males #Females images size
PKU-Reid-IB 1120 704 1824
PKU-Reid-BROS-1 1300 1300 2600
Mixed 128×64 Indoor
PKU-Reid-BROS-2 1300 1300 2600
PKU-Reid-BROS-3 1120 1120 2240
To validate the proposed approach, these augmented datasets are also taken for
experiments and their performance is compared with an imbalanced PKU-Reid-IB
dataset. Figure 4.23 shows few sample images from PKU-Reid-IB and augmented
141
PKU-Reid-IB datasets as an example so that the pedestrian appearances under
viewpoint angle changes can be observed.
Figure 4.23: Sample images of pedestrians (a) first and second row represent male and
female samples, respectively collected from PKU-Reid-IB dataset, (b)
first and second row represent male and female samples respectively
collected from augmented datasets, and (c) male (top) and female (bottom)
images to show pedestrian images with different viewpoint angle changes
from 0° to 315° , total in eight directions
PETA dataset has annotated images of pedestrians taken from ten existing datasets.
This dataset contains 19000 total images and widely used as a large-scale dataset for
pedestrian attribute recognition. Each sub-dataset of PETA contains number of images
for iLIDS as 477, CUHK as 4563, 3DPeS as 1012, GRID as 1275, CAVIAR4REID as
1220, MIT as 888, PRID as 1134, SARC3D as 200, VIPeR as 1264, and Town Center
as 6967 with 2%, 24%, 5%, 7%, 6%, 5%, 6%, 1%, 7%, and 37% of total percentage
respectively. The information regarding gender images in these datasets has variation
in terms of camera angle, viewpoint, illumination, resolution, body deformation, and
scenes (indoor/ outdoor) [263]. Some sample images from these datasets are shown in
Figure 4.24. In literature, many studies conducted experiments on PETA dataset for
pedestrian gender classification such that class-wise 8472 maximum number of images
are utilized for both male and female classes [46, 48, 266].
142
Figure 4.24: Sample images of pedestrian, column represents gender (male and female)
selected from each customized SSS PETA datasets, upper row shows
images of male, and lower row is shows images of female
Table 4.43: Statistics of PETA dataset samples based customized PETA-SSS-1 and
PETA-SSS-2 datasets for pedestrian gender classification
VIPeR dataset is also considered for experiments. It is sub-dataset of PETA dataset and
widely used for person ReID [18, 215, 267]. Also, this dataset is examined by [168] for
143
pedestrian gender classification such that they only categorized frontal-view of
pedestrian and found 292 male and 291 female samples. The original VIPeR dataset
comprises of 632 pedestrian objects and has two different views of each object captured
from two disjoint cameras: CAM1 and CAM2. The views of these cameras are
challenging because of variation in viewpoint angle from 45° to 180° , different camera
settings, pose, and light effects, as shown in Figure 4.24. These views are normalized
to 128×64 pixels. In this work, new dataset named VIPeR-SSS is created by observing
CAM1 and CAM2 images. For this purpose, VIPeR dataset images are annotated and
categorized as male class and female class for pedestrian gender classification, hence
resulted in 720 and 544 images of male and female, respectively. To maintain class-
wise equal distribution of data, 544 images from male class are randomly chosen. Thus,
equal distribution of male and female classes with 544 images each is used to test the
performance of proposed method. It is also noticeable that the prepared VIPeR-SSS
dataset is suitable for pedestrian gender classification with another SSS dataset for
experiments. This dataset also contains mixed (frontal, back, and side) views of each
gender image. The statistics of customized VIPeR-SSS dataset is shown in Table 4.44.
Table 4.44: Statistics of VIPeR dataset samples based customized VIPeR-SSS dataset
for pedestrian gender classification
Class-wise images Total Image
Dataset View Scenario
#Males #Females images size
VIPeR-SSS 544 544 1088 Mixed 128×48 Outdoor
144
mixed views of gender images selected from the same sub-datasets of PETA dataset.
The statistics of cross datasets is described in Table 4.45.
Figure 4.25: Gender wise sample images of pedestrian, column represents gender (male
and female) selected from sub-datasets of PETA dataset, upper row
shows two images of male, and lower row shows two image of female
Table 4.45: Statistics of cross-datasets for pedestrian gender classification
145
Figure 4.26: An overview of the selected, imbalanced, augmented balanced, and
customized datasets with the class-wise (male and female) distribution of
samples for pedestrian gender classification (a) imbalanced and
augmented balanced SSS datasets and (b) customized balanced SSS
datasets
146
Using selected features, multiple classifiers L-SVM, Q-SVM, C-SVM, and M-SVM
are trained with standard settings and configurations (as described in chapter 3).
Besides, different evaluation protocols, for example PPV, TPR, F1, NPV, TNR, FPR,
FNR, O-ACC, M-ACC, B-ACC, AUC, time, and CW-ACC are used to assess the
robustness of proposed approach. Moreover, the results are presented in multiple tables
containing numerical values indicating gender classification rates using common and
numerous evaluation protocols. The objective of these investigations is to observe
higher classification rates and robustness of proposed FSF approach for accurate gender
prediction. As a novel research contribution, results are also provided on few
augmented and customized SSS datasets, for instance, MIT-BROS-1, MIT-BROS-2,
MIT-BROS-3, PKU-Reid-IB, PKU-Reid-BROS-1, PKU-Reid-BROS-2, PKU-Reid-
BROS-3, PETA-SSS1, PETA-SSS2, VIPeR-SSS, and cross-dataset1 with a balanced
distribution of data. In this work, all experiments are conducted using entropy and PCA
based selected features sets (vectors) such as PHOG_FV, HSV-Hist_FV, MaxDeep_FV
and AvgDeep_FV, and their fusion. FFV is provided to selected classifiers and results
are analyzed using evaluation protocols. Comparisons of results with state-of-the-art
methods is also done on the abovementioned datasets (MIT/MIT-IB and cross-dataset).
In subsequent tables, best computed results are written in bold.
In this section, the proposed PGC-FSDTF approach classifies a data instance into male
or female class on MIT-IB (imbalanced), MIT-BROS-1 (balanced), MIT-BROS-2
(balanced), and MIT-BROS-3 (balanced) datasets using multiple classifiers. Two kinds
of experiments are performed on each dataset and computed outcomes are evaluated
using standard performance evaluation metrics. These experiments are performed to 1)
observe entropy and PCA based features set performance on imbalanced and balanced
SSS datasets, and 2) validate the effect of proposed approach to overcome false
positives and improvements in terms of accuracies and AUC. Later, obtained results on
MIT-IB dataset are compared with state-of-the-art approaches in literature. The existing
studies utilized MIT dataset for gender prediction in which gender wise images are 600
males and 288 females, whereas in this study, MIT dataset is designated as imbalanced
and renamed as MIT-IB dataset to highlight its imbalanced nature. The proceeding
subsections discuss the results using the abovementioned imbalanced and three
balanced datasets.
147
a) Performance Evaluation on MIT-IB Dataset: Results are calculated on an
imbalanced and SSS MIT-IB dataset using proposed PGC-FSDTF method with settings
described above. The computed results using different classifiers such as L-SVM, Q-
SVM, C-SVM, and M-SVM under evaluation protocols of PPV, TPR, F1, NPV, TNR,
FPR, and FNR are shown in Table 4.46 and Table 4.47. The results depict role of both
features selection methods of entropy and PCA based features sets using MIT-IB
dataset. In case of entropy based selected features, Q-SVM and M-SVM are proven
better classifiers for accurate gender prediction as compared to other applied classifiers.
The empirical based results evaluation revealed that the proposed approach using Q-
SVM produced better results in terms of F1, TNR, and FPR with 61.9%, 80.2%, and
19.7% respectively, while M-SVM outperformed other classifiers in terms of TPR,
NPV, FNR with 83.9%, 96.3%, and 16.1% respectively. C-SVM classifier, among
other applied classifiers, presented better results using a performance evaluation
measure of PPV with a value as high as 55.1%. In case of PCA based selected features,
M-SVM showed better performance as compared to other selected classifiers.
According to computed results, M-SVM classifier exhibited significant performance in
terms of PPV, TPR, F1, NPV, TNR, FPR, and FNR with 79.8%, 97.0%, 87.6%, 98.8%,
91.1%, 8.9%, and 2.9%, respectively. Moreover, proposed approach tested on other
selected classifiers such that L-SVM, Q-SVM, and C-SVM classifiers show acceptable
results. For instance, L-SVM, Q-SVM, and C-SVM produced lower values of PPV with
20.4%, 12.5%, and 22.2%, respectively as compared to M-SVM classifier. Both
entropy and PCA based computed outcomes in terms of PPV, TPR, F1, NPV, TNR,
FPR, and FNR are shown in Table 4.46.
148
Results are also recorded with different accuracies (O-ACC, M-ACC, and B-ACC),
AUC, and time using entropy and PCA based FSs as presented in Table 4.47. Overall,
Q-SVM and M-SVM outperformed as compared to other classifiers, for example L-
SVM, and C-SVM. In case of entropy based FSs, Q-SVM classifier depicted better
outcomes in terms of O-ACC, M-ACC, and AUC with 78.4%, 72.1%, and 81%
respectively among other applied supervised methods. Moreover, higher outcomes are
observed in terms of B-ACC with 80.5% and CW-ACC male with 96% whereas Q-
SVM produced best CW-ACC female with 55%. L-SVM among other selected
supervised methods yields less time with 1.99 sec.
Likewise, the results are acquired using PCA based FSs. M-SVM outperformed as
compared to other selected classifiers. According to computed results, M-SVM
classifier showed better performance in terms of O-ACC, M-ACC, B-ACC, AUC, time,
CW-ACC male, and CW-ACC female with 92.7%, 89.4%, 94.1%, 93%, 1.91 sec, 99%,
and 80%, respectively. The corresponding results on selected supervised methods are
also computed to examine the performance of proposed approach. Hence, showing
better performance under respective performance evaluation protocols strengthens the
working of proposed approach with PCA based FSs. Best entropy and PCA based
results of two classifiers are presented such that proposed method obtained higher O-
ACC among other classifiers as shown in Figure 4.27. This comparison provides an
insight view of different accuracies and AUC for comparison of both features selection
methods.
149
Figure 4.27: Entropy and PCA based proposed PGC-FSDTF method results
comparison in terms of O-ACC, M-ACC, B-ACC, AUC, CW-ACC
male, and CW-ACC female on imbalanced MIT-IB dataset
According to these results, PCA based FSs showed improved results as compared to
entropy based FSs on an imbalanced and SSS MIT-IB dataset. For example, using PCA
based FSs, M-SVM classifier results are surpassed with the values of 14.3%, 17.4%,
17.6%, 12%, 9%, and 25% for O-ACC, M-ACC, B-ACC, AUC, CW-ACC male, and
CW-ACC female respectively when compared with entropy based computed results of
Q-SVM classifier. Also, other notable improvement of 10.2%, 11.0%, 14.9%, 10%,
7%, and 12% are recorded for O-ACC, M-ACC, B-ACC, AUC, CW-ACC male, and
CW-ACC female, respectively when Q-SVM results using PCA FSs are compared with
entropy based FSs. These improvements verify that PCA based features selection and
then serial fusion boost the performance as compared to entropy based FSs. Similarly,
the corresponding classification methods also produced better results using the same
FSs and settings. Moreover, it is observed that Q-SVM and M-SVM classifiers exhibit
higher AUC than other supervised methods on entropy and PCA based FSs
respectively. However, M-SVM classifier with PCA based FSs reveals a superior AUC
with a value of 93% as shown in Figure 4.28. The important factor is B-ACC with a
higher value of 80.5% on M-SVM classifier (entropy based) and 94.1% on M-SVM
classifier (PCA based) that is acceptable even in case of imbalanced data distribution.
MIT-IB dataset is imbalanced and also challenging due to variations in pose,
illumination, low contrast, etc. Therefore, presenting better performance under B-ACC
metric strengthens the working of proposed approach.
150
Figure 4.28: Best AUC for males and females on imbalanced MIT-IB dataset using
PCA based selected features set
151
84%, and 82% respectively using entropy based selected FSs. But in case of PCA based
results, M-SVM classifier showed better performance than other applied classifiers and
entropy based results. For example, M-SVM showed the values of O-ACC, M-ACC,
B-ACC, AUC, CW-ACC male, and CW-ACC female as 89.2%, 89.2%, 89.2%, 96%,
89%, and 90%, respectively by considering PCA based selected FSs.
It is obvious from the computed results that the combination of proposed features leads
to higher classification results when they are used in combination of M-SVM classifier
and PCA based selected features. The combination of traditional and deep features has
better discrimination potential to classify the gender image. Furthermore, numerous
features are less helpful, therefore, the selection of important features with PCA as
compared to entropy outperforms for pedestrian gender classification. The other applied
classifiers also showed adequate performance and validated against entropy and PCA
based selected FSs as presented in Table 4.49. For comparison, results are provided in
Figure 4.29, which show entropy and PCA based AUC and accuracies (O-ACC, M-
ACC, B-ACC, CW-ACC male, CW-ACC female) of multiple classifiers. According to
Figure 4.29, PCA based FSs exhibit better results as compared to entropy based FSs on
a balanced and SSS MIT-BROS-1 dataset. For example, M-SVM classifier results are
surpassed with 6.6%, 6.6%, 6.6%, 5%, 5%, and 8% for O-ACC, M-ACC, B-ACC,
AUC, CW-ACC male, and CW-ACC female respectively when compared with entropy
based results of C-SVM classifier, hence verify that PCA based selected FSs
combination enhances the performance as compared to entropy based FSs.
152
Table 4.49: Performance of proposed PGC-FSDTF method on balanced MIT- BROS-
1 dataset (male=864, and female=864 images) using different accuracies,
AUC and time
Evaluation protocols CW-ACC
O- M- B-
Method Classifier AUC Time M F
ACC ACC ACC
(%) (Sec) (%) (%)
(%) (%) (%)
Entropy L-SVM 78.1 78.1 78.3 86 4.31 82 74
based Q-SVM 81.3 81.3 81.4 90 4.95 84 79
features C-SVM 82.6 82.6 82.6 91 4.92 84 82
selection M-SVM 80.4 80.4 80.5 88 4.21 83 78
Similarly, in parallel, other selected supervised methods also attained reliable results
using the same FSs and settings. Moreover, it can be noted that C-SVM and M-SVM
classifiers exhibit higher AUC than other classification methods using entropy and PCA
based FSs respectively. However, M-SVM classifier with PCA based FSs shows 96%
AUC being the highest one as depicted in Figure 4.30. The corresponding all AUCs are
also computed to show the performance of other selected classifiers as tabulated in
Table 4.49.
Figure 4.29: Entropy and PCA based proposed PGC-FSDTF method results
comparison in terms of O-ACC, M-ACC, B-ACC, AUC, CW-ACC
male, and CW-ACC female on balanced MIT-BROS-1 dataset
153
Figure 4.30: Best AUC for males and females on balanced MIT-BROS-1 dataset using
PCA based selected FSs
154
Table 4.50: Performance of proposed PGC-FSDTF method on balanced MIT-BROS-2
dataset (male=864, and female=864 images) using different evaluation
protocols
Evaluation protocols
Method Classifier PPV TPR F1 NPV TNR FPR FNR
(%) (%) (%) (%) (%) (%) (%)
Entropy L-SVM 73.8 80.4 77.0 82.0 75.8 24.1 19.5
based Q-SVM 78.0 84.1 80.9 85.3 79.5 20.4 15.8
selected C-SVM 80.2 83.3 81.7 84.0 80.9 19.0 16.6
features M-SVM 77.0 82.6 79.7 83.7 78.5 21.4 17.3
But when PCA based selected features are used, M-SVM classifier gave superior
classification results as compared to other selected supervised methods and entropy
based results. Thus, M-SVM revealed O-ACC, M-ACC, B-ACC, AUC, CW-ACC
male, and CW-ACC female with values of 90.3%, 90.3%, 90.4%, 97%, 91%, and 90%
respectively by considering PCA based selected FSs. The other classification methods
including L-SVM, Q-SVM, and C-SVM produced average results then M-SVM. From
Table 4.50 and Table 4.51, it is apparent from the calculated results that use of proposed
features assists in better classification specifically when M-SVM classifier is trained
with PCA based selected features. In this study, the combination of selected deep and
traditional features supports FFV with distinct features for accurate gender prediction,
155
which is the main reason for attaining outstanding results. The other applied classifiers
also displayed appropriate results and confirmed against entropy and PCA based
selected FSs as presented in Table 4.50 and Table 4.51. Hence, comparison of both
entropy and PCA based results of two classifiers confirms that proposed approach
achieved higher O-ACC. As presented in Figure 4.31, PCA based FSs exhibit better
results as compared to entropy based FSs on a balanced and SSS MIT-BROS-2 dataset.
It means that the proposed approach proved reliable results under both augmented
datasets while applying 1vs1 strategy for MIT-BROS-1 and 1vs4 strategy for MIT-
BROS-2 to include augmented images in these datasets for equal distribution of data.
For example, while comparing entropy based C-SVM classifier results with PCA based
M-SVM classifier results, the outcomes of M-SVM classifier are surpassed with 8.2%,
8.2%, 8.3%, 7%, 7%, and 10% for O-ACC, M-ACC, B-ACC, AUC, CW-ACC male,
and CW-ACC female respectively. M-SVM classifier has the highest 91% CW-ACC
male and 90% CW-ACC female using PCA based FSs when compared with other
classifiers, and entropy based results. The satisfactory improvements confirm that PCA
based selected FSs are more reliable for gender prediction as compared to entropy based
selected features.
Figure 4.31: Entropy and PCA based proposed PGC-FSDTF method results
comparison in terms of O-ACC, M-ACC, B-ACC, AUC, CW-ACC
male, and CW-ACC female on balanced MIT-BROS-2 dataset
156
Figure 4.32: Best AUC for males and females on balanced MIT-BROS-2 dataset using
PCA based selected FS
Similarly, in parallel, other selected classifiers also attained acceptable results using the
same selected FSs and settings. Moreover, it is observed that C-SVM and M-SVM
classifiers exhibit higher AUC than other classification methods using entropy and PCA
based FSs respectively. Hence, M-SVM classifier with PCA based FSs shows 97%
AUC being the highest one for both male and female classes as depicted in Figure 4.32.
The corresponding all AUCs are also calculated to show the performance of other
selected classifiers as described previously in Table 4.51. Thus, it is determined that C-
SVM and M-SVM classifiers show improved AUC than other selected classifiers.
157
of all classifiers are also taken with both entropy and PCA based selected features and
tabulated in Table 4.52.
For comparison, selected classifiers are trained and tested using entropy and PCA based
FSs for gender prediction with additional evaluation protocols of O-ACC, M-ACC, B-
ACC, AUC, time, and CW-ACC, and results are shown in Table 4.53. According to
these results, it is observed that Q-SVM classifier exhibit better performance with
entropy based selected FSs such that it attained O-ACC, M-ACC, B-ACC, and AUC
with values of 83.7%, 83.7%, 84.2%, and 91% respectively, whereas M-SVM produced
a higher score of 91% against CW-ACC male with the best time of 2.4 sec. Moreover,
C-SVM has the best CW-ACC female with a value of 80%. In parallel, with PCA based
computed results, M-SVM classifier outperformed as compared to other applied
classifiers and entropy based results. Thus, M-SVM revealed O-ACC, M-ACC, B-
ACC, AUC, CW-ACC male, and CW-ACC female with values of 93.7%, 93.7%,
93.9%, 98%, 97%, and 90% respectively with best time 2.35 sec by considering PCA
based selected FSs. Again, the combination of proposed PCA based selected FSs, for
instance, max deep, average deep, and traditional features lead to higher classification
results. The selected feature combination has distinct properties that classify gender
image precisely. The other applied classifiers also showed acceptable performance and
validated against entropy and PCA based selected FSs as shown in Table 4.53. Figure
4.33 shows entropy and PCA based AUC and accuracies (O-ACC, M-ACC, B-ACC,
CW-ACC male, CW-ACC female) of different classifiers verifying that PCA based FSs
produced better results than entropy based FSs on a balanced and SSS MIT-BROS-3
158
dataset. For example, M-SVM classifier results are surpassed with values of 10.0%,
10.0%, 9.7%, 7%, 8%, and 11% for O-ACC, M-ACC, B-ACC, AUC, CW-ACC male,
and CW-ACC female respectively when compared with other classifiers and entropy
based results.
Also, M-SVM classifier has the highest 97% CW-ACC male and 90% CW-ACC female
using PCA based FSs as compared to L-SVM and Q-SVM. The significant
improvements confirm that PCA based selected FSs combination enhances the
performance as compared to entropy based.
Figure 4.33: Entropy and PCA based proposed PGC-FSDTF method results
comparison in terms of O-ACC, M-ACC, B-ACC, AUC, CW-ACC
male, and CW-ACC female on balanced MIT-BROS-3 dataset
159
Figure 4.34: Best AUC for males and females on balanced MIT-BROS-3 dataset using
PCA based selected features subsets
Similarly, in parallel, other selected supervised methods also attained reliable results
using the same FSs and settings. Furthermore, it is seen that Q-SVM and M-SVM
classifiers exhibit higher AUC than other classification methods using entropy and PCA
based FSs, respectively. The corresponding all AUCs are also calculated to show the
performance of other selected classifiers as described in Table 4.53. Therefore, it is
determined that Q-SVM and M-SVM classifiers show improved AUC than other
selected classifiers as shown in Figure 4.34. However, M-SVM classifier with PCA
based FSs shows a higher AUC with a value of 98% for both male and female classes
among other classifiers and entropy based results.
160
under unconstrained environment. In this study, PKU-Reid dataset is labeled as an
imbalanced PKU-Reid-IB dataset and first time considered for pedestrian gender
classification. The experimental results along with detailed discussion on the datasets
mentioned in this section are given in subsequent sections.
Results are verified using additional evaluation protocols, for instance accuracies (O-
ACC, M-ACC, and B-ACC), AUC, time, and CW-ACC using entropy and PCA based
161
FSs as presented in Table 4.55 in which Q-SVM and M-SVM classifiers outperformed
than other applied classifiers. Keeping in view entropy based results, Q-SVM classifier
depicted better outcomes of O-ACC, M-ACC, B-ACC, AUC, CW-ACC male, and CW-
ACC female with values of 88.3%, 87.0%, 88.3%, 95%, 9.3%, and 81% respectively
among other applied classifiers.
The parallel results of applied classifiers are also obtained to see the significance of
proposed approach. Likewise, the results are acquired using PCA based FSs in which
M-SVM outperformed as compared to other selected classifiers. Hence, M-SVM
classifier showed better performance in terms of O-ACC, M-ACC, B-ACC, AUC, time,
CW-ACC male, and CW-ACC female with 89.7%, 88.2%, 90.0%, 96%, 1.92 sec, 95%,
and 93% respectively. The corresponding results on selected classifiers are also
calculated for the evaluation of proposed approach. Hence, showing better performance
under respective performance evaluation protocols strengthens working of the proposed
technique when PCA based FSs are utilized. Both entropy and PCA based experimental
results are depicted in Figure 4.35 with different accuracies and AUCs for comparison
of both features selection methods. PCA based FSs showed improved results as
compared to entropy based FSs on the imbalanced and SSS PKU-Reid-IB dataset. For
example, using PCA based FSs, M-SVM classifier results are surpassed with values of
1.4%, 1.2%, 1.7%, 1%, 2%, and 12% for O-ACC, M-ACC, B-ACC, AUC, CW-ACC
male, and CW-ACC female respectively when compared with entropy based computed
results of Q-SVM classifier.
162
It is obvious from computed results that entropy and PCA based selected features
equally performed better because of a slight difference in terms of O-ACC, M-ACC,
B-ACC, AUC, and CW-ACC male as shown in Table 4.55. The tiny improvements
confirm that PCA based features selection and then serial fusion enhances the
performance as compared to entropy based FSs. Similarly, corresponding classification
methods also produced better results using the same FSs and settings. Further, it is seen
that Q-SVM and M-SVM classifiers exhibit higher AUC than other supervised methods
on entropy and PCA based FSs sets respectively. However, M-SVM classifier with
PCA based FSs reveals a superior AUC of 96% as shown in Figure 4.36. The important
factor is B-ACC with 88.3% on Q-SVM classifier (entropy based) and 90% on M-SVM
classifier (PCA based), which is acceptable even in case of imbalanced data
distribution. As PKU-Reid-IB dataset is imbalanced and also challenging due to
variations in angle, environment conditions, low contrast, etc. therefore, presenting
better performance under the B-ACC metric strengthens the working of proposed
method.
Figure 4.35: Entropy and PCA based proposed PGC-FSDTF method results
comparison in terms of O-ACC, M-ACC, B-ACC, AUC, CW-ACC
male, and CW-ACC female on imbalanced PKU-Reid-IB dataset
163
Figure 4.36: Best AUC for males and females on imbalanced PKU-Reid-IB dataset
using PCA based selected FSs
164
FSs. In comparison, M-SVM classifier takes less time of 5.05 sec than other applied
classifiers.
In case of PCA based results, M-SVM classifier gave better performance as compared
to other applied classifiers and entropy based results. For example, C-SVM displayed
O-ACC, M-ACC, B-ACC, AUC, CW-ACC male, and CW-ACC female with values of
91.6%, 91.6%, 91.6%, 97%, and 94% respectively by considering PCA based selected
features subsets. Despite this, M-SVM classifier achieved higher CW-ACC female with
90% and Q-SVM with less time 12.61 sec as compared to other classifiers. Again, it is
observed from the computed results that combination of PCA based proposed FSs leads
to higher classification under C-SVM classifier.
165
The combination of traditional and deep features has better discrimination potential to
classify the gender image even in case of augmented balanced dataset. Furthermore,
numerous features are less helpful, therefore, the selection of important features with
PCA as compared to entropy outperformed for pedestrian gender classification. The
other applied classifiers also gave adequate performance and validated against entropy
and PCA based selected FSs as shown in Table 4.57. For comparison, results in Figure
4.37 show entropy and PCA based AUC and accuracies (O-ACC, M-ACC, B-ACC,
CW-ACC male, CW-ACC female) of multiple classifiers.
According to Figure 4.37, PCA based FSs exhibited better results as compared to
entropy based FSs on a balanced and SSS PKU-Reid-BROS-1 dataset. For example, C-
SVM classifier results are exceeded with 2.1%, 2.1%, 2.1%, 1%, and 5% for O-ACC,
M-ACC, B-ACC, AUC, and CW-ACC male respectively when compared with entropy
based results of C-SVM classifier. Whereas, entropy based results are better in terms
of time and CW-ACC female with 7.05 sec and 90%, respectively on C-SVM classifier.
Here, it is noted that C-SVM produced better CW-ACC for male class with 94% and
M-SVM classifier for female class with 90% using PCA based selected FSs. These
significant improvements verify that PCA based selected FSs combination enhances
the performance for O-ACC, M-ACC, B-ACC, AUC, and CW-ACC male but entropy
based FSs performed better in terms of time and CW-ACC female.
Figure 4.37: Entropy and PCA based proposed PGC-FSDTF method results
comparison in terms of O-ACC, M-ACC, B-ACC, AUC, CW-ACC
male, and CW-ACC female on the balanced PKU-Reid-BROS-1 dataset
166
Figure 4.38: Best AUC for males and females on balanced PKU-Reid-BROS-1 dataset
using PCA based selected FSs
Hence, in Figure 4.37, best entropy and PCA based results of two classifiers are
depicted with higher O-ACC. Similarly, in parallel, other selected classifiers also
attained reliable results using the same FSs and settings. Moreover, it is observed that
C-SVM classifier exhibited higher AUC than other classification methods using
entropy and PCA based FSs. According to the computed results, C-SVM classifier with
PCA based FSs shows a superior AUC with a value of 97%, as presented in Figure
4.38. The corresponding all AUCs are also calculated to show the performance of other
selected classifiers as previously shown in Table 4.57. Thus, it is determined that C-
SVM classifier gave improved AUC than other selected classifiers. Overall, PCA based
results are better than entropy based results.
167
Results are also calculated using other selected classifiers which showed satisfactory
performance as tabulated in Table 4.58.
According to the results, again C-SVM classifier exhibited better results when entropy
based selected features are applied with O-ACC, M-ACC, B-ACC, AUC, CW-ACC
male, and CW-ACC female as 87.3%, 87.3%, 87.3%, 94%, 88%, and 87%,
respectively. Furthermore, it is observed that M-SVM classifier takes less time of 3.04
sec as compared to the selected classifiers. But when PCA based selected features are
168
applied, M-SVM classifier showed superior classification results in comparison with
other selected supervised methods and entropy based results. For example, M-SVM
revealed the O-ACC, M-ACC, B-ACC, AUC, CW-ACC male, and CW-ACC female
with values of 92.2%, 92.2%, 92.2%, 97%, 93%, and 91% respectively by considering
PCA based selected FSs. According to the results shown in Table 4.58 and Table 4.59,
it is apparent that proposed FSs assist in better classification especially when M-SVM
classifier is trained on PCA based selected FSs. The results show that entropy based
selected FSs are lower than PCA based selected FSs. Therefore, it is obvious that the
combination of deep and traditional features supports FFV with distinct features for
accurate gender prediction. The other applied classifiers portrayed appropriate results
and confirmed against entropy and PCA based selected FSs as presented in Table 4.59.
It can be seen that proposed approach achieved higher O-ACC as shown in Figure 4.39.
This figure offers us an insight view of different accuracies and AUC for comparison
of both features selection methods. PCA based FSs exhibited better results as compared
to entropy based FSs on a balanced and SSS PKU-Reid-BROS-2 dataset. Experiments
confirmed that proposed approach produced reliable outcomes on another augmented
dataset PKU-Reid-BROS-2 as previously performed on the augmented dataset PKU-
Reid-BROS-1.
Figure 4.39: Entropy and PCA based proposed PGC-FSDTF method results
comparison in terms of O-ACC, M-ACC, B-ACC, AUC, CW-ACC
male, and CW-ACC female on balanced PKU-Reid-BROS-2 dataset
169
While comparing PCA based M-SVM classifier results with entropy based C-SVM
classifier results, the outcomes of M-SVM classifier are surpassed with values of 4.9%,
4.9%, 4.9%, 3%, 5%, and 4% for O-ACC, M-ACC, B-ACC, AUC, CW-ACC male,
and CW-ACC female respectively. The satisfactory improvements witness that PCA
based selected FSs are more reliable for gender prediction than entropy based selected
features.
In parallel, other selected classifiers also achieved acceptable results using the same
selected FSs and settings. Further, it is observed that C-SVM and M-SVM classifiers
depicted higher AUC than other classification methods using entropy and PCA based
FSs respectively. However, M-SVM classifier with PCA based FSs shows 97% AUC
as highest as shown in Figure 4.40. The corresponding all AUCs are also calculated as
well to show the performance of other selected classifiers as previously described in
Table 4.59. Therefore, it is concluded that C-SVM and M-SVM classifiers are preferred
over other selected classifiers by achieving higher AUC.
Figure 4.40: Best AUC for males and females on balanced PKU-Reid-BROS-2 dataset
using PCA based selected FSs
170
9.1%, whereas C-SVM classifier attained superior results as 89.1% PPV, 89.8% F1,
89.3% TNR, and 10.6% FPR. With PCA based outcomes, outstanding results are
obtained in terms of PPV, TPR, F1, NPV, TNR, FPR, and FNR with 88.9%, 94.8%,
91.7%, 95.2%, 89.6%, 10.4%, and 5.1% than entropy based results and also among
other selected classifiers. Achieved results revealed better performance of PCA based
selected features than entropy based selected features. Moreover, corresponding results
of other classifiers are also computed with both entropy and PCA based selected
features and tabulated in Table 4.60.
For comparison of both features selection methods, selected classifiers are trained and
tested using entropy and PCA based FSs for gender prediction with additional
evaluation protocols of O-ACC, M-ACC, B-ACC, AUC, time, CW-ACC and their
results are shown in Table 4.61. Obtained results verify that C-SVM classifier is better
in performance with entropy based selected FSs such that it attained O-ACC, M-ACC,
B-ACC, AUC, CW-ACC male and CW-ACC female with values of 90.0%, 90.0%,
90.1%, 97%, 91%, and 89% respectively. With PCA based computed results in parallel,
M-SVM classifier outperformed than other applied classifiers and entropy based
results. Hence, M-SVM revealed O-ACC, M-ACC, B-ACC, AUC, CW-ACC male, and
CW-ACC female as 92.1%, 92.1%, 92.2%, 98%, 95%, and 89% respectively with best
time of 7.78 sec by considering PCA based selected FSs. Again, the combination of
proposed PCA based selected FSs, for instance max deep, average deep, and traditional
features produce higher classification results. The selected feature combination has
distinct properties that classify gender image precisely for customized datasets as well.
171
Other applied classifiers also gave acceptable performance and validated against
entropy and PCA based selected FSs.
Best entropy and PCA based results of two classifiers are shown such that proposed
technique achieved superior value of O-ACC as presented in Figure 4.41 that offers us
an insight view of different accuracies and AUC for comparison of both features
selection methods. According to Figure 4.41, PCA based FSs displayed better results
in comparison with entropy based FSs on a balanced and SSS PKU-Reid-BROS-3
dataset. For example, M-SVM classifier results are surpassed with 2.1%, 2.1%, 2.1%,
1%, 1%, for O-ACC, M-ACC, B-ACC, AUC, and CW-ACC male respectively when
compared with entropy based results of Q-SVM classifier. The computed CW-ACC of
89% is same on both C-SVM and M-SVM classifiers. The improvements confirm that
PCA based selected FSs combination slightly improves the performance as compared
to entropy based results.
Similarly, in parallel, other classification methods also attained reliable results with the
same selected FSs and settings. Further, it is observed that C-SVM and M-SVM
classifiers exhibit higher AUC than other classification methods using entropy and PCA
based FSs respectively. The corresponding all AUCs are also computed to show the
performance of other selected classifiers as described previously in Table 4.61. It is
concluded that C-SVM and M-SVM classifiers show improved AUC than other
selected classifiers. However, M-SVM classifier with PCA based FSs shows a higher
172
AUC with a value of 98% for both male and female classes, among other selected
classifiers and entropy based results. The only best AUC is shown in Figure 4.42.
Figure 4.41: Entropy and PCA based proposed PGC-FSDTF method results
comparison in terms of O-ACC, M-ACC, B-ACC, AUC, CW-ACC
male, and CW-ACC female on balanced PKU-Reid-BROS-3 dataset
Figure 4.42: Best AUC for males and females on balanced PKU-Reid-BROS-3 dataset
using PCA based selected FSs
173
Later, results based on test samples, and computed results are tabulated using different
evaluation protocols as presented in Tables (4.62)–(4.65). The purpose of these
experiments is to observe the performance of proposed method on balanced and SSS
(B-SSS) challenging datasets. The following subsections cover the evaluation of these
datasets.
Similarly, in case of PCA based computed results, best values are obtained such as PPV
88.5%, TPR 89.0%, F1 88.7%, TNR 88.6%, FPR 11.3%, and FNR 10.9% on M-SVM
classifier whereas NPV 89.5% on C-SVM classifier. It is apparent from the results on
PETA-SSS-1 balanced dataset that PCA based selected FSs outperformed as compared
to entropy based selected FSs. For example, while using PCA, FPR and FNR is 11.3 %
and 10.9%, which are better rates as compared to entropy based rates and among other
selected classifiers respective rates. Thus, the proposed approach achieved best PPV,
TPR, F1, etc. using PCA based selected features. Results are computed under other
evaluation protocols as well including O-ACC, M-ACC, B-ACC, AUC, time, and CW-
ACC. The calculated figures are shown in Table 4.63 from where it is noted that C-
SVM and M-SVM outperformed as compared to other classifiers. For example, if
entropy based results are perceived, proposed approach attained higher values such as
O-ACC 84.0%, M-ACC 84.0%, AUC 91%, and CW-ACC female 83% on C-SVM
classifier whereas B-ACC 85.5% on M-SVM classifier, and CW-ACC male 94% on L-
SVM classifier. Similarly, with PCA based computed results, superior results are
obtained like O-ACC 88.3%, M-ACC 88.3%, B-ACC 88.8%, AUC 95%, and CW-ACC
174
female 88% on M-SVM classifier with less time 4.52 sec than CW-ACC male 91% on
C-SVM classifier.
175
Figure 4.43: Entropy and PCA based proposed PGC-FSDTF method results
comparison in terms of O-ACC, M-ACC, B-ACC, AUC, CW-ACC
male, and CW-ACC female on the balanced PETA-SSS-1 dataset
Figure 4.44: Best AUC for males and females on the balanced PETA-SSS-1 dataset
using PCA based selected FSs
Moreover, Figure 4.44 depicts best average AUC of 95% achieved using PCA based
FSs in the proposed approach. The corresponding AUCs of all other classifiers were
also computed for performance evaluation as presented previously in Table 4.63. But
C-SVM classifier showed 91% AUC with entropy based results and M-SVM classifier
gave 95% with PCA based results, hence exhibited higher AUC. Thus, it is concluded
that M-SVM classifier reveals better AUC on PCA based selected FSs than other
classifiers.
176
b) Performance Evaluation on PETA-SSS-2 Dataset: To observe the robustness of
proposed approach, additional investigations are presented on another balanced PETA-
SSS-2 dataset that has more number of samples as compared to PETA-SSS-1 dataset.
For this purpose, selected classifiers, evaluation protocols, and settings are considered
during these investigations. The entropy and PCA based results are computed and
tabulated for discussion. According to Q-SVM classifier results shown in Table (4.64),
it can be seen that entropy features selection method achieved higher rates with PPV
84.2%, TPR 84.5%, F1 84.5%, NPV 84.9%, TNR 84.3%, FPR 15.6%, and FNR 15.1%
which are higher than the best PCA based results and among other classifiers. Despite,
M-SVM classifier outperformed such that PPV is 80.3%, TPR is 94.6%, F1 is 86.9%,
NPV is 95.4%, TNR 82.9%, FPR 17.0%, and FNR 5.3% in comparison with L-SVM,
Q-SVM, and C-SVM, only when PCA based selected features are supplied for training
and testing. The corresponding results of all classifiers are also computed with both
entropy and PCA based selected features and presented in Table 4.64.
Other evaluation protocols are also utilized for the performance check of proposed
approach including O-ACC, M-ACC, B-ACC, AUC, time, and CW-ACC. The
computed scores for these evaluation protocols are shown in Table 4.65 where Q-SVM
classifier obtained higher values in terms of O-ACC, M-ACC, B-ACC, AUC, CW-
ACC male, and CW-ACC female with 84.5%, 84.5%, 84.5% 92%, 85%, and 84% when
entropy based selected features are used in proposed approach. As opposed, M-SVM
used less time with 6.66 sec among other classifiers. In parallel, M-SVM attained O-
ACC, M-ACC, B-ACC, AUC, CW-ACC male, and CW-ACC female with values of
177
87.9%, 87.9%, 88.8%, 95%, 95%, and 80%, respectively using PCA based selected
FSs. The combination of proposed selected features performed better in SSS even when
the samples are collected from more challenging dataset PETA. It can be seen that
proposed feature combination precisely classifies gender image when tested under
PETA dataset images. The other applied classifiers also showed adequate performance
and validated against entropy and PCA based selected FSs as shown in Table 4.65.
For comparison, results are depicted in Figure 4.45 to cover entropy and PCA based
AUC and accuracies (O-ACC, M-ACC, B-ACC, CW-ACC male, CW-ACC female) of
multiple classifiers. Thus, PCA based FSs displayed higher results as compared to
entropy based FSs on a more challenging balanced and SSS PETA-SSS-2 dataset. For
example, M-SVM classifier results are exceeded with 3.4%, 3.4%, 4.3%, 3%, 10%, and
4% for O-ACC, M-ACC, B-ACC, AUC, CW-ACC male, and CW-ACC female
respectively in comparison with entropy based results of Q-SVM classifier. The
noteworthy improvement proves that PCA based selected FSs combination enhances
the performance as compared to entropy based. Likewise, other selected supervised
methods also attained better results in parallel using the same settings and FSs.
Further, it is noted that Q-SVM and M-SVM classifiers exhibit higher AUC than other
classification methods using entropy and PCA based FSs respectively. However, M-
SVM classifier with PCA based FSs shows a higher AUC with a value of 95% as shown
in Figure 4.46. The corresponding all AUCs are also calculated to show the
178
performance of other selected classifiers as shown previously in Table 4.65. Thus, Q-
SVM and M-SVM classifiers showed improved AUC than other selected classifiers.
Figure 4.45: Entropy and PCA based proposed PGC-FSDTF method results
comparison in terms of O-ACC, M-ACC, B-ACC, AUC, CW-ACC
male, and CW-ACC female on balanced PETA-SSS-2 dataset
Figure 4.46: Best AUC for males and females on balanced PETA-SSS-2 dataset using
PCA based selected FSs
c) Performance Evaluation on VIPeR-SSS Dataset: Proposed approach is verified
on customized VIPeR-SSS dataset for pedestrian gender classification. The entropy and
PCA based selected FSs combination is provided to the selected classifier and acquired
179
results are presented in Tables (4.66)-(4.67). In case of entropy based FSs, M-SVM
classifier showed values of 75.8% PPV, 73.9% TPR, 74.9% F1, 73.3% NPV, 75.2%
TNR, 24.7% FPR, and 26.0% FNR. While with PCA based outcomes, superior results
are attained in terms of PPV, TPR, F1, NPV, TNR, FPR, and FNR with 90.4%, 98.7%,
94.4%, 98.8%, 91.2%, 18.7%, and 11.2% respectively as compared to entropy based
results and also among the other classifiers. Obtained results confirmed that PCA based
selected features performed much better than entropy based selected features.
Moreover, the corresponding results of all classifiers are also computed using both
entropy and PCA based selected features, as shown in Table 4.66.
Moreover, the selected classifiers are trained and tested with entropy and PCA based
FSs for gender prediction with additional evaluation protocols of O-ACC, M-ACC, B-
ACC, AUC, time, and CW-ACC with the results shown in Table 4.67. According to
these results, M-SVM classifier depicted better performance with entropy based
selected FSs while attaining O-ACC, M-ACC, B-ACC, AUC, CW-ACC male, and
CW-ACC female as 74.6%, 74.6%, 74.6%, 81%, 73%, and 76% respectively. Testing
is also done on PCA based computed results such that M-SVM classifier outperformed
as compared to other applied classifiers and entropy based results with O-ACC, M-
ACC, B-ACC, AUC, CW-ACC male, and CW-ACC female as 94.6%, 94.6%, 95.0%,
97%, 99%, and 90% respectively by considering PCA based selected FSs. Again, the
combination of proposed PCA based selected FSs, for instance max deep, average deep,
and traditional features produce higher classification results. The selected proposed
feature combination has distinct properties to classify gender image accurately even in
180
case of customized dataset VIPeR. Acceptable performance is also observed on other
classifiers and validated against entropy and PCA based selected FSs as shown in Table
4.67.
Figure 4.47 shows entropy and PCA based AUC and accuracies (O-ACC, M-ACC, B-
ACC, CW-ACC male, CW-ACC female) of the top two classifiers according to which
PCA based FSs gave better results than entropy based FSs on a balanced and SSS
VIPeR dataset. For example, M-SVM classifier results are surpassed with values of
20.0%, 20.0%, 20.4%, 16%, 26%, and 14% for O-ACC, M-ACC, B-ACC, AUC, CW-
ACC male, and CW-ACC female when compared with entropy based results of M-
SVM classifier. M-SVM has excellent CW-ACC as 99% male and 90% female. These
improvements make it obvious that PCA based selected FSs combination and
contribution of each FS enhance the performance than entropy based results. Likewise,
other selected classification methods showed reliable results in parallel with the same
FSs and settings. It is observed that M-SVM classifier depicted higher AUC than other
classification methods in case of both entropy and PCA based FSs respectively. All
other corresponding AUCs are calculated as well for the performance evaluation of
other selected classifiers as described previously in Table 4.67. M-SVM classifier with
PCA based FSs showed higher AUC with 97% for both male and females classes
among other classifiers and entropy method. The only best AUC for both classes (male
and female) is shown in Figure 4.48. It is evident from the results that the proposed
method has superior performance on VIPeR-SSS dataset.
181
Figure 4.47: Entropy and PCA based proposed PGC-FSDTF method results
comparison in terms of O-ACC, M-ACC, B-ACC, AUC, CW-ACC
male, and CW-ACC female on balanced VIPeR-SSS dataset
Figure 4.48: Best AUC for males and females on balanced VIPeR-SSS dataset using
PCA based selected features subsets
In this section, the proposed approach is tested on two balanced cross-datasets using
the same classifiers and evaluation protocols. All the results are computed separately
on each dataset, presented in Tables (4.68)–(4.71). The reason to perform experiments
on these datasets is to confirm the strength of the proposed approach using a small
182
number of samples in a dataset, for instance, a total of 175 samples for each class
(male/female) as used by [45]. Proposed method is also verified on a new dataset named
cross-dataset-1 such that each class consists of 350 samples. The results analysis with
discussion on both datasets is given in the following subsections.
183
are attained using PCA based FSs. According to computed results, M-SVM classifier
outperformed by depicting better O-ACC, M-ACC, B-ACC, AUC, CW-ACC male, and
CW-ACC female with values of 90.8%, 90.8%, 90.9%, 95%, 93%, and 89%,
respectively. As opposed, C-SVM classifier gave better time with 0.47 sec.
184
For example, using PCA based FS, M-SVM classifier results are exceeded 17.7%,
17.7%, 17.6%, 14%, 24%, and 11% for O-ACC, M-ACC, B-ACC, AUC, CW-ACC
male, and CW-ACC female when they are compared with Q-SVM classifier results
using entropy. For comparison, both entropy and PCA based experimental results are
shown in Figure 4.49 that offers an insight view of best results of two classifiers under
both features selection methods. Also, other notable improvements of 11.8%, 11.8%,
11.9%, 12%, 20%, and 3% are observed for O-ACC, M-ACC, AUC, B-ACC, CW-
ACC male, and CW-ACC female when C-SVM results are compared under both
features selection methods. These improvements prove that PCA based features
selection boost the performance in comparison with entropy based FSs. Likewise, other
classification methods also depicted improved results using the same FSs and settings.
Further, it is obvious that Q-SVM and M-SVM classifiers showed higher AUC than
other supervised methods on entropy and PCA based FSs respectively. However, M-
SVM classifier with PCA based FSs depicted better AUC of 95% as shown in Figure
4.50.
Figure 4.49: Entropy and PCA based proposed PGC-FSDTF method results
comparison in terms of O-ACC, M-ACC, B-ACC, AUC, CW-ACC
male, and CW-ACC female on balanced cross-dataset, best outcomes of
two classifiers
185
Figure 4.50: Best AUC for males and females on balanced cross-dataset using PCA
based selected FSs
b) Performance Evaluation on Cross-dataset-1: Proposed technique is also
examined on cross-dataset-1 through experiments. The entropy and PCA based
computed results are presented in Table 4.70 and Table 4.71. With entropy based FSs,
the proposed approach revealed enhanced performance under Q-SVM classifier as
compared to other classifiers in terms of PPV, TPR, F1, NPV, TNR, FPR, and FNR
with 84.9%, 82.5%, 83.6%, 82.0%, 84.4%, 15.5%, and 17.5% respectively. In parallel,
PCA based results using M-SVM classifier outclassed in comparison with entropy
based results and also among other classifiers by attaining better results of PPV, TPR,
F1, NPV, TNR, FPR, and FNR with 88.2%, 92.2%, 90.2%, 92.5%, 88.7%, 11.2%, and
7.7% respectively. Keeping in view these experiments, in can be seen that PCA based
selected features performed much better as compared to entropy based selected
features. Moreover, the corresponding results of all classifiers are also computed using
both entropy and PCA based selected features and shown in Table 4.70. Confirmation
of proposed technique is also done using more evaluation protocols such as O-ACC,
M-ACC, B-ACC, AUC, time, and CW-ACC. The intended results under these
evaluation protocols are shown in Table 4.71 that revealed that Q-SVM classifier
outperformed with entropy based computed results by attaining O-ACC, M-ACC, B-
ACC, AUC, CW-ACC male, and CW-ACC female as 83.4%, 83.4%, 83.4%, 90%,
82%, and 85% using entropy based selected FSs. But in case of PCA based results, M-
186
SVM classifier showed improved performance than other classifiers and entropy based
results with values of O-ACC, M-ACC, B-ACC, AUC, CW-ACC male, and CW-ACC
female as 90.4%, 90.4%, 90.5%, 96%, 89%, and 90% respectively by considering PCA
based selected FSs.
It is obvious from the computed results that the combination of proposed features like
deep and traditional features leads to higher classification results when they are used
under PCA controlled features selection and trained using M-SVM classifier. The
combination of selected deep and traditional features has the potential to classify gender
image precisely. Other applied classifiers also showed notable outcomes performance
and validated against entropy and PCA based selected FSs as shown in Table 4.71.
187
Figure 4.51 shows entropy and PCA based AUC and accuracies (O-ACC, M-ACC, B-
ACC, CW-ACC male, CW-ACC female) of multiple classifiers. According to this
figure, better results are achieved with PCA based FSs as compared to entropy based
FSs on a balanced and SSS cross-dataset-1. For example, M-SVM classifier results are
exceeded with values of 7.0%, 7.0%, 7.1%, 6%, 7%, and 5% for O-ACC, M-ACC, B-
ACC, AUC, CW-ACC male, and CW-ACC female when they are compared with
entropy based results of Q-SVM classifier. The noteworthy improvements confirm that
PCA based selected FSs combination improves the performance as compared to entropy
based.
Figure 4.51: Entropy and PCA based proposed PGC-FSDTF method results
comparison in terms of O-ACC, M-ACC, B-ACC, AUC, CW-ACC
male, and CW-ACC female on balanced cross-dataset-1, best outcomes
of two classifiers
Similarly, in parallel, other selected supervised methods also attained reliable results
using the same FSs and settings. Moreover, it is noted that Q-SVM and M-SVM
classifiers exhibit higher AUC than other classification methods using entropy and PCA
based FSs. All other corresponding AUCs are also calculated to show the performance
of selected classifiers previously described in Table 4.71 and shown in Figure 4.52.
Conclusion is that Q-SVM and M-SVM classifies with PCA based FSs show a higher
AUC with 96% than rest of classifiers and entropy method.
188
Figure 4.52: Best AUC for males and females on balanced cross-dataset-1 using PCA
based selected FSs
189
Table 4.72: Performance comparison with state-of-the-art methods on MIT/MIT-IB
dataset, dash (-) represents that no reported result is available
Methods Year AUC (%) O-ACC (%) M-ACC(%)
PBGR [166] 2008 - 75.0 -
PiHOG-LHSV [168] 2009 - 72.3 -
BIO-PCA [60] 2009 - 79.2 -
BIO-OLPP [60] 2009 - 77.1 -
BIO-LSDA [60] 2009 - 78.2 -
BIO-MFA [60] 2009 - 75.2 -
BIF-PCA [60] 2009 - 80.6 -
CNN [61] 2013 - 80.4 -
HOG [169] 2015 - 78.9 75.9
LBP [169] 2015 - 76.1 68.5
HSV [169] 2015 - 71.3 64.8
LBP-HSV [169] 2015 - 77.6 73.7
HOG-HSV [169] 2015 - 80.9 75.3
HOG-LBP [169] 2015 - 79.8 76.6
HOG-LBP-HSV [169] 2015 - 80.1 76.7
CNN-e [183] 2017 - 81.5 -
U+M+L (CNN-1) [71] 2019 - 80.8 -
U+M+L (CNN-2) [71] 2019 - 81.2 -
U+M+L (CNN-3) [71] 2019 - 81.3 -
U+M+L (CaffeNet) [71] 2019 - 80.9 -
J-LDFR [266] 2020 86 82.0 77.3
CSVFL[69] 2020 - 85.2 -
Proposed PGC-FSDTF 93 92.7 89.4
190
Figure 4.53: Performance comparison in terms of overall accuracy between proposed
PGC-FSDTF method and existing methods on MIT/MIT-IB dataset
Figure 4.54: Comparison of training and prediction time of PGC-FSDTF with J-LDFR
191
higher AUC as compared to existing approaches with a difference of 4% minimum and
16% maximum. The superior results of PGC-FSDTF approach is because of PCA based
selected distinct information from the proposed FSs in which applied PHOG FS is
insensitive to local geometric, illumination, and pose variations. It contains significant
information at both local and global levels using HOG descriptor and pyramid structure
that effectively contribute for gender prediction. Despite this, HSV based histogram
contains color information which seems to be complimentary part of traditional features
because color information resist multiple types of changes in an image such as size,
direction, rotation, distortion, and noise. Furthermore, two different CNN architectures
are used to provide deeper level information of gender image according to the depth of
CNN architecture.
The utilization of parallel fused deep features is more robust because it provides distinct
information (max features) and average information (average features) from two CNN
architectures instead of single CNN architecture. Hence, the proposed results suggest
that combining features of two CNN architectures leads to better classification rates.
The superior results also verified that fusion of all FSs provides better discrimination
ability and identifies the contribution of each FS for robust pedestrian gender
classification. The proposed PGC-FSDTF method addresses SSS dataset problem for
pedestrian gender classification. These SSS datasets mainly faced three issues: (1) lack
of data generalization, (2) class-wise imbalanced data, and (3) model learning. First two
issues are addressed using data augmentation operations and random over sampling.
192
Figure 4.55: Performance comparison in terms of AUC between the proposed PGC-
FSDTF method and existing methods on cross-datasets
In this work, equal distribution of data is achieved using four data augmentation
operations that added synthetic data (more samples) in a SSS class. Mostly, single
minority class is considered for data augmentation but the data of this class is
synthesized as well as for both gender classes because of two reasons: (1) to minimize
biases while applying data augmentation operations and (2) performance of a male
gender class may be affected while trying to improve the performance of female gender
class and vice versa. To generate unbiased balanced data, three different strategies
(1vs1, 1vs4, and mixed) are presented wherein selected data augmentation operations
are implemented based on these strategies one by one. Consequently, from each
imbalanced dataset, three augmented datasets are generated for experiments. Finally,
all datasets are classified as two class-wise imbalanced, six augmented balanced, and
five non-augmented/customized balanced SSS datasets.
This work has utilized different kernels with SVM to classify the gender. However, the
superior results in terms of O-ACC and AUC on these datasets validate the proposed
PGC-FSDTF approach performance for gender classification when C-SVM and M-
SVM classifiers are trained on PCA based selected FSs. This study aimed to design a
robust approach that accurately classifies the gender image under imbalanced and SSS
dataset issues. In parallel, detailed investigations are shown on these datasets using
entropy based FSs with selected classifiers to test the performance of proposed
approach. Moreover, it is observed that proposed approach outperformed as compared
193
to existing studies on MIT-IB dataset and cross-dataset with significant improvements.
In this study, deeply learned information is extracted from FC layers using two different
CNN architectures as compared to model learning from stretch or transfer learning.
Deep feature representations are robust against large intra-class variations, pose
changes, and different illumination conditions. Multiple selected feature schemes are
combined in proposed PGC-FSDTF approach that produces improved outcomes
without demanding additional data for model learning. For comprehensive analysis,
applied datasets are classified into two categories: (1) imbalanced and augmented
balanced datasets, and (2) customized/non-augmented balanced datasets. The empirical
analysis is elaborated in the next section.
194
unequal distribution of data, is examined on MIT-IB and PKU-Reid-IB datasets. These
datasets are also challenging due to variations in pose, illumination, and low contrast.
Considering PKU-Reid based augmented balanced datasets, the proposed approach
outperformed in terms of O-ACC, M-ACC, and B-ACC with 92.2% for each evaluation
metric on PKU-Reid-BROS-2 dataset such that O-ACC is 0.6% and 2.3% higher than
PKU-Reid-BROS-1 and PKU-Reid-BROS-3 datasets respectively.
It is also observed from the empirical results that gender classification improvement on
PKU-Reid-BROS-2 was achieved when more samples are added in minority class
(female) based on 1vs4 strategy while implementing data augmentation operations. The
proposed approach maintains the classification accuracies with minor differences even
that the data is augmented for single or both gender classes. However, proposed
approach achieved best AUC with 98% on both mixed strategies based on augmented
datasets MIT-BROS-3 and PKU-Reid-BROS-3.
It was also noticed from overall outcomes that performance improvement was attained
on an extremely imbalanced MIT-IB dataset when BROS is applied using mixed
strategy rather than augmenting data with 1vs1 and 1vs4 strategies. Similarly, in case
of PKU-Reid-BROS-3 dataset, the performance improvement was achieved when
BROS is applied using 1vs4 and mixed strategies rather than augmenting data with 1vs1
strategy. Considering these developments, it can be concluded that proposed approach
achieved better results on imbalanced SSS datasets, and can also be applied on
augmented balanced datasets in an effective manner. Hence, showing better
195
performance under multiple respective performance evaluation measures strengthens
the working of proposed approach.
The proposed approach also yields better classification rates as compared to existing
studies reported on cross-dataset with significant improvements. Also, the proposed
approach achieved comparable classification results on a newly customized cross-
dataset-1. This dataset has class-wise double number of samples (350) as compared to
cross-dataset. However, as far as performance aspect is concerned, there is a slight
difference between the existing cross-dataset and newly customized cross-dataset-1.
Experimental outcomes validate that proposed approach can achieve classification
improvements on customized SSS cross-dataset-1. These improvements also confirm
that PGC-FSDTF maintains better performance on cross-dataset as compared to newly
prepared cross-dataset-1. Proposed PGC-FSDTF approach is also evaluated on other
three balanced and SSS datasets where gender wise samples are collected from more
challenging PETA and VIPeR datasets. Using PETA dataset, class-wise 864 and 1300
samples are randomly selected for PETA-SSS-1 and PETA-SSS-2 datasets
respectively. The reason to set these numbers was to cross check the proposed approach
performance on the same number of samples (class-wise) as used in augmented
balanced datasets. Interestingly, proposed approach yields better results on both PETA-
SSS-1 and PETA-SSS-2 datasets. It acquired same AUC of 95% on both datasets.
Despite this, proposed approach attained O-ACC, M-ACC, and B-ACC with 88.8% for
each on PETA-SSS-1 dataset. Similarly, on PETA-SSS-2 dataset PGC-FSDTF
approach has acquired slightly lower accuracies. The notable factor on these datasets is
class-wise accuracies, while comparing these accuracies of both datasets, it is observed
196
that CW-ACC male is higher with a difference of 6% on PETA-SSS-2 dataset, while
CW-ACC female is better with a difference of 8% on PETA-SSS-1 dataset because of
class-wise similarities, in randomly collected data.
VIPeR dataset is customized and renamed as VIPeR-SSS dataset. This prepared dataset
contains balanced distribution of SSS data. The proposed approach outperformed all
applied datasets in terms of O-ACC, M-ACC, and B-ACC with 94.7%, 94.7%, and
95.0%, whereas other evaluation metrics also showed significant performance. To the
best of our knowledge, PGC-FSDTF approach entirely outperformed existing
approaches reported in literature for pedestrian gender classification in terms of O-
ACC, M-ACC, AUC, CW-ACC male, CW-ACC female on MIT-IB dataset and cross-
dataset. According to the results, the proposed approach achieved higher outcomes as
O-ACC 92.7%, M-ACC 89.4%, B-ACC 94.1%, AUC 93%, CW-ACC male 99%, and
CW-ACC female 80% on MIT-IB dataset. The improved results acquired on cross-
dataset show that O-ACC is 90.8%, M-ACC is 90.8%, B-ACC is 90.9%, AUC is 85%,
CW-ACC male 93%, and CW-ACC female 89% as described in Table 4.76
Table 4.76: Proposed PGC-FSDTF approach results on MIT-IB dataset and cross-
dataset
O- M- B- CW-ACC (%)
AUC
ACC ACC ACC
Dataset (%) Male Female
(%) (%) (%)
MIT-IB 92.7 89.4 94.1 93 99 80
Cross-dataset 90.8 90.8 90.9 95 93 89
Further, it is seen that training time and prediction time of proposed method on applied
datasets are efficient. Considering optimal values of training time and prediction time,
the proposed PGC-FSDTF obtained best AUC using entropy and PCA based features
subsets as shown in Figure 4.56 and Figure 4.57. By considering all applied datasets,
197
Figure 4.56 shows training time in a sec, and Figure 4.57 shows the prediction time in
obs/sec using entropy and PCA based features subsets.
198
Figure 4.58: Complete overview of proposed PGC-FSDTF method results in terms of
best CW-ACC on selected, customized, and augmented datasets where
proposed approach achieved superior AUC (a) CW-ACC on customized
balanced SSS datasets, and (b) CW-ACC on imbalanced and augmented
balanced SSS datasets
4.5 Discussion
In this thesis, one method is presented for person ReID named as FCDF and two
methods for pedestrian gender classification namely J-LDFR and PGC-FSDTF. In
person ReID, gallery search optimization, environmental effects and different camera
settings issues are examined. Three stage robust method is proposed to address these
issues. The first stage of proposed method extracts distinct features using FFS. The
second stage splits the dataset into clusters based on the selected features subset such
that extracted deep features are fused with OFS to handle issues of appearance and
199
inherent ambiguities. In the third stage, feature matching is introduced to search probe
image match from classified cluster. This stage provides gallery search optimization
because of cluster-based search as compared to gallery-based search for probe image
matching. The proposed FCDF approach is tested on VIPeR, CUHK01 and iLIDS-VID
datasets and obtained significant results at different ranks as compared to state-of-the-
art methods. Recognition rates at rank-1 is shown in Table 4.77. As this work
implemented person ReID on different pedestrian datasets, the next objective is to
examine abovementioned datasets which are sub-datasets of large-scale PETA dataset
and MIT dataset, widely reported in existing literature to investigate full-body
appearances of pedestrian(s) for gender classification. Therefore, in second proposed J-
LDFR method, large-scale (PETA) and small-scale (MIT) datasets are selected for
PGC.
Table 4.77: Summary of proposed methods including tasks, datasets, and results
200
accurate gender prediction. The computed results surpassed with significant margins to
relevant existing methods while considering the large-scale and small-scale datasets.
To the best of our knowledge, no such study has been found to address the binary
imbalanced classification and SSS problem for PGC. Therefore, third proposed method
performed PGC on IB-SSS datasets. In this regard, the contribution of this research is
therefore two fold. (1) to generate synthetic data in effective handling of binary
imbalanced classification and SSS problem for PGC and (2) investigation of multiple
low and high-level feature extraction schemes, features selection (PCA and entropy),
and fusion (parallel and serial) strategies to accurately classify the gender under same
settings and protocols but with different types of datasets including imbalanced,
augmented balanced and customized balanced. The suggested PGC-FSDTF equally
produced better results on these datasets. The details about datasets and results in terms
of O-ACC and AUC is given in Table 4.77.
4.6 Summary
In this chapter, proposed methodologies are evaluated on several publicly available
challenging datasets. Different experiments are performed to authenticate the
performance of proposed methods. The selected features subsets based results are tested
on a number of evaluation protocols such as CMC, O-ACC, M-ACC, B-ACC, and AUC
using handcrafted and deep learning features, which are then fed to distance measure
and different classifiers for matching and classification of full-body pedestrian images.
The probe matching results in person ReID are obtained from cluster instead of whole
gallery, whereas pedestrian gender classification results are obtained using 10-fold
cross-validation.
201
Chapter 5 Conclusion and Future Work
Conclusion and Future Work
202
5.1 Conclusion
The accurate pedestrian ReID and gender classification are challenging tasks, owing to
high diversity in pedestrian appearances under non-overlapping camera settings. To
overcome the challenges, in this thesis, three methods are proposed and evaluated on
benchmark datasets.
The first method, addresses the person ReID problem and its challenges using an OFS
based features clustering and deeply learned features. A novel FCDF framework is
presented which considers a single image (one probe and one gallery image per identity)
based ReID. OFS is chosen to handle the challenges of illumination and viewpoint
variations. For its selection, a reliable and effective method called FFS is proposed. A
deep CNN model is also used to extract discriminative cues from FC-7 layer that
effectively handles large appearance changes. For probe matching, gallery search
optimization is performed using features-based clustering technique to improve
recognition rates. The cross-bin histogram distance measure is utilized to obtain an
accurate image pair from the cluster(s). Moreover, the proposed framework handles the
challenges of chosen datasets more efficiently. According to the computed results,
recognition rates are significantly increased at different ranks on VIPeR, CUHK01, and
iLIDS-VID datasets that show the robustness of FCDF framework.
203
framework is tested by comparing its results with various state-of-the-art methods. The
cross-validation results demonstrate that J-LDFR is superior and outperforms existing
pedestrian gender classification methods.
The third proposed method PGC-FSDTF obtained superior results on SSS imbalanced,
augmented balanced and customized balanced datasets for pedestrian gender
classification. Different techniaues are implemented in data preparation module for
image smoothness, removal of noise, and balanced distribution of samples in both
classes. Traditional and deep features are extracted and then FSF is carried out to
compute single robust feature vector. SVM classifier with different kernels is applied
to classify the gender image. The experimental outcomes show that the suggested
method outperforms state-of-the-art methods on the original MIT dataset and cross-
dataset. In addition, it achieves noteworthy results on eleven imbalanced, augmented
balanced, and customized balanced datasets.
This thesis has validated proposed methods on several benchmark pedestrian datasets,
such as VIPeR, CUHK01, iLIDS-VID, PETA, MIT/MIT-IB, MIT-BROS-1, MIT-
BROS-2, MIT-BROS-3, PKU-Reid-IB, PKU-Reid-BROS-1, PKU-Reid-BROS-2,
PKU-Reid-BROS-3, PETA-SSS-1, PETA-SSS-2, VIPeR-SSS, cross-dataset, and
cross-dataset-1 in terms of different evaluation protocols such as CMC, O-ACC, M-
ACC, B-ACC, AUC, F1-score, FPR, TPR, PPV, NPV, TNR, and FNR. It is concluded
that the proposed methods obtained improved results in terms of CMC, accuracy, and
AUC as compared with existing methods for pedestrian ReID and pedestrian gender
classification tasks.
204
investigated with more robust low-level and high-level feature extraction, feature
optimization, and feature fusion techniques for gender classifications. Finally,
extensive pre-processing steps or data augmentation techniques may improve the
classification and recognition results. Researchers may apply shallow and deep learning
based schemes for low-level and deep feature extraction for the testing of other
pedestrian analysis tasks such as pedestrian detection, pedestrian activity and action
recognition. This will further validate the usefulness of feature engineering as
implemented in different proposed methods. Despite this, the unsupervised learning can
be used to label pedestrian or non-pedestrian images based on deep features
automatically.
205
Chapter 6 References
References
206
[1] E. Cosgrove, "One billion surveillance cameras will be watching around the
world in 2021"
[2] M. P. Ashby, "The value of CCTV surveillance cameras as an investigative tool:
An empirical analysis," European Journal on Criminal Policy and Research,
vol. 23, pp. 441-459, 2017.
[3] M. Souded, "People detection, tracking and re-identification through a video
camera network," 2013.
[4] Q. Ma, Y. Kang, W. Song, Y. Cao, and J. Zhang, "Deep Fundamental Diagram
Network for Real-Time Pedestrian Dynamics Analysis," in Traffic and
Granular Flow 2019, ed: Springer, pp. 195-203, 2020.
[5] Z. Ji, Z. Hu, E. He, J. Han, and Y. Pang, "Pedestrian attribute recognition based
on multiple time steps attention," Pattern Recognition Letters, vol. 138, pp. 170-
176, 2020.
[6] P. Pandey and J. V. Aghav, "Pedestrian Activity Recognition Using 2-D Pose
Estimation for Autonomous Vehicles," in ICT Analysis and Applications, ed:
Springer, pp. 499-506, 2020.
[7] S. Kazeminia, C. Baur, A. Kuijper, B. van Ginneken, N. Navab, S. Albarqouni,
et al., "GANs for medical image analysis," Artificial Intelligence in Medicine,
p. 101938, 2020.
[8] X. Ma, Y. Niu, L. Gu, Y. Wang, Y. Zhao, J. Bailey, et al., "Understanding
adversarial attacks on deep learning based medical image analysis systems,"
Pattern Recognition, p. 107332, 2020.
[9] K. Armanious, C. Jiang, M. Fischer, T. Küstner, T. Hepp, K. Nikolaou, et al.,
"MedGAN: Medical image translation using GANs," Computerized Medical
Imaging and Graphics, vol. 79, p. 101684, 2020.
[10] D. Caballero, R. Calvini, and J. M. Amigo, "Hyperspectral imaging in crop
fields: precision agriculture," in Data Handling in Science and Technology. vol.
32, ed: Elsevier, pp. 453-473, 2020.
[11] A. Przybylak, R. Kozłowski, E. Osuch, A. Osuch, P. Rybacki, and P.
Przygodziński, "Quality Evaluation of Potato Tubers Using Neural Image
Analysis Method," Agriculture, vol. 10, p. 112, 2020.
[12] H. Tian, T. Wang, Y. Liu, X. Qiao, and Y. Li, "Computer vision technology in
agricultural automation—A review," Information Processing in Agriculture,
vol. 7, pp. 1-19, 2020.
207
[13] V. Tsakanikas and T. Dagiuklas, "Video surveillance systems-current status and
future trends," Computers & Electrical Engineering, vol. 70, pp. 736-753, 2018.
[14] M. Baqui, M. D. Samad, and R. Löhner, "A novel framework for automated
monitoring and analysis of high density pedestrian flow," Journal of Intelligent
Transportation Systems, vol. 24, pp. 585-597, 2020.
[15] L. Cai, J. Zhu, H. Zeng, J. Chen, and C. Cai, "Deep-learned and hand-crafted
features fusion network for pedestrian gender recognition," in Proceedings of
ELM-2016, ed: Springer, pp. 207-215, 2018.
[16] X. Li, L. Telesca, M. Lovallo, X. Xu, J. Zhang, and W. Song, "Spectral and
informational analysis of pedestrian contact force in simulated overcrowding
conditions," Physica A: Statistical Mechanics and its Applications, p. 124614,
2020.
[17] L. Yang, G. Hu, Y. Song, G. Li, and L. Xie, "Intelligent video analysis: A
Pedestrian trajectory extraction method for the whole indoor space without
blind areas," Computer Vision and Image Understanding, p. 102968, 2020.
[18] C. Zhao, X. Wang, W. Zuo, F. Shen, L. Shao, and D. Miao, "Similarity learning
with joint transfer constraints for person re-identification," Pattern Recognition,
vol. 97, p. 107014, 2020.
[19] L. An, X. Chen, S. Liu, Y. Lei, and S. Yang, "Integrating appearance features
and soft biometrics for person re-identification," Multimedia Tools and
Applications, vol. 76, pp. 12117-12131, 2017.
[20] H. J. Galiyawala, M. S. Raval, and A. Laddha, "Person retrieval in surveillance
videos using deep soft biometrics," in Deep Biometrics, ed: Springer, pp. 191-
214, 2020.
[21] P. P. Sarangi, B. S. P. Mishra, and S. Dehuri, "Fusion of PHOG and LDP local
descriptors for kernel-based ear biometric recognition," Multimedia Tools and
Applications, vol. 78, pp. 9595-9623, 2019.
[22] N. Ahmadi and G. Akbarizadeh, "Iris tissue recognition based on GLDM feature
extraction and hybrid MLPNN-ICA classifier," Neural Computing and
Applications, vol. 32, pp. 2267-2281, 2020.
[23] M. Regouid, M. Touahria, M. Benouis, and N. Costen, "Multimodal biometric
system for ECG, ear and iris recognition based on local descriptors,"
Multimedia Tools and Applications, vol. 78, pp. 22509-22535, 2019.
208
[24] R. F. Soliman, M. Amin, and F. E. Abd El-Samie, "Cancelable Iris recognition
system based on comb filter," Multimedia Tools and Applications, vol. 79, pp.
2521-2541, 2020.
[25] P. S. Chanukya and T. Thivakaran, "Multimodal biometric cryptosystem for
human authentication using fingerprint and ear," Multimedia Tools and
Applications, vol. 79, pp. 659-673, 2020.
[26] K. M. Sagayam, D. N. Ponraj, J. Winston, E. Jeba, and A. Clara,
"Authentication of Biometric System using Fingerprint Recognition with
Euclidean Distance and Neural Network Classifier," Project: Hand posture and
gesture recognition techniques for virtual reality applications: a survey, 2019.
[27] T. B. Moeslund, S. Escalera, G. Anbarjafari, K. Nasrollahi, and J. Wan,
"Statistical Machine Learning for Human Behaviour Analysis," ed:
Multidisciplinary Digital Publishing Institute, 2020.
[28] S. Gupta, K. Thakur, and M. Kumar, "2D-human face recognition using SIFT
and SURF descriptors of face’s feature regions," The Visual Computer, pp. 1-
10, 2020.
[29] N. Jahan, P. K. Bhuiyan, P. A. Moon, and M. A. Akbar, "Real Time Face
Recognition System with Deep Residual Network and KNN," in 2020
International Conference on Electronics and Sustainable Communication
Systems (ICESC), pp. 1122-1126, 2020.
[30] F. Zhao, J. Li, L. Zhang, Z. Li, and S.-G. Na, "Multi-view face recognition using
deep neural networks," Future Generation Computer Systems, pp.375-380,
2020.
[31] I. Chtourou, E. Fendri, and M. Hammami, "Walking Direction Estimation for
Gait Based Applications," Procedia Computer Science, vol. 126, pp. 759-767,
2018.
[32] A. Derbel, N. Mansouri, Y. B. Jemaa, B. Emile, and S. Treuillet, "Comparative
study between spatio/temporal descriptors for pedestrians recognition by gait,"
in International Conference Image Analysis and Recognition, pp. 35-42, 2013.
[33] M. Huang, H.-Z. Li, and X. Wu, "Three-dimensional pedestrian dead reckoning
method based on gait recognition," International Journal of Simulation and
Process Modelling, vol. 13, pp. 537-547, 2018.
[34] Z. Li, J. Xiong, and X. Ye, "Gait Energy Image Based on Static Region
Alignment for Pedestrian Gait Recognition," in Proceedings of the 3rd
209
International Conference on Vision, Image and Signal Processing, pp. 1-6,
2019.
[35] Y. Makihara, G. Ogi, and Y. Yagi, "Geometrically Consistent Pedestrian
Trajectory Extraction for Gait Recognition," in 2018 IEEE 9th International
Conference on Biometrics Theory, Applications and Systems (BTAS), pp. 1-11,
2018.
[36] B. Wang, T. Su, X. Jin, J. Kong, and Y. Bai, "3D reconstruction of pedestrian
trajectory with moving direction learning and optimal gait recognition,"
Complexity, vol. 2018, 2018.
[37] H. Wang, Y. Fan, B. Fang, and S. Dai, "Generalized linear discriminant analysis
based on euclidean norm for gait recognition," International Journal of
Machine Learning and Cybernetics, vol. 9, pp. 569-576, 2018.
[38] L.-F. Shi, C.-X. Qiu, D.-J. Xin, and G.-X. Liu, "Gait recognition via random
forests based on wearable inertial measurement unit," Journal of Ambient
Intelligence and Humanized Computing, pp. 1-12, 2020.
[39] S. Islam, T. Qasim, M. Yasir, N. Bhatti, H. Mahmood, and M. Zia, "Single-and
two-person action recognition based on silhouette shape and optical point
descriptors," Signal, Image and Video Processing, vol. 12, pp. 853-860, 2018.
[40] S. Govardhan and A. Vasuki, "Wavelet based iterative deformable part model
for pedestrian detection," Multimedia Tools and Applications, pp. 1-15, 2018.
[41] A. Li, L. Liu, K. Wang, S. Liu, and S. Yan, "Clothing attributes assisted person
reidentification," IEEE Transactions on Circuits and Systems for Video
Technology, vol. 25, pp. 869-878, 2015.
[42] S.-M. Li, C. Gao, J.-G. Zhu, and C.-W. Li, "Person Reidentification Using
Attribute-Restricted Projection Metric Learning," IEEE Transactions on
Circuits and Systems for Video Technology, vol. 28, pp. 1765-1776, 2018.
[43] X. Zhang, X.-Y. Jing, X. Zhu, and F. Ma, "Semi-supervised person re-
identification by similarity-embedded cycle GANs," Neural Computing and
Applications, pp. 1-10, 2020.
[44] M. Raza, Z. Chen, S.-U. Rehman, P. Wang, and P. Bao, "Appearance based
pedestrians’ head pose and body orientation estimation using deep learning,"
Neurocomputing, vol. 272, pp. 647-659, 2018.
210
[45] L. Cai, J. Zhu, H. Zeng, J. Chen, C. Cai, and K.-K. Ma, "Hog-assisted deep
feature learning for pedestrian gender recognition," Journal of the Franklin
Institute, vol. 355, pp. 1991-2008, 2018.
[46] M. Raza, M. Sharif, M. Yasmin, M. A. Khan, T. Saba, and S. L. Fernandes,
"Appearance based pedestrians’ gender recognition by employing stacked auto
encoders in deep learning," Future Generation Computer Systems, vol. 88, pp.
28-39, 2018.
[47] G. Antipov, S.-A. Berrani, N. Ruchaud, and J.-L. Dugelay, "Learned vs. hand-
crafted features for pedestrian gender recognition," in Proceedings of the 23rd
ACM international conference on Multimedia, pp. 1263-1266, 2015.
[48] C.-B. Ng, Y.-H. Tay, and B.-M. Goi, "Pedestrian gender classification using
combined global and local parts-based convolutional neural networks," Pattern
Analysis and Applications, pp. 1-12, 2018.
[49] E. Yaghoubi, P. Alirezazadeh, E. Assunção, J. C. Neves, and H. Proençaã,
"Region-Based CNNs for Pedestrian Gender Recognition in Visual
Surveillance Environments," in 2019 International Conference of the
Biometrics Special Interest Group (BIOSIG), pp. 1-5, 2019.
[50] R. Q. Mínguez, I. P. Alonso, D. Fernández-Llorca, and M. Á. Sotelo,
"Pedestrian Path, Pose, and Intention Prediction Through Gaussian Process
Dynamical Models and Pedestrian Activity Recognition," IEEE Transactions
on Intelligent Transportation Systems, pp.1803-1814, 2018.
[51] R. Cui, G. Hua, A. Zhu, J. Wu, and H. Liu, "Hard Sample Mining and Learning
for Skeleton-Based Human Action Recognition and Identification," IEEE
Access, vol. 7, pp. 8245-8257, 2019.
[52] Z. Gao, T.-t. Han, H. Zhang, Y.-b. Xue, and G.-p. Xu, "MMA: a multi-view and
multi-modality benchmark dataset for human action recognition," Multimedia
Tools and Applications, pp. 1-22, 2018.
[53] R. Vezzani, D. Baltieri, and R. Cucchiara, "People reidentification in
surveillance and forensics: A survey," ACM Computing Surveys (CSUR), vol.
46, p. 29, 2013.
[54] Y. Chen, S. Duffner, A. Stoian, J.-Y. Dufour, and A. Baskurt, "Deep and low-
level feature based attribute learning for person re-identification," Image and
Vision Computing, vol. 79, pp. 25-34, 2018.
211
[55] Y. Sun, M. Zhang, Z. Sun, and T. Tan, "Demographic analysis from biometric
data: Achievements, challenges, and new frontiers," IEEE transactions on
pattern analysis and machine intelligence, vol. 40, pp. 332-351, 2017.
[56] G. Azzopardi, A. Greco, A. Saggese, and M. Vento, "Fusion of domain-specific
and trainable features for gender recognition from face images," IEEE access,
vol. 6, pp. 24171-24183, 2018.
[57] S. Mane and G. Shah, "Facial Recognition, Expression Recognition, and Gender
Identification," in Data Management, Analytics and Innovation, ed: Springer,
pp. 275-290, 2019.
[58] J. Cheng, Y. Li, J. Wang, L. Yu, and S. Wang, "Exploiting effective facial
patches for robust gender recognition," Tsinghua Science and Technology, vol.
24, pp. 333-345, 2019.
[59] A. Geetha, M. Sundaram, and B. Vijayakumari, "Gender classification from
face images by mixing the classifier outcome of prime, distinct descriptors,"
Soft Computing, vol. 23, pp. 2525-2535, 2019.
[60] G. Guo, G. Mu, and Y. Fu, "Gender from body: A biologically-inspired
approach with manifold learning," in Asian Conference on Computer Vision,
pp. 236-245, 2009.
[61] R. Zhao, W. Ouyang, and X. Wang, "Unsupervised salience learning for person
re-identification," in Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition, pp. 3586-3593, 2013.
[62] H. Yang, X. Wang, J. Zhu, W. Ma, and H. Su, "Resolution adaptive feature
extracting and fusing framework for person re-identification," Neurocomputing,
vol. 212, pp. 65-74, 2016.
[63] S. Gong, M. Cristani, C. C. Loy, and T. M. Hospedales, "The re-identification
challenge," in Person re-identification, ed: Springer, pp. 1-20, 2014.
[64] X. Jin, C. Lan, W. Zeng, G. Wei, and Z. Chen, "Semantics-Aligned
Representation Learning for Person Re-Identification," in AAAI, pp. 11173-
11180, 2020.
[65] L. An, X. Chen, and S. Yang, "Multi-graph feature level fusion for person re-
identification," Neurocomputing, pp.39-45, 2017.
[66] L. An, X. Chen, S. Yang, and X. Li, "Person Re-identification by Multi-
hypergraph Fusion," IEEE Transactions on Neural Networks and Learning
Systems, pp.2763-2774, 2016.
212
[67] Q.-Q. Ren, W.-D. Tian, and Z.-Q. Zhao, "Person Re-identification Based on
Feature Fusion," in International Conference on Intelligent Computing, pp. 65-
73, 2019.
[68] X. Wang, C. Zhao, D. Miao, Z. Wei, R. Zhang, and T. Ye, "Fusion of multiple
channel features for person re-identification," Neurocomputing, vol. 213, pp.
125-136, 2016.
[69] L. Cai, H. Zeng, J. Zhu, J. Cao, Y. Wang, and K.-K. Ma, "Cascading Scene and
Viewpoint Feature Learning for Pedestrian Gender Recognition," IEEE Internet
of Things Journal, pp.3014-3026, 2020.
[70] L. Cai, H. Zeng, J. Zhu, J. Cao, J. Hou, and C. Cai, "Multi-view joint learning
network for pedestrian gender classification," in 2017 International Symposium
on Intelligent Signal Processing and Communication Systems (ISPACS), pp. 23-
27, 2017.
[71] C.-B. Ng, Y.-H. Tay, and B.-M. Goi, "Pedestrian gender classification using
combined global and local parts-based convolutional neural networks," Pattern
Analysis and Applications, vol. 22, pp. 1469-1480, 2019.
[72] L. An, Z. Qin, X. Chen, and S. Yang, "Multi-level common space learning for
person re-identification," IEEE Transactions on Circuits and Systems for Video
Technology, vol. 28, pp. 1777-1787, 2018.
[73] C. Chahla, H. Snoussi, F. Abdallah, and F. Dornaika, "Learned versus
handcrafted features for person re-identification," International Journal of
Pattern Recognition and Artificial Intelligence, vol. 34, p. 2055009, 2020.
[74] X. Gu, T. Ni, W. Wang, and J. Zhu, "Cross-domain transfer person re-
identification via topology properties preserved local fisher discriminant
analysis," Journal of Ambient Intelligence and Humanized Computing, pp. 1-
11, 2020.
[75] H. Li, W. Zhou, Z. Yu, B. Yang, and H. Jin, "Person re-identification with
dictionary learning regularized by stretching regularization and label
consistency constraint," Neurocomputing, vol. 379, pp. 356-369, 2020.
[76] X. Xu, Y. Chen, and Q. Chen, "Dynamic Hybrid Graph Matching for
Unsupervised Video-based Person Re-identification," International Journal on
Artificial Intelligence Tools, vol. 29, p. 2050004, 2020.
213
[77] S. Liao, Y. Hu, X. Zhu, and S. Z. Li, "Person re-identification by local maximal
occurrence representation and metric learning," in Proceedings of the IEEE
conference on computer vision and pattern recognition, pp. 2197-2206., 2015.
[78] R. Zhao, W. Ouyang, and X. Wang, "Person re-identification by salience
matching," in Proceedings of the IEEE International Conference on Computer
Vision, pp. 2528-2535, 2013.
[79] R. Zhao, W. Ouyang, and X. Wang, "Learning mid-level filters for person re-
identification," in Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition, pp. 144-151, 2014.
[80] T. Wang, S. Gong, X. Zhu, and S. Wang, "Person re-identification by video
ranking," in European Conference on Computer Vision, pp. 688-703, 2014.
[81] Y. Geng, H.-M. Hu, G. Zeng, and J. Zheng, "A person re-identification
algorithm by exploiting region-based feature salience," Journal of Visual
Communication and Image Representation, vol. 29, pp. 89-102, 2015.
[82] L. An, S. Yang, and B. Bhanu, "Person re-identification by robust canonical
correlation analysis," IEEE signal processing letters, vol. 22, pp. 1103-1107,
2015.
[83] L. Zheng, S. Wang, L. Tian, F. He, Z. Liu, and Q. Tian, "Query-adaptive late
fusion for image search and person re-identification," in Proceedings of the
IEEE conference on computer vision and pattern recognition, pp. 1741-1750,
2015.
[84] L. An, X. Chen, S. Yang, and X. Li, "Person re-identification by multi-
hypergraph fusion," IEEE transactions on neural networks and learning
systems, vol. 28, pp. 2763-2774, 2017.
[85] X. Yang, M. Wang, R. Hong, Q. Tian, and Y. Rui, "Enhancing Person Re-
identification in a Self-trained Subspace," arXiv preprint arXiv:1704.06020,
2017.
[86] Y.-J. Cho and K.-J. Yoon, "Improving person re-identification via pose-aware
multi-shot matching," in Proceedings of the IEEE conference on computer
vision and pattern recognition, pp. 1354-1362, 2016.
[87] L. Zhang, T. Xiang, and S. Gong, "Learning a discriminative null space for
person re-identification," in Proceedings of the IEEE conference on computer
vision and pattern recognition, pp. 1239-1248, 2016.
214
[88] L. An, M. Kafai, S. Yang, and B. Bhanu, "Person reidentification with reference
descriptor," IEEE Transactions on Circuits and Systems for Video Technology,
vol. 26, pp. 776-787, 2016.
[89] L. An, X. Chen, S. Yang, and B. Bhanu, "Sparse representation matching for
person re-identification," Information Sciences, vol. 355, pp. 74-89, 2016.
[90] C. Chahla, H. Snoussi, F. Abdallah, and F. Dornaika, "Discriminant quaternion
local binary pattern embedding for person re-identification through prototype
formation and color categorization," Engineering Applications of Artificial
Intelligence, vol. 58, pp. 27-33, 2017.
[91] T. Li, L. Sun, C. Han, and J. Guo, "Salient Region-Based Least-Squares Log-
Density Gradient Clustering for Image-To-Video Person Re-Identification,"
IEEE ACCESS, vol. 6, pp. 8638-8648, 2018.
[92] L. Zhang, K. Li, Y. Zhang, Y. Qi, and L. Yang, "Adaptive image segmentation
based on color clustering for person re-identification," Soft Computing, vol. 21,
pp. 5729-5739, 2017.
[93] J. H. Shah, M. Lin, and Z. Chen, "Multi-camera handoff for person re-
identification," Neurocomputing, vol. 191, pp. 238-248, 2016.
[94] H. Chu, M. Qi, H. Liu, and J. Jiang, "Local region partition for person re-
identification," Multimedia Tools and Applications, pp. 1-17, 2017.
[95] A. Nanda, P. K. Sa, S. K. Choudhury, S. Bakshi, and B. Majhi, "A neuromorphic
person re-identification framework for video surveillance," IEEE Access, vol.
5, pp. 6471-6482, 2017.
[96] A. Nanda, P. K. Sa, D. S. Chauhan, and B. Majhi, "A person re-identification
framework by inlier-set group modeling for video surveillance," Journal of
Ambient Intelligence and Humanized Computing, vol. 10, pp. 13-25, 2019.
[97] D. Gray and H. Tao, "Viewpoint invariant pedestrian recognition with an
ensemble of localized features," Computer Vision–ECCV 2008, pp. 262-275,
2008.
[98] M. Farenzena, L. Bazzani, A. Perina, V. Murino, and M. Cristani, "Person re-
identification by symmetry-driven accumulation of local features," in Computer
Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on, pp. 2360-
2367, 2010.
215
[99] X. Ye, W.-y. Zhou, and L.-a. Dong, "Body Part-Based Person Re-identification
Integrating Semantic Attributes," Neural Processing Letters, vol. 49, pp. 1111-
1124, 2019.
[100] J. Dai, Y. Zhang, H. Lu, and H. Wang, "Cross-view semantic projection learning
for person re-identification," Pattern Recognition, vol. 75, pp. 63-76, 2018.
[101] C.-H. Kuo, S. Khamis, and V. Shet, "Person re-identification using semantic
color names and rankboost," in Applications of Computer Vision (WACV), 2013
IEEE Workshop on, pp. 281-287, 2013.
[102] X. Ye, W.-y. Zhou, and L.-a. Dong, "Body Part-Based Person Re-identification
Integrating Semantic Attributes," Neural Processing Letters, pp. 1-14, 2018.
[103] I. Kviatkovsky, A. Adam, and E. Rivlin, "Color invariants for person
reidentification," IEEE Transactions on Pattern Analysis and Machine
Intelligence, vol. 35, pp. 1622-1634, 2013.
[104] Y. Yang, J. Yang, J. Yan, S. Liao, D. Yi, and S. Z. Li, "Salient color names for
person re-identification," in European conference on computer vision, pp. 536-
551, 2014.
[105] F. Xiong, M. Gou, O. Camps, and M. Sznaier, "Person re-identification using
kernel-based metric learning methods," in European conference on computer
vision, pp. 1-16, 2014.
[106] Y.-C. Chen, X. Zhu, W.-S. Zheng, and J.-H. Lai, "Person re-identification by
camera correlation aware feature augmentation," IEEE transactions on pattern
analysis and machine intelligence, vol. 40, pp. 392-408, 2018.
[107] N. M. Z. Hashim, Y. Kawanishi, D. Deguchi, I. Ide, and H. Murase,
"Simultaneous image matching for person re‐identification via the stable
marriage algorithm," IEEJ Transactions on Electrical and Electronic
Engineering, vol. 15, pp. 909-917, 2020.
[108] W. R. Schwartz and L. S. Davis, "Learning discriminative appearance-based
models using partial least squares," in Computer Graphics and Image
Processing (SIBGRAPI), 2009 XXII Brazilian Symposium on, pp. 322-329,
2009.
[109] W.-S. Zheng, S. Gong, and T. Xiang, "Person re-identification by probabilistic
relative distance comparison," in Computer vision and pattern recognition
(CVPR), 2011 IEEE conference on, pp. 649-656, 2011.
216
[110] S. Pedagadi, J. Orwell, S. Velastin, and B. Boghossian, "Local fisher
discriminant analysis for pedestrian re-identification," in Proceedings of the
IEEE conference on computer vision and pattern recognition, pp. 3318-3325,
2013.
[111] Y. Li, Z. Wu, S. Karanam, and R. J. Radke, "Multi-Shot Human Re-
Identification Using Adaptive Fisher Discriminant Analysis," in BMVC, p. 2,
2015.
[112] S. Bąk and P. Carr, "Person re-identification using deformable patch metric
learning," in 2016 IEEE Winter Conference on Applications of Computer Vision
(WACV), pp. 1-9, 2016.
[113] W. Li, R. Zhao, and X. Wang, "Human reidentification with transferred metric
learning," in Asian Conference on Computer Vision, pp. 31-44, 2012.
[114] W.-S. Zheng, S. Gong, and T. Xiang, "Reidentification by relative distance
comparison," IEEE transactions on pattern analysis and machine intelligence,
vol. 35, pp. 653-668, 2013.
[115] Y. Xie, H. Yu, X. Gong, and M. D. Levine, "Adaptive Metric Learning and
Probe-Specific Reranking for Person Reidentification," IEEE Signal Processing
Letters, vol. 24, pp. 853-857, 2017.
[116] X. Liu, X. Ma, J. Wang, and H. Wang, "M3L: Multi-modality mining for metric
learning in person re-Identification," Pattern Recognition, 2017.
[117] J. Li, A. J. Ma, and P. C. Yuen, "Semi-supervised Region Metric Learning for
Person Re-identification," International Journal of Computer Vision, pp. 1-20,
2018.
[118] F. Ma, X. Zhu, X. Zhang, L. Yang, M. Zuo, and X.-Y. Jing, "Low illumination
person re-identification," Multimedia Tools and Applications, vol. 78, pp. 337-
362, 2019.
[119] G. Feng, W. Liu, D. Tao, and Y. Zhou, "Hessian Regularized Distance Metric
Learning for People Re-Identification," Neural Processing Letters, pp. 1-14,
2019.
[120] J. Jia, Q. Ruan, Y. Jin, G. An, and S. Ge, "View-specific Subspace Learning and
Re-ranking for Semi-supervised Person Re-identification," Pattern
Recognition, p. 107568, 2020.
217
[121] Q. Zhou, S. Zheng, H. Ling, H. Su, and S. Wu, "Joint dictionary and metric
learning for person re-identification," Pattern Recognition, vol. 72, pp. 196-206,
2017.
[122] T. Ni, Z. Ding, F. Chen, and H. Wang, "Relative Distance Metric Leaning Based
on Clustering Centralization and Projection Vectors Learning for Person Re-
Identification," IEEE Access, vol. 6, pp. 11405-11411, 2018.
[123] C. Zhao, Y. Chen, Z. Wei, D. Miao, and X. Gu, "QRKISS: a two-stage metric
learning via QR-decomposition and KISS for person re-identification," Neural
Processing Letters, vol. 49, pp. 899-922, 2019.
[124] W. Ma, H. Han, Y. Kong, and Y. Zhang, "A New Date-Balanced Method Based
on Adaptive Asymmetric and Diversity Regularization in Person Re-
Identification," International Journal of Pattern Recognition and Artificial
Intelligence, vol. 34, p. 2056004, 2020.
[125] G. Lisanti, I. Masi, A. D. Bagdanov, and A. Del Bimbo, "Person re-
identification by iterative re-weighted sparse ranking," IEEE transactions on
pattern analysis and machine intelligence, vol. 37, pp. 1629-1642, 2015.
[126] T. Wang, S. Gong, X. Zhu, and S. Wang, "Person re-identification by
discriminative selection in video ranking," IEEE transactions on pattern
analysis and machine intelligence, vol. 38, pp. 2501-2514, 2016.
[127] L. An, X. Chen, and S. Yang, "Person re-identification via hypergraph-based
matching," Neurocomputing, vol. 182, pp. 247-254, 2016.
[128] A. Krizhevsky, I. Sutskever, and G. E. Hinton, "Imagenet classification with
deep convolutional neural networks," in Advances in neural information
processing systems, pp. 1097-1105, 2012.
[129] R. Girshick, J. Donahue, T. Darrell, and J. Malik, "Region-based convolutional
networks for accurate object detection and segmentation," IEEE transactions
on pattern analysis and machine intelligence, vol. 38, pp. 142-158, 2016.
[130] R. R. Varior, M. Haloi, and G. Wang, "Gated siamese convolutional neural
network architecture for human re-identification," in European conference on
computer vision, pp. 791-808, 2016.
[131] Z. Zhang and T. Si, "Learning deep features from body and parts for person re-
identification in camera networks," EURASIP Journal on Wireless
Communications and Networking, vol. 2018, p. 52, 2018.
218
[132] Z. Zhang and M. Huang, "Learning local embedding deep features for person
re-identification in camera networks," EURASIP Journal on Wireless
Communications and Networking, vol. 2018, p. 85, 2018.
[133] Y. Huang, H. Sheng, Y. Zheng, and Z. Xiong, "DeepDiff: Learning deep
difference features on human body parts for person re-identification,"
Neurocomputing, vol. 241, pp. 191-203, 2017.
[134] X. Xin, J. Wang, R. Xie, S. Zhou, W. Huang, and N. Zheng, "Semi-supervised
person Re-Identification using multi-view clustering," Pattern Recognition, vol.
88, pp. 285-297, 2019.
[135] G. Wang, J. Lai, and X. Xie, "P2snet: Can an image match a video for person
re-identification in an end-to-end way?," IEEE Transactions on Circuits and
Systems for Video Technology, vol. 28, pp. 2777-2787, 2017.
[136] C. Yuan, C. Xu, T. Wang, F. Liu, Z. Zhao, P. Feng, et al., "Deep multi-instance
learning for end-to-end person re-identification," Multimedia Tools and
Applications, vol. 77, pp. 12437-12467, 2018.
[137] J. Nie, L. Huang, W. Zhang, G. Wei, and Z. Wei, "Deep Feature Ranking for
Person Re-Identification," IEEE Access, vol. 7, pp. 15007-15017, 2019.
[138] Y. Hu, X. Cai, D. Huang, and Y. Liu, "LWA: A lightweight and accurate
method for unsupervised Person Re-identification," in 2019 6th International
Conference on Systems and Informatics (ICSAI), pp. 1314-1318, 2019.
[139] R. Aburasain, "Application of convolutional neural networks in object
detection, re-identification and recognition," Loughborough University, 2020.
[140] W. Zhang, L. Huang, Z. Wei, and J. Nie, "Appearance feature enhancement for
person re-identification," Expert Systems with Applications, vol. 163, p. 113771,
2021.
[141] X. Zhang, S. Li, X.-Y. Jing, F. Ma, and C. Zhu, "Unsupervised domain adaption
for image-to-video person re-identification," Multimedia Tools and
Applications, pp. 1-18, 2020.
[142] J. Lv, W. Chen, Q. Li, and C. Yang, "Unsupervised cross-dataset person re-
identification by transfer learning of spatial-temporal patterns," in Proceedings
of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7948-
7956, 2018.
219
[143] M. Afifi and A. Abdelhamed, "AFIF4: deep gender classification based on
adaboost-based fusion of isolated facial features and foggy faces," Journal of
Visual Communication and Image Representation, vol. 62, pp. 77-86, 2019.
[144] C.-B. Ng, Y.-H. Tay, and B.-M. Goi, "A review of facial gender recognition,"
Pattern Analysis and Applications, vol. 18, pp. 739-755, 2015.
[145] M. Raza, C. Zonghai, S. U. Rehman, G. Zhenhua, W. Jikai, and B. Peng, "Part-
wise pedestrian gender recognition via deep convolutional neural networks,"
2017.
[146] C. Zhao, X. Wang, W. K. Wong, W. Zheng, J. Yang, and D. Miao, "Multiple
Metric Learning based on Bar-shape Descriptor for Person Re-Identification,"
Pattern Recognition, 2017.
[147] X. Gao, F. Gao, D. Tao, and X. Li, "Universal blind image quality assessment
metrics via natural scene statistics and multiple kernel learning," IEEE
transactions on neural networks and learning systems, vol. 24, 2013.
[148] C. BenAbdelkader and P. Griffin, "A local region-based approach to gender
classi. cation from face images," in 2005 IEEE Computer Society Conference
on Computer Vision and Pattern Recognition (CVPR'05)-Workshops, pp. 52-
52, 2005.
[149] E. Eidinger, R. Enbar, and T. Hassner, "Age and gender estimation of unfiltered
faces," IEEE Transactions on Information Forensics and Security, vol. 9, pp.
2170-2179, 2014.
[150] H. A. Ahmed, T. A. Rashid, and A. Sidiq, "Face behavior recognition through
support vector machines," International Journal of Advanced Computer Science
and Applications, vol. 7, pp. 101-108, 2016.
[151] N. Sun, W. Zheng, C. Sun, C. Zou, and L. Zhao, "Gender classification based
on boosting local binary pattern," in International Symposium on Neural
Networks, pp. 194-201, 2006.
[152] C. Shan, "Learning local binary patterns for gender classification on real-world
face images," Pattern recognition letters, vol. 33, pp. 431-437, 2012.
[153] J.-G. Wang, J. Li, W.-Y. Yau, and E. Sung, "Boosting dense SIFT descriptors
and shape contexts of face images for gender recognition," in 2010 IEEE
Computer Society Conference on Computer Vision and Pattern Recognition-
Workshops, pp. 96-102, 2010.
220
[154] J. Bekios-Calfa, J. M. Buenaposada, and L. Baumela, "Robust gender
recognition by exploiting facial attributes dependencies," Pattern Recognition
Letters, vol. 36, pp. 228-234, 2014.
[155] B. Patel, R. Maheshwari, and B. Raman, "Compass local binary patterns for
gender recognition of facial photographs and sketches," Neurocomputing, vol.
218, pp. 203-215, 2016.
[156] X. Li, X. Zhao, Y. Fu, and Y. Liu, "Bimodal gender recognition from face and
fingerprint," in 2010 IEEE Computer Society Conference on Computer Vision
and Pattern Recognition, pp. 2590-2597, 2010.
[157] S. E. Bekhouche, A. Ouafi, F. Dornaika, A. Taleb-Ahmed, and A. Hadid,
"Pyramid multi-level features for facial demographic estimation," Expert
Systems with Applications, vol. 80, pp. 297-310, 2017.
[158] C. P. Divate and S. Z. Ali, "Study of Different Bio-Metric Based Gender
Classification Systems," in 2018 International Conference on Inventive
Research in Computing Applications (ICIRCA), pp. 347-353, 2018.
[159] A. M. Ali and T. A. Rashid, "Kernel Visual Keyword Description for Object
and Place Recognition," in Advances in Signal Processing and Intelligent
Recognition Systems, ed: Springer, pp. 27-38, 2016.
[160] B. Moghaddam and M.-H. Yang, "Learning gender with support faces," IEEE
Transactions on Pattern Analysis and Machine Intelligence, vol. 24, pp. 707-
711, 2002.
[161] J. Bekios-Calfa, J. M. Buenaposada, and L. Baumela, "Revisiting linear
discriminant techniques in gender recognition," IEEE Transactions on Pattern
Analysis and Machine Intelligence, vol. 33, pp. 858-864, 2010.
[162] M. Duan, K. Li, C. Yang, and K. Li, "A hybrid deep learning CNN–ELM for
age and gender classification," Neurocomputing, vol. 275, pp. 448-461, 2018.
[163] A. Dhomne, R. Kumar, and V. Bhan, "Gender Recognition Through Face Using
Deep Learning," Procedia computer science, vol. 132, pp. 2-10, 2018.
[164] K. Zhang, L. Tan, Z. Li, and Y. Qiao, "Gender and smile classification using
deep convolutional neural networks," in Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition Workshops, pp. 34-38, 2016.
[165] J. Mansanet, A. Albiol, and R. Paredes, "Local deep neural networks for gender
recognition," Pattern Recognition Letters, vol. 70, pp. 80-86, 2016.
221
[166] L. Cao, M. Dikmen, Y. Fu, and T. S. Huang, "Gender recognition from body,"
in Proceedings of the 16th ACM international conference on Multimedia, pp.
725-728, 2008.
[167] K. Ahmad, A. Sohail, N. Conci, and F. De Natale, "A Comparative study of
Global and Deep Features for the analysis of user-generated natural disaster
related images," in 2018 IEEE 13th Image, Video, and Multidimensional Signal
Processing Workshop (IVMSP), pp. 1-5, 2018.
[168] M. Collins, J. Zhang, P. Miller, and H. Wang, "Full body image feature
representations for gender profiling," in 2009 IEEE 12th International
Conference on Computer Vision Workshops, ICCV Workshops, pp. 1235-1242,
2009.
[169] C. D. Geelen, R. G. Wijnhoven, and G. Dubbelman, "Gender classification in
low-resolution surveillance video: in-depth comparison of random forests and
SVMs," in Video Surveillance and Transportation Imaging Applications, p.
94070M, 2015.
[170] V. A. Sindagi and V. M. Patel, "A survey of recent advances in cnn-based single
image crowd counting and density estimation," Pattern Recognition Letters,
vol. 107, pp. 3-16, 2018.
[171] C. Li, J. Guo, F. Porikli, and Y. Pang, "Lightennet: A convolutional neural
network for weakly illuminated image enhancement," Pattern Recognition
Letters, vol. 104, pp. 15-22, 2018.
[172] M. Rashid, M. A. Khan, M. Sharif, M. Raza, M. M. Sarfraz, and F. Afza,
"Object detection and classification: a joint selection and fusion strategy of deep
convolutional neural network and SIFT point features," Multimedia Tools and
Applications, vol. 78, pp. 15751-15777, 2019.
[173] M. A. Khan, T. Akram, M. Sharif, M. Awais, K. Javed, H. Ali, et al., "CCDF:
Automatic system for segmentation and recognition of fruit crops diseases
based on correlation coefficient and deep CNN features," Computers and
electronics in agriculture, vol. 155, pp. 220-236, 2018.
[174] M. Sharif, M. Attique Khan, M. Rashid, M. Yasmin, F. Afza, and U. J. Tanik,
"Deep CNN and geometric features-based gastrointestinal tract diseases
detection and classification from wireless capsule endoscopy images," Journal
of Experimental & Theoretical Artificial Intelligence, pp. 1-23, 2019.
222
[175] T. A. Rashid, P. Fattah, and D. K. Awla, "Using accuracy measure for
improving the training of lstm with metaheuristic algorithms," Procedia
Computer Science, vol. 140, pp. 324-333, 2018.
[176] T. A. Rashid, "Convolutional neural networks based method for improving
facial expression recognition," in The International Symposium on Intelligent
Systems Technologies and Applications, pp. 73-84, 2016.
[177] A. S. Shamsaldin, P. Fattah, T. A. Rashid, and N. K. Al-Salihi, "A Study of The
Convolutional Neural Networks Applications," UKH Journal of Science and
Engineering, vol. 3, pp. 31-40, 2019.
[178] M. A. Uddin and Y.-K. Lee, "Feature fusion of deep spatial features and
handcrafted spatiotemporal features for human action recognition," Sensors,
vol. 19, p. 1599, 2019.
[179] X. Fan and T. Tjahjadi, "Fusing dynamic deep learned features and handcrafted
features for facial expression recognition," Journal of Visual Communication
and Image Representation, vol. 65, p. 102659, 2019.
[180] M.-I. Georgescu, R. T. Ionescu, and M. Popescu, "Local learning with deep and
handcrafted features for facial expression recognition," IEEE Access, vol. 7, pp.
64827-64836, 2019.
[181] A. M. Hasan, H. A. Jalab, F. Meziane, H. Kahtan, and A. S. Al-Ahmad,
"Combining deep and handcrafted image features for MRI brain scan
classification," IEEE Access, vol. 7, pp. 79959-79967, 2019.
[182] C.-B. Ng, Y.-H. Tay, and B.-M. Goi, "Comparing image representations for
training a convolutional neural network to classify gender," in 2013 1st
International Conference on Artificial Intelligence, Modelling and Simulation,
pp. 29-33, 2013.
[183] C.-B. Ng, Y.-H. Tay, and B.-M. Goi, "Training strategy for convolutional neural
networks in pedestrian gender classification," in Second International
Workshop on Pattern Recognition, p. 104431A, 2017.
[184] M. Tkalcic and J. F. Tasic, Colour spaces: perceptual, historical and
applicational background vol. 1: IEEE, 2003.
[185] A. Hu, R. Zhang, D. Yin, and Y. Zhan, "Image quality assessment using a SVD-
based structural projection," Signal Processing: Image Communication, vol. 29,
pp. 293-302, 2014.
223
[186] T. Ojala, M. Pietikainen, and T. Maenpaa, "Multiresolution gray-scale and
rotation invariant texture classification with local binary patterns," IEEE
Transactions on pattern analysis and machine intelligence, vol. 24, pp. 971-
987, 2002.
[187] M. Verma, B. Raman, and S. Murala, "Local extrema co-occurrence pattern for
color and texture image retrieval," Neurocomputing, vol. 165, pp. 255-269,
2015.
[188] N. Dalal and B. Triggs, "Histograms of oriented gradients for human detection,"
in Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE
Computer Society Conference on, pp. 886-893, 2005.
[189] N. Dalal, B. Triggs, and C. Schmid, "Human detection using oriented
histograms of flow and appearance," in European conference on computer
vision, pp. 428-441, 2006.
[190] S. Zhang and X. Wang, "Human detection and object tracking based on
Histograms of Oriented Gradients," in Natural Computation (ICNC), 2013
Ninth International Conference on, pp. 1349-1353, 2013.
[191] O. Déniz, G. Bueno, J. Salido, and F. De la Torre, "Face recognition using
histograms of oriented gradients," Pattern Recognition Letters, vol. 32, pp.
1598-1603, 2011.
[192] K. Seemanthini and S. Manjunath, "Human Detection and Tracking using HOG
for Action Recognition," Procedia Computer Science, vol. 132, pp. 1317-1326,
2018.
[193] M. Dash and P. W. Koot, "Feature selection for clustering," in Encyclopedia of
database systems, ed: Springer, pp. 1119-1125, 2009.
[194] L. Zelnik-Manor and P. Perona, "Self-tuning spectral clustering," in Advances
in neural information processing systems, pp. 1601-1608, 2005.
[195] S. Ben-David, D. Pál, and H. U. Simon, "Stability of k-means clustering," in
International Conference on Computational Learning Theory, pp. 20-34, 2007.
[196] C. Zhong, X. Yue, Z. Zhang, and J. Lei, "A clustering ensemble: Two-level-
refined co-association matrix with path-based transformation," Pattern
Recognition, vol. 48, pp. 2699-2709, 2015.
[197] Y. D. Zhang, S. Chen, S. H. Wang, J. F. Yang, and P. Phillips, "Magnetic
resonance brain image classification based on weighted‐type fractional Fourier
224
transform and nonparallel support vector machine," International Journal of
Imaging Systems and Technology, vol. 25, pp. 317-327, 2015.
[198] L. Chen, C. P. Chen, and M. Lu, "A multiple-kernel fuzzy c-means algorithm
for image segmentation," IEEE Transactions on Systems, Man, and
Cybernetics, Part B (Cybernetics), vol. 41, pp. 1263-1274, 2011.
[199] S. Wu, Y.-C. Chen, X. Li, A.-C. Wu, J.-J. You, and W.-S. Zheng, "An enhanced
deep feature representation for person re-identification," in Applications of
Computer Vision (WACV), 2016 IEEE Winter Conference on, pp. 1-8, 2016.
[200] Y. Gan, "Facial Expression Recognition Using Convolutional Neural Network,"
in Proceedings of the 2nd International Conference on Vision, Image and Signal
Processing, p. 29, 2018.
[201] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, et al., "Going
deeper with convolutions," in Proceedings of the IEEE conference on computer
vision and pattern recognition, pp. 1-9, 2015.
[202] O. Pele and M. Werman, "The quadratic-chi histogram distance family," in
European conference on computer vision, pp. 749-762, 2010.
[203] N. Dalal and B. Triggs, "Histograms of oriented gradients for human detection,"
2005.
[204] Z. Qi, Y. Tian, and Y. Shi, "Efficient railway tracks detection and turnouts
recognition method using HOG features," Neural Computing and Applications,
vol. 23, pp. 245-254, 2013.
[205] K. W. Chee and S. S. Teoh, "Pedestrian Detection in Visual Images Using
Combination of HOG and HOM Features," in 10th International Conference on
Robotics, Vision, Signal Processing and Power Applications, pp. 591-597,
2019.
[206] Y. Wei, Q. Tian, J. Guo, W. Huang, and J. Cao, "Multi-vehicle detection
algorithm through combining Harr and HOG features," Mathematics and
Computers in Simulation, vol. 155, pp. 130-145, 2019.
[207] K. Firuzi, M. Vakilian, B. T. Phung, and T. R. Blackburn, "Partial discharges
pattern recognition of transformer defect model by LBP & HOG features," IEEE
Transactions on Power Delivery, vol. 34, pp. 542-550, 2018.
[208] T. Xiao, S. Li, B. Wang, L. Lin, and X. Wang, "Joint detection and identification
feature learning for person search," in Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, pp. 3415-3424, 2017.
225
[209] J. Xu, L. Luo, C. Deng, and H. Huang, "Bilevel distance metric learning for
robust image recognition," in Advances in Neural Information Processing
Systems, pp. 4198-4207, 2018.
[210] I. N. Junejo, "A Deep Learning Based Multi-color Space Approach for
Pedestrian Attribute Recognition," in Proceedings of the 2019 3rd International
Conference on Graphics and Signal Processing, pp. 113-116, 2019.
[211] S. Liao, G. Zhao, V. Kellokumpu, M. Pietikäinen, and S. Z. Li, "Modeling pixel
process with scale invariant local patterns for background subtraction in
complex scenes," in 2010 IEEE Computer Society Conference on Computer
Vision and Pattern Recognition, pp. 1301-1306, 2010.
[212] S. Arora and M. Bhatia, "A Robust Approach for Gender Recognition Using
Deep Learning," in 2018 9th International Conference on Computing,
Communication and Networking Technologies (ICCCNT), pp. 1-6, 2018.
[213] R. D. Labati, E. Muñoz, V. Piuri, R. Sassi, and F. Scotti, "Deep-ECG:
Convolutional neural networks for ECG biometric recognition," Pattern
Recognition Letters, 2018.
[214] E.-J. Cheng, K.-P. Chou, S. Rajora, B.-H. Jin, M. Tanveer, C.-T. Lin, et al.,
"Deep Sparse Representation Classifier for facial recognition and detection
system," Pattern Recognition Letters, vol. 125, pp. 71-77, 2019.
[215] M. Fayyaz, M. Yasmin, M. Sharif, J. H. Shah, M. Raza, and T. Iqbal, "Person
re-identification with features-based clustering and deep features," Neural
Computing and Applications, pp. 1-22, 2019.
[216] F. Hu, G.-S. Xia, J. Hu, and L. Zhang, "Transferring deep convolutional neural
networks for the scene classification of high-resolution remote sensing
imagery," Remote Sensing, vol. 7, pp. 14680-14707, 2015.
[217] K. He, X. Zhang, S. Ren, and J. Sun, "Deep residual learning for image
recognition," in Proceedings of the IEEE conference on computer vision and
pattern recognition, pp. 770-778, 2016.
[218] M. A. Khan, T. Akram, M. Sharif, M. Y. Javed, N. Muhammad, and M. Yasmin,
"An implementation of optimized framework for action classification using
multilayers neural network on selected fused features," Pattern Analysis and
Applications, pp. 1-21, 2018.
226
[219] K. Nigam, J. Lafferty, and A. McCallum, "Using maximum entropy for text
classification," in IJCAI-99 workshop on machine learning for information
filtering, pp. 61-67, 1999.
[220] C. L. Morais, K. M. Lima, and F. L. Martin, "Uncertainty estimation and
misclassification probability for classification models based on discriminant
analysis and support vector machines," Analytica chimica acta, vol. 1063, pp.
40-46, 2019.
[221] K. Radhika and S. Varadarajan, "Ensemble Subspace Discriminant
Classification of Satellite Images," 2018.
[222] A. S. Shirkhorshidi, S. Aghabozorgi, and T. Y. Wah, "A comparison study on
similarity and dissimilarity measures in clustering continuous data," PloS one,
vol. 10, 2015.
[223] K. Lekdioui, R. Messoussi, Y. Ruichek, Y. Chaabi, and R. Touahni, "Facial
decomposition for expression recognition using texture/shape descriptors and
SVM classifier," Signal Processing: Image Communication, vol. 58, pp. 300-
312, 2017.
[224] X.-X. Niu and C. Y. Suen, "A novel hybrid CNN–SVM classifier for
recognizing handwritten digits," Pattern Recognition, vol. 45, pp. 1318-1325,
2012.
[225] L. Ma, H. Liu, L. Hu, C. Wang, and Q. Sun, "Orientation driven bag of
appearances for person re-identification," arXiv preprint arXiv:1605.02464,
2016.
[226] W.-L. Wei, J.-C. Lin, Y.-Y. Lin, and H.-Y. M. Liao, "What makes you look like
you: Learning an inherent feature representation for person re-identification,"
in 2019 16th IEEE International Conference on Advanced Video and Signal
Based Surveillance (AVSS), pp. 1-6, 2019.
[227] D. Moctezuma, E. S. Tellez, S. Miranda-Jiménez, and M. Graff, "Appearance
model update based on online learning and soft-biometrics traits for people re-
identification in multi-camera environments," IET Image Processing, vol. 13,
pp. 2162-2168, 2019.
[228] C. Shorten and T. M. Khoshgoftaar, "A survey on image data augmentation for
deep learning," Journal of Big Data, vol. 6, p. 60, 2019.
227
[229] W. Hou, Y. Wei, Y. Jin, and C. Zhu, "Deep features based on a DCNN model
for classifying imbalanced weld flaw types," Measurement, vol. 131, pp. 482-
489, 2019.
[230] N. Hussain, M. A. Khan, M. Sharif, S. A. Khan, A. A. Albesher, T. Saba, et al.,
"A deep neural network and classical features based scheme for objects
recognition: an application for machine inspection," Multimed Tools Appl.
https://doi. org/10.1007/s11042-020-08852-3, 2020.
[231] N. Gour and P. Khanna, "Automated glaucoma detection using GIST and
pyramid histogram of oriented gradients (PHOG) descriptors," Pattern
Recognition Letters, 2019.
[232] I. Murtza, D. Abdullah, A. Khan, M. Arif, and S. M. Mirza, "Cortex-inspired
multilayer hierarchy based object detection system using PHOG descriptors and
ensemble classification," The Visual Computer, vol. 33, pp. 99-112, 2017.
[233] W. El-Tarhouni, L. Boubchir, M. Elbendak, and A. Bouridane, "Multispectral
palmprint recognition using Pascal coefficients-based LBP and PHOG
descriptors with random sampling," Neural Computing and Applications, vol.
31, pp. 593-603, 2019.
[234] D. A. Abdullah, M. H. Akpinar, and A. Sengür, "Local feature descriptors based
ECG beat classification," Health Inf. Sci. Syst., vol. 8, p. 20, 2020.
[235] T. Xu, H. Zhang, C. Xin, E. Kim, L. R. Long, Z. Xue, et al., "Multi-feature
based benchmark for cervical dysplasia classification evaluation," Pattern
recognition, vol. 63, pp. 468-475, 2017.
[236] P. Liu, J.-M. Guo, K. Chamnongthai, and H. Prasetyo, "Fusion of color
histogram and LBP-based features for texture image retrieval and
classification," Information Sciences, vol. 390, pp. 95-111, 2017.
[237] Y. Mistry, D. Ingole, and M. Ingole, "Content based image retrieval using
hybrid features and various distance metric," Journal of Electrical Systems and
Information Technology, vol. 5, pp. 874-888, 2018.
[238] N. Danapur, S. A. A. Dizaj, and V. Rostami, "An efficient image retrieval based
on an integration of HSV, RLBP, and CENTRIST features using ensemble
classifier learning," Multimedia Tools and Applications, pp. 1-24, 2020.
[239] H. Fujiyoshi, T. Hirakawa, and T. Yamashita, "Deep learning-based image
recognition for autonomous driving," IATSS research, vol. 43, pp. 244-252,
2019.
228
[240] J.-H. Choi and J.-S. Lee, "EmbraceNet: A robust deep learning architecture for
multimodal classification," Information Fusion, vol. 51, pp. 259-270, 2019.
[241] F. I. Diakogiannis, F. Waldner, P. Caccetta, and C. Wu, "Resunet-a: a deep
learning framework for semantic segmentation of remotely sensed data," ISPRS
Journal of Photogrammetry and Remote Sensing, vol. 162, pp. 94-114, 2020.
[242] L. Liu, W. Ouyang, X. Wang, P. Fieguth, J. Chen, X. Liu, et al., "Deep learning
for generic object detection: A survey," International journal of computer
vision, vol. 128, pp. 261-318, 2020.
[243] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, et al., "Imagenet
large scale visual recognition challenge," International journal of computer
vision, vol. 115, pp. 211-252, 2015.
[244] M. Toğaçar, B. Ergen, Z. Cömert, and F. Özyurt, "A deep feature learning
model for pneumonia detection applying a combination of mRMR feature
selection and machine learning models," IRBM, 2019.
[245] B. Yuan, L. Han, X. Gu, and H. Yan, "Multi-deep features fusion for high-
resolution remote sensing image scene classification," Neural Computing and
Applications, pp. 1-17, 2020.
[246] C. Szegedy, S. Ioffe, V. Vanhoucke, and A. Alemi, "Inception-v4, inception-
resnet and the impact of residual connections on learning," arXiv preprint
arXiv:1602.07261, 2016.
[247] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger, "Densely
connected convolutional networks," in Proceedings of the IEEE conference on
computer vision and pattern recognition, pp. 4700-4708, 2017.
[248] I. Lizarazo, "SVM‐based segmentation and classification of remotely sensed
data," International Journal of Remote Sensing, vol. 29, pp. 7277-7283, 2008.
[249] B. S. Bhati and C. Rai, "Analysis of Support Vector Machine-based Intrusion
Detection Techniques," Arabian Journal for Science and Engineering, vol. 45,
pp. 2371-2383, 2020.
[250] V. E. Liong, J. Lu, and Y. Ge, "Regularized local metric learning for person re-
identification," Pattern Recognition Letters, vol. 68, pp. 288-296, 2015.
[251] X. Yang and P. Chen, "Person re-identification based on multi-scale
convolutional network," Multimedia Tools and Applications, pp. 1-15, 2019.
229
[252] H. Wang, D. Oneata, J. Verbeek, and C. Schmid, "A robust and efficient video
representation for action recognition," International Journal of Computer
Vision, vol. 119, pp. 219-238, 2016.
[253] G. Varol, I. Laptev, and C. Schmid, "Long-term temporal convolutions for
action recognition," IEEE transactions on pattern analysis and machine
intelligence, vol. 40, pp. 1510-1517, 2018.
[254] C. Su, S. Zhang, F. Yang, G. Zhang, Q. Tian, W. Gao, et al., "Attributes driven
tracklet-to-tracklet person re-identification using latent prototypes space
mapping," Pattern Recognition, vol. 66, pp. 4-15, 2017.
[255] S. Karanam, Y. Li, and R. J. Radke, "Person re-identification with
discriminatively trained viewpoint invariant dictionaries," in Proceedings of the
IEEE International Conference on Computer Vision, pp. 4516-4524, 2015.
[256] X. Yang, M. Wang, R. Hong, Q. Tian, and Y. Rui, "Enhancing person re-
identification in a self-trained subspace," ACM Transactions on Multimedia
Computing, Communications, and Applications (TOMM), vol. 13, pp. 1-23,
2017.
[257] C. Ying and X. Xiaoyue, "Matrix Metric Learning for Person Re-identification
Based on Bidirectional Reference Set," vol. 42, pp. 394-402, 2020.
[258] P. Peng, T. Xiang, Y. Wang, M. Pontil, S. Gong, T. Huang, et al., "Unsupervised
cross-dataset transfer learning for person re-identification," in Proceedings of
the IEEE conference on computer vision and pattern recognition, pp. 1306-
1315, 2016.
[259] D. Zhang, W. Wu, H. Cheng, R. Zhang, Z. Dong, and Z. Cai, "Image-to-video
person re-identification with temporally memorized similarity learning," IEEE
Transactions on Circuits and Systems for Video Technology, vol. 28, pp. 2622-
2632, 2017.
[260] X. Zhu, X.-Y. Jing, X. You, W. Zuo, S. Shan, and W.-S. Zheng, "Image to video
person re-identification by learning heterogeneous dictionary pair with feature
projection matrix," IEEE Transactions on Information Forensics and Security,
vol. 13, pp. 717-732, 2017.
[261] B. Yu and N. Xu, "Urgent image-to-video person reidentification by cross-
media transfer cycle generative adversarial networks," Journal of Electronic
Imaging, vol. 28, p. 013052, 2019.
230
[262] F. Ma, X. Zhu, Q. Liu, C. Song, X.-Y. Jing, and D. Ye, "Multi-view coupled
dictionary learning for person re-identification," Neurocomputing, vol. 348, pp.
16-26, 2019.
[263] Y. Deng, P. Luo, C. C. Loy, and X. Tang, "Pedestrian attribute recognition at
far distance," in Proceedings of the 22nd ACM international conference on
Multimedia, pp. 789-792, 2014.
[264] W. Zhu, J. Miao, L. Qing, and G.-B. Huang, "Hierarchical extreme learning
machine for unsupervised representation learning," in 2015 International Joint
Conference on Neural Networks (IJCNN), pp. 1-8, 2015.
[265] A. Ali-Gombe and E. Elyan, "MFC-GAN: class-imbalanced dataset
classification using multiple fake class generative adversarial network,"
Neurocomputing, vol. 361, pp. 212-221, 2019.
[266] M. Fayyaz, M. Yasmin, M. Sharif, and M. Raza, "J-LDFR: joint low-level and
deep neural network feature representations for pedestrian gender
classification," NEURAL COMPUTING & APPLICATIONS, 2020.
[267] Z. Wang, J. Jiang, Y. Yu, and S. i. Satoh, "Incremental re-identification by
cross-direction and cross-ranking adaption," IEEE Transactions on Multimedia,
vol. 21, pp. 2376-2386, 2019.
[268] K. Simonyan and A. Zisserman, "Very deep convolutional networks for large-
scale image recognition," arXiv preprint arXiv:1409.1556, 2014.
231
Appendix A
232
A.1 English Proof Reading Certificate
233