0% found this document useful (0 votes)
58 views6 pages

Face Recognition On Small-Scale Datasets

Abstract—Face recognition algorithms have achieved both accuracy and efficiency on large-scale datasets with millions of images. ArcFace [1] is the state-of-the-art method on large- scale datasets. It adds a softmax classifier layer at the end of the backbone to minimize the geodesic distance between all embedding features of the same identity. This method solves the problem of the vanilla softmax classifier when the number of identities increases. However, it ...

Uploaded by

Maxminlevel
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
58 views6 pages

Face Recognition On Small-Scale Datasets

Abstract—Face recognition algorithms have achieved both accuracy and efficiency on large-scale datasets with millions of images. ArcFace [1] is the state-of-the-art method on large- scale datasets. It adds a softmax classifier layer at the end of the backbone to minimize the geodesic distance between all embedding features of the same identity. This method solves the problem of the vanilla softmax classifier when the number of identities increases. However, it ...

Uploaded by

Maxminlevel
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Face Recognition on Small-scale Datasets

Chi-Bien Chu, Minh-Hien Le, Van-Toan Vo, Minh-Triet Tran


Faculty of Information Technology
University of Science, VNU-HCM, Ho Chi Minh City, Vietnam
Vietnam National University, Ho Chi Minh City, Vietnam
19120002@student.hcmus.edu.vn, 19120225@student.hcmus.edu.vn, 19120690@student.hcmus.edu.vn, tmtriet@fit.hcmus.edu.vn

Abstract—Face recognition algorithms have achieved both This paper experimented with ArcFace representing the
accuracy and efficiency on large-scale datasets with millions margin-based softmax methods and a sample-to-sample com-
of images. ArcFace [1] is the state-of-the-art method on large- parison method like FaceNet in terms of effectiveness using a
scale datasets. It adds a softmax classifier layer at the end of
the backbone to minimize the geodesic distance between all small quantity of training data on three different domains. We
embedding features of the same identity. This method solves the reimplemented the two mentioned methods and applied them
problem of the vanilla softmax classifier when the number of on three small-scale datasets where each identity has only 200
identities increases. However, it also requires a commensurately images.
large number of images per identity. We observed that while ArcFace cannot converge due to
In some situations, we need to cluster face images where the
dataset is limited due to difficulty in collecting in the real world.
the lack of data, a sample-to-sample comparison strategy like
In this study, we perform experiments on the problem of face FaceNet [2] gave a positive result on these datasets (Table
recognition on three small-scale datasets, respectively belonging I). On a subset of the well-known CASIA-WebFace dataset
to three domains: normal human face images, human face sketch introduced in [13], FaceNet achieved a true-accept accuracy
images, cartoon characters’ faces. Each dataset of the above of 85.12%. On a subset of iCartoonFace [14] - a polished,
contains hundreds of identities, where each identity has only
200 images.
large-scale, challenging benchmark dataset for cartoon face
The result on ArcFace [1] matches the our assumption, that recognition - it achieved 90.20%. Additionally, we applied a
the loss function cannot converge even on the training set. Hence, GAN-based algorithm [15] on CASIA-WebFace to generate a
another method is FaceNet [2] which proved more effective in portrait sketch dataset and FaceNet reached 81.10% accuracy.
this case. Since such a sample-to-sample approach is less sensitive The remainder of this paper is organized as follows: in
to the number of identities or the number of images per identity.
section II we outline the related works; section III reviews
For all mentioned three datasets, FaceNet [2] achieves true-accept
accuracy of over 80%. the main ideas of the method FaceNet; the datasets overview,
evaluation metrics, and the results are discussed in Subsection
I. I NTRODUCTION IV; finally, section V presents the conclusions of this paper
and future work.
Face recognition systems have gained significant momen-
tum since recent outstanding developments of deep convo- II. R ELATED W ORKS
lutional neural networks (DCNNs) [3]–[6]. There are two For the face recognition problem, margin-based softmax
major lines to train DCNNs for face recognition: those that methods [1], [7]–[10] focused on incorporating margin penalty
use a softmax classifier to train a multi-class classifier that into a more feasible framework, softmax loss, which has
can differentiate distinct identities in the training set [1], [7]– extensive sample-to-class comparisons. More specifically,
[10]; and those that directly learn an embedding, such as the SphereFace [8] proposed an angular margin penalty to enforce
triplet loss [2]. Both of these aforementioned methods obtain extra intra-class compactness and inter-class discrepancy si-
exceptional results on face recognition using the large-scale multaneously, leading to a better discriminative power of the
datasets and the complex DCNNs architectures. trained model. CosFace [10] directly adds a cosine margin
However, practically there are still scenarios in which ap- penalty to the target logit, which obtains better performance
plications in the real world encounter a challenging problem compared to SphereFace [8]. ArcFace [1] has a clear geometric
of approaching a large number of training data. Specifically, interpretation due to the exact correspondence to the geodesic
while existing works [11], [12] put many efforts into real hu- distance on the hypersphere, which obtains discriminative
man faces, face images in other domains like portrait sketches, features for face recognition higher than the above methods.
or animation are limited. As the interest in the field of face However, the limitation of these sample-to-class approaches is
recognition has increasingly heightened, the question arises as the sensitivity to the size of the training datasets, which means
to whether inadequate data would degrade the performance that they require large resources.
of well-known systems. Accurately recognizing human faces FaceNet [2] uses the Triplet loss to exploit triplet data such
using a limited dataset would help businesses minimize the that faces from the same identity are closer than faces from
cost and time in collecting data and processing data. different identities by a clear Euclidean distance margin. Even

1
though there can be a combinatorial explosion in the number same person would be minimized, while L2 -distance between
of triplets, especially for large-scale datasets, we consider this the face images xni (negative) of different identities would be
technique is well suited for small-scale datasets where the maximized. This constraint can be defined as:
number of images of each identity is limited. We conduct f (xp ) − f (xbi )) 2 + α < − ∥f (xai ) − f (xni ))∥2 (1)

experiments to verify that the results are shown in section i 2 2

IV-D. where f (x) represents the embedding of an image in the


Heterogeneous Face Recognition (HFR) alludes to match- training set, α is the margin that is enforced between positive
ing cross-space faces. Nevertheless, HFR is confronted with and negative pairs. In other words, similar images are closer
challenges from large domain discrepancy and insufficient het- to each other than distinct images in the embedding space,
erogeneous data. To overcome this, some subspace learning- which was shown in Fig. 2.
based methods [16]–[19] map multimodal data into a common
feature space to eliminate the large discrepancies of cross-
modality image pairs.
Though general face recognition does not encounter the
same data limitation problem, there are still cases where large
datasets are not available. Several studies [20]–[22] focus on
this problem and achieve satisfactory results. Debayan Deb
Fig. 2: Triple Loss (taken from the original FaceNet paper [2])
et al. [20] fine tune FaceNet to solve the problem of child
face recognition over a time span. Mohannad A. Abuzneid
Triplet loss function can be defined as
et al. [21] enhance face recognition by employing a back-
propagation neural network (BPNN) and features extraction N
X 2 2
[ f (xai ) − f (xbi )) 2 − ∥f (xai ) − f (xni ))∥2 + α] (2)

based on the correlation between the training images and L=
generates a new key dataset using the original one. Lichun Yu L

and Jinqing Liu [22] propose an improved VGGNet algorithm Generating all possible triplets would not be ideal as there
to reduce the dependence of DCNNs on large-scale training would be many triplets satisfying the triplet condition (Equa-
data. tion 1) , hence not contributing to training the model and
resulting in slower convergence. The triplet selection strategy
III. FACE N ET would be discussed in the next section.
FaceNet [2] is a facial features extraction method that aims
B. Triplet selection
to reduce existing difficulties and achieve great results in the
field of face recognition. It employs a deep convolutional The authors suggest that it is extremely crucial to select
network along with the triplet loss, which is visualized in images violating the triplet constraint which will improve
Fig. 1, to stand among the most outstanding face recognition our model and fasten the convergence. Specifically, xpi (hard
methods. The method aims to map every face image into a positives - having a high difference in the distance) and xni
feature space such that the squared distance between all faces, (hard negatives - having a low difference in the distance)
regardless of the image conditions, of the same identities are should be selected such that
small, whereas the squared distance between images of distinct 2
xpi = argmaxxpi ∥f (xai ) − f (xpi ]))∥2 (3)
identities is large. The following section covers the main ideas
of triplet loss. 2
xni = argminxni ∥f (xai ) − f (xni ]))∥2 (4)
Nevertheless, computing every viable argmin and argmax
of the training set is inefficient as it is time-consuming,
and additionally, mislabelled and low-quality images would
dominate the hard positives and hard negatives. There are
several options which we can employ to avoid this issue.
Fig. 1: FaceNet model structure (taken from the original FaceNet Here, we follow the original FaceNet [2] paper by generating
paper [2]) triplets online, using large mini-batches, and computing only
the argmin and argmax within a mini-batch.

A. Triplet loss IV. E XPERIMENT


The key idea of FaceNet [2] is the triplet loss function. This A. Datasets
loss function is motivated by the context of nearest-neighbor It is quite simple to collect small-scale datasets to train
classification, which means the distance between two images the model for human face clustering. For example, we can
in the embedding space defines the similarity of the images. use a subset of large-scale datasets, which are well-known
It ensures that the L2 -distance between an image xai (anchor) benchmarks for general face problems. Another strategy is
of a specific person and all other images xpi (positive) of that using Generative Adversarial Networks [23] to synthesize

2
images belonging to other domains such as portrait sketches.
In this study, we focus on the limitation of the number of face
images per identity. For each experiment, we use a dataset
consisting of face images of several hundred identities, each
identity has 200 images. Fig. 8 shows some examples of the
three datasets used for three experiments. Each dataset will be
briefly described below.
We conduct our first experiment on CASIA-WebFace in-
troduced in [13]. This dataset contains hundreds of thousands
of human face images, which belong to 10575 real identities
crawled from the internet by the Institute of Automation,
Chinese Academy of Sciences (CASIA). We use a subset of
60800 images of 304 different identities for the experiment. Fig. 4: The workflow of the training phase
Our second experiment is conducted on a dataset of portrait
sketches images. Discriminating identities in sketch images is
more challenging because the identity features hidden in each
sketch image are more deficient than the normal ones. We
use a GAN-based method [15] to synthesize the dataset. More
specifically, we fed all the images in the first experiment into
the GAN [15] to obtain sketch images.
iCartoonFace [14] is the largest-scale, high-quality, rich-
annotated, and spanning multiple occurrences in the field of
image recognition. We use a subset of 88400 cartoon face
images of 442 different identities for the third experiment. Fig. 5: The workflow of the testing phase

The input image is resized to 160 × 160. Then their pixel


values are normalized with µ = 0.5 and σ = 0.5. The output
embedding vector has a size of 128 float elements. The margin
α in Equa. 1 is set to be 0.5. We train each model in 20
epochs by the Adam optimizer with a 0.001 learning rate,
Fig. 3: Some example images of the datasets. Three rows respec- β1 = 0.9, β2 = 0.999.
tively correspond to CASIA-WebFace, synthesized face sketch and After the training in each experiment is done, we select the
iCartoonFace.
model having the best score on the validation set for testing.
Since the total number of pairs of images is still quite large,
B. Implementation we only sample 10000 positive pairs and 10000 negative pairs,
We train an Inception-ResNet-V1 [24] backbone from then we statisticize the L2 distances of all sampled pairs. As
scratch with the triplet loss for each experiment. We set the mentioned before, a positive pair is a pair of images of the
same hyperparameters and configurations for all experiments, same identity, whereas a negative pair is a pair of images
as we will describe below. The diagrams of the training stage of two different identities. You can see the details of the
and testing stage are shown in Fig. 4 and Fig. 5. experimental results in section IV-D.
Each dataset has several hundred identities and is separated We also set up some experiments for ArcFace [1] with
into the training set, validation set, test set by images in an the same datasets. Some general hyperparameters such as
identity. In other words, every identity appears in the above input image size, embedding vector size, batch size, Adam’s
mentioned sets. As mentioned in the previous section IV-A, parameters are maintained as above. However, the training
each identity in each dataset has exactly 200 images. We take progress does not obtain any acceptable result, as shown in
150 face images to add to the training set, 30 face images to Fig. 6. Since the third epoch, the loss value on the softmax
add to the validation set, and the remaining 20 images to add layer gets stuck around 5.0. The classification accuracy is even
to the test set. worse when it is always zero. This result is not too surprising
To select triplets in the dataset to train or validate the model, to us, with hundreds of identities, the softmax layer requires a
we split all images into mini-batches of size 50, then iterate large enough number of images per identity while our datasets
over all pairs of images in each mini-batch. We further impose have only a few hundred images per identity.
the constraint that each identity in each mini-batch has a
maximum of 10 images. This helps to balance the number of C. Evaluation metrics
positive and negative pairs, avoiding imbalance when training Given a pair of two face images a squared L2 distance
the model. threshold D(xi , xj ) is used to determine the classification of

3
Fig. 6: The change of the loss value over the iterations

same and different. All face pairs (i, j) of the same identity are
denoted with Psame , whereas all pairs of different identities
are denoted with Pdif f .
With face pairs (i, j) that were correctly classified as same
at threshold d, we define the set of true-accept as:

T A(d) = {(i, j) ∈ Psame , with D(xi , xj ) ≤ d)} (5)

With face pairs i, j that were correctly classified as different


at threshold d, we define the set of false-accept as:

F A(d) = {(i, j) ∈ Pdif f , with D(xi , xj ) ≤ d)} (6)

The validation rate V AL(d) and the false-accept rate F AR(d)


for a given face distance d are the defined as
|T A (d)| |F A (d)|
V AL(d) = , F AR(d) = (7)
|Psame | |Pdif f |

D. Results
Fig. 7: The statistic of the L2 -distances between pairs of embedding
The results are shown in Fig. 7, where the horizontal x- vectors. For each dataset, we sample 10000 positive pairs and 10000
axis corresponds to the distance between the two vectors of negative pairs from the testing subset.
a pair of images, and the vertical y-axis statistics about the
number of pairs with each distance. Based on Fig. 7, we
choose an appropriate L2 -distance threshold d for each dataset Here are some illustrations of the three datasets (Fig. 8,
and calculate the evaluation metrics as the I: 9 and 10) each pair of images has a corresponding distance
below it. We divide them into four cases:
TABLE I: Evaluation metrics with the best thresholds
• True-accept: Given a pair of two images of the same

Dataset d V AL(d) F AR(d) person, the calculated L2 -distance of two embedding


CASIA-WebFace 1.1000 0.8512 0.1369 vectors is less than the chosen threshold i.e. the predicted
CASIA-WebFace (sketch) 1.1000 0.8110 0.1923 result is correct.
iCartoonFace [14] 0.8000 0.9020 0.0687
• False-accept: Given a pair of two images of two different
persons, the predicted result is incorrect.
On portrait sketches, the face recognition problem becomes • True-reject: Given a pair of two images of two different

more challenging because the identity information has been persons, the predicted result is correct.
omitted quite a lot compared to normal face photos (Fig. 3) • False-reject: Given a pair of two images of the same

so it is expected that the results on face sketch dataset are person, the predicted result is incorrect.
worse than the real one (CASIA-WebFace). However, this
result is acceptable especially when the amount of training
data is quite small. On the cartoon face images, FaceNet [2]
algorithm performs pretty well, since the provided information
of an image is quite good for classification.

4
Fig. 8: Illustration of the results on CASIA-WebFace (threshold d = Fig. 10: Illustration of the results on iCartoonFace (threshold d = 0.8)
1.1)

domains. Further frameworks can also be developed to verify


the potential of multi-domain exploitation. We believe that face
recognition using limited datasets is of great importance for
practical applications in the future.

R EFERENCES

[1] J. Deng, J. Guo, and S. Zafeiriou, “Arcface: Additive angular margin


loss for deep face recognition,” CoRR, vol. abs/1801.07698, 2018.
[Online]. Available: http://arxiv.org/abs/1801.07698
[2] F. Schroff, D. Kalenichenko, and J. Philbin, “Facenet: A unified embed-
ding for face recognition and clustering,” CoRR, vol. abs/1503.03832,
2015. [Online]. Available: http://arxiv.org/abs/1503.03832
[3] L. Yu and J. Liu, “Face recognition based on deep learning of small
data set,” Journal of Physics: Conference Series, vol. 1624, p. 052004,
10 2020.
[4] S. Xie, R. B. Girshick, P. Dollár, Z. Tu, and K. He, “Aggregated residual
Fig. 9: Illustration of the results on CASIA-WebFace-Sketch (thresh- transformations for deep neural networks,” CoRR, vol. abs/1611.05431,
old d = 1.1) 2016. [Online]. Available: http://arxiv.org/abs/1611.05431
[5] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image
recognition,” CoRR, vol. abs/1512.03385, 2015. [Online]. Available:
http://arxiv.org/abs/1512.03385
V. C ONCLUSIONS AND F UTURE WORK [6] A. Z. Karen Simonyan, “Very deep convolutional networks for large-
scale image recognition,” CoRR, vol. abs/1409.1556, 2015. [Online].
In this paper, we conduct experiments with face recognition Available: http://arxiv.org/abs/1409.1556
[7] J. Deng, Y. Zhou, and S. Zafeiriou, “Marginal loss for deep face
on small-scale datasets and the obtained results are quite recognition,” in Proceedings of the IEEE Conference on Computer Vision
positive. The major conclusion to this study was that when and Pattern Recognition (CVPR) Workshops, July 2017.
the amount of data is limited, it is suggested that the ”sample- [8] W. Liu, Y. Wen, Z. Yu, M. Li, B. Raj, and L. Song, “Sphereface:
Deep hypersphere embedding for face recognition,” CoRR, vol.
to-sample” comparison strategy [2] is more effective than the abs/1704.08063, 2017. [Online]. Available: http://arxiv.org/abs/1704.
”sample-to-class” comparison [1], [7]–[10]. Especially with 08063
the sketch image dataset, where the identification information [9] F. Wang, J. Cheng, W. Liu, and H. Liu, “Additive margin softmax for
in each image is insufficient, the FaceNet algorithm still face verification,” IEEE Signal Processing Letters, vol. 25, no. 7, pp.
926–930, 2018.
provides more than 80% accuracy on the test set. [10] H. Wang, Y. Wang, Z. Zhou, X. Ji, Z. Li, D. Gong, J. Zhou,
The result was just acceptance of our expectations. In and W. Liu, “Cosface: Large margin cosine loss for deep face
this paper, we only work on a subset of CASIA-WebFace, recognition,” CoRR, vol. abs/1801.09414, 2018. [Online]. Available:
http://arxiv.org/abs/1801.09414
synthesized face sketch images, and a subset of iCartoonFace. [11] G. B. Huang, M. Mattar, T. Berg, and E. Learned-Miller, “Labeled
So in the future, we will expand our work by increasing the Faces in the Wild: A Database for Studying Face Recognition in
type and the number of test datasets. Additionally, we will Unconstrained Environments,” in Workshop on Faces in ’Real-Life’
Images: Detection, Alignment, and Recognition. Marseille, France:
experiment with leveraging the pre-trained weights on large- Erik Learned-Miller and Andras Ferencz and Frédéric Jurie, Oct. 2008.
scale datasets to fit new datasets which can belong to different [Online]. Available: https://hal.inria.fr/inria-00321923

5
[12] Y. Guo, L. Zhang, Y. Hu, X. He, and J. Gao, “Ms-celeb-1m: A
dataset and benchmark for large-scale face recognition,” CoRR, vol.
abs/1607.08221, 2016. [Online]. Available: http://arxiv.org/abs/1607.
08221
[13] D. Yi, Z. Lei, S. Liao, and S. Z. Li, “Learning face representation
from scratch,” CoRR, vol. abs/1411.7923, 2014. [Online]. Available:
http://arxiv.org/abs/1411.7923
[14] Y. Zheng, Y. Zhao, M. Ren, H. Yan, X. Lu, J. Liu, and J. Li, “Cartoon
face recognition: A benchmark dataset,” in Proceedings of the 28th ACM
International Conference on Multimedia, 2020, pp. 2264–2272.
[15] C. Chen, W. Liu, X. Tan, and K. K. Wong, “Semi-supervised learning
for face sketch synthesis in the wild,” CoRR, vol. abs/1812.04929,
2018. [Online]. Available: http://arxiv.org/abs/1812.04929
[16] A. Sharma and D. W. Jacobs, “Bypassing synthesis: Pls for face
recognition with pose, low-resolution and sketch,” in CVPR 2011, 2011,
pp. 593–600.
[17] M. Kan, S. Shan, H. Zhang, S. Lao, and X. Chen, “Multi-view discrim-
inant analysis,” IEEE Transactions on Pattern Analysis and Machine
Intelligence, vol. 38, no. 1, pp. 188–194, 2016.
[18] X. Huang, Z. Lei, M. Fan, X. Wang, and S. Z. Li, “Regularized discrim-
inative spectral regression method for heterogeneous face matching,”
IEEE Transactions on Image Processing, vol. 22, no. 1, pp. 353–362,
2013.
[19] Y. Tian, C. Yan, X. Bai, and J. Zhou, “Heterogeneous face recognition
via grassmannian based nearest subspace search,” in 2017 IEEE Inter-
national Conference on Image Processing (ICIP), 2017, pp. 1077–1081.
[20] D. Deb, N. Nain, and A. K. Jain, “Longitudinal study of child face
recognition,” CoRR, vol. abs/1711.03990, 2017. [Online]. Available:
http://arxiv.org/abs/1711.03990
[21] I. MOHANNAD A. ABUZNEID, (Member and I. AUSIF MAHMOOD,
(Senior Member, “Enhanced human face recognition using lbph descrip-
tor, multi-knn, and back-propagation neural network,” IEEE Access,
vol. 6, no. 2169, pp. 20 641–20 651, May 2018. [Online]. Available:
https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=8334532
[22] L. Yu and J. Liu, “Face recognition based on deep learning of small
data set,” Journal of Physics: Conference Series, vol. 1624, p. 052004,
10 2020.
[23] I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley,
S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial networks,”
2014.
[24] C. Szegedy, S. Ioffe, and V. Vanhoucke, “Inception-v4, inception-resnet
and the impact of residual connections on learning,” CoRR, vol.
abs/1602.07261, 2016. [Online]. Available: http://arxiv.org/abs/1602.
07261

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy