Face Recognition On Small-Scale Datasets
Face Recognition On Small-Scale Datasets
Abstract—Face recognition algorithms have achieved both This paper experimented with ArcFace representing the
accuracy and efficiency on large-scale datasets with millions margin-based softmax methods and a sample-to-sample com-
of images. ArcFace [1] is the state-of-the-art method on large- parison method like FaceNet in terms of effectiveness using a
scale datasets. It adds a softmax classifier layer at the end of
the backbone to minimize the geodesic distance between all small quantity of training data on three different domains. We
embedding features of the same identity. This method solves the reimplemented the two mentioned methods and applied them
problem of the vanilla softmax classifier when the number of on three small-scale datasets where each identity has only 200
identities increases. However, it also requires a commensurately images.
large number of images per identity. We observed that while ArcFace cannot converge due to
In some situations, we need to cluster face images where the
dataset is limited due to difficulty in collecting in the real world.
the lack of data, a sample-to-sample comparison strategy like
In this study, we perform experiments on the problem of face FaceNet [2] gave a positive result on these datasets (Table
recognition on three small-scale datasets, respectively belonging I). On a subset of the well-known CASIA-WebFace dataset
to three domains: normal human face images, human face sketch introduced in [13], FaceNet achieved a true-accept accuracy
images, cartoon characters’ faces. Each dataset of the above of 85.12%. On a subset of iCartoonFace [14] - a polished,
contains hundreds of identities, where each identity has only
200 images.
large-scale, challenging benchmark dataset for cartoon face
The result on ArcFace [1] matches the our assumption, that recognition - it achieved 90.20%. Additionally, we applied a
the loss function cannot converge even on the training set. Hence, GAN-based algorithm [15] on CASIA-WebFace to generate a
another method is FaceNet [2] which proved more effective in portrait sketch dataset and FaceNet reached 81.10% accuracy.
this case. Since such a sample-to-sample approach is less sensitive The remainder of this paper is organized as follows: in
to the number of identities or the number of images per identity.
section II we outline the related works; section III reviews
For all mentioned three datasets, FaceNet [2] achieves true-accept
accuracy of over 80%. the main ideas of the method FaceNet; the datasets overview,
evaluation metrics, and the results are discussed in Subsection
I. I NTRODUCTION IV; finally, section V presents the conclusions of this paper
and future work.
Face recognition systems have gained significant momen-
tum since recent outstanding developments of deep convo- II. R ELATED W ORKS
lutional neural networks (DCNNs) [3]–[6]. There are two For the face recognition problem, margin-based softmax
major lines to train DCNNs for face recognition: those that methods [1], [7]–[10] focused on incorporating margin penalty
use a softmax classifier to train a multi-class classifier that into a more feasible framework, softmax loss, which has
can differentiate distinct identities in the training set [1], [7]– extensive sample-to-class comparisons. More specifically,
[10]; and those that directly learn an embedding, such as the SphereFace [8] proposed an angular margin penalty to enforce
triplet loss [2]. Both of these aforementioned methods obtain extra intra-class compactness and inter-class discrepancy si-
exceptional results on face recognition using the large-scale multaneously, leading to a better discriminative power of the
datasets and the complex DCNNs architectures. trained model. CosFace [10] directly adds a cosine margin
However, practically there are still scenarios in which ap- penalty to the target logit, which obtains better performance
plications in the real world encounter a challenging problem compared to SphereFace [8]. ArcFace [1] has a clear geometric
of approaching a large number of training data. Specifically, interpretation due to the exact correspondence to the geodesic
while existing works [11], [12] put many efforts into real hu- distance on the hypersphere, which obtains discriminative
man faces, face images in other domains like portrait sketches, features for face recognition higher than the above methods.
or animation are limited. As the interest in the field of face However, the limitation of these sample-to-class approaches is
recognition has increasingly heightened, the question arises as the sensitivity to the size of the training datasets, which means
to whether inadequate data would degrade the performance that they require large resources.
of well-known systems. Accurately recognizing human faces FaceNet [2] uses the Triplet loss to exploit triplet data such
using a limited dataset would help businesses minimize the that faces from the same identity are closer than faces from
cost and time in collecting data and processing data. different identities by a clear Euclidean distance margin. Even
1
though there can be a combinatorial explosion in the number same person would be minimized, while L2 -distance between
of triplets, especially for large-scale datasets, we consider this the face images xni (negative) of different identities would be
technique is well suited for small-scale datasets where the maximized. This constraint can be defined as:
number of images of each identity is limited. We conduct
f (xp ) − f (xbi ))
2 + α < − ∥f (xai ) − f (xni ))∥2 (1)
experiments to verify that the results are shown in section i 2 2
and Jinqing Liu [22] propose an improved VGGNet algorithm Generating all possible triplets would not be ideal as there
to reduce the dependence of DCNNs on large-scale training would be many triplets satisfying the triplet condition (Equa-
data. tion 1) , hence not contributing to training the model and
resulting in slower convergence. The triplet selection strategy
III. FACE N ET would be discussed in the next section.
FaceNet [2] is a facial features extraction method that aims
B. Triplet selection
to reduce existing difficulties and achieve great results in the
field of face recognition. It employs a deep convolutional The authors suggest that it is extremely crucial to select
network along with the triplet loss, which is visualized in images violating the triplet constraint which will improve
Fig. 1, to stand among the most outstanding face recognition our model and fasten the convergence. Specifically, xpi (hard
methods. The method aims to map every face image into a positives - having a high difference in the distance) and xni
feature space such that the squared distance between all faces, (hard negatives - having a low difference in the distance)
regardless of the image conditions, of the same identities are should be selected such that
small, whereas the squared distance between images of distinct 2
xpi = argmaxxpi ∥f (xai ) − f (xpi ]))∥2 (3)
identities is large. The following section covers the main ideas
of triplet loss. 2
xni = argminxni ∥f (xai ) − f (xni ]))∥2 (4)
Nevertheless, computing every viable argmin and argmax
of the training set is inefficient as it is time-consuming,
and additionally, mislabelled and low-quality images would
dominate the hard positives and hard negatives. There are
several options which we can employ to avoid this issue.
Fig. 1: FaceNet model structure (taken from the original FaceNet Here, we follow the original FaceNet [2] paper by generating
paper [2]) triplets online, using large mini-batches, and computing only
the argmin and argmax within a mini-batch.
2
images belonging to other domains such as portrait sketches.
In this study, we focus on the limitation of the number of face
images per identity. For each experiment, we use a dataset
consisting of face images of several hundred identities, each
identity has 200 images. Fig. 8 shows some examples of the
three datasets used for three experiments. Each dataset will be
briefly described below.
We conduct our first experiment on CASIA-WebFace in-
troduced in [13]. This dataset contains hundreds of thousands
of human face images, which belong to 10575 real identities
crawled from the internet by the Institute of Automation,
Chinese Academy of Sciences (CASIA). We use a subset of
60800 images of 304 different identities for the experiment. Fig. 4: The workflow of the training phase
Our second experiment is conducted on a dataset of portrait
sketches images. Discriminating identities in sketch images is
more challenging because the identity features hidden in each
sketch image are more deficient than the normal ones. We
use a GAN-based method [15] to synthesize the dataset. More
specifically, we fed all the images in the first experiment into
the GAN [15] to obtain sketch images.
iCartoonFace [14] is the largest-scale, high-quality, rich-
annotated, and spanning multiple occurrences in the field of
image recognition. We use a subset of 88400 cartoon face
images of 442 different identities for the third experiment. Fig. 5: The workflow of the testing phase
3
Fig. 6: The change of the loss value over the iterations
same and different. All face pairs (i, j) of the same identity are
denoted with Psame , whereas all pairs of different identities
are denoted with Pdif f .
With face pairs (i, j) that were correctly classified as same
at threshold d, we define the set of true-accept as:
D. Results
Fig. 7: The statistic of the L2 -distances between pairs of embedding
The results are shown in Fig. 7, where the horizontal x- vectors. For each dataset, we sample 10000 positive pairs and 10000
axis corresponds to the distance between the two vectors of negative pairs from the testing subset.
a pair of images, and the vertical y-axis statistics about the
number of pairs with each distance. Based on Fig. 7, we
choose an appropriate L2 -distance threshold d for each dataset Here are some illustrations of the three datasets (Fig. 8,
and calculate the evaluation metrics as the I: 9 and 10) each pair of images has a corresponding distance
below it. We divide them into four cases:
TABLE I: Evaluation metrics with the best thresholds
• True-accept: Given a pair of two images of the same
more challenging because the identity information has been persons, the predicted result is correct.
omitted quite a lot compared to normal face photos (Fig. 3) • False-reject: Given a pair of two images of the same
so it is expected that the results on face sketch dataset are person, the predicted result is incorrect.
worse than the real one (CASIA-WebFace). However, this
result is acceptable especially when the amount of training
data is quite small. On the cartoon face images, FaceNet [2]
algorithm performs pretty well, since the provided information
of an image is quite good for classification.
4
Fig. 8: Illustration of the results on CASIA-WebFace (threshold d = Fig. 10: Illustration of the results on iCartoonFace (threshold d = 0.8)
1.1)
R EFERENCES
5
[12] Y. Guo, L. Zhang, Y. Hu, X. He, and J. Gao, “Ms-celeb-1m: A
dataset and benchmark for large-scale face recognition,” CoRR, vol.
abs/1607.08221, 2016. [Online]. Available: http://arxiv.org/abs/1607.
08221
[13] D. Yi, Z. Lei, S. Liao, and S. Z. Li, “Learning face representation
from scratch,” CoRR, vol. abs/1411.7923, 2014. [Online]. Available:
http://arxiv.org/abs/1411.7923
[14] Y. Zheng, Y. Zhao, M. Ren, H. Yan, X. Lu, J. Liu, and J. Li, “Cartoon
face recognition: A benchmark dataset,” in Proceedings of the 28th ACM
International Conference on Multimedia, 2020, pp. 2264–2272.
[15] C. Chen, W. Liu, X. Tan, and K. K. Wong, “Semi-supervised learning
for face sketch synthesis in the wild,” CoRR, vol. abs/1812.04929,
2018. [Online]. Available: http://arxiv.org/abs/1812.04929
[16] A. Sharma and D. W. Jacobs, “Bypassing synthesis: Pls for face
recognition with pose, low-resolution and sketch,” in CVPR 2011, 2011,
pp. 593–600.
[17] M. Kan, S. Shan, H. Zhang, S. Lao, and X. Chen, “Multi-view discrim-
inant analysis,” IEEE Transactions on Pattern Analysis and Machine
Intelligence, vol. 38, no. 1, pp. 188–194, 2016.
[18] X. Huang, Z. Lei, M. Fan, X. Wang, and S. Z. Li, “Regularized discrim-
inative spectral regression method for heterogeneous face matching,”
IEEE Transactions on Image Processing, vol. 22, no. 1, pp. 353–362,
2013.
[19] Y. Tian, C. Yan, X. Bai, and J. Zhou, “Heterogeneous face recognition
via grassmannian based nearest subspace search,” in 2017 IEEE Inter-
national Conference on Image Processing (ICIP), 2017, pp. 1077–1081.
[20] D. Deb, N. Nain, and A. K. Jain, “Longitudinal study of child face
recognition,” CoRR, vol. abs/1711.03990, 2017. [Online]. Available:
http://arxiv.org/abs/1711.03990
[21] I. MOHANNAD A. ABUZNEID, (Member and I. AUSIF MAHMOOD,
(Senior Member, “Enhanced human face recognition using lbph descrip-
tor, multi-knn, and back-propagation neural network,” IEEE Access,
vol. 6, no. 2169, pp. 20 641–20 651, May 2018. [Online]. Available:
https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=8334532
[22] L. Yu and J. Liu, “Face recognition based on deep learning of small
data set,” Journal of Physics: Conference Series, vol. 1624, p. 052004,
10 2020.
[23] I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley,
S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial networks,”
2014.
[24] C. Szegedy, S. Ioffe, and V. Vanhoucke, “Inception-v4, inception-resnet
and the impact of residual connections on learning,” CoRR, vol.
abs/1602.07261, 2016. [Online]. Available: http://arxiv.org/abs/1602.
07261