Arcface: Additive Angular Margin Loss For Deep Face Recognition
Arcface: Additive Angular Margin Loss For Deep Face Recognition
Stefanos Zafeiriou
Imperial College London
arXiv:1801.07698v3 [cs.CV] 9 Feb 2019
s.zafeiriou@imperial.ac.uk
Abstract
One of the main challenges in feature learning using
Deep Convolutional Neural Networks (DCNNs) for large-
scale face recognition is the design of appropriate loss func-
tions that enhance discriminative power. Centre loss pe-
nalises the distance between the deep features and their cor-
responding class centres in the Euclidean space to achieve
intra-class compactness. SphereFace assumes that the lin-
ear transformation matrix in the last fully connected layer
can be used as a representation of the class centres in an
angular space and penalises the angles between the deep Figure 1. Based on the centre [18] and feature [37] normalisation,
features and their corresponding weights in a multiplicative all identities are distributed on a hypersphere. To enhance intra-
way. Recently, a popular line of research is to incorporate class compactness and inter-class discrepancy, we consider four
margins in well-established loss functions in order to max- kinds of Geodesic Distance (GDis) constraint. (A) Margin-Loss:
imise face class separability. In this paper, we propose an insert a geodesic distance margin between the sample and cen-
Additive Angular Margin Loss (ArcFace) to obtain highly tres. (B) Intra-Loss: decrease the geodesic distance between the
discriminative features for face recognition. The proposed sample and the corresponding centre. (C) Inter-Loss: increase the
ArcFace has a clear geometric interpretation due to the ex- geodesic distance between different centres. (D) Triplet-Loss: in-
sert a geodesic distance margin between triplet samples. In this
act correspondence to the geodesic distance on the hyper-
paper, we propose an Additive Angular Margin Loss (ArcFace),
sphere. We present arguably the most extensive experimen-
which is exactly corresponded to the geodesic distance (Arc) mar-
tal evaluation of all the recent state-of-the-art face recog- gin penalty in (A), to enhance the discriminative power of face
nition methods on over 10 face recognition benchmarks in- recognition model. Extensive experimental results show that the
cluding a new large-scale image database with trillion level strategy of (A) is most effective.
of pairs and a large-scale video dataset. We show that Ar-
cFace consistently outperforms the state-of-the-art and can face recognition [32, 33, 29, 24]. DCNNs map the face im-
be easily implemented with negligible computational over- age, typically after a pose normalisation step [45], into a
head. We release all refined training data, training codes, feature that has small intra-class and large inter-class dis-
pre-trained models and training logs1 , which will help re- tance.
produce the results in this paper. There are two main lines of research to train DCNNs
for face recognition. Those that train a multi-class clas-
sifier which can separate different identities in the train-
1. Introduction ing set, such by using a softmax classifier [33, 24, 6], and
those that learn directly an embedding, such as the triplet
Face representation using Deep Convolutional Neural loss [29]. Based on the large-scale training data and the
Network (DCNN) embedding is the method of choice for elaborate DCNN architectures, both the softmax-loss-based
∗ denotes equal contribution to this work. methods [6] and the triplet-loss-based methods [29] can ob-
1 https://github.com/deepinsight/insightface tain excellent performance on face recognition. However,
1
both the softmax loss and the triplet loss have some draw- the softmax loss. The advantages of the proposed ArcFace
backs. For the softmax loss: (1) the size of the linear trans- can be summarised as follows:
formation matrix W ∈ Rd×n increases linearly with the Engaging. ArcFace directly optimises the geodesic dis-
identities number n; (2) the learned features are separable tance margin by virtue of the exact correspondence between
for the closed-set classification problem but not discrimina- the angle and arc in the normalised hypersphere. We in-
tive enough for the open-set face recognition problem. For tuitively illustrate what happens in the 512-D space via
the triplet loss: (1) there is a combinatorial explosion in the analysing the angle statistics between features and weights.
number of face triplets especially for large-scale datasets, Effective. ArcFace achieves state-of-the-art performance
leading to a significant increase in the number of iteration on ten face recognition benchmarks including large-scale
steps; (2) semi-hard sample mining is a quite difficult prob- image and video datasets.
lem for effective model training. Easy. ArcFace only needs several lines of code as given
Several variants [38, 9, 46, 18, 37, 35, 7, 34, 27] have in Algorithm 1 and is extremely easy to implement in the
been proposed to enhance the discriminative power of the computational-graph-based deep learning frameworks, e.g.
softmax loss. Wen et al. [38] pioneered the centre loss, the MxNet [8], Pytorch [25] and Tensorflow [4]. Furthermore,
Euclidean distance between each feature vector and its class contrary to the works in [18, 19], ArcFace does not need
centre, to obtain intra-class compactness while the inter- to be combined with other loss functions in order to have
class dispersion is guaranteed by the joint penalisation of stable performance, and can easily converge on any training
the softmax loss. Nevertheless, updating the actual centres datasets.
during training is extremely difficult as the number of face Efficient. ArcFace only adds negligible computational
classes available for training has recently dramatically in- complexity during training. Current GPUs can easily sup-
creased. port millions of identities for training and the model parallel
By observing that the weights from the last fully con- strategy can easily support many more identities.
nected layer of a classification DCNN trained on the soft-
max loss bear conceptual similarities with the centres of 2. Proposed Approach
each face class, the works in [18, 19] proposed a multiplica- 2.1. ArcFace
tive angular margin penalty to enforce extra intra-class com-
pactness and inter-class discrepancy simultaneously, lead- The most widely used classification loss function, soft-
ing to a better discriminative power of the trained model. max loss, is presented as follows:
Even though Sphereface [18] introduced the important idea T
N
of angular margin, their loss function required a series of ap- 1 X eWyi xi +byi
proximations in order to be computed, which resulted in an L1 = − log Pn T , (1)
N i=1 eWj xi +bj
j=1
unstable training of the network. In order to stabilise train-
ing, they proposed a hybrid loss function which includes the where xi ∈ Rd denotes the deep feature of the i-th sample,
standard softmax loss. Empirically, the softmax loss dom- belonging to the yi -th class. The embedding feature dimen-
inates the training process, because the integer-based mul- sion d is set to 512 in this paper following [38, 46, 18, 37].
tiplicative angular margin makes the target logit curve very Wj ∈ Rd denotes the j-th column of the weight W ∈ Rd×n
precipitous and thus hinders convergence. CosFace [37, 35] and bj ∈ Rn is the bias term. The batch size and the class
directly adds cosine margin penalty to the target logit, which number are N and n, respectively. Traditional softmax loss
obtains better performance compared to SphereFace but ad- is widely used in deep face recognition [24, 6]. However,
mits much easier implementation and relieves the need for the softmax loss function does not explicitly optimise the
joint supervision from the softmax loss. feature embedding to enforce higher similarity for intra-
In this paper, we propose an Additive Angular Margin class samples and diversity for inter-class samples, which
Loss (ArcFace) to further improve the discriminative power results in a performance gap for deep face recognition under
of the face recognition model and to stabilise the training large intra-class appearance variations (e.g. pose variations
process. As illustrated in Figure 2, the dot product be- [30, 48] and age gaps [22, 49]) and large-scale test scenarios
tween the DCNN feature and the last fully connected layer (e.g. million [15, 39, 21] or trillion pairs [2]).
is equal to the cosine distance after feature and weight nor- For simplicity, we fix the bias bj = 0 as in [18]. Then,
malisation. We utilise the arc-cosine function to calculate we transform the logit [26] as WjT xi = kWj k kxi k cos θj ,
the angle between the current feature and the target weight. where θj is the angle between the weight Wj and the fea-
Afterwards, we add an additive angular margin to the tar- ture xi . Following [18, 37, 36], we fix the individual weight
get angle, and we get the target logit back again by the co- kWj k = 1 by l2 normalisation. Following [28, 37, 36, 35],
sine function. Then, we re-scale all logits by a fixed feature we also fix the embedding feature kxi k by l2 normalisation
norm, and the subsequent steps are exactly the same as in and re-scale it to s. The normalisation step on features and
Figure 2. Training a DCNN for face recognition supervised by the ArcFace loss. Based on the feature xi and weight W normalisation, we
get the cos θj (logit) for each class as WjT xi . We calculate the arccosθyi and get the angle between the feature xi and the ground truth
weight Wyi . In fact, Wj provides a kind of centre for each class. Then, we add an angular margin penalty m on the target (ground truth)
angle θyi . After that, we calculate cos(θyi + m) and multiply all logits by the feature scale s. The logits then go through the softmax
function and contribute to the cross entropy loss.
Algorithm 1 The Pseudo-code of ArcFace on MxNet
Input: Feature Scale s, Margin Parameter m in Eq. 3, Class Number n, Ground-Truth ID gt.
1. x = mx.symbol.L2Normalization (x, mode = ’instance’)
2. W = mx.symbol.L2Normalization (W, mode = ’instance’)
3. fc7 = mx.sym.FullyConnected (data = x, weight = W, no bias = True, num hidden = n)
4. original target logit = mx.sym.pick (fc7, gt, axis = 1)
5. theta = mx.sym.arccos (original target logit)
6. marginal target logit = mx.sym.cos (theta + m)
7. one hot = mx.sym.one hot (gt, depth = n, on value = 1.0, off value = 0.0)
8. fc7 = fc7 + mx.sym.broadcast mul (one hot, mx.sym.expand dims (marginal target logit - original target logit, 1))
9. fc7 = fc7 * s
Output: Class-wise affinity score f c7.
weights makes the predictions only depend on the angle be- while the proposed ArcFace loss can obviously enforce a
tween the feature and the weight. The learned embedding more evident gap between the nearest classes.
features are thus distributed on a hypersphere with a radius
of s.
N
1 X es cos θyi
L2 = − log s cos θy Pn s cos θj
. (2)
N i=1 e i +
j=1,j6=yi e
0.4
6000
Numbers
4000
-0.2 lar representation of features and weight-vectors. For exam-
Softmax (1.00, 0.00, 0.00)
-0.4
-0.6
SphereFace(m=4, =5)
SphereFace(1.35, 0.00, 0.00)
ArcFace (1.00, 0.50, 0.00)
ples, we can design a loss to enforce intra-class compact-
2000 -0.8
-1
CosFace (1.00, 0.00, 0.35)
CM1
CM2
(1.00, 0.30, 0.20)
(0.90, 0.40, 0.15)
ness and inter-class discrepancy on the hypersphere. As
0
20 30 40 50 60 70 80 90
Angle between the Feature and Target Center
100
-1.2
20 30 40 50 60 70 80
Angle between the Feature and Target Center
90 100 shown in Figure 1, we compare with three other losses in
this paper.
(a) θj Distributions (b) Target Logits Curves
Intra-Loss is designed to improve the intra-class compact-
Figure 4. Target logit analysis. (a) θj distributions from start to ness by decreasing the angle/arc between the sample and
end during ArcFace training. (2) Target logit curves for softmax, the ground truth centre.
SphereFace, ArcFace, CosFace and combined margin penalty
(cos(m1 θ + m2 ) − m3 ). 1 X
N
2.5
Negative
Positive
Negative
Positive
that the additive angular margin penalty can notably en-
2
2
hance the discriminative power of deeply learned features,
Pair Numbers
Pair Numbers
1.5
200
Positive 160
140
Positive
150
Positive
methods on MegaFace Challenge1 using FaceScrub as the probe
120
Pair Numbers
Pair Numbers
150
100
100
0
20
0 0
FAR. “R” refers to data refinement on both probe set and 1M dis-
tractors. ArcFace obtains state-of-the-art performance under both
0 10 20 30 40 50 60 70 80 90 100 110 120 0 10 20 30 40 50 60 70 80 90 100 110 120 0 10 20 30 40 50 60 70 80 90 100 110 120
Angles Between Positive and Negative Pairs Angles Between Positive and Negative Pairs Angles Between Positive and Negative Pairs
(a) LFW (99.83%) (b) CFP-FP (98.37%) (c) AgeDB (98.15%) small and large protocols.
350
300
Negative
Positive
250
Negative
Positive
180
160
Negative
Positive
affects the performance. Therefore, we manually refined
200
140
250
120 the whole MegaFace dataset and report the correct perfor-
Pair Numbers
Pair Numbers
Pair Numbers
150
200 100
150
100
100
80
60
mance of ArcFace on MegaFace. On the refined MegaFace,
50
50
40
98
97
90
96
comparison, we train ArcFace on CAISA and MS1MV2 85 95
under the small protocol and large protocol, respectively. 80 CASIA, ResNet50, ArcFace, Original
CASIA, ResNet50, ArcFace, Refine
94
93
CASIA, ResNet50, ArcFace, Original
CASIA, ResNet50, ArcFace, Refine
MS1MV2, ResNet100, ArcFace, Original MS1MV2, ResNet100, ArcFace, Original
In Table 6, ArcFace trained on CASIA achieves the best 75 MS1MV2, ResNet100, ArcFace, Refine
MS1MV2, ResNet100, CosFace, Original
92
91
MS1MV2, ResNet100, ArcFace, Refine
MS1MV2, ResNet100, CosFace, Original
MS1MV2, ResNet100, CosFace, Refine MS1MV2, ResNet100, CosFace, Refine
not only surpassing the strong baselines (e.g. SphereFace Rank False Positive Rate
[18] and CosFace [37]) but also outperforming other pub- (a) CMC (b) ROC
lished methods [38, 17]. Figure 8. CMC and ROC curves of different models on MegaFace.
As we observed an obvious performance gap between Results are evaluated on both original and refined MegaFace
identification and verification, we performed a thorough dataset.
manual check in the whole MegaFace dataset and found Results on IJB-B and IJB-C. The IJB-B dataset [39]
many face images with wrong labels, which significantly contains 1, 845 subjects with 21.8K still images and 55K
Method IJB-B IJB-C Method Id (@FPR=1e-3) Ver(@FPR=1e-9)
ResNet50 [6] 0.784 0.825 CASIA 26.643 21.452
SENet50 [6] 0.800 0.840 MS1MV2 80.968 78.600
ResNet50+SENet50 [6] 0.800 0.841 DeepGlint-Face 80.331 78.586
MN-v [42] 0.818 0.852 MS1MV2+Asian 84.840 (1st) 80.540
MN-vc [42] 0.831 0.862 CIGIT IRSEC 84.234 (2nd) 81.558 (1st)
ResNet50+DCN(Kpts) [41] 0.850 0.867
Table 8. Identification and verification results (%) on the Trillion-
ResNet50+DCN(Divs) [41] 0.841 0.880 Pairs dataset. ([Dataset*, ResNet100, ArcFace])
SENet50+DCN(Kpts) [41] 0.846 0.874
SENet50+DCN(Divs) [41] 0.849 0.885
VGG2, R50, ArcFace 0.898 0.921 set. Every pair between gallery and probe set is used
MS1MV2, R100, ArcFace 0.942 0.956 for evaluation (0.4 trillion pairs in total). In Table 8,
we compare the performance of ArcFace trained on dif-
Table 7. 1:1 verification TAR (@FAR=1e-4) on the IJB-B and IJB- ferent datasets. The proposed MS1MV2 dataset obvi-
C dataset.
ously boosts the performance compared to CASIA and even
frames from 7, 011 videos. In total, there are 12, 115 slightly outperforms the DeepGlint-Face dataset, which has
templates with 10, 270 genuine matches and 8M impos- a double identity number. When combining all identities
tor matches. The IJB-C dataset [39] is a further extension from MS1MV2 and Asian celebrities from DeepGlint, Arc-
of IJB-B, having 3, 531 subjects with 31.3K still images Face achieves the best identification performance 84.840%
and 117.5K frames from 11, 779 videos. In total, there (@FPR=1e-3) and comparable verification performance
are 23, 124 templates with 19, 557 genuine matches and compared to the most recent submission (CIGIT IRSEC)
15, 639K impostor matches. from the lead-board.
On the IJB-B and IJB-C datasets, we employ the VGG2 Results on iQIYI-VID. The iQIYI-VID challenge [20]
dataset as the training data and the ResNet50 as the embed- contains 565,372 video clips (training set 219,677, valida-
ding network to train ArcFace for the fair comparison with tion set 172,860, and test set 172,835) of 4934 identities
the most recent methods [6, 42, 41]. In Table 7, we compare from iQIYI variety shows, films and television dramas. The
the TAR (@FAR=1e-4) of ArcFace with the previous state- length of each video ranges from 1 to 30 seconds. This
of-the-art models [6, 42, 41]. ArcFace can obviously boost dataset supplies multi-modal cues, including face, cloth,
the performance on both IJB-B and IJB-C (about 3 ∼ 5%, voice, gait and subtitles, for character identification. The
which is a significant reduction in the error). Drawing sup- iQIYI-VID dataset employs MAP@100 as the evaluation
port from more training data (MS1MV2) and deeper neu- indicator. MAP (Mean Average Precision) refers to the
ral network (ResNet100), ArcFace can further improve the overall average accuracy rate, which is the mean of the av-
TAR (@FAR=1e-4) to 94.2% and 95.6% on IJB-B and IJB- erage accuracy rate of the corresponding videos of person
C, respectively. In Figure 9, we show the full ROC curves of ID retrieved in the test set for each person ID (as the query)
the proposed ArcFace on IJB-B and IJB-C 2 , and ArcFace in the training set.
achieves impressive performance even at FAR=1e-6 setting As shown in Table 9, ArcFace trained on combined
a new baseline. MS1MV2 and Asian datasets with ResNet100 sets a high
baseline (MAP=(79.80%)). Based on the embedding fea-
1
ROC on IJB-B
1
ROC on IJB-C ture for each training video, we train an additional three-
0.9
0.8
0.9
0.8
layer fully connected network with a classification loss to
0.7 0.7
get the customised feature descriptor on the iQIYI-VID
True Positive Rate
0.6 0.6
0.5 0.5 dataset. The MLP learned on the iQIYI-VID training set
0.4 0.4
0.1
MS1MV2, ResNet100, ArcFace
VGG2, ResNet50, ArcFace
0.1
MS1MV2, ResNet100, ArcFace
VGG2, ResNet50, ArcFace
from the model ensemble and context features from the off-
0
10-6 10 -5
10 -4
10 -3
Results on Trillion-Pairs. The Trillion-Pairs dataset [2] In this paper, we proposed an Additive Angular Margin
provides 1.58M images from Flickr as the gallery set and Loss function, which can effectively enhance the discrimi-
274K images from 5.7k LFW [13] identities as the probe native power of feature embeddings learned via DCNNs for
face recognition. In the most comprehensive experiments
2 https://github.com/deepinsight/insightface/tree/master/Evaluation/IJB reported in the literature we demonstrate that our method
Method MAP(%) 25
+ Ensemble 88.26
+ Context 88.65 (1st) 15
900
Can we apply ArcFace on large-scale identities? Yes,
ation strategy [44] to relieve this problem. We optimise Identity Number in the Training Data 106
our training code to easily and efficiently support million (b) Training Speed
level identities on a single machine by parallel accelera- Figure 10. Parallel acceleration on both feature x and centre W .
tion on both feature x (it known as the general data parallel Setting: ResNet 50, batch size 8*64, feature dimension 512, float
strategy) and centre W (we named it as the centre parallel point 32, GPU 8*P40 (24GB).
strategy). As shown in Figure 10, our parallel acceleration
on both feature x and centre W can significantly decrease
the GPU memory consumption and accelerate the training
speed. Even for one million identities trained on 8*1080ti score sub-matrix (batch size 512 × identity number 1M/8)
(11GB), our implementation (ResNet 50, batch size 8*64, on each GPU. The similarity score matrix goes forward to
feature dimension 512 and float point 32) can still run at calculate the ArcFace loss and the gradient. Here, we con-
800 samples per second. Compared to the approximate ac- duct a simple matrix partition on the centre matrix and the
celeration method proposed in [47], our implementation has similarity score matrix along the identity dimension, and
no performance drop. there is no communication cost on the centre and similarity
score matrix. Both the centre sub-matrix and the similarity
In Figure 11, we illustrate the main calculation steps of
score sub-matrix are only 256MB on each GPU.
the parallel acceleration by simple matrix partition, which
can be easily grasped and reproduced by beginners [3]. (3) Get gradient on centre (dW ). We transpose the fea-
(1) Get feature (x). Face embedding features are aggre- ture matrix on each GPU, and concurrently multiply the
gated into one feature matrix (batch size 8*64 × feature transposed feature matrix by the gradient sub-matrix of the
dimension 512) from 8 GPU cards. The size of the aggre- similarity score.
gated feature matrix is only 1MB, and the communication (4) Get gradient on feature (x). We concurrently multi-
cost is negligible when we transfer the feature matrix. ply the gradient sub-matrix of similarity score by the trans-
(2) Get similarity score matrix (score = xW ). We copy posed centre sub-matrix and sum up the outputs from 8
the feature matrix into each GPU, and concurrently multi- GPU cards to get the gradient on feature x.
ply the feature matrix by the centre sub-matrix (feature di-
mension 512 × identity number 1M/8) to get the similarity Considering the communication cost (MB level), our
implementation of ArcFace can be easily and efficiently
3 https://github.com/deepinsight/insightface/tree/master/recognition trained on millions of identities by clusters.
nearest neighbour separation[5] is
2 1 Γ( d2 ) 1
E[θ(Wj )] → n− d−1 Γ(1 + )( √ d−1
)− d−1 ,
d − 1 2 π(d − 1)Γ( 2 )
(7)
where d is the space dimension, n is the identity number,
and θ(Wj ) = min1≤i,j≤n,i6=j arccos(Wi , Wj )∀i, j. In Fig-
ure 12, we give E[θ(Wj )] in the 128-d, 256-d and 512-d
space with the class number ranging from 10K to 100M .
(a) x
The high-dimensional space is so large that E[θ(Wj )] de-
creases slowly when the class number increases exponen-
tially.
90
85
75
70
(b) score = xW
65
60
128-d
55 256-d
512-d
50
4 5 6 7 8
10 10 10 10 10
Random Individual Numbers
References
[1] http://data.mxnet.io/models/. 8
[2] http://trillionpairs.deepglint.com/overview. 2, 4, 5, 8
[3] Stanford cs class cs231n: Convolutional neural networks
for visual recognition. http://cs231n.github.io/
neural-networks-case-study/. 9
[4] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen,
C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, et al.
Tensorflow: Large-scale machine learning on heterogeneous
(d) dx = dscoreW T distributed systems. arXiv:1603.04467, 2016. 2
[5] J. S. Brauchart, A. B. Reznikov, E. B. Saff, I. H. Sloan,
Figure 11. Parallel calculation by simple matrix partition. Setting:
Y. G. Wang, and R. S. Womersley. Random point sets on
ResNet 50, batch size 8*64, feature dimension 512, float point
the spherehole radii, covering, and separation. Experimental
32, identity number 1 Million, GPU 8 * 1080ti (11GB). Com-
Mathematics, 2018. 10
munication cost: 1MB (feature x). Training speed: 800 sam-
[6] Q. Cao, L. Shen, W. Xie, O. M. Parkhi, and A. Zisserman.
ples/second.
Vggface2: A dataset for recognising faces across pose and
age. In FG, 2018. 1, 2, 4, 5, 7, 8
[7] B. Chen, W. Deng, and J. Du. Noisy softmax: improving
5.2. Feature Space Analysis the generalization ability of dcnn via postponing the early
softmax saturation. In CVPR, 2017. 2
[8] T. Chen, M. Li, Y. Li, M. Lin, N. Wang, M. Wang, T. Xiao,
Is the 512-d hypersphere space large enough to hold B. Xu, C. Zhang, and Z. Zhang. Mxnet: A flexible and effi-
large-scale identities? Theoretically, Yes. cient machine learning library for heterogeneous distributed
systems. arXiv:1512.01274, 2015. 2, 5
We assume that the identity centre Wj ’s follow a realis- [9] J. Deng, Y. Zhou, and S. Zafeiriou. Marginal loss for deep
tically spherical uniform distribution, the expectation of the face recognition. In CVPR Workshop, 2017. 2, 6
[10] Y. Guo, L. Zhang, Y. Hu, X. He, and J. Gao. Ms-celeb-1m: [30] S. Sengupta, J.-C. Chen, C. Castillo, V. M. Patel, R. Chel-
A dataset and benchmark for large-scale face recognition. In lappa, and D. W. Jacobs. Frontal to profile face verification
ECCV, 2016. 4 in the wild. In WACV, 2016. 2, 5
[11] D. Han, J. Kim, and J. Kim. Deep pyramidal residual net- [31] N. Srivastava, G. E. Hinton, A. Krizhevsky, I. Sutskever, and
works. arXiv:1610.02915, 2016. 5 R. Salakhutdinov. Dropout: a simple way to prevent neural
[12] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning networks from overfitting. JML, 2014. 5
for image recognition. In CVPR, 2016. 5 [32] Y. Sun, Y. Chen, X. Wang, and X. Tang. Deep learning face
[13] G. B. Huang, M. Ramesh, T. Berg, and E. Learned-Miller. representation by joint identification-verification. In NIPS,
Labeled faces in the wild: A database for studying face 2014. 1, 6, 7
recognition in unconstrained environments. Technical report, [33] Y. Taigman, M. Yang, M. Ranzato, and L. Wolf. Deepface:
2007. 5, 6, 8 Closing the gap to human-level performance in face verifica-
[14] S. Ioffe and C. Szegedy. Batch normalization: Accelerating tion. In CVPR, 2014. 1, 6
deep network training by reducing internal covariate shift. In [34] W. Wan, Y. Zhong, T. Li, and J. Chen. Rethinking fea-
ICML, 2015. 5 ture distribution for loss functions in image classification.
[15] I. Kemelmacher-Shlizerman, S. M. Seitz, D. Miller, and arXiv:1803.02988, 2018. 2
E. Brossard. The megaface benchmark: 1 million faces for [35] F. Wang, W. Liu, H. Liu, and J. Cheng. Additive margin
recognition at scale. In CVPR, 2016. 2, 5, 7 softmax for face verification. IEEE Signal Processing Let-
[16] J. Liu, Y. Deng, T. Bai, Z. Wei, and C. Huang. Targeting ters, 2018. 2, 3, 7
ultimate accuracy: Face recognition via deep embedding.
[36] F. Wang, X. Xiang, J. Cheng, and A. L. Yuille. Norm-
arXiv:1506.07310, 2015. 6
face: l 2 hypersphere embedding for face verification.
[17] W. Liu, R. Lin, Z. Liu, L. Liu, Z. Yu, B. Dai, and L. Song. arXiv:1704.06369, 2017. 2
Learning towards minimum hyperspherical energy. In NIPS,
[37] H. Wang, Y. Wang, Z. Zhou, X. Ji, Z. Li, D. Gong, J. Zhou,
2018. 4, 6, 7
and W. Liu. Cosface: Large margin cosine loss for deep face
[18] W. Liu, Y. Wen, Z. Yu, M. Li, B. Raj, and L. Song.
recognition. In CVPR, 2018. 1, 2, 3, 5, 6, 7
Sphereface: Deep hypersphere embedding for face recogni-
[38] Y. Wen, K. Zhang, Z. Li, and Y. Qiao. A discriminative fea-
tion. In CVPR, 2017. 1, 2, 3, 4, 5, 6, 7
ture learning approach for deep face recognition. In ECCV,
[19] W. Liu, Y. Wen, Z. Yu, and M. Yang. Large-margin softmax
2016. 2, 6, 7
loss for convolutional neural networks. In ICML, 2016. 2, 3
[20] Y. Liu, P. Shi, B. Peng, H. Yan, Y. Zhou, B. Han, Y. Zheng, [39] C. Whitelam, E. Taborsky, A. Blanton, B. Maze, J. C.
C. Lin, J. Jiang, and Y. Fan. iqiyi-vid: A large dataset for Adams, T. Miller, N. D. Kalka, A. K. Jain, J. A. Duncan, and
multi-modal person identification. arXiv:1811.07548, 2018. K. Allen. Iarpa janus benchmark-b face dataset. In CVPR
5, 8 Workshop, 2017. 2, 5, 7, 8
[21] B. Maze, J. Adams, J. A. Duncan, N. Kalka, T. Miller, [40] L. Wolf, T. Hassner, and I. Maoz. Face recognition in un-
C. Otto, A. K. Jain, W. T. Niggel, J. Anderson, and J. Ch- constrained videos with matched background similarity. In
eney. Iarpa janus benchmark–c: Face dataset and protocol. CVPR, 2011. 5, 6
In ICB, 2018. 2, 5 [41] W. Xie, S. Li, and A. Zisserman. Comparator networks. In
[22] S. Moschoglou, A. Papaioannou, C. Sagonas, J. Deng, I. Kot- ECCV, 2018. 8
sia, and S. Zafeiriou. Agedb: The first manually collected [42] W. Xie and A. Zisserman. Multicolumn networks for face
in-the-wild age database. In CVPR Workshop, 2017. 2, 5 recognition. In BMVC, 2018. 8
[23] H.-W. Ng and S. Winkler. A data-driven approach to clean- [43] D. Yi, Z. Lei, S. Liao, and S. Z. Li. Learning face represen-
ing large face datasets. In ICIP, 2014. 7 tation from scratch. arXiv:1411.7923, 2014. 4, 5
[24] O. M. Parkhi, A. Vedaldi, and A. Zisserman. Deep face [44] D. Zhang. A distributed training solution for face recogni-
recognition. In BMVC, 2015. 1, 2, 6 tion. DeepGlint, 2018. 9
[25] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. De- [45] K. Zhang, Z. Zhang, Z. Li, and Y. Qiao. Joint face detec-
Vito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer. Au- tion and alignment using multitask cascaded convolutional
tomatic differentiation in pytorch. In NIPS Workshop, 2017. networks. IEEE Signal Processing Letters, 2016. 1
2 [46] X. Zhang, Z. Fang, Y. Wen, Z. Li, and Y. Qiao. Range loss
[26] G. Pereyra, G. Tucker, J. Chorowski, Ł. Kaiser, and G. Hin- for deep face recognition with long-tail. In ICCV, 2017. 2, 6
ton. Regularizing neural networks by penalizing confident [47] X. Zhang, L. Yang, J. Yan, and D. Lin. Accelerated train-
output distributions. arXiv:1701.06548, 2017. 2, 3 ing for massive classification via dynamic class selection. In
[27] X. Qi and L. Zhang. Face recognition via centralized coor- AAAI, 2018. 9
dinate learning. arXiv:1801.05678, 2018. 2 [48] T. Zheng and W. Deng. Cross-pose lfw: A database for
[28] R. Ranjan, C. D. Castillo, and R. Chellappa. L2- studying cross-pose face recognition in unconstrained envi-
constrained softmax loss for discriminative face verification. ronments. Technical Report, 2018. 2, 5, 6
arXiv:1703.09507, 2017. 2 [49] T. Zheng, W. Deng, and J. Hu. Cross-age lfw: A database
[29] F. Schroff, D. Kalenichenko, and J. Philbin. Facenet: A uni- for studying cross-age face recognition in unconstrained en-
fied embedding for face recognition and clustering. In CVPR, vironments. arXiv:1708.08197, 2017. 2, 5, 6
2015. 1, 4, 6, 7