0% found this document useful (0 votes)
219 views11 pages

Arcface: Additive Angular Margin Loss For Deep Face Recognition

1) The document proposes a new loss function called ArcFace that aims to improve discriminative power for deep face recognition. 2) ArcFace adds an additive angular margin to the standard softmax loss. This directly optimizes the geometric distance margin on a hypersphere and enhances intra-class compactness and inter-class discrepancy. 3) Experiments on 10 face recognition benchmarks show that ArcFace outperforms previous state-of-the-art methods and is easy to implement with little computational overhead.

Uploaded by

Hoàng Phạm
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
219 views11 pages

Arcface: Additive Angular Margin Loss For Deep Face Recognition

1) The document proposes a new loss function called ArcFace that aims to improve discriminative power for deep face recognition. 2) ArcFace adds an additive angular margin to the standard softmax loss. This directly optimizes the geometric distance margin on a hypersphere and enhances intra-class compactness and inter-class discrepancy. 3) Experiments on 10 face recognition benchmarks show that ArcFace outperforms previous state-of-the-art methods and is easy to implement with little computational overhead.

Uploaded by

Hoàng Phạm
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

ArcFace: Additive Angular Margin Loss for Deep Face Recognition

Jiankang Deng * Jia Guo ∗ Niannan Xue


Imperial College London InsightFace Imperial College London
j.deng16@imperial.ac.uk guojia@gmail.com n.xue15@imperial.ac.uk

Stefanos Zafeiriou
Imperial College London
arXiv:1801.07698v3 [cs.CV] 9 Feb 2019

s.zafeiriou@imperial.ac.uk

Abstract
One of the main challenges in feature learning using
Deep Convolutional Neural Networks (DCNNs) for large-
scale face recognition is the design of appropriate loss func-
tions that enhance discriminative power. Centre loss pe-
nalises the distance between the deep features and their cor-
responding class centres in the Euclidean space to achieve
intra-class compactness. SphereFace assumes that the lin-
ear transformation matrix in the last fully connected layer
can be used as a representation of the class centres in an
angular space and penalises the angles between the deep Figure 1. Based on the centre [18] and feature [37] normalisation,
features and their corresponding weights in a multiplicative all identities are distributed on a hypersphere. To enhance intra-
way. Recently, a popular line of research is to incorporate class compactness and inter-class discrepancy, we consider four
margins in well-established loss functions in order to max- kinds of Geodesic Distance (GDis) constraint. (A) Margin-Loss:
imise face class separability. In this paper, we propose an insert a geodesic distance margin between the sample and cen-
Additive Angular Margin Loss (ArcFace) to obtain highly tres. (B) Intra-Loss: decrease the geodesic distance between the
discriminative features for face recognition. The proposed sample and the corresponding centre. (C) Inter-Loss: increase the
ArcFace has a clear geometric interpretation due to the ex- geodesic distance between different centres. (D) Triplet-Loss: in-
sert a geodesic distance margin between triplet samples. In this
act correspondence to the geodesic distance on the hyper-
paper, we propose an Additive Angular Margin Loss (ArcFace),
sphere. We present arguably the most extensive experimen-
which is exactly corresponded to the geodesic distance (Arc) mar-
tal evaluation of all the recent state-of-the-art face recog- gin penalty in (A), to enhance the discriminative power of face
nition methods on over 10 face recognition benchmarks in- recognition model. Extensive experimental results show that the
cluding a new large-scale image database with trillion level strategy of (A) is most effective.
of pairs and a large-scale video dataset. We show that Ar-
cFace consistently outperforms the state-of-the-art and can face recognition [32, 33, 29, 24]. DCNNs map the face im-
be easily implemented with negligible computational over- age, typically after a pose normalisation step [45], into a
head. We release all refined training data, training codes, feature that has small intra-class and large inter-class dis-
pre-trained models and training logs1 , which will help re- tance.
produce the results in this paper. There are two main lines of research to train DCNNs
for face recognition. Those that train a multi-class clas-
sifier which can separate different identities in the train-
1. Introduction ing set, such by using a softmax classifier [33, 24, 6], and
those that learn directly an embedding, such as the triplet
Face representation using Deep Convolutional Neural loss [29]. Based on the large-scale training data and the
Network (DCNN) embedding is the method of choice for elaborate DCNN architectures, both the softmax-loss-based
∗ denotes equal contribution to this work. methods [6] and the triplet-loss-based methods [29] can ob-
1 https://github.com/deepinsight/insightface tain excellent performance on face recognition. However,

1
both the softmax loss and the triplet loss have some draw- the softmax loss. The advantages of the proposed ArcFace
backs. For the softmax loss: (1) the size of the linear trans- can be summarised as follows:
formation matrix W ∈ Rd×n increases linearly with the Engaging. ArcFace directly optimises the geodesic dis-
identities number n; (2) the learned features are separable tance margin by virtue of the exact correspondence between
for the closed-set classification problem but not discrimina- the angle and arc in the normalised hypersphere. We in-
tive enough for the open-set face recognition problem. For tuitively illustrate what happens in the 512-D space via
the triplet loss: (1) there is a combinatorial explosion in the analysing the angle statistics between features and weights.
number of face triplets especially for large-scale datasets, Effective. ArcFace achieves state-of-the-art performance
leading to a significant increase in the number of iteration on ten face recognition benchmarks including large-scale
steps; (2) semi-hard sample mining is a quite difficult prob- image and video datasets.
lem for effective model training. Easy. ArcFace only needs several lines of code as given
Several variants [38, 9, 46, 18, 37, 35, 7, 34, 27] have in Algorithm 1 and is extremely easy to implement in the
been proposed to enhance the discriminative power of the computational-graph-based deep learning frameworks, e.g.
softmax loss. Wen et al. [38] pioneered the centre loss, the MxNet [8], Pytorch [25] and Tensorflow [4]. Furthermore,
Euclidean distance between each feature vector and its class contrary to the works in [18, 19], ArcFace does not need
centre, to obtain intra-class compactness while the inter- to be combined with other loss functions in order to have
class dispersion is guaranteed by the joint penalisation of stable performance, and can easily converge on any training
the softmax loss. Nevertheless, updating the actual centres datasets.
during training is extremely difficult as the number of face Efficient. ArcFace only adds negligible computational
classes available for training has recently dramatically in- complexity during training. Current GPUs can easily sup-
creased. port millions of identities for training and the model parallel
By observing that the weights from the last fully con- strategy can easily support many more identities.
nected layer of a classification DCNN trained on the soft-
max loss bear conceptual similarities with the centres of 2. Proposed Approach
each face class, the works in [18, 19] proposed a multiplica- 2.1. ArcFace
tive angular margin penalty to enforce extra intra-class com-
pactness and inter-class discrepancy simultaneously, lead- The most widely used classification loss function, soft-
ing to a better discriminative power of the trained model. max loss, is presented as follows:
Even though Sphereface [18] introduced the important idea T
N
of angular margin, their loss function required a series of ap- 1 X eWyi xi +byi
proximations in order to be computed, which resulted in an L1 = − log Pn T , (1)
N i=1 eWj xi +bj
j=1
unstable training of the network. In order to stabilise train-
ing, they proposed a hybrid loss function which includes the where xi ∈ Rd denotes the deep feature of the i-th sample,
standard softmax loss. Empirically, the softmax loss dom- belonging to the yi -th class. The embedding feature dimen-
inates the training process, because the integer-based mul- sion d is set to 512 in this paper following [38, 46, 18, 37].
tiplicative angular margin makes the target logit curve very Wj ∈ Rd denotes the j-th column of the weight W ∈ Rd×n
precipitous and thus hinders convergence. CosFace [37, 35] and bj ∈ Rn is the bias term. The batch size and the class
directly adds cosine margin penalty to the target logit, which number are N and n, respectively. Traditional softmax loss
obtains better performance compared to SphereFace but ad- is widely used in deep face recognition [24, 6]. However,
mits much easier implementation and relieves the need for the softmax loss function does not explicitly optimise the
joint supervision from the softmax loss. feature embedding to enforce higher similarity for intra-
In this paper, we propose an Additive Angular Margin class samples and diversity for inter-class samples, which
Loss (ArcFace) to further improve the discriminative power results in a performance gap for deep face recognition under
of the face recognition model and to stabilise the training large intra-class appearance variations (e.g. pose variations
process. As illustrated in Figure 2, the dot product be- [30, 48] and age gaps [22, 49]) and large-scale test scenarios
tween the DCNN feature and the last fully connected layer (e.g. million [15, 39, 21] or trillion pairs [2]).
is equal to the cosine distance after feature and weight nor- For simplicity, we fix the bias bj = 0 as in [18]. Then,
malisation. We utilise the arc-cosine function to calculate we transform the logit [26] as WjT xi = kWj k kxi k cos θj ,
the angle between the current feature and the target weight. where θj is the angle between the weight Wj and the fea-
Afterwards, we add an additive angular margin to the tar- ture xi . Following [18, 37, 36], we fix the individual weight
get angle, and we get the target logit back again by the co- kWj k = 1 by l2 normalisation. Following [28, 37, 36, 35],
sine function. Then, we re-scale all logits by a fixed feature we also fix the embedding feature kxi k by l2 normalisation
norm, and the subsequent steps are exactly the same as in and re-scale it to s. The normalisation step on features and
Figure 2. Training a DCNN for face recognition supervised by the ArcFace loss. Based on the feature xi and weight W normalisation, we
get the cos θj (logit) for each class as WjT xi . We calculate the arccosθyi and get the angle between the feature xi and the ground truth
weight Wyi . In fact, Wj provides a kind of centre for each class. Then, we add an angular margin penalty m on the target (ground truth)
angle θyi . After that, we calculate cos(θyi + m) and multiply all logits by the feature scale s. The logits then go through the softmax
function and contribute to the cross entropy loss.
Algorithm 1 The Pseudo-code of ArcFace on MxNet
Input: Feature Scale s, Margin Parameter m in Eq. 3, Class Number n, Ground-Truth ID gt.
1. x = mx.symbol.L2Normalization (x, mode = ’instance’)
2. W = mx.symbol.L2Normalization (W, mode = ’instance’)
3. fc7 = mx.sym.FullyConnected (data = x, weight = W, no bias = True, num hidden = n)
4. original target logit = mx.sym.pick (fc7, gt, axis = 1)
5. theta = mx.sym.arccos (original target logit)
6. marginal target logit = mx.sym.cos (theta + m)
7. one hot = mx.sym.one hot (gt, depth = n, on value = 1.0, off value = 0.0)
8. fc7 = fc7 + mx.sym.broadcast mul (one hot, mx.sym.expand dims (marginal target logit - original target logit, 1))
9. fc7 = fc7 * s
Output: Class-wise affinity score f c7.

weights makes the predictions only depend on the angle be- while the proposed ArcFace loss can obviously enforce a
tween the feature and the weight. The learned embedding more evident gap between the nearest classes.
features are thus distributed on a hypersphere with a radius
of s.
N
1 X es cos θyi
L2 = − log s cos θy Pn s cos θj
. (2)
N i=1 e i +
j=1,j6=yi e

As the embedding features are distributed around each


feature centre on the hypersphere, we add an additive angu- (a) Softmax (b) ArcFace
lar margin penalty m between xi and Wyi to simultaneously Figure 3. Toy examples under the softmax and ArcFace loss on
enhance the intra-class compactness and inter-class discrep- 8 identities with 2D features. Dots indicate samples and lines re-
ancy. Since the proposed additive angular margin penalty is fer to the centre direction of each identity. Based on the feature
equal to the geodesic distance margin penalty in the nor- normalisation, all face features are pushed to the arc space with
malised hypersphere, we name our method as ArcFace. a fixed radius. The geodesic distance gap between closest classes
becomes evident as the additive angular margin penalty is incor-
N porated.
1 X es(cos(θyi +m))
L3 = − log s(cos(θ +m)) Pn . 2.2. Comparison with SphereFace and CosFace
N i=1 e yi
+ j=1,j6=yi es cos θj
(3) Numerical Similarity. In SphereFace [18, 19], ArcFace,
We select face images from 8 different identities contain- and CosFace [37, 35], three different kinds of margin
ing enough samples (around 1,500 images/class) to train 2- penalty are proposed, e.g. multiplicative angular margin
D feature embedding networks with the softmax and Ar- m1 , additive angular margin m2 , and additive cosine mar-
cFace loss, respectively. As illustrated in Figure 3, the gin m3 , respectively. From the view of numerical analysis,
softmax loss provides roughly separable feature embedding different margin penalties, no matter add on the angle [18]
but produces noticeable ambiguity in decision boundaries, or cosine space [37], all enforce the intra-class compactness
and inter-class diversity by penalising the target logit [26]. SphereFace [18] employs an annealing optimisation strat-
In Figure 4(b), we plot the target logit curves of SphereFace, egy. To avoid divergence at the beginning of training, joint
ArcFace and CosFace under their best margin settings. We supervision from softmax is used in SphereFace to weaken
only show these target logit curves within [20◦ , 100◦ ] be- the multiplicative margin penalty. We implement a new ver-
cause the angles between Wyi and xi start from around 90◦ sion of SphereFace without the integer requirement on the
(random initialisation) and end at around 30◦ during Arc- margin by employing the arc-cosine function instead of us-
Face training as shown in Figure 4(a). Intuitively, there are ing the complex double angle formula. In our implementa-
three factors in the target logit curves that affect the perfor- tion, we find that m = 1.35 can obtain similar performance
mance, i.e. the starting point, the end point and the slope. compared to the original SphereFace without any conver-
gence difficulty.
10000 1
start
middle
end
0.8 2.3. Comparison with Other Losses
8000 0.6

0.4

0.2 Other loss functions can be designed based on the angu-


Target Logit

6000
Numbers

4000
-0.2 lar representation of features and weight-vectors. For exam-
Softmax (1.00, 0.00, 0.00)
-0.4

-0.6
SphereFace(m=4, =5)
SphereFace(1.35, 0.00, 0.00)
ArcFace (1.00, 0.50, 0.00)
ples, we can design a loss to enforce intra-class compact-
2000 -0.8

-1
CosFace (1.00, 0.00, 0.35)
CM1
CM2
(1.00, 0.30, 0.20)
(0.90, 0.40, 0.15)
ness and inter-class discrepancy on the hypersphere. As
0
20 30 40 50 60 70 80 90
Angle between the Feature and Target Center
100
-1.2
20 30 40 50 60 70 80
Angle between the Feature and Target Center
90 100 shown in Figure 1, we compare with three other losses in
this paper.
(a) θj Distributions (b) Target Logits Curves
Intra-Loss is designed to improve the intra-class compact-
Figure 4. Target logit analysis. (a) θj distributions from start to ness by decreasing the angle/arc between the sample and
end during ArcFace training. (2) Target logit curves for softmax, the ground truth centre.
SphereFace, ArcFace, CosFace and combined margin penalty
(cos(m1 θ + m2 ) − m3 ). 1 X
N

By combining all of the margin penalties, we implement L5 = L2 + θy . (5)


πN i=1 i
SphereFace, ArcFace and CosFace in an united framework
with m1 , m2 and m3 as the hyper-parameters. Inter-Loss targets at enhancing inter-class discrepancy by
N
increasing the angle/arc between different centres.
1 X es(cos(m1 θyi +m2 )−m3 )
L4 = − log s(cos(m θ +m )−m ) Pn s cos θj
. N n
N i=1 e 1 yi 2 3 +
j=1,j6=yi e 1 X X
(4) L6 = L2 − arccos(WyTi Wj ). (6)
πN (n − 1) i=1
As shown in Figure 4(b), by combining all of the above- j=1,j6=yi
motioned margins (cos(m1 θ + m2 ) − m3 ), we can easily The Inter-Loss here is a special case of the Minimum
get some other target logit curves which also have high per- Hyper-spherical Energy (MHE) method [17]. In [17], both
formance. hidden layers and output layers are regularised by MHE. In
Geometric Difference. Despite the numerical similarity the MHE paper, a special case of loss function was also pro-
between ArcFace and previous works, the proposed ad- posed by combining the SphereFace loss with MHE loss on
ditive angular margin has a better geometric attribute as the last layer of the network.
the angular margin has the exact correspondence to the Triplet-loss aims at enlarging the angle/arc margin between
geodesic distance. As illustrated in Figure 5, we compare triplet samples. In FaceNet [29], Euclidean margin is ap-
the decision boundaries under the binary classification case. plied on the normalised features. Here, we employ the
The proposed ArcFace has a constant linear angular margin triplet-loss by the angular representation of our features as
throughout the whole interval. By contrast, SphereFace and arccos(xposi xi ) + m ≤ arccos(xi
neg
xi ).
CosFace only have a nonlinear angular margin.
3. Experiments
3.1. Implementation Details
Datasets. As given in Table 1, we separately employ CA-
SIA [43], VGGFace2 [6], MS1MV2 and DeepGlint-Face
(including MS1M-DeepGlint and Asian-DeepGlint) [2] as
Figure 5. Decision margins of different loss functions under bi- our training data in order to conduct fair comparison with
nary classification case. The dashed line represents the decision other methods. Please note that the proposed MS1MV2 is a
boundary, and the grey areas are the decision margins. semi-automatic refined version of the MS-Celeb-1M dataset
The minor difference in margin designs can have “butter- [10]. To best of our knowledge, we are the first to em-
fly effect” on the model training. For example, the original ploy ethnicity-specific annotators for large-scale face image
Datasets #Identity #Image/Video ResNet50 and 250MB for ResNet100) and extract the 512-
CASIA [43] 10K 0.5M D features (8.9 ms/face for ResNet50 and 15.4 ms/face for
VGGFace2 [6] 9.1K 3.3M ResNet100) for each normalised face. To get the embed-
MS1MV2 85K 5.8M ding features for templates (e.g. IJB-B and IJB-C) or videos
MS1M-DeepGlint [2] 87K 3.9M (e.g. YTF and iQIYI-VID), we simply calculate the feature
Asian-DeepGlint [2] 94 K 2.83M centre of all images from the template or all frames from
LFW [13] 5,749 13,233 the video. Note that, overlap identities between the training
CFP-FP [30] 500 7,000 set and the test set are removed for strict evaluations, and
AgeDB-30 [22] 568 16,488 we only use a single crop for all testing.
CPLFW [48] 5,749 11,652
CALFW [49] 5,749 12,174 3.2. Ablation Study on Losses
YTF [40] 1,595 3,425
In Table 2, we first explore the angular margin setting
MegaFace [15] 530 (P) 1M (G)
for ArcFace on the CASIA dataset with ResNet50. The best
IJB-B [39] 1,845 76.8K
margin observed in our experiments was 0.5. Using the pro-
IJB-C [21] 3,531 148.8K
posed combined margin framework in Eq. 4, it is easier to
Trillion-Pairs [2] 5,749 (P) 1.58M (G)
set the margin of SphereFace and CosFace which we found
iQIYI-VID [20] 4,934 172,835
to have optimal performance when setting at 1.35 and 0.35,
Table 1. Face datasets for training and testing. “(P)” and “(G)” respectively. Our implementations for both SphereFace and
refer to the probe and gallery set, respectively. CosFace can lead to excellent performance without observ-
ing any difficulty in convergence. The proposed ArcFace
annotations, as the boundary cases (e.g. hard samples and achieves the highest verification accuracy on all three test
noisy samples) are very hard to distinguish if the annotator sets. In addition, we performed extensive experiments with
is not familiar with the identity. During training, we explore the combined margin framework (some of the best perfor-
efficient face verification datasets (e.g. LFW [13], CFP-FP mance was observed for CM1 (1, 0.3, 0.2) and CM2 (0.9,
[30], AgeDB-30 [22]) to check the improvement from dif- 0.4, 0.15)) guided by the target logit curves in Figure 4(b).
ferent settings. Besides the most widely used LFW [13] and The combined margin framework led to better performance
YTF [40] datasets, we also report the performance of Ar- than individual SphereFace and CosFace but upper-bounded
cFace on the recent large-pose and large-age datasets(e.g. by the performance of ArcFace.
CPLFW [48] and CALFW [49]). We also extensively test Besides the comparison with margin-based methods, we
the proposed ArcFace on large-scale image datasets (e.g. conduct a further comparison between ArcFace and other
MegaFace [15], IJB-B [39], IJB-C [21] and Trillion-Pairs losses which aim at enforcing intra-class compactness (Eq.
[2]) and video datasets (iQIYI-VID [20]). 5) and inter-class discrepancy (Eq. 6). As the baseline
Experimental Settings. For data prepossessing, we follow we have chosen the softmax loss and we have observed
the recent papers [18, 37] to generate the normalised face performance drop on CFP-FP and AgeDB-30 after weight
crops (112 × 112) by utilising five facial points. For the and feature normalisation. By combining the softmax with
embedding network, we employ the widely used CNN ar- the intra-class loss, the performance improves on CFP-FP
chitectures, ResNet50 and ResNet100 [12, 11]. After the and AgeDB-30. However, combining the softmax with the
last convolutional layer, we explore the BN [14]-Dropout inter-class loss only slightly improves the accuracy. The
[31]-FC-BN structure to get the final 512-D embedding fea- fact that Triplet-loss outperforms Norm-Softmax loss in-
ture. In this paper, we use ([training dataset, network struc- dicates the importance of margin in improving the perfor-
ture, loss]) to facilitate understanding of the experimental mance. However, employing margin penalty within triplet
settings. samples is less effective than inserting margin between sam-
We follow [37] to set the feature scale s to 64 and choose ples and centres as in ArcFace. Finally, we incorporate the
the angular margin m of ArcFace at 0.5. All experiments in Intra-loss, Inter-loss and Triplet-loss into ArcFace, but no
this paper are implemented by MXNet [8]. We set the batch improvement is observed, which leads us to believe that Ar-
size to 512 and train models on four NVIDIA Tesla P40 cFace is already enforcing intra-class compactness, inter-
(24GB) GPUs. On CASIA, the learning rate starts from 0.1 class discrepancy and classification margin.
and is divided by 10 at 20K, 28K iterations. The training To get a better understanding of ArcFace’s superiority,
process is finished at 32K iterations. On MS1MV2, we di- we give the detailed angle statistics on training data (CA-
vide the learning rate at 100K,160K iterations and finish at SIA) and test data (LFW) under different losses in Table
180K iterations. We set momentum to 0.9 and weight decay 3. We find that (1) Wj is nearly synchronised with em-
to 5e − 4. During testing, we only keep the feature embed- bedding feature centre for ArcFace (14.29◦ ), but there is
ding network without the fully connected layer (160MB for an obvious deviation (44.26◦ ) between Wj and the em-
Loss Functions LFW CFP-FP AgeDB-30 NS ArcFace IntraL InterL TripletL
ArcFace (0.4) 99.53 95.41 94.98 W-EC 44.26 14.29 8.83 46.85 -
ArcFace (0.45) 99.46 95.47 94.93 W-Inter 69.66 71.61 31.34 75.66 -
ArcFace (0.5) 99.53 95.56 95.15 Intra1 50.50 38.45 17.50 52.74 41.19
ArcFace (0.55) 99.41 95.32 95.05 Inter1 59.23 65.83 24.07 62.40 50.23
SphereFace [18] 99.42 - - Intra2 33.97 28.05 12.94 35.38 27.42
SphereFace (1.35) 99.11 94.38 91.70 Inter2 65.60 66.55 26.28 67.90 55.94
CosFace [37] 99.33 - -
CosFace (0.35) 99.51 95.44 94.56 Table 3. The angle statistics under different losses ([CASIA,
CM1 (1, 0.3, 0.2) 99.48 95.12 94.38 ResNet50, loss*]). Each column denotes one particular loss. “W-
CM2 (0.9, 0.4, 0.15) 99.50 95.24 94.86 EC” refers to the mean of angles between Wj and the correspond-
ing embedding feature centre. “W-Inter” refers to the mean of
Softmax 99.08 94.39 92.33
minimum angles between Wj ’s. “Intra1” and “Intra2” refer to the
Norm-Softmax (NS) 98.56 89.79 88.72 mean of angles between xi and the embedding feature centre on
NS+Intra 98.75 93.81 90.92 CASIA and LFW, respectively. “Inter1” and “Inter2” refer to the
NS+Inter 98.68 90.67 89.50 mean of minimum angles between embedding feature centres on
NS+Intra+Inter 98.73 94.00 91.41 CASIA and LFW, respectively.
Triplet (0.35) 98.98 91.90 89.98 Method #Image LFW YTF
ArcFace+Intra 99.45 95.37 94.73 DeepID [32] 0.2M 99.47 93.20
ArcFace+Inter 99.43 95.25 94.55 Deep Face [33] 4.4M 97.35 91.4
ArcFace+Intra+Inter 99.43 95.42 95.10 VGG Face [24] 2.6M 98.95 97.30
ArcFace+Triplet 99.50 95.51 94.40 FaceNet [29] 200M 99.63 95.10
Table 2. Verification results (%) of different loss functions ([CA- Baidu [16] 1.3M 99.13 -
SIA, ResNet50, loss*]). Center Loss [38] 0.7M 99.28 94.9
Range Loss [46] 5M 99.52 93.70
Marginal Loss [9] 3.8M 99.48 95.98
bedding feature centre for Norm-Softmax. Therefore, the
SphereFace [18] 0.5M 99.42 95.0
angles between Wj cannot absolutely represent the inter-
SphereFace+ [17] 0.5M 99.47 -
class discrepancy on training data. Alternatively, the em-
CosFace [37] 5M 99.73 97.6
bedding feature centres calculated by the trained network
MS1MV2, R100, ArcFace 5.8M 99.83 98.02
are more representative. (2) Intra-Loss can effectively com-
press intra-class variations but also brings in smaller inter- Table 4. Verification performance (%) of different methods on
class angles. (3) Inter-Loss can slightly increase inter-class LFW and YTF.
discrepancy on both W (directly) and the embedding net- 3.3. Evaluation Results
work (indirectly), but also raises intra-class angles. (4) Ar-
cFace already has very good intra-class compactness and Results on LFW, YTF, CALFW and CPLFW. LFW [13]
inter-class discrepancy. (5) Triplet-Loss has similar intra- and YTF [40] datasets are the most widely used benchmark
class compactness but inferior inter-class discrepancy com- for unconstrained face verification on images and videos. In
pared to ArcFace. In addition, ArcFace has a more distinct this paper, we follow the unrestricted with labelled outside
margin than Triplet-Loss on the test set as illustrated in Fig- data protocol to report the performance. As reported in Ta-
ure 6. ble 4, ArcFace trained on MS1MV2 with ResNet100 beats
the baselines (e.g. SphereFace [18] and CosFace [37]) by
a significant margin on both LFW and YTF, which shows
104 104
3 2.5

2.5
Negative
Positive
Negative
Positive
that the additive angular margin penalty can notably en-
2

2
hance the discriminative power of deeply learned features,
Pair Numbers

Pair Numbers

1.5

1.5 demonstrating the effectiveness of ArcFace.


1
1
Besides on LFW and YTF datasets, we also report the
0.5
0.5
performance of ArcFace on the recently introduced datasets
0 0
0 10 20 30 40 50 60 70 80 90 100 110 120
Angles Between Positive and Negative Pairs
0 10 20 30 40 50 60 70 80 90 100 110 120
Angles Between Positive and Negative Pairs (e.g. CPLFW [48] and CALFW [49]) which show higher
(a) ArcFace (b) Triplet-Loss pose and age variations with same identities from LFW.
Figure 6. Angle distributions of all positive pairs and random neg- Among all of the open-sourced face recognition models, the
ative pairs (∼ 0.5M) from LFW. Red area indicates positive pairs ArcFace model is evaluated as the top-ranked face recog-
while blue indicates negative pairs. All angles are represented in nition model as shown in Table 5, outperforming coun-
degree. ([CASIA, ResNet50, loss*]). terparts by an obvious margin. In Figure 7, we illustrate
the angle distributions (predicted by ArcFace model trained
Method LFW CALFW CPLFW Methods Id (%) Ver (%)
HUMAN-Individual 97.27 82.32 81.21 Softmax [18] 54.85 65.92
HUMAN-Fusion 99.85 86.50 85.24 Contrastive Loss[18, 32] 65.21 78.86
Center Loss [38] 98.75 85.48 77.48 Triplet [18, 29] 64.79 78.32
SphereFace [18] 99.27 90.30 81.40 Center Loss[38] 65.49 80.14
VGGFace2 [6] 99.43 90.57 84.00 SphereFace [18] 72.729 85.561
MS1MV2, R100, ArcFace 99.82 95.45 92.08 CosFace [37] 77.11 89.88
AM-Softmax [35] 72.47 84.44
Table 5. Verification performance (%) of open-sourced face recog-
nition models on LFW, CALFW and CPLFW. SphereFace+ [17] 73.03 -
CASIA, R50, ArcFace 77.50 92.34
on MS1MV2 with ResNet100) of both positive and nega- CASIA, R50, ArcFace, R 91.75 93.69
tive pairs on LFW, CFP-FP, AgeDB-30, YTF, CPLFW and FaceNet [29] 70.49 86.47
CALFW. We can clearly find that the intra-variance due CosFace [37] 82.72 96.65
to pose and age gaps significantly increases the angles be- MS1MV2, R100, ArcFace 81.03 96.98
tween positive pairs thus making the best threshold for face MS1MV2, R100, CosFace 80.56 96.56
verification increasing and generating more confusion re- MS1MV2, R100, ArcFace, R 98.35 98.48
gions on the histogram. MS1MV2, R100, CosFace, R 97.91 97.91
250 180 200
Table 6. Face identification and verification evaluation of different
Negative Negative Negative

200
Positive 160

140
Positive

150
Positive
methods on MegaFace Challenge1 using FaceScrub as the probe
120

set. “Id” refers to the rank-1 face identification accuracy with 1M


Pair Numbers

Pair Numbers

Pair Numbers

150
100
100

distractors, and “Ver” refers to the face verification TAR at 10−6


80
100
60
50
40
50

0
20

0 0
FAR. “R” refers to data refinement on both probe set and 1M dis-
tractors. ArcFace obtains state-of-the-art performance under both
0 10 20 30 40 50 60 70 80 90 100 110 120 0 10 20 30 40 50 60 70 80 90 100 110 120 0 10 20 30 40 50 60 70 80 90 100 110 120
Angles Between Positive and Negative Pairs Angles Between Positive and Negative Pairs Angles Between Positive and Negative Pairs

(a) LFW (99.83%) (b) CFP-FP (98.37%) (c) AgeDB (98.15%) small and large protocols.
350

300
Negative
Positive
250
Negative
Positive
180

160
Negative
Positive
affects the performance. Therefore, we manually refined
200
140
250
120 the whole MegaFace dataset and report the correct perfor-
Pair Numbers

Pair Numbers

Pair Numbers

150
200 100

150

100
100
80

60
mance of ArcFace on MegaFace. On the refined MegaFace,
50
50
40

20 ArcFace still clearly outperforms CosFace and achieves the


0 0 0
0 10 20 30 40 50 60 70 80 90 100 110 120
Angles Between Positive and Negative Pairs
0 10 20 30 40 50 60 70 80 90 100 110 120
Angles Between Positive and Negative Pairs
0 10 20 30 40 50 60 70 80 90 100 110 120
Angles Between Positive and Negative Pairs
best performance on both verification and identification.
(d) YTF (98.02%) (e) CPLFW (92.08%) (f) CALFW (95.45%) Under large protocol, ArcFace surpasses FaceNet [29]
Figure 7. Angle distributions of both positive and negative pairs on by a clear margin and obtains comparable results on iden-
LFW, CFP-FP, AgeDB-30, YTF, CPLFW and CALFW. Red area tification and better results on verification compared to
indicates positive pairs while blue indicates negative pairs. All an- CosFace [37]. Since CosFace employs a private training
gles are represented in degree. ([MS1MV2, ResNet100, ArcFace]) data, we retrain CosFace on our MS1MV2 dataset with
ResNet100. Under fair comparison, ArcFace shows supe-
Results on MegaFace. The MegaFace dataset [15] includes riority over CosFace and forms an upper envelope of Cos-
1M images of 690K different individuals as the gallery set Face under both identification and verification scenarios as
and 100K photos of 530 unique individuals from FaceScrub shown in Figure 8.
[23] as the probe set. On MegaFace, there are two testing
scenarios (identification and verification) under two proto- 100 100

cols (large or small training set). The training set is defined 95


99

98

as large if it contains more than 0.5M images. For the fair


True Positive Rate (%)
Identification Rate (%)

97
90
96
comparison, we train ArcFace on CAISA and MS1MV2 85 95

under the small protocol and large protocol, respectively. 80 CASIA, ResNet50, ArcFace, Original
CASIA, ResNet50, ArcFace, Refine
94

93
CASIA, ResNet50, ArcFace, Original
CASIA, ResNet50, ArcFace, Refine
MS1MV2, ResNet100, ArcFace, Original MS1MV2, ResNet100, ArcFace, Original
In Table 6, ArcFace trained on CASIA achieves the best 75 MS1MV2, ResNet100, ArcFace, Refine
MS1MV2, ResNet100, CosFace, Original
92

91
MS1MV2, ResNet100, ArcFace, Refine
MS1MV2, ResNet100, CosFace, Original
MS1MV2, ResNet100, CosFace, Refine MS1MV2, ResNet100, CosFace, Refine

single-model identification and verification performance, 70


100 101 102 103 104 105 106
90
10-6 10-5 10-4 10-3 10-2 10-1 100

not only surpassing the strong baselines (e.g. SphereFace Rank False Positive Rate

[18] and CosFace [37]) but also outperforming other pub- (a) CMC (b) ROC
lished methods [38, 17]. Figure 8. CMC and ROC curves of different models on MegaFace.
As we observed an obvious performance gap between Results are evaluated on both original and refined MegaFace
identification and verification, we performed a thorough dataset.
manual check in the whole MegaFace dataset and found Results on IJB-B and IJB-C. The IJB-B dataset [39]
many face images with wrong labels, which significantly contains 1, 845 subjects with 21.8K still images and 55K
Method IJB-B IJB-C Method Id (@FPR=1e-3) Ver(@FPR=1e-9)
ResNet50 [6] 0.784 0.825 CASIA 26.643 21.452
SENet50 [6] 0.800 0.840 MS1MV2 80.968 78.600
ResNet50+SENet50 [6] 0.800 0.841 DeepGlint-Face 80.331 78.586
MN-v [42] 0.818 0.852 MS1MV2+Asian 84.840 (1st) 80.540
MN-vc [42] 0.831 0.862 CIGIT IRSEC 84.234 (2nd) 81.558 (1st)
ResNet50+DCN(Kpts) [41] 0.850 0.867
Table 8. Identification and verification results (%) on the Trillion-
ResNet50+DCN(Divs) [41] 0.841 0.880 Pairs dataset. ([Dataset*, ResNet100, ArcFace])
SENet50+DCN(Kpts) [41] 0.846 0.874
SENet50+DCN(Divs) [41] 0.849 0.885
VGG2, R50, ArcFace 0.898 0.921 set. Every pair between gallery and probe set is used
MS1MV2, R100, ArcFace 0.942 0.956 for evaluation (0.4 trillion pairs in total). In Table 8,
we compare the performance of ArcFace trained on dif-
Table 7. 1:1 verification TAR (@FAR=1e-4) on the IJB-B and IJB- ferent datasets. The proposed MS1MV2 dataset obvi-
C dataset.
ously boosts the performance compared to CASIA and even
frames from 7, 011 videos. In total, there are 12, 115 slightly outperforms the DeepGlint-Face dataset, which has
templates with 10, 270 genuine matches and 8M impos- a double identity number. When combining all identities
tor matches. The IJB-C dataset [39] is a further extension from MS1MV2 and Asian celebrities from DeepGlint, Arc-
of IJB-B, having 3, 531 subjects with 31.3K still images Face achieves the best identification performance 84.840%
and 117.5K frames from 11, 779 videos. In total, there (@FPR=1e-3) and comparable verification performance
are 23, 124 templates with 19, 557 genuine matches and compared to the most recent submission (CIGIT IRSEC)
15, 639K impostor matches. from the lead-board.
On the IJB-B and IJB-C datasets, we employ the VGG2 Results on iQIYI-VID. The iQIYI-VID challenge [20]
dataset as the training data and the ResNet50 as the embed- contains 565,372 video clips (training set 219,677, valida-
ding network to train ArcFace for the fair comparison with tion set 172,860, and test set 172,835) of 4934 identities
the most recent methods [6, 42, 41]. In Table 7, we compare from iQIYI variety shows, films and television dramas. The
the TAR (@FAR=1e-4) of ArcFace with the previous state- length of each video ranges from 1 to 30 seconds. This
of-the-art models [6, 42, 41]. ArcFace can obviously boost dataset supplies multi-modal cues, including face, cloth,
the performance on both IJB-B and IJB-C (about 3 ∼ 5%, voice, gait and subtitles, for character identification. The
which is a significant reduction in the error). Drawing sup- iQIYI-VID dataset employs MAP@100 as the evaluation
port from more training data (MS1MV2) and deeper neu- indicator. MAP (Mean Average Precision) refers to the
ral network (ResNet100), ArcFace can further improve the overall average accuracy rate, which is the mean of the av-
TAR (@FAR=1e-4) to 94.2% and 95.6% on IJB-B and IJB- erage accuracy rate of the corresponding videos of person
C, respectively. In Figure 9, we show the full ROC curves of ID retrieved in the test set for each person ID (as the query)
the proposed ArcFace on IJB-B and IJB-C 2 , and ArcFace in the training set.
achieves impressive performance even at FAR=1e-6 setting As shown in Table 9, ArcFace trained on combined
a new baseline. MS1MV2 and Asian datasets with ResNet100 sets a high
baseline (MAP=(79.80%)). Based on the embedding fea-
1
ROC on IJB-B
1
ROC on IJB-C ture for each training video, we train an additional three-
0.9

0.8
0.9

0.8
layer fully connected network with a classification loss to
0.7 0.7
get the customised feature descriptor on the iQIYI-VID
True Positive Rate

True Positive Rate

0.6 0.6

0.5 0.5 dataset. The MLP learned on the iQIYI-VID training set
0.4 0.4

0.3 0.3 significantly boosts the MAP by 6.60%. Drawing support


0.2 0.2

0.1
MS1MV2, ResNet100, ArcFace
VGG2, ResNet50, ArcFace
0.1
MS1MV2, ResNet100, ArcFace
VGG2, ResNet50, ArcFace
from the model ensemble and context features from the off-
0
10-6 10 -5
10 -4
10 -3

False Positive Rate


10 -2
10 -1
10 0
0
10-6 10-5 10-4 10-3
False Positive Rate
10-2 10-1 100 the-shelf object and scene classifier [1], our final result sur-
passes the runner-up by a clear margin ( 0.99%).
(a) ROC for IJB-B (b) ROC for IJB-C
Figure 9. ROC curves of 1:1 verification protocol on the IJB-B and
IJB-C dataset.
4. Conclusions

Results on Trillion-Pairs. The Trillion-Pairs dataset [2] In this paper, we proposed an Additive Angular Margin
provides 1.58M images from Flickr as the gallery set and Loss function, which can effectively enhance the discrimi-
274K images from 5.7k LFW [13] identities as the probe native power of feature embeddings learned via DCNNs for
face recognition. In the most comprehensive experiments
2 https://github.com/deepinsight/insightface/tree/master/Evaluation/IJB reported in the literature we demonstrate that our method
Method MAP(%) 25

MS1MV2+Asian, R100, ArcFace 79.80


+ MLP 86.40

GPU Memory Consumption (GB)


20

+ Ensemble 88.26
+ Context 88.65 (1st) 15

Other Participant 87.66 (2nd)


10
Table 9. MAP of our method on the iQIYI-VID test set. “MLP”
refers to a three-layer fully connected network trained on the
iQIYI-VID training data. 5
Parallel Acceleration on Feature (x) -- Data Parallel
consistently outperforms the state-of-the-art. Code and de- Parallel Acceleration on Feature (x) and Center (W)
0
tails have been released under the MIT license. 0 0.5 1 1.5 2 2.5 3
Identity Number in the Training Data 6
10

5. Appendix (a) GPU Memory

5.1. Parallel Acceleration 1000

900
Can we apply ArcFace on large-scale identities? Yes,

Training Speed (samples/second)


800
millions of identities are not a problem.
700
The concept of Centre (W ) is indispensable in ArcFace,
600
but the parameter size of Centre (W ) is proportional to the
500
number of classes. When there are millions of identities
400
in the training data, the proposed ArcFace confronts with
300
substantial training difficulties, e.g. excessive GPU mem-
200
ory consumption and massive computational cost, even at a Parallel Acceleration on Feature (x) -- Data Parallel
100
prohibitive level. Parallel Acceleration on Feature (x) and Center (W)
0
In our implementation 3 , we employ a parallel acceler- 0 0.5 1 1.5 2 2.5 3

ation strategy [44] to relieve this problem. We optimise Identity Number in the Training Data 106

our training code to easily and efficiently support million (b) Training Speed
level identities on a single machine by parallel accelera- Figure 10. Parallel acceleration on both feature x and centre W .
tion on both feature x (it known as the general data parallel Setting: ResNet 50, batch size 8*64, feature dimension 512, float
strategy) and centre W (we named it as the centre parallel point 32, GPU 8*P40 (24GB).
strategy). As shown in Figure 10, our parallel acceleration
on both feature x and centre W can significantly decrease
the GPU memory consumption and accelerate the training
speed. Even for one million identities trained on 8*1080ti score sub-matrix (batch size 512 × identity number 1M/8)
(11GB), our implementation (ResNet 50, batch size 8*64, on each GPU. The similarity score matrix goes forward to
feature dimension 512 and float point 32) can still run at calculate the ArcFace loss and the gradient. Here, we con-
800 samples per second. Compared to the approximate ac- duct a simple matrix partition on the centre matrix and the
celeration method proposed in [47], our implementation has similarity score matrix along the identity dimension, and
no performance drop. there is no communication cost on the centre and similarity
score matrix. Both the centre sub-matrix and the similarity
In Figure 11, we illustrate the main calculation steps of
score sub-matrix are only 256MB on each GPU.
the parallel acceleration by simple matrix partition, which
can be easily grasped and reproduced by beginners [3]. (3) Get gradient on centre (dW ). We transpose the fea-
(1) Get feature (x). Face embedding features are aggre- ture matrix on each GPU, and concurrently multiply the
gated into one feature matrix (batch size 8*64 × feature transposed feature matrix by the gradient sub-matrix of the
dimension 512) from 8 GPU cards. The size of the aggre- similarity score.
gated feature matrix is only 1MB, and the communication (4) Get gradient on feature (x). We concurrently multi-
cost is negligible when we transfer the feature matrix. ply the gradient sub-matrix of similarity score by the trans-
(2) Get similarity score matrix (score = xW ). We copy posed centre sub-matrix and sum up the outputs from 8
the feature matrix into each GPU, and concurrently multi- GPU cards to get the gradient on feature x.
ply the feature matrix by the centre sub-matrix (feature di-
mension 512 × identity number 1M/8) to get the similarity Considering the communication cost (MB level), our
implementation of ArcFace can be easily and efficiently
3 https://github.com/deepinsight/insightface/tree/master/recognition trained on millions of identities by clusters.
nearest neighbour separation[5] is
2 1 Γ( d2 ) 1
E[θ(Wj )] → n− d−1 Γ(1 + )( √ d−1
)− d−1 ,
d − 1 2 π(d − 1)Γ( 2 )
(7)
where d is the space dimension, n is the identity number,
and θ(Wj ) = min1≤i,j≤n,i6=j arccos(Wi , Wj )∀i, j. In Fig-
ure 12, we give E[θ(Wj )] in the 128-d, 256-d and 512-d
space with the class number ranging from 10K to 100M .
(a) x
The high-dimensional space is so large that E[θ(Wj )] de-
creases slowly when the class number increases exponen-
tially.
90

85

Minimum Angles Between Individuals


80

75

70
(b) score = xW
65

60

128-d
55 256-d
512-d
50
4 5 6 7 8
10 10 10 10 10
Random Individual Numbers

Figure 12. The high-dimensional space is so large that the mean


of the nearest angles decreases slowly when the class number in-
(c) dW = xT dscore creases exponentially.

References
[1] http://data.mxnet.io/models/. 8
[2] http://trillionpairs.deepglint.com/overview. 2, 4, 5, 8
[3] Stanford cs class cs231n: Convolutional neural networks
for visual recognition. http://cs231n.github.io/
neural-networks-case-study/. 9
[4] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen,
C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, et al.
Tensorflow: Large-scale machine learning on heterogeneous
(d) dx = dscoreW T distributed systems. arXiv:1603.04467, 2016. 2
[5] J. S. Brauchart, A. B. Reznikov, E. B. Saff, I. H. Sloan,
Figure 11. Parallel calculation by simple matrix partition. Setting:
Y. G. Wang, and R. S. Womersley. Random point sets on
ResNet 50, batch size 8*64, feature dimension 512, float point
the spherehole radii, covering, and separation. Experimental
32, identity number 1 Million, GPU 8 * 1080ti (11GB). Com-
Mathematics, 2018. 10
munication cost: 1MB (feature x). Training speed: 800 sam-
[6] Q. Cao, L. Shen, W. Xie, O. M. Parkhi, and A. Zisserman.
ples/second.
Vggface2: A dataset for recognising faces across pose and
age. In FG, 2018. 1, 2, 4, 5, 7, 8
[7] B. Chen, W. Deng, and J. Du. Noisy softmax: improving
5.2. Feature Space Analysis the generalization ability of dcnn via postponing the early
softmax saturation. In CVPR, 2017. 2
[8] T. Chen, M. Li, Y. Li, M. Lin, N. Wang, M. Wang, T. Xiao,
Is the 512-d hypersphere space large enough to hold B. Xu, C. Zhang, and Z. Zhang. Mxnet: A flexible and effi-
large-scale identities? Theoretically, Yes. cient machine learning library for heterogeneous distributed
systems. arXiv:1512.01274, 2015. 2, 5
We assume that the identity centre Wj ’s follow a realis- [9] J. Deng, Y. Zhou, and S. Zafeiriou. Marginal loss for deep
tically spherical uniform distribution, the expectation of the face recognition. In CVPR Workshop, 2017. 2, 6
[10] Y. Guo, L. Zhang, Y. Hu, X. He, and J. Gao. Ms-celeb-1m: [30] S. Sengupta, J.-C. Chen, C. Castillo, V. M. Patel, R. Chel-
A dataset and benchmark for large-scale face recognition. In lappa, and D. W. Jacobs. Frontal to profile face verification
ECCV, 2016. 4 in the wild. In WACV, 2016. 2, 5
[11] D. Han, J. Kim, and J. Kim. Deep pyramidal residual net- [31] N. Srivastava, G. E. Hinton, A. Krizhevsky, I. Sutskever, and
works. arXiv:1610.02915, 2016. 5 R. Salakhutdinov. Dropout: a simple way to prevent neural
[12] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning networks from overfitting. JML, 2014. 5
for image recognition. In CVPR, 2016. 5 [32] Y. Sun, Y. Chen, X. Wang, and X. Tang. Deep learning face
[13] G. B. Huang, M. Ramesh, T. Berg, and E. Learned-Miller. representation by joint identification-verification. In NIPS,
Labeled faces in the wild: A database for studying face 2014. 1, 6, 7
recognition in unconstrained environments. Technical report, [33] Y. Taigman, M. Yang, M. Ranzato, and L. Wolf. Deepface:
2007. 5, 6, 8 Closing the gap to human-level performance in face verifica-
[14] S. Ioffe and C. Szegedy. Batch normalization: Accelerating tion. In CVPR, 2014. 1, 6
deep network training by reducing internal covariate shift. In [34] W. Wan, Y. Zhong, T. Li, and J. Chen. Rethinking fea-
ICML, 2015. 5 ture distribution for loss functions in image classification.
[15] I. Kemelmacher-Shlizerman, S. M. Seitz, D. Miller, and arXiv:1803.02988, 2018. 2
E. Brossard. The megaface benchmark: 1 million faces for [35] F. Wang, W. Liu, H. Liu, and J. Cheng. Additive margin
recognition at scale. In CVPR, 2016. 2, 5, 7 softmax for face verification. IEEE Signal Processing Let-
[16] J. Liu, Y. Deng, T. Bai, Z. Wei, and C. Huang. Targeting ters, 2018. 2, 3, 7
ultimate accuracy: Face recognition via deep embedding.
[36] F. Wang, X. Xiang, J. Cheng, and A. L. Yuille. Norm-
arXiv:1506.07310, 2015. 6
face: l 2 hypersphere embedding for face verification.
[17] W. Liu, R. Lin, Z. Liu, L. Liu, Z. Yu, B. Dai, and L. Song. arXiv:1704.06369, 2017. 2
Learning towards minimum hyperspherical energy. In NIPS,
[37] H. Wang, Y. Wang, Z. Zhou, X. Ji, Z. Li, D. Gong, J. Zhou,
2018. 4, 6, 7
and W. Liu. Cosface: Large margin cosine loss for deep face
[18] W. Liu, Y. Wen, Z. Yu, M. Li, B. Raj, and L. Song.
recognition. In CVPR, 2018. 1, 2, 3, 5, 6, 7
Sphereface: Deep hypersphere embedding for face recogni-
[38] Y. Wen, K. Zhang, Z. Li, and Y. Qiao. A discriminative fea-
tion. In CVPR, 2017. 1, 2, 3, 4, 5, 6, 7
ture learning approach for deep face recognition. In ECCV,
[19] W. Liu, Y. Wen, Z. Yu, and M. Yang. Large-margin softmax
2016. 2, 6, 7
loss for convolutional neural networks. In ICML, 2016. 2, 3
[20] Y. Liu, P. Shi, B. Peng, H. Yan, Y. Zhou, B. Han, Y. Zheng, [39] C. Whitelam, E. Taborsky, A. Blanton, B. Maze, J. C.
C. Lin, J. Jiang, and Y. Fan. iqiyi-vid: A large dataset for Adams, T. Miller, N. D. Kalka, A. K. Jain, J. A. Duncan, and
multi-modal person identification. arXiv:1811.07548, 2018. K. Allen. Iarpa janus benchmark-b face dataset. In CVPR
5, 8 Workshop, 2017. 2, 5, 7, 8
[21] B. Maze, J. Adams, J. A. Duncan, N. Kalka, T. Miller, [40] L. Wolf, T. Hassner, and I. Maoz. Face recognition in un-
C. Otto, A. K. Jain, W. T. Niggel, J. Anderson, and J. Ch- constrained videos with matched background similarity. In
eney. Iarpa janus benchmark–c: Face dataset and protocol. CVPR, 2011. 5, 6
In ICB, 2018. 2, 5 [41] W. Xie, S. Li, and A. Zisserman. Comparator networks. In
[22] S. Moschoglou, A. Papaioannou, C. Sagonas, J. Deng, I. Kot- ECCV, 2018. 8
sia, and S. Zafeiriou. Agedb: The first manually collected [42] W. Xie and A. Zisserman. Multicolumn networks for face
in-the-wild age database. In CVPR Workshop, 2017. 2, 5 recognition. In BMVC, 2018. 8
[23] H.-W. Ng and S. Winkler. A data-driven approach to clean- [43] D. Yi, Z. Lei, S. Liao, and S. Z. Li. Learning face represen-
ing large face datasets. In ICIP, 2014. 7 tation from scratch. arXiv:1411.7923, 2014. 4, 5
[24] O. M. Parkhi, A. Vedaldi, and A. Zisserman. Deep face [44] D. Zhang. A distributed training solution for face recogni-
recognition. In BMVC, 2015. 1, 2, 6 tion. DeepGlint, 2018. 9
[25] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. De- [45] K. Zhang, Z. Zhang, Z. Li, and Y. Qiao. Joint face detec-
Vito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer. Au- tion and alignment using multitask cascaded convolutional
tomatic differentiation in pytorch. In NIPS Workshop, 2017. networks. IEEE Signal Processing Letters, 2016. 1
2 [46] X. Zhang, Z. Fang, Y. Wen, Z. Li, and Y. Qiao. Range loss
[26] G. Pereyra, G. Tucker, J. Chorowski, Ł. Kaiser, and G. Hin- for deep face recognition with long-tail. In ICCV, 2017. 2, 6
ton. Regularizing neural networks by penalizing confident [47] X. Zhang, L. Yang, J. Yan, and D. Lin. Accelerated train-
output distributions. arXiv:1701.06548, 2017. 2, 3 ing for massive classification via dynamic class selection. In
[27] X. Qi and L. Zhang. Face recognition via centralized coor- AAAI, 2018. 9
dinate learning. arXiv:1801.05678, 2018. 2 [48] T. Zheng and W. Deng. Cross-pose lfw: A database for
[28] R. Ranjan, C. D. Castillo, and R. Chellappa. L2- studying cross-pose face recognition in unconstrained envi-
constrained softmax loss for discriminative face verification. ronments. Technical Report, 2018. 2, 5, 6
arXiv:1703.09507, 2017. 2 [49] T. Zheng, W. Deng, and J. Hu. Cross-age lfw: A database
[29] F. Schroff, D. Kalenichenko, and J. Philbin. Facenet: A uni- for studying cross-age face recognition in unconstrained en-
fied embedding for face recognition and clustering. In CVPR, vironments. arXiv:1708.08197, 2017. 2, 5, 6
2015. 1, 4, 6, 7

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy