Circle Loss: A Unified Perspective of Pair Similarity Optimization

在这里插入图片描述

Abstract

This paper provides a pair similarity optimization view- point on deep feature learning, aiming to maximize the within-class similarity sp and minimize the between-class similarity sn. We find a majority of loss functions, includ- ing the triplet loss and the softmax plus cross-entropy loss, embed sn and sp into similarity pairs and seek to reduce (sn − sp). Such an optimization manner is inflexible, be- cause the penalty strength on every single similarity score is restricted to be equal. Our intuition is that if a similarity score deviates far from the optimum, it should be empha- sized. To this end, we simply re-weight each similarity to highlight the less-optimized similarity scores. It results in a Circle loss, which is named due to its circular decision boundary. The Circle loss has a unified formula for two elemental deep feature learning approaches, i.e., learning with class-level labels and pair-wise labels. Analytically, we show that the Circle loss offers a more flexible optimiza- tion approach towards a more definite convergence target, compared with the loss functions optimizing (sn − sp). Ex- perimentally, we demonstrate the superiority of the Circle loss on a variety of deep feature learning tasks. On face recognition, person re-identification, as well as several fine- grained image retrieval datasets, the achieved performance is on par with the state of the art.web

本文提供了关于深度特征学习的一对类似度优化观点,旨在最大化类内类似度sp并最小化类间类似度sn。咱们发现了大多数损失函数,包括三重态损失和softmax加上交叉熵损失,将sn和sp嵌入类似对并试图下降(sn-sp)。这种优化方式是不灵活的,由于每一个类似度得分上的惩罚强度都被限制为相等。咱们的直觉是,若是类似性得分偏离最佳值,则应强调它。为此,咱们仅需对每一个类似度从新加权,以突出显示未优化的类似度得分。这会致使圆损失,因为其圆形决策边界而被命名。 Circle loss具备两种基本的深度特征学习方法的统一公式,即便用类级标签和成对标签进行学习。从分析上,咱们代表,与损失函数优化(sn-sp)相比,Circle损失为更肯定的收敛目标提供了更灵活的优化方法。实验上,咱们证实了Circle Loss在各类深度特征学习任务中的优越性。在人脸识别,人员从新识别以及几个细粒度的图像检索数据集上,实现的性能与最新技术水平至关。api

在这里插入图片描述

Figure 1: Comparison between the popular optimization manner of reducing (sn −sp) and the proposed optimization manner of reducing (αnsn −αpsp). (a) Reducing (sn −sp) is prone to inflexible optimization (A, B and C all have equal gradients with respect to sn and sp), as well as am- biguous convergence status (both T and T ′ on the decision boundary are acceptable). (b) With (αnsn −αpsp), the Cir- cle loss dynamically adjusts its gradients on sp and sn, and thus benefits from flexible optimization process. For A, it emphasizes on increasing sp; for B, it emphasizes on reduc- ing sn. Moreover, it favors a specified point T on the circu- lar decision boundary for convergence, setting up a definite convergence target.promise

图1:流行的还原优化方法(sn -sp)与建议的还原优化方法(αnsn-αpsp)比较。 (a)减小(sn -sp)倾向于不灵活的优化(A,B和C相对于sn和sp都具备相同的梯度),以及模糊的收敛状态(决策边界上的T和T’都同样) 是能够接受的)。 (b)利用(αnsn-αpsp),圆环损耗动态地调整其在sp和sn上的梯度,所以得益于灵活的优化过程。 对于A,它强调增长sp。 对于B,它强调减小sn。 此外,它有利于在圆周决策边界上指定点T进行收敛,从而肯定肯定的收敛目标。weex

1. Introduction

This paper holds a similarity optimization view towards two elemental deep feature learning approaches, i.e., learn- ing from data with class-level labels and from data with pair-wise labels. The former employs a classification loss function (e.g., Softmax plus cross-entropy loss [25, 16, 36]) to optimize the similarity between samples and weight vec- tors. The latter leverages a metric loss function (e.g., triplet loss [9, 22]) to optimize the similarity between samples. In our interpretation, there is no intrinsic difference between these two learning approaches. They both seek to minimize between-class similarity sn, as well as to maximize within- class similarity sp.
From this viewpoint, we find that many popular loss functions (e.g., triplet loss [9, 22], Softmax loss and its vari- ants [25, 16, 36, 29, 32, 2]) share a similar optimization pat- tern. They all embed sn and sp into similarity pairs and seek to reduce (sn −sp). In (sn −sp), increasing sp is equivalent to reducing sn. We argue that this symmetric optimization manner is prone to the following two problems.
• Lack of flexibility for optimization. The penalty strength on sn and sp is restricted to be equal. Given the specified loss functions, the gradients with respect to sn and sp are of same amplitudes (as detailed in Section 2). In some corner cases, e.g., sp is small and sn already ap- proaches 0 (“A” in Fig. 1 (a)), it keeps on penalizing sn with large gradient. It is inefficient and irrational.网络

本文针对两种基本的深度特征学习方法(即从具备类级别标签的数据和具备成对标签的数据中学习)持类似性优化观点。前者采用分类损失函数(例如,Softmax加交叉熵损失[25,16,36])来优化样本和权重向量之间的类似度。后者利用度量损失函数(例如,三重损失[9,22])来优化样本之间的类似度。在咱们的解释中,这两种学习方法之间没有内在的区别。他们都试图最小化类间类似度sn,以及最大化类内类似度sp。
从这个角度来看,咱们发现许多流行的损失函数(例如,三重损失[9,22],Softmax损失及其变体[2五、1六、3六、2九、3二、2])具备类似的优化模式。它们都将sn和sp嵌入类似对,并寻求减少(sn -sp)。在(sn -sp)中,增长sp等于减小sn。咱们认为这种对称优化方式容易出现如下两个问题。
•缺少优化的灵活性。 sn和sp的惩罚强度被限制为相等。给定指定的损耗函数,关于sn和sp的梯度具备相同的幅度(如第2节中所述)。在某些极端状况下,例如,sp很小,而且sn已经接近0(图1(a)中的“ A”),它会继续以大梯度惩罚sn。它效率低下且不合理。app

• Ambiguous convergence status. Optimizing (sn−sp)
•收敛状态不明确。 优化(sn-sp)less

在这里插入图片描述

Being simple, Circle loss intrinsically reshapes the char- acteristics of the deep feature learning from the following three aspects:dom

First, a unified loss function. From the unified similar- ity pair optimization perspective, we propose a unified loss function for two elemental learning approaches, learning with class-level labels and with pair-wise labels.
Second, flexible optimization. During training, the gradient back-propagated to sn (sp) will be amplified by αn (αp). Those less-optimized similarity scores will have larger weighting factors and consequentially get larger gra- dient. As shown in Fig. 1 (b), the optimization on A, B and C are different to each other.
Third, definite convergence status. On the circular de- cision boundary, Circle loss favors a specified convergence status (“T” in Fig. 1 (b)), as to be demonstrated in Sec- tion 3.3. Correspondingly, it sets up a definite optimization target and benefits the separability.
The main contributions of this paper are summarized as follows:ide

简单来讲,圆损失从如下三个方面从本质上重塑了深度特征学习的特征:svg

一是统一亏损功能。从统一类似度对优化的角度来看,咱们为两种基本学习方法(使用类级标签和逐对标签的学习方法)提出了统一的损失函数。
第二,灵活优化。在训练期间,反向传播到sn(sp)的梯度将被αn(αp)放大。那些不太优化的类似性分数将具备较大的权重因子,所以会得到较大的梯度。如图1(b)所示,对A,B和C的优化互不相同。
第三,肯定收敛状态。在圆弧决策边界上,圆弧损耗倾向于指定的收敛状态(图1(b)中的“ T”),如第3.3节所示。相应地,它创建了明确的优化目标并有利于可分离性。
本文的主要贡献概述以下:

• We propose Circle loss, a simple loss function for deep feature learning. By re-weighting each similarity score under supervision, Circle loss benefits the deep feature learning with flexible optimization and definite conver- gence target.
• We present Circle loss with compatibility to both class- level labels and pair-wise labels. Circle loss degener- ates to triplet loss or Softmax loss with slight modifi- cations.

• We conduct extensive experiment on a variety of deep feature learning tasks, e.g. face recognition, person re- identification, car image retrieval and so on. On all these tasks, we demonstrate the superiority of Circle loss with performance on par with the state of the art.

•咱们提出了Circle损失,这是用于深度特征学习的简单损失功能。 经过在监督下对每一个类似度得分从新加权,Circle loss经过灵活的优化和肯定的收敛目标而受益于深度特征学习。
•咱们提出了Circle损失,它与类级别标签和成对标签都兼容。 圆度损失通过轻微的修改,可退化为三重态损失或Softmax损失。

•咱们对各类深度特征学习任务进行了普遍的实验,例如 人脸识别,人员从新识别,汽车图像检索等。 在全部这些任务上,咱们展现了Circle损失与性能相媲美的优越性。

2. A Unified Perspective

Deep feature learning aims to maximize the within-class similarity sp, as well as to minimize the between-class sim- ilarity sn. Under the cosine similarity metric, for example, weexpectsp →1andsn →0.
To this end, learning with class-level labels and learn- ing with pair-wise labels are two paradigms of approaches and are usually considered separately. Given class-level labels, the first one basically learns to classify each train- ing sample to its target class with a classification loss, e.g. L2-Softmax [21], Large-margin Softmax [15], Angu- lar Softmax [16], NormFace [30], AM-Softmax [29], Cos- Face [32], ArcFace [2]. In contrast, given pair-wise la- bels, the second one directly learns pair-wise similarity in the feature space in an explicit manner, e.g., constrastive loss [5, 1], triplet loss [9, 22], Lifted-Structure loss [19], N-pair loss [24], Histogram loss [27], Angular loss [33], Margin based loss [38], Multi-Similarity loss [34] and so on.

深度特征学习旨在最大化类内类似度sp,以及最小化类间类似度sn。 在余弦类似度度量下,例如,weexpectsp→1andsn→0。
为此,使用类级别的标签学习和使用逐对标签的学习是方法的两个范例,一般被单独考虑。 给定班级级别的标签,第一个基本上学会经过分类损失(例如, L2-Softmax [21],大边距Softmax [15],Angular角Softmax [16],NormFace [30],AM-Softmax [29],CosFace [32],ArcFace [2]。 相反,给定成对标签,第二个以显式方式直接学习特征空间中的成对类似性,例如,对比损失[5,1],三重态损失[9,22],提高结构 损失[19],N对损失[24],直方图损失[27],角度损失[33],边际损失[38],多类似损失[34]等。

在这里插入图片描述
Figure 2: The gradients of the loss functions. (a) Triplet loss. (b) AMSoftmax loss. © The proposed Circle loss. Both triplet loss and AMSoftmax loss present lack of flexibility for optimization. The gradients with respect to sp (left) and sn (right) are restricted to equal and undergo a sudden decrease upon convergence (the similarity pair B). For example, at A, the within-class similarity score sp already approaches 1, and still incurs large gradient. Moreover, the decision boundaries are parallel to sp = sn, which allows ambiguous convergence. In contrast, the proposed Circle loss assigns different gradients to the similarity scores, depending on their distances to the optimum. For A (both sn and sp are large), Circle loss lays emphasis on optimizing sn. For B, since sn significantly decreases, Circle loss reduces its gradient and thus enforces mild penalty. Circle loss has a circular decision boundary, and promotes accurate convergence status.

图2:损失函数的梯度。 (a)三重态损失。 (b)AMSoftmax损失。 (c)建议的循环损失。 三重态损失和AMSoftmax损失都缺少优化的灵活性。 相对于sp(左)和sn(右)的梯度被限制为相等,而且在收敛时(类似性对B)会忽然减少。 例如,在A处,类内类似性得分sp已经接近1,而且仍会产生较大的梯度。 此外,决策边界与sp = sn平行,从而容许模棱两可的收敛。 相比之下,拟议的Circle损失会根据类似度得分与最佳评分之间的距离将不一样的梯度分配给类似度得分。 对于A(sn和sp都很大),Circle loss将重点放在优化sn上。 对于B,因为sn显着下降,所以Circle损失会减少其梯度,所以会施加轻微的惩罚。 圆损失具备圆形决策边界,并能够提升准确的收敛状态。

在这里插入图片描述

在这里插入图片描述

generates to AM-Softmax [29, 32], an important variant of Softmax loss:

生成AM-Softmax [29,32],Softmax损失的重要变体:

在这里插入图片描述

Moreover, with m = 0, Eq. 2 further degenerates to Normface [30]. By replacing the cosine similarity with in- ner product and setting γ = 1, it finally degenerates to Soft- max loss (i.e., softmax plus cross-entropy loss).

此外,当m = 0时, 2进一步退化为Normface [30]。 经过用内积代替余弦类似度并将γ= 1设置,它最终退化为Softmax损失(即softmax加上交叉熵损失)。

在这里插入图片描述

Specifically, we note that in Eq. 3, the “􏰅 exp(·)” op- eration is utilized by Lifted-Structure loss [19], N-pair loss [24], Multi-Similarity loss [34] and etc., to conduct “soft” hard mining among samples. Enlarging γ gradually reinforces the mining intensity and when γ → +∞, it re- sults in the canonical hard mining in [22, 8].

具体来讲,咱们注意到在等式中。 如图3所示,“􏰅exp(·)”运算可用于提高结构损失[19],N对损失[24],多重类似损失[34]等,以进行“软”硬开采 在样本中。 增大γ会逐渐加强开采强度,当γ→+∞时,会致使[22,8]中的规范硬开采。

在这里插入图片描述

Gradient analysis. Eq. 2 and Eq. 3 show triplet loss, Softmax loss and its several variants can be interpreted as specific cases of Eq. 1. In another word, they all optimize (sn − sp). Under the toy scenario where there are only a single sp and sn, we visualize the gradients of triplet loss and AMSoftmax loss in Fig. 2 (a) and (b), from which we draw the following observations:

• First, before the loss reaches its decision boundary (upon which the gradients vanish), the gradients with respect to both sp and sn are the same to each other. The status A has {sn , sp } = {0.8, 0.8}, indicating good within-class compactness. However, A still re- ceives large gradient with respect to sp. It leads to lack of flexibility during optimization.
• Second, the gradients stay (roughly) constant before convergence and undergo a sudden decrease upon con- vergence. The status B lies closer to the decision boundary and is better optimized, compared with A. However, the loss functions (both triplet loss and AM- Softmax loss) enforce approximately equal penalty on A and B. It is another evidence of inflexibility.

梯度分析。等式2和等式图3显示了三重态损失,Softmax损失及其几个变体能够解释为等式的特定状况。换句话说,它们都优化(sn-sp)。在只有一个sp和sn的玩具场景下,咱们可视化图2(a)和(b)中的三重态损失和AMSoftmax损失的梯度,从中得出如下观察结果:

•首先,在损失到达其决策边界(梯度消失时)以前,相对于sp和sn的梯度彼此相同。状态A的{sn,sp} = {0.8,0.8},表示良好的类内紧凑性。可是,A相对于sp仍会收到较大的梯度。这致使优化期间缺少灵活性。
•其次,梯度在收敛以前保持(大体)恒定,而且在收敛时忽然减少。与A相比,状态B更接近决策边界,而且优化程度更好。可是,损失函数(三元组损失和AM-Softmax损失)对A和B施加的惩罚几乎相等。这是不灵活的另外一证据。

在这里插入图片描述

These problems originate from the optimization manner of minimizing (sn − sp ), in which reducing sn is equivalent to increasing sp. In the following Section 3, we will trans- fer such an optimization manner into a more general one to facilitate higher flexibility.

这些问题源于最小化(sn-sp)的优化方式,其中减少sn等于增大sp。 在下面的第3节中,咱们将把这种优化方式转换为更通用的方式,以提升灵活性。

3. A New Loss Function

3.1. Self-paced Weighting

We consider to enhance the optimization flexibility by allowing each similarity score to learn at its own pace, de- pending on its current optimization status. We first neglect the margin item m in Eq. 1 and transfer the unified loss function (Eq. 1) into the proposed Circle loss by:

咱们考虑经过容许每一个类似性得分根据其当前优化状态按照本身的步调学习,从而提升优化灵活性。 咱们首先忽略等式中的保证金项目m。 1并将统一损失函数(等式1)转移到建议的圆损失中,方法是:

在这里插入图片描述

在这里插入图片描述

在这里插入图片描述

Re-scaling the cosine similarity under supervision is a
common practice in modern classification losses [21, 30, 29, 32, 39, 40]. Conventionally, all the similarity score share an equal scale factor γ. The non-normalized weighting op- eration in Circle loss can be also interpreted as a specific scaling operation. Different from the other loss functions, Circle loss re-weights (re-scales) each similarity score in- dependently and thus allows different learning paces. We empirically show that Circle loss is robust to various γ set- tings in Section 4.5.

Discussions. We notice another difference beyond the scaling strategy. The output of softmax function in a classi- fication loss is conventionally interpreted as the probability of a sample belonging to a certain class. Since the probabil- ities are based on comparing each similarity score against all the similarity scores, equal re-scaling is prerequisite for fair comparison. Circle loss abandons such an probability- related interpretation and holds a similarity pair optimiza- tion perspective, instead. Correspondingly, it gets rid of the constraint of equal re-scaling and allows more flexible opti- mization.

在监督下从新缩放余弦类似度是
现代分类损失的常见作法[21,30,29,32,39,40]。传统上,全部类似度分数共享相等的比例因子γ。 Circle loss中的非归一化加权运算也能够解释为特定的缩放操做。与其余损失函数不一样,Circle损失独立地对每一个类似度评分进行从新加权(从新定标),所以容许不一样的学习进度。咱们根据经验证实,在第4.5节中,圆损失对各类γ设置都具备鲁棒性。

讨论。咱们注意到缩放策略以外的另外一个区别。一般将分类损失中softmax函数的输出解释为样本属于某个类别的几率。因为几率是基于将每一个类似性得分与全部类似性得分进行比较而得出的,因此相等的从新定标是进行公平比较的前提。圆损失放弃了这种与几率有关的解释,取而代之的是拥有类似对对的观点。相应地,它摆脱了均等缩放的约束,并容许更灵活的优化。

3.2. Within-class and Between-class Margins

In loss functions optimizing (sn − sp), adding a margin m reinforces the optimization [15, 16, 29, 32]. Since sn and −sp are in symmetric positions, a positive margin on sn is equivalent to a negative margin on sp. It thus only requires a single margin m. In Circle loss, sn and sp are in asymmetric position. Naturally, it requires respective margins for sn and sp, which is formulated by:

在损失函数优化(sn-sp)中,增长余量m能够增强优化[1五、1六、2九、32]。 因为sn和-sp处于对称位置,所以sn上的正余量等于sp上的负余量。 所以,仅须要单个余量m。 在“圆损耗”中,sn和sp处于不对称位置。 天然,它须要为sn和sp分别设置边距,其公式以下:

在这里插入图片描述

在这里插入图片描述

在这里插入图片描述

在这里插入图片描述

With the decision boundary defined in Eq. 8, we have another intuitive interpretation of Circle loss. It aims to op- timize sp → 1 and sn → 0. The parameter m controls the radius of the decision boundary and can be viewed as a relaxation factor. In another word, Circle loss expects

在等式中定义决策边界。 8,咱们对圆损有另外一种直观的解释。 它旨在优化sp→1和sn→0。参数m控制决策边界的半径,能够将其视为松弛因子。 换句话说,Circle loss指望

在这里插入图片描述

Hence there are only two hyper-parameters, i.e., the scale factor γ and the relaxation margin m. We will experimen- tally analyze the impacts of m and γ in Section 4.5.

所以,只有两个超参数,即比例因子γ和弛豫裕度m。 咱们将在第4.5节中对m和γ的影响进行实验分析。

3.3. The Advantages of Circle Loss

The gradients of Circle loss with respect to sjn and sip are derived as follows:

圆损耗相对于sjn和sip的梯度推导以下:

在这里插入图片描述

在这里插入图片描述

Under the toy scenario of binary classification (or only
a single sn and sp), we visualize the gradients under dif- ferent settings of m in Fig. 2 ©, from which we draw the following three observations:
• Balanced optimization on sn and sp. We recall that the loss functions minimizing (sn − sp ) always have equal gra- dients on sp and sn and is inflexible. In contrast, Circle loss presents dynamic penalty strength. Among a specified sim- ilarity pair {sn, sp}, if sp is better optimized in comparison to sn (e.g., A = {0.8, 0.8} in Fig. 2 ©), Circle loss assigns larger gradient to sn (and vice versa), so as to decrease sn with higher superiority. The experimental evidence of bal- anced optimization is to be accessed in Section 4.6.

• Gradually-attenuated gradients. At the start of train- ing, the similarity scores deviate far from the optimum and gains large gradient (e.g., “A” in Fig. 2 ©). As the train- ing gradually approaches the convergence, the gradients on the similarity scores correspondingly decays (e.g., “B” in Fig. 2 ©), elaborating mild optimization. Experimental re- sult in Section 4.5 shows that the learning effect is robust to various settings of γ (in Eq. 6), which we attribute to the automatically-attenuated gradients.

在二元分类的玩具场景下(或仅
单个sn和sp),咱们在图2(c)的m的不一样设置下可视化了梯度,从中咱们得出如下三个观察结果:
•对sn和sp的均衡优化。咱们记得损失函数最小化(sn-sp)在sp和sn上老是具备相等的梯度,而且是不灵活的。相反,圆损失表现出动态的惩罚强度。在指定的类似对{sn,sp}中,若是与sn相比,sp的优化效果更好(例如,图2(c)中的A = {0.8,0.8}),则圆损耗为sn分配了较大的梯度(而且反之亦然),以便以更高的优点减小sn。平衡优化的实验证据将在第4.6节中提供。

•逐渐衰减的渐变。在训练开始时,类似度得分偏离最佳值,并得到了较大的梯度(例如,图2(c)中的“ A”)。随着训练逐渐趋于收敛,类似度分数上的梯度相应地衰减(例如,图2(c)中的“ B”),从而进行了适度的优化。 4.5节中的实验结果代表,学习效果对于γ的各类设置(在等式6中)是鲁棒的,咱们将其归因于自动衰减的梯度。

• A (more) definite convergence target. Circle loss has a circular decision boundary and favors T rather than T′ (Fig. 1) for convergence. It is because T has the smallest gap between sp and sn, compared with all the other points on the decision boundary. In another word, T ′ has a larger gap between sp and sn and is inherently more difficult to maintain. In contrast, losses that minimize (sn − sp) have a homogeneous decision boundary, that is, every point on the decision boundary is of the same difficulty to reach. Ex- perimentally, we observe that Circle loss leads to a more concentrate similarity distribution after convergence, as to be detailed in Section 4.6 and Fig. 5.

•(更多)明确的收敛目标。 圆损失具备圆形决策边界,而且倾向于T而不是T’(图1)进行收敛。 这是由于与决策边界上的全部其余点相比,T在sp和sn之间的间隙最小。 换句话说,T′在sp和sn之间具备较大的间隙,而且固有地更难以维护。 相反,最小化(sn-sp)的损失具备均匀的决策边界,也就是说,决策边界上的每一个点都具备相同的难度。 实验上,咱们观察到圆损失会致使收敛后更集中的类似度分布,详见第4.6节和图5。

4. Experiment

We comprehensively evaluate the effectiveness of Circle loss under two elemental learning approaches, i.e., learn- ing with class-level labels and learning with pair-wise la- bels. For the former approach, we evaluate our method on face recognition (Section 4.2) and person re-identification (Section 4.3) tasks. For the latter approach, we use the fine-grained image retrieval datasets (Section 4.4), which are relatively small and encourage learning with pair-wise labels. We show that Circle loss is competent under both settings. Section 4.5 analyzes the impact of the two hyper- parameters, i.e., the scale factor γ in Eq. 6 and the relaxation factor m in Eq. 8. We show that Circle loss is robust under reasonable settings. Finally, Section 4.6 experimentally confirms the characteristics of Circle loss.

咱们经过两种基本的学习方法(即,使用班级标签学习和使用成对标签学习)来全面评估Circle损失的有效性。 对于前一种方法,咱们评估咱们在面部识别(第4.2节)和人员从新识别(第4.3节)任务上的方法。 对于后一种方法,咱们使用细粒度的图像检索数据集(第4.4节),该数据集相对较小,并鼓励使用成对标签学习。 咱们代表,在两种状况下,圆环损失均可以胜任。 第4.5节分析了两个超参数(即等式1中的比例因子γ)的影响。 6和等式中的松弛因子m。 8.咱们证实,在合理的设置下,圆环损耗是可靠的。 最后,第4.6节经过实验证明了圆环损耗的特征。

4.1. Settings

Face recognition. We use the popular dataset MS- Celeb-1M [4] for training. The native MS-Celeb-1M data is noisy and has a long-tailed data distribution. We clean the dirty samples and exclude the tail identities (≤ 3 im- ages per identity). It results in 3.6M images and 79.9K identities. For evaluation, we adopt MegaFace Challenge 1 (MF1) [12], IJB-C [17], LFW [10], YTF [37] and CFP- FP [23] datasets and the official evaluation protocols. We also polish the probe set and 1M distractors on MF1 for more reliable evaluation, following [2]. For data pre- processing, we resize the aligned face images to 112 × 112 and linearly normalize the pixel values of RGB images to [−1, 1] [36, 15, 32]. We only augment the training samples by random horizontal flip. We choose the popular residual networks [6] as our backbones. All the models are trained with 182k iterations. The learning rate is started with 0.1 and reduced by 10× at 50%, 70% and 90% of total iter- ations respectively. The default hyper-parameters of our method are γ = 256 and m = 0.25 if not specified. For all the model inference, we extract the 512-D feature em- beddings and use cosine distance as metric.

人脸识别。咱们使用流行的数据集MS-Celeb-1M [4]进行训练。 MS-Celeb-1M原始数据比较嘈杂,而且数据分布很长。咱们清洗脏样品并排除尾巴身份(每一个身份≤3个图像)。结果为360万张图像和79.9K个身份。为了进行评估,咱们采用了MegaFace Challenge 1(MF1)[12],IJB-C [17],LFW [10],YTF [37]和CFP-FP [23]数据集以及官方评估协议。咱们还会根据[2]对MF1上的探针组和1M干扰器进行抛光,以进行更可靠的评估。对于数据预处理,咱们将对齐的面部图像调整为112×112的大小,并将RGB图像的像素值线性标准化为[-1,1] [36,15,32]。咱们仅经过随机水平翻转来增长训练样本。咱们选择流行的残差网络[6]做为咱们的骨干网。全部模型都通过182k次迭代训练。学习率从0.1开始,并分别以总迭代的50%,70%和90%下降10倍。若是未指定,咱们方法的默认超参数为γ= 256和m = 0.25。对于全部模型推断,咱们提取512-D特征嵌入并使用余弦距离做为度量。

Person re-identification. Person re-identification (re- ID) aims to spot the appearance of a same person in dif- ferent observations. We evaluate our method on two pop- ular datasets, i.e., Market-1501 [41] and MSMT17 [35]. Market-1501 contains 1,501 identities, 12,936 training im- ages and 19,732 gallery images captured with 6 cameras. MSMT17 contains 4,101 identities, 126,411 images cap- tured with 15 cameras and presents long-tailed sample dis- tribution. We adopt two network structures, i.e. a global feature learning model backboned on ResNet50 and a part- feature model named MGN [31]. We use MGN with consid- eration of its competitive performance and relatively con- cise structure. The original MGN uses a Sofmax loss on each part feature branch for training. Our implementation concatenates all the part features into a single feature vec- tor for simplicity. For Circle loss, we set γ = 256 and m = 0.25.

人员从新识别。 人物从新识别(re-ID)旨在在不一样的观察结果中发现同一我的的外表。 咱们在两个受欢迎的数据集(即Market-1501 [41]和MSMT17 [35])上评估了咱们的方法。 Market-1501包含1,501个身份,12,936个训练图像和用6个摄像机捕获的19,732个画廊图像。 MSMT17包含4,101个身份,用15个摄像机捕获的126,411张图像,并显示了长尾样本分布。 咱们采用两种网络结构,即基于ResNet50的全局特征学习模型和名为MGN的部分特征模型[31]。 咱们使用MGN时要考虑其竞争性能和相对简洁的结构。 原始MGN在每一个零件特征分支上使用Sofmax损失进行训练。 为了简化起见,咱们的实现将全部零件特征链接到一个特征向量中。 对于圆损失,咱们将γ设置为256,将m设置为0.25。

Fine-grained image retrieval. We use three datasets for evaluation on fine-grained image retrieval, i.e. CUB- 200-2011 [28], Cars196 [14] and Stanford Online Prod- ucts [19]. CARS-196 contains 16, 183 images which be- longs to 196 class of cars. The first 98 classes are used for training and the last 98 classes are used for testing. CUB- 200-2010 has 200 different class of birds. We use the first 100 class with 5, 864 images for training and the last 100 class with 5, 924 images for testing. SOP is a large dataset consists of 120, 053 images belonging to 22, 634 classes of online products. The training set contains 11, 318 class includes 59,551 images and the rest 11,316 class includes 60, 499 images are for testing. The experimental setup fol- lows [19]. We use BN-Inception [11] as the backbone to learn 512-D embeddings. We adopt P-K sampling trat- egy [8] to construct mini-batch with P = 16 and K = 5. For Circle loss, we set γ = 80 and m = 0.4.

Table 1: Identification rank-1 accuracy (%) on MFC1 dataset with different backbones and loss functions.

细粒度的图像检索。咱们使用三个数据集评估细粒度的图像,即CUB-200-2011 [28],Cars196 [14]和斯坦福在线产品[19]。 CARS-196包含16张183张图像,属于196类汽车。前98个班级用于培训,后98个班级用于测试。 CUB- 200-2010有200种不一样的鸟类。咱们使用前100个班级提供5张864张图像进行训练,最后100个班级使用5张924张图像进行测试。 SOP是一个大型数据集,包含120、053个图像,这些图像属于2二、634类在线产品。训练集包含11个,318个类,包括59,551张图像,其他11,316个类,包括60个,499张图像供测试。实验设置以下[19]。咱们使用BN-Inception [11]做为骨干学习512-D嵌入。咱们采用P-K采样策略[8]来构建P = 16和K = 5的小批量。对于圆损失,咱们将γ= 80设置为m = 0.4。

表1:具备不一样主干和损失函数的MFC1数据集的识别等级1准确性(%)。

在这里插入图片描述

Table 2: Face verification accuracy (%) on LFW, YTF and CFP-FP with ResNet34 backbone.

表2:使用ResNet34主干的LFW,YTF和CFP-FP上的人脸验证准确性(%)。

在这里插入图片描述

Table 3: Comparison of true accept rates (%) on the IJB-C 1:1 verification task.

表3:IJB-C 1:1验证任务的真实接受率(%)的比较。

在这里插入图片描述

Table 4: Evaluation of Circle loss on re-ID task. We report R-1 accuracy (%) and mAP (%).

表4:从新ID任务的圆环损失评估。 咱们报告了R-1准确性(%)和mAP(%)。

在这里插入图片描述

4.2. Face Recognition

For face recognition task, we compare Circle loss against several popular classification loss functions, i.e., vanilla Softmax, NormFace [30], AM-Softmax [29] (or CosFace [32]), ArcFace [2]. Following the original pa- pers [29, 2], we set γ = 64, m = 0.35 for AM-Softmax and γ = 64, m = 0.5 for ArcFace.
We report the rank-1 accuracy on MegaFace Challenge 1 dataset (MFC1) in Table 1. On all the three backbones, Circle loss marginally outperforms the counterparts. For example, with ResNet34 as the backbone, Circle loss sur- passes the most competitive one (ArcFace) by +0.13%. With ResNet100 as the backbone, while ArcFace achieves a high rank-1 accuracy of 98.36%, Circle loss still outper- forms it by +0.14%.
Table 2 summarizes face verification results on LFW [10], YTF [37] and CFP-FP [23]. We note that perfor- mance on these datasets is already near saturation. Specif- ically, ArcFace is higher than AM-Softmax by +0.05%, +0.03%, +0.07% on three datasets, respectively. Circle loss remains the best one, surpassing ArcFace by +0.05%, +0.06% and +0.18%, respectively.
We further compare Circle loss with AM-Softmax on IJB-C 1:1 verification task in Table 3. Our implementa- tion of Arcface is unstable on this dataset and achieves abnormally low performance, so we did not compare Cir- cle loss against Arcface. With ResNet34 as the backbone, Circle loss significantly surpasses AM-Softmax by +1.30% and +4.92% on “TAR@FAR=1e-4” and “TAR@FAR=1e- 5”, respectively. With ResNet100 as the backbone, Circle loss still maintains considerable superiority.

对于人脸识别任务,咱们将Circle损失与几种流行的分类损失函数(即香草Softmax,NormFace [30],AM-Softmax [29](或CosFace [32]),ArcFace [2])进行比较。按照原始文件[29,2],咱们将γ= 64,对于AM-Softmax设置为m = 0.35,对于γ= 64,对于ArcFace设置为m = 0.5。
咱们在表1中报告了MegaFace Challenge 1数据集(MFC1)的1级准确度。在全部三个主干上,Circle损失略胜于同行。例如,以ResNet34为骨干,Circle损失比最有竞争力的损失(ArcFace)高0.13%。以ResNet100为骨干,虽然ArcFace达到了98.36%的1级高准确度,但圆环损耗仍然比其高0.14%。
表2总结了LFW [10],YTF [37]和CFP-FP [23]的面部验证结果。咱们注意到,这些数据集的性能已经接近饱和。具体来讲,在三个数据集上,ArcFace比AM-Softmax高出+0.05%,+ 0.03%和+ 0.07%。圆环损耗仍然是最好的圆环,分别超过ArcFace +0.05%,+ 0.06%和+ 0.18%。
咱们在表3中进一步将Circle loss与AM-Softmax在IJB-C 1:1验证任务上进行了比较。咱们在此数据集上实现Arcface不稳定,而且性能异常低下,所以咱们没有将Arc损失与Arcface进行比较。以ResNet34为骨干,在“ TAR @ FAR = 1e-4”和“ TAR @ FAR = 1e-5”上,圆环损耗分别大大超过AM-Softmax + 1.30%和+ 4.92%。以ResNet100为骨干,Circle Loss仍然保持至关的优点。

4.3. Person Re-identification

We evaluate Circle loss on re-ID task in Table 4. MGN [31] is one of the state-of-the-art method and is featured for learning multi-granularity part-level features. Originally, it uses both Softmax loss and triplet loss to fa- cilitate a joint optimization. Our implementation of “MGN (ResNet50) + AMSoftmax” and “MGN (ResNet50)+ Circle loss” only use a single loss function for simplicity.
We make three observations from Table 4. First, com- paring Circle loss against state of the art, we find that Cir- cle loss achieves competitive re-ID accuracy, with a con- cise setup (no more auxiliary loss functions). We note that “JDGL” is slightly higher than “MGN + Circle loss” on MSMT17 [35]. JDGL [42] uses generative model to aug- ment the training data, and significantly improves re-ID over long-tailed dataset. Second, comparing “Circle loss” with “AMSoftmax”, we observe the superiority of Circle loss, which is consistent with the experimental results on face recognition task. Third, comparing “ResNet50 + Circle loss” against “MGN + Circle loss”, we find that part-level features bring incremental improvement to Circle loss. It implies that Circle loss is compatible to the part-model spe- cially designed for re-ID.

咱们在表4中评估了关于re-ID任务的Circle损失。MGN [31]是最早进的方法之一,用于学习多粒度零件级特征。最初,它同时使用Softmax损失和Triplet损失来促进联合优化。为了简化起见,咱们对“ MGN(ResNet50)+ AMSoftmax”和“ MGN(ResNet50)+圆损”的实现仅使用单个损失函数。
咱们从表4中得出三个观察结果。首先,将Circle损失与现有技术进行比较,咱们发现,经过简单的设置(没有更多的辅助损失功能),圆形损失达到了具备竞争力的re-ID准确性。咱们注意到,在MSMT17上,“ JDGL”略高于“ MGN +圆环损耗” [35]。 JDGL [42]使用生成模型来加强训练数据,并显着改善了长尾数据集的re-ID。其次,将“圆损”与“ AMSoftmax”进行比较,观察到圆损的优越性,与人脸识别任务的实验结果相吻合。第三,将“ ResNet50 +环损”与“ MGN +环损”进行比较,咱们发现部件级功能为环损带来了增量改进。这代表Circle损失与专门为re-ID设计的零件模型兼容。

Table 5: Comparison with state of the art on CUB-200-2011, Cars196 and Stanford Online Products. R@K(%) is reported.

表5:与CUB-200-2011,Cars196和斯坦福在线产品上的最新技术比较。 报告了R @ K(%)。

在这里插入图片描述

在这里插入图片描述
Figure 3: Impact of two hyper-parameters. In (a), Circle loss presents high robustness on various settings of scale factor γ. In (b), Circle loss surpasses the best performance of both AMSoftmax and ArcFace within a large range of relaxation factor m.

图3:两个超参数的影响。 在(a)中,圆损在比例因子γ的各类设置下表现出很高的鲁棒性。 在(b)中,在大的松弛因子m范围内,圆环损耗超过了AMSoftmax和ArcFace的最佳性能。

在这里插入图片描述
Figure 4: The change of sp and sn values during training. We linearly lengthen the curves within the first 2k iterations to highlight the initial training process (in the green zone). During the early training stage, Circle loss rapidly increases sp, because sp deviates far from the optimum at the initial- ization and thus attracts higher optimization priority.

图4:训练期间sp和sn值的变化。 咱们在前2k次迭代中线性延长曲线,以突出显示初始训练过程(在绿色区域中)。 在训练的早期阶段,Circle损失会迅速增长sp,由于sp在初始化时偏离了最优值,所以吸引了更高的优化优先级。

4.4. Fine-grained Image Retrieval

We evaluate the compatibility of Circle loss to pair-wise labeled data on three fine-grained image retrieval datasets, i.e., CUB-200-2011, Cars196, and Standford Online Prod- ucts. On these datasets, majority methods [19, 18, 3, 20, 13, 34] adopt the encouraged setting of learning with pair- wise labels. We compare Circle loss against these state- of-the-art methods in Table 5. We observe that Circle loss achieves competitive performance, on all of the three datasets. Among the competing methods, LiftedStruct [19] and Multi-Simi [34] are specially designed with elaborate hard mining strategies for learning with pair-wise labels. HDC [18], ABIER [20] and ABE [13] benefit from model ensemble. In contrast, the proposed Circle loss achieves performance on par with the state of the art, without any bells and whistles.

咱们在三个细粒度的图像检索数据集(即CUB-200-2011,Cars196和Standford Online产品)上评估了Circle loss与成对标记数据的兼容性。 在这些数据集上,多数方法[1九、1八、三、20、1三、34]采用成对标签鼓励学习。 咱们在表5中将Circle损失与这些最新方法进行了比较。咱们观察到,在全部三个数据集上,Circle损失均达到了竞争表现。 在竞争的方法中,LiftedStruct [19]和Multi-Simi [34]是通过精心设计的,具备精心设计的硬挖掘策略,用于按对标记学习。 HDC [18],ABIER [20]和ABE [13]受益于模型集成。 相比之下,建议的Circle损失能够达到与现有技术至关的性能,而没有任何风吹草动。

4.5. Impact of the Hyper-parameters

在这里插入图片描述
Figure 5: Visualization of the similarity distribution after convergence. The blue dots mark the similarity pairs crossing the decision boundary during the whole training process. The green dots mark the similarity pairs after convergence. (a) AMSoftmax seeks to minimize (sn − sp). During training, the similarity pairs cross the decision boundary through a wide passage. After convergence, the similarity pairs scatter in a relatively large region in the (sn , sp ) space. In (b) and ©, Circle loss has a circular decision boundary. The similarity pairs cross the decision boundary through a narrow passage and gather into a relatively concentrated region.

图5:收敛后的类似度分布的可视化。 蓝色点标记整个训练过程当中跨越决策边界的类似度对。 绿点在收敛后标记类似对。 (a)AMSoftmax寻求最小化(sn-sp)。 在训练过程当中,类似度对经过决策通道跨越决策边界。 收敛以后,类似度对在(sn,sp)空间中的相对较大区域中分散。 在(b)和(c)中,圆损失具备圆形决策边界。 类似对经过狭窄的通道越过决策边界,并汇集到相对集中的区域。

We analyze the impact of two hyper-parameters, i.e., the scale factor γ in Eq. 6 and the relaxation factor m in Eq. 8 on face recognition tasks.
The scale factor γ determines the largest scale of each similarity score. The concept of scale factor is critical in a lot of variants of Softmax loss. We experimentally evaluate its impact on Circle loss and make a comparison with sev- eral other loss functions involving scale factors. We vary γ from 32 to 1024 for both AMSoftmax and Circle loss. For ArcFace, we only set γ to 32, 64 and 128, as it becomes un- stable with larger γ in our implementation. The results are visualized in Fig. 3. Compared with AM-Softmax and Ar- cFace, Circle loss exhibits high robustness on γ. The main reason for the robustness of Circle loss on γ is the auto- matic attenuation of gradients. As the training progresses, the similarity scores approach toward the optimum. Con- sequentially, the weighting scales along with the gradients automatically decay, maintaining a mild optimization.
The relaxation factor m determines the radius of the circular decision boundary. We vary m from −0.2 to 0.3 (with 0.05 as the interval) and visualize the results in Fig. 3 (b). It is observed that under all the settings from −0.1 to 0.25, Circle loss surpasses the best performance of Arcface, as well as AMSoftmax, presenting considerable degree of robustness.

咱们分析了两个超参数(即等式中的比例因子γ)的影响。 6和等式中的松弛因子m。 8关于面部识别任务。
比例因子γ肯定每一个类似性评分的最大比例。比例因子的概念在Softmax损耗的许多变体中相当重要。咱们经过实验评估其对Circle损失的影响,并与其余涉及比例因子的其余损失函数进行比较。对于AMSoftmax和Circle损失,咱们将γ从32变为1024。对于ArcFace,咱们仅将γ设置为3二、64和128,由于在咱们的实现中,随着γ的增大它变得不稳定。结果显示在图3中。与AM-Softmax和ArcFace相比,Circle loss对γ表现出很高的鲁棒性。圆损失对γ的鲁棒性的主要缘由是梯度的自动衰减。随着训练的进行,类似性分数趋于最佳。所以,权重比例和梯度会自动衰减,从而保持适度的优化。
弛豫因子m肯定圆形决策边界的半径。咱们将m从-0.2更改成0.3(以0.05为间隔),并将结果可视化在图3(b)中。能够看出,在从-0.1到0.25的全部设置下,圆环损耗都超过了Arcface和AMSoftmax的最佳性能,表现出至关高的鲁棒性。

4.6. Investigation of the Characteristics

Analysis of the optimization process. To intuitively understand the learning process, we show the change of sn and sp during the whole training process in Fig. 4, from which we draw two observations:
First, at the initialization, all the sn and sp scores are small. It is because in the high dimensional feature space, randomized features are prone to be far away from each other [40, 7]. Correspondingly, sp get significantly larger weights (compared with sn), and the optimization on sp dominates the training, incurring a fast increase in similar- ity values in Fig. 4. This phenomenon evidences that Circle loss maintains a flexible and balanced optimization.
Second, at the end of training, Circle loss achieves both better within-class compactness and between-class discrep- ancy (on the training set), compared with AMSoftmax. Considering the fact that Circle loss achieves higher perfor- mance on the testing set, we believe that it indicates better optimization.

优化过程分析。 为了直观地了解学习过程,咱们在图4中显示了整个训练过程当中sn和sp的变化,从中咱们得出两个观察结果:
首先,在初始化时,全部的sn和sp得分都很小。 这是由于在高维特征空间中,随机特征倾向于彼此远离[40,7]。 相应地,sp的权重显着增大(与sn相比),sp的优化支配了训练,从而致使图4中的类似性值快速增长。这种现象代表Circle loss保持了灵活而均衡的优化。
其次,在训练结束时,与AMSoftmax相比,Circle损失能够同时实现更好的班级内部紧实度和班级间差别(在训练集上)。 考虑到Circle loss在测试集上可得到更高的性能,咱们认为这代表优化效果更好。

Analysis of the convergence. We analyze the conver- gence status of Circle loss in Fig. 5. We investigate two issues: how do the similarity pairs consisted of sn and sp cross the decision boundary during training and how do the similarity pairs distribute in the (sn , sp ) space after con- vergence. The results are shown in Fig. 5. In Fig. 5 (a), AMSoftmax loss adopts the optimal setting of m = 0.35. In Fig. 5 (b), Circle loss adopts a compromised setting of m = 0.325. The decision boundaries of (a) and (b) are tangent to each other, allowing an intuitive comparison. In Fig. 5 ©, Circle loss adopts its optimal setting of m = 0.25. Comparing Fig. 5 (b) and © against Fig. 5 (a), we find that Circle loss presents a relatively narrower passage on the de- cision boundary, as well as a more concentrated distribution for convergence (especially when m = 0.25). It indicates that Circle loss facilitates more consistent convergence for all the similarity pairs, compared with AMSoftmax loss. This phenomenon confirms that Circle loss has a more defi- nite convergence target, which promotes better separability in the feature space.

收敛性分析。咱们在图5中分析了Circle损失的收敛状态。咱们研究了两个问题:在训练过程当中,由sn和sp组成的类似性对如何越过决策边界,以及类似性对如何在(sn,sp)中分布收敛后的空间。结果如图5所示。在图5(a)中,AMSoftmax损失采用m = 0.35的最佳设置。在图5(b)中,圆损耗采用m = 0.325的折衷设置。 (a)和(b)的决策边界相互切线,从而能够进行直观的比较。在图5(c)中,圆损采用其m = 0.25的最佳设置。将图5(b)和(c)与图5(a)进行比较,咱们发现圆损失在决策边界上呈现出相对较窄的通道,而且在收敛时呈现出更加集中的分布(特别是当m = 0.25)。这代表,与AMSoftmax损失相比,Circle损失有助于全部类似对的更一致的收敛。这种现象证明了圆损失具备更明确的收敛目标,从而促进了特征空间中更好的可分离性。

5. Conclusion

This paper provides two insights into the optimization process for deep feature learning. First, a majority of loss functions, including the triplet loss and popular clas- sification losses, conduct optimization by embedding the between-class and within-class similarity into similarity pairs. Second, within a similarity pair under supervision, each similarity score favors different penalty strength, de- pending on its distance to the optimum. These insights result in Circle loss, which allows the similarity scores to learn at different paces. The Circle loss benefits deep fea- ture learning with high flexibility in optimization and a more definite convergence target. It has a unified formula for two elemental learning approaches, i.e., learning with class-level labels and learning with pair-wise labels. On a variety of deep feature learning tasks, e.g., face recog- nition, person re-identification, and fine-grained image re- trieval, the Circle loss achieves performance on par with the state of the art.

本文为深度特征学习的优化过程提供了两种看法。首先,大多数损失函数(包括三元组损失和流行的分类损失)经过将类间和类内类似性嵌入类似对来进行优化。其次,在监督下的类似度对中,每一个类似度得分偏向于不一样的惩罚强度,这取决于其到最佳值的距离。这些看法会致使Circle损失,从而使类似度分数以不一样的速度学习。 Circle损失使深度功能学习受益于优化的高度灵活性和更明确的收敛目标。它具备用于两种基本学习方法的统一公式,即便用类级标签进行学习和使用逐对标签进行学习。在各类深度特征学习任务上,例如人脸识别,人员从新识别和细粒度的图像检索,Circle损失可实现与现有技术至关的性能。

References

[1] S. Chopra, R. Hadsell, and Y. LeCun. Learning a similarity metric discriminatively, with application to face verification. 2005 IEEE Computer Society Conference on Computer Vi- sion and Pattern Recognition (CVPR’05), 1:539–546 vol. 1, 2005. 2 [2] J.Deng,J.Guo,N.Xue,andS.Zafeiriou.Arcface:Additive angular margin loss for deep face recognition. In Proceed- ings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019. 1, 2, 5, 6 [3] W. Ge. Deep metric learning with hierarchical triplet loss. In The European Conference on Computer Vision (ECCV), September 2018. 7 [4] Y. Guo, L. Zhang, Y. Hu, X. He, and J. Gao. Ms-celeb-1m: A dataset and benchmark for large-scale face recognition. In European Conference on Computer Vision, 2016. 5 [5] R.Hadsell,S.Chopra,andY.LeCun.Dimensionalityreduc- tion by learning an invariant mapping. In IEEE Computer Society Conference on Computer Vision and Pattern Recog- nition (CVPR), volume 2, pages 1735–1742. IEEE, 2006. 2 [6] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, 2016. 5 [7] L.He,Z.Wang,Y.Li,andS.Wang.Softmaxdissection:To- wards understanding intra- and inter-clas objective for em- bedding learning. CoRR, abs/1908.01281, 2019. 8 [8] A. Hermans, L. Beyer, and B. Leibe. In defense of the triplet loss for person re-identification. arXiv preprint arXiv:1703.07737, 2017. 3, 6 [9] E. Hoffer and N. Ailon. Deep metric learning using triplet network. In International Workshop on Similarity-Based Pattern Recognition, pages 84–92. Springer, 2015. 1, 2 [10] G. B. Huang, M. Ramesh, T. Berg, and E. Learned-Miller. Labeled faces in the wild: A database for studying face recognition in unconstrained environments. Technical Re- port 07-49, University of Massachusetts, Amherst, October 2007. 5, 6 [11] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015. 6 [12] I. Kemelmacher-Shlizerman, S. M. Seitz, D. Miller, and E. Brossard. The megaface benchmark: 1 million faces for recognition at scale. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4873– 4882, 2016. 5, 6 [13] W. Kim, B. Goyal, K. Chawla, J. Lee, and K. Kwon. Attention-based ensemble for deep metric learning. In The European Conference on Computer Vision (ECCV), Septem- ber 2018. 7 [14] J. Krause, M. Stark, J. Deng, and L. Fei-Fei. 3d object rep- resentations for fine-grained categorization. In Proceedings of the IEEE International Conference on Computer Vision Workshops, pages 554–561, 2013. 5, 7 [15] W. Liu, Y. Wen, Z. Yu, M. Li, B. Raj, and L. Song. Sphereface: Deep hypersphere embedding for face recog- nition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 212–220, 2017. 2, 4, 5 [16] W. Liu, Y. Wen, Z. Yu, and M. Yang. Large-margin softmax loss for convolutional neural networks. In ICML, 2016. 1, 2, 4 [17] B. Maze, J. Adams, J. A. Duncan, N. Kalka, T. Miller, C. Otto, A. K. Jain, W. T. Niggel, J. Anderson, J. Cheney, et al. Iarpa janus benchmark-c: Face dataset and protocol. In 2018 International Conference on Biometrics (ICB), pages 158–165. IEEE, 2018. 5, 6 [18] H. Oh Song, S. Jegelka, V. Rathod, and K. Murphy. Deep metric learning via facility location. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017. 7 [19] H. Oh Song, Y. Xiang, S. Jegelka, and S. Savarese. Deep metric learning via lifted structured feature embedding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4004–4012, 2016. 2, 3, 5, 6, 7 [20] M. Opitz, G. Waltner, H. Possegger, and H. Bischof. Deep metric learning with bier: Boosting independent embeddings robustly. IEEE Transactions on Pattern Analysis and Ma- chine Intelligence, pages 1–1, 2018. 7 [21] R. Ranjan, C. D. Castillo, and R. Chellappa. L2-constrained softmax loss for discriminative face verification. arXiv preprint arXiv:1703.09507, 2017. 2, 4 [22] F. Schroff, D. Kalenichenko, and J. Philbin. Facenet: A unified embedding for face recognition and clustering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 815–823, 2015. 1, 2, 3 [23] S. Sengupta, J.-C. Chen, C. Castillo, V. M. Patel, R. Chel- lappa, and D. W. Jacobs. Frontal to profile face verification in the wild. In 2016 IEEE Winter Conference on Applica- tions of Computer Vision (WACV), pages 1–9. IEEE, 2016. 5, 6 [24] K. Sohn. Improved deep metric learning with multi-class n-pair loss objective. In NIPS, 2016. 2, 3 [25] Y. Sun, X. Wang, and X. Tang. Deep learning face repre- sentation from predicting 10,000 classes. In Proceedings of the IEEE conference on computer vision and pattern recog- nition, pages 1891–1898, 2014. 1 [26] Y. Sun, L. Zheng, Y. Yang, Q. Tian, and S. Wang. Beyond part models: Person retrieval with refined part pooling (and a strong convolutional baseline). In The European Conference on Computer Vision (ECCV), September 2018. 6 [27] E.UstinovaandV.S.Lempitsky.Learningdeepembeddings with histogram loss. In NIPS, 2016. 2 [28] C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie. The Caltech-UCSD Birds-200-2011 Dataset. Technical Re- port CNS-TR-2011-001, California Institute of Technology, 2011. 5, 7 [29] F. Wang, J. Cheng, W. Liu, and H. Liu. Additive margin softmax for face verification. IEEE Signal Processing Let- ters, 25(7):926–930, 2018. 1, 2, 3, 4, 6 [30] F. Wang, X. Xiang, J. Cheng, and A. L. Yuille. Normface: L2 hypersphere embedding for face verification. In Proceed- ings of the 25th ACM international conference on Multime- dia, pages 1041–1049. ACM, 2017. 2, 3, 4, 6