{"title": "Improved Deep Metric Learning with Multi-class N-pair Loss Objective", "book": "Advances in Neural Information Processing Systems", "page_first": 1857, "page_last": 1865, "abstract": "Deep metric learning has gained much popularity in recent years, following the success of deep learning. However, existing frameworks of deep metric learning based on contrastive loss and triplet loss often suffer from slow convergence, partially because they employ only one negative example while not interacting with the other negative classes in each update. In this paper, we propose to address this problem with a new metric learning objective called multi-class N-pair loss. The proposed objective function firstly generalizes triplet loss by allowing joint comparison among more than one negative examples \u2013 more specifically, N-1 negative examples \u2013 and secondly reduces the computational burden of evaluating deep embedding vectors via an efficient batch construction strategy using only N pairs of examples, instead of (N+1)\u00d7N. We demonstrate the superiority of our proposed loss to the triplet loss as well as other competing loss functions for a variety of tasks on several visual recognition benchmark, including fine-grained object recognition and verification, image clustering and retrieval, and face verification and identification.", "full_text": "Improved Deep Metric Learning with\n\nMulti-class N-pair Loss Objective\n\nKihyuk Sohn\n\nNEC Laboratories America, Inc.\n\nksohn@nec-labs.com\n\nAbstract\n\nDeep metric learning has gained much popularity in recent years, following the\nsuccess of deep learning. However, existing frameworks of deep metric learning\nbased on contrastive loss and triplet loss often suffer from slow convergence, par-\ntially because they employ only one negative example while not interacting with\nthe other negative classes in each update. In this paper, we propose to address\nthis problem with a new metric learning objective called multi-class N-pair loss.\nThe proposed objective function \ufb01rstly generalizes triplet loss by allowing joint\ncomparison among more than one negative examples \u2013 more speci\ufb01cally, N-1\nnegative examples \u2013 and secondly reduces the computational burden of evaluating\ndeep embedding vectors via an ef\ufb01cient batch construction strategy using only N\npairs of examples, instead of (N+1)\u00d7N. We demonstrate the superiority of our\nproposed loss to the triplet loss as well as other competing loss functions for a\nvariety of tasks on several visual recognition benchmark, including \ufb01ne-grained\nobject recognition and veri\ufb01cation, image clustering and retrieval, and face veri\ufb01-\ncation and identi\ufb01cation.\n\nIntroduction\n\n1\nDistance metric learning aims to learn an embedding representation of the data that preserves\nthe distance between similar data points close and dissimilar data points far on the embedding\nspace [15, 30]. With success of deep learning [13, 20, 23, 5], deep metric learning has received\na lot of attention. Compared to standard distance metric learning, it learns a nonlinear embedding\nof the data using deep neural networks, and it has shown a signi\ufb01cant bene\ufb01t by learning deep\nrepresentation using contrastive loss [3, 7] or triplet loss [27, 2] for applications such as face recog-\nnition [24, 22, 19] and image retrieval [26]. Although yielding promising progress, such frame-\nworks often suffer from slow convergence and poor local optima, partially due to that the loss func-\ntion employs only one negative example while not interacting with the other negative classes per\neach update. Hard negative data mining could alleviate the problem, but it is expensive to evaluate\nembedding vectors in deep learning framework during hard negative example search. As to ex-\nperimental results, only a few has reported strong empirical performance using these loss functions\nalone [19, 26], but many have combined with softmax loss to train deep networks [22, 31, 18, 14, 32].\nTo address this problem, we propose an (N+1)-tuplet loss that optimizes to identify a positive ex-\nample from N-1 negative examples. Our proposed loss extends triplet loss by allowing joint com-\nparison among more than one negative examples; when N=2, it is equivalent to triplet loss. One\nimmediate concern with (N+1)-tuplet loss is that it quickly becomes intractable when scaling up\nsince the number of examples to evaluate in each batch grows in quadratic to the number of tuplets\nand their length N. To overcome this, we propose an ef\ufb01cient batch construction method that only\nrequires 2N examples instead of (N+1)N to build N tuplets of length N+1. We unify the (N+1)-\ntuplet loss with our proposed batch construction method to form a novel, scalable and effective deep\nmetric learning objective, called multi-class N-pair loss (N-pair-mc loss). Since the N-pair-mc\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fFigure 1: Deep metric learning with (left) triplet loss and (right) (N+1)-tuplet loss. Embedding\nvectors f of deep networks are trained to satisfy the constraints of each loss. Triplet loss pulls\npositive example while pushing one negative example at a time. On the other hand, (N+1)-tuplet\nloss pushes N-1 negative examples all at once, based on their similarity to the input example.\n\nloss already considers comparison to N-1 negative examples in its training objectives, negative data\nmining won\u2019t be necessary in learning from small or medium-scale datasets in terms of the number\nof output classes. For datasets with large number of output classes, we propose a hard negative\n\u201cclass\u201d mining scheme which greedily adds examples to form a batch from a class that violates the\nconstraint with the previously selected classes in the batch.\nIn experiment, we demonstrate the superiority of our proposed N-pair-mc loss to the triplet loss as\nwell as other competing metric learning objectives on visual recognition, veri\ufb01cation, and retrieval\ntasks. Speci\ufb01cally, we report much improved recognition and veri\ufb01cation performance on our \ufb01ne-\ngrained car and \ufb02ower recognition datasets. In comparison to the softmax loss, N-pair-mc loss is as\ncompetitive for recognition but signi\ufb01cantly better for veri\ufb01cation. Moreover, we demonstrate sub-\nstantial improvement in image clustering and retrieval tasks on Online product [21], Car-196 [12],\nand CUB-200 [25], as well as face veri\ufb01cation and identi\ufb01cation accuracy on LFW database [8].\n2 Preliminary: Distance Metric Learning\nLet x \u2208 X be an input data and y \u2208 {1,\u00b7\u00b7\u00b7 , L} be its output label. We use x+ and x\u2212 to denote\npositive and negative examples of x, meaning that x and x+ are from the same class and x\u2212 is from\ndifferent class to x. The kernel f (\u00b7; \u03b8) : X \u2192 RK takes x and generates an embedding vector f (x).\nWe often omit x from f (x) for simplicity, while f inherits all superscripts and subscripts.\nContrastive loss [3, 7] takes pairs of examples as input and trains a network to predict whether two\ninputs are from the same class or not. Speci\ufb01cally, the loss is written as follows:\n\n2 + 1{yi (cid:54)= yj} max(cid:0)0, m \u2212 (cid:107)fi \u2212 fj(cid:107)2\n\n(cid:1)2\n\nLm\ncont(xi, xj; f ) = 1{yi = yj}(cid:107)fi \u2212 fj(cid:107)2\n\n(1)\nwhere m is a margin parameter imposing the distance between examples from different classes to\nbe larger than m. Triplet loss [27, 2, 19] shares a similar spirit to contrastive loss, but is composed\nof triplets, each consisting of a query, a positive example (to the query), and a negative example:\n\ntri (x, x+, x\u2212; f ) = max(cid:0)0,(cid:107)f \u2212 f +(cid:107)2\n\nLm\n\n(2)\nCompared to contrastive loss, triplet loss only requires the difference of (dis-)similarities between\npositive and negative examples to the query point to be larger than a margin m. Despite their wide\nuse, both loss functions are known to suffer from slow convergence and they often require expensive\ndata sampling method to provide nontrivial pairs or triplets to accelerate the training [2, 19, 17, 4].\n3 Deep Metric Learning with Multiple Negative Examples\nThe fundamental philosophy behind triplet loss is the following: for an input (query) example, we\ndesire to shorten the distances between its embedding vector and those of positive examples while\nenlarging the distances between that of negative examples. However, during one update, the triplet\nloss only compares an example with one negative example while ignoring negative examples from\nthe rest of the classes. As a consequence, the embedding vector for an example is only guaranteed\nto be far from the selected negative class but not necessarily the others. Thus we can end up only\ndifferentiating an example from a limited selection of negative classes yet still maintain a small\ndistance from many other classes. In practice, the hope is that, after looping over suf\ufb01ciently many\nrandomly sampled triplets, the \ufb01nal distance metric can be balanced correctly; but individual update\ncan still be unstable and the convergence would be slow. Speci\ufb01cally, towards the end of training,\nmost randomly selected negative examples can no longer yield non-zero triplet loss error.\n\n2 + m(cid:1)\n\n2 \u2212 (cid:107)f \u2212 f\u2212(cid:107)2\n\n2\n\nff+f-ff+f-fDNNx........f-N-2f+N-1f+2f+N-2ff+f+2f+4f+3...ff+f+4f+3...f+N-11111\fAn evident way to improve the vanilla triplet loss is to select a negative example that violates the\ntriplet constraint. However, hard negative data mining can be expensive with a large number of out-\nput classes for deep metric learning. We seek an alternative: a loss function that recruits multiple\nnegatives for each update, as illustrated by Figure 1. In this case, an input example is being com-\npared against negative examples from multiple classes and it needs to be distinguishable from all of\nthem at the same time. Ideally, we would like the loss function to incorporate examples across every\nclass all at once. But it is usually not attainable for large scale deep metric learning due to the mem-\nory bottleneck from the neural network based embedding. Motivated by this thought process, we\npropose a novel, computationally feasible loss function, illustrated by Figure 2, which approximates\nour ideal loss by pushing N examples simultaneously.\n3.1 Learning to identify from multiple negative examples\nWe formalize our proposed method, which is optimized to identify a positive example from multiple\nnegative examples. Consider an (N+1)-tuplet of training examples {x, x+, x1,\u00b7\u00b7\u00b7 , xN\u22121}: x+ is a\npositive example to x and {xi}N\u22121\n\ni=1 are negative. The (N+1)-tuplet loss is de\ufb01ned as follows:\ni=1 }; f ) = log\n\nexp(f(cid:62)fi \u2212 f(cid:62)f +)\n\nN\u22121(cid:88)\n\n1 +\n\n(cid:17)\n\n(cid:16)\n\nL({x, x+,{xi}N\u22121\n\n(3)\n\ni=1\n\nwhere f (\u00b7; \u03b8) is an embedding kernel de\ufb01ned by deep neural network. Recall that it is desirable for\nthe tuplet loss to involve negative examples across all classes but it is impractical in the case when\nthe number of output classes L is large; even if we restrict the number of negative examples per\nclass to one, it is still too heavy-lifting to perform standard optimization, such as stochastic gradient\ndescent (SGD), with a mini-batch size as large as L.\nWhen N = 2, the corresponding (2+1)-tuplet loss highly resembles the triplet loss as there is only\none negative example for each pair of input and positive examples:\n\nL(2+1)-tuplet({x, x+, xi}; f ) = log(cid:0)1 + exp(f(cid:62)fi \u2212 f(cid:62)f +)(cid:1);\n\nLtriplet({x, x+, xi}; f ) = max(cid:0)0, f(cid:62)fi \u2212 f(cid:62)f +(cid:1).\n\n(4)\n(5)\nIndeed, under mild assumptions, we can show that an embedding f minimizes L(2+1)-tuplet if and\nonly if it minimizes Ltriplet, i.e., two loss functions are equivalent.1When N > 2, we further argue\nthe advantages of (N+1)-tuplet loss over triplet loss. We compare (N+1)-tuplet loss with the triplet\nloss in terms of partition function estimation of an ideal (L+1)-tuplet loss, where an (L+1)-tuplet\nloss coupled with a single example per negative class can be written as follows:\nexp(f(cid:62)f +)\n\n(cid:16)\n\n(cid:17)\n\nL\u22121(cid:88)\n\nexp(f(cid:62)fi \u2212 f(cid:62)f +)\n\n= \u2212 log\n\nexp(f(cid:62)f +) +(cid:80)L\u22121\n\nlog\n\n1 +\n\ni=1\n\ni=1 exp(f(cid:62)fi)\n\n(6)\n\nEquation (6) is similar to the multi-class logistic loss (i.e., softmax loss) formulation when we view\nf as a feature vector, f + and fi\u2019s as weight vectors, and the denominator on the right hand side\nof Equation (6) as a partition function of the likelihood P (y = y+). We observe that the partition\nfunction corresponding to the (N+1)-tuplet approximates that of (L+1)-tuplet, and larger the value\nof N, more accurate the approximation. Therefore, it naturally follows that (N+1)-tuplet loss is a\nbetter approximation than the triplet loss to an ideal (L+1)-tuplet loss.\n3.2 N-pair loss for ef\ufb01cient deep metric learning\nSuppose we directly apply the (N+1)-tuplet loss to the deep metric learning framework. When the\nbatch size of SGD is M, there are M\u00d7(N+1) examples to be passed through f at one update. Since\nthe number of examples to evaluate for each batch grows in quadratic to M and N, it again becomes\nimpractical to scale the training for a very deep convolutional networks.\nNow, we introduce an effective batch construction to avoid excessive computational burden. Let\nN )} be N pairs of examples from N different classes, i.e., yi (cid:54)= yj,\u2200i (cid:54)= j.\n{(x1, x+\nWe build N tuplets, denoted as {Si}N\n2 ,\u00b7\u00b7\u00b7 , x+\nN}.\nj , j (cid:54)= i are the negative examples.\nHere, xi is the query for Si, x+\ni\n\ni=1, from the N pairs, where Si = {xi, x+\n\nis the positive example and x+\n\n1 ),\u00b7\u00b7\u00b7 , (xN , x+\n\n1 , x+\n\n1We assume f to have unit norm in Equation (5) to avoid degeneracy.\n\n3\n\n\f(a) Triplet loss\n\n(b) (N+1)-tuplet loss\n\n(c) N-pair-mc loss\n\nFigure 2: Triplet loss, (N+1)-tuplet loss, and multi-class N-pair loss with training batch construc-\ntion. Assuming each pair belongs to a different class, the N-pair batch construction in (c) leverages\nall 2 \u00d7 N embedding vectors to build N distinct (N+1)-tuplets with {fi}N\ni=1 as their queries; there-\nafter, we congregate these N distinct tuplets to form the N-pair-mc loss. For a batch consisting\nof N distinct queries, triplet loss requires 3N passes to evaluate the necessary embedding vectors,\n(N+1)-tuplet loss requires (N+1)N passes and our N-pair-mc loss only requires 2N.\n\nFigure 2(c) illustrates this batch construction process. The corresponding (N+1)-tuplet loss, which\nwe refer to as the multi-class N-pair loss (N-pair-mc), can be formulated as follows:2\n\nLN-pair-mc({(xi, x+\n\ni )}N\n\ni=1; f ) =\n\n1\nN\n\nlog\n\n1 +\n\nexp(f(cid:62)\n\nj \u2212 f(cid:62)\n\ni f +\n\ni f +\ni )\n\n(7)\n\nThe mathematical formulation of our N-pair loss shares similar spirits with other existing methods,\nsuch as the neighbourhood component analysis (NCA) [6] and the triplet loss with lifted struc-\nture [21].3 Nevertheless, our batch construction is designed to achieve the utmost potential of such\n(N+1)-tuplet loss, when using deep CNNs as embedding kernel on large scale datasets both in terms\nof training data and number of output classes. Therefore, the proposed N-pair-mc loss is a novel\nframework consisting of two indispensable components: the (N+1)-tuplet loss, as the building block\nloss function, and the N-pair construction, as the key to enable highly scalable training. Later in\nSection 4.4, we empirically show the advantage of our N-pair-mc loss framework in comparison to\nother variations of mini-batch construction methods.\nFinally, we note that the tuplet batch construction is not speci\ufb01c to the (N+1)-tuplet loss. We call the\nset of loss functions using tuplet construction method an N-pair loss. For example, when integrated\ninto the standard triplet loss, we obtain the following one-vs-one N-pair loss (N-pair-ovo):\n\n(cid:16)\n\nN(cid:88)\n\ni=1\n\n(cid:88)\n\nj(cid:54)=i\n\n(cid:17)\n\n(cid:17)\n\n(cid:16)\n\nN(cid:88)\n\n(cid:88)\n\ni=1\n\nj(cid:54)=i\n\nLN-pair-ovo({(xi, x+\n\ni )}N\n\ni=1; f ) =\n\n1\nN\n\nlog\n\n1 + exp(f(cid:62)\n\nj \u2212 f(cid:62)\n\ni f +\n\ni f +\ni )\n\n.\n\n(8)\n\n3.2.1 Hard negative class mining\nThe hard negative data mining is considered as an essential component to many triplet-based distance\nmetric learning algorithms [19, 17, 4] to improve convergence speed or the \ufb01nal discriminative\nperformance. When the number of output classes are not too large, it may be unnecessary for N-\npair loss since the examples from most of the negative classes are considered jointly already. When\nwe train on the dataset with large output classes, the N-pair loss can be bene\ufb01ted from carefully\nselected impostor examples.\nEvaluating deep embedding vectors for multiple examples from large number of classes is computa-\ntionally demanding. Moreover, for N-pair loss, one theoretically needs N classes that are negative\nto one another, which substantially adds to the challenge of hard negative search. To overcome such\ndif\ufb01culty, we propose negative \u201cclass\u201d mining, as opposed to negative \u201cinstance\u201d mining, which\ngreedily selects negative classes in a relatively ef\ufb01cient manner.\nMore speci\ufb01cally, the negative class mining for N-pair loss can be executed as follows:\n\n2We also consider the symmetric loss to Equation (7) that swaps f and f + to maximize the ef\ufb01cacy.\n3Our N-pair batch construction can be seen as a special case of lifted structure [21] where the batch includes\nonly positive pairs that are from disjoint classes. Besides, the loss function in [21] is based on the max-margin\nformulation, whereas we optimize the log probability of identi\ufb01cation loss directly.\n\n4\n\nf1f1f1+-fNfNfN+-f2f2f2+-f1f1f1,1+-fNfN+-f2f2+-f1,2---f1,N-1-f2,N-1-fN,N-1-f2,1f2,2fN,1fN,2f1f1+fNfN+f2f2+f3f3+f2+f2+f3+f1+fN+fN-1+\f1. Evaluate Embedding Vectors: choose randomly a large number of output classes C; for each\n\nclass, randomly pass a few (one or two) examples to extract their embedding vectors.\n\n2. Select Negative Classes: select one class randomly from C classes from step 1. Next, greedily\nadd a new class that violates triplet constraint the most w.r.t. the selected classes till we reach\nN classes. When a tie appears, we randomly pick one of tied classes [28].\n3. Finalize N-pair: draw two examples from each selected class from step 2.\n3.2.2 L2 norm regularization of embedding vectors\nThe numerical value of f(cid:62)f + can be in\ufb02uenced by not only the direction of f + but also its norm,\neven though the classi\ufb01cation decision should be determined merely by the direction. Normalization\ncan be a solution to avoid such situation, but it is too stringent for our loss formulation since it bounds\nthe value of |f(cid:62)f +| to be less than 1 and makes the optimization dif\ufb01cult. Instead, we regularize\nthe L2 norm of the embedding vectors to be small.\n4 Experimental Results\nWe assess the impact of our proposed N-pair loss functions, such as multi-class N-pair loss (N-pair-\nmc) or one-vs-one N-pair loss (N-pair-ovo), on several generic and \ufb01ne-grained visual recognition\nand veri\ufb01cation tasks. As a baseline, we also evaluate the performance of triplet loss with negative\ndata mining4 (triplet-nm). In our experiments, we draw a pair of examples from two different classes\nand then form two triplets: each with one of the positive examples as query, the other one as positive,\n(any) one of the negative examples as negative. Thus, a batch of 2N training examples can produce\n4 \u00d7 2 triplets, which is more ef\ufb01cient than the formulation in Equation (2) that we need 3N\nN = 2N\nexamples to form N triplets. We adapt the smooth upper bound of triplet loss in Equation (4) instead\nof large-margin formulation [27] in all our experiments to be consistent with N-pair-mc losses.\nWe use Adam [11] for mini-batch stochastic gradient descent with data augmentation, namely hor-\nizontal \ufb02ips and random crops. For evaluation, we extract a feature vector and compute the cosine\nsimilarity for veri\ufb01cation. When more than one feature vectors are extracted via horizontal \ufb02ip or\nfrom multiple crops, we use the cosine similarity averaged over all possible combinations between\nfeature vectors of two examples. For all our experiments except for the face veri\ufb01cation, we use\nImageNet pretrained GoogLeNet5 [23] for network initialization; for face veri\ufb01cation, we use the\nsame network architecture as CasiaNet [31] but trained from scratch without the last fully-connected\nlayer for softmax classi\ufb01cation. Our implementation is based on Caffe [10].\n4.1 Fine-grained visual object recognition and veri\ufb01cation\nWe evaluate deep metric learning algorithms on \ufb01ne-grained object recognition and veri\ufb01cation\ntasks. Speci\ufb01cally, we consider car and \ufb02ower recognition problems on the following database:\n\n\u2022 Car-333 [29] dataset is composed of 164, 863 images of cars from 333 model categories col-\nlected from the internet. Following the experimental protocol [29], we split the dataset into\n157, 023 images for training and 7, 840 for testing.\n\n\u2022 Flower-610 dataset contains 61, 771 images of \ufb02owers from 610 different \ufb02ower species and\n\namong all collected, 58, 721 images are used for training and 3, 050 for testing.\n\nWe train networks for 40k iterations with 144 examples per batch. This corresponds to 72 pairs per\nbatch for N-pair losses. We perform 5-fold cross-validation on the training set and report the average\nperformance on the test set. We evaluate both recognition and veri\ufb01cation accuracy. Speci\ufb01cally, we\nconsider veri\ufb01cation setting where there are different number of negative examples from different\nclasses, and determine as success only when the positive example is closer to the query example than\nany other negative example. Since the recognition task is involved, we also evaluate the performance\nof deep networks trained with softmax loss. The summary results are given in Table 1.\nWe observe consistent improvement of 72-pair loss models over triplet loss models. Although the\nnegative data mining could bring substantial improvement to the baseline models, the performance\nis not as competitive as 72-pair loss models. Moreover, the 72-pair loss models are trained without\nnegative data mining, thus should be more effective for deep metric learning framework. Between\n\n4Throughout experiments, negative data mining refers to the negative class mining for both triplet and\n\nN-pair loss instead of negative instance mining.\n\n5https://github.com/BVLC/caffe/tree/master/models/bvlc_googlenet\n\n5\n\n\fCar-333\n\nDatabase, evaluation metric\nRecognition\nVRF (neg=1)\nVRF (neg=71)\nRecognition\nVRF (neg=1)\nVRF (neg=71)\n\nFlower-610\n\ntriplet\n\n70.24\u00b10.38\n96.78\u00b10.04\n48.96\u00b10.35\n71.55\u00b10.26\n98.73\u00b10.03\n73.04\u00b10.13\n\ntriplet-nm 72-pair-ovo\n86.84\u00b10.13\n83.22\u00b10.09\n98.09\u00b10.07\n97.39\u00b10.07\n73.05\u00b10.25\n65.14\u00b10.24\n82.85\u00b10.22\n84.10\u00b10.42\n99.32\u00b10.03\n99.15\u00b10.03\n83.13\u00b10.15\n87.42\u00b10.18\n\n72-pair-mc\n88.37\u00b10.05\n97.92\u00b10.06\n76.02\u00b10.30\n85.57\u00b10.25\n99.50\u00b10.02\n88.63\u00b10.14\n\nsoftmax\n89.21\u00b10.16\n\u2020\n88.69\u00b10.20\n96.19\u00b10.07\n55.36\u00b10.30\n84.38\u00b10.28\n\u2020\n84.59\u00b10.21\n98.72\u00b10.04\n78.44\u00b10.33\n\nTable 1: Mean recognition and veri\ufb01cation accuracy with standard error on the test set of Car-333\nand Flower-610 datasets. The recognition accuracy of all models are evaluated using kNN classi\ufb01er;\nfor models with softmax classi\ufb01er, we also evaluate recognition accuracy using softmax classi\ufb01er\n(\u2020). The veri\ufb01cation accuracy (VRF) is evaluated at different numbers of negative examples.\n\nN-pair loss models, the multi-class loss (72-pair-mc) shows better performance than the one-vs-one\nloss (72-pair-ovo). As discussed in Section 3.1, superior performance of multi-class formulation is\nexpected since the N-pair-ovo loss is decoupled in the sense that the individual losses are generated\nfor each negative example independently.\nWhen it compares to the softmax loss, the recognition performance of the 72-pair-mc loss models\nare competitive, showing slightly worse on Car-333 but better on Flower-610 datasets. However, the\nperformance of softmax loss model breaks down severely on the veri\ufb01cation task. We argue that the\nrepresentation of the model trained with classi\ufb01cation loss is not optimal for veri\ufb01cation tasks. For\nexample, examples near the classi\ufb01cation decision boundary can still be classi\ufb01ed correctly, but are\nprone to be missed for veri\ufb01cation when there are examples from different class near the boundary.\n\n4.2 Distance metric learning for unseen object recognition\nDistance metric learning allows to learn a metric that can be generalized to an unseen categories.\nWe highlight this aspect of deep metric learning on several visual object recognition benchmark.\nFollowing the experimental protocol in [21], we evaluate on the following three datasets:\n\n\u2022 Stanford Online Product [21] dataset is composed of 120, 053 images from 22, 634 online prod-\nuct categories, and is partitioned into 59, 551 images of 11, 318 categories for training and\n60, 502 images of 11, 316 categories for testing.\n\n\u2022 Stanford Car-196 [12] dataset is composed of 16, 185 images of cars from 196 model categories.\n\nThe \ufb01rst 98 model categories are used for training and the rest for testing.\n\n\u2022 Caltech-UCSD Birds (CUB-200) [25] dataset is composed of 11, 788 images of birds from 200\n\ndifferent species. Similarly, we use the \ufb01rst 100 categories for training.\n\nUnlike in Section 4.1, the object categories between train and test sets are disjoint. This makes the\nproblem more challenging since deep networks can easily over\ufb01t to the categories in the train set\nand generalization of distance metric to unseen object categories could be dif\ufb01cult.\nWe closely follow experimental setting of [21]. For example, we initialize the network using Ima-\ngeNet pretrained GoogLeNet and train for 20k iterations using the same network architecture (e.g.,\n64 dimensional embedding for Car-196 and CUB-200 datasets and 512 dimensional embedding for\nOnline product dataset) and the number of examples (e.g., 120 examples) per batch. Besides, we\nuse Adam for stochastic optimization and other hyperparameters such as learning rate are tuned ac-\ncordingly via 5-fold cross-validation on the train set. We report the performance for both clustering\nand retrieval tasks using F1 and normalized mutual information (NMI) [16] scores for clustering as\nwell as recall@K [9] score for retrieval in Table 2.\nWe observe similar trend as in Section 4.1. The triplet loss model performs the worst among all\nlosses considered. Negative data mining can alleviate the model to escape from the local optimum,\nbut the N-pair loss models outperforms even without additional computational cost for negative\ndata mining. The performance of N-pair loss further improves when combined with the proposed\nnegative data mining. Overall, we improve by 9.6% on F1 score, 1.99% on NMI score, and 14.41%\non recall@1 score on Online product dataset compared to the baseline triplet loss models. Lastly,\nour model outperforms the performance of triplet loss with lifted structure [21], which demonstrates\nthe effectiveness of the proposed N pair batch construction.\n\n6\n\n\ftriplet\n19.59\n86.11\n53.32\n72.75\n87.66\n96.43\n\nF1\nNMI\nK=1\nK=10\nK=100\nK=1000\n\nOnline product\n\ntriplet-nm triplet-lifted\nstructure [21]\n24.27\n87.23\n62.39\n79.69\n91.10\n97.25\n\n25.6\n87.5\n61.8\n79.9\n91.1\n97.3\n\nCar-196\n\n60-pair-ovo\n\n23.13\n86.98\n60.71\n78.74\n91.03\n97.50\n\n60-pair-ovo\n\n-nm\n25.31\n87.45\n63.85\n81.22\n91.89\n97.51\n\n60-pair-mc\n\n26.53\n87.77\n65.25\n82.15\n92.60\n97.92\nCUB-200\n\n60-pair-mc\n\n-nm\n28.19\n88.10\n67.73\n83.76\n92.98\n97.81\n\ntriplet-nm 60-pair-ovo\n\n60-pair-mc\n\ntriplet-nm 60-pair-ovo\n\n60-pair-mc\n\ntriplet\n24.73\n58.25\n53.84\n66.02\n75.91\n84.18\n\nF1\nNMI\nK=1\nK=2\nK=4\nK=8\n\n27.86\n59.94\n61.62\n73.48\n81.88\n87.81\n\n33.52\n63.87\n69.52\n78.76\n85.80\n90.94\n\n33.55\n63.95\n71.12\n79.74\n86.48\n91.60\n\ntriplet\n21.88\n55.83\n43.30\n55.84\n67.30\n77.48\n\n24.37\n57.87\n46.47\n58.58\n71.03\n80.17\n\n25.21\n58.55\n48.73\n60.48\n72.08\n81.62\n\n27.24\n60.39\n50.96\n63.34\n74.29\n83.22\n\nTable 2: F1, NMI, and recall@K scores on the test set of online product [21], Car-196 [12], and\nCUB-200 [25] datasets. F1 and NMI scores are averaged over 10 different random seeds for kmeans\nclustering but standard errors are omitted due to space limit. The best performing model and those\nwith overlapping standard errors are bold-faced.\n\nVRF\nRank-1\n\nDIR@FIR=1%\n\ntriplet\n\n95.88\u00b10.30\n\n55.14\n25.96\n\ntriplet-nm\n96.68\u00b10.30\n\n192-pair-ovo\n96.92\u00b10.24\n\n192-pair-mc\n98.27\u00b10.19\n\n320-pair-mc\n98.33\u00b10.17\n\n60.93\n34.60\n\n66.21\n34.14\n\n88.58\n66.51\n\n90.17\n71.76\n\nTable 3: Mean veri\ufb01cation accuracy (VRF) with standard error, rank-1 accuracy of closed set iden-\nti\ufb01cation and DIR@FAR=1% rate of open-set identi\ufb01cation [1] on LFW dataset. The number of\nexamples per batch is \ufb01xed to 384 for all models except for 320-pair-mc model.\n\n4.3 Face veri\ufb01cation and identi\ufb01cation\nFinally, we apply our deep metric learning algorithms on face veri\ufb01cation and identi\ufb01cation, a prob-\nlem that determines whether two face images are the same identities (veri\ufb01cation) and a problem that\nidenti\ufb01es the face image of the same identity from the gallery with many negative examples (iden-\nti\ufb01cation). We train our networks on the WebFace database [31], which is composed of 494, 414\nimages from 10, 575 identities, and evaluate the quality of embedding networks trained with dif-\nferent metric learning objectives on Labeled Faces in the Wild (LFW) [8] database. We follow the\nnetwork architecture in [31]. All networks are trained for 240k iterations, while the learning rate is\ndecreased from 0.0003 to 0.0001 and 0.00003 at 160k and 200k iterations, respectively. We report\nthe performance of face veri\ufb01cation. The summary result is provided in Table 3.\nThe triplet loss model shows 95.88% veri\ufb01cation accuracy, but the performance breaks down on\nidenti\ufb01cation tasks. Although negative data mining helps, the improvement is limited. Compared to\nthese, the N-pair-mc loss model improves the performance by a signi\ufb01cant margin. Furthermore, we\nobserve additional improvement by increasing N to 320, obtaining 98.33% for veri\ufb01cation, 90.17%\nfor closed-set and 71.76% for open-set identi\ufb01cation accuracy. It is worth noting that, although it\nshows better performance than the baseline triplet loss models, the N-pair-ovo loss model performs\nmuch worse than the N-pair-mc loss on this problem.\nInterestingly, the N-pair-mc loss model also outperforms the model trained with combined con-\ntrastive loss and softmax loss whose veri\ufb01cation accuracy is reported as 96.13% [31]. Since this\nmodel is trained on the same dataset using the same network architecture, this clearly demonstrates\nthe effectiveness of our proposed metric learning objectives on face recognition tasks. Nevertheless,\nthere are other works reported higher accuracy for face veri\ufb01cation. For example, [19] demonstrated\n99.63% test set veri\ufb01cation accuracy on LFW database using triplet network trained with hundred\nmillions of examples and [22] reported 98.97% by training multiple deep neural networks from\ndifferent facial keypoint regions with combined contrastive loss and softmax loss. Since our contri-\nbution is complementary to the scale of the training data or the network architecture, it is expected\nto bring further improvement by replacing the existing training objectives into our proposal.\n\n7\n\n\f(a) Triplet and 192-pair loss\n\n(b) Triplet and 192-way classi\ufb01cation accuracy\n\nFigure 3: Training curve of triplet, 192-pair-ovo, and 192-pair-mc loss models on WebFace database.\nWe measure both (a) triplet and 192-pair loss as well as (b) classi\ufb01cation accuracy.\n\nOnline product\n30 \u00d7 4\n60 \u00d7 2\n25.01\n26.53\n87.40\n87.77\n65.25\n63.58\n\nF1\nNMI\nK=1\n\nVRF\nRank-1\n\nDIR@FIR=1%\n\n60 \u00d7 2\n33.55\n63.87\n71.12\n192 \u00d7 2\n98.27\u00b10.19\n\n88.58\n66.51\n\nCar-196\n30 \u00d7 4\n31.92\n62.94\n69.30\n\n10 \u00d7 12\n29.87\n61.84\n65.49\n\n60 \u00d7 2\n27.24\n60.39\n50.96\n64 \u00d7 6\n\n98.25\u00b10.25\n\n97.98\u00b10.22\n\n96 \u00d7 4\n\n87.53\n66.22\n\n83.96\n64.38\n\n10 \u00d7 12\n26.66\n59.37\n49.65\n\nCUB-200\n30 \u00d7 4\n27.54\n60.43\n50.91\n32 \u00d7 12\n97.57\u00b10.33\n\n79.61\n56.46\n\nTable 4: F1, NMI, and recall@1 scores on online product, Car-196, and CUB-200 datasets, and\nveri\ufb01cation and rank-1 accuracy on LFW database. For model name of N \u00d7 M, we refer N the\nnumber of different classes in each batch and M the number of positive examples per class.\n\nFinally, we provide training curve in Figure 3. Since the difference of triplet loss between models is\nrelatively small, we also measure 192-pair loss (and accuracy) of three models at every 5k iteration.\nWe observe signi\ufb01cantly faster training progress using 192-pair-mc loss than triplet loss; it only\ntakes 15k iterations to reach at the loss at convergence of triplet loss model (240k iteration).\n4.4 Analysis on tuplet construction methods\nIn this section, we highlight the importance of the proposed tuplet construction strategy using N\npairs of examples by conducting control experiments using different numbers of distinguishable\nclasses per batch while \ufb01xing the total number of examples per batch the same. For example, if we\nare to use N/2 different classes per batch rather than N different classes, we select 4 examples from\neach class instead of a pair of examples. Since N-pair loss is not de\ufb01ned to handle multiple positive\nexamples, we follow the de\ufb01nition of NCA in this experiments as follows:\n\n(cid:80)\n\n(cid:88)\n\ni\n\nL =\n\n1\n2N\n\n\u2212 log\n\n(cid:80)\n\nj(cid:54)=i:yj =yi\n\nexp(f(cid:62)\ni fj)\n\nj(cid:54)=i exp(f(cid:62)\n\ni fj)\n\n(9)\n\nWe repeat experiments in Section 4.2 and 4.3 and provide the summary results in Table 4. We\nobserve a certain degree of performance drop as we decrease the number of classes. Despite, all of\nthese results are substantially better than those of triplet loss, con\ufb01rming the importance of training\nwith multiple negative classes, and suggesting to train with as many negative classes as possible.\n5 Conclusion\nTriplet loss has been widely used for deep metric learning, even though with somewhat unsatisfac-\ntory convergence. We present a scalable novel objective, multi-calss N-pair loss, for deep metric\nlearning, which signi\ufb01cantly improves upon the triplet loss by pushing away multiple negative ex-\namples jointly at each update. We demonstrate the effectiveness of N-pair-mc loss on \ufb01ne-grained\nvisual recognition and veri\ufb01cation, as well as visual object clustering and retrieval.\nAcknowledgments\nWe express our sincere thanks to Wenling Shang for her support in many parts of this work from\nalgorithm development to paper writing. We also thank Junhyuk Oh and Paul Vernaza for helpful\ndiscussion.\n\n8\n\n0\t\r \u00a00.5\t\r \u00a01\t\r \u00a01.5\t\r \u00a02\t\r \u00a02.5\t\r \u00a03\t\r \u00a03.5\t\r \u00a04\t\r \u00a04.5\t\r \u00a00\t\r \u00a040000\t\r \u00a080000\t\r \u00a0120000\t\r \u00a0160000\t\r \u00a0200000\t\r \u00a0240000\t\r \u00a0tri,\t\r \u00a0tri\t\r \u00a0tri,\t\r \u00a0192\t\r \u00a0192p-\u00ad\u2010ovo,\t\r \u00a0tri\t\r \u00a0192p-\u00ad\u2010ovo,\t\r \u00a0192\t\r \u00a0192p-\u00ad\u2010mc,\t\r \u00a0tri\t\r \u00a0192p-\u00ad\u2010mc,\t\r \u00a0192\t\r \u00a00\t\r \u00a010\t\r \u00a020\t\r \u00a030\t\r \u00a040\t\r \u00a050\t\r \u00a060\t\r \u00a070\t\r \u00a080\t\r \u00a090\t\r \u00a0100\t\r \u00a00\t\r \u00a040000\t\r \u00a080000\t\r \u00a0120000\t\r \u00a0160000\t\r \u00a0200000\t\r \u00a0240000\t\r \u00a0tri,\t\r \u00a0tri\t\r \u00a0tri,\t\r \u00a0192\t\r \u00a0192p-\u00ad\u2010ovo,\t\r \u00a0tri\t\r \u00a0192p-\u00ad\u2010ovo,\t\r \u00a0192\t\r \u00a0192p-\u00ad\u2010mc,\t\r \u00a0tri\t\r \u00a0192p-\u00ad\u2010mc,\t\r \u00a0192\t\r \u00a0\fReferences\n[1] L. Best-Rowden, H. Han, C. Otto, B. F. Klare, and A. K. Jain. Unconstrained face recognition: Identifying a person of interest\n\nfrom a media collection. IEEE Transactions on Information Forensics and Security, 9(12):2144\u20132157, 2014.\n\n[2] G. Chechik, V. Sharma, U. Shalit, and S. Bengio. Large scale online learning of image similarity through ranking. Journal of\n\nMachine Learning Research, 11:1109\u20131135, 2010.\n\n[3] S. Chopra, R. Hadsell, and Y. LeCun. Learning a similarity metric discriminatively, with application to face veri\ufb01cation. In\n\nCVPR, 2005.\n\n[4] Y. Cui, F. Zhou, Y. Lin, and S. Belongie. Fine-grained categorization and dataset bootstrapping using deep metric learning with\n\nhumans in the loop. In CVPR, 2016.\n\n[5] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Region-based convolutional networks for accurate object detection and\n\nsegmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, PP(99):1\u20131, 2015.\n\n[6] J. Goldberger, G. E. Hinton, S. T. Roweis, and R. Salakhutdinov. Neighbourhood components analysis. In NIPS, 2004.\n\n[7] R. Hadsell, S. Chopra, and Y. LeCun. Dimensionality reduction by learning an invariant mapping. In CVPR, 2006.\n\n[8] G. B. Huang, M. Narayana, and E. Learned-Miller. Towards unconstrained face recognition. In CVPR Workshop, 2008.\n\n[9] H. Jegou, M. Douze, and C. Schmid. Product quantization for nearest neighbor search. IEEE Transactions on Pattern Analysis\n\nand Machine Intelligence, 33(1):117\u2013128, 2011.\n\n[10] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. Caffe: Convolutional\n\narchitecture for fast feature embedding. arXiv preprint arXiv:1408.5093, 2014.\n\n[11] D. Kingma and J. Ba. Adam: A method for stochastic optimization. In ICLR, 2015.\n\n[12] J. Krause, M. Stark, J. Deng, and L. Fei-Fei. 3d object representations for \ufb01ne-grained categorization. In ICCV Workshop,\n\n2013.\n\n[13] A. Krizhevsky, I. Sutskever, and G. E. Hinton. ImageNet classi\ufb01cation with deep convolutional neural networks. In NIPS, 2012.\n\n[14] J. Liu, Y. Deng, T. Bai, and C. Huang. Targeting ultimate accuracy: Face recognition via deep embedding. CoRR,\n\nabs/1506.07310, 2015.\n\n[15] D. G. Lowe. Similarity metric learning for a variable-kernel classi\ufb01er. Neural computation, 7(1):72\u201385, 1995.\n\n[16] C. D. Manning, P. Raghavan, H. Sch\u00a8utze, et al. Introduction to information retrieval, volume 1. Cambridge university press\n\nCambridge, 2008.\n\n[17] M. Norouzi, D. J. Fleet, and R. R. Salakhutdinov. Hamming distance metric learning. In NIPS, 2012.\n\n[18] O. M. Parkhi, A. Vedaldi, and A. Zisserman. Deep face recognition. BMVC, 2015.\n\n[19] F. Schroff, D. Kalenichenko, and J. Philbin. FaceNet: A uni\ufb01ed embedding for face recognition and clustering. In CVPR, 2015.\n\n[20] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In ICLR, 2015.\n\n[21] H. O. Song, Y. Xiang, S. Jegelka, and S. Savarese. Deep metric learning via lifted structured feature embedding. In CVPR,\n\n2016.\n\n[22] Y. Sun, Y. Chen, X. Wang, and X. Tang. Deep learning face representation by joint identi\ufb01cation-veri\ufb01cation. In NIPS, 2014.\n\n[23] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with\n\nconvolutions. In CVPR, 2015.\n\n[24] Y. Taigman, M. Yang, M. Ranzato, and L. Wolf. Deepface: Closing the gap to human-level performance in face veri\ufb01cation. In\n\nCVPR, 2014.\n\n[25] C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie. The Caltech-UCSD Birds-200-2011 Dataset. Technical Report\n\nCNS-TR-2011-001, California Institute of Technology, 2011.\n\n[26] J. Wang, Y. Song, T. Leung, C. Rosenberg, J. Wang, J. Philbin, B. Chen, and Y. Wu. Learning \ufb01ne-grained image similarity\n\nwith deep ranking. In CVPR, 2014.\n\n[27] K. Q. Weinberger, J. Blitzer, and L. K. Saul. Distance metric learning for large margin nearest neighbor classi\ufb01cation. In NIPS,\n\n2005.\n\n[28] J. Weston, S. Bengio, and N. Usunier. Wsabie: Scaling up to large vocabulary image annotation. In IJCAI, volume 11, pages\n\n2764\u20132770, 2011.\n\n[29] S. Xie, T. Yang, X. Wang, and Y. Lin. Hyper-class augmented and regularized deep learning for \ufb01ne-grained image classi\ufb01ca-\n\ntion. In CVPR, 2015.\n\n[30] E. P. Xing, A. Y. Ng, M. I. Jordan, and S. Russell. Distance metric learning with application to clustering with side-information.\n\n2003.\n\n[31] D. Yi, Z. Lei, S. Liao, and S. Z. Li. Learning face representation from scratch. CoRR, abs/1411.7923, 2014.\n\n[32] X. Zhang, F. Zhou, Y. Lin, and S. Zhang. Embedding label structures for \ufb01ne-grained feature representation. In CVPR, 2016.\n\n9\n\n\f", "award": [], "sourceid": 1013, "authors": [{"given_name": "Kihyuk", "family_name": "Sohn", "institution": "NEC Laboratories America"}]}