NIPS 2016
Mon Dec 5th through Sun the 11th, 2016 at Centre Convencions Internacional Barcelona
Paper ID: 1013 Improved Deep Metric Learning with Multi-class N-pair Loss Objective

### Reviewer 1

#### Summary

The paper proposes an improved loss function for training deep metric learning models that considers one target example, one positive example that belongs to the same class as the target, and N-1 negative examples that belong to different classes than the target. This can be seen as a generalization of the popular triplet loss, which uses a target, one positive example, and one negative example. To make training efficient, a form of the loss that uses N pairs of examples from N different classes, called multi-class N-pair loss, is proposed. For tasks having a very large number of target classes, a variation on hard negative mining is proposed to ensure that difficult impostors are emphasized in the training process. Experiments on a number of visual object recognition, visual object verification, unseen object recognition, face verification, and face identification tasks show that the proposed loss usually outperforms competing losses such as the triplet loss, triplet loss with hard negative data mining, softmax, and contrastive+softmax losses.

#### Qualitative Assessment

Having read the other reviews and the rebuttal from the authors, I still believe that this paper should be accepted, but as a poster, not an oral. The primary contribution of the paper is the N-pair selection, which as the authors point out in their rebuttal, reduces the number of network passes for the loss computation from O(N^2) to O(N). If the paper has any problems, it is that this point is obscured somewhat by the discussion of the N+1-tuplet loss, which as the other reviewers point out, is closely related to NCA. Nevertheless, the value of the N-pair selection is sufficient to justify publication -- it is a nice idea that would probably be adopted and extended by other researchers. ========== The proposed loss function makes sense, and the arguments laid out to support it are clear and logical. The experimental evaluation of the proposed loss is thoroughly done and well designed, and the results are quite convincing. In Section 3.2.2 I was left wondering why the regularization does not attempt to force the embeddings to have a fixed 2-norm of 1.0 instead of simply requiring them to have a small 2-norm. There are a few issues with the writing of the paper that I have listed below. Correcting them would improve the quality of the paper. Globally: "tuples" is more standard terminology than "tuplets". Line 68: "in quadratic to" -> "quadratically with" Line 138: "it is still too heavy-lifting to perform" -> "it is still too computationally expensive to perform" Line 159: "not realizable consistent" -> "not realizably consistent" Line 182: "grows in quadratic to" -> "grows quadratically with" Line 248: "adapt the smooth upper bound" -> "adopt the smooth upper bound" Line 427: "multi-calss" -> "multi-class"

#### Confidence in this Review

2-Confident (read it all; understood it all reasonably well)

### Reviewer 2

#### Summary

This paper proposes a training scheme for large-scale, deep metric learning, where training batches are constructed to provide multiple negative similarity observations for each point. The motivation for this approach is that it makes more efficient use of labeled training data, and this appears to be supported by experiments on a variety of data sets.

#### Qualitative Assessment

I found this paper well written and easy to follow. The motivation makes intuitive sense, and the arguments seem well supported by the experimental results. I would like to see a little more detail in the description of the experiment conditions. At line 285, the authors claim to train the networks for 40K iterations; is there any validation used here for early stopping? Alternately, is this sufficiently many steps for the triplet networks (which receive comparably less information per step) to converge? It seems plausible that the triplet networks are have not yet converged, which could invalidate the results (especially for the CUB200 set with early stopping at 3K). If this is not the case, then some discussion of this would strengthen the paper.

#### Confidence in this Review

2-Confident (read it all; understood it all reasonably well)

### Reviewer 3

#### Summary

This paper aims to improve existing deep metric learning techniques which typically suffer from slow convergence problem and high computational complexity by using additional negative data mining process. For this purpose, it extends the existing triplet loss to a tuplet loss that could be considered as one variant of softmax loss formulation, and employs a shared comparison scheme between the involved samples and multiple negative examples in a mini-batch. Since the comparison scheme merely requires a smaller number of passes, the new learning strategy somehow reduces the complexity of training each batch of selected samples. Besides, this paper also proposes a hard negative class mining strategy to greedily add the class that violates the previously selected classes the most to the training batch. In the experimental evaluations, the work studies the effectiveness of the proposed method by using three different settings of visual recognition tasks.

#### Qualitative Assessment

As I was beginning to read this paper I felt generally positive about it. Although its technical contribution is not large, I felt that there was merit in what the authors were describing. However, as I read on I found that there was a series of increasingly serious concerns. The first one is the motivation of this paper. In the line 109-111, this paper claimed that the existing deep metric learning methods generally only compare an example with one randomly selected negative example while ignoring negative examples from the classes. As far as I know, this is not really true because there exit a few deep metric learning methods such as [21] that also uses multiple negative examples from different classes. In the other word, I don’t know why the paper did not present the case of (N+1)-triplet loss at all. The second is that the designed N-pair selection strategy in batches can be not only coupled with the proposed loss function but also work with the existing triplet loss. Unfortunately, the authors merely exploit the new pair comparison scheme in the setting of the proposed tuplet loss function. This would lead the readers highly misunderstand the real contributions of this work. With the unclear study on the proposed pair selection scheme for deep metric learning, the evaluations are also insufficient. As discussed before, the paper should compare the cases of (N+1)-triplet, (N+1)-tuplet, N-pair mc-triplet which are very important to study the exact improvement of the proposed method. Last but not least, to substantiate the original motivations for higher convergence rate and lower computational complexity, the convergence behavior and running time comparison should be studied in the experimental part. Overall, my main concerns are the unclear motivation, undercooked techniques and inconclusive evaluations of the presented work. So, in my opinion, the quality of this paper is thought to be not enough to be published in NIPS.

#### Confidence in this Review

2-Confident (read it all; understood it all reasonably well)

### Reviewer 4

#### Summary

This work proposes an N-pairs loss function for training deep metric learning models. Similarly to previous work in deep metric learning, it learns a non-linear transformation of the examples that encourages examples of the same category to be close in the output space. At the difference of previous approaches, the optimized loss function considers N-1 negative examples at every step rather than just one. This is supposed to lead to faster convergence and to a more discriminative representation of the data than the one obtained with contrastive or triplet based losses. The equivalence between the triplet-loss and 2-pair loss.

#### Qualitative Assessment

The paper is well written and easy to follow. The relevant literature is properly cited even though few important references are missing (See next). I have three main criticisms: - (Novelty) The idea of using several negative examples is not new. It has been successfully used in Neighbourhood Component Analysis (NCA). The explanation about the difference between the proposal and NCA is not convincing (subsection 3.3). Why is N-pairs less prone to having an empty set of positive examples than NCA (data preparation can be achieved the same way)? Therefore, since the N-pairs loss is the main contribution of the paper, this questions the amount of novelty in the paper. - (Main concern) The N-pair loss is mainly compared to the triplet or the 2-pair loss (which it extends). In both of these approaches, at each round the negative example is sampled randomly. This AUC sampling scheme is problematic when the number of classes is large. In such situations, most of the classes are well separated and the constrained in the triplet-loss is often not violated. Hence the gradient is often zero and learning is slow. The same issue is faced when learning embeddings[1][2]. Using a larger number of negatives as in the N-pair loss is one way of solving the problem because as the number of negative increases, the loss is likely to be larger. Another way of solving the problem is to use WARP sampling as in [2][3][4] instead of AUC sampling. WARP sampling scans the negative examples until one that violates the constraint (for triplet-loss) is found. This results in faster convergence because the loss is often non-zero. WARP sampling has been successfully used in both learning embeddings[2]3] and metric learning[4]. Therefore, this is an obvious competitor that should be cited and compared to. It is not sure that N-pair loss converges faster of leads to more accurate models than triplet loss with WARP sampling. The key here seems to be how the sampling is performed rather than how many pairs are samples. - (Experiments) The N-pairs loss is motivated in this work by the ability to deal with problems having intractable output distributions or very large number of classes (lines 011-014 abstract, lines 095-102 introduction, etc.). However, all the experiments have been performed on datasets with a relatively small number of classes. In addition, how much faster does N-pair loss (for various N) converge compared to 2-pair or triplet loss? [1] A Discriminative Kernel-based Model to Rank Images from Text Queries, Grangier et al. [2] WSABIE: Scaling Up To Large Vocabulary Image Annotation, Weston et al. [3] Training highly multiclass classifiers, Gupta et al. [4] Efficient Learning of Mahalanobis Metrics for Ranking, Lim and Lanckriet

#### Confidence in this Review

3-Expert (read the paper in detail, know the area, quite certain of my opinion)

### Reviewer 5

#### Summary

This paper proposes a new metric learning objective called multi-class N-pair loss, which generalizes triplet loss by allowing joint comparison among more than one negative examples, such as N-1 negative examples. It also reduces the computational complexity of evaluating deep embedding vectors through using only N pairs of examples, instead of (N+1)N. Extensive experiments show the proposed multi-class N-pair loss outperforms the triplet loss and other competing loss functions on various tasks of fine-grained object recognition and verification, image clustering and retrieval, and face verification and identification.

#### Qualitative Assessment

Strengths: 1. The paper tries to solve an important problem, i.e., negative example selection in loss function of deep metric learning. 2. Extensive experiments are conducted to analyze the proposed loss on various tasks including fine-grained object recognition and verification, image clustering and retrieval, and face verification and identification. 3. The paper is well written and is easy to follow in most sections. Weaknesses: 1. The novelty is a little weak. It is not clear what's the significant difference and advantage compared to NCA [6] and "Small codes and large image databases for recognition", A. Torralba et al., 2008, which used NCA in deep learning. 2. In the experiment of face recognition, some state-of-the art references are missing, such as Baidu' work "Targeting Ultimate Accuracy: Face Recognition via Deep Embedding", http://vis-www.cs.umass.edu/lfw/results.html#baidu. In that work, the triplet loss is also used and it reported the result trained on the dataset containing 9K identities and 450K images, which is similar with Webface. The VRF can achieve 98.65% on LFW which is better than the result in Table 3 in this paper.

#### Confidence in this Review

2-Confident (read it all; understood it all reasonably well)

### Reviewer 6

#### Summary

Update: My review remains the same after rebuttal. The rebuttal did help to clear some questions. I think the paper can be accepted, provided the authors include the comments and points discussed in the revised manuscript. Summary of the paper: The paper proposed a new multi-class pair/triplet loss for metric learning with CNN. The novelty of their work lies in the proposed loss function and the efficient construction of the mini-batches for it. Their method improves over the standard triplet loss, which pushes the impostor of one offending class at a time. In contrast their loss pushes impostors of multiple classes for each update. They claim, that learning while pushing multiple impostor classes leads to a better metric. They verify their claim experimentally and benchmark their methods against relevant baselines. Overall the paper is well written, well-presented and provides clear experimental results supporting their claim. Advantages: - Novel loss function for metric learning, which works well over standard approaches - Easy to implement and understand - Simple method for efficient mini-batch construction Disadvantages: - Method might not be extensible to cases where we only have similar and dissimilar constraints as it needs labels for constructing the tuple in an efficient manner.

#### Qualitative Assessment

Novelty/Originality: The novelty of the paper is two fold. First, they analyze the widely used triplet loss and its variants. Doing so, they propose a modification for the triplet loss, coined as the tuplet loss. Second, for the proposed tuplet loss, they propose and efficient method for forming the mini-batch of and SGD. This is quite simple and straightforward to implement, and does not come at a high computational cost which come into play when people use strategies like hard negative mining. There are few things which I would like the authors to respond to: Q1: Strong usage of "Slow Convergence issue" In line 047, they mention that people have evaluated combining softmax loss with contrastive or triplet loss for overcoming Slow Convergence issue. This is mentioned in the abstract and other places as well. However, this is not entirely true. There are other reasons for which this combination has been tried. For example, in [24], it was shown that while training the CNN for face dataset, using both classification and identification (pairwise loss) gave superior performance than using either alone. They showed that using both losses they could obtain a regularization and hence a better solution, but their goal was not to correct "slow convergence". Also, the statement in conclusion at L427 is too bold stating performance of triplet loss as "unsatisfactory convergence". This is too strong and should be revised, given in their own results the triplet loss performs quite decently in some cases. Q2: In section 3.2.2 they perform regularization on the embedding vectors to be $\ell_2$ regularized. I was curious, if they tried using cosine similarity directly to avoid this explicit regularization? Q3: In Table 1, I would like to see the results when negative mining is added for ovo and mc, the same way it was done in Table 2. Q4: Just to be clear, in section 4.2 when the authors discuss results in Table 2, they mention negative data mining. I am assuming this refers to negative class mining. If that is the case, it should be mentioned as that to avoid confusion given in L222 a clear distinction was made between them. Also, in the methods where no negative class mining is mentioned, how is it replaced? by random classes. This should be made explicit if that is the case? Q5: In table 2, why were all the methods stopped at 20k iterations? I find it strange that the optimal number of early stopping iterations is the same for all these different loss functions? Can the authors explain which method was chosen to set this 20k iteration? Also, I think this should be revised and the number of iterations should be set by cross-validation for each loss independently.

#### Confidence in this Review

3-Expert (read the paper in detail, know the area, quite certain of my opinion)