{"title": "Learning Imbalanced Datasets with Label-Distribution-Aware Margin Loss", "book": "Advances in Neural Information Processing Systems", "page_first": 1567, "page_last": 1578, "abstract": "Deep learning algorithms can fare poorly when the training dataset suffers from heavy class-imbalance but the testing criterion requires good generalization on less frequent classes. We design two novel methods to improve performance in such scenarios. First, we propose a theoretically-principled label-distribution-aware margin (LDAM) loss motivated by minimizing a margin-based generalization bound. This loss replaces the standard cross-entropy objective during training and can be applied with prior strategies for training with class-imbalance such as re-weighting or re-sampling. Second, we propose a simple, yet effective, training schedule that defers re-weighting until after the initial stage, allowing the model to learn an initial representation while avoiding some of the complications associated with re-weighting or re-sampling. We test our methods on several benchmark vision tasks including the real-world imbalanced dataset iNaturalist 2018. Our experiments show that either of these methods alone can already improve over existing techniques and their combination achieves even better performance gains.", "full_text": "Learning Imbalanced Datasets with\n\nLabel-Distribution-Aware Margin Loss\n\nKaidi Cao\n\nStanford University\n\nkaidicao@stanford.edu\n\ncolinwei@stanford.edu\n\nColin Wei\n\nStanford University\n\nAdrien Gaidon\n\nToyota Research Institute\n\nadrien.gaidon@tri.global\n\nNikos Arechiga\n\nToyota Research Institute\n\nnikos.arechiga@tri.global\n\nTengyu Ma\n\nStanford University\n\ntengyuma@stanford.edu\n\nAbstract\n\nDeep learning algorithms can fare poorly when the training dataset suffers from\nheavy class-imbalance but the testing criterion requires good generalization on less\nfrequent classes. We design two novel methods to improve performance in such\nscenarios. First, we propose a theoretically-principled label-distribution-aware\nmargin (LDAM) loss motivated by minimizing a margin-based generalization\nbound. This loss replaces the standard cross-entropy objective during training\nand can be applied with prior strategies for training with class-imbalance such as\nre-weighting or re-sampling. Second, we propose a simple, yet effective, training\nschedule that defers re-weighting until after the initial stage, allowing the model to\nlearn an initial representation while avoiding some of the complications associated\nwith re-weighting or re-sampling. We test our methods on several benchmark\nvision tasks including the real-world imbalanced dataset iNaturalist 2018. Our\nexperiments show that either of these methods alone can already improve over\nexisting techniques and their combination achieves even better performance gains1.\n\n1\n\nIntroduction\n\nModern real-world large-scale datasets often have long-tailed label distributions [51, 28, 34, 12,\n15, 50, 40]. On these datasets, deep neural networks have been found to perform poorly on less\nrepresented classes [17, 51, 5]. This is particularly detrimental if the testing criterion places more\nemphasis on minority classes. For example, accuracy on a uniform label distribution or the minimum\naccuracy among all classes are examples of such criteria. These are common scenarios in many\napplications [7, 42, 20] due to various practical concerns such as transferability to new domains,\nfairness, etc.\nThe two common approaches for learning long-tailed data are re-weighting the losses of the examples\nand re-sampling the examples in the SGD mini-batch (see [5, 21, 10, 17, 18, 9] and the references\ntherein). They both devise a training loss that is in expectation closer to the test distribution, and\ntherefore can achieve better trade-offs between the accuracies of the frequent classes and the minority\nclasses. However, because we have fundamentally less information about the minority classes and\nthe models deployed are often huge, over-\ufb01tting to the minority classes appears to be one of the\nchallenges in improving these methods.\nWe propose to regularize the minority classes more strongly than the frequent classes so that we\ncan improve the generalization error of minority classes without sacri\ufb01cing the model\u2019s ability to \ufb01t\n\n1Code available at https://github.com/kaidic/LDAM-DRW.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\f\u221a\n1\n\nFigure 1: For binary classi\ufb01cation with a linearly\nseparable classi\ufb01er, the margin \u03b3i of the i-th class\nis de\ufb01ned to be the the minimum distance of the\ndata in the i-th class to the decision boundary. We\nshow that the test error with the uniform label\ndistribution is bounded by a quantity that scales\nin\n. As illustrated here, \ufb01xing the\ndirection of the decision boundary leads to a \ufb01xed\n\u03b31 + \u03b32, but the trade-off between \u03b31, \u03b32 can be\noptimized by shifting the decision boundary. As\nderived in Section 3.1, the optimal trade-off is\n\u22121/4\n\u03b3i \u221d n\nwhere ni is the sample size of the\ni\ni-th class.\n\n\u221a\n+ 1\n\n\u03b32\n\nn2\n\n\u03b31\n\nn1\n\nthe frequent classes. Implementing this general idea requires a data-dependent or label-dependent\nregularizer \u2014 which in contrast to standard (cid:96)2 regularization depends not only on the weight matrices\nbut also on the labels \u2014 to differentiate frequent and minority classes. The theoretical understanding\nof data-dependent regularizers is sparse (see [57, 43, 2] for a few recent works.)\nWe explore one of the simplest and most well-understood data-dependent properties: the margins\nof the training examples. Encouraging a large margin can be viewed as regularization, as standard\ngeneralization error bounds (e.g., [4, 59]) depend on the inverse of the minimum margin among all\nthe examples. Motivated by the question of generalization with respect to minority classes, we instead\nstudy the minimum margin per class and obtain per-class and uniform-label test error bounds.2\nMinimizing the obtained bounds gives an optimal trade-off between the margins of the classes. See\nFigure 1 for an illustration in the binary classi\ufb01cation case.\nInspired by the theory, we design a label-distribution-aware loss function that encourages the model\nto have the optimal trade-off between per-class margins. The proposed loss extends the existing soft\nmargin loss [53] by encouraging the minority classes to have larger margins. As a label-dependent\nregularization technique, our modi\ufb01ed loss function is orthogonal to the re-weighting and re-sampling\napproach. In fact, we also design a deferred re-balancing optimization procedure that allows us to\ncombine the re-weighting strategy with our loss (or other losses) in a more ef\ufb01cient way.\nIn summary, our main contributions are (i) we design a label-distribution-aware loss function to\nencourage larger margins for minority classes, (ii) we propose a simple deferred re-balancing\noptimization procedure to apply re-weighting more effectively, and (iii) our practical implementation\nshows signi\ufb01cant improvements on several benchmark vision tasks, such as arti\ufb01cially imbalanced\nCIFAR and Tiny ImageNet [1], and the real-world large-scale imbalanced dataset iNaturalist\u201918 [52].\n\n2 Related Works\n\nMost existing algorithms for learning imbalanced datasets can be divided in to two categories:\nre-sampling and re-weighting.\nRe-sampling. There are two types of re-sampling techniques: over-sampling the minority classes\n(see e.g., [46, 60, 5, 6] and references therein) and under-sampling the frequent classes (see, e.g.,\n[17, 23, 5] and the references therein.) The downside of under-sampling is that it discards a large\nportion of the data and thus is not feasible when data imbalance is extreme. Over-sampling is effective\nin a lot of cases but can lead to over-\ufb01tting of the minority classes [9, 10]. Stronger data augmentation\nfor minority classes can help alleviate the over-\ufb01tting [9, 61].\nRe-weighting. Cost-sensitive re-weighting assigns (adaptive) weights for different classes or even\ndifferent samples. The vanilla scheme re-weights classes proportionally to the inverse of their\nfrequency [21, 22, 55]. Re-weighting methods tend to make the optimization of deep models dif\ufb01cult\nunder extreme data imbalanced settings and large-scale scenarios [21, 22]. Cui et al. [10] observe\n\n2The same technique can also be used for other test label distribution as long as the test label distribution is\n\nknown. See Section C.5 for some experimental results.\n\n2\n\n\u03b3\"\u03b3#\fthat re-weighting by inverse class frequency yields poor performance on frequent classes, and thus\npropose re-weighting by the inverse effective number of samples. This is the main prior work that we\nempirically compare with.\nAnother line of work assigns weights to each sample based on their individual properties. Focal loss\n[35] down-weights the well-classi\ufb01ed examples; Li et al. [31] suggests an improved technique which\ndown-weights examples with either very small gradients or large gradients because examples with\nsmall gradients are well-classi\ufb01ed and those with large gradients tend to be outliers.\nIn a recent work [6], Byrd and Lipton study the effect of importance weighting and show that empiri-\ncally importance weighting does not have a signi\ufb01cant effect when no regularization is applied, which\nis consistent with the theoretical prediction in [48] that logistical regression without regularization\nconverges to the max margin solution. In our work, we explicitly encourage rare classes to have higher\nmargin, and therefore we don\u2019t converge to a max margin solution. Moreover, in our experiments, we\napply non-trivial (cid:96)2-regularization to achieve the best generalization performance. We also found\ndeferred re-weighting (or deferred re-sampling) are more effective than re-weighting and re-sampling\nfrom the beginning of the training.\nIn contrast, and orthogonally to these papers above, our main technique aims to improve the gen-\neralization of the minority classes by applying additional regularization that is orthogonal to the\nre-weighting scheme. We also propose a deferred re-balancing optimization procedure to improve\nthe optimization and generalization of a generic re-weighting scheme.\nMargin loss. The hinge loss is often used to obtain a \u201cmax-margin\u201d classi\ufb01er, most notably in\nSVMs [49]. Recently, Large-Margin Softmax [37], Angular Softmax [38], and Additive Margin\nSoftmax [53] have been proposed to minimize intra-class variation in predictions and enlarge the\ninter-class margin by incorporating the idea of angular margin. In contrast to the class-independent\nmargins in these papers, our approach encourages bigger margins for minority classes. Uneven\nmargins for imbalanced datasets are also proposed and studied in [32] and the recent work [25, 33].\nOur theory put this idea on a more theoretical footing by providing a concrete formula for the desired\nmargins of the classes alongside good empirical progress.\nLabel shift in domain adaptation. The problem of learning imbalanced datasets can be also viewed\nas a label shift problem in transfer learning or domain adaptation (for which we refer the readers to\nthe survey [54] and the reference therein). In a typical label shift formulation, the dif\ufb01culty is to detect\nand estimate the label shift, and after estimating the label shift, re-weighting or re-sampling is applied.\nWe are addressing a largely different question: can we do better than re-weighting or re-sampling\nwhen the label shift is known? In fact, our algorithms can be used to replace the re-weighting steps of\nsome of the recent interesting work on detecting and correcting label shift [36, 3].\nDistributionally robust optimization (DRO) is another technique for domain adaptation (see [11, 16, 8]\nand the reference therein.) However, the formulation assumes no knowledge of the target label\ndistribution beyond a bound on the amount of shift, which makes the problem very challenging. We\nhere assume the knowledge of the test label distribution, using which we design ef\ufb01cient methods\nthat can scale easily to large-scale vision datasets with signi\ufb01cant improvements.\nMeta-learning. Meta-learning is also used in improving the performance on imbalanced datasets or\nthe few shot learning settings. We refer the readers to [55, 47, 56] and the references therein. So far,\nwe generally believe that our approaches that modify the losses are more computationally ef\ufb01cient\nthan meta-learning based approaches.\n\n3 Main Approach\n\n3.1 Theoretical Motivations\nProblem setup and notations. We assume the input space is Rd and the label space is {1, . . . , k}.\nLet x denote the input and y denote the corresponding label. We assume that the class-conditional\ndistribution P(x | y) is the same at training and test time. Let Pj denote the class-conditional\ndistribution, i.e. Pj = P(x | y = j). We will use Pbal to denote the balanced test distribution which\n\ufb01rst samples a class uniformly and then samples data from Pj.\n\n3\n\n\fFor a model f : Rd \u2192 Rk that outputs k logits, we use Lbal[f ] to denote the standard 0-1 test error\non the balanced data distribution:\n\nLbal[f ] = Pr\n\n(x,y)\u223cPbal\n\n[f (x)y < max\n(cid:96)(cid:54)=y\n\nf (x)(cid:96)]\n\nSimilarly, the error Lj for class j is de\ufb01ned as Lj[f ] = Pr(x,y)\u223cPj [f (x)y < max(cid:96)(cid:54)=y f (x)(cid:96)].\nSuppose we have a training dataset {(xi, yi)}n\ni=1. Let nj be the number of examples in class j. Let\nSj = {i : yi = j} denote the example indices corresponding to class j.\nDe\ufb01ne the margin of an example (x, y) as\n\n\u03b3(x, y) = f (x)y \u2212 max\nj(cid:54)=y\n\nf (x)j\n\nDe\ufb01ne the training margin for class j as:\n\n\u03b3j = min\ni\u2208Sj\n\n\u03b3(xi, yi)\n\n(1)\n\n(2)\n\n(3)\n\nWe consider the separable cases (meaning that all the training examples are classi\ufb01ed correctly)\nbecause neural networks are often over-parameterized and can \ufb01t the training data well. We also\nnote that the minimum margin of all the classes, \u03b3min = min{\u03b31, . . . , \u03b3k}, is the classical notion of\ntraining margin studied in the past [27].\nFine-grained generalization error bounds. Let F be the family of hypothesis class. Let C(F) be\nsome proper complexity measure of the hypothesis class F. There is a large body of recent work\non measuring the complexity of neural networks (see [4, 13, 57] and references therein), and our\ndiscussion below is orthogonal to the precise choices. When the training distribution and the test\ndistribution are the same, the typical generalization error bounds scale in C(F)/\nn. That is, in our\ncase, if the test distribution is also imbalanced as the training distribution, then\n\n\u221a\n\nimbalanced test error (cid:46) 1\n\u03b3min\n\n(cid:114)C(F)\n\nn\n\nNote that the bound is oblivious to the label distribution, and only involves the minimum margin\nacross all examples and the total number of data points. We extend such bounds to the setting\nwith balanced test distribution by considering the margin of each class. As we will see, the more\n\ufb01ne-grained bound below allows us to design new training loss function that is customized to the\nimbalanced dataset.\nTheorem 1 (Informal and simpli\ufb01ed version of Theorem 2). With high probability (1 \u2212 n\u22125) over\nthe randomness of the training data, the error Lj for class j is bounded by\n\nlog n\u221a\nnj\nwhere we use (cid:46) to hide constant factors. As a direct consequence,\n\nLj[f ] (cid:46) 1\n\u03b3j\n\n+\n\nk(cid:88)\n\nj=1\n\nLbal[f ] (cid:46) 1\nk\n\nC(F)\nnj\n\n+\n\nlog n\u221a\nnj\n\n(cid:33)\n\n(4)\n\n(5)\n\nClass-distribution-aware margin trade-off. The generalization error bound (4) for each class\nsuggests that if we wish to improve the generalization of minority classes (those with small nj\u2019s),\nwe should aim to enforce bigger margins \u03b3j\u2019s for them. However, enforcing bigger margins for\nminority classes may hurt the margins of the frequent classes. What is the optimal trade-off between\nthe margins of the classes? An answer for the general case may be dif\ufb01cult, but fortunately we can\nobtain the optimal trade-off for the binary classi\ufb01cation problem.\nWith k = 2 classes, we aim to optimize the balanced generalization error bound provided in (5),\nwhich can be simpli\ufb01ed to (by removing the low order term log n\u221a\nnj\n\nand the common factor C(F))\n\n\u221a\n1\n\n\u03b31\n\n+\n\n\u03b32\n\nn1\n\n\u221a\n1\n\nn2\n\n4\n\n(6)\n\n(cid:115)\n(cid:32)\n\n1\n\u03b3j\n\nC(F)\nnj\n\n(cid:115)\n\n\fAt the \ufb01rst sight, because \u03b31 and \u03b32 are complicated functions of the weight matrices, it appears\ndif\ufb01cult to understand the optimal margins. However, we can \ufb01gure out the relative scales between\n1 = \u03b31 \u2212 \u03b4 and\n\u03b31 and \u03b32. Suppose \u03b31, \u03b32 > 0 minimize the equation above, we observe that any \u03b3(cid:48)\n2 = \u03b32 + \u03b4 (for \u03b4 \u2208 (\u2212\u03b32, \u03b31)) can be realized by the same weight matrices with a shifted bias term\n\u03b3(cid:48)\n(See Figure 1 for an illustration). Therefore, for \u03b31, \u03b32 to be optimal, they should satisfy\n\n\u221a\n1\n\n+\n\nn1\nThe equation above implies that\n\n\u03b31\n\n\u221a\n1\n\nn2\n\n\u2264\n\n1\n(\u03b31 \u2212 \u03b4)\n\n\u221a\n\nn1\n\n+\n\n\u221a\n1\n(\u03b32 + \u03b4)\n\nn2\n\n\u03b32\n\n\u03b31 =\n\nC\nn1/4\n1\n\n, and \u03b32 =\n\nC\nn1/4\n2\n\n(7)\n\n(8)\n\nfor some constant C. Please see a detailed derivation in the Section A.\nFast rate vs slow rate, and the implication on the choice of margins. The bound in Theorem 1\nmay not necessarily be tight. The generalization bounds that scale in 1/\nni here with\nimbalanced classes) are generally referred to the \u201cslow rate\u201d and those that scale in 1/n are referred\nto the \u201cfast rate\u201d. With deep neural networks and when the model is suf\ufb01ciently big enough, it\nis possible that some of these bounds can be improved to the fast rate. See [58] for some recent\ndevelopment. In those cases, we can derive the optimal trade-off of the margin to be ni \u221d n\n\nn (or 1/\n\n\u221a\n\n\u221a\n\n.\n\n\u22121/3\ni\n\n3.2 Label-Distribution-Aware Margin Loss\n\nInspired by the trade-off between the class margins in Section 3.1 for two classes, we propose to\nenforce a class-dependent margin for multiple classes of the form\n\n\u03b3j =\n\nC\nn1/4\nj\n\n(9)\n\nWe will design a soft margin loss function to encourage the network to have the margins above. Let\n(x, y) be an example and f be a model. For simplicity, we use zj = f (x)j to denote the j-th output\nof the model for the j-th class.\nThe most natural choice would be a multi-class extension of the hinge loss:\n{zj} \u2212 zy + \u2206y, 0)\nC\nn1/4\nj\n\nLLDAM-HG((x, y); f ) = max(max\nj(cid:54)=y\n\nfor j \u2208 {1, . . . , k}\n\nwhere \u2206j =\n\n(10)\n\n(11)\n\nHere C is a hyper-parameter to be tuned. In order to tune the margin more easily, we effectively\nnormalize the logits (the input to the loss function) by normalizing last hidden activation to (cid:96)2 norm\n1, and normalizing the weight vectors of the last fully-connected layer to (cid:96)2 norm 1, following the\nprevious work [53]. Notice that we then scale the logits by a constant s = 10 following [53].\nEmpirically, the non-smoothness of hinge loss may pose dif\ufb01culties for optimization. The smooth\nrelaxation of the hinge loss is the following cross-entropy loss with enforced margins:\n\nLLDAM((x, y); f ) = \u2212 log\n\nezy\u2212\u2206y +(cid:80)\n\nezy\u2212\u2206y\n\nj(cid:54)=y ezj\n\nwhere \u2206j =\n\nC\nn1/4\nj\n\nfor j \u2208 {1, . . . , k}\n\n(12)\n\n(13)\n\nIn the previous work [37, 38, 53] where the training set is usually balanced, the margin \u2206y is chosen\nto be a label independent constant C, whereas our margin depends on the label distribution.\nRemark: Attentive readers may \ufb01nd the loss LLDAM somewhat reminiscent of the re-weighting\nbecause in the binary classi\ufb01cation case \u2014 where the model outputs a single real number which is\n\n5\n\n\fpassed through a sigmoid to be converted into a probability, \u2014 both the two approaches change the\ngradient of an example by a scalar factor. However, we remark two key differences: the scalar factor\nintroduced by the re-weighting only depends on the class, whereas the scalar introduced by LLDAM\nalso depends on the output of the model; for multiclass classi\ufb01cation problems, the proposed loss\nLLDAM affects the gradient of the example in a more involved way than only introducing a scalar\nfactor. Moreover, recent work has shown that, under separable assumptions, the logistical loss, with\nweak regularization [59] or without regularization [48], gives the max margin solution, which is in\nturn not effected by any re-weighting by its de\ufb01nition. This further suggests that the loss LLDAM and\nthe re-weighting may complement each other, as we have seen in the experiments. (Re-weighting\nwould affect the margin in the non-separable data case, which is left for future work.)\n\n3.3 Deferred Re-balancing Optimization Schedule\n\nCost-sensitive re-weighting and re-sampling are two well-known and successful strategies to cope\nwith imbalanced datasets because, in expectation, they effectively make the imbalanced training\ndistribution closer to the uniform test distribution. The known issues with applying these techniques\nare (a) re-sampling the examples in minority classes often causes heavy over-\ufb01tting to the minority\nclasses when the model is a deep neural network, as pointed out in prior work (e.g., [10]), and\n(b) weighting up the minority classes\u2019 losses can cause dif\ufb01culties and instability in optimization,\nespecially when the classes are extremely imbalanced [10, 21]. In fact, Cui et al. [10] develop a\nnovel and sophisticated learning rate schedule to cope with the optimization dif\ufb01culty.\nWe observe empirically that re-weighting and re-sampling are both inferior to the vanilla empirical\nrisk minimization (ERM) algorithm (where all training examples have the same weight) before\nannealing the learning rate in the following sense. The features produced before annealing the\nlearning rate by re-weighting and re-sampling are worse than those produced by ERM. (See Figure 6\nfor an ablation study of the feature quality performed by training linear classi\ufb01ers on top of the\nfeatures on a large balanced dataset.)\nInspired by this, we develop a deferred re-balancing training procedure (Algorithm 1), which \ufb01rst\ntrains using vanilla ERM with the LDAM loss before annealing the learning rate, and then deploys a\nre-weighted LDAM loss with a smaller learning rate. Empirically, the \ufb01rst stage of training leads\nto a good initialization for the second stage of training with re-weighted losses. Because the loss is\nnon-convex and the learning rate in the second stage is relatively small, the second stage does not\nmove the weights very far. Interestingly, with our LDAM loss and deferred re-balancing training,\nthe vanilla re-weighting scheme (which re-weights by the inverse of the number of examples in each\nclass) works as well as the re-weighting scheme introduced in prior work [10]. We also found that\nwith our re-weighting scheme and LDAM, we are less sensitive to early stopping than [10].\n\ni=1. A parameterized model f\u03b8\n\n(cid:80)\n(x,y)\u2208B LLDAM((x, y); f\u03b8)\n\nB \u2190 SampleMiniBatch(D, m)\nL(f\u03b8) \u2190 1\nf\u03b8 \u2190 f\u03b8 \u2212 \u03b1\u2207\u03b8L(f\u03b8)\nOptional: \u03b1 \u2190 \u03b1/\u03c4\n\nAlgorithm 1 Deferred Re-balancing Optimization with LDAM Loss\nRequire: Dataset D = {(xi, yi)}n\n1: Initialize the model parameters \u03b8 randomly\n2: for t = 1 to T0 do\n3:\n4:\n5:\n6:\n7:\n8: for t = T0 to T do\n9:\n10:\n11:\n\nB \u2190 SampleMiniBatch(D, m)\nL(f\u03b8) \u2190 1\nf\u03b8 \u2190 f\u03b8 \u2212 \u03b1\n\n(cid:80)\n(x,y)\u2208B n\u22121\n1(cid:80)\n\nm\n\nm\n\n\u00b7 LLDAM((x, y); f\u03b8)\n\u2207\u03b8L(f\u03b8)\n\n(cid:46) A mini-batch of m examples\n(cid:46) standard re-weighting by frequency\n(cid:46) one SGD step with re-normalized learning rate\n\ny\n\u22121\ny\n\n(x,y)\u2208B n\n\n(cid:46) a mini-batch of m examples\n\n(cid:46) one SGD step\n(cid:46) anneal learning rate by a factor \u03c4 if necessary\n\n4 Experiments\n\nWe evaluate our proposed algorithm on arti\ufb01cially created versions of IMDB review [41], CIFAR-10,\nCIFAR-100 [29] and Tiny ImageNet [45, 1] with controllable degrees of data imbalance, as well as a\n\n6\n\n\fTable 1: Top-1 validation errors on imbalanced IMDB review dataset. Our proposed approach\nLDAM-DRW outperforms the baselines.\n\nApproach\n\nError on positive reviews Error on negative reviews Mean Error\n\nERM\nRS\nRW\n\nLDAM-DRW\n\n2.86\n7.12\n5.20\n4.91\n\n70.78\n45.88\n42.12\n30.77\n\n36.82\n26.50\n23.66\n17.84\n\nreal-world large-scale imbalanced dataset, iNaturalist 2018 [52]. Our core algorithm is developed\nusing PyTorch [44].\n\nBaselines. We compare our methods with the standard training and several state-of-the-art tech-\nniques and their combinations that have been widely adopted to mitigate the issues with training on\nimbalanced datasets: (1) Empirical risk minimization (ERM) loss: all the examples have the same\nweights; by default, we use standard cross-entropy loss. (2) Re-Weighting (RW): we re-weight each\nsample by the inverse of the sample size of its class, and then re-normalize to make the weights\n1 on average in the mini-batch. (3) Re-Sampling (RS): each example is sampled with probability\nproportional to the inverse sample size of its class. (4) CB [10]: the examples are re-weighted or\nre-sampled according to the inverse of the effective number of samples in each class, de\ufb01ned as\n(1 \u2212 \u03b2ni)/(1 \u2212 \u03b2), instead of inverse class frequencies. This idea can be combined with either\nre-weighting or re-sampling. (5) Focal: we use the recently proposed focal loss [35] as another\nbaseline. (6) SGD schedule: by SGD, we refer to the standard schedule where the learning rates are\ndecayed a constant factor at certain steps; we use a standard learning rate decay schedule.\n\nOur proposed algorithm and variants. We test combinations of the following techniques pro-\nposed by us. (1) DRW and DRS: following the proposed training Algorithm 1, we use the standard\nERM optimization schedule until the last learning rate decay, and then apply re-weighting or re-\nsampling for optimization in the second stage. (2) LDAM: the proposed Label-Distribution-Aware\nMargin losses as described in Section 3.2.\nWhen two of these methods can be combined, we will concatenate the acronyms with a dash in\nbetween as an abbreviation. The main algorithm we propose is LDAM-DRW. Please refer to Section B\nfor additional implementation details.\n\n4.1 Experimental results on IMDB review dataset\n\nIMDB review dataset consists of 50,000 movie reviews for binary sentiment classi\ufb01cation [41]. The\noriginal dataset contains an evenly distributed number of positive and negative reviews. We manually\ncreated an imbalanced training set by removing 90% of negative reviews. We train a two-layer\nbidirectional LSTM with Adam optimizer [26]. The results are reported in Table 1.\n\n4.2 Experimental results on CIFAR\n\nImbalanced CIFAR-10 and CIFAR-100. The original version of CIFAR-10 and CIFAR-100\ncontains 50,000 training images and 10,000 validation images of size 32\u00d7 32 with 10 and 100 classes,\nrespectively. To create their imbalanced version, we reduce the number of training examples per class\nand keep the validation set unchanged. To ensure that our methods apply to a variety of settings,\nwe consider two types of imbalance: long-tailed imbalance [10] and step imbalance [5]. We use\nimbalance ratio \u03c1 to denote the ratio between sample sizes of the most frequent and least frequent\nclass, i.e., \u03c1 = maxi{ni}/ mini{ni}. Long-tailed imbalance follows an exponential decay in sample\nsizes across different classes. For step imbalance setting, all minority classes have the same sample\nsize, as do all frequent classes. This gives a clear distinction between minority classes and frequent\nclasses, which is particularly useful for ablation study. We further de\ufb01ne the fraction of minority\nclasses as \u00b5. By default we set \u00b5 = 0.5 for all experiments.\n\n7\n\n\fTable 2: Top-1 validation errors of ResNet-32 on imbalanced CIFAR-10 and CIFAR-100. The\ncombination of our two techniques, LDAM-DRW, achieves the best performance, and each of them\nindividually are bene\ufb01cial when combined with other losses or schedules.\n\nImbalanced CIFAR-10\nstep\n\nImbalanced CIFAR-100\nstep\n\nDataset\n\nImbalance Type\nImbalance Ratio\n\nERM\n\nFocal [35]\n\nLDAM\n\nlong-tailed\n100\n10\n29.64\n29.62\n26.65\n\n13.61\n13.34\n13.04\n\nCB RS\n\nCB RW [10]\nCB Focal [10]\n\n29.45\n27.63\n25.43\n27.16\n24.42\n24.94\nLDAM-DRW 22.97\n\nHG-DRS\n\nM-DRW\n\nLDAM-HG-DRS\n\n13.21\n13.46\n12.90\n14.03\n12.72\n13.57\n11.84\n\n100\n36.70\n36.09\n33.42\n\n38.14\n38.06\n39.73\n29.93\n24.53\n27.67\n23.08\n\n10\n\n17.50\n16.36\n15.00\n\n15.41\n16.20\n16.54\n14.85\n12.82\n13.17\n12.19\n\nlong-tailed\n100\n10\n61.68\n61.59\n60.40\n\n44.30\n44.22\n43.09\n\n66.56\n66.01\n63.98\n\n44.94\n42.88\n42.01\n\n-\n-\n\n59.49\n57.96\n\n-\n-\n\n43.78\n41.29\n\n100\n61.45\n61.43\n60.42\n\n66.23\n78.69\n80.24\n\n-\n-\n\n58.91\n54.64\n\n10\n\n45.37\n46.54\n43.73\n\n46.92\n47.52\n49.98\n\n-\n-\n\n44.72\n40.54\n\nTable 3: Validation errors on iNaturalist 2018 of various approaches. Our proposed method LDAM-\nDRW demonstrates signi\ufb01cant improvements over the previous state-of-the-arts. We include ERM-\nDRW and LDAM-SGD for the ablation study.\n\nCB Focal [10]\n\nLoss\nERM\n\nERM\nLDAM\nLDAM\n\nSchedule Top-1 Top-5\n21.31\n18.97\n16.55\n16.48\n14.82\n\n42.86\nSGD\n38.88\nSGD\n36.27\nDRW\nSGD\n35.42\nDRW 32.00\n\nWe report the top-1 validation error of various methods for imbalanced versions of CIFAR-10 and\nCIFAR-100 in Table 2. Our proposed approach is LDAM-DRW, but we also include a various\ncombination of our two techniques with other losses and training schedule for our ablation study.\nWe \ufb01rst show that the proposed label-distribution-aware margin cross-entropy loss is superior to\npure cross-entropy loss and one of its variants tailored for imbalanced data, focal loss, while no\ndata-rebalance learning schedule is applied. We also demonstrate that our full pipeline outperforms\nthe previous state-of-the-arts by a large margin. To further demonstrate that the proposed LDAM\nloss is essential, we compare it with regularizing by a uniform margin across all classes under the\nsetting of cross-entropy loss and hinge loss. We use M-DRW to denote the algorithm that uses a\ncross-entropy loss with uniform margin [53] to replace LDAM, namely, the \u2206j in equation (13) is\nchosen to be a tuned constant that does not depend on the class j. Hinge loss (HG) suffers from\noptimization issues with 100 classes so we constrain its experiment setting with CIFAR-10 only.\nImbalanced but known test label distribution: We also test the performance of an extension of\nour algorithm in the setting where the test label distribution is known but not uniform. Please see\nSection C.5 for details.\n\n4.3 Visual recognition on iNaturalist 2018 and imbalanced Tiny ImageNet\n\nWe further verify the effectiveness of our method on large-scale imbalanced datasets. The iNatualist\nspecies classi\ufb01cation and detection dataset [52] is a real-world large-scale imbalanced dataset which\nhas 437,513 training images with a total of 8,142 classes in its 2018 version. We adopt the of\ufb01cial\ntraining and validation splits for our experiments. The training datasets have a long-tailed label\ndistribution and the validation set is designed to have a balanced label distribution. We use ResNet-50\nas the backbone network across all experiments for iNaturalist 2018. Table 3 summarizes top-1\nvalidation error for iNaturalist 2018. Notably, our full pipeline is able to outperform the ERM baseline\n\n8\n\n\fFigure 2: Per-class top-1 error on CIFAR-10 with\nstep imbalance (\u03c1 = 100, \u00b5 = 0.5). Classes 0-F\nto 4-F are frequent classes, and the rest are mi-\nnority classes. Under this extremely imbalanced\nsetting RW suffers from under-\ufb01tting, while RS\nover-\ufb01ts on minority examples. On the contrary,\nthe proposed algorithm exhibits great generaliza-\ntion on minority classes while keeping the per-\nformance on frequent classes almost unaffected.\nThis suggests we succeeded in regularizing mi-\nnority classes more strongly.\n\nImbalanced training errors (dotted\nFigure 3:\nlines) and balanced test errors (solid lines) on\nCIFAR-10 with long-tailed imbalance (\u03c1 = 100).\nWe anneal decay the learning rate at epoch 160\nfor all algorithms. Our DRW schedule uses\nERM before annealing the learning rate and thus\nperforms worse than RW and RS before that\npoint, as expected. However, it outperforms\nthe others signi\ufb01cantly after annealing the learn-\ning rate. See Section 4.4 for more analysis.\nxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx\n\nby 10.86% and previous state-of-the-art by 6.88% in top-1 error. Please refer to Appendix C.2 for\nresults on imbalanced Tiny ImageNet.\n\n4.4 Ablation study\n\nEvaluating generalization on minority classes. To better understand the improvement of our\nalgorithms, we show per-class errors of different methods in Figure 2 on imbalanced CIFAR-10.\nPlease see the caption there for discussions.\nEvaluating deferred re-balancing schedule. We compare the learning curves of deferred re-\nbalancing schedule with other baselines in Figure 3. In Figure 6 of Section C.3, we further show that\neven though ERM in the \ufb01rst stage has slightly worse or comparable balanced test error compared to\nRW and RS, in fact the features (the last-but-one layer activations) learned by ERM are better than\nthose by RW and RS. This agrees with our intuition that the second stage of DRW, starting from\nbetter features, adjusts the decision boundary and locally \ufb01ne-tunes the features.\n\n5 Conclusion\n\nWe propose two methods for training on imbalanced datasets, label-distribution-aware margin loss\n(LDAM), and a deferred re-weighting (DRW) training schedule. Our methods achieve signi\ufb01-\ncantly improved performance on a variety of benchmark vision tasks. Furthermore, we provide a\ntheoretically-principled justi\ufb01cation of LDAM by showing that it optimizes a uniform-label gener-\nalization error bound. For DRW, we believe that deferring re-weighting lets the model avoid the\ndrawbacks associated with re-weighting or re-sampling until after it learns a good initial represen-\ntation (see some analysis in Figure 3 and Figure 6). However, the precise explanation for DRW\u2019s\nsuccess is not fully theoretically clear, and we leave this as a direction for future work.\n\nAcknowledgements Toyota Research Institute (\"TRI\") provided funds and computational resources\nto assist the authors with their research but this article solely re\ufb02ects the opinions and conclusions of\nits authors and not TRI or any other Toyota entity. We thank Percy Liang and Michael Xie for helpful\ndiscussions in various stages of this work.\n\n9\n\n0-F1-F2-F3-F4-F5-M6-M7-M8-M9-Mclass0.00.20.40.60.8val errmethodLDAM-DRWCB RSCB RW0.00.20.40.60.8train err0255075100125150175200epoch01020304050607080err(%)methodDRWCB RSCB RWphasetesttrain\fReferences\n[1] Tiny imagenet visual recognition challenge. URL https://tiny-imagenet.herokuapp.com.\n\n[2] Sanjeev Arora, Rong Ge, Behnam Neyshabur, and Yi Zhang. Stronger generalization bounds for deep nets\n\nvia a compression approach. arXiv preprint arXiv:1802.05296, 2018.\n\n[3] Kamyar Azizzadenesheli, Anqi Liu, Fanny Yang, and Animashree Anandkumar. Regularized learning for\ndomain adaptation under label shifts. In International Conference on Learning Representations, 2019.\nURL https://openreview.net/forum?id=rJl0r3R9KX.\n\n[4] Peter L Bartlett, Dylan J Foster, and Matus J Telgarsky. Spectrally-normalized margin bounds for neural\n\nnetworks. In Advances in Neural Information Processing Systems, pages 6240\u20136249, 2017.\n\n[5] Mateusz Buda, Atsuto Maki, and Maciej A Mazurowski. A systematic study of the class imbalance\n\nproblem in convolutional neural networks. Neural Networks, 106:249\u2013259, 2018.\n\n[6] Jonathon Byrd and Zachary Lipton. What is the effect of importance weighting in deep learning? In\n\nInternational Conference on Machine Learning, 2019.\n\n[7] Kaidi Cao, Yu Rong, Cheng Li, Xiaoou Tang, and Chen Change Loy. Pose-robust face recognition via deep\nresidual equivariant mapping. In Proceedings of the IEEE Conference on Computer Vision and Pattern\nRecognition, pages 5187\u20135196, 2018.\n\n[8] Yair Carmon, Yujia Jin, Aaron Sidford, and Kevin Tian. Variance reduction for matrix games. arXiv\n\npreprint arXiv:1907.02056, 2019.\n\n[9] Nitesh V Chawla, Kevin W Bowyer, Lawrence O Hall, and W Philip Kegelmeyer. Smote: synthetic\n\nminority over-sampling technique. Journal of arti\ufb01cial intelligence research, 16:321\u2013357, 2002.\n\n[10] Yin Cui, Menglin Jia, Tsung-Yi Lin, Yang Song, and Serge Belongie. Class-balanced loss based on\neffective number of samples. In The IEEE Conference on Computer Vision and Pattern Recognition\n(CVPR), 2019.\n\n[11] John C Duchi, Tatsunori Hashimoto, and Hongseok Namkoong. Distributionally robust losses against\n\nmixture covariate shifts.\n\n[12] Mark Everingham, Luc Van Gool, Christopher KI Williams, John Winn, and Andrew Zisserman. The\npascal visual object classes (voc) challenge. International journal of computer vision, 88(2):303\u2013338,\n2010.\n\n[13] Noah Golowich, Alexander Rakhlin, and Ohad Shamir. Size-independent sample complexity of neural\n\nnetworks. arXiv preprint arXiv:1712.06541, 2017.\n\n[14] Priya Goyal, Piotr Doll\u00e1r, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew\nTulloch, Yangqing Jia, and Kaiming He. Accurate, large minibatch sgd: Training imagenet in 1 hour. arXiv\npreprint arXiv:1706.02677, 2017.\n\n[15] Yandong Guo, Lei Zhang, Yuxiao Hu, Xiaodong He, and Jianfeng Gao. Ms-celeb-1m: A dataset and\nbenchmark for large-scale face recognition. In European Conference on Computer Vision, pages 87\u2013102.\nSpringer, 2016.\n\n[16] Tatsunori Hashimoto, Megha Srivastava, Hongseok Namkoong, and Percy Liang. Fairness without\ndemographics in repeated loss minimization. In International Conference on Machine Learning, pages\n1934\u20131943, 2018.\n\n[17] Haibo He and Edwardo A Garcia. Learning from imbalanced data. IEEE Transactions on Knowledge &\n\nData Engineering, (9):1263\u20131284, 2008.\n\n[18] Haibo He and Yunqian Ma. Imbalanced learning: foundations, algorithms, and applications. John Wiley\n\n& Sons, 2013.\n\n[19] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition.\nIn Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770\u2013778, 2016.\n\n[20] J Henry Hinnefeld, Peter Cooman, Nat Mammo, and Rupert Deese. Evaluating fairness metrics in the\n\npresence of dataset bias. arXiv preprint arXiv:1809.09245, 2018.\n\n10\n\n\f[21] Chen Huang, Yining Li, Chen Change Loy, and Xiaoou Tang. Learning deep representation for imbalanced\nclassi\ufb01cation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages\n5375\u20135384, 2016.\n\n[22] Chen Huang, Yining Li, Change Loy Chen, and Xiaoou Tang. Deep imbalanced learning for face\nrecognition and attribute prediction. IEEE transactions on pattern analysis and machine intelligence, 2019.\n\n[23] Nathalie Japkowicz and Shaju Stephen. The class imbalance problem: A systematic study. Intelligent data\n\nanalysis, 6(5):429\u2013449, 2002.\n\n[24] Sham M Kakade, Karthik Sridharan, and Ambuj Tewari. On the complexity of linear prediction: Risk\nbounds, margin bounds, and regularization. In Advances in neural information processing systems, pages\n793\u2013800, 2009.\n\n[25] Salman Khan, Munawar Hayat, Syed Waqas Zamir, Jianbing Shen, and Ling Shao. Striking the right\nIn Proceedings of the IEEE Conference on Computer Vision and Pattern\n\nbalance with uncertainty.\nRecognition, pages 103\u2013112, 2019.\n\n[26] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint\n\narXiv:1412.6980, 2014.\n\n[27] Vladimir Koltchinskii, Dmitry Panchenko, et al. Empirical margin distributions and bounding the general-\n\nization error of combined classi\ufb01ers. The Annals of Statistics, 30(1):1\u201350, 2002.\n\n[28] Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen,\nYannis Kalantidis, Li-Jia Li, David A Shamma, et al. Visual genome: Connecting language and vision\nusing crowdsourced dense image annotations. International Journal of Computer Vision, 123(1):32\u201373,\n2017.\n\n[29] Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. Technical\n\nreport, Citeseer, 2009.\n\n[30] Yann LeCun, L\u00e9on Bottou, Yoshua Bengio, Patrick Haffner, et al. Gradient-based learning applied to\n\ndocument recognition. Proceedings of the IEEE, 86(11):2278\u20132324, 1998.\n\n[31] Buyu Li, Yu Liu, and Xiaogang Wang. Gradient harmonized single-stage detector. arXiv preprint\n\narXiv:1811.05181, 2018.\n\n[32] Yaoyong Li, Hugo Zaragoza, Ralf Herbrich, John Shawe-Taylor, and Jaz Kandola. The perceptron\n\nalgorithm with uneven margins. In ICML, volume 2, pages 379\u2013386, 2002.\n\n[33] Zeju Li, Konstantinos Kamnitsas, and Ben Glocker. Over\ufb01tting of neural nets under class imbalance:\nAnalysis and improvements for segmentation. In International Conference on Medical Image Computing\nand Computer-Assisted Intervention, pages 402\u2013410. Springer, 2019.\n\n[34] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll\u00e1r,\nand C Lawrence Zitnick. Microsoft coco: Common objects in context. In European conference on\ncomputer vision, pages 740\u2013755. Springer, 2014.\n\n[35] Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Doll\u00e1r. Focal loss for dense object\ndetection. In Proceedings of the IEEE international conference on computer vision, pages 2980\u20132988,\n2017.\n\n[36] Zachary Lipton, Yu-Xiang Wang, and Alexander Smola. Detecting and correcting for label shift with black\n\nbox predictors. In International Conference on Machine Learning, pages 3128\u20133136, 2018.\n\n[37] Weiyang Liu, Yandong Wen, Zhiding Yu, and Meng Yang. Large-margin softmax loss for convolutional\n\nneural networks. In ICML, volume 2, page 7, 2016.\n\n[38] Weiyang Liu, Yandong Wen, Zhiding Yu, Ming Li, Bhiksha Raj, and Le Song. Sphereface: Deep\nhypersphere embedding for face recognition. In Proceedings of the IEEE conference on computer vision\nand pattern recognition, pages 212\u2013220, 2017.\n\n[39] Yu Liu, Hongyang Li, and Xiaogang Wang. Rethinking feature discrimination and polymerization for\n\nlarge-scale recognition. arXiv preprint arXiv:1710.00870, 2017.\n\n[40] Ziwei Liu, Zhongqi Miao, Xiaohang Zhan, Jiayun Wang, Boqing Gong, and Stella X Yu. Large-scale\nlong-tailed recognition in an open world. In Proceedings of the IEEE Conference on Computer Vision and\nPattern Recognition, pages 2537\u20132546, 2019.\n\n11\n\n\f[41] Andrew L Maas, Raymond E Daly, Peter T Pham, Dan Huang, Andrew Y Ng, and Christopher Potts.\nLearning word vectors for sentiment analysis. In Proceedings of the 49th annual meeting of the association\nfor computational linguistics: Human language technologies-volume 1, pages 142\u2013150. Association for\nComputational Linguistics, 2011.\n\n[42] Michele Merler, Nalini Ratha, Rogerio S Feris, and John R Smith. Diversity in faces. arXiv preprint\n\narXiv:1901.10436, 2019.\n\n[43] Vaishnavh Nagarajan and Zico Kolter. Deterministic PAC-bayesian generalization bounds for deep\nnetworks via generalizing noise-resilience. In International Conference on Learning Representations,\n2019. URL https://openreview.net/forum?id=Hygn2o0qKX.\n\n[44] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming\n\nLin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in pytorch. 2017.\n\n[45] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang,\nAndrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge.\nInternational journal of computer vision, 115(3):211\u2013252, 2015.\n\n[46] Li Shen, Zhouchen Lin, and Qingming Huang. Relay backpropagation for effective learning of deep\nconvolutional neural networks. In European conference on computer vision, pages 467\u2013482. Springer,\n2016.\n\n[47] Jun Shu, Qi Xie, Lixuan Yi, Qian Zhao, Sanping Zhou, Zongben Xu, and Deyu Meng. Meta-weight-net:\n\nLearning an explicit mapping for sample weighting. arXiv preprint arXiv:1902.07379, 2019.\n\n[48] Daniel Soudry, Elad Hoffer, Mor Shpigel Nacson, Suriya Gunasekar, and Nathan Srebro. The implicit\nbias of gradient descent on separable data. The Journal of Machine Learning Research, 19(1):2822\u20132878,\n2018.\n\n[49] Johan AK Suykens and Joos Vandewalle. Least squares support vector machine classi\ufb01ers. Neural\n\nprocessing letters, 9(3):293\u2013300, 1999.\n\n[50] Bart Thomee, David A Shamma, Gerald Friedland, Benjamin Elizalde, Karl Ni, Douglas Poland, Damian\nBorth, and Li-Jia Li. Yfcc100m: The new data in multimedia research. arXiv preprint arXiv:1503.01817,\n2015.\n\n[51] Grant Van Horn and Pietro Perona. The devil is in the tails: Fine-grained classi\ufb01cation in the wild. arXiv\n\npreprint arXiv:1709.01450, 2017.\n\n[52] Grant Van Horn, Oisin Mac Aodha, Yang Song, Yin Cui, Chen Sun, Alex Shepard, Hartwig Adam, Pietro\nPerona, and Serge Belongie. The inaturalist species classi\ufb01cation and detection dataset. In Proceedings of\nthe IEEE conference on computer vision and pattern recognition, pages 8769\u20138778, 2018.\n\n[53] Feng Wang, Jian Cheng, Weiyang Liu, and Haijun Liu. Additive margin softmax for face veri\ufb01cation.\n\nIEEE Signal Processing Letters, 25(7):926\u2013930, 2018.\n\n[54] Mei Wang and Weihong Deng. Deep visual domain adaptation: A survey. Neurocomputing, 312:135\u2013153,\n\n2018.\n\n[55] Yu-Xiong Wang, Deva Ramanan, and Martial Hebert. Learning to model the tail. In Advances in Neural\n\nInformation Processing Systems, pages 7029\u20137039, 2017.\n\n[56] Yu-Xiong Wang, Ross Girshick, Martial Hebert, and Bharath Hariharan. Low-shot learning from imaginary\ndata. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7278\u2013\n7286, 2018.\n\n[57] Colin Wei and Tengyu Ma. Data-dependent Sample Complexity of Deep Neural Networks via Lipschitz\n\nAugmentation. arXiv e-prints, art. arXiv:1905.03684, May 2019.\n\n[58] Colin Wei and Tengyu Ma. Improved sample complexities for deep networks and robust classi\ufb01cation via\n\nan all-layer margin. arXiv preprint arXiv:1910.04284, 2019.\n\n[59] Colin Wei, Jason D Lee, Qiang Liu, and Tengyu Ma. On the margin theory of feedforward neural networks.\n\narXiv preprint arXiv:1810.05369, 2018.\n\n[60] Q Zhong, C Li, Y Zhang, H Sun, S Yang, D Xie, and S Pu. Towards good practices for recognition &\n\ndetection. In CVPR workshops, 2016.\n\n[61] Yang Zou, Zhiding Yu, BVK Kumar, and Jinsong Wang. Domain adaptation for semantic segmentation via\n\nclass-balanced self-training. arXiv preprint arXiv:1810.07911, 2018.\n\n12\n\n\f", "award": [], "sourceid": 907, "authors": [{"given_name": "Kaidi", "family_name": "Cao", "institution": "Stanford University"}, {"given_name": "Colin", "family_name": "Wei", "institution": "Stanford University"}, {"given_name": "Adrien", "family_name": "Gaidon", "institution": "Toyota Research Institute"}, {"given_name": "Nikos", "family_name": "Arechiga", "institution": "Toyota Research Institute"}, {"given_name": "Tengyu", "family_name": "Ma", "institution": "Stanford University"}]}