{"title": "Robust Bi-Tempered Logistic Loss Based on Bregman Divergences", "book": "Advances in Neural Information Processing Systems", "page_first": 15013, "page_last": 15022, "abstract": "We introduce a temperature into the exponential function and replace the softmax output layer of the neural networks by a high-temperature generalization. Similarly, the logarithm in the loss we use for training is replaced by a low-temperature logarithm. By tuning the two temperatures, we create loss functions that are non-convex already in the single layer case. When replacing the last layer of the neural networks by our bi-temperature generalization of the logistic loss, the training becomes more robust to noise. We visualize the effect of tuning the two temperatures in a simple setting and show the efficacy of our method on large datasets. Our methodology is based on Bregman divergences and is superior to a related two-temperature method that uses the Tsallis divergence.", "full_text": "Robust Bi-Tempered Logistic Loss\n\nBased on Bregman Divergences\n\nEhsan Amid \u2039: Manfred K. Warmuth \u2039: Rohan Anil : Tomer Koren :\u00a7\n\n\u2039 Department of Computer Science, University of California, Santa Cruz\n\n\u00a7 School of Computer Science, Tel Aviv University, Tel Aviv, Israel\n\n{eamid,manfred,rohananil,tkoren}@google.com\n\n: Google Brain\n\nAbstract\n\nWe introduce a temperature into the exponential function and replace the softmax\noutput layer of the neural networks by a high-temperature generalization. Similarly,\nthe logarithm in the loss we use for training is replaced by a low-temperature\nlogarithm. By tuning the two temperatures, we create loss functions that are\nnon-convex already in the single layer case. When replacing the last layer of\nthe neural networks by our bi-temperature generalization of the logistic loss, the\ntraining becomes more robust to noise. We visualize the effect of tuning the two\ntemperatures in a simple setting and show the ef\ufb01cacy of our method on large\ndatasets. Our methodology is based on Bregman divergences and is superior to a\nrelated two-temperature method that uses the Tsallis divergence.\n\n1\n\nIntroduction\n\nThe logistic loss, also known as the softmax loss, has been the standard choice in training deep\nneural networks for classi\ufb01cation. The loss involves the application of the softmax function on the\nactivations of the last layer to form the class probabilities followed by the relative entropy (aka\nthe Kullback-Leibler (KL) divergence) between the true labels and the predicted probabilities. The\nlogistic loss is known to be a convex function of the activations (and consequently, the weights) of\nthe last layer.\n\nAlthough desirable from an optimization standpoint, convex losses have been shown to be prone\nto outliers [15] as the loss of each individual example unboundedly increases as a function of\nthe activations. These outliers may correspond to extreme examples that lead to large gradients, or\nmisclassi\ufb01ed training examples that are located far away from the classi\ufb01cation boundary. Requiring a\nconvex loss function at the output layer thus seems somewhat arbitrary, in particular since convexity in\nthe last layer\u2019s activations does not guarantee convexity with respect to the parameters of the network\noutside the last layer. Another issue arises due to the exponentially decaying tail of the softmax\nfunction that assigns probabilities to the classes. In the presence of mislabeled training examples near\nthe classi\ufb01cation boundary, the short tail of the softmax probabilities enforces the classi\ufb01er to stretch\nthe decision boundary towards the noisy training examples. In contrast, heavy-tailed alternatives for\nthe softmax probabilities have been shown to signi\ufb01cantly improve the robustness of the loss to these\nexamples [8].\n\nThe logistic loss is essentially the negative logarithm of the predicted class probabilities, which are\ncomputed as the normalized exponentials of the inputs. In this paper, we tackle both shortcomings of\nthe logistic loss, pertaining to its convexity as well as its tail-lightness, by replacing the logarithm\nand the exponential functions with their corresponding \u201ctempered\u201d versions. We de\ufb01ne the function\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\f(a)\n\n(b)\n\n(c)\n\n(d)\n\nFigure 1: Tempered logarithm and exponential functions, and the bi-tempered logistic loss: (a) logt\nfunction, (b) expt function, bi-tempered logistic loss when (c) t2 = 1.2 \ufb01xed and t1 \u010f 1, and (d)\nt1 = 0.8 \ufb01xed and t2 \u011b 1.\n\nlogt : R+ \u00d1 R with temperature parameter t \u011b 0 as in [16]:\n\nlogt(x) :=\n\n1\n\n1 \u2013 t\n\n(x1\u2013t \u2013 1) .\n\n(1)\n\nThe logt function is monotonically increasing and concave. The standard (natural) logarithm is\nrecovered at the limit t \u00d1 1. Unlike the standard log, the logt function is bounded from below\nby \u20131/(1 \u2013 t) for 0 \u010f t < 1. This property will be used to de\ufb01ne bounded loss functions that are\nsigni\ufb01cantly more robust to outliers. Similarly, our heavy-tailed alternative for the softmax function\nis based on the tempered exponential function. The function expt : R \u00d1 R+ with temperature t \u011b 0\nis de\ufb01ned as the inverse1 of logt, that is,\n\nexpt(x) := [1 + (1 \u2013 t) x]1/(1\u2013t)\n\n+\n\n,\n\n(2)\n\nwhere [ \u00a8 ]+ = max{ \u00a8 , 0}. The standard exp function is again recovered at the limit t \u00d1 1. Compared\nto the exp function, a heavier tail (for negative values of x) is achieved for t > 1. We use this property\nto de\ufb01ne heavy-tailed analogues of softmax probabilities at the output layer.\n\nThe vanilla logistic loss can be viewed as a logarithmic (relative entropy) divergence that operates on\na \u201cmatching\u201d exponential (softmax) probability assignment [11, 12]. Its convexity then stems from\nclassical convex duality, using the fact that the probability assignment function is the gradient of the\ndual function to the negative entropy on the simplex. When the logt1\nare substituted instead,\nthis duality still holds whenever t1 = t2, albeit with a different Bregman divergence, and the induced\nloss remains convex2. However, for t1 < t2, the loss becomes non-convex in the output activations. In\nparticular, 0 \u010f t1 < 1 leads to a bounded loss, while t2 > 1 provides tail-heaviness. Figure 1 illustrates\nthe tempered logt and expt functions as well as examples of our proposed bi-tempered logistic loss\nfunction for a two-class problem expressed as a function of the activation of the \ufb01rst class. The true\nlabel is assumed to be class one.\n\nand expt2\n\nTempered generalizations of the logistic regression have been introduced before [7, 8, 22, 2]. The\nmost recent two-temperature method [2] is based on the Tsallis divergence and contains all the\nprevious methods as special cases. However, the Tsallis based divergences do not result in proper\nloss functions. In contrast, we show that the Bregman based construction introduced in this paper is\nindeed proper, which is a requirement for many real-world applications.\n\n1.1 Our replacement of the softmax output layer in neural networks\n\nConsider an arbitrary classi\ufb01cation model with multiclass softmax output. We are given training\nexamples of the form (x, y), where x is a \ufb01xed dimensional input vector and the target y is a probability\nvector over k classes. In practice, the targets are often one-hot encoded binary vectors in k dimensions.\nEach input x is fed to the model, resulting in a vector z of inputs to the \ufb01nal softmax layer. This layer\ntypically has one trainable weight vector wi per class i and yields the predicted class probability\n\n\u02c6yi =\n\nexp(\u02c6ai)\nj=1 exp(\u02c6aj)\n\n\u0159k\n\n= exp(cid:16)\u02c6ai \u2013 log\n\nk\n\n\u00ff\n\nj=1\n\nexp(\u02c6aj)(cid:17), for linear activation \u02c6ai = wi \u00a8 z for class i.\n\n1When 0 \u010f t < 1, the domain of expt needs to be restricted to \u20131/(1 \u2013 t) \u010f x for the inverse property to hold.\n2In a restricted domain when t1 = t2 < 1, as discussed later.\n\n2\n\n\fc\ni\nt\ns\ni\ng\no\nL\n\nd\ne\nr\ne\np\nm\ne\nT\n-\ni\n\nB\n\nbounded & heavy-tail\n\n(0.2, 4.0)\n\nonly heavy-tail\n\n(1.0, 4.0)\n\nonly bounded\n\n(0.2, 1.0)\n\nbounded & heavy-tail\n\n(0.2, 4.0)\n\n(a)\n\n(b)\n\n(c)\n\n(d)\n\nFigure 2: Logistic vs. robust bi-tempered logistic loss: (a) noise-free labels, (b) small-margin label\nnoise, (c) large-margin label noise, and (d) random label noise. The temperature values (t1, t2) for the\nbi-tempered loss are shown above each \ufb01gure.\n\nWe \ufb01rst replace the softmax function by a generalized heavy-tailed version that uses the expt2\nwith t2 > 1, which we call the tempered softmax function:\n\nfunction\n\n\u02c6yi = expt2 (cid:0)\u02c6ai \u2013 \u03bbt2 (\u02c6a)(cid:1), where \u03bbt2 (\u02c6a) P R is s.t.\n\nk\n\n\u00ff\n\nj=1\n\nexpt2 (cid:0)\u02c6aj \u2013 \u03bbt2 (\u02c6a)(cid:1) = 1 .\n\nThis requires computing the normalization value \u03bbt2 (\u02c6a) (for each example) via a binary search or an\niterative procedure like the one given in Appendix A. The relative entropy between the true label y\nand prediction \u02c6y is replaced by the tempered version with temperature range 0 \u010f t1 < 1,\n\nk\n\n\u00ff\n\ni=1\n\n(cid:0)yi (logt1\n\nyi \u2013 logt1\n\n\u02c6yi) \u2013 1\n2\u2013t1\n\n(y2\u2013t1\n\ni\n\n\u2013 \u02c6y2\u2013t1\n\ni\n\n)(cid:1) if y one-hot\n\n=\n\n\u2013 logt1\n\n\u02c6yc \u2013 1\n\n2\u2013t1(cid:16)1 \u2013\n\nk\n\n\u00ff\n\ni=1\n\ni (cid:17) .\n\u02c6y2\u2013t1\n\nwhere c = argmaxi yi is the index of the one-hot class. In later sections we prove various properties of\nthis loss. When t1 = t2 = 1, then it reduces to the vanilla relative entropy loss with softmax. Also\nwhen 0 \u010f t1 < 1, then the loss is bounded, while t2 > 1 gives the tempered softmax function a heavier\ntail.\n\n1.2 An illustration\n\nWe provide some intuition on why both boundedness of the loss as well as tail-heaviness of the\ntempered softmax are crucial for robustness. For this, we train a small two-layer feed-forward neural\nnetwork on a synthetic binary classi\ufb01cation problem in two dimensions. The network has 10 and 5\nunits in the \ufb01rst and second layer, respectively3. Figure 2(a) shows the results of the logistic and our\nbi-tempered logistic loss on the noise-free dataset. The network converges to a desirable classi\ufb01cation\nboundary (the white stripe in the \ufb01gure) using both loss functions. In Figure 2(b), we illustrate the\neffect of adding small-margin label noise to the training examples, targeting those examples that\nreside near the noise-free classi\ufb01cation boundary. The logistic loss clearly follows the noisy examples\nby stretching the classi\ufb01cation boundary. On the other hand, using only the tail-heavy tempered\nsoftmax function (t2 = 4 while t1 = 1, i.e. KL divergence as the divergence) can handle the noisy\nexamples by producing more uniform class probabilities. Next, we show the effect of large-margin\nnoisy examples in Figure 2(c), targeting examples that are located far away from the noise-free\nclassi\ufb01cation boundary. The convexity of the logistic loss causes the network to be highly affected by\nthe noisy examples that are located far away from the boundary. In contrast, only the boundedness of\nthe loss (t1 = 0.2 while t2 = 1, meaning that the outputs are vanilla softmax probabilities) reduces the\n\n3An interactive visualization of the bi-tempered loss is available at: https://google.github.io/\n\nbi-tempered-loss/\n\n3\n\n\feffect of the outliers by allocating at most a \ufb01nite amount of loss to each example. Finally, we show\nthe effect of random label noise that includes both small-margin and large-margin noisy examples in\nFigure 2(d). Clearly, the logistic loss fails to handle the noise, while our bi-tempered logistic loss\nsuccessfully recovers the appropriate boundary. Note for random noise, we exploit both boundedness\nof the loss (t1 = 0.2 < 1) as well as the tail-heaviness of the probability assignments (t2 = 4 > 1).\n\nThe theoretical background as well as our treatment of the softmax layer of the neural networks are\ndeveloped in later sections. In particular, we show that special discrete choices of the temperatures\nresult in a large variety of divergences commonly used in machine learning. As we show in our\nexperiments, tuning the two temperatures as continuous parameters is crucial.\n\n1.3 Summary of the experiments\n\nWe perform experiments by adding synthetic label noise to MNIST and CIFAR-100 datasets and\ncompare the results of our robust bi-tempered loss to the vanilla logistic loss. Our bi-tempered loss is\nsigni\ufb01cantly more robust to label noise (when trained on noisy data and test accuracy is measured w.r.t.\nthe clean data): It provides 98.56% and 62.55% accuracy on MNIST and CIFAR-100, respectively,\nwhen trained with 40% label noise (compared to 97.64% and 53.17%, respectively, obtained using\nlogistic loss). The bi-tempered loss also yields improvement over the state-of-the-art results on the\nImageNet-2012 dataset using both the Resnet18 and Resnet50 architectures (see Table 2).\n\n2 Preliminaries\n\n2.1 Convex duality and Bregman divergences on the simplex\n\nWe start by brie\ufb02y reviewing some basic background in convex analysis. For a continuously-\ndifferentiable strictly convex function F : D \u00d1 R, with convex domain D \u010e Rk, the Bregman\ndivergence [3] between y, \u02c6y P D induced by F is de\ufb01ned as\n\n\u2206F(y, \u02c6y) = F(y) \u2013 F(\u02c6y) \u2013 (y \u2013 \u02c6y) \u00a8 f (\u02c6y) ,\n\n2 }y}2\n\n2) and the Kullback\u2013Leibler divergence \u2206F(y, \u02c6y) = \u0159i(yi log yi\n\nwhere f (\u02c6y) := \u2207F(\u02c6y) denotes the gradient of F at \u02c6y (sometimes called the link function of F). Clearly\n\u2206F(y, \u02c6y) \u011b 0 and \u2206F(y, \u02c6y) = 0 iff y = \u02c6y. Also the Bregman divergence is always convex in the \ufb01rst\n\u2206F(y, \u02c6y) = f (y) \u2013 f (\u02c6y), but not generally in its second argument. Bregman divergence\nargument and \u2207y\ngeneralizes many well-known divergences such as the squared Euclidean \u2206F(y, \u02c6y) = 1\n2 }y \u2013 \u02c6y}2\n2\n(with F(y) = 1\n\u2013 yi + \u02c6yi) (with\nF(y) =\u0159i(yi log yi \u2013 yi)). The Bregman divergence is typically not symmetric, i.e. \u2206F(y, \u02c6y) \u2030 \u2206F(\u02c6y, y).\nAdditionally, the Bregman divergence is invariant to adding af\ufb01ne functions to the convex function F:\n\u2206F+A(y, \u02c6y) = \u2206F(y, \u02c6y), where A(y) = b + c \u00a8 y for arbitrary b P R, c P Rk.\nFor every differentiable strictly convex function F (with domain D \u010e Rk\n+), there exists a convex\ndual F\u02da : D\u02da \u00d1 R function such that for dual parameter pairs (y, a), a P D\u02da, the following holds:\na = f (y) and y = f \u02da(a) = \u2207F\u02da(a) = f \u20131(a). However, we are mainly interested in the dual of the\n+| \u0159k\nfunction F when the domain is restricted to the probability simplex Sk := {y P Rk\ni=1 yi = 1}. Let\n\u02c7F\u02da : \u02c7D\u02da \u00d1 R denote the convex conjugate of the restricted function F : D X Sk \u00d1 R,\n\n\u02c6yi\n\n\u02c7F\u02da(a) = sup\n\n1PDXSk (cid:0)y\n\ny\n\n1 \u00a8 a \u2013 F(y\n\n1)(cid:1) = sup\n\n1PD\ny\n\n\u03bbPR(cid:0)y\ninf\n\n1 \u00a8 a \u2013 F(y\n\n1) + \u03bb (1 \u2013\n\nk\n\n\u00ff\n\ni=1\n\ny1\ni)(cid:1) ,\n\nwhere we introduced a Lagrange multiplier \u03bb P R to enforce the linear constraint \u0159k\noptimum, the following relationships hold between the primal and dual variables:\n\ni=1 y1\n\ni = 1. At the\n\nwhere \u03bb(a) is chosen so that it satis\ufb01es the constraint. Note the dependence of the optimum \u03bb on a.\n\nf (y) = a \u2013 \u03bb(a) 1 and y = f \u20131(cid:0)a \u2013 \u03bb(a) 1(cid:1) = \u02c7f \u02da(a) ,\n\n(3)\n\n2.2 Matching losses\n\nNext, we recall the notion of a matching loss [11, 12, 4, 17]. It arises as a natural way of de\ufb01ning a\nloss function over activations \u02c6a P Rk, by \ufb01rst mapping them to a probability distribution over class\nlabels using a transfer function s : Rk \u00d1 Sk, and then computing a divergence \u2206F between this\n\n4\n\n\fdistribution and the correct target labels. The idea behind the following de\ufb01nition is to \u201cmatch\u201d the\ntransfer function and the divergence via duality.4\nDe\ufb01nition 1 (Matching Loss). Let F : Sk \u00d1 R be a continuously-differentiable, strictly convex\nfunction and let s : Rk \u00d1 Sk be a transfer function such that \u02c6y = s(\u02c6a) denotes the predicted probability\ndistribution based on the activations \u02c6a. Then the loss function\n\nis called the matching loss for s, if s = \u02c7f \u02da = \u2207\u02c7F\u02da.\n\nLF(\u02c6a | y) := \u2206F(y, s(\u02c6a)) ,\n\nNote that \u02c7f \u02da is no longer one-to-one since \u02c7f \u02da(\u02c6a + R 1) = \u02c7f \u02da(\u02c6a) (see Appendix D for more details).\nHowever, w.l.o.g. we can constrain the domain of the function to \u02c6a P dom(\u02c7f \u02da) X {a\n1 \u00a8 1 = 0}\n1 P Rk | a\nto obtain a one-to-one mapping. The matching loss is useful due to the following property.\nProposition 1. The matching loss LF(\u02c6a | y) is convex w.r.t. the activations \u02c6a P dom(\u02c7f \u02da) X {a\n1 \u00a8 1 = 0}.\na\n\n1 P Rk |\n\nProof. Note that \u02c7F\u02da is a strictly convex function and the following relation holds between the\ndivergences induced by F and \u02c7F\u02da (see proof of Proposition 4 in Appendix D):\n\n\u2206F(cid:0)y, \u02c6y(cid:1) = \u2206\u02c7F\u02da(cid:0)(\u02c7f \u02da)\u20131(\u02c6y), (\u02c7f \u02da)\u20131(y)(cid:1).\n\n(4)\n\nThus for any \u02c6a in the range of (\u02c7f \u02da)\u20131,\n\nThe claim now follows from the convexity of \u2206\u02c7F\u02da w.r.t. its \ufb01rst argument.\n\n\u2206F(cid:0)y, \u02c7f \u02da(\u02c6a)(cid:1) = \u2206\u02c7F\u02da(cid:0)\u02c6a, (\u02c7f \u02da)\u20131(y)(cid:1).\n\nThe original motivating example for the matching loss was the logistic loss [11, 12]. It can be\nobtained as the matching loss for the softmax function\n\n\u02c6yi = [\u02c7f \u02da(\u02c6a)]i =\n\nexp(\u02c6ai)\nj=1 exp(\u02c6aj)\n\n\u0159k\n\n,\n\nwhich corresponds to the relative entropy (KL) divergence\n\nLF(\u02c6a | y) = \u2206F(cid:0)y, \u02c7f \u02da(\u02c6a)(cid:1) =\n\nk\n\n\u00ff\n\ni=1\n\nyi (log yi \u2013 log \u02c6yi) =\n\nk\n\n\u00ff\n\ni=1\n\n(cid:0)yi log yi \u2013 yi \u02c6ai)(cid:1) + log(cid:0)\n\nk\n\n\u00ff\n\ni=1\n\nexp(\u02c6ai)(cid:1) ,\n\ninduced from the negative entropy function F(y) = \u0159k\ni=1(yi log yi \u2013 yi). We next de\ufb01ne a family\nof convex functions Ft parameterized by a temperature t \u011b 0. The matching loss LFt (\u02c6a | y) =\n\u2206Ft(cid:0)y, \u02c7f \u02da\nis convex in the activations \u02c6a. However, by letting the\ntemperature t2 of \u02c7f \u02da\nbe larger than the temperature t1 of Ft1 , we construct bounded non-convex losses\nt2\nwith heavy-tailed transfer functions.\n\nt (\u02c6a)(cid:1) for the link function \u02c7f \u02da\n\nt of \u02c7F\u02da\n\nt\n\n3 Tempered Matching Loss\n\nWe start by introducing a generalization of the relative entropy divergence, denoted by \u2206Ft , induced\nby a strictly convex function Ft : Rk\n+ \u00d1 R with a temperature parameter t \u011b 0. The convex function\nFt is chosen so that its gradient takes the form5 ft(y) := \u2207Ft(y) = logt y. Via simple integration, we\nobtain that\n\nk\n\n\u00ff\n\ni=1\n\nFt(y) =\n\n(cid:0)yi logt yi + 1\nIndeed, Ft is a convex function since \u22072Ft(y) = diag(y\nconvex, for 0 \u010f t \u010f 1:\nLemma 1. The function Ft, with 0 \u010f t \u010f 1, is B\u2013t\u2013strongly convex over the set {y P Rk\nw.r.t. the L2\u2013t-norm.\n\n\u2013t) \u013e 0 for any y P Rk\n\n2\u2013t (1 \u2013 y2\u2013t\n\ni\n\n)(cid:1) .\n\n+. In fact, Ft is strongly\n\n+ : }y}2\u2013t \u010f B}\n\n4Originally in [11, 12], the matching loss was de\ufb01ned as a simple integral over the transfer function s = f \u20131:\n\nLF(\u02c6a | y) = \u015f\u02c6a\n\ns\u20131(y)(s(z) \u2013 y)\u00a8d z. Our new duality based de\ufb01nition handles additional linear constraints.\n\n5Here, the logt function is applied elementwise.\n\n5\n\n\fSee Appendix B for a proof. The Bregman divergence induced by Ft is then given by\n\n\u2206Ft (y, \u02c6y) =\n\n=\n\nk\n\n\u00ff\n\ni=1\n\nk\n\n\u00ff\n\ni=1\n\n(cid:0)yi logt yi \u2013 yi logt\n\n\u02c6yi \u2013 1\n\n2\u2013t y2\u2013t\n\ni + 1\n2\u2013t\n\n\u02c6y2\u2013t\ni (cid:1)\n\n(5)\n\n(cid:16)\n\n1\n\n(1\u2013t)(2\u2013t) y2\u2013t\n\ni \u2013 1\n\n1\u2013t yi\u02c6y1\u2013t\n\ni + 1\n2\u2013t\n\ni (cid:17).\n\u02c6y2\u2013t\n\nThe second form may be recognized as \u03b2-divergence [5] with parameter \u03b2 = 2 \u2013 t. The divergence (5)\nincludes many well-known divergences such as squared Euclidean, KL, and Itakura-Saito divergence\nas special cases. A list of additional special cases is given in Table 3 of Appendix C.\n\nThe following corollary is the direct consequence of the strong convexity of Ft.\nCorollary 1. Let max(}y}2\u2013t, }\u02c6y}2\u2013t) \u010f B for 0 \u010f t < 1. Then\n\n1\n2Bt }y \u2013 \u02c6y}2\n\n2\u2013t \u010f \u2206Ft (y, \u02c6y) \u010f\n\nBt\n\n2 (1 \u2013 t)2 }y\n\n1\u2013t \u2013 \u02c6y\n\n1\u2013t}2\n\n.\n\n2\u2013t\n1\u2013t\n\nSee Appendix B for a proof. Thus for 0 \u010f t < 1, \u2206Ft (y, \u02c6y) is upper-bounded by 2 B2\u2013t\n(1\u2013t)2 . Note\nthat boundedness on the simplex also implies boundedness in the L2\u2013t-ball. Thus, Corollary 1\nimmediately implies the boundedness of the divergence \u2206Ft (y, \u02c6y) with 0 \u010f t < 1 over the simplex.\nAlternate parameterizations of the family {Ft} of convex functions and their corresponding Bregman\ndivergences are discussed in Appendix C.\n\n3.1 Tempered softmax function\n\nNow, let us consider the convex function Ft(y) when its domain is restricted to the probability simplex\nSk. We denote the constrained dual of Ft(y) by \u02c7F\u02da\n\nt (a),\n\n\u02c7F\u02da\n\nt (a) = sup\n\n1PSk (cid:0)y\n\ny\n\n1 \u00a8 a \u2013 Ft(y\n\n1)(cid:1) = sup\n\n1PRk\ny\n+\n\ninf\n\n\u03bbtPR (cid:0)y\n\n1 \u00a8 a \u2013 Ft(y\n\n1) + \u03bbt (cid:0)1 \u2013\n\nk\n\n\u00ff\n\ni=1\n\ny1\ni(cid:1)(cid:1) .\n\nFollowing our discussion in Section 2.1 and using (3), the transfer function induced by \u02c7F\u02da\nt\n\nis6\n\ny = expt (cid:0)a \u2013 \u03bbt(a) 1(cid:1), with \u03bbt(a) s.t.\n\nk\n\n\u00ff\n\ni=1\n\nexpt (cid:0)ai \u2013 \u03bbt(a)(cid:1) = 1.\n\n3.2 Matching loss of tempered softmax\n\nFinally, we derive the matching loss function LFt . Plugging in (7) into (5), we have\n\n(6)\n\n(7)\n\nLt(\u02c6a | y) = \u2206Ft(cid:0)y, expt(\u02c6a \u2013 \u03bbt(\u02c6a) 1)(cid:1).\n\nRecall that by Proposition 1, this loss is convex in activations \u02c6a P dom(\u02c7f \u02da) X {a\n1 \u00a8 1 = 0}.\nIn general, \u03bbt(a) does not have a closed form solution. However, it can be easily approximated via an\niterative method, e.g., a binary search. An alternative (\ufb01xed-point) algorithm for computing \u03bbt(a) for\nt > 1 is given in Algorithm 1 of Appendix A.\n\n1 P Rk | a\n\n4 Robust Bi-Tempered Logistic Loss\n\nA more interesting class of loss functions can be obtained by introducing a \u201cmismatch\u201d between\nthe temperature of the divergence function (5) and the temperature of the probability assignment\nfunction, i.e. the tempered softmax (7). That is, we consider loss functions of the following type:\n\nk\n\n@ 0 \u010f t1 < 1 < t2 : Lt2\nt1\n\n(\u02c6a | y) := \u2206Ft1(cid:0)y, expt2\n\n(\u02c6a \u2013 \u03bbt2 (\u02c6a)1)(cid:1),with \u03bbt(\u02c6a) s.t.\n\n\u00ff\n\ni=1\n\nexpt (cid:0)ai \u2013 \u03bbt(a)(cid:1) = 1.\n\n(8)\n\nWe call this the Bi-Tempered Logistic Loss. As illustrated in our two-dimensional example in\nSection 1, both properties are crucial for handling noisy examples. The derivative of the bi-tempered\nloss is given in Appendix E. In the following, we discuss the properties of this loss for classi\ufb01cation.\nt (a) = expt (cid:0)a \u2013 \u03bbt(a) 1(cid:1) is\n\n6Note that due to the simplex constraint, the link function y = \u02c7f \u02da\n\nt (a) = \u2207 \u02c7F\u02da\n\ndifferent from f \u20131\n\nt\n\n(a) = f \u02da\n\nt (a) = \u2207F\u02da\n\nt (a) = expt(a), i.e., the gradient of the unconstrained dual.\n\n6\n\n\f4.1 Properness and Monte-Carlo sampling\n\nLet PUK(x, y) denote the (unknown) joint probability distribution of the observed variable x P Rm and\nthe class label y P [k]. The goal of discriminative learning is to approximate the posterior distribution\nof the labels PUK(y | x) via a parametric model P(y | x; \u0398) parameterized by \u0398. Thus the model \ufb01tting\ncan be expressed as minimizing the following expected loss between the data and the model\u2019s label\nprobabilities\n\nEPUK(x)h\u2206(cid:0)PUK(y | x), P(y | x; \u0398)(cid:1)i ,\n\n(9)\n\nwhere \u2206(cid:0)PUK(y | x), P(y | x; \u0398)(cid:1) is any divergence measure between PUK(y | x) and P(y | x; \u0398). We\n(\u02c6ai \u2013 \u03bbt2 (\u02c6a)), where \u02c6a is the\nuse \u2206 := \u2206Ft1\nactivation vector of the last layer given input x and \u0398 is the set of all weights of the network. Ignoring\nthe constant terms w.r.t. \u0398, our loss (9) becomes\n\nas the divergence and P(i | x; \u0398) := P(y = i | x; \u0398) = expt2\n\nEPUK(x)h \u00ff\n\ni\n\n(cid:0) \u2013 PUK(i | x) logt P(i | x; \u0398) +\n\n1\n\n2 \u2013 t\n\nP(i | x; \u0398)2\u2013t(cid:1)i\n\n= \u2013EPUK(x,y)h logt P(y | x; \u0398)i + EPUK(x)h 1\n\n2 \u2013 t \u00ff\n\ni\n\nP(i | x; \u0398)2\u2013t(cid:1)i\n\n\u00ab\n\n1\n\nN \u00ff\n\nn (cid:0) \u2013 logt P(yn | xn; \u0398) +\n\n1\n\n2 \u2013 t \u00ff\n\ni\n\nP(i | xn; \u0398)2\u2013t(cid:1) ,\n\n(10a)\n\n(10b)\n\n(10c)\n\nwhere from (10b) to (10c), we perform a Monte-Carlo approximation of the expectation w.r.t.\nPUK(x, y) using samples {(xn, yn)}N\nn=1. Thus, (10c) is an unbiased approximate of the expected loss (9),\nthus is a proper loss [20].\n\nFollowing the same approximation steps for the Tsallis divergence used in [2], we have\n\nEPUK(x)h \u2013 \u00ff\n\nPUK(i | x) logt\n\nP(i | x; \u0398)\nPUK(i | x)\nlooooooooooooooooomooooooooooooooooon\n\ni\n\ni \u00ab \u2013\n\n1\n\nN \u00ff\n\nn\n\nlogt\n\nP(yn | xn; \u0398)\nPUK(yn | xn)\n\n,\n\n\u2206Tsallis\n\nwhich, due to the fact that logt\nconditional distribution PUK(y | x). In this case the approximation \u2013 1\nin [2] by setting PUK(yn | xn) to 1 is not an unbiased estimator of (9) and therefore, not proper.\n\n(cid:0)PUK(y|x), P(y|x;\u0398)(cid:1)\na\nb \u2030 logt a \u2013 logt b in general, requires access to the (unknown)\nN \u0159n logt P(yn | xn; \u0398) proposed\n\nt\n\n4.2 Bayes-risk consistency\n\nAnother important property of a multiclass loss is the Bayes-risk consistency [19]. Bayes-risk\nconsistency of the two-temperature logistic loss based on the Tsallis divergence was shown in [2]. As\nexpected, the tempered Bregman loss (8) is also Bayes-risk consistent even in the non-convex case.\nProposition 2. The multiclass bi-tempered logistic loss Lt2\n\nt1 (\u02c6a | y) is Bayes-risk consistent.\n\n5 Experiments\n\nWe demonstrate the practical utility of the bi-tempered logistic loss function on a wide variety of\nimage classi\ufb01cation tasks. For moderate-size experiments, we use MNIST dataset of handwritten\ndigits [14] and CIFAR-100, which contains real-world images from 100 different classes [13]. We\nuse ImageNet-2012 [6] for large scale image classi\ufb01cation, having 1000 classes. All experiments are\ncarried out using the TensorFlow [1] framework. We use P100 GPU\u2019s for small-scale experiments and\nCloud TPU-v2 for larger scale ImageNet-2012 experiments. An implementation of the bi-tempered\nlogistic loss is available online at: https://github.com/google/bi-tempered-loss.\n\n5.1 Corrupted labels experiments\n\nFor our moderate size datasets, i.e. MNIST and CIFAR-100, we introduce noise by arti\ufb01cially\ncorrupting a fraction of the labels and producing a new set of labels for each noise level. For all\nexperiments, we compare our bi-tempered loss function against the logistic loss.\n\n7\n\n\fDataset\n\nLoss\n\nLabel Noise Level\n\n0.0\n\n0.1\n\n0.2\n\n0.3\n\n0.4\n\n0.5\n\nMNIST\n\nLogistic\n\n99.40\n\n98.96\n\n98.70\n\n98.50\n\n97.64\n\n96.13\n\nBi-Tempered (0.5, 4.0)\n\n99.24\n\n99.13\n\n99.02\n\n98.62\n\n98.56\n\n97.69\n\nCIFAR-100\n\nLogistic\n\n74.03\n\n69.94\n\n66.39\n\n63.00\n\n53.17\n\n52.96\n\nBi-Tempered (0.8, 1.2)\n\n75.30\n\n73.30\n\n70.69\n\n67.45\n\n62.55\n\n57.80\n\nTable 1: Top-1 accuracy on a clean test set for MNIST and CIFAR-100 datasets where a fraction of\nthe training labels are corrupted.\n\nModel\n\nLogistic\n\nResnet50\n\n76.332 \u02d8 0.105\n\n76.748 \u02d8 0.164\n\n71.618 \u02d8 0.163\n\nResnet18\n\n71.333 \u02d8 0.069\n\nBi-tempered (0.9,1.05)\n\nTable 2: Top-1 accuracy on ImageNet-2012 with\nResnet-18 and 50 architectures.\n\nFor MNIST, we use a CNN with two convolu-\ntional layers of size 32 and 64 with a mask size\nof 5, followed by two fully-connected layers\nof size 1024 and 10. We apply max-pooling\nafter each convolutional layer with a window\nsize equal to 2 and use dropout during training\nwith keep probability equal to 0.75. We use the\nAdaDelta optimizer [21] with 500 epochs and\nbatch size of 128 for training. For CIFAR-100, we use a Resnet-56 [10] model without batch norm\nfrom [9] with SGD + momentum optimizer trained for 50k steps with batch size of 128 and use the\nstandard learning rate stair case decay schedule. For both experiments, we report the test accuracy of\nthe checkpoint which yields the highest accuracy on an identically label-noise corrupted validation\nset. We search over a set of learning rates for each experiment. For both experiments, we exhaustively\nsearch over a number of temperatures within the range [0.5, 1) and (1.0, 4.0] for t1 and t2, respectively.\nThe results are presented in Table 1 where we report the top-1 accuracy on a clean test set. As can be\nseen, the bi-tempered loss outperforms the logistic loss for all noise levels (including the noise-free\ncase for CIFAR-100). Using our bi-tempered loss function the model is able to continue to perform\nwell even for high levels of label noise whereas the accuracy of the logistic loss drops immediately\nwith a much smaller level of noise.\n\n5.2 Large scale experiments\n\nWe train state-of-the-art Resnet-18 and Resnet-50 models on the ImageNet-2012 dataset. Note that\nthe ImageNet-2012 dataset is inherently noisy due to some amount of mislabeling. We train on a\n4x4 CloudTPU-v2 device with a batch size of 4096. All experiments were trained for 180 epochs,\nand use the SGD + momentum optimizer with staircase learning rate decay schedule. The results are\npresented in Table 2. For both architectures we see a signi\ufb01cant gain in the top-1 accuracy using the\nrobust bi-tempered loss.\n\n6 Conclusion and Future Work\n\nNeural networks on large standard datasets have been optimized along with a large variety of variables\nsuch as architecture, transfer function, choice of optimizer, and label smoothing to name just a few.\nWe proposed a new variant by training the network with tunable loss functions. We do this by\n\ufb01rst developing convex loss functions based on temperature dependent logarithm and exponential\nfunctions. When both temperatures are the same, then a construction based on the notion of \u201cmatching\nloss\u201d leads to loss functions that are convex in the last layer. However by letting the temperature of\nthe new tempered softmax function be larger than the temperature of the tempered log function used\nin the divergence, we construct tunable losses that are non-convex in the last layer. Our construction\nremedies two issues simultaneously: we construct bounded tempered loss functions that can handle\nlarge-margin outliers and introduce heavy-tailedness in our new tempered softmax function that\nseems to handle small-margin mislabeled examples. At this point, we simply took a number of\nbenchmark datasets and networks for these datasets that have been heavily optimized for the logistic\nloss paired with vanilla softmax and simply replaced the loss in the last layer by our new construction.\nBy simply trying a number of temperature pairs, we already achieved signi\ufb01cant improvements. We\nbelieve that with a systematic \u201cjoint optimization\u201d of all commonly tried variables, signi\ufb01cant further\nimprovements can be achieved. This is of course a more long-term goal. We also plan to explore the\nidea of annealing the temperature parameters over the training process.\n\n8\n\n\fAcknowledgement\n\nWe would like to thank Jerome Rony for pointing out that early stopping improves the accuracy of\nthe logistic loss on the noisy MNIST experiment. This research was partially supported by the NSF\ngrant IIS-1546459.\n\nReferences\n\n[1] Mart\u00edn Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro,\nGreg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow,\nAndrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser,\nManjunath Kudlur, Josh Levenberg, Dan Man\u00e9, Rajat Monga, Sherry Moore, Derek Murray,\nChris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul\nTucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Vi\u00e9gas, Oriol Vinyals, Pete Warden,\nMartin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. TensorFlow: Large-scale\nmachine learning on heterogeneous systems, 2015. Software available from tensor\ufb02ow.org.\n\n[2] Ehsan Amid, Manfred K. Warmuth, and Sriram Srinivasan. Two-temperature logistic regression\nbased on the Tsallis divergence. In 22nd International Conference on Arti\ufb01cial Intelligence and\nStatistics (AISTATS 19), 2019.\n\n[3] Lev M Bregman. The relaxation method of \ufb01nding the common point of convex sets and\nits application to the solution of problems in convex programming. USSR computational\nmathematics and mathematical physics, 7(3):200\u2013217, 1967.\n\n[4] Andreas Buja, Werner Stuetzle, and Yi Shen. Loss functions for binary class probability\nestimation and classi\ufb01cation: Structure and applications. Technical report, University of\nPennsylvania, November 2005.\n\n[5] Andrzej Cichocki and Shun-ichi Amari. Families of alpha-beta-and gamma-divergences:\n\nFlexible and robust measures of similarities. Entropy, 12(6):1532\u20131568, 2010.\n\n[6] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei.\n\nImageNet: A Large-Scale\n\nHierarchical Image Database. In CVPR09, 2009.\n\n[7] Nan Ding. Statistical machine learning in the t-exponential family of distributions. PhD thesis,\n\nPurdue University, 2013.\n\n[8] Nan Ding and S. V. N. Vishwanathan.\n\nIn Proceedings of the 23th\nInternational Conference on Neural Information Processing Systems, NIPS\u201910, pages 514\u2013522,\nCambridge, MA, USA, 2010.\n\nt-logistic regression.\n\n[9] Moritz Hardt and Tengyu Ma. Identity matters in deep learning. International Conference on\n\nLearning Representations (ICLR), 2017.\n\n[10] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image\nrecognition. In Proceedings of the IEEE conference on computer vision and pattern recognition,\npages 770\u2013778, 2016.\n\n[11] D. P. Helmbold, J. Kivinen, and M. K. Warmuth. Relative loss bounds for single neurons. IEEE\n\nTransactions on Neural Networks, 10(6):1291\u20131304, November 1999.\n\n[12] J. Kivinen and M. K. Warmuth. Relative loss bounds for multidimensional regression problems.\n\nMachine Learning, 45(3):301\u2013329, 2001.\n\n[13] Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical report,\n\nCiteseer, 2009.\n\n[14] Yann LeCun and Corinna Cortes. The MNIST database of handwritten digits, 1999.\n\n[15] Philip M Long and Rocco A Servedio. Random classi\ufb01cation noise defeats all convex potential\nboosters. In Proceedings of the 25th international conference on Machine learning, pages\n608\u2013615. ACM, 2008.\n\n[16] Jan Naudts. Deformed exponentials and logarithms in generalized thermostatistics. Physica A,\n\n316:323\u2013334, 2002.\n\n[17] M. D. Reid and R. C. Williamson. Surrogate regret bounds for proper losses. In Proceedings of\n\nthe 26th International Conference on Machine Learning (ICML\u201909), pages 897\u2013904, 2009.\n\n9\n\n\f[18] Shai Shalev-Shwartz et al. Online learning and online convex optimization. Foundations and\n\nTrends\u00ae in Machine Learning, 4(2):107\u2013194, 2012.\n\n[19] Ambuj Tewari and Peter L Bartlett. On the consistency of multiclass classi\ufb01cation methods.\n\nJournal of Machine Learning Research, 8(May):1007\u20131025, 2007.\n\n[20] Robert C. Williamson, Elodie Vernet, and Mark D. Reid. Composite multiclass losses. Journal\n\nof Machine Learning Research, 17(223):1\u201352, 2016.\n\n[21] Matthew D Zeiler. Adadelta: an adaptive learning rate method. arXiv preprint arXiv:1212.5701,\n\n2012.\n\n[22] Zhilu Zhang and Mert Sabuncu. Generalized cross entropy loss for training deep neural networks\nwith noisy labels. In Advances in Neural Information Processing Systems, pages 8778\u20138788,\n2018.\n\n10\n\n\f", "award": [], "sourceid": 8569, "authors": [{"given_name": "Ehsan", "family_name": "Amid", "institution": "University of California, Santa Cruz"}, {"given_name": "Manfred K.", "family_name": "Warmuth", "institution": "Google Brain"}, {"given_name": "Rohan", "family_name": "Anil", "institution": "Google"}, {"given_name": "Tomer", "family_name": "Koren", "institution": "Google"}]}