{"title": "Learning with Average Top-k Loss", "book": "Advances in Neural Information Processing Systems", "page_first": 497, "page_last": 505, "abstract": "In this work, we introduce the average top-$k$ (\\atk) loss as a new ensemble loss for supervised learning. The \\atk loss provides a natural generalization of the two widely used ensemble losses, namely the average loss and the maximum loss. Furthermore, the \\atk loss combines the advantages of them and can alleviate their corresponding drawbacks to better adapt to different data distributions. We show that the \\atk loss affords an intuitive interpretation that reduces the penalty of continuous and convex individual losses on correctly classified data. The \\atk loss can lead to convex optimization problems that can be solved effectively with conventional sub-gradient based method. We further study the Statistical Learning Theory of \\matk by establishing its classification calibration and statistical consistency of \\matk which provide useful insights on the practical choice of the parameter $k$. We demonstrate the applicability of \\matk learning combined with different individual loss functions for binary and multi-class classification and regression using synthetic and real datasets.", "full_text": "Learning with Average Top-k Loss\n\nYanbo Fan3,4,1 , Siwei Lyu1\u2217, Yiming Ying2 , Bao-Gang Hu3,4\n1Department of Computer Science, University at Albany, SUNY\n\n2Department of Mathematics and Statistics, University at Albany, SUNY\n\n3National Laboratory of Pattern Recognition, CASIA\n4University of Chinese Academy of Sciences (UCAS)\n\n{yanbo.fan,hubg}@nlpr.ia.ac.cn, slyu@albany.edu, yying@albany.edu\n\nAbstract\n\nIn this work, we introduce the average top-k (ATk) loss as a new aggregate loss\nfor supervised learning, which is the average over the k largest individual losses\nover a training dataset. We show that the ATk loss is a natural generalization of the\ntwo widely used aggregate losses, namely the average loss and the maximum loss,\nbut can combine their advantages and mitigate their drawbacks to better adapt to\ndifferent data distributions. Furthermore, it remains a convex function over all\nindividual losses, which can lead to convex optimization problems that can be\nsolved effectively with conventional gradient-based methods. We provide an intu-\nitive interpretation of the ATk loss based on its equivalent effect on the continuous\nindividual loss functions, suggesting that it can reduce the penalty on correctly\nclassi\ufb01ed data. We further give a learning theory analysis of MATk learning on\nthe classi\ufb01cation calibration of the ATk loss and the error bounds of ATk-SVM.\nWe demonstrate the applicability of minimum average top-k learning for binary\nclassi\ufb01cation and regression using synthetic and real datasets.\n\nIntroduction\n\n1\nSupervised learning concerns the inference of a function f : X (cid:55)\u2192 Y that predicts a target y \u2208 Y\nfrom data/features x \u2208 X using a set of labeled training examples {(xi, yi)}n\ni=1. This is typically\nachieved by seeking a function f that minimizes an aggregate loss formed from individual losses\nevaluated over all training samples.\nTo be more speci\ufb01c, the individual loss for a sample (x, y) is given by (cid:96)(f (x), y), in which (cid:96) is\na nonnegative bivariate function that evaluates the quality of the prediction made by function f.\nFor example, for binary classi\ufb01cation (i.e., yi \u2208 {\u00b11}), commonly used forms for individual loss\ninclude the 0-1 loss, Iyf (x)\u22640, which is 1 when y and f (x) have different sign and 0 otherwise, the\nhinge loss, max(0, 1 \u2212 yf (x)), and the logistic loss, log2(1 + exp(\u2212yf (x))), all of which can be\nfurther simpli\ufb01ed as the so-called margin loss, i.e., (cid:96)(y, f (x)) = (cid:96)(yf (x)). For regression, squared\ndifference (y \u2212 f (x))2 and absolute difference |y \u2212 f (x)| are two most popular forms for individual\nloss, which can be simpli\ufb01ed as (cid:96)(y, f (x)) = (cid:96)(|y \u2212 f (x)|). Usually the individual loss is chosen\nto be a convex function of its input, but recent works also propose various types of non-convex\nindividual losses (e.g., [10, 15, 27, 28]).\nThe supervised learning problem is then formulated as minf {L(Lz(f )) + \u2126(f )}, where L(Lz(f ))\ni.e., Lz(f ) =\nis the aggregate loss accumulates all individual losses over training samples,\n{(cid:96)i(f )}n\ni=1, with (cid:96)i(f ) being the shorthand notation for (cid:96)(f (xi), yi), and \u2126(f ) is the regularizer\non f. However, in contrast to the plethora of the types of individual losses, there are only a few\nchoices when we consider the aggregate loss:\n\n\u2217Corresponding author.\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fFigure 1: Comparison of different aggregate losses on 2D synthetic datasets with n = 200 samples for binary\nclassi\ufb01cation on a balanced but multi-modal dataset and with outliers (top) and an imbalanced dataset with\noutliers (bottom) with logistic loss (left) and hinge loss (right). Outliers in data are shown as an enlarged \u00d7\nand the optimal Bayes classi\ufb01cations are shown as shaded areas. The \ufb01gures in the second and fourth columns\nshow the misclassi\ufb01cation rate of ATk vs. k for each case.\n\n\u2022 the average loss: Lavg(Lz(f )) = 1\ni=1 (cid:96)i(f ), i.e., the mean of all individual losses;\n\u2022 the maximum loss: Lmax(Lz(f )) = max1\u2264k\u2264n (cid:96)i(f ), i.e., the largest individual loss;\n\u2022 the top-k loss [20]: Ltop-k(Lz(f )) = (cid:96)[k](f )2 for 1 \u2264 k \u2264 n, i.e., the k-th largest (top-k)\n\n(cid:80)n\n\nn\n\nindividual loss.\n\nThe average loss is unarguably the most widely used aggregate loss, as it is a unbiased approximation\nto the expected risk and leads to the empirical risk minimization in learning theory [1, 7, 22, 25, 26].\nFurther, minimizing the average loss affords simple and ef\ufb01cient stochastic gradient descent algo-\nrithms [3, 21]. On the other hand, the work in [20] shows that constructing learning objective based\non the maximum loss may lead to improved performance for data with separate typical and rare sub-\npopulations. The top-k loss [20] generalizes the maximum loss, as Lmax(Lz(f )) = Ltop-1(Lz(f )),\nand can alleviate the sensitivity to outliers of the latter. However, unlike the average loss or the\nmaximum loss, the top-k loss in general does not lead to a convex learning objective, as it is not\nconvex of all the individual losses Lz(f ).\nIn this work, we propose a new type of aggregate loss that we term as the average top-k (ATk) loss,\nwhich is the average of the largest k individual losses, that is de\ufb01ned as:\n\n(cid:80)k\n\nLavt-k(Lz(f )) = 1\n\nk\n\ni=1 (cid:96)[i](f ).\n\n(1)\n\nWe refer to learning objectives based on minimizing the ATk loss as MATk learning.\nThe ATk loss generalizes the average loss (k = n) and the maximum loss (k = 1), yet it is less\nsusceptible to their corresponding drawbacks, i.e., it is less sensitive to outliers than the maximum\nloss and can adapt to imbalanced and/or multi-modal data distributions better than the average loss.\nThis is illustrated with two toy examples of synthesized 2D data for binary classi\ufb01cation in Fig.1\n(see supplementary materials for a complete illustration). As these plots show, the linear classi\ufb01er\nobtained with the maximum loss is not optimal due to the existence of outliers while the linear clas-\nsi\ufb01er corresponding to the average loss has to accommodate the requirement to minimize individual\nlosses across all training data, and sacri\ufb01ces smaller sub-clusters of data (e.g., the rare population of\n+ class in the top row and the smaller dataset of \u2212 class in the bottom row). In contrast, using ATk\nloss with k = 10 can better protect such smaller sub-clusters and leads to linear classi\ufb01ers closer\nto the optimal Bayesian linear classi\ufb01er. This is also corroborated by the plots of corresponding\nmisclassi\ufb01cation rate of ATk vs. k value in Fig.1, which show that minimum misclassi\ufb01cation rates\noccur at k value other than 1 (maximum loss) or n (average loss).\nThe ATk loss is a tight upper-bound of the top-k loss, as Lavt-k(Lz(f )) \u2265 Ltop-k(Lz(f )) with\nequality holds when k = 1 or (cid:96)i(f ) = constant, and it is a convex function of the individual\nlosses (see Section 2). Indeed, we can express (cid:96)[k](f ) as the difference of two convex functions\nkLavt-k(Lz(f ))\u2212(k\u22121)Lavt-(k\u22121)(Lz(f )), which shows that in general Ltop-k(Lz(f )) is not convex\nwith regards to the individual losses.\n\n2We de\ufb01ne the top-k element of a set S = {s1,\u00b7\u00b7\u00b7 , sn} as s[k], such that s[1] \u2265 s[2] \u2265 \u00b7\u00b7\u00b7 \u2265 s[n].\n\n2\n\n-10123-10123Classification Boundary110100200k00.020.040.060.080.1Misclassification Rate-10123-10123Classification Boundary110100200k00.0050.010.0150.020.0250.03Misclassification Rate-10123-10123Classification Boundary110100200k00.0050.010.0150.020.025Misclassification Rate-10123-10123Classification Boundary110100200k00.0050.010.0150.020.025Misclassification Rate\fIn sequel, we will provide a detailed analysis of the ATk loss and MATk learning. First, we establish\na reformulation of the ATk loss as the minimum of the average of the individual losses over all train-\ning examples transformed by a hinge function. This reformulation leads to a simple and effective\nstochastic gradient-based algorithm for MATk learning, and interprets the effect of the ATk loss as\nshifting down and truncating at zero the individual loss to reduce the undesirable penalty on correct-\nly classi\ufb01ed data. When combined with the hinge function as individual loss, the ATk aggregate loss\nleads to a new variant of SVM algorithm that we term as ATk SVM, which generalizes the C-SVM\nand the \u03bd-SVM algorithms [19]. We further study learning theory of MATk learning, focusing on\nthe classi\ufb01cation calibration of the ATk loss function and error bounds of the ATk SVM algorithm.\nThis provides a theoretical lower-bound for k for reliable classi\ufb01cation performance. We demon-\nstrate the applicability of minimum average top-k learning for binary classi\ufb01cation and regression\nusing synthetic and real datasets.\nThe main contributions of this work can be summarized as follows.\n\n\u2022 We introduce the ATk loss for supervised learning, which can balance the pros and cons\nof the average and maximum losses, and allows the learning algorithm to better adapt to\nimbalanced and multi-modal data distributions.\n\u2022 We provide algorithm and interpretation of the ATk loss, suggesting that most existing\nlearning algorithms can take advantage of it without signi\ufb01cant increase in computation.\n\u2022 We further study the theoretical aspects of ATk loss on classi\ufb01cation calibration and error\n\u2022 We perform extensive experiments to validate the effectiveness of the MATk learning.\n\nbounds of minimum average top-k learning for ATk-SVM.\n\n2 Formulation and Interpretation\n\nThe original ATk loss, though intuitive, is not convenient to work with because of the sorting proce-\ndure involved. This also obscures its connection with the statistical view of supervised learning as\nminimizing the expectation of individual loss with regards to the underlying data distribution. Yet,\nit affords an equivalent form, which is based on the following result.\ni=1 x[i] is a convex function of (x1,\u00b7\u00b7\u00b7 , xn). Furthermore, for\n\ni=1 x[i] = min\u03bb\u22650\n\n(2)\n\nLemma 1 (Lemma 1, [16]). (cid:80)k\nxi \u2265 0 and i = 1,\u00b7\u00b7\u00b7 , n, we have(cid:80)k\nk(cid:88)\n\nLavt-k(Lz(f )) =\n\nmax{0, a} is the hinge function.\nFor completeness, we include a proof of Lemma 1 in supplementary materials. Using Lemma 1, we\ncan reformulate the ATk loss (1) as\n1\nk\n\n(cid:96)[i](f ) \u221d min\n\u03bb\u22650\n\n[(cid:96)i(f ) \u2212 \u03bb]+ +\n\n(cid:40)\n\n1\nn\n\nk\nn\n\n\u03bb\n\n.\n\n(cid:8)k\u03bb +(cid:80)n\nn(cid:88)\n\n(cid:9), where [a]+ =\ni=1 [xi \u2212 \u03bb]+\n(cid:41)\n\ni=1\n\ni=1\n\nIn other words, the ATk loss is equivalent to minimum of the average of individual losses that are\nshifted and truncated by the hinge function controlled by \u03bb. This sheds more lights on the ATk loss,\nwhich is particularly easy to illustrate in the context of binary classi\ufb01cation using the margin losses,\n(cid:96)(f (x), y) = (cid:96)(yf (x)).\nIn binary classi\ufb01cation, the \u201cgold standard\u201d of individual loss is the 0-1 loss Iyf (x)\u22640, which exerts\na constant penalty 1 to examples that are misclassi\ufb01ed by f and no penalty to correctly classi\ufb01ed\nexamples. However, the 0-1 loss is dif\ufb01cult to work as it is neither continuous nor convex.\nIn\npractice, it is usually replaced by a surrogate convex loss. Such convex surrogates afford ef\ufb01cient\nalgorithms, but as continuous and convex upper-bounds of the 0-1 loss, they typically also penalize\ncorrectly classi\ufb01ed examples, i.e., for y and x that satisfy yf (x) > 0, (cid:96)(yf (x)) > 0, whereas\nIyf (x)\u22640 = 0 (Fig.2). This implies that when the average of individual losses across all training\nexamples is minimized, correctly classi\ufb01ed examples by f that are \u201ctoo close\u201d to the classi\ufb01cation\nboundary may be sacri\ufb01ced to accommodate reducing the average loss, as is shown in Fig.1.\nIn contrast, after the individual loss is combined with the hinge function, i.e., [(cid:96)(yf (x)) \u2212 \u03bb]+ with\n\u03bb > 0, it has the effect of \u201cshifting down\u201d the original individual loss function and truncating it at\nzero, see Fig.2. The transformation of the individual loss reduces penalties of all examples, and in\nparticular bene\ufb01ts correctly classi\ufb01ed data. In particular, if such examples are \u201cfar enough\u201d from the\ndecision boundary, like in the 0-1 loss, their penalty becomes zero. This alleviates the likelihood of\nmisclassi\ufb01cation on those rare sub-populations of data that are close to the decision boundary.\n\n3\n\n\fAlgorithm: The reformulation of the ATk loss in Eq.(2) al-\nso facilitates development of optimization algorithms for the\nminimum ATk learning. As practical supervised learning prob-\nlems usually use a parametric form of f, as f (x; w), where w\nis the parameter, the corresponding minimum ATk objective\nbecomes\n\n(cid:41)\n\n[(cid:96)(f (xi; w), yi) \u2212 \u03bb]+ +\n\nk\nn\n\n\u03bb + \u2126(w)\n\n,\n\n(cid:40)\n\nn(cid:88)\n\ni=1\n\n1\nn\n\nmin\nw,\u03bb\u22650\n\n(3)\nIt is not hard to see that if (cid:96)(f (x; w), y) is convex with respect\nto w, the objective function of in Eq.(3) is a convex function\nfor w and \u03bb jointly. This leads to an immediate stochastic (pro-\njected) gradient descent [3, 21] for solving (3). For instance,\n2C(cid:107)w(cid:107)2, where C > 0 is a regularization fac-\nwith \u2126(w) = 1\ntor, at the t-th iteration, the corresponding MATk objective can be minimized by \ufb01rst randomly\nsampling (xit, yit) from the training set and then updating the parameters as\n(cid:17)(cid:105)\n\nFigure 2: The ATk loss interpreted at\nthe individual loss level. Shaded area\ncorresponds to data/target with correct\nclassi\ufb01cation.\n\n(cid:16)\n(cid:16) k\n\u2202w(cid:96)(f (xit; w(t)), yit) \u00b7 I\nn \u2212 I\n\n[(cid:96)(f (xit ;w(t)),yit )>\u03bb(t)] + w(t)\n\n\u03bb(t+1) \u2190(cid:104)\n\nw(t+1) \u2190 w(t) \u2212 \u03b7t\n\u03bb(t) \u2212 \u03b7t\n\n[(cid:96)(f (xit ;w(t),yit )>\u03bb(t)]\n\n(cid:17)\n\n(4)\n\nC\n\nwhere \u2202w(cid:96)(f (x; w), y) denotes the sub-gradient with respect to w, and \u03b7t \u223c 1\u221a\nATk-SVM: As a general aggregate loss, the ATk loss can be combined with any functional form\nfor individual losses. In the case of binary classi\ufb01cation, the ATk loss combined with the individual\nhinge loss for a prediction function f from a reproducing kernel Hilbert space (RKHS) [18] leads to\nthe ATk-SVM model. Speci\ufb01cally, we consider function f as a member of RKHS HK with norm\n(cid:107) \u00b7 (cid:107)K, which is induced from a reproducing kernel K : X \u00d7 X \u2192 R. Using the individual hinge\nloss, [1 \u2212 yif (xi)]+, the corresponding MATk learning objective in RKHS becomes\n\nis the step size.\n\nt\n\n+\n\n+\n\nk\nn\n\n\u03bb +\n\n(cid:107)f(cid:107)2\nK,\n\n1\n2C\n\n(5)\n\n+\n\nwhere C > 0 is the regularization factor. Furthermore, the outer hinge function in (5) can be\nremoved due to the following result.\n\n= [a \u2212 b \u2212 (cid:96)]+.\n\n1\nn\n\nmin\n\nn(cid:88)\n\nf\u2208HK ,\u03bb\u22650\n\n(cid:2)[1 \u2212 yif (xi)]+ \u2212 \u03bb(cid:3)\nLemma 2. For a \u2265 0, b \u2265 0, there holds(cid:2)[a \u2212 (cid:96)]+ \u2212 b(cid:3)\n(cid:80)n\n\n(cid:2)[1 \u2212 yifz(xi)]+ \u2212 \u03bbz\n\n(cid:3)\n\ni=1\n\n+\n\nn(cid:88)\n\ni=1\n\nProof of Lemma 2 can be found in the supplementary materials.\nIn addition, note that for any\nn \u03bbz \u2264\nminimizer (fz, \u03bbz) of (5), setting f (x) = 0, \u03bb = 1 in the objective function of (5), we have k\nn, so we have 0 \u2264 \u03bbz \u2264 1 which means\n1\nthat the minimization can be restricted to 0 \u2264 \u03bb \u2264 1. Using these results and introducing \u03c1 = 1\u2212 \u03bb,\nn\nEq.(5) can be rewritten as\n\n2C(cid:107)fz(cid:107)2\n\nK \u2264 k\n\nn \u03bbz + 1\n\n+ k\n\ni=1\n\n+\n\nmin\n\nf\u2208HK ,0\u2264\u03c1\u22641\n\n1\nn\n\n[\u03c1 \u2212 yif (xi)]+ \u2212 k\nn\n\n\u03c1 +\n\n(cid:107)f(cid:107)2\nK.\n\n1\n2C\n\n(6)\n\nThe ATk-SVM objective generalizes many several existing SVM models. For example, when k = n,\nit equals to the standard C-SVM [5]. When C = 1 and with conditions K(xi, xi) \u2264 1 for any i,\nATk-SVM reduces to \u03bd-SVM [19] with \u03bd = k\nn. Furthermore, similar to the conventional SVM\nmodel, writing in the dual form of (6) can lead to a convex quadratic programming problem that can\nbe solved ef\ufb01ciently. See supplementary materials for more detailed explanations.\nChoosing k. The number of top individual losses in the ATk loss is a critical parameter that affects\nthe learning performance.\nIn concept, using ATk loss will not be worse than using average or\nmaximum losses as they correspond to speci\ufb01c choices of k. In practice, k can be chosen during\ntraining from a validation dataset as the experiments in Section 4. As k is an integer, a simple grid\nsearch usually suf\ufb01ces to \ufb01nd a satisfactory value. Besides, Theorem 1 in Section 3 establishes a\ntheoretical lower bound for k to guarantee reliable classi\ufb01cation based on the Bayes error. If we\nhave information about the proportion of outliers, we can also narrow searching space of k based on\nthe fact that ATk loss is the convex upper bound of the top-k loss, which is similar to [20].\n\n4\n\n-1.5-1-0.500.511.500.511.522.53Loss\f3 Statistical Analysis\n\nIn this section, we address the statistical properties of the ATk objective in the context of binary\nclassi\ufb01cation. Speci\ufb01cally, we investigate the property of classi\ufb01cation calibration [1] of the ATk\ngeneral objective, and derive bounds for the misclassi\ufb01cation error of the ATk-SVM model in the\nframework of statistical learning theory (e.g. [1, 7, 23, 26]).\n\n3.1 Classi\ufb01cation Calibration under ATk Loss\nWe assume the training data z = {(xi, yi)}n\ni=1 are i.i.d. samples from an unknown distribution p on\nX \u00d7{\u00b11}. Let pX be the marginal distribution of p on the input space X . Then, the misclassi\ufb01cation\nerror of a classi\ufb01er f : X \u2192 {\u00b11} is denoted by R(f ) = Pr(y (cid:54)= f (x)) = E[Iyf (x)\u22640]. The Bayes\nerror is given by R\u2217 = inf f R(f ), where the in\ufb01mum is over all measurable functions. No function\ncan achieve less risk than the Bayes rule fc(x) = sign(\u03b7(x) \u2212 1\n2 ), where \u03b7(x) = Pr(y = 1|x) [8].\nIn practice, one uses a surrogate loss (cid:96) : R \u2192 [0,\u221e) which is convex and upper-bound the 0-1\nloss. The population (cid:96)-risk (generalization error) is given by E(cid:96)(f ) = E[(cid:96)(yf (x))]. Denote the\noptimal (cid:96)-risk by E\u2217\n(cid:96) = inf f E(cid:96)(f ). A very basic requirement for using such a surrogate loss (cid:96) is\nthe so-called classi\ufb01cation calibration (point-wise form of Fisher consistency) [1, 14]. Speci\ufb01cally,\na loss (cid:96) is classi\ufb01cation calibrated with respect to distribution p if, for any x, the minimizer f\u2217\n(cid:96) =\ninf f E(cid:96)(f ) should have the same sign as the Bayes rule fc(x), i.e., sign(f\u2217\n(cid:96) (x)) = sign(fc(x))\nwhenever fc(x) (cid:54)= 0.\nAn appealing result concerning the classi\ufb01cation calibration of a loss function (cid:96) was obtained in [1],\nwhich states that (cid:96) is classi\ufb01cation calibrated if (cid:96) is convex, differentiable at 0 and (cid:96)(cid:48)(0) < 0. In the\nsame spirit, we investigate the classi\ufb01cation calibration property of the ATk loss. Speci\ufb01cally, we\n\ufb01rst obtain the population form of the ATk objective using the in\ufb01nite limit of (2)\nE [[(cid:96)(yf (x)) \u2212 \u03bb]+] + \u03bd\u03bb.\n\n[(cid:96)(yif (xi)) \u2212 \u03bb]+ +\n\nn(cid:88)\n\n\u03bb\n\nk\nn\n\nk\n\nn\u2192\u03bd\u2212\u2212\u2212\u2212\u2192\nn\u2192\u221e\n\n1\nn\n\ni=1\n\nWe then consider the optimization problem\n(f\u2217, \u03bb\u2217) = arg inf\nf,\u03bb\u22650\n\nE [[(cid:96)(yf (x)) \u2212 \u03bb]+] + \u03bd\u03bb,\n\n(7)\nwhere the in\ufb01mum is taken over all measurable function f : X \u2192 R. We say the ATk (aggregate)\nloss is classi\ufb01cation calibrated with respect to p if f\u2217 has the same sign as the Bayes rule fc. The\nfollowing theorem establishes such conditions.\nTheorem 1. Suppose the individual loss (cid:96) : R \u2192 R+ is convex, differentiable at 0 and (cid:96)(cid:48)(0) < 0.\nWithout loss of generality, assume that (cid:96)(0) = 1. Let (f\u2217, \u03bb\u2217) be de\ufb01ned in (7),\n\n(i) If \u03bd > E\u2217\n(ii) If, moreover, (cid:96) is monotonically decreasing and the ATk aggregate loss is classi\ufb01cation\n\n(cid:96) then the ATk loss is classi\ufb01cation calibrated.\n\ncalibrated then \u03bd \u2265(cid:82)\n\nmin(\u03b7(x), 1 \u2212 \u03b7(x))dpX (x).\n\n\u03b7(x)(cid:54)= 1\n\n2\n\n(cid:96) is larger than the Bayes error R\u2217, i.e., E\u2217\n\nThe proof of Theorem 1 can be found in the supplementary materials. Part (i) and (ii) of the above\ntheorem address respectively the suf\ufb01cient and necessary conditions on \u03bd such that the ATk loss\nbecomes classi\ufb01cation calibrated. Since (cid:96) is an upper bound surrogate of the 0-1 loss, the optimal\n(cid:96)-risk E\u2217\n(cid:96) \u2265 R\u2217. In particular, if the individual loss (cid:96)\nis the hinge loss then E\u2217\n(cid:96) = 2R\u2217. Part (ii) of the above theorem indicates that the ATk aggregate\nloss is classi\ufb01cation calibrated if \u03bd = limn\u2192\u221e k/n is larger than the optimal generalization error E\u2217\nassociated with the individual loss. The choice of k > nE\u2217\n(cid:96) thus guarantees classi\ufb01cation calibration,\nwhich gives a lower bound of k. This result also provides a theoretical underpinning of the sensitivity\nto outliers of the maximum loss (ATk loss with k = 1). If the probability of the set {x : \u03b7(x) = 1/2}\n\u03b7(x)(cid:54)=1/2 min(\u03b7(x), 1 \u2212 \u03b7(x))dpX (x). Theorem\nn \u2248 \u03bd \u2265 R\u2217. In\n1 indicates that in this case, if the maximum loss is calibrated, one must have 1\nother words, as the number of training data increases, the Bayes error has to be arbitrarily small,\nwhich is consistent with the empirical observation that the maximum loss works well under the\nwell-separable data setting but are sensitive to outliers and non-separable data.\n\nX min(\u03b7(x), 1 \u2212 \u03b7(x))dpX (x) =(cid:82)\n\nis zero, R\u2217 =(cid:82)\n\n(cid:96)\n\n5\n\n\f3.2 Error bounds of ATk-SVM\nWe next study the excess misclassi\ufb01cation error of the ATk-SVM model i.e., R(sign(fz)) \u2212 R\u2217.\nLet (fz, \u03c1z) be the minimizer of the ATk-SVM objective (6) in the RKHS setting. Let fH be the\nminimizer of the generalization error over the RKHS space HK, i.e., fH = argminf\u2208HK Eh(f ),\nwhere we use the notation Eh(f ) = E [[1 \u2212 yf (x)]+] to denote the (cid:96)-risk of the hinge loss. In the\n\ufb01nite-dimension case, the existence of fH follows from the direct method in the variational calculus,\nas Eh(\u00b7) is lower bounded by zero, coercive, and weakly sequentially lower semi-continuous by its\nconvexity. For an in\ufb01nite dimensional HK, we assume the existence of fH. We also assume that\nEh(fH) < 1 since even a na\u00a8\u0131ve zero classi\ufb01er can achieve Eh(0) = 1. Denote the approximation\nerror by A(HK) = inf f\u2208HK Eh(f ) \u2212 Eh(fc) = Eh(fH) \u2212 Eh(fc), and let \u03ba = supx\u2208X\nThe main theorem can be stated as follows.\nTheorem 2. Consider the ATk-SVM in RKHS (6). For any \u03b5 \u2208 (0, 1] and \u00b5 \u2208 (0, 1 \u2212 Eh(fH)),\nchoosing k = (cid:100)n(Eh(fH) + \u00b5)(cid:101). Then, it holds\n\n(cid:112)K(x, x).\n\nPr(cid:8)R(sign(fz)) \u2212 R\u2217 \u2265 \u00b5 + A(H) + \u03b5 +\n\n(cid:9) \u2264 2 exp(cid:0)\u2212 n\u00b52\u03b52\n\n(cid:1),\n\n(1 + C\u03ba,H)2\n\n1 + C\u03ba,H\u221a\nn\u00b5\n\nwhere C\u03ba,H = \u03ba(2\n\n\u221a\n\n2C + 4(cid:107)fH(cid:107)K).\n\nThe complete proof of Theorem 2 is given in the supplementary materials. The main idea is to\nshow that \u03c1z is bounded from below by a positive constant with high probability, and then bound the\nz )) \u2212 R\u2217 by Eh(fz/\u03c1z) \u2212 Eh(fc). If K is a universal kernel\nexcess misclassi\ufb01cation error R(sign(f\u2217\nthen A(HK) = 0 [23]. In this case, let \u00b5 = \u03b5 \u2208 (0, 1 \u2212 Eh(fH)), then from Theorem 2 we have\n\nPr(cid:8)R(sign(fz)) \u2212 R\u2217 \u2265 2\u03b5 +\n\n(cid:9) \u2264 2 exp(cid:0)\u2212\n\n1 + C\u03ba,H\u221a\n\n(cid:1),\n\nn\u03b54\n\nConsequently,\nto\nlimn\u2192\u221e (1 + C\u03ba,H)2/n = 0, then R(sign(fz)) can be arbitrarily close to the Bayes error R\u2217,\nwith high probability, as long as n is suf\ufb01ciently large.\n\nchoosing C such that\n\nequivalent\n\nn\u03b5\n\n(1 + C\u03ba,H)2\nlimn\u2192\u221e C/n = 0, which is\n\n4 Experiments\n\nWe have demonstrated that ATk loss provides a continuum between the average loss and the max-\nimum loss, which can potentially alleviates their drawbacks. A natural question is whether such\nan advantage actually bene\ufb01ts practical learning problems.\nIn this section, we demonstrate the\nbehaviors of MATk learning coupled with different individual losses for binary classi\ufb01cation and\nregression on synthetic and real datasets, with minimizing the average loss and the maximum loss\ntreated as special cases for k = n and k = 1, respectively. For simplicity, in all experiments, we use\nhomogenized linear prediction functions f (x) = wT x with parameters w and the Tikhonov regu-\n2C||w||2 , and optimize the MATk learning objective with the stochastic gradient\nlarizer \u2126(w) = 1\ndescent method given in (4).\nBinary Classi\ufb01cation: We conduct experiments on binary classi\ufb01cation using eight benchmark\ndatasets from the UCI3 and KEEL4 data repositories to illustrate the potential effects of using ATk\nloss in practical learning to adapt to different underlying data distributions. A detailed description\nof the datasets is given in supplementary materials. The standard individual logistic loss and hinge\nloss are combined with different aggregate losses. Note that average loss combined with individual\nlogistic loss corresponds to the logistic regression model and average loss combined with individual\nhinge loss leads to the C-SVM algorithm [5].\nFor each dataset, we randomly sample 50%, 25%, 25% examples as training, validation and testing\nsets, respectively. During training, we select parameters C (regularization factor) and k (number of\ntop losses) on the validation set. Parameter C is searched on grids of log10 scale in the range of\n[10\u22125, 105] (extended when optimal value is on the boundary), and k is searched on grids of log10\nscale in the range of [1, n]. We use k\u2217 to denote the optimal k selected from the validation set.\n\n3https://archive.ics.uci.edu/ml/datasets.html\n4http://sci2s.ugr.es/keel/datasets.php\n\n6\n\n\fMaximum\n22.41(2.95)\n19.88(6.64)\n47.85(2.51)\n23.57(1.93)\n21.30(3.05)\n28.24(1.69)\n26.50(3.35)\n28.67(0.58)\n\nLogistic Loss\n\nAverage\n\n20.46(2.02)\n14.27(3.22)\n40.68(1.43)\n17.25(0.93)\n8.36(0.97)\n25.36(1.27)\n22.77(0.82)\n25.50(0.88)\n\nATk\u2217\n\n16.76(2.29)\n11.70(2.82)\n39.65(1.72)\n16.12(0.97)\n8.36(0.97)\n23.28(1.16)\n22.44(0.84)\n24.17(0.89)\n\nMonk\n\nAustralian\n\nMadelon\nSplice\n\nSpambase\n\nGerman\nTitanic\nPhoneme\n\nMaximum\n22.04(3.08)\n19.82(6.56)\n48.55(1.97)\n23.40(2.10)\n21.03(3.26)\n27.88(1.61)\n25.45(2.52)\n28.81(0.62)\n\nHinge Loss\n\nAverage\n\nATk\u2217\n\n18.61(3.16)\n14.74(3.10)\n40.58(1.86)\n16.25(1.12)\n7.40(0.72)\n24.16(0.89)\n22.82(0.74)\n22.88(1.01)\n\n17.04(2.77)\n12.51(4.03)\n40.18(1.64)\n16.23(0.97)\n7.40(0.72)\n23.80(1.05)\n22.02(0.77)\n22.88(1.01)\n\nTable 1: Average misclassi\ufb01cation rate (%) of different learning objectives over 8 datasets. The best results\nare shown in bold with results that are not signi\ufb01cant different to the best results underlined.\n\nFigure 3: Plots of misclassi\ufb01cation rate on testing set vs. k on four datasets.\n\nWe report the average performance over 10 random splitting of training/validation/testing for each\ndataset with MATk learning objectives formed from individual logistic loss and hinge loss. Table 1\ngives their experimental results in terms of misclassi\ufb01cation rate (results in terms of other classi\ufb01ca-\ntion quality metrics are given in supplementary materials). Note that on these datasets, the average\nloss consistently outperforms the maximum loss, but the performance can be further improved with\nthe ATk loss, which is more adaptable to different data distributions. This advantage of the ATk loss\nis particularly conspicuous for datasets Monk and Australian.\nTo further understand the behavior of MATk learning on individual datasets, we show plots of mis-\nclassi\ufb01cation rate on testing set vs. k for four representative datasets in Fig.3 (in which C is \ufb01xed to\n102 and k \u2208 [1, n]). As these plots show, on all four datasets, there is a clear range of k value with\nbetter classi\ufb01cation performance than the two extreme cases k = 1 and k = n, corresponding to the\nmaximum and average loss, respectively. To be more speci\ufb01c, when k = 1, the potential noises and\noutliers will have the highest negative effects on the learned classi\ufb01er and the related classi\ufb01cation\nperformance is very poor. As k increases, the negative effects of noises and outliers will reduce and\nthe classi\ufb01cation performance becomes better, this is more signi\ufb01cant on dataset Monk, Australian\nand Splice. However, if k keeps increase, the classi\ufb01cation performance may decrease (e.g., when\nk = n). This may because that as k increases, more and more well classi\ufb01ed samples will be includ-\ned and the non-zero loss on these samples will have negative effects on the learned classi\ufb01er (see\nour analysis in Section 2), especi\ufb01cally for dataset Monk, Australian and Phoneme.\nRegression. Next, we report experimental results of linear regression on one synthetic dataset\n(Sinc) and three real datasets from [4], with a detailed description of these datasets given in sup-\nplementary materials. The standard square loss and absolute loss are adopted as individual losses.\nNote that average loss coupled with individual square loss is standard ridge regression model and\naverage loss coupled with individual absolute loss reduces to \u03bd-SVR [19]. We normalize the target\noutput to [0, 1] and report their root mean square error (RMSE) in Table 2, with optimal C and k\u2217\nobtained by a grid search as in the case of classi\ufb01cation (performance in terms of mean absolute\nsquare error (MAE) is given in supplementary materials). Similar to the classi\ufb01cation cases, using\nthe ATk loss usually improves performance in comparison to the average loss or maximum loss.\n\n5 Related Works\n\nMost work on learning objectives focus on designing individual losses, and only a few are dedicated\nto new forms of aggregate losses. Recently, aggregate loss considering the order of training data have\nbeen proposed in curriculum learning [2] and self-paced learning [11, 9], which suggest to organize\nthe training process in several passes and samples are included from easy to hard gradually. It is\ninteresting to note that each pass of self-paced learning [11] is equivalent to minimum the average of\n\n7\n\n150100150216k0.160.180.20.220.240.260.28Misclassification RateMonk1100200300346k0.10.150.20.250.30.35Misclassification RateAustralian150010001588k0.160.180.20.220.240.260.28Misclassification RateSplice15001000150020002702k0.220.240.260.280.3Misclassification RatePhoneme\fSquare Loss\n\nAbsolute Loss\n\nMaximum\n\nAverage\n\nATk\u2217\n\nMaximum\n\nAverage\n\nATk\u2217\n\nSinc\n\n(cid:80)n\n\n0.2790(0.0449) 0.1147(0.0060) 0.1139(0.0057) 0.1916(0.0771) 0.1188(0.0067) 0.1161(0.0060)\nHousing 0.1531(0.0226) 0.1065(0.0132) 0.1050(0.0132) 0.1498(0.0125) 0.1097(0.0180) 0.1082(0.0189)\nAbalone 0.1544(0.1012) 0.0800(0.0026) 0.0797(0.0026) 0.1243(0.0283) 0.0814(0.0029) 0.0811(0.0027)\nCpusmall 0.2895(0.0722) 0.1001(0.0035) 0.0998(0.0037) 0.2041(0.0933) 0.1170(0.0061) 0.1164(0.0062)\nTable 2: Average RMSE on four datasets. The best results are shown in bold with results that are not signi\ufb01cant\ndifferent to the best results underlined.\nthe k smallest individual losses, i.e., 1\ni=n\u2212k+1 (cid:96)[i](f ), which we term it as the average bottom-k\nk\nloss in contrast to the average top-k losses in our case. In [20], the pros and cons of the maximum\nloss and the average loss are compared, and the top-k loss, i.e., (cid:96)[k](f ), is advocated as a remedy to\nthe problem of both. However, unlike the ATk loss, in general, neither the average bottom-k loss nor\nthe top-k loss are convex functions with regards to the individual losses.\nMinimizing top-k errors has also been used in individual losses. For ranking problems, the work of\n[17, 24] describes a form of individual loss that gives more weights to the top examples in a ranked\nlist. In multi-class classi\ufb01cation, the top-1 loss is commonly used which causes penalties when the\ntop-1 predicted class is not the same as the target class label [6]. This has been further extended\nin [12, 13] to the top-k multi-class loss, in which for a class label that can take m different values,\nthe classi\ufb01er is only penalized when the correct value does not show up in the top k most con\ufb01dent\npredicted values. As an individual loss, these works are complementary to the ATk loss and they\ncan be combined to improve learning performance.\n\n6 Discussion\n\nIn this work, we introduce the average top-k (ATk) loss as a new aggregate loss for supervised\nlearning, which is the average over the k largest individual losses over a training dataset. We show\nthat the ATk loss is a natural generalization of the two widely used aggregate losses, namely the\naverage loss and the maximum loss, but can combine their advantages and mitigate their drawbacks\nto better adapt to different data distributions. We demonstrate that the ATk loss can better protect\nsmall subsets of hard samples from being swamped by a large number of easy ones, especially for\nimbalanced problems. Furthermore, it remains a convex function over all individual losses, which\ncan lead to convex optimization problems that can be solved effectively with conventional gradient-\nbased methods. We provide an intuitive interpretation of the ATk loss based on its equivalent effect\non the continuous individual loss functions, suggesting that it can reduce the penalty on correctly\nclassi\ufb01ed data. We further study the theoretical aspects of ATk loss on classi\ufb01cation calibration and\nerror bounds of minimum average top-k learning for ATk-SVM. We demonstrate the applicability\nof minimum average top-k learning for binary classi\ufb01cation and regression using synthetic and real\ndatasets.\nThere are many interesting questions left unanswered regarding using the ATk loss as learning ob-\njectives. Currently, we use conventional gradient-based algorithms for its optimization, but we are\ninvestigating special instantiations of MATk learning for which more ef\ufb01cient optimization methods\ncan be developed. Furthermore, the ATk loss can also be used for unsupervised learning problems\n(e.g., clustering), which is a focus of our subsequent study. It is also of practical importance to\ncombine ATk loss with other successful learning paradigms such as deep learning, and to apply it to\nlarge scale real life dataset. Lastly, it would be very interesting to derive error bounds of MATk with\ngeneral individual loss functions.\n\n7 Acknowledgments\n\nWe thank the anonymous reviewers for their constructive comments. This work was completed when\nthe \ufb01rst author was a visiting student at SUNY Albany, supported by a scholarship from University of\nChinese Academy of Sciences (UCAS). Siwei Lyu is supported by the National Science Foundation\n(NSF, Grant IIS-1537257) and Yiming Ying is supported by the Simons Foundation (#422504) and\nthe 2016-2017 Presidential Innovation Fund for Research and Scholarship (PIFRS) program from\nSUNY Albany. This work is also partially supported by the National Science Foundation of China\n(NSFC, Grant 61620106003) for Bao-Gang Hu and Yanbo Fan.\n\n8\n\n\fReferences\n[1] P. L. Bartlett, M. I. Jordan, and J. D. McAuliffe. Convexity, classi\ufb01cation, and risk bounds. Journal of the\n\nAmerican Statistical Association, 101(473):138\u2013156, 2006.\n\n[2] Y. Bengio, J. Louradour, R. Collobert, and J. Weston. Curriculum learning. In ICML, pages 41\u201348, 2009.\n[3] O. Bousquet and L. Bottou. The tradeoffs of large scale learning. In NIPS, pages 161\u2013168, 2008.\n[4] C.-C. Chang and C.-J. Lin. Libsvm: a library for support vector machines. TIST, 2(3):27, 2011.\n[5] C. Cortes and V. Vapnik. Support-vector networks. Machine learning, 20(3):273\u2013297, 1995.\n[6] K. Crammer and Y. Singer. On the algorithmic implementation of multiclass kernel-based vector ma-\n\nchines. Journal of machine learning research, 2(Dec):265\u2013292, 2001.\n\n[7] E. De Vito, A. Caponnetto, and L. Rosasco. Model selection for regularized least-squares algorithm in\n\nlearning theory. Foundations of Computational Mathematics, 5(1):59\u201385, 2005.\n\n[8] L. Devroye, L. Gy\u00a8or\ufb01, and G. Lugosi. A probabilistic theory of pattern recognition, volume 31. Springer\n\nScience & Business Media, 2013.\n\n[9] Y. Fan, R. He, J. Liang, and B.-G. Hu. Self-paced learning: An implicit regularization perspective. In\n\nAAAI, pages 1877\u20131833, 2017.\n\n[10] R. He, W.-S. Zheng, and B.-G. Hu. Maximum correntropy criterion for robust face recognition. IEEE\n\nTransactions on Pattern Analysis and Machine Intelligence, 33(8):1561\u20131576, 2011.\n\n[11] M. P. Kumar, B. Packer, and D. Koller. Self-paced learning for latent variable models. In NIPS, pages\n\n1189\u20131197, 2010.\n\n[12] M. Lapin, M. Hein, and B. Schiele. Top-k multiclass SVM. In NIPS, pages 325\u2013333, 2015.\n[13] M. Lapin, M. Hein, and B. Schiele. Loss functions for top-k error: Analysis and insights. In CVPR, pages\n\n1468\u20131477, 2016.\n\n[14] Y. Lin. A note on margin-based loss functions in classi\ufb01cation. Statistics & probability letters, 68(1):73\u2013\n\n82, 2004.\n\n[15] H. Masnadi-Shirazi and N. Vasconcelos. On the design of loss functions for classi\ufb01cation: theory, robust-\n\nness to outliers, and savageboost. In NIPS, pages 1049\u20131056, 2009.\n\n[16] W. Ogryczak and A. Tamir. Minimizing the sum of the k largest functions in linear time. Information\n\nProcessing Letters, 85(3):117\u2013122, 2003.\n\n[17] C. Rudin. The p-norm push: A simple convex ranking algorithm that concentrates at the top of the list.\n\nJournal of Machine Learning Research, 10(Oct):2233\u20132271, 2009.\n\n[18] B. Sch\u00a8olkopf and A. J. Smola. Learning with kernels: support vector machines, regularization, optimiza-\n\ntion, and beyond. MIT press, 2001.\n\n[19] B. Sch\u00a8olkopf, A. J. Smola, R. C. Williamson, and P. L. Bartlett. New support vector algorithms. Neural\n\ncomputation, 12(5):1207\u20131245, 2000.\n\n[20] S. Shalev-Shwartz and Y. Wexler. Minimizing the maximal loss: How and why. In ICML, 2016.\n[21] N. Srebro and A. Tewari. Stochastic optimization for machine learning. ICML Tutorial, 2010.\n[22] I. Steinwart. On the optimal parameter choice for \u03bd-support vector machines.\n\nIEEE Transactions on\n\nPattern Analysis and Machine Intelligence, 25(10):1274\u20131284, 2003.\n\n[23] I. Steinwart and A. Christmann. Support vector machines. Springer Science & Business Media, 2008.\n[24] N. Usunier, D. Buffoni, and P. Gallinari. Ranking with ordered weighted pairwise classi\ufb01cation. In ICML,\n\npages 1057\u20131064, 2009.\n\n[25] V. Vapnik. Statistical learning theory, volume 1. Wiley New York, 1998.\n[26] Q. Wu, Y. Ying, and D.-X. Zhou. Learning rates of least-square regularized regression. Foundations of\n\nComputational Mathematics, 6(2):171\u2013192, 2006.\n\n[27] Y. Wu and Y. Liu. Robust truncated hinge loss support vector machines. Journal of the American Statis-\n\ntical Association, 102(479):974\u2013983, 2007.\n\n[28] Y. Yu, M. Yang, L. Xu, M. White, and D. Schuurmans. Relaxed clipping: A global training method for\n\nrobust regression and classi\ufb01cation. In NIPS, pages 2532\u20132540, 2010.\n\n9\n\n\f", "award": [], "sourceid": 356, "authors": [{"given_name": "Yanbo", "family_name": "Fan", "institution": "NLPR, CASIA"}, {"given_name": "Siwei", "family_name": "Lyu", "institution": "SUNY at Albany"}, {"given_name": "Yiming", "family_name": "Ying", "institution": "State University of New York at Albany"}, {"given_name": "Baogang", "family_name": "Hu", "institution": "Chinese Academy of Sciences"}]}