{"title": "Online and Stochastic Gradient Methods for Non-decomposable Loss Functions", "book": "Advances in Neural Information Processing Systems", "page_first": 694, "page_last": 702, "abstract": "Modern applications in sensitive domains such as biometrics and medicine frequently require the use of non-decomposable loss functions such as precision@k, F-measure etc. Compared to point loss functions such as hinge-loss, these offer much more fine grained control over prediction, but at the same time present novel challenges in terms of algorithm design and analysis. In this work we initiate a study of online learning techniques for such non-decomposable loss functions with an aim to enable incremental learning as well as design scalable solvers for batch problems. To this end, we propose an online learning framework for such loss functions. Our model enjoys several nice properties, chief amongst them being the existence of efficient online learning algorithms with sublinear regret and online to batch conversion bounds. Our model is a provable extension of existing online learning models for point loss functions. We instantiate two popular losses, Prec @k and pAUC, in our model and prove sublinear regret bounds for both of them. Our proofs require a novel structural lemma over ranked lists which may be of independent interest. We then develop scalable stochastic gradient descent solvers for non-decomposable loss functions. We show that for a large family of loss functions satisfying a certain uniform convergence property (that includes Prec @k, pAUC, and F-measure), our methods provably converge to the empirical risk minimizer. Such uniform convergence results were not known for these losses and we establish these using novel proof techniques. We then use extensive experimentation on real life and benchmark datasets to establish that our method can be orders of magnitude faster than a recently proposed cutting plane method.", "full_text": "Online and Stochastic Gradient Methods for\n\nNon-decomposable Loss Functions\n\nPurushottam Kar\u2217\n\nHarikrishna Narasimhan\u2020\n\n\u2217Microsoft Research, INDIA\n\n{t-purkar,prajain}@microsoft.com, harikrishna@csa.iisc.ernet.in\n\n\u2020Indian Institute of Science, Bangalore, INDIA\n\nPrateek Jain\u2217\n\nAbstract\n\nModern applications in sensitive domains such as biometrics and medicine fre-\nquently require the use of non-decomposable loss functions such as precision@k,\nF-measure etc. Compared to point loss functions such as hinge-loss, these of-\nfer much more \ufb01ne grained control over prediction, but at the same time present\nnovel challenges in terms of algorithm design and analysis. In this work we initiate\na study of online learning techniques for such non-decomposable loss functions\nwith an aim to enable incremental learning as well as design scalable solvers for\nbatch problems. To this end, we propose an online learning framework for such\nloss functions. Our model enjoys several nice properties, chief amongst them be-\ning the existence of ef\ufb01cient online learning algorithms with sublinear regret and\nonline to batch conversion bounds. Our model is a provable extension of existing\nonline learning models for point loss functions. We instantiate two popular losses,\nPrec@k and pAUC, in our model and prove sublinear regret bounds for both of\nthem. Our proofs require a novel structural lemma over ranked lists which may\nbe of independent interest. We then develop scalable stochastic gradient descent\nsolvers for non-decomposable loss functions. We show that for a large family\nof loss functions satisfying a certain uniform convergence property (that includes\nPrec@k, pAUC, and F-measure), our methods provably converge to the empirical\nrisk minimizer. Such uniform convergence results were not known for these losses\nand we establish these using novel proof techniques. We then use extensive exper-\nimentation on real life and benchmark datasets to establish that our method can be\norders of magnitude faster than a recently proposed cutting plane method.\n\n1\n\nIntroduction\n\nModern learning applications frequently require a level of \ufb01ne-grained control over prediction per-\nformance that is not offered by traditional \u201cper-point\u201d performance measures such as hinge loss.\nExamples include datasets with mild to severe label imbalance such as spam classi\ufb01cation wherein\npositive instances (spam emails) constitute a tiny fraction of the available data, and learning tasks\nsuch as those in medical diagnosis which make it imperative for learning algorithms to be sensi-\ntive to class imbalances. Other popular examples include ranking tasks where precision in the top\nranked results is valued more than overall precision/recall characteristics. The performance mea-\nsures of choice in these situations are those that evaluate algorithms over the entire dataset in a\nholistic manner. Consequently, these measures are frequently non-decomposable over data points.\nMore speci\ufb01cally, for these measures, the loss on a set of points cannot be expressed as the sum of\nlosses on individual data points (unlike hinge loss, for example). Popular examples of such measures\ninclude F-measure, Precision@k, (partial) area under the ROC curve etc.\nDespite their success in these domains, non-decomposable loss functions are not nearly as well\nunderstood as their decomposable counterparts. The study of point loss functions has led to a deep\n\n1\n\n\funderstanding about their behavior in batch and online settings and tight characterizations of their\ngeneralization abilities. The same cannot be said for most non-decomposable losses. For instance, in\nthe popular online learning model, it is dif\ufb01cult to even instantiate a non-decomposable loss function\nas de\ufb01ning the per-step penalty itself becomes a challenge.\n\n1.1 Our Contributions\n\n(cid:17)\n\nT\n\nto offer vanishing O(cid:16) 1\u221a\n\nOur \ufb01rst main contribution is a framework for online learning with non-decomposable loss functions.\nThe main hurdle in this task is a proper de\ufb01nition of instantaneous penalties for non-decomposable\nlosses. Instead of resorting to canonical de\ufb01nitions, we set up our framework in a principled way\nthat ful\ufb01lls the objectives of an online model. Our framework has a very desirable characteristic that\nallows it to recover existing online learning models when instantiated with point loss functions. Our\nframework also admits online-to-batch conversion bounds.\nWe then propose an ef\ufb01cient Follow-the-Regularized-Leader [1] algorithm within our framework.\nWe show that for loss functions that satisfy a generic \u201cstability\u201d condition, our algorithm is able\nregret. Next, we instantiate within our framework, convex surrogates\nfor two popular performances measures namely, Precision at k (Prec@k) and partial area under the\nROC curve (pAUC) [2] and show, via a stability analysis, that we do indeed achieve sublinear\nregret bounds for these loss functions. Our stability proofs involve a structural lemma on sorted\nlists of inner products which proves Lipschitz continuity properties for measures on such lists (see\nLemma 2) and might be useful for analyzing non-decomposable loss functions in general.\nA key property of online learning methods is their applicability in designing solvers for of-\n\ufb02ine/batch problems. With this goal in mind, we design a stochastic gradient-based solver for\nnon-decomposable loss functions. Our methods apply to a wide family of loss functions (including\nPrec@k, pAUC and F-measure) that were introduced in [3] and have been widely adopted [4, 5, 6]\nin the literature. We design several variants of our method and show that our methods provably con-\nverge to the empirical risk minimizer of the learning instance at hand. Our proofs involve uniform\nconvergence-style results which were not known for the loss functions we study and require novel\ntechniques, in particular the structural lemma mentioned above.\nFinally, we conduct extensive experiments on real life and benchmark datasets with pAUC and\nPrec@k as performance measures. We compare our methods to state-of-the-art methods that are\nbased on cutting plane techniques [7]. The results establish that our methods can be signi\ufb01cantly\nfaster, all the while offering comparable or higher accuracy values. For example, on a KDD 2008\nchallenge dataset, our method was able to achieve a pAUC value of 64.8% within 30ms whereas it\ntook the cutting plane method more than 1.2 seconds to achieve a comparable performance.\n\n1.2 Related Work\n\nNon-decomposable loss functions such as Prec@k, (partial) AUC, F-measure etc, owing to their\ndemonstrated ability to give better performance in situations with label imbalance etc, have gener-\nated signi\ufb01cant interest within the learning community. From their role in early works as indicators\nof performance on imbalanced datasets [8], their importance has risen to a point where they have\nbecome the learning objectives themselves. Due to their complexity, methods that try to indirectly\noptimize these measures are very common e.g. [9], [10] and [11] who study the F-measure. How-\never, such methods frequently seek to learn a complex probabilistic model, a task arguably harder\nthan the one at hand itself. On the other hand are algorithms that perform optimization directly via\nstructured losses. Starting from the seminal work of [3], this method has received a lot of interest\nfor measures such as the F-measure [3], average precision [4], pAUC [7] and various ranking losses\n[5, 6]. These formulations typically use cutting plane methods to design dual solvers.\nWe note that the learning and game theory communities are also interested in non-additive notions\nof regret and utility. In particular [12] provides a generic framework for online learning with non-\nadditive notions of regret with a focus on showing regret bounds for mixed strategies in a variety of\nproblems. However, even polynomial time implementation of their strategies is dif\ufb01cult in general.\nOur focus, on the other hand, is on developing ef\ufb01cient online algorithms that can be used to solve\nlarge scale batch problems. Moreover, it is not clear how (if at all) can the loss functions considered\nhere (such as Prec@k) be instantiated in their framework.\n\n2\n\n\fRecently, online learning for AUC maximization has received some attention [13, 14]. Although\nAUC is not a point loss function, it still decomposes over pairs of points in a dataset, a fact that [13]\nand [14] crucially use. The loss functions in this paper do not exhibit any such decomposability.\n\n2 Problem Formulation\nLet x1:t := {x1, . . . , xt}, xi \u2208 Rd and y1:t := {y1, . . . , yt}, yi \u2208 {\u22121, 1} be the observed data\n\npoints and true binary labels. We will use(cid:98)y1:t := {(cid:98)y1, . . . ,(cid:98)yt},(cid:98)yi \u2208 R to denote the predictions of a\nlearning algorithm. We shall, for sake of simplicity, restrict ourselves to linear predictors(cid:98)yi = w(cid:62)xi\n\nfor parameter vectors w \u2208 Rd. A performance measure P : {\u22121, 1}t \u00d7 Rt \u2192 R+ shall be used to\nevaluate the the predictions of the learning algorithm against the true labels. Our focus shall be on\nnon-decomposable performance measures such as Prec@k, partial AUC etc.\nSince these measures are typically non-convex, convex surrogate loss functions are used instead (we\nwill use the terms loss function and performance measure interchangeably). A popular technique for\nconstructing such loss functions is the structural SVM formulation [3] given below. For simplicity,\nwe shall drop mention of the training points and use the notation (cid:96)P (w) := (cid:96)P (x1:T , y1:T , w).\n\n(cid:96)P (w) =\n\nmax\n\n\u00afy\u2208{\u22121,+1}T\n\nPrecision@k. The Prec@k measure ranks the data points in order of the predicted scores(cid:98)yi and then\n\nreturns the number of true positives in the top ranked positions. This is valuable in situations where\nthere are very few positives. To formalize this, for any predictor w and set of points x1:t, de\ufb01ne\nS(x, w) := {j : w(cid:62)x > w(cid:62)xj} to be the set of points which w ranks above x. Then de\ufb01ne\n\ni=1\n\n(1)\n\n(\u00afyi \u2212 yi)x(cid:62)\n\ni w \u2212 P(\u00afy, y).\n\nT(cid:88)\n\n(cid:26)1,\n\n0,\n\n(cid:88)\nt(cid:88)\n\n(2)\n\n(3)\n\n(5)\n\nT\u03b2,t(x, w) =\n\nif |S(x, w)| < (cid:100)\u03b2t(cid:101),\notherwise.\n\ni.e. T\u03b2,t(x, w) is non-zero iff x is in the top-\u03b2 fraction of the set. Then we de\ufb01ne1\n\nPrec@k(w) :=\n\nI [yj = 1] .\n\nj:Tk,t(xj ,w)=1\n\nThe structural surrogate for this measure is then calculated as 2\n(\u00afyi \u2212 yi)xT\n\n(cid:96)Prec@k (w) =\n\nmax\n\ni w \u2212 t(cid:88)\n\nyi \u00afyi.\n\ni=1\n\ni=1\n\n(cid:80)\n\n\u00afy\u2208{\u22121,+1}t\ni(\u00afyi+1)=2kt\n\nPartial AUC. This measures the area under the ROC curve with the false positive rate restricted to\nthe range [0, \u03b2]. This is in contrast to AUC that considers the entire range [0, 1] of false positive\nrates. pAUC is useful in medical applications such as cancer detection where a small false positive\nrate is desirable. Let us extend notation to use the indicator T\u2212\n\u03b2,t(x, w) to select the top \u03b2 fraction of\nthe negatively labeled points i.e. T\u2212\nt\u2212 is the number of negatives. Then we de\ufb01ne\n\n(cid:9)(cid:12)(cid:12) \u2264 (cid:100)\u03b2t\u2212(cid:101) where\n\npAUC(w) =\n\n(4)\nLet \u03c6 : R \u2192 R+ be any convex, monotone, Lipschitz, classi\ufb01cation surrogate. Then we can obtain\nconvex surrogates for pAUC(w) by replacing the indicator functions above with \u03c6(\u00b7).\n\nj:yj <0\n\n\u03b2,t(x, w) = 1 iff(cid:12)(cid:12)(cid:8)j : yj < 0, w(cid:62)x > w(cid:62)xj\n(cid:88)\n(cid:88)\n\n(cid:88)\n(cid:88)\n\n\u03b2,t(xj, w) \u00b7 I[x(cid:62)\nT\u2212\n\ni w \u2265 x(cid:62)\n\nj w].\n\ni:yi>0\n\n\u03b2,t(xj, w) \u00b7 \u03c6(x(cid:62)\nT\u2212\n\ni w \u2212 x(cid:62)\n\nj w),\n\n(cid:96)pAUC(w) =\n\ni:yi>0\n\nj:yj <0\n\nIt can be shown [7, Theorem 4] that the structural surrogate for pAUC is equivalent to (5) with\n\u03c6(c) = max(0, 1 \u2212 c), the hinge loss function.\nIn the next section we will develop an online\nlearning framework for non-decomposable performance measures and instantiate loss functions such\nas (cid:96)Prec@k and (cid:96)pAUC in our framework. Then in Section 4, we will develop stochastic gradient\nmethods for non-decomposable loss functions and prove error bounds for the same. There we will\nfocus on a much larger family of loss functions including Prec@k, pAUC and F-measure.\n\n1An equivalent de\ufb01nition considers k to be the number of top ranked points instead.\n2[3] uses a slightly modi\ufb01ed, but equivalent, de\ufb01nition that considers labels to be Boolean.\n\n3\n\n\f3 Online Learning with Non-decomposable Loss Functions\n\nt=1 Lt(wt) \u2212 arg minw\u2208W(cid:80)T\n\nThe goal is to minimize the regret i.e. (cid:80)T\n\nWe now present our online learning framework for non-decomposable loss functions. Traditional\nonline learning takes place in several rounds, in each of which the player proposes some wt \u2208 W\nwhile the adversary responds with a penalty function Lt : W \u2192 R and a loss Lt(wt) is incurred.\nt=1 Lt(w). For point loss\nfunctions, the instantaneous penalty Lt(\u00b7) is encoded using a data point (xt, yt) \u2208 Rd \u00d7 {\u22121, 1}\nas Lt(w) = (cid:96)P (xt, yt, w). However, for (surrogates of) non-decomposable loss functions such as\n(cid:96)pAUC and (cid:96)Prec@k the de\ufb01nition of instantaneous penalty itself is not clear and remains a challenge.\nTo guide us in this process we turn to some properties of standard online learning frameworks. For\npoint losses, we note that the best solution in hindsight is also the batch optimal solution. This is\nt=1 Lt(w) = arg minw\u2208W (cid:96)P (x1:T , y1:T , w) for non-\ndecomposable losses. Also, since the batch optimal solution is agnostic to the ordering of points,\nt=1 Lt(w) to be invariant to permutations within the stream. By pruning away\nseveral naive de\ufb01nitions of Lt using these requirements, we arrive at the following de\ufb01nition:\nLt(w) = (cid:96)P (x1:t, y1:t, w) \u2212 (cid:96)P (x1:(t\u22121), y1:(t\u22121), w).\n\nequivalent to the condition arg minw\u2208W(cid:80)T\nwe should expect(cid:80)T\n\n(6)\n\n(cid:80)T\nIt turns out that the above is a very natural penalty function as it measures the amount of \u201cextra\u201d\nonline learning frameworks since for point losses, we have (cid:96)P (x1:t, y1:t, w) =(cid:80)t\npenalty incurred due to the inclusion of xt into the set of points. It can be readily veri\ufb01ed that\nt=1 Lt(w) = (cid:96)P (x1:T , y1:T , w) as required. Also, this penalty function seamlessly generalizes\ni=1 (cid:96)P (xi, yi, w)\nand thus Lt(w) = (cid:96)P (xt, yt, w). We note that our framework also recovers the model for online\nAUC maximization used in [13] and [14]. The notion of regret corresponding to this penalty is\n\nR(T ) =\n\n1\nT\n\nLt(wt) \u2212 arg min\nw\u2208W\n\n1\nT\n\n(cid:96)P (x1:T , y1:T , w).\n\nWe note that Lt, being the difference of two loss functions, is non-convex in general and thus, stan-\ndard online convex programming regret bounds cannot be applied in our framework. Interestingly, as\nwe show below, by exploiting structural properties of our penalty function, we can still get ef\ufb01cient\nlow-regret learning algorithms, as well as online-to-batch conversion bounds in our framework.\n\nT(cid:88)\n\nt=1\n\n3.1 Low Regret Online Learning\n\nWe propose an ef\ufb01cient Follow-the-Regularized-Leader (FTRL) style algorithm in our framework.\nLet w1 = arg minw\u2208W (cid:107)w(cid:107)2\n\n2 and consider the following update:\n\nwt+1 = arg min\nw\u2208W\n\nLt(w) +\n\n\u03b7\n2\n\n(cid:107)w(cid:107)2\n\n2 = arg min\n\nw\u2208W (cid:96)P (x1:t, y1:t, w) +\n\n(cid:107)w(cid:107)2\n\n2\n\n\u03b7\n2\n\n(FTRL)\n\nt(cid:88)\n\nt=1\n\nWe would like to stress that despite the non-convexity of Lt, the FTRL objective is strongly convex\nif (cid:96)P is convex and thus the update can be implemented ef\ufb01ciently by solving a regularized batch\nproblem on x1:t. We now present our regret bound analysis for the FTRL update given above.\nTheorem 1. Let (cid:96)P (\u00b7, w) be a convex loss function and W \u2286 Rd be a convex set. Assume w.l.o.g.\n(cid:107)xt(cid:107)2 \u2264 1,\u2200t. Also, for the penalty function Lt in (6), let |Lt(w) \u2212 Lt(w(cid:48))| \u2264 Gt \u00b7 (cid:107)w \u2212 w(cid:48)(cid:107)2,\nfor all t and all w, w(cid:48) \u2208 W, for some Gt > 0. Suppose we use the update step given in ((FTRL)) to\nobtain wt+1, 0 \u2264 t \u2264 T \u2212 1. Then for all w\u2217, we have\n\nT(cid:88)\n\nt=1\n\n1\nT\n\nLt(wt) \u2264 1\nT\n\n(cid:96)P (x1:T , y1:T , w\u2217) + (cid:107)w\u2217(cid:107)2\n\nt=1 G2\nt\nT\n\n.\n\n(cid:113)\n\n2(cid:80)T\n\nSee Appendix A for a proof. The above result requires the penalty function Lt to be Lipschitz\ncontinuous i.e. be \u201cstable\u201d w.r.t. w. Establishing this for point losses such as hinge loss is relatively\nstraightforward. However, the same becomes non-trivial for non-decomposable loss functions as\n\n4\n\n\fj1\n\n\u2265 z(cid:48)\n\n\u2265 \u00b7\u00b7\u00b7 \u2265 z(cid:48)\n\nLt is now the difference of two loss functions, both of which involve \u2126 (t) data points. A naive\nargument would thus, only be able to show Gt \u2264 O(t) which would yield vacuous regret bounds.\nInstead, we now show that for the surrogate loss functions for Prec@k and pAUC, this Lipschitz\ncontinuity property does indeed hold. Our proofs crucially use a structural lemma given below that\nshows that sorted lists of inner products are Lipschitz at each \ufb01xed position.\nLemma 2 (Structural Lemma). Let x1, . . . , xt be t points with (cid:107)xi(cid:107)2 \u2264 1 \u2200t. Let w, w(cid:48) \u2208 W be any\ntwo vectors. Let zi = (cid:104)w, xi(cid:105) \u2212 ci and z(cid:48)\ni = (cid:104)w(cid:48), xi(cid:105) \u2212 ci, where ci \u2208 R are constants independent\nof w, w(cid:48). Also, let {i1, . . . , it} and {j1, . . . , jt} be ordering of indices such that zi1 \u2265 zi2 \u2265 \u00b7\u00b7\u00b7 \u2265\n. Then for any 1-Lipschitz increasing function g : R \u2192 R (i.e.\nzit and z(cid:48)\n|g(u) \u2212 g(v)| \u2264 |u \u2212 v| and u \u2264 v \u21d4 g(u) \u2264 g(v)), we have, \u2200k |g(zik )\u2212 g(z(cid:48)\n)| \u2264 3(cid:107)w\u2212 w(cid:48)(cid:107)2.\nSee Appendix B for a proof. Using this lemma we can show that the Lipschitz constant for (cid:96)Prec@k\nregret bound for Prec@k (see Appendix C for\nthe proof). In Appendix D, we show that the same technique can be used to prove a stability result\nfor the structural SVM surrogate of the Precision-Recall Break Even Point (PRBEP) performance\nmeasure [3] as well. The case of pAUC is handled similarly. However, since pAUC discriminates\nbetween positives and negatives, our previous analysis cannot be applied directly. Nevertheless, we\ncan obtain the following regret bound for pAUC (a proof will appear in the full version of the paper).\nTheorem 3. Let T+ and T\u2212 resp. be the number of positive and negative points in the stream and\nlet wt+1, 0 \u2264 t \u2264 T \u2212 1 be obtained using the FTRL algorithm ((FTRL)). Then we have\n\nis bounded by Gt \u2264 8 which gives us a O(cid:16)(cid:113) 1\n\n(cid:17)\n\njk\n\nj2\n\njt\n\nT\n\nLt(wt) \u2264 min\nw\u2208W\n\n1\n\n\u03b2T+T\u2212\n\n(cid:96)pAUC(x1:T , y1:T , w) + O\n\n1\nT+\n\n+\n\n1\nT\u2212\n\n(cid:32)(cid:115)\n\n(cid:33)\n\n.\n\nT(cid:88)\n\n1\n\n\u03b2T+T\u2212\n\nt=1\n\nNotice that the above regret bound depends on both T+ and T\u2212 and the regret becomes large even\nif one of them is small. This is actually quite intuitive because if, say T+ = 1 and T\u2212 = T \u2212 1,\nan adversary may wish to provide the lone positive point in the last round. Naturally the algorithm,\nhaving only seen negatives till now, would not be able to perform well and would incur a large error.\n\n3.2 Online-to-batch Conversion\n\n(cid:80)T /s\nt=1 Lt(wt) \u2212 arg minw\u2208W 1\n\nTo present our bounds we generalize our framework slightly: we now consider the stream of T\npoints to be composed of T /s batches Z1, . . . , ZT /s of size s each. Thus, the instantaneous penalty\nis now de\ufb01ned as Lt(w) = (cid:96)P (Z1, . . . , Zt, w) \u2212 (cid:96)P (Z1, . . . , Zt\u22121, w) for t = 1 . . . T /s and the\nT (cid:96)P (x1:T , y1:T , w). Let RP denote\nregret becomes R(T, s) = 1\nthe population risk for the (normalized) performance measure P. Then we have:\nT\nTheorem 4. Suppose the sequence of points (xt, yt) is generated i.i.d. and let w1, w2, . . . , wT /s\nbe an ensemble of models generated by an online learning algorithm upon receiving these T /s\nbatches. Suppose the online learning algorithm has a guaranteed regret bound R(T, s). Then for\nw = 1\nT /s\n\n(cid:80)T /s\n(cid:33)\nt=1 wt, any w\u2217 \u2208 W, \u0001 \u2208 (0, 0.5] and \u03b4 > 0, with probability at least 1 \u2212 \u03b4,\nRP (w) \u2264 (1 + \u0001)RP (w\u2217) + R(T, s) + e\u2212\u2126(s\u00012) + \u02dcO\nT ) and \u0001 = 4(cid:112)1/T gives us, with probability at least 1 \u2212 \u03b4,\n\n\u221a\nIn particular, setting s = \u02dcO(\n\n(cid:32)(cid:114)\n\ns ln(1/\u03b4)\n\nT\n\n.\n\nRP (w) \u2264 RP (w\u2217) + R(T,\n\nT ) + \u02dcO\n\n\u221a\n\nWe conclude by noting that for Prec@k and pAUC, R(T,\n\n\u221a\n\n(see Appendix E).\n\n(cid:32)\n(cid:114)\nT ) \u2264 O(cid:16) 4(cid:112)1/T\n\nln(1/\u03b4)\n\nT\n\n4\n\n(cid:33)\n(cid:17)\n\n.\n\n4 Stochastic Gradient Methods for Non-decomposable Losses\n\nThe online learning algorithms discussed in the previous section present attractive guarantees in the\nsequential prediction model but are required to solve batch problems at each stage. This rapidly\n\n5\n\n\fAlgorithm 1 1PMB: Single-Pass with Mini-batches\nInput: Step length scale \u03b7, Buffer B of size s\nOutput: A good predictor w \u2208 W\n1: w0 \u2190 0, B \u2190 \u03c6, e \u2190 0\n2: while stream not exhausted do\n3:\n\n1), . . . , (xe\n\n1, ye\n\ns, ye\n\ns) in\n\nCollect s data points (xe\nbuffer B\nSet step length \u03b7e \u2190 \u03b7\u221a\n\n4:\n5: we+1 \u2190 \u03a0W [we + \u03b7e\u2207w(cid:96)P (xe\n\n1:s, we)]\n//\u03a0W projects onto the set W\n\n1:s, ye\n\ne\n\nAlgorithm 2 2PMB: Two-Passes with Mini-batches\nInput: Step length scale \u03b7, Buffers B+, B\u2212 of size s\nOutput: A good predictor w \u2208 W\n\nPass 1: B+ \u2190 \u03c6\n1: Collect random sample of pos. x+\nPass 2: w0 \u2190 0, B\u2212 \u2190 \u03c6, e \u2190 0\n\n1 , . . . , x+\n\ns in B+\n\n2: while stream of negative points not exhausted do\n3:\nin B\u2212\n4:\n\n5: we+1 \u2190 \u03a0W(cid:2)we + \u03b7e\u2207w(cid:96)P (xe\u2212\n\nCollect s negative points xe\u2212\nSet step length \u03b7e \u2190 \u03b7\u221a\n\n1 , . . . , xe\u2212\n\n1:s, we)(cid:3)\n\n1:s, x+\n\ne\n\ns\n\n//start a new epoch\n\n//start a new epoch\n\nFlush buffer B\ne \u2190 e + 1\n\n6:\n7:\n8: end while\n9: return w = 1\ne\n\n(cid:80)e\n\ni=1 wi\n\nFlush buffer B\u2212\ne \u2190 e + 1\n\n6:\n7:\n8: end while\n9: return w = 1\ne\n\n(cid:80)e\n\ni=1 wi\n\nbecomes infeasible for large scale data. To remedy this, we now present memory ef\ufb01cient stochastic\ngradient descent methods for batch learning with non-decomposable loss functions. The motivation\nfor our approach comes from mini-batch methods used to make learning methods for point loss\nfunctions amenable to distributed computing environments [15, 16], we exploit these techniques to\noffer scalable algorithms for non-decomposable loss functions.\nSingle-pass Method with Mini-batches. The method assumes access to a limited memory buffer\nand takes a pass over the data stream. The stream is partitioned into epochs. In each epoch, the\nmethod accumulates points in the stream, uses them to form gradient estimates and takes descent\nsteps. The buffer is \ufb02ushed after each epoch. Algorithm 1 describes the 1PMB method. Gradient\ncomputations can be done using Danskin\u2019s theorem (see Appendix H).\nTwo-pass Method with Mini-batches. The previous algorithm is unable to exploit relationships\nbetween data points across epochs which may help improve performance for loss functions such as\npAUC. To remedy this, we observe that several real life learning scenarios exhibit mild to severe\nlabel imbalance (see Table 2 in Appendix H) which makes it possible to store all or a large fraction\nof points of the rare label. Our two pass method exploits this by utilizing two passes over the data:\nthe \ufb01rst pass collects all (or a random subset of) points of the rare label using some stream sampling\ntechnique [13]. The second pass then goes over the stream, restricted to the non-rare label points,\nand performs gradient updates. See Algorithm 2 for details of the 2PMB method.\n\n4.1 Error Bounds\nGiven a set of n labeled data points (xi, yi), i = 1 . . . n and a performance measure P, our goal is to\napproximate the empirical risk minimizer w\u2217 = arg min\n(cid:96)P (x1:n, y1:n, w) as closely as possible.\nw\u2208W\nIn this section we shall show that our methods 1PMB and 2PMB provably converge to the empirical\nrisk minimizer. We \ufb01rst introduce the notion of uniform convergence for a performance measure.\nDe\ufb01nition 5. We say that a loss function (cid:96) demonstrates uniform convergence with respect to a set of\nrandomly from an arbitrary set of n points {(x1, y1), . . . , (xn, yn)} then w.p. at least 1\u2212 \u03b4, we have\n\n(cid:1), when given a set of s points \u00afx1, . . . , \u00afxs chosen\n\npredictors W if for some \u03b1(s, \u03b4) = poly(cid:0) 1\n\ns , log 1\n\n\u03b4\n\n|(cid:96)P (x1:n, y1:n, w) \u2212 (cid:96)P (\u00afx1:s, \u00afy1:s, w)| \u2264 \u03b1(s, \u03b4).\n\nsup\nw\u2208W\n\nSuch uniform convergence results are fairly common for decomposable loss functions such as the\nsquared loss, logistic loss etc. However, the same is not true for non-decomposable loss functions\nbarring a few exceptions [17, 10]. To bridge this gap, below we show that a large family of surrogate\nloss functions for popular non decomposable performance measures does indeed exhibit uniform\nconvergence. Our proofs require novel techniques and do not follow from traditional proof progres-\nsions. However, we \ufb01rst show how we can use these results to arrive at an error bound.\nTheorem 6. Suppose the loss function (cid:96) is convex and demonstrates \u03b1(s, \u03b4)-uniform convergence.\nAlso suppose we have an arbitrary set of n points which are randomly ordered, then the predictor\n\n6\n\n\f(a) PPI\n\n(b) KDDCup08\n\n(c) IJCNN\n\n(d) Letter\n\nFigure 1: Comparison of stochastic gradient methods with the cutting plane (CP) and projected\nsubgradient (PSG) methods on partial AUC maximization tasks. The epoch lengths/buffer sizes for\n1PMB and 2PMB were set to 500.\n\n(a) PPI\n\n(b) KDDCup08\n\n(c) IJCNN\n\n(d) Letter\n\nFigure 2: Comparison of stochastic gradient methods with the cutting plane (CP) method on Prec@k\nmaximization tasks. The epoch lengths/buffer sizes for 1PMB and 2PMB were set to 500.\nw returned by 1PMB with buffer size s satis\ufb01es w.p. 1 \u2212 \u03b4,\n(cid:96)P (x1:n, y1:n, w) \u2264 (cid:96)P (x1:n, y1:n, w\u2217) + 2\u03b1\n\n(cid:18)(cid:114) s\n\n+ O\n\n(cid:19)\n\n(cid:19)\n\n(cid:18)\n\ns,\n\ns\u03b4\nn\n\nn\n\nWe would like to stress that the above result does not assume i.i.d. data and works for arbitrary\ndatasets so long as they are randomly ordered. We can show similar guarantees for the two pass\nmethod as well (see Appendix F). Using regularized formulations, we can also exploit logarithmic\nregret guarantees [18], offered by online gradient descent, to improve this result \u2013 however we do not\nexplore those considerations here. Instead, we now look at speci\ufb01c instances of loss functions that\npossess the desired uniform convergence properties. As mentioned before, due to the combinatorial\nnature of these performance measures, our proofs do not follow from traditional methods.\nTheorem 7 (Partial Area under the ROC Curve). For any convex, monotone, Lipschitz, classi\ufb01cation\nsurrogate \u03c6 : R \u2192 R+, the surrogate loss function for the (0, \u03b2)-partial AUC performance measure\n\nde\ufb01ned as follows exhibits uniform convergence at the rate \u03b1(s, \u03b4) = O(cid:16)(cid:112)log(1/\u03b4)/s\n\n(cid:17)\n\n:\n\n(cid:88)\n\n(cid:88)\n\ni:yi>0\n\nj:yj <0\n\n1\n\n(cid:100)\u03b2n\u2212(cid:101)n+\n\n\u03b2,t(xj, w) \u00b7 \u03c6(x(cid:62)\nT\u2212\n\ni w \u2212 x(cid:62)\n\nj w)\n\nSee Appendix G for a proof sketch. This result covers a large family of surrogate loss functions such\nas hinge loss (5), logistic loss etc. Note that the insistence on including only top ranked negative\npoints introduces a high degree of non-decomposability into the loss function. A similar result for\nthe special case \u03b2 = 1 is due to [17]. We extend the same to the more challenging case of \u03b2 < 1.\nTheorem 8 (Structural SVM loss for Prec@k). The structural SVM surrogate for the Prec@k per-\n.\n\nformance measure (see (3)) exhibits uniform convergence at the rate \u03b1(s, \u03b4) = O(cid:16)(cid:112)log(1/\u03b4)/s\n\n(cid:17)\n\nWe defer the proof to the full version of the paper. The above result can be extended to a large family\nof performances measures introduced in [3] that have been widely adopted [10, 19, 8] such as F-\nmeasure, G-mean, and PRBEP. The above indicates that our methods are expected to output models\nthat closely approach the empirical risk minimizer for a wide variety of performance measures. In\nthe next section we verify that this is indeed the case for several real life and benchmark datasets.\n\n5 Experimental Results\n\nWe evaluate the proposed stochastic gradient methods on several real-world and benchmark datasets.\n\n7\n\n0123450.10.20.30.40.50.6Training time (secs)Average pAUC in [0, 0.1] CPPSG1PMB2PMB00.20.40.60.80.20.40.6Training time (secs)Average pAUC in [0, 0.1] CPPSG1PMB2PMB00.511.50.20.40.6Training time (secs)Average pAUC in [0, 0.1] CPPSG1PMB2PMB00.10.20.30.10.20.30.40.50.6Training time (secs)Average pAUC in [0, 0.1] CPPSG1PMB2PMB02468100.10.20.3Training time (secs)Average Prec@k CP1PMB2PMB01020300.10.20.30.4Training time (secs)Average Prec@k CP1PMB2PMB05100.20.40.6Training time (secs)Average Prec@k CP1PMB2PMB00.20.40.60.80.10.20.30.40.5Training time (secs)Average Prec@k CP1PMB2PMB\fMeasure\npAUC\nPrec@k\n\n1PMB\n\n0.10 (68.2)\n0.49 (42.7)\n\n2PMB\n\n0.15 (69.6)\n0.55 (38.7)\n\nCP\n\n0.39 (62.5)\n23.25 (40.8)\n\nTable 1: Comparison of training time (secs) and accu-\nracies (in brackets) of 1PMB, 2PMB and cutting plane\nmethods for pAUC (in [0, 0.1]) and Prec@k maximiza-\ntion tasks on the KDD Cup 2008 dataset.\n\nFigure 3: Performance of 1PMB and\n2PMB on the PPI dataset with varying\nepoch/buffer sizes for pAUC tasks.\n\nPerformance measures: We consider three measures, 1) partial AUC in the false positive range\n[0, 0.1], 2) Prec@k with k set to the proportion of positives (PRBEP), and 3) F-measure.\nAlgorithms: For partial AUC, we compare against the state-of-the-art cutting plane (CP) and pro-\njected subgradient methods (PSG) proposed in [7]; unlike the (online) stochastic methods considered\nin this work, the PSG method is a \u2018batch\u2019 algorithm which, at each iteration, computes a subgradient-\nbased update over the entire training set. For Prec@k and F-measure, we compare our methods\nagainst cutting plane methods from [3]. We used structural SVM surrogates for all the measures.\nDatasets: We used several data sets for our experiments (see Table 2 in Appendix H); of these,\nKDDCup08 is from the KDD Cup 2008 challenge and involves a breast cancer detection task [20],\nPPI contains data for a protein-protein interaction prediction task [21], and the remaining datasets\nare taken from the UCI repository [22].\nParameters: We used 70% of the data set for training and the remaining for testing, with the results\naveraged over 5 random train-test splits. Tunable parameters such as step length scale were chosen\nusing a small validation set. The epoch lengths/buffer sizes were set to 500 in all experiments. Since\na single iteration of the proposed stochastic methods is very fast in practice, we performed multiple\npasses over the training data (see Appendix H for details).\nThe results for pAUC and Prec@k maximization tasks are shown in the Figures 1 and 2. We found\nthe proposed stochastic gradient methods to be several orders of magnitude faster than the baseline\nmethods, all the while achieving comparable or better accuracies. For example, for the pAUC task\non the KDD Cup 2008 dataset, the 1PMB method achieved an accuracy of 64.81% within 0.03\nseconds, while even after 0.39 seconds, the cutting plane method could only achieve an accuracy\nof 62.52% (see Table 1). As expected, the (online) stochastic gradient methods were faster than\nthe \u2018batch\u2019 projected subgradient descent method for pAUC as well. We found similar trends on\nPrec@k (see Figure 2) and F-measure maximization tasks as well. For F-measure tasks, on the KDD\nCup 2008 dataset, for example, the 1PMB method achieved an accuracy of 35.92 within 12 seconds\nwhereas, even after 150 seconds, the cutting plane method could only achieve an accuracy of 35.25.\nThe proposed stochastic methods were also found to be robust to changes in epoch lengths (buffer\nsizes) till such a point where excessively long epochs would cause the number of updates as well as\naccuracy to dip (see Figure 3). The 2PMB method was found to offer higher accuracies for pAUC\nmaximization on several datasets (see Table 1 and Figure 1), as well as be more robust to changes\nin buffer size (Figure 3). We defer results on more datasets and performance measures to the full\nversion of the paper.\nThe cutting plane methods were generally found to exhibit a zig-zag behaviour in performance\nacross iterates. This is because these methods solve the dual optimization problem for a given per-\nformance measure; hence the intermediate models do not necessarily yield good accuracies. On the\nother hand, (stochastic) gradient based methods directly offer progress in terms of the primal opti-\nmization problem, and hence provide good intermediate solutions as well. This can be advantageous\nin scenarios with a time budget in the training phase.\n\nAcknowledgements\n\nThe authors thank Shivani Agarwal and Bharath K Sriperumbudur for their comments and helpful\ndiscussions. They thank Andr\u00b4as Gy\u00a8orgy for pointing out an omission in the proof of Theorem 1.\nThey also thank the anonymous reviewers for their suggestions. HN thanks support from a Google\nIndia PhD Fellowship.\n\n8\n\n1001021040.420.440.460.480.50.520.54Epoch lengthAverage pAUC 1PMB1001021040.450.50.550.6Epoch lengthAverage pAUC 2PMB\fReferences\n[1] Alexander Rakhlin. Lecture Notes on Online Learning. http://www-stat.wharton.upenn.\n\nedu/\u02dcrakhlin/papers/online_learning.pdf, 2009.\n\n[2] Harikrishna Narasimhan and Shivani Agarwal. A Structural SVM Based Approach for Optimizing Partial\n\nAUC. In 30th International Conference on Machine Learning (ICML), 2013.\n\n[3] Thorsten Joachims. A Support Vector Method for Multivariate Performance Measures. In ICML, 2005.\n[4] Yisong Yue, Thomas Finley, Filip Radlinski, and Thorsten Joachims. A Support Vector Method for\n\nOptimizing Average Precision. In SIGIR, 2007.\n\n[5] Soumen Chakrabarti, Rajiv Khanna, Uma Sawant, and Chiru Bhattacharyya. Structured Learning for\n\nNon-Smooth Ranking Losses. In KDD, 2008.\n\n[6] Brian McFee and Gert Lanckriet. Metric Learning to Rank. In ICML, 2010.\n[7] Harikrishna Narasimhan and Shivani Agarwal. SVMtight\n\nPartial AUC Based on a Tight Convex Upper Bound. In KDD, 2013.\n\npAUC: A New Support Vector Method for Optimizing\n\n[8] Miroslav Kubat and Stan Matwin. Addressing the Curse of Imbalanced. Training Sets: One-Sided Selec-\n\ntion. In 24th International Conference on Machine Learning (ICML), 1997.\n\n[9] Krzysztof Dembczy\u00b4nski, Willem Waegeman, Weiwei Cheng, and Eyke H\u00a8ullermeier. An Exact Algorithm\n\nfor F-Measure Maximization. In NIPS, 2011.\n\n[10] Nan Ye, Kian Ming A. Chai, Wee Sun Lee, and Hai Leong Chieu. Optimizing F-Measures: A Tale of\n\nTwo Approaches. In 29th International Conference on Machine Learning (ICML), 2012.\n\n[11] Krzysztof Dembczy\u00b4nski, Arkadiusz Jachnik, Wojciech Kotlowski, Willem Waegeman, and Eyke\nH\u00a8ullermeier. Optimizing the F-Measure in Multi-Label Classi\ufb01cation: Plug-in Rule Approach versus\nStructured Loss Minimization. In 30th International Conference on Machine Learning (ICML), 2013.\n\n[12] Alexander Rakhlin, Karthik Sridharan, and Ambuj Tewari. Online Learning: Beyond Regret. In 24th\n\nAnnual Conference on Learning Theory (COLT), 2011.\n\n[13] Purushottam Kar, Bharath K Sriperumbudur, Prateek Jain, and Harish Karnick. On the Generalization\n\nAbility of Online Learning Algorithms for Pairwise Loss Functions. In ICML, 2013.\n\n[14] Peilin Zhao, Steven C. H. Hoi, Rong Jin, and Tianbao Yang. Online AUC Maximization. In ICML, 2011.\n[15] Ofer Dekel, Ran Gilad-Bachrach, Ohad Shamir, and Lin Xiao. Optimal Distributed Online Prediction\n\nUsing Mini-Batches. Journal of Machine Learning Research, 13:165\u2013202, 2012.\n\n[16] Yuchen Zhang, John C. Duchi, and Martin J. Wainwright. Communication-Ef\ufb01cient Algorithms for Sta-\n\ntistical Optimization. Journal of Machine Learning Research, 14:3321\u20133363, 2013.\n\n[17] St\u00b4ephan Cl\u00b4emenc\u00b8on, G\u00b4abor Lugosi, and Nicolas Vayatis. Ranking and empirical minimization of U-\n\nstatistics. Annals of Statistics, 36:844\u2013874, 2008.\n\n[18] Elad Hazan, Adam Kalai, Satyen Kale, and Amit Agarwal. Logarithmic Regret Algorithms for Online\n\nConvex Optimization. In COLT, pages 499\u2013513, 2006.\n\n[19] Sophia Daskalaki, Ioannis Kopanas, and Nikolaos Avouris. Evaluation of Classi\ufb01ers for an Uneven Class\n\nDistribution Problem. Applied Arti\ufb01cial Intelligence, 20:381\u2013417, 2006.\n\n[20] R. Bharath Rao, Oksana Yakhnenko, and Balaji Krishnapuram. KDD Cup 2008 and the Workshop on\n\nMining Medical Data. SIGKDD Explorations Newsletter, 10(2):34\u201338, 2008.\n\n[21] Yanjun Qi, Ziv Bar-Joseph, and Judith Klein-Seetharaman. Evaluation of Different Biological Data and\nComputational Classi\ufb01cation Methods for Use in Protein Interaction Prediction. Proteins, 63:490\u2013500,\n2006.\n\n[22] A. Frank and Arthur Asuncion. The UCI Machine Learning Repository. http://archive.ics.\n\nuci.edu/ml, 2010. University of California, Irvine, School of Information and Computer Sciences.\n\n[23] Ankan Saha, Prateek Jain, and Ambuj Tewari. The interplay between stability and regret in online learn-\n\ning. CoRR, abs/1211.6158, 2012.\n\n[24] Martin Zinkevich. Online Convex Programming and Generalized In\ufb01nitesimal Gradient Ascent. In ICML,\n\npages 928\u2013936, 2003.\n\n[25] Robert J. Ser\ufb02ing. Probability Inequalities for the Sum in Sampling without Replacement. Annals of\n\nStatistics, 2(1):39\u201348, 1974.\n\n[26] Dimitri P. Bertsekas. Nonlinear Programming: 2nd Edition. Belmont, MA: Athena Scienti\ufb01c, 2004.\n\n9\n\n\f", "award": [], "sourceid": 486, "authors": [{"given_name": "Purushottam", "family_name": "Kar", "institution": "Microsoft Research India"}, {"given_name": "Harikrishna", "family_name": "Narasimhan", "institution": "Indian Institute of Science"}, {"given_name": "Prateek", "family_name": "Jain", "institution": "Microsoft Research"}]}