{"title": "Online Gradient Boosting", "book": "Advances in Neural Information Processing Systems", "page_first": 2458, "page_last": 2466, "abstract": "We extend the theory of boosting for regression problems to the online learning setting. Generalizing from the batch setting for boosting, the notion of a weak learning algorithm is modeled as an online learning algorithm with linear loss functions that competes with a base class of regression functions, while a strong learning algorithm is an online learning algorithm with smooth convex loss functions that competes with a larger class of regression functions. Our main result is an online gradient boosting algorithm which converts a weak online learning algorithm into a strong one where the larger class of functions is the linear span of the base class. We also give a simpler boosting algorithm that converts a weak online learning algorithm into a strong one where the larger class of functions is the convex hull of the base class, and prove its optimality.", "full_text": "Online Gradient Boosting\n\nAlina Beygelzimer\n\nYahoo Labs\n\nNew York, NY 10036\n\nElad Hazan\n\nPrinceton University\nPrinceton, NJ 08540\n\nSatyen Kale\nYahoo Labs\n\nNew York, NY 10036\n\nbeygel@yahoo-inc.com\n\nehazan@cs.princeton.edu\n\nsatyen@yahoo-inc.com\n\nHaipeng Luo\n\nPrinceton University\nPrinceton, NJ 08540\n\nhaipengl@cs.princeton.edu\n\nAbstract\n\nWe extend the theory of boosting for regression problems to the online\nlearning setting. Generalizing from the batch setting for boosting, the no-\ntion of a weak learning algorithm is modeled as an online learning algorithm\nwith linear loss functions that competes with a base class of regression func-\ntions, while a strong learning algorithm is an online learning algorithm with\nsmooth convex loss functions that competes with a larger class of regres-\nsion functions. Our main result is an online gradient boosting algorithm\nthat converts a weak online learning algorithm into a strong one where the\nlarger class of functions is the linear span of the base class. We also give a\nsimpler boosting algorithm that converts a weak online learning algorithm\ninto a strong one where the larger class of functions is the convex hull of\nthe base class, and prove its optimality.\n\n1\n\nIntroduction\n\nBoosting algorithms [21] are ensemble methods that convert a learning algorithm for a base\nclass of models with weak predictive power, such as decision trees, into a learning algorithm\nfor a class of models with stronger predictive power, such as a weighted majority vote over\nbase models in the case of classi\ufb01cation, or a linear combination of base models in the case\nof regression.\n\nBoosting methods such as AdaBoost [9] and Gradient Boosting [10] have found tremendous\npractical application, especially using decision trees as the base class of models. These\nalgorithms were developed in the batch setting, where training is done over a \ufb01xed batch of\nsample data. However, with the recent explosion of huge data sets which do not \ufb01t in main\nmemory, training in the batch setting is infeasible, and online learning techniques which\ntrain a model in one pass over the data have proven extremely useful.\n\nA natural goal therefore is to extend boosting algorithms to the online learning setting.\nIndeed, there has already been some work on online boosting for classi\ufb01cation problems [20,\n11, 17, 12, 4, 5, 2]. Of these, the work by Chen et al. [4] provided the \ufb01rst theoretical study\nof online boosting for classi\ufb01cation, which was later generalized by Beygelzimer et al. [2] to\nobtain optimal and adaptive online boosting algorithms.\n\nHowever, extending boosting algorithms for regression to the online setting has been elusive\nand escaped theoretical guarantees thus far.\nIn this paper, we rigorously formalize the\nsetting of online boosting for regression and then extend the very commonly used gradient\n\n1\n\n\fboosting methods [10, 19] to the online setting, providing theoretical guarantees on their\nperformance.\n\nThe main result of this paper is an online boosting algorithm that competes with any\nlinear combination the base functions, given an online linear learning algorithm over the\nbase class. This algorithm is the online analogue of the batch boosting algorithm of Zhang\nand Yu [24], and in fact our algorithmic technique, when specialized to the batch boosting\nsetting, provides exponentially better convergence guarantees.\n\nWe also give an online boosting algorithm that competes with the best convex combination\nof base functions. This is a simpler algorithm which is analyzed along the lines of the Frank-\nWolfe algorithm [8]. While the algorithm has weaker theoretical guarantees, it can still be\nuseful in practice. We also prove that this algorithm obtains the optimal regret bound (up\nto constant factors) for this setting.\n\nFinally, we conduct some proof-of-concept experiments which show that our online boosting\nalgorithms do obtain performance improvements over di\u21b5erent classes of base learners.\n\n1.1 Related Work\n\nWhile the theory of boosting for classi\ufb01cation in the batch setting is well-developed (see\n[21]), the theory of boosting for regression is comparatively sparse.The foundational theory\nof boosting for regression can be found in the statistics literature [14, 13], where boosting\nis understood as a greedy stagewise algorithm for \ufb01tting of additive models. The goal is to\nachieve the performance of linear combinations of base models, and to prove convergence to\nthe performance of the best such linear combination.\n\nWhile the earliest works on boosting for regression such as [10] do not have such convergence\nproofs, later works such as [19, 6] do have convergence proofs but without a bound on the\nspeed of convergence. Bounds on the speed of convergence have been obtained by Du\u21b5y\nand Helmbold [7] relying on a somewhat strong assumption on the performance of the base\nlearning algorithm. A di\u21b5erent approach to boosting for regression was taken by Freund and\nSchapire [9], who give an algorithm that reduces the regression problem to classi\ufb01cation and\nthen applies AdaBoost; the corresponding proof of convergence relies on an assumption on\nthe induced classi\ufb01cation problem which may be hard to satisfy in practice. The strongest\nresult is that of Zhang and Yu [24], who prove convergence to the performance of the best\nlinear combination of base functions, along with a bound on the rate of convergence, making\nessentially no assumptions on the performance of the base learning algorithm. Telgarsky [22]\nproves similar results for logistic (or similar) loss using a slightly simpler boosting algorithm.\n\nThe results in this paper are a generalization of the results of Zhang and Yu [24] to the online\nsetting. However, we emphasize that this generalization is nontrivial and requires di\u21b5erent\nalgorithmic ideas and proof techniques.\nIndeed, we were not able to directly generalize\nthe analysis in [24] by simply adapting the techniques used in recent online boosting work\n[4, 2], but we made use of the classical Frank-Wolfe algorithm [8]. On the other hand, while\nan important part of the convergence analysis for the batch setting is to show statistical\nconsistency of the algorithms [24, 1, 22], in the online setting we only need to study the\nempirical convergence (that is, the regret), which makes our analysis much more concise.\n\n2 Setup\n\nExamples are chosen from a feature space X , and the prediction space is Rd. Let k\u00b7k denote\nsome norm in Rd. In the setting for online regression, in each round t for t = 1, 2, . . . , T , an\nadversary selects an example xt 2X and a loss function `t : Rd ! R, and presents xt to the\nonline learner. The online learner outputs a prediction yt 2 Rd, obtains the loss function\n`t, and incurs loss `t(yt).\nLet F denote a reference class of regression functions f : X! Rd, and let C denote a class\nof loss functions ` : Rd ! R. Also, let R : N ! R+ be a non-decreasing function. We\nsay that the function class F is online learnable for losses in C with regret R if there is an\nonline learning algorithm A, that for every T 2 N and every sequence (xt,` t) 2X\u21e5C\nfor\n\n2\n\n\ft = 1, 2, . . . , T chosen by the adversary, generates predictions1 A(xt) 2 Rd such that\n\nTXt=1\n\n`t(A(xt)) \uf8ff inf\nf2F\n\nTXt=1\n\n`t(f (xt)) + R(T ).\n\n(1)\n\nIf the online learning algorithm is randomized, we require the above bound to hold with\nhigh probability.\n\nminimizesPT\n\nThe above de\ufb01nition is simply the online generalization of standard empirical risk mini-\nmization (ERM) in the batch setting. A concrete example is 1-dimensional regression, i.e.\nthe prediction space is R. For a labeled data point (x, y?) 2X\u21e5 R, the loss for the pre-\ndiction y 2 R is given by `(y?, y) where `(\u00b7,\u00b7) is a \ufb01xed loss function that is convex in\nthe second argument (such as squared loss, logistic loss, etc). Given a batch of T labeled\ndata points {(xt, y?\nt ) | t = 1, 2, . . . , T} and a base class of regression functions F (say, the\nset of bounded norm linear regressors), an ERM algorithm \ufb01nds the function f 2F that\nt=1 `(y?\nt , f (xt)).\n\nIn the online setting, the adversary reveals the data (xt, y?\nt ) in an online fashion, only\npresenting the true label y?\nt after the online learner A has chosen a prediction yt. Thus,\nsetting `t(yt) = `(y?\nt , yt), we observe that if A satis\ufb01es the regret bound (1), then it makes\npredictions with total loss almost as small as that of the empirical risk minimizer, up to the\nregret term. If F is the set of all bounded-norm linear regressors, for example, the algorithm\nA could be online gradient descent [25] or online Newton Step [16].\nAt a high level, in the batch setting, \u201cboosting\u201d is understood as a procedure that, given a\nbatch of data and access to an ERM algorithm for a function class F (this is called a \u201cweak\u201d\nlearner), obtains an approximate ERM algorithm for a richer function class F0 (this is called\na \u201cstrong\u201d learner). Generally, F0 is the set of \ufb01nite linear combinations of functions in F.\nThe eciency of boosting is measured by how many times, N , the base ERM algorithm\nneeds to be called (i.e., the number of boosting steps) to obtain an ERM algorithm for\nthe richer function within the desired approximation tolerance. Convergence rates [24] give\nbounds on how quickly the approximation error goes to 0 and N ! 1.\nWe now extend this notion of boosting to the online setting in the natural manner. To\ncapture the full generality of the techniques, we also specify a class of loss functions that\nthe online learning algorithm can work with. Informally, an online boosting algorithm is a\nreduction that, given access to an online learning algorithm A for a function class F and\nloss function class C with regret R, and a bound N on the total number of calls made in each\niteration to copies of A, obtains an online learning algorithm A0 for a richer function class\nF0, a richer loss function class C0, and (possibly larger) regret R0. The bound N on the total\nnumber of calls made to all the copies of A corresponds to the number of boosting stages in\nthe batch setting, and in the online setting it may be viewed as a resource constraint on the\nalgorithm. The ecacy of the reduction is measured by R0 which is a function of R, N , and\ncertain parameters of the comparator class F0 and loss function class C0. We desire online\nboosting algorithms such that 1\nT R0(T ) ! 0 quickly as N ! 1 and T ! 1. We make the\nnotions of richness in the above informal description more precise now.\n\nComparator function classes. A given function class F is said to be D-bounded if for\nall x 2X and all f 2F , we have kf (x)k \uf8ff D. Throughout this paper, we assume that F is\nsymmetric:2 i.e. if f 2F , then f 2F , and it contains the constant zero function, which\nwe denote, with some abuse of notation, by 0.\n\n1There is a slight abuse of notation here. A(\u00b7) is not a function but rather the output of the\nonline learning algorithm A computed on the given example using its internal state.\n2This is without loss of generality; as will be seen momentarily, our base assumption only requires\nan online learning algorithm A for F for linear losses `t. By running the Hedge algorithm on two\ncopies of A, one of which receives the actual loss functions `t and the other recieves `t, we get\nan algorithm which competes with negations of functions in F and the constant zero function as\nwell. Furthermore, since the loss functions are convex (indeed, linear) this can be made into a\ndeterministic reduction by choosing the convex combination of the outputs of the two copies of A\nwith mixing weights given by the Hedge algorithm.\n\n3\n\n\fthe set of convex combinations of a \ufb01nite number of\n\nGiven F, we de\ufb01ne two richer function classes F0:\nnoted CH(F),\ntions in F, and the span of F, denoted span(F),\ntions of \ufb01nitely many functions in F.\n\nis\n\ninfnmax{1,Pg2S |wg|} : f =Pg2S wgg, S \u2713F , |S| < 1, wg 2 Ro. Since functions in\n\nspan(F) are not bounded, it is not possible to obtain a uniform regret bound for all functions\nin span(F): rather, the regret of an online learning algorithm A for span(F) is speci\ufb01ed in\nterms of regret bounds for individual comparator functions f 2 span(F ), viz.\n\nFor any f 2 span(F), de\ufb01ne kfk1\n\nthe convex hull of F, de-\nfunc-\nlinear combina-\n:=\n\nis the set of\n\nRf (T ) :=\n\nTXt=1\n\n`t(A(xt)) \n\nTXt=1\n\n`t(f (xt)).\n\nLoss function classes. The base loss function class we consider is L, the set of all linear\nfunctions ` : Rd ! R, with Lipschitz constant bounded by 1. A function class F that is\nonline learnable with the loss function class L is called online linear learnable for short. The\nricher loss function class we consider is denoted by C and is a set of convex loss functions\n` : Rd ! R satisfying some regularity conditions speci\ufb01ed in terms of certain parameters\ndescribed below.\nWe de\ufb01ne a few parameters of the class C. For any b > 0, let Bd(b) = {y 2 Rd : kyk \uf8ff b}\nbe the ball of radius b. The class C is said to have Lipschitz constant Lb on Bd(b) if for all\n` 2C and all y 2 Bd(b) there is an eciently computable subgradient r`(y) with norm at\nmost Lb. Next, C is said to be b-smooth on Bd(b) if for all ` 2C and all y, y0 2 Bd(b) we\nhave\n\n`(y0) \uf8ff `(y) + r`(y) \u00b7 (y0 y) +\n\nb\n2 ky y0k2.\n\nNext, de\ufb01ne the projection operator \u21e7b : Rd ! Bd(b) as \u21e7b(y) := arg miny02Bd(b) ky y0k,\nand de\ufb01ne \u270fb := supy2Rd,` 2C\n\n`(\u21e7b(y))`(y)\nk\u21e7b(y)yk\n\n.\n\n3 Online Boosting Algorithms\n\nThe setup is that we are given a D-bounded reference class of functions F with an online\nlinear learning algorithm A with regret bound R(\u00b7). For normalization, we also assume that\nthe output of A at any time is bounded in norm by D, i.e. kA(xt)k \uf8ff D for all t. We\nfurther assume that for every b > 0, we can compute3 a Lipschitz constant Lb, a smoothness\nparameter b, and the parameter \u270fb for the class C over Bd(b). Furthermore, the online\nboosting algorithm may make up to N calls per iteration to any copies of A it maintains,\nfor a given a budget parameter N .\n\nGiven this setup, our main result is an online boosting algorithm, Algorithm 1, competing\nwith span(F). The algorithm maintains N copies of A, denoted Ai, for i = 1, 2, . . . , N . Each\ncopy corresponds to one stage in boosting. When it receives a new example xt, it passes it\nto each Ai and obtains their predictions Ai(xt), which it then combines into a prediction\nfor yt using a linear combination. At the most basic level, this linear combination is simply\nthe sum of all the predictions scaled by a step size parameter \u2318. Two tweaks are made to\nthis sum in step 8 to facilitate the analysis:\n\nt\n\n1. While constructing the sum, the partial sum yi1\n\nis multiplied by a shrinkage factor\n(1 i\nt\u2318). This shrinkage term is tuned using an online gradient descent algorithm\nin step 14. The goal of the tuning is to induce the partial sums yi1\nto be aligned\nwith a descent direction for the loss functions, as measured by the inner product\nr`t(yi1\nt are made to lie in Bd(B), for some parameter B, by using the\nprojection operator \u21e7B. This is done to ensure that the Lipschitz constant and\nsmoothness of the loss function are suitably bounded.\n\n2. The partial sums yi\n\n) \u00b7 yi1\n\n.\n\nt\n\nt\n\nt\n\n3It suces to compute upper bounds on these parameters.\n\n4\n\n\fN , 1],\n\n1 = 0.\n\nt = 0.\n\nfor i = 1 to N do\n\nDe\ufb01ne yi\n\nReceive example xt.\n\nAlgorithm 1 Online Gradient Boosting for span(F)\nRequire: Number of weak learners N , step size parameter \u2318 2 [ 1\n1: Let B = min{\u2318N D, inf{b D : \u2318bb2 \u270fbD}}.\n2: Maintain N copies of the algorithm A, denoted Ai for i = 1, 2, . . . , N .\n3: For each i, initialize i\n4: for t = 1 to T do\n5:\n6: De\ufb01ne y0\n7:\n8:\n9:\n10:\n11: Obtain loss function `t and su\u21b5er loss `t(yt).\n12:\n13:\n14:\n15:\n16: end for\n\nPass loss function `i\nSet i\nend for\n\nLB r`t(yi1\nt + \u21b5tr`t(yi1\n\nend for\nPredict yt = yN\nt .\n\n) \u00b7 y to Ai.\n) \u00b7 yi1\n\n), 1}, 0}, where \u21b5t =\n\nt\u2318)yi1\n\nt + \u2318Ai(xt)).\n\nt+1 = max{min{i\n\nt =\u21e7 B((1 i\n\nfor i = 1 to N do\n\nt(y) = 1\n\nt\n\nt\n\nt\n\n1\n\nLBBpt\n\n.\n\nOnce the boosting algorithm makes the prediction yt and obtains the loss function `t, each\nAi is updated using a suitably scaled linear approximation to the loss function at the partial\nsum yi1\n)\u00b7y. This forces Ai to produce predictions\nthat are aligned with a descent direction for the loss function.\n\n, i.e. the linear loss function 1\n\nLB r`t(yi1\n\nt\n\nt\n\nFor lack of space, we provide the analysis of the algorithm in Section B in the supplementary\nmaterial. The analysis yields the following regret bound for the algorithm:\nTheorem 1. Let \u2318 2 [ 1\nN , 1] be a given parameter. Let B = min{\u2318N D, inf{b D : \u2318bb2 \n\u270fbD}}. Algorithm 1 is an online learning algorithm for span(F) and losses in C with the\nfollowing regret bound for any f 2 span(F):\n\nR0f (T ) \uf8ff \u27131 \n\n\u2318\n\nkfk1\u25c6N\n\nwhere 0 :=PT\n\nt=1 `t(0) `t(f (xt)).\n\n0 + 3\u2318BB2kfk1T + LBkfk1R(T ) + 2LBBkfk1pT ,\n\nThe regret bound in this theorem depends on several parameters such as B, B and LB.\nIn applications of the algorithm for 1-dimensional regression with commonly used loss func-\ntions, however, these parameters are essentially modest constants; see Section 3.1 for calcu-\nlations of the parameters for various loss functions. Furthermore, if \u2318 is appropriately set\n(e.g. \u2318 = (log N )/N ), then the average regret R0f (T )/T clearly converges to 0 as N ! 1\nand T ! 1. While the requirement that N ! 1 may raise concerns about computational\neciency, this is in fact analogous to the guarantee in the batch setting: the algorithms\nconverge only when the number of boosting stages goes to in\ufb01nity. Moreover, our lower\nbound (Theorem 4) shows that this is indeed necessary.\nWe also present a simpler boosting algorithm, Algorithm 2, that competes with CH(F).\nAlgorithm 2 is similar to Algorithm 1, with some simpli\ufb01cations: the \ufb01nal prediction is\nsimply a convex combination of the predictions of the base learners, with no projections or\nshrinkage necessary. While Algorithm 1 is more general, Algorithm 2 may still be useful in\npractice when a bound on the norm of the comparator function is known in advance, using\nthe observations in Section 4.2. Furthermore, its analysis is cleaner and easier to understand\nfor readers who are familiar with the Frank-Wolfe method, and this serves as a foundation\nfor the analysis of Algorithm 1. This algorithm has an optimal (up to constant factors)\nregret bound as given in the following theorem, proved in Section A in the supplementary\nmaterial. The upper bound in this theorem is proved along the lines of the Frank-Wolfe [8]\nalgorithm, and the lower bound using information-theoretic arguments.\n\n5\n\n\fTheorem 2. Algorithm 2 is an online learning algorithm for CH(F) for losses in C with\nthe regret bound\n\nR0(T ) \uf8ff\n\n8DD2\n\nN\n\nT + LDR(T ).\n\nFurthermore, the dependence of this regret bound on N is optimal up to constant factors.\n\nThe dependence of the regret bound on R(T ) is unimprovable without additional assump-\ntions: otherwise, Algorithm 2 will be an online linear learning algorithm over F with better\nthan R(T ) regret.\n\ni+1 for\n\nt = 0.\n\ni = 1, 2, . . . , N .\n\nReceive example xt.\n\nAlgorithm 2 Online Gradient Boosting for CH(F)\n1: Maintain N copies of the algorithm A, denoted A1,A2, . . . ,AN , and let \u2318i = 2\n2: for t = 1 to T do\n3:\n4: De\ufb01ne y0\n5:\n6:\n7:\n8:\n9: Obtain loss function `t and su\u21b5er loss `t(yt).\n10:\n11:\n12:\n13: end for\n\nend for\nPredict yt = yN\nt .\n\nPass loss function `i\n\nt = (1 \u2318i)yi1\n\nt + \u2318iAi(xt).\n\nLD r`t(yi1\n\nfor i = 1 to N do\n\nt\n\n) \u00b7 y to Ai.\n\nfor i = 1 to N do\n\nDe\ufb01ne yi\n\nt(y) = 1\n\nend for\n\nUsing a deterministic base online linear learning algorithm.\nIf the base online\nlinear learning algorithm A is deterministic, then our results can be improved, because our\nonline boosting algorithms are also deterministic, and using a standard simple reduction,\nwe can now allow C to be any set of convex functions (smooth or not) with a computable\nLipschitz constant Lb over the domain Bd(b) for any b > 0.\nThis reduction converts arbitrary convex loss functions into linear functions: viz.\nif yt is\nthe output of the online boosting algorithm, then the loss function provided to the boosting\nalgorithm as feedback is the linear function `0t(y) = r`t(yt)\u00b7 y. This reduction immediately\n`0t, is\nimplies that the base online linear learning algorithm A, when fed loss functions\nalready an online learning algorithm for CH(F) with losses in C with the regret bound\nR0(T ) \uf8ff LDR(T ).\nAs for competing with span(F), since linear loss functions are 0-smooth, we obtain the\nfollowing easy corollary of Theorem 1:\nCorollary 1. Let \u2318 2 [ 1\nN , 1] be a given parameter, and set B = \u2318N D. Algorithm 1 is an\nonline learning algorithm for span(F) for losses in C with the following regret bound for any\nf 2 span(F):\n\n1\nLD\n\n0 + LBkfk1R(T ) + 2LBBkfk1pT ,\n\nR0f (T ) \uf8ff \u27131 \n\n\u2318\n\nkfk1\u25c6N\n\nt=1 `t(0) `t(f (xt)).\n\nwhere 0 :=PT\n\n3.1 The parameters for several basic loss functions\n\nIn this section we consider the application of our results to 1-dimensional regression, where\nwe assume, for normalization, that the true labels of the examples and the predictions of\nthe functions in the class F are in [1, 1]. In this case k\u00b7k denotes the absolute value norm.\nThus, in each round, the adversary chooses a labeled data point (xt, y?\nt ) 2X\u21e5 [1, 1], and\nthe loss for the prediction yt 2 [1, 1] is given by `t(yt) = `(y?\nt , yt) where `(\u00b7,\u00b7) is a \ufb01xed\nloss function that is convex in the second argument. Note that D = 1 in this setting. We\n\n6\n\n\fgive examples of several such loss functions below, and compute the parameters Lb, b and\n\u270fb for every b > 0, as well as B from Theorem 1.\n\n1. Linear loss: `(y?, y) = y?y. We have Lb = 1, b = 0, \u270fb = 1, and B = \u2318N .\n2. p-norm loss, for some p 2: `(y?, y) = |y? y|p. We have Lb = p(b + 1)p1,\n3. Modi\ufb01ed least squares: `(y?, y) = 1\n2 max{1 y?y, 0}2. We have Lb = b + 1, b = 1,\n1+exp(b) , b = 1\n4 ,\n\nb = p(p 1)(b + 1)p2, \u270fb = max{p(1 b)p1, 0}, and B = 1.\n\u270fb = max{1 b, 0}, and B = 1.\n\n4. Logistic loss: `(y?, y) = ln(1 + exp(y?y)). We have Lb = exp(b)\n\n\u270fb = exp(b)\n\n1+exp(b) , and B = min{\u2318N, ln(4/\u2318)}.\n4 Variants of the boosting algorithms\n\nOur boosting algorithms and the analysis are considerably \ufb02exible:\nit is easy to modify\nthe algorithms to work with a di\u21b5erent (and perhaps more natural) kind of base learner\nwhich does greedy \ufb01tting, or incorporate a scaling of the base functions which improves\nperformance. Also, when specialized to the batch setting, our algorithms provide better\nconvergence rates than previous work.\n\n4.1 Fitting to actual loss functions\n\nThe choice of an online linear learning algorithm over the base function class in our algo-\nrithms was made to ease the analysis. In practice, it is more common to have an online\nalgorithm which produce predictions with comparable accuracy to the best function in hind-\nsight for the actual sequence of loss functions. In particular, a common heuristic in boosting\nalgorithms such as the original gradient boosting algorithm by Friedman [10] or the match-\ning pursuit algorithm of Mallat and Zhang [18] is to build a linear combination of base\nfunctions by iteratively augmenting the current linear combination via greedily choosing a\nbase function and a step size for it that minimizes the loss with respect to the residual label.\nIndeed, the boosting algorithm of Zhang and Yu [24] also uses this kind of greedy \ufb01tting\nalgorithm as the base learner.\nIn the online setting, we can model greedy \ufb01tting as follows. We \ufb01rst \ufb01x a step size \u21b5 0\nin advance. Then, in each round t, the base learner A receives not only the example xt, but\nalso an o\u21b5set y0t 2 Rd for the prediction, and produces a prediction A(xt) 2 Rd, after which\nit receives the loss function `t and su\u21b5ers loss `t(y0t + \u21b5A(xt)). The predictions of A satisfy\n\nwhere R is the regret. Our algorithms can be made to work with this kind of base learner\nas well. The details can be found in Section C.1 of the supplementary material.\n\n4.2\n\nImproving the regret bound via scaling\n\nGiven an online linear learning algorithm A over the function class F with regret R, then\nfor any scaling parameter > 0, we trivially obtain an online linear learning algorithm,\ndenoted A, over a -scaling of F, viz. F := {f | f 2F} , simply by multiplying the\npredictions of A by . The corresponding regret scales by as well, i.e. it becomes R.\nThe performance of Algorithm 1 can be improved by using such an online linear learning\nalgorithm over F for a suitably chosen scaling 1 of the function class F. The regret\nbound from Theorem 1 improves because the 1-norm of f measured with respect to F,\ni.e. kfk01 = max{1, kfk1\n }, is smaller than kfk1, but degrades because the parameter B0 =\ninf{b D : \u2318bb2 \u270fbD}} is larger than B. But, as detailed in Section\nmin{\u2318ND,\nC.2 of the supplementary material, in many situations the improvement due to the former\ncompensates for the degradation due to the latter, and overall we can get improved regret\nbounds using a suitable value of .\n\n7\n\nTXt=1\n\n`t(y0t + \u21b5A(xt)) \uf8ff inf\nf2F\n\n`t(y0t + \u21b5f (xt)) + R(T ),\n\nTXt=1\n\n\f4.3\n\nImprovements for batch boosting\n\nOur algorithmic technique can be easily specialized and modi\ufb01ed to the standard batch\nsetting with a \ufb01xed batch of training examples and a base learning algorithm operating over\nthe batch, exactly as in [24]. The main di\u21b5erence compared to the algorithm of [24] is the\nuse of the variables to scale the coecients of the weak hypotheses appropriately. While\na seemingly innocuous tweak, this allows us to derive analogous bounds to those of Zhang\nand Yu [24] on the optimization error that show that our boosting algorithm converges\nexponential faster. A detailed comparison can be found in Section C.3 of the supplementary\nmaterial.\n\n5 Experimental Results\n\nIs it possible to boost in an online fashion in practice with real base learners? To study\nthis question, we implemented and evaluated Algorithms 1 and 2 within the Vowpal Wabbit\n(VW) open source machine learning system [23]. The three online base learners used were\nVW\u2019s default linear learner (a variant of stochastic gradient descent), two-layer sigmoidal\nneural networks with 10 hidden units, and regression stumps.\n\nRegression stumps were implemented by doing stochastic gradient descent on each individual\nfeature, and predicting with the best-performing non-zero valued feature in the current\nexample.\n\nAll experiments were done on a collection of 14 publically available regression and classi\ufb01-\ncation datasets (described in Section D in the supplementary material) using squared loss.\nThe only parameters tuned were the learning rate and the number of weak learners, as well\nas the step size parameter for Algorithm 1. Parameters were tuned based on progressive\nvalidation loss on half of the dataset; reported is propressive validation loss on the remaining\nhalf. Progressive validation is a standard online validation technique, where each training\nexample is used for testing before it is used for updating the model [3].\n\nThe following table reports the average and the median, over the datasets, relative improve-\nment in squared loss over the respective base learner. Detailed results can be found in\nSection D in the supplementary material.\n\nBase learner\n\nAverage relative improvement Median relative improvement\nAlgorithm 2\nAlgorithm 1\n\nAlgorithm 1\n\nAlgorithm 2\n\nSGD\nRegression stumps\nNeural networks\n\n1.65%\n20.22%\n7.88%\n\n1.33%\n15.9%\n0.72%\n\n0.03%\n10.45%\n0.72%\n\n0.29%\n13.69%\n0.33%\n\nNote that both SGD (stochastic gradient descent) and neural networks are already very\nstrong learners. Naturally, boosting is much more e\u21b5ective for regression stumps, which is\na weak base learner.\n\n6 Conclusions and Future Work\n\nIn this paper we generalized the theory of boosting for regression problems to the online\nsetting and provided online boosting algorithms with theoretical convergence guarantees.\nOur algorithmic technique also improves convergence guarantees for batch boosting algo-\nrithms. We also provide experimental evidence that our boosting algorithms do improve\nprediction accuracy over commonly used base learners in practice, with greater improve-\nments for weaker base learners. The main remaining open question is whether the boosting\nalgorithm for competing with the span of the base functions is optimal in any sense, similar\nto our proof of optimality for the the boosting algorithm for competing with the convex hull\nof the base functions.\n\n8\n\n\fReferences\n[1] Peter L. Bartlett and Mikhail Traskin. AdaBoost is consistent. JMLR, 8:2347\u20132368,\n\n2007.\n\n[2] Alina Beygelzimer, Satyen Kale, and Haipeng Luo. Optimal and adaptive algorithms\n\nfor online boosting. In ICML, 2015.\n\n[3] Avrim Blum, Adam Kalai, and John Langford. Beating the hold-out: Bounds for k-fold\n\nand progressive cross-validation. In COLT, pages 203\u2013208, 1999.\n\n[4] Shang-Tse Chen, Hsuan-Tien Lin, and Chi-Jen Lu. An Online Boosting Algorithm\n\nwith Theoretical Justi\ufb01cations. In ICML, 2012.\n\n[5] Shang-Tse Chen, Hsuan-Tien Lin, and Chi-Jen Lu. Boosting with Online Binary Learn-\n\ners for the Multiclass Bandit Problem. In ICML, 2014.\n\n[6] Michael Collins, Robert E. Schapire, and Yoram Singer. Logistic regression, AdaBoost\n\nand Bregman distances. In COLT, 2000.\n\n[7] Nigel Du\u21b5y and David Helmbold. Boosting methods for regression. Machine Learning,\n\n47(2/3):153\u2013200, 2002.\n\n[8] Marguerite Frank and Philip Wolfe. An algorithm for quadratic programming. Naval\n\nRes. Logis. Quart., 3:95\u2013110, 1956.\n\n[9] Yoav Freund and Robert E. Schapire. A decision-theoretic generalization of on-line\n\nlearning and an application to boosting. JCSS, 55(1):119\u2013139, August 1997.\n\n[10] Jerome H. Friedman. Greedy function approximation: A gradient boosting machine.\n\nAnnals of Statistics, 29(5), October 2001.\n\n[11] Helmut Grabner and Horst Bischof. On-line boosting and vision. In CVPR, volume 1,\n\npages 260\u2013267, 2006.\n\n[12] Helmut Grabner, Christian Leistner, and Horst Bischof. Semi-supervised on-line boost-\n\ning for robust tracking. In ECCV, pages 234\u2013247, 2008.\n\n[13] Trevor Hastie and R. J Robet Tibshirani. Generalized Additive Models. Chapman and\n\nHall, 1990.\n\n[14] Trevor Hastie, Robert Tibshirani, and Jerome Friedman. The Elements of Statistical\n\nLearning: Data Mining, Inference, and Prediction. Springer Verlag, 2001.\n\n[15] Elad Hazan and Satyen Kale. Beyond the regret minimization barrier: optimal algo-\n\nrithms for stochastic strongly-convex optimization. JMLR, 15(1):2489\u20132512, 2014.\n\n[16] Elad Hazan, Amit Agarwal, and Satyen Kale. Logarithmic regret algorithms for online\n\nconvex optimization. Machine Learning, 69(2-3):169\u2013192, 2007.\n\n[17] Xiaoming Liu and Ting Yu. Gradient feature selection for online boosting. In ICCV,\n\npages 1\u20138, 2007.\n\n[18] St\u00b4ephane G. Mallat and Zhifeng Zhang. Matching pursuits with time-frequency dictio-\n\nnaries. IEEE Transactions on Signal Processing, 41(12):3397\u20133415, December 1993.\n\n[19] Llew Mason, Jonathan Baxter, Peter Bartlett, and Marcus Frean. Boosting algorithms\n\nas gradient descent. In NIPS, 2000.\n\n[20] Nikunj C. Oza and Stuart Russell. Online bagging and boosting. In AISTATS, pages\n\n105\u2013112, 2001.\n\n[21] Robert E. Schapire and Yoav Freund. Boosting: Foundations and Algorithms. MIT\n\nPress, 2012.\n\n[22] Matus Telgarsky. Boosting with the logistic loss is consistent. In COLT, 2013.\n[23] VW. URL https://github.com/JohnLangford/vowpal_wabbit/.\n[24] Tong Zhang and Bin Yu. Boosting with early stopping: Convergence and consistency.\n\nAnnals of Statistics, 33(4):1538\u20131579, 2005.\n\n[25] Martin Zinkevich. Online convex programming and generalized in\ufb01nitesimal gradient\n\nascent. In ICML, 2003.\n\n9\n\n\f", "award": [], "sourceid": 1455, "authors": [{"given_name": "Alina", "family_name": "Beygelzimer", "institution": "Yahoo Labs"}, {"given_name": "Elad", "family_name": "Hazan", "institution": "Princeton University"}, {"given_name": "Satyen", "family_name": "Kale", "institution": "Yahoo Labs"}, {"given_name": "Haipeng", "family_name": "Luo", "institution": "Princeton University"}]}