{"title": "The Fast Convergence of Boosting", "book": "Advances in Neural Information Processing Systems", "page_first": 1593, "page_last": 1601, "abstract": "This manuscript considers the convergence rate of boosting under a large class of losses, including the exponential and logistic losses, where the best previous rate of convergence was O(exp(1/\u03b5\u00b2)). First, it is established that the setting of weak learnability aids the entire class, granting a rate O(ln(1/\u03b5)). Next, the (disjoint) conditions under which the infimal empirical risk is attainable are characterized in terms of the sample and weak learning class, and a new proof is given for the known rate O(ln(1/\u03b5)). Finally, it is established that any instance can be decomposed into two smaller instances resembling the two preceding special cases, yielding a rate O(1/\u03b5), with a matching lower bound for the logistic loss. The principal technical hurdle throughout this work is the potential unattainability of the infimal empirical risk; the technique for overcoming this barrier may be of general interest.", "full_text": "The Fast Convergence of Boosting\n\nDepartment of Computer Science and Engineering\n\nMatus Telgarsky\n\nUniversity of California, San Diego\n\n9500 Gilman Drive, La Jolla, CA 92093-0404\n\nmtelgars@cs.ucsd.edu\n\nAbstract\n\nThis manuscript considers the convergence rate of boosting under a large class of\nlosses, including the exponential and logistic losses, where the best previous rate\nof convergence was O(exp(1/\u270f2)). First, it is established that the setting of weak\nlearnability aids the entire class, granting a rate O(ln(1/\u270f)). Next, the (disjoint)\nconditions under which the in\ufb01mal empirical risk is attainable are characterized\nin terms of the sample and weak learning class, and a new proof is given for the\nknown rate O(ln(1/\u270f)). Finally, it is established that any instance can be de-\ncomposed into two smaller instances resembling the two preceding special cases,\nyielding a rate O(1/\u270f), with a matching lower bound for the logistic loss. The\nprincipal technical hurdle throughout this work is the potential unattainability of\nthe in\ufb01mal empirical risk; the technique for overcoming this barrier may be of\ngeneral interest.\n\n1\n\nIntroduction\n\nBoosting is the task of converting inaccurate weak learners into a single accurate predictor. The\nexistence of any such method was unknown until the breakthrough result of Schapire [1]: under\na weak learning assumption, it is possible to combine many carefully chosen weak learners into a\nmajority of majorities with arbitrarily low training error. Soon after, Freund [2] noted that a single\nmajority is enough, and that O(ln(1/\u270f)) iterations are both necessary and suf\ufb01cient to attain accu-\nracy \u270f. Finally, their combined effort produced AdaBoost, which attains the optimal convergence\nrate (under the weak learning assumption), and has an astonishingly simple implementation [3].\nIt was eventually revealed that AdaBoost was minimizing a risk functional, speci\ufb01cally the expo-\nnential loss [4]. Aiming to alleviate perceived de\ufb01ciencies in the algorithm, other loss functions\nwere proposed, foremost amongst these being the logistic loss [5]. Given the wide practical suc-\ncess of boosting with the logistic loss, it is perhaps surprising that no convergence rate better than\nO(exp(1/\u270f2)) was known, even under the weak learning assumption [6]. The reason for this de-\n\ufb01ciency is simple: unlike SVM, least squares, and basically any other optimization problem con-\nsidered in machine learning, there might not exist a choice which attains the minimal risk! This\nreliance is carried over from convex optimization, where the assumption of attainability is generally\nmade, either directly, or through stronger conditions like compact level sets or strong convexity [7].\nConvergence rate analysis provides a valuable mechanism to compare and improve of minimiza-\ntion algorithms. But there is a deeper signi\ufb01cance with boosting: a convergence rate of O(ln(1/\u270f))\nmeans that, with a combination of just O(ln(1/\u270f)) predictors, one can construct an \u270f-optimal clas-\nsi\ufb01er, which is crucial to both the computational ef\ufb01ciency and statistical stability of this predictor.\nThe contribution of this manuscript is to provide a tight convergence theory for a large class of\nlosses, including the exponential and logistic losses, which has heretofore resisted analysis. The goal\nis a general analysis without any assumptions (attainability of the minimum, or weak learnability),\n\n1\n\n\fhowever this manuscript also demonstrates how the classically understood scenarios of attainability\nand weak learnability can be understood directly from the sample and the weak learning class.\nThe organization is as follows. Section 2 provides a few pieces of background: how to encode the\nweak learning class and sample as a matrix, boosting as coordinate descent, and the primal objective\nfunction. Section 3 then gives the dual problem, max entropy. Given these tools, section 4 shows\nhow to adjust the weak learning rate to a quantity which is useful without any assumptions. The \ufb01rst\nstep towards convergence rates is then taken in section 5, which demonstrates that the weak learning\nrate is in fact a mechanism to convert between the primal and dual problems.\nThe convergence rates then follow: section 6 and section 7 discuss, respectively, the conditions\nunder which classical weak learnability and (disjointly) attainability hold, both yielding the rate\nO(ln(1/\u270f)), and \ufb01nally section 8 shows how the general case may be decomposed into these two,\nand the con\ufb02icting optimization behavior leads to a degraded rate of O(1/\u270f). The last section will\nalso exhibit an \u2326(1/\u270f) lower bound for the logistic loss.\n\n1.1 Related Work\n\nThe development of general convergence rates has a number of important milestones in the past\ndecade. The \ufb01rst convergence result, albeit without any rates, is due to Collins et al. [8]; the work\nconsidered the improvement due to a single step, and as its update rule was less aggressive than the\nline search of boosting, it appears to imply general convergence. Next, Bickel et al. [6] showed a\nrate of O(exp(1/\u270f2)), where the assumptions of bounded second derivatives on compact sets are\nalso necessary here.\nMany extremely important cases have also been handled. The \ufb01rst is the original rate of O(ln(1/\u270f))\nfor the exponential loss under the weak learning assumption [3]. Next, R\u00a8atsch et al. [9] showed, for\na class of losses similar to those considered here, a rate of O(ln(1/\u270f)) when the loss minimizer is\nattainable. The current manuscript provides another mechanism to analyze this case (with the same\nrate), which is crucial to being able to produce a general analysis. And, very recently, parallel to this\nwork, Mukherjee et al. [10] established the general convergence under the exponential loss, with a\nrate of \u21e5(1/\u270f). The same matrix, due to Schapire [11], was used to show the lower bound there as\nfor the logistic loss here; their upper bound proof also utilized a decomposition theorem.\nIt is interesting to mention that, for many variants of boosting, general convergence rates were\nknown. Speci\ufb01cally, once it was revealed that boosting is trying to be not only correct but also\nhave large margins [12], much work was invested into methods which explicitly maximized the\nmargin [13], or penalized variants focused on the inseparable case [14, 15]. These methods generally\nimpose some form of regularization [15], which grants attainability of the risk minimizer, and allows\nstandard techniques to grant general convergence rates. Interestingly, the guarantees in those works\ncited in this paragraph are O(1/\u270f2).\n\n2 Setup\n\nA view of boosting, which pervades this manuscript, is that the action of the weak learning class\nupon the sample can be encoded as a matrix [9, 15]. Let a sample S := {(xi, yi)}m\n1 \u2713 (X\u21e5Y )m\nand a weak learning class H be given. For every h 2H , let S|h denote the projection onto S\ninduced by h; that is, S|h is a vector of length m, with coordinates (S|h)i = yih(xi). If the set\nof all such columns {S|h : h 2H} is \ufb01nite, collect them into the matrix A 2 Rm\u21e5n. Let ai\ndenote the ith row of A, corresponding to the example (xi, yi), and let {hj}n\n1 index the set of weak\nlearners corresponding to columns of A. It is assumed, for convenience, that entries of A are within\n[1, +1]; relaxing this assumption merely scales the presented rates by a constant.\nThe setting considered in this manuscript is that this \ufb01nite matrix can be constructed. Note that this\ncan encode in\ufb01nite classes, so long as they map to only k < 1 values (in which case A has at most\nkm columns). As another example, if the weak learners are binary, and H has VC dimension d,\nthen Sauer\u2019s lemma grants that A has at most (m + 1)d columns. This matrix view of boosting is\nthus similar to the interpretation of boosting performing descent on functional space, but the class\ncomplexity and \ufb01nite sample have been used to reduce the function class to a \ufb01nite object [16, 5].\n\n2\n\n\fRoutine BOOST.\nInput Convex function f A.\nOutput Approximate primal optimum .\n1. Initialize 0 := 0n.\n2. For t = 1, 2, . . ., while r(f A)(t1) 6= 0n:\n\n(a) Choose column jt := argmaxj |r(f A)(t1)>ej|.\n(b) Line search: \u21b5t apx. minimizes \u21b5 7! (f A)(t1 + \u21b5ejt).\n(c) Update t := t1 + \u21b5tejt.\n\n3. Return t1.\n\nFigure 1: l1 steepest descent [17, Algorithm 9.4] applied to f A.\n\nTo make the connection to boosting, the missing ingredient is the loss function. Let G0 denote the\nset of loss functions g satisfying: g is twice continuously differentiable, g00 > 0 (which implies\nstrict convexity), and limx!1 g(x) = 0. (A few more conditions will be added in section 5 to prove\nconvergence rates, but these properties suf\ufb01ce for the current exposition.) Crucially, the exponential\nloss exp(x) from AdaBoost and the logistic loss ln(1 + exp(x)) are in G0 (and the eventual G).\nBoosting determines some weighting 2 Rn of the columns of A, which correspond to weak\nlearners in H. The (unnormalized) margin of example i is thus hai,i = e>i A, where ei is an\nindicator vector. Since the prediction on xi is 1[hai,i 0], it follows that A > 0m (where 0m is\nthe zero vector) implies a training error of zero. As such, boosting solves the minimization problem\n\nmXi=1\n\nmXi=1\n\ninf\n2Rn\n\ng(hai,i) = inf\n2Rn\n\ng(e>i A) = inf\n2Rn\n\nf (A) = inf\n2Rn\n\n(f A)() =: \u00affA,\n\n(2.1)\n\nwhere f : Rm ! R is the convenience function f (x) = Pi g((x)i), and in the present problem\n\ndenotes the (unnormalized) empirical risk. \u00affA will denote the optimal objective value.\nThe in\ufb01mum in eq. (2.1) may well not be attainable. Suppose there exists 0 such that A0 > 0m\n(theorem 6.1 will show that this is equivalent to the weak learning assumption). Then\nf (c(A0)) = 0.\n\nf (A) \uf8ff inf {f (A) : = c0, c > 0} = inf\n\nc>0\n\n0 \uf8ff inf\n2Rn\n\nOn the other hand, for any 2 Rn, f (A) > 0. Thus the in\ufb01mum is never attainable when weak\nlearnability holds.\nThe template boosting algorithm appears in \ufb01g. 1, formulated in terms of f A to make the connec-\ntion to coordinate descent as clear as possible. To interpret the gradient terms, note that\n\n(r(f A)())j = (A>rf (A))j =\n\ng0(hai,i)hj(xi)yi,\n\nmXi=1\n\nwhich is the expected correlation of hj with the target labels according to an unnormalized dis-\ntribution with weights g0(hai,i). The stopping condition r(f A)() = 0m means: either the\ndistribution is degenerate (it is exactly zero), or every weak learner is uncorrelated with the target.\nAs such, eq. (2.1) represents an equivalent formulation of boosting, with one minor modi\ufb01cation:\nthe column (weak learner) selection has an absolute value. But note that this is the same as closing\nH under complementation (i.e., for any h 2H , there exists h0 with h(x) = h0(x)), which is\nassumed in many theoretical treatments of boosting.\nIn the case of the exponential loss with binary weak learners, the line search step has a conve-\nnient closed form; but for other losses, or even for the exponential loss but with con\ufb01dence-rated\npredictors, there may not be a closed form. Moreover, this univariate search problem may lack a\nminimizer. To produce the eventual convergence rates, this manuscript utilizes a step size minimiz-\ning an upper bounding quadratic (which is guaranteed to exist); if instead a standard iterative line\nsearch guarantee were used, rates would only degrade by a constant factor [17, section 9.3.1].\n\n3\n\n\f1 of A as a collection of m points in Rn. Due to the form\nAs a \ufb01nal remark, consider the rows {ai}m\nof g, BOOST is therefore searching for a halfspace, parameterized by a vector , which contains\nall of the points. Sometimes such a halfspace may not exist, and g applies a smoothly increasing\npenalty to points that are farther and farther outside it.\n\n3 Dual Problem\n\nThis section provides the convex dual to eq. (2.1). The relevance of the dual to convergence rates is\nas follows. First, although the primal optimum may not be attainable, the dual optimum is always\nattainable\u2014this suggests a strategy of mapping the convergence strategy to the dual, where there\nexists a clear notion of progress to the optimum. Second, this section determines the dual feasible\nset\u2014the space of dual variables or what the boosting literature typically calls unnormalized weights.\nUnderstanding this set is key to relating weak learnability, attainability, and general instances.\nBefore proceeding, note that the dual formulation will make use of the Fenchel conjugate h\u21e4() =\nsupx2dom(h) hx, i h(x), a concept taking a central place in convex analysis [18, 19]. Interest-\ningly, the Fenchel conjugates to the exponential and logistic losses are respectively the Boltzmann-\nShannon and Fermi-Dirac entropies [19, Commentary, section 3.3], and thus the dual is explicitly\nperforming entropy maximization (cf. lemma C.2). As a \ufb01nal piece of notation, denote the kernel of\na matrix B 2 Rm\u21e5n by Ker(B) = { 2 Rn : B = 0m}.\nTheorem 3.1. For any A 2 Rm\u21e5n and g 2 G0 with f (x) =Pi g((x)i),\ninf {f (A) : 2 Rn} = sup{f\u21e4() : 2 A} ,\nwhere A := Ker(A>)\\Rm\nLastly, f\u21e4() =Pm\n+ has a strong interpretation. Suppose 2 A; then \nThe dual feasible set A = Ker(A>) \\ Rm\n+ ), and, for any j, 0 = (>A)j = Pm\nis a nonnegative vector (since 2 Rm\ni=1 iyihj(xi). That\nis to say, every nonzero feasible dual vector provides a (an unnormalized) distribution upon which\nevery weak learner is uncorrelated! Furthermore, recall that the weak learning assumption states that\nunder any weighting of the input, there exists a correlated weak learner; as such, weak learnability\nnecessitates that the dual feasible set contains only the zero vector.\nThere is also a geometric interpretation. Ignoring the constraint, f\u21e4 attains its maximum at some\nrescaling of the uniform distribution (for details, please see lemma C.2). As such, the constrained\ndual problem is aiming to write the origin as a high entropy convex combination of the points {ai}m\n1 .\n\n+ is the dual feasible set. The dual optimum A is unique and attainable.\n\ni=1 g\u21e4(()i).\n\n(3.2)\n\n4 A Generalized Weak Learning Rate\n\nThe weak learning rate was critical to the original convergence analysis of AdaBoost, providing a\nhandle on the progress of the algorithm. Recall that the quantity appeared in the denominator of the\nconvergence rate, and a weak learning assumption critically provided that this quantity is nonzero.\nThis section will generalize the weak learning rate to a quantity which is always positive, without\nany assumptions.\nNote brie\ufb02y that this manuscript will differ slightly from the norm in that weak learning will be a\npurely sample-speci\ufb01c concept. That is, the concern here is convergence, and all that matters is the\n1 , as encoded in A; it doesn\u2019t matter if there are wild points outside this\nsample S = {(xi, yi)}m\nsample, because the algorithm has no access to them.\nThis distinction has the following implication. The usual weak learning assumption states that there\nexists no uncorrelating distribution over the input space. This of course implies that any training\nsample S used by the algorithm will also have this property; however, it suf\ufb01ces that there is no\ndistribution over the input sample S which uncorrelates the weak learners from the target.\nReturning to task, the weak learning assumption posits the existence of a constant, the weak learning\nrate , which lower bounds the correlation of the best weak learner with the target for any distribu-\n\n4\n\n\ftion. Stated in terms of the matrix A,\n\n0 < = inf\n2Rm\nkk=1\n\n+\n\nmXi=1\n\nmax\n\nj2[n]\n\n()iyihj(xi)\n\n=\n\ninf\n2Rm\n+ \\{0m}\n\nkA>k1\nkk1\n\n=\n\ninf\n2Rm\n+ \\{0m}\n\nkA>k1\nk 0mk1\n\n.\n\n(4.1)\nThe only way this quantity can be positive is if 62 Ker(A>) \\ Rm\n+ = A, meaning the dual\nfeasible set is exactly {0m}. As such, one candidate adjustment is to simply replace {0m} with the\ndual feasible set:\n\n0 := inf\n2Rm\n\n+ \\A\n\nkA>k1\n\ninf 2A k k1\n\n.\n\nIndeed, by the forthcoming proposition 4.3, 0 > 0 as desired. Due to technical considerations\nwhich will be postponed until the various convergence rates, it is necessary to tighten this de\ufb01nition\nwith another set.\nDe\ufb01nition 4.2. For a given matrix A 2 Rm\u21e5n and set S \u2713 Rm, de\ufb01ne\n\n(A, S) := inf\u21e2\n\nkA>k1\n\ninf 2S\\Ker(A>) k k1\n\n: 2 S \\ Ker(A>) .\n\n\u2303\n\nCrucially, for the choices of S pertinent here, this quantity is always positive.\nProposition 4.3. Let A 6= 0m\u21e5n and polyhedron S be given. If S \\ Ker(A>) 6= ; and S has\nnonempty interior, (A, S) 2 (0,1).\nTo simplify discussion, the following projection and distance notation will be used in the sequel:\n\nPp\nC(x) 2 Argmin\n\ny2C ky xkp,\n\nDp\nC(x) = kx Pp\n\nC(x)kp,\n\nwith some arbitrary choice made when the minimizer is not unique.\n\n5 Prelude to Convergence Rates: Three Alternatives\n\nThe pieces are in place to \ufb01nally sketch how the convergence rates may be proved. This section\nidenti\ufb01es how the weak learning rate (A, S) can be used to convert the standard gradient guarantees\ninto something which can be used in the presence of no attainable minimum. To close, three basic\noptimization scenarios are identi\ufb01ed, which lead to the following three sections on convergence\nrates. But \ufb01rst, it is a good time to de\ufb01ne the \ufb01nal loss function class.\nDe\ufb01nition 5.1. Every g 2 G satis\ufb01es the following properties. First, g 2 G0. Next, for any x 2 Rm\nsatisfying f (x) \uf8ff f (A0), and for any coordinate (x)i, there exist constants \u2318> 0 and > 0 such\n\u2303\nthat g00((x)i) \uf8ff \u2318g((x)i) and g((x)i) \uf8ff g0((x)i).\nThe exponential loss is in this class with \u2318 = = 1 since exp(\u00b7) is a \ufb01xed point with respect to\nthe differentiation operator. Furthermore, as is veri\ufb01ed in remark F.1 of the full version, the logistic\nloss is also in this class, with \u2318 = 2m/(m ln(2)) and \uf8ff 1 + 2m. Intuitively, \u2318 and encode\nhow similar some g 2 G is to the exponential loss, and thus these parameters can degrade radically.\nHowever, outside the weak learnability case, the other terms in the bounds here will also incur a\npenalty of the form em for the exponential loss, and there is some evidence that this is unavoidable\n(see the lower bounds in Mukherjee et al. [10] or the upper bounds in R\u00a8atsch et al. [9]).\nNext, note how the standard guarantee for coordinate descent methods can lead to guarantees on the\nprogress of the algorithm in terms of dual distances, thanks to (A, S).\nProposition 5.2. For any t, A 6= 0m\u21e5n, S \u25c6 {rf (At)} with (A, S) > 0, and g 2 G,\n\nf (At+1) \u00affA \uf8ff f (At) \u00affA \n\n(A, S)2D1\n\nS\\Ker(A>)(rf (At))2\n2\u2318f (At)\n\n.\n\nProof. The stopping condition grants rf (At) 62 Ker(A>). Thus, by de\ufb01nition of (A, S),\n\n(A, S) =\n\ninf\n\n2S\\Ker(A>)\n\nkA>k1\nS\\Ker(A>)() \uf8ff\n\nD1\n\nkA>rf (At)k1\nD1\nS\\Ker(A>)(rf (At))\n\n.\n\n5\n\n\f(a) Weak learnability.\n\n(b) Attainability.\n\n(c) General case.\n\nFigure 2: Viewing the rows {ai}m\n1 of A as points in Rn, boosting seeks a homogeneous halfspace,\nparameterized by a normal 2 Rn, which contains all m points. The dual, on the other hand, aims\nto express the origin as a high entropy convex combination of the rows. The convergence rate and\ndynamics of this process are controlled by A, which dictates one of the three above scenarios.\n\nCombined with a standard guarantee of coordinate descent progress (cf. lemma F.2),\n\nf (At) f (At+1) kA>rf (At)k2\n\n2\u2318f (At)\n\n1\n\n(A, S)2D1\n\nS\\Ker(A>)(rf (At))2\n2\u2318f (At)\n\n.\n\n\n\nSubtracting \u00affA from both sides and rearranging yields the statement.\n\nRecall the interpretation of boosting closing section 2: boosting seeks a halfspace, parameterized by\n1 . Progress onward from proposition 5.2 will be divided\n 2 Rn, which contains the points {ai}m\ninto three cases, each distinguished by the kind of halfspace which boosting can reach.\nThese cases appear in \ufb01g. 2. The \ufb01rst case is weak learnability: positive margins can be attained\non each example, meaning a halfspace exists which strictly contains all points. Boosting races to\npush all these margins unboundedly large, and has a convergence rate O(ln(1/\u270f)). Next is the case\nthat no halfspace contains the points within its interior: either any such halfspace has the points on\nits boundary, or no such halfspace exists at all (the degenerate choice = 0n). This is the case of\nattainability: boosting races towards \ufb01nite margins at the rate O(ln(1/\u270f)).\nThe \ufb01nal situation is a mix of the two: there exists a halfspace with some points on the boundary,\nsome within its interior. Boosting will try to push some margins to in\ufb01nity, and keep others \ufb01nite.\nThese two desires are at odds, and the rate degrades to O(1/\u270f). Less metaphorically, the analysis\nwill proceed by decomposing this case into the previous two, applying the above analysis in parallel,\nand then stitching the result back together. It is precisely while stitching up that an incompatibility\narises, and the rate degrades. This is no artifact: a lower bound will be shown for the logistic loss.\n\n6 Convergence Rate under Weak Learnability\n\nTo start this section, the following result characterizes weak learnability, including the earlier rela-\ntionship to the dual feasible set (speci\ufb01cally, that it is precisely the origin), and, as analyzed by many\nauthors, the relationship to separability [1, 9, 15].\nTheorem 6.1. For any A 2 Rm\u21e5n and g 2 G the following conditions are equivalent:\n\nf (A) = 0,\n\n9 2 Rn \u21e7 A 2 Rm\n++,\ninf\n2Rn\n A = 0m,\nA = {0m}.\n\n(6.2)\n(6.3)\n\n(6.4)\n(6.5)\n\nThe equivalence means the presence of any of these properties suf\ufb01ces to indicate weak learnability.\nThe last two statements encode the usual distributional version of the weak learning assumption.\n\n6\n\n\fThe \ufb01rst encodes the fact that there exists a homogeneous halfspace containing all points within\nits interior; this encodes separability, since removing the factor yi from the de\ufb01nition of ai will\nplace all negative points outside the halfspace. Lastly, the second statement encodes the fact that the\nempirical risk approaches zero.\nTheorem 6.6. Suppose A > 0m and g 2 G; then (A, Rm\nf (At) \u00affA \uf8ff f (A0)\u27131 \n+ \\ Ker(A>) = A = {0m}, which combined with g \uf8ff g0 gives\n\n+ ) > 0, and for all t,\n(A, Rm\n\n22\u2318 \u25c6t\n\nProof. By theorem 6.1, Rm\n\n+ )2\n\n.\n\nD1\nA(rf (At)) = inf\n\n 2A k rf (At) k1 = krf (At)k1 f (At)/.\n\nPlugging this and \u00affA = 0 (again by theorem 6.1) along with polyhedron Rm\n(A, Rm\n\n+ ) > 0 by proposition 4.3 since A 2 Rm\n\n+ ) into proposition 5.2 gives\n\nf (At+1) \uf8ff f (At) \n\n(A, Rm\n\n+ )2f (At)\n22\u2318\n\nand recursively applying this inequality yields the result.\n\n+ \u25c6 rf (Rm) (whereby\n22\u2318 \u25c6 ,\n(A, Rm\n\n+ )2\n\n= f (At)\u27131 \n\n+ grants\n+ ) is exactly the original weak learning rate. When specialized for the exponential loss\n+ )2/2)t, which exactly recovers the bound of\n\nSince the present setting is weak learnability, note by (4.1) that the choice of polyhedron Rm\nthat (A, Rm\n(where \u2318 = = 1), the bound becomes (1 (A, Rm\nSchapire and Singer [20], although via different analysis.\nIn general, solving for t in the expression \u270f = f (At) \u00affA\nreveals that t \uf8ff 22\u2318\n(A,S)2 ln(1/\u270f) iterations suf\ufb01ce to reach error \u270f. Recall that and \u2318, in the case of\nthe logistic loss, have only been bounded by quantities like 2m. While it is unclear if this analysis\nof and \u2318 was tight, note that it is plausible that the logistic loss is slower than the exponential loss\nin this scenario, as it works less in initial phases to correct minor margin violations.\n\nf (A0) \u00affA \uf8ff\u21e31 (f,A)2\n22\u2318 \u2318t\n\n\uf8ff exp\u21e3 (f,A)2t\n22\u2318 \u2318\n\n7 Convergence Rate under Attainability\nTheorem 7.1. For any A 2 Rm\u21e5n and g 2 G, the following conditions are equivalent:\n\n+ \\ {0m},\n\n8 2 Rn \u21e7 A 62 Rm\nf A has minimizers,\n A 2 Rm\n++,\nA \\ Rm\n++ 6= ;.\n\n(7.2)\n(7.3)\n(7.4)\n(7.5)\n\nInterestingly, as revealed in (7.4) and (7.5), attainability entails that the dual has fully interior points,\nand furthermore that the dual optimum is interior. On the other hand, under weak learnability,\neq. (6.4) provided that the dual optimum has zeros at every coordinate. As will be made clear in\nsection 8, the primal and dual weights have the following dichotomy: either the margin hai,i goes\nto in\ufb01nity and ( A)i goes to zero, or the margin stays \ufb01nite and ( A)i goes to some positive value.\nTheorem 7.6. Suppose A 6= 0m\u21e5n, g 2 G, and the in\ufb01mum of eq. (2.1) is attainable. Then there\nexists a (compact) tightest axis-aligned retangle C containing the initial level set, and f is strongly\nconvex with modulus c > 0 over C. Finally, (A,rf (C)) > 0, and for all t,\n\u25c6t\nc(A,rf (C))2\n\n\u2318f (A0)\n\n.\n\nf (At) \u00affA \uf8ff (f (0m) \u00affA)\u27131 \nc(A,rf (C))2 ln( 1\n\n\u2318f (A0)\n\n\u270f ) iterations suf\ufb01ce to reach error \u270f. The appearance of a\nIn other words, t \uf8ff\nmodulus of strong convexity c (i.e., a lower bound on the eigenvalues of the Hessian of f) may seem\nsurprising, and sketching the proof illuminates its appearance and subsequent function.\n\n7\n\n\fWhen the in\ufb01mum is attainable, every margin hai,i converges to some \ufb01nite value. In fact, they all\nremain bounded: (7.2) provides that no halfspace contains all points, so if one margin becomes pos-\nitive and large, another becomes negative and large, giving a terrible objective value. But objective\nvalues never increase with coordinate descent. To \ufb01nish the proof, strong convexity (i.e., quadratic\nlower bounds in the primal) grants quadratic upper bounds in the dual, which can be used to bound\nthe dual distance in proposition 5.2, and yield the desired convergence rate. This approach fails\nunder weak learnability\u2014some primal weights grow unboundedly, all dual weights shrink to zero,\nand no compact set contains all margins.\n\n8 General Convergence Rate\n\nThe \ufb01nal characterization encodes two principles: the rows of A may be partitioned into two matri-\nces A0, A+ which respectively satisfy theorem 6.1 and theorem 7.1, and that these two subproblems\naffect the optimization problem essentially independently.\nTheorem 8.1. Let A0 2 Rz\u21e5n, A+ 2 Rp\u21e5n, and g 2 G be given. Set m := z + p, and A 2 Rm\u21e5n\nto be the matrix obtained by stacking A0 on top of A+. The following conditions are equivalent:\n(8.2)\n(8.3)\n\n++ ^ A+ = 0p) ^ (8 2 Rn \u21e7 A+ 62 Rp\nf (A+)) ^ ( inf\n2Rn\n\n+ \\ {0p}),\n\nf (A0) = 0) ^ f A+ has minimizers,\n++,\n\n(9 2 Rn \u21e7 A0 2 Rz\nf (A) = inf\n( inf\n2Rn\n2Rn\n A =h A0\n(A0 = {0z}) ^ (A+ \\ Rp\n\n A+i with A0 = 0z ^ A+ 2 Rp\n\n(8.4)\n(8.5)\n\n++ 6= ;) ^ (A = A0 \u21e5 A+).\n\nTo see that any matrix A falls into one of the three scenarios here, \ufb01x a loss function g, and recall\nfrom theorem 3.1 that A is unique. In particular, the set of zero entries in A exactly speci\ufb01es\nwhich of the three scenarios hold, the current scenario allowing for simultaneous positive and zero\nentries. Although this reasoning made use of A, note that it is A which dictates the behavior: in\nfact, as is shown in remark I.1 of the full version, the decomposition is unique.\nReturning to theorem 8.1, the geometry of \ufb01g. 2c is provided by (8.2) and (8.5). The analysis\nwill start from (8.3), which allows the primal problem to be split into two pieces, which are then\nindividually handled precisely as in the preceding sections. To \ufb01nish, (8.5) will allow these pieces\nto be stitched together.\nTheorem 8.6. Suppose A 6= 0m\u21e5n, g 2 G, A 2 Rm\n++ \\ {0m}, and the notation from\n(rf (A+t))k1. Then w < 1, and there exists\ntheorem 8.1. Set w := supt krf (A+t) + P1\na tightest cube C+ so that C+ \u25c6{ x 2 Rp : f (x) \uf8ff f (A0)}, and let c > 0 be the modulus of\n+ \u21e5 rf (C+)) > 0, and for all t, f (At) \u00affA \uf8ff\nstrong convexity of f over C+. Then (A, Rz\n2f (A0)/(t + 1) min1, (A, Rz\n(In the case of the logistic loss, w \uf8ff supx2Rm krf (x)k1 \uf8ff m.)\nAs discussed previously, the bounds deteriorate to O(1/\u270f) because the \ufb01nite and in\ufb01nite margins\nsought by the two pieces A0, A+ are in con\ufb02ict. For a beautifully simple, concrete case of this,\nconsider the following matrix, due to Schapire [11]:\n\n+ \u21e5 rf (C+))2/(( + w/(2c))2\u2318) .\n\n+ \\ Rm\n\nA+\n\nS :=\"1 +1\n+1 +1# .\n\n+1 1\n\nThe optimal solution here is to push both coordinates of unboundedly positive, with margins\napproaching (0, 0,1). But pushing any coordinate i too quickly will increase the objective value,\nrather than decreasing it. In fact, this instance will provide a lower bound, and the mechanism of the\nproof shows that the primal weights grow extremely slowly, as O(ln(t)).\nTheorem 8.7. Using the logistic loss and exact line search, for any t 1, f (St) \u00affS 1/(8t).\nAcknowledgement\n\nThe author thanks Sanjoy Dasgupta, Daniel Hsu, Indraneel Mukherjee, and Robert Schapire for\nvaluable conversations. The NSF supported this work under grants IIS-0713540 and IIS-0812598.\n\n8\n\n\fReferences\n[1] Robert E. Schapire. The strength of weak learnability. Machine Learning, 5:197\u2013227, July\n\n1990.\n\n[2] Yoav Freund. Boosting a weak learning algorithm by majority. Information and Computation,\n\n121(2):256\u2013285, 1995.\n\n[3] Yoav Freund and Robert E. Schapire. A decision-theoretic generalization of on-line learning\n\nand an application to boosting. J. Comput. Syst. Sci., 55(1):119\u2013139, 1997.\n\n[4] Leo Breiman. Prediction games and arcing algorithms. Neural Computation, 11:1493\u20131517,\n\nOctober 1999.\n\n[5] Jerome Friedman, Trevor Hastie, and Robert Tibshirani. Additive logistic regression: a statis-\n\ntical view of boosting. Annals of Statistics, 28, 1998.\n\n[6] Peter J. Bickel, Yaacov Ritov, and Alon Zakai. Some theory for generalized boosting algo-\n\nrithms. Journal of Machine Learning Research, 7:705\u2013732, 2006.\n\n[7] Z. Q. Luo and P. Tseng. On the convergence of the coordinate descent method for convex\ndifferentiable minimization. Journal of Optimization Theory and Applications, 72:7\u201335, 1992.\n[8] Michael Collins, Robert E. Schapire, and Yoram Singer. Logistic regression, AdaBoost and\n\nBregman distances. Machine Learning, 48(1-3):253\u2013285, 2002.\n\n[9] Gunnar R\u00a8atsch, Sebastian Mika, and Manfred K. Warmuth. On the convergence of leveraging.\n\nIn NIPS, pages 487\u2013494, 2001.\n\n[10] Indraneel Mukherjee, Cynthia Rudin, and Robert Schapire. The convergence rate of AdaBoost.\n\nIn COLT, 2011.\n\n[11] Robert E. Schapire. The convergence rate of AdaBoost. In COLT, 2010.\n[12] Robert E. Schapire, Yoav Freund, Peter Barlett, and Wee Sun Lee. Boosting the margin: A\n\nnew explanation for the effectiveness of voting methods. In ICML, pages 322\u2013330, 1997.\n\n[13] Gunnar R\u00a8atsch and Manfred K. Warmuth. Maximizing the margin with boosting. In COLT,\n\npages 334\u2013350, 2002.\n\n[14] Manfred K. Warmuth, Karen A. Glocer, and Gunnar R\u00a8atsch. Boosting algorithms for maxi-\n\nmizing the soft margin. In NIPS, 2007.\n\n[15] Shai Shalev-Shwartz and Yoram Singer. On the equivalence of weak learnability and linear\nIn COLT, pages 311\u2013322,\n\nseparability: New relaxations and ef\ufb01cient boosting algorithms.\n2008.\n\n[16] Llew Mason, Jonathan Baxter, Peter L. Bartlett, and Marcus R. Frean. Functional gradient\ntechniques for combining hypotheses. In A.J. Smola, P.L. Bartlett, B. Sch\u00a8olkopf, and D. Schu-\nurmans, editors, Advances in Large Margin Classi\ufb01ers, pages 221\u2013246, Cambridge, MA, 2000.\nMIT Press.\n\n[17] Stephen P. Boyd and Lieven Vandenberghe. Convex Optimization. Cambridge University\n\nPress, 2004.\n\n[18] Jean-Baptiste Hiriart-Urruty and Claude Lemar\u00b4echal. Fundamentals of Convex Analysis.\n\nSpringer Publishing Company, Incorporated, 2001.\n\n[19] Jonathan Borwein and Adrian Lewis. Convex Analysis and Nonlinear Optimization. Springer\n\nPublishing Company, Incorporated, 2000.\n\n[20] Robert E. Schapire and Yoram Singer. Improved boosting algorithms using con\ufb01dence-rated\n\npredictions. Machine Learning, 37(3):297\u2013336, 1999.\n\n[21] George B. Dantzig and Mukund N. Thapa. Linear Programming 2: Theory and Extensions.\n\nSpringer, 2003.\n\n[22] Adi Ben-Israel. Motzkin\u2019s transposition theorem, and the related theorems of Farkas, Gordan\nand Stiemke. In M. Hazewinkel, editor, Encyclopaedia of Mathematics, Supplement III. 2002.\n\n9\n\n\f", "award": [], "sourceid": 913, "authors": [{"given_name": "Matus", "family_name": "Telgarsky", "institution": null}]}