{"title": "The Tradeoffs of Large Scale Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 161, "page_last": 168, "abstract": "This contribution develops a theoretical framework that takes into account the effect of approximate optimization on learning algorithms. The analysis shows distinct tradeoffs for the case of small-scale and large-scale learning problems. Small-scale learning problems are subject to the usual approximation--estimation tradeoff. Large-scale learning problems are subject to a qualitatively different tradeoff involving the computational complexity of the underlying optimization algorithms in non-trivial ways.", "full_text": "The Tradeoffs of Large Scale Learning\n\nL\u00b4eon Bottou\n\nNEC laboratories of America\nPrinceton, NJ 08540, USA\n\nOlivier Bousquet\n\nGoogle Z\u00a8urich\n\n8002 Zurich, Switzerland\n\nleon@bottou.org\n\nolivier.bousquet@m4x.org\n\nAbstract\n\nThis contribution develops a theoretical framework that takes into account the\neffect of approximate optimization on learning algorithms. The analysis shows\ndistinct tradeoffs for the case of small-scale and large-scale learning problems.\nSmall-scale learning problems are subject to the usual approximation\u2013estimation\ntradeoff. Large-scale learning problems are subject to a qualitatively different\ntradeoff involving the computational complexity of the underlying optimization\nalgorithms in non-trivial ways.\n\n1 Motivation\n\nThe computational complexity of learning algorithms has seldom been taken into account by the\nlearning theory. Valiant [1] states that a problem is \u201clearnable\u201d when there exists a probably approx-\nimatively correct learning algorithm with polynomial complexity. Whereas much progress has been\nmade on the statistical aspect (e.g., [2, 3, 4]), very little has been told about the complexity side of\nthis proposal (e.g., [5].)\n\nComputational complexity becomes the limiting factor when one envisions large amounts of training\ndata. Two important examples come to mind:\n\n\u2022 Data mining exists because competitive advantages can be achieved by analyzing the\nmasses of data that describe the life of our computerized society. Since virtually every\ncomputer generates data, the data volume is proportional to the available computing power.\nTherefore one needs learning algorithms that scale roughly linearly with the total volume\nof data.\n\n\u2022 Arti\ufb01cial intelligence attempts to emulate the cognitive capabilities of human beings. Our\nbiological brains can learn quite ef\ufb01ciently from the continuous streams of perceptual data\ngenerated by our six senses, using limited amounts of sugar as a source of power. This\nobservation suggests that there are learning algorithms whose computing time requirements\nscale roughly linearly with the total volume of data.\n\nThis contribution \ufb01nds its source in the idea that approximate optimization algorithms might be\nsuf\ufb01cient for learning purposes. The \ufb01rst part proposes new decomposition of the test error where\nan additional term represents the impact of approximate optimization. In the case of small-scale\nlearning problems, this decomposition reduces to the well known tradeoff between approximation\nerror and estimation error. In the case of large-scale learning problems, the tradeoff is more com-\nplex because it involves the computational complexity of the learning algorithm. The second part\nexplores the asymptotic properties of the large-scale learning tradeoff for various prototypical learn-\ning algorithms under various assumptions regarding the statistical estimation rates associated with\nthe chosen objective functions. This part clearly shows that the best optimization algorithms are not\nnecessarily the best learning algorithms. Maybe more surprisingly, certain algorithms perform well\nregardless of the assumed rate for the statistical estimation error.\n\n\f2 Approximate Optimization\n\n2.1 Setup\n\nFollowing [6, 2], we consider a space of input-output pairs (x, y) \u2208 X \u00d7 Y endowed with a proba-\nbility distribution P (x, y). The conditional distribution P (y|x) represents the unknown relationship\nbetween inputs and outputs. The discrepancy between the predicted output \u02c6y and the real output\ny is measured with a loss function \u2113(\u02c6y, y). Our benchmark is the function f \u2217 that minimizes the\nexpected risk\n\nthat is,\n\nE(f ) =Z \u2113(f (x), y) dP (x, y) = E [\u2113(f (x), y)],\n\nf \u2217(x) = arg min\n\nE [ \u2113(\u02c6y, y)| x].\n\n\u02c6y\n\nAlthough the distribution P (x, y) is unknown, we are given a sample S of n independently drawn\ntraining examples (xi, yi), i = 1 . . . n. We de\ufb01ne the empirical risk\n\nEn(f ) =\n\n1\nn\n\nn\n\nXi=1\n\n\u2113(f (xi), yi) = En[\u2113(f (x), y)].\n\nOur \ufb01rst learning principle consists in choosing a family F of candidate prediction functions and\n\ufb01nding the function fn = arg minf \u2208F En(f ) that minimizes the empirical risk. Well known com-\nbinatorial results (e.g., [2]) support this approach provided that the chosen family F is suf\ufb01ciently\nrestrictive. Since the optimal function f \u2217 is unlikely to belong to the family F, we also de\ufb01ne\nF = arg minf \u2208F E(f ). For simplicity, we assume that f \u2217, f \u2217\nF and fn are well de\ufb01ned and unique.\nf \u2217\nWe can then decompose the excess error as\n\nE [E(fn) \u2212 E(f \u2217)] = E [E(f \u2217\n\nF ) \u2212 E(f \u2217)] + E [E(fn) \u2212 E(f \u2217\n\n(1)\nwhere the expectation is taken with respect to the random choice of training set. The approximation\nerror Eapp measures how closely functions in F can approximate the optimal solution f \u2217. The\nestimation error Eest measures the effect of minimizing the empirical risk En(f ) instead of the\nexpected risk E(f ). The estimation error is determined by the number of training examples and by\nthe capacity of the family of functions [2]. Large families1 of functions have smaller approximation\nerrors but lead to higher estimation errors. This tradeoff has been extensively discussed in the\nliterature [2, 3] and lead to excess error that scale between the inverse and the inverse square root of\nthe number of examples [7, 8].\n\nF )] = Eapp + Eest ,\n\n2.2 Optimization Error\n\nFinding fn by minimizing the empirical risk En(f ) is often a computationally expensive operation.\nSince the empirical risk En(f ) is already an approximation of the expected risk E(f ), it should\nnot be necessary to carry out this minimization with great accuracy. For instance, we could stop an\niterative optimization algorithm long before its convergence.\nLet us assume that our minimization algorithm returns an approximate solution \u02dcfn such that\n\nEn( \u02dcfn) < En(fn) + \u03c1\n\nwhere \u03c1 \u2265 0 is a prede\ufb01ned tolerance. An additional term Eopt = E(cid:2)E( \u02dcfn) \u2212 E(fn)(cid:3) then appears\nin the decomposition of the excess error E = E(cid:2)E( \u02dcfn) \u2212 E(f \u2217)(cid:3):\n\nF ) \u2212 E(f \u2217)] + E [E(fn) \u2212 E(f \u2217\n\nE = E [E(f \u2217\n\n= Eapp + Eest + Eopt.\n\n(2)\nWe call this additional term optimization error. It re\ufb02ects the impact of the approximate optimization\non the generalization performance. Its magnitude is comparable to \u03c1 (see section 3.1.)\n\nF )] + E(cid:2)E( \u02dcfn) \u2212 E(fn)(cid:3)\n\n1We often consider nested families of functions of the form Fc = {f \u2208 H, \u2126(f ) \u2264 c}. Then, for each\nvalue of c, function fn is obtained by minimizing the regularized empirical risk En(f ) + \u03bb\u2126(f ) for a suitable\nchoice of the Lagrange coef\ufb01cient \u03bb. We can then control the estimation-approximation tradeoff by choosing\n\u03bb instead of c.\n\n\f2.3 The Approximation\u2013Estimation\u2013Optimization Tradeoff\n\nThis decomposition leads to a more complicated compromise. It involves three variables and two\nconstraints. The constraints are the maximal number of available training example and the maximal\ncomputation time. The variables are the size of the family of functions F, the optimization accuracy\n\u03c1, and the number of examples n. This is formalized by the following optimization problem.\n\nmin\nF ,\u03c1,n\n\nE = Eapp + Eest + Eopt\n\nThe number n of training examples is a variable because we could choose to use only a subset of\nthe available training examples in order to complete the optimization within the alloted time. This\nhappens often in practice. Table 1 summarizes the typical evolution of the quantities of interest with\nthe three variables F, n, and \u03c1 increase.\n\nsubject to (cid:26)\n\nn \u2264 nmax\nT (F, \u03c1, n) \u2264 Tmax\n\n(3)\n\nTable 1: Typical variations when F, n, and \u03c1 increase.\n\nn\n\n\u03c1\n\nF\n(approximation error) \u0581\n(estimation error)\n(optimization error)\n(computation time)\n\n\u0580 \u0581\n\u00b7 \u00b7 \u00b7\n\u00b7 \u00b7 \u00b7 \u0580\n\u0580 \u0580 \u0581\n\nEapp\nEest\nEopt\nT\n\nThe solution of the optimization program (3) depends critically of which budget constraint is active:\nconstraint n < nmax on the number of examples, or constraint T < Tmax on the training time.\n\n\u2022 We speak of small-scale learning problem when (3) is constrained by the maximal number\nof examples nmax. Since the computing time is not limited, we can reduce the optimization\nerror Eopt to insigni\ufb01cant levels by choosing \u03c1 arbitrarily small. The excess error is then\ndominated by the approximation and estimation errors, Eapp and Eest. Taking n = nmax,\nwe recover the approximation-estimation tradeoff that is the object of abundant literature.\n\u2022 We speak of large-scale learning problem when (3) is constrained by the maximal com-\nputing time Tmax. Approximate optimization, that is choosing \u03c1 > 0, possibly can achieve\nbetter generalization because more training examples can be processed during the allowed\ntime. The speci\ufb01cs depend on the computational properties of the chosen optimization\nalgorithm through the expression of the computing time T (F, \u03c1, n).\n\n3 The Asymptotics of Large-scale Learning\n\nIn the previous section, we have extended the classical approximation-estimation tradeoff by taking\ninto account the optimization error. We have given an objective criterion to distiguish small-scale\nand large-scale learning problems. In the small-scale case, we recover the classical tradeoff between\napproximation and estimation. The large-scale case is substantially different because it involves\nthe computational complexity of the learning algorithm. In order to clarify the large-scale learning\ntradeoff with suf\ufb01cient generality, this section makes several simpli\ufb01cations:\n\n\u2022 We are studying upper bounds of the approximation, estimation, and optimization er-\nrors (2).\nIt is often accepted that these upper bounds give a realistic idea of the actual\nconvergence rates [9, 10, 11, 12]. Another way to \ufb01nd comfort in this approach is to say\nthat we study guaranteed convergence rates instead of the possibly pathological special\ncases.\n\n\u2022 We are studying the asymptotic properties of the tradeoff when the problem size increases.\nInstead of carefully balancing the three terms, we write E = O(Eapp) + O(Eest) + O(Eopt)\nand only need to ensure that the three terms decrease with the same asymptotic rate.\n\n\u2022 We are considering a \ufb01xed family of functions F and therefore avoid taking into account\nthe approximation error Eapp. This part of the tradeoff covers a wide spectrum of practical\nrealities such as choosing models and choosing features. In the context of this work, we do\n\n\fnot believe we can meaningfully address this without discussing, for instance, the thorny\nissue of feature selection. Instead we focus on the choice of optimization algorithm.\n\n\u2022 Finally, in order to keep this paper short, we consider that the family of functions F is\nlinearly parametrized by a vector w \u2208 Rd. We also assume that x, y and w are bounded,\nensuring that there is a constant B such that 0 \u2264 \u2113(fw(x), y) \u2264 B and \u2113(\u00b7, y) is Lipschitz.\n\nWe \ufb01rst explain how the uniform convergence bounds provide convergence rates that take the op-\ntimization error into account. Then we discuss and compare the asymptotic learning properties of\nseveral optimization algorithms.\n\n3.1 Convergence of the Estimation and Optimization Errors\n\nThe optimization error Eopt depends directly on the optimization accuracy \u03c1. However, the accuracy\n\u03c1 involves the empirical quantity En( \u02dcfn) \u2212 En(fn), whereas the optimization error Eopt involves\nits expected counterpart E( \u02dcfn) \u2212 E(fn). This section discusses the impact on the optimization\nerror Eopt and of the optimization accuracy \u03c1 on generalization bounds that leverage the uniform\nconvergence concepts pioneered by Vapnik and Chervonenkis (e.g., [2].)\nIn this discussion, we use the letter c to refer to any positive constant. Multiple occurences of the\nletter c do not necessarily imply that the constants have identical values.\n\n3.1.1 Simple Uniform Convergence Bounds\n\nRecall that we assume that F is linearly parametrized by w \u2208 Rd. Elementary uniform convergence\nresults then state that\n\nE\u00bbsup\n\nf \u2208F\n\n|E(f ) \u2212 En(f )|\u2013 \u2264 cr d\n\nn\n\n,\n\nwhere the expectation is taken with respect to the random choice of the training set.2 This result\nimmediately provides a bound on the estimation error:\n\nEest = E\u02c6`E(fn) \u2212 En(fn)\u00b4 +`En(fn) \u2212 En(f \u2217\n\nF )\u00b4 +`En(f \u2217\n\n\u2264 2 E\u00bb sup\n\nf \u2208F\n\n|E(f ) \u2212 En(f )|\u2013 \u2264 cr d\n\nn\n\n.\n\nF ) \u2212 E(f \u2217\n\nF )\u00b4\u02dc\n\nThis same result also provides a combined bound for the estimation and optimization errors:\n\nEest + Eopt = E\u02c6E( \u02dcfn) \u2212 En( \u02dcfn)\u02dc + E\u02c6En( \u02dcfn) \u2212 En(fn)\u02dc\n= c \u03c1 +r d\nn! .\n\n+ \u03c1 + 0 + cr d\n\n\u2264 cr d\n\n+ E [En(fn) \u2212 En(f \u2217\n\nF )] + E [En(f \u2217\n\nF ) \u2212 E(f \u2217\n\nn\n\nn\n\nF )]\n\nUnfortunately, this convergence rate is known to be pessimistic in many important cases. More\nsophisticated bounds are required.\n\n3.1.2 Faster Rates in the Realizable Case\n\nWhen the loss functions \u2113(\u02c6y, y) is positive, with probability 1 \u2212 e\u2212\u03c4 for any \u03c4 > 0, relative uniform\nconvergence bounds state that\n\nE(f ) \u2212 En(f )\n\nsup\nf \u2208F\n\npE(f )\n\n\u2264 cr d\n\nn\n\nlog\n\nn\nd\n\n+\n\n\u03c4\nn\n\n.\n\nThis result is very useful because it provides faster convergence rates O(log n/n) in the realizable\ncase, that is when \u2113(fn(xi), yi) = 0 for all training examples (xi, yi). We have then En(fn) = 0,\nEn( \u02dcfn) \u2264 \u03c1, and we can write\n\nE( \u02dcfn) \u2212 \u03c1 \u2264 cqE( \u02dcfn)r d\n\nn\n\nlog\n\nn\nd\n\n+\n\n\u03c4\nn\n\n.\n\n2Although the original Vapnik-Chervonenkis bounds have the form cq d\n\nbe eliminated using the \u201cchaining\u201d technique (e.g., [10].)\n\nn log n\n\nd , the logarithmic term can\n\n\fViewing this as a second degree polynomial inequality in variableqE( \u02dcfn), we obtain\n\nE( \u02dcfn) \u2264 c\u201e\u03c1 +\n\nd\nn\n\nlog\n\n+\n\nn\nd\n\n\u03c4\n\nn\u00ab .\n\nIntegrating this inequality using a standard technique (see, e.g., [13]), we obtain a better convergence\nrate of the combined estimation and optimization error:\n\nEest + Eopt = EhE( \u02dcfn) \u2212 E(f \u2217\n\nF )i \u2264 EhE( \u02dcfn)i = c\u201e\u03c1 +\n\nd\nn\n\nlog\n\nn\n\nd\u00ab .\n\n3.1.3 Fast Rate Bounds\n\nMany authors (e.g., [10, 4, 12]) obtain fast statistical estimation rates in more general conditions.\nThese bounds have the general form\n\nEapp + Eest \u2264 c(cid:18) Eapp +(cid:18) d\n\nn\n\nlog\n\nn\n\nd(cid:19)\u03b1(cid:19) for\n\n1\n2\n\n\u2264 \u03b1 \u2264 1 .\n\n(4)\n\nThis result holds when one can establish the following variance condition:\n\n\u2200f \u2208 F Eh(cid:0)\u2113(f (X), Y ) \u2212 \u2113(f \u2217\n\nF (X), Y )(cid:1)2i \u2264 c (cid:18) E(f ) \u2212 E(f \u2217\n\nThe convergence rate of (4) is described by the exponent \u03b1 which is determined by the quality of\nthe variance bound (5). Works on fast statistical estimation identify two main ways to establish such\na variance condition.\n\n.\n\n(5)\n\n\u03b1\n\nF )(cid:19)2\u2212 1\n\n\u2022 Exploiting the strict convexity of certain loss functions [12, theorem 12]. For instance, Lee\n\net al. [14] establish a O(log n/n) rate using the squared loss \u2113(\u02c6y, y) = (\u02c6y \u2212 y)2.\n\n\u2022 Making assumptions on the data distribution. In the case of pattern recognition problems,\nfor instance, the \u201cTsybakov condition\u201d indicates how cleanly the posterior distributions\nP (y|x) cross near the optimal decision boundary [11, 12]. The realizable case discussed in\nsection 3.1.2 can be viewed as an extreme case of this.\n\nDespite their much greater complexity, fast rate estimation results can accomodate the optimization\naccuracy \u03c1 using essentially the methods illustrated in sections 3.1.1 and 3.1.2. We then obtain a\nbound of the form\n\nE = Eapp + Eest + Eopt = EhE( \u02dcfn) \u2212 E(f \u2217)i \u2264 c(cid:18) Eapp +(cid:18) d\n\nn\n\nlog\n\nn\n\nd(cid:19)\u03b1\n\n+ \u03c1(cid:19) .\n\n(6)\n\nFor instance, a general result with \u03b1 = 1 is provided by Massart [13, theorem 4.2]. Combining this\nresult with standard bounds on the complexity of classes of linear functions (e.g., [10]) yields the\nfollowing result:\n\nn\nd\nSee also [15, 4] for more bounds taking into account the optimization accuracy.\n\nE = Eapp + Eest + Eopt = EhE( \u02dcfn) \u2212 E(f \u2217)i \u2264 c(cid:18) Eapp +\n\nd\nn\n\nlog\n\n+ \u03c1(cid:19) .\n\n(7)\n\n3.2 Gradient Optimization Algorithms\n\nWe now discuss and compare the asymptotic learning properties of four gradient optimization algo-\nrithms. Recall that the family of function F is linearly parametrized by w \u2208 Rd. Let w\u2217\nF and wn\ncorrespond to the functions f \u2217\nF and fn de\ufb01ned in section 2.1. In this section, we assume that the\nfunctions w 7\u2192 \u2113(fw(x), y) are convex and twice differentiable with continuous second derivatives.\nConvexity ensures that the empirical const function C(w) = En(fw) has a single minimum.\nTwo matrices play an important role in the analysis: the Hessian matrix H and the gradient covari-\nance matrix G, both measured at the empirical optimum wn.\n\nH =\n\n\u22022C\n\n\u2202w2 (wn) = En(cid:20) \u22022\u2113(fwn (x), y)\n\nG = En\"(cid:18) \u2202\u2113(fwn (x), y)\n\n\u2202w\n\n\u2202w2\n\n(cid:21) ,\n(cid:19)(cid:18) \u2202\u2113(fwn (x), y)\n\n\u2202w\n\n(8)\n\n(9)\n\n(cid:19)\u2032# .\n\n\fThe relation between these two matrices depends on the chosen loss function. In order to summarize\nthem, we assume that there are constants \u03bbmax \u2265 \u03bbmin > 0 and \u03bd > 0 such that, for any \u03b7 > 0,\nwe can choose the number of examples n large enough to ensure that the following assertion is true\nwith probability greater than 1 \u2212 \u03b7 :\n\ntr(G H \u22121) \u2264 \u03bd\n\nand\n\nEigenSpectrum(H) \u2282 [ \u03bbmin , \u03bbmax ]\n\n(10)\n\nThe condition number \u03ba = \u03bbmax/\u03bbmin is a good indicator of the dif\ufb01culty of the optimization [16].\nThe condition \u03bbmin > 0 avoids complications with stochastic gradient algorithms. Note that this\ncondition only implies strict convexity around the optimum. For instance, consider the loss func-\ntion \u2113 is obtained by smoothing the well known hinge loss \u2113(z, y) = max{0, 1 \u2212 yz} in a small\nneighborhood of its non-differentiable points. Function C(w) is then piecewise linear with smoothed\nedges and vertices. It is not strictly convex. However its minimum is likely to be on a smoothed\nvertex with a non singular Hessian. When we have strict convexity, the argument of [12, theorem 12]\nyields fast estimation rates \u03b1 \u2248 1 in (4) and (6). This is not necessarily the case here.\nThe four algorithm considered in this paper use information about the gradient of the cost function\nto iteratively update their current estimate w(t) of the parameter vector.\n\n\u2022 Gradient Descent (GD) iterates\n\nw(t + 1) = w(t) \u2212 \u03b7\n\n\u2202C\n\u2202w\n\n(w(t)) = w(t) \u2212 \u03b7\n\n1\nn\n\n\u2202\n\u2202w\n\nn\n\nXi=1\n\n\u2113(cid:0)fw(t)(xi), yi(cid:1)\n\nwhere \u03b7 > 0 is a small enough gain. GD is an algorithm with linear convergence [16].\nWhen \u03b7 = 1/\u03bbmax, this algorithm requires O(\u03ba log(1/\u03c1)) iterations to reach accuracy \u03c1.\nThe exact number of iterations depends on the choice of the initial parameter vector.\n\n\u2022 Second Order Gradient Descent (2GD) iterates\n\nw(t + 1) = w(t) \u2212 H \u22121 \u2202C\n\u2202w\n\n(w(t)) = w(t) \u2212\n\n1\nn\n\nH \u22121\n\n\u2202\n\u2202w\n\nn\n\nXi=1\n\n\u2113(cid:0)fw(t)(xi), yi(cid:1)\n\nwhere matrix H \u22121 is the inverse of the Hessian matrix (8). This is more favorable than\nNewton\u2019s algorithm because we do not evaluate the local Hessian at each iteration but\nsimply assume that we know in advance the Hessian at the optimum. 2GD is a superlinear\noptimization algorithm with quadratic convergence [16]. When the cost is quadratic, a\nsingle iteration is suf\ufb01cient. In the general case, O(log log(1/\u03c1)) iterations are required to\nreach accuracy \u03c1.\n\n\u2022 Stochastic Gradient Descent (SGD) picks a random training example (xt, yt) at each\n\niteration and updates the parameter w on the basis of this example only,\n\nw(t + 1) = w(t) \u2212\n\n\u03b7\nt\n\n\u2202\n\u2202w\n\n\u2113(cid:0)fw(t)(xt), yt(cid:1).\n\nMurata [17, section 2.2], characterizes the mean ES[w(t)] and variance VarS[w(t)] with\nrespect to the distribution implied by the random examples drawn from the training set S at\neach iteration. Applying this result to the discrete training set distribution for \u03b7 = 1/\u03bbmin,\nwe have \u03b4w(t)2 = O(1/t) where \u03b4w(t) is a shorthand notation for w(t) \u2212 wn.\nWe can then write\n\n\u2264 tr(GH)\n\nES[ C(w(t)) \u2212 inf C ] = ES\u02c6tr`H \u03b4w(t) \u03b4w(t)\u2032\u00b4\u02dc + o` 1\nt\u00b4\nt + o` 1\nt\u00b4 .\n\n= tr` H ES[\u03b4w(t)] ES[\u03b4w(t)]\u2032 + H VarS[w(t)]\u00b4 + o` 1\nt\u00b4\n\nTherefore the SGD algorithm reaches accuracy \u03c1 after less than \u03bd\u03ba2/\u03c1 + o(1/\u03c1) iterations\non average. The SGD convergence is essentially limited by the stochastic noise induced\nby the random choice of one example at each iteration. Neither the initial value of the\nparameter vector w nor the total number of examples n appear in the dominant term of this\nbound! When the training set is large, one could reach the desired accuracy \u03c1 measured on\nthe whole training set without even visiting all the training examples. This is in fact a kind\nof generalization bound.\n\nt + o` 1\n\nt\u00b4 \u2264 \u03bd\u03ba2\n\n(11)\n\n\fTable 2: Asymptotic results for gradient algorithms (with probability 1). Compare the second\nlast column (time to optimize) with the last column (time to reach the excess test error \u01eb).\nLegend: n number of examples; d parameter dimension; \u03ba, \u03bd see equation (10).\n\nAlgorithm Cost of one\niteration\n\nIterations\nto reach \u03c1\n\nTime to reach\n\naccuracy \u03c1\n\nTime to reach\n\nE \u2264 c (Eapp + \u03b5)\n\nGD\n\n2GD\n\nSGD\n\n2SGD\n\nO(nd)\n\nO(cid:16)\u03ba log 1\nO(cid:16)nd\u03ba log 1\n\u03c1(cid:17)\n\u03c1(cid:17)\nO(cid:0)d2 + nd(cid:1) O(cid:16)log log 1\n\u03c1(cid:17) O(cid:16)(cid:0)d2 + nd(cid:1) log log 1\nO(cid:16) d\u03bd\u03ba2\n\u03c1 (cid:17)\n\u03c1 + o(cid:16) 1\n\u03c1(cid:17)\nO(cid:16) d2\u03bd\n\u03c1 + o(cid:16) 1\n\u03c1 (cid:17)\n\u03c1(cid:17)\nO(cid:0)d2(cid:1)\n\nO(d)\n\n\u03bd\u03ba2\n\n\u03bd\n\n\u03c1(cid:17) O(cid:16) d2\n\n\u03b5 log log 1\n\n\u03b5(cid:17)\n\u03b51/\u03b1 log2 1\n\n\u03b51/\u03b1 log 1\n\nO(cid:16) d2 \u03ba\nO(cid:16) d \u03bd \u03ba2\n\u03b5 (cid:17)\nO(cid:16) d2 \u03bd\n\u03b5 (cid:17)\n\n\u03b5(cid:17)\n\n\u2022 Second Order Stochastic Gradient Descent (2SGD) replaces the gain \u03b7 by the inverse of\n\nthe Hessian matrix H:\n\nw(t + 1) = w(t) \u2212\n\n1\nt\n\nH \u22121 \u2202\n\u2202w\n\n\u2113(cid:0)fw(t)(xt), yt(cid:1).\n\nUnlike standard gradient algorithms, using the second order information does not change\nthe in\ufb02uence of \u03c1 on the convergence rate but improves the constants. Using again [17,\ntheorem 4], accuracy \u03c1 is reached after \u03bd/\u03c1 + o(1/\u03c1) iterations.\n\nFor each of the four gradient algorithms, the \ufb01rst three columns of table 2 report the time for a single\niteration, the number of iterations needed to reach a prede\ufb01ned accuracy \u03c1, and their product, the\ntime needed to reach accuracy \u03c1. These asymptotic results are valid with probability 1, since the\nprobability of their complement is smaller than \u03b7 for any \u03b7 > 0.\nThe fourth column bounds the time necessary to reduce the excess error E below c (Eapp +\u03b5) where c\n\nis the constant from (6). This is computed by observing that choosing \u03c1 \u223c` d\n\nthe fastest rate for \u03b5, with minimal computation time. We can then use the asymptotic equivalences\n\u03c1 \u223c \u03b5 and n \u223c d\n\u03b5 . Setting the fourth column expressions to Tmax and solving for \u01eb yields\nthe best excess error achieved by each algorithm within the limited time Tmax . This provides the\nasymptotic solution of the Estimation\u2013Optimization tradeoff (3) for large scale problems satisfying\nour assumptions.\n\nd\u00b4\u03b1 in (6) achieves\n\n\u03b51/\u03b1 log 1\n\nn log n\n\nThese results clearly show that the generalization performance of large-scale learning systems de-\npends on both the statistical properties of the estimation procedure and the computational properties\nof the chosen optimization algorithm. Their combination leads to surprising consequences:\n\n\u2022 The SGD and 2SGD results do not depend on the estimation rate \u03b1. When the estimation\nrate is poor, there is less need to optimize accurately. That leaves time to process more\nexamples. A potentially more useful interpretation leverages the fact that (11) is already a\nkind of generalization bound: its fast rate trumps the slower rate assumed for the estimation\nerror.\n\n\u2022 Second order algorithms bring little asymptotical improvements in \u03b5. Although the super-\nlinear 2GD algorithm improves the logarithmic term, all four algorithms are dominated by\nthe polynomial term in (1/\u03b5). However, there are important variations in the in\ufb02uence of\nthe constants d, \u03ba and \u03bd. These constants are very important in practice.\n\n\u2022 Stochastic algorithms (SGD, 2SGD) yield the best generalization performance despite be-\ning the worst optimization algorithms. This had been described before [18] and observed\nin experiments.\n\nIn contrast, since the optimization error Eopt of small-scale learning systems can be reduced to\ninsigni\ufb01cant levels, their generalization performance is solely determined by the statistical properties\nof their estimation procedure.\n\n\f4 Conclusion\nTaking in account budget constraints on both the number of examples and the computation time,\nwe \ufb01nd qualitative differences between the generalization performance of small-scale learning sys-\ntems and large-scale learning systems. The generalization properties of large-scale learning systems\ndepend on both the statistical properties of the estimation procedure and the computational proper-\nties of the optimization algorithm. We illustrate this fact with some asymptotic results on gradient\nalgorithms.\n\nConsiderable re\ufb01nements of this framework can be expected. Extending the analysis to regular-\nized risk formulations would make results on the complexity of primal and dual optimization algo-\nrithms [19, 20] directly exploitable. The choice of surrogate loss function [7, 12] could also have a\nnon-trivial impact in the large-scale case.\n\nAcknowledgments Part of this work was funded by NSF grant CCR-0325463.\n\nReferences\n[1] Leslie G. Valiant. A theory of learnable. Proc. of the 1984 STOC, pages 436\u2013445, 1984.\n[2] Vladimir N. Vapnik. Estimation of Dependences Based on Empirical Data. Springer Series in Statistics.\n\nSpringer-Verlag, Berlin, 1982.\n\n[3] St\u00b4ephane Boucheron, Olivier Bousquet, and G\u00b4abor Lugosi. Theory of classi\ufb01cation: a survey of recent\n\nadvances. ESAIM: Probability and Statistics, 9:323\u2013375, 2005.\n\n[4] Peter L. Bartlett and Shahar Mendelson. Empirical minimization. Probability Theory and Related Fields,\n\n135(3):311\u2013334, 2006.\n\n[5] J. Stephen Judd. On the complexity of loading shallow neural networks. Journal of Complexity, 4(3):177\u2013\n\n192, 1988.\n\n[6] Richard O. Duda and Peter E. Hart. Pattern Classi\ufb01cation And Scene Analysis. Wiley and Son, 1973.\n[7] Tong Zhang. Statistical behavior and consistency of classi\ufb01cation methods based on convex risk mini-\n\nmization. The Annals of Statistics, 32:56\u201385, 2004.\n\n[8] Clint Scovel and Ingo Steinwart. Fast rates for support vector machines. In Peter Auer and Ron Meir,\neditors, Proceedings of the 18th Conference on Learning Theory (COLT 2005), volume 3559 of Lecture\nNotes in Computer Science, pages 279\u2013294, Bertinoro, Italy, June 2005. Springer-Verlag.\n\n[9] Vladimir N. Vapnik, Esther Levin, and Yann LeCun. Measuring the VC-dimension of a learning machine.\n\nNeural Computation, 6(5):851\u2013876, 1994.\n\n[10] Olivier Bousquet. Concentration Inequalities and Empirical Processes Theory Applied to the Analysis of\n\nLearning Algorithms. PhD thesis, Ecole Polytechnique, 2002.\n\n[11] Alexandre B. Tsybakov. Optimal aggregation of classi\ufb01ers in statistical learning. Annals of Statististics,\n\n32(1), 2004.\n\n[12] Peter L. Bartlett, Michael I. Jordan, and Jon D. McAuliffe. Convexity, classi\ufb01cation and risk bounds.\n\nJournal of the American Statistical Association, 101(473):138\u2013156, March 2006.\n\n[13] Pascal Massart. Some applications of concentration inequalities to statistics. Annales de la Facult\u00b4e des\n\nSciences de Toulouse, series 6, 9(2):245\u2013303, 2000.\n\n[14] Wee S. Lee, Peter L. Bartlett, and Robert C. Williamson. The importance of convexity in learning with\n\nsquared loss. IEEE Transactions on Information Theory, 44(5):1974\u20131980, 1998.\n\n[15] Shahar Mendelson. A few notes on statistical learning theory. In Shahar Mendelson and Alexander J.\nSmola, editors, Advanced Lectures in Machine Learning, volume 2600 of Lecture Notes in Computer\nScience, pages 1\u201340. Springer-Verlag, Berlin, 2003.\n\n[16] John E. Dennis, Jr. and Robert B. Schnabel. Numerical Methods For Unconstrained Optimization and\n\nNonlinear Equations. Prentice-Hall, Inc., Englewood Cliffs, New Jersey, 1983.\n\n[17] Noboru Murata. A statistical study of on-line learning. In David Saad, editor, Online Learning and Neural\n\nNetworks. Cambridge University Press, Cambridge, UK, 1998.\n[18] L\u00b4eon Bottou and Yann Le Cun. Large scale online learning.\n\nIn Sebastian Thrun, Lawrence K. Saul,\nand Bernhard Sch\u00a8olkopf, editors, Advances in Neural Information Processing Systems 16. MIT Press,\nCambridge, MA, 2004.\n\n[19] Thorsten Joachims. Training linear SVMs in linear time. In Proceedings of KDD\u201906, Philadelphia, PA,\n\nUSA, August 20-23 2006. ACM.\n\n[20] Don Hush, Patrick Kelly, Clint Scovel, and Ingo Steinwart. QP algorithms with guaranteed accuracy and\n\nrun time for support vector machines. Journal of Machine Learning Research, 7:733\u2013769, 2006.\n\n\f", "award": [], "sourceid": 726, "authors": [{"given_name": "L\u00e9on", "family_name": "Bottou", "institution": ""}, {"given_name": "Olivier", "family_name": "Bousquet", "institution": ""}]}