{"title": "Regularized Nonlinear Acceleration", "book": "Advances in Neural Information Processing Systems", "page_first": 712, "page_last": 720, "abstract": "We describe a convergence acceleration technique for generic optimization problems. Our scheme computes estimates of the optimum from a nonlinear average of the iterates produced by any optimization method. The weights in this average are computed via a simple and small linear system, whose solution can be updated online. This acceleration scheme runs in parallel to the base algorithm, providing improved estimates of the solution on the fly, while the original optimization method is running. Numerical experiments are detailed on classical classification problems.", "full_text": "Regularized Nonlinear Acceleration\n\nDamien Scieur\n\nINRIA & D.I., UMR 8548,\n\nAlexandre d\u2019Aspremont\nCNRS & D.I., UMR 8548,\n\n\u00c9cole Normale Sup\u00e9rieure, Paris, France.\n\ndamien.scieur@inria.fr\n\n\u00c9cole Normale Sup\u00e9rieure, Paris, France.\n\naspremon@di.ens.fr\n\nFrancis Bach\n\nINRIA & D.I., UMR 8548,\n\n\u00c9cole Normale Sup\u00e9rieure, Paris, France.\n\nfrancis.bach@inria.fr\n\nAbstract\n\nWe describe a convergence acceleration technique for generic optimization prob-\nlems. Our scheme computes estimates of the optimum from a nonlinear average\nof the iterates produced by any optimization method. The weights in this average\nare computed via a simple and small linear system, whose solution can be updated\nonline. This acceleration scheme runs in parallel to the base algorithm, provid-\ning improved estimates of the solution on the \ufb02y, while the original optimization\nmethod is running. Numerical experiments are detailed on classical classi\ufb01cation\nproblems.\n\n1\n\nIntroduction\n\nSuppose we want to solve the following optimization problem\n\n(1)\nin the variable x \u2208 Rn, where f (x) is strongly convex with respect to the Euclidean norm with\nparameter \u00b5, and has a Lipschitz continuous gradient with parameter L with respect to the same norm.\nThis class of function is often encountered, for example in regression where f (x) is of the form\n\nmin\nx\u2208Rn\n\nf (x)\n\nf (x) = L(x) + \u2126(x),\n\nwhere L(x) is a smooth convex loss function and \u2126(x) is a smooth strongly convex penalty function.\nAssume we solve this problem using an iterative algorithm of the form\n\nxi+1 = g(xi),\n\nfor i = 1, ..., k,\n\n(2)\nwhere xi \u2208 Rn and k the number of iterations. Here, we will focus on the problem of estimating the\nsolution to (1) by tracking only the sequence of iterates xi produced by an optimization algorithm.\nThis will lead to an acceleration of the speed of convergence, since we will be able to extrapolate\nmore accurate solutions without any calls to the oracle g(x).\nSince the publication of Nesterov\u2019s optimal \ufb01rst-order smooth convex minimization algorithm [1], a\nsigni\ufb01cant effort has been focused on either providing more explicit interpretable views on current\nacceleration techniques, or on replicating these complexity gains using different, more intuitive\nschemes. Early efforts were focused on directly extending the original acceleration result in [1] to\nbroader function classes [2], allow for generic metrics, line searches or simpler proofs [5, 6], produce\nadaptive accelerated algorithms [7], etc. More recently however, several authors [8, 9] have started\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fusing classical results from control theory to obtain numerical bounds on convergence rates that\nmatch the optimal rates. Others have studied the second order ODEs obtained as the limit for small\nstep sizes of classical accelerated schemes, to better understand their convergence [10, 11]. Finally,\nrecent results have also shown how to wrap classical algorithms in an outer optimization loop, to\naccelerate convergence [12] and reach optimal complexity bounds.\nHere, we take a signi\ufb01cantly different approach to convergence acceleration stemming from classical\nresults in numerical analysis. We use the iterates produced by any (converging) optimization algorithm,\nand estimate the solution directly from this sequence, assuming only some regularity conditions on the\nfunction to minimize. Our scheme is based on the idea behind Aitken\u2019s \u22062 algorithm [13], generalized\nas the Shanks transform [14], whose recursive formulation is known as the \u03b5-algorithm [15] (see e.g.\n[16, 17] for a survey). In a nutshell, these methods \ufb01t geometrical models to linearly converging\nsequences, then extrapolate their limit from the \ufb01tted model.\nIn a sense, this approach is more statistical in nature. It assumes an approximately linear model\nholds for iterations near the optimum, and estimates this model using the iterates. In fact, Wynn\u2019s\nalgorithm [15] is directly connected to the Levinson-Durbin algorithm [18, 19] used to solve Toeplitz\nsystems recursively and \ufb01t autoregressive models (the Shanks transform solves Hankel systems, but\nthis is essentially the same problem [20]). The key difference here is that estimating the autocovariance\noperator is not required, as we only focus on the limit. Moreover, the method presents strong links\nwith the conjugate gradient when applied to unconstrained quadratic optimization.\nWe start from a slightly different formulation of these techniques known as minimal polynomial\nextrapolation (MPE) [17, 21] which uses the minimal polynomial of the linear operator driving\niterations to estimate the optimum by nonlinear averaging (i.e., using weights in the average which are\nnonlinear functions of the iterates). So far, for all the techniques cited above, no proofs of convergence\nof these estimates were given in the case where the iterates made the estimation process unstable.\nOur contribution here is to add a regularization in order to produce explicit bounds on the distance to\noptimality by controlling the stability through the regularization parameter, thus explicitly quanti-\nfying the acceleration provided by these techniques. We show in several numerical examples that\nthese stabilized estimates often speed up convergence by an order of magnitude. Furthermore this\nacceleration scheme thus runs in parallel to the original algorithm, providing improved estimates of\nthe solution on the \ufb02y, while the original method is progressing.\nThe paper is organized as follows. In section 2.1 we recall basic results behind MPE for linear\niterations and we will introduce in section 2.2 a formulation of the approximate version of MPE and\nmake a link with the conjugate gradient method. Then, in section 2.3, we generalize these results to\ngeneric nonlinear iterations and show, in section 2.4, how to fully control the impact of nonlinearity.\nWe use these results to derive explicit bounds on the acceleration performance of our estimates.\n\n2 Approximate Minimal Polynomial Extrapolation\n\nIn what follows, we recall the key arguments behind minimal polynomial extrapolation (MPE) as\nderived in [22] or also [21]. We also explain a variant called approximate minimal polynomial\nextrapolation (AMPE) which allows to control the number of iterates used in the extrapolation, hence\nreduces its computational complexity. We begin by a simple description of the method for linear\niterations, then extend these results to the generic nonlinear case. Finally, we fully characterize the\nacceleration factor provided by a regularized version of AMPE, using regularity properties of the\nfunction f (x), and the result of a Chebyshev-like, tractable polynomial optimization problem.\n\n2.1 Linear Iterations\n\nHere, we assume that the iterative algorithm in (2) is in fact linear, with\n\nxi = A(xi\u22121 \u2212 x\u2217) + x\u2217,\n\n(3)\nwhere A \u2208 Rn\u00d7n (not necessarily symmetric) and x\u2217 \u2208 Rn. We assume that 1 is not an eigenvalue\nof A, implying that (3) admits a unique \ufb01xed point x\u2217. Moreover, if we assume that (cid:107)A(cid:107)2 < 1, then\nxk converge to x\u2217 at rate (cid:107)xk \u2212 x\u2217(cid:107)2 \u2264 (cid:107)A(cid:107)k\n2(cid:107)x0 \u2212 x\u2217(cid:107). We now recall the minimal polynomial\nextrapolation (MPE) method as described in [21], starting with the following de\ufb01nition.\n\n2\n\n\fDe\ufb01nition 2.1 Given A \u2208 Rn\u00d7n, s.t. 1 is not an eigenvalue of A and v \u2208 Rn, the minimal polynomial\nof A with respect to the vector v is the lowest degree polynomial p(x) such that\n\np(A)v = 0,\n\np(1) = 1.\n\nNote that the degree of p(x) is always less than n and the condition p(1) = 1 makes p unique. Notice\nthat because we assumed that 1 is not an eigenvalue of A, having p(1) = 1 is not restrictive since\nwe can normalize each minimal polynomial with the sum of its coef\ufb01cients (see Lemma A.1 in\nthe supplementary material). Given an initial iterate x0, MPE starts by forming a matrix U whose\ncolumns are the increments xi+1 \u2212 xi, with\n\nui = xi+1 \u2212 xi = (A \u2212 I)(xi \u2212 x\u2217) = (A \u2212 I)Ai(x0 \u2212 x\u2217).\n\n(4)\nNow, let p be the minimal polynomial of A with respect to the vector u0 (where p has coef\ufb01cients ci\nand degree d), and U = [u0, u1, ..., ud]. So\n\np(1) =(cid:80)d\n\ncomputed exactly as follows\n\n(cid:80)d\ni=0 ciui =(cid:80)d\nWe can thus solve the system U c = 0,(cid:80)\ni=0 ciAiu0 = (cid:80)d\n0 =(cid:80)d\n= (A \u2212 I)(cid:80)d\n(A \u2212 I)(cid:80)d\ni=0 ci(xi \u2212 x\u2217) = 0 \u21d4 (cid:80)d\n\ni=0 ciAiu0 = p(A)u0 = 0 ,\n\n(5)\ni ci = 1 to \ufb01nd p. In this case, the \ufb01xed point x\u2217 can be\n\ni=0 ci = 1.\n\ni=0 ciAi(A \u2212 I)(x0 \u2212 x\u2217)\n\ni=0 ciAi(x0 \u2212 x\u2217) = (A \u2212 I)(cid:80)d\ni=0 ci(xi \u2212 x\u2217) = 0 \u21d4 (cid:80)d\n\ni=0 ci(xi \u2212 x\u2217).\n\ni=0 cixi = x\u2217.\n\nHence, using the fact that 1 is not an eigenvalue of A and p(1) = 1,\n\nThis means that x\u2217 is obtained by averaging iterates using the coef\ufb01cients in c. The averaging in this\ncase is called nonlinear, since the coef\ufb01cients of c vary with the iterates themselves.\n\n2.2 Approximate Minimal Polynomial Extrapolation (AMPE)\n\nSuppose now that we only compute a fraction of the iterates xi used in the MPE procedure. While the\nnumber of iterates k might be smaller than the degree of the minimal polynomial of A with respect\nto u0, we can still try to make the quantity pk(A)u0 small, where pk(x) is now a polynomial of degree\nat most k. The corresponding difference matrix U = [u0, u1, ..., uk] \u2208 Rn\u00d7(k+1) is rectangular.\nThis is also known as the Eddy-Me\u0161ina method [3, 4] or reduced rank extrapolation with arbitrary k\n(see [21, \u00a710]). The objective here is similar to (5), but the system is now overdetermined because\nk < deg(P ). We will thus choose c to make (cid:107)U c(cid:107)2 = (cid:107)p(A)u0(cid:107)2, for some polynomial p such that\np(1) = 1, as small as possible, which means solving for\n\nc\u2217 (cid:44) argmin (cid:107)U c(cid:107)2\n\n(AMPE)\nin the variable c \u2208 Rk+1. The optimal value (cid:107)U c\u2217(cid:107)2 of this problem is decreasing with k, satis-\n\ufb01es (cid:107)U c\u2217(cid:107)2 = 0 when k is greater than the degree of the minimal polynomial, and controls the\napproximation error in x\u2217 with equation (4). Setting ui = (A \u2212 I)(xi \u2212 x\u2217), we have\n\ns.t. 1T c = 1\n\n(cid:107)(cid:80)k\n\ni=0 c\u2217\n\ni xi \u2212 x\u2217(cid:107)2 = (cid:107)(I \u2212 A)\u22121(cid:80)k\n\n\u2264 (cid:13)(cid:13)(I \u2212 A)\u22121(cid:13)(cid:13)2 (cid:107)U c\u2217(cid:107)2.\n\ni=0 c\u2217\n\ni ui(cid:107)2\n\nWe can get a crude bound on (cid:107)U c\u2217(cid:107)2 from Chebyshev polynomials, using only an assumption on\nthe range of the spectrum of the matrix A. Assume A symmetric, 0 (cid:22) A (cid:22) \u03c3I \u227a I and deg(p) \u2264 k.\nIndeed,\n\n(cid:107)U c\u2217(cid:107)2 = (cid:107)p\u2217(A)u0(cid:107)2 \u2264 (cid:107)u0(cid:107)2 min\n\np:p(1)=1\n\n(cid:107)p(A)(cid:107)2 \u2264 (cid:107)u0(cid:107)2 min\n\np:p(1)=1\n\nmax\n\nA:0(cid:22)A(cid:22)\u03c3I\n\n(cid:107)p(A)(cid:107)2,\n\n(6)\n\nwhere p\u2217 is the polynomial with coef\ufb01cients c\u2217. Since A is symmetric, we have A = Q\u039bQT where\nQ is unitary. We can thus simplify the objective function:\n(cid:107)p(\u039b)(cid:107)2 = max\n\n(cid:107)p(A)(cid:107)2 = max\n\n|p(\u03bb)|.\n\nmax\n\n|p(\u03bbi)| = max\n\u03bb:0\u2264\u03bb\u2264\u03c3\n\nmax\n\nA:0(cid:22)A(cid:22)\u03c3I\n\n\u039b:0(cid:22)\u039b(cid:22)\u03c3I\n\ni\n\n\u039b:0(cid:22)\u039b(cid:22)\u03c3I\n\n3\n\n\fWe thus have\n\n(cid:107)U c\u2217(cid:107)2 \u2264 (cid:107)u0(cid:107)2 min\n\np:p(1)=1\n\n|p(\u03bb)|.\n\nmax\n\n\u03bb:0\u2264\u03bb\u2264\u03c3\n\nSo we must \ufb01nd a polynomial which takes small values in the interval [0, \u03c3]. However, Chebyshev\npolynomials are known to be polynomials for which the maximal value in the interval [0, 1] is\nthe smallest. Let Ck be the Chebyshev polynomial of degree k. By de\ufb01nition, Ck(x) is a monic\npolynomial1 which solves\n\nCk(x) = argmin\np:p is monic\n\nmax\n\nx:x\u2208[\u22121,1]\n\n|p(x)|.\n\nWe can thus use a variant of Ck(x) in order to solve the minimax problem\n\nmin\n\np:p(1)=1\n\nmax\n\n\u03bb:0\u2264\u03bb\u2264\u03c3\n\n|p(\u03bb)|.\n\nThe solution of this problem is given in [23] and admits an explicit formulation:\n\nT (x) =\n\nCk(t(x))\nCk(t(1))\n\n,\n\nt(x) =\n\n2x \u2212 \u03c3\n\n.\n\n\u03c3\n\nNote that t(x) is simply a linear mapping from interval [0, \u03c3] to [\u22121, 1]. Moreover,\n\nmin\n\np:p(1)=1\n\nmax\n\n\u03bb:0\u2264\u03bb\u2264\u03c3\n\n|p(\u03bb)| = max\n\u03bb:0\u2264\u03bb\u2264\u03c3\n\n|Tk(\u03bb)| = |Tk(\u03c3)| =\n\n2\u03b6 k\n1 + \u03b6 2k ,\n\nwhere \u03b6 is\n\n\u03b6 = (1 \u2212 \u221a\n\n1 \u2212 \u03c3)/(1 +\n\n\u221a\n\n1 \u2212 \u03c3) < \u03c3.\n\nSince (cid:107)u0(cid:107)2 = (cid:107)(A \u2212 I)(x0 \u2212 x\u2217)(cid:107)2 \u2264 (cid:107)A \u2212 I(cid:107)2(cid:107)x0 \u2212 x\u2217(cid:107), we can bound (6) by\n\n(cid:107)U c\u2217(cid:107)2 \u2264 (cid:107)u0(cid:107)2 min\n\np:p(1)=1\n\nmax\n\n\u03bb:0\u2264\u03bb\u2264\u03c3\n\n|p(\u03bb)| \u2264 (cid:107)A \u2212 I(cid:107)2\n\n2\u03b6 k\n\n1 + \u03b6 2k (cid:107)x0 \u2212 x\u2217(cid:107)2.\n\n(7)\n\n(8)\n\n(9)\n\nThis leads to the following proposition.\nProposition 2.2 Let A be symmetric, 0 (cid:22) A (cid:22) \u03c3I \u227a I and ci be the solution of (AMPE). Then\n\n\u2264 \u03ba(A \u2212 I) 2\u03b6k\n\n1+\u03b62k (cid:107)x0 \u2212 x\u2217(cid:107)2,\n\n(10)\n\n(cid:13)(cid:13)(cid:13)(cid:80)k\n\ni=0 c\u2217\n\ni xi \u2212 x\u2217(cid:13)(cid:13)(cid:13)2\n\n\u221a\n\nwhere \u03ba(A \u2212 I) is the condition number of the matrix A \u2212 I and \u03b6 is de\ufb01ned in (9).\nNote that, when solving quadratic optimization problems, the rate in this bound matches that obtained\nusing the optimal method in [6]. Also, the bound on the rate of convergence of this method is exactly\nthe one of the conjugate gradient with an additional factor \u03ba(A \u2212 I).\n\nThis method presents a strong link with the conjugate gradient. Denote (cid:107)v(cid:107)B =\n\nRemark:\nvT Bv\nthe norm induced by the de\ufb01nite positive matrix B. By de\ufb01nition, at the k-th iteration, the conjugate\ngradient computes an approximation s of x\u2217 which follows\n\ns = argmin(cid:107)x \u2212 x\u2217(cid:107)A s.t. x \u2208 Kk ,\n\nis a linear combination of the element in Kk, so x = (cid:80)k\n\nwhere Kk = {Ax\u2217, A2x\u2217, ..., Akx\u2217} is called a Krylov subspace. Since x \u2208 Kk, we have that x\ni=1 ciAix\u2217 = q(A)x\u2217, where q(x) is a\n\npolynomial of degree k and q(0) = 0. So conjugate gradient solves\n\ns = argminq:q(0)=0 (cid:107)q(A)x\u2217 \u2212 x\u2217(cid:107)A = argmin\u02c6q:\u02c6q(0)=0 (cid:107)\u02c6q(A)x\u2217(cid:107)A,\n\nwhich is very similar to equation (AMPE). However, while conjugate gradient has access to an oracle\nwhich gives the result of the product between matrix A and any vector v, the AMPE procedure can\nonly use the iterations produced by (3) (meaning that the AMPE procedure does not need to know A).\nMoreover, we analyze the convergence of AMPE in another norm ((cid:107) \u00b7 (cid:107)2 instead of (cid:107) \u00b7 (cid:107)A). These\ntwo reasons explain why a condition number appears in the rate of convergence of AMPE (10).\n\n1A monic polynomial is a univariate polynomial in which the coef\ufb01cient of highest degree is equal to 1.\n\n4\n\n\f2.3 Nonlinear Iterations\n\nWe now go back to the general case where the iterative algorithm is nonlinear, with\n\n\u02dcxi+1 = g(\u02dcxi),\n\nfor i = 1, ..., k,\n\n(11)\nwhere \u02dcxi \u2208 Rn and the function g has a symmetric Jacobian at point x\u2217. We also assume that the\nmethod has a unique \ufb01xed point written x\u2217 and linearize these iterations around x\u2217, to get\n\n\u02dcxi \u2212 x\u2217 = A(\u02dcxi\u22121 \u2212 x\u2217) + ei,\n(12)\nwhere A is now the Jacobian matrix (i.e., the \ufb01rst derivative) of g taken at the \ufb01xed point x\u2217 and\nei \u2208 Rn is a second order error term (cid:107)ei(cid:107)2 = O((cid:107)\u02dcxi\u22121 \u2212 x\u2217(cid:107)2\n2). Note that, by construction, the\nlinear and nonlinear models share the same \ufb01xed point x\u2217. We write xi the iterates that would be\nobtained using the asymptotic linear model (starting at x0)\n\nxi \u2212 x\u2217 = A(xi\u22121 \u2212 x\u2217).\n\nRunning the algorithm described in (11), we thus observe the iterates \u02dcxi and build \u02dcU from their\ndifferences. As in (AMPE) we then compute \u02dcc using matrix \u02dcU and \ufb01nally estimate\n\ni=0 \u02dcci \u02dcxi.\n\n\u02dcx\u2217 =(cid:80)k\n(cid:13)(cid:13)(cid:13)2\n(cid:13)(cid:13)(cid:13)(cid:80)k\n\n+\n\nIn this case, our estimate for x\u2217 is based on the coef\ufb01cient \u02dcc, computed using the iterates \u02dcxi. We will\nnow decompose the error made by the estimation by comparing it with the estimation which comes\nfrom the linear model:\n\n(cid:13)(cid:13)(cid:13)2\n\n(cid:13)(cid:13)(cid:13)(cid:80)k\ni=0 cixi \u2212 x\u2217(cid:13)(cid:13)(cid:13)2\n\n. (13)\n\n(cid:13)(cid:13)(cid:13)(cid:80)k\ni=0 \u02dcci \u02dcxi \u2212 x\u2217(cid:13)(cid:13)(cid:13)2\n\n\u2264(cid:13)(cid:13)(cid:13)(cid:80)k\n\ni=0(\u02dcci \u2212 ci)xi\n\ni=0 \u02dcci(\u02dcxi \u2212 xi)\n\n+\n\nThis expression shows us that the precision is comparable to the precision of the AMPE process in\nthe linear case (third term) with some perturbation. Also, if (cid:107)ei(cid:107)2 is small then (cid:107)xi \u2212 \u02dcxi(cid:107)2 is small\nas well. But we need more information about (cid:107)c(cid:107)2 and (cid:107)\u02dcc \u2212 c(cid:107)2 if we want to go further.\nWe now show the following proposition computing the perturbation \u2206c = (\u02dcc\u2217 \u2212 c\u2217) of the optimal\nsolution of (AMPE), c\u2217, induced by E = \u02dcU \u2212 U. It will allow us to bound the \ufb01rst term on the\nright-hand side of (13) (see proof A.2 in the Appendix). For simplicity, we will use P = \u02dcU T \u02dcU\u2212U T U.\nProposition 2.3 Let c\u2217 be the optimal solution to (AMPE)\n(cid:107)U c(cid:107)2\n\nc\u2217 = argmin\n\nfor some matrix U \u2208 Rn\u00d7k. Suppose U becomes \u02dcU = U + E and write c\u2217 + \u2206c the perturbed\nsolution to (AMPE). Let M = \u02dcU T \u02dcU and the perturbation matrix P = \u02dcU T \u02dcU \u2212 U T U. Then,\n\n\u2206c = \u2212\n\nM\u22121P c\u2217.\n\n(14)\n\n1T c=1\n\n(cid:18)\nI \u2212 M\u2212111T\n1T M\u221211\n\n(cid:19)\n\nWe see here that the perturbation can be potentially large. Even if (cid:107)c\u2217(cid:107)2 and (cid:107)P(cid:107)2 can be potentially\nsmall, (cid:107)M\u22121(cid:107)2 is huge in general. It can be shown that U T U (the square of a Krylov-like matrix)\npresents an exponential condition number (see [24]) because the minimal eigenvalue decays very fast.\nMoreover, the eigenvalues are perturbed by P , leading to a potential huge perturbation \u2206c, especially\nif (cid:107)P(cid:107)2 is comparable (or bigger) to \u03bbmin(U T U ).\n\n2.4 Regularized AMPE\n\nThe condition number of the matrix U T U in problem (AMPE) can be arbitrary large. Indeed, this\ncondition number is related to the one of Krylov matrices which has been proved in [24] to be\nexponential in k. By consequence, this conditioning problem coupled with nonlinear errors lead to\nhighly unstable solutions c\u2217 (which we observe in our experiments). We thus study a regularized\nformulation of problem (AMPE), which reads\nminimize\nsubject to 1T c = 1\n\ncT (U T U + \u03bbI)c\n\n(RMPE)\n\nThe solution of this problem may be computed with a linear system, and the regularization parameter\ncontrols the norm of the solution, as shown in the following Lemma (see proof A.3 in Appendix).\n\n5\n\n\fLemma 2.4 Let c\u2217\n\u03bb be the optimal solution of problem (RMPE). Then\nc\u2217\n\u03bb =\n\n(U T U + \u03bbI)\u221211\n1T (U T U + \u03bbI)\u221211\n\n(cid:107)c\u2217\n\u03bb(cid:107)2 \u2264\n\nand\n\n(cid:114)\n\n\u03bb + (cid:107)U(cid:107)2\n\n2\n\nk\u03bb\n\n.\n\n(15)\n\nThis allows us to obtain the following corollary extending Proposition 2.3 to the regularized AMPE\nproblem in (RMPE), showing that the perturbation of c is now controlled by the regulaization\nparameter \u03bb.\nCorollary 2.5 Let c\u2217\nproblem (RMPE) for the perturbed matrix \u02dcU = U + E is given by c\u2217\n(cid:107)\u2206c\u2217\n\n\u03bb, de\ufb01ned in (15), be the solution of problem (RMPE). Then the solution of\n\n(cid:16)\n\u03bb = \u2212M\u22121\n\u03bb W T P c\u2217\nI \u2212 M\nwhere M\u03bb = (U T U + P + \u03bbI) and W =\n\n\u2206c\u03bb = \u2212W M\u22121\n\n\u03bb + \u2206c\u03bb where\n\u03bb(cid:107)2 \u2264 (cid:107)P(cid:107)2\n\u03bb(cid:107)2,\n\u03bb (cid:107)c\u2217\nis a projector of rank k \u2212 1.\n\n\u03bb P c\u2217\n\n(cid:17)\n\nand\n\n\u03bb\n\u22121\n\u03bb 11T\n\u22121\n\u03bb 1\n\n1T M\n\nThese results lead us to the following simple algorithm.\n\nAlgorithm 1 Regularized Approximate Minimal Polynomial Extrapolation (RMPE)\nInput: Sequence {x0, x1, ..., xk+1}, parameter \u03bb > 0\n\nCompute U = [x1 \u2212 x0, ..., xk+1 \u2212 xk]\nSolve the linear system (U T U + \u03bbI)z = 1\nSet c = z/(zT 1)\n\nOutput: (cid:80)k\n\ni=0 cixi, the approximation of the \ufb01xed point x\u2217\n\nThe computational complexity (with online updates or in batch mode) of the algorithm is O(nk2)\nand some strategies (batch and online) are discussed in the Appendix A.3. Note that the algorithm\nnever calls the oracle g(x). It means that, in an optimization context, the acceleration does not require\nf (x) or f(cid:48)(x) to compute the extrapolation. Moreover, it does not need a priori information on the\nfunction, for example L and \u00b5 (unlike Nesterov\u2019s method).\n\n2.5 Convergence Bounds on Regularized AMPE\n\non the right-hand side of (13), namely (cid:107)(cid:80)k\n\nTo fully characterize the convergence of our estimate sequence, we still need to bound the last term\ni=0 cixi \u2212 x\u2217(cid:107)2. A coarse bound can be provided using\nChebyshev polynomials, however the norm of the Chebyshev\u2019s coef\ufb01cients grows exponentially as k\ngrows. Here we re\ufb01ne this bound to better control the quality of our estimate.\n(cid:26)\nLet g(x\u2217) (cid:22) \u03c3I. Consider the following Chebyshev-like optimization problem, written\n\n{q\u2208Rk[x]: q(1)=1}\n\n(16)\nwhere Rk[x] is the ring of polynomials of degree at most k and q \u2208 Rk+1 is the vector of coef\ufb01cients\nof the polynomial q(x). This problem can be solved exactly using a semi-de\ufb01nite solver because it\ncan be reduced to a SDP program (see Appendix A.4 for the details of the reduction). Our main result\nbelow shows how S(k, \u03b1) bounds the error between our estimate of the optimum constructed using\nthe iterates \u02dcxi in (RMPE) and the optimum x\u2217 of problem (1).\nProposition 2.6 Let matrices X = [x0, x1, ..., xk], \u02dcX = [x0, \u02dcx1, ..., \u02dcxk], E = (X \u2212 \u02dcX) and scalar\n\u03ba = (cid:107)(A \u2212 I)\u22121(cid:107)2. Suppose \u02dcc\u2217\n\n((1 \u2212 x)q(x))2 + \u03b1(cid:107)q(cid:107)2\n\nS(k, \u03b1) (cid:44)\n\nmax\nx\u2208[0,\u03c3]\n\n(cid:27)\n\nmin\n\n2\n\n,\n\nminimize\nsubject to 1T c = 1\n\n(17)\nin the variable c \u2208 Rk+1, with parameters \u02dcU \u2208 Rn\u00d7(k+1). Assume A symmetric with 0 (cid:22) A \u227a I.\nThen\n(cid:107) \u02dcX \u02dcc\u2217\n\n(cid:19)2(cid:33)1\n2(cid:0)S(k, \u03bb/(cid:107)x0 \u2212 x\u2217(cid:107)2\n2)(cid:1) 1\n\n\u03bb\u2212x\u2217(cid:107)2 \u2264\n\n2(cid:107)x0\u2212x\u2217(cid:107)2,\n\n(cid:19)2(cid:18)\n\n(cid:107)E(cid:107)2 + \u03ba\n\n(cid:32)\n\n(cid:18)\n\n\u03ba2 +\n\n\u03bb =\n\n1 +\n\n( \u02dcU T \u02dcU + \u03bbI)\u221211\n1T ( \u02dcU T \u02dcU + \u03bbI)\u221211\n\n\u03bb solves problem (RMPE)\ncT ( \u02dcU T \u02dcU + \u03bbI)c\n\u21d2 \u02dcc\u2217\n\n1\n\u03bb\n\n(cid:107)P(cid:107)2\n\u03bb\n\n(cid:107)P(cid:107)2\n\u221a\n\u03bb\n2\n\nwith P is de\ufb01ned in Corollary 2.5 and S(k, \u03b1) is de\ufb01ned in (16).\n\n6\n\n\fWe have that S(k, \u03bb/(cid:107)x0 \u2212 x\u2217(cid:107)2\n2 is similar to the value Tk(\u03c3) (see (8)) so our algorithm achieves a\n2) 1\nrate similar to the Chebyshev\u2019s acceleration up to some multiplicative scalar. We thus need to choose\n\u03bb so that this multiplicative scalar is not too high (while keeping S(k, \u03bb/(cid:107)x0 \u2212 x\u2217(cid:107)2\n2) 1\nWe can analyze the behavior of the bound if we start close to the optimum. Assume\n\n2 small).\n\n(cid:107)E(cid:107)2 = O((cid:107)x0 \u2212 x\u2217(cid:107)2\n2),\n\n(cid:107)U(cid:107)2 = O((cid:107)x0 \u2212 x\u2217(cid:107)2) \u21d2 (cid:107)P(cid:107)2 = O((cid:107)x0 \u2212 x\u2217(cid:107)3\n2).\n\nThis case is encountered when minimizing a smooth strongly convex function with Lipchitz-\ncontinuous Hessian using \ufb01xed-step gradient method (this case is discussed in details in the Appendix,\nsection A.6). Also, let \u03bb = \u03b2(cid:107)P(cid:107)2 for \u03b2 > 0 and (cid:107)x0 \u2212 x\u2217(cid:107) small. We can thus approximate the\nright parenthesis of the bound by\n(cid:107)E(cid:107)2 + \u03ba\n\n\u03ba(cid:112)(cid:107)P(cid:107)2\n\n(cid:112)(cid:107)P(cid:107)2\n\n(cid:107)E(cid:107)2 + \u03ba\n\n(cid:33)\n\n(cid:18)\n\n(cid:19)\n\n\u221a\n\n=\n\n=\n\n.\n\nlim(cid:107)x\u2212x\u2217(cid:107)2\u21920\n\nlim(cid:107)x\u2212x\u2217(cid:107)2\u21920\n\n2\n\n\u03b2\n\n\u221a\n2\n\n\u03b2\n\nTherefore, the bound on the precision of the extrapolation is approximately equal to\n\n(cid:107)P(cid:107)2\n\u221a\n\u03bb\n2\n\n(cid:32)\n\n(cid:32)\n(cid:33)1/2(cid:115)\n\n(cid:18)\n\nS\n\nk,\n\n(cid:19)\n\n\u03b2(cid:107)P(cid:107)2\n\n(cid:107)x0 \u2212 x\u2217(cid:107)2\n\n2\n\n(cid:107)x0 \u2212 x\u2217(cid:107)2\n\n(cid:107) \u02dcX \u02dcc\u2217\n\n\u03bb \u2212 x\u2217(cid:107)2 (cid:46) \u03ba\n\n1 +\n\n\u03b2 )2\n\n(1 + 1\n4\u03b22\n\nAlso, if we use equation (8), it is easy to see that\n\n(cid:112)S (k, 0) \u2264\n\n{q\u2208Rk[x]: q(1)=1} max\nx\u2208[0,\u03c31]\n\nmin\n\n|q(x)| = Tk(t(\u03c3)) =\n\n2\u03b6 k\n1 + \u03b6 2k ,\n\nwhere \u03b6 is de\ufb01ned in (9). So, when (cid:107)x0 \u2212 x\u2217(cid:107)2 is close to zero, the regularized version of AMPE\ntends to converge as fast as AMPE (see equation (10)) up to a small constant.\n\n3 Numerical Experiments\n\nWe test our methods on a regularized logistic regression problem written\n2(cid:107)w(cid:107)2\n2,\n\ni=1 log(cid:0)1 + exp(\u2212yi\u03beT\n\nf (w) =(cid:80)m\n\ni w)(cid:1) + \u03c4\n\nwhere \u039e = [\u03be1, ..., \u03bem]T \u2208 Rm\u00d7n is the design matrix and y is a {\u22121, 1}n vector of labels. We used\nthe Madelon UCI dataset, setting \u03c4 = 102 (in order to have a ratio L/\u03c4 equal to 109). We solve this\nproblem using several algorithms, the \ufb01xed-step gradient method for strongly convex functions [6,\nTh. 2.1.15] using stepsize 2/(L + \u00b5), where L = (cid:107)\u039e(cid:107)2\n2/4 + \u03c4 and \u00b5 = \u03c4, the accelerated gradient\nmethod for strongly convex functions [6, Th. 2.2.3] and our nonlinear acceleration of the gradient\nmethod iterates using RMPE in Proposition 2.6 with restarts.\nThis last algorithm is implemented as follows. We do k steps (in the numerical experiments, k is\ntypically equal to 5) of the gradient method, then extrapolate a solution \u02dcX \u02dcc\u2217\n\u03bb is computed\nby solving the RMPE problem (17) on the gradient iterates \u02dcX, with regularization parameter \u03bb\ndetermined by a grid search. Then, this extrapolation becomes the new starting point of the gradient\nmethod. We consider it as one iteration of RMPEk using k gradient oracle calls. We also analyze the\nimpact of an inexact line-search (described in Appendix A.7) performed after this procedure.\nThe results are reported in Figure 1. Using very few iterates, the solution computed using our estimate\n(a nonlinear average of the gradient iterates) are markedly better than those produced by the Nesterov-\naccelerated method. This is only partially re\ufb02ected by the theoretical bound from Proposition 2.6\nwhich shows signi\ufb01cant speedup in some regions but remains highly conservative (see Figure 3 in\nsection A.6). Also, Figure 2 shows us the impact of regularization. The AMPE process becomes\nunstable because of the condition number of matrix M, which impacts the precision of the estimate.\n\n\u03bb where \u02dcc\u2217\n\n4 Conclusion and Perspectives\n\nIn this paper, we developed a method which is able to accelerate, under some regularity conditions,\nthe convergence of a sequence {xi} without any knowledge of the algorithm which generates this\nsequence. The regularization parameter present in the acceleration method can be computed easily\nusing some inexact line-search.\n\n7\n\n\fFigure 1: Solving logistic regression on UCI Madelon dataset (500 features, 2000 data points)\nusing the gradient method, Nesterov\u2019s accelerated method and RMPE with k = 5 (with and without\nline search over the stepsize), with penalty parameter \u03c4 equal to 102 (Condition number is equal\nto 1.2 \u00b7 109). Here, we see that our algorithm has a similar behavior to the conjugate gradient:\nunlike the Nesterov\u2019s method, where we need to provide parameters \u00b5 and L, the RMPE algorithm\nadapts himself in function of the spectrum of g(x\u2217) (so it can exploit the good local strong convexity\nparameter), without any prior speci\ufb01cation. We can, for example, observe this behavior when the\nglobal strong convexity parameter is bad but not the local one.\n\nFigure 2: Logistic regression on Madelon UCI Dataset, solved using Gradient method, Nesterov\u2019s\nmethod and AMPE (i.e. RMPE with \u03bb = 0). The condition number is equal to 1.2 \u00b7 109. We see that\nwithout regularization, AMPE is unstable because (cid:107)( \u02dcU T \u02dcU )\u22121(cid:107)2 is huge (see Proposition 2.3).\n\nThe algorithm itself is simple. By solving only a small linear system we are able to compute a good\nestimate of the limits of the sequence {xi}. Also, we showed (using the gradient method on logistic\nregression) that the strategy which consists in alternating the algorithm and the extrapolation method\ncan lead to impressive results, improving signi\ufb01cantly the rate of convergence.\nFuture work will consist in improving the performance of the algorithm by exploiting the structure of\nthe noise matrix E in some cases (for example, using the gradient method, the norm of the column\nEk in the matrix E is decreasing when k grows), extending the algorithm to the constrained case, the\nstochastic case and to the non-symmetric case. We will also try to re\ufb01ne the term (16) present in the\ntheoretical bound.\n\nAcknowledgment. The research leading to these results has received funding from the European\nUnion\u2019s Seventh Framework Programme (FP7-PEOPLE-2013-ITN) under grant agreement no 607290\nSpaRTaN, as well as support from ERC SIPA and the chaire \u00c9conomie des nouvelles donn\u00e9es with\nthe data science joint research initiative with the fonds AXA pour la recherche.\n\n8\n\n0246810\u00d710410-5100f(xk)\u2212f(x\u2217)GradientoraclecallsGradientNesterovNest.+backtrackRMPE5RMPE5+LS05001000150010-5100CPUTime(sec.)GradientNesterovNest.+backtrackRMPE5RMPE5+LS0246810x 10410\u22122100102 f(xk)\u2212f(x\u2217)GradientoraclecallsGradientNesterovRMPE5AMPE5\fReferences\n[1] Y. Nesterov. A method of solving a convex programming problem with convergence rate O(1/k2). Soviet\n\nMathematics Doklady, 27(2):372\u2013376, 1983.\n\n[2] AS Nemirovskii and Y. E Nesterov. Optimal methods of smooth convex minimization. USSR Computational\n\nMathematics and Mathematical Physics, 25(2):21\u201330, 1985.\n\n[3] Me\u0161ina, M. [1977], \u2018Convergence acceleration for the iterative solution of the equations x = ax + f\u2019,\n\nComputer Methods in Applied Mechanics and Engineering 10(2), 165\u2013173.\n\n[4] Eddy, R. [1979], \u2018Extrapolating to the limit of a vector sequence\u2019, Information linkage between applied\n\nmathematics and industry pp. 387\u2013396.\n\n[5] A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems.\n\nSIAM Journal on Imaging Sciences, 2(1):183\u2013202, 2009.\n\n[6] Y. Nesterov. Introductory Lectures on Convex Optimization. Springer, 2003.\n[7] Y. Nesterov. Universal gradient methods for convex optimization problems. Mathematical Programming,\n\n152(1-2):381\u2013404, 2015.\n\n[8] Yoel Drori and Marc Teboulle. Performance of \ufb01rst-order methods for smooth convex minimization: a\n\nnovel approach. Mathematical Programming, 145(1-2):451\u2013482, 2014.\n\n[9] Laurent Lessard, Benjamin Recht, and Andrew Packard. Analysis and design of optimization algorithms\n\nvia integral quadratic constraints. SIAM Journal on Optimization, 26(1):57\u201395, 2016.\n\n[10] Weijie Su, Stephen Boyd, and Emmanuel Candes. A differential equation for modeling nesterov\u2019s\naccelerated gradient method: Theory and insights. In Advances in Neural Information Processing Systems,\npages 2510\u20132518, 2014.\n\n[11] Andre Wibisono and Ashia C Wilson. On accelerated methods in optimization. Technical report, 2015.\n[12] Hongzhou Lin, Julien Mairal, and Zaid Harchaoui. A universal catalyst for \ufb01rst-order optimization. In\n\nAdvances in Neural Information Processing Systems, pages 3366\u20133374, 2015.\n\n[13] Alexander Craig Aitken. On Bernoulli\u2019s numerical solution of algebraic equations. Proceedings of the\n\nRoyal Society of Edinburgh, 46:289\u2013305, 1927.\n\n[14] Daniel Shanks. Non-linear transformations of divergent and slowly convergent sequences. Journal of\n\nMathematics and Physics, 34(1):1\u201342, 1955.\n\n[15] Peter Wynn. On a device for computing the em(sn) transformation. Mathematical Tables and Other Aids\n\nto Computation, 10(54):91\u201396, 1956.\n\n[16] C Brezinski. Acc\u00e9l\u00e9ration de la convergence en analyse num\u00e9rique. Lecture notes in mathematics, (584),\n\n1977.\n\n[17] Avram Sidi, William F Ford, and David A Smith. Acceleration of convergence of vector sequences. SIAM\n\nJournal on Numerical Analysis, 23(1):178\u2013196, 1986.\n\n[18] N Levinson. The Wiener RMS error criterion in \ufb01lter design and prediction, appendix b of wiener, n.(1949).\n\nExtrapolation, Interpolation, and Smoothing of Stationary Time Series, 1949.\n\n[19] James Durbin. The \ufb01tting of time-series models. Revue de l\u2019Institut International de Statistique, pages\n\n233\u2013244, 1960.\n\n[20] Georg Heinig and Karla Rost. Fast algorithms for Toeplitz and Hankel matrices. Linear Algebra and its\n\nApplications, 435(1):1\u201359, 2011.\n\n[21] David A Smith, William F Ford, and Avram Sidi. Extrapolation methods for vector sequences. SIAM\n\nreview, 29(2):199\u2013233, 1987.\n\n[22] Stan Cabay and LW Jackson. A polynomial extrapolation method for \ufb01nding limits and antilimits of vector\n\nsequences. SIAM Journal on Numerical Analysis, 13(5):734\u2013752, 1976.\n\n[23] Gene H Golub and Richard S Varga. Chebyshev semi-iterative methods, successive overrelaxation iterative\nmethods, and second order richardson iterative methods. Numerische Mathematik, 3(1):157\u2013168, 1961.\n[24] Evgenij E Tyrtyshnikov. How bad are Hankel matrices? Numerische Mathematik, 67(2):261\u2013269, 1994.\n[25] Y. Nesterov. Squared functional systems and optimization problems. In High performance optimization,\n\npages 405\u2013440. Springer, 2000.\n\n[26] J. B. Lasserre. Global optimization with polynomials and the problem of moments. SIAM Journal on\n\nOptimization, 11(3):796\u2013817, 2001.\n\n[27] P. Parrilo. Structured Semide\ufb01nite Programs and Semialgebraic Geometry Methods in Robustness and\n\nOptimization. PhD thesis, California Institute of Technology, 2000.\n\n[28] A. Ben-Tal and A. Nemirovski. Lectures on modern convex optimization : analysis, algorithms, and\n\nengineering applications. MPS-SIAM series on optimization. SIAM, 2001.\n\n9\n\n\f", "award": [], "sourceid": 403, "authors": [{"given_name": "Damien", "family_name": "Scieur", "institution": "INRIA - ENS"}, {"given_name": "Alexandre", "family_name": "d'Aspremont", "institution": "CNRS - Ecole Normale Sup\u00e9rieure"}, {"given_name": "Francis", "family_name": "Bach", "institution": "INRIA - Ecole Normale Superieure"}]}