{"title": "Analysis of Krylov Subspace Solutions of Regularized Non-Convex Quadratic Problems", "book": "Advances in Neural Information Processing Systems", "page_first": 10705, "page_last": 10715, "abstract": "We provide convergence rates for Krylov subspace solutions to the trust-region and cubic-regularized (nonconvex) quadratic problems. Such solutions may be efficiently computed by the Lanczos method and have long been used in practice. We prove error bounds of the form $1/t^2$ and $e^{-4t/\\sqrt{\\kappa}}$, where $\\kappa$ is a condition number for the problem, and $t$ is the Krylov subspace order (number of Lanczos iterations). We also provide lower bounds showing that our analysis is sharp.", "full_text": "Analysis of Krylov Subspace Solutions of Regularized\n\nNonconvex Quadratic Problems\n\nDepartment of Electrical Engineering\n\nDepartments of Statitstics and Electrical Engineering\n\nYair Carmon\n\nStanford University\n\nyairc@stanford.edu\n\nJohn C. Duchi\n\nStanford University\n\njduchi@stanford.edu\n\nAbstract\n\nWe provide convergence rates for Krylov subspace solutions to the trust-region\nand cubic-regularized (nonconvex) quadratic problems. Such solutions may be\nef\ufb01ciently computed by the Lanczos method and have long been used in practice.\nWe prove error bounds of the form 1/t2 and e4t/p\uf8ff, where \uf8ff is a condition\nnumber for the problem, and t is the Krylov subspace order (number of Lanczos\niterations). We also provide lower bounds showing that our analysis is sharp.\n\n1\n\nIntroduction\n\nConsider the potentially nonconvex quadratic function\n\nfA,b(x) :=\n\n1\n2\n\nxT Ax + bT x,\n\nminimize\n\nx\n\nwhere A 2 Rd\u21e5d and b 2 Rd. We wish to solve regularized minimization problems of the form\n\nfA,b(x) subject tokxk \uf8ff R and minimize\n\n(1)\nwhere R and \u21e2 0 are regularization parameters. These problems arise primarily in the fam-\nily of trust-region and cubic-regularized Newton methods for general nonlinear optimization prob-\nlems [11, 29, 18, 9], which optimize a smooth function g by sequentially minimizing local models\nof the form\n\nfA,b(x) +\n\n\u21e2\n3 kxk3 ,\n\nx\n\ng(xi +) \u21e1 g(xi) + rg(xi)T +\n\n1\n2\n\nTr2g(xi) = g(xi) + fr2g(xi),rg(xi)(),\n\nwhere xi is the current iterate and 2 Rd is the search direction. Such models tend to be unreliable\nfor large kk, particularly when r2g(xi) \u2325 0. Trust-region and cubic regularization methods\naddress this by constraining and regularizing the direction , respectively.\nBoth classes of methods and their associated subproblems are the subject of substantial ongoing re-\nsearch [19, 21, 5, 1, 25]. In the machine learning community, there is growing interest in using these\nmethods for minimizing (often nonconvex) training losses, handling the large \ufb01nite-sum structure of\nlearning problems by means of sub-sampling [32, 23, 3, 38, 36].\nThe problems (1) are challenging to solve in high-dimensional settings, where direct decomposition\n(or even storage) of the matrix A is infeasible. In some scenarios, however, computing matrix-vector\nproducts v 7! Av is feasible. Such is the case when A is the Hessian of a neural network, where d\nmay be in the millions and A is dense, and yet we can compute Hessian-vector products ef\ufb01ciently\non batches of training data [31, 33].\nIn this paper we consider a scalable approach for approximately solving (1), which consists of\nminimizing the objective in the Krylov subspace of order t,\n\nKt(A, b) := span{b, Ab, . . . , At1b}.\n\n(2)\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fThis requires only t matrix-vector products, and the Lanczos method allows one to ef\ufb01ciently \ufb01nd\nthe solution to problems (1) over Kt(A, b) (see, e.g. [17, 9, Sec. 2]). Krylov subspace methods\nare familiar in numerous large-scale numerical problems, including conjugate gradient methods,\neigenvector problems, or solving linear systems [20, 26, 35, 14].\nIt is well-known that, with exact arithmetic, the order d subspace Kd(A, b) generically contains\nthe global solutions to (1). However, until recently the literature contained no guarantees on the\nrate at which the suboptimality of the solution approaches zero as the subspace dimension t grows.\nThis is in contrast to the two predominant Krylov subspace method use-cases\u2014convex quadratic\noptimization [14, 27, 28] and eigenvector \ufb01nding [24]\u2014where such rates of convergence have been\nknown for decades. Zhang et al. [39] make substantial progress on this gap, establishing bounds\nimplying a linear rate of convergence for the trust-region variant of problem (1).\nIn this work we complete the picture, proving that the optimality gap of the order t Krylov subspace\nsolution to either of the problems (1) is bounded by both e4t/p\uf8ff and t2 log2(kbk/|uT\nminb|). Here\n\uf8ff is a condition number for the problem that naturally generalizes the classical condition number\nof the matrix A, and umin is an eigenvector of A corresponding to its smallest eigenvalue. Using\nminb| with a term proportional to 1/pd, circumventing the well-\nrandomization, we may replace |uT\nknown \u201chard case\u201d of the problem (1) (see Section 2.5). Our analysis both leverages and uni\ufb01es the\nknown results for convex quadratic and eigenvector problems, which constitute special cases of (1).\n\nRelated work Zhang et al. [39] show that the error of certain polynomial approximation problems\nbounds the suboptimality of Krylov subspace solutions to the trust region-variant of the problems (1),\nimplying convergence at a rate exponential in t/p\uf8ff. Based on these bounds, the authors propose\nnovel stopping criteria for subproblem solutions in the trust-region optimization method, showing\ngood empirical results. However, the bounds of [39] become weak for large \uf8ff and vacuous in the\nhard case where \uf8ff = 1.\nPrior works develop algorithms for solving (1) with convergence guarantees that hold in the hard\ncase. Hazan and Koren [19], Ho-Nguyen and K\u0131l\u0131nc\u02db-Karzan [21], and Agarwal et al. [1] propose\nalgorithms that obtain error roughly t2 after computing t matrix-vector products. The different\nalgorithms these papers propose all essentially reduce the problems (1) to a sequence of eigenvector\nand convex quadratic problems to which standard algorithms apply. In previous work [5], we analyze\ngradient descent\u2014a direct, local method\u2014for the cubic-regularized problem. There, we show a rate\nof convergence roughly t1, re\ufb02ecting the well-known complexity gap between gradient descent\n(respectively, the power method) and conjugate gradient (respectively, Lanczos) methods [35, 14].\nOur development differs from this prior work in the following ways.\n\n1. We analyze a practical approach, implemented in ef\ufb01cient optimization libraries [16, 25], with\nessentially no tuning parameters. Previous algorithms [19, 21, 1] are convenient for theoretical\nanalysis but less conducive to ef\ufb01cient implementation; each has several parameters that require\ntuning, and we are unaware of numerical experiments with any of the approaches.\n\n2. We provide both linear (e4t/p\uf8ff) and sublinear (t2) convergence guarantees. In contrast, the\n\npapers [19, 21, 1] provide only a sublinear rate; Zhang et al. [39] provide only the linear rate.\n\n3. Our analysis applies to both the trust-region and cubic regularization variants in (1), while [19,\n\n21, 39] consider only the trust-region problem, and [39, 5] consider only cubic regularization.\n\n4. We provide lower bounds\u2014for adversarially constructed problem instances\u2014showing our con-\nvergence guarantees are tight to within numerical constants. By a resisting oracle argument [27],\nthese bounds apply to any deterministic algorithm that accesses A via matrix-vector products.\n\n5. Our arguments are simple and transparent, and we leverage established results on convex opti-\n\nmization and the eigenvector problem to give short proofs of our main results.\n\nPaper organization In Section 2 we state and prove our convergence rate guarantees for the trust-\nregion problem. Then, in Section 3 we quickly transfer those results to the cubic-regularized problem\nby showing that it always has a smaller optimality gap. Section 4 gives our lower bounds, stated for\ncubic regularization but immediately applicable to the trust-region problem by the same optimality\ngap bound. Finally, in Section 5 we illustrate our analysis with some numerical experiments.\n\n2\n\n\fNotation For a symmetric matrix A 2 Rd\u21e5d and vector b we let fA,b(x) := 1\n2 xT Ax + bT x.\nWe let min(A) and max(A) denote the minimum and maximum eigenvalues of A, and let\numin(A), umax(A) denote their corresponding (unit) eigenvectors, dropping the argument A when\nclear from context. For integer t 1 we let Pt := c0 + c1x + \u00b7\u00b7\u00b7 + ct1xt1 | ci 2 R be the\npolynomials of degree at most t1, so that the Krylov subspace (2) is Kt(A, b) = {p(A)b | p 2P t}.\nWe use k\u00b7k to denote Euclidean norm on Rd and `2-operator norm on Rd\u21e5d. Finally, we denote\n(z)+ := max{z, 0} and (z) := min{z, 0}.\n2 The trust-region problem\nFixing a symmetric matrix A 2 Rd\u21e5d, vector b 2 Rd and trust-region radius R > 0, we let\n\nstr\n? 2 argmin\n\nx2Rd, kxk\uf8ffR\n\nfA,b(x) =\n\n1\n2\n\nxT Ax + bT x\n\ndenote a solution (global minimizer) of the trust region problem. Letting min, max denote the ex-\ntremal eigenvalues of A, str\n? solves problem (1)\nif and only if there exists ? such that\n\n? admits the following characterization [11, Ch. 7]: str\n\n(A + ?I)str\n\n? = b, ? (min)+,\n\n(3)\nThe optimal Lagrange multiplier ? always exists and is unique, and if ? > min the solution str\nis unique and satis\ufb01es str\n? = (A + ?I)1b. Letting umin denote the eigenvector of A correspond-\ning to min, the characterization (3) shows that uT\nNow, consider the Krylov subspace solutions, and for t > 0, let\n\nminb 6= 0 implies ? > min.\n\nand ?(R str\n\n?) = 0.\n\n?\n\nstr\nt 2\n\nargmin\n\nx2Kt(A,b), kxk\uf8ffR\n\nfA,b(x) =\n\n1\n2\n\nxT Ax + bT x\n\ndenote a minimizer of the trust-region problem in the Krylov subspace of order t . Gould et al. [17]\nshow how to compute the Krylov subspace solution str\nt in time dominated by the cost of computing\nt matrix-vector products using the Lanczos method (see also Section A of the supplement).\n\n2.1 Main result\nWith the notation established above, our main result follows.\nTheorem 1. For every t > 0,\n\nfA,b(str\n\nt ) fA,b(str\n\nand\n\nmax + ?) ,\n? )\u21e4 exp(4tr min + ?\n? ) \uf8ff 36\u21e5fA,b(0) fA,b(str\nminb)2!# .\nlog2 4kbk2\n\"4 +\n?k2\n(max min)kstr\n\nI{min<0}\n\n(uT\n\n8\n\n(t 1\n2 )2\n\n(4)\n\n(5)\n\nfA,b(str\n\nt ) fA,b(str\n\n? ) \uf8ff\n\nTheorem 1 characterizes two convergence regimes: linear (4) and sublinear (5). Linear convergence\noccurs when t & pk, where \uf8ff = max+?\nmin+? 1 is the condition number for the problem. There,\nthe error decays exponentially and falls beneath \u270f in roughly p\uf8ff log 1\n\u270f Lanczos iteration. Sublinear\nconvergence occurs when t . pk, and there the error decays polynomially and falls beneath \u270f in\nroughly 1p\u270f iterations. For worst-case problem instances this characterization is tight to constant\nfactors, as we show in Section 4.\nThe guarantees of Theorem 1 closely resemble the well-known guarantees for the conjugate gradient\nmethod [35], including them as the special case R = 1 and min 0. For convex problems, the\nradius constraint kxk \uf8ff R always improves the conditioning of the problem, as max\n;\nmin max+?\nmin+?\nthe smaller R is, the better conditioned the problem becomes. For non-convex problems, the sublin-\near rate features an additional logarithmic term that captures the role of the eigenvector umin. The\n\n3\n\n\f\ufb01rst rate (4) is similar to those of Zhang et al. [39, Thm. 4.11], though with somewhat more explicit\ndependence on t.\nIn the \u201chard case,\u201d which corresponds to uT\nminb = 0 and min + ? = 0 (cf. [11, Ch. 7]), both the\nbounds in Theorem 1 become vacuous, and indeed str\nt may not converge to the global minimizer in\nthis case. However, as the bound (5) depends only logarithmically on uT\nminb, it remains valid even\nextremely close to the hard case. In Section 2.5 we describe two simple randomization techniques\nwith convergence guarantees that are valid in the hard case as well.\n\n2.2 Proof sketch\nOur analysis reposes on two elementary observations. First, we note that Krylov subspaces are\ninvariant to shifts by scalar matrices, i.e. Kt(A, b) = Kt(A, b) for any A, b, t where 2 R, and\n\nA := A + I.\n\nSecond, we observe that for every point x and 2 R\n\n\n2\n\n(str\n?2 kxk2)\n\n? ) +\n\n2 (kstr\n\nfA,b(x) fA,b(str\n\n? ) = fA,b(x) fA,b(str\n\n(6)\nOur strategy then is to choose such that A \u232b 0, and then use known results to \ufb01nd yt 2\nKt(A, b) = Kt(A, b) that rapidly reduces the \u201cconvex error\u201d term fA,b(yt) fA,b(str\n? ). We then\n?k2 kxtk2) is small.\nadjust yt to obtain a feasible point xt such that the \u201cnorm error\u201d term \nTo establish linear convergence, we take = ? and adjust the norm of yt by taking xt = (1 \u21b5)yt\nfor some small \u21b5 that guarantees xt is feasible and that the \u201cnorm error\u201d term is small. To establish\nsublinear convergence we set = min and take xt = yt + \u21b5 \u00b7 zt, where zt is an approximation\nfor umin within Kt(A, b), and \u21b5 is chosen to make kxtk = kstr\n?k. This means the \u201cnorm error\u201d\nvanishes, while the \u201cconvex error\u201d cannot increase too much, as Aminzt \u21e1 Aminumin = 0.\nOur approach for proving the sublinear rate of convergence is inspired by Ho-Nguyen and K\u0131l\u0131nc\u02db-\nKarzan [21], who also rely on Nesterov\u2019s method in conjunction with Lanczos-based eigenvector\napproximation. The analysis in [21] uses an algorithmic reduction, proposing to apply the Lanc-\nzos method (with a random vector instead of b) to approximate umin and min, then run Nesterov\u2019s\nmethod on an approximate version of the \u201cconvex error\u201d term, and then use the approximated eigen-\nvector to adjust the norm of the result. We instead argue that all the ingredients for this reduction\nalready exist in the Krylov subspace Kt(A, b), obviating the need for explicit eigenvector estimation\nor actual application of accelerated gradient descent.\n\n2.3 Building blocks\nOur proof uses the following classical results.\nLemma 1 (Approximate matrix inverse). Let \u21b5, satisfy 0 <\u21b5 \uf8ff , and let \uf8ff = /\u21b5. For\nany t 1 there exists a polynomial p of degree at most t 1, such that for every M satisfying\n\u21b5I M I ,\n\nkI M p(M )k \uf8ff 2e2t/p\uf8ff.\n\nLemma 2 (Convex trust-region problem). Let t 1, M \u232b 0, v 2 Rd and r 0, and let fM,v(x) =\n2 xT M x + vT x. There exists xt 2K t(M, v) such that\n\n1\n\nkxtk \uf8ff r and fM,v(xt) min\nkxk\uf8ffr\n\nfM,v(x) \uf8ff\n\n4max(M ) \u00b7 r2\n\n(t + 1)2\n\n.\n\nLemma 3 (Finding eigenvectors, [24, Theorem 4.2]). Let M \u232b 0 be such that uT M u = 0 for some\nunit vector u 2 Rd, and let v 2 Rd. For every t 1 there exists zt 2K t(M, v) such that\n\nkztk = 1 and zT\n\nt M zt \uf8ff\n\n2 )2 log2 2 + 4 kvk2\n(uT v)2! .\n\nkMk\n16(t 1\n\nWhile these lemmas are standard, their explicit forms are useful, and we prove them in Section C.1\nin the supplement. Lemmas 1 and 3 are consequences of uniform polynomial approximation results\n(cf. supplement, Sec. B). To prove Lemma 2 we invoke Tseng\u2019s results on a variant of Nesterov\u2019s\naccelerated gradient method [37], arguing that its iterates lie in the Krylov subspace.\n\n4\n\n\f2.4 Proof of Theorem 1\nLinear convergence Recalling the notation A? = A+?I, let yt = p(A?)b = p(A?)A?str\n? ,\nfor the p 2P t which Lemma 1 guarantees to satisfy kp(A?)A? Ik \uf8ff 2e2t/p\uf8ff(A? ). Let\n\nxt = (1 \u21b5)yt, where \u21b5 = kytk kstr\n?k\n\nmax{kstr\n\n?k ,kytk}\n?k for any value of kytk. Moreover\n= k(p(A?)A? I)str\n?k\n\nso that we are guaranteed kxtk \uf8ff kstr\n\n|\u21b5| = |k ytk kstr\n?k|\n\n?k ,kytk} \uf8ff kyt str\n?k\nkstr\n?k\n\nmax{kstr\n\nwhere the last transition used kp(A?)A? Ik \uf8ff 2e2t/p\uf8ff(A? ).\n2kA1/2\nSince b = A?str\n = ? and kxtk \uf8ff kstr\n\n? , we have fA? ,b(x) = fA? ,b(str\n\n?k therefore implies\n\n? ) + 1\n\n?\n\nkstr\n?k\n\n,\n\n\uf8ff 2e2t/p\uf8ff(A? ),\n\n(x str\n\n? )k2. The equality (6) with\n\n2\n\n1\n\n?\n\n?\n\nkstr\n\nstr\n\n?\n\n? ) \uf8ff\n\n(7)\n\n(8)\n\n(xt str\n\n(xt str\n\nWe also have,\n\n? )\n\nWhen kytk kstr\n\nfA,b(xt) fA,b(str\n\n?k we have kxtk = kstr\n\n? kxtk).\n\n?k and the second term vanishes. When kytk < kstr\n?k,\n\n2A1/2\n? (str\n+ ?str\n? \u21b52 \uf8ff 4e4t/p\uf8ff(A? )str\n? kytk) =str\n? .\n? kytk kytk\n?k \u00b7 (str\nstr\n? kxtk =str\n?\n? ) =([1 \u21b5]p(A?)A? I) A1/2\nA1/2\n? e2t/p\uf8ff(A? ),\n? \uf8ff 6A1/2\n? + |\u21b5|A1/2\n\uf8ff (1 + |\u21b5|)(p(A?)A? I) A1/2\n?2\u2318 e4t/p\uf8ff(A? ),\n? ) \uf8ff\u21e318str\n\n(9)\nwhere in the \ufb01nal transition we used our upper bounds on \u21b5 and kp(A?)A? Ik, as well as\n|\u21b5|\uf8ff 1. Substituting the bounds (8) and (9) into inequality (7), we have\n\n(10)\n?k2\nand the \ufb01nal bound follows from recalling that fA,b(0) fA,b(str\n2 kstr\nand substituting \uf8ff(A?) = (max + ?)/(min + ?). To conclude the proof we note that (1 \n\u21b5)p(A?) = (1 \u21b5)p(A + ?I) = \u02dcp(A) for some \u02dcp 2P t, so that xt 2K t(A, b) and kxtk \uf8ff R,\nand therefore fA,b(str\nSublinear convergence Let A0 := A minI \u232b 0 and apply Lemma 2 with M = A0, v = b and\nr = kstr\n\n?k to obtain yt 2K t(A0, b) = Kt(A, b) such that\n\n? + 4?str\n\nfA,b(xt) fA,b(str\n\nt ) \uf8ff fA,b(xt).\n\nT A?str\n\nT A?str\n\n? + ?\n\n? ) = 1\n\n2 str\n?\n\nstr\n\n?\n\nstr\n\n?\n\nstr\n\n?\n\n?\n\nkytk \uf8ffstr\n\n? and fA0,b(yt) fA0,b(str\n\n? ) \uf8ff fA0,b(yt) min\nkxk\uf8ffkstr\n?k\n\nfA0,b(x) \uf8ff\n\n?k2\n4kA0kkstr\n(t + 1)2\n\n. (11)\n\nIf min 0, equality (6) with = min along with (11) means we are done, recalling that\nkA0k = max min. For min < 0, apply Lemma 3 with M = A0 and v = b to obtain\nzt 2K t(A, b) such that\n\nkztk = 1 and zT\n\nt A0zt \uf8ff kA0k\n16(t 1\n\n2 )2 log2 4 kbk2\n\nminb)2! .\n\n(uT\n\n(12)\n\nWe form the vector\n\nand choose \u21b5 to satisfy\n\nxt = yt + \u21b5 \u00b7 zt 2K t(A, b),\n\nkxtk =str\n\n? and \u21b5 \u00b7 zT\n\nt (A0yt + b) = \u21b5 \u00b7 zT\n\nt rfA0,b(yt) \uf8ff 0.\n\nWe may always choose such \u21b5 because kytk \uf8ff kstr\n?k has both\na non-positive and a non-negative solution in \u21b5. Moreover because kztk = 1 we have that |\u21b5|\uf8ff\n\n?k and therefore kyt + \u21b5ztk = kstr\n\n5\n\n\f2kstr\ngives us,\n\n?k. The property \u21b5 \u00b7 zT\n\nt rfA0,b(yt) \uf8ff 0 of our construction of \u21b5 along with r2fA0,b = A0,\n\nfA0,b(xt) = fA0,b(yt) + \u21b5 \u00b7 zT\n\nt rfA0,b(yt) +\n\n\u21b52\n2\n\nzT\nt A0zt \uf8ff fA0,b(yt) +\n\n\u21b52\n2\n\nzT\nt A0zt.\n\nSubstituting this bound along with kxtk = kstr\n\n?k and \u21b52 \uf8ff 4kstr\n\n?k2 into (6) with = min gives\n\nt A0zt.\nSubstituting in the bounds (11) and (12) concludes the proof for the case min < 0.\n\n? ) \uf8ff fA0,b(yt) fA0,b(str\n\nfA,b(xt) fA,b(str\n\n? ) + 2str\n\n?2 zT\n\n2.5 Randomizing away the hard case\nKrylov subspace solutions may fail to converge to global solution when both ? = min and\nminb = 0, the so-called hard case [11, 30]. Yet as with eigenvector methods [24, 14], simple\nuT\nrandomization approaches allow us to handle the hard case with high probability, at the modest\ncost of introducing to the error bounds a logarithmic dependence on d. Here we describe two such\napproaches.\nIn the \ufb01rst approach, we draw a spherically symmetric random vector v, and consider the joint\nKrylov subspace\n\nK2t(A,{b, v}) := span{b, Ab, . . . , At1b, v, Av, . . . , At1v}.\n\nThe trust-region and cubic-regularized problems (1) can be solved ef\ufb01ciently in K2t(A,{b, v}) using\nthe block Lanczos method [12, 15]; we survey this technique in Section A.1 in the supplement. The\nanalysis in the previous section immediately implies the following convergence guarantee.\nCorollary 2. Let v be uniformly distributed on the unit sphere in Rd, and\n\n\u02c6str\nt 2\n\nargmin\n\nx2Kbt/2c(A,{b,v}),kxk\uf8ffR\n\nfA,b(x).\n\nFor any > 0,\n\nfA,b(\u02c6str\n\nt ) fA,b(str\n\n? ) \uf8ff\n\n(max min)R2\n\n(t 1)2\n\n\"16 + 2 \u00b7 I{min<0} log2 2pd\n !#\n\n(13)\n\n2 , d1\n\nwith probability at least 1 with respect to the random choice of v.\nProof. In the preceding proof of sublinear convergence, apply Lemma 2 on Kbt/2c(A, b)\nand Lemma 3 on Kbt/2c(A, v);\nthe constructed solution is in Kbt/2c(A,{b, v}). To bound\nminv|2/kvk2, note that its distribution is Beta( 1\nminv|2/kvk2 2/d with\n|uT\nprobability greater than 1 (cf. [5, Lemma 4.6]).\nCorollary 2 implies we can solve the trust-region problem to \u270f accuracy in roughly \u270f1/2 log d\nmatrix-vector products, even in the hard case. The main drawback of this randomization approach\nis that half the matrix-vector products are expended on the random vector; when the problem is\nwell-conditioned or when |uT\nminb|/kbk is not extremely small, using the standard subspace solution\nis nearly twice as fast.\nThe second approach follows the proposal [5] to construct a perturbed version of the linear term b,\ndenoted \u02dcb, and solve the problem instance (A, \u02dcb, R) in the Krylov subspace Kt(A, \u02dcb).\nCorollary 3. Let v be uniformly distributed on the unit sphere in Rd, let > 0 and let\n\n2 ) and therefore |uT\n\nLet \u02dcstr\n\nt 2 argminx2Kt(A,\u02dcb),kxk\uf8ffR fA,\u02dcb(x) := 1\n(max min)R2\nfA,b(\u02dcstr\n\nt ) fA,b(str\n\n? ) \uf8ff\n\n(t 1\n2 )2\n\n\u02dcb = b + \u00b7 v.\n\"4 +\n\nwith probability at least 1 with respect to the random choice of v.\n\n2 xT Ax + \u02dcbT x. For any > 0,\n\nI{min<0}\n\n2\n\nlog2 2k\u02dcbkpd\n\n !# + 2R\n\n(14)\n\n6\n\n\fmin\n\nSee section C.2 in the supplement for a short proof, which consists of arguing that fA,b and fA,\u02dcb\n\u02dcb|. For\ndeviate by at most R at any feasible point, and applying a probabilistic lower bound on |uT\nany desired accuracy \u270f, using Corollary 3 with = \u270f/(4R) shows we can achieve this accuracy, with\nconstant probability, in a number of Lanczos iterations that scales as \u270f1/2 log(d/\u270f2). Compared to\nthe \ufb01rst approach, this rate of convergence is asymptotically slightly slower (by a factor of log 1\n\u270f ),\nand moreover requires us to decide on a desired level of accuracy in advance. However, the second\napproach avoids the 2x slowdown that the \ufb01rst approach exhibits on easier problem instances. In\nSection 5 we compare the two approaches empirically.\nWe remark that the linear convergence guarantee (4) continues to hold for both randomization ap-\nproaches. For the second approach, this is due to the fact that small perturbations to b do not\ndrastically change the condition number, as shown in [5]. However, this also means that we can-\nnot expect a good condition number when perturbing b in the hard case. Nevertheless, we believe\nit is possible to show that, with randomization, Krylov subspace methods exhibit linear conver-\ngence even in the hard case, where the condition number is replaced by the normalized eigen-gap\n(max min)/(2 min), with 2 the smallest eigenvalue of A larger than min.\n3 The cubic-regularized problem\n\nWe now consider the cubic-regularized problem\n\nminimize\n\nx2Rd\n\n\u02c6fA,b,\u21e2 (x) := fA,b(x) +\n\n\u21e2\n3 kxk3 =\n\n1\n2\n\nxT Ax + bT x +\n\n\u21e2\n3 kxk3 .\n\nAny global minimizer of \u02c6fA,b,\u21e2, denoted scr\n? ) = (A + \u21e2kscr\n\nr \u02c6fA,b,\u21e2(scr\n\n? , admits the characterization [9, Theorem 3.1]\n? k I) scr\n\n? + b = 0 and \u21e2kscr\n\n? k min.\n\n(15)\n\nComparing this characterization to its counterpart (3) for the trust-region problem, we see that any\ninstance (A, b, \u21e2) of cubic regularization has an equivalent trust-region instance (A, b, R), with R =\n? k. Theses instances are equivalent in that they have the same set of global minimizers. Evidently,\nkscr\nthe equivalent trust-region instance has optimal Lagrange multiplier ? = \u21e2kscr\n? k. Moreover, at\n?k), the cubic-regularization\nany trust-region feasible point x (satisfying kxk \uf8ff R = kscr\noptimality gap is smaller than its trust-region equivalent,\n?k3 \uf8ff fA,b(x) fA,b(str\n\n? k = kstr\n3kxk3 kstr\nt denote the minimizer of \u02c6fA,b,\u21e2 in Kt(A, b) and letting str\n\nLetting scr\nsolution of the equivalent trust-region problem, we conclude that\n\n? ) = fA,b(x) fA,b(str\n\n\u02c6fA,b,\u21e2(x) \u02c6fA,b,\u21e2(scr\n\nt denote the Krylov subspace\n\n? ) +\n\n? ).\n\n\u21e2\n\n\u02c6fA,b,\u21e2(scr\n\nt ) \u02c6fA,b,\u21e2(scr\n\n? ) \uf8ff \u02c6fA,b,\u21e2(str\n\nt ) \u02c6fA,b,\u21e2(scr\n\n? ) \uf8ff fA,b(str\n\nt ) fA,b(str\n? );\n\n(16)\n\ncubic regularization Krylov subspace solutions always have a smaller optimality gap than their trust-\nregion equivalents. The guarantees of Theorem 1 therefore apply to \u02c6fA,b,\u21e2(scr\n? ) as well,\nand we arrive at the following\nCorollary 4. For every t > 0,\n\nt ) \u02c6fA,b,\u21e2(scr\n\n\u02c6fA,b,\u21e2(scr\n\nt ) \u02c6fA,b,\u21e2(scr\n\nand\n\n? )i exp(4ts min + \u21e2kscr\n? k) ,\n? ) \uf8ff 36h \u02c6fA,b,\u21e2(0) \u02c6fA,b,\u21e2(scr\nminb)2!# .\nlog2 4kbk2\n\"4 +\n\n? k\nmax + \u21e2kscr\n\n? k2\n(max min)kscr\n\nI{min<0}\n\n? ) \uf8ff\n\n(uT\n\n8\n\n(t 1\n2 )2\n\n(17)\n\n(18)\n\n\u02c6fA,b,\u21e2(scr\n\nt ) \u02c6fA,b,\u21e2(scr\n\nProof. Use the slightly stronger bound (10) derived in the proof of Theorem 1 with the inequality\n18str\n?\n\n? k3] = 36[ \u02c6fA,b,\u21e2(0) \u02c6fA,b,\u21e2(scr\n? )].\n\n?k2 \uf8ff 36[ 1\n\n? + 4? kstr\n\n6 \u21e2kscr\n\nT A?str\n\n? + 1\n\nT Ascr\n\n2 scr\n?\n\n7\n\n\fHere too it is possible to randomly perturb b and obtain a guarantee for cubic regularization that\nIn [5] we carry out such analysis for gradient descent, and show that\napplies in the hard case.\n? k2 by at most 2/\u21e2 [5, Lemma 4.6]. Thus the\nperturbations to b with norm can increase kscr\n? k2 + 2/\u21e2 in (14).\ncubic-regularization equivalent of Corollary 3 amounts to replacing R2 with kscr\nWe note brie\ufb02y\u2014without giving a full analysis\u2014that Corollary 4 shows that the practically success-\nful Adaptive Regularization using Cubics (ARC) method [9] can \ufb01nd \u270f-stationary points in roughly\n\u270f7/4 Hessian-vector product operations (with proper randomization and subproblem stopping cri-\nteria). Researchers have given such guarantees for a number of algorithms that are mainly theoreti-\ncal [1, 8], as well as variants of accelerated gradient descent [6, 22], which while more practical still\nrequire careful parameter tuning. In contrast, ARC requires very little tuning and it is encouraging\nthat it may also exhibit the enhanced Hessian-vector product complexity \u270f7/4, which is at least\nnear-optimal [7].\n\n4 Lower bounds\n\nWe now show that the guarantees in Theorem 1 and Corollary 4 are tight up to numerical constants\nfor adversarially constructed problems. We state the result for the cubic-regularization problem;\ncorresponding lower bounds for the trust-region problem are immediate from the optimality gap\nrelation (16).1\nTo state the result, we require a bit more notation. Let L map cubic-regularization problem instances\nof the form (A, b, \u21e2) to the quadruple (min, max, ?, ) = L(A, b, \u21e2) such that min, max are\nthe extremal eigenvalues of A and the solution scr\n? k = ?,\nand \u02c6fA,b,\u21e2(0) \u02c6fA,b,\u21e2(scr\n? ) = . Similarly let L0 map an instance (A, b, \u21e2) to the quadruple\n(min, max,\u2327, R ) where now kscr\nminb| = \u2327, with umin an eigenvector of A\ncorresponding to eigenvalue min.\nWith this notation in hand, we state our lower bounds. (See supplemental section D for a proof.)\nTheorem 5. Let d, t 2 N with t < d and min, max, ?, be such that min \uf8ff max, ? >\n(min)+, and > 0. There exists (A, b, \u21e2) such that L(A, b, \u21e2) = (min, max, ?, ) and for\nall s 2K t(A, b),\n\n\u02c6fA,b,\u21e2(x) satis\ufb01es \u21e2kscr\n\n? k = R and kbk /|uT\n\n? = argminx\n\n\u02c6fA,b,\u21e2(s) \u02c6fA,b,\u21e2(scr\n\n? ) >\n\n1\n\nKh \u02c6fA,b,\u21e2(0) \u02c6fA,b,\u21e2(scr\n\n? )i exp\u21e2\n\n4t\n\np\uf8ff 1 ,\n\n. Alternatively, for any \u2327 1 and R > 0, there exists\n\n?\n\n3(?+min) and \uf8ff = ?+max\n?+min\n\nwhere K = 1 +\n(A, b, \u21e2) such that L0(A, b, \u21e2) = (min, max,\u2327, R ) and for s 2K t(A, b),\n\u02c6fA,b,\u21e2(s) \u02c6fA,b,\u21e2(scr\nand\n\n? ) > min((max) min,\n\nmax min\n16(t 1\n\n2 )2 log2 kbk2\n\n(uT\n\nminb)2!) kscr\n? k2\n32\n\n(19)\n\n, (20)\n\n(21)\n\n\u02c6fA,b,\u21e2(s) \u02c6fA,b,\u21e2(scr\n\n? ) >\n\n? k2\n(max min)kscr\n\n.\n\n16(t + 1\n\n2 )2\n\nThe lower bounds (19) matches the linear convergence guarantee (17) to within a numerical constant,\nas we may choose max, min and ? so that \uf8ff is arbitrary and K < 2. Similarly, lower bounds (20)\nand (21) match the sublinear convergence rate (18) for min < 0 and min 0 respectively. Our\nproof \ufb02ows naturally from minimax characterizations of uniform polynomial approximations (Lem-\nmas 4 and 5 in the supplement), which also play a crucial role in proving our upper bounds.\nOne consequence of the lower bound (19) is the existence of extremely badly conditioned instances,\nsay with \uf8ff = (100d)2 and K = 3/2, such that in the \ufb01rst d 1 iterations it is impossible to decrease\nthe initial error by more than a factor of 2 (the initial error may be chosen arbitrarily large as well).\nHowever, since these instances have \ufb01nite condition number we have scr\n? 2K d(A, b), and so the\nerror supposedly drops to 0 at the dth iteration. This seeming discontinuity stems from the fact that\n1To obtain the correct prefactor in the trust-region equivalent of lower bound (19) we may use the fact that\n\n\u02c6fA,b,\u21e2(0) \u02c6fA,b,\u21e2(scr\n\n? ) = 1\n\n2 bT A1\n\n? b + \u21e2\n\n6 kscr\n\n? k3 1\n\n3 ( 1\n\n2 bT A1\n\n? b + ?\n\n2 R2) = 1\n\n3 (fA,b(0) fA,b(str\n\n? )).\n\n8\n\n\f(a)\n\n(b)\n\nFigure 1: Optimality gap of Krylov subspace solutions on random cubic-regularization problems,\nversus subspace dimension t. (a) Columns show ensembles with different condition numbers \uf8ff, and\nrows differ by scaling of t. Thin lines indicate results for individual instances, and bold lines indicate\nensemble median and maximum suboptimality. (b) Each line represents median suboptimality, and\nshaded regions represent inter-quartile range. Different lines correspond to different randomization\nsettings.\n\nin this case scr\n? depends on the Lanczos basis of Kd(A, b) through a very badly conditioned linear\nsystem and cannot be recovered with \ufb01nite-precision arithmetic. Indeed, running Krylov subspace\nmethods for d iterations with inexact arithmetic often results in solutions that are very far from exact,\nwhile guarantees of the form (17) are more robust to roundoff errors [4, 13, 35].\nWhile we state the lower bounds in Theorem 5 for points in the Krylov subspace Kt(A, b), a clas-\nsical \u201cresisting oracle\u201d construction due to Nemirovski and Yudin [27, Chapter 7.2] (see also [26,\n\u00a710.2.3]) shows that (for d > 2t) these lower bounds hold also for any deterministic method that\naccesses A only through matrix-vector products, and computes a single matrix-vector product per\niteration. The randomization we employ in Corollaries 2 and 3 breaks the lower bound (20) when\nmin < 0 and kbk /|uT\nminb| is very large, so there is some substantial power from randomization in\nthis case. However, Simchowitz [34] recently showed that randomization cannot break the lower\nbounds for convex quadratics (min 0 and \u21e2 = 0).\n5 Numerical experiments\n\n? k)/(min + \u21e2kscr\n\nTo see whether our analysis applies to non-worst case problem instances, we generate 5,000 ran-\ndom cubic-regularization problems with d = 106 and controlled condition number \uf8ff = (max +\n? k) (see Section E in the supplement for more details). We repeat the ex-\n\u21e2kscr\nperiment three times with different values of \uf8ff and summarize the results in Figure 1a. As seen\nin the \ufb01gure, about 20 Lanczos iterations suf\ufb01ce to solve even the worst-conditioned instances to\nabout 10% accuracy, and 100 iterations give accuracy better than 1%. Moreover, for t ' p\uf8ff, the\napproximation error decays exponentially with precisely the rate 4/p\uf8ff predicted by our analysis,\nfor almost all the generated problems. For t \u2327 p\uf8ff, the error decays approximately as t2. We\nconclude that the rates characterized by Theorem 1 are relevant beyond the worst case.\nWe conduct an additional experiment to test the effect of randomization for \u201chard case\u201d instances,\nwhere \uf8ff = 1. We generate such problem instances (see details in Section E), and compare the joint\nsubspace randomization scheme (Corollary 2) to the perturbation scheme (Corollary 3) with different\nperturbation magnitudes ; the results are shown in Figure 1b. For any \ufb01xed target accuracy, some\nchoices of yield faster convergence than the joint subspace scheme. However, for any \ufb01xed \noptimization eventually hits a noise \ufb02oor due to the perturbation, while the joint subspace scheme\ncontinues to improve. Choosing requires striking a balance: if too large the noise \ufb02oor is high\nand might even be worse than no perturbation at all; if too small, escaping the unperturbed error\nlevel will take too long, and the method might falsely declare convergence. A practical heuristic for\nsafely choosing is an interesting topic for future research.\n\n9\n\n\fAcknowledgments\n\nWe thank the anonymous reviewers for several helpful questions and suggestions. Both authors were\nsupported by NSF-CAREER Award 1553086 and the Sloan Foundation. YC was partially supported\nby the Stanford Graduate Fellowship.\n\nReferences\n[1] N. Agarwal, Z. Allen-Zhu, B. Bullins, E. Hazan, and T. Ma. Finding approximate local minima\nfaster than gradient descent. In Proceedings of the Forty-Ninth Annual ACM Symposium on\nthe Theory of Computing, 2017.\n\n[2] Z. Allen-Zhu and L. Orecchia. Linear coupling: An ultimate uni\ufb01cation of gradient and mirror\ndescent. In Proceedings of the 8th Innovations in Theoretical Computer Science, ITCS \u201917,\n2017.\n\n[3] J. Blanchet, C. Cartis, M. Menickelly, and K. Scheinberg. Convergence rate analysis of a\nstochastic trust region method for nonconvex optimization. arXiv:1609.07428 [math.OC],\n2016.\n\n[4] A. S. Cameron Musco, Christopher Musco. Stability of the Lanczos method for matrix function\n\napproximation. arXiv:1708.07788 [cs.DS], 2017.\n\n[5] Y. Carmon and J. C. Duchi. Gradient descent ef\ufb01ciently \ufb01nds the cubic-regularized non-convex\n\nNewton step. arXiv:1612.00547 [math.OC], 2016.\n\n[6] Y. Carmon, J. C. Duchi, O. Hinder, and A. Sidford. Convex until proven guilty: dimension-\nIn Proceedings of the 34th\n\nfree acceleration of gradient descent on non-convex functions.\nInternational Conference on Machine Learning, 2017.\n\n[7] Y. Carmon, J. C. Duchi, O. Hinder, and A. Sidford. Lower bounds for \ufb01nding stationary points\n\nII: First order methods. arXiv:1711.00841 [math.OC], 2017.\n\n[8] Y. Carmon, J. C. Duchi, O. Hinder, and A. Sidford. Accelerated methods for non-convex\noptimization. SIAM Journal on Optimization, 28(2):1751\u20131772, 2018. URL https://arXiv.\norg/abs/1611.00756.\n\n[9] C. Cartis, N. I. M. Gould, and P. L. Toint. Adaptive cubic regularisation methods for uncon-\nstrained optimization. Part I: motivation, convergence and numerical results. Mathematical\nProgramming, Series A, 127:245\u2013295, 2011.\n\n[10] E. S. Coakley and V. Rokhlin. A fast divide-and-conquer algorithm for computing the spectra\nof real symmetric tridiagonal matrices. Applied and Computational Harmonic Analysis, 34(3):\n379\u2013414, 2013.\n\n[11] A. R. Conn, N. I. M. Gould, and P. L. Toint. Trust Region Methods. MPS-SIAM Series on\n\nOptimization. SIAM, 2000.\n\n[12] J. Cullum and W. E. Donath. A block Lanczos algorithm for computing the q algebraically\nlargest eigenvalues and a corresponding eigenspace of large, sparse, real symmetric matrices.\nIn Decision and Control including the 13th Symposium on Adaptive Processes, 1974 IEEE\nConference on, volume 13, pages 505\u2013509. IEEE, 1974.\n\n[13] V. Druskin and L. Knizhnerman. Error bounds in the simple Lanczos procedure for computing\nfunctions of symmetric matrices and eigenvalues. U.S.S.R. Computational Mathematics and\nMathematical Physics, 31(7):970\u2013983, 1991.\n\n[14] G. Golub and C. V. Loan. Matrix computations. John Hopkins University Press, 1989.\n[15] G. H. Golub and R. Underwood. The block Lanczos method for computing eigenvalues. In\n\nMathematical software, pages 361\u2013377. Elsevier, 1977.\n\n[16] N. I. Gould, D. Orban, and P. L. Toint. GALAHAD, a library of thread-safe Fortran 90 pack-\nages for large-scale nonlinear optimization. ACM Transactions on Mathematical Software\n(TOMS), 29(4):353\u2013372, 2003.\n\n10\n\n\f[17] N. I. M. Gould, S. Lucidi, M. Roma, and P. L. Toint. Solving the trust-region subproblem using\n\nthe Lanczos method. SIAM Journal on Optimization, 9(2):504\u2013525, 1999.\n\n[18] A. Griewank. The modi\ufb01cation of Newton\u2019s method for unconstrained optimization by bound-\n\ning cubic terms. Technical report, Technical report NA/12, 1981.\n\n[19] E. Hazan and T. Koren. A linear-time algorithm for trust region problems. Mathematical\n\nProgramming, Series A, 158(1):363\u2013381, 2016.\n\n[20] M. Hestenes and E. Stiefel. Methods of conjugate gradients for solving linear systems. Journal\n\nof Research of the National Bureau of Standards, 49(6), 1952.\n\n[21] N. Ho-Nguyen and F. K\u0131l\u0131nc\u02db-Karzan. A second-order cone based approach for solving the\n\ntrust-region subproblem and its variants. arXiv:1603.03366 [math.OC], 2016.\n\n[22] C. Jin, P. Netrapalli, and M. I. Jordan. Accelerated gradient descent escapes saddle points\n\nfaster than gradient descent. arXiv:1711.10456 [cs.LG], 2017.\n\n[23] J. M. Kohler and A. Lucchi. Sub-sampled cubic regularization for non-convex optimization.\n\nIn Proceedings of the 34th International Conference on Machine Learning, 2017.\n\n[24] J. Kuczynski and H. Wozniakowski. Estimating the largest eigenvalue by the power and Lanc-\nzos algorithms with a random start. SIAM Journal on Matrix Analysis and Applications, 13(4):\n1094\u20131122, 1992.\n\n[25] F. Lenders, C. Kirches, and A. Potschka.\n\ntrlib: A vector-free implementation of the GLTR\nmethod for iterative solution of the trust region problem. Optimization Methods and Software,\n33(3):420\u2013449, 2018.\n\n[26] A. Nemirovski. Ef\ufb01cient methods in convex programming. Technion: The Israel Institute of\n\nTechnology, 1994.\n\n[27] A. Nemirovski and D. Yudin. Problem Complexity and Method Ef\ufb01ciency in Optimization.\n\nWiley, 1983.\n\n[28] Y. Nesterov. Introductory Lectures on Convex Optimization. Kluwer Academic Publishers,\n\n2004.\n\n[29] Y. Nesterov and B. Polyak. Cubic regularization of Newton method and its global performance.\n\nMathematical Programming, Series A, 108:177\u2013205, 2006.\n\n[30] J. Nocedal and S. J. Wright. Numerical Optimization. Springer, 2006.\n[31] B. A. Pearlmutter. Fast exact multiplication by the Hessian. Neural computation, 6(1):147\u2013\n\n160, 1994.\n\n[32] J. Regier, M. I. Jordan, and J. McAuliffe. Fast black-box variational inference through stochas-\ntic trust-region optimization. In Advances in Neural Information Processing Systems 31, 2017.\n[33] N. N. Schraudolph. Fast curvature matrix-vector products for second-order gradient descent.\n\nNeural computation, 14(7):1723\u20131738, 2002.\n\n[34] M. Simchowitz. On the randomized complexity of minimizing a convex quadratic function.\n\narXiv:1807.09386 [cs.LG], 2018.\n\n[35] L. N. Trefethen and D. Bau III. Numerical Linear Algebra. SIAM, 1997.\n[36] N. Tripuraneni, M. Stern, C. Jin, J. Regier, and M. I. Jordan. Stochastic cubic regularization\n\nfor fast nonconvex optimization. arXiv:1711.02838 [cs.LG], 2017.\n\n[37] P. Tseng. On accelerated proximal gradient methods for convex-concave optimization. 2008.\n\nURL http://www.mit.edu/~dimitrib/PTseng/papers/apgm.pdf.\n\n[38] Z. Yao, P. Xu, F. Roosta-Khorasani, and M. W. Mahoney. Inexact non-convex newton-type\n\nmethods. arXiv:1802.06925 [math.OC], 2018.\n\n[39] L.-H. Zhang, C. Shen, and R.-C. Li. On the generalized Lanczos trust-region method. SIAM\n\nJournal on Optimization, 27(3):2110\u20132142, 2017.\n\n11\n\n\f", "award": [], "sourceid": 6832, "authors": [{"given_name": "Yair", "family_name": "Carmon", "institution": "Stanford"}, {"given_name": "John", "family_name": "Duchi", "institution": "Stanford"}]}