{"title": "First-order methods almost always avoid saddle points: The case of vanishing step-sizes", "book": "Advances in Neural Information Processing Systems", "page_first": 6474, "page_last": 6483, "abstract": "In a series of papers [Lee et al 2016], [Panageas and Piliouras 2017], [Lee et al 2019], it was established that some of the most commonly used first order methods almost surely (under random initializations) and with step-size being small enough, avoid strict saddle points, as long as the objective function $f$ is $C^2$ and has Lipschitz gradient. The key observation was that first order methods can be studied from a dynamical systems perspective, in which instantiations of Center-Stable manifold theorem allow for a global analysis. The results of the aforementioned papers were limited to the case where the step-size $\\alpha$ is constant, i.e., does not depend on time (and typically bounded from the inverse of the Lipschitz constant of the gradient of $f$). It remains an open question whether or not the results still hold when the step-size is time dependent and vanishes with time.\n\nIn this paper, we resolve this question on the affirmative for gradient descent, mirror descent, manifold descent and proximal point. The main technical challenge is that the induced (from each first order method) dynamical system is time non-homogeneous and the stable manifold theorem is not applicable in its classic form. By exploiting the dynamical systems structure of the aforementioned first order methods, we are able to prove a stable manifold theorem that is applicable to time non-homogeneous dynamical systems and generalize the results in [Lee et al 2019] for time dependent step-sizes.", "full_text": "First-order methods almost always avoid saddle\n\npoints: The case of vanishing step-sizes\n\nIoannis Panageas\n\nSUTD\n\nSingapore\n\nGeorgios Piliouras\n\nSUTD\n\nSingapore\n\nioannis@sutd.edu.sg\n\ngeorgios@sutd.edu.sg\n\nXiao Wang\n\nSUTD\n\nSingapore\n\nxiao_wang@sutd.edu.sg\n\nAbstract\n\nIn a series of papers [17, 22, 16], it was established that some of the most commonly\nused \ufb01rst order methods almost surely (under random initializations) and with step-\nsize being small enough, avoid strict saddle points, as long as the objective function\nf is C 2 and has Lipschitz gradient. The key observation was that \ufb01rst order methods\ncan be studied from a dynamical systems perspective, in which instantiations of\nCenter-Stable manifold theorem allow for a global analysis. The results of the\naforementioned papers were limited to the case where the step-size \u03b1 is constant,\ni.e., does not depend on time (and bounded from the inverse of the Lipschitz\nconstant of the gradient of f). It remains an open question whether or not the\nresults still hold when the step-size is time dependent and vanishes with time.\nIn this paper, we resolve this question on the af\ufb01rmative for gradient descent, mirror\ndescent, manifold descent and proximal point. The main technical challenge is\nthat the induced (from each \ufb01rst order method) dynamical system is time non-\nhomogeneous and the stable manifold theorem is not applicable in its classic form.\nBy exploiting the dynamical systems structure of the aforementioned \ufb01rst order\nmethods, we are able to prove a stable manifold theorem that is applicable to\ntime non-homogeneous dynamical systems and generalize the results in [16] for\nvanishing step-sizes.\n\n1\n\nIntroduction\n\nNon-convex optimization has been studied extensively the last years and has been one of the main\nfocuses of Machine Learning community. The reason behind the interest of ML community is that in\nmany applications of interest, one has to deal with the optimization of a non-convex landscape. One\nof the key obstacles of non-convex optimization is the existence of numerous saddle points (which\ncan outnumber the local minima [10, 24, 6]). Avoiding them is a fundamental challenge for ML [14].\nRecent progress [11, 16] has shown that under mild regularity assumptions on the objective function,\n\ufb01rst-order methods such as gradient descent can provably avoid the so-called strict saddle points1.\n\n1These are saddle points where the Hessian of the objective admits at least one direction of negative curvature.\nSuch property has been shown to hold in a wide range of objective functions, see [11, 29, 28, 13, 12, 3] and\nreferences therein.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fIn particular, a uni\ufb01ed theoretical framework is established in [16] to analyze the asymptotic behavior\nof \ufb01rst-order optimization algorithms such as gradient descent, mirror descent, proximal point, coor-\ndinate descent and manifold descent. It is shown that under random initialization, the aforementioned\nmethods avoid strict saddle points almost surely. The proof exploits a powerful theorem from the\ndynamical systems literature, the so-called Stable-manifold theorem (see supplementary material for\na statement of this theorem). For example, given a C 2 (twice continuously differentiable function) f\nwith L-Lipschitz gradient, gradient descent method\n\nxk+1 = g(xk) := xk \u2212 \u03b1\u2207f (xk)\n\nk or\n\n1\u221a\nk\n\navoids strict saddle points almost surely, under the assumption that the stepsize is constant and\nL. The crux of the proof in [16] is the Stable-manifold theorem for the time-homogeneous2\n0 < \u03b1 < 1\ndynamical system xk+1 = g(xk). The Stable-manifold theorem implies that the dynamical system\ng avoids its unstable \ufb01xed points and with the fact that the unstable \ufb01xed points of the dynamical\nsystem g coincide with the strict saddles of f the claim follows. Results of similar \ufb02avor can be\nshown for Expectation Maximization algorithm [19], Multiplicative Weights Update [18, 23] and for\nmin-max optimization [9].\nIn many applications/algorithms, however, the stepsize is adaptive or vanishing/diminishing (meaning\nlimk \u03b1k = 0, e.g., \u03b1k = 1\n). Such applications include stochastic gradient descent (see\n[27] for analysis of SGD for convex functions), urn models and stochastic approximation [25],\ngradient descent [4], online learning algorithms like multiplicative weights update [1, 15] (which\nis an instantiation of Mirror Descent with entropic regularizer). It is also important to note that the\nchoice of the stepsize is really crucial in the aforementioned applications as changing the stepsize can\nchange the convergence properties (transition from convergence to oscillations/chaos [20, 21, 8, 5]),\nthe rate of convergence [20] as well as the system ef\ufb01ciency [7].\nThe proof in [16] does not carry over when the stepsize depends on time, because the Stable-manifold\ntheorem is not applicable. Hence, whether the results of [16] hold for vanishing step-sizes remains\nunresovled. This was stated explicitly as an open question in [16]. Our work resolves this question in\nthe af\ufb01rmative. Our main result is stated below informally.\nTheorem 1.1 (Informal). Gradient Descent, Mirror Descent, Proximal point and Manifold descent,\n\n(cid:1) avoid the set of strict saddle points (isolated and non-\n\nwith vanishing step-size \u03b1k of order \u2126(cid:0) 1\n\nisolated) almost surely under random initialization.\n\nk\n\nOrganization of the paper. The paper is organized as follows: In Section 2 we give important\nde\ufb01nitions for the rest of the paper, in Section 3 we provide intuition and technical overview of\nour results, in Section 4 we show a new Stable-manifold theorem that is applicable to a class of\ntime non-homogeneous dynamical systems and \ufb01nally in Section 5 we show how this new manifold\ntheorem can be applied to Gradient descent, Mirror Descent, Proximal point and Manifold Descent.\nDue to space constraints, most of the proofs can be found in the supplementary material.\n\nNotation. Throughout this paper, we denote N the set of nonnegative integers and R the set of\nreal numbers, (cid:107)\u00b7(cid:107) the Euclidean norm, bolded x the vector, B(x, \u03b4) the open ball centering at x\nwith radius \u03b4, g(k, x) the update rule for optimization algorithms indexed by k \u2208 N, \u02dcg(m, n, x) the\ncomposition g(m, ..., g(n + 1, g(n, x))...) for m \u2265 n, \u2207f the gradient of f : Rd \u2192 R and \u22072f (x)\nthe Hessian of f at x, Dxg(k, x) the differential with respect to variable x,\n\n2 Preliminaries\n\nIn this section we provide all necessary de\ufb01nitions that will be needed for the rest of the paper.\nDe\ufb01nition 2.1 (Time (non)-homogeneous). We call a dynamical system xk+1 = g(xk) as time\nhomogeneous since g does not on step k. Furthermore, we call a dynamical system xk+1 = g(k, xk)\ntime non-homogeneous as g depends on k.\nDe\ufb01nition 2.2 (Critical point). Given a C 2 (twice continuously differentiable) function f : X \u2192 R\nwhere X is an open, convex subset of Rd, the following de\ufb01nitions are provided for completeness.\n\n2This means that g does not depend on time. In the dynamical systems/differential equations literature such\n\nsystems are called \"autonomous\" whereas time-dependent systems are called \"non-autonomous\".\n\n2\n\n\f1. A point x\u2217 is a critical point of f if \u2207f (x\u2217) = 0.\n2. A critical point is a local minimum if there is a neighborhood U around x\u2217 such that\n\nf (x\u2217) \u2264 f (x) for all x \u2208 U, and a local maximum if f (x\u2217) \u2265 f (x).\n\n3. A critical point is a saddle point if for all neighborhoods U around x\u2217, there are x, y \u2208 U\n\nsuch that f (x) \u2264 f (x\u2217) \u2264 f (y).\n\n4. A critical point x\u2217 is isolated if there is a neighborhood U around x\u2217, and x\u2217 is the only\n\ncritical point in U.\n\nThis paper will focus on saddle points that have directions of strictly negative curvature, that is the\nconcept of strict saddle.\nDe\ufb01nition 2.3 (Strict Saddle). A critical point x\u2217 of f is a strict saddle if \u03bbmin(\u22072f (x\u2217)) < 0\n(minimum eigenvalue of the Hessian computed at the critical point is negative).\nLet X \u2217 be the set of strict saddle points of function f and we follow the De\ufb01nition 2 of [16] for the\nglobal stable set of X \u2217.\nDe\ufb01nition 2.4 (Global Stable Set and \ufb01xed points). Given a dynamical system (e.g., gradient descent\nxk+1 = xk \u2212 \u03b1k\u2207f (xk))\n(1)\nthe global stable set W s(X \u2217) of X \u2217 is the set of initial conditions where the sequence xk converges\nto a strict saddle. This is de\ufb01ned as:\n\nxk+1 = g(k, xk),\n\nW s(X \u2217) = {x0 :\n\nk\u2192\u221e xk \u2208 X \u2217}.\n\nlim\n\nMoreover, z is called a \ufb01xed point of the system (1) if z = g(k, z) for all natural numbers k.\nDe\ufb01nition 2.5 (Manifold). A C k-differentiable, d-dimensional manifold is a topological space M,\ntogether with a collection of charts {(U\u03b1, \u03c6\u03b1)}, where each \u03c6\u03b1 is a C k-diffeomorphism from an\nopen subset U\u03b1 \u2282 M to Rd. The charts are compatible in the sense that, whenever U\u03b1 \u2229 U\u03b2 (cid:54)= \u2205,\nthe transition map \u03c6\u03b1 \u25e6 \u03c6\u22121\n\n\u03b2 : \u03c6\u03b2(U\u03b2 \u2229 U\u03b1) \u2192 Rd is of C k.\n\n3\n\nIntuition and Overview\n\nto saddle points, even for time varying/vanishing step-sizes \u03b1k of order \u2126(cid:0) 1\n\nIn this section we will illustrate why gradient descent and related \ufb01rst-order methods do not converge\n\n(cid:1).\n\nk\n\n3.1\n\nIntuition\n\n2 xT Ax where A = diag(\u03bb1, ..., \u03bbd) is a d \u00d7 d, non-\nConsider the case of a quadratic, f (x) = 1\nsingular, diagonal matrix with at least a negative eigenvalue. Let \u03bb1, ..., \u03bbj be the positive eigenvalues\nof A (the \ufb01rst j) and \u03bbj+1, ..., \u03bbd be the non-positive ones. It is clear that x\u2217 = 0 is the unique\ncritical point of function f and the Hessian \u22072f is A everywhere (and hence at the critical point).\nMoreover, it is clear that x\u2217 is a strict saddle point (not a local minimum).\nGradient descent with step-size \u03b1k (it holds that \u03b1k \u2265 0 for all k and limk\u2192\u221e \u03b1k = 0) has the\nfollowing form:\n\nxk+1 = xk + \u03b1kAxk = (I \u2212 \u03b1kA)xk.\nAssuming that x0 is the starting point, then it holds that xk+1 =\nconclude that\n\nWe examine when it is true that limk\u2192\u221e xk = x\u2217. It is clear that(cid:81)\u221e\n\nxk+1 = diag\n\nt=0\n\n(1 \u2212 \u03bb1\u03b1t), ...,\n\n(1 \u2212 \u03bbn\u03b1t)\n\nand has the same convergence properties as\n\n(cid:32) k(cid:89)\n\nk(cid:89)\n\nt=0\n\ne\u2212\u03bb(cid:80)\u221e\n\nt=0 \u03b1t.\n\n3\n\n(cid:16)(cid:81)k\n(cid:17)\nt=0(I \u2212 \u03b1k\u2212tA)\n(cid:33)\n\nx0.\n\nt=0(1\u2212 \u03bb\u03b1t) = e\n\nx0. We\n\n(2)\n\n(cid:80)\u221e\nt=0 ln(1\u2212\u03bb\u03b1t),\n\n(3)\n\n\fFor \u03bb > 0, the term (3) converges to zero if and only if(cid:80)\u221e\nstepsize \u03b1k) and for \u03bb < 0 it holds that the term (3) diverges for \u03b1t to be \u2126(cid:0) 1\nbeing \u2126(cid:0) 1\n\n(cid:1) we conclude that limk\u2192\u221e xk = 0 whenever the initial point x0 satis\ufb01es xi\n\nt=0 \u03b1t = +\u221e which is true if \u03b1t is \u2126(cid:0) 1\n(cid:1).\n(cid:1). Therefore, for \u03b1k\n\nMoreover, for \u03bb = 0 it holds that the term (3) remains a constant (independently of the choice of\n\n0 = 0 (i-th\ncoordinate of x0) for \u03bbi \u2265 0.\nHence, if x0 \u2208 Es := span(e1, . . . , ej)3, then xt converges to the saddle point x\u2217 and if x0 has a\ncomponent outside Es then gradient descent diverges. For the example above, the global stable set of\nx\u2217 is the subspace Es which is of measure zero since Es is not full dimensional.\n\n(cid:1)). In the case where \u03b1k is a sequence of stepsizes that converges\n\nRemark 3.1 (\u03b1k of order O(cid:0) 1\n\nk\n\nt\n\nt\n\nto zero with a rate\nt=0 \u03b1k converges\nand hence in our example above we conclude that limk\u2192\u221e xk exists, i.e., xk converges but not\nnecessarily to a critical point.\n\nk1+\u0001 for any \u0001 > 0 (example 1\n\nk2 , 1\n\n1\n\n2k etc), then it holds that(cid:80)\u221e\n\nk1+\u0001\n\n3.2 Technical Overview\n\nThe stability of non-homogeneous (i.e. non-autonomous) systems, at least for the case of continuous-\ntime systems, has been the subject of intensive investigation ([2] and references therein). Although\nsome work on discrete-time systems exists [26], this area is less developed and as far as we know\nno explicit connections to optimization applications have been made before. Moreover, as far as\ngradient descent, mirror descent, etc are concerned, the corresponding dynamical system that needs\nto be analyzed is more complicated when the objective function is not quadratic and the analysis of\nprevious subsection does not apply.\nSuppose we are given a function f that is C 2, and 0 is a saddle point of f. The Taylor expansion of\nthe gradient descent in a neighborhood of 0 is as follows:\n\nxk+1 = (I \u2212 \u03b1k\u22072f (0))xk + \u03b7(k, xk),\n\n(4)\n\nwhere \u03b7(k, 0) = 0 and \u03b7(k, x) is of order o((cid:107)x(cid:107)) around 0 for all naturals k.\nDue to the error term \u03b7(k, xk), the approach for quadratic functions does not imply the existence\nof the stable manifold. Inspired by the proof of Stable-manifold theorem for time homogeneous\nODEs, we prove a Stable-manifold theorem for discrete time non-homogeneous dynamical system\n(4). In words, we prove the existence of a manifold W s that is not of full dimension (it has the\nsame dimension as Es, where Es denotes the subspace that is spanned by the eigenvectors with\ncorresponding positive eigenvalues of matrix \u22072f (0)).\nTo show this, we derive the expression of (2) for the general function f to be:\n\nk(cid:88)\n\ni=0\n\nA (k, i + 1) \u03b7 (i, xi) ,\n\nxk+1 = A (k, 0) x0 +\n\nwhere A (m, n) =(cid:0)I \u2212 \u03b1m\u22072f (0)(cid:1) ...(cid:0)I \u2212 \u03b1n\u22072f (0)(cid:1) for m \u2265 n, and A (m, n) = I if m < n.\n\nNext, we generate a sequence {xk}k\u2208N from (5) with an initial point x0 = (x+\n0 \u2208 Es\n0 \u2208 Eu. If this sequence converges to 0, the equation (5) induces an operator T on the space\nand x\u2212\nof sequences converging to 0, and the sequence {xk}k\u2208N is the \ufb01xed point of T . This is so called\nthe Lyapunov-Perron method (see supplementary material for some brief overview of the method).\nBy Banach \ufb01xed point theorem (see supplementary material for the statement of the theorem), it can\nbe proved that the sequence {xk}k\u2208N (as the \ufb01xed point of T ) exists and is unique. Furthermore,\nthis implies that there is a unique x\u2212\n0 corresponding to x+\n0 , i.e. there exists a well de\ufb01ned function\n\u03d5 : Es \u2192 Eu such that x\u2212\n0 = \u03d5(x+\n0 ).\n\n0 ), where x+\n\n0 , x\u2212\n\n(5)\n\n4 Stable Manifold Theorem for Time Non-homogeneous Dynamical Systems\n\nWe start this section by showing the main technical result of this paper. This is a new stable manifold\ntheorem that works for time non-homogeneous dynamical systems and is used to prove our main result\n(Theorem 1.1) for Gradient Descent, Mirror Descent, Proximal Point and Manifold Descent. The\n\n3{e1, ..., ed} denote the classic orthogonal basis of Rd.\n\n4\n\n\fproof of this theorem exploits the structure of the aforementioned \ufb01rst-order methods as dynamical\nsystems.\nTheorem 4.1 (A new stable manifold theorem). Let H be a d \u00d7 d real diagonal matrix with at least\none negative eigenvalue, i.e. H = diag{\u03bb1, ..., \u03bbd} with \u03bb1 \u2265 \u03bb2 \u2265 ...\u03bbs > 0 \u2265 \u03bbs+1 \u2265 ... \u2265 \u03bbd\nand assume \u03bbd < 0. Let \u03b7(k, x) be a continuously differentiable function such that \u03b7(k, 0) = 0 and\nfor each \u0001 > 0, there exists a neighborhood of 0 in which it holds\n\nLet {\u03b1k}k\u2208N be a sequence of positive real numbers of order \u2126(cid:0) 1\n\n(cid:107)\u03b7 (k, x) \u2212 \u03b7 (k, y)(cid:107) \u2264 \u03b1k\u0001(cid:107)x \u2212 y(cid:107) , for all naturals k.\n\n(cid:1) that converges to zero. We de\ufb01ne\n\n(6)\n\nk\n\nthe time non-homogeneous dynamical system\n\nxk+1 = g(k, xk), where g(k, x) = (I \u2212 \u03b1kH)x + \u03b7(k, x).\n\n(7)\nSuppose that E = Es \u2295 Eu, where Es is the span of the eigenvectors corresponding to negative\neigenvalues of H, and Eu is the span of the eigenvectors corresponding to nonnegative eigenvalues\nof H. Then there exists a neighborhood U of 0 and a C 1-manifold V (0) in U that is tangent to Es\nk=0 \u02dcg\u22121(k, 0, U ) \u2282 V (0).\nWe can generalize Theorem 4.1 to the case where matrix H is diagonalizable and for any \ufb01xed point\nx\u2217 (instead of 0, using a shifting argument). The statement is given below.\nCorollary 4.2. Let {\u03b1k}k\u2208N be a sequence of positive real numbers that converges to zero. Addi-\n\nat 0, such that for all x0 \u2208 V (0), limk\u2192\u221e g(k, xk) = 0. Moreover,(cid:84)\u221e\n(cid:1). Let g(k, x) : Rd \u2192 Rd be C 1 maps for all k \u2208 N and\ntionally, \u03b1k \u2208 \u2126(cid:0) 1\n\nk\n\n(8)\nbe a time non-homogeneous dynamical system. Assume x\u2217 is a \ufb01xed point, i.e. g(k, x\u2217) = x\u2217 for all\nk \u2208 N. Suppose the Taylor expansion of g(k, x) at x\u2217 in some neighborhood of x\u2217,\n\nxk+1 = g(k, xk)\n\ng(k, x) = g(k, x\u2217) + Dxg(k, x\u2217)(x \u2212 x\u2217) + \u03b8(k, x), satis\ufb01es\n\n(9)\n\n(10)\n\n1. Dxg(k, x\u2217) = I \u2212 \u03b1kG, G real diagonalizable with at least one negative eigenvalue;\n2. For each \u0001 > 0, there exists an open neighborhood centering at x\u2217 of radius \u03b4 > 0, denoted\n\nas B(x\u2217, \u03b4), such that\n\n(cid:107)\u03b8(k, u1) \u2212 \u03b8(k, u2)(cid:107) \u2264 \u03b1k\u0001(cid:107)u1 \u2212 u2(cid:107)\n\nfor all k \u2208 N and all u1, u2 \u2208 B(x\u2217, \u03b4).\n\none, such that for x0 \u2208 W (x\u2217), limk\u2192\u221e g(k, x\u2217) = x\u2217. Moreover,(cid:84)\u221e\n\nThere exists a open neighborhood U of x\u2217 and a C 1-manifold W (x\u2217) in U, with codimension at least\nk=0 \u02dcg\u22121(k, 0, U ) \u2282 W (x\u2217).\nProof. Since G is diagonalizable, there exists invertible matrix Q such that G = Q\u22121HQ, hence\nQGQ\u22121 = H, where H = diag{\u03bb1, ..., \u03bbd} (i.e., H is a diagonal matrix with entries \u03bb1, ..., \u03bbd).\nConsider the map z = \u03d5(x) = Q(x \u2212 x\u2217). \u03d5 induces a new dynamical system in terms of z as\nfollows:\n\nQ\u22121zk+1 = (I \u2212 \u03b1kG)Q\u22121zk + \u03b8(k, Q\u22121zk + x\u2217).\n\nMultiplying by Q from the left on both sides, we have\n\nzk+1 = Q(I \u2212 \u03b1kG)Q\u22121zk + Q\u03b8(k, Q\u22121zk + x\u2217) = (I \u2212 \u03b1kH)zk + \u02c6\u03b8(k, zk),\n\n(11)\nwhere \u02c6\u03b8(k, zk) = Q\u03b8(k, Q\u22121zk +x\u2217). Denote q(k, z) = (I \u2212\u03b1kH)z+ \u02c6\u03b8(k, z) the update rule given\nby equation (11). In order to apply Theorem 4.1, we next verify that \u02c6\u03b8(k,\u00b7) satis\ufb01es the condition (6)\nin Theorem 4.1 for all k \u2208 N. It is essentially to verify that given any \u0001 > 0, there exists a \u03b4(cid:48) > 0,\nsuch that\n\n(cid:13)(cid:13)(cid:13) =(cid:13)(cid:13)Q\u03b8(k, Q\u22121w1 + x\u2217) \u2212 Q\u03b8(k, Q\u22121w2 + x\u2217)(cid:13)(cid:13) \u2264 \u03b1k\u0001(cid:107)w1 \u2212 w2(cid:107)\n\n(cid:13)(cid:13)(cid:13)\u02c6\u03b8(k, w1) \u2212 \u02c6\u03b8(k, w2)\n\n(12)\nfor all w1, w2 \u2208 B(0, \u03b4(cid:48)). Let\u2019s elaborate it. According to (10) of condition 2, for any given \u0001 > 0,\n(cid:107)Q(cid:107)(cid:107)Q\u22121(cid:107) ), such that\nand then\n\n(cid:107)Q(cid:107)(cid:107)Q\u22121(cid:107) is also a small positive number, there exists a \u03b4 > 0 (w.r.t.\n(cid:107)Q(cid:107) (cid:107)Q\u22121(cid:107) (cid:107)u1 \u2212 u2(cid:107)\n\n(cid:107)\u03b8(k, u1) \u2212 \u03b8(k, u2)(cid:107) \u2264 \u03b1k\n\n\u0001\n\n\u0001\n\n\u0001\n\n5\n\n\ffor all u1, u2 \u2208 B(x\u2217, \u03b4). Denote V = Q(B(x\u2217, \u03b4) \u2212 x\u2217), i.e.\n\nV = {w \u2208 Rd : w = Q(u \u2212 x\u2217) for some u \u2208 B(x\u2217, \u03b4)},\n\nand it is easy to see that 0 \u2208 V . Since Q(u \u2212 x\u2217) is a diffeomorphism (composition of a translation\nand a linear isomorphism) from the open ball B(x\u2217, \u03b4) to Rd, V is an open neighborhood (not\nnecessarily a ball) of 0. Therefore, there exists an open ball at 0 with radius \u03b4(cid:48), denoted as B(0, \u03b4(cid:48)),\nsuch that B(0, \u03b4(cid:48)) \u2282 V . Next we show that B(0, \u03b4(cid:48)) satis\ufb01es the inequality (12). By the de\ufb01nition of\nV , for any w1, w2 \u2208 B(0, \u03b4(cid:48)) \u2282 V , there exist u1, u2 \u2208 B(x\u2217, \u03b4), such that\n\n(13)\nand the inverse transformation is given by u1 = Q\u22121w1 + x\u2217, u2 = Q\u22121w2 + x\u2217. Plugging to\ninequality (12), we have\n\nw1 = Q(u1 \u2212 x\u2217), w2 = Q(u1 \u2212 x\u2217),\n\n(cid:13)(cid:13)(cid:13)\u02c6\u03b8(k, w1) \u2212 \u02c6\u03b8(k, w2)\n\n(cid:13)(cid:13)(cid:13) =(cid:13)(cid:13)Q\u03b8(k, Q\u22121w1 + x\u2217) \u2212 Q\u03b8(k, Q\u22121w2 + x\u2217)(cid:13)(cid:13)\n\n= (cid:107)Q\u03b8(k, u1) \u2212 Q\u03b8(k, u2)(cid:107)\n\u2264 (cid:107)Q(cid:107) (cid:107)\u03b8(k, u1) \u2212 \u03b8(k, u2)(cid:107) \u2264 (cid:107)Q(cid:107) \u03b1k\n= (cid:107)Q(cid:107) \u03b1k\n\u2264 (cid:107)Q(cid:107) \u03b1k\n\n(cid:13)(cid:13)(Q\u22121w1 + x\u2217) \u2212 (Q\u22121w2 + x\u2217)(cid:13)(cid:13)\n(cid:13)(cid:13)Q\u22121(cid:13)(cid:13) (cid:107)w1 \u2212 w2(cid:107) = \u03b1k\u0001(cid:107)w1 \u2212 w2(cid:107) .\n\n\u0001(cid:107)u1 \u2212 u2(cid:107)\n(cid:107)Q(cid:107) (cid:107)Q\u22121(cid:107)\n\n(cid:107)Q(cid:107) (cid:107)Q\u22121(cid:107)\n(cid:107)Q(cid:107) (cid:107)Q\u22121(cid:107)\n\n\u0001\n\n\u0001\n\nThus the veri\ufb01cation is complete. So as a consequence of Theorem 4.1, there exists a C 1-manifold\nV (0) such that for all z0 \u2208 V (0), limk\u2192\u221e \u02dcq(k, 0, z0) = 0. For the neighborhood \u03d5\u22121(B(0, \u03b4(cid:48))) of\nx\u2217, denote W (x\u2217) the local stable set of dynamical system given by g(k, x), i.e.,\nk\u2192\u221e \u02dcg(k, 0, x0) = x\u2217}.\n\nW (x\u2217) = {x0 \u2208 \u03d5\u22121(B(0, \u03b4)) :\n\nlim\n\nWe claim that W (x\u2217) \u2282 \u03d5\u22121(V (0)) and the proof is as follows:\nSuppose x0 \u2208 W (x\u2217), then the sequence {xk}k\u2208N generated by xk+1 = g(k, xk) with initial\ncondition x0 converges to x\u2217. The map \u03d5 induces a sequence {zk}k\u2208N, where z0 = \u03d5(x0) and\n\nzk+1 = \u03d5(xk+1) = \u03d5 (g(k, xk))\n\n= Q (x\u2217 + (I \u2212 \u03b1kG)(xk \u2212 x\u2217) + \u03b8(k, xk) \u2212 x\u2217)\n(since xk = \u03d5\u22121(zk) = Q\u22121zk + x\u2217)\n= Q(I \u2212 \u03b1kG)Q\u22121zk + Q\u03b8(k, Q\u22121zk + x\u2217) = (I \u2212 \u03b1kH)zk + \u02c6\u03b8(k, zk).\n\n(14)\n(15)\n(16)\n(17)\nSince zk = \u03d5(xk), and xk \u2192 x\u2217, we have that zk \u2192 0. This implies sequence zk generated\nby zk+1 = q(k, zk) with initial condition z0 converges to 0, meaning that z0 = \u03d5(x0) \u2208 V (x\u2217).\nTherefore W (x\u2217) \u2282 \u03d5\u22121(V (0)). Let U = \u03d5\u22121(B(0, \u03b4)) and the proof is complete.\n\nWe conclude this section by the following corollary which can be proved using standard arguments\nabout separability of Rd (every open cover has a countable subcover). We denote W s(A\u2217) the set of\ninitial conditions so that the given dynamical system g converges to a \ufb01xed point x\u2217 such that matrix\nDxg(k, x\u2217) has an eigenvalue with absolute value greater than one for all k.\nCorollary 4.3. Let g(k, x) : Rd \u2192 Rd be the mappings de\ufb01ned in Theorem 4.2. Then W s(A\u2217) has\nLebesgue measure zero.\n\n5 Applications\n\nIn this section, we apply Theorem 4.1 (or its corollary 4.2) to the four of the most commonly used\n\ufb01rst-order methods and we prove that each one of them avoids strict saddle points even with vanishing\n\nstepsize \u03b1k of order \u2126(cid:0) 1\n\n(cid:1).\n\nk\n\n6\n\n\f5.1 Gradient Descent\nLet f (x) : Rd \u2192 R be a real-valued C 2 function, and g(k, x) = x \u2212 \u03b1k\u2207f (x) be the update rule of\ngradient descent, where {\u03b1k}k\u2208N is a sequence of positive real numbers. Then\n\nxk+1 = xk \u2212 \u03b1k\u2207f (xk)\n\n(18)\n\nis a time non-homogeneous dynamical system.\nTheorem 5.1. Let xk+1 = g(k, xk) be the gradient descent algorithm de\ufb01ned by equation 18, and\n\n{\u03b1k}k\u2208N be a sequence of positive real numbers of order \u2126(cid:0) 1\n\n(cid:1) that converges to zero. Then the\n\nstable set of strict saddle points has Lebesgue measure zero.\nProof. We need to verify that the Taylor expansion of g(k, x) at x\u2217 satis\ufb01es the conditions of\nCorollary 4.2. Condition 1 is obvious since the Hessian \u22072f (x\u2217) is diagonalizable and has at least\none negative eigenvalue. It suf\ufb01ce to verify condition 2. Consider the Taylor expansion of g(k, x) in\na neighborhood U of x\u2217:\n\nk\n\ng(k, x) = g(k, x\u2217) + Dxg(k, x\u2217)(x \u2212 x\u2217) + \u03b8(k, x)\n= x\u2217 + (I \u2212 \u03b1k\u22072f (x\u2217))(x \u2212 x\u2217) + \u03b8(k, x)\n= x \u2212 \u03b1k\u22072f (x\u2217)(x \u2212 x\u2217) + \u03b8(k, x).\n\n(cid:90) 1\n\n\u03b8(k, x) \u2212 \u03b8(k, y) =\n\nSo we can write \u03b8(k, x) = g(k, x) \u2212 x + \u03b1k\u22072f (x\u2217)(x \u2212 x\u2217), and then the differential of \u03b8(k, x)\nwith respect to x is Dx\u03b8(k, x) = Dx(g(k, x) \u2212 x) + \u03b1k\u22072f (x\u2217) = \u2212\u03b1k\u22072f (x) + \u03b1k\u22072f (x\u2217).\nFrom the Fundamental Theorem of Calculus and chain rule for multivariable functions, we have\nDz\u03b8(k, z)|z=tx+(1\u2212t)y \u00b7 (x \u2212 y)dt.\n\nBy the assumption of f being C 2, we have that \u22072f (x) is continuous everywhere. And then for any\nx \u2208 B(x\u2217). And this implies that (cid:107)Dx\u03b8(k, x)(cid:107) \u2264 \u03b1k\u0001 for all x \u2208 B(x\u2217). Since tx+(1\u2212t)y \u2208 B(x\u2217)\n\ngiven \u0001 > 0, there exists a open ball B(x\u2217) centering at x\u2217, such that(cid:13)(cid:13)\u22072f (x) \u2212 \u22072f (x\u2217)(cid:13)(cid:13) for all\n(cid:13)(cid:13) \u2264 \u03b1k\u0001 for all t \u2208 [0, 1]. By Cauchy-Schwarz\nif x, y \u2208 B(x\u2217), we have that(cid:13)(cid:13)Dz\u03b8(k, z)|z=tx+(1\u2212t)y\n\n\u03b8(k, tx + (1 \u2212 t)y)dt =\n\n(cid:90) 1\n\nd\ndt\n\n0\n\n0\n\ninequality, we have\n\n(cid:13)(cid:13)(cid:13)(cid:13)\n\n(cid:13)(cid:13)(cid:13)(cid:13)(cid:90) 1\n(cid:18)(cid:90) 1\n\n0\n\n0\n\n(cid:107)\u03b8(k, x) \u2212 \u03b8(k, y)(cid:107) =\n\n\u2264\n\nDz\u03b8(k, z)|z=tx+(1\u2212t)y \u00b7 (x \u2212 y)dt\n\n(cid:13)(cid:13)Dz\u03b8(k, z)|z=tx+(1\u2212t)y\n\n(cid:19)\n(cid:13)(cid:13) dt\n\n\u00b7 (cid:107)x \u2212 y(cid:107) = \u03b1k\u0001(cid:107)x \u2212 y(cid:107) ,\n\nthe veri\ufb01cation completes. By Corollary 4.2 and Corollary 4.3, we conclude that the stable set of\nstrict saddle points has Lebesgue measure zero.\n\n5.2 Mirror Descent\nWe consider mirror descent algorithm in this section. Let D be a convex open subset of Rd, and\nM = D \u2229 A for some af\ufb01ne space A. Given a function f : M \u2192 R and a mirror map \u03a6, the mirror\ndescent algorithm with vanishing step-size is de\ufb01ned as\n\nxk+1 = g(k, xk) := h(\u2207\u03a6(xk) \u2212 \u03b1k\u2207f (xk)),\n\n(19)\n\nwhere h(x) = argmaxz\u2208M(cid:104)z, x(cid:105) \u2212 \u03a6(z).\nDe\ufb01nition 5.2 (Mirror Map). We say that \u03a6 is a mirror map if it satis\ufb01es the following properties.\n\n\u2022 \u03a6 : D \u2192 R is C 2 and strictly convex.\n\u2022 The gradient of \u03a6 is surjective onto Rd, that is \u2207\u03a6(D) = Rd.\n\u2022 \u2207R\u03a6 diverges on the relative boundary of M, that is limx\u2192\u2202M ||\u2207R\u03a6(x)|| = \u221e.\n\n{\u03b1k}k\u2208N be a sequence of positive real numbers of order \u2126(cid:0) 1\n\nstable set of strict saddle points has Lebesgue measure zero.\n\nk\n\n(cid:1) that converges to zero. Then the\n\nTheorem 5.3. Let xk+1 = g(k, xk) be the mirror descent algorithm de\ufb01ned by equation (19), and\n\n7\n\n\f5.3 Proximal Point\n\nThe proximal point algorithm is given by the iteration\n\nxk+1 = g(k, xk) := arg min\nz\n\n{\u03b1k}k\u2208N be a sequence of positive real numbers of order \u2126(cid:0) 1\n\nstable set of strict saddle points has Lebesgue measure zero.\n\nf (z) +\n\nk\n\nTheorem 5.4. Let xk+1 = g(k, xk) be the proximal point algorithm de\ufb01ned by equation (20), and\n\n(cid:107)xk \u2212 z(cid:107)2 .\n\n1\n2\u03b1k\n\n(20)\n\n(cid:1) that converges to zero. Then the\n\n5.4 Manifold Gradient Descent\nLet M be a submanifold of Rd, and TxM be the tangent space of M at x. PM and PTxM be the\northogonal projector onto M and TxM respectively. Assume that f : M \u2192 R is extendable to\nneighborhood of M and let \u00aff be a smooth extension of f to Rd. Suppose that the Riemannian metric\non M is induced by Euclidean metric of Rd, then the Riemannian gradient \u2207Rf (x) is the projection\nof the gradient of f (x) on Rd, i.e. \u2207Rf (x) = PTxM\u2207f (x). Then the manifold gradient descent\nalgorithm is:\n(21)\nTheorem 5.5. Let xk+1 = g(k, xk) be the manifold gradient descent de\ufb01ned by equation (21), and\n\n{\u03b1k}k\u2208N be a sequence of positive real numbers of order \u2126(cid:0) 1\n\n(cid:1) that converges to zero. Then the\n\nxk+1 = g(k, xk) := PM (xk \u2212 \u03b1kPTxk M\u2207f (xk)).\n\nstable set of strict saddle points has measure zero.\nFor the case when M is not a submanifold of Rd, the manifold gradient descent algorithm depends\non the Riemannian metric R de\ufb01ned intrinsically, i.e., R is not induced by any ambient metric.\nGiven f : M \u2192 R, the Riemannian gradient \u2207Rf is de\ufb01ned to be the unique vector \ufb01eld such that\nR(\u2207Rf, X) = \u2202X f for all vector \ufb01eld X on M. In local coordinate systems x(p) = (x1, ..., xd),\np \u2208 M, the Riemannian gradient is written as \u2207Rf (x) =\n\n(cid:16)\nwhere(cid:0)Rij(cid:1) is the inverse of the metric matrix at the point x and Rij \u2202f\n(cid:0)Rij(cid:1) \u00b7 \u2207f (xk).\n{\u03b1k}k\u2208N be a sequence of positive real numbers of order \u2126(cid:0) 1\n\n(22)\nTheorem 5.6. Let xk+1 = g(k, xk) be the manifold gradient descent de\ufb01ned by equation (22), and\n\n(cid:1) that converges to zero. Then the\n\n(cid:17)\n=(cid:0)Rij(cid:1) \u00b7 \u2207f (x),\n= (cid:80)\n\nEinstein convention. Then the update rule (in a local coordinate system) is\n\nxk+1 = g(k, xk) := xk \u2212 \u03b1k\n\n, ..., Rdj \u2202f\n\u2202xj\n\nR1j \u2202f\n\u2202xj\n\nj Rij \u2202f\n\u2202xj\n\nas the\n\n\u2202xj\n\nk\n\nk\n\nstable set of strict saddle points has measure zero.\n\n6 Conclusion\n\nthe stepsize \u03b1k converges to zero with order \u2126(cid:0) 1\n\n(cid:1), then gradient descent, mirror descent, proximal\n\nIn this paper, we generalize the results of [16] for the case of vanishing stepsizes. We showed that if\n\npoint and manifold descent still avoid strict saddles. We believe that this is an important result\nthat was missing from the literature since in practice vanishing or adaptive stepsizes are commonly\nused. Our main result boils down to the proof of a Stable-manifold theorem 4.1 that works for time\nnon-homogeneous dynamical systems and might be of independent interest. We leave as an open\nquestion the case of Block Coordinate Descent (as it also appears in [16]).\n\nk\n\n7 Acknowledgements\n\nIoannis Panageas acknowledges SRG ISTD 2018 136 and NRF fellowship for AI. Georgios Piliouras\nand Xiao Wang acknowledge MOE AcRF Tier 2 Grant 2016-T2-1-170, grant PIE-SGP-AI-2018-01\nand NRF 2018 Fellowship NRF-NRFF2018-07. We thank Tony Roberts for pointers to the literature\nof stability of non-autonomous dynamical systems.\n\n8\n\n\fFigure 1: Steps of Gradient Descent for x2 \u2212 y2. (0, 0) is a strict saddle. Stepsizes 1\u221a\ngreen) avoid (0, 0) (blue faster than green). Stepsize 1\n\nk (blue,\n, 1\nk4 (red) converges to a non-critical point.\n\nk\n\nReferences\n[1] S. Arora, E. Hazan, and S. Kale. The multiplicative weights update method: a meta algorithm\n\nand applications. In Theory of Computing, 2012.\n\n[2] Luis Barreira, Claudia Valls, and Claudia Valls. Stability of nonautonomous differential equa-\n\ntions, volume 1926. Springer, 2008.\n\n[3] Srinadh Bhojanapalli, Behnam Neyshabur, and Nati Srebro. Global optimality of local search\nfor low rank matrix recovery. In Advances in Neural Information Processing Systems, pages\n3873\u20133881, 2016.\n\n[4] S\u00e9bastien Bubeck. Theory of convex optimization for machine learning. CoRR, abs/1405.4980,\n\n2014.\n\n[5] Vaggos Chatziafratis, Sai Ganesh Nagarajan, Ioannis Panageas, and Xiao Wang. Depth-width\n\ntrade-offs for relu networks via sharkovsky\u2019s theorem. In Arxiv, 2019.\n\n[6] Anna Choromanska, Mikael Henaff, Michael Mathieu, G\u00e9rard Ben Arous, and Yann LeCun.\nThe loss surfaces of multilayer networks. In Arti\ufb01cial Intelligence and Statistics, pages 192\u2013204,\n2015.\n\n[7] Thiparat Chotibut, Fryderyk Falniowski, Micha? Misiurewicz, and Georgios Piliouras. The\nroute to chaos in routing games: Population increase drives period-doubling instability, chaos &\ninef\ufb01ciency with price of anarchy equal to one, 2019.\n\n[8] Thiparat Chotibut, Fryderyk Falniowski, Micha\u0142 Misiurewicz, and Georgios Piliouras. Family\nof chaotic maps from game theory, 2018. Manuscript available at https://arxiv.org/abs/\n1807.06831.\n\n[9] Constantinos Daskalakis and Ioannis Panageas. The limit points of (optimistic) gradient descent\nin min-max optimization. In Advances in Neural Information Processing Systems 31, pages\n9256\u20139266, 2018.\n\n[10] Yann N Dauphin, Razvan Pascanu, Caglar Gulcehre, Kyunghyun Cho, Surya Ganguli, and\nYoshua Bengio. Identifying and attacking the saddle point problem in high-dimensional non-\nconvex optimization. In Advances in Neural Information Processing Systems, pages 2933\u20132941,\n2014.\n\n9\n\n\f[11] Rong Ge, Furong Huang, Chi Jin, and Yang Yuan. Escaping from saddle points; online stochastic\ngradient for tensor decomposition. In Conference on Learning Theory, pages 797\u2013842, 2015.\n\n[12] Rong Ge, Chi Jin, and Yi Zheng. No spurious local minima in nonconvex low rank problems:\n\nA uni\ufb01ed geometric analysis. arXiv preprint arXiv:1704.00708, 2017.\n\n[13] Rong Ge, Jason D Lee, and Tengyu Ma. Matrix completion has no spurious local minimum. In\n\nAdvances in Neural Information Processing Systems, pages 2973\u20132981, 2016.\n\n[14] Michael I. Jordan. Dynamical, symplectic and stochastic perspectives on gradient-based\n\noptimization. 2018.\n\n[15] Robert Kleinberg, Georgios Piliouras, and \u00c9va Tardos. Multiplicative updates outperform\ngeneric no-regret learning in congestion games. In ACM Symposium on Theory of Computing\n(STOC), 2009.\n\n[16] J. D. Lee, I. Panageas, G. Piliouras, M. Simchowitz, M. I. Jordan, and B. Recht. First-order\n\nmethods almost always avoid saddle points. In Mathematical Programming, 2019.\n\n[17] Jason D Lee, Max Simchowitz, Michael I Jordan, and Benjamin Recht. Gradient descent only\n\nconverges to minimizers. In Conference on Learning Theory, pages 1246\u20131257, 2016.\n\n[18] Ruta Mehta, Ioannis Panageas, and Georgios Piliouras. Natural selection as an inhibitor of\ngenetic diversity: Multiplicative weights updates algorithm and a conjecture of haploid genetics\n[working paper abstract]. In Proceedings of the 2015 Conference on Innovations in Theoretical\nComputer Science, ITCS, page 73, 2015.\n\n[19] Sai Ganesh Nagarajan and Ioannis Panageas. On the analysis of EM for truncated mixtures of\n\ntwo gaussians. CoRR, abs/1902.06958, 2019.\n\n[20] Kamil Nar and Shankar Sastry. Step size matters in deep learning. In Advances in Neural\nInformation Processing Systems 31: Annual Conference on Neural Information Processing\nSystems 2018, NeurIPS 2018, 3-8 December 2018, Montr\u00e9al, Canada., pages 3440\u20133448, 2018.\n\n[21] G. Palaiopanos, I. Panageas, and G. Piliouras. Multiplicative weights update with constant\n\nstep-size in congestion games: Convergence, limit cycles and chaos. In NIPS, 2017.\n\n[22] Ioannis Panageas and Georgios Piliouras. Gradient descent only converges to minimizers:\nNon-isolated critical points and invariant regions. In Innovations of Theoretical Computer\nScience (ITCS), 2017.\n\n[23] Ioannis Panageas, Georgios Piliouras, and Xiao Wang. Multiplicative weights updates as a\ndistributed constrained optimization algorithm: Convergence to second-order stationary points\nalmost always. In Proceedings of the 36th International Conference on Machine Learning,\nICML, pages 4961\u20134969, 2019.\n\n[24] Razvan Pascanu, Yann N Dauphin, Surya Ganguli, and Yoshua Bengio. On the saddle point\n\nproblem for non-convex optimization. arXiv:1405.4604, 2014.\n\n[25] Robin Pemantle. Nonconvergence to unstable points in urn models and stochastic approxima-\n\ntions. The Annals of Probability, pages 698\u2013712, 1990.\n\n[26] Christian P\u00f6tzsche and Martin Rasmussen. Computation of nonautonomous invariant and\n\ninertial manifolds. Numerische Mathematik, 112(3):449, 2009.\n\n[27] Shai Shalev-Shwartz and Shai Ben-David. Understanding Machine Learning: From Theory to\n\nAlgorithms. Cambridge University Press, 2014.\n\n[28] Ju Sun, Qing Qu, and John Wright. Complete dictionary recovery over the sphere i: Overview\n\nand the geometric picture. IEEE Transactions on Information Theory, 63(2):853\u2013884, 2017.\n\n[29] Ju Sun, Qing Qu, and John Wright. Complete dictionary recovery over the sphere ii: Recovery\nby riemannian trust-region method. IEEE Transactions on Information Theory, 63(2):885\u2013914,\n2017.\n\n10\n\n\f", "award": [], "sourceid": 3481, "authors": [{"given_name": "Ioannis", "family_name": "Panageas", "institution": "SUTD"}, {"given_name": "Georgios", "family_name": "Piliouras", "institution": "Singapore University of Technology and Design"}, {"given_name": "Xiao", "family_name": "Wang", "institution": "Singapore university of technology and design"}]}