{"title": "On the Curved Geometry of Accelerated Optimization", "book": "Advances in Neural Information Processing Systems", "page_first": 1766, "page_last": 1775, "abstract": "In this work we propose a differential geometric motivation for Nesterov's accelerated gradient method (AGM) for strongly-convex problems. By considering the optimization procedure as occurring on a Riemannian manifold with a natural structure, The AGM method can be seen as the proximal point method applied in this curved space. This viewpoint can also be extended to the continuous time case, where the accelerated gradient method arises from the natural block-implicit Euler discretization of an ODE on the manifold. We provide an analysis of the convergence rate of this ODE for quadratic objectives.", "full_text": "On the Curved Geometry of Accelerated\n\nOptimization\n\nAaron Defazio\n\nFacebook AI Research\n\nNew York\n\nAbstract\n\nIn this work we propose a differential geometric motivation for Nesterov\u2019s accel-\nerated gradient method (AGM) for strongly-convex problems. By considering\nthe optimization procedure as occurring on a Riemannian manifold with a natural\nstructure, The AGM method can be seen as the proximal point method applied\nin this curved space. This viewpoint can also be extended to the continuous time\ncase, where the accelerated gradient method arises from the natural block-implicit\nEuler discretization of an ODE on the manifold. We provide an analysis of the\nconvergence rate of this ODE for quadratic objectives.\n\nIntroduction\n\n1\nThe core algorithms of convex optimization are gradient descent (GD) and the accelerated gradient\nmethod (AGM). These methods are rarely used directly, more often they occur as the building blocks\nfor distributed, composite, or non-convex optimization. In order to build upon these components, it is\nhelpful to understand not just how they work, but why. The gradient method is well understood in this\nsense. It is commonly viewed as following a direction of steepest descent or as minimizing a quadratic\nupper bound. These interpretations provide a motivation for the method as well as suggesting a\npotential convergence proof strategy.\nThe accelerated gradient method in contrast has an identity crisis. Its equational form is remarkably\nmalleable, allowing for many different ways of writing the same updates. We list a number of these\nforms in Table 1. Nesterov\u2019s original motivation for the AGM method used the concept of estimate\nsequences. Unfortunately, estimate sequences do not necessarily yield the simplest accelerated\nmethods when generalized, such as for the composite case (Beck and Teboulle 2009, Nesterov\n2007), and they have not been successfully applied in the important \ufb01nite-sum (variance reduced)\noptimization setting.\nBecause of the complexity of estimate sequences, the AGM method is commonly motivated as a form\nof momentum. This view is \ufb02awed as a way of introducing the AGM method from \ufb01rst principles, as\nthe most natural way of using momentum yields the heavy ball method instead:\n\nxk+1 = xk \u2212 \u03b3\u2207f(cid:0)xk(cid:1) + \u03b2(cid:0)xk \u2212 xk\u22121(cid:1) ,\n\nwhich arises from discretizing the physics of a particle in a potential well with additional friction.\nThe heavy-ball method does not achieve an accelerated convergence rate on general convex problems,\nsuggesting that momentum, per se, is not the reason for acceleration. Another contemporary view is\nthe linear-coupling interpretation of Allen-Zhu and Orecchia [2017], which views the AGM method\nas an interpolation between gradient descent and mirror descent. We take a more geometric view in\nour interpretation.\nIn this work we motivate the AGM by introducing it as an application of the proximal-point method:\n\nxk = arg min\n\nx\n\nf (x) +\n\n(cid:110)\n\n(cid:13)(cid:13)x \u2212 xk\u22121(cid:13)(cid:13)2(cid:111)\n\n.\n\n\u03b7\n2\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fThe proximal point (PP) method is perhaps as foundational as the gradient descent method, although\nit sees even less direct use as each step requires solving a regularized subproblem, in contrast to the\nclosed form steps for GD and AGM. The PP method gains remarkable convergence rate properties in\nexchange for the computational dif\ufb01culty, including convergence for any positive step-size.\nWe construct the AGM by applying a dual form of the proximal point method in a curved space.\nEach step follows a geodesic on a manifold in a sense we make precise in Section 4. We use the\nterm curved with respect to a coordinate system, rather than a coordinate free notion of curvature\nsuch as the Riemannian curvature. We \ufb01rst give a brief introduction to the concepts from differential\ngeometry necessary to understand our motivation. The equational form that our argument yields is\nmuch closer to those that have been successfully applied in practice, particularly for the minimization\nof \ufb01nite sums [Lan and Zhou, 2017, Zhang and Xiao, 2017].\n\n2 Connections\nAn (af\ufb01ne) connection is a type of structure on a manifold that can be used to de\ufb01ne and compute\ngeodesics. Geodesics in this sense represent curves of zero acceleration. These geodesics are more\ngeneral concepts than Riemannian geodesics induced by the Riemannian connection, not necessarily\nrepresenting the shortest path in any metric. Indeed, we will de\ufb01ne multiple connections on the same\nmanifold that lead to completely different geodesics.\nGiven a n dimensional coordinate system, a connection is de\ufb01ned by n3 numbers at every point x\non the manifold, called the connection coef\ufb01cients (or Christoffel symbols) \u0393 k\nij(x). A geodesic is a\npath \u03b3 : [0, 1] \u2192 M (in our case M = Rn) between two points x and y can then be computed as the\nunique solution \u03b3(t) = x(t) to the system of ordinary differential equations [Lee, 1997, Page 58, Eq\n4.11]:\n\nd2\u03b3i\ndt2\n\n.\n=\n\nd2xi\ndt2 +\n\n\u0393 i\n\njk(x)\n\ndxj\ndt\n\ndxk\ndt\n\n= 0,\n\n(cid:88)\n\nj,k\n\nwith boundary conditions x(0) = x and x(1) = y. Here xi denotes the ith component of x expressed\nin the same coordinate system as the connection.\n\n3 Divergences induce Hessian manifold structure\nLet \u03c6 be a smooth strongly convex function de\ufb01ned on Rn. The Bregman divergence generated by \u03c6:\n\nB\u03c6(x, y) = \u03c6(x) \u2212 \u03c6(y) \u2212(cid:10)\u2207\u03c6(y), x \u2212 y(cid:11),\n\nand its derivatives can be used to de\ufb01ne a remarkable amount of structure on the domain of \u03c6. In\nparticular, we can de\ufb01ne a Riemannian manifold, together with two dually \ufb02at connections with\nbiorthogonal coordinate systems [Amari and Nagaoka, 2000, Shima, 2007]. This structure is also\nknown as a Hessian manifold. Topologically it is M = Rn with the following additional geometric\nstructures.\n\nRiemannian structure\nRiemannian manifolds have the additional structure of a metric tensor (a generalized dot-product),\nde\ufb01ned on their tangent spaces. We denote the vector space of tangent vectors at a point x as TxM.\nIf we express the tangent vectors with respect to the Euclidean basis, the metric at a point x is a\nquadratic form with the Hessian matrix H(x) = \u22072\n\nxB(x, y) = \u22072\u03c6(x) of \u03c6 at x:\n\ngx(u, v) = uT H(x)v.\n\nBiorthogonal coordinate systems\nCentral to the notion of a manifold is the invariance to the choice of coordinate system. We can\nexpress a point on the manifold as well as a point in the tangent space using any coordinate system\nthat is most convenient. Of course, when we wish to perform calculations on the manifold we must\nbe careful to express all quantities in that coordinate system. Euclidean coordinates ei are the most\nnatural on our Hessian manifold, however there is another coordinate system which is naturally dual\nto ei, and ties the manifold structure directly to duality theory in optimization.\n\n2\n\n\fTable 1: Equivalent forms of Nesterov\u2019s method for \u00b5-strongly convex, L-smooth f. Proofs of the\nstated relations are available in the appendix.\n\nForm Name\n\nAlgorithm\n\nNesterov [2013]\n\nform I\n\nNesterov [2013]\n\nform II\n\nSutskever et al.\n\n[2013]\n\nModern\n\nMomentum1\n\nAuslender and\nTeboulle [2006]\n\nLan and Zhou\n\n[2017]\n\nyk =\n\n\u03b1\u03b3vk + \u03b3xk\n\n\u03b1\u00b5 + \u03b3\n\nyk \u2212 \u03b1\n\u03b3\n\n\u2207f (yk)\n\nxk+1 = yk \u2212 1\nL\nvk+1 = (1 \u2212 \u03b1) vk +\nxk+1 = yk \u2212 1\nL\n\n\u2207f (yk),\n\u03b1\u00b5\n\u03b3\n\n(cid:16)\n\n\u2207f (yk),\n\nxk+1 \u2212 xk(cid:17)\n(cid:16)\nxk + \u03b2pk(cid:17)\n(cid:16)\u2207f (xk) + \u03b2pk+1(cid:17)\n\n,\n\nyk+1 = xk+1 + \u03b2\npk+1 = \u03b2pk \u2212 1\n\u2207f\nL\nxk+1 = xk + pk+1\npk+1 = \u03b2pk + \u2207f (xk),\nxk+1 = xk \u2212 1\nL\n\n.\n\nyk = (1 \u2212 \u03b8)\u02c6xk + \u03b8zk,\n\u2207f (yk),\n\nzk+1 = zk \u2212 \u03b3\n\u03b8\n\n\u02c6xk = (1 \u2212 \u03b8)\u02c6xk + \u03b8zk+1.\n\u02dcxk = \u03b1(xk\u22121 \u2212 xk\u22122) + xk\u22121,\n\n\u02dcxk + \u03c4 xk\u22121\n\n,\n\nxk =\n1 + \u03c4\ngk = \u2207f (xk),\nxk = xk\u22121 \u2212 1\n\u03b7\n\ngk.\n\nRelations\n\n\u03b1Nes =(cid:112)\u00b5/L\n\n\u03b3Nes = \u00b5.\n\n\u03b2Nes =\n\n\u221a\nL\u2212\u221a\n\u221a\n\u00b5\n\u221a\n\u00b5\n\nL+\n\npk+1\nSut = xk+1\nNes = xk\nyk\n\nNes \u2212 xk\nNes,\nSut + \u03b2pk\nSut.\n\nSut + \u03b2pk\n\nxk\nmod = xk\nmod = \u2212Lpk\npk\nSut.\n\nSut = yk\n\nNes,\n\nmod,\n\nNes,\nNes = xk\n\n\u03b8AT = 1 \u2212 \u03b2Nes,\n\u02c6xk\nAT = xk\nyk\nAT = yk\n\u03b3AT = 1/L.\nLan = zk\nxk\nAT,\nLan = yk\nxk\nAT,\n\u03b3AT\n\u03b7Lan =\n\u03b8AT\n1 \u2212 \u03b8AT\n\u03c4Lan =\n,\n\u03b8AT\n\u03b1Lan = 1 \u2212 \u03b8AT.\n\n,\n\nthat\n\nfor a convex function \u03c6 we may de\ufb01ne the convex conjugate \u03c6\u2217(y) =\nRecall\nmaxx {(cid:104)x, y(cid:105) \u2212 \u03c6(x)} . The dual coordinate system we de\ufb01ne simply identi\ufb01es each point x, when\nexpressed in Euclidean (\u201cprimal\u201d) coordinates, with the vector of \u201cdual\u201d coordinates:\n\ny = \u2207\u03c6(x).\n\nOur assumptions of smoothness and strong convexity imply this is a one-to-one mapping, with inverse\ngiven by x = \u2207\u03c6\u2217(y). The remarkable fact that the gradient of the conjugate is the inverse of the\ngradient is a key building block of the theory in this paper.\nThe notion of biorthogonality refers to natural tangent space coordinates of these two systems. A\ntangent vector v at a point x can be converted to a vector u of dual (tangent space) coordinates using\nmatrix multiplication with the Hessian [Shima, 2007]:\n\n(1)\nGiven the de\ufb01nition of the metric above, it is easy to see that if we have two vectors v1 and v2, we\nmay express v2 in dual coordinates u2 so that the metric tensor takes the simple form:\n\nu = H(x)v,\n\n1 H(x)(cid:0)H(x)\u22121u2\n\n(cid:1) = vT\n\n1 u2,\n\ngx(v1, v2) = vT\n\n1 H(x)v2 = vT\n\nwhich is the biorthogonal relation between the two tangent space coordinate systems.\n\nDual Flat Connections\nThere is an obvious connection \u0393 (E) we can apply to the Hessian manifold, the Euclidean connection\nthat trivially identi\ufb01es straight lines in Rn as geodesics. Normally when we perform gradient descent\n1PyTorch & Tensor\ufb02ow (for instance) implement this form. Evaluating the gradient and function at the\ncurrent iterate xk instead of a shifted point makes it more consistent with gradient descent when wrapped in a\ngeneric optimization layer.\n\n3\n\n\f4 (cid:107)Ax(cid:107)4 , with A = [2, 1; 1, 3]. Viewing them from both\n\nFigure 1: Illustrative geodesics for f (x) = 1\ncoordinate systems highlights the duality. Contour lines are for f and f\u2217 respectively.\nin Rn we are implicitly following a geodesic of this connection. The connection coef\ufb01cients \u0393 (E)k\nare all zero when this connection is expressed in Euclidean coordinates. A connection that has\nij = 0 with respect to some coordinate system is a \ufb02at connection.\n\u0393 k\nThe Hessian manifold admits another \ufb02at connection, which we will call the dual connection, as it\ncorresponds to straight lines in the dual coordinate system established above. In particular each dual\ngeodesic can be expressed in primal coordinates \u03b3(t) as a solution to the equation:\n\nij\n\n\u2207\u03c6 (\u03b3(t)) = at + b,\n\nfor vectors a, b representing the initial velocity and point respectively (both represented in dual\ncoordinates) that depend on the boundary conditions. This is quite easy to solve using the relation\n\u2207\u03c6\u22121 = \u2207\u03c6\u2217 discussed above. For instance, a geodesic \u03b3 : [0, 1] \u2192 M between two arbitrary\npoints x and y under the dual connection could be computed explicitly in Euclidean coordinates as:\n\nIf we instead know the initial velocity we can \ufb01nd the endpoint with:\n\n\u03b3(t) = \u2207\u03c6\u2217 (t\u2207\u03c6(y) + (1 \u2212 t)\u2207\u03c6(x)) .\n\ny = \u2207\u03c6\u2217(cid:0)\u2207\u03c6(x) + H(xk)v(cid:1) .\n\n(2)\n\n(3)\n\nThe \ufb02atness of the dual connection \u0393 (D) is crucial to its computability in practice. If we instead try\nto compute the geodesic in Euclidean coordinates using the geodesic ODE, we have to work with the\nconnection coef\ufb01cients which involve third derivatives of \u03c6 (taking the form of double those of the\nRiemannian connection \u0393 (R)):\n\nij =(cid:2)H(x)\u22121 (\u2207H(x))i\n\n(cid:3)\n\nkj ,\n\n\u0393 (D)k\nij\n\n(x) = 2\u0393 (R)k\n\nThe Riemannian connection\u2019s geodesics are similarly dif\ufb01cult to compute directly from the ODE\n(they also can\u2019t generally be expressed in a simpler form).\n\n4 Bregman proximal operators follow geodesics\nBregman divergences arise in optimization primarily through their use in proximal steps. A Bregman\nproximal operation balances \ufb01nding a minimizer of a given function f with maintaining proximity to\na given point y, measured using a Bregman divergence instead of a distance metric:\n\n(cid:8)f (x) + \u03c1B\u03c6(x, xk\u22121)(cid:9) .\n\nxk = arg min\n\nx\n\nA core application of this would be the mirror descent step [Nemirovski and Yudin, 1983, Beck and\nTeboulle, 2003], where the operation is applied to a linearized version of f instead of f directly:\n\n(cid:8)(cid:10)x,\u2207f (xk\u22121)(cid:11) + \u03c1B\u03c6(x, xk\u22121)(cid:9) .\n\nxk = arg min\n\nx\n\nBregman proximal operations can be interpreted as geodesic steps with respect to the dual connection.\nThe key idea is that given an input point xk\u22121, they output a point x such that the velocity of the\nconnecting geodesic is equal to \u2212\u2207 1\n\u03c1 f (x) at x. This velocity is measured in the \ufb02at coordinate\nsystem of the connection, the dual coordinates. To see why, consider a geodesic \u03b3(t) = (1 \u2212\n\n4\n\n0.20.10.00.10.20.30.200.150.100.050.000.05Euclidean (primal) coordinatesRiemannian geodesicDual connection geodesicEuclidean geodesic0.200.150.100.050.000.050.100.150.200.250.40.30.20.10.00.10.2Dual coordinatesRiemannian geodesicDual connection geodesicEuclidean geodesic\ft)\u2207\u03c6(xk\u22121) + t\u2207\u03c6(xk). Here xk\u22121 and xk are in primal coordinates and \u03b3(t) is in dual coordinates.\ndt \u03b3(t) = \u2207\u03c6(xk) \u2212 \u2207\u03c6(xk\u22121). Contrast to the optimality condition of the Bregman\nThe velocity is d\nprox (Equation 3):\n\nFor instance, when using the Euclidean penalty the step is:\n\n\u2207f (xk) = \u2207\u03c6(xk) \u2212 \u2207\u03c6(xk\u22121).\n\n\u2212 1\n\u03c1\n\n(cid:8)f (x) + \u03c1\n\n2\n\n(cid:13)(cid:13)x \u2212 xk\u22121(cid:13)(cid:13)2(cid:9).\n\nxk = arg minx\n\nThe \ufb01nal velocity is just xk \u2212 xk\u22121, and so xk \u2212 xk\u22121 = \u2212 1\nproximal operation.\n\n\u03c1\u2207f (xk), which is the solution of the\n\n5 Primal-Dual form of the proximal point method\nThe proximal point method is the building block from which we will construct the accelerated\ngradient method. Consider the basic form of the proximal point method applied to a strongly convex\nfunction f. At each step, the iterate xk is constructed from xk\u22121 by solving the proximal operation\nsubproblem given an inverse step size parameter \u03b7:\n\n(cid:110)\n\n(cid:13)(cid:13)x \u2212 xk\u22121(cid:13)(cid:13)2(cid:111)\n\n\u03b7\n2\n\nxk = arg min\n\nx\n\nf (x) +\n\n.\n\n(4)\n\nThis step can be considered an implicit form of the gradient step, where the gradient is evaluated at\nthe end-point of the step instead of the beginning:\n\nxk = xk\u22121 \u2212 1\n\u03b7\n\n\u2207f (xk),\n\nwhich is just the optimality condition of the subproblem in Equation 4, found by taking the derivative\n\u2207f (x) + \u03b7x\u2212 \u03b7xk\u22121 to be zero. A remarkable property of the proximal operation becomes apparent\nwhen we rearrange this formula, namely that the solution to the operation is not a single point but a\nprimal-dual pair, whose weighted sum is equal to the input point:\n\nxk +\n\n1\n\u03b7\n\n\u2207f (xk) = xk\u22121.\n\nIf we de\ufb01ne gk = \u2207f (xk), the primal-dual pair obeys a duality relation: gk = \u2207f (xk) and\nxk = \u2207f\u2217(gk), which allows us to interchange primal and dual quantities freely. Indeed we may\nwrite the condition in a dual form as:\n\nwhich is the optimality condition for the proximal operation:\n\n\u2207f\u2217(cid:0)gk(cid:1) +\n(cid:26)\n\ngk = arg min\n\ng\n\nf\u2217(g) +\n\n1\n2\u03b7\n\ngk = xk\u22121,\n\n1\n\u03b7\n\n(cid:13)(cid:13)g \u2212 \u03b7xk\u22121(cid:13)(cid:13)2(cid:27)\n\n(5)\n\n.\n\nOur goal in this section is to express the proximal point method in terms of a dual step, and while this\nequation involves the dual function f\u2217, it is not a step in the sense that gk is formed by a proximal\noperation from gk\u22121.\nWe can manipulate this formula further to get an update of the form we want, by simply adding and\nsubtracting gk\u22121 from 5:\n\n\u2207f\u2217(cid:0)gk(cid:1) +\n(cid:26)\n\n(cid:28)\n\n1\n\u03b7\n\ngk =\n\ngk\u22121 +\n\n1\n\u03b7\n\nxk\u22121 \u2212 1\n\u03b7\n\n(cid:18)\n\n(cid:29)\n\ngk\u22121\n\n,\n\n(cid:19)\n(cid:13)(cid:13)g \u2212 gk\u22121(cid:13)(cid:13)2(cid:27)\n\n,\n\nWhich gives the updates:\n\ngk = arg min\n\ng\n\nf\u2217(g) \u2212\n\ng, xk\u22121 \u2212 1\n\u03b7\n\ngk\u22121\n\n+\n\n1\n2\u03b7\n\nxk = xk\u22121 \u2212 1\n\u03b7\n\ngk.\n\nWe call this the primal-dual form of the proximal point method.\n\n5\n\n\f6 Acceleration as a change of geometry\nThe proximal point method is rarely used in practice due to the dif\ufb01culty of computing the solution to\nthe proximal subproblem. It is natural then to consider modi\ufb01cations of the subproblem to make it\nmore tractable. The subproblem becomes particularly simple if we replace the proximal operation\nwith a Bregman proximal operation with respect to f\u2217,\ng, xk\u22121 \u2212 1\n\u03b7\n\n+ \u03c4 Bf\u2217 (g, gk\u22121)\n\ngk = arg min\n\nf\u2217(g) \u2212\n\ngk\u22121\n\n(cid:26)\n\n(cid:28)\n\n(cid:29)\n\n(cid:27)\n\n.\n\ng\n\nWe have additionally changed the penalty parameter to a new constant \u03c4, which is necessary as the\nchange to the Bregman divergence changes the scaling of distances. We discuss this further below.\nRecall from Section 4 that Bregman proximal operations follow geodesics. The key idea is that we\nare now following a geodesic in the dual connection of \u03c6 = f\u2217, using the notation of Section 3, which\nis a straight-line in the primal coordinates of f due to the \ufb02atness of the connection (Section 3). Due\nto the \ufb02atness property, a simple closed-form solution can be derived by equating the derivative to 0:\n\n(cid:20)\n\nxk\u22121 \u2212 1\n\u03b7\n\n(cid:18)\n\n(cid:21)\n(cid:20)\n\n\u2207f\u2217(gk) \u2212\n\ngk\u22121\n\n+ \u03c4\u2207f\u2217(gk) \u2212 \u03c4\u2207f\u2217(gk\u22121) = 0,\n\ntherefore gk = \u2207f\n\n\u22121\n\n(1 + \u03c4 )\n\nxk\u22121 \u2212 1\n\u03b7\n\ngk\u22121 + \u03c4\u2207f\u2217(gk\u22121)\n\n.\n\n(cid:21)(cid:19)\n\nThis formula gives gk in terms of the derivative of known quantities, as \u2207f\u2217(gk\u22121) is known from\nthe previous step as the point at which we evaluated the derivative at. We will denote this argument\nto the derivative operation y, so that gk = \u2207f (yk). It no longer holds that gk = \u2207f (xk) after the\nchange of divergence. Using this relation, y can be computed each step via the update:\n\nyk =\n\nxk\u22121 \u2212 1\n\n\u03b7 gk\u22121 + \u03c4 yk\u22121\n1 + \u03c4\n\n.\n\nIn order to match the accelerated gradient method exactly we need some additional \ufb02exibility in the\nstep size used in the yk update. To this end we introduce an additional constant \u03b1 in front of gk\u22121,\nwhich is 1 for the proximal point variant. The full method is as follows:\nBregman form of the accelerated gradient method\n\n\u03b7 gk\u22121 + \u03c4 yk\u22121\n1 + \u03c4\n\n,\n\nxk\u22121 \u2212 \u03b1\n\nyk =\ngk = \u2207f (yk),\nxk = xk\u22121 \u2212 1\n\u03b7\n\ngk.\n\n(6)\n\nThis is very close to the equational form of Nesterov\u2019s method explored by Lan and Zhou [2017], with\nthe change that they assume an explicit regularizer is used, whereas we assume strong convexity of f.\nIndeed we have chosen our notation so that the constants match. This form is algebraically equivalent\nto other known forms of the accelerated gradient method for appropriate choice of constants. Table 1\nshows the direct relation between the many known ways of writing the accelerated gradient method\nin the strongly-convex case (Proofs of these relations are in the Appendix). When f is \u00b5-strongly\nconvex and L-smooth, existing theory implies an accelerated geometric convergence rate of at least\n\n1 \u2212(cid:112) \u00b5\n\nL for the parameter settings [Nesterov, 2013]:\n\u03c4 = L\n\u03b7 ,\n\n\u03b7 =\n\n\u00b5L,\n\n\u221a\n\n\u03b1= \u03c4\n\n1+\u03c4 .\n\nIn contrast, the primal-dual form of the proximal point method achieves at least that convergence rate\nfor parameters:\n\n\u221a\n\n\u03b7 =\n\n\u00b5L,\n\n\u03c4 = 1\n\u03b7 ,\n\n\u03b1 = 1.\n\nThe difference in \u03c4 arises from the difference in the scaling of the Bregman penalty compared to\nthe Euclidean penalty. The Bregman generator f\u2217 is strongly convex with constant 1/L whereas\n2 (cid:107)\u00b7(cid:107)2 is strongly convex with constant 1, so the change in scale requires\nthe Euclidean generator 1\nrescaling by L.\n\n6\n\n\fInterpretations\n\n6.1\nAfter the change in geometry, the g update no longer gives a dual point that is directly the gradient of\nthe primal iterate. However, notice that the term we are attempting to minimize in the g step:\n\nhas a \ufb01xed point of \u2207f\u2217(cid:0)gk(cid:1) = xk\u22121 \u2212 \u03b1\n\nf\u2217(g) \u2212(cid:10)g, xk\u22121 \u2212 \u03b1\n\ngk\u22121(cid:11),\n\n\u03b7\n\n\u03b7 gk, which is precisely an \u03b1-weight version of the proximal\npoint\u2019s key property from Equation 5. Essentially we have relaxed the proximal-point method. Instead\nof this relation holding precisely at every step, we are instead constantly taking steps which pull g\ncloser to satisfying it.\n\n6.2\nInertial form\nThe primal-dual view of the proximal point method can also be written in terms of the quantity\nzk\u22121 = xk\u22121 \u2212 \u03b1\n\u03b7 gk\u22121 instead of xk\u22121. This form is useful for the construction of ODEs that model\nthe discrete dynamics. Under this change of variables the updates are:\n\n(cid:26)\n\nf\u2217(g) \u2212(cid:10)g, zk\u22121(cid:11) +\n(cid:0)gk \u2212 gk\u22121(cid:1) .\n\ngk \u2212 \u03b1\n\u03b7\n\n1\n2\u03b7\n\ngk = arg min\n\ng\n\nzk = zk\u22121 \u2212 1\n\u03b7\n\n(cid:13)(cid:13)g \u2212 gk\u22121(cid:13)(cid:13)2(cid:27)\n\n,\n\n(7)\n\n6.3 Relation to the heavy ball method\nConsider Equation 6 with \u03b1 = 0, which removes the over-extrapolation before the proximal operation.\nIf we de\ufb01ne \u03b2 = \u03c4\n\n1+\u03c4 we may write the method as:\nxk = xk\u22121 \u2212 1\n\u03b7\n\nf(cid:48)(yk\u22121),\n\nyk = \u03b2yk\u22121 + (1 \u2212 \u03b2) xk.\n\nWe can eliminate xk from the yk update above by plugging in the xk step equation, then using the yk\nupdate from the previous step in the form (1 \u2212 \u03b2) xk\u22121 = yk\u22121 \u2212 \u03b2yk\u22122 :\n\n(cid:18)\n\n(cid:19)\n\nyk = \u03b2yk\u22121 + (1 \u2212 \u03b2)\n\n= \u03b2yk\u22121 \u2212 (1 \u2212 \u03b2)\n= yk\u22121 \u2212 (1 \u2212 \u03b2)\n\n1\n\u03b7\n\nf(cid:48)(yk\u22121)\n\nxk\u22121 \u2212 1\n\u03b7\n\nf(cid:48)(yk\u22121) +(cid:2)yk\u22121 \u2212 \u03b2yk\u22122(cid:3)\nf(cid:48)(yk\u22121) + \u03b2(cid:2)yk\u22121 \u2212 yk\u22122(cid:3) .\n\n1\n\u03b7\n\nThis has the exact form of the heavy ball method with step size (1 \u2212 \u03b2) /\u03b7 and momentum \u03b2. We\ncan also derive the heavy ball method by starting from the saddle-point expression for f:\n\nmin\n\nx\n\nf (x) = min\n\nx\n\nmax\n\ng\n\n{(cid:104)x, g(cid:105) \u2212 f\u2217(g)} .\n\nThe alternating-block gradient descent/ascent method on the objective (cid:104)x, g(cid:105) \u2212 f\u2217(g) with step-size\n\u03b3 is simply:\n\n(cid:2)xk\u22121 \u2212 \u2207f\u2217(gk\u22121)(cid:3) ,\n\ngk = gk\u22121 +\n\n1\n\u03b3\n\nxk = xk\u22121 \u2212 \u03b3gk.\n\nIf we instead perform a Bregman proximal update in the dual geometry for the g part, we arrive at the\nsame equations as we had for the primal-dual proximal point method but with \u03b1 = 0, yielding the\nheavy ball method. In order to get the accelerated gradient method instead of the heavy ball method,\nthe extra inertia that arises from starting from the proximal point method instead of the saddle point\nformulation appears to be crucial.\n\n7 Dual geometry in continuous time\nThe inertial form (Equation 7) of the proximal point method can be formulated as an ODE in a very\nnatural way, by mapping zk \u2212 zk\u22121 \u2192 \u02d9z and gk \u2212 gk\u22121 \u2192 \u02d9g, and taking x and g to be at time t.\n\n7\n\n\fThis is the inverse of the Euler class of discretizations applied separately to the two terms, which is\nthe most natural way to discretize an ODE. The resulting proximal point ODE is:\n\n\u02d9g = fg(z, g, t)\n\n\u02d9z = fz(z, g, t)\n\n= \u2212 1\n.\n\u03c4\n= \u2212 1\n.\n\u03b7\n\n\u2207f\u2217 (g) +\ng \u2212 \u03b1\n\u03b7\n\n\u02d9g.\n\n1\n\u03c4\n\nz,\n\nWe have suppressed the dependence on t of each quantity for notational simplicity. We can treat g\nmore formally as a point g \u2208 M on a Hessian manifold M. Then the solution for the g variable of\nthe ODE is a curve \u03b3(t) : I \u2192 T M from an interval I to the tangent bundle on the manifold so the\nvelocity \u02d9\u03b3(t) \u2208 TgM obeys the ODE: \u02d9\u03b3(t) = fg(z, g, t). The right hand side of the ODE is a point\nin the tangent space of the manifold at \u03b3(t), expressed in Euclidean coordinates.\nWe can now apply the same partial change of geometry that we used in the discrete case. We will\nconsider the quantity fg(z, g, t) to be a tangent vector in dual tangent space coordinates For the\n\u03c6 = f\u2217 Hessian manifold, instead of its primal tangent space coordinates (which would leave the\nODE unchanged). The variable g remains in primal coordinates with respect to \u03c6, so we must add to\nthe ODE a change of coordinates for the tangent vector, yielding:\n\n\u02d9g =(cid:0)\u22072f\u2217(g)(cid:1)\u22121\n\nfg(z, g, t),\n\nwhere we have used the inverse of Equation 1, with \u03c6 = f\u2217. We can rewrite this as:\n\nfg(z, g, t) = \u22072f\u2217(g) \u02d9g =\n\n\u2207f\u2217(g),\n\nd\ndt\n\ngiving the AGM ODE system:\n\nd\n\ndt\u2207f\u2217(g) = \u2212 1\n\n\u03c4 \u2207f\u2217 (g) + 1\n\u03c4 z,\n\n\u02d9z = \u2212 1\n\n\u03b7 g \u2212 \u03b1\n\u03b7 \u02d9g.\n\nIt is now easily seen that the implicit Euler update for the g variable with z \ufb01xed now corresponds to\nthe solution of the Bregman proximal operation considered in the discrete case. So this ODE is a\nnatural continuous time analogue to the accelerated gradient method.\n\nConvergence in continuous time\nThe natural analogy to convergence in continuous time is\nknown as the decay rate of the ODE. A suf\ufb01cient condition\nfor an ODE with parameters u = [z; g] to decay with\nconstant \u03c1 is:\n\n(cid:107)u(t) \u2212 u\u2217(cid:107) \u2264 exp (\u2212t\u03c1)(cid:107)u(0) \u2212 u\u2217(cid:107) ,\n\nwhere u\u2217 is a \ufb01xed point. We can relate this to the discrete\ncase by noting that exp(\u2212t\u03c1) = limk\u2192\u221e(1 \u2212 t\nk \u03c1)k, so\ngiven our discrete-time convergence rate is proportional to\n\n(1 \u2212(cid:112)\u00b5/L)k, we would expect values of \u03c1 proportional\nto(cid:112)\u00b5/L if the ODE behaves similarly to the discrete\n\nprocess. We have been able to establish this result for both\nthe proximal and AGM ODEs for quadratic objectives\n(proof in the Appendix in the supplementary material).\nTheorem 1. The proximal and AGM ODEs decay with\nat least the following rates for \u00b5-strongly convex and L-\nsmooth quadratic objective functions when using the same hyper-parameters as in the discrete\ncase:\n\nFigure 2: Paths for the quadratic problem\nf (x) = 1\n\n2 xT Ax with A = [2, 1; 1, 3].\n\n\u03c1prox \u2265 \u221a\n\u221a\n\u00b5\u221a\n\u00b5+\n\n,\n\nL\n\n\u03c1AGM \u2265 1\n\n2\n\n(cid:112) \u00b5\n\nL .\n\nFigure 2 contrasts the convergence of the discrete and continuous variants. The two methods have\nquite distinct paths whose shape is shared by their ODE counterparts.\n\n8\n\n1.21.00.80.60.40.20.00.20.41.00.50.00.51.01.52.0Prox ODEAGM ODEProxAGM\f8 Related Work\nThe application of Bregman divergence to the analysis of continuous time views of the accelerated\ngradient method has recently been explored by Wibisono et al. [2016] and Wilson et al. [2018]. Their\napproaches do not use the Bregman divergence of f\u2217, a key factor of our approach. The Bregman\ndivergence of a function \u03c6 occurs explicitly as a term in a Hamiltonian, in contrast to our view of \u03c6\nas curving space. The accelerated gradient method has been shown to be modeled by a momentum of\nthe form ODE \u00a8X + c(t) \u02d9X + \u2207f (x) = 0 by Su et al. [2014]. Natural discretizations of their ODE\nresult in the heavy-ball method instead of the accelerated gradient method, unlike our form which\ncan produce both based on the choice of \u03b1. The \ufb01ne-grained properties of momentum ODEs have\nalso been studied in the quadratic case by Scieur et al. [2017].\nA primal-dual form of the regularized accelerated gradient method appears in Lan and Zhou [2017].\nOur form can be seen as a special case of their form when the regularizer is zero. Our work extends\ntheirs, providing an understanding of the role that geometry plays in unifying acceleration and implicit\nsteps.\nThe Riemannian connection induced by a function has been heavily explored in the optimization\nliterature as part of the natural gradient method [Amari, 1998], although other connections on this\nmanifold are less explored. The dual-\ufb02at connections have primarily seen use in the information-\ngeometry setting for optimization over distributions [Amari and Nagaoka, 2000].\nThe accelerated gradient method is not the only way to achieve accelerated rates among \ufb01rst order\nmethods. Other techniques include the Geometric descent method of Bubeck et al. [2015], where a\nbounding ball is updated at each step that encloses two other balls, a very different approach. The\nmethod described by Nemirovski and Yudin [1983] is also notable as being closer to the conjugate\ngradient method than other accelerated approaches, but at the expense of requiring a 2D search\ninstead of a 1D line search at each step.\n\n9 Conclusion\n\nWe believe the tools of differential geometry may provide a new and insightful avenue for the analysis\nof accelerated optimization methods. The analysis we provide in this work is a \ufb01rst step in this\ndirection. The advantage of the differential geometric approach is that it provides high level tools that\nmake the derivation of acceleration easier to state. This derivation, from the proximal point method\nto the accelerated gradient method, is in our opinion not nearly as mysterious as the other known\napproaches to understanding acceleration.\n\nReferences\nZeyuan Allen-Zhu and Lorenzo Orecchia. Linear Coupling: An Ultimate Uni\ufb01cation of Gradient and\nMirror Descent. In Proceedings of the 8th Innovations in Theoretical Computer Science, ITCS \u201917,\n2017. Full version available at http://arxiv.org/abs/1407.1537.\n\nShun-Ichi Amari. Natural gradient works ef\ufb01ciently in learning. Neural computation, 10(2):251\u2013276,\n\n1998.\n\nShun-Ichi Amari and Hiroshi Nagaoka. Methods of Information Geometry. Oxford University Press,\n\n2000.\n\nAlfred Auslender and Marc Teboulle. Interior gradient and proximal methods for convex and conic\n\noptimization. SIAM Journal on Optimization, 16(3):697\u2013725, 2006.\n\nAmir Beck and Marc Teboulle. Mirror descent and nonlinear projected subgradient methods for\n\nconvex optimization. Operations Research Letters, 2003.\n\nAmir Beck and Marc Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse\n\nproblems. SIAM J. IMAGING SCIENCES, 2009.\n\nStephen Boyd, Laurent El Ghaoui, Eric Feron, and Venkataramanan Balakrishnan. Linear Matrix\nInequalities in System and Control Theory. Society for Industrial and Applied Mathematics, 1994.\n\n9\n\n\fS\u00e9bastien Bubeck, Yin Tat Lee, and Mohit Singh. A geometric alternative to nesterov\u2019s accelerated\n\ngradient descent, 2015.\n\nGuanghui Lan and Yi Zhou. An optimal randomized incremental gradient method. Mathematical\n\nprogramming, pages 1\u201349, 2017.\n\nJohn M. Lee. Riemannian Manifolds : An introduction to curvature. Springer-Verlag, 1997.\n\nArkadi Nemirovski and David Yudin. Problem Complexity and Method Ef\ufb01ciency in Optimization.\n\nWiley, 1983.\n\nYu. Nesterov. Accelerating the cubic regularization of newton\u2019s method on convex problems. Mathe-\n\nmatical Programming, 2008.\n\nYurii Nesterov. A method of solving a convex programming problem with convergence rate O(1/k2).\n\nIn Soviet Mathematics Doklady, volume 27, pages 372\u2013376, 1983.\n\nYurii Nesterov. Gradient methods for minimizing composite objective function,. Technical report,\n\nCORE, 2007.\n\nYurii Nesterov. Introductory lectures on convex optimization: A basic course, volume 87. Springer\n\nScience & Business Media, 2013.\n\nErnest K Ryu and Stephen Boyd. Primer on monotone operator methods. Appl. Comput. Math, 15(1):\n\n3\u201343, 2016.\n\nDamien Scieur, Vincent Roulet, Francis Bach, and Alexandre d\u2019Aspremont. Integration methods and\n\noptimization algorithms. In Advances in Neural Information Processing Systems 30, 2017.\n\nHirohiko Shima. The Geometry of Hessian Structures. World Scienti\ufb01c, 2007.\n\nWeijie Su, Stephen Boyd, and Emmanuel Candes. A differential equation for modeling Nesterov\u2019s\naccelerated gradient method: Theory and insights. Advances in Neural Information Processing\nSystems, 2014.\n\nIlya Sutskever, James Martens, George Dahl, and Geoffrey Hinton. On the importance of initialization\nand momentum in deep learning. In International conference on machine learning, pages 1139\u2013\n1147, 2013.\n\nAndre Wibisono, Ashia C Wilson, and Michael I Jordan. A variational perspective on accelerated\nmethods in optimization. Proceedings of the National Academy of Sciences, 113(47):E7351\u2013E7358,\n2016.\n\nAshia C Wilson, Benjamin Recht, and Michael I Jordan. A Lyapunov analysis of momentum methods\n\nin optimization. arXiv preprint arXiv:1611.02635, 2018.\n\nYuchen Zhang and Lin Xiao. Stochastic primal-dual coordinate method for regularized empirical risk\n\nminimization. The Journal of Machine Learning Research, 18(1):2939\u20132980, 2017.\n\n10\n\n\f", "award": [], "sourceid": 1030, "authors": [{"given_name": "Aaron", "family_name": "Defazio", "institution": "Facebook AI Research"}]}