{"title": "Accelerated Mirror Descent in Continuous and Discrete Time", "book": "Advances in Neural Information Processing Systems", "page_first": 2845, "page_last": 2853, "abstract": "We study accelerated mirror descent dynamics in continuous and discrete time. Combining the original continuous-time motivation of mirror descent with a recent ODE interpretation of Nesterov's accelerated method, we propose a family of continuous-time descent dynamics for convex functions with Lipschitz gradients, such that the solution trajectories are guaranteed to converge to the optimum at a $O(1/t^2)$ rate. We then show that a large family of first-order accelerated methods can be obtained as a discretization of the ODE, and these methods converge at a $O(1/k^2)$ rate. This connection between accelerated mirror descent and the ODE provides an intuitive approach to the design and analysis of accelerated first-order algorithms.", "full_text": "Accelerated Mirror Descent\n\nin Continuous and Discrete Time\n\nWalid Krichene\n\nUC Berkeley\n\nAlexandre M. Bayen\n\nUC Berkeley\n\nPeter L. Bartlett\n\nUC Berkeley and QUT\n\nwalid@eecs.berkeley.edu\n\nbayen@berkeley.edu\n\nbartlett@berkeley.edu\n\nAbstract\n\nWe study accelerated mirror descent dynamics in continuous and discrete time.\nCombining the original continuous-time motivation of mirror descent with a re-\ncent ODE interpretation of Nesterov\u2019s accelerated method, we propose a family of\ncontinuous-time descent dynamics for convex functions with Lipschitz gradients,\nsuch that the solution trajectories converge to the optimum at a O(1/t2) rate. We\nthen show that a large family of \ufb01rst-order accelerated methods can be obtained as\na discretization of the ODE, and these methods converge at a O(1/k2) rate. This\nconnection between accelerated mirror descent and the ODE provides an intuitive\napproach to the design and analysis of accelerated \ufb01rst-order algorithms.\n\nIntroduction\n\n1\nWe consider a convex optimization problem, minimizex\u2208X f (x), where X \u2286 Rn is convex and\nclosed, f is a C 1 convex function, and \u2207f is assumed to be Lf -Lipschitz. Let f (cid:63) be the minimum\nof f on X . Many convex optimization methods can be interpreted as the discretization of an ordinary\ndifferential equation, the solutions of which are guaranteed to converge to the set of minimizers. Per-\nhaps the simplest such method is gradient descent, given by the iteration x(k+1) = x(k)\u2212s\u2207f (x(k))\nfor some step size s, which can be interpreted as the discretization of the ODE \u02d9X(t) = \u2212\u2207f (X(t)),\nwith discretization step s. The well-established theory of ordinary differential equations can provide\nguidance in the design and analysis of optimization algorithms, and has been used for unconstrained\noptimization [8, 7, 13], constrained optimization [27] and stochastic optimization [25]. In particular,\nproving convergence of the solution trajectories of an ODE can often be achieved using simple and\nelegant Lyapunov arguments. The ODE can then be carefully discretized to obtain an optimiza-\ntion algorithm for which the convergence rate can be analyzed by using an analogous Lyapunov\nargument in discrete time.\nIn this article, we focus on two families of \ufb01rst-order methods: Nesterov\u2019s accelerated method [22],\nand Nemirovski\u2019s mirror descent method [19]. First-order methods have become increasingly im-\nportant for large-scale optimization problems that arise in machine learning applications. Nesterov\u2019s\naccelerated method [22] has been applied to many problems and extended in a number of ways, see\nfor example [23, 20, 21, 4]. The mirror descent method also provides an important generaliza-\ntion of the gradient descent method to non-Euclidean geometries, as discussed in [19, 3], and has\nmany applications in convex optimization [6, 5, 12, 15], as well as online learning [9, 11]. An in-\ntuitive understanding of these methods is of particular importance for the design and analysis of\nnew algorithms. Although Nesterov\u2019s method has been notoriously hard to explain intuitively [14],\nprogress has been made recently: in [28], Su et al. give an ODE interpretation of Nesterov\u2019s method.\nHowever, this interpretation is restricted to the original method [22], and does not apply to its ex-\ntensions to non-Euclidean geometries. In [1], Allen-Zhu and Orecchia give another interpretation\nof Nesterov\u2019s method, as performing, at each iteration, a convex combination of a mirror step and a\ngradient step. Although it covers a broader family of algorithms (including non-Euclidean geome-\ntries), this interpretation still requires an involved analysis, and lacks the simplicity and elegance of\n\n1\n\n\fODEs. We provide a new interpretation which has the bene\ufb01ts of both approaches: we show that a\nbroad family of accelerated methods (which includes those studied in [28] and [1]) can be obtained\nas a discretization of a simple ODE, which converges at a O(1/t2) rate. This provides a uni\ufb01ed\ninterpretation, which could potentially simplify the design and analysis of \ufb01rst-order accelerated\nmethods.\nThe continuous-time interpretation [28] of Nesterov\u2019s method and the continuous-time motivation\nof mirror descent [19] both rely on a Lyapunov argument. They are reviewed in Section 2. By\ncombining these ideas, we propose, in Section 3, a candidate Lyapunov function V (X(t), Z(t), t)\nthat depends on two state variables: X(t), which evolves in the primal space E = Rn, and Z(t),\nwhich evolves in the dual space E\u2217, and we design coupled dynamics of (X, Z) to guarantee that\ndt V (X(t), Z(t), t) \u2264 0. Such a function is said to be a Lyapunov function, in reference to [18];\nd\nsee also [16]. This leads to a new family of ODE systems, given in Equation (5). We prove the\nexistence and uniqueness of the solution to (5) in Theorem 1. Then we prove in Thereom 2, using\nthe Lyapunov function V , that the solution trajectories are such that f (X(t)) \u2212 f (cid:63) = O(1/t2). In\nSection 4, we give a discretization of these continuous-time dynamics, and obtain a family of accel-\nerated mirror descent methods, for which we prove the same O(1/k2) convergence rate (Theorem 3)\nusing a Lyapunov argument analogous to (though more involved than) the continuous-time case. We\ngive, as an example, a new accelerated method on the simplex, which can be viewed as performing,\nat each step, a convex combination of two entropic projections with different step sizes. This ODE\ninterpretation of accelerated mirror descent gives new insights and allows us to extend recent results\nsuch as the adaptive restarting heuristics proposed by O\u2019Donoghue and Cand`es in [24], which are\nknown to empirically improve the convergence rate. We test these methods on numerical examples\nin Section 5 and comment on their performance.\n\n2 ODE interpretations of Nemirovski\u2019s mirror descent method and\n\nNesterov\u2019s accelerated method\n\nProving convergence of the solution trajectories of an ODE often involves a Lyapunov argument. For\nexample, to prove convergence of the solutions to the gradient descent ODE \u02d9X(t) = \u2212\u2207f (X(t)),\nconsider the Lyapunov function V (X(t)) = 1\n2(cid:107)X(t) \u2212 x(cid:63)(cid:107)2 for some minimizer x(cid:63). Then the time\nderivative of V (X(t)) is given by\n= (cid:104)\u2212\u2207f (X(t)), X(t) \u2212 x(cid:63)(cid:105) \u2264 \u2212(f (X(t)) \u2212 f (cid:63)),\n\n(cid:82) t\nwhere the last inequality is by convexity of f. Integrating, we have V (X(t)) \u2212 V (x0) \u2264 tf (cid:63) \u2212\n0 f (X(\u03c4 ))d\u03c4, thus by Jensen\u2019s inequality, f\n0 f (X(\u03c4 ))d\u03c4 \u2212 f (cid:63) \u2264\n\n(cid:68) \u02d9X(t), X(t) \u2212 x(cid:63)(cid:69)\n(cid:16) 1\n\n(cid:16) 1\n\n(cid:82) t\n\n(cid:82) t\n\u2212 f (cid:63) \u2264 1\nconverges to f (cid:63) at a O(1/t) rate.\n\n0 X(\u03c4 )d\u03c4\n\n, which proves that f\n\n0 X(\u03c4 )d\u03c4\n\nV (X(t)) =\n\n(cid:82) t\n\n(cid:17)\n\n(cid:17)\n\nd\ndt\n\nV (x0)\n\nt\n\nt\n\nt\n\nt\n\n2.1 Mirror descent ODE\n\nThe previous argument was extended by Nemirovski and Yudin in [19] to a family of methods\ncalled mirror descent. The idea is to start from a non-negative function V , then to design dynamics\nfor which V is a Lyapunov function. Nemirovski and Yudin argue that one can replace the Lyapunov\n2(cid:107)X(t) \u2212 x(cid:63)(cid:107)2 by a function on the dual space, V (Z(t)) = D\u03c8\u2217 (Z(t), z(cid:63)),\nfunction V (X(t)) = 1\nwhere Z(t) \u2208 E\u2217 is a dual variable for which we will design the dynamics (z(cid:63) is the value of Z at\nequilibrium), and the corresponding trajectory in the primal space is X(t) = \u2207\u03c8\u2217(Z(t)). Here \u03c8\u2217 is\na convex function de\ufb01ned on E\u2217, such that \u2207\u03c8\u2217 maps E\u2217 to X , and D\u03c8\u2217 (Z(t), z(cid:63)) is the Bregman\ndivergence associated with \u03c8\u2217, de\ufb01ned as D\u03c8\u2217 (z, y) = \u03c8\u2217(z) \u2212 \u03c8\u2217(y) \u2212 (cid:104)\u2207\u03c8\u2217(y), z \u2212 y(cid:105). The\nfunction \u03c8\u2217 is said to be (cid:96)-strongly convex w.r.t. a reference norm (cid:107) \u00b7 (cid:107)\u2217 if D\u2217\u03c8(z, y) \u2265 (cid:96)\n2(cid:107)z \u2212 y(cid:107)2\n\u2217\n. For a review of\nfor all y, z, and it is said to be L-smooth w.r.t. (cid:107) \u00b7 (cid:107)\u2217 if D\u03c8\u2217 (z, y) \u2264 L\n2 (cid:107)z \u2212 y(cid:107)2\n\u2217\nproperties of Bregman divergences, see Chapter 11.2 in [11], or Appendix A in [2].\nBy de\ufb01nition of the Bregman divergence, we have\n\u2217\n\n\u2217\n\n\u2217\n\n), Z(t) \u2212 z(cid:63)(cid:105))\n\nd\ndt\n\nV (Z(t)) =\n\nd\ndt\n\n(cid:68)\u2207\u03c8\n\nD\u03c8\u2217 (Z(t), z(cid:63)) =\n(Z(t)) \u2212 \u2207\u03c8\n\u2217\n\n\u2217\n\n(\u03c8\n\nd\ndt\n(z(cid:63)), \u02d9Z(t)\n\n(cid:69)\n\n=\n\n(Z(t)) \u2212 \u03c8\n\n(cid:68)\n\n\u2217\n\n(z(cid:63)) \u2212 (cid:104)\u2207\u03c8\n(z\nX(t) \u2212 x(cid:63), \u02d9Z(t)\n\n.\n\n(cid:69)\n\n=\n\n2\n\n\fTherefore, if the dual variable Z obeys the dynamics \u02d9Z = \u2212\u2207f (X), then\n\nV (Z(t)) = \u2212(cid:104)\u2207f (X(t)), X(t) \u2212 x(cid:63)(cid:105) \u2264 \u2212(f (X(t)) \u2212 f (cid:63))\n\n(cid:16) 1\n\n(cid:82) t\n\nd\ndt\n\n(cid:17)\n\nand by the same argument as in the gradient descent ODE, V is a Lyapunov function and\n\u2212 f (cid:63) converges to 0 at a O(1/t) rate. The mirror descent ODE system can be\nf\n0 X(\u03c4 )d\u03c4\nsummarized by\n\nt\n\n\uf8f1\uf8f4\uf8f2\uf8f4\uf8f3 X = \u2207\u03c8\u2217(Z)\n\n\u02d9Z = \u2212\u2207f (X)\nX(0) = x0, Z(0) = z0 with \u2207\u03c8\u2217(z0) = x0\n\nNote that since \u2207\u03c8\u2217 maps into X , X(t) = \u2207\u03c8\u2217(Z(t)) remains in X . Finally, the unconstrained\ngradient descent ODE can be obtained as a special case of the mirror descent ODE (1) by taking\n\u03c8\u2217(z) = 1\n\n2(cid:107)z(cid:107)2, for which \u2207\u03c8\u2217 is the identity, in which case X and Z coincide.\n\n2.2 ODE interpretation of Nesterov\u2019s accelerated method\n\nIn [28], Su et al. show that Nesterov\u2019s accelerated method [22] can be interpreted as a discretization\nof a second-order differential equation, given by\n\n(cid:40) \u00a8X + r+1\n\nt\n\n\u02d9X + \u2207f (X) = 0\n\nX(0) = x0,\n\n\u02d9X(0) = 0\n\n(1)\n\n(2)\n\nr\n\n2(cid:107)X + t\n2(cid:107)x0 \u2212 x(cid:63)(cid:107)2, therefore f (X(t)) \u2212 f (cid:63) \u2264 r\n\nThe argument uses the following Lyapunov function (up to reparameterization), E(t) = t2\nr (f (X) \u2212\n\u02d9X \u2212 x(cid:63)(cid:107)2, which is proved to be a Lyapunov function for the ODE (2) whenever\nf (cid:63)) + r\nr \u2265 2. Since E is decreasing along trajectories of the system, it follows that for all t > 0, E(t) \u2264\n, which proves\nE(0) = r\nthat f (X(t)) converges to f (cid:63) at a O(1/t2) rate. One should note in particular that the squared\nEuclidean norm is used in the de\ufb01nition of E(t) and, as a consequence, discretizing the ODE (2)\nleads to a family of unconstrained, Euclidean accelerated methods. In the next section, we show\nthat by combining this argument with Nemirovski\u2019s idea of using a general Bregman divergence\nas a Lyapunov function, we can construct a much more general family of ODE systems which\nhave the same O(1/t2) convergence guarantee. And by discretizing the resulting dynamics, we\nobtain a general family of accelerated methods that are not restricted to the unconstrained Euclidean\ngeometry.\n\nt2E(0) \u2264 r2\n\nt2E(t) \u2264 r\n\nt2 (cid:107)x0\u2212x(cid:63)(cid:107)2\n\n2\n\n3 Continuous-time Accelerated Mirror Descent\n\n3.1 Derivation of the accelerated mirror descent ODE\nWe consider a pair of dual convex functions, \u03c8 de\ufb01ned on X and \u03c8\u2217 de\ufb01ned on E\u2217, such that\n\u2207\u03c8\u2217 : E\u2217 \u2192 X . We assume that \u03c8\u2217 is L\u03c8\u2217-smooth with respect to (cid:107) \u00b7 (cid:107)\u2217, a reference norm on the\ndual space. Consider the function\n\n(3)\nwhere Z is a dual variable for which we will design the dynamics, and z(cid:63) is its value at equilibrium.\nTaking the time-derivative of V , we have\n\nV (X(t), Z(t), t) =\n\n(f (X(t)) \u2212 f (cid:63)) + rD\u03c8\u2217 (Z(t), z(cid:63))\n\nt2\nr\n\n(cid:69)\n\n(cid:68) \u02d9Z,\u2207\u03c8\n\n\u02d9X\n\n+ r\n\n\u2217\n\n(Z) \u2212 \u2207\u03c8\n\n\u2217\n\n(z(cid:63))\n\n.\n\nd\ndt\n\nV (X(t), Z(t), t) =\n\n(f (X) \u2212 f (cid:63)) +\n\n2t\nr\n\nt2\nr\n\nAssume that\nd\ndt\n\n\u02d9Z = \u2212 t\n\nr\u2207f (X). Then, the time-derivative of V becomes\n\u02d9X + \u2207\u03c8\n\n(f (X) \u2212 f (cid:63)) \u2212 t\n\n\u2207f (X),\u2212 t\nr\n\n2t\nr\n\nV (X(t), Z(t), t) =\n\n\u2217\n\n(Z) \u2212 \u2207\u03c8\n\n\u2217\n\n(z(cid:63))\n\n.\n\n(cid:68)\u2207f (X),\n(cid:28)\n\n(cid:69)\n\n(cid:29)\n\nTherefore, if Z is such that \u2207\u03c8\u2217(Z) = X + t\n\nr\n\n\u02d9X, and \u2207\u03c8\u2217(z(cid:63)) = x(cid:63), then,\n\nd\ndt\n\nV (X(t), Z(t), t) =\n\n2t\nr\n\u2264 \u2212t\n\n(f (X) \u2212 f (cid:63)) \u2212 t(cid:104)\u2207f (X), X \u2212 x(cid:63)(cid:105) \u2264 2t\nr\nr \u2212 2\n\n(f (X) \u2212 f (cid:63))\n\nr\n\n(f (X) \u2212 f (cid:63)) \u2212 t(f (X) \u2212 f (cid:63))\n\n(4)\n\n3\n\n\fand it follows that V is a Lyapunov function whenever r \u2265 2. The proposed ODE system is then\n\n\uf8f1\uf8f4\uf8f2\uf8f4\uf8f3 \u02d9X = r\n\nt (\u2207\u03c8\u2217(Z) \u2212 X),\nr\u2207f (X),\n\n\u02d9Z = \u2212 t\nX(0) = x0, Z(0) = z0, with \u2207\u03c8\u2217(z0) = x0.\n\n(5)\n\n(cid:82) t\n\n= \u2212 t\n\n\u02d9X, and the ODE system is equivalent to d\ndt\n\n(cid:17)\n2(cid:107)z(cid:107)2, we have \u2207\u03c8\u2217(z) = z, thus Z =\n\u02d9X\nr\u2207f (X), which is equivalent to\n\n(cid:16)\nrtr\u22121\u2207\u03c8\u2217(Z), or, in integral form, trX(t) = r(cid:82) t\n\nIn the unconstrained Euclidean case, taking \u03c8\u2217(z) = 1\nX + t\nX + t\nr\nr\nthe ODE (2) studied in [28], which we recover as a special case.\nWe also give another interpretation of ODE (5): the \ufb01rst equation is equivalent to tr \u02d9X + rtr\u22121X =\n0 \u03c4 r\u22121\u2207\u03c8\u2217(Z(\u03c4 ))d\u03c4, which can be written as\n, with w(\u03c4 ) = \u03c4 r\u22121. Therefore the coupled dynamics of (X, Z) can\nX(t) =\nbe interpreted as follows: the dual variable Z accumulates gradients with a t\nr rate, while the primal\nvariable X is a weighted average of \u2207\u03c8\u2217(Z(\u03c4 )) (the \u201cmirrored\u201d dual trajectory), with weights\nproportional to \u03c4 r\u22121. This also gives an interpretation of r as a parameter controlling the weight\ndistribution. It is also interesting to observe that the weights are increasing if and only if r \u2265 2.\nFinally, with this averaging interpretation, it becomes clear that the primal trajectory X(t) remains\nin X , since \u2207\u03c8\u2217 maps into X and X is convex.\n3.2 Solution of the proposed dynamics\n\n0 w(\u03c4 )\u2207\u03c8\n\n0 w(\u03c4 )d\u03c4\n\n(Z(\u03c4 ))d\u03c4\n\n(cid:82) t\n\n\u2217\n\nr\n\nt term in the expression of\n\nmax(t,\u03b4) (\u2207\u03c8\u2217(Z) \u2212 X),\nr\u2207f (X),\n\nFirst, we prove existence and uniqueness of a solution to the ODE system (5), de\ufb01ned for all t >\n0. By assumption, \u03c8\u2217 is L\u03c8\u2217-smooth w.r.t. (cid:107) \u00b7 (cid:107)\u2217, which is equivalent (see e.g. [26]) to \u2207\u03c8\u2217 is\n\u02d9X, the function (X, Z, t) (cid:55)\u2192\nL\u03c8\u2217-Lipschitz. Unfortunately, due to the r\n( \u02d9X, \u02d9Z) is not Lipschitz at t = 0, and we cannot directly apply the Cauchy-Lipschitz existence and\nuniqueness theorem. However, one can work around it by considering a sequence of approximating\nODEs, similarly to the argument used in [28].\nTheorem 1. Suppose f is C 1, and that \u2207f is Lf -Lipschitz, and let (x0, z0) \u2208 X \u00d7 E\u2217 such that\n\u2207\u03c8\u2217(z0) = x0. Then the accelerated mirror descent ODE system (5) with initial condition (x0, z0)\nhas a unique solution (X, Z), in C 1([0,\u221e), Rn).\nWe will show existence of a solution on any given interval [0, T ] (uniqueness is proved in the sup-\nplementary material). Let \u03b4 > 0, and consider the smoothed ODE system\n\n\u02d9Z = \u2212 t\nX(0) = x0, Z(0) = z0 with \u2207\u03c8\u2217(z0) = x0.\nr\u2207f (X) and (X, Z) (cid:55)\u2192 r\n\nSince the functions (X, Z) (cid:55)\u2192 \u2212 t\nmax(t,\u03b4) (\u2207\u03c8\u2217(Z) \u2212 X) are Lipschitz for\nall t \u2208 [0, T ], by the Cauchy-Lipschitz theorem (Theorem 2.5 in [29]), the system (6) has a unique\nsolution (X\u03b4, Z\u03b4) in C 1([0, T ]). In order to show the existence of a solution to the original ODE,\nwe use the following Lemma (proved in the supplementary material).\nLemma 1. Let t0 =\ncontinuous and uniformly bounded.\n\n\uf8f1\uf8f4\uf8f2\uf8f4\uf8f3 \u02d9X =\n2\u221aLf L\u03c8\u2217 . Then the family of solutions(cid:0)(X\u03b4, Z\u03b4)|[0,t0]\n(cid:1)\nProof of existence. Consider the family of solutions(cid:0)(X\u03b4i, Z\u03b4i), \u03b4i = t02\u2212i(cid:1)\ni\u2208N restricted to [0, t0].\nBy Lemma 1, this family is equi-Lipschitz-continuous and uniformly bounded, thus by the Arzel`a-\nAscoli theorem, there exists a subsequence ((X\u03b4i, Z\u03b4i))i\u2208I\nthat converges uniformly on [0, t0]\n(where I \u2282 N is an in\ufb01nite set of indices). Let ( \u00afX, \u00afZ) be its limit. Then we prove that ( \u00afX, \u00afZ)\nis a solution to the original ODE (5) on [0, t0].\nit follows that \u00afX(0) =\nFirst, since for all i \u2208 I, X\u03b4i(0) = x0 and Z\u03b4i(0) = z0,\nlimi\u2192\u221e,i\u2208I X\u03b4i(0) = x0 and \u00afZ(0) = limi\u2192\u221e,i\u2208I Z\u03b4i(0) = z0, thus ( \u00afX, \u00afZ) satis\ufb01es the initial\nconditions. Next, let t1 \u2208 (0, t0), and let ( \u02dcX, \u02dcZ) be the solution of the ODE (5) on t \u2265 t1, with\ninitial condition ( \u00afX(t1), \u00afZ(t1)). Since (X\u03b4i(t1), Z\u03b4i(t1))i\u2208I \u2192 ( \u00afX(t1), \u00afZ(t1)) as i \u2192 \u221e, then by\n\nis equi-Lipschitz-\n\n\u03b4\u2264t0\n\n(6)\n\n4\n\n\fcontinuity of the solution w.r.t. initial conditions (Theorem 2.8 in [29]), we have that for some \u0001 > 0,\nX\u03b4i \u2192 \u02dcX uniformly on [t1, t1 + \u0001). But we also have X\u03b4i \u2192 \u00afX uniformly on [0, t0], therefore \u00afX\nand \u02dcX coincide on [t1, t1 + \u0001), therefore \u00afX satis\ufb01es the ODE on [t1, t1 + \u0001). And since t1 is arbitrary\nin (0, t0), this concludes the proof of existence.\n\n3.3 Convergence rate\n\nIt is now straightforward to establish the convergence rate of the solution.\nTheorem 2. Suppose that f has Lipschitz gradient, and that \u03c8\u2217 is a smooth distance generating\nfunction. Let (X(t), Z(t)) be the solution to the accelerated mirror descent ODE (5) with r \u2265 2.\nThen for all t > 0, f (X(t)) \u2212 f (cid:63) \u2264 r2D\u03c8\u2217 (z0,z(cid:63))\nProof. By construction of the ODE, V (X(t), Z(t), t) = t2\na Lyapunov function.\nV (x0, z0, 0) = rD\u03c8\u2217 (z0, z(cid:63)).\n\nr (f (X(t)) \u2212 f (cid:63)) + rD\u03c8\u2217 (Z(t), z(cid:63)) is\nr (f (X(t)) \u2212 f (cid:63)) \u2264 V (X(t), Z(t), t) \u2264\n\nIt follows that for all t > 0, t2\n\nt2\n\n.\n\n4 Discretization\n\nNext, we show that with a careful discretization of this continuous-time dynamics, we can obtain a\ngeneral family of accelerated mirror descent methods for constrained optimization. Using a mixed\nforward/backward Euler scheme (see e.g. Chapter 2 in [10]), we can discretize the ODE system (5)\nusing a step size \u221as as follows. Given a solution (X, Z) of the ODE (5), let tk = k\u221as, and x(k) =\nX(tk) = X(k\u221as). Approximating \u02d9X(tk) with X(tk+\u221as)\u2212X(tk)\n\n, we propose the discretization\n\nThe \ufb01rst equation can be rewritten as x(k+1) =(cid:0)x(k) + r\n\nz(k+1)\u2212z(k)\n\n(cid:40) x(k+1)\u2212x(k)\n\n\u221as\n\u221as\n\n\u221as\n\n(cid:0)\n\n= r\nk\u221as\n+ k\u221as\nr \u2207f (x(k+1)) = 0.\n\n\u2207\u03c8\u2217(z(k)) \u2212 x(k+1)(cid:1) ,\nk\u2207\u03c8\u2217(z(k))(cid:1) /(cid:0)1 + r\n\nk\n\n(7)\n\n(cid:1) (note the indepen-\n\ndence on s, due to the time-scale invariance of the \ufb01rst ODE). In other words, x(k+1) is a convex\nr+k . To summarize,\ncombination of \u2207\u03c8\u2217(z(k)) and x(k) with coef\ufb01cients \u03bbk = r\nour \ufb01rst discrete scheme can be written as\n\nr+k and 1 \u2212 \u03bbk = k\n\n(cid:40)\n\nx(k+1) = \u03bbk\u2207\u03c8\u2217(z(k)) + (1 \u2212 \u03bbk)x(k), \u03bbk = r\nz(k+1) = z(k) \u2212 ks\n\nr \u2207f (x(k+1)).\n\nr+k ,\n\n(8)\n\nSince \u2207\u03c8\u2217 maps into the feasible set X , starting from x(0) \u2208 X guarantees that x(k) remains in X\nfor all k (by convexity of X ). Note that by duality, we have \u2207\u03c8\u2217(x\u2217) = arg maxx\u2208X (cid:104)x, x\u2217(cid:105)\u2212\u03c8(x),\nand if we additionally assume that \u03c8 is differentiable on the image of \u2207\u03c8\u2217, then \u2207\u03c8 = (\u2207\u03c8\u2217)\u22121\n(cid:29)\n(Theorem 23.5 in [26]), thus if we write \u02dcz(k) = \u2207\u03c8\u2217(z(k)), the second equation can be written as\n\n(cid:28)\n\n\u2207f (x(k+1))) = arg min\n\n\u03c8(x) \u2212\n\nx\u2208X\n\n\u2207\u03c8(\u02dcz(k)) \u2212 ks\nr\n\n\u2207f (x(k+1)), x\n\n\u02dcz(k+1) = \u2207\u03c8\n\n\u2217\n\n(\u2207\u03c8(\u02dcz(k)) \u2212 ks\nr\n\n(cid:68)\u2207f (x(k+1)), x\n\n(cid:69)\n\n= arg min\n\nx\u2208X\n\nks\nr\n\n+ D\u03c8(x, \u02dcz(k)).\n\nWe will eventually modify this scheme in order to be able to prove the desired O(1/k2) convergence\nrate. However, we start by analyzing this version. Motivated by the continuous-time Lyapunov\nfunction (3), and using the correspondence t \u2248 k\u221as, we consider the potential function E(k) =\nV (x(k), z(k), k\u221as) = k2s\n\nr (f (x(k)) \u2212 f (cid:63)) + rD\u03c8\u2217 (z(k), z(cid:63)). Then we have\n\nE(k+1) \u2212 E(k) =\n\n(k + 1)2s\n\nr\n\n(f (x(k+1)) \u2212 f (cid:63)) \u2212 k2s\n(f (x(k)) \u2212 f (cid:63)) + r(D\u03c8\u2217 (z(k+1), z(cid:63)) \u2212 D\u03c8\u2217 (z(k), z(cid:63)))\nr\n(f (x(k+1)) \u2212 f (cid:63)) + r(D\u03c8\u2217 (z(k+1), z(cid:63)) \u2212 D\u03c8\u2217 (z(k), z(cid:63))).\n\ns(1 + 2k)\n\n(f (x(k+1)) \u2212 f (x(k))) +\n\n=\n\nk2s\nr\n\nr\n\n5\n\n\f\u2217\n\n\u2217\n\n(z(k)) \u2212 \u2207\u03c8\n\n= D\u03c8\u2217 (z(k+1), z(k)) +\n\n(z(cid:63)), z(k+1) \u2212 z(k)(cid:69)\n\nAnd through simple algebraic manipulation, the last term can be bounded as follows\nD\u03c8\u2217 (z(k+1), z(cid:63)) \u2212 D\u03c8\u2217 (z(k), z(cid:63))\n= D\u03c8\u2217 (z(k+1), z(k)) +\n\n(cid:68)\u2207\u03c8\n(cid:28) k\n(x(k+1) \u2212 x(k)) + x(k+1) \u2212 x(cid:63),\u2212 ks\nr\nr2 (f (x(k)) \u2212 f (x(k+1))) +\nTherefore we have E(k+1) \u2212 E(k) \u2264 \u2212 s[(r\u22122)k\u22121]\n(f (x(k+1)) \u2212 f (cid:63)) + rD\u03c8\u2217 (z(k+1), z(k)). Com-\ndt V (X(t), Z(t), t) in the continuous-time case,\nparing this expression with the expression (4) of d\nwe see that we obtain an analogous expression, except for the additional Bregman divergence term\nrD\u03c8\u2217 (z(k+1), z(k)), and we cannot immediately conclude that V is a Lyapunov function. This can\nbe remedied by the following modi\ufb01cation of the discretization scheme.\n\nby de\ufb01nition of the Bregman divergence\n\n\u2264 D\u03c8\u2217 (z(k+1), z(k)) +\n\n(f (cid:63) \u2212 f (x(k+1))).\n\nby the discretization (8)\n\n\u2207f (x(k+1))\n\nby convexity of f\n\nr\nk2s\n\n(cid:29)\n\nks\nr\n\nr\n\ntained as a solution to a minimization problem \u02dcx(k) = arg minx\u2208X \u03b3s(cid:10)\n\n4.1 A family of discrete-time accelerated mirror descent methods\nIn the expression (8) of x(k+1) = \u03bbk \u02dcz(k) + (1 \u2212 \u03bbk)x(k), we propose to replace x(k) with \u02dcx(k), ob-\nwhere R is regularization function that satis\ufb01es the following assumptions: there exist 0 < (cid:96)R \u2264 LR\n2 (cid:107)x \u2212 x(cid:48)(cid:107)2.\nsuch that for all x, x(cid:48) \u2208 X , (cid:96)R\n2 (cid:107)x \u2212 x(cid:48)(cid:107)2 \u2264 R(x, x(cid:48)) \u2264 LR\n(cid:107)2\nIn the Euclidean case, one can take R(x, x(cid:48)) = (cid:107)x\u2212x\n, in which case (cid:96)R = LR = 1 and the\n2\n\u02dcx update becomes a prox-update.\nIn the general case, one can take R(x, x(cid:48)) = D\u03c6(x, x(cid:48)) for\nsome distance generating function \u03c6 which is (cid:96)R-strongly convex and LR-smooth, in which case\nthe \u02dcx update becomes a mirror update. The resulting method is summarized in Algorithm 1. This\nalgorithm is a generalization of Allen-Zhu and Orecchia\u2019s interpretation of Nesterov\u2019s method in [1],\nwhere x(k+1) is a convex combination of a mirror descent update and a gradient descent update.\n\n\u2207f (x(k)), x(cid:11) + R(x, x(k)),\n\n(cid:48)\n\nAlgorithm 1 Accelerated mirror descent with distance generating function \u03c8\u2217, regularizer R, step\n1: Initialize \u02dcx(0) = x0, \u02dcz(0) = x0,(cid:0)or z(0) \u2208 (\u2207\u03c8)\u22121(x0)(cid:1).\nsize s, and parameter r \u2265 3\n2: for k \u2208 N do\n3:\n4:\n\nr+k .\n+ D\u03c8(\u02dcz, \u02dcz(k)).\n\nx(k+1) = \u03bbk \u02dcz(k) + (1 \u2212 \u03bbk)\u02dcx(k), with \u03bbk = r\n\u02dcz(k+1) = arg min\u02dcz\u2208X ks\n\n(cid:68)\u2207f (x(k+1)), \u02dcz\n(cid:69)\n(cid:0)If \u03c8 is non-differentiable, z(k+1) = z(k) \u2212 kr\n(cid:68)\u2207f (x(k+1)), \u02dcx\n(cid:69)\n\n\u02dcx(k+1) = arg min\u02dcx\u2208X \u03b3s\n\ns \u2207f (x(k+1)) and \u02dcz(k+1) = \u2207\u03c8\u2217(z(k+1)).(cid:1)\n\n+ R(\u02dcx, x(k+1))\n\n5:\n\nr\n\n4.2 Consistency of the modi\ufb01ed scheme\nOne can show that given our assumptions on R, \u02dcx(k) = x(k) + O(s). Indeed, we have\n\n(cid:68)\u2207f (x(k)), x(k) \u2212 \u02dcx(k)(cid:69)\n\n(cid:107)\u02dcx(k) \u2212 x(k)(cid:107)2 \u2264 R(\u02dcx(k), x(k)) \u2264 R(x(k), x(k)) + \u03b3s\n\n(cid:96)R\n2\n\n\u2264 \u03b3s(cid:107)\u2207f (x(k))(cid:107)\u2217(cid:107)\u02dcx(k) \u2212 x(k)(cid:107)\n\ntherefore (cid:107)\u02dcx(k) \u2212 x(k)(cid:107) \u2264 s 2\u03b3(cid:107)\u2207f (x(k))(cid:107)\u2217\n, which proves the claim. Using this observation, we\ncan show that the modi\ufb01ed discretization scheme is consistent with the original ODE (5), that is,\nthe difference equations de\ufb01ning x(k) and z(k) converge, as s tends to 0, to the ordinary differential\nequations of the continuous-time system (5). The difference equations of Algorithm 1 are equivalent\nto (7) in which x(k) is replaced by \u02dcx(k), i.e.\n\n(cid:96)R\n\n(cid:40) x(k+1)\u2212\u02dcx(k)\n\n\u221as\n\u221as\n\nz(k+1)\u2212z(k)\n\n= r\n\nk\u221as (\u2207\u03c8\u2217(z(k)) \u2212 x(k+1))\n\n= \u2212 k\u221as\n\nr \u2207f (x(k+1))\n\n6\n\n\fNow suppose there exist C 1 functions (X, Z), de\ufb01ned on R+, such that X(tk) \u2248 x(k) and\nZ(tk) \u2248 z(k) for tk = k\u221as. Then, using the fact that \u02dcx(k) = x(k) + O(s), we have\n+ O(\u221as) = \u02d9X(tk) + o(1), and simi-\nx(k+1)\u2212\u02dcx(k)\nlarly, z(k+1)\u2212z(k)\n\n\u2248 \u02d9Z(tk) + o(1), therefore the difference equation system can be written as\n\n+ O(\u221as) \u2248 X(tk+\u221as)\u2212X(tk)\n\n= x(k+1)\u2212x(k)\n\n\u221as\n\n\u221as\n\n\u221as\n\n\u221as\n\n(cid:40) \u02d9X(tk) + o(1) = r\n\ntk\n\n\u02d9Z(tk) + o(1) = \u2212 tk\n\n(\u2207\u03c8\u2217(Z(tk)) \u2212 X(tk + \u221as))\nr \u2207f (X(tk + \u221as))\n\nwhich converges to the ODE (5) as s \u2192 0.\n4.3 Convergence rate\n\nTo prove convergence of the algorithm, consider the modi\ufb01ed potential function\n\n\u02dcE(k) = V (\u02dcx(k), z(k), k\u221as) = k2s\n\nr (f (\u02dcx(k)) \u2212 f (cid:63)) + rD\u03c8\u2217 (z(k), z(cid:63)).\n\nLemma 2. If \u03b3 \u2265 LRL\u03c8\u2217 and s \u2264 (cid:96)R\n\u02dcE(k+1) \u2212 \u02dcE(k) \u2264\n\n2Lf \u03b3 , then for all k \u2265 0,\n\n(2k + 1 \u2212 kr)s\n\nr\n\n(f (\u02dcx(k+1)) \u2212 f (cid:63)).\n\nAs a consequence, if r \u2265 3, \u02dcE is a Lyapunov function for k \u2265 1.\nThis lemma is proved in the supplementary material.\nTheorem 3. The discrete-time accelerated mirror descent Algorithm 1 with parameter r \u2265 3 and\nstep sizes \u03b3 \u2265 LRL\u03c8\u2217, s \u2264 (cid:96)R\n\n2Lf \u03b3 , guarantees that for all k > 0,\nr2D\u03c8\u2217 (z0, z(cid:63))\n\nf (\u02dcx(k))) \u2212 f (cid:63) \u2264\n\nr\nsk2\n\n\u02dcE(1) \u2264\n\nsk2\n\n+\n\nf (x0) \u2212 f (cid:63)\n\nk2\n\n.\n\nProof. The \ufb01rst inequality follows immediately from Lemma 2. The second inequality follows from\na simple bound on \u02dcE(1), proved in the supplementary material.\n\n\u2217\n\n\u2217\n\ni=1\n\ni=1\n\nezi\n\n(cid:33)\n\n\u03c8(x) =\n\n(z) = ln\n\n(cid:32) n(cid:88)\n\nxi ln xi+\u03b4(x|\u2206), \u03c8\n\n, \u2202\u03c8(x) = (1 + ln xi)i+Ru, \u2207\u03c8\n\n4.4 Example: accelerated entropic descent\n+ :(cid:80)n\nWe give an instance of Algorithm 1 for simplex-constrained problems. Suppose that X = \u2206n =\n{x \u2208 Rn\ni=1 xi = 1} is the n-simplex. Taking \u03c8 to be the negative entropy on \u2206, we have for\nn(cid:88)\nezi(cid:80)n\nx \u2208 X , z \u2208 E\u2217,\nj=1 ezj\nwhere \u03b4(\u00b7|\u2206) is the indicator function of the simplex (\u03b4(x|\u2206) = 0 if x \u2208 \u2206 and +\u221e otherwise),\nand u \u2208 Rn is a normal vector to the af\ufb01ne hull of the simplex. The resulting mirror descent\nupdate is a simple entropy projection and can be computed exactly in O(n) operations, and \u03c8\u2217\nlet \u0001 > 0, and let \u03c6(x) = \u0001(cid:80)n\ncan be shown to be 1-smooth w.r.t. (cid:107) \u00b7 (cid:107)\u221e, see for example [3, 6]. For the second update, we\ntake R(x, y) = D\u03c6(x, y) where \u03c6 is a smoothed negative entropy function de\ufb01ned as follows:\ni=1(xi + \u0001) ln(xi + \u0001) + \u03b4(x|\u2206). Although no simple, closed-form\nexpression is known for \u2207\u03c6\u2217, it can be computed ef\ufb01ciently, in O(n log n) time using a deterministic\nalgorithm, or O(n) expected time using a randomized algorithm, see [17]. Additionally, \u03c6 satis\ufb01es\n1+n\u0001-strongly convex and 1-smooth w.r.t. (cid:107) \u00b7 (cid:107)\u221e. The resulting accelerated\nour assumptions: it is\nmirror descent method on the simplex can then be implemented ef\ufb01ciently, and by Theorem 3 it is\nguaranteed to converge in O(1/k2) whenever \u03b3 \u2265 1 and s \u2264\n5 Numerical Experiments\n\n2(1+n\u0001)Lf \u03b3 .\n\n(z)i =\n\n\u0001\n\n\u0001\n\n.\n\nWe test the accelerated mirror descent method in Algorithm 1, on simplex-constrained prob-\nlems in Rn, n = 100, with two different objective functions: a simple quadratic f (x) =\n(cid:104)x \u2212 x(cid:63), Q(x \u2212 x(cid:63))(cid:105), for a random positive semi-de\ufb01nite matrix Q, and a log-sum-exp function\n\n7\n\n\f(a) Weakly convex quadratic, rank 10\n\n(b) Log-sum-exp\n\n(c) Effect of the parameter r.\n\nr\n\n\u02dcz(k+1) \u2190 x(k+1), l \u2190 0\n\n(cid:16)(cid:80)I\n\ni=1 (cid:104)ai, x(cid:105) + bi\n\n5:\n6:\n7:\n8:\n\nr+l\n+ D\u03c8(\u02dcz, \u02dcz(k))\n\n+ R(\u02dcx, x(k+1))\n\nx(k+1) = \u03bbl \u02dcz(k) + (1 \u2212 \u03bbl)\u02dcx(k), with \u03bbl = r\n\u02dcz(k+1) = arg min\u02dcz\u2208X ks\n\u02dcx(k+1) = arg min\u02dcx\u2208X \u03b3s\nl \u2190 l + 1\nif Restart condition then\n\nFigure 1: Evolution of f (x(k)) \u2212 f (cid:63) on simplex-constrained problems, using different accelerated\nmirror descent methods with entropy distance generating functions.\nAlgorithm 2 Accelerated mirror descent with restart\n1: Initialize l = 0, \u02dcx(0) = \u02dcz(0) = x0.\n2: for k \u2208 N do\n3:\n4:\n\n(cid:68)\u2207f (x(k+1)), \u02dcz\n(cid:69)\n(cid:68)\u2207f (x(k+1)), \u02dcx\n(cid:69)\n(cid:17)\n, where each entry in ai \u2208 Rn and bi \u2208 R is iid nor-\ngiven by f (x) = ln\nmal. We implement the accelerated entropic descent algorithm proposed in Section 4.4, and in-\nclude the (non-accelerated) entropic descent for reference. We also adapt the gradient restarting\nheuristic proposed by O\u2019Donoghue and Cand`es in [24], as well as the speed restart heuristic pro-\nposed by Su et al. in [28]. The generic restart method is given in Algorithm 2. The restart condi-\n\ntions are the following: (i) gradient restart: (cid:10)x(k+1) \u2212 x(k),\u2207f (x(k))(cid:11) > 0, and (ii) speed restart:\n\n(cid:107)x(k+1) \u2212 x(k)(cid:107) < (cid:107)x(k) \u2212 x(k\u22121)(cid:107).\nThe results are given in Figure 1. The accelerated mirror descent method exhibits a polynomial\nconvergence rate, which is empirically faster than the O(1/k2) rate predicted by Theorem 3. The\nmethod also exhibits oscillations around the set of minimizers, and increasing the parameter r seems\nto reduce the period of the oscillations, and results in a trajectory that is initially slower, but faster\nfor large k, see Figure 1-c. The restarting heuristics alleviate the oscillation and empirically speed\nup the convergence. We also visualized, for each experiment, the trajectory of the iterates x(k) for\neach method, projected on a 2-dimensional hyperplane. The corresponding videos are included in\nthe supplementary material.\n\n6 Conclusion\n\nBy combining the Lyapunov argument that motivated mirror descent, and the recent ODE interpre-\ntation [28] of Nesterov\u2019s method, we proposed a family of ODE systems for minimizing convex\nfunctions with a Lipschitz gradient, which are guaranteed to converge at a O(1/t2) rate, and proved\nexistence and uniqueness of a solution. Then by discretizing the ODE, we proposed a family of\naccelerated mirror descent methods for constrained optimization and proved an analogous O(1/k2)\nrate when the step size is small enough. The connection with the continuous-time dynamics moti-\nvates a more detailed study of the ODE (5), such as studying the oscillatory behavior of its solution\ntrajectories, its convergence rates under additional assumptions such as strong convexity, and a rig-\norous study of the restart heuristics.\nAcknowledgments\n\nWe gratefully acknowledge the NSF (CCF-1115788, CNS-1238959, CNS-1238962, CNS-1239054,\nCNS-1239166), the ARC (FL110100281 and ACEMS), and the Simons Institute Fall 2014 Algo-\nrithmic Spectral Graph Theory Program.\n\n8\n\n10020030040050060070010\u22121710\u22121310\u2212910\u2212510\u22121kf(x(k))\u2212f?MirrordescentAcceleratedmirrordescentSpeedrestartGradientrestart10020030040050060010\u22121410\u22121110\u2212810\u2212510\u22122kf(x(k))\u2212f?MirrordescentAcceleratedmirrordescentSpeedrestartGradientrestart10020030040050060070080010\u22121710\u22121410\u22121110\u2212810\u2212510\u22122kf(x(k))\u2212f?r=3r=10r=30r=90\fReferences\n[1] Zeyuan Allen-Zhu and Lorenzo Orecchia. Linear coupling: An ultimate uni\ufb01cation of gradient and mirror\n\ndescent. In ArXiv, 2014.\n\n[2] Arindam Banerjee, Srujana Merugu, Inderjit S. Dhillon, and Joydeep Ghosh. Clustering with Bregman\n\ndivergences. J. Mach. Learn. Res., 6:1705\u20131749, December 2005.\n\n[3] Amir Beck and Marc Teboulle. Mirror descent and nonlinear projected subgradient methods for convex\n\noptimization. Oper. Res. Lett., 31(3):167\u2013175, May 2003.\n\n[4] Amir Beck and Marc Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse prob-\n\nlems. SIAM Journal on Imaging Sciences, 2(1):183\u2013202, 2009.\n\n[5] A. Ben-Tal and A. Nemirovski. Lectures on Modern Convex Optimization. SIAM, 2001.\n[6] Aharon Ben-Tal, Tamar Margalit, and Arkadi Nemirovski. The ordered subsets mirror descent optimiza-\n\ntion method with applications to tomography. SIAM J. on Optimization, 12(1):79\u2013108, January 2001.\n\n[7] Anthony Bloch, editor. Hamiltonian and gradient \ufb02ows, algorithms, and control. American Mathematical\n\nSociety, 1994.\n\n[8] A. A. Brown and M. C. Bartholomew-Biggs. Some effective methods for unconstrained optimization\nbased on the solution of systems of ordinary differential equations. Journal of Optimization Theory and\nApplications, 62(2):211\u2013224, 1989.\n\n[9] S\u00b4ebastien Bubeck and Nicol`o Cesa-Bianchi. Regret analysis of stochastic and nonstochastic multi-armed\n\nbandit problems. Foundations and Trends in Machine Learning, 5(1):1\u2013122, 2012.\n\n[10] J. C. Butcher. Numerical Methods for Ordinary Differential Equations. John Wiley & Sons, Ltd, 2008.\n[11] Nicol`o Cesa-Bianchi and G\u00b4abor Lugosi. Prediction, Learning, and Games. Cambridge, 2006.\n[12] Ofer Dekel, Ran Gilad-Bachrach, Ohad Shamir, and Lin Xiao. Optimal distributed online prediction. In\n\nProceedings of the 28th International Conference on Machine Learning (ICML), June 2011.\n\n[13] U. Helmke and J.B. Moore. Optimization and dynamical systems. Communications and control engineer-\n\ning series. Springer-Verlag, 1994.\n\n[14] Anatoli Juditsky. Convex Optimization II: Algorithms, Lecture Notes. 2013.\n[15] Anatoli Juditsky, Arkadi Nemirovski, and Claire Tauvel. Solving variational inequalities with stochastic\n\nmirror-prox algorithm. Stoch. Syst., 1(1):17\u201358, 2011.\n\n[16] H.K. Khalil. Nonlinear systems. Macmillan Pub. Co., 1992.\n[17] Walid Krichene, Syrine Krichene, and Alexandre Bayen. Ef\ufb01cient Bregman projections onto the simplex.\n\nIn 54th IEEE Conference on Decision and Control, 2015.\n\n[18] A.M. Lyapunov. General Problem of the Stability Of Motion. Control Theory and Applications Series.\n\nTaylor & Francis, 1992.\n\n[19] A. S. Nemirovsky and D. B. Yudin. Problem Complexity and Method Ef\ufb01ciency in Optimization. Wiley-\n\nInterscience series in discrete mathematics. Wiley, 1983.\n\n[20] Yu. Nesterov. Smooth minimization of non-smooth functions. Mathematical Programming, 103(1):127\u2013\n\n152, 2005.\n\n[21] Yu. Nesterov. Gradient methods for minimizing composite functions. Mathematical Programming,\n\n140(1):125\u2013161, 2013.\n\n[22] Yurii Nesterov. A method of solving a convex programming problem with convergence rate o(1/k2).\n\nSoviet Mathematics Doklady, 27(2):372\u2013376, 1983.\n\n[23] Yurii Nesterov. Introductory Lectures on Convex Optimization, volume 87. Springer Science & Business\n\nMedia, 2004.\n\n[24] Brendan O\u2019Donoghue and Emmanuel Cand`es. Adaptive restart for accelerated gradient schemes. Foun-\n\ndations of Computational Mathematics, 15(3):715\u2013732, 2015.\n\n[25] M. Raginsky and J. Bouvrie. Continuous-time stochastic mirror descent on a network: Variance reduction,\n\nconsensus, convergence. In CDC 2012, pages 6793\u20136800, 2012.\n\n[26] R.T. Rockafellar. Convex Analysis. Princeton University Press, 1970.\n[27] J. Schropp and I. Singer. A dynamical systems approach to constrained minimization. Numerical Func-\n\ntional Analysis and Optimization, 21(3-4):537\u2013551, 2000.\n\n[28] Weijie Su, Stephen Boyd, and Emmanuel Cand`es. A differential equation for modeling Nesterov\u2019s accel-\n\nerated gradient method: Theory and insights. In NIPS, 2014.\n\n[29] Gerald Teschl. Ordinary differential equations and dynamical systems, volume 140. American Mathe-\n\nmatical Soc., 2012.\n\n9\n\n\f", "award": [], "sourceid": 1618, "authors": [{"given_name": "Walid", "family_name": "Krichene", "institution": "UC Berkeley"}, {"given_name": "Alexandre", "family_name": "Bayen", "institution": "UC Berkeley"}, {"given_name": "Peter", "family_name": "Bartlett", "institution": "UC Berkeley"}]}