{"title": "Variational PDEs for Acceleration on Manifolds and Application to Diffeomorphisms", "book": "Advances in Neural Information Processing Systems", "page_first": 3793, "page_last": 3803, "abstract": "We consider the optimization of cost functionals on manifolds and derive a variational approach to accelerated methods on manifolds. We demonstrate the methodology on the infinite-dimensional manifold of diffeomorphisms, motivated by registration problems in computer vision. We build on the variational approach to accelerated optimization by Wibisono, Wilson and Jordan, which applies in finite dimensions, and generalize that approach to infinite dimensional manifolds. We derive the continuum evolution equations, which are partial differential equations (PDE), and relate them to simple mechanical principles. Our approach can also be viewed as a generalization of the $L^2$ optimal mass transport problem. Our approach evolves an infinite number of particles endowed with mass, represented as a mass density. The density evolves with the optimization variable, and endows the particles with dynamics. This is different than current accelerated methods where only a single particle moves and hence the dynamics does not depend on the mass. We derive the theory, compute the PDEs for acceleration, and illustrate the behavior of this new accelerated optimization scheme.", "full_text": "Variational PDEs for Acceleration on Manifolds and\n\nApplication to Diffeomorphisms\n\nGanesh Sundaramoorthi\n\nUnited Technologies Research Center\n\nEast Hartford, CT 06118\n\nsundarga1@utrc.utc.com\n\nAnthony Yezzi\n\nSchool of Electrical & Computer Engineering\n\nGeorgia Institute of Technology, Atlanta, GA 30332\n\nayezzi@ece.gatech.edu\n\nAbstract\n\nWe consider the optimization of cost functionals on in\ufb01nite dimensional mani-\nfolds and derive a variational approach to accelerated methods on manifolds. We\ndemonstrate the methodology on the in\ufb01nite-dimensional manifold of diffeomor-\nphisms, motivated by optical \ufb02ow problems in computer vision. We build on a\nvariational approach to accelerated optimization in \ufb01nite dimensions, and gener-\nalize that approach to in\ufb01nite dimensional manifolds. We derive the continuum\nevolution equations, which are partial differential equations (PDE), and relate them\nto mechanical principles. A particular case of our approach can be viewed as a\ngeneralization of the L2 optimal mass transport problem. Our approach evolves\nan in\ufb01nite number of particles endowed with mass, represented as a mass density.\nThe density evolves with the optimization variable, and endows the particles with\ndynamics. This is different than current accelerated methods where only a single\nparticle moves and hence the dynamics does not depend on mass. We derive theory\nand the PDEs for acceleration, and illustrate the behavior of this new scheme.\n\n1\n\nIntroduction\n\nAccelerated optimization methods have gained wide interest within the machine learning and opti-\nmization communities (e.g., [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13]). They are known for optimal\nconvergence rates among schemes that use only gradient (\ufb01rst order) information in the convex case.\nIn the non-convex case, they appear to provide robustness to shallow local minima. The intuitive\nidea is that by considering a particle with mass that moves in an energy landscape, the particle\nwill gain momentum and surpass shallow local minimum and settle in in deeper and wider local\nextrema. These methods have so far have only been used in \ufb01nite dimensional optimization problems.\nIn this paper, we consider the generalization of these methods to in\ufb01nite dimensional manifolds.\nWe are motivated by applications in computer vision, e.g., segmentation, 3D reconstruction, and\noptical \ufb02ow. In these problems, optimization is over in\ufb01nite dimensional manifolds (e.g., curves,\nsurfaces, mappings). Recently there has been interest within machine learning in optimization on\n\ufb01nite dimensional manifolds, such as matrix groups, e.g., [14, 15, 16], in a non-variational framework,\nand the methodology presented here can be adapted to those problems as well.\nRecent work [17] has shown that the continuum limit of accelerated methods may be formulated\nwith variational principles. The resulting optimal continuum path is de\ufb01ned by an ordinary differ-\nential equation (ODE), which when discretized appropriately yields Nesterov\u2019s method [13]. The\noptimization problem is an action integral, which integrates the Bregman Lagrangian over paths.\nThe Bregman Lagrangian consists of a generalization of kinetic and potential energies. The kinetic\nenergy is de\ufb01ned using the Bregman divergence; the potential energy is the cost function that is to be\noptimized. We build on the approach of [17] by formulating accelerated optimization with an action\nintegral, but we generalize that approach to manifolds. Our approach is general for manifolds, but we\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fillustrate the idea here for the case of the in\ufb01nite dimensional manifold of diffeomorphisms of Rn. To\ndo this, we forgo the Bregman Lagrangian framework in [17] since that assumes that the variable over\nwhich one optimizes is embedded in Rn, which is not the case for in\ufb01nite dimensional manifolds.\nInstead, we adopt the formulation of action integrals in physics [18, 19], where kinetic energies that\nare de\ufb01ned through Riemannian metrics, which allows generalization beyond Euclidean geometries.\nOur contributions are speci\ufb01cally: 1. We present a novel variational approach to accelerated optimiza-\ntion on manifolds. 2. We adapt our approach to accelerated optimization on the in\ufb01nite dimensional\nmanifold diffeomorphisms, i.e., smooth invertible mappings. 3. We introduce a Riemannian metric\nfor the purpose of acceleration on diffeomorphisms, which de\ufb01nes the kinetic energy of a mass\ndistribution. The metric is the same one in the \ufb02uid mechanics formulation of the L2 mass transport\nproblem [20]. 4. We derive the PDE for accelerated optimization of any cost functional de\ufb01ned on\ndiffeomorphisms, and relate it to \ufb02uid mechanics principles. 5. We present a numerical discretization,\nwhich requires entropy schemes [21], and show the advantage over gradient descent.\n\n1.1 Related Work\n\nOptimal Mass Transport: Our work relates to the optimal mass transport problem (e.g., [22, 23,\n20, 24]). In this problem, two probability densities \u03c10, \u03c11 in Rn are given, and the goal is to \ufb01nd\na transformation M : Rn \u2192 Rn so that the pushforward of \u03c10 by M results in \u03c11 such that M\nRn |M (x) \u2212 x|p\u03c10(x) dx where p \u2265 1. The value of the\nminimum cost is called the Lp Wasserstein distance. For p = 2, [20] has shown that mass transport\ncan be formulated as a \ufb02uid mechanics problem. In particular, the Wasserstein distance can be\nformulated as a distance arising from a Riemannian metric on the space of probability densities. The\ntangent space consists of vector \ufb01elds that in\ufb01nitesimally displace the density, and the metric is the\n2 \u03c1(x)|v(x)|2 dx.\n\nhas minimal cost. The cost is de\ufb01ned as(cid:82)\nkinetic energy of the mass distribution as it is displaced by the velocity, given by(cid:82)\n\nRn\n\n1\n\nThe goal is to minimize(cid:82) 1\n\nOptimal mass transport computes the optimal path that minimizes the integral of kinetic energy.\nIn our work, we minimize a potential on the manifold of diffeomorphisms, with acceleration. We\nassociate a mass density in Rn to the diffeomorphism that, as we optimize the potential, moves in Rn\nvia a push-forward of the diffeomorphism. This evolution arises as the stationary condition of the\naction integral, i.e., the difference of the kinetic and potential energies. One choice of kinetic energy\nthat we explore is the L2 Riemannian mass transport metric. The main difference of our approach is\nthat we compute stationary paths of the path integral of kinetic minus potential energies.\nDiffeomorphic Registration: Our work relates to diffeomorphic image registration [25, 26], where\nthe goal is to compute registration (pixel-wise correspondence) between images as diffeomorphisms.\nThe optimization problem is formed on a path of velocity \ufb01elds, which generates a diffeomorphism.\n0 (cid:107)v(cid:107)2 dt where v is a time varying vector \ufb01eld, and the optimization is\nsubject to the constraint that the mapping \u03c6 maps one image to the other, i.e., I1 = I0 \u25e6 \u03c6\u22121. This\nminimizes an action integral where the action contains only a kinetic energy. The norm is a Sobolev\nnorm to ensure that the generated diffeomorphism is smooth. The problem is solved by a Sobolev\ngradient descent (see also [27, 28, 29]) on the space of paths, which gives a geodesic.\nOur framework instead uses accelerated gradient descent. Like [25, 26], it is derived from an action\nintegral, but the action has both a kinetic energy and a potential energy. In our work, one choice of\nkinetic energy is an L2 metric weighted by mass rather than a Sobolev metric. One of our motivations\nin this work is to get regularizing effects of Sobolev norms without using Sobolev norms, since that\nrequires inverting differential operators in the optimization, which can be computationally expensive.\nOur approach allows us to generate diffeomorphisms without using Sobolev norms.\nOptical Flow: The purpose of this paper is to derive the variational framework for acceleration on\nmanifolds, in particular diffeomorphisms. To demonstrate our ideas numerically, we show gains over\ngradient descent on one possible application - variational optical \ufb02ow (e.g., [30, 31, 32, 33, 34, 29]).\nOptical \ufb02ow, i.e., determining pixel-wise correspondence between images, is fundamental to computer\nvision and remains a challenge to solve, due to its non-convexity. The dominant general approach to\nthese problems involves iterative linearization of the cost functional around the current accumulated\noptical \ufb02ow and its solution for an incremental displacement. This approach, in conjunction with\nimage pyramids, has been successful in many cases, but the there are still many cases, e.g., large\ndisplacement of thin structures where it fails.\n\n2\n\n\fAccelerated Optimization in In\ufb01nite Dimensions: We present general methodology for accelerated\noptimization on in\ufb01nite-dimensional manifolds, and illustrate it on diffeomorphisms (see [35] for\nan extended version of this paper). The case of the manifold of curves and surfaces is considered in\n[36]. Convergence rates of the resulting PDEs are not explored in this paper. However, in [37], we\nrigorously analyze stability conditions and time step restrictions for discretizations of accelerated\nPDEs and show that they are signi\ufb01cantly more generous compared to gradient descent.\n\n2 Background for Accelerated Optimization on Manifolds\n\nq\n\nDifferential Geometry: We review basic manifold theory [38]. A manifold M is a space in which\nevery point p \u2208 M has a (invertible) mapping fp from a neighborhood of p to a model space that is a\nlinear normed vector space, and has an additional compatibility condition that if the neighborhoods\nfor p and q overlap then the mapping fp \u25e6 f\u22121\nis differentiable. Intuitively, a manifold is a space that\nlocally appears \ufb02at. The manifold may be \ufb01nite or in\ufb01nite dimensional when the model spaces are\n\ufb01nite or in\ufb01nite dimensional, respectively. The tangent space at a point p \u2208 M is the equivalence\nclass, [\u03b3], of curves \u03b3 : [0, 1] \u2192 M under the equivalence that \u03b3(0) = p and (fp \u25e6 \u03b3)(cid:48)(0) are the\nsame for each curve \u03b3 \u2208 [\u03b3]. Intuitively, these are the set of possible directions of movement at the\npoint p on the manifold. The tangent bundle, denoted T M, is T M = {(p, v) : p \u2208 M, v \u2208 TpM},\ni.e., the space formed from the collection of all points and tangent spaces.\nA Riemannian manifold is a manifold that has an inner product (called the metric) that exists on each\ntangent space TpM, and smoothly varies on M. A Riemannian manifold allows one to formally\nde\ufb01ne the lengths of curves \u03b3 : [\u22121, 1] \u2192 M on the manifold. This allows one to construct paths\nof critical length, called geodesics, a generalization of a path of constant velocity. The Riemannian\nmetric allows one to de\ufb01ne gradients of functions g : M \u2192 R de\ufb01ned on the manifold: the gradient\n\u2207g(p) \u2208 TpM is de\ufb01ned to be the vector that satis\ufb01es d\nd\u03b5 g(\u03b3(\u03b5))|\u03b5=0 = (cid:104)\u2207g(p), \u03b3(cid:48)(0)(cid:105), where\n\u03b3(0) = p, the left hand side is the directional derivative and the right hand side is the inner product.\nMechanics on Manifolds: We now review some of the formalism of classical mechanics on mani-\nfolds [18, 19]. The subject of mechanics describes the principles governing the evolution of a particle\nthat moves on a manifold M. The equations governing a particle are Newton\u2019s laws. There are two\nviewpoints in mechanics, namely the Lagrangian and Hamiltonian viewpoints, which formulate more\ngeneral principles to derive Newton\u2019s equations. In this paper, we use the Lagrangian formulation\nto derive equations of motion for accelerated optimization on the manifold of diffeomorphisms.\nLagrangian mechanics obtains equations of motion through variational principles, which makes\nit easier to generalize Newton\u2019s laws beyond simple particle systems in R3, especially to the case\nof manifolds. In Lagrangian mechanics, one starts with a Lagrangian L : T M \u2192 R where M\nis a Riemannian manifold. One says that a curve \u03b3 : [\u22121, 1] \u2192 M is a motion in a Lagrangian\n\nsystem with Lagrangian L if it is an extremal of A = (cid:82) L(\u03b3(t), \u02d9\u03b3(t)) dt. The previous integral\n\nis called an action integral. Hamilton\u2019s principle of stationary action states that the motion in\nthe Lagrangian system satis\ufb01es the condition that \u03b4A = 0, where \u03b4 denotes the variation, for all\nvariations of A induced by variations of the path \u03b3 that keep endpoints \ufb01xed. The variation is de\ufb01ned\nds A(\u02dc\u03b3(t, s))|s=0 where \u02dc\u03b3 : [\u22121, 1]2 \u2192 M is a smooth family of curves (a variation of \u03b3)\nas \u03b4A := d\non the manifold such that \u02dc\u03b3(t, 0) = \u03b3(t). The stationary conditions give rise to what is known as\nLagrange\u2019s equations. A natural Lagrangian has the special form L = T \u2212 U where T : T M \u2192 R+\nis the kinetic energy and U : M \u2192 R is the potential energy. The kinetic energy is de\ufb01ned as\n2 (cid:104)v, v(cid:105) where (cid:104)\u00b7,\u00b7(cid:105) is the Riemannian metric. In the case that one has a particle system\nT (v) = 1\nin R3, i.e., a collection of particles with masses mi, in a natural Lagrangian system, one can show\nthat Hamilton\u2019s principle of stationary action is equivalent to Newton\u2019s law of motion, i.e., that\ndt (mi \u02d9ri) = \u2212 \u2202U\nwhere ri is the trajectory of the ith particle, and \u02d9ri is the velocity. This states\nd\nthat mass times acceleration is the force, which is given by minus the derivative of the potential in a\nconservative system. Thus, Hamilton\u2019s principle is more general and allows us to more easily derive\nequations of motion for more general systems, in particular those on manifolds.\nIn this paper, we will consider Lagrangian non-autonomous systems where the Lagrangian is also an\nexplicit function of time t, i.e., L : T M \u00d7 R \u2192 R. In particular, the kinetic and potential energies\ncan both be explicit functions of time: T : T M \u00d7 R \u2192 R and U : M \u00d7 R \u2192 R. Autonomous\nsystems have an energy conservation property and do not converge; for instance, one can think of a\n\n\u2202ri\n\n3\n\n\fL(X, V, t) = ea(t)+\u03b3(t)(cid:104)\n\nL(X, V, t) = e\u03b3(t)(cid:104)\n\n(cid:105)\n\n(cid:105)\n\nmoving pendulum with no friction, which oscillates forever. Since we wish to minimize an objective\nfunctional, we want the system to converge and Lagrangian non-autonomous systems allow for this.\nVariational Approach to Accelerated Optimization in Rn: We review the variational formulation\nof accelerated gradient descent by [17]. This approach is based on the Bregman Lagrangian:\n\nd(X + e\u2212a(t)V, X) \u2212 eb(t)U (X)\n\n,\n\nwhere the potential energy U represents the cost to be minimized, and d(y, x) = h(y) \u2212 h(x) \u2212\n2|x|2,\n\u2207h(x) \u00b7 (y \u2212 x) where h is a given convex function. In the Euclidean case, where h(x) = 1\nd(y, x) = 1\n\n2|y \u2212 x|2, this simpli\ufb01es to\n\ne\u2212a(t)|V |2/2 \u2212 ea(t)+b(t)U (X)\n\n,\n\n2|V |2 is the kinetic energy of a unit mass particle in Rn. Nesterov\u2019s methods [13, 39, 40,\nwhere T = 1\n12, 41, 11] belong to a subfamily of Bregman Lagrangians with various choices of a, b, \u03b3.\nThe Bregman Lagrangian assumes that the underlying manifold is a subset of Rn, which is not true\nof many manifolds including diffeomorphisms1. Therefore, we use the mechanics formulation, which\nprovides a formalism for general metrics though the Riemannian distance.\n\n3 Accelerated Optimization on Manifolds Applied to Diffeomorphisms\n\nIn this section, we use the mechanics on manifolds to generalize accelerated optimization to in\ufb01nite\ndimensional manifolds, in particular the manifold of diffeomorphisms of Rn for general n. Dif-\nfeomorphisms, denoted Diff(Rn), are smooth mappings \u03c6 : Rn \u2192 Rn whose inverse exists and is\nalso smooth. The cost functional (the potential) is denoted U (\u03c6) and our framework applies to any\npotential. In the \ufb01rst sub-section, we give the formulation and evolution equations for the case of\nacceleration without energy dissipation, i.e., the Lagrangian is autonomous, since it is relevant for\nthe case of energy dissipation. In the second sub-section, we formulate the dissipative case, which\ngeneralizes [17] to diffeomorphisms. All proofs are found in supplementary material.\n\n3.1 Acceleration Without Energy Dissipation\n\nFormulation of the Action Integral: To formulate the action, we de\ufb01ne the kinetic energy T on the\nspace of diffeomorphisms, which is de\ufb01ned on the tangent space, denoted T\u03c6Diff(Rn), to Diff(Rn)\nat a diffeomorphism \u03c6. The tangent space at \u03c6 is the set of perturbations v of \u03c6 that preserve the\ndiffeomorphism property, i.e., for all small \u03b5, \u03c6 + \u03b5v is a diffeomorphism. One can show that\n\nT\u03c6Diff(Rn) = {v : \u03c6(Rn) \u2192 Rn : v is in a Sobolev space },\n\n(1)\nwhich is a set of smooth vector \ufb01elds on \u03c6(Rn) in which the vector \ufb01eld at each point \u03c6(x) displaces\n\u03c6(x) in\ufb01nitesimally. Since \u03c6 is a diffeomorphism, we have that \u03c6(Rn) = Rn. However, we write\nv : \u03c6(Rn) \u2192 Rn to emphasize that the velocity \ufb01elds in the tangent space are de\ufb01ned on the\ndeformed domain, so that v is a Eulerian velocity.\nWe note a result from [42], which will be the basis of our derivation of accelerated optimization on\nDiff(Rn). Any (orientable) diffeomorphism may be generated by integrating a time-varying smooth\nvector \ufb01eld over time, i.e.,\n\n\u2202t\u03c6t(x) = vt(\u03c6t(x)),\n\n(2)\nwhere \u2202t denotes partial derivative with respect to t, \u03c6t denotes a time varying family of diffeomor-\nphisms evaluated at the time t, and vt is a time varying collection of vector \ufb01elds evaluated at time t.\nThe path t \u2192 \u03c6t(x) for a \ufb01xed x represents a trajectory of a particle starting at x.\nThe space on which the kinetic energy is de\ufb01ned is now clear, but one more ingredient is needed\nbefore we can de\ufb01ne the kinetic energy. Any accelerated method will need a notion of mass, as a\nmass-less ball will not accelerate. We imagine that an in\ufb01nite number of particles densely distributed\nin Rn with mass exist and are displaced by the velocity \ufb01eld v at every point. We represent the mass\n1The Bregman distance can be generalized to manifolds using the exponential and logarithmic maps. However,\nfor many manifolds, including diffeomorphisms, computing these maps require solving an optimization problem.\n\nx \u2208 Rn,\n\n4\n\n\fdistribution with a mass density \u03c1 : Rn \u2192 R+, which is the mass divided by volume as the volume\nshrinks. During the evolution to optimize the potential U, the particles are displaced continuously\nand thus the density of these particles will in general change over time. We will assume that the\nRn \u03c1(x) dx = 1. The\nevolution of a time varying density \u03c1t as it is deformed by a time varying velocity is given by the\ncontinuity equation, which is a local form of the conservation of mass, given by\n\nsystem of particles in Rn is closed and so we impose mass preservation, i.e.,(cid:82)\nwhere div () denotes the divergence operator, given by div (F ) =(cid:80)n\n\n(3)\ni= \u2202xiF i where \u2202xi is the partial\n\n\u2202t\u03c1(x) + div (\u03c1(x)v(x)) = 0,\n\nwith respect to the ith coordinate and F i is the ith component of the vector \ufb01eld.\nWe now have the ingredients to de\ufb01ne the kinetic energy. We present one choice to illustrate the idea\nof accelerated optimization. We de\ufb01ne the kinetic energy as\n\nx \u2208 Rn\n\nT (v) =\n\n\u03c6(Rn)\n\n\u03c1(x)|v(x)|2 dx,\n\n1\n2\n\n(4)\n\n(cid:90)\n\n(cid:90)\n\nwhich matches the de\ufb01nition of the kinetic energy of a system of particles in physics.\nWe de\ufb01ne the action integral, which is de\ufb01ned on paths on Diff(Rn). A path of diffeomorphisms is\n\u03c6 : [0,\u221e) \u00d7 Rn \u2192 Rn. We denote the diffeomorphism at a time t as \u03c6t. Since diffeomorphisms are\ngenerated by velocity \ufb01elds, we can de\ufb01ne the action on paths of velocity \ufb01elds. A path of velocity\n\ufb01elds is given by v : [0,\u221e) \u00d7 Rn \u2192 Rn. The kinetic energy is dependent on the mass density, thus,\na path of densities \u03c1 : [0,\u221e) \u00d7 Rn \u2192 R+ is required, which represents the mass distribution as it is\ndeformed. The action integral is\n\nA =\n\n[T (vt) \u2212 U (\u03c6t)] dt,\n\n(5)\n\n(cid:90)\n\n(cid:90) (cid:90)\n\nx \u2208 Rn\n\n(cid:90) (cid:90)\n\nwhere the integral is over time. The action is implicitly a function of three paths, i.e., vt, \u03c6t and \u03c1t.\nThese paths are coupled as \u03c6t depends on vt through (2), and \u03c1t depends on vt through (3).\nStationary Conditions for the Action: We treat the computation of the stationary conditions of the\naction as a constrained optimization problem with respect to the two aforementioned constraints.\nTo do this, it is easier to formulate the action in terms of the path of the inverse diffeomorphisms\n\u03c6\u22121\n, which we will call \u03c8t. This is because the non-linear PDE constraint (2) can be equivalently\nt\nreformulated as the following linear transport PDE in the inverse mappings:\n\n\u2202t\u03c8t(x) + [D\u03c8t(x)]vt(x) = 0,\n\n(6)\nwhere D denotes the derivative (Jacobian) operator. To derive the stationary conditions with respect\nto the constraints, we use the method of Lagrange multipliers. We denote by \u03bb : [0,\u221e) \u00d7 Rn \u2192 Rn\nthe Lagrange multiplier according to (6). We denote \u00b5 : [0,\u221e)\u00d7 Rn \u2192 R as the Lagrange multiplier\nfor the continuity equation (3). Because we would like to be able to have possibly discontinuous\nsolutions of the continuity equation, we formulate it in its weak form by multiplying the constraint\nby the Lagrange multiplier and integrating by parts thereby removing the derivatives on possibly\ndiscontinuous \u03c1. This gives the action integral with Lagrange multipliers as\n\nRn\n\nA =\n\n[T (v) \u2212 U (\u03c6)] dt +\n\n\u03bbT [\u2202t\u03c8 + (D\u03c8)v] dx dt \u2212\n\n[\u2202t\u00b5 + \u2207\u00b5 \u00b7 v] \u03c1 dx dt, (7)\nwhere we have omitted the subscripts to avoid cluttering the notation. Notice that the potential U is\nnow a function of \u03c8, and the action depends on \u03c1, \u03c8, v and the Lagrange multipliers \u00b5, \u03bb.\nWe now compute variations of A as we perturb the paths by variations \u03b4\u03c1, \u03b4v and \u03b4\u03c6 along the\npaths. The variation with respect to \u03c1 is de\ufb01ned as \u03b4A \u00b7 \u03b4\u03c1 = d\nvariations are de\ufb01ned in a similar fashion. This results in the following theorem.\nTheorem 3.1 (Evolution Equations for the Path of Least Action). The stationary conditions for\nthe path of the action integral (5) subject to the constraints (2) on the mapping and the continuity\nequation (3) are given by the forward evolution equation\n\nd\u03b5 A(\u03c1 + \u03b5\u03b4\u03c1, v, \u03c8)(cid:12)(cid:12)\u03b5=0, and the other\n\nRn\n\n\u03c1\n\nDv\nDt\n\n= \u2212\u2207U (\u03c6),\n\nor \u2202tv = \u2212(Dv)v \u2212 1\n\u03c1\n\n\u2207U (\u03c6),\n\n(8)\n\nwhere Df\nDt := \u2202tf + (Df )v is the material derivative. The previous equation describes the velocity\nevolution. The forward evolution equation for the diffeomorphism is given by (2), that of its inverse\nmapping is given by (6), and the forward evolution of its density is given by (3).\n\n5\n\n\fThe \ufb01rst equation in (8) is an analogue of Newton\u2019s equations. Indeed, the equation says the time rate\nof change of velocity along trajectories generated by the velocity \ufb01eld multiplied by density is equal\nto minus the gradient of the potential, which is Newton\u2019s 2nd law.\nViscosity Solution and Regularity: The evolution equations given by Theorem 3.1 maintain the\nmapping \u03c6t as a diffeomorphism. This is because we de\ufb01ne the solution as the viscosity solution\n(e.g., [43, 44, 21]). The viscosity solution is de\ufb01ned as follows. De\ufb01ne\n\u2202tv\u03b5 = \u2212(Dv\u03b5)v\u03b5 + \u03b5\u2206v\u03b5 \u2212 \u03c1\u22121\u2207U (\u03c6),\n\n(9)\nwhere \u2206 denotes the spatial Laplacian, which is a smoothing operator. This leads to a smooth (C\u221e)\nsolution due to smoothing properties of the Laplacian. The viscosity solution is v = lim\u03b5\u21920 v\u03b5. We\napproximate the effects with small \u03b5 by using entropy conditions in our numerical implementation.\nSince the velocity is smooth (C\u221e), the integral of a smooth vector \ufb01eld is a diffeomorphism [42].\n\n3.2 Acceleration with Energy Dissipation\n\nWe now present the case of a non-autonomous Lagrangian. We consider time varying scalar functions\na, b : [0,\u221e) \u2192 R+, and de\ufb01ne the action integral as follows:\n\nA =\n\n[atT (vt) \u2212 btU (\u03c6t)] dt,\n\n(10)\n\n(cid:90)\n\nwhere at, bt denote the values of the scalar at time t. It can be shown that the stationary conditions\nresult in the following evolution:\nTheorem 3.2 (Evolution Equations for the Path of Least Action). The stationary conditions for the\npath of the action integral (10) subject to the constraints (2) on the mapping and the continuity\nequation (3) are given by the forward evolution equation\n\na\u2202tv + a(Dv)v + (\u2202ta)v = \u2212(b/\u03c1)\u2207U (\u03c6),\n\n(11)\nwhich describes the evolution of the velocity. The same evolution equations as Theorem 3.1 for the\nmappings (2) and (6), and density (3) hold.\nIf we consider the case at = e\u03b3t\u2212\u03b1t, bt = e\u03b1t+\u03b2t+\u03b3t where \u03b1t = log p \u2212 log t, \u03b2t = p log t +\nlog C, \u03b3t = p log t, with p = 2, which was considered in [17] as the continuum limit of Nesterov\u2019s\noriginal scheme in \ufb01nite dimensions, then we arrive at the following evolution equation:\n\n(12)\nThis evolution equation is the same as the evolution equations for the non-dissipative case (8), except\nfor the term \u2212(3/t)v. One can interpret the latter term as a frictional dissipative term.\n\n\u2202tv = \u2212(3/t)v \u2212 (Dv)v \u2212 (1/\u03c1)\u2207U (\u03c6).\n\n4 Experiments\n\nWe now show empirical evidence to illustrate the behavior of our accelerated optimization by\ncomparing it to gradient descent. For illustration, we consider:\n\nU (\u03c6) =\n\n1\n2\n\n|I1(\u03c6(x)) \u2212 I0(x)|2 dx +\n\n1\n2\n\n\u03b1\n\nRn\n\n|\u2207(\u03c6(x) \u2212 x)|2 dx,\n\n(13)\n\nwhere \u03b1 > 0 is a weight, and I0, I1 are images, which is the classical Horn & Schunck energy. The\n\ufb01rst term is the data \ufb01delity which measures how close \u03c6 deforms I1 back to I0 through the squared\nnorm, and the second term penalizes non-smoothness of the displacement \ufb01eld, given by \u03c6(x) \u2212 x.\nWe compare to standard (Riemannian L2) gradient descent to illustrate how much one can gain by\nincorporating acceleration, which requires little additional effort over gradient descent. Over gradient\ndescent, acceleration requires only to update the velocity by the velocity evolution in the previous\nsection, and the density evolution. Both these evolutions are cheap to compute since they only\ninvolve local updates. We discretize using forward Euler and entropy conditions (see Supplementary\nfor details), and choose the step size to satisfy CFL conditions. For gradient descent we choose\n\u2206t < 1/(4\u03b1); for accelerated gradient descent we have the additional evolution of the velocity\n(12), and our numerical scheme has CFL condition \u2206t < 1/(4\u03b1 \u00b7 maxx\u2208\u2126{|v(x)|,|Dv(x)|}). The\ninitialization is \u03c6(x) = \u03c8(x) = x, v(x) = 0, and \u03c1(x) = 1/|\u2126| where |\u2126| is the area the image.\n\n6\n\n(cid:90)\n\n(cid:90)\n\nRn\n\n\fFigure 1: Convergence Comparison: Two images are registered, each are binary images. [Left]:\nThe two images are a square and a translation of it. [Middle and right]: The \ufb01rst image is a square\nand the second image is a translated and non-uniformly scaled version of the square in the \ufb01rst image.\n[Left and middle]: The cost functional to be minimized versus the iteration number is shown for both\ngradient descent (GD) and accelerated gradient descent (AGD). [Right]: The image reconstruction\nerror: (cid:107)I1 \u25e6 \u03c6 \u2212 I0(cid:107) in the non-uniformly scaled squares is shown.\n\nI1\n\nI0\n\nI1 \u25e6 \u03c6gd\n\nI1 \u25e6 \u03c6agd\n\nFigure 2: Convergence Comparison: Two MR cardiac images are registered. [Left]: A plot of\nthe potential versus the iteration number in the minimization using gradient descent (GD) and\naccelerated gradient descent (AGD). [Right]: The original images and the back-warped images using\nthe recovered diffeomorphisms. Note that I1 \u25e6 \u03c6 should appear close to I0.\n\nConvergence analysis: In this experiment, the images are two white squares against a black back-\nground. The sizes of the squares are 50 \u00d7 50 pixels wide, and the square (of size 20 \u00d7 20) in the \ufb01rst\nimage is translated by 10 pixels to form the second image. Small images are chosen due to the fact\ngradient descent is impractically slow for larger sized images (e.g., 256 \u00d7 256). Figure 1 (left \ufb01gure)\nshows the plot of the potential energy (13) of both gradient descent and accelerated gradient descent.\nHere \u03b1 = 5. Accelerated gradient descent very quickly accelerates to a global minimum, surpasses\nthe global minimum and then oscillates until the friction term slows it down and then it converges\nvery quickly. Gradient descent slowly decreases the energy and eventually converges.\nWe now repeat the same experiment, but with different images. We choose the images again to be\n50 \u00d7 50. The \ufb01rst image has a square that is 17 \u00d7 17 and the second image has a rectangle of size\n20 \u00d7 14 and is translated by 8 pixels. We choose \u03b1 = 2. A plot of results is shown in Figure 1\n(middle). Again accelerated gradient descent accelerates very quickly at the start, then oscillates\nand the oscillations die down and then it converges. The potential is not zero as the \ufb02ow is not\na translation and thus the regularity term is non-zero. Gradient descent converges faster than the\ncase of translation due to smaller \u03b1 and thus larger step size. However, it is stuck at a high energy\ncon\ufb01guration. Gradient descent has not fully converged. We verify this by plotting the \ufb01rst term\nof the potential, which is zero for accelerated gradient descent at convergence, indicating a perfect\nmatch. Gradient descent has an error of 50, indicating the \ufb02ow does not fully warp I1 to I0.\nWe repeat the experiment with cardiac MR images. The image transformation is a general diffeo-\nmorphism. We choose \u03b1 = 0.02. A plot of potential versus iterations for both methods is shown in\nFigure 2. Convergence is quicker for AGD, though both schemes converge to a similar solution.\n\n7\n\niteration number\u00d7104051015potential0100200300400500600700Potential vs iteration number: Square translationGDAGDiteration number\u00d7104051015potential050100150200250300350400450Potential vs iteration number: Square translation + non-uniform scalingGDAGDiteration number\u00d7104051015error050100150200250300350400450Error vs iteration number: Square translation + non-uniform scalingGDAGDiteration number0100200300400500600potential5678910111213Potential vs iteration number: Cardiac sequenceGDAGD\fFigure 3: [Left]: Convergence Comparison as a Function of Regularity: Two binary images\nare registered with varying amounts of regularization \u03b1 for gradient descent (GD) and accelerated\ngradient descent (AGD). [Right]: Convergence Comparison as a Function of Image Size: We vary\nthe size (height and width) of the image and compare GD with AGD.\n\nFigure 4: Analysis of Stability to Noise: We register noisy images with varying amounts of noise.\nWe plot the error in the recovered \ufb02ow of both GD and AGD versus the level of noise. [Left]: The\n\ufb01rst image is formed from a square and the second image is the same square but translated. [Right]:\nThe \ufb01rst image is a square and the second image is the non-uniformly scaled and translated square.\n\nConvergence analysis versus parameter settings: We now analyze the convergence of accelerated\ngradient descent and gradient descent as a function of the regularity \u03b1 and the image size. First, we\nanalyze an image pair of size 50 \u00d7 50 in which one image has a square of size 16 \u00d7 16 and the other\nimage is the same square translated by 7 pixels. We vary \u03b1 and analyze the convergence. In the left\nplot of Figure 3, we show the number of iterations until convergence versus the regularity \u03b1. As \u03b1\nincreases, the number of iterations for both gradient descent and accelerated gradient descent increase.\nHowever, the number of iterations for accelerated gradient descent grows more slowly. In all cases,\nthe \ufb02ow achieves the ground truth \ufb02ow. Next, we analyze the number of convergence iterations\nversus the image size. We consider binary images with squares of size 16 \u00d7 16 and translated by 7\npixels. However, we vary the image size from 50 \u00d7 50 to 200 \u00d7 200. We \ufb01x \u03b1 = 8. We show the\nnumber of iterations to convergence versus the image size in the right of Figure 3. Gradient descent\nis impractically slow for all the sizes considered, and the number of iterations quickly increases with\nimage size. Accelerated gradient descent appears to have little growth with respect to the image size.\nAnalysis of Robustness to Noise: We simulate undesirable local minima by using salt and pepper\nnoise. We consider images of size 50\u00d7 50. We \ufb01x \u03b1 = 1 and vary the noise level. One could increase\n\u03b1 to increase robustness to noise; however, we are interested in understanding the robustness to noise\nof the optimization algorithms. We consider a square of size 16 \u00d7 16 in a binary image and the same\nsquare translated by 4 pixels in the second image. We plot the error in the \ufb02ow versus the noise\nlevel. The result is shown in the left plot of Figure 4. This shows that accelerated gradient descent\ndegrades much slower than gradient descent. We repeat the experiment with different images, one\nwith a square of size 15 \u00d7 15 and a rectangle that is 20 \u00d7 10 and translated by 5 pixels. The result is\nplotted in the right of Figure 4. A similar trend as in the previous experiment is observed.\n\n8\n\nregularity (\u03b1)0510152025303540iterations\u00d710501234567Iterations vs regularity: Square translationGDAGDimage size406080100120140160180200220iterations\u00d71050123456789Iterations vs image size: Square translationGDAGDnoise00.10.20.30.40.5error in flow0123456Noise Stability: Square translation in noiseGDAGDnoise00.050.10.150.2error in image reconstruction00.010.020.030.040.050.06Noise Stability: Square translation + dilation in noiseGDAGD\fAcknowledgments\n\nThis research was partially funded by ARO W911NF-18-1-0281 and NSF CCF-1526848.\n\nReferences\n\n[1] S. Bubeck, Y. T. Lee, and M. Singh, \u201cA geometric alternative to nesterov\u2019s accelerated gradient\n\ndescent,\u201d CoRR, vol. abs/1506.08187, 2015.\n\n[2] N. Flammarion and F. Bach, \u201cFrom averaging to acceleration, there is only a step-size,\u201d in\n\nProceedings of Machine Learning Research, vol. 40, 2015, pp. 658\u2013695.\n\n[3] S. Ghadimi and G. Lan, \u201cAccelerated gradient methods for nonconvex nonlinear and stochastic\n\nprogramming,\u201d Math. Program., vol. 156, no. 1-2, pp. 59\u201399, 2016.\n\n[4] C. Hu, W. Pan, and J. T. Kwok, \u201cAccelerated gradient methods for stochastic optimization\nand online learning,\u201d in Advances in Neural Information Processing Systems 22, Y. Bengio,\nD. Schuurmans, J. D. Lafferty, C. K. I. Williams, and A. Culotta, Eds. Curran Associates, Inc.,\n2009, pp. 781\u2013789.\n\n[5] S. Ji and J. Ye, \u201cAn accelerated gradient method for trace norm minimization,\u201d in Proceedings\nof the 26th Annual International Conference on Machine Learning, ser. ICML \u201909, 2009, pp.\n457\u2013464.\n\n[6] V. Jojic, S. Gould, and D. Koller, \u201cAccelerated dual decomposition for map inference,\u201d in\nProceedings of the 27th International Conference on International Conference on Machine\nLearning, ser. ICML\u201910, 2010, pp. 503\u2013510.\n\n[7] W. Su, S. Boyd, and E. Cand\u00e8s, \u201cA differential equation for modeling nesterov\u2019s accelerated\ngradient method: Theory and insights,\u201d in Advances in Neural Information Processing Systems,\n2014, pp. 2510\u20132518.\n\n[8] W. Krichene, A. Bayen, and P. L. Bartlett, \u201cAccelerated mirror descent in continuous and\ndiscrete time,\u201d in Advances in Neural Information Processing Systems 28, C. Cortes, N. D.\nLawrence, D. D. Lee, M. Sugiyama, and R. Garnett, Eds. Curran Associates, Inc., 2015, pp.\n2845\u20132853.\n\n[9] H. Li and Z. Lin, \u201cAccelerated proximal gradient methods for nonconvex programming,\u201d in\nAdvances in Neural Information Processing Systems 28, C. Cortes, N. D. Lawrence, D. D. Lee,\nM. Sugiyama, and R. Garnett, Eds. Curran Associates, Inc., 2015, pp. 379\u2013387.\n\n[10] L. F. Yang, R. Arora, V. Braverman, and T. Zhao, \u201cThe physical systems behind optimization\n\nalgorithms,\u201d arXiv preprint arXiv:1612.02803, 2016.\n\n[11] Y. Nesterov, \u201cSmooth minimization of non-smooth functions,\u201d Math. Program., vol. 103, no. 1,\n\npp. 127\u2013152, 2005.\n\n[12] \u2014\u2014, \u201cAccelerating the cubic regularization of newton\u2019s method on convex problems,\u201d Math.\n\nProgram., vol. 112, no. 1, pp. 159\u2013181, 2008.\n\n[13] \u2014\u2014, \u201cA method of solving a convex programming problem with convergence rate o (1/k2),\u201d in\n\nSoviet Mathematics Doklady, vol. 27, 1983, pp. 372\u2013376.\n\n[14] H. Zhang, S. J. Reddi, and S. Sra, \u201cRiemannian svrg: fast stochastic optimization on riemannian\n\nmanifolds,\u201d in Advances in Neural Information Processing Systems, 2016, pp. 4592\u20134600.\n\n[15] Y. Liu, F. Shang, J. Cheng, H. Cheng, and L. Jiao, \u201cAccelerated \ufb01rst-order methods for geodesi-\ncally convex optimization on riemannian manifolds,\u201d in Advances in Neural Information Pro-\ncessing Systems, 2017, pp. 4875\u20134884.\n\n[16] R. Hosseini and S. Sra, \u201cAn alternative to em for gaussian mixture models: Batch and stochastic\n\nriemannian optimization,\u201d arXiv preprint arXiv:1706.03267, 2017.\n\n[17] A. Wibisono, A. C. Wilson, and M. I. Jordan, \u201cA variational perspective on accelerated methods\nin optimization,\u201d Proceedings of the National Academy of Sciences, vol. 113, no. 47, pp.\nE7351\u2013E7358, 2016.\n\n[18] V. I. Arnol\u2019d, Mathematical methods of classical mechanics. Springer Science & Business\n\nMedia, 2013, vol. 60.\n\n9\n\n\f[19] J. E. Marsden and T. S. Ratiu, Introduction to mechanics and symmetry: a basic exposition of\n\nclassical mechanical systems. Springer Science & Business Media, 2013, vol. 17.\n\n[20] J.-D. Benamou and Y. Brenier, \u201cA computational \ufb02uid mechanics solution to the monge-\nkantorovich mass transfer problem,\u201d Numerische Mathematik, vol. 84, no. 3, pp. 375\u2013393,\n2000.\n\n[21] J. A. Sethian, Level set methods and fast marching methods: evolving interfaces in computational\ngeometry, \ufb02uid mechanics, computer vision, and materials science. Cambridge university\npress, 1999, vol. 3.\n\n[22] C. Villani, Topics in optimal transportation. American Mathematical Soc., 2003, no. 58.\n[23] W. Gangbo and R. J. McCann, \u201cThe geometry of optimal transportation,\u201d Acta Mathematica,\n\nvol. 177, no. 2, pp. 113\u2013161, 1996.\n\n[24] S. Angenent, S. Haker, and A. Tannenbaum, \u201cMinimizing \ufb02ows for the monge\u2013kantorovich\n\nproblem,\u201d SIAM journal on mathematical analysis, vol. 35, no. 1, pp. 61\u201397, 2003.\n\n[25] M. F. Beg, M. I. Miller, A. Trouv\u00e9, and L. Younes, \u201cComputing large deformation metric\nmappings via geodesic \ufb02ows of diffeomorphisms,\u201d International journal of computer vision,\nvol. 61, no. 2, pp. 139\u2013157, 2005.\n\n[26] M. I. Miller, A. Trouv\u00e9, and L. Younes, \u201cGeodesic shooting for computational anatomy,\u201d Journal\n\nof mathematical imaging and vision, vol. 24, no. 2, pp. 209\u2013228, 2006.\n\n[27] G. Sundaramoorthi, A. Yezzi, and A. C. Mennucci, \u201cSobolev active contours,\u201d International\n\nJournal of Computer Vision, vol. 73, no. 3, pp. 345\u2013366, 2007.\n\n[28] G. Charpiat, P. Maurel, J.-P. Pons, R. Keriven, and O. Faugeras, \u201cGeneralized gradients: Priors\non minimization \ufb02ows,\u201d International journal of computer vision, vol. 73, no. 3, pp. 325\u2013344,\n2007.\n\n[29] Y. Yang and G. Sundaramoorthi, \u201cShape tracking with occlusions via coarse-to-\ufb01ne region-\nbased sobolev descent,\u201d IEEE transactions on pattern analysis and machine intelligence, vol. 37,\nno. 5, pp. 1053\u20131066, 2015.\n\n[30] B. K. Horn and B. G. Schunck, \u201cDetermining optical \ufb02ow,\u201d Arti\ufb01cial intelligence, vol. 17, no.\n\n1-3, pp. 185\u2013203, 1981.\n\n[31] M. J. Black and P. Anandan, \u201cThe robust estimation of multiple motions: Parametric and\npiecewise-smooth \ufb02ow \ufb01elds,\u201d Computer vision and image understanding, vol. 63, no. 1, pp.\n75\u2013104, 1996.\n\n[32] T. Brox, A. Bruhn, N. Papenberg, and J. Weickert, \u201cHigh accuracy optical \ufb02ow estimation based\non a theory for warping,\u201d in European conference on computer vision. Springer, 2004, pp.\n25\u201336.\n\n[33] A. Wedel, T. Pock, C. Zach, H. Bischof, and D. Cremers, \u201cAn improved algorithm for tv-l 1\noptical \ufb02ow,\u201d in Statistical and geometrical approaches to visual motion analysis. Springer,\n2009, pp. 23\u201345.\n\n[34] D. Sun, S. Roth, and M. J. Black, \u201cSecrets of optical \ufb02ow estimation and their principles,\u201d in\nIEEE, 2010,\n\nComputer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on.\npp. 2432\u20132439.\n\n[35] G. Sundaramoorthi and A. Yezzi, \u201cAccelerated Optimization in the PDE Framework:\nFormulations for the Manifold of Diffeomorphisms,\u201d ArXiv e-prints, vol. abs/1804.02307, Apr.\n2018. [Online]. Available: http://arxiv.org/abs/1804.02307\n\n[36] A. J. Yezzi and G. Sundaramoorthi, \u201cAccelerated optimization in the PDE framework:\nFormulations for the active contour case,\u201d CoRR, vol. abs/1711.09867, 2017. [Online].\nAvailable: http://arxiv.org/abs/1711.09867\n\n[37] M. Benyamin, J. Calder, G. Sundaramoorthi, and A. Yezzi, \u201cAccelerated pde\u2019s for ef\ufb01cient\nsolution of regularized inversion problems,\u201d vol. abs/1810.00410, 2018. [Online]. Available:\nhttp://arxiv.org/abs/1810.00410\n\n[38] M. P. Do Carmo, Riemannian geometry. Birkhauser, 1992.\n[39] Y. Nesterov, Introductory Lectures on Convex Optimization: A Basic Course, 1st ed. Springer\n\nPublishing Company, Incorporated, 2014.\n\n10\n\n\f[40] \u2014\u2014, \u201cGradient methods for minimizing composite functions,\u201d Math. Program., vol. 140, no. 1,\n\npp. 125\u2013161, 2013.\n\n[41] Y. Nesterov and B. T. Polyak, \u201cCubic regularization of newton method and its global perfor-\n\nmance,\u201d Math. Program., vol. 108, no. 1, pp. 177\u2013205, 2006.\n\n[42] D. G. Ebin and J. Marsden, \u201cGroups of diffeomorphisms and the motion of an incompressible\n\n\ufb02uid,\u201d Annals of Mathematics, pp. 102\u2013163, 1970.\n\n[43] M. G. Crandall and P.-L. Lions, \u201cViscosity solutions of hamilton-jacobi equations,\u201d Transactions\n\nof the American Mathematical Society, vol. 277, no. 1, pp. 1\u201342, 1983.\n\n[44] E. Rouy and A. Tourin, \u201cA viscosity solutions approach to shape-from-shading,\u201d SIAM Journal\n\non Numerical Analysis, vol. 29, no. 3, pp. 867\u2013884, 1992.\n\n11\n\n\f", "award": [], "sourceid": 1889, "authors": [{"given_name": "Ganesh", "family_name": "Sundaramoorthi", "institution": "UTRC"}, {"given_name": "Anthony", "family_name": "Yezzi", "institution": "Georgia Tech"}]}