{"title": "Accelerated First-order Methods for Geodesically Convex Optimization on Riemannian Manifolds", "book": "Advances in Neural Information Processing Systems", "page_first": 4868, "page_last": 4877, "abstract": "In this paper, we propose an accelerated first-order method for geodesically convex optimization, which is the generalization of the standard Nesterov's accelerated method from Euclidean space to nonlinear Riemannian space. We first derive two equations and obtain two nonlinear operators for geodesically convex optimization instead of the linear extrapolation step in Euclidean space. In particular, we analyze the global convergence properties of our accelerated method for geodesically strongly-convex problems, which show that our method improves the convergence rate from O((1-\\mu/L)^{k}) to O((1-\\sqrt{\\mu/L})^{k}). Moreover, our method also improves the global convergence rate on geodesically general convex problems from O(1/k) to O(1/k^{2}). Finally, we give a specific iterative scheme for matrix Karcher mean problems, and validate our theoretical results with experiments.", "full_text": "Accelerated First-order Methods for Geodesically\nConvex Optimization on Riemannian Manifolds\n\nYuanyuan Liu1, Fanhua Shang1\u2217, James Cheng1, Hong Cheng2, Licheng Jiao3\n1Dept. of Computer Science and Engineering, The Chinese University of Hong Kong\n\n2Dept. of Systems Engineering and Engineering Management,\n\nThe Chinese University of Hong Kong, Hong Kong\n\n3Key Laboratory of Intelligent Perception and Image Understanding of Ministry of Education,\n\nSchool of Arti\ufb01cial Intelligence, Xidian University, China\n\n{yyliu, fhshang, jcheng}@cse.cuhk.edu.hk; hcheng@se.cuhk.edu.hk;\n\nlchjiao@mail.xidian.edu.cn\n\nAbstract\n\nIn this paper, we propose an accelerated \ufb01rst-order method for geodesically convex\noptimization, which is the generalization of the standard Nesterov\u2019s accelerated\nmethod from Euclidean space to nonlinear Riemannian space. We \ufb01rst derive two\nequations and obtain two nonlinear operators for geodesically convex optimization\ninstead of the linear extrapolation step in Euclidean space. In particular, we analyze\nthe global convergence properties of our accelerated method for geodesically\nstrongly-convex problems, which show that our method improves the convergence\n\nrate from O((1\u2212\u00b5/L)k) to O((1\u2212(cid:112)\u00b5/L)k). Moreover, our method also improves\n\nthe global convergence rate on geodesically general convex problems from O(1/k)\nto O(1/k2). Finally, we give a speci\ufb01c iterative scheme for matrix Karcher mean\nproblems, and validate our theoretical results with experiments.\n\n1\n\nIntroduction\n\nIn this paper, we study the following Riemannian optimization problem:\n\nsuch that x \u2208 X \u2282 M,\n\nmin f (x)\n\n(1)\nwhere (M, \u0001) denotes a Riemannian manifold with the Riemannian metric \u0001, X \u2282M is a nonempty,\ncompact, geodesically convex set, and f :X \u2192 R is geodesically convex (G-convex) and geodesically\nL-smooth (G-L-smooth). Here, G-convex functions may be non-convex in the usual Euclidean space\nbut convex along the manifold, and thus can be solved by a global optimization solver. [5] presented\nG-convexity and G-convex optimization on geodesic metric spaces, though without any attention\nto global complexity analysis. As discussed in [11], the topic of \"geometric programming\" may be\nviewed as a special case of G-convex optimization. [25] developed theoretical tools to recognize\nand generate G-convex functions as well as cone theoretic \ufb01xed point optimization algorithms.\nHowever, none of these three works provided a global convergence rate analysis for their algorithms.\nVery recently, [31] provided the global complexity analysis of \ufb01rst-order algorithms for G-convex\noptimization, and designed the following Riemannian gradient descent rule:\n\nxk+1 = Expxk\n\n(\u2212\u03b7 gradf (xk)),\n\nwhere k is the iterate index, Expxk\nstep-size or learning rate, and gradf (xk) is the Riemannian gradient of f at xk \u2208X .\n\nis an exponential map at xk (see Section 2 for details), \u03b7 is a\n\n\u2217Corresponding author.\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fIn this paper, we extend the Nesterov\u2019s accelerated gradient descent method [19] from Euclidean\nspace to nonlinear Riemannian space. Below, we \ufb01rst introduce the Nesterov\u2019s method and its variants\nfor convex optimization on Euclidean space, which can be viewed as a special case of our method,\nwhen M =Rd, and \u0001 is the Euclidean inner product. Nowadays many real-world applications involve\nlarge data sets. As data sets and problems are getting larger in size, accelerating \ufb01rst-order methods\nis of both practical and theoretical interests. The earliest \ufb01rst-order method for minimizing a convex\nfunction f is perhaps the gradient method. Thirty years ago, Nesterov [19] proposed an accelerated\ngradient method, which takes the following form: starting with x0 and y0 = x0, and for any k\u2265 1,\n\nxk = yk\u22121 \u2212 \u03b7\u2207f (yk\u22121),\nyk = xk + \u03c4k(xk \u2212 xk\u22121),\n\nk2\n\n(2)\nwhere 0 \u2264 \u03c4k \u2264 1 is the momentum parameter. For a \ufb01xed step-size \u03b7 = 1/L, where L is the\nLipschitz constant of \u2207f, this scheme with \u03c4k = (k\u22121)/(k+2) exhibits the optimal convergence\nrate, f (xk)\u2212f (x(cid:63)) \u2264 O( L(cid:107)x(cid:63)\u2212x0(cid:107)2\n), for general convex (or non-strongly convex) problems [20],\nwhere x(cid:63) is any minimizer of f. In contrast, standard gradient descent methods can only achieve\na convergence rate of O(1/k). We can see that this improvement relies on the introduction of the\nmomentum term \u03c4k(xk \u2212 xk\u22121) as well as the particularly tuned coef\ufb01cient (k\u22121)/(k+2)\u2248 1\u22123/k.\nInspired by the success of the Nesterov\u2019s momentum, there has been much work on the development\nof \ufb01rst-order accelerated methods, see [2, 8, 21, 26, 27] for example. In addition, for strongly convex\n\nproblems and setting \u03c4k \u2261 (1\u2212(cid:112)\u00b5/L)/(1+(cid:112)\u00b5/L), Nesterov\u2019s accelerated gradient method attains\na convergence rate of O((1\u2212(cid:112)\u00b5/L)k), while standard gradient descent methods achieve a linear\n\nconvergence rate of O((1\u2212 \u00b5/L)k). It is then natural to ask whether our accelerated method in\nnonlinear Riemannian space has the same convergence rates as its Euclidean space counterparts (e.g.,\nNesterov\u2019s accelerated method [20])?\n\n1.1 Motivation and Challenges\n\nZhang and Sra [31] proposed an ef\ufb01cient Riemannian gradient descent (RGD) method, which\nattains the convergence rates of O((1 \u2212 \u00b5/L)k) and O(1/k) for geodesically strongly-convex and\ngeodesically convex problems, respectively. Hence, there still remain gaps in convergence rates\nbetween RGD and the Nesterov\u2019s accelerated method.\nAs discussed in [31], a long-time question is whether the famous Nesterov\u2019s accelerated gradient\ndescent algorithm has a counterpart in nonlinear Riemannian spaces. Compared with standard\ngradient descent methods in Euclidean space, Nesterov\u2019s accelerated gradient method involves a\nlinear extrapolation step: yk = xk + \u03c4k(xk \u2212 xk\u22121), which can improve its convergence rates for both\nstrongly convex and non-strongly convex problems. It is clear that \u03d5k(x) := f (yk)+(cid:104)\u2207f (yk), x\u2212yk(cid:105)\nis a linear function in Euclidean space, while its counterpart in nonlinear space, e.g., \u03d5k(x) :=\nf (yk) + (cid:104)gradf (yk), Exp\u22121\nis the inverse of the\n, and (cid:104)\u00b7,\u00b7(cid:105)y is the inner product (see Section 2 for details). Therefore, in\nexponential map Expyk\nnonlinear Riemannian spaces, there is no trivial analogy of such a linear extrapolation step. In\nother words, although Riemannian geometry provides tools that enable generalization of Euclidean\nalgorithms mentioned above [1], we must overcome some fundamental geometric hurdles to analyze\nthe global convergence properties of our accelerated method as in [31].\n\n(x)(cid:105)yk, is a nonlinear function, where Exp\u22121\n\nyk\n\nyk\n\n1.2 Contributions\n\nTo answer the above-mentioned open problem in [31], in this paper we propose a general accelerated\n\ufb01rst-order method for nonlinear Riemannian spaces, which is in essence the generalization of the\nstandard Nesterov\u2019s accelerated method. We summarize the key contributions of this paper as follows.\n\u2022 We \ufb01rst present a general Nesterov\u2019s accelerated iterative scheme in nonlinear Riemannian\nspaces, where the linear extrapolation step in (2) is replaced by a nonlinear operator. Fur-\nthermore, we derive two equations and obtain two corresponding nonlinear operators for\nboth geodesically strongly-convex and geodesically convex cases, respectively.\n\u2022 We provide the global convergence analysis of our accelerated algorithms, which shows\n\nthat our algorithms attain the convergence rates of O((1\u2212(cid:112)\u00b5/L)k) and O(1/k2) for\n\ngeodesically strongly-convex and geodesically convex objectives, respectively.\n\n2\n\n\f\u2022 Finally, we present a speci\ufb01c iterative scheme for matrix Karcher mean problems. Our\n\nexperimental results verify the effectiveness and ef\ufb01ciency of our accelerated method.\n\n2 Notation and Preliminaries\n\nw\u2208 TxM is de\ufb01ned as (cid:107)w(cid:107)x =(cid:112)\u0001x(w, w), where the metric \u0001 induces an inner product structure\n\nWe \ufb01rst introduce some key notations and de\ufb01nitions about Riemannian geometry (see [23, 30] for\ndetails). A Riemannian manifold (M, \u0001) is a real smooth manifold M equipped with a Riemannian\nmetric \u0001. Let (cid:104)w1, w2(cid:105)x = \u0001x(w1, w2) denote the inner product of w1, w2 \u2208 TxM; and the norm of\nin each tangent space TxM associated with every x \u2208 M. A geodesic is a constant speed curve\n\u03b3 : [0, 1]\u2192M that is locally distance minimizing. Let y \u2208M and w \u2208 TxM, then an exponential\nmap y = Expx(w) : TxM\u2192M maps w to y on M, such that there is a geodesic \u03b3 with \u03b3(0) = x,\n\u03b3(1) = y and \u02d9\u03b3(0) = w. If there is a unique geodesic between any two points in X \u2282 M, the\nexponential map has inverse Exp\u22121\nx (y), and the geodesic is the unique\ny (x)(cid:107)y = d(x, y), where d(x, y) is the geodesic distance\nshortest path with (cid:107)Exp\u22121\nbetween x, y \u2208X . Parallel transport \u0393y\nxw \u2208 TyM,\nx : TxM\u2192 TyM maps a vector w \u2208 TxM to \u0393y\nand preserves inner products and norm, that is, (cid:104)w1, w2(cid:105)x =(cid:104)\u0393y\nxw1(cid:107)y,\nwhere w1, w2\u2208 TxM.\nFor any x, y \u2208 X and any geodesic \u03b3 with \u03b3(0) = x, \u03b3(1) = y and \u03b3(t) \u2208 X for t \u2208 [0, 1] such\nthat f (\u03b3(t)) \u2264 (1 \u2212 t)f (x) + tf (y), then f is geodesically convex (G-convex), and an equivalent\nde\ufb01nition is formulated as follows:\n\nx :X \u2192 TxM, i.e., w = Exp\u22121\n\nx (y)(cid:107)x =(cid:107)Exp\u22121\n\nxw1, \u0393y\n\nxw2(cid:105)y and (cid:107)w1(cid:107)x =(cid:107)\u0393y\n\nf (y) \u2265 f (x) + (cid:104)gradf (x), Exp\u22121\n\nx (y)(cid:105)x,\n\nwhere gradf (x) is the Riemannian gradient of f at x. A function f : X \u2192R is called geodesically\n\u00b5-strongly convex (\u00b5-strongly G-convex) if for any x, y\u2208X , the following inequality holds\n\nf (y) \u2265 f (x) + (cid:104)gradf (x), Exp\u22121\n\nx (y)(cid:105)x +\n\n(cid:107)Exp\u22121\n\nx (y)(cid:107)2\nx.\n\n\u00b5\n2\n\nA differential function f is geodesically L-smooth (G-L-smooth) if its gradient is L-Lipschitz, i.e.,\n\nf (y) \u2264 f (x) + (cid:104)gradf (x), Exp\u22121\n\nx (y)(cid:105)x +\n\n(cid:107)Exp\u22121\n\nx (y)(cid:107)2\nx.\n\nL\n2\n\n3 An Accelerated Method for Geodesically Convex Optimization\n\nIn this section, we propose a general acceleration method for geodesically convex optimization,\nwhich can be viewed as a generalization of the famous Nesterov\u2019s accelerated method from Euclidean\nspace to Riemannian space. Nesterov\u2019s accelerated method involves a linear extrapolation step as\nin (2), while in nonlinear Riemannian spaces, we do not have a simple way to \ufb01nd an analogy to\nsuch a linear extrapolation. Therefore, some standard analysis techniques do not work in nonlinear\nspace. Motivated by this, we derive two equations to bridge the gap for both geodesically strongly-\nconvex and geodesically convex cases, and then generalized Nesterov\u2019s algorithms are proposed for\ngeodesically convex optimization by solving these two equations.\nWe \ufb01rst propose to replace the classical Nesterov\u2019s scheme in (2) with the following update rules for\ngeodesically convex optimization in Riemannian space:\n\n(\u2212\u03b7 gradf (yk\u22121)),\n\nxk = Expyk\u22121\nyk = S(yk\u22121, xk, xk\u22121),\n\n(3)\n\nwhere yk, xk \u2208X , S denotes a nonlinear operator, and yk = S(yk\u22121, xk, xk\u22121) can be obtained by\nsolving the two proposed equations (see (4) and (5) below, which can be used to deduce the key\nanalysis tools for our convergence analysis) for strongly G-convex and general G-convex cases, re-\n(\u2212\u03b7gradf (xk))),\nspectively. Different from the Riemannian gradient descent rule (e.g., xk+1 =Expxk\nthe Nesterov\u2019s accelerated technique is introduced into our update rule of yk. Compared with the\nNesterov\u2019s scheme in (2), the main difference is the update rule of yk. That is, our update rule for yk\nis an implicit iteration process as shown below, while that of (2) is an explicit iteration one.\n\n3\n\n\fFigure 1: Illustration of geometric interpretation for Equations (4) and (5).\n\nAlgorithm 1 Accelerated method for strongly G-convex optimization\nInput: \u00b5, L\nInitialize: x0, y0, \u03b7.\n1: for k = 1, 2, . . . , K do\n2:\n3:\n4:\n5: end for\nOutput: xK\n\nComputing the gradient at yk\u22121: gk\u22121 = gradf (yk\u22121);\nxk = Expyk\u22121\nyk = S(yk\u22121, xk, xk\u22121) by solving (4).\n\n(\u2212\u03b7gk\u22121);\n\nyk\n\n\u221a\n\n(cid:16)\n\nwhere \u03b2 = 4/\n\nyk gradf (yk) =\n\n(xk) \u2212 \u03b2\u0393yk\u22121\n\n3.1 Geodesically Strongly Convex Cases\nWe \ufb01rst design the following equation with respect to yk \u2208X for the \u00b5-strongly G-convex case:\n\n(cid:17)\n(cid:17)3/2\n(cid:16)\n1 \u2212(cid:112)\u00b5/L\n1 \u2212(cid:112)\u00b5/L\nequation (4) for the strongly G-convex case, where uk = (1\u2212(cid:112)\u00b5/L)Exp\u22121\nand wk\u22121 = (1\u2212(cid:112)\u00b5/L)3/2Exp\u22121\n\nyk Exp\u22121\n\u0393yk\u22121\n(4)\n\u00b5L\u22121/L > 0. Figure 1(a) illustrates the geometric interpretation of the proposed\n(xk), vk =\u2212\u03b2gradf (yk),\n(xk\u22121). The vectors uk, vk \u2208 TykM are parallel transported\nto Tyk\u22121M, and the sum of their parallel translations is equal to wk\u22121 \u2208 Tyk\u22121M, which means\nthat the equation (4) holds. We design an accelerated \ufb01rst-order algorithm for solving geodesically\nstrongly-convex problems, as shown in Algorithm 1. In real applications, the proposed equation\n(4) can be manipulated into simpler forms. For example, we will give a speci\ufb01c equation for the\naveraging real symmetric positive de\ufb01nite matrices problem below.\n\nExp\u22121\nyk\u22121\n\n(xk\u22121),\n\nyk\u22121\n\nyk\n\n(cid:18) k\n\n(cid:19)\n(xk)\u2212D(cid:98)gk\n\n3.2 Geodesically Convex Cases\nLet f be G-convex and G-L-smooth, the diameter of X be bounded by D (i.e., maxx,y\u2208X d(x, y) \u2264\nD), the variable yk \u2208 X can be obtained by solving the following equation:\n\nk\u22121\n\u03b1\u22121\n\nExp\u22121\nyk\u22121\n\n(xk\u22121)\u2212D(cid:98)gk\u22121 +\n\n(k+\u03b1\u22122)\u03b7\n\n\u0393yk\u22121\nyk\n\nExp\u22121\n\nyk\n\n=\n\n\u03b1\u22121\n\nwhere gk\u22121 = gradf (yk\u22121), and (cid:98)gk = gk/(cid:107)gk(cid:107)yk, and \u03b1 \u2265 3 is a given constant. Figure 1(b)\n\nillustrates the geometric interpretation of the proposed equation (5) for the G-convex case, where\ngk\u22121. We also present an accelerated \ufb01rst-order\nuk = k\nalgorithm for solving geodesically convex problems, as shown in Algorithm 2.\n\n(xk)\u2212D(cid:98)gk, and vk\u22121 = (k+\u03b1\u22122)\u03b7\n\n\u03b1\u22121Exp\u22121\n\n\u03b1 \u2212 1\n\n\u03b1\u22121\n\nyk\n\ngk\u22121,\n\n(5)\n\n3.3 Key Lemmas\nFor the Nesterov\u2019s accelerated scheme in (2) with \u03c4k = k\u22121\nk+2 (for example, the general convex case)\nin Euclidean space, the following result in [3, 20] plays a key role in the convergence analysis of\nNesterov\u2019s accelerated algorithm.\n\n(cid:2)(cid:107)zk \u2212 x(cid:63)(cid:107)2 \u2212 (cid:107)zk+1\u2212 x(cid:63)(cid:107)2(cid:3),\n\n(6)\n\n2\n\nk+2\n\n(cid:104)\u2207f (yk), zk\u2212x(cid:63)(cid:105) \u2212 \u03b7\n2\n\n(cid:107)\u2207f (yk)(cid:107)2 =\n\n2\n\n\u03b7(k+2)2\n\n4\n\n\fAlgorithm 2 Accelerated method for general G-convex optimization\nInput: L, D, \u03b1\nInitialize: x0, y0, \u03b7.\n1: for k = 1, 2, . . . , K do\n2:\n3:\n4:\n5: end for\nOutput: xK\n\nComputing the gradient at yk\u22121: gk\u22121 = gradf (yk\u22121) and \u02c6gk\u22121 = gk\u22121/(cid:107)gk\u22121(cid:107)yk\u22121;\nxk = Expyk\u22121\nyk = S(yk\u22121, xk, xk\u22121) by solving (5).\n\n(\u2212\u03b7gk\u22121);\n\nwhere zk = (k+2)yk/2 \u2212 (k/2)xk. Correspondingly, we can also obtain the following analysis tools\nfor our convergence analysis using the proposed equations (4) and (5). In other words, the following\nequations (7) and (8) can be viewed as the Riemannian space counterparts of (6).\nLemma 1 (Strongly G-convex). If f : X \u2192 R is geodesically \u00b5-strongly convex and G-L-smooth,\nand {yk} satis\ufb01es the equation (4), and zk is de\ufb01ned as follows:\n\n(cid:16)\n\n(cid:17)\n1 \u2212(cid:112)\u00b5/L\n\nThen the following results hold:\n\nzk =\n\n\u0393yk\u22121\nyk\n\n\u2212(cid:104)gradf (yk), zk(cid:105)yk +\n\n\u03b2\n2\n\n(zk \u2212 \u03b2gradf (yk)) =\n(cid:107)gradf (yk)(cid:107)2\n1\n2\u03b2\n\n=\n\nyk\n\nExp\u22121\n\nyk\n\n(xk) \u2208 TykM.\n(cid:17)1/2\n(cid:16)\n1 \u2212(cid:112)\u00b5/L\n(cid:16)\n(cid:17)(cid:107)zk\u22121(cid:107)2\n1 \u2212(cid:112)\u00b5/L\n\nzk\u22121,\n\n\u2212 1\n2\u03b2\n\n(cid:107)zk(cid:107)2\n\nyk\n\n.\n\n(7)\n\nyk\u22121\n\nFor general G-convex objectives, we have the following result.\nLemma 2 (General G-convex). If f : X \u2192 R is G-convex and G-L-smooth, the diameter of X is\nbounded by D, and {yk} satis\ufb01es the equation (5), and zk is de\ufb01ned as\n\nzk =\n\nk\n\n\u03b1 \u2212 1\n\nExp\u22121\n\nyk\n\nThen the following results hold:\n\n\u0393yk\n\nyk+1\n\nzk+1 = zk +\n\n\u03b1\u22121\nk+\u03b1\u22121\n\n(cid:104)gradf (yk),\u2212zk(cid:105)yk \u2212 \u03b7\n2\n\n(cid:107)gradf (yk)(cid:107)2\n\nyk\n\n(xk) \u2212 D(cid:98)gk \u2208 TykM.\n(cid:16)(cid:107)zk(cid:107)2\n\n2(\u03b1\u22121)2\n\u03b7(k+\u03b1\u22121)2\n\n(k + \u03b1 \u2212 1)\u03b7\n\ngradf (yk),\n\n\u03b1 \u2212 1\n\n=\n\nyk\n\n(cid:17)\n\n.\n\n(8)\n\n\u2212 (cid:107)zk+1(cid:107)2\n\nyk+1\n\nThe proofs of Lemmas 1 and 2 are provided in the Supplementary Materials.\n\n4 Convergence Analysis\n\nIn this section, we analyze the global convergence properties of the proposed algorithms (i.e.,\nAlgorithms 1 and 2) for both geodesically strongly convex and general convex problems.\nLemma 3. If f : X \u2192 R is G-convex and G-L-smooth for any x \u2208 X , and {xk} is the sequence\nproduced by Algorithms 1 and 2 with \u03b7 \u2264 1/L, then the following result holds:\n\nf (xk+1) \u2264 f (x) + (cid:104)gradf (yk), \u2212Exp\u22121\n\nyk\n\n(x)(cid:105)yk \u2212 \u03b7\n2\n\n(cid:107)gradf (yk)(cid:107)2\n\nyk\n\n.\n\nThe proof of this lemma can be found in the Supplementary Materials. For the geodesically strongly\nconvex case, we have the following result.\nTheorem 1 (Strongly G-convex). Let x(cid:63) be the optimal solution of Problem (1), and {xk} be the\nsequence produced by Algorithm 1. If f : X \u2192 R is geodesically \u00b5-strongly convex and G-L-smooth,\nthen the following result holds\n\nf (xk+1) \u2212 f (x(cid:63)) \u2264(cid:16)\n\n(cid:17)k(cid:20)\n1 \u2212(cid:112)\u00b5/L\n\n(cid:16)\n\n1 \u2212(cid:112)\u00b5/L\n\n(cid:17)(cid:107)z0(cid:107)2\n\n(cid:21)\n\n,\n\ny0\n\nf (x0) \u2212 f (x(cid:63)) +\n\n1\n2\u03b2\n\nwhere z0 is de\ufb01ned in Lemma 1.\n\n5\n\n\fTable 1: Comparison of convergence rates for geodesically convex optimization algorithms.\n\nAlgorithms\nStrongly G-convex and smooth\nGeneral G-convex and smooth\n\nO(cid:0)(1 \u2212 min{ 1\n(cid:17)\n(cid:16) c\n\nRGD [31]\nc , \u00b5\n\nL})k(cid:1)\n\nO\n\nc + k\n\n(cid:16)\n\n(cid:17)\n\nRSGD [31]\n\nO (1/k)\n\u221a\nk\n\n1/\n\nO\n\nOurs\n\nO(cid:0)(1 \u2212(cid:112) \u00b5\nL )k(cid:1)\nO(cid:0)1/k2(cid:1)\n\nthat the proposed algorithm attains a linear convergence rate of O((1\u2212(cid:112)\u00b5/L)k) for geodesically\n\nThe proof of Theorem 1 can be found in the Supplementary Materials. From this theorem, we can see\n\nstrongly convex problems, which is the same as that of its Euclidean space counterparts and signi\ufb01-\ncantly faster than that of non-accelerated algorithms such as [31] (i.e., O((1\u2212\u00b5/L)k)), as shown in\nTable 1. For the geodesically non-strongly convex case, we have the following result.\nTheorem 2 (General G-convex). Let {xk} be the sequence produced by Algorithm 2. If f :X \u2192 R\nis G-convex and G-L-smooth, and the diameter of X is bounded by D, then\n\nwhere z0 = \u2212D(cid:98)g0, as de\ufb01ned in Lemma 2.\n\nf (xk+1) \u2212 f (x(cid:63)) \u2264\n\n(\u03b1 \u2212 1)2\n\n2\u03b7(k + \u03b1 \u2212 2)2(cid:107)z0(cid:107)2\n\ny0\n\n,\n\nThe proof of Theorem 2 can be found in the Supplementary Materials. Theorem 2 shows that for\ngeneral G-convex objectives, our acceleration method improves the theoretical convergence rate from\nO(1/k) (e.g., RGD [31]) to O(1/k2), which matches the optimal rate for general convex settings in\nEuclidean space. Please see the detail in Table 1, where the parameter c is de\ufb01ned in [31].\n\n5 Application for Matrix Karcher Mean Problems\n\nIn this section, we give a speci\ufb01c accelerated scheme for a type of conic geometric optimization\nproblems [25], e.g., the matrix Karcher mean problem. Speci\ufb01cally, the loss function of the Karcher\nmean problem for a set of N symmetric positive de\ufb01nite (SPD) matrices {Wi}N\n\ni=1 is de\ufb01ned as\n\nN(cid:88)\n\ni=1\n\nf (X) :=\n\n1\n2N\n\n(cid:107)log(X\u22121/2WiX\u22121/2)(cid:107)2\nF ,\n\n(9)\n\nwhere X \u2208 P := {Z \u2208 Rd\u00d7d, s.t., Z = Z T (cid:31) 0}. The loss function f is known to be non-convex\nin Euclidean space but geodesically 2N-strongly convex. The inner product of two tangent vectors at\npoint X on the manifold is given by\n\n(10)\nwhere tr(\u00b7) is the trace of a real square matrix. For any matrices X, Y \u2208 P, the Riemannian distance\nis de\ufb01ned as follows:\n\n(cid:104)\u03b6, \u03be(cid:105)X = tr(\u03b6X\u22121\u03beX\u22121), \u03b6, \u03be \u2208 TXP,\n\nd(X, Y ) = (cid:107)log(X\u2212 1\n\n2 Y X\u2212 1\n\n2 )(cid:107)F .\n\n5.1 Computation of Yk\n\nFor the accelerated update rules in (3) for Algorithm 1, we need to compute Yk via solving the\nequation (4). However, for the speci\ufb01c problem in (9) with the inner product in (10), we can derive a\nsimpler form to solve Yk below. We \ufb01rst give the following properties:\nProperty 1. For the loss function f in (9) with the inner product in (10), we have\n\n\u22121/2\nlog(Y\nk\n\nk\n\nYk\n\n1. Exp\u22121\n\n(Xk) = Y 1/2\n\n(cid:80)N\n(Xk)(cid:11)\n3. (cid:10)gradf (Yk), Exp\u22121\n\n2. gradf (Yk) = 1\nN\n\ni=1 Y 1/2\n\nk\n\nYk\n\nYk\n\nlog(Y 1/2\n= (cid:104)U, V (cid:105);\n\n\u22121/2\nXkY\nk\n\n)Y 1/2\n\nk\n\n;\n\nk W \u22121\n\ni Y 1/2\n\nk\n\n)Y 1/2\n\nk\n\n;\n\n4. (cid:107)gradf (Yk)(cid:107)2\n\nYk\n\n= (cid:107)U(cid:107)2\nF ,\n\n6\n\n\f(cid:80)N\n\ni=1log(Y 1/2\n\nk W \u22121\n\ni Y 1/2\n\nk\n\nwhere U = 1\nN\n\n) \u2208 Rd\u00d7d, and V = log(Y\n\n\u22121/2\nk\n\n\u22121/2\nXkY\nk\n\n) \u2208 Rd\u00d7d.\n\nProof. In this part, we only provide the proof of Result 1 in Property 1, and the proofs of the other\nresults are provided in the Supplementary Materials. The inner product in (10) on the Riemannian\nmanifold leads to the following exponential map:\n\n(11)\nwhere \u03beX \u2208 TXP denotes the tangent vector with the geometry, and tangent vectors \u03beX are expressed\nas follows (see [17] for details):\n\nExpX (\u03beX ) = X\n\n2 )X\n\n2 ,\n\n1\n\n2 exp(X\u2212 1\n\n2 \u03beX X\u2212 1\n\n1\n\n\u03beX = X\n\n1\n\n2 sym(\u2206)X\n\n1\n\n2 , \u2206 \u2208 Rd\u00d7d,\n\nwhere sym(\u00b7) extracts the symmetric part of its argument, that is, sym(A) = (AT +A)/2. Then we\ncan set Exp\u22121\n(Xk), we have\n(Exp\u22121\nExpYk\n\n\u2208 TYkP. By the de\ufb01nition of Exp\u22121\n\n(Xk) = Y 1/2\n(Xk)) = Xk, that is,\n\nsym(\u2206Xk )Y 1/2\n\nYk\n\nYk\n\nk\n\nk\n\nYk\n\nExpYk\n\n(Y 1/2\n\nk\n\nsym(\u2206Xk )Y 1/2\n\nk\n\n) = Xk.\n\n(12)\n\nUsing (11) and (12), we have\n\n\u22121/2\nsym(\u2206Xk ) = log(Y\nk\n\n\u22121/2\nXkY\nk\n\n) \u2208 Rd\u00d7d.\n\nTherefore, we have\n\nExp\u22121\n\nYk\n\n(Xk) = Y 1/2\n\nk\n\nsym(\u2206Xk )Y 1/2\n\nk = Y 1/2\n\nk\n\n\u22121/2\nlog(Y\nk\n\n\u22121/2\nXkY\nk\n\nk = \u2212Yk log(X\u22121\n)Y 1/2\n\nk Yk),\n\nwhere the last equality holds due to the fact that log(X\u22121Y X) = X\u22121 log(Y )X.\n\nResult 3 in Property 1 shows that the inner product of two tangent vectors at Yk is equal to the\nEuclidean inner-product of two vectors U, V \u2208 Rd\u00d7d. Thus, we can reformulate (4) as follows:\n\nN(cid:88)\n\ni=1\n\n(cid:18)\n\n(cid:114) \u00b5\n\n(cid:19) 3\n\n2\n\nL\n\n(cid:18)\n\n(cid:114) \u00b5\n\n(cid:19)\n\n1\u2212\n\nL\n\nlog(Y\n\u221a\nwhere \u03b2 = 4/\nk W \u22121\nlog(Y\ni Y\n\n1\n2\n\n\u2212 1\n\u2212 1\nk XkY\nk\n\n2\n\n2\n\n) \u2212 \u03b2\nN\n\nlog(Y\n\n1\n2\n\nk W \u22121\ni Y\n\n1\n2\n\nk ) =\n\n1\u2212\n\nlog(Y\n\n\u2212 1\nk\u22121X\u22121\n\nk\u22121Y\n\n2\n\n\u2212 1\nk\u22121),\n\n2\n\n(13)\n\u00b5L\u22121/L. Then Yk can be obtained by solving (13). From a numerical perspective,\nk ) can be approximated by log(Y\n\nk\u22121), and then Yk is given by\n\n1\n2\n\n1\n2\n\n(cid:34)(cid:18)\n\n(cid:114) \u00b5\n\n(cid:19) 1\nwhere \u03b4 = 1/(1\u2212(cid:112)\u00b5/L), and Yk \u2208 P.\n\nk exp\u22121\n\nYk = X\n\n1 \u2212\n\nL\n\n1\n2\n\n2\n\n1\n2\n\nk\u22121W \u22121\ni Y\n\u2212 1\nk\u22121) +\n\n2\n\nN(cid:88)\n\n\u03b4\u03b2\nN\n\nlog(Y\n\n\u2212 1\nk\u22121Xk\u22121Y\n\n2\n\n(cid:35)\n\nlog(Y\n\n1\n2\n\nk\u22121W \u22121\ni Y\n\n1\n2\n\nk\u22121)\n\nX\n\n1\n2\n\nk ,\n\ni=1\n\n(14)\n\n6 Experiments\n\nIn this section, we validate the performance of our accelerated method for averaging SPD matrices\nunder the Riemannian metric, e.g., the matrix Karcher mean problem (9), and also compare our\nmethod against the state-of-the-art methods: Riemannian gradient descent (RGD) [31] and limited-\nmemory Riemannian BFGS (LRBFGS) [29]. The matrix Karcher mean problem has been widely\napplied to many real-world applications such as elasticity [18], radar signal and image processing [6,\n15, 22], and medical imaging [9, 7, 13]. In fact, this problem is geodesically strongly convex, but\nnon-convex in Euclidean space.\nOther methods for solving this problem include the relaxed Richardson iteration algorithm [10],\nthe approximated joint diagonalization algorithm [12], and Riemannian stochastic gradient descent\n(RSGD) [31]. Since all the three methods achieve similar performance to RGD, especially in data\nscience applications where N is large and relatively small optimization error is not required [31], we\nonly report the experimental results of RGD. The step-size \u03b7 of both RGD and LRBFGS is selected\nwith a line search method as in [29] (see [29] for details), while \u03b7 of our accelerated method is set to\n1/L. For the algorithms, we initialize X using the arithmetic mean of the data set as in [29].\n\n7\n\n\fFigure 2: Comparison of RGD, LRBFGS and our accelerated method for solving geodesically\nstrongly convex Karcher mean problems on data sets with d = 100 (the \ufb01rst row) and d = 200 (the\nsecond row). The vertical axis represents the distance in log scale, and the horizontal axis denotes the\nnumber of iterations (left) or running time (right).\n\nThe input synthetic data are random SPD matrices of size 100\u00d7100 or 200\u00d7200 generated by using\nthe technique in [29] or the matrix mean toolbox [10], and all matrices are explicitly normalized so\nthat their norms are all equal to 1. We report the experimental results of RGD, LRBFGS and our\naccelerated method on the two data sets in Figure 2, where N is set to 100, and the condition number\nC of each matrix {Wi}N\ni=1 is set to 102. Figure 2 shows the evolution of the distance between the\nexact Karcher mean and current iterate (i.e., dist(X\u2217, Xk)) of the methods with respect to number\nof iterations and running time (seconds), where X\u2217 is the exact Karcher mean. We can observe that\nour method consistently converges much faster than RGD, which empirically veri\ufb01es our theoretical\nresult in Theorem 1 that our accelerated method has a much faster convergence rate than RGD.\nAlthough LRBFGS outperforms our method in terms of number of iterations, our accelerated method\nconverges much faster than LRBFGS in terms of running time.\n\n7 Conclusions\n\nIn this paper, we proposed a general Nesterov\u2019s accelerated gradient method for nonlinear Riemannian\nspace, which is a generalization of the famous Nesterov\u2019s accelerated method for Euclidean space.\nWe derived two equations and presented two accelerated algorithms for geodesically strongly-convex\nand general convex optimization problems, respectively. In particular, our theoretical results show\nthat our accelerated method attains the same convergence rates as the standard Nesterov\u2019s accelerated\nmethod in Euclidean space for both strongly G-convex and G-convex cases. Finally, we presented a\nspecial iteration scheme for solving matrix Karcher mean problems, which in essence is non-convex\nin Euclidean space, and the numerical results verify the ef\ufb01ciency of our accelerated method.\nWe can extend our accelerated method to the stochastic setting using variance reduction techniques [14,\n16, 24, 28], and apply our method to solve more geodesically convex problems in the future, e.g.,\nthe general G-convex problem with a non-smooth regularization term as in [4]. In addition, we\ncan replace exponential mapping by computationally cheap retractions together with corresponding\ntheoretical guarantees [31]. An interesting direction of future work is to design accelerated schemes\nfor non-convex optimization in Riemannian space.\n\n8\n\n020406010\u22121010\u22125100Numberofiterationsdist(X\u2217,Xk) RGDLRBFGSOurs0510152010\u22121010\u22125100Runningtime(s)dist(X\u2217,Xk) RGDLRBFGSOurs020406010\u22121010\u22125100Numberofiterationsdist(X\u2217,Xk) RGDLRBFGSOurs02040608010010\u22121010\u22125100Runningtime(s)dist(X\u2217,Xk) RGDLRBFGSOurs\fAcknowledgments\n\nThis research is supported in part by Grants (CUHK 14206715 & 14222816) from the Hong Kong\nRGC, the Major Research Plan of the National Natural Science Foundation of China (Nos. 91438201\nand 91438103), and the National Natural Science Foundation of China (No. 61573267).\n\nReferences\n[1] P.-A. Absil, R. Mahony, and R. Sepulchre. Optimization algorithms on matrix manifolds.\n\nPrinceton University Press, Princeton, N.J., 2009.\n\n[2] Z. Allen-Zhu. Katyusha: The \ufb01rst direct acceleration of stochastic gradient methods. In STOC,\n\npages 1200\u20131205, 2017.\n\n[3] H. Attouch and J. Peypouquet. The rate of convergence of Nesterov\u2019s accelerated forward-\n\nbackward method is actually faster than 1/k2. SIAM J. Optim., 26:1824\u20131834, 2015.\n\n[4] D. Azagra and J. Ferrera. Inf-convolution and regularization of convex functions on Riemannian\n\nmanifolds of nonpositive curvature. Rev. Mat. Complut., 2006.\n\n[5] M. Bacak. Convex analysis and optimization in Hadamard spaces. Walter de Gruyter GmbH &\n\nCo KG, 2014.\n\n[6] F. Barbaresco. New foundation of radar Doppler signal processing based on advanced differential\n\ngeometry of symmetric spaces: Doppler matrix CFAR radar application. In RADAR, 2009.\n\n[7] P. G. Batchelor, M. Moakher, D. Atkinson, F. Calamante, and A. Connelly. A rigorous framework\n\nfor diffusion tensor calculus. Magn. Reson. Med., 53:221\u2013225, 2005.\n\n[8] A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse\n\nproblems. SIAM J. Imaging Sci., 2(1):183\u2013202, 2009.\n\n[9] R. Bhatia. Positive de\ufb01nite matrices, Princeton Series in Applied Mathematics. Princeton\n\nUniversity Press, Princeton, N.J., 2007.\n\n[10] D. A. Bini and B. Iannazzo. Computing the Karcher mean of symmetric positive de\ufb01nite\n\nmatrices. Linear Algebra Appl., 438:1700\u20131710, 2013.\n\n[11] S. Boyd, S.-J. Kim, L. Vandenberghe, and A. Hassibi. A tutorial on geometric programming.\n\nOptim. Eng., 8:67\u2013127, 2007.\n\n[12] M. Congedo, B. Afsari, A. Barachant, and M. Moakher. Approximate joint diagonalization and\n\ngeometric mean of symmetric positive de\ufb01nite matrices. PloS one, 10:e0121423, 2015.\n\n[13] P. T. Fletcher and S. Joshi. Riemannian geometry for the statistical analysis of diffusion tensor\n\ndata. Signal Process., 87:250\u2013262, 2007.\n\n[14] R. Johnson and T. Zhang. Accelerating stochastic gradient descent using predictive variance\n\nreduction. In NIPS, pages 315\u2013323, 2013.\n\n[15] J. Lapuyade-Lahorgue and F. Barbaresco. Radar detection using Siegel distance between\n\nautoregressive processes, application to HF and X-band radar. In RADAR, 2008.\n\n[16] Y. Liu, F. Shang, and J. Cheng. Accelerated variance reduced stochastic ADMM. In AAAI,\n\npages 2287\u20132293, 2017.\n\n[17] G. Meyer, S. Bonnabel, and R. Sepulchre. Regression on \ufb01xed-rank positive semide\ufb01nite\n\nmatrices: A Riemannian approach. J. Mach. Learn. Res., 12:593\u2013625, 2011.\n\n[18] M. Moakher. On the averaging of symmetric positive-de\ufb01nite tensors. J. Elasticity, 82:273\u2013296,\n\n2006.\n\n[19] Y. Nesterov. A method of solving a convex programming problem with convergence rate\n\nO(1/k2). Soviet Mathematics Doklady, 27:372\u2013376, 1983.\n\n9\n\n\f[20] Y. Nesterov. Introductory Lectures on Convex Optimization: A Basic Course. Kluwer Academic\n\nPubl., Boston, 2004.\n\n[21] Y. Nesterov. Gradient methods for minimizing composite functions. Math. Program., 140:125\u2013\n\n161, 2013.\n\n[22] X. Pennec, P. Fillard, and N. Ayache. A Riemannian framework for tensor computing. Interna-\n\ntional Journal of Computer Vision, 66:41\u201366, 2006.\n\n[23] P. Petersen. Riemannian Geometry. Springer-Verlag, New York, 2016.\n\n[24] F. Shang. Larger is better: The effect of learning rates enjoyed by stochastic optimization with\n\nprogressive variance reduction. arXiv:1704.04966, 2017.\n\n[25] S. Sra and R. Hosseini. Conic geometric optimization on the manifold of positive de\ufb01nite\n\nmatrices. SIAM J. Optim., 25(1):713\u2013739, 2015.\n\n[26] W. Su, S. Boyd, and E. J. Candes. A differential equation for modeling Nesterov\u2019s accelerated\n\ngradient method: Theory and insights. J. Mach. Learn. Res., 17:1\u201343, 2016.\n\n[27] P. Tseng. On aacelerated proximal gradient methods for convex-concave optimization. 2008.\n\n[28] L. Xiao and T. Zhang. A proximal stochastic gradient method with progressive variance\n\nreduction. SIAM J. Optim., 24(4):2057\u20132075, 2014.\n\n[29] X. Yuan, W. Huang, P.-A. Absil, and K. Gallivan. A Riemannian limited-memory BFGS\nalgorithm for computing the matrix geometric mean. Procedia Computer Science, 80:2147\u2013\n2157, 2016.\n\n[30] H. Zhang, S. Reddi, and S. Sra. Riemannian SVRG: Fast stochastic optimization on Riemannian\n\nmanifolds. In NIPS, pages 4592\u20134600, 2016.\n\n[31] H. Zhang and S. Sra. First-order methods for geodesically convex optimization. In COLT, pages\n\n1617\u20131638, 2016.\n\n10\n\n\f", "award": [], "sourceid": 2518, "authors": [{"given_name": "Yuanyuan", "family_name": "Liu", "institution": "The Chinese University of Hong Kong"}, {"given_name": "Fanhua", "family_name": "Shang", "institution": "The Chinese University of Hong Kong"}, {"given_name": "James", "family_name": "Cheng", "institution": "The Chinese University of Hong Kong"}, {"given_name": "Hong", "family_name": "Cheng", "institution": "The Chinese University of Hong Kong"}, {"given_name": "Licheng", "family_name": "Jiao", "institution": "Xidian University"}]}