{"title": "Communication Efficient Parallel Algorithms for Optimization on Manifolds", "book": "Advances in Neural Information Processing Systems", "page_first": 3574, "page_last": 3584, "abstract": "The last decade has witnessed an explosion in the development of models, theory and computational algorithms for ``big data'' analysis. In particular, distributed inference has served as a natural and dominating paradigm for statistical inference. However, the existing literature on parallel inference almost exclusively focuses on Euclidean data and parameters. While this assumption is valid for many applications, it is increasingly more common to encounter problems where the data or the parameters lie on a non-Euclidean space, like a manifold for example. Our work aims to fill a critical gap in the literature by generalizing parallel inference algorithms to optimization on manifolds. We show that our proposed algorithm is both communication efficient and carries theoretical convergence guarantees. In addition, we demonstrate the performance of our algorithm to the estimation of Fr\\'echet means on simulated spherical data and the low-rank matrix completion problem over Grassmann manifolds applied to the Netflix prize data set.", "full_text": "Communication Ef\ufb01cient Parallel Algorithms for\n\nOptimization on Manifolds\n\nBayan Saparbayeva\n\nDepartment of Applied and\n\nComputational Mathematics and Statistics\n\nUniveristy of Notre Dame\n\nNotre Dame, Indiana 46556, USA\n\nbsaparba@nd.edu\n\nMichael Minyi Zhang\n\nDepartment of Computer Science\n\nPrinceton University\n\nPrinceton, New Jersey 08540, USA\n\nmz8@cs.princeton.edu\n\nLizhen Lin\n\nDepartment of Applied and\n\nComputational Mathematics and Statistics\n\nUniveristy of Notre Dame\n\nNotre Dame, Indiana 46556, USA\n\nlizhen.lin@nd.edu\n\nAbstract\n\nThe last decade has witnessed an explosion in the development of models, theory\nand computational algorithms for \u201cbig data\u201d analysis. In particular, distributed\ncomputing has served as a natural and dominating paradigm for statistical inference.\nHowever, the existing literature on parallel inference almost exclusively focuses\non Euclidean data and parameters. While this assumption is valid for many ap-\nplications, it is increasingly more common to encounter problems where the data\nor the parameters lie on a non-Euclidean space, like a manifold for example. Our\nwork aims to \ufb01ll a critical gap in the literature by generalizing parallel inference\nalgorithms to optimization on manifolds. We show that our proposed algorithm is\nboth communication ef\ufb01cient and carries theoretical convergence guarantees. In\naddition, we demonstrate the performance of our algorithm to the estimation of\nFr\u00e9chet means on simulated spherical data and the low-rank matrix completion\nproblem over Grassmann manifolds applied to the Net\ufb02ix prize data set.\n\n1\n\nIntroduction\n\nA natural representation for many statistical and machine learning problems is to assume the parameter\nof interest lies on a more general space than the Euclidean space. Typical examples of this situation\ninclude diffusion matrices in large scale diffusion tensor imaging (DTI) which are 3\u00d7 3 positive\nde\ufb01nite matrices, now commonly used in neuroimaging for clinical trials [1]. In computer vision,\nimages are often preprocessed or reduced to a collection of subspaces [11, 27] or, a digital image\ncan also be represented by a set of k-landmarks, forming landmark based shapes [13]. One may also\nencounter data that are stored as orthonormal frames [8], surfaces[15], curves[16], and networks [14].\nIn addition, parallel inference has become popular in overcoming the computational burden arising\nfrom the storage, processing and computation of big data, resulting in a vast literature in statistics and\nmachine learning dedicated to this topic. The general scheme in the frequentist setting is to divide the\ndata into subsets, obtain estimates from each subset which are combined to form an ultimate estimate\nfor inference [9, 30, 17]. In the Bayesian setting, the subset posterior distributions are \ufb01rst obtained\nin the dividing step, and these subset posterior measures or the MCMC samples from each subset\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fposterior are then combined for \ufb01nal inference [20, 29, 28, 21, 25, 22]. Most of these methods are\n\u201cembarrassingly parallel\u201d which often do not require communication across different machines or\nsubsets. Some communication ef\ufb01cient algorithms have also been proposed with prominent methods\nincluding [12] and [26].\nDespite tremendous advancement in parallel inference, previous work largely focuses only on\nEuclidean data and parameter spaces. To better address challenges arising from inference of big\nnon-Euclidean data or data with non-Euclidean parameters, there is a crucial need for developing\nvalid and ef\ufb01cient inference methods including parallel or distributed inference and algorithms that\ncan appropriately incorporate the underlying geometric structure.\nFor a majority of applications, the parameter spaces fall into the general category of manifolds, whose\ngeometry is well-characterized. Although there is a recent literature on inference of manifold-valued\ndata including methods based on Fr\u00e9chet means or model based methods [3, 4, 5, 2, 18] and even\nscalable methods for certain models [23, 19, 24], there is still a vital lack of general parallel algorithms\non manifolds. We aim to \ufb01ll this critical gap by introducing our parallel inference strategy. The\nnovelty of our paper is in the fact that is generalizable to a wide range of loss functions for manifold\noptimization problems and that we can parallelize the algorithm by splitting the data across processors.\nFurthermore, our theoretical development does not rely on previous results. In fact, generalizing\nTheorem 1 to the manifold setting requires totally different machineries from that of previous work.\nNotably, our parallel optimization algorithm has several key features:\n\n(1) Our parallel algorithm ef\ufb01ciently exploits the geometric information of the data or parame-\n\nters.\n\n(2) The algorithm minimizes expensive inter-processor communication.\n(3) The algorithm has theoretical guarantees in approximating the true optimizer, characterized\n\nin terms of convergence rates.\n\n(4) The algorithm has outstanding practical performance in simulation studies and real data\n\nexamples.\n\nOur paper is organized as follows: In Section 2 we introduce related work to the topic of parallel\ninference. Next we present our proposed parallel optimization framework in Section 3 and present\ntheoretical convergence results for our parallel algorithm in Section 4. In Section 5, we consider a\nsimulation study of estimating the Fr\u00e9chet means on the spheres and a real data example using the\nNet\ufb02ix prize data set. The paper ends with a conclusion and discussion of future work in Section 6.\n\n2 Related work\n\nIn the typical \u201cbig data\u201d scenario, it is usually the case that the entire data set cannot \ufb01t onto one\nmachine. Hence, parallel inference algorithms with provably good theoretic convergence properties\nare crucial for this situation. In such a setting, we assume that we have N = mn identically distributed\nobservations {xi j : i = 1,...,n, j = 1,...,m}, which are i.i.d divided into m subsets X j = {xi j ,i =\n1,...,n}, j = 1,...,m and stored in m separate machines. While it is important to consider inference\nproblems when the data are not i.i.d. distributed across processors, we will only consider the i.i.d.\nsetting as a simplifying assumption for the theory.\n(cid:80)n\nFor a loss function L : \u0398\u00d7 D \u2192 (cid:82), each machine j has access to a local loss function, L j (\u03b8) =\nL (\u03b8, xi j ), where D is the data space. Then, the local loss functions are combined into a\nglobal loss function LN (\u03b8) = 1\nL j (\u03b8). For our intended optimization routine, we are actually\n(\u03b8) = (cid:69)x\u2208D L (\u03b8, x). In the parallel setting,\nlooking for the minimizer of an expected loss function L \u2217\nwe cannot investigate L \u2217 directly and we may only analyze it through LN . However, calculating\nthe total loss function directly and exactly requires excessive inter-processor communication, which\ncarries a huge computational burden as the number of processors increase. Thus, we must approximate\nthe true parameter \u03b8\u2217 = argmin\u03b8\u2208\u0398 L \u2217\n(\u03b8) by an empirical risk minimizer \u02c6\u03b8 = argmin\u03b8\u2208\u0398 LN (\u03b8).\nIn this work, we focus on generalizing a particular parallel inference framework, the Iterative Local\nEstimation Algorithm (ILEA) [12], to manifolds. This algorithm optimizes an approximate, surrogate\nloss function instead of the global loss function as a way to avoid processor communication. The\n\n(cid:80)m\n\nj=1\n\n1\nn\n\ni=1\n\nm\n\n2\n\n\fidea of the surrogate function starts from the Taylor series expansion of LN\n\n(cid:161) \u00af\u03b8+ t(\u03b8\u2212 \u00af\u03b8)(cid:162) = LN ( \u00af\u03b8)+ t\u2329\u2207LN ( \u00af\u03b8), \u03b8\u2212 \u00af\u03b8\u232a+ \u221e(cid:88)\n\nLN\n\n\u2207s LN ( \u00af\u03b8)(\u03b8\u2212 \u00af\u03b8)\n\n\u2297s.\n\nt s\ns!\n\ns=2\n\nThe global high-order derivatives \u2207s LN ( \u00af\u03b8) (s \u2265 2) are replaced by local high-order derivatives\n\u2207s L1( \u00af\u03b8)(s \u2265 2) from the \ufb01rst machine\n\n\u02dcL (\u03b8) = LN ( \u00af\u03b8)+\u2329\u2207LN ( \u00af\u03b8), \u03b8\u2212 \u00af\u03b8\u232a+ \u221e(cid:88)\n\n\u2207s L1( \u00af\u03b8)(\u03b8\u2212 \u00af\u03b8)\n\n\u2297s.\n\n1\ns!\n\ns=2\n\n\u2297s\n\n1\ns!\n\ns=2\n= 1\n2\n= O\n\nSo the approximation error is\n\n\u02dcL (\u03b8)\u2212 LN (\u03b8) = \u221e(cid:88)\n(cid:161)\u2207s L1( \u00af\u03b8)\u2212\u2207s LN ( \u00af\u03b8)(cid:162)(\u03b8\u2212 \u00af\u03b8)\n\u03b8\u2212 \u00af\u03b8,(cid:161)\u22072L1( \u00af\u03b8)\u2212\u22072LN ( \u00af\u03b8)(cid:162)(cid:161)\u03b8\u2212 \u00af\u03b8(cid:162)(cid:69)+O(cid:161) \u2225 \u03b8\u2212 \u00af\u03b8 \u22253(cid:162)\n(cid:68)\n(cid:179) 1(cid:112)\n\u2225 \u03b8\u2212 \u00af\u03b8 \u22252 + \u2225 \u03b8\u2212 \u00af\u03b8 \u22253(cid:180)\n\u2297s in the \u02dcL (\u03b8) can be replaced by L1(\u03b8)\u2212L1( \u00af\u03b8)\u2212\u2329\u2207L1( \u00af\u03b8), \u03b8\u2212\n\nThe in\ufb01nite sum(cid:80)\u221e\nWe can omit the additive constant(cid:161)L1( \u00af\u03b8)\u2212 LN ( \u00af\u03b8)(cid:162)+\u2329\u2207L1( \u00af\u03b8)\u2212\u2207LN ( \u00af\u03b8), \u00af\u03b8\u232a. Thus the surrogate loss\n\n\u02dcL (\u03b8) = L1(\u03b8)\u2212(cid:161)L1( \u00af\u03b8)\u2212 LN ( \u00af\u03b8)(cid:162)\u2212(cid:173)\u2207L1( \u00af\u03b8)\u2212\u2207LN ( \u00af\u03b8), \u03b8\u2212 \u00af\u03b8(cid:174).\n\nn\n\u2207s L1( \u00af\u03b8)(\u03b8\u2212 \u00af\u03b8)\n\ns=2\n\n1\ns!\n\n\u00af\u03b8\u232a\n\n.\n\nfunction \u02dcL (\u03b8) is de\ufb01ned as\n\n\u02dcL (\u03b8) = L1(\u03b8)\u2212\u2329\u2207L1( \u00af\u03b8)\u2212\u2207LN ( \u00af\u03b8), \u03b8\u232a.\n\nThus, the surrogate minimizer \u02dc\u03b8 = argmin\u0398 \u02dcL approximates the empirical risk minimizer \u02c6\u03b8.\n[12] show that the consequent surrogate minimizers have a provably good convergence rate to \u02c6\u03b8 given\nthe following regularity conditions:\n\nR = sup\u03b8\u2208\u0398 \u2225 \u03b8\u2212 \u03b8\u2217 \u2225> 0,\n\n1. The parameter space \u0398 is a compact and convex subset of (cid:82)d . Besides, \u03b8\u2217 \u2208 int(\u0398) and\n2. The Hessian matrix I (\u03b8) = \u22072L \u2217\n\n, that is there exist constants (\u00b5\u2212, \u00b5+)\n\nsuch that\n\n3. For any \u03b4 > 0, there exists \u0001 > 0, such that\n\n(\u03b8) is invertible at \u03b8\u2217\n\u00b5\u2212Id (cid:185) I (\u03b8\u2217\n) (cid:185) \u00b5+Id ,\n(cid:111) = 1,\n(cid:175)(cid:175)L (\u03b8)\u2212 L (\u03b8\u2217\n)(cid:175)(cid:175) \u2265 \u0001\n\ninf\n\n(cid:110)\n\ninf (cid:80)\n\n\u2225\u03b8\u2212\u03b8\u2217\u2225\u2265\u03b4\n\n4. For a ball around the true parameter U (\u03c1) = {\u03b8 :\u2225 \u03b8\u2212 \u03b8\u2217 \u2225\u2264 \u03c1} there exist constants (G,L)\n\nand a function K (x) such that\n\n(cid:69) \u2225 \u2207L (\u03b8) \u222516\u2264 G16\n\n(cid:129)L (\u03b8, x)\u2212 L (\u03b8(cid:48)\n\n(cid:69)(cid:129)\u22072L (\u03b8)\u2212 I (\u03b8)(cid:129) \u2264 L16,\n, x)(cid:129) \u2264 K (x) \u2225 \u03b8\u2212 \u03b8(cid:48) \u2225,\n\nfor all \u03b8, \u03b8(cid:48) \u2208 U (\u03c1).\n\nwhich leads to the following theorem:\nTheorem 1. Suppose that the standard regularity conditions hold and initial estimator \u00af\u03b8 lies in the\nneighborhood U (\u03c1) of \u03b8\u2217\n\n. Then the minimizer \u02dc\u03b8 of the surrogate loss function \u02dcL (\u03b8) satis\ufb01es\n\n\u2225 \u02dc\u03b8\u2212 \u02c6\u03b8 \u2225\u2264 C2(\u2225 \u00af\u03b8\u2212 \u02c6\u03b8 \u2225 + \u2225 \u02c6\u03b8\u2212 \u03b8\u2217 \u2225 +(cid:129)\u22072L1(\u03b8\u2217\n\n)\u2212\u22072LN (\u03b8\u2217\n\n)(cid:129)) \u2225 \u00af\u03b8\u2212 \u02c6\u03b8 \u2225,\n\nwith probability at least 1\u2212C1mn\n\n\u22128, where the constants C1 and C2 are independent of (m,n, N ).\n\n3\n\n\f3 Parallel optimizations on manifolds\n\nOur work aims to generalize the typical gradient descent optimization framework to manifold\noptimization. In particular, we will use the ILEA framework as our working example to generalize\nparallel optimization algorithms. Instead of working with (cid:82)d , we have a d-dimensional manifold M.\nWe also consider a surrogate loss function \u02dcL j : \u0398\u00d7Z \u2192 (cid:82), where \u0398 is a subset of the manifold M, that\napproximates the global loss function LN . Here we choose to optimize \u02dcL j on the jth machine\u2013that\nis, on different iterations we optimize on different machine for ef\ufb01cient exploration unlike from\nprevious algorithm, where the surrogate function is always optimized on the \ufb01rst machine.\nTo generalize the idea of moving along a gradient on the manifold M, we use the retraction map,\nwhich is not necessarily the exponential map that one would typically use in manifold gradient\ndescent, but shares several important properties with the exponential map. Namely, a retraction on M\nis a smooth mapping R : T M \u2192 M with the following properties\n\n1. R\u03b8(0\u03b8) = R(\u03b8,0\u03b8) = \u03b8, where R\u03b8 is the restriction of R from T M to the point \u03b8 and the\n2. DR\u03b8(0\u03b8) = DR(\u03b8,0\u03b8) = idT\u03b8 M , where idT\u03b8 M denotes the identity mapping on T\u03b8M.\n\ntangent space T\u03b8M, 0\u03b8 denotes the zero vector on T\u03b8M,\n\nWe also demand that\n\n1. For any \u03b81, \u03b82 \u2208 M, curves R\u03b81 t R\u22121\n\u03b81, where s, t \u2208 [0,1], must coincide,\n2. The triangle inequality holds, that is for any \u03b81, \u03b82, \u03b83 \u2208 M, it is the case that dR(\u03b81, \u03b82) \u2264\n\u03b82 for t \u2208 [0,1].\n\ndR(\u03b82, \u03b83)+dR(\u03b83, \u03b81) where dR(\u03b81, \u03b82) is the length of the curve R\u03b81 t R\u22121\n\n\u03b82 and R\u03b82 sR\u22121\n\n\u03b81\n\n\u03b82\n\n\u03b81\n\nOur construction starts with the Taylor\u2019s formula for LN on the manifold M\n\nLN (\u03b8) = LN ( \u00af\u03b8)+\u2329\u2207LN ( \u00af\u03b8),log \u00af\u03b8 \u03b8\u232a+ \u221e(cid:88)\n\n\u2207s LN ( \u00af\u03b8)(log \u00af\u03b8 \u03b8)\n\n\u2297s\n\n1\ns!\n\ns=2\n\nBecause we split the data across machines, evaluating the derivatives \u2207s LN ( \u00af\u03b8) requires excessive\nprocessor communication. We want to reduce the amount of communication by replacing the global\nhigh-order derivatives \u2207s LN ( \u00af\u03b8) (s \u2265 2) with the high-order local derivatives \u2207s L j ( \u00af\u03b8). This gives us\nthe following surrogate to LN\n\n\u02dcL j (\u03b8) = LN ( \u00af\u03b8)+\u2329\u2207LN ( \u00af\u03b8),log \u00af\u03b8 \u03b8\u232a+ \u221e(cid:88)\n\n\u2207s L j ( \u00af\u03b8)(log \u00af\u03b8 \u03b8)\n\n\u2297s.\n\n1\ns!\n\ns=2\n\nThen we have the following approximation error\n\n\u02dcL j (\u03b8)\u2212 LN (\u03b8) = 1\n2\n= O\n\n\u2329log \u00af\u03b8 \u03b8,(\u22072L j ( \u00af\u03b8)\u2212\u22072LN ( \u00af\u03b8))log \u00af\u03b8 \u03b8\u232a+O(cid:161)dg ( \u00af\u03b8, \u03b8)3(cid:162)\ndg ( \u00af\u03b8, \u03b8)2 + dg ( \u00af\u03b8, \u03b8)3(cid:180)\n(cid:179) 1(cid:112)\n\u2297s with L j (\u03b8)\u2212 L j ( \u00af\u03b8)\u2212(cid:173)\u2207L j ( \u00af\u03b8),log \u00af\u03b8 \u03b8(cid:174) :\n\nn\n\n.\n\n\u2207s L j ( \u00af\u03b8)(log \u00af\u03b8 \u03b8)\n\n1\ns!\n\ns=2\n\u02dcL j (\u03b8) = LN ( \u00af\u03b8)+\u2329\u2207LN ( \u00af\u03b8),log \u00af\u03b8 \u03b8\u232a+ L j (\u03b8)\u2212 L j ( \u00af\u03b8)\u2212\u2329\u2207L j ( \u00af\u03b8),log \u00af\u03b8 \u03b8\u232a\n\n= L j (\u03b8)+ (LN ( \u00af\u03b8)\u2212 L j ( \u00af\u03b8))+\u2329\u2207LN ( \u00af\u03b8)\u2212\u2207L j ( \u00af\u03b8),log \u00af\u03b8 \u03b8\u232a.\n\nWe replace(cid:80)\u221e\n\n\u02dcL j but in its minimizer, we omit the additive constant\nSince we are not interested in the value of\n(LN ( \u00af\u03b8)\u2212 L j ( \u00af\u03b8)) and rede\ufb01ne \u02dcL j as\n\u02dcL j (\u03b8) := L j (\u03b8)\u2212\u2329\u2207L j ( \u00af\u03b8)\u2212\u2207LN ( \u00af\u03b8),log \u00af\u03b8 \u03b8\u232a. Then we can\ngeneralize the exponential map exp \u00af\u03b8 and the inverse exponential map log \u00af\u03b8 to the retraction map R \u00af\u03b8\nand the inverse retraction map R\u22121\n\n\u00af\u03b8 , which is also called the lifting, and rede\ufb01ne \u02dcL j\n\n\u02dcL j (\u03b8) := L j (\u03b8)\u2212\u2329\u2207L j ( \u00af\u03b8)\u2212\u2207LN ( \u00af\u03b8), R\u22121\n\n\u00af\u03b8\n\n\u03b8\u232a.\n\nTherefore we have the following generalization of the Iterative Local Estimation Algorithm (ILEA)\nfor the manifold M:\n\n4\n\n\fAlgorithm 1: ILEA for Manifolds\nInitialize \u03b80 = \u00af\u03b8;\nfor s = 0,1,...,T \u2212 1 do\nTransmit the current iterate \u03b8s to local machines {Mj }m\nj=1;\nfor j = 1,...,m do\nCompute the local gradient \u2207L j (\u03b8s) at machine Mj ;\nTransmit the local gradient \u2207L j (\u03b8s) to machine Ms;\n\n(cid:80)m\nCalculate the global gradient \u2207LN (\u03b8s) = 1\nj=1\nForm the surrogate function \u02dcLs(\u03b8) = Ls(\u03b8)\u2212\u2329R\u22121\n\u03b8s\nUpdate \u03b8s+1 \u2208 argmin \u02dcLs;\n\nm\n\nReturn \u03b8T\n\n\u2207L j (\u03b8s)) in Machine Ms;\n\u03b8,\u2207Ls(\u03b8s)\u2212\u2207LN (\u03b8s)\u232a;\n\n4 Convergence rates of the algorithm\n\nTo establish some theoretical convergence rates on our algorithm, we consequently have to impose\nsome regularity conditions on the parameter space \u0398, the loss function L and the population\nrisk L \u2217. We must establish these conditions speci\ufb01cally for manifolds instead of simply using\nthe regularity conditions placed on Euclidean spaces. For example, in the manifold the Hessians\n\u22072L (\u03b8, x),\u22072L (\u03b8(cid:48)\n, x) are de\ufb01ned in different tangent spaces meaning there cannot be any linear\nexpressions of the second-order derivatives.\nIn the manifold for any \u03be \u2208 T\u03b8(cid:48) M we can de\ufb01ne the vector \ufb01eld as \u03be(\u03b8) = D(R\u22121\ntake the covariant derivative of \u03be(\u03b8) along the retraction R\u03b8(cid:48) t R\u03b8(cid:48) \u03b8 :\n\n)\u03be. We can also\n\n\u03b8(cid:48)\n\n\u03b8\n\n\u03be(R\u03b8(cid:48) t R\u03b8(cid:48) \u03b8) =\n\u2207\n\nD(cid:161)R\n\n\u03b8(cid:48) (R\u03b8(cid:48) t R\u03b8(cid:48) \u03b8)(cid:162)\u22121\n\n\u22121\n\nR\u22121\n\u03b8(cid:48) \u03b8\n\n(cid:179)\n\nD\n\nR\u03b8(cid:48) t R\u03b8(cid:48) \u03b8\u03b8(cid:48)(cid:180)\n\nR\u22121\n\n\u03be = \u2207D(t, \u03b8, \u03b8(cid:48)\n\n)\u03be.\n\n(1)\n\nThe expression (1) de\ufb01nes the linear map \u2207D(t, \u03b8, \u03b8(cid:48)\n) from T\u03b8(cid:48) M to TR\u03b8(cid:48) t R\u03b8(cid:48) \u03b8M and want to impose\nsome conditions to this map. Finally, we impose the following regularity conditions on the parameter\nspace \u0398, the loss function L and the population risk L \u2217.\n\nD(cid:161)R\n\n\u03b8(cid:48) (R\u03b8(cid:48) t R\u03b8(cid:48) \u03b8)(cid:162)\u22121\n\n\u22121\n\n\u2207\n\nR\u22121\n\u03b8(cid:48) \u03b8\n\n\u03b81, \u03b82 \u2208 \u0398 curves R\u03b81 t R\u03b81\nalso demand that there exists L\n\n1. The parameter space \u0398 is a compact and R-convex subset of M, which means that for any\n\u03b82 must be within \u0398 for any \u03b81, \u03b82 \u2208 M and\n(cid:48)\n\n\u03b82 and exp\u03b81 t log\u03b81\n(cid:48) \u2208 (cid:82) such that\ndR(\u03b81, \u03b82) \u2264 L\n\ndg (\u03b81, \u03b82),\n\nwhere dg (\u03b81, \u03b82) is the geodesic distance,\n\n2. The matrix I (\u03b8) = \u22072L \u2217\n\n(\u03b8) is invertible at \u03b8\u2217\n\u00b5\u2212id\u03b8\u2217 (cid:185) I (\u03b8\u2217\n\n: \u2203 constants \u00b5\u2212, \u00b5+ \u2208 (cid:82) such that\n) (cid:185) \u00b5+id\u03b8\u2217,\n\n3. For any \u03b4 > 0, there exists \u03b5 > 0 such that\n\n(cid:110)\n\ninf (cid:80)\n\ninf\n\ndg (\u03b8\u2217,\u03b8)\u2265\u03b4\n\n(cid:175)(cid:175)L (\u03b8)\u2212 L (\u03b8\u2217\n\n4. There exist constants (G,L) and a function K (x) such that for all \u03b8, \u03b8(cid:48) \u2208 U and t \u2208 [0,1]\n\n)(cid:175)(cid:175) \u2265 \u03b5\n\n(cid:111) = 1,\n(cid:69)(cid:147)(cid:147)\u22072L (\u03b8, D)\u2212 I (\u03b8)(cid:147)(cid:147)16 \u2264 L16,\n\u03b8(cid:48)(cid:162)\u22121(cid:147)(cid:147)(cid:147) \u2264 K (x)dR \u00af\u03b8 (\u03b8, \u03b8(cid:48)\n, x)(cid:161)DR\u22121\n\u03b8(cid:48) \u02c6\u03b8(cid:162)(cid:147)(cid:147)(cid:147) \u2264 K (x)dR \u00af\u03b8 (\u03b8, \u03b8(cid:48)\n, x)(cid:161)DR\u22121\n\n\u2217\u2207L (R\u03b8(cid:48) t R\u03b8(cid:48) \u03b8, x) \u2225\u2264 K (x)dR(\u03b8, \u03b8(cid:48)\n\n),\n\n),\n\n),\n\n\u02c6\u03b8\n\n\u2225 x \u2225= 1}. Moreover,\n\n(cid:69) \u2225 \u2207L (\u03b8, D) \u222516\u2264 G16,\n\n(cid:147)(cid:147)(cid:147)(cid:161)DR\u22121\n\u02c6\u03b8(cid:162)\u2217\u22072L (\u03b8, x)(cid:161)DR\u22121\n(cid:147)(cid:147)(cid:147)(cid:161)DR\u22121\n\n\u02c6\u03b8(cid:162)\u2217\u22072L (\u03b8, x)(DR\u22121\n\n\u2225 \u2207D(t, \u03b8, \u03b8(cid:48)\n\n\u03b8(cid:162)\u22121 \u2212(cid:161)DR\u22121\n\u02c6\u03b8(cid:162)\u2212(cid:161)DR\u22121\n\n\u03b8(cid:48) \u02c6\u03b8(cid:162)\u2217\u22072L (\u03b8(cid:48)\n\u03b8(cid:48) \u02c6\u03b8(cid:162)\u2217\u22072L (\u03b8(cid:48)\n\n)\n\n\u02c6\u03b8\n\n\u03b8\n\n\u03b8\n\n\u03b8\n\nwhere (cid:129)(cid:129) is a spectral norm of matrices, (cid:129)A(cid:129) = sup{\u2225 Ax \u2225: x \u2208 (cid:82)n,\nK satis\ufb01es (cid:69)K \u2264 K 16 for some constant K > 0.\n\n5\n\n\fGiven these conditions, we have the following theorem:\nTheorem 2. If the standard regularity conditions holds, the initial estimator \u00af\u03b8 lies in the neighborhood\nU of \u03b8\u2217 and\n\n(cid:147)(cid:147)(cid:147)(cid:161)DR\u22121\n\u02c6\u03b8(cid:162)\u2212(cid:161)DR\u22121\n\n\u03b8\u2217 \u02c6\u03b8(cid:162)\u2217(cid:161)\u22072 \u02dcLs(\u03b8\u2217\n\u03b8(cid:48) \u02c6\u03b8(cid:162)\u2217\u22072 \u02dcLs(\u03b8(cid:48)\n\n\u03b8\u2217 \u02c6\u03b8(cid:162)(cid:147)(cid:147)(cid:147) \u2264 \u03c1\u00b5\u2212R\u2212\n)(cid:162)(cid:161)DR\u22121\n\u03b8(cid:48) \u02c6\u03b8(cid:162)(cid:147)(cid:147)(cid:147) \u2264 K (x)dR \u00af\u03b8 (\u03b8, \u03b8(cid:48)\n\n4\n\n,\n\n)\u2212 I (\u03b8\u2217\n\n, x)(cid:161)DR\u22121\n\n\u03b8\n\n\u03b8\u2217 \u02c6\u03b8)(cid:162)\u22121(cid:147)(cid:147)(cid:147) , then any minimizer \u02dc\u03b8 of the surrogate loss function \u02dcLs(\u03b8)\n\n),\n\n(cid:147)(cid:147)(cid:147)(cid:161)DR\u22121\n\n\u03b8\n\nwhere R\u2212 =\n\nsatis\ufb01es\n\ndR( \u02dc\u03b8, \u02c6\u03b8) \u2264 C2\n\n\u02c6\u03b8(cid:162)\u2217\u22072 \u02dcLs(\u03b8, x)(DR\u22121\n(cid:147)(cid:147)(cid:147)(cid:161)(DR\u22121\n(cid:179)\n1+ dR( \u00af\u03b8, \u02c6\u03b8)+ dR(\u03b8\u2217\n\n\u03b8\u2217 \u02c6\u03b8)\u2217(DR\u22121\n\n1\n\ndR( \u00af\u03b8, \u02c6\u03b8),\nC3\nwith probability at least 1\u2212C1mn\n\u22128, where constants C1,C2 and C3 are independent of (m,n, N ).\n\n\u02c6\u03b8\n\n)\u2212\u22072LN (\u03b8\u2217\n\n, \u02c6\u03b8)+\n\n(cid:147)(cid:147)(cid:147)(cid:161)DR\u22121\n\n\u03b8\u2217 \u02c6\u03b8(cid:162)\u2217(cid:161)\u22072Ls(\u03b8\u2217\n\n)(cid:162)(cid:161)DR\u22121\n\n\u03b8\u2217(cid:162)\u22121(cid:147)(cid:147)(cid:147)(cid:180)\n\n5 Simulation study and data analysis\n\nTo examine the quality of our parallel algorithm we \ufb01rst apply it to the estimation of Fr\u00e9chet means\non spheres, which has closed form expressions for the estimation of the extrinsic mean (true empirical\nminimizer). In addition, we apply our algorithm to Net\ufb02ix movie-ranking data set as an example\nof optimization over Grassmannian manifolds in the low-rank matrix completion problem. In the\nfollowing results, we demonstrate the utility of our algorithm both for high dimensional manifold-\nvalued data (Section 5.1) and Euclidean space data with non-Euclidean parameters (Section 5.2).\nWe wrote the code for our implementations in Python and carried out the parallelization of the code\nthrough MPI1[7].\n\n5.1 Estimation of Fr\u00e9chet means on manifolds\n\nWe \ufb01rst consider the estimation problem of Fr\u00e9chet means [10] on manifolds. In particular, the\nmanifold under consideration is the sphere in which we wish to estimate both the extrinsic and\nintrinsic mean [3]. Let M be a general manifold and \u03c1 be a distance on M which can be an intrinsic\ndistance, by employing a Riemannian structure of M, or an extrinsic distance, via some embedding J\nonto some Euclidean space. Also, let x1,..., xN be sample of point on the hypersphere Sd , the sample\nFr\u00e9chet mean of x1,..., xn is de\ufb01ned as\n\n\u02c6\u03b8 = arg min\n\u03b8\u2208M=Sd\n\nN(cid:88)\n\ni=1\n\n\u03c12(\u03b8, xi ),\n\n(2)\n\nwhere \u03c1 is some distance on the sphere.\nThe extrinsic distance, for our spherical example, is de\ufb01ned to be \u03c1(x, y) = (cid:107)J(x)\u2212 J(y)(cid:107) = (cid:107)x \u2212 y(cid:107)\nwith (cid:107)\u00b7(cid:107) as the Euclidean distance and the embedding map J(x) = x \u2208 (cid:82)d+1 as the identity map. We\ncall \u02c6\u03b8 the extrinsic Fr\u00e9chet mean on the sphere. We choose this example in our simulation, as we\nknow the true global optimizer which is given by \u00afx/(cid:107) \u00afx(cid:107) where \u00afx is the standard sample mean of\nx1,..., xN in Euclidean distance. The intrinsic Fr\u00e9chet mean, on the other hand, is de\ufb01ned to be where\nthe distance \u03c1 is the geodesic distance (or the arc length). In this case we compare the estimator\nobtained from the parallel algorithm with the optimizer obtained from a gradient descent algorithm\nalong the sphere applied to the entire data set. Despite that the spherical case may be an \u201ceasy\u201d setting\nas it has a Betti number of zero, we chose this example so that we have ground truth to compare our\nresults with and we, in fact, perform favorably even when the dimensionality of the data is high even\nas we increase the number of processors.\nFor this example, we simulate one million observations from a 100-dimensional von Mises distribution\nprojected onto the unit sphere with mean sampled randomly from N (0, I ) and a precision of 2. For\n\n1Our code is available at https://github.com/michaelzhang01/parallel_manifold_opt\n\n6\n\n\fFigure 1: Extrinsic mean comparison (left) and intrinsic mean comparison (right) on spheres in S99\n\nFigure 2: Extrinsic mean results on S1, for one (left) and ten (right) processors\n\nthe extrinsic mean example, the closed form expression of the sample mean acts as a \u201cground truth\u201d\nto which we can compare our results. In both the extrinsic and intrinsic mean examples, we run 20\ntrials of our algorithm over 1, 2, 4, 6, 8 and 10 processors. For the extrinsic mean simulations we\ncompare our results to the true global optimizer in terms of root mean squared error (RMSE) and for\nthe intrinsic mean simulations we compare our distributed results to the single processor results, also\nin terms of RMSE.\nAs we can see in Figure 1, even if we divide our observations to as many as 10 processors we still\nobtain favorable results for the estimation of the Fr\u00e9chet mean in terms of RMSE to the ground truth\nfor the extrinsic mean case and the single processor results for the intrinsic mean case. To visualize\nthis comparison, we show in Figure 2 an example of our method\u2019s performance on two dimensional\ndata so that we may see that our optimization results yield a very close estimate to the true global\noptimizer.\n\n5.2 Real data analysis: the Net\ufb02ix example\n\nNext, we consider an application of our algorithm to the Net\ufb02ix movie rating dataset. This dataset\nof over a million entries, X \u2208 (cid:82)M\u00d7N , consists of M = 17770 movies and N = 480189 users, in which\nonly a sparse subset of the users and movies have ratings. In order to build a better recommendation\nsystems to users, we can frame the problem of predicting users\u2019 ratings for movies as a low-rank\nmatrix completion problem by learning the rank-r Grassmannian manifold U \u2208 Gr(M, r) which\noptimizes for the set of observed entries (i , j ) \u2208 \u2126 the loss function\n(cid:88)\n\n(cid:88)\n\n(cid:161)(UW )i j \u2212 Xi j\n\n(cid:162)2 + \u03bb2\n\n(3)\n\nL(U ) = 1\n2\n\n(i ,j )\u2208\u2126\n\n(UW )i j ,\n\n2\n\n(i ,j )\u2209\u2126\n\n7\n\n\fwhere W is r -by-N matrix. Each user k has the loss function L (U ,k) = 1\nwhere \u25e6 is the Hadamard product, (wk)i = Wi k, and\n\n2\n\n|ck \u25e6 (U wk(U )\u2212 Xk)|2 ,\n\n(cid:40)\n\n(cid:40)\n\n(ck)i =\n\n1,\n\u03bb,\n\nif\nif\n\n(i ,k) \u2208 \u2126\n(i ,k) \u2209 \u2126 ,\n\n(Xk)i =\n\nwk(U ) =(cid:161)U T diag(ck \u25e6 ck)U(cid:162)\u22121U T(cid:161)ck \u25e6 ck \u25e6 Xk\n\n(cid:162).\n\nXi k,\n0,\n\nif\nif\n\n(i ,k) \u2208 \u2126\n(i ,k) \u2209 \u2126,\n\nL (U ,k) = 1\nN\n\nL(U ).\n\nWhich results in the following gradient\n\n\u2207L (U ,k) =(cid:161)ck \u25e6 ck \u25e6 (U wk(U )\u2212 Xk)(cid:162)wk(U )T = diag(ck \u25e6 ck)(U wk(U )\u2212 Xk)wk(U )T .\n(cid:80)j q\n\nWe can assume that N = pq, then for each local machine Mj , j = 1,..., p, we have the local function\nL j (U ) = 1\n\nL (U ,k). So the global function is\n\nq\n\np(cid:88)\n\nk=(j\u22121)q+1\n\npq(cid:88)\nFor iterations s = 0,1,...,P \u2212 1 we have \u2207L j (Us) =(cid:80)j q\ngradient is \u2207LN (Us) = 1\nretraction map\n\nLN (U ) = 1\np\n(cid:80)p\nj=1\n[U ] : Gr(m,r ) \u2192\n\u22121\nR\n\nL j (U ) = 1\npq\n\nk=1\n\nj=1\n\np\n\n\u2207L (Us,k). Therefore the global\n\u2207L j (Us). Instead of the logarithm map we will use the inverse\n\nk=(j\u22121)q+1\n\nT[U ]Gr(m,r )\n\n(cid:55)\u2192 V \u2212U (U T U )\n\n\u22121U T V.\n\n[V ]\n\nWhich gives us the following surrogate function\n\u02dcLs(V ) = Ls(V )\u2212\u2329V \u2212Us(U T\n\n\u22121U T\n\ns V,\u2207Ls(Us)\u2212\u2207LN (Us)\u232a\n\ns Us)\n\n= Ls(V )\u2212\u2329V,\u2207Ls(Us)\u2212\u2207LN (Us)\u232a.\n\nand its gradient\n\n\u2207 \u02dcLs(V ) = \u2207Ls(V )\u2212 (Im \u2212V (V T V )\n\n\u22121V T )(\u2207Ls(Us)\u2212\u2207LN (Us)).\n\n(cid:161)\u03bb0\u2207 \u02dcLs(Us)(cid:162).2\n\nTo optimize with respect to our loss function, we have to \ufb01nd Us+1 = argmin \u02dcLs. To do this, we\nmove according to the steepest descent by taking step size \u03bb0 in the direction \u2207 \u02dcLs(Us) by taking the\nretraction, Us+1 = R[Us ]\nFor our example we set the matrix rank to r = 10 and the regularization parameter to \u03bb = 0.1 and\ndivided the data randomly across 4 processors. Figure 3 shows that we can perform distributed\nmanifold gradient descent in this complicated problem and we can reach convergence fairly quickly\n(after about 1000 seconds).\n\n6 Conclusion\n\nWe propose in this paper a communication ef\ufb01cient parallel algorithm for general optimization\nproblems on manifolds which is applicable to many different manifold spaces and loss functions.\nMoreover, our proposed algorithm can explore the geometry of the underlying space ef\ufb01ciently and\nperform well in simulation studies and practical examples all while having theoretical convergence\nguarantees.\nIn the age of \u201cbig data\u201d, the need for distributable inference algorithms is crucial as we cannot reliably\nexpect entire datasets to sit on a single processor anymore. Despite this, much of the previous work\nin parallel inference has only focused on data and parameters in Euclidean space. Realistically, much\nof the data that we are interested in is better modeled by manifolds and thus we need fast inference\nalgorithms that are provably suitable for situations beyond the Euclidean setting. In future work, we\naim to extend the situations under which parallel inference algorithms are generalizable to manifolds\nand demonstrate more critical problems (in neuroscience or computer vision, for example) in which\nparallel inference is a crucial solution.\n\n2We select the step size parameter according to the modi\ufb01ed Armijo algorithm seen in [6].\n\n8\n\n\fFigure 3: Test set RMSE of the Net\ufb02ix example over time, evaluated on 10 trials.\n\n.\n\nAcknowledgments\nBayan Saparbayeva was partially supported by DARPA N66001-17-1-4041. Michael Zhang was\nsupported by NSF grant 1447721. Lizhen Lin acknowledges the support from NSF grants IIS\n1663870, DMS Career 1654579 and a DARPA grant N66001-17-1-4041.\n\nReferences\n[1] Andrew L Alexander, Jee Eun Lee, Mariana Lazar, and Aaron S. Field. Diffusion tensor imaging\n\nof the brain. Neurotherapeutics, 4(3):316\u2013329, 2007.\n\n[2] Abhishek Bhattacharya and Rabi Bhattacharya. Nonparametric Inference on Manifolds: With\n\nApplications to Shape Spaces. IMS Monograph #2. Cambridge University Press, 2012.\n\n[3] Rabi Bhattacharya and Lizhen Lin. Omnibus CLTs for Fr\u00e9chet means and nonparametric\ninference on non-Euclidean spaces. The Proceedings of the American Mathematical Society,\n145:13\u2013428, 2017.\n\n[4] Rabi Bhattacharya and Vic Patrangenaru. Large sample theory of intrinsic and extrinsic sample\n\nmeans on manifolds. The Annals of Statistics, 31(1):1\u201329, 2003.\n\n[5] Rabi Bhattacharya and Vic Patrangenaru. Large sample theory of intrinsic and extrinsic sample\n\nmeans on manifolds: II. Ann. Statist., 33:1225\u20131259, 2005.\n\n[6] Nicolas Boumal. Optimization and estimation on manifolds. PhD thesis, Universit\u00e9 catholique\n\nde Louvain, 2014.\n\n[7] Lisandro Dalc\u00edn, Rodrigo Paz, and Mario Storti. MPI for Python. Journal of Parallel and\n\nDistributed Computing, 65(9):1108 \u2013 1115, 2005.\n\n[8] T. Downs, J. Liebman, and W. Mackay. Statistical methods for vectorcardiogram orientations.\n\nInternational Symposium on Vectorcardiography, pages 216\u2013222, 1971.\n\n[9] John C. Duchi, Alekh Agarwal, and Martin J. Wainwright. Dual averaging for distributed\noptimization: Convergence analysis and network scaling. IEEE Transactions on Automatic\nControl, 57:592\u2013606, 2012.\n\n[10] Maurice Fr\u00e9chet. L\u00e9s \u00e9lements al\u00e9atoires de nature quelconque dans un espace distanci\u00e9. Ann.\n\nInst. H. Poincar\u00e9, 10:215\u2013310, 1948.\n\n9\n\n\f[11] Jeffrey Ho, Kuang-Chih Lee, Ming-Hsuan Yang, and David Kriegman. Visual tracking using\nIn Computer Vision and Pattern Recognition, volume 1, pages\n\nlearned linear subspaces.\nI\u2013782\u2013I\u2013789 Vol.1, June 2004.\n\n[12] Michael I. Jordan, Jason D. Lee, and Yun Yang. Communication-ef\ufb01cient distributed statistical\n\ninference. Journal of the American Statistical Association, 0(ja):0\u20130, 2018.\n\n[13] David G. Kendall. Shape manifolds, Procrustean metrics, and complex projective spaces. Bull.\n\nof the London Math. Soc., 16:81\u2013121, 1984.\n\n[14] Eric Kolaczyk, Lizhen Lin, Steven Rosenberg, and Jackson Walters. Averages of Unlabeled\nNetworks: Geometric Characterization and Asymptotic Behavior. ArXiv e-prints, September\n2017.\n\n[15] Sebastian Kurtek, Eric Klassen, John C. Gore, Zhaohua Ding, and Anuj Srivastava. Elastic\ngeodesic paths in shape space of parameterized surfaces. IEEE Transactions on Pattern Analysis\nand Machine Intelligence, 34(9):1717\u20131730, Sept 2012.\n\n[16] Sebastian Kurtek, Anuj Srivastava, Eric Klassen, and Zhaohua Ding. Statistical modeling\nof curves using shapes and related features. Journal of the American Statistical Association,\n107(499):1152\u20131165, 2012.\n\n[17] Jason D. Lee, Yuekai Sun, Qiang Liu, and Jonathan E. Taylor. Communication-ef\ufb01cient sparse\n\nregression: a one-shot approach. CoRR, abs/1503.04337, 2015.\n\n[18] Lizhen Lin, Vinayak Rao, and David B. Dunson. Bayesian nonparametric inference on the\n\nStiefel manifold. Statistics Sinica, 27:535\u2013553, 2017.\n\n[19] Lester Mackey, Ameet Talwalkar, and Michael I Jordan. Distributed matrix completion and\n\nrobust factorization. The Journal of Machine Learning Research, 16(1):913\u2013960, 2015.\n\n[20] Stanislav Minsker, Sanvesh Srivastava, Lizhen Lin, and David B. Dunson. Robust and scalable\nBayes via a median of subset posterior measures. Journal of Machine Learning Research,\n18(124):1\u201340, 2017.\n\n[21] Willie Neiswanger, Chong Wang, and Eric P. Xing. Asymptotically exact, embarrassingly\nIn Proceedings of the Thirtieth Conference on Uncertainty in Arti\ufb01cial\n\nparallel MCMC.\nIntelligence, pages 623\u2013632, 2914.\n\n[22] Christopher Nemeth, Chris Sherlock, et al. Merging MCMC subposteriors through Gaussian-\n\nprocess approximations. Bayesian Analysis, 13(2):507\u2013530, 2018.\n\n[23] Benjamin Recht and Christopher R\u00e9. Parallel stochastic gradient algorithms for large-scale\n\nmatrix completion. Mathematical Programming Computation, 5(2):201\u2013226, 2013.\n\n[24] Hesamoddin Salehian, Rudrasis Chakraborty, Edward Ofori, David Vaillancourt, and Baba C.\nVemuri. An ef\ufb01cient recursive estimator of the Fr\u00e9chet mean on a hypersphere with applications\nto medical image analysis. In Mathematical Foundations of Computational Anatomy, volume 3,\n2015.\n\n[25] Steven L. Scott, Alexander W. Blocker, Fernando V. Bonassi, Hugh A. Chipman, Edward I.\nGeorge, and Robert E. McCulloch. Bayes and big data: the consensus Monte Carlo algorithm.\nInternational Journal of Management Science and Engineering Management, 11(2):78\u201388,\n2016.\n\n[26] Ohad Shamir, Nathan Srebro, and Tong Zhang. Communication-ef\ufb01cient distributed opti-\nmization using an approximate newton-type method. In International Conference on Machine\nLearning, 2014.\n\n[27] G. Prabhu Teja and Sundaram Ravi. Face recognition using subspaces techniques. In In-\nternational Conference on Recent Trends In Information Technology, pages 103\u2013107, April\n2012.\n\n10\n\n\f[28] Xiangyu Wang, Fangjian Guo, Katherine A. Heller, and David B. Dunson. Parallelizing MCMC\nwith random partition trees. In Advances in Neural Information Processing Systems, pages\n451\u2013459. 2015.\n\n[29] Michael Minyi Zhang, Henry Lam, and Lizhen Lin. Robust and parallel Bayesian model\n\nselection. Computational Statistics and Data Analysis, 127:229 \u2013 247, 2018.\n\n[30] Yuchen Zhang, John C. Duchi, and Martin J. Wainwright. Communication-ef\ufb01cient algorithms\n\nfor statistical optimization. Journal of Machine Learning Research, 14:3321\u20133363, 2013.\n\n11\n\n\f", "award": [], "sourceid": 1817, "authors": [{"given_name": "Bayan", "family_name": "Saparbayeva", "institution": "University Notre Dame"}, {"given_name": "Michael", "family_name": "Zhang", "institution": "Princeton University"}, {"given_name": "Lizhen", "family_name": "Lin", "institution": "The University of Notre Dame"}]}