{"title": "Stochastic Gradient Geodesic MCMC Methods", "book": "Advances in Neural Information Processing Systems", "page_first": 3009, "page_last": 3017, "abstract": "We propose two stochastic gradient MCMC methods for sampling from Bayesian posterior distributions defined on Riemann manifolds with a known geodesic flow, e.g. hyperspheres. Our methods are the first scalable sampling methods on these manifolds, with the aid of stochastic gradients. Novel dynamics are conceived and 2nd-order integrators are developed. By adopting embedding techniques and the geodesic integrator, the methods do not require a global coordinate system of the manifold and do not involve inner iterations. Synthetic experiments show the validity of the method, and its application to the challenging inference for spherical topic models indicate practical usability and efficiency.", "full_text": "Stochastic Gradient Geodesic MCMC Methods\n\n\u2020 Dept. of Comp. Sci. & Tech., TNList Lab; Center for Bio-Inspired Computing Research\n\n\u2020 State Key Lab for Intell. Tech. & Systems, Tsinghua University, Beijing, China\n\nChang Liu\u2020, Jun Zhu\u2020, Yang Song\u2021\u2217\n\n\u2021 Dept. of Physics, Tsinghua University, Beijing, China\n\n{chang-li14@mails, dcszj@}.tsinghua.edu.cn; songyang@stanford.edu\n\nAbstract\n\nWe propose two stochastic gradient MCMC methods for sampling from Bayesian\nposterior distributions de\ufb01ned on Riemann manifolds with a known geodesic \ufb02ow,\ne.g. hyperspheres. Our methods are the \ufb01rst scalable sampling methods on these\nmanifolds, with the aid of stochastic gradients. Novel dynamics are conceived\nand 2nd-order integrators are developed. By adopting embedding techniques and\nthe geodesic integrator, the methods do not require a global coordinate system of\nthe manifold and do not involve inner iterations. Synthetic experiments show the\nvalidity of the method, and its application to the challenging inference for spherical\ntopic models indicate practical usability and ef\ufb01ciency.\n\n1\n\nIntroduction\n\nDynamics-based Markov Chain Monte Carlo methods (D-MCMCs) are sampling methods using\ndynamics simulation for state transition in a Markov chain. They have become a workhorse for\nBayesian inference, with well-known examples like Hamiltonian Monte Carlo (HMC) [22] and\nstochastic gradient Langevin dynamics (SGLD) [29]. Here we consider variants for sampling from\ndistributions de\ufb01ned on Riemann manifolds. Overall, geodesic Monte Carlo (GMC) [7] stands out\nfor its notable performance on manifolds with known geodesic \ufb02ow, such as simplex, hypersphere\nand Stiefel manifold [26, 16]. Its applicability to manifolds with no global coordinate systems (e.g.\nhyperspheres) is enabled by the embedding technique, and its geodesic integrator eliminates inner\n(within one step in dynamics simulation) iteration to ensure ef\ufb01ciency. It is also used for ef\ufb01cient\nsampling from constraint distributions [17]. Constrained HMC (CHMC) [6] aims at manifolds de\ufb01ned\nby a constraint in some Rn. It covers all common manifolds, but inner iteration makes it less appealing.\nOther D-MCMCs involving Riemann manifold, e.g. Riemann manifold Langevin dynamics (RMLD)\nand Riemann manifold HMC (RMHMC) [13], are invented for better performance but still on the task\nof sampling in Euclidean space, where the target variable is treated as the global coordinates of some\ndistribution manifold. Although they can be used to sample in non-Euclidean Riemann manifolds by\nreplacing the distribution manifold with the target manifold, a global coordinate system of the target\nmanifold is required. Moreover, RMHMC suffers from expensive inner iteration.\nHowever, GMC scales undesirably to large datasets, which are becoming common. An effective\nstrategy to scale up D-MCMCs is by randomly sampling a subset to estimate a noisy but unbiased\nstochastic gradient, with stochastic gradient MCMC methods (SG-MCMCs). Welling et al. [29]\npioneered in this direction by developing stochastic gradient Langevin dynamics (SGLD). Chen\net al. [9] apply the idea to HMC with stochastic gradient HMC (SGHMC), where a non-trivial\ndynamics with friction has to be conceived. Ding et at. [10] propose stochastic gradient Nos\u00e9-Hoover\nthermostats (SGNHT) to automatically adapt the friction to the noise by a thermostats. To unify\ndynamics used for SG-MCMCs, Ma et al. [19] develop a complete recipe to formulate the dynamics.\n\n\u2217JZ is the corresponding author; YS is with Department of Computer Science, Stanford University, CA.\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fTable 1: A summary of some D-MCMCs. \u2013: sampling on manifold not supported; \u2020: The integrators\nare not in the SSI scheme (It is unclear whether the claimed \u201c2nd-order\u201d is equivalent to ours); \u2021:\n2nd-order integrators for SGHMC and mSGNHT are developed by [8] and [18], respectively.\n\nmethods\n\nGMC [7]\nRMLD [13]\nRMHMC [13]\nCHMC [6]\nSGLD [29]\nSGHMC [9] / SGNHT [10]\nSGRLD [23] / SGRHMC [19]\nSGGMC / gSGNHT (proposed)\n\nstochastic\ngradient\n\n\u00d7\n\u00d7\n\u00d7\n\u00d7\n\u221a\n\u221a\n\u221a\n\u221a\n\nno inner\niteration\n\n\u221a\n\u221a\n\u00d7\n\u00d7\n\u221a\n\u221a\n\u221a\n\u221a\n\nno global\ncoordinates\n\n\u221a\n\u00d7\n\u00d7\n\u221a\n\u2013\n\u2013\n\u00d7\n\u221a\n\norder of\nintegrator\n\n2nd\n1st\n2nd\u2020\n2nd\u2020\n1st\n1st\u2021\n1st\n2nd\n\nIn this paper, we present two SG-MCMCs for manifolds with known geodesic \ufb02ow: stochastic\ngradient geodesic Monte Carlo (SGGMC) and geodesic stochastic gradient Nos\u00e9-Hoover thermostats\n(gSGNHT). They are the \ufb01rst scalable sampling methods on manifolds with known geodesic \ufb02ow and\nno global coordinate systems. We use the recipe [19] to tackle the non-trivial dynamics conceiving\ntask. Our novel dynamics are also suitable for developing 2nd-order integrators by adopting the\nsymmetric splitting integrator (SSI) [8] scheme. A key property of a Kth-order integrator is the\nbias of the expected sample average at iteration L can be upper bounded by L\u2212K/(K+1) and the\nmean square error by L\u22122K/(2K+1) [8], so a higher order integrator basically performs better. Our\nintegrators also incorporate the geodesic integrator to avoid inner iteration. Our methods can also be\nused to scalably sample from constraint distributions [17] like GMC.\nThere exist other SG-MCMCs on Riemann manifold, e.g. SGRLD [23] and SGRHMC [19], stochastic\ngradient versions of RMLD and RMHMC respectively. But they also require the Riemann man-\nifold to have a global coordinate system, like their original versions as is mentioned above. So\nbasically they cannot draw samples from hyperspheres, while our methods are capable. Technically,\nSGRLD/SGRHMC (and RMLD/RMHMC) samples in the coordinate space, so we need a global one\nto make it valid. The explicit use of the Riemann metric tensor also makes the methods more dif\ufb01cult\nto implement. Our methods (and GMC) sample in the isometrically embedded space, where the\nwhole manifold is represented and the Riemann metric tensor is implicitly embodied by the isometric\nembedding. Moreover, our integrators are of a higher order. Tab. 1 summarizes the key properties of\naforementioned D-MCMCs, where our advantages are clearly shown.\nFinally, we apply our samplers to perform inference for spherical admixture models (SAM) [24].\nSAM de\ufb01nes a hierarchical generative process to describe the data that are expressed as unit vectors\n(i.e., elements on the hypersphere). The task of posterior inference is to identify a set of latent topics,\nwhich are also unit vectors. This process is highly challenging due to a non-conjugate structure and\nthe strict manifold constraints. None of the existing MCMC methods is both applicable to the task\nand scalable. We demonstrate that our methods are the most ef\ufb01cient methods to learn SAM on large\ndatasets, with a good performance on testing data perplexity.\n\n\u2212\u2207 log \u03c00(q) \u2212(cid:80)D\n\n2 Preliminaries\nWe brie\ufb02y review the basics of SG-MCMCs. Consider a Bayesian model with latent variable q, prior\n\u03c00(q) and likelihood \u03c0(x|q). Given a dataset D = {xd}D\nd=1, sampling from the posterior \u03c0(q|D)\nby D-MCMCs requires computing the gradient of potential energy \u2207U (q) (cid:44) \u2212\u2207 log \u03c0(q|D) =\nd=1 \u2207 log \u03c0(xd|q), which is linear to data size D thus not scalable. SG-MCMCs\n(cid:80)\naddress this challenge by randomly drawing a subset S of D to build the stochastic gradient \u2207q \u02dcU (q) (cid:44)\n\u2212\u2207q log \u03c00(q)\u2212 D|S|\nx\u2208S \u2207q log \u03c0(x|q), a noisy but unbiased estimate.Under the i.i.d. assumption\nof D, the central limit theorem holds: in the sense of convergence in distribution for large D,\n\n\u2207q \u02dcU (q) = \u2207qU (q) + N (0, V (q)),\n\n(1)\n\nwhere we use N (\u00b7,\u00b7) to denote a Gaussian random variable and V (q) is some covariance matrix.\nThe gradient noise raises challenging restrictions to the SG-MCMC dynamics. Ma et al. [19]\nthen provide a recipe to construct correct dynamics. It claims that for a random variable z, given\na Hamiltonian H(z), a skew-symmetric matrix (curl matrix) Q(z) and a positive de\ufb01nite matrix\n(diffusion matrix) D(z), the dynamics de\ufb01ned by the following stochastic differential equation (SDE)\n\n2\n\n\f(2)\nhas the unique stationary distribution \u03c0(z) \u221d exp{\u2212H(z)}, where W (t) is the Wiener process and\n(3)\n\nf (z) = \u2212 [D(z) + Q(z)]\u2207zH(z) + \u0393(z), \u0393i(z) =\n\n(Dij(z) + Qij(z)) .\n\ndz = f (z)dt +(cid:112)2D(z)dW (t)\n(cid:88)\n\n\u2202\n\u2202zj\n\nj\n\nThe above dynamics is compatible with stochastic gradient. For SG-MCMCs, z is usually an\naugmentation of the target variable q, and the Hamiltonian usually follows the form H(z) = T (z) +\nU (q). Referring to Eqn. (1), \u2207q \u02dcH(z) = \u2207qH(z) + N (0, V (q)) and \u02dcf (z) = f (z) + N (0, B(z)),\nwhere B(z) is the covariance matrix of the Gaussian noise passed from \u2207z \u02dcH(z) to \u02dcf (z) through\nEqn. (3). We informally rewrite dW (t) as N (0, dt) and express dynamics Eqn. (2) as\n\ndz =f (z)dt + N (0, 2D(z)dt) = f (z)dt + N (0, B(z)dt2) + N(cid:0)0, 2D(z)dt \u2212 B(z)dt2(cid:1)\n= \u02dcf (z)dt + N(cid:0)0, 2D(z)dt \u2212 B(z)dt2(cid:1).\n\n(4)\nThis tells us that the same dynamics can be exactly expressed by stochastic gradient. Moreover, the\nrecipe is complete: for any continuous Markov process de\ufb01ned by Eqn. (2) with a unique stationary\ndistribution \u03c0(z) \u221d exp{\u2212H(z)}, there exists a skew-symmetric matrix Q(z) so that Eqn. (3) holds.\n3 Stochastic Gradient Geodesic MCMC Methods\nWe now formally develop our SGGMC and gSGNHT. We will describe the task settings, develop the\ndynamics, and show how to simulate by 2nd-order integrators and stochastic gradient.\n3.1 Technical Descriptions of the Settings\nWe \ufb01rst describe a Riemann manifold. Main concepts\nare depicted in Fig. 1. Let M be an m-dim Riemann\nmanifold, which is covered by a set of local coordi-\nnate systems. Denote one of them by (N , \u03a6), where\nN \u2286 M is an open subset, and \u03a6 : N \u2192 \u2126, Q (cid:55)\u2192 q\nwith \u2126 (cid:44) \u03a6(N ) \u2286 Rm, Q \u2208 N and q \u2208 \u2126 is a\nhomeomorphism. Additionally, transition mappings\nbetween any two intersecting local coordinate systems\nare required to be smooth. Denote the Riemann metric\ntensor under (N , \u03a6) by G(q), an m \u00d7 m symmetric\npositive-de\ufb01nite matrix. Another way to describe M is through embedding \u2014 a diffeomorphism\n\u039e : M \u2192 \u039e(M) \u2286 Rn (n \u2265 m). In (N , \u03a6), \u039e can be embodied by a more sensible mapping\n\u03be (cid:44) \u039e \u25e6 \u03a6\u22121 : Rm \u2192 Rn, q (cid:55)\u2192 x, which links the coordinate space and the embedded space. For\nconvenience, we only consider isometric embeddings (whose existence is guaranteed [21]): \u039e such\n, 1 \u2264 i, j \u2264 m holds for any local coordinate system. Common\nmanifolds are subsets of some Rn, in which case the identity mapping (as \u039e) from Rn (where M is\nde\ufb01ned) to Rn (the embedded space) is isometric.\nTo de\ufb01ne a distribution on a Riemann manifold, from which we want to sample, we need a measure.\nIn the coordinate space Rm, \u2126 naturally possesses the Lebesgue measure \u03bbm(dq), and the probability\ndensity can be de\ufb01ned in \u2126, which we denote as \u03c0(q). In the embedded space Rn, \u039e(N ) naturally\npossesses the Hausdorff measure Hm(dx), and we denote the probability density w.r.t this measure\n\nas \u03c0H(x). The relation between them can be found by \u03c0H(\u03be(q)) = \u03c0(q)/(cid:112)|G(q)|.\n\nFigure 1: An illustration of manifold M\nwith local coordinate system (N , \u03a6) and\nembedding \u039e. See text for details.\n\nthat G(q)ij =(cid:80)n\n\n\u2202\u03bel(q)\n\n\u2202\u03bel(q)\n\nl=1\n\n\u2202qi\n\n\u2202qj\n\n3.2 The Dynamics\nWe now construct our dynamics using the recipe [19] so that our dynamics naturally have the desired\nstationary distribution, leading to correct samples. It is important to note that the recipe only suits for\ndynamics in a Euclidean space. So we can only develop the dynamics in the coordinate space but\nnot in the embedded space \u039e(M), which is generally not Euclidean. However it is advantageous to\nsimulate the dynamics in the embedded space (See Sec. 3.3).\nDynamics for SGGMC De\ufb01ne the momentum in the coordinate space p \u2208 Rm and the augmented\nvariable z = (q, p) \u2208 R2m. De\ufb01ne the Hamiltonian 2 H(z) = U (q) + 1\n2 p(cid:62)G(q)\u22121p,\n2Another derivation of the momentum and the Hamiltonian originated from physics in both coordinate and\n\n2 log |G(q)| + 1\n\nembedded spaces is provided in Appendix C.\n\n3\n\n\u211d\ud835\udc5a\ud835\udc5a=2\u211d\ud835\udc5b\ud835\udc5b=3\ud835\udca9\ud835\udc44\ud835\udc5e\u03a6\u03a9\u039e\ud835\udca9\ud835\udf09\u039e\ud835\udc65\u2133\f\uf8f1\uf8f2\uf8f3 dq =G\u22121pdt\n\nwhere U (q) (cid:44) \u2212 log \u03c0(q). We de\ufb01ne the Hamiltonian so that the canonical distribution \u03c0(z) \u221d\nexp{\u2212H(z)} marginalized w.r.t p recovers the target distribution \u03c0(q). For a symmetric positive\nde\ufb01nite n \u00d7 n matrix C, de\ufb01ne the diffusion matrix D(z) and the curl matrix Q(z) as\n\nD(z) =\n\n0\n\n0 M (q)(cid:62)CM (q)\n\n, Q(z) =\n\n(cid:19)\n\n(cid:18) 0 \u2212I\n\nI\n\n0\n\n(cid:19)\n\n,\n\n(cid:18) 0\n\nwhere we de\ufb01ne M (q)n\u00d7m : M (q)ij = \u2202\u03bei(q)/\u2202qj. So from Eqn. (2, 3), the dynamics\n\ndp = \u2212 \u2207qU dt \u2212 1\n2\n\n\u2207q log |G|dt \u2212 M(cid:62)CM G\u22121p dt \u2212 1\n2\n\n\u2207q\n\nhas a unique stationary distribution \u03c0(z) \u221d exp{\u2212H(z)}.\n(5)\nDynamics for gSGNHT De\ufb01ne z = (q, p, \u03be) \u2208 R2m+1, where \u03be \u2208 R is the thermostats. For a\npositive C \u2208 R, de\ufb01ne the Hamiltonian H(z) = U (q) + 1\n2 (\u03be \u2212 C)2,\nwhose marginalized canonical distribution is \u03c0(q) as desired. De\ufb01ne D(z) and Q(z) as\n\n2 p(cid:62)G(q)\u22121p + m\n\n(cid:2)p(cid:62)G\u22121p(cid:3)dt + N (0, 2M(cid:62)CM dt)\n\n(cid:32) 0\n\n(cid:33)\n\n0\n0\n0\n\n0\n\nD(z) =\n\n0 CG(q)\n0\n\n0\n\n, Q(z) =\n\nThen by Eqn. (2, 3) the proper dynamics of gSGNHT is\n\n\uf8f1\uf8f4\uf8f4\uf8f4\uf8f4\uf8f2\uf8f4\uf8f4\uf8f4\uf8f4\uf8f3\n\ndq =G\u22121pdt\ndp = \u2212 \u2207qU dt \u2212 1\n2\n\nd\u03be =(\n\n1\nm\n\np(cid:62)G\u22121p \u2212 1)dt\n\n\u2212I\n0\n\n\uf8f6\uf8f8 ,\n\n2 log |G(q)| + 1\n\n0\nI\np/m\n0 \u2212p(cid:62)/m 0\n\n\uf8eb\uf8ed 0\n(cid:2)p(cid:62)G\u22121p(cid:3)dt + N (0, 2CGdt)\n\n\u2207q\n\n\u2207q log |G|dt \u2212 \u03bep dt \u2212 1\n2\n\n.\n\n(6)\n\nThese two dynamics are novel. They are extensions of the dynamics of SGHMC and SGNHT to\nRiemann manifolds, respectively. Conceiving the dynamics in this form is also intended for the\nconvenience to develop 2nd-order geodesic integrators, which differs from SGRHMC.\n\n3.3 Simulation with 2nd-order Geodesic Integrators\nIn this part we develop our integrators by following the symmetric splitting integrator (SSI) scheme [8],\nwhich is guaranteed to be of 2nd-order. The idea of SSI is to \ufb01rst split the dynamics into parts with\neach analytically solvable, then alternately simulate each exactly with the analytic solutions. Although\nalso SSI, the integrator of GMC does not \ufb01t our dynamics where diffusion arises. But we adopt its\nembedding technique to get rid of any local coordinate system thus release the global coordinate\nsystem requirement. So we will solve and simulate the split dynamics in the isometrically embedded\nspace, where everything is expressed by the position x = \u03be(q) and the velocity v = \u02d9x (which is\nactually the momentum in the isometrically embedded space, see Appendix C; the overhead dot\nmeans time derivative), instead of q and p.\nIntegrator for SGGMC We \ufb01rst split dynamics (5) into sub-SDEs with each analytically solvable:\n\n\uf8f1\uf8f2\uf8f3dq =G\u22121pdt\n\ndp =\u2212 1\n2\n\n\u2207q\n\nA :\n\n(cid:2)p(cid:62)G\u22121p(cid:3)dt\n\n(cid:40)dq =0\n\n, B :\n\ndp =\u2212M(cid:62)CMG\u22121pdt\n\n, O :\n\n\uf8f1\uf8f4\uf8f4\uf8f2\uf8f4\uf8f4\uf8f3\n\ndq =0\ndp =\u2212\u2207qU (q)dt\u2212 1\n2\n\n+ N (0, 2M(cid:62)CM dt)\n\n\u2207qlog|G(q)|dt\n\n.\n\nAs noted in GMC, the solution of dynamics A is the geodesic \ufb02ow of the manifold [1]. Intuitively,\ndynamics A describes motion with no force so a particle moves freely on the manifold, e.g. the\nuniform motion in Euclidean space, and motion along great circles (velocity rotating with varying\ntangents along the trajectory) on hypersphere Sd\u22121 (cid:44) {x \u2208 Rd|(cid:107)x(cid:107) = 1} ((cid:107) \u00b7 (cid:107) denotes (cid:96)2-norm).\nThe evolution of the position and velocity of this kind is the geodesic \ufb02ow. We require an explicit\nform of the geodesic \ufb02ow in the embedded space. For Sd\u22121,\n\n(cid:40)\n\nx(t) = x(0) cos(\u03b1t) +(cid:0)v(0)/\u03b1(cid:1) sin(\u03b1t)\n\nv(t) = \u2212\u03b1x(0) sin(\u03b1t) + v(0) cos(\u03b1t)\n\n(7)\n\n4\n\n\fis the geodesic \ufb02ow expressed by the embedded variables x and v, where \u03b1 = (cid:107)v(0)(cid:107).\nBy details in [7] or Appendix A, dynamics B and O are solved as\n\n(cid:40)x(t) =x(0)\nv(t) =v(0)+\u039b(cid:0)x(0)(cid:1)(cid:2)\u2212\u2207xUH(cid:0)x(0)(cid:1)t+N (0, 2Ct)(cid:3) ,\n\n(cid:40)x(t) =x(0)\nv(t) =expm(cid:8)\u2212\u039b(cid:0)x(0)(cid:1)Ct(cid:9)v(0)\n\n, O :\n\nB :\n\nwhere UH(x) (cid:44) \u2212 log \u03c0H(x), expm{\u00b7} is the matrix exponent, and \u039b(x) is the projection onto the\ntangent space at x in the embedded manifold. For Rn, \u039b(x) = In (the identity mapping in Rn) and\nfor Sn\u22121 embedded in Rn, \u039b(x) = In \u2212 xx(cid:62) (see Appendix A.3).\nWe further reduce dynamics B for scalar C: v(t) = \u039b(x(0)) exp{\u2212Ct}v(0) = exp{\u2212Ct}v(0), by\nnoting that exp{\u2212Ct} is a scalar and v(0) already lies on the tangent space at x(0). To illustrate this\nform, we expand the exponent for small t and get v(t) = (1 \u2212 Ct)v(0), which is exactly the action\nof a friction dissipating energy to control injected noise, as proposed in SGHMC. Our investigation\nreveals that this form holds generally for v as the momentum in the isometrically embedded space,\nbut not the usual momentum p in the coordinate space. In SGHMC, v and p are undistinguishable,\nbut in our case v can only lie in the tangent space and p is arbitrary in Rm.\nIntegrator for gSGNHT We split dynamics (6) in a similar way:\ndq =0\ndp =\u2212\u2207qU dt\u2212 1\n2\n\n\u2207q log |G| dt\n\n, B :\n\nA :\n\n.\n\ndq =G\u22121pdt\ndp =\u2212 1\n\u2207q\n2\np(cid:62)G\u22121p\u22121\n\n(cid:2)p(cid:62)G\u22121p(cid:3)dt\n(cid:17)\n\n(cid:16) 1\n\nd\u03be =\n\ndt\n\n\uf8f1\uf8f4\uf8f4\uf8f4\uf8f4\uf8f2\uf8f4\uf8f4\uf8f4\uf8f4\uf8f3\n\n\uf8f1\uf8f4\uf8f4\uf8f4\uf8f4\uf8f2\uf8f4\uf8f4\uf8f4\uf8f4\uf8f3\n\ndp =\u2212\u03bep dt\nd\u03be =0\n\n\uf8f1\uf8f2\uf8f3 dq =0\n(cid:2)p(cid:62)G(q)\u22121p(cid:3) = \u2207q\n\n, O :\n\n+N (0, 2CGdt)\n\nm\n\n(cid:2)p(cid:62)G(q)\u22121p(cid:3)(cid:62)\nby motion with no force. Now the evolution of \u03be can be solved as \u03be(t) = \u03be(0) +(cid:0) 1\n\nFor dynamics A, the solution of q and p is again the geodesic \ufb02ow. To solve \u03be, we \ufb01rst \ufb01gure out that\nfor dynamics A, p(cid:62)G\u22121p is constant: d\n\u02d9p =\n\u22122 \u02d9p(cid:62) \u02d9q + 2 \u02d9q(cid:62) \u02d9p = 0. Alternatively we note that 1\n2 v(cid:62)v is the kinetic energy 3 conserved\n\n\u02d9q+2(cid:2)G(q)\u22121p(cid:3)(cid:62)\nm v(0)(cid:62)v(0) \u2212 1(cid:1) t.\n\n2 p(cid:62)G\u22121p = 1\n\ndt\n\nd\u03be =0\n\nDynamics O is identical to the one of SGGMC. Dynamics B can be solved similarly with only\nv updated: v(t) = exp{\u2212\u03be(0)t}v(0). Expansion of this recovers the dissipation of energy by an\nadaptive friction proposed by SGNHT, and we extend it to an embedded space.\nNow we consider incorporating stochastic gradient. Only the common dynamics O is affected.\nSimilar to Eqn. (1), we express the stochastic gradient as \u2207x \u02dcUH(x) = \u2207xUH(x) + N (0, V (x)),\nthen reformulate the solution of dynamics O as\n\nv(t) = v(0) + \u039b(cid:0)x(0)(cid:1) \u00b7(cid:104)\u2212\u2207x \u02dcUH(cid:0)x(0)(cid:1)t + N(cid:16)\n\n0, 2Ct\u2212V(cid:0)x(0)(cid:1)t2(cid:17)(cid:105)\n\n.\n\n(8)\n\nTo estimate the usually unknown V (x), a simple way is just to take it as zero, in the sense that V (x)t2\nis a higher order in\ufb01nitesimal of 2Ct for t as a small simulation step size. Another way to estimate\nV (x) is by the empirical Fisher information, as is done in [2].\nFinally, as SSI suggests, we simulate the complete dynamics by exactly simulating these solutions\nalternately in an \u201cABOBA\u201d pattern. For a time step size of \u03b5, dynamics A and B advance by \u03b5/2 for\nonce and dynamics O by \u03b5. As other SG-MCMCs, we omit the unscalable Metropolis-Hastings test.\nBut the consistency is still guaranteed [8] of e.g. the estimation by averaging over samples drawn\nfrom SG-MCMCs. Algorithms of SGGMC and gSGNHT are listed in Appendix E.\n4 Application to Spherical Admixture Model\nWe now apply SGGMC/gSGNHT to solve the challenging task of posterior inference in Spherical\nAdmixture Model (SAM) [24]. SAM is a Bayesian topic model for spherical data (each datum is in\nsome Sd\u22121), such as the tf-idf representation of text data. It enables more feature representations for\nhierarchical Bayesian models, and have the bene\ufb01t over Latent Dirichlet Allocation (LDA) [5] to\ndirectly model the absence of words. The structure of SAM is shown in Fig. 2. Each document vd,\neach topic \u03b2k, the corpus mean \u00b5 and the hyper-parameter m are all in SV \u22121 with V the vocabulary\nsize. Each topic proportion \u03b8d is in (K \u2212 1)-dim simplex with K the number of topics.\n\n3p(cid:62)G\u22121p = (G\u22121p)(cid:62)G(G\u22121p) = \u02d9q(cid:62)(M(cid:62)M ) \u02d9q = (M \u02d9q)(cid:62)(M \u02d9q) = v(cid:62)v for an isometric embedding.\n\n5\n\n\fSAM uses the von Mises-Fisher distribution (vMF) (see e.g.\non hyperspheres.\neter \u03ba \u2208 R+ has pdf (w.r.t\n\nwhere cd(\u03ba) = \u03bad/2\u22121/(cid:0)(2\u03c0)d/2Id/2\u22121(\u03ba)(cid:1) and Ir(\u00b7) denotes the modi\ufb01ed Bessel func-\n\n[20]) to model variables\nThe vMF on Sd\u22121 with mean \u00b5 \u2208 Sd\u22121 and concentration param-\nthe Hausdorff measure) vMF(x|\u00b5, \u03ba) = cd(\u03ba) exp{\u03ba\u00b5(cid:62)x},\n\ntion of\n\nthe \ufb01rst kind and order r.\n\nThen the generating process of SAM is\n\n\u2022 Draw \u00b5 \u223c vMF(\u00b5|m, \u03ba0);\n\u2022 For k = 1, . . . , K, draw topic \u03b2k \u223c vMF(\u03b2k|\u00b5, \u03c3);\n\u2022 For d = 1, . . . , D, draw \u03b8d \u223c Dir(\u03b8d|\u03b1) and vd \u223c\nwhere \u00afv(\u03b2, \u03b8d)(cid:44) \u03b2\u03b8d\n(cid:107)\u03b2\u03b8d(cid:107) with \u03b2 (cid:44) (\u03b21, . . . , \u03b2K) is an approxi-\nmate spherical weighted mean of topics. The joint distribution\n\nof(cid:0)v (cid:44) (v1, . . . , vD), \u03b2, \u03b8 (cid:44) (\u03b81, . . . , \u03b8K), \u00b5(cid:1) can be known.\n\nvMF(vd|\u00afv(\u03b2, \u03b8d), \u03ba),\n\nFigure 2: An illustration of SAM\nmodel structure.\n\n(cid:90)\n\n(cid:90) \u03c0(\u03b2, \u03b8|v)\n\n\u03c0(\u03b2|v)\n\n\u2207\u03b2\u03c0(\u03b2, \u03b8|v)\n\u03c0(\u03b2, \u03b8|v)\n\nThe inference task is to estimate the topic posterior \u03c0(\u03b2|v). As it is intractable, [24] provides a mean-\n\ufb01eld variational inference method and solves an optimization problem under spherical constraint,\nwhich is tackled by repeatedly normalizing. However, this treatment is not applicable to most\nsampling methods since it may corrupt the distribution of the samples. [24] tries a simple adaptive\nMetropolis-Hastings sampler with undesirable results, and no more attempt of sampling methods\nappears. Due to the de\ufb01ciency of global coordinate system of hypersphere, most Riemann manifold\nsamplers including SGRLD and SGRHMC fail. To our knowledge, only CHMC and GMC are\nsuitable, yet not scalable. Our samplers are appropriate for the task, with the advantage of scalability.\nNow we present our inference method that uses SGGMC/gSGNHT to directly sample from \u03c0(\u03b2|v).\nFirst we note that \u00b5 can be collapsed analytically and the marginalized distribution of (v, \u03b2, \u03b8) is:\n\n\u03c0(v, \u03b2, \u03b8) = cV (\u03ba0)cV (\u03c3)KcV ((cid:107) \u00afm(\u03b2)(cid:107))\u22121\n\nwhere \u00afm(\u03b2) (cid:44) \u03ba0m + \u03c3(cid:80)K\n\n(9)\nk=1 \u03b2k. To sample from \u03c0(\u03b2|v) using our samplers, we only need to\nknow a stochastic estimate of the gradient of potential energy \u2207\u03b2U (\u03b2|v) (cid:44) \u2212\u2207\u03b2 log \u03c0(\u03b2|v), which\ncan be estimated by adopting the technique used in [11]: \u2207\u03b2 log \u03c0(\u03b2|v) =\n\nDir(\u03b8d|\u03b1)vMF(vd|\u00afv(\u03b2, \u03b8d), \u03ba),\n\nd=1\n\nD(cid:89)\n\nd\u03b8 = E\u03c0(\u03b8|\u03b2,v) [\u2207\u03b2 log \u03c0(\u03b2, \u03b8|v)] ,\n\n1\n\n\u03c0(\u03b2|v)\n\n\u2207\u03b2\n\n\u03c0(\u03b2, \u03b8|v)d\u03b8 =\n\n(cid:80)N\nwhere \u2207\u03b2 log \u03c0(\u03b2, \u03b8|v) = \u2207\u03b2 log \u03c0(v, \u03b2, \u03b8) is known, and the expectation can be estimated by aver-\naging over a set of samples {\u03b8(n)}N\nn=1 \u2207\u03b2 log \u03c0(v, \u03b2, \u03b8(n)).\nTo draw {\u03b8(n)}N\nn=1, noting the simplex constraint and that the target distribution \u03c0(\u03b8|v, \u03b2) is known\nup to a constant multiplier, we use GMC to do the task.\nTo scale up, we use a subset {d(s)}S\nto get a stochastic estimate for each \u2207\u03b2 log \u03c0(v, \u03b2, \u03b8(n)). The \ufb01nal stochastic gradient is:\n\ns=1 of indices of randomly chosen items from the whole data set\n\nn=1 from \u03c0(\u03b8|v, \u03b2): \u2207\u03b2U (\u03b2|v) \u2248 1\n\nN\n\n\u2207\u03b2 \u02dcU (\u03b2|v) \u2248 \u2207\u03b2 log cV ((cid:107) \u00afm(\u03b2)(cid:107)) \u2212 \u03ba\n\nD\nN S\n\nv(cid:62)\nd(s)\u00afv(\u03b2, \u03b8(n)\n\nd(s)).\n\n(10)\n\nThe inference algorithm for SAM by SGGMC/gSGNHT is summarized in Alg. 3 in Appendix E.\n5 Experiments\nWe present empirical results on both synthetic and real datasets to prove the accuracy and ef\ufb01ciency\nof our methods. All target densities are expressed in the embedded space w.r.t the Hausdorff measure\nso we omit the subscript \u201cH\u201d. Synthetic experiments are only for SGGMC since the advantage to use\nthermostats has been shown by [10] and the effectiveness of gSGNHT is presented on real datasets.\nDetailed settings of the experiments are provided in Appendix F.\n\nthe target distribution such that the potential energy is U (x) = \u2212 log(cid:0)exp{5\u00b5(cid:62)\n\n5.1 Toy Experiment\nWe \ufb01rst present the utility and check the correctness of SGGMC by a greenhouse experiment with\nknown stochastic gradient noise. Consider sampling from a circle (S1) for easy visualization. We set\nwhere x, \u00b51, \u00b52 \u2208 S1 and \u00b51 = \u2212\u00b52 = \u03c0\n\n3 (angle from +x direction). The stochastic gradient is\n\n1 x} + 2 exp{5\u00b5(cid:62)\n\n2 x}(cid:1),\n\n6\n\nN(cid:88)\n\nS(cid:88)\n\nn=1\n\ns=1\n\n \ud835\udc37 \ud835\udc3e \ud835\udc5a,\ud835\udf050 \ud835\udf05 \ud835\udf07 \ud835\udefd\ud835\udc58 \ud835\udf03\ud835\udc51 \ud835\udc63\ud835\udc51 \ud835\udefc \ud835\udf0e \f(a) \u03c0(v1|D)\n\n(c) \u03c0(v1, v2|D)\n\n(b) \u03c0(v2|D)\nFigure 4: (a-b): True and empirical densities for \u03c0(v1|D) and \u03c0(v2|D). (c) True (left) and empirical\nby SGGMC (right) densities for \u03c0(v1, v2|D).\nproduced by corrupting with N (0, 1000I), whose variance is used as V (x) in Eqn. (8) for sampling.\nFig. 3(a) shows 100 samples from SGGMC and empirical distribution of 10,000 samples in the\nembedded space R2. True and empirical distributions are compared in Fig. 3(b) in angle space (local\ncoordinate space). We see no obvious corruption of the result when using stochastic gradient.\nIt should be stressed that although it is possi-\nble to apply scalable methods like SGRLD in\nspherical coordinate systems (almost global\nones), it is too troublesome to work out the\nform of e.g. Riemann metric tensor, and spe-\ncial treatments like re\ufb02ection at boundaries\nhave to be considered. Numerical instability\nat boundaries also tends to appear. All these\nwill get even worse in higher dimensions. Our\nmethods work in embedded spaces, so all\nthese issues are bypassed and can be elegantly\nextended to high dimensions.\n\n(a) samples by SGGMC in\nthe embedded space\nFigure 3: Toy experiment results: (a) samples and\nempirical distribution of SGGMC; (b) comparison\nof true and empirical distributions.\n\n(b) distribution comparison in\nangle space\n\n5.2 Synthetic Experiment\nWe then test SGGMC on a simple Bayesian posterior estimation task. We adopt a model with similar\nstructure as the one used in [29]. Consider a mixture model of two vMFs on S1 with equal weights:\n\u03c0(v1)=vMF(v1|e1,\u03ba1), \u03c0(v2)=vMF(v2|e1,\u03ba2), \u03c0(xi|v1,v2)\u221dvMF(xi|v1,\u03bax) + vMF(xi|\u00b5,\u03bax),\nwhere e1 = (1, 0) and \u00b5 (cid:44) (v1 + v2)/(cid:107)v1 + v2(cid:107). The task is to infer the posterior \u03c0(v1, v2|D), where\nD = {xi}D=100\n24 , v2 = \u03c0\n8\nand \u03ba1 = \u03ba2 = \u03bax = 20 by GMC. SGGMC uses empirical Fisher information in the way of [2]\nfor V (x) in Eqn. (8), and uses 10 for batch size. Fig. 4(a-b) show the true and empirical marginal\nposteriors of v1 and v2, and Fig. 4(c) presents empirical joint posterior by samples from SGGMC\nand its true density. We see that samples from SGGMC exhibit no observable corruption when a\nmini-batch is used, and fully explore the two modes and the strong correlation of v1 and v2. 4\n\nis our synthetic data that is generated from the likelihood with v1 = \u2212 \u03c0\n\ni=1\n\n5.3 Spherical Admixture Models\nSetups For baselines, we compare with the mean-\ufb01eld variational inference (VI) by [24] and its\nstochastic version (StoVI) based on [15], as well as GMC methods. It is problematic for GMC to\ndirectly sample from the target distribution \u03c0(\u03b2|v) since the potential energy is hard to estimate, which\nis required for Metropolis-Hastings (MH) test in GMC. An approximate Monte Carlo estimation is\nprovided in Appendix B and the corresponding method for SAM is GMC-apprMH. An alternative\nis GMC-bGibbs, which adopts blockwise Gibbs sampling to alternately sample from \u03c0(\u03b2|\u03b8, v) and\n\u03c0(\u03b8|\u03b2, v) (both known up to a constant multiplier) using GMC.\n(cid:80)\nWe evaluate the methods by log-perplexity \u2014 the average of negative log-likelihood on a held-out\ntest set Dtest. Variational methods produce a single point estimate \u02c6\u03b2 and the log-perplexity is\nlog-perp = \u2212 1|Dtest|\nand log-perp = \u2212 1|Dtest|\n\nneeds to be estimated. By noting that \u03c0(vd|\u03b2) = (cid:82) \u03c0(vd, \u03b8d|\u03b2)d\u03b8d = E\u03c0(\u03b8d|\u03b2)[\u03c0(vd|\u03b2, \u03b8d)], we\n\n(cid:80)M\nm=1 \u03c0(vd|\u03b2(m))). In both cases the intractable \u03c0(vd|\u03b2)\n\nlog \u03c0(vd| \u02c6\u03b2). Sampling methods draw a set of samples {\u03b2(m)}M\n\n(cid:80)\n\nlog( 1\nM\n\nd\u2208Dtest\n\nd\u2208Dtest\n\nm=1\n\n4Appendix D provides a rationale on the shape of the joint posterior.\n\n7\n\n\u22120.6\u22120.4\u22120.200.20.40.600.511.522.533.544.55 trueGMCSGGMC\u22120.6\u22120.4\u22120.200.20.40.600.511.522.533.544.55 trueGMCSGGMC\u22120.500.5\u22120.6\u22120.4\u22120.200.20.40.6\u22120.4\u22120.200.20.4\u22120.6\u22120.4\u22120.200.20.40.6 0.5 1 1.5 23021060240902701203001503301800 empirical distributionsamples from SGGMC\u22124\u22123\u22122\u221210123400.10.20.30.40.50.60.7\u03c6\u03c0(\u03c6) trueGMCSGGMC\fd ) (exactly known from the generating process) over samples\nn=1 drawn from \u03c0(\u03b8d|\u03b2) = \u03c0(\u03b8d) = Dir(\u03b1), the prior of \u03b8d. The log-perplexity is not\n\nestimate it by averaging \u03c0(vd|\u03b2, \u03b8(n)\n{\u03b8(n)\nd }N\ncomparable among different models so we exclude LDA from our baseline.\nWe show the performance of all methods on a small and a large dataset. Hyper-parameters of\nSAM are \ufb01xed while training and set the same for all methods. V (x) in Eqn. (8) is taken zero\nfor SGGMC/gSGNHT. All sampling methods are implemented 5 in C++ and fairly parallelized\nby OpenMP. VI/StoVI are run in MATLAB codes by [24] and we only use their \ufb01nal scores for\ncomparison. Appendix F gives further implementation details, including techniques to avoid over\ufb02ow.\n\nsmall dataset\nOn the\nThe small dataset\nis the\n20News-different dataset\nused by [24], which con-\nsists of 3 categories from\n20Newsgroups dataset.\nIt\nis small (1,666 training and\n1,107 test documents) so\nwe have the chance to see\nthe eventual results of all\nmethods. We use 20 topics\nand 50 as the batch size.\nFig. 5(a) shows the perfor-\nmance of all methods. We\ncan see that our SGGMC and gSGNHT perform better than others. VI converges swiftly but cannot\ngo any lower due to the intrinsic gap between the mean-\ufb01eld variational distribution and the true\nposterior. StoVI converges slower than VI in this small scale case, and exhibits the same limit.\nAll sampling methods eventually go below variational methods, and ours go the lowest. gSGNHT\nshows its bene\ufb01t to outperform SGGMC under the same setting. For our methods, an appropriately\nsmaller batch size achieves a better result due to the speed-up by subsampling. Note that even the\nfull-batch SGGMC and gSGNHT outperform GMC variants. This may be due to the randomness in\nthe dynamics helps jumping out of one local mode to another for a better exploration.\n\nFigure 5: Evolution of log-perplexity along wall time of all methods\non (a) 20News-different dataset and (b) 150K Wikipedia subset.\n\n(a) 20News-different\n\n(b) 150K Wikipedia\n\nOn the large dataset For the large dataset, we use a subset of the Wikipedia dataset with 150K\ntraining and 1K test documents, to challenge the scalability of all the methods. We use 50 topics and\n100 as the batch size. Fig. 5(b) shows the outcome. We see that the gap between our methods and\nother baselines gets larger, indicating our scalability. Bounded curves of VI/StoVI, the advantage of\nusing thermostats and subsampling speed-up appear again. Our full-batch versions are still better than\nGMC variants. GMC-apprMH and GMC-bGibbs scale badly; they converge slowly in this case.\n\n6 Conclusions and Discussions\nWe propose SGGMC and gSGNHT, SG-MCMCs for scalable sampling from manifolds with known\ngeodesic \ufb02ow. They are saliently ef\ufb01cient on their applications. Novel dynamics are constructed and\n2nd-order geodesic integrators are developed. We apply the methods to SAM topic model for more\naccurate and scalable inference. Synthetic experiments verify the validity and experiments for SAM\non real-world data shows an obvious advantage in accuracy over variational inference methods and\nin scalability over other applicable sampling methods. There remains possible broad applications\nof our methods, including models involving vMF (e.g. mixture of vMF [4, 14, 28], DP mixture of\nvMF [12, 3, 27]), constraint distributions [17] (e.g. truncated Gaussian), and distributions on Stiefel\nmanifold (e.g. Bayesian matrix completion [25]), where the ability of scale-up will be appealing.\n\nAcknowledgments\nThe work was supported by the National Basic Research Program (973 Program) of China (No.\n2013CB329403), National NSF of China Projects (Nos. 61620106010, 61322308, 61332007), the\nYouth Top-notch Talent Support Program, and Tsinghua Initiative Scienti\ufb01c Research Program (No.\n20141080934).\n\n5All the codes and data can be found at http://ml.cs.tsinghua.edu.cn/~changliu/sggmcmc-sam/.\n\n8\n\n1021031041051000150020002500300035004000450050005500wall time in seconds (log scale)log\u2212perplexity VIStoVIGMC\u2212apprMHGMC\u2212bGibbsSGGMC\u2212batchSGGMC\u2212fullgSGNHT\u2212batchgSGNHT\u2212full1031041052000250030003500400045005000wall time in seconds (log scale)log\u2212perplexity VIStoVIGMC\u2212apprMHGMC\u2212bGibbsSGGMC\u2212batchSGGMC\u2212fullgSGNHT\u2212batchgSGNHT\u2212full\fReferences\n[1] Ralph Abraham, Jerrold E Marsden, and Jerrold E Marsden. Foundations of mechanics. Benjamin/Cummings Publishing Company\n\nReading, Massachusetts, 1978.\n\n[2] Sungjin Ahn, Anoop Korattikara, and Max Welling. Bayesian posterior sampling via stochastic gradient \ufb01sher scoring. arXiv preprint\n\narXiv:1206.6380, 2012.\n\n[3] Nguyen Kim Anh, Nguyen The Tam, and Ngo Van Linh. Document clustering using dirichlet process mixture model of von mises-\ufb01sher\n\ndistributions. In The 4th International Symposium on Information and Communication Technology, SoICT 2013, page 131\u2013138, 2013.\n\n[4] Arindam Banerjee, Inderjit S Dhillon, Joydeep Ghosh, and Suvrit Sra. Clustering on the unit hypersphere using von mises-\ufb01sher\n\ndistributions. Journal of Machine Learning Research, 6:1345\u20131382, 2005.\n\n[5] David M. Blei, Andrew Y. Ng, and Michael I. Jordan. Latent dirichlet allocation. The Journal of Machine Learning Research, 3:993\u2013\n\n1022, 2003.\n\n[6] Marcus A. Brubaker, Mathieu Salzmann, and Raquel Urtasun. A family of mcmc methods on implicitly de\ufb01ned manifolds. In Proceed-\n\nings of the 15th International Conference on Arti\ufb01cial Intelligence and Statistics (AISTATS), pages 161\u2013172, 2012.\n\n[7] Simon Byrne and Mark Girolami. Geodesic monte carlo on embedded manifolds. Scandinavian Journal of Statistics, 40(4):825\u2013845,\n\n2013.\n\n[8] Changyou Chen, Nan Ding, and Lawrence Carin. On the convergence of stochastic gradient mcmc algorithms with high-order integrators.\n\nIn Advances in Neural Information Processing Systems, pages 2269\u20132277, 2015.\n\n[9] Tianqi Chen, Emily Fox, and Carlos Guestrin. Stochastic gradient hamiltonian monte carlo. In Proceedings of the 31st International\n\nConference on Machine Learning (ICML-14), pages 1683\u20131691, 2014.\n\n[10] Nan Ding, Youhan Fang, Ryan Babbush, Changyou Chen, Robert D. Skeel, and Hartmut Neven. Bayesian sampling using stochastic\n\ngradient thermostats. In Advances in Neural Information Processing Systems, pages 3203\u20133211, 2014.\n\n[11] Chao Du, Jun Zhu, and Bo Zhang. Learning deep generative models with doubly stochastic mcmc. arXiv preprint arXiv:1506.04557,\n\n2015.\n\n[12] Kaushik Ghosh, Rao Jammalamadaka, and Ram C. Tiwari. Semiparametric bayesian techniques for problems in circular data. Journal\n\nof Applied Statistics, 30(2):145\u2013161, 2003.\n\n[13] Mark Girolami and Ben Calderhead. Riemann manifold langevin and hamiltonian monte carlo methods. Journal of the Royal Statistical\n\nSociety: Series B (Statistical Methodology), 73(2):123\u2013214, 2011.\n\n[14] Siddharth Gopal and Yiming Yang. Von mises-\ufb01sher clustering models. In Proceedings of the 31st International Conference on Machine\n\nLearning (ICML-14), 2014.\n\n[15] Matthew D. Hoffman, David M. Blei, Chong Wang, and John Paisley. Stochastic variational inference. The Journal of Machine Learning\n\nResearch, 14(1):1303\u20131347, 2013.\n\n[16]\n\nI. M. James. The topology of Stiefel manifolds, volume 24. Cambridge University Press, 1976.\n\n[17] Shiwei Lan, Bo Zhou, and Babak Shahbaba. Spherical hamiltonian monte carlo for constrained target distributions. In Proceedings of\n\nthe 31st International Conference on Machine Learning (ICML-14), pages 629\u2013637, 2014.\n\n[18] Chunyuan Li, Changyou Chen, Kai Fan, and Lawrence Carin. High-order stochastic gradient thermostats for bayesian learning of deep\n\nmodels. arXiv preprint arXiv:1512.07662, 2015.\n\n[19] Yi-An Ma, Tianqi Chen, and Emily Fox. A complete recipe for stochastic gradient mcmc. In Advances in Neural Information Processing\n\nSystems, pages 2899\u20132907, 2015.\n\n[20] Kanti V. Mardia and Peter E. Jupp. Distributions on spheres. Directional Statistics, pages 159\u2013192, 2000.\n\n[21] John Nash. The imbedding problem for riemannian manifolds. Annals of Mathematics, pages 20\u201363, 1956.\n\n[22] Radford M. Neal. Mcmc using hamiltonian dynamics. Handbook of Markov Chain Monte Carlo, 2, 2011.\n\n[23] Sam Patterson and Yee Whye Teh. Stochastic gradient riemannian langevin dynamics on the probability simplex. In Advances in Neural\n\nInformation Processing Systems, pages 3102\u20133110, 2013.\n\n[24] Joseph Reisinger, Austin Waters, Bryan Silverthorn, and Raymond J. Mooney. Spherical topic models.\n\nInternational Conference on Machine Learning (ICML-10), pages 903\u2013910, 2010.\n\nIn Proceedings of the 27th\n\n[25] Yang Song and Jun Zhu. Bayesian matrix completion via adaptive relaxed spectral regularization. In The 30th AAAI Conference on\n\nArti\ufb01cial Intelligence (AAAI-16), 2016.\n\n[26] Eduard L. Stiefel. Richtungsfelder und fernparallelismus in n-dimensionalen mannigfaltigkeiten. Commentarii Mathematici Helvetici,\n\n8(1):305\u2013353, 1935.\n\n[27] Julian Straub, Jason Chang, Oren Freifeld, and John W. Fisher III. A dirichlet process mixture model for spherical data. In Proceedings\n\nof the 18th International Conference on Arti\ufb01cial Intelligence and Statistics (AISTATS), pages 930\u2013938, 2015.\n\n[28] Jalil Taghia, Zhanyu Ma, and Arne Leijon. Bayesian estimation of the von mises-\ufb01sher mixture model with variational inference. IEEE\n\nTransactions on Pattern Analysis and Machine Intelligence, 36(9):1701\u20131715, 2014.\n\n[29] Max Welling and Yee Whye Teh. Bayesian learning via stochastic gradient langevin dynamics. In Proceedings of the 28th International\n\nConference on Machine Learning (ICML-11), pages 681\u2013688, 2011.\n\n9\n\n\f", "award": [], "sourceid": 1502, "authors": [{"given_name": "Chang", "family_name": "Liu", "institution": "Tsinghua University"}, {"given_name": "Jun", "family_name": "Zhu", "institution": "Tsinghua University"}, {"given_name": "Yang", "family_name": "Song", "institution": "Stanford University"}]}