{"title": "Towards Unifying Hamiltonian Monte Carlo and Slice Sampling", "book": "Advances in Neural Information Processing Systems", "page_first": 1741, "page_last": 1749, "abstract": "We unify slice sampling and Hamiltonian Monte Carlo (HMC) sampling, demonstrating their connection via the Hamiltonian-Jacobi equation from Hamiltonian mechanics. This insight enables extension of HMC and slice sampling to a broader family of samplers, called Monomial Gamma Samplers (MGS). We provide a theoretical analysis of the mixing performance of such samplers, proving that in the limit of a single parameter, the MGS draws decorrelated samples from the desired target distribution. We further show that as this parameter tends toward this limit, performance gains are achieved at a cost of increasing numerical difficulty and some practical convergence issues. Our theoretical results are validated with synthetic data and real-world applications.", "full_text": "Towards Unifying Hamiltonian Monte Carlo\n\nand Slice Sampling\n\nYizhe Zhang, Xiangyu Wang, Changyou Chen, Ricardo Henao, Kai Fan, Lawrence Carin\n\n{yz196,xw56,changyou.chen, ricardo.henao, kf96 , lcarin} @duke.edu\n\nDuke University\n\nDurham, NC, 27708\n\nAbstract\n\nWe unify slice sampling and Hamiltonian Monte Carlo (HMC) sampling, demon-\nstrating their connection via the Hamiltonian-Jacobi equation from Hamiltonian\nmechanics. This insight enables extension of HMC and slice sampling to a broader\nfamily of samplers, called Monomial Gamma Samplers (MGS). We provide a\ntheoretical analysis of the mixing performance of such samplers, proving that in\nthe limit of a single parameter, the MGS draws decorrelated samples from the\ndesired target distribution. We further show that as this parameter tends toward this\nlimit, performance gains are achieved at a cost of increasing numerical dif\ufb01culty\nand some practical convergence issues. Our theoretical results are validated with\nsynthetic data and real-world applications.\n\n1\n\nIntroduction\n\nMarkov Chain Monte Carlo (MCMC) sampling [1] stands as a fundamental approach for probabilistic\ninference in many computational statistical problems. In MCMC one typically seeks to design\nmethods to ef\ufb01ciently draw samples from an unnormalized density function. Two popular auxiliary-\nvariable sampling schemes for this task are Hamiltonian Monte Carlo (HMC) [2, 3] and the slice\nsampler [4]. HMC exploits gradient information to propose samples along a trajectory that follows\nHamiltonian dynamics [3], introducing momentum as an auxiliary variable. Extending the random\nproposal associated with Metropolis-Hastings sampling [4], HMC is often able to propose large\nmoves with acceptance rates close to one [2]. Recent attempts toward improving HMC have leveraged\ngeometric manifold information [5] and have used better numerical integrators [6]. Limitations of\nHMC include being sensitive to parameter tuning and being restricted to continuous distributions.\nThese issues can be partially solved by using adaptive approaches [7, 8], and by transforming sampling\nfrom discrete distributions into sampling from continuous ones [9, 10].\nSeemingly distinct from HMC, the slice sampler [4] alternates between drawing conditional samples\nbased on a target distribution and a uniformly distributed slice variable (the auxiliary variable). One\nproblem with the slice sampler is the dif\ufb01culty of solving for the slice interval, i.e., the domain of\nthe uniform distribution, especially in high dimensions. As a consequence, adaptive methods are\noften applied [4]. Alternatively, one recent attempt to perform ef\ufb01cient slice sampling on latent\nGaussian models samples from a high-dimensional elliptical curve parameterized by a single scalar\n[11]. It has been shown that in some cases slice sampling is more ef\ufb01cient than Gibbs sampling and\nMetropolis-Hastings, due to the adaptability of the sampler to the scale of the region currently being\nsampled [4].\nDespite the success of slice sampling and HMC, little research has been performed to investigate\ntheir connections. In this paper we use the Hamilton-Jacobi equation from classical mechanics to\nshow that slice sampling is equivalent to HMC with a (simply) generalized kinetic function. Further,\nwe also show that different settings of the HMC kinetic function correspond to generalized slice\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fsampling, with a non-uniform conditional slicing distribution. Based on this relationship, we develop\ntheory to analyze the newly proposed broad family of auxiliary-variable-based samplers. We prove\nthat under this special family of distributions for the momentum in HMC, as the distribution becomes\nmore heavy-tailed, the one-step autocorrelation of samples from the target distribution converges\nasymptotically to zero, leading to potentially decorrelated samples. While of limited practical impact,\nthis theoretical result provides insights into the properties of the proposed family of samplers. We\nalso elaborate on the practical tradeoff between the increased computational complexity associated\nwith improved theoretical sampling ef\ufb01ciency. In the experiments, we validate our theory on both\nsynthetic data and with real-world problems, including Bayesian Logistic Regression (BLR) and\nIndependent Component Analysis (ICA), for which we compare the mixing performance of our\napproach with that of standard HMC and slice sampling.\n2 Solving Hamiltonian dynamics via the Hamilton-Jacobi equation\nA Hamiltonian system consists of a kinetic function K(p) with momentum variable p 2 R, and a\npotential energy function U (x) with coordinate x 2 R. We elaborate on multivariate cases in the\nAppendix. The dynamics of a Hamiltonian system are completely determined by a set of \ufb01rst-order\nPartial Differential Equations (PDEs) known as Hamilton\u2019s equations [12]:\n\n@p\n@\u2327\n\n= \n\n@H (x, p, \u2327 )\n\n@x\n\n,\n\n@x\n@\u2327\n\n=\n\n@H (x, p, \u2327 )\n\n@p\n\n,\n\n(1)\n\nwhere H(x, p, \u2327 ) = K(p(\u2327 )) + U (x(\u2327 )) is the Hamiltonian, and \u2327 is the system time. Solving\n(1) gives the dynamics of x(\u2327 ) and p(\u2327 ) as a function of system time \u2327. In a Hamiltonian system\ngoverned by (1), H(\u00b7) is a constant for every \u2327 [12]. A speci\ufb01ed H(\u00b7), together with the initial point\n{x(0), p(0)}, de\ufb01nes a Hamiltonian trajectory {{x(\u2327 ), p(\u2327 )} : 8\u2327}, in {x, p} space.\nIt is well known that in many practical cases, a direct solution to (1) may be dif\ufb01cult [13]. Alternatively,\none might seek to transform the original HMC system {H(\u00b7), x, p,\u2327 } to a dual space {H0(\u00b7), x0, p0,\u2327 }\nin hope that the transformed PDEs in the dual space becomes simpler than the original PDEs in\n(1). One promising approach consists of using the Legendre transformation [12]. This family of\ntransformations de\ufb01nes a unique mapping between primed and original variables, where the system\ntime, \u2327, is identical. In the transformed space, the resulting dynamics are often simpler than the\noriginal Hamiltonian system.\nAn important property of the Legendre transformation is that the form of (1) is preserved in the new\nspace [14], i.e., @p0/@\u2327 = @H 0(x0, p0,\u2327 )/@x0 ,@x 0/@\u2327 = @H 0(x0, p0,\u2327 )/@p0 . To guarantee a valid\nLegendre transformation between the original Hamiltonian system {H(\u00b7), x, p,\u2327 } and the transformed\nHamiltonian system {H0(\u00b7), x0, p0,\u2327 }, both systems should satisfy the Hamilton\u2019s principle [13],\nwhich equivalently express Hamilton\u2019s equations (1). The form of this Legendre transformation is not\nunique. One possibility is to use a generating function approach [13], which requires the transformed\nvariables to satisfy p \u00b7 @x/@\u2327 H(x, p, \u2327 ) = p0 \u00b7 @x0/@\u2327 H(x0, p0,\u2327 )0 + dG(x, x0, p0,\u2327 )/d\u2327,\nwhere dG(x, x0, p0,\u2327 )/d\u2327 follows from the chain rule and G(\u00b7) is a Type-2 generating function\nde\ufb01ned as G(\u00b7) , x0\u00b7 p0 + S(x, p0,\u2327 ) [14], with S(x, p0,\u2327 ) being the Hamilton\u2019s principal function\n[15], de\ufb01ned below. The following holds due to the independency of x, x0 and p0 in the previous\ntransformation (after replacing G(\u00b7) by its de\ufb01nition):\np =\n\nH0(x0, p0,\u2327 ) = H(x, p, \u2327 ) +\n\n@S (x, p0,\u2327 )\n\n@S (x, p0,\u2327 )\n\n,\n\nx0 =\n\n.\n\n(2)\n\n@x\n\n@S (x, p0,\u2327 )\n\n,\n\n@p0\n\n@\u2327\n\nWe then obtain the desired Legendre transformation by setting H0(x0, p0,\u2327 ) = 0. The resulting\n(2) is known as the Hamilton-Jacobi equation (HJE). We refer the reader to [13, 12] for extensive\ndiscussions on the Legendre transformation and HJE.\nRecall from above that the Legendre transformation preserves the form of (1). Since H0(x0, p0,\u2327 ) = 0,\n{x0, p0} are time-invariant (constant for every \u2327). Importantly, the time-invariant point {x0, p0} corre-\nsponds to a Hamiltonian trajectory in the original space, and it de\ufb01nes the initial point {x(0), p(0)}\nin the original space {x, p}; hence, given {x0, p0}, one may update the point along the trajectory\nby specifying the time \u2327. A new point {x(\u2327 ), p(\u2327 )} in the original space along the Hamiltonian\ntrajectory, with system time \u2327, can be determined from the transformed point {x0, p0} via solving (2).\nOne typically speci\ufb01es the kinetic function as K(p) = p2 [2], and Hamilton\u2019s principal function as\nS(x, p0,\u2327 ) = W (x) p0\u2327, where W (x) is a function to be determined (de\ufb01ned below). From (2),\n\n2\n\n\f@S (x, p0,\u2327 )\n\nx0 =\n\n=\n\n@W (x)\n@H \u2327 =\n\n@p0\n\n(4)\n\n(5)\n\n@S\n@\u2327\n\nand the de\ufb01nition of S(\u00b7), we can write\nH(x, p, \u2327 ) +\n\n= H(x, p, \u2327 ) p0 = U (x) +\uf8ff @S\n@x2\n\n p0 = 0 , (3)\nwhere the second equality is obtained by replacing H(x, p, \u2327 ) = U (x(\u2327 )) + K(p(\u2327 )) and the third\nequality by replacing p from (2) into K(p(\u2327 )). From (3), p0 = H(x, p, \u2327 ) represents the total\nHamiltonian in the original space {x, p}, and uniquely de\ufb01nes a Hamiltonian trajectory in {x, p}.\nDe\ufb01ne X , {x : H(\u00b7) U (x) 0} as the slice interval, which for constant p0 = H(x, p, \u2327 )\ncorresponds to a set of valid coordinates in the original space {x, p}. Solving (3) for W (x) gives\n\n p0 = U (x) +\uf8ff dW (x)\ndx 2\n\nW (x) =Z x(\u2327 )\n\nz 2 X\nz 62 X ,\nwhere xmin = min{x : x 2 X} and C is a constant. In addition, from (2) we have\n\n2 dz + C ,\n\nf (z)\n\nxmin\n\n1\n\n0,\n\nf (z) =\u21e2 H(\u00b7) U (z),\n2Z x(\u2327 )\n\nf (z) 1\n\nxmin\n\n1\n\n2 dz \u2327,\n\npp\n\nxx\n\nxt(0), pt(0)\nxt(0), pt(0)\n\nxt(\u2327t), pt(\u2327t)\nxt(\u2327t), pt(\u2327t)\n\nxt+1(0), pt+1(0)\nxt+1(0), pt+1(0)\n\nwhere the second equality is obtained by substituting S(\u00b7) by its de\ufb01nition and the third equality is\nobtained by applying Fubini\u2019s theorem on (4). Hence, for constant {x0, p0 = H(x, p, \u2327 )}, equation\n(5) uniquely de\ufb01nes x(\u2327 ) in the original space, for a speci\ufb01ed system time \u2327.\n3 Formulating HMC as a Slice Sampler\n3.1 Revisiting HMC and Slice Sampling\nSuppose we are interested in sampling a random variable x from\nan unnormalized density function f (x) / exp[U (x)], where\nU (x) is the potential energy function. Hamiltonian Monte Carlo\n(HMC) augments the target density with an auxiliary momentum\nrandom variable p, that is independent of x. The distribution of p\nis speci\ufb01ed as / exp[K(p)], where K(p) is the kinetic energy\nfunction. De\ufb01ne H(x, p) = U (x) + K(p) as the Hamiltonian.\nWe have omitted the dependency of H(\u00b7), x and p on the system\ntime \u2327 for simplicity. HMC iteratively performs dynamic evolv-\ning and momentum resampling steps, by sampling xt from the\ntarget distribution and pt from the momentum distribution (Gaus-\nsian as K(p) = p2), respectively, for t = 1, 2, . . . iterations.\nFigure 1 illustrates two iterations of this procedure. Starting\nfrom point {xt(0), pt(0)} at the t-th (discrete) iteration, HMC leverages the Hamiltonian dynamics,\ngoverned by Hamilton\u2019s equations in (1) to propose the next sample {xt(\u2327t), pt(\u2327t)}, at system time\n\u2327t. The position in HMC at iteration t + 1 is updated as xt+1(0) = xt(\u2327t) (dynamic evolving). A new\nmomentum pt+1(0) is resampled independently from a Gaussian distribution (assuming K(p) = p2),\nestablishing the next initial point {xt+1(0), pt+1(0)} for iteration t + 1 (momentum resampling).\nThe latter point corresponds to the initial point of a new trajectory because the Hamiltonian H(\u00b7) is\ncommensurately updated. This means that trajectories correspond to distinct values of H(\u00b7).\nTypically, numerical integrators such as the leap-frog method [2] are employed to numerically\napproximate the Hamiltonian dynamics. In practice, a random number (uniformly drawn from a\n\ufb01xed range) of discrete numerical integration steps (leap-frog steps) are often used (corresponding to\nrandom time \u2327t along the trajectory), which has been shown to have better convergence properties\nthan a single leap-frog step [16]. The discretization error introduced by the numerical integration is\ncorrected by a Metropolis Hastings (MH) step.\nSlice sampling is conceptually simpler than HMC. It augments the target unnormalized density f (x)\nwith a random variable y, with joint distribution expressed as p(x, y) = Z1\n1 , s.t. 0 < y < f (x),\n\nFigure 1: Representation of HMC\nsampling. Points {xt(0), pt(0)}\nand {xt+1(0), pt+1(0)} represent\nHMC samples at iterations t and\nt + 1, respectively. The trajecto-\nries for t and t + 1 correspond to\ndistinct Hamiltonian levels Ht(\u00b7)\nand Ht+1(\u00b7), denoted as black and\nred lines, respectively.\n\nwhere Z1 = R f (x)dx is the normalization constant, and the marginal distribution of x exactly\nrecovers the target normalized distribution f (x)/Z1. To sample from the target density, slice sampling\niteratively performs a conditional sampling step from p(x|y) and sampling a slice from p(y|x). At\niteration t, starting from xt, a slice yt is uniformly drawn from (0, f (xt)). Then, the next sample\nxt+1, at iteration t + 1, is uniformly drawn from the slice interval {x : f (x) > yt}.\n\n3\n\n\fHMC and slice sampling both augment the target distribution with auxiliary variables and can\npropose long-range moves with high acceptance probability.\n3.2 Formulating HMC as a Slice Sampler\nConsider the dynamic evolving step in HMC, i.e., {xt(0), pt(0)} 7! {xt(\u2327 ), pt(\u2327 )} in Figure 1.\nFrom Section 2, the Hamiltonian dynamics in {x, p} space with initial point {x(0), p(0)} can be\nperformed by mapping to {x0, p0} space and updating {x(\u2327 ), p(\u2327 )} via selecting a \u2327 and solving\n(5). As we show in the Appendix, from (5) and in univariate cases\u21e4 the Hamiltonian dynamics has\nperiodRX[H(\u00b7) U (z)] 1\n2 dz and is symmetric along p = 0 (due to the symmetric form of the kinetic\nfunction). Also from (5), the system time, \u2327, is speci\ufb01ed uniformly sampled from a half-period of\n2\u2318. Intuitively, x0\nthe Hamiltonian dynamics. i.e., \u2327 \u21e0 Uniform\u21e3x0,x0 + 1\nis the \u201canchor\u201d of the initial point {x(0), p(0)}, w.r.t. the start of the \ufb01rst half period, i.e, when\nRX[H(\u00b7) U (z)] 1\n2 = 0. Further, we only need consider half a period because for a symmetric\nkinetic function, K(p) = p2, the Hamiltonian dynamics for the two half-periods are mirrored [14].\nFor the same reason, Figure 1 only shows half of the {x, p} space, when p 0.\nGiven the sampled \u2327 and the constant {x0, p0}, equation (5) can be solved for x\u21e4 , x(\u2327 ), i.e., the\nvalue of x at time \u2327. Interestingly, the integral in (5) can be interpreted as (up to normalization\nconstant) a cumulative density function (CDF) of x(\u2327 ). From the inverse CDF transform sampling\nmethod, uniformly sampling \u2327 from half of a period and solving for x\u21e4 from (5), are equivalent to\ndirectly sampling x\u21e4 from the following density\n\n2RX[H(\u00b7) U (z)] 1\n\ns.t., H(\u00b7) U (x\u21e4) 0 .\n\np(x\u21e4|H(\u00b7)) / [H(\u00b7) U (x\u21e4)] 1\n2 ,\n\n(6)\nWe note that this transformation does not make the analytic solution of x(\u2327 ) generally tractable.\nHowever, it provides the basic setup to reveal the connection between the slice sampler and HMC.\nIn the momentum resampling step of HMC, i.e., {xt(\u2327 ), pt(\u2327 )} 7! {xt+1(0), pt+1(0)} in Figure 1,\nand using the previously described kinetic function, K(p) = p2, resampling corresponds to drawing\np from a Gaussian distribution [2].\nThe algorithm to analytically sample from the HMC (analytic HMC) proceeds as follows: at iteration\nt, momentum pt is drawn from a Gaussian distribution. The previously sampled value of xt1 and\nthe newly sampled pt yield a Hamiltonian Ht(\u00b7). Then, the next sample xt is drawn from (6). This\nprocedure relates HMC to the slice sampler. To clearly see the connection, we denote yt = eHt(\u00b7).\nInstead of directly sampling {p, x} as just described, we sample {y, x} instead. By substituting Ht(\u00b7)\nwith yt in (6), the conditional updates for this new sampling procedure can be rewritten as below,\nyielding the HMC slice sampler (HMC-SS), with conditional distributions de\ufb01ned as\nSampling a slice: p(yt|xt) =\nConditional sampling: p(xt+1|yt) =\nwhere a = 1/2 (other values of a considered below), f (x) = eU (x) is an unnormalized density, and\nZ1 ,R f (x)dx and Z2(y) ,Rf (x)>y[log f (x) log y] 1\n\nComparing these two procedures, analytic HMC and HMC-SS, we see that the resampling momentum\nin analytic HMC corresponds to sampling a slice in HMC-SS. Further, the dynamic evolving in\nHMC corresponds to the conditional sampling in MG-SS. We have thus shown that HMC can be\nequivalently formulated as a slice sampler procedure via (7) and (8).\n3.3 Reformulating Standard Slice Sampler from HMC-SS\nIn standard slice sampling (described in Section 3.1), both conditional sampling and sampling a\nslice are drawn from uniform distributions. However those for HMC-SS in (7) and (8) represent\nnon-uniform distributions. Interestingly, if we change a in (7) and (8) from a = 1/2 to a = 1, we\nobtain the desired uniform distributions for standard slice sampling. This key observation leads us to\nconsider a generalized form of the kinetic function for HMC, described below.\n\n[log f (xt) log yt]1a ,\n(a)f (xt)\n1\n\nZ2(yt)\n\n[log f (xt+1) log yt]1a ,\n\ns.t. f (xt) > yt , (8)\n\n1\n\ns.t. 0 < yt < f (xt) ,\n\n(7)\n\n2 dx are the normalization constants.\n\n\u21e4For multidimensional cases, the Hamiltonian dynamics are semi-periodic, yet a similar conclusion still\n\nholds. Details are discussed in the Appendix.\n\n4\n\n\fConsider the generalized family of kinetic functions K(p) = |p|1/a with a > 0. One may rederive\nequations (3)-(8) using this generalized kinetic energy. As shown in the Appendix, these equations\nremained unchanged, with the update that each isolated 2 in these equations is replaced by 1/a, and\n1/2 is replaced by a 1.\nSampling p (for the momentum resampling step) with the generalized kinetics, corresponds to drawing\np from \u21e1(p; m, a) = 1\n2 ma/(a + 1) exp[|p|1/a/m], with m = 1. All the formulation in the paper\nstill holds for arbitrary m, see Appendix for details. We denote this distribution the monomial Gamma\n(MG) distribution, MG(a, m), where m is the mass parameter, and a is the monomial parameter.\nNote that this is equivalent to the exponential power distribution with zero-mean, described in [17].\nWe summarize some properties of the MG distribution in the Appendix.\nTo generate random samples from the MG distribution, one can draw G \u21e0 Gamma(a, m) and a\nuniform sign variable S \u21e0 {1, 1}, then S \u00b7 Ga follows the MG(a, m) distribution. We call the HMC\nsampler based on the generalized kinetic function, K(p; a, m): Monomial Gamma Hamiltonian\nMonte Carlo (MG-HMC). The algorithm to analytically sample from the MG-HMC is shown\nin Algorithm 1. The only difference between this procedure and the previously described is the\nmomentum resampling step, in that for analytic HMC, p is drawn Gaussian instead of MG(a, m).\nHowever, note that the Gaussian distribution is a special case of MG(a, m) when a = 1/2.\nAlgorithm 1: MG-HMC with HJE\nfor t = 1 to T do\nResample momentum: pt \u21e0 MG(m, a).\nCompute Hamiltonian: Ht = U (xt1) + K(pt).\nFind X , {x : x 2 R; U (x) \uf8ff Ht(\u00b7)}.\nDynamic evolving: xt|Ht(\u00b7) / [Ht(\u00b7) U (xt)]a1 ; x 2 X.\n\nAlgorithm 2: MG-SS\nfor t = 1 to T do\nSampling a slice:\nSample yt from (7).\nConditional sampling:\nSample xt from (8).\n\nInterestingly, when a = 1, the Monomial Gamma Slice sampler (MG-SS) in Algorithm 2 recovers\nexactly the same update formulas as in standard slice sampling, described in Section 3.1, where the\nconditional distributions in (7) and (8) are both uniform. When a 6= 1, we have to iteratively alternate\nbetween sampling from non-uniform distributions (7) and (8), for both auxiliary (slicing) variable y\nand target variable x.\nUsing the same argument from the convergence analysis of standard slice sampling [4], the iterative\nsampling procedure in (7) and (8), converges to an invariant joint distribution (detailed in the\nAppendix). Further, the marginal distribution of x recovers the target distribution as f (x)/Z1, while\nthe marginal distribution of y is given by p(y) = Z2(y)/[(a)Z1].\nThe MG-SS can be divided into three broad regimes: 0 < a < 1, a = 1 and a > 1 (illustrated in the\nAppendix). When 0 < a < 1, the conditional distribution p(yt|xt) is skewed towards the current\nunnormalized density value f (xt). The conditional draw of p(xt+1|yt) encourages taking samples\nwith smaller density value (inef\ufb01cient moves), within the domain of the slice interval X. On the other\nhand, when a > 1, draws of yt tend to take smaller values, while draws of xt+1 encourage sampling\nfrom those with large density function values (ef\ufb01cient moves). The case a = 1 corresponds to the\nconventional slice sampler. Intuitively, setting a to be small makes the auxiliary variable, yt, stay\nclose to f (xt), thus f (xt+1) is close to f (xt). As a result, a larger a seems more desirable. This\nintuition is justi\ufb01ed in the following sections.\n\n4 Theoretical analysis\n\nWe analyze theoretical properties of the MG sampler. All the proofs as well as the ergodicity\nproperties of analytic MG-SS are given in the Appendix.\nOne-step autocorrelation of analytic MG-SS We present results on the univariate distribution\ncase: p(x) / eU (x). We \ufb01rst investigate the impact of the monomial parameter a on the one-step\nautocorrelation function (ACF), \u21e2x(1) , \u21e2(xt, xt+1) = [Extxt+1 (Ex)2]/Var(x), as a ! 1.\nTheorem 1 characterizes the limiting behavior of \u21e2(xt, xt+1).\nTheorem 1 For a univariate target distribution, i.e. exp[U (x)] has \ufb01nite integral over R, un-\nder certain regularity conditions, the one-step autocorrelation of the MG-SS parameterized by a,\nasymptotically approaches zero as a ! 1, i.e., lima!0 \u21e2x(1) = 0.\n\n5\n\n\fIn the Appendix we also show that lima!1 \u21e2(yt, yt+1) = 0. In addition, we show that \u21e2(yt, yt+h)\nis a non-negative decreasing function of the time lag in discrete steps h.\nEffective sample size The variance of a Monte Carlo estimator is determined by its Effective\n\nSample Size (ESS) [18], de\ufb01ned as ESS = N/(1 + 2\u21e5P1h=1 \u21e2x(h)), where N is the total number of\n\nsamples, \u21e2x(h) is the h-step autocorrelation function, which can be calculated in a recursive manner.\nWe prove in the Appendix that \u21e2x(h) is non-negative. Further, assuming the MG sampler is uniformly\nergodic and \u21e2x(h) is monotonically decreasing, it can be shown that lima!1 ESS = N. When ESS\napproaches full sample size, N, the resulting sampler delivers excellent mixing ef\ufb01ciency [5]. Details\nand further discussion are provided in the Appendix.\nCase study To examine a speci\ufb01c 1D example, we consider sampling from the exponential\ndistribution, Exp(\u2713), with energy function given by U (x) = x/\u2713, where x 0. This case has\nanalytic \u21e2x(h) and ESS. After some algebra (details in the Appendix),\n\n\u21e2x(1) =\n\n,\u21e2 x(h) =\n\n1\n\na + 1\n\nx0\u2713\n\n1\n\n(a + 1)h , ESS =\n\nN a\na + 2\n\ndecays exponentially in h, with a factor of\n\n1\n\n, \u02c6xh(x0) , E\uf8ffh(xh|x0)xh = \u2713 +\n\nx0 \u2713\n(a + 1)h .\nThese results are in agreement with Theorem 1 and related arguments of ESS and monotonicity of\nautocorrelation w.r.t. a. Here \u02c6xh(x0) denotes the expectation of the h-lag sample, starting from any\nx0. The relative difference \u02c6xh(x0)\u2713\na+1. In fact, the \u21e2x(1)\nfor the exponential family class of models introduced in [19], with potential energy U (x) = x!/\u2713,\nwhere x 0,!,\u2713 > 0, can be analytically calculated. The result, provided in the Appendix, indicates\nthat for this family, \u21e2x(1) decays at a rate of O(a1).\nMG-HMC mixing performance\nIn theory, the analytic MG-HMC (the dynamics in (5) can be\nsolved exactly) is expected to have the same theoretical properties of the analytic MG-SS for unimodal\ncases, since they are derived from the same setup. However, the mixing performance of the two\nmethods could differ signi\ufb01cantly when sampling from a multimodal distribution, due to the fact\nthat the Hamiltonian dynamics may get \u201ctrapped\u201d into a single closed trajectory (one of the modes)\nwith low energy, whereas the analytic MG-SS does not suffer from this problem as is able to sample\nfrom disjoint slice intervals (one per mode). This is a well-known property of slice sampling [4] that\narises from (7) and (8). However, if a is large enough, as we show in the Appendix, the probability of\ngetting into a low-energy level associated with more than one Hamiltonian trajectory, which restrict\nmovement between modes, is arbitrarily small. As a result, the analytic MG-HMC with large value\nof a is able to approach the stationary mixing performance of MG-SS.\n5 MG sampling in practice\nMG-HMC with numerical integrator\nIn practice, MG-SS (performing Algorithm 2) requires: 1)\nanalytically solving for the slice interval X, which is typically infeasible for multivariate cases [4]; or\n2) analytically computing the integral Z2(y) over X, implied by the non-uniform conditionals from\nMG-SS. These are usually computationally infeasible, though adaptive estimation of X could be done\nusing schemes like \u201cdoubling\u201d and \u201cshrinking\u201d strategies from the slice sampling literature [4].\nIt is more convenient to perform approximate MG-HMC using a numerical integrator like in traditional\nHMC, i.e., in each iteration, the momentum p is \ufb01rst initialized by sampling from MG(m, a), then\nsecond order St\u00f6rmer-Verlet integration [2] is performed for the Hamiltonian dynamics updates:\n2rU (xt+1) ,\n\n(9)\nwhere rK(p) = sign(p) \u00b7 1\nma|p|1/a1. When a = 1, [rK(p)]d = 1/m for any dimension d,\nindependent of x and p. To avoid moving on a grid when a = 1, we employ a random step-size \u270f\nfrom a uniform distribution within non-negative range (r1, r2), as suggested in [2].\nNo free lunch With a numerical integrator for MG-HMC, however, the argument about choosing\nlarge a (of great theoretical advantage as discussed in the previous section) may face practical issues.\nFirst, a large value of a will lead to a less accurate numerical integrator. This is because as a gets\nlarger, the trajectory of the total Hamiltonian becomes \u201cstiffer\u201d, i.e., that the maximum curvature\nbecomes larger. When a > 1/2, the Hamiltonian trajectory in the phase space, (x, p), has at least\n2D (D denotes the total dimension) non-differentiable points (\u201cturnovers\u201d), at each intersection\npoint with the hyperplane p(d) = 0, d 2{ 1\u00b7\u00b7\u00b7 D}. As a result, directly applying St\u00f6rmer-Verlet\nintegration would lead to high integration error as D becomes large.\n\npt+1/2 = pt \u270f\n\n2rU (xt) , xt+1 = xt + \u270frK(pt+1/2) , pt+1 = pt+1/2 \u270f\n\n6\n\n\fSecond, if the sampler is initialized in the tail region of a light-tailed target distribution, MG-HMC\nwith a > 1 may converge arbitrarily slow to the true target distribution, i.e., the burn-in period could\ntake arbitrarily long time. For example, with a > 1, rU (x0) can be very large when x0 is in the\nlight-tailed region, leading the update x0 + rK(p0 + rU (x0)) to be arbitrary close to x0, i.e., the\nsampler does not move.\nTo ameliorate these issues, we provide mitigating strategies. For the \ufb01rst (numerical) issue, we\npropose two possibilities: 1) As an analog to the \u201cre\ufb02ection\u201d action of [2], in (9), whenever the\nd-th dimension(s) of the momentum changes sign, we \u201crecoil\u201d the point of these dimension(s) to the\nprevious iteration, and negate the momentum of these dimension(s), i.e., x(d)\nt+1 = p(d)\n.\n2) Substituting the kinetic function K(p) with a \u201csoftened\u201d kinetic function, and use importance\nsampling to sample the momentum. The details and comparison between the \u201cre\ufb02ection\u201d action and\n\u201csoftened\u201d kinetics are discussed in the Appendix.\nFor the second (convergence) issue, we suggest using a step-size decay scheme, e.g., \u270f =\nmax(\u270f1\u21e2t,\u270f 0). In our experiments we use (\u270f1,\u21e2 ) = (106, 0.9), where \u270f0 is problem-speci\ufb01c. This ap-\nproach empirically alleviates the slow convergence problem, however we note that a more principled\nway would be adaptively selecting a during sampling, which is left for further investigation.\nAs a compromise between theoretical gains and practical issues, we suggest setting a = 1 (HMC\nimplementation of a slice sampler) when the dimension is relatively large. This is because in our\nexperiments, when a > 1, numerical errors and convergence issues tend to overwhelm the theoretical\nmixing performance gains described in Section 4.\n\nt+1 = x(d)\n\n, p(d)\n\nt\n\nt\n\n(a)\n\n)\n1\n(\n;\n\nTheoretical\nMG-SS\nMG-HMC\n\n1\n0.8\n0.6\n0.4\n0.2\n0\n0\n4\nMonomial parameter a\n\n1\n\n2\n\n3\n\n(d)\n\nS\nS\nE\n\nTheoretical\nMG-SS\nMG-HMC\n\n1\n0.8\n0.6\n0.4\n0.2\n0\n0\n4\nMonomial parameter a\n\n1\n\n2\n\n3\n\n#104\n\n2.5\n2\n1.5\n1\n0.5\n0\n0\n4\nMonomial parameter a\n\nTheoretical\nMG-SS\nMG-HMC\n\n1\n\n2\n\n(e)\n\n)\n1\n(\n;\n\nTheoretical\nMG-HMC\n\n1\n0.8\n0.6\n0.4\n0.2\n0\n0\n4\nMonomial parameter a\n\n1\n\n2\n\n3\n\n3\n\n(b)\n\n#104\n\n2\n\n(c)\n\n1.5\n\nS\nS\nE\n\n1\n\n0.5\n\n)\n1\n(\n;\n\nTheoretical\nMG-SS\nMG-HMC\n\n0\n0\n4\nMonomial parameter a\n\n1\n\n2\n\n3\n\nFigure 2: Theoretical and empirical \u21e2x(1) and ESS of exponential distribution (a,b), N+ (c,d) and Gamma (e).\n6 Experiments\n6.1 Simulation studies\n1D unimodal problems We \ufb01rst evaluate the performance of the MG sampler with several univariate\ndistributions: 1) Exponential distribution, U (x) = \u2713x, x 0. 2) Truncated Gaussian, U (x) =\n\u2713x2, x 0. 3) Gamma distribution, U (x) = (r 1) log x + \u2713x. Note that the performance of\nthe sampler does not depend on the scale parameter \u2713> 0. We compare the empirical \u21e2x(1) and\nESS of the analytic MG-SS and MG-HMC with their theoretical values. In the Gamma distribution\ncase, analytic derivations of the autocorrelations and ESS are dif\ufb01cult, thus we resort to a numerical\napproach to compute \u21e2x(1) and ESS. Details are provided in the Appendix. Each method is run for\n30,000 iterations with 10,000 burn-in samples. The number of leap-frog steps is set to be uniformly\ndrawn from (100 l, 100 + l) with l = 20, as suggested by [16]. We also compared MG-HMC\n(a = 1) with standard slice sampling using doubling and shrinking scheme [4] As expected, the\nresulting ESS (not shown) for these two methods is almost identical. The experiment settings and\nresults are provided in the Appendix. The acceptance rates decrease from around 0.98 to around 0.77\nfor each case, when a grows from 0.5 to 4, as shown in Figure 2(a)-(d),\nThe results for analytic MG-SS match well with the theoretical results, however MG-HMC seems to\nsuffer from practical dif\ufb01culties when a is large, evidenced by results gradually deviating from the\ntheoretical values. This issue is more evident in the Gamma case (see Figure 2(e)), where \u21e2x(1) \ufb01rst\ndecreases then increases. Meanwhile, the acceptance rates decreases from 0.9 to 0.5.\n1D and 2D bimodal problems We further conduct simulation studies to evaluate the ef\ufb01ciency of\nMG-HMC when sampling 1D and 2D multimodal distributions. For the univariate case, the potential\nenergy is given by U (x) = x4 2x2; whereas U (x) = 0.2 \u21e5 (x1 + x2)2 + 0.01 \u21e5 (x1 + x2)4 \n0.4 \u21e5 (x1 x2)2 in the bivariate case. We show in the Appendix that if the energy functions are\nsymmetric along x = C, where C is a constant, in theory, the analytic MG-SS will have ESS equal\nto the total sample size. However, as shown in Section 4, the analytic MG-HMC is expected to have\nan ESS less than its corresponding analytic MG-SS, and the gap between the analytic MG-HMC\n\n7\n\n\f1D\n\n2\n\nx\n\nFigure 3: 10 MC samples\nby MG-HMC from a 2D\ndistribution and different a.\n\nTable 1: ESS of MG-HMC\nfor 1D and 2D bimodal dis-\ntributions.\n\nand analytic MG-SS counterpart should decrease with a. As a result, despite numerical dif\ufb01culties,\nwe expect the MG-HMC based on numerical integration to have better mixing performance with\nlarge a. To verify our theory, we run MG-HMC for a = {0.5, 1, 2} for 30,000 iterations with 10,000\nburn-in samples. The parameter settings and the acceptance rates are detailed in the Appendix.\nEmpirically, we \ufb01nd that the ef\ufb01ciency of HMC is signi\ufb01cantly improved with a large a as shown in\nTable 1, which coincides with the theory in Section 4. From Figure 3, we observe that the MG-HMC\nsampler with monomial parameter a = {1, 2} performs better at jumping between modes of the\ntarget distribution, when compared to standard HMC, which con\ufb01rms the theory in Section 4. We\nalso compared MG-HMC (a = 1) with standard SS [4]. As expected, in the 1D case, the standard SS\nyields ESS close to full sample size, while in 2D case, the resulting ESS is lower than MG-HMC\n(a = 1) (details are provided in the Appendix).\n6.2 Real data\nBayesian logistic regression We evalu-\nate our methods on 6 real-world datasets\nfrom the UCI repository [20]: German\ncredit (G), Australian credit (A), Pima In-\ndian (P), Heart (H), Ripley (R) and Car-\navan (C) [21]. Feature dimensions range\nfrom 7 to 87, and total data instances are\nbetween 250 to 5822. All datasets are\nnormalized to have zero mean and unit\nvariance. Gaussian priors N (0, 100I)\nare imposed on the regression coef\ufb01cients. We draw 5000 iterations with 1000 burn-in samples for\neach experiment. The leap-frog steps are set to be uniformly drawn from (100 l, 100 + l) with\nl = 20. Other experimental settings (m and \u270f) are provided in the Appendix.\nResults in terms of minimum ESS are summarized in Table 2. Prediction accuracies estimated\nvia cross-validation are almost identical all across (reported in the Appendix). It can be seen that\nMG-HMC with a = 1 outperforms (in terms of ESS) the other two settings with a = 0.5 and a = 2,\nindicating increased numerical dif\ufb01culties counter the theoretical gains when a becomes large. This\ncan be also seen by noting that the acceptance rates drop from around 0.9 to around 0.7 as a increases\nfrom 0.5 to 2. The dimensionality also seems to have an impact on the optimal setting of a, since in\nthe high-dimensional dataset Cavaran, the improvement of MG-HMC with a = 1 is less signi\ufb01cant\ncompared with other datasets, and a = 2 seems to suffer more of numerical dif\ufb01culties. Comparisons\nbetween MG-HMC (a = 1) and standard slice sampling are provided in the Appendix. In general,\nstandard slice sampling with adaptive search underperforms relative to MG-HMC (a = 1).\nTable 2: Minimum ESS for each method (dimensionality indicated in parenthesis). Left: BLR; Right: ICA\nDataset (dim) A (15) G (25) H (14)\n3524\n4591\n4315\n\n33 (median 3987)\n36 (median 4531)\n7 (median 740)\n\nP (8) R (7)\n3317\n3434\n4226\n4664\n4424\n1490\n\nESS\n5175\n10157\n24298\nESS\n4691\n16349\n18007\n\n\u21e2x(1)\n0.60\n0.43\n0.11\n\u21e2x(1)\n0.67\n0.60\n0.53\n\nMG-HMC (a=0.5)\nMG-HMC (a=1)\nMG-HMC (a=2)\nDensity contour\nx1\n\n2.5\n\n1\n\na = 0.5\na = 1\na = 2\n2D\n\na = 0.5\na = 1\na = 2\n\na = 0.5\na = 1\na = 2\n\n3124\n4308\n1490\n\n3447\n4353\n3646\n\n3\n\n2\n\n1\n\n0\n\n-1\n\n-2\n\n-3\n-3.5\n\n-2\n\n-0.5\n\n3.5\n\nC (87)\n\nICA (25)\n\n2677\n3029\n1534\n\nICA We \ufb01nally evaluate our methods on the MEG [22] dataset for Independent Component Analysis\n(ICA), with 17,730 time points and 25 feature dimension. All experiments are based on 5000 MCMC\nsamples. The acceptance rates for a = (0.5, 1, 2) are (0.98, 0.97, 0.77). Running time is almost\nidentical for different a. Settings (including m and \u270f) are provided in the Appendix. As shown in\nTable 2, when a = 1, MG-HMC has better mixing performance compared with other settings.\n7 Conclusion\nWe demonstrated the connection between HMC and slice sampling, introducing a new method for\nimplementing a slice sampler via an augmented form of HMC. With few modi\ufb01cations to standard\nHMC, our MG-HMC can be seen as a drop-in replacement for any scenario where HMC and its\nvariants apply, for example, Hamiltonian Variational Inference (HVI) [23]. We showed the theoretical\nadvantages of our method over standard HMC, as well as numerical dif\ufb01culties associated with it.\nSeveral future extensions can be explored to mitigate numerical issues, e.g., performing MG-HMC\non the Riemann manifold [5] so that step-sizes can be adaptively chosen, and using a high-order\nsymplectic numerical method [24, 25] to reduce the discretization error introduced by the integrator.\n\n8\n\n\fReferences\n[1] Christian Robert and George Casella. Monte Carlo statistical methods. Springer Science & Business\n\nMedia, 2004.\n\nletters B, 195(2), 1987.\n\n[2] Radford M Neal. MCMC using Hamiltonian dynamics. Handbook of Markov Chain Monte Carlo, 2, 2011.\n[3] Simon Duane, Anthony D Kennedy, Brian J Pendleton, and Duncan Roweth. Hybrid Monte Carlo. Physics\n\n[4] Radford M Neal. Slice sampling. Annals of statistics, 2003.\n[5] Mark Girolami and Ben Calderhead. Riemann manifold Langevin and Hamiltonian Monte Carlo methods.\n\nJournal of the Royal Statistical Society: Series B (Statistical Methodology), 73(2), 2011.\n\n[6] Wei-Lun Chao, Justin Solomon, Dominik Michels, and Fei Sha. Exponential integration for Hamiltonian\n\nMonte Carlo. In ICML, 2015.\n\n[7] Matthew D Homan and Andrew Gelman. The no-u-turn sampler: Adaptively setting path lengths in\n\nhamiltonian monte carlo. The Journal of Machine Learning Research, 15(1), 2014.\n\n[8] Ziyu Wang, Shakir Mohamed, and De Nando. Adaptive hamiltonian and riemann manifold monte carlo.\n\nIn ICML, 2013.\n\ndistributions. In NIPS, 2013.\n\n[9] Ari Pakman and Liam Paninski. Auxiliary-variable exact Hamiltonian Monte Carlo samplers for binary\n\n[10] Yichuan Zhang, Zoubin Ghahramani, Amos J Storkey, and Charles A Sutton. Continuous relaxations for\n\ndiscrete Hamiltonian Monte Carlo. In NIPS, 2012.\n\n[11] Iain Murray, Ryan Prescott Adams, and David JC MacKay. Elliptical slice sampling. ArXiv, 2009.\n[12] Vladimir Igorevich Arnol\u2019d. Mathematical methods of classical mechanics, volume 60. Springer Science\n\n& Business Media, 2013.\n\n[13] Herbert Goldstein. Classical mechanics. Pearson Education India, 1965.\n[14] John Robert Taylor. Classical mechanics. University Science Books, 2005.\n[15] LD Landau and EM Lifshitz. Mechanics, 1st edition. Pergamon Press, Oxford, 1976.\n[16] Samuel Livingstone, Michael Betancourt, Simon Byrne, and Mark Girolami. On the Geometric Ergodicity\n\nof Hamiltonian Monte Carlo. ArXiv, January 2016.\n\n[17] Saralees Nadarajah. A generalized normal distribution. Journal of Applied Statistics, 32(7), 2005.\n[18] Steve Brooks, Andrew Gelman, Galin Jones, and Xiao-Li Meng. Handbook of Markov Chain Monte Carlo.\n\nCRC press, 2011.\n\n[19] Gareth O Roberts and Richard L Tweedie. Exponential convergence of Langevin distributions and their\n\ndiscrete approximations. Bernoulli, 1996.\n\n[20] Kevin Bache and Moshe Lichman. UCI machine learning repository, 2013.\n[21] Peter Van Der Putten and Maarten van Someren. COIL challenge 2000: The insurance company case.\n\nSentient Machine Research, 9, 2000.\n\n[22] Ricardo Vig\u00e1rio, Veikko Jousm\u00e4ki, M H\u00e4m\u00e4l\u00e4ninen, R Haft, and Erkki Oja. Independent component\n\nanalysis for identi\ufb01cation of artifacts in magnetoencephalographic recordings. In NIPS, 1998.\n\n[23] Tim Salimans, Diederik P Kingma, and Max Welling. Markov chain Monte Carlo and variational inference:\n\nBridging the gap. ArXiv, 2014.\n\n[24] Michael Striebel, Michael G\u00fcnther, Francesco Knechtli, and Mich\u00e8le Wandelt. Accuracy of symmetric\n\npartitioned Runge-Kutta methods for differential equations on Lie-groups. ArXiv, 12 2011.\n\n[25] Chengxiang Jiang and Yuhao Cong. A sixth order diagonally implicit symmetric and symplectic Runge-\nKutta method for solving hamiltonian systems. Journal of Applied Analysis and Computation, 5(1),\n2015.\n\n[26] Ivar Ekeland and Jean-Michel Lasry. On the number of periodic trajectories for a Hamiltonian \ufb02ow on a\n\nconvex energy surface. Annals of Mathematics, 1980.\n\n[27] Luke Tierney and Antonietta Mira. Some adaptive Monte Carlo methods for Bayesian inference. Statistics\n\nin Medicine, 18(1718), 1999.\n\n[28] Richard Isaac. A general version of doeblin\u2019s condition. The Annals of Mathematical Statistics, 1963.\n[29] Eric Cances, Fr\u00e9d\u00e9ric Legoll, and Gabriel Stoltz. Theoretical and numerical comparison of some sampling\nmethods for molecular dynamics. ESAIM: Mathematical Modelling and Numerical Analysis, 41(02), 2007.\n\n[30] Alicia A Johnson. Geometric ergodicity of Gibbs samplers. PhD thesis, university of Minnesota, 2009.\n[31] Gareth O Roberts and Jeffrey S Rosenthal. Markov-chain Monte Carlo: Some practical implications of\n\ntheoretical results. Canadian Journal of Statistics, 26(1), 1998.\n\n[32] Jeffrey S Rosenthal. Minorization conditions and convergence rates for Markov chain Monte Carlo. Journal\n\nof the American Statistical Association, 90(430), 1995.\n\n[33] Michael Betancourt, Simon Byrne, and Mark Girolami. Optimizing the integrator step size for Hamiltonian\n\n[34] Aapo Hyv\u00e4rinen and Erkki Oja. Independent component analysis: algorithms and applications. Neural\n\n[35] Anoop Korattikara, Yutian Chen, and Max Welling. Austerity in MCMC land: Cutting the Metropolis-\n\nMonte Carlo. ArXiv, 2014.\n\nnetworks, 13(4), 2000.\n\nHastings budget. ArXiv, 2013.\n\n9\n\n\f", "award": [], "sourceid": 949, "authors": [{"given_name": "Yizhe", "family_name": "Zhang", "institution": "Duke university"}, {"given_name": "Xiangyu", "family_name": "Wang", "institution": "Duke University"}, {"given_name": "Changyou", "family_name": "Chen", "institution": "Duke University"}, {"given_name": "Ricardo", "family_name": "Henao", "institution": "Duke University"}, {"given_name": "Kai", "family_name": "Fan", "institution": "Duke university"}, {"given_name": "Lawrence", "family_name": "Carin", "institution": "Duke University"}]}