{"title": "Semi-Separable Hamiltonian Monte Carlo for Inference in Bayesian Hierarchical Models", "book": "Advances in Neural Information Processing Systems", "page_first": 10, "page_last": 18, "abstract": "Sampling from hierarchical Bayesian models is often difficult for MCMC methods, because of the strong correlations between the model parameters and the hyperparameters. Recent Riemannian manifold Hamiltonian Monte Carlo (RMHMC) methods have significant potential advantages in this setting, but are computationally expensive. We introduce a new RMHMC method, which we call semi-separable Hamiltonian Monte Carlo, which uses a specially designed mass matrix that allows the joint Hamiltonian over model parameters and hyperparameters to decompose into two simpler Hamiltonians. This structure is exploited by a new integrator which we call the alternating blockwise leapfrog algorithm. The resulting method can mix faster than simpler Gibbs sampling while being simpler and more efficient than previous instances of RMHMC.", "full_text": "Semi-Separable Hamiltonian Monte Carlo\n\nfor Inference in Bayesian Hierarchical Models\n\nYichuan Zhang\n\nSchool of Informatics\nUniversity of Edinburgh\n\nCharles Sutton\n\nSchool of Informatics\nUniversity of Edinburgh\n\nY.Zhang-60@sms.ed.ac.uk\n\nc.sutton@inf.ed.ac.uk\n\nAbstract\n\nSampling from hierarchical Bayesian models is often dif\ufb01cult for MCMC meth-\nods, because of the strong correlations between the model parameters and\nthe hyperparameters. Recent Riemannian manifold Hamiltonian Monte Carlo\n(RMHMC) methods have signi\ufb01cant potential advantages in this setting, but are\ncomputationally expensive. We introduce a new RMHMC method, which we call\nsemi-separable Hamiltonian Monte Carlo, which uses a specially designed mass\nmatrix that allows the joint Hamiltonian over model parameters and hyperparam-\neters to decompose into two simpler Hamiltonians. This structure is exploited by\na new integrator which we call the alternating blockwise leapfrog algorithm. The\nresulting method can mix faster than simpler Gibbs sampling while being simpler\nand more ef\ufb01cient than previous instances of RMHMC.\n\nIntroduction\n\n1\nBayesian statistics provides a natural way to manage model complexity and control over\ufb01tting, with\nmodern problems involving complicated models with a large number of parameters. One of the\nmost powerful advantages of the Bayesian approach is hierarchical modeling, which allows partial\npooling across a group of datasets, allowing groups with little data to borrow information from\nsimilar groups with larger amounts of data. However, such models pose problems for Markov chain\nMonte Carlo (MCMC) methods, because the joint posterior distribution is often pathological due to\nstrong correlations between the model parameters and the hyperparameters [3]. For example, one of\nthe most powerful MCMC methods is Hamiltonian Monte Carlo (HMC). However, for hierarchical\nmodels even the mixing speed of HMC can be unsatisfactory in practice, as has been noted several\ntimes in the literature [3, 4, 11]. Riemannian manifold Hamiltonian Monte Carlo (RMHMC) [7] is a\nrecent extension of HMC that aims to ef\ufb01ciently sample from challenging posterior distributions by\nexploiting local geometric properties of the distribution of interest. However, it is computationally\ntoo expensive to be applicable to large scale problems.\nIn this work, we propose a simpli\ufb01ed RMHMC method, called Semi-Separable Hamiltonian Monte\nCarlo (SSHMC), in which the joint Hamiltonian over parameters and hyperparameters has special\nstructure, which we call semi-separability, that allows it to be decomposed into two simpler, separa-\nble Hamiltonians. This condition allows for a new ef\ufb01cient algorithm which we call the alternating\nblockwise leapfrog algorithm. Compared to Gibbs sampling, SSHMC can make signi\ufb01cantly larger\nmoves in hyperparameter space due to shared terms between the two simple Hamiltonians. Com-\npared to previous RMHMC methods, SSHMC yields simpler and more computationally ef\ufb01cient\nsamplers for many practical Bayesian models.\n\n2 Hierarchical Bayesian Models\nLet D = {Di}N\nvations yj = {yji}Ni\n\ni=1 be a collection of data groups where ith data group is a collection of iid obser-\ni=1. We assume the data follows a parametric\n\ni=1 and their inputs xj = {xji}Ni\n\n1\n\n\fN(cid:89)\n\ni=1\n\ndistribution p(yi|xi, \u03b8i), where \u03b8i is the model parameter for group i. The parameters are assumed\nto be drawn from a prior p(\u03b8i|\u03c6), where \u03c6 is the hyperparameter with a prior distribution p(\u03c6). The\njoint posterior over model parameters \u03b8 = (\u03b81, . . . , \u03b8N ) and hyperparameters \u03c6 is then\n\np(\u03b8, \u03c6|D) \u221d\n\np(yi|xi, \u03b8i)p(\u03b8i|\u03c6)p(\u03c6).\n\n(1)\n\nThis hierarchical Bayesian model is popular because the parameters \u03b8i for each group are coupled,\nallowing the groups to share statistical strength. However, this property causes dif\ufb01culties when\napproximating the posterior distribution. In the posterior, the model parameters and hyperparameters\nare strongly correlated. In particular, \u03c6 usually controls the variance of p(\u03b8|\u03c6) to promote partial\npooling, so the variance of \u03b8|\u03c6,D depends strongly on \u03c6. This causes dif\ufb01culties for many MCMC\nmethods, such as the Gibbs sampler and HMC. An illustrative example of pathological structure\nin hierarchical models is the Gaussian funnel distribution [11]. Its density function is de\ufb01ned as\ni=1 N (xi|0, e\u2212v)N (v|0, 32), where x is the vector of low-level parameters and v is the\nvariance hyperparameter. The pathological correlation between x and v is illustrated by Figure 1.\n\np(x, v) =(cid:81)n\n\n3 Hamiltonian Monte Carlo on Posterior Manifold\n\nHamiltonian Monte Carlo (HMC) is a gradient-based MCMC method with auxiliary variables. To\ngenerate samples from a target density \u03c0(z), HMC constructs an ergodic Markov chain with the\ninvariant distribution \u03c0(z, r) = \u03c0(z)\u03c0(r), where r is an auxiliary variable. The most common\nchoice of \u03c0(r) is a Gaussian distribution N (0, G\u22121) with precision matrix G. Given the current\nsample z, the transition kernel of the HMC chain includes three steps: \ufb01rst sample r \u223c \u03c0(r),\nsecond propose a new sample (z(cid:48), r(cid:48)) by simulating the Hamiltonian dynamics and \ufb01nally accept\nthe proposed sample with probability \u03b1 = min{1, \u03c0(z(cid:48), r(cid:48))/\u03c0(z, r)}, otherwise leave z unchanged.\nThe last step is a Metropolis-Hastings (MH) correction. De\ufb01ne H(z, r) := \u2212 log \u03c0(z, r). The\nHamiltonian dynamics is de\ufb01ned by the differential equations ( \u02d9z, \u02d9r) = (\u2202rH,\u2212\u2202zH), where z is\ncalled the position and r is called the momentum.\n\u02d9H(z, r) = \u2202zH \u02d9z + \u2202rH \u02d9r = 0, which is called the energy preservation property\nIt is easy to see that\n[10, 11]. In physics, H(z, r) is known as the Hamiltonian energy, and is decomposed into the sum\nof the potential energy U (z) := \u2212 log \u03c0(z) and the kinetic energy K(r) := \u2212 log \u03c0(r). The most\nused discretized simulation in HMC is the leapfrog algorithm, which is given by the recursion\n\nr(\u03c4 + \u0001/2) = r(\u03c4 ) \u2212\n\n\u0001\n2\u2207zU (\u03c4 )\n\nz(\u03c4 + \u0001) = z(\u03c4 ) + \u0001\u2207rK(\u03c4 + \u0001/2)\nr(\u03c4 + \u0001) = r(\u03c4 + \u0001/2) \u2212\n\n\u0001\n2\u2207\u03b8U (\u03c4 + \u0001),\n\n(2a)\n\n(2b)\n\n(2c)\n\nwhere \u0001 is the step size of discretized simulation time. After L steps from the current sample\n(z(0), r(0)) = (z, r), the new sample is proposed as the last point (z(cid:48), r(cid:48)) = (z(L\u0001), r(L\u0001)). In\nHamiltonian dynamics, the matrix G is called the mass matrix. If G is constant w.r.t. z, then z\nand r are independent in \u03c0(z, r). In this case we say that H(z, r) is a separable Hamiltonian. In\nparticular, we use the term standard HMC to refer to HMC using the identity matrix as G. Although\nHMC methods often outperform other popular MCMC methods, they may mix slowly if there are\nstrong correlations between variables in the target distribution. Neal [11] showed that HMC can mix\nfaster if G is not the identity matrix. Intuitively, such a G acts like a preconditioner. However, if the\ncurvature of \u03c0(z) varies greatly, a global preconditioner can be inadequate.\nFor this reason, recent work, notably that on Riemannian manifold HMC (RMHMC) [7], has con-\nsidered non-separable Hamiltonian methods, in which G(z) varies with position z, so that z and r\nare no longer independent in \u03c0(z, r). The resulting Hamiltonian H(z, r) = \u2212 log \u03c0(z, r) is called\na non-separable Hamiltonian. For example, for Bayesian inference problems, Girolami and Calder-\nhead [7] proposed using the Fisher Information Matrix (FIM) of \u03c0(\u03b8), which is the metric tensor\nof posterior manifold. However, for a non-separable Hamiltonian, the simple leapfrog dynamics\n(2a)-(2c) do not yield a valid MCMC method, as they are no longer reversible. Simulation of gen-\neral non-separable systems requires the generalized leapfrog integrator (GLI) [7], which requires\ncomputing higher order derivatives to solve a system of non-linear differential equations. The com-\nputational cost of GLI in general is O(d3) where d is the number of parameters, which is prohibitive\nfor large d.\n\n2\n\n\fIn hierarchical models, there are two ways to sample the posterior using HMC. One way is to sample\nthe joint posterior \u03c0(\u03b8, \u03c6) directly. The other way is to sample the conditional \u03c0(\u03b8|\u03c6) and \u03c0(\u03c6|\u03b8),\nsimulating from each conditional distribution using HMC. This strategy is called HMC within Gibbs\n[11]. In either case, HMC chains tend to mix slowly in hyperparameter space, because the huge vari-\nation of potential energy across different hyperparameter values can easily overwhelm the kinetic\nenergy in separable HMC [11]. Hierarchical models also pose a challenge to RMHMC, if we want\nto sample the model parameters and hyperparameters jointly. In particular, the closed-form FIM\nof the joint posterior \u03c0(\u03b8, \u03c6) is usually unavailable. Due to this problem, even sampling some toy\nmodels like the Gaussian funnel using RMHMC becomes challenging. Betancourt [2] proposed a\nnew metric that uses a transformed Hessian matrix of \u03c0(\u03b8), and Betancourt and Girolami [3] demon-\nstrate the power of this method for ef\ufb01ciently sampling hyperparameters of hierarchical models on\nsome simple benchmarks like Gaussian funnel. However, the transformation requires computing\neigendecomposition of the Hessian matrix, which is infeasible in high dimensions.\nBecause of these technical dif\ufb01culties, RMHMC for hierarchical models is usually used within a\nblock Gibbs sampling scheme, alternating between \u03b8 and \u03c6. This RMHMC within Gibbs strategy is\nuseful because the simulation of the non-separable dynamics for the conditional distributions may\nhave much lower computational cost than that for the joint one. However, as we have discussed, in\nhierarchical models these variables tend be very strongly correlated, and it is well-known that Gibbs\nsamplers mix slowly in such cases [13]. So, the Gibbs scheme limits the true power of RMHMC.\n\n4 Semi-Separable Hamiltonian Monte Carlo\n\nIn this section we propose a non-separable HMC method that does not have the limitations of\nGibbs sampling and that scales to relatively high dimensions, based on a novel property that we\nwill call semi-separability. We introduce new HMC methods that rely on semi-separable Hamilto-\nnians, which we call semi-separable Hamiltonian Monte Carlo (SSHMC).\n\n4.1 Semi-Separable Hamiltonian\n\nH(\u03b8, \u03c6, r\u03b8, r\u03c6) = U (\u03b8, \u03c6) + K(r\u03b8, r\u03c6|\u03b8, \u03c6),\n\nIn this section, we de\ufb01ne the semi-separable Hamiltonian system. Our target distribution will be the\nposterior \u03c0(\u03b8, \u03c6) = log p(\u03b8, \u03c6|D) of a hierarchical model (1), where \u03b8 \u2208 Rn and \u03c6 \u2208 Rm. Let\nr\u03b8 \u2208 Rn and r\u03c6 \u2208 Rm be the momentum variables corresponding to \u03b8 and \u03c6 respectively. The\nnon-separable Hamiltonian is de\ufb01ned as\n(3)\nwhere the potential energy is U (\u03b8, \u03c6) = \u2212 log \u03c0(\u03b8, \u03c6) and the kinetic energy is K(r\u03b8, r\u03c6|\u03b8, \u03c6) =\n\u2212 log N (r\u03b8, r\u03c6; 0, G(\u03b8, \u03c6)\u22121), which includes the normalization term log |G(\u03b8, \u03c6)|. The mass\n(cid:16)\nmatrix G(\u03b8, \u03c6) can be an arbitrary p.d. matrix. For example, previous work on RMHMC [7] has\nchosen G(\u03b8, \u03c6) to be FIM of the joint posterior \u03c0(\u03b8, \u03c6), resulting in an HMC method that requires\nO\nTo attack these computational challenges, we introduce restrictions on the mass matrix G(\u03b8, \u03c6) to\nenable ef\ufb01cient simulation. In particular, we restrict G(\u03b8, \u03c6) to have the form\n\ntime. This limits applications of RMHMC to large scale problems.\n\n(m + n)3(cid:17)\n\n(cid:18) G\u03b8(\u03c6, x)\n\n0\n\nG(\u03b8, \u03c6) =\n\n(cid:19)\n\n,\n\n0\n\nG\u03c6(\u03b8)\n\nwhere G\u03b8 and G\u03c6 are the precision matrices of r\u03b8 and r\u03c6, respectively. Importantly, we restrict\nG\u03b8(\u03c6, x) to be independent of \u03b8 and G\u03c6(\u03b8) to be independent of \u03c6. If G has these properties, we\ncall the resulting Hamiltonian a semi-separable Hamiltonian. A semi-separable Hamiltonian is still\nin general non-separable, as the two random vectors (\u03b8, \u03c6) and (r\u03b8, r\u03c6) are not independent.\nThe semi-separability property has important computational advantages. First, because G is block\ndiagonal, the cost of matrix operations reduces from O((n + m)k) to O(nk). Second, and more\nimportant, substituting the restricted mass matrix into (3) results in the potential and kinetic energy:\n(4)\n\n[log p(yi|\u03b8i, xi) + log p(\u03b8i|\u03c6)] \u2212 log p(\u03c6),\n\n(cid:88)\n(cid:2)rT\n\ni\n\nU (\u03b8, \u03c6) = \u2212\n1\n2\n\nK(r\u03b8, r\u03c6|\u03c6, \u03b8) =\n\n\u03b8 G\u03b8(x, \u03c6)r\u03b8 + rT\n\n\u03c6G\u03c6(\u03b8)r\u03c6 + log |G\u03b8(x, \u03c6)| + log |G\u03c6(\u03b8)|\n\n3\n\n(cid:3) .\n\n(5)\n\n\fIf we \ufb01x (\u03b8, r\u03b8) or (\u03c6, r\u03c6), the non-separable Hamiltonian (3) can be seen as a separable Hamilto-\nnian plus some constant terms. In particular, de\ufb01ne the notation\nA(r\u03c6|\u03b8) =\n\nrT\n\u03b8 G\u03b8(x, \u03c6)r\u03b8,\n\nA(r\u03b8|\u03c6) =\n\nrT\n\u03c6G\u03c6(\u03b8)r\u03c6.\n\nThen, considering (\u03c6, r\u03c6) as \ufb01xed, the non-separable Hamiltonian H in (3) is different from the\nfollowing separable Hamiltonian\n\n1\n2\n\n1\n2\n\nH1(\u03b8, r\u03b8) = U1(\u03b8|\u03c6, r\u03c6) + K1(r\u03b8|\u03c6),\n\nU1(\u03b8|\u03c6, r\u03c6) = \u2212\n\n[log p(yi|\u03b8i, xi) + log p(\u03b8i|\u03c6)] + A(r\u03c6|\u03b8) +\n\n1\n2\n\nlog |G\u03c6(\u03b8)| ,\n\n(6)\n\n(7)\n\nK1(r\u03b8|\u03c6) = A(r\u03b8|\u03c6)\n\n(8)\nonly by some constant terms that do not depend on (\u03b8, r\u03b8). What this means is that any update to\n(\u03b8, r\u03b8) that leaves H1 invariant leaves the joint Hamiltonian H invariant as well. An example is the\nleapfrog dynamics on H1, where U1 is considered the potential energy, and K1 the kinetic energy.\nSimilarly, if (\u03b8, r\u03b8) are \ufb01xed, then H differs from the following separable Hamiltonian\n\n(cid:88)\n\ni\n\n(cid:88)\n\ni\n\nH2(\u03c6, r\u03c6) = U2(\u03c6|\u03b8, r\u03b8) + K2(r\u03c6|\u03b8),\nU2(\u03c6|\u03b8, r\u03b8) = \u2212\n\nlog p(\u03b8i|\u03c6) \u2212 log p(\u03c6) + A(r\u03b8|\u03c6) +\n\n1\n2\n\nlog |G\u03b8(x, \u03c6)| ,\n\n(9)\n\n(10)\n\n(11)\n\nK2(r\u03c6|\u03b8) = A(r\u03c6|\u03b8)\n\nonly by terms that are constant with respect to (\u03c6, r\u03c6).\nNotice that H1 and H2 are coupled by the terms A(r\u03b8|\u03c6) and A(r\u03c6|\u03b8). Each of these terms appears\nin the kinetic energy of one of the separable Hamiltonians, but in the potential energy of the other\none. We call these terms auxiliary potentials because they are potential energy terms introduced by\nthe auxiliary variables. These auxiliary potentials are key to our method (see Section 4.3).\n\n4.2 Alternating Block-wise Leapfrog Algorithm\nNow we introduce an ef\ufb01cient SSHMC method\nthat exploits the semi-separability property. As\ndescribed in the previous section, any update to\n(\u03b8, r\u03b8) that leaves H1 invariant also leaves the\njoint Hamiltonian H invariant, as does any up-\ndate to (\u03c6, r\u03c6) that leaves H2 invariant. So a\nnatural idea is simply to alternate between sim-\nulating the Hamiltonian dynamics for H1 and\nthat for H2. Crucially, even though the total\nHamiltonian H is not separable in general, both\nH1 and H2 are separable. Therefore when sim-\nulating H1 and H2, the simple leapfrog method\ncan be used, and the more complex GLI method\nis not required.\nWe call this method the alternating block-wise\nleapfrog algorithm (ABLA), shown in Algo-\nrithm 1. In this \ufb01gure the function \u201cleapfrog\u201d\nreturns the result of the leapfrog dynamics (2a)-(2c) for the given starting point, Hamiltonian, and\nstep size. We call each iteration of the loop from 1 . . . L an ABLA step. For simplicity, we have\nshown one leapfrog step for H1 and H2 for each ABLA step, but in practice it is useful to use multi-\nple leapfrog steps per ABLA step. ABLA has discretization error due to the leapfrog discretization,\nso the MH correction is required. If it is possible to simulate H1 and H2 exactly, then H is preserved\nexactly and there is no need for MH correction.\nTo show that the SSHMC method by ABLA preserves the distribution \u03c0(\u03b8, \u03c6), we also need to\nshow that the ABLA is a time-reversible and volume-preserving transformation in the joint space of\n(\u03b8, r\u03b8, \u03c6, r\u03c6). Let X = X\u03b8,r\u03b8 \u00d7X\u03c6,r\u03c6 where (\u03b8, r\u03b8) \u2208 X\u03b8,r\u03b8 and (\u03c6, r\u03c6) \u2208 X\u03c6,r\u03c6. Obviously, any\nreversible and volume-preserving transformation in a subspace of X is also reversible and volume-\npreserving in X .\nIt is easy to see that each leapfrog step in the ABLA algorithm is reversible\nand volume-preserving in either X\u03b8,r\u03b8 or X\u03c6,r\u03c6. One more property of integrator of interest is\n\n4\n\nReferences[1]K.BacheandM.Lichman.UCImachinelearningrepository,2013.URLhttp://archive.ics.uci.edu/ml.[2]M.J.Betancourt.AGeneralMetricforRiemannianManifoldHamiltonianMonteCarlo.ArXive-prints,Dec.2012.[3]M.J.BetancourtandM.Girolami.HamiltonianMonteCarloforHierarchicalModels.ArXive-prints,Dec.2013.[4]K.Choo.LearninghyperparametersforneuralnetworkmodelsusingHamiltoniandynamics.PhDthesis,Citeseer,2000.[5]O.F.Christensen,G.O.Roberts,andJ.S.Rosenthal.ScalinglimitsforthetransientphaseoflocalMetropolis\u2013Hastingsalgorithms.JournaloftheRoyalStatisticalSociety:SeriesB(StatisticalMethodol-ogy),67(2):253\u2013268,2005.[6]C.J.Geyer.PracticalMarkovChainMonteCarlo.StatisticalScience,pages473\u2013483,1992.[7]M.GirolamiandB.Calderhead.RiemannmanifoldLangevinandHamiltonianMonteCarlomethods.JournaloftheRoyalStatisticalSociety:SeriesB(StatisticalMethodology),73(2):123\u2013214,2011.ISSN1467-9868.doi:10.1111/j.1467-9868.2010.00765.x.URLhttp://dx.doi.org/10.1111/j.1467-9868.2010.00765.x.[8]M.D.HoffmanandA.Gelman.Theno-U-turnsampler:AdaptivelysettingpathlengthsinHamiltonianMonteCarlo.JournalofMachineLearningResearch,Inpress.[9]S.Kim,N.Shephard,andS.Chib.Stochasticvolatility:likelihoodinferenceandcomparisonwithARCHmodels.TheReviewofEconomicStudies,65(3):361\u2013393,1998.[10]B.LeimkuhlerandS.Reich.SimulatingHamiltoniandynamics,volume14.CambridgeUniversityPress,2004.[11]R.Neal.MCMCusingHamiltoniandynamics.HandbookofMarkovChainMonteCarlo,pages113\u2013162,2011.[12]A.PakmanandL.Paninski.Auxiliary-variableexacthamiltonianmontecarlosamplersforbinarydistri-butions.InAdvancesinNeuralInformationProcessingSystems26,pages2490\u20132498.2013.[13]C.P.RobertandG.Casella.MonteCarlostatisticalmethods,volume319.Citeseer,2004.[14]Z.Wang,S.Mohamed,andN.deFreitas.AdaptiveHamiltonianandRiemannmanifoldMonteCarlosamplers.InInternationalConferenceonMachineLearning(ICML),pages1462\u20131470,2013.URLhttp://jmlr.org/proceedings/papers/v28/wang13e.pdf.JMLRW&CP28(3):1462\u20131470,2013.[15]Y.Zhang,C.Sutton,A.Storkey,andZ.Ghahramani.ContinuousrelaxationsfordiscreteHamiltonianMonteCarlo.InAdvancesinNeuralInformationProcessingSystems(NIPS),2012.Algorithm1SSHMCbyABLARequire:(\u2713,)Sampler\u2713\u21e0N(0,G\u2713(,x))andr\u21e0N(0,G(\u2713))forlin1,2,...,Ldo(\u2713(l+\u270f/2),r(l+\u270f/2)\u2713) leapfrog(\u2713(l),r(l)\u2713,H1,\u270f/2)((l+\u270f),r(l+\u270f)) leapfrog((l),r(l),H2,\u270f)(\u2713(l+\u270f),r(l+\u270f)\u2713) leapfrog(\u2713(l),r(l)\u2713,H1,\u270f/2)endforDrawu\u21e0U(0,1)ifu