{"title": "Quasi-Newton Methods for Markov Chain Monte Carlo", "book": "Advances in Neural Information Processing Systems", "page_first": 2393, "page_last": 2401, "abstract": "The performance of Markov chain Monte Carlo methods is often sensitive to the scaling and correlations between the random variables of interest. An important source of information about the local correlation and scale is given by the Hessian matrix of the target distribution, but this is often either computationally expensive or infeasible. In this paper we propose MCMC samplers that make use of quasi-Newton approximations from the optimization literature, that approximate the Hessian of the target distribution from previous samples and gradients generated by the sampler. A key issue is that MCMC samplers that depend on the history of previous states are in general not valid. We address this problem by using limited memory quasi-Newton methods, which depend only on a fixed window of previous samples. On several real world datasets, we show that the quasi-Newton sampler is a more effective sampler than standard Hamiltonian Monte Carlo at a fraction of the cost of MCMC methods that require higher-order derivatives.", "full_text": "Quasi-Newton Methods\n\nfor Markov Chain Monte Carlo\n\nYichuan Zhang and Charles Sutton\n\nSchool of Informatics\nUniversity of Edinburgh\n\nY.Zhang-60@sms.ed.ac.uk, csutton@inf.ed.ac.uk\n\nAbstract\n\nThe performance of Markov chain Monte Carlo methods is often sensitive to the\nscaling and correlations between the random variables of interest. An important\nsource of information about the local correlation and scale is given by the Hessian\nmatrix of the target distribution, but this is often either computationally expensive\nor infeasible. In this paper we propose MCMC samplers that make use of quasi-\nNewton approximations, which approximate the Hessian of the target distribution\nfrom previous samples and gradients generated by the sampler. A key issue is that\nMCMC samplers that depend on the history of previous states are in general not\nvalid. We address this problem by using limited memory quasi-Newton methods,\nwhich depend only on a \ufb01xed window of previous samples. On several real world\ndatasets, we show that the quasi-Newton sampler is more effective than standard\nHamiltonian Monte Carlo at a fraction of the cost of MCMC methods that require\nhigher-order derivatives.\n\n1\n\nIntroduction\n\nThe design of effective approximate inference methods for continuous variables often requires con-\nsidering the curvature of the target distribution. This is especially true of Markov chain Monte Carlo\n(MCMC) methods. For example, it is well known that the Gibbs sampler mixes extremely poorly\non distributions that are strongly correlated. In a similar way, the performance of a random walk\nMetropolis-Hastings algorithm is sensitive to the variance of the proposal distribution. Many sam-\nplers can be improved by incorporating second-order information about the target distribution. For\nexample, several authors have used a Metropolis-Hastings algorithm in which the Hessian is used\nto form a covariance for a Gaussian proposal [3, 11]. Recently, Girolami and Calderhead [5] have\nproposed a Hamiltonian Monte Carlo method that can require computing higher-order derivatives of\nthe target distribution.\nUnfortunately, second derivatives can be inconvenient or infeasible to obtain and the quadratic cost\nof manipulating a d\u00d7 d Hessian matrix can also be prohibitive. An appealing idea is to approximate\nthe Hessian matrix using the sequence of \ufb01rst order information of previous samples, in a manner\nsimilar to quasi-Newton methods from the optimization literature. However, samplers that depend on\nthe history of previous samples must be carefully designed in order to guarantee the chain converges\nto the target distribution.\nIn this paper, we present quasi-Newton methods for MCMC that are based on approximations to the\nHessian from \ufb01rst-order information. In particular, we present a Hamiltonian Monte Carlo algorithm\nin which the variance of the momentum variables is based on a BFGS approximation. The key point\nis that we use a limited memory approximation, in which only a small window of previous samples\nare used to the approximate the Hessian. This makes it straightforward to show that our samplers are\nvalid, because the samples are distributed as a order-k Markov chain. Second, by taking advantage\n\n1\n\n\fof the special structure in the Hessian approximation, the samplers require only linear time and\nlinear space in the dimensionality of the problem. Although this is a very natural approach, we are\nunaware of previous MCMC methods that use quasi-Newton approximations. In general we know\nof very few MCMC methods that make use of the rich set of approximations from the numerical\noptimization literature (some exceptions include [7, 11]). On several logistic regression data sets,\nwe show that the quasi-Newton samplers produce samples of higher quality than standard HMC, but\nwith signi\ufb01cantly less computation time than methods that require higher-order derivatives.\n\n2 Background\n\nIn this section we provide background on Hamiltonian Monte Carlo. An excellent recent tutorial\nis given by Neal [9]. Let x be a random variable on state space X = Rd with a target probability\ndistribution \u03c0(x) \u221d exp(L(x)) and p be a Gaussian random variable on P = Rd with density\np(p) = N (p|0, M) where M is the covariance matrix.\nIn general, Hamiltonian Monte Carlo\n(HMC) de\ufb01nes a stationary Markov chain on the augmented state space X \u00d7 P with invariant dis-\ntribution p(x, p) = \u03c0(x)p(p). The sampler is de\ufb01ned using a Hamiltonian function, which up to a\nconstant is the negative log density of (x, p), given as follows:\n\nH(x, p) = \u2212 L(x) +\n\npT M\u22121p.\n\n(1)\nIn an analogy to physical systems, the \ufb01rst term on the RHS is called the potential energy, the second\nterm is called the kinetic energy, the state x is called the position variable, and p the momentum\nvariable. Finally, we will call the covariance M the mass matrix. The most common mass matrix\nis the identity matrix I. Samples in HMC are generated as following. First, the state p is resampled\nfrom its marginal distribution N (p|0, M). Then, given the current state (x, p), a new state (x\u2217, p\u2217)\nis generated by a deterministic simulation of Hamiltonian dynamics:\n\n1\n2\n\n\u02d9x = M\u22121p;\n\n\u02d9p = \u2212 \u2207xL(x).\n\n(2)\nOne common approximation to this dynamical system is given by the leapfrog algorithm. One single\niteration of leapfrog algorithm is given by the recursive formula\n\u2207xL(x(\u03c4 )),\np(\u03c4 +\nx(\u03c4 + \u0001) = x(\u03c4 ) + \u0001M\u22121p(\u03c4 +\n\u0001\n2\n\n) = p(\u03c4 ) +\n\n(4)\n\n(3)\n\n\u0001\n2\n\n\u0001\n2\n\n),\n\n\u0001\n2\n\n\u2207xL(x(\u03c4 + \u0001)),\n\n\u0001\n2\n\n) +\n\np(\u03c4 + \u0001) = p(\u03c4 +\n\n(5)\nwhere \u0001 is the step size and \u03c4 is a discrete time variable. The leapfrog algorithm is initialised by\nthe current sample, that is (x(0), p(0)) = (x, p). After L leapfrog steps (3)-(5), the \ufb01nal state\n(x(L\u0001), p(L\u0001)) is used as the proposal (x\u2217, p\u2217) in Metropolis-Hastings correction with acceptance\nprobability min[1, exp(H(x, p) \u2212 H(x\u2217, p\u2217))]. The step size \u0001 and the number of leapfrog steps L\nare two parameters of HMC.\nIn many applications, different components of x may have different scale and be highly correlated.\nTuning HMC in such a situation can be very dif\ufb01cult. However, the performance of HMC can be\nimproved by multiplying the state x by a non-singular matrix A. If A is chosen well, the transformed\nstate x(cid:48) = Ax may at least locally be better conditioned, i.e., the new variables x(cid:48) may be less\ncorrelated and have similar scale, so that sampling can be easier.\nIn the context of HMC, this\ntransformation is equivalent to changing mass matrix M. This is because the Hamiltonian dynamics\nof the system (Ax, p) with mass matrix M are isomorphic to the dynamics on (x, AT p), which\nis equivalent to de\ufb01ning the state as (x, p) and using the mass matrix M(cid:48) = AT MA. For a more\ndetailed version of this argument, see the tutorial of Neal [9]. So in this paper we will concentrate\non tuning M on the \ufb02y during sampling.\nNow, if L has a constant Hessian B (or nearly so), then a reasonable choice of transformation is to\nchoose A so that B = AAT , because then the Hessian of the log density over x(cid:48) will be nearly the\nidentity. This corresponds to a choice of M = B. For more general functions without a constant\nHessian, this argument suggests the idea of employing a mass matrix M(x) that is a function of the\nposition. In this case the Hamiltonian function can be\n\nH(x, p) = \u2212 L(x) +\n\nlog(2\u03c0)d|M(x)| +\n\n1\n2\n\npT M(x)\n\n\u22121p,\n\n1\n2\n\n(6)\n\n2\n\n\fwhere the second term on the RHS is from the normalisation factor of Gaussian momentum variable.\n\n3 Quasi-Newton Approximations for Sampling\n\nIn this section, we describe the Hessian approximation that is used in our samplers. It is based on\nthe well-known BFGS approximation [10], but there are several customizations that we must make\nto use it within a sampler. First we explain quasi-Newton methods in the context of optimization.\nConsider minimising the function f : Rd \u2192 R, quasi-Newton methods search for the minimum of\nf (x) by generating a sequence of iterates xk+1 = xk\u2212\u03b1kHk\u2207f (xk) where Hk is an approximation\nto the inverse Hessian at xk, which is computed from the previous function values and gradients.\nOne of the most popular large scale quasi-Newton methods is limited-Memory BFGS (L-BFGS) [10].\nGiven the previous m iterates xk\u2212m+1, xk\u2212m+2, . . . xk, the L-BFGS approximation Hk+1 is\n\nHk+1 = (I \u2212 yksT\nk\nsT\nk yk\n\n)Hk(I \u2212 skyT\nk\nsT\nk yk\n\n) + sksT\nk\n\n(7)\nwhere sk = xk+1 \u2212 xk and yk = \u2207fk+1 \u2212 \u2207fk. The base case of the recursion is typically chosen\nas Hk\u2212m = \u03b3I for some \u03b3 \u2208 R. If m = k, then this is called the BFGS formula, and typically it is\nimplemented by storing the full d \u00d7 d matrix Hk. If m < k, however, this is called limited-memory\nBFGS, and can be implemented much more ef\ufb01ciently. It can be seen that the BFGS formula (7) is\na rank-two update to the previous Hessian approximation Hk. Therefore Hk+1 is a diagonal matrix\nplus a rank 2m matrix, so the matrix vector product Hk\u2207f (xk) can be computed in linear time\nO(md). Typically the product Hv is implemented by a special two-loop recursive algorithm [10].\nIn contrast to optimization methods, most sampling methods need a factorized form of Hk to draw\nsamples from N (0, Hk). More precisely, we adopt the factorisation Hk = SkST\nk , so that we can\ngenerate a sample as p = Skz where z \u223c N (0, I). The matrix operations to obtain Sk, e.g. the\nCholesky decomposition cost O(d3). To avoid this cost, we need a way to compute Sk that does not\nrequire constructing the matrix Hk explicitly. Fortunately there is a variant of the BFGS formula\nthat maintains Sk directly [2], which is\n\nk+1; Sk+1 = (cid:0)I \u2212 pkqT\nk+1; Ck+1 = (cid:0)I \u2212 uktT\n\nk\n\nk\n\n(cid:1) Sk\n(cid:1) Ck\n\nHk+1 = Sk+1ST\nBk+1 = Ck+1CT\n\ni=k\u2212m\u22121(I\u2212 piqT\n\nk denotes the Hessian matrix approximation. Again, we will use a limited-memory\n\ncomputed by a sequence of inner products Sk+1z =(cid:81)k\n\nwhere Bk = H\u22121\nversion of these updates, in which the recursion is stopped at Hk\u2212m = \u03b3I.\nAs for the running time of the above approximation, computing Sk requires O(m2d) time and\nO(md) space, so it is still linear in the dimensionality. The matrix vector product Sk+1z can be\ni )Sk\u2212mz, in time O(md).\nA second issue is that we need Hk to be positive de\ufb01nite if it is to be used as a covariance matrix. It\ncan be shown [10] that Hk is positive de\ufb01nite if for all i \u2208 (k \u2212 m + 1, k), we have sT\ni yi > 0. For a\nconvex function f, an optimizer can be arranged so that this condition always holds, but we cannot\ndo this in a sampler. Instead, we \ufb01rst sort the previous samples {xi} in ascending order with respect\nto L(x), and then check if there are any adjacent pairs (xi, xi+1) such that the resulting si and yi\ni yi \u2264 0. If this happens, we remove the point xi+1 from the memory and recompute si, yi\nhave sT\nusing xi+2, and so on. In this way we can ensure that Hk is always positive de\ufb01nite.\nAlthough we have described BFGS as relying on a memory of \u201dprevious\u201d points, e.g., previous\niterates of an optimization algorithm, or previous samples of an MCMC chain, in principle the\nBFGS equations could be used to generate a Hessian approximation from any set of points X =\n{x1 . . . xm}. To emphasize this, we will write HBFGS : X (cid:55)\u2192 Hk for the function that maps a\n\u201cpseudo-memory\u201d X to the inverse Hessian Hk. This function \ufb01rst sorts x \u2208 X by L(xi), then\ncomputes si = xi+1 \u2212 xi and yi = \u2207L(xi+1) \u2212 \u2207L(xi), then \ufb01lters xi as described above so that\ni yi > 0\u2200i, and \ufb01nally computes the Hessian approximation Hk using the recursion (8)\u2013(11).\nsT\n\n3\n\n(cid:115)\n(cid:115)\n\npk =\n\nsk\nsT\nk yk\n\nsk\n\n; qk =\n\n; uk =\n\ntk =\n\nsT\nk Bksk\n\nsT\nk yk\nsT\nk Bkyk\n\nBksk \u2212 yk\n\nsT\nk Bksk\nsT\nk yk\n\nyk + Bksk\n\n(8)\n(9)\n\n(10)\n\n(11)\n\n\f4 Quasi-Newton Markov Chain Monte Carlo\n\nIn this section, we describe two new quasi-Newton samplers. They will both follow the same struc-\nture, which we describe now. Intuitively, we want to use the characteristics of the target distribution\nto accelerate the exploration of the region with high probability mass. The previous samples pro-\nvide information about the target distribution, so it is reasonable to use them to adapt the kernel.\nHowever, naively tuning the sampling parameters using all previous samples may lead to an invalid\nchain, that is, a chain that does not have \u03c0 as its invariant distribution.\nOur samplers will use a simple solution to this problem. Rather than adapting the kernel using all\nof the previous samples in the Markov chain, we will adapt using a limited window of K previous\nsamples. The chain as a whole will then be an order K Markov chain.\nIt is easiest to analyze\nthis chain by converting it into a \ufb01rst-order Markov chain over an enlarged space. Speci\ufb01cally, we\nbuild a Markov chain in a K-fold product space X K with the stationary distribution p(x1:K) =\ni=1:K \u03c0(xi). We denote a state of this chain by xt\u2212K+1, xt\u2212K+2, . . . , xt. We use the short-hand\n\n(cid:81)\n\n1:K per iteration, in a Gibbs-like fashion. We\n\nnotation x(t)\n\n1:K\\i for the subset of x(t)\n\n1:K excluding the x(t)\nOur samplers will then update one component of x(t)\nde\ufb01ne a transition kernel Ti that only updates the ith component of x(t)\n1:K, that is:\ni|x(t)\n\n1:K\\i)B(xi, x(cid:48)\n\n1:K, x(cid:48)\n\n1:K\\i, x(cid:48)\n\n1:K\\i),\n\nTi(x(t)\n\n1:K) = \u03b4(x(t)\n\n(12)\ni|x1:K\\i) is called the base kernel that is a MCMC kernel in X and adapts with\nwhere B(xi, x(cid:48)\nx(t)\n1:K\\i. If B leaves \u03c0(xi) invariant for all \ufb01xed values of x1:K\\i, it is straightforward to show that\nTi leaves p invariant. Then, the sampler as a whole updates each of the components x(t)\nin sequence,\ni\nso that the method as a whole is described by the kernel\n\n.\n\ni\n\nT (x1:K, x(cid:48)\n\n1:K) = T1 \u25e6 T2 . . . \u25e6 TK(x1:K, x(cid:48)\n\n(13)\nwhere Ti \u25e6 Tj denotes composition of kernels Ti and Tj. Because the each kernel Ti leaves p(x1:K)\ninvariant, the composition kernel T also leaves p(x1:K) invariant. Such an adaptive scheme is\nequivalent to using an ensemble of K chains and changing the kernel of each chain with the state\nof the others. It is called the ensemble-chain adaptation (ECA) in this paper. One early example of\nECA is found in [4]. To simplify the analysis of the validity of the chain, we assume the base kernel\nB is irreducible in one iteration. This assumption can be satis\ufb01ed by many popular MCMC kernels.\n\n1:K),\n\n4.1 Using BFGS within Metropolis-Hastings\n\n1:K) = N (x(cid:48); \u00b5, \u03a3), where the proposal mean \u00b5 = \u00b5(x(t)\n\nA simple way to incorporate quasi-Newton approximations within MCMC is to use the Metropolis-\nHastings (M-H) algorithm. The intuition is to \ufb01t the Gaussian proposal distribution to the target\ndistribution, so that points in a high probability region are more likely to be proposed. We will\ncall this algorithm MHBFGS. Speci\ufb01cally, the proposal distribution of MHBFGS is de\ufb01ned as\nq(x(cid:48)|x(t)\n1:K)\ndepend on the state of all K chains.\nSeveral choices for the mean function are possible. One simple choice is to use one of the samples\nin the window as the mean, e.g., \u00b5(x(t)\n1 . Another potentially better choice is a Newton\nstep from xt. For the covariance function at \u00b5, we will use the BFGS approximation \u03a3(x1:K) =\nHBFGS(x1:K). The proposal x(cid:48) of T1 is accepted with probability\n\n1:K) and covariance \u03a3 = \u03a3(x(t)\n\n1:K) = x(t)\n\n(cid:33)\n\n.\n\n(14)\n\n(cid:32)\n\n\u03b1(x(t)\n\n1 , x(cid:48)) = min\n\n1,\n\nq(x(t)\n\n1 |x(t)\n2:K, x(cid:48))\n\u03c0(x(t)\n1 )\n\n\u03c0(x(cid:48))\nq(x(cid:48)|x(t)\n1 , x(t)\n\n2:K)\n\nis duplicated as the new sample.\n\nis rejected, x(t)\n1\n1 , x(t)\n\nIf x(cid:48)\nBecause the Gaussian proposal\nq(A|x(t)\n2:K) has positive probability for all A \u2208 X , the M-H kernel is irreducible within one\niteration. Because the M-H algorithm with acceptance ratio de\ufb01ned as (14) leaves \u03c0(x) invariant,\nMHBFGS is a valid method that leaves p(x1:K) invariant. Although MHBFGS is simple and intu-\nitive, in preliminary experiments we have found that MHBFGS sampler may converge slowly in high\n\n4\n\n\f1 , x(t)\n\n2 , . . . , x(t)\nK )\n, x(t+1)\n\n2\n\n, . . . , x(t+1)\nK )\n\n1 , p) using (3)-(5)\n\n1 , p|x2:K) \u2212 H(x\u2217, p\u2217|x2:K)) then\n\nAlgorithm 1\nHMCBFGS\nInput: Current memory (x(t)\nOutput: Next memory (x(t+1)\n1\n1: p \u223c N (0, BBFGS(x(t)\n2:K))\n2: (x\u2217, p\u2217) \u2190Leapfrog(x(t)\n3: u \u223c Unif[0, 1]\n4: if u \u2264 exp(H(x(t)\nK \u2190 x\u2217\nx(t+1)\n5:\n6: else\nK \u2190 x(t)\nx(t+1)\n7:\n8: end if\n1:K\u22121 \u2190 x(t)\n9: x(t+1)\n2:K\n10: return (x(t+1)\n\n, x(t+1)\n\nK\n\n1\n\n2\n\n, . . . , x(t+1)\nK )\n\ndimensions. In general Metropolis Hastings with a Gaussian proposal can suffer from random walk\nbehavior, even if the true Hessian is used. For this reason, next we incorporate the BFGS into a more\nsophisticated sampling algorithm.\n\n4.2 Using BFGS within Hamiltonian Monte Carlo\n\nBetter convergence speed can be achieved by incorporating BFGS within the HMC kernel. The\nhigh-level idea is to start with the MHBFGS algorithm, but to replace the Gaussian proposal with a\nsimulation of Hamiltonian dynamics. However, we will need to be a bit careful in order to ensure that\nthe Hamiltonian is separable, because otherwise we would need to employ a generalized leapfrog\nintegrator [5] which is signi\ufb01cantly more expensive.\nThe new samples in HMCBFGS are generated as follows. As before we update one component of\nx(t)\n1:K at a time. Say that we are updating component i. First we sample a new value of the momentum\nvariable p \u223c N (0, BBFGS(x(t)\n1:K\\i)). It is important that when constructing the BFGS approximation,\nwe not use the value x(t)\nthat we are currently resampling. Then we simulate the Hamiltonian\ni\ndynamics starting at the point (x(t)\n, p) using the leapfrog method (3)\u2013(5). The Hamiltonian energy\ni\nused for this dynamics is simply\n\n(15)\n\nHi(x(t)\n\n1:K, p) = \u2212 L(x(t)\n\n1\ni ) +\n2\nFinally,\n\npT HBFGS(x(t)\nthe proposal\n\n1:K\\i)\u22121p,\n\n(cid:90)\n\nThis yields a proposed value (x\u2217, p\u2217).\nmin[1, exp(H(xi, p) \u2212 H(x\u2217\nprocedure is summarized in Algorithm 1.\nThis procedure is an instance of the general ECA scheme described above, with base kernel\n\nis accepted with probability\ni , p\u2217)], for H in (15) and p\u2217 is discarded after M-H correction. This\n\ni, p(cid:48)\n\ni, p(cid:48)\n\ni|x1:K\\i) =\n\nB(xi, x(cid:48)\ni|x1:K\\i) is a standard HMC kernel with mass matrix BBFGS(x1:K\\i) that in-\nwhere \u02c6B(xi, pi, x(cid:48)\ncludes sampling pi. The Hamiltonian energy function of \u02c6B given by (15) is separable, that means\nxi only appear in potential energy. It is easy to see that B is a valid kernel in X , so as a ECA method,\n\ni|x1:K\\i)dpidp(cid:48)\ni.\n\nHMCBFGS leaves p(x1:K) =(cid:81)\n\n\u02c6B(xi, pi, x(cid:48)\n\ni \u03c0(xi) invariant.\n\nIt is interesting to consider if the method is valid in the augmented space X K \u00d7 P K, i.e., whether\nAlgorithm 1 leaves the distribution\n\nK(cid:89)\n\np(x1:K, p1:K) =\n\n\u03c0(xi)N (pi; 0, BBFGS(x(t)\n\n1:K\\i))\n\nInterestingly, this is not true, because every update to xi changes the Gaussian factors for the mo-\nmentum variables pj for j (cid:54)= i in a way that the Metropolis Hastings correction in lines 4\u20138 does\nnot consider. So despite the auxiliary variables, it is easiest to establish validity in the original space.\n\ni=1\n\n5\n\n\fHMCBFGS has the advantages of being a simple approach that only uses gradient and the compu-\ntational ef\ufb01ciency that the cost of all matrix operations (namely in lines 1 and 2 of Algorithm 1)\nis at the scale of O(Kd). But, being an ECA method, HMCBFGS has the disadvantage that the\nlarger the number of chains K, the updates are \u201cspread across\u201d the chains, so that each chain gets a\nsmall number of updates during a \ufb01xed amount of computation time. In Section 6 we will evaluate\nempirically whether this potential drawback is outweighed by the advantages of using approximate\nsecond-order information.\n\n5 Related Work\n\nGirolami and Calderhead [5] propose a new HMC method called Riemannian manifold Hamiltonian\nMonte Carlo (RMHMC) where M(x) can be any positive de\ufb01nite matrix. In their work, M(x) is\nchosen to be the expected Fisher information matrix and the experimental results show that RMHMC\ncan converge much faster than many other MCMC methods. Girolami and Calderhead adopted\na generalised leapfrog method that is a reversible and volume-preserving approximation to non-\nseparable Hamiltonian. However, such a method may require computing third-order derivatives of\nL, which can be infeasible in many applications.\nBarthelme and Chopin [1] pointed out the possibility to use approximate BFGS Hessian in RMHMC\nfor computational ef\ufb01ciency. Similarly, Roy [14] suggested iteratively updating the local metric\napproximation. Roy also emphasized the potential effect of such an iterative approximation to the\nvalidity, a main problem that we address here. An early example of ECA is adaptive direction\nsampling (ADS) [4], in which each sample is taken along a random direction that is chosen based\non the samples from a set of chains. However, the validity of ADS can be established only when the\nsize of ensemble is greater than the number of dimensions, otherwise the samples are trapped in a\nsubspace. HMCBFGS avoids this problem because the BFGS Hessian approximation is full rank.\nThere has been a large amount of interest in adaptive MCMC methods that accumulate informa-\ntion from all previous samples. These methods must be designed carefully because if the kernel is\nadapted with the full sampling history in a naive way, the sampler can be invalid [13]. A well known\nexample of a correct adaptive algorithm is the Adaptive Metropolis [6] algorithm, which adapts the\nGaussian proposal of a Metropolis Hastings algorithm based on the empirical covariance of previ-\nous samples in a way that maintains ergodicity. Being a valid method, the adaptation of kernel must\nkeep decreasing over time. In practice, the parameters of the kernel in many diminishing adaptive\nmethods converge to a single value over the entire state space. This could be problematic if we want\nthe sampler to adapt to local characteristics of the target distribution, e.g., if different regions of the\ntarget distribution have different curvature. Using a \ufb01nite memory of recent samples, our method\navoids such a problem.\n\n6 Experiments\n\nWe test HMCBFGS on two different models, Bayesian logistic regression and Bayesian conditional\nrandom \ufb01elds (BCRFs). We compare HMCBFGS to the standard HMC which uses identity mass\nmatrix and RMHMC which requires computing the Hessian matrix. All methods are implemented in\nJava 1. We do not report results from MHBFGS because preliminary experiments showed that it was\nmuch worse than either HMC or HMCBFGS. The datasets for Bayesian logistic regression is used\nfor RMHMC in [5]. For HMC and HMCBFGS we employ the random step size \u0001 \u223c Unif[0.9\u02c6\u0001, \u02c6\u0001],\nwhere \u02c6\u0001 is the maximum step size. For RMHMC, we used the \ufb01xed \u0001 = 0.5 for all datasets that\nfollows the setting in [5].\nFor HMC and HMCBFGS we tuned L on one data set (the German data set) and used that value\non all datasets. We chose the smallest number of leaps that did not degrade the performance of the\nsampler. L was chosen to be 40 for HMC and to be 20 for HMCBFGS. For RMHMC, we employed\nL = 6 leaps in RMHMC, following Girolami and Calderhead [5]. For HMCBFGS, we heuristically\nchose the number of ensemble chains K to be slightly higher than d/2.\n\n1Our implementation was based on the Matlab code of RMHMC of Girolami and Calderhead and checked\n\nagainst the original Matlab version\n\n6\n\n\fFor each method, we drew 5000 samples after 1000 burn-in samples. The convergence speed is\nmeasured by effective sample size (ESS) [5], which summaries the amount of autocorrelation across\ndifferent lags over all dimensions2. The more detailed description of ESS can be found in [5]. Be-\ncause HMCBFGS displays more correlation within individual chain than across chains, we calculate\nthe ESS separately for individual chains in the ensemble and the overall ESS is simply the sum of\nthat from individual chains. All the \ufb01nal ESS on each data set is obtained by averaging over 10 runs\nusing different initialisations.\n\nESS\n\nHMC HMCBFGS RMHMC\n\nMin\nMean\nMax\nTime (s)\nES/s\n\n3312\n3862\n4445\n7.56\n739\n\n3643\n4541\n4993\n4.74\n1470\n\n4819\n4950\n5000\n483.00\n107\n\nTable 1: Performance of MCMC samplers on Bayesian logistic regression, as measured by Effective\nSample Size (ESS). Higher is better. Averaged over \ufb01ve datasets. ES/s is the number of effective\nsamples per second\n\nDataset\nAustralian\nGerman\nHeart\nPima\nRipley\n\nD\n15\n25\n14\n8\n7\n\nN HMC HMCBFGS RMHMC\n18\n3\n54\n118\n344\n\n818\n397\n2009\n1383\n2745\n\n690\n1000\n532\n270\n250\n\n396\n255\n1054\n591\n1396\n\nTable 2: Effective samples per second on Bayesian logistic regression. D is the number of regression\ncoef\ufb01cients and N is the size of training data set\n\nThe results on ESS averaged over \ufb01ve datasets on Bayesian logistic regression are given by Table 1.\nOur ESS number of HMC and RMHMC basically replicates the results in [5]. RMHMC achieves\nthe highest minimum and mean and maximum ESS and that are all very close to the total number of\nsamples 5000. However, because HMC and our method only require computing the gradient, they\noutperforms RMHMC in terms of mean ESS per second. HMCBFGS gains a 10%, 17% and 12%\nincrease in minimum, mean and maximum ESS than HMC, but only needs half number of leaps for\nHMC. A detailed performance of methods over datasets is shown in Table 2.\nThe second model that we use is a Bayesian CRF on a small natural language dataset of FAQs from\nUsenet [8]. A linear-chain CRF is used with Gaussian prior on the parameters. The model has 120\nparameters. This model has been used previously [12, 15]. In a CRF it is intractable to compute\nthe Hessian matrix exactly, so RMHMC is infeasible. For HMCBFGS we use K = 5 ensemble\nchains. Each method is also tested 10 times with different initial points. For each chain we draw\n8000 samples with 1000 burn-in. We use the step size \u0001 = 0.02 and the number of leaps L = 10\nfor both HMC and HMCBFGS. This parameter setting gives 84% acceptance rate on both HMC and\nHMCBFGS (averaged over the 10 runs).\nFigure 1 shows the sample trajectory plots for HMC and HMCBFGS on seven randomly selected\ndimensions. It is clear that HMCBFGS demonstrates remarkably less autocorrelation than HMC.\nThe statistics of ESS in Table 3 gives a quantitative evaluation of the performance of HMC and\nHMCBFGS. The results suggest that BFGS approximation dramatically reduces the sample autocor-\nrelation with a small increase of computational overhead on this dataset.\nFinally, we evaluate the scalability of the methods on the highly correlated 1000 dimensional Gaus-\nsian N (0, 11T + 4). Using an ensemble of K = 5 chains, the samples from HMCBFGS are less\ncorrelated than HMC along the largest eigenvalue direction (Figure 2).\n\n2We use the code from [5] to compute ESS of samples\n\n7\n\n\fESS\n\nHMC HMCBFGS\n\nMin\nMean\nMax\nTime (s)\nES/h\n\n3\n9\n25\n35743\n1\n\n26\n438\n5371\n37387\n42\n\nTable 3: Performance of MCMC samplers on Bayesian CRFs, as measured by Effective Sample\nSize (ESS). Higher is better. ES/h is the number of effective samples per hour\n\nFigure 1: Sample trace plot of 7000 samples from the posterior of a Bayesian CRF using HMC\n(left) and our method HMCBFGS (rigt) from a single run of each sampler (each line represents a\ndimension)\n\nFigure 2: ACF plot of samples projected on to the direction of largest eigenvector of 1000 dimen-\nsional Gaussian using HMC(left) and HMCBFGS(right)\n\n7 Discussion\n\nTo the best of our knowledge, this paper presents the \ufb01rst adaptive MCMC methods to employ quasi-\nNewton approximations. Naive attempts at combining these ideas (such as MHBFGS) do not work\nwell. On the other hand, HMCBFGS is more effective than the state-of-the-art sampler on several\nreal world data sets. Furthermore, HMCBFGS works well on a high dimensional model, where full\nsecond-order methods are infeasible, with little extra overhead over regular HMC.\nAs far as future work, our current method may not work well in regions where the density is not\nconvex, because the true Hessian is not positive de\ufb01nite. Another potential issue, the asymptotic\nindependence between the chains in ECA methods may lead to poor Hessian approximations. On\na brighter note, our work raises the interesting possibility that quasi-Newton methods, which are\nalmost exclusively used within the optimization literature, may be useful more generally.\n\nAcknowledgments\n\nWe thank Iain Murray for many useful discussions, and Mark Girolami for detailed comments on an\nearlier draft.\n\n8\n\n01000200030004000500060007000\u221230\u221225\u221220\u221215\u221210\u221250510152001000200030004000500060007000\u221215\u221210\u22125051015020406080100120140160180200\u22120.500.51LagSample AutocorrelationSample Autocorrelation Function020406080100120140160180200\u22120.200.20.40.60.8LagSample AutocorrelationSample Autocorrelation Function\fReferences\n[1] S. Barthelme and N. Chopin. Discussion on Riemannian Manifold Hamiltonian Monte Carlo.\nJournal of the Royal Statistical Society, B (Statistical Methodology), 73:163\u2013164, 2011. doi:\n10.1111/j.1467-9868.2010.00765.x.\n\n[2] K. Brodlie, A. Gourlay, and J. Greenstadt. Rank-one and rank-two corrections to positive\nIMA Journal of Applied Mathematics, 11(1):\n\nde\ufb01nite matrices expressed in product form.\n73\u201382, 1973.\n\n[3] S. Chib, E. Greenberg, and R. Winkelmann. Posterior simulation and bayes factors in\npanel count data models. Journal of Econometrics, 86(1):33\u201354, June 1998. URL http:\n//ideas.repec.org/a/eee/econom/v86y1998i1p33-54.html.\n\n[4] W. R. Gilks, G. O. Roberts, and E. I. George. Adaptive direction sampling. The Statistician,\n\n43(1):179\u20139, 1994.\n\n[5] M. Girolami and B. Calderhead. Riemannian manifold Hamiltonian Monte Carlo (with discus-\nsion). Journal of the Royal Statistical Society, B (Statistical Methodology), 73:123\u2013214, 2011.\ndoi: 10.1111/j.1467-9868.2010.00765.x.\n\n[6] H. Haario, E. Saksman, and J. Tamminen. An adaptive Metropolis algorithm. Bernoulli, 7(2):\n\n223\u2013242, 2001.\n\n[7] J. S. Liu, F. Liang, and W. H. Wong. The multiple-try method and local optimization in\nMetropolis sampling. Journal of the American Statistical Association, 95(449):pp. 121\u2013134,\n2000.\n\n[8] A. McCallum. Frequently asked questions data set. http://www.cs.umass.edu/\n\n\u02dcmccallum/data/faqdata.\n\n[9] R. M. Neal. MCMC using Hamiltonian dynamics. In S. Brooks, A. Gelman, G. Jones, and\nX.-L. Meng, editors, Handbook of Markov Chain Monte Carlo. Chapman & Hall / CRC Press,\n2010.\n\n[10] J. Nocedal and S. J. Wright. Numerical Optimization. Springer-Verlag, New York, 1999. ISBN\n\n0-387-98793-2.\n\n[11] Y. Qi and T. P. Minka. Hessian-based Markov chain Monte Carlo algorithms. In First Cape\n\nCod Workshop on Monte Carlo Methods, September 2002.\n\n[12] Y. Qi, M. Szummer, and T. P. Minka. Bayesian conditional random \ufb01elds. In Arti\ufb01cial Intelli-\n\ngence and Statistics (AISTATS). Barbados, January 2005.\n\n[13] G. O. Roberts and J. S. Rosenthal. Coupling and ergodicity of adaptive mcmc. Journal of\n\nApplied Probability, 44(2):458\u2013475, 2007.\n\n[14] D. M. Roy. Discussion on Riemannian Manifold Hamiltonian Monte Carlo. Journal of the\nRoyal Statistical Society, B (Statistical Methodology), 73:194\u2013195, 2011. doi: 10.1111/j.\n1467-9868.2010.00765.x.\n\n[15] M. Welling and S. Parise. Bayesian random \ufb01elds: The Bethe-Laplace approximation.\n\nUncertainty in Arti\ufb01cial Intelligence (UAI), 2006.\n\nIn\n\n9\n\n\f", "award": [], "sourceid": 1274, "authors": [{"given_name": "Yichuan", "family_name": "Zhang", "institution": null}, {"given_name": "Charles", "family_name": "Sutton", "institution": null}]}