{"title": "Gradient-based Adaptive Markov Chain Monte Carlo", "book": "Advances in Neural Information Processing Systems", "page_first": 15730, "page_last": 15739, "abstract": "We introduce a gradient-based learning method to automatically adapt Markov chain Monte Carlo (MCMC) proposal distributions to intractable targets. We define a maximum entropy regularised objective function, referred to as generalised speed measure, which can be robustly optimised over the parameters of the proposal distribution by applying stochastic gradient optimisation. An advantage of our method compared to traditional adaptive MCMC methods is that the adaptation occurs even when candidate state values are rejected. This is a highly desirable property of any adaptation strategy because the adaptation starts in early iterations even if the initial proposal distribution is far from optimum. We apply the framework for learning multivariate random walk Metropolis and Metropolis-adjusted Langevin proposals with full covariance matrices, and provide empirical evidence that our method can outperform other MCMC algorithms, including Hamiltonian Monte Carlo schemes.", "full_text": "Gradient-based Adaptive Markov Chain Monte Carlo\n\nMichalis K. Titsias\n\nDeepMind\nLondon, UK\n\nmtitsias@google.com\n\nPetros Dellaportas\n\nDepartment of Statistical Science\nUniversity College of London, UK\nDepartment of Statistics, Athens\n\nUniv. of Econ. and Business, Greece\nand The Alan Turing Institute, UK\n\nAbstract\n\nWe introduce a gradient-based learning method to automatically adapt Markov\nchain Monte Carlo (MCMC) proposal distributions to intractable targets. We de\ufb01ne\na maximum entropy regularised objective function, referred to as generalised speed\nmeasure, which can be robustly optimised over the parameters of the proposal dis-\ntribution by applying stochastic gradient optimisation. An advantage of our method\ncompared to traditional adaptive MCMC methods is that the adaptation occurs\neven when candidate state values are rejected. This is a highly desirable property\nof any adaptation strategy because the adaptation starts in early iterations even if\nthe initial proposal distribution is far from optimum. We apply the framework for\nlearning multivariate random walk Metropolis and Metropolis-adjusted Langevin\nproposals with full covariance matrices, and provide empirical evidence that our\nmethod can outperform other MCMC algorithms, including Hamiltonian Monte\nCarlo schemes.\n\n1\n\nIntroduction\n\n\u00b5t(F ) = t\u22121(cid:80)t\u22121\n\nMarkov chain Monte Carlo (MCMC) is a family of algorithms that provide a mechanism for gen-\nerating dependent draws from arbitrarily complex distributions. The basic set up of an MCMC\nalgorithm in any probabilistic (e.g. Bayesian) inference problem, with an intractable target density\n\u03c0(x), is as follows. A discrete time Markov chain {Xt}\u221e\nt=0 with transition kernel P\u03b8, appropriately\nchosen from a collection of \u03c0-invariant kernels {P\u03b8(\u00b7,\u00b7)}\u03b8\u2208\u0398, is generated and the ergodic averages\ni=0 F (Xi) are used as approximations to E\u03c0(F ) for any real-valued function F . Al-\nthough in principle this sampling setup is simple, the actual implementation of any MCMC algorithm\nrequires careful choice of P\u03b8 because the properties of \u00b5t depend on \u03b8. In common kernels that\nlead to random walk Metropolis (RWM), Metropolis-adjusted Langevin (MALA) or Hamiltonian\nMonte Carlo (HMC) algorithms the kernels P\u03b8 are speci\ufb01ed through an accept-reject mechanism in\nwhich the chain moves from time t to time t + 1 by \ufb01rst proposing candidate values y from a density\nq\u03b8(y|x) and accepting them with some probability \u03b1(xt, y) and setting xt+1 = y, or rejecting them\nand setting xt+1 = xt. Since \u03b8 directly affects this acceptance probability, it is clear that one should\nchoose \u03b8 so that the chain does not move too slowly or rejects too many proposed values y because in\nboth these cases convergence to the stationary distribution will be slow. This has been recognised as\nearly as in [22] and has initiated exciting research that has produced optimum average acceptance\nprobabilities for a series of algorithms; see [30, 31, 32, 15, 6, 8, 34, 7, 35, 9]. Such optimal average\nacceptance probabilities provide basic guidelines for adapting single step size parameters to achieve\ncertain average acceptance rates.\nMore sophisticated adaptive MCMC algorithms that can learn a full set of parameters \u03b8, such as\na covariance matrix, borrow information from the history of the chain to optimise some criterion\nre\ufb02ecting the performance of the Markov chain [14, 5, 33, 13, 2, 1, 4]. Such methods are typically\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fnon gradient-based and the basic strategy they use is to sequentially \ufb01t the proposal q\u03b8(y|x) to the\nhistory of states xt\u22121, xt, . . . , by ignoring also the rejected state values. This can result in very slow\nadaptation because the initial Markov chain simulations are based on poor initial \u03b8 and the generated\nstates, from which \u03b8 is learnt, are highly correlated and far from the target. The authors in [34] call\nsuch adaptive strategies \u2018greedy\u2019 in the sense that they try to adapt too closely to initial information\nfrom the output and take considerable time to recover from misleading initial information.\nIn this paper, we develop faster and more robust gradient-based adaptive MCMC algorithms that\nmake use of the gradient of the target, \u2207 log \u03c0(x), and they learn from both actual states of the\nchain and proposed (and possibly rejected) states. The key idea is to de\ufb01ne and maximise w.r.t. \u03b8\nan entropy regularised objective function that promotes high acceptance rates and high values for\nthe entropy of the proposal distribution. This objective function, referred to as generalised speed\nmeasure, is inspired by the speed measure of the in\ufb01nite-dimensional limiting diffusion process that\ncaptures the notion of speed in which a Markov chain converges to its stationary distribution [32]. We\nmaximise this objective function by applying stochastic gradient variational inference techniques such\nas those based on the reparametrisation trick [19, 29, 40]. An advantage of our algorithm compared\nto traditional adaptive MCMC methods is that the adaptation occurs even when candidate state values\nare rejected. In fact, the adaptation can be faster when candidate values y are rejected since then we\nmake always full use of the gradient \u2207 log \u03c0(y) evaluated at the rejected y. This allows the adaptation\nto start in early iterations even if the initial proposal distribution is far from optimum and the chain is\nnot moving. We apply the method for learning multivariate RWM and MALA proposals where we\nadapt full covariance matrices parametrised ef\ufb01ciently using Cholesky factors. In the experiments\nwe demonstrate our algorithms to multivariate Gaussian targets and Bayesian logistic regression and\nempirically show that they outperform several other baselines, including advanced HMC schemes.\n\n2 Gradient-based adaptive MCMC\nAssume a target distribution \u03c0(x), known up to some unknown normalising constant, where x \u2208 Rn\nis the state vector. To sample from \u03c0(x) we consider the Metropolis-Hastings (M-H) algorithm that\ngenerates new states by sampling from a proposal distribution q\u03b8(y|x), that depends on parameters \u03b8,\nand accepts or rejects each proposed state by using the standard M-H acceptance probability\n\n(cid:27)\n\n(cid:26)\n\n\u03c0(y)q\u03b8(x|y)\n\u03c0(x)q\u03b8(y|x)\n\n1,\n\n\u03b1(x, y; \u03b8) = min\n\n.\n\n(1)\n\nWhile the M-H algorithm de\ufb01nes a Markov chain that converges to the target distribution, the\nef\ufb01ciency of the algorithm heavily depends on the choice of the proposal distribution q\u03b8(x|y) and the\nsetting of its parameters \u03b8.\nHere, we develop a framework for stochastic gradient-based adaptation or learning of q\u03b8(x|y) that\nmaximises an objective function inspired by the concept of speed measure that underlies the theoretical\nfoundations of MCMC optimal tuning [30, 31]. Given that the chain is at state x we would like: (i) to\npropose big jumps in the state space and (ii) accept these jumps with high probability. By assuming for\nnow that the proposal has the standard random walk isotropic form, such that q\u03c3(y|x) = N (y|x, \u03c32I),\nthe speed measure is de\ufb01ned as\n\n(2)\nwhere \u03c32 denotes the variance, also called step size, of the proposal distribution, while \u03b1(x; \u03c3) is\n\ns\u03c3(x) = \u03c32 \u00d7 \u03b1(x; \u03c3),\n\nthe average acceptance probability when starting at x, i.e. \u03b1(x; \u03c3) = (cid:82) \u03b1(x, y; \u03c3)q\u03c3(y|x)dy. To\nglobal speed measure value s\u03c3 =(cid:82) \u03c0(x)s\u03c3(x)dx. For simple targets and with increasing dimension\n\nlearn a good value for the step size we could maximise the speed measure in Eq. 2, which intuitively\npromotes high variance for the proposal distribution together with high acceptance rates. In the\ntheory of optimal MCMC tuning, s\u03c3(x) is averaged under the stationary distribution \u03c0(x) to obtain a\n\nthis measure is maximised so that \u03c32 is set to a value that leads to the acceptance probability 0.234\n[30, 31]. This subsequently leads to the popular heuristic for tuning random walk proposals: tune \u03c32\nso that on average the proposed states are accepted with probability 1/4. Similar heuristics have been\nobtained for tuning the step sizes of more advanced schemes such as MALA and HMC, where 0.574\nis considered optimal for MALA [32] and 0.651 for HMC [24, 9].\nWhile the current notion of speed measure from Eq. 2 is intuitive, it is only suitable for tuning\nproposals having a single step size. Thus, in order to learn arbitrary proposal distributions q\u03b8(y|x),\n\n2\n\n\fdistance ||y \u2212 x||2 given by the trace tr(\u03a3) =(cid:80)\n\nwhere \u03b8 is a vector of parameters, we need to de\ufb01ne suitable generalisations of the speed measure.\nSuppose, for instance, that we wish to tune a Gaussian with a full covariance matrix, i.e. q\u03a3(y|x) =\nN (y|x, \u03a3). To achieve this we need to generalise the objective in Eq. 2 by replacing \u03c32 with some\nfunctional F(\u03a3) that depends on the full covariance. An obvious choice is to consider the average\ni . However, this is problematic since it can lead to\nlearning proposals with very poor mixing. To see this note that since the trace is the sum of variances it\ncan obtain high values even when some of the components of x have very low variance, e.g. for some\ni \u2248 0, which can result in very low sampling ef\ufb01ciency or even non-ergodicity. In order\nxi it holds \u03c32\nto de\ufb01ne better functionals F(\u03a3) we wish to exploit the intuition that for MCMC all components of\nx need to jointly perform (relative to their scale) big jumps, a requirement that is better captured by\nthe determinant |\u03a3| or more generally by the entropy of the proposal distribution.\nTherefore, we introduce a generalisation of the speed measure having the form,\n\ni \u03c32\n\n\u03b1(x, y; \u03b8)q\u03b8(y|x)dy,\n\ns\u03b8(x) = exp{\u03b2Hq\u03b8(y|x)} \u00d7 \u03b1(x; \u03b8) = exp{\u03b2Hq\u03b8(y|x)} \u00d7\n\nwhere Hq\u03b8(y|x) = \u2212(cid:82) q\u03b8(y|x) log q\u03b8(y|x)dy denotes the entropy of the proposal distribution and\n\n\u03b2 > 0 is an hyperparameter. Note that when the proposal distribution is a full Gaussian q\u03a3(y|x) =\nN (y|x, \u03a3) then exp{\u03b2Hq(y|x)} = const \u00d7 |\u03a3| \u03b2\n2 which depends on the determinant of \u03a3. s\u03b8(x),\nreferred to as generalised speed measure, trades off between high entropy of the proposal distribution\nand high acceptance probability. The hyperparameter \u03b2 plays the crucial role of balancing the relative\nstrengths of these terms. As discussed in the next section we can ef\ufb01ciently optimise \u03b2 in order to\nachieve a desirable average acceptance rate.\nIn the following we make use of the above generalised speed measure to derive a variational learning\nalgorithm for adapting the parameters \u03b8 using stochastic gradient-based optimisation.\n\n(3)\n\n(cid:90)\n\n2.1 Maximising the generalised speed measure using variational inference\n\nDuring MCMC iterations we collect the pairs of vectors (xt, yt)t>0 where xt is the state of the chain\nat time t and yt the corresponding proposed next state, which if accepted then xt+1 = yt. When\nthe chain has converged each xt follows the stationary distribution \u03c0(x), otherwise it follows some\ndistribution that progressively converges to \u03c0(x). In either case we view the sequence of pairs (xt, yt)\nas non-iid data based on which we wish to perform gradient-based updates of the parameters \u03b8. In\npractice such updates can be performed with diminishing learning rates, or more safely completely\nstop after some number of burn-in iterations to ensure convergence. Speci\ufb01cally, given the current\nstate xt we wish to take a step towards maximising s\u03b8(xt) in Eq. 3 or equivalently its logarithm,\n\nlog s\u03b8(xt) = log\n\n\u03b1(x, y; \u03b8)q\u03b8(y|xt)dy + \u03b2Hq\u03b8(y|xt).\n\n(4)\n\n(cid:90)\n\nThe second term is just the entropy of the proposal distribution, which typically will be analytically\ntractable, while the \ufb01rst term involves an intractable integral. To approximate the \ufb01rst term we work\nsimilarly to variational inference [18, 10] and we lower bound it using Jensen\u2019s inequality,\n\nlog s\u03b8(xt) \u2265 F\u03b8(xt) =\n\nq\u03b8(y|xt) log min\n\n1,\n\n\u03c0(y)q\u03b8(xt|y)\n\u03c0(xt)q\u03b8(y|xt)\n\u03c0(y)\n\u03c0(xt)\n\nq\u03b8(xt|y)\nq\u03b8(y|xt)\n\n(cid:27)\ndy + \u03b2Hq\u03b8(y|xt)\n\ndy + \u03b2Hq\u03b8(y|xt).\n\n(5)\n\nq\u03b8(y|xt)min\n\n=\n\n(6)\nTo take a step towards maximising F\u03b8 we can apply standard stochastic variational inference tech-\nniques such as the score function method or the reparametrisation trick [11, 26, 28, 19, 29, 40, 20].\nHere, we focus on the case q\u03b8(y|xt) is a reparametrisable distribution such that y = T\u03b8(xt, \u0001) where\nT\u03b8 is a deterministic transformation and \u0001 \u223c p(\u0001). F\u03b8(xt) can be re-written as\n\n+ log\n\n0, log\n\n(cid:26)\n\n(cid:26)\n\nF\u03b8(xt) =\n\np(\u0001)min\n\n0, log\n\n\u03c0(T\u03b8(xt, \u0001))\n\n\u03c0(xt)\n\n+ log\n\nq\u03b8(xt|T\u03b8(xt, \u0001))\nq\u03b8(T\u03b8(xt, \u0001)|xt)\n\nSince MCMC at the t-th iteration proposes a speci\ufb01c state yt constructed as \u0001t \u223c p(\u0001t), yt =\nT\u03b8(xt, \u0001t), an unbiased estimate of the exact gradient \u2207\u03b8F\u03b8(xt) can be obtained by\n\n(cid:27)\n\n(cid:90)\n(cid:90)\n\n(cid:26)\n(cid:26)\n\n(cid:90)\n\n(cid:27)\nd\u0001 + \u03b2Hq\u03b8(y|xt).\n(cid:27)\n\n+ \u03b2\u2207\u03b8Hq\u03b8(y|xt),\n\n\u2207\u03b8F\u03b8(xt, \u0001t) = \u2207\u03b8min\n\n0, log\n\n+ log\n\nq\u03b8(xt|T\u03b8(xt, \u0001t))\nq\u03b8(T\u03b8(xt, \u0001t)|xt)\n\n\u03c0(T\u03b8(xt, \u0001t))\n\n\u03c0(xt)\n\n3\n\n\fAlgorithm 1 Gradient-based Adaptive MCMC\n\nInput: target \u03c0(x); reparametrisable proposal q\u03b8(y|x) s.t. y = T\u03b8(x, \u0001), \u0001 \u223c p(\u0001); initial x0;\ndesired average acceptance probability \u03b1\u2217.\nInitialise \u03b8, \u03b2 = 1.\nfor t = 1, 2, 3, . . . , do\n\n: Propose \u0001t \u223c p(\u0001t), yt = T\u03b8(xt, \u0001t).\n: Adapt \u03b8: \u03b8 \u2190 \u03b8 + \u03c1t\u2207\u03b8F\u03b8(xt, \u0001t).\n: Accept or reject yt using the standard M-H ratio to obtain xt+1.\n: Set \u03b1t = 1 if yt was accepted and \u03b1t = 0 otherwise.\n: Adapt hyperparameter \u03b2: \u03b2 \u2190 \u03b2[1 + \u03c1\u03b2(\u03b1t \u2212 \u03b1\u2217)] # default value for \u03c1\u03b2 = 0.02.\n\nend for\n\nwhich is used to make a gradient update for the parameters \u03b8. Note that the \ufb01rst term in the\nstochastic gradient is analogous to differentiating through a recti\ufb01ed linear hidden unit (ReLu) in\nq\u03b8(yt|xt) \u2265 0 the gradient is zero (this is the case when yt is\nneural networks, i.e. if log \u03c0(yt)\naccepted with probability one), while otherwise the gradient of the \ufb01rst term reduces to\n\n\u03c0(xt) + log q\u03b8(xt|yt)\n\n\u2207\u03b8 log \u03c0(T\u03b8(xt, \u0001t)) + \u2207\u03b8 log\n\nq\u03b8(xt|T\u03b8(xt, \u0001t))\nq\u03b8(T\u03b8(xt, \u0001t)|xt)\n\n.\n\nThe value of the hyperparameter \u03b2 can allow to trade off between large acceptance probability and\nlarge entropy of the proposal distribution. Such hyperparameter cannot be optimised by maximising\nthe variational objective F\u03b8 (this typically will set \u03b2 to a very small value so that the acceptance\nprobability becomes close to one but the chain is not moving since the entropy is very low). Thus, \u03b2\nneeds to be updated in order to control the average acceptance probability of the chain in order to\nachieve a certain desired value \u03b1\u2217. The value of \u03b1\u2217 can be determined based on the speci\ufb01c MCMC\nproposal we are using and by following standard recommendations in the literature, as discussed\nalso in the previous section. For instance, when we use RWM \u03b1\u2217 can be set to 1/4 (see Section 2.2),\nwhile for gradient-based MALA (see Section 2.3) \u03b1\u2217 can be set to 0.55.\nPseudocode for the general procedure is given by Algorithm 1. We set the learning rate \u03c1t using\nRMSProp [39]; at each iteration t we set \u03c1t = \u03b7/(1 +\nGt), where \u03b7 is the baseline learning rate, and\nthe updates of Gt depend on the gradient estimate \u2207\u03b8F\u03b8(xt, \u0001t) as Gt = 0.9Gt +0.1 [\u2207\u03b8F\u03b8(xt, \u0001t)]2.\n\n\u221a\n\n2.2 Fitting a full covariance Gaussian random walk proposal\nWe now specialise to the case the proposal distribution is a random walk Gaussian qL(y|x) =\nN (y|x, LL(cid:62)) where the parameter L is a positive de\ufb01nite lower triangular matrix, i.e. a Cholesky\nfactor. This distribution is reparametrisable since y \u2261 TL(x, \u0001) = x + L\u0001, \u0001 \u223c N (\u0001|0, I). At the t-th\niteration when the state is xt the lower bound becomes\n\nFL(xt) =\n\nN (\u0001|0, I)min{0, log \u03c0(xt + L\u0001) \u2212 log \u03c0(xt)} d\u0001 + \u03b2\n\nlog Lii + const.\n\n(7)\n\nHq\u03b8(y|xt) = log |L| + const and log |L| =(cid:80)n\n\nHere, the proposal distribution has cancelled out from the M-H ratio, since it is symmetric, while\ni=1 log Lii. By making use of the MCMC proposed\nstate yt = xt + L\u0001t we can obtain an unbiased estimate of the exact gradient \u2207LFL(xt),\n\ni=1\n\nn(cid:88)\n\n(cid:90)\n\n(cid:40)(cid:2)\u2207yt log \u03c0(yt) \u00d7 \u0001(cid:62)\n\n(cid:3)\n\n\u03b2diag( 1\nL11\n\n, . . . ,\n\nt\n1\n\nLnn\n\n\u2207LFL(xt, \u0001t) =\n\nlower + \u03b2diag( 1\n),\n\nL11\n\n, . . . ,\n\n1\n\nLnn\n\n),\n\nif log \u03c0(yt) < log \u03c0(xt)\notherwise\n\n, . . . ,\n\nwhere yt = xt + L\u0001t, the operation [A]lower zeros the upper triangular part (above the main diagonal)\nof a squared matrix and diag( 1\n) is a diagonal matrix with elements 1/Lii. The \ufb01rst case\nL11\nof this gradient, i.e. when log \u03c0(yt) < log \u03c0(xt), has a very similar structure with the stochastic\nreparametrisation gradient when \ufb01tting a variational Gaussian approximation [19, 29, 40] with the\ndifference that here we centre the corresponding approximation, i.e. the proposal qL(yt|xt), at the\ncurrent state xt instead of having a global variational mean parameter. Interestingly, this \ufb01rst case\nwhen MCMC rejects many samples (or even it gets stuck at the same value for long time) is when\n\nLnn\n\n1\n\n4\n\n\ft\n\nlearning can be faster since the term \u2207yt log \u03c0(yt) \u00d7 \u0001(cid:62)\ntransfers information about the curvature\nof the target to the covariance of the proposal. When we start getting high acceptance rates the\nsecond case, i.e. log \u03c0(yt) \u2265 log \u03c0(xt), will often be activated so that the gradient will often reduce\nto only having the term \u03b2diag( 1\n) that encourages the entropy of the proposal to become\nL11\nlarge. The ability to learn from rejections is in sharp contrast with the traditional non gradient-based\nadaptive MCMC methods which can become very slow when MCMC has high rejection rates. This is\nbecause these methods typically learn from the history of state vectors xt by ignoring the information\nfrom the rejected states. The algorithm for learning the full random walk Gaussian follows precisely\nthe general structure of Algorithm 1. For the average acceptance rate \u03b1\u2217 we use the value 1/4.\n\n, . . . ,\n\nLnn\n\n1\n\n2.3 Fitting a full covariance MALA proposal\nHere, we specialise to a full covariance, also called preconditioned, MALA of the form qL(y|x) =\nN (y|x + (1/2)LL(cid:62)\u2207x log \u03c0(x), LL(cid:62)) where the covariance matrix is parametrised by the\nCholesky factor L. Again this distribution is reparametrisable according to y \u2261 TL(x, \u0001) =\nx + (1/2)LL(cid:62)\u2207 log \u03c0(x) + L\u0001, \u0001 \u223c N (\u0001|0, I). At the t-th iteration when the state is xt the\nreparametrised lower bound simpli\ufb01es signi\ufb01cantly and reduces to,\n\n(cid:16)\n\n(cid:110)\n[\u2207 log \u03c0(xt) + \u2207 log \u03c0(y)] + \u0001||2 \u2212 ||\u0001||2(cid:17)(cid:111)\n\n(cid:62)\u2207 log \u03c0(xt) + L\u0001\n\nxt + (1/2)LL\n\n0, log \u03c0\n\n(cid:17) \u2212 log \u03c0(xt)\nn(cid:88)\n\nN (\u0001|0, I)min\n\n(cid:16)||(1/2)L\n\n(cid:62)\n\nFL(xt) =\n\n(cid:90)\n\n\u2212 1\n2\n\nd\u0001 + \u03b2\n\nlog Lii + const,\n\ni=1\n\nwhere ||\u00b7|| denotes Euclidean norm and in the term \u2207 log \u03c0(y), y = xt +(1/2)LL(cid:62)\u2207 log \u03c0(xt)+L\u0001.\nThen, based on the proposed state yt = TL(xt, \u0001t) we can obtain the unbiased gradient estimate\n\u2207FL(xt, \u0001t) similarly to the previous section. In general, such an estimate can be very expensive\nbecause the existence of L inside \u2207 log \u03c0(yt) means that we need to compute the matrix of second\nderivatives or Hessian \u2207\u2207 log \u03c0(yt). We have found that an alternative procedure which stops\nthe gradient inside \u2207 log \u03c0(yt) (i.e. it views \u2207 log \u03c0(yt) as a constant w.r.t. L) has small bias and\nworks well in practice. In fact, as we will show in the experiments this approximation not only is\ncomputationally much faster but remarkably also it leads to better adaptation compared to the exact\nHessian procedure, presumably because by not accounting for the gradient inside \u2207 log \u03c0(yt) reduces\nthe variance. Furthermore, the expression of the gradient w.r.t. L used by this fast approximation can\nbe computed very ef\ufb01ciently with a single O(n2) operation (an outer vector product; see Supplement),\nwhile each iteration of the algorithm requires overall at most four O(n2) operations. For these\ngradient-based adaptive MALA schemes, \u03b2 in Algorithm 1 is adapted to obtain an average acceptance\nrate roughly \u03b1\u2217 = 0.55.\n\n3 Related Work\n\nConnection of our method with traditional adaptive MCMC methods has been discussed in Section\n1. Here, we analyse additional related works that make use of gradient-based optimisation and\nspecialised objective functions or algorithms to train MCMC proposal distributions.\nThe work in [21] proposed a criterion to tune MCMC proposals based on maximising a modi\ufb01ed\n\nversion of the expected squared jumped distance,(cid:82) q\u03b8(y|xt)||y \u2212 xt||2\u03b1(xt, y; \u03b8)dy, previously\n\nconsidered in [27]. Speci\ufb01cally, the authors in [21] \ufb01rstly observe that the expected squared jumped\ndistance may not encourage mixing across all dimensions of x1 and then try to resolve this by\nincluding a reciprocal term (see Section 4.2 in their paper). The generalised speed measure proposed\nin this paper is rather different from such criteria since it encourages joint exploration of all dimensions\nof x by applying maximum entropy regularisation, which by construction penalises \"dimensions that\ndo not move\" since the entropy becomes minus in\ufb01nity in such cases. Another important difference is\nthat in our method the optimisation is performed in the log space by propagating gradients through the\nlogarithm of the M-H acceptance probability, i.e. through log \u03b1(xt, y; \u03b8) and not through \u03b1(xt, y; \u03b8).\nThis is exactly analogous to other numerically stable objectives such as variational lower bounds and\nlog likelihoods, and as those our method leads to numerically stable optimisation for arbitrarily large\ndimensionality of x and complex targets \u03c0(x).\n\n1Because the additive form of ||y \u2212 xt||2 =(cid:80)\n\ni(yi \u2212 xti)2 implies that even when some dimensions might\n\nnot be moving at all (the corresponding distance terms are zero), the overall sum can still be large.\n\n5\n\n\fIn another related work,\nthe authors in [25] considered minimising the KL divergence\nKL[\u03c0(xt)q\u03b8(yt|xt)||\u03c0(yt)q\u03b8(xt|yt)]. However, this loss for standard proposal schemes, such as\nRWM and MALA, leads to degenerate deterministic solutions where q\u03b8(yt|xt) collapses to a delta\nfunction. Therefore, [25] maximised this objective for the independent M-H sampler where the\ncollapsing problem does not occur. The entropy regularised objective we introduced is different and\nit can adapt arbitrary MCMC proposal distributions, and not just the independent M-H sampler.\nThere has been also work to learn \ufb02exible MCMC proposals using neural networks [38, 21, 16, 36].\nFor instance, [38] use volume preserving \ufb02ows and an adversarial objective, [21] use the modi\ufb01ed\nexpected jumped distance, discussed earlier, to learn neural network-based extensions of HMC, while\n[16, 36] use auxiliary variational inference. The need to train neural networks can add a signi\ufb01cant\ncomputational cost, and from the end-user point of view these neural adaptive samplers might be\nhard to tune especially in high dimensions. Notice that the generalised speed measure we proposed\nin this paper could possibly be used to train neural adaptive samplers as well. However, to really\nobtain practical algorithms we need to ensure that training has small cost that does not overwhelm\nthe possible bene\ufb01ts in terms of effective sample size.\nFinally, the generalised speed measure that is based on entropy regularisation shares similarities with\nother used objectives for learning probability distributions, such as in variational Bayesian inference,\nwhere the variational lower bound includes an entropy term [18, 10] and reinforcement learning (RL)\nwhere maximum-entropy regularised policy gradients are able to estimate more explorative policies\n[37, 23]. Further discussion on the resemblance of our algorithm with RL is given in the Supplement.\n4 Experiments\nWe test the gradient-based adaptive MCMC methods in several simulated and real data. We investigate\nthe performance of two instances of the framework: the gradient-based adaptive random walk\n(gadRWM) detailed in Section 2.2 and the corresponding MALA (gadMALA) detailed in Section\n2.3. For gadMALA we consider the exact reparametrisation method that requires the evaluation\nof the Hessian (gadMALAe) and the fast approximate variant (gadMALAf). These schemes are\ncompared against: (i) standard random walk Metropolis (RWM) with proposal N (y|x, \u03c32I), (ii)\nan adaptive MCMC (AM) that \ufb01ts a proposal of the form N (y|x, \u03a3) (we consider a computational\nef\ufb01cient version based on updating the Cholesky factor of \u03a3; see Supplement), (iii) a standard MALA\nproposal N (y|x + (1/2)\u03c32\u2207 log \u03c0(x), \u03c32I), (iv) an Hamiltonian Monte Carlo (HMC) with a \ufb01xed\nnumber of leap frog steps (either 5, or 10, or 20) (v) and the state of the art no-U-turn sampler (NUTS)\n[17] which arguably is the most ef\ufb01cient adaptive HMC method that automatically determines the\nnumber of leap frog steps. We provide our own MALTAB implementation2 of all algorithms, apart\nfrom NUTS for which we consider a publicly available implementation.\n\nIllustrative experiments\n\n4.1\nTo visually illustrate the gradient-based adaptive samplers we consider a correlated 2-D Gaussian\ntarget with covariance matrix \u03a3 = [1 0.99; 0.99 1] and a 51-dimensional Gaussian target obtained by\n}+\nevaluating the squared exponential kernel plus small white noise, i.e. k(xi, xj) = exp{\u2212 1\n0.01\u03b4i,j, on the regular grid [0, 4]. The \ufb01rst two panels in Figure 1 show the true covariance\ntogether with the adapted covariances obtained by gadRWM for two different settings of the average\nacceptance rate \u03b1\u2217 in Algorithm 1, which illustrates also the adaptation of the entropy-regularisation\nhyperparameter \u03b2 that is learnt to obtain a certain \u03b1\u2217. The remaining two plots illustrate the ability to\nlearn a highly correlated 51-dimensional covariance matrix (with eigenvalues ranging from 0.01 to\n12.07) by applying our most advanced gadMALAf scheme.\n\n(xi\u2212xj )2\n\n2\n\n0.16\n\n4.2 Quantitative results\nHere, we compare all algorithms in some standard benchmark problems, such as Bayesian logistic\nregression, and report effective sample size (ESS) together with other quantitative scores.\n\u221a\nExperimental settings. In all experiments for AM and gradient-based adaptive schemes the Cholesky\nfactor L was initialised to a diagonal matrix with values 0.1/\nn in the diagonal where n is the\ndimensionality of x. For the AM algorithm the learning rate was set to 0.001/(1 + t/T ) where t is the\nnumber of iterations and T (the value 4000 was used in all experiments) serves to keep the learning\nrate nearly constant for the \ufb01rst T training iterations. For the gradient-based adaptive algorithms\n\n2https://github.com/mtitsias/gadMCMC.\n\n6\n\n\fFigure 1: The green contours in the \ufb01rst two panels (from left to right) show the 2-D Gaussian target, while\nthe blue contours show the learned covariance, LL(cid:62), after adapting for 2 \u00d7 104 iterations using gadRWM and\ntargeting acceptance rates \u03b1\u2217 = 0.25 and \u03b1\u2217 = 0.4, respectively. For \u03b1\u2217 = 0.25 the adapted blue contours\nshow that the proposal matches the shape of the target but it has higher entropy/variance and the hyperparameter\n\u03b2 obtained the value 7.4. For \u03b1\u2217 = 0.4 the blue contours shrink a bit and \u03b2 is reduced to 2.2 (since higher\nacceptance rate requires smaller entropy). The third panel shows the exact 51 \u00d7 51 covariance matrix and the\nlast panel shows the adapted one, after running our most ef\ufb01cient gadMALAf scheme for 2 \u00d7 105 iterations. In\nboth experiments L was initialised to diagonal matrix with 0.1/\n\n\u221a\nn in the diagonal.\n\nFigure 2: Panels in the \ufb01rst row show trace plots, obtained by different schemes, across the last 2\u00d7 104 sampling\niterations for the most dif\ufb01cult to sample x100 dimension. The panels in the second row show the estimated\nvalues of the diagonal of L obtained by different adaptive schemes. Notice that the real Gaussian target has\ndiagonal covariance matrix \u03a3 = diag(s2\n\n100) where si are uniform in the range [0.01, 1].\n\n1, . . . , s2\n\nwe use RMSprop (see Section 2.1) where \u03b7 was set to 0.00005 for gadRWM and to 0.00015 for the\ngadMALA schemes. NUTS uses its own fully automatic adaptive procedure that determines both\nthe step size and the number of leap frog steps [17]. For all experiments and algorithms (apart from\nNUTS) we consider 2 \u00d7 104 burn-in iterations and 2 \u00d7 104 iterations for collecting samples. This\nadaptation of L or \u03c32 takes place only during the burn-in iterations and then it stops, i.e. at collection\nof samples stage these parameters are kept \ufb01xed. For NUTS, which has its own internal tuning\nprocedure, 500 burn-in iterations are suf\ufb01cient before collecting 2 \u00d7 104 samples. The computational\ntime for all algorithms reported in the tables corresponds to the overall running time, i.e. the time for\nperforming jointly all burn-in and collection of samples iterations.\nNeal\u2019s Gaussian target. We \ufb01rst consider an example used in [24] where the target is a zero-mean\nmultivariate Gaussian with diagonal covariance matrix \u03a3 = diag(s2\n100) where the stds si\ntake values in the linear grid 0.01, 0.02, . . . , 1. This is a challenging example because the different\nscaling of the dimensions means that the schemes that use an isotropic step \u03c32 will be adapted to the\nsmallest dimension x1 while the chain at the higher dimensions, such as x100, will be moving slowly\nexhibiting high autocorrelation and small effective sample size. The \ufb01rst row of Figure 3 shows\nthe trace plot across iterations of the dimension x100 for some of the adaptive schemes including\nan HMC scheme that uses 20 leap frog steps. Clearly, the gradient-based adaptive methods show\nmuch smaller autocorrelation that AM. This is because they achieve a more ef\ufb01cient adaptation of the\nCholesky factor L which ideally should become proportional to a diagonal matrix with the linear grid\n0.01, 0.02, . . . , 1 in the main diagonal. The second row of Figure 3 shows the diagonal elements of\nL from which we can observe that all gradient-based schemes lead to more accurate adaptation with\ngadMALAf being the most accurate.\nFurthermore, Table 1 provides quantitative results such as minimum, median and maximum ESS\ncomputed across all dimensions of the state vector x, running times and an overall ef\ufb01ciency score\n\n1, . . . , s2\n\n7\n\n-3-2-10123-3-2-10123-3-2-10123-3-2-1012310203040505101520253035404550102030405051015202530354045500.511.52\u00d7104-4-2024AM0.511.52\u00d7104-4-2024gadRWM0.511.52\u00d7104-4-2024HMC-200.511.52\u00d7104-4-2024gadMALAf02040608010000.10.20.30.40.5AM02040608010000.050.10.150.2gadRWM02040608010000.10.20.30.40.50.6gadMALAe02040608010000.20.40.60.8gadMALAf\fTime(s) Accept Rate ESS (Min, Med, Max)\n\nMin ESS/s (1 st.d.)\n\nTable 1: Comparison in Neal\u2019s Gaussian example (dimensionality was n = 100; see panel above) and Caravan\nbinary classi\ufb01cation dataset where the latter consists of 5822 data points (dimensionality was n = 87; see panel\nbelow). All numbers are averages across ten repeats where also one-standard deviation is given for the Min\nESS/s score. From the three HMC schemes we report only the best one in each case.\nMethod\n(Neal\u2019s Gaussian)\ngadMALAf\ngadMALAe\ngadRWM\nAM\nRWM\nMALA\nHMC-20\nNUTS\n(Caravan)\ngadMALAf\ngadMALAe\ngadRWM\nAM\nRWM\nMALA\nHMC-10\nNUTS\n\n(1413.4, 1987.4, 2580.8)\n(922.2, 2006.3, 2691.1)\n(27.5, 66.9, 126.9)\n(8.7, 48.6, 829.1)\n(2.9, 8.4, 2547.6)\n(2.9, 10.0, 12489.2)\n(306.1, 1537.8, 19732.4)\n(18479.6, 20000.0, 20000.0)\n\n(228.1, 750.3, 1114.7)\n(66.6, 508.3, 1442.7)\n(5.3, 34.3, 104.5)\n(3.2, 11.8, 62.5)\n(3.0, 9.3, 52.5)\n(4.4, 28.3, 326.0)\n(248.3, 2415.7, 19778.7)\n(7469.5, 20000.0, 20000.0)\n\n161.70 (15.07)\n92.34 (7.11)\n3.95 (0.66)\n3.71 (0.87)\n1.31 (0.06)\n0.95 (0.03)\n6.17 (3.35)\n51.28 (1.64)\n\n0.556\n0.541\n0.254\n0.257\n0.261\n0.530\n0.694\n>0.7\n\n0.621\n0.494\n0.234\n0.257\n0.242\n0.543\n0.711\n>0.7\n\n8.7\n10.0\n7.0\n2.3\n2.2\n3.1\n49.7\n360.5\n\n23.1\n95.1\n22.6\n20.0\n15.3\n22.8\n225.5\n1412.1\n\n9.94 (2.64)\n0.70 (0.16)\n0.23 (0.06)\n0.16 (0.01)\n0.20 (0.03)\n0.19 (0.05)\n1.10 (0.12)\n5.29 (0.38)\n\ni=1 we assume a logistic regression likelihood p(y|w, s) =(cid:80)n\n\nMin ESS/s (i.e. ESS for the slowest moving component of x divided by running time \u2013 last column\nin the Table) which allows to rank the different algorithms. All results are averages after repeating\nthe simulations 10 times under different random initialisations. From the table it is clear that the\ngadMALA algorithms give the best performance with gadMALAf being overall the most effective.\nBayesian logistic regression. We consider binary classi\ufb01cation where given a set of training exam-\nples {yi, si}n\ni=1 yi log \u03c3(si) + (1 \u2212\nyi) log(1 \u2212 \u03c3(si)), where \u03c3(si) = 1/(1 + exp(\u2212w(cid:62)si)), si is the input vector and w the parame-\nters. We place a Gaussian prior on w and we wish to sample from the posterior distribution over\nw. We considered six binary classi\ufb01cation datasets (Australian Credit, Heart, Pima Indian, Ripley,\nGerman Credit and Caravan) with a number of examples ranging from n = 250 to n = 5822 and\ndimensionality of the state/parameter vector ranging from 3 to 87. Table 1 shows results for the most\nchallenging Caravan dataset where the dimensionality of w is 87, while the remaining \ufb01ve tables\nare given in the Supplement. From all tables we observe that the gadMALAf is the most effective\nand it outperforms all other methods. While NUTS has always very high ESS is still outperformed\nby gadMALAf because of the high computational cost, i.e. NUTS might need to use a very large\nnumber of leap frog steps (each requiring re-evaluating the gradient of the log target) per iteration.\nFurther results, including a higher 785-dimensional example on MNIST, are given in the Supplement.\n5 Conclusion\nWe have presented a new framework for gradient-based adaptive MCMC that makes use of an entropy-\nregularised objective function that generalises the concept of speed measure. We have applied this\nmethod for learning RWM and MALA proposals with full covariance matrices.\nSome topics for future research are the following. Firstly, to deal with very high dimensional spaces\nit would be useful to consider low rank parametrisations of the covariance matrices in RWM and\nMALA proposals. Secondly, it would be interesting to investigate whether our method can be used\nto tune the so-called mass matrix in HMC samplers. However, in order for this to lead to practical\nand scalable algorithms we have to come up with schemes that avoid the computation of the Hessian,\nas we successfully have done for MALA. Finally, in order to reduce the variance of the stochastic\ngradients and speed up further the adaptation, especially in high dimensions, our framework could be\npossibly combined with parallel computing as used for instance in deep reinforcement learning [12].\n\n8\n\n\fReferences\n[1] Christophe Andrieu and Yves Atchade. On the ef\ufb01ciency of adaptive mcmc algorithms. Elec-\n\ntronic Communications in Probability, 12:336\u2013349, 2007.\n\n[2] Christophe Andrieu and \u00c9ric Moulines. On the ergodicity properties of some adaptive mcmc\n\nalgorithms. The Annals of Applied Probability, 16(3):1462\u20131505, 2006.\n\n[3] Christophe Andrieu and Johannes Thoms. A tutorial on adaptive mcmc. Statistics and Comput-\n\ning, 18(4):343\u2013373, December 2008.\n\n[4] Yves Atchade, Gersende Fort, Eric Moulines, and Pierre Priouret. Adaptive markov chain\n\nmonte carlo: theory and methods. Preprint, 2009.\n\n[5] Yves F Atchad\u00e9 and Jeffrey S Rosenthal. On adaptive markov chain monte carlo algorithms.\n\nBernoulli, 11(5):815\u2013828, 2005.\n\n[6] Myl\u00e8ne B\u00e9dard. Weak convergence of metropolis algorithms for non-iid target distributions.\n\nThe Annals of Applied Probability, pages 1222\u20131244, 2007.\n\n[7] Myl\u00e8ne B\u00e9dard. Ef\ufb01cient sampling using metropolis algorithms: Applications of optimal\n\nscaling results. Journal of Computational and Graphical Statistics, 17(2):312\u2013332, 2008.\n\n[8] Mylene Bedard. Optimal acceptance rates for metropolis algorithms: Moving beyond 0.234.\n\nStochastic Processes and their Applications, 118(12):2198\u20132222, 2008.\n\n[9] Alexandros Beskos, Natesh Pillai, Gareth Roberts, Jesus-Maria Sanz-Serna, and Andrew Stuart.\n\nOptimal tuning of the hybrid monte carlo algorithm. Bernoulli, 19(5A):1501\u20131534, 2013.\n\n[10] C. M. Bishop. Pattern Recognition and Machine Learning (Information Science and Statistics).\n\nSpringer-Verlag New York, Inc., Secaucus, NJ, USA, 2006.\n\n[11] P. Carbonetto, M. King, and F. Hamze. A stochastic approximation method for inference in\nprobabilistic graphical models. In Advances in Neural Information Processing Systems, 2009.\n\n[12] Lasse Espeholt, Hubert Soyer, Remi Munos, Karen Simonyan, Vlad Mnih, Tom Ward, Yotam\nDoron, Vlad Firoiu, Tim Harley, Iain Dunning, Shane Legg, and Koray Kavukcuoglu. IMPALA:\nScalable distributed deep-RL with importance weighted actor-learner architectures. In Jennifer\nDy and Andreas Krause, editors, Proceedings of the 35th International Conference on Ma-\nchine Learning, volume 80 of Proceedings of Machine Learning Research, pages 1407\u20131416,\nStockholmsm\u00e4ssan, Stockholm Sweden, 10\u201315 Jul 2018. PMLR.\n\n[13] Paolo Giordani and Robert Kohn. Adaptive independent metropolis\u2013hastings by fast estimation\nof mixtures of normals. Journal of Computational and Graphical Statistics, 19(2):243\u2013259,\n2010.\n\n[14] Heikki Haario, Eero Saksman, and Johanna Tamminen. An adaptive metropolis algorithm.\n\nBernoulli, 7(2):223\u2013242, 2001.\n\n[15] Heikki Haario, Eero Saksman, and Johanna Tamminen. Componentwise adaptation for high\n\ndimensional mcmc. Computational Statistics, 20(2):265\u2013273, 2005.\n\n[16] Raza Habib and David Barber. Auxiliary variational mcmc. To appear at ICLR 2019, 2019.\n\n[17] Matthew D Hoffman and Andrew Gelman. The no-u-turn sampler: adaptively setting path\nlengths in hamiltonian monte carlo. Journal of Machine Learning Research, 15(1):1593\u20131623,\n2014.\n\n[18] M. I. Jordan, editor. Learning in Graphical Models. MIT Press, Cambridge, MA, USA, 1999.\n\n[19] D. P. Kingma and M. Welling. Auto-encoding variational Bayes. In International Conference\n\non Learning Representations, 2014.\n\n[20] A. Kucukelbir, D. Tran, R. Ranganath, A. Gelman, and D. M. Blei. Automatic differentiation\n\nvariational inference. Journal of Machine Learning Research, 18(14):1\u201345, 2017.\n\n9\n\n\f[21] Daniel Levy, Matt D. Hoffman, and Jascha Sohl-Dickstein. Generalizing hamiltonian monte\ncarlo with neural networks. In International Conference on Learning Representations, 2018.\n\n[22] Nicholas Metropolis, Arianna W Rosenbluth, Marshall N Rosenbluth, Augusta H Teller, and\nEdward Teller. Equation of state calculations by fast computing machines. The journal of\nchemical physics, 21(6):1087\u20131092, 1953.\n\n[23] Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lilli-\ncrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep\nreinforcement learning. In International Conference on Machine Learning, 2016.\n\n[24] Radford M. Neal. MCMC using Hamiltonian dynamics. Handbook of Markov Chain Monte\n\nCarlo, 54:113\u2013162, 2010.\n\n[25] Kirill Neklyudov, Pavel Shvechikov, and Dmitry Vetrov. Metropolis-hastings view on variational\n\ninference and adversarial training. arXiv preprint arXiv:1810.07151, 2018.\n\n[26] J. W. Paisley, D. M. Blei, and M. I. Jordan. Variational Bayesian inference with stochastic\n\nsearch. In International Conference on Machine Learning, 2012.\n\n[27] Cristian Pasarica and Andrew Gelman. Adaptively scaling the metropolis algorithm using\n\nexpected squared jumped distance. Statistica Sinica, pages 343\u2013364, 2010.\n\n[28] R. Ranganath, S. Gerrish, and D. M. Blei. Black box variational inference.\n\nIntelligence and Statistics, 2014.\n\nIn Arti\ufb01cial\n\n[29] D. J. Rezende, S. Mohamed, and D. Wierstra. Stochastic backpropagation and approximate\ninference in deep generative models. In International Conference on Machine Learning, 2014.\n\n[30] Gareth O Roberts, Andrew Gelman, and Walter R Gilks. Weak convergence and optimal scaling\nof random walk metropolis algorithms. The annals of applied probability, 7(1):110\u2013120, 1997.\n\n[31] Gareth O Roberts and Jeffrey S Rosenthal. Optimal scaling of discrete approximations to\nlangevin diffusions. Journal of the Royal Statistical Society: Series B (Statistical Methodology),\n60(1):255\u2013268, 1998.\n\n[32] Gareth O Roberts and Jeffrey S Rosenthal. Optimal scaling for various metropolis-hastings\n\nalgorithms. Statistical Science, pages 351\u2013367, 2001.\n\n[33] Gareth O Roberts and Jeffrey S Rosenthal. Coupling and ergodicity of adaptive markov chain\n\nmonte carlo algorithms. Journal of applied probability, 44(2):458\u2013475, 2007.\n\n[34] Gareth O Roberts and Jeffrey S Rosenthal. Examples of adaptive mcmc. Journal of Computa-\n\ntional and Graphical Statistics, 18(2):349\u2013367, 2009.\n\n[35] Jeffrey S Rosenthal. Optimal proposal distributions and adaptive mcmc. In Handbook of Markov\n\nChain Monte Carlo, pages 114\u2013132. Chapman and Hall/CRC, 2011.\n\n[36] T. Salimans, D. P. Kingma, and M. Welling. Markov chain Monte Carlo and variational\n\ninference: Bridging the gap. In International Conference on Machine Learning, 2015.\n\n[37] John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region\npolicy optimization. In Proceedings of the 32nd International Conference on Machine Learning,\n2015.\n\n[38] Jiaming Song, Shengjia Zhao, and Stefano Ermon. A-nice-mc: Adversarial training for mcmc.\n\nIn Advances in Neural Information Processing Systems, pages 5140\u20135150, 2017.\n\n[39] T. Tieleman and G. Hinton. Lecture 6.5-RMSPROP: Divide the gradient by a running average\n\nof its recent magnitude. Coursera: Neural Networks for Machine Learning, 4, 2012.\n\n[40] M. K. Titsias and M. L\u00e1zaro-Gredilla. Doubly stochastic variational Bayes for non-conjugate\n\ninference. In International Conference on Machine Learning, 2014.\n\n10\n\n\f", "award": [], "sourceid": 9199, "authors": [{"given_name": "Michalis", "family_name": "Titsias", "institution": "DeepMind"}, {"given_name": "Petros", "family_name": "Dellaportas", "institution": "University College London, Athens University of Economics and Alan Turing Institute"}]}