{"title": "Sample Adaptive MCMC", "book": "Advances in Neural Information Processing Systems", "page_first": 9066, "page_last": 9077, "abstract": "For MCMC methods like Metropolis-Hastings, tuning the proposal distribution is important in practice for effective sampling from the target distribution \\pi. In this paper, we present Sample Adaptive MCMC (SA-MCMC), a MCMC method based on a reversible Markov chain for \\pi^{\\otimes N} that uses an adaptive proposal distribution based on the current state of N points and a sequential substitution procedure with one new likelihood evaluation per iteration and at most one updated point each iteration. The SA-MCMC proposal distribution automatically adapts within its parametric family to best approximate the target distribution, so in contrast to many existing MCMC methods, SA-MCMC does not require any tuning of the proposal distribution. Instead, SA-MCMC only requires specifying the initial state of N points, which can often be chosen a priori, thereby automating the entire sampling procedure with no tuning required. Experimental results demonstrate the fast adaptation and effective sampling of SA-MCMC.", "full_text": "Sample Adaptive MCMC\n\nMichael H. Zhu\n\nDepartment of Computer Science\n\nStanford University\nStanford, CA 94305\n\nmhzhu@cs.stanford.edu\n\nAbstract\n\nFor MCMC methods like Metropolis-Hastings, tuning the proposal distribution is\nimportant in practice for effective sampling from the target distribution \u03c0. In this\npaper, we present Sample Adaptive MCMC (SA-MCMC), a MCMC method based\non a reversible Markov chain for \u03c0\u2297N that uses an adaptive proposal distribution\nbased on the current state of N points and a sequential substitution procedure\nwith one new likelihood evaluation per iteration and at most one updated point\neach iteration. The SA-MCMC proposal distribution automatically adapts within\nits parametric family to best approximate the target distribution, so in contrast to\nmany existing MCMC methods, SA-MCMC does not require any tuning of the\nproposal distribution. Instead, SA-MCMC only requires specifying the initial state\nof N points, which can often be chosen a priori, thereby automating the entire\nsampling procedure with no tuning required. Experimental results demonstrate the\nfast adaptation and effective sampling of SA-MCMC.\n\n1\n\nIntroduction\n\nMarkov Chain Monte Carlo (MCMC) methods are a large class of sampling-based algorithms that\ncan be applied to solve integration problems in high-dimensional spaces [1]. The goal of MCMC\nmethods is to sample from a probability distribution \u03c0(\u03b8) (known up to some normalization constant)\nby constructing a Markov chain with limiting distribution \u03c0(\u03b8) that visits points \u03b8 with a frequency\nproportional to the corresponding probability \u03c0(\u03b8).\nFor MCMC methods like Metropolis-Hastings [2, 3], the choice of the proposal distribution q(\u00b7|\u03b8(k))\nis important in practice for effective sampling from the target distribution. Metropolis-Hastings\n(MH) is generally used with random walk proposals where local moves based on q(\u00b7|\u03b8(k)) are used\nto globally simulate the target distribution \u03c0(\u03b8). A suboptimal choice for the scale or shape of\nthe proposal can lead to inef\ufb01cient sampling, yet the design of an optimal proposal distribution is\nchallenging when the properties of the target distribution are unknown, especially in high-dimensional\nspaces.\nGelman et al. [4, 5] recommend a two-phase approach where the covariance matrix of the proposal\ndistribution in phase 2 is proportional to the covariance of the posterior samples from phase 1.\nAdaptive MCMC methods such as Adaptive Metropolis [6] continually adapt the proposal distribution\nbased on the entire history of past states. However, the method is no longer based on a valid Markov\nchain, so the usual MCMC convergence theorems do not apply, and the validity of the sampler must\nbe proved for each speci\ufb01c algorithm under speci\ufb01c technical assumptions [7, 8]. In this paper, we\npropose Sample Adaptive MCMC, a method that is adaptive based on the current state of N points\nand uses an adaptive proposal which is an adaptive approximation of the target distribution.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\f2 Related work\n\nOur substitution procedure is related to the Sample Metropolis-Hastings (SMH) algorithm by Liang\net al. [9, ch. 5] and Lewandowski [10]. The SMH algorithm reduces to Metropolis-Hastings for\nN = 1, and our substitution procedure reduces to the method of Barker [11]. The SMH algorithm\nhas also been used by Martino et al. [12, 13] in the context of adaptive importance sampling and\nwith independent SMH proposals by Martino et al. [14] who propose a family of orthogonal parallel\nMCMC methods where vertical MCMC chains are run in parallel using random-walk proposals and\nshare information using horizontal MCMC steps encompassing all of the chains using independent\nproposals.\nParallel tempering [15] runs parallel MCMC chains targeting the posterior distribution at different\ntemperatures. Many previous works have studied MCMC methods which simulate from \u03c0\u2297N (a\nsample of size N from \u03c0). Early works include the Adaptive Direction Sampler by Gilks et al. [16],\nthe Normal Kernel Coupler by Warnes [17], and the pinball sampler by Mengersen and Robert [18].\nIn the Normal Kernel Coupler, Warnes [17] \ufb01rst selects one of the N points in the state to update, uses\na kernel density estimate constructed from the state of N points to propose a new point, and \ufb01nally\naccepts or rejects the proposed swap according to the Metropolis-Hasting acceptance probability.\nGoodman and Weare [19] propose an ensemble MCMC sampler with af\ufb01ne invariance. Grif\ufb01n and\nWalker [20] present a method for adaptation in MH by letting the joint density be the product of a\nproposal density and \u03c0\u2297N and then sampling this augmented density using a Gibbs sampler including\na Metropolis step. Their work is related to the works by Cai et al. [21], Keith et al. [22]. Leimkuhler\net al. [23] propose an Ensemble Quasi-Newton sampler using gradient information based on time\ndiscretization of an SDE that can incorporate covariance information from the other walkers.\nThe Multiple-Try Metropolis method [24, 25] \ufb01rst proposes M potential candidates, randomly\nchooses one of the best candidates based on the weights to be the potential move, and \ufb01nally accepts\nor rejects the move according to a generalized MH ratio. Neal et al. [26] propose a new Markov\nchain method for sampling from the posterior of a hidden state sequence in a non-linear dynamical\nsystem by \ufb01rst proposing a pool of candidate states and then using DP with an embedded HMM.\nTjelmeland [27] describe a general framework for running MCMC with multiple proposals in each\niteration and using all proposed states to estimate mean values. Neal [28] propose a MCMC scheme\nwhich \ufb01rst stochastically maps the current state \u03b8 to an ensemble (\u03b81, . . . , \u03b8N ), applies a MCMC\nupdate to the ensemble, and \ufb01nally stochastically selects a single state. Calderhead [29] presents a\ngeneral construction for parallelizing MH algorithms.\nPopulation Monte Carlo [30] is an iterated importance sampling scheme with a state of N points\nwhere the proposal distribution can be adapted for each point and at each iteration in any way and\na resampling step based on the importance sampling weights is used to update the state. Adaptive\nImportance Sampling [31, 32, 33] represents a class of methods, including PMC, based on importance\nsampling with adaptive proposals. Our work is also inspired by particle \ufb01lters [34, 35] and PMCMC\n[36] which combines standard MCMC methods with a particle \ufb01lter based inner loop for joint\nparameter and state estimation in state-space models.\n\n3 Sample Adaptive MCMC\n\nknown up to some normalization constant, and let \u03c0(\u03b8) = p(\u03b8)/(cid:82) p(\u03b8(cid:48))d\u03b8(cid:48). The state of the SA-\n\nWe now present the Sample Adaptive MCMC algorithm. Let p(\u03b8) be the target probability density\n\n1 , \u03b8(k)\n\n2 , . . . , \u03b8(k)\n\nN ). De\ufb01ne \u00b5(S) = 1\n\nMCMC Markov Chain consists of N points at each iteration. We denote the state at iteration k by\nS(k) = (\u03b8(k)\nn=1 \u03b8n to be the mean of the N points in the\nstate S. De\ufb01ne \u03a3(S) to be the sample covariance matrix of the N points in the state S. Optionally,\nwe can also consider a diagonal approximation of \u03a3(S) that is non-zero along the diagonals and\nzero elsewhere. When proposing a new point \u03b8(cid:48), the proposal distribution q(\u00b7|\u00b5(S(k)), \u03a3(S(k))) is a\nfunction of the mean and covariance of all N points in the current state S(k). In our experiments,\nwe use a Gaussian or Gaussian scale-mixture family as our adaptive family of proposal distributions.\nAfter proposing \u03b8(cid:48), the algorithm might reject the proposed point \u03b8(cid:48) or substitute any of the N\ncurrent points with \u03b8(cid:48). For example, the algorithm might substitute \u03b8(k)\n1 with \u03b8(cid:48) so that the new state\nbecomes S(k+1) = (\u03b8(cid:48), \u03b8(k)\nN ). The probabilities of substituting each of the N points with\n\n2 , . . . , \u03b8(k)\n\n(cid:80)N\n\nN\n\n2\n\n\fFigure 1: Illustration of one iteration of SA-MCMC for N = 3. After the proposed point\n\u03b8N +1 \u223c q(\u00b7|\u00b5(S), \u03a3(S)) is sampled, the sets S\u22121, . . . , S\u2212(N +1) are used to calculate the sub-\nstitution probabilities \u03bb1, . . . , \u03bb(N +1). One of the sets S\u22121, . . . , S\u2212(N +1) is chosen to be the next\nstate with probability proportional to \u03bbn.\n\nAlgorithm 1 Sample Adaptive MCMC\nRequire: p(\u03b8), q0(\u00b7), q(\u00b7|\u00b5(S), \u03a3(S)), N, \u03ba, K\n1: Initialize S(0) \u2190 (\u03b81, . . . , \u03b8N ) where \u03b8n \u223c q0(\u00b7) for n = 1, . . . , N\n2: for k = 1 to \u03ba + K do\n3:\n4:\n5:\n6:\n7:\n\nLet S = (\u03b81, . . . , \u03b8N ) \u2190 S(k\u22121)\nSample \u03b8N +1 \u223c q(\u00b7|\u00b5(S), \u03a3(S))\nLet S\u2212n \u2190 (S with \u03b8n replaced by \u03b8N +1) for n = 1, . . . , N. Let S\u2212(N +1) \u2190 S.\nLet \u03bbn \u2190 q(\u03b8n|\u00b5(S\u2212n), \u03a3(S\u2212n))/p(\u03b8n) for n = 1, . . . , N + 1\nSample j \u223c J with P [J = n] = \u03bbn\nLet S(k) \u2190 S\u2212j\n\n(cid:46)(cid:80)N +1\n\n1 \u2264 n \u2264 N + 1\n\ni=1 \u03bbi,\n\n8:\n9: end for\n\n10: Return(cid:83)\n\nk=\u03ba+1,...,\u03ba+K S(k)\n\nstationary distribution of the SA-MCMC Markov chain is \u03c0\u2297N (\u03b81, . . . , \u03b8N ) =(cid:81)N\n\nthe proposed point and the probability of rejecting the proposed point are all constructed so that the\n\nn=1 \u03c0(\u03b8n).\n\n1\nN\n\nk=\u03ba+1\n\nn=1 h(\u03b8(k)\nn ).\n\n(cid:80)\u03ba+K\n\nThe SA-MCMC algorithm is presented in Algorithm 1 and illustrated in Figure 1. The initialization\ndistribution for initializing the N points is q0(\u00b7). The sets S\u2212n, with \u03b8n replaced by the proposed\npoint, are the N + 1 possibilities for the next state depending on which of the current N points gets\nreplaced, if any. One of the sets S\u22121, . . . , S\u2212(N +1) is chosen to be the next state with probability\nproportional to \u03bbn. The number of burn-in iterations is \u03ba, and the number of estimation iterations\n\nis K. For any function h(\u03b8) satisfying(cid:82)\n(cid:80)N\ncorresponds to the inverse of the importance weight used in Metropolis-Hastings, p (\u03b8(cid:48)) /q(cid:0)\u03b8(cid:48)\n\n|h(\u03b8)| \u03c0(\u03b8)d\u03b8 < \u221e, we can estimate(cid:82) h(\u03b8)\u03c0(\u03b8)d\u03b8 by the\n|\u03b8(k)(cid:1).\n\nsample average 1\nK\nThe likelihood ratio q(\u03b8n|\u00b5(S\u2212n), \u03a3(S\u2212n))/p(\u03b8n) used to compute the substitution probability \u03bbn\nThus, points with low importance weight (i.e. points with low likelihood under the target distribution\nrelative to the proposal) are likely to be replaced by points with higher importance weight. Since the\nproposed point is compared with the N points in the state before deciding which point to remove,\nthis generally leads to higher acceptance rates compared to having a state with only one point, which\nis advantageous in problems where evaluating the target density is computationally expensive.\nThe initialization distribution q0 determines the initial positions of the N points and also the initial\nmean and scale structure of the proposal distribution. In practice, q0 can be chosen either based on\nintuition about where the parameters are likely to have high probability under the target distribution\nor by hyperparameter search. Note that while we must choose q0 carefully depending on the problem,\nwe do not have to tune the proposal distribution. In practice, this usually makes SA-MCMC easier to\nuse since an optimal scale structure for the initialization distribution q0 in SA-MCMC is generally\nmore intuitive than the optimal scale structure (step size) of the random walk proposal in Metropolis-\nHastings. In many cases, an optimal choice for q0 can even be chosen a priori based on knowledge\nof the data and the model, thereby automating the entire sampling procedure with no tuning required.\n\n3\n\n\u03b81\u03b82\u03b83S\u03b8N+1q(\u00b7|\u00b5(S),\u03a3(S))\u03b8N+1\u03b82\u03b83S\u22121\u03b81\u03bb1\u2190q(\u03b81|\u00b5(S\u22121),\u03a3(S\u22121))/p(\u03b81)\u03b81\u03b8N+1\u03b83S\u22122\u03b82\u03bb2\u2190q(\u03b82|\u00b5(S\u22122),\u03a3(S\u22122))/p(\u03b82)\u03b81\u03b82\u03b8N+1S\u22123\u03b83\u03bb3\u2190q(\u03b83|\u00b5(S\u22123),\u03a3(S\u22123))/p(\u03b83)\u03b81\u03b82\u03b83S\u2212(N+1)\u03b8N+1\u03bb4\u2190q(\u03b84|\u00b5(S\u22124),\u03a3(S\u22124))/p(\u03b84)\fThe state of N points, the adaptive proposal distribution q(\u00b7|\u00b5(S), \u03a3(S)) based on the current state,\nand the substitution procedure enable both fast adaptation of the proposal distribution and effective\nsampling from the target distribution. We \ufb01rst explain how SA-MCMC can transition quickly from\nits initial state of N points to an empirical representation of the target distribution during the burn-in\nphase, as illustrated in Figure 2. For example, consider the case where some or even all N points are\ninitialized far away from the high-probability region of the target distribution. In this case, the points\n\u03b8n farthest away from the target distribution will have a much smaller value for p(\u03b8n), leading to\nlarge \u03bbn and a high probability of substitution. As the points farthest away from the target distribution\nare replaced, the N points in the state and the corresponding adaptive proposal distribution gradually\nnarrow in on or shift towards the high-probability region of the target distribution. As another\nexample, consider the case where the initial mean is speci\ufb01ed correctly but the initial variance is too\nsmall. In this case, the points in the center will have a high probability of substitution since the points\nin the center will have a much larger value for q(\u03b8n) compared to points at the end (while values for\np(\u03b8n) are comparable), so that the variance of the N points in the state gradually increases to the\nvariance of the target distribution. Thus, we see that the form of the substitution probability enables\nthe initial state of N points to adapt to the target distribution under many different initial conditions.\nAfter this burn-in phase, the N points in the state form an empirical representation of the target\ndistribution, and the proposal distribution approximates the target distribution. Using our form for the\nproposal distribution, when the N points in the state represent a mode of the target distribution, \u00b5(S)\napproximates the mean and \u03a3 (S) approximates the covariance structure of the mode of the target\ndistribution. As the shape of the proposal distribution approximates the shape of the mode of the\ntarget distribution, this enables very effective sampling. Since the substitution probability \u03bbn is an\ninverse importance weight, p(\u03b8n) favors keeping points closer to the mode of the target distribution\nwhile q(\u03b8n|\u00b5(S\u2212n), \u03a3(S\u2212n)) favors keeping points farther away from the mode of the proposal\ndistribution relative to its covariance \u03a3(S\u2212n), balancing each other to ensure that the N points in the\nstate are distributed according to and are approximate samples from the target distribution.\n\nTheory Let \u03c0(\u03b8) = p(\u03b8)/(cid:82) p(\u03b8(cid:48))d\u03b8(cid:48) be the target density. Proposition 1 demonstrates that the\n(cid:81)N\nSA-MCMC Markov chain satis\ufb01es the detailed balance condition with respect to \u03c0\u2297N (\u03b81, . . . , \u03b8N ) =\nn=1 \u03c0(\u03b8n), thus establishing \u03c0\u2297N as the stationary density of the chain. We then prove that under\ngeneral conditions on the target distribution and a family of proposal distributions with diagonal\ncovariance matrices, SA-MCMC using a diagonal covariance matrix is ergodic, allowing us to prove\nconvergence in total variation norm to \u03c0\u2297N and the law of large numbers for estimating expectations\nwith respect to \u03c0 by sample averages. The convergence guarantees for SA-MCMC are proven under\nthe same assumptions on the target distribution as for Metropolis-Hastings. Theorem 1 is based on\nthe theorem of Athreya et al. [37] and the textbook by Robert and Casella [38]. The proofs are given\nin Appendix 1 and 2. We note that our detailed balance proof is closely related to the detailed balance\nproof for Sample Metropolis-Hastings given by Lewandowski [10] and Martino et al. [12].\n\nProposition 1. The SA-MCMC Markov chain from Algorithm 1 with target density \u03c0(\u03b8) =\n\np(\u03b8)/(cid:82) p(\u03b8(cid:48))d\u03b8(cid:48) satis\ufb01es the detailed balance condition with respect to \u03c0\u2297N (\u03b81, . . . , \u03b8N ) =\n(cid:81)N\nn=1 \u03c0(\u03b8n). Hence, \u03c0\u2297N is the stationary density of the chain, and the chain is reversible.\nAlgorithm 1 with target density \u03c0(\u03b8) = p(\u03b8)/(cid:82) p(\u03b8(cid:48))d\u03b8(cid:48), proposal density q(\u00b7|\u00b5(s), diag(\u03a3(s))),\nTheorem 1. Let {S(k)} be the SA-MCMC Markov chain with diagonal covariance matrix from\nsatisfying(cid:82)\nand N \u2265 3. Denote the conditional density of S(k) given S(0) by fk(\u00b7|\u00b7). Let h(\u03b8) be any function\n(A1) \u03c0 is bounded and positive on every compact set of its support E \u2286 Rd, and\n(A2) For all a, b, \u03b4 > 0, there exist \u00011, \u00012 > 0 such that if a < \u03c3j < b and |xj \u2212 \u00b5j| < \u03b4 for\nj \u2208 1, . . . , d, then \u00011 < q(x | \u00b5, diag(\u03c32)) < \u00012,\nthen the SA-MCMC Markov chain is ergodic, and\n(1) limK\u2192\u221e supC\n(2) Ps0\n\nC \u03c0\u2297N (s)ds(cid:12)(cid:12) = 0 for(cid:2)\u03c0\u2297N(cid:3)-almost all s0, and\nn ) =(cid:82) h(\u03b8)\u03c0(\u03b8)d\u03b8\n\n= 1 for(cid:2)\u03c0\u2297N(cid:3)-almost all s0.\n\n|h(\u03b8)| \u03c0(\u03b8)d\u03b8 < \u221e. If\n\nlimK\u2192\u221e 1\nK\n\n1\nN\n\nk=1\n\nn=1 h(\u03b8(k)\n\n(cid:82)\n\n(cid:105)\n\n(cid:104)\n\n(cid:12)(cid:12)(cid:82)\n(cid:80)K\n\nC fK(s|s0)ds \u2212\n\n(cid:80)N\n\n4\n\n\f1 , . . . , \u03b8(k)\n\nand let the proposal density be q(\u00b7|\u03b3(S)), where \u03b3(S) = N\u22121(cid:80)\n\nRemark 1. The convergence result for Metropolis-Hastings can be proved under the assumptions\n(A1) and \u2203\u0001, \u03b4 > 0 such that if (cid:107)x \u2212 y(cid:107) < \u03b4, then q(y|x) > \u0001 [39, 38]. (A2) is a generalization for a\nfamily of proposal distributions with different means and scales (e.g. location-scale families).\nOur next theorem is stated in a more general form for a proposal distribution q(\u00b7|\u03b3(S)). We prove the\nuniform ergodicity of SA-MCMC assuming q(\u03b8|\u03b3)/\u03c0(\u03b8) is bounded above and below. The proof is\nbased on the proof of Lemma 1 in the working paper by Chan and Lai [40] and is in Appendix 3.\nTheorem 2. Let \u03c0 be a positive target density on the parameter space \u0398, and let {q(\u00b7|\u03b3) : \u03b3 \u2208 \u0393}\nbe a family of positive proposal densities, with \u0393 a convex Euclidean set. Let \u03bb(\u03b8|\u03b3) = q(\u03b8|\u03b3)/\u03c0(\u03b8)\n\u03b8\u2208S \u03b3(\u03b8) for some continuous\nN ) and \u03c0\u2297N the product density of \u03c0\n\u03b3 : \u0398 \u2192 \u0393. Let fk denote the joint densities of (\u03b8(k)\non \u0398N . If there exist constants 0 < a < b < \u221e such that a \u2264 \u03bb(\u03b8|\u03b3) \u2264 b for all \u03b8 \u2208 \u0398 and \u03b3 \u2208 \u0393,\nthen (cid:107)fk \u2212 \u03c0\u2297N(cid:107)TV \u2264 2(1 \u2212 C)(cid:98)k/N(cid:99) for C = N !(\nMengersen and Tweedie [41] prove that the Independent MH (IMH) algorithm with independent\nproposal distribution q(\u00b7) is uniformly ergodic if there exists a constant \u03b1 > 0 such that q(\u03b8)/\u03c0(\u03b8) \u2265\n\u03b1 for all \u03b8 \u2208 \u0398, in which case (cid:107)F k(\u03b8(0),\u00b7) \u2212 \u03c0(cid:107)TV \u2264 2(1 \u2212 \u03b1)k. Holden et al. [42] propose an\nAdaptive Independent MH (AIMH) algorithm where the proposal distribution qk(\u00b7|hk\u22121) at iteration\nk depends on the history hk\u22121. To preserve the invariance of the sampler, hk must be constructed\nfrom hk\u22121 by appending the previous state of the chain if the transition is accepted and the rejected\nproposed point if the transition is rejected. They prove that the convergence is geometric if there\nexists a constant \u03b1 > 0 such that qk(\u03b8|hk\u22121)/\u03c0(\u03b8) \u2265 \u03b1 for all \u03b8, hk\u22121, k. Uniform ergodicity is\nproven by lower bounding the one-step probability of transitioning to the target density each iteration.\nThe conditions above for IMH and AIMH essentially require that the proposal densities have uniformly\nheavier tails than the target. We note that a mixture proposal distribution, with the main distribution\nand a fat-tailed distribution with a small mixing proportion, can be used as a safeguard to guarantee\nthis lower bound [43, 44, 45]. For our algorithm, in practice, we observe that this condition (i.e.\n\u03bb(\u03b8|\u03b3) \u2265 a) is also crucial for SA-MCMC. In the proof and the corresponding bound, this condition\ncorresponds to the term aN in C. Formally, our proof also requires the assumption \u03bb(\u03b8|\u03b3) \u2264 b to\nlower bound the acceptance probability of N substitutions by the term (\n(N +1)b )N in C corresponding\nto Line 7 in Algorithm 1. In practice, we \ufb01nd that this condition is not necessary for effective sampling.\nA proposed point with signi\ufb01cantly larger \u03bb is unlikely to be accepted in the \ufb01rst place, and if accepted,\nthe point is likely to be replaced quickly, so the worst-case bound of (N + 1)b in the denominator\nlikely understates the practical performance of the algorithm. To support this conclusion, we conduct\nextensive experiments on t-distributions with different degrees of freedom in Appendix 4 and observe\nthat only the assumption \u03bb(\u03b8|\u03b3) \u2265 a is necessary in practice.\nWhile IMH and AIMH have weaker assumptions for uniform ergodicity in theory, we note that IMH\nand AIMH fail to work for any of the examples in our paper since they are not adaptive enough for an\nindependent proposal distribution to work in high-dimensional spaces. We elaborate in Appendix 5.\n\na\n\n(N +1)b )N aN .\n\na\n\nImplementation A fast, numerically stable implementation of SA-MCMC is given in Appendix 6.\n\n4 Experimental results\n\nWe \ufb01rst illustrate the adaptive nature of SA-MCMC on toy 1D distributions and then present experi-\nmental results for the Bayesian linear regression and Bayesian logistic regression models. Our goal\nis to sample from the posterior distribution p(\u03b8|y) of the parameters \u03b8 given the data y. We assume\nonly that we can compute p(\u03b8|y) for any \u03b8 up to a normalization constant; we do not assume any\nother information or structure of the model as in the setup for Metropolis-Hastings. We will compare\nthe performance of Metropolis-Hastings (MH), Adaptive Metropolis (AM), Multiple-Try Metropolis\n(MTM), and SA-MCMC (SA). As a benchmark, we also compare to the No-U-Turn Sampler (NUTS)\n[46] using the implementation in RStan 2.19.2 [47], which is a state-of-the-art Hamiltonian Monte\nCarlo (HMC) method [48] . Note that unlike all of the other MCMC methods in this paper, NUTS\nuses the gradient of the target density at every step and is based on discretizations of continuous-time\nstochastic dynamics.\n\n5\n\n\fT = 0\n\nT = 25\n\nT = 50\n\nT = 75\n\nT = 0\n\nT = 75\n\nT = 150\n\nT = 200\n\nd\nn\no\nc\ne\ns\n/\nS\nS\nE\n\n2500\n\n2000\n\n1500\n\n1000\n\n500\n\n0\n\nT = 0\n\nT = 150\n\nT = 300\n\nT = 500\n\n0\n\n2\n\nSA N=40\nMH\n\n4\n\n6\nParameter\n\n8\n\nNUTS\nMTM\n\nAM\n\nn\no\ni\nt\na\ni\nv\ne\nD\nd\nr\na\nd\nn\na\nt\nS\n\n0.10\n\n0.08\n\n0.06\n\n0.04\n\n0.02\n\n0.00\n\n0\n\n1\n\n2\n\n5\n\n4\n\n3\n6\nParameter\n\n7\n\n8\n\n9\n\nFigure 2: Adaptation of the SA-MCMC\nproposal distribution (green) to three tar-\nget distributions (red).\n\nFigure 3: Bayesian linear regression. (left) Comparison\nof ESS/second for each parameter. (right) Standard devi-\nation of the SA proposal distribution (blue bar), averaged\nover iterations, for each parameter compared with the\nground truth posterior standard deviation (black line).\n\nq0,SA\n\n(cid:0)\u03b8(k\u22121), s2\n\n(cid:0)\u03b8, \u03c32\nq,MHI(cid:1), with scale parameter\n(cid:0)0, \u03c32\nq,MHI(cid:1). We tune \u03c3q,MH to make the acceptance rate\nAM\u03a3(k\u22121)(cid:1) at iteration k with scale parameter sAM and\nI(cid:1) with scale parameter \u03c3q0,SA as the distribution for initializing the N\n(cid:3) which we observed works better empirically for logistic regression.\n\nWe now describe the experimental setup. For Metropolis-Hastings (MH), we use an isotropic\nnormal distribution as the proposal distribution, qMH(\u00b7|\u03b8) = N\n\u03c3q,MH, and initialize \u03b8(0) \u223c q0,MH(\u00b7) = N\nclose to the optimal value of 23% [49]. For Adaptive Metropolis (AM), we use the optimal MH\nproposal distribution during the burn-in (non-adaptive) phase and then use the proposal distribution\nqAM(\u00b7|\u03b8(1), . . . , \u03b8(k\u22121)) = N\nsample covariance matrix \u03a3(k\u22121) of the past samples (\u03b8(1), . . . , \u03b8(k\u22121)). We tune sAM to make\nthe acceptance rate close to the optimal value of 23%. For Multiple-Try Metropolis (MTM), we\nuse the optimal MH proposal distribution with 3 tries. Finally, for SA-MCMC (SA), we use\nq0,SA(\u00b7) = N\nstarting points. For the proposal distribution, when using the full covariance matrix, we use the\nGaussian family q(\u00b7|\u00b5(S), \u03a3(S)) = N (\u00b7|\u00b5(S), \u03a3(S)). When using the diagonal covariance matrix,\ni piN (\u00b7|\u00b5(S), cidiag(\u03a3(S))) with\nFor each of the MCMC methods, we run 16 chains to assess convergence and calculate Effective\nSample Size (ESS) divided by the total running time in seconds. For each chain, we run 100,000\nburn-in iterations and then collect 1,000,000 samples. For NUTS, we use 10,000 burn-in iterations\nand 100,000 samples. To assess convergence, we calculate the Gelman and Rubin potential scale\n\n(cid:0)0, \u03c32\nwe use a Gaussian scale-mixture family q(\u00b7|\u00b5(S), \u03a3(S)) = (cid:80)\nc =(cid:2) 1\n2 , 1, 2(cid:3) and p =(cid:2) 1\n\nreduction statistic, (cid:98)R, for each dimension and ensure that all of the (cid:98)R values are close to 1 [50]. We\nour experiments, we compute (cid:98)R and ESS using RStan [47]. Since SA-MCMC has a state consisting\n\ncalculate ESS for each dimension using samples from all of our chains following Gelman et al. [5]. In\n\nof N points, we compute ESS for SA-MCMC as N times the effective sample size of the history of\nthe mean of the N points as in Goodman and Weare [19, p. 73-74]. Our experiments and timing are\ndone on a Intel Xeon E5-2640v3 using Julia v0.64 [51], except for NUTS which uses Stan C++.\n\n3 , 1\n\n3 , 1\n\n3\n\nToy 1D examples We \ufb01rst demonstrate the adaptive nature of SA-MCMC in three different cases\nin Figure 2. In the \ufb01rst example, the target distribution is N (0, 1) and our proposal distribution is\nN (\u221210, 102). Even though our guess of the mean is far away from the true mean, SA-MCMC is able\nto quickly hone in on the high-probability region of the target distribution. In the second example,\nthe target is N (0, 32) and our proposal is N (\u22124, 1). Though we start with an incorrect mean and an\nunderestimate of the variance, SA-MCMC is able to adapt to the target. In the third example, the\ntarget is N (0, 1) and our proposal is N (\u22125, 1). Even when there is little overlap in the densities of\nthe proposal and the target, the proposal is able to move to the target. The adaptivity of SA-MCMC\ndemonstrated here enables tuning-free MCMC, as SA-MCMC can quickly transition from its initial\nstate of N points to an empirical representation of the target distribution during the burn-in phase.\n\nBayesian linear regression We consider a Bayesian linear regression model where the regression\nparameters have i.i.d. Laplace priors. To study the adaptivity of SA-MCMC, we generate a synthetic\ndataset where the posterior standard deviation of the regression parameters varies. The true regression\nparameters \u03b2 are sampled from i.i.d. Laplace(0, 1) priors. For the feature matrix X, each entry of\n\n6\n\n\fTable 1: Comparison of ESS/second for Bayesian logistic regression on (top) 11-dim MNIST 7s vs\n9s using 10 features computed with PCA (bottom) 7-dim adult census income\n\nmin(ESS)/s\nmedian(ESS)/s\ns/chain\nHyperparameters\n\nAcceptance rate\nmin(ESS)/s\nmedian(ESS)/s\ns/chain\nHyperparameters\n\nMH\n13\n21\n733\nq=.02\n\n23%\n1.4\n17\n2198\nq=.016\n\nAcceptance rate\n\n26%\n\nMTM AM (diag) AM (full)\n\nSA (diag)\n\nSA (full) NUTS\n\n5\n9\n\n3651\nq=.02\nM=3\n48%\n0.6\n7\n\n10951\nq=.016\nM=3\n52%\n\n17\n23\n734\nq=.02\ns=.6\n24%\n13\n15\n2205\nq=.016\ns=.8\n21%\n\n37\n38\n742\nq=.02\ns=.7\n26%\n16\n17\n2217\nq=.016\ns=.85\n24%\n\n23\n52\n782\nq0=1\nN=40\n75%\n67\n89\n2283\nq0=1\nN=40\n89%\n\n278\n290\n1112\nq0=1\nN=150\n98.9%\n151\n158\n2509\nq0=1\nN=150\n99.2%\n\n54\n105\n1160\nStan\n\n\u2014\n40\n49\n2989\nStan\n\n\u2014\n\ncolumn j \u2265 1 is sampled i.i.d. from N (0, (j + 1)2/4). The dependent variables are generated with a\nhigh noise level as y \u223c N (X\u03b2 + \u03b20, 102). For our experiment, we consider 10 regression parameters\nand a dataset of 10,000 points with a 80%/20% train/test split.\nThe ESS/second for each regression parameter is presented in Figure 3 (left) for each MCMC method.\nWe use SA and AM with diagonal covariance matrices for this experiment. The hyperparameters are\n(MH) q=.03; (MTM) q=.03, M=3; (AM) q=.03, s=.7; (SA) q0=1, N=40. SA-MCMC achieves very\nhigh ESS/second as the SA multivariate Gaussian proposal distribution adapts within its parametric\nfamily to match the posterior standard deviation of each regression parameter, as shown in Figure 3\n(right). For this reason, the ESS of SA is nearly constant across the regression parameters. AM and\nNUTS are also able to adapt for this problem. Since MH and MTM only use a single scale parameter\nfor the proposal distribution and cannot adapt, MH and MTM are very inef\ufb01cient in sampling certain\ncoordinates. When comparing min(ESS)/second, MH\u2019s is 62, MTM\u2019s is 25, AM\u2019s is 387, NUTS\u2019s\nis 365, and SA\u2019s is 2329. Under this metric, SA is 6x more ef\ufb01cient than AM, 6.4x than NUTS,\n38x than MH, and 94x than MTM. The average running time in seconds for each chain is (MH) 66;\n(MTM) 341; (AM) 70; (NUTS) 493; (SA) 114. Finally, we emphasize that no tuning is required for\nSA since a Gaussian initialization distribution with standard deviation of 1 suf\ufb01ces.\n\nBayesian logistic regression We consider a Bayesian logistic regression model for binary classi-\n\ufb01cation where the prior on the regression coef\ufb01cients is Gaussian. For our experiments, we use a\nstandard multivariate Gaussian as the prior. We \ufb01rst present results on two large-scale, real-world\ndatasets: classifying digits 7 vs. 9 on the MNIST dataset, and predicting whether an adult\u2019s income\nexceeds $50K/year based on the census income dataset from the UCI repository [52]. The MNIST\ntraining set consists of 12,214 images, and after scaling the pixel values to the range [0, 1], we reduce\nthe dimensionality of the image from 784 to 10 using PCA similar to Korattikara et al. [53]. The\nresulting classi\ufb01cation accuracy is around 93%. The census income training set has 32,561 data\npoints, and we use 6 continuous features as predictors (we exclude fnlwgt and include gender). We\nstandardize each feature in the feature matrix to zero mean and unit variance. Visualizations of the\nposterior distributions are presented in Appendix 7.\nThe ESS/second results, as well as hyperparameters and acceptance rates, for each MCMC method\nare presented in Table 1. Overall, the high acceptance rates of 98.9% and 99.2% for SA-MCMC\nusing the full covariance matrix indicate that the posterior distributions are approximated well by\nGaussian distributions that can be captured by the adaptive proposal family, leading to high ESS/s\nfor SA-MCMC. For the MNIST dataset, when comparing min(ESS)/second, we see that SA (full)\nis 5.2x more ef\ufb01cient than NUTS, 7.6x than AM (full), 21x than MH, and 52x than MTM. Since a\nfew dimensions of the posterior are highly correlated, using a full covariance matrix for AM and SA\nimproves ESS. NUTS is adversely affected by the high correlation, and its min(ESS) is around half of\nits median(ESS). For the census income dataset, when comparing min(ESS)/second, we see that SA\n(full) is 3.8x more ef\ufb01cient than NUTS, 9.4x than AM (full), 106x than MH, and 263x than MTM.\nMH and MTM are extremely inef\ufb01cient in this case because these 2 algorithms are non-adaptive\n\n7\n\n\f300\n\n200\n\nd\nn\no\nc\ne\ns\n/\nS\nS\nE\n\n1.00\n\n0.95\n\n0.90\n\ne\nt\na\nr\n\ne\nc\nn\na\nt\np\ne\nc\nc\nA\n\nmedian\nmin\n\n200\n\n250\n\n300\n\n200\n\n250\n\n300\n\n50\n\n100\n\n150\n\n50\n\n100\n\n150\n\nN\n\nN\n\nFigure 4: Plot of ESS/s and acceptance rate for\nSA-MCMC (full) versus N on MNIST.\n\n100\n\n\u22121\n\n10\n\n)\n1\n(\nS\nS\nE\n/\n)\nh\n(\nS\nS\nE\n\n\u22122\n\n10\n\n\u22123\n\n10\n\nSA (full)\nAM (full)\nMH\n\n\u22122\n\n10\n\n\u22121\n\n10\n\nMultiplier h for MCMC hyperparameter\n\n100\n\n101\n\nFigure 5: Impact of MCMC hyperparameter on\nESS for MNIST. The ratio ESS(h)/ESS(1) mea-\nsures the drop in ESS using 0.02h for q in MH,\n0.7h for s in AM, and 1h for q0 in SA.\n\nand one of the regression coef\ufb01cients has a posterior standard deviation of 0.072 while the other 6\nregression coef\ufb01cients have posterior standard deviations of 0.013-0.020. Thus, we see a real-world\nexample of the scenario we presented with Bayesian linear regression.\nIn Figure 4, we plot ESS/s and acceptance rate for SA-MCMC (full) as a function of N for the\nMNIST dataset. Note that the acceptance rate approaches 1 as N increases. The ESS/s also increases\nas we increase N from 40 to 150 with a similar curvature as the acceptance rate plot. Past N = 200,\nthe ESS/s starts to slowly decline. In Figure 5, we study the impact of MCMC hyperparameter\ntuning on ESS for the MNIST dataset. We de\ufb01ne ESS(1) to be the median ESS using the optimal\nhyperparameter in Table 1: 0.02 for q in MH, 0.7 for s in AM (full), and 1 for q0 in SA (full). We\nde\ufb01ne ESS(h) to be the median ESS using the hyperparameter 0.02h for q in MH, 0.7h for s in AM\n(full), and 1h for q0 in SA (full) and plot the ratio ESS(h)/ESS(1) as we vary h. In this experiment,\nwe use N = 150 and 500k burn-in iterations followed by 1 million estimation iterations. For any\nvalue of q0 from 10\u22123 to 101, SA-MCMC can adapt perfectly to the target distribution during the\nburn-in phase and maintains optimal ESS. In contrast, both AM and MH suffer from suboptimal\nhyperparameters with ESS dropping signi\ufb01cantly. In Appendix 8, we present results for SA-MCMC\nand NUTS on MNIST across a range of dimensions. When comparing minimum ESS/second, we\nnote that SA-MCMC (full) outperforms NUTS up to dimension 50 on MNIST.\nFinally, we present results on two higher-dimensional, large-scale datasets: predicting forest cover\ntype from cartographic variables using the covtype.binary dataset,1 and distinguishing electron\nneutrinos (signal) from muon neutrinos (background) based on the MiniBooNE dataset from the UCI\nrepository [52]. The covtype dataset has a total of 581,012 data points, and we use a 80% training\nand 20% test split. There are 54 features in total, with 10 real-valued features and 44 binary features.\nThe MiniBooNE dataset has 130,065 data points and 50 real-valued features. For MiniBooNE, we\nnormalize each feature to zero mean and unit variance. The covtype and MiniBooNE datasets lead to\nextremely challenging sampling problems. The condition number of the posterior covariance matrix\nis around 340,000 for covtype and 140,000 for MiniBooNE.\nFor this experiment, we \ufb01rst run Newton\u2019s method to obtain a point estimate of the posterior\nmode and then initialize MH, AM (full), and SA (full) around this point estimate. Speci\ufb01cally,\nif we let \u02dc\u03b8 be the point estimate, then we initialize \u03b8(0) \u223c q0,MH(\u00b7) = N\nfor MH\nfor SA. MH with an isotropic normal distribution as the proposal\nand q0,SA(\u00b7) = N\ndistribution is not able to sample all of the dimensions effectively with ESS and R_hat detecting\nnon-convergence in several dimensions. Since NUTS with the default options in Stan does not work\nwell for this problem, we run NUTS with a dense mass matrix instead of a diagonal mass matrix.\nThe ESS/second results are presented in Table 2. When comparing min(ESS)/s, SA outperforms\nAM by 31x on covertype and 11x on MiniBoonE and outperforms NUTS by 24x on covertype and\n147x on MiniBoonE. Thus, SA samples effectively from this high-dimensional, challenging posterior\ndistribution without requiring any tuning of the initialization distribution. While we use 500 burn-in\n\n(cid:16)\u02dc\u03b8, \u03c32\nq,MHI(cid:17)\n\n(cid:16)\u02dc\u03b8, \u03c32\n\nI(cid:17)\n\nq0,SA\n\n1https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html\n\n8\n\n\fTable 2: Comparison of ESS/second for Bayesian logistic regression on (left) 55-dim cover type\n(right) 51-dim MiniBooNE between AM (full), SA (full), and NUTS with a dense mass matrix.\n\nCover type\n\nSA\n2.34\n2.81\n65,537\n5,958\n59,579\n100,000\n1,000,000\n\nq0=1\n\nN=1,000\n99.3%\n\nAM\n0.075\n0.078\n52,469\n4,770\n47,699\n100,000\n1,000,000\nq=.004\ns=.32\n25.1%\n\nNUTS\n0.099\n0.114\n25,143\n16,980\n8,163\n500\n2,000\nStan\n(dense)\n\n\u2014\n\nMiniBooNE\n\nSA\n3.35\n6.59\n26,627\n2,421\n24,206\n100,000\n1,000,000\n\nq0=1\n\nN=1000\n90.5%\n\nAM\n0.31\n0.38\n28,178\n1,342\n26,836\n100,000\n2,000,000\nq=.007\ns=.33\n25.7%\n\nNUTS\n0.023\n0.039\n33,584\n19,051\n14,533\n\n500\n2,000\nStan\n(dense)\n\n\u2014\n\nmin(ESS)/s\nmedian(ESS)/s\ns/chain\ns/chain (burn-in)\ns/chain (estimation)\n# iter. (burn-in)\n# iter. (estimation)\nHyperparameters\n\nAcceptance rate\n\nand 2,000 estimation iterations for NUTS, the running time of NUTS in the burn-in phase is larger\nthan in the estimation phase due to the number of likelihood evaluations. Note that our ESS/second\ncalculation is based on the total running time of the algorithm, including burn-in. We note that it is\npossible that further tuning or other techniques could improve the performance of NUTS.\n\n5 Discussion\n\n\u03b3(S) = N\u22121(cid:80)\n\nOur experimental results demonstrate the strong empirical performance of SA-MCMC with zero\ntuning compared to MH, MTM, AM, and NUTS with extensive tuning on Bayesian linear regression\nand Bayesian logistic regression. SA-MCMC achieves this by maintaining a state of N points\nand using an adaptive proposal distribution q(\u00b7|\u00b5(S), \u03a3(S)) depending on the current state. The\nSA-MCMC substitution procedure for the N points guarantees that the proposal distribution adapts\nwithin its parametric family to best approximate the target distribution. For example, when using a\nGaussian family of proposal distributions, SA-MCMC is well-suited for posterior inference tasks\nwhere the posterior distribution can be approximated well by a Gaussian distribution. In these\ncases, SA-MCMC is very ef\ufb01cient as the draws from the proposal distribution approximate draws\nfrom the target distribution. While we focused on proposal families of the form q(\u00b7|\u00b5(S), \u03a3(S)) in\nthis paper, more generally, our method can be extended to proposals of the form q(\u00b7|\u03b3(S)) where\n\u03b8\u2208S \u03b3(\u03b8) (as proved in Theorem 2) to tackle other problems. Future extensions of\nthis work include using a family of mixture distributions as the proposal family and learning the\noptimal mixture distribution (within a given family) [54, 55] and combining SA-MCMC updates with\nother MCMC updates, such as with NKC in the Parallel Metropolis-Hastings Coupler [56].\nThe computational complexity per iteration of SA-MCMC is one likelihood evaluation and the\ncomputation of the substitution probabilities in time O(N d) with a a diagonal covariance matrix\nor time O(N d2) with the full covariance matrix where d is the dimension. The computational\ncomplexity per iteration of MH and AM is one likelihood evaluation plus O(d) with a diagonal\ncovariance matrix or O(d2) with the full covariance matrix. The computational complexity per\niteration of MTM with M tries is 2M \u2212 1 likelihood evaluations plus O(d) or O(d2).\nWhile the adaptivity of AM is based on the entire history of past samples, the adaptivity of SA-MCMC\nis based on the current state of N points which offers theoretical and experimental advantages. For\nSA-MCMC, the Markovian property of the chain and the reversibility of the chain are preserved,\nand standard MCMC convergence theory can be applied. With AM, the \ufb01rst stage of AM is MH, so\nthe MH proposal distribution during the non-adaptive phase still has to be tuned. Using a sequential\nsubstitution framework, SA-MCMC is a principled adaptive MCMC method that only requires\nspecifying an initialization distribution. In many cases, the initialization distribution for SA-MCMC\ncan be chosen a priori, thereby automating the entire sampling procedure with no tuning required.\nExperimental results demonstrate the fast adaptation and effective sampling of SA-MCMC.\n\n9\n\n\fAcknowledgments\n\nI would like to thank my advisor, Professor Tze Leung Lai, for introducing me to this research area\nand for supporting me throughout this project. I would like to thank Tze Leung Lai and Hock Peng\nChan for providing a working paper with the algorithm and some theoretical derivations including\nthe proof for uniform ergodicity. Finally, I would like to thank the anonymous reviewers for their\nvaluable feedback.\n\nReferences\n[1] Steve Brooks, Andrew Gelman, Galin Jones, and Xiao-Li Meng. Handbook of Markov Chain Monte Carlo.\n\nCRC Press, 2011.\n\n[2] Nicholas Metropolis, Arianna W Rosenbluth, Marshall N Rosenbluth, Augusta H Teller, and Edward\nTeller. Equation of state calculations by fast computing machines. The Journal of Chemical Physics, 21(6):\n1087\u20131092, 1953.\n\n[3] W K Hastings. Monte Carlo sampling methods using Markov chains and their applications. Biometrika, 57\n\n(1):97\u2013109, 1970.\n\n[4] Andrew Gelman, Gareth O Roberts, and Walter R Gilks. Ef\ufb01cient Metropolis jumping rules. Bayesian\n\nStatistics, 5(599-608):42, 1996.\n\n[5] Andrew Gelman, John B Carlin, Hal S Stern, David B Dunson, Aki Vehtari, and Donald B Rubin. Bayesian\n\nData Analysis. Chapman and Hall/CRC, 2013.\n\n[6] Heikki Haario, Eero Saksman, and Johanna Tamminen. An adaptive Metropolis algorithm. Bernoulli, 7(2):\n\n223\u2013242, 2001.\n\n[7] Christophe Andrieu and \u00c9ric Moulines. On the ergodicity properties of some adaptive MCMC algorithms.\n\nThe Annals of Applied Probability, 16(3):1462\u20131505, 2006.\n\n[8] Gareth O Roberts and Jeffrey S Rosenthal. Coupling and ergodicity of adaptive Markov chain Monte Carlo\n\nalgorithms. Journal of Applied Probability, 44(2):458\u2013475, 2007.\n\n[9] Faming Liang, Chuanhai Liu, and Raymond Carroll. Advanced Markov chain Monte Carlo methods:\n\nlearning from past samples, volume 714. John Wiley & Sons, 2011.\n\n[10] Andrew Lewandowski. Population Monte Carlo methods with applications in Bayesian statistics. PhD\n\nthesis, 2010.\n\n[11] Av A Barker. Monte Carlo calculations of the radial distribution functions for a proton-electron plasma.\n\nAustralian Journal of Physics, 18(2):119\u2013134, 1965.\n\n[12] Luca Martino, Victor Elvira, David Luengo, and Jukka Corander. An adaptive population importance\nsampler: Learning from uncertainty. IEEE Transactions on Signal Processing, 63(16):4422\u20134437, 2015.\n\n[13] Luca Martino, Victor Elvira, David Luengo, and Jukka Corander. Layered adaptive importance sampling.\n\nStatistics and Computing, 27(3):599\u2013623, 2017.\n\n[14] Luca Martino, V\u00edctor Elvira, David Luengo, Jukka Corander, and Francisco Louzada. Orthogonal parallel\n\nMCMC methods for sampling and optimization. Digital Signal Processing, 58:64\u201384, 2016.\n\n[15] Charles J Geyer. Markov chain Monte Carlo maximum likelihood. 1991.\n\n[16] Walter R Gilks, Gareth O Roberts, and Edward I George. Adaptive direction sampling. Journal of the\n\nRoyal Statistical Society: Series D (The Statistician), 43(1):179\u2013189, 1994.\n\n[17] Gregory R Warnes. The Normal Kernel Coupler: An adaptive Markov Chain Monte Carlo method for\n\nef\ufb01ciently sampling from multi-modal distributions. PhD thesis, 2000.\n\n[18] Kerrie L Mengersen and Christian P Robert. IID sampling using self-avoiding population Monte Carlo:\n\nthe pinball sampler. Bayesian Statistics, 7:277\u2013292, 2003.\n\n[19] Jonathan Goodman and Jonathan Weare. Ensemble samplers with af\ufb01ne invariance. Communications in\n\nApplied Mathematics and Computational Science, 5(1):65\u201380, 2010.\n\n[20] Jim E Grif\ufb01n and Stephen G Walker. On adaptive Metropolis\u2013Hastings methods. Statistics and Computing,\n\n23(1):123\u2013134, 2013.\n\n10\n\n\f[21] Bo Cai, Renate Meyer, and Fran\u00e7ois Perron. Metropolis\u2013Hastings algorithms with adaptive proposals.\n\nStatistics and Computing, 18(4):421\u2013433, 2008.\n\n[22] Jonathan M Keith, Dirk P Kroese, and George Y Sofronov. Adaptive independence samplers. Statistics\n\nand Computing, 18(4):409\u2013420, 2008.\n\n[23] Benedict Leimkuhler, Charles Matthews, and Jonathan Weare. Ensemble preconditioning for Markov\n\nchain Monte Carlo simulation. Statistics and Computing, 28(2):277\u2013290, 2018.\n\n[24] Jun S Liu, Faming Liang, and Wing Hung Wong. The multiple-try method and local optimization in\n\nMetropolis sampling. Journal of the American Statistical Association, 95(449):121\u2013134, 2000.\n\n[25] Luca Martino. A review of multiple try MCMC algorithms for signal processing. Digital Signal Processing,\n\n2018.\n\n[26] Radford M Neal, Matthew J Beal, and Sam T Roweis. Inferring state sequences for non-linear systems\nwith embedded hidden Markov models. In Advances in Neural Information Processing Systems, pages\n401\u2013408, 2004.\n\n[27] Hakon Tjelmeland. Using all Metropolis\u2013Hastings proposals to estimate mean values. Technical report,\n\n2004.\n\n[28] Radford M Neal. MCMC using ensembles of states for problems with fast and slow variables such as\n\nGaussian process regression. arXiv preprint arXiv:1101.0387, 2011.\n\n[29] Ben Calderhead. A general construction for parallelizing Metropolis-Hastings algorithms. Proceedings of\n\nthe National Academy of Sciences, 111(49):17408\u201317413, 2014.\n\n[30] Olivier Capp\u00e9, Arnaud Guillin, Jean-Michel Marin, and Christian P Robert. Population Monte Carlo.\n\nJournal of Computational and Graphical Statistics, 13(4):907\u2013929, 2004.\n\n[31] M\u00f3nica F Bugallo, Luca Martino, and Jukka Corander. Adaptive importance sampling in signal processing.\n\nDigital Signal Processing, 47:36\u201349, 2015.\n\n[32] Monica F Bugallo, Victor Elvira, Luca Martino, David Luengo, Joaquin Miguez, and Petar M Djuric.\nAdaptive importance sampling: the past, the present, and the future. IEEE Signal Processing Magazine, 34\n(4):60\u201379, 2017.\n\n[33] V\u00edctor Elvira, Luca Martino, David Luengo, and M\u00f3nica F Bugallo. Generalized multiple importance\n\nsampling. Statistical Science, 34(1):129\u2013155, 2019.\n\n[34] Neil J Gordon, David J Salmond, and Adrian FM Smith. Novel approach to nonlinear/non-Gaussian\nBayesian state estimation. In IEE Proceedings F-radar and signal processing, volume 140, pages 107\u2013113.\nIET, 1993.\n\n[35] Jun S Liu and Rong Chen. Sequential Monte Carlo methods for dynamic systems. Journal of the American\n\nStatistical Association, 93(443):1032\u20131044, 1998.\n\n[36] Christophe Andrieu, Arnaud Doucet, and Roman Holenstein. Particle Markov chain Monte Carlo methods.\n\nJournal of the Royal Statistical Society: Series B (Statistical Methodology), 72(3):269\u2013342, 2010.\n\n[37] Krishna B Athreya, Hani Doss, and Jayaram Sethuraman. On the convergence of the Markov chain\n\nsimulation method. The Annals of Statistics, 24(1):69\u2013100, 1996.\n\n[38] Christian Robert and George Casella. Monte Carlo Statistical Methods. Springer Texts in Statistics.\n\nSpringer New York, 2004.\n\n[39] Gareth O Roberts and Richard L Tweedie. Geometric convergence and central limit theorems for multidi-\n\nmensional Hastings and Metropolis algorithms. Biometrika, 83(1):95\u2013110, 1996.\n\n[40] Hock Peng Chan and Tze Leung Lai. MCMC with sequential substitutions: theory and applications.\n\nWorking paper, 2015.\n\n[41] Kerrie L Mengersen and Richard L Tweedie. Rates of convergence of the Hastings and Metropolis\n\nalgorithms. The Annals of Statistics, 24(1):101\u2013121, 1996.\n\n[42] Lars Holden, Ragnar Hauge, and Marit Holden. Adaptive independent Metropolis\u2013Hastings. The Annals\n\nof Applied Probability, 19(1):395\u2013413, 2009.\n\n11\n\n\f[43] Art Owen and Yi Zhou. Safe and effective importance sampling. Journal of the American Statistical\n\nAssociation, 95(449):135\u2013143, 2000.\n\n[44] Wentao Li, Zhiqiang Tan, and Rong Chen. Two-stage importance sampling with mixture proposals. Journal\n\nof the American Statistical Association, 108(504):1350\u20131365, 2013.\n\n[45] Wentao Li, Rong Chen, and Zhiqiang Tan. Ef\ufb01cient sequential Monte Carlo with multiple proposals and\n\ncontrol variates. Journal of the American Statistical Association, 111(513):298\u2013313, 2016.\n\n[46] Matthew D Hoffman and Andrew Gelman. The No-U-Turn sampler: adaptively setting path lengths in\n\nHamiltonian Monte Carlo. Journal of Machine Learning Research, 15(1):1593\u20131623, 2014.\n\n[47] Bob Carpenter, Andrew Gelman, Matthew D Hoffman, Daniel Lee, Ben Goodrich, Michael Betancourt,\nMarcus Brubaker, Jiqiang Guo, Peter Li, and Allen Riddell. Stan: A probabilistic programming language.\nJournal of Statistical Software, 76(1), 2017.\n\n[48] Radford M Neal. MCMC using Hamiltonian dynamics. Handbook of Markov Chain Monte Carlo, 2(11):2,\n\n2011.\n\n[49] Gareth O Roberts, Andrew Gelman, and Walter R Gilks. Weak convergence and optimal scaling of random\n\nwalk Metropolis algorithms. The Annals of Applied Probability, 7(1):110\u2013120, 1997.\n\n[50] Andrew Gelman and Donald B Rubin. Inference from iterative simulation using multiple sequences.\n\nStatistical Science, 7(4):457\u2013472, 1992.\n\n[51] Jeff Bezanson, Alan Edelman, Stefan Karpinski, and Viral B Shah. Julia: A fresh approach to numerical\n\ncomputing. SIAM Review, 59(1):65\u201398, 2017.\n\n[52] Dheeru Dua and Casey Graff. UCI Machine Learning Repository, 2019. URL http://archive.ics.\n\nuci.edu/ml.\n\n[53] Anoop Korattikara, Yutian Chen, and Max Welling. Austerity in MCMC land: Cutting the Metropolis-\n\nHastings budget. In International Conference on Machine Learning, pages 181\u2013189, 2014.\n\n[54] Olivier Capp\u00e9, Randal Douc, Arnaud Guillin, Jean-Michel Marin, and Christian P Robert. Adaptive\n\nimportance sampling in general mixture classes. Statistics and Computing, 18(4):447\u2013459, 2008.\n\n[55] David Luengo and Luca Martino. Fully adaptive Gaussian mixture Metropolis-Hastings algorithm. In\n2013 IEEE International Conference on Acoustics, Speech and Signal Processing, pages 6148\u20136152. IEEE,\n2013.\n\n[56] Fernando Llorente, Luca Martino, and David Delgado. Parallel Metropolis\u2013Hastings coupler. IEEE Signal\n\nProcessing Letters, 26(6):953\u2013957, 2019.\n\n12\n\n\f", "award": [], "sourceid": 4863, "authors": [{"given_name": "Michael", "family_name": "Zhu", "institution": "Stanford University"}]}