{"title": "Beyond Log-concavity: Provable Guarantees for Sampling Multi-modal Distributions using Simulated Tempering Langevin Monte Carlo", "book": "Advances in Neural Information Processing Systems", "page_first": 7847, "page_last": 7856, "abstract": "A key task in Bayesian machine learning is sampling from distributions that are only specified up to a partition function (i.e., constant of proportionality). One prevalent example of this is sampling posteriors in parametric \ndistributions, such as latent-variable generative models.  However sampling (even very approximately) can be #P-hard.\n\nClassical results (going back to Bakry and Emery) on sampling focus on log-concave distributions, and show a natural Markov chain called Langevin diffusion mix in polynomial time.  However, all log-concave distributions are uni-modal, while in practice it is very common for the distribution of interest to have multiple modes.\nIn this case, Langevin diffusion suffers from torpid mixing. \n\nWe address this problem by combining Langevin diffusion with simulated tempering. The result is a Markov chain that mixes more rapidly by transitioning between different temperatures of the distribution. We analyze this Markov chain for a mixture of (strongly) log-concave distributions of the same shape. In particular, our technique applies to the canonical multi-modal distribution: a mixture of gaussians (of equal variance). Our algorithm efficiently samples from these distributions given only access to the gradient of the log-pdf. To the best of our knowledge, this is the first result that proves fast mixing for multimodal distributions.", "full_text": "Beyond Log-concavity: Provable Guarantees for\n\nSampling Multi-modal Distributions using Simulated\n\nTempering Langevin Monte Carlo\n\nDuke University, Computer Science Department\n\nRong Ge\n\nrongge@cs.duke.edu\n\nPrinceton University, Mathematics Department\n\nHolden Lee\n\nholdenl@princeton.edu\n\nMassachusetts Institute of Technology, Applied Mathematics and IDSS\n\nAndrej Risteski\n\nristeski@mit.edu\n\nAbstract\n\nA key task in Bayesian machine learning is sampling from distributions that are\nonly speci\ufb01ed up to a partition function (i.e., constant of proportionality). One\nprevalent example of this is sampling posteriors in parametric distributions, such\nas latent-variable generative models. However sampling (even very approximately)\ncan be #P-hard.\nClassical results (going back to [B\u00c985]) on sampling focus on log-concave dis-\ntributions, and show a natural Markov chain called Langevin diffusion mixes in\npolynomial time. However, all log-concave distributions are uni-modal, while in\npractice it is very common for the distribution of interest to have multiple modes.\nIn this case, Langevin diffusion suffers from torpid mixing.\nWe address this problem by combining Langevin diffusion with simulated temper-\ning. The result is a Markov chain that mixes more rapidly by transitioning between\ndifferent temperatures of the distribution. We analyze this Markov chain for a\nmixture of (strongly) log-concave distributions of the same shape. In particular, our\ntechnique applies to the canonical multi-modal distribution: a mixture of gaussians\n(of equal variance). Our algorithm ef\ufb01ciently samples from these distributions\ngiven only access to the gradient of the log-pdf. To the best of our knowledge, this\nis the \ufb01rst result that proves fast mixing for multimodal distributions in this setting.\nFor the analysis, we introduce novel techniques for proving spectral gaps based on\ndecomposing the action of the generator of the diffusion. Previous approaches rely\non decomposing the state space as a partition of sets, while our approach can be\nthought of as decomposing the stationary measure as a mixture of distributions (a\n\u201csoft partition\u201d).\nAdditional materials for the paper can be found at http://tiny.cc/glr17. Note\nthat the proof and results have been improved and generalized from the precursor\nat http://www.arxiv.org/abs/1710.02736. See Section ?? for a compari-\nson.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\f1\n\nIntroduction\n\n\ud835\udc5d(\ud835\udc65)\n\nto evaluate, the denominator \ud835\udc5d(\ud835\udc65) =\u222b\ufe00\n\nSampling is a fundamental task in Bayesian statistics, and dealing with multimodal distributions is\na core challenge. One common technique to sample from a probability distribution is to de\ufb01ne a\nMarkov chain with that distribution as its stationary distribution. This general approach is called\nMarkov chain Monte Carlo. However, in many practical problems, the Markov chain does not mix\nrapidly, and we obtain samples from only one part of the support of the distribution.\nPractitioners have dealt with this problem through a variety of heuristics. A popular family of\napproaches involve changing the temperature of the distribution. However, there has been little\ntheoretical analysis of such methods. We give provable guarantees for a temperature-based method\ncalled simulated tempering when it is combined with Langevin diffusion.\nMore precisely, the setup we consider is sampling from a distribution given up to a constant of\nproportionality. This is inspired from sampling a posterior distribution over the latent variables of a\nlatent-variable Bayesian model with known parameters. In such models, the observable variables\n\ud835\udc65 follow a distribution \ud835\udc5d(\ud835\udc65) which has a simple and succinct form given the values of some latent\nvariables \u210e, i.e., the joint \ud835\udc5d(\u210e, \ud835\udc65) factorizes as \ud835\udc5d(\u210e)\ud835\udc5d(\ud835\udc65|\u210e) where both factors are explicit. Hence,\nthe posterior distribution \ud835\udc5d(\u210e|\ud835\udc65) has the form \ud835\udc5d(\u210e|\ud835\udc65) = \ud835\udc5d(\u210e)\ud835\udc5d(\ud835\udc65|\u210e)\n. Although the numerator is easy\n\u210e \ud835\udc5d(\u210e)\ud835\udc5d(\ud835\udc65|\u210e) can be NP-hard to approximate even for simple\nmodels like topic models [SR11]. Thus the problem is intractable without structural assumptions.\nPrevious theoretical results on sampling have focused on log-concave distributions, i.e., distributions\nof the form \ud835\udc5d(\ud835\udc65) \u221d \ud835\udc52\u2212\ud835\udc53 (\ud835\udc65) for a convex function \ud835\udc53 (\ud835\udc65). This is analogous to convex optimization\nwhere the objective function \ud835\udc53 (\ud835\udc65) is convex. Recently, there has been renewed interest in analyzing\na popular Markov Chain for sampling from such distributions, when given gradient access to \ud835\udc53\u2014a\nnatural setup for the posterior sampling task described above. In particular, a Markov chain called\nLangevin Monte Carlo (see Section 2.1), popular with Bayesian practitioners, has been proven to\nwork, with various rates depending on the precise properties of \ud835\udc53 [Dal16, DM16, Dal17].\nYet, just as many interesting optimization problems are nonconvex, many interesting sampling prob-\nlems are not log-concave. A log-concave distribution is necessarily uni-modal: its density function\nhas only one local maximum, which is necessarily a global maximum. This fails to capture many\ninteresting scenarios. Many simple posterior distributions are neither log-concave nor uni-modal, for\ninstance, the posterior distribution of the means for a mixture of gaussians, given a sample of points\nfrom the mixture of gaussians. In a more practical direction, complicated posterior distributions asso-\nciated with deep generative models [RMW14] and variational auto-encoders [KW13] are believed to\nbe multimodal as well.\nIn this work we initiate an exploration of provable methods for sampling \u201cbeyond log-concavity,\u201d\nin parallel to optimization \u201cbeyond convexity\u201d. As worst-case results are prohibited by hardness\nresults, we must make assumptions on the distributions of interest. As a \ufb01rst step, we consider a\nmixture of strongly log-concave distributions of the same shape. This class of distributions captures\nthe prototypical multimodal distribution, a mixture of Gaussians with the same covariance matrix.\nOur result is also robust in the sense that even if the actual distribution has density that is only close\nto a mixture that we can handle, our algorithm can still sample from the distribution in polynomial\ntime. Note that the requirement that all Gaussians have the same covariance matrix is in some sense\nnecessary: in Appendix K we show that even if the covariance of two components differ by a constant\nfactor, no algorithm (with query access to \ud835\udc53 and \u2207\ud835\udc53) can achieve the same robustness guarantee in\npolynomial time.\n\n1.1 Problem statement\n\nWe formalize the problem of interest as follows.\nProblem 1.1. Let \ud835\udc53 : R\ud835\udc51 \u2192 R be a function. Given query access to \u2207\ud835\udc53 (\ud835\udc65) and \ud835\udc53 (\ud835\udc65) at any point\n\ud835\udc65 \u2208 R\ud835\udc51, sample from the probability distribution with density function \ud835\udc5d(\ud835\udc65) \u221d \ud835\udc52\u2212\ud835\udc53 (\ud835\udc65).\nIn particular, consider the case where \ud835\udc52\u2212\ud835\udc53 (\ud835\udc65) is the density function of a mixture of strongly log-\nconcave distributions that are translates of each other. That is, there is a base function \ud835\udc530 : R\ud835\udc51 \u2192 R,\n\n2\n\n\fcenters \ud835\udf071, \ud835\udf072, . . . , \ud835\udf07\ud835\udc5a \u2208 R\ud835\udc51, and weights \ud835\udc641, \ud835\udc642, . . . , \ud835\udc64\ud835\udc5a (\u2211\ufe00\ud835\udc5a\n)\ufe03\n\n(\ufe03 \ud835\udc5a\u2211\ufe01\n\n\ud835\udc56=1 \ud835\udc64\ud835\udc56 = 1) such that\n\n,\n\n\ud835\udc56=1\n\n\ud835\udc53 (\ud835\udc65) = \u2212 log\n\n\ud835\udc64\ud835\udc56\ud835\udc52\u2212\ud835\udc530(\ud835\udc65\u2212\ud835\udf07\ud835\udc56)\nFor notational convenience, we will de\ufb01ne \ud835\udc53\ud835\udc56(\ud835\udc65) = \ud835\udc530(\ud835\udc65 \u2212 \ud835\udf07\ud835\udc56).\nThe function \ud835\udc530 speci\ufb01es a basic \u201cshape\u201d around the modes, and the means \ud835\udf07\ud835\udc56 indicate the locations\nof the modes.\nWithout loss of generality we assume the mode of the distribution \ud835\udc52\u2212\ud835\udc530(\ud835\udc65) is at 0 (\u2207\ud835\udc530(0) =\n0). We also assume \ud835\udc530 is twice differentiable, and for any \ud835\udc65 the Hessian is sandwiched between\n\ud835\udf05\ud835\udc3c \u2aaf \u22072\ud835\udc530(\ud835\udc65)) \u2aaf \ud835\udc3e\ud835\udc3c. Such functions are called \ud835\udf05-strongly-convex, \ud835\udc3e-smooth functions. The\ncorresponding distribution \ud835\udc52\u2212\ud835\udc530(\ud835\udc65) are strongly log-concave distributions. 1\n\n(1)\n\n1.2 Our results\n\n\ud835\udf00 , 1\n\n\ud835\udf05 , \ud835\udc3e)\ufe00,\n\nAlgorithm 2 with appropriate setting of parameters) with running time poly(\ufe00\ud835\udc64min, \ud835\udc37, \ud835\udc51, 1\n\nWe show that there is an ef\ufb01cient algorithm that can sample from this distribution given just access to\n\ud835\udc53 (\ud835\udc65) and \u2207\ud835\udc53 (\ud835\udc65).\nTheorem 1.2 (main). Given \ud835\udc53 (\ud835\udc65) as de\ufb01ned in Equation (1), where the base function \ud835\udc530 satis\ufb01es\nfor any \ud835\udc65, \ud835\udf05\ud835\udc3c \u2aaf \u22072\ud835\udc530(\ud835\udc65) \u2aaf \ud835\udc3e\ud835\udc3c, and \u2016\ud835\udf07\ud835\udc56\u2016 \u2264 \ud835\udc37 for all \ud835\udc56 \u2208 [\ud835\udc5a], there is an algorithm (given as\nwhich given query access to \u2207\ud835\udc53 and \ud835\udc53, outputs a sample from a distribution within TV-distance \ud835\udf00 of\n\ud835\udc5d(\ud835\udc65) \u221d \ud835\udc52\u2212\ud835\udc53 (\ud835\udc65).\nNote that importantly the algorithm does not have direct access to the mixture parameters \ud835\udf07\ud835\udc56, \ud835\udc64\ud835\udc56, \ud835\udc56 \u2208\n[\ud835\udc5b] (otherwise the problem would be trivial). Sampling from this mixture is thus non-trivial: algo-\nrithms that are based on making local steps (such as the ball-walk [LS93, Vem05] and Langevin\nMonte Carlo) cannot move between different components of the gaussian mixture when the gaus-\nsians are well-separated. In the algorithm we use simulated tempering (see Section 2.2), which is\na technique that adjusts the \u201ctemperature\u201d of the distribution in order to move between different\ncomponents.\nOf course, requiring the distribution to be exactly a mixture of log-concave distributions is a very\nstrong assumption. Our results can be generalized to all functions that are \u201cclose\u201d to a mixture of\nlog-concave distributions.\nMore precisely, assume the function \ud835\udc53 satis\ufb01es the following properties:\n\u2203 \u02dc\ud835\udc53 : R\ud835\udc51 \u2192 R where\n\n\u2264 \ud835\udf0f and \u2016\u22072 \u02dc\ud835\udc53 (\ud835\udc65) \u2212 \u22072\ud835\udc53 (\ud835\udc65)\u20162 \u2264 \ud835\udf0f,\u2200\ud835\udc65 \u2208 R\ud835\udc51\n\n\u2264 \u2206 ,\n\n\u20e6\u20e6\u20e6 \u02dc\ud835\udc53 \u2212 \ud835\udc53\n\n\u20e6\u20e6\u20e6\u221e\n(\ufe03 \ud835\udc5a\u2211\ufe01\n\n\u20e6\u20e6\u20e6\u2207 \u02dc\ud835\udc53 \u2212 \u2207\ud835\udc53\n\u20e6\u20e6\u20e6\u221e\n)\ufe03\n\n(2)\n\n(3)\n\n(4)\n\nand \u02dc\ud835\udc53 (\ud835\udc65) = \u2212 log\n\n\ud835\udc64\ud835\udc56\ud835\udc52\u2212\ud835\udc530(\ud835\udc65\u2212\ud835\udf07\ud835\udc56)\n\nwhere \u2207\ud835\udc530(0) = 0, and \u2200\ud835\udc65, \ud835\udf05\ud835\udc3c \u2aaf \u22072\ud835\udc530(\ud835\udc65) \u2aaf \ud835\udc3e\ud835\udc3c.\n\n\ud835\udc56=1\n\n\ud835\udf00 , \ud835\udc52\u0394, \ud835\udf0f, 1\n\npoly(\ufe00\ud835\udc64min, \ud835\udc37, \ud835\udc51, 1\n\n\ud835\udf05 , \ud835\udc3e)\ufe00, which given query access to \u2207\ud835\udc53 and \ud835\udc53, outputs a sample \ud835\udc65 from\n\nThat is, \ud835\udc53 is within a \ud835\udc52\u0394 multiplicative factor of an (unknown) mixture of log-concave distributions.\nOur theorem can be generalized to this case.\nTheorem 1.3 (general case). For function \ud835\udc53 (\ud835\udc65) that satis\ufb01es Equations (2),(3) and (4), there is\nan algorithm (given as Algorithm 2 with appropriate setting of parameters) that runs in time\na distribution that has TV-distance at most \ud835\udf00 from \ud835\udc5d(\ud835\udc65) \u221d \ud835\udc52\u2212\ud835\udc53 (\ud835\udc65).\nBoth main theorems may seem simple. In particular, one might conjecture that it is easy to use local\nsearch algorithms to \ufb01nd all the modes. However in Section J, we give a few examples to show that\nsuch simple heuristics do not work (e.g. random initialization is not enough to \ufb01nd all the modes).\n2\ud835\udf0e2 \u2016\ud835\udc65\u20162. This corresponds to the case\n\n1On a \ufb01rst read, we recommend concentrating on the case \ud835\udc530(\ud835\udc65) = 1\n\nwhere all the components are spherical Gaussians with mean \ud835\udf07\ud835\udc56 and covariance matrix \ud835\udf0e2\ud835\udc3c.\n\n3\n\n\fThe assumption that all the mixture components share the same \ud835\udc530 (hence when applied to Gaussians,\nall Gaussians have same covariance) is also necessary. In Section K, we give an example where for a\nmixture of two gaussians, even if the covariance only differs by a constant factor, any algorithm that\nachieves similar gaurantees as Theorem 1.3 must take exponential time.\n\n2 Overview of algorithm\n\nOur algorithm combines Langevin diffusion, a chain for sampling from distributions in the form\n\ud835\udc5d(\ud835\udc65) \u221d \ud835\udc52\u2212\ud835\udc53 (\ud835\udc65) given only gradient access to \ud835\udc53 and simulated tempering, a heuristic used for tackling\nmultimodality. We brie\ufb02y de\ufb01ne both of these and recall what is known for both of these techniques.\nFor technical prerequisites on Markov chains, the reader can refer to Appendix B.\nThe basic idea to keep in mind is the following: A Markov chain with local moves such as Langevin\ndiffusion gets stuck in a local mode. Creating a \u201cmeta-Markov chain\u201d which changes the temperature\n(the simulated tempering chain) can exponentially speed up mixing.\n\n2.1 Langevin dynamics\nLangevin Monte Carlo is an algorithm for sampling from \ud835\udc5d \u221d \ud835\udc52\u2212\ud835\udc53 given access to the gradient of the\nlog-pdf, \u2207\ud835\udc53.\nThe continuous version, overdamped Langevin diffusion (often simply called Langevin diffusion), is\na stochastic process described by the stochastic differential equation (henceforth SDE)\n\n\ud835\udc51\ud835\udc4b\ud835\udc61 = \u2212\u2207\ud835\udc53 (\ud835\udc4b\ud835\udc61) \ud835\udc51\ud835\udc61 +\n\n(5)\nwhere \ud835\udc4a\ud835\udc61 is the Wiener process (Brownian motion). For us, the crucial fact is that Langevin dynamics\nconverges to the stationary distribution given by \ud835\udc5d(\ud835\udc65) \u221d \ud835\udc52\u2212\ud835\udc53 (\ud835\udc65).\nSubstituting \ud835\udefd\ud835\udc53 for \ud835\udc53 in (5) gives the Langevin diffusion process for inverse temperature \ud835\udefd, which\nhas stationary distribution \u221d \ud835\udc52\u2212\ud835\udefd\ud835\udc53 (\ud835\udc65). Equivalently we can consider the temperature as changing the\nmagnitude of the noise:\n\n2 \ud835\udc51\ud835\udc4a\ud835\udc61\n\n\u221a\n\n\ud835\udc51\ud835\udc4b\ud835\udc61 = \u2212\u2207\ud835\udc53 (\ud835\udc4b\ud835\udc61)\ud835\udc51\ud835\udc61 +\n\n\u221a\ufe00\n\ud835\udc4b\ud835\udc61+1 = \ud835\udc4b\ud835\udc61 \u2212 \ud835\udf02\u2207\ud835\udc53 (\ud835\udc4b\ud835\udc61) +\u221a\ufe002\ud835\udf02\ud835\udf09\ud835\udc58,\n\n2\ud835\udefd\u22121\ud835\udc51\ud835\udc4a\ud835\udc61.\n\nOf course algorithmically we cannot run a continuous-time process, so we run a discretized version of\nthe above process: namely, we run a Markov chain where the random variable at time \ud835\udc61 is described\nas\n\n(6)\n\ud835\udf02 scaling is that running Brownian motion for \ud835\udf02 of\n\ud835\udf02.) This is analogous to how gradient descent is a discretization of\n\n\u221a\n\n\ud835\udf09\ud835\udc58 \u223c \ud835\udc41 (0, \ud835\udc3c)\n\nwhere \ud835\udf02 is the step size. (The reason for the\nthe time scales the variance by\ngradient \ufb02ow.\n\n\u221a\n\n2.1.1 Prior work on Langevin dynamics\n\nFor Langevin dynamics, convergence to the stationary distribution is a classic result [Bha78]. Fast\nmixing for log-concave distributions is also a classic result: [B\u00c985, BBCG08] show that log-\nconcave distributions satisfy a Poincar\u00e9 and log-Sobolev inequality, which characterize the rate\nof convergence\u2014If \ud835\udc53 is \ud835\udefc-strongly convex, then the mixing time is on the order of 1\n\ud835\udefc. Of course,\nalgorithmically, one can only run a \u201cdiscretized\u201d version of the Langevin dynamics. Analyses of the\ndiscretization are more recent: [Dal16, DM16, Dal17, DK17, DMM18] give running times bounds\nfor sampling from a log-concave distribution over R\ud835\udc51, and [BEL18] give a algorithm to sample\nfrom a log-concave distribution restricted to a convex set by incorporating a projection. We note\nthese analysis and ours are for the simplest kind of Langevin dynamics, the overdamped case; better\nrates are known for underdamped dynamics ([CCBJ17]), if a Metropolis-Hastings rejection step is\nused ([DCWY18]), and for Hamiltonian Monte Carlo which takes into account momentum ([MS17]).\n[RRT17] consider arbitrary non-log-concave distributions with certain regularity and decay properties,\nbut the mixing time is exponential in general; furthermore, it has long been known that transition-\ning between different modes can take exponentially long, a phenomenon known as meta-stability\n[BEGK02, BEGK04, BGK05]. The Holley-Stroock Theorem (see e.g. [BGL13]) shows that guaran-\ntees for mixing extend to distributions \ud835\udc52\u2212\ud835\udc53 (\ud835\udc65) where \ud835\udc53 (\ud835\udc65) is a \u201cnice\u201d function that is close to a convex\n\n4\n\n\ffunction in \ud835\udc3f\u221e distance; however, this does not address more global deviations from convexity.\n[MV17] consider a more general model with multiplicative noise.\n\n2.2 Simulated tempering\n\nFor distributions that are far from being log-concave and have many deep modes, additional techniques\nare necessary. One proposed heuristic, out of many, is simulated tempering, which swaps between\nMarkov chains that are different temperature variants of the original chain. The intuition is that\nthe Markov chains at higher temperature can move between modes more easily, and hence, the\nhigher-temperature chain acts as a \u201cbridge\u201d to move between modes.\nIndeed, Langevin dynamics corresponding to a higher temperature distribution\u2014with \ud835\udefd\ud835\udc53 rather\nthan \ud835\udc53, where \ud835\udefd < 1\u2014mixes faster. (Here, we use terminology from statistical physics, letting \ud835\udf0f\ndenote teh temperature and \ud835\udefd = 1\n\ud835\udf0f denote the inverse temperature.) A high temperature \ufb02attens\nout the distribution. However, we can\u2019t simply run Langevin at a higher temperature because the\nstationary distribution is wrong; the simulated tempering chain combines Markov chains at different\ntemperatures in a way that preserves the stationary distribution.\nWe can de\ufb01ne simulated tempering with respect to any sequence of Markov chains \ud835\udc40\ud835\udc56 on the same\nspace \u2126. Think of \ud835\udc40\ud835\udc56 as the Markov chain corresponding to temperature \ud835\udc56, with stationary distribution\n\ud835\udc52\u2212\ud835\udefd\ud835\udc56\ud835\udc53 .\nThen we de\ufb01ne the simulated tempering Markov chain as follows.\n\n\u2219 The state space is \u2126 \u00d7 [\ud835\udc3f]: \ud835\udc3f copies of the state space (in our case R\ud835\udc51), one copy for each\n\u2219 The evolution is de\ufb01ned as follows.\n\ntemperature.\n\n1. If the current point is (\ud835\udc65, \ud835\udc56), then evolve according to the \ud835\udc56th chain \ud835\udc40\ud835\udc56.\n2. Propose swaps with some rate \ud835\udf06. When a swap is proposed, attempt to move to a\nneighboring chain, \ud835\udc56\u2032 = \ud835\udc56 \u00b1 1. With probability min{\ud835\udc5d\ud835\udc56\u2032(\ud835\udc65)/\ud835\udc5d\ud835\udc56(\ud835\udc65), 1}, the transition is\nsuccessful. Otherwise, stay at the same point. This is a Metropolis-Hastings step; its\npurpose is to preserve the stationary distribution.2\n\nThe crucial fact to note is that the stationary distribution is a \u201cmixture\u201d of the distributions corre-\nsponding to the different temperatures. Namely:\nProposition 2.1. [MP92, Nea96] If the \ud835\udc40\ud835\udc58, 1 \u2264 \ud835\udc58 \u2264 \ud835\udc3f are reversible Markov chains with stationary\ndistributions \ud835\udc5d\ud835\udc58, then the simulated tempering chain \ud835\udc40 is a reversible Markov chain with stationary\ndistribution\n\n\ud835\udc5d(\ud835\udc65, \ud835\udc56) =\n\n\ud835\udc5d\ud835\udc56(\ud835\udc65).\n\n1\n\ud835\udc3f\n\nThe typical setting of simulated tempering is as follows. The Markov chains come from a smooth\nfamily of Markov chains with parameter \ud835\udefd \u2265 0, and \ud835\udc40\ud835\udc56 is the Markov chain with parameter \ud835\udefd\ud835\udc56,\nwhere 0 \u2264 \ud835\udefd1 \u2264 . . . \u2264 \ud835\udefd\ud835\udc3f = 1. We are interested in sampling from the distribution when \ud835\udefd is large\n(\ud835\udf0f is small). However, the chain suffers from torpid mixing in this case, because the distribution is\nmore peaked. The simulated tempering chain uses smaller \ud835\udefd (larger \ud835\udf0f) to help with mixing. For us,\nthe stationary distribution at inverse temperature \ud835\udefd is \u221d \ud835\udc52\u2212\ud835\udefd\ud835\udc53 (\ud835\udc65).\n\n2.2.1 Prior work on simulated tempering\n\nProvable results of this heuristic are few and far between. [WSH09, Zhe03] lower-bound the spectral\ngap for generic simulated tempering chains, using a Markov chain decomposition technique due to\n[MR02]. However, for the Problem 1.1 that we are interested in, the spectral gap bound in [WSH09]\nis exponentially small as a function of the number of modes. Drawing inspiration from [MR02], we\nestablish a Markov chain decomposition technique that overcomes this.\n\n2 This can be de\ufb01ned as either a discrete or continuous Markov chain. For a discrete chain, we propose\na swap with probability \ud835\udf06 and follow the current chain with probability 1 \u2212 \ud835\udf06. For a continuous chain, the\ntime between swaps is an exponential distribution with decay \ud835\udf06 (in other words, the times of the swaps forms a\nPoisson process). Note that simulated tempering is traditionally de\ufb01ned for discrete Markov chains, but we will\nuse the continuous version. See De\ufb01nition C.1 for the formal de\ufb01nition.\n\n5\n\n\fOne issue that comes up in simulated tempering is estimating the partition functions; various methods\nhave been proposed for this [PP07, Lia05].\n\n2.3 Main algorithm\n\nOur algorithm is intuitively the following. Take a sequence of inverse temperatures \ud835\udefd\ud835\udc56, starting at a\nsmall value and increasing geometrically towards 1. Run simulated tempering Langevin on these\ntemperatures, suitably discretized. Take the samples that are at the \ud835\udc3fth temperature.\nNote that there is one complication: the standard simulated tempering chain assumes that we can\ncompute the ratio between temperatures \ud835\udc5d\ud835\udc56\u2032 (\ud835\udc65)\n\ud835\udc5d\ud835\udc56(\ud835\udc65) . However, we only know the probability density\nfunctions up to a normalizing factor (the partition function). To overcome this, we note that if we use\nthe ratios \ud835\udc5f\ud835\udc56\u2032 \ud835\udc5d\ud835\udc56\u2032 (\ud835\udc65)\n\ud835\udc56=1 \ud835\udc5f\ud835\udc56 = 1, then the chain converges to the stationary distribution\n\ud835\udc5f\ud835\udc56\ud835\udc5d\ud835\udc56(\ud835\udc65)\nwith \ud835\udc5d(\ud835\udc65, \ud835\udc56) = \ud835\udc5f\ud835\udc56\ud835\udc5d\ud835\udc56(\ud835\udc65). Thus, it suf\ufb01ces to estimate each partition function up to a constant factor. We\ncan do this inductively: running the simulated tempering chain on the \ufb01rst \u2113 levels, we can estimate\nthe partition function \ud835\udc4d\u2113+1; then we can run the simulated tempering chain on the \ufb01rst \u2113 + 1 levels.\nThis is what Algorithm 2 does when it calls Algorithm 1 as subroutine.\nA formal description of the algorithm follows.\n\ninstead, for\u2211\ufe00\ud835\udc3f\n\nAlgorithm 1 Simulated tempering Langevin Monte Carlo\n\nINPUT: Temperatures \ud835\udefd1, . . . , \ud835\udefd\u2113; partition function estimates \u0302\ufe00\ud835\udc4d1, . . . ,\u0302\ufe00\ud835\udc4d\u2113; step size \ud835\udf02, time \ud835\udc47 , rate\n\n0\ud835\udc3c).\n\n\ud835\udf06, variance of initial distribution \ud835\udf0e0.\nOUTPUT: A random sample \ud835\udc65 \u2208 R\ud835\udc51 (approximately from the distribution \ud835\udc5d\u2113(\ud835\udc65) \u221d \ud835\udc52\u2212\ud835\udefd\u2113\ud835\udc53 (\ud835\udc65)).\nLet (\ud835\udc56, \ud835\udc65) = (1, \ud835\udc650) where \ud835\udc650 \u223c \ud835\udc41 (0, \ud835\udf0e2\nLet \ud835\udc5b = 0, \ud835\udc470 = 0.\nwhile \ud835\udc47\ud835\udc5b < \ud835\udc47 do\n\nDetermine the next transition time: Draw \ud835\udf09\ud835\udc5b+1 from the exponential distribution \ud835\udc5d(\ud835\udc65) = \ud835\udf06\ud835\udc52\u2212\ud835\udf06\ud835\udc65,\n\ud835\udc65 \u2265 0.\n\nLet \ud835\udf09\ud835\udc5b+1 \u2190(cid:91) min{\ud835\udc47 \u2212 \ud835\udc47\ud835\udc5b, \ud835\udf09\ud835\udc5b+1}, \ud835\udc47\ud835\udc5b+1 = \ud835\udc47\ud835\udc5b + \ud835\udf09\ud835\udc5b+1.\n\n\u2308\ufe01 \ud835\udf09\ud835\udc5b+1\n\u2308\ufe01 \ud835\udf09\ud835\udc5b+1\n\u2309\ufe01\ntimes: Update \ud835\udc65 according to \ud835\udc65 \u2190(cid:91) \ud835\udc65 \u2212 \ud835\udf02\u2032\ud835\udefd\ud835\udc56\u2207\ud835\udc53 (\ud835\udc65) +\n{\ufe01 \ud835\udc52\n\ud835\udc56\u2032 \ud835\udc53 (\ud835\udc65)/\u0302\ufe00\ud835\udc4d\ud835\udc56\u2032\n\ud835\udc52\u2212\ud835\udefd\ud835\udc56\ud835\udc53 (\ud835\udc65)/\u0302\ufe00\ud835\udc4d\ud835\udc56\n\nLet \ud835\udf02\u2032 = \ud835\udf09\ud835\udc5b+1/\nRepeat\nIf \ud835\udc47\ud835\udc5b+1 < \ud835\udc47 (i.e., the end time has not been reached), let \ud835\udc56\u2032 = \ud835\udc56 \u00b1 1 with probability 1\n2. If \ud835\udc56\u2032 is\nout of bounds, do nothing. If \ud835\udc56\u2032 is in bounds, make a type 2 transition, where the acceptance\nratio is min\n\n(the largest step size < \ud835\udf02 that evenly divides into \ud835\udf09\ud835\udc5b+1).\n\n2\ud835\udf02\u2032\ud835\udf09, \ud835\udf09 \u223c \ud835\udc41 (0, \ud835\udc3c).\n\n}\ufe01\n\n\u2309\ufe01\n\n\u221a\n\n, 1\n\n\u2212\ud835\udefd\n\n.\n\n\ud835\udf02\n\n\ud835\udf02\n\n\ud835\udc5b \u2190(cid:91) \ud835\udc5b + 1.\n\nend while\nIf the \ufb01nal state is (\u2113, \ud835\udc65) for some \ud835\udc65 \u2208 R\ud835\udc51, return \ud835\udc65. Otherwise, re-run the chain.\n\nAlgorithm 2 Main algorithm\n\nINPUT: A function \ud835\udc53 : R\ud835\udc51, satisfying assumption (2), to which we have gradient access.\nOUTPUT: A random sample \ud835\udc65 \u2208 R\ud835\udc51.\nLet 0 \u2264 \ud835\udefd1 < \u00b7\u00b7\u00b7 < \ud835\udefd\ud835\udc3f = 1 be a sequence of inverse temperatures satisfying (117) and (118).\nfor \u2113 = 1 \u2192 \ud835\udc3f do\n\nLet \u0302\ufe00\ud835\udc4d1 = 1.\nfunction estimates \u0302\ufe00\ud835\udc4d1, . . . ,\u0302\ufe00\ud835\udc4d\u2113, step size \ud835\udf02, time \ud835\udc47 , and rate \ud835\udf06 given by Lemma G.2.\n(\ufe01 1\n\u0302\ufe01\ud835\udc4d\u2113\n\nto get \ud835\udc5b = \ud835\udc42(\ud835\udc3f2 ln(\ufe00 1\n\n\ud835\udc57=1 \ud835\udc52(\u2212\ud835\udefd\u2113+1+\ud835\udefd\u2113)\ud835\udc53 (\ud835\udc65\ud835\udc57 ))\ufe01\n\u2211\ufe00\ud835\udc5b\n\nRun the simulated tempering chain in Algorithm 1 with temperatures \ud835\udefd1, . . . , \ud835\udefd\u2113, partition\n\nIf \u2113 = \ud835\udc3f, return the sample.\nIf\n\nand let (cid:91)\ud835\udc4d\u2113+1 =\n\n\u2113 < \ud835\udc3f,\n\nsamples,\n\n)\ufe00)\n\nrepeat\n\n.\n\n\ud835\udeff\n\n\ud835\udc5b\nend for\n\n6\n\n\f3 Overview of the proof techniques\n\nWe summarize the main ingredients and crucial techniques in the proof, while the full proofs are\nincluded in the appendices.\n\nStep 1: De\ufb01ne a continuous version of the simulated tempering Markov chain (De\ufb01nition C.1,\nLemma C.2), where transition times are real numbers determined by an exponential weighting time\ndistribution.\n\nStep 2: Prove a new decomposition theorem (Theorem D.2) for bounding the spectral gap (or\nequivalently, the mixing time) of the simulated tempering chain we de\ufb01ne. This is the main technical\ningredient, and also a result of independent interest.\nWhile decomposition theorems have appeared in the Markov Chain literature (e.g. [MR02]), typically\none partitions the state space, and bounds the spectral gap using (1) the probability \ufb02ow of the chain\ninside the individual sets, and (2) between different sets.\nIn our case, we decompose the Markov chain itself; this includes a decomposition of the stationary\ndistribution into components. (More precisely, we show a decomposition theorem on the generator of\nthe tempering chain.) We would like to do this because in our setting, the stationary distribution is\nexactly a mixture distribution (Problem 1.1).\nOur Markov chain decomposition theorem bounds the spectral gap (mixing time) of a simulated\ntempering chain in terms of the spectral gap (mixing time) of two chains:\n\n1. \u201ccomponent\u201d chains on the mixture components\n2. a \u201cprojected\u201d chain whose state space is the set of components, and which captures the\naction of the chain between components as well as the \ud835\udf122-divergence between the mixture\ncomponents.\n\nThis means that if the Markov chain on the individual components mixes rapidly, and the \u201cprojected\u201d\nchain mixes rapidly, then the simulated tempering chain mixes rapidly as well.\n(Note [MR02,\nTheorem 1.2] does partition into mixture components, but they only consider the special case where\nthey components are laid out in a chain.)\nThe mixing time of a continuous Markov chain is quanti\ufb01ed by a Poincar\u00e9 inequality.\nTheorem (Simpli\ufb01ed version of Theorem D.2). Consider the simulated tempering chain \ud835\udc40 with\n\ud835\udc36 , where the Markov chain at the \ud835\udc56th level (temperature) is \ud835\udc40\ud835\udc56 = (\u2126, L\ud835\udc56) with stationary\nrate \ud835\udf06 = 1\ndistribution \ud835\udc5d\ud835\udc56, for 1 \u2264 \ud835\udc56 \u2264 \ud835\udc3f. Suppose we have a decomposition of the Markov chain at each level,\n\ud835\udc57=1 \ud835\udc64\ud835\udc56,\ud835\udc57 = 1. If each \ud835\udc40\ud835\udc56,\ud835\udc57 satis\ufb01es a Poincar\u00e9 inequality with\nconstant \ud835\udc36, and the \ud835\udf122-projected chain \ud835\udc40 satis\ufb01es a Poincar\u00e9 inequality with constant \ud835\udc36, then \ud835\udc40\nsatis\ufb01es a Poincar\u00e9 inequality with constant \ud835\udc42(\ud835\udc36(1 + \ud835\udc36)).\nHere, the projected chain \ud835\udc40 is the chain on [\ud835\udc3f] \u00d7 [\ud835\udc5a] with probability \ufb02ow in the same and adjacent\nlevels given by\n\n\ud835\udc57=1 \ud835\udc64\ud835\udc56,\ud835\udc57\ud835\udc5d\ud835\udc56,\ud835\udc57\ud835\udc40\ud835\udc56,\ud835\udc57, where\u2211\ufe00\ud835\udc5a\n\n\ud835\udc5d\ud835\udc56\ud835\udc40\ud835\udc56 =\u2211\ufe00\ud835\udc5a\n\n\ud835\udc43 ((\ud835\udc56, \ud835\udc57), (\ud835\udc56, \ud835\udc57\u2032)) =\n\n\ud835\udc43 ((\ud835\udc56, \ud835\udc57), (\ud835\udc56 \u00b1 1, \ud835\udc57)) =\n\nwhere \ud835\udf122\n\nsym(\ud835\udc5d, \ud835\udc5e) = max{\ud835\udf122(\ud835\udc5d||\ud835\udc5e), \ud835\udf122(\ud835\udc5e||\ud835\udc5d)}.\n\n\ud835\udf122\n\nsym(\ud835\udc5d\ud835\udc56,\ud835\udc57, \ud835\udc5d\ud835\udc56,\ud835\udc57\u2032)\n\n, 1\n\nmin\nsym(\ud835\udc5d\ud835\udc56,\ud835\udc57, \ud835\udc5d\ud835\udc56\u00b11,\ud835\udc57\u2032)\n\ud835\udf122\n\n\ud835\udc64\ud835\udc56,\ud835\udc57\n\n\ud835\udc64\ud835\udc56,\ud835\udc57\u2032\n\n{\ufe01 \ud835\udc64\ud835\udc56\u00b11,\ud835\udc57\n\n}\ufe01\n\n(7)\n\n(8)\n\n,\n\nThe decomposition theorem is the reason why we use a slightly different simulated tempering chain,\nwhich is allowed to transition at arbitrary times, with some rate \ud835\udf06. Such a chain \u201ccomposes\u201d nicely\nwith the decomposition of the Langevin chain, and allows a better control of the Dirichlet form of the\ntempering chain, which governs the mixing time.\n\nStep 3: Finally, we need to apply the decomposition theorem to our setup, namely a distribution\nwhich is a mixture of strongly log-concave distributions. The \u201ccomponents\u201d of the decomposition in\n\n7\n\n\four setup are simply the mixture components \ud835\udc52\u2212\ud835\udc530(\ud835\udc65\u2212\ud835\udf07\ud835\udc57 ). We rely crucially on the fact that Langevin\ndiffusion on a mixture distribution decomposes into Langevin diffusion on the individual components.\n\nWe actually \ufb01rst analyze the hypothetical simulated tempering Langevin chain on \u0303\ufe00\ud835\udc5d\ud835\udc56 \u221d\n\u2211\ufe00\ud835\udc5a\n\ud835\udc57=1 \ud835\udc64\ud835\udc57\ud835\udc52\u2212\ud835\udefd\ud835\udc57 \ud835\udc530(\ud835\udc65\u2212\ud835\udf07\ud835\udc57 ) (Theorem E.1)\u2014i.e., where the stationary distribution for each tempera-\nwe can run, where \ud835\udc5d\ud835\udc56 \u221d \ud835\udc5d\ud835\udefd. To do this, we use the fact that \ud835\udc5d\ud835\udc56 is off from\u0303\ufe00\ud835\udc5d\ud835\udc56 by at most\nture is a mixture. Then in Lemma E.5 we compare to the actual simulated tempering Langevin that\n. (This is\n\nwhere the factor of \ud835\udc64min comes in.)\nTo use our Markov chain decomposition theorem, we need to show two things:\n\n1\n\n\ud835\udc64min\n\n1. The component chains mix rapidly: this follows from the classic fact that Langevin diffusion\n\nmixes rapidly for log-concave distributions.\n\n2. The projected chain mixes rapidly: The \u201cprojected\u201d chain is de\ufb01ned as having more prob-\nability \ufb02ow between mixture components in the same or adjacent temperatures which are\nclose together in \ud835\udf122-divergence.\nBy choosing the temperatures close enough, we can ensure that the corresponding mixture\ncomponents in adjacent temperatures are close in \ud835\udf122-divergence. By choosing the highest\ntemperature large enough, we can ensure that all the mixture components at the highest\ntemperature are close in \ud835\udf122-divergence.\nFrom this it follows that we can easily get from any component to any other (by traveling\nup to the highest temperature and then back down). Thus the projected chain mixes rapidly\nfrom the method of canonical paths, Theorem B.4.\n\nNote that the equal variance (for gaussians) or shape (for general log-concave distributions) condition\nis necessary here. For gaussians with different variance, the Markov chain can fail to mix between\ncomponents at the highest temperature. This is because scaling the temperature changes the variance\nof all the components equally, and preserves their ratio (which is not equal to 1).\nStep 4: We analyze the error from discretization (Lemma F.1), and choose parameters so that it is\nsmall. We show that in Algorithm 2 we can inductively estimate the partition functions. When we\nhave all the estimates, we can run the simulated tempering chain on all the temperatures to get the\ndesired sample.\n\n4 Conclusion\n\nWe initiated a study of sampling \u201cbeyond log-convexity.\" In so doing, we developed a new general\ntechnique to analyze simulated tempering, a classical algorithm used in practice to combat multi-\nmodality but that has seen little theoretical analysis. The technique is a new decomposition lemma\nfor Markov chains based on decomposing the Markov chain rather than just the state space. We have\nanalyzed simulated tempering with Langevin diffusion, but note that it can be applied to any with any\nother Markov chain with a notion of temperature.\nOur result is the \ufb01rst result in its class (sampling multimodal, non-log-concave distributions with\ngradient oracle access). Admittedly, distributions encountered in practice are rarely mixtures of\ndistributions with the same shape. However, we hope that our techniques may be built on to\nprovide guarantees for more practical probability distributions. An exciting research direction is\nto provide (average-case) guarantees for probability distributions encountered in practice, such as\nposteriors for clustering, topic models, and Ising models. For example, the posterior distribution for a\nmixture of gaussians can have exponentially many terms, but may perhaps be tractable in practice.\nAnother interesting direction is to study other temperature heuristics used in practice, such as particle\n\ufb01lters [Sch12, DMHW+12, PJT15, GDM+17], annealed importance sampling [Nea01], and parallel\ntempering [WSH09].\n\nReferences\n\n[BBCG08] Dominique Bakry, Franck Barthe, Patrick Cattiaux, and Arnaud Guillin. A simple\nproof of the Poincar\u00e9 inequality for a large class of probability measures including\nthe log-concave case. Electron. Commun. Probab, 13:60\u201366, 2008.\n\n8\n\n\f[B\u00c985] Dominique Bakry and Michel \u00c9mery. Diffusions hypercontractives. In S\u00e9minaire de\n\nProbabilit\u00e9s XIX 1983/84, pages 177\u2013206. Springer, 1985.\n\n[BEGK02] Anton Bovier, Michael Eckhoff, V\u00e9ronique Gayrard, and Markus Klein. Metastability\nand low lying spectra in reversible Markov chains. Communications in mathematical\nphysics, 228(2):219\u2013255, 2002.\n\n[BEGK04] Anton Bovier, Michael Eckhoff, V\u00e9ronique Gayrard, and Markus Klein. Metastability\nin reversible diffusion processes i: Sharp asymptotics for capacities and exit times.\nJournal of the European Mathematical Society, 6(4):399\u2013424, 2004.\n\n[BEL18] S\u00e9bastien Bubeck, Ronen Eldan, and Joseph Lehec. Sampling from a log-concave\ndistribution with projected langevin monte carlo. Discrete & Computational Geometry,\n59(4):757\u2013783, 2018.\n\n[BGK05] Anton Bovier, V\u00e9ronique Gayrard, and Markus Klein. Metastability in reversible\ndiffusion processes ii: Precise asymptotics for small eigenvalues. Journal of the\nEuropean Mathematical Society, 7(1):69\u201399, 2005.\n\n[BGL13] Dominique Bakry, Ivan Gentil, and Michel Ledoux. Analysis and geometry of Markov\n\ndiffusion operators, volume 348. Springer Science & Business Media, 2013.\n\n[Bha78] RN Bhattacharya. Criteria for recurrence and existence of invariant measures for\n\nmultidimensional diffusions. The Annals of Probability, pages 541\u2013553, 1978.\n\n[CCBJ17] Xiang Cheng, Niladri S Chatterji, Peter L Bartlett, and Michael I Jordan.\narXiv preprint\n\nUnderdamped Langevin MCMC: A non-asymptotic analysis.\narXiv:1707.03663, 2017.\n\n[Dal16] Arnak S Dalalyan. Theoretical guarantees for approximate sampling from smooth and\nlog-concave densities. Journal of the Royal Statistical Society: Series B (Statistical\nMethodology), 2016.\n\n[Dal17] Arnak Dalalyan. Further and stronger analogy between sampling and optimization:\nLangevin monte carlo and gradient descent. In Satyen Kale and Ohad Shamir, editors,\nProceedings of the 2017 Conference on Learning Theory, volume 65 of Proceedings\nof Machine Learning Research, pages 678\u2013689, Amsterdam, Netherlands, 07\u201310 Jul\n2017. PMLR.\n\n[DCWY18] Raaz Dwivedi, Yuansi Chen, Martin J Wainwright, and Bin Yu. Log-concave sampling:\nMetropolis-Hastings algorithms are fast! In Proceedings of the 2018 Conference on\nLearning Theory, PMLR 75, 2018.\n\n[DK17] Arnak S Dalalyan and Avetik G Karagulyan. User-friendly guarantees for the\nLangevin Monte Carlo with inaccurate gradient. arXiv preprint arXiv:1710.00095,\n2017.\n\n[DM16] Alain Durmus and Eric Moulines. High-dimensional Bayesian inference via the\n\nunadjusted Langevin algorithm. 2016.\n\n[DMHW+12] Pierre Del Moral, Peng Hu, Liming Wu, et al. On the concentration properties\nof interacting particle processes. Foundations and Trends R\u25cb in Machine Learning,\n3(3\u20134):225\u2013389, 2012.\n\n[DMM18] Alain Durmus, Szymon Majewski, and B\u0142a\u02d9zej Miasojedow. Analysis of Langevin\n\nMonte Carlo via convex optimization. arXiv preprint arXiv:1802.09188, 2018.\n\n[GDM+17] Fran\u00e7ois Giraud, Pierre Del Moral, et al. Nonasymptotic analysis of adaptive and\n\nannealed Feynman\u2013Kac particle models. Bernoulli, 23(1):670\u2013709, 2017.\n\n[KW13] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv\n\npreprint arXiv:1312.6114, 2013.\n\n[Lia05] Faming Liang. Determination of normalizing constants for simulated tempering.\n\nPhysica A: Statistical Mechanics and its Applications, 356(2-4):468\u2013480, 2005.\n\n9\n\n\f[LS93] L\u00e1szl\u00f3 Lov\u00e1sz and Mikl\u00f3s Simonovits. Random walks in a convex body and an\nimproved volume algorithm. Random structures & algorithms, 4(4):359\u2013412, 1993.\n\n[MP92] Enzo Marinari and Giorgio Parisi. Simulated tempering: a new Monte Carlo scheme.\n\nEPL (Europhysics Letters), 19(6):451, 1992.\n\n[MR02] Neal Madras and Dana Randall. Markov chain decomposition for convergence rate\n\nanalysis. Annals of Applied Probability, pages 581\u2013606, 2002.\n\n[MS17] Oren Mangoubi and Aaron Smith. Rapid mixing of Hamiltonian Monte Carlo on\n\nstrongly log-concave distributions. arXiv preprint arXiv:1708.07114, 2017.\n\n[MV17] Oren Mangoubi and Nisheeth K Vishnoi. Convex optimization with nonconvex\n\noracles. arXiv preprint arXiv:1711.02621, 2017.\n\n[Nea96] Radford M Neal. Sampling from multimodal distributions using tempered transitions.\n\nStatistics and computing, 6(4):353\u2013366, 1996.\n\n[Nea01] Radford M Neal. Annealed importance sampling. Statistics and computing, 11(2):125\u2013\n\n139, 2001.\n\n[PJT15] Daniel Paulin, Ajay Jasra, and Alexandre Thiery. Error bounds for sequential Monte\nCarlo samplers for multimodal distributions. arXiv preprint arXiv:1509.08775, 2015.\n\n[PP07] Sanghyun Park and Vijay S Pande. Choosing weights for simulated tempering.\n\nPhysical Review E, 76(1):016703, 2007.\n\n[RMW14] Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic back-\npropagation and approximate inference in deep generative models. In International\nConference on Machine Learning, pages 1278\u20131286, 2014.\n\n[RRT17] Maxim Raginsky, Alexander Rakhlin, and Matus Telgarsky. Non-convex learning via\nstochastic gradient langevin dynamics: a nonasymptotic analysis. In Conference on\nLearning Theory, pages 1674\u20131703, 2017.\n\n[Sch12] Nikolaus Schweizer. Non-asymptotic error bounds for sequential MCMC methods.\n\n2012.\n\n[SR11] David Sontag and Dan Roy. Complexity of inference in latent dirichlet allocation. In\n\nAdvances in neural information processing systems, pages 1008\u20131016, 2011.\n\n[Vem05] Santosh Vempala. Geometric random walks: a survey. Combinatorial and computa-\n\ntional geometry, 52(573-612):2, 2005.\n\n[WSH09] Dawn B Woodard, Scott C Schmidler, and Mark Huber. Conditions for rapid mixing\nof parallel and simulated tempering on multimodal distributions. The Annals of\nApplied Probability, pages 617\u2013640, 2009.\n\n[Zhe03] Zhongrong Zheng. On swapping and simulated tempering algorithms. Stochastic\n\nProcesses and their Applications, 104(1):131\u2013154, 2003.\n\n10\n\n\f", "award": [], "sourceid": 4881, "authors": [{"given_name": "Holden", "family_name": "Lee", "institution": "Princeton"}, {"given_name": "Andrej", "family_name": "Risteski", "institution": "MIT"}, {"given_name": "Rong", "family_name": "Ge", "institution": "Duke University"}]}