{"title": "Adaptive Bayesian Sampling with Monte Carlo EM", "book": "Advances in Neural Information Processing Systems", "page_first": 1256, "page_last": 1266, "abstract": "We present a novel technique for learning the mass matrices in samplers obtained from discretized dynamics that preserve some energy function. Existing adaptive samplers use Riemannian preconditioning techniques, where the mass matrices are functions of the parameters being sampled. This leads to significant complexities in the energy reformulations and resultant dynamics, often leading to implicit systems of equations and requiring inversion of high-dimensional matrices in the leapfrog steps. Our approach provides a simpler alternative, by using existing dynamics in the sampling step of a Monte Carlo EM framework, and learning the mass matrices in the M step with a novel online technique. We also propose a way to adaptively set the number of samples gathered in the E step, using sampling error estimates from the leapfrog dynamics. Along with a novel stochastic sampler based on Nos\\'{e}-Poincar\\'{e} dynamics, we use this framework with standard Hamiltonian Monte Carlo (HMC) as well as newer stochastic algorithms such as SGHMC and SGNHT, and show strong performance on synthetic and real high-dimensional sampling scenarios; we achieve sampling accuracies comparable to Riemannian samplers while being significantly faster.", "full_text": "Adaptive Bayesian Sampling with Monte Carlo EM\n\nAnirban Roychowdhury, Srinivasan Parthasarathy\n\nDepartment of Computer Science and Engineering\n\nroychowdhury.7@osu.edu, srini@cse.ohio-state.edu\n\nThe Ohio State University\n\nAbstract\n\nWe present a novel technique for learning the mass matrices in samplers obtained\nfrom discretized dynamics that preserve some energy function. Existing adaptive\nsamplers use Riemannian preconditioning techniques, where the mass matrices are\nfunctions of the parameters being sampled. This leads to signi\ufb01cant complexities in\nthe energy reformulations and resultant dynamics, often leading to implicit systems\nof equations and requiring inversion of high-dimensional matrices in the leapfrog\nsteps. Our approach provides a simpler alternative, by using existing dynamics in\nthe sampling step of a Monte Carlo EM framework, and learning the mass matrices\nin the M step with a novel online technique. We also propose a way to adaptively\nset the number of samples gathered in the E step, using sampling error estimates\nfrom the leapfrog dynamics. Along with a novel stochastic sampler based on\nNos\u00e9-Poincar\u00e9 dynamics, we use this framework with standard Hamiltonian Monte\nCarlo (HMC) as well as newer stochastic algorithms such as SGHMC and SGNHT,\nand show strong performance on synthetic and real high-dimensional sampling\nscenarios; we achieve sampling accuracies comparable to Riemannian samplers\nwhile being signi\ufb01cantly faster.\n\n1\n\nIntroduction\n\nMarkov Chain Monte Carlo sampling is a well-known set of techniques for learning complex Bayesian\nprobabilistic models that arise in machine learning. Typically used in cases where computing the\nposterior distributions of parameters in closed form is not feasible, MCMC techniques that converge\nreliably to the target distributions offer a provably correct way (in an asymptotic sense) to draw\nsamples of target parameters from arbitrarily complex probability distributions. A recently proposed\nmethod in this domain is Hamiltonian Monte Carlo (HMC) [1, 2], that formulates the target density\nas an \u201cenergy function\u201d augmented with auxiliary \u201cmomentum\u201d parameters, and uses discretized\nHamiltonian dynamics to sample the parameters while preserving the energy function. The resulting\nsamplers perform noticeably better than random walk-based methods in terms of sampling ef\ufb01ciency\nand accuracy [1, 3]. For use in stochastic settings, where one uses random minibatches of the data\nto calculate the gradients of likelihoods for better scalability, researchers have used Fokker-Planck\ncorrection steps to preserve the energy in the face of stochastic noise [4], as well as used auxiliary\n\u201cthermostat\u201d variables to control the effect of this noise on the momentum terms [5, 6]. As with the\nbatch setting, these methods have exploited energy-preserving dynamics to sample more ef\ufb01ciently\nthan random walk-based stochastic samplers [4, 7, 8].\nA primary (hyper-)parameter of interest in these augmented energy function-based samplers in the\n\u201cmass\u201d matrix of the kinetic energy term; as noted by various researchers [1, 3, 6, 8, 9], this matrix\nplays an important role in the trajectories taken by the samplers in the parameter space of interest,\nthereby affecting the overall ef\ufb01ciency. While prior efforts have set this to the identity matrix or some\nother pre-calculated value [4, 5, 7], recent work has shown that there are signi\ufb01cant gains to be had in\nef\ufb01ciency as well as convergent accuracy by reformulating the mass in terms of the target parameters\nto be sampled [3, 6, 8], thereby making the sampler sensitive to the underlying geometry. This is\ndone by imposing a positive de\ufb01nite constraint on the adaptive mass, and using it as the metric of\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fthe Riemannian manifold of probability distributions parametrized by the target parameters. This\nconstraint also satis\ufb01es the condition that the momenta be sampled from a Gaussian with the mass as\nthe covariance. Often called Riemannian preconditioning, this idea has been applied in both batch [3]\nas well as stochastic settings [6, 8] to derive HMC-based samplers that adaptively learn the critically\nimportant mass matrix from the data.\nAlthough robust, these reformulations often lead to signi\ufb01cant complexities in the resultant dynamics;\none can end up solving an implicit system of equations in each half-step of the leapfrog dynamics\n\n[3, 6], along with inverting large(cid:0)O(D2)(cid:1) matrices. This is sometimes sidestepped by performing\n\n\ufb01xed point updates at the cost of additional error, or restricting oneself to simpler formulations that\nhonor the symmetric positive de\ufb01nite constraint, such as a diagonal matrix [8]. While this latter\nchoice ameliorates a lot of the added complexity, it is clearly suboptimal in the context of adapting\nto the underlying geometry of the parameter space. Thus we would ideally need a mechanism to\nrobustly learn this critical mass hyperparameter from the data without signi\ufb01cantly adding to the\ncomputational burden.\nWe address this issue in this work with the Monte Carlo EM (MCEM) [10, 11, 12, 13] framework. An\nalternative to the venerable EM technique, MCEM is used to locally optimize maximum likelihood\nproblems where the posterior probabilities required in the E step of EM cannot be computed in closed\nform. In this work, we perform existing dynamics derived from energy functions in the Monte Carlo\nE step while holding the mass \ufb01xed, and use the stored samples of the momentum term to learn the\nmass in the M step. We address the important issue of selecting appropriate E-step sampling iterations,\nusing error estimates to gradually increase the sample sizes as the Markov chain progresses towards\nconvergence. Combined with an online method to update the mass using sample covariance estimates\nin the M step, this gives a clean and scalable adaptive sampling algorithm that performs favorably\ncompared to the Riemannian samplers. In both our synthetic experiments and a high dimensional\ntopic modeling problem with a complex Bayesian nonparametric construction [14], our samplers\nmatch or beat the Riemannian variants in sampling ef\ufb01ciency and accuracy, while being close to an\norder of magnitude faster.\n2 Preliminaries\n2.1 MCMC with Energy-Preserving Dynamics\n\nIn Hamiltonian Monte Carlo, the energy function is written as\n\nH(\u03b8, p) = \u2212L(\u03b8) +\n\npT M\u22121p.\n\n(1)\nHere X is the observed data, and \u03b8 denotes the model parameters. L(\u03b8) = log p(X|\u03b8) + log p(\u03b8)\ndenotes the log likelihood of the data given the parameters along with the Bayesian prior, and\np denotes the auxiliary \u201cmomentum\u201d mentioned above. Note that the second term in the energy\nfunction, the kinetic energy, is simply the kernel of a Gaussian with the mass matrix M acting as\ncovariance. Hamilton\u2019s equations of motions are then applied to this energy function to derive the\nfollowing differential equations, with the dot accent denoting a time derivative:\n\n1\n2\n\n\u02d9\u03b8 = M\u22121p,\n\n\u02d9p = \u2207L(\u03b8).\n\nThese are discretized using the generalized leapfrog algorithm [1, 15] to create a sampler that is both\nsymplectic and time-reversible, upto a discretization error that is quadratic in the stepsize.\nMachine learning applications typically see the use of very large datasets for which computing the\ngradients of the likelihoods in every leapfrog step followed by a Metropolis-Hastings correction\nratio is prohibitively expensive. To address this, one uses random \u201cminibatches\u201d of the dataset\nin each iteration [16], allowing some stochastic noise for improved scalability, and removes the\nMetropolis-Hastings (M-H) correction steps [4, 7]. To preserve the system energy in this context one\nhas to additionally apply Fokker-Planck corrections to the dynamics [17]. The stochastic sampler in\n[4] uses these techniques to preserve the canonical Gibbs energy above (1). Researchers have also\nused the notion of \u201cthermostats\u201d from the molecular dynamics literature [9, 18, 19, 20] to further\ncontrol the behavior of the momentum terms in the face of stochastic noise; the resulting algorithm\n[5] preserves an energy of its own [21] as well.\n2.2 Adaptive MCMC using Riemannian Manifolds\nAs mentioned above, learning the mass matrices in these MCMC systems is an important challenge.\nResearchers have traditionally used Riemannian manifold refomulations to address this, and integrate\n\n2\n\n\flog(cid:8)(2\u03c0)D|G(\u03b8)|(cid:9) ,\n\n1\n2\n\n1\n2\n\nthe updating of the mass into the sampling steps. In [3] the authors use this approach to derive\nadaptive variants of \ufb01rst-order Langevin dynamics as well as HMC. For the latter the reformulated\nenergy function can be written as:\n\nHgc(\u03b8, p) = \u2212L(\u03b8) +\n\npT G(\u03b8)\u22121p +\n\nm(cid:80)\n\n(2)\nwhere D is the dimensionality of the parameter space. Note that the momentum variable p can\nbe integrated out to recover the desired marginal density of \u03b8, in spite of the covariance being a\nfunction of \u03b8. In the machine learning literature, the authors of [8] used a diagonal G(\u03b8) to produce\nan adaptive variant of the algorithm in [7], whereas the authors in [6] derived deterministic and\nstochastic algorithms from a Riemannian variant of the Nos\u00e9-Poincar\u00e9 energy [9], with the resulting\nadaptive samplers preserving symplecticness as well as canonical system temperature.\n2.3 Monte Carlo EM\nThe EM algorithm [22] is widely used to learn maximum likelihood parameter estimates for complex\nprobabilistic models. In cases where the expectations of the likelihoods required in the E step are\nnot tractable, one can use Monte Carlo simulations of the posterior instead. The resulting Monte\nCarlo EM (MCEM) framework [10] has been widely studied in the statistics literature, with various\ntechniques developed to ef\ufb01ciently draw samples and estimate Monte Carlo errors in the E step\n[11, 12, 13]. For instance, the expected log-likelihood is usually replaced with the following Monte\nCarlo approximation: Q(\u03b8|\u03b8t) = 1\nl|\u03b8), where u represents the latent augmentation\nvariables used in EM, and m is the number of samples taken in the E step. While applying this\nframework, one typically has to carefully tune the number of samples gathered in the E step, since\nthe potential distance from the stationary distribution in the early phases would necessitate drawing\nrelatively fewer samples, and progressively more as the sampler nears convergence.\nIn this work we leverage this MCEM framework to learn M in (1) and similar energies using samples\nof p; the discretized dynamics constitute the E step of the MCEM framework, with suitable updates\nto M performed in the corresponding M step. We also use a novel mechanism to dynamically adjust\nthe sample count by using sampling errors estimated from the gathered samples, as described next.\n3 Mass-Adaptive Sampling with Monte Carlo EM\n3.1 The Basic Framework\nRiemannian samplers start off by reformulating the energy function, making the mass a function\nof \u03b8 and adding suitable terms to ensure constancy of the marginal distributions. Our approach is\nfundamentally different: we cast the task of learning the mass as a maximum likelihood problem\nover the space of symmetric positive de\ufb01nite matrices. For instance, we can construct the following\nproblem for standard HMC:\n\nlog p(X, ut\n\nl=1\n\nm\n\nlog |M|.\n\nL(\u03b8) \u2212 1\n2\n\npT M\u22121p \u2212 1\n2\n\nmax\nM(cid:31)0\n\n(3)\nRecall that the joint likelihood is p(\u03b8, p) \u221d exp(\u2212H(\u03b8, p)), H(\u00b7,\u00b7) being the energy from (1). Then,\nwe use correct samplers that preserve the desired densities in the E step of a Monte Carlo EM\n(MCEM) framework, and use the obtained samples of p in the corresponding M step to perform\nsuitable updates for the mass M. Speci\ufb01cally, to wrap the standard HMC sampler in our framework,\nwe perform the generalized leapfrog steps [1, 15] to obtain proposal updates for \u03b8, p followed by\nMetropolis-Hastings corrections in the E step, and use the obtained p values in the M step. The\nresultant adaptive sampling method is shown in Alg. 1.\nNote that this framework can also be applied to stochastic samplers that preserve the energy, upto\nstandard discretization errors. We can wrap the SGHMC sampler [4] in our framework as well,\nsince it uses Fokker-Planck corrections to approximately preserve the energy (1) in the presence\nof stochastic noise. We call the resulting method SGHMC-EM, and specify it in Alg. 3 in the\nsupplementary.\nAs another example, the SGNHT sampler [5] is known to preserve a modi\ufb01ed Gibbs energy [21];\ntherefore we can propose the following max-likelihood problem for learning the mass:\n\nmax\nM(cid:31)0\n\nL(\u03b8) \u2212 1\n2\n\npT M\u22121p \u2212 1\n2\n\nlog |M| + \u00b5(\u03be \u2212 \u00af\u03be)2/2,\n\n(4)\n\n3\n\n\fwhere \u03be is the thermostat variable, and \u00b5, \u00af\u03be are constants chosen to preserve correct marginals. The\nSGNHT dynamics can used in the E step to maintain the above energy, and we can use the collected\np samples in the M step as before. We call the resultant method SGNHT-EM, as shown in Alg. 2.\nNote that, unlike standard HMC above, we do not perform Metropolis-Hastings corrections steps on\nthe gathered samples for these cases. As shown in the algorithms, we collect one set of momenta\nsamples per epoch, after the leapfrog iterations. We use S_count to denote the number of such\nsamples collected before running an M-step update.\nThe advantage of this MCEM approach over the parameter-dependent Riemannian variants is twofold:\n1. The existing Riemannian adaptive algorithms in the literature [3, 6, 8] all start by modifying the\nenergy function, whereas our framework does not have any such requirement. As long as one uses a\nsampling mechanism that preserves some energy with correct marginals for \u03b8, in a stochastic sense\nor otherwise, it can be used in the E step of our framework.\n2. The primary disadvantage of the Riemannian algorithms is the added complexity in the dynamics\nderived from the modi\ufb01ed energy functions. One typically ends up using generalized leapfrog\ndynamics [3, 6], which can lead to implicit systems of equations; to solve these one either has to use\nstandard solvers that have complexity at least cubic in the dimensionality [23, 24], with scalability\nissues in high dimensional datasets, or use \ufb01xed point updates with worsened error guarantees. An\nalternative approach is to use diagonal covariance matrices, as mentioned earlier, which ignores the\ncoordinate correlations. Our MCEM approach sidesteps all these issues by keeping the existing\ndynamics of the desired E step sampler unchanged. As shown in the experiments, we can match or\nbeat the Riemannian samplers in accuracy and ef\ufb01ciency by using suitable sample sizes and M step\nupdates, with signi\ufb01cantly improved sampling complexities and runtimes.\n3.2 Dynamic Updates for the E-step Sample Size\n\nAlgorithm 1 HMC-EM\n\nInput: \u03b8(0), \u0001, LP _S, S_count\n\u00b7 Initialize M;\nrepeat\n\n\u00b7 Set(cid:0)\u03b8(t+1), p(t+1)(cid:1) from(cid:0)\u03b8LP _S+\u0001, pLP _S+\u0001(cid:1)\n\nend for\n\n\u00b7 Sample p(t) \u223c N (0, M );\nfor i = 1 to LP _S do\n\u00b7 p(i) \u2190 p(i+\u0001\u22121), \u03b8(i) \u2190 \u03b8(i+\u0001\u22121);\n\u00b7 p(i+\u0001/2) \u2190 p(i) \u2212 \u0001\n2\u2207\u03b8H(\u03b8(i), p(i));\n2\u2207pH(\u03b8(i), p(i+\u0001/2));\n\u00b7 \u03b8(i+\u0001) \u2190 \u03b8(i) + \u0001\n\u00b7 p(i+\u0001) \u2190 p(i+\u0001/2) \u2212 \u0001\n2\u2207\u03b8H(\u03b8(i+\u0001), p(i+\u0001/2));\n\nWe now turn our attention to the task of\nlearning the sample size in the E step from\nthe data. The nontriviality of this issue\nis due to the following reasons: \ufb01rst, we\ncannot let the sampling dynamics run to\nconvergence in each E step without making\nthe whole process prohibitively slow; sec-\nond, we have to account for the correlation\namong successive samples, especially early\non in the process when the Markov chain is\nfar from convergence, possibly with \u201cthin-\nning\u201d techniques; and third, we may want\nto increase the sample count as the chain\nmatures and gets closer to the stationary\ndistribution, and use relatively fewer sam-\nples early on.\nTo this end, we leverage techniques de-\nrived from the MCEM literature in statis-\ntics [11, 13, 25] to \ufb01rst evaluate a suitable\n\u201ctest\u201d function of the target parameters at\ncertain subsampled steps, using the gath-\nered samples and current M step estimates.\nWe then use con\ufb01dence intervals created around these evaluations to gauge the relative effect of\nsuccessive MCEM estimates over the Monte Carlo error. If the updated values of these functions\nusing newer M-step estimates lie in these intervals, we increase the number of samples collected in\nthe next MCEM loop.\nSpeci\ufb01cally, similar to [13], we start off with the following test function for HMC-EM (Alg. 1):\n\nq(\u00b7) = (cid:2)M\u22121p,\u2207L(\u03b8)(cid:3). We then subsample some timesteps as mentioned below, evaluate q at\n\nusing Metropolis-Hastings\n\u00b7 Store MC-EM sample p(t+1);\nif (t + 1) mod S_count = 0 then\n\n\u00b7 Update M using MC-EM samples;\n\nend if\n\u00b7 Update S_count as described in the text;\n\nuntil forever\n\nthose steps, and create con\ufb01dence intervals using sample means and variances:\n\nmS =\n\n1\nS\n\nqs,\n\nvS =\n\n1\nS\n\ns \u2212 m2\nq2\n\nS, CS := mS \u00b1 z1\u2212\u03b1/2vS,\n\nS(cid:88)\n\ns=1\n\nS(cid:88)\n\ns=1\n\n4\n\n\fwhere S denotes the subsample count, z1\u2212\u03b1/2 is the (1 \u2212 \u03b1) critical value of a standard Gaussian, and\nCS the con\ufb01dence interval mentioned earlier. For SGNHT-EM (Alg. 2), we use the following test\n\nfunction: q(\u00b7) =(cid:2)M\u22121p,\u2207L(\u03b8) + \u03beM\u22121p, pT M\u22121p(cid:3), derived from the SGNHT dynamics.\n\nAlgorithm 2 SGNHT-EM\n\nInput: \u03b8(0), \u0001, A, LP _S, S_count\n\u00b7 Initialize \u03be(0), p(0) and M;\nrepeat\n\nfor i = 1 to LP _S do\n\n\u221a\n\n2AN (0, \u0001);\n\n\u00b7 p(i+1) \u2190 p(i) \u2212 \u0001\u03be(i)M\u22121p(i) \u2212 \u0001 \u02dc\u2207L(\u03b8(i))+\n\u00b7 \u03b8(i+1) \u2190 \u03b8(i) + \u0001M\u22121p(i+1);\n\n\u00b7 \u03be(i+1) \u2190 \u03be(i) +\u0001(cid:2) 1\n\u00b7 Set(cid:0)\u03b8(t+1), p(t+1), \u03be(t+1)(cid:1) =\n(cid:0)\u03b8(LP _S+1), p(LP _S+1), \u03be(LP _S+1)(cid:1);\n\nD p(i+1)T M\u22121p(i+1) \u2212 1(cid:3);\n\nend for\n\n\u00b7 Store MC-EM sample p(t+1);\nif (t + 1) mod S_count = 0 then\n\n\u00b7 Update M using MC-EM samples;\n\nend if\n\u00b7 Update S_count as described in the text;\n\nuntil forever\n\noffsets {t1 . . . tS} as ts =(cid:80)s\n\nOne can adopt the following method de-\nscribed in [25]: choose the subsampling\ni=1 xi, where\nxi \u2212 1 \u223c Poisson(\u03bdid), with suitably cho-\nsen \u03bd \u2265 1 and d > 0. We found both this\nand a \ufb01xed set of S offsets to work well in\nour experiments.\nWith the subsamples collected using this\nmechanism, we calculate the con\ufb01dence\nintervals as described earlier. The assump-\ntion is that this interval provides an esti-\nmate of the spread of q due to the Monte\nCarlo error. We then perform the M-step,\nand evaluate q using the updated M-step es-\ntimates. If this value lies in the previously\ncalculated con\ufb01dence bound, we increase\nS as S = S + S/SI in the following iter-\nation to overcome the Monte Carlo noise.\nSee [11, 13] for details on these procedures.\nValues for the constants \u03bd, \u03b1, d, SI, as well\nas initial estimates for S are given in the\n\nI\n\nI\n\nI\n\n,\n\nsupplementary. Running values for S are denoted S_count hereafter.\n3.3 An Online Update for the M-Step\nNext we turn our attention to the task of updating the mass matrices using the collected momenta\nsamples. As shown in the energy functions above, the momenta are sampled from zero-mean\nnormal distributions, enabling us to use standard covariance estimation techniques from the literature.\nHowever, since we are using discretized MCMC to obtain these samples, we have to address the\nvariance arising from the Monte Carlo error, especially during the burn-in phase. To that end, we\nfound a running average of the updates to work well in our experiments; in particular, we updated the\ninverse mass matrix, denoted as MI, at the kth M-step as:\nI = (1 \u2212 \u03ba(k))M (k\u22121)\nM (k)\n\n(cid:8)\u03ba(k)(cid:9) is a step sequence satisfying some standard assumptions, as described below. Note that the\n\nis a suitable estimate computed from the gathered samples in the kth M-step, and\n\nwhere M (k,est)\n\n+ \u03ba(k)M (k,est)\n\nMIs correspond to the precision matrix of the Gaussian distribution of the momenta; updating this\nduring the M-step also removes the need to invert the mass matrices during the leapfrog iterations.\nCuriously, we found the inverse of the empirical covariance matrix to work quite well as M (k,est)\nin\nour experiments.\nThese updates also induce a fresh perspective on the convergence of the overall MCEM procedure.\nExisting convergence analyses in the statistics literature fall into three broad categories: a) the almost\nsure convergence presented in [26] as t \u2192 \u221e with increasing sample sizes, b) the asymptotic angle\npresented in [27], where the sequence of MCEM updates are analyzed as an approximation to the\nstandard EM sequence as the sample size, referred to as S_count above, tends to in\ufb01nity, and c)\nthe asymptotic consistency results obtained from multiple Gibbs chains in [28], by letting the chain\ncounts and iterations tend to \u221e. Our analysis differs from all of these, by focusing on the maximum\nlikelihood situations noted above as convex optimization problems, and using SGD convergence\ntechniques [29] for the sequence of iterates M (k)\nProposition 1. Assume the M (k,est)\neigenvalues. Let inf(cid:107)MI\u2212M\u2217\n\nI (cid:107)2>\u0001 \u2207J(MI ) > 0 \u2200\u0001 > 0. Further, let the sequence(cid:8)\u03ba(k)(cid:9) satisfy\n\n\u2019s provide an unbiased estimate of \u2207J, and have bounded\n\n(5)\n\n.\n\nI\n\nI\n\nI\n\n(cid:110)\n\n(cid:111)\n\n< \u221e. Then the sequence\n\nM (k)\n\nI\n\nconverges to the MLE of the precision\n\n(cid:80)\nk \u03ba(k) = \u221e,(cid:80)\n\nk\n\n(cid:0)\u03ba(k)(cid:1)2\n\nalmost surely.\n\n5\n\n\fRecall that the (negative) precision is a natural parameter of the normal distribution written in\nexponential family notation, and that the log-likelihood is a concave function of the natural parameters\nfor this family; this makes max-likelihood a convex optimization problem over the precision, even in\nthe presence of linear constraints [30, 31]. Therefore, this implies that the problems (3), (4) have a\nunique maximum, denoted by M\u2217\nI above. Also note that the update (5) corresponds to a \ufb01rst order\nupdate on the iterates with an L2-regularized objective, with unit regularization parameter; this is\ndenoted by J(MI ) in the proposition. That is, J is the energy preserved by our sampler(s), as a\nfunction of the mass (precision), augmented with an L2 regularization term. The resultant strongly\nconvex optimization problem can be analyzed using SGD techniques under the assumptions noted\nabove; we provide a proof in the supplementary for completeness.\nWe should note here that the \u201cstochasticity\u201d in the proof does not refer to the stochastic gradients\nof L(\u03b8) used in the leapfrog dynamics of Algorithms 2 through 5; instead we think of the collected\nmomenta samples as a stochastic minibatch used to compute the gradient of the regularized energy,\nas a function of the covariance (mass), allowing us to deal with the Monte Carlo error indirectly. Also\nnote that our assumption on the unbiasedness of the M (k,est)\nestimates is similar to [26], and distinct\nfrom assuming that the MCEM samples of \u03b8 are unbiased; indeed, it would be dif\ufb01cult to make this\nlatter claim, since stochastic samplers in general are known to have a convergent bias.\n3.4 Nos\u00e9-Poincar\u00e9 Variants\nWe next develop a stochastic version of the dynamics derived from the Nos\u00e9-Poincar\u00e9 Hamiltonian,\nfollowed by an MCEM variant. This allows for a direct comparison of the Riemann manifold formula-\ntion and our MCEM framework for learning the kinetic masses, in a stochastic setting with thermostat\ncontrols on the momentum terms and desired properties like reversibility and symplecticness provided\nby generalized leapfrog discretizations. The Nos\u00e9-Poincar\u00e9 energy function can be written as [6, 9]:\n\nI\n\n\u2212L(\u03b8) +\n\n1\n2\n\nq2\n2Q\n\n+ gkT log s \u2212 H0\n\ns\n\ns\n\n+\n\nHN P = s\n\n(6)\nwhere L(\u03b8) is the joint log-likelihood, s is the thermostat control, p and q the momentum terms\ncorresponding to \u03b8 and s respectively, and M and Q the respective mass terms. See [6, 9] for\ndescriptions of the other constants. Our goal is to learn both M and Q using the MCEM framework,\nas opposed to [6], where both were formulated in terms of \u03b8. To that end, we propose the following\nsystem of equations for the stochastic scenario:\nM\u22121pt+\u0001/2\n\npt+\u0001/2 = p +\n\n(qt+\u0001/2)2 +\n\nA(\u03b8)s\u0001\n\nqt+\u0001/2\n\n(cid:20)\n\n(cid:20)\n\n1 +\n\n,\n\n(cid:16)p\n\n(cid:17)\n\nM\u22121(cid:16)p\n\n(cid:17)\n\n(cid:20)\n\n(cid:21)\n(cid:19)\n\n(cid:21)\n\n(cid:21)\n(cid:21)(cid:21)\n\n= 0,\n\n(cid:20)\n(cid:0)qt+\u0001/2(cid:1)2\n\n\u0001\n2\n\n2Q\n\n(cid:21)\n\n,\n\ns \u02dc\u2207L(\u03b8) \u2212 B(\u03b8)\u221a\n\u0001\n2\ns\n\u2212gkT (1 + log s) +\n\n(cid:18)pt+\u0001/2\n\n,\n\n\u0001\n4Q\nM\u22121\n\n(cid:19)\n(cid:18)pt+\u0001/2\n(cid:20) 1\n(cid:21)\n\ns\n\n2Q\n+ \u02dcL(\u03b8) + H0\n\n(cid:21)\n\n,\n\n+\n\n1\n\nst+\u0001\n\ns\ns\n\u03b8t+\u0001 = \u03b8 + \u0001M\u22121p\n\n(cid:20)\n\n\u2212\n\nq +\n\n\u0001\n2\n\nst+\u0001 = s + \u0001\n\n(cid:20)\n(cid:20) qt+\u0001/2\n(cid:20)\n\nQ\n\u0001\n2\n\n(cid:16)\n\n1\n2\n\ns + st+\u0001/2(cid:17)(cid:21)\n(cid:18)pt+\u0001/2\n\n,\n\npt+\u0001 = pt+\u0001/2 +\n\n\u221a\n\nst+\u0001 \u02dc\u2207L(\u03b8t+\u0001) \u2212 B(\u03b8t+\u0001)\nst+\u0001\nM\u22121\n\n(cid:19)\n\n(cid:18)pt+\u0001/2\n\n(cid:19)\n\nM\u22121pt+\u0001/2\n\nst+\u0001\n\nst+\u0001\n\n2Q\n\n\u2212 A(\u03b8)st+\u0001\n\nqt+\u0001/2 \u2212\n\n\u2212 gkT (1 + log st+\u0001) +\n\n1\n2\n\n,\n\nqt+\u0001 = qt+\u0001/2 +\n\nH0 + \u02dcL(\u03b8t+\u0001)\n\n(7)\nwhere t + \u0001/2 denotes the half-step dynamics, \u02dc signi\ufb01es noisy stochastic estimates, and A(\u03b8) and\nB(\u03b8) denote the stochastic noise terms, necessary for the Fokker-Planck corrections [6]. Note that\nwe only have to solve a quadratic equation for qt+\u0001/2 with the other updates also being closed-form,\nas opposed to the implicit system of equations in [6].\nProposition 2. The dynamics (7) preserve the Nos\u00e9-Poincar\u00e9 energy (6).\n\nThe proof is a straightforward application of the Fokker-Planck corrections for stochastic noise\nto the Hamiltonian dynamics derived from (6), and is provided in the supplementary. With these\ndynamics, we \ufb01rst develop the SG-NPHMC algorithm (Alg. 4 in the supplementary) as a counterpart\nto SGHMC and SGNHT, and wrap it in our MCEM framework to create SG-NPHMC-EM (Alg. 5\nin the supplementary). As we shall demonstrate shortly, this EM variant performs comparably to\nSGR-NPHMC from [6], while being signi\ufb01cantly faster.\n\n6\n\n\f4 Experiments\nIn this section we compare the performance of the MCEM-augmented variants of HMC, SGHMC\nas well as SGNHT with their standard counterparts, where the mass matrices are set to the identity\nmatrix. We call these augmented versions HMC-EM, SGHMC-EM, and SGNHT-EM respectively.\nAs baselines for the synthetic experiments, in addition to the standard samplers mentioned above, we\nalso evaluate RHMC [3] and SGR-NPHMC [6], two recent algorithms based on dynamic Riemann\nmanifold formulations for learning the mass matrices. In the topic modeling experiment, for scalability\nreasons we evaluate only the stochastic algorithms, including the recently proposed SGR-NPHMC,\nand omit HMC, HMC-EM and RHMC. Since we restrict the discussions in this paper to samplers\nwith second-order dynamics, we do not compare our methods with SGLD [7] or SGRLD [8].\n\n4.1 Parameter Estimation of a 1D Standard Normal Distribution\nIn this experiment we aim to learn the parameters of a unidimensional standard normal distribution in\nboth batch and stochastic settings, using 5, 000 data points generated from N (0, 1), analyzing the\nimpact of our MC-EM framework on the way. We compare all the algorithms mentioned so far: HMC,\nHMC-EM, SGHMC, SGHMC-EM, SGNHT, SGNHT-EM, SG-NPHMC, SG-NPHMC-EM along\nwith RHMC and SGR-NPHMC. The generative model consists of normal-Wishart priors on the mean\n\u00b5 and precision \u03c4, with posterior distribution p(\u00b5, \u03c4|X) \u221d N (X|\u00b5, \u03c4 )W(\u03c4|1, 1), where W denotes\nthe Wishart distribution. We run all the algorithms for the same number of iterations, discarding\nthe \ufb01rst 5, 000 as \u201cburn-in\u201d. Batch sizes were \ufb01xed to 100 for all the stochastic algorithms, along\nwith 10 leapfrog iterations across the board. For SGR-NPHMC and RHMC, we used the observed\nFisher information plus the negative Hessian of the prior as the tensor, with one \ufb01xed point iteration\non the implicit system of equations arising from the dynamics of both. For HMC we used a fairly\nhigh learning rate of 1e \u2212 2. For SGHMC and SGNHT we used A = 10 and A = 1 respectively. For\nSGR-NPHMC we used A, B = 0.01.\nWe show the RMSE numbers col-\nlected from post-burn-in samples as\nwell as per-iteration runtimes in Ta-\nble 1. An \u201citeration\u201d here refers to a\ncomplete E step, with the full quota\nof leapfrog jumps. The improvements\nafforded by our MCEM framework\nare immediately noticeable; HMC-\nEM matches the errors obtained from\nRHMC, in effect matching the sam-\nple distribution, while being much\nfaster (an order of magnitude) per it-\neration. The stochastic MCEM algo-\nrithms show markedly better perfor-\nmance as well; SGNHT-EM in partic-\nular beats SGR-NPHMC in RMSE-\u03c4 while being signi\ufb01cantly faster due to simpler updates for the\nmass matrices. Accuracy improvements are particularly noticeable for the high learning rate regimes\nfor HMC, SGHMC and SG-NPHMC.\n\nTable 1: RMSE of the sampled means, precisions and per-\niteration runtimes (in milliseconds) from runs on synthetic\nGaussian data.\n\nMETHOD\nHMC\nHMC-EM\nRHMC\nSGHMC\nSGHMC-EM\nSG-NPHMC\nSG-NPHMC-EM\nSGR-NPHMC\nSGNHT\nSGNHT-EM\n\nTIME\n\n0.417MS\n0.423MS\n5.748MS\n0.133MS\n0.132MS\n0.514MS\n0.498MS\n3.145MS\n0.148MS\n0.148MS\n\nRMSE (\u00b5)\n\nRMSE (\u03c4 )\n\n0.0196\n0.0115\n0.0111\n0.1590\n0.0713\n0.0326\n0.0274\n0.0240\n0.0344\n0.0317\n\n0.0197\n0.0104\n0.0089\n0.1646\n0.2243\n0.0433\n0.0354\n0.0308\n0.0335\n0.0289\n\n4.2 Parameter Estimation in 2D Bayesian Logistic Regression\n\nNext we present some results obtained from a Bayesian logistic regression experiment, using both\nsynthetic and real datasets. For the synthetic case, we used the same methodology as [6]; we generated\n2, 000 observations from a mixture of two normal distributions with means at [1,\u22121] and [\u22121, 1], with\nmixing weights set to (0.5, 0.5) and the covariance set to I. We then classify these points using a linear\nclassi\ufb01er with weights {W0, W1} = [1,\u22121], and attempt to learn these weights using our samplers.\nWe put N (0, 10I) priors on the weights, and used the metric tensor described in \u00a77 of [3] for the\nRiemannian samplers. In the (generalized) leapfrog steps of the Riemannian samplers, we opted to\nuse 2 or 3 \ufb01xed point iterations to approximate the solutions to the implicit equations. Along with this\nsynthetic setup, we also \ufb01t a Bayesian LR model to the Australian Credit and Heart regression datasets\nfrom the UCI database, for additional runtime comparisons. The Australian credit dataset contains\n690 datapoints of dimensionality 14, and the Heart dataset has 270 13-dimensional datapoints.\n\n7\n\n\fRMSE (W0)\n\nRMSE (W1)\n\n0.0456\n0.0145\n0.0091\n0.2812\n0.2804\n0.4945\n0.0990\n0.1901\n0.2035\n0.1983\n\n0.1290\n0.0851\n0.0574\n0.2717\n0.2583\n0.4263\n0.4229\n0.1925\n0.1921\n0.1729\n\nTable 2: RMSE of the two regression parameters,\nfor the synthetic Bayesian logistic regression ex-\nperiment. See text for details.\n\nMETHOD\nHMC\nHMC-EM\nRHMC\nSGHMC\nSGHMC-EM\nSG-NPHMC\nSG-NPHMC-EM\nSGR-NPHMC\nSGNHT\nSGNHT-EM\n\nFor the synthetic case, we discard the \ufb01rst\n10, 000 samples as burn-in, and calculate RMSE\nvalues from the remaining samples. Learning\nrates were chosen from {1e\u2212 2, 1e\u2212 4, 1e\u2212 6},\nand values of the stochastic noise terms were se-\nlected from {0.001, 0.01, 0.1, 1, 10}. Leapfrog\nsteps were chosen from {10, 20, 30}. For the\nstochastic algorithms we used a batchsize of\n100.\nThe RMSE numbers for the synthetic dataset\nare shown in Table 2, and the per-iteration run-\ntimes for all the datasets are shown in Table 3.\nWe used initialized S_count to 300 for HMC-\nEM, SGHMC-EM, and SGNHT-EM, and 200\nfor SG-NPHMC-EM. The MCEM framework\nnoticeably improves the accuracy in almost all cases, with no computational overhead. Note the\nimprovement for SG-NPHMC in terms of RMSE for W0. For the runtime calculations, we set all\nsamplers to 10 leapfrog steps, and \ufb01xed S_count to the values mentioned above.\nThe comparisons with the\nRiemannian algorithms tell\na clear story:\nthough we\ndo get somewhat better\naccuracy with these sam-\nthey are orders of\nplers,\nmagnitude slower.\nIn\nour synthetic case, for in-\nstance, each iteration of\nRHMC (consisting of all\nthe leapfrog steps and the\nM-H ratio calculation) takes\nmore than a second, us-\ning 10 leapfrog steps and 2\n\ufb01xed point iterations for the\nimplicit leapfrog equations, whereas both HMC and HMC-EM are simpler and much faster. Also\nnote that the M-step calculations for our MCEM framework involve a single-step closed form update\nfor the precision matrix, using the collected samples of p once every S_count sampling steps; thus\nwe can amortize the cost of the M-step over the previous S_count iterations, leading to negligible\nchanges to the per-sample runtimes.\n\nMETHOD\nHMC\nHMC-EM\nRHMC\nSGHMC\nSGHMC-EM\nSG-NPHMC\nSG-NPHMC-EM\nSGR-NPHMC\nSGNHT\nSGNHT-EM\n\nTable 3: Per-iteration runtimes (in milliseconds) for Bayesian logistic\nregression experiments, on both synthetic and real datasets.\n\n0.791MS\n0.799MS\n209MS\n0.112MS\n0.131MS\n0.403MS\n0.426MS\n3.676MS\n0.166MS\n0.175MS\n\n1.435MS\n1.428MS\n1550MS\n0.200MS\n0.203MS\n0.731MS\n0.803MS\n6.720MS\n0.302MS\n0.306MS\n\n0.987MS\n0.970MS\n367MS\n0.136MS\n0.141MS\n0.512MS\n0.525MS\n4.568MS\n0.270MS\n0.251MS\n\nTIME (SYNTH)\n\nTIME (AUS)\n\nTIME (HEART)\n\n4.3 Topic Modeling using a Nonparametric Gamma Process Construction\n\nNext we turn our attention to a high-dimensional topic modeling experiment using a nonparametric\nGamma process construction. We elect to follow the experimental setup described in [6]. Speci\ufb01cally,\nwe use the Poisson factor analysis framework of [32]. Denoting the vocabulary as V , and the\ndocuments in the corpus as D, we model the observed counts of the vocabulary terms as DV \u00d7N =\nPoi(\u03a6\u0398), where \u0398K\u00d7N models the counts of K latent topics in the documents, and \u03a6V \u00d7K denotes\nthe factor load matrix, that encodes the relative importance of the vocabulary terms in the latent topics.\nFollowing standard Bayesian convention, we put model the columns of \u03a6 as \u03c6\u00b7,k \u223c Dirichlet(\u03b1),\nusing normalized Gamma variables: \u03c6v,k = \u03b3v(cid:80)\n, with \u03b3v \u223c \u0393(\u03b1, 1). Then we have \u03b8n,k \u223c\n\u0393(rk, pj\n); we put \u03b2(a0, b0) priors on the document-speci\ufb01c mixing probabilities pj. We then set\n1\u2212pj\nthe rks to the atom weights generated by the constructive Gamma process de\ufb01nition of [14]; we refer\nthe reader to that paper for the details of the formulation. It leads to a rich nonparametric construction\nof this Poisson factor analysis model for which closed-form Gibbs updates are infeasible, thereby\nproviding a testing application area for the stochastic MCMC algorithms. We omit the Metropolis\nHastings correction-based HMC and RHMC samplers in this evaluation due to poor scalability.\n\nv \u03b3v\n\n8\n\n\f(a)\n\n(b)\n\nFigure 1: Test perplexities plotted against (a) post-burnin iterations and (b) wall-clock time for the\n20-Newsgroups dataset. See text for experimental details.\n\nWe use count matrices from the 20-Newsgroups and Reuters Corpus Volume 1 corpora [33]. The\nformer has 2, 000 words and 18, 845 documents, while the second has a vocabulary of size 10, 000\nover 804, 414 documents. We used a chronological 60\u221240 train-test split for both datasets. Following\nstandard convention for stochastic algorithms, following each minibatch we learn document-speci\ufb01c\nparameters from 80% of the test set, and calculate test perplexities on the remaining 20%. Test\nperplexity, a commonly used measure for such evaluations, is detailed in the supplementary.\nAs noted in [14], the atom weights have three sets of components: the Eks, Tks and the hyperparame-\nters \u03b1, \u03b3 and c. As in [6], we ran three parallel chains for these parameters, collecting samples of the\nmomenta from the Tk and hyperparameter chains for the MCEM mass updates. We kept the mass of\nthe Ek chain \ufb01xed to IK, and chose K = 100 as number of latent topics. We initialized S_count,\nthe E-step sample size in our algorithms, to 50 for NPHMC-EM and 100 for the rest. Increasing\nS_count over time yielded fairly minor improvements, hence we kept it \ufb01xed to the values above\nfor simplicity. Additional details on batch sizes, learning rates, stochastic noise estimates, leapfrog\niterations etc are provided in the supplementary. For the 20-Newsgroups dataset we ran all algorithms\nfor 1, 500 burn-in iterations, and collected samples for the next 1, 500 steps thereafter, with a stride\nof 100, for perplexity calculations. For the Reuters dataset we used 2, 500 burn-in iterations. Note\nthat for all these algorithms, an \u201citeration\u201d corresponds to a full E-step with a stochastic minibatch.\nThe numbers obtained at the end\nof the runs are shown in Table 2,\nalong with per-iteration runtimes.\nThe post-burnin perplexity-vs-\niteration plots from the 20-\nNewsgroups dataset are shown in\nFigure 1. We can see signi\ufb01cant\nimprovements from the MCEM\nframework for all samplers, with\nthat of SGNHT being highly pro-\nnounced (719 vs 757); indeed,\nthe SG-NPHMC samplers have\nlower perplexities (712) than those obtained by SGR-NPHMC (723), while being close to an order\nof magnitude faster per iteration for 20-Newsgroups even when the latter used diagonalized metric\ntensors, ostensibly by avoiding implicit systems of equations in the leapfrog steps to learn the kinetic\nmasses. The framework yields nontrivial improvements for the Reuters dataset as well.\n5 Conclusion\nWe propose a new theoretically grounded approach to learning the mass matrices in Hamiltonian-\nbased samplers, including both standard HMC and stochastic variants, using a Monte Carlo EM\nframework. In addition to a newly proposed stochastic sampler, we augment certain existing samplers\nwith this technique to devise a set of new algorithms that learn the kinetic masses dynamically from\nthe data in a \ufb02exible and scalable fashion. Experiments conducted on synthetic and real datasets\ndemonstrate the ef\ufb01cacy and ef\ufb01ciency of our framework, when compared to existing Riemannian\nmanifold-based samplers.\n\nTable 4: Test perplexities and per-iteration runtimes on 20-\nNewsgroups and Reuters datasets.\n\nMETHOD\nSGHMC\nSGHMC-EM\nSGNHT\nSGNHT-EM\nSGR-NPHMC\nSG-NPHMC\nSG-NPHMC-EM\n\n0.047S\n0.047S\n0.045S\n0.045S\n0.410S\n0.049S\n0.049S\n\n20-NEWS\n\nREUTERS\n\nTIME(20-NEWS)\n\n759\n738\n757\n719\n723\n714\n712\n\n996\n972\n979\n968\n952\n958\n947\n\n9\n\n\fAcknowledgments\n\nWe thank the anonymous reviewers for their insightful comments and suggestions. This material is\nbased upon work supported by the National Science Foundation under Grant No. DMS-1418265.\nAny opinions, \ufb01ndings, and conclusions or recommendations expressed in this material are those of\nthe author(s) and do not necessarily re\ufb02ect the views of the National Science Foundation.\n\nReferences\n[1] R. M. Neal. MCMC using Hamiltonian dynamics. In S. Brooks, A. Gelman, G. L. Jones, and\nX.-L. Meng, editors, Handbook of Markov Chain Monte Carlo, pages 113\u2013162. Chapman &\nHall / CRC Press, 2011.\n\n[2] S. Duane, A. D. Kennedy, B. J. Pendleton, and D. Roweth. Hybrid Monte Carlo. Physics Letters\n\nB, 195(2):216\u2013222, 1987.\n\n[3] M. Girolami and B. Calderhead. Riemann manifold Langevin and Hamiltonian Monte Carlo\nmethods. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 73(2):123\u2013\n214, 2011.\n\n[4] T. Chen, E. Fox, and C. Guestrin. Stochastic Gradient Hamiltonian Monte Carlo. In Proceedings\nof The 31st International Conference on Machine Learning (ICML), pages 1683\u20131691, 2014.\n\n[5] N. Ding, Y. Fang, R. Babbush, C. Chen, R. D. Skeel, and H. Neven. Bayesian Sampling using\nStochastic Gradient Thermostats. In Advances in Neural Information Processing Systems (NIPS)\n27, pages 3203\u20133211, 2014.\n\n[6] A. Roychowdhury, B. Kulis, and S. Parthasarathy. Robust Monte Carlo Sampling using\nRiemannian Nos\u00e9-Poincar\u00e9 Hamiltonian Dynamics. In Proceedings of The 33rd International\nConference on Machine Learning (ICML), pages 2673\u20132681, 2016.\n\n[7] M. Welling and Y. W. Teh. Bayesian Learning via Stochastic Gradient Langevin Dynamics.\nIn Proceedings of The 28th International Conference on Machine Learning (ICML), pages\n681\u2013688, 2011.\n\n[8] S. Patterson and Y. W. Teh. Stochastic Gradient Riemannian Langevin Dynamics on the\nProbability Simplex. In Advances in Neural Information Processing Systems (NIPS) 26, pages\n3102\u20133110, 2013.\n\n[9] S. D. Bond, B. J. Leimkuhler, and B. B. Laird. The Nos\u00e9-Poincar\u00e9 Method for Constant\n\nTemperature Molecular Dynamics. J. Comput. Phys, 151:114\u2013134, 1999.\n\n[10] G. C. G. Wei and M. A. Tanner. A Monte Carlo Implementation of the EM Algorithm and the\nPoor Man\u2019s Data Augmentation Algorithms. Journal of the American Statistical Association,\n85:699\u2013704, 1990.\n\n[11] J. G. Booth and J. P. Hobert. Maximizing Generalized Linear Mixed Model Likelihoods with\nan Automated Monte Carlo EM Algorithm. Journal of the Royal Statistical Society Series B,\n61(1):265\u2013285, 1999.\n\n[12] C. E. McCulloch. Maximum Likelihood Algorithms for Generalized Linear Mixed Models.\n\nJournal of the American Statistical Association, 92(437):162\u2013170, 1997.\n\n[13] R. A. Levine and G. Casella. Implementations of the Monte Carlo EM Algorithm. Journal\n\nComputational and Graphical Statistics, 10(3):422\u2013439, 2001.\n\n[14] A. Roychowdhury and B. Kulis. Gamma Processes, Stick-Breaking, and Variational Inference.\nIn Proceedings of the 18th International Conference on Arti\ufb01cial Intelligence and Statistics\n(AISTATS), pages 800\u2013808, 2015.\n\n[15] B. Leimkuhler and S. Reich. Simulating Hamiltonian Dynamics. Cambridge University Press,\n\n2004.\n\n10\n\n\f[16] H. Robbins and S. Monro. A Stochastic Approximation Method. The Annals of Mathematical\n\nStatistics, 22(3):400\u2013407, 1951.\n\n[17] L. Yin and P. Ao. Existence and Construction of Dynamical Potential in Nonequilibrium Pro-\ncesses without Detailed Balance. Journal of Physics A: Mathematical and General, 39(27):8593,\n2006.\n\n[18] D. Frenkel and B. Smit. Understanding Molecular Simulations: From Algorithms to Applica-\n\ntions, 2nd Edition. Academic Press, 2001.\n\n[19] B. Leimkuhler and C. Matthews. Molecular Dynamics: With Deterministic and Stochastic\n\nNumerical Methods. Springer, 2015.\n\n[20] W. G. Hoover. Canonical dynamics: Equilibrium phase-space distributions. Physical Review A\n\n(General Physics), 31(3):1695\u20131697, 1985.\n\n[21] A. Jones and B. Leimkuhler. Adaptive stochastic methods for sampling driven molecular\n\nsystems. Journal of Chemical Physics, 135(8):084125, 2011.\n\n[22] A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum Likelihood from Incomplete Data via\n\nthe EM Algorithm. Journal of the Royal Statistical Society Series B, 39(1):1\u201338, 1977.\n\n[23] J. D. Dixon. Exact solution of linear equations using P-adic expansions. Numerische Mathematik,\n\n40(1):137\u2013141, 1982.\n\n[24] W. Eberly, M. Giesbrecht, P. Giorgi, A. Storjohann, and G. Villard. Solving sparse rational\nlinear systems. In Proceedings of the 2006 international symposium on Symbolic and algebraic\ncomputation (ISSAC), pages 63\u201370, 2006.\n\n[25] C. P. Robert, T. Ryd\u00e9n, and D. M. Titterington. Convergence Controls for MCMC Algorithms,\nWith Applications to Hidden Markov Chains. Journal of Statistical Computation and Simulation,\n64:327\u2013355, 1999.\n\n[26] G. Fort and E. Moulines. Convergence of the Monte Carlo Expectation Maximization for\n\nCurved Exponential Families. The Annals of Statistics, 31(4):1220\u20131259, 2003.\n\n[27] K. S. Chan and J. Ledolter. Monte Carlo EM Estimation for Time Series Models Involving\n\nCounts. Journal of the American Statistical Association, 90(429):242\u2013252, 1995.\n\n[28] R. P. Sherman, Y.-Y. K. Ho, and S. R. Dalal. Conditions for convergence of Monte Carlo\nEM sequences with an application to product diffusion modeling . The Econometrics Journal,\n2(2):248\u2013267, 1999.\n\n[29] L. Bottou. On-line Learning and Stochastic Approximations. In On-line Learning in Neural\n\nNetworks, pages 9\u201342. Cambridge University Press, 1998.\n\n[30] C. Uhler. Geometry of maximum likelihood estimation in Gaussian graphical models. Annals\n\nof Statistics, 40:238\u2013261, 2012.\n\n[31] A. P. Dempster. Covariance selection. Biometrics, 28:157\u2013175, 1972.\n\n[32] M. Zhou and L. Carin. Negative Binomial Process Count and Mixture Modeling. IEEE Trans.\n\nPattern Anal. Mach. Intell., 37(2):307\u2013320, 2015.\n\n[33] N. Srivastava, R. Salakhutdinov, and G. E. Hinton. Modeling documents with deep Boltzmann\nmachines. In Proceedings of the 29th Conference on Uncertainty in Arti\ufb01cial Intelligence (UAI),\npages 616\u2013624, 2013.\n\n11\n\n\f", "award": [], "sourceid": 841, "authors": [{"given_name": "Anirban", "family_name": "Roychowdhury", "institution": "Ohio State University"}, {"given_name": "Srinivasan", "family_name": "Parthasarathy", "institution": "The Ohio State University"}]}