{"title": "Variational Consensus Monte Carlo", "book": "Advances in Neural Information Processing Systems", "page_first": 1207, "page_last": 1215, "abstract": "Practitioners of Bayesian statistics have long depended on Markov chain Monte Carlo (MCMC) to obtain samples from intractable posterior distributions. Unfortunately, MCMC algorithms are typically serial, and do not scale to the large datasets typical of modern machine learning. The recently proposed consensus Monte Carlo algorithm removes this limitation by partitioning the data and drawing samples conditional on each partition in parallel (Scott et al, 2013). A fixed aggregation function then combines these samples, yielding approximate posterior samples. We introduce variational consensus Monte Carlo (VCMC), a variational Bayes algorithm that optimizes over aggregation functions to obtain samples from a distribution that better approximates the target. The resulting objective contains an intractable entropy term; we therefore derive a relaxation of the objective and show that the relaxed problem is blockwise concave under mild conditions. We illustrate the advantages of our algorithm on three inference tasks from the literature, demonstrating both the superior quality of the posterior approximation and the moderate overhead of the optimization step. Our algorithm achieves a relative error reduction (measured against serial MCMC) of up to 39% compared to consensus Monte Carlo on the task of estimating 300-dimensional probit regression parameter expectations; similarly, it achieves an error reduction of 92% on the task of estimating cluster comembership probabilities in a Gaussian mixture model with 8 components in 8 dimensions. Furthermore, these gains come at moderate cost compared to the runtime of serial MCMC, achieving near-ideal speedup in some instances.", "full_text": "Variational Consensus Monte Carlo\n\nMaxim Rabinovich, Elaine Angelino, and Michael I. Jordan\n\n{rabinovich, elaine, jordan}@eecs.berkeley.edu\n\nComputer Science Division\n\nUniversity of California, Berkeley\n\nAbstract\n\nPractitioners of Bayesian statistics have long depended on Markov chain Monte\nCarlo (MCMC) to obtain samples from intractable posterior distributions. Un-\nfortunately, MCMC algorithms are typically serial, and do not scale to the large\ndatasets typical of modern machine learning. The recently proposed consensus\nMonte Carlo algorithm removes this limitation by partitioning the data and draw-\ning samples conditional on each partition in parallel [22]. A \ufb01xed aggregation\nfunction then combines these samples, yielding approximate posterior samples.\nWe introduce variational consensus Monte Carlo (VCMC), a variational Bayes\nalgorithm that optimizes over aggregation functions to obtain samples from a dis-\ntribution that better approximates the target. The resulting objective contains an\nintractable entropy term; we therefore derive a relaxation of the objective and\nshow that the relaxed problem is blockwise concave under mild conditions. We\nillustrate the advantages of our algorithm on three inference tasks from the lit-\nerature, demonstrating both the superior quality of the posterior approximation\nand the moderate overhead of the optimization step. Our algorithm achieves a\nrelative error reduction (measured against serial MCMC) of up to 39% compared\nto consensus Monte Carlo on the task of estimating 300-dimensional probit re-\ngression parameter expectations; similarly, it achieves an error reduction of 92%\non the task of estimating cluster comembership probabilities in a Gaussian mix-\nture model with 8 components in 8 dimensions. Furthermore, these gains come\nat moderate cost compared to the runtime of serial MCMC\u2014achieving near-ideal\nspeedup in some instances.\n\n1\n\nIntroduction\n\nModern statistical inference demands scalability to massive datasets and high-dimensional models.\nInnovation in distributed and stochastic optimization has enabled parameter estimation in this set-\nting, e.g. via stochastic [3] and asynchronous [20] variants of gradient descent. Achieving similar\nsuccess in Bayesian inference \u2013 where the target is a posterior distribution over parameter values,\nrather than a point estimate \u2013 remains computationally challenging.\nTwo dominant approaches to Bayesian computation are variational Bayes and Markov chain Monte\nCarlo (MCMC). Within the former, scalable algorithms like stochastic variational inference [11]\nand streaming variational Bayes [4] have successfully imported ideas from optimization. Within\nMCMC, adaptive subsampling procedures [2, 14], stochastic gradient Langevin dynamics [25], and\nFire\ufb02y Monte Carlo [16] have applied similar ideas, achieving computational gains by operating\nonly on data subsets. These algorithms are serial, however, and thus cannot take advantage of\nmulticore and multi-machine architectures. This motivates data-parallel MCMC algorithms such as\nasynchronous variants of Gibbs sampling [1, 8, 12].\nOur work belongs to a class of communication-avoiding data-parallel MCMC algorithms. These\nalgorithms partition the full dataset X1:N into K disjoint subsets XI1:K where XIk denotes the data\n\n1\n\n\fassociated with core k. Each core samples from a subposterior distribution,\n\npk (\u2713k) / p (XIk | \u2713k) p (\u2713k)1/K ,\n\n(1)\nand then a centralized procedure combines the samples into an approximation of the full posterior.\nDue to their ef\ufb01ciency, such procedures have recently received substantial attention [18, 22, 24].\nOne of these algorithms, consensus Monte Carlo (CMC), requires communication only at the start\nand end of sampling [22]. CMC proceeds from the intuition that subposterior samples, when aggre-\ngated correctly, can approximate full posterior samples. This is formally backed by the factorization\n\np (\u2713 | x1:N ) / p (\u2713)\n\np (XIk | \u2713) =\n\npk (\u2713) .\n\n(2)\n\nIf one can approximate the subposterior densities pk, using kernel density estimates for instance [18],\nit is therefore possible to recombine them into an estimate of the full posterior.\nUnfortunately, the factorization does not make it immediately clear how to aggregate on the level of\nsamples without \ufb01rst having to obtain an estimate of the densities pk themselves. CMC alters (2) to\nuntie the parameters across partitions and plug in a deterministic link F from the \u2713k to \u2713:\n\nKYk=1\n\nKYk=1\n\nKYk=1\n\np (\u2713 | x1:N ) \u21e1\n\npk (\u2713k) \u00b7 \u2713=F (\u27131,...,\u2713K ).\n\n(3)\n\nThis approximation and an aggregation function motivated by a Gaussian approximation lie at the\ncore of the CMC algorithm [22].\nThe introduction of CMC raises numerous interesting questions whose answers are essential to its\nwider application. Two among these stand out as particularly vital. First, how should the aggregation\nfunction be chosen to achieve the closest possible approximation to the target posterior? Second,\nwhen model parameters exhibit structure or must conform to constraints \u2014 if they are, for example,\npositive semide\ufb01nite covariance matrices or labeled centers of clusters \u2014 how can the weighted\naveraging strategy of Scott et al. [22] be modi\ufb01ed to account for this structure?\nIn this paper, we propose variational consensus Monte Carlo (VCMC), a novel class of data-parallel\nMCMC algorithms that allow both questions to be addressed. By formulating the choice of aggrega-\ntion function as a variational Bayes problem, VCMC makes it possible to adaptively choose the ag-\ngregation function to achieve a closer approximation to the true posterior. The \ufb02exibility of VCMC\nlikewise supports nonlinear aggregation functions, including structured aggregation functions appli-\ncable to not purely vectorial inference problems.\nAn appealing bene\ufb01t of the VCMC point of view is a clari\ufb01cation of the untying step leading\nto (3).\nIn VCMC, the approximate factorization corresponds to a variational approximation to\nthe true posterior. This approximation can be viewed as the joint distribution of (\u27131, . . . , \u2713K)\nand \u2713 in an augmented model that assumes conditional independence between the data partitions\nand posits a deterministic mapping from partition-level parameters to the single global parameter.\nThe added \ufb02exibility of this point-of-view makes it possible to move beyond subposteriors and in-\nclude alternative forms of (3) within the CMC framework.\nIn particular, it is possible to de\ufb01ne\npk (\u2713k) = p (\u2713k) p (XIk | \u2713k), using partial posteriors in place of subposteriors (cf. [23]). Although\nextensive investigation of this issue is beyond the scope of this paper, we provide some evidence\nin Section 6 that partial posteriors are a better choice in some circumstances and demonstrate that\nVCMC can provide substantial gains in both the partial posterior and subposterior settings.\nBefore proceeding, we outline the remainder of this paper. Below, in \u00a72, we review CMC and\nrelated data-parallel MCMC algorithms. Next, we cast CMC as a variational Bayes problem in \u00a73.\nWe de\ufb01ne the variational optimization objective in \u00a74, addressing the challenging entropy term\nby relaxing it to a concave lower bound, and give conditions for which this leads to a blockwise\nconcave maximization problem. In \u00a75, we de\ufb01ne several aggregation functions, including novel\nones that enable aggregation of structured samples\u2014e.g. positive semide\ufb01nite matrices and mixture\nmodel parameters. In \u00a76, we evaluate the performance of VCMC and CMC relative to serial MCMC.\nWe replicate experiments carried out by Scott et al. [22] and execute more challenging experiments\nin higher dimensions and with more data. Finally in \u00a77, we summarize our approach and discuss\nseveral open problems generated by this work.\n\n2\n\n\f2 Related work\n\nWe focus on data-parallel MCMC algorithms for large-scale Bayesian posterior sampling. Sev-\neral recent research threads propose schemes in the setting where the posterior factors as in (2).\nIn general, these parallel strategies are approximate relative to serial procedures, and the speci\ufb01c\nalgorithms differ in terms of the approximations employed and amount of communication required.\nAt one end of the communication spectrum are algorithms that \ufb01t into the MapReduce model [7].\nFirst, K parallel cores sample from K subposteriors, de\ufb01ned in (1), via any Monte Carlo sampling\nprocedure. The subposterior samples are then aggregated to obtain approximate samples from the\nfull posterior. This leads to the challenge of designing proper and ef\ufb01cient aggregation procedures.\nScott et al. [22] propose consensus Monte Carlo (CMC), which constructs approximate posterior\nsamples via weighted averages of subposterior samples; our algorithms are motivated by this work.\nLet \u2713k,t denote the t-th subposterior sample from core k. In CMC, the aggregation function averages\nk=1 to produce one approximate posterior sample \u02c6\u2713t. Uniform\nacross each set of K samples {\u2713k,t}K\naveraging is a natural but na\u00a8\u0131ve heuristic that can in fact be improved upon via a weighted average,\n\n\u02c6\u2713 = F (\u27131:K) =\n\nWk\u2713k,\n\n(4)\n\nKXk=1\n\nwhere in general, \u2713k is a vector and Wk can be a matrix. The authors derive weights motivated by the\nspecial case of a Gaussian posterior, where each subposterior is consequently also Gaussian. Let \u2303k\nbe the covariance of the k-th subposterior. This suggests weights Wk = \u23031\nequal to the subpos-\nk\nteriors\u2019 inverse covariances. CMC treats arbitrary subpostertiors as Gaussians, aggregating with\nweights given by empirical estimates of \u02c6\u23031\nk\nNeiswanger et al. [18] propose aggregation at the level of distributions rather than samples. Here, the\nidea is to form an approximate posterior via a product of density estimates \ufb01t to each subposterior,\nand then sample from this approximate posterior. The accuracy and computational requirements\nof this approach depend on the complexity of these density estimates. Wang and Dunson [24]\ndevelop alternate data-parallel MCMC methods based on applying a Weierstrass transform to each\nsubposterior. These Weierstrass sampling procedures introduce auxiliary variables and additional\ncommunication between computational cores.\n\ncomputed from the observed subposterior samples.\n\n3 Consensus Monte Carlo as variational inference\n\nGiven the distributional form of the CMC framework (3), we would like to choose F so that the\ninduced distribution on \u2713 is as close as possible to the true posterior. This is precisely the problem\naddressed by variational Bayes, which approximates an intractable posterior p (\u2713 | X) by the solution\nq\u21e4 to the constrained optimization problem\n\nmin DKL (q || p (\u00b7 | X)) subject to q 2 Q,\n\nwhere Q is the family of variational approximations to the distribution, usually chosen to make both\noptimization and evaluation of target expectations tractable. We thus view the aggregation problem\nin CMC as a variational inference problem, with the variational family given by all distributions\nQ = QF = {qF : F 2 F}, where each F is in some function class F and de\ufb01nes a density\n\nqF (\u2713) =Z\u2326K\n\nKYk=1\n\npk (\u2713k) \u00b7 \u2713=F (\u27131,...,\u2713K ) d\u27131:K.\n\nIn practice, we optimize over \ufb01nite-dimensional F using projected stochastic gradient descent\n(SGD).\n\n4 The variational optimization problem\n\nStandard optimization of the variational Bayes objective uses the evidence lower bound (ELBO)\n\nlog p (X) = log Eq\uf8ff p (\u2713, X)\n\nq (\u2713)   Eq\uf8fflog\n\np (\u2713, X)\n\nq (\u2713) \n\n= log p (X)  DKL (q || p (\u00b7 | X)) = : LVB (q) .\n\n(5)\n\n3\n\n\fWe can therefore recast the variational optimization problem in an equivalent form as\n\nmaxLVB (q) subject to q 2 Q.\n\nUnfortunately, the variational Bayes objective LVB remains dif\ufb01cult to optimize. Indeed, by writing\n\nLVB (q) = Eq [log p (\u2713, X)] + H [q]\n\nwe see that optimizing LVB requires computing an entropy H [q] and its gradients. We can deal with\nthis issue by deriving a lower bound on the entropy that relaxes the objective further.\nConcretely, suppose that every F 2 F can be decomposed as F (\u27131:K) =PK\n\nk=1 Fk (\u2713k), with\neach Fk a differentiable bijection. Since the \u2713k come from subposteriors conditioning on different\nsegments of the data, they are independent. The entropy power inequality [6] therefore implies\n\nH [q]  max\n1\uf8ffk\uf8ffK\n min\n1\uf8ffk\uf8ffK\n\n min\n1\uf8ffk\uf8ffK\n\nH [Fk (\u2713k)] = max\n1\uf8ffk\uf8ffK\n\nH [pk] + max\n1\uf8ffk\uf8ffK\n1\nK\n\nH [pk] +\n\nKXk=1\n\n(H [pk] + Epk [log det [J (Fk) (\u2713k)]])\n\nEpk [log det [J (Fk) (\u2713k)]]\n\nEpk [log det [J (Fk) (\u2713k)]] = : \u02dcH [q] ,\n\n(6)\n\n(7)\n\nwhere J (f ) denotes the Jacobian of the function f. The proof can be found in the supplement.\nThis approach gives an explicit, easily computed approximation to the entropy\u2014and this approx-\nimation is a lower bound, allowing us to interpret it simply as a further relaxation of the original\ninference problem. Furthermore, and crucially, it decouples pk and Fk, thereby making it possible\nto optimize over Fk without estimating the entropy of any pk. We note additionally that if we are\nwilling to sacri\ufb01ce concavity, we can use the tighter lower bound on the entropy given by (6).\nPutting everything together, we can de\ufb01ne our relaxed variational objective as\n\n(8)\nMaximizing this function is the variational Bayes problem we consider in the remainder of the paper.\n\nL (q) = Eq [log p (\u2713, X)] + \u02dcH [q] .\n\nConditions for concavity Under certain conditions, the problem posed above is blockwise con-\ncave. To see when this holds, we use the language of graphical models and exponential families. To\nderive the result in the greatest possible generality, we decompose the variational objective as\n\nLVB = Eq [log p (\u2713, X)] + H [q]  \u02dcL + \u02dcH [q]\n\nand prove concavity directly for \u02dcL, then treat our choice of relaxed entropy (7). We emphasize that\nwhile the entropy relaxation is only de\ufb01ned for decomposed aggregation functions, concavity of the\npartial objective holds for arbitrary aggregation functions. All proofs are in the supplement.\nSuppose the model distribution is speci\ufb01ed via a graphical model G, so that \u2713 = (\u2713u)u2V (G), such\nthat each conditional distribution is de\ufb01ned by an exponential family\n\nlog p\u21e3\u2713u | \u2713par(u)\u2318 = log hu (\u2713u) + Xu02par(u)\u21e3\u2713u0\u2318T\n\nT u0!u (\u2713u)  log Au\u21e3\u2713par(u)\u2318 .\n\nIf each of these log conditional density functions is log-concave in \u2713u, we can guarantee that the log\nlikelihood is concave in each \u2713u individually.\nTheorem 4.1 (Blockwise concavity of the variational cross-entropy). Suppose that the model dis-\ntribution is speci\ufb01ed by a graphical model G in which each conditional probability density is a\nlog-concave exponential family. Suppose further that the variational aggregation function family\n\nsatis\ufb01es F =Qu2V (G) F u such that we can decompose each aggregation function across nodes via\nIf each F u is a convex subset of some vector space Hu, then the variational cross-entropy \u02dcL is\nconcave in each F u individually.\n\nF (\u2713) = (F u (\u2713u))u2V (G) , F 2 F and F u 2 F u.\n\n4\n\n\fAssuming that the aggregation function can be decomposed into a sum over functions of individual\nsubposterior terms we can also prove concavity of our entropy relaxation (7).\n\nTheorem 4.2 (Concavity of the relaxed entropy). Suppose F = QK\nF 2 F decomposing as F (\u27131, . . . , \u2713K) =PK\n\nrelaxed entropy (7) is concave in F .\n\nk=1 Fk, with each function\nk=1 Fk (\u2713k) for unique bijective Fk 2 Fk. Then the\n\nAs a result, we derive concavity of the variational objective in a broad range of settings.\nCorollary 4.1 (Concavity of the variational objective). Under the hypotheses of Theorems 4.1 and\n4.2, the variational Bayes objective L = \u02dcL + \u02dcH is concave in each F u individually.\n5 Variational aggregation function families\nThe performance of our algorithm depends critically on the choice of aggregation function family F.\nThe family must be suf\ufb01ciently simple to support ef\ufb01cient optimization, expressive to capture the\ncomplex transformation from the set of subposteriors to the full posterior, and structured to preserve\nstructure in the parameters. We now illustrate some aggregation functions that meet these criteria.\n\nVector aggregation.\n\nk=1 Wk = Id. For computational reasons, it is often desirable to restrict to diagonal Wk.\n\nIn the simplest case, \u2713 2 Rd is an unconstrained vector. Then, a linear aggre-\nk=1 Wk\u2713k makes sense, and it is natural to impose constraints to make this\n+ is a positive semide\ufb01nite (PSD) matrix\n\ngation function FW =PK\nsum behave like a weighted average\u2014i.e., each Wk 2 S d\nandPK\nSpectral aggregation. Cases involving structure exhibit more interesting behavior. Indeed, if our\n+, applying the vector aggregation function above to the \ufb02attened\nparameter is a PSD matrix \u21e4 2 S d\nvector form vec (\u21e4) of the parameter does not suf\ufb01ce. Denoting elementwise matrix product as ,\nwe note that this strategy would in general lead to FW (\u21e41:m) =PK\nWe therefore introduce a more sophisticated aggregation function that preserves PSD structure. For\nthis, given symmetric A 2 Rd\u21e5d, de\ufb01ne R (A) and D (A) to be orthogonal and diagonal matrices,\nrespectively, such that A = R (A)T D (A) R (A).\nImpose further\u2014and crucially\u2014the canonical\nordering D (A)11  \u00b7\u00b7\u00b7  D (A)dd. We can then de\ufb01ne our spectral aggregation function by\n\n+.\nk=1 Wk  \u21e4k /2 S d\n\nF spec\nW (\u21e41:K) =\n\nR (\u21e4k)T [WkD (\u21e4k)] R (\u21e4k) .\n\nKXk=1\n\nk=1 Wk = I}.\n\nk=1 : Wk 2 S d\n\n+, PK\n\n+, the output of this function is guaranteed to be PSD, as required. As above we\n\nAssuming Wk 2 S d\nrestrict the set of Wk to the matrix simplex {(Wk)K\nCombinatorial aggregation. Additional complexity arises with unidenti\ufb01able latent variables\nand, more generally, models with multimodal posteriors. Since this class encompasses many popular\nalgorithms in machine learning, including factor analysis, mixtures of Gaussians and multinomials,\nand latent Dirichlet allocation (LDA), we now show how our framework can accommodate them.\nFor concreteness, suppose now that our model parameters are given by \u2713 2 RL\u21e5d, where L denotes\nthe number of global latent variables (e.g. cluster centers). We introduce discrete alignment param-\neters ak that indicate how latent variables associated with partitions map to global latent variables.\nEach ak is thus a one-to-one correspondence [L] ! [L], with ak` denoting the index on worker\ncore k of cluster center `. For \ufb01xed a, we then obtain the variational aggregation function\n\nFa (\u27131:K) =\u2713 KXk=1\n\nWk`\u2713kak`(`)\u25c6L\n\n`=1\n\n.\n\nOptimization can then proceed in an alternating manner, switching between the alignments ak\nand the weights Wk, or in a greedy manner, \ufb01xing the alignments at the start and optimizing\nIn practice, we do the latter, aligning using a simple heuristic objective\nthe weight matrices.\nO (a) = PK\n2 , where \u00af\u2713k` denotes the mean value of cluster center ` on\npartition k. As O suggests, we set a1` = `. Minimizing O via the Hungarian algorithm [15] leads\nto good alignments.\n\n`=1\u00af\u2713kak`  \u00af\u27131`2\n\nk=2PL\n\n5\n\n\fFigure 1: High-dimensional probit regression (d = 300). Moment approximation error for the\nuniform and Gaussian averaging baselines and VCMC, relative to serial MCMC, for subposteri-\nors (left) and partial posteriors (right); note the different vertical axis scales. We assessed three\ngroups of functions: \ufb01rst moments, with f () = j for 1 \uf8ff j \uf8ff d; pure second moments, with\nj for 1 \uf8ff j \uf8ff d; and mixed second moments, with f () = ij for 1 \uf8ff i < j \uf8ff d. For\nf () = 2\nbrevity, results for pure second moments are relegated to Figure 5 in the supplement.\n\n6 Empirical evaluation\n\nWe now evaluate VCMC on three inference problems, in a range of data and dimensionality con-\nditions. In the vector parameter case, we compare directly to the simple weighting baselines corre-\nsponding to previous work on CMC [22]; in the other cases, we compare to structured analogues of\nthese weighting schemes. Our experiments demonstrate the advantages of VCMC across the whole\nrange of model dimensionality, data quantity, and availability of parallel resources.\n\nBaseline weight settings. Scott et al. [22] studied linear aggregation functions with \ufb01xed weights,\n\nW unif\n\nk =\n\n1\nK \u00b7 Id\n\nand\n\nW gauss\n\nk\n\n/ diag\u21e3 \u02c6\u2303k\u23181\n\n,\n\n(9)\n\ncorresponding to uniform averaging and Gaussian averaging, respectively, where \u02c6\u2303k denotes the\nstandard empirical estimate of the covariance. These are our baselines for comparison.\n\nEvaluation metrics. Since the goal of MCMC is usually to estimate event probabilities and func-\ntion expectations, we evaluate algorithm accuracy for such estimates, relative to serial MCMC out-\nput. For each model, we consider a suite of test functions f 2 F (e.g.\nlow degree polynomials,\ncluster comembership indicators), and we assess the error of each algorithm A using the metric\n\n\u270fA (f ) = |EA [f ]  EMCMC [f ]|\n\n.\n\n|EMCMC [f ]|\n\nIn the body of the paper, we report median values of \u270fA, computed within each test function class.\nThe supplement expands on this further, showing quartiles for the differences in \u270fVCMC and \u270fCMC.\n\nBayesian probit regression. We consider the nonconjugate probit regression model. In this case,\nwe use linear aggregation functions as our function class. For computational ef\ufb01ciency, we also\nlimit ourselves to diagonal Wk. We use Gibbs sampling on the following augmented model:\n\n \u21e0 N (0, 2Id),\n\nZn | , xn \u21e0 N (T xn, 1),\n\nThis augmentation allows us to implement an ef\ufb01cient and rapidly mixing Gibbs sampler, where\n\n0 otherwise.\n\nYn | Zn, , xn =\u21e21 if Zn > 0,\n\u2303 =2Id + XT X1\n\n.\n\n | x1:N = X,\n\nz1:N = z \u21e0 N\u2303XT z, \u2303 ,\n\nWe run two experiments:\nthe \ufb01rst using a data generating distribution from Scott et al. [22],\nwith N = 8500 data points and d = 5 dimensions, and the second using N = 105 data points and\nd = 300 dimensions. As shown in Figure 1 and, in the supplement,1 Figures 4 and 5, VCMC de-\ncreases the error of moment estimation compared to the baselines, with substantial gains starting\nat K = 25 partitions (and increasing with K). We also run the high-dimensional experiment using\npartial posteriors [23] in place of subposteriors, and observe substantially lower errors in this case.\n\n6\n\n\fFigure 2: High-dimensional normal-inverse Wishart model (d = 100). (Far left, left, right) Moment\napproximation error for the uniform and Gaussian averaging baselines and VCMC, relative to serial\nMCMC. Letting \u21e2j denote the jth largest eigenvalue of \u21e41, we assessed three groups of functions:\n\ufb01rst moments, with f (\u21e4) = \u21e2j for 1 \uf8ff j \uf8ff d; pure second moments, with f (\u21e4) = \u21e22\nj for 1 \uf8ff j \uf8ff d;\nand mixed second moments, with f (\u21e4) = \u21e2i\u21e2j for 1 \uf8ff i < j \uf8ff d. (Far right) Graph of error in\nestimating E [\u21e2j] as a function of j (where \u21e21  \u21e22  \u00b7\u00b7\u00b7  \u21e2d).\n\nNormal-inverse Wishart model. To compare directly to prior work [22], we consider the normal-\ninverse Wishart model\n\n\u21e4 \u21e0 Wishart (\u232b, V ) ,\n\nXn | \u00b5, \u21e4 \u21e0 N\u00b5, \u21e41 .\n\nHere, we use spectral aggregation rules as our function class, restricting to diagonal Wk for com-\nputational ef\ufb01ciency. We run two sets of experiments: one using the covariance matrix from Scott\net al. [22], with N = 5000 data points and d = 5 dimensions, and one using a higher-dimensional\ncovariance matrix designed to have a small spectral gap and a range of eigenvalues, with N = 105\ndata points and d = 100 dimensions. In both cases, we use a form of projected SGD, using 40\nsamples per iteration to estimate the variational gradients and running 25 iterations of optimization.\nWe note that because the mean \u00b5 is treated as a point-estimated parameter, one could sample \u21e4\nexactly using normal-inverse Wishart conjugacy [10]. As Figure 2 shows,2 VCMC improves both\n\ufb01rst and second posterior moment estimation as compared to the baselines. Here, the greatest gains\nfrom VCMC appear at large numbers of partitions (K = 50, 100). We also note that uniform and\nGaussian averaging perform similarly because the variances do not differ much across partitions.\n\nMixture of Gaussians. A substantial portion of Bayesian inference focuses on latent variable\nmodels and, in particular, mixture models. We therefore evaluate VCMC on a mixture of Gaussians,\n\n\u27131:L \u21e0 N0, \u2327 2Id ,\n\nZn \u21e0 Cat (\u21e1) ,\n\nXn | Zn = z \u21e0 N\u2713z, 2Id ,\n\nwhere the mixture weights \u21e1 and the prior and likelihood variances \u2327 2 and 2 are assumed known.\nWe use the combinatorial aggregation functions de\ufb01ned in Section 5; we set L = 8, \u2327 = 2,  = 1,\nand \u21e1 uniform and generate N = 5 \u21e5 104 data points in d = 8 dimensions, using the model\nfrom Nishihara et al. [19]. The resulting inference problem is therefore L \u21e5 d = 64-dimensional.\nAll samples were drawn using the PyStan implementation of Hamiltonian Monte Carlo (HMC).\nAs Figure 3a shows, VCMC drastically improves moment estimation compared to the baseline\nGaussian averaging (9). To assess how VCMC in\ufb02uences estimates in cluster membership prob-\nabilities, we generated 100 new test points from the model and analyzed cluster comembership\nprobabilities for all pairs in the test set. Concretely, for each xi and xj in the test data, we es-\ntimated P [xi and xj belong to the same cluster]. Figure 3a shows the resulting boost in accuracy:\nwhen  = 1, VCMC delivers estimates close to those of serial MCMC, across all numbers of parti-\ntions; the errors are larger for  = 2. Unlike previous models, uniform averaging here outperforms\nGaussian averaging, and indeed is competitive with VCMC.\n\nAssessing computational ef\ufb01ciency. The ef\ufb01ciency of VCMC depends on that of the optimization\nstep, which depends on factors including the step size schedule, number of samples used per iteration\nto estimate gradients, and size of data minibatches used per iteration. Extensively assessing the\nin\ufb02uence of all these factors is beyond the scope of this paper, and is an active area of research both\nin general and speci\ufb01cally in the context of variational inference [13, 17, 21]. Here, we provide\n\n1Due to space constraints, we relegate results for d = 5 to the supplement.\n2Due to space constraints, we compare to the d = 5 experiment of Scott et al. [22] in the supplement.\n\n7\n\n\f(a) Mixture of Gaussians (d = 8, L = 8).\n\n(b) Error versus timing and speedup measurements.\n\nFigure 3: (a) Expectation approximation error for the uniform and Gaussian baselines and VCMC.\nWe report the median error, relative to serial MCMC, for cluster comembership probabilities of\npairs of test data points, for (left)  = 1 and (right)  = 2, where we run the VCMC optimization\nprocedure for 50 and 200 iterations, respectively. When  = 2, some comembership probabilities\nare estimated poorly by all methods; we therefore only use the 70% of comembership probabilities\nwith the smallest errors across all the methods. (b) (Left) VCMC error as a function of number of\nseconds of optimization. The cost of optimization is nonnegligible, but still moderate compared to\nserial MCMC\u2014particularly since our optimization scheme only needs small batches of samples and\ncan therefore operate concurrently with the sampler. (Right) Error versus speedup relative to serial\nMCMC, for both CMC with Gaussian averaging (small markers) and VCMC (large markers).\n\nan initial assessment of the computational ef\ufb01ciency of VCMC, taking the probit regression and\nGaussian mixture models as our examples, using step sizes and sample numbers from above, and\neschewing minibatching on data points.\nFigure 3b shows timing results for both models. For the probit regression, while the optimization\ncost is not negligible, it is signi\ufb01cantly smaller than that of serial sampling, which takes over 6000\nseconds to produce 1000 effective samples.3 Across most numbers of partitions, approximately 25\niterations\u2014corresponding to less than 1500 seconds of wall clock time\u2014suf\ufb01ces to give errors close\nto those at convergence. For the mixture, on the other hand, the computational cost of optimization\nis minimal compared to serial sampling. We can see this in the overall speedup of VCMC relative\nto serial MCMC: for sampling and optimization combined, low numbers of partitions (K \uf8ff 25)\nachieve speedups close to the ideal value of K, and large numbers (K = 50, 100) still achieve good\nspeedups of about K/2. The cost of the VCMC optimization step is thus moderate\u2014and, when the\nMCMC step is expensive, small enough to preserve the linear speedup of embarrassingly parallel\nsampling. Moreover, since the serial bottleneck is an optimization, we are optimistic that perfor-\nmance, both in terms of number of iterations and wall clock time, can be signi\ufb01cantly increased by\nusing techniques like data minibatching [9], adaptive step sizes [21], or asynchronous updates [20].\n\n7 Conclusion and future work\n\nThe \ufb02exibility of variational consensus Monte Carlo (VCMC) opens several avenues for further\nresearch. Following previous work on data-parallel MCMC, we used the subposterior factoriza-\ntion. Our variational framework can accomodate more general factorizations that might be more\nstatistically or computationally ef\ufb01cient \u2013 e.g.\nthe factorization used by Broderick et al. [4]. We\nalso introduced structured sample aggregation, and analyzed some concrete instantiations. Complex\nlatent variable models would require more sophisticated aggregation functions \u2013 e.g. ones that ac-\ncount for symmetries in the model [5] or lift the parameter to a higher dimensional space before\naggregating. Finally, recall that our algorithm \u2013 again following previous work \u2013 aggregates in a\nsample-by-sample manner, cf. (4). Other aggregation paradigms may be useful in building approxi-\nmations to multimodal posteriors or in boosting the statistical ef\ufb01ciency of the overall sampler.\n\nAcknowledgments. We thank R.P. Adams, N. Altieri, T. Broderick, R. Giordano, M.J. Johnson,\nand S.L. Scott for helpful discussions. E.A. is supported by the Miller Institute for Basic Research\nin Science, University of California, Berkeley. M.R. is supported by a Hertz Foundation Fellowship,\ngenerously endowed by Google, and an NSF Graduate Research Fellowship. Support for this project\nwas provided by Amazon and by ONR under the MURI program (N00014-11-1-0688).\n\n3We ran the sampler for 5100 iterations, including 100 burnin steps, and kept every \ufb01fth sample.\n\n8\n\n\fReferences\n[1] A. U. Asuncion, P. Smyth, and M. Welling. Asynchronous distributed learning of topic models.\n\nAdvances in Neural Information Processing Systems 21, pages 81\u201388, 2008.\n\nIn\n\n[2] R. Bardenet, A. Doucet, and C. Holmes. Towards scaling up Markov chain Monte Carlo: An adaptive\nsubsampling approach. In Proceedings of the 31st International Conference on Machine Learning, 2014.\n\n[3] D. P. Bertsekas. Nonlinear Programming. Athena Scienti\ufb01c, Belmont, MA, 2nd edition, 1990.\n[4] T. Broderick, N. Boyd, A. Wibisono, A. C. Wilson, and M. I. Jordan. Streaming variational Bayes. In\n\nAdvances in Neural Information Processing Systems 26, pages 1727\u20131735, 2013.\n\n[5] T. Campbell and J. P. How. Approximate decentralized Bayesian inference.\n\nUncertainty in Arti\ufb01cial Intelligence, 2014.\n\nIn 30th Conference on\n\n[6] T. M. Cover and J. A. Thomas. Elements of Information Theory (Wiley Series in Telecommunications and\n\nSignal Processing). Wiley-Interscience, 2006.\n\n[7] J. Dean and S. Ghemawat. MapReduce: Simpli\ufb01ed data processing on large clusters. Communications of\n\nthe ACM, 51(1):107\u2013113, Jan. 2008.\n\n[8] F. Doshi-Velez, D. A. Knowles, S. Mohamed, and Z. Ghahramani. Large scale nonparametric Bayesian\ninference: Data parallelisation in the Indian buffet process. In Advances in Neural Information Processing\nSystems 22, pages 1294\u20131302, 2009.\n\n[9] J. C. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for online learning and stochastic\n\noptimization. Journal of Machine Learning Research, 12:2121\u20132159, 2011.\n\n[10] A. Gelman, J. B. Carlin, H. S. Stern, D. B. Dunson, A. Vehtari, and D. B. Rubin. Bayesian Data Analysis,\n\nThird Edition. Chapman and Hall/CRC, 2013.\n\n[11] M. D. Hoffman, D. M. Blei, C. Wang, and J. Paisley. Stochastic variational inference. Journal of Machine\n\nLearning Research, 14(1):1303\u20131347, May 2013.\n\n[12] M. Johnson, J. Saunderson, and A. Willsky. Analyzing Hogwild parallel Gaussian Gibbs sampling. In\n\nAdvances in Neural Information Processing Systems 26, pages 2715\u20132723, 2013.\n\n[13] R. Johnson and T. Zhang. Accelerating stochastic gradient descent using predictive variance reduction.\n\nIn Advances in Neural Information Processing Systems 26, pages 315\u2013323, 2013.\n\n[14] A. Korattikara, Y. Chen, and M. Welling. Austerity in MCMC land: Cutting the Metropolis-Hastings\n\nbudget. In Proceedings of the 31st International Conference on Machine Learning, 2014.\n\n[15] H. W. Kuhn. The Hungarian method for the assignment problem. Naval Research Logistics Quarterly, 2\n\n(1-2):83\u201397, 1955.\n\n[16] D. Maclaurin and R. P. Adams. Fire\ufb02y Monte Carlo: Exact MCMC with subsets of data. In Proceedings\n\nof 30th Conference on Uncertainty in Arti\ufb01cial Intelligence, 2014.\n\n[17] S. Mandt and D. M. Blei. Smoothed gradients for stochastic variational inference. In Advances in Neural\n\nInformation Processing Systems 27, pages 2438\u20132446, 2014.\n\n[18] W. Neiswanger, C. Wang, and E. Xing. Asymptotically exact, embarrassingly parallel MCMC. In 30th\n\nConference on Uncertainty in Arti\ufb01cial Intelligence, 2014.\n\n[19] R. Nishihara, I. Murray, and R. P. Adams. Parallel MCMC with generalized elliptical slice sampling.\n\nJournal of Machine Learning Research, 15:2087\u20132112, 2014.\n\n[20] F. Niu, B. Recht, C. R\u00b4e, and S. Wright. Hogwild!: A lock-free approach to parallelizing stochastic\n\ngradient descent. In Advances in Neural Information Processing Systems 24, pages 693\u2013701, 2011.\n\n[21] R. Ranganath, C. Wang, D. M. Blei, and E. P. Xing. An adaptive learning rate for stochastic variational\ninference. In Proceedings of the 30th International Conference on Machine Learning, pages 298\u2013306,\n2013.\n\n[22] S. L. Scott, A. W. Blocker, and F. V. Bonassi. Bayes and big data: The consensus Monte Carlo algorithm.\n\nIn Bayes 250, 2013.\n\n[23] H. Strathmann, D. Sejdinovic, and M. Girolami. Unbiased Bayes for big data: Paths of partial posteriors.\n\narXiv:1501.03326, 2015.\n\n[24] X. Wang and D. B. Dunson. Parallel MCMC via Weierstrass sampler. arXiv:1312.4605, 2013.\n[25] M. Welling and Y. W. Teh. Bayesian learning via stochastic gradient Langevin dynamics. In Proceedings\n\nof the 28th International Conference on Machine Learning, 2011.\n\n9\n\n\f", "award": [], "sourceid": 742, "authors": [{"given_name": "Maxim", "family_name": "Rabinovich", "institution": "UC Berkeley"}, {"given_name": "Elaine", "family_name": "Angelino", "institution": "Harvard"}, {"given_name": "Michael", "family_name": "Jordan", "institution": "UC Berkeley"}]}