{"title": "Parallelizing MCMC with Random Partition Trees", "book": "Advances in Neural Information Processing Systems", "page_first": 451, "page_last": 459, "abstract": "The modern scale of data has brought new challenges to Bayesian inference. In particular, conventional MCMC algorithms are computationally very expensive for large data sets. A promising approach to solve this problem is embarrassingly parallel MCMC (EP-MCMC), which first partitions the data into multiple subsets and runs independent sampling algorithms on each subset. The subset posterior draws are then aggregated via some combining rules to obtain the final approximation. Existing EP-MCMC algorithms are limited by approximation accuracy and difficulty in resampling. In this article, we propose a new EP-MCMC algorithm PART that solves these problems. The new algorithm applies random partition trees to combine the subset posterior draws, which is distribution-free, easy to resample from and can adapt to multiple scales. We provide theoretical justification and extensive experiments illustrating empirical performance.", "full_text": "Parallelizing MCMC with Random Partition Trees\n\nDept. of Statistical Science\n\nDept. of Computer Science\n\nXiangyu Wang\n\nDuke University\n\nxw56@stat.duke.edu\n\nguo@cs.duke.edu\n\nkheller@stat.duke.edu\n\nFangjian Guo\n\nDuke University\n\nKatherine A. Heller\n\nDept. of Statistical Science\n\nDuke University\n\nDavid B. Dunson\n\nDept. of Statistical Science\n\nDuke University\n\ndunson@stat.duke.edu\n\nAbstract\n\nThe modern scale of data has brought new challenges to Bayesian inference. In\nparticular, conventional MCMC algorithms are computationally very expensive\nfor large data sets. A promising approach to solve this problem is embarrassingly\nparallel MCMC (EP-MCMC), which \ufb01rst partitions the data into multiple subsets\nand runs independent sampling algorithms on each subset. The subset posterior\ndraws are then aggregated via some combining rules to obtain the \ufb01nal approxima-\ntion. Existing EP-MCMC algorithms are limited by approximation accuracy and\ndif\ufb01culty in resampling. In this article, we propose a new EP-MCMC algorithm\nPART that solves these problems. The new algorithm applies random partition\ntrees to combine the subset posterior draws, which is distribution-free, easy to re-\nsample from and can adapt to multiple scales. We provide theoretical justi\ufb01cation\nand extensive experiments illustrating empirical performance.\n\n1\n\nIntroduction\n\nBayesian methods are popular for their success in analyzing complex data sets. However, for large\ndata sets, Markov Chain Monte Carlo (MCMC) algorithms, widely used in Bayesian inference, can\nsuffer from huge computational expense. With large data, there is increasing time per iteration, in-\ncreasing time to convergence, and dif\ufb01culties with processing the full data on a single machine due\nto memory limits. To ameliorate these concerns, various methods such as stochastic gradient Monte\nCarlo [1] and sub-sampling based Monte Carlo [2] have been proposed. Among directions that have\nbeen explored, embarrassingly parallel MCMC (EP-MCMC) seems most promising. EP-MCMC\nalgorithms typically divide the data into multiple subsets and run independent MCMC chains si-\nmultaneously on each subset. The posterior draws are then aggregated according to some rules to\nproduce the \ufb01nal approximation. This approach is clearly more ef\ufb01cient as now each chain involves\na much smaller data set and the sampling is communication-free. The key to a successful EP-MCMC\nalgorithm lies in the speed and accuracy of the combining rule.\nExisting EP-MCMC algorithms can be roughly divided into three categories. The \ufb01rst relies on\nasymptotic normality of posterior distributions. [3] propose a \u201cConsensus Monte Carlo\u201d algorithm,\nwhich produces \ufb01nal approximation by a weighted averaging over all subset draws. This approach is\neffective when the posterior distributions are close to Gaussian, but could suffer from huge bias when\nskewness and multi-modes are present. The second category relies on calculating an appropriate\nvariant of a mean or median of the subset posterior measures [4, 5]. These approaches rely on\nasymptotics (size of data increasing to in\ufb01nity) to justify accuracy, and lack guarantees in \ufb01nite\nsamples. The third category relies on the product density equation (PDE) in (1). Assuming X is the\n\n1\n\n\fobserved data and \u03b8 is the parameter of interest, when the observations are iid conditioned on \u03b8, for\nany partition of X = X (1) \u222a X (2) \u222a \u00b7\u00b7\u00b7 \u222a X (m), the following identity holds,\n\np(\u03b8|X) \u221d \u03c0(\u03b8)p(X|\u03b8) \u221d p(\u03b8|X (1))p(\u03b8|X (2))\u00b7\u00b7\u00b7 p(\u03b8|X (m)),\n\nif the prior on the full data and subsets satisfy \u03c0(\u03b8) =(cid:81)m\n\n(1)\ni=1 \u03c0i(\u03b8). [6] proposes using kernel density\nestimation on each subset posterior and then combining via (1). They use an independent Metropolis\nsampler to resample from the combined density. [7] apply the Weierstrass transform directly to (1)\nand developed two sampling algorithms based on the transformed density. These methods guarantee\nthe approximation density converges to the true posterior density as the number of posterior draws\nincrease. However, as both are kernel-based, the two methods are limited by two major drawbacks.\nThe \ufb01rst is the inef\ufb01ciency of resampling. Kernel density estimators are essentially mixture distri-\nbutions. Assuming we have collected 10,000 posterior samples on each machine, then multiplying\njust two densities already yields a mixture distribution containing 108 components, each of which\nis associated with a different weight. The resampling requires the independent Metropolis sampler\nto search over an exponential number of mixture components and it is likely to get stuck at one\n\u201cgood\u201d component, resulting in high rejection rates and slow mixing. The second is the sensitivity\nto bandwidth choice, with one bandwidth applied to the whole space.\nIn this article, we propose a novel EP-MCMC algorithm termed \u201cparallel aggregation random trees\u201d\n(PART), which solves the above two problems. The algorithm inhibits the explosion of mixture\ncomponents so that the aggregated density is easy to resample. In addition, the density estimator is\nable to adapt to multiple scales and thus achieve better approximation accuracy. In Section 2, we\nmotivate the new methodology and present the algorithm. In Section 3, we present error bounds and\nprove consistency of PART in the number of posterior draws. Experimental results are presented in\nSection 4. Proofs and part of the numerical results are provided in the supplementary materials.\n\n2 Method\n\nRecall the PDE identity (1) in the introduction. When data set X is partitioned into m subsets\nX = X (1) \u222a \u00b7\u00b7\u00b7 \u222a X (m), the posterior distribution of the ith subset can be written as\n\n(2)\nwhere \u03c0(\u03b8) is the prior assigned to the full data set. Assuming observations are iid given \u03b8, the\nrelationship between the full data posterior and subset posteriors is captured by\n\nf (i)(\u03b8) \u221d \u03c0(\u03b8)1/mp(X (i)|\u03b8),\n\np(\u03b8|X) \u221d \u03c0(\u03b8)\n\nf (i)(\u03b8).\n\n(3)\n\nm(cid:89)\n\np(X (i)|\u03b8) \u221d m(cid:89)\n\ni=1\n\ni=1\n\nDue to the \ufb02aws of applying kernel-based density estimation to (3) mentioned above, we propose\nto use random partition trees or multi-scale histograms. Let FK be the collection of all Rp-\npartitions formed by K disjoint rectangular blocks, where a rectangular block takes the form of\n= (lk,1, rk,1] \u00d7 (lk,2, rk,2] \u00d7 \u00b7\u00b7\u00b7 (lk,p, rk,p] \u2286 Rp for some lk,q < rk,q. A K-block histogram\nAk\nis then de\ufb01ned as\n\ndef\n\n\u02c6f (i)(\u03b8) =\n\nn(i)\nN|Ak| 1(\u03b8 \u2208 Ak),\nk\nwhere {Ak : k = 1, 2,\u00b7\u00b7\u00b7 , K} \u2208 FK are the blocks and N, n(i)\nk are the total number of posterior\nsamples on the ith subset and of those inside the block Ak respectively (assuming the same N across\nsubsets). We use | \u00b7 | to denote the area of a block. Assuming each subset posterior is approximated\nby a K-block histogram, if the partition {Ak} is restricted to be the same across all subsets, then the\naggregated density after applying (3) is still a K-block histogram (illustrated in the supplement),\n\nm(cid:89)\n\u02c6f (i)(\u03b8) =\n(5)\nk /|Ak|m\u22121 is the normalizing constant, wk\u2019s are the updated weights,\nand gk(\u03b8) = unif(\u03b8; Ak) is the block-wise distribution. Common histogram blocks across subsets\ncontrol the number of mixture components, leading to simple aggregation and resampling proce-\ndures. Our PART algorithm consists of space partitioning followed by density aggregation, with\naggregation simply multiplying densities across subsets for each block and then normalizing.\n\nwhere Z = (cid:80)K\n\n1\nZ\ni=1 n(i)\n\n1(\u03b8 \u2208 Ak) =\n\n(cid:18) m(cid:89)\n\n\u02c6p(\u03b8|X) =\n\n(cid:81)m\n\nK(cid:88)\n\nK(cid:88)\n\nn(i)\nk|Ak|\n\nwkgk(\u03b8),\n\ni=1\n\nk=1\n\ni=1\n\nk=1\n\n(cid:19)\n\n(4)\n\nK(cid:88)\n\nk=1\n\n1\nZ\n\n2\n\nk=1\n\n\f2.1 Space Partitioning\n\nTo \ufb01nd good partitions, our algorithm recursively bisects (not necessarily evenly) a previous block\nalong a randomly selected dimension, subject to certain rules. Such partitioning is multi-scale and\nrelated to wavelets [8]. Assume we are currently splitting the block A along the dimension q and\nj }j\u2208A for the ith subset. The cut point on dimension q is\ndenote the posterior samples in A by {\u03b8(i)\ndetermined by a partition rule \u03c6({\u03b8(1)\nj,q }). The resulting two blocks are subject\nto further bisecting under the same procedure until one of the following stopping criteria is met \u2014\n(i) nk/N < \u03b4\u03c1 or (ii) the area of the block |Ak| becomes smaller than \u03b4|A|. The algorithm returns a\ntree with K leafs, each corresponding to a block Ak. Details are provided in Algorithm 1.\n\nj,q},\u00b7\u00b7\u00b7 ,{\u03b8(m)\n\nj,q},{\u03b8(2)\n\nj },{\u03b8(2)\n\nj },\u00b7\u00b7\u00b7 ,{\u03b8(m)\n\nj\n\n}, \u03c6(\u00b7), \u03b4\u03c1, \u03b4a, N, L, R)\n\nD \u2190 {1, 2,\u00b7\u00b7\u00b7 , p}\nwhile D not empty do\n\nAlgorithm 1 Partition tree algorithm\n1: procedure BUILDTREE({\u03b8(1)\n2:\n3:\n4:\n5:\n6:\n\nj,q},{\u03b8(2)\n\nj,q},\u00b7\u00b7\u00b7 ,{\u03b8(m)\n\nq > \u03b4a and min((cid:80)\n\nDraw q uniformly at random from D.\nq \u2190 \u03c6({\u03b8(1)\nj,q }),\n\u03b8\u2217\nq \u2212 Lq > \u03b4a, Rq \u2212 \u03b8\u2217\nif \u03b8\u2217\nfor all i then\nL(cid:48) \u2190 L, L(cid:48)\nT .L \u2190 BUILDTREE({\u03b8(1)\nT .R \u2190 BUILDTREE({\u03b8(1)\nreturn T\nD \u2190 D \\ {q}\n\nq \u2190 \u03b8\u2217\nq , R(cid:48) \u2190 R, R(cid:48)\nj,q \u2264 \u03b8\u2217\n: \u03b8(1)\nj,q > \u03b8\u2217\n: \u03b8(1)\n\nq \u2190 \u03b8\u2217\n\nq\n\nj\n\nj\n\n7:\n8:\n9:\n10:\n11:\n12:\n13:\n14:\n15:\n16: end procedure\n\nelse\n\nend if\nend while\nT .L \u2190 NULL, T .R \u2190 NULL, return T\n\n(cid:46) Randomly choose the dimension to cut\n\nT .n(i) \u2190 Cardinality of {\u03b8(i)\nj 1(\u03b8(i)\n\nj,q \u2264 \u03b8\u2217\n\nj 1(\u03b8(i)\n\nj } for all i\nj,q > \u03b8\u2217\n\nq ),(cid:80)\n\nq )) > N \u03b4\u03c1\n\nq},\u00b7\u00b7\u00b7 ,{\u03b8(m)\nq},\u00b7\u00b7\u00b7 ,{\u03b8(m)\n\nj\n\nj\n\n(cid:46) Update left and right boundaries\nq},\u00b7\u00b7\u00b7 , N, L, R(cid:48))\nq},\u00b7\u00b7\u00b7 , N, L(cid:48), R)\n\nj,q \u2264 \u03b8\u2217\n: \u03b8(m)\nj,q > \u03b8\u2217\n: \u03b8(m)\n\n(cid:46) Try cutting at another dimension\n\n(cid:46) Leaf node\n\nIn Algorithm 1, \u03b4|A| becomes the minimum edge length of a block \u03b4a (possibly different across di-\nmensions). Quantities L, R \u2208 Rp are the left and right boundaries of the samples respectively, which\ntake the sample minimum/maximum when the support is unbounded. We consider two choices for\nthe partition rule \u03c6(\u00b7) \u2014 maximum (empirical) likelihood partition (ML) and median/KD-tree par-\ntition (KD).\n\nMaximum Likelihood Partition (ML) ML-partition searches for partitions by greedily maximiz-\ning the empirical log likelihood at each iteration. For m = 1 we have\n\n(cid:18) n1\n\nn|A1|\n\n(cid:19)n1(cid:18) n2\n\n(cid:19)n2\n\nn|A2|\n\n,\n\n(6)\n\n\u03b8\u2217 = \u03c6ML({\u03b8j,q, j = 1,\u00b7\u00b7\u00b7 , n}) =\n\narg max\n\nn1+n2=n,A1\u222aA2=A\n\nwhere n1 and n2 are counts of posterior samples in A1 and A2, respectively. The solution to (6) falls\ninside the set {\u03b8j}. Thus, a simple linear search after sorting samples suf\ufb01ces (by book-keeping the\nordering, sorting the whole block once is enough for the entire procedure). For m > 1, we have\n\n\u03c6q,ML(\u00b7) = arg max\ni=1{\u03b8(i)\nj }\n\n\u03b8\u2217\u2208\u222am\n\n(cid:18) n(i)\n\n1\n\nn(i)|A1|\n\nm(cid:89)\n\ni=1\n\n(cid:19)n(i)\n1 (cid:18) n(i)\n\n(cid:19)n(i)\n\n2\n\n2\n\nn(i)|A2|\n\n,\n\n(7)\n\nsimilarly solved by a linear search. This is dominated by sorting and takes O(n log n) time.\n\nMedian/KD-Tree Partition (KD) Median/KD-tree partition cuts at the empirical median of pos-\nterior samples. When there are multiple subsets, the median is taken over pooled samples to force\n{Ak} to be the same across subsets. Searching for median takes O(n) time [9], which is faster than\nML-partition especially when the number of posterior draws is large. The same partitioning strategy\nis adopted by KD-trees [10].\n\n3\n\n\f2.2 Density Aggregation\n\nGiven a common partition, Algorithm 2 aggregates all subsets in one stage. However, assuming a\nsingle \u201cgood\u201d partition for all subsets is overly restrictive when m is large. Hence, we also consider\npairwise aggregation [6, 7], which recursively groups subsets into pairs, combines each pair with\nAlgorithm 2, and repeats until one \ufb01nal set is obtained. Run time of PART is dominated by space\npartitioning (BUILDTREE), with normalization and resampling very fast.\n\nj\n\nj\n\nj },{\u03b8(2)\n\n\u02dcwk \u2190(cid:81)m\n\nk }) \u2190 TRAVERSELEAF(T )\n\nj },{\u03b8(2)\nj },\u00b7\u00b7\u00b7 ,{\u03b8(m)\n\nT \u2190 BUILDTREE({\u03b8(1)\n({Ak},{n(i)\nfor k = 1, 2,\u00b7\u00b7\u00b7 , K do\n\nj },\u00b7\u00b7\u00b7 ,{\u03b8(m)\n}, \u03c6(\u00b7), \u03b4\u03c1, \u03b4a, N, L, R), Z \u2190 0\n\nAlgorithm 2 Density aggregation algorithm (drawing N(cid:48) samples from the aggregated posterior)\n}, \u03c6(\u00b7), \u03b4\u03c1, \u03b4a, N, N(cid:48), L, R)\n1: procedure ONESTAGEAGGREGATE({\u03b8(1)\n2:\n3:\n4:\n5:\n6:\n7:\n8:\n9:\n10:\n11:\n12: end procedure\n\ni=1 n(i)\nend for\nwk \u2190 \u02dcwk/Z for all k\nfor t = 1, 2,\u00b7\u00b7\u00b7 , N(cid:48) do\n\nDraw k with weights {wk} and then draw \u03b8t \u223c gk(\u03b8)\n\nend for\nreturn {\u03b81, \u03b82,\u00b7\u00b7\u00b7 , \u03b8N(cid:48)}\n\nk /|Ak|m\u22121, Z \u2190 Z + \u02dcwk\n\n(cid:46) Multiply inside each block\n\n(cid:46) Normalize\n\n2.3 Variance Reduction and Smoothing\n\nRandom Tree Ensemble\nInspired by random forests [11, 12], the full posterior is estimated by\naveraging T independent trees output by Algorithm 1. Smoothing and averaging can reduce vari-\nance and yield better approximation accuracy. The trees can be built in parallel and resampling in\nAlgorithm 2 only additionally requires picking a tree uniformly at random.\nLocal Gaussian Smoothing As another approach to increase smoothness, the blockwise uniform\ndistribution in (5) can be replaced by a Gaussian distribution gk = N (\u03b8; \u00b5k, \u03a3k), with mean and\ncovariance estimated \u201clocally\u201d by samples within the block. A multiplied Gaussian approximation\nk are estimated\nwith the ith subset. We apply both random tree ensembles and local Gaussian smoothing in all\napplications of PART in this article unless explicitly stated otherwise.\n\nis used: \u03a3k = ((cid:80)m\n\n)\u22121, \u00b5k = \u03a3k((cid:80)m\n\n\u02c6\u00b5(i)\nk ), where \u02c6\u03a3(i)\n\nk and \u02c6\u00b5(i)\n\n\u02c6\u03a3(i)\u22121\n\n\u02c6\u03a3(i)\u22121\n\ni=1\n\ni=1\n\nk\n\nk\n\n3 Theory\n\nIn this section, we provide consistency theory (in the number of posterior samples) for histograms\nand the aggregated density. We do not consider the variance reduction and smoothing modi\ufb01cations\nin these developments for simplicity in exposition, but extensions are possible. Section 3.1 pro-\nvides error bounds on ML and KD-tree partitioning-based histogram density estimators constructed\nfrom N independent samples from a single joint posterior; modi\ufb01ed bounds can be obtained for\nMCMC samples incorporating the mixing rate, but will not be considered here. Section 3.2 then\nprovides corresponding error bounds for our PART-aggregrated density estimators in the one-stage\nand pairwise cases. Detailed proofs are provided in the supplementary materials.\nLet f (\u03b8) be a p-dimensional posterior density function. Assume f is supported on a measurable\nset \u2126 \u2286 Rp. Since one can always transform \u2126 to a bounded region by scaling, we simply assume\n\u2126 = [0, 1]p as in [8, 13] without loss of generality. We also assume that f \u2208 C 1(\u2126).\n\n3.1 Space partitioning\n\nMaximum likelihood partition (ML) For a given K, ML partition solves the following problem:\n\n\u02c6fM L = arg max\n\n1\nN\n\nnk log\n\ns.t. nk/N \u2265 c0\u03c1, |Ak| \u2265 \u03c1/D,\n\n,\n\n(8)\n\nK(cid:88)\n\nk=1\n\n(cid:19)\n\n(cid:18) nk\n\nN|Ak|\n\n4\n\n\ffor some c0 and \u03c1, where D = (cid:107)f(cid:107)\u221e < \u221e. We have the following result.\nTheorem 1. Choose \u03c1 = 1/K 1+1/(2p). For any \u03b4 > 0, if the sample size satis\ufb01es that N >\n2(1 \u2212 c0)\u22122K 1+1/(2p) log(2K/\u03b4), then with probability at least 1 \u2212 \u03b4, the optimal solution to (8)\nsatis\ufb01es that\n\n2p + C2 max(cid:8) log D, 2 log K(cid:9)(cid:115)\n\n(cid:18) 3eN\n\n(cid:19)\n\nK\n\nlog\n\n(cid:18) 8\n\n(cid:19)\n\n,\n\n\u03b4\n\nK\nN\n\nlog\n\nDKL(f(cid:107) \u02c6fM L) \u2264 (C1 + 2 log K)K\n\n\u2212 1\n\n\u221a\nwhere C1 = log D + 4pLD with L = (cid:107)f(cid:48)(cid:107)\u221e and C2 = 48\nWhen multiple densities f (1)(\u03b8),\u00b7\u00b7\u00b7 , f (m)(\u03b8) are presented, our goal of imposing the same partition\non all functions requires solving a different problem,\n\np + 1.\n\n( \u02c6f (i)\n\nM L)m\n\ni=1 = arg max\n\nn(i)\nk log\n\ns.t. n(i)\n\nk /Ni \u2265 c0\u03c1, |Ak| \u2265 \u03c1/D,\n\n,\n\n(9)\n\nK(cid:88)\n\nm(cid:88)\n\n1\nNi\n\ni=1\n\nk=1\n\n(cid:19)\n\n(cid:18) n(i)\n\nk\n\nNi|Ak|\n\nwhere Ni is the number of posterior samples for function f (i). A similar result as Theorem 1 for (9)\nis provided in the supplementary materials.\n\nMedian partition/KD-tree (KD) The KD-tree \u02c6fKD cuts at the empirical median for different\ndimensions. We have the following result.\n\nTheorem 2. For any \u03b5 > 0, de\ufb01ne r\u03b5 = log2\n32e2(log K)2K log(2K/\u03b4), then with probability at least 1 \u2212 \u03b4, we have\n\n2+3L/\u03b5\n\n1 +\n\n1\n\n. For any \u03b4 > 0, if N >\n\n(cid:18)\n\n(cid:19)\n\n(cid:19)\n\n(cid:115)\n\n(cid:115)\n\n(cid:18) 2K\n\n\u03b4\n\n(cid:19)\n\n.\n\n2K\nN\n\nlog\n\n(cid:18) 2K\n\n(cid:19)\n\n.\n\n\u03b4\n\n(cid:107) \u02c6fKD \u2212 fKD(cid:107)1 \u2264 \u03b5 + pLK\u2212r\u03b5/p + 4e log K\n\n(cid:18)\n\nIf f (\u03b8) is further lower bounded by some constant b0 > 0, we can then obtain an upper bound on\nthe KL-divergence. De\ufb01ne rb0 = log2\n\nand we have\n\n1 +\n\n1\n\n2+3L/b0\n\nDKL(f(cid:107) \u02c6fKD) \u2264 pLD\nb0\n\nK\u2212rb0 /p + 8e log K\n\n2K\nN\n\nlog\n\nWhen there are multiple functions and the median partition is performed on pooled data, the partition\nmight not happen at the empirical median on each subset. However, as long as the partition quantiles\nare upper and lower bounded by \u03b1 and 1 \u2212 \u03b1 for some \u03b1 \u2208 [1/2, 1), we can establish results similar\nto Theorem 2. The result is provided in the supplementary materials.\n\n3.2 Posterior aggregation\n\ni\u2208I f (i)/(cid:82)(cid:81)\n\nThe previous section provides estimation error bounds on individual posterior densities, through\nwhich we can bound the distance between the true posterior conditional on the full data set and the\naggregated density via (3). Assume we have m density functions {f (i), i = 1, 2,\u00b7\u00b7\u00b7 , m} and intend\ni\u2208I f (i), where I = {1, 2,\u00b7\u00b7\u00b7 , m}.\ni\u2208I(cid:48) X (i)). Let D = maxI(cid:48)\u2286I (cid:107)fI(cid:48)(cid:107)\u221e, i.e., D is an upper\ni\u2208I(cid:48) f (i). These\nquantities depend only on the model and the observed data (not posterior samples). We denote \u02c6fM L\nand \u02c6fKD by \u02c6f as the following results apply similarly to both methods.\nThe \u201cone-stage\u201d aggregation (Algorithm 2) \ufb01rst obtains an approximation for each f (i) (via either\n\nto approximate their aggregated density fI = (cid:81)\nNotice that for any I(cid:48) \u2286 I, fI(cid:48) = p(\u03b8|(cid:83)\nbound on all posterior densities formed by a subset of X. Also de\ufb01ne ZI(cid:48) =(cid:82)(cid:81)\nML-partition or KD-partition) and then computes \u02c6fI =(cid:81)\n\n\u02c6f (i)/(cid:82)(cid:81)\n\n\u02c6f (i).\n\ni\u2208I\n\ni\u2208I\n\n5\n\n\fTheorem 3 (One-stage aggregation). Denote the average total variation distance between f (i) and\n\u02c6f (i) by \u03b5. Assume the conditions in Theorem 1 and 2 and for ML-partition\n\n(cid:18) 3eN\n\n(cid:19)\n\nK\n\n(cid:18) 8\n\n(cid:19)\n\n\u03b4\n\n3\n\n2 + 1\n\n2p\n\nlog\n\nlog\n\n\u221a\n\nN \u2265 32c\u22121\n\n0\n\n(cid:112)2(p + 1)K\n\n(cid:115)\n\nand for KD-partition\nThen with high probability the total variation distance between fI and \u02c6fI is bounded by (cid:107)fI\u2212 \u02c6fI(cid:107)1 \u2264\n\nN > 128e2K(log K)2 log(K/\u03b4).\n\nm(2D)m\u22121\u03b5, where ZI is a constant that does not depend on the posterior samples.\n\n2\nZI\nThe approximation error of Algorithm 2 increases dramatically with the number of subsets. To\nameliorate this, we introduce the pairwise aggregation strategy in Section 2, for which we have the\nfollowing result.\nTheorem 4 (Pairwise aggregation). Denote the average total variation distance between f (i) and\n\u02c6f (i) by \u03b5. Assume the conditions in Theorem 3. Then with high probability the total varia-\ntion distance between fI and \u02c6fI is bounded by (cid:107)fI \u2212 \u02c6fI(cid:107)1 \u2264 (4C0D)log2 m+1\u03b5, where C0 =\nmaxI(cid:48)(cid:48)\u2282I(cid:48)\u2286I\n\nis a constant that does not depend on posterior samples.\n\nZI(cid:48)(cid:48) ZI(cid:48)\\I(cid:48)(cid:48)\n\nZI(cid:48)\n\n4 Experiments\n\nIn this section, we evaluate the empirical performance of PART1 and compare the two algorithms\nPART-KD and PART-ML to the following posterior aggregation algorithms.\n\n1. Simple averaging (average): each aggregated sample is an arithmetic average of M sam-\n\nples coming from M subsets.\n\n2. Weighted averaging (weighted): also called Consensus Monte Carlo algorithm [3],\nwhere each aggregated sample is a weighted average of M samples. The weights are\noptimally chosen for a Gaussian posterior.\n\n3. Weierstrass rejection sampler (Weierstrass): subset posterior samples are passed through\na rejection sampler based on the Weierstrass transform to produce the aggregated sam-\nples [7]. We use its R package2 for experiments.\n\n4. Parametric density product (parametric): aggregated samples are drawn from a multi-\nvariate Gaussian, which is a product of Laplacian approximations to subset posteriors [6].\n5. Nonparametric density product (nonparametric): aggregated posterior is approximated\nby a product of kernel density estimates of subset posteriors [6]. Samples are drawn with\nan independent Metropolis sampler.\n\n6. Semiparametric density product (semiparametric): similar to the nonparametric, but\n\nwith subset posteriors estimated semiparametrically [6, 14].\n\nAll experiments except the two toy examples use adaptive MCMC [15, 16] 3 for posterior sampling.\nFor PART-KD/ML, one-stage aggregation (Algorithm 2) is used only for the toy examples (results\nfrom pairwise aggregation are provided in the supplement). For other experiments, pairwise aggre-\ngation is used, which draws 50,000 samples for intermediate stages and halves \u03b4\u03c1 after each stage\nto re\ufb01ne the resolution (The value of \u03b4\u03c1 listed below is for the \ufb01nal stage). The random ensemble of\nPART consists of 40 trees.\n\n4.1 Two Toy Examples\n\nThe two toy examples highlight the performance of our methods in terms of (i) recovering multiple\nmodes and (ii) correctly locating posterior mass when subset posteriors are heterogeneous. The\nPART-KD/PART-ML results are obtained from Algorithm 2 without local Gaussian smoothing.\n\n1MATLAB\n\nimplementation\n\navailable\n\nfrom\n\nhttps://github.com/richardkwo/\n\nrandom-tree-parallel-MCMC\n\n2https://github.com/wwrechard/weierstrass\n3http://helios.fmi.fi/\u02dclainema/mcmc/\n\n6\n\n\fBimodal Example Figure 1 shows an example consisting of m = 10 subsets. Each subset\nconsists of 10,000 samples drawn from a mixture of two univariate normals 0.27N (\u00b5i,1, \u03c32\ni,1) +\ni,2), with the means and standard deviations slightly different across subsets, given by\n0.73N (\u00b5i,2, \u03c32\n\u00b5i,1 = \u22125 + \u0001i,1, \u00b5i,2 = 5 + \u0001i,2 and \u03c3i,1 = 1 + |\u03b4i,1|, \u03c3i,2 = 4 + |\u03b4i,2|, where \u0001i,l \u223c N (0, 0.5),\n\u03b4i,l \u223c N (0, 0.1) independently for m = 1,\u00b7\u00b7\u00b7 , 10 and l = 1, 2. The resulting true combined pos-\nterior (red solid) consists of two modes with different scales. In Figure 1, the left panel shows the\nsubset posteriors (dashed) and the true posterior; the right panel compares the results with various\nmethods to the truth. A few are omitted in the graph: average and weighted average overlap with\nparametric, and Weierstrass overlaps with PART-KD/PART-ML.\n\nFigure 1: Bimodal posterior combined from 10 subsets. Left: the true posterior and subset posteriors\n(dashed). Right: aggregated posterior output by various methods compared to the truth. Results are\nbased on 10,000 aggregated samples.\n\niid\u223c Ber(\u03b8) split into m =\nRare Bernoulli Example We consider N = 10, 000 Bernoulli trials xi\n15 subsets. The parameter \u03b8 is chosen to be 2m/N so that on average each subset only contains\n2 successes. By random partitioning, the subset posteriors are rather heterogeneous as plotted in\ndashed lines in the left panel of Figure 2. The prior is set as \u03c0(\u03b8) = Beta(\u03b8; 2, 2). The right\npanel of Figure 2 compares the results of various methods. PART-KD, PART-ML and Weierstrass\ncapture the true posterior shape, while parametric, average and weighted average are all biased. The\nnonparametric and semiparametric methods produce \ufb02at densities near zero (not visible in Figure 2\ndue to the scale).\n\nFigure 2: The posterior for the probability \u03b8 of a rare event. Left: the full posterior (solid) and\nm = 15 subset posteriors (dashed). Right: aggregated posterior output by various methods. All\nresults are based on 20,000 aggregated samples.\n\n4.2 Bayesian Logistic Regression\nSynthetic dataset The dataset {(xi, yi)}N\ni=1 consists of N = 50, 000 observations in p = 50\ndimensions. All features xi \u2208 Rp\u22121 are drawn from Np\u22121(0, \u03a3) with p = 50 and \u03a3k,l = 0.9|k\u2212l|.\nThe model intercept is set to \u22123 and the other coef\ufb01cient \u03b8\u2217\nj \u2019s are drawn randomly from N (0, 52).\nConditional on xi, yi \u2208 {0, 1} follows p(yi = 1) = 1/(1 + exp(\u2212\u03b8\u2217T [1, xi])). The dataset is\nrandomly split into m = 40 subsets. For both full chain and subset chains, we run adaptive MCMC\nfor 200,000 iterations after 100,000 burn-in. Thinning by 4 results in T = 50, 000 samples.\nThe samples from the full chain (denoted as {\u03b8j}T\nj=1) are treated as the ground truth. To compare the\naccuracy of different methods, we resample T points { \u02c6\u03b8j} from each aggregated posterior and then\n\n7\n\nx-10-5051015density00.20.40.60.81True densitySubset densitiesx-10-5051015024681True densityPART-KDPART-MLParametricNonparametricSemiparametric300.0050.010.0150.02density0200400600800True posteriorSubset posteriorsPART-KDPART-ML\f(cid:113)(cid:80)\n\npT ((cid:80)\n\nj\n\n\u02c6\u03b8j \u2212(cid:80)\n\n2/(cid:80)\n\ncompare them using the following metrics: (1) RMSE of posterior mean (cid:107) 1\nj \u03b8j)(cid:107)2\n(2) approximate KL divergence DKL(p(\u03b8)(cid:107)\u02c6p(\u03b8)) and DKL(\u02c6p(\u03b8)(cid:107)p(\u03b8)), where \u02c6p and p are both\napproximated by multivariate Gaussians (3) the posterior concentration ratio, de\ufb01ned as r =\n2, which measures how posterior spreads out around the true\nvalue (with r = 1 being ideal). The result is provided in Table 1. Figure 4 shows the DKL(p(cid:107)\u02c6p)\nversus the length of subset chains supplied to the aggregation algorithm. The results of PART are\nobtained with \u03b4\u03c1 = 0.001, \u03b4a = 0.0001 and 40 trees. Figure 3 showcases the aggregated posterior\nfor two parameters in terms of joint and marginal distributions.\n\nj (cid:107)\u03b8j \u2212 \u03b8\u2217(cid:107)2\n\nj (cid:107) \u02c6\u03b8j \u2212 \u03b8\u2217(cid:107)2\n\nMethod\nPART (KD)\nPART (ML)\naverage\nweighted\nWeierstrass\nparametric\nnonparametric\nsemiparametric\n\nRMSE DKL(p(cid:107)\u02c6p)\n3.95 \u00d7 102\n0.587\n8.05 \u00d7 101\n1.399\n2.53 \u00d7 103\n29.93\n2.60 \u00d7 104\n38.28\n7.20 \u00d7 102\n6.47\n2.46 \u00d7 103\n10.07\n3.40 \u00d7 104\n25.59\n2.06 \u00d7 104\n25.45\n\nDKL(\u02c6p(cid:107)p)\n6.45 \u00d7 102\n5.47 \u00d7 102\n5.41 \u00d7 104\n2.53 \u00d7 105\n2.62 \u00d7 103\n6.12 \u00d7 103\n3.95 \u00d7 104\n3.90 \u00d7 104\n\nr\n3.94\n9.17\n184.62\n236.15\n39.96\n62.13\n157.86\n156.97\n\nTable 1: Accuracy of posterior aggregation on logistic regression. Figure 3: Posterior of \u03b81 and \u03b817.\nReal datasets We also run experiments on two real datasets: (1) the Covertype dataset4 [17] con-\nsists of 581,012 observations in 54 dimensions, and the task is to predict the type of forest cover\nwith cartographic measurements; (2) the MiniBooNE dataset5 [18, 19] consists of 130,065 observa-\ntions in 50 dimensions, whose task is to distinguish electron neutrinos from muon neutrinos with\nexperimental data. For both datasets, we reserve 1/5 of the data as the test set. The training set\nis randomly split into m = 50 and m = 25 subsets respectively for covertype and MiniBooNE.\nFigure 5 shows the prediction accuracy versus total runtime (parallel subset MCMC + aggregation\ntime) for different methods. For each MCMC chain, the \ufb01rst 20% iterations are discarded before ag-\ngregation as burn-in. The aggregated chain is required to be of the same length as the subset chains.\nAs a reference, we also plot the result for the full chain and lasso [20] run on the full training set.\n\nFigure 4: Approximate KL diver-\ngence between the full chain and\nthe combined posterior versus the\nlength of subset chains.\n\nFigure 5: Prediction accuracy versus total runtime (running\nchain + aggregation) on Covertype and MiniBooNE datasets\n(semiparametric is not compared due to its long running time).\nPlots against the length of chain are provided in the supplement.\n\n5 Conclusion\n\nIn this article, we propose a new embarrassingly-parallel MCMC algorithm PART that can ef\ufb01ciently\ndraw posterior samples for large data sets. PART is simple to implement, ef\ufb01cient in subset com-\nbining and has theoretical guarantees. Compared to existing EP-MCMC algorithms, PART has sub-\nstantially improved performance. Possible future directions include (1) exploring other multi-scale\ndensity estimators which share similar properties as partition trees but with a better approximation\naccuracy (2) developing a tuning procedure for choosing good \u03b4\u03c1 and \u03b4a, which are essential to the\nperformance of PART.\n\n4http://www.csie.ntu.edu.tw/\u02dccjlin/libsvmtools/datasets/binary.html\n5https://archive.ics.uci.edu/ml/machine-learning-databases/00199\n\n8\n\nPART-KDPART-MLPART-KDPART-MLtotal time (sec)0200400600800prediction accuracy0.60.650.70.750.8CovertypePART-KDPART-MLParametricNonparametricWeierstrassAverageWeightedFull chainLassototal time (sec)0501001502002503000.20.40.60.81MiniBooNE\fReferences\n[1] Max Welling and Yee W Teh. Bayesian learning via stochastic gradient langevin dynamics. In\n\nProceedings of the 28th International Conference on Machine Learning (ICML-11), 2011.\n\n[2] Dougal Maclaurin and Ryan P Adams. Fire\ufb02y Monte Carlo: Exact MCMC with subsets of\n\ndata. Proceedings of the conference on Uncertainty in Arti\ufb01cial Intelligence (UAI), 2014.\n\n[3] Steven L Scott, Alexander W Blocker, Fernando V Bonassi, Hugh A Chipman, Edward I\nGeorge, and Robert E McCulloch. Bayes and big data: The consensus Monte Carlo algorithm.\nIn EFaBBayes 250 conference, volume 16, 2013.\n\n[4] Stanislav Minsker, Sanvesh Srivastava, Lizhen Lin, and David Dunson. Scalable and robust\nbayesian inference via the median posterior. In Proceedings of the 31st International Confer-\nence on Machine Learning (ICML-14), 2014.\n\n[5] Sanvesh Srivastava, Volkan Cevher, Quoc Tran-Dinh, and David B Dunson. WASP: Scalable\nBayes via barycenters of subset posteriors. In Proceedings of the 18th International Conference\non Arti\ufb01cial Intelligence and Statistics (AISTATS), volume 38, 2015.\n\n[6] Willie Neiswanger, Chong Wang, and Eric Xing. Asymptotically exact, embarrassingly paral-\nlel MCMC. In Proceedings of the Thirtieth Conference Annual Conference on Uncertainty in\nArti\ufb01cial Intelligence (UAI-14), pages 623\u2013632, Corvallis, Oregon, 2014. AUAI Press.\n\n[7] Xiangyu Wang and David B Dunson. Parallel MCMC via Weierstrass sampler. arXiv preprint\n\narXiv:1312.4605, 2013.\n\n[8] Linxi Liu and Wing Hung Wong. Multivariate density estimation based on adaptive par-\narXiv preprint\n\ntitioning: Convergence rate, variable selection and spatial adaptation.\narXiv:1401.2597, 2014.\n\n[9] Manuel Blum, Robert W Floyd, Vaughan Pratt, Ronald L Rivest, and Robert E Tarjan. Time\n\nbounds for selection. Journal of Computer and System Sciences, 7(4):448\u2013461, 1973.\n\n[10] Jon Louis Bentley. Multidimensional binary search trees used for associative searching. Com-\n\nmunications of the ACM, 18(9):509\u2013517, 1975.\n\n[11] Leo Breiman. Random forests. Machine Learning, 45(1):5\u201332, 2001.\n[12] Leo Breiman. Bagging predictors. Machine Learning, 24(2):123\u2013140, 1996.\n[13] Xiaotong Shen and Wing Hung Wong. Convergence rate of sieve estimates. The Annals of\n\nStatistics, pages 580\u2013615, 1994.\n\n[14] Nils Lid Hjort and Ingrid K Glad. Nonparametric density estimation with a parametric start.\n\nThe Annals of Statistics, pages 882\u2013904, 1995.\n\n[15] Heikki Haario, Marko Laine, Antonietta Mira, and Eero Saksman. DRAM: ef\ufb01cient adaptive\n\nMCMC. Statistics and Computing, 16(4):339\u2013354, 2006.\n\n[16] Heikki Haario, Eero Saksman, and Johanna Tamminen. An adaptive Metropolis algorithm.\n\nBernoulli, pages 223\u2013242, 2001.\n\n[17] Jock A Blackard and Denis J Dean. Comparative accuracies of neural networks and discrim-\ninant analysis in predicting forest cover types from cartographic variables. In Proc. Second\nSouthern Forestry GIS Conf, pages 189\u2013199, 1998.\n\n[18] Byron P Roe, Hai-Jun Yang, Ji Zhu, Yong Liu, Ion Stancu, and Gordon McGregor. Boosted\ndecision trees as an alternative to arti\ufb01cial neural networks for particle identi\ufb01cation. Nuclear\nInstruments and Methods in Physics Research Section A: Accelerators, Spectrometers, Detec-\ntors and Associated Equipment, 543(2):577\u2013584, 2005.\n\n[19] M. Lichman. UCI machine learning repository, 2013.\n[20] Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal\n\nStatistical Society. Series B (Methodological), pages 267\u2013288, 1996.\n\n9\n\n\f", "award": [], "sourceid": 340, "authors": [{"given_name": "Xiangyu", "family_name": "Wang", "institution": "Duke University"}, {"given_name": "Fangjian", "family_name": "Guo", "institution": "Duke University"}, {"given_name": "Katherine", "family_name": "Heller", "institution": "Duke University"}, {"given_name": "David", "family_name": "Dunson", "institution": "Duke University"}]}