{"title": "Large Scale Nonparametric Bayesian Inference: Data Parallelisation in the Indian Buffet Process", "book": "Advances in Neural Information Processing Systems", "page_first": 1294, "page_last": 1302, "abstract": "Nonparametric Bayesian models provide a framework for flexible probabilistic modelling of complex datasets. Unfortunately, Bayesian inference methods often require high-dimensional averages and can be slow to compute, especially with the potentially unbounded representations associated with nonparametric models. We address the challenge of scaling nonparametric Bayesian inference to the increasingly large datasets found in real-world applications, focusing on the case of parallelising inference in the Indian Buffet Process (IBP). Our approach divides a large data set between multiple processors. The processors use message passing to compute likelihoods in an asynchronous, distributed fashion and to propagate statistics about the global Bayesian posterior. This novel MCMC sampler is the first parallel inference scheme for IBP-based models, scaling to datasets orders of magnitude larger than had previously been possible.", "full_text": "Large Scale Nonparametric Bayesian Inference:\nData Parallelisation in the Indian Buffet Process\n\nFinale Doshi-Velez\u2217\n\nUniversity of Cambridge\nCambridge, CB21PZ, UK\nfinale@alum.mit.edu\n\nDavid Knowles\u2217\n\nUniversity of Cambridge\nCambridge, CB21PZ, UK\n\ndak33@cam.ac.uk\n\nShakir Mohamed\u2217\n\nUniversity of Cambridge\nCambridge, CB21PZ, UK\n\nsm694@cam.ac.uk\n\nZoubin Ghahramani\nUniversity of Cambridge\nCambridge, CB21PZ, UK\n\nzoubin@eng.cam.ac.uk\n\nAbstract\n\nNonparametric Bayesian models provide a framework for \ufb02exible probabilistic\nmodelling of complex datasets. Unfortunately, the high-dimensional averages re-\nquired for Bayesian methods can be slow, especially with the unbounded repre-\nsentations used by nonparametric models. We address the challenge of scaling\nBayesian inference to the increasingly large datasets found in real-world appli-\ncations. We focus on parallelisation of inference in the Indian Buffet Process\n(IBP), which allows data points to have an unbounded number of sparse latent\nfeatures. Our novel MCMC sampler divides a large data set between multiple\nprocessors and uses message passing to compute the global likelihoods and pos-\nteriors. This algorithm, the \ufb01rst parallel inference scheme for IBP-based models,\nscales to datasets orders of magnitude larger than have previously been possible.\n\n1\n\nIntroduction\n\nFrom information retrieval to recommender systems, from bioinformatics to \ufb01nancial market anal-\nysis, the amount of data available to researchers has exploded in recent years. While large, these\ndatasets are often still sparse: For example, a biologist may have expression levels from thousands\nof genes from only a few people. A ratings database may contain millions of users and thousands\nof movies, but each user may have only rated a few movies. In such settings, Bayesian methods\nprovide a robust approach to drawing inferences and making predictions from sparse information.\nAt the heart of Bayesian methods is the idea that all unknown quantities should be averaged over\nwhen making predictions. Computing these high-dimensional average is thus a key challenge in\nscaling Bayesian inference to large datasets, especially for nonparametric models.\n\nAdvances in multicore and distributed computing provide one answer to this challenge: if each pro-\ncessor can consider only a small part of the data, then inference in these large datasets might become\nmore tractable. However, such data parallelisation of inference is nontrivial\u2014while simple models\nmight only require pooling a small number of suf\ufb01cient statistics [1], inference in more complex\nmodels might require the frequent communication of complex, high-dimensional probability distri-\nbutions between processors. Building on work on approximate asynchronous multicore inference\nfor topic models [2], we develop a message passing framework for data-parallel Bayesian inference\napplicable to a variety of models, including matrix factorization and the Indian Buffet Process (IBP).\n\n\u2217 Authors contributed equally.\n\n1\n\n\fNonparametric models are attractive for large datasets because they automatically adapt to the com-\nplexity of the data, relieving the researcher from the need to specify aspects of the model such as the\nnumber of latent factors. Much recent work in nonparametric Bayesian modelling has focused on\nthe Chinese restaurant process (CRP), which is a discrete distribution that can be used to assign data\npoints to an unbounded number of clusters. However, many real-world datasets have observations\nthat may belong to multiple clusters\u2014for example, a gene may have multiple functions; an image\nmay contain multiple objects. The IBP [3] is a distribution over in\ufb01nite sparse binary matrices that\nallows data points to be represented by an unbounded number of sparse latent features or factors.\nWhile the parallelisation method we present in this paper is applicable to a broad set of models, we\nfocus on inference for the IBP because of its unique challenges and potential.\n\nMany serial procedures have been developed for inference in the IBP, including variants of Gibbs\nsampling [3, 4], which may be augmented with Metropolis split-merge proposals [5], slice sam-\npling [6], particle \ufb01ltering [7], and variational inference [8]. With the exception of the accelerated\nGibbs sampler of [4], these methods have been applied to datasets with less than 1,000 observations.\n\nTo achieve ef\ufb01cient paralellisation, we exploit an idea recently introduced in [4], which maintains\na distribution over parameters while sampling. Coupled with a message passing scheme over pro-\ncessors, this idea enables computations for inference to be distributed over many processors with\nfew losses in accuracy. We demonstrate our approach on a problem with 100,000 observations. The\nlargest application of IBP inference to date, our work opens the use of the IBP and similar models\nto a variety of data-intensive applications.\n\n2 Latent Feature Model\n\nThe IBP can be used to de\ufb01ne models in which each observation is associated with a set of latent\nfactors or features. A binary feature-assignment matrix Z represents which observations possess\nwhich hidden features, where Znk = 1 if observation n has feature k and Znk = 0 otherwise.\nFor example, the observations might be images and the hidden features could be possible objects in\nthose images. Importantly, the IBP allows the set of such possible hidden features to be unbounded.\n\nTo generate a sample from the IBP, we \ufb01rst imagine that the rows of Z (the observations) are cus-\ntomers and the columns of Z (the features) are dishes in an in\ufb01nite buffet. The \ufb01rst customer takes\nthe \ufb01rst Poisson(\u03b1) dishes. The following customers try previously sampled dishes with probability\nmk/n, where mk is the number of people who tried dish k before customer n. Each customer also\ntakes Poisson(\u03b1/n) new dishes. The value Znk records if customer n tried dish k. This generative\nprocess allows an unbounded set of features but guarantees that a \ufb01nite dataset will contain a \ufb01nite\nnumber of features with probability one. The process is also exchangeable in that the order in which\ncustomers visit the buffet has no impact on the distribution of Z. Finally, if the effect of possessing\na feature is independent of the feature index, the model is also exchangeable in the columns of Z.\nWe associate with the feature assignment matrix Z, a feature matrix A with rows that parameterise\nthe effect that possessing each feature has on the data. Given these matrices, we write the probability\nof the data as P (X|Z, A). Our work requires that P (A|X, Z) can be computed or approximated\nef\ufb01ciently by an exponential family distribution. Speci\ufb01cally, we apply our techniques to both a\nfully-conjugate linear-Gaussian model and non-conjugate Bernoulli model.\n\nLinear Gaussian Model. We model an N \u00d7D real-valued data matrix X as a product:\n\n(1)\nwhere Z is the binary feature-assignment matrix and A is a K by D real-valued matrix with an\na) on each element (see cartoon in Figure 1(a)). Each element\nindependent Gaussian prior N (0, \u03c32\nof the N by D noise matrix \u01eb is independent with a N (0, \u03c32\nn) distribution. Given Z and X, the\nposterior on the features A is Gaussian, given by mean and covariance\n\nX = ZA + \u01eb,\n\n\u00b5A = (cid:18)Z T Z +\n\n\u03c32\nx\n\u03c32\na\n\nI(cid:19)\u22121\n\nZ T X\n\n\u03a3A = \u03c32\n\nx(cid:18)Z T Z +\n\n\u03c32\nx\n\u03c32\na\n\nI(cid:19)\u22121\n\n(2)\n\nBernoulli Model. We use a leaky, noisy-or likelihood for each element of an N \u00d7D matrix X:\n\nP (Xnd = 1|Z, A) = 1 \u2212 \u01eb \u03bbPk\n\nZnkAkd.\n\n(3)\n\n2\n\n\fX\n\nN\n\nD\n\nK\n\nZ\n\n~ N\n\nA\n*\n\n...\n\nD\n\n.\n.\n.\n\nK\n\n+ \u03b5\n\nprior\n\nRoot\n\nstatistics\nposterior\n\nsp\n\no\n\nstatistic\nsterio\n\nr\n\nP1\n\nP2\n\nstatistics\nposterior\n\np\n\no\n\nstatistic\nsterio\n\nr\n\ns\n\nP3\n\nP4\n\n(a) Representation of the linear-Gaussian model.\nThe data X is generated from the product of the\nfeature assignment matrix Z and feature matrix A.\nIn the Bernoulli model, the product ZA adjusts the\nprobability of X = 1\n\n(b) Message passing process. Pro-\ncessors send suf\ufb01cient statistics of\nthe likelihood up to the root, which\ncalculates and sends the (exact) pos-\nterior back to the processors.\n\nFigure 1: Diagrammatic representation of the model structure and the message passing process.\n\nEach element of the A matrix is binary with independent Bernoulli(pA) priors. The parameters \u01eb\nand \u03bb determine how \u201cleaky\u201d and how \u201cnoisy\u201d the or-function is, respectively. Typical hyperpa-\nrameter values are \u01eb = 0.95 and \u03bb = 0.2. The posterior P (A|X, Z) cannot be computed in closed\nform; however, a mean-\ufb01eld variational posterior in which we approximate P (A|X, Z) as product\n\nof independent Bernoulli variables QK,D\n3 Parallel Inference\n\nk,d qkd(akd) can be readily derived.\n\nWe describe both synchronous and asynchronous procedures for approximate, parallel inference in\nthe IBP that combines MCMC with message passing. We \ufb01rst partition the data among the proces-\nsors, using X p to denote the subset of observations X assigned to processor p. We use Z p to denote\nthe latent features associated with the data on processor p. In [4], the distribution P (A|X\u2212n, Z\u2212n)\nwas used to derive an accelerated sampler for sampling Zn, where n indexes the nth observation and\n\u2212n is the set of all observations except n. In our parallel inference approach, each processor p main-\ntains a distribution P p(A|X\u2212n, Z\u2212n), a local approximation to P (A|X\u2212n, Z\u2212n). The distributions\nP p are updated via message passing between the processors.\nThe inference alternates between three steps:\n\n\u2022 Message passing: processors communicate to compute the exact P (A|X, Z).\n\u2022 Gibbs sampling: processors sample a new set of Z p\u2019s in parallel.\n\u2022 Hyperparameter sampling: a root processor resamples global hyperparameters\n\nThe sampler is approximate because during Gibbs sampling, all processors resample elements of Z\nat the same time; their posteriors P p(A|X, Z) are no longer the true P (A|X, Z).\n\nMessage Passing We use Bayes rule to factorise the posterior over features P (A|Z, X):\n\nP (A|Z, X) \u221d P (A)Yp\n\nP (X p|Z p, A)\n\n(4)\n\nIf the prior P (A) and the likelihoods P (X p|Z p, A) are conjugate exponential family models, then\nthe suf\ufb01cient statistics of P (A|Z, X) are the sum of the suf\ufb01cient statistics of each term on the right\nside of equation (4). For example, the suf\ufb01cient statistics in the linear-Gaussian model are means\nand covariances; in the Bernoulli model, they are counts of how often each element Akd equals one.\nThe linear-Gaussian messages have size O(K 2 + KD), and the Bernoulli messages O(KD), where\nK is the number of features. For nonparametric models such as the IBP, the number of features K\ngrows as O(log N ). This slow growth means that messages remain small, even for large datasets.\nThe most straightforward way to compute the full posterior is to arrange processors in a tree archi-\ntecture, as belief propagation is then exact. The message s from processor p to processor q is:\n\nsp\u2192q = lp + Xr\u2208N (p)\\q\n\nsr\u2192p\n\n3\n\n\fwhere N (p)\\q are the processors attached to p besides q and lp are the suf\ufb01cient statistics from\nprocessor p. A dummy neighbour containing the statistics of the prior is connected to (an arbitrarily\ndesignated) root processor. Also passed are the feature counts mp\nnk, the popularity of\nfeature k within processor p. (See \ufb01gure 1(b) for a cartoon.)\n\nk = Pn\u2208X p Z p\n\nGibbs Sampling\n\nIn general, Znk can be Gibbs-sampled using Bayes rule\n\nP (Znk|Z\u2212nk, X) \u221d P (Znk|Z\u2212nk)P (X|Z).\n\nThe probability P (Znk|Z\u2212nk) depends on the size of the dataset N and the number of observations\nmk using feature k. At the beginning of the Gibbs sampling stage, each processor has the correct\nvalues of mk. We compute m\u2212p\nk are\nupdated, approximate mk \u2248 m\u2212p\nk + mp\nstays \ufb01xed during the\ncurrent stage (good for popular features).\n\nk, and, as the processor\u2019s internal feature counts mp\n\nk = mk \u2212 mp\n\nk. This approximation assumes m\u2212p\n\nk\n\nThe collapsed likelihood P (X|Z) integrating out the feature values A is given by:\n\nP (X|Z) \u221d ZA\n\nP (Xn|Zn, A)P (A|Z\u2212n, X\u2212n)dA,\n\nwhere the partial posterior P (A|Z\u2212n, X\u2212n) \u221d P (A|Z,X)\nP (Xn|Zn,A) . In conjugate models, P (A|Z\u2212n, X\u2212n)\ncan be ef\ufb01ciently computed by subtracting observation n\u2019s contribution to the suf\ufb01cient statistics.1\nFor non-conjugate models, we can use an exponential family distribution Q(A) to approximate\nP (A|X, Z) during message passing. A draw A \u223c Q\u2212p(A) is then used to initialise an uncollapsed\nGibbs sampler. The outputted samples of A are used to compute suf\ufb01cient statistics for the likelihood\nP (X|Z). In both cases, new features are added as described in [3].\n\nHyperparameter Resampling The IBP concentration parameter \u03b1 and hyperparameters of the\nlikelihood can also be sampled during inference. Resampling \u03b1 depends only on the total number of\nactive features; thus it can easily be resampled at the root and propagated to the other processors. In\nthe linear-Gaussian model, the posteriors on the noise and feature variances (starting from gamma\npriors) depend on various squared-errors, which can also be computed in a distributed fashion.\n\nFor more general, non-conjugate models, resampling the hyperparameters requires two steps. In\nthe \ufb01rst step, a hyperparameter value is proposed by the root and propagated to the processors.\nThe processors each compute the likelihood of the current and proposed hyperparameter values and\npropagate this value back to root. The root evaluates a Metropolis step for the hyperparameters\nand propagates the decision back to the leaves. The two-step approach introduces a latency in the\nresampling but does not require any additional message passing rounds.\n\nAsynchronous Operation So far we have discussed message passing, Gibbs sampling, and hy-\nperparameter resampling as if they occur in separate phases. In practice, these phases may occur\nasynchronously: between its Gibbs sweeps, each processor updates its feature posterior based on\nthe most current messages it has received and sends likelihood messages to its parent. Likewise,\nthe root continuously resamples hyperparameters and propagates the values down through the tree.\nWhile another layer of approximation, this asynchronous form of message passing allows faster pro-\ncessors to share information and perform more inference on their data instead of waiting for slower\nprocessors.\n\nImplementation Note When performing parallel inference in the IBP, a few factors need to be\nconsidered with care. Other parallel inference for nonparametric models, such as the HDP [2],\nsimply matched features by their index, that is, assumed that the ith feature on processor p was also\nthe ith feature on processor q. In the IBP, we \ufb01nd that this indiscriminate feature merging is often\ndisastrous when adding or deleting features: if none of the observations in a particular processor are\nusing a feature, we cannot simply delete that column of Z and shift the other features over\u2014doing\nso destroys the alignment of features across processors.\n\n1In the IBP, only the linear-Gaussian model exhibits this conjugate structure. However, many other matrix\n\nfactorization models (such as PCA) often have this conjugate form.\n\n4\n\n\f4 Comparison to Exact Metropolis\n\nBecause all Z p\u2019s are sampled at once, the posteriors P p(A|X, Z) used by each processor in section 3\nare no longer exact. Below we show how Metropolis\u2013Hastings (MH) steps can make the parallel\nsampler exact, but introduce signi\ufb01cant computational overheads both in computing the transition\nprobabilities and in the message passing. We argue that trying to do exact inference is a poor\nuse of computational resources (especially as any \ufb01nite chain will not be exact); empirically, the\napproximate sampler behaves similarly to the MH sampler while \ufb01nding higher likelihood regions\nin the data.\n\nExact Parallel Metropolis Sampler.\nIdeally, we would simply add an MH accept/reject step after\neach stage of the approximate inference to make the sampler exact. Unfortunately, the approximate\nsampler makes several non-independent random choices in each stage of the inference, making the\nreverse proposal inconvenient to compute. We circumvent this issue by \ufb01xing the random seed, mak-\ning the initial stage of the approximate sampler a deterministic function, and then add independent\nrandom noise to create a proposal distribution. This approach makes both the forward and reverse\ntransition probabilities simple to compute.\n\nFormally, let \u02c6Z p be the matrix output after a set of Gibbs sweeps on Z p. We use all the \u02c6Z p\u2019s to\npropose a new Z \u2032 matrix. The acceptance probability of the proposal is\n\nmin(1,\n\nP (X|Z \u2032)P (Z \u2032)Q(Z \u2032 \u2192 Z)\nP (X|Z)P (Z)Q(Z \u2192 Z \u2032)\n\n),\n\n(5)\n\nwhere the likelihood terms P (X|Z) and P (Z) are readily computed in a distributed fashion. For\nthe transition distribution Q, we note that if we set the random seed r, then the matrix \u02c6Z p from the\nGibbs sweeps in the processor is some deterministic function of the input matrix Z p. The proposal\nZ p\u2032 is a (stochastic) noisy representation of \u02c6Z p in which for example\n\nP (Z p\u2032\n\nnk = 1) = .99 if\n\n\u02c6Z p\n\nnk = 1,\n\nP (Z p\u2032\n\nnk = 1) = .01 if\n\n\u02c6Z p\n\nnk = 0\n\n(6)\n\nwhere K should be at least the number of features in \u02c6Z p. We set Z p\u2032\nin \ufb01gure 2.)\n\nnk = 0 for k > K. (See cartoon\n\nTo compute the backward probability, we take Z p\u2032 and apply the same number of Gibbs sampling\nsweeps with the same random seed r. The resulting \u02c6Z p\u2032 is a deterministic function of Z p\u2032. The\nbackward probability Q(Z p\u2032\n\u2192 Z p) which is the probability of going from Z p\u2032 to Z p using 6.\nWhile the transition probabilities can be computed in a distributed, asynchronous fashion, all of the\nprocessors must synchronise when deciding whether to accept the proposal.\n\nExperimental Comparison To compare the exact Metropolis and approximate inference tech-\nniques, we ran each inference type on 1000 block images of [3] on 5 simulated processors. Each\ntest was repeated 25 times. For each of the 25 tests, we create a held out dataset by setting elements\nof the last 100 images as missing values. For the \ufb01rst 50 test images, we set all even numbered\ndimensions as the missing elements, and every odd numbered dimension as the missing values for\nthe last 50 images. Each sampler was run for 10,000 iterations with 5 Gibbs sweeps per iteration;\nstatistics were collected from the second half of the chain. To keep the probability of an acceptance\nreasonable, we allowed each processor to change only small parts of its Z p: the feature assignments\nZn for 1, 5, or 10 data points each during each sweep.\nIn table 1, we see that the approximate sampler runs about \ufb01ve times faster than the exact samplers\nwhile achieving comparable (or better) predictive likelihoods and reconstruction errors on held-\nout data. Both the acceptance rates and the predictive likelihoods fall as the exact sampler tries\nto take larger steps, suggesting that the difference between the approximate and exact sampler\u2019s\nperformance on predictive likelihood is due to poor mixing by the exact sampler. Figure 4 shows\nempirical CDFs for the number of features k , IBP concentration parameter \u03b1, the noise variance \u03c32\nn,\nand the feature variance \u03c32\na. The approximate sampler (black) produces similar CDFs to the various\nexact Metropolis samplers (gray) for the variances; the concentration parameter is smaller, but the\nfeature counts are similar to the single-processor case.\n\n5\n\n\fZp\n\nZp\nGibbs with\nfixed seed\n\nZp\u2019\n\nRandom\nnoise\n\nFigure 2: Cartoon of MH\nproposal\n\nMethod\n\nTime (s)\n\nMH, n = 1\nMH, n = 5\nMH, n = 10\nApproximate\n\n717\n1075\n1486\n179\n\nTest L2\nError\n0.0468\n0.0488\n0.0555\n0.0487\n\nTest Log\nLikelihood\n0.1098\n0.0893\n0.0196\n0.1292\n\nMH Accept\nProportion\n0.1106\n0.0121\n0.0062\n-\n\nTable 1: Evaluation of exact and approximate methods.\n\n1\n\n0.9\n\n0.8\n\n0.7\n\n0.6\n\n0.5\n\n0.4\n\n0.3\n\n0.2\n\n0.1\n\ny\nt\ni\nl\ni\n\nb\na\nb\no\nr\nP\ne\nv\ni\nt\n\n \n\na\n\nl\ni\n\nm\nu\nC\n\nEmpirical CDF for IBP Concentration\n\n \n\ny\nt\ni\nl\ni\n\nb\na\nb\no\nr\nP\ne\nv\ni\nt\n\n \n\nSingle Processor\nApproximate Sampling\nExact Sampling, various windows\n\na\n\nl\ni\n\nm\nu\nC\n\n1\n\n0.9\n\n0.8\n\n0.7\n\n0.6\n\n0.5\n\n0.4\n\n0.3\n\n0.2\n\n0.1\n\nEmpirical CDF for IBP Concentration\n\n \n\ny\nt\ni\nl\ni\n\nb\na\nb\no\nr\nP\ne\nv\ni\nt\n\n \n\nSingle Processor\nApproximate Sampling\nExact Sampling, various windows\n\na\n\nl\ni\n\nm\nu\nC\n\n1\n\n0.9\n\n0.8\n\n0.7\n\n0.6\n\n0.5\n\n0.4\n\n0.3\n\n0.2\n\n0.1\n\n0\n\n \n5\n\n6\n\n7\n\n8\n\n10\n\n9\n11\nFeature Count\n\n12\n\n13\n\n14\n\n15\n\n \n0\n0.5\n\n1\n\n1.5\n\n2\n\n2.5\n\nIBP Concentration Parameter\n\n3\n\n3.5\n\n \n0\n0.1\n\n0.15\n\n0.2\n\n0.25\n\nEmpirical CDF for Noise Variance\n\n \n\ny\nt\ni\nl\ni\n\nb\na\nb\no\nr\nP\ne\nv\ni\nt\n\n \n\na\n\nl\ni\n\nm\nu\nC\n\nSingle Processor\nApproximate Sampling\nExact Sampling, various windows\n\n0.35\n\n0.3\n0.4\nNoise Variance\n\n0.45\n\n0.5\n\n0.55\n\n0.6\n\n1\n\n0.9\n\n0.8\n\n0.7\n\n0.6\n\n0.5\n\n0.4\n\n0.3\n\n0.2\n\n0.1\n\nEmpirical CDF for Feature Variance\n\n \n\nSingle Processor\nApproximate Sampling\nExact Sampling, various windows\n\n0\n\n \n\n0.2\n\n0.25\n\n0.3\n\n0.35\n\n0.4\n\n0.45\n\n0.5\n\nFeature Variance\n\n(a) Active feature count k\n\n(b) IBP Concentration \u03b1\n\n(c) Noise variance \u03c32\n\nx\n\n(d) Feature variance \u03c32\n\na\n\nFigure 3: Empirical CDFs: The solid black line is the approximate sampler; the three solid gray lines\nare the MH samplers with n equal to 1, 5, and 10 (lighter shades indicate larger n. The approximate\nsampler and the MH samplers for smaller n have similar CDFs; the n = 10 MH sampler\u2019s differing\nCDF indicates it did not mix in 7500 iterations (reasonable since its acceptance rate was 0.0062).\n\n5 Analysis of Mixing Properties\n\nWe ran a series of experiments on 10,000 36-dimensional block images of [3] to study the effects\nof various sampler con\ufb01gurations on running time, performance, and mixing time properties of\nthe sampler. 5000 elements of the data matrix were held-out as test data. Figure 4 shows test\nlog-likelihoods using 1, 7, 31 and 127 parallel processors simulated in software, using 1000 outer\niterations with 5 Gibbs inner iterations each. The parallel samplers have similar test likelihoods as\nthe serial algorithm with signi\ufb01cant savings in running time. The characteristic shape of the test\nlikelihood, similar across all testing regimes, indicates how the features are learned. Initially, a large\nnumber of features are added, which provides improvements in the test likelihood. A re\ufb01nement\nphase, in which excess features are pruned, provides further improvements.\n\nFigure 4 shows hairiness-index plots for each of the test cases after thinning and burn-in. The hairi-\nness index, based on the method of CUSUM for monitoring MCMC convergence [9, 10], monitors\nhow often the derivatives of sampler statistics\u2014in our case, the number of features, the test likeli-\nhood, and \u03b1\u2014change in sign; infrequent changes in sign indicate that the sampler may not be mixed.\nThe outer bounds on the plots are the 95% con\ufb01dence bounds. The index stays within the bounds\nsuggesting that the chains are mixing.\n\nFinally, we considered the trade-off between mixing and running time as the number of outer it-\nerations and inner Gibbs iterations are varied. Each combination of inner and outer iterations was\nset so that the total number of Gibbs sweeps through the data was 5000. Mixing ef\ufb01ciency was\n\n0.5\n\n0\n\n\u22120.5\n\n\u22121\n\n\u22121.5\n\nd\no\no\nh\n\ni\nl\n\ne\nk\n\ni\nl\n\ng\no\n\nl\n \nt\ns\ne\nT\n\nTest Loglikelihood for inner = 5 and outer = 1000 iterations\n\nProcessors = 1\n\nProcessors = 7\n\n \n\nx\ne\nd\nn\n\nI\n \n\ns\ns\ne\nn\ni\nr\ni\n\na\nH\n\n1\n\n0.5\n\n0\n\n1\n\n0.5\n\n0\n\nx\ne\nd\nn\n\nI\n \n\ns\ns\ne\nn\ni\nr\ni\n\na\nH\n\n20 40 60 80100\nProcessors = 31\n\n20 40 60 80 100\nProcessors = 127\n\nx\ne\nd\nn\n\nI\n \n\ns\ns\ne\nn\ni\nr\ni\n\na\nH\n\n1\n\n0.5\n\n0\n\nProc = 1\nProc = 7\nProc = 31\nProc = 127\n\n1\n\n0.5\n\n0\n\nx\ne\nd\nn\n\nI\n \n\ns\ns\ne\nn\ni\nr\ni\n\na\nH\n\n\u22122\n \n\u22121\n\n0\n\n1\n\n2\n\nTime (s)\n\n3\n\n4\n\n5\n\n20 40 60 80100\n\n20 40 60 80100\n\nFigure 4: Change in likelihood for various numbers of processors over the simulation time. The\ncorresponding hairiness index plots are shown on the left.\n\n6\n\n\f1\n\n0.9\n\n0.8\n\n0.7\n\n0.6\n\n0.5\n\n \n\nProc = 1\nProc = 7\nProc = 31\nProc = 127\n\n105\n\n104\n\n103\n\n102\n\n)\ns\n(\n \n\ne\nm\nT\n\ni\n\n \nl\n\na\nt\no\nT\n\n \n\ni = 50, o = 100\ni = 20, o = 250\ni = 10, o = 500\ni = 5, o = 1000\ni = 1, o = 5000\n\n.\nr\ne\nt\nI\n \nr\ne\nt\nu\nO\n \nr\ne\np\ns\ne\np\nm\na\nS\ne\nv\n\nl\n\n \n\n \n\ni\nt\nc\ne\nf\nf\n\nE\n#\n\n \n\n0.4\n \n0\n\n10\n\n20\n30\n#Inner Iterations\n\n40\n\n50\n\n101\n\n \n\n1\n\n7\n\n# Processors\n\n31\n\n127\n\nFigure 5: Effects of changing the number of inner iterations on: (a) The effective sample size (b)\nTotal running time (Gibbs and Message passing).\n\nTable 2: Test log-likelihoods on real-world datasets for the serial, synchronous and asynchronous\ninference types.\n\nDataset\n\nN\n\nD\n\nDescription\n\nAR Faces [11]\n\n2600\n\n1598\n\nPiano [12]\n\n57931\n\n161\n\nFlickr [13]\n\n100000\n\n1000\n\nfaces with lighting, acces-\nsories (real-valued)\nSTDFT of a piano recording\n(real-valued)\nindicators\nof\n(binary-valued)\n\nimage\n\ntags\n\nSerial\np = 1\n-4.74\n\nSynch\np = 16\n-4.77\n\nAsync\np = 16\n-4.84\n\n-1.435\n\n-1.182\n\n-1.228\n\n\u2014\n\n-0.0584\n\nmeasured via the effective number of samples per sample [10], which evaluates what fraction of\nthe samples are independent (ideally, we would want all samples to be independent, but MCMC\nproduces dependent chains). Running time for Gibbs sampling was taken to be the time required by\nthe slowest processor (since all processors must synchronize before message passing); the total time\nre\ufb02ected the Gibbs time and the message-passing time. As seen in \ufb01gure 5, completing fewer inner\nGibbs iterations per outer iteration results in faster mixing, which is sensible as the processors are\ncommunicating about their data more often. However, having fewer inner iterations requires more\nfrequent message passing; as the number of processors becomes large, the cost of message passing\nbecomes a limiting factor.2\n\n6 Real-world Experiments\n\nWe tested our parallel scheme on three real world datasets on a 16 node cluster using the Matlab\nDistributed Computing Engine, using 3 inner Gibbs iterations per outer iteration. The \ufb01rst dataset\nwas a set of 2,600 frontal face images with 1,598 dimensions [11]. While not extremely large,\nthe high-dimensionality of the dataset makes it challenging for other inference approaches. The\npiano dataset [12] consisted of 57,931 samples from a 161-dimensional short-time discrete Fourier\ntransform of a piano piece. Finally, the binary-valued Flickr dataset [13] indicated whether each\nof 1000 popular keywords occurred in the tags of 100,000 images from Flickr. Performance was\nmeasured using test likelihoods and running time. Test likelihoods look only at held-out data and\nthus they allow us to \u2018honestly\u2019 evaluate the model\u2019s \ufb01t. Table 2 summarises the data and shows that\nall approaches had similar test-likelihood performance.\n\nIn the faces and music datasets, the Gibbs time per iteration improved almost linearly as the number\nof processors increased (\ufb01gure 6). For example, we observed a 14x-speedup for p = 16 in the music\ndataset. Meanwhile, the message passing time remained small even with 16 processors\u20147% of the\nGibbs time for the faces data and 0.1% of the Gibbs time for the music data. However, waiting for\nsynchronisation became a signi\ufb01cant factor in the synchronous sampler. Figure 6(c) compares the\ntimes for running inference serially, synchronously and asynchronously with 16 processors. The\n\n2We believe part of the timing results may be an artifact, as the simulation overestimates the message passing\n\ntime. In the actual parallel system (section 6), the cost of message passing was negligible.\n\n7\n\n\fs\n/\nn\no\ni\nt\na\nr\ne\nt\ni\n \nr\ne\nt\nu\no\n \nr\ne\np\n \ne\nm\n\ni\nt\n \nn\na\ne\nm\n\n120\n\n100\n\n80\n\n60\n\n40\n\n20\n\n0\n\n \n\n \n\nsampling\nwaiting\n\n1\n\n2\n\n4\n\n8\n\n16\n\nnumber of processors\n\n1200\n\n1000\n\n800\n\n600\n\n400\n\n200\n\ns\n/\nn\no\ni\nt\na\nr\ne\nt\ni\n \nr\ne\np\n \ne\nm\n\ni\nt\n \nn\na\ne\nm\n\n0\n\n \n\n \n\nsampling\nwaiting\n\n1\n\n2\n\n4\n\n8\n\n16\n\nnumber of processors\n\nx 107\n\n\u22121.8\n\n\u22122\n\n\u22122.2\n\n \n\nt\n\nn\no\n\ni\n\nj\n \n\ng\no\n\nl\n\n\u22122.4\n\n\u22122.6\n\n\u22122.8\n\n \n\n10\u22122\n\nserial P=1\nsynchronous P=16\nasynchronous P=16\n\n100\n\n102\n\n104\n\ntime/s\n\n(a) Timing analysis for faces\ndataset\n\n(b) Timing analysis for music\ndataset\n\n(c) Timing comparison for different\napproaches\n\nFigure 6: Bar charts comparing sampling time and waiting times for synchronous parallel inference.\n\nasynchronous inference is 1.64 times faster than the synchronous case, reducing the computational\ntime from 11.8s per iteration to 7.2s.\n\n7 Discussion and Conclusion\n\nAs datasets grow, parallelisation is an increasingly attractive and important feature for doing infer-\nence. Not only does it allow multiple processors/multicore technologies to be leveraged for large-\nscale analyses, but it also reduces the amount of data and associated structures that each processor\nneeds to keep in memory. Existing work has focused both on general techniques to ef\ufb01ciently split\nvariables across processors in undirected graphical models [14] and factor graphs [15] and speci\ufb01c\nmodels such as LDA [16, 17]. Our work falls in between: we leverage properties of a speci\ufb01c kind\nof parallelisation\u2014data parallelisation\u2014for a fairly broad class of models.\n\nSpeci\ufb01cally, we describe a parallel inference procedure that allows nonparametric Bayesian models\nbased on the Indian Buffet Process to be applied to large datasets. The IBP poses speci\ufb01c challenges\nto data parallelisation in that the dimensionality of the representation changes during inference and\nmay be unbounded. Our contribution is an algorithm for data-parallelisation that leverages a com-\npact representation of the feature posterior that approximately decorrelates the data stored on each\nprocessor, thus limiting the communication bandwidth between processors. While we focused on\nthe IBP, the ideas presented here are applicable to a more general problems in unsupervised learning\nincluding bilinear models such as PCA, NMF, and ICA.\n\nOur sampler is approximate, and we show that in conjugate models, it behaves similarly to an ex-\nact sampler\u2014but with much less computational overhead. However, as seen in the Bernoulli case,\nvariational message passing for non-conjugate data doesn\u2019t always produce good results if the ap-\nproximating distribution is a poor match for the true feature posterior. Determining when variational\nmessage passing is successful is an interesting question for future work. Other interesting directions\ninclude approaches for dynamically optimising the network topology (for example, slower proces-\nsors could be moved lower in the tree). Finally, we note that a middle ground between synchronous\nand asynchronous operations as we presented them might be a system that gives each processor a\ncertain amount of time, instead of a certain number of iterations, to do Gibbs sweeps. Further study\nalong these avenues should lead to even more ef\ufb01cient data-parallel Bayesian inference techniques.\n\n8\n\n\fReferences\n\n[1] C. Chu, S. Kim, Y. Lin, Y. Yu, G. Bradski, A. Ng, and K. Olukotun, \u201cMap-reduce for machine\nlearning on multicore,\u201d in Advances in Neural Information Processing Systems, p. 281, MIT\nPress, 2007.\n\n[2] A. Asuncion, P. Smyth, and M. Welling, \u201cAsynchronous distributed learning of topic models,\u201d\n\nin Advances in Neural Information Processing Systems 21, 2008.\n\n[3] T. Grif\ufb01ths and Z. Ghahramani, \u201cIn\ufb01nite latent feature models and the Indian buffet process,\u201d\n\nin Advances in Neural Information Processing Systems, vol. 16, NIPS, 2006.\n\n[4] F. Doshi-Velez and Z. Ghahramani, \u201cAccelerated inference for the Indian buffet process,\u201d in\n\nInternational Conference on Machine Learning, 2009.\n\n[5] E. Meeds, Z. Ghahramani, R. Neal, and S. Roweis, \u201cModeling dyadic data with binary latent\nfactors,\u201d in Advances in Neural Information Processing Systems, vol. 19, pp. 977\u2013984, 2007.\n[6] Y. W. Teh, D. G\u00a8or\u00a8ur, and Z. Ghahramani, \u201cStick-breaking construction for the Indian buffet\nprocess,\u201d in Proceedings of the Intl. Conf. on Arti\ufb01cial Intelligence and Statistics, vol. 11,\npp. 556\u2013563, 2007.\n\n[7] F. Wood and T. L. Grif\ufb01ths, \u201cParticle \ufb01ltering for nonparametric Bayesian matrix factoriza-\n\ntion,\u201d in Advances in Neural Information Processing Systems, vol. 19, pp. 1513\u20131520, 2007.\n\n[8] F. Doshi-Velez, K. T. Miller, J. Van Gael, and Y. W. Teh, \u201cVariational inference for the In-\ndian buffet process,\u201d in Proceedings of the Intl. Conf. on Arti\ufb01cial Intelligence and Statistics,\nvol. 12, pp. 137\u2013144, 2009.\n\n[9] S. P. Brooks and G. O. Roberts, \u201cConvergence assessment techniques for Markov Chain Monte\n\nCarlo,\u201d Statistics and Computing, vol. 8, pp. 319\u2013335, 1998.\n\n[10] C. R. Robert and G. Casella, Monte Carlo Statistical Methods. Springer, second ed., 2004.\n[11] A. M. Mart\u2019inez and A. C. Kak, \u201cPCA versus LDA,\u201d IEEE Trans. Pattern Anal. Mach. Intelli-\n\ngence, vol. 23, pp. 228\u2013233, 2001.\n\n[12] G. E. Poliner and D. P. W. Ellis, \u201cA discriminative model for polyphonic piano transcription,\u201d\n\nEURASIP J. Appl. Signal Process., vol. 2007, no. 1, pp. 154\u2013154, 2007.\n\n[13] T. Kollar and N. Roy, \u201cUtilizing object-object and object-scene context when planning to \ufb01nd\n\nthings.,\u201d in International Conference on Robotics and Automation, 2009.\n\n[14] C. G. Joseph Gonzalez, Yucheng Low, \u201cResidual splash for optimally parallelizing belief prop-\nagation,\u201d in Proceedings of the Twelfth International Conference on Arti\ufb01cial Intelligence and\nStatistics (D. van Dyk and M. Welling, eds.), vol. 5, pp. 177\u2013184, JMLR, 2009.\n\n[15] D. Stern, R. Herbrich, and T. Graepel, \u201cMatchbox: Large scale online Bayesian recommenda-\n\ntions,\u201d in 18th International World Wide Web Conference (WWW2009), April 2009.\n\n[16] R. Nallapati, W. Cohen, and J. Lafferty, \u201cParallelized variational EM for Latent Dirichlet Al-\nlocation: An experimental evaluation of speed and scalability,\u201d in ICDMW \u201907: Proceedings\nof the Seventh IEEE International Conference on Data Mining Workshops, (Washington, DC,\nUSA), pp. 349\u2013354, IEEE Computer Society, 2007.\n\n[17] D. Newman, A. Asuncion, P. Smyth, and M. Welling, \u201cDistributed inference for Latent Dirich-\nlet Allocation,\u201d in Advances in Neural Information Processing Systems 20 (J. Platt, D. Koller,\nY. Singer, and S. Roweis, eds.), pp. 1081\u20131088, Cambridge, MA: MIT Press, 2008.\n\n9\n\n\f", "award": [], "sourceid": 880, "authors": [{"given_name": "Finale", "family_name": "Doshi-velez", "institution": null}, {"given_name": "Shakir", "family_name": "Mohamed", "institution": null}, {"given_name": "Zoubin", "family_name": "Ghahramani", "institution": null}, {"given_name": "David", "family_name": "Knowles", "institution": null}]}