{"title": "Slice sampling normalized kernel-weighted completely random measure mixture models", "book": "Advances in Neural Information Processing Systems", "page_first": 2240, "page_last": 2248, "abstract": "A number of dependent nonparametric processes have been proposed to model non-stationary data with unknown latent dimensionality. However, the inference algorithms are often slow and unwieldy, and are in general highly specific to a given model formulation. In this paper, we describe a wide class of nonparametric processes, including several existing models, and present a slice sampler that allows efficient inference across this class of models.", "full_text": "Slice sampling normalized kernel-weighted\ncompletely random measure mixture models\n\nNicholas J. Foti\n\nDepartment of Computer Science\n\nDartmouth College\nHanover, NH 03755\n\nnfoti@cs.dartmouth.edu\n\nSinead A. Williamson\n\nDepartment of Machine Learning\n\nCarnegie Mellon University\n\nPittsburgh, PA 15213\n\nsinead@cs.cmu.edu\n\nAbstract\n\nA number of dependent nonparametric processes have been proposed to model\nnon-stationary data with unknown latent dimensionality. However, the inference\nalgorithms are often slow and unwieldy, and are in general highly speci\ufb01c to a\ngiven model formulation. In this paper, we describe a large class of dependent\nnonparametric processes, including several existing models, and present a slice\nsampler that allows ef\ufb01cient inference across this class of models.\n\n1\n\nIntroduction\n\nNonparametric mixture models allow us to bypass the issue of model selection, by modeling data\nusing a random number of mixture components that can grow if we observe more data. However,\nsuch models work on the assumption that data can be considered exchangeable. This assumption\noften does not hold in practice as distributions commonly vary with some covariate. For example,\nthe proportions of different species may vary across geographic regions, and the distribution over\ntopics discussed on Twitter is likely to evolve over time.\nRecently, there has been increasing interest in dependent nonparametric processes [1], that extend\nexisting nonparametric distributions to non-stationary data. While a nonparametric process is a dis-\ntribution over a single measure, a dependent nonparametric process is a distribution over a collection\nof measures, which may be associated with values in a covariate space. The key property of a de-\npendent nonparametric process is that the measure at each covariate value is marginally distributed\naccording to a known nonparametric process.\nA number of dependent nonparametric processes have been developed in the literature ([2] \u00a76).\nFor example, the single-p DDP [1] de\ufb01nes a collection of Dirichlet processes with common atom\nsizes but variable atom locations. The order-based DDP [3] constructs a collection of Dirichlet\nprocesses using a common set of beta random variables, but permuting the order in which they are\nused in a stick-breaking construction. The Spatial Normalized Gamma Process (SNGP) [4] de\ufb01nes\na gamma process on an augmented space, such that at each covariate location a subset of the atoms\nare available. This creates a dependent gamma process, that can be normalized to obtain a dependent\nDirichlet process. The kernel beta process (KBP) [5] de\ufb01nes a beta process on an augmented space,\nand at each covariate location modulates the atom sizes using a collection of kernels, to create a\ncollection of dependent beta processes.\nUnfortunately, while such models have a number of appealing properties, inference can be challeng-\ning. While there are many similarities between existing dependent nonparametric processes, most of\nthe inference schemes that have been proposed are highly speci\ufb01c, and cannot be generally applied\nwithout signi\ufb01cant modi\ufb01cation.\n\n1\n\n\fThe contributions of this paper are twofold. First, in Section 2 we describe a general class of de-\npendent nonparametric processes, based on de\ufb01ning completely random measures on an extended\nspace. This class of models includes the SNGP and the KBP as special cases. Second, we develop\na slice sampler that is applicable for all the dependent probability measures in this framework. We\ncompare our slice sampler to existing inference algorithms, and show that we are able to achieve\nsuperior performance over existing algorithms. Further, the generality of our algorithm mean we are\nable to easily modify the assumptions of existing models to better \ufb01t the data, without the need to\nsigni\ufb01cantly modify our sampler.\n\n2 Constructing dependent nonparametric models using kernels\n\nIn this section, we describe a general class of dependent completely random measures, that includes\nthe kernel beta process as a special case. We then describe the class of dependent normalized random\nmeasures obtained by normalizing these dependent completely random measures, and show that the\nSNGP lies in this framework.\n\n2.1 Kernel CRMs\n\nA completely random measure (CRM) [6, 7] is a distribution over discrete1 measures B on some\nmeasurable space \u2126 such that, for any disjoint subsets Ak \u2282 \u2126, the masses B(Ak) are independent.\nCommonly used examples of CRMs include the gamma process, the generalized gamma process, the\nbeta process, and the stable process. A CRM is uniquely categorized by a L\u00b4evy measure \u03bd(d\u03c9, d\u03c0)\non \u2126\u00d7 R+, which controls the location and size of the jumps. We can interpret a CRM as a Poisson\nprocess on \u2126 \u00d7 R+ with mean measure \u03bd(d\u03c9, d\u03c0).\nLet \u2126 = (X \u00d7\u0398), and let \u03a0 = {(\u00b5k, \u03b8k, \u03c0k)}\u221e\nk=1 be a Poisson process on the space X \u00d7\u0398\u00d7R+ with\nassociated product \u03c3-algebra. The space has three components: X , a bounded space of covariates;\n\u0398, a space of parameter values; and R+, the space of atom masses. Let the mean measure of \u03a0\nbe described by the positive L\u00b4evy measure \u03bd(d\u00b5, d\u03b8, d\u03c0). While the construction herein applies for\nany such L\u00b4evy measure, we focus on the class of L\u00b4evy measures that factorize as \u03bd(d\u00b5, d\u03b8, d\u03c0) =\nR0(d\u00b5)H0(d\u03b8)\u03bd0(d\u03c0). This corresponds to the class of homogeneous CRMs, where the size of an\natom is independent of its location in \u0398 \u00d7 X , and covers most CRMs encountered in the literature.\nWe assume that X is a discrete space with P unique values, \u00b5\u2217\np, in order to simplify the exposition,\nand without loss of generality we assume that R0(X ) = 1. Additionally, let K(\u00b7,\u00b7) : X \u00d7X \u2192 [0, 1]\nbe a bounded kernel function. Though any such kernel may be used, for concreteness we only\nconsider a box kernel and square exponential kernel de\ufb01ned as\n\n\u2022 Box kernel: K(x, \u00b5) = 1 (||x \u2212 \u00b5|| < W ), where we call W the width.\n\u2022 Square exponential kernel: K(x, \u00b5) = exp\n\n(cid:16)\u2212\u03c8 ||x \u2212 \u00b5||2(cid:17)\n\n, for ||\u00b7|| a dissimilarity mea-\n\nsure, and \u03c8 > 0 a \ufb01xed constant.\n\nBx(A) =(cid:80)\u221e\n\nUsing the setup above we de\ufb01ne a kernel-weighted CRM (KCRM) at a \ufb01xed covariate x \u2208 X and\nfor A measurable as\n\n(1)\nwhich is seen to be a CRM on \u0398 by the mapping theorem for Poisson processes [8]. For a \ufb01xed set of\nobservations (x1, . . . , xG)T we de\ufb01ne B(A) = (Bx1(A), . . . , BxG(A))T as the vector of measures\nof the KCRM at the observed covariates. CRMs are characterized by their characteristic function\n(CF) [9] which for the CRM B can be written as\n\nm=1 K(x, \u00b5m)\u03c0m\u03b4\u03b8m (A)\n\nE[exp(\u2212vTB(A))] = exp\n\n(2)\nwhere v \u2208 RG and K\u00b5 = (K(x1, \u00b5), . . . , K(xG, \u00b5))T . Equation 2 is easily derived from the general\nform of the CF of a Poisson process [8] and by noting that the one-dimensional CFs are exactly those\nof the individual Bxi(A). See [5] for a discussion of the dependence structure between Bx and Bx(cid:48)\nfor x, x(cid:48) \u2208 X .\n\nX\u00d7A\u00d7R+\n\n(1 \u2212 exp(\u2212vTK\u00b5\u03c0)\u03bd(d\u00b5, d\u03b8, d\u03c0))\n\n(cid:90)\n\n(cid:18)\n\n\u2212\n\n(cid:19)\n\n1with, possibly, a deterministic continuous component\n\n2\n\n\f\u221e(cid:88)\n\nm=1\n\n(cid:80)\u221e\n\nK(x, \u00b5m)\u03c0m\nl=1 K(x, \u00b5l)\u03c0l\n\nTaking \u03bd0 to be the L\u00b4evy measure of a beta process [10] results in the KBP. Alternatively, taking \u03bd0\nas the L\u00b4evy measure of a gamma process, \u03bdGaP [11], and K(\u00b7,\u00b7) as the box kernel we recover the\nunnormalized form of the SNGP.\n\n2.2 Kernel NRMs\n\nA distribution over probability measures can be obtained by starting from a CRM, and normalizing\nthe resulting random measure. Such distributions are often referred to as normalized random mea-\nsures (NRM) [12]. The most commonly used example of an NRM is the Dirichlet process, which\ncan be obtained as a normalized gamma process [11]. Other CRMs yield NRMs with different prop-\nerties \u2013 for example a normalized generalized gamma process can have heavier tails than a Dirichlet\nprocess [13].\nWe can de\ufb01ne a class of dependent NRMs in a similar manner, starting from the KCRM de\ufb01ned\nabove. Since each marginal measure Bx of B is a CRM, we can normalize it by its total mass,\nBx(\u0398), to produce a NRM\n\nPx(A) = Bx(A)/Bx(\u0398) =\n\n\u03b4\u03b8m(A)\n\n(3)\n\nThis formulation of a kernel NRM (KNRM) is similar to that in [14] for Ornstein-Uhlenbeck NRMs\n(OUNRM). While the OUNRM framework allows for arbitrary CRMs, in theory, extending it to ar-\nbitrary kernel functions is non-trivial. A fundamental difference between OUNRMs and normalized\nKCRMs is that the marginals of an OUNRM follow a speci\ufb01ed process, whereas the marginals of a\nKCRM may be different than the underlying CRM.\nA common use in statistics and machine learning for NRMs is as prior distributions for mixture\nmodels with an unbounded number of components [15]. Analogously, covariate-dependent NRMs\ncan be used as priors for mixture models where the probability of being associated with a mixture\ncomponent varies with the covariate [4, 14]. For concreteness, we limit ourselves to a kernel gamma\nprocess (KGaP) which we denote as B \u223c KGaP(K, R0, H0, \u03bdGaP), although the slice sampler can\nbe adapted to any normalized KCRM.\nSpeci\ufb01cally, we observe data {(xj, yj)}N\nj=1 where xj \u2208 X denotes the covariate of observation j\nand yj \u2208 Rd denotes the quantities we wish to model. Let x\u2217\ng denote the gth unique covariate value\namong all the xj which induces a partition on the observations so that observation j belongs to group\ng if xj = x\u2217\nEach observation is associated with a mixture component which we denote as sg,i which is drawn\naccording to a normalized KGaP on a parameter space \u0398, such that (\u03b8, \u03c6) \u2208 \u0398, where \u03b8 is a mean\nand \u03c6 a precision. Conditional on sg,i, each observation is then drawn from some density q(\u00b7|\u03b8, \u03c6)\nwhich we assume to be N(\u03b8, \u03c6\u22121). The full model can then be speci\ufb01ed as\n\ng. We denote the ith observation corresponding to x\u2217\n\ng as yg,i.\n\n\u221e(cid:88)\n\nPg(A)|B \u223c Bg(A)/Bg(\u0398)\nK(x\u2217\nsg,i|Pg \u223c\nl=1 K(x\u2217\nm, \u03c6\u2217\n(\u03b8\u2217\n\n(cid:80)\u221e\nm) \u223c H0(d\u03b8, d\u03c6)\n\nm=1\n\ng, \u00b5m)\u03c0m\n\nyg,i|sg,i,{(\u03b8\u2217, \u03c6\u2217)} \u223c q(yg,i|\u03b8\u2217\n\n, \u03c6\u2217\n\nsg,i\n\n)\n\nsg,i\n\ng, \u00b5l)\u03c0l\n\n\u03b4m\n\n(4)\n\nIf K(\u00b7,\u00b7) is a box kernel, Eq. 4 describes a SNGP mixture model [4].\n\n3 A slice sampler for dependent NRMs\n\nThe slice sampler of [16] allows us to perform inference in arbitrary NRMs. We extend this slice\nsampler to perform inference in the class of dependent NRMs described in Sec. 2.2. The slice\nsampler can be used with any underlying CRM, but for simplicity we concentrate on an underlying\ngamma process, as described in Eq. 4.\nIn the supplement we also derive a Rao-Blackwellized\nestimator of the predictive density for unobserved data using the output from the slice sampler. We\nuse this estimator to compute predictive densities in the experiments.\n\n3\n\n\fG(cid:89)\n(\u0398) =(cid:80)\u221e\n\ng=1\n\ng\n\nAnalogously to [16] we introduce a set of auxiliary slice variables \u2013 one for each data point. Each\ndata point can only belong to clusters corresponding to atoms larger than its slice variable. The set of\nslice variables thus de\ufb01nes a minimum atom size that need be represented, ensuring a \ufb01nite number\nof instantiated atoms.\nWe extend this idea to the KNRM framework. Note that, in this case, an atom will exhibit different\nsizes at different covariate locations. We refer to these sizes as the kernelized atom sizes, K(x\u2217\ng, \u00b5)\u03c0,\nobtained by applying a kernel K, evaluated at location x\u2217\ng, to the raw atom \u03c0. Following [16], we\nintroduce a local slice variable ug,i. This allows us to write the joint distribution over the data points\nyg,i, their cluster allocations sg,i and their slice variables ug,i as\nf (y, u, s|\u03c0, \u00b5, \u03b8, \u03c6) =\n\n(cid:1) q(yg,i|\u03b8sg,i, \u03c6sg,i)\n\n1(cid:0)ug,i < K(x\u2217\n\ne(\u2212VgBT g)\n\nng(cid:89)\n\nV ng\u22121\n\ng, \u00b5sg,i)\u03c0sg,i\n\ni=1\n\ng\n\nm=1 K(x\u2217\n\n(5)\ng, \u00b5m)\u03c0m and Vg \u223c Ga(ng, BT g) is an auxiliary variable2.\n\nwhere BT g = Bx\u2217\nSee the supplement and [16, 17] for a complete derivation.\nIn order to evaluate Eq. 5, we need to evaluate BT g, the total mass of the unnormalized CRM at each\ncovariate value. This involves summing over an in\ufb01nite number of atoms \u2013 which we do not wish to\nrepresent. De\ufb01ne 0 < L = min{usg,i}. This gives the smallest possible (kernelized) atom size to\nwhich data can be attached. Therefore, if we instantiate all atoms with raw size greater than L, we\nwill include all atoms associated with occupied clusters. For any value of L, there will be a \ufb01nite\nnumber M of atoms above this threshold. From these M raw atoms, we can obtain the kernelized\natoms above the slice corresponding to a given data point.\nWe must obtain the remaining mass by marginalizing over all kernelized atoms that are below the\nslice (see the supplement). We can split this mass into, a.)\nthe mass due to atoms that are not\ninstantiated (i.e. whose kernelized value is below the slice at all covariate locations) and, b.) the\nmass due to currently instantiated atoms (i.e. atoms whose kernelized value is above the slice at at\nleast one covariate location) 3. As we show in the supplement, the \ufb01rst term, a, corresponds to atoms\n(\u03c0, \u00b5) where \u03c0 < L, the mass of which can be written as\n\n(cid:32)\n\n(cid:88)\n\n\u00b5\u2217\u2208X\n\n(cid:90) L\n\n0\n\n(cid:33)\n\nR0(\u00b5\u2217)\n\n(1 \u2212 exp (\u2212V TK\u00b5\u2217 \u03c0))\u03bd0(d\u03c0)\n\n(6)\n\nwhere V = (V1, . . . , VG)T . This can be evaluated numerically for many CRMs including gamma\nand generalized gamma processes [16]. The second term, b, consists of realized atoms {(\u03c0k, \u00b5k)}\nsuch that K(x\u2217\ng. We use a Monte Carlo estimate for b that we describe\nin the supplement. For box kernels term b vanishes, and we have found that even for the square\nexponential kernel ignoring this term yields good results.\n\ng, \u00b5k)\u03c0k < L at covariate x\u2217\n\n3.1 Sampling equations\n\nHaving speci\ufb01ed the joint distribution in terms of a \ufb01nite measure with a random truncation point\nL we can now describe a sampler that samples in turn from the conditional distributions for the\nauxiliary variables Vg, the gamma process parameter \u03b1 = H0(\u0398), the instantiated raw atom sizes\n\u03c0m and corresponding locations in covariate space \u00b5m and in parameter space (\u03b8m, \u03c6m), and the\nslice variables ug,i. We de\ufb01ne some simplifying notation: K\u00b5 = (K(x\u2217\nG, \u00b5))T ;\ng, \u00b5m)\u03c0m, B\u2217g =\ng, \u00b5m)\u03c0m so that BT g = B+g+B\u2217g; and ng,m = |{sg,i : sg,i = m, i \u2208 1, . . . , ng}|.\n\nB+ = (B+1, . . . , B+G)T , B\u2217 = (B\u22171, . . . , B\u2217G)T , where B+g = (cid:80)M\n(cid:80)\u221e\nm=M +1 K(x\u2217\n\n1, \u00b5), . . . , K(x\u2217\n\nm=1 K(x\u2217\n\n\u2022 Auxiliary variables Vg: The full conditional distribution for Vg is given by\n\np(Vg | ng, V\u2212g, B+, B\u2217) \u221d V ng\u22121\n\nexp(\u2212V T B+)E[exp(\u2212V T B\u2217)], Vg > 0\n\ng\n\n(7)\n\nwhich we sample using Metropolis-Hastings moves, as in [18].\n\n2We parametrize the gamma distribution so that X \u223c Ga(a, b) has mean a/b and variance a/b2\n3If X were not bounded there would be a third term consisting of raw atoms > L that when kernelized\nfall below the slice everywhere. These can be ignored by a judicious choice of the space X and the allowable\nkernel widths.\n\n4\n\n\f\u2022 Gamma process parameter \u03b1: The conditional distribution for \u03b1 is given by\n\n(cid:82)\nX (1\u2212exp (\u2212V T K\u00b5\u03c0))R0(d\u00b5)\u03bd0(d\u03c0)]\n\nIf p(\u03b1) = Ga(a0, b0) then the posterior is also a gamma distribution with parameters\n\nL \u03bd0(d\u03c0)+(cid:82) L\np(\u03b1| K, V, \u00b5, \u03c0) \u221d p(\u03b1)\u03b1Ke\u2212\u03b1[(cid:82) \u221e\n(cid:90)\n(cid:90) L\n\n(cid:90) \u221e\n\na = a0 + K\n\n0\n\nb = b0 +\n\n\u03bd0(d\u03c0) +\n\nL\n\nX\n\n0\n\n(1 \u2212 exp(\u2212V TK\u00b5\u03c0))\u03bd0(d\u03c0)R0(d\u00b5)\n\n(8)\n\n(9)\n\n(10)\n\nwhere the \ufb01rst integral in Eq. 10 can be evaluated for many processes of interest and the\nsecond integral can be evaluated as in Eq. 6.\n\u2022 Raw atom sizes \u03c0m: The posterior for atoms associated with occupied clusters is given by\n\np(\u03c0m | ng,m, \u00b5m, V, B+) \u221d \u03c0\n\nm\n\ng=1 ng,m\n\nexp\n\n\u2212\u03c0m\n\nVgK(x\u2217\n\ng, \u00b5m)\n\n\u03bd0(\u03c0m)\n\n(11)\n\n(cid:80)G\n\n(cid:32)\n\nG(cid:88)\n\ng=1\n\n(cid:33)\n\nPoisson distributed with mean \u03b1(cid:82)\n\nFor an underlying gamma or generalized gamma process, the posterior of \u03c0m will be given\nby a gamma distribution due to conjugacy [16]. There will also be a number of atoms\nwith raw size \u03c0m > L that do not have associated data. The number of such atoms is\nA exp(\u2212V TK\u00b5\u03c0)\u03bd0(d\u03c0)R0(d\u03c0), where A = {(\u00b5, \u03c0) :\ng, \u00b5)\u03c0 > L, for some g} and which can be computed using the approach described\nK(x\u2217\nfor Eq. 6.\n\u2022 Raw atom covariate locations \u00b5m: Since we assume a \ufb01nite set of covariate locations, we\n\ncan sample \u00b5m according to the discrete distribution\n\n(cid:32)\n\nK(cid:88)\n\n(cid:33)\n\nK(x\u2217\n\ng, \u00b5k)ng,m exp\n\n\u2212\u03c0m\n\nVgK(x\u2217\n\ng, \u00b5m)\n\nR0(\u00b5m)\n\np(\u00b5m | ng,m, V, B+) \u221d G(cid:89)\n\ng=1\n\ng=1\n\n\u2022 Slice variables ug,i: Sampled as ug,i|{\u03c0},{\u00b5}, sg,i \u223c Un[0, K(x\u2217\ng, \u00b5sg,i)\u03c0sg,i].\n\u2022 Cluster allocations sg,i: The prior on sg,i cancels with the prior on ug,i, yielding\ng, \u00b5m)\u03c0m\n\np(sg,i = m| yg,i, ug,i, \u03b8m, \u03c0m, \u00b5m) \u221d q(yg,i|\u03b8m, \u03c6m)1(cid:0)ug,i < K(x\u2217\n\n(cid:1)\n\nwhere only a \ufb01nite number of m need be evaluated.\n\n\u2022 Parameter locations: Can be sampled as in a standard mixture model [16].\n\n(12)\n\n(13)\n\n4 Experiments\n\nWe evaluate the performance of the proposed slice sampler in the setting of covariate dependent\ndensity estimation. We assume the statistical model in Eq. 4 and consider a univariate Gaussian\ndistribution as the data generating distribution. We use both synthetic and real data sets in our\nexperiments and compare the slice sampler to a Gibbs sampler for a \ufb01nite approximation to the\nmodel (see the supplement for details of the model and sampler) and to the original SNGP sampler.\nWe assess the mixing characteristics of the sampler using the integrated autocorrelation time \u03c4 of the\nnumber of clusters used by the sampler at each iteration after a burn-in period, and by the predictive\nquality of the collective samples on held-out data. The integrated autocorrelation time of samples\ndrawn from an MCMC algorithm controls the Monte Carlo error inherent in a sample drawn from\nthe MCMC algorithm. It can be shown that in a set of T samples from the MCMC algorithm, there\nare in effect only T /(2\u03c4 ) \u201cindependent\u201d samples. Therefore, lower values of \u03c4 are deemed better.\nWe obtain an estimate \u02c6\u03c4 of the integrated autocorrelation time following [19].\nWe assess the predictive performance of the collected samples from the various algorithms by com-\nputing a Monte Carlo estimate of the predictive log-likelihood of a held-out data point under the\nmodel. Speci\ufb01cally, for a held out point y\u2217 we have\n\nlog p(y\u2217|y) \u2248 1\nT\n\nw(t)\nm q\n\ny\u2217|\u03b8(t)\n\nm , \u03c6(t)\nm\n\n(14)\n\n(cid:16)\n\nT(cid:88)\n\nlog\n\n\uf8eb\uf8edM (t)(cid:88)\n\nt=1\n\nm=1\n\n5\n\n(cid:17)\uf8f6\uf8f8 .\n\n\fTable 1: Results of the samplers using different kernels. Entries are of the form \u201caverage predictive\ndensity / average number of clusters used / \u02c6\u03c4\u201d where two standard errors are shown in parentheses.\nResults are averaged over 5 hold-out data sets.\n\nSlice Box\n\nSNGP\n\nFinite Box\nSlice SE\nFinite SE\n\nSynthetic\n\n-2.70 (0.12) / 11.6 / 2442\n-2.67 (0.12) / 43.3 / 2488\n-2.78 (0.15) / 11.7 / 2497\n\nNA\nNA\n\nCMB\n\n-0.15 (0.11) / 14.4 / 2465\n-0.22 (0.14) / 79.1 / 2495\n-0.41 (0.14) / 18.2 / 2444\n-0.28 (0.07) / 14.7 / 2447\n-0.29 (0.05) / 9.5 / 2491\n\nMotorcycle\n\n-0.90 (0.28) / 10.3 / 2414\n\nNA\n\n-1.19 (0.16) / 16.4 / 2352\n-0.87 (0.28) / 8.2 / 2377\n-0.99 (0.19) / 7.3 / 2159\n\nFigure 1: Left: Synthetic data. Middle: Trace plots of the number of clusters used by the three\nsamplers. Right: Histogram of truncation point L.\n\nThe weight w(t)\nm is the probability of choosing atom m for sample t. We did not use the Rao-\nBlackwellized estimator to compute Eq. 14 for the slice sampler to achieve fair comparisons (see\nthe supplement for the results using the Rao-Blackwellized estimator).\n\n4.1 Synthetic data\n\nWe generated synthetic data from a dynamic mixture model with 12 components (Figure 1).\nEach component has an associated location, \u00b5k, that can take the value of any of ten uniformly\nspaced time stamps, tj \u2208 [0, 1]. The components are active according to the kernel K(x, \u00b5k) =\n1 (|x \u2212 \u00b5k| < .2) \u2013 i.e. components are active for two time stamps around their location. At each\ntime stamp, tj, we generate 60 data points. For each data point we choose a component, k, such that\n1 (|tj \u2212 \u00b5k| < .2) and then generate that data point from a Gaussian distribution with mean \u00b5k and\nvariance 10. We use 50 of the generated data points per time stamp as a training set and hold out 10\ndata points for prediction.\nSince the SNGP is a special case of the normalized KGaP, we compare the \ufb01nite and slice samplers,\nwhich are both conditional samplers, to the original marginal sampler proposed in [4]. We use the\nbasic version of the SNGP that uses \ufb01xed-width kernels, as we assume \ufb01xed width kernel functions\nfor simplicity. The implementation of the SNGP sampler we used also only allows for \ufb01xed compo-\nnent variances, so we \ufb01x all \u03c6k = 1/10, the true data generating precision. We use the true kernel\nfunction that was used to generate the data as the kernel for the normalized KGaP model.\nWe ran the slice sampler for 10, 000 burn-in iterations and subsequently collected 5, 000 samples.\nWe truncated the \ufb01nite version of the model to 100 atoms and ran the sampler for 5, 000 burn-in\niterations and collected 5, 000 samples. The SNGP sampler was run for 2, 000 burn-in iterations and\n5, 000 samples were collected4. The predictive log-likelihood, mean number of clusters used and \u02c6\u03c4\nare shown in the \u201cSynthetic\u201d column in Table 1.\nWe see that all three algorithms \ufb01nd a region of the posterior that gives predictive estimates of a\nsimilar quality. The autocorrelation estimates for the three samplers are also very similar. This might\nseem surprising, since the SNGP sampler uses sophisticated split-merge moves to improve mixing,\nwhich have no analogue in the slice sampler. In addition, we note that although the per-iteration\n\n4No thinning was performed in any of the experiments in this paper.\n\n6\n\n 00.20.40.60.8 1(cid:239)30(cid:239)20(cid:239)10010TimeObservation01000200030004000500010152025303540Iteration# used clusters SliceSNGPFinite(cid:239)25(cid:239)20(cid:239)15(cid:239)10(cid:239)50200400600800log(L)Frequency\fmixing performance is comparable, the average time per 100 iterations for the slice sampler was\n\u223c 10 seconds, for the SNGP sampler was \u223c 30 seconds and for the \ufb01nite sampler was \u223c 200\nseconds. Even with only 100 atoms the \ufb01nite sampler is much more expensive than the slice and\nSNGP5 samplers.\nWe also observe (Figure 1) that both the slice and \ufb01nite samplers use essentially the true number\nof components underlying the data and that the SNGP sampler uses on average twice as many\ncomponents. The \ufb01nite sampler \ufb01nds a posterior mode with 13 clusters and rarely makes small\nmoves from that mode. The slice sampler explores modes with 10-17 clusters, but never makes\nlarge jumps away from this region. The SNGP sampler explores the largest number of used clusters\nranging from 23-40, however, it has not explored regions that use less clusters.\nFigure 1 also depicts the distribution of the variable truncation level L over all samples in the slice\nsampler. This suggests that a \ufb01nite model that discards atoms with \u03c0k < 10\u221218 introduces negligible\ntruncation error. However, this value of L corresponds to \u2248 1018 atoms in the \ufb01nite model which\nis computationally intractable. To keep the computation times reasonable we were only able to use\n100 atoms, a far cry from the number implied by L.\nIn Figure 2 (Left) we plot estimates of the predictive density at each time stamp for the slice (a), \ufb01nite\n(b) and SNGP (c) samplers. All three samplers capture the evolving structure of the distribution.\nHowever, the \ufb01nite sampler seems unable to discard unneeded components. This is evidenced by\nthe small mass of probability that spans times [0, 0.8] when the data that the component explains only\nexists at times [0.2, 0.5]. The slice and SNGP samplers seem to both provide reasonable explanations\nfor the distribution, with the slice sampler tending to provide smoother estimates.\n\n4.2 Real data\n\nAs well as providing an alternative inference method for existing models, our slice sampler can be\nused in a range of models that fall under the general class of KNRMs. To demonstrate this, we\nuse the \ufb01nite and slice versions of our sampler to learn two kernel DPs, one using a box kernel,\nK(x, \u00b5) = 1 (|x \u2212 \u00b5| < 0.2) (the setting in the SNGP), and the other using a square exponential\nkernel K(x, \u00b5) = exp(\u2212200(x \u2212 \u00b5)2), which has support approximately on [\u00b5 \u2212 .2, \u00b5 + .2]. The\nkernel was chosen to be somewhat comparable to the box kernel, however, this kernel allows the\nin\ufb02uence of an atom to diminish gradually as opposed to being constant. We compare to the SNGP\nsampler for the box kernel model, but note that this sampler is not applicable to the exponential\nkernel model.\nWe compare these approaches on two real-world datasets:\n\n\u2022 Cosmic microwave background radiation (CMB)[20]: TT power spectrum measure-\nments, \u03b7, from the cosmic microwave background radiation (CMB) at various \u2018multipole\nmoments\u2019, denoted M. Both variables are considered continuous and exhibit dependence.\nWe rescale M to be in [0, 1] and standardize \u03b7 to have mean 0 and unit variance.\n\n\u2022 Motorcycle crash data [21]. This data set records the head acceleration, A, at various\ntimes during a simulated motorcycle crash. We normalize time to [0, 1] and standardize A\nto have mean 0 and unit variance.\n\nBoth datasets exhibit local heteroskedasticity, which cannot be captured using the SNGP. For the\nCMB data, we consider only the \ufb01rst 600 multipole moments, where the variance is approximately\nconstant, allowing us to compare the SNGP sampler to the other algorithms. For all models we\n\ufb01xed the observation variance to 0.02, which we estimated from the standardized data. To ease the\ncomputational burden of the samplers we picked 18 time stamps in [0.05, 0.95], equally spaced 0.05\napart and assigned each observation to the time stamp closest to its associated value of M. This\nstep is by no means necessary, but the running time of the algorithms improves signi\ufb01cantly. For the\n\n5Sampling the cluster means and assignments is the slowest step for the SNGP sampler taking about 3\nseconds. The times reported here only performed this step every 25 iterations achieving reasonable results. If\nthis step were performed every iteration the results may improve, but the computation time will explode.\n\n7\n\n\fFigure 2: Left: Predictive density at each time stamp for synthetic data using the slice (a), \ufb01nite (b)\nand SNGP (c) samplers. The scales of all three axis are identical. Middle: Mean and 95% CI of\npredictive distribution for all three samplers on CMB data using the box kernel. Right: Mean and\n95% CI of predictive distribution using the square exponential kernel.\n\nmotorcycle data, there was no regime of constant variance, so we only compare the slice and \ufb01nite\ntruncation samplers6.\nFor each dataset and each model/sampler, the held-out predictive log-likelihood, the mean number\nof used clusters and \u02c6\u03c4 are reported in Table 1. The mixing characteristics of the chain are similar to\nthose obtained for the synthetic data. We see in Table 1 that the box kernel and the square exponential\nkernel produce similar results on the CMB data. However, the kernel width was not optimized and\ndifferent values may prove to yield superior results. For the motorcycle data we see a noticeable\ndifference between using the box and square exponential kernels where using the latter improves the\nheld-out predictive likelihood and results in both samplers using fewer components on average.\nFigure 2 shows the predictive distributions obtained on the CMB data. Looking at the mean and 95%\nCI of the predictive distribution (middle) we see that when using the box kernel the SNGP actually\n\ufb01ts the data the best. This is most likely due to the fact that the SNGP is using more atoms than\nthe slice or \ufb01nite samplers. We show that the square exponential kernel (right) gives much smoother\nestimates and appears to \ufb01t the data better, using the same number of atoms as were learned with the\nbox kernel (see Table 1). We note that the slice sampler took \u223c 20 seconds per 100 iterations while\nthe \ufb01nite sampler used \u223c 150 seconds.\n\n5 Conclusion\n\nWe presented the class of normalized kernel CRMs, a type of dependent normalized random mea-\nsure. This class generalizes previous work by allowing more \ufb02exibility in the underlying CRM and\nkernel function used to induce dependence. We developed a slice sampler to perform inference on\nthe in\ufb01nite dimensional measure and compared this method with samplers for a \ufb01nite approxima-\ntion and for the SNGP. We found that the slice sampler yields samples with competitive predictive\naccuracy at a fraction of the computational cost.\nThere are many directions for future research. Incorporating reversible-jump moves [22] such as\nsplit-merge proposals should allow the slice sampler to explore larger regions of the parameter space\nwith a limited decrease in computational ef\ufb01ciency. A similar methodology may yield ef\ufb01cient\ninference algorithms for KCRMs such as the KBP, extending the existing slice sampler for the\nIndian Buffet Process [23].\n\nAcknowledgments\n\nNF was funded by grant AFOSR FA9550-11-1-0166. SW was funded by grants NIH R01GM087694\nand AFOSR FA9550010247.\n\n6The SNGP could still be used to model this data, however, then we would be comparing the models as\n\nopposed to the samplers.\n\n8\n\n\u221210120.20.40.60.8multipole momentTT power spectrumsamplerfiniteslicesngp\u221210120.20.40.60.8multipole momentTT power spectrumsamplerfiniteslice\fReferences\n[1] S.N. MacEachern. Dependent nonparametric processes. In ASA Proceedings of the Section on\n\nBayesian Statistical Science, 1999.\n\n[2] D. Dunson. Nonparametric Bayes applications to biostatistics.\n\nIn N. L. Hjort, C. Holmes,\nP. M\u00a8uller, and S. G. Walker, editors, Bayesian Nonparametrics. Cambridge University Press,\n2010.\n\n[3] J.E. Grif\ufb01n and M.F.J. Steel. Order-based dependent Dirichlet processes. JASA, 101(473):179\u2013\n\n194, 2006.\n\n[4] V. Rao and Y.W. Teh. Spatial normalized gamma processes. In NIPS, 2009.\n[5] L. Ren, Y. Wang, D. Dunson, and L. Carin. The kernel beta process. In NIPS, 2011.\n[6] J.F.C. Kingman. Completely random measures. Paci\ufb01c Journal of Mathematics, 21(1):59\u201378,\n\n1967.\n\n[7] A. Lijoi and I. Pr\u00a8unster. Models beyond the Dirichlet process. Technical Report 129, Collegio\n\nCarlo Alberto, 2009.\n\n[8] J.F.C. Kingman. Poisson processes. OUP, 1993.\n[9] B. Fristedt and L.F. Gray. A Modern Approach to Probability Theory. Probability and Its\n\nApplications. Birkh\u00a8auser, 1997.\n\n[10] N.L. Hjort. Nonparametric Bayes estimators based on beta processes in models for life history\n\ndata. Annals of Statistics, 18:1259\u20131294, 1990.\n\n[11] T.S. Ferguson. A Bayesian analysis of some nonparametric problems. Annals of Statistics,\n\n1(2):209\u2013230, 1973.\n\n[12] E. Regazzini, A. Lijoi, and I. Pr\u00a8unster. Distributional results for means of normalized random\n\nmeasures with independent increments. Annals of Statistics, 31(2):pp. 560\u2013585, 2003.\n\n[13] A. Lijoi, R.H. Mena, and I. Pr\u00a8unster. Controlling the reinforcement in Bayesian non-parametric\n\nmixture models. JRSS B, 69(4):715\u2013740, 2007.\n\n[14] J.E. Grif\ufb01n. The Ornstein-Uhlenbeck Dirichlet process and other time-varying processes for\nBayesian nonparametric inference. Technical report, Department of Statistics, University of\nWarwick, 2007.\n\n[15] S. Favaro and Y.W. Teh. MCMC for normalized random measure mixture models. Submitted,\n\n2012.\n\n[16] J. E. Grif\ufb01n and S. G. Walker. Posterior simulation of normalized random measure mixtures.\n\nJournal of Computational and Graphical Statistics, 20(1):241\u2013259, 2011.\n\n[17] L.F. James, A. Lijoi, and I. Pr\u00a8unster. Posterior analysis for normalized random measures with\n\nindependent increments. Scandinavian Journal of Statistics, 36(1):76\u201397, 2009.\n\n[18] J.E. Grif\ufb01n, M. Kolossiatis, and M.F.J. Steel. Comparing distributions using dependent nor-\n\nmalized random measure mixtures. Technical report, University of Warwick, 2010.\n\n[19] M. Kalli, J.E. Grif\ufb01n, and S.G. Walker. Slice sampling mixture models. Statistics and Com-\n\nputing, 21(1):93\u2013105, 2011.\n\n[20] C.L. Bennett et al. First year Wilkinson Microwave Anisotropy Probe (WMAP) observations:\n\nPreliminary maps and basic results. Astrophysics Journal Supplement, 148:1, 2003.\n\n[21] B.W. Silverman. Some aspects of the spline smoothing approach to non-parametric curve\n\n\ufb01tting. JRSS B, 47:1\u201352, 1985.\n\n[22] P.J. Green. Reversible jump Markov chain Monte Carlo computation and Bayesian model\n\ndetermination. Biometrika, 82(4):711\u2013732, 1995.\n\n[23] Y.W. Teh, D. G\u00a8or\u00a8ur, and Z. Ghahramani. Stick-breaking construction for the Indian buffet\n\nprocess. In AISTATS, volume 11, 2007.\n\n9\n\n\f", "award": [], "sourceid": 1109, "authors": [{"given_name": "Nick", "family_name": "Foti", "institution": null}, {"given_name": "Sinead", "family_name": "Williamson", "institution": null}]}