{"title": "Priors for Diversity in Generative Latent Variable Models", "book": "Advances in Neural Information Processing Systems", "page_first": 2996, "page_last": 3004, "abstract": "Probabilistic latent variable models are one of the cornerstones of machine learning. They offer a convenient and coherent way to specify prior distributions over unobserved structure in data, so that these unknown properties can be inferred via posterior inference. Such models are useful for exploratory analysis and visualization, for building density models of data, and for providing features that can be used for later discriminative tasks. A significant limitation of these models, however, is that draws from the prior are often highly redundant due to i.i.d. assumptions on internal parameters. For example, there is no preference in the prior of a mixture model to make components non-overlapping, or in topic model to ensure that co-ocurring words only appear in a small number of topics. In this work, we revisit these independence assumptions for probabilistic latent variable models, replacing the underlying i.i.d.\\ prior with a determinantal point process (DPP). The DPP allows us to specify a preference for diversity in our latent variables using a positive definite kernel function. Using a kernel between probability distributions, we are able to define a DPP on probability measures. We show how to perform MAP inference with DPP priors in latent Dirichlet allocation and in mixture models, leading to better intuition for the latent variable representation and quantitatively improved unsupervised feature extraction, without compromising the generative aspects of the model.", "full_text": "Priors for Diversity in Generative\n\nLatent Variable Models\n\nJames Y. Zou\n\nRyan P. Adams\n\nSchool of Engineering and Applied Sciences\n\nSchool of Engineering and Applied Sciences\n\nHarvard University\n\nCambridge, MA 02138\n\njzou@fas.harvard.edu\n\nHarvard University\n\nCambridge, MA 02138\n\nrpa@seas.harvard.edu\n\nAbstract\n\nProbabilistic latent variable models are one of the cornerstones of machine learn-\ning. They offer a convenient and coherent way to specify prior distributions over\nunobserved structure in data, so that these unknown properties can be inferred via\nposterior inference. Such models are useful for exploratory analysis and visual-\nization, for building density models of data, and for providing features that can\nbe used for later discriminative tasks. A signi\ufb01cant limitation of these models,\nhowever, is that draws from the prior are often highly redundant due to i.i.d. as-\nsumptions on internal parameters. For example, there is no preference in the prior\nof a mixture model to make components non-overlapping, or in topic model to\nensure that co-occurring words only appear in a small number of topics. In this\nwork, we revisit these independence assumptions for probabilistic latent variable\nmodels, replacing the underlying i.i.d. prior with a determinantal point process\n(DPP). The DPP allows us to specify a preference for diversity in our latent vari-\nables using a positive de\ufb01nite kernel function. Using a kernel between probability\ndistributions, we are able to de\ufb01ne a DPP on probability measures. We show how\nto perform MAP inference with DPP priors in latent Dirichlet allocation and in\nmixture models, leading to better intuition for the latent variable representation\nand quantitatively improved unsupervised feature extraction, without compromis-\ning the generative aspects of the model.\n\n1\n\nIntroduction\n\nThe probabilistic generative model is an important tool for statistical learning because it enables rich\ndata to be explained in terms of simpler latent structure. The discovered structure can be useful in\nits own right, for explanatory purposes and visualization, or it may be useful for improving general-\nization to unseen data. In the latter case, we might think of the inferred latent structure as providing\na feature representation that summarizes complex high-dimensional interaction into a simpler form.\nThe core assumption behind the use of latent variables as features, however, is that the salient sta-\ntistical properties discovered by unsupervised learning will be useful for discriminative tasks. This\nrequires that the features span the space of possible data and represent diverse characteristics that\nmay be important for discrimination. Diversity, however, is dif\ufb01cult to express within the generative\nframework. Most often, one builds a model where the feature representations are independent a\npriori, with the hope that a good \ufb01t to the data will require employing a variety of latent variables.\nThere is reason to think that this does not always happen in practice, and that during unsupervised\nlearning, model capacity is often spent improving the density around the common cases, not allo-\ncating new features. For example, in a generative clustering model based on a mixture distribution,\nmultiple mixture components will often be used for a single \u201cintuitive group\u201d in the data, simply be-\ncause the shape of the component\u2019s density is not a close \ufb01t to the group\u2019s distribution. A generative\n\n1\n\n\fmixture model will happily use many of its components to closely \ufb01t the density of a single group of\ndata, leading to a highly redundant feature representation. Similarly, when applied to a text corpus,\na topic model such as latent Dirichlet allocation [1] will place large probability mass on the same\nstop words across many topics, in order to \ufb01ne-tune the probability assigned to the common case. In\nboth of these situations, we would like the latent groupings to uniquely correspond to characteristics\nof the data: that a group of data should be explained by one mixture component, and that common\nstop words should be one category of words among many. This intuition expresses a need for diver-\nsity in the latent parameters of the model that goes beyond what is highly likely under the posterior\ndistribution implied by an independent prior.\nIn this paper, we propose a modular approach to diversity in generative probabilistic models by\nreplacing the independent prior on latent parameters with a determinantal point process (DPP). The\ndeterminantal point process enables a modeler to specify a notion of similarity on the space of\ninterest, which in this case is a space of possible latent distributions, via a positive de\ufb01nite kernel.\nThe DPP then assigns probabilities to particular con\ufb01gurations of these distributions according to\nthe determinant of the Gram matrix. This construction naturally leads to a generative latent variable\nmodel in which diverse sets of latent parameters are preferred over redundant sets.\nThe determinantal point process is a convenient statistical tool for constructing a tractable point\nprocess with repulsive interaction. The DPP is more general than the Poisson process (see, e.g.,\n[2]), which has no interaction, but more tractable than Strauss [3] and Gibbs/Markov [4] processes\n(at the cost of only being able to capture anticorrelation). Hough et al. [5] provides a useful survey\nof probabilistic properties of the determinantal point process, and for statistical properties, see, e.g.,\nScardicchio et al. [6] and Lavancier et al. [7]. There has also been recent interest in using the DPP\nwithin machine learning for modeling sets of structures [8], and for conditionally producing diverse\ncollections of objects [9]. The approach we propose here is different from this previous work in that\nwe are suggesting the use of a determinantal point process within a hierarchical model, and using it\nto enforce diversity among latent variables, rather than as a mechanism for diversity across directly\nobserved discrete structures.\n\n2 Diversity in Generative Latent Variable Models\n\nIn this paper we consider generic directed probabilistic latent variable models that produce distribu-\ntions over a set of N data, denoted {xn}N\nn=1, which live in a sample space X . Each of these data\nhas a latent discrete label zn, which takes a value in {1, 2,\u00b7\u00b7\u00b7 , J}. The latent label indexes into a\nset of parameters {\u03b8j}J\nj=1. The parameters determined by zn then produce the data according to a\ndistribution f (xn | \u03b8zn ). Typically we use independent priors for the \u03b8j, here denoted by \u03c0(\u00b7), but\nthe distribution over the latent indices zn may be more structured. Taken together this leads to the\ngeneric joint distribution:\n\n(cid:34) N(cid:89)\n\n(cid:35) J(cid:89)\n\np({xn, zn}N\n\nn=1,{\u03b8j}J\n\nj=1) = p({zn}N\n\nn=1)\n\nf (xn | \u03b8zn)\n\n\u03c0(\u03b8j).\n\n(1)\n\nn=1\n\nj=1\n\nThe details of each distribution are problem-speci\ufb01c, but this general framework appears in many\ncontexts. For example, in a typical mixture model, the zn are drawn independently from a multino-\nmial distribution and the \u03b8j are the component-speci\ufb01c parameters. In an admixture model such as\nlatent Dirichlet allocation (LDA) [1], the \u03b8j may be \u201ctopics\u201d, or distributions over words. In an ad-\nmixture, the zn may share structure based on, e.g., being words within a common set of documents.\nThese models are often thought of as providing a principled approach for feature extraction. At\ntraining time, one either \ufb01nds the maximum of the posterior distribution p({\u03b8j}J\nn=1) or\ncollects samples from it, while integrating out the data-speci\ufb01c latent variables zn. Then when\npresented with a test case x(cid:63), one can construct a conditional distribution over the corresponding\nunknown variable z(cid:63), which is now a \u201cfeature\u201d that might usefully summarize many related aspects\nof x(cid:63). However, this interpretation of the model is suspect; we have not asked the model to make\nthe zn variables explanatory, except as a byproduct of improving the training likelihood. Different \u03b8j\nmay assign essentially identical probabilities to the same datum, resulting in ambiguous features.\n\nj=1 |{xn}N\n\n2\n\n\f(a) Independent Points\n\n(c) Independent Gaussians\n\n(e) Independent Multinomials\n\n(b) DPP Points\n\n(d) DPP Gaussians\n\n(f) DPP Multinomials\n\nj=1) =(cid:81)\n\nFigure 1: Illustrations of the determinantal point process prior. (a) 25 independent uniform draw in the unit\nsquare; (b) a draw from a DPP with 25 points; (c) ten Gaussian distributions with means uniformly drawn on\nthe unit interval; (d) ten Gaussian distributions with means distributed according to a DPP using the probability\nproduct kernel; (e) \ufb01ve random discrete distributions; (f) \ufb01ve random discrete distributions drawn from a DPP\non the simplex with the probability product kernel [10].\n2.1 Measure-Valued Determinantal Point Process\n\nIn this work we propose an alternative to the independence assumption of the standard latent variable\nmodel. Rather than specifying p({\u03b8j}J\nj \u03c0(\u03b8j), we will construct a determinantal point\nprocess on sets of component-speci\ufb01c distributions {f (x| \u03b8j)}J\nj=1. Via the DPP, it will be possible\nfor us to specify a preference for sets of distributions that have minimal overlap, as determined via a\npositive-de\ufb01nite kernel function between distributions. In the case of the simple parametric families\nfor f (\u00b7) that we consider here, it is appropriate to think of the DPP as providing a \u201cdiverse\u201d set of\nparameters \u03b8 = {\u03b8j}J\nj=1, where the notion of diversity is expressed entirely in terms of the resulting\nprobability measure on the sample space X . After MAP inference with this additional structure, the\nhope is that the \u03b8j will explain substantially different regions of X \u2014 appropriately modulated by\nthe likelihood \u2014 and lead to improved, non-redundant feature extraction at test time.\nWe will use \u0398 to denote the space of possible \u03b8. A realization from a point process on \u0398 produces\na random \ufb01nite subset of \u0398. To construct a determinantal point process, we \ufb01rst de\ufb01ne a positive\nde\ufb01nite kernel on \u0398, which we denote K : \u0398 \u00d7 \u0398 \u2192 R. The probability density associated with a\nparticular \ufb01nite \u03b8 \u2282 \u0398 is given by\n\np(\u03b8 \u2282 \u0398) \u221d |K\u03b8|,\n\n(2)\nwhere K\u03b8 is the |\u03b8| \u00d7 |\u03b8| positive de\ufb01nite Gram matrix that results from applying K(\u03b8, \u03b8(cid:48)) to the\nelements of \u03b8. The eigenspectrum of the kernel on \u0398 must be bounded to [0, 1]. The kernels we will\nfocus on in this paper are composed of two parts: 1) a positive de\ufb01nite correlation function R(\u03b8, \u03b8(cid:48)),\n\u03c0(\u03b8)\u03c0(\u03b8(cid:48)), which expresses our marginal preferences\n\nwhere R(\u03b8, \u03b8) = 1, and 2) the \u201cprior kernel\u201d(cid:112)\nwhich leads to the matrix form K\u03b8 = \u03a0 R\u03b8 \u03a0, where \u03a0 = diag([(cid:112)\n\nfor some parameters over others. These are combined to form the kernel of interest:\n\n\u03c0(\u03b82),\u00b7\u00b7\u00b7 ]).\n\nK(\u03b8, \u03b8\n\n(cid:48)) = R(\u03b8, \u03b8\n\n\u03c0(\u03b8)\u03c0(\u03b8(cid:48)),\n\n(cid:112)\n\n\u03c0(\u03b81),\n\n(cid:48))(cid:112)\n\nNote that if R(\u03b8, \u03b8(cid:48)) = 0 when \u03b8 (cid:54)= \u03b8(cid:48), this construction recovers the Poisson process with intensity\nmeasure \u03c0(\u03b8). Note also in this case that if the cardinality of \u03b8 is predetermined, then this recovers\nthe traditional independent prior. More interesting, however, are R(\u03b8, \u03b8(cid:48)) with off-diagonal structure\nthat induces interaction within the set. Such kernels will always induce repulsion of the points so that\ndiverse subsets of \u0398 will tend to have higher probability under the prior. See Fig. 1 for illustrations\nof the difference between independent samples and the DPP for several different settings.\n\n(3)\n\n3\n\n\f2.2 Kernels for Probability Distributions\n\nThe determinantal point process framework allows us to construct a generative model for repulsion,\nbut as with other kernel-based priors, we must de\ufb01ne what \u201crepulsion\u201d means. A variety of positive\nde\ufb01nite functions on probability measures have been de\ufb01ned, but in this work we will use the proba-\nbility product kernel [10]. This kernel is a natural generalization of the inner product for probability\ndistributions. The basic kernel has the form\n(cid:48) ; \u03c1) =\n\nf (x| \u03b8)\u03c1 f (x| \u03b8\n\n(cid:48))\u03c1 dx\n\nK(\u03b8, \u03b8\n\n(cid:90)\n\n(4)\n\nfor \u03c1 > 0. As we require a correlation kernel, we use the normalized variant given by\n\n(5)\nThis kernel has convenient closed forms for several distributions of interest, which makes it an ideal\nbuilding block for the present model.\n\nR(\u03b8, \u03b8\n\nK(\u03b8, \u03b8 ; \u03c1)K(\u03b8(cid:48), \u03b8(cid:48) ; \u03c1).\n\n(cid:48) ; \u03c1) = K(\u03b8, \u03b8\n\n(cid:48) ; \u03c1)/\n\nX\n\n(cid:112)\n\n2.3 Replicated Determinantal Point Process\n\nA property that we often desire from our prior distributions is that they have the ability to be-\ncome arbitrarily strong. That is, under the interpretation of a Bayesian prior as \u201cinferences from\npreviously-seen data\u201d, we would like to be able to imagine an arbitrary amount of such data and\nconstruct a highly-informative prior when appropriate. Unfortunately, the standard determinantal\npoint process does not provide a knob to turn to increase its strength arbitrarily.\nFor example, take a DPP on a Euclidean space and consider a point t, an arbitrary unit vector w and\na small scalar \u0001. Construct two pairs of points using a \u03b4 > 1: a \u201cnear\u201d pair {t, t + \u0001w)}, and a \u201cfar\u201d\npair {t, t + \u0001\u03b4w}. We wish to \ufb01nd some small \u0001 such that the \u201cfar\u201d con\ufb01guration is arbitrarily more\nlikely than the \u201cnear\u201d con\ufb01guration under the DPP. That is, we would like the ratio of determinants\n\nr(\u0001) =\n\np({t, t + \u0001\u03b4w})\np({t, t + \u0001w)})\n\n1 \u2212 R(t, t + \u0001\u03b4w)2\n1 \u2212 R(t, t + \u0001w))2 ,\n\n=\n\n(6)\n\nto be unbounded as \u0001 approaches zero. The objective is to have a scaling parameter that can cause the\ndeterminantal prior to be arbitrarily strong relative to the likelihood terms. If we perform a Taylor\nexpansion of the numerator and denominator around \u0001 = 0, we get\n\nr(\u0001) \u2248 1 \u2212 (R(t, t) + 2\u03b4w\u0001(cid:2) d\n1 \u2212 (R(t, t) + 2w\u0001(cid:2) d\n\nd\u02dct R(t, \u02dct)(cid:3)\nd\u02dct R(t, \u02dct)(cid:3)\n\n)\n\n\u02dct=t\n)\n\n\u02dct=t\n\n= \u03b4.\n\n(7)\n\nWe can see that, when near zero, this ratio captures the difference in distances, but not in a way\nthat can be rescaled to greater effect. This means that there exist \ufb01nite data sets that we cannot\noverwhelm with any DPP prior. To address this issue, we augment the determinantal point process\nwith an additional parameter \u03bb > 0, so that the probability of a \ufb01nite subset \u03b8 \u2282 \u0398 becomes\n\np(\u03b8 \u2282 \u0398) \u221d |K\u03b8|\u03bb.\n\n(8)\nFor integer \u03bb, it can be viewed as a set of \u03bb identical \u201creplicated realizations\u201d from determinantal\npoint processes, leaving our generative view intact. The replicate of \u03b8 is just \u03b8\u03bb = {\u03bb copies of \u03b8}\nand the corresponding K\u03b8\u03bb is a \u03bb|\u03b8| \u00d7 \u03bb|\u03b8| block diagonal matrix where each block is a replicate\nof K\u03b8. This maps well onto the view of a prior as pseudo-data; our replicated DPP asserts that\nwe have seen \u03bb previous such data sets. As in other pseudo-count priors, we do not require in\npractice that \u03bb be an integer, and under a penalized log likelihood view of MAP inference, it can be\ninterpreted as a parameter for increasing the effect of the determinantal penalty.\n\n2.4 Determinantal Point Process as Regularization.\n\nIn addition to acting as a prior over distributions in the generative setting, we can also view the DPP\nas a new type of \u201cdiversity\u201d regularizer on learning. The goal is to solve\nn=1) \u2212 \u03bb ln|K\u03b8|,\n\nL(\u03b8;{xn}N\n\n\u03b8(cid:63) = argmin\n\n(9)\n\n\u03b8\u2282\u0398\n\n4\n\n\fFigure 2: Schematic of DPP-LDA. We replace the standard plate notation for i.i.d topics in LDA with a \u201cdouble-\nstruck plate\u201d to indicate a determinantal point process.\nchoosing the best set of parameters \u03b8 from \u0398. Here L(\u00b7) is a loss function that depends on the data\nand the discrimination function, with parameters \u03b8. From Eqn. (3),\n\nln|K\u03b8| = ln|R\u03b8| +\n\nln \u03c0(\u03b8j).\n\n(10)\n\n(cid:88)\n\n\u03b8j\u2208\u03b8\n\nn=1|\u03b8), then the resulting optimization is simply MAP estimation. In this\nIf L(\u00b7) = \u2212 ln p({xn}N\nframework, we can combine the DPP penalty with any other regularizer on \u03b8, for example the\nsparsity-inducing (cid:96)1 regularizer.\nIn the following sections, we give empirical evidence that this\ndiversity improves generalization performance.\n\n\u03b8\n\n3 MAP Inference\nIn what follows, we \ufb01x the cardinality |\u03b8|. Viewing the kernel K\u03b8 as a function of \u03b8, the gradient\n\u2202\u03b8 log |K\u03b8| = trace(K\u22121\n\u2202\u03b8 ). This allows application of general gradient-based optimization\n\u2202K\u03b8\n\u2202\nalgorithms for inference. In particular, we can optimize \u03b8 as a modular component within an off-\nthe-shelf expectation maximization (EM) algorithm. Here we examine two illustrative examples of\ngenerative latent variable models into which we can directly plug our DPP-based prior.\nDiversi\ufb01ed Latent Dirichlet Allocation Latent Dirichlet allocation (LDA) [1] is an immensely\npopular admixture model for text and, increasingly, for other kinds of data that can be treated as a\n\u201cbag of words\u201d. LDA constructs a set of topics \u2014 distributions over the vocabulary \u2014 and asserts\nthat each word in the corpus is explained by one of these topics. The topic-word assignments are\nunobserved, but LDA attempts to \ufb01nd structure by requiring that only a small number of topics be\nrepresented in any given document.\nIn the standard LDA formulation, the topics are K discrete distributions \u03b2k over a vocabulary of\nsize V , where \u03b2kv is the probability of topic k generating word v. There are M documents and\nthe mth document has Nm words. Document m has a latent multinomial distribution over topics,\ndenoted \u03b8m and each word in the document wmn has a topic index zmn drawn from \u03b8m. While\nclassical LDA uses independent Dirichlet priors for the \u03b2k, here we \u201cdiversify\u201d latent Dirichlet\nallocation by replacing this prior with a DPP. That is, we introduce a correlation kernel\n\nR(\u03b2k, \u03b2k(cid:48)) =\n\n,\n\n(11)\n\n(cid:80)V\n(cid:113)(cid:80)V\n\n(cid:113)(cid:80)V\n\nv=1(\u03b2kv \u03b2k(cid:48)v)\u03c1\nv=1 \u03b22\u03c1\n\nkv\n\nv=1 \u03b22\u03c1\nk(cid:48)v\n\nwhich approaches one as \u03b2k becomes more similar to \u03b2k(cid:48). In the application below of DPP-LDA, we\nuse \u03c1 = 0.5. We use \u03c0(\u03b2k) = Dirichlet(\u03b1), and write the resulting prior as p(\u03b2) \u221d |K\u03b2|. We call\nthis model \u201cDPP-LDA\u201d, and illustrate it with a graphical model in Figure 2. We use a \u201cdouble-struck\nplate\u201d in the graphical model to represent the DPP, and highlight how it can be used as a drop-in\nreplacement for the i.i.d. assumption.\nTo perform MAP learning of this model, we construct a modi\ufb01ed version of the standard variational\nEM algorithm. As in variational EM for LDA, we de\ufb01ne a factored approximation\n\nq(\u03b8m, zm|\u03b3m, \u03c6m) = q(\u03b8m|\u03b3m)\n\nq(zmn|\u03c6mn).\n\n(12)\n\nN(cid:89)\n\nn=1\n\n5\n\n\u03b8m\u03b1zmnwmn\u03b2kNmMK\u03bb\fDPP-LDA\n\nLDA\ntypical\nthe\nto\nand\nit\nof\nis\nin\nthat\nfor\nyou\nTable 1: Top ten words from representative topics learned in LDA and DPP-LDA.\n\n\u201dChristianity\u201d\njesus\nmatthew\nprophecy\nchristians\nchurch\nmessiah\npsalm\nisaiah\nprophet\nlord\n\n\u201dspace\u201d\nspace\nnasa\nastronaut\nmm\nmission\npilot\nshuttle\nmilitary\ncandidates\nww\n\n\u201dstop words\u201d\nthe\nof\nthat\nyou\nby\none\nall\nbut\ndo\nmy which\n\nand\nin\nat\nfrom\nsome\ntheir\nwith\nyour\nwho\n\n\u201dOS\u201d\n\ufb01le\npub\nusr\navailable\nexport\nfont\nlib\ndirectory\nformat\nserver\n\n\u201dpolitics\u201d\nms\nmyers\ngod\npresident\nbut\npackage\noptions\ndee\nbelieve\ngroups\n\nthe\nto\nand\nin\nof\nis\nit\nfor\nthat\ncan\n\nIn this approximation, each document m has a Dirichlet approximation to its posterior over topics,\ngiven by \u03b3m. \u03c6m is an N \u00d7 K matrix in which the nth row, denoted \u03c6mn, is a multinomial distribu-\ntion over topics for word wmn. For the current estimate of \u03b2kv, \u03b3m and \u03c6m are iteratively optimized.\nSee Blei et al. [1] for more details. Our extension of variational EM to include the DPP does not\nrequire alteration of these steps.\nThe inclusion of the determinantal point process prior does, however, effect the maximization step.\nThe diversity prior introduces an additional penalty on \u03b2, so that the M-step requires solving\n\n(cid:41)\nmn ln \u03b2kv + \u03bb ln|K\u03b2|\n\n,\n\n(13)\n\n(cid:40) M(cid:88)\n\n\u03b2(cid:63) = argmax\n\n\u03b2\n\nm=1\n\nsubject to the constraints that each row of \u03b2 sum to 1. For \u03bb = 0, this optimization procedure yields\nthe standard update for vanilla LDA, \u03b2(cid:63)\nmn. For \u03bb > 0 we use gradient\ndescent to \ufb01nd a local optimal \u03b2.\nDiversi\ufb01ed Gaussian Mixture Model The mixture model is a popular model for generative clus-\ntering and density estimation. Given J components, the probability of the data is given by\n\nn=1 \u03c6mnkw(v)\n\nm=1\n\n\u03c6mnkw(v)\n\n(cid:80)Nm\n\nn=1\n\nv=1\n\nk=1\n\nNm(cid:88)\nV(cid:88)\nK(cid:88)\nkv \u221d(cid:80)M\nJ(cid:88)\n\nj=1\n\np(xn | \u03b8) =\n\n\u03c7j f (xn | \u03b8j).\n\n(14)\n\nTypically, the \u03b8k are taken to be independent in the prior. Here we examine determinantal point\nprocess priors for the \u03b8k in the case where the components are Gaussian.\nFor Gaussian mixture models, the DPP prior is particularly tractable. As in DPP-LDA, we use the\nprobability product kernel, which in this case also has a convenient closed form [10]. Let f1 =\nN (\u00b51, \u03a31) and f2 = N (\u00b52, \u03a32) be two Gaussians, the product kernel is:\nK(f1, f2) = (2\u03c0)(1\u22122\u03c1) D\n2 \u03c1\n\n2 | \u02c6\u03a3| 1\n\u2212 D\n\n(\u00b5T\n\n2 (|\u03a31||\u03a32|)\u2212 \u03c1\n1 \u00b51 + \u03a3\u22121\n\n1 \u03a3\u22121\n\n2 exp(\u2212 \u03c1\n2 \u00b52 \u2212 \u02c6\u00b5T \u02c6\u03a3\u02c6\u00b5))\n2\nIn the special case of a \ufb01xed, isotropic\n\n1 \u00b51 + \u00b5T\n\n2 \u03a3\u22121\n\n2 \u00b52.\n\nwhere \u02c6\u03a3 = (\u03a31 + \u03a32)\u22121 and \u02c6\u00b5 = \u03a3\u22121\ncovariance \u03c32I and \u03c1 = 1, the kernel is\nK(f (\u00b7| \u00b5), f (\u00b7| \u00b5\n\n(cid:48))) =\n\n1\n\n(4\u03c0\u03c32)D/2 e\n\n\u2212||\u00b5\u2212\u00b5\n\n(cid:48)||2/(4\u03c32)\n\n(15)\n\nwhere D is the data dimensionality.\nIn the standard EM algorithm for Gaussian mixtures, one typically introduces latent binary variables\nznj, which indicate that datum n belongs to component j. The E-step computes the responsibility\nvector \u03b3(znj) = E[znj] \u221d \u03c7jN (xn|\u00b5j, \u03a3j). This step is identical for DPP-GMM. The update for\nthe component weights is also the same: \u03c7j = 1\nn=1 \u03b3(znj). The difference between this pro-\nN\ncedure and the standard EM approach is that the M-step for the DPP-GMM optimizes the objective\nfunction (summarizing {\u00b5j, \u03a3j}J\nJ(cid:88)\n\nj=1 by \u03b8 for clarity):\n\n\u03b3(znj) [ln \u03c7j + lnN (xn|\u00b5j, \u03a3j)] + \u03bb ln|K\u03b8|\n\n\u03b8(cid:63) = argmax\n\n(cid:80)N\n\n(16)\n\n\uf8f1\uf8f2\uf8f3 N(cid:88)\n\nn=1\n\n\u03b8\u2208\u0398\n\n\uf8fc\uf8fd\uf8fe .\n\nj=1\n\n6\n\n\fFigure 3: Effect of \u03bb on classi\ufb01cation error.\n\nFigure 4: Effect of centroid distance on test error.\n\nClosely related to DPP-GMM is DPP-K-means. The kernel acts on the set of centroids as in\nEqn. (15), with \u03c32 now just a constant scaling term. Let \u03b8 = {\u00b5j} and znj be the hard assign-\nment indicator, the maximization step is:\n\nznj||xn \u2212 \u00b5j||2 + \u03bb ln|K\u03b8|\n\n(17)\n\n\uf8f1\uf8f2\uf8f3 N(cid:88)\n\nn=1\n\nJ(cid:88)\n\nj=1\n\n\u03b8(cid:63) = argmax\n\n\u03b8\u2208\u0398\n\n\uf8fc\uf8fd\uf8fe .\n\nWith the product kernel, the similarity between two Gaussians decays exponentially as the distance\nbetween their means increases. In practice, we \ufb01nd that when the number of mixture components |\u03b8|\nis large, K\u03b8 is well approximated by a sparse matrix.\n\n4 Experiment I: diversi\ufb01ed topic modeling.\n\nWe tested LDA and DPP-LDA on the un\ufb01ltered 20 Newsgroup corpus, without removing any stop-\nwords. A common frustration with vanilla LDA is that applying LDA to un\ufb01ltered data returns\ntopics that are dominated by stop-words. This frustrating phenomenon occurs even as the number of\ntopics is varied from K = 5 to K = 50. The \ufb01rst two columns of Table 1 show the ten most frequent\nwords from two representative topics learned by LDA using K = 25 . Stop-words occur frequently\nacross all documents and thus are unhelpfully correlated with topic-speci\ufb01c informative keywords.\nWe repeated the experiments after removing a list of 669 most common stop-words. However, the\ntopics inferred by regular LDA are still dominated by secondary stop-words that are not informative.\nDPP-LDA automatically groups common stop words into a few topics. By \ufb01nding stop-word-speci\ufb01c\ntopics, the majority of the remaining topics are available for more informative words. Table 1 shows\na sample of topics learned by DPP-LDA on the un\ufb01ltered 20 Newsgroup corpus (K = 25, \u03bb = 104).\nAs we vary K or increase \u03bb we observe robust grouping of stop-words into a few topics. High\nfrequency words that are common across many topics signi\ufb01cantly increase the similarity between\nthe topics, as measured by the product kernel on the \u03b2 distributions. This similarity incurs a large\npenalty in DPP and so the objective actively pushes the parameters of LDA away from regions where\nstop words occupy large probability mass across many topics.\nFeatures learned from DPP-LDA leads to better document classi\ufb01cation. It is common to use the \u03b3m,\nthe document speci\ufb01c posterior distribution over topics, as feature vectors in document classi\ufb01cation.\nWe inferred {\u03b3m,train} on training documents from DPP-LDA variational EM, and then trained a\nsupport vector machine (SVM) classi\ufb01er on {\u03b3m,train} with the true topic labels from 20 News-\ngroups. On test documents, we \ufb01xed the parameters \u03b1 and \u03b2 to the values inferred from the training\nset, and used variational EM to \ufb01nd MAP estimates of {\u03b3m,test}. The mean test classi\ufb01cation accu-\nracy for a range of \u03bb values is plotted in Figure 3. The setting \u03bb = 0 corresponds to vanilla LDA. In\neach trial, we use the same training set for DPP-LDA on a range of \u03bb values. DPP-LDA with \u03bb = 1\nconsistently outperforms LDA in test classi\ufb01cation (p < 0.001 binomial test). Large values of \u03bb\ndecrease classi\ufb01cation performance.\n\n5 Experiment II: diverse clustering.\n\nMixture models are often a useful way to learn features for classi\ufb01cation. The recent work of Coates\net al. [11], for example, shows that even simple K-means works well as a method of extracting\n\n7\n\n00.11101000.190.20.210.220.230.24\u03bbtest accuracy k=20k=2505101520\u22120.500.511.5% increase in inter\u2212centroid distance% change in test accuracy\u03bb=0.1\u03bb=0.001\u03bb=0.01\ftraining set size K\n30\n500\n30\n1000\n2000\n60\n150\n5000\n10000\n300\n\nK-means DPP K-means\n34.81\n43.32\n52.05\n61.03\n66.36\n\n36.21\n44.27\n52.55\n61.23\n66.65\n\ngain (%)\n1.4\n0.95\n0.50\n0.20\n0.29\n\n\u03bb\n0.01\n0.01\n0.01\n0.001\n0.001\n\nTable 2: Test classi\ufb01cation accuracy on CIFAR-10 dataset.\n\nfeatures for image labeling. In that work, K-means gave state of art results on the CIFAR-10 object\nrecognition task. Coates et al. achieved these results using a patch-wise procedure in which random\npatches are sampled from images for training. Each patch is a 6-by-6 square, represented as a point\nin a 36 dimensional vector space. Patches from the training images are combined and clustered using\nK-means. Each patch is then represented by a binary K-dimensional feature vector: the kth entry\nis one if the patch is closer to the centroid k than its average distance to centroids. Roughly half of\nthe feature entries are zero. Patches from the same image are then pooled to construct one feature\nvector for the whole image. An SVM is trained on these image features to perform classi\ufb01cation.\nWe reason that DPP-K-means may produce more informative features since the cluster centroids\nwill repel each other into more distinct positions in pixel space. We replicated the experiments from\nCoates et al., using their publicly-available code for identical pre- and post-processing. With this\nsetup, \u03bb = 0 recovers regular K-means, and reproduces the results from Coates et al. [11]. We\napplied DPP-K-means to the CIFAR-10 dataset, while varying the size of the training set. For each\ntraining set size, we ran regular K-means for a range of values of K and select the K that gives the\nbest test accuracy for K-means. Then we compare the performance with DPP-K-means using the\nsame K. For up to 10000 images in the training set, DPP-K-means leads to better test classi\ufb01cation\naccuracy compared to the simple K-means. The comparisons are performed on matched settings:\nfor a given randomly sampled training set and a centroid initialization, we generate the centroids\nfrom both K-means and DPP-K-means. The two sets of centroids were used to extract features\nand train classi\ufb01ers, which are then tested on the same test set of images. DPP-K-means consis-\ntently outperforms K-means in generalization accuracy (p < 0.001 binomial test). For example,\nfor training set of size 1000, with k = 30, we ran 100 trials, each with an random training set and\ninitialization, DPP-K-means outperformed K-means in 94 trials. As expected given its role as a reg-\nularizer, improvement from DPP-K-means is more signi\ufb01cant for smaller training sets. For the full\nCIFAR-10 with 50000 training images, DPP-K-means does not consistently outperform K-means.\nNext we ask if there is a pattern between how far the DPP pushes apart the centroids and classi\ufb01-\ncation accuracy on the test set. Focusing on 1000 training images and k = 30, for each randomly\nsampled training set and centroid initialization, we compute the mean inter-centroid distance for K-\nmeans and DPP-K-means. We compute the test accuracy for each set of centroids. Fig. 4 bins\nthe relative increase in inter-centroid distance into 10 bins. For each bin, we show the 25th, 50th,\nand 75th percentile of changes in test accuracy. Test accuracy is maximized when the inter-centroid\ndistances increase by about 14% from K-means centroids, corresponding to \u03bb = 0.01.\n\n6 Discussion.\n\nWe have introduced a general approach to including a preference for diversity into generative proba-\nbilistic models. We showed how a determinantal point process can be integrated as a modular com-\nponent into existing learning algorithms, and discussed its general role as a diversity regularizer. We\ninvestigated two settings where diversity can be useful: learning topics from documents, and clus-\ntering image patches. Plugging a DPP into latent Dirichlet allocation allows LDA to automatically\ngroup stop-words into a few categories, enabling more informative topics in other categories. In\nboth document and image classi\ufb01cation tasks, there exists an intermediate regime of diversity (as\ncontrolled by the hyperparameter \u03bb) that leads to consistent improvement in accuracy when com-\npared to standard i.i.d. models. A computational bottleneck can come from inverting the M \u00d7 M\nkernel matrix K, where M is the number of latent distributions. However in many settings such as\nLDA, M is much smaller than the data size. We expect that there are many other settings where\nDPP-based diversity can be usefully introduced into a generative probabilistic model: in the emis-\nsion parameters of HMM and more general time series, and as a mechanism for transfer learning.\n\n8\n\n\fReferences\n[1] David M. Blei, Andrew Y. Ng, and Michael I. Jordan. Latent Dirichlet allocation. Journal of\n\nMachine Learning Research, 3:993\u20131022, 2003.\n\n[2] J. F. C. Kingman. Poisson Processes. Oxford University Press, Oxford, United Kingdom,\n\n1993.\n\n[3] David J. Strauss. A model for clustering. Biometrika, 62(2):467\u2013475, August 1975.\n[4] Jesper M\u00f8ller and Rasmus Plenge Waagepetersen. Statistical Inference and Simulation for\nSpatial Point Processes. Monographs on Statistics and Applied Probability. Chapman and\nHall/CRC, Boca Raton, FL, 2004.\n\n[5] J. Ben Hough, Manjunath Krishnapur, Yuval Peres, and Blint Vir\u00b4ag. Determinantal processes\n\nand independence. Probability Surveys, 3:206\u2013229, 2006.\n\n[6] Antonello Scardicchio, Chase E. Zachary, and Salvatore Torquato. Statistical properties of\ndeterminantal point processes in high-dimensional Euclidean spaces. Physical Review E, 79(4),\n2009.\n\n[7] Fr\u00b4ed\u00b4eric Lavancier, Jesper M\u00f8ller, and Ege Rubak. Statistical aspects of determinantal point\n\nprocesses. http://arxiv.org/abs/1205.4818, 2012.\n\n[8] Alex Kulesza and Ben Taskar. Structured determinantal point processes. In Advanced in Neural\n\nInformation Processing Systems 23, 2011.\n\n[9] Alex Kulesza and Ben Taskar. Learning determinantal point processes. In Proceedings of the\n\n27th Conference on Uncertainty in Arti\ufb01cial Intelligence, 2011.\n\n[10] Tony Jebara, Risi Kondor, and Andrew Howard. Probability product kernels. Journal of\n\nMachine Learning Research, 5:819\u2013844, 2004.\n\n[11] Adam Coates Honglak Lee and Andrew Ng. An analysis of single-layer networks in unsu-\npervised feature learning. In Proceedings of the 14th International Conference on Arti\ufb01cial\nIntelligence and Statistics, 2011.\n\n9\n\n\f", "award": [], "sourceid": 1357, "authors": [{"given_name": "James", "family_name": "Kwok", "institution": null}, {"given_name": "Ryan", "family_name": "Adams", "institution": null}]}