{"title": "Small-Variance Asymptotics for Exponential Family Dirichlet Process Mixture Models", "book": "Advances in Neural Information Processing Systems", "page_first": 3158, "page_last": 3166, "abstract": "Links between probabilistic and non-probabilistic learning algorithms can arise by performing small-variance asymptotics, i.e., letting the variance of particular distributions in a graphical model go to zero. For instance, in the context of clustering, such an approach yields precise connections between the k-means and EM algorithms.  In this paper, we explore small-variance asymptotics for exponential family Dirichlet process (DP) and hierarchical Dirichlet process (HDP) mixture models.  Utilizing connections between exponential family distributions and Bregman divergences, we derive novel clustering algorithms from the asymptotic limit of the DP and HDP mixtures that feature the scalability of existing hard clustering methods as well as the flexibility of Bayesian nonparametric models.  We focus on special cases of our analysis for discrete-data problems, including topic modeling, and we demonstrate the utility of our results by applying variants of our algorithms to problems arising in vision and document analysis.", "full_text": "Small-Variance Asymptotics for Exponential Family\n\nDirichlet Process Mixture Models\n\nKe Jiang, Brian Kulis\nDepartment of CSE\n\nThe Ohio State University\n\n{jiangk,kulis}@cse.ohio-state.edu\n\nMichael I. Jordan\n\nDepartments of EECS and Statistics\nUniversity of California at Berkeley\njordan@cs.berkeley.edu\n\nAbstract\n\nSampling and variational inference techniques are two standard methods for in-\nference in probabilistic models, but for many problems, neither approach scales\neffectively to large-scale data. An alternative is to relax the probabilistic model\ninto a non-probabilistic formulation which has a scalable associated algorithm.\nThis can often be ful\ufb01lled by performing small-variance asymptotics, i.e., letting\nthe variance of particular distributions in the model go to zero. For instance, in\nthe context of clustering, such an approach yields connections between the k-\nmeans and EM algorithms. In this paper, we explore small-variance asymptotics\nfor exponential family Dirichlet process (DP) and hierarchical Dirichlet process\n(HDP) mixture models. Utilizing connections between exponential family distri-\nbutions and Bregman divergences, we derive novel clustering algorithms from the\nasymptotic limit of the DP and HDP mixtures that features the scalability of exist-\ning hard clustering methods as well as the \ufb02exibility of Bayesian nonparametric\nmodels. We focus on special cases of our analysis for discrete-data problems, in-\ncluding topic modeling, and we demonstrate the utility of our results by applying\nvariants of our algorithms to problems arising in vision and document analysis.\n\nIntroduction\n\n1\nAn enduring challenge for machine learning is in the development of algorithms that scale to truly\nlarge data sets. While probabilistic approaches\u2014particularly Bayesian models\u2014are \ufb02exible from\na modeling perspective, lack of scalable inference methods can limit applicability on some data.\nFor example, in clustering, algorithms such as k-means are often preferred in large-scale settings\nover probabilistic approaches such as Gaussian mixtures or Dirichlet process (DP) mixtures, as the\nk-means algorithm is easy to implement and scales to large data sets.\nIn some cases, links between probabilistic and non-probabilistic models can be made by applying\nasymptotics to the variance (or covariance) of distributions within the model. For instance, con-\nnections between probabilistic and standard PCA can be made by letting the covariance of the data\nlikelihood in probabilistic PCA tend toward zero [1, 2]; similarly, the k-means algorithm may be\nobtained as a limit of the EM algorithm when the covariances of the Gaussians corresponding to\neach cluster goes to zero. Besides providing a conceptual link between seemingly quite different\napproaches, small-variance asymptotics can yield useful alternatives to probabilistic models when\nthe data size becomes large, as the non-probabilistic models often exhibit more favorable scaling\nproperties. The use of such techniques to derive scalable algorithms from rich probabilistic models\nis still emerging, but provides a promising approach to developing scalable learning algorithms.\nThis paper explores such small-variance asymptotics for clustering, focusing on the DP mixture.\nExisting work has considered asymptotics over the Gaussian DP mixture [3], leading to k-means-\nlike algorithms that do not \ufb01x the number of clusters upfront. This approach, while an important\n\ufb01rst step, raises the question of whether we can perform similar asymptotics over distributions other\n\n1\n\n\fthan the Gaussian. We answer in the af\ufb01rmative by showing how such asymptotics may be applied\nto the exponential family distributions for DP mixtures; such analysis opens the door to a new class\nof scalable clustering algorithms and utilizes connections between Bregman divergences and expo-\nnential families. We further extend our approach to hierarchical nonparametric models (speci\ufb01cally,\nthe hierarchical Dirichlet process (HDP) [4]), and we view a major contribution of our analysis to\nbe the development of a general hard clustering algorithm for grouped data.\nOne of the primary advantages of generalizing beyond the Gaussian case is that it opens the door\nto novel scalable algorithms for discrete-data problems. For instance, visual bag-of-words [5] have\nbecome a standard representation for images in a variety of computer vision tasks, but many existing\nprobabilistic models in vision cannot scale to the size of data sets now commonly available. Simi-\nlarly, text document analysis models (e.g., LDA [6]) are almost exclusively discrete-data problems.\nOur analysis covers such problems; for instance, a particular special case of our analysis is a hard\nversion of HDP topic modeling. We demonstrate the utility of our methods by exploring applications\nin text and vision.\nRelated Work: In the non-Bayesian setting, asymptotics for the expectation-maximization algo-\nrithm for exponential family distributions were studied in [7]. The authors showed a connection be-\ntween EM and a general k-means-like algorithm, where the squared Euclidean distance is replaced\nby the Bregman divergence corresponding to exponential family distribution of interest. Our results\nmay be viewed as generalizing this approach to the Bayesian nonparametric setting. As discussed\nabove, our results may also be viewed as generalizing the approach of [3], where the asymptotics\nwere performed for the DP mixture with a Gaussian likelihood, leading to a k-means-like algo-\nrithm where the number of clusters is not \ufb01xed upfront. Note that our setting is considerably more\ninvolved than either of these previous works, particularly since we will require an appropriate tech-\nnique for computing an asymptotic marginal likelihood. Other connections between hard clustering\nand probabilistic models were explored in [8], which proposes a \u201cBayesian k-means\u201d algorithm by\nperforming a maximization-expectation algorithm.\n\n2 Background\nIn this section, we brie\ufb02y review exponential family distributions, Bregman divergences, and the\nDirichlet process mixture model.\n\n2.1 The Exponential Family\nConsider the exponential family with natural parameter \u03b8 = {\u03b8j}d\nfamily probability density function can be written as [9]:\n\np(x| \u03b8) = exp(cid:0)(cid:104)x, \u03b8(cid:105) \u2212 \u03c8(\u03b8) \u2212 h(x)(cid:1),\n\nj=1 \u2208 Rd; then the exponential\n\nwhere \u03c8(\u03b8) = log(cid:82) exp((cid:104)x, \u03b8(cid:105) \u2212 h(x))dx is the log-partition function. Here we assume for\n\nsimplicity that x is a minimal suf\ufb01cient statistic for the natural parameter \u03b8. \u03c8(\u03b8) can be utilized to\ncompute the mean and covariance of p(x| \u03b8); in particular, the expected value is given by \u2207\u03c8(\u03b8),\nand the covariance is \u22072\u03c8(\u03b8).\nConjugate Priors: In a Bayesian setting, we will require a prior distribution over the natural pa-\nrameter \u03b8. A convenient property of the exponential family is that a conjugate prior distribution of\n\u03b8 exists; in particular, given any speci\ufb01c distribution in the exponential family, the conjugate prior\ncan be parametrized as [11]:\n\np(\u03b8 | \u03c4, \u03b7) = exp(cid:0)(cid:104)\u03b8, \u03c4(cid:105) \u2212 \u03b7\u03c8(\u03b8) \u2212 m(\u03c4, \u03b7)(cid:1).\n\nHere, the \u03c8(\u00b7) function is the same as that of the likelihood function. Given a data point xi, the\nposterior distribution of \u03b8 has the same form as the prior, with \u03c4 \u2192 \u03c4 + xi and \u03b7 \u2192 \u03b7 + 1.\nRelationship to Bregman Divergences: Let \u03c6 : S \u2192 R be a differentiable, strictly convex function\nde\ufb01ned on a convex set S \u2286 Rd. The Bregman divergence for any pair of points x, y \u2208 S is de\ufb01ned\nas D\u03c6(x, y) = \u03c6(x)\u2212 \u03c6(y)\u2212(cid:104)x\u2212 y,\u2207\u03c6(y)(cid:105), and can be viewed as a generalized distortion mea-\nsure. An important result connecting Bregman divergences and exponential families was discussed\nin [7] (see also [10, 11]), where a bijection between the two was established. A key consequence\nof this result is that we can equivalently parameterize both p(x| \u03b8) and p(\u03b8 | \u03c4, \u03b7) in terms of the\n\n2\n\n\fexpectation \u00b5:\n\np(x| \u03b8) = p(x| \u00b5) = exp(\u2212D\u03c6(x, \u00b5))f\u03c6(x),\np(\u03b8 | \u03c4, \u03b7) = p(\u00b5| \u03c4, \u03b7) = exp\n\n\u2212 \u03b7D\u03c6\n\n, \u00b5\n\n(cid:18)\n\n(cid:19)(cid:19)\n\n(cid:18) \u03c4\n\n\u03b7\n\ng\u03c6(\u03c4, \u03b7),\n\nwhere \u03c6(\u00b7) is the Legendre-conjugate function of \u03c8(\u00b7) (denoted as \u03c6 = \u03c8\u2217), f\u03c6(x) = exp(\u03c6(x) \u2212\nh(x)), and \u00b5 is the expectation parameter which satis\ufb01es \u00b5 = \u2207\u03c8(\u03b8) (and also \u00b5 = \u03b8\u2217). The\nBregman divergence representation provides a natural way to parametrize the exponential family\ndistributions with its expectation parameter and, as we will see, we will \ufb01nd it convenient to work\nwith this form.\n\n2.2 Dirichlet Process Mixture Models\nThe Dirichlet Process (DP) mixture model is a Bayesian nonparametric mixture model [12]; unlike\nmost parametric mixture models (Bayesian or otherwise), the number of clusters in a DP mixture is\nnot \ufb01xed upfront. Using the exponential family parameterized by the expectation \u00b5c, the likelihood\nfor a data point can be expressed as the following in\ufb01nite mixture:\n\n\u221e(cid:88)\n\n\u221e(cid:88)\n\np(x) =\n\n\u03c0cp(x| \u00b5c) =\n\n\u03c0c exp(\u2212D\u03c6(x, \u00b5c))f\u03c6(x).\n\nc=1\n\nc=1\n\nEven though there are conceptually an in\ufb01nite number of clusters, the nonparametric prior over the\nmixing weights causes the weights \u03c0c to decay exponentially. Moreover, a simple collapsed Gibbs\nsampler can be employed for performing inference in this model [13]; this Gibbs sampler will form\nthe basis of our asymptotic analysis. Given a data set {x1, ..., xn}, the state of the Markov chain\nis the set of cluster indicators {z1, ..., zn} as well as the cluster means of the currently-occupied\nclusters (the mixing weights have been integrated out). The Gibbs updates for zi, (i = 1, . . . , n),\nare given by the following conditional probabilities:\n\nP (zi = c| z\u2212i, xi, \u00b5) =\nP (zi = cnew | z\u2212i, xi, \u00b5) =\n\nn\u2212i,c\n\nZ(n \u2212 1 + \u03b1)\n\n\u03b1\n\nZ(n \u2212 1 + \u03b1)\n\np(xi | \u00b5c)\n\n(cid:90)\n\np(xi | \u00b5)dG0,\n\nwhere Z is the normalizing constant, n\u2212i,c is the number of data points (excluding xi) that are\ncurrently assigned to cluster c, G0 is a prior over \u00b5, and \u03b1 is the concentration parameter that\ndetermines how likely we are to start a new cluster. If we choose to start a new cluster during the\nGibbs update, we sample its mean from the posterior distribution obtained from the prior distribution\nG0 and the single observation xi. After performing Gibbs moves on the cluster indicators, we update\nthe cluster means \u00b5c by sampling from the posterior of \u00b5c given the data points assigned to cluster c.\n\n3 Hard Clustering for Exponential Family DP Mixtures\nOur goal is to analyze what happens as we perform small-variance asymptotics on the exponential\nfamily DP mixture when running the collapsed Gibbs sampler described earlier, and we begin by\nconsidering how to scale the covariance in an exponential family distribution. Given an exponential\nfamily distribution p(x| \u03b8) with natural parameter \u03b8 and log-partition function \u03c8(\u03b8), consider a\nscaled exponential family distribution whose natural parameter is \u02dc\u03b8 = \u03b2\u03b8 and log-partition function\nis \u02dc\u03c8( \u02dc\u03b8) = \u03b2\u03c8( \u02dc\u03b8/\u03b2), where \u03b2 > 0. The following result characterizes the relationship between the\nmean and covariance of the original and scaled exponential family distributions.\nLemma 3.1. Denote \u00b5(\u03b8) as the mean, and cov(\u03b8) as the covariance, of p(x| \u03b8) with log-partition\n\u03c8(\u03b8). Given a scaled exponential family with \u02dc\u03b8 = \u03b2\u03b8 and \u02dc\u03c8( \u02dc\u03b8) = \u03b2\u03c8( \u02dc\u03b8/\u03b2), the mean \u02dc\u00b5( \u02dc\u03b8) of the\nscaled distribution is \u00b5(\u03b8) and the covariance, \u02dccov( \u02dc\u03b8), is cov(\u03b8)/\u03b2.\nThis lemma follows directly from \u02dc\u00b5( \u02dc\u03b8) = \u2207\u02dc\u03b8\n\u02dc\u03c8( \u02dc\u03b8) = \u03b2\u2207\u02dc\u03b8\u03c8( \u02dc\u03b8/\u03b2) = \u2207\u03b8\u03c8( \u02dc\u03b8/\u03b2) = \u2207\u03b8\u03c8(\u03b8) =\n\u00b5(\u03b8), and \u02dccov( \u02dc\u03b8) = \u22072\n\u03b8\u03c8(\u03b8) =\n\u02dc\u03b8\ncov(\u03b8)/\u03b2. It is perhaps intuitively simpler to observe what happens to the distribution using the\n\n( \u02dc\u03c8( \u02dc\u03b8)) = \u03b2\u2207\u02dc\u03b8(\u2207\u02dc\u03b8\u03c8( \u02dc\u03b8/\u03b2)) = 1\n\n\u03b8\u03c8( \u02dc\u03b8/\u03b2) = 1\n\n\u03b2 \u00d7 \u22072\n\n\u03b2 \u00d7 \u22072\n\n3\n\n\fBregman divergence representation. Recall that the generating function \u03c6 for the Bregman diver-\ngence is given by the Legendre-conjugate of \u03c8. Using standard properties of convex conjugates, we\nsee that the conjugate of \u02dc\u03c8 is simply \u02dc\u03c6 = \u03b2\u03c6. The Bregman divergence representation for the scaled\ndistribution is given by\n\np(x| \u02dc\u03b8) = p(x| \u02dc\u00b5) = exp(\u2212D \u02dc\u03c6(x, \u02dc\u00b5))f \u02dc\u03c6(x) = exp(\u2212\u03b2D\u03c6(x, \u00b5))f\u03b2\u03c6(x),\n\nwhere the last equality follows from Lemma 3.1 and the fact that, for a Bregman divergence,\nD\u03b2\u03c6(\u00b7,\u00b7) = \u03b2D\u03c6(\u00b7,\u00b7). Thus, as \u03b2 increases under the above scaling, the mean is \ufb01xed while the\ndistribution becomes increasingly concentrated around the mean.\nNext we consider the prior distribution under the scaled exponential family. When scaling by \u03b2, we\nalso need to scale the hyper-parameters \u03c4 and \u03b7, namely \u03c4 \u2192 \u03c4 /\u03b2 and \u03b7 \u2192 \u03b7/\u03b2. This gives the\nfollowing prior written using the Bregman divergence, where we are now explicitly conditioning on\n\u03b2:\np( \u02dc\u03b8 | \u03c4, \u03b7, \u03b2) = exp\n\n(cid:18) \u03c4 /\u03b2\n\n(cid:19)(cid:19)\n\n(cid:19)(cid:19)\n\n(cid:18) \u03c4\n\n(cid:18) \u03c4\n\n(cid:18) \u03c4\n\n(cid:18)\n\n(cid:19)\n\n(cid:19)\n\n(cid:18)\n\n= exp\n\n\u2212 \u03b7D\u03c6\n\n, \u00b5\n\n, \u00b5\n\n,\n\n.\n\n,\n\n\u2212 \u03b7\n\u03b2\n\nD \u02dc\u03c6\n\n\u03b7/\u03b2\n\ng \u02dc\u03c6\n\n\u03b2\n\n\u03b7\n\u03b2\n\ng \u02dc\u03c6\n\n\u03b2\n\n\u03b7\n\u03b2\n\n\u03b7\n\nFinally, we compute the marginal likelihood for x by integrating out \u02dc\u03b8, as it will be necessary for\nthe Gibbs sampler. Standard algebraic manipulations yield the following:\np(x| \u03c4, \u03b7, \u03b2) =\n\np(x| \u02dc\u03b8) \u00d7 p( \u02dc\u03b8 | \u03c4, \u03b7, \u03b2)d \u02dc\u03b8\n\n(cid:90)\n\n(cid:18) \u03b2x + \u03c4\n\n(cid:19)(cid:19)\n\nexp\n\nd \u02dc\u03b8\n\n\u03b2 + \u03b7\n\n, \u02dc\u00b5( \u02dc\u03b8)\n\n\u2212 (\u03b2 + \u03b7)D\u03c6\n\n\u2212 (\u03b2 + \u03b7)D\u03c6\n\n(cid:18) \u03b2x + \u03c4\n\n(cid:19)(cid:19)\n(cid:18)\n\u03b2+\u03b7 ))(cid:1), which arises when combining\n\n, \u00b5(\u03b8)\n\n\u03b2 + \u03b7\n\n(1)\n\nd\u03b8.\n\n(cid:18) \u03c4\n(cid:18) \u03c4\n\n\u03b2\n\n(cid:19)\n(cid:19)\n\n,\n\n,\n\n\u03b7\n\u03b2\n\u03b7\n\u03b2\n\n(cid:90)\n\n(cid:18)\n(cid:90)\n\nexp\n\nA( \u02dc\u03c6,\u03c4,\u03b7,\u03b2)(x)\nA( \u02dc\u03c6,\u03c4,\u03b7,\u03b2)(x) \u00b7 \u03b2d \u00b7\n\n= f \u02dc\u03c6(x) \u00b7 g \u02dc\u03c6\n= f \u02dc\u03c6(x) \u00b7 g \u02dc\u03c6\n\nHere, A( \u02dc\u03c6,\u03c4,\u03b7,\u03b2)(x) = exp(cid:0)\u2212 (\u03b2\u03c6(x) + \u03b7\u03c6( \u03c4\n\n\u03b2\n\n\u03b7 )\u2212 (\u03b2 + \u03b7)\u03c6( \u03b2x+\u03c4\n\n=\n\n, \u02c6\u00b5\n\nI = exp\n\n\u03b2+\u03b7 , \u02c6\u00b5)\n\n(cid:18) 1\n\n\u03b2\n\n(cid:12)(cid:12)(cid:12)(cid:12)\u22121/2\n\n(cid:18)\n(cid:18) 2\u03c0\n\nthe Bregman divergences from the likelihood and the prior.\nNow we make the following key insight, which will allow us to perform the necessary asymptotics.\nWe can write the integral from the last line above (denoted I below) via Laplace\u2019s method. Since\n(cid:19)\n\u03b2+\u03b7 )\u2217, we have:\nD\u03c6( \u03b2x+\u03c4\n\n\u03b2+\u03b7 , \u00b5) has a local minimum (which is global in this case) at \u02c6\u03b8 = \u02c6\u00b5\u2217 = ( \u03b2x+\u03c4\n\n\u2212 (\u03b2 + \u03b7)D\u03c6\n\n(cid:18) \u03b2x + \u03c4\n(cid:19)d/2(cid:12)(cid:12)(cid:12)(cid:12) \u22022D\u03c6( \u03b2x+\u03c4\n\n\u03b2 + \u03b7\n\u03b2+\u03b7 , \u02c6\u00b5)\n\n\u2202\u03b8\u2202\u03b8T\n\n(cid:19)d/2(cid:12)(cid:12)(cid:12)(cid:12) \u22022D\u03c6( \u03b2x+\u03c4\n(cid:19)(cid:19)(cid:18) 2\u03c0\n(cid:12)(cid:12)(cid:12)(cid:12)\u22121/2\n(cid:18) 1\n(cid:19)\n\n\u2202\u03b8\u2202\u03b8T\n\n\u03b2 + \u03b7\n\n+ O\n\n\u03b2\n\n\u2202\u03b8\u2202\u03b8T\n\n\u03b2+\u03b7 , \u02c6\u00b5)\n\n\u03b2 + \u03b7\nwhere \u22022D\u03c6( \u03b2x+\u03c4\n= cov( \u02c6\u03b8) is the covariance matrix of the likelihood function instantiated at \u02c6\u03b8\nand approaches cov(x\u2217) when \u03b2 goes to \u221e. Note that the exponential term equals one since the\ndivergence inside is 0.\n3.1 Asymptotic Behavior of the Gibbs Sampler\nWe now have the tools to consider the Gibbs sampler for the exponential family DP mixture as we\nlet \u03b2 \u2192 \u221e. As we will see, we will obtain a general k-means-like hard clustering algorithm which\nutilizes the appropriate Bregman divergence in place of the squared Euclidean distance, and also can\nvary the number of clusters. Recall the conditional probabilities for performing Gibbs moves on the\ncluster indicators zi, where we now are considering the scaled distributions:\n\n+ O\n\n(2)\n\nP (zi = c| z\u2212i, xi, \u03b2, \u00b5) =\nP (zi = cnew | z\u2212i, xi, \u03b2, \u00b5) =\n\nn\u2212i,c\n\nexp(\u2212\u03b2D\u03c6(xi, \u00b5c))f \u02dc\u03c6(xi)\n\nZ\np(xi | \u03c4, \u03b7, \u03b2),\n\n\u03b1\nZ\n\nwhere Z is a normalization factor, and the marginal probability p(xi | \u03c4, \u03b7, \u03b2) is given by the deriva-\ntions in (1) and (2). Now, we consider the asymptotic behavior of these probabilities as \u03b2 \u2192 \u221e. We\n\n4\n\n\fnote that\n\nlim\n\u03b2\u2192\u221e\n\n\u03b2xi + \u03c4\n\u03b2 + \u03b7\n\n= xi,\n\nand\n\nand that the Laplace approximation error term goes to zero as \u03b2 \u2192 \u221e. Further, we de\ufb01ne \u03b1 as a\nfunction of \u03b2, \u03b7, and \u03c4 (but independent of the data):\n\nlim\n\n\u03b2\u2192\u221e A( \u02dc\u03c6,\u03c4,\u03b7,\u03b2)(xi) = exp(\u2212\u03b7(\u03c6(\u03c4 /\u03b7) \u2212 \u03c6(xi))),\n(cid:18) 2\u03c0\n(cid:19)\n\n(cid:19)\u22121 \u00b7 exp(\u2212\u03b2\u03bb),\n\n(cid:19)d/2 \u00b7 \u03b2d\n\n\u00b7\n\n\u03b2 + \u03b7\n\n(cid:18)\n\n(cid:18) \u03c4\n\n\u03b1 =\n\ng \u02dc\u03c6\n\n,\n\n\u03b7\n\u03b2\n\n\u03b2\n\nfor some \u03bb. After canceling out the f \u02dc\u03c6(xi) terms from all probabilities, we can then write the Gibbs\nprobabilities as\n\nP (zi = c| z\u2212i, xi, \u03b2, \u00b5) =\n\nP (zi = cnew | z\u2212i, xi, \u03b2, \u00b5) =\n\nCxi \u00b7 exp(\u2212\u03b2\u03bb) +(cid:80)k\nCxi \u00b7 exp(\u2212\u03b2\u03bb) +(cid:80)k\n\nn\u2212i,c \u00b7 exp(\u2212\u03b2D\u03c6(xi, \u00b5c))\n\nj=1 n\u2212i,j \u00b7 exp(\u2212\u03b2D\u03c6(xi, \u00b5j))\n\nCxi \u00b7 exp(\u2212\u03b2\u03bb)\n\nj=1 n\u2212i,j \u00b7 exp(\u2212\u03b2D\u03c6(xi, \u00b5j))\n\n,\n\nwhere Cxi approaches a positive, \ufb01nite constant for a given xi as \u03b2 \u2192 \u221e. Now, all of the above\nprobabilities will become binary as \u03b2 \u2192 \u221e. More speci\ufb01cally, all the k + 1 values will be in-\ncreasingly dominated by the smallest value of {D\u03c6(xi, \u00b51), . . . , D\u03c6(xi, \u00b5k), \u03bb}. As \u03b2 \u2192 \u221e, only\nthe smallest of these values will receive a non-zero probability. That is, the data point xi will be\nassigned to the nearest cluster with a divergence at most \u03bb. If the closest mean has a divergence\ngreater than \u03bb, we start a new cluster containing only xi.\nNext, we show that sampling \u00b5c from the posterior distribution is achieved by simply computing\nthe empirical mean of a cluster in the limit. During Gibbs sampling, once we have performed one\ncomplete set of Gibbs moves on the cluster assignments, we need to sample the \u00b5c conditioned on\nall assignments and observations. If we let nc be the number of points assigned to cluster c, then the\nposterior distribution (parameterized by the expectation parameter) for cluster c is\np(\u00b5c | X, z, \u03c4, \u03b7, \u03b2) \u221d p(Xc | \u00b5c, \u03b2)\u00d7p(\u00b5c | \u03c4, \u03b7, \u03b2) \u221d exp\n} is the set of points currently assigned to cluster c, and z\nwhere X is all the data, Xc = {xc\nis the set of all current assignments. We can see that the mass of the posterior distribution becomes\nas \u03b2 \u2192 \u221e. In other words, after we determine the\nconcentrated around the sample mean\nassignments of data points to clusters, we update the means as the sample mean of the data points in\neach cluster. This is equivalent to the standard k-means cluster mean update step.\n3.2 Objective function and algorithm\nFrom the above asymptotic analysis of the Gibbs sampler, we observe a new algorithm which can\nbe utilized for hard clustering. It is as simple as the popular k-means algorithm, but also provides\nthe ability to adapt the number of clusters depending on the data as well as incorporate different\ndistortion measures. The algorithm description is as follows:\n\n1, ..., xc\n(cid:80)nc\nnc\n\n\u2212(\u03b2nc+\u03b7)D\u03c6\n\n(cid:18)(cid:80)nc\n\ni=1 \u03b2xc\n\u03b2nc + \u03b7\n\ni=1 xi\nnc\n\n(cid:18)\n\ni + \u03c4\n\n, \u00b5\n\n(cid:19)(cid:19)\n\n,\n\n\u2022 Initialization: input data x1, . . . , xn, \u03bb > 0, and \u00b51 = 1\n\u2022 Assignment: for each data point xi, compute the Bregman divergence D\u03c6(xi, \u00b5c) to all\nexisting clusters. If minc D\u03c6(xi, \u00b5c) \u2264 \u03bb, then zi,c0 = 1 where c0 = argmincD\u03c6(xi, \u00b5c);\notherwise, start a new cluster and set zi,cnew = 1;\n\u2022 Mean Update: compute the cluster mean for each cluster, \u00b5j = 1|lj|\nthe set of points in the j-th cluster.\n\nx, where lj is\n\n(cid:80)\n\ni=1 xn\n\nx\u2208lj\n\nn\n\n(cid:80)n\n\nWe iterate between the assignment and mean update steps until local convergence. Note that the\ninitialization used here\u2014placing all data points into a single cluster\u2014is not necessary, but is one\nnatural way to initialize the algorithm. Also note that the algorithm depends heavily on the choice\nof \u03bb; heuristics for selecting \u03bb were brie\ufb02y discussed for the Gaussian case in [3], and we will follow\nthis approach (generalized in the obvious way to Bregman divergences) for our experiments.\n\n5\n\n\fWe can easily show that the underlying objective function for our algorithm is quite similar to that\nin [3], replacing the squared Euclidean distance with an appropriate Bregman divergence. Recall\nthat the squared Euclidean distance is the Bregman divergence corresponding to the Gaussian distri-\nbution. Thus, the objective function in [3] can be seen as a special case of our work. The objective\nfunction optimized by our derived algorithm is the following:\n\nk(cid:88)\n\n(cid:88)\n\nc=1\n\nx\u2208lc\n\nmin\n{lc}k\n\nc=1\n\nD\u03c6(x, \u00b5c) + \u03bbk\n\n(3)\n\nwhere k is the total number of clusters, \u03c6 is the conjugate function of the log-partition function of\nthe chosen exponential family distribution, and \u00b5c is the sample mean of cluster c. The penalty term\n\u03bb controls the tradeoff between the likelihood and the model complexity, where a large \u03bb favors\nsmall model complexity (i.e., fewer clusters) while a small \u03bb favors more clusters. Given the above\nobjective function, our algorithm can be shown to monotonically decrease the objective function\nvalue until convergence to some local minima. We omit the proof here as it is almost identical as the\nproof for Theorem 3.1 in [3].\n\n4 Extension to Hierarchies\nA key bene\ufb01t of the Bayesian approach is its natural ability to form hierarchical models. In the con-\ntext of clustering, a hierarchical mixture allows one to cluster multiple groups of data\u2014each group\nis clustered into a set of local clusters, but these local clusters are shared among the groups (i.e.,\nsets of local clusters across groups form global clusters, with a shared global mean). For Bayesian\nnonparametric mixture models, one way of achieving such hierarchies arises via the hierarchical\nDirichlet Process (HDP) [4], which provides a nonparametric approach to allow sharing of clusters\namong a set of DP mixtures.\nIn this section, we will brie\ufb02y sketch out the extension of our analysis to the HDP mixture, which\nyields a natural extension of our methods to groups of data. Given space considerations, and the fact\nthat the resulting algorithm turns out to reduce to Algorithm 2 from [3] with the squared Euclidean\ndistance replaced by an appropriate Bregman divergence, we will omit the full speci\ufb01cation of the\nalgorithm here. However, despite the similarity to the existing Gaussian case, we do view the ex-\ntension to hierarchies as a promising application of our analysis. In particular, our approach opens\nthe door to hard hierarchical algorithms over discrete data, such as text, and we brie\ufb02y discuss an\napplication of our derived algorithm to topic modeling.\nWe assume that there are J data sets (groups) which we index by j = 1, ..., J. Data point xij\nrefers to data point i from set j. The HDP model can be viewed as clustering each data set into\nlocal clusters, but where each local cluster is associated to a global mean. Global means may be\nshared across data sets. When performing the asymptotics, we require variables for the global means\n(\u00b51, ..., \u00b5g), the associations of data points to local clusters, zij, and the associations of local clusters\nto global means, vjt, where t indexes the local clusters for a data set. A standard Gibbs sampler\nconsiders updates on all of these variables, and in the nonparametric setting does not \ufb01x the number\nof local or global clusters.\nThe tools from the previous section may be nearly directly applied to the hierarchical case. As\nopposed to the \ufb02at model, the hard HDP requires two parameters: a value \u03bbtop which is utilized\nwhen starting a global (top-level) cluster, and a value \u03bbbottom which is utilized when starting a local\ncluster. The resulting hard clustering algorithm \ufb01rst performs local assignment moves on the zij,\nthen updates the local cluster assignments, and \ufb01nally updates all global means.\nThe resulting objective function that is monotonically minimized by our algorithm is given as fol-\nlows:\n\nk(cid:88)\n\n(cid:88)\n\nc=1\n\nxij\u2208lc\n\nmin\n{lc}k\n\nc=1\n\nD\u03c6(xij, \u00b5c) + \u03bbbottomt + \u03bbtopk,\n\n(4)\n\nwhere k is the total number of global clusters and t is the total number of local clusters. The bottom-\nlevel penalty term \u03bbbottom controls both the number of local and top-level clusters, where larger\n\u03bbbottom tends to give fewer local clusters and more top-level clusters. Meanwhile, the top-level\npenalty term \u03bbtop, as in the one-level case, controls the tradeoff between the likelihood and model\ncomplexity.\n\n6\n\n\fFigure 1: (Left) Example images from the ImageNet data (Persian cat and elephant categories). Each\nimage is represented via a discrete visual-bag-of-words histogram. Clustering via an asymptotic\nmultinomial DP mixture considerably outperforms the asymptotic Gaussian DP mixture; see text\nfor details. (Right) Elapsed time per iteration in seconds of our topic modeling algorithm when\nrunning on the NIPS data, as a function of the number of topics.\n\n5 Experiments\nWe conclude with a brief set of experiments highlighting applications of our analysis to discrete-data\nproblems, namely image clustering and topic modeling. For all experiments, we randomly permute\nthe data points at each iteration, as this tends to improve results (as discussed previously, unlike\nstandard k-means, the order in which the data points are processed impacts the resulting clusters).\n\nImage Clustering. We \ufb01rst explore an application of our techniques to image clustering, focusing\non the ImageNet data [14]. We utilize a subset of this data for quantitative experiments, sampling\n100 images from 10 different categories of this data set (Persian cat, African elephant, \ufb01re engine,\nmotor scooter, wheelchair, park bench, cello, French horn, television, and goblet), for a total of 1000\nimages. Each image is processed via standard visual-bag-of-words: SIFT is densely applied on top\nof image patches in image, and the resulting SIFT vectors are quantized into 1000 visual words.\nWe use the resulting histograms as our discrete representation for an image, as is standard. Some\nexample images from this data set are shown in Figure 1.\nWe explore whether the discrete version of our hard clustering algorithm based on a multinomial\nDP mixture outperforms the Gaussian mixture version (i.e., DP-means); this will validate our gen-\neralization beyond the Gaussian setting. For both the Gaussian and multinomial cases, we utilize a\nfarthest-\ufb01rst approach for both selecting \u03bb as well as initializing the clusters (see [3] for a discussion\nof farthest-\ufb01rst for selecting \u03bb).\nWe compute the normalized mutual information (NMI) between the true clusters and the results of\nthe two algorithms on this dif\ufb01cult data set. The Gaussian version performs poorly, achieving an\nNMI of .06 on this data, whereas the hard multinomial version achieves a score of .27. While the\nmultinomial version is far from perfect, it performs signi\ufb01cantly better than DP-means. Scalability\nto large data sets is clearly feasible, given that the method scales linearly in the number of data\npoints. Note that comparisons between the Gibbs sampler and the corresponding hard clustering\nalgorithm for the Gaussian case were considered in [3], where experiments on several data sets\nshowed comparable clustering accuracy results between the sampler and the hard clustering method.\nFurthermore, for a fully Bayesian model that places a prior on the concentration parameter, the\nsampler was shown to be considerably slower than the corresponding hard clustering method. Given\nthe similarity of the sampler for the Gaussian and multinomial case, we expect similar behavior\nwith the multinomial Gibbs sampler.\n\nIllustration: Scalable Hard Topic Models. We also highlight an application to topic modeling,\nby providing some qualitative results over two common document collections. Utilizing our general\nalgorithm for a hard version of the multinomial HDP is straightforward. In order to apply the hard\nhierarchical algorithm to topic modeling, we simply utilize the discrete KL-divergence in the hard\nexponential family HDP, since topic modeling for text uses a multinomial distribution for the data\nlikelihood.\nTo test topic modeling using our asymptotic approach, we performed analyses using the NIPS 1-121\nand the NYTimes [15] datasets. For the NIPS dataset, we use the whole dataset, which contains\n1740 total documents, 13649 words in the vocabulary, and 2,301,375 total words. For the NYTimes\n\n1http://www.cs.nyu.edu/ roweis/data.html\n\n7\n\n\f1\n\n2\n\n3\n\n4\n\n5\n\n6\n\nNIPS\nneurons, memory, patterns, activity, re-\nsponse, neuron, stimulus, \ufb01ring, cortex, re-\ncurrent, pattern, spike, stimuli, delay, re-\nsponses\nneural, networks, state, weight, states, re-\nsults, synaptic, threshold, large, time, sys-\ntems, activation, small, work, weights\ntraining, hidden, recognition, layer, per-\nformance, probability, parameter, error,\nspeech, class, weights, trained, algorithm,\napproach, order\ncells, visual, cell, orientation, cortical, con-\nnection, receptive, \ufb01eld, center,\ntuning,\nlow, ocular, present, dominance, \ufb01elds\nenergy, solution, methods, function, solu-\ntions, local, equations, minimum, hop\ufb01eld,\ntemperature, adaptation,\nterm, optimiza-\ntion, computational, procedure\nnoise, classi\ufb01er, classi\ufb01ers, note, margin,\nnoisy, regularization, generalization, hy-\npothesis, multiclasses, prior, cases, boost-\ning, \ufb01g, pattern\n\nNYTimes\nteam, game, season, play, games, point,\nplayer, coach, win, won, guy, played, play-\ning, record, \ufb01nal\n\npercent, campaign, money, fund, quarter,\nfederal, public, pay, cost, according, in-\ncome, half, term, program, increase\npresident, power, government, country,\npeace, trial, public, reform, patriot, eco-\nnomic, past, clear,\ninterview, religious,\nearly\nfamily, father, room, line, shares, recount,\ntold, mother, friend, speech, expression,\nwon, offer, card, real\ncompany, companies, stock, market, busi-\nness, billion, \ufb01rm, computer, analyst, in-\ndustry,\ntechnology, cus-\ntomer, number\nright, human, decision, need, leadership,\nfoundation, number, question, country,\nstrike, set, called, support, law, train\n\ninternet, chief,\n\nTable 1: Sample topics inferred from the NIPS and NYTimes datasets by our hard multinomial HDP\nalgorithm.\n\ndataset, we randomly sampled 2971 documents with 10171 vocabulary words, and 853,451 words in\ntotal; we also eliminated low-frequency words (those with less than ten occurrences). The prevailing\nmetric to measure the goodness of topic models is perplexity; however, this is based on the predictive\nprobability, which has no counterpart in the hard clustering case. Furthermore, ground truth for topic\nmodels is dif\ufb01cult to obtain. This makes quantitative comparisons dif\ufb01cult for topic modeling, and\nso we therefore focus on qualitative results. Some sample topics (with the corresponding top 15\nterms) discovered by our approach from both the NIPS and NYTimes datasets are given in Table 1;\nwe can see that the topics appear to be quite reasonable. Also, we highlight the scalability of our\napproach: the number of iterations needed for convergence on these data sets ranges from 13 to 25,\nand each iteration completes in under one minute (see the right side of Figure 1). In contrast, for\nsampling methods, it is notoriously dif\ufb01cult to detect convergence, and generally a large number of\niterations is required. Thus, we expect this approach to scale favorably to large data sets.\n\n6 Conclusion\nWe considered a general small-variance asymptotic analysis for the exponential family DP and\nHDP mixture model. Crucially, this analysis allows us to move beyond the Gaussian distribution\nin such models, and opens the door to new clustering applications, such as those involving discrete\ndata. Our analysis utilizes connections between Bregman divergences and exponential families,\nand results in a simple and scalable hard clustering algorithm which may be viewed as generalizing\nexisting non-Bayesian Bregman clustering algorithms [7] as well as the DP-means algorithm [3].\nDue to the prevalence of discrete data in modern computer vision and information retrieval, we\nhope our algorithms will \ufb01nd use for a variety of large-scale data analysis tasks. We plan to\ncontinue to focus on the dif\ufb01cult problem of quantitative evaluations comparing probabilistic and\nnon-probabilistic methods for clustering, particularly for topic models. We also plan to compare\nour algorithms with recent online inference schemes for topic modeling, particularly the online\nLDA [16] and online HDP [17] algorithms.\n\nAcknowledgements. This work was supported by NSF award IIS-1217433 and by the ONR under\ngrant number N00014-11-1-0688.\n\n8\n\n\fReferences\n[1] M. E. Tipping and C. M. Bishop. Probabilistic principal component analysis. Journal of the\n\nRoyal Statistical Society, Series B, 21(3):611\u2013622, 1999.\n\n[2] S. Roweis. EM algorithms for PCA and SPCA. In Advances in Neural Information Processing\n\nSystems, 1998.\n\n[3] B. Kulis and M. I. Jordan. Revisiting k-means: New algorithms via Bayesian nonparametrics.\n\nIn Proceedings of the 29th International Conference on Machine Learning, 2012.\n\n[4] Y. W. Teh, M. I. Jordan, M. J. Beal, and D. M. Blei. Hierarchical Dirichlet processes. Journal\n\nof the American Statistical Association, 101(476):1566\u20131581, 2006.\n\n[5] L. Fei-Fei and P. Perona. A Bayesian hierarchical model for learning natural scene categories.\n\nIn IEEE Conference on Computer Vision and Patterns Recognition, 2005.\n\n[6] D. Blei, A. Ng, and M. I. Jordan. Latent Dirichlet allocation. Journal of Machine Learning\n\nResearch, 3:993\u20131022, 2003.\n\n[7] A. Banerjee, S. Merugu, I. S. Dhillon, and J. Ghosh. Clustering with Bregman divergences.\n\nJournal of Machine Learning Research, 6:1705\u20131749, 2005.\n\n[8] K. Kurihara and M. Welling. Bayesian k-means as a \u201cMaximization-Expectation\u201d algorithm.\n\nNeural Computation, 21(4):1145\u20131172, 2008.\n\n[9] O. Barndorff-Nielsen.\n\nPublishers, 1978.\n\nInformation and Exponential Families in Statistical Theory. Wiley\n\n[10] J. Forster and M. K. Warmuth. Relative expected instantaneous loss bounds. In Proceedings\n\nof 13th Conference on Computational Learning Theory, 2000.\n\n[11] A. Agarwal and H. Daume. A geometric view of conjugate priors. Machine Learning,\n\n81(1):99\u2013113, 2010.\n\n[12] N. Hjort, C. Holmes, P. Mueller, and S. Walker. Bayesian Nonparametrics: Principles and\n\nPractice. Cambridge University Press, Cambridge, UK, 2010.\n\n[13] R. M. Neal. Markov chain sampling methods for Dirichlet process mixture models. Journal of\n\nComputational and Graphical Statistics, 9:249\u2013265, 2000.\n\n[14] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. ImageNet: A large-scale hier-\narchical image database. In IEEE Conference on Computer Vision and Patterns Recognition,\n2009.\n\n[15] A. Frank and A. Asuncion. UCI Machine Learning Repository, 2010.\n[16] M. D. Hoffman, D. M. Blei, and F. Bach. Online learning for Latent Dirichlet Allocation. In\n\nAdvances in Neural Information Processing Systems, 2010.\n\n[17] C. Wang, J. Paisley, and D. M. Blei. Online variational inference for the hierarchical Dirichlet\nIn Proceedings of the 14th International Conference on Arti\ufb01cial Intelligence and\n\nprocess.\nStatistics, 2011.\n\n9\n\n\f", "award": [], "sourceid": 1454, "authors": [{"given_name": "Ke", "family_name": "Jiang", "institution": null}, {"given_name": "Brian", "family_name": "Kulis", "institution": null}, {"given_name": "Michael", "family_name": "Jordan", "institution": null}]}