{"title": "Coupling Nonparametric Mixtures via Latent Dirichlet Processes", "book": "Advances in Neural Information Processing Systems", "page_first": 55, "page_last": 63, "abstract": "Mixture distributions are often used to model complex data. In this paper, we develop a new method that jointly estimates mixture models over multiple data sets by exploiting the statistical dependencies between them. Specifically, we introduce a set of latent Dirichlet processes as sources of component models (atoms), and for each data set, we construct a nonparametric mixture model by combining sub-sampled versions of the latent DPs. Each mixture model may acquire atoms from different latent DPs, while each atom may be shared by multiple mixtures. This multi-to-multi association distinguishes the proposed method from prior constructions that rely on tree or chain structures, allowing mixture models to be coupled more flexibly. In addition, we derive a sampling algorithm that jointly infers the model parameters and present experiments on both document analysis and image modeling.", "full_text": "Coupling Nonparametric Mixtures via\n\nLatent Dirichlet Processes\n\nDahua Lin\nMIT CSAIL\n\ndhlin@mit.edu\n\nJohn Fisher\nMIT CSAIL\n\nfisher@csail.mit.edu\n\nAbstract\n\nMixture distributions are often used to model complex data. In this paper, we de-\nvelop a new method that jointly estimates mixture models over multiple data sets\nby exploiting the statistical dependencies between them. Speci\ufb01cally, we intro-\nduce a set of latent Dirichlet processes as sources of component models (atoms),\nand for each data set, we construct a nonparametric mixture model by combining\nsub-sampled versions of the latent DPs. Each mixture model may acquire atoms\nfrom different latent DPs, while each atom may be shared by multiple mixtures.\nThis multi-to-multi association distinguishes the proposed method from previous\nones that require the model structure to be a tree or a chain, allowing more \ufb02exible\ndesigns. We also derive a sampling algorithm that jointly infers the model param-\neters and present experiments on both document analysis and image modeling.\n\n1\n\nIntroduction\n\nMixture distributions have been widely used for statistical modeling of complex data. Classical\nformulations specify the number of components a priori, leading to dif\ufb01culties in situations where\nthe number is either unknown or hard to estimate in advance. Bayesian nonparametric models,\nnotably those based on Dirichlet processes (DPs) [14, 16], have emerged as an important method to\naddress this issue. The basic idea of DP mixture models is to use a sample of a DP, which is itself a\ndistribution over a countably in\ufb01nite set, as the prior for component parameters.\nOne signi\ufb01cant assumption underlying a DP mixture model is that observations are in\ufb01nitely ex-\nchangeable. This assumption does not hold in the cases with multiple groups of data, where sam-\nples in different groups are generally not exchangeable. Among various approaches to this issue,\nhierarchical Dirichlet processes (HDPs) [20], which organize DPs into a tree with parents acting as\nthe base measure for children, is one of the most popular. HDPs have been extended in a variety of\nways. Kim and Smyth [9] incorporated group-speci\ufb01c random perturbations, allowing component\nparameters to vary across different groups. Ren et al. [17] proposed dynamic HDPs, which combine\nthe DP at a previous time step with a new one at the current time step.\nOther methods have also been developed. MacEachern [13] proposed a DDP model that allows pa-\nrameters to vary following a stochastic process. Grif\ufb01n and Steel [6] proposed the order-based DDP,\nwhere atoms can be weighted differently via the permutation of the Beta variables for stick-breaking.\nChung and Dunson [3] carried this approach further, using local predictors to select subsets of atoms.\nRecently, the connections between Poisson, Gamma, and Dirichlet processes have been exploited.\nRao and Teh [15] proposed the spatially normalized Gamma process, where a set of dependent DPs\ncan be derived by normalizing restricted projections of an auxiliary Gamma process over overlap-\nping sub-regions. Lin et al [12] proposed a new construction of dependent DPs, which supports\ndynamic evolution of a DP through operations on the underlying Poisson processes.\nOur primary goal here is to describe multiple groups of data through coupled mixture models. Shar-\ning statistical properties across different groups allows for more reliable model estimation, especially\n\n1\n\n\fwhen the observed samples in each group are limited or noisy. From a probabilistic standpoint, this\nframework can be obtained by devising a joint stochastic process that generates DPs with mutual\ndependency. Particularly, it is desirable to have a design that satis\ufb01es three properties: (1) Sharing\nof mixture components (atoms) between groups. (2) The marginal distribution of atoms for each\ngroup remains a DP. (3) Flexible con\ufb01guration of inter-group dependencies. For example, the prior\nweight of a common atom can vary across groups.\nAchieving these goals simultaneously is nontrivial. Whereas several existing constructions [3,6, 12,\n15] meet the \ufb01rst two properties, they impose restrictions on the model structure (e.g. the groups need\nto be arranged into a tree or a chain). We present a new framework to address this issue. Speci\ufb01cally,\nwe express mixture models for each group as a stochastic combination over a set of latent DPs. The\nmulti-to-multi association between data groups and latent DPs provides much greater \ufb02exibility to\nmodel con\ufb01gurations, as opposed to prior work (we provide a detailed comparison in section 3.2).\nWe also derive an MCMC sampling method to infer model parameters from grouped observations.\n\n2 Background\n\nWe provide a review of Dirichlet processes in order to lay the theoretical foundations of the method\ndescribed herein. We also discuss the related construction of dependent DPs proposed by [12], which\nexploits the connection between Poisson and Dirichlet processes to support various operations.\nA Dirichlet process, denoted by DP(\u03b1B), is a distribution over probability measures, which is\ncharacterized by a concentration parameter \u03b1 and a base measure B over an underlying space \u2126.\nEach sample path D \u223c DP(\u03b1B) is itself a distribution over \u2126. Sethuraman [18] showed that D is\nalmost surely discrete (with countably in\ufb01nite support), and can be expressed as\n\n\u221e(cid:88)\n\nk\u22121(cid:89)\n\nK/i(cid:88)\n\nD =\n\n\u03c0k\u03b4\u03c6k , with \u03c0k = vk\n\nk=1\n\nl=1\n\n(1 \u2212 vl), vk \u223c Beta(1, \u03b1).\n\n(1)\n\nThis is known as the stick breaking representation of a DP. This discrete nature makes a DP partic-\nularly suited to serve as a prior for component parameters in mixture models.\nGenerally, in a DP mixture model, each data sample xi is considered to be generated from a compo-\nnent model with parameter \u03b8i, denoted by G(\u03b8i). The component parameters are samples from D,\nwhich is itself a realization of a DP. The formulation is given below\n(2)\nAs D is an in\ufb01nite series, it is infeasible to instantiate D. As such, the Chinese restaurant process,\ngiven by Eq. 3, is often used to directly sample the component parameters, with D integrated out.\n\n\u03b8i \u223c D, xi \u223c G(\u03b8i),\n\nD \u223c DP(\u03b1B),\n\nfor i = 1, . . . , n.\n\np(\u03b8i|\u03b8/i) =\n\nm/i(k)\n\nk=1\n\n\u03b1 + (n \u2212 1)\n\n\u03b4\u03c6k +\n\n\u03b1\n\n\u03b1 + (n \u2212 1)\n\nB.\n\n(3)\n\nHere, \u03b8/i denotes all component parameters except \u03b8i, K/i denotes the number of distinct atoms\namong them, and m/i(k) denotes the number of occurrences of the atom \u03c6k. When xi is given, the\nlikelihood to generate xi conditioned on \u03b8i can be incorporated, resulting in an modulated sampling\nscheme described below. Let f (xi; \u03c6) denote the likelihood to generate xi w.r.t. G(\u03c6), and f (xi; B)\ndenote the marginal likelihood w.r.t. the parameter prior B. Then, with a probability proportional\nto m/i(k)f (xi; \u03c6k), we set \u03b8i = \u03c6k, and with a probability proportional to \u03b1f (xi; B), we draw an\nnew atom from B(\u00b7|xi), which is the posterior parameter distribution given xi.\nRecently, Lin et al. [12] proposed a new construction of DPs based on the connections between\nPoisson, Gamma, and Dirichlet processes. The construction provides three operations to derive new\nDPs depending on existing ones, which we will use to develop the coupled DP model. Here, we\nprovide a brief review of these operations.\n(1) Superposition. Let Dk \u223c DP(\u03b1kBk) for k = 1, . . . , K be independent DPs and (c1, . . . , cK) \u223c\nDir(\u03b11, . . . , \u03b1K). Then the stochastic convex combination of these DPs as below remains a DP:\n(4)\n\nc1D1 + \u00b7\u00b7\u00b7 + cKDK \u223c DP(\u03b11B1 + \u00b7\u00b7\u00b7 + \u03b1KBK).\n\n2\n\n\fFigure 2: The reformulated model for Gibbs\nsampling contains latent DPs, groups of data, and\natoms. Each sample xti is attached a label zti that\nassigns it an atom \u03c6zti. To generate zti, we draw\na latent DP (from Mult(ct)) and choose a label\ntherefrom. In sampling, Hs is integrated out, re-\nsulting in mutual dependency between zti, as in\nthe Chinese restaurant process.\n\nFigure 1: This shows the graphical model\nof the coupled DP formulation on a case\nwith four groups and two latent DPs. Each\nmixture model Dt inherits atoms from Hs\nwith a probability qts, resulting in Eq.(7).\n\n(2) Sub-sampling. Let D =(cid:80)\u221e\n\nk=1 \u03c0k\u03b4\u03c6k \u223c DP (\u03b1B). One obtains a new DP by sub-sampling D\nvia independent Bernoulli trials. Given a sub-sampling probability q, one draws a binary value rk\nwith Pr(rk = 1) = q for each atom \u03c6k to decide whether to retain it, resulting in a DP as\n\n\u03c0\n\n(cid:48)\nk\u03b4\u03c6k \u223c DP(\u03b1qB).\n\n(5)\n\nSq(D) (cid:44) (cid:88)\nk = \u03c0k/(cid:80)\n\nk:rk=1\n\nHere, Sq denotes the sub-sampling operation (with probability q), and \u03c0(cid:48)\n\ufb01cient for \u03c6k, which is given by \u03c0(cid:48)\n\nk rk\u03c0k.\n\n(3) Transition. Given D =(cid:80)\u221e\nlowing a probabilistic transition kernel T also yields a new DP, given by T (D) (cid:44)(cid:80)\u221e\n\nk=1 \u03c0k\u03b4\u03c6k \u223c DP (\u03b1B), perturbing the locations of each atom fol-\nk=1 \u03c0k\u03b4T (\u03c6k).\nWhile these operations were originally developed to evolve a DP along a Markov chain, we show in\nthe next section that they can also be utilized to construct models with different structures.\n\nk is the re-normalized coef-\n\n3 Coupled Nonparametric Mixture Models\n\nOur primary goal is to develop a joint formulation over group-wise DP mixture models where com-\nponents are shared across different groups and the weights and parameters of shared components\nvary across groups. We propose a new construction illustrated in Figure 1. Suppose there are M\ngroups of data, each with a mixture model. They are coupled by ML latent DPs. The generative\nformulation is then described as follows: First, generate ML latent DPs independently, as\n\nHs \u223c DP (\u03b1sB),\n\nfor s = 1, . . . , ML.\n\n(6)\n\nSecond, generate M dependent DPs, each for a group of data, by combining the sub-sampled ver-\nsions of the latent DPs through stochastic convex combination. For each t = 1, . . . , M,\n\nDt =\n\nctsSqts(Hs), with (ct1, . . . , ctML) \u223c Dir(\u03b11qt1, . . . , \u03b1MLqtML).\n\n(7)\n\ninherited by Dt. Note that this formulation can be further extended into Dt =(cid:80)\n\nIntuitively, for each group of data (say the t-th), we choose a subset of atoms from each latent source\nand bring them together to generate Dt. Here, qts is the prior probability that an atom in Hs will be\ns ctsTt(Sqts (Hs)).\nHere, Tt is a probabilistic transition kernel. Using the transition operation, this extension allows\nparameters to vary across different groups. Particularly, the atom parameter would be an adapted\nversion from Tt(\u03c6k,\u00b7) instead of \u03c6k itself, when the atom \u03c6k is inherited by Dt.\n\nML(cid:88)\n\ns=1\n\n3\n\nn4n3n2n1H1H2D1D2D3D4q31q21q11q22q32q42\u27131ix1i\u27132i\u27133i\u27134ix2ix3ix4i\u21b5sHsMLctztixtintQBM1k1rtkLatent DPsGroupsAtoms\fThird, generate the component parameters and data samples in the standard way, as\n\n\u03b8t,i|Dt \u223c Dt, and xt,i|\u03b8t,i \u223c G(\u03b8t,i),\n\nfor i = 1, . . . , nt, t = 1, . . . , M.\n\n(8)\n\nHere, xt,i is the i-th data sample in the t-th group, and \u03b8t,i is the associated atom parameter.\n\n3.1 Theoretical Analysis\n\nThe following theorems (proofs provided in supplementary material) demonstrate that, as a result of\nthe construction above, the marginal distribution of Dt is a DP:\n\nTheorem 1. The stochastic process Dt given by Eq.(7) has Dt \u223c DP(\u03b2tB), with \u03b2t =(cid:80)ML\n\ns=1 \u03b1sqts.\n\nWe also show that they are dependent, with the covariance given by the theorem below.\nTheorem 2. Let t1 (cid:54)= t2 and U be a measurable subset of \u2126, then\n(\u03b1sqt1sqt2s)2\n\u03b1sqt1sqt2s + 1\n\nB(U )(1 \u2212 B(U )).\n\nCov(Dt1 (U ), Dt2 (U )) =\n\nML(cid:88)\n\ns=1\n\n1\n\n\u03b2t1\u03b2t2\n\n(9)\n\nIt can be seen that the hyper-parameters in\ufb02uence the model characteristics in different ways. The\ninheritance probabilities (i.e. the q-values) control how closely the models are coupled. Two models\nare strongly coupled, if there exists a subset of latent DPs, from which both inherit atoms with high\nprobabilities, while their coupling is much weaker if the associated q-values are set differently. The\nlatent concentration parameters (i.e. the values of \u03b1s) control how frequently new atoms are created.\nGenerally, higher values of \u03b1s lead to more atoms being associated with the data, resulting in \ufb01ner\nclusters. Another important factor is ML, the number of latent DPs. A large number of latent DPs\nprovides \ufb01ne-grained control of the model con\ufb01guration at the cost of increased complexity.\n\n3.2 Comparison with Other Models\n\nWe review related approaches and discuss their differences with the one proposed here. Similar\nto this work, HDPs [20] model grouped data. Such models must be arranged into a tree, i.e. each\nchild can only have one parent. Our model allows the mixture model for each group to inherit from\nmultiple sources, making it applicable to more general contexts.\nIt is worth emphasizing that enabling inheritance from multiple parents is not just a straightfor-\nward extension, as it entails both theoretical and practical challenges: First, to combine atoms from\nmultiple DPs while guaranteeing that the resultant process remains a DP requires careful design of\nthe formulation (e.g. the combination coef\ufb01cients should be from a Dirichlet distribution, and each\nparent DP should be properly sub-sampled). Second, the sampling procedure has to determine the\nsource of each atom, which, again, is nontrivial and needs special algorithmic design (see section 4)\nto maintain the detailed balance.\nSN\u0393P [15] de\ufb01nes a gamma process G over an extended space. For each group t, a DP Dt is\nderived through normalized restriction of G into a measurable subset. The DPs derived on over-\nlapped subsets are dependent. Though motivated differently, this construction can be reduced to a\nj\u2208Rt ctjHj, where Rt is the subset of latent DPs used for Dt. Com-\npared to Eq.(7), we can see that it is essentially a special case of the present construction without\nsub-sampling (i.e. all q-values equal 1). Consequently, the combination coef\ufb01cients have to satisfy\n(ctj)j\u2208Rt \u223c Dir((\u03b1j)j\u2208Rt), implying that the relative weights of two latent sources are restricted to\nbe the same in all groups that inherit from both. In contrast, the approach here allows the weights of\nlatent DPs to vary across groups. Also, SN\u0393P doesn\u2019t allow atom parameters to vary across groups.\n\nformulation in the form Dt =(cid:80)\n\n4 Sampling Algorithm\n\nThis section introduces a Gibbs sampling algorithm to jointly estimate the mixture models of mul-\ntiple groups. Overall, this algorithm is an extension to the Chinese restaurant process, with several\nnew aspects: (1) The conditional probability of labels depend on the total number of samples asso-\nciated with it over the entire corpus (instead of that within a speci\ufb01c group). Note that it also differs\n\n4\n\n\ffrom HDP, where such probabilities depend on the number of associated tables. (2) Each group\nmaintains a distribution over the latent DPs to choose from, which re\ufb02ects the different contribu-\ntions of these sources. (3) It leverages the sub-sampling operation to explicitly control the model\ncomplexity. In particular, each group maintains indicators of whether particular atoms are inherited,\nand as a consequence, the ones that are deemed irrelevant are put out of scope. (4) As there are mul-\ntiple latent DPs, for each atom, there is uncertainty about where it comes from. We have a speci\ufb01c\nstep that takes this into account, which allows reassigning an atom to different sources.\nWe \ufb01rst set up the notations. Recall that there are M groups of data, and ML latent DPs to link\nbetween them. The observations in the t-th group are xt1, . . . , xtnt. We use \u03c6k to denote an atom.\nNote here that the index k is a globally unique identi\ufb01er of the atom, which would not be changed\nduring atom relocation. Since an atom may correspond to multiple data samples. Instead of instan-\ntiating the parameter \u03b8ti for each data sample xti, we attach to xti an indicator zti that associates\nthe sample to a particular atom. This is equivalent to setting \u03b8ti = \u03c6zti. To facilitate the sampling\nprocess, for each atom \u03c6k, we maintain an indicator sk specifying the latent DP that contains it, and\na set of counters {mtk}, where mtk equals the number of associated data samples in t-th group. We\nalso maintain a set Is for Hs (the s-th latent DP), which contains the indices of all atoms therein.\nThe model in Eq.(7) and (8) can then be reformulated, as shown in Fig 2. It consists of four steps:\n(1) Generate latent DPs: for each s = 1, . . . , ML, we draw Hs \u223c DP(\u03b1sB). (2) Generate the\ncombination coef\ufb01cients: for each group t, we draw (ct1, . . . , ctML) \u223c Dir(\u03b11qt1, . . . , \u03b1ML qtML),\nwhich gives the group-speci\ufb01c prior over the sources for the t-th group. (3) Decide inheritance:\nfor each atom \u03c6k, we draw a binary variable rtk with Pr(rtk = 1) = qtsk to indicate whether \u03c6k is\ninherited by the t-th group. Here sk is the index of the latent DP which \u03c6k is from. (4) Generate\ndata: to generate xti, we \ufb01rst choose a latent DP by drawing u \u223c Mult(ct1, . . . , ctML ), then draw\nan atom from Hu, using it to produce xti. Based on this formulation, we derive the following Gibbs\nsampling steps to update the atom parameters and other hidden variables.\n(1) Update labels. Recall that each data sample xti is associated with a label variable zti that\nindicates the atom accounting for xti. To draw zti, we \ufb01rst have to choose a particular latent DP\nas the source (we denote the index of this DP by uti). Let z/ti denote all labels except zti, and rt\ndenote the inheritance indicators. Then, we get the likelihood of xti (with Hs integrated out) as\n\n(cid:33)\n\n(cid:32) (cid:88)\n\n1\n\np(xti|uti = s, rt, z/ti) =\n\nwst/i + qts\u03b1s\n\nk\u2208Is:rtk=1 m\u2217k/ti, f (xti; \u03c6k) is the pdf at xti w.r.t. \u03c6k, and f (xti; B) = (cid:82)\n(cid:80)\n\nHere, m\u2217k/ti is the total number of samples associated with \u03c6k in all groups (except for xti), wst/i =\n\u03b8 f (xti; \u03b8)B(\u03b8)d\u03b8.\n\nDerivations of this and other formulas for sampling are in the supplemental document. Hence,\n\nk\u2208Is:rtk=1\n\nm\u2217k/tif (xti; \u03c6k) + qts\u03b1sf (xti; B)\n\n. (10)\n\np(uti = s|others) \u221d p(uti = s|ct)p(xti|uti = s, z/ti) = ctsp(xti|uti = s, z/ti).\n\n(11)\nHere, ct = (ct1, . . . , ctML) are the group-speci\ufb01c prior over latent sources. Once a latent DP is\nchosen (using the formula above), we can then draw a particular atom. This is similar to the Chinese\nrestaurant process: with a probability proportional to m\u2217k/tif (xti; \u03c6k), we set zti = k, and with a\nprobability proportional to qts\u03b1sf (xti; B), we draw a new atom from B(\u00b7|xi). Only the atoms that\nis contained in Hs and has rtk = 1 (inherited by Dt) can be drawn at this step.\nWe have to modify relevant quantities accordingly, such as mtk, ws, and Is, when a label zti is\nchanged. Moreover, when a new atom \u03c6k is created, it will be initially assigned to the latent DP that\ngenerates it (i.e. setting sk = uti).\n(2) Update inheritance indicators. If an atom \u03c6k is associated with some data in the t-th group,\nthen we know for sure that it is inherited by Dt, and thus we can set rtk = 1. However, if \u03c6k is not\nobserved, it doesn\u2019t imply rtk = 0. For such an atom (suppose it is from Hs), we have\n\u03b3(\u03c4s/t, nt)\n\nPr(rtk = 1|others)\nPr(rtk = 0|others)\n\nHere, \u03c4s/t = qts\u03b1s +(cid:80)\n(cid:81)n\u22121\n\n(1 \u2212 qts) \u00b7 p(zt|rtk = 0, others)\nk(cid:48)\u2208Is\u2212{k} m\u2217k/t and m\u2217k(cid:48)/t is the number of samples associated with\nk(cid:48) in all other groups (excluding the ones in the t-th group). \u03b3 is a function de\ufb01ned by \u03b3(\u03c4, n) =\ni=0 (\u03c4 +i) = \u0393(\u03c4 +n)/\u0393(\u03c4 ). Intuitively, when m\u2217k is large (indicating that \u03c6k appears frequently\n\nqts \u00b7 p(zt|rtk = 1, others)\n\n\u03b3(\u03c4s/t + m\u2217k/t, nt)\n\n=\n\nqts\n1 \u2212 qts\n\n.\n\n(12)\n\n=\n\n5\n\n\fFigure 3: model\nstructures.\n\nFigure 4: The results on NIPS data ob-\ntained with training sets of different sizes.\n\nFigure 5: The results on NIPS data us-\ning M-LDP, with different \u03c3 values.\n\n\uf8eb\uf8ed\u03b11qt1 +\n\n(cid:88)\n\nk\u2208I1\n\n(cid:88)\n\n\uf8f6\uf8f8 .\n\nin other groups) or nt is large, \u03c6k is likely to appear in the t-th group if it is inherited. Under such\ncircumstances, if \u03c6k not seen, then it is probably not inherited.\n(3) Update combination coef\ufb01cients. The coef\ufb01cients ct = (ct1, . . . , ctML) re\ufb02ect the relative\ncontribution of each latent DP to the t-th group. ct follows a Dirichlet distribution a priori (see\nEq.(7). Given zt, the labels of all samples in the t-th group, we have\n\n(13)\n\nmtk\n\nk\u2208IML\n\nct|zt \u223c Dir\n\nmtk, . . . , \u03b1MLqtML +\n\nk\u2208Is mtk is the total number of samples in the t-th group that are associated with Hs.\n\nHere,(cid:80)\nconditioned on Xk, with the pdf given by B(\u03c6|Xk) \u221d B(\u03c6)(cid:81)\n\n(4) Update atom parameters. Given all the labels, we can update the atoms, by re-drawing their\nparameters from the posterior distributions. Let Xk denote the set of all data samples associated with\nthe k-th atom, then we can draw \u03c6k \u223c B(\u00b7|Xk), where B(\u00b7|Xk) denotes the posterior distribution\n(5) Reassign atoms. In this model, each atom is almost surely from a unique latent DP (i.e. it never\ncomes from two distinct sources). This leads to an important question: How to we assign atoms to\nlatent DPs? Initially, an atom is assigned to the latent DP from which it is generated. This is not\nnecessarily optimal. Here, we treat the assignment of each atom as a variable. Consider an atom \u03c6k,\nwith sk indicating its corresponding source DP. Then, we have\n\nx\u2208Xk\n\nf (xk; \u03c6).\n\n(cid:89)\n\n(cid:89)\n\np(sk = j|others) =\n\nqts\n\n(1 \u2212 qts).\n\nt:rtk=1\n\nt:rtk=0\n\n(14)\n\nWhen an atom \u03c6k that was in Hs is reassigned to Hs(cid:48), we have to move the index k from Is to Is(cid:48).\n\n5 Experiments\n\nThe framework developed in this paper provides a generic tool to model grouped data. In this sec-\ntion, we present experiments on two applications: document analysis and scene modeling. The\nprimary goal is to demonstrate the key distinctions between the proposed approach and other non-\nparametric methods, and study how the new design in\ufb02uences empirical performance.\n\n5.1 Document Analysis\n\nTopic models [1, 2, 7, 20] have been widely used for statistical analysis of documents. In general, a\ntopic model comprises a set of topics, each associated with a multinomial distribution, from which\nwords can be independently generated. Here, we formulate a Coupled Topic Model by extending\nLDA [2] to model multiple groups of documents. Speci\ufb01cally, it associates the t-th group with a\nmixture of topics, characterized by a DP sample Dt. With this given, the words in a document are\ngenerated independently, each from a topic drawn from Dt. To exploit the statistical dependency\nbetween groups, we further introduce a set of latent DPs to link between these mixtures, as described\n\n6\n\nHDP / S-LDPSN\u0393PM-LDP0200400600800100012001500200025003000350040004500# training docsperplexity HDP (train)HDP (test)S\u2212LDP (train)S\u2212LDP (test)SNGP (train)SNGP (test)M\u2212LDP (train)M\u2212LDP (test)05101520180020002200240026002800\u03c3perplexity # train docs = 400# train docs = 800# train docs = 1200\fabove. The NIPS (1-17) database [5], which contains 2484 papers published from 1987 to 2003, is\nused in our experiments. We clean the data by removing the words that occur fewer than 10 times\nover the corpus and those that appear in more than 2000 papers, resulting in a reduced vocabulary\ncomprised of 11729 words. The data are divided into 17 groups, one for each year.\nWe perform experiments on several con\ufb01gurations, with different ways to connect between latent\nsources and data groups, as illustrated in Figure 3. (1) Single Latent DP (S-LDP): there is only one\nlatent DP connecting to all groups, with q-values set to 0.5. Though with a structure similar to HDP,\nthe formulation is actually different: HDP generates group-speci\ufb01c mixtures by using the latent DP\nas the base measure, while our model involves explicit sub-sampling. (2) Multi Latent DP (M-LDP):\nthere are two types of latent DPs \u2013 local and global ones. The local latent DPs are introduced to help\nsharing statistical strength among the groups close to each other, so as to capture the intuition that\npapers published in consecutive years are more likely to share topics than those published in distant\nyears. The inheritance probability from a local latent DP Hs to Dt is set as qts = exp(\u2212|t \u2212 s|/\u03c3).\nAlso, recognizing that some topics may be shared across the entire corpus, we also introduce a\nglobal latent DP, from which every group inherit atoms with the same probability, which allows\ndistant groups to be connected. This design illustrates the \ufb02exibility of the proposed framework and\nhow one can leverage this \ufb02exibility to address practical needs.\nFor comparison, we also consider another setting of q-values under the M-LDP structure: to set\nqts = I(|t \u2212 s| \u2264 \u03c3), that is to connect Dt and Hs only when |t \u2212 s| \u2264 \u03c3, with qts = 1. Under\nthis special setting, the formulation reduces to SN\u0393P [15]. We also test HDP following exactly the\nsettings given in [20]: \u03b10 \u223c Gamma(0.1, 0.1) and \u03b3 \u223c Gamma(5, 0.1). Other design parameters\nare set as below. We place a weak prior over \u03b1s for each latent DP, as \u03b1s \u223c Gamma(0.1, 0.1), and\nperiodically update its value. The base distribution B is assumed to be Dir(1), which is actually a\nuniform distribution over the probability simplex.\nThe \ufb01rst experiment is to compare different methods on training sets of various sizes. We divide\nall papers into two disjoint halves, respectively for training and testing. In each test, models are\nestimated upon a subset of speci\ufb01c size randomly chosen from the training corpus. The learned\nmodels are then respectively tested on the training subset and the held-out testing set, so as to study\nthe gap between empirical and generalized performance, which is measured in terms of perplexity.\nFrom Figure 4, we observe: (1) In general, as the training set size increases, the perplexity evaluated\non the training set increases and that on the testing set decreases. However, such convergence is\nfaster when local coupling is used (e.g. in SN\u0393P and M-LDP). This suggests that the sharing of\nstatistical strength through local latent DPs improves the reliability of the estimation, especially\nwhen the training data are limited. (2) Even when the training set size is increased to 1200, the\nmethods using local coupling still yield lower perplexity than others. This is partly ascribed to the\nmodel structure. For example, the papers published in consecutive years tend to share lots of topics,\nhowever, the topics may not be as similar when you compare papers published recently to those\na decade ago. A set of local latent DPs may capture such relations more effectively than a single\nglobal one. (3) The proposed method under M-LDP setting outperforms other methods, including\nSN\u0393P. In M-LDP, the contribution of Hs to Dt decreases gracefully as |t \u2212 s| increases. This\nway encourages each latent DP to be locally focused, while allowing the atoms therein to be shared\nacross the entire corpus. This is enabled through the use of explicit sub-sampling. The SN\u0393P,\ninstead, provides no mechanism to vary the contributions of the latent DPs, and has to make a hard\nlimit of their spans to achieve locality. Whereas this issue could be addressed through multiple level\nof latent nodes with different spans, it will increase the complexity, and thus the risk of over\ufb01tting.\nFor M-LDP, recall that we set qts = exp(\u2212|t \u2212 s|/\u03c3). Here, \u03c3 is an important design parameter\nthat controls the range of local coupling. The results acquired with different \u03c3 values are shown in\nFigure 5. Optimal performance is attained when the choice of \u03c3 balances the need to share atoms\nand the desire to keep the latent DPs locally focused. Generally, the optimum of \u03c3 depends on data.\nWhen the training set is limited, one may increase its value to enlarge the coupling range.\n\n5.2 Scene Modeling\n\nScene modeling is an important task in computer vision. Among various approaches, topic models\nthat build upon bag-of-features image representation [4, 11, 21] have become increasingly popular\n\n7\n\n\fFigure 6: This \ufb01gure shows example images in all eight categories\nselected for the experiment.\n\nFigure 7: The results on SUN data,\nwith training sets of different sizes.\n\nand are widely used for statistical modeling of visual scenes. Along this trend, Dirichlet processes\nhave also been employed to discover visual topics from observed scenes [10, 19].\nWe apply the proposed method to jointly model the topics in multiple scene categories. Rather than\npursuing the optimal scene model, here we primarily aimed at comparing different nonparametric\nmethods in mixture model estimation, under a reasonable setting. We choose a subset from the\nSUN database [22]. The selected set comprises eight outdoor categories: mountain snowy, hill,\nboardwalk, swamp, water cascade, ocean, coast and sky. The number of images in each category\nranges from 50 to 100. Figure 6 shows some example images. We can see that some categories\nare similar (e.g. ocean and coast, boardwalk and swamp, etc), while others are largely different. To\nderive the image representation, PCA-SIFT [8] descriptors are densely extracted from each training\nimage, and then pooled together and quantized using K-means into 512 visual words. In this way,\neach image can be represented as a histogram of 512 bins.\nAll methods mentioned above are compared. For M-LDP, we introduce a global latent DP to capture\ncommon topics, with q-values set uniformly to 0.5, and a set of local latent DPs, each for a category.\nThe prior probability of inheriting from the corresponding latent DP is 1.0, and that from other local\nDPs is 0.2. Whereas no prior knowledge about the similarity between categories is assumed, the\nlatent DPs incorporated in this way still provide a mechanism for local coupling. For SN\u0393P, we use\n28 latent DPs, each connected to a pair of categories. Again, we divide the data into two disjoint\nhalves, respectively for training and testing, and evaluate the performance in terms of perplexity. The\nresults are shown in Figure 7, where we can observe trends similar to those that we have seen on the\nNIPS data: local coupling helps model estimation, and our model under the M-LDP setting further\nreduces the perplexity (from 37 to 31, as compared to SN\u0393P). This is due to the more \ufb02exible way\nto con\ufb01gure local coupling that allows the weights of latent DPs to vary.\n\n6 Conclusion\n\nWe have presented a principled approach to modeling grouped data, where mixture models for dif-\nferent groups are coupled via a set of latent DPs. The proposed framework allows each mixture\nmodel to inherit from multiple latent DPs, and each latent DP to contribute differently to differ-\nent groups, thus providing great \ufb02exibility for model design. The experiments on both document\nanalysis and image modeling has clearly demonstrated the utility of such \ufb02exibility. Particularly,\nthe proposed method makes it possible to make various modeling choices, e.g. the use of latent\nDPs with different connection patterns, substantially improving the effectiveness of the estimated\nmodels. While q-values are treated as design parameters, it should be possible to extend this frame-\nwork to incorporate prior models over these and other parameters. Such extensions should lead to\nconstructions with richer structure capable of addressing more complex problems.\n\nAcknowledgements\n\nThis research was partially supported by the Of\ufb01ce of Naval Research Multidisciplinary Research\nInitiative (MURI) program, award N000141110688 and by DARPA award FA8650-11-1-7154.\n\n8\n\nhillmountainsnowyboardwalkwatercascadeoceancoastskyswamp501001502002503003504004505002030405060708090100# training imagesperplexity HDP (train)HDP (test)S\u2212LDP (train)S\u2212LDP (test)SNGP (train)SNGP (test)M\u2212LDP (train)M\u2212LDP (test)\fReferences\n[1] David Blei and John Lafferty. Correlated topic models. In Proc. of NIPS\u201906, 2006.\n[2] David M. Blei, Andrew Y. Ng, and Michael I. Jordan. Latent Dirichlet Allocation. Journal of\n\nMachine Learning Research, 3:993\u20131022, 2003.\n\n[3] Yeonseung Chung and David B. Dunson. The local Dirichlet Process. Annals of the Inst. of\n\nStat. Math., 63(1):59\u201380, 2009.\n\n[4] Li Fei-fei. A bayesian hierarchical model for learning natural scene categories. In Proc. of\n\nCVPR\u201905, 2005.\n\n[5] Amir Globerson, Gal Chechik, Fernando Pereira, and Naftali Tishby. Euclidean embedding of\n\nco-occurrence data. JMLR, 8, 2007.\n\n[6] J. E Grif\ufb01n and M. F. J Steel. Order-Based Dependent Dirichlet Processes. Journal of the\n\nAmerican Statistical Association, 101(473):179\u2013194, March 2006.\n\n[7] Thomas Hofmann. Probabilistic latent semantic indexing. In Proc. of ACM SIGIR\u201999, 1999.\n[8] Yan Ke and Rahul Sukthankar. Pca-sift: A more distinctive representation for local image\n\ndescriptors. In Proc. of CVPR\u201904, 2004.\n\n[9] Seyoung Kim and Padhraic Smyth. Hierarchical dirichlet processes with random effects. In\n\nProc. of NIPS\u201906, 2006.\n\n[10] Jyri J. Kivinen, Erik B. Sudderth, and Michael I. Jordan. Learning multiscale representations\n\nof natural scenes using dirichlet processes. In Proc. of CVPR\u201907, 2007.\n\n[11] S. Lazebnik, C. Schmid, and J. Ponce. Beyond bags of features: Spatial pyramid matching for\n\nrecognizing natural scene categories. In Proc. of CVPR\u201906, 2006.\n\n[12] Dahua Lin, Eric Grimson, and John Fisher. Construction of dependent dirichlet processes\n\nbased on poisson processes. In Advances of NIPS\u201910, 2010.\n\n[13] Steven N. MacEachern. Dependent Nonparametric Processes. In Proceedings of the Section\n\non Bayesian Statistical Science, 1999.\n\n[14] Radford M. Neal. Markov Chain Sampling Methods for Dirichlet Process Mixture Models.\n\nJournal of computational and graphical statistics, 9(2):249\u2013265, 2000.\n\n[15] Vinayak Rao and Yee Whye Teh. Spatial Normalized Gamma Processes. In Proc. of NIPS\u201909,\n\n2009.\n\n[16] Carl Edward Rasmussen. The In\ufb01nite Gaussian Mixture Model. In Proc. of NIPS\u201900, 2000.\n[17] Lu Ren, David B. Dunson, and Lawrence Carin. The Dynamic Hierarchical Dirichlet Process.\n\nIn Proc. of ICML\u201908, New York, New York, USA, 2008. ACM Press.\n\n[18] J. Sethuraman. A Constructive De\ufb01nition of Dirichlet Priors. Statistica Sinica, 4(2):639\u2013650,\n\n1994.\n\n[19] Erik B. Sudderth, Antonio Torralba, William Freeman, and Alan Willsky. Describing visual\n\nscenes using transformed dirichlet processes. In Proc. of NIPS\u201905, 2005.\n\n[20] Yee Whye Teh, Michael I. Jordan, Matthew J. Beal, and David M. Blei. Hierarchical Dirichlet\n\nProcesses. Journal of the American Statistical Association, 101(476):1566\u20131581, 2006.\n\n[21] Chang Wang, David Blei, and Fei-Fei Li. Simultaneous image classi\ufb01cation and annotation.\n\nIn Proc. of CVPR\u201909, 2009.\n\n[22] J. Xiao, J. Hays, K. Ehinger, A. Oliva, and A. Torralba. Sun database: Large-scale scene\n\nrecognition from abbey to zoo. In Proc. of CVPR\u201910, 2010.\n\n9\n\n\f", "award": [], "sourceid": 35, "authors": [{"given_name": "Dahua", "family_name": "Lin", "institution": null}, {"given_name": "John", "family_name": "Fisher", "institution": null}]}