{"title": "Infinite State Bayes-Nets for Structured Domains", "book": "Advances in Neural Information Processing Systems", "page_first": 1601, "page_last": 1608, "abstract": null, "full_text": "In\ufb01nite State Bayesian Networks\n\nMax Welling\u2217, Ian Porteous, Evgeniy Bart\u2020\n\nDonald Bren School of Information and Computer Sciences\n\n{welling,iporteou}@ics.uci.edu, bart@caltech.edu\n\nUniversity of California Irvine\nIrvine, CA 92697-3425 USA\n\nAbstract\n\nA general modeling framework is proposed that uni\ufb01es nonparametric-Bayesian\nmodels, topic-models and Bayesian networks. This class of in\ufb01nite state Bayes\nnets (ISBN) can be viewed as directed networks of \u2018hierarchical Dirichlet\nprocesses\u2019 (HDPs) where the domain of the variables can be structured (e.g. words\nin documents or features in images). We show that collapsed Gibbs sampling can\nbe done ef\ufb01ciently in these models by leveraging the structure of the Bayes net\nand using the forward-\ufb01ltering-backward-sampling algorithm for junction trees.\nExisting models, such as nested-DP, Pachinko allocation, mixed membership sto-\nchastic block models as well as a number of new models are described as ISBNs.\nTwo experiments have been performed to illustrate these ideas.\n\n1 Introduction\n\nBayesian networks remain the cornerstone of modern AI. They have been applied to a wide range\nof problems both in academia as well as in industry. A recent development in this area is a class\nof Bayes nets known as topic models (e.g. LDA [1]) which are well suited for structured data such\nas text or images. A recent statistical sophistication of topic models is a nonparametric extension\nknown as HDP [2], which adaptively infers the number of topics based on the available data.\n\nThis paper has the goal of bridging the gap between these three developments. We propose a general\nmodeling paradigm, the \u201cin\ufb01nite state Bayes net\u201d (ISBN), that incorporates these three aspects.\nWe consider models where the variables may have the nested structure of documents and images,\nmay have in\ufb01nite discrete state spaces, and where the random variables are related through the\nintuitive causal dependencies of a Bayes net. ISBN\u2019s can be viewed as collections of HDP \u201cmodules\u201d\nconnected together to form a network. Inference in these networks is achieved through a two-stage\nGibbs sampler, which combines the \u201cforward-\ufb01ltering-backward-sampling algorithm\u201d [3] extended\nto junction trees and the direct assignment sampler for HDPs [2].\n\n2 Bayes Net Structure for ISBN\nConsider observed random variables xA (cid:44) {xa}, a = 1..A. These variables can take values\nin an arbitrary domain. In the following we will assume that xa is sampled from a (conditional)\ndistribution in the exponential family. We will also introduce hidden (unobserved, latent) variables\n{zb}, b = 1..B which will always take discrete values. The indices a, b thus index the nodes of the\nBayesian network.\nWe will introduce a separate index, e.g. na, to label observations. In the simplest setting we assume\nIID data n = i, i.e. N independent identically distributed observations for each variable. We will\n\n\u2217On sabbatical at Radboud University Nijmegen, Netherlands, Dept. of Biophysics.\n\u2020Joint appointment at California Institute of Technology, USA, Dept. of Electrical Engineering.\n\n1\n\n\f(a)\n\n(b)\n\n(c)\n\nFigure 1: Graphical representation of (a) Unstructured in\ufb01nite state Bayesian network, (b) HDP, (c) H2DP.\n\nhowever also be interested in more structured data, such as words in documents, where the index\nn can be decomposed into e.g. n = (j, ij). In this notation we think of j as labelling a document\nand ij as labelling a word in document j. To simplify notation we will often write n = (ji). It is\nstraightforward to generalize to deeper nestings of indices, e.g. n = (k, jk, ijk) = (kji) where k\ncan index e.g. books, j chapters and i words. We interpret this as the observed structure in the data,\nas opposed to the latent structure which we seek to infer. The unobserved structure is labelled with\nthe discrete \u201cassignment variables\u201d za\nn which assign the object indexed by n to latent groups (a.k.a.\ntopics, factors, components).\nThe assignment variables z together with the observed variables x are organized into a Bayes net,\nwhere dependencies are encoded by the usual \u201cconditional probability tables\u201d (CPT), which we\ndenote with \u03c6a\nzb|\u2118b for latent variables1. Here, \u2118a denotes the\njoint state of all the parent variables of xa or zb. When a vertical bar is present we normalize over\nxa|\u2118a = 1, \u2200a, \u2118a. Note that CPTs are considered random\nthe variables to the left of it, e.g.\nvariables and may themselves be indexed by (a subset of) n, e.g. \u03c6xa|\u2118aj.\nWe assume that each \u03c0b is sampled from a Dirichlet prior: e.g. \u03c0zb|\u2118b \u223c D[\u03b1b\u03c4 zb] inde-\npendently and identically for all values of \u2118b. The distribution \u03c4 itself is Dirichlet distributed,\n\u03c4 za \u223c D[\u03b3a/K a], where K a is the number of states for variable za. We can put gamma priors on\n\u03b1a, \u03b3a and consider them as random variables as well, but to keep things simple we will consider\nthem \ufb01xed variables here. We refer to [4] for algorithms to learn them from data and to [5] and [2]\nfor ways to infer them through sampling. In section 5 we further discuss these hierarchical priors.\n\nxa|\u2118a for observed variables and \u03c0b\n\n(cid:80)\n\nxa \u03c6a\n\nIn drawing BNs we will not include the plates to avoid cluttering the \ufb01gures. However, it is always\npossible to infer the number of times variables in the BN are replicated by looking at its indices. For\ninstance, the variable node labelled with \u03c0z1|z2j in Fig.3a stands for K (2) \u00d7 J IID copies of \u03c01\nsampled from \u03c4 1.\n\n3 Networks of HDPs\n\nIn Fig.1b we have drawn the \ufb01nite version of the HDP. Here \u03c6 is a distribution over words, one for\neach topic value z, and is often referred to a \u201ctopic distribution\u201d. Topic values are generated from\na document speci\ufb01c distribution \u03c0 which in turn are generated from a \u201cmother distribution\u201d over\ntopics \u03c4 . As was shown in [2] one can take the in\ufb01nite limit K \u2192 \u221e in this model and arrive at\nthe HDP. We will return to this in\ufb01nite limit when we describe Gibbs sampling. In the following we\nwill use the same graphical model for \ufb01nite and in\ufb01nite versions of ISBNs.\n\n1We will often avoid writing the super-indices a, b when it is clear from the context, e.g. \u03c6a\n\nxa|\u2118a = \u03c6xa|\u2118a.\n\n2\n\n1iz1ix112|xzz(cid:73)1(cid:78)2iz3iz3z(cid:83)3(cid:68)2z(cid:87)24|zz(cid:83)2(cid:68)2(cid:74)1z(cid:87)134|zzz(cid:83)1(cid:68)1(cid:74)4iz2ix4z(cid:83)4(cid:68)22|xz(cid:73)2(cid:78)jizjix|xz(cid:73)|zj(cid:83)z(cid:87)(cid:78)(cid:68)(cid:74)kjizkjix|xz(cid:73)|zkj(cid:83)z(cid:87)(cid:78)(cid:68)|zk(cid:85)(cid:69)(cid:74)\f(a)\n\n(b)\n\n(c)\n\nFigure 2: Graphical representation for (a) BiHDP, (b) Mixed membership stochastic block model and (c) the\n\u201cmultimedia\u201d model.\n\nOne of the key features of the HDP is that topics are shared across all documents indexed by j. The\nreason for this is the distribution \u03c4 : new states are \u201cinvented\u201d at this level and become available\nto all other documents. In other words, there is a single state space for all copies of \u03c0. One can\ninterpret j is an instantiation of a dummy, fully observed random variable \u03b9. We could add this node\nto the BN as a parent of z (since \u03c0 depends on it) and reinterpret the statement of sharing topics\nas a fully connected transition matrix between states of \u03b9 to states of z. This idea can be extended\nto a combination of fully observed parent variables and multiple unobserved parent variables, e.g.\n\u03b9 \u2192 z2, z3, \u03b9. Moreover, the child variables do not have to be observed either, so we can also replace\nx \u2192 z. In this fashion we can connect together multiple vertical stacks \u03c4 \u2192 \u03c6 \u2192 z where each\nsuch module is part of a \u201cvirtual-HDP\u201d where the joint child states act as virtual data and the joint\nparent states act as virtual document labels. Examples are given in Figs. 1a (in\ufb01nite extension of a\nBayes net with IID data items) and 3a (in\ufb01nite extension of Pachinko Allocation).\n\n4 Inference\n\nTo simplify the presentation we will now restrict attention to a Bayesian network where all CPTs are\nshared across all data-items (see Fig.1a). In this case data is unstructured, assumed IID and indexed\nby a \ufb02at index n = i. Instead of going through the detailed derivation, which is an extension of the\nderivation in [2] for HDP, we will describe the sampling process in the following.\n\n(cid:80)\n\ni I[zb\n\ni = k \u2227 \u2118b\n\nThere is a considerable body of empirical evidence which con\ufb01rms that marginalizing out the vari-\nables \u03c0, \u03c6 will result in improved inference (e.g. [6, 7]). In this collapsed space, we sample two sets\nof variables alternatingly, {z} on the hand and {\u03c4} on the other. First, we focus on the latter given\nz and notice that all \u03c4 are conditionally independent given z, x.\nSampling \u03c4|(z, x): Given x, z we can compute count matrices2 Nzb|\u2118b and Nxa|\u2118a as\nNzb=k|\u2118b=l =\ni = l] and similarly for Nxa|\u2118a. Given these counts, for each value\nof k, l, we now create the following vector: vkl = \u03b1\u03c4 k/(\u03b1\u03c4 k + nk|l \u2212 1) with nk|l = [1, 2, .., Nk|l].\nWe then draw a number Nk|l Bernoulli random variables with probability of success given by the\nelements of v, which we call3 st\nk|l. This pro-\ncedure is equivalent to running a Chinese restaurant process (CRP) with Nk|l customers and only\nkeeping track of how many tables become occupied. We will denote it with Sk|l \u223c A[Nk|l, \u03b1\u03c4 k]\nafter Antoniak [8]. Next we compute Sk =\nl Sk|l and sample \u03c4 from a Dirichlet distribution,\n\u03c4 \u223c D[\u03b3, S1, .., Sk]. Note that \u03c4 is a distribution over K a + 1 states, where we now denote with\nK a the number of occupied states. If the state corresponding to \u03b3 is picked, a new state is created\nand we increment K a \u2190 K a + 1. If on the other hand a state becomes empty, we remove it from\n2Note that these can be also used to compute Rao-Blackwellised estimates of \u03c0 a \u03c6, i.e. E[\u03c0zb|\u2118b ] =\nz + Nzb|\u2118b )/(\u03b1b + N\u2118b ) and similarly for \u03c6.\n\nk|l and compute their sum across t: Sk|l =\n\n(cid:80)\n\n(cid:80)\n\nt st\n\n(\u03b1b\u03c4 b\n\n3These variables are so called auxiliary variables to facilitate sampling \u03c4 .\n\n3\n\n1jizjix12|xzz(cid:73)1z(cid:87)(cid:78)2jiz2z(cid:87)1(cid:74)1|zi(cid:83)2|zj(cid:83)1(cid:68)2(cid:74)2(cid:68)jiz(cid:111)jix|,\u2019xzz(cid:73)z(cid:87)(cid:78)(cid:74)|zj(cid:83)(cid:68)1jwz1jwx11|xz(cid:73)1|zj(cid:83)1z(cid:87)1(cid:78)1(cid:68)1(cid:74)2jfz2jfx2z(cid:87)2(cid:78)2(cid:68)2(cid:74)0jz0z(cid:83)0(cid:68)2|zj(cid:83)22|xz(cid:73)\f(a)\n\n(b)\n\nFigure 3: Graphical representation for (a) Pachinko Allocation and (b) Nested DP.\n\nthe list and we decrement K a \u2190 K a \u2212 1. This will allow assignment variables to add or remove\nstates adaptively4.\nSampling z|(\u03c4 , x): The conditional probability of all {zi, xi} variables jointly (for \ufb01xed i) is\ngiven by,\n\n(cid:89)\n\nb\n\n\u03b1b\u03c4 zb\n\n+ N\u00aci\n\u03b1b + N\u00aci\n\ni\n\ni |\u2118b\nzb\n\ni\n\n\u2118b\ni\n\n(1)\n\ni for variable b and its parents assigned\n\nP (xi, zi|z\u00aci, x\u00aci, \u03c4 , \u03b1) =\n\nF (xa\n\nwhere N\u00aci\ni |\u2118b\nzb\nto group \u2118b\ni , where we exclude data-case i from this count. Also,\n\nis the number data-cases assigned to group zb\n\ni\n\n(cid:89)\n\na\n\ni |xa\u00aci, \u2118a\u00aci)\n(cid:81)\n\n(cid:82)\n\n(cid:82)\n\nd\u03c6k P (xa\nd\u03c6k\n\n(cid:81)\ni |\u03c6k)\n\nF (xa\n\ni |xa\u00aci, \u2118a\u00aci = k) =\n\ni(cid:48)|\u03c6k) P (\u03c6k)\n\ni(cid:48) =k P (xa\n\ni(cid:48)\\i:\u2118a\ni(cid:48) =k P (xa\n\ni(cid:48)|\u03c6k) P (\u03c6k)\n\ni(cid:48)\\i:\u2118a\n\n(2)\n\ni |\u2118a\n\nImportantly, equation 1 follows the structure of the original Bayes net, where each term has the\nform of a conditional distribution P (za\ni ) and is based on suf\ufb01cient statistics collected over all\nthe other data-cases. Hence, we can use the structure of the Bayes net to sample the assignment\nvariables jointly across the BN (for data-case i). The general technique that allows one to exploit\nnetwork structure is \u2018forward-\ufb01ltering-backward-sampling\u2019 (FFBS) [3]. Assume for instance that\nthe network is a tree. In that case we \ufb01rst propagate information from the leaves to the root, comput-\ning the probabilities P (zb|{xb\u2193}) as we go where \u2018\u2193\u2019 means that we compute a marginal conditioned\non \u2018downstream\u2019 evidence. When we reach the root we draw a sample from P (zroot|{xb}). Finally,\nwe work our way back to the leaves, conditioning on drawn samples (which summarize upstream\ninformation) and using the marginal probabilities P (zb|{xb\u2193}) cashed during the \ufb01ltering phase to\nrepresent downstream evidence. For networks with higher treewidth we can extend this technique\nto junction trees. Alternatively, one can use cut-set sampling techniques [9].\n\n5 ISBN for Structured Data\n\nIn section 2 we introduced an index n to label the known structure of the data. The simplest nontrivial\nexample is given by the HDP, where n = (ji) indexing e.g. documents and words. In this case the\nCPT \u03c0z|j is not shared across all data, but is speci\ufb01c to a document. Next consider Fig.1c where\nn = (kji) is labelling for instance words (i) in chapters (j) in books (k). The \ufb01rst level CPT \u03c0z|kj is\nspeci\ufb01c to chapters (and hence books) and is sampled from a Dirichlet distribution with mean given\n\n4We adopted the name \u2018in\ufb01nite state Bayesian network\u2019 because the (K a + 1)th state actually represents an\n\nin\ufb01nite pool of indistinguishable states.\n\n4\n\n1jizjix1|xz(cid:73)1z(cid:87)(cid:78)2jiz2z(cid:87)3jiz3z(cid:87)12|zzj(cid:83)23|zzj(cid:83)3|zj(cid:83)1(cid:68)3(cid:74)2(cid:68)3(cid:68)2(cid:74)1(cid:74)1nznx1|xz(cid:73)(cid:78)2nz3nz12|zz(cid:83)23|zz(cid:83)3z(cid:83)1(cid:68)2(cid:68)3(cid:68)\f(cid:88)\n\ntu|k\n\nk\n(3)\n\nj,m\n\nsu|jkm \u2192 ru|k \u223c A[Su|k, \u03b2\u03c4 u] \u2192 Ru =\n\nby a second level CPT \u03c1z|k speci\ufb01c to books, which in turn is sampled from a Dirichlet distribution\nwith mean \u03c4 z, which \ufb01nally is sampled from a Dirichlet prior with parameters \u03b3. Sampling occurs\nagain in two phases: sampling \u03c1, \u03c4|x, z and z|\u03c1, \u03c4 , x while marginalizing out \u03c0, \u03c6.\nTo sample from \u03c1, \u03c4 we compute counts Nu|m,jk which is the number of times words were assigned\nin chapter j and book k to the joint state z = u, \u2118 = m. We then work our way up the stack,\nsampling new count arrays S, R as we go, and then down again sampling the CPTs (\u03c4 , \u03c1) using\nthese count arrays5. Note that this is just one step of Gibbs sampling from P (\u03c4 , \u03c1|z, x) and does\n(cid:88)\nnot (unlike the other phase for z) generate an equilibrium sample from this conditional distribution.\n\u2191: su|jkm \u223c A[Nu|jkm, \u03b1\u03c1u|k] \u2192 Su|k =\n\u2193: \u03c4 u \u223c D[(\u03b3, Ru)] \u2192 \u03c1u|k \u223c D[\u03b2\u03c4 u + Su|k]\nA similar procedure is de\ufb01ned for the priors of \u03c6 and extensions to deeper stacks are straightforward.\nIf all z variables carry the same index n, sampling zn given the hierarchical priors is very similar\nto the FFBS procedure described in the previous section, except that the count arrays may carry\n\u00acijk\nz|\u2118jk. Since these counts are speci\ufb01c to a chapter they are\na subset of the indices from n, e.g. N\ntypically smaller resulting in a higher variance for the samples z. If two neighboring z variables\ncarry different subsets of n labels, e.g. node z0\nj in Fig.2c, the conditional distributions are harder to\ncompute. The general rule is to identify and remove all z(cid:48) variables that are impacted by changing\nthe value for z under consideration, e.g. {z1\njw, \u2200w \u222a z2\nj . To\ncompute the conditional probability we set z = k and add the impacted variables z(cid:48) back into the\nsystem, one-by-one in an arbitrary order and assigning them to their old values.\nIt is also instructive to place DP priors (instead of HDP priors) of the form D[\u03b1b/K b] directly on \u03c0\n(skipping \u03c4 ). In taking the in\ufb01nite limit the conditional distribution for existing states zb becomes\ndirectly proportional to Nzb|\u2118b (the \u03b1b\u03c4 zb term is missing). This has the effect that a new state\nzb = k that was discovered for some parent state \u2118b = l will not be available to other parent states,\nsimply because Nk|l(cid:48) = 0, l(cid:48) (cid:54)= l. The result is that the state space forks into a tree structure as we\nmove down the Bayes net. When the network structure is a linear chain, this model is equivalent\nto the \u2018nested-DP\u2019 introduced in [10] as a prior on tree-structures. The corresponding Bayes net\nis depicted in Fig.3b. A chain of length 1 is of course just a Dirichlet process mixture model. A\nDP prior is certainly appropriate for nodes zb with CPTs that do not depend on other parents or\nadditional labels, e.g. nodes z3 and z4 in Fig.1a. Interestingly, an HDP would also be possible\nand would result in a different model. We will however follow the convention that we will use the\nminimum depth necessary for modelling the structure of the data.\n\njf , \u2200f} in Fig.2c if we resample z0\n\n6 Examples\nExample: HDP Perhaps the simplest example is an HDP itself, see Fig.1b. It consists of a single\ntopic node and a single observation node. If we make \u03c6 depend on the item index i, i.e. \u03c6x|z,i, we\nobtain the in\ufb01nite version of the \u2019user rating pro\ufb01le\u2019 (URP) model [11]. If we make \u03c6 depend on j\ninstead and add a prior: \u03c8x|z \u2192 \u03c6x|z,j, we obtain an \u201cHDP with random effects\u201d [12] which has\nthe bene\ufb01t that shared topics across documents can vary slightly relative to each other.\nExample: In\ufb01nite State Chains The \u2018Pachinko allocation model\u2019 (PAM) [13] consists of a linear\nchain of assignment variables with document speci\ufb01c transition probabilities, see Fig.3a. It was\nproposed to model correlations between topics. The in\ufb01nite version of this is clearly an example\nof an ISBN. An equivalent Chinese restaurant process formulation was published in [14]. A slight\nvariation on this architecture was described in [15] (POM). Here, images are modeled as mixtures\nover parts and parts were modeled as mixtures over visual words. Finally, a visual word is a dis-\ntribution over features. POM is only subtly different from PAM (see Fig.3a) in that parts are not\nimage-speci\ufb01c distributions over words, and so the distribution \u03c0z1|z2 does not depend on j.\nExample: BiHDP This model, depicted in Fig.2a has a data variable xji and two parent topic\nvariables z1\nji. One can think of j as the customer index and i as the product index (and no IID\nrepeated index). The value of x is the rating of that customer for that product. The hidden variables\n\nji and z2\n\n5Teh\u2019s code npbayes-r21, (available from his web-site) does in fact implement this sampling process.\n\n5\n\n\fz1\nji and z2\nji represent product groups and customer groups. Every data entry is assigned to both a\ncustomer group and a product group which together determine the factor from which we sample\nthe rating. Note that the difference between the assignment variables is that their corresponding\nCPTs \u03c0z1,j and \u03c0z2,i depend on j and i respectively. Extensions are easily conceived. For instance,\ninstead of two modalities, we can model multiple modalities (e.g. customers, products, year). Also,\nsingle topics can be generalized to hierarchies of topics, so every branch becomes a PAM. Note\nthat for unobserved xji values (not all products have been rated by all customers) the corresponding\nza\nji, zb\nji are \u201cdangling\u201d and can be integrated out. The result is that we should skip that variable in\nthe Gibbs sampler.\nExample: The Mixed-Membership Stochastic Block Model[16] This model is depicted in\nFig.2b. The main difference with HDP is that (like BiHDP) \u03c0 depends on two parent states zi\u2192j\nand zj\u2192i by which we mean that item i has chosen topic zi\u2192j to interact with item j and vice\nversa. However, (unlike BiHDP) those topic states share a common distribution \u03c0. Indices only\nrun over distinct pairs i > j. These features make the model suitable for modeling social interac-\ntion networks or protein-protein interaction networks. The hidden variables jointly label the type of\ninteraction that was used to generate \u2018matrix-element\u2019 xij.\nExample: The Multimedia Model\nIn the above examples we had a single observed variable in\nthe graphical model (repeated over ij). The model depicted in Fig.2c has two observed variables\nand an assignment variable that is not repeated over items. We can think of the middle node z0\nj as\nthe class label for a web-page j. The left branch can then model words on the web-page while the\nright branch can model visual features on the web-page. Since no sharing is required for z0\nj we used\na Dirichlet prior. The other variables have the usual HDP priors.\n\n7 Experiments\n\nTo illustrate the ideas we implemented two models: BiHDP of Fig.2a and the \u201cprobabilistic object\nmodel\u201d (POM), explained in the previous section.\nMarket Basket Data\nIn this experiment we investigate the performance of BiHDP on a synthetic\nmarket basket dataset. We used the IBM Almaden association and sequential patterns generator to\ncreate this dataset [17]. This is a standard synthetic transaction dataset generator often used by the\nassociation research community. The generated data consists of purchases from simulated groups\nof customers who have similar buying habits. Similarity of buying habits refers to the fact that\ncustomers within a group buy similar groups of items. For example, items like strawberries and\ncream are likely to be in the same item group and thus are likely to be purchased together in the\nsame market basket. The following parameters were used to generate data for our experiments: 1M\ntransactions, 10K customers, 1K different items, 4 items per transaction on average, 4 item groups\nper customer group on average, 50 market basket patterns, 50 customer patterns. Default values\nwere used for the remaining parameters.\n\nThe two assignment variables correspond to customers and items respectively. For a given pair of\ncustomer and item groups, a binomial distribution was used to model the probability of a customer\ngroup making a purchase from that item group. A collapsed Gibbs sampler was used to \ufb01t the model.\nAfter 1000 epochs the system converged to 278 customer groups and 39 item factors. Fig.4 shows\nthe results. As can be seen, most item groups correspond directly to the hidden ground truth data.\nThe conclusion is that the model can learn successfully the hidden structure in the data.\nLearning Visual Vocabularies LDA has also gained popularity to model images as collections of\nfeatures. The visual vocabulary is usually determined in a preprocessing step where k-means is run\nto cluster features collected from the training set. In [15] a different approach was proposed in which\nthe visual word vocabulary was learned jointly with \ufb01tting the parameters of the model. This can\nhave the bene\ufb01t that the vocabulary is better adapted to suit the needs of the model. Our extension of\ntheir PLSA-based model is the in\ufb01nite state model given by Fig.3a with 2 hidden variables (instead\nof 3) and \u03c0z1|z2 independent of j. x is modeled as a Gaussian-distributed random variable over\nfeature values, z1 represents the word identity and z2 is the topic index.\nWe used the Harris interest-point detector and 21\u00d721 patches centered on each interest point as input\nto the algorithm. We normalized the patches to have zero mean. Next we reduced the dimensionality\nof detections from 441 to 100 using PCA. The procedure described above generates a set of between\n\n6\n\n\fLearned: 223, 619, 271, 448, 39, 390\nLearned: 364, 250, 718, 952, 326, 802\nLearned: 159, 563, 780, 995, 103, 216, 598, 72\nLearned: 227, 130, 862, 991, 904, 213\nLearned: 953, 175, 956, 385, 269, 14, 64\nLearned: 49, 657, 906, 604, 229\nLearned: 295, 129, 662, 922, 705, 210\nLearned: 886, 460, 471, 933, 544\nLearned: 489, 818, 927, 378, 64, 710\nLearned: 776, 224, 139, 379\n\nTrue: 223, 271, 448, 39, 427, 677\nTrue: 364, 718, 952, 326, 542, 98\nTrue: 159, 563, 780, 995, 103, 216, 542, 72\nTrue: 227, 130, 862, 991, 904, 213\nTrue: 953, 175, 956, 385, 269, 14, 956\nTrue: 49, 657, 906, 604, 229\nTrue: 295, 129, 662, 922, 705, 68\nTrue: 886, 460, 471, 933, 917\nTrue: 489, 818, 927, 378, 64, 247\nTrue: 776, 224, 139, 379\n\nFigure 4: The 10 most popular item groups learned by the BiHDP model (left) compared to ground truth\nitem groups for market basket data (right). Learned items are ordered by decreasing popularity. Ground truth\nitems have no associated weight; therefore, they were ordered to facilitate comparison with the left row. Non-\nmatching items are shown in boldface.\n\nFigure 5: Precision Recall curves for Caltech-4 dataset (left) and turntable dataset (right). Solid curve repre-\nsents POM and dashed curve represents LDA.\n\n50 and 400, 100-dimensional detections per image. Experiments were performed using the Caltech-\n4 and \u2018turntable\u2019 datasets. For Caltech-4 we used 130 randomly sampled images from each of the 4\ncategories for training. LDA was \ufb01t using 500 visual words and 50 parts (which we found to give\nthe best results). The turntable database contains images of 15 toy objects. The objects were placed\non a turntable and photographed every 5 degrees. We have used angles 15, 20, 25, 35, 40, and 45\nfor training, and angles 10, 30, and 50 for testing. LDA used 15 topics and 200 visual words (which\nagain was optimal). LDA was then \ufb01tted to both datasets using Gibbs sampling. We initialized POM\nwith the output of LDA to make sure the comparison involved similar modes of the distribution.\n\nThe precision-recall curves for this dataset are shown in Fig.5. Images were labelled by choosing\nthe majority class across the 11 most similar retrieved images. Similarity was measured as the\nprobability of the query image given the part probabilities of the retrieved image.\n\nThese experiments show that ISBNs can be successfully implemented. We are not interested in\nclaiming superiority of ISBNs, but rather hope to convey that ISBNs are a convenient tool to design\nmodels and to facilitate the search for the number of latent states.\n\n8 Discussion\n\nWe have presented a uni\ufb01ed framework to organize the fast growing class of \u2018topic models\u2019. By\nmerging ideas from Bayes nets, nonparametric Bayesian statistics and topic models we have arrived\nat a convenient framework to 1) extend existing models to in\ufb01nite state spaces, 2) reason about\nand design new models and 3) derive ef\ufb01cient inference algorithms that exploit the structure of the\nunderlying Bayes net.\n\nNot every topic model naturally \ufb01ts the suit of an ISBN. For instance, the in\ufb01nite HMM [18] is like a\nPOM model with emission states, but with a single transition probability shared across time. When\nmarginalizing out \u03c0 this has the effect of coupling all z variables. An ef\ufb01cient sampler for this\nmodel was introduced in [19]. Also, in [10, 20] models were studied where a word can be emitted at\n\n7\n\n00.20.40.60.8100.20.40.60.8100.20.40.60.8100.20.40.60.81\fany node corresponding to a topic variable z. We would need an extra switching variable to \ufb01t this\ninto the ISBN framework.\n\nWe are currently working towards a graphical interface where one can design ISBN models by\nattaching together HkDP modules and where the system will automatically perform the inference\nnecessary for the task at hand.\n\nAcknowledgements\n\nThis material is based upon work supported by the National Science Foundation under Grant No.\n0447903 and No. 0535278 and by ONR under Grant No. 00014-06-1-0734.\n\nReferences\n\n[1] D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent Dirichlet allocation. Journal of Machine Learning\n\nResearch, 3:993\u20131022, 2003.\n\n[2] Y. W. Teh, M. I. Jordan, M. J. Beal, and D. M. Blei. Hierarchical Dirichlet processes. To appear in\n\nJournal of the American Statistical Association, 2006.\n\n[3] S. L. Scott. Bayesian methods for hidden Markov models, recursive computing in the 21st century.\n\nvolume 97, pages 337\u2013351, 2002.\n\n[4] T. Minka. Estimating a dirichlet distribution. Technical report, 2000.\n[5] M.D. Escobar and M. West. Bayesian density estimation and inference using mixtures. Journal of the\n\nAmerican Statistical Association, 90:577\u2013588, 1995.\n\n[6] T.L. Grif\ufb01ths and M. Steyvers. A probabilistic approach to semantic representation. In Proceedings of\n\nthe 24th Annual Conference of the Cognitive Science Society, 2002.\n\n[7] Y.W. Teh, D. Newman, and M. Welling. A collapsed variational bayesian inference algorithm for latent\n\ndirichlet allocation. In NIPS, volume 19, 2006.\n\n[8] C.E. Antoniak. Mixtures of Dirichlet processes with applications to bayesian nonparametric problems.\n\nThe Annals of Statistics, 2:1152\u20131174, 1974.\n\n[9] B. Bidyuk and R. Dechter. Cycle-cutset sampling for Bayesian networks. In Sixteenth Canadian Conf.\n\non AI, 2003.\n\n[10] David Blei, Thomas L. Grif\ufb01ths, Michael I. Jordan, and Joshua B. Tenenbaum. Hierarchical topic models\n\nand the nested chinese restaurant process. In Neural Information Processing Systems 16, 2004.\n\n[11] B. Marlin. Modeling user rating pro\ufb01les for collaborative \ufb01ltering. In Advances in Neural Information\n\nProcessing Systems 16. 2004.\n\n[12] S. Kim and P. Smyth. Hierarchical dirichlet processes with random effects. In NIPS, volume 19, 2006.\n[13] W. Li and A. McCallum. Pachinko allocation: Dag-structured mixture models of topic correlations. In\n\nProceedings of the 23rd international conference on Machine learning, pages 577\u2013584, 2006.\n\n[14] W. Li, A. McCallum, and D. Blei. Nonparametric bayes pachinko allocation. In UAI, 2007.\n[15] D. Larlus and F. Jurie. Latent mixture vocabularies for object categorization. In British Machine Vision\n\nConference, 2006.\n\n[16] E. Airoldi, D. Blei, E. Xing, and S. Fienberg. A latent mixed membership model for relational data. In\n\nLinkKDD \u201905: Proceedings of the 3rd international workshop on Link discovery, pages 82\u201389, 2005.\n\n[17] R. Agrawal, T. Imielinski, and A. Swami. Mining associations between sets of items in massive databases.\n\nIn Proc. of the ACM-SIGMOD 1993 Intl Conf on Management of Data, 1993.\n\n[18] M.J. Beal, Z. Ghahramani, and C.E. Rasmussen. The in\ufb01nite hidden markov model.\n\n577\u2013584, 2001.\n\nIn NIPS, pages\n\n[19] Y. W. Teh, D. G\u00a8or\u00a8ur, and Z. Ghahramani. Stick-breaking construction for the Indian buffet process. In\n\nProceedings of the International Conference on Arti\ufb01cial Intelligence and Statistics, volume 11, 2007.\n\n[20] W. Li D. Mimno and A. McCallum. Mixtures of hierarchical topics with pachinko allocation. In Pro-\n\nceedings of the 21st International Conference on Machine Learning, 2007.\n\n8\n\n\f", "award": [], "sourceid": 100, "authors": [{"given_name": "Max", "family_name": "Welling", "institution": null}, {"given_name": "Ian", "family_name": "Porteous", "institution": null}, {"given_name": "Evgeniy", "family_name": "Bart", "institution": null}]}