{"title": "The Infinite Gamma-Poisson Feature Model", "book": "Advances in Neural Information Processing Systems", "page_first": 1513, "page_last": 1520, "abstract": "We address the problem of factorial learning which associates a set of latent causes or features with the observed data. Factorial models usually assume that each feature has a single occurrence in a given data point. However, there are data such as images where latent features have multiple occurrences, e.g. a visual object class can have multiple instances shown in the same image. To deal with such cases, we present a probability model over non-negative integer valued matrices with possibly unbounded number of columns. This model can play the role of the prior in an nonparametric Bayesian learning scenario where both the latent features and the number of their occurrences are unknown. We use this prior together with a likelihood model for unsupervised learning from images using a Markov Chain Monte Carlo inference algorithm.", "full_text": "The In\ufb01nite Gamma-Poisson Feature Model\n\nMichalis K. Titsias\n\nSchool of Computer Science,\nUniversity of Manchester, UK\nmtitsias@cs.man.ac.uk\n\nAbstract\n\nWe present a probability distribution over non-negative integer valued matrices\nwith possibly an in\ufb01nite number of columns. We also derive a stochastic process\nthat reproduces this distribution over equivalence classes. This model can play\nthe role of the prior in nonparametric Bayesian learning scenarios where multiple\nlatent features are associated with the observed data and each feature can have\nmultiple appearances or occurrences within each data point. Such data arise nat-\nurally when learning visual object recognition systems from unlabelled images.\nTogether with the nonparametric prior we consider a likelihood model that ex-\nplains the visual appearance and location of local image patches. Inference with\nthis model is carried out using a Markov chain Monte Carlo algorithm.\n\n1 Introduction\n\nUnsupervised learning using mixture models assumes that one latent cause is associated with each\ndata point. This assumption can be quite restrictive and a useful generalization is to consider factorial\nrepresentations which assume that multiple causes have generated the data [11]. Factorial models\nare widely used in modern unsupervised learning algorithms; see e.g. algorithms that model text\ndata [2, 3, 4]. Algorithms for learning factorial models should deal with the problem of specifying\nthe size of the representation. Bayesian learning and especially nonparametric methods such as the\nIndian buffet process [7] can be very useful for solving this problem.\n\nFactorial models usually assume that each feature occurs once in a given data point. This is inef-\n\ufb01cient to model the precise generation mechanism of several data such as images. An image can\ncontain views of multiple object classes such as cars and humans and each class may have multiple\noccurrences in the image. To deal with features having multiple occurrences, we introduce a prob-\nability distribution over sparse non-negative integer valued matrices with possibly an unbounded\nnumber of columns. Each matrix row corresponds to a data point and each column to a feature\nsimilarly to the binary matrix used in the Indian buffet process [7]. Each element of the matrix\ncan be zero or a positive integer and expresses the number of times a feature occurs in a speci\ufb01c\ndata point. This model is derived by considering a \ufb01nite gamma-Poisson distribution and taking\nthe in\ufb01nite limit for equivalence classes of non-negative integer valued matrices. We also present a\nstochastic process that reproduces this in\ufb01nite model. This process uses the Ewens\u2019s distribution [5]\nover integer partitions which was introduced in population genetics literature and it is equivalent to\nthe distribution over partitions of objects induced by the Dirichlet process [1].\n\nThe in\ufb01nite gamma-Poisson model can play the role of the prior in a nonparametric Bayesian learn-\ning scenario where both the latent features and the number of their occurrences are unknown. Given\nthis prior, we consider a likelihood model which is suitable for explaining the visual appearance and\nlocation of local image patches. Introducing a prior for the parameters of this likelihood model, we\napply Bayesian learning using a Markov chain Monte Carlo inference algorithm and show results in\nsome image data.\n\n\f2 The \ufb01nite gamma-Poisson model\n\nLet X = {X1, . . . , XN } be some data where each data point Xn is a set of attributes. In section\n4 we specify Xn to be a collection of local image patches. We assume that each data point is\nassociated with a set of latent features and each feature can have multiple occurrences. Let znk\ndenote the number of times feature k occurs in the data point Xn. Given K features, Z = {znk} is\na N \u00d7 K non-negative integer valued matrix that collects together all the znk values so as each row\ncorresponds to a data point and each column to a feature. Given that znk is drawn from a Poisson\nwith a feature-speci\ufb01c parameter \u03bbk, Z follows the distribution\n\nP (Z|{\u03bbk}) =\n\nNYn=1\n\nKYk=1\n\n\u03bbznk\nk\n\nexp{\u2212\u03bbk}\nznk!\n\n=\n\nKYk=1\n\n\u03bbmk\nk\n\nexp{\u2212N \u03bbk}\nn=1 znk!\n\nQN\n\n,\n\n(1)\n\nthat favors sparsity (in a sense that will be explained shortly):\n\nn=1 znk. We further assume that each \u03bbk parameter follows a gamma distribution\n\nwhere mk =PN\n\nG(\u03bbk;\n\n\u03bb\n\n, 1) =\n\n\u03b1\nK\n\n\u03b1\nK \u22121\nk\n\nexp{\u2212\u03bbk}\n\u0393( \u03b1\nK )\n\n.\n\n(2)\n\nThe hyperparameter \u03b1 itself is given a vague gamma prior G(\u03b1; \u03b10, \u03b20). Using the above equations\nwe can easily integrate out the parameters {\u03bbk} as follows\n\nP (Z|\u03b1) =\n\nKYk=1\n\n\u0393(mk + \u03b1\nK )\nK )(N + 1)mk+ \u03b1\n\n\u0393( \u03b1\n\nn=1 znk!\n\nK QN\n\n,\n\n(3)\n\nwhich shows that given the hyperparameter \u03b1 the columns of Z are independent. Note that the above\ndistribution is exchangeable since reordering the rows of Z does not alter the probability. Also as\nK increases the distribution favors sparsity. This can be shown by taking the expectation of the sum\nn=1 E(znk) and\n\nof all elements of Z. Since the columns are independent this expectation is KPN\n\nE(znk) is given by\n\nE(znk) =\n\nznkN B(znk;\n\n\u03b1\nK\n\n,\n\n1\n2\n\n) =\n\n\u03b1\nK\n\n,\n\n\u221eXznk=0\n\n(4)\n\nwhere N B(znk; r, p), with r > 0 and 0 < p < 1, denotes the negative binomial distribution over\npositive integers\n\nN B(znk; r, p) =\n\n\u0393(r + znk)\nznk!\u0393(r)\n\npr(1 \u2212 p)znk ,\n\n(5)\n\nthat has a mean equal to r(1\u2212p)\n. Using Equation (4) the expectation of the sum of znks is \u03b1N and\nis independent of the number of features. As K increases, Z becomes sparser and \u03b1 controls the\nsparsity of this matrix.\n\np\n\nThere is an alternative way of deriving the joint distribution P (Z|\u03b1) according to the following\ngenerative process:\n\n(\u03b81, . . . , \u03b8K) \u223c D(cid:16) \u03b1\nLn \u223c P oisson(\u03bb), (zn1, . . . , znK) \u223c(cid:18)\n\nK(cid:17) , \u03bb \u223c G(\u03bb; \u03b1, 1),\nzn1 . . . znK(cid:19) KYk=1\n\nLn\n\n\u03b8znk\nk\n\n, n = 1, . . . , N,\n\nK ) denotes the symmetric Dirichlet. Marginalizing out \u03b8 and \u03bb gives rise to the same\nwhere D( \u03b1\ndistribution P (Z|\u03b1). The above process generates a gamma random variable and multinomial pa-\nrameters and then samples the rows of Z independently by using the Poisson-multinomial pair. The\nconnection with the Dirichlet-multinomial pair implies that the in\ufb01nite limit of the gamma-Poisson\nmodel must be related to the Dirichlet process. In the next section we see how this connection is\nrevealed through the Ewens\u2019s distribution [5].\n\nModels that combine gamma and Poisson distributions are widely applied in statistics. We point out\nthat the above \ufb01nite model shares similarities with the techniques presented in [3, 4] that model text\ndata.\n\n\f3 The in\ufb01nite limit and the stochastic process\n\nTo express the probability distribution in (3) for in\ufb01nite many features K we need to consider equiv-\nalence classes of Z matrices similarly to [7]. The association of columns in Z with features de\ufb01nes\nan arbitrary labelling of the features. Given that the likelihood p(X|Z) is not affected by relabelling\nthe features, there is an equivalence class of matrices that all can be reduced to the same standard\nform after column reordering. We de\ufb01ne the left-ordered form of non-negative integer valued ma-\ntrices as follows. We assume that for any possible znk holds znk \u2264 c \u2212 1, where c is a suf\ufb01ciently\nlarge integer. We de\ufb01ne h = (z1k . . . zN k) as the integer number associated with column k that is\nexpressed in a numeral system with basis c. The left-ordered form is de\ufb01ned so as the columns of Z\nappear from left to right in a decreasing order according to the magnitude of their numbers.\n\nStarting from Equation (3) we wish to de\ufb01ne the probability distribution over matrices constrained in\na left-ordered standard form. Let Kh be the multiplicity of the column with number h; for example\nK0 is the number of zero columns. An equivalence class [Z] consists of\ndifferent matri-\nces that they are generated from the distribution in (3) with equal probabilities and can be reduced\nto the same left-ordered form. Thus, the probability of [Z] is\n\nK!\nPcN \u22121\n\nh=0 Kh!\n\nP ([Z]) =\n\nK!\n\nPcN \u22121\n\nh=0 Kh!\n\nKYk=1\n\n\u0393(mk + \u03b1\nK )\nK )(N + 1)mk+ \u03b1\n\n\u0393( \u03b1\n\nn=1 znk!\n\nK QN\n\nWe assume that the \ufb01rst K+ features are represented i.e. mk > 0 for k \u2264 K+, while the rest K \u2212K+\nfeatures are unrepresented i.e. mk = 0 for k > K+. The in\ufb01nite limit of (6) is derived by following\na similar strategy with the one used for expressing the distribution over partitions of objects as a\nlimit of the Dirichlet-multinomial pair [6, 9]. The limit takes the following form:\n\n.\n\n(6)\n\nP (Z|\u03b1) =\n\n1\n\nh=1 Kh!\n\nPcN \u22121\n\n\u03b1K+\n\n(N + 1)m+\u03b1 QK+\nQK+\nk=1QN\n\nk=1(mk \u2212 1)!\nn=1 znk!\n\nwhere m =PK+\n\nk=1 mk. This expression de\ufb01nes an exchangeable joint distribution over non-negative\ninteger valued matrices with in\ufb01nite many columns in a left-ordered form. Next we present a se-\nquential stochastic process that reproduces this distribution.\n\n,\n\n(7)\n\n3.1 The stochastic process\n\nThe distribution in Equation (7) can be derived from a simple stochastic process that constructs\nthe matrix Z sequentially so as the data arrive one at each time in a \ufb01xed order. The steps of this\nstochastic process are discussed below.\n\nWhen the \ufb01rst data point arrives all the features are currently unrepresented. We sample feature\noccurrences from the set of unrepresented features as follows. Firstly, we draw an integer number\ng1 from the negative binomial N B(g1; \u03b1, 1\n2 ) which has a mean value equal to \u03b1. g1 is the total\nnumber of feature occurrences for the \ufb01rst data point. Given g1, we randomly select a partition\n(z11, . . . , z1K1 ) of the integer g1 into parts1, i.e. z11 + . . . + z1K1 = g1 and 1 \u2264 K1 \u2264 g1, by\ndrawing from Ewens\u2019s distribution [5] over integer partitions which is given by\n\nP (z11, . . . , z1K1 ) = \u03b1K1\n\n\u0393(\u03b1)\n\ng1!\n\n\u0393(g1 + \u03b1)\n\nz11 \u00d7 . . . \u00d7 z1K1\n\ng1Yi=1\n\n1\nv(1)\n\ni\n\n!\n\n,\n\n(8)\n\ni\n\nwhere v(1)\nis the multiplicity of integer i in the partition (z11, . . . , z1K1 ). The Ewens\u2019s distribution\nis equivalent to the distribution over partitions of objects induced by the Dirichlet process and the\nChinese restaurant process since we can derive the one from the other using simple combinatorics\narguments. The difference between them is that the former is a distribution over integer partitions\nwhile the latter is a distribution over partitions of objects.\nLet Kn\u22121 be the number of represented features when the nth data point arrives. For each feature\nk, with k \u2264 Kn\u22121, we choose znk based on the popularity of this feature in the previous n \u2212 1 data\n\n1The partition of a positive integer is a way of writing this integer as a sum of positive integers where order\n\ndoes not matter, e.g. the partitions of 3 are: (3),(2,1) and (1,1,1).\n\n\fi=1 zik. Particularly, we draw znk from N B(znk; mk, n\n\ngiven by mk =Pn\u22121\n\npoints. This popularity is expressed by the total number of occurrences for the feature k which is\nn+1 ) which has a mean\nvalue equal to mk\nn . Once we have sampled from all represented features we need to consider a\nsample from the set of unrepresented features. Similarly to the \ufb01rst data point, we \ufb01rst draw an\nn+1 ), and subsequently we select a partition of that integer by drawing\ninteger gn from N B(gn; \u03b1, n\nfrom the Ewens\u2019s formula. This process produces the following distribution:\n\nP (Z|\u03b1) =\n\n1\n\ni=1 v(1)\n\ni\n\nQg1\n\n! \u00d7 . . . \u00d7QgN\n\ni=1 v(N )\n\ni\n\n!\n\n\u03b1K+\n\n(N + 1)m+\u03b1 QK+\nQK+\nk=1QN\n\nk=1(mk \u2212 1)!\nn=1 znk!\n\n,\n\n(9)\n\nwhere {v(n)\ni } are the integer-multiplicities for the nth data point which arise when we draw from\nthe Ewens\u2019s distribution. Note that the above expression does not have exactly the same form as the\ndistribution in Equation (7) and is not exchangeable since it depends on the order the data arrive.\nHowever, if we consider only the left-ordered class of matrices generated by the stochastic process\nthen we obtain the exchangeable distribution in Equation (7). Note that a similar situation arises\nwith the Indian buffet process.\n\n3.2 Conditional distributions\n\nWhen we combine the prior P (Z|\u03b1) with a likelihood model p(X|Z) and we wish to do in-\nference over Z using Gibbs-type sampling, we need to express the conditionals of the form\nP (znk|Z\u2212(nk), \u03b1) where Z\u2212(nk) = Z \\ znk. We can derive such conditionals by taking limits\nof the conditionals for the \ufb01nite model or by using the stochastic process.\nSuppose that for the current value of Z, there exist K+ represented features i.e. mk > 0 for\n\nk \u2264 K+. Let m\u2212n,k = Pen6=n zenk. When m\u2212n,k > 0, the conditional of znk is given by\n\nIn all different cases, we need a special conditional that samples from\nN B(znk; m\u2212n,k, N\nnew features2 and accounts for all k such that m\u2212n,k = 0. This conditional draws an integer num-\nber from N B(gn; a, N\nN +1 ) and then determines the occurrences for the new features by choosing a\npartition of the integer gn using the Ewens\u2019s distribution. Finally the conditional p(\u03b1|Z), which can\nbe directly expressed from Equation (7) and the prior of \u03b1, is given by\n\nN +1 ).\n\np(\u03b1|Z) \u221d G(\u03b1; \u03b10, \u03b20)\n\n\u03b1K+\n\n(N + 1)\u03b1 .\n\n(10)\n\nTypically the likelihood model does not depend on \u03b1 and thus the above quantity is also the posterior\nconditional of \u03b1 given data and Z.\n\n4 A likelihood model for images\n\nAn image can contain multiple objects of different classes. Each object class can have more than\none occurrences, i.e. multiple instances of the class may appear simultaneously in the image. Un-\nsupervised learning should deal with the unknown number of object classes in the images and also\nthe unknown number of occurrences of each class in each image separately. If object classes are the\nlatent features, what we wish to infer is the underlying feature occurrence matrix Z. We consider\nan observation model that is a combination of latent Dirichlet allocation [2] and Gaussian mixture\nmodels. Such a combination has been used before [12]. Each image n is represented by dn local\npatches that are detected in the image so as Xn = (Yn, Wn) = {(yni, wni), i = 1, . . . , dn}. yni\nis the two-dimensional location of patch i and wni is an indicator vector (i.e. is binary and satis\ufb01es\nni = 1) that points into a set of L possible visual appearances. X, Y , and W denote all\nthe data the locations and the appearances, respectively. We will describe the probabilistic model\nstarting from the joint distribution of all variables which is given by\n\nPL\n\n\u2113=1 w\u2113\n\njoint = p(\u03b1)P (Z|\u03b1)p({\u03b8k}|Z)\u00d7\n\nNYn=1\"p(\u03c0n|Zn)p(mn, \u03a3n|Zn)\n\ndnYi=1\n\nP (sni|\u03c0n)P (wni|sni, {\u03b8k})p(yni|sni, mn, \u03a3n)# .\n\n(11)\n\n2Features of this kind are the unrepresented features (k > K+) as well as all the unique features that occur\n\nonly in the data point n (i.e. m\u2212n,k = 0, but znk > 0).\n\n\fZ\n\n\u03b1\n\n{\u03b8k}\n\n\u03c0n\n\n(mn, \u03a3n)\n\nsni\n\nwni\n\nyni\n\ndn\n\nN\n\nFigure 1: Graphical model for the joint distribution in Equation (11).\n\nThe graphical representation of this distribution is depicted in Figure 1. We now explain all the\npieces of this joint distribution following the causal structure of the graphical model. Firstly, we\ngenerate \u03b1 from its prior and then we draw the feature occurrence matrix Z using the in\ufb01nite\ngamma-Poisson prior P (Z|\u03b1). The matrix Z de\ufb01nes the structure for the remaining part of the\nmodel. The parameter vector \u03b8k = {\u03b8k1, . . . , \u03b8kL} describes the appearance of the local patches W\nfor the feature (object class) k. Each \u03b8k is generated from a symmetric Dirichlet so as the whole\nk=1 D(\u03b8k|\u03b3), where \u03b3 is the hyperparameter of\nthe symmetric Dirichlet and it is common for all features. Note that the feature appearance param-\neters {\u03b8k} depend on Z only through the number of represented features K+ which is obtained by\ncounting the non-zero columns of Z.\nThe parameter vector \u03c0n = {\u03c0nkj} de\ufb01nes the image-speci\ufb01c mixing proportions for the mixture\nmodel associated with image n. To see how this mixture model arises, notice that a local patch in\nimage n belongs to a certain occurrence of a feature. We use the double index kj to denote the j\n\nset of {\u03b8k} vectors is drawn from p({\u03b8k}|Z) =QK+\n\noccurrence of feature k where j = 1, . . . , znk and k \u2208 {ek : znek > 0}. This mixture model has\nMn =PK+\n\nk=1 znk components, i.e. as many as the total number of feature occurrences in image n.\nThe assignment variable sni = {skj\nni}, which takes Mn values, indicates the feature occurrence of\npatch i. \u03c0n is drawn from a symmetric Dirichlet given by p(\u03c0n|Zn) = D(\u03c0n|\u03b2/Mn), where Zn\ndenotes the nth row of Z and \u03b2 is a hyperparameter shared by all images. Notice that \u03c0n depends\nonly on the nth row of Z.\nThe parameters (mn, \u03a3n) determine the image-speci\ufb01c distribution for the locations {yni} of the\nlocal patches in image n. We assume that each occurrence of a feature forms a Gaussian cluster\nof patch locations. Thus yni follows a image-speci\ufb01c Gaussian mixture with Mn components. We\nassume that the component kj has mean mnkj and covariance \u03a3nkj. mnkj describes object location\nand \u03a3nkj object shape. mn and \u03a3n collect all the means and covariances of the clusters in the image\nn. Given that any object can be anywhere in the image and have arbitrary scale and orientation,\n(mnkj, \u03a3nkj) should be drawn from a quite vague prior. We use a conjugate normal-Wishart prior\nfor the pair (mnkj, \u03a3nkj) so as\n\np(mn, \u03a3n|Zn) = Yk:znk>0\n\nznkYj=1\n\nN (mnkj|\u00b5, \u03c4 \u03a3nkj)W (\u03a3\u22121\n\nnkj|v, V ),\n\n(12)\n\nwhere (\u00b5, \u03c4, v, V ) are the hyperparameters shared by all features and images. The assignment sni\nwhich determines the allocation of a local patch in a certain feature occurrence follows a multino-\nni. Similarly the observed data pair (wni, yni) of a\n\nj=1(\u03c0nkj)skj\n\nmial: P (sni|\u03c0n) = Qk:znk>0Qznk\n\nlocal image patch is generated according to\n\nP (wni|sni, {\u03b8k}) =\n\nK+Yk=1\n\nLY\u2113=1\n\n\u03b8\n\nni Pznk\nw\u2113\nk\u2113\n\nj=1 skj\n\nni\n\n\fand\n\np(yni|sni, mn, \u03a3n) = Yk:znk>0\n\nznkYj=1\n\n[N (yni|mnkj, \u03a3nkj)]skj\nni .\n\nThe hyperparameters (\u03b3, \u03b2, \u00b5, \u03c4, v, V ) take \ufb01xed values that give vague priors and they are not\ndepicted in the graphical model shown in Figure 1.\n\nSince we have chosen conjugate priors, we can analytically marginalize out from the joint distri-\nbution all the parameters {\u03c0n}, {\u03b8k}, {mn} and {\u03a3n} and obtain p(X, S, Z, \u03b1). Marginalizing\nout the assignments S is generally intractable and the MCMC algorithm discussed next produces\nsamples from the posterior P (S, Z, \u03b1|X).\n\n4.1 MCMC inference\n\nInference with our model involves expressing the posterior P (S, Z, \u03b1|X) over the feature occur-\nrences Z, the assignments S and the parameter \u03b1. Note that the joint P (S, Z, \u03b1, X) factorizes\nn=1 P (Sn|Zn)p(Yn|Sn, Zn) where Sn denotes the assign-\nments associated with image n. Our algorithm uses mainly Gibbs-type sampling from conditional\nposterior distributions. Due to space limitations we brie\ufb02y discuss the main points of this algorithm.\n\naccording to p(\u03b1)P (Z|\u03b1)P (W |S, Z)QN\n\nnk \u2212 zold\n\nnk | \u2264 1. Initially Z is such that Mn =PK+\n\nThe MCMC algorithm processes the rows of Z iteratively and updates its values. A single step can\nchange an element of Z by one so as |znew\nk=1 znk \u2265\n1, for any n which means that at least one mixture component explains the data of each image. The\nproposal distribution for changing znks ensures that this constraint is satis\ufb01ed.\nSuppose we wish to sample a new value for znk using the joint model p(S, Z, \u03b1, X). Simply witting\nP (znk|S, Z\u2212(nk), \u03b1, X) is not useful since when znk changes the number of states the assignments\nSn can take also changes. This is clear since znk is a structural variable that affects the number of\nk=1 znk of the mixture model associated with image n and assignments Sn.\nOn the other hand the dimensionality of the assignments S\u2212n = S \\ Sn of all other images is not\naffected when znk changes. To deal with the above we marginalize out Sn and we sample znk from\nthe marginalized posterior conditional P (znk|S\u2212n, Z\u2212(nk), \u03b1, X) which is computed according to\n\ncomponents Mn =PK+\nP (znk|S\u2212n, Z\u2212(nk), \u03b1, X) \u221d P (znk|Z\u2212(nk), \u03b1)XSn\n\nP (W |S, Z)p(Yn|Sn, Zn)P (Sn|Zn),\n\n(13)\n\nwhere P (znk|Z\u2212n,k, \u03b1) for the in\ufb01nite case is computed as described in section 3.2 while computing\nthe sum requires an approximation. This sum is a marginal likelihood and we apply importance\nsampling using as an importance distribution the posterior conditional P (Sn|S\u2212n, Z, W, Yn) [10].\nSampling from P (Sn|S\u2212n, Z, W, Yn) is carried out by applying local Gibbs sampling moves and\nglobal Metropolis moves that allow two occurrences of different features to exchange their data\nclusters. In our implementation we consider a single sample drawn from this posterior distribution\nn is a sample accepted\nso that the sum is approximated by P (W |S\u2217\nafter a burn in period. Additionally to scans that update Z and S we add few Metropolis-Hastings\nsteps that update the hyperparameter \u03b1 using the posterior conditional given by Equation (10).\n\nn, S\u2212n, Z)p(Yn|S\u2217\n\nn, Zn) and S\u2217\n\n5 Experiments\n\nIn the \ufb01rst experiment we use a set of 10 arti\ufb01cial images. We consider four features that have\nthe regular shapes shown in Figure 2. The discrete patch appearances correspond to pixels and\ncan take 20 possible grayscale values. Each feature has its own multinomial distribution over the\nappearances. To generate an image we \ufb01rst decide to include each feature with probability 0.5.\nThen for each included feature we randomly select the number of occurrences from the range [1, 3].\nFor each feature occurrence we select the pixels using the appearance multinomial and place the\nrespective feature shape in a random location so that feature occurrences do not occlude each other.\nThe \ufb01rst row of Figure 2 shows a training image (left), the locations of pixels (middle) and the\ndiscrete appearances (right). The MCMC algorithm was initialized with K+ = 1, \u03b1 = 1 and\nzn1 = 1, n = 1, . . . , 10. The third row of Figure 2 shows how K+ (left) and the sum of all znks\n(right) evolve through the \ufb01rst 500 MCMC iterations. The algorithm in the \ufb01rst 20 iterations has\n\n\ftraining image n\n\nlocations Yn\n\nappearances Wn\n\n1 3 3 1\n\n3 2 3 0\n\n0 2 1 2\n\nFigure 2: The \ufb01rst row shows a training image (left), the locations of pixels (middle) and the discrete\nappearances (right). The second row shows the localizations of all feature occurrences in three\nimages. Below of each image the corresponding row of Z is also shown. The third row shows how\nK+ (left) and the sum of all znks (right) evolve through the \ufb01rst 500 MCMC iterations.\n\nFigure 3: The left most plot on the \ufb01rst row shows the locations of detected patches and the bounding\nboxes in one of the annotated images. The remaining \ufb01ve plots show examples of detections and\nlocalizations of the three most dominant features (including the car-category) in \ufb01ve non-annotated\nimages.\n\n\fvisited the matrix Z that was used to generate the data and then stabilizes. For 86% of the samples\nK+ is equal to four. For the state (Z, S) that is most frequently visited, the second row of Figure\n2 shows the localizations of all different feature occurrences in three images. Each ellipse is drawn\nusing the posterior mean values for a pair (mnkj, \u03a3nkj) and illustrates the predicted location and\nshape of a feature occurrence. Note that ellipses with the same color correspond to the different\noccurrences of the same feature.\nIn the second experiment we consider 25 real images from the UIUC3 cars database. We used the\npatch detection method presented in [8] and we constructed a dictionary of 200 visual appearances\nby clustering the SIFT [8] descriptors of the patches using K-means. Locations of detected patches\nare shown in the \ufb01rst row (left) of Figure 3. We partially labelled some of the images. Particularly,\nfor 7 out of 25 images we annotated the car views using bounding boxes (Figure 3). This allows\nus to specify seven elements of the \ufb01rst column of the matrix Z (the \ufb01rst feature will correspond\nto the car-category). These znks values plus the assignments of all patches inside the boxes do not\nchange during sampling. Also the patches that lie outside the boxes in all annotated images are not\nallowed to be part of car occurrences. This is achieved by applying partial Gibbs sampling updates\nand Metropolis moves when sampling the assignments S. The algorithm is initialized with K+ = 1,\nafter 30 iterations stabilizes and then \ufb02uctuates between nine to twelve features. To keep the plots\nuncluttered, Figure 3 shows the detections and localizations of only the three most dominant features\n(including the car-category) in \ufb01ve non-annotated images. The red ellipses correspond to different\noccurrences of the car-feature, the green ones to a tree-feature and the blue ones to a street-feature.\n\n6 Discussion\n\nWe presented the in\ufb01nite gamma-Poisson model which is a nonparametric prior for non-negative\ninteger valued matrices with in\ufb01nite number of columns. We discussed the use of this prior for\nunsupervised learning where multiple features are associated with our data and each feature can\nhave multiple occurrences within each data point. The in\ufb01nite gamma-Poisson prior can be used for\nother purposes as well. For example, an interesting application can be Bayesian matrix factorization\nwhere a matrix of observations is decomposed into a product of two or more matrices with one of\nthem being a non-negative integer valued matrix.\n\nReferences\n\n[1] C. Antoniak. Mixture of Dirichlet processes with application to Bayesian nonparametric problems. The\n\nAnnals of Statistics, 2:1152\u20131174, 1974.\n\n[2] D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent Dirichlet allocation. JMLR, 3, 2003.\n[3] W. Buntime and A. Jakulin. Applying discrete PCA in data analysis. In UAI, 2004.\n[4] J. Canny. GaP: A factor model for discrete data. In SIGIR, pages 122\u2013129. ACM Press, 2004.\n[5] W. Ewens. The sampling theory of selectively neutral alleles. Theoretical Population Biology, 3:87\u2013112,\n\n1972.\n\n[6] P. Green and S. Richardson. Modelling heterogeneity with and without the Dirichlet process. Scandina-\n\nvian Journal of Statistics, 28:355\u2013377, 2001.\n\n[7] T. Grif\ufb01ths and Z. Ghahramani. In\ufb01nite latent feature models and the Indian buffet process. In NIPS 18,\n\n2006.\n\n[8] D. G. Lowe. Distinctive image features from scale-invariant keypoints. International Journal of Computer\n\nVision, 60(2):91\u2013110, 2004.\n\n[9] R. M. Neal. Bayesian mixture modeling.\n\nIn 11th International Workshop on Maximum Entropy and\n\nBayesian Methods of Statistical Analysis, pages 197\u2013211, 1992.\n\n[10] M. A. Newton and A. E Raftery. Approximate Bayesian inference by the weighted likelihood bootstrap.\n\nJournal of the Royal Statistical Society, Series B, 3:3\u201348, 1994.\n\n[11] E. Saund. A multiple cause mixture model for unsupervised learning. Neural Computation, 7:51\u201371,\n\n1995.\n\n[12] E. Sudderth, A. Torralba, W. T. Freeman, and A. Willsky. Describing Visual Scenes using Transformed\n\nDirichlet Processes. In NIPS 18, 2006.\n\n3available from http://l2r.cs.uiuc.edu/\u223ccogcomp/Data/Car/.\n\n\f", "award": [], "sourceid": 331, "authors": [{"given_name": "Michalis", "family_name": "Titsias", "institution": null}]}