{"title": "Linearly constrained Bayesian matrix factorization for blind source separation", "book": "Advances in Neural Information Processing Systems", "page_first": 1624, "page_last": 1632, "abstract": "We present a general Bayesian approach to probabilistic matrix factorization subject to linear constraints. The approach is based on a Gaussian observation model and Gaussian priors with bilinear equality and inequality constraints. We present an efficient Markov chain Monte Carlo inference procedure based on Gibbs sampling. Special cases of the proposed model are Bayesian formulations of non-negative matrix factorization and factor analysis. The method is evaluated on a blind source separation problem. We demonstrate that our algorithm can be used to extract meaningful and interpretable features that are remarkably different from features extracted using existing related matrix factorization techniques.", "full_text": "Linearly constrained Bayesian matrix factorization\n\nfor blind source separation\n\nMikkel N. Schmidt\n\nDepartment of Engineering\nUniversity of Cambridge\n\nmns@imm.dtu.dk\n\nAbstract\n\nWe present a general Bayesian approach to probabilistic matrix factorization sub-\nject to linear constraints. The approach is based on a Gaussian observation model\nand Gaussian priors with bilinear equality and inequality constraints. We present\nan ef\ufb01cient Markov chain Monte Carlo inference procedure based on Gibbs sam-\npling. Special cases of the proposed model are Bayesian formulations of non-\nnegative matrix factorization and factor analysis. The method is evaluated on a\nblind source separation problem. We demonstrate that our algorithm can be used\nto extract meaningful and interpretable features that are remarkably different from\nfeatures extracted using existing related matrix factorization techniques.\n\n1 Introduction\n\nSource separation problems arise when a number of signals are mixed together, and the objective\nis to estimate the underlying sources based on the observed mixture. In the supervised, model-\nbased approach to source separation, examples of isolated sources are used to train source models,\nwhich are then combined in order to separate a mixture. Oppositely, in unsupervised, blind source\nseparation, only very general information about the sources is available. Instead of estimating mod-\nels of the sources, blind source separation is based on relatively weak criteria such as minimizing\ncorrelations, maximizing statistical independence, or \ufb01tting data subject to constraints.\n\nUnder the assumptions of linear mixing and additive noise, blind source separation can be expressed\nas a matrix factorization problem,\n\nX\n\nI\u00d7J\n\n= A\n\nI\u00d7K\n\nB\n\nK\u00d7J\n\n+ N\n\nI\u00d7J\n\n,\n\nor equivalently, xij =\n\naikbkj + nij ,\n\n(1)\n\nK\n\nXk=1\n\nwhere the subscripts below the matrices denote their dimensions. The columns of A represent K\nunknown sources, and the elements of B are the mixing coef\ufb01cients. Each of the J columns of\nX contains an observation that is a mixture of the sources plus additive noise represented by the\ncolumns of N. The objective is to estimate the sources, A, as well as the mixing coef\ufb01cients, B,\nwhen only the data matrix, X, is observed. In a Bayesian formulation, the aim is not to compute a\nsingle value for A and B, but to infer their posterior distribution under a set of model assumptions.\nThese assumptions are speci\ufb01ed in the likelihood function, p(X|A, B), which expresses the proba-\nbility of the data given the factorizing matrices, and in the prior, p(A, B), which describes available\nknowledge before observing the data. Depending on the speci\ufb01c choice of likelihood and priors,\nmatrix factorizations with different characteristics can be devised.\n\nNon-negative matrix factorization (NMF), which is distinguished from other matrix factorization\ntechniques by its non-negativity constraints, has been shown to decompose data into meaningful,\ninterpretable parts [3]; however, a parts-based decomposition is not necessarily useful, unless it\n\n1\n\n\fLinear subspace\n\nAf\ufb01ne subspace\n\nSimplicial cone\n\nSimplicial cone in\nnon-neg. orthant\n\nNo constraints\n\n(a)\n\nPolytope\n\nbkj \u2265 0, Pkbkj = 1\n\n(e)\n\nPkbkj = 1\n\nPolytope in\n\n(b)\n\nnon-neg. orthant\n\naik \u2265 0,\n\nbkj \u2265 0, Pkbkj = 1\n\n(f)\n\nbkj \u2265 0\n\n(c)\n\nPolytope on\nunit simplex\n\naik \u2265 0, bkj \u2265 0\n\n(d)\n\nPolytope in\n\nunit hypercube\n\naik \u2265 0, Piaik = 1,\nbkj \u2265 0, Pkbkj = 1\n\n(g)\n\n0 \u2264 aik \u2264 1,\n\nbkj \u2265 0, Pkbkj = 1\n\n(h)\n\nFigure 1: Examples of model spaces that can be attained using matrix factorization with different\nindicates the feasible region for the source\nlinear constraints in A and B. The red hatched area\nvectors (columns of A). Dots,\n, are examples of speci\ufb01c positions of source vectors, and the black\nhatched area,\n, is the corresponding feasible region for the data vectors. Special cases include (a)\nfactor analysis and (d) non-negative matrix factorization.\n\n\ufb01nds the \u201ccorrect\u201d parts. The main contribution in this paper is that specifying relevant constraints\nother than non-negativity substantially changes the qualities of the results obtained using matrix\nfactorization. Some intuition about how the incorporation of different constraints affects the matrix\nfactorization can be gained by considering their geometric implications. Figure 1 shows how differ-\nent linear constraints on A and B constrain the model space. For example, if the mixing coef\ufb01cients\nare constrained to be non-negative, data is modelled as the convex hull of a simplicial cone, and if\nthe mixing coef\ufb01cients are further constrained to sum to unity, data is modelled as the hull of a\nconvex polytope.\n\nIn this paper, we develop a general and \ufb02exible framework for Bayesian matrix factorization, in\nwhich the unknown sources and the mixing coef\ufb01cients are treated as hidden variables. Furthermore,\nwe allow any number of linear equality or inequality constraints to be speci\ufb01ed. On an unsupervised\nimage separation problem, we demonstrate, that when relevant constraints are speci\ufb01ed, the method\n\ufb01nds interpretable features that are remarkably different from features computed using other matrix\nfactorization techniques.\n\nThe proposed method is related to recently proposed Bayesian matrix factorization techniques:\nBayesian matrix factorization based on Gibbs sampling has been demonstrated [7, 8] to scale up\nto very large datasets and to avoid the problem of over\ufb01tting associated with non-Bayesian tech-\nniques. Bayesian methods for non-negative matrix factorization have also been proposed, either\nbased on variational inference [1] or Gibbs sampling [4, 9]. The latter can be seen as special cases\nof the algorithm proposed here.\n\nThe paper is structured as follows: In section 2, the linearly constrained Bayesian matrix factoriza-\ntion model is described. Section 3 presents an inference procedure based on Gibbs sampling. In\nSection 4, the method is applied to an unsupervised source separation problem and compared to\nother existing matrix factorization methods. We discuss our results and conclude in Section 5.\n\n2\n\n\f2 The linearly constrained Bayesian matrix factorization model\n\nIn the following, we describe the linearly constrained Bayesian matrix factorization model. We\nmake speci\ufb01c choices for the likelihood and priors that keep the formulation general while allowing\nfor ef\ufb01cient inference based on Gibbs sampling.\n\n2.1 Noise model\n\nWe choose an iid. zero mean Gaussian noise model,\n\np(nij ) = N (nij|0, vij) =\n\n1\u221a2\u03c0vij\n\nexp(cid:16)\u2212\n\nn2\nij\n\n2vij(cid:17) ,\n\nwhere, in the most general formulation, each matrix element has its own variance, vij; however, the\nvariance parameters can easily be joined, e.g., to have a single noise variance per row or just one\noverall variance, which corresponds to an isotropic noise model. The noise model gives rise to the\nlikelihood, i.e., the probability of the observations given the parameters of the model. The likelihood\nis given by\n\nI\n\nJ\n\nK\n\nYi=1\n\nYj=1\n\n(cid:19), (3)\np(x|\u03b8) =\nwhere \u03b8 = (cid:8)A, B,{vij}(cid:9) denotes all parameters in the model. For the noise variance parameters\n\nwe choose conjugate inverse-gamma priors,\n\naikbkj , vij(cid:17) =\n\nN(cid:16)xij(cid:12)(cid:12)(cid:12)\n\nk=1 aikbkj)2\n2vij\n\n(x \u2212PK\n\np2\u03c0vij\n\nexp(cid:18)\u2212\n\nXk=1\n\nYj=1\n\nYi=1\n\n1\n\nJ\n\nI\n\np(vij ) = IG(vij|\u03b1, \u03b2) =\n\n\u03b2\u03b1\n\u0393(\u03b1)\n\nv\u2212(\u03b1+1)\nij\n\nexp(cid:18)\u2212\u03b2\nvij(cid:19) .\n\n2.2 Priors for sources and mixing coef\ufb01cients\n\nWe now de\ufb01ne the prior distribution for the factorizing matrices, A and B. To simplify the notation,\nwe specify the matrices by vectors a = vec(A\u22a4) = [a11, a12, . . . , aIK]\u22a4 and b = vec(B) =\n[b11, b21, . . . , bKJ ]\u22a4. We choose a Gaussian prior over a and b subject to inequality constraints, Q,\nand equality constraints, R,\n\np(a, b) \u221d\n\n\uf8f1\uf8f4\uf8f4\uf8f4\uf8f2\n\uf8f4\uf8f4\uf8f4\uf8f3\n\nb(cid:21)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)\nN (cid:20)a\n\n0,\n\n\u00b5b(cid:21)\n(cid:20)\u00b5a\n|{z}\u2261\u00b5\n\n\u03a3\u22a4ab\n\n,(cid:20) \u03a3a \u03a3ab\n\u03a3b(cid:21)\n}\n|\n\n{z\n\n\u2261\u03a3\n\n!,\n\nif Q(a, b) \u2264 0, R(a, b) = 0,\n\n(5)\n\notherwise.\n\nIn slight abuse of denotation, we refer to \u00b5 and \u03a3 as the mean and covariance matrix, although the\nactual mean and covariance of a and b depends on the constraints.\nIn the most general formulation, the constraints, Q : RIK\u00d7RKJ\u2192 RNQ and R : RIK\u00d7RKJ\u2192 RNR,\nare biaf\ufb01ne maps, that de\ufb01ne NQ inequality and NR equality constraints jointly in a and b. Specif-\nically, each inequality constraint has the form\n\nQm(a, b) = qm + a\u22a4q(a)\n\nm + b\u22a4q(b)\n\nm + a\u22a4Q(ab)\n\nBy rearranging terms and combining the NQ constraints in matrix notation, we may write\n\n(2)\n\n(4)\n\n(6)\n\n(7)\n\n1 +Q(ab)\n\n1\n\nhq(a)\n|\n\n+Q(ab)\nNQ\n\nNQ\n\nb \u00b7\u00b7\u00b7 q(a)\n{z\n\u2261Qa\n\n\u22a4a \u2264\uf8ee\nbi\n\uf8ef\uf8ef\uf8f0\n}\n|\n\n\u2212q1\u2212b\u22a4q(b)\n\n1\n\n\u2212qNQ\u2212b\u22a4q(b)\n\nNQ\n\n...\n\n\u2261qa\n\n{z\n\nm b \u2264 0.\n\uf8f9\n\uf8fa\uf8fa\uf8fb\n}\n\n,\n\nQ\u22a4a a \u2264 qa,\n\nfrom which it is clear that the constraints are linear in a. Likewise, the constraints can be rearranged\nto a linear form in b. The equality constraints, R, are de\ufb01ned analogously.\n\nThis general formulation of the priors allows all elements of a and b to have prior dependencies\nboth through their covariance matrix, \u03a3ab, and through the joint constraints; however, in some\n\n3\n\n\f\u00b5, \u03a3, Q, R\n\naik\n\nk=1...K\n\nbkj\n\ni\n\n=\n\n1\n\nxij\n\n.\n\n.\n\n.\n\nI\n\n\u03b1, \u03b2\n\nvij\n\nj = 1...J\n\nFigure 2: Graphical model for linearly constrained Bayesian matrix factorization, when A and B are\nindependent in the prior. White and grey nodes represent latent and observed variables respectively,\nand arrows indicate stochastic dependensies. The colored plates denote repeated variables over the\nindicated indices.\n\napplications it is not relevant or practical to specify all of these dependencies in advance. We may\nrestrict the model such that a and b are independent a priori by setting \u03a3ab, Q(ab)\nm to zero,\nand restricting q(a)\nm 6= 0 and vice versa. Furthermore, we can decouple\nthe elements of A, or groups of elements such as rows or columns, by choosing \u03a3a, Qa, and Ra to\nhave an appropriate block structure. Similarly we can decouple elements of B.\n\nm = 0 for all m where q(b)\n\nm , and R(ab)\n\n2.3 Posterior distribution\n\nHaving speci\ufb01ed the model and the prior densities, we can now write the posterior, which is the\ndistribution of the parameters conditioned on the observed data and hyper-parameters. The posterior\nis given by\n\np(\u03b8|x, \u03c8) \u221d p(a, b)p(x|\u03b8)\n\np(vij),\n\n(8)\n\nwhere \u03c8 = {\u03b1, \u03b2, \u00b5, \u03a3, Q, R} denotes all hyper-parameters in the model. A graphical representa-\ntion of the model is given in Figure 2.\n\nI\n\nYi=1\n\nJ\n\nYj=1\n\n3 Inference\n\nIn a Bayesian framework, we are interested in computing the posterior distribution over the param-\neters, p(\u03b8|x, \u03c8). The posterior, given in Eq. (8), is only known up to a multiplicative constant, and\ndirect computation of this normalizing constant involves integrating over the unnormalized poste-\nrior, which is not analytically tractable. Instead, we approximate the posterior distribution using\nMarkov chain Monte Carlo (MCMC).\n\n3.1 Gibbs sampling\n\nWe propose an inference procedure based on Gibbs sampling. Gibbs sampling is applicable when\nthe joint density of the parameters is not known, but the parameters can be partitioned into groups,\nsuch that their posterior conditional densities are known. We iteratively sweep through the groups of\nparameters and generate a random sample for each, conditioned on the current value of the others.\nThis procedure forms a homogenous Markov chain and its stationary distribution is exactly the joint\nposterior.\n\nIn the following, we derive the posterior conditional densities required in the Gibbs sampler. First,\nwe consider the noise variances, vij. Due to the choice of conjugate prior, the posterior density is an\ninverse-gamma,\n\np(vij|\u03b8\\vij) = IG(vij|\u00af\u03b1, \u00af\u03b2),\n2(cid:0)xij \u2212PK\n\u00af\u03b1 = \u03b1 + 1\n2 ,\n\n\u00af\u03b2 = \u03b2 + 1\n\nk=1 aikbkj(cid:1)2\n\n,\n\n(9)\n\n(10)\n\n4\n\n\ffrom which samples can be generated using standard acceptance-rejection methods.\n\np(a|b) \u221d( N (a| \u02dc\u00b5a, \u02dc\u03a3a),\nb (b \u2212 \u00b5b),\n\n\u02dc\u00b5a = \u00b5a + \u03a3ab\u03a3\u22121\n\nNext, we consider the factorizing matrices, represented by the vectors a and b. We only discuss\ngenerating samples from a, since the sampling procedure for b is identical due to the symmetry of\nthe model. Conditioned on b, the prior density of a is a constrained Gaussian,\nif Q\u22a4a a \u2264 qa, R\u22a4a a = ra,\notherwise,\n\n(11)\n\n0,\n\n(12)\nwhere we have used Eq. (7) and the standard result for a conditional Gaussian density. In the special\ncase when a and b are independent in the prior, we simply have \u02dc\u00b5a = \u00b5a and \u02dc\u03a3a = \u03a3a. Further,\nconditioning on the data leads to the \ufb01nal expression for the posterior conditional density of a,\n\n\u02dc\u03a3a = \u03a3a \u2212 \u03a3ab\u03a3\u22121\n\n\u03a3\u22a4ab,\n\nb\n\np(a|x, \u03b8\\a) \u221d( N (a| \u00af\u00b5a, \u00af\u03a3a),\n\u00af\u03a3a =(cid:0) \u02dc\u03a3\u22121\na + \u02dcBV \u22121 \u02dcB\u22a4(cid:1)\u22121\n\n0,\n\n,\n\nif Q\u22a4a a \u2264 qa, R\u22a4a a = ra,\notherwise,\n\u00af\u00b5a = \u00af\u03a3a(cid:0) \u02dc\u03a3\u22121\n\u02dc\u00b5a + \u02dcBV \u22121x(cid:1),\n\na\n\n(14)\nwhere V = diag(v11, v12, . . . , vIJ ) and \u02dcB = diag(B, . . . , B) is a diagonal block matrix with I\nrepetitions of B.\n\n(13)\n\nThe Gibbs sampler proceeds iteratively: First, the noise variance is generated from the inverse-\ngamma density in Eq. (9); second, a is generated from the constrained Gaussian density in Eq. (13);\nand third, b is generated from a constrained Gaussian analogous to Eq. (13).\n\n3.2 Sampling from a constrained Gaussian\n\nAn essential component in the proposed matrix factorization method is an algorithm for generat-\ning random samples from a multivariate Gaussian density subject to linear equality and inequality\nconstraints. With a slight change of notation, we consider generating x \u2208 RN from the density\n\np(x) \u221d( N (x|\u00b5x, \u03a3x),\n\n0,\n\nif Q\u22a4xx \u2264 qx, R\u22a4xx = rx,\notherwise.\n\nA similar problem has previously been treated by Geweke [2], who proposes a Gibbs sampling\nprocedure, that does not handle equality constraints and no more than N inequality constraints.\nRodriguez-Yam et al. [6] extends the method in [2] to an arbitrary number of inequality constraints,\nbut do not provide an algorithm for handling equality constraints. Here, we present a general Gibbs\nsampling procedure that handles any number of equality and inequality constraints.\nThe equality constraints restrict the distribution to an af\ufb01ne subspace of dimensionality N \u2212 R,\nwhere R is the number of linearly independent constraints. The conditional distribution on that\nsubspace is a Gaussian subject to inequality constraints. To handle the equality constraints, we map\nthe distribution onto this subspace. Using the singular value decomposition (SVD), we can robustly\ncompute an orthonormal basis, T , for the constraints, as well as its orthogonal complement, T\u22a5,\n\n(15)\n\n(16)\n\n(18)\n\n(19)\n\nRx = U SV \u22a4 =(cid:20) T\n\nT\u22a5(cid:21)\u22a4(cid:20)ST\n\n0\n\n0\n\n0(cid:21) V \u22a4,\n\nwhere ST = diag(s1, . . . , sR) holds the R non-zero singular values. We now de\ufb01ne a transformed\nvariable, y, that is related to x by\n\n(17)\nwhere x0 is some vector that satis\ufb01es the equality constraints, e.g., computed using the pseudo-\ninverse, x0 = R\u2020\u22a4x rx. This transformation ensures, that for any value of y, the corresponding x\nsatis\ufb01es the equality constraints. We can now compute the distribution of y conditioned on the\nequality constraints, which is Gaussian subject to inequality constraints,\n\ny = T\u22a5(x \u2212 x0), y \u2208 RN\u2212R\n\np(y|R\u22a4xx = rx) \u221d( N (y|\u00b5y, \u03a3y)\n\nif Q\u22a4y y \u2264 qy\notherwise,\n\n0\n\u00b5y = \u039b(\u00b5x \u2212 x0), \u03a3y = \u039b\u03a3xT \u22a4\n\u22a5\n\n, Qy = T\u22a5Qx,\n\nqy = qx \u2212 Q\u22a4xx0,\n\n5\n\n\fwhere \u039b = T\u22a5(I \u2212 \u03a3xT \u22a4(T \u03a3xT \u22a4)\u22121T ).\nWe introduce a second transformation with the purpose of reducing the correlations between the\nvariables. This may potentially improve the sampling procedure, because Gibbs sampling can suffer\nfrom slow mixing when the distribution is highly correlated. Correlations between the elements\nof y are due to both the Gaussian covariance structure and the inequality constraints; however,\nfor simplicity we only decorrelate with respect to the covariance of the underlying unconstrained\nGaussian. To this end, we de\ufb01ne the transformed variable, z, given by\n\nwhere L is the Cholesky factorization of the covariance matrix, LL\u22a4 = \u03a3y. The distribution of z\nis then a standard Gaussian subject to inequality constraints,\n\nz = L\u2212\u22a4(y \u2212 \u00b5y),\n\n(20)\n\np(z|R\u22a4xx = rx) \u221d( N (z|0, I),\n\nif Q\u22a4z z \u2264 qz,\notherwise,\n\nQz = LQy,\n\n0,\nqz = qy \u2212 Q\u22a4y \u00b5y.\n\n(21)\n\n(22)\n\nWe can now sample from z using a Gibbs sampling procedure by sweeping over the elements zi\nand generating samples from their conditional distributions, which are univariate truncated standard\nGaussian,\n\np(zi|z\\zi) = q 2\n\ni\n\n\u03c0 exp(cid:16) \u2212z2\n2 (cid:17)\n\nerf(cid:16) ui\u221a2(cid:17) \u2212 erf(cid:16) \u2113i\u221a2(cid:17) \u221d(cid:26) N (zi|0, 1),\n\n0,\n\n\u2113i \u2264 zi \u2264 ui,\notherwise.\n\n(23)\n\nSamples from this density can be generated using standard methods such as inverse transform sam-\npling (transforming a uniform random variable by the inverse cumulative density function); the\nef\ufb01cient mixed rejection sampling algorithm proposed by Geweke [2]; or slice sampling [5].\n\nThe upper and lower points of truncation can be computed as\n\n[Qz]\u22a4i:\n\nQ\u22a4z z \u2264 qz\nzi \u2264 qz \u2212 [Qz]\u22a4\u02dci: z\u02dci\n| {z }d\n{z\n}\n|\n: dk > 0(cid:9) = ui,\n: dk < 0(cid:9) \u2264 zi \u2264 min(cid:8)\u221e, nk\n\ndk\n\nn\n\n\u2113i = max(cid:8)\u2212\u221e, nk\n\ndk\n\n(24)\n(25)\n\n(26)\n\nwhere [Qz]i: denotes the ith row of Qz, [Qz]\u02dci: denotes all rows except the ith, and z\u02dci denotes the\nvector of all elements of z except the ith.\n\nFinally, when a sample of z has been generated after a number of Gibbs sweeps, it can be trans-\nformed into a sample of the original variable, x, using\n\nThe sampling procedure is illustrated in Figure 3.\n\nx = T \u22a4\n\u22a5\n\n(L\u22a4z + \u00b5y) + x0.\n\n(27)\n\n4 Experiments\n\nWe demonstrate the proposed linearly constrained Bayesian matrix factorization method on a blind\nimage separation problem, and compare it to two other matrix factorization techniques: independent\ncomponent analysis (ICA) and non-negative matrix factorization (NMF).\n\nData We used a subset from the MNIST dataset which consists of 28 \u00d7 28 pixel grayscale images\nof handwritten digits (see Figure 4.a). We selected the \ufb01rst 800 images of each digit, 0\u20139, which\ngave us 8, 000 unique images. From these images we created 4, 000 image mixtures by adding the\ngrayscale intensities of the images two and two, such that the different digits were combined in equal\nproportion. We rescaled the mixed images so that their pixel intensities were in the 0\u20131 interval, and\narranged the vectorized images as the columns of the matrix X \u2208 RI\u00d7J , where I = 784 and\nJ = 4, 000. Examples of the image mixtures are shown in Figure 4.b.\n\n6\n\n\fx2\n\nx\u2217\n\np(x1|x2 = x\u2217)\n\nx2\n\n1\n\n2\n\n5\n\n3 4\n\nx1\n\nx1\n\nx1\n\n\u2113\n\nu\n\n(a)\n\n\u2113\n\nu\n\n(b)\n\n(c)\n\nFigure 3: Gibbs sampling from a multivariate Gaussian density subject to linear constraints. a) Two-\ndimensional Gaussian subject to three inequality constraints. b) The conditional distribution of x1\ngiven x2 = x\u2217 is a truncated Gaussian. c) Gibbs sampling proceeds iteratively by sweeping over\nthe dimensions and sampling from the conditional distribution in each dimension conditioned on the\ncurrent value in the other dimensions.\n\nTask The objective is to factorize the data matrix in order to \ufb01nd a number of source images that\nexplain the data. Ideally, the sources should correspond to the original digits. We cannot hope to\n\ufb01nd exactly 10 sources that each corresponds to a digit, because there are large variations as to how\neach digit is written. For that reason, we used 40 hidden sources in our experiments, which allowed\n4 exemplars on average for each digit.\n\nMethod For comparison we factorized the mixed image data using two standard matrix factor-\nization techniques: ICA, where we used the FastICA algorithm, and NMF, where we used Lee and\nSeung\u2019s multiplicative update algorithm [3]. The sources determined using these methods are shown\nin Figure 4.c\u2013d.\n\nFor the linearly constrained Bayesian matrix factorization, we used an isotropic noise model. We\nchose a decoupled prior for A and B with zero mean, \u00b5 = 0, and unit diagonal covariance matrix,\n\u03a3 = I. The hidden sources were constrained to have the same range of pixel intensities as the\nimage mixtures, 0 \u2264 aik \u2264 1. This constraint allows the sources to be interpreted as images since\npixel intensities outside this interval are not meaningful. The mixing coef\ufb01cients were constrained\nto be non-negative, bkj \u2265 0, and sum to unity,PK\nk=1 bkj = 1; thus, the observed data is modeled\nas a convex combination of the sources. The constraints ensure that only additive combinations of\nthe sources are allowed, and introduces a negative correlation between the mixing coef\ufb01cients. This\nnegative correlation implies that if one source contributes more to a mixture the other sources must\ncorrespondingly contribute less. The idea behind this constraint is that it will lead the sources to\ncompete as opposed to collaborate to explain the data. A geometric interpretation of the constraints\nis illustrated in Figure 1.h: The data vectors are modeled by a convex polytope in the non-negative\nunit hypercube, and the hidden sources are the vertices of this polytope. We computed 10, 000 Gibbs\nsamples, which appeared suf\ufb01cient for the sampler to converge. The result of the matrix factorization\nare shown in Figure 4.e, which displays a single sample of A at the last iteration.\n\nResults\nIn ICA (see Figure 4.c) the sources are not constrained to be non-negative, and therefore\ndo not have a direct interpretation as images. Most of the computed sources are complex patterns,\nthat appear to be dominated by a single digit. While ICA certainly does \ufb01nd structure in the data,\nthe estimated sources lack a clear interpretation.\n\nThe sources computed using NMF (see Figure 4.d) have the property which Lee and Seung [3]\nrefer to as a parts-based representation. Spatially, the sources are local as opposed to global. The\ndecomposition has an intuitive interpretation: Each source is a short line segment or a dot, and the\ndifferent digits can be constructed by combining these parts.\n\nLinearly constrained Bayesian matrix factorization (see Figure 4.e) computes sources with a very\nclear and intuitive interpretation: Almost all of the 40 computed sources visually resemble hand-\nwritten digits, and are thus well aligned with the sources that were used to generate the mixtures.\nCompared to the original data, the computed sources are a bit bolder and have slightly smeared\n\n7\n\n\f(a) Original dataset: MNIST digits\n\n(b) Training data: Mixture of digits\n\n(c) Independent component analysis\n\n(d) Non-negative matrix factorization\n\n(e) Linearly constrained Bayesian matrix factorization\n\nFigure 4: Data and results of the analyses of an image separation problem. a) The MNIST digits data\n(20 examples shown) used to generate the mixture data. b) The mixture data consists of 4000 images\nof two mixed digits (20 examples shown). c) Sources computed using independent component\nanalysis (color indicate sign). d) Sources computed using non-negative matrix factorization. e)\nSources computed using linearly constrained Bayesian matrix factorization (details explained in the\ntext).\n\nedges. Two sources stand out: One is a black blob of approximately the same size as the digits, and\nanother is an all white feature, which are useful for adjusting the brightness.\n\n5 Conclusions\n\nWe presented a linearly constrained Bayesian matrix factorization method as well as an inference\nprocedure for this model. On an unsupervised image separation problem, we have demonstrated that\nthe method \ufb01nds sources that have a clear an interpretable meaning. As opposed to ICA and NMF,\nour method \ufb01nds sources that visually resemble handwritten digits.\n\nWe formulated the model in general terms, which allows speci\ufb01c prior information to be incorpo-\nrated in the factorization. The Gaussian priors over the sources can be used if knowledge is available\nabout the covariance structure of the sources, e.g., if the sources are known to be smooth. The con-\nstraints we used in our experiments were separate for A and B, but the framework allows bilinearly\ndependent constraints to be speci\ufb01ed, which can be used for example to specify constraints in the\ndata domain, i.e., on the product AB.\n\nAs a general framework for constrained Bayesian matrix factorization, the proposed method has\napplications in many other areas than blind source separation. Interesting applications include blind\ndeconvolution, music transcription, spectral unmixing, and collaborative \ufb01ltering. The method can\nalso be used in a supervised source separation setting, where the distributions over sources and\nmixing coef\ufb01cients are learned from a training set of isolated sources. It is an interesting challenge\nto develop methods for learning relevant constraints from data.\n\n8\n\n\fReferences\n[1] A. T. Cemgil. Bayesian inference for nonnegative matrix factorisation models. Computational\n\nIntelligence and Neuroscience, 2009. doi: 10.1155/2009/785152.\n\n[2] J. Geweke. Ef\ufb01cient simulation from the multivariate normal and student-t distributions subject\nto linear constraints and the evaluation of constraint probabilities. In Computer Sciences and\nStatistics, Proceedings the 23rd Symposium on the Interface between, pages 571\u2013578, 1991.\ndoi: 10.1.1.26.6892.\n\n[3] D. D. Lee and H. S. Seung. Learning the parts of objects by non-negative matrix factorization.\n\nNature, pages 788\u2013791, October 1999. doi: 10.1038/44565.\n\n[4] S. Moussaoui, D. Brie, A. Mohammad-Djafari, and C. Carteret. Separation of non-negative\nmixture of non-negative sources using a Bayesian approach and MCMC sampling. Signal Pro-\ncessing, IEEE Transactions on, 54(11):4133\u20134145, Nov 2006. doi: 10.1109/TSP.2006.880310.\n\n[5] R. M. Neal. Slice sampling. Annals of Statistics, 31(3):705\u2013767, 2003.\n[6] G. Rodriguez-Yam, R. Davis, and L. Scharf. Ef\ufb01cient gibbs sampling of truncated multivari-\nate normal with application to constrained linear regression. Technical report, Colorado State\nUniversity, Fort Collins, 2004.\n\n[7] R. Salakhutdinov and A. Mnih. Probabilistic matrix factorization. In Neural Information Pro-\n\ncessing Systems, Advances in (NIPS), pages 1257\u20131264, 2008.\n\n[8] R. Salakhutdinov and A. Mnih. Bayesian probabilistic matrix factorization using Markov chain\nMonte Carlo. In Machine Learning, International Conference on (ICML), pages 880\u2013887, 2008.\n[9] M. N. Schmidt, O. Winther, and L. K. Hansen. Bayesian non-negative matrix factorization. In\nIndependent Component Analysis and Signal Separation, International Conference on, volume\n5441 of Lecture Notes in Computer Science (LNCS), pages 540\u2013547. Springer, 2009. doi: 10.\n1007/978-3-642-00599-2 68.\n\n9\n\n\f", "award": [], "sourceid": 352, "authors": [{"given_name": "Mikkel", "family_name": "Schmidt", "institution": null}]}