{"title": "Sparse Overcomplete Latent Variable Decomposition of Counts Data", "book": "Advances in Neural Information Processing Systems", "page_first": 1313, "page_last": 1320, "abstract": "An important problem in many fields is the analysis of counts data to extract meaningful latent components. Methods like Probabilistic Latent Semantic Analysis (PLSA) and Latent Dirichlet Allocation (LDA) have been proposed for this purpose. However, they are limited in the number of components they can extract and also do not have a provision to control the expressiveness\" of the extracted components. In this paper, we present a learning formulation to address these limitations by employing the notion of sparsity. We start with the PLSA framework and use an entropic prior in a maximum a posteriori formulation to enforce sparsity. We show that this allows the extraction of overcomplete sets of latent components which better characterize the data. We present experimental evidence of the utility of such representations.\"", "full_text": "Sparse Overcomplete Latent Variable Decomposition\n\nof Counts Data\n\nMadhusudana Shashanka\n\nMars, Incorporated\nHackettstown, NJ\n\nBhiksha Raj\n\nMitsubishi Electric Research Labs\n\nCambridge, MA\n\nParis Smaragdis\nAdobe Systems\nNewton, MA\n\nshashanka@cns.bu.edu\n\nbhiksha@merl.com\n\nparis@adobe.com\n\nAbstract\n\nAn important problem in many \ufb01elds is the analysis of counts data to extract mean-\ningful latent components. Methods like Probabilistic Latent Semantic Analysis\n(PLSA) and Latent Dirichlet Allocation (LDA) have been proposed for this pur-\npose. However, they are limited in the number of components they can extract and\nlack an explicit provision to control the \u201cexpressiveness\u201d of the extracted compo-\nnents. In this paper, we present a learning formulation to address these limitations\nby employing the notion of sparsity. We start with the PLSA framework and\nuse an entropic prior in a maximum a posteriori formulation to enforce sparsity.\nWe show that this allows the extraction of overcomplete sets of latent components\nwhich better characterize the data. We present experimental evidence of the utility\nof such representations.\n\n1 Introduction\n\nA frequently encountered problem in many \ufb01elds is the analysis of histogram data to extract mean-\ningful latent factors from it. For text analysis where the data represent counts of word occurrences\nfrom a collection of documents, popular techniques available include Probabilistic Latent Semantic\nAnalysis (PLSA; [6]) and Latent Dirichlet Allocation (LDA; [2]). These methods extract compo-\nnents that can be interpreted as topics characterizing the corpus of documents. Although they are\nprimarily motivated by the analysis of text, these methods can be applied to analyze arbitrary count\ndata. For example, images can be interpreted as histograms of multiple draws of pixels, where each\ndraw corresponds to a \u201cquantum of intensity\u201d. PLSA allows us to express distributions that underlie\nsuch count data as mixtures of latent components. Extensions to PLSA include methods that attempt\nto model how these components co-occur (eg. LDA, Correlated Topic Model [1]).\n\nOne of the main limitations of these models is related to the number of components they can extract.\nRealistically, it may be expected that the number of latent components in the process underlying\nany dataset is unrestricted. However, the number of components that can be discovered by LDA\nor PLSA is restricted by the cardinality of the data, e.g. by the vocabulary of the documents, or\nthe number of pixels of the image analyzed. Any analysis that attempts to \ufb01nd an overcomplete\nset of a larger number of components encounters the problem of indeterminacy and is liable to\nresult in meaningless or trivial solutions. The second limitation of the models is related to the\n\u201cexpressiveness\u201d of the extracted components i.e. the information content in them. Although the\nmethods aim to \ufb01nd \u201cmeaningful\u201d latent components, they do not actually provide any control over\nthe information content in the components.\n\nIn this paper, we present a learning formulation that addresses both these limitations by employing\nthe notion of sparsity. Sparse coding refers to a representational scheme where, of a set of compo-\nnents that may be combined to compose data, only a small number are combined to represent any\nparticular instance of the data (although the speci\ufb01c set of components may change from instance to\n\n1\n\n\finstance). In our problem, this translates to permitting the generating process to have an unrestricted\nnumber of latent components, but requiring that only a small number of them contribute to the com-\nposition of the histogram represented by any data instance. In other words, the latent components\nmust be learned such that the mixture weights with which they are combined to generate any data\nhave low entropy \u2013 a set with low entropy implies that only a few mixture weight terms are signi\ufb01-\ncant. This addresses both the limitations. Firstly, it largely eliminates the problem of indeterminacy\npermitting us to learn an unrestricted number of latent components. Secondly, estimation of low\nentropy mixture weights forces more information on to the latent components, thereby making them\nmore expressive.\n\nThe basic formulation we use to extract latent components is similar to PLSA. We use an entropic\nprior to manipulate the entropy of the mixture weights. We formulate the problem in a maximum a\nposteriori framework and derive inference algorithms. We use an arti\ufb01cial dataset to illustrate the\neffects of sparsity on the model. We show through simulations that sparsity can lead to components\nthat are more representative of the true nature of the data compared to conventional maximum like-\nlihood learning. We demonstrate through experiments on images that the latent components learned\nin this manner are more informative enabling us to predict unobserved data. We also demonstrate\nthat they are more discriminative than those learned using regular maximum likelihood methods.\nWe then present conclusions and avenues for future work.\n\n2 Latent Variable Decomposition\n\nConsider an F \u00d7 N count matrix V. We will consider each column of V to be the histogram of an\nindependent set of draws from an underlying multinomial distribution over F discrete values. Each\ncolumn of V thus represents counts in a unique data set. Vf n, the f th row entry of Vn, the nth\ncolumn of V, represents the count of f (or the f th discrete symbol that may be generated by the\nmultinomial) in the nth data set. For example, if the columns of V represent word count vectors\nfor a collection of documents, Vf n would be the count of the f th word of the vocabulary in the nth\ndocument in the collection.\n\nWe model all data as having been generated by a process that is characterized by a set of latent\nprobability distributions that, although not directly observed, combine to compose the distribution\nof any data set. We represent the probability of drawing f from the zth latent distribution by P (f |z),\nwhere z is a latent variable. To generate any data set, the latent distributions P (f |z) are combined in\nproportions that are speci\ufb01c to that set. Thus, each histogram (column) in V is the outcome of draws\nfrom a distribution that is a column-speci\ufb01c composition of P (f |z). We can de\ufb01ne the distribution\nunderlying the nth column of V as\n\nPn(f ) = X\n\nP (f |z)Pn(z),\n\nz\n\n(1)\n\nwhere Pn(f ) represents the probability of drawing f in the nth data set in V, and Pn(z) is the\nmixing proportion signifying the contribution of P (f |z) towards Pn(f ).\nEquation 1 is functionally identical to that used for Probabilistic Latent Semantic Analysis of text\ndata [6]1: if the columns Vn of V represent word count vectors for documents, P (f |z) represents\nthe zth latent topic in the documents. Analogous interpretations may be proposed for other types\nof data as well. For example, if each column of V represents one of a collection of images (each\nof which has been unraveled into a column vector), the P (f |z)\u2019s would represent the latent \u201cbases\u201d\nthat compose all images in the collection. In maintaining this latter analogy, we will henceforth refer\nto P (f |z) as the basis distributions for the process.\nGeometrically, the normalized columns of V (obtained by scaling the entries of Vn to sum to 1.0),\n\u00afVn, which we refer to as data distributions, may be viewed as F -dimensional vectors that lie in an\n(F \u2212 1) simplex. The distributions Pn(f ) and basis distributions P (f |z) are also F -dimensional\nvectors in the same simplex. The model expresses Pn(f ) as points within the convex hull formed\nby the basis distributions P (f |z). The aim of the model is to determine P (f |z) such that the model\n\n1PLSA actually represents the joint distribution of n and f as P (n, f ) = P (n) Pz P (f |z)P (z|n). How-\never the maximum likelihood estimate of P (n) is simply the fraction of all observations from all data sets that\noccurred in the nth data set and does not affect the estimation of P (f |z) and P (z|n).\n\n2\n\n\f(100)\n\n(100)\n\n2 Basis Vectors\n\n(010)\n\n \n\n3 Basis Vectors\n\n(010)\n\n \n\n \n\n(001)\n\nSimplex Boundary\nData Points\nBasis Vectors\nApproximation\n\n \n\n(001)\n\nSimplex Boundary\nData Points\nBasis Vectors\nConvex Hull\n\nFigure 1: Illustration of the latent variable model. Panels show 3-dimensional data distributions as\npoints within the Standard 2-Simplex given by {(001), (010), (100)}. The left panel shows a set of\n2 Basis distributions (compact code) derived from the 400 data points. The right panel shows a set\nof 3 Basis distributions (complete code). The model approximates data distributions as points lying\nwithin the convex hull formed by the basis distributions. Also shown are two data points (marked\nby + and \u00d7) and their approximations by the model (respectively shown by \u2666 and (cid:3)).\n\nPn(f ) for any data distribution \u00afVn approximates it closely. Since Pn(f ) is constrained to lie within\nthe simplex de\ufb01ned by P (f |z), it can only model \u00afVn accurately if the latter also lies within the\nhull. Any \u00afVn that lies outside the hull is modeled with error. Thus, the objective of the model\nis to identify P (f |z) such that they form a convex hull surrounding the data distributions. This is\nillustrated in Figure 1 for a synthetic data set of 400 3-dimensional data distributions.\n\n2.1 Parameter Estimation\n\nGiven count matrix V, we estimate P (f |z) and Pn(z) to maximize the likelihood of V. This can be\ndone through iterations of equations derived using the Expectation Maximization (EM) algorithm:\n\nPn(z|f ) =\n\nPn(z)P (f |z)\n\nPz Pn(z)P (f |z)\n\n,\n\nand\n\nP (f |z) = Pn Vf nPn(z|f )\n\nPf Pn Vf nPn(z|f )\n\n,\n\nPn(z) = Pf Vf nPn(z|f )\nPz Pf Vf nPn(z|f )\n\n(2)\n\n(3)\n\nDetailed derivation is shown in supplemental material. The EM algorithm guarantees that the above\nmultiplicative updates converge to a local optimum.\n\n2.2 Latent Variable Model as Matrix Factorization\n\nWe can write the model given by equation (1) in matrix form as pn = Wgn, where pn is a column\nvector indicating Pn(f ), gn is a column vector indicating Pn(z), and W is a matrix with the (f, z)-\nth element corresponding to P (f |z). If we characterize V by R basis distributions, W is an F \u00d7 R\nmatrix. Concatenating all column vectors pn and gn as matrices P and G respectively, one can\nwrite the model as P = WG, where G is an R \u00d7 N matrix. It is easy to show (as demonstrated in\nthe supplementary material) that the maximum likelihood estimator for P (f |z) and Pn(z) attempts\nto minimize the Kullback-Leibler (KL) distance between the normalized data distribution Vn and\nPn(f ), weighted by the total count in Vn.\nIn other words, the model of Equation (1) actually\nrepresents the decomposition\n\n(4)\nwhere D is an N \u00d7 N diagonal matrix, whose nth diagonal element is the total number of counts\nin Vn and H = GD. The astute reader might recognize the decomposition of equation (4) as Non-\nnegative matrix factorization (NMF; [8]). In fact equations (2) and (3) can be shown to be equivalent\nto one of the standard update rules for NMF.\n\nV \u2248 WGD = WH\n\nRepresenting the decomposition in matrix form immediately reveals one of the shortcomings of the\nbasic model. If R, the number of basis distributions, is equal to F , then a trivial solution exists\nthat achieves perfect decomposition: W = I; H = V, where I is the identity matrix (although the\nalgorithm may not always arrive at this solution). However, this solution is no longer of any utility\nto us since our aim is to derive basis distributions that are characteristic of the data, whereas the\n\n3\n\n\fEnclosing triangles for \u2019+\u2019:\nABG, ABD, ABE, ACG,\nACD, ACE, ACF\n\n(010)\n\nC\n\n(100)\n\nA\n\nB\n\nF\n\nG\n\nDE\n(001)\n\nSimplex Boundary\nData Points\nBasis Vectors\n\nFigure 2: Illustration of the effect of sparsifying H on the dataset shown in Figure 1. A-G represent\n7 basis distributions. The \u2018+\u2019 represents a typical data point. It can be accurately represented by\nany set of three or more bases that form an enclosing polygon and there are many such polygons.\nHowever, if we restrict the number of bases used to enclose \u2018+\u2019 to be minimized, only the 7 enclosing\ntriangles shown remain as valid solutions. By further imposing the restriction that the entropy of\nthe mixture weights with which the bases (corners) must be combined to represent \u2018+\u2019 must be\nminimum, only one triangle is obtained as the unique optimal enclosure.\n\ncolumns of W in this trivial solution are not speci\ufb01c to any data, but represent the dimensions of\nthe space the data lie in. For overcomplete decompositions where R > F , the solution becomes\nindeterminate \u2013 multiple perfect decompositions are possible.\n\nThe indeterminacy of the overcomplete decomposition can, however, be greatly reduced by im-\nposing a restriction that the approximation for any \u00afVn must employ minimum number of basis\ndistributions required. By further imposing the constraint that the entropy of gn must be minimized,\nthe indeterminacy of the solution can often be eliminated as illustrated by Figure 2. This principle,\nwhich is related to the concept of sparse coding [5], is what we will use to derive overcomplete sets\nof basis distributions for the data.\n\n3 Sparsity in the Latent Variable Model\n\nSparse coding refers to a representational scheme where, of a set of components that may be com-\nbined to compose data, only a small number are combined to represent any particular input. In the\ncontext of basis decompositions, the goal of sparse coding is to \ufb01nd a set of bases for any data set\nsuch that the mixture weights with which the bases are combined to compose any data are sparse.\nDifferent metrics have been used to quantify the sparsity of the mixture weights in the literature.\nSome approaches minimize variants of the Lp norm of the mixture weights (eg. [7]) while other\napproaches minimize various approximations of the entropy of the mixture weights.\n\nIn our approach, we use entropy as a measure of sparsity. We use the entropic prior, which has\nbeen used in the maximum entropy literature (see [9]) to manipulate entropy. Given a probability\ndistribution \u03b8, the entropic prior is de\ufb01ned as Pe(\u03b8) \u221d e\u2212\u03b1H(\u03b8), where H(\u03b8) = \u2212 Pi \u03b8i log \u03b8i is the\nentropy of the distribution and \u03b1 is a weighting factor. Positive values of \u03b1 favor distributions with\nlower entropies while negative values of \u03b1 favor distributions with higher entropies. Imposing this\nprior during maximum a posteriori estimation is a way to manipulate the entropy of the distribution.\nThe distribution \u03b8 could correspond to the basis distributions P (f |z) or the mixture weights Pn(z)\nor both. A sparse code would correspond to having the entropic prior on Pn(z) with a positive\nvalue for \u03b1. Below, we consider the case where both the basis vectors and mixture weights have the\nentropic prior to keep the exposition general.\n\n3.1 Parameter Estimation\n\nWe use the EM algorithm to derive the update equations. Let us examine the case where both\nP (f |z) and Pn(z) have the entropic prior. The set of parameters to be estimated is given by \u039b =\n{P (f |z), Pn(z)}. The a priori distribution over the parameters, P (\u039b), corresponds to the entropic\npriors. We can write log P (\u039b), the log-prior, as\n\n\u03b1 X\n\nX\n\nP (f |z) log P (f |z) + \u03b2 X\n\nX\n\nPn(z) log Pn(z),\n\n(5)\n\nz\n\nf\n\nn\n\nz\n\n4\n\n\f(100)\n\n(100)\n\n(100)\n\n3 Basis Vectors\n\n(010)\n\n7 Basis Vectors\n\n(010)\n\n10 Basis Vectors\n\n(010)\n\n(001)\n\n(001)\n\n(001)\n\n(100)\n\n(100)\n\n(100)\n\n7 Basis Vectors\n\n(010)\n\n7 Basis Vectors\n\n(010)\n\n7 Basis Vectors\n\n(010)\n\nSparsity Param = 0.01\n\nSparsity Param = 0.05\n\nSparsity Param = 0.3\n\n(001)\n\n(001)\n\n(001)\n\nFigure 3: Illustration of the effect of sparsity on the synthetic data set from Figure 1. For visual\nclarity, we do not display the data points.\nTop panels: Decomposition without sparsity. Sets of 3 (left), 7 (center), and 10 (right) basis dis-\ntributions were obtained from the data without employing sparsity. In each case, 20 runs of the\nestimation algorithm were performed from different initial values. The convex hulls formed by the\nbases from each of these runs are shown in the panels from left to right. Notice that increasing the\nnumber of bases enlarges the sizes of convex hulls, none of which characterize the distribution of\nthe data well.\nBottom panels: Decomposition with sparsity. The panels from left to right show the 20 sets of\nestimates of 7 basis distributions, for increasing values of the sparsity parameter for the mixture\nweights. The convex hulls quickly shrink to compactly enclose the distribution of the data.\n\nwhere \u03b1 and \u03b2 are parameters indicating the degree of sparsity desired in P (f |z) and Pn(z) respec-\ntively. As before, we can write the E-step as\n\nThe M-step reduces to the equations\n\nPn(z|f ) =\n\nPn(z)P (f |z)\n\nPz Pn(z)P (f |z)\n\n.\n\n\u03be\n\nP (f |z)\n\n+ \u03b1 + \u03b1 log P (f |z) + \u03c1z = 0,\n\n\u03c9\n\nPn(z)\n\n+ \u03b2 + \u03b2 log Pn(z) + \u03c4n = 0\n\n(6)\n\n(7)\n\nwhere we have let \u03be represent Pn Vf nPn(z|f ), \u03c9 represent Pf Vf nPn(z|f ), and \u03c1z, \u03c4n are La-\ngrange multipliers. The above M-step equations are systems of simultaneous transcendental equa-\ntions for P (f |z) and Pn(z). Brand [3] proposes a method to solve such equations using the Lambert\nW function [4]. It can be shown that P (f |z) and Pn(z) can be estimated as\n\n\u02c6P (f |z) =\n\n\u2212\u03be/\u03b1\n\nW(\u2212\u03bee1+\u03c1z /\u03b1/\u03b1)\n\n,\n\n\u02c6Pn(z) =\n\n\u2212\u03c9/\u03b2\n\nW(\u2212\u03c9e1+\u03c4n/\u03b2/\u03b2)\n\n.\n\n(8)\n\nEquations (7), (8) form a set of \ufb01xed-point iterations that typically converge in 2-5 iterations [3].\n\nThe \ufb01nal update equations are given by equation (6), and the \ufb01xed-point equation-pairs (7), (8). De-\ntails of the derivation are provided in supplemental material. Notice that the above equations reduce\nto the maximum likelihood updates of equations (2) and (3) when \u03b1 and \u03b2 are set to zero. More\ngenerally, the EM algorithm aims to minimize the KL distance between the true distribution of the\ndata and that of the model, i.e. it attempts to arrive at a model that conserves the entropy of the data,\nsubject to the a priori constraints. Consequently, reducing entropy of the mixture weights Pn(z) to\nobtain a sparse code results in increased entropy (information) of basis distributions P (f |z).\n\n3.2 Illustration of the Effect of Sparsity\n\nThe effect and utility of sparse overcomplete representations is demonstrated by Figure 3. In this\nexample, the data (from Figure 1) have four distinct quadrilaterally located clusters. This structure\ncannot be accurately represented by three or fewer basis distributions, since they can, at best specify\n\n5\n\n\fA. Occluded Faces\n\nB. Reconstructions\n\nC. Original Test Images\n\nFigure 4: Application of latent variable decomposition for reconstructing faces from occluded im-\nages (CBCL Database). (A). Example of a random subset of 36 occluded test images. Four 6 \u00d7 6\npatches were removed from the images in several randomly chosen con\ufb01gurations (corresponding\nto the rows). (B). Reconstructed faces from a sparse-overcomplete basis set of 1000 learned compo-\nnents (sparsity parameter = 0.1). (C). Original test images shown for comparison.\n\na triangular simplex, as demonstrated by the top left panel in the \ufb01gure. Simply increasing the num-\nber of bases without constraining the sparsity of the mixture weights does not provide meaningful\nsolutions. However, increasing the sparsity quickly results in solutions that accurately characterize\nthe distribution of the data.\n\nA clearer intuition is obtained when we consider the matrix form of the decomposition in Equation\n4. The goal of the decomposition is often to identify a set of latent distributions that characterize\nthe underlying process that generated the data V. When no sparsity is enforced on the solution, the\ntrivial solution W = I, H = V is obtained at R = F . In this solution, the entire information in\nV is borne by H and the bases W becomes uninformative, i.e. they no longer contain information\nabout the underlying process.\n\nHowever, by enforcing sparsity on H the information V is transferred back to W, and non-trivial\nsolutions are possible for R > F . As R increases, however, W become more and more data-like.\nAt R = N another trivial solution is obtained: W = V, and H = D (i.e. G = I). The columns of\nW now simply represent (scaled versions) of the speci\ufb01c data V rather than the underlying process.\nFor R > N the solutions will now become indeterminate. By enforcing sparsity, we have thus\nincreased the implicit limit on the number of bases that can be estimated without indeterminacy\nfrom the smaller dimension of V to the larger one.\n\n4 Experimental Evaluation\n\nWe hypothesize that if the learned basis distribution are characteristic of the process that generates\nthe data, they must not only generalize to explain new data from the process, but also enable predic-\ntion of components of the data that were not observed. Secondly, the bases for a given process must\nbe worse at explaining data that have been generated by any other process. We test both these hy-\npotheses below. In both experiments we utilize images, which we interpret as histograms of repeated\ndraws of pixels, where each draw corresponds to a quantum of intensity.\n\n4.1 Face Reconstruction\n\nIn this experiment we evaluate the ability of the overcomplete bases to explain new data and predict\nthe values of unobserved components of the data. Speci\ufb01cally, we use it to reconstruct occluded\nportions of images. We used the CBCL database consisting of 2429 frontal view face images hand-\naligned in a 19 \u00d7 19 grid. We preprocessed the images by linearly scaling the grayscale intensities\nso that pixel mean and standard deviation was 0.25, and then clipped them to the range [0, 1].\n2000 images were randomly chosen as the training set. 100 images from the remaining 429 were\nrandomly chosen as the test set. To create occluded test images, we removed 6 \u00d7 6 grids in ten\nrandom con\ufb01gurations for 10 test faces each, resulting in 100 occluded images. We created 4 sets of\ntest images, where each set had one, two, three or four 6 \u00d7 6 patches removed. Figure 4A represents\nthe case where 4 patches were removed from each face.\n\nIn a training stage, we learned sets of K \u2208 {50, 200, 500, 750, 1000} basis distributions from the\ntraining data. Sparsity was not used in the compact (R < F ) case (50 and 200 bases) and sparsity\n\n6\n\n\fBasis Vectors\n\nMixture Weights\n\nBasis Vectors\n\nMixture Weights\n\nPixel Image\n\nPixel Image\n\nFigure 5: 25 Basis distributions (represented as images) extracted for class \u201c2\u201d from training data\nwithout sparsity on mixture weights (Left Panel, sparsity parameter = 0) and with sparsity on mixture\nweights (Right Panel, sparsity parameter = 0.2). Basis images combine in proportion to the mixture\nweights shown to result in the pixel images shown.\n\n\u03b2 = 0\n\n\u03b2 = 0.2\n\n\u03b2 = 0.5\n\nFigure 6: 25 basis distributions learned from training data for class \u201c3\u201d with increasing sparsity\nparameters on the mixture weights. The sparsity parameter was set to 0, 0.2 and 0.5 respectively. In-\ncreasing the sparsity parameter of mixture weights produces bases which are holistic representations\nof the input (histogram) data instead of parts-like features.\n\nwas imposed (parameter = 0.1) on the mixture weights in the overcomplete cases (500, 750 and 1000\nbasis vectors).\n\nThe procedure for estimating the occluded regions of the a test image has two steps. In the \ufb01rst step,\nwe estimate the distribution underlying the image as a linear combination of the basis distributions.\nThis is done by iterations of Equations 2 and 3 to estimate Pn(z) (the bases P (f |z), being already\nknown, stay \ufb01xed) based only on the pixels that are observed (i.e. we marginalize out the occluded\npixels). The combination of the bases P (f |z) and the estimated Pn(z) give us the overall distribution\nPn(f ) for the image. The occluded pixel values at any pixel f is estimated as the expected number\nof counts at the pixels, given by Pn(f )(Pf \u2032\u2208{Fo} Vf \u2032 )/(Pf \u2032\u2208{Fo} Pn(f \u2032)) where Vf represents\nthe value of the image at the f th pixel and {Fo} is the set of observed pixels. Figure 4B shows the\nreconstructed faces for the sparse-overcomplete case of 1000 basis vectors. Figure 7A summarizes\nthe results for all cases. Performance is measured by mean Signal-to-Noise-Ratio (SNR), where\nSNR for an image was computed as the ratio of the sum of squared pixel intensities of the original\nimage to the sum of squared error between the original image pixels and the reconstruction.\n\n4.2 Handwritten Digit Classi\ufb01cation\n\nIn this experiment we evaluate the speci\ufb01city of the bases to the process represented by the training\ndata set, through a simple example of handwritten digit classi\ufb01cation. We used the USPS Handwrit-\nten Digits database which has 1100 examples for each digit class. We randomly chose 100 examples\nfrom each class and separated them as the test set. The remaining examples were used for training.\nDuring training, separate sets of basis distributions P k(f |z) were learned for each class, where k\nrepresents the index of the class. Figure 5 shows 25 bases images extracted for the digit \u201c2\u201d. To\nclassify any test image v, we attempted to compute the distribution underlying the image using the\nbases for each class (by estimating the mixture weights P k\nv (z), keeping the bases \ufb01xed, as before).\nThe \u201cmatch\u201d of the bases to the test instance was indicated by the likelihood Lk of the image com-\nputed using P k(f ) = Pz P k(f |z)P k\nv (z) as Lk = Pf vf log P k(f ). Since we expect the bases for\nthe true class of the image to best compose it, we expect the likelihood for the correct class to be\nmaximum. Hence, the image v was assigned to the class for which likelihood was the highest.\n\n7\n\n\fA. Reconstruction Experiment\n\n \n\n1 patch\n2 patches\n3 patches\n4 patches\n\n24\n\n22\n\n20\n\n18\n\n16\n\n14\n\n12\n\n10\n\n \n\nR\nN\nS\nn\na\ne\nM\n\n8\n\n \n\n50\n\n200\n750\nNumber of Basis Components\n\n500\n\n1000\n\nr\no\nr\nr\n\n \n\nE\ne\ng\na\n\nt\n\nn\ne\nc\nr\ne\nP\n\n5\n\n4.5\n\n4\n\n3.5\n\n3\n\n2.5\n\n \n\n2\n0\n\nB. Classification Experiment\n\n \n\n25\n50\n75\n100\n200\n\n0.05\n\n0.1\n\nSparsity Parameter\n\n0.2\n\n0.3\n\nFigure 7: (A). Results of the face Reconstruction experiment. Mean SNR of the reconstructions is\nshown as a function of the number of basis vectors and the test case (number of deleted patches,\nshown in the legend). Notice that the sparse-overcomplete codes consistently perform better than\nthe compact codes.\n(B). Results of the classi\ufb01cation experiment. The legend shows number of\nbasis distributions used. Notice that imposing sparsity almost always leads to better classi\ufb01cation\nperformance. In the case of 100 bases, error rate comes down by almost 50% when a sparsity\nparameter of 0.3 is imposed.\n\nResults are shown in Figure 7B. As one can see, imposing sparsity improves classi\ufb01cation perfor-\nmance in almost all cases. Figure 6 shows three sets of basis distributions learned for class \u201c3\u201d with\ndifferent sparsity values on the mixture weights. As the sparsity parameter is increased, bases tend\nto be holistic representations of the input histograms. This is consistent with improved classi\ufb01cation\nperformance - as the representation of basis distributions gets more holistic, the more unlike they\nbecome when compared to bases of other classes. Thus, there is a lesser chance that the bases of\none class can compose an image in another class, thereby improving performance.\n\n5 Conclusions\n\nIn this paper, we have presented an algorithm for sparse extraction of overcomplete sets of latent\ndistributions from histogram data. We have used entropy as a measure of sparsity and employed\nthe entropic prior to manipulate the entropy of the estimated parameters. We showed that sparse-\novercomplete components can lead to an improved characterization of data and can be used in appli-\ncations such as classi\ufb01cation and inference of missing data. We believe further improved characteri-\nzation may be achieved by the imposition of additional priors that represent known or hypothesized\nstructure in the data, and will be the focus of future research.\n\nReferences\n\n[1] DM Blei and JD Lafferty. Correlated Topic Models. In NIPS, 2006.\n[2] DM Blei, AY Ng, and MI Jordan. Latent Dirichlet Allocation. Journal of Machine Learning\n\nResearch, 3:993\u20131022, 2003.\n\n[3] ME Brand. Pattern Discovery via Entropy Minimization. In Uncertainty 99: AISTATS 99, 1999.\n[4] RM Corless, GH Gonnet, DEG Hare, DJ Jeffrey, and DE Knuth. On the Lambert W Function.\n\nAdvances in Computational mathematics, 1996.\n\n[5] DJ Field. What is the Goal of Sensory Coding? Neural Computation, 1994.\n[6] T Hofmann. Unsupervised Learning by Probabilistic Latent Semantic Analysis. Machine Learn-\n\ning, 42:177\u2013196, 2001.\n\n[7] PO Hoyer. Non-negative Matrix Factorization with Sparseness Constraints. Journal of Machine\n\nLearning Research, 5, 2004.\n\n[8] DD Lee and HS Seung. Algorithms for Non-negative Matrix Factorization. In NIPS, 2001.\n[9] J Skilling. Classic Maximum Entropy. In J Skilling, editor, Maximum Entropy and Bayesian\n\nMethods. Kluwer Academic, 1989.\n\n8\n\n\f", "award": [], "sourceid": 1036, "authors": [{"given_name": "Madhusudana", "family_name": "Shashanka", "institution": null}, {"given_name": "Bhiksha", "family_name": "Raj", "institution": null}, {"given_name": "Paris", "family_name": "Smaragdis", "institution": null}]}