{"title": "Training sparse natural image models with a fast Gibbs sampler of an extended state space", "book": "Advances in Neural Information Processing Systems", "page_first": 1124, "page_last": 1132, "abstract": "We present a new learning strategy based on an efficient blocked Gibbs sampler for sparse overcomplete linear models. Particular emphasis is placed on statistical image modeling, where overcomplete models have played an important role in discovering sparse representations. Our Gibbs sampler is faster than general purpose sampling schemes while also requiring no tuning as it is free of parameters. Using the Gibbs sampler and a persistent variant of expectation maximization, we are able to extract highly sparse distributions over latent sources from data. When applied to natural images, our algorithm learns source distributions which resemble spike-and-slab distributions. We evaluate the likelihood and quantitatively compare the performance of the overcomplete linear model to its complete counterpart as well as a product of experts model, which represents another overcomplete generalization of the complete linear model. In contrast to previous claims, we find that overcomplete representations lead to significant improvements, but that the overcomplete linear model still underperforms other models.", "full_text": "Training sparse natural image models with a fast\n\nGibbs sampler of an extended state space\n\nLucas Theis\n\nWerner Reichardt Centre\n\nfor Integrative Neuroscience\nlucas@bethgelab.org\n\nJascha Sohl-Dickstein\n\nRedwood Center\n\nfor Theoretical Neuroscience\njascha@berkeley.edu\n\nMatthias Bethge\n\nWerner Reichardt Centre\n\nfor Integrative Neuroscience\n\nmatthias@bethgelab.org\n\nAbstract\n\nWe present a new learning strategy based on an ef\ufb01cient blocked Gibbs sampler\nfor sparse overcomplete linear models. Particular emphasis is placed on statistical\nimage modeling, where overcomplete models have played an important role in dis-\ncovering sparse representations. Our Gibbs sampler is faster than general purpose\nsampling schemes while also requiring no tuning as it is free of parameters. Using\nthe Gibbs sampler and a persistent variant of expectation maximization, we are\nable to extract highly sparse distributions over latent sources from data. When ap-\nplied to natural images, our algorithm learns source distributions which resemble\nspike-and-slab distributions. We evaluate the likelihood and quantitatively com-\npare the performance of the overcomplete linear model to its complete counterpart\nas well as a product of experts model, which represents another overcomplete gen-\neralization of the complete linear model. In contrast to previous claims, we \ufb01nd\nthat overcomplete representations lead to signi\ufb01cant improvements, but that the\novercomplete linear model still underperforms other models.\n\n1\n\nIntroduction\n\nHere we study learning and inference in the overcomplete linear model given by\n\n(cid:89)\n\nx = As,\n\np(s) =\n\nfi(si),\n\n(1)\n\ni\n\nwhere A \u2208 RM\u00d7N , N \u2265 M, and each marginal source distribution fi may depend on additional\nparameters. Our goal is to \ufb01nd parameters which maximize the model\u2019s log-likelihood, log p(x), for\na given set of observations x.\nMost of the literature on overcomplete linear models assumes observations corrupted by additive\nGaussian noise, that is, x = As + \u03b5 for a Gaussian distributed random variable \u03b5. Note that this is a\nspecial case of the model discussed here, as we can always represent this noise by making some of\nthe sources Gaussian.\nWhen the observations are image patches, the source distributions fi(si) are typically assumed to be\nsparse or leptokurtotic [e.g., 2, 20, 28]. Examples include the Laplace distribution, the Cauchy distri-\nbution, and Student\u2019s t-distribution. A large family of leptokurtotic distributions which also contains\n\n1\n\n\fFigure 1: A: In the noiseless overcomplete linear model, the posterior distribution over hidden\nsources s lives on a linear subspace. The two parallel lines indicate two different subspaces for\ndifferent values of x. For sparse source distributions, the posterior will generally be heavy-tailed and\nmultimodal, as can be seen on the right. B: A graphical model representation of the overcomplete\nlinear model extended by two sets of auxiliary variables (Equation 2 and 3). We perform blocked\nGibbs sampling between \u03bb and z to sample from the posterior distribution over all latent variables\ngiven an observation x. For a given \u03bb, the posterior over z becomes Gaussian while for given z, the\nposterior over \u03bb becomes factorial and is thus easy to sample from.\n\nthe aforementioned distributions as a special case is formed by Gaussian scale mixtures (GSMs),\n\n(cid:90) \u221e\n\n0\n\nfi(si) =\n\ngi(\u03bbi)N (si; 0, \u03bb\u22121\n\ni\n\n) d\u03bbi,\n\n(2)\n\nwhere gi(\u03bbi) is a univariate density over precisions \u03bbi. In the following, we will concentrate on\nlinear models whose marginal source distributions can be represented as GSMs. For a detailed\ndescription of the representational power of GSMs, see Andrews and Mallows\u2019 paper [1].\nDespite the apparent simplicity of the linear model, inference over the latent variables is computa-\ntionally hard except for a few special cases such as when all sources are Gaussian distributed. In\nparticular, the posterior distribution over sources p(s | x) is constrained to a linear subspace and can\nhave multiple modes with heavy tails (Figure 1A).\nInference can be simpli\ufb01ed by assuming additive Gaussian noise, constraining the source distribu-\ntions to be log-concave or making crude approximations to the posterior. Here, however, we would\nlike to exhaust the full potential of the linear model. On this account, we use Markov chain Monte\nCarlo (MCMC) methods to obtain samples with which we represent the posterior distribution. While\ncomputationally more demanding than many other methods, this allows us, at least in principle, to\napproximate the posterior to arbitrary precision.\nOther approximations often introduce strong biases and preclude learning of meaningful source\ndistributions. Using MCMC, on the other hand, we can study the model\u2019s optimal sparseness and\novercompleteness level in a more objective fashion as well as evaluate the model\u2019s log-likelihood.\nHowever, multiple modes and heavy tails also pose challenges to MCMC methods. General purpose\nmethods are therefore likely to be slow. In the following, we will describe an ef\ufb01cient blocked Gibbs\nsampler which exploits the speci\ufb01c structure of the sparse linear model.\n\n2 Sampling and inference\n\nIn this section, we \ufb01rst review the nullspace sampling algorithm of Chen and Wu [4], which solves\nthe problem of sampling from a linear subspace in the noiseless case of the overcomplete linear\nmodel. We then introduce an additional set of auxiliary variables which leads to an ef\ufb01cient blocked\nGibbs sampler.\n\n2\n\nABsx\u03bbzAs1s2ABp(s|x)p(z|x)\f2.1 Nullspace sampling\n\n(cid:89)\n\ni\n\n(cid:21)\n\n(cid:20) x\n\nz\n\n(cid:20) A\n\n(cid:21)\n\nThe basic idea behind the nullspace sampling algorithm is to extend the overcomplete linear model\nby an additional set of variables z which essentially makes it complete (Figure 1B),\n\n(3)\nwhere B \u2208 R(N\u2212M )\u00d7N and square brackets denote concatenation. If in addition to our observation\nx we knew the unobserved variables z, we could perform inference as in the complete case by simply\nsolving the above linear system, provided the concatenation of A and B is invertible. If the rows of\nA and B are orthogonal, AB(cid:62) = 0, or, in other words, B spans the nullspace of A, we have\n\n=\n\nB\n\ns,\n\ns = A+x + B+z,\n\n(4)\n\nwhere A+ and B+ are the pseudoinverses [24] of A and B, respectively. The marginal distributions\nover x and s do not depend on our choice of B, which means we can choose B freely. An orthogonal\nbasis spanning the nullspace of A can be obtained from A\u2019s singular value decomposition [4].\nMaking use of Equation 4, we can equally well try to obtain samples from the posterior p(z | x)\ninstead of p(s | x). In contrast to the latter, this distribution has full support and is not restricted to\njust a linear subspace,\n\np(z | x) \u221d p(z, x) \u221d p(s) =\n\nfi(w(cid:62)i x + v(cid:62)i z),\n\n(5)\n\nwhere w(cid:62)i and v(cid:62)i are the i-th rows of A+ and B+, respectively. Chen and Wu [4] used Metropolis-\nadjusted Langevin (MALA) sampling [25] to sample from p(z | x).\n2.2 Blocked Gibbs sampling\n\nThe fact that the marginals fi(si) are expressed as Gaussian mixtures (Equation 2) can be used\nto derive an ef\ufb01cient blocked Gibbs sampler. The Gibbs sampler alternately samples nullspace\nrepresentations z and precisions of the source marginals \u03bb. The key observation here is that given\nthe precisions \u03bb, the distribution over x and z becomes Gaussian which makes sampling from the\nposterior distribution tractable.\nA similar idea was pursued by Olshausen and Millman [21], who modeled the source distributions\nwith mixtures of Gaussians and conditionally Gibbs sampled precisions one by one. However, a\nchange in one of the precision variables entails larger computational costs, so that this algorithm is\nmost ef\ufb01cient if only few Gaussians are used and the probability of changing precisions is small. In\ncontrast, here we update all precision variables in parallel by conditioning on the nullspace repre-\nsentation z. This makes it feasible to use a large or even in\ufb01nite number of precisions.\nConditioned on a data point x and a corresponding nullspace representation z, the distribution over\nprecisions \u03bb becomes factorial,\n\np(\u03bb | x, z) = p(\u03bb | s) \u221d p(s | \u03bb)p(\u03bb) =\n\nN (si; 0, \u03bb\u22121\n\ni\n\n)gi(\u03bbi),\n\n(6)\n\nwhere we have used the fact that we can perfectly recover the sources given x and z (Equation 4).\nUsing a \ufb01nite number of precisions \u03d1ik with prior probabilities \u03c0ik, for example, the posterior\nprobability of \u03bbi being \u03d1ij becomes\n\n(cid:89)\n\ni\n\nConditioned on \u03bb, s is Gaussian distributed with diagonal covariance \u039b\u22121 = diag(\u03bb\u22121). As a linear\ntransformation of s, the distribution over x and z is also Gaussian with covariance\n\n(cid:80)\np(\u03bbi = \u03d1ij | x, z) = N (si; 0, \u03d1\u22121\nij )\u03c0ij\nk N (si; 0, \u03d1\u22121\n(cid:21)\n(cid:20) A\u039b\u22121A(cid:62) A\u039b\u22121B(cid:62)\n\n(cid:20) \u03a3xx \u03a3xz\n\nik )\u03c0ik\n\n(cid:21)\n\n\u03a3 =\n\nB\u039b\u22121A(cid:62) B\u039b\u22121B(cid:62)\n\n=\n\n\u03a3(cid:62)xz \u03a3zz\n\n.\n\n(7)\n\n.\n\n(8)\n\n3\n\n\fUsing standard Gaussian identities, we obtain\n\np(z | x, \u03bb) = N (z; \u00b5z|x, \u03a3z|x),\n\n(9)\n\nwhere \u00b5z|x = \u03a3(cid:62)xz\u03a3\u22121\nef\ufb01cient method to conditionally sample Gaussian distributions [8, 14]:\n\nxx x and \u03a3z|x = \u03a3zz \u2212 \u03a3(cid:62)xz\u03a3\u22121\n\nxx \u03a3xz. We use the following computationally\n\n\u223c N (0, \u03a3),\n\nz = z(cid:48) + \u03a3(cid:62)xz\u03a3\u22121\n\nxx (x \u2212 x(cid:48)).\n\n(10)\n\n(cid:21)\n\n(cid:20) x(cid:48)\n\nz(cid:48)\n\nIt can easily be shown that z has the desired distribution of Equation 9. Together, equations 7 and 9\nimplement a rapidly mixing blocked Gibbs sampler. However, the computational cost of solving\nEquation 10 is larger than for a single Markov step in other sampling methods such as MALA. We\nempirically show in the results section that for natural image patches the bene\ufb01ts of blocked Gibbs\nsampling outweigh its computational costs.\nA closely related sampling algorithm was proposed by Park and Casella [23] for implementing\nBayesian inference in the linear regression model with Laplace prior. The main differences here are\nthat we also consider the noiseless case by exploiting the nullspace representation, that instead of\nusing a \ufb01xed Laplace prior we will use the sampler to learn the distribution over source variables,\nand that we apply the algorithm in the context of image modeling. Related ideas were also discussed\nby Papandreou and Yuille [22], Schmidt et al. [27], and others.\n\n3 Learning\n\nIn the following, we describe a learning strategy for the overcomplete linear model based on the idea\nof persistent Markov chains [26, 32, 36], which already has led to improved learning strategies for\na number of different models [e.g., 6, 12, 29, 32].\nFollowing Girolami [11] and others, we use expectation maximization (EM) [7] to maximize the\nlikelihood of the overcomplete linear model. Instead of a variational approximation, here we use the\nblocked Gibbs sampler to sample a hidden state z for every data point x in the E-step. Each M-step\nthen reduces to maximum likelihood learning as in the complete case, for which many algorithms\nare available. Due to the sampling step, this variant of EM is known as Monte Carlo EM [34].\nDespite our efforts to make sampling ef\ufb01cient, running the Markov chain till convergence can still\nbe a costly operation due to the generally large number of data points and high dimensionality of\nposterior samples. To further reduce computational costs, we developed a learning strategy which\nmakes use of persistent Markov chains and only requires a few sampling steps in every iteration.\nInstead of starting the Markov chain anew in every iteration, we initialize the Markov chain with\nthe samples of the previous iteration. This approach is based on the following intuition. First, if the\nmodel changes only slightly, the posterior will change only slightly. As a result, the samples from\nthe previous iteration will provide a good initialization and fewer updates of the Markov chain will\nbe suf\ufb01cient to reach convergence. Second, if updating the Markov chain has only a small effect\non the posterior samples z, also the distribution of the complete data (x, z) will change very little.\nThus, the optimal parameters of the previous M-step will be close to optimal in the current M-step.\nThis causes an inef\ufb01cient Markov chain to automatically slow down the learning process, so that the\nposterior samples will always be close to the stationary distribution.\nEven updating the Markov chain only once results in a valid EM strategy, which can be seen as\nfollows. EM can be viewed as alternately optimizing a lower bound to the log-likelihood with\nrespect to model parameters \u03b8 and an approximating posterior distribution q [18]:\n\nF [q, \u03b8] = log p(x; \u03b8) \u2212 DKL [q(z | x) || p(z | x, \u03b8)] .\n\n(11)\n\nEach M-step increases F for \ufb01xed q while each E-step increases F for \ufb01xed \u03b8. This is repeated\nuntil a local optimum is reached. Importantly, local maxima of F are also local maxima of the\nlog-likelihood, log p(x; \u03b8).\nInterestingly, improving the lower bound F with respect to q can be accomplished by driving the\nMarkov chain with our Gibbs sampler or some other transition operator [26]. This can be seen\n\n4\n\n\fFigure 2: A: The average energy of posterior samples for different sampling methods after deter-\nministic initialization. Depending on the initialization, the average energy can be initially too low\nor too high. Gray lines correspond to different hyperparameter choices for the HMC sampler, red\nand brown lines indicate the manually picked best performing HMC and MALA samplers. The\ndashed line represents an unbiased estimate of the true average posterior energy. B: Autocorrelation\nfunctions for Gibbs sampling and the best HMC and MALA samplers.\n\nby using the fact that application of a transition operator T to any distribution cannot increase its\nKullback-Leibler (KL) divergence to a stationary distribution [5, 15]:\n\nwhere T q(z | x) =(cid:82) q(z0 | x)T (z | z0, x) dz0 and T (z | z0, x) is the probability density of making\n\nDKL [T q(z | x) || p(z | x, \u03b8)] \u2264 DKL [q(z | x) || p(z | x, \u03b8)] ,\n\n(12)\n\na transition from z0 to z. Hence, each Gibbs update of the hidden states implicitly increases F . In\npractice, of course, we only have access to samples from T q and will never compute it explicitly.\nThis shows that the algorithm converges provided the log-likelihood is bounded. This stands in\ncontrast to other contexts where persistent Markov chains have been successful but training can\ndiverge [10]. To guarantee not only convergence but convergence to a local optimum of F , we would\nalso have to prove DKL [T nq(z | x) || p(z | x, \u03b8)] \u2192 0 for n \u2192 \u221e. Unfortunately, most results on\nMCMC convergence deal with convergence in total variation, which is weaker than convergence in\nKL divergence.\n\n4 Results\n\nWe trained several linear models on log-transformed, centered and symmetrically whitened image\npatches extracted from van Hateren\u2019s dataset of natural images [33]. We explicitly modeled the\nDC component of the whitened image patches using a mixture of Gaussians and constrained the\nremaining components of the linear basis to be orthogonal to the DC component.\nFor faster convergence, we initialized the linear basis with the sparse coding algorithm of Olshausen\nand Field [19], which corresponds to learning with MAP inference and \ufb01xed marginal source dis-\ntributions. After initialization, we optimized the basis using L-BFGS [3] during each M-step and\nupdated the representation of the posterior using 2 steps of Gibbs sampling in each E-step. To repre-\nsent the source marginals, we used \ufb01nite GSMs (Equation 8) with 10 precisions \u03d1ij each and equal\nprior weights, that is, \u03c0ij = 0.1. The source marginals were initialized by \ufb01tting them to samples\nfrom the Laplace distribution and later optimized using 10 iterations of standard EM at the beginning\nof each M-step.\n\n4.1 Performance of the blocked Gibbs sampler\n\nWe compared the sampling performance of our Gibbs sampler to MALA sampling\u2014as used by\nChen and Wu [4]\u2014as well as HMC sampling [9], which is a generalization of MALA. The HMC\nsampler has two parameters: a step width and a number of so called leap frog steps. In addition, we\nslightly randomized the step width to avoid problems with periodicity [17], which added an addi-\ntional parameter to control the degree of randomization. After manually determining a reasonable\nrange for the parameters of HMC, we picked 40 parameter sets for each model to test against our\nGibbs sampler.\n\n5\n\n0510156.97.17.37.5Time[s]Avg.posteriorenergyToymodel020406080250\u221225\u221250Time[s]Imagemodel05101500.51Time[s]AutocorrelationToymodelMALAHMCGibbs02040608000.51Time[s]ImagemodelAB\fFigure 3: We trained models with up to four times overcomplete representations using either\nLaplace marginals or GSM marginals. A four times overcomplete basis set is shown in the cen-\nter. Basis vectors were normalized so that the corresponding source distributions had unit variance.\nThe left plot shows the norms of the learned basis vectors. With \ufb01xed Laplace marginals, the al-\ngorithm produces a basis which is barely overcomplete. However, with GSM marginals the model\nlearns bases which are at least three times overcomplete. The right panel shows log-densities of the\nsource distributions corresponding to basis vectors inside the dashed rectangle. For reference, each\nplot also contains a Laplace distribution of equal variance.\n\nThe algorithms were tested on one toy model and one two times overcomplete model trained on\n8 \u00d7 8 image patches. The toy model employed 1 visible unit and 3 hidden units with exponential\npower distributions whose exponents were 0.5. The entries of its basis matrix were randomly drawn\nfrom a Gaussian distribution with mean 1 and standard deviation 0.2.\nFigure 2 shows trace plots and autocorrelation functions for the different sampling methods. The\ntrace plots were generated by measuring the negative log-density (or energy) of posterior samples\nfor a \ufb01xed set of visible states over time, \u2212 log p(x, zt), and averaging over data points. Autocorre-\nlation functions were estimated from single Markov chain runs of equal duration for each sampler\nand data point. All Markov chains were initialized using 100 burn-in steps of Gibbs sampling, inde-\npendent of the sampler used to generate the autocorrelation functions. Finally, we averaged several\nautocorrelation functions corresponding to different data points (see Supplementary Section 1 for\nmore information).\nFor both models we observed faster convergence with Gibbs sampling than with the best MALA\nor HMC samplers (Figure 2). The image model in particular bene\ufb01ted from replacing MALA by\nHMC. Still, even the best HMC sampler produced more correlated samples than the blocked Gibbs\nsampler. While the best HMC sampler reached an autocorrelation of 0.05 after about 64 seconds, it\ntook only about 26 seconds with the blocked Gibbs sampler (right-hand side of Figure 2B).\nAll tests were performed on a single core of an AMD Opteron 6174 machine with 2.20 GHz and\nimplementations written in Python and NumPy.\n\n4.2 Sparsity and overcompleteness\n\nBerkes et al. [2] found that even for very sparse choices of the Student-t prior, the representations\nlearned by the linear model are barely overcomplete if a variational approximation to the posterior is\nused. Similar results and even undercomplete representations were obtained by Seeger [28] with the\nLaplace prior. The results of these studies suggest that the optimal basis set is not very overcomplete.\nOn the other hand, basis sets obtained with other, often more crude approximations are often highly\novercomplete. In the following, we revisit the question of optimal overcompletness and support our\n\ufb01ndings with quantitative measurements.\nConsistent with the study of Seeger [28], if we \ufb01x the source distributions to be Laplacian, our\nalgorithm learns representations which are only slightly overcomplete (Figure 3). However, much\nmore overcomplete representations were obtained when the source distributions were learned from\nthe data. This is in line with the results of Olshausen and Millman [21], who used mixtures of two\n\n6\n\n6412819225600.250.50.751Basiscoef\ufb01cient,iBasisvectornorm,||ai||Laplace,2xGSM,2xGSM,3xGSM,4x\fFigure 4: A comparison of different models for natural image patches. While using overcomplete\nrepresentations (OLM) yields substantial improvements over the complete linear model (LM), it still\ncannot compete with other models of natural image patches. GSM here refers to a single multivariate\nGaussian scale mixture, that is, an elliptically contoured distribution with very few parameters (see\nSupplementary Section 3). Log-likelihoods are reported for non-whitened image patches. Average\nlog-likelihood and standard error of the mean (SEM) were calculated from log-probabilities of 10000\ntest data points.\n\nand three Gaussians as source distributions and obtained two times overcomplete representations for\n8 \u00d7 8 image patches.\nFigure 3 suggests that with GSMs as source distributions, the model can make use of three and\nup to four times overcomplete representations. Our quantitative evaluations con\ufb01rmed a substantial\nimprovement of the two-times overcomplete model over the complete model. Beyond this, however,\nthe improvements quickly become negligible (Figure 4).\nThe source distributions discovered by our algorithm were extremely sparse and resembled spike-\nand-slab distributions, generating mostly values close to zero with the occasional outlier. Source dis-\ntributions of low-frequency components generally had narrower peaks than those of high-frequency\ncomponents (Figure 3).\n\n4.3 Model comparison\n\nTo compare the performance of the overcomplete linear model to the complete linear model and\nother image models, we would like to evaluate the overcomplete linear models\u2019 log-likelihood on a\ntest set of images. However, to do this, we would have to integrate out all hidden units, which we\ncannot do analytically. One way to nevertheless obtain an unbiased estimate of p(x) is by introduc-\ning a tractable distribution as follows:\n\n(cid:90)\n\np(x) =\n\np(x, z) dz =\n\ndz.\n\n(13)\n\nq(z | x)\n\np(x, z)\nq(z | x)\n\nWe can then estimate the above integral by sampling states zn from q(z | x) and averaging over\np(x, zn)/q(zn | x), a technique called importance sampling. The closer q(z | x) is to p(z | x), the\nmore ef\ufb01cient the estimator will be.\nA procedure for constructing distributions q(z | x) from transition operators such as our Gibbs sam-\npling operator is annealed importance sampling (AIS) [16]. AIS starts with a simple and tractable\ndistribution and successively brings it closer to p(z | x). The computational and statistical ef\ufb01ciency\nof the estimator depends on the ef\ufb01cieny of the transition operator. Here, we used our Gibbs sam-\npler and constructed intermediate distributions by interpolating between a Gaussian distribution and\nthe overcomplete linear model. For the four-times overcomplete model, we used 300 intermediate\ndistributions and 300 importance samples to estimate the density of each data point.\nWe \ufb01nd that the overcomplete linear model is still worse than, for example, a single multivariate\nGSM with separately modeled DC component (Figure 4; see also Supplementary Section 3).\n\n(cid:90)\n\n7\n\n--1x2x3x4x2x3x4x0.91.11.31.51.331.441.471.481.481.551.581.471.03Overcompleteness16\u00d716imagepatchesGaussianGSMLMOLMPoT--1x2x3x4x2x3x4x0.91.11.31.51.251.361.381.391.411.461.491.410.96OvercompletenessLog-likelihood\u00b1SEM[bit/pixel]8\u00d78imagepatches\f(cid:89)\n\ni\n\nfi(si),\n\nAn alternative overcomplete generalization of the complete linear model is the family of products of\nexperts (PoE) [13]. Instead of introducing additional source variables, a PoE can have more factors\nthan visible units,\n\ns = W x,\n\np(x) \u221d\n\n(14)\nwhere W \u2208 RN\u00d7M and each factor is also called an expert. For N = M, the PoE is equivalent to\nthe linear model (Equation 1). In contrast to the overcomplete linear model, the prior over hidden\nsources s here is in general not factorial.\nA popular choice of PoE in the context of natural images is the product of Student-t (PoT) distri-\nbutions, in which experts have the form fi(si) = (1 + s2\ni )\u2212\u03b1i [35]. To train the PoT, we used\na persistent variant of minimum probability \ufb02ow learning [29, 31]. We used AIS in combination\nwith HMC to evaluate each PoT model [30]. We \ufb01nd that the PoT is better suited for modeling the\nstatistics of natural images and takes better advantage of overcomplete representations (Figure 4).\nWhile both the estimator for the PoT and the estimator for the overcomplete linear model are con-\nsistent, the former tends to overestimate and the latter tends to underestimate the average log-\nlikelihood. It is thus crucial to test convergence of both estimates if any meaningful comparison\nis to be made (see Supplementary Section 2).\n\n5 Discussion\n\nWe have shown how to ef\ufb01ciently perform inference, training and evaluation in the sparse overcom-\nplete linear model. While general purpose sampling algorithms such as MALA or HMC have the\nadvantage of being more widely applicable, we showed that blocked Gibbs sampling can be much\nfaster when the source distributions are sparse, as for natural images.\nAnother advantage of our sampler is that it is parameter free. Choosing suboptimal parameters\nfor the HMC sampler can lead to extremely poor performance. Which parameters are optimal can\nchange from data point to data point and over time as the model is trained. Furthermore, monitoring\nthe convergence of the Markov chains can be problematic [28]. We showed that by training a model\nwith a persistent variant of Monte Carlo EM, even the number of sampling steps performed in each\nE-step becomes much less crucial for the success of training.\nOptimizing and evaluating the likelihood of overcomplete linear models is a challenging problem.\nTo our knowledge, our study is the \ufb01rst to show a clear advantage of the overcomplete linear model\nover its complete counterpart on natural images. At the same time, we demonstrated that with the\nassumptions of a factorial prior, the overcomplete linear model underperforms other generalizations\nof the complete linear model. Yet it is easy to see how our algorithm could be extended to other,\nmuch better performing models. For instance, models in which multiple sources are modeled jointly\nby a multivariate GSM, or bilinear models with two sets of latent variables.\nCode for training and evaluating overcomplete linear models is available at\n\nhttp://bethgelab.org/code/theis2012d/.\n\nAcknowledgments\n\nThe authors would like to thank Bruno Olshausen, Nicolas Heess and George Papandreou for helpful\ncomments. This study was \ufb01nancially supported by the Bernstein award (BMBF; FKZ: 01GQ0601),\nthe German Research Foundation (DFG; priority program 1527, BE 3848/2-1), and a DFG-NSF\ncollaboration grant (TO 409/8-1).\n\nReferences\n[1] D. F. Andrews and C. L. Mallows. Scale mixtures of normal distributions. Journal of the Royal Statistical\n\nSociety, Series B, 36(1):99\u2013102, 1974.\n\n[2] P. Berkes, R. Turner, and M. Sahani. On sparsity and overcompleteness in image models. Advances in\n\nNeural Information Processing Systems, 20, 2008.\n\n[3] R. H. Byrd, P. Lu, and J. Nocedal. A limited memory algorithm for bound constrained optimization.\n\nSIAM Journal on Scienti\ufb01c and Statistical Computing, 16(5):1190\u20131208, 1995.\n\n8\n\n\f[4] R.-B. Chen and Y. N. Wu. A null space method for over-complete blind source separation. Computational\n\nStatistics & Data Analysis, 51(12):5519\u20135536, 2007.\n\n[5] T. Cover and J. Thomas. Elements of Information Theory. Wiley, 1991.\n[6] B. J. Culpepper, J. Sohl-Dickstein, and B. A. Olshausen. Building a better probabilistic model of images\n\nby factorization. Proceedings of the International Conference on Computer Vision, 13, 2011.\n\n[7] A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the EM\n\nalgorithm. Journal of the Royal Statistical Society, Series B, 39(1):1\u201338, 1977.\n\n[8] A. Doucet. A note on ef\ufb01cient conditional simulation of Gaussian distributions, 2010.\n[9] S. Duane, A. D. Kennedy, B. J. Pendleton, and D. Roweth. Hybrid Monte Carlo. Physics Letters B, 195\n\n(2):216\u2013222, 1987.\n\n[10] A. Fischer and C. Igel. Empirical analysis of the divergence of Gibbs sampling based learning algorithms\nfor restricted Boltzmann machines. Proceedings of the 20th International Conference on Arti\ufb01cial Neural\nNetworks, 2010.\n\n[11] M. Girolami. A variational method for learning sparse and overcomplete representations. Neural Com-\n\nputation, 13(11):2517\u20132532, 2001.\n\n[12] N. Heess, N. Le Roux, and J. Winn. Weakly supervised learning of foreground-background segmentation\n\nusing masked rbms. International Conference on Arti\ufb01cial Neural Networks, 2011.\n\n[13] G. E. Hinton. Training products of experts by minimizing contrastive divergence. Neural Computation,\n\n14(8):1771\u20131800, 2002.\n\n[14] Y. Hoffman and E. Ribak. Constrained realizations of Gaussian \ufb01elds: a simple algorithm. The Astro-\n\nphysical Journal, 380:L5\u2013L8, 1991.\n\n[15] I. Murray and R. Salakhutdinov. Notes on the KL-divergence between a Markov chain and its equilibrium\n\ndistribution, 2008.\n\n[16] R. M. Neal. Annealed importance sampling. Statistics and Computing, 11(2):125\u2013139, 2001.\n[17] R. M. Neal. MCMC using Hamiltonian Dynamics, pages 113\u2013162. Chapman & Hall/CRC Press, 2011.\n[18] R. M. Neal and G. E. Hinton. A view of the EM algorithm that justi\ufb01es incremental, sparse, and other\n\nvariants, pages 355\u2013368. MIT Press, 1998.\n\n[19] B. A. Olshausen and D. J. Field. Emergence of simple-cell receptive \ufb01eld properties by learning a sparse\n\ncode for natural images. Nature, 381:607\u2013609, 1996.\n\n[20] B. A. Olshausen and D. J. Field. Sparse coding with an overcomplete basis set: A strategy employed by\n\nV1? Vision Research, 37(23):3311\u20133325, 1997.\n\n[21] B. A. Olshausen and K. J. Millman. Learning sparse codes with a mixture-of-Gaussians prior. Advances\n\nin Neural Information Processing Systems, 12, 2000.\n\n[22] G. Papandreou and A. L. Yuille. Gaussian sampling by local perturbations. Advances in Neural Informa-\n\ntion Processing Systems, 23, 2010.\n\n[23] T. Park and G. Casella. The Bayesian lasso. Journal of the American Statistical Association, 103(482):\n\n681\u2013686, 2008.\n\n[24] R. Penrose. A generalized inverse for matrices. Proceedings of the Cambridge Philosophical Society,\n\n51:406\u2013413, 1955.\n\n[25] G. O. Roberts and R. L. Tweedie. Exponential convergence of Langevin diffusions and their discrete\n\napproximations. Bernoulli, 2(4):341\u2013363, 1996.\n\n[26] B. Sallans. A hierarchical community of experts. Master\u2019s thesis, University of Toronto, 1998.\n[27] U. Schmidt, Q. Gao, and S. Roth. A generative perspective on MRFs in low-level vision. Proceedings of\n\nthe IEEE Conference on Computer Vision and Pattern Recognition, 2010.\n\n[28] M. W. Seeger. Bayesian inference and optimal design for the sparse linear model. Journal of Machine\n\nLearning Research, 9:759\u2013813, 2008.\n\n[29] J. Sohl-Dickstein. Persistent minimum probability \ufb02ow, 2011.\n[30] J. Sohl-Dickstein and B. J. Culpepper. Hamiltonian annealed importance sampling for partition function\n\nestimation, 2012.\n\n[31] J. Sohl-Dickstein, P. Battaglino, and M. R. DeWeese. Minimum probability \ufb02ow learning. Proceedings\n\nof the 28th International Conference on Machine Learning, 2011.\n\n[32] T. Tieleman. Training restricted Boltzmann machines using approximations to the likelihood gradient.\n\nProceedings of the 25th International Conference on Machine Learning, 2008.\n\n[33] J. H. van Hateren and A. van der Schaaf. Independent component \ufb01lters of natural images compared with\nsimple cells in primary visual cortex. Proc. of the Royal Society B: Biological Sciences, 265(1394), 1998.\n[34] G. C. G. Wei and M. A. Tanner. A Monte Carlo implementation of the EM algorithm and the poor man\u2019s\ndata augmentation algorithms. Journal of the American Statistical Association, 85(411):699\u2013704, 1990.\n[35] M. Welling, G. Hinton, and S. Osindero. Learning sparse topographic representations with products of\n\nStudent-t distributions. Advances in Neural Information Processing Systems, 15, 2003.\n\n[36] L. Younes. Parametric inference for imperfectly observed Gibbsian \ufb01elds. Probability Theory and Related\n\nFields, 1999.\n\n9\n\n\f", "award": [], "sourceid": 540, "authors": [{"given_name": "Lucas", "family_name": "Theis", "institution": null}, {"given_name": "Jascha", "family_name": "Sohl-dickstein", "institution": null}, {"given_name": "Matthias", "family_name": "Bethge", "institution": null}]}