{"title": "Generating more realistic images using gated MRF's", "book": "Advances in Neural Information Processing Systems", "page_first": 2002, "page_last": 2010, "abstract": "Probabilistic models of natural images are usually evaluated by measuring performance on rather indirect tasks, such as denoising and inpainting. A more direct way to evaluate a generative model is to draw samples from it and to check whether statistical properties of the samples match the statistics of natural images. This method is seldom used with high-resolution images, because current models produce samples that are very different from natural images, as assessed by even simple visual inspection. We investigate the reasons for this failure and we show that by augmenting existing models so that there are two sets of latent variables, one set modelling pixel intensities and the other set modelling image-specific pixel covariances, we are able to generate high-resolution images that look much more realistic than before. The overall model can be interpreted as a gated MRF where both pair-wise dependencies and mean intensities of pixels are modulated by the states of latent variables. Finally, we confirm that if we disallow weight-sharing between receptive fields that overlap each other, the gated MRF learns more efficient internal representations, as demonstrated in several recognition tasks.", "full_text": "Generating more realistic images using gated MRF\u2019s\n\nMarc\u2019Aurelio Ranzato\n\nVolodymyr Mnih\n\nGeoffrey E. Hinton\n\nDepartment of Computer Science\n\n{ranzato,vmnih,hinton}@cs.toronto.edu\n\nUniversity of Toronto\n\nAbstract\n\nProbabilistic models of natural images are usually evaluated by measuring per-\nformance on rather indirect tasks, such as denoising and inpainting. A more di-\nrect way to evaluate a generative model is to draw samples from it and to check\nwhether statistical properties of the samples match the statistics of natural images.\nThis method is seldom used with high-resolution images, because current models\nproduce samples that are very different from natural images, as assessed by even\nsimple visual inspection. We investigate the reasons for this failure and we show\nthat by augmenting existing models so that there are two sets of latent variables,\none set modelling pixel intensities and the other set modelling image-speci\ufb01c pixel\ncovariances, we are able to generate high-resolution images that look much more\nrealistic than before. The overall model can be interpreted as a gated MRF where\nboth pair-wise dependencies and mean intensities of pixels are modulated by the\nstates of latent variables. Finally, we con\ufb01rm that if we disallow weight-sharing\nbetween receptive \ufb01elds that overlap each other, the gated MRF learns more ef\ufb01-\ncient internal representations, as demonstrated in several recognition tasks.\n\n1\n\nIntroduction and Prior Work\n\nThe study of the statistical properties of natural images has a long history and has in\ufb02uenced many\n\ufb01elds, from image processing to computational neuroscience [1]. In this work we focus on proba-\nbilistic models of natural images. These models are useful for extracting representations [2, 3, 4]\nthat can be used for discriminative tasks and they can also provide adaptive priors [5, 6, 7] that can be\nused in applications like denoising and inpainting. Our main focus, however, will be on improving\nthe quality of the generative model, rather than exploring its possible applications.\nMarkov Random Fields (MRF\u2019s) provide a very general framework for modelling natural images.\nIn an MRF, an image is assigned a probability which is a normalized product of potential functions,\nwith each function typically being de\ufb01ned over a subset of the observed variables. In this work we\nconsider a very versatile class of MRF\u2019s in which potential functions are de\ufb01ned over both pixels\nand latent variables, thus allowing the states of the latent variables to modulate or gate the effective\ninteractions between the pixels. This type of MRF, that we dub gated MRF, was proposed as an\nimage model by Geman and Geman [8]. Welling et al. [9] showed how an MRF in this family1\ncould be learned for small image patches and their work was extended to high-resolution images by\nRoth and Black [6] who also demonstrated its success in some practical applications [7].\nBesides their practical use, these models were speci\ufb01cally designed to match the statistical properties\nof natural images, and therefore, it seems natural to evaluate them in those terms. Indeed, several\nauthors [10, 7] have proposed that these models should be evaluated by generating images and\n\n1Product of Student\u2019s t models (without pooling) may not appear to have latent variables but each potential\ncan be viewed as an in\ufb01nite mixture of zero-mean Gaussians where the inverse variance of the Gaussian is the\nlatent variable.\n\n1\n\n\fchecking whether the samples match the statistical properties observed in natural images.\nIt is,\ntherefore, very troublesome that none of the existing models can generate good samples, especially\nfor high-resolution images (see for instance \ufb01g. 2 in [7] which is one of the best models of high-\nresolution images reported in the literature so far).\nIn fact, as our experiments demonstrate the\ngenerated samples from these models are more similar to random images than to natural images!\nWhen MRF\u2019s with gated interactions are applied to small image patches, they actually seem to\nwork moderately well, as demonstrated by several authors [11, 12, 13]. The generated patches have\nsome coherent and elongated structure and, like natural image patches, they are predominantly very\nsmooth with sudden outbreaks of strong structure. This is unsurprising because these models have\na built-in assumption that images are very smooth with occasional strong violations of smooth-\nness [8, 14, 15]. However, the extension of these patch-based models to high-resolution images by\nreplicating \ufb01lters across the image has proven to be dif\ufb01cult. The receptive \ufb01elds that are learned\nno longer resemble Gabor wavelets but look random [6, 16] and the generated images lack any of\nthe long range structure that is so typical of natural images [7]. The success of these methods in\napplications such as denoising is a poor measure of the quality of the generative model that has been\nlearned: Setting the parameters to random values works almost as well for eliminating independent\nGaussian noise [17], because this can be done quite well by just using a penalty for high-frequency\nvariation.\nIn this work, we show that the generative quality of these models can be drastically improved by\njointly modelling both pixel mean intensities and pixel covariances. This can be achieved by using\ntwo sets of latent variables, one that gates pair-wise interactions between pixels and another one that\nsets the mean intensities of pixels, as we already proposed in some earlier work [4]. Here, we show\nthat this modelling choice is crucial to make the gated MRF work well on high-resolution images.\nFinally, we show that the most widely used method of sharing weights in MRF\u2019s for high-resolution\nimages is overly constrained. Earlier work considered homogeneous MRF\u2019s in which each potential\nis replicated at all image locations. This has the subtle effect of making learning very dif\ufb01cult\nbecause of strong correlations at nearby sites. Following Gregor and LeCun [18] and also Tang and\nEliasmith [19], we keep the number of parameters under control by using local potentials, but unlike\nRoth and Black [6] we only share weights between potentials that do not overlap.\n\n2 Augmenting Gated MRF\u2019s with Mean Hidden Units\n\nA Product of Student\u2019s t (PoT) model [15] is a gated MRF de\ufb01ned on small image patches that\ncan be viewed as modelling image-speci\ufb01c, pair-wise relationships between pixel values by using\nthe states of its latent variables. It is very good at representing the fact that two-pixel have very\nsimilar intensities and no good at all at modelling what these intensities are. Failure to model the\nmean also leads to impoverished modelling of the covariances when the input images have non-\nzero mean intensity. The covariance RBM (cRBM) [20] is another model that shares the same\nlimitation since it only differs from PoT in the distribution of its latent variables: The posterior over\nthe latent variables is a product of Bernoulli distributions instead of Gamma distributions as in PoT.\nWe explain the fundamental limitation of these models by using a simple toy example: Modelling\ntwo-pixel images using a cRBM with only one binary hidden unit, see \ufb01g. 1.\nThis cRBM assumes that the conditional distribution over the input is a zero-mean Gaussian with a\ncovariance that is determined by the state of the latent variable. Since the latent variable is binary, the\ncRBM can be viewed as a mixture of two zero-mean full covariance Gaussians. The latent variable\nuses the pairwise relationship between pixels to decide which of the two covariance matrices should\nbe used to model each image. When the input data is pre-proessed by making each image have zero\nmean intensity (the empirical histogram is shown in the \ufb01rst row and \ufb01rst column), most images lie\nnear the origin because most of the times nearby pixels are strongly correlated. Less frequently we\nencounter edge images that exhibit strong anti-correlation between the pixels, as shown by the long\ntails along the anti-diagonal line. A cRBM could model this data by using two Gaussians (\ufb01rst row\nand second column): one that is spherical and tight at the origin for smooth images and another one\nthat has a covariance elongated along the anti-diagonal for structured images.\nIf, however, the whole set of images is normalized by subtracting from every pixel the mean value\nof all pixels over all images (second row and \ufb01rst column), the cRBM fails at modelling structured\nimages (second row and second column). It can \ufb01t a Gaussian to the smooth images by discovering\n\n2\n\n\fFigure 1: In the \ufb01rst row, each image is zero mean. In the second row, the whole set of data points is centered\nbut each image can have non-zero mean. The \ufb01rst column shows 8x8 images picked at random from natural\nimages. The images in the second column are generated by a model that does not account for mean intensity.\nThe images in the third column are generated by a model that has both \u201cmean\u201d and \u201ccovariance\u201d hidden units.\nThe contours in the \ufb01rst column show the negative log of the empirical distribution of (tiny) natural two-pixel\nimages (x-axis being the \ufb01rst pixel and the y-axis the second pixel). The plots in the other columns are toy\nexamples showing how each model could represent the empirical distribution using a mixture of Gaussians\nwith components that have one of two possible covariances (corresponding to the state of a binary \u201ccovariance\u201d\nlatent variable). Models that can change the means of the Gaussians (mPoT and mcRBM) can represent better\nstructured images (edge images lie along the anti-diagonal and are \ufb01tted by the Gaussians shown in red) while\nthe other models (PoT and cRBM) fail, overall when each image can have non-zero mean.\n\nthe direction of strong correlation along the main diagonal, but it is very likely to fail to discover the\ndirection of anti-correlation, which is crucial to represent discontinuities, because structured images\nwith different mean intensity appear to be evenly spread over the whole input space.\nIf the model has another set of latent variables that can change the means of the Gaussian distribu-\ntions in the mixture (as explained more formally below and yielding the mPoT and mcRBM models),\nthen the model can represent both changes of mean intensity and the correlational structure of pixels\n(see last column). The mean latent variables effectively subtract off the relevant mean from each\ndata-point, letting the covariance latent variable capture the covariance structure of the data. As\nbefore, the covariance latent variable needs only to select between two covariance matrices.\nIn fact, experiments on real 8x8 image patches con\ufb01rm these conjectures. Fig. 1 shows samples\ndrawn from PoT and mPoT. mPoT (and similarly mcRBM [4]) is not only better at modelling zero\nmean images but it can also represent images that have non zero mean intensity well.\nWe now describe mPoT, referring the reader to [4] for a detailed description of mcRBM. In PoT [9]\nthe energy function is:\n\n[hc\n\ni (1 +\n\n1\n2\n\n(Ci\n\nT x)2) + (1 \u2212 \u03b3) log hc\ni ]\n\n(1)\n\nEPoT(x, hc) =(cid:88)\n\ni\n\nwhere x is a vectorized image patch, hc is a vector of Gamma \u201ccovariance\u201d latent variables, C is\na \ufb01lter bank matrix and \u03b3 is a scalar parameter. The joint probability over input pixels and latent\nvariables is proportional to exp(\u2212EPoT(x, hc)). Therefore, the conditional distribution over the\ninput pixels is a zero-mean Gaussian with covariance equal to:\n\u03a3c = (Cdiag(hc)C T )\u22121.\n\n(2)\n\nIn order to make the mean of the conditional distribution non-zero, we de\ufb01ne mPoT as the normal-\nized product of the above zero-mean Gaussian that models the covariance and a spherical covariance\nGaussian that models the mean. The overall energy function becomes:\n\nEmPoT(x, hc, hm) = EPoT(x, hc) + Em(x, hm)\n\n(3)\n\n3\n\n\fFigure 2: Illustration of different choices of weight-sharing scheme for a RBM. Links converging to one latent\nvariable are \ufb01lters. Filters with the same color share the same parameters. Kinds of weight-sharing scheme: A)\nGlobal, B) Local, C) TConv and D) Conv. E) TConv applied to an image. Cells correspond to neighborhoods\nto which \ufb01lters are applied. Cells with the same color share the same parameters. F) 256 \ufb01lters learned by\na Gaussian RBM with TConv weight-sharing scheme on high-resolution natural images. Each \ufb01lter has size\n16x16 pixels and it is applied every 16 pixels in both the horizontal and vertical directions. Filters in position\n(i, j) and (1, 1) are applied to neighborhoods that are (i, j) pixels away form each other. Best viewed in color.\n\nwhere hm is another set of latent variables that are assumed to be Bernoulli distributed (but other\ndistributions could be used). The new energy term is:\n\nEm(x, hm) =\n\n1\n2\n\nhm\nj Wj\n\nT x\n\nxT x \u2212(cid:88)\n\nj\n\n(4)\n\n(5)\n\nyielding the following conditional distribution over the input pixels:\n\np(x|hc, hm) = N(\u03a3(W hm), \u03a3), \u03a3 = (\u03a3c + I)\u22121\n\nwith \u03a3c de\ufb01ned in eq. 2. As desired, the conditional distribution has non-zero mean2.\nPatch-based models like PoT have been extended to high-resolution images by using spatially lo-\ncalized \ufb01lters [6]. While we can subtract off the mean intensity from independent image patches to\nsuccessfully train PoT, we cannot do that on a high-resolution image because overlapping patches\nmight have different mean. Unfortunately, replicating potentials over the image ignoring variations\nof mean intensity has been the leading strategy to date [6]3. This is the major reason why generation\nof high-resolution images is so poor. Sec. 4 shows that generation can be drastically improved by\nexplicitly accounting for variations of mean intensity, as performed by mPoT and mcRBM.\n\n3 Weight-Sharing Schemes\n\nBy integrating out the latent variables, we can write the density function of any gated MRF as a\nnormalized product of potential functions (for mPoT refer to eq. 6). In this section we investigate\ndifferent ways of constraining the parameters of the potentials of a generic MRF.\nGlobal: The obvious way to extend a patch-based model like PoT to high-resolution images is to\nde\ufb01ne potentials over the whole image; we call this scheme global. This is not practical because\n1) the number of parameters grows about quadratically with the size of the image making training\ntoo slow, 2) we do not need to model interactions between very distant pairs of pixels since their\ndependence is negligible, and 3) we would not be able to use the model on images of different size.\nConv:\nThe most popular way to handle big images is to de\ufb01ne potentials on small subsets of\nvariables (e.g., neighborhoods of size 5x5 pixels) and to replicate these potentials across space while\n\n2The need to model the means was clearly recognized in [21] but they used conjunctive latent features that\nsimultaneously represented a contribution to the \u201cprecision matrix\u201d in a speci\ufb01c direction and the mean along\nthat same direction.\n\n3The success of PoT-like models in Bayesian denoising is not surprising since the noisy image effectively\nreplaces the reconstruction term from the mean hidden units (see eq. 5), providing a set of noisy mean intensities\nthat are cleaned up by the patterns of correlation enforced by the covariance latent variables.\n\n4\n\n\fi S(yi), where yi = Ci\n\nwrite: p(y) =(cid:81)\n\nsharing their parameters at each image location [23, 24, 6]. This yields a convolutional weight-\nsharing scheme, also called homogeneous \ufb01eld in the statistics literature. This choice is justi\ufb01ed\nby the stationarity of natural images. This weight-sharing scheme is extremely concise in terms of\nnumber of parameters, but also rather inef\ufb01cient in terms of latent representation. First, if there are\nN \ufb01lters at each location and these \ufb01lters are stepped by one pixel then the internal representation\nis about N times overcomplete. The internal representation has not only high computational cost,\nbut it is also highly redundant. Since the input is mostly smooth and the parameters are the same\nacross space, the latent variables are strongly correlated as well. This inef\ufb01ciency turns out to be\nparticularly harmful for a model like PoT causing the learned \ufb01lters to become \u201crandom\u201d looking\n(see \ufb01g 3-iii). A simple intuition follows from the equivalence between PoT and square ICA [15]. If\nthe \ufb01lter matrix C of eq. 1 is square and invertible, we can marginalize out the latent variables and\nT x and S is a Student\u2019s t distribution. In other words, there\nis an underlying assumption that \ufb01lter outputs are independent. However, if the \ufb01lters of matrix C\nare shifted and overlapping versions of each other, this clearly cannot be true. Training PoT with the\nConv weight-sharing scheme forces the model to \ufb01nd \ufb01lters that make \ufb01lter outputs as independent\nas possible, which explains the very high-frequency patterns that are usually discovered [6].\nLocal: The Global and Conv weight-sharing schemes are at the two extremes of a spectrum of\npossibilities. For instance, we can de\ufb01ne potentials on a small subset of input variables but, unlike\nConv, each potential can have its own set of parameters, as shown in \ufb01g. 2-B. This is called local,\nor inhomogeneous \ufb01eld. Compared to Conv the number of parameters increases only slightly but\nthe number of latent variables required and their redundancy is greatly reduced. In fact, the model\nlearns different receptive \ufb01elds at different locations as a better strategy for representing the input,\noverall when the number of potentials is limited (see also \ufb01g. 2-F).\nTConv: Local would not allow the model to be trained and tested on images of different resolution,\nand it might seem wasteful not to exploit the translation invariant property of images. We therefore\nadvocate the use of a weight-sharing scheme that we call tiled-convolutional (TConv) shown in\n\ufb01g. 2-C and E [18]. Each \ufb01lter tiles the image without overlaps with copies of itself (i.e. the stride\nequals the \ufb01lter diameter). This reduces spatial redundancy of latent variables and allows the input\nimages to have arbitrary size. At the same time, different \ufb01lters do overlap with each other in order\nto avoid tiling artifacts. Fig. 2-F shows \ufb01lters that were (jointly) learned by a Restricted Boltzmann\nMachine (RBM) [29] with Gaussian input variables using the TConv weight-sharing scheme.\n\n4 Experiments\n\nWe train gated MRF\u2019s with and without mean hidden units using different weight-sharing schemes.\nThe training procedure is very similar in all cases. We perform approximate maximum likelihood by\nusing Fast Persistence Contrastive Divergence (FPCD) [25] and we draw samples by using Hybrid\nMonte Carlo (HMC) [26]. Since all latent variables can be exactly marginalized out we can use\nHMC on the free energy (negative logarithm of the marginal distribution over the input pixels). For\nmPoT this is:\nF mPoT(x) = \u2212 log(p(x))+const. =\n\nxT x\u2212(cid:88)\n\n(cid:88)\n\nlog(1+exp(W T\n\njkxk)) (6)\n\n\u03b3 log(1+\n\n(Cik\n\nT xk)2)+\n\n1\n2\n\n1\n2\n\nk,i\n\nk,j\n\nwhere the index k runs over spatial locations and xk is the k-th image patch. FPCD keeps samples,\ncalled negative particles, that it uses to represent the model distribution. These particles are all\nupdated after each weight update. For each mini-batch of data-points a) we compute the derivative\nof the free energy w.r.t. the training samples, b) we update the negative particles by running HMC for\none HMC step consisting of 20 leapfrog steps. We start at the previous set of negative particles and\nuse as parameters the sum of the regular parameters and a small perturbation vector, c) we compute\nthe derivative of the free energy at the negative particles, and d) we update the regular parameters\nby using the difference of gradients between step a) and c) while the perturbation vector is updated\nusing the gradient from c) only. The perturbation is also strongly decayed to zero and is subject to a\nlarger learning rate. The aim is to encourage the negative particles to explore the space more quickly\nby slightly and temporarily raising the energy at their current position. Note that the use of FPCD\nas opposed to other estimation methods (like Persistent Contrastive Divergence [27]) turns out to be\ncrucial to achieve good mixing of the sampler even after training. We train on mini-batches of 32\nsamples using gray-scale images of approximate size 160x160 pixels randomly cropped from the\nBerkeley segmentation dataset [28]. We perform 160,000 weight updates decreasing the learning\nby a factor of 4 by the end of training. The initial learning rate is set to 0.1 for the covariance\n\n5\n\n\fFigure 3: 160x160 samples drawn by A) mPoT-TConv, B) mHPoT-TConv, C) mcRBM-TConv and D) PoT-\nTConv. On the side also i) a subset of 8x8 \u201ccovariance\u201d \ufb01lters learned by mPoT-TConv (the plot below shows\nhow the whole set of \ufb01lters tile a small patch; each bar correspond to a Gabor \ufb01t of a \ufb01lter and colors identify\n\ufb01lters applied at the same 8x8 location, each group is shifted by 2 pixels down the diagonal and a high-resolution\nimage is tiled by replicating this pattern every 8 pixels horizontally and vertically), ii) a subset of 8x8 \u201cmean\u201d\n\ufb01lters learned by the same mPoT-TConv, iii) \ufb01lters learned by PoT-Conv and iv) by PoT-TConv.\n\n\ufb01lters (matrix C of eq. 1), 0.01 for the mean parameters (matrix W of eq. 4), and 0.001 for the\nother parameters (\u03b3 of eq. 1). During training we condition on the borders and initialize the negative\nparticles at zero in order to avoid artifacts at the border of the image. We learn 8x8 \ufb01lters and\npre-multiply the covariance \ufb01lters by a whitening transform retaining 99% of the variance; we also\nnormalize the norm of the covariance \ufb01lters to prevent some of them from decaying to zero during\ntraining4.\nWhenever we use the TConv weight-sharing scheme the model learns covariance \ufb01lters that mostly\nresemble localized and oriented Gabor functions (see \ufb01g. 3-i and iv), while the Conv weight-sharing\nscheme learns structured but poorly localized high-frequency patterns (see \ufb01g. 3-iii) [6]. The TConv\nmodels re-use the same 8x8 \ufb01lters every 8 pixels and apply a diagonal offset of 2 pixels between\nneighboring \ufb01lters with different weights in order to reduce tiling artifacts. There are 4 sets of \ufb01lters,\neach with 64 \ufb01lters for a total of 256 covariance \ufb01lters (see bottom plot of \ufb01g. 3). Similarly, we have\n4 sets of mean \ufb01lters, each with 32 \ufb01lters. These \ufb01lters have usually non-zero mean and exhibit\non-center off-surround and off-center on-surround patterns, see \ufb01g. 3-ii.\nIn order to draw samples from the learned models, we run HMC for a long time (10,000 iterations,\neach composed of 20 leap-frog steps). Some samples of size 160x160 pixels are reported in \ufb01g. 3 A)-\nD). Without modelling the mean intensity, samples lack structure and do not seem much different\nfrom those that would be generated by a simple Gaussian model merely \ufb01tting the second order\nstatistics (see \ufb01g. 3 in [1] and also \ufb01g. 2 in [7]). By contrast, structure, sharp boundaries and some\nsimple texture emerge only from models that have mean latent variables, namely mcRBM, mPoT\nand mHPoT which differs from mPoT by having a second layer pooling matrix on the squared\ncovariance \ufb01lter outputs [11].\nA more quantitative comparison is reported in table 1. We \ufb01rst compute marginal statistics of \ufb01lter\nresponses using the generated images, natural images from the test set, and random images. The\nstatistics are the normalized histogram of individual \ufb01lter responses to 24 Gabor \ufb01lters (8 orienta-\ntions and 3 scales). We then calculate the KL divergence between the histograms on random images\nand generated images and the KL divergence between the histograms on natural images and gener-\nated images. The table also reports the average difference of energies between random images and\nnatural images. All results demonstrate that models that account for mean intensity generate images\n\n4The code used in the experiments can be found at the \ufb01rst author\u2019s web-page.\n\n6\n\n\fMODEL\n\nPoT - Conv\nPoT - TConv\nmPoT - TConv\nmHPoT - TConv\nmcRBM - TConv\n\nF (R) \u2212 F (T ) (104)\n\nKL(R (cid:107) G)\n\nKL(T (cid:107) G)\n\nKL(R (cid:107) G) \u2212 KL(T (cid:107) G)\n\n2.9\n2.8\n5.2\n4.9\n3.5\n\n0.3\n0.4\n1.0\n1.7\n1.5\n\n0.6\n1.0\n0.2\n0.8\n1.0\n\n-0.3\n-0.6\n0.8\n0.9\n0.5\n\nTable 1: Comparing MRF\u2019s by measuring: difference of energy (negative log ratio of probabilities) between\nrandom images (R) and test natural images (T), the KL divergence between statistics of random images (R) and\ngenerated images (G), KL divergence between statistics of test natural images (T) and generated images (G),\nand difference of these two KL divergences. Statistics are computed using 24 Gabor \ufb01lters.\n\nthat are closer to natural images than to random images, whereas models that do not account for the\nmean (like the widely used PoT-Conv) produce samples that are actually closer to random images.\n\n4.1 Discriminative Experiments on Weight-Sharing Schemes\n\nIn future work, we intend to use the features discovered by the generative model for recognition.\nTo understand how the different weight sharing schemes affect recognition performance we have\ndone preliminary tests using the discriminative performance of a simpler model on simpler data. We\nconsider one of the simplest and most versatile models, namely the RBM [29]. Since we also aim\nto test the Global weight-sharing scheme we are constrained to using fairly low resolution datasets\nsuch as the MNIST dataset of handwritten digits [30] and the CIFAR 10 dataset of generic object\ncategories [22]. The MNIST dataset has soft binary images of size 28x28 pixels, while the CIFAR\n10 dataset has color images of size 32x32 pixels. CIFAR 10 has 10 classes, 5000 training samples\nper class and 1000 test samples per class. MNIST also has 10 classes with, on average, 6000 training\nsamples per class and 1000 test samples per class.\nThe energy function of the RBM trained on the CIFAR 10 dataset, modelling input pixels with 3\n(R,G,B) Gaussian variables [31], is exactly the one shown in eq. 4; while the RBM trained on MNIST\nuses logistic units for the pixels and the energy function is again the same as before but without any\nquadratic term. All models are trained in an unsupervised way to approximately maximize the\nlikelihood in the training set using Contrastive Divergence [32]. They are then used to represent\neach input image with a feature vector (mean of the posterior over the latent variables) which is\nfed to a multinomial logistic classi\ufb01er for discrimination. Models are compared in terms of: 1)\nrecognition accuracy, 2) convergence time and 3) dimensionality of the representation. In general,\nassuming \ufb01lters much smaller than the input image and assuming equal number of latent variables,\nConv, TConv and Local models process each sample faster than Global by a factor approximately\nequal to the ratio between the area of the image and the area of the \ufb01lters, which can be very large\nin practice.\nIn the \ufb01rst set of experiments reported on the left of \ufb01g. 4 we study the internal representation in\nterms of discrimination and dimensionality using the MNIST dataset. For each choice of dimension-\nality all models are trained using the same number of operations. This is set to the amount necessary\nto complete one epoch over the training set using the Global model. This experiment shows that: 1)\nLocal outperforms all other weight-sharing schemes for a wide range of dimensionalities, 2) TConv\ndoes not perform as well as Local probably because the translation invariant assumption is clearly\nviolated for these relatively small, centered, images, 3) Conv performs well only when the internal\nrepresentation is very high dimensional (10 times overcomplete) otherwise it severely under\ufb01ts, 4)\nGlobal performs well when the representation is compact but its performance degrades rapidly as\nthis increases because it needs more than the allotted training time. The right hand side of \ufb01g. 4\nshows how the recognition performance evolves as we increase the number of operations (or train-\ning time) using models that produce a twice overcomplete internal representation. With only very\nfew \ufb01lters Conv still under\ufb01ts and it does not improve its performance by training for longer, but\nGlobal does improve and eventually it reaches the performance of Local. If we look at the crossing\nof the error rate at 2% we can see that Local is about 4 times faster than Global. To summarize, Lo-\ncal provides more compact representations than Conv, is much faster than Global while achieving\n\n7\n\n\fFigure 4: Experiments on MNIST using RBM\u2019s with different weight-sharing schemes. Left: Error rate as\na function of the dimensionality of the latent representation. Right: Error rate as a function of the number of\noperations (normalized to those needed to perform one epoch in the Global model); all models have a twice\novercomplete latent representation.\n\nsimilar performance in discrimination. Also, Local can easily scale to larger images while Global\ncannot.\nSimilar experiments are performed using the CIFAR 10 dataset [22] of natural images. Using the\nsame protocol introduced in earlier work by Krizhevsky [22], the RBM\u2019s are trained in an unsuper-\nvised way on a subset of the 80 million tiny images dataset [33] and then \u201c\ufb01ne-tuned\u201d on the CIFAR\n10 dataset by supervised back-propagation of the error through the linear classi\ufb01er and feature ex-\ntractor. All models produce an approximately 10,000 dimensional internal representation to make a\nfair comparison. Models using local \ufb01lters learn 16x16 \ufb01lters that are stepped every pixel. Again,\nwe do not experiment with the TConv weight-sharing scheme because the image is not large enough\nto allow enough replicas.\nSimilarly to \ufb01g. 3-iii the Conv weight-sharing scheme was very dif\ufb01cult to train and did not produce\nGabor-like features. Indeed, careful injection of sparsity and long training time seem necessary [31]\nfor these RBM\u2019s. By contrast, both Local and Global produce Gabor-like \ufb01lters similar to those\nshown in \ufb01g. 2 F). The model trained with Conv weight-sharing scheme yields an accuracy equal\nto 56.6%, while Local and Global yield much better performance, 63.6% and 64.8% [22], respec-\ntively. Although Local and Global have similar performance, training with the Local weight-sharing\nscheme took under an hour while using the Global weight-sharing scheme required more than a day.\n\n5 Conclusions and Future Work\n\nThis work is motivated by the poor generative quality of currently popular MRF models of natural\nimages. These models generate images that are actually more similar to white noise than to natural\nimages. Our contribution is to recognize that current models can bene\ufb01t from 1) the addition of\na simple model of the mean intensities and from 2) the use of a less constrained weight-sharing\nscheme. By augmenting these models with an extra set of latent variables that model mean intensity\nwe can generate samples that look much more realistic: they are characterized by smooth regions,\nsharp boundaries and some simple high frequency texture. We validate our approach by comparing\nthe statistics of \ufb01lter outputs on natural images and generated images.\nIn the future, we plan to integrate these MRF\u2019s into deeper hierarchical models and to use their\ninternal representation to perform object recognition in high-resolution images. The hope is to\nfurther improve generation by capturing longer range dependencies and to exploit this to better cope\nwith missing values and ambiguous sensory inputs.\n\nReferences\n[1] E.P. Simoncelli. Statistical modeling of photographic images. Handbook of Image and Video Processing,\n\npages 431\u2013441, 2005.\n\n8\n\n010002000300040005000600070008000123456dimensionalityerror rate %  GlobalLocalTConvConv02468101.61.822.22.42.6# flops (relative to # flops per epoch of Global model)error rate %  GlobalLocalConv\f[2] A. Hyvarinen, J. Karhunen, and E. Oja. Independent Component Analysis. John Wiley & Sons, 2001.\n[3] G.E. Hinton and R. R Salakhutdinov. Reducing the dimensionality of data with neural networks. Science,\n\n313(5786):504\u2013507, 2006.\n\n[4] M. Ranzato and G.E. Hinton. Modeling pixel means and covariances using factorized third-order boltz-\n\nmann machines. In CVPR, 2010.\n\n[5] M.J. Wainwright and E.P. Simoncelli. Scale mixtures of gaussians and the statistics of natural images. In\n\nNIPS, 2000.\n\n[6] S. Roth and M.J. Black. Fields of experts: A framework for learning image priors. In CVPR, 2005.\n[7] U. Schmidt, Q. Gao, and S. Roth. A generative perspective on mrfs in low-level vision. In CVPR, 2010.\n[8] S. Geman and D. Geman. Stochastic relaxation, gibbs distributions, and the bayesian restoration of\n\nimages. PAMI, 6:721\u2013741, 1984.\n\n[9] M. Welling, G.E. Hinton, and S. Osindero. Learning sparse topographic representations with products of\n\nstudent-t distributions. In NIPS, 2003.\n\n[10] S.C. Zhu and D. Mumford. Prior learning and gibbs reaction diffusion. PAMI, pages 1236\u20131250, 1997.\n[11] S. Osindero, M. Welling, and G. E. Hinton. Topographic product models applied to natural scene statistics.\n\nNeural Comp., 18:344\u2013381, 2006.\n\n[12] S. Osindero and G. E. Hinton. Modeling image patches with a directed hierarchy of markov random\n\n\ufb01elds. In NIPS, 2008.\n\n[13] Y. Karklin and M.S. Lewicki. Emergence of complex cell properties by learning to generalize in natural\n\nscenes. Nature, 457:83\u201386, 2009.\n\n[14] B. A. Olshausen and D. J. Field. Sparse coding with an overcomplete basis set: a strategy employed by\n\nv1? Vision Research, 37:3311\u20133325, 1997.\n\n[15] Y. W. Teh, M. Welling, S. Osindero, and G. E. Hinton. Energy-based models for sparse overcomplete\n\nrepresentations. JMLR, 4:1235\u20131260, 2003.\n\n[16] Y. Weiss and W.T. Freeman. What makes a good model of natural images? In CVPR, 2007.\n[17] S. Roth and M. J. Black. Fields of experts. Int. Journal of Computer Vision, 82:205\u2013229, 2009.\n[18] K. Gregor and Y. LeCun. Emergence of complex-like cells in a temporal product network with local\n\nreceptive \ufb01elds. arXiv:1006.0448, 2010.\n\n[19] C. Tang and C. Eliasmith. Deep networks for robust visual recognition. In ICML, 2010.\n[20] M. Ranzato, A. Krizhevsky, and G.E. Hinton. Factored 3-way restricted boltzmann machines for modeling\n\nnatural images. In AISTATS, 2010.\n\n[21] N. Heess, C.K.I. Williams, and G.E. Hinton. Learning generative texture models with extended \ufb01elds-of-\n\nexperts. In BMCV, 2009.\n\n[22] A. Krizhevsky. Learning multiple layers of features from tiny images, 2009. MSc Thesis, Dept. of Comp.\n\nScience, Univ. of Toronto.\n\n[23] A. Waibel, T. Hanazawa, G. Hinton, K. Shikano, and K. Lang. Phoneme recognition using time-delay\n\nneural networks. IEEE Acoustics Speech and Signal Proc., 37:328\u2013339, 1989.\n\n[24] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition.\n\nProceedings of the IEEE, 86(11):2278\u20132324, 1998.\n\n[25] T. Tieleman and G.E. Hinton. Using fast weights to improve persistent contrastive divergence. In ICML,\n\n2009.\n\n[26] R.M. Neal. Bayesian learning for neural networks. Springer-Verlag, 1996.\n[27] T. Tieleman. Training restricted boltzmann machines using approximations to the likelihood gradient. In\n\nICML, 2008.\n\n[28] http://www.cs.berkeley.edu/projects/vision/grouping/segbench/.\n[29] M. Welling, M. Rosen-Zvi, and G.E. Hinton. Exponential family harmoniums with an application to\n\ninformation retrieval. In NIPS, 2005.\n\n[30] http://yann.lecun.com/exdb/mnist/.\n[31] H. Lee, R. Grosse, R. Ranganath, and A. Y. Ng. Convolutional deep belief networks for scalable unsu-\n\npervised learning of hierarchical representations. In Proc. ICML, 2009.\n\n[32] G.E. Hinton. Training products of experts by minimizing contrastive divergence. Neural Computation,\n\n14:1771\u20131800, 2002.\n\n[33] A. Torralba, R. Fergus, and W.T. Freeman. 80 million tiny images: a large dataset for non-parametric\n\nobject and scene recognition. PAMI, 30:1958\u20131970, 2008.\n\n9\n\n\f", "award": [], "sourceid": 1163, "authors": [{"given_name": "Marc'aurelio", "family_name": "Ranzato", "institution": null}, {"given_name": "Volodymyr", "family_name": "Mnih", "institution": null}, {"given_name": "Geoffrey", "family_name": "Hinton", "institution": null}]}